Abstract

For many applications of machine learning the goal is to predict the value of a vector c given the value of a vector x of input features. In a classification problem c represents a discrete class label, whereas in a regression problem it corresponds to one or more continuous variables. From a probabilistic perspective, the goal is to nd the conditional distribution p(cjx). The most common approach to this problem is to represent the conditional distribution using a parametric model, and then to determine the parameters using a training set consisting of pairs fxn; cng of input vectors along with their corresponding target output vectors. The resulting conditional distribution can be used to make predictions of c for new values of x. This is known as a discriminative approach, since the conditional distribution discriminates directly between the different values of c.

An alternative approach is to find the joint distribution p(x; c), expressed for instance as a parametric model, and then subsequently uses this joint distribution to evaluate the conditional p(cjx) in order to make predictions of c for new values of x. This is known as a generative approach since by sampling from the joint distribution it is possible to generate synthetic examples of the feature vector x. In practice, the generalization performance of generative models is often found to be poorer than than of discriminative models due to differences between the model and the true distribution of the data. When labelled training data is plentiful, discriminative techniques are widely used since they give excellent generalization performance. However, although collection of data is often easy, the process of labelling it can be expensive. Consequently there is increasing interest in generative methods since these can exploit unlabelled data in addition to labelled data. Although the generalization performance of generative models can often be improved by `training them discriminatively’, they can then no longer make use of unlabelled data. In an attempt to gain the bene t of both generative and discriminative approaches, heuristic procedure have been proposed which interpolate between these two extremes by taking a convex combination of the generative and discriminative objective functions.

Here we discuss a new perspective which says that there is only one correct way to train a given model, and that a `discriminatively trained’ generative model is fundamentally a new model (Minka, 2006). From this viewpoint, generative and discriminative models correspond to specific choices for the prior over parameters. As well as giving a principled interpretation of `discriminative training’, this approach opens the door to very general ways of interpolating between generative and discriminative extremes through alternative choices of prior. We illustrate this framework using both synthetic data and a practical example in the domain of multi-class object recognition. Our results show that, when the supply of labelled training data is limited, the optimum performance corresponds to a balance between the purely generative and the purely discriminative. We conclude by discussing how to use a Bayesian approach to find automatically the appropriate trade-off between the generative and discriminative extremes.