Mixtures of Gaussian Processes 
Volker Tresp 
Siemens AG, Corporate Technology, Department of Neural Computation 
Otto-Hahn-Ring 6, 81730 Miinchen, Germany 
Volker.. Tresp @mchp. siemens. de 
Abstract 
We introduce the mixture of Gaussian processes (MGP) model which is 
useful for applications in which the optimal bandwidth of a map is input 
dependent. The MGP is derived from the mixture of experts model and 
can also be used for modeling general conditional probability densities. 
We discuss how Gaussian processes --in particular in form of Gaussian 
process classification, the support vector machine and the MGP model-- 
can be used for quantifying the dependencies in graphical models. 
1 Introduction 
Gaussian processes are typically used for regression where it is assumed that the underly- 
ing function is generated by one infinite-dimensional Gaussian distribution (i.e. we assume 
a Gaussian prior distribution). In Gaussian process regression (GPR) we further assume 
that output data are generated by additive Gaussian noise, i.e. we assume a Gaussian like- 
lihood model. GPR can be generalized by using likelihood models from the exponential 
family of distributions which is useful for classification and the prediction of lifetimes or 
counts. The support vector machine (SVM) is a variant in which the likelihood model is 
not derived from the exponential family of distributions but rather uses functions with a 
discontinuous first derivative. In this paper we introduce another generalization of GPR 
in form of the mixture of Gaussian processes (MGP) model which is a variant of the well 
known mixture of experts (ME) model of Jacobs et al. (1991). The MGP model allows 
Gaussian processes to model general conditional probability densities. An advantage of 
the MGP model is that it is fast to train, if compared to the neural network ME model. 
Even more interesting, the MGP model is one possible approach of addressing the problem 
of input-dependent bandwidth requirements in GPR. Input-dependent bandwidth is useful 
if either the complexity of the map is input dependent --requiring a higher bandwidth in 
regions of high complexity-- or if the input data distribution is input dependent. In the 
latter case, one would prefer Gaussian processes with a higher bandwidth in regions with 
many data points and a lower bandwidth in regions with lower data density. If GPR models 
with different bandwidths are used, the MGP approach allows the system to self-organize 
by locally selecting the GPR model with the appropriate optimal bandwidth. 
Gaussian process classifiers, the support vector machine and the MGP can be used to model 
the local dependencies in graphical models. Here, we are mostly interested in the case that 
the dependencies of a set of variables /is modified via Gaussian processes by a set of ex- 
ogenous variables :r. As an example consider a medical domain in which a Bayesian net- 
work of discrete variables /models the dependencies between diseases and symptoms and 
where these dependencies are modified by exogenous (often continuous) variables z rep- 
resenting quantities such as the patient's age, weight or blood pressure. Another example 
would be collaborative filtering where y might represent a set of goods and the correlation 
between customer preferences is modeled by a dependency network (another example of 
a graphical model). Here, exogenous variables such as income, gender and social status 
might be useful quantities to modify those dependencies. 
The paper is organized as follows. In the next section we briefly review Gaussian processes 
and their application to regression. In Section 3 we discuss generalizations of the simple 
GPR model. In Section 4 we introduce the MGP model and present experimental results. 
In Section 5 we discuss Gaussian processes in context with graphical models. In Section 6 
we present conclusions. 
2 Gaussian Processes 
In Gaussian Process Regression (GPR) one assumes that a priori a function f(x) is gen- 
erated from an infinite-dimensional Gaussian distribution with zero mean and covariance 
K(x, xk) = cov(f(x), f(xk)) where K(x,x) are positive definite kernel functions. In 
this paper we will only use Gaussian kernel functions of the form 
K(x,x)=Aexp( IIx- xll2) 
with scale parameter s and amplitude A. Furthermore, we assume a set of N training 
data W = { (x, yk)}r_ where targets are generated following a normal distribution with 
variance cr 2 such that 
(1 ) 
P(ylf(x)) exp -2cr2 (f(x) - y): . (1) 
The expected value f(x) to an input x given the training data is a superposition of the 
kernel functions of the form 
N 
](x) = 
(2) 
k=l 
Here, wk is the weight on the k-th kernel. Let K be the N x N Gram matrix with 
(K)k,j = cov(f(xk), f(xj)). Then we have the relation fm = Kw where the compo- 
nents of fm = (f(x),..., f(xv))' are the values of f at the location of the training data 
and w = (w,..., wv)'. As a result of this relationship we can either calculate the opti- 
mal w or we can calculate the optimal fm and then deduce the corresponding w-vector by 
matrix inversion. The latter approach is taken in this paper. Following the assumptions, the 
optimal fm minimizes the cost function 
(fm),K-l fm q- _2 (fm _ y),(fm _ y) 
(3) 
such that 
]m __-- K(K q-cr2I)-y. 
Here y = (y,..., yv)  is the vector of targets and I is the N-dimensional unit matrix. 
3 Generalized Gaussian Processes and the Support Vector Machine 
In generalized Gaussian processes the Gaussian prior assumption is maintained but the 
likelihood model is now derived from the exponential family of distributions. The most 
important special cases are two-class classification 
1 
P(y = 1If(x))= 1 + exp(-f(x)) 
and multiple-class classification. Here, y is a discrete variable with U states and 
exp (fi(x)) 
P(Y = ilf (x)' " " fc(x)) = y._ exp (fj(x)) ' 
(4) 
Note, that for multiple-class classification (7 Gaussian processes f(x),..., fc(x) are 
used. Generalized Gaussian processes are discusses in Tresp (2000). The special case 
of classification was discussed by Williams and Barber (1998) from a Bayesian perspec- 
tive. The related smoothing splines approaches are discussed in Fahrmeir and Tutz (1994). 
For generalized Gaussian processes, the optimization of the cost function is based on an 
iterative Fisher scoring procedure. 
Incidentally, the support vector machine (SVM) can also be considered to be a generalized 
Gaussian process model with 
P(ylf(x)) exp (-const(1 - yf(x))+) . 
Here, y  {-1, 1}, the operation 0+ sets all negative values equal to zero and const is 
a constant (Sollich (2000)). 1 The SVM cost function is particularly interesting since due 
to its discontinuous first derivative, many components of the optimal weight vector w are 
zero, i.e. we obtain sparse solutions. 
4 Mixtures of Gaussian Processes 
GPR employs a global scale parameter s. In many applications it might be more desirable 
to permit an input-dependent scale parameter: the complexity of the map might be input de- 
pendent or the input data density might be nonuniform. In the latter case one might want to 
use a smaller scale parameter in regions with high data density. This is the main motivation 
for introducing another generalization of the simple GPR model, the mixture of Gaussian 
processes (MGP) model, which is a variant of the mixture of experts model of Jacobs et al. 
(1991). Here, a set of GPR models with different scale parameters is used and the system 
can autonomously decide which GPR model is appropriate for a particular region of input 
space. Let F t' (x) = {f(x),..., f(x)} denote this set of M GPR models. The state of a 
discrete M-state variable z determines which of the GPR models is active for a given input 
x. The state of z is estimated by an M-class classification Gaussian process model with 
exp (f? (x)) 
P(z = iIFZ(x)) = y.__ exp (ff(x)) 
where FZ(x) = {f(x),..., flu(x)} denotes a second set of M Gaussian processes. Fi- 
nally, we use a set of M Gaussian processes F(x) = {f(x),..., f(x)} to model the 
input-dependent noise variance of the GPR models. The likelihood model given the state 
of z 
P(ylz, (x), F (x) ) = G (y; fff (x), exp(ef; (x))) 
is a Gaussian centered at fzU(X) and with variance (exp(2f (x))). The exponential is used 
to ensure positivity. Note that G(a; b, c) is our notation for a Gaussian density with mean 
b, variance c, evaluated at a. In the remaining parts of the paper we will not denote the 
Properly normalizing the conditional probability density is somewhat tricky and is discussed in 
detail in Sollich (2000). 
dependency on the Gaussian processes explicitely, e.g we will write P(ylz, z) instead of 
P(ylz, FU(x),F(x)). Since z is a latent variable we obtain with 
M M 
P(ylx) = Y P(z = ilx) G (y; f(x),exp(2f(x))) E(ylx) = y P(z = ilx) if(x) 
i=1 i=1 
the well known mixture of experts network of Jacobs et al (1991) where the f(x) are the 
(Gaussian process) experts and P(z = i I x) is the gating network. Figure 2 (left) illustrates 
the dependencies in the GPR model. 
4.1 EM Fisher Scoring Learning Rules 
Although a priori the functions f are Gaussian distributed, this is not necessarily true -in 
contrast to simple GPR in Section 2- for the posterior distribution due to the nonlinear 
nature of the model. Therefore one is typically interested in the minimum of the negative 
logarithm of the posterior density 
N M 
- y log y P(z = ilxk) G (yk; f (x), exp(2fg (x))) 
k=l i=1 
M I -(ler,m]t(Er,m]_ 1 
1 (fz,m]t z,m --lfz,m+l M M 
) 
i=1 i=1 i=1 
The superscript m denotes the vectors and matrices defined at the measurement point, e.g. 
f,m 
i = (f(x),..., f(x))'. In the E-step, based on the cuent estimates of the Gaus- 
sian processes at the data points, the state of the latent viable is estimated as 
P(z = ilx, y ) = 
In the M-step, based on the E-step, the Gaussian processes at the data points are updated. 
We obtain 
].,m _ s$,m (s$,m + ym 
i -- 
where ,m is a diagonal matrix with entries 
(,m) = exp(2f(x))/P(z = ilx,y). 
Note, that data with a small P(z = ilx, y) obtain a small weight. To update the other 
Gaussian processes iterative Fisher scoring steps have to be used as shown in the appendix. 
There is a serious problem with overtraining in the MGP approach. The reason is that the 
GPR model with the highest bandwidth tends to obtain the highest weight in the E-step 
since it provides the best fit to the data. There is an easy fix for the MGP: For calculating 
the responses of the Gaussian processes at x in the E-step we use all training data except 
(x, y). Fortunately, this calculation is very cheap in the case of Gaussian processes since 
for example 
- 
- 
1 - 
where f (x) denotes the estimates at the training data point x not using (x, y). Here, 
Si, is the k-th diagonal element of Si = S 'm (S 'm + ,m)-.2 
2See Hofmann (2000) for a discussion of the convergence of this type of algorithms. 
0.5 
-0.5 
-1 
-2 
o 
0.5 
-2 -1 0 
0.8 
0.2 
\ '\ I' '\ / 
J I .... 
\ ,1' 
i! \ 
-1 0 
x 
0.5 
-0.5 
-2 -2 -1 0 1 2 
Figure l: The input data are generated from a Gaussian distribution with unit variance and mean 
0. The output data are generated from a step function (o, bottom right). The top left plot shows the 
map formed by three GPR models with different bandwidths. As can be seen no individual model 
achieves a good map. Then a MGP model was trained using the three GPR models. The top right 
plot shows the GPR models after convergence. The bottom left plot shows P(z ---- ilz). The GPR 
model with the highest bandwidth models the transition at zero, the GPR model with an intermediate 
bandwidth models the intermediate region and the GPR model with the lowest bandwidth models the 
extreme regions. The bottom right plot shows the data o and the fit obtained by the complete MGP 
model which is better than the map formed by any of the individual GPR models. 
4.2 Experiments 
Figure 1 illustrates how the MGP divides up a complex task into subtasks modeled by 
the individual GPR models (see caption). By dividing up the task, the MGP model can 
potentially achieve a performance which is better than the performance of any individual 
model. Table 1 shows results from artificial data sets and real world data sets. In all cases, 
the performance of the MGP is better than the mean performance of the GPR models and 
also better than the performance of the mean (obtained by averaging the predictions of all 
GPR models). 
5 Gaussian Processes for Graphical Models 
Gaussian processes can be useful models for quantifying the dependencies in Bayesian net- 
works and dependency networks (the latter were introduced in Hofmann and Tresp, 1998, 
Heckerman et al., 2000), in particular when parent variables are continuous quantities. If 
the child variable is discrete, Gaussian process classification or the SVM are appropriate 
models whereas when the child variable is continuous, the MGP model can be employed 
as a general conditional density estimator. Typically one would require that the continu- 
ous input variables to the Gaussian process systems x are known. It might therefore be 
Table 1: The table shows results using artificial and real data sets of size N = 100 using 
M = 10 GPR models. The data set ART is generated by adding Gaussian noise with a 
standard deviation of 0.2 to a map defined by 5 normalized Gaussian bumps. numin is 
the number of inputs. The bandwidth s was generated randomly between 0 and max. s. 
Furthermore, mean perfl. is the mean squared test set error of all GPR networks and perfl. of 
mean is the mean squared test set error achieved by simple averaging the predictions. The 
last column shows the performance of the MGP. 
Data numin max. s mean perf. perf. of mean MGP 
ART 
ART 
ART 
ART 
ART 
HOUSING 
BUPA 
DIABETES 
WAVEFORM 
1 
2 
5 
10 
20 
13 
6 
8 
21 
1 
3 
6 
10 
20 
10 
20 
40 
40 
0.0167 
0.0573 
0.1994 
0.1670 
0.1716 
0.4677 
0.9654 
0.8230 
0.6295 
0.0080 
0.0345 
0.1383 
0.1135 
0.1203 
0.3568 
0.9067 
0.7660 
0.5979 
0.0054 
0.0239 
0.0808 
0.0739 
0.0662 
0.2634 
0.8804 
0.7275 
0.4453 
useful to consider those as exogenous variables which modify the dependencies in a graph- 
ical model of t-variables as shown in Figure 2 (right). As an example consider a medical 
domain in which a Bayesian network of discrete variables t models the dependencies be- 
tween diseases and symptoms and where these dependencies are modified by exogenous 
(often continuous) variables :r representing quantities such as the patient's age, weight or 
blood pressure. Another example would be collaborative filtering where t might represent 
a set of goods and the correlation between customer preferences is modeled by a depen- 
dency network as in Heckerman et al. (2000). Here, exogenous variables such as income, 
gender and social status might be useful quantities to modify those correlations. Note, that 
the GPR model itself can also be considered to be a graphical model with dependencies 
modeled as Gaussian processes (compare Figure 2). 
Readers might also be interested in the related and independent paper by Friedman and 
Nachman (2000) in which those authors used GPR systems (not in form of the MGP) to 
perform structural learning in Bayesian networks of continuous variables. 
6 Conclusions 
We demonstrated that Gaussian processes can be useful building blocks for forming com- 
plex probabilistic models. In particular we introduced the MGP model and demonstrated 
how Gaussian processes can model the dependencies in graphical models. 
7 Appendix 
For fz and f the mode estimates are found by iterating Newton-Raphson equations f(/-]-l) : f(1) _ 
r-l(1)J(1) where J(1) is the Jacobian and t(1) the Hessian matrix for which certain interactions 
are ignored. One obtains for (l = 1, 2,...) the following update equations. 
d, z.,m,(1) : ((z: ilx,y ) - (1)(z: ilxk))kN__l , 
. ,.,(1) _- diag ([P(/)(z -- ilx )(1 - p(1) (z -- ilx ) )]-l ) v 
* k:l ' 
Figure 2: Left: The graphical structure of an MGP model consisting of the discrete latent variable 
z, the continuous variable y and input variable x. The probability density of z is dependent on the 
Gaussian processes F z. The probability distribution of y is dependent on the state of z and of the 
Gaussian processes F t', F . Right: An example of a Bayesian network which contains the variables 
//1, y2, Y3,//4. Some of the dependencies are modified by x via Gaussian processes fl, f2, f3. 
Similarly, 
where e is an N-dimensional vector of ones and 
)er,m,(/) = (exp?2f;'(1)(xl))) N 
i 2(f(xk) --yk)2 k----1 
( exp(2f,(l)(xk)) )N 
'm'(1):dia9 2P(z=ilx,Y)(](x)-Y) 2 k=l 
References 
[1] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., Hinton, J. E. (1991). Adaptive Mixtures of Local 
Experts, Neural Computation, 3. 
[2] Trcsp, V. (2000). The Generalized Bayesian Coittcc Machine. Proceedings oft he Sixth ACM 
SIGKDD International Conference on Knowledge Discove and Data Mining, KDD-2000. 
[3] Williams, C. K. I., Bbcr. D. (1998). Bayesian Classification with Gaussian Processes, IEEE 
Transactions on Pattern Analysis and Machine Intelligence, 20(12). 
[4] Fcir, L., Tutz, G. (1994) Multivariate Statistical Modeling Based on Generalized Linear 
Models, Springer. 
[5] Sollich, P. (2000). Probabilistic Methods for Support Vector Machines. In Solla, S. A., Lccn, T. 
K., M5llcr, K.-R. (Eds.), Advances in Neural Information Processing Systems I2, MIT Press. 
[6] Holmann R. (2000). Lernen der Struktur nichtlinearer Abhngingkeiten mir graphischen Mod- 
elien. PhD Dissedation. 
[7] Holmann, R., Trcsp, V. (1998). Nonlinc Mkov Networks for Continuous Viablcs. In Jordan, 
M. I., Kcns, M. S., Solla, S. A., (Eds.), Advances in Neural Information Processing Systems IO, 
MIT Press. 
[8] Hcckcrman, D., Chickcring, D., Meek, C., Rounthwaitc, R., Kadic C. (2000). Dependency Net- 
works for Inference, Collaborativc Filtcdng, and Data Visualization.. Journal of Machine Learning 
Research, 1. 
[9] Friedman, N., Nachman, I. (2000). Gaussian Process Networks. In Boutilicr, C., Goldszdt, M., 
(Eds.), Proc. Sixteenth Con on Uncertain in Artificial Intelligence (UAI). 
