We first review the definition and properties of Gaussian distribution:

A Gaussian random variable $X\sim \mathcal{N}(\mu,\Sigma)$, where $\mu$ is the mean and $\Sigma$ is the covariance matrix has the following probability density function: $$P(x;\mu,\Sigma)=\frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma|}e^{-\frac{1}{2}((x-\mu)^\top \Sigma^{-1}(x-\mu)}$$ where $|\Sigma|$ is the determinant of $\Sigma$.

The Gaussian distribution occurs very often in real world data. This is for a good reason: the Central Limit Theorem (CLT). The CLT states that the arithmetic mean of $m>0$ samples is approximately normal distributed - independent of the original sample distribution (provided it has finite mean and variance).

Let Gaussian random variable $y=\begin{bmatrix} y_A\\ y_B \end{bmatrix}$, mean $\mu=\begin{bmatrix} \mu_A\\ \mu_B \end{bmatrix}$ and covariance matrix $\Sigma=\begin{bmatrix} \Sigma_{AA}, \Sigma_{AB} \\ \Sigma_{BA}, \Sigma_{BB} \end{bmatrix}$. We have the following properties:

In general, the posterior predictive distribution is $$ P(Y\mid D,X) = \int_{\mathbf{w}}P(Y,\mathbf{w} \mid D,X) d\mathbf{w} = \int_{\mathbf{w}} P(Y \mid \mathbf{w}, D,X) P(\mathbf{w} \mid D) d\mathbf{w} $$ Unfortunately, the above is

**Problem:** $f$ is an *infinte dimensional function*! But, the multivariate Gaussian distributions is for *finite dimensional random vectors*.

**Definition:** A GP is a (potentially infinte) collection of random variables (RV) such that the joint distribution of every finite subset of RVs is multivariate Gaussian:
$$f \sim GP(\mu, k), $$
where $\mu(\mathbf{x})$ and $k(\mathbf{x}, \mathbf{x}')$ are the mean resp. covariance function!
Now, in order to model the predictive distribution $P(f_* \mid \mathbf{x}_*, D)$ we can use a Bayesian approach by using a **GP prior:** $P(f\mid \mathbf{x}) \sim \mathcal{N}(\mu, \Sigma)$ and condition it on the training data $D$ to model the joint distribution of $f = f(X)$ (vector of training observations) and $f_* = f(\mathbf{x}_*)$ (prediction at test input).

We assume that, before we observe the training labels, the labels are drawn from the zero-mean prior Gaussian distribution:
$$
\begin{bmatrix}
y_1\\
y_2\\
\vdots\\
y_n\\
y_t
\end{bmatrix}
\sim \mathcal{N}(0,\Sigma)$$

W.l.o.g. zero-mean is always possible by subtracting the sample mean.
All training and test labels are drawn from an $(n+m)$-dimension Gaussian distribution, where $n$ is the number of training points, $m$ is the number of testing points. Note that, the real training labels, $y_1,...,y_n$, we observe are samples of $Y_1,...,Y_n$.

Whether this distribution gives us meaningful distribution or not depends on how we choose the covariance matrix $\Sigma$. We consider the following properties of $\Sigma$:

**1.** $\Sigma_{ij}=E((Y_i-\mu_i)(Y_j-\mu_j))$.

**2.** $\Sigma$ is always positive semi-definite.

**3.** $\Sigma_{ii}=\text{Variance}(Y_i)$, thus $\Sigma_{ii}\geq 0$.

**4.** If $Y_i$ and $Y_j$ are very independent, i.e. $\mathbf{x}_i$ is very different from $\mathbf{x}_j$, then $\Sigma_{ij}=\Sigma_{ji}=0$.

**5.** If $\mathbf{x}_i$ is similar to $\mathbf{x}_j$, then $\Sigma_{ij}=\Sigma_{ji}>0$.

We can observe that this is very similar from the kernel matrix in SVMs. Therefore, we can simply let $\Sigma_{ij}=K(\mathbf{x}_i,\mathbf{x}_j)$. For example, if we use RBF kernel (aka "squared exponential kernel"), then \begin{equation} \Sigma_{ij}=\tau e^\frac{-\|\mathbf{x}_i-\mathbf{x}_j\|^2}{\sigma^2}. \end{equation} If we use polynomial kernel, then $\Sigma_{ij}=\tau (1+\mathbf{x}_i^\top \mathbf{x}_j)^d$.

Thus, we can decompose $\Sigma$ as $\begin{pmatrix} K, K_* \\K_*^\top , K_{**} \end{pmatrix}$, where $K$ is the training kernel matrix, $K_*$ is the training-testing kernel matrix, $K_*^\top $ is the testing-training kernel matrix and $K_{**}$ is the testing kernel matrix. The conditional distribution of (noise-free) values of the latent function $f$ can be written as: \begin{equation} f_*|(Y_1=y_1,...,Y_n=y_n,\mathbf{x}_1,...,\mathbf{x}_n,\mathbf{x}_t)\sim \mathcal{N}(K_*^\top K^{-1}y,K_{**}-K_*^\top K^{-1}K_*), \end{equation} where the kernel matrices $K_*, K_{**}, K$ are functions of $\mathbf{x}_1,\dots,\mathbf{x}_n,\mathbf{x}_*$.

In many applications the observed labels can be noisy. If we assume this noise is independent and zero-mean Gaussian, then we observe $\hat Y_i=f_i+\epsilon_i$, where $f_i$ is the true (unobserved=latent) target and the noise is denoted by $\epsilon_i\sim \mathcal{N}(0,\sigma^2)$. In this case the new covariance matrix becomes $\hat\Sigma=\Sigma+\sigma^2\mathbf{I}$. We can derive this fact first for the off-diagonal terms where $i\neq j$
\begin{equation}
\hat\Sigma_{ij}=\mathbb{E}[(f_i+\epsilon_i)(f_j+\epsilon_j)]=\mathbb{E}[f_if_j]+\mathbb{E}[f_i]\mathbb{E}[\epsilon_j]+\mathbb{E}[f_j]\mathbb{E}[\epsilon_i]+\mathbb{E}[\epsilon_i]\mathbb{E}[\epsilon_j]=\mathbb{E}[f_if_j]=\Sigma_{ij},
\end{equation}
as $\mathbb{E}[\epsilon_i]=\mathbb{E}[\epsilon_j]=0$ and where we use the fact that $\epsilon_i$ is independent from all other random variables.
For the diagonal entries of $\Sigma$, *i.e.* the case where $i=j$, we obtain
\begin{equation}
\hat\Sigma_{ii}=\mathbb{E}[(f_i+\epsilon_i)^2]=\mathbb{E}[f_i^2]+2\mathbb{E}[f_i]\mathbb{E}[\epsilon_i]+\mathbb{E}[\epsilon_i^2]=\mathbb{E}[f_if_j]+\mathbb{E}[\epsilon_i^2]=\Sigma_{ij}+\sigma^2,
\end{equation}
because $E[\epsilon_i^2]=\sigma^2$, which denotes the variance if $\epsilon_i$.

Plugging this updated covariance matrix into the Gaussian Process posterior distribution leads to
\begin{equation}
Y_*|(Y_1=y_1,...,Y_n=y_n,\mathbf{x}_1,...,\mathbf{x}_n)\sim \mathcal{N}(K_*^\top (K+\sigma^2 I)^{-1}y,K_{**}+\sigma^2 I-K_*^\top (K+\sigma^2 I)^{-1}K_*).\label{eq:GP:withnoise}
\end{equation}
In practice the above equation is often more stable because the matrix $(K+\sigma^2 I)$ is always invertible if $\sigma^2$ is sufficiently large.
So, for **predictions** we can use the posterior mean and additionally we get the predictive variance as measure of confidence or (un)certainty about the point prediction.

Cf. equations (2.28)-(2.30) in GPML CH 2 (p. 19). Once you have the marginal likelihood and its derivatives you can use any out-of-the-box solver such as (stochastic) Gradient descent, or conjugate gradient descent (Caution: minimize

- Expert knowledge (awesome to have -- difficult to get)
- Bayesian model selection (more possibly analytically intractable integrals!!)
- Cross-validation (time consuming -- but simple to implement)

- GPs are an elegant and powerful ML method
- We get a measure of (un)certainty for the predictions for free.
- GPs work very well for regression problems with small training data set sizes.
- Running time $O(n^3) \leftarrow $ matrix inversion (gets slow when $n\gg 0$) $\Rightarrow$ use sparse GPs for large $n$.
- GPs are a little bit more involved for classification (non-Gaussian likelihood).
- We can model non-Gaussian likelihoods in regression and do approximate inference for e.g., count data (Poisson distribution)
- GP implementations: GPyTorch, GPML (MATLAB), GPys, pyGPs, and scikit-learn (Python)

A nice applications of GP regression is Bayesian Global Optimization. Here, the goal is to optimize the hyper-parameters of a machine learning algorithm to do well on a fixed validation data set. Imagine you have $d$ hyper-parameters to tune, then your data set consists of $d-$dimensional vectors $\mathbf{x}_i\in{\mathcal{R}}^d$, where each training point represents a particular hyper parameter setting and the labels $y_i\in{\mathcal R}$ represents the validation error.

Initially you train your classifier under a few random hyper-parameter settings and evaluate the classifier on the validation set. This gives you $\mathbf{x}_1,\dots,\mathbf{x}_m$ with labels $y_1,\dots,y_m$. You can now train a Gaussian Process to predict the validation error $y_t$ at any new hyperparameter setting $\mathbf{x}_t$. In fact, you obtain a mean prediction $y_t=h(\mathbf{x}_t)$ and its variance $v(\mathbf{x}_t)$. If you have more time you can now explore a new point. The most promising point is the one with the lowest lower confidence bound, i.e. \begin{equation} \textrm{argmin}_{\mathbf{x}_t} h(\mathbf{x}_t)-\kappa\sqrt{v(\mathbf{x}_t)}. \end{equation} The constant $\kappa>0$ trades off how much you want to

For $i$ = 1 to $m$

sample $\mathbf{x}_i$ randomly e.g. sample uniformly within reasonable range

$y_i=\mathcal{A}(\mathbf{x}_i)$ Compute validation error

EndFor

For $i$ = $m+1$ to $n$

Update kernel $K$ based on $\mathbf{x}_1,\dots,\mathbf{x}_{i-1}$

$\mathbf{x}_i=\textrm{argmin}_{\mathbf{x}_t} K_t^\top(K+\sigma^2 I)^{-1}y-\kappa\sqrt{K_{tt}+\sigma^2 I-K_t^\top (K+\sigma^2 I)^{-1}K_t}$

$y_i=\mathcal{A}(\mathbf{x}_i)$ Compute validation error

Endfor

$i_{best}=\textrm{argmin}_{i} \{y_1,\dots,y_n\}$ Find best hyper-parameter setting explored.

Return $\mathbf{x}_{i_{best}}$ Return best hyper-parameter setting explored.

The figure shows a Gaussian processes trained on four training points (black crosses) and evaluated on a dense grid within the [-5,5] interval. The red line shows the predicted mean value at each test point. The shaded gray region depicts the uncertainty of the prediction (two standard deviations from the mean). Testing points that are closer to training points (i.e. have more similar features) obtain more certain predictions.

BO is much faster than grid search.