A Gaussian process is a stochastic process that assumes that the outputs for any set of input points follows a multivariate normal distribution. To determine the normal distribution, we must select a mean function that gives a mean for each point and a covariance function that gives the covariance between any set of points.
Thus if we have mean function μ, covariance function Σ, and the n × d matrix X with the input vectors x1, ..., xn in its rows, then distribution of the output at these points, y = [y1, …, yn]T is given by:
y ∼ N(μ(X), Σ(X))
Or in full element notation:
$$ \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} \sim N(\begin{bmatrix} \mu(\mathbf{x_1}) \\ \vdots \\ \mu(\mathbf{x_n}) \end{bmatrix}, \begin{bmatrix} \Sigma(\mathbf{x_1},\mathbf{x_1}) &\cdots &\Sigma(\mathbf{x_1},\mathbf{x_n}) \\ \vdots &\ddots &\vdots \\ \Sigma(\mathbf{x_n},\mathbf{x_1}) &\cdots &\Sigma(\mathbf{x_n},\mathbf{x_n}) \end{bmatrix})$$
The mean function can be any function mapping the input space to the real numbers. The most commonly used mean function is a constant, so μ(x) = μ. This means that over the entire space the predicted mean given no other information a constant. When fitting a GP model to data, μ is usually estimated using the data. Another commonly used mean function is zero. This works surprisingly well since the GP will interpolated between your data, meaning that the mean function work have much of an effect when there is enough data. A note on notation: for a vector u, μ(u) is sometimes written as μu for simplicity. For a matrix X, μ(X) = μX is the vector obtained from applying μ to each row of X.
A more advanced choice of mean function is to use a linear model, so $$\mu(\mathbf{x}) = \beta_0 + \sum_{i=1}^d \beta_i x_i$$. Again these parameters must be estimated. This can be generalized to a linear combination of functions of the input data, f1, …, fm so the mean function is $$\mu(\mathbf{x}) = \beta_0 + \sum_{i=1}^m \beta_i f_i(\mathbf{x})$$.
It is generally recommended to just use a constant mean since the data itself should provide enough information to fit the true function. Some researchers say that using a linear model can have negative effects on fitting a good model.
The covariance function determines how strong the correlation is between points. A note on notation: the covariance/correlation functions are heavily overloaded, meaning that their meaning depends on the context. For vectors u and v, Σ(u, v) is the covariance between the points, which is also sometimes written as Σu, v or Σuv. Σ(u) or Σu is the same thing as the covariance of u with itself, or Σ(u, u). For a matrix X, Σ(X, u) or ΣXu is a column vector whose elements are the covariance of the rows of X with u. For another matrix W, Σ(X, W) is a matrix whose (i, j) element is the covariance of the i row of X and the j row of W. Σ(X) or ΣX means the same thing as Σ(X, X).
Often a correlation function, R, is used instead of a covariance function. The correlation function should map any pair of points to [0, 1]. The correlation for any point with itself should be 1, i.e. R(x, x) = 1. When a correlation function is used, a variance parameter σ2 must be estimated to scale the correlation matrix into a covariance matrix. Thus the covariance matrix is C(X) = σ̂2R(X). R is overloaded similarly to Σ.
The most commonly used correlation function is the Gaussian.
$$ R(\mathbf{u}, \mathbf{v}) = \exp \left( -\sum_{i=1}^d \theta_i (u_i - v_i)^2 \right)$$
The parameters θ = (θ1, …, θd) are the correlation parameters for each dimensions. Generally they must be estimated from the data when fitting a Gaussian process model to data.
The parameters are often estimated by finding the parameters that maximize the likelihood given a data set.
The likelihood function is the usual multivariate normal pdf shown below, where μ = μ(X), Σ = Σ(X)
$$ f(\mathbf{\theta};~X,~\mathbf{y}) = f(X,~\mathbf{y};~\mathbf{\theta}) = \frac{1}{(2 \pi)^{n/2} |\Sigma|^{1/2} } \exp{(-\frac{1}{2} (\mathbf{y} - \mathbf{\mu})^T \Sigma^{-1} (\mathbf{y} - \mathbf{\mu}))}$$
As usual, we use negative two times the log-likelihood for simplicity, ignoring the constant terms.
ℓ(θ) = ln |Σ| + (y − μ)TΣ−1(y − μ)
This equation is minimized as a function of the correlation parameters to find the parameters that give the greatest likelihood. Since there is a determinant and matrix solve, this is an expensive function to optimize, with each evaluation being O(n3)
If the mean is set to be constant, μ(X) = μ1n, then there is a single parameter μ to estimate. Differentiating ℓ with respect to μ will then give $$ \frac{d \ell}{d \mu} = \mathbf{1_n}^T \Sigma^{-1}(\mathbf{y} - \mu \mathbf{1_n})$$ Setting this equal to zero and solving for μ gives the maximum likelihood estimate μ̂ $$ \hat{\mu} = \frac{\mathbf{1_n}^T \Sigma^{-1}\mathbf{y}}{\mathbf{1_n}^T \Sigma^{-1}\mathbf{1_n}}$$
When using a correlation matrix so that Σ = σ2R, σ must be estimated using maximum likelihood.
$$ \frac{d \ell}{d \sigma^2} = \frac{n}{\sigma^2} - \frac{1}{\sigma^4}(\mathbf{y} - \mathbf{\mu})^T R^{-1} (\mathbf{y} - \mathbf{\mu})$$ Setting equal to zero and solving for σ2 gives $$ \hat{\sigma}^2 = \frac{1}{n} (\mathbf{y} - \mathbf{\mu})^T R^{-1} (\mathbf{y} - \mathbf{\mu})$$ When estimating μ and σ2 simultaneously, these estimates are valid and the estimate can simply be plugged into the equation.
Suppose there are vectors y1 and y2 that are jointly multivariate normal. The joint distribution is $$ \begin{bmatrix} \mathbf{y_1} \\ \mathbf{y_2} \end{bmatrix} \sim N( \begin{bmatrix} \mathbf{\mu_1} \\ \mathbf{\mu_2} \end{bmatrix}, ~\begin{bmatrix} \Sigma_{11} \Sigma_{12} \\ \Sigma_{21} \Sigma_{22} \end{bmatrix} ) $$
The conditional distribution of y1 given y2 is
y1 | y2 ∼ N(μ1 + Σ12Σ22−1(y2 − μ2)), Σ11 − Σ12Σ22−1Σ21)
Suppose there are two input matrices, X1 and X2, whose rows are the input points, with corresponding output vectors y1 and y2. Suppose we have the actual values for y2, and want to estimate, or predict, y1. We can use the conditional distribution above to get a posterior distribution for y1.
If we only want to predict for a single point, i.e. we want to predict the output y at x, then this equation gives
y | y2 ∼ N(ŷ, σ̂2(y)) where ŷ = μ̂ + R(x, X2)R(X2)−1(y2 − μ1n)) and σ̂2(y) = R(x) − R(x, X2)R(X2)−1R(X2, x)
Notice we get an estimate not only for the value of y, but also the standard error. This can be useful when we need a way to judge the prediction accuracy of the model.