An Introduction to Gaussian Process Models
AAn Introduction toGaussian Process Models byThomas [email protected]
Within the past two decades, Gaussian process regression has been increasingly used formodeling dynamical systems due to some beneficial properties such as the bias variancetrade-off and the strong connection to Bayesian mathematics. As data-driven method, aGaussian process is a powerful tool for nonlinear function regression without the need of muchprior knowledge. In contrast to most of the other techniques, Gaussian Process modelingprovides not only a mean prediction but also a measure for the model fidelity. In this article,we give an introduction to Gaussian processes and its usage in regression tasks of dynamicalsystems. Try it yourself: gpr.tbeckers.com
Original Work: April, 2020Current Revision: Feb 10, 2021Chair ofInformation-oriented ControlTechnical University of Munich a r X i v : . [ ee ss . S Y ] F e b ontents A Gaussian process (GP) is a stochastic process that is in general a collection of randomvariables indexed by time or space. Its special property is that any finite collection of thesevariables follows a multivariate Gaussian distribution. Thus, the GP is a distribution overinfinitely many variables and, therefore, a distribution over functions with a continuousdomain. Consequently, it describes a probability distribution over an infinite dimensionalvector space. For engineering applications, the GP has gained increasing attention as su-pervised machine learning technique, where it is used as prior probability distribution overfunctions in Bayesian inference. The inference of continuous variables leads to Gaussianprocess regression (GPR) where the prior GP model is updated with training data to obtaina posterior GP distribution. Historically, GPR was used for the prediction of time series, atfirst presented by Wiener and Kolmogorov in the 1940’s. Afterwards, it became increasinglypopular in geostatistics in the 1970’s, where GPR is known as kriging . Recently, it came backin the area of machine learning [Rad96; WR96], especially boosted by the rapidly increasingcomputational power.In this article, we present background information about GPs and GPR, mainly basedon [Ras06], focusing on the application in control. We start with an introduction of GPs,explain the role of the underlying kernel function and show its relation to reproducing kernelHilbert spaces. Afterwards, the embedding in dynamical systems and the interpretation ofthe model uncertainty as error bounds is presented. Several examples are included for anintuitive understanding in addition to the formal notation.
Let (Ω ss , F σ , P ) be a probability space with the sample space Ω ss , the corresponding σ -algebra F σ and the probability measure P. The index set is given by Z ⊆ R n z with positiveinteger n z . Then, a function f GP ( z , ω ss ), which is a measurable function of ω ss ∈ Ω ss withindex z ∈ Z , is called a stochastic process. The function f GP ( z , ω ss ) is a random variableon Ω ss if z ∈ Z is specified. It is simplified written as f GP ( z ). A GP is a stochastic processwhich is fully described by a mean function m : Z → R and covariance function k : Z ×Z → R such that f GP ( z ) ∼ GP ( m ( z ), k ( z , z )) (1) m ( z ) = E [ f GP ( z )] k ( z , z ) = E [( f GP ( z ) − m ( z )) ( f GP ( z ) − m ( z ))] (2)with z , z ∈ Z . The covariance function is a measure for the correlation of two states ( z , z )and is called kernel in combination with GPs. Even though no analytic description of theprobability density function of the GP exists in general, the interesting property is that anyfinite collection of its random variables { f GP ( z ), . . . , f GP ( z n GP ) } follows a n GP -dimensionalmultivariate Gaussian distribution. As a GP defines a distribution over functions, eachrealization is also a function over the index set Z . Example 1.
A GP f GP ( t c ) ∼ GP ( m ( t c ), k ( t c , t c )) with time t c ∈ R ≥ , where m ( t c ) = 1A, k ( t c , t c ) = (0.1A) t c = t c (0A) t c = t c describes a time-dependent electric current signal with Gaussian white noise with astandard deviation of 0.1 A and a mean of 1 A. The GP can be utilized as prior probability distribution in Bayesian inference, which allowsto perform function regression. Following the Bayesian methodology, new information iscombined with existing information: using Bayes’ theorem, the prior is combined with newdata to obtain a posterior distribution. The new information is expressed as training dataset D = { X , Y } . It contains the input values X = [ x { } dat , x { } dat , . . . , x { n D } dat ] ∈ Z × n D andoutput values Y = [˜ y { } dat , ˜ y { } dat , . . . , ˜ y { n D } dat ] > ∈ R n D , where˜ y { i } dat = f GP ( x { i } dat ) + ν (3)for all i = 1, . . . , n D . The output data might be corrupted by Gaussian noise ν ∼ N (0, σ n ). Remark 1.
Note that we always use the standard notation X for the input training dataand Y for the output training data throughout this report. As any finite subset of a GP follows a multivariate Gaussian distribution, we can writethe joint distribution Yf GP ( z ∗ ) ∼ N m ( x { } dat )... m ( x { n D } dat ) m ( z ∗ ) , K ( X , X ) + σ n I n D k ( z ∗ , X ) k ( z ∗ , X ) > k ( z ∗ , z ∗ ) (4)for any arbitrary test point z ∗ ∈ Z . The function m : Z → R denotes the mean function.The matrix function K : Z × n D × Z × n D → R n D × n D is called the covariance or Gram matrix with K j , l ( X , X ) = k ( X :, l , X :, j ) for all j , l ∈ { . . . , n D } (5)where each element of the matrix represents the covariance between two elements of thetraining data X . The expression X :, l denotes the l -th column of X . For notational sim-plification, we shorten K ( X , X ) to K when necessary. The vector-valued kernel func-tion k : Z × Z × n D → R n D calculates the covariance between the test input z ∗ and theinput training data X , i.e., k ( z ∗ , X ) = [ k ( z ∗ , X :,1 ), . . . , k ( z ∗ , X :, n D )] > . (6)To obtain the posterior predictive distribution of f GP ( z ∗ ), we condition on the test point z ∗ and the training data set D given byp( f GP ( z ∗ ) | z ∗ , D ) = p( f GP ( z ∗ ), Y | X , z ∗ )p( Y | X ) . (7)Thus, the conditional posterior Gaussian distribution is defined by the mean and the variance µ ( f GP ( z ∗ ) | z ∗ , D ) = m ( z ∗ ) + k ( z ∗ , X ) > ( K + σ n I n D ) − (cid:16) Y − [ m ( X :,1 ), . . . , m ( X :, n D )] > (cid:17) var( f GP ( z ∗ ) | z ∗ , D ) = k ( z ∗ , z ∗ ) − k ( z ∗ , X ) > ( K + σ n I n D ) − k ( z ∗ , X ). (8)A detailed derivation of the posterior mean and variance based on the joint distribution (4)can be found in appendix A. Analyzing (8) we can make the following observations: i) The mean prediction can be written as µ ( f GP ( z ∗ ) | z ∗ , D ) = m ( z ∗ ) + n D X j =1 α j k ( z ∗ , X :, j ) (9)with α = ( K + σ n I n D ) − (cid:16) Y − [ m ( X :,1 ), . . . , m ( X :, n D )] > (cid:17) ∈ R n D . That formulation highlightsthe data-driven characteristic of the GPR as the posterior mean is a sum of kernel functionsand its number grows with the number n D of training data. ii) The variance does not depend on the observed data, but only on the inputs, which is aproperty of the Gaussian distribution. The variance is the difference between two terms: Thefirst term k ( z ∗ , z ∗ ) is simply the prior covariance from which a (positive) term is subtracted,representing the information the observations contain about the function. The uncertaintyof the prediction, expressed in the variance, holds only for f GP ( z ∗ ) and does not consider thenoise in the training data. For this purpose, an additional noise term σ n I n D can be added tothe variance in (8). Finally, (8) clearly shows the strong dependence of the posterior meanand variance on the kernel k that we will discuss in depth in Section 3. Example 2.
We assume a GP with zero mean and a kernel function given by k ( z , z ) = 0.3679 exp − ( z − z ) · ! as prior distribution. The training data set D is assumed to be X = h i , Y = h − − i > ,where the output is corrupted by Gaussian noise with σ n = 0.0498 standard devia-tion and the test point is assumed to be z ∗ = 5. According to (5) to (8) the Grammatrix K ( X , X ) is calculated as K ( X , X ) = and the kernel vector k ( z ∗ , X ) and k ( z ∗ , z ∗ ) are obtained to be k ( z ∗ , X ) = h i k ( z ∗ , z ∗ ) = 0.1378.Finally, with (8), we compute the predicted mean and variance for f GP ( z ∗ ) µ ( f GP ( z ∗ ) | z ∗ , D ) = 0.0278, var( f GP ( z ∗ ) | z ∗ , D ) = 0.0015,which is equivalent to a 2 σ -standard deviation of 0.0775. Figure 1 shows the priordistribution (left), the posterior distribution with two training points (black crosses) inthe middle, and the posterior distribution given the full training set D (right). Thesolid red line is the mean function and the gray shaded area indicates the 2 σ -standarddeviation. Five realizations (dashed lines) visualize the character of the distributionover functions.0 5 10 − O u t pu t s p a ce Input space z ∗ = 5 10Figure 1: The prior distribution of a GP is updated with data that leads to the posteriordistribution. So far, the GP regression allows functions with scalar outputs as in (8). For the extensionto vector-valued outputs, multiple approaches exist: i) Extending the kernel to multivariateoutputs [ÁRL12], ii) adding the output dimension as training data [Ber+17] or iii) usingseparated GPR for each output [Ras06]. While the first two approaches set a prior on thecorrelation between the output dimensions, the latter disregards a correlation without lossof generality. Following the approach in iii), the previous definition of the training set D isextended to a vector-valued output with X = [ x { } dat , x { } dat , . . . , x { n D } dat ] ∈ Z × n D , Y = [˜ y { } dat , ˜ y { } dat , . . . , ˜ y { n D } dat ] > ∈ R n D × n y dat , (10)where n y dat ∈ N is the dimension of the output and the vector-valued GP is defined by f GP ( z ) ∼ GP (cid:16) m ( z ), k ( z , z ) (cid:17) ... ... GP (cid:16) m n y dat ( z ), k n y dat ( z , z ) (cid:17) (11) m ( z ) := h m ( z ), . . . , m n y dat ( z ) i > (12)Following (4) to (8), we obtain for the predicted mean and variance µ ( f GP, i ( z ∗ ) | z ∗ , D ) = m i ( z ∗ ) + k i ( z ∗ , X ) > ( K i + σ n , i I n D ) − (cid:16) Y :, i − [ m i ( X :,1 ), . . . , m i ( X :, n D )] > (cid:17) var( f GP, i ( z ∗ ) | z ∗ , D ) = k i ( z ∗ , z ∗ ) − k i ( z ∗ , X ) > ( K i + σ n , i I n D ) − k i ( z ∗ , X ) (13)for each output dimension i ∈ { . . . , n y dat } with respect to the kernels k , . . . , k n y dat . Thevariable σ n , i denotes the standard deviation of the Gaussian noise that corrupts the i -thdimension of the output measurements. The n y dat components of f GP | z ∗ , D are combinedinto a multi-variable Gaussian distribution with µ ( f GP | z ∗ , D ) = [ µ ( f GP,1 | z ∗ , D ), . . . , µ ( f GP, n y dat | z ∗ , D )] > Σ( f GP | z ∗ , D ) = diag (cid:16) var( f GP,1 | z ∗ , D ), . . . , var( f GP, n y dat | z ∗ , D ) (cid:17) , (14)where Σ( f GP | z ∗ , D ) denotes the posterior variance matrix. This formulation allows to use aGP prior on vector-valued functions to perform predictions for test points z ∗ . This approachtreats each output dimension separately which is mostly sufficient and easy-to-handle. Analternative approach is to include the dimension as additional input, e.g., as in [Ber+17],with the benefit of a single GP at the price of loss of interpretability. For highly correlatedoutput data, a multi-output kernel might be beneficial, see [ÁRL12]. Remark 2.
Without specific knowledge about a trend in the data, the prior mean func-tions m , . . . , m n y dat are often set to zero, see [Ras06]. Therefore, we set the mean functionsto zero for the remainder of the report if not stated otherwise. In Section 2.1, we target the GPR from a Bayesian perspective. However, for some applica-tions of GPR a different point of view is beneficial; namely from the kernel perspective. Inthe following, we derive GPR from linear regression that is extended with a kernel transfor-mation. In general, the prediction of parametric models is based on a parameter vector w which is typically learned using a set of training data points. In contrast, non-parametricmodels typically maintain at least a subset of the training data points in memory in order tomake predictions for new data points. Many linear models can be transformed into a dualrepresentation where the prediction is based on a linear combination of kernel functions. Theidea is to transform the data points of a model to an often high-dimensional feature spacewhere a linear regression can be applied to predict the model output, as depicted in Fig. 2.For a nonlinear feature map φ : Z → F , where F is a n φ ∈ N ∪ {∞} dimensional Hilbertspace, the kernel function is given by the inner product k ( z , z ) = h φ ( z ), φ ( z ) i , ∀ z , z ∈ Z .Thus, the kernel implicitly encodes the way the data points are transformed into a higherdimensional space. The formulation as inner product in a feature space allows to extendmany standard regression methods. Also the GPR can be derived using a standard linearregression model f lin ( z ) = z > w , ˜ y { i } dat = f GP ( x { i } dat ) + ν (15)where z ∈ Z is the input vector, w ∈ R n z the vector of weights with n z = dim( Z )and f lin : Z → R the unknown function. The observed value ˜ y { i } dat ∈ R for the input x { i } dat ∈ Z φ Figure 2: The mapping φ transforms the data points into a feature space where linearregressors can be applied to predict the output.is corrupted by Gaussian noise ν ∼ N (0, σ n ) for all i = 1, . . . , n D . The analysis of thismodel is analogous to the standard linear regression, i.e., we put a prior on the weights suchthat w ∼ N ( , Σ p ) with Σ p ∈ R n z × n z . Based on n D collected training data points as definedin Section 2.1, that leads to the well known linear Bayesian regressionp( f lin ( z ∗ ) | z ∗ , D ) = N (cid:16) σ n z ∗> A − XY | {z } µ ( f lin ( z ∗ ) | z ∗ , D ) , z ∗> A − z ∗ | {z } var( f lin ( z ∗ ) | z ∗ , D ) (cid:17) (16)where A lin = σ − n XX > + Σ − p . Now, using the feature map φ ( z ) instead of z directly,leads to f GP ( z ) = φ ( z ) > ˇ w with ˇ w ∼ N ( , ˇΣ p ), ˇΣ p ∈ R n φ × n φ . As long as the projectionsare fixed functions, i.e., independent of the parameters w , the model is still linear in theparameters and, thus, analytically tractable. In particular, the Bayesian regression (16)with the mapping φ ( z ) can be written as( f GP ( z ∗ ) | z ∗ , D ) ∼ N σ n φ ( z ∗ ) > A − [ φ ( X :,1 ); . . . ; φ ( X :, n D )] Y , φ ( z ∗ ) > A − φ ( z ∗ ) ! , (17)with the matrix A GP ∈ R n φ × n φ given by A GP = σ − n [ φ ( X :,1 ); . . . ; φ ( X :, n D )] [ φ ( X :,1 ); . . . ; φ ( X :, n D )] > + ˇΣ − p . (18)This equation can be simplified and rewritten to( f GP ( z ∗ ) | z ∗ , D ) ∼ N (cid:16) k ( z ∗ , X ) > K − Y | {z } µ ( f GP ( z ∗ ) | z ∗ , D ) , k ( z ∗ , z ∗ ) − k ( z ∗ , X ) > K − k ( z ∗ , X ) | {z } var( f GP ( z ∗ ) | z ∗ , D ) (cid:17) , (19)with k ( z , z ) = φ ( z ) > ˇΣ p φ ( z ) that equals (8). The fact that in (19) the feature map φ ( z ) isnot needed is known as the kernel trick . This trick is also used in other kernel-based models,e.g., support vector machines (SVM), see [SC08] for more details. Even though a kernel neither uniquely defines the feature map nor the feature space, onecan always construct a canonical feature space, namely the reproducing kernel Hilbert space (RKHS) given a certain kernel. After the introduction of the theory, illustrative examplesfor an intuitive understanding are presented. We will now formally present this constructionprocedure, starting with the concept of Hilbert spaces, following [BLG16]: A Hilbert space F represents all possible realizations of some class of functions, for example all functions ofcontinuity degree i , denoted by C i . Moreover, a Hilbert space is a vector space such that anyfunction f F ∈ F must have a non-negative norm, k f F k F > f F = 0. All functions f F must additionally be equipped with an inner-product in F . Simply speaking, a Hilbertspace is an infinite dimensional vector space, where many operations behave like in the finitecase. The properties of Hilbert spaces have been explored in great detail in literature, e.g.,in [DM+05]. An extremely useful property of Hilbert spaces is that they are equivalent toan associated kernel function [Aro50]. This equivalence allows to simply define a kernel,instead of fully defining the associated vector space. Formally speaking, if a Hilbert space H is a RKHS, it will have a unique positive definite kernel k : Z × Z → R , which spans thespace H . Theorem 1 (Moore-Aronszajn [Aro50]) . Every positive definite kernel k is associatedwith a unique RKHS H . Theorem 2 ([Aro50]) . Let F be a Hilbert space, Z a non-empty set and φ : Z → F .Then, the inner product h φ ( z ), φ ( z ) i F := k ( z , z ) is positive definite. Importantly, any function f H in H can be represented as a weighted linear sum of this kernelevaluated over the space H , as f H ( · ) = h f H ( · ), k ( x , · ) i H = n φ X i =1 α i k (cid:16) x { i } dat , · (cid:17) , (20)with α i ∈ R for all i = { . . . , n φ } , where n φ ∈ N ∪ {∞} is the dimension of the featurespace F . Thus, the RKHS is equipped with the inner-product h f H , f i H = n φ X i =1 n φ X j =1 α i α j k ( x { i } dat , x j } dat ), (21)with f ( · ) = P n φ j =1 α j k (cid:16) x j } dat , · (cid:17) ∈ H , α j ∈ R . Now, the reproducing character manifests as ∀ z ∈ Z , ∀ f H ∈ H , h f H , k ( x , · ) i H = f H ( z ), in particular k ( z , z ) = h k ( · , z ), k ( · , z ) i H . (22)According to [SHS06], the RKHS is then defined as H = { f H : Z → R |∃ c ∈ F , f H ( z ) = h c , φ ( z ) i F , ∀ z ∈ Z} , (23)where φ ( z ) is the feature map constructing the kernel through k ( z , z ) = h φ ( z ), φ ( z ) i F . Example 3.
We want to find the RKHS for the polynomial kernel with degree 2 thatis given by k ( z , z ) = ( z > z ) = ( z z ) + 2( z z z z ) + ( z z ) .for any z , z ∈ R . First, we have to find a feature map φ such that the kernel corre-sponds to the inner product k ( z , y ) = h φ ( z ), φ ( y ) i . A possible candidate for the feature0map is φ ( z ) = h z , √ z z , z i > , because h φ ( z ), φ ( z ) i R = φ ( z ) > φ ( y ) = ( z z ) + 2( z z z z ) + ( z z ) = k ( z , z ).We know that the RKHS contains all linear combinations of the form f H ( z ) = X i =1 α i k (cid:16) x { i } dat , z (cid:17) = X i =1 α i h φ ( z ), φ ( z ) i R = X i =1 h c , φ ( z ) i R = c z + c √ z z + c z ,with α , c , x { i } dat ∈ R . Therefore, a possible candidate for the RKHS H is given by H = n f H : R → R | f H ( z ) = c z + c √ z z + c z , c ∈ R o (24)Next, it must be checked if the proposed Hilbert space is the related RKHS to thepolynomial kernel with degree 2. This is achieved in two steps: i) Checking if the spaceis a Hilbert space and ii) confirming the reproducing property. First, we can easily proofthat this is a Hilbert space rewriting f H ( z ) = z > S z with symmetric matrix S ∈ R × and using the fact that H is euclidean and isomorphic to S . Second, the condition for anRKHS must be fulfilled, i.e., the reproducing property f H ( z ) = h f H ( · ), k ( · , z ) i H . Sincewe can write h f H ( · ), k ( · , z ) i H = h c > φ ( · ), k ( · , z ) i H = X i =1 c i k ( · , z ) = c > φ ( z ) = f H ( z ),property (22) is fulfilled and, thus, H is the RKHS for the polynomial kernel withdegree 2. Note that, even though the mapping φ is not unique for the kernel k , therelation of k and the RKHS H is unique.Given a function f H ∈ H defined by n D observations, its RKHS norm is defined as k f H k H = h f H , f H i H = n D X i =1 n D X j =1 α i α j k ( x { i } dat , x j } dat ) = α > K ( X , X ) α , (25)with α ∈ R n D and K ( X , X ) given by (5). We can also use the feature map such that k f H k H = inf {k c k F : c ∈ F , f H ( z ) = h c , φ ( z ) i F , ∀ z ∈ Z} . (26)As there is a unique relation between the RKHS H and the kernel k , the norm k f H k H canequivalently be written as k f H k k . The norm of a function in the RKHS indicates how fastthe function varies over Z with respect to the geometry defined by the kernel. Formally, itcan be written as | f H ( z ) − f H ( z ) | d ( z , z ) ≤ k f H k H , (27)with the distance d ( z , z ) = k ( z , z ) − k ( z , z ) + k ( z , z ). A function with finite RKHSnorm is also element of the RKHS. A more detailed discussion about RKHS and norms isgiven in [Wah90].1 Example 4.
We want to find the RKHS norm of a function f H that is an element ofthe RKHS of the polynomial kernel with degree 2 that is given by k ( z , z ) = ( z > z ) = ( z z ) + 2( z z z z ) + ( z z ) .Let the function be f H ( z ) = X i =1 α i k (cid:16) x { i } dat , z (cid:17) , with (28) α = 1, α = − α = 3 (29) x { } dat = [1, 1] > , x { } dat = [1, 2] > , x { } dat = [2, 1] > . (30)Hence, function (28) with (29) and (30) corresponds to f H ( z ) = 11 z + 6 z z − z .Now, we have two possibilities how to calculate the RKHS norm. First, the RKHS-normof f H is calculated using (25) by k f H k H = α > K ( X , X ) α = h − i − = 155with X = [ x { } dat , x { } dat , x { } dat ]. Alternatively, we can use (26) that results in k f H k H = k c k ,where c is defined by (24). Thus, the norm is computed as f H ( z ) = 11 z + 6 z z − z ⇒ c = 11, c = 6 √ c = − ⇒ k f H k H = 155.2 Example 5.
In this example, we visualize the meaning of the RKHS norm. Figure 3shows different quadratic functions with the same RKHS norm (top left and top right), asmaller RKHS norm (bottom left) and a larger RKHS norm (bottom right). An identicalnorm indicates a similar variation of the functions, whereas a higher norm leads to amore varying function. − − − z z f ( z ) − − − z z f ( z ) − − − z z f z ) − − − z z f ( z ) Figure 3: Functions with different RKHS-norms: k f k H = k f k H = 4 k f k H = k f k H .In summary, we investigate the unique relation between the kernel and its RKHS. Thereproducing property allows us to write the inner-product as a tractable function whichimplicitly defines a higher (or even infinite) feature dimensional space. The RKHS-normof a function is a Lipschitz-like indicator based on the metric defined by the kernel. Thisview of the RKHS is related to the kernel trick in machine learning. In the next section,the RKHS-norm is exploited to determine the error between the prediction of GPR and theactual data-generating function. One of the most interesting properties of GPR is the uncertainty description encoded in thepredicted variance. This uncertainty is beneficial to quantify the error between the actualunderlying data generating process and the GPR. In this section, we assume that there isan unknown function f uk : R n z → R that generates the training data. In detail, the data3set D = { X , Y } consists of X = [ x { } dat , x { } dat , . . . , x { n D } dat ] ∈ R n z × n D Y = [˜ y { } dat , ˜ y { } dat , . . . , ˜ y { n D } dat ] > ∈ R n D , (31)where the data is generated by˜ y { i } dat = f uk ( x { i } dat ) + ν , ν ∼ N (0, σ n ) (32)for all i = { . . . , n D } . Without any assumptions on f uk it is obviously not possible toquantify the model error. Loosely speaking, the prior distribution of the GPR with kernel k must be suitable to learn the unknown function. More technically, f uk must be an element ofthe RKHS spanned by the kernel as described in (23). This leads to the following assumption. Assumption 1.
The function f uk has a finite RKHS norm with respect to the kernel k ,i.e., k f uk k H < ∞ , where H is the RKHS spanned by k . This sounds paradoxical as f uk is assumed to be unknown. However, there exist kernelsthat can approximate any continuous function arbitrarily exact. Thus, for any continuousfunction, an arbitrarily close function is element of the RKHS of an universal kernel. Formore details, we refer to Section 2.4. More infomation about the model error of misspecifiedGaussian Process model can be found in [BUH18]We classify the error quantification in three different approaches: i) the robust approach,ii) the scenario approach, and iii) the information-theoretical approach. The different tech-niques are presented in the following and visualized in Fig. 4. For the remainder of thissection, we assume that a GPR is trained with the data set (31) and Assumption 1 holds. The robust approach exploits the fact that the prediction of the GPR is Gaussian distributed.Thus, for any z ∗ ∈ R n z , the model error is bounded by | f uk ( z ∗ ) − µ ( f GP | z ∗ , D ) | ≤ c var( f GP | z ∗ , D ) (33)with high probability where c ∈ R > adjusts the probability. However, for multiple testpoints z ∗ , z ∗ , . . . ∈ R n z , this approach neglects any correlation between f GP ( z ∗ ), f GP ( z ∗ ), . . . .Figure 4 shows how for a given z ∗ and z ∗ , the variance is exploited as upper bound. Thus, anyprediction is handled independently, which leads to a very conservative bound, see [UBH18]. Instead of using the mean and the variance as in the robust approach, the scenario approachdeals with the samples of the GPR directly. In contrast to the other methods, there is nodirect model error quantification but rather a sample based quantification. The idea is todraw a large number n scen ∈ N of sample functions f , f , . . . , f n scen GP over n s ∈ N samplingpoints. The sampling is performed by drawing multiple instances from f GP given by the4multivariate Gaussian distribution Yf GP ( z ∗ )... f GP ( z ∗ n s ) ∼ N m ( x { } dat )... m ( x { n D } dat ) m ( z ∗ )... m ( z ∗ n s ) , K ( X , X ) + σ n I n D K ( X ∗ , X ) K ( X ∗ , X ) > K ( X ∗ , X ∗ ) , (34)where X ∗ = [ z ∗ , · · · , z ∗ n s ] contains the sampling points. Each sample can then be usedin the application instead of the unknown function. For a large number of samples it isassumed that the unknown function is close to one of these samples. However, the cruxof this approach is to determine, for a given model error c ∈ R > , the required number ofsamples n scen and probability δ scen > (cid:16) | f uk ( z ∗ ) − f i GP ( z ∗ ) | ≤ c , i ∈ { . . . , n scen } (cid:17) ≥ δ scen (35)for all z ∗ ∈ Z . In Fig. 4, five different samples of a GP model are drawn as example. Alternatively, the work in [Sri+12] derives an upper bound for samples of the GPR on acompact set with a specific probability. In contrast to the robust approach, the correlationbetween the function values are considered. We restate here the theorem of [Sri+12].
Theorem 3 ([Sri+12]) . Given Assumption 1, the model error ∆ ∈ R ∆ = | µ ( f GP | z , D ) − f uk ( z ) | (36) is bounded for all z on a compact set Ω ⊂ R n z with a probability of at least δ ∈ (0, 1) byP n ∀ z ∈ Ω, ∆ ≤ | β Σ ( f GP | z , D ) | o ≥ δ , (37) where β ∈ R is defined as β = s k f uk k k + 300 γ max ln (cid:18) n D + 11 − δ (cid:19) . (38) The variable γ max ∈ R is the maximum of the information gain γ max = max x { } dat , ... , x { n D +1 } dat ∈ Ω
12 log | I n D +1 + σ − n K ( z , z ) | (39) with Gram matrix K ( z , z ) and the input elements z , z ∈ { x { } dat , . . . , x { n D +1 } dat } . To compute this bound, the RKHS norm of f uk must be known. That is in applicationusually not the case. However, often the norm can be upper bounded and thus, the bound50 x ∗ x ∗ − O u t pu t s p a ce Mean
Input spaceTraining points
Figure 4: Different approaches to quantify the model error: Robust approach (left), scenarioapproach (middle), information-theoretical approach (right).in Theorem 3 can be upper bounded. For this purpose, the relation of the RKHS norm tothe Lipschitz constant given by (27) is beneficial as the Lipschitz constant is more likely tobe known. In general, the computation of the information gain is a non-convex optimiza-tion problem. However, the information capacity γ max has a sub-linear dependency on thenumber of training points for many commonly used kernel functions [Sri+12]. Therefore,even though β is increasing with the number of training data, it is possible to learn the truefunction f uk arbitrarily exactly [Ber+16]. In contrast to the other approaches, Theorem 3allows to bound the error for any test point in a compact set. In [BKH19], we exploit thisapproach in GP model based control tasks. The right illustration of Fig. 4 visualizes theinformation-theoretical bound. Equation (8) clearly shows the immense impact of the kernel on the posterior mean andvariance. However, this is not surprising as the kernel is an essential part of the prior model.For practical applications that leads to the question how to choose the kernel. Additionally,most kernels depend on a set of hyperparameters that must be defined. Thus in order toturn GPR into a powerful practical tool it is essential to develop methods that address themodel selection problem. We see the model selection as the determination of the kernel andits hyperparameters. We only focus on kernels that are defined on
Z ⊆ R n z . In the next twosubsections, we present different kernels and explain the role of the hyperparameters andtheir selection, mainly based on [Ras06]. Remark 3.
The selection of the kernel functions seems to be similar to the model selectionfor parametric models. However, there are two major differences: i) the selection is fullycovered by the Bayesian methodology and ii) many kernels allow to model a wide range ofdifferent functions whereas parametric models a typically limited to very specific types offunctions.
The value of the kernel function k ( z , z ) is an indicator of the interaction of two states ( z , z ).Thus, an essential part of GPR is the selection of the kernel function and the estimation of its6free parameters ϕ , ϕ , . . . , ϕ n ϕ , called hyperparameters. The number n ϕ of hyperparametersdepends on the kernel function. The choice of the kernel function and the determination ofthe corresponding hyperparameters can be seen as degrees of freedom of the regression. Firstof all, we start with the general properties of a function to be qualified as a kernel for GPR.A necessary and sufficient condition for the function k : Z × Z → R to be a valid kernel isthat the Gram matrix, see (5), is positive semidefinite for all possible input values [SC04]. Remark 4.
As shown in Section 2.4, the kernel function must be positive definite to spanan unique RKHS. That seems to be contradictory to the required positive semi-definiteness ofthe Gram matrix. The solution is the definition of positive definite kernels as it is equivalentto a positive semi-definite Gram matrix. In detail, a symmetric function k : Z × Z → R isa positive definite kernel on Z if n D X j =1 n D X i =1 k ( x { i } dat , x { j } dat ) c i c j ≥ holds for any n D ∈ N , x { } dat , . . . , x { n D } dat ∈ Z and c , . . . , c n ∈ R . Thus, there exists a positivesemi-definite matrix A G ∈ R n D × n D such that x > dat A G x dat = n D X j =1 n D X i =1 k ( x { i } dat , x { j } dat ) c i c j (41) holds for any n D ∈ N and z ∈ Z . The set of functions k which fulfill this condition is denoted with K . Kernel functionscan be separated into two classes, the stationary and non-stationary kernels. A stationarykernel is a function of the distance z − z . Thus, it is invariant to translations in the inputspace. In contrast, non-stationary kernels depend directly on z , z and are often functionsof a dot product z > z . In the following, we list some common kernel functions with theirbasic properties. Even though the number of presented kernels is limited, new kernels canbe constructed easily as K is closed under specific operations such as addition and scalarmultiplication. At the end, we summarize the equation of each kernel in Table 1 and providea comparative example. The equation for the constant kernel is given by k ( z , z ) = ϕ . (42)This kernel is mostly used in addition to other kernel functions. It depends one a singlehyperparameter ϕ ∈ R ≥ . The equation for the linear kernel is given by k ( z , z ) = z > z . (43)The linear kernel is a dot-product kernel and thus, non-stationary. The kernel can be ob-tained from Bayesian linear regression as shown in Section 2.3. The linear kernel is oftenused in combination with the constant kernel to include a bias.7 The equation for the polynomial kernel is given by k ( z , z ) = (cid:16) z > z + ϕ (cid:17) p , p ∈ N . (44)The polynomial kernel has an additional parameter p ∈ N , that determines the degree ofthe polynomial. Since a dot product is contained, the kernel is also non-stationary. Theprior variance grows rapidly for k z k > ϕ ∈ R ≥ . The equation for the Matérn kernel is given by k ( z , z ) = ϕ exp − √ p k z − z k ϕ ! p !(2 p )! p X i =0 ( p + i )! i !( p − i )! √ p k z − z k ϕ ! p − i (45)with ˇ p = p + , p ∈ N . The Matérn kernel is a very powerful kernel and presented here withthe most common parameterization for ˇ p . Functions drawn from a GP model with Matérnkernel are p -times differentiable. The more general equation of this stationary kernel can befound in [Bis06]. This kernel is an universal kernel which is explained in the following. Lemma 1 ([SC08, Lemma 4.55]) . Consider the RKHS H ( Z c ) of an universal kernelon any prescribed compact subset Z c ∈ Z . Given any positive number ε and any func-tion f C ∈ C ( Z C ) , there is a function f H ∈ H ( Z c ) such that k f C − f H k Z c ≤ ε . Intuitively speaking, a GPR with an universal kernel can approximate any continuous func-tion arbitrarily exact on a compact set. For p → ∞ , it results in the squared exponentialkernel. The two hyperparameters are ϕ ∈ R ≥ and ϕ ∈ R > . The equation for the squared exponential kernel is given by k ( z , z ) = ϕ exp − k z − z k ϕ ! . (46)Probably the most widely used kernel function for GPR is the squared exponential kernel,see [Ras06]. The hyperparameter ϕ describes the signal variance which determines the av-erage distance of the data-generating function from its mean. The lengthscale ϕ defineshow far it is needed to move along a particular axis in input space for the function values tobecome uncorrelated. Formally, the lengthscale determines the number of expected upcross-ings of the level zero in a unit interval by a zero-mean GP. The squared exponential kernelis infinitely differentiable, which means that the GPR exhibits a smooth behavior. As limitof the Matérn kernel, it is also an universal kernel, see [MXZ06].8 Example 6.
Figure 5 shows the power for regression of universal kernel functions. Inthis example, a GPR with squared exponential kernel is used for different training datasets. The hyperparameter are optimized individually for each training data set by meansof the likelihood, see Section 3.2. Note that all presented regressions are based on the same
GP model, i.e. the same kernel function, but with different data sets. Thathighlights again the superior flexibility of GPR.Figure 5: Examples for the flexibility of the regression that all are based on the sameGP model.
The equation for the rational quadratic kernel is given by k ( z , z ) = ϕ k z − z k pϕ ! − p , p ∈ N . (47)This kernel is equivalent to summing over infinitely many squared exponential kernels withdifferent lengthscales. Hence, GP priors with this kernel are expected to see functions whichvary smoothly across many lengthscales. The parameter p determines the relative weightingof large-scale and small-scale variations. For p → ∞ , the rational quadratic kernel is identicalto the squared exponential kernel. The equation for the squared exponential ARD kernel is given by k ( z , z ) = ϕ exp (cid:16) − ( z − z ) > P − ( z − z ) (cid:17) , P = diag( ϕ , . . . , ϕ n z ). (48)The automatic relevance determination (ARD) extension to the squared exponential kernelallows for independent lenghtscales ϕ , . . . , ϕ n z ∈ R > for each dimension of z , z ∈ R n z .The individual lenghtscales are typically larger for dimensions which are irrelevant as thecovariance will become almost independent of that input. A more detailed discussion aboutthe advantages of different kernels can be found, for instance, in [Mac97] and [Bis06].9 Example 7.
In this example, we use three GPRs with the same set of training data X = [1, 3, 5, 7, 9], Y = [0, 1, 2, 3, 6] (49)but with different kernels, namely the squared exponential (46), the linear (43), andthe polynomial (44) kernel. Figure 6 shows the different shapes of the regressions withthe posterior mean (red), the posterior variance (gray shaded) and the training points(black). Even for this simple data set, the flexibility of the squared exponential kernelis already visible.0 5 10 − O u t pu t s p a ce Input space k ( z , z ) =Constant ϕ Linear z > z + ϕ Polynomial p ∈ N (cid:16) z > z + ϕ (cid:17) p Matérn ˇ p = p + , p ∈ N ϕ exp (cid:16) − √ p k z − z k ϕ (cid:17) p !(2 p )! P pi =0 ( p + i )! i !( p − i )! (cid:16) √ p k z − z k ϕ (cid:17) p − i Squared exponential ϕ exp (cid:16) − k z − z k ϕ (cid:17) Rational quadratic ϕ (cid:18) k z − z k pϕ (cid:19) − p Squared exponential ARD ϕ exp (cid:16) − ( z − z ) > P − ( z − z ) (cid:17) , P = diag( ϕ , . . . , ϕ n z )Table 1: Overview of some commonly used kernel functions.0 In addition to the selection of a kernel function, values for any hyperparameter must bedetermined to perform the regression. The number of hyperparameters depends on thekernel function used. We concatenate all hyperparameters in a vector ϕ with size n ϕ ∈ N ,where ϕ ∈ Φ ⊆ R n ϕ . The hyperparameter set Φ is introduced to cover the different spacesof the individual hyperparameters as defined in the following. Definition 1.
The set Φ is called a hyperparameter set for a kernel function k if and onlyif the set Φ is a domain for the hyperparameters ϕ of k .Often, the signal noise σ n , see (4), is also treated as hyperparameter. For a betterunderstanding, we keep the signal noise separated from the hyperparameters. There existseveral techniques that allow computing the hyperparameters and the signal noise withrespect to one optimality criterion. From a Bayesian perspective, we want to find the vectorof hyperparameters ϕ which are most likely for the output data Y given the inputs X and aGP model. For this purpose, one approach is to optimize the log marginal likelihood function of the GP. Another idea is to split the training set into two disjoint sets, one which is actuallyused for training, and the other, the validation set, which is used to monitor performance.This approach is known as cross-validation . In the following, these two techniques for theselection of hyperparameters are presented. A very common method for the optimization of the hyperparameters is by means of the negative log marginal likelihood function , often simply named as (neg. log) likelihood func-tion. It is marginal since it is obtained through marginalization over the function f GP . Themarginal likelihood is the likelihood that the output data Y ∈ R n D fits to the input data X with the hyperparameters ϕ . It is given bylog p ( Y | X , ϕ ) = − Y > ( K + σ n I n D ) − Y −
12 log | K + σ n I n D | − n D π . (50)A detailed derivation can be found in [Ras06]. The three terms of the marginal likelihoodin (50) have the following roles:• Y > ( K + σ n I n D ) − Y is the only term that depends on the output data Y and representsthe data-fit.• log | K + σ n I n D | penalizes the complexity depending on the kernel function and theinput data X .• n D log 2 π is a normalization constant. Remark 5.
For the sake of notational simplicity, we suppress the dependency on the hyper-parameters of the kernel function k whenever possible. The optimal hyperparameters ϕ ∗ ∈ Φ and signal noise σ ∗ n in the sense of the likelihood areobtained as the minimum of the negative log marginal likelihood function " ϕ ∗ σ ∗ n = arg min ϕ ∈ Φ, σ n ∈ R ≥ log p ( Y | X , ϕ ). (51)1Since an analytic solution of the derivation of (50) is impossible, a gradient based opti-mization algorithm is typically used to minimize the function. However, the negative loglikelihood is non-convex in general such that there is no guarantee to find the optimum ϕ ∗ , σ ∗ n .In fact, every local minimum corresponds to a particular interpretation of the data. In thefollowing example, we visualize how the hyperparameters affect the regression. Example 8.
A GPR with the squared exponential kernel is trained on eight data points.The signal variance is fixed to ϕ = 2.13. First, we visualize the influence of thelengthscale. For this purpose, the signal noise is fixed to σ n = 0.21. Figure 7 showsthe posterior mean of the regression and the neg. log likelihood function. On the leftside are three posterior means for different lengthscales. A short lengthscale results inoverfitting whereas a large lengthscale smooths out the training data (black crosses).The dotted red function represents the mean with optimized lengthscale by a descentgradient algorithm with respect to (51). The right plot shows the neg. log likelihood overthe signal variance ϕ and lengthscale ϕ . The minimum is here at ϕ ∗ = [2.13, 1.58] > .0 2 4 6 8 100123 Input space O u t pu t s p a ce ϕ ϕ ϕ = 0.67 (cyan, solid), ϕ = 7.39(brown, dashed), and ϕ = 1.58 (red, dotted). Right: Neg. log likelihood function oversignal variance ϕ and lengthscale ϕ .Next, the meaning of different interpretations of the data is visualized by varying thesignal noise σ n and the lengthscale ϕ . The right plot of Fig. 8 shows two minimaof the negative log likelihood function. The lower left minimum at log( σ n ) = 0.73and log( ϕ ) = − σ n ) = 5and log( ϕ ) = − Input space O u t pu t s p a ce − −
10 log( σ n ) l og ( ϕ ) ≥ σ n and lengthscale ϕ . This approach works with a separation of the data set D in two classes: one for trainingand one for validation. Cross-validation is almost always used in the k cv -fold cross-validationsetting: the k cv -fold cross-validation data is split into k cv disjoint, equally sized subsets; vali-dation is done on a single subset and training is done using the union of the remaining k cv − k cv times, each time with a different subset for vali-dation. Here, without loss of generality, we present the leave-one-out cross-validation, whichmeans k cv = n D . The predictive log probability when leaving out a training point { x { i } dat , ˜ y { i } dat } is given by log p ( y { i } dat | X , Y − i , ϕ ) = −
12 log (var − i ) − (cid:16) ˜ y { i } dat − µ − i (cid:17) − i − n D π , (52)where µ − i = µ ( f GP ( x { i } dat ) | x { i } dat , X :, − i , Y − i ) and var − i = var( f GP ( x { i } dat ) | x { i } dat , X :, − i , Y − i ). The − i index indicates X and Y without the element x { i } dat and ˜ y { i } dat , respectively. Thus, (52) is theprobability for the output y { i } dat at x { i } dat but without the training point { x { i } dat , ˜ y { i } } . Accord-ingly, the leave-one-out log predictive probability L LOO ∈ R is L LOO = n D X i =1 log p ( y { i } dat | X , Y − i , ϕ ). (53)In comparison to the log likelihood approach (51), the cross-validation is in general morecomputationally expensive but might find a better representation of the data set, see [GE79]for discussion and related approaches. So far, we consider GPR in non-dynamical settings where only an input-to-output mapping isconsidered. However, Gaussian process dynamical models (GPDMs) have recently becomea versatile tool in system identification because of their beneficial properties such as thebias variance trade-off and the strong connection to Bayesian mathematics, see [FCR14].3In many works, where GPs are applied to dynamical model, only the mean function of theprocess is employed, e.g., in [WHB05] and [Cho+13]. This is mainly because GP models areoften used to replace deterministic parametric models. However, GPDMs contain a muchricher description of the underlying dynamics, but also the uncertainty about the modelitself when the full probabilistic representation is considered. Therefore, one main aspectof GPDMs is to distinguish between recurrent structures and non-recurrent structures. Amodel is called recurrent if parts of the regression vector depend on the outputs of the model.Even though recurrent models become more complex in terms of their behavior, they allowto model sequences of data, see [Sjö+95]. If all states are fed back from the model itself, weget a simulation model, which is a special case of the recurrent structure. The advantageof such a model is its property to be independent from the real system. Thus, it is suitablefor simulations, as it allows multi-step ahead predictions. In this report, we focus on twooften-used recurrent structures: the Gaussian process state space model (GP-SSM) and theGaussian process nonlinear error output (GP-NOE) model.
Gaussian process state space models are structured as a discrete-time system. In this case,the states are the regressors, which is visualized in Fig. 9. This approach allows to bemore efficient, since the regressors are less restricted in their internal structure as in input-output models. Thus, a very efficient model in terms of number of regressors might bepossible. The mapping from the states to the output is often be assumed to be known. Thesituation, where the output mapping describes a known sensor model, is such an example.It is mentioned in [Fri+13] that using too flexible models for both, the state mapping f andthe output mapping, can result in problems of non-identifiability. Therefore, we focus on aknown output mapping. The mathematical model of the GP-SSM is thus given by x t +1 = f ( ξ t ) = f ( ξ t ) ∼ GP ( m ( ξ t ), k ( ξ t , ξ t ))... ... ... f n x ( ξ t ) ∼ GP ( m n x ( ξ t ), k n x ( ξ t , ξ t )) . y t ∼ p ( y t | x t , γ y ), (54)where ξ t ∈ R n ξ , n ξ = n x + n u is the concatenation of the state vector x t ∈ X ⊆ R n x and the input u t ∈ U ⊆ R n u such that ξ t = [ x t ; u t ]. The mean function is given bycontinuous functions m , . . . , m n x : R n ξ → R . The output mapping is parametrized by aknown vector γ y ∈ R n γ with n γ ∈ N . The system identification task for the GP-SSM mainlyfocuses on f in particular. It can be described as finding the state-transition probabilityconditioned on the observed training data. Remark 6.
The potentially unknown number of regressors can be determined using estab-lished nonlinear identification techniques as presented in [KL99], or exploiting embeddedtechniques such as automatic relevance determination [Koc16]. A mismatch leads to similarissues as in parametric system identification. P − GPGP ... GP p ( y t +1 | x t +1 , γ y ) x t u t x t +1 y t +1 Figure 9: Structure of a GP-SSM with P as backshift operator, such that P − x t +1 = x t The GP-NOE model uses the past n in ∈ N > input values u t ∈ U and the past n out ∈ N > output values y t ∈ R n y of the model as the regressors. Figure 10 shows the structure of GP-NOE, where the outputs are feedbacked. Analogously to the GP-SSM, the mathematicalmodel of the GP-NOE is given by y t +1 = h ( ζ t ) = h ( ζ t ) ∼ GP ( m ( ζ t ), k ( ζ t , ζ t ))... ... ... h n y ( ζ t ) ∼ GP ( m n y ( ζ t ), k n y ( ζ t , ζ t )) , (55)where ζ t ∈ R n ζ , n ζ = n out n y + n in n u is the concatenation of the past outputs y t and inputs u t such that ζ t = [ y t − n out +1 ; . . . ; y t ; u t − n in +1 ; . . . ; u t ]. The mean function is given by continuousfunctions m , . . . , m n y : R n ζ → R . In contrast to nonlinear autoregressive exogenous models,that focus on one-step ahead prediction, a NOE model is more suitable for simulations as itconsiders the multi-step ahead prediction [Nel13]. However, the drawback is a more complextraining procedure that requires a nonlinear optimization scheme due to their recurrentstructure [Koc16]. ... P − ... P − n out GPGP ... GP y t y t − n out +1 u t u t − n in +1 y t +1 Figure 10: Structure of a GP-NOE model with P as backshift operator ( P − y t +1 = y t ) Remark 7.
It is always possible to convert an identified input-output model into a state-space model, see[PL70]. However, focusing on state-space models only would preclude thedevelopment of a large number of useful identification results.
Remark 8.
Control relevant properties of GP-SSMs and GO-NOE models are discussedin [BH16b; BH16a; BH20]. In this article, we introduce the GP and its usage in GPR. Based on the property, that everyfinite subset of a GP follows a multi-variate Gaussian distribution, a closed-form equation canbe derived to predict the mean and variance for a new test point. The GPR can intrinsicallyhandle noisy output data if it is Gaussian distributed. As GPR is a data-driven method,only little prior knowledge is necessary for the regression. Further, the complexity of GPmodels scales with the number of training points. A degree of freedom in the modeling is theselection of the kernel function and its hyperparameters. We present an overview of commonkernels and the necessary properties to be a valid kernel function. For the hyperparameterdetermination, two approaches based on numerical optimization are shown. The kernel ofthe GP is uniquely related to a RKHS, which determines the shape of the samples of theGP. Based on this, we compare different approaches for the quantification of the model errorthat quantifies the error between the GPR and the actual data-generating function. Finally,we introduce how GP models can be used as dynamical systems in GP-SSMs and GP-NOEmodels.6
A Conditional Distribution
Let ν ∈ R n ν , ν ∈ R n ν with n ν , n ν ∈ N be probability variables, which are multivariateGaussian distribution " ν ν ∼ N " µ µ , " Σ Σ > Σ Σ (56)with mean µ ∈ R n ν , µ ∈ R n ν and variance Σ ∈ R n ν × n ν , Σ ∈ R n ν × n ν , Σ ∈ R n ν × n ν .The task is to determine the conditional probabilityp( ν | ν ) = p( ν , ν )p( ν ) . (57)The joined probability p( ν , ν ) is a multivariate Gaussian distribution withp( ν , ν ) = 1(2 π ) ( n ν + n ν ) / det(Σ) exp (cid:18) −
12 ( x − µ ) > Σ − ( x − µ ) (cid:19) (58) µ := " µ µ , Σ := " Σ Σ > Σ Σ , (59)where x = [ x ; x ], x ∈ R n ν , x ∈ R n ν . The marginal distribution of ν is defined by themean µ and the variance Σ such thatp( ν ) = 1(2 π ) nν det(Σ ) exp (cid:18) −
12 ( x − µ ) > Σ − ( x − µ ) (cid:19) . (60)The division of the joint distribution by the marginal distribution results again in a Gaussiandistribution withp( ν | ν ) = det(Σ ) (2 π ) nν det(Σ) | {z } ∗ exp (cid:18) − h ( x − µ ) > Σ − ( x − µ ) − ( x − µ ) > Σ − ( x − µ ) i| {z } ∗∗ (cid:19) ,(61)where the first part ∗ can be rewritten as ∗ = 1(2 π ) nν det(Σ )det(Σ ) det(Σ − Σ Σ − Σ > ) ! = 1(2 π ) nν det(Σ − Σ Σ − Σ > ) . (62)Thus, the covariance matrix Σ | of the conditional distribution p( ν | ν ) is given byΣ | = Σ − Σ Σ − Σ > . (63)For the simplification of the second part ∗∗ of (61), we exploit the special block structure ofΣ, such that its inverse is given byΣ = " Σ Σ Σ Σ , Σ − = " Σ Σ Σ Σ (64)Σ = Σ − + Σ − Σ N Σ Σ − Σ = − Σ − Σ N Σ = − N Σ Σ − Σ = N (65)7with N = (Σ − Σ Σ − Σ ) − . Thus, we compute ∗∗ as ∗∗ = " x − µ x − µ > " Σ Σ > Σ Σ − " x − µ x − µ − ( x − µ ) > Σ − ( x − µ ) (66)= ( x − µ ) > Σ − | ( x − µ ) + 2( x − µ ) > (cid:16) − Σ − Σ > Σ − | (cid:17) ( x − µ )+ ( x − µ ) > (cid:16) Σ − + Σ − Σ > Σ − | Σ Σ − (cid:17) ( x − µ ) − ( x − µ ) > Σ − ( x − µ ) (67)= (cid:16) x − µ +Σ Σ − ( x − µ ) | {z } µ | (cid:17) > Σ − | (cid:16) x − µ +Σ Σ − ( x − µ ) | {z } µ | (cid:17) (68)Finally, the conditional probability is given with the conditional mean µ | ∈ R n ν and theconditional covariance matrix Σ | ∈ R n ν × n ν by p ( ν | ν ) = 1(2 π ) nν det(Σ | ) exp (cid:18) −
12 ( x − µ | ) > Σ − | ( x − µ | ) (cid:19) (69) µ | = µ +Σ Σ − ( x − µ )Σ | = Σ − Σ Σ − Σ > . (70)8 References [ÁRL12] Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence. “Kernels for Vector-Valued Functions: A Review”. In:
Foundations and Trends in Machine Learning doi : .[Aro50] Nachman Aronszajn. “Theory of reproducing kernels”. In: Transactions of theAmerican mathematical society doi : .[Ber+16] Felix Berkenkamp, Riccardo Moriconi, Angela P. Schoellig, and Andreas Krause.“Safe Learning of Regions of Attraction for Uncertain, Nonlinear Systems withGaussian Processes”. In: . 2016, pp. 4661–4666.[Ber+17] Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause.“Safe Model-based Reinforcement Learning with Stability Guarantees”. In: Ad-vances in Neural Information Processing Systems . 2017, pp. 908–918.[BH16a] Thomas Beckers and Sandra Hirche. “Equilibrium distributions and stabilityanalysis of Gaussian Process State Space Models”. In: . 2016, pp. 6355–6361. doi : .[BH16b] Thomas Beckers and Sandra Hirche. “Stability of Gaussian Process State SpaceModels”. In: . 2016, pp. 2275–2281. doi : .[BH20] Thomas Beckers and Sandra Hirche. “Prediction with Gaussian Process Dynam-ical Models”. In: Transaction on Automatic Control (2020). Submitted.[Bis06] Christopher Bishop.
Pattern recognition and machine learning . Springer-VerlagNew York, 2006. isbn : 9780387310732.[BKH19] Thomas Beckers, Dana Kulić, and Sandra Hirche. “Stable Gaussian Processbased Tracking Control of Euler-Lagrange Systems”. In:
Automatica
103 (2019),pp. 390–397. doi : .[BLG16] Yusuf Bhujwalla, Vincent Laurain, and Marion Gilson. “The impact of smooth-ness on model class selection in nonlinear system identification: An application ofderivatives in the RKHS”. In: . 2016,pp. 1808–1813. doi : .[BUH18] Thomas Beckers, Jonas Umlauft, and Sandra Hirche. “Mean Square PredictionError of Misspecified Gaussian Process Models”. In: . 2018, pp. 1162–1167. doi : .[Cho+13] Girish Chowdhary, Hassan Kingravi, Jonathan How, and Patricio A. Vela. “Bayes-ian nonparametric adaptive control of time-varying systems using Gaussian pro-cesses”. In: . 2013, pp. 2655–2661. doi : .[DM+05] Lokenath Debnath, Piotr Mikusinski, et al. Introduction to Hilbert spaces withapplications . Academic press, 2005.9[FCR14] Roger Frigola, Yutian Chen, and Carl E. Rasmussen.
Variational Gaussian Pro-cess State-Space Models . 2014. arXiv: .[Fri+13] Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl E. Rasmussen.“Bayesian inference and learning in Gaussian process state-space models withparticle MCMC”. In:
Advances in Neural Information Processing Systems . 2013,pp. 3156–3164.[GE79] Seymour Geisser and William F. Eddy. “A Predictive Approach to Model Selec-tion”. In:
Journal of the American Statistical Association doi : .[KL99] Robert Keviczky and Haber Laszlo. Nonlinear system identification: input-outputmodeling approach . Springer Netherlands, 1999. isbn : 9780792358589.[Koc16] Juš Kocijan.
Modelling and Control of Dynamic Systems Using Gaussian ProcessModels . Springer International Publishing, 2016. doi :
10 . 1007 / 978 - 3 - 319 -21021-6 .[Mac97] David J. MacKay.
Gaussian Processes - A Replacement for Supervised NeuralNetworks?
Journal of Machine Learning Research
Nonlinear system identification: from classical approaches to neuralnetworks and fuzzy models . Springer-Verlag Berlin Heidelberg, 2013. doi :
10 .1007/978-3-662-04323-3 .[PL70] M. Q. Phan and R. W. Longman. “Relationship between state-space and input-output models via observer Markov parameters”. In:
WIT Transactions on TheBuilt Environment
22 (1970). doi : .[Rad96] Neal M. Radford. Bayesian learning for neural networks . Springer-Verlag NewYork, 1996. doi : .[Ras06] Carl E. Rasmussen. Gaussian processes for machine learning . The MIT Press,2006. isbn : 9780262182539.[SC04] John Shawe-Taylor and Nello Cristianini.
Kernel methods for pattern analysis .Cambridge university press, 2004. doi : .[SC08] Ingo Steinwart and Andreas Christmann. Support vector machines . Springer Sci-ence & Business Media, 2008. doi : .[SHS06] Ingo Steinwart, Don Hush, and Clint Scovel. “An explicit description of the repro-ducing kernel Hilbert spaces of Gaussian RBF kernels”. In: IEEE Transactionson Information Theory doi : .[Sjö+95] Jonas Sjöberg, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard De-lyon, Pierre-Yves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. “Non-linear black-box modeling in system identification: a unified overview”. In: Au-tomatica doi : .0[Sri+12] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger.“Information-theoretic regret bounds for Gaussian process optimization in thebandit setting”. In: IEEE Transactions on Information Theory doi : .[UBH18] Jonas Umlauft, Thomas Beckers, and Sandra Hirche. “A Scenario-based OptimalControl Approach for Gaussian Process State Space Models”. In: . 2018, pp. 1386–1392. doi :
10 . 23919 / ECC . 2018 .8550458 .[Wah90] Grace Wahba.