[PDF] An Introduction to Gaussian Process Models

Abstract

Within the past two decades, Gaussian process regression has been increasingly used for modeling dynamical systems due to some beneficial properties such as the bias variance trade-off and the strong connection to Bayesian mathematics. As data-driven method, a Gaussian process is a powerful tool for nonlinear function regression without the need of much prior knowledge. In contrast to most of the other techniques, Gaussian Process modeling provides not only a mean prediction but also a measure for the model fidelity. In this article, we give an introduction to Gaussian processes and its usage in regression tasks of dynamical systems. Try Gaussian process regression yourself: this https URL

Full PDF

AAn Introduction toGaussian Process Models byThomas [email protected]

Within the past two decades, Gaussian process regression has been increasingly used formodeling dynamical systems due to some beneﬁcial properties such as the bias variancetrade-oﬀ and the strong connection to Bayesian mathematics. As data-driven method, aGaussian process is a powerful tool for nonlinear function regression without the need of muchprior knowledge. In contrast to most of the other techniques, Gaussian Process modelingprovides not only a mean prediction but also a measure for the model ﬁdelity. In this article,we give an introduction to Gaussian processes and its usage in regression tasks of dynamicalsystems. Try it yourself: gpr.tbeckers.com

Original Work: April, 2020Current Revision: Feb 10, 2021Chair ofInformation-oriented ControlTechnical University of Munich a r X i v : . [ ee ss . S Y ] F e b ontents A Gaussian process (GP) is a stochastic process that is in general a collection of randomvariables indexed by time or space. Its special property is that any ﬁnite collection of thesevariables follows a multivariate Gaussian distribution. Thus, the GP is a distribution overinﬁnitely many variables and, therefore, a distribution over functions with a continuousdomain. Consequently, it describes a probability distribution over an inﬁnite dimensionalvector space. For engineering applications, the GP has gained increasing attention as su-pervised machine learning technique, where it is used as prior probability distribution overfunctions in Bayesian inference. The inference of continuous variables leads to Gaussianprocess regression (GPR) where the prior GP model is updated with training data to obtaina posterior GP distribution. Historically, GPR was used for the prediction of time series, atﬁrst presented by Wiener and Kolmogorov in the 1940’s. Afterwards, it became increasinglypopular in geostatistics in the 1970’s, where GPR is known as kriging . Recently, it came backin the area of machine learning [Rad96; WR96], especially boosted by the rapidly increasingcomputational power.In this article, we present background information about GPs and GPR, mainly basedon [Ras06], focusing on the application in control. We start with an introduction of GPs,explain the role of the underlying kernel function and show its relation to reproducing kernelHilbert spaces. Afterwards, the embedding in dynamical systems and the interpretation ofthe model uncertainty as error bounds is presented. Several examples are included for anintuitive understanding in addition to the formal notation.

Let (Ω ss , F σ , P ) be a probability space with the sample space Ω ss , the corresponding σ -algebra F σ and the probability measure P. The index set is given by Z ⊆ R n z with positiveinteger n z . Then, a function f GP ( z , ω ss ), which is a measurable function of ω ss ∈ Ω ss withindex z ∈ Z , is called a stochastic process. The function f GP ( z , ω ss ) is a random variableon Ω ss if z ∈ Z is speciﬁed. It is simpliﬁed written as f GP ( z ). A GP is a stochastic processwhich is fully described by a mean function m : Z → R and covariance function k : Z ×Z → R such that f GP ( z ) ∼ GP ( m ( z ), k ( z , z )) (1) m ( z ) = E [ f GP ( z )] k ( z , z ) = E [( f GP ( z ) − m ( z )) ( f GP ( z ) − m ( z ))] (2)with z , z ∈ Z . The covariance function is a measure for the correlation of two states ( z , z )and is called kernel in combination with GPs. Even though no analytic description of theprobability density function of the GP exists in general, the interesting property is that anyﬁnite collection of its random variables { f GP ( z ), . . . , f GP ( z n GP ) } follows a n GP -dimensionalmultivariate Gaussian distribution. As a GP deﬁnes a distribution over functions, eachrealization is also a function over the index set Z . Example 1.

A GP f GP ( t c ) ∼ GP ( m ( t c ), k ( t c , t c )) with time t c ∈ R ≥ , where m ( t c ) = 1A, k ( t c , t c ) =  (0.1A) t c = t c (0A) t c = t c describes a time-dependent electric current signal with Gaussian white noise with astandard deviation of 0.1 A and a mean of 1 A. The GP can be utilized as prior probability distribution in Bayesian inference, which allowsto perform function regression. Following the Bayesian methodology, new information iscombined with existing information: using Bayes’ theorem, the prior is combined with newdata to obtain a posterior distribution. The new information is expressed as training dataset D = { X , Y } . It contains the input values X = [ x { } dat , x { } dat , . . . , x { n D } dat ] ∈ Z × n D andoutput values Y = [˜ y { } dat , ˜ y { } dat , . . . , ˜ y { n D } dat ] > ∈ R n D , where˜ y { i } dat = f GP ( x { i } dat ) + ν (3)for all i = 1, . . . , n D . The output data might be corrupted by Gaussian noise ν ∼ N (0, σ n ). Remark 1.

Note that we always use the standard notation X for the input training dataand Y for the output training data throughout this report. As any ﬁnite subset of a GP follows a multivariate Gaussian distribution, we can writethe joint distribution  Yf GP ( z ∗ )  ∼ N  m ( x { } dat )... m ( x { n D } dat ) m ( z ∗ )  ,  K ( X , X ) + σ n I n D k ( z ∗ , X ) k ( z ∗ , X ) > k ( z ∗ , z ∗ )  (4)for any arbitrary test point z ∗ ∈ Z . The function m : Z → R denotes the mean function.The matrix function K : Z × n D × Z × n D → R n D × n D is called the covariance or Gram matrix with K j , l ( X , X ) = k ( X :, l , X :, j ) for all j , l ∈ { . . . , n D } (5)where each element of the matrix represents the covariance between two elements of thetraining data X . The expression X :, l denotes the l -th column of X . For notational sim-pliﬁcation, we shorten K ( X , X ) to K when necessary. The vector-valued kernel func-tion k : Z × Z × n D → R n D calculates the covariance between the test input z ∗ and theinput training data X , i.e., k ( z ∗ , X ) = [ k ( z ∗ , X :,1 ), . . . , k ( z ∗ , X :, n D )] > . (6)To obtain the posterior predictive distribution of f GP ( z ∗ ), we condition on the test point z ∗ and the training data set D given byp( f GP ( z ∗ ) | z ∗ , D ) = p( f GP ( z ∗ ), Y | X , z ∗ )p( Y | X ) . (7)Thus, the conditional posterior Gaussian distribution is deﬁned by the mean and the variance µ ( f GP ( z ∗ ) | z ∗ , D ) = m ( z ∗ ) + k ( z ∗ , X ) > ( K + σ n I n D ) − (cid:16) Y − [ m ( X :,1 ), . . . , m ( X :, n D )] > (cid:17) var( f GP ( z ∗ ) | z ∗ , D ) = k ( z ∗ , z ∗ ) − k ( z ∗ , X ) > ( K + σ n I n D ) − k ( z ∗ , X ). (8)A detailed derivation of the posterior mean and variance based on the joint distribution (4)can be found in appendix A. Analyzing (8) we can make the following observations: i) The mean prediction can be written as µ ( f GP ( z ∗ ) | z ∗ , D ) = m ( z ∗ ) + n D X j =1 α j k ( z ∗ , X :, j ) (9)with α = ( K + σ n I n D ) − (cid:16) Y − [ m ( X :,1 ), . . . , m ( X :, n D )] > (cid:17) ∈ R n D . That formulation highlightsthe data-driven characteristic of the GPR as the posterior mean is a sum of kernel functionsand its number grows with the number n D of training data. ii) The variance does not depend on the observed data, but only on the inputs, which is aproperty of the Gaussian distribution. The variance is the diﬀerence between two terms: Theﬁrst term k ( z ∗ , z ∗ ) is simply the prior covariance from which a (positive) term is subtracted,representing the information the observations contain about the function. The uncertaintyof the prediction, expressed in the variance, holds only for f GP ( z ∗ ) and does not consider thenoise in the training data. For this purpose, an additional noise term σ n I n D can be added tothe variance in (8). Finally, (8) clearly shows the strong dependence of the posterior meanand variance on the kernel k that we will discuss in depth in Section 3. Example 2.

We assume a GP with zero mean and a kernel function given by k ( z , z ) = 0.3679 exp − ( z − z ) · ! as prior distribution. The training data set D is assumed to be X = h i , Y = h − − i > ,where the output is corrupted by Gaussian noise with σ n = 0.0498 standard devia-tion and the test point is assumed to be z ∗ = 5. According to (5) to (8) the Grammatrix K ( X , X ) is calculated as K ( X , X ) =   and the kernel vector k ( z ∗ , X ) and k ( z ∗ , z ∗ ) are obtained to be k ( z ∗ , X ) = h i k ( z ∗ , z ∗ ) = 0.1378.Finally, with (8), we compute the predicted mean and variance for f GP ( z ∗ ) µ ( f GP ( z ∗ ) | z ∗ , D ) = 0.0278, var( f GP ( z ∗ ) | z ∗ , D ) = 0.0015,which is equivalent to a 2 σ -standard deviation of 0.0775. Figure 1 shows the priordistribution (left), the posterior distribution with two training points (black crosses) inthe middle, and the posterior distribution given the full training set D (right). Thesolid red line is the mean function and the gray shaded area indicates the 2 σ -standarddeviation. Five realizations (dashed lines) visualize the character of the distributionover functions.0 5 10 − O u t pu t s p a ce Input space z ∗ = 5 10Figure 1: The prior distribution of a GP is updated with data that leads to the posteriordistribution. So far, the GP regression allows functions with scalar outputs as in (8). For the extensionto vector-valued outputs, multiple approaches exist: i) Extending the kernel to multivariateoutputs [ÁRL12], ii) adding the output dimension as training data [Ber+17] or iii) usingseparated GPR for each output [Ras06]. While the ﬁrst two approaches set a prior on thecorrelation between the output dimensions, the latter disregards a correlation without lossof generality. Following the approach in iii), the previous deﬁnition of the training set D isextended to a vector-valued output with X = [ x { } dat , x { } dat , . . . , x { n D } dat ] ∈ Z × n D , Y = [˜ y { } dat , ˜ y { } dat , . . . , ˜ y { n D } dat ] > ∈ R n D × n y dat , (10)where n y dat ∈ N is the dimension of the output and the vector-valued GP is deﬁned by f GP ( z ) ∼  GP (cid:16) m ( z ), k ( z , z ) (cid:17) ... ... GP (cid:16) m n y dat ( z ), k n y dat ( z , z ) (cid:17) (11) m ( z ) := h m ( z ), . . . , m n y dat ( z ) i > (12)Following (4) to (8), we obtain for the predicted mean and variance µ ( f GP, i ( z ∗ ) | z ∗ , D ) = m i ( z ∗ ) + k i ( z ∗ , X ) > ( K i + σ n , i I n D ) − (cid:16) Y :, i − [ m i ( X :,1 ), . . . , m i ( X :, n D )] > (cid:17) var( f GP, i ( z ∗ ) | z ∗ , D ) = k i ( z ∗ , z ∗ ) − k i ( z ∗ , X ) > ( K i + σ n , i I n D ) − k i ( z ∗ , X ) (13)for each output dimension i ∈ { . . . , n y dat } with respect to the kernels k , . . . , k n y dat . Thevariable σ n , i denotes the standard deviation of the Gaussian noise that corrupts the i -thdimension of the output measurements. The n y dat components of f GP | z ∗ , D are combinedinto a multi-variable Gaussian distribution with µ ( f GP | z ∗ , D ) = [ µ ( f GP,1 | z ∗ , D ), . . . , µ ( f GP, n y dat | z ∗ , D )] > Σ( f GP | z ∗ , D ) = diag (cid:16) var( f GP,1 | z ∗ , D ), . . . , var( f GP, n y dat | z ∗ , D ) (cid:17) , (14)where Σ( f GP | z ∗ , D ) denotes the posterior variance matrix. This formulation allows to use aGP prior on vector-valued functions to perform predictions for test points z ∗ . This approachtreats each output dimension separately which is mostly suﬃcient and easy-to-handle. Analternative approach is to include the dimension as additional input, e.g., as in [Ber+17],with the beneﬁt of a single GP at the price of loss of interpretability. For highly correlatedoutput data, a multi-output kernel might be beneﬁcial, see [ÁRL12]. Remark 2.

Without speciﬁc knowledge about a trend in the data, the prior mean func-tions m , . . . , m n y dat are often set to zero, see [Ras06]. Therefore, we set the mean functionsto zero for the remainder of the report if not stated otherwise. In Section 2.1, we target the GPR from a Bayesian perspective. However, for some applica-tions of GPR a diﬀerent point of view is beneﬁcial; namely from the kernel perspective. Inthe following, we derive GPR from linear regression that is extended with a kernel transfor-mation. In general, the prediction of parametric models is based on a parameter vector w which is typically learned using a set of training data points. In contrast, non-parametricmodels typically maintain at least a subset of the training data points in memory in order tomake predictions for new data points. Many linear models can be transformed into a dualrepresentation where the prediction is based on a linear combination of kernel functions. Theidea is to transform the data points of a model to an often high-dimensional feature spacewhere a linear regression can be applied to predict the model output, as depicted in Fig. 2.For a nonlinear feature map φ : Z → F , where F is a n φ ∈ N ∪ {∞} dimensional Hilbertspace, the kernel function is given by the inner product k ( z , z ) = h φ ( z ), φ ( z ) i , ∀ z , z ∈ Z .Thus, the kernel implicitly encodes the way the data points are transformed into a higherdimensional space. The formulation as inner product in a feature space allows to extendmany standard regression methods. Also the GPR can be derived using a standard linearregression model f lin ( z ) = z > w , ˜ y { i } dat = f GP ( x { i } dat ) + ν (15)where z ∈ Z is the input vector, w ∈ R n z the vector of weights with n z = dim( Z )and f lin : Z → R the unknown function. The observed value ˜ y { i } dat ∈ R for the input x { i } dat ∈ Z φ Figure 2: The mapping φ transforms the data points into a feature space where linearregressors can be applied to predict the output.is corrupted by Gaussian noise ν ∼ N (0, σ n ) for all i = 1, . . . , n D . The analysis of thismodel is analogous to the standard linear regression, i.e., we put a prior on the weights suchthat w ∼ N ( , Σ p ) with Σ p ∈ R n z × n z . Based on n D collected training data points as deﬁnedin Section 2.1, that leads to the well known linear Bayesian regressionp( f lin ( z ∗ ) | z ∗ , D ) = N (cid:16) σ n z ∗> A − XY | {z } µ ( f lin ( z ∗ ) | z ∗ , D ) , z ∗> A − z ∗ | {z } var( f lin ( z ∗ ) | z ∗ , D ) (cid:17) (16)where A lin = σ − n XX > + Σ − p . Now, using the feature map φ ( z ) instead of z directly,leads to f GP ( z ) = φ ( z ) > ˇ w with ˇ w ∼ N ( , ˇΣ p ), ˇΣ p ∈ R n φ × n φ . As long as the projectionsare ﬁxed functions, i.e., independent of the parameters w , the model is still linear in theparameters and, thus, analytically tractable. In particular, the Bayesian regression (16)with the mapping φ ( z ) can be written as( f GP ( z ∗ ) | z ∗ , D ) ∼ N σ n φ ( z ∗ ) > A − [ φ ( X :,1 ); . . . ; φ ( X :, n D )] Y , φ ( z ∗ ) > A − φ ( z ∗ ) ! , (17)with the matrix A GP ∈ R n φ × n φ given by A GP = σ − n [ φ ( X :,1 ); . . . ; φ ( X :, n D )] [ φ ( X :,1 ); . . . ; φ ( X :, n D )] > + ˇΣ − p . (18)This equation can be simpliﬁed and rewritten to( f GP ( z ∗ ) | z ∗ , D ) ∼ N (cid:16) k ( z ∗ , X ) > K − Y | {z } µ ( f GP ( z ∗ ) | z ∗ , D ) , k ( z ∗ , z ∗ ) − k ( z ∗ , X ) > K − k ( z ∗ , X ) | {z } var( f GP ( z ∗ ) | z ∗ , D ) (cid:17) , (19)with k ( z , z ) = φ ( z ) > ˇΣ p φ ( z ) that equals (8). The fact that in (19) the feature map φ ( z ) isnot needed is known as the kernel trick . This trick is also used in other kernel-based models,e.g., support vector machines (SVM), see [SC08] for more details. Even though a kernel neither uniquely deﬁnes the feature map nor the feature space, onecan always construct a canonical feature space, namely the reproducing kernel Hilbert space (RKHS) given a certain kernel. After the introduction of the theory, illustrative examplesfor an intuitive understanding are presented. We will now formally present this constructionprocedure, starting with the concept of Hilbert spaces, following [BLG16]: A Hilbert space F represents all possible realizations of some class of functions, for example all functions ofcontinuity degree i , denoted by C i . Moreover, a Hilbert space is a vector space such that anyfunction f F ∈ F must have a non-negative norm, k f F k F > f F = 0. All functions f F must additionally be equipped with an inner-product in F . Simply speaking, a Hilbertspace is an inﬁnite dimensional vector space, where many operations behave like in the ﬁnitecase. The properties of Hilbert spaces have been explored in great detail in literature, e.g.,in [DM+05]. An extremely useful property of Hilbert spaces is that they are equivalent toan associated kernel function [Aro50]. This equivalence allows to simply deﬁne a kernel,instead of fully deﬁning the associated vector space. Formally speaking, if a Hilbert space H is a RKHS, it will have a unique positive deﬁnite kernel k : Z × Z → R , which spans thespace H . Theorem 1 (Moore-Aronszajn [Aro50]) . Every positive deﬁnite kernel k is associatedwith a unique RKHS H . Theorem 2 ([Aro50]) . Let F be a Hilbert space, Z a non-empty set and φ : Z → F .Then, the inner product h φ ( z ), φ ( z ) i F := k ( z , z ) is positive deﬁnite. Importantly, any function f H in H can be represented as a weighted linear sum of this kernelevaluated over the space H , as f H ( · ) = h f H ( · ), k ( x , · ) i H = n φ X i =1 α i k (cid:16) x { i } dat , · (cid:17) , (20)with α i ∈ R for all i = { . . . , n φ } , where n φ ∈ N ∪ {∞} is the dimension of the featurespace F . Thus, the RKHS is equipped with the inner-product h f H , f i H = n φ X i =1 n φ X j =1 α i α j k ( x { i } dat , x j } dat ), (21)with f ( · ) = P n φ j =1 α j k (cid:16) x j } dat , · (cid:17) ∈ H , α j ∈ R . Now, the reproducing character manifests as ∀ z ∈ Z , ∀ f H ∈ H , h f H , k ( x , · ) i H = f H ( z ), in particular k ( z , z ) = h k ( · , z ), k ( · , z ) i H . (22)According to [SHS06], the RKHS is then deﬁned as H = { f H : Z → R |∃ c ∈ F , f H ( z ) = h c , φ ( z ) i F , ∀ z ∈ Z} , (23)where φ ( z ) is the feature map constructing the kernel through k ( z , z ) = h φ ( z ), φ ( z ) i F . Example 3.

We want to ﬁnd the RKHS for the polynomial kernel with degree 2 thatis given by k ( z , z ) = ( z > z ) = ( z z ) + 2( z z z z ) + ( z z ) .for any z , z ∈ R . First, we have to ﬁnd a feature map φ such that the kernel corre-sponds to the inner product k ( z , y ) = h φ ( z ), φ ( y ) i . A possible candidate for the feature0map is φ ( z ) = h z , √ z z , z i > , because h φ ( z ), φ ( z ) i R = φ ( z ) > φ ( y ) = ( z z ) + 2( z z z z ) + ( z z ) = k ( z , z ).We know that the RKHS contains all linear combinations of the form f H ( z ) = X i =1 α i k (cid:16) x { i } dat , z (cid:17) = X i =1 α i h φ ( z ), φ ( z ) i R = X i =1 h c , φ ( z ) i R = c z + c √ z z + c z ,with α , c , x { i } dat ∈ R . Therefore, a possible candidate for the RKHS H is given by H = n f H : R → R | f H ( z ) = c z + c √ z z + c z , c ∈ R o (24)Next, it must be checked if the proposed Hilbert space is the related RKHS to thepolynomial kernel with degree 2. This is achieved in two steps: i) Checking if the spaceis a Hilbert space and ii) conﬁrming the reproducing property. First, we can easily proofthat this is a Hilbert space rewriting f H ( z ) = z > S z with symmetric matrix S ∈ R × and using the fact that H is euclidean and isomorphic to S . Second, the condition for anRKHS must be fulﬁlled, i.e., the reproducing property f H ( z ) = h f H ( · ), k ( · , z ) i H . Sincewe can write h f H ( · ), k ( · , z ) i H = h c > φ ( · ), k ( · , z ) i H = X i =1 c i k ( · , z ) = c > φ ( z ) = f H ( z ),property (22) is fulﬁlled and, thus, H is the RKHS for the polynomial kernel withdegree 2. Note that, even though the mapping φ is not unique for the kernel k , therelation of k and the RKHS H is unique.Given a function f H ∈ H deﬁned by n D observations, its RKHS norm is deﬁned as k f H k H = h f H , f H i H = n D X i =1 n D X j =1 α i α j k ( x { i } dat , x j } dat ) = α > K ( X , X ) α , (25)with α ∈ R n D and K ( X , X ) given by (5). We can also use the feature map such that k f H k H = inf {k c k F : c ∈ F , f H ( z ) = h c , φ ( z ) i F , ∀ z ∈ Z} . (26)As there is a unique relation between the RKHS H and the kernel k , the norm k f H k H canequivalently be written as k f H k k . The norm of a function in the RKHS indicates how fastthe function varies over Z with respect to the geometry deﬁned by the kernel. Formally, itcan be written as | f H ( z ) − f H ( z ) | d ( z , z ) ≤ k f H k H , (27)with the distance d ( z , z ) = k ( z , z ) − k ( z , z ) + k ( z , z ). A function with ﬁnite RKHSnorm is also element of the RKHS. A more detailed discussion about RKHS and norms isgiven in [Wah90].1 Example 4.

We want to ﬁnd the RKHS norm of a function f H that is an element ofthe RKHS of the polynomial kernel with degree 2 that is given by k ( z , z ) = ( z > z ) = ( z z ) + 2( z z z z ) + ( z z ) .Let the function be f H ( z ) = X i =1 α i k (cid:16) x { i } dat , z (cid:17) , with (28) α = 1, α = − α = 3 (29) x { } dat = [1, 1] > , x { } dat = [1, 2] > , x { } dat = [2, 1] > . (30)Hence, function (28) with (29) and (30) corresponds to f H ( z ) = 11 z + 6 z z − z .Now, we have two possibilities how to calculate the RKHS norm. First, the RKHS-normof f H is calculated using (25) by k f H k H = α > K ( X , X ) α = h − i    −  = 155with X = [ x { } dat , x { } dat , x { } dat ]. Alternatively, we can use (26) that results in k f H k H = k c k ,where c is deﬁned by (24). Thus, the norm is computed as f H ( z ) = 11 z + 6 z z − z ⇒ c = 11, c = 6 √ c = − ⇒ k f H k H = 155.2 Example 5.

In this example, we visualize the meaning of the RKHS norm. Figure 3shows diﬀerent quadratic functions with the same RKHS norm (top left and top right), asmaller RKHS norm (bottom left) and a larger RKHS norm (bottom right). An identicalnorm indicates a similar variation of the functions, whereas a higher norm leads to amore varying function. − − − z z f ( z ) − − − z z f ( z ) − − − z z f z ) − − − z z f ( z ) Figure 3: Functions with diﬀerent RKHS-norms: k f k H = k f k H = 4 k f k H = k f k H .In summary, we investigate the unique relation between the kernel and its RKHS. Thereproducing property allows us to write the inner-product as a tractable function whichimplicitly deﬁnes a higher (or even inﬁnite) feature dimensional space. The RKHS-normof a function is a Lipschitz-like indicator based on the metric deﬁned by the kernel. Thisview of the RKHS is related to the kernel trick in machine learning. In the next section,the RKHS-norm is exploited to determine the error between the prediction of GPR and theactual data-generating function. One of the most interesting properties of GPR is the uncertainty description encoded in thepredicted variance. This uncertainty is beneﬁcial to quantify the error between the actualunderlying data generating process and the GPR. In this section, we assume that there isan unknown function f uk : R n z → R that generates the training data. In detail, the data3set D = { X , Y } consists of X = [ x { } dat , x { } dat , . . . , x { n D } dat ] ∈ R n z × n D Y = [˜ y { } dat , ˜ y { } dat , . . . , ˜ y { n D } dat ] > ∈ R n D , (31)where the data is generated by˜ y { i } dat = f uk ( x { i } dat ) + ν , ν ∼ N (0, σ n ) (32)for all i = { . . . , n D } . Without any assumptions on f uk it is obviously not possible toquantify the model error. Loosely speaking, the prior distribution of the GPR with kernel k must be suitable to learn the unknown function. More technically, f uk must be an element ofthe RKHS spanned by the kernel as described in (23). This leads to the following assumption. Assumption 1.

The function f uk has a ﬁnite RKHS norm with respect to the kernel k ,i.e., k f uk k H < ∞ , where H is the RKHS spanned by k . This sounds paradoxical as f uk is assumed to be unknown. However, there exist kernelsthat can approximate any continuous function arbitrarily exact. Thus, for any continuousfunction, an arbitrarily close function is element of the RKHS of an universal kernel. Formore details, we refer to Section 2.4. More infomation about the model error of misspeciﬁedGaussian Process model can be found in [BUH18]We classify the error quantiﬁcation in three diﬀerent approaches: i) the robust approach,ii) the scenario approach, and iii) the information-theoretical approach. The diﬀerent tech-niques are presented in the following and visualized in Fig. 4. For the remainder of thissection, we assume that a GPR is trained with the data set (31) and Assumption 1 holds. The robust approach exploits the fact that the prediction of the GPR is Gaussian distributed.Thus, for any z ∗ ∈ R n z , the model error is bounded by | f uk ( z ∗ ) − µ ( f GP | z ∗ , D ) | ≤ c var( f GP | z ∗ , D ) (33)with high probability where c ∈ R > adjusts the probability. However, for multiple testpoints z ∗ , z ∗ , . . . ∈ R n z , this approach neglects any correlation between f GP ( z ∗ ), f GP ( z ∗ ), . . . .Figure 4 shows how for a given z ∗ and z ∗ , the variance is exploited as upper bound. Thus, anyprediction is handled independently, which leads to a very conservative bound, see [UBH18]. Instead of using the mean and the variance as in the robust approach, the scenario approachdeals with the samples of the GPR directly. In contrast to the other methods, there is nodirect model error quantiﬁcation but rather a sample based quantiﬁcation. The idea is todraw a large number n scen ∈ N of sample functions f , f , . . . , f n scen GP over n s ∈ N samplingpoints. The sampling is performed by drawing multiple instances from f GP given by the4multivariate Gaussian distribution  Yf GP ( z ∗ )... f GP ( z ∗ n s )  ∼ N  m ( x { } dat )... m ( x { n D } dat ) m ( z ∗ )... m ( z ∗ n s )  ,  K ( X , X ) + σ n I n D K ( X ∗ , X ) K ( X ∗ , X ) > K ( X ∗ , X ∗ )  , (34)where X ∗ = [ z ∗ , · · · , z ∗ n s ] contains the sampling points. Each sample can then be usedin the application instead of the unknown function. For a large number of samples it isassumed that the unknown function is close to one of these samples. However, the cruxof this approach is to determine, for a given model error c ∈ R > , the required number ofsamples n scen and probability δ scen > (cid:16) | f uk ( z ∗ ) − f i GP ( z ∗ ) | ≤ c , i ∈ { . . . , n scen } (cid:17) ≥ δ scen (35)for all z ∗ ∈ Z . In Fig. 4, ﬁve diﬀerent samples of a GP model are drawn as example. Alternatively, the work in [Sri+12] derives an upper bound for samples of the GPR on acompact set with a speciﬁc probability. In contrast to the robust approach, the correlationbetween the function values are considered. We restate here the theorem of [Sri+12].

Theorem 3 ([Sri+12]) . Given Assumption 1, the model error ∆ ∈ R ∆ = | µ ( f GP | z , D ) − f uk ( z ) | (36) is bounded for all z on a compact set Ω ⊂ R n z with a probability of at least δ ∈ (0, 1) byP n ∀ z ∈ Ω, ∆ ≤ | β Σ ( f GP | z , D ) | o ≥ δ , (37) where β ∈ R is deﬁned as β = s k f uk k k + 300 γ max ln (cid:18) n D + 11 − δ (cid:19) . (38) The variable γ max ∈ R is the maximum of the information gain γ max = max x { } dat , ... , x { n D +1 } dat ∈ Ω

12 log | I n D +1 + σ − n K ( z , z ) | (39) with Gram matrix K ( z , z ) and the input elements z , z ∈ { x { } dat , . . . , x { n D +1 } dat } . To compute this bound, the RKHS norm of f uk must be known. That is in applicationusually not the case. However, often the norm can be upper bounded and thus, the bound50 x ∗ x ∗ − O u t pu t s p a ce Mean

Input spaceTraining points

Figure 4: Diﬀerent approaches to quantify the model error: Robust approach (left), scenarioapproach (middle), information-theoretical approach (right).in Theorem 3 can be upper bounded. For this purpose, the relation of the RKHS norm tothe Lipschitz constant given by (27) is beneﬁcial as the Lipschitz constant is more likely tobe known. In general, the computation of the information gain is a non-convex optimiza-tion problem. However, the information capacity γ max has a sub-linear dependency on thenumber of training points for many commonly used kernel functions [Sri+12]. Therefore,even though β is increasing with the number of training data, it is possible to learn the truefunction f uk arbitrarily exactly [Ber+16]. In contrast to the other approaches, Theorem 3allows to bound the error for any test point in a compact set. In [BKH19], we exploit thisapproach in GP model based control tasks. The right illustration of Fig. 4 visualizes theinformation-theoretical bound. Equation (8) clearly shows the immense impact of the kernel on the posterior mean andvariance. However, this is not surprising as the kernel is an essential part of the prior model.For practical applications that leads to the question how to choose the kernel. Additionally,most kernels depend on a set of hyperparameters that must be deﬁned. Thus in order toturn GPR into a powerful practical tool it is essential to develop methods that address themodel selection problem. We see the model selection as the determination of the kernel andits hyperparameters. We only focus on kernels that are deﬁned on

Z ⊆ R n z . In the next twosubsections, we present diﬀerent kernels and explain the role of the hyperparameters andtheir selection, mainly based on [Ras06]. Remark 3.

The selection of the kernel functions seems to be similar to the model selectionfor parametric models. However, there are two major diﬀerences: i) the selection is fullycovered by the Bayesian methodology and ii) many kernels allow to model a wide range ofdiﬀerent functions whereas parametric models a typically limited to very speciﬁc types offunctions.

The value of the kernel function k ( z , z ) is an indicator of the interaction of two states ( z , z ).Thus, an essential part of GPR is the selection of the kernel function and the estimation of its6free parameters ϕ , ϕ , . . . , ϕ n ϕ , called hyperparameters. The number n ϕ of hyperparametersdepends on the kernel function. The choice of the kernel function and the determination ofthe corresponding hyperparameters can be seen as degrees of freedom of the regression. Firstof all, we start with the general properties of a function to be qualiﬁed as a kernel for GPR.A necessary and suﬃcient condition for the function k : Z × Z → R to be a valid kernel isthat the Gram matrix, see (5), is positive semideﬁnite for all possible input values [SC04]. Remark 4.

As shown in Section 2.4, the kernel function must be positive deﬁnite to spanan unique RKHS. That seems to be contradictory to the required positive semi-deﬁniteness ofthe Gram matrix. The solution is the deﬁnition of positive deﬁnite kernels as it is equivalentto a positive semi-deﬁnite Gram matrix. In detail, a symmetric function k : Z × Z → R isa positive deﬁnite kernel on Z if n D X j =1 n D X i =1 k ( x { i } dat , x { j } dat ) c i c j ≥ holds for any n D ∈ N , x { } dat , . . . , x { n D } dat ∈ Z and c , . . . , c n ∈ R . Thus, there exists a positivesemi-deﬁnite matrix A G ∈ R n D × n D such that x > dat A G x dat = n D X j =1 n D X i =1 k ( x { i } dat , x { j } dat ) c i c j (41) holds for any n D ∈ N and z ∈ Z . The set of functions k which fulﬁll this condition is denoted with K . Kernel functionscan be separated into two classes, the stationary and non-stationary kernels. A stationarykernel is a function of the distance z − z . Thus, it is invariant to translations in the inputspace. In contrast, non-stationary kernels depend directly on z , z and are often functionsof a dot product z > z . In the following, we list some common kernel functions with theirbasic properties. Even though the number of presented kernels is limited, new kernels canbe constructed easily as K is closed under speciﬁc operations such as addition and scalarmultiplication. At the end, we summarize the equation of each kernel in Table 1 and providea comparative example. The equation for the constant kernel is given by k ( z , z ) = ϕ . (42)This kernel is mostly used in addition to other kernel functions. It depends one a singlehyperparameter ϕ ∈ R ≥ . The equation for the linear kernel is given by k ( z , z ) = z > z . (43)The linear kernel is a dot-product kernel and thus, non-stationary. The kernel can be ob-tained from Bayesian linear regression as shown in Section 2.3. The linear kernel is oftenused in combination with the constant kernel to include a bias.7 The equation for the polynomial kernel is given by k ( z , z ) = (cid:16) z > z + ϕ (cid:17) p , p ∈ N . (44)The polynomial kernel has an additional parameter p ∈ N , that determines the degree ofthe polynomial. Since a dot product is contained, the kernel is also non-stationary. Theprior variance grows rapidly for k z k > ϕ ∈ R ≥ . The equation for the Matérn kernel is given by k ( z , z ) = ϕ exp − √ p k z − z k ϕ ! p !(2 p )! p X i =0 ( p + i )! i !( p − i )! √ p k z − z k ϕ ! p − i (45)with ˇ p = p + , p ∈ N . The Matérn kernel is a very powerful kernel and presented here withthe most common parameterization for ˇ p . Functions drawn from a GP model with Matérnkernel are p -times diﬀerentiable. The more general equation of this stationary kernel can befound in [Bis06]. This kernel is an universal kernel which is explained in the following. Lemma 1 ([SC08, Lemma 4.55]) . Consider the RKHS H ( Z c ) of an universal kernelon any prescribed compact subset Z c ∈ Z . Given any positive number ε and any func-tion f C ∈ C ( Z C ) , there is a function f H ∈ H ( Z c ) such that k f C − f H k Z c ≤ ε . Intuitively speaking, a GPR with an universal kernel can approximate any continuous func-tion arbitrarily exact on a compact set. For p → ∞ , it results in the squared exponentialkernel. The two hyperparameters are ϕ ∈ R ≥ and ϕ ∈ R > . The equation for the squared exponential kernel is given by k ( z , z ) = ϕ exp − k z − z k ϕ ! . (46)Probably the most widely used kernel function for GPR is the squared exponential kernel,see [Ras06]. The hyperparameter ϕ describes the signal variance which determines the av-erage distance of the data-generating function from its mean. The lengthscale ϕ deﬁneshow far it is needed to move along a particular axis in input space for the function values tobecome uncorrelated. Formally, the lengthscale determines the number of expected upcross-ings of the level zero in a unit interval by a zero-mean GP. The squared exponential kernelis inﬁnitely diﬀerentiable, which means that the GPR exhibits a smooth behavior. As limitof the Matérn kernel, it is also an universal kernel, see [MXZ06].8 Example 6.

Figure 5 shows the power for regression of universal kernel functions. Inthis example, a GPR with squared exponential kernel is used for diﬀerent training datasets. The hyperparameter are optimized individually for each training data set by meansof the likelihood, see Section 3.2. Note that all presented regressions are based on the same

GP model, i.e. the same kernel function, but with diﬀerent data sets. Thathighlights again the superior ﬂexibility of GPR.Figure 5: Examples for the ﬂexibility of the regression that all are based on the sameGP model.

The equation for the rational quadratic kernel is given by k ( z , z ) = ϕ k z − z k pϕ ! − p , p ∈ N . (47)This kernel is equivalent to summing over inﬁnitely many squared exponential kernels withdiﬀerent lengthscales. Hence, GP priors with this kernel are expected to see functions whichvary smoothly across many lengthscales. The parameter p determines the relative weightingof large-scale and small-scale variations. For p → ∞ , the rational quadratic kernel is identicalto the squared exponential kernel. The equation for the squared exponential ARD kernel is given by k ( z , z ) = ϕ exp (cid:16) − ( z − z ) > P − ( z − z ) (cid:17) , P = diag( ϕ , . . . , ϕ n z ). (48)The automatic relevance determination (ARD) extension to the squared exponential kernelallows for independent lenghtscales ϕ , . . . , ϕ n z ∈ R > for each dimension of z , z ∈ R n z .The individual lenghtscales are typically larger for dimensions which are irrelevant as thecovariance will become almost independent of that input. A more detailed discussion aboutthe advantages of diﬀerent kernels can be found, for instance, in [Mac97] and [Bis06].9 Example 7.

In this example, we use three GPRs with the same set of training data X = [1, 3, 5, 7, 9], Y = [0, 1, 2, 3, 6] (49)but with diﬀerent kernels, namely the squared exponential (46), the linear (43), andthe polynomial (44) kernel. Figure 6 shows the diﬀerent shapes of the regressions withthe posterior mean (red), the posterior variance (gray shaded) and the training points(black). Even for this simple data set, the ﬂexibility of the squared exponential kernelis already visible.0 5 10 − O u t pu t s p a ce Input space k ( z , z ) =Constant ϕ Linear z > z + ϕ Polynomial p ∈ N (cid:16) z > z + ϕ (cid:17) p Matérn ˇ p = p + , p ∈ N ϕ exp (cid:16) − √ p k z − z k ϕ (cid:17) p !(2 p )! P pi =0 ( p + i )! i !( p − i )! (cid:16) √ p k z − z k ϕ (cid:17) p − i Squared exponential ϕ exp (cid:16) − k z − z k ϕ (cid:17) Rational quadratic ϕ (cid:18) k z − z k pϕ (cid:19) − p Squared exponential ARD ϕ exp (cid:16) − ( z − z ) > P − ( z − z ) (cid:17) , P = diag( ϕ , . . . , ϕ n z )Table 1: Overview of some commonly used kernel functions.0 In addition to the selection of a kernel function, values for any hyperparameter must bedetermined to perform the regression. The number of hyperparameters depends on thekernel function used. We concatenate all hyperparameters in a vector ϕ with size n ϕ ∈ N ,where ϕ ∈ Φ ⊆ R n ϕ . The hyperparameter set Φ is introduced to cover the diﬀerent spacesof the individual hyperparameters as deﬁned in the following. Deﬁnition 1.

The set Φ is called a hyperparameter set for a kernel function k if and onlyif the set Φ is a domain for the hyperparameters ϕ of k .Often, the signal noise σ n , see (4), is also treated as hyperparameter. For a betterunderstanding, we keep the signal noise separated from the hyperparameters. There existseveral techniques that allow computing the hyperparameters and the signal noise withrespect to one optimality criterion. From a Bayesian perspective, we want to ﬁnd the vectorof hyperparameters ϕ which are most likely for the output data Y given the inputs X and aGP model. For this purpose, one approach is to optimize the log marginal likelihood function of the GP. Another idea is to split the training set into two disjoint sets, one which is actuallyused for training, and the other, the validation set, which is used to monitor performance.This approach is known as cross-validation . In the following, these two techniques for theselection of hyperparameters are presented. A very common method for the optimization of the hyperparameters is by means of the negative log marginal likelihood function , often simply named as (neg. log) likelihood func-tion. It is marginal since it is obtained through marginalization over the function f GP . Themarginal likelihood is the likelihood that the output data Y ∈ R n D ﬁts to the input data X with the hyperparameters ϕ . It is given bylog p ( Y | X , ϕ ) = − Y > ( K + σ n I n D ) − Y −

12 log | K + σ n I n D | − n D π . (50)A detailed derivation can be found in [Ras06]. The three terms of the marginal likelihoodin (50) have the following roles:• Y > ( K + σ n I n D ) − Y is the only term that depends on the output data Y and representsthe data-ﬁt.• log | K + σ n I n D | penalizes the complexity depending on the kernel function and theinput data X .• n D log 2 π is a normalization constant. Remark 5.

For the sake of notational simplicity, we suppress the dependency on the hyper-parameters of the kernel function k whenever possible. The optimal hyperparameters ϕ ∗ ∈ Φ and signal noise σ ∗ n in the sense of the likelihood areobtained as the minimum of the negative log marginal likelihood function " ϕ ∗ σ ∗ n = arg min ϕ ∈ Φ, σ n ∈ R ≥ log p ( Y | X , ϕ ). (51)1Since an analytic solution of the derivation of (50) is impossible, a gradient based opti-mization algorithm is typically used to minimize the function. However, the negative loglikelihood is non-convex in general such that there is no guarantee to ﬁnd the optimum ϕ ∗ , σ ∗ n .In fact, every local minimum corresponds to a particular interpretation of the data. In thefollowing example, we visualize how the hyperparameters aﬀect the regression. Example 8.

A GPR with the squared exponential kernel is trained on eight data points.The signal variance is ﬁxed to ϕ = 2.13. First, we visualize the inﬂuence of thelengthscale. For this purpose, the signal noise is ﬁxed to σ n = 0.21. Figure 7 showsthe posterior mean of the regression and the neg. log likelihood function. On the leftside are three posterior means for diﬀerent lengthscales. A short lengthscale results inoverﬁtting whereas a large lengthscale smooths out the training data (black crosses).The dotted red function represents the mean with optimized lengthscale by a descentgradient algorithm with respect to (51). The right plot shows the neg. log likelihood overthe signal variance ϕ and lengthscale ϕ . The minimum is here at ϕ ∗ = [2.13, 1.58] > .0 2 4 6 8 100123 Input space O u t pu t s p a ce ϕ ϕ ϕ = 0.67 (cyan, solid), ϕ = 7.39(brown, dashed), and ϕ = 1.58 (red, dotted). Right: Neg. log likelihood function oversignal variance ϕ and lengthscale ϕ .Next, the meaning of diﬀerent interpretations of the data is visualized by varying thesignal noise σ n and the lengthscale ϕ . The right plot of Fig. 8 shows two minimaof the negative log likelihood function. The lower left minimum at log( σ n ) = 0.73and log( ϕ ) = − σ n ) = 5and log( ϕ ) = − Input space O u t pu t s p a ce − −

10 log( σ n ) l og ( ϕ ) ≥ σ n and lengthscale ϕ . This approach works with a separation of the data set D in two classes: one for trainingand one for validation. Cross-validation is almost always used in the k cv -fold cross-validationsetting: the k cv -fold cross-validation data is split into k cv disjoint, equally sized subsets; vali-dation is done on a single subset and training is done using the union of the remaining k cv − k cv times, each time with a diﬀerent subset for vali-dation. Here, without loss of generality, we present the leave-one-out cross-validation, whichmeans k cv = n D . The predictive log probability when leaving out a training point { x { i } dat , ˜ y { i } dat } is given by log p ( y { i } dat | X , Y − i , ϕ ) = −

12 log (var − i ) − (cid:16) ˜ y { i } dat − µ − i (cid:17) − i − n D π , (52)where µ − i = µ ( f GP ( x { i } dat ) | x { i } dat , X :, − i , Y − i ) and var − i = var( f GP ( x { i } dat ) | x { i } dat , X :, − i , Y − i ). The − i index indicates X and Y without the element x { i } dat and ˜ y { i } dat , respectively. Thus, (52) is theprobability for the output y { i } dat at x { i } dat but without the training point { x { i } dat , ˜ y { i } } . Accord-ingly, the leave-one-out log predictive probability L LOO ∈ R is L LOO = n D X i =1 log p ( y { i } dat | X , Y − i , ϕ ). (53)In comparison to the log likelihood approach (51), the cross-validation is in general morecomputationally expensive but might ﬁnd a better representation of the data set, see [GE79]for discussion and related approaches. So far, we consider GPR in non-dynamical settings where only an input-to-output mapping isconsidered. However, Gaussian process dynamical models (GPDMs) have recently becomea versatile tool in system identiﬁcation because of their beneﬁcial properties such as thebias variance trade-oﬀ and the strong connection to Bayesian mathematics, see [FCR14].3In many works, where GPs are applied to dynamical model, only the mean function of theprocess is employed, e.g., in [WHB05] and [Cho+13]. This is mainly because GP models areoften used to replace deterministic parametric models. However, GPDMs contain a muchricher description of the underlying dynamics, but also the uncertainty about the modelitself when the full probabilistic representation is considered. Therefore, one main aspectof GPDMs is to distinguish between recurrent structures and non-recurrent structures. Amodel is called recurrent if parts of the regression vector depend on the outputs of the model.Even though recurrent models become more complex in terms of their behavior, they allowto model sequences of data, see [Sjö+95]. If all states are fed back from the model itself, weget a simulation model, which is a special case of the recurrent structure. The advantageof such a model is its property to be independent from the real system. Thus, it is suitablefor simulations, as it allows multi-step ahead predictions. In this report, we focus on twooften-used recurrent structures: the Gaussian process state space model (GP-SSM) and theGaussian process nonlinear error output (GP-NOE) model.

Gaussian process state space models are structured as a discrete-time system. In this case,the states are the regressors, which is visualized in Fig. 9. This approach allows to bemore eﬃcient, since the regressors are less restricted in their internal structure as in input-output models. Thus, a very eﬃcient model in terms of number of regressors might bepossible. The mapping from the states to the output is often be assumed to be known. Thesituation, where the output mapping describes a known sensor model, is such an example.It is mentioned in [Fri+13] that using too ﬂexible models for both, the state mapping f andthe output mapping, can result in problems of non-identiﬁability. Therefore, we focus on aknown output mapping. The mathematical model of the GP-SSM is thus given by x t +1 = f ( ξ t ) =  f ( ξ t ) ∼ GP ( m ( ξ t ), k ( ξ t , ξ t ))... ... ... f n x ( ξ t ) ∼ GP ( m n x ( ξ t ), k n x ( ξ t , ξ t )) . y t ∼ p ( y t | x t , γ y ), (54)where ξ t ∈ R n ξ , n ξ = n x + n u is the concatenation of the state vector x t ∈ X ⊆ R n x and the input u t ∈ U ⊆ R n u such that ξ t = [ x t ; u t ]. The mean function is given bycontinuous functions m , . . . , m n x : R n ξ → R . The output mapping is parametrized by aknown vector γ y ∈ R n γ with n γ ∈ N . The system identiﬁcation task for the GP-SSM mainlyfocuses on f in particular. It can be described as ﬁnding the state-transition probabilityconditioned on the observed training data. Remark 6.

The potentially unknown number of regressors can be determined using estab-lished nonlinear identiﬁcation techniques as presented in [KL99], or exploiting embeddedtechniques such as automatic relevance determination [Koc16]. A mismatch leads to similarissues as in parametric system identiﬁcation. P − GPGP ... GP p ( y t +1 | x t +1 , γ y ) x t u t x t +1 y t +1 Figure 9: Structure of a GP-SSM with P as backshift operator, such that P − x t +1 = x t The GP-NOE model uses the past n in ∈ N > input values u t ∈ U and the past n out ∈ N > output values y t ∈ R n y of the model as the regressors. Figure 10 shows the structure of GP-NOE, where the outputs are feedbacked. Analogously to the GP-SSM, the mathematicalmodel of the GP-NOE is given by y t +1 = h ( ζ t ) =  h ( ζ t ) ∼ GP ( m ( ζ t ), k ( ζ t , ζ t ))... ... ... h n y ( ζ t ) ∼ GP ( m n y ( ζ t ), k n y ( ζ t , ζ t )) , (55)where ζ t ∈ R n ζ , n ζ = n out n y + n in n u is the concatenation of the past outputs y t and inputs u t such that ζ t = [ y t − n out +1 ; . . . ; y t ; u t − n in +1 ; . . . ; u t ]. The mean function is given by continuousfunctions m , . . . , m n y : R n ζ → R . In contrast to nonlinear autoregressive exogenous models,that focus on one-step ahead prediction, a NOE model is more suitable for simulations as itconsiders the multi-step ahead prediction [Nel13]. However, the drawback is a more complextraining procedure that requires a nonlinear optimization scheme due to their recurrentstructure [Koc16]. ... P − ... P − n out GPGP ... GP y t y t − n out +1 u t u t − n in +1 y t +1 Figure 10: Structure of a GP-NOE model with P as backshift operator ( P − y t +1 = y t ) Remark 7.

It is always possible to convert an identiﬁed input-output model into a state-space model, see[PL70]. However, focusing on state-space models only would preclude thedevelopment of a large number of useful identiﬁcation results.

Remark 8.

Control relevant properties of GP-SSMs and GO-NOE models are discussedin [BH16b; BH16a; BH20]. In this article, we introduce the GP and its usage in GPR. Based on the property, that everyﬁnite subset of a GP follows a multi-variate Gaussian distribution, a closed-form equation canbe derived to predict the mean and variance for a new test point. The GPR can intrinsicallyhandle noisy output data if it is Gaussian distributed. As GPR is a data-driven method,only little prior knowledge is necessary for the regression. Further, the complexity of GPmodels scales with the number of training points. A degree of freedom in the modeling is theselection of the kernel function and its hyperparameters. We present an overview of commonkernels and the necessary properties to be a valid kernel function. For the hyperparameterdetermination, two approaches based on numerical optimization are shown. The kernel ofthe GP is uniquely related to a RKHS, which determines the shape of the samples of theGP. Based on this, we compare diﬀerent approaches for the quantiﬁcation of the model errorthat quantiﬁes the error between the GPR and the actual data-generating function. Finally,we introduce how GP models can be used as dynamical systems in GP-SSMs and GP-NOEmodels.6

A Conditional Distribution

Let ν ∈ R n ν , ν ∈ R n ν with n ν , n ν ∈ N be probability variables, which are multivariateGaussian distribution " ν ν ∼ N " µ µ , " Σ Σ > Σ Σ (56)with mean µ ∈ R n ν , µ ∈ R n ν and variance Σ ∈ R n ν × n ν , Σ ∈ R n ν × n ν , Σ ∈ R n ν × n ν .The task is to determine the conditional probabilityp( ν | ν ) = p( ν , ν )p( ν ) . (57)The joined probability p( ν , ν ) is a multivariate Gaussian distribution withp( ν , ν ) = 1(2 π ) ( n ν + n ν ) / det(Σ) exp (cid:18) −

12 ( x − µ ) > Σ − ( x − µ ) (cid:19) (58) µ := " µ µ , Σ := " Σ Σ > Σ Σ , (59)where x = [ x ; x ], x ∈ R n ν , x ∈ R n ν . The marginal distribution of ν is deﬁned by themean µ and the variance Σ such thatp( ν ) = 1(2 π ) nν det(Σ ) exp (cid:18) −

12 ( x − µ ) > Σ − ( x − µ ) (cid:19) . (60)The division of the joint distribution by the marginal distribution results again in a Gaussiandistribution withp( ν | ν ) = det(Σ ) (2 π ) nν det(Σ) | {z } ∗ exp (cid:18) − h ( x − µ ) > Σ − ( x − µ ) − ( x − µ ) > Σ − ( x − µ ) i| {z } ∗∗ (cid:19) ,(61)where the ﬁrst part ∗ can be rewritten as ∗ = 1(2 π ) nν det(Σ )det(Σ ) det(Σ − Σ Σ − Σ > ) ! = 1(2 π ) nν det(Σ − Σ Σ − Σ > ) . (62)Thus, the covariance matrix Σ | of the conditional distribution p( ν | ν ) is given byΣ | = Σ − Σ Σ − Σ > . (63)For the simpliﬁcation of the second part ∗∗ of (61), we exploit the special block structure ofΣ, such that its inverse is given byΣ = " Σ Σ Σ Σ , Σ − = " Σ Σ Σ Σ (64)Σ = Σ − + Σ − Σ N Σ Σ − Σ = − Σ − Σ N Σ = − N Σ Σ − Σ = N (65)7with N = (Σ − Σ Σ − Σ ) − . Thus, we compute ∗∗ as ∗∗ = " x − µ x − µ > " Σ Σ > Σ Σ − " x − µ x − µ − ( x − µ ) > Σ − ( x − µ ) (66)= ( x − µ ) > Σ − | ( x − µ ) + 2( x − µ ) > (cid:16) − Σ − Σ > Σ − | (cid:17) ( x − µ )+ ( x − µ ) > (cid:16) Σ − + Σ − Σ > Σ − | Σ Σ − (cid:17) ( x − µ ) − ( x − µ ) > Σ − ( x − µ ) (67)= (cid:16) x − µ +Σ Σ − ( x − µ ) | {z } µ | (cid:17) > Σ − | (cid:16) x − µ +Σ Σ − ( x − µ ) | {z } µ | (cid:17) (68)Finally, the conditional probability is given with the conditional mean µ | ∈ R n ν and theconditional covariance matrix Σ | ∈ R n ν × n ν by p ( ν | ν ) = 1(2 π ) nν det(Σ | ) exp (cid:18) −

12 ( x − µ | ) > Σ − | ( x − µ | ) (cid:19) (69) µ | = µ +Σ Σ − ( x − µ )Σ | = Σ − Σ Σ − Σ > . (70)8 References [ÁRL12] Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence. “Kernels for Vector-Valued Functions: A Review”. In:

Foundations and Trends in Machine Learning doi : .[Aro50] Nachman Aronszajn. “Theory of reproducing kernels”. In: Transactions of theAmerican mathematical society doi : .[Ber+16] Felix Berkenkamp, Riccardo Moriconi, Angela P. Schoellig, and Andreas Krause.“Safe Learning of Regions of Attraction for Uncertain, Nonlinear Systems withGaussian Processes”. In: . 2016, pp. 4661–4666.[Ber+17] Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause.“Safe Model-based Reinforcement Learning with Stability Guarantees”. In: Ad-vances in Neural Information Processing Systems . 2017, pp. 908–918.[BH16a] Thomas Beckers and Sandra Hirche. “Equilibrium distributions and stabilityanalysis of Gaussian Process State Space Models”. In: . 2016, pp. 6355–6361. doi : .[BH16b] Thomas Beckers and Sandra Hirche. “Stability of Gaussian Process State SpaceModels”. In: . 2016, pp. 2275–2281. doi : .[BH20] Thomas Beckers and Sandra Hirche. “Prediction with Gaussian Process Dynam-ical Models”. In: Transaction on Automatic Control (2020). Submitted.[Bis06] Christopher Bishop.

Pattern recognition and machine learning . Springer-VerlagNew York, 2006. isbn : 9780387310732.[BKH19] Thomas Beckers, Dana Kulić, and Sandra Hirche. “Stable Gaussian Processbased Tracking Control of Euler-Lagrange Systems”. In:

Automatica

103 (2019),pp. 390–397. doi : .[BLG16] Yusuf Bhujwalla, Vincent Laurain, and Marion Gilson. “The impact of smooth-ness on model class selection in nonlinear system identiﬁcation: An application ofderivatives in the RKHS”. In: . 2016,pp. 1808–1813. doi : .[BUH18] Thomas Beckers, Jonas Umlauft, and Sandra Hirche. “Mean Square PredictionError of Misspeciﬁed Gaussian Process Models”. In: . 2018, pp. 1162–1167. doi : .[Cho+13] Girish Chowdhary, Hassan Kingravi, Jonathan How, and Patricio A. Vela. “Bayes-ian nonparametric adaptive control of time-varying systems using Gaussian pro-cesses”. In: . 2013, pp. 2655–2661. doi : .[DM+05] Lokenath Debnath, Piotr Mikusinski, et al. Introduction to Hilbert spaces withapplications . Academic press, 2005.9[FCR14] Roger Frigola, Yutian Chen, and Carl E. Rasmussen.

Variational Gaussian Pro-cess State-Space Models . 2014. arXiv: .[Fri+13] Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl E. Rasmussen.“Bayesian inference and learning in Gaussian process state-space models withparticle MCMC”. In:

Advances in Neural Information Processing Systems . 2013,pp. 3156–3164.[GE79] Seymour Geisser and William F. Eddy. “A Predictive Approach to Model Selec-tion”. In:

Journal of the American Statistical Association doi : .[KL99] Robert Keviczky and Haber Laszlo. Nonlinear system identiﬁcation: input-outputmodeling approach . Springer Netherlands, 1999. isbn : 9780792358589.[Koc16] Juš Kocijan.

Modelling and Control of Dynamic Systems Using Gaussian ProcessModels . Springer International Publishing, 2016. doi :

10 . 1007 / 978 - 3 - 319 -21021-6 .[Mac97] David J. MacKay.

Gaussian Processes - A Replacement for Supervised NeuralNetworks?

Journal of Machine Learning Research

Nonlinear system identiﬁcation: from classical approaches to neuralnetworks and fuzzy models . Springer-Verlag Berlin Heidelberg, 2013. doi :

10 .1007/978-3-662-04323-3 .[PL70] M. Q. Phan and R. W. Longman. “Relationship between state-space and input-output models via observer Markov parameters”. In:

WIT Transactions on TheBuilt Environment

22 (1970). doi : .[Rad96] Neal M. Radford. Bayesian learning for neural networks . Springer-Verlag NewYork, 1996. doi : .[Ras06] Carl E. Rasmussen. Gaussian processes for machine learning . The MIT Press,2006. isbn : 9780262182539.[SC04] John Shawe-Taylor and Nello Cristianini.

Kernel methods for pattern analysis .Cambridge university press, 2004. doi : .[SC08] Ingo Steinwart and Andreas Christmann. Support vector machines . Springer Sci-ence & Business Media, 2008. doi : .[SHS06] Ingo Steinwart, Don Hush, and Clint Scovel. “An explicit description of the repro-ducing kernel Hilbert spaces of Gaussian RBF kernels”. In: IEEE Transactionson Information Theory doi : .[Sjö+95] Jonas Sjöberg, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard De-lyon, Pierre-Yves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. “Non-linear black-box modeling in system identiﬁcation: a uniﬁed overview”. In: Au-tomatica doi : .0[Sri+12] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger.“Information-theoretic regret bounds for Gaussian process optimization in thebandit setting”. In: IEEE Transactions on Information Theory doi : .[UBH18] Jonas Umlauft, Thomas Beckers, and Sandra Hirche. “A Scenario-based OptimalControl Approach for Gaussian Process State Space Models”. In: . 2018, pp. 1386–1392. doi :

10 . 23919 / ECC . 2018 .8550458 .[Wah90] Grace Wahba.