[PDF] Multifidelity Data Fusion via Gradient-Enhanced Gaussian Process Regression

Abstract

We propose a data fusion method based on multi-fidelity Gaussian process regression (GPR) framework. This method combines available data of the quantity of interest (QoI) and its gradients with different fidelity levels, namely, it is a Gradient-enhanced Cokriging method (GE-Cokriging). It provides the approximations of both the QoI and its gradients simultaneously with uncertainty estimates. We compare this method with the conventional multi-fidelity Cokriging method that does not use gradients information, and the result suggests that GE-Cokriging has a better performance in predicting both QoI and its gradients. Moreover, GE-Cokriging even shows better generalization result in some cases where Cokriging performs poorly due to the singularity of the covariance matrix. We demonstrate the application of GE-Cokriging in several practical cases including reconstructing the trajectories and velocity of an underdamped oscillator with respect to time simultaneously, and investigating the sensitivity of power factor of a load bus with respect to varying power inputs of a generator bus in a large scale power system. We also show that though GE-Cokriging method requires a little bit higher computational cost than Cokriging method, the result of accuracy comparison shows that this cost is usually worth it.

Full PDF

aa r X i v : . [ c s . C E ] A ug Multiﬁdelity Data Fusion via Gradient-Enhanced Gaussian ProcessRegression

Yixiang Deng , Guang Lin , and Xiu Yang ∗ School of Engineering, Brown University, USA Division of Applied Mathematics, Brown University, USA Department of Mathematics, Purdue University, USA School of Engineering, Purdue University, USA Department of Industrial and Systems Engineering, Lehigh University, USAAugust 4, 2020

Abstract

We propose a data fusion method based on multi-ﬁdelity Gaussian process regression (GPR) frame-work. This method combines available data of the quantity of interest (QoI) and its gradients withdiﬀerent ﬁdelity levels, namely, it is a Gradient-enhanced Cokriging method (GE-Cokriging). It providesthe approximations of both the QoI and its gradients simultaneously with uncertainty estimates. Wecompare this method with the conventional multi-ﬁdelity Cokriging method that does not use gradientsinformation, and the result suggests that GE-Cokriging has a better performance in predicting both QoIand its gradients. Moreover, GE-Cokriging even shows better generalization result in some cases whereCokriging performs poorly due to the singularity of the covariance matrix. We demonstrate the appli-cation of GE-Cokriging in several practical cases including reconstructing the trajectories and velocityof an underdamped oscillator with respect to time simultaneously, and investigating the sensitivity ofpower factor of a load bus with respect to varying power inputs of a generator bus in a large scale powersystem. We also show that though GE-Cokriging method requires a little bit higher computational costthan Cokriging method, the result of accuracy comparison shows that this cost is usually worth it.

Gaussian process (GP) is one of the most well studied stochastic processes in probability and statistics.Given the ﬂexible form of data representation, GP is a powerful tool for classiﬁcation and regression, andit is widely used in probabilistic scientiﬁc computing, engineering design, geostatistics, data assimilation,machine learning, etc. In particular, given a data set comprising input/output pairs of locations and quantityof interest (QoI),

GP regression (GPR, also known as

Kriging ), can provide a prediction along with a meansquared error (MSE) estimate of the QoI at any location. Alternatively, from the Bayesian perspective,GPR identiﬁes a Gaussian random variable at any location with a posterior mean (corresponding to theprediction) and variance (corresponding to the MSE). Generally speaking, the larger the given data set sizeis, the closer the GPR’s posterior mean is to the ground truth and the smaller the posterior variance is.In many practical problems, obtaining a large amount of data can be diﬃcult because of the limitationof resources. There are several approaches to augment the data set in diﬀerent manners. For example,the original Cokriging method exploits the correlation between multiple QoIs in the geostatistical study, ∗ [email protected] gradient-enhanced Kriging (GE-Kriging) method, also referred to as Gradient-based Krigingin some literature, has been widely investigated in areas such as computational ﬂuid dynamics, especiallyin aerodynamics optimization problems [4, 32, 15, 3]. Incorporating gradient information in diﬀerent ways,this method consists of direct and indirect approaches. The former uses the gradient information through anaugmented covariance matrix [12], while the latter approximates the gradient via ﬁnite-diﬀerence method [3,37]. The gradient-enhanced Cokriging (GE-Cokriging) method in [16] refers to a GE-Kriging method thatuses a diﬀerent covariance function between the QoI and its gradients other than that in conventional GE-Kriging. The GE-Cokriging method in [30] combines multi-ﬁdelity information of the QoI and its gradientsto predict the QoI only.Most of the aforementioned works focus on enhancing the accuracy of predicting the QoI. Hence, whenthe gradient information is used, the method is a “gradient-enhanced” approach. However, in many ap-plications, both the QoI and its gradient are important. For example, when studying the phase diagramof a dynamical system, one needs an accurate prediction of both location and velocity. Another exampleis the sensitivity analysis of a system, where the gradient information is critical. Therefore, in this work,we propose a comprehensive multiﬁdelity gradient-enhanced Cokriging method to predict both QoI and itsgradients simultaneously based on GE-Cokriging [30]. This method exploits the QoI and its gradient frommodels of diﬀerent ﬁdelities based on the combination of the GE-Kriging and the Cokriging to improve theprediction accuracy. In terms of predicting the QoI, this method can be considered as “gradient-enhanced”,while from the perspective of estimating gradients, this method can be considered as “integral-enhanced”.In this work, GE-Cokriging refers to our proposed multi-ﬁdelity method, instead of the GE-Cokriging in [16].In this paper, we ﬁrstly review GPR (Kriging) and its extension for a multi-ﬁdelity study (Cokriging).Then, we describe the gradient-enhanced Kriging/Cokriging as well as the GE-Kriging/Cokriging method.Finally, we use four examples to demonstrate the eﬃcacy of our approach. We present a brief review of the GPR method adopted from [1, 6, 34]. We denote the observation locationsas X = { x ( i ) } Ni =1 ( x ( i ) ∈ D, D ⊆ R d ) and the observed values of the QoI at these locations as y =( y (1) , y (2) , . . . , y ( N ) ) ⊤ ( y ( i ) ∈ R ). For simplicity, we assume that y ( i ) are scalars. The GPR method aimsto identify a GP Y ( x , ω ) : D × Ω → R based on the input/output data set { ( x ( i ) , y ( i ) ) } Ni =1 , where Ω isthe sample space of a probability triple. Here, x can be considered as parameters for this GP, such that Y ( x , · ) : Ω → R is a Gaussian random variable for any x in the set D . A GP Y ( x , ω ) is usually denoted as Y ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) , (2.1)2here ω is not explicitly listed for brevity, µ ( · ) : D → R and k ( · , · ) : D × D → R are the mean and covariancefunctions (also called kernel function), respectively: µ ( x ) = E { Y ( x ) } , (2.2) k ( x , x ′ ) = Cov { Y ( x ) , Y ( x ′ ) } = E { ( Y ( x ) − µ ( x ))( Y ( x ′ ) − µ ( x ′ )) } . (2.3)The variance of Y ( x ) is k ( x , x ), and its standard deviation is σ ( x ) = p k ( x , x ). The covariance matrix,denoted as C , is deﬁned as C ij = k ( x ( i ) , x ( j ) ). Functions µ ( x ) and k ( x , x ′ ) are obtained by identifying theirhyperparameters via maximizing the log marginal likelihood [31]:ln L = −

12 ( y − µ ) ⊤ C − ( y − µ ) −

12 ln | C | − N π, (2.4)where µ = ( µ ( x (1) ) , . . . , µ ( x ( N ) )) ⊤ and | C | is the determinant of matrix C . For any x ∗ ∈ D , the GPRposterior mean and variance are ˆ y ( x ∗ ) = µ ( x ∗ ) + c ( x ∗ ) ⊤ C − ( y − µ ) , (2.5)ˆ s ( x ∗ ) = σ ( x ∗ ) − c ( x ∗ ) ⊤ C − c ( x ∗ ) , (2.6)where c ( x ∗ ) is a vector of covariance: ( c ( x ∗ )) i = k ( x ( i ) , x ∗ ). In practice, it is common to use ˆ y ( x ∗ ) asthe prediction, and ˆ s ( x ∗ ) is also called the mean squared error (MSE) of the prediction because ˆ s ( x ∗ ) =E (cid:8) (ˆ y ( x ∗ ) − Y ( x ∗ )) (cid:9) [6]. Consequently, ˆ s ( x ∗ ), the posterior standard deviation, is called the root meansquared error (RMSE). Moreover, to account for the observation noise, one can assume that the noise isindependent and identically distributed (i.i.d.) Gaussian random variables with zero mean and variance δ ,and replace C with C + δ I . In this study, we assume that observations y are noiseless. If C is not invertibleor its condition number is very large, one can add a small regularization term α I ( α is a small positive realnumber) to C , which is equivalent to assuming there is an observation noise. In addition, ˆ s can be used inglobal optimization, or in the greedy algorithm to identify locations of additional observations. In the widely used ordinary Kriging method, a stationary GP is assumed [14]. Speciﬁcally, µ is set as a con-stant µ ( x ) ≡ µ , and k ( x , x ′ ) = k ( τ ), where τ = x − x ′ . Consequently, σ ( x ) = k ( x , x ) = k ( ) = σ is a con-stant. The most widely used kernels in scientiﬁc computing is the Mat´ern functions, especially its two specialcases, i.e., exponential and squared-exponential (Gaussian) kernels. For example, the Gaussian kernel can bewritten as k ( τ ) = σ exp (cid:0) − k x − x ′ k w (cid:1) , where the weighted norm is deﬁned as k x − x ′ k w = d X i =1 (cid:18) x i − x ′ i l i (cid:19) .Here, l i ( i = 1 , . . . , d ), the correlation lengths in the i direction, are constants. Given a stationary covariancefunction, the covariance matrix C can be written as C = σ Ψ , where Ψ ij = exp( − k x ( i ) − x ( j ) k w ). Theestimators of µ and σ , denoted as ˆ µ and ˆ σ , areˆ µ = ⊤ Ψ − y ⊤ Ψ − , ˆ σ = ( y − ˆ µ ) ⊤ Ψ − ( y − ˆ µ ) N , (2.7)where is a constant vector consisting of 1s [6]. It is also common to set µ = 0 [31]. The hyperparameters σ and l i are identiﬁed by maximizing the log marginal likelihood in Eq. (2.4). The terms ˆ y ( x ∗ ) and ˆ s ( x ∗ )in Eq. (2.5) take the following form: ˆ y ( x ∗ ) = ˆ µ + ψ ⊤ Ψ − ( y − ˆ µ ) , (2.8)ˆ s ( x ∗ ) = ˆ σ (cid:0) − ψ ⊤ Ψ − ψ (cid:1) , (2.9)where ψ = ψ ( x ∗ ) is a (column) vector consisting of correlations between the observed data and the predic-tion, i.e., ψ i = σ k ( x ( i ) , x ∗ ). 3ext, we brieﬂy review the formulation of the multiﬁdelity Cokriging, and we use the two-ﬁdelity modelfor demonstration. Suppose that we have high-ﬁdelity data (e.g., accurate measurements of the QoI) y H = ( y (1) H , . . . , y ( N H ) H ) ⊤ at locations X H = { x ( i ) H } N H i =1 , and low-ﬁdelity data (e.g., measurements with loweraccuracy or numerical approximations of the QoI) y L = ( y (1) L , . . . , y ( N L ) L ) ⊤ at locations X L = { x ( i ) L } N L i =1 ,where y ( i ) H , y ( i ) L ∈ R and x ( i ) H , x ( i ) L ∈ D ⊆ R d . We denote X = { X L , X H } and e y = ( y ⊤ L , y ⊤ H ) ⊤ . Kennedyand O’Hagan [13] proposed a multiﬁdelity formulation based on the auto-regressive model for GP Y H ( · )( ∼ GP ( µ H ( · ) , k H ( · , · ))): Y H ( x ) = ρY L ( x ) + Y d ( x ) , (2.10)where Y L ( · ) ( ∼ GP ( µ L ( · ) , k L ( · , · ))) regresses the low-ﬁdelity data, ρ ∈ R is a regression parameter and Y d ( · )( ∼ GP ( µ d ( · ) , k d ( · , · ))) models the discrepancy between Y H and ρY L . This model assumes thatCov { Y H ( x ) , Y L ( x ′ ) | Y L ( x ) } = 0 , for all x ′ = x , x , x ′ ∈ D. (2.11)The covariance of observations, e C , is then given by e C = (cid:18) C L ( X L , X L ) ρ C L ( X L , X H ) ρ C L ( X H , X L ) ρ C L ( X H , X H ) + C d ( X H , X H ) (cid:19) , (2.12)where C L and C d are the covariance matrices computed from k L ( · , · ) and k d ( · , · ), respectively, i.e.,[ C L ( X L , X L )] ij = k L ( X ( i ) L , X ( j ) L ) , [ C L ( X L , X H )] ij = k L ( X ( i ) L , X ( j ) H ) , [ C L ( X H , X L )] ij = k L ( X ( i ) H , X ( j ) L ) , [ C L ( X H , X H )] ij = k L ( X ( i ) H , X ( j ) H ) , [ C d ( X H , X H )] ij = k d ( X ( i ) H , X ( j ) H ) . (2.13)One can assume parameterized forms for these kernels (e.g., Gaussian kernel) and employ the followingtwo-step approach [7, 6] to identify hyperparameters:1. Use Kriging to construct Y L based on { X L , y L } .2. Denote y d = y H − ρ y L ( X H ), where y L ( X H ) are the values of y L at locations common to those of X H , then construct Y d using { X H , y d } via Kriging.The posterior mean and variance of Y H at x ∗ ∈ D are given byˆ y ( x ∗ ) = µ H ( x ∗ ) + e c ( x ∗ ) ⊤ e C − ( e y − e µ ) , (2.14)ˆ s ( x ∗ ) = ρ σ L ( x ∗ ) + σ d ( x ∗ ) − e c ( x ∗ ) ⊤ e C − e c ( x ∗ ) , (2.15)where µ H ( x ∗ ) = ρµ L ( x ∗ ) + µ d ( x ∗ ), σ L ( x ∗ ) = k L ( x ∗ , x ∗ ), σ d ( x ∗ ) = k d ( x ∗ , x ∗ ), and e µ = (cid:18) µ L µ H (cid:19) =  (cid:16) µ L ( x (1) L ) , . . . , µ L ( x ( N L ) L ) (cid:17) ⊤ (cid:16) µ H ( x (1) H ) , . . . , µ H ( x ( N H ) H ) (cid:17) ⊤  , (2.16) e c ( x ∗ ) = (cid:18) ρ c L ( x ∗ ) c H ( x ∗ ) (cid:19) = (cid:16) ρk L ( x (1) L , x ∗ ) , . . . , ρk L ( x ( N L ) L , x ∗ ) (cid:17) ⊤ (cid:16) k H ( x (1) H , x ∗ ) , . . . , k H ( x ( N H ) H , x ∗ ) (cid:17) ⊤  , (2.17)where k H ( x , x ′ ) = ρ k L ( x , x ′ ) + k d ( x , x ′ ). Alternatively, one can simultaneously identify hyperparametersin k L ( · , · ) and k d ( · , · ) along with ρ by maximizing the following log marginal likelihood:ln e L = −

12 ( e y − e µ ) ⊤ e C − ( e y − e µ ) −

12 ln (cid:12)(cid:12) e C (cid:12)(cid:12) − N H + N L π. (2.18)4 .3 GE-Kriging/Cokriging GE-Kriging uses the fact that under some condition, the derivative in physical space and the integral in theprobability space are interchangable: ∂∂x i µ ( x ) = ∂∂x i E { Y ( x ) } = E (cid:26) ∂∂x i Y ( x ) (cid:27) ,∂∂x i k ( x , x ′ ) = ∂∂x i Cov { Y ( x ) , Y ( x ′ ) } = Cov (cid:26) ∂∂x i Y ( x ) , Y ( x ′ ) (cid:27) ,∂ ∂x i ∂x ′ j k ( x , x ′ ) = ∂ ∂x i ∂x ′ j Cov { Y ( x ) , Y ( x ′ ) } = Cov ( ∂∂x i Y ( x ) , ∂∂x ′ j Y ( x ′ ) ) . (2.19)These formulas specify the covariance between the QoI and its gradient as well as the covariance betweendiﬀerent components of the gradient. To simplify the notations, we use ∂ i and ∂ i ′ to denote ∂∂x i and ∂∂x ′ i ,respectively, and ∇ = ( ∂ , ∂ , . . . , ∂ d ) ⊤ , ∇ ′ = ( ∂ ′ , ∂ ′ , . . . , ∂ d ′ ). Of note, for a scalar function z , ∇ z is acolumn vector and ∇ ′ z is a row vector. Since we use a stationary kernel in this work, i.e., k ( x , x ′ ) = k ( x − x ′ ),we have ∂ i k ( x , x ′ ) = − ∂ i ′ k ( x , x ′ ) . (2.20)The analytical form of ∂ i k ( x , x ′ ) and ∂ i ∂ j ′ k ( x , x ′ ) can be found in the appendix of [30] for widely usedkernel functions k ( x , x ′ ), e.g., Mat´ern kernels with several speciﬁc selections of ν . Subsequently, GE-Krigingfollows almost the same procedures as those in Kriging with the following modiﬁcations [16]:1. The observation vector is augmented to include gradient data, i.e., y = ( y (1) , y (2) , . . . , y ( N ) , ( ∇ y (1) ) ⊤ , ( ∇ y (2) ) ⊤ , . . . , ( ∇ y ( N ) ) ⊤ ) ⊤ .

2. Given a constant posterior mean of the QoI, the posterior mean of the gradient is zero, hence, =(1 , , . . . , | {z } N , , , . . . , | {z } N × d ) ⊤ .3. Covariance matrix C = σ Ψ , more speciﬁcally, the correlation matrix Ψ is expanded to include cor-relations between QoI and its gradient as well as correlations between components of the gradient,i.e., Ψ = (cid:20) Ψ Ψ Ψ Ψ (cid:21) , (2.21)where Ψ = 1 σ  k ( x (1) , x (1) ) · · · k ( x (1) , x ( N ) )... . . . ... k ( x ( N ) , x (1) ) · · · k ( x ( N ) , x ( N ) )  , Ψ = ∇ Ψ = 1 σ  ∂ k ( x (1) , x (1) ) · · · ∂ k ( x (1) , x ( N ) )... . . . ... ∂ d k ( x (1) , x (1) ) · · · ∂ d k ( x (1) , x ( N ) )... . . . ... ∂ k ( x ( N ) , x (1) ) · · · ∂ k ( x ( N ) , x ( N ) )... . . . ... ∂ d k ( x ( N ) , x (1) ) · · · ∂ d k ( x ( N ) , x ( N ) )  , Ψ = Ψ ⊤ , = ∇ ′ ∇ Ψ =  ψ · · · ψ N ... . . . ... ψ N · · · ψ NN  , ψ lm = 1 σ  ∂ ∂ ′ k ( x ( l ) , x ( m ) ) · · · ∂ ∂ d ′ k ( x ( l ) , x ( m ) )... . . . ... ∂ d ∂ ′ k ( x ( l ) , x ( m ) ) · · · ∂ d ∂ d ′ k ( x ( l ) , x ( m ) )  . The posterior mean and variance of the QoI at a new location x ∗ , denoted by ˆ y ( x ∗ ) and ˆ s ( x ∗ ), hasthe same form as in Kriging, i.e., Eqs. (2.8) and (2.9), except that ψ = (cid:18) ψ ( x ∗ ) ∇ ψ ( x ∗ ) (cid:19) , where ∇ ψ ( x ∗ ) =1 σ  ∇ k ( x (1) , x ∗ )... ∇ k ( x ( N ) , x ∗ )  . Furthermore, the posterior mean and variance of the QoI’s gradient at x ∗ are computedas c ∂ i y ( x ∗ ) = ( ∂ i ′ ψ ) ⊤ Ψ − ( y − ˆ µ ) , (2.22) b s i ( x ∗ ) = ˆ σ (cid:2) − ( ∂ i ′ ψ ) ⊤ Ψ − ∂ i ′ ψ (cid:3) , (2.23)where ∂ i ′ ψ = (cid:18) ∂ i ′ ψ ( x ∗ ) ∂ i ′ ( ∇ ψ ( x ∗ )) (cid:19) and i = 1 , , . . . , d .Next, we introduce the details of GE-Cokriging method, which also shares a similar construction pro-cedure as Cokriging except for some modiﬁcations to incorporate gradient information. Such modiﬁcationsare as follows:1. The observation vector is augmented to e y = (cid:0) y ⊤ L , y ⊤ H , ( ∇ y L ) ⊤ , ( ∇ y H ) ⊤ (cid:1) ⊤ and is of length N L + N H +( N L + N H ) d .2. The covariance matrix of the observation data, e C in Eq. (2.12), is augmented to include gradientinformation as well, i.e., e C = e C e C e C e C ! (2.24)where e C takes the form of covariance matrix in Cokriging, see Eq. (2.12), and e C = (cid:20) ∇ C L ( X L , X L ) ρ ∇ C L ( X L , X H ) ∇ C L ( X H , X L ) ρ ∇ C L ( X H , X H ) + ∇ C d ( X H , X H ) (cid:21) , e C = e C ⊤ , e C = (cid:20) ∇ ′ ∇ C L ( X L , X L ) ρ ∇ ′ ∇ C L ( X L , X H ) ρ ∇ ′ ∇ C L ( X H , X L ) ρ ∇ ′ ∇ C L ( X H , X H ) + ∇ ′ ∇ C d ( X H , X H ) (cid:21) . Here ∇ C L ( X L , X L ) is a matrix constructed by replacing each element in C L ( X L , X L ), i.e., [ C L ( X L , X L )] ij ,with its gradient ( ∂ [ C L ( X L , X L )] ij , . . . , ∂ d [ C L ( X L , X L )] ij ) ⊤ . Similarly, ∇ C L ( X L , X H ), ∇ C L ( X H , X L ), ∇ C L ( X H , X H ) and ∇ C d ( X H , X H ) are constructed by replacing elements in corresponding matrices inEq. (2.13) with their gradients, respectively. The matrix ∇ ′ ∇ C L ( X L , X L ) is constructed by replacingeach element in C L ( X L , X L ), i.e., [ C L ( X L , X L )] ij , with the matrix  ∂ ∂ ′ [ C L ( X L , X L )] ij · · · ∂ ∂ d ′ [ C L ( X L , X L )] ij ... . . . ... ∂ d ∂ ′ [ C L ( X L , X L )] ij · · · ∂ d ∂ d ′ [ C L ( X L , X L )] ij  . Other submatrices in e C are constructed in the same manner.6. The posterior mean vector now becomes e µ =  µ L µ H L H  =  (cid:0) µ L ( x (1) L ) , . . . , µ L ( x ( N L ) L ) (cid:1) ⊤ (cid:0) µ H ( x (1) H ) , . . . , µ H ( x ( N H ) H ) (cid:1) ⊤ (0 , . . . , | {z } N L · d ) ⊤ (0 , . . . , | {z } N H · d ) ⊤  . (2.25)4. The covariance vector between the new observation location x ∗ and existing observation data [ X L , X H ],denoted by e c ( x ∗ ), is given by e c ( x ∗ ) =  ρ c L ( x ∗ ) c H ( x ∗ ) ρ ∇ c L ( x ∗ ) ∇ c H ( x ∗ )  , (2.26)where c L ( x ∗ ) = (cid:0) k L ( x (1) L , x ∗ ) , . . . , k L ( x ( N L ) L , x ∗ ) (cid:1) ⊤ and c H ( x ∗ ) = (cid:0) k H ( x (1) H , x ∗ ) , . . . , k H ( x ( N H ) H , x ∗ ) (cid:1) ⊤ .The estimators for the mean and standard deviation of QoI at the new observation location x ∗ in GE-Cokriging follow Eqs. (2.14) and (2.15) in Cokriging method with corresponding components updated asshown above.We provide the formulas for the posterior mean and variance of the QoI’s gradient at x ∗ as follows: c ∂ i y ( x ∗ ) = ( ∂ i ′ e c ( x ∗ )) ⊤ e C − ( e y − e µ ) , (2.27) b s i ( x ∗ ) = ρ ∂ i ∂ i ′ k L ( x ∗ , x ∗ ) + ∂ i ∂ i ′ k H ( x ∗ , x ∗ ) − [ ∂ i ′ e c ( x ∗ )] ⊤ e C − ∂ i ′ e c ( x ∗ ) , (2.28)where i = 1 , , . . . , d . The derivation of Eqs. (2.27) and (2.28) follow the same procedure as Eqs. (2.14)and (2.15) shown in [13, 6]. In other words, Eqs. (2.27) and (2.28) can be obtained by replacing Y ( x ) inEqs. (2.14) and (2.15) with ∂ i Y ( x ). More speciﬁcally, µ H ( x ∗ ) is replaced with the mean of ∂ i Y ( x ) (whichis zero), e c is replaced with ∂ i e c , and ρ σ L ( x ∗ ) + σ d ( x ∗ ) (i.e., ρ Var { Y L ( x ∗ ) } + Var { Y d ( x ∗ ) } ) is replaced with ρ Var { ∂ i Y L ( x ∗ ) } + Var { ∂ i Y d ( x ∗ ) } = ρ ∂ i ∂ i ′ k L ( x ∗ , x ∗ ) + ∂ i ∂ i ′ k H ( x ∗ , x ∗ ).We note that the GE-Cokriging exploits the relation between QoI and its gradients, and once the hyper-parameters in the model are identiﬁed, we can compute the posterior mean and variance of the QoI and itsgradients simultaneously . It has the potential to improve the accuracy of the prediction for both QoI and itsgradients compared with predicting them separately. Also, in some cases, this approach can reduce compu-tational cost compared to, for example, constructing Cokriging models for QoI and its gradients separately(see Section 3.5). In this section, we provide another perspective on using the QoI f and its gradients ∇ f in GPR simultane-ously. The aforementioned gradient-enhanced methods ﬁrstly assume a GP model Y ( x ) for f , and the GPmodel for ∇ f can be constructed accordingly by taking (partial) derivatives of Y ( x )’s mean and covariancefunction. Alternatively, one can also assume a GP model for ∇ f ﬁrst, e.g., ∂ i f is modeled by Y ( x ), then theQoI f can be modeled by R Y ( x )d x i , which is a GP because integral is a linear operator. Here we use theunivariate function to further illustrate the concept. We model f ′ with GP Y f ′ ( x ) ∼ GP ( µ f ′ ( x ) , k f ′ ( x , x ′ )),7hen similar to Eqs. (2.19), the integrals in the physical space and in the probability space are interchangeable: Z µ f ′ ( x )d x = Z E { Y f ′ ( x ) } d x = E (cid:26)Z Y f ′ ( x )d x (cid:27) , Z k f ′ ( x , x ′ )d x = Z Cov { Y f ′ ( x ) , Y f ′ ( x ′ ) } d x = Z E { ( Y f ′ ( x ) − µ f ′ ( x ))( Y f ′ ( x ′ ) − µ f ′ ( x ′ )) } d x = E (cid:26)(cid:20) Z ( Y f ′ ( x ) − µ f ′ ( x ))d x (cid:21) ( Y f ′ ( x ′ ) − µ f ′ ( x ′ )) (cid:27) = Cov (cid:26)Z Y f ′ ( x )d x , Y f ′ ( x ′ ) (cid:27) , Z Z k f ′ ( x , x ′ )d x d x ′ = Z Z

Cov { Y f ′ ( x ) , Y f ′ ( x ′ ) } d x d x ′ = Z Z E { ( Y f ′ ( x ) − µ f ′ ( x ))( Y f ′ ( x ′ ) − µ f ′ ( x ′ )) } d x d x ′ = E (cid:26)Z ( Y f ′ ( x ) − µ f ′ ( x ))d x Z ( Y f ′ ( x ′ ) − µ f ′ ( x ′ ))d x ′ (cid:27) = Cov (cid:26)Z Y f ′ ( x )d x , Z Y f ′ ( x ′ )d x ′ (cid:27) . (2.29)These formulas provide the mean and covariance of the GP Y f ( x ) = R Y f ′ ( x )d x as well as the covariancebetween Y f ( x ) and Y f ′ ( x ). Of note, we use indeﬁnite integral here and the constant associated with thisintegral needs identiﬁcation via maximizing the log marginal likelihood. But this constant will not aﬀectthe covariance function, because Cov (cid:8)R Y f ′ ( x )d x , R Y f ′ ( x ′ )d x ′ (cid:9) = Cov (cid:8)R Y f ′ ( x )d x + a R Y f ′ ( x ′ )d x ′ + b (cid:9) for any constants a and b .Then we can follow the same procedure in the gradient-enhanced Kriging in Section 2.3 to construct thecovariance matrix C and compute the posterior mean and variance of f and f ′ at any location x ∗ . Of note,this “integral-enhanced” GPR/Kriging is equivalent to the gradient-enhanced version. For example, if weset the mean of Y f ′ ( x ) to be zero, then the mean of Y f ( x ) is a constant µ , which needs identifying as in thegradient-enhanced version. Subsequently, the integral-enhanced Kriging is equivalent to the equivalence ofthe gradient-enhanced Kriging if the mean and covariance functions are selected appropriately. For example,if we assume zero mean and set k f ′ ( x , x ′ ) = ∂ ∂x i ∂x ′ j k f ( x , x ′ ) for Y f ′ ( x ), where k f ( x , x ′ ) is the Gaussiankernel function, this integral-enhanced Kriging model is the same as the gradient-enhanced Kriging modelthat uses Gaussian kernel function and constant mean for Y f ( x ). In most cases, it is easier to computethe (partial) derivatives than to compute the integral. Therefore, it is more convenient to use the gradient-enhanced setting. The similar argument holds for Cokriging. In this work, we only show the results ofgradient-enhanced Kriging/Cokriging. We present four numerical examples to demonstrate the performance of GE-Cokriging. The ﬁrst two proto-type examples show the capability GE-Cokriging’s capability of approximating the QoI and its gradients oftwo 1D functions and a 2D function. The other two examples illustrate the high precision of GE-Cokrigingin constructing the phase diagram of an underdamped oscillator and analyzing the sensitivity of power factorunder varying power inputs in a large-scale power grid system. In all these examples, we assume that boththe QoI and its gradients are collected at every observation locations. The hyperparameters in GP modelsare identiﬁed by maximizing associated log marginal likelihood function using genetic algorithm as in [6].8astly, we compare the prediction accuracy using Cokriging, GE-Kriging and GE-Cokriging in each casequantitatively. We also compare the computational cost of these methods in each case.

In this part, we compare the results of Cokriging and GE-Cokriging in approximating a 1D function. In thiscase, the target function to approximate is, f H ( x ) = (6 x − sin(12 x − , (3.1)from which high-ﬁdelity data are sampled. The low-ﬁdelity data are sampled from the following function f L ( x ) = Af H ( x ) + B ( x − .

5) + C. (3.2)The observation locations of f H are X H = { , . , . , . } , and those for f L are X L = { , . , . , . , . , . } .Here, the observation locations of data are chosen so that X H ⊂ X L . We ﬁrst show a well-studied case where parameters of low-ﬁdelity function is given by A = 0 . , B = 10 , C = − f L ( x ) = 0 . f H ( x ) + 10( x − . − . (3.3)Of note, we use fewer observation points in X L than in [6]. (a) (b) Figure 1: Prediction of the QoI for the 1D problem case 1. Prediction of posterior mean (black solid line)and standard deviation (grey shaded area) of QoI f H by (a) Cokriging and (b) GE-Cokriging. The low-ﬁdelity function f L is denoted by red solid lines, high-ﬁdelity samples are denoted by black diamonds andlow-ﬁdelity samples by red circles. Colored online.The results of Cokriging and GE-Cokriging for reconstructing f H are shown in Fig. 1. Fig. 1a shows thatCokriging is able to capture f H as the posterior mean is generally close to the high-ﬁdelity function value.However, ˆ s of the prediction are large on most of the prediction locations, which indicates that Cokrigingmethod yields considerable uncertainty at those locations, whereas this uncertainty is very small at X c because a simple relation has been found between f H and f L based on available data [6]. As a comparison,9ig. 1b illustrates that the posterior mean of GE-Cokriging coincides with f H , and the uncertainty in theprediction is very small on the entire interval as the grey shaded area is almost invisible.Next, we compare the performance of predicting the gradients of f H , i.e., d f H ( x )d x . Fig. 2 shows thatCokriging method suﬀers from the singularity of the covariance matrix in this setup, implied from sharpturning of predicted curvature between neighboring observations in Fig. 2a and large standard deviations inFig. 2b on locations where observations are not available. As for GE-Cokriging method, the prediction ofgradients is accurate both in terms of posterior mean illustrated in Fig. 2a and standard deviation illustratedFig. 2b, which shows that the prediction uncertainty by Cokriging is almost 10 times greater than that byGE-Cokriging. We note that the performance of Cokriging is poor in this case because the covariance matrix˜ C is close to a singular matrix. The reason for this phenomenon is that the value of d y L d x is close at x = 0 . x = 0 .

4, as well as at x = 0 and x = 0 .

6. As we point out in Section 2.1, this singularity issue is commonfor GPR method in practice, and the typical approach to alleviate this is to add a diagonal matrix αI to thecovariance matrix, which is equivalent to add noises in the collected data. In this paper, we set α = 10 − ,which is much smaller than typical numbers used in practice, to demonstrate that the GE-Cokriging canhelp to alleviate the singularity issue without sacriﬁcing accuracy of matching observation data. (a) (b) Figure 2: Prediction of the gradient of QoI for the 1D problem case 1. Prediction of posterior (a) mean byCokriging (blue solid line) and GE-Cokriging (green solid line), where the gradient of high-ﬁdelity function d f H d x is denoted by black solid line, gradient of low-ﬁdelity function d f L d x is denoted by red solid line, high-ﬁdelity samples are denoted by black diamonds and low-ﬁdelity samples by red circles and (b) standarddeviation for gradient of QoI d f H d x by Cokriging (red solid line) and GE-Cokriging (black solid line). Coloredonline. f L Next, we keep the sampling locations, i.e., X H and X L same as those in Section 3.1.1, and only modify themodel parameters of the low-ﬁdelity function in Eq. (3.3) by slightly shifting it, i.e., replace x with x − . f L , f L ( x ) = f L ( x − . . f H ( x − . x − . − . − . (3.4)The posterior means and standard deviations of Cokriging and GE-Cokriging are shown in Fig. 3. It is shownin Fig. 3a that the Cokriging method is not able to obtain an accurate prediction of f H , and the resultinguncertainty is large on the entire interval except for locations of X H . On the contrary, as shown in Fig. 3b,the GE-Cokriging result is much closer to f H and the uncertainty is very small.10 a) (b) Figure 3: Prediction of the QoI for the 1D problem case 2. Prediction of posterior mean (black solid line)and standard deviation (grey shaded area) of QoI f H by (a) Cokriging and (b) GE-Cokriging. The low-ﬁdelity function f L is denoted by red solid lines, high-ﬁdelity samples are denoted by black diamonds andlow-ﬁdelity samples by red circles. Colored online.We present the prediction results of gradients by GE-Cokriging and Cokriging in Fig. 4. Similar tothe observations from Fig. 2a, Cokriging in this case suﬀers from the singularity of the covariance matrix,with posterior mean deviating signiﬁcantly from f H (see Fig. 4a) and standard deviation being in the ordercomparable to its mean value (see Fig. 4b). In comparison, GE-Cokriging still yields a good result withposterior mean close to d f H d x (see Fig. 4a) and low uncertainty, i.e., small standard deviations (see Fig. 4b).These contrasts between the Cokriging and GE-Cokriging suggest that the gradient information from high-ﬁdelity function and low-ﬁdelity function can help to improve the prediction accuracy of not only QoI butalso the corresponding gradients. We extend the application of GE-Cokriging method in approximating a 2D function, namely a modiﬁedBranin function [6], given by f H ( x, y ) = a (¯ x − b ¯ x + c ¯ x − r ) + g (1 − p ) cos(¯ x ) + g + qx, (3.5)where ¯ x = 15 x − , ¯ x = 15 y, x ∈ [0 , , y ∈ [0 , , with a = 1 , b = 5 . π , c = 5 π , r = 6 , g = 10 , p = 18 π , q = 5 , and the low-ﬁdelity function is constructed as follows, f L ( x, y ) = Af H ( Bx + (1 − B ) , Cy ) , (3.6)where A = 1.1, B = 0.95, C = 0.9. The contour of the modiﬁed Branin function f H that we aim toapproximate is shown in Fig. 5a and the contour for the low-ﬁdelity function f L is shown in Fig. 5d. Thesamples for high-ﬁdelity observation locations X H (black squares in Fig. 5a) and low-ﬁdelity observationlocations X L (black circles in Fig. 5d) are randomly selected from the uniformly spaced grid of size 41 × , × [0 , ∈ R . We note that X H ⊂ X L as before.11 a) (b) Figure 4: Prediction of the gradient of QoI for the 1D problem case 2. Prediction of posterior (a) mean byCokriging (blue solid line) and GE-Cokriging (green solid line), where the gradient of high-ﬁdelity function d f H d x is denoted by black solid line, gradient of low-ﬁdelity function d f L d x is denoted by red solid line, high-ﬁdelity samples are denoted by black diamonds and low-ﬁdelity samples by red circles and (b) standarddeviation for gradient of QoI d f H d x by Cokriging (red solid line) and GE-Cokriging (black solid line). Coloredonline.We ﬁrst compare the results of reconstructing f H by Cokriging and GE-Cokriging shown in Fig. 5. Itis clear that the posterior mean of GE-Cokriging (Fig. 5c) is closer to f H than that of Cokriging (Fig. 5b).Also the degree of uncertainty is distinct as posterior standard deviation of Cokriging (Fig. 5e) is one orderof magnitude larger than that in GE-Cokriging (Fig. 5f).Next, we compare the prediction of gradients by Cokriging and GE-Cokriging. Fig. 6a and Fig. 6d proﬁlecontours of exact ∂f H ∂x and ∂f H ∂y , respectively. For predicting ∂f H ∂x , GE-Cokriging (Fig. 6c) shows higheraccuracy globally while Cokriging (Fig. 6b) can not result in accurate prediction in the lower left corner,where the available observation data is rare. As for ∂f H ∂y , since the target function is relatively smooth,both Cokriging (Fig. 6e) and GE-Cokriging (Fig. 6f), are capable of obtaining accurate prediction, whileGE-Cokriging still outperforms Cokriging in the sense of the total RMSE recorded in Tab. 1. We consider a driven harmonic oscillator described by the following second order ODE: ( m ¨ x + c ˙ x + kx = F ( t ) ,x (0) = 1 , ˙ x (0) = 0 , (3.7)where m is the mass, c is the damping coeﬃcient, k is a constant (e.g., elasticity coeﬃcient of a string), and F ( t ) is the external force. We rewrite the ODE in Eq. (3.7) as¨ x + 2 ζω ˙ x + ω x = F ( t ) m , (3.8)12 a) (b) (c)(d) (e) (f) Figure 5: The high-ﬁdelity and low-ﬁdelity function of the 2D problem and the posterior prediction for thehigh-ﬁdelity function. (a) The high-ﬁdelity function, namely the modiﬁed Brainin function f H (contour)and observation locations (black squares). Posterior mean of QoI prediction by (b) Cokriging and (c) GE-Cokriging. (d) Low-ﬁdelity function f L (contour) and observation locations (black dots). Posterior standarddeviation of QoI by (e) Cokriging and (f) GE-Cokriging. Colored online.where ω = r km is the undamped angular frequency, and ζ = c √ mk is the damping ratio. We set ζ = 1 / √ ω = 6 p − ζ in this study. The external force is set as the step response: F ( t ) m = ( ω , t ≥ , , t < . (3.9)The analytical solution to Eq. (3.7) is x H ( t ) = e − ζω t sin( p − ζ ω t + ϕ )sin ϕ , ϕ = arccos ζ, (3.10)and the velocity is˙ x H ( t ) = − ω e − ζω t sin ϕ h ζ sin( p − ζ ω t + ϕ ) − p − ζ cos( p − ζ ω t + ϕ ) i . (3.11)The low-ﬁdelity model is a simple harmonic oscillator model: ( m ¨ x + kx = 0 ,x (0) = 1 , ˙ x (0) = 0 , (3.12)13 a) (b) (c)(d) (e) (f) Figure 6: The high-ﬁdelity gradients in x and y directions of the 2D problem and the corresponding posteriorpredictions. (a) The gradient of high-ﬁdelity function in x direction, df H dx (contour) and high-ﬁdelity samples(black squares) of gradient in x direction. Posterior mean of gradient prediction in x direction by (b)Cokriging and (c) GE-Cokriging. (d) The gradient of high-ﬁdelity function in y direction, df H dy (contour) andhigh-ﬁdelity samples (black squares) of gradient in y direction. Posterior mean of gradient prediction in y direction by (e) Cokriging and (f) GE-Cokriging. Colored online.which is equivalent to setting ζ = 0 and F ( t ) = 0 in Eq. (3.8). The analytical solution to the low-ﬁdelitymodel is x L ( t ) = cos( ω t ) , (3.13)and the velocity is ˙ x L ( t ) = − ω sin( ω t ) . (3.14)The observation locations for high- and low-ﬁdelity models are set as T H = { . j } j =0 and T L = { . j } j =0 ,respectively. We compare the constructed trajectory x ( t ) and velocity ˙ x ( t ) on [0 ,

3] by Cokriging and GE-Cokriging in Fig. 7. Cokriging again shows worse performance both for prediction of QoI (Fig. 7a) andgradient (Fig. 7c) marked by signiﬁcant deviations from the true values as well as large uncertainties atlocations distant from observation locations, while GE-Cokriging manages to reconstruct the trajectory(Fig. 7b) and velocity (Fig. 7d) of the oscillator well with small standard deviations. The overlapping betweentrajectory-velocity phase diagram by GE-Cokriging and the exact phase diagram (Fig. 7e) emphasizes thatGE-Cokriging can provide accurate predictions for QoI and the corresponding gradients simultaneously,while Cokriging failed to. We also note that Cokring suﬀers from singularity of the covariance matrix again,while GE-Cokriging doesn’t have this concern. 14 .4 Sensitivity of a power grid system

We now consider the relationship between the power input of a generator bus, denoted as x , and real-timepower factor of a load bus, as f ( x ), in a large-scale power system from IEEE 118 bus test case [26]. We useMATPOWER [36], which provides a model for the IEEE 118 bus test case, to run simulations and generatesample points. The f H ( x ) and f L ( x ) represent the alternative current (AC) and direct current (DC) modelsapproximating f ( x ), respectively.The observation locations for Cokriging and GE-Cokriging consist of 51 low-ﬁdelity samples from DCmodel on X L = {

20 + 2 j } j =0 and ﬁve samples from AC model on X H = { , , , , } (again, X H ⊂ X L ). In addition to reconstructing f H accurately, estimating the change of power factor of a load busin response to the change of power input of a generate bus, i.e., the sensitivity of f with respect to x , isimportant for safety or energy-eﬃciency consideration. This change is reﬂected by the derivative of f ( x ), i.e., d f ( x )d x . Therefore, we aim to approximate both f H and its derivative. Here we use ﬁnite-diﬀerence methodto obtain d f H d x and d f L d x at X H and X L , respectively, and the step size is 0 . f H with noticeable standarddeviations (Fig. 8a), but it fails to reconstruct d f H d x (Fig. 8c). On the other hand, GE-Cokriging can recon-struct both f H (Fig. 8b) and d f H d x (Fig. 8d) accurately with rather small uncertainty, and the only noticeablediscrepancy appears near the left boundary because that region is far from available data. Unlike othercases, here we notice the occurrence of wiggling in the high-ﬁdelity gradient prediction by GE-Cokriging.This is caused by the aliasing error as we used ﬁnite-diﬀerent method to approximate the gradient functions,recall that no wiggling is observed in previous examples where the gradients are observed directly. Again,reconstructing the gradient using Cokriging suﬀers from the sigularity of the covariance matrix as shown inFig. 8c, whereas GE-Cokriging doesn’t have this concern (see Fig. 8d). To analyze and compare the accuracy and eﬃciency among Cokriging, GE-Kriging and GE-Cokriging, werun simulations for ﬁve times with random initial conditions for each numerical example, and list the relativemean squared errors for QoI prediction and gradient of QoI prediction in Tab. 1. The numerical simulationswere performed on the same laptop with Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz. We recorded thetime for each separate run and computed the corresponding mean and standard deviation from these 5 runsfor each example (see Tab. 2).The results in Tab. 1 show that GE-Cokriging outperforms Cokriging and GE-Kriging in terms of relativemean squared error for all examples presented. We note that in GE-Kriging, only high-ﬁdelity QoI data(including high-ﬁdelity gradient data) was used for training. GE-Cokriging improves accuracy in all casescompared to Cokriging, which is consistent to the visual observations shown in each numerical example. Itis also worth noting that the relative mean squared errors by Cokriging are almost one order of magnitudehigher than those by GE-Cokriging in most of the cases. The errors in the prediction of QoIs by GE-Krigingare several times larger than those by GE-Cokriging, and the prediction of gradients by GE-Kriging areeven worse than those by GE-Cokriging, in all examples. Hence, among these three methods compared, GE-Cokriging is able to maintain a robust prediction result both in terms of QoI and in terms of the gradientof QoI simultaneously, while the other two methods can not obtain comparable results. This further veriﬁesthat the information of QoI and its gradients can be strongly correlated, and hence is of great help to improvethe accuracy of GPR methods when used jointly.Tab. 2 shows that GE-Kriging and GE-Cokriging are more time-eﬃcient compared to Cokriging, whichis suggested by the fact that the prediction of gradients with GE-Kriging and GE-Cokriging take a rathersmall amount of time compared to Cokriging method. This is due to the fact that GE-Kriging and GE-Cokriging integrate both QoI data and the corresponding gradient data in the training step and henceprovides prediction of QoI as well as the gradient on the new locations simultaneously in the predictingstep. Whereas, Cokriging requires construction of a model for gradient data separately. Hence, the timefor the prediction of the gradients by GE-Kriging and GE-Cokriging, i.e., the last two columns in Tab. 2,are for prediction only and is relatively short. It is also noticed that the time consumption of GE-Kriging15 ase Cokriging GE-Kriging GE-Cokriging Cokriging ( ∇ ) GE-Kriging ( ∇ ) GE-Cokriging ( ∇ )1D1 0 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . ∗ . ± . . ± . . ± . . ± . . ± . . ± . ∗∗ - - - 0 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Table 1: Relative mean squared error (mean ± standard deviation) of QoI and the corresponding gradients foreach numerical example averaged over 5 separate runs with random parameters initialization by Cokriging,GE-Kriging and GE-Cokriging. ∗ denotes gradient in x direction and ∗∗ denotes gradient in y direction. ∇ denotes prediction of the gradient of QoI.is smaller than that of GE-Cokriging, recall that GE-Kriging only used high-ﬁdelity information while GE-Cokriging used both high-ﬁdelity and low-ﬁdelity information, which lead to a larger covariance matrix inGE-Cokriging compared to that in GE-Kriging. Although GE-Cokriging generally requires longer time inthe training step, almost doubles Cokriging’s training time, the total time cost of GE-Cokriging in QoI andgradients prediction is almost the same as that of Cokriging. Considering the signiﬁcant improvement inaccuracy and robustness, we can conclude that GE-Cokriging is an accurate and eﬃcient approach to obtainprediction both QoI and its gradients simultaneously. Case ID Cokriging GE-Kriging GE-Cokriging Cokriging ( ∇ ) GE-Kriging ( ∇ ) GE-Cokriging ( ∇ )1D1 1 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . ∗ . ± . . ± . . ± . . ± . . ± . . ± . ∗∗ - - - 1 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Table 2: Runtime (mean ± standard deviation) of predicting QoI and its gradients for each numerical exampleaveraged over 5 separate runs with random parameters initialization by Cokriging, GE-Kriging and GE-Cokriging. ∗ denotes gradient in x direction and ∗∗ denotes gradient in y direction. ∇ denotes prediction ofthe gradient of QoI. In this work, we present a comprehensive gradient-enhanced multi-ﬁdelity Cokriging method, namely GE-Cokriging, which incorporates available gradient information of multi-ﬁdelity data, i.e., low-ﬁdelity andhigh-ﬁdelity observation of QoIs and its gradients. We present several numerical examples to study theperformance of GE-Cokriging. Our results show that GE-Cokriging can accurately predict the QoI and itsgradients simultaneously. We compare the performance of GE-Cokriging against GE-Kriging and multi-ﬁdelity Cokriging, two popular GP-based prediction methods, and illustrate that GE-Cokriging is the mostaccurate, robust and eﬃcient among these methods.In particular, our result suggests that GE-Cokriging achieves better accuracy than GE-Kriging, thisis because it exploits the information of the low-ﬁdelity model. Also, GE-Cokriging yields more accurateresults than using Cokriging for QoI and its gradients separately, because it takes advantage of the relationbetween these two quantities and makes use of corresponding data jointly. Even when some of the low-ﬁdelitygradient information is misleading, for example, the gradient of low-ﬁdelity data is negative while that ofhigh-ﬁdelity data is positive, the GE-Cokriging method may still be robust enough to predict accuratelyon target functions with less uncertainty compared to those by Cokriging and GE-Kriging. Moreover,the GE-Cokriging helps to alleviate the singularity issue of the covariance matrix, which is quite commonin GPR methods. In terms of computational cost, the training of GE-Cokriging model, i.e., identifying16yperparameters, could take longer time than Cokriging in solving a high-dimensional problem, given thatthe dimension of the covariance matrix is expanded due to the incorporation of gradient samples. However,once these hyperparameters are speciﬁed, the QoI and its gradients can be predicted simultaneously. Thissaves total computational time compared with Cokriging, which requires constructing models for QoI andits gradients separately, and hence needs training at least two models. Therefore, the overhead of training amodel with a larger covariance matrix in GE-Cokriging is mitigated, and the overall time required to predictboth QoI and its gradients for these three methods are comparable.We note that our gradient-enhanced framework is also ﬂexible for further extensions. In all of thenumerical examples, we apply the commonly used stationary radial-basis function kernel. Other kernel func-tions, e.g., Mat´ern kernels with diﬀerent smoothness, can be used to solve problems with desired regularityconstraints. In addition, non-stationary kernels can be applied in this framework to model heterogeneoussystems more accurately. Another extension can be to relax the constraints on the sample data to address thesituation of missing data. More speciﬁcally, in the numerical examples presented, the gradient information isavailable with QoI at each observation location. Whereas in practice, it is possible that at some observationlocations, either the QoI or its gradient is unavailable. In this scenario, modiﬁcations to the mean and co-variance functions of the GP in our framework are needed. Moreover, we used the linear auto-regression formof the multi-ﬁdelity Cokriging from [13], which can be replaced by more general nonlinear auto-regressionforms, e.g., the methods used in [23, 9, 17], or even the deep neural network, e.g., [19]. Finally, as we pointout in Section 2.4, our framework can also be built based on the “integral-enhanced” perspective, which canbe useful in speciﬁc practical problems.

Acknowledgments

Yixiang Deng was supported by National Science Foundation (NSF) Award No. 1736088. Xiu Yang wassupported by the U.S. Department of Energy (DOE), Oﬃce of Science, Oﬃce of Advanced Scientiﬁc Com-puting Research (ASCR) as part of Multifaceted Mathematics for Rare, Extreme Events in Complex Energyand Environment Systems (MACSER). Guang Lin gratefully acknowledges the support from National Sci-ence Foundation (DMS-1555072, DMS-1736364, and CMMI-1634832) and Brookhaven National LaboratorySubcontract 382247.

References [1] Petter Abrahamsen. A review of gaussian random ﬁelds and correlation functions, 1997.[2] Giancarlo Alfonsi. Reynolds-averaged navier–stokes equations for turbulence modeling.

Appl. Mech.Rev. , 62(4), 2009.[3] Hyoung Seog Chung and Juan Alonso. Design of a low-boom supersonic business jet using cokrigingapproximation models. In ,page 5598, 2002.[4] Richard Dwight and Zhong-Hua Han. Eﬃcient uncertainty quantiﬁcation using gradient-enhancedkriging. In , page 2276, 2009.[5] Pep Espanol and Patrick Warren. Statistical mechanics of dissipative particle dynamics.

Europhys.Lett. , 30(4):191, 1995.[6] Alexander Forrester, Andy Keane, and Andr`as S`obester.

Engineering Design via Surrogate Modelling:A Practical Guide . John Wiley & Sons, 2008.[7] Alexander IJ Forrester, Andr´as S´obester, and Andy J Keane. Multi-ﬁdelity optimization via surrogatemodelling.

Proc. R. Soc. A. , 463(2088):3251–3269, 2007.178] Meixia Geng, Danian Huang, Qingjie Yang, and Yinping Liu. 3d inversion of airborne gravity-gradiometry data using cokriging.

Geophysics , 79(4):G37–G47, 2014.[9] Mark Girolami and Mingjun Zhong. Data integration for classiﬁcation problems employing gaussianprocess priors. In

Adv. Neural. Inf. Process. Syst. , pages 465–472, 2007.[10] Pierre Goovaerts. Ordinary cokriging revisited.

Math. Geosci. , 30(1):21–42, 1998.[11] Loic Le Gratiet and Josselin Garnier. Recursive co-kriging model for design of computer experimentswith multiple levels of ﬁdelity.

Int. J. Uncertain. Quan. , 4(5):365–386, 2014.[12] Zhong-Hua Han, Stefan G¨ortz, and Ralf Zimmermann. Improving variable-ﬁdelity surrogate modelingvia gradient-enhanced kriging and a generalized hybrid bridge function.

Aerosp. Sci. Technol. , 25(1):177–189, 2013.[13] Marc C Kennedy and Anthony O’Hagan. Predicting the output from a complex computer code whenfast approximations are available.

Biometrika , 87(1):1–13, 2000.[14] Peter K Kitanidis.

Introduction to Geostatistics: Applications in Hydrogeology . Cambridge UniversityPress, 1997.[15] J Laurenceau, M Meaux, M Montagnac, and P Sagaut. Comparison of gradient-based and gradient-enhanced response-surface-based optimizers.

AIAA J. , 48(5):981–994, 2010.[16] Luc Laurent, Rodolphe Le Riche, Bruno Soulier, and Pierre-Alain Boucard. An overview of gradient-enhanced metamodels with applications.

Arch. Comput. Methods Eng. , 26(1):61–106, 2019.[17] Seungjoon Lee, Felix Dietrich, George E Karniadakis, and Ioannis G Kevrekidis. Linking gaussianprocess regression with data-driven manifold embeddings for nonlinear data fusion.

Interface focus ,9(3):20180083, 2019.[18] Seungjoon Lee, Ioannis G Kevrekidis, and George Em Karniadakis. A general cfd framework for fault-resilient simulations based on multi-resolution information fusion.

J. Comput. Phys. , 347:290–304, 2017.[19] Xuhui Meng and George Em Karniadakis. A composite neural network that learns from multi-ﬁdelitydata: Application to function approximation and inverse pde problems.

J. Comput. Phys. , 401:109020,2020.[20] Max D Morris, Toby J Mitchell, and Donald Ylvisaker. Bayesian design and analysis of computerexperiments: use of derivatives in surface prediction.

Technometrics , 35(3):243–255, 1993.[21] Benjamin Peherstorfer, Karen Willcox, and Max Gunzburger. Survey of multiﬁdelity methods in un-certainty propagation, inference, and optimization.

SIAM Rev. , 60(3):550–591, 2018.[22] P Perdikaris, D Venturi, JO Royset, and GE Karniadakis. Multi-ﬁdelity modelling via recursive co-kriging and Gaussian–Markov random ﬁelds.

Proc. R. Soc. A. , 471(2179):20150018, 2015.[23] Paris Perdikaris, Maziar Raissi, Andreas Damianou, ND Lawrence, and George Em Karniadakis.Nonlinear information fusion algorithms for data-eﬃcient multi-ﬁdelity modelling.

Proc. R. Soc. A ,473(2198):20160751, 2017.[24] Ghanshyam Pilania, James E Gubernatis, and Turab Lookman. Multi-ﬁdelity machine learning modelsfor accurate bandgap predictions of solids.

Comput. Mater. Sci. , 129:156–163, 2017.[25] Osborne Reynolds. Iv. on the dynamical theory of incompressible viscous ﬂuids and the determinationof the criterion.

Philos. Trans. R. Soc. Lond. A , (186):123–164, 1895.[26] Christie Richard. Power systems test case archive, May 1993.1827] Robert E Rudd and Jeremy Q Broughton. Coarse-grained molecular dynamics and the atomic limit ofﬁnite elements.

Phys. Rev. B , 58(10):R5893, 1998.[28] A Stein and LCA Corsten. Universal kriging and cokriging as a regression procedure.

Biometrics , pages575–587, 1991.[29] A Stein, IG Staritsky, J Bouma, AC Van Eijnsbergen, and AK Bregt. Simulation of moisture deﬁcitsand areal interpolation by universal cokriging.

Water Resour. Res. , 27(8):1963–1973, 1991.[30] Selvakumar Ulaganathan, Ivo Couckuyt, Francesco Ferranti, Eric Laermans, and Tom Dhaene. Perfor-mance study of multi-ﬁdelity gradient enhanced kriging.

Struct. Multidiscipl. Optim. , 51(5):1017–1033,2015.[31] Christopher KI Williams and Carl Edward Rasmussen.

Gaussian processes for machine learning , vol-ume 2. MIT press Cambridge, MA, 2006.[32] Ying Xuan, JunHua Xiang, WeiHua Zhang, and YuLin Zhang. Gradient-based kriging approximatemodel and its application research to optimization design.

Sci. China Technol. Sci. , 52(4):1117–1124,2009.[33] Xiu Yang, David Barajas-Solano, Guzel Tartakovsky, and Alexandre M Tartakovsky. Physics-informedcokriging: A gaussian-process-regression-based multiﬁdelity method for data-model convergence.

J.Comput. Phys. , 395:410–431, 2019.[34] Xiu Yang, Guzel Tartakovsky, and Alexandre Tartakovsky. Physics-informed kriging: Aphysics-informed gaussian process regression method for data-model convergence. arXiv preprintarXiv:1809.03461 , 2018.[35] Xiu Yang, Xueyu Zhu, and Jing Li. When biﬁdelity meets cokriging: An eﬃcient physics-informedmultiﬁdelity method.

SIAM J. Sci. Comput. , 42(1):A220–A249, 2020.[36] Ray Daniel Zimmerman, Carlos Edmundo Murillo-S´anchez, and Robert John Thomas. Matpower:Steady-state operations, planning, and analysis tools for power systems research and education.

IEEETrans. Power Syst. , 26(1):12–19, 2011.[37] Ralf Zimmermann. On the maximum likelihood training of gradient-enhanced spatial gaussian processes.

SIAM J. Sci. Comput. , 35(6):A2554–A2574, 2013.19 a) (b)(c) (d)(e)

Figure 7: Prediction of the trajectory (QoI), velocity (gradient of QoI) and the phase diagram of an under-damped oscillator. Prediction of the posterior mean (blue solid lines) and standard deviation (grey shadedarea) of the trajectory x H ( t ) by (a) Cokriging and (b) GE-Cokriging. Prediction of the posterior mean(blue solid lines) and standard deviation (grey shaded area) of the velocity dx H ( t ) dt by (c) Cokriging and (d)GE-Cokriging. (e) Prediction of phase diagram by Cokriging (blue dashed line) and that by GE-Cokriging(black dashed line). Black diamonds denote high-ﬁdelity observations, red circles denote low-ﬁdelity observa-tions, black solid lines denote the high-ﬁdelity models and red solid lines denote low-ﬁdelity models. Coloredonline. 20 a) (b)(c) (d) Figure 8: Prediction of the relationship between the power input of a generator bus x and real-time powerfactor of a load bus f H ( x ) by an AC model. Prediction of the posterior mean (blue solid lines) and standarddeviation (grey shaded area) of f H ( x ) by (a) Cokriging and (b) GE-Cokriging. Prediction of the posteriormean (blue solid lines) and standard deviation (grey shaded area) of gradient of QoI df H ( x ) dxdx