[PDF] Linking Gaussian Process regression with data-driven manifold embeddings for nonlinear data fusion

Abstract

In statistical modeling with Gaussian Process regression, it has been shown that combining (few) high-fidelity data with (many) low-fidelity data can enhance prediction accuracy, compared to prediction based on the few high-fidelity data only. Such information fusion techniques for multifidelity data commonly approach the high-fidelity model f h (t) as a function of two variables (t,y) , and then using f l (t) as the y data. More generally, the high-fidelity model can be written as a function of several variables (t, y 1 , y 2 ....) ; the low-fidelity model f l and, say, some of its derivatives, can then be substituted for these variables. In this paper, we will explore mathematical algorithms for multifidelity information fusion that use such an approach towards improving the representation of the high-fidelity function with only a few training data points. Given that f h may not be a simple function -- and sometimes not even a function -- of f l , we demonstrate that using additional functions of t , such as derivatives or shifts of f l , can drastically improve the approximation of f h through Gaussian Processes. We also point out a connection with "embedology" techniques from topology and dynamical systems.

Full PDF

LLinking Gaussian Process regression with data-driven manifold embeddingsfor nonlinear data fusion

Seungjoon Lee and Felix Dietrich

Department of Chemical and Biomolecular Engineering, Johns Hopkins University

George E. Karniadakis

Division of Applied Mathematics, Brown University

Ioannis G. Kevrekidis

Department of Chemical and Biomolecular Engineering, Johns Hopkins University ∗ In statistical modeling with Gaussian Process regression, it has been shown that combining (few)high-ﬁdelity data with (many) low-ﬁdelity data can enhance prediction accuracy, compared to pre-diction based on the few high-ﬁdelity data only. Such information fusion techniques for multiﬁdelitydata commonly approach the high-ﬁdelity model f h ( t ) as a function of t wo variables ( t, y ), andthen using f l ( t ) as the y data. More generally, the high-ﬁdelity model can be written as a func-tion of several variables ( t, y , y .... ); the low-ﬁdelity model f l and, say, some of its derivatives, canthen be substituted for these variables. In this paper, we will explore mathematical algorithms formultiﬁdelity information fusion that use such an approach towards improving the representationof the high-ﬁdelity function with only a few training data points. Given that f h may not be asimple function -and sometimes not even a function- of f l , we demonstrate that using additionalfunctions of t , such as derivatives or shifts of f l , can drastically improve the approximation of f h through Gaussian Processes. We also point out a connection with “embedology” techniques fromtopology and dynamical systems. Our illustrative examples range from instructive caricatures tocomputational biology models, such as Hodgkin-Huxley neural oscillations. I. INTRODUCTION

Recent advances in both algorithms and hardware areincreasingly making machine learning an important com-ponent of mathematical modeling for physicochemical,engineering, as well as biological systems (e.g. [1–4]). Part of these developments focus on multiresolu-tion and multiﬁdelity data fusion [5, 6]. Fusing informa-tion from models constructed at diﬀerent levels of reso-lution/ﬁdelity has been shown to enhance prediction ac-curacy in data-driven scientiﬁc computing (e.g. [7, 8]).In addition, if high-ﬁdelity data are costly to obtain(experimentally or computationally) while low-ﬁdelitydata are relatively cheap, a combination of a few high-ﬁdelity data and many low-ﬁdelity data can also leadto overall computational eﬃciency. Recently, GaussianProcess (GP) regression has been widely used to eﬀec-tively combine multiple ﬁdelity data [9–13]. “Classical”information fusion by GP had focused only on linear cor-relations between high- and low-ﬁdelity data via auto-regressive schemes such as the coKriging approach ofKennedy and O’Hagan [14]. If two (or more) data setshave nonlinear correlations, such linear-type approacheswill lose their eﬀectiveness.When a high-ﬁdelity model f h ( t ) is nonlinearly cor-related with a low-ﬁdelity model f l ( t ), Nonlinear Auto-Regressive Gaussian Process (NARGP) [13] achieves ∗ [email protected]; Also at Department of Applied Mathematicsand statistics and Department of Medicine, Johns Hopkins Uni-versity. highly accurate prediction results by introducing an ad-ditional dimension: f h ( t ) is approximated as a curve ona two-dimensional manifold parametrized by t and a sec-ond variable, y . The data points on this manifold are ofthe form f h ( t ) , t, f l ( t ). More generally, NARGP ﬁnds astrong correlation between d -dimensional high- and low-ﬁdelity models in a ( d + 1)-dimensional space. In thisframework, the low-ﬁdelity model provides the additional“latent” variable of the high-ﬁdelity model.Deep multiﬁdelity GPs were introduced [15] as an im-provement, especially in the context of discontinuities in f h with respect to t (and f l ). This approach focuses onconstructing a useful transformation T ( t ) of the inputvariables t through a (deep) neural network. Then, thehigh-ﬁdelity function f h ( t ) is approximated as a GP of(just) f l ( T ( t )). One must now, of course, perform opti-mization for the additional network hyperparameters.In this paper we discuss a connection of NARGP withdata-driven embeddings in topology/dynamical systems,and extend it (in the spirit of such data driven embed-dings) in an attempt to improve the numerical approxi-mation of f h in the “sparse f h , rich f l ” data setting. Inwhat follows we will (rather arbitrarily) ascribe the char-acterization “high ﬁdelity” or “low ﬁdelity” to diﬀerentfunctions used in our illustrations; this characterizationis solely based on the number of available data points foreach.For our ﬁrst example, the “high-ﬁdelity” function f h ,for which we only have a few data points, is a functionof t ; but it actually also happens that we can describe it a r X i v : . [ s t a t . M L ] D ec (a)A correlation between t , f l ( t ), and f h ( t ) (b)GP with { t } (c)GP with { f l ( t ) } FIG. 1. Two alternative GP regressions for the high-ﬁdelity model f h ( t ) = sin (8 πt ) (the highly oscillatory green curve in (a)).The blue curve visually suggests a smooth two-dimensional manifold g ( t, f l ). If, on the other hand, the high-ﬁdelity function isprojected onto f l ( t ), we can see it is a simple quadratic. (b) and (c): the prediction results with 2 standard deviations (dashedline) for the high-ﬁdelity function by GP regression with only t and with only f l ( t ), respectively. We employ 15 high-ﬁdelitydata points and 100 uniformly distributed low-ﬁdelity data points for the regression. as a function of f l , f h ( t ) = sin (8 πt ) , f l ( t ) = sin(8 πt ) , (1)see FIG.1.When we choose t as the coordinate parametrizing f h ( t ) (the green curve in FIG.1(a)), the GP regressionfails to represent the high-frequency sine function withjust a few training data points as shown in FIG.1(b).However, as FIG.1(c) shows, if we choose f l as the coor-dinate of f h ( f l ) (colored by red in FIG.1(a)), the GP re-gression can represent the simple quadratic function quiteeﬀectively. If we still need to know the parametrizationof f h by t , we can obtain it through the “data rich” f l : f h ( t ) ≡ f h ( f l ( t )).If f h i s not a function of f l , however (as has beenobserved in the literature [13] and as can be seen in Fig-ure 2(a)) more variables are necessary to create a usefulparametrization of f h - more precisely, a domain overwhich f h is a function.In the NARGP framework, the variable used in addi-tion to f l is t itself. In this paper, we will also advocatethe use of delays or derivatives of f l as additional vari-ables. This approach can also help remove the explicitdependence of f h on t , since embedding theories in dy-namical systems [16–21] guarantee that we can chooseany generic observation of t , or derivatives and/or delaysof this observation, as a replacement for t , see sectionII B for more details.In section III D, we apply the proposed framework tothe Hodgkin-Huxley model, describing the behavior ofaction potentials in a neuron. Here, the high- and thelow-ﬁdelity functions are action potentials at two diﬀer-ent values of the external current. This is a case where f h is a complicated function of t and does not only de-pend on f l ; yet as we will see, delays of f l will help usconstruct an accurate approximation of f h .The paper is organized as follows: In section II, we re-view the NARGP framework and concepts of “embedol- ogy”. Also, we illustrate why and when this frameworkis successful. In section III, we demonstrate the eﬀective-ness of our proposed framework via some pedagogical ex-amples and the biologically motivated application to theHodgkin-Huxley model. In section IV, we summarize ourresults and discuss open issues for further development ofgeneral multiﬁdelity information fusion in modeling andsimulation practice across scientiﬁc domains. II. METHODSA. Nonlinear information fusion algorithms “Classical” multiﬁdelity data fusion algorithms requirea linear (or almost linear) correlation between diﬀerentﬁdelity models. Under this constraint, we can mergetwo or more data sets by using scaling and shiftingparameters such as the Kennedy and O’Hagan coKrig-ing approach [14]. However, more generally, the datasets are nonlinearly correlated, which typically degradesthe quality of results of linear information fusion algo-rithms. In order to resolve correlation nonlinearity be-tween data sets, the use of a space-dependent scaling fac-tor ρ ( x ) [22] or, alternatively, deep multiﬁdelity GaussianProcesses [15] have been introduced; clearly, the improve-ment they bring requires additional hyperparameter op-timization.When the high-ﬁdelity model f h is nonlinearly corre-lated with the low-ﬁdelity model f l , but can be writtenas a simple function of x and f l , the Nonlinear Auto-Regressive GP (NARGP) [13] is an appropriate choice.In this framework, a one dimensional high-ﬁdelity func-tion f h is assumed to be a “simple” function g of twovariables ( x, y ), i.e. it is a curve that lies in the two-dimensional manifold described by g . Then, GP regres-sion in the two dimensional space is performed, where (a)A correlation between t , f l ( t ), and f h ( t ) (b)NARGP: GP with { t , f l ( t ) } (c)GP with { f l ( t ), f l ( t − τ ) } FIG. 2. Two alternative GP regressions for the high-ﬁdelity model f h (see Equation (5)). (a): the multivaluedness of f h withrespect to f l is clearly visible when the high-ﬁdelity data (the blue curve) is projected onto f l ( t ) (the red curve). (b) and(c): The high-ﬁdelity function (the yellow curve) versus posterior means with two standard deviations (dashed lines) of twoalternative GP regressions. We use 7 high-ﬁdelity data points and 100 uniformly distributed low-ﬁdelity data points for trainingGP regression models. (b) GP regression with { t , f l ( t ) } . (c) GP regression with { f l ( t ), f l ( t − τ ) } , τ = 1 / the data for y are the f l ( x ) data, g ( x, y ) ∼ GP (0 , k (( x, y ) , ( x (cid:48) , y (cid:48) ))) , f h ( x ) = g ( x, f l ( x )) . (2)In [13] the gain in accuracy of NARGP, compared toan auto-regressive scheme with a constant scaling factor,as well as a scaling factor that was modeled as a (space-dependent) Gaussian process, was documented.Algorithmically, classical autoregressive GPs employan explicit method such as a scaling constant ( ρ ) betweentwo covariance kernels, k and k as (cid:20) f l ( x ) f h ( x ) (cid:21) ∼ GP (cid:18)(cid:20) (cid:21) , (cid:20) k ( x, x (cid:48) ) ρk ( x, x (cid:48) ) ρk ( x, x (cid:48) ) ρ k ( x, x (cid:48) ) + k ( x, x (cid:48) ) (cid:21)(cid:19) . (3)The NARGP framework, on the other hand, employsan implicit approach by the automatic relevance de-termination (ARD) weight [23] in the extended spaceparametrized by x and y : a diﬀerent scaling hyperparam-eter for each of the two dimensions in the kernel. In manyapplications, the radial basis function (see the equation(4)) has been used as the covariance kernel (where ARDimplies a diﬀerent scaling hyperparameter θ i for each di-mension): k ( x, x (cid:48) ; θ ) = exp (cid:32) − d (cid:88) θ i ( x i − x (cid:48) i ) (cid:33) . (4)FIG.2 showcases an example where f h cannot be writ-ten as a function of f l : f h ( t ) = t + sin (8 πt ) , f l ( t ) = sin(8 πt ) t ∈ [0 , . . (5)Following the NARGP framework, we choose the low-ﬁdelity data as an additional variable y = f l ( t ); we thenapproximate the two-dimensional function g ( t, y ) = t + y . (6) Approximating g only requires a few training data pointpairs for the GP regression. Then, f h can be written as f h ( t ) = g ( t, f l ( t )), see FIG.2(b). FIG.2(c) demonstratesthat we can, alternatively, use delays of f l instead of t asan additional variable. A rationalization of this followsin the next section. B. Data-driven Higher-dimensional Embeddings

The theorem of Whitney [16] states that any suﬃ-ciently smooth manifold of dimension d ∈ N can beembedded in Euclidean space R n , with the tight bound n ≥ d + 1. Nash [17] showed that this embedding caneven be isometric if the manifold is compact and Rie-mannian, even though the bound on n is higher. Manyresults on the reconstruction of invariant sets in the statespaces of dynamical systems are based on these two theo-rems [18–21]. Here, the n embedding dimensions are usu-ally formed by n scalar observations of the system statevariables. Instead of n diﬀerent observations, recording n time delays of a single scalar observable is also possible,as originally formulated by Takens [18] (see also [24].)Given a smooth, d -dimensional manifold M , a vectorﬁeld X on M with associated ﬂow map f : M × R → M ,as well as an observable h : M → R , it is possible toconstruct an embedding φ : M → R n through φ ( p ) = [ h ( p ) , h ( f ( p, − ∆ t )) , . . . , h ( f ( p, − ( n − t ))] T . (7)Sauer et al. [20] further speciﬁed the conditions on theobservable, with respect to the underlying vector ﬁeldof the dynamical system, that have to be satisﬁed suchthat the state space can be embedded successfully. Theyalso extended the results on embeddings of invariant setswith fractal dimension. The observable and the vectorﬁeld together build a tuple that has to be generic (Tak-ens [18]) or prevalent (Sauer et al. [20]). The two pa-pers show that “almost all” tuples are admissible, wherethe notions of genericity and prevalence are deﬁning thatprobabilistic statement in the respective function spaces.These results are crucial for the numerical exploitation ofthe embedding theorems, since they show that “almostall” observables of a dynamical system can be chosen fora delay embedding.In many applications, the intrinsic dimension d of M isunknown, and n is preemptively chosen large (larger thannecessary). Manifold learning techniques are then oftencapable of reducing the embedding dimension, bringingit closer to the minimal embedding dimension necessaryfor the given manifold.We also note that in our framework, if we can obtainlow-ﬁdelity data from a (dynamic) process instead of justsingle measurements, the same embedding and manifoldlearning techniques can be used even if the “independentvariable” t is not known [25].Let us consider the equation (5) again. NARGP per-forms GP regression with two observations { t, f l ( t ) } . Us-ing embedding theory, we can rationalize (a) why the twoobservations were necessary; and (b) why performing GPregression with an additional delay of the scalar observ-able, { f l ( t ) , f l ( t − τ ) } , is equally appropriate for a rela-tively small time horizon, see FIG.2(c). Notice that delaycoordinates f l ( t ) and f l ( t − τ ) lie on an ellipse with periodof 0.25. Hence, if the data are collected over times longerthan 0 .

25, using only delays will fail to represent the high-ﬁdelity function due to multivaluedness (see FIG.3(e) insection III A); t itself used as an observable will resolvethis. C. Extending the formulation through delays

Now we provide a mathematical formulation of the ap-proach. We assume that f h : R → R and f l : R → R are C K -smooth functions with K ≥

2, and we want to con-struct an interpolant of f h with a numerical scheme (here,a Gaussian Process). We assume that(1) only a small number of data points { x, f h ( x ) } aswell as the function f l and its K derivatives areavailable,(2) f h can be written in the form f h ( x ) = g ( x, f l ( x ) , f (1) l ( x ) , . . . , f ( K ) l ( x )) , (8)where f ( i ) l ( x ) denotes the i -th derivative, i ∈{ , . . . , K } , and(3) g : R K +2 → R is a C K function with derivativesbounded in the L ∞ norm by a small constant c g > (cid:13)(cid:13)(cid:13)(cid:13) ∂∂x l g ( x , . . . , x K +1 ) (cid:13)(cid:13)(cid:13)(cid:13) L ∞ ≤ c g ∀ l ∈ { , . . . , K + 2 } . (9) The Taylor series of f h reveals why assumptions (2) and(3) are required: Assume we want to evaluate f h in asmall neighborhood of a point x ∈ R (here, w.l.o.g. weset x = 0). Then, assuming f h is smooth enough, wecan write f h ( x ) = f h (0) + ∂∂x f h (0) x + O (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) ∂ ∂x f h (cid:13)(cid:13)(cid:13)(cid:13)(cid:19) . (10)Since we assume the form (8) for f h , we can write ∂∂x f h ( x ) = ∂∂x (cid:104) g ( x, f l ( x ) , f (1) l ( x ) , . . . , f ( K ) l ( x )) (cid:105) (11)= ∂∂x g ( x, f l ( x ) , f (1) l ( x ) , . . . ) + (12) K +1 (cid:88) i =2 ∂∂x i g ( x, f l ( x ) , f (1) l ( x ) , . . . ) · f ( i − l ( x ) . (13)From this, we can see that ∂∂x f h ( x ) can be large (becausethe derivatives f ( i ) l ( x ) can be large), but if we know all f ( i ) l ( x ), we only have to estimate g from d ata f h and f ( i ) l .Crucially, we do not approximate the function f h ( x ) andits derivatives. The derivatives of g are bounded by c g through assumption (8), and g is a C K function, so onlya few data points are necessary for a good ﬁt. If we haveaccess to function values of f l over “delays” (at discreteshifts, say in the form of a ﬁnite diﬀerence stencil) inspace, rather than its derivatives, we can use the Newtonseries approximation of f h instead of equation (10) foran analogous argumentation: For functions f h that areanalytic around the expansion point (here, x = 0), thefunction f h can be evaluated at x close to it by f h ( x ) = ∞ (cid:88) m =0 (cid:18) x − mm (cid:19) ∆ m f h (0) = (14) f h (0) + f h (∆ x ) − f h (0)∆ x x + O ( (cid:107) ∆ f h (cid:107) ) , (15)where ∆ m f h is the m -th ﬁnite diﬀerence approximationof f h , with a small step size ∆ x . By equation (8), thesediﬀerences can be expressed through g and delays of f l (instead of delays of f h ), analogously to equations (11–13). Using delays in space compared to derivatives hasnumerical advantages, especially in cases where f h or f l are not diﬀerentiable (or even have discontinuities). Italso enables us to estimate the derivatives of f l implic-itly, in case only the function f l is available (and not itsderivatives). D. Outline of the numerical approach

In order to employ the delay coordinates of the low-ﬁdelity function, it is required to know shifts of it. Anecessary condition of the proposed framework is thatthe low-ﬁdelity function is given explicitly or can be well-learned by given data such that low-ﬁdelity function val-ues can be accurately approximated (interpolated) at ar-bitrary points. If the state variable t is not available,the low ﬁdelity model should be a generic observationof t to be useful in employing Takens’ embedding theo-rem [18, 20]. Under these conditions, we now present asummary of the workﬂow.If the low-ﬁdelity model is given in the form of (rich)data, we train a GP regression model for it from thesedata { t l , y l ( t l ) } via minimizing a negative log marginallikelihood estimation. This data driven process canbe circumvented if the low-ﬁdelity model is explicitlygiven, as in the above examples. After that, we com-pute predictive posterior means of the low-ﬁdelity modelat the points t h where the high-ﬁdelity data is avail-able. We also compute a number of shifts of thelow-ﬁdelity function at the points t h − kτ and at thetest points t ∗ . Next, we train another GP regressionmodel for high-ﬁdelity data in the higher-dimensionalspace, {{ t h , f l ( t h ) , f l ( t h − τ ) , . . . , f l ( t h − n τ ) } , f h ( t h ) } .Finally, we compute the predictive posterior mean andvariance at the test points ( t ∗ ) in the higher-dimensionalspace. Each new delay burdens the optimization by asingle additional hyperparameter. For more details, re-fer to [13, 23]. In this paper, all GP computations areperformed by the open source python package GPy [26]. III. RESULTS

We introduce three pedagogical examples to demon-strate our approach. First, we explore the case where f h is a phase shifted version of f l (section III A). Then,we show that oscillations with diﬀerent periods (lead-ing to diﬀerent recurrences) present a more challengingscenario, which however can still be resolved by usingshifts of f l . The third example involves discontinuitiesin f h and f l . After these three examples, in sectionIII D, we demonstrate the approach in the context ofthe Hodgkin-Huxley model. We investigate the eﬀective-ness of the proposed framework by comparing it to threeestablished frameworks: (1) single Gaussian Process re-gression with high-ﬁdelity data only (GP or Kriging),(2) Auto-regressive method with a constant scaling pa-rameter ρ (AR1 or coKriging), and (3) nonlinear auto-regressive GP (NARGP) in the same computational en-vironment. A. Phase Shifted Oscillations

Using models at diﬀerent levels of resolution (e.g. forbiological oscillators) will often give oscillations that havevery comparable periods but are phase-shifted. Let usstart by considering two functions with diﬀerent phaseson t ∈ [0 , f h ( t ) = t + sin (8 πt + π/ , f l ( t ) = sin(8 πt ) . (16)The high-ﬁdelity function can be rewritten by a trigonometric addition formula: f h ( t ) = t + (sin(8 πt ) cos( π/

10) + cos(8 πt ) sin( π/ . (17)Now we can explicitly see how the high-ﬁdelity functioncan be thought of as a combination of three variables: t , the low-ﬁdelity function f l ( t ) = sin(8 πt ) and its ﬁrstderivative f (1) l ( t ) = cos(8 πt ).Using delays of f l with a small stepsize τ containsenough information to numerically estimate its deriva-tives, hence we can also write f h as f h ( t ) (cid:55)→ g ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ )) . (18)The GP regression model for g is trained on this 4-dimensional data. In addition, we perform GP regressionin a three dimensional space constructed from only threedelays: f h ( t ) (cid:55)→ g ( f l ( t ) , f l ( t − τ ) , f l ( t − τ )) . (19)As shown in FIG.3(b), the single GP regression modelprovides inaccurate predictive posterior means due tolack of high-ﬁdelity data. While the linear auto-regressivemodel (AR1) also fails to predict the high ﬁdelity values,the NARGP (with 10 high-ﬁdelity data points and 100low-ﬁdelity data points) catches the trend of the high ﬁ-delity data, yet still yields inaccurate results: NARGPis informed only by t and f l ( t ), and not also by f (1) l ( t ).Similarly, the GP regression with only delays (no infor-mation about t ) in FIG.3(e) fails to represent the high-ﬁdelity function for these long observation windows. Be-yond 0 . t cannot be recovered from the shifts of f l because f l is only a generic observer of t ∈ [0 , . t and three delaysof f l provides an excellent prediction with only 10 high-ﬁdelity data points (and 100 low-ﬁdelity data points).This means that, in the 4-dimensional space, g (see equa-tion 18) has small derivatives, which then helps to employGP regression successfully.Next, we investigate the sensitivity and scalability ofthe proposed framework on the number of high-ﬁdelitydata points (training data points). We train all GP re-gression models with 10, 15, 20, and 25 randomly-chosenhigh-ﬁdelity data points and 100 uniformly distributedlow-ﬁdelity data points. The error is obtained by aver-aging 10 trials of random data selections. A log L errorwith respect to the number of high-ﬁdelity data points ispresented in FIG.4(a).Two established approaches (AR1 and NARGP) andthe GP with only delays have no signiﬁcant accuracy en-hancements as the number of training points increases.The reason for the consistently large errors is the lack ofadditional information provided by the derivatives. TheGP in the higher-dimensional space that includes t , onthe other hand, shows a strong correlation between ac-curacy and the number of training points – more highﬁdelity points visibly improve the approximation. (a)The correlation between models (b)GP (Kriging)(c)AR1 (CoKriging) (d)NARGP(e)GP+delays (f)GP+E FIG. 3. Examples of phase shifted oscillations. (a) correlation between t , f l and f h . (b)-(f) The high-ﬁdelity function (theyellow curve) versus posterior means with 2 standard deviations (dashed lines) of 5 compared methods with 10 high-ﬁdelity datapoints and 100 uniformly distributed low-ﬁdelity data points. (b) GP (Kriging) and low ﬁdelity data (the red-dashed curve). (c)Auto-regressive GP (AR1 or coKriging). (d) nonlinear auto-regressive GP (NARGP). (e) GP in the higher-dimensional spacewith only delays ( f l ( t ) , f l ( t − τ ) , f l ( t − τ )). (f) GP in the higher-dimensional space (GP+E), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ )). B. Diﬀerent Periodicity

In this example, the high- and the low-ﬁdelity modeloscillations are not just phase shifted, but they also arecharacterized by diﬀerent periods. In applications, thiscould arise if we tried to match observations of oscilla-tions of t he same model at two d iﬀerent parameter val-ues. Diﬀerent (possibly irrationally related) oscillation periods dramatically complicate the correlation acrossthe two data sets.We consider two diﬀerent period a nd phase shifteddata, f h ( t ) = sin(8 πt + π/ , f l ( t ) = sin(6 √ πt ) . (20)The high-ﬁdelity function can be rewritten by a (a)Phase Shifted Oscillations (b)Diﬀerent Periodicity(c)Models with Discontinuities (d)The Hodgkin-Huxley models FIG. 4. Log L error (y-axis) of prediction by GP (Kriging), AR1 (coKriging), NARGP, and GPs in the higher-dimensionalspace (GP+E and GP+E(2)) with respect to the number of high-ﬁdelity data points (x-axis). The error is obtained by averaging10 trials of random data selections. trigonometric addition formula: f h ( t ) = sin(8 πt ) cos( π/

10) + cos(8 πt ) sin( π/ . (21)In addition, the ﬁrst term sin(8 πt ) can be rewritten againby a trigonometric subtraction formula:sin( at − bt ) = cos( bt ) sin( at ) − cos( at ) sin( bt ) , (22)where a = 6 √ π and b = 6 √ π − π . Then,sin(8 πt ) = cos( bt ) f l ( t ) − sin( bt ) f (1) l ( t ) . (23)The second term cos(8 πt ) can be rewritten in the sameway. This shows that the high-ﬁdelity function can bewritten in terms of sin( bt ), cos( bt ), f l ( t ), and f (1) l ( t ).Since sin( bt ) and cos( bt ) have lower frequency comparedto the original frequency 8, the bound c g for the deriva-tives of g (see section II C) is smaller. It is then reason-able that we can approximate the high-ﬁdelity functionin the higher-dimensional space with a few training datapoints.We perform the GP in the two diﬀerent extendedspaces: (1) 3 additional delays, totaling 4-dimensionalspace (GP+E), and (2) 5 additional delays, totaling 6-dimensional space (GP+E(2)), and compare them to a single GP, AR1, and NARGP. Examples of regression re-sults with 15 high-ﬁdelity data points and 200 uniformlydistributed low-ﬁdelity data are shown in FIG.5. TheGP in the 4-dimensional space provides better regressionresult than other established methods, and the GP in the6-dimensional space presents the best results.Moreover, as shown in FIG.5(b), the phase discrepancybetween the high- and low-ﬁdelity functions increases astime increases, resulting in larger error for larger valuesof t , see FIG.5(b)–(d). However, the GPs in the higher-dimensional spaces provide accurate prediction resultsover this time observation window.The sensitivity of the number of high ﬁdelity data isshown in FIG.4(b). The GPs in the 4- and 6-dimensionalspace show signiﬁcant computational accuracy gain com-pared to all other methods. These results demonstratethe capability of the proposed framework for period-shifted and phase-shifted data fusion. C. Models with Discontinuities

In general, a smooth stationary covariance kernel can-not capture discontinuous model data. In order to resolvethis problem, a non-stationary kernel has been intro- (a)The correlation between models (b)GP (Kriging)(c)AR1 (CoKriging) (d)NARGP(e)GP+E (f)GP+E(2)

FIG. 5. Examples of diﬀerent periodicity. (a) correlation between f l and f h . (b)-(f) The high-ﬁdelity function (the yellow curve)versus posterior means with 2 standard deviations (dashed lines) of 5 compared methods with 15 high-ﬁdelity data points and 200uniformly distributed low-ﬁdelity data points. (b) GP (Kriging) and low ﬁdelity data (the red-dashed curve). (c) Auto-regressiveGP (AR1 or coKriging). (d) nonlinear auto-regressive GP (NARGP). (e) GP in the 4-dimensional space (GP+E), using( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ )). (f) GP in the 6-dimensional space (GP+E(2)), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ )). duced [27, 28], with space-dependent hyperparameters.Moreover, nested GPs were also used successfully to al-leviate this problem [29]. Both approaches introduce, ofcourse, additional hyperparameters to optimize.In this example, we introduce a discontinuous function f l on t ∈ [0 , . f l ( t ) = 0 . t − sin(12 t −

4) + 10( t − . − , (24)and on t ∈ [0 . ,

1] as f l ( t ) = 0 . t − sin(12 t −

4) + 10( t − . , (25) and the high-ﬁdelity function f h f h ( t ) = 2 f l ( t ) − t + 20 = g ( t, f l ( t )) . (26)In the scenario we describe here, the high-ﬁdelity function f h is discontinuous, but can be written in terms of a linearfunction of g in two variables.Examples of regression results with 10 high-ﬁdelitydata points and 200 uniformly distributed low-ﬁdelitydata points are shown in FIG.6. Since g is a linear func-tion of t and f l ( t ), the NARGP, as well as our GPs inthe higher-dimensional spaces, provide highly accurate (a)The correlation between models (b)GP (Kriging)(c)AR1 (CoKriging) (d)NARGP(e)GP+E (f)GP+E(2) FIG. 6. Examples of models with discontinuities. (a) correlation between t , f l and f h . (b)-(f) The high-ﬁdelity function (theyellow curve) versus posterior means with 2 standard deviations (dashed lines) of 5 compared methods with 10 high-ﬁdelitydata points and 200 uniformly distributed low-ﬁdelity data points. (b) GP (Kriging) and low ﬁdelity data (the red-dashedcurve). (c) Auto-regressive GP (AR1 or coKriging). (d) nonlinear auto-regressive GP (NARGP). (e) GP in the 4-dimensionalspace (GP+E), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ )). (f) GP in the 6-dimensional space (GP+E(2)), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ )). prediction results with just a few high-ﬁdelity data.In the sensitivity analysis of the number of high ﬁ-delity data points, see FIG.4(c), there is no signiﬁcantaccuracy gain after 10 high-ﬁdelity data points. That isbecause 10 training data points are enough to represent alinear function accurately. It is worth noting that, here,the NARGP provides better prediction results with 20training data points compared to the GPs in the higher- dimensional space, possibly due to overﬁtting. D. The Hodgkin-Huxley Model

Based on the results of the pedagogical examples, weapply the proposed framework to a famous model of anintracellular process, a version of the Hodgkin-Huxley0 (a)The correlation between models (b)GP (Kriging)(c)AR1 (CoKriging) (d)NARGP(e)GP+E (f)GP+E(2)

FIG. 7. Examples of the two Hodgkin-Huxley model oscillations obtained with diﬀerent external currents I ext . (a) correlationbetween t , f l and f h . (b)-(f) The high-ﬁdelity function (the yellow curve) versus posterior means with 2 standard deviations(dashed lines) of 5 compared methods with 20 high-ﬁdelity data points and 300 uniformly distributed low-ﬁdelity data points.(b) GP (Kriging) and low ﬁdelity data (the red-dashed curve). (c) Auto-regressive GP (AR1 or coKriging). (d) nonlinearauto-regressive GP (NARGP). (e) GP in the 4-dimensional space (GP+E), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ )). (f) GP in the6-dimensional space (GP+E(2)), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ )). equations [30]. In 1952, Hodgkin and Huxley introduceda mathematical model which can describe the initiationand propagation of action potentials in a neuron. Speciﬁ-cally, they invented electrical equivalent circuits to mimicthe ion channel, where ions traﬃc through the cell mem-brane. The model for intracellular action potentials ( V m ) can be written as a simple ODE: C m dV m dt + I ion = I ext , (27)where C m is the membrane capacitance and I ion and I ext represent the total ionic current and the external current,respectively.The total ionic current I ion = I Na + I K + I L is the1sum of the three individual currents as a sodium current( I Na ), a potassium current ( I K ), and a leakage current( I L ). In order to calculate the three individual currents intime, the Hodgkin-Huxley model introduced gates whichregulate the ﬂow of ions through the channel. Speciﬁ-cally, the three ionic currents are aﬀected by the threediﬀerent gates n , m , and h . Based on these gates, thetotal ionic currents can be calculated by I ion = ¯ g Na m h ( V m − E Na ) − ¯ g K n ( V m − E K ) − ¯ g L ( V m − E L ) , (28)where ¯ g ∗ represents a normalized constant and E ∗ repre-sents the equilibrium potential for a sodium, ( ∗ ≡ N a ),a potassium, ( ∗ ≡ K ), and a leakage, ( ∗ ≡ L ), current.The three gates n , m , and h can be modeled by the fol-lowing ODEs: dndt = α n ( V )(1 − n ) − β n ( V ) n, (29) dmdt = α m ( V )(1 − m ) − β m ( V ) m, (30) dhdt = α h ( V )(1 − h ) − β h ( V ) h. (31)In this paper, we set the model parameter values to¯ g Na = 1 .

2, ¯ g K = 0 .

36, ¯ g L = 0 . E Na = 55 . E K = − .

14, and E L = − .

42 [31]. We assume thatdiﬀerent ﬁdelity data come from simulations at diﬀerentexternal currents I ext . We set I ext = 1 . I ext = 1 .

05 for the low ﬁdelity model,resulting in diﬀerent oscillation period (and a phase shiftwhen we start at the same initial conditions). The ac-tion potentials V m of the two diﬀerent ﬁdelity models areshown in FIG.7(a,b).Examples of regression results for 5 diﬀerent methodswith 20 high-ﬁdelity data points and 300 uniformly dis-tributed low-ﬁdelity data points are shown in FIG.7(b–f).Since the two data sets are phase-shifted, the single GP,AR1, and NARGP fail to represent a proper function ofthe high ﬁdelity model eﬀectively. However, GPs in thehigher-dimensional spaces provide reasonable predictionresults. The GP in the 6-dimensional space (GP+E(2))shows signiﬁcant improvement in the form of large un-certainty reduction as well as high prediction accuracy.The sensitivity to the number of high ﬁdelity data isshown in FIG.4(d). The proposed framework shows com-putational accuracy gains compared to all other methods,as well as marked improvement when new high ﬁdelitypoints are added. IV. CONCLUSION

In this paper we explored mathematical algorithms formultiﬁdelity information fusion and its links with “em-bedology”, motivated by the NARGP approach. Thesemodiﬁcations/extensions of kriging show promise in im-proving the representation of data-poor “high-ﬁdelity”datasets exploiting data-rich “low-ﬁdelity” datasets.Given that f h may not be a simple function -and some-times not even a function- of f l , we demonstrated that us-ing additional functions of t , such as derivatives or shiftsof f l , can drastically improve the approximation of f h through Gaussian Processes.The limitations of the proposed framework arise inthe form of the curse of dimensionality and of overﬁt-ting. As the number of hyperpameters in the GP frame-work grows in an increasingly higher dimensional inputspace, the optimization cost grows (and there is alwaysthe possibility of converging to local, unsatisfactory min-ima). Adaptively testing for the “best” number of delaysis possible, and will be pursued in future work. Thenatural option of using multiple low-ﬁdelity models (in-stead of delays of just one of them) is also being explored.Techniques that systematically ﬁnd all the local hyper-parameter minima (in the spirit of the reduced gradientmethod [32]) may also be useful in this eﬀort. Anotherpromising research direction involves the construction ofdata-informed kernels (e.g. through “Neural-net-inducedGaussian process”, NNGP [33]) for more realistic andunbiased predictions. Alternatively, it is interesting toconsider transformations of the input space using man-ifold learning techniques and the so-called Mahalanobisdistance [34, 35], which has been demonstrated to suc-cessfully match diﬀerent (yet conjugate) models [36, 37].What we believe is a most promising direction for theuse of these techniques is the reconciliation of diﬀerentgranularity multiscale models - having, say, an atom-istic “high-ﬁdelity” simulation enhanced by a contin-uum “low-ﬁdelity” approximate closure. Thus, “hetero-geneous data” fusion becomes a version of multiﬁdelitydata fusion [8]. In this paper the fusion tools simply“ﬁlled in the gaps” in a single manifestation of the high ﬁ-delity data. In a time-dependent context, “full space, fulltime” low ﬁdelity simulations can help complete and thusaccelerate “small space, small time” high ﬁdelity simula-tions - in a form reminiscent of the patch-dynamics ap-proach in equation-free computation [38], see also [7, 8].Using a qualitatively correct (even though quantitativelyinaccurate) low-ﬁdelity model –as opposed to just thelocal Taylor series that play the role of low-ﬁdelity mod-eling in patch dynamics– may very much improve thecomputational savings of such multiscale computationschemes. V. ACKNOWLEDGMENTS

S. Lee, F. Dietrich and I. G. Kevrekidis gratefullyacknowledge the partially support of NSF, NIH, andDARPA. G. E. Karniadakis gratefully acknowledge theﬁnancial support of DARPA (grant no. N66001-534 15-2-4055) also the MIT ESRDC grant.2 [1] Tarca AL, Carey VJ, Chen Xw, Romero R, Dr˘aghici S.Machine learning and its applications to biology. PLoScomputational biology. 2007;3(6):e116.[2] Holzinger A, Jurisica I. Interactive Knowledge Discoveryand Data Mining in Biomedical Informatics: State-of-the-Art and Future Challenges. vol. 8401. Springer; 2014.[3] Libbrecht MW, Noble WS. Machine learning applica-tions in genetics and genomics. Nature Reviews Genetics.2015;16(6):321.[4] Villoutreix P, And´en J, Lim B, Lu H, Kevrekidis IG,Singer A, et al. Synthesizing developmental trajectories.PLoS computational biology. 2017;13(9):e1005742.[5] Lanckriet GR, De Bie T, Cristianini N, Jordan MI, NobleWS. A statistical framework for genomic data fusion.Bioinformatics. 2004;20(16):2626–2635.[6] Willett P. Combination of similarity rankings using datafusion. Journal of chemical information and modeling.2013;53(1):1–10.[7] Lee S, Kevrekidis IG, Karniadakis GE. A general CFDframework for fault-resilient simulations based on multi-resolution information fusion. Journal of ComputationalPhysics. 2017;347:290–304.[8] Lee S, Kevrekidis IG, Karniadakis GE. A resilientand eﬃcient CFD framework: Statistical learning toolsfor multi-ﬁdelity and heterogeneous information fusion.Journal of Computational Physics. 2017;344:516–533.[9] Le Gratiet L, Garnier J. Recursive co-kriging model fordesign of computer experiments with multiple levels ofﬁdelity. International Journal for Uncertainty Quantiﬁ-cation. 2014;4(5).[10] Perdikaris P, Venturi D, Royset J, Karniadakis G. Multi-ﬁdelity modelling via recursive co-kriging and Gaussian–Markov random ﬁelds. In: Proceedings of the Royal So-ciety A. vol. 471. The Royal Society; 2015. p. 20150018.[11] Perdikaris P, Venturi D, Karniadakis GE. MultiﬁdelityInformation Fusion Algorithms for High-DimensionalSystems and Massive Data sets. SIAM Journal on Scien-tiﬁc Computing. 2016;38(4):B521–B538.[12] Parussini L, Venturi D, Perdikaris P, Karniadakis G.Multi-ﬁdelity Gaussian process regression for predictionof random ﬁelds. Journal of Computational Physics.2017;336:36–50.[13] Perdikaris P, Raissi M, Damianou A, Lawrence N, Kar-niadakis G. Nonlinear information fusion algorithms fordata-eﬃcient multi-ﬁdelity modeling. In: Proc. R. Soc.A. vol. 473. The Royal Society; 2017. p. 20160751.[14] Kennedy MC, O’Hagan A. Predicting the output froma complex computer code when fast approximations areavailable. Biometrika. 2000;87(1):1–13.[15] Raissi M, Karniadakis G. Deep Multi-ﬁdelity GaussianProcesses. arXiv preprint arXiv:160407484. 2016;.[16] Whitney H. Diﬀerentiable Manifolds. Annals of Mathe-matics. 1936;37(3):645–680.[17] Nash J. Analyticity of the Solutions of Implicit FunctionProblems With Analytic Data. Annals of Mathematics.1966;84(3):345–355.[18] Takens F. Detecting strange attractors in turbulence.In: Rand D, Young LS, editors. Dynamical Systems andTurbulence, Warwick 1980. Berlin, Heidelberg: SpringerBerlin Heidelberg; 1981. p. 366–381. [19] Kostelich EJ, Yorke JA. Noise reduction: Finding thesimplest dynamical system consistent with the data.Physica D: Nonlinear Phenomena. 1990;41(2):183–196.[20] Sauer T, Yorke JA, Casdagli M. Embedology. Journal ofstatistical Physics. 1991;65(3):579–616.[21] Kennel MB, Brown R, Abarbanel HD. Determiningembedding dimension for phase-space reconstruction us-ing a geometrical construction. Physical review A.1992;45(6):3403.[22] Le Gratiet L. Multi-ﬁdelity Gaussian process regres-sion for computer experiments. Universit´e Paris-Diderot-Paris VII; 2013.[23] Rasmussen CE, Williams CKI. Gaussian processes formachine learning. MIT Press; 2006.[24] Packard NH, Crutchﬁeld JP, Farmer JD, Shaw RS.Geometry from a Time Series. Phys Rev Lett. 1980Sep;45:712–716.[25] Dietrich F, Kooshkbaghi M, Bollt EM, Kevrekidis IG.Manifold Learning for Bifurcation Diagram Observa-tions. arXiv preprint. 2018;.[26] GPy. GPy: A Gaussian process framework in python;since 2012. http://github.com/SheffieldML/GPy .[27] Schmidt AM, O’Hagan A. Bayesian inference for non-stationary spatial covariance structure via spatial defor-mations. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology). 2003;65(3):743–758.[28] Paciorek CJ, Schervish MJ. Nonstationary covariancefunctions for Gaussian process regression. In: Advancesin neural information processing systems; 2004. p. 273–280.[29] Hensman J, Lawrence ND. Nested variational com-pression in deep Gaussian processes. arXiv preprintarXiv:14121370. 2014;.[30] Hodgkin AL, Huxley AF. A quantitative descriptionof membrane current and its application to conductionand excitation in nerve. The Journal of physiology.1952;117(4):500–544.[31] Siciliano R. The Hodgkin–Huxley model–its extensions,analysis and numerics. McGill University, Department ofMathematics and Statistics, Montreal, Canada; 2012.[32] Quapp W. Reduced gradient methods and their relatonto reaction paths. Journal of Theoretical and Computa-tional Chemistry. 2003 sep;02(03):385–417.[33] Pang G, Yang L, Karniadakis GE. Neural-net-inducedGaussian process regression for function approximationand PDE solution. arXiv preprint arXiv:180611187.2018;.[34] Singer A, Coifman RR. Non-linear independent compo-nent analysis with diﬀusion maps. Applied and Compu-tational Harmonic Analysis. 2008 9;25(2):226–239.[35] Singer A, Erban R, Kevrekidis IG, Coifman RR. Detect-ing intrinsic slow variables in stochastic dynamical sys-tems by anisotropic diﬀusion maps. Proceedings of theNational Academy of Sciences. 2009;106:16090–16095.[36] Dsilva CJ, Talmon R, Gear CW, Coifman RR, KevrekidisIG. Data-Driven Reduction for a Class of Multiscale Fast-Slow Stochastic Dynamical Systems. SIAM Journal onApplied Dynamical Systems. 2016 jan;15(3):1327–1351.[37] Kemeth FP, Haugland SW, Dietrich F, Bertalan T,Hohlein K, Li Q, et al. An Emergent Space for Dis-tributed Data with Hidden Internal Order through Man-3