Linking Gaussian Process regression with data-driven manifold embeddings for nonlinear data fusion
Seungjoon Lee, Felix Dietrich, George E. Karniadakis, Ioannis G. Kevrekidis
LLinking Gaussian Process regression with data-driven manifold embeddingsfor nonlinear data fusion
Seungjoon Lee and Felix Dietrich
Department of Chemical and Biomolecular Engineering, Johns Hopkins University
George E. Karniadakis
Division of Applied Mathematics, Brown University
Ioannis G. Kevrekidis
Department of Chemical and Biomolecular Engineering, Johns Hopkins University ∗ In statistical modeling with Gaussian Process regression, it has been shown that combining (few)high-fidelity data with (many) low-fidelity data can enhance prediction accuracy, compared to pre-diction based on the few high-fidelity data only. Such information fusion techniques for multifidelitydata commonly approach the high-fidelity model f h ( t ) as a function of t wo variables ( t, y ), andthen using f l ( t ) as the y data. More generally, the high-fidelity model can be written as a func-tion of several variables ( t, y , y .... ); the low-fidelity model f l and, say, some of its derivatives, canthen be substituted for these variables. In this paper, we will explore mathematical algorithms formultifidelity information fusion that use such an approach towards improving the representationof the high-fidelity function with only a few training data points. Given that f h may not be asimple function -and sometimes not even a function- of f l , we demonstrate that using additionalfunctions of t , such as derivatives or shifts of f l , can drastically improve the approximation of f h through Gaussian Processes. We also point out a connection with “embedology” techniques fromtopology and dynamical systems. Our illustrative examples range from instructive caricatures tocomputational biology models, such as Hodgkin-Huxley neural oscillations. I. INTRODUCTION
Recent advances in both algorithms and hardware areincreasingly making machine learning an important com-ponent of mathematical modeling for physicochemical,engineering, as well as biological systems (e.g. [1–4]). Part of these developments focus on multiresolu-tion and multifidelity data fusion [5, 6]. Fusing informa-tion from models constructed at different levels of reso-lution/fidelity has been shown to enhance prediction ac-curacy in data-driven scientific computing (e.g. [7, 8]).In addition, if high-fidelity data are costly to obtain(experimentally or computationally) while low-fidelitydata are relatively cheap, a combination of a few high-fidelity data and many low-fidelity data can also leadto overall computational efficiency. Recently, GaussianProcess (GP) regression has been widely used to effec-tively combine multiple fidelity data [9–13]. “Classical”information fusion by GP had focused only on linear cor-relations between high- and low-fidelity data via auto-regressive schemes such as the coKriging approach ofKennedy and O’Hagan [14]. If two (or more) data setshave nonlinear correlations, such linear-type approacheswill lose their effectiveness.When a high-fidelity model f h ( t ) is nonlinearly cor-related with a low-fidelity model f l ( t ), Nonlinear Auto-Regressive Gaussian Process (NARGP) [13] achieves ∗ [email protected]; Also at Department of Applied Mathematicsand statistics and Department of Medicine, Johns Hopkins Uni-versity. highly accurate prediction results by introducing an ad-ditional dimension: f h ( t ) is approximated as a curve ona two-dimensional manifold parametrized by t and a sec-ond variable, y . The data points on this manifold are ofthe form f h ( t ) , t, f l ( t ). More generally, NARGP finds astrong correlation between d -dimensional high- and low-fidelity models in a ( d + 1)-dimensional space. In thisframework, the low-fidelity model provides the additional“latent” variable of the high-fidelity model.Deep multifidelity GPs were introduced [15] as an im-provement, especially in the context of discontinuities in f h with respect to t (and f l ). This approach focuses onconstructing a useful transformation T ( t ) of the inputvariables t through a (deep) neural network. Then, thehigh-fidelity function f h ( t ) is approximated as a GP of(just) f l ( T ( t )). One must now, of course, perform opti-mization for the additional network hyperparameters.In this paper we discuss a connection of NARGP withdata-driven embeddings in topology/dynamical systems,and extend it (in the spirit of such data driven embed-dings) in an attempt to improve the numerical approxi-mation of f h in the “sparse f h , rich f l ” data setting. Inwhat follows we will (rather arbitrarily) ascribe the char-acterization “high fidelity” or “low fidelity” to differentfunctions used in our illustrations; this characterizationis solely based on the number of available data points foreach.For our first example, the “high-fidelity” function f h ,for which we only have a few data points, is a functionof t ; but it actually also happens that we can describe it a r X i v : . [ s t a t . M L ] D ec (a)A correlation between t , f l ( t ), and f h ( t ) (b)GP with { t } (c)GP with { f l ( t ) } FIG. 1. Two alternative GP regressions for the high-fidelity model f h ( t ) = sin (8 πt ) (the highly oscillatory green curve in (a)).The blue curve visually suggests a smooth two-dimensional manifold g ( t, f l ). If, on the other hand, the high-fidelity function isprojected onto f l ( t ), we can see it is a simple quadratic. (b) and (c): the prediction results with 2 standard deviations (dashedline) for the high-fidelity function by GP regression with only t and with only f l ( t ), respectively. We employ 15 high-fidelitydata points and 100 uniformly distributed low-fidelity data points for the regression. as a function of f l , f h ( t ) = sin (8 πt ) , f l ( t ) = sin(8 πt ) , (1)see FIG.1.When we choose t as the coordinate parametrizing f h ( t ) (the green curve in FIG.1(a)), the GP regressionfails to represent the high-frequency sine function withjust a few training data points as shown in FIG.1(b).However, as FIG.1(c) shows, if we choose f l as the coor-dinate of f h ( f l ) (colored by red in FIG.1(a)), the GP re-gression can represent the simple quadratic function quiteeffectively. If we still need to know the parametrizationof f h by t , we can obtain it through the “data rich” f l : f h ( t ) ≡ f h ( f l ( t )).If f h i s not a function of f l , however (as has beenobserved in the literature [13] and as can be seen in Fig-ure 2(a)) more variables are necessary to create a usefulparametrization of f h - more precisely, a domain overwhich f h is a function.In the NARGP framework, the variable used in addi-tion to f l is t itself. In this paper, we will also advocatethe use of delays or derivatives of f l as additional vari-ables. This approach can also help remove the explicitdependence of f h on t , since embedding theories in dy-namical systems [16–21] guarantee that we can chooseany generic observation of t , or derivatives and/or delaysof this observation, as a replacement for t , see sectionII B for more details.In section III D, we apply the proposed framework tothe Hodgkin-Huxley model, describing the behavior ofaction potentials in a neuron. Here, the high- and thelow-fidelity functions are action potentials at two differ-ent values of the external current. This is a case where f h is a complicated function of t and does not only de-pend on f l ; yet as we will see, delays of f l will help usconstruct an accurate approximation of f h .The paper is organized as follows: In section II, we re-view the NARGP framework and concepts of “embedol- ogy”. Also, we illustrate why and when this frameworkis successful. In section III, we demonstrate the effective-ness of our proposed framework via some pedagogical ex-amples and the biologically motivated application to theHodgkin-Huxley model. In section IV, we summarize ourresults and discuss open issues for further development ofgeneral multifidelity information fusion in modeling andsimulation practice across scientific domains. II. METHODSA. Nonlinear information fusion algorithms “Classical” multifidelity data fusion algorithms requirea linear (or almost linear) correlation between differentfidelity models. Under this constraint, we can mergetwo or more data sets by using scaling and shiftingparameters such as the Kennedy and O’Hagan coKrig-ing approach [14]. However, more generally, the datasets are nonlinearly correlated, which typically degradesthe quality of results of linear information fusion algo-rithms. In order to resolve correlation nonlinearity be-tween data sets, the use of a space-dependent scaling fac-tor ρ ( x ) [22] or, alternatively, deep multifidelity GaussianProcesses [15] have been introduced; clearly, the improve-ment they bring requires additional hyperparameter op-timization.When the high-fidelity model f h is nonlinearly corre-lated with the low-fidelity model f l , but can be writtenas a simple function of x and f l , the Nonlinear Auto-Regressive GP (NARGP) [13] is an appropriate choice.In this framework, a one dimensional high-fidelity func-tion f h is assumed to be a “simple” function g of twovariables ( x, y ), i.e. it is a curve that lies in the two-dimensional manifold described by g . Then, GP regres-sion in the two dimensional space is performed, where (a)A correlation between t , f l ( t ), and f h ( t ) (b)NARGP: GP with { t , f l ( t ) } (c)GP with { f l ( t ), f l ( t − τ ) } FIG. 2. Two alternative GP regressions for the high-fidelity model f h (see Equation (5)). (a): the multivaluedness of f h withrespect to f l is clearly visible when the high-fidelity data (the blue curve) is projected onto f l ( t ) (the red curve). (b) and(c): The high-fidelity function (the yellow curve) versus posterior means with two standard deviations (dashed lines) of twoalternative GP regressions. We use 7 high-fidelity data points and 100 uniformly distributed low-fidelity data points for trainingGP regression models. (b) GP regression with { t , f l ( t ) } . (c) GP regression with { f l ( t ), f l ( t − τ ) } , τ = 1 / the data for y are the f l ( x ) data, g ( x, y ) ∼ GP (0 , k (( x, y ) , ( x (cid:48) , y (cid:48) ))) , f h ( x ) = g ( x, f l ( x )) . (2)In [13] the gain in accuracy of NARGP, compared toan auto-regressive scheme with a constant scaling factor,as well as a scaling factor that was modeled as a (space-dependent) Gaussian process, was documented.Algorithmically, classical autoregressive GPs employan explicit method such as a scaling constant ( ρ ) betweentwo covariance kernels, k and k as (cid:20) f l ( x ) f h ( x ) (cid:21) ∼ GP (cid:18)(cid:20) (cid:21) , (cid:20) k ( x, x (cid:48) ) ρk ( x, x (cid:48) ) ρk ( x, x (cid:48) ) ρ k ( x, x (cid:48) ) + k ( x, x (cid:48) ) (cid:21)(cid:19) . (3)The NARGP framework, on the other hand, employsan implicit approach by the automatic relevance de-termination (ARD) weight [23] in the extended spaceparametrized by x and y : a different scaling hyperparam-eter for each of the two dimensions in the kernel. In manyapplications, the radial basis function (see the equation(4)) has been used as the covariance kernel (where ARDimplies a different scaling hyperparameter θ i for each di-mension): k ( x, x (cid:48) ; θ ) = exp (cid:32) − d (cid:88) θ i ( x i − x (cid:48) i ) (cid:33) . (4)FIG.2 showcases an example where f h cannot be writ-ten as a function of f l : f h ( t ) = t + sin (8 πt ) , f l ( t ) = sin(8 πt ) t ∈ [0 , . . (5)Following the NARGP framework, we choose the low-fidelity data as an additional variable y = f l ( t ); we thenapproximate the two-dimensional function g ( t, y ) = t + y . (6) Approximating g only requires a few training data pointpairs for the GP regression. Then, f h can be written as f h ( t ) = g ( t, f l ( t )), see FIG.2(b). FIG.2(c) demonstratesthat we can, alternatively, use delays of f l instead of t asan additional variable. A rationalization of this followsin the next section. B. Data-driven Higher-dimensional Embeddings
The theorem of Whitney [16] states that any suffi-ciently smooth manifold of dimension d ∈ N can beembedded in Euclidean space R n , with the tight bound n ≥ d + 1. Nash [17] showed that this embedding caneven be isometric if the manifold is compact and Rie-mannian, even though the bound on n is higher. Manyresults on the reconstruction of invariant sets in the statespaces of dynamical systems are based on these two theo-rems [18–21]. Here, the n embedding dimensions are usu-ally formed by n scalar observations of the system statevariables. Instead of n different observations, recording n time delays of a single scalar observable is also possible,as originally formulated by Takens [18] (see also [24].)Given a smooth, d -dimensional manifold M , a vectorfield X on M with associated flow map f : M × R → M ,as well as an observable h : M → R , it is possible toconstruct an embedding φ : M → R n through φ ( p ) = [ h ( p ) , h ( f ( p, − ∆ t )) , . . . , h ( f ( p, − ( n − t ))] T . (7)Sauer et al. [20] further specified the conditions on theobservable, with respect to the underlying vector fieldof the dynamical system, that have to be satisfied suchthat the state space can be embedded successfully. Theyalso extended the results on embeddings of invariant setswith fractal dimension. The observable and the vectorfield together build a tuple that has to be generic (Tak-ens [18]) or prevalent (Sauer et al. [20]). The two pa-pers show that “almost all” tuples are admissible, wherethe notions of genericity and prevalence are defining thatprobabilistic statement in the respective function spaces.These results are crucial for the numerical exploitation ofthe embedding theorems, since they show that “almostall” observables of a dynamical system can be chosen fora delay embedding.In many applications, the intrinsic dimension d of M isunknown, and n is preemptively chosen large (larger thannecessary). Manifold learning techniques are then oftencapable of reducing the embedding dimension, bringingit closer to the minimal embedding dimension necessaryfor the given manifold.We also note that in our framework, if we can obtainlow-fidelity data from a (dynamic) process instead of justsingle measurements, the same embedding and manifoldlearning techniques can be used even if the “independentvariable” t is not known [25].Let us consider the equation (5) again. NARGP per-forms GP regression with two observations { t, f l ( t ) } . Us-ing embedding theory, we can rationalize (a) why the twoobservations were necessary; and (b) why performing GPregression with an additional delay of the scalar observ-able, { f l ( t ) , f l ( t − τ ) } , is equally appropriate for a rela-tively small time horizon, see FIG.2(c). Notice that delaycoordinates f l ( t ) and f l ( t − τ ) lie on an ellipse with periodof 0.25. Hence, if the data are collected over times longerthan 0 .
25, using only delays will fail to represent the high-fidelity function due to multivaluedness (see FIG.3(e) insection III A); t itself used as an observable will resolvethis. C. Extending the formulation through delays
Now we provide a mathematical formulation of the ap-proach. We assume that f h : R → R and f l : R → R are C K -smooth functions with K ≥
2, and we want to con-struct an interpolant of f h with a numerical scheme (here,a Gaussian Process). We assume that(1) only a small number of data points { x, f h ( x ) } aswell as the function f l and its K derivatives areavailable,(2) f h can be written in the form f h ( x ) = g ( x, f l ( x ) , f (1) l ( x ) , . . . , f ( K ) l ( x )) , (8)where f ( i ) l ( x ) denotes the i -th derivative, i ∈{ , . . . , K } , and(3) g : R K +2 → R is a C K function with derivativesbounded in the L ∞ norm by a small constant c g > (cid:13)(cid:13)(cid:13)(cid:13) ∂∂x l g ( x , . . . , x K +1 ) (cid:13)(cid:13)(cid:13)(cid:13) L ∞ ≤ c g ∀ l ∈ { , . . . , K + 2 } . (9) The Taylor series of f h reveals why assumptions (2) and(3) are required: Assume we want to evaluate f h in asmall neighborhood of a point x ∈ R (here, w.l.o.g. weset x = 0). Then, assuming f h is smooth enough, wecan write f h ( x ) = f h (0) + ∂∂x f h (0) x + O (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) ∂ ∂x f h (cid:13)(cid:13)(cid:13)(cid:13)(cid:19) . (10)Since we assume the form (8) for f h , we can write ∂∂x f h ( x ) = ∂∂x (cid:104) g ( x, f l ( x ) , f (1) l ( x ) , . . . , f ( K ) l ( x )) (cid:105) (11)= ∂∂x g ( x, f l ( x ) , f (1) l ( x ) , . . . ) + (12) K +1 (cid:88) i =2 ∂∂x i g ( x, f l ( x ) , f (1) l ( x ) , . . . ) · f ( i − l ( x ) . (13)From this, we can see that ∂∂x f h ( x ) can be large (becausethe derivatives f ( i ) l ( x ) can be large), but if we know all f ( i ) l ( x ), we only have to estimate g from d ata f h and f ( i ) l .Crucially, we do not approximate the function f h ( x ) andits derivatives. The derivatives of g are bounded by c g through assumption (8), and g is a C K function, so onlya few data points are necessary for a good fit. If we haveaccess to function values of f l over “delays” (at discreteshifts, say in the form of a finite difference stencil) inspace, rather than its derivatives, we can use the Newtonseries approximation of f h instead of equation (10) foran analogous argumentation: For functions f h that areanalytic around the expansion point (here, x = 0), thefunction f h can be evaluated at x close to it by f h ( x ) = ∞ (cid:88) m =0 (cid:18) x − mm (cid:19) ∆ m f h (0) = (14) f h (0) + f h (∆ x ) − f h (0)∆ x x + O ( (cid:107) ∆ f h (cid:107) ) , (15)where ∆ m f h is the m -th finite difference approximationof f h , with a small step size ∆ x . By equation (8), thesedifferences can be expressed through g and delays of f l (instead of delays of f h ), analogously to equations (11–13). Using delays in space compared to derivatives hasnumerical advantages, especially in cases where f h or f l are not differentiable (or even have discontinuities). Italso enables us to estimate the derivatives of f l implic-itly, in case only the function f l is available (and not itsderivatives). D. Outline of the numerical approach
In order to employ the delay coordinates of the low-fidelity function, it is required to know shifts of it. Anecessary condition of the proposed framework is thatthe low-fidelity function is given explicitly or can be well-learned by given data such that low-fidelity function val-ues can be accurately approximated (interpolated) at ar-bitrary points. If the state variable t is not available,the low fidelity model should be a generic observationof t to be useful in employing Takens’ embedding theo-rem [18, 20]. Under these conditions, we now present asummary of the workflow.If the low-fidelity model is given in the form of (rich)data, we train a GP regression model for it from thesedata { t l , y l ( t l ) } via minimizing a negative log marginallikelihood estimation. This data driven process canbe circumvented if the low-fidelity model is explicitlygiven, as in the above examples. After that, we com-pute predictive posterior means of the low-fidelity modelat the points t h where the high-fidelity data is avail-able. We also compute a number of shifts of thelow-fidelity function at the points t h − kτ and at thetest points t ∗ . Next, we train another GP regressionmodel for high-fidelity data in the higher-dimensionalspace, {{ t h , f l ( t h ) , f l ( t h − τ ) , . . . , f l ( t h − n τ ) } , f h ( t h ) } .Finally, we compute the predictive posterior mean andvariance at the test points ( t ∗ ) in the higher-dimensionalspace. Each new delay burdens the optimization by asingle additional hyperparameter. For more details, re-fer to [13, 23]. In this paper, all GP computations areperformed by the open source python package GPy [26]. III. RESULTS
We introduce three pedagogical examples to demon-strate our approach. First, we explore the case where f h is a phase shifted version of f l (section III A). Then,we show that oscillations with different periods (lead-ing to different recurrences) present a more challengingscenario, which however can still be resolved by usingshifts of f l . The third example involves discontinuitiesin f h and f l . After these three examples, in sectionIII D, we demonstrate the approach in the context ofthe Hodgkin-Huxley model. We investigate the effective-ness of the proposed framework by comparing it to threeestablished frameworks: (1) single Gaussian Process re-gression with high-fidelity data only (GP or Kriging),(2) Auto-regressive method with a constant scaling pa-rameter ρ (AR1 or coKriging), and (3) nonlinear auto-regressive GP (NARGP) in the same computational en-vironment. A. Phase Shifted Oscillations
Using models at different levels of resolution (e.g. forbiological oscillators) will often give oscillations that havevery comparable periods but are phase-shifted. Let usstart by considering two functions with different phaseson t ∈ [0 , f h ( t ) = t + sin (8 πt + π/ , f l ( t ) = sin(8 πt ) . (16)The high-fidelity function can be rewritten by a trigonometric addition formula: f h ( t ) = t + (sin(8 πt ) cos( π/
10) + cos(8 πt ) sin( π/ . (17)Now we can explicitly see how the high-fidelity functioncan be thought of as a combination of three variables: t , the low-fidelity function f l ( t ) = sin(8 πt ) and its firstderivative f (1) l ( t ) = cos(8 πt ).Using delays of f l with a small stepsize τ containsenough information to numerically estimate its deriva-tives, hence we can also write f h as f h ( t ) (cid:55)→ g ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ )) . (18)The GP regression model for g is trained on this 4-dimensional data. In addition, we perform GP regressionin a three dimensional space constructed from only threedelays: f h ( t ) (cid:55)→ g ( f l ( t ) , f l ( t − τ ) , f l ( t − τ )) . (19)As shown in FIG.3(b), the single GP regression modelprovides inaccurate predictive posterior means due tolack of high-fidelity data. While the linear auto-regressivemodel (AR1) also fails to predict the high fidelity values,the NARGP (with 10 high-fidelity data points and 100low-fidelity data points) catches the trend of the high fi-delity data, yet still yields inaccurate results: NARGPis informed only by t and f l ( t ), and not also by f (1) l ( t ).Similarly, the GP regression with only delays (no infor-mation about t ) in FIG.3(e) fails to represent the high-fidelity function for these long observation windows. Be-yond 0 . t cannot be recovered from the shifts of f l because f l is only a generic observer of t ∈ [0 , . t and three delaysof f l provides an excellent prediction with only 10 high-fidelity data points (and 100 low-fidelity data points).This means that, in the 4-dimensional space, g (see equa-tion 18) has small derivatives, which then helps to employGP regression successfully.Next, we investigate the sensitivity and scalability ofthe proposed framework on the number of high-fidelitydata points (training data points). We train all GP re-gression models with 10, 15, 20, and 25 randomly-chosenhigh-fidelity data points and 100 uniformly distributedlow-fidelity data points. The error is obtained by aver-aging 10 trials of random data selections. A log L errorwith respect to the number of high-fidelity data points ispresented in FIG.4(a).Two established approaches (AR1 and NARGP) andthe GP with only delays have no significant accuracy en-hancements as the number of training points increases.The reason for the consistently large errors is the lack ofadditional information provided by the derivatives. TheGP in the higher-dimensional space that includes t , onthe other hand, shows a strong correlation between ac-curacy and the number of training points – more highfidelity points visibly improve the approximation. (a)The correlation between models (b)GP (Kriging)(c)AR1 (CoKriging) (d)NARGP(e)GP+delays (f)GP+E FIG. 3. Examples of phase shifted oscillations. (a) correlation between t , f l and f h . (b)-(f) The high-fidelity function (theyellow curve) versus posterior means with 2 standard deviations (dashed lines) of 5 compared methods with 10 high-fidelity datapoints and 100 uniformly distributed low-fidelity data points. (b) GP (Kriging) and low fidelity data (the red-dashed curve). (c)Auto-regressive GP (AR1 or coKriging). (d) nonlinear auto-regressive GP (NARGP). (e) GP in the higher-dimensional spacewith only delays ( f l ( t ) , f l ( t − τ ) , f l ( t − τ )). (f) GP in the higher-dimensional space (GP+E), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ )). B. Different Periodicity
In this example, the high- and the low-fidelity modeloscillations are not just phase shifted, but they also arecharacterized by different periods. In applications, thiscould arise if we tried to match observations of oscilla-tions of t he same model at two d ifferent parameter val-ues. Different (possibly irrationally related) oscillation periods dramatically complicate the correlation acrossthe two data sets.We consider two different period a nd phase shifteddata, f h ( t ) = sin(8 πt + π/ , f l ( t ) = sin(6 √ πt ) . (20)The high-fidelity function can be rewritten by a (a)Phase Shifted Oscillations (b)Different Periodicity(c)Models with Discontinuities (d)The Hodgkin-Huxley models FIG. 4. Log L error (y-axis) of prediction by GP (Kriging), AR1 (coKriging), NARGP, and GPs in the higher-dimensionalspace (GP+E and GP+E(2)) with respect to the number of high-fidelity data points (x-axis). The error is obtained by averaging10 trials of random data selections. trigonometric addition formula: f h ( t ) = sin(8 πt ) cos( π/
10) + cos(8 πt ) sin( π/ . (21)In addition, the first term sin(8 πt ) can be rewritten againby a trigonometric subtraction formula:sin( at − bt ) = cos( bt ) sin( at ) − cos( at ) sin( bt ) , (22)where a = 6 √ π and b = 6 √ π − π . Then,sin(8 πt ) = cos( bt ) f l ( t ) − sin( bt ) f (1) l ( t ) . (23)The second term cos(8 πt ) can be rewritten in the sameway. This shows that the high-fidelity function can bewritten in terms of sin( bt ), cos( bt ), f l ( t ), and f (1) l ( t ).Since sin( bt ) and cos( bt ) have lower frequency comparedto the original frequency 8, the bound c g for the deriva-tives of g (see section II C) is smaller. It is then reason-able that we can approximate the high-fidelity functionin the higher-dimensional space with a few training datapoints.We perform the GP in the two different extendedspaces: (1) 3 additional delays, totaling 4-dimensionalspace (GP+E), and (2) 5 additional delays, totaling 6-dimensional space (GP+E(2)), and compare them to a single GP, AR1, and NARGP. Examples of regression re-sults with 15 high-fidelity data points and 200 uniformlydistributed low-fidelity data are shown in FIG.5. TheGP in the 4-dimensional space provides better regressionresult than other established methods, and the GP in the6-dimensional space presents the best results.Moreover, as shown in FIG.5(b), the phase discrepancybetween the high- and low-fidelity functions increases astime increases, resulting in larger error for larger valuesof t , see FIG.5(b)–(d). However, the GPs in the higher-dimensional spaces provide accurate prediction resultsover this time observation window.The sensitivity of the number of high fidelity data isshown in FIG.4(b). The GPs in the 4- and 6-dimensionalspace show significant computational accuracy gain com-pared to all other methods. These results demonstratethe capability of the proposed framework for period-shifted and phase-shifted data fusion. C. Models with Discontinuities
In general, a smooth stationary covariance kernel can-not capture discontinuous model data. In order to resolvethis problem, a non-stationary kernel has been intro- (a)The correlation between models (b)GP (Kriging)(c)AR1 (CoKriging) (d)NARGP(e)GP+E (f)GP+E(2)
FIG. 5. Examples of different periodicity. (a) correlation between f l and f h . (b)-(f) The high-fidelity function (the yellow curve)versus posterior means with 2 standard deviations (dashed lines) of 5 compared methods with 15 high-fidelity data points and 200uniformly distributed low-fidelity data points. (b) GP (Kriging) and low fidelity data (the red-dashed curve). (c) Auto-regressiveGP (AR1 or coKriging). (d) nonlinear auto-regressive GP (NARGP). (e) GP in the 4-dimensional space (GP+E), using( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ )). (f) GP in the 6-dimensional space (GP+E(2)), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ )). duced [27, 28], with space-dependent hyperparameters.Moreover, nested GPs were also used successfully to al-leviate this problem [29]. Both approaches introduce, ofcourse, additional hyperparameters to optimize.In this example, we introduce a discontinuous function f l on t ∈ [0 , . f l ( t ) = 0 . t − sin(12 t −
4) + 10( t − . − , (24)and on t ∈ [0 . ,
1] as f l ( t ) = 0 . t − sin(12 t −
4) + 10( t − . , (25) and the high-fidelity function f h f h ( t ) = 2 f l ( t ) − t + 20 = g ( t, f l ( t )) . (26)In the scenario we describe here, the high-fidelity function f h is discontinuous, but can be written in terms of a linearfunction of g in two variables.Examples of regression results with 10 high-fidelitydata points and 200 uniformly distributed low-fidelitydata points are shown in FIG.6. Since g is a linear func-tion of t and f l ( t ), the NARGP, as well as our GPs inthe higher-dimensional spaces, provide highly accurate (a)The correlation between models (b)GP (Kriging)(c)AR1 (CoKriging) (d)NARGP(e)GP+E (f)GP+E(2) FIG. 6. Examples of models with discontinuities. (a) correlation between t , f l and f h . (b)-(f) The high-fidelity function (theyellow curve) versus posterior means with 2 standard deviations (dashed lines) of 5 compared methods with 10 high-fidelitydata points and 200 uniformly distributed low-fidelity data points. (b) GP (Kriging) and low fidelity data (the red-dashedcurve). (c) Auto-regressive GP (AR1 or coKriging). (d) nonlinear auto-regressive GP (NARGP). (e) GP in the 4-dimensionalspace (GP+E), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ )). (f) GP in the 6-dimensional space (GP+E(2)), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ )). prediction results with just a few high-fidelity data.In the sensitivity analysis of the number of high fi-delity data points, see FIG.4(c), there is no significantaccuracy gain after 10 high-fidelity data points. That isbecause 10 training data points are enough to represent alinear function accurately. It is worth noting that, here,the NARGP provides better prediction results with 20training data points compared to the GPs in the higher- dimensional space, possibly due to overfitting. D. The Hodgkin-Huxley Model
Based on the results of the pedagogical examples, weapply the proposed framework to a famous model of anintracellular process, a version of the Hodgkin-Huxley0 (a)The correlation between models (b)GP (Kriging)(c)AR1 (CoKriging) (d)NARGP(e)GP+E (f)GP+E(2)
FIG. 7. Examples of the two Hodgkin-Huxley model oscillations obtained with different external currents I ext . (a) correlationbetween t , f l and f h . (b)-(f) The high-fidelity function (the yellow curve) versus posterior means with 2 standard deviations(dashed lines) of 5 compared methods with 20 high-fidelity data points and 300 uniformly distributed low-fidelity data points.(b) GP (Kriging) and low fidelity data (the red-dashed curve). (c) Auto-regressive GP (AR1 or coKriging). (d) nonlinearauto-regressive GP (NARGP). (e) GP in the 4-dimensional space (GP+E), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ )). (f) GP in the6-dimensional space (GP+E(2)), using ( t, f l ( t ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ ) , f l ( t − τ )). equations [30]. In 1952, Hodgkin and Huxley introduceda mathematical model which can describe the initiationand propagation of action potentials in a neuron. Specifi-cally, they invented electrical equivalent circuits to mimicthe ion channel, where ions traffic through the cell mem-brane. The model for intracellular action potentials ( V m ) can be written as a simple ODE: C m dV m dt + I ion = I ext , (27)where C m is the membrane capacitance and I ion and I ext represent the total ionic current and the external current,respectively.The total ionic current I ion = I Na + I K + I L is the1sum of the three individual currents as a sodium current( I Na ), a potassium current ( I K ), and a leakage current( I L ). In order to calculate the three individual currents intime, the Hodgkin-Huxley model introduced gates whichregulate the flow of ions through the channel. Specifi-cally, the three ionic currents are affected by the threedifferent gates n , m , and h . Based on these gates, thetotal ionic currents can be calculated by I ion = ¯ g Na m h ( V m − E Na ) − ¯ g K n ( V m − E K ) − ¯ g L ( V m − E L ) , (28)where ¯ g ∗ represents a normalized constant and E ∗ repre-sents the equilibrium potential for a sodium, ( ∗ ≡ N a ),a potassium, ( ∗ ≡ K ), and a leakage, ( ∗ ≡ L ), current.The three gates n , m , and h can be modeled by the fol-lowing ODEs: dndt = α n ( V )(1 − n ) − β n ( V ) n, (29) dmdt = α m ( V )(1 − m ) − β m ( V ) m, (30) dhdt = α h ( V )(1 − h ) − β h ( V ) h. (31)In this paper, we set the model parameter values to¯ g Na = 1 .
2, ¯ g K = 0 .
36, ¯ g L = 0 . E Na = 55 . E K = − .
14, and E L = − .
42 [31]. We assume thatdifferent fidelity data come from simulations at differentexternal currents I ext . We set I ext = 1 . I ext = 1 .
05 for the low fidelity model,resulting in different oscillation period (and a phase shiftwhen we start at the same initial conditions). The ac-tion potentials V m of the two different fidelity models areshown in FIG.7(a,b).Examples of regression results for 5 different methodswith 20 high-fidelity data points and 300 uniformly dis-tributed low-fidelity data points are shown in FIG.7(b–f).Since the two data sets are phase-shifted, the single GP,AR1, and NARGP fail to represent a proper function ofthe high fidelity model effectively. However, GPs in thehigher-dimensional spaces provide reasonable predictionresults. The GP in the 6-dimensional space (GP+E(2))shows significant improvement in the form of large un-certainty reduction as well as high prediction accuracy.The sensitivity to the number of high fidelity data isshown in FIG.4(d). The proposed framework shows com-putational accuracy gains compared to all other methods,as well as marked improvement when new high fidelitypoints are added. IV. CONCLUSION
In this paper we explored mathematical algorithms formultifidelity information fusion and its links with “em-bedology”, motivated by the NARGP approach. Thesemodifications/extensions of kriging show promise in im-proving the representation of data-poor “high-fidelity”datasets exploiting data-rich “low-fidelity” datasets.Given that f h may not be a simple function -and some-times not even a function- of f l , we demonstrated that us-ing additional functions of t , such as derivatives or shiftsof f l , can drastically improve the approximation of f h through Gaussian Processes.The limitations of the proposed framework arise inthe form of the curse of dimensionality and of overfit-ting. As the number of hyperpameters in the GP frame-work grows in an increasingly higher dimensional inputspace, the optimization cost grows (and there is alwaysthe possibility of converging to local, unsatisfactory min-ima). Adaptively testing for the “best” number of delaysis possible, and will be pursued in future work. Thenatural option of using multiple low-fidelity models (in-stead of delays of just one of them) is also being explored.Techniques that systematically find all the local hyper-parameter minima (in the spirit of the reduced gradientmethod [32]) may also be useful in this effort. Anotherpromising research direction involves the construction ofdata-informed kernels (e.g. through “Neural-net-inducedGaussian process”, NNGP [33]) for more realistic andunbiased predictions. Alternatively, it is interesting toconsider transformations of the input space using man-ifold learning techniques and the so-called Mahalanobisdistance [34, 35], which has been demonstrated to suc-cessfully match different (yet conjugate) models [36, 37].What we believe is a most promising direction for theuse of these techniques is the reconciliation of differentgranularity multiscale models - having, say, an atom-istic “high-fidelity” simulation enhanced by a contin-uum “low-fidelity” approximate closure. Thus, “hetero-geneous data” fusion becomes a version of multifidelitydata fusion [8]. In this paper the fusion tools simply“filled in the gaps” in a single manifestation of the high fi-delity data. In a time-dependent context, “full space, fulltime” low fidelity simulations can help complete and thusaccelerate “small space, small time” high fidelity simula-tions - in a form reminiscent of the patch-dynamics ap-proach in equation-free computation [38], see also [7, 8].Using a qualitatively correct (even though quantitativelyinaccurate) low-fidelity model –as opposed to just thelocal Taylor series that play the role of low-fidelity mod-eling in patch dynamics– may very much improve thecomputational savings of such multiscale computationschemes. V. ACKNOWLEDGMENTS
S. Lee, F. Dietrich and I. G. Kevrekidis gratefullyacknowledge the partially support of NSF, NIH, andDARPA. G. E. Karniadakis gratefully acknowledge thefinancial support of DARPA (grant no. N66001-534 15-2-4055) also the MIT ESRDC grant.2 [1] Tarca AL, Carey VJ, Chen Xw, Romero R, Dr˘aghici S.Machine learning and its applications to biology. PLoScomputational biology. 2007;3(6):e116.[2] Holzinger A, Jurisica I. Interactive Knowledge Discoveryand Data Mining in Biomedical Informatics: State-of-the-Art and Future Challenges. vol. 8401. Springer; 2014.[3] Libbrecht MW, Noble WS. Machine learning applica-tions in genetics and genomics. Nature Reviews Genetics.2015;16(6):321.[4] Villoutreix P, And´en J, Lim B, Lu H, Kevrekidis IG,Singer A, et al. Synthesizing developmental trajectories.PLoS computational biology. 2017;13(9):e1005742.[5] Lanckriet GR, De Bie T, Cristianini N, Jordan MI, NobleWS. A statistical framework for genomic data fusion.Bioinformatics. 2004;20(16):2626–2635.[6] Willett P. Combination of similarity rankings using datafusion. Journal of chemical information and modeling.2013;53(1):1–10.[7] Lee S, Kevrekidis IG, Karniadakis GE. A general CFDframework for fault-resilient simulations based on multi-resolution information fusion. Journal of ComputationalPhysics. 2017;347:290–304.[8] Lee S, Kevrekidis IG, Karniadakis GE. A resilientand efficient CFD framework: Statistical learning toolsfor multi-fidelity and heterogeneous information fusion.Journal of Computational Physics. 2017;344:516–533.[9] Le Gratiet L, Garnier J. Recursive co-kriging model fordesign of computer experiments with multiple levels offidelity. International Journal for Uncertainty Quantifi-cation. 2014;4(5).[10] Perdikaris P, Venturi D, Royset J, Karniadakis G. Multi-fidelity modelling via recursive co-kriging and Gaussian–Markov random fields. In: Proceedings of the Royal So-ciety A. vol. 471. The Royal Society; 2015. p. 20150018.[11] Perdikaris P, Venturi D, Karniadakis GE. MultifidelityInformation Fusion Algorithms for High-DimensionalSystems and Massive Data sets. SIAM Journal on Scien-tific Computing. 2016;38(4):B521–B538.[12] Parussini L, Venturi D, Perdikaris P, Karniadakis G.Multi-fidelity Gaussian process regression for predictionof random fields. Journal of Computational Physics.2017;336:36–50.[13] Perdikaris P, Raissi M, Damianou A, Lawrence N, Kar-niadakis G. Nonlinear information fusion algorithms fordata-efficient multi-fidelity modeling. In: Proc. R. Soc.A. vol. 473. The Royal Society; 2017. p. 20160751.[14] Kennedy MC, O’Hagan A. Predicting the output froma complex computer code when fast approximations areavailable. Biometrika. 2000;87(1):1–13.[15] Raissi M, Karniadakis G. Deep Multi-fidelity GaussianProcesses. arXiv preprint arXiv:160407484. 2016;.[16] Whitney H. Differentiable Manifolds. Annals of Mathe-matics. 1936;37(3):645–680.[17] Nash J. Analyticity of the Solutions of Implicit FunctionProblems With Analytic Data. Annals of Mathematics.1966;84(3):345–355.[18] Takens F. Detecting strange attractors in turbulence.In: Rand D, Young LS, editors. Dynamical Systems andTurbulence, Warwick 1980. Berlin, Heidelberg: SpringerBerlin Heidelberg; 1981. p. 366–381. [19] Kostelich EJ, Yorke JA. Noise reduction: Finding thesimplest dynamical system consistent with the data.Physica D: Nonlinear Phenomena. 1990;41(2):183–196.[20] Sauer T, Yorke JA, Casdagli M. Embedology. Journal ofstatistical Physics. 1991;65(3):579–616.[21] Kennel MB, Brown R, Abarbanel HD. Determiningembedding dimension for phase-space reconstruction us-ing a geometrical construction. Physical review A.1992;45(6):3403.[22] Le Gratiet L. Multi-fidelity Gaussian process regres-sion for computer experiments. Universit´e Paris-Diderot-Paris VII; 2013.[23] Rasmussen CE, Williams CKI. Gaussian processes formachine learning. MIT Press; 2006.[24] Packard NH, Crutchfield JP, Farmer JD, Shaw RS.Geometry from a Time Series. Phys Rev Lett. 1980Sep;45:712–716.[25] Dietrich F, Kooshkbaghi M, Bollt EM, Kevrekidis IG.Manifold Learning for Bifurcation Diagram Observa-tions. arXiv preprint. 2018;.[26] GPy. GPy: A Gaussian process framework in python;since 2012. http://github.com/SheffieldML/GPy .[27] Schmidt AM, O’Hagan A. Bayesian inference for non-stationary spatial covariance structure via spatial defor-mations. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology). 2003;65(3):743–758.[28] Paciorek CJ, Schervish MJ. Nonstationary covariancefunctions for Gaussian process regression. In: Advancesin neural information processing systems; 2004. p. 273–280.[29] Hensman J, Lawrence ND. Nested variational com-pression in deep Gaussian processes. arXiv preprintarXiv:14121370. 2014;.[30] Hodgkin AL, Huxley AF. A quantitative descriptionof membrane current and its application to conductionand excitation in nerve. The Journal of physiology.1952;117(4):500–544.[31] Siciliano R. The Hodgkin–Huxley model–its extensions,analysis and numerics. McGill University, Department ofMathematics and Statistics, Montreal, Canada; 2012.[32] Quapp W. Reduced gradient methods and their relatonto reaction paths. Journal of Theoretical and Computa-tional Chemistry. 2003 sep;02(03):385–417.[33] Pang G, Yang L, Karniadakis GE. Neural-net-inducedGaussian process regression for function approximationand PDE solution. arXiv preprint arXiv:180611187.2018;.[34] Singer A, Coifman RR. Non-linear independent compo-nent analysis with diffusion maps. Applied and Compu-tational Harmonic Analysis. 2008 9;25(2):226–239.[35] Singer A, Erban R, Kevrekidis IG, Coifman RR. Detect-ing intrinsic slow variables in stochastic dynamical sys-tems by anisotropic diffusion maps. Proceedings of theNational Academy of Sciences. 2009;106:16090–16095.[36] Dsilva CJ, Talmon R, Gear CW, Coifman RR, KevrekidisIG. Data-Driven Reduction for a Class of Multiscale Fast-Slow Stochastic Dynamical Systems. SIAM Journal onApplied Dynamical Systems. 2016 jan;15(3):1327–1351.[37] Kemeth FP, Haugland SW, Dietrich F, Bertalan T,Hohlein K, Li Q, et al. An Emergent Space for Dis-tributed Data with Hidden Internal Order through Man-3