Multi-fidelity regression using artificial neural networks: efficient approximation of parameter-dependent output quantities
Mengwu Guo, Andrea Manzoni, Maurice Amendt, Paolo Conti, Jan S. Hesthaven
MMulti-fidelity regression using artificial neural networks: efficientapproximation of parameter-dependent output quantities
Mengwu Guo a, ∗ , Andrea Manzoni b , Maurice Amendt c , Paolo Conti b , Jan S. Hesthaven c a Department of Applied Mathematics, University of Twente b MOX – Dipartimento di Matematica, Politecnico di Milano c Institute of Mathematics, ´Ecole Polytechnique F´ed´erale de Lausanne
Abstract
Highly accurate numerical or physical experiments are often very time-consuming or expensive to obtain.When time or budget restrictions prohibit the generation of additional data, the amount of available samplesmay be too limited to provide satisfactory model results. Multi-fidelity methods deal with such problems byincorporating information from other sources, which are ideally well-correlated with the high-fidelity data,but can be obtained at a lower cost. By leveraging correlations between different data sets, multi-fidelitymethods often yield superior generalization when compared to models based solely on a small amount ofhigh-fidelity data. In the current work, we present the use of artificial neural networks applied to multi-fidelity regression problems. By elaborating a few existing approaches, we propose new neural networkarchitectures for multi-fidelity regression. The introduced models are compared against a traditional multi-fidelity regression scheme – co-kriging. A collection of artificial benchmarks are presented to measure theperformance of the analyzed models. The results show that cross-validation in combination with Bayesianoptimization consistently leads to neural network models that outperform the co-kriging scheme. Addition-ally, we show an application of multi-fidelity regression to an engineering problem. The propagation of apressure wave into an acoustic horn with parametrized shape and frequency is considered, and the indexof reflection intensity is approximated using the proposed multi-fidelity models. A finite element, full-ordermodel and a reduced-order model built through the reduced basis method are adopted as the high- andlow-fidelity, respectively. It is shown that the multi-fidelity neural network returns outputs that achieve acomparable accuracy to those from the expensive, full-order model, using only very few full-order evaluationscombined with a larger amount of inaccurate but cheap evaluations of a reduced order model.
Keywords:
Machine learning, artificial neural network, multi-fidelity regression, Gaussian processregression, reduced order modeling, parametrized PDE
1. Introduction
Artificial neural networks (ANNs) have arguably been one of the most active topics during the recentyears. They have been successfully applied in a substantial number of research areas, including imagerecognition [21], translation [10], and fraud detection [15]. More recently, ANNs have also been widely usedin the emerging area of machine learning in computational science and engineering, sometimes referred to asscientific machine learning [3]. The remarkable expressive power of neural networks (NNs) has made themstand out in the solution of forward and inverse problems governed by partial differential equations (PDEs)[36, 39, 41], reduced order modeling [13, 18, 23], data-driven discovery [37], multiscale analysis [38], and soon. This success can largely be explained by three major factors: computational power, flexibility and access ∗ Corresponding author.
Email addresses: [email protected] (Mengwu Guo), [email protected] (Andrea Manzoni), [email protected] (Paolo Conti),
[email protected] (Jan S. Hesthaven)
Preprint submitted to Elsevier March 1, 2021 a r X i v : . [ m a t h . NA ] F e b o large data sets. The great flexibility of NNs, as well as their multi-purpose nature, is a dominant factorexplaining their overwhelming success, not only in research, but also in real-world applications. Severalapplication-dependent architectures have been developed, e.g., convolutional neural networks are often usedin image related tasks, whereas recurrent neural networks have found their success in speech recognition.Nevertheless, even for a fixed structure, a neural network is still able to adapt to different situations. Thisfeature is mainly explained by the large number of parameters and hyperparameters that can be tuned tofit data in many situations.The ability to handle large data sets without overwhelming computational costs has made NNs a goodcandidate for multi-fidelity (MF) regression. MF regression exploits correlations between different datasets to provide a regression model that generalizes better than a simple regression model, which only takesa single data set into account. The most common setup for MF regression is a set of data sources withdifferent fidelity levels. High-fidelity (HF) samples are usually rare due to their cost, be it computationalor experimental. However, such data are accurate and are the best available knowledge about the problem.In contrast, low-fidelity (LF) data are assumed to be easy and cheap to obtain, yet might lack accuracy.Often generated numerically, the LF data set ideally expresses major trends of the problem and correlatewell with the HF data. The goal of MF regression is to infer trends from LF data and use it to approximatethe HF model, especially in regions where HF data are sparse.A natural MF setting arose in the geostatistics community, who realized that ”if core data at otherlocations are correlated, they should be included to improve the regression” [19]. In the context of soilporosity, precise measurements are combined with seismic data to better predict porosity in large areas.Another common MF situation arises when solving PDEs [20, 35]. High order numerical schemes with afine mesh give rise to accurate, HF data, whereas the usage of a coarse mesh, a partially converged solution,or a linearized equation can lead to LF data. In the past decade, MF methods have found applications inmany areas of scientific computing, including uncertainty quantification, inference, and optimization. Werefer to [32] for a comprehensive review. A widely used MF technique is co-kriging [1, 31] which relies onvector-valued Gaussian processes for regression. Such a Gaussian process regression scheme presents twomajor benefits. Firstly, as a non-parametric regression tool, it is suitable for many different applications.Secondly, the method can be cast in the Bayesian framework, and the regression results naturally includean uncertainty estimation, which is usually desirable. Nevertheless, Gaussian process regression is notsuitable for many applications as it suffers from different drawbacks, such as the curse of dimensionality.Consequently, new methods for MF regression are needed, ideally able to detect non-trivial, highly nonlinearcorrelations between the data sets of different fidelity levels, and should be applicable to a general class ofproblems.Based on these considerations, ANNs appear to be a promising candidate to solve MF problems. In therelated current work, we consider MF regression with ANNs. Several approaches have been proposed in theliterature so far and successfully used NNs in a MF setting. In [26], the authors tested a deep NN structureon different artificial MF benchmarks, extended the idea to the physics-informed NN scheme and applied it toinverse problems governed by partial differential equations (PDEs). Their MF model clearly outperforms asingle fidelity regression. A different NN architecture, incorporated into a Monte Carlo sampling algorithmto estimate uncertainties in a MF setting, was introduced in [27], and reduced computational cost hasbeen observed as compared to traditional Monte Carlo sampling. In addition, different MF strategies fortraining NNs were discussed in [2], NNs were used to approximate the discrepancy between the HF and LFphysics-constrained NNs in [24], and deep neural networks (DNNs) were embedded into co-kriging in [34].Moreover, a composition of Gaussian processes in a multi-layer network structure was used for MF modelingin [11], and a multi-fidelity Bayesian neural network scheme has been developed in [25] and applied to thephysics-informed versions.In this work, we propose different ANN architectures for the purpose of MF regression. Inspired by [26]and [27], we present two all-in-one models in which different fidelity levels are trained simultaneously, as wellas two multilevel models that define separate NNs for the hierarchy of fidelity levels. In addition, we utilizea strategy based on cross-validation and Bayesian optimization to automatically select the best performingNN hyperparameters. To the best of our knowledge, hyperparameter optimization (HPO) has not beeninvestigated for MF neural networks. The goal of this work is to define a reliable strategy that consistently2roposes a neural network which achieves high accuracy and shows good generalization properties. We willassess the performance of the proposed NNs on a set of manufactured benchmarks, all chosen carefully to testfor the desirable properties. All models will be compared to single-fidelity regression schemes. Comparisonswill also include co-kriging results to benchmark the proposed models against common practices in MFframeworks.Additionally, we show an application of the proposed MF regression schemes to the evaluation of aquantity of interest defined as a functional of the solution to a parameter-dependent problem governedby PDEs. Such a task often occurs in the applied sciences and engineering, where multiple evaluationsof PDE solutions, for different scenarios described in terms of physical or geometrical parameters, canbe computationally demanding if relying on high-fidelity, full-order models (FOMs) such as detailed finiteelement approximations. To overcome this difficulty, low-fidelity, reduced-order models (ROMs) can be builtthrough the reduced basis (RB) method. Despite ROMs featuring much lower-dimensional solution spacesthan those of the FOMs, they are able to capture the critical physical features of the FOMs. A reduced modelseeks the solutions on a low-dimensional manifold which is approximated by a linear trial subspace spannedby a set of global basis functions, built from a set of full-order snapshots. The ROM accuracy is often grantedat the price of a relatively large number of basis functions involved in the reduced-order approximation. Onthe other hand, an efficient assembly of the ROM during the online stage may only be possible provided thatan expensive hyper-reduction is performed during the offline stage. Therefore, a low dimensionality withoutexpensive hyper-reduction can make a reduced model extremely efficient, but potentially inaccurate. Ourgoal, enabled by the MF neural network schemes in this work, is to provide accurate approximations to theoutput quantities by leveraging a relatively large number of output evaluations using very low-dimensionalROMs and a small number of evaluations using the FOM, so as to avoid the efficiency issues stemmingfrom the ROM construction and evaluation without compromising the outcome accuracy. In particular, weapply the proposed approaches to a parametrized PDE problem, namely the propagation of a pressure waveinto an acoustic horn with parametrized shape and frequency, described by the Helmholtz equation. Weassess the impact of both the quality and the amount of the training data on the overall accuracy of theMF regression outcome.Following the introduction, the concepts of ANNs and Gaussian process regression are briefly reviewed,and their underlying correlation is discussed in Section 2. Several NN structures for MF regression areintroduced and discussed in Section 3, and their effectiveness is demonstrated by a series of benchmark testcases in Section 4. An application to a parametrized PDE problem is presented in Section 5, and conclusionsare drawn in Section 6.
2. Artificial neural networks and Gaussian processes for regression
In this section, we consider an L -hidden-layer fully-connected NN [40] with hidden layers of width N l forthe l -th layer and the nonlinear activation function φ , 1 ≤ l ≤ L . At the j -th neuron in the l -th layer of theNN, the pre- and post-activation are denoted by z lj and x lj , respectively, 1 ≤ i ≤ M l , M l being the widthof the l -th layer. Let x = x ∈ R d in denote the inputs of the network and y = z L +1 ∈ R d out denote theoutputs. Note that we have M = d in and M L +1 = d out . Weight and bias parameters between the ( l − l -th layers are represented by W lij and b li , respectively, 1 ≤ l ≤ ( L + 1), 1 ≤ i ≤ M l , 1 ≤ j ≤ M l − .Then one has z li ( x ) = b li + M l − (cid:88) j =1 W lij x l − j ( x ) , x li ( x ) = φ ( z li ( x )) , ≤ i ≤ M l , ≤ l ≤ L , and y i ( x ) = z L +1 i ( x ) = b L +1 i + M L (cid:88) j =1 W L +1 ij x Lj ( x ) , ≤ i ≤ d out . (1)3 multivariate function y = f ( x ) is approximated by a vector-valued network surrogate f NN ( · ; W , b ) : R d in → R d out to be trained on the input-output pairs { ( x ( k ) , y ( k ) ) } Nk =1 . Here W and b are the vectorscollecting all the weight and bias parameters, respectively, and N is the number of data pairs. Such atraining is often performed by minimizing a cost function:( W , b ) = arg min W , b (cid:40) N N (cid:88) k =1 (cid:107) y ( k ) − f NN ( x ( k ) ; W , b ) (cid:107) + λ (cid:107) W (cid:107) (cid:41) , (2)in which the first term is the mean square error (MSE) and the second term is a regularization term with λ ≥ A Gaussian process (GP) is a collection of random variables, any finite number of which obeys a jointGaussian distribution. In the GPR, the prior on the scalar-valued regression function f : R d in → R isassumed to be a GP corrupted by an independent Gaussian noise term, i.e., for ( x , x (cid:48) ) ∈ R d in × R d in , f ( x ) ∼ GP(0 , κ ( x , x (cid:48) )) , y = f ( x ) + (cid:15) , (cid:15) ∼ N (0 , χ ) , (3)where χ is the standard deviation of a Gaussian noise term (cid:15) , and the semi-positive definite kernel function κ gives the covariance of the prior GP.Given N pairs of input-output training data, a prior joint Gaussian is defined for the correspondingoutputs: y | X ∼ N ( , K y ) , K y = Cov[ y | X ] = κ ( X , X ) + χ I N , (4)where y = { y (1) , y (2) , · · · , y ( N ) } T , X = [ x (1) | x (2) | · · · | x ( N ) ] and I N is the M -dimensional unit matrix, N being the number of training samples.From a regression model, the goal is to predict the noise-free output f ∗ ( s ) for a new test input s ∈ R d in .By the standard rules for conditioning Gaussians, the posterior predictive distribution conditioning on thetraining data is obtained as a new GP: f ∗ ( s ) | s , X , y ∼ GP( m ∗ ( s ) , c ∗ ( s , s (cid:48) )) ,m ∗ ( s ) = κ ( s , X ) K − y y , c ∗ ( s , s (cid:48) ) = κ ( s , s (cid:48) ) − κ ( s , X ) K − y κ ( X , s (cid:48) ) . (5) Multi-fidelity GPR
GPR with training data from different fidelity levels is known as cokriging [31] or vector-valued GPR[1]. In such a regression scheme, one can use a large amount of LF data and only a limited number ofHF samples to training a model of a reasonable accuracy. Since the LF evaluations are cheap, the cost oftraining data preparation can be reduced by controlling the number of HF evaluations. Assuming a linearcorrelation between the different fidelity levels, we can employ the linear model of coregionalization (LMC)[1] that expresses the prior of a hierarchy of D solution fidelities as f i ( x ) = D (cid:88) j =1 a i,j u j ( x ) , i = 1 , , · · · , D , (6)i.e., each level of solution f i is written as a linear combination of D independent Gaussian processes u j ∼ GP(0 , κ j ( · , · )). In addition, the vector a j , 1 ≤ j ≤ D , collects the weights of the corresponding GPcomponent u j , i.e., a j = { a ,j , · · · , a D,j } T . This formulation leads to a matrix-valued kernel for the MFGPR model as K ( x , x (cid:48) ) = D (cid:88) j =1 a j a T j κ j ( x , x (cid:48) ) . (7)4n the two-level case, the well-known form of AR(1)-cokriging [31] is a special form of the linear modeldefining the prior of a LF solution f L and a HF f H : f H ( x ) = ρu ( x ) + u ( x ) ,f L ( x ) = u ( x ) , (8)in which a = { ρ, } T and a = { , } T . Conditioning on the training input-output pairs from both theLF and HF, denoted by ( X L , y L ) and ( X H , y H ), respectively, the predictive distribution for the HF can beexpressed as a posterior GP, i.e., f ∗ H ( s ) | s , X H , y H , X L , y L ∼ GP.
It can be shown that the prior of a neural network output can be seen as a set of Gaussian processes underthe following probabilistic assumptions [22, 28] : (I) All the weight parameters W lij ’s are independent andidentically distributed (i.i.d), as are all the bias parameters b li ’s, and the weight and bias parameter sets areindependent of each other; (II) In the l -th layer, 1 ≤ ≤ L + 1, b li ∼ N (0 , σ b ), and W lij ’s are independentlydrawn from any distribution with zero mean and variance σ w /M l − ; and (III) M l → ∞ , 1 ≤ l ≤ L . Herewe show by induction that { z li : 1 ≤ i ≤ M l } are i.i.d. zero-mean Gaussian processes and { x lj : 1 ≤ j ≤ M l } are i.i.d. for all 2 ≤ l ≤ L + 1.We consider an arbitrary set of finite locations of the input x , denoted by X . Since both { b i : ∀ i } and { W i,j : ∀ i, j } are i.i.d., one obtains { z i ( X ) = b i + (cid:80) d in j =1 W ij x j ( X ) : ∀ i } . Thus { x j ( X ) = φ ( z j ( X )) : ∀ j } are i.i.d. Then it can be recovered that { z i ( X ) = b i + (cid:80) M j =1 W ij x j ( X ) : ∀ i } are i.i.d. For each1 ≤ i ≤ M , we apply the multivariate central limit theorem [12] to a sequence of i.i.d. random vectors {√ M W i x ( X ) , √ M W i x ( X ) , · · · } , all with zero mean and covariance σ w Cov[ x · ( X )], and we obtainthat z i ( X ) ∼ N ( , σ b I + σ w E [ x · ( X ) ⊗ x · ( X )]) as M → ∞ . Since the input locations X are arbitrary,each z i ( x ) is a Gaussian process as z i ( · ) ∼ GP(0 , K ( · , · )) with its kernel K defined as K ( x , x (cid:48) ) = σ b I + σ w E [ φ ( z · ( x )) φ ( z · ( x (cid:48) ))]. After the nonlinear activation in the 2nd layer, we have the i.i.d. { x j : ∀ j } .Therefore the proposition holds true for l = 2.Assume it holds true for l , 2 ≤ l ≤ L . The proposition can be verified to be true for l + 1, similarly tothat for l = 2. The pre-activation in each intermediate layer follows a Gaussian process z l · ∼ GP(0 , K l ( · , · )),2 ≤ l ≤ L + 1, and K l ( x , x (cid:48) ) = σ b + σ w E z l − · ∼ GP(0 ,K l − ) [ φ ( z l − · ( x )) φ ( z l − · ( x (cid:48) ))] . (9)The outputs y ( x ) = z L +1 ( x ) thus follow i.i.d. Gaussian process priors and the kernel function can be formedfrom the recursive relation (9).
3. Artificial neural networks for multi-fidelity regression
Currently, there have been two successful approaches [26, 27] to the use of NNs in a multi-fidelitycontext. The neural networks for multi-fidelity regression (NNMFR) presented in these two studies showinherent differences at the architectural level of the networks. Whereas [26] uses a single NN to perform asimultaneous regression of both the HF and LF data, [27] splits them up and used two distinct ANNs, onefor each fidelity level. Note that in the present work, we restrict ourselves to bi-fidelity problems, i.e., we areinterested in approximating the scalar HF function f HF ( x ) by incorporating information from the scalar LFfunction f LF ( x ). Consequently, there will generally be only a few available observations of the HF function,whereas the LF data are abundant. In that respect, it is useful to introduce the following notation: • N HF and N LF denote the number of HF and LF samples, respectively. • The training set for the HF function is given by T HF = { ( x ( i ) HF , y ( i ) HF ) : 1 ≤ i ≤ N HF } , which satisfies theHF function y HF = f HF ( x ). The discussion here involves some modification from the work in [22, 28]. By replacing the subscripts HF in the above notations by LF , we obtain the LF counterparts. • The NN approximations corresponding to the HF and LF functions are denoted by f NNHF ( x ) and f NNLF ( x ),respectively.In the current section, we will review existing NNMFR approaches and propose our own strategies byelaborating and adapting existing ideas. Different from the existing NNMFR strategies, our all-in-onemodels consider the LF outputs as latent variables of the HF surrogate or mimic the correlation between HFand LF levels in co-kriging, and our multilevel models are formulated directly from the hierarchy of fidelitylevels. As previously noted, a single NN, which performs two simultaneous regressions on the HF and LF data,is used in [26]. It essentially resembles the structure of an autoencoder, where the inputs are encodedtowards the LF output y LF , then decoded and encoded again for the HF output y HF . The setup relies on theassumption that f HF ( x ) = F ( x , y LF ) = α F l ( x , y LF ) + (1 − α ) F nl ( x , y LF ) , α ∈ [0 , , (10)i.e., the unknown mapping F of the LF to the HF data can be decomposed into a linear and a nonlinearpart, denoted by F l and F nl , respectively.In (10), the hyperparameter α determines the strength of the linear correlation, with α = 1 correspondingto a fully linear relation between the HF and LF outputs. The determination of the value of α was notspecified in [26], neither are any values indicated in the reported results.Based on these considerations, we propose the following two modified NN architectures, see (a) and (b)in Fig. 1:1. ”Intermediate” model : The first NN structure is similar to the one in [26], except that the sameinput layer is used for HF and LF data and we omit the autoencoder resemblance by adding additionalnodes to the layer containing the LF output. Furthermore, we do not impose (10) as we seek to directlymodel the function F ( x , y LF ). We refer to this NN as the ”Intermediate” model, referring to the locationof the LF output. In the numerical experiments, the Intermediate network has 5 hidden layers; the LFoutput is situated in the 3rd hidden layer and the HF output in the last layer. The number of neuronsin the first 3 layers is fixed to 64 each and the widths of the layers between the LF and HF outputsare determined by the hyperparameter optimization. Except for the output, we use the hyperbolictangent activation function.2. ”GPmimic” model : The second NNMFR is based on the similarity between a wide NN and a GP.Based on this correspondence, and since Gaussian processes are commonly used for MF regression, weimplement an architecture, which seeks to mimic the action of a GP. Hereafter we refer to this archi-tecture as ”GPmimic”. In contrast to NNs, the use of GPs becomes infeasible in higher dimensions.Hence, the GPmimic NN indirectly enables GP-like regression even in the case of large data sets andhigh-dimensional inputs. The NN architecture is depicted in Fig. 1. The neurons labelled by u and u are defined to be analogous to two independent Gaussian processes, especially when the previouslayers are wide, and they play the same roles as in the vector-valued GPR (6). By considering nononlinear activation in the output layer, the NN outputs of the GPmimic model are given as: y HF ( x ) = W u ( x ) + W u ( x ) + b y LF ( x ) = W u ( x ) + W u ( x ) + b . (11)By inspecting (11), we notice that the outputs are recovered as an affine transformation of the twovariables u ( x ) and u ( x ), and the parameters W (cid:48) ij s define the correlations between y HF and y LF as inthe vector-valued GPR. In principle, this setting only allows to model linear correlations between theHF and LF data. 6 a) Intermediate (b) GPmimic(c) 2-step (d) 3-stepFigure 1: Different ANNs proposed for MF regression. All-in-one models in the top row (a, b), and multilevel models in thebottom row (c, d). The ’Intermediate model’ in (a) considers the LF outputs at intermediate latent variables of the surrogatefor HF. The ’GPmimic model’ in (b) employs the LMC to mimic MF GPR. The ’2-step’ and ’3-step’ models in (c) and (d)define seperate NNs for the hierarchy of fidelity levels. Remember that the all-in-one architectures perform vector-valued learning of f = [ f HF ( x ) , f LF ( x )] T . Thereare thus two different error components, one for each fidelity level. More specifically, these components aregiven by the training errors MSE HF and MSE LF of the HF and LF models, defined asMSE HF = 1 N HF N HF (cid:88) i =1 | y ( i ) HF − f NNHF ( x ( i ) HF ) | , and MSE LF = 1 N LF N LF (cid:88) i =1 | y ( i ) LF − f NNLF ( x ( i ) LF ) | . (12)Hence, training an all-in-one network involves a multiobjective minimization problem, which is generallynot easy to solve. In [26], the problem is reduced to a single optimization problem by considering a lossfunction given by the addition of the MSEs related to the two fidelity sets and a L -regularization term.A more general approach in multiobjective optimization is to use a weighted sum instead, to account forobjectives of different scales. Even though the data sets can be initially scaled to ensure the same order ofmagnitude, large discrepancies between MSE HF and MSE LF could develop during the learning process. Basedon these considerations, the loss function in the Intermediate and GPmimic models is given as L = α MSE HF + (1 − α )MSE LF + λ (cid:107) W (cid:107) , (13)7here α ∈ [0 ,
1] (unrelated to the one in (10)) acts as a scaling factor between the two fidelity levels and λ > α = 0 and α = 1 correspond to the single-fidelity regressionsfor the LF and HF data, respectively. The choice α = 0 . α will be tuned by hyperparameter optimization in the numerical examples. The α values minimize the testing errors of k -fold cross-validation and are determined by Bayesian optimization. The NNMFR introduced in Subsection 3.1 relies on a single NN to learn the multidimensional function f = [ f HF ( x ) , f LF ( x )] T . Another approach, leading to the multilevel NNMFR, is to use distinct NNs tomodel these functions. Such an approach has been employed in [27], and the presented method can besummarized as follows. A first NN N N learns a correlation function y HF = F ( x , y LF ) based on the inputdata (cid:110) ( x ( i ) HF , y LF ( x ( i ) HF )) : 1 ≤ i ≤ N HF (cid:111) and the output data (cid:110) y ( i ) HF : 1 ≤ i ≤ N HF (cid:111) . In this case, it is requiredthat the HF input locations should form a subset of the LF input locations, i.e., X HF = { x ( i ) HF : 1 ≤ i ≤ N HF } ⊆ X LF = { x ( i ) LF : 1 ≤ i ≤ N LF } . For each available LF sample x (cid:48) ∈ X LF \ X HF , an approximate HF datasample y (cid:48) HF ( x (cid:48) ) can be generated by the network N N , and these generated data, together with the existingHF data at X HF , can be used as the new HF data set Y (cid:48) HF = { y HF ( X HF ) , y (cid:48) HF ( X LF \ X HF ) } . Finally, a secondNN N N is trained to model the HF function f HF ( x ) based the input-output pairs between X LF and Y (cid:48) HF .Following this multilevel approach in [27], we propose two modified multi-step NNs for MF regression byadopting the following major changes:1. The LF function f LF ( x ) is modeled by a first NN N N LF . This modification is relevant if the computa-tional cost of additional LF data is not cheap, for instance when having to solve PDEs.2. Now that we can rapidly generate new LF data using N N LF , modeling the direct mapping between x and y HF = f HF ( x ) is unnecessary. Hence our second network N N HF will be analogous to N N in [27].These considerations lead to the following multilevel architectures, also see (c) and (d) in Fig. 1:1. ”2-step” model : A DNN N N LF is trained on T LF to learn the LF function f LF ( x ). Using the NN N N LF , we can predict the values of the LF function at the training inputs X HF of the HF data, denotedby f NNLF ( X HF ) = { f NNLF ( x ( i ) HF ) : 1 ≤ i ≤ N HF } . Then a second artificial network N N HF approximates the HFfunction y HF = F ( x , y LF ) based on the input data ( X HF , f NNLF ( X HF )) = { ( x ( i ) HF , f NNLF ( x ( i ) HF )) : 1 ≤ i ≤ N HF } andthe available HF output data Y HF = y HF ( X HF ) = { y ( i ) HF : 1 ≤ i ≤ N HF } . N N HF is a shallow NN consistingof a single hidden layer.In other words, the network N N HF approximates the function y HF = F ( x , y LF ) based on the HF andLF data at the same locations X HF . However, as the HF and LF data locations, i.e., X HF and X LF ,are generated independently, the observations of the LF function at the HF inputs X HF have to beevaluated from N N LF if not directly available at X HF .2. ”3-step” model : This model is a modification of the 2-step model by adding an additional level offidelity, generated by a third network N N lin . N N lin is equivalent to the
N N HF in the 2-step model inthe sense that it models the correlation function y HF = F ( x , y LF ). However, no nonlinear activationfunction is used in N N lin . In other words,
N N lin is responsible for capturing the linear correlationsbetween the HF and LF data sets. Let f NN lin ( X HF , f NNLF ( X HF )) denote the set of outputs of N N lin at the HFinput locations X HF . This set will then serve as part of the inputs for the third and final network N N HF . N N HF approximates the correlation function y HF = F (cid:48) ( x , y LF , y lin ), y lin being the output of N N lin asa linear approximation of y HF , and the training of N N HF is based on the input-output pairs between( X HF , f NNLF ( X HF ) , f NN lin ( X HF , f NNLF ( X HF ))) and Y HF . N N HF is again a shallow NN consisting of a single hiddenlayer. Note that the 3-step model should only present an advantage over the 2-step model when thereexist strong linear correlations between the HF and LF data.It should be clear that the proposed architectures present a few differences. The key difference betweenthe all-in-one and the multilevel strategies is at the structural level. All-in-one NNMFR aims to approximate8he two-dimensional function f = [ f HF ( x ) , f LF ( x )] T with a single NN, while multilevel NNMFR uses distinctnetworks to model the HF and LF functions separately. However, to incorporate information from the LFfunction, this is done sequentially by first modeling f LF ( x ) and then the correlation between the two fidelitylevels.We note that all-in-one models include the additional hyperparameter α . This hyperparameter canpotentially be used to incorporate prior knowledge into the regression model. Suppose for instance thatwe know that the LF data provide a good representation of the problem. We can then enforce the NNto accurately model the LF data by setting the value of α accordingly. Furthermore, this can be takeninto account in the Intermediate model by choosing the number of neurons in the layer containing the LFoutput, where a smaller layer width may indicate a stronger dependency on the LF data. In the multilevelarchitectures, the question of how much trust is put in the LF data is left to the model and determinedduring the training process.An advantage of the multilevel approaches is that the LF function does not have to be approximated by aneural network. Instead, we can use other techniques which might be more accurate or less time-consuming.For instance, in the presence of discontinuities, an accurate approximation can be obtained by utilizinggradient boosting algorithms [14].Finally, in the non-hierarchical cases where the available information sources present similar levels offidelities, the GPmimic model is the only NNMFR that can be used without any adaptation. In fact, allother proposed structures rely on a hierarchy of the available data sets according to their fidelity levels.
4. Numerical results (I): benchmark test cases
In this section, we analyze the performance of the proposed ANN structures in MF regression problems.For that purpose, the models presented in Section 3 are tested on a variety of artificial benchmarks, andcompared against commonly used co-kriging methods.In all benchmarks, the hyperparameters of the neural networks, for instance the parameter α to balancethe MSE terms, are optimized by k -fold cross-validation and Bayesian optimization, whereas the hyperpa-rameters of the GPs are obtained by maximizing the marginal likelihood [42]. For the Bayesian optimization,Python package Hyperopt [6] is used for a tree-structured parzen estimator (TPE) based approach [7], inwhich the priors on hyperparameters can be chosen from a range of distributions for both continuous anddiscrete random variables. Both the single-fidelity and multi-fidelity GPRs are implemented by the Pythonpackage GPy [16]. Furthermore, GPy automatically adds a noise term for each fidelity level to account fornoisy data. As artificial benchmarks give rise to noiseless observations, the standard deviation values of allthe noise terms are fixed to 10 − in these benchmarks.The numerical study in this section compares the performance of the proposed NNs against co-kriging.In each case, the qualitative difference between single- and multi-fidelity regressions is noted briefly beforepresenting a comparative study between NNMFR and multi-fidelity GPR. Models are evaluated using theMSE on a test set covering the whole input domain. The results will reveal whether choosing a model basedon the validation error is a good strategy. The comparison between NN regression and the best GPR willnot only include accuracy results but also computational time. In addition, we evaluate the following R score that allows for a cross-benchmark comparison: R = 1 − (cid:80) N test i =1 (cid:16) y ( i ) HF − f ( i ) reg (cid:17) (cid:80) N test i =1 (cid:16) y ( i ) HF − ¯ y HF (cid:17) , where ¯ y HF = 1 N test N test (cid:88) i =1 y ( i ) HF , (14) N text is the size of the test set, and f reg denotes the prediction given by a regression model.9 .1. Benchmark case 1: Linear correlation The first benchmark is a common test case for MF methods. HF and LF functions are defined overΩ = [0 ,
1] as f HF ( x ) = (6 x − sin(12 x − ,f LF ( x ) = 0 . f HF ( x ) + 10( x − .
5) + 5 , respectively. The LF function is obtained by a linear transformation from the HF function and the inputvariable x . The setup serves as a good initial test for MF models. HF and LF samples are given by 5 and 32equally spaced values in Ω, respectively. Since the number of the HF data is very limited, hyperparametersof the NNs are optimized using leave-one-out cross-validation (LOOCV). The values can be found in TableA.1. Fig. 2 shows the results of both the single-fidelity regression (SFR) and MF regression (MFR). Boththe NN and the GPR fail to approximate the function based solely on the HF data; in particular, bothmodels fail to accurately predict the function in the interval [0 , . N N HF of the 3-step model is redundant inthis case, as there is no nonlinear trend to catch. Consequently, in this first and simple benchmark there isnothing to be gained by using ANN, as they all perform worse than classic GPR and are also computationallymore expensive. Table 1: Linear correlation: comparison of the MF regression models. Indicated times account for both HPO and final modelpredictions.
Model Validation MSE Test MSE R Elapsed time (s)Co-kriging - . × − × . × − × × − × − × − × − × − This second benchmark is designed to analyze how well the proposed NN models can approximatediscontinuities in a MF setting. HF and LF data are generated from the following functions: f LF ( x ) = (cid:40) . x − sin(12 x −
4) + 10( x − . − , ≤ x < .
53 + 0 . x − sin(12 x −
4) + 10( x − . − , . < x ≤ f HF ( x ) = (cid:40) f LF ( x ) − x − , ≤ x < .
54 + 2 f LF ( x ) − x − , . < x ≤ . .0 0.2 0.4 0.6 0.8 1.0x−5051015 Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (a) SFR with GP, MFR with co-kriging.
Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (b) SFR with NN, MFR with 3-step.
Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (c) SFR with NN, MFR with 2-step.
Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (d) SFR with NN, MFR with GPmimic.
Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (e) SFR with NN, MFR with Intermediate.Figure 2: Linear correlation: single- and MF regression results using GPs and different ANNs. There are 5 HF (circle) and 32LF (cross) observations. Single-fidelity model based on the 5 HF data points is shown in green (dashed). ,
1] are used as the HF and LF inputs, respectively. Inaddition, 10 equally spaced points in the interval [0 . , .
55] are added to the LF data set to allow for a11etter approximation of the discontinuity. In contrast to the first test case, the correlation between thetwo functions is only piecewise linear, and the piecewise definition results in the discontinuities of distinctamplitudes, see Fig. 3(a).Fig. 3 shows that the regression models based merely on the HF data are not able to approximate thediscontinuity at x = 0 .
5, but the regression results are significantly improved by taking the LF data intoaccount. With R scores over 0.99, both the 2-step and 3-step multilevel NN models show a good matchwith the exact solution. However, the discontinuity is considerably smoothened when other NN modelsare used, especially when employing the GPmimic model. While the co-kriging presents a discontinuity at x = 0 .
5, fluctuations are present in the interval [0 . , . Table 2: Discontinuous function: comparison of the MF regression models. Indicated times account for HPO and final modelpredictions.
Model Validation MSE Test MSE R Elapsed time (s)Co-kriging - 8.07 × − × × − × × − × . × − × − × − The third one-dimensional benchmark will test the proposed NN models on a nonlinear correlationbetween HF and LF levels. The data samples are obtained from the following functions: f LF ( x ) = sin(8 πx ) ,f HF ( x ) = ( x − √ f LF ( x ) . We use 15 equally spaced HF data points and 42 LF data points over the interval Ω = [0 , x . Table 3: Nonlinear correlation: comparison of the MF regression models. Indicated times account for HPO and final modelpredictions.
Model Validation MSE Test MSE R Elapsed time (s)Co-kriging - 5 . × − . × − . × − . × − . × − . × − . × − . × − . × − R score of 0.128, the GPmimic NN hardly performs better than the average function value. However, itis not surprising that both models fail in the current benchmark, as their mathematical structure is unableto detect nonlinear correlations between the two fidelity levels. Fig. 4 underlines that the difficulty of thistest case lies in approximating the different local extrema of the HF function, since data points are not12 .0 0.2 0.4 0.6 0.8 1.0x−50510152025 Exact HF functionExact LF function (a) HF and LF functions.
Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (b) SFR with GP, MFR with co-kriging.
Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (c) SFR with NN, MFR with 2-step.
Exact HF functionIntermediateGPmimic3stepHF data (d) MFR using different NNs.Figure 3: Discontinuous function: single- and MF regression results using GPs and different ANNs. There are 8 HF (circle)and 42 LF (cross) observations. always available close to the local extremum. Nevertheless, both the multilevel NN models recover an R score exceeding 0.99. It is well known that GPR suffer from the curse of dimensionality, and cannot be effectively used whenthe number of data points is very large (
N > f HF ( x ) = ( x − + (cid:88) i =2 (2 x i − x i − ) ,f LF ( x ) = 0 . f HF ( x ) − (cid:88) i =2 . x i − x i − , with x = { x , x , · · · , x } ∈ Ω = [ − , . HF data are sampled from f HF at 5000 locations randomlychosen from a uniform distribution over Ω, while the LF samples are evaluated at 30000 random input13 .0 0.2 0.4 0.6 0.8 1.0x−1.5−1.0−0.50.00.51.01.52.02.5 Exact HF functionExact LF functionHF dataLF data (a) HF and LF functions.
Exact HF functionSFR for HFMFR for HFHF data (b) SFR with GP, MFR with co-kriging.
Exact HF functionSFR for HFMFR for HFHF data (c) SFR with NN, MFR with 3-step.
Exact HF functionGPmimicIntermediate2stepHF data (d) MFR using different NNs.Figure 4: Nonlinear correlation: single- and MF regression results using GPs and different ANNs. There are 15 HF (circle)and 42 LF (cross) observations. (a) has a different range for the vertical axis than (b), (c) and (d). locations.In this case, the co-kriging would require one to compute the inverse of a 35000 × y HF at one million random input locations.
5. Numerical results (II): application to parametrized PDEs
In this section we apply our MF framework to a problem arising in acoustics, namely the propagation ofa pressure wave P ( x , t ) into an acoustic horn with parametrized shape, addressed in [30]. In particular, weare interested in evaluating an input-output map involving the solution of a PDE, exploiting the accuracy ofthe finite element method to obtain HF solutions, while relying on a reduced basis (RB) method to computeLF, but fast and inexpensive, approximations. Different from the previous section, we consider differentsources of data for each fidelity level.We consider an acoustic device, illustrated in Fig 6 (left), comprising of a waveguide, with infinite14 a) Single-fidelity regression (b) MF regression with the 3-step modelFigure 5: 20-D benchmark: Exact function values versus predicted values by NNs at one million random locations. All dataare normalized to the range [0,1]. extension to the left and a conical extremity on the right, namely the horn , and an internal propagatingplanar wave inside the waveguide: once the wave reaches the horn, a portion of its energy is converted into anouter-going wave. Under the assumptions of single-frequency and time harmonic waves, the acoustic pressurecan be expressed as P ( x , t ) = R ( p ( x ) e iωt ), where the complex amplitude p ( x ) satisfies the monochromaticsteady-state Helmholtz equation with mixed Neumann-Robin boundary conditions:∆ p + k p = 0 in Ω( ik + 12 R ) p + ∇ p · n = 0 on Γ ikp + ∇ p · n = 2 ikA on Γ i ∇ p · n = 0 on Γ h ∪ Γ s = Γ n , (15)where k = ω/c is the wave number, ω = 2 πf the angular frequency and c = 340 cm s − the speed of sound; n denotes the outward-directed unit normal on the boundary of Ω. We restrict the computation to the domainΩ shown in Fig. 6, and impose on Γ i – a propagating wave with amplitude A = 1 while absorbing theouter-going planar waves – an absorbing condition on the far-field boundary Γ o , and homogeneous Neumannboundary conditions on the sound-hard walls of the device Γ h as well as on the symmetry boundary Γ s . Wetake, for simplicity, the radius equal to R = 1 [4].In addition to the frequency f of the incoming wave, we parametrize, as in [30], the shape of the horn bymeans of radial basis functions, introducing as parameters µ g = ( µ g, , . . . , µ g, ) the vertical displacement offour control points located on the horn wall Γ h , shown in Fig. 6, right. The admissible domain configurationsare defined as the diffeomorphic images Ω( µ g ) of the reference shape Ω through a deformation mapping T ( · ; µ g ) obtained as linear combinations of the control points displacements. Hence, the acoustic problemdepends on five parameters, i.e., we denote by µ = ( f, µ g ) the parameter vector and by D ⊂ R p theparameter space.We focus our analysis on a specific output of interest, namely the index of reflection intensity (IRI) [4]which measures the transmission efficiency of the acoustic horn and is defined as the absolute value of theaverage reflected wave at the sound inlet Γ i , i.e., f ( µ ) = J ( p ( µ )) = (cid:12)(cid:12)(cid:12)(cid:12) | Γ i | (cid:90) Γ i p ( µ ) d Γ − (cid:12)(cid:12)(cid:12)(cid:12) . For simplicity, the device extends infinitely in the direction normal to the plan and its wall consists of sound-hard material.Therefore, for the frequencies in the range under consideration, we assume that all non-planar modes in the waveguide arenegligible, which allows us to reduce the problem to two space dimensions. igure 6: Acoustic horn problem. Left: the computational domain Ω and the boundaries. Right: the control points used inRBF shape parametrization, whose vertical displacements are treated as parameters. In particular, f N LF ( µ ) = J ( p mN ( µ )) , and f HF ( µ ) = J ( p h ( µ )) , define the MF setting for this test case. We denote by p mN ( µ )) and p h ( µ )) the LF and the HF solutions ofthe parametrized PDE, respectively, the construction of which is reported in the Appendix B. Moreover, wehighlight the dependence of the LF model on the dimension N of the ROM used to evaluate the PDE solution.In the following subsections, we consider two different scenarios, dealing with either 1 or 5 parameters. p = 1 parameter In this first case, we compare different MF strategies, by considering the output of interest as a functionof the frequency µ = f only, letting it vary in D = [10 , f and compute the corresponding HFsolutions through the FOM. The FOM is approximated by P finite elements and, considering a mesh madeof 8740 triangular elements, we have an HF model of dimension n = 4567. Then, we apply POD and extract N = 44 reduced basis functions by imposing a relative projection error of 10 − . All computational detailsare summarized in Table 4. The computation of HF and LF solutions, i.e. the FOM and ROM solutions,respectively, is carried out in Matlab , using the redbKIT library [29]. All hyperparameters obtained byHPO are reported in Appendix A.We assess how the quality of the LF model and the amount of HF data impact the accuracy of the MFprediction. We recall that (i) we can improve the quality of the LF model by selecting a smaller or largernumber of bases, and (ii) we can freely decide the parameter values for which we solve the HF model, bykeeping the sampling method fixed. A comparison among LF models obtained with different dimensions N of the ROM is reported in Fig. 7. We choose to limit the selected basis functions in the POD-GalerkinROM to be between 5 and 22 (even though the maximum number of available basis functions is 44), to dealwith a potentially inaccurate (or, at least, not accurate enough) LF model. The ability to obtain accuratepredictions by combining a few HF data and several evaluations of a reliable, but not sufficiently accurate,ROM, is indeed an attractive feature of the proposed framework, as this may prevent us from constructingROMs with large dimensions and poor efficiency. Moreover, we consider the same amount of HF data, whilewe keep fixed (and equal to 32) the number of LF data.In the following, we test the Intermediate, 2-step, 3-step network architectures; for each combination number of HF data - number of bases , we train the NNs, predict the output of interest and compute theindices of goodness of fit R and M SE . Evaluated outputs with these network architectures are displayedin Fig. 8; the values of R and M SE are reported instead in Figs. 9, 10 and 11, as functions of the number N of basis functions, and the amount of HF data considered.Overall, the neural architectures perform very well and provide good predictions, in terms of both M SE and R , provided a sufficient number of basis functions and HF data are used. The and able 4: Computational details in the case with p = 1 parameter: µ = f . Number of parameters 1 Parameter domain D [10 , − Number of FE dofs n 4567 Number of ROM dofs N 44Number of HF data from 5 to 22 Number of bases from 5 to 22Number of LF data 32 Sampling method LHS frequency f HF modelLF 5 basesLF 10 basesLF 15 basesLF 20 basesLF 25 basesLF 30 basesLF 35 basesLF 40 bases
Figure 7: HF model f HF ( µ ) and different LF models f HF ( µ ) depending on the number N of basis functions. For too small valuesof N , the LF model prediction is rather poor, while it is almost indistinguishable from the HF model prediction for N ≥ models produce similar results and perform better than the Intermediate model, as displayed in Fig. 8. Inparticular, the multilevel models are robust and efficient even in the cases where the LF models are builtfrom few bases, while the
Intermediate network has poor predictive accuracy without a sufficiently accurateLF model, even if a large amount of HF data are provided. In fact, with multistep networks we can reachvalues of R larger than 0.8 even considering just N = 5 bases, whereas the Intermediate model does notprovide large values of R with less than 12 bases, see Figs. 9, 10 and 11. Regarding the computed outputs,we see how the peaks with larger amplitude found for f < and models, while smaller amplitude peaks for f > modelthan the . On the other hand, the intermediate model provides a less accurate trend of the output,and only provides reliable results provided that both N and the amount of HF data are large enough.As expected, the prediction improves as the number of HF data or the number of basis functions increases,even if these two features impact the accuracy in a slightly different way. Restricting, for the sake ofsimplicity, to the case of a model (see Fig. 11, top) we observe that, for each fixed number of reducedbases N , the goodness-of-fit indices improve as the number of HF data increases until a threshold limit isreached, which is determined solely by the number of bases and cannot be overcome by adding more HFdata. Conversely (see Fig. 11, bottom) by increasing N , the trends of R (resp. MSE), corresponding to thedifferent numbers of HF data adopted, increase (resp. decrease) and also tend to reach values closer to eachother. The amount of HF data thus becomes less and less important as the LF model improves. Hence, inthis specific problem it is more efficient to have a good LF model even with a small amount of HF data,rather than lots of HF data but a poor LF model. p = 5 parameters We finally apply our MF setting to the case where all the five parameters µ = ( f, µ g ), namely thefrequency and the four geometric parameters, vary, in order to consider shape variations in the horn geometryas well. The parameter space is D = [50 , × D g , where D g = [ − . , . .17 Figure 8: Plots of the HF, LF and MF (
Intermediate, 2-step, 3-step ) models considering different number of bases and HFtraining data.
We compute 200 FOM snapshots for 200 points sampled in the parameter domain D through a LatinHypercube sampling design. In this case, the reduced basis is made from N = 80 POD modes; see Table 5.18 igure 9: Case with p = 1 parameter, Intermediate architecture. MSE and R for different amounts of HF data and LFdimension N . As in the case of p = 1 parameter, we select an appropriate range for the number of bases and HF data toassess the efficiency of the MF approach in different scenarios. We consider a larger number of LF data (500instead of 100) than in the case of p = 1 parameter; we then vary the number of HF data from 5 to 45, andthe number of basis functions of the LF model from 5 to 40. We report the results only in the case of the model for the sake of brevity and to focus on the role of the quality and quantity of the training datarather than on the choice of the NN architecture. Table 5: Computational details in the case with p = 5 parameters: µ = ( f, µ g ). Number of parameters 5 Parameter domain D [50 , × [ − . , . Number of FE elements 8740 Number of FE snapshots 200Number of FE dofs n 4567 Number of ROM dofs N 80Number of HF data from 5 to 45 Number of bases from 5 to 40Number of LF data 500 Sampling method LHSPassing from 1 to 5 parameters does not worsen the prediction power of the NN architecture. Indeed,once again we obtain an accurate prediction from a small number of FOM data, by exploiting a large numberof ROM solutions that can be computed very quickly and inexpensively. As in the previous case, the MFprediction improves both as the number of HF data and the number of bases increase, with the latter playinga more important role in this framework. Similar to the case p = 1, we display R and the MSE obtained19 igure 10: Case with p = 1 parameter, architecture. MSE and R for different amounts of HF data and LF dimension N . with the NN architecture as functions of the aforementioned factors (see Fig.12).Numerical results show that we can achieve very good results even in 5 dimensions by employing 500LF data, and that increasing both the amount of HF data and the dimension of the LF model improves theprediction accuracy. When considering a poor LF model, the number of HF data is fundamental to obtaingood prediction: for instance, with an LF model of dimension N = 15, we obtain R = 0 . R = 0 . N = 35, when HF data increasesfrom 10 to 45, R only improves by 0 . N = 5) R (resp., the MSE) continues to increase (resp., decrease) as the number of HFdata increases, while both indices flatten when considering larger values of N (e.g., N > R greater than 0.97 and an MSE smaller than 1 . × − , with both indices continuing to improve asthe number of bases increases. In particular, it seems that increasing the number of HF data only worksprovided also the dimension N of the LF model is enlarged. On the other hand, as N increases, the indiceskeeps improving and the number of HF data becomes less and less relevant. Therefore, also in the case of p = 5 parameters we can conclude that improving the quality of the LF model is a more efficient strategy- as well as computationally cheaper - than increasing the number of HF data, to reach a certain degree of20 igure 11: Case with p = 1 parameter, architecture. MSE and R for different amounts of HF data and LF dimension N . accuracy.
6. Conclusions
In this work we discuss MF regression with ANNs, for which four different architectures are presented.The proposed NN schemes are benchmarked against co-kriging on a collection of test cases of increasingcomplexity. We also successfully predict output quantities associated with parametrized PDEs using themulti-fidelity NN models. Observations made from different numerical examples show that MF models basedon NNs can consistently outperform co-kriging schemes. In contrast to co-kriging, NNs are able to detectnonlinear correlations between fidelity levels more effectively and are capable of dealing with large datasets. In addition, the hyperparameter selection for the proposed multi-fidelity NNs is automatized based oncross validation and Bayesian optimization, and the tuned NN models consistently yield very low predictionerrors.The proposed schemes have been tested on a series of manufactured benchmarks and applied to aparametrized PDE problem. In the latter, the goal is to evaluate an output functional of the PDE solutionthat features an oscillating input-output dependence, and the LF model is constructed through the reducedbasis method while the HF model is given by detailed finite element analysis. Numerical results show that21 igure 12: Case with p = 5 parameters, architecture. MSE and R for different amounts of HF data and LF dimension N . the accuracy of the predictions through MF regression is mainly driven by the reliability of the reducedorder model, rather than the amount of HF data fed into the NNs.A promising direction for future work is to inspect whether changing the value of the balance parameter α during the training process can improve the model accuracy. As the ratio MSE HF / MSE LF evolves duringthe training process, a corresponding adaptation of α could be a reasonable improvement of the currentschemes. References [1] M. A. ´Alvarez, L. Rosasco, and N. D. Lawrence. Kernels for vector-valued functions: A review.
Foundations and Trends ® in Machine Learning , 4(3):195–266, 2012.[2] R. C. Aydin, F. A. Braeu, and C. J. Cyron. General multi-fidelity framework for training artificial neural networks withcomputational models. Frontiers in Materials , 6:61, 2019.[3] N. Baker, F. Alexander, T. Bremer, A. Hagberg, Y. Kevrekidis, H. Najm, M. Parashar, A. Patra, J. Sethian, S. Wild,et al. Workshop report on basic research needs for scientific machine learning: Core technologies for artificial intelligence.Technical report, USDOE Office of Science (SC), Washington, DC (United States), 2019.[4] E. B¨angtsson, D. Noreland, and M. Berggren. Shape optimization of an acoustic horn.
Comput. Meth. Appl. Mech.Engrg. , 192(11-12):1533–1571, 2003.[5] P. Benner, S. Gugercin, and K. Willcox. A survey of projection-based model reduction methods for parametric dynamicalsystems.
SIAM Review , 57(4):483–531, 2015.
6] J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds ofdimensions for vision architectures. In
Proceedings of the 30th International Conference on International Conference onMachine Learning - Volume 28 , ICML’13, pages 115–123. JMLR.org, 2013.[7] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. K´egl. Algorithms for hyper-parameter optimization. In
Advances in neuralinformation processing systems , pages 2546–2554, 2011.[8] K. T. Carlberg, R. Tuminaro, and P. Boggs. Preserving lagrangian structure in nonlinear model reduction with applicationto structural dynamics.
SIAM J. Sci. Comput , 37(2):B153–B184, 2015.[9] S. Chaturantabut and D. C. Sorensen. Nonlinear model reduction via discrete empirical interpolation.
SIAM J. Sci.Comput. , 32(5):2737–2764, 2010.[10] M. Chen, Y. Li, and R. Li. Research on neural machine translation model.
Journal of Physics: Conference Series ,1237:052020, 2019.[11] K. Cutajar, M. Pullin, A. Damianou, N. Lawrence, and J. Gonz´alez. Deep gaussian processes for multi-fidelity modeling. arXiv:1903.07320 , 2019.[12] R. Durrett.
Probability: theory and examples , volume 49. Cambridge university press, 2019.[13] S. Fresca, L. Ded`e, and A. Manzoni. A comprehensive deep learning-based approach to reduced order modeling of nonlineartime-dependent parametrized pdes. arXiv:2001.04001 , 2020.[14] J. Friedman. Greedy function approximation: A gradient boosting machine.
Ann. Stat. , 29(5):1189–1232, 2001.[15] S. Ghosh and D. L. Reilly. Credit card fraud detection with a neural-network. , 3:621–630, 1994.[16] GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy , since 2012.[17] J. S. Hesthaven, G. Rozza, and B. Stamm.
Certified reduced basis methods for parametrized partial differential equations .Springer International Publishing, 2016.[18] J. S. Hesthaven and S. Ubbiali. Non-intrusive reduced order modeling of nonlinear problems using neural networks.
J.Comput. Phys. , 363:55–78, 2018.[19] A. G. Journel.
Fundamentals of Geostatistics in Five Lessons , volume Short Courses in Geology, Vol. 8. AmericanGeophysical Union (AGU), 1989.[20] M. Kast, M. Guo, and J. S. Hesthaven. A non-intrusive multifidelity method for the reduced order modeling of nonlinearproblems.
Comput. Methods Appl. Mech. Engrg. , 364:112947, 2020.[21] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedingsof the IEEE , 86:2278 – 2324, 1998.[22] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussianprocesses. arXiv preprint No. 1711.00165 , 2017.[23] K. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deep convolutionalautoencoders.
J. Comput. Phys. , 404:108973, 2020.[24] D. Liu and Y. Wang. Multi-fidelity physics-constrained neural network and its application in materials modeling.
Journalof Mechanical Design , 141(12), 2019.[25] X. Meng, H. Babaee, and G. E. Karniadakis. Multi-fidelity bayesian neural networks: Algorithms and applications. arXivpreprint arXiv:2012.13294 , 2020.[26] X. Meng and G. E. Karniadakis. A composite neural network that learns from multi-fidelity data: Application to functionapproximation and inverse pde problems.
J. Comput. Phys. , 401, 2019.[27] M. Motamed. A multi-fidelity neural network surrogate sampling method for uncertainty quantification.
Int. J. Uncert.Quantif. , 10(4):315–332, 2020.[28] R. M. Neal.
Bayesian learning for neural networks , volume 118. Springer Science & Business Media, 2012.[29] F. Negri. redbKIT Version 2.2. http://redbkit.github.io/redbKIT/ , 2016.[30] F. Negri, A. Manzoni, and D. Amsallem. Efficient model reduction of parametrized systems by matrix discrete empiricalinterpolation.
J. Comput. Phys. , 303:431–454, 2015.[31] A. O’Hagan and M. C. Kennedy. Predicting the output from a complex computer code when fast approximations areavailable.
Biometrika , 87(1):1–13, 2000.[32] B. Peherstorfer, K. Willcox, and M. Gunzburger. Survey of multifidelity methods in uncertainty propagation, inference,and optimization.
SIAM Review , 60(3):550–591, 2018.[33] A. Quarteroni, A. Manzoni, and F. Negri.
Reduced Basis Methods for Partial Differential Equations. An Introduction .Springer International Publishing, 2016.[34] M. Raissi and G. Karniadakis. Deep multi-fidelity gaussian processes. arXiv: 1604.07484 , 2016.[35] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Inferring solutions of differential equations using noisy multi-fidelity data.
J. Comput. Phys. , 335:736–746, 2017.[36] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework for solvingforward and inverse problems involving nonlinear partial differential equations.
J. Comput. Phys. , 378:686–707, 2019.[37] D. Ray and J. S. Hesthaven. An artificial neural network as a troubled-cell indicator.
J. Comput. Phys. , 367:166–191,2018.[38] F. Regazzoni, L. Ded`e, and A. Quarteroni. Machine learning of multiscale active force generation models for the efficientsimulation of cardiac electromechanics.
Comput. Methods Appl. Mech. Engrg. , 370:113268, 2020.[39] J. Sirignano and K. Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations.
J. Comput.Phys. , 375:1339–1364, 2018.[40] G. Strang.
Linear algebra and learning from data . Wellesley-Cambridge Press, 2019.[41] E. Weinan, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial ifferential equations and backward stochastic differential equations. Commun. Math. Stat. , 5(4):349–380, 2017.[42] C. Williams and C. Rasmussen.
Gaussian Processes for Machine Learning . the MIT Press, 2006.
Appendix A: hyperparameter summary
Tables A.1 and A.2 show the values of NN hyperparameters resulting from the HPO in different nu-merical examples. We recall that α is the parameter in all-in-one networks that balances the fidelity levels’contributions to training error, λ is a L -regularization parameter, and η is the learning rate. Nodes indi-cates the number of nodes in each layer whose size is left to be determined through the HPO. Although NNhyperparameters should generally not be analyzed separately, the following comments about the individualhyperparameters can be made: • In all test cases, the optimization procedure chooses α < .
1, which corresponds to weighting the LFdata one order of magnitude more than their HF counterpart. • The learning rates η in models belonging to the same NNMFR class (all-in-one or multilevel) aregenerally at the same order of magnitude. • In most cases, the AdaMax algorithm is the best performing optimizer. • In the benchmarks, Glorot uniform weight initialization is preferred, whereas standard uniform initial-ization performs best when predicting the IRI in the acoustic horn problem.
Table A.1: Hyperparameter values of the NNMFR used for different artificial test cases. Benchmark 1 involves linear correlation,benchmark 2 discontinuous functions, and benchmark 3 non-linear correlation.
Test Model Weight initializer Depth × width α λ Optimizer η ×
59 4.59 × − × − Adam 6.35 × − GPmimic Glorot uniform 3 ×
33 5.01 × − × − AdaMax 3.90 × − ×
34 - 1.01 × − AdaMax 5.17 × − ×
38 - 1.08 × − Adam 2.06 × − ×
30 9.22 × − × − AdaMax 3.41 × − GPmimic Glorot uniform 4 ×
10 2.71 × − × − AdaMax 4.42 × − ×
60 - 1.01 × − AdaMax 3.65 × − ×
98 - 2.30 × − AdaMax 1.02 × − ×
39 4.02 × − × − Adam 1.55 × − GPmimic Glorot normal 4 ×
22 9.74 × − × − ×
44 - 1.02 × − AdaMax 4.84 × − ×
62 - 2.35 × − Adam 1.25 × − Appendix B: HF and LF models for the acoustic horn problem
To derive the HF model related to the test case of Section 5, we employ the Galerkin-finite elementmethod. We write the weak formulation of problem (15): given µ ∈ D , find p ( µ ) ∈ V s.t. a ( p ( µ , u ; µ ) = f ( u ; µ ) ∀ u ∈ V (16)24 able A.2: Hyperparameter values of the considered NN architectures regarding the problems in Section 5; here P indicatesthe number of parameters considered. P Model Weight initializer α λ
Optimizer Nodes η Elapsed time (s)1 Inter. Uniform 3.19 × − × − Adam 114 4.27 × − × − Adamax 18 1.76 × − × − Adam 6 1.42 × − × − Adam 48 7.57 × − V = H (Ω( µ g )) = { q ∈ L (Ω( µ g )) : ∂q/∂x j ∈ L (Ω( µ g )) , j ∈ { , }} and the bilinear form a ( · , · , µ ) : V × V → C and the linear form f ( · ; µ ) : V → C are defined, respectively, by2 a ( p, u ; µ ) = (cid:90) Ω( µ g ) {∇ p · ∇ ¯ u − k p ¯ u } d Ω + ik (cid:90) Γ o ∪ Γ i p ¯ u d Γ + 12 R (cid:90) Γ o p ¯ u d Γ (17) f ( u ; µ ) = 2 ikA (cid:90) Γ i ¯ u d Γ (18)Next, we introduce a conforming triangulation T h = { ∆ k } n e k =1 of the domain Ω and seek a HF approximation p h ( µ ) ∈ V h as a globally continuous, piecewise linear, function belonging to a finite-dimensional space V h ⊂ V . In our case, V h is spanned by a set of basis functions { φ i } ni =1 for the space V h consisting of a setof n piecewise polynomial nodal basis functions on T h . The Galerkin-finite element approximation of (16)thus results in the following n dimensional linear system: A ( µ ) p h ( µ ) = f ( µ ) (19)where A ij ( µ ) = a ( φ j , φ i ; µ ) , f i ( µ ) = f ( φ i ; µ ) , for ≤ i, j ≤ n and p h is the vector of coefficients { p i } ni =1 such that the projection of p onto V h is p h ( x ) = (cid:80) ni =1 p i φ i ( x ). This latter formula allows us to state aone-to-one correspondence between the finite element functions p h ( µ ) ∈ V h and their discrete counterparts p h ( µ ) ∈ R N h .To derive the LF model, we employ the reduced basis (RB) method [17, 33], which is briefly recalledhere. The RB method is a projection-based reduced order modeling technique, addressing the repeatedsolution of parametrized PDEs, which allows to dramatically reduce the dimension of the discrete problemsarising from numerical approximation. The strategy adopted in RB methods consists in the projection ofthe HF problem upon a subspace made of specially selected basis functions, built from a set of HF solutionscorresponding to suitably chosen parameters (or snapshots), e.g., through proper orthogonal decomposition(POD). Later, a (Petrov-)Galerkin projection onto the RB space is employed to generate the ROM.Starting from the FOM (19), i.e. find p h ( µ ) such that A ( µ ) p h ( µ ) = f ( µ ), the idea of a projection-basedROM is to approximate p h ( µ ) ≈ Vp N ( µ ) as a linear combination of basis functions, for a vector of unknownreduced degrees of freedom p N ( µ ) of reduced dimension N (cid:28) n . This latter is sought by imposing that W T ( A ( µ ) Vp N ( µ ) − f ( µ )) = , a condition which enforces the orthogonality of the residual to a subspace spanned by a suitable test basis W ∈ R n × N . The ROM reads: find p N ∈ R N such that A N ( µ ) p N ( µ ) = f N ( µ ) (20)where A N ( µ ) = W T A ( µ ) V and f N ( µ ) = W T f ( µ ). A Galerkin projection results if W = V . As forthe HF approximation, the RB approximation reflects a one-to-one correspondence between the function p hN ( µ ) ∈ V h and its finite-dimensional counterpart Vp N ( µ ) ∈ R N h .Although the dimension of the RB problem (20) is very small as compared to the FOM (19), the25ssembling of the former system still depends in general on the dimension n of the HF system for any µ ∈ D . A convenient situation arises when the HF arrays in (19) can be written as A ( µ ) = Q A (cid:88) q =1 Θ aq ( µ ) A q , f ( µ ) = Q f (cid:88) q =1 Θ fq ( µ ) f q . By virtue of this property – which we refer to as affine parametric dependence of A ( µ ) and f ( µ ), theassembling of the system (20) during the online stage can be made efficient, since the arrays W T A q V , q = 1 , . . . , Q A and W T f q , q = 1 , . . . , Q F , can be pre-computed and stored during a possibly expensiveoffline stage.Since in the Helmholtz problem with p > A ( µ ) and f ( µ ) are nonaffine functions of µ , so we employ hyper-reduction through thediscrete empirical interpolation method (DEIM) [9] and its matrix version (MDEIM) [5, 8] to computeapproximate affine decompositions as [30] f ( µ ) ≈ f m ( µ ) = M f (cid:88) k =1 θ f k ( µ ) f k , A ( µ ) ≈ A m ( µ ) = M A (cid:88) k =1 Θ ak ( µ ) A k , where f k , k = 1 , . . . , M f , and A k , k = 1 , . . . , M A are precomputable vectors and matrices, respectively, andindependent of µ . In this way, we can approximate the ROM arrays as f N ( µ ) ≈ f mN ( µ ) = M f (cid:88) k =1 θ f k ( µ ) f kN , A N ( µ ) ≈ A mN ( µ ) = M a (cid:88) k =1 θ a k ( µ ) A kN , where f kN = W T f k ∈ R N , k = 1 , . . . , M f , and A kN = W T A k V ∈ R N × N , k = 1 , . . . , M A . Taking advantageof hyper-reduction, we recover the following hyper reduced order model: find p mN ( µ ) ∈ R N s.t. A mN ( µ ) p mN ( µ ) = f mN ( µ ) . (21)Due to its small dimension, the solution of the system (21) can be very fast and computationally inexpensive,allowing us to generate many instances of the output of interest, which can be used as LF training data.Finally, we denote by p mN ( µ ) ∈ V h the finite element approximation of the problem corresponding to thevector Vp mN ( µ ) ∈ R N hh