[PDF] Multi-fidelity regression using artificial neural networks: efficient approximation of parameter-dependent output quantities

Abstract

Highly accurate numerical or physical experiments are often time-consuming or expensive to obtain. When time or budget restrictions prohibit the generation of additional data, the amount of available samples may be too limited to provide satisfactory model results. Multi-fidelity methods deal with such problems by incorporating information from other sources, which are ideally well-correlated with the high-fidelity data, but can be obtained at a lower cost. By leveraging correlations between different data sets, multi-fidelity methods often yield superior generalization when compared to models based solely on a small amount of high-fidelity data. In this work, we present the use of artificial neural networks applied to multi-fidelity regression problems. By elaborating a few existing approaches, we propose new neural network architectures for multi-fidelity regression. The introduced models are compared against a traditional multi-fidelity scheme, co-kriging. A collection of artificial benchmarks are presented to measure the performance of the analyzed models. The results show that cross-validation in combination with Bayesian optimization consistently leads to neural network models that outperform the co-kriging scheme. Additionally, we show an application of multi-fidelity regression to an engineering problem. The propagation of a pressure wave into an acoustic horn with parametrized shape and frequency is considered, and the index of reflection intensity is approximated using the multi-fidelity models. A finite element model and a reduced basis model are adopted as the high- and low-fidelity, respectively. It is shown that the multi-fidelity neural network returns outputs that achieve a comparable accuracy to those from the expensive, full-order model, using only very few full-order evaluations combined with a larger amount of inaccurate but cheap evaluations of a reduced order model.

Full PDF

MMulti-ﬁdelity regression using artiﬁcial neural networks: eﬃcientapproximation of parameter-dependent output quantities

Mengwu Guo a, ∗ , Andrea Manzoni b , Maurice Amendt c , Paolo Conti b , Jan S. Hesthaven c a Department of Applied Mathematics, University of Twente b MOX – Dipartimento di Matematica, Politecnico di Milano c Institute of Mathematics, ´Ecole Polytechnique F´ed´erale de Lausanne

Abstract

Highly accurate numerical or physical experiments are often very time-consuming or expensive to obtain.When time or budget restrictions prohibit the generation of additional data, the amount of available samplesmay be too limited to provide satisfactory model results. Multi-ﬁdelity methods deal with such problems byincorporating information from other sources, which are ideally well-correlated with the high-ﬁdelity data,but can be obtained at a lower cost. By leveraging correlations between diﬀerent data sets, multi-ﬁdelitymethods often yield superior generalization when compared to models based solely on a small amount ofhigh-ﬁdelity data. In the current work, we present the use of artiﬁcial neural networks applied to multi-ﬁdelity regression problems. By elaborating a few existing approaches, we propose new neural networkarchitectures for multi-ﬁdelity regression. The introduced models are compared against a traditional multi-ﬁdelity regression scheme – co-kriging. A collection of artiﬁcial benchmarks are presented to measure theperformance of the analyzed models. The results show that cross-validation in combination with Bayesianoptimization consistently leads to neural network models that outperform the co-kriging scheme. Addition-ally, we show an application of multi-ﬁdelity regression to an engineering problem. The propagation of apressure wave into an acoustic horn with parametrized shape and frequency is considered, and the indexof reﬂection intensity is approximated using the proposed multi-ﬁdelity models. A ﬁnite element, full-ordermodel and a reduced-order model built through the reduced basis method are adopted as the high- andlow-ﬁdelity, respectively. It is shown that the multi-ﬁdelity neural network returns outputs that achieve acomparable accuracy to those from the expensive, full-order model, using only very few full-order evaluationscombined with a larger amount of inaccurate but cheap evaluations of a reduced order model.

Keywords:

Machine learning, artiﬁcial neural network, multi-ﬁdelity regression, Gaussian processregression, reduced order modeling, parametrized PDE

1. Introduction

Artiﬁcial neural networks (ANNs) have arguably been one of the most active topics during the recentyears. They have been successfully applied in a substantial number of research areas, including imagerecognition [21], translation [10], and fraud detection [15]. More recently, ANNs have also been widely usedin the emerging area of machine learning in computational science and engineering, sometimes referred to asscientiﬁc machine learning [3]. The remarkable expressive power of neural networks (NNs) has made themstand out in the solution of forward and inverse problems governed by partial diﬀerential equations (PDEs)[36, 39, 41], reduced order modeling [13, 18, 23], data-driven discovery [37], multiscale analysis [38], and soon. This success can largely be explained by three major factors: computational power, ﬂexibility and access ∗ Corresponding author.

Email addresses: [email protected] (Mengwu Guo), [email protected] (Andrea Manzoni), [email protected] (Paolo Conti),

[email protected] (Jan S. Hesthaven)

Preprint submitted to Elsevier March 1, 2021 a r X i v : . [ m a t h . NA ] F e b o large data sets. The great ﬂexibility of NNs, as well as their multi-purpose nature, is a dominant factorexplaining their overwhelming success, not only in research, but also in real-world applications. Severalapplication-dependent architectures have been developed, e.g., convolutional neural networks are often usedin image related tasks, whereas recurrent neural networks have found their success in speech recognition.Nevertheless, even for a ﬁxed structure, a neural network is still able to adapt to diﬀerent situations. Thisfeature is mainly explained by the large number of parameters and hyperparameters that can be tuned toﬁt data in many situations.The ability to handle large data sets without overwhelming computational costs has made NNs a goodcandidate for multi-ﬁdelity (MF) regression. MF regression exploits correlations between diﬀerent datasets to provide a regression model that generalizes better than a simple regression model, which only takesa single data set into account. The most common setup for MF regression is a set of data sources withdiﬀerent ﬁdelity levels. High-ﬁdelity (HF) samples are usually rare due to their cost, be it computationalor experimental. However, such data are accurate and are the best available knowledge about the problem.In contrast, low-ﬁdelity (LF) data are assumed to be easy and cheap to obtain, yet might lack accuracy.Often generated numerically, the LF data set ideally expresses major trends of the problem and correlatewell with the HF data. The goal of MF regression is to infer trends from LF data and use it to approximatethe HF model, especially in regions where HF data are sparse.A natural MF setting arose in the geostatistics community, who realized that ”if core data at otherlocations are correlated, they should be included to improve the regression” [19]. In the context of soilporosity, precise measurements are combined with seismic data to better predict porosity in large areas.Another common MF situation arises when solving PDEs [20, 35]. High order numerical schemes with aﬁne mesh give rise to accurate, HF data, whereas the usage of a coarse mesh, a partially converged solution,or a linearized equation can lead to LF data. In the past decade, MF methods have found applications inmany areas of scientiﬁc computing, including uncertainty quantiﬁcation, inference, and optimization. Werefer to [32] for a comprehensive review. A widely used MF technique is co-kriging [1, 31] which relies onvector-valued Gaussian processes for regression. Such a Gaussian process regression scheme presents twomajor beneﬁts. Firstly, as a non-parametric regression tool, it is suitable for many diﬀerent applications.Secondly, the method can be cast in the Bayesian framework, and the regression results naturally includean uncertainty estimation, which is usually desirable. Nevertheless, Gaussian process regression is notsuitable for many applications as it suﬀers from diﬀerent drawbacks, such as the curse of dimensionality.Consequently, new methods for MF regression are needed, ideally able to detect non-trivial, highly nonlinearcorrelations between the data sets of diﬀerent ﬁdelity levels, and should be applicable to a general class ofproblems.Based on these considerations, ANNs appear to be a promising candidate to solve MF problems. In therelated current work, we consider MF regression with ANNs. Several approaches have been proposed in theliterature so far and successfully used NNs in a MF setting. In [26], the authors tested a deep NN structureon diﬀerent artiﬁcial MF benchmarks, extended the idea to the physics-informed NN scheme and applied it toinverse problems governed by partial diﬀerential equations (PDEs). Their MF model clearly outperforms asingle ﬁdelity regression. A diﬀerent NN architecture, incorporated into a Monte Carlo sampling algorithmto estimate uncertainties in a MF setting, was introduced in [27], and reduced computational cost hasbeen observed as compared to traditional Monte Carlo sampling. In addition, diﬀerent MF strategies fortraining NNs were discussed in [2], NNs were used to approximate the discrepancy between the HF and LFphysics-constrained NNs in [24], and deep neural networks (DNNs) were embedded into co-kriging in [34].Moreover, a composition of Gaussian processes in a multi-layer network structure was used for MF modelingin [11], and a multi-ﬁdelity Bayesian neural network scheme has been developed in [25] and applied to thephysics-informed versions.In this work, we propose diﬀerent ANN architectures for the purpose of MF regression. Inspired by [26]and [27], we present two all-in-one models in which diﬀerent ﬁdelity levels are trained simultaneously, as wellas two multilevel models that deﬁne separate NNs for the hierarchy of ﬁdelity levels. In addition, we utilizea strategy based on cross-validation and Bayesian optimization to automatically select the best performingNN hyperparameters. To the best of our knowledge, hyperparameter optimization (HPO) has not beeninvestigated for MF neural networks. The goal of this work is to deﬁne a reliable strategy that consistently2roposes a neural network which achieves high accuracy and shows good generalization properties. We willassess the performance of the proposed NNs on a set of manufactured benchmarks, all chosen carefully to testfor the desirable properties. All models will be compared to single-ﬁdelity regression schemes. Comparisonswill also include co-kriging results to benchmark the proposed models against common practices in MFframeworks.Additionally, we show an application of the proposed MF regression schemes to the evaluation of aquantity of interest deﬁned as a functional of the solution to a parameter-dependent problem governedby PDEs. Such a task often occurs in the applied sciences and engineering, where multiple evaluationsof PDE solutions, for diﬀerent scenarios described in terms of physical or geometrical parameters, canbe computationally demanding if relying on high-ﬁdelity, full-order models (FOMs) such as detailed ﬁniteelement approximations. To overcome this diﬃculty, low-ﬁdelity, reduced-order models (ROMs) can be builtthrough the reduced basis (RB) method. Despite ROMs featuring much lower-dimensional solution spacesthan those of the FOMs, they are able to capture the critical physical features of the FOMs. A reduced modelseeks the solutions on a low-dimensional manifold which is approximated by a linear trial subspace spannedby a set of global basis functions, built from a set of full-order snapshots. The ROM accuracy is often grantedat the price of a relatively large number of basis functions involved in the reduced-order approximation. Onthe other hand, an eﬃcient assembly of the ROM during the online stage may only be possible provided thatan expensive hyper-reduction is performed during the oﬄine stage. Therefore, a low dimensionality withoutexpensive hyper-reduction can make a reduced model extremely eﬃcient, but potentially inaccurate. Ourgoal, enabled by the MF neural network schemes in this work, is to provide accurate approximations to theoutput quantities by leveraging a relatively large number of output evaluations using very low-dimensionalROMs and a small number of evaluations using the FOM, so as to avoid the eﬃciency issues stemmingfrom the ROM construction and evaluation without compromising the outcome accuracy. In particular, weapply the proposed approaches to a parametrized PDE problem, namely the propagation of a pressure waveinto an acoustic horn with parametrized shape and frequency, described by the Helmholtz equation. Weassess the impact of both the quality and the amount of the training data on the overall accuracy of theMF regression outcome.Following the introduction, the concepts of ANNs and Gaussian process regression are brieﬂy reviewed,and their underlying correlation is discussed in Section 2. Several NN structures for MF regression areintroduced and discussed in Section 3, and their eﬀectiveness is demonstrated by a series of benchmark testcases in Section 4. An application to a parametrized PDE problem is presented in Section 5, and conclusionsare drawn in Section 6.

2. Artiﬁcial neural networks and Gaussian processes for regression

In this section, we consider an L -hidden-layer fully-connected NN [40] with hidden layers of width N l forthe l -th layer and the nonlinear activation function φ , 1 ≤ l ≤ L . At the j -th neuron in the l -th layer of theNN, the pre- and post-activation are denoted by z lj and x lj , respectively, 1 ≤ i ≤ M l , M l being the widthof the l -th layer. Let x = x ∈ R d in denote the inputs of the network and y = z L +1 ∈ R d out denote theoutputs. Note that we have M = d in and M L +1 = d out . Weight and bias parameters between the ( l − l -th layers are represented by W lij and b li , respectively, 1 ≤ l ≤ ( L + 1), 1 ≤ i ≤ M l , 1 ≤ j ≤ M l − .Then one has z li ( x ) = b li + M l − (cid:88) j =1 W lij x l − j ( x ) , x li ( x ) = φ ( z li ( x )) , ≤ i ≤ M l , ≤ l ≤ L , and y i ( x ) = z L +1 i ( x ) = b L +1 i + M L (cid:88) j =1 W L +1 ij x Lj ( x ) , ≤ i ≤ d out . (1)3 multivariate function y = f ( x ) is approximated by a vector-valued network surrogate f NN ( · ; W , b ) : R d in → R d out to be trained on the input-output pairs { ( x ( k ) , y ( k ) ) } Nk =1 . Here W and b are the vectorscollecting all the weight and bias parameters, respectively, and N is the number of data pairs. Such atraining is often performed by minimizing a cost function:( W , b ) = arg min W , b (cid:40) N N (cid:88) k =1 (cid:107) y ( k ) − f NN ( x ( k ) ; W , b ) (cid:107) + λ (cid:107) W (cid:107) (cid:41) , (2)in which the ﬁrst term is the mean square error (MSE) and the second term is a regularization term with λ ≥ A Gaussian process (GP) is a collection of random variables, any ﬁnite number of which obeys a jointGaussian distribution. In the GPR, the prior on the scalar-valued regression function f : R d in → R isassumed to be a GP corrupted by an independent Gaussian noise term, i.e., for ( x , x (cid:48) ) ∈ R d in × R d in , f ( x ) ∼ GP(0 , κ ( x , x (cid:48) )) , y = f ( x ) + (cid:15) , (cid:15) ∼ N (0 , χ ) , (3)where χ is the standard deviation of a Gaussian noise term (cid:15) , and the semi-positive deﬁnite kernel function κ gives the covariance of the prior GP.Given N pairs of input-output training data, a prior joint Gaussian is deﬁned for the correspondingoutputs: y | X ∼ N ( , K y ) , K y = Cov[ y | X ] = κ ( X , X ) + χ I N , (4)where y = { y (1) , y (2) , · · · , y ( N ) } T , X = [ x (1) | x (2) | · · · | x ( N ) ] and I N is the M -dimensional unit matrix, N being the number of training samples.From a regression model, the goal is to predict the noise-free output f ∗ ( s ) for a new test input s ∈ R d in .By the standard rules for conditioning Gaussians, the posterior predictive distribution conditioning on thetraining data is obtained as a new GP: f ∗ ( s ) | s , X , y ∼ GP( m ∗ ( s ) , c ∗ ( s , s (cid:48) )) ,m ∗ ( s ) = κ ( s , X ) K − y y , c ∗ ( s , s (cid:48) ) = κ ( s , s (cid:48) ) − κ ( s , X ) K − y κ ( X , s (cid:48) ) . (5) Multi-ﬁdelity GPR

GPR with training data from diﬀerent ﬁdelity levels is known as cokriging [31] or vector-valued GPR[1]. In such a regression scheme, one can use a large amount of LF data and only a limited number ofHF samples to training a model of a reasonable accuracy. Since the LF evaluations are cheap, the cost oftraining data preparation can be reduced by controlling the number of HF evaluations. Assuming a linearcorrelation between the diﬀerent ﬁdelity levels, we can employ the linear model of coregionalization (LMC)[1] that expresses the prior of a hierarchy of D solution ﬁdelities as f i ( x ) = D (cid:88) j =1 a i,j u j ( x ) , i = 1 , , · · · , D , (6)i.e., each level of solution f i is written as a linear combination of D independent Gaussian processes u j ∼ GP(0 , κ j ( · , · )). In addition, the vector a j , 1 ≤ j ≤ D , collects the weights of the corresponding GPcomponent u j , i.e., a j = { a ,j , · · · , a D,j } T . This formulation leads to a matrix-valued kernel for the MFGPR model as K ( x , x (cid:48) ) = D (cid:88) j =1 a j a T j κ j ( x , x (cid:48) ) . (7)4n the two-level case, the well-known form of AR(1)-cokriging [31] is a special form of the linear modeldeﬁning the prior of a LF solution f L and a HF f H : f H ( x ) = ρu ( x ) + u ( x ) ,f L ( x ) = u ( x ) , (8)in which a = { ρ, } T and a = { , } T . Conditioning on the training input-output pairs from both theLF and HF, denoted by ( X L , y L ) and ( X H , y H ), respectively, the predictive distribution for the HF can beexpressed as a posterior GP, i.e., f ∗ H ( s ) | s , X H , y H , X L , y L ∼ GP.

It can be shown that the prior of a neural network output can be seen as a set of Gaussian processes underthe following probabilistic assumptions [22, 28] : (I) All the weight parameters W lij ’s are independent andidentically distributed (i.i.d), as are all the bias parameters b li ’s, and the weight and bias parameter sets areindependent of each other; (II) In the l -th layer, 1 ≤ ≤ L + 1, b li ∼ N (0 , σ b ), and W lij ’s are independentlydrawn from any distribution with zero mean and variance σ w /M l − ; and (III) M l → ∞ , 1 ≤ l ≤ L . Herewe show by induction that { z li : 1 ≤ i ≤ M l } are i.i.d. zero-mean Gaussian processes and { x lj : 1 ≤ j ≤ M l } are i.i.d. for all 2 ≤ l ≤ L + 1.We consider an arbitrary set of ﬁnite locations of the input x , denoted by X . Since both { b i : ∀ i } and { W i,j : ∀ i, j } are i.i.d., one obtains { z i ( X ) = b i + (cid:80) d in j =1 W ij x j ( X ) : ∀ i } . Thus { x j ( X ) = φ ( z j ( X )) : ∀ j } are i.i.d. Then it can be recovered that { z i ( X ) = b i + (cid:80) M j =1 W ij x j ( X ) : ∀ i } are i.i.d. For each1 ≤ i ≤ M , we apply the multivariate central limit theorem [12] to a sequence of i.i.d. random vectors {√ M W i x ( X ) , √ M W i x ( X ) , · · · } , all with zero mean and covariance σ w Cov[ x · ( X )], and we obtainthat z i ( X ) ∼ N ( , σ b I + σ w E [ x · ( X ) ⊗ x · ( X )]) as M → ∞ . Since the input locations X are arbitrary,each z i ( x ) is a Gaussian process as z i ( · ) ∼ GP(0 , K ( · , · )) with its kernel K deﬁned as K ( x , x (cid:48) ) = σ b I + σ w E [ φ ( z · ( x )) φ ( z · ( x (cid:48) ))]. After the nonlinear activation in the 2nd layer, we have the i.i.d. { x j : ∀ j } .Therefore the proposition holds true for l = 2.Assume it holds true for l , 2 ≤ l ≤ L . The proposition can be veriﬁed to be true for l + 1, similarly tothat for l = 2. The pre-activation in each intermediate layer follows a Gaussian process z l · ∼ GP(0 , K l ( · , · )),2 ≤ l ≤ L + 1, and K l ( x , x (cid:48) ) = σ b + σ w E z l − · ∼ GP(0 ,K l − ) [ φ ( z l − · ( x )) φ ( z l − · ( x (cid:48) ))] . (9)The outputs y ( x ) = z L +1 ( x ) thus follow i.i.d. Gaussian process priors and the kernel function can be formedfrom the recursive relation (9).

3. Artiﬁcial neural networks for multi-ﬁdelity regression

Currently, there have been two successful approaches [26, 27] to the use of NNs in a multi-ﬁdelitycontext. The neural networks for multi-ﬁdelity regression (NNMFR) presented in these two studies showinherent diﬀerences at the architectural level of the networks. Whereas [26] uses a single NN to perform asimultaneous regression of both the HF and LF data, [27] splits them up and used two distinct ANNs, onefor each ﬁdelity level. Note that in the present work, we restrict ourselves to bi-ﬁdelity problems, i.e., we areinterested in approximating the scalar HF function f HF ( x ) by incorporating information from the scalar LFfunction f LF ( x ). Consequently, there will generally be only a few available observations of the HF function,whereas the LF data are abundant. In that respect, it is useful to introduce the following notation: • N HF and N LF denote the number of HF and LF samples, respectively. • The training set for the HF function is given by T HF = { ( x ( i ) HF , y ( i ) HF ) : 1 ≤ i ≤ N HF } , which satisﬁes theHF function y HF = f HF ( x ). The discussion here involves some modiﬁcation from the work in [22, 28]. By replacing the subscripts HF in the above notations by LF , we obtain the LF counterparts. • The NN approximations corresponding to the HF and LF functions are denoted by f NNHF ( x ) and f NNLF ( x ),respectively.In the current section, we will review existing NNMFR approaches and propose our own strategies byelaborating and adapting existing ideas. Diﬀerent from the existing NNMFR strategies, our all-in-onemodels consider the LF outputs as latent variables of the HF surrogate or mimic the correlation between HFand LF levels in co-kriging, and our multilevel models are formulated directly from the hierarchy of ﬁdelitylevels. As previously noted, a single NN, which performs two simultaneous regressions on the HF and LF data,is used in [26]. It essentially resembles the structure of an autoencoder, where the inputs are encodedtowards the LF output y LF , then decoded and encoded again for the HF output y HF . The setup relies on theassumption that f HF ( x ) = F ( x , y LF ) = α F l ( x , y LF ) + (1 − α ) F nl ( x , y LF ) , α ∈ [0 , , (10)i.e., the unknown mapping F of the LF to the HF data can be decomposed into a linear and a nonlinearpart, denoted by F l and F nl , respectively.In (10), the hyperparameter α determines the strength of the linear correlation, with α = 1 correspondingto a fully linear relation between the HF and LF outputs. The determination of the value of α was notspeciﬁed in [26], neither are any values indicated in the reported results.Based on these considerations, we propose the following two modiﬁed NN architectures, see (a) and (b)in Fig. 1:1. ”Intermediate” model : The ﬁrst NN structure is similar to the one in [26], except that the sameinput layer is used for HF and LF data and we omit the autoencoder resemblance by adding additionalnodes to the layer containing the LF output. Furthermore, we do not impose (10) as we seek to directlymodel the function F ( x , y LF ). We refer to this NN as the ”Intermediate” model, referring to the locationof the LF output. In the numerical experiments, the Intermediate network has 5 hidden layers; the LFoutput is situated in the 3rd hidden layer and the HF output in the last layer. The number of neuronsin the ﬁrst 3 layers is ﬁxed to 64 each and the widths of the layers between the LF and HF outputsare determined by the hyperparameter optimization. Except for the output, we use the hyperbolictangent activation function.2. ”GPmimic” model : The second NNMFR is based on the similarity between a wide NN and a GP.Based on this correspondence, and since Gaussian processes are commonly used for MF regression, weimplement an architecture, which seeks to mimic the action of a GP. Hereafter we refer to this archi-tecture as ”GPmimic”. In contrast to NNs, the use of GPs becomes infeasible in higher dimensions.Hence, the GPmimic NN indirectly enables GP-like regression even in the case of large data sets andhigh-dimensional inputs. The NN architecture is depicted in Fig. 1. The neurons labelled by u and u are deﬁned to be analogous to two independent Gaussian processes, especially when the previouslayers are wide, and they play the same roles as in the vector-valued GPR (6). By considering nononlinear activation in the output layer, the NN outputs of the GPmimic model are given as: y HF ( x ) = W u ( x ) + W u ( x ) + b y LF ( x ) = W u ( x ) + W u ( x ) + b . (11)By inspecting (11), we notice that the outputs are recovered as an aﬃne transformation of the twovariables u ( x ) and u ( x ), and the parameters W (cid:48) ij s deﬁne the correlations between y HF and y LF as inthe vector-valued GPR. In principle, this setting only allows to model linear correlations between theHF and LF data. 6 a) Intermediate (b) GPmimic(c) 2-step (d) 3-stepFigure 1: Diﬀerent ANNs proposed for MF regression. All-in-one models in the top row (a, b), and multilevel models in thebottom row (c, d). The ’Intermediate model’ in (a) considers the LF outputs at intermediate latent variables of the surrogatefor HF. The ’GPmimic model’ in (b) employs the LMC to mimic MF GPR. The ’2-step’ and ’3-step’ models in (c) and (d)deﬁne seperate NNs for the hierarchy of ﬁdelity levels. Remember that the all-in-one architectures perform vector-valued learning of f = [ f HF ( x ) , f LF ( x )] T . Thereare thus two diﬀerent error components, one for each ﬁdelity level. More speciﬁcally, these components aregiven by the training errors MSE HF and MSE LF of the HF and LF models, deﬁned asMSE HF = 1 N HF N HF (cid:88) i =1 | y ( i ) HF − f NNHF ( x ( i ) HF ) | , and MSE LF = 1 N LF N LF (cid:88) i =1 | y ( i ) LF − f NNLF ( x ( i ) LF ) | . (12)Hence, training an all-in-one network involves a multiobjective minimization problem, which is generallynot easy to solve. In [26], the problem is reduced to a single optimization problem by considering a lossfunction given by the addition of the MSEs related to the two ﬁdelity sets and a L -regularization term.A more general approach in multiobjective optimization is to use a weighted sum instead, to account forobjectives of diﬀerent scales. Even though the data sets can be initially scaled to ensure the same order ofmagnitude, large discrepancies between MSE HF and MSE LF could develop during the learning process. Basedon these considerations, the loss function in the Intermediate and GPmimic models is given as L = α MSE HF + (1 − α )MSE LF + λ (cid:107) W (cid:107) , (13)7here α ∈ [0 ,

1] (unrelated to the one in (10)) acts as a scaling factor between the two ﬁdelity levels and λ > α = 0 and α = 1 correspond to the single-ﬁdelity regressionsfor the LF and HF data, respectively. The choice α = 0 . α will be tuned by hyperparameter optimization in the numerical examples. The α values minimize the testing errors of k -fold cross-validation and are determined by Bayesian optimization. The NNMFR introduced in Subsection 3.1 relies on a single NN to learn the multidimensional function f = [ f HF ( x ) , f LF ( x )] T . Another approach, leading to the multilevel NNMFR, is to use distinct NNs tomodel these functions. Such an approach has been employed in [27], and the presented method can besummarized as follows. A ﬁrst NN N N learns a correlation function y HF = F ( x , y LF ) based on the inputdata (cid:110) ( x ( i ) HF , y LF ( x ( i ) HF )) : 1 ≤ i ≤ N HF (cid:111) and the output data (cid:110) y ( i ) HF : 1 ≤ i ≤ N HF (cid:111) . In this case, it is requiredthat the HF input locations should form a subset of the LF input locations, i.e., X HF = { x ( i ) HF : 1 ≤ i ≤ N HF } ⊆ X LF = { x ( i ) LF : 1 ≤ i ≤ N LF } . For each available LF sample x (cid:48) ∈ X LF \ X HF , an approximate HF datasample y (cid:48) HF ( x (cid:48) ) can be generated by the network N N , and these generated data, together with the existingHF data at X HF , can be used as the new HF data set Y (cid:48) HF = { y HF ( X HF ) , y (cid:48) HF ( X LF \ X HF ) } . Finally, a secondNN N N is trained to model the HF function f HF ( x ) based the input-output pairs between X LF and Y (cid:48) HF .Following this multilevel approach in [27], we propose two modiﬁed multi-step NNs for MF regression byadopting the following major changes:1. The LF function f LF ( x ) is modeled by a ﬁrst NN N N LF . This modiﬁcation is relevant if the computa-tional cost of additional LF data is not cheap, for instance when having to solve PDEs.2. Now that we can rapidly generate new LF data using N N LF , modeling the direct mapping between x and y HF = f HF ( x ) is unnecessary. Hence our second network N N HF will be analogous to N N in [27].These considerations lead to the following multilevel architectures, also see (c) and (d) in Fig. 1:1. ”2-step” model : A DNN N N LF is trained on T LF to learn the LF function f LF ( x ). Using the NN N N LF , we can predict the values of the LF function at the training inputs X HF of the HF data, denotedby f NNLF ( X HF ) = { f NNLF ( x ( i ) HF ) : 1 ≤ i ≤ N HF } . Then a second artiﬁcial network N N HF approximates the HFfunction y HF = F ( x , y LF ) based on the input data ( X HF , f NNLF ( X HF )) = { ( x ( i ) HF , f NNLF ( x ( i ) HF )) : 1 ≤ i ≤ N HF } andthe available HF output data Y HF = y HF ( X HF ) = { y ( i ) HF : 1 ≤ i ≤ N HF } . N N HF is a shallow NN consistingof a single hidden layer.In other words, the network N N HF approximates the function y HF = F ( x , y LF ) based on the HF andLF data at the same locations X HF . However, as the HF and LF data locations, i.e., X HF and X LF ,are generated independently, the observations of the LF function at the HF inputs X HF have to beevaluated from N N LF if not directly available at X HF .2. ”3-step” model : This model is a modiﬁcation of the 2-step model by adding an additional level ofﬁdelity, generated by a third network N N lin . N N lin is equivalent to the

N N HF in the 2-step model inthe sense that it models the correlation function y HF = F ( x , y LF ). However, no nonlinear activationfunction is used in N N lin . In other words,

N N lin is responsible for capturing the linear correlationsbetween the HF and LF data sets. Let f NN lin ( X HF , f NNLF ( X HF )) denote the set of outputs of N N lin at the HFinput locations X HF . This set will then serve as part of the inputs for the third and ﬁnal network N N HF . N N HF approximates the correlation function y HF = F (cid:48) ( x , y LF , y lin ), y lin being the output of N N lin asa linear approximation of y HF , and the training of N N HF is based on the input-output pairs between( X HF , f NNLF ( X HF ) , f NN lin ( X HF , f NNLF ( X HF ))) and Y HF . N N HF is again a shallow NN consisting of a single hiddenlayer. Note that the 3-step model should only present an advantage over the 2-step model when thereexist strong linear correlations between the HF and LF data.It should be clear that the proposed architectures present a few diﬀerences. The key diﬀerence betweenthe all-in-one and the multilevel strategies is at the structural level. All-in-one NNMFR aims to approximate8he two-dimensional function f = [ f HF ( x ) , f LF ( x )] T with a single NN, while multilevel NNMFR uses distinctnetworks to model the HF and LF functions separately. However, to incorporate information from the LFfunction, this is done sequentially by ﬁrst modeling f LF ( x ) and then the correlation between the two ﬁdelitylevels.We note that all-in-one models include the additional hyperparameter α . This hyperparameter canpotentially be used to incorporate prior knowledge into the regression model. Suppose for instance thatwe know that the LF data provide a good representation of the problem. We can then enforce the NNto accurately model the LF data by setting the value of α accordingly. Furthermore, this can be takeninto account in the Intermediate model by choosing the number of neurons in the layer containing the LFoutput, where a smaller layer width may indicate a stronger dependency on the LF data. In the multilevelarchitectures, the question of how much trust is put in the LF data is left to the model and determinedduring the training process.An advantage of the multilevel approaches is that the LF function does not have to be approximated by aneural network. Instead, we can use other techniques which might be more accurate or less time-consuming.For instance, in the presence of discontinuities, an accurate approximation can be obtained by utilizinggradient boosting algorithms [14].Finally, in the non-hierarchical cases where the available information sources present similar levels ofﬁdelities, the GPmimic model is the only NNMFR that can be used without any adaptation. In fact, allother proposed structures rely on a hierarchy of the available data sets according to their ﬁdelity levels.

4. Numerical results (I): benchmark test cases

In this section, we analyze the performance of the proposed ANN structures in MF regression problems.For that purpose, the models presented in Section 3 are tested on a variety of artiﬁcial benchmarks, andcompared against commonly used co-kriging methods.In all benchmarks, the hyperparameters of the neural networks, for instance the parameter α to balancethe MSE terms, are optimized by k -fold cross-validation and Bayesian optimization, whereas the hyperpa-rameters of the GPs are obtained by maximizing the marginal likelihood [42]. For the Bayesian optimization,Python package Hyperopt [6] is used for a tree-structured parzen estimator (TPE) based approach [7], inwhich the priors on hyperparameters can be chosen from a range of distributions for both continuous anddiscrete random variables. Both the single-ﬁdelity and multi-ﬁdelity GPRs are implemented by the Pythonpackage GPy [16]. Furthermore, GPy automatically adds a noise term for each ﬁdelity level to account fornoisy data. As artiﬁcial benchmarks give rise to noiseless observations, the standard deviation values of allthe noise terms are ﬁxed to 10 − in these benchmarks.The numerical study in this section compares the performance of the proposed NNs against co-kriging.In each case, the qualitative diﬀerence between single- and multi-ﬁdelity regressions is noted brieﬂy beforepresenting a comparative study between NNMFR and multi-ﬁdelity GPR. Models are evaluated using theMSE on a test set covering the whole input domain. The results will reveal whether choosing a model basedon the validation error is a good strategy. The comparison between NN regression and the best GPR willnot only include accuracy results but also computational time. In addition, we evaluate the following R score that allows for a cross-benchmark comparison: R = 1 − (cid:80) N test i =1 (cid:16) y ( i ) HF − f ( i ) reg (cid:17) (cid:80) N test i =1 (cid:16) y ( i ) HF − ¯ y HF (cid:17) , where ¯ y HF = 1 N test N test (cid:88) i =1 y ( i ) HF , (14) N text is the size of the test set, and f reg denotes the prediction given by a regression model.9 .1. Benchmark case 1: Linear correlation The ﬁrst benchmark is a common test case for MF methods. HF and LF functions are deﬁned overΩ = [0 ,

1] as f HF ( x ) = (6 x − sin(12 x − ,f LF ( x ) = 0 . f HF ( x ) + 10( x − .

5) + 5 , respectively. The LF function is obtained by a linear transformation from the HF function and the inputvariable x . The setup serves as a good initial test for MF models. HF and LF samples are given by 5 and 32equally spaced values in Ω, respectively. Since the number of the HF data is very limited, hyperparametersof the NNs are optimized using leave-one-out cross-validation (LOOCV). The values can be found in TableA.1. Fig. 2 shows the results of both the single-ﬁdelity regression (SFR) and MF regression (MFR). Boththe NN and the GPR fail to approximate the function based solely on the HF data; in particular, bothmodels fail to accurately predict the function in the interval [0 , . N N HF of the 3-step model is redundant inthis case, as there is no nonlinear trend to catch. Consequently, in this ﬁrst and simple benchmark there isnothing to be gained by using ANN, as they all perform worse than classic GPR and are also computationallymore expensive. Table 1: Linear correlation: comparison of the MF regression models. Indicated times account for both HPO and ﬁnal modelpredictions.

Model Validation MSE Test MSE R Elapsed time (s)Co-kriging - . × − × . × − × × − × − × − × − × − This second benchmark is designed to analyze how well the proposed NN models can approximatediscontinuities in a MF setting. HF and LF data are generated from the following functions: f LF ( x ) = (cid:40) . x − sin(12 x −

4) + 10( x − . − , ≤ x < .

53 + 0 . x − sin(12 x −

4) + 10( x − . − , . < x ≤ f HF ( x ) = (cid:40) f LF ( x ) − x − , ≤ x < .

54 + 2 f LF ( x ) − x − , . < x ≤ . .0 0.2 0.4 0.6 0.8 1.0x−5051015 Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (a) SFR with GP, MFR with co-kriging.

Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (b) SFR with NN, MFR with 3-step.

Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (c) SFR with NN, MFR with 2-step.

Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (d) SFR with NN, MFR with GPmimic.

Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (e) SFR with NN, MFR with Intermediate.Figure 2: Linear correlation: single- and MF regression results using GPs and diﬀerent ANNs. There are 5 HF (circle) and 32LF (cross) observations. Single-ﬁdelity model based on the 5 HF data points is shown in green (dashed). ,

1] are used as the HF and LF inputs, respectively. Inaddition, 10 equally spaced points in the interval [0 . , .

55] are added to the LF data set to allow for a11etter approximation of the discontinuity. In contrast to the ﬁrst test case, the correlation between thetwo functions is only piecewise linear, and the piecewise deﬁnition results in the discontinuities of distinctamplitudes, see Fig. 3(a).Fig. 3 shows that the regression models based merely on the HF data are not able to approximate thediscontinuity at x = 0 .

5, but the regression results are signiﬁcantly improved by taking the LF data intoaccount. With R scores over 0.99, both the 2-step and 3-step multilevel NN models show a good matchwith the exact solution. However, the discontinuity is considerably smoothened when other NN modelsare used, especially when employing the GPmimic model. While the co-kriging presents a discontinuity at x = 0 .

5, ﬂuctuations are present in the interval [0 . , . Table 2: Discontinuous function: comparison of the MF regression models. Indicated times account for HPO and ﬁnal modelpredictions.

Model Validation MSE Test MSE R Elapsed time (s)Co-kriging - 8.07 × − × × − × × − × . × − × − × − The third one-dimensional benchmark will test the proposed NN models on a nonlinear correlationbetween HF and LF levels. The data samples are obtained from the following functions: f LF ( x ) = sin(8 πx ) ,f HF ( x ) = ( x − √ f LF ( x ) . We use 15 equally spaced HF data points and 42 LF data points over the interval Ω = [0 , x . Table 3: Nonlinear correlation: comparison of the MF regression models. Indicated times account for HPO and ﬁnal modelpredictions.

Model Validation MSE Test MSE R Elapsed time (s)Co-kriging - 5 . × − . × − . × − . × − . × − . × − . × − . × − . × − R score of 0.128, the GPmimic NN hardly performs better than the average function value. However, itis not surprising that both models fail in the current benchmark, as their mathematical structure is unableto detect nonlinear correlations between the two ﬁdelity levels. Fig. 4 underlines that the diﬃculty of thistest case lies in approximating the diﬀerent local extrema of the HF function, since data points are not12 .0 0.2 0.4 0.6 0.8 1.0x−50510152025 Exact HF functionExact LF function (a) HF and LF functions.

Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (b) SFR with GP, MFR with co-kriging.

Exact HF functionSFR for HFSFR for LFMFR for HFHF dataLF data (c) SFR with NN, MFR with 2-step.

Exact HF functionIntermediateGPmimic3stepHF data (d) MFR using diﬀerent NNs.Figure 3: Discontinuous function: single- and MF regression results using GPs and diﬀerent ANNs. There are 8 HF (circle)and 42 LF (cross) observations. always available close to the local extremum. Nevertheless, both the multilevel NN models recover an R score exceeding 0.99. It is well known that GPR suﬀer from the curse of dimensionality, and cannot be eﬀectively used whenthe number of data points is very large (

N > f HF ( x ) = ( x − + (cid:88) i =2 (2 x i − x i − ) ,f LF ( x ) = 0 . f HF ( x ) − (cid:88) i =2 . x i − x i − , with x = { x , x , · · · , x } ∈ Ω = [ − , . HF data are sampled from f HF at 5000 locations randomlychosen from a uniform distribution over Ω, while the LF samples are evaluated at 30000 random input13 .0 0.2 0.4 0.6 0.8 1.0x−1.5−1.0−0.50.00.51.01.52.02.5 Exact HF functionExact LF functionHF dataLF data (a) HF and LF functions.

Exact HF functionSFR for HFMFR for HFHF data (b) SFR with GP, MFR with co-kriging.

Exact HF functionSFR for HFMFR for HFHF data (c) SFR with NN, MFR with 3-step.

Exact HF functionGPmimicIntermediate2stepHF data (d) MFR using diﬀerent NNs.Figure 4: Nonlinear correlation: single- and MF regression results using GPs and diﬀerent ANNs. There are 15 HF (circle)and 42 LF (cross) observations. (a) has a diﬀerent range for the vertical axis than (b), (c) and (d). locations.In this case, the co-kriging would require one to compute the inverse of a 35000 × y HF at one million random input locations.

5. Numerical results (II): application to parametrized PDEs

In this section we apply our MF framework to a problem arising in acoustics, namely the propagation ofa pressure wave P ( x , t ) into an acoustic horn with parametrized shape, addressed in [30]. In particular, weare interested in evaluating an input-output map involving the solution of a PDE, exploiting the accuracy ofthe ﬁnite element method to obtain HF solutions, while relying on a reduced basis (RB) method to computeLF, but fast and inexpensive, approximations. Diﬀerent from the previous section, we consider diﬀerentsources of data for each ﬁdelity level.We consider an acoustic device, illustrated in Fig 6 (left), comprising of a waveguide, with inﬁnite14 a) Single-ﬁdelity regression (b) MF regression with the 3-step modelFigure 5: 20-D benchmark: Exact function values versus predicted values by NNs at one million random locations. All dataare normalized to the range [0,1]. extension to the left and a conical extremity on the right, namely the horn , and an internal propagatingplanar wave inside the waveguide: once the wave reaches the horn, a portion of its energy is converted into anouter-going wave. Under the assumptions of single-frequency and time harmonic waves, the acoustic pressurecan be expressed as P ( x , t ) = R ( p ( x ) e iωt ), where the complex amplitude p ( x ) satisﬁes the monochromaticsteady-state Helmholtz equation with mixed Neumann-Robin boundary conditions:∆ p + k p = 0 in Ω( ik + 12 R ) p + ∇ p · n = 0 on Γ ikp + ∇ p · n = 2 ikA on Γ i ∇ p · n = 0 on Γ h ∪ Γ s = Γ n , (15)where k = ω/c is the wave number, ω = 2 πf the angular frequency and c = 340 cm s − the speed of sound; n denotes the outward-directed unit normal on the boundary of Ω. We restrict the computation to the domainΩ shown in Fig. 6, and impose on Γ i – a propagating wave with amplitude A = 1 while absorbing theouter-going planar waves – an absorbing condition on the far-ﬁeld boundary Γ o , and homogeneous Neumannboundary conditions on the sound-hard walls of the device Γ h as well as on the symmetry boundary Γ s . Wetake, for simplicity, the radius equal to R = 1 [4].In addition to the frequency f of the incoming wave, we parametrize, as in [30], the shape of the horn bymeans of radial basis functions, introducing as parameters µ g = ( µ g, , . . . , µ g, ) the vertical displacement offour control points located on the horn wall Γ h , shown in Fig. 6, right. The admissible domain conﬁgurationsare deﬁned as the diﬀeomorphic images Ω( µ g ) of the reference shape Ω through a deformation mapping T ( · ; µ g ) obtained as linear combinations of the control points displacements. Hence, the acoustic problemdepends on ﬁve parameters, i.e., we denote by µ = ( f, µ g ) the parameter vector and by D ⊂ R p theparameter space.We focus our analysis on a speciﬁc output of interest, namely the index of reﬂection intensity (IRI) [4]which measures the transmission eﬃciency of the acoustic horn and is deﬁned as the absolute value of theaverage reﬂected wave at the sound inlet Γ i , i.e., f ( µ ) = J ( p ( µ )) = (cid:12)(cid:12)(cid:12)(cid:12) | Γ i | (cid:90) Γ i p ( µ ) d Γ − (cid:12)(cid:12)(cid:12)(cid:12) . For simplicity, the device extends inﬁnitely in the direction normal to the plan and its wall consists of sound-hard material.Therefore, for the frequencies in the range under consideration, we assume that all non-planar modes in the waveguide arenegligible, which allows us to reduce the problem to two space dimensions. igure 6: Acoustic horn problem. Left: the computational domain Ω and the boundaries. Right: the control points used inRBF shape parametrization, whose vertical displacements are treated as parameters. In particular, f N LF ( µ ) = J ( p mN ( µ )) , and f HF ( µ ) = J ( p h ( µ )) , deﬁne the MF setting for this test case. We denote by p mN ( µ )) and p h ( µ )) the LF and the HF solutions ofthe parametrized PDE, respectively, the construction of which is reported in the Appendix B. Moreover, wehighlight the dependence of the LF model on the dimension N of the ROM used to evaluate the PDE solution.In the following subsections, we consider two diﬀerent scenarios, dealing with either 1 or 5 parameters. p = 1 parameter In this ﬁrst case, we compare diﬀerent MF strategies, by considering the output of interest as a functionof the frequency µ = f only, letting it vary in D = [10 , f and compute the corresponding HFsolutions through the FOM. The FOM is approximated by P ﬁnite elements and, considering a mesh madeof 8740 triangular elements, we have an HF model of dimension n = 4567. Then, we apply POD and extract N = 44 reduced basis functions by imposing a relative projection error of 10 − . All computational detailsare summarized in Table 4. The computation of HF and LF solutions, i.e. the FOM and ROM solutions,respectively, is carried out in Matlab , using the redbKIT library [29]. All hyperparameters obtained byHPO are reported in Appendix A.We assess how the quality of the LF model and the amount of HF data impact the accuracy of the MFprediction. We recall that (i) we can improve the quality of the LF model by selecting a smaller or largernumber of bases, and (ii) we can freely decide the parameter values for which we solve the HF model, bykeeping the sampling method ﬁxed. A comparison among LF models obtained with diﬀerent dimensions N of the ROM is reported in Fig. 7. We choose to limit the selected basis functions in the POD-GalerkinROM to be between 5 and 22 (even though the maximum number of available basis functions is 44), to dealwith a potentially inaccurate (or, at least, not accurate enough) LF model. The ability to obtain accuratepredictions by combining a few HF data and several evaluations of a reliable, but not suﬃciently accurate,ROM, is indeed an attractive feature of the proposed framework, as this may prevent us from constructingROMs with large dimensions and poor eﬃciency. Moreover, we consider the same amount of HF data, whilewe keep ﬁxed (and equal to 32) the number of LF data.In the following, we test the Intermediate, 2-step, 3-step network architectures; for each combination number of HF data - number of bases , we train the NNs, predict the output of interest and compute theindices of goodness of ﬁt R and M SE . Evaluated outputs with these network architectures are displayedin Fig. 8; the values of R and M SE are reported instead in Figs. 9, 10 and 11, as functions of the number N of basis functions, and the amount of HF data considered.Overall, the neural architectures perform very well and provide good predictions, in terms of both M SE and R , provided a suﬃcient number of basis functions and HF data are used. The and able 4: Computational details in the case with p = 1 parameter: µ = f . Number of parameters 1 Parameter domain D [10 , − Number of FE dofs n 4567 Number of ROM dofs N 44Number of HF data from 5 to 22 Number of bases from 5 to 22Number of LF data 32 Sampling method LHS frequency f HF modelLF 5 basesLF 10 basesLF 15 basesLF 20 basesLF 25 basesLF 30 basesLF 35 basesLF 40 bases

Figure 7: HF model f HF ( µ ) and diﬀerent LF models f HF ( µ ) depending on the number N of basis functions. For too small valuesof N , the LF model prediction is rather poor, while it is almost indistinguishable from the HF model prediction for N ≥ models produce similar results and perform better than the Intermediate model, as displayed in Fig. 8. Inparticular, the multilevel models are robust and eﬃcient even in the cases where the LF models are builtfrom few bases, while the

Intermediate network has poor predictive accuracy without a suﬃciently accurateLF model, even if a large amount of HF data are provided. In fact, with multistep networks we can reachvalues of R larger than 0.8 even considering just N = 5 bases, whereas the Intermediate model does notprovide large values of R with less than 12 bases, see Figs. 9, 10 and 11. Regarding the computed outputs,we see how the peaks with larger amplitude found for f < and models, while smaller amplitude peaks for f > modelthan the . On the other hand, the intermediate model provides a less accurate trend of the output,and only provides reliable results provided that both N and the amount of HF data are large enough.As expected, the prediction improves as the number of HF data or the number of basis functions increases,even if these two features impact the accuracy in a slightly diﬀerent way. Restricting, for the sake ofsimplicity, to the case of a model (see Fig. 11, top) we observe that, for each ﬁxed number of reducedbases N , the goodness-of-ﬁt indices improve as the number of HF data increases until a threshold limit isreached, which is determined solely by the number of bases and cannot be overcome by adding more HFdata. Conversely (see Fig. 11, bottom) by increasing N , the trends of R (resp. MSE), corresponding to thediﬀerent numbers of HF data adopted, increase (resp. decrease) and also tend to reach values closer to eachother. The amount of HF data thus becomes less and less important as the LF model improves. Hence, inthis speciﬁc problem it is more eﬃcient to have a good LF model even with a small amount of HF data,rather than lots of HF data but a poor LF model. p = 5 parameters We ﬁnally apply our MF setting to the case where all the ﬁve parameters µ = ( f, µ g ), namely thefrequency and the four geometric parameters, vary, in order to consider shape variations in the horn geometryas well. The parameter space is D = [50 , × D g , where D g = [ − . , . .17 Figure 8: Plots of the HF, LF and MF (

Intermediate, 2-step, 3-step ) models considering diﬀerent number of bases and HFtraining data.

We compute 200 FOM snapshots for 200 points sampled in the parameter domain D through a LatinHypercube sampling design. In this case, the reduced basis is made from N = 80 POD modes; see Table 5.18 igure 9: Case with p = 1 parameter, Intermediate architecture. MSE and R for diﬀerent amounts of HF data and LFdimension N . As in the case of p = 1 parameter, we select an appropriate range for the number of bases and HF data toassess the eﬃciency of the MF approach in diﬀerent scenarios. We consider a larger number of LF data (500instead of 100) than in the case of p = 1 parameter; we then vary the number of HF data from 5 to 45, andthe number of basis functions of the LF model from 5 to 40. We report the results only in the case of the model for the sake of brevity and to focus on the role of the quality and quantity of the training datarather than on the choice of the NN architecture. Table 5: Computational details in the case with p = 5 parameters: µ = ( f, µ g ). Number of parameters 5 Parameter domain D [50 , × [ − . , . Number of FE elements 8740 Number of FE snapshots 200Number of FE dofs n 4567 Number of ROM dofs N 80Number of HF data from 5 to 45 Number of bases from 5 to 40Number of LF data 500 Sampling method LHSPassing from 1 to 5 parameters does not worsen the prediction power of the NN architecture. Indeed,once again we obtain an accurate prediction from a small number of FOM data, by exploiting a large numberof ROM solutions that can be computed very quickly and inexpensively. As in the previous case, the MFprediction improves both as the number of HF data and the number of bases increase, with the latter playinga more important role in this framework. Similar to the case p = 1, we display R and the MSE obtained19 igure 10: Case with p = 1 parameter, architecture. MSE and R for diﬀerent amounts of HF data and LF dimension N . with the NN architecture as functions of the aforementioned factors (see Fig.12).Numerical results show that we can achieve very good results even in 5 dimensions by employing 500LF data, and that increasing both the amount of HF data and the dimension of the LF model improves theprediction accuracy. When considering a poor LF model, the number of HF data is fundamental to obtaingood prediction: for instance, with an LF model of dimension N = 15, we obtain R = 0 . R = 0 . N = 35, when HF data increasesfrom 10 to 45, R only improves by 0 . N = 5) R (resp., the MSE) continues to increase (resp., decrease) as the number of HFdata increases, while both indices ﬂatten when considering larger values of N (e.g., N > R greater than 0.97 and an MSE smaller than 1 . × − , with both indices continuing to improve asthe number of bases increases. In particular, it seems that increasing the number of HF data only worksprovided also the dimension N of the LF model is enlarged. On the other hand, as N increases, the indiceskeeps improving and the number of HF data becomes less and less relevant. Therefore, also in the case of p = 5 parameters we can conclude that improving the quality of the LF model is a more eﬃcient strategy- as well as computationally cheaper - than increasing the number of HF data, to reach a certain degree of20 igure 11: Case with p = 1 parameter, architecture. MSE and R for diﬀerent amounts of HF data and LF dimension N . accuracy.

6. Conclusions

In this work we discuss MF regression with ANNs, for which four diﬀerent architectures are presented.The proposed NN schemes are benchmarked against co-kriging on a collection of test cases of increasingcomplexity. We also successfully predict output quantities associated with parametrized PDEs using themulti-ﬁdelity NN models. Observations made from diﬀerent numerical examples show that MF models basedon NNs can consistently outperform co-kriging schemes. In contrast to co-kriging, NNs are able to detectnonlinear correlations between ﬁdelity levels more eﬀectively and are capable of dealing with large datasets. In addition, the hyperparameter selection for the proposed multi-ﬁdelity NNs is automatized based oncross validation and Bayesian optimization, and the tuned NN models consistently yield very low predictionerrors.The proposed schemes have been tested on a series of manufactured benchmarks and applied to aparametrized PDE problem. In the latter, the goal is to evaluate an output functional of the PDE solutionthat features an oscillating input-output dependence, and the LF model is constructed through the reducedbasis method while the HF model is given by detailed ﬁnite element analysis. Numerical results show that21 igure 12: Case with p = 5 parameters, architecture. MSE and R for diﬀerent amounts of HF data and LF dimension N . the accuracy of the predictions through MF regression is mainly driven by the reliability of the reducedorder model, rather than the amount of HF data fed into the NNs.A promising direction for future work is to inspect whether changing the value of the balance parameter α during the training process can improve the model accuracy. As the ratio MSE HF / MSE LF evolves duringthe training process, a corresponding adaptation of α could be a reasonable improvement of the currentschemes. References [1] M. A. ´Alvarez, L. Rosasco, and N. D. Lawrence. Kernels for vector-valued functions: A review.

Foundations and Trends ® in Machine Learning , 4(3):195–266, 2012.[2] R. C. Aydin, F. A. Braeu, and C. J. Cyron. General multi-ﬁdelity framework for training artiﬁcial neural networks withcomputational models. Frontiers in Materials , 6:61, 2019.[3] N. Baker, F. Alexander, T. Bremer, A. Hagberg, Y. Kevrekidis, H. Najm, M. Parashar, A. Patra, J. Sethian, S. Wild,et al. Workshop report on basic research needs for scientiﬁc machine learning: Core technologies for artiﬁcial intelligence.Technical report, USDOE Oﬃce of Science (SC), Washington, DC (United States), 2019.[4] E. B¨angtsson, D. Noreland, and M. Berggren. Shape optimization of an acoustic horn.

Comput. Meth. Appl. Mech.Engrg. , 192(11-12):1533–1571, 2003.[5] P. Benner, S. Gugercin, and K. Willcox. A survey of projection-based model reduction methods for parametric dynamicalsystems.

SIAM Review , 57(4):483–531, 2015.

6] J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds ofdimensions for vision architectures. In

Proceedings of the 30th International Conference on International Conference onMachine Learning - Volume 28 , ICML’13, pages 115–123. JMLR.org, 2013.[7] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. K´egl. Algorithms for hyper-parameter optimization. In

Advances in neuralinformation processing systems , pages 2546–2554, 2011.[8] K. T. Carlberg, R. Tuminaro, and P. Boggs. Preserving lagrangian structure in nonlinear model reduction with applicationto structural dynamics.

SIAM J. Sci. Comput , 37(2):B153–B184, 2015.[9] S. Chaturantabut and D. C. Sorensen. Nonlinear model reduction via discrete empirical interpolation.

SIAM J. Sci.Comput. , 32(5):2737–2764, 2010.[10] M. Chen, Y. Li, and R. Li. Research on neural machine translation model.

Journal of Physics: Conference Series ,1237:052020, 2019.[11] K. Cutajar, M. Pullin, A. Damianou, N. Lawrence, and J. Gonz´alez. Deep gaussian processes for multi-ﬁdelity modeling. arXiv:1903.07320 , 2019.[12] R. Durrett.

Probability: theory and examples , volume 49. Cambridge university press, 2019.[13] S. Fresca, L. Ded`e, and A. Manzoni. A comprehensive deep learning-based approach to reduced order modeling of nonlineartime-dependent parametrized pdes. arXiv:2001.04001 , 2020.[14] J. Friedman. Greedy function approximation: A gradient boosting machine.

Ann. Stat. , 29(5):1189–1232, 2001.[15] S. Ghosh and D. L. Reilly. Credit card fraud detection with a neural-network. , 3:621–630, 1994.[16] GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy , since 2012.[17] J. S. Hesthaven, G. Rozza, and B. Stamm.

Certiﬁed reduced basis methods for parametrized partial diﬀerential equations .Springer International Publishing, 2016.[18] J. S. Hesthaven and S. Ubbiali. Non-intrusive reduced order modeling of nonlinear problems using neural networks.

J.Comput. Phys. , 363:55–78, 2018.[19] A. G. Journel.

Fundamentals of Geostatistics in Five Lessons , volume Short Courses in Geology, Vol. 8. AmericanGeophysical Union (AGU), 1989.[20] M. Kast, M. Guo, and J. S. Hesthaven. A non-intrusive multiﬁdelity method for the reduced order modeling of nonlinearproblems.

Comput. Methods Appl. Mech. Engrg. , 364:112947, 2020.[21] Y. Lecun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to document recognition.

Proceedingsof the IEEE , 86:2278 – 2324, 1998.[22] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussianprocesses. arXiv preprint No. 1711.00165 , 2017.[23] K. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deep convolutionalautoencoders.

J. Comput. Phys. , 404:108973, 2020.[24] D. Liu and Y. Wang. Multi-ﬁdelity physics-constrained neural network and its application in materials modeling.

Journalof Mechanical Design , 141(12), 2019.[25] X. Meng, H. Babaee, and G. E. Karniadakis. Multi-ﬁdelity bayesian neural networks: Algorithms and applications. arXivpreprint arXiv:2012.13294 , 2020.[26] X. Meng and G. E. Karniadakis. A composite neural network that learns from multi-ﬁdelity data: Application to functionapproximation and inverse pde problems.

J. Comput. Phys. , 401, 2019.[27] M. Motamed. A multi-ﬁdelity neural network surrogate sampling method for uncertainty quantiﬁcation.

Int. J. Uncert.Quantif. , 10(4):315–332, 2020.[28] R. M. Neal.

Bayesian learning for neural networks , volume 118. Springer Science & Business Media, 2012.[29] F. Negri. redbKIT Version 2.2. http://redbkit.github.io/redbKIT/ , 2016.[30] F. Negri, A. Manzoni, and D. Amsallem. Eﬃcient model reduction of parametrized systems by matrix discrete empiricalinterpolation.

J. Comput. Phys. , 303:431–454, 2015.[31] A. O’Hagan and M. C. Kennedy. Predicting the output from a complex computer code when fast approximations areavailable.

Biometrika , 87(1):1–13, 2000.[32] B. Peherstorfer, K. Willcox, and M. Gunzburger. Survey of multiﬁdelity methods in uncertainty propagation, inference,and optimization.

SIAM Review , 60(3):550–591, 2018.[33] A. Quarteroni, A. Manzoni, and F. Negri.

Reduced Basis Methods for Partial Diﬀerential Equations. An Introduction .Springer International Publishing, 2016.[34] M. Raissi and G. Karniadakis. Deep multi-ﬁdelity gaussian processes. arXiv: 1604.07484 , 2016.[35] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Inferring solutions of diﬀerential equations using noisy multi-ﬁdelity data.

J. Comput. Phys. , 335:736–746, 2017.[36] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework for solvingforward and inverse problems involving nonlinear partial diﬀerential equations.

J. Comput. Phys. , 378:686–707, 2019.[37] D. Ray and J. S. Hesthaven. An artiﬁcial neural network as a troubled-cell indicator.

J. Comput. Phys. , 367:166–191,2018.[38] F. Regazzoni, L. Ded`e, and A. Quarteroni. Machine learning of multiscale active force generation models for the eﬃcientsimulation of cardiac electromechanics.

Comput. Methods Appl. Mech. Engrg. , 370:113268, 2020.[39] J. Sirignano and K. Spiliopoulos. Dgm: A deep learning algorithm for solving partial diﬀerential equations.

J. Comput.Phys. , 375:1339–1364, 2018.[40] G. Strang.

Linear algebra and learning from data . Wellesley-Cambridge Press, 2019.[41] E. Weinan, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial iﬀerential equations and backward stochastic diﬀerential equations. Commun. Math. Stat. , 5(4):349–380, 2017.[42] C. Williams and C. Rasmussen.

Gaussian Processes for Machine Learning . the MIT Press, 2006.

Appendix A: hyperparameter summary

Tables A.1 and A.2 show the values of NN hyperparameters resulting from the HPO in diﬀerent nu-merical examples. We recall that α is the parameter in all-in-one networks that balances the ﬁdelity levels’contributions to training error, λ is a L -regularization parameter, and η is the learning rate. Nodes indi-cates the number of nodes in each layer whose size is left to be determined through the HPO. Although NNhyperparameters should generally not be analyzed separately, the following comments about the individualhyperparameters can be made: • In all test cases, the optimization procedure chooses α < .

1, which corresponds to weighting the LFdata one order of magnitude more than their HF counterpart. • The learning rates η in models belonging to the same NNMFR class (all-in-one or multilevel) aregenerally at the same order of magnitude. • In most cases, the AdaMax algorithm is the best performing optimizer. • In the benchmarks, Glorot uniform weight initialization is preferred, whereas standard uniform initial-ization performs best when predicting the IRI in the acoustic horn problem.

Table A.1: Hyperparameter values of the NNMFR used for diﬀerent artiﬁcial test cases. Benchmark 1 involves linear correlation,benchmark 2 discontinuous functions, and benchmark 3 non-linear correlation.

Test Model Weight initializer Depth × width α λ Optimizer η ×

59 4.59 × − × − Adam 6.35 × − GPmimic Glorot uniform 3 ×

33 5.01 × − × − AdaMax 3.90 × − ×

34 - 1.01 × − AdaMax 5.17 × − ×

38 - 1.08 × − Adam 2.06 × − ×

30 9.22 × − × − AdaMax 3.41 × − GPmimic Glorot uniform 4 ×

10 2.71 × − × − AdaMax 4.42 × − ×

60 - 1.01 × − AdaMax 3.65 × − ×

98 - 2.30 × − AdaMax 1.02 × − ×

39 4.02 × − × − Adam 1.55 × − GPmimic Glorot normal 4 ×

22 9.74 × − × − ×

44 - 1.02 × − AdaMax 4.84 × − ×

62 - 2.35 × − Adam 1.25 × − Appendix B: HF and LF models for the acoustic horn problem

To derive the HF model related to the test case of Section 5, we employ the Galerkin-ﬁnite elementmethod. We write the weak formulation of problem (15): given µ ∈ D , ﬁnd p ( µ ) ∈ V s.t. a ( p ( µ , u ; µ ) = f ( u ; µ ) ∀ u ∈ V (16)24 able A.2: Hyperparameter values of the considered NN architectures regarding the problems in Section 5; here P indicatesthe number of parameters considered. P Model Weight initializer α λ

Optimizer Nodes η Elapsed time (s)1 Inter. Uniform 3.19 × − × − Adam 114 4.27 × − × − Adamax 18 1.76 × − × − Adam 6 1.42 × − × − Adam 48 7.57 × − V = H (Ω( µ g )) = { q ∈ L (Ω( µ g )) : ∂q/∂x j ∈ L (Ω( µ g )) , j ∈ { , }} and the bilinear form a ( · , · , µ ) : V × V → C and the linear form f ( · ; µ ) : V → C are deﬁned, respectively, by2 a ( p, u ; µ ) = (cid:90) Ω( µ g ) {∇ p · ∇ ¯ u − k p ¯ u } d Ω + ik (cid:90) Γ o ∪ Γ i p ¯ u d Γ + 12 R (cid:90) Γ o p ¯ u d Γ (17) f ( u ; µ ) = 2 ikA (cid:90) Γ i ¯ u d Γ (18)Next, we introduce a conforming triangulation T h = { ∆ k } n e k =1 of the domain Ω and seek a HF approximation p h ( µ ) ∈ V h as a globally continuous, piecewise linear, function belonging to a ﬁnite-dimensional space V h ⊂ V . In our case, V h is spanned by a set of basis functions { φ i } ni =1 for the space V h consisting of a setof n piecewise polynomial nodal basis functions on T h . The Galerkin-ﬁnite element approximation of (16)thus results in the following n dimensional linear system: A ( µ ) p h ( µ ) = f ( µ ) (19)where A ij ( µ ) = a ( φ j , φ i ; µ ) , f i ( µ ) = f ( φ i ; µ ) , for ≤ i, j ≤ n and p h is the vector of coeﬃcients { p i } ni =1 such that the projection of p onto V h is p h ( x ) = (cid:80) ni =1 p i φ i ( x ). This latter formula allows us to state aone-to-one correspondence between the ﬁnite element functions p h ( µ ) ∈ V h and their discrete counterparts p h ( µ ) ∈ R N h .To derive the LF model, we employ the reduced basis (RB) method [17, 33], which is brieﬂy recalledhere. The RB method is a projection-based reduced order modeling technique, addressing the repeatedsolution of parametrized PDEs, which allows to dramatically reduce the dimension of the discrete problemsarising from numerical approximation. The strategy adopted in RB methods consists in the projection ofthe HF problem upon a subspace made of specially selected basis functions, built from a set of HF solutionscorresponding to suitably chosen parameters (or snapshots), e.g., through proper orthogonal decomposition(POD). Later, a (Petrov-)Galerkin projection onto the RB space is employed to generate the ROM.Starting from the FOM (19), i.e. ﬁnd p h ( µ ) such that A ( µ ) p h ( µ ) = f ( µ ), the idea of a projection-basedROM is to approximate p h ( µ ) ≈ Vp N ( µ ) as a linear combination of basis functions, for a vector of unknownreduced degrees of freedom p N ( µ ) of reduced dimension N (cid:28) n . This latter is sought by imposing that W T ( A ( µ ) Vp N ( µ ) − f ( µ )) = , a condition which enforces the orthogonality of the residual to a subspace spanned by a suitable test basis W ∈ R n × N . The ROM reads: ﬁnd p N ∈ R N such that A N ( µ ) p N ( µ ) = f N ( µ ) (20)where A N ( µ ) = W T A ( µ ) V and f N ( µ ) = W T f ( µ ). A Galerkin projection results if W = V . As forthe HF approximation, the RB approximation reﬂects a one-to-one correspondence between the function p hN ( µ ) ∈ V h and its ﬁnite-dimensional counterpart Vp N ( µ ) ∈ R N h .Although the dimension of the RB problem (20) is very small as compared to the FOM (19), the25ssembling of the former system still depends in general on the dimension n of the HF system for any µ ∈ D . A convenient situation arises when the HF arrays in (19) can be written as A ( µ ) = Q A (cid:88) q =1 Θ aq ( µ ) A q , f ( µ ) = Q f (cid:88) q =1 Θ fq ( µ ) f q . By virtue of this property – which we refer to as aﬃne parametric dependence of A ( µ ) and f ( µ ), theassembling of the system (20) during the online stage can be made eﬃcient, since the arrays W T A q V , q = 1 , . . . , Q A and W T f q , q = 1 , . . . , Q F , can be pre-computed and stored during a possibly expensiveoﬄine stage.Since in the Helmholtz problem with p > A ( µ ) and f ( µ ) are nonaﬃne functions of µ , so we employ hyper-reduction through thediscrete empirical interpolation method (DEIM) [9] and its matrix version (MDEIM) [5, 8] to computeapproximate aﬃne decompositions as [30] f ( µ ) ≈ f m ( µ ) = M f (cid:88) k =1 θ f k ( µ ) f k , A ( µ ) ≈ A m ( µ ) = M A (cid:88) k =1 Θ ak ( µ ) A k , where f k , k = 1 , . . . , M f , and A k , k = 1 , . . . , M A are precomputable vectors and matrices, respectively, andindependent of µ . In this way, we can approximate the ROM arrays as f N ( µ ) ≈ f mN ( µ ) = M f (cid:88) k =1 θ f k ( µ ) f kN , A N ( µ ) ≈ A mN ( µ ) = M a (cid:88) k =1 θ a k ( µ ) A kN , where f kN = W T f k ∈ R N , k = 1 , . . . , M f , and A kN = W T A k V ∈ R N × N , k = 1 , . . . , M A . Taking advantageof hyper-reduction, we recover the following hyper reduced order model: ﬁnd p mN ( µ ) ∈ R N s.t. A mN ( µ ) p mN ( µ ) = f mN ( µ ) . (21)Due to its small dimension, the solution of the system (21) can be very fast and computationally inexpensive,allowing us to generate many instances of the output of interest, which can be used as LF training data.Finally, we denote by p mN ( µ ) ∈ V h the ﬁnite element approximation of the problem corresponding to thevector Vp mN ( µ ) ∈ R N hh