[PDF] A fast and scalable computational framework for goal-oriented linear Bayesian optimal experimental design: Application to optimal sensor placement

Abstract

Optimal experimental design (OED) is a principled framework for maximizing information gained from limited data in inverse problems. Unfortunately, conventional methods for OED are prohibitive when applied to expensive models with high-dimensional parameters, as we target here. We develop a fast and scalable computational framework for goal-oriented OED of large-scale Bayesian linear inverse problems that finds sensor locations to maximize the expected information gain (EIG) for a predicted quantity of interest. By employing low-rank approximations of appropriate operators, an online-offline decomposition, and a new swapping greedy algorithm, we are able to maximize EIG at a cost measured in model solutions that is independent of the problem dimensions. We demonstrate the efficiency, accuracy, and both data- and parameter-dimension independence of the proposed algorithm for a contaminant transport inverse problem with infinite-dimensional parameter field.

Full PDF

AA fast and scalable computational framework for goal-orientedlinear Bayesian optimal experimental design: Application tooptimal sensor placement

Keyi Wu , Peng Chen , and Omar Ghattas Department of Mathematics, University of Texas at Austin, Austin, USA Oden Institute for Computational Engineering and Sciences, University of Texas atAustin, Austin, USA

Abstract

Optimal experimental design (OED) is a principled framework for maximizing information gainedfrom limited data in inverse problems. Unfortunately, conventional methods for OED are prohibitivewhen applied to expensive models with high-dimensional parameters, as we target here. We developa fast and scalable computational framework for goal-oriented OED of large-scale Bayesian linearinverse problems that ﬁnds sensor locations to maximize the expected information gain (EIG) for apredicted quantity of interest. By employing low-rank approximations of appropriate operators,an online-oﬄine decomposition, and a new swapping greedy algorithm, we are able to maximizeEIG at a cost measured in model solutions that is independent of the problem dimensions. Wedemonstrate the eﬃciency, accuracy, and both data- and parameter-dimension independence ofthe proposed algorithm for a contaminant transport inverse problem with inﬁnite-dimensionalparameter ﬁeld.

Optimizing the acquisition of data—e.g., what, where, and when to measure, what experiments torun—in order to maximize information gained from the data is a fundamental and ubiquitous problemacross all of the natural and social sciences, engineering, medicine, and technology. Just three importantexamples include optimal observing system design for ocean climate data [37], optimal sensor placementfor early warning of tsunami waves [25], and optimal experimental design to accelerate MRI imaging [9].Bayesian optimal experimental design (BOED)—including formulations as active learning, Bayesianoptimization, and sensor placement—provides a probabilistic framework to maximize the expectedinformation gain (EIG) or mutual information (MI) for uncertain parameters or related quantities ofinterest [15]. However, evaluating the EIG remains prohibitive for large-scale, complex models, due tothe need to compute double integrals with respect to both parameter and data distributions. Recently,

This research was partially funded by the National Science Foundation, Division of Mathematical Sciences underaward DMS-2012453; the Department of Energy, Oﬃce of Science, Oﬃce of Advanced Scientiﬁc Computing Research,Mathematical Multifaceted Integrated Capability Centers (MMICCS) program under award DE-SC0019303; and theSimons Foundation under award 560651. a r X i v : . [ m a t h . O C ] F e b dvances in eﬃciently evaluating the EIG and optimizing the design have been achieved using methodsbased on posterior Laplace approximation-based EIG estimation [36], myopic posterior sampling foradaptive goal-oriented BOED [34], EIG estimation by variational inference for BOED [27], BOED forimplicit models by neural EIG estimation [35], and sequential BOED with variable cost structure [45].Interest has intensiﬁed in extending BOED to the case of experiments on, or observations of,complex physical systems, since these can be very expensive (e.g., satellite trajectories, subsurface wells,ocean-bottom acoustic sensors). Such physical systems are typically modeled by partial diﬀerentialequations (PDEs), which are expensive to solve and often contain inﬁnite-dimensional parameterﬁelds and large numbers of design variable. This presents fundamental challenges to conventionalBOED methods, which require prohibitively large numbers of (PDE) model solves. Several diﬀerentclasses of methods have been developed to tackle these computational challenges by exploiting (1)sparsity by polynomial chaos approximation of parameter-to-observation maps [29, 30, 31], (2) intrinsiclow dimensionality by low-rank approximation of (prior-preconditioned and data-informed) operators[2, 1, 3, 41, 23, 8], and (3) decomposibility by oﬄine (for model-constrained EIG approximation)–online(for design optimization) decomposition [43].Here, we focus on goal-oriented optimal experimental design (GOOED) for large-scale Bayesianinverse problems, in the context of optimal sensor placement. That is, we seek optimal sensor locationsthat maximize the information gained from the sensors, not about the model parameters, but (of greaterpractical interest) for a posterior model-predictive goal. In particular, we consider linear parameter-to-observable (PtO) maps governed by expensive models (e.g., PDEs) with high-dimensional uncertainparameters (e.g., inﬁnite-dimensional before discretization). In [8], a gradient-based optimizationmethod is developed to solve the linear GOOED problem to ﬁnd the optimal sensor locations. However,in each of the possibly very large number of iterations, many model evaluations have to be performed,which makes the algorithm prohibitive if each model evaluation (e.g., solving PDEs) is very expensive. Contributions . We propose a fast and scalable computational framework for high-dimensionaland Bayesian GOOED problems governed by large-scale, expensive-to-solve models. To overcome thecurse-of-dimensionality with respect to both parameter and data dimensions, we exploit the intrinsiclow-dimensionality of the PtO map and propose and analyze low-rank approximation of appropriatedata- and parameter-informed operators and the EIG. These properties are revealed by Jacobiansand Hessians of the PtO map, as has been done for model reduction for sampling and deep learning[10, 17, 6, 38], Bayesian inference [13, 14, 16, 21, 22, 18], optimization under uncertainty [4, 20, 19],and BOED [2, 3, 23, 41, 8, 43]. We use a randomized algorithm for the low-rank approximation, whichrequires only a small and dimension-independent number of large-scale model evaluations. Moreover,we develop an eﬃcient oﬄine-online decomposition scheme to evaluate the goal-oriented

EIG, wherein the oﬄine stage the model-constrained low-rank approximations are performed just once, whilein the online stage the design optimization is performed free of model evaluations. Furthermore, forthe design optimization we introduce a new swapping greedy algorithm that ﬁrst constructs an initialset of sensors using leverage scores, and then swaps the chosen sensors with other candidates untilcertain convergence criteria are met. Finally, we demonstrate the eﬃciency, accuracy, and dimensionindependence (with respect to both data and parameters) of the proposed algorithm for a contaminanttransport inverse problem with inﬁnite-dimensional parameter ﬁeld.We present background on BOED in Section 2, propose our computational framework for GOOEDin Section 3, and report results on experiments in Section 4.2

Background

We consider a general linear model y = Fm + (cid:15) , (1)where y ∈ R d y is a d y -dimensional observational data vector corrupted by additive Gaussian noise (cid:15) ∈ N ( , Γ n ) with zero mean and covariance Γ n ∈ R d y × d y , m ∈ R d m is a d m -dimensional uncertainparameter vector, and F : R d m (cid:55)→ R d y is a linear PtO map. As a speciﬁc case, m is a discretization(e.g., by ﬁnite element method) of an inﬁnite-dimensional parameter ﬁeld in a model described byPDEs, while F is implicitly given by solving the PDE model. In this case, the parameter dimension istypically very high, O (10 − ) for practical applications.We assume a Gaussian prior m ∼ N ( m pr , Γ pr ) with mean m pr and covariance Γ pr for the parameter m with density π pr ( m ) ∝ exp (cid:18) − || m − m pr || Γ − (cid:19) , (2)where || m − m pr || Γ − := ( m − m pr ) T Γ − ( m − m pr ). Then by Bayes’ rule the posterior density of m satisﬁes π post ( m | y ) ∝ π like ( y | m ) π pr ( m ) . (3)Here π like ( y | m ) is the likelihood function that satisﬁes π like ( y | m ) ∝ exp ( − Φ( m , y )) (4)under Gaussian noise (cid:15) ∈ N ( , Γ n ) , where the potentialΦ( m , y ) := 12 || Fm − y || Γ − . (5)Under the assumption of Gaussian prior and Gaussian noise, the posterior of m is also Gaussian N ( m map , Γ post ) with mean m post = Γ post ( F ∗ Γ − y + Γ − m pr ) and covariance Γ post = ( H m + Γ − ) − ,where H m = F ∗ Γ − F (6)is the (data-misﬁt) Hessian of the potential Φ( m , y ), and F ∗ is the adjoint of F , e.g., by solving theadjoint PDE model. The expected information gain (EIG) is deﬁned as the expected (with respect to data) Kullback-Leibler (KL) divergence between the posterior and the prior distributions,Ψ := E y [ D KL ( π post ( ·| y ) (cid:107) π pr )] , (7)where the KL divergence is deﬁned as D KL ( π post (cid:107) π pr ) := (cid:90) ln (cid:18) dπ post dπ pr (cid:19) dπ post . (8)3or a Bayesian linear inverse problem as formulated in Section 2.1, the EIG Ψ admits the closed form[1] Ψ = 12 logdet (cid:16) I m + (cid:101) H m (cid:17) , (9)where I m is an identity matrix of size d m × d m , and (cid:101) H m := Γ pr H m Γ pr is the prior-preconditionedHessian that includes both data and prior information. We consider an optimal sensor placement problem. Assume we have a collection of d candidatesensors { s i } di =1 . We need to choose a much smaller number r < d of sensors (due to a limited budgetor physical constraints) at which data are collected. The OED problem seeks to ﬁnd the best sensorcombination from the candidates. We use a Boolean design matrix W ∈ W ⊂ R r × d to represent sensorplacement such that W ij = 1 if the i -th sensor is placed at the j -th candidate location, i.e., W ij ∈ { , } , d (cid:88) j =1 W ij = 1 , r (cid:88) i =1 W ij ∈ { , } . (10)We assume that the observational noise for the d candidate sensors is uncorrelated, with covariance Γ d n = diag( σ , . . . , σ d ) . (11)As a result, for any design W with the covariance for the observation noise (cid:15) as Γ n ( W ) = W Γ d n W T , wehave Γ − ( W ) = W ( Γ d n ) − W T . (12)Denoting by F d the PtO map using all d candidate sensors, we have the design-speciﬁc PtO map F ( W ) = W F d , (13)with its adjoint F ∗ = F ∗ d W T . We can now state the OED problem as: ﬁnd an optimal design W ∈ W such that W = arg max W ∈W Ψ( W ) . (14) The classical OED problem seeks a design that maximizes the information gain for the parametervector m . In this work, we consider a goal-oriented optimal experimental design (GOOED) problemthat maximizes the information gain of a predicted quantity of interest (QoI) ρ ∈ R p , which is assumedto be a linear function of the parameter m , ρ = Pm , (15)where P : R d m (cid:55)→ R d ρ is a linear map that typically involves model evaluation (e.g., solving PDEs). Dueto linearity, the prior distribution of ρ is Gaussian N ( ρ pr , Σ pr ) with mean ρ pr = Pm pr and covariance Σ pr = PΓ pr P ∗ , where P ∗ is the adjoint of P . Moreover, the posterior distribution of ρ is also Gaussian N ( ρ post , Σ post ) with mean ρ post = Pm post and covariance Σ post = PΓ post P ∗ .4 .1 Expected information gain for GOOED To construct an expression for EIG for GOOED, we ﬁrst introduce Proposition 3.1 [42], whichrelates the observational data y and the QoI ρ . Proposition 3.1.

Model (1) and QoI (15) lead to y = FP † ρ + η , (16) where P † := Γ pr P ∗ Σ − pr , and η ∼ N ( , Γ η ) with Γ η := Γ n + F ( Γ pr − Γ pr P ∗ Σ − pr PΓ pr ) F ∗ , (17) or equivalently Γ η = C ov [ (cid:15) ] + C ov [ F ( I m − P † P ) m ] , with C ov as covariance. Moreover, ρ and η areindependent. Thus, the EIG for ρ can be obtained analogously to (9),Ψ ρ ( W ) = 12 logdet (cid:16) I ρ + (cid:101) H ρm ( W ) (cid:17) , (18)where I ρ is an identity matrix of size d ρ × d ρ , and (cid:101) H ρm ( W ) = Σ pr H ρm ( W ) Σ pr , with H ρm ( W ) given by H ρm ( W ) = ( F ( W ) P † ) ∗ Γ − η ( W ) F ( W ) P † . (19) Ψ ρ The EIG Ψ ρ ( W ) depends on W through F ( W ) given in (13), which involves expensive modelevaluations (e.g., PDE solutions). Since Ψ ρ ( W ) must be evaluated repeatedly in the course ofmaximizing EIG, these repeated model evaluations would be prohibitive. To circumvent this problem,we propose an oﬄine-online decomposition scheme, where model-constrained computation of quantitiesthat are independent of W is performed oﬄine a single time, and the online design optimization isfree of any model evaluations. The key result permitting this decomposition is given in the followingtheorem, whose proof is given in Appendix A. Theorem 3.2.

For each design W ∈ W , the goal-oriented EIG Ψ ρ ( W ) given in (18) can be computedas Ψ ρ ( W ) = 12 logdet (cid:0) I r + L T W H ρd W T L (cid:1) , (20) where I r is an identity matrix of size r × r , H ρd is given by H ρd := F d Γ pr P ∗ Σ − pr PΓ pr F ∗ d , (21) and L is given by the Cholesky factorization Γ − η = LL T . Note that Γ η deﬁned in (17) can be equivalently written as Γ η ( W ) = W (cid:0) Γ d n + H d − H ρd (cid:1) W T , (22)where H d := F d Γ pr F ∗ d . Hence evaluation of Ψ ρ ( W ) can be decomposed as follows: (1) construct themodel-constrained matrices H d and H ρd oﬄine just once; and (2) for each W in the online optimizationprocess, assemble a small ( r × r ) matrix Γ η ( W ) by (22), compute a Cholesky factorization Γ − η = LL T ,and assemble Ψ ρ ( W ) by (20), which are all free of the expensive model evaluations.5ote that H d ∈ R d × d and H ρd ∈ R d × d are large matrices when we have a large number of candidatesensors d (cid:29)

1. Moreover, their construction involves expensive model evaluations when the parametersare high-dimensional, d m (cid:29)

1, e.g., by solving PDEs. Therefore, it is computationally not practical todirectly compute and store these matrices. Fortunately, the intrinsic ill-posedness of high-dimensionalBayesian inverse problems—data inform only a low-dimensional subspace of parameter space, e.g.,[32, 13, 39, 26, 7]—suggests that these matrices are likely low rank or exhibit rapid spectral decay. Weexploit this property and construct low-rank approximations of H ρd and H d − H ρd in the next section. Let ∆ H d := H d − H ρd , where H ρd and H d are given in (21) and (22) and integrate data, parameter,and QoI information. Noting that H ρd and ∆ H d are both symmetric, we compute their low-rankapproximation for given tolerances (cid:15) ζ , (cid:15) λ > H ρd = U k Z k U Tk and ∆ ˆ H d = V l Λ l V Tl , (23)where ( U k , Z k ) represent the k dominant eigenpairs of H ρd with Z k = diag( ζ , . . . , ζ k ) such that ζ ≥ ζ ≥ · · · ≥ ζ k ≥ (cid:15) ζ ≥ ζ k +1 · · · ≥ ζ d ; (24)and ( V l , Λ l ) represent the l dominant eigenpairs of ∆ H d with Λ l = diag( λ , . . . , λ l ) such that λ ≥ λ ≥ · · · ≥ λ l ≥ (cid:15) λ ≥ λ l +1 ≥ · · · ≥ λ d . (25)For the low-rank approximation, we employ a randomized SVD algorithm [28], which requires only O ( k ) and O ( l ) model evaluations, respectively. In practice, k, l (cid:28) d . More details on the algorithmapplied to the example problem in Section 4 can be found in Appendix C.With ˆ Γ η ( W ) := W (cid:16) Γ d n + ∆ ˆ H d (cid:17) W T as an approximation of Γ η ( W ) in (17), we compute theCholesky factorization ˆ Γ − η = ˆ L ˆ L T . Then we can deﬁne an approximate EIG asˆΨ ρ ( W ) := 12 logdet (cid:16) I r + ˆ L T W ˆ H ρd W T ˆ L (cid:17) . (26)The following theorem quantiﬁes the approximation error, whose proof can be found in Appendix B. Theorem 3.3.

For any design W ∈ W , the error for the goal-oriented EIG Ψ ρ ( W ) in (20) by itsapproximation ˆΨ ρ ( W ) in (26) can be bounded by | Ψ ρ ( W ) − ˆΨ ρ ( W ) | ≤ d (cid:88) i = k +1 log(1 + ζ i /σ min )+ 12 k (cid:88) i = l +1 log(1 + λ i ζ /σ min ) , (27) where σ min := min( σ , . . . , σ d ) as deﬁned in (11). We remark that with rapid decay of the eigenvalues ( ζ k ) k ≥ of ˆ H ρd and ( λ l ) l ≥ of ∆ ˆ H d , the errorbound in (27) becomes very small. Moreover, the decay rates are often independent of the (candidate)data dimension d and the parameter dimension d m , as demonstrated in Section 4.3. This means thatan arbitrarily-accurate EIG approximation can be constructed with a small number, O ( k + l ), of modelsolves. 6 .4 Swapping greedy optimization Once the low-rank approximations of H ρd and ∆ H d are constructed per (23), we obtain a fast methodfor evaluating the approximate EIG in (26), with certiﬁed approximation error given by Theorem 3.3.We emphasize that this fast computation does not involve expensive model evaluations (e.g., large-scalePDE solves) for any given design W . We now turn to the (combinatorial) optimization problem ofﬁnding the optimal design matrix W , W = arg max W ∈W ˆΨ ρ ( W ) . (28)We next introduce a swapping greedy algorithm to solve this problem requiring only evaluation ofˆΨ ρ ( W ).In contrast to classical greedy algorithms that sequentially ﬁnd the optimal sensors one by one(or batch by batch) [11, 33], we extend a swapping greedy algorithm developed for BOED in [43] tosolve the GOOED problem. Given a current sensor set, it swaps sensors with the remaining sensorsto maximize the approximate EIG ˆΨ ρ ( W ) until convergence. To initialize the chosen sensor set, wetake advantage of the low-rank approximation ˆ H ρd in (23), which contains information from the data(through F d ), parameter (through Γ pr ), and QoI (through P ), as can be seen from (21). In particular,the most informative sensors can be revealed by the rows of U k with the largest norms, or the leveragescores of H ρd [12]. More speciﬁcally, given a budget of selecting r sensors from d candidate locations, weinitialize the candidate set S = { s , . . . , s r } such that s i , i = 1 , . . . , r , is the row index correspondingto the i -th largest row norm of U k , i.e., s i = arg max s ∈ S \ S i − || U k ( s, :) || , i = 1 , . . . , r, (29)where U k ( s, :) is the s -th row of U k , || · || is the Euclidean norm, and the set S i − = { s , . . . , s i − } for i = 2 , , . . . , and S = ∅ . Then at each step of a loop for t = 1 , . . . , r , we swap a sensor s t from thecurrent chosen sensor set S t − with one from the candidate set such that the approximate EIG ˆΨ ρ ( W )evaluated as in (26) can be maximized, i.e., we choose s ∗ such that s ∗ = arg max s ∈{ s t }∪ ( S \ S t − ) ˆΨ ρ ( W s ) , (30)where W s is the design matrix corresponding to the sensor choice S t − \ { s t } ∪ { s } . We repeat theloop until a convergence criterion is met, e.g., the chosen S does not change or the diﬀerence of theapproximate EIG is smaller than a given tolerance. We summarize the swapping algorithm algorithm7n Algorithm 1. Algorithm 1:

A swapping greedy algorithm for GOOED Input : low-rank approximations (23), a set S = { , . . . , d } of d candidate sensors, a budget of r sensors tobe placed. Output : the optimal sensor set S ∗ with r sensors. Initialize S ∗ = { s , . . . , s r } ⊂ S according to (29). Set S = {∅} . while S ∗ (cid:54) = S do S ← S ∗ . for t = 1 , . . . , r do Choose s ∗ according to (30). Update S t ← ( S t − \ { s t } ) ∪ { s ∗ } . end for Update S ∗ ← S r . end while Output: optimal sensor choice S ∗ . In this section, we present results of numerical experiments for GOOED governed by a lineardynamical PDE model with inﬁnite-dimensional parameter ﬁeld and varying numbers of candidatesensors. This problem features the key challenges of (1) expensive model evaluation and (2) high-dimensional parameters and data. The code is available at https://github.com/cpempire/GOOED.

We consider sensor placement for Bayesian inversion of a contaminant source with the goal ofmaximizing information gain for contaminant concentration on some building surfaces. The transport ofthe contaminant can be modeled by the time-dependent advection-diﬀusion equation with homogeneousNeumann boundary condition, u t − k ∆ u + v · ∇ u = 0 in D × (0 , T ) ,u ( · ,

0) = m in D ,k ∇ u · n = 0 on ∂ D × (0 , T ) , (31)where k = 0 .

001 is the diﬀusion coeﬃcient and

T >

D ⊂ R is open andbounded with boundary ∂ D depicted in Figure 1. The initial condition m is an inﬁnite-dimensionalrandom parameter ﬁeld in D , which is to be inferred. The velocity ﬁeld v ∈ R is obtained as thesolution of the steady-state Navier–Stokes equations with Dirichlet boundary condition, − v + ∇ q + v · ∇ v = 0 in D , ∇ · v = 0 in D , v = g on ∂ D , (32)where q represents the pressure ﬁeld and the Reynolds number Re = 50. The Dirichlet boundary data g ∈ R are prescribed as g = (0 ,

1) on the left wall of the domain, g = (0 , −

1) on the right wall, and g = (0 ,

0) elsewhere.We consider a Gaussian prior for the parameter m ∼ N ( m pr , C pr ) with mean m pr and covariance8 .00.20.40.60.81.0 0.00.10.20.30.40.5 Figure 1:

The domain D is [0 , with two buildings (rectangles occupying [0 . , . × [0 . , . , [0 . , . × [0 . , . v . Right: true parameter ﬁeld m true . operator C pr = A − , where the elliptic operator A = − γ ∆ + δI (with Laplacian ∆ and identity I ) isequipped with Robin boundary condition γ ∇ m · n + βm on ∂ D . Here γ, δ > m [24]. In our numerical test, we set m pr = 0 . γ = 1 , δ = 8. We synthesizea “true” initial condition m true = min(0 . , exp( − (cid:107) x − [0 . , . (cid:107) ) as the contaminant source(Figure 1). To solve the PDE model, we use an implicit Euler method for temporal discretization with N t time steps, and a ﬁnite element method for spatial discretization, resulting in a d m -dimensionaldiscrete parameter m ∼ N ( m pr , Γ pr ), with m pr , Γ pr denoting ﬁnite element discretizations of m pr , C pr ,respectively.Figure 2: Data of contaminant concentration at time T = 0 .

8, obtained as the solution of (31) at the initialcondition shown in Figure 1. The QoI maps ( P , P , P ) are the averaged solution within the lines along theleft, right, and both buildings. Candidate sensor locations are shown in circles (9 left and 75 right). The solution of the PDE for d m = 2023 and N t = 40 is shown in Figure 2 at the observationtime T = 0 .

8. The d candidate sensor locations are also shown in Figure 2, at which we observe thecontaminant concentration u . The linear map F is deﬁned by the predicted data, i.e., the concentrations9t the selected sensors. Finally, we take the QoI as an averaged contaminant concentration at time t pred within a distance δ = 0 .

02 from the boundaries of either the left, the right, or both buildings, withcorresponding QoI maps denoted as P , P , P (see Figure 2). We ﬁrst consider the case of a small number of candidate sensors, for which we can use exhaustivesearch to ﬁnd the optimal sensor combination and compare it with the sensors chosen by the standardand swapping greedy algorithms. Speciﬁcally, we use a grid of d = 9 candidate locations { s i } i =0 ( x i ∈{ . , . , . }×{ . , . , . } ) as shown in Figure 2 (left) with the goal of choosing r = 2 , , , , , , t pred = 1 .

0. We compute the matrices H ρd and ∆ H d (of size 9 × a pp r o x i m a t e d E I G swapping greedy choicestandard greedy choiceall possible choices (a) P a pp r o x i m a t e d E I G swapping greedy choicestandard greedy choiceall possible choices (b) P a pp r o x i m a t e d E I G swapping greedy choicestandard greedy choiceall possible choices (c) P Figure 3:

Approximate EIG ˆΨ ρ at r sensors chosen by the standard and swapping greedy algorithms, and thedistribution of ˆΨ ρ at all possible combinations of 9 candidate sensors. The three plots are for the QoI maps P , P , and P . We can see from Figure 3 that for QoI maps P and P , both greedy algorithms ﬁnd the optimaldesign, while for P with r = 2 ,

4, only swapping greedy ﬁnds the optimal design. Moreover, an increasein r leads to diminishing returns, as the gain in information about the QoI from additional sensorssaturates. We see that ∼ r optimal sensors, r = 5 , , , , , , , ,

60, from among the 75 candidates. Results are shown in Figure 4.

10 20 30 40 50 60number of sensors r0.51.01.52.02.53.03.5 a pp r o x i m a t e d E I G swapping greedy choicestandard greedy choicerandom choices (a) P

10 20 30 40 50 60number of sensors r0.51.01.52.02.53.03.5 a pp r o x i m a t e d E I G swapping greedy choicestandard greedy choicerandom choices (b) P

10 20 30 40 50 60number of sensors r0.51.01.52.02.53.03.54.0 a pp r o x i m a t e d E I G swapping greedy choicestandard greedy choicerandom choices (c) P Figure 4:

Approximate EIG ˆΨ ρ for r out of 75 sensors, found by the standard and swapping greedy algorithms,compared with the distribution of ˆΨ ρ for 200 randomly-chosen sets from the 75. The three plots are for theQoI maps P , P and P . We see that both greedy algorithms ﬁnd designs with larger EIG than all random choices. Moreover,for small r , the swapping greedy algorithm ﬁnds better designs than the standard greedy. For large r ,both greedy algorithms can ﬁnd designs with similar EIG. In fact, multiple designs with similar EIGbecome more likely with larger r .To demonstrate the reduction of computational cost achieved by the oﬄine-online decomposition,we report the total number of EIG evaluations, the number of swapping loops, and the number of swapsof the swapping greedy algorithm (Algorithm 1) in Table 1 for 75 candidate sensors with diﬀerent targetnumber of sensors. We see that the number of loops at convergence is mostly 3. We observe in theexperiments that most of the swaps take place in the ﬁrst loop, followed by a smaller number of swapsin the second loop resulting in slight sensor adjustments. There are no swaps in the last loop, which werequire as a convergence criterion. As a result of the oﬄine-online decomposition (Theorem 3.2), whichrelieves the (thousands of) EIG evaluations of expensive PDE solves once the low-rank approximation(23) is built, we achieve over 1000X speedup. This is because the PDE solves overwhelmingly dominatethe overall cost, and because the oﬄine decomposition is computed at a cost comparable to one directEIG evaluation by (18). 11able 1: Number of swapping loops ( r selected sensors out of 75 candidates. Results are reported for Algorithm 1 for the goal P . r

41 73 124 164 190 r

30 40 50 60

194 235 199 119

Figure 5 illustrates the eﬀect of the goal of maximizing information gain for the QoIs from optimallyplaced sensors. Speciﬁcally, for the parameter-to-QoI maps P , P , P that quantify the averagecontaminant concentration at time t pred = 1 around left, right, and both blocks, the goal-oriented OEDﬁnds the sensors depicted in the ﬁrst row. For P at longer prediction times t pred = 1 , , ,

8, we see inthe bottom row of Figure 5 that the optimal sensors are no longer placed in the immediate vicinity ofthe building, but instead are increasingly dispersed to better detect the now more diﬀused ﬁeld. Finally,the ability of GOOED to reduce the posterior variance in the initial condition ﬁeld is depicted in Figure6 for diﬀerent goals P , P , P . Compared to a random design (lower right), the three optimal designslead to lower variance surrounding regions of interest. Here we demonstrate the fast decay of the eigenvalues of H ρd and ∆ H d with respect to the parameterand data dimensions, as exploited by the algorithms of Section 3.3. For H ρd deﬁned in (21), we haverank( H ρd ) ≤ min( p, d ) with QoI dimension p and data dimension d . In practice, the QoI is often anaveraged quantity with small p , so the rank of H ρd is also small. In our tests we have rank( H ρd ) = p = 1.For ∆ H d = H d − H ρd with H d = F d Γ pr F ∗ , the spectrum of ∆ H d depends on that of H d , which typicallyexhibits fast decay due to ill-posedness of inverse problems. As can be observed in the left plot of Figure7, the eigenvalues of ∆ H d decay very rapidly and independently of the parameter dimension, whichimplies that the required number of PDE solves is small and independent of the parameter dimensionwhile achieving the same absolute accuracy of the approximate EIG by Theorem 3.3. The right plotin Figure 7 also illustrates rapid decay of eigenvalues, as well as diminishing returns, with increasingnumber of candidate sensors, suggesting that the number of PDE solves is asymptotically independentof the data dimension for the same relative accuracy of the approximate EIG. These plots suggest that O (100) PDE solves are required to accurately capture the information gained about the parameterﬁeld and QoI from the data, regardless of the parameter or sensor dimensions, when using randomizedSVD (Algorithm 2). We have developed a fast and scalable computational framework for goal-oriented linear Bayesianoptimal experimental design governed by expensive models. Repeated fast evaluation of an (arbitrarilyaccurate) approximate EIG while avoiding model evaluations is made possible by an oﬄine-online12a) P at t pred = 1 . (b) P at t pred = 1 . (c) P at t pred = 1 . (d) P at t pred = 2 . (e) P at t pred = 4 . (f) P at t pred = 8 . Figure 5:

Sensor locations chosen by the swapping greedy algorithm for 10 out of 75 candidates for theparameter-to-QoI maps P , P , P at time t pred = 1 and also P at time t pred = 2 , , decomposition and low-rank approximation of certain operators informed by the parameter, data, andpredictive goals of interest. Scalability, as measured by parameter- and data-dimension independenceof the number of model evaluations, is achieved by carefully exploiting the GOOED problem’s intrinsiclow dimensionality as manifested by the rapid spectral decay of several critical operators. To justifythe low-rank approximation of these operators in computing the EIG, we proved an upper bound forthe approximation error in terms of the operators’ truncated eigenvalues. Moreover, we proposed anew swapping greedy algorithm that is demonstrated to be more eﬀective than the standard greedyalgorithm in our experiments. Numerical experiments with optimal sensor placement for Bayesianinference of the initial condition of an advection–diﬀusion PDE demonstrated over 1000X speedups(measured in PDE solves). Future work includes extension to nonlinear Bayesian GOOED problemswith nonlinear parameter-to-observable maps and nonlinear parameter-to-QoI maps. References [1] Alen Alexanderian, Philip J. Gloor, and Omar Ghattas. On Bayesian A-and D-optimal experimentaldesigns in inﬁnite dimensions.

Bayesian Analysis , 11(3):671–695, 2016.[2] Alen Alexanderian, Noemi Petra, Georg Stadler, and Omar Ghattas. A-optimal design of experi-ments for inﬁnite-dimensional Bayesian linear inverse problems with regularized (cid:96) -sparsiﬁcation.13a) optimal design for P (b) optimal design for P (c) optimal design for P (d) random design Figure 6:

Pointwise posterior variance of the parameter at optimal designs for goals P , P , P , compared to arandom design, for 10 sensors. The darker regions represent lower variance. number of eigenvalue e i g e n v a l u e o f H d d m = 248 d m = 398 d m = 897 d m = 1525 d m = 2685 d m = 5895 10 number of eigenvalue e i g e n v a l u e o f H d d = 7d = 13d = 32d = 58d = 124d = 232d = 506d = 907d = 1566 Figure 7:

Decay of the eigenvalues of ∆ H d with increasing parameter dimension (left) and data (candidatesensor locations) dimension (right). SIAM Journal on Scientiﬁc Computing , 36(5):A2122–A2148, 2014.[3] Alen Alexanderian, Noemi Petra, Georg Stadler, and Omar Ghattas. A fast and scalable methodfor A-optimal design of experiments for inﬁnite-dimensional Bayesian nonlinear inverse problems.

SIAM Journal on Scientiﬁc Computing , 38(1):A243–A272, 2016.[4] Alen Alexanderian, Noemi Petra, Georg Stadler, and Omar Ghattas. Mean-variance risk-averseoptimal control of systems governed by PDEs with random parameter ﬁelds using quadraticapproximations.

SIAM/ASA Journal on Uncertainty Quantiﬁcation , 5(1):1166–1192, 2017.[5] Alen Alexanderian and Arvind K. Saibaba. Eﬃcient D-optimal design of experiments for inﬁnite-dimensional Bayesian linear inverse problems.

SIAM Journal on Scientiﬁc Computing , 40(5):A2956–A2985, 2018.[6] Nick Alger, Peng Chen, and Omar Ghattas. Tensor train construction from tensor actions, withapplication to compression of large high order derivative tensors.

SIAM Journal on ScientiﬁcComputing , 42(5):A3516–A3539, 2020.[7] Ilona Ambartsumyan, Wajih Boukaram, Tan Bui-Thanh, Omar Ghattas, David Keyes, GeorgStadler, George Turkiyyah, and Stefano Zampini. Hierarchical matrix approximations of Hessians14rising in inverse problems governed by PDEs.

SIAM Journal on Scientiﬁc Computing , 42(5):A3397–A3426, 2020.[8] Ahmed Attia, Alen Alexanderian, and Arvind K Saibaba. Goal-oriented optimal design ofexperiments for large-scale bayesian linear inverse problems.

Inverse Problems , 34(9):095009, 2018.[9] Tim Bakker, Herke van Hoof, and Max Welling. Experimental design for MRI by greedy policysearch.

Advances in Neural Information Processing Systems , 33, 2020.[10] O. Bashir, K. Willcox, O. Ghattas, B. van Bloemen Waanders, and J. Hill. Hessian-based modelreduction for large-scale systems with initial condition inputs.

International Journal for NumericalMethods in Engineering , 73:844–868, 2008.[11] Andrew An Bian, Joachim M Buhmann, Andreas Krause, and Sebastian Tschiatschek. Guaranteesfor greedy maximization of non-submodular functions with applications. In

Proceedings of the 34thInternational Conference on Machine Learning-Volume 70 , pages 498–507. JMLR. org, 2017.[12] Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. An improved approximationalgorithm for the column subset selection problem.

CoRR , abs/0812.4293, 2008.[13] Tan Bui-Thanh, Carsten Burstedde, Omar Ghattas, James Martin, Georg Stadler, and Lucas C.Wilcox. Extreme-scale UQ for Bayesian inverse problems governed by PDEs. In

SC12: Proceedingsof the International Conference for High Performance Computing, Networking, Storage andAnalysis , 2012.[14] Tan Bui-Thanh, Omar Ghattas, James Martin, and Georg Stadler. A computational frameworkfor inﬁnite-dimensional Bayesian inverse problems Part I: The linearized case, with application toglobal seismic inversion.

SIAM Journal on Scientiﬁc Computing , 35(6):A2494–A2523, 2013.[15] Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review.

StatisticalScience , 10(3):273–304, 1995.[16] P. Chen, U. Villa, and O. Ghattas. Hessian-based adaptive sparse quadrature for inﬁnite-dimensionalBayesian inverse problems.

Computer Methods in Applied Mechanics and Engineering , 327:147–172,2017.[17] Peng Chen and Omar Ghattas. Hessian-based sampling for high-dimensional model reduction.

International Journal for Uncertainty Quantiﬁcation , 9(2), 2019.[18] Peng Chen and Omar Ghattas. Projected Stein variational gradient descent. In

Advances inNeural Information Processing Systems , 2020.[19] Peng Chen, Michael R Haberman, and Omar Ghattas. Optimal design of acoustic metamaterialcloaks under uncertainty.

Journal of Computational Physics , page 110114, 2021.[20] Peng Chen, Umberto Villa, and Omar Ghattas. Taylor approximation and variance reduction forPDE-constrained optimal control under uncertainty.

Journal of Computational Physics , 385:163–186, 2019.[21] Peng Chen, Keyi Wu, Joshua Chen, Thomas O’Leary-Roseberry, and Omar Ghattas. ProjectedStein variational Newton: A fast and scalable Bayesian inference method in high dimensions.

Advances in Neural Information Processing Systems , 2019.1522] Peng Chen, Keyi Wu, and Omar Ghattas. Bayesian inference of heterogeneous epidemic mod-els: Application to COVID-19 spread accounting for long-term care facilities. arXiv preprintarXiv:2011.01058 , 2020.[23] Benjamin Crestel, Alen Alexanderian, Georg Stadler, and Omar Ghattas. A-optimal encodingweights for nonlinear inverse problems, with application to the Helmholtz inverse problem.

InverseProblems , 33(7):074008, 2017.[24] Yair Daon and Georg Stadler. Mitigating the inﬂuence of boundary conditions on covarianceoperators derived from elliptic PDEs.

Inverse Problems and Imaging , 12(5):1083–1102, 2018.[25] Angelie R Ferrolino, Jose Ernie C Lope, and Renier G Mendoza. Optimal location of sensors forearly detection of tsunami waves. In

International Conference on Computational Science , pages562–575. Springer, 2020.[26] Pearl H. Flath, Lucas C. Wilcox, Volkan Ak¸celik, Judy Hill, Bart van Bloemen Waanders, andOmar Ghattas. Fast algorithms for Bayesian uncertainty quantiﬁcation in large-scale linearinverse problems based on low-rank partial Hessian approximations.

SIAM Journal on ScientiﬁcComputing , 33(1):407–432, 2011.[27] Adam Foster, Martin Jankowiak, Elias Bingham, Paul Horsfall, Yee Whye Teh, Thomas Rain-forth, and Noah Goodman. Variational Bayesian optimal experimental design. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d ' Alch´e-Buc, E. Fox, and R. Garnett, editors,

Advances inNeural Information Processing Systems , volume 32, pages 14036–14047. Curran Associates, Inc.,2019.[28] Nathan Halko, Per Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness:Probabilistic algorithms for constructing approximate matrix decompositions.

SIAM Review ,53(2):217–288, 2011.[29] Xun Huan and Youssef M. Marzouk. Simulation-based optimal Bayesian experimental design fornonlinear systems.

Journal of Computational Physics , 232(1):288–317, 2013.[30] Xun Huan and Youssef M. Marzouk. Gradient-based stochastic optimization methods in Bayesianexperimental design.

International Journal for Uncertainty Quantiﬁcation , 4(6):479–510, 2014.[31] Xun Huan and Youssef M Marzouk. Sequential bayesian optimal experimental design via approxi-mate dynamic programming. arXiv preprint arXiv:1604.08320 , 2016.[32] Tobin Isaac, Noemi Petra, Georg Stadler, and Omar Ghattas. Scalable and eﬃcient algorithms forthe propagation of uncertainty from data through inference to prediction for large-scale problems,with application to ﬂow of the Antarctic ice sheet.

Journal of Computational Physics , 296:348–368,September 2015.[33] Jayanth Jagalur-Mohan and Youssef Marzouk. Batch greedy maximization of non-submodularfunctions: Guarantees and applications to experimental design. arXiv preprint arXiv:2006.04554 ,2020.[34] Kirthevasan Kandasamy, Willie Neiswanger, Reed Zhang, Akshay Krishnamurthy, Jeﬀ Schneider,and Barnabas Poczos. Myopic posterior sampling for adaptive goal oriented design of experiments.In

International Conference on Machine Learning , pages 3222–3232. PMLR, 2019.1635] Steven Kleinegesse and Michael U Gutmann. Bayesian experimental design for implicit models bymutual information neural estimation. In

International Conference on Machine Learning , pages5316–5326. PMLR, 2020.[36] Quan Long, Marco Scavino, Ra´ul Tempone, and Suojin Wang. Fast estimation of expectedinformation gains for Bayesian experimental designs based on Laplace approximations.

ComputerMethods in Applied Mechanics and Engineering , 259:24–39, 2013.[37] Nora Loose and Patrick Heimbach. Leveraging uncertainty quantiﬁcation to design ocean climateobserving systems.

Journal of Advances in Modeling Earth Systems , pages 1–29, January 2021.[38] T. O’Leary-Roseberry, U. Villa, P. Chen, and O. Ghattas. Derivative-informed projected neural net-works for high-dimensional parametric maps governed by PDEs. https://arxiv.org/abs/2011.15110 ,2020.[39] Noemi Petra, James Martin, Georg Stadler, and Omar Ghattas. A computational framework forinﬁnite-dimensional Bayesian inverse problems: Part II. Stochastic Newton MCMC with applicationto ice sheet ﬂow inverse problems.

SIAM Journal on Scientiﬁc Computing , 36(4):A1525–A1555,2014.[40] Constantine Pozrikidis.

An introduction to grids, graphs, and networks . Oxford University Press,2014.[41] Arvind K Saibaba, Alen Alexanderian, and Ilse CF Ipsen. Randomized matrix-free trace andlog-determinant estimators.

Numerische Mathematik , 137(2):353–395, 2017.[42] A. Spantini, T. Cui, K. Willcox, L. Tenorio, and Y. M. Marzouk. Goal-oriented optimal approxima-tions of Bayesian linear inverse problems.

SIAM Journal on Scientiﬁc Computing , 39(5):S167–S196,2017.[43] Keyi Wu, Peng Chen, and Omar Ghattas. A fast and scalable computational framework for large-scale and high-dimensional Bayesian optimal experimental design. arXiv preprint arXiv:2010.15196 ,2020.[44] Man-Chung Yue. A matrix generalization of the hardy-littlewood-p´olya rearrangement inequalityand its applications. arXiv preprint arXiv:2006.08144 , 2020.[45] Sue Zheng, David Hayden, Jason Pacheco, and John W Fisher III. Sequential Bayesian experimentaldesign with variable cost structure.

Advances in Neural Information Processing Systems , 33, 2020.17 ppendix A: Proof of Theorem 3.2

To start with, we introduce the Weinstein-Aronszajn identity in Proposition .1 which is proven in[40].

Proposition .1.

Let A and B be matrices of size m × n and n × m respectively, then det( I n × n + BA ) = det( I m × m + AB ) . (33) Proof.

Considering the design problem deﬁned in Section 2.2.2, for each design with design matrix W ,we have F ( W ) = W F d , and Γ n ( W ) = W Γ d n W T . (34)We can then reformulate Ψ ρ with the deﬁnition in (18) asΨ ρ ( W ) = 12 log det( I + (cid:101) H ρm )= 12 log det( I + Σ pr ( W F d P † ) ∗ Γ − η ( W )( W F d P † ) Σ pr ) . (35)where Γ η ( W ) = Γ n ( W ) + F ( W )( Γ pr − Γ pr P ∗ Σ − PΓ pr ) F ∗ ( W )= W Γ d n W T + W F d ( Γ pr − Γ pr P ∗ Σ − PΓ pr ) F ∗ d W T = W ( Γ d n + F d ( Γ pr − Γ pr P ∗ Σ − PΓ pr ) F ∗ d ) W T = W ( Γ d n + F d Γ pr F ∗ d (cid:124) (cid:123)(cid:122) (cid:125) := H d ∈ R d × d − F d Γ pr P ∗ Σ − PΓ pr F ∗ d (cid:124) (cid:123)(cid:122) (cid:125) := H ρd ∈ R d × d ) W T = W ( Γ d n + H d − H ρd (cid:124) (cid:123)(cid:122) (cid:125) :=∆ H d ) W T = W ( Γ d n + ∆ H d ) W T . (36)To this end, we haveΨ( W ) = 12 logdet (cid:16) I + Σ pr ( W F d P † ) ∗ Γ − η ( W )( W F d P † ) Σ pr (cid:17) = 12 logdet  I + Σ pr ( W F d P † ) ∗ L (cid:124) (cid:123)(cid:122) (cid:125) A L T ( W F d P † ) Σ pr (cid:124) (cid:123)(cid:122) (cid:125) B  = 12 logdet  I + L T ( W F d P † ) Σ pr (cid:124) (cid:123)(cid:122) (cid:125) B Σ pr ( W F d P † ) ∗ L (cid:124) (cid:123)(cid:122) (cid:125) A  = 12 logdet (cid:0) I + L T ( W F d P † ) Σ pr ( W F d P † ) ∗ L (cid:1) = 12 logdet (cid:0) I + L T W F d Γ pr P ∗ Σ − Σ pr Σ − PΓ pr F ∗ d W T L (cid:1) = 12 logdet (cid:0) I + L T W F d Γ pr P ∗ Σ − PΓ pr F ∗ d W T L (cid:1) = 12 logdet (cid:0) I + L T W H ρd W T L (cid:1) , (37)where we use the Cholesky decomposition Γ − η = LL T in the second equality, Proposition .1 in thethird, deﬁnition of P † from (16) in the ﬁfth, and deﬁnition of H ρd from (21) in the last.18 ppendix B: Proof of Theorem 3.3 To start with, we introduce necessary properties that are proven in [43] for Proposition .2, [5] forProposition .3 and [44] for Proposition .4.

Proposition .2.

Let A and B be matrices of size m × n and n × m respectively, then AB and BA have the same non-zero eigenvalues. Proposition .3.

Let

A, B ∈ C n × n be Hermitian positive semideﬁnite with A ≥ B (i.e., A − B isHermitian positive semideﬁnite), then ≤ log det( I + A ) − log det( I + B ) ≤ log det( I + A − B ) . (38) Proposition .4.

Let f : R + → R be a continuous function that is diﬀerentiable on R + (with x ≥ for x ∈ R + ). If the function x (cid:55)→ xf (cid:48) ( x ) is monotonically increasing on R + . Then for any matrices A, B ∈ R n × m , it holds that n (cid:88) i =1 f ( υ i ( AB T )) ≤ n (cid:88) i =1 f ( υ i ( A ) υ i ( B )) (39) where υ i ( · ) denotes the singular values of matrices sorted in non-increasing order. Lemma .5.

Let A ∈ R n × m , B ∈ R m × m , A T A and B are Hermitian positive semideﬁnite, then log det( I + ABA T ) ≤ m (cid:88) i =1 log(1 + υ i ( A T A ) υ i ( B )) . (40) Proof.

Since log det( I + ABA T ) = (cid:80) ni =1 log(1 + υ i ( ABA T )) = (cid:80) ni =1 log(1 + υ i ( AB / )), let f ( x ) =log(1 + x ), which satisﬁes Proposition .4, we have n (cid:88) i =1 log(1 + υ i ( AB / )) ≤ n (cid:88) i =1 log(1 + υ i ( A ) υ i ( B / )) = m (cid:88) i =1 log(1 + υ i ( A T A ) υ i ( B )) . (41)To this end, we are at the place to prove Theorem 3.3. Proof.

Denote the eigenvalue decompositions of H ρd and ∆ H d as H ρd = U k Z k U Tk + U ⊥ Z ⊥ U T ⊥ , and ∆ H d = V l Λ l V Tl + V ⊥ Λ ⊥ V T ⊥ , (42)where ( Z k , U k ) , ( V l , Λ l ) represent the dominant eigenpairs, and ( Z ⊥ , U ⊥ ) , ( V ⊥ , Λ ⊥ ) represent the re-maining eigenpairs. By triangle inequality, we have | Ψ ρ ( W ) − ˆΨ ρ ( W ) | = |

12 logdet (cid:0) I r × r + L T W H ρd W T L (cid:1) −

12 logdet (cid:16) I r × r + ˆ L T W ˆ H ρd W T ˆ L (cid:17) |≤ |

12 logdet (cid:0) I r × r + L T W H ρd W T L (cid:1) −

12 logdet (cid:16) I r × r + L T W ˆ H ρd W T L (cid:17) | (cid:124) (cid:123)(cid:122) (cid:125) ( a ) + |

12 logdet (cid:16) I r × r + L T W ˆ H ρd W T L (cid:17) −

12 logdet (cid:16) I r × r + ˆ L T W ˆ H ρd W T ˆ L (cid:17) | (cid:124) (cid:123)(cid:122) (cid:125) ( b ) . (43)19e ﬁrst look at ( a ). By Proposition .3 and note that ( H ρd − ˆ H ρd ) = U ⊥ Z ⊥ U T ⊥ is Hermitian positivesemideﬁnite, we have ( a ) ≤

12 logdet (cid:16) I r × r + L T W H ρd W T L − L T W ˆ H ρd W T L (cid:17) = 12 logdet (cid:16) I r × r + L T W ( H ρd − ˆ H ρd ) W T L (cid:17) = 12 logdet (cid:0) I r × r + L T W U ⊥ Z ⊥ U T ⊥ W T L (cid:1) . (44)Then applying Proposition .1, we have( a ) = 12 logdet (cid:16) I r × r + L T W U ⊥ Z / ⊥ Z / ⊥ U T ⊥ W T L (cid:17) = 12 logdet (cid:16) I ( d − k ) × ( d − k ) + Z / ⊥ U T ⊥ W T LL T W U ⊥ Z / ⊥ (cid:17) = 12 logdet (cid:16) I ( d − k ) × ( d − k ) + Z / ⊥ U T ⊥ W T ( W ( Γ d n + ∆ H d ) W T ) − W U ⊥ Z / ⊥ (cid:17) . (45)Applying Lemme .5, let A = Z / ⊥ U T ⊥ W T , B = ( W ( Γ d n + ∆ H d ) W T ) − , we have( a ) ≤ (cid:88) i log(1 + υ i ( W U ⊥ Z / ⊥ Z / ⊥ U T ⊥ W T ) υ i (( W ( Γ d n + ∆ H d ) W T ) − ))= 12 (cid:88) i log(1 + υ i ( W U ⊥ Z ⊥ U T ⊥ W T ) υ i (( W ( Γ d n + ∆ H d ) W T ) − )) . (46)By Proposition 3.1, ∆ H d = C ov [ F d ( I − P † P ) m ], is a covariance matrix, thus is positive semideﬁnite.The smallest eigenvalue of Γ d n + ∆ H d is greater than the smallest eigenvalue of Γ d n . Hence υ i ( W ( Γ d n +∆ H d ) W T )) ≥ σ , i.e., υ i (( W ( Γ d n + ∆ H d ) W T ) − ) ≤ /σ . Note that υ i ( W U ⊥ Z ⊥ U T ⊥ W T ) ≤ υ i ( U ⊥ Z ⊥ U T ⊥ ) = ζ i . Thus we have ( a ) ≤ d (cid:88) i = k +1 log(1 + ζ i /σ ) . (47)Then we turn to second part ( b ), with Proposition .1 and Proposition .3, we have( b ) = | −

12 logdet (cid:0) I r × r + L T W U k Z k U Tk W T L (cid:1) + 12 logdet (cid:16) I r × r + ˆ L T W U k Z k U Tk W T ˆ L (cid:17) | = | −

12 logdet (cid:16) I k × k + Z / k U Tk W T LL T W U k Z / k (cid:17) + 12 logdet (cid:16) I k × k + Z / k U Tk W T ˆ L ˆ L T W U k Z / k (cid:17) |≤

12 logdet (cid:16) I k × k + Z / k U Tk W T ˆ L ˆ L T W U k Z / k − Z / k U Tk W T LL T W U k Z / k (cid:17) = 12 logdet (cid:16) I k × k + Z / k U Tk W T ( ˆ L ˆ L T − LL T ) W U k Z / k (cid:17) = 12 logdet  I k × k + Z / k U Tk W T (( W ( Γ d n + ∆ ˆ H d ) W T ) − − ( W ( Γ d n + ∆ H d ) W T ) − ) (cid:124) (cid:123)(cid:122) (cid:125) ( c ) W U k Z / k  . (48)Note that ( A + B ) − = A − − A − B ( A + B ) − , let A = W ( Γ d n + ∆ ˆ H d ) W T , B = W (∆ H d − ∆ ˆ H d ) W T = W V ⊥ Λ ⊥ V T ⊥ W T , we have( A + B ) − = ( W ( Γ d n + ∆ H d ) W T ) − = ( W ( Γ d n + ∆ ˆ H d ) W T ) − − ( W ( Γ d n + ∆ ˆ H d ) W T ) − W V ⊥ Λ ⊥ V T ⊥ W T ( W ( Γ d n + ∆ H d ) W T ) − ⇒ ( c ) = ( W ( Γ d n + ∆ ˆ H d ) W T ) − W V ⊥ Λ ⊥ V T ⊥ W T ( W ( Γ d n + ∆ H d ) W T ) − (49)20hen we can see that( b ) ≤

12 logdet (cid:16) I k × k + Z / k U Tk W T ( W ( Γ d n + ∆ ˆ H d ) W T ) − W V ⊥ Λ ⊥ V T ⊥ W T ( W ( Γ d n + ∆ H d ) W T ) − W U k Z / k (cid:17) = 12 logdet (cid:16) I ( d − l ) × ( d − l ) + Λ / ⊥ V T ⊥ W T ( W ( Γ d n + ∆ ˆ H d ) W T ) − W U k Z / k Z / k U Tk W T ( W ( Γ d n + ∆ H d ) W T ) − W V ⊥ Λ / ⊥ (cid:17) = 12 logdet (cid:16) I ( d − l ) × ( d − l ) + Λ / ⊥ V T ⊥ W T ( W ( Γ d n + ∆ ˆ H d ) W T ) − W U k Z k U Tk W T ( W ( Γ d n + ∆ H d ) W T ) − W V ⊥ Λ / ⊥ (cid:17) . (50)Applying Lemma .5, we have( b ) ≤ (cid:88) i log(1 + υ i ( W V ⊥ Λ ⊥ V T ⊥ W T ) υ i (( W ( Γ d n + ∆ ˆ H d ) W T ) − W U k Z k U Tk W T ( W ( Γ d n + ∆ H d ) W T ) − )) ≤ k (cid:88) i = l +1 log(1 + λ i ζ /σ ) , (51)where we have used υ i (( W ( Γ d n + ∆ ˆ H d ) W T ) − W U k Z k U Tk W T ( W ( Γ d n + ∆ H d ) W T ) − ) ≤ v (( W ( Γ d n + ∆ ˆ H d ) W T ) − ) v ( W U k Z k U Tk W T ) v (( W ( Γ d n + ∆ H d ) W T ) − ) ≤ ζ /σ (52)for i ≤ k in the last inequality. Note that it vanishes for i > k as Z k has rank not larger than k .Combining (47) and (51), | Ψ ρ ( W ) − ˆΨ ρ ( W ) | ≤ ( a ) + ( b ) ≤ d (cid:88) i = k +1 log(1 + ζ i /σ ) + 12 k (cid:88) i = l +1 log(1 + λ i ζ /σ ) . (53) Appendix C: Low-rank approximation

To compute the low-rank approximations of ∆ H d and H ρd as described in Section 3.3, we presentthe randomized SVD algorithm for these two quantities. Recall the explicit forms of ∆ H d and H ρd as H ρd = F d Γ pr P ∗ Σ − PΓ pr F ∗ d , ∆ H d = F d Γ pr F ∗ d − F d Γ pr P ∗ Σ − PΓ pr F ∗ d . (54) Algorithm 2:

Randomized SVD to compute H with low rank k Generate i.i.d. Gaussian matrix Ω ∈ R d × ( k + p ) with an oversampling parameter p very small (e.g., p = 10). Compute Y = HΩ . Compute the QR factorization Y = QR satisfying Q T Q = I . Compute B = Q T H Q . Solve an eigenvalue problem for B such that B = Z Σ Z T . Form U k = QZ [1 : k ] and Σ k = Σ [1 : k, k ]. We see that this is a matrix-free eigensolver. Steps 2 and 4 represent ∆ H d action on O (2( l + p ))vectors and H ρd action on O (2( k + p )) vectors. In terms of the total actions, it requires 2(2 l + k + p )forward operator F and 2( l + k + p ) of its adjoint F ∗ , 2( k + l + p ) prediction operator P and its adjoint P ∗ . 21or the contaminant problem given in Section 4.1, the concentration ﬁeld u ( x, t ) is given by u t − k ∆ u + v · ∇ u = 0 in D × (0 , T ) ,u ( · ,

0) = m in D ,k ∇ u · n = 0 on ∂ D × (0 , T ) , (55)we can form the parameter-to-observable map Fm as the discretized value of B u ( m ) where B is thepointwise observation operator. The adjoint problem is a terminal value problem which can be solvedbackwards in time by the equation: − p t − ∇ · ( p v ) − k ∆ p = B ∗ y in D × (0 , T ) ,p ( · , T ) = 0 in D , ( p v + k ∇ p ) · n = 0 on ∂ D × (0 , T ) . (56)Then we can deﬁne the adjoint of the parameter-to-observable map F ∗ y as the discretized value of p ( x,

0) for any yy