[PDF] Unscented Kalman Inversion

Abstract

A useful approach to solve inverse problems is to pair the parameter-to-data map with a stochastic dynamical system for the parameter, and then employ techniques from filtering to estimate the parameter given the data. Three classical approaches to filtering of nonlinear systems are the extended, ensemble and unscented Kalman filters. The extended Kalman inversion (ExKI) is impractical when the forward map is not readily differentiable and given as a black box, and also for high dimensional parameter spaces because of the need to propagate large covariance matrices. Ensemble Kalman inversion (EKI) has emerged as a useful tool which overcomes both of these issues: it is derivative free and works with a low-rank covariance approximation formed from the ensemble. In this paper, we demonstrate that unscented Kalman methods also provide an effective tool for derivative-free inversion in the setting of black-box forward models, introducing unscented Kalman inversion (UKI). Theoretical analysis is provided for linear inverse problems, and a smoothing property of the data mis-fit under the unscented transform is explained. We provide numerical experiments, including various applications: learning subsurface flow permeability parameters; learning the structure damage field; learning the Navier-Stokes initial condition; and learning subgrid-scale parameters in a general circulation model. The theory and experiments show that the UKI outperforms the EKI on parameter learning problems with moderate numbers of parameters and outperforms the ExKI on problems where the forward model is not readily differentiable, or where the derivative is very sensitive. In particular, UKI based methods are of particular value for parameter estimation problems in which the number of parameters is moderate but the forward model is expensive and provided as a black box which is impractical to differentiate.

Full PDF

UUnscented Kalman Inversion

Daniel Z. Huang a , Tapio Schneider a , Andrew M. Stuart a a California Institute of Technology, Pasadena, CA

Abstract

A useful approach to solve inverse problems is to pair the parameter-to-data map with a stochasticdynamical system for the parameter, and then employ techniques from ﬁltering to estimate theparameter given the data. Three classical approaches to ﬁltering of nonlinear systems are theextended, ensemble and unscented Kalman ﬁlters. The extended Kalman ﬁlter (which we refer toas ExKI in the context of inverse problems) is impractical when the forward map is not readilydiﬀerentiable and given as a black box, and also for high dimensional parameter spaces because ofthe need to propagate large covariance matrices. Ensemble Kalman inversion (EKI) has emergedas a useful tool which overcomes both of these issues: it is derivative free and works with a low-rank covariance approximation formed from the ensemble. In this paper, we demonstrate thatunscented Kalman methods also provide an eﬀective tool for derivative-free inversion in the settingof black-box forward models, introducing unscented Kalman inversion – UKI.Theoretical analysis is provided for linear inverse problems, and a smoothing property of thedata mis-ﬁt, under the unscented transform used to deﬁne the UKI as a modiﬁcation of the ExKI,is explained. We provide numerical experiments, including proof-of-concept linear examples andvarious applications: learning of permeability parameters in subsurface ﬂow; learning the damageﬁeld from structure deformation; learning the Navier-Stokes initial condition from solution dataat positive times; and learning subgrid-scale parameters in a general circulation model (GCM)from time-averaged statistics. The theory and experiments show that the UKI outperforms theEKI on parameter learning problems with moderate numbers of parameters and outperforms theExKI on problems where the forward model is not readily diﬀerentiable, or where the derivative isvery sensitive. In particular, UKI based methods are of particular value for parameter estimationproblems in which the number of parameters is moderate but the forward model is expensive andprovided as a black box which is impractical to diﬀerentiate.

Keywords:

Inverse Problem, Optimization, Sampling, Filtering, Extended Kalman Methods,Ensemble Kalman Methods, Unscented Kalman Methods,

1. Introduction

This paper is concerned with inversion of the map G : R N θ → R N y , in the presence of noise. Weassume we are given y ∈ R N y and wish to recover θ ∈ R N θ from the relation y = G ( θ ) + η. (1) Email addresses: [email protected] (Daniel Z. Huang), [email protected] (Tapio Schneider), [email protected] (Andrew M. Stuart) a r X i v : . [ m a t h . NA ] F e b he distribution of η , the observational noise, is assumed known, but not its value; to be concretewe will assume that it is drawn from a Gaussian with distribution N (0 , Σ η ). Central to both theoptimization and probabilistic approaches to inversion is the regularized objective function Φ R ( θ )deﬁned by Φ R ( θ ) := Φ( θ ) + 12 (cid:107) Σ − ( θ − r ) (cid:107) , (2a)Φ( θ ) := 12 (cid:107) Σ − η ( y − G ( θ )) (cid:107) , (2b)where Σ η normalizes the model-data misﬁt Φ by means of the known error statistics of the noiseand r and Σ encode prior mean and covariance information about θ. We assume that Σ η is strictlypositive-deﬁnite, denoted Σ η (cid:31)

0, and similarly that Σ (cid:31) The focus of this paper is mainly on derivative-free inversion by means of iterative techniquesaimed at solving the optimization problem deﬁned by minimization of Φ R , or variants of thisproblem [1]. However, even in the optimization setting, the methods introduced in this paper areclosely related to iterative methods applied in Bayesian (probabilistic) inversion which we ﬁrstdescribe. In the Bayesian approach to the inverse problem [2, 3] the posterior distribution is givenby µ ( dθ ) = 1 Z exp (cid:0) − Φ( θ ) (cid:1) µ ( dθ ) , (3)where µ = N ( r , Σ ) is the prior and µ the posterior. A commonly adopted iterative approach tosolve the problem of sampling from µ is sequential Monte Carlo (SMC) [4] in which the measures µ n deﬁned by µ n +1 ( dθ ) = 1 Z n exp (cid:0) − h Φ( θ ) (cid:1) P n µ n ( dθ ) (4)are successively approximated by ensembles. In SMC P n is chosen to be a µ n − invariant Markovkernel so that P n µ n = µ n . Note, then, that if

N h = 1 it follows that µ N = µ. Thus the successiveensembles approximate µ N after N steps. Application of this methodology to solve inverse problemsmay be found in [5]. On the other hand if h = 1 is ﬁxed and the measures µ n are studied in thelimit of large n they will tend to concentrate on minimizers of Φ, restricted to the support of µ ,providing an approach to solving an unregularized inverse problem; this follows from the identity µ n ( dθ ) = 1 (cid:0) Π n − (cid:96) =0 Z (cid:96) (cid:1) exp (cid:0) − n Φ( θ ) (cid:1) µ ( dθ ) . (5)We will build on the latter, optimization, approach to the problem. However we note that, otherthan restriction of µ n to the support of µ , regularization is lost in this approach since it focuseson minimizing Φ( · ) and not Φ R . To address this issue we will choose P j to be the Markov kernelassociated with a ﬁrst-order autoregressive (AR1) process. The resulting dynamic on measuresmay be understood by considering the stochastic dynamical systemevolution: θ n +1 = r + α ( θ n − r ) + ω n +1 , ω n +1 ∼ N (0 , Σ ω ) , (6a)observation: y n +1 = G ( θ n +1 ) + ν n +1 , ν n +1 ∼ N (0 , Σ ν ) . (6b)We assume that the regularization parameter α ∈ (0 , ω (cid:31)

0, and the artiﬁcial observation error covariance Σ ν (cid:31) r is an arbitrary vector. If Y n = We will also write A (cid:22) B when B − A is positive semi-deﬁnite and A ≺ B when B − A is positive deﬁnite. y (cid:96) } n(cid:96) =1 then µ n ( dθ ), the conditional distribution of θ n | Y n , is determined by the iteration (4) with h = 1 and P n the Markov kernel associated with (6a), provided the choice y n ≡ y is made in theconditional measure. Note that µ n is not invariant with respect to P n with this AR1 choice; thusthe method diﬀers from SMC in this regard and the choice of P n is made here in order to regularizethe iterative optimization approach to inversion encapsulated in (5).The method we introduce and study in this paper arises from application of ideas from Kalmanﬁltering to the problem of approximating the distribution of θ n | Y n . The Kalman ﬁlter itself appliesto the case of linear G [6, 7]. When G is nonlinear the methods can be generalized by use ofthe extended Kalman ﬁlter (ExKF) [8] which is based on linearization and application of Kalmanmethodology. However this method suﬀers from two drawbacks which hamper its application inmany large-scale applications: (a) it requires a derivative of the forward map G ( · ); and (b) theapproach scales poorly to high dimensional parameter spaces where N θ (cid:29)

1, because of the need tosequentially update covariances in R N θ × N θ . Thus, despite an early realization that Kalman-basedmethods could be useful for large-scale ﬁltering problems arising in the geosciences [9], the methodsdid not become practical in this context until the work of Evensen [10]. This revolutionary paperintroduced the ensemble Kalman ﬁlter (EnKF) the essence of which is to avoid the linearization ofthe dynamics and sequential updating of the covariance, and instead use a low rank approximationof the covariance found by maintaining an ensemble of estimates for θ n | Y n at every step n. Theseensemble Kalman methods have been widely adopted in the geosciences, not only because they areeﬀective for high dimensional parameter spaces, but also because they are derivative-free, requiringonly G as a black box. Their use in the solution of inverse problems via iterative methods waspioneered in subsurface inversion [11, 12] where the perspective of ﬁxing h (cid:28) n = N = 1 /h was used, so that µ N is viewed as an approximation of the posterior, provided µ ischosen as the prior. These papers thus view the ensemble methodology as a way of sampling fromthe posterior and have elements in common with SMC; this idea is also implicit in the paper [13]which is focussed on data assimilation, and addresses solution of a Bayesian inverse problem eachtime new data is received.In [14] the Kalman methodology for inversion was revisited from the optimization perspective,based on ﬁxing h = 1 and iterating in n , leading to an algorithm we will refer to as ensemble Kalmaninversion (EKI). The paper [15] introduced a novel approach to regularizing the iterative method,by drawing analogy with the Levenberg-Marquardt method [16]; see also [17]. Subsequent variantson the iterative optimization approach demonstrate how to introduce Tikhonov regularization intothe EKI algorithm [18] and the paper [19] shows that adding noise to the iteration can lead toapproximate Bayesian inversion, a method we will refer to as ensemble Kalman sampling (EKS)and which is further analyzed in [20, 21]. The EKS provides a diﬀerent approach to the problemof Bayesian inversion from the ones pioneered in [11, 12] since it does not require starting withdraws from the prior µ , but instead relies on ergodicity and iteration to large n ; the methods in[11, 12] must be started with draws from the prior µ and iterated for precisely n = 1 /h steps,and are hence more rigid in their requirements. Since the ensemble methods do not, in general,accurately approximate the true posterior distribution [22, 23] outside Gaussian scenarios, thederivative-free optimization perspective is arguably a more natural avenue within which to analyzeensemble inversion. However recent work demonstrates how a derivative-free multiscale stochasticsampling method can usefully take the output of EKS as a preconditioner for a method whichprovably approximates the true posterior distribution [24]; in that context the EKS is central tomaking the method eﬃcient.Within the control theory literature, and parallel to the development of the ensemble Kalmanﬁlter, the unscented Kalman ﬁlter (UKF) was introduced [25, 26]. Like the ensemble Kalmanmethods, this method also sidesteps the need to sequentially update the derivative of the forward3odel as part of the covariance update; but, in the primary diﬀerence from ensemble Kalmanmethods, particles (sigma points) are chosen deterministically, and a quadrature rule is appliedwithin a Gaussian approximation of the ﬁlter. The goal of this paper is to establish a frameworkfor the development of unscented Kalman methods for inverse problems, based on (6): we formalize,and demonstrate the power of, unscented Kalman inversion (UKI) techniques. We also formalizeextended Kalman inversion (ExKI) as a general purpose methodology for parameter learning anddescribe ExKI, UKI and UKI as diﬀerent approximations of a general Gaussian methodology forthe ﬁltering problem deﬁned by (6). We make the following contributions to the study of the UKI methodology: • we establish the UKI algorithm as a general purpose derivative-free approach to solving theinverse problem (1), based on the novel stochastic dynamical system formulation (6); • we derive the UKI methodology by introducing a useful conceptual algorithm, based on Gaus-sian approximation for the ﬁltering distribution deﬁned by (6) and, in the same framework,relate the UKI to the EKI and the ExKI; • by studying linear problems we demonstrate that the methodology induces a form of Tikhonovregularization and we prove exponential convergence of the algorithm to the minimizer of theTikhonov-regularized problem, in the linear case; • we show that the method performs like a generalized Levenberg–Marquardt Algorithm for theExKI and we demonstrate that the UKI may be viewed as an approximation of the generalizedLevenberg–Marquardt Algorithm applied to a smoothed data-misﬁt; • we show that UKI outperforms EKI for certain inverse problems with moderate-dimensionalparameters, and for the UKI all tests converge within O (10) iterations with no empiricalvariance inﬂation or early stopping needed; • the UKI is tested on a wide range of problems, including linear test problems, inversion forspatial ﬁelds in a variety of continuum mechanics applications and the learning of parametersin chaotic dynamical systems, using time-averaged data; • we introduce the unscented Kalman sampler (UKS), an unscented Kalman method for gen-erating approximate samples from the Bayesian posterior distribution (3).Taken together, the theoretical framework we develop and the numerical results we present showthat the UKI is a competitive methodology for solving inverse problems and parameter estimationproblems deﬁned by an expensive black-box forward model; indeed the UKI is shown to outperformthe EKI in settings where the number of parameters N θ is of moderate size and the black-box isnot readily diﬀerentiable so that ExKI methods are not applicable.We conclude this introductory section with a deeper literature review relating to the contribu-tions we make in this paper, in Subsection 1.3. Then, in Section 2 we derive the algorithm froma Gaussian approximation of the ﬁltering distribution associated with (6), and relate to the ExKIand EKI approaches to the problem. In Section 3 we study the methodology for linear problems,obtaining insight into the regularization conferred by (6a); study the relationship of the method-ology to other gradient-based optimization techniques; and derive continuous time limits in thenonlinear setting. Section 4 describes variants on the basic UKI algorithm that we have founduseful in practice, including the UKS, and in Section 5 we present numerical results demonstratingthe performance of the UK approaches. 4 .3. Literature Review Inverse and parameter estimation problems are ubiquitous in engineering and scientiﬁc appli-cations. Applications that motivate this work include global climate model calibration [27, 28, 29],material constitutive relation calibration [30, 31, 32], seismic inversion in geophysics [33, 34], andmedical tomography [35, 36]. These problems are generally highly nonlinear, may feature multiplescales and may include chaotic and turbulent phenomena. Moreover, the observational data is oftennoisy and the inverse problem may be ill-posed. We note, also, that a number of inverse problemsof interest may involve a moderate number of unknown parameters N θ , yet may involve solutionof a very expensive forward model G depending on those parameters; furthermore G may not bediﬀerentiable with respect to the parameters, or may be complex to diﬀerentiate as it is given as ablack box.In the nonlinear setting of state estimation, there are three primary Kalman ﬁlters [37, 38, 39]:the extended Kalman ﬁlter (ExKF), the unscented Kalman ﬁlter (UKF), and the ensemble Kalmanﬁlter (EnKF). The use of Kalman based methodology as a non-intrusive iterative method forparameter estimation originates in the papers [40, 41] which were based on the ExKF, hencerequiring derivative d G , and its adjoint, to propagate covariances; the use of derivative-free en-semble methods was then developed systematically in the papers [11, 12], leading to the EKI[14]. Derivative-free ensemble inversion and parameter estimation is particularly suitable for com-plex multiphysics problems requiring coupling of diﬀerent solvers, such as ﬂuid-structure interac-tion [42, 43, 44, 45] and general circulation models [46] and methods containing discontinuities suchas the immersed/embedded boundary method [47, 48, 49, 50] and adaptive mesh reﬁnement [51, 52].Furthermore, derivative-free ensemble inversion and parameter estimation has been demonstratedto be eﬀective in the context of forward models deﬁned by chaotic dynamical systems [53] whereadjoint-based methods fail to deliver meaningful sensitivities [54, 55]. These wide-ranging poten-tial applications form motivation for developing other derivative-free Kalman based inversion andparameter estimation techniques, and in particular the unscented Kalman methods developed here.There is already some work in which unscented Kalman methods are used for parameter in-version. Extended, ensemble and unscented Kalman inversions have been applied to train neuralnetworks [40, 41, 26, 56] and EKI has been applied in the oil industry [57, 11, 12]. Dual and jointKalman ﬁlters [58, 26] have been designed to simultaneously estimate the unknown states and theparameters [58, 59, 26, 60, 61] from noisy sequential observations. However, whilst the EKI hasbeen systematically developed and analyzed as a general purpose methodology for the solution ofinverse and parameter estimation problems, the same is not the case for UKI.Continuous time limits and gradient ﬂow structure of the EKI have been introduced and studiedin [13, 62, 63, 64, 65]. This work led to the development of variants on the EKI, such as theTikhonov-EKI (TEKI) [18] and the EKS [19]. We will develop study of continuous time limitsfor the UKI, and variants including an unscented Kalman sampler (UKS), in this paper. Thereare interesting links to the Levenberg–Marquardt Algorithm [66, 16], as introduced in [15] anddeveloped further in [67, 68, 17]. We will further reﬁne the idea, which provides insights intounderstanding and improving UKI.Finally we mention that there are other derivative-free optimization techniques which are basedon interacting particle systems, but are not Kalman based. Rather these methods are based onconsensus-forming mean-ﬁeld models, and their particle approximations, leading to consensus-basedoptimization [69] and consensus-based sampling [70]. The paper [24] also provides an alternativederivative-free approach to optimization and sampling for inverse problems, using ideas from mul-tiscale dynamical systems. 5 . The Algorithm Recall that the basic approach to inverse problems that we adopt in this paper is to pair theparameter-to-data relationship encoded in (1) with a stochastic dynamical system for the parameter,resulting in (6). We then employ techniques from ﬁltering to approximate the distribution µ n of θ n | Y n . A useful way to think of updating µ n is through the prediction and analysis steps [71, 72]: µ n (cid:55)→ ˆ µ n +1 , and then ˆ µ n +1 (cid:55)→ µ n +1 , where ˆ µ n +1 is the distribution of θ n +1 | Y n . In Subsection 2.1 weﬁrst introduce a Gaussian approximation of the analysis step, leading to an algorithm which mapsthe space of Gaussian measures into itself at each step of the iteration; it is not implementable ingeneral, but it is a useful conceptual algorithm. Subsection 2.2 shows how this algorithm can bemade practical, for low to moderate dimension N θ and assuming that d G is available, by meansof the ExKF, a form of linearization of the conceptual algorithm; we refer to this as ExKI. InSubsection 2.3 we show how the UKI algorithm, the main focus of this paper, may be derivedby applying a quadrature rule to evaluate certain integrals appearing in the conceptual Gaussianapproximation. Subsection 2.4 connects the conceptual algorithm with the EKI, an approach inwhich ensemble approximations of the integrals is used. This conceptual algorithm maps Gaussians into Gaussians. Re refer to it henceforth as theGaussian Approximation Algorithm (GAA). Assume that µ n ≈ N ( m n , C n ). Note that, under (6a),ˆ µ n +1 is also Gaussian. The algorithm proceeds by introducing the distribution of θ n +1 , y n +1 | Y n ,projecting this onto a Gaussian by computing its mean and covariance, and then conditioning thisGaussian to obtain a Gaussian approximation N ( m n +1 , C n +1 ) to µ n +1 . In the analysis step, we assume that the joint distribution of { θ n +1 , y n +1 }| Y n can be approxi-mated by a Gaussian distribution N (cid:16)(cid:20) (cid:98) m n +1 (cid:98) y n +1 (cid:21) , (cid:34) (cid:98) C n +1 (cid:98) C θpn +1 (cid:98) C θpn +1 T (cid:98) C ppn +1 (cid:35)(cid:17) . (7)Use of (6a) shows that (cid:98) m n +1 = E [ θ n +1 | Y n ] = r + α ( m n − r ) , (cid:98) C n +1 = Cov[ θ n +1 | Y n ] = α C n + Σ ω . (8)Then, with E denoting expectation with respect to θ n +1 | Y n ∼ N ( (cid:98) m n +1 , (cid:98) C n +1 ), (cid:98) y n +1 = E [ G ( θ n +1 ) | Y n ] , (cid:98) C θpn +1 =Cov[ θ n +1 , G ( θ n +1 ) | Y n ] , (cid:98) C ppn +1 =Cov[ G ( θ n +1 ) | Y n ] + Σ ν . (9)Conditioning the Gaussian in (7) to ﬁnd θ n +1 |{ Y n , y n +1 } = θ n +1 | Y n +1 gives the following expressionsfor the mean m n +1 and covariance C n +1 of the approximation to µ n +1 : m n +1 = (cid:98) m n +1 + (cid:98) C θpn +1 ( (cid:98) C ppn +1 ) − ( y n +1 − (cid:98) y n +1 ) ,C n +1 = (cid:98) C n +1 − (cid:98) C θpn +1 ( (cid:98) C ppn +1 ) − (cid:98) C θpn +1 T . (10) The choice of indices θ and p in this, and what follows, is made to align with the notation in [14] where the data y comprised noisy observation of variable denoted by p . { y n } are identical to y and iterating in n. ,With this assumption, we may write the algorithm as ( m n +1 , C n +1 ) = F ( m n , C n ; G , r, Σ ω ) , (11)noting that the mapping is dependent on G and on the mean and covariance of the assumed auto-regressive dynamics for { θ n } . In the setting where G is linear, the Gaussian ansatz used in the derivation of the conceptualalgorithm is exact, the integrals appearing in (9) have closed form, and the algorithm reduces to theKalman ﬁlter applied to (6), with a particular assumption on the data stream { y n } . In Subsection3.1 we will show that the mean of this iteration converges to a minimizer of Φ R given by (2), inwhich the prior mean of the regularization is r = r and the prior covariance of the regularizationΣ is deﬁned by solution of a linear equation depending on the choices of α , Σ ω , and Σ ν .In the nonlinear setting, to make an implementable algorithm from the GAA encapsulatedin equations (8) to (10), it is necessary to approximate the integrals appearing in (9). Whenextended, unscented and ensemble Kalman ﬁlters are applied, respectively, to make such approx-imation, we obtain the ExKI, UKI and EKI algorithms. The extended, unscented and ensembleapproaches to this are detailed in the following three subsections. Underlying all of them is thefollowing property of the GAA encapsulated in Proposition 1.We recall the idea of aﬃne invariance, introduced for MCMC methods in [73], motivated bythe attribution of the empirical success of the Nelder-Mead algorithm [74] for optimization to asimilar property; further development of the method in the context of sampling algorithms may befound in [75, 20]. In words an iteration is aﬃne invariant if an invertible linear transformation of thevariable being iterated makes no diﬀerence to the algorithm and hence to the convergence propertiesof the algorithm; this has the desirable consequence that coordinates in which the problem is well-conditioned may be used to understand convergence of algorithms for ill-conditioned problems.Consider the invertible mapping from x ∈ R N θ to ∗ x ∈ R N θ deﬁned by ∗ x = Ax + b . Thendeﬁne ∗ G ( θ ) = G (cid:0) A − ( θ − b ) (cid:1) , ∗ r = Ar + b and ∗ Σ ω = A Σ ω A T . Proposition 1.

Deﬁne, for all n ∈ Z , ∗ m n = Am n + b ∗ C n = AC n A T . Then ( ∗ m n +1 , ∗ C n +1 ) = F ( ∗ m n , ∗ C n ; ∗ G , ∗ r, ∗ Σ ω ) . (12) Proof.

The proof is in Appendix A.This establishes the property of aﬃne invariance, noting that only G , r, Σ ω need to be trans-formed as the aﬃne map applies only on the signal space for { θ n } and not the observation spacefor { y n } . Consider the algorithm deﬁned by equations (8) to (10). The ExKI algorithm follows frominvoking the approximations G ( θ n +1 ) ≈ G ( (cid:98) m n +1 ) + d G ( (cid:98) m n +1 )( θ n +1 − (cid:98) m n +1 ) (13) Dependencies on other matrices entering the speciﬁcation of (6) are not included in this notation as they remainunchanged in the discussion of aﬃne invariance which follows below.

7n the analysis updates for the mean and covariance respectively. In particular both the meanand the covariances in (9) can be evaluated in closed form with the approximation (13). Theapproximations are valid if the ﬂuctuations around the mean state are small, say of O ( (cid:15) ) (cid:28)

1, andall the covariances are O ( (cid:15) ) . This results in the following algorithm: • Prediction step : (cid:98) m n +1 = r + α ( m n − r ) , (cid:98) C n +1 = α C n + Σ ω . (14) • Analysis step : (cid:98) y n +1 = G ( (cid:98) m n +1 ) , (cid:98) C θpn +1 = (cid:98) C n +1 d G ( (cid:98) m n +1 ) T , (cid:98) C ppn +1 = d G ( (cid:98) m n +1 ) (cid:98) C n +1 d G ( (cid:98) m n +1 ) T + Σ ν ,m n +1 = (cid:98) m n +1 + (cid:98) C θpn +1 ( (cid:98) C ppn +1 ) − (cid:0) y − (cid:98) y n +1 (cid:1) ,C n +1 = (cid:98) C n +1 − (cid:98) C θpn +1 ( (cid:98) C ppn +1 ) − (cid:98) C θpn +1 T . (15) UKI approximates the conceptual Gaussian algorithm as does ExKI, but it approximates theintegrals appearing in Equations (9) by means of deterministic quadrature rules which are exactwhen evaluating means and covariances of variables deﬁned as linear transformations of the randomvariable in question. This is the idea of the unscented transform [25, 26] which we now deﬁne.

Deﬁnition 1 (Modiﬁed Unscented Transform) . Let denote Gaussian random variable θ ∼ N ( m, C ) ∈ R N θ , N θ + 1 symmetric sigma points are chosen deterministically: θ = m,θ j = m + c j [ √ C ] j (1 ≤ j ≤ N θ ) ,θ j + N θ = m − c j [ √ C ] j (1 ≤ j ≤ N θ ) , (16) where [ √ C ] j is the j th column of the Cholesky factor of C . The quadrature rule approximates themean and covariance of the transformed variable G i ( θ ) as follows, E [ G i ( θ )] ≈ G i ( θ ) Cov[ G ( θ ) , G ( θ )] ≈ N θ (cid:88) j =0 W cj ( G ( θ j ) − E G ( θ ))( G ( θ j ) − E G ( θ )) T . (17) Here these constant weights are c = c · · · = c N θ = (cid:112) N θ + λ λ = a ( N θ + κ ) − N θ ,W c = λN θ + λ + (1 − a + b ) W cj = 12( N θ + λ ) ( j = 1 , · · · , N θ ) . The original unscented transform leads to 2nd-order accuracy [76] with respect to small covari-ance C : the approximation introduces errors in estimating the mean and covariance at the forth and8igher orders in the Taylor series. The modiﬁcation we employ here replaces the original 2nd-orderapproximation of the E [ G i ( θ )] with its 1st-order counterpart. We do this to avoid negative weights;it also has ramiﬁcations for the optimization process which we discuss in Remark 8. Finally wemention that our modiﬁed unscented transform retains the properties of exactness for mean andcovariance under arbitrary linear transformations G and G .In this paper, the hyper-parameters are chosen to be κ = 0 a = min { (cid:114) N θ + κ , } b = 2 . Consider the algorithm deﬁned by equations (8) to (10). By utilizing the aforementioned quadra-ture rule, we obtain the following UKI algorithm: • Prediction step : (cid:98) m n +1 = r + α ( m n − r ) , (cid:98) C n +1 = α C n + Σ ω . (18) • Generate sigma points : (cid:98) θ n +1 = (cid:98) m n +1 , (cid:98) θ jn +1 = (cid:98) m n +1 + c j [ (cid:113) (cid:98) C n +1 ] j (1 ≤ j ≤ N θ ) , (cid:98) θ j + N θ n +1 = (cid:98) m n +1 − c j [ (cid:113) (cid:98) C n +1 ] j (1 ≤ j ≤ N θ ) . (19) • Analysis step : (cid:98) y jn +1 = G ( (cid:98) θ jn +1 ) (cid:98) y n +1 = (cid:98) y n +1 , (cid:98) C θpn +1 = N θ (cid:88) j =0 W cj ( (cid:98) θ jn +1 − (cid:98) m n +1 )( (cid:98) y jn +1 − (cid:98) y n +1 ) T , (cid:98) C ppn +1 = N θ (cid:88) j =0 W cj ( (cid:98) y jn +1 − (cid:98) y n +1 )( (cid:98) y jn +1 − (cid:98) y n +1 ) T + Σ ν ,m n +1 = (cid:98) m n +1 + (cid:98) C θpn +1 ( (cid:98) C ppn +1 ) − ( y − (cid:98) y n +1 ) ,C n +1 = (cid:98) C n +1 − (cid:98) C θpn +1 ( (cid:98) C ppn +1 ) − (cid:98) C θpn +1 T . (20) Consider the conceptual Gaussian approximation algorithm deﬁned by equations (8) to (10).The EKI approach to making this implementable is to work with an ensemble of parameter estimatesand approximate the covariances (cid:98) C θpn +1 and (cid:98) C ppn +1 empirically: We note that the papers [76, 26, 39], suggest using a small positive value of a . We ﬁnd in the numerical examplesconsidered in this paper that our proposed choice of a outperforms the choice a = min { (cid:113) N θ + κ , . } ), which buildsin the idea of using a small positive value of a . Prediction step : (cid:98) θ jn +1 = r + α ( θ jn − r ) + ω jn +1 , (cid:98) m n +1 = 1 J J (cid:88) j =1 (cid:98) θ jn +1 . (21) • Analysis step : (cid:98) y jn +1 = G ( (cid:98) θ jn +1 ) (cid:98) y n +1 = 1 J J (cid:88) j =1 (cid:98) y jn +1 , (cid:98) C θpn +1 = 1 J − J (cid:88) j =1 ( (cid:98) θ jn +1 − (cid:98) m n +1 )( (cid:98) y jn +1 − (cid:98) y n +1 ) T , (cid:98) C ppn +1 = 1 J − J (cid:88) j =1 ( (cid:98) y jn +1 − (cid:98) y n +1 )( (cid:98) y jn +1 − (cid:98) y n +1 ) T + Σ ν ,θ jn +1 = (cid:98) θ jn +1 + (cid:98) C θpn +1 (cid:16) (cid:98) C ppn +1 (cid:17) − ( y − (cid:98) y jn +1 − ν jn +1 ) ,m n +1 = 1 J J (cid:88) j =1 θ jn +1 . (22)Here the superscript j = 1 , · · · , J is the ensemble particle index, ω jn +1 ∼ N (0 , Σ ω ) and ν jn +1 ∼N (0 , Σ ν ) are independent and identically distributed random variables.

3. Theoretical Insights

Recall that we view the GAA as an underlying conceptual algorithm which gives insight intothe ExKI, UKI and EKI algorithms. The ExKI is itself an approximation of the GAA, found bylinearizing G around the predictive mean and the UKI and EKI algorithms are approximationsof the resulting ExKI. Thus study of the GAA and ExKI give insights into the UKI and EKIalgorithms. This section is devoted to such studies. In Subsection 3.1 we consider behaviour ofthe GAA in the linear setting, where it is identical to the ExKI. In Subsection 3.2 we show thatthe ExKI may be viewed as a generalization of the Levenberg-Marquardt method for optimization.Subsection 3.3 exhibits an averaging property induced by the unscented approximation, indicatinghow this may help in solving problems with rough energy landscapes. And in Subsection 3.4 westudy a continuous time limit of the GAA, which may itself be approximated to obtain continuoustime limits of the ExKI, UKI and EKI algorithms; this provides insight into the discrete algorithmsas implemented in practice. In the linear setting the stochastic dynamical system for state { θ n } and observations { y n } isgiven by evolution: θ n +1 = r + α ( θ n − r ) + ω n +1 , ω n +1 ∼ N (0 , Σ ω ) , (23a)observation: y n +1 = Gθ n +1 + ν n +1 , ν n +1 ∼ N (0 , Σ ν ) . (23b)10hanks to the linearity, equations (9) reduce to (cid:98) y n +1 = Gm n , (cid:98) C θpn +1 = (cid:98) C n +1 G T , and (cid:98) C ppn +1 = G (cid:98) C n +1 G T + Σ ν . The update equations (10) become (cid:98) m n +1 = r + α ( m n − r ) , (cid:98) C n +1 = α C n + Σ ω , (24)and m n +1 = (cid:98) m n +1 + (cid:98) C n +1 G T ( G (cid:98) C n +1 G T + Σ ν ) − (cid:16) y − G (cid:98) m n +1 (cid:17) , (25a) C n +1 = (cid:98) C n +1 − (cid:98) C n +1 G T ( G (cid:98) C n +1 G T + Σ ν ) − G (cid:98) C n +1 . (25b)We have the following theorem about the convergence of the UKI in the setting of the linear forwardmodel: Theorem 1.

Assume that Σ ω (cid:31) . Consider the iteration (24) , (25) mapping ( m n , C n ) into ( m n +1 , C n +1 ) . Assume further α = 1 and Range ( G T ) = R N θ or that α ∈ (0 , . Then the steadystate equation of equation (25b) C − ∞ = G T Σ − ν G + ( α C ∞ + Σ ω ) − (26) has a unique solution C ∞ (cid:31) . The pair ( m n , C n ) converges exponentially fast to limit ( m ∞ , C ∞ ) .Furthermore the limiting mean m ∞ is the minimizer of the Tikhonov regularized least squaresfunctional Φ R given by Φ R ( θ ) := 12 (cid:107) Σ − ν ( y − Gθ ) (cid:107) + 1 − α (cid:107) (cid:98) C − ∞ ( θ − r ) (cid:107) , (27) where (cid:98) C ∞ = α C ∞ + Σ ω . (28) Proof.

The proof is in Appendix A.

Remark 1.

The theorem suggests the importance of choosing α ∈ (0 , unless the forward operatorhas empty null-space. In the case α ∈ (0 , the preceding theorem demonstrates the regularizationwhich underlies the proposed iterative method. Remark 2.

The free parameters r, Σ ν , Σ ω deﬁne the prior mean and covariance r and Σ ap-pearing in (2) . However it is important to realize that, despite the clear parallels with Tikhonovregularization [1], there is an important diﬀerence: the matrix (cid:98) C ∞ deﬁning the implied prior co-variance in the regularization term depends on the forward model. This may be seen by noting thatit is deﬁned by (28) in terms of the steady state covariance C ∞ satisfying (26) . To get some insightinto the implications of this, we consider the over-determined linear system in which G T Σ − η G isinvertible and we may deﬁne C ∗ = ( G T Σ − η G ) − . (29) If we choose r = r , the desired prior mean, and the artiﬁcial evolution and observation errorcovariances Σ ν = 2Σ η , (30a)Σ ω = (cid:0) − α (cid:1) C ∗ , (30b)11 hen straightforward calculation with (26) , (28) shows that C ∞ = C ∗ , (cid:98) C ∞ = 2 C ∗ . From (27) it follows that Φ R ( θ ) = 14 (cid:13)(cid:13)(cid:13)(cid:13) Σ − η ( y − Gθ ) (cid:13)(cid:13)(cid:13)(cid:13) + (1 − α )4 (cid:13)(cid:13)(cid:13)(cid:13) Σ − η G ( θ − r ) (cid:13)(cid:13)(cid:13)(cid:13) . (31) This calculation clearly demonstrates the dependence of the second (regularization) term on theforward model and that choosing α ∈ (0 , allows diﬀerent weights on the regularization term.Indeed by allowing a multiplicative factor diﬀerent from in (30a) the regularization term can begiven arbitrarily large weight relative to the data misﬁt. Remark 3.

The prior regularization is deﬁned through the steady state of the iterative process, incontrast to SMC where the prior is speciﬁed as the initial condition.

Remark 4.

Let α = 1 and consider the setting where the forward operator G has null space, sothat the problem is under-determined. Then the steady state precision matrix C − ∞ of equation (26) is singular. Moreover, { C n } diverges to + ∞ with the following linear bound C n (cid:22) C + n Σ ω , but { m n } still converges to a stationary point of Φ( θ ) , which for α = 1 has no regularizing term;this phenomenon is illustrated in the numerical experiments presented in Subsection 5.3. Remark 5.

When α ∈ (0 , , the exponential convergence rates of the mean and covariance areindependent of the condition number of G T Σ − ν G . Remark 6.

When α ∈ (0 , , (cid:98) C ∞ is bounded, Σ ω (cid:22) (cid:98) C ∞ (cid:22) Σ ω − α , since (cid:22) C ∞ (cid:22) α C ∞ + Σ ω . Remark 7.

When Σ ω = 0 , lim n →∞ C n = 0 .3.2. ExKI: Levenberg–Marquardt Connection In the nonlinear setting, our numerical results will demonstrate the implicit regularizationand linear (sometimes superlinear) convergence of ExKI and UKI. This desirable features canbe understood by the analogy with the Levenberg–Marquardt Algorithm (LMA). We focus thisdiscussion on the particular case α = 1 as we ﬁnd that, for over-determined problems, this choiceoften produces the best results.Consider the non-regularized nonlinear least-squares objective function Φ, deﬁned in (2b). Thekey step in the Levenberg–Marquardt Algorithm (LMA) is to solve the minimization problemfor(2b) by a preconditioned gradient descent procedure which maps θ n to θ n + δθ n and where δθ n solves ( d G ( θ n ) T Σ − ν d G ( θ n ) + λ n I ) δθ n = d G ( θ n ) T Σ − ν ( y − G ( θ n )) . (32)Here I is the identity matrix on R N θ and λ n is the (non-negative) damping factor, often chosenadaptively. Because of the damping matrix λ n I , the LMA is found to be more robust than theGauss–Newton Algorithm and exhibits linear (or even superlinear) convergence in practice. Theuse of LMA for inverse problems is discussed in [16].12he ExKI procedure solves the optimization problem for (2b) by a diﬀerent preconditionedgradient descent procedure, deﬁned by the update (cid:16) d G T ( θ n )Σ − ν d G ( θ n ) + ( C n + Σ ω ) − (cid:17) δθ n = d G T ( θ n )Σ − ν ( y − G ( θ n )) . (33)This may be viewed as a generalization of the LMA in which the adaptive damping term is nowa matrix C n + Σ ω and the adaptation is automated through the covariance updates; furthermorethis matrix is lower bounded (in the sense of quadratic forms) by Σ ω , regardless of the adaptationthrough the covariance, ensuring some damping of the Gauss-Newton approximate Hessian. Wemay expect that the UKI and EKI, which approximate the linearization d G in the ExKI to beneﬁtfrom this generalized LMA. Connections between the LMA and EKI were ﬁrst systematicallyexplored in [15] and more recently in [68]. Here we explain that the unscented transform may be viewed as smoothing the energy landscapeof UKI, in comparison with ExKI; this helps to explain the improved behaviour of UKI over ExKI onrough landscapes, such as those we will show in what follows when performing parameter estimationfor chaotic diﬀerential equations.Comparing with the ExKI, the UKI further smooths the gradient and the landscape of thenonlinear inverse function G ( θ ). We ﬁrst introduce a useful averaging property. Lemma 1.

Let θ denote Gaussian random vector θ ∼ N ( m, C ) ∈ R N θ . For any nonlinear function G : R N θ → R N y , we deﬁne the associated averaged function F G : R N θ × R N θ × N θ (cid:23) → R N y andaveraged gradient function F d G : R N θ × R N θ × N θ (cid:23) → R N y × N θ as follows: F G ( m, C ) := E [ G ( θ )] F d G ( m, C ) := Cov[ G ( θ ) , θ ] · C − . (34) Then we have ∂ F G ( m, C ) ∂m = F d G ( m, C ) .Proof. The proof is in Appendix A.Note that in the linear case

F G ( m, C ) = G ( m ) and F d G ( m, C ) = G ; the averaged derivativeis exact. This averaging procedure is useful to understand the conceptual GAA precisely because(34) may be used to express Cov[ G ( θ ) , θ ], which appears in the conceptual GAA, in terms of theaveraged derivative F d G ( m, C ) . In order to use this idea in the context of the UKI it is useful tounderstand related averaging operations when the modiﬁed unscented transform is employed toapproximate Gaussian expectations. To this end we deﬁne, using (18)–(20), F u G n := (cid:98) y n , F u d G n := (cid:98) C θpn T (cid:98) C − n , (35)noting that F u G n and F u d G n then correspond to approximation of (34) at step n of the algorithm,using the modiﬁed unscented transform from Deﬁnition 1. Proposition 2.

The UKI algorithm (18) – (20) may be written in the following form: In what follows, the suﬃx (cid:23) denotes positive semi-deﬁnite matrix and ∂∂m denotes gradient with respect to m. Prediction step : (cid:98) m n +1 = r + α ( m n − r ) , (cid:98) C n +1 = α C n + Σ ω . (36) • Analysis step : (cid:98) y n +1 = F u G n +1 , (cid:98) C θpn +1 = (cid:98) C n +1 F u d G Tn +1 , (cid:98) C ppn +1 = F u d G n +1 (cid:98) C n +1 F u d G Tn +1 + Σ ν + (cid:101) Σ ν,n +1 ,m n +1 = (cid:98) m n +1 + (cid:98) C θpn +1 ( (cid:98) C ppn +1 ) − (cid:0) y − (cid:98) y n +1 (cid:1) ,C n +1 = (cid:98) C n +1 − (cid:98) C θpn +1 ( (cid:98) C ppn +1 ) − (cid:98) C θpn +1 T . (37) Here (cid:101) Σ ν,n +1 (cid:23) . Furthermore, (cid:107) (cid:101) Σ ν,n +1 (cid:107) = O ( (cid:107) (cid:98) C n +1 (cid:107) ) and (cid:101) Σ ν,n +1 = 0 when G is linear.Proof. The proof is in Appendix A.

Remark 8.

Comparison of the original UKI algorithm (18) – (20) with its rewritten form (36) - (37) demonstrates that, in the regime where the covariance is small, or the forward model is linear, theUKI algorithm behaves like the ExKI algorithm (14) - (15) but with the nonlinear function G and itsassociated gradient d G having been averaged according to unscented approximations of the averagingoperations deﬁned in Lemma 1. From the preceding subsection it follows that the UKI is also relatedto a modiﬁed LMA applied to an averaged objective function. Note that, by using the unscentedapproximation of the averaging procedure deﬁned in Lemma 1, we essentially remove the averagingof G and retain it only on d G . Averaging of the gradient d G alone will be demonstrated to have animportant positive eﬀect on parameter estimation for chaotic dynamical systems in Subsections 5.8,5.9 and 5.10.3.4. Continuous Time Limit To derive a continuous time limit we set α = 1 − α h h , Σ ω (cid:55)→ h Σ ω and Σ ν (cid:55)→ h − Σ ν . Thealgorithm deﬁned by Equations (8) to (10) then has the form of a ﬁrst order accurate (in h )approximation of the dynamical system˙ m = − α h ( m − r ) + C θp Σ − ν (cid:0) y − E G ( θ ) (cid:1) , (38a)˙ C = − α h C + Σ ω − C θp Σ ν − C θpT , (38b)where θ ∼ N ( m, C ), expectation E is with respect to this distribution and C θp = E (cid:16)(cid:0) θ − m (cid:1) ⊗ (cid:0) G ( θ ) − E G ( θ ) (cid:1)(cid:17) . This continuous time dynamical system may be used as the basis for practical algorithms by dis-cretizing in time, for example using forward Euler with an adaptive time-step as in [56], andapplying the same ideas used in the ExKF, UKI or EKI to approximate the expectations.The steady state m ∞ , C ∞ of the diﬀerential equations (38) are implicitly deﬁned in a somewhatcomplicated fashion. However, any such steady state always has non-singular covariance as we nowstate and prove. Lemma 2.

For any steady state ( m ∞ , C ∞ ) of equation (38) , the steady covariance C ∞ is non-singular.Proof. The proof is in Appendix A. 14 . Variants on the Basic Algorithm

Kalman inversion requires solving forward problems at every iteration. Failure of the forwardproblem to deliver physically meaningful solutions can lead to failure of the inverse problem. Addingconstraints to the parameters (for example, dissipation is non-negative) signiﬁcantly improves therobustness of Kalman inversion. Within the EKI there is a natural way to impose constraints, usingthe fact that each iteration of the algorithm may be interpreted as solving a set of coupled quadraticoptimization problems, with coupling arising from empirical covariances. These optimization prob-lems are readily appended with convex constraints, such as box (inequality) constraints [77]; seealso [15, 18]. The UKI does not have this optimization interpretation and so we adopt a diﬀerentapproach to enforcing box constraints.In this paper there are occasions where we impose element-wise box constraints of the form0 ≤ θ or θ min ≤ θ ≤ θ max . These are enforced by change of variables writing θ = ϕ (˜ θ ) where, for example, respectively, ϕ (˜ θ ) = | ˜ θ | or ϕ (˜ θ ) = θ min + θ max − θ min | ˜ θ | . The inverse problem is then reformulated as y = G ( ϕ (˜ θ )) + η. and the UKI methods and variants are employed with G (cid:55)→ G ◦ ϕ. Consider the following stochastic dynamical system, in which W is a standard unit Brownianmotion in R N θ : ˙ θ = C θp Σ − η (cid:0) y − G ( θ ) (cid:1) − C Σ − ( θ − r ) + √ C ˙ W . (39a)Here C θp = E (cid:16)(cid:0) θ − m (cid:1) ⊗ (cid:0) G ( θ ) − E G ( θ ) (cid:1)(cid:17) (40)and all expectations are computed under the law of θ , with respect to which the mean and covarianceare denoted m and C respectively. This It`o-McKean diﬀusion process is approximated by aninteracting particle system, and the law of θ approximated using the resulting empirical Gaussianapproximation, leading to the EKS [19].Here we invoke a diﬀerent Gaussian approximation to approximate the law of θ , based on theunscented transform. First consider the following evolution equations for the mean and covarianceof the Gaussian approximation to the law of θ :˙ m = C θp Σ − η (cid:0) y − E G ( θ ) (cid:1) − C Σ − ( m − r ) , (41a)˙ C = − C θp Σ − η C θpT − C Σ − C + 2 C. (41b)The matrix C θp is again given by (40) but now expectation E is with respect to the distribution N ( m, C ) . The UKS is deﬁned by approximating the expectations in this system by use of anunscented transform. 15n the case where G is linear and the solution is initialized at a Gaussian then the system (41)is self-consistent in that N ( m, C ) is the distribution of the It`o-McKean diﬀusion (39) governing θ ;furthermore the analysis in [19] show that then the system converges to the posterior distribution(3) and the analysis in [78] shows that, when initialized at a non-Gaussian, the Gaussian dynamicsis an attractor. It is thus natural to use numerical simulations of (41) to generate approximatesamples from the posterior distribution. Illustrative examples are presented in Appendix B.

5. Numerical Results

In this section we present numerical results for Kalman-based inversion using the proposedstochastic dynamical system equation (6).

We make choices of Σ ω , Σ ν and r guided by the discussion in Remark 2. However, for generalnonlinear problems C ∗ is not explicitly deﬁned. Thus we modify the prescription given in (30) andinstead choose r = r the desired prior mean andΣ ν = 2Σ η (42a)Σ ω = (cid:0) − α (cid:1) γ I (42b)for some γ > . When the observational noise is absent or negligible, for over-determined problems,we take α = 1 . To avoid overﬁtting in the presence of noise, for under-determined problems, wechoose α ∈ (0 , α. In this paper, we have simply used the values 0 . , . , . m = r and C = γ I . Speciﬁc choices of r and γ will diﬀer betweenexamples and will be spelled out in each example. For all applications, we focus mainly on the UKI; some comparisons between the UKI and EKIare also presented and computational diﬃculties inherent in the rough misﬁt landscape experiencedby ExKI for chaotic dynamical systems are demonstrably overcome by deploying the UKI. Theapplications cover a wide range of problems. They include three categories:1. Noiseless linear problems, where over-determined, under-determined and well-determined sys-tems are considered. • Linear 2-parameter model problem: this problem serves as a proof-of-concept example,which demonstrates the convergence of the mean and the covariance matrix discussed inSubsection 3.1. • Hilbert matrix problem: this problem demonstrates the ability of the UKI to solve ill-conditioned inverse problems. It also serves as an example where the EKI suﬀers fromdivergence as it is iterated, but UKI behaves well.2. Noisy ﬁeld recovery problems, in which we add 0%, 1% and 5% Gaussian random noise tothe observation, as follows: y obs = y ref + (cid:15) (cid:12) ξ, ξ ∼ N (0 , I ) , (43)where y ref = G ( θ ref ), (cid:15) = 0% y ref , y ref , and 5% y ref , and (cid:12) denotes element-wise multipli-cation. It is important to distinguish between the added Gaussian random noise appearing in16he data and the observation error model η ∼ N (0 , Σ η ) used in the development of the inver-sion algorithm; in essence we assume imperfect knowledge of the noise model. Comparisonof UKI and EKI is presented. EKI is shown to suﬀer from ﬁnite ensemble size eﬀects, and insome cases diverges; in contrast, UKI behaves well. This category of inversion for ﬁelds alsoserves to demonstrate the value of the Tikhonov regularization parameter α ∈ (0 ,

1) in theprevention of overﬁtting. We consider three examples, now listed. • Darcy ﬂow problem: to ﬁnd permeability parameters in subsurface ﬂow from measure-ments of pressure (or piezometric head). • Damage detection problem: determining the damage ﬁeld in an elastic body from dis-placement observations on the surface of the structure. • Navier-Stokes problem: we study a two dimensional incompressible ﬂuid, using thevorticity-streamfunction formulation, and recover the initial vorticity from noisy ob-servations of the vorticity ﬁeld at later times.3. Chaotic problems, in which the parameters are learned from the time-averaged statistics. Forthese problems, we demonstrate that choosing α = 1 is satisfactory, relying on the implicitregularization inherent in the approximate LMA interpretation of ExKI and UKI, as discussedin Subsection 3.2. The three examples considered are now listed. • Lorenz63 model problem: we present a discussion of why adjoint based methods includingExKI, fail; we then demonstrate that the UKI succeeds. We attribute the success ofthe UKI to the averaging eﬀect induced by the unscented transform and discussed inSubsection 3.3. • Multiscale Lorenz96 problem: we study a scale-separated setting, in which the closurefor the fast dynamics is learned from time-averaged statistics. • Idealized general circulation model problem: this is a 3D Navier-Stokes problem witha hydrostatic assumption, and simple parameterized subgrid-scale model; we learn theparameters of the subgrid-scale model from time-averaged data. This problem demon-strates the potential of applying the UKI for large scale chaotic inverse problems.The code is accessible online: https://github.com/Zhengyu-Huang/InverseProblems.jl

Consider the 2-parameter linear inverse problem y = Gθ + η. We explore the following three scenarios • non-singular (well-determined) system (NS) y = (cid:20) (cid:21) G = (cid:20) (cid:21) θ ref = (cid:20) (cid:21) ; See section 7.1 of [79] for an example with a similar set-up; see also discussion around equation (55) in [80] wherethe additive Gaussian noise used in the data is carefully constructed to scale relative to the truth underlying it. over-determined system (OD) y =   G =   θ ref = (cid:20) / / (cid:21) ; • under-determined system (UD) y = (cid:2) (cid:3) G = (cid:2) (cid:3) θ ref = (cid:20) (cid:21) + c (cid:20) − (cid:21) , c ∈ R . We deﬁne θ ref = arg min θ (cid:107) Σ − η ( y − Gθ ) (cid:107) , and note that, in the UD case, θ ref comprises a one-parameter family of possible solutions. Wealso note that y = Gθ ref for NS; and y = Gθ ref ( c † ) for UD, with c † = 0; but for OD y (cid:54) = Gθ ref . We choose r = 0 and γ = 0 . and hence also initialize the UKI at θ ∼ N (0 , . I ) . We choosethe observation error noise to be η ∼ N (0 , . I ) although, note, there is no noise in the data weuse; we thus set α = 1 . The convergence of the parameter vector m n is depicted in Fig. 1. In allscenarios, the estimated parameter vectors converge to the desired values exponentially fast. ForUD, θ ref is not unique and the converged mean vector m ∞ depends on the initial conditions forthe algorithm. For this speciﬁc case , { m n } converges to θ ref = [0 . . T , and the corresponding c is − .

2; this is not equal to the true c † = 0 as the data contains no information about it. Theconvergence of the covariance matrices { C n } to C ∞ is depicted in Fig. 2, with NS and OD on theleft and UD on the right. In the cases NS and OD, the estimated covariance matrices converge tothe desired values (the steady state of equation (26)). In the case UD, the covariance matrices { C n } diverge to + ∞ (see Remark 4); nonetheless, this divergence of the covariance matrix does not aﬀectthe convergence of the parameter vector. Discussion concerning the use of UKI with α ∈ (0 ,

1) forthe under-determined system is presented in Appendix C, illustrating the beneﬁt of regularizationin this setting.

We deﬁne the Hilbert matrix G ∈ R N θ × N θ by its entries G i,j = 1 i + j − . The condition number of G grows as O (cid:16) (1 + √ N θ / √ N θ (cid:17) [81]. We consider the inverse problem y = Gθ + η. We use a true solution θ ref = and data y = G . The conditioning of G makes determination of θ from y diﬃcult. Traditional linear solvers fail for such a problem. We consider two scenarios: N θ = 10 and N θ = 100. Both UKI and EKI are applied, and we set r = 0 and γ = 0 . . Thus θ ∼ N (0 , . I ). In the inversion algorithm, we take the observationerror to be η ∼ N (0 , . I ); but the data itself contains no noise, and we again set α = 1 . For Since θ ∼ N ( m , . I ), and since m n − m ∈ Range( G T ), we deduce that c = [2 − · m − . G \ y in Julia leads to an L error of 4250 .

142 for N θ = 100.

10 20 30 40 50Iterations10 L n o r m e rr o r NSODUD

Figure 1: L error (cid:107) m n − θ ref (cid:107) of the linear 2-parameter model problem. NS: non-singular system, OD: over-determined system, UD: under-determined system. the EKI, the sample sizes are set to J = 2 N θ + 1 and J = 100 N θ + 1. The convergence of theparameter vector m n is depicted in Fig. 3. The UKI converges, but the convergence rate dependson the condition number of G , slowing as it grows. The EKI converges to certain accuracy as fastas the UKI and then diverges. This demonstrates the importance of early stopping of the EKI, asshown in [14], or the use of adaptive regularization as in [15, 17]. Furthermore, it demonstratesthat the UKI does not suﬀer from these issues and automatically handles the ill-posedness of theinverse problem. Consider the Darcy ﬂow equation on the two-dimensional spatial domain D = [0 , , whichdescribes the pressure ﬁeld p ( x ) in a porous medium deﬁned by a positive permeability ﬁeld a ( x, θ ): −∇ · ( a ( x, θ ) ∇ p ( x )) = f ( x ) , x ∈ D,p ( x ) = 0 , x ∈ ∂D. For simplicity, Dirichlet boundary conditions on the pressure are applied on ∂D . The ﬂuid sourceﬁeld f is deﬁned as f ( x , x ) =  ≤ x ≤ < x ≤ < x ≤ . We place a prior on the permeability ﬁeld a ( x, θ ) by assuming that log a ( x, θ ) is a centred Gaussianwith covariance C = ( − ∆ + τ ) − d ;here − ∆ denotes the Laplacian on D subject to homogeneous Neumann boundary conditions onthe space of spatial-mean zero functions, τ > d > τ = 3 and d = 2 in the present study). See [18, 82, 19, 83] for19

10 20 30 40 50Iterations10 F r o b e n i u s n o r m e rr o r NSOD 0 10 20 30 40 50Iterations10 F r o b e n i u s n o r m e rr o r UD Figure 2: Frobenius norm (cid:107) C n − C ∞ (cid:107) F (left) for non-singular (NS) and over-determined (OD) systems, and (cid:107) C n (cid:107) F (right) for the under-determined (UD) system of the linear 2-parameter model problem. L n o r m e rr o r UKIEKI ( J = 2 N + 1)EKI ( J = 100 N + 1) 0 200 400 600 800 1000Iterations10 L n o r m e rr o r UKIEKI ( J = 2 N + 1)EKI ( J = 100 N + 1) Figure 3: L error (cid:107) m n − θ ref (cid:107) of the Hilbert inverse problem with N θ = 10 (left) and N θ = 100 (right). examples. The parameter θ represents the countable set of coeﬃcients in the Karhunen-Lo`eve (KL)expansion of the Gaussian random ﬁeld:log a ( x, θ ) = (cid:88) l ∈ K θ ( l ) (cid:112) λ l ψ l ( x ) , (44)where K = Z × Z \ { , } , and the eigenpairs are of the form ψ l ( x ) =  √ πl x ) l = 0 √ πl x ) l = 02 cos( πl x ) cos( πl x ) otherwise , λ l = ( π | l | + τ ) − d and θ ( l ) ∼ N (0 ,

1) i.i.d. The KL expansion equation (44) can be rewritten as a sum over Z ratherthan a lattice: log a ( x, θ ) = (cid:88) k ∈ Z θ ( k ) (cid:112) λ k ψ k ( x ) , (45)20here the eigenvalues λ k are in descending order. In practice, we truncate this sum to N θ terms,based on the largest N θ eigenvalues, and hence θ ∈ R N θ . The forward problem is solved by a ﬁnitediﬀerence method on a 80 ×

80 grid.

Figure 4: The pressure ﬁeld of the Darcy ﬂow problem and the 49 equidistant pointwise measurements (black dots).

For the inverse problem, the observation y ref consists of pointwise measurements of the pressurevalue p ( x ) at 49 equidistant points in the domain (See Fig. 4). We generate a truth random ﬁeldlog a ref ( x ) with θ ∼ N (0 , I ) in R (i.e. we use the ﬁrst 256 KL modes) to construct the observation y ref ; diﬀerent levels of noise are added to make data y obs as explained in (43). Using this data, weconsider two incomplete parameterization scenarios: solving for the ﬁrst 32 KL modes ( N θ = 32)and for the ﬁrst 8 KL modes ( N θ = 8). EKI and UKI are both applied. We take r = 0 and γ = 1so that θ ∼ N (0 , I ). The observation error satisﬁes η ∼ N (0 , I ). For the EKI, the ensemble size isset to be J = 100, which is larger than the number of σ − points used in UKI (2 N θ + 1).For the N θ = 32 case, the convergence of the log-permeability ﬁelds log a ( x, m n ) and the op-timization errors (2) at each iteration for diﬀerent noise levels are depicted in Fig. 5; the top rowshows the relative L errors in the estimate of log a and the bottom row shows the optimizationerrors (data-misﬁt), left to right corresponds to diﬀerent noise levels in the data. Without explicitregularization ( α = 1 . α = 0 .

5) relieves the overﬁtting. The estimated log-permeabilityﬁelds log a ( x, m n ) at the 50th iteration and the truth random ﬁeld are depicted in Fig. 6. Both UKIand EKI deliver similar results and these estimated log-permeability ﬁelds capture main featuresof the truth random ﬁeld.For the N θ = 8 case, the convergence of the log-permeability ﬁelds log a ( x, m n ) and the op-timization errors at each iteration for diﬀerent noise levels are depicted in Fig. 7. Even withoutexplicit regularization ( α = 1 . a ( x, m n ) at the 50th iteration for diﬀerent noise levels, obtained by the UKI and the truthrandom ﬁeld, are depicted in Fig. 8. Comparing with the N θ = 32 case, all Kalman inversions with N θ = 8 perform better for the 5% noise scenario. This indicates the possibility of regularizing theinverse problem by reducing the parameter dimensionality.Finally we observe the smoothness, as a function of the iteration number, of the UKI in com-parison to EKI. This may be seen in all the experiments, undertaken in the Darcy ﬂow example.21 R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.5)EKI ( = 0.5) 10 R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.5)EKI ( = 0.5) 10 R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.5)EKI ( = 0.5)

Figure 5: Relative error (cid:107) log a ( x, m n ) − log a ref ( x ) (cid:107) (cid:107) log a ref ( x ) (cid:107) (top) and the optimization error 12 (cid:107) Σ − η ( y obs − (cid:98) y n ) (cid:107) (bottom)of the Darcy problem ( N θ = 32) with diﬀerent noise levels: noiseless (left), 1% error (middle), and 5% error (right). Consider a thin linear elastic arch-like plate, which is ﬁxed on the bottom edges. Tensiontraction is applied on the top edge, and the distributed load is p = (2 , − σ and take the form ∇ σ + b = 0 in Ω ,u = ¯ u on Γ u ,σ · n = ¯ t on Γ t . Here u is the displacement vector, b = 0 is the body force vector, Ω ∈ R is the bounded domainoccupied by the plate (see Fig. 9) and Γ u and Γ t are the displacement boundary and tractionboundary respectively; thus Γ u (cid:83) Γ t = ∂ Ω . The strain tensor is ε mn = 12 (cid:16) ∂u n ∂x m + ∂u m ∂x n (cid:17) . (46)The linear constitutive relation between strain and stress is written as σ ij = C ijmn ( E, ν ) ε mn , (47)here C ijmn are the constitutive tensor components, which depends on the Young’s modulus E andPoisson’s ratio ν ; throughout this study, we ﬁx ν = 0 . E. The damage is assumed to be isotropic elasticity-based damagewith E ( x, θ ) = (cid:0) − ω ( x, θ ) (cid:1) E . Throughout this study, we ﬁx E = 1000, and ω ( x, θ ) is the scalar-valued damage variable, whichvaries between zero (no damage) to one (complete damage). The truth damage ﬁeld (See Fig. 9-left)22 .20.10.00.10.20.3 Figure 6: Log-permeability ﬁelds log a ( x, m n ) with N θ = 32 obtained by UKI, EKI, and the truth (left to right) fordiﬀerent noise levels: noiseless α = 1 (top), 1% noise α = 0 . α = 0 . is ω ref ( x ) = a e − ( x − x )Σ − ( x − x ) + a e − ( x − x )Σ − ( x − x ) + a e − ( x − x )Σ − ( x − x ) ,a = 0 . , a = 0 . , a = 0 . ,x = (cid:20) (cid:21) , x = (cid:20) (cid:21) , x = (cid:20) (cid:21) , Σ = (cid:20)

200 00 200 (cid:21) , Σ = (cid:20)

800 00 400 (cid:21) , Σ = (cid:20)

100 00 400 (cid:21) , and may be seen to exhibit three ﬂaws. Noise is added to the observations on the boundary as in(43). The forward equation is solved by the ﬁnite element method with 384 quadratic quadrilateralelements (1649 nodes) using the NNFEM library [30, 31].For the inverse problem, the damage ﬁeld is parameterized as follows ω ( θ ( x )) = 0 . − e − θ ( x ) e − θ ( x ) ∈ ( − . , . , here θ ( x ) ﬁeld is discretized and represented by 24 quadratic quadrilateral elements ( N θ = 125) .The observations are x and x displacements measured at 46 ( N y = 92) locations on the surface It is worth mentioning that increasing the parameter dimensionality by reﬁning the parameter mesh exacerbatesthe ill-posedness and, therefore, deteriorates the performance of both Kalman inversions. R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.5)EKI ( = 0.5) 10 R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.5)EKI ( = 0.5) 10 R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.5)EKI ( = 0.5)

Figure 7: Relative error (cid:107) log a ( x, m n ) − log a ref ( x ) (cid:107) (cid:107) log a ref ( x ) (cid:107) (top) and the optimization error 12 (cid:107) Σ − η ( y obs − (cid:98) y n ) (cid:107) (bottom)of the Darcy problem ( N θ = 8) with diﬀerent noise levels: noiseless (left), 1% error (middle), and 5% error (right). Figure 8: Log-permeability ﬁelds log a ( x, m n ) with N θ = 8 obtained by the UKI and the truth (right) for diﬀerentnoise levels: noiseless α = 1 (left), 1% noise α = 1 (middle-left), 5% noise α = 1 (middle-right). boundaries (see Fig. 9-right). We set r = 0 and γ = 1 and UKI and EKI are both applied,initialized with θ ∼ N (0 , I ). The observation error model used in the algorithm is η ∼ N (0 , . I ).For this problem the prior information ω ( θ = 0) = 0 corresponds to an undamaged plate, and isexpected to be reasonable for most of the domain; therefore the choice α = 0 . J = 500, which is larger than the number of σ − points usedin UKI (2 N θ + 1).The convergence of the damage ﬁeld ω ( θ ( x, m n )) and the optimization errors at each iterationare depicted in Fig. 10; the organization of the information is the same as in the Darcy ﬂowexample. In the noiseless scenario, the EKI exhibits divergence without regularization ( α = 1 . . For noisy scenarios, the eﬀect of overﬁttingis signiﬁcant. At 1% noise level, setting α = 0 . α = 0 . α = 0 . We will see the same phenomenon in Subsection 5.7.

100 200 300 4000100200 02004006008001000 0 200 4000100200 051015

Figure 9: The damaged Young’s modulus (left) and the displacement magnitude ﬁeld (right) with 46 measurementlocations on the surface of the boundaries (black dots). reported for the 5% noise scenario. The estimated damaged Young’s modulus ﬁelds E ( x, θ ) andthe truth are depicted in Fig. 11. Both Kalman inversion methods perform comparably, and thesethree ﬂaw areas are captured; however at 5% noise level noticeable bias is visible in the ﬂaws tothe left and right of the domain. As in the Darcy ﬂow case, the convergence histories of the UKIare smoother than for the EKI. R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.5)EKI ( = 0.5) 0.60.81.01.21.4 R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.5)EKI ( = 0.5) 0.81.01.21.41.61.82.02.2 R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.5)EKI ( = 0.5)UKI ( = 0.0)EKI ( = 0.0)

Figure 10: Relative error (cid:107) ω ( θ ( x, m n )) − ω ref (cid:107) (cid:107) ω ref (cid:107) (top) and the optimization error 12 (cid:107) Σ − η ( y obs − (cid:98) y n ) (cid:107) (bottom) ofthe damage detection problem with diﬀerent noise levels: noiseless (left), 1% error (middle), and 5% error (right). Figure 11: Damaged Young’s modulus ﬁelds (1 − ω ( x, m n )) E obtained by UKI, EKI, and the truth (left to right)at diﬀerent noise levels: noiseless α = 1, 1% noise α = 0 .

5, 5% noise α = 0 .

5, and 5% noise α = 0 (top to bottom). We consider the 2D Navier-Stokes equation on a periodic domain D = [0 , π ] × [0 , π ]: ∂v∂t + ( v · ∇ ) v + ∇ p − ν ∆ v = 0 , ∇ · v = 0 , π (cid:90) v = v b ;here v and p denote the velocity vector and the pressure, ν = 0 .

01 denotes the dynamic viscosity,and v b = (2 π, π ) denotes the non-zero mean background velocity. The forward problem is rewrittenin the vorticity-stream ( ω − ψ ) formulation: ∂ω∂t − ( v · ∇ ) ω − ν ∆ ω = 0 ,ω = −∇ ψ π (cid:90) ψ = 0 ,v = (cid:16) ∂ψ∂x , − ∂ψ∂x (cid:17) + v b , and solved by the pseudo-spectral method [84] on a 128 ×

128 grid. To eliminate aliasing error, theOrszag 2/3-Rule [85] is applied and, therefore there are 85 Fourier modes (padding with zeros).Time-integration is performed using the Crank–Nicolson method with ∆ T = 2 . × − .26e study the problem of recovering the initial vorticity ﬁeld from measurements at positivetimes. We parameterize it as ω ( x, θ ), deﬁned by parameters θ ∈ R N θ , and modeled a priori as aGaussian ﬁeld with covariance operator C = ∆ − , subject to periodic boundary conditions, on thespace of spatial-mean zero functions. The KL expansion of the initial vorticity ﬁeld is given by ω ( x, θ ) = (cid:88) l ∈ K θ c ( l ) (cid:112) λ l ψ cl + θ s ( l ) (cid:112) λ l ψ sl , (48)where K = { ( k x , k y ) | k x + k y > k x + k y = 0 and k x > } , and the eigenpairs are of the form ψ cl ( x ) = cos( l · x ) √ π ψ sl ( x ) = sin( l · x ) √ π λ l = 1 | l | , and θ c ( l ) , θ s ( l ) ∼ N (0 , π ) i.i.d. The KL expansion equation (48) can be rewritten as a sum over Z rather than a lattice: ω ( x, θ ) = (cid:88) k ∈ Z θ ( k ) (cid:112) λ k ψ k ( x ) , (49)where the eigenvalues λ k are in descending order.For the inverse problem, we recover the initial condition, speciﬁcally the initial vorticity ﬁeld ofthe Navier-Stokes equation, given pointwise observations y ref of the vorticity ﬁeld at 16 equidistantpoints ( N y = 32) at T = 0 .

25 and T = 0 . y obs are deﬁned asin (43). The initial vorticity ﬁeld ω ,ref is generated with all 85 Fourier modes, and the ﬁrst N θ = 100 KL modes of equation (49) are recovered. We take r = 0 and γ = 10. Both UKI andEKI are applied with θ ∼ N (0 , I ) and the observation error assumed for inversion purposes is η ∼ N (0 , I ). For the EKI, the ensemble size is set to be J = 201, which equals the number of σ − points in UKI (2 N θ + 1). Y Y Figure 12: The vorticity ﬁelds of the Navier-Stokes problem and the 16 equidistant pointwise measurements (blackdots) at two observation times ( T = 0 .

25 and T = 0 . The convergence of the initial vorticity ﬁeld ω ( x, m n ) and the optimization errors for diﬀerentnoise levels at each iteration are depicted in Fig. 13; the organization of the ﬁgure is the same as inthe Darcy case. In all scenarios, the UKI outperforms EKI. Moreover, without regularization ( α =1 . ω ( x, m n ) at the 50th iteration fordiﬀerent noise levels obtained by the Kalman inversions and the truth random ﬁeld are depicted inFig. 14. Both Kalman inversions capture main features of the truth random initial ﬁeld, but notthe detailed small features, due to the irreversibility of the diﬀusion process ( ν = 0 . .30.40.50.60.70.80.91.0 R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.9)EKI ( = 0.9) 0.30.40.50.60.70.80.91.0 R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.9)EKI ( = 0.9) 0.30.40.50.60.70.80.91.0 R e l a t i v e L n o r m e rr o r O p t i m i z a t i o n e rr o r UKI ( = 1.0)EKI ( = 1.0)UKI ( = 0.9)EKI ( = 0.9)

Figure 13: Relative error (cid:107) ω ( x,m n ) − ω ,ref (cid:107) (cid:107) ω ,ref (cid:107) (top) and the optimization error 12 (cid:107) Σ − η ( y obs − (cid:98) y n ) (cid:107) (bottom) of theNavier-Stokes problem with diﬀerent noise levels: noiseless (left), 1% error (middle), and 5% error (right). Consider the Lorenz63 system, a simpliﬁed mathematical model for atmospheric convection [86]: dx dt = σ ( x − x ) ,dx dt = x ( r − x ) − x ,dx dt = x x − βx ;the system is parameterized by σ, r, β ∈ R + . The observation consists of the time-average of thevarious moments over time windows of size T = 20, with an initial spin-up period T = 30 toeliminate the inﬂuence of the initial condition; if f : R (cid:55)→ R computes a moment, then we deﬁne f ( x ) = (cid:90) f (cid:0) x ( t ) (cid:1) dt. The truth observation is computed with parameters ( σ, r, β ) = (10 , , /

3) over a time windowof size T = 200, also with an initial spin-up period T = 30. Since the data is generated on aninterval 10 times as long as the observation window, we may appeal to the central limit theorem[87] to argue that the observation error caused by time-averaging using only 20 time units isapproximately Gaussian. We split the observation time-series into 10 windows of size T = 20 andcompute covariance of the observation error η following [53]. We set r = 5 . and γ = 1. TheUKI is initialized with θ ∼ N (5 . , I ), and α is set to 1 . We start with the following one-parameter inverse problem with ﬁxed σ = 10 and β = 8 / y = G ( r ) + η with y = x . (50)The landscape of G and sensitivity of G ( · ) with respect to the input for observations, derived fromchaotic problems such as equation (50), are widely studied [54, 55]. For our speciﬁc set-up, we28 Figure 14: Initial vorticity ﬁelds ω ( x, m n ) recovered by UKI, EKI, and the truth (left to right) for diﬀerent noiselevels: noiseless α = 1, 1% noise α = 0 .

9, 5% noise α = 0 . demonstrate this in Fig. 15. The function G is characterized by a sudden change at r ≈

22 andthe landscape is highly oscillatory for r >

22; furthermore the sensitivity d G ( r ) computed with thediscrete adjoint method blows up: | d G ( r ) | ∝ O ( e λT ) , with value of the exponent λ consistent with the ﬁrst global Lyapunov exponent [54, 88]. This illus-trates the challenges inherent in parameter estimation and sensitivity analyses for chaotic systems.In particular, the ExKI method suﬀers from the large derivatives of G .The UKI is applied, and the estimated r and the associated 3- σ conﬁdence intervals at eachiteration are depicted in Fig. 16. The conﬁdence intervals give an indication of the evolving co-variance C n . The estimation of r at the 20th iteration is r ∼ N (28 . , . F G and its associated gradient F d G ,with the standard deviation σ r = √ .

22 ﬁxed; this gives an indication of the energy landscape asperceived by the UKI. In particular we have:

F G ( r ) = (cid:90) G ( x ) 1 √ πσ r e − ( x − r )22 σ r dx, F d G ( r ) = (cid:82) ( x − r )( G ( x ) − G ( r )) √ πσ r e − ( x − r )22 σ r dx (cid:82) ( x − r ) √ πσ r e − ( x − r )22 σ r dx . These functions are depicted in Fig. 17, which should be compared with Fig. 15. We see that29 G is smooth (except the transition point), and F d G does not suﬀer from blow-up in the way d G does; furthermore F d G represents the true gradient d G dr ≈ .

96 well. This explains why theadjoint/gradient-based methods, including ExKI, fail, but the UKI succeeds for this chaotic inverseproblem. r ( r ) r | d ( r ) | Figure 15: Landscape (left) and sensitivity (right) of G in the 1-parameter Lorenz63 inverse problem equation (50) r Figure 16: Convergence of the 1-parameter Lorenz63 inverse problem with UKI ( α = 1 . Next, we consider a three-parameter inverse problem, using the ideas in Subsection 4.1. Let θ = ( θ (1) , θ (2) , θ (3) ) and let ( σ, r, β ) = ( | θ (1) | , | θ (2) | , | θ (3) | ) . The map G ( θ ) is found by computingtime-averages of all three components of x , as described above, for given input parameter θ. Theuse of the modulus helps ensure solution trajectories, which do not blow-up. We have y = G ( θ ) + η with y = ( x , x , x , x x , x x , x x ) . (51)All other aspects of the setup are the same as the aforementioned one-parameter inverse problem.The estimated parameters and associated 3- σ conﬁdence intervals for each component at each30

10 20 30 40 50 r ( r ) r d ( r ) Figure 17: Landscape (left) and sensitivity (right) of FG in the 1-parameter Lorenz63 inverse problem equation (50)smoothed and viewed by UKI. iteration are depicted in Fig. 18. The estimation of the parameters at the 20th iteration is  σrβ  ∼ N (cid:16)  . . .  ,  .

119 0 . − . . .

009 0 . − . . .  (cid:17) . For both scenarios, the UKI converges eﬃciently, thanks to the linear (or superlinear) conver-gence rate of the LMA and the averaging property. Although the covariance of the iteration doesnot represent the Bayesian posterior uncertainty, it does indicate the sensitivities inherent in theestimation problem, and in particular that estimation of β involves the largest sensitivities. r Figure 18: Convergence of the 3-parameter Lorenz63 inverse problem with UKI ( α = 1 . .9. Multiscale Lorenz96 Problem Consider the multi-scale Lorenz96 system, a simpliﬁed mathematical model for the midlatitudeatmosphere [89], with K slow variables X ( k ) which are each coupled with J fast variables Y ( j,k ) ,given by: dX ( k ) dt = − X ( k − ( X ( k − − X ( k +1) ) − X ( k ) + F − hcb J (cid:88) j =1 Y ( j,k ) ,dY ( j,k ) dt = − cbY ( j +1 ,k ) ( Y ( j +2 ,k ) − Y ( j − ,k ) ) − cY ( j,k ) + hcb X k . (52)To close the system, it is appended with the cyclic boundary conditions X ( k + K ) = X ( k ) , Y ( j,k + K ) = Y ( j,k ) and Y ( j + J,k ) = Y ( j,k +1) . The time scale separation is parameterized by the coeﬃcient c andthe large-scales are subjected to external forcing F . We choose here as parameters K = 8, J = 32, F = 20, c = b = 10 and h = 1 as in [90, 91, 92, 93]. As time-integrator, we use the 4th-order RungeKutta method with ∆ T = 5 × − .Our goal is to learn the closure model ψ ( X ) of the fast dynamics for a reduced model of theform dX ( k ) dt = − X ( k − ( X ( k − − X ( k +1) ) − X ( k ) + F + ψ ( X ( k ) ) . The closure model ψ : D ⊂ R (cid:55)→ R is parameterized by the ﬁnite element method with cubicHermite polynomials. The domain is set to be D = [ − ,

20] and decomposed into 5 elements and,therefore, N θ = 12.For the inverse problem, the observations consist of the time-average of the ﬁrst and secondmoments of X (1) , X (2) , X (3) , and X (4) over a time window of size T = 1000 and, therefore N y = 14.The truth observation y ref is computed with the multiscale chaotic system equation (52) with arandom initial condition X ( k ) ∼ N (0 ,

1) and Y ( j,k ) ∼ N (0 , . ). And 1%, 2%, and 5% Gaussianrandom noises are added to the observation following equation (43).We set r = 0 and γ = 1; the UKI is thus initialized with θ ∼ N (0 , I ). The observation error isset to be η = N (0 , diag { . y obs (cid:12) y obs } ), and we take α = 1, since the system is over-determined.Moreover, these simulations start with another random initialization of X ( k ) ∼ N (0 , D , the predicted probability density functions match well with the reference,obtained from a full multiscale simulation. It is worth mentioning this problem is not sensitive withrespect to the added Gaussian random noise. Finally, we consider an idealized general circulation model. The model is based on the 3DNavier-Stokes equations, making the hydrostatic and shallow-atmosphere approximations commonin atmospheric modeling. Speciﬁcally, we test UKI on the well-known Held-Suarez test case [94],in which a detailed radiative transfer model is replaced by Newtonian relaxation of temperaturestoward a prescribed “radiative equilibrium” T eq ( φ, p ) that varies with latitude φ and pressure p .Speciﬁcally, the thermodynamic equation for temperature T∂T∂t + · · · = Q igure 19: Closure terms ψ ( X ) for the multi-scale Lorenz96 system obtained from the truth (grey dots) and polynomialdata-ﬁtting by Wilks [91] and Arnold [92], compared with what is learned using the UKI approach ( α = 1) withdiﬀerent noise levels. (dots denoting advective and pressure work terms) contains a diabatic heat source Q = − k T ( φ, p, p s ) (cid:0) T − T eq ( φ, p ) (cid:1) , with relaxation coeﬃcient (inverse relaxation time) k T = k a + ( k s − k a ) max (cid:16) , σ − σ b − σ b (cid:17) cos φ. Here, σ = p/p s , which is pressure p normalized by surface pressure p s , is the vertical coordinate ofthe model, and T eq = max (cid:110) K, (cid:104) K − ∆ T y sin φ − ∆ θ z log (cid:16) pp (cid:17) cos φ (cid:105)(cid:16) pp (cid:17) κ (cid:111) is the equilibrium temperature proﬁle ( p = 10 Pa is a reference surface pressure and κ = 2 / k a = (40 day) − , k s = (4 day) − , ∆ T y = 60 K , ∆ θ z = 10 K . For the numerical simulations, we use the spectral transform method in the horizontal, withT42 spectral resolution (triangular truncation at wavenumber 42, with 64 ×

128 points on thelatitude-longitude transform grid); we use 20 vertical levels equally spaced in σ . With the defaultparameters, the model produce an Earth-like zonal-mean circulation, albeit without moisture orprecipitation. A single jet is generated with maximum strength of roughly 30 m s − near 45 ◦ latitude (Fig. 21).Our inverse problem is constructed to learn parameters in the Newtonian relaxation term Q :( k a , k s , ∆ T y , ∆ θ z ) . We do so in the presence of the following constraints:0 day − < k a < − , k a < k s < − , < ∆ T y , < ∆ θ z . Figure 20: Empirical probability density functions of the slow variables X ( k ) obtained from the full multi-scaleLorenz96 system (Truth), the initial closure model (Prior), and the closure models learned by the UKI ( α = 1) atdiﬀerent noise levels. Conceptually, the setting is identical to that for the Lorenz63 example. We use the same overlinenotation to denote averaging, which here in addition to the time average in the Lorenz modelalso includes an zonal average over longitude (because the model is statistically symmetric underrotations around the planet’s spin axis). To incorporate the imposition of the constraints, theinverse problem is formulated as follows (see Subsection 4.1 for details): y = G ( θ ) + η with G ( θ ) = T ( φ, σ ) (53)with the parameter transformation θ : ( k a , k s , ∆ T y , ∆ θ z ) = (cid:16)

11 + | θ (1) | ,

11 + | θ (1) | + 11 + | θ (2) | , | θ (3) | , | θ (4) | (cid:17) . (54)The observation mapping is deﬁned by mapping from the unknown θ to the 200-day zonal mean ofthe temperature as a function of latitude ( φ ) and height ( σ ), after an initial spin-up of 200 days.The truth observation is the 1000-day zonal mean of the temperature (see Fig. 21-a), after an initialspin-up 200 days to eliminate the inﬂuence of the initial condition. Because the truth observationscome from an average 5 times as long as the observation window used for parameter learning, thechaotic internal variability of the model introduces noise in the observations. As for the Lorenz63setting, the central limit theorem may be invoked to model the observation error from internalvariability.To perform the inversion, we set r = [2 day , ,

20 K ,

20 K] T and γ = 1. Thus UKIis initialized with θ ∼ N (cid:16) r , I (cid:17) . Within the algorithm, we assume that the observation errorsatisﬁes η ∼ N (0 K , I K ). Because the problem is over-determined, we set α = 1 . The estimatedparameters and associated 3- σ conﬁdence intervals for each component at each iteration are depictedin Fig. 22. The estimation of model parameters at the 20th iteration are (cid:20) k a k s ∆ T y ∆ θ z (cid:21) ∼ N (cid:16) (cid:34) . − .

243 day − . .

91 K (cid:35) , (cid:34) . × − day − . × − day − . × − day − K 2 . × − day − K4 . × − day − . × − day − . × − day − K − . × − day − K1 . × − day − K 5 . × − day − K 2 . × − K . × − K . × − day − K − . × − day − K 4 . × − K . × − K (cid:35) (cid:17) . The reported covariances of the iteration do not reﬂect the true Bayesian posterior uncertainty, butthey do capture the relative sensitivities in the problem. UKI converges to the true parameters in34ewer than 10 iterations with 9 σ -points, demonstrating the potential of applying UKI for large-scaleinverse problems. (a) T (b) T eq (c) U (d) V Figure 21: Zonal mean proﬁle of temperature (a), radiative equilibrium temperature (b), zonal wind velocity (c),and meridional wind velocity (d), all from a 1000-day average. The horizontal coordinate is latitude and the verticalcoordinate is the nondimensional σ coordinate of the model.

6. Conclusion

The unscented Kalman inversion is attractive for at least four main reasons: (i) it is black-box and derivative-free; (ii) it is robust for chaotic inverse problems and noisy observations; (iii)it provides sensitivity information; (iv) the method is embarrassingly parallel. It is well-adaptedto parameter estimation problems for large complex models given as a black box. Our numericalresults demonstrate its theoretical properties and its applicability; in particular it is demonstratedto outperform the EKI on large scale problems in which the number of unknown parameters is small.Because the methodology constitutes a novel approach to parameter estimation, there are manyavenues for future research, including applications of the method, methodological improvementsand extensions, and theoretical analysis.

Acknowledgments.

This work was supported by the generosity of Eric and Wendy Schmidt byrecommendation of the Schmidt Futures program and by the National Science Foundation (NSF,award AGS-1835860). A.M.S. was also supported by the Oﬃce of Naval Research (award N00014-17-1-2079). 35 k a ( day ) k s ( day ) T y ( K ) y ( K ) Figure 22: Convergence of the idealized general circulation model inverse problem with UKI ( α = 1 . Appendix A. Proof of Theorems

Proof of Proposition 1.

An aﬃne transformation is an invertible mapping from R N θ to R N θ of theform ∗ x = Ax + b . When we apply the following aﬃne transformation ∗ m n = Am n + b ∗ C n = AC n A T with ∗ r = Ar + b ∗ Σ ω = A Σ ω A T , keep y n and Σ ν unchanged, and deﬁne ∗ G ( θ ) = G (cid:0) A − ( θ − b ) (cid:1) . We prove ∗ m n +1 = Am n +1 + b ∗ C n +1 = AC n +1 A T . (A.1)Equation (8) leads to ∗ (cid:98) m n +1 = ∗ r + α ( ∗ m n − ∗ r ) = A (cid:98) m n +1 + b ∗ (cid:98) C n +1 = α C ∗ n + Σ ∗ ω = A (cid:98) C n +1 A T . (A.2)Therefore, the distribution of ∗ θ n +1 | Y n ∼ N ( ∗ (cid:98) m n +1 ∗ (cid:98) C n +1 ) is the same as ∗ Aθ n + b | Y n and equa-tion (9) becomes ∗ (cid:98) y n +1 = (cid:98) y n +1 ∗ (cid:98) C θpn +1 = A (cid:98) C θpn +1 ∗ (cid:98) C ppn +1 = (cid:98) C ppn +1 . (A.3)Finally, equation (10) leads to ∗ m n +1 = ∗ (cid:98) m n +1 + ∗ (cid:98) C θpn +1 ( ∗ (cid:98) C ppn +1 ) − ( y n +1 − ∗ (cid:98) y n +1 ) = Am n +1 + b, ∗ C n +1 = ∗ (cid:98) C n +1 − ∗ (cid:98) C θpn +1 ( ∗ (cid:98) C ppn +1 ) − ∗ (cid:98) C θpn +1 T = AC n +1 A T . (A.4) Proof of Theorem 1.

In this proof we let B denote the Banach space of matrices in R N θ × N θ equippedwith the operator norm induced by the Euclidean norm on R N θ . Furthermore we let L denote theBanach space of bounded linear operators from B into itself, equipped with the standard inducedoperator norm. For simplicity we consider the case r = 0; a change of origin may be used to extendto the case r (cid:54) = 0 . We ﬁrst prove that the precision operators converge: lim n →∞ C − n = C − ∞ ; we then36tudy behaviour of the mean { m n } . For both the precision and the mean we ﬁrst study α ∈ (0 , α = 1 . In what follows it is useful to note [72][Theorem 4.1] that the mean and covarianceupdate equations (25) can be rewritten as C − n +1 = G T Σ − ν G + ( α C n + Σ ω ) − ,C − n +1 m n +1 = G T Σ − ν y + ( α C n + Σ ω ) − αm n ; (A.5)furthermore the iteration for the covariance remains in the cone of positive semi-deﬁnite matrices[72][Theorem 4.1]. Since Σ ω (cid:31)

0, the sequence { C − n } is bounded: G T Σ − ν G (cid:22) C − n (cid:22) G T Σ − ν G + Σ − ω , ∀ n ∈ Z + . (A.6)Introducing (cid:101) C − n := √ Σ ω C − n √ Σ ω , we may rewrite the covariance update equation (A.5) in theform (cid:101) C − n +1 = (cid:112) Σ ω G T Σ − ν G (cid:112) Σ ω + (cid:16) α (cid:101) C n + I (cid:17) − . (A.7)We deﬁne mapping f ( X ; α ) = (cid:112) Σ ω G T Σ − ν G (cid:112) Σ ω + (cid:0) α X − + I (cid:1) − (A.8)then (cid:101) C − n +1 = f (cid:0) (cid:101) C − n ; α (cid:1) . This iteration is well-deﬁned for (cid:101) C n in B satisfying (A.6) and hence forthe iteration (A.5).We ﬁrst consider α ∈ (0 , (cid:101) C n +1 (cid:22) α (cid:101) C n + I (cid:22) − α n +2 − α I + α n +2 (cid:101) C (cid:22) − α I + α n +2 (cid:101) C , (A.9)and hence, for n is suﬃciently large, we have 0 < (cid:15) < − α , such that (cid:101) C − n +1 (cid:23) (1 − α − (cid:15) ) I . (A.10)Let M ⊂ B denote the set of matrices B ∈ B satisfying B (cid:23) (1 − α − (cid:15) ) I . Then M is absorbing andforward invariant under f . Thus to show the existence of a globally exponentially attracting steadystate it suﬃces to show that f ( · ; α ) is a contraction on M . The derivative of f ( · ; α ) : M (cid:55)→ M isthe element Df ( X ; α ) ∈ L deﬁned by its action on ∆ X ∈ B as follows: Df ( X ; α )∆ X = α ( X + α I ) − ∆ X ( X + α I ) − . (A.11)Thus (cid:107) Df ( X ; α )∆ X (cid:107) = α (cid:13)(cid:13) ( X + α I ) − ∆ X ( X + α I ) − (cid:13)(cid:13) . ≤ α (1 − (cid:15) ) (cid:107) ∆ X (cid:107) . Therefore, since α ∈ (0 , − (cid:15) ), sup X ∈M (cid:107) Df ( X ; α ) (cid:107) L < f is a contraction map on M . This establishes the exponential convergence of { (cid:101) C − n } . Fi-nally, the sequence { C − n } converges exponentially fast to C − ∞ , the non-singular ﬁxed point ofequation (A.5); Equation (A.6) indicates that C − ∞ is indeed non-singular.37hen α = 1 deﬁne mapping f ( X ) = f ( X ; 1) so that (cid:101) C − n +1 = f (cid:0) (cid:101) C − n (cid:1) . The derivative Df ( X ) ∈ L is deﬁned by its action on ∆ X ∈ B as follows: Df ( X )∆ X =( I + X ) − ∆ X ( I + X ) − . (A.12)Thus, using the lower bound from (A.6) and Range( G T ) = R N θ , (cid:107) Df ( X )∆ X (cid:107) ≤ (cid:13)(cid:13)(cid:13) ( I + X ) − (cid:13)(cid:13)(cid:13) (cid:107) ∆ X (cid:107)≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) I + (cid:112) Σ ω G T Σ − ν G (cid:112) Σ ω (cid:17) − (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) ∆ X (cid:107)≤ (1 + (cid:15) ) − (cid:107) ∆ X (cid:107) , (A.13)where (cid:15) >

0. Therefore, f is a contraction map on the whole of B and the sequence { C − n } converges. This completes the proof of exponential convergence of { C − n } to a limit; the sequence { C − n } converges to C − ∞ , the ﬁxed point of equation (A.5), viewed as a mapping on precisionmatrices. That C ∞ (cid:31) { m n } converges exponentially fast to m ∞ . Using (A.5) the updateequation (25) of m n can be rewritten as m n +1 = α ( I − C n +1 G T Σ − ν G ) m n + C n +1 G T Σ − ν y. (A.14)Thus convergence to m ∞ satisfying m ∞ = α ( I − C ∞ G T Σ − ν G ) m ∞ + C ∞ G T Σ − ν y (A.15)is determined by the spectral radius of α ( I − C n +1 G T Σ − ν G ) . If α ∈ (0 , ρ ( α I − αC n +1 G T Σ − ν G ) ≤ α < { m n } converges exponentially fast to m ∞ . If α = 1 then we use the fact that B := G T Σ − ν G is symmetric and that B (cid:31) . From this it follows that I − C n +1 B has the same spectrum as I − B C n +1 B . Using the upper bound on C n +1 appearing in (A.6) we deduce that ρ (cid:16) I − √ BC n +1 √ B (cid:17) ≤ − ρ (cid:16) √ B (cid:0) B + Σ − ω (cid:1) − √ B (cid:17) = 1 − (cid:15) , for some (cid:15) ∈ (0 , I − C n +1 B is less than one, there is a norm on R N θ in which the operator norm on I − C n +1 B is less than one and exponential convergence follows.When n (cid:55)→ ∞ , equation (A.15) can be rewritten as0 = C ∞ (cid:16) G T Σ − ν ( y − Gm ∞ ) + (1 − α )( G T Σ − ν G − C − ∞ ) m ∞ (cid:17) = C ∞ (cid:16) G T Σ − ν ( y − Gm ∞ ) − (1 − α ) (cid:98) C − ∞ m ∞ (cid:17) . And hence m ∞ is the minimizer of equation (27).38 roof of Lemma 1. ∂ F G ( m, C ) ∂m = ∂ E [ G ( θ )] ∂m = (cid:90) G ( θ ) 1 (cid:112) (2 π ) N θ | C | exp (cid:0) − (cid:107) C − ( θ − m ) (cid:107) (cid:1)(cid:0) C − ( θ − m ) (cid:1) T dθ = (cid:90) G ( θ )( θ − m ) T (cid:112) (2 π ) N θ | C | exp (cid:0) − (cid:107) C − ( θ − m ) (cid:107) (cid:1) dθ · C − = (cid:90) (cid:0) G ( θ ) − E G ( θ ) (cid:1) ( θ − m ) T (cid:112) (2 π ) N θ | C | exp (cid:0) − (cid:107) C − ( θ − m ) (cid:107) (cid:1) dθ · C − = F d G ( m, C ) . (A.16) Proof of Proposition 2.

From equation (35) we have (cid:98) y n +1 = F u G n +1 , (cid:98) C θpn +1 = (cid:98) C n +1 F u d G Tn +1 . (A.17)In what follow we use the modiﬁed unscented transform Deﬁnition 1, and speciﬁcally its use toderive (19) and (20). First note that (cid:98) m n +1 = (cid:98) θ n +1 , (cid:98) y n +1 = G ( (cid:98) θ n +1 ) = (cid:98) y n +1 , and w = W c = W c = · · · = W c N θ . Now deﬁne the matrices Y = [ (cid:98) y n +1 − (cid:98) y n +1 (cid:98) y n +1 − (cid:98) y n +1 · · · (cid:98) y N θ n +1 − (cid:98) y n +1 ] , Y = [ (cid:98) y N θ +1 n +1 − (cid:98) y n +1 (cid:98) y N θ +2 n +1 − (cid:98) y n +1 · · · (cid:98) y N θ n +1 − (cid:98) y n +1 ] , Θ = [ (cid:98) θ n +1 − (cid:98) m n +1 (cid:98) θ n +1 − (cid:98) m n +1 · · · (cid:98) θ N θ n +1 − (cid:98) m n +1 ] . Then we have (cid:98) C θpn +1 = N θ (cid:88) j =0 W cj ( (cid:98) θ jn +1 − (cid:98) m n +1 )( (cid:98) y jn +1 − (cid:98) y n +1 ) T = w Θ( Y T − Y T ) , (A.18a) (cid:98) C ppn +1 = N θ (cid:88) j =0 W cj ( (cid:98) y jn +1 − (cid:98) y n +1 )( (cid:98) y jn +1 − (cid:98) y n +1 ) T + Σ ν = w ( Y Y T + Y Y T ) + Σ ν , (A.18b) (cid:98) C n +1 = N θ (cid:88) j =0 W cj ( (cid:98) θ jn +1 − (cid:98) m n +1 )( (cid:98) θ jn +1 − (cid:98) m n +1 ) T = 2 w ΘΘ T . (A.18c)Equation (A.18c) follows from the deﬁnition of the sigma points (19). Since (cid:98) C n +1 (cid:23) Σ ω (cid:31)

0, thematrix Θ ∈ R N θ × N θ is non-singular. Thus we have F u d G n +1 (cid:98) C n +1 F u d G Tn +1 = (cid:98) C θpn +1 T (cid:98) C − n +1 (cid:98) C n +1 (cid:98) C − n +1 (cid:98) C θpn +1 = (cid:98) C θpn +1 T (cid:98) C − n +1 (cid:98) C θpn +1 = w ( Y − Y )Θ T (cid:16) w ΘΘ T (cid:17) − Θ( Y T − Y T ) w = w Y Y T + Y Y T − Y Y T − Y Y T ) . (A.19)39sing equation (A.19) in equation (A.18b) yields (cid:98) C ppn +1 = F u d G n +1 (cid:98) C n +1 F u d G Tn +1 + Σ ν + (cid:101) Σ ν,n +1 , (A.20)where (cid:101) Σ ν,n +1 := w Y + Y )( Y + Y ) T . We note that (cid:101) Σ ν,n +1 is positive semi-deﬁnite. Furthermore, the i -th column of Y + Y satisﬁes (cid:98) y in +1 + (cid:98) y i + N θ n +1 − (cid:98) y n +1 = G ( (cid:98) m n +1 + c i [ (cid:113) (cid:98) C n +1 ] j ) + G ( (cid:98) m n +1 − c i [ (cid:113) (cid:98) C n +1 ] j ) − G ( (cid:98) m n +1 ) ≈ d G ( (cid:98) m n +1 ) d θ : [ (cid:113) (cid:98) C n +1 ] j ⊗ [ (cid:113) (cid:98) C n +1 ] j . (A.21)Hence (cid:101) Σ ν,n +1 = 0 when G is linear; otherwise (cid:107) (cid:101) Σ ν,n +1 (cid:107) = O ( (cid:107) (cid:98) C n +1 (cid:107) ), a second order term withsmall covariance (cid:98) C n +1 . Proof of Lemma 2.

If the steady state C of equation (38b) is singular, then ∃ v ∈ R N θ s.t. v T Cv = 0.We have (cid:16) v T C θp u (cid:17) = (cid:16) E [ v T ( θ − m ) ⊗ ( G ( θ ) − G ( m )) u ] (cid:17) ≤ E [ v T ( θ − m ) ⊗ ( θ − m ) v ] E [ u T ( G ( θ ) − G ( m )) ⊗ ( G ( θ ) − G ( m )) u ]= 0 , for any u ∈ R N y . This implies that v T C θp = 0, and therefore, − α h v T Cv − v T C θp Σ ν − C θpT v = 0 , which contradicts the assumption that Σ ω (cid:31) Appendix B. Illustrative Examples for UKS

The primary focus of the paper is on using the UKI for optimization purposes. However thebasic ingredients of the method, and the dynamical system (41) in particular, can also be usedto perform approximate posterior sampling from the measure µ given by (3). In the case where µ is Gaussian, the posterior is exactly captured by the steady state of these equations; whenthe posterior is not Gaussian, then only an approximation is obtained. To illustrate the UKS, weconsider, in Subsection Appendix B.1, application to three linear inverse problems from Subsection3.1, for which the posterior is Gaussian if the prior is Gaussian; and then give a simple example ofapplication to a non-Gaussian posterior in Subsection Appendix B.2.The UKS equations (41) can be discretized by the following semi-implicit scheme m n +1 − m n = h (cid:16) C θp Σ − η (cid:0) y − E G ( θ ) (cid:1) − C Σ − ( m n +1 − r ) (cid:17) ,C n +1 − C n = h (cid:16) − C θp Σ − η C θpT − C n Σ − C n + 2 C n +1 (cid:17) , (B.1)with a ﬁxed time-step. The integrals deﬁning C θp and E G ( θ ) are explicitly approximated by themodiﬁed unscented transform (see Deﬁnition 1) using the Gaussian N ( m n , C n ). Integration couldalso be performed using an adaptive time-step, as in [53]; however more work is needed to developeﬃcient methods stemming from the UKS as formulated here.40 ppendix B.1. Linear 2-parameter Model Problem The linear 2-parameter model problems discussed in Section 5.3 are used with prior r = 0 and Σ = I . Therefore, the posterior distribution is µ ∼ N ( m ref , C ref ), where m ref = (cid:16) Σ − + G T Σ − η G (cid:17) − (cid:16) G T Σ − η y + Σ − r (cid:17) and C ref = (cid:16) Σ − + G T Σ − η G (cid:17) − . (B.2)The UKS is initialized with θ ∼ N ( r , Σ ). The convergence of the UKS, in terms of theposterior mean and covariance errors for t ∈ [0 ,

10] are reported in Fig. B.23. Both mean andcovariance converge to the posterior mean and covariance. However, even with the semi-implicitscheme the maximum time step that allows for stable simulation is h = 5 × − . L n o r m e rr o r NSODUD 0 2 4 6 8 10Time10 F r o b e n i u s n o r m e rr o r NSODUD

Figure B.23: L error (cid:107) m n − m ref (cid:107) (left) and Frobenius norm (cid:107) C n − C ref (cid:107) F (right) obtained by UKS for non-singular (NS), over-determined (OD), and under-determined (UD) systems of the linear 2-parameter model problem. Appendix B.2. Nonlinear 2-Parameter Model Problem

The following Bayesian logistic regression problem is considered, y = 11 + exp( θ (1) + θ (2) x ) + η. Here N θ = 2 and N y = 1, and hence this is an under-determined problem. The prior distribution N ( r , Σ ) satisﬁes r = [1 1] T and Σ = I . The observation data y ref = 0 .

08 is generated at x = , with observation error η ∼ N (0 , . ) and θ ref = [2 2] T .The UKS is initialized with θ ∼ N ( r , Σ ). The posterior distributions obtained by the UKSat t = 10 with a time step h = 5 × − and Markov chain Monte Carlo method (MCMC) with astep size 1 . × samples (with a 10 sample burn-in period) are presented in Fig. B.24. Theestimated posterior distributions are in reasonably good agreement, but of course not as accurateas in the linear setting in the previous subsection, because of a Gaussian approximation being made41o a non-Gaussian distribution. Speciﬁcally, the posterior mean and covariance estimated by theUKS are [1 .

41 1 . T and (cid:20) . − . − .

235 0 . (cid:21) , whilst the posterior mean and covariance estimated by the MCMC are[1 .

62 1 . T and (cid:20) . − . − .

254 1 . (cid:21) . Figure B.24: Contour plot: posterior distributions obtained by UKS at t = 10; blue dots: reference posteriordistribution obtained by MCMC for the nonlinear 2-parameter model problem. x-axis is for θ (1) and y-axis is for θ (2) . Appendix C. Discussion Concerning Regularization ( α ∈ (0 , Recall the under-determined linear system from Subsection 5.3. In the computations performedthere, parameter α = 1 and the least squares solution of the inverse problem comprises a one-parameter family of solutions. Nonetheless { m n } converges, but the limit depends on the initialcondition. Here we revisit this problem with α ∈ (0 , { m n } converges to a uniquely deﬁned limit, which minimizes a regularized least squares problem. Thus,with regularization ( α ∈ (0 , r = 0 and γ = 0 . . We then employ three diﬀerent initializationsof the UKI: N (0 , . I ) N ( , . I ) and N ([1 − T , . I ) . Thus, in contrast to the experiments in Subsection 5.3, m (cid:54) = r except in the ﬁrst case. We employthree diﬀerent α : α = 0 . , . , and 0 . .

42s predicted by Theorem 1, the UKI converges exponentially fast to a point which does not dependon the initial condition, but does depend on the choice of α ; for α = 0 . , . . m ∞ is given by[0 . . T [0 . . T and [0 . . T . The convergence of the parameter vector { m n } is reported in Fig. C.25. For this test case, smaller α leads to better convergence, and setting m = r also leads to improved convergence. L n o r m e rr o r m = [0 0] T m = [1 1] T m = [1 1] T m = [0 0] T m = [1 1] T m = [1 1] T m = [0 0] T m = [1 1] T m = [1 1] T Figure C.25: L error (cid:107) m n − θ ref (cid:107) of the under-determined (UD) linear 2-parameter model problem with α = 0 . . . References [1] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer.

Regularization of inverse problems ,volume 375. Springer Science & Business Media, 1996.[2] Jari Kaipio and Erkki Somersalo.

Statistical and computational inverse problems , volume 160.Springer Science & Business Media, 2006.[3] Masoumeh Dashti and Andrew M Stuart. The bayesian approach to inverse problems. arXivpreprint arXiv:1302.6989 , 2013.[4] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 68(3):411–436, 2006.[5] Alexandros Beskos, Ajay Jasra, Ege A Muzaﬀer, and Andrew M Stuart. Sequential montecarlo methods for bayesian elliptic inverse problems.

Statistics and Computing , 25(4):727–737,2015.[6] Rudolph Emil Kalman. A new approach to linear ﬁltering and prediction problems.

J. BasicEng. Mar , 82(1):35–45, 1960.[7] Harold Wayne Sorenson.

Kalman ﬁltering: theory and application . IEEE, 1985.[8] Andrew H Jazwinski.

Stochastic processes and ﬁltering theory . Courier Corporation, 2007.439] Michael Ghil, S Cohn, John Tavantzis, K Bube, and Eugene Isaacson. Applications of esti-mation theory to numerical weather prediction. In

Dynamic meteorology: Data assimilationmethods , pages 139–224. Springer, 1981.[10] Geir Evensen. Sequential data assimilation with a nonlinear quasi-geostrophic model usingmonte carlo methods to forecast error statistics.

Journal of Geophysical Research: Oceans ,99(C5):10143–10162, 1994.[11] Yan Chen and Dean S Oliver. Ensemble randomized maximum likelihood method as aniterative ensemble smoother.

Mathematical Geosciences , 44(1):1–26, 2012.[12] Alexandre A Emerick and Albert C Reynolds. Investigation of the sampling performance ofensemble-based methods with a simple reservoir model.

Computational Geosciences , 17(2):325–350, 2013.[13] Sebastian Reich. A dynamical systems framework for intermittent data assimilation.

BITNumerical Mathematics , 51(1):235–249, 2011.[14] Marco A Iglesias, Kody JH Law, and Andrew M Stuart. Ensemble kalman methods for inverseproblems.

Inverse Problems , 29(4):045001, 2013.[15] Marco A Iglesias. A regularizing iterative ensemble kalman method for pde-constrained inverseproblems.

Inverse Problems , 32(2):025002, 2016.[16] Martin Hanke. A regularizing levenberg-marquardt scheme, with applications to inversegroundwater ﬁltration problems.

Inverse problems , 13(1):79, 1997.[17] Marco A Iglesias and Yuchen Yang. Adaptive regularisation for ensemble Kalman inversion.

Inverse Problems , 2020.[18] Neil K Chada, Andrew M Stuart, and Xin T Tong. Tikhonov regularization within ensemblekalman inversion.

SIAM Journal on Numerical Analysis , 58(2):1263–1294, 2020.[19] Alfredo Garbuno-Inigo, Franca Hoﬀmann, Wuchen Li, and Andrew M Stuart. Interactinglangevin diﬀusions: Gradient structure and ensemble kalman sampler.

SIAM Journal onApplied Dynamical Systems , 19(1):412–441, 2020.[20] Alfredo Garbuno-Inigo, Nikolas N¨usken, and Sebastian Reich. Aﬃne invariant interactinglangevin dynamics for bayesian inference.

SIAM Journal on Applied Dynamical Systems ,19(3):1633–1658, 2020.[21] Nikolas N¨usken and Sebastian Reich. Note on interacting langevin diﬀusions: Gradient struc-ture and ensemble kalman sampler by garbuno-inigo, hoﬀmann, li and stuart. arXiv preprintarXiv:1908.10890 , 2019.[22] Kody JH Law and Andrew M Stuart. Evaluating data assimilation algorithms.

MonthlyWeather Review , 140(11):3757–3782, 2012.[23] Oliver G Ernst, Bj¨orn Sprungk, and Hans-J¨org Starkloﬀ. Analysis of the ensemble and poly-nomial chaos kalman ﬁlters in bayesian inverse problems.

SIAM/ASA Journal on UncertaintyQuantiﬁcation , 3(1):823–851, 2015.[24] GA Pavliotis, AM Stuart, and U. Vaes. Derivative-free bayesian inversion using multiscaledynamics. arXiv preprint arXiv:2102.00540 , 2021.4425] Simon J Julier, Jeﬀrey K Uhlmann, and Hugh F Durrant-Whyte. A new approach for ﬁlteringnonlinear systems. In

Proceedings of 1995 American Control Conference-ACC’95 , volume 3,pages 1628–1632. IEEE, 1995.[26] Eric A Wan and Rudolph Van Der Merwe. The unscented kalman ﬁlter for nonlinear estima-tion. In

Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communica-tions, and Control Symposium (Cat. No. 00EX373) , pages 153–158. Ieee, 2000.[27] Mrinal K Sen and Paul L Stoﬀa.

Global optimization methods in geophysical inversion . Cam-bridge University Press, 2013.[28] Tapio Schneider, Shiwei Lan, Andrew Stuart, and Joao Teixeira. Earth system modeling 2.0:A blueprint for models that learn from observations and targeted high-resolution simulations.

Geophysical Research Letters , 44(24):12–396, 2017.[29] Oliver RA Dunbar, Alfredo Garbuno-Inigo, Tapio Schneider, and Andrew M Stuart. Cali-bration and uncertainty quantiﬁcation of convective parameters in an idealized gcm. arXivpreprint arXiv:2012.13262 , 2020.[30] Daniel Z Huang, Kailai Xu, Charbel Farhat, and Eric Darve. Learning constitutive relationsfrom indirect observations using deep neural networks.

Journal of Computational Physics ,page 109491, 2020.[31] Kailai Xu, Daniel Z Huang, and Eric Darve. Learning constitutive relations using symmetricpositive deﬁnite neural networks.

Journal of Computational Physics , 428:110072, 2020.[32] Philip Avery, Daniel Z Huang, Wanli He, Johanna Ehlers, Armen Derkevorkian, and CharbelFarhat. A computationally tractable framework for nonlinear dynamic multiscale modeling ofmembrane fabric. arXiv preprint arXiv:2007.05877 , 2020.[33] Brian H Russell.

Introduction to seismic inversion methods . SEG Books, 1988.[34] Carey Bunks, Fatimetou M Saleck, S Zaleski, and G Chavent. Multiscale seismic waveforminversion.

Geophysics , 60(5):1457–1473, 1995.[35] Johannes T¨oger, Matthew J Zahr, Nicolas Aristokleous, Karin Markenroth Bloch, MarcusCarlsson, and Per-Olof Persson. Blood ﬂow imaging by optimal matching of computationalﬂuid dynamics to 4d-ﬂow data.

Magnetic Resonance in Medicine , 2020.[36] Fl´avio Celso Trigo, Raul Gonzalez-Lima, and Marcelo Britto Passos Amato. Electricalimpedance tomography using the extended kalman ﬁlter.

IEEE Transactions on Biomedi-cal Engineering , 51(1):72–81, 2004.[37] Dan Simon.

Optimal state estimation: Kalman, H inﬁnity, and nonlinear approaches . JohnWiley & Sons, 2006.[38] Fran¸cois Auger, Mickael Hilairet, Josep M Guerrero, Eric Monmasson, Teresa Orlowska-Kowalska, and Seiichiro Katsura. Industrial applications of the kalman ﬁlter: A review.

IEEETransactions on Industrial Electronics , 60(12):5458–5471, 2013.[39] Huazhen Fang, Ning Tian, Yebin Wang, MengChu Zhou, and Mulugeta A Haile. Nonlinearbayesian estimation: from kalman ﬁltering to a broader horizon.

IEEE/CAA Journal ofAutomatica Sinica , 5(2):401–417, 2018. 4540] Sharad Singhal and Lance Wu. Training multilayer perceptrons with the extended kalmanalgorithm. In

Advances in neural information processing systems , pages 133–140, 1989.[41] Gintaras V Puskorius and Lee A Feldkamp. Decoupled extended kalman ﬁlter training offeedforward layered networks. In

IJCNN-91-Seattle International Joint Conference on NeuralNetworks , volume 1, pages 771–777. IEEE, 1991.[42] Zhengyu Huang, Philip Avery, Charbel Farhat, Jason Rabinovitch, Armen Derkevorkian, andLee D Peterson. Simulation of parachute inﬂation dynamics using an eulerian computationalframework for ﬂuid-structure interfaces evolving in high-speed turbulent ﬂows. In , page 1540, 2018.[43] Daniel Z Huang, P-O Persson, and Matthew J Zahr. High-order, linearly stable, partitionedsolvers for general multiphysics problems based on implicit–explicit runge–kutta schemes.

Computer Methods in Applied Mechanics and Engineering , 346:674–706, 2019.[44] Daniel Z Huang, Philip Avery, Charbel Farhat, Jason Rabinovitch, Armen Derkevorkian,and Lee D Peterson. Modeling, simulation and validation of supersonic parachute inﬂationdynamics during mars landing. In

AIAA Scitech 2020 Forum , page 0313, 2020.[45] Daniel Z Huang, Will Pazner, Per-Olof Persson, and Matthew J Zahr. High-order parti-tioned spectral deferred correction solvers for multiphysics problems.

Journal of ComputationalPhysics , page 109441, 2020.[46] Alistair Adcroft, Whit Anderson, V Balaji, Chris Blanton, Mitchell Bushuk, Carolina O Du-four, John P Dunne, Stephen M Griﬃes, Robert Hallberg, Matthew J Harrison, et al. The gfdlglobal ocean and sea ice model om4. 0: Model description and simulation features.

Journal ofAdvances in Modeling Earth Systems , 11(10):3167–3211, 2019.[47] Charles S Peskin. Numerical analysis of blood ﬂow in the heart.

Journal of computationalphysics , 25(3):220–252, 1977.[48] Marsha Berger and Michael Aftosmis. Progress towards a cartesian cut-cell method for viscouscompressible ﬂow. In , page 1301, 2012.[49] Daniel Z Huang, Dante De Santis, and Charbel Farhat. A family of position-and orientation-independent embedded boundary methods for viscous ﬂow and ﬂuid–structure interactionproblems.

Journal of Computational Physics , 365:74–104, 2018.[50] Daniel Z Huang, Philip Avery, and Charbel Farhat. An embedded boundary approach forresolving the contribution of cable subsystems to fully coupled ﬂuid-structure interaction.

International Journal for Numerical Methods in Engineering , 2020.[51] Marsha J Berger, Phillip Colella, et al. Local adaptive mesh reﬁnement for shock hydrody-namics.

Journal of computational Physics , 82(1):64–84, 1989.[52] Raunak Borker, Daniel Huang, Sebastian Grimberg, Charbel Farhat, Philip Avery, and JasonRabinovitch. Mesh adaptation framework for embedded boundary methods for computationalﬂuid dynamics and ﬂuid-structure interaction.

International Journal for Numerical Methodsin Fluids , 90(8):389–424, 2019. 4653] Emmet Cleary, Alfredo Garbuno-Inigo, Shiwei Lan, Tapio Schneider, and Andrew M Stuart.Calibrate, emulate, sample.

Journal of Computational Physics , 424:109716, 2020.[54] Daniel J Lea, Myles R Allen, and Thomas WN Haine. Sensitivity analysis of the climate of achaotic system.

Tellus A: Dynamic Meteorology and Oceanography , 52(5):523–532, 2000.[55] Qiqi Wang, Rui Hu, and Patrick Blonigan. Least squares shadowing sensitivity analysis ofchaotic limit cycle oscillations.

Journal of Computational Physics , 267:210–224, 2014.[56] Nikola B Kovachki and Andrew M Stuart. Ensemble kalman inversion: a derivative-freetechnique for machine learning tasks.

Inverse Problems , 35(9):095005, 2019.[57] Dean S Oliver, Albert C Reynolds, and Ning Liu.

Inverse theory for petroleum reservoircharacterization and history matching . Cambridge University Press, 2008.[58] Eric A Wan and Alex T Nelson. Neural dual extended kalman ﬁltering: applications in speechenhancement and monaural blind signal separation. In

Neural Networks for Signal ProcessingVII. Proceedings of the 1997 IEEE Signal Processing Society Workshop , pages 466–475. IEEE,1997.[59] Alexander G Parlos, Sunil K Menon, and A Atiya. An algorithmic approach to adaptive stateﬁltering using recurrent neural networks.

IEEE Transactions on Neural Networks , 12(6):1411–1432, 2001.[60] JH Gove and DY Hollinger. Application of a dual unscented kalman ﬁlter for simultaneous stateand parameter estimation in problems of surface-atmosphere exchange.

Journal of GeophysicalResearch: Atmospheres , 111(D8), 2006.[61] David J Albers, Matthew Levine, Bruce Gluckman, Henry Ginsberg, George Hripcsak, andLena Mamykina. Personalized glucose forecasting for type 2 diabetes using data assimilation.

PLoS computational biology , 13(4):e1005232, 2017.[62] Kay Bergemann and Sebastian Reich. An ensemble kalman-bucy ﬁlter for continuous dataassimilation.

Meteorologische Zeitschrift , 21(3):213–219, 2012.[63] Claudia Schillings and Andrew M Stuart. Analysis of the ensemble kalman ﬁlter for inverseproblems.

SIAM Journal on Numerical Analysis , 55(3):1264–1290, 2017.[64] Claudia Schillings and Andrew M Stuart. Convergence analysis of ensemble kalman inversion:the linear, noisy case.

Applicable Analysis , 97(1):107–123, 2018.[65] Zhiyan Ding and Qin Li. Ensemble kalman inversion: mean-ﬁeld limit and convergence anal-ysis. arXiv preprint arXiv:1908.05575 , 2019.[66] Bradley M Bell and Frederick W Cathey. The iterated kalman ﬁlter update as a gauss-newtonmethod.

IEEE Transactions on Automatic Control , 38(2):294–297, 1993.[67] Neil K Chada and Xin T Tong. Convergence acceleration of ensemble kalman inversion innonlinear settings. arXiv preprint arXiv:1911.02424 , 2019.[68] Neil K Chada, Yuming Chen, and Daniel Sanz-Alonso. Iterative ensemble Kalman methods:A uniﬁed perspective with some new variants. arXiv preprint arXiv:2010.13299 , 2020.4769] Jos´e A Carrillo, Young-Pil Choi, Claudia Totzeck, and Oliver Tse. An analytical frameworkfor consensus-based global optimization method.

Mathematical Models and Methods in AppliedSciences , 28(06):1037–1066, 2018.[70] Jos´e A Carrillo, Franca Hoﬀmann, Andrew M Stuart, and Urbain Vaes. Consensus basedsampling. arXiv , 2021.[71] Sebastian Reich and Colin Cotter.

Probabilistic forecasting and Bayesian data assimilation .Cambridge University Press, 2015.[72] Kody Law, Andrew Stuart, and Kostas Zygalakis. Data assimilation.

Cham, Switzerland:Springer , 2015.[73] Jonathan Goodman and Jonathan Weare. Ensemble samplers with aﬃne invariance.

Commu-nications in applied mathematics and computational science , 5(1):65–80, 2010.[74] John A Nelder and Roger Mead. A simplex method for function minimization.

The computerjournal , 7(4):308–313, 1965.[75] Benedict Leimkuhler, Charles Matthews, and Jonathan Weare. Ensemble preconditioning formarkov chain monte carlo simulation.

Statistics and Computing , 28(2):277–290, 2018.[76] Simon Julier, Jeﬀrey Uhlmann, and Hugh F Durrant-Whyte. A new method for the nonlineartransformation of means and covariances in ﬁlters and estimators.

IEEE Transactions onautomatic control , 45(3):477–482, 2000.[77] David J Albers, Paul-Adrien Blancquart, Matthew E Levine, Elnaz Esmaeilzadeh Seylabi, andAndrew Stuart. Ensemble kalman methods with constraints.

Inverse Problems , 35(9):095007,2019.[78] JA Carrillo and U Vaes. Wasserstein stability estimates for covariance-preconditioned fokker–planck equations. arXiv preprint arXiv:1910.07555 , 2019.[79] Lassi Roininen, Janne MJ Huttunen, and Sari Lasanen. Whittle-mat´ern priors for bayesianstatistical inversion with applications in electrical impedance tomography.

Inverse Problems& Imaging , 8(2):561, 2014.[80] Marco A Iglesias, Kody JH Law, and Andrew M Stuart. Evaluation of gaussian approximationsfor data assimilation in reservoir models.

Computational Geosciences , 17(5):851–885, 2013.[81] Bernhard Beckermann. The condition number of real vandermonde, krylov and positive deﬁnitehankel matrices.

Numerische Mathematik , 85(4):553–577, 2000.[82] Matthew M Dunlop, Marco A Iglesias, and Andrew M Stuart. Hierarchical bayesian level setinversion.

Statistics and Computing , 27(6):1555–1584, 2017.[83] Nicholas H Nelsen and Andrew M Stuart. The random feature model for input-output mapsbetween banach spaces. arXiv preprint arXiv:2005.10224 , 2020.[84] Jan S Hesthaven, Sigal Gottlieb, and David Gottlieb.

Spectral methods for time-dependentproblems , volume 21. Cambridge University Press, 2007.[85] Steven A Orszag and GS Patterson Jr. Numerical simulation of three-dimensional homogeneousisotropic turbulence.

Physical Review Letters , 28(2):76, 1972.4886] Edward N Lorenz. Deterministic nonperiodic ﬂow. In

The Theory of Chaotic Attractors , pages25–36. Springer, 2004.[87] Wael Bahsoun, Ian Melbourne, and Marks Ruziboev. Variance continuity for lorenz ﬂows. In

Annales Henri Poincare , volume 21, pages 1873–1892. Springer, 2020.[88] Jan Frøyland and Knut H Alfsen. Lyapunov-exponent spectra for the lorenz model.

PhysicalReview A , 29(5):2928, 1984.[89] Edward N Lorenz. Predictability: A problem partly solved. In

Proc. Seminar on predictability ,volume 1, 1996.[90] Ibrahim Fatkullin and Eric Vanden-Eijnden. A computational strategy for multiscale systemswith applications to lorenz 96 model.

Journal of Computational Physics , 200(2):605–638, 2004.[91] Daniel S Wilks. Eﬀects of stochastic parametrizations in the lorenz’96 system.

QuarterlyJournal of the Royal Meteorological Society: A journal of the atmospheric sciences, appliedmeteorology and physical oceanography , 131(606):389–407, 2005.[92] HM Arnold, IM Moroz, and TN Palmer. Stochastic parametrizations and model uncertaintyin the lorenz’96 system.

Philosophical Transactions of the Royal Society A: Mathematical,Physical and Engineering Sciences , 371(1991):20110479, 2013.[93] Georg A Gottwald and Sebastian Reich. Supervised learning from noisy observations: Com-bining machine-learning techniques with data assimilation. arXiv preprint arXiv:2007.07383 ,2020.[94] Isaac M Held and Max J Suarez. A proposal for the intercomparison of the dynamical coresof atmospheric general circulation models.