Adaptive covariance inflation in the ensemble Kalman filter by Gaussian scale mixtures
aa r X i v : . [ phy s i c s . d a t a - a n ] J a n Adaptive covariance inflation in the ensemble Kalman filterby Gaussian scale mixtures
Patrick N. Raanes ∗ , Marc Bocquet , and Alberto Carrassi Nansen Environmental and Remote Sensing Center, Thormøhlensgate 47, Bergen, N-5006, Norway CEREA, Joint laboratory of École des Ponts ParisTech and EDF R&D, Université Paris-Est,Champs-sur-Marne, FranceJanuary 18, 2019
Abstract
This paper studies multiplicative inflation: the complementary scaling of the state covariance in the ensembleKalman filter (EnKF). Firstly, error sources in the EnKF are catalogued and discussed in relation to inflation;nonlinearity is given particular attention as a source of sampling error. In response, the “finite-size” refinementknown as the EnKF- N is re-derived via a Gaussian scale mixture, again demonstrating how it yields adaptiveinflation. Existing methods for adaptive inflation estimation are reviewed, and several insights are gained from acomparative analysis. One such adaptive inflation method is selected to complement the EnKF- N to make a hybridthat is suitable for contexts where model error is present and imperfectly parameterized. Benchmarks are obtainedfrom experiments with the two-scale Lorenz model and its slow-scale truncation. The proposed hybrid EnKF- N method of adaptive inflation is found to yield systematic accuracy improvements in comparison with the existingmethods, albeit to a moderate degree. Consider the problem of estimating the state x k ∈ R M given the observation y k ∈ R P , as generated by: x k = M ( x k − ) + ξ k , ξ k ∼ N ( , Q k ) , (1a) y k = H x k + υ k , υ k ∼ N ( , R k ) , (1b)for sequentially increasing time index k , where theGaussian noise processes, ξ k and υ k , are independentin time and from each other. More specifically, theBayesian filtering problem consists of computing andrepresenting p ( x k | y k ), namely the probability densityfunction (pdf) of the current state, x k , given thecurrent and past observations, y k = { y l } kl =1 . In dataassimilation (DA) for the geosciences, the state size, M ,and possibly the observation size, P , may be large, and thedynamical operator, M , may be nonlinear (observationoperators that are nonlinear are implicitly included bystate augmentation [Evensen, 2003]). These difficultiesnecessitate approximate solution methods such as theensemble Kalman filter (EnKF), which is simple andefficient [Evensen, 2009b].The EnKF computes an ensemble of N realizations,or “members”, to represent p ( x k | y k ) as a (supposed)sample thereof. It consists of a forecast-analysis “cycle”for each sequential time window of the DA problem. Theforecast step simulates the dynamical forecast (1a) foreach individual member. This paper is focused on theanalysis step. Since the analysis only concerns a fixed time, k , this subscript is henceforth dropped, as is theexplicit conditioning on y k − . Thus, the prior at time k is written p ( x ), and the analysis (posterior) at time k becomes p ( x | y ) ∝ p ( y | x ) p ( x ), per Bayes’ rule.Denote { x n } Nn =1 the forecasted ensemble representing p ( x ), and define the prior sample mean and covariance: ¯ x = 1 N N X n =1 x n , (2a) ¯B = 1 N − N X n =1 ( x n − ¯ x ) ( x n − ¯ x ) T . (2b)The EnKF analysis update can be derived by assumingthat ¯ x and ¯B exactly equal the true moments of p ( x ),labelled b and B , and carefully dealing with rank issues[§6.2 of Raanes, 2016]. The posterior then arises as inthe Kalman filter, described by the analysis moments ¯ x a and ¯P a , or a (deterministic, “square-root”) ensembletransformation to match these.Multiplicative inflation is an auxiliary technique toadjust (typically increase) the ensemble spread andthereby covariance, initially studied by Pham et al. [1998];Anderson and Anderson [1999]; Hamill et al. [2001]. Here,the specific variant studied is that of multiplying the priorstate covariance matrix, ¯B , by the inflation factor, α > ¯B α ¯B . (3)1he need for inflation may arise from intrinsicdeficiencies of the EnKF: errors due to non-Gaussianityor the finite size of the ensemble. The technique oflocalization should be applied as the primary remedy,but inflation is still generally necessary and beneficial[Asch et al. , 2016, figure 6.6]. Inflation may alsobe necessary as a heuristic but pragmatic treatmentfor extrinsic deficiencies, i.e. model and observationalerrors, meaning any misspecification of equations (1a)and (1b). Again, however, it is advisable to exploitany prior knowledge of errors (bias, covariance, subspace,etc.) with more advanced treatments before employingmultiplicative inflation. Examples include additive noise[Whitaker and Hamill, 2012], relaxation [Kotsuki et al. ,2017], and square-root transformations Raanes et al. [2015]; Sommer and Janjić [2017].It is difficult to formulate directives for thetuning configurations of the EnKF with any generality.Concerning α , it may be that the accuracy of the EnKFis improved either by well-tuned inflation ( α >
1) ordeflation ( α < adding to other errors. Similarly, inflating istypically required in conditions of extrinsic error such asmodel error [Li et al. , 2009].Further specificity and quantitative guidelines aredifficult to deduce. Therefore, the inflation parametertypically requires application-specific, off-line tuning fora fixed value, sometimes at significant expense. As analternative strategy, adaptive inflation aims to estimatethe inflation factor on-line. This also naturally promotesthe use of time-varying values.The EnKF- N [Bocquet et al. , 2015, hereafter Boc15]is a refinement of the analysis step of the EnKF thatexplicitly accounts for sampling error in ¯ x and ¯B , meaningtheir discrepancy from the true moments b and B , whichare seen as uncertain, hierarchical “hyperparameters”.The derivation proceeds from the rejection of theassumption that ¯ x and ¯B are exact [Bocquet, 2011,hereafter Boc11]. Moreover, when using a non-informativehyperprior for b and B , the EnKF- N has been shownto yield a “dual” form which can be straightforwardlyidentified as a scheme for adaptive inflation [Bocquetand Sakov, 2012]. Its implementation only requiresminor add-ons to the (square-root) EnKF, with negligiblecomputational cost. In the idealistic context where modelerror is absent or accurately parameterized by the noiseprocess, as detailed by section 2, the EnKF- N nullifiesthe need for inflation tuning, making it opportune forsynthetic experiments. However, (i) wider adoption of theEnKF- N has been limited by some technically challengingaspects of its derivation. Moreover, (ii) the idealismof the above context means that the EnKF- N wouldstill be reliant on ad-hoc inflation tuning in real-world,operational use. This paper addresses both of the above issues of theEnKF- N . Firstly, by re-deriving it with a focus oninflation, section 3 further elucidates its workings. Then,section 4 reviews and analyses the literature on adaptiveinflation estimation. In contrast to the EnKF- N , theseadaptive inflation methods have hyperpriors that are time-dependent (as opposed to being “reset” at each analysistime) making them suitable for realistic contexts wheremodel error is present and imperfectly parameterized.Then, section 5 uses one such method to complementthe EnKF- N and create a new, hybrid method. Lastly,section 6 presents benchmark experiments of the variousadaptive inflation methods. Expressions and propertiesof the standard parametric pdfs in use in this paper, N , t , χ +2 , χ − , W +1 , W − , can be found in appendix A. Model-error adaptive inflation is considered from section 4and onward. By contrast, this section is focused on theeffects of sampling error, as well as its causes, especiallynonlinearity. Section 3 will show how sampling error ispartially remedied by the EnKF- N . Consider the univariate (scalar) filtering problem wherethe likelihood p ( y | x ) = N (0 | x,
2) and dynamical model M Lin ( x ) = √ x repeat identically for each time index,and the initial prior is p ( x ) = N ( x | , p ( x | y ) = N ( x | , M NonLin ( x ), detailed in appendix B.2. Thismodel has been designed to preserve Gaussianity despitebeing nonlinear: if p ( x | y ) = N ( x | ,
1) then M NonLin ( x )has the distribution N (0 , N = 40 members drawnrandomly from p ( x ). But, in the linear case, the resultingsampling errors are quickly attenuated, and the ensemblestatistics converge to the exact ones.By contrast, in the nonlinear case, the jitteriness(sampling error) is chronic. This demonstrates thatsampling error may arise purely due to nonlinearity, i.e.without actual stochasticity. Furthermore, note that thetrue distributions are perfectly Gaussian, and therefore2 ¯x a ¯P a ¯B DA cycle (i.e. time) index
Figure 1:
Time series of statistics from the EnKF appliedto the univariate DA problem with M Lin (smooth lines) and M NonLin (jittery lines). the EnKF would compute the exact solution if N wereinfinite. Thus, even though nonlinearity typically yieldsnon-Gaussianity, this is not always the case. Hence, theissue of sampling error, even if caused by nonlinear models,can be analysed and addressed separately from the issueof non-Gaussianity.An instructive scenario (not shown) of the nonlinearexperiment is that in which the initial ensemble hasa mean of 0 and a variance of 2, exactly. Despitethe “perfect” initialization, sampling errors will still begenerated, as predicted by appendix B.1. However, thiserror is not immediately as big as if the ensemble wereactually randomly sampled from N (0 , ¯B ∼ χ +2 (2 , N − E [ ¯B − = 8 / ( N − M NonLin for the ensemble to saturate at anoise level of 8 / ( N − This subsection is summarized in Table 1, whose rowscorrespond to paragraphs, as numbered (§).§1. An important property of the EnKF is that it is aconsistent estimator in the linear-Gaussian case [Le Gland et al. , 2009; Mandel et al. , 2011]: at each time k , the EnKFstatistics ¯ x and ¯B converge (in probability, as N → ∞ )to the true moments, b and B . Clearly, in this context,inflating or deflating will degrade the ensemble estimates.§2. Stochastic forms of the EnKF employ pseudo-random “observation perturbations” for the analysisupdate step. Similarly, the forecast step may simulateadditive or more advanced stochastic parameterizations ofthe forecast noise. With N < ∞ , this introduces samplingerror. Table 1:
Summary of section 2.2 regarding filteringcontexts and the consequent need for inflation. Backgroundassumptions are idealistic: M , H , Q , R perfectly known, and p ( x ) and p ( y | x ) always Gaussian. The star (*) means “ineither case”. Ensemble Treatment of Model Should§ size ( N ) noises ( Q , R ) ( M ) inflate?1 ∞ * * No2 ( M, ∞ ) Stochastic * Yes3 ( M, ∞ ) * Nonlin. Yes4 ( M, ∞ ) Deterministic Linear No5 [2 , M ] * * YesOne cause of the typical need for α > E [tr( ¯P a )] < tr( P a ) , (4)where the expectation is taken over the prior ensemble (orequivalently the covariance, ¯B ), and ¯P a = ( ¯B − + H T R − H ) − , (5) P a = ( B − + H T R − H ) − . (6)In other words, even though E [ ¯B ] = B , the nonlinearity(concavity) of ¯P a , as a function of ¯B , causes a bias.A related but distinct bias applies for the Kalman gainmatrix, ¯K = ¯BH T ( H ¯BH T + R ) − . Note, though, thatthe sampling error originates in the prior; therefore, theprior covariance is the root cause, and targeting (inflating) ¯B , rather than ¯P a and ¯K , is more principled.There is a misconception that this bias leads toensemble “collapse”, meaning that ¯P a → and ¯B → as k → ∞ . But no matter how acute the single-cycle bias is,its accumulation will saturate, because it is counteractedby reductions in ¯K .The term “inbreeding” is sometimes used to refer tothe bias (4). However, inbreeding also encompasses twoother issues, namely the introduction of non-Gaussianityand of dependency between ensemble members. These arecaused by the cross-member interaction that takes placethrough the EnKF update [Houtekamer and Mitchell,1998]. It is not quite clear how these effects will impactthe need for inflation in later cycles.Analytical, quantitative results on the bias (4) havebeen obtained for the general, multivariate case byFurrer and Bengtsson [2007]; Sacher and Bartello [2008].However, the degree of the approximation is not entirelyclear, the assumption of the ensemble being trulystochastic is unreliable, and the related correctionalmethods were only moderately successful. An alternativeapproach is that of §15.3 of Evensen [2009a], wherethe bias is empirically estimated by using a companionensemble of white noise.However, as discussed below equation (13), asignificant drawback of the inflation methods targeting3his bias is that they do not establish a feedbackmechanism through the cycling of DA. Moreover, as shownby the theory of the EnKF- N in section 3, even in asingle cycle, the observations, y , contain information thatcan improve estimates of prior hyperparameters “before”utilising y to update the state vector, x , thereby reducingsampling error and biases.§3. Deterministic, square-root update forms of theEnKF (which may also be formulated for the forecast noise[Raanes et al. , 2015]) do not introduce sampling error inthe mean and covariance. Yet, with N < ∞ , samplingerrors will arise due to model nonlinearities. This wasillustrated in the experiments of section 2.1, and predictedby appendix B.1. As in §2, sampling error will instigatethe need for inflation. Indeed, the bias (4) is slightlyvisible in the nonlinear experiment of Figure 1, where thecovariances, ¯B and ¯P a , are on average lower (long-runaverages: 1 .
95 and 0 .
98) than the true values.Filter “divergence” is the situation where the actualerror is far larger than expected from ¯B . It cannot occurin the linear context, except by extrinsic errors [Fitzgerald,1971]. It may, however, arise in nonlinear, chaotic contextsbecause, heuristically, (i) smaller covariances are proneto deficient (relative) growth by the forecast, creating aninstability that (ii) might not be adequately controlled bythe analyses. Further, the deficiency in growth typicallydepends on the starting deficiency of the covariance, aform of positive feedback that makes the cycle even more“vicious”. The alarming prospect of divergence, especiallyin light of the bias (4), favours “erring on the side ofcaution”, i.e. using α > Q = [Bocquet et al. , 2017]. Hence, convergence (in time k )holds for any N ≥ M , with a rate independent of N .Interestingly, a similar analysis reveals that the choiceof normalization factor for the covariance estimator(usually N − , or N ) does not impact the asymptoticEnKF-estimated moments (in the linear context): theyalways converge to the true moments as k → ∞ . Thismeans that the success of the EnKF does not so much relyon some statistical, single-cycle optimality or unbiasedness(in ¯B , ¯P a , or ¯K ), but rather on the above insensitivity tothe choice of normalization factor. §5. Decreasing the ensemble size, N , increases thesampling error, the bias (4), and the need for inflation.Worse, if N ≤ M , then the ensemble is said to be rank-deficient; this is a separate issue from sampling error,with the grave consequence that the truth, x , will notlie entirely within the ensemble subspace (cf. section 3.5).By operating marginally, “localization” [Anderson, 2003;Sakov and Bertino, 2011], can mitigate the rank deficiency.Localization also diminishes off-diagonal sampling errors(“spurious correlations”), thus decreasing the need forinflation. On the other hand, by eliminating priorcorrelations, localization affects an overly uncertain prior,yielding too strong a reduction of the ensemble spread .Another consequence of rank deficiency is thepossibility of the Bayesian uncertainty (i.e. potentialerror) outside of the ensemble subspace “mixing in”, andadding to, the ensemble subspace uncertainty. If Q = and the context is linear, this interaction is small andtransitory. It then does not seem beneficial to (inflatein order to) have the ensemble spread match the total(as opposed to the subspace) uncertainty. By contrast,if Q > [Grudzien et al. , 2018], or in the nonlinearcontext [Palatella and Trevisan, 2015], the interaction willoccur, favouring the use of α >
1. In their section 4,Boc15 showed that (scalar/homogeneous) inflation is well-suited to combat this type of error; this applies for bothmultiplicative and additive treatments.Assuming Q = , the long-run ( k → ∞ ) rank ofthe true state covariance, B , is the number of non-decaying modes (non-negative Lyapunov exponents) ofthe dynamics, i.e. the rank of the “unstable subspace”,0 ≤ n < M . This correspondence also holdsapproximately in the nonlinear context, and means thatthe rank deficiency of the ensemble may be much lesssevere than M − N + 1 [Bocquet and Carrassi, 2017]. Ifthis is the case, a duplicate of Table 1 applies, with M replaced by n .Filter divergence will (almost surely) occur if N ≤ n ,if localization is not used. In contrast with §3, inflationis then futile, because the divergence is caused by rankdeficiency, regardless of the degree of nonlinearity of thegrowth. It could be speculated that nonlinearity willsequentially “rotate” the ensemble around in the unstablesubspace, and hence effectively encompass it. However,twin experiments with the 40-dimensional Lorenz model,such as the data point N = n = 14 of Figure 6.6 of [Asch et al. , 2016], do not give credence to this hypothesis. Formally, quantify the reduction via | I P − H ¯K | = | R | / | H ¯BH T + R | , the determinant of the reduction in the variance. Localizationdecreases the magnitude of the off-diagonals of H ¯BH T + R , providedthe eigen-structures of the two terms are not too dissimilar. Thus,localization increases the denominator, hence reducing I P − H ¯K andthe posterior variance. Re-deriving the dual EnKF- N via a Gaussian scale mixture This section gives a new derivation of the dual EnKF- N .Subsection 3.1 outlines the main ideas. The details arefilled in by the subsequent subsections. Suppose the Bayesian forecast prior for the “truth” isGaussian, with mean b and covariance B ; formally, p ( x | y k − ) = N ( x | b , B ), where the conditioningon past observations has been made explicit again.Furthermore, assume that the sample { x n } Nn =1 is an“ensemble”, meaning that its members are independentand statistically indistinguishable from the truth [Wilks,2011], having been drawn from the very same distribution.In short, x and x n ∼ N ( b , B ) iid . (7)The assumption (7) is convenient, but may be tooidealistic in case of severe inbreeding, non-Gaussianity,and model error. Conversely, it may be too agnostic incase the ensemble is not fully random, as discussed insection 2.1. For convenience, assemble the ensemble intothe matrix E = (cid:2) x , . . . x n , . . . x N (cid:3) .Even in the linear-Gaussian context, computationalconstraints induce the use of an ensemble to carry theinformation on the state, and thus the approximation p ( x | y k − ) ≈ p ( x | E ) , (8)meaning the reduction of the information of y k − to thatrepresented by the forecast ensemble, E . Thus, whilein principle (with infinite computational resources) the“true moments”, b and B , are known, this is not so whenemploying the EnKF. Here, all that is known about b and B comes from E .The appropriate response is to consider all of thepossibilities; indeed, since by the above assumptions p ( x , b , B | E ) = N ( x | b , B ) p ( b , B | E ), marginalizationyields: p ( x | E ) = Z B Z R M N ( x | b , B ) p ( b , B | E ) d b d B , (9)where B is the set of M × M (symmetric) positive-definitematrices . Equation (9) says that the “effective prior”, p ( x | E ), is a (continuous) mixture: the average of the“candidate priors”, N ( x | b , B ), as weighted by the “mixingdistribution”, p ( b , B | E ). Since the distribution of thestate, x , depends on the abstract parameters b and B thatare themselves unknown, these are called hyperparametersand this layered structure is called hierarchical. B is the Euclidean space R M ( M +1) / corresponding to the M ( M + 1) / B , restricted to positive-definite matrices (the conic subset wherein B > ). The standard EnKF may be recovered from themixture (9) by assuming that the ensemble size is infinite( N = ∞ ), in which case the sample mean and covariance, ¯ x and ¯B of equation (2), are exact, implying a mixingdistribution of Dirac delta functions: δ ( b − ¯ x ) δ ( B − ¯B ),and hence the effective prior: N ( x | ¯ x , ¯B ).The EnKF- N does not make this approximation, butinstead acknowledges that N is finite (whence the “finite-size” moniker). The mixing distribution is obtainedwith Gaussian sampling theory and a non-informativehyperprior, p ( b , B ). For now, N > M is assumed, inwhich case ¯B − exists [almost surely, per theorem 3.1.4 ofMuirhead, 1982].The connection to inflation comes from noting, as willbe proven later, that equation (9) reduces to: p ( x | E ) = Z α> N ( x | ¯ x , α ¯B ) p ( α | E ) d α , (10)which is a mixture of candidate Gaussians over a scalar,scale parameter, only. The mixture (10) is illustrated bythe orange objects in Figure 2. The candidate (prior)Gaussians are distinguished solely by the scaling, α ,of the covariance, ¯B . Only a finite selection of thecontinuous family of candidate priors is plotted, theselection being representative of the mixing distribution, p ( α | E ). Interestingly, as detailed later, this yields aneffective prior, p ( x | E ), which is not Gaussian, but rathera (Student’s) t distribution.The effective posterior, p ( x | E , y ) ∝ p ( y | x ) p ( x | E ), isgiven by Bayes’ rule, i.e. pointwise multiplication. Butthe likelihood, p ( y | x ) = N ( y | H x , R ) , (11)per equation (1b), is Gaussian. The posterior isthen neither Gaussian nor t , and does not simplifyparametrically. This poses a computational challenge inhigh-dimensional problems, and the question of how theposterior (or an ensemble thereof) is to be computedin practice. Progress can be made by noting that theaveraging over the prior moments can be “delayed” untilafter application of Bayes’ rule, i.e. p ( x | E , y ) ∝ Z N (cid:0) y | H x , R (cid:1) N ( x | ¯ x , α ¯B ) | {z } p ( x , y | α, E ) p ( α | E ) d α . (12)Thus, the effective posterior can also be seen asthe average of the (Gaussian) candidate posteriors, p ( x | α, y , E ), each of which is given by the Kalmanfilter formulae for a given α , and computable essentiallysimultaneously for all α .The by-product of Bayes’ rule is the “evidence”, p ( y | α, E ). In this context, it is not a constant, but insteadconstitutes the likelihood of the mixing parameter, α . Toreflect this, the candidate posterior curves in Figure 2have not been normalized to integrate to 1, but instead p ( y | α, E ) · c .5 d f x Prior ensembleLikelihoodCandidate priorsCandidate posteriorsEffective priorEffective posterior
Figure 2:
Illustration of the EnKF- N as a scale mixture of Gaussians, as described in section 3.1. The constant c has been inserted and set such thatthe particular candidate posterior whose mode coincideswith that of the effective posterior also shares its height.This makes it visible that no candidate posterior is fullycoincident with the effective posterior. Nevertheless, itseems a reasonable approximation. But this candidateposterior corresponds to a candidate prior, which merelyamounts to choosing a particular prior inflation, α ⋆ . Theapproximation can thus be written: p ( x | E , y ) ≈ p ( x | α ⋆ , E , y ) , (13)meaning that the integral over the hyperparameter, α ,for the effective posterior (12), is replaced by using aparticular value, α ⋆ , which is chosen after taking intoaccount y . This approximation is a form of “empiricalBayes”, known as such because the effective prior isapproximated in a way that depends on the observations, y . This may appear to over-use the observations, y , but it is merely an artefact of the approximation.Indeed, decomposing the integrand in equation (12) as p ( x | α, y , E ) p ( y | α, E ) makes it apparent that α dependson y .A posterior ensemble corresponding to the approx-imate posterior (13) may be computed using standardEnKF formulae, except with ¯B replaced by the selectedvalue, α ⋆ ¯B . Provided that the choice among the approx-imating Gaussian posteriors is judicious, it stands toreason that the resulting ensemble yields an improvedanalysis compared to that of the standard EnKF. Afterall, the standard EnKF chooses its covariance estimate( B = ¯B ) before taking into account y . By contrast,the EnKF- N lets y inform this choice ( B = α ⋆ ¯B ). Forthe same reason, even though the EnKF- N does nottarget any particular unbiasedness, improvement could beachieved compared to the methods targeting “single-cycleunbiasedness”, described below equation (4).However, the main asset of the EnKF- N is thatits secondary dependence in y implicitly establishes a negative feedback loop via the sequential cycling of DA: ifthe covariance estimate was too small at time k , this willlikely be detected and adjusted for at k +1. Moreover, this feedback is “theoretically tuned”: parameters that may betuned exist (cf. section 3.7), but none strictly require it.As will be shown, the inflation prior is centredon 1, conferring important advantages to the EnKF- N . However, this anchoring to 1 also reflects the maindrawback of the EnKF- N : the hyperprior is static, so thatno explicit accumulation of past information takes placefor the inflation factor, which otherwise could have beenused to account for model error. Redressing this is thesubject of section 4 and onwards. This subsection and the next further describe equation (9)for the effective prior , p ( x | E ). They are largelysourced from textbooks on Gaussian sampling theory andinference, under the heading of “predictive posterior”: theprobability of another draw, x , from the same distributionas the sample, E [e.g., §3.2 of Gelman et al. , 2004]. Thepresentation is didactic, giving meaning to intermediatestages. A concise version is provided by Boc11.The mixing distribution in equation (9) is given by: p ( b , B | E ) ∝ p ( E | b , B ) p ( b , B ) , (14)where p ( b , B ) is a hyperprior to be specified. Here, as inBoc11, the Jeffreys priors are independently assigned tothe hyperparameters: p ( b , B ) = p ( b ) p ( B ) ∝ · | B | − ( M +1) / . (15)This is a prior designed to be as non-informative (agnostic)as possible. It may be derived by positing invariance inlocation and scale [e.g., §12.4 of Jaynes, 2003]. Boc15also showed the utility of using a highly informativehyperprior, suitable in contexts with little nonlinearity.Examples were also given for encoding information suchas climatology or conditional statistics, resulting in a formof localization.By the Gaussian ensemble assumption (7), p ( E | b , B ) = Y n N ( x n | b , B ) ∝ | B | − N/ e − P n k x n − b k B / , (16)6here k x k B = x T B − x = tr( xx T B − ). Now, writing x n − b = ( ¯ x − b ) + ( x n − ¯ x ), it can be shown that X n k x n − b k B = N k ¯ x − b k B + tr(( N − ¯BB − ) . (17)Combining equations (15) to (17) for the mixingdistribution (14), the resulting factors may be identifiedas: p ( b , B | E ) = N ( b | ¯ x , B /N ) | {z } p ( b | B , E ) W − ( B | ¯B , N − | {z } p ( B | E ) , (18)where W − is the inverse-Wishart distribution (cf. Table 2of appendix A). Writing the integrand of equation (9) as p ( x , b , B | E ) = p ( b | x , B , E ) p ( x | B , E ) p ( B | E ), the integral over b becomestrivial, leaving just the latter two factors: p ( x | E ) = Z p ( x | B , E ) p ( B | E ) d B , (19)of which p ( B | E ) was obtained in equation (18).Meanwhile, recalling p ( x | b , B ) and p ( b | B , E ) fromequations (7) and (18) respectively, it may be shown bycompleting the square in b that p ( x | b , B ) p ( b | B , E ) = N (cid:0) b (cid:12)(cid:12) N ¯ x + x N +1 , B / ( N + 1) (cid:1)| {z } p ( b | x , B , E ) N ( x | ¯ x , ε N B ) | {z } p ( x | B , E ) , (20)where ε N = 1 + N . The underbraces follow byidentification and provide the first factor in equation (19).Thus, p ( x | E ) = Z N ( x | ¯ x , ε N B ) W − ( B | ¯B , N −
1) d B . (21)It should be appreciated that equation (21) would beunchanged if b = ¯ x had been assumed from the start,except for the slight adjustment of ε N and the reductionfrom N to N − W − . Bycontrast, as shown in the following, the uncertainty in B has significantly more interesting consequences. This section derives the scale mixture equation (10).While conventional, the assumption “ ¯B ∝ B ” is ill-suited for inflation targeting sampling error, as it yieldsan inflation prior with an overpowering confidence, to thedetriment of the likelihood [Raanes, 2016, §C.4]. Thisassumption is therefore not made. But then merelydefining the inflation parameter becomes challenging.Clearly, it must be some scalar summary statistic onthe “ratio” of B versus ¯B ; possibilities include usingthe determinant, trace, or matrix norms. However, the subsequent assignment “ B = ˜ α ¯B ” would representan artificial approximation. By contrast, the followingdefinition and developments make no approximations.Consider a fixed x , and define the (squared) inflation:˜ α = k x − ¯ x k ¯B k x − ¯ x k B . (22)Now, given the ensemble, E , the sample moments ¯ x and ¯B are known (fixed), while p ( B | E ) = W − ( B | ¯B , N − B − ∼W +1 ( ¯B − , N − / ˜ α ∼ χ +2 (1 , N − p (˜ α | E ) = χ − (˜ α | , N − , (23)meaning that ˜ α is inverse-chi-square (cf. Table 2), withlocation parameter 1 and certainty N − p (˜ α | E ) could also have been derived bymarginalizing p ( B | E ) over C ∈ C ˜ α , where C denotes(any parameterization of) the degrees of freedom in B not fixed by ˜ α , i.e. C ˜ α = { C ∈ R M ( M +1) / − ; B (˜ α, C ) ∈B , k x − ¯ x k B = k x − ¯ x k α ¯B } . Formally, Z C ˜ α p ( B | E ) J d C = p (˜ α | E ) , (24)with J denoting the Jacobian determinant of (˜ α, C ) B .Inserting the pdfs from equations (18) and (23): Z C ˜ α W − ( B | ¯B , N − J d C = χ − (˜ α | , N − . (25)Now, the covariance mixture (21) can be rearranged as: p ( x | E ) ∝ Z B exp (cid:0) − k x − ¯ x k ε N B (cid:1) W − (cid:0) B (cid:12)(cid:12) N − N ¯B , N (cid:1) d B . The same change of variables then yields: p ( x | E ) ∝ Z ˜ α> exp (cid:0) − k x − ¯ x k ε N ˜ α ¯B (cid:1)Z C ˜ α W − (cid:0) B (cid:12)(cid:12) N − N ¯B , N (cid:1) J d C d˜ α . (26)The inner integral can be substituted by comparing it toequation (25), yielding: p ( x | E ) ∝ Z exp (cid:0) − k x − ¯ x k ε N ˜ α ¯B (cid:1) χ − (cid:0) ˜ α (cid:12)(cid:12) N − N , N (cid:1) d˜ α ∝ Z N ( k x − ¯ x k ¯B | , ε N ˜ α ) χ − (˜ α | , N −
1) d˜ α . (27)7n conclusion, the covariance mixture of equation (21)reduces to a scale mixture. An alternative, direct proof,using tricks from complex analysis instead of Property 6,was given in a preprint version of this paper.Note that the scale mixture (27) has been writtenusing the notational trick where N acts as a univariate function. Also, since ˜ α is defined via x , the integrand ofequation (27) cannot be read as “ p ( x | E , ˜ α ) p (˜ α | E )”. Bycontrast, the mixture (10) is obtained by undoing thetrick, and defining α = ε N ˜ α . Let be the vector of ones of length N , and I N the N × N identity matrix. Then the sample moments, givenin equation (2), may be conveniently expressed as: ¯ x = E /N , ¯B = N − XX T , (28)where X = (cid:2) x − ¯ x , . . . x n − ¯ x , . . . x N − ¯ x (cid:3) = EΠ ⊥ is the ensemble “anomalies”, with Π ⊥ = (cid:0) I N − T /N (cid:1) the orthogonal projector onto range( ) ⊥ ,the orthogonal complement space to range( ).So far it has been assumed that N > M so that ¯B is invertible (almost surely) and that the ensemble spansthe entire state space. This is unrealistic for geoscientificDA, where N rarely exceeds 100, while M may exceed10 . More reasonably, it is henceforth assumed that thesupport of the forecast pdf is confined to the ensemblesubspace, i.e. the affine space { x ∈ R M : [ x − ¯ x ] ∈ range( X ) } . This assumption is actually conventional, as itis implied by the standard EnKF’s assumption that b = ¯ x and B = ¯B along with Gaussianity. The assumptionmeans that the ensemble has sufficient rank. Thus, onemay expect tolerable accuracy of the filter, even withoutlocalization [Bocquet and Carrassi, 2017]. It is preferableto work with variables that embody the restriction of theassumption [Hunt et al. , 2007]; therefore, with w ∈ R N ,the following change of variables is done: x ( w ) = ¯ x + X w . (29)In terms of the new variable, the likelihood (11) maybe succinctly written as: p ( y | w ) = N ( ¯ δ | Y w , R ) . (30)with ¯ δ = y − HE /N the average innovation, and Y = HEΠ ⊥ = HX the corresponding observation anomalies.For the effective prior (27), note that the ensemblemembers expressed in the coordinate system of w aremerely the coordinate vectors ( x n = ¯ x + X e n , with e n being the n -th column of I N ). Hence, in this coordinatesystem, the sample mean is /N , replaceable by zero since X = , and the sample covariance matrix is N − I N .Substituting these for ¯ x and ¯B in equation (27) is ashortcut to obtain the effective prior for w ; with α = ε N ˜ α , p ( w | E ) ∝ (31) Z α − g/ N (cid:0) k w k N − I N (cid:12)(cid:12) , α (cid:1) χ − ( α | ε N , N −
1) d α , where the presence of α − g/ is explained in the following.Denote g the dimensionality of the nullspace of X . Dueto Π ⊥ it holds that g = max(1 , N − M ), almost surely.Thus, typically g = 1, and the parameterization in w hasone direction of redundancy, warranting careful attention.The issue is analogous to expressing 1 random variableas the sum of 2, or indeed expressing N − g randomvariables as a linear combination of N . The principle isthat regardless of how the probability space is augmentedwith the redundant degrees of freedom, once these aremarginalized out, one should be left with the originaldistribution. Boc15 showed that the adjustment of g inequation (31) is then required. Denote f the integrand of the scale mixture (31).Expanding the parametric pdfs yields: f ( α, z ) = α − ( N + g ) / − e − z/ α , (32)where all of the dependency in w is contained in z ( w ) = ( N − (cid:0) ε N + k w k (cid:1) . (33)Defining F ( z ( w )) ∝ p ( w | E ) for the effective prior,equation (31) may be restated as: F ( z ) = Z f ( α, z ) d α . (34)It can be seen that the change of variables u = α/z factors z out of the integral, yielding F ( z ) = z − ( N + g ) / F (1) , (35)or, reverting to the full w notation, p ( w | E ) ∝ (cid:0) ε N + k w k (cid:1) − ( N + g ) / , (36)which is a t distribution (cf. appendix A), also calleda Cauchy distribution when g = 1. The t distribution iselliptical, like the Gaussian [Muirhead, 1982, §1.5], but hasthick tails, making it suited for robust inference [Geweke,1993; Fernandez and Steel, 1999; Roth et al. , 2017].Unlike Boc11, here the t distribution form (36) ofthe effective prior will not be used directly. Instead, theeffective prior ( F ) will again be expressed as a Gaussian( G ) with inflation. To that end, note that, for generalfunctions F and G with image[ F ] ⊆ image[ G ], there willalways exist a function ζ ( z ) such that F ( z ) = G ( ζ ( z )).To find a suitable G , consider applying the mean-valuetheorem to the integral (34) for a fixed z , denoting ζ − the particular point for α . This will not work becausethe integrational interval, [0 , ∞ ), is of infinite length. Inplace of the length, therefore, substitute c ζ − to form: G ( ζ, z ) = c ζ − f ( ζ − , z ), where the constant c ensuresthat the height (and hence image) of G ( . , z ) is sufficient.Inserting f from equation (32) yields G ( ζ, z ) = c ζ ( N + g ) / e − zζ/ , (37)8his G works well; indeed, equating equation (37) to (35)immediately yields the associated function ζ ( z ) = c /z .In summary, the effective prior may be expressed by G (37) which, similarly to the integrand, f , is Gaussianin w . For later optimization purposes, c = N + g is set,and the logarithm is taken: p ( w | E ) = c exp (cid:0) − L (cid:1) , (38a) L ( ζ, w ) = (cid:0) ε N + k w k (cid:1) ζ − ( N + g ) log ζ , (38b) ζ ( w ) = N + gε N + k w k , (38c)The value of c was chosen so as to yield the property that ∂L∂ζ ( ζ ( w ) , w ) = 0 for any w , as can be directly verified.Conversely, this means that ζ may be treated as a freevariable to be optimized for, because equation (38c) issatisfied wherever ∂L∂ζ = 0. This tactic becomes useful inthe following section.The above form of the effective prior may bederived (as in a preprint version of this paper) as a“saddlepoint approximation”. Here, however, there is noapproximation. Its exactitude is a remarkable featureknown to arise in a few cases [Azevedo-Filho and Shachter,1994; Goutis and Casella, 1999]. Define J = − p ( w | E , y ) plus a constant, where theposterior is given by Bayes’ rule with the likelihood (30)and the effective prior (38). The log posterior reads: J ( ζ, w ) = L ( ζ, w ) − N ( ¯ δ | Y w , R ) + c (39a)= ε N ζ − ( N + g ) log ζ + ζ k w k + (cid:13)(cid:13) ¯ δ − Y w (cid:13)(cid:13) R . (39b)Completing the square in w yields: J ( ζ, w ) = k w − ¯ w a ( ζ ) k ¯P a w ( ζ ) + D ( ζ ) , (40)where the quadratic form is specified by the usual EnKFsubspace analysis formulae: ¯P a w ( ζ ) = (cid:0) ζ I N + Y T R − Y (cid:1) − , (41a) ¯ w a ( ζ ) = ¯P a w ( ζ ) Y T R − ¯ δ , (41b)and D should be recognized as the “dual”, as in Bocquetand Sakov [2012]: D ( ζ ) = ε N ζ − ( N + g ) log ζ + (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R + YY T /ζ . (42)Now, ζ depends on w , and so the maximization of p ( w | E , y ) is not as obvious as equation (40) suggests.Fortunately, to find a critical point, it suffices to satisfy ∂J∂ w = , (43a) ∂J∂ζ = 0 . (43b)This is because the criteria (43a) and (43b) imply d J d w = ∂J∂ w + ∂J∂ζ d ζ d w = , where ζ ( w ) is given by equation (38c), which is enforced since ∂L∂ζ = 0, as followsfrom equations (39a) and (43b).Now, the first criterion (43a) is trivially satisfiedby setting w = ¯ w a ( ζ ) for a given ζ , as seen fromequation (40). But it can also be seen that J = D alongthe constraint w = ¯ w a ( ζ ), and sod D d ζ = d J d ζ = ∂J∂ζ + ∂J∂ w d w d ζ = ∂J∂ζ . (44)Hence, finding ζ such that d D d ζ = 0 will satisfy the secondcriterion (43b).In conclusion, w = ¯ w a ( ζ ⋆ ) is a critical point of theeffective posterior if and only if ζ ⋆ is a local minimizer of D ( ζ ). Since both of the terms D ( ζ ) and k w − ¯ w a k ¯P a w of equation (40) are here individually minimized, thiscritical point must be a minimum, as was originally shownusing Lagrangian duality theory by Bocquet and Sakov[2012]. Hence the N -dimensional, non-Gaussian mode-finding problem for p ( w | E , y ) may be exchanged for thescalar optimization problem in ζ .The optimization of D ( ζ ) requires iterating, but eachevaluation of (42) and its derivative is computationallynegligible, given the singular value decomposition (SVD), U diag(¯ σ , . . . , ¯ σ P ) V T = [( N − R ] − / Y , (45)has been obtained beforehand, as is typical to computeequation (41). Multiple minima are a rarity; in such cases ζ ⋆ will depend on the optimizer and initial guess, hereNewton’s method and ζ = N − ensemble , a Gaussianapproximation to the effective posterior is chosen. Inaddition to its simplicity, twin experiments [Boc11; Boc12;Boc15] have provided solid support to using that of ζ ⋆ : p ( w | E , y ) ≈ N (cid:0) w (cid:12)(cid:12) ¯ w a ( ζ ⋆ ) , ¯P a w ( ζ ⋆ ) (cid:1) . (46)Notably, this approximation matches the mode andHessian of the exact p ( w | E , y ). The correspondingensemble is constructed as: E a = [ ¯ x + X ¯ w a ( ζ ⋆ )] T + √ N − XT , (47)where T = ( ¯P a w ( ζ ⋆ )) / is readily computed using thesame SVD (45) as before, and may be appended bya mean-preserving orthogonal matrix [Sakov and Oke,2008]. Note that this is “just” the symmetric square-rootEnKF, i.e. the ensemble transform Kalman filter (ETKF)of Bishop et al. [2001]; Hunt et al. [2004], except for aprior (squared) inflation of α ⋆ = ( N − /ζ ⋆ . (48)As discussed by Boc15, the choice of candidateposterior is not unassailable, and certain modificationsof the dual function can be argued on the grounds ofmodifying this choice. Indeed, if the influence of thelikelihood, quantified by¯ σ = tr( H ¯BH T R − ) /P , (49)9nd computed using the SVD (45), is small: ¯ σ → χ − to ahigher value than N −
1, yielding a Dirac-delta in thelimit. Finally, since ζ is not actually constant in w , theHessian, ( ¯P a w ( ζ ⋆ )) − , should be corrected by subtracting e ¯ w a ¯ w a T , where e = ζ ⋆ N + g , [Boc15]. Since this is buta rank-one update, a corrected transformation can becomputed without significant expense as T ( I N + γ vv T ),where v = T T ¯ w a , γ = e/ ( τ / + τ ), and τ = 1 − e v T v ,as may be deduced from equation (B2) of Bocquet [2016].However, this is typically a very minor correction, whilethe Gaussian posterior remains but an approximation.None of these adjustments were deemed necessary to usein the numerical experiments of this paper. The filtering problem as formulated with equations (1a)and (1b) uses additive noise processes ξ k and υ k torepresent model and observation errors. “Primary” filters[Dee et al. , 1985] assume the noise/error covariances, Q and R , are perfectly known. In reality, these systemstatistics are rarely well known; this deteriorates the stateestimates and can cause filter divergence. Self-diagnosingand correcting filters which estimate and modify thegiven noise statistics are called “adaptive”. AdaptiveKalman filter literature include Mehra [1972]; Mohamedand Schwarz [1999]. Adaptive particle filter literatureinclude Storvik [2002]; Özkan et al. [2013].This section is concerned with adaptive filtering forDA, focusing on the EnKF and the on-line estimation ofthe model error, i.e. Q . Among the literature using full[Miyoshi et al. , 2013; Nakabayashi and Ueno, 2017; Pulido et al. , 2018] and/or diagonal [Ueno and Nakamura, 2016;Dreano et al. , 2017] covariance parameterizations, somesuccess has been noted. However, the scope of this paperis restricted to the estimation of a single multiplicativeinflation factor, β , for ¯B . This assigns a structure to thecovariance and reduces the dimensionality and complexityof the problem, thus regularizing it. The tradeoff is a bias,but this drawback is largely offset by spatialization.Since this section presents adaptive inflation for modelerror, the inflation factor is here labelled β . This contrastsit to α of the EnKF- N , which targets sampling error.Formally, the unknown, β , is defined by the modellingassumption that the ensemble is now drawn with a “On-line” means that the estimation is included in the DAcycling loop, so as to be updated each time new data y is received.Some off-line approaches not further reviewed include Dee andDa Silva [1999]; Ueno et al. [2010]; Mitchell and Carrassi [2015]. The methods may also possess some skill in dealing with errorsdue to non-Gaussianity, biases, and even misspecification of R . covariance that is β times too small, x n ∼ N ( b , B /β ) , (50)as compared to the distribution (7) of the truth.Moreover, in order to neglect sampling errors, theensemble is assumed infinite ( N = ∞ ). Again, thisis commonly tacitly assumed in the adaptive EnKFliterature. The reason, as discussed below equation (9), isthat it yields a prior that is (the limit of the t distributionwhich is) Gaussian. In summary, p ( x | β, E ) = N (cid:0) x (cid:12)(cid:12) ¯ x , β ¯B (cid:1) . (51)The N = ∞ assumption is rolled back in the biasstudy of appendix D.2. More pragmatically, section 5makes a hybrid of the EnKF- N inflation with a method ofadaptive inflation for model error. The following reviewand analyses serve to choose a method with which to makethis hybrid.The review is split between methods that may betermed “marginal”, working with β separately from x ,and “joint”, working with ( β, x ) simultaneously. Thejoint approach, including “variational” and “hierarchical”methods, is theoretically appealing. However, on closerinspection, including numerical testing, it was found tobe less advantageous. Therefore, the marginal methodstake centre stage, while the review of the joint methodshas been placed in appendix C. Recall the average innovation, ¯ δ = y − H¯ x . For brevity,the explicit conditioning on the ensemble, E , is henceforthdropped. Writing ¯ δ = ( y − H x ) + H ( x − b ) + H ( b − ¯ x ) it can beseen that E [ ¯ δ ¯ δ T | B ] = R + H ε N BH T . (52)A departure from non-ensemble works [e.g., Daley, 1992] isthe adjustment ε N = 1 + 1 /N , resulting from (necessarily)using the prior ensemble mean, ¯ x , rather than the trueprior mean, b , to define the innovation, ¯ δ . However, since N = ∞ is here assumed, ε N = 1.Substitute the expectation E [ ¯ δ ¯ δ T | B ] by the observedvalue ¯ δ ¯ δ T in equation (52), and B by β ¯B , perequation (51). Then, ¯ δ ¯ δ T ≈ β H ¯BH T + R =: ¯C ( β ) , (53)which suggests matching (some univariate summary of) ¯ δ ¯ δ T by adjusting the inflation, β . For example, Li et al. For familiarity, the state distributions are written in terms ofthe original variable x , rather than the subspace variable w ofequation (29). Note that most of the following inflation estimatorsalso implicitly address the rank deficiency issue, as they do not ignorecomponents of the innovation outside of the (observed) ensemblesubspace. This may not be generally beneficial (cf. §5 of section 2.2). β I = (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) − tr( R )tr( H ¯BH T ) , (54)which follows directly from the trace of equation (53).Alternatively, the estimator of Wang and Bishop [2003]can be obtained by transforming equation (53) by R − before taking the trace, thus allowing for heterogeneousobservations and different units. This yields:ˆ β R = (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R /P − σ , (55)where tr( ¯ δ ¯ δ T R − ) = (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R has been used and ¯ σ isdefined in equation (49).Miyoshi [2011] proposes a variant using the Schurproduct ◦ R − . However, the regular algebra behind ˆ β R is preferable, as it decorrelates the diagonals of ¯ δ ¯ δ T andhence diminishes the variance of their trace. More variantsare studied in appendix D.1. However, appendix D.2show them to have more bias than ˆ β R because of furtherexposure to the uncertainty in ¯B , which is present whenthe N = ∞ assumption is not made. As an alternative to the trace-based estimators, considersome results of likelihood maximization. Recallingequations (11) and (51), it can be shown that thelikelihood for β is: p ( y | β ) = N (cid:0) y (cid:12)(cid:12) H¯ x , C (cid:1) = N (cid:0) ¯ δ (cid:12)(cid:12) , C (cid:1) , (56)where C = ¯C ( β ) of equation (53). By this Gaussianity, ¯ δ ¯ δ T can be seen as the maximum of the likelihood for C ,but only in the case of univariate observations ( P = 1).In the multivariate case, the maximization is only definedrestricted to the direction of ¯ δ ¯ δ T .Less artificially, the maximization can be renderedwell-posed by restricting C to the scaling: C ( θ ) = θ C for some C , in which case the maximum likelihood(ML) estimate of θ is k δ k C /P , as noted by Dee [1995].Unfortunately, with the parameterization C ( β ) = ¯C ( β ) ofequation (56), the ML estimate, ˆ β ML , is not analyticallyavailable, and will require iterations [e.g., Mitchell andHoutekamer, 2000; Zheng, 2009; Liang et al. , 2012].Also note that equation (53) may be derived in thevariational framework, where it is seen as a consistencycriterion on the cost function [Desroziers and Ivanov, 2001;Chapnik et al. , 2006; Ménard, 2016]. These referencesare also known for deriving further diagnostics, employingdistances labelled “observation-analysis” and “analysis-background”, which enable the simultaneous estimationof the observation error covariance. Li et al. [2009] makeuse of this in an ensemble framework, as do Ying andZhang [2015]; Kotsuki et al. [2017], who makes use of thescheme in a relaxation variety. The accumulation and memorization of past informationhas gradually become more sophisticated. Wang andBishop [2003] take the geometric average of ˆ β R over time,while Mitchell and Houtekamer [2000] use the median ofinstantaneous maximum likelihood estimates. Lackingthe fuller Bayesian setting, neither yields consistency(convergence to the true value) in time, because atemporal average of some point estimate ˆ β k based on p ( y k | β ) is generally not the value to which β | y K converges as K → ∞ . Nevertheless, this mismatch is notlikely to be severe, and the simplicity of the approach isan advantage.The approach of temporally averaging likelihoodestimates has since been replaced by the more rigorousapproach of filtering. It should be noted that this“secondary” filter is valid because the innovations aresupposedly independent in time [Mehra, 1972]. Li et al. [2009] and Miyoshi [2011] assign a Gaussian prior p ( β ) = N ( β | β f , V f ), where the mean β f is a persistence orrelaxation of the previous analysis mean, and where thevariance V f is a tuning parameter. The likelihood is alsoassumed Gaussian: p ( y | β ) = N ( ˆ β | β, ˆ V ), where ˆ β = ˆ β I in Li et al. [2009] and ˆ β = ˆ β R in Miyoshi [2011]. In theformer, ˆ V is a tuning parameter, while the latter suggestsusing the variance of ˆ β R of equation (55):ˆ V = " tr( HBH T R − ) /P + 1¯ σ p P/ ≈ " β f ¯ σ + 1¯ σ p P/ , (57)where the approximation comes from using B ≈ β f ¯B ,which is consistent with operative assumption that N = ∞ . Importantly, equation (57) has the logical consequencethat the observations are given no weight when they carryno information, i.e. when ¯ σ → P →
0. The methodis spatialized by associating each local analysis domainwith its own inflation parameter. Localization taperingpromotes smoothness of the inflation field.Also working in the framework of a square-rootEnKF, Brankart et al. [2010] use diffusive forecasts forthe inflation distribution. They do not sequentiallyapproximate the posteriors, instead expressing themexplicitly in terms of all of the past innovations,progressively diffused by a “forgetting exponent”, φ < p ( β | y K ) ∝ ( p ( β )) φ K K Y k =1 N (cid:0) ¯ δ k | , φ k − K ¯C k ( β ) (cid:1) . (58)The chosen estimate, which maximizes p ( β | y K ), is founditeratively. The requisite evaluations of the cost functionis not prohibitive because the square-root formulationmeans that diagonalizing matrix decompositions havealready been computed.Anderson [2007], working in the framework of theEAKF, explicitly considers forecasting the (parametersof the) inflation distribution, but only tests persistenceforecasts, i.e. p ( β k | β k − ) = δ ( β k − β k − ). The posteriors11and hence the forecast priors) are approximated byGaussians: for a given time, p ( β | y i ) ≈ N ( β | ˆ β MAP , V a ) , (59)which are computed serially for each component y i ofthe observation y . The univariate-observation likelihoodallows the maximum a posteriori (MAP) estimate ˆ β MAP to be found exactly and without iterations, via a cubicpolynomial; V a is fitted by requiring that equation (59) beexact for another particular value of β , as normalized byits value at ˆ β MAP . Anderson [2009] spatializes the method,associating each state variable with its own inflationparameter. Here, correlation coefficients complicate thelikelihood, so that ˆ β MAP must be found via a Taylorexpansion of the likelihood and a resulting quadraticpolynomial.
Anderson [2009] also notes that a Gaussian prior does notconstrain the inflation estimates to positive values. Thus,nonsensical negative estimates will occasionally occur forsmall innovations. Imposing some lower cap (bound), e.g.,ˆ β MAP > β MAP >
1, is done as a quick-fix. AsStroud et al. [2018] note, such capping may cause a bias.However, this bias should be avoided by employing theun-capped values in the averaging.Instead of ad-hoc mechanisms, using a better tailoredprior will pre-empt the problem. Brankart et al. [2010] usean exponential pdf for the initial prior, but acknowledgethat it is not appropriate for small values.Indeed, a better choice is the inverse-chi-squaredistribution, χ − , also called inverse-Gamma; it is theconjugate prior to the variance parameter of a Gaussiansample, and was shown in section 3 to be intimately linkedto inflation. It has been employed in adaptive filteringby e.g., Mehra [1972]; Storvik [2002], and in an EnKFcontexts by Stroud and Bengtsson [2007] and Gharamti[2018], who used it to enhance the scheme of Anderson[2009]. Similarly, the following section formulates aninflation filter based on ˆ β R using χ − distributions. This subsection improves the formulation of the estimatorˆ β R ; in particular, it abstains from assuming Gaussianityfor the distributions for β .Recall the definition (49) of ¯ σ and make theapproximation H ¯BH T R − ≈ ¯ σ I P . This proportionalityrenders the likelihood symmetric and hence reducible;indeed, the likelihood (56) can now be written: p ( y | β ) ∝∼ N (cid:0) R − / ¯ δ (cid:12)(cid:12) , (1 + ¯ σ β ) I P (cid:1) ∝ (1 + ¯ σ β ) − P/ e − k ¯ δ k R / σ β ) ∝ χ +2 (cid:0)(cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R /P (cid:12)(cid:12) (1 + ¯ σ β ) , P (cid:1) . (60) Remarkably, the value of β that maximizes thisapproximate likelihood is ˆ β R of equation (55).The likelihood (60) may be further approximated byfitting the following shape to it: p ( y | β ) ≈ χ +2 ( ˆ β R | β, ˆ ν ) . (61)Irrespective of the certainty parameter ˆ ν , the mode (in β )is then the same as for equation (60), namely ˆ β R . Alsofitting the curvature at the mode to that of equation (60)yields ˆ ν = P [¯ σ ˆ β R / (1 + ¯ σ ˆ β R )] . Remarkably, this yieldsa variance (cf. Table 2) equal to ˆ V of equation (57), exceptwith ˆ β R in place of β f .The benefit of making second approximation (61) inaddition to the first (60) is the resulting conjugacy withan χ − prior. Indeed, suppose the forecast prior for theinflation parameter is: p ( β ) = χ − ( β | β f , ν f ) , (62)for some ( β f , ν f ). The posterior is then: p ( β | y ) = χ − ( β | β a , ν a ) . (63)where ν a = ν f + ˆ ν , (64a) β a = ( ν f β f + ˆ ν ˆ β R ) /ν a . (64b)This weighted-average update for the parameters isthe same as that of the Kalman filter, except withslightly different meaning to the parameters. As such,it constitutes a natural and original derivation of theinflation filter of Miyoshi [2011]. Rather than approximating the likelihood as χ +2 , animproved approximation could be obtained by reverting tothe likelihood p ( y | β ) of equation (60) and directly fitting χ − ( β | β a , ν a ) to the posterior p ( y | β ) χ − ( β | β f , ν f ) bymatching their modes and local curvatures. As describedfor equation (59), fitting posterior criteria is the approachtaken by Anderson [2007]. With the χ − prior, though,an immediate benefit is that of precluding negative andnonsensical values for β a . Moreover, it can be shown thatthe posterior mode satisfies a cubic equation. However,the curvature is more complicated; alternatives includesetting ν a = ν f + ν , where ν = ˆ ν or where ν issuch such that a likelihood χ +2 ( ˆ β R | β, ν ) would yield theaforementioned cubic-root mode.The numerical approach offers another set of options:locating the mode by optimization and computing thecurvature by finite differences. The latter could also beexchanged for the ratio of two points, as in Anderson[2007]. If such avenues are pursued, it is then notnecessary to make the approximation of reducing thelikelihood (56) to (60); indeed, it is feasible to find the12equired statistics with the full likelihood, provided thepre-computed SVD (45). However, this is not necessarilyadvisable, because it yields substantial bias due to theuncertainty in ¯B , as discussed in appendix D.2.An alternative approximation is to fit ( β a , ν a ) viathe mean and variance of the posterior, as computedby quadrature. This requires careful implementation toavoid a drift due to truncation errors, which otherwiseaccumulate exponentially through the DA cycles. It isalso important to judiciously define the extent of the grid.Another option is to abandon the parametric approachaltogether, and instead represent the posterior on a grid[e.g., Stroud et al. , 2018]. In the case of a single, global,inflation parameter, as in this paper, this approach isaffordable for any purposeful precision. Lastly, a Monte-Carlo representation can also be employed [e.g., Frei andKünsch, 2012].For all variants, a choice must be made as to whichpoint value of β to use to inflate the ensemble. Ratherthan using the parameter β a directly, one can use themean or the mode (cf. Table 2). Typically, though, ν a is so large that this does not matter much. Similarly,although ˆ β R has a slight bias (appendix D.2), it errs onthe side of caution. For this reason or other, de-biasing didnot generally yield gains in testing by twin experiments.Therefore, and for simplicity, de-biasing is not furtheremployed. The following is a simple and pragmatic modellingof the forgetting of past information, achieved by“relaxing” towards some background (and initial) prior, χ − ( β | β , p ( β ) = χ − ( β | β f , ν f ) , (65)where ν f = e − ∆ /L ν a , (66a) β f = e − ∆ /L ( β a − β ) + β , (66b)with ∆ as the time between analyses. The time scale, L ≥
0, controls the rate of relaxation, and could be set asa multiple of the time scale of the model. Alternatively,it can be set by solving the stationarity condition ν ∞ = e − ∆ /L ν ∞ + ˆ ν , derived from equations (64a) and (66a).The forecast (66a) and (66b) was engineered – notderived from dynamics. While the issue is largelyacademic, this lack of formality has been perceived as adifficulty [Anderson, 2007, 2009; Sarkka and Hartikainen,2013; Nakabayashi and Ueno, 2017]. The followingconstrues a few possible resolutions.One possibility is diffusion. This would yield similarresults to (66a) and (66b), though how the χ − shapemay be maintained is not clear. However, multiplicative,positive noise seems preferable to avoid negative values.Computationally, if the parametric distribution is not preserved under the forecast, then gridded, Monte-Carlo, or kernelized inverse-transform approaches maybe employed. Instead of diffusion, Brankart et al. [2010] suggested exponentiating p ( β ) in the forecast step(cf. equation (58)), also called “annealing” [Stordal andElsheikh, 2015]. This maintains the χ − shape, but isdifficult to motivate physically. Alternatively, the desiredeffect of “forgetting” is perhaps most naturally modelledwith an autoregressive model with limited correlationlength. This can be rendered Markovian (to fit withfiltering theory) using state augmentation [Durbin andKoopman, 2012, §3.4].It may also be argued that the search for physicaldynamics is misguided. After all, the inflation, β ,is already a hyperparameter. So why should it bemore natural to forecast β rather than the hyper-hyperparameters, β a,f and ν a,f k as in equations (66a)and (66b)? Several of the improvements of section 4.3 were tested withtwin experiments. The results (not shown) indicate thatmost schemes, including the original one of section 4.2,perform surprisingly well (in terms of filter accuracy)provided they have reasonable settings of their tunableparameters. The most plausible explanation is that theinflation filters are consistent estimators: as the DA cyclesbuild up, they all eventually converge towards a near-optimal value of inflation. In view of this parity it seemslogical to opt for the simplest scheme, namely that ofequations (63) and (64).Similarly, instead of equations (66a) and (66b), thenumerical experiments use the fixed value ν f = 10 ,corresponding to a variance of approximately 2( β f ) / ,according to Table 2. Moreover, instead of fitting thelikelihood’s certainty, ˆ ν , it was simply set to 1. Keeping ν f fixed is suboptimal in the spin-up phase of the twinexperiments, but this part of the experiment is notincluded in the time-averaged statistics. Keeping ν f fixedalso foregoes the interesting possibility of actually havingthe certainty decrease through an update, something thatwill occur with the posterior fitting or non-parametricapproaches. Nevertheless, this simplification was done(i) to facilitate reproduction; (ii) because in the twinexperiments of this paper, the models and observationalnetworks are homogeneous, and therefore using localizedand variable ˆ ν and ν f is not crucial; (iii) it was foundthat using the fitted ˆ ν sometimes yielded worse filteraccuracy than keeping it fixed; (iv) to provide faircomparisons between all of the methods by equalizing thesophistication of their forecast step.In addition to the adaptive inflation, Miyoshi [2011]also uses a fixed inflation of 1 . . et al. [2009], the differences having proven largelyirrelevant and “cosmetic”, as discussed above. It istherefore labelled “ETKF adaptive”. Another schemethat was tested, labelled “EAKF adaptive”, is that ofequation (59). Instead of fitting V a , however, the valueof V f was fixed, and set (tuned) to 0 .
01. As describedby Anderson [2007], the actual inflation applied to theensemble is damped, using the value 0 . β MAP . The abovetwo schemes are the established standard in the literature,and have featured in many operational studies, althoughmainly in their spatialized formulations.The proliferation of tunable “hyper-hyperparameters”(the hyperparameters of the inflation filter), is due tothe hierarchical nature of adaptive filters, and may makethe adaptive approach appear counterproductive. But,considering the breadth of error sources targeted, it shouldbe recognized that such methods will necessarily be ad-hoc, and that the existence of tuning parameters is tobe expected. One should not be dismayed, however,because the performances of the adaptive filters are largelyinsensitive to the hyper-hyperparameter settings. Indeed,intuitively, more abstract parameters (further up thehierarchy) should have less impact, as is illustrated bythe results of Roberts and Rosenthal [2001]. Indeed,the given values for the hyper-hyperparameters (i) wereonly tuned for a single experimental context, (ii) seemreasonable, and (iii) yield satisfactory filter accuracyalmost universally across the experiments. Point (iii)corroborates previous findings [e.g., Anderson, 2007;Miyoshi, 2011], and suggests that these values may beused in vivo . N Section 3 showed that the EnKF- N may yield a formof adaptive inflation, α ⋆ . However, the EnKF- N isbuilt on equation (7), with the assumption that thetruth is statistically indistinguishable from the ensemble.Therefore, it only targets sampling error, and the prior(23) always has the location parameter 1. The EnKF- N is therefore not robust in the context of model error,analysed in section 4.If tuned inflation is used in concert with the EnKF- N ,then the tuned value could be seen as a measure of modelerror disentangled from sampling error [Bocquet et al. ,2013]. Here, however, the aim is to hybridize the EnKF- N with an adaptive inflation scheme that estimates aninflation factor, β , targeted at model error. Notably, sucha scheme has a prior that is time-dependent and, generally,not with location parameter 1. The scheme used is the“adaptive ETKF” specified in section 4.5.Again, the explicit conditioning on the ensemble, E , ishere dropped for brevity. Consider p ( x , β | y ) ∝ p ( y | x ) p ( x | β ) p ( β ) . (67) In contrast with section 4 and equation (51), N willhere not be assumed infinite. Re-deriving, therefore, theEnKF- N prior (10), but with equation (50) in place of(7), reveals that p ( x | β ) is a scale mixture over α , but nowwith ¯B also scaled by β , i.e. p ( x | α, β ) = N ( x | ¯ x , αβ ¯B ).Further, sampling error is assumed independent frommodel error: p ( α | β ) = p ( α ). Hence, p ( x , β | y ) ∝ p ( y | x ) (cid:18) Z p ( x | α, β ) p ( α ) d α (cid:19) p ( β ) . (68)Moving the likelihood inside as in the EnKF- N (12) yieldsa mixture over p ( y , x | α, β ), which can be re-factorized toobtain: p ( x , β | y ) ∝ (cid:18) Z p ( x | y , α, β ) p ( y | α, β ) p ( α ) d α (cid:19) p ( β )(69)As in the EnKF- N (13), the mixture is approximated byempirical Bayes: p ( x , β | y ) ∝∼ p ( x | y , α ⋆ , β ) | {z } ≈ p ( x | y ,β ) (cid:18) Z p ( y | α, β ) p ( α ) d α | {z } p ( y | β ) (cid:19) p ( β ) , (70)meaning that p ( x | α ⋆ , β, y ) is given by equation (46),with the change of variables (29), except that the priorcovariance is now also scaled by β , which also impacts theselection of α ⋆ through the dual cost function.The remaining integral of equation (70) is againapproximated using a particular value of α : Z p ( y | α, β ) p ( α ) d α ≈ p ( y | α f , β ) . (71)In this instance, however, the optimizing value of α is notconditioned on y , and is therefore denoted α f . In practice, α f = 1 is used for simplicity. Thus, p ( y | α f , β ) becomesthe same as in equation (61), yielding the posterior ofequation (63).A final approximation is made to decouple the jointposterior: replacing β in the conditional distribution, p ( x | y , α ⋆ , β ), by β ⋆ , some point estimate from p ( β | y , α f ).This may again be seen as empirical Bayes, except that β is not a latent variable. Other reasons for only usinga single inflation value for the ensemble are discussed inappendix C.2. The particular point used is the mean: β ⋆ = ν a ν a − β a . (72)Thus, equation (70) becomes: p ( x , β | y ) ≈ p ( x | y , α ⋆ , β ⋆ ) p ( y | α f , β ) p ( β ) ∝ p ( x | y , α ⋆ , β ⋆ ) p ( β | y , α f ) . (73)The algorithm of the analysis update of the EnKF- N hybrid can be stated as follows.14. Update the general-purpose inflation, β , accordingto equations (64a) and (64b).2. Conditional on β ⋆ (72), find the EnKF- N inflationusing the dual (42): α ⋆ = argmin α D (( N − /β ⋆ α ).3. Update the ensemble by the ETKF with a priorinflation of α ⋆ β ⋆ . In other words, implementequation (47) with ( N − /α ⋆ β ⋆ in place of ζ ⋆ .Note that the second step is the only essential differenceto the ETKF adaptive inflation scheme of Miyoshi [2011].The forecast is greatly facilitated by the decoupling ofthe posterior (73), which means that the ensemble andgeneral-purpose inflation β are independent. As discussedby section 4.4, it is pragmatic to separate the forecastof the ensemble from the forecast of β , which should becarried out as in equation (66). Instead, however, for thereasons described in section 4.5, here ν f = 10 , and β f isset to the previous β a . The EnKF- N inflation, α , is notforecasted, as its prior is static. The standard methods of section 4.5 and the hybridof section 5 are tested with twin experiments: asynthetic truth and observation thereof are simulated andsubsequently estimated by the DA methods. Contrary tothe meaning of “twin”, however, the setup is deliberatelyone of model error: the model provided to the DA systemis different from the one actually generating the truth.The main system used is the two-scale/layerLorenz model [Lorenz, 1996, 2005]. It constitutes asurrogate system for synoptic weather, exhibiting similarcharacteristics, and enables studying the impact ofunresolved scales on filter accuracy. The autonomousdynamics are given by: ψ ± i ( u ) = u i ∓ ( u i ± − u i ∓ ) − u i ,where the indices apply periodically. Then,d x i d t = ψ + i ( x ) + F − h cb X j =1 z j +10( i − , (74)d z j d t = cb ψ − j ( b z ) + 0 + h cb x j − // , (75)for i = 1 , . . . , j = 1 , . . . , // meansinteger division. Unless otherwise stated, the constantsare set as in Lorenz [1996]: time-scale ratio: c = 10, space-scale ratio: b = 10, coupling: h = 1, forcing: F = 10. Theresulting dynamics, illustrated in Figure 3, are chaotic andhave a leading Lyapunov exponent of 1.3775 [Mitchell andCarrassi, 2015].The truth is composed of both fields: (cid:2) x , z (cid:3) . Bycontrast, the model provided to the DA methods is atruncated version of the full one:d x i d t = ψ + i ( x ) + F − [ A + Bx i ] , i = 1 , . . . , . (76)The term in the brackets parameterize (compensate for)the missing coupling to the small scale. Ahead of each −4−20246810 1 4 8 12 16 20 24 28 32 36 Figure 3:
Illustration of the dynamics of the two-scale Lorenzmodel by 20 consecutive snapshots of the state profile. Thecolour gradation, from light to dark, represents the timesequence, with a total span of 0 .
3. The large-scale x field witha mean value of 2.4 can be seen moving left. The small-scale z field with a mean value of 0.1 is moving right; its variations arefaster, but of less amplitude. The abscissa denotes the indicesof (upper) x and (lower) z . experiment, the constants A and B are determined bylinear regression to unresolved tendencies, as describedby Wilks [2005]. The error parameterization removes thelinear bias of the truncated model; this has been donebecause inflation is not well suited to deal with systematicerrors. Higher-order parameterizations were tested andfound to yield little improvement to the filter accuracies,while a 0-order parameterization yielded too much modelerror, overly dominating the dynamical growth in the filtererror.Another source of model error is that the full modelis integrated with a time step of 0 . .
05 to lower computational costs. However,this source of error was found to be negligible comparedto the truncation itself.The time between observations is ∆ = 0 .
15. Directobservations are taken of the full x field with errorcovariance R = I M . There is no model noise: Q = .The filters are assessed by their accuracy as measured byroot-mean squared error:RMSE = r M k x − ¯ x k , (77)which is recorded immediately following each analysis.This instantaneous RMSE is then averaged in time (3300cycles following a spin-up of 40 cycles), and over 32repetitions of each experiment. A table of RMSE averagesis compiled for a range of experiment settings and plottedas curves for each method. The figures also present (thin,dashed lines) the root-mean variances (RMS spread) ofeach method. All of the experiment benchmark resultscan be reproduced using Python-code scripts hosted15nline at https://github.com/nansencenter/DAPPER/tree/paper_AdInf .Figure 4a shows benchmarks obtained with the two-scale system as a function of the forcing, F . The increasingRMSE averages of all filters reflect the fact (not shown)that the system variability and chaoticity both increasewith F . The same applies for decreasing c in Figure 4b,where F is fixed at 10. Note that all of the adaptivefilters are largely coincident at F = 10 and c = 10, withRMSE scores almost as low as fixed, tuned inflation. Thisis because the hyper-hyperparameters for each method,described in section 4.5, were tuned at this point (and thispoint only). By contrast, the fixed inflation of the “ETKFtuned” filter is determined for each experiment setting byselecting for the lowest RMSE among 40 inflation valuesbetween 0 .
98 and 3, most of which are close to 1.It is not surprising that tuning an adaptive filter willmake it about as accurate any other. The objective,however, is to avoid tuning. In that regard, it is surprisingis how well all of the adaptive filters perform overall.Indeed, except for the fairly extreme contexts of
F > c <
4, the difference in RMSE is small in the sense thatthe adaptive filters are all superior to “ETKF excessive”:a fixed-inflation filter with a suboptimal inflation factorthat adds 0 . x i d t = ψ + i ( x ) + F , i = 1 , . . . , , (78)and there is no z field. As in Anderson [2007], the modelerror consists in using a different value of F for the truththan for the DA system. The setup is otherwise repeatedfrom above. As shown in Figure 4c, the benchmarksplotted as a function of F are V-shaped, with the lowestscores obtained in the absence of error ( F = 8). Theadaptive filters score very similar RMSE averages, whichare generally significantly in excess of the RMS spreadscores. The mismatch can be explained by the well knownbias-variance decomposition of RMSE, and the fact thatthe model error contains significant bias. The presence ofbias is also a likely cause for the closeness of the RMSEaverages of the adaptive methods, because inflation is notwell suited to treat bias.Tests were also run with the 3-variable Lorenz-63system [Lorenz, 1963], where the model error consists inadding independent white noise to the truth. The setup isthe same as above, except that R = 2 I . Figure 4d showsthe corresponding benchmarks. These are obtained witha small ensemble ( N = 3); using a larger ensemble, therelative advantage of the hybrid disappears.The hybrid EnKF- N obtains slightly superior accuracyrelative to the adaptive ETKF and EAKF for nearly allexperiments. This is as expected from theory: separate,dedicated treatment of sampling and model errors yieldimproved accuracy. The practical advantage of the hybridis illustrated in the time series of Figure 5. Notably, the inflation of the hybrid EnKF- N has much more volatility(shorter time scale). This is made possible by the staticprior which “anchors” the inflation to 1. By contrast,similar volatility in the adaptive ETKF and EAKF wouldrequire much more lenient settings of ν f (or V f ), whichwould yield excessive longer-term volatility (i.e. variance).On the other hand, the volatility also means that largerspikes in the inflation will occur. Since inflation is nota physical or especially gentle way of increasing spread,this could potentially cause trouble. For example, thepure EnKF- N with the full model and large F values,blows up due to stiffness, as illustrated by the greycurves in Figure 4a. This could have been prevented byincreasing (doubling) its certainty parameter, somethingthat would also slightly improve its accuracy across all ofthe experiment settings.Different hyper-hyperparameter values for the adap-tive EAKF and ETKF will penalize their RMSE scoresfor some settings, and reward it for other settings. Thissensitivity was observed to be much reduced for thehybrid, which is in line with the principal objective: toavoid tuning across a multitude of contexts.A secondary objective is to obtain improved accuracycompared to fixed, tuned inflation, as has been previouslyobserved for the pure EnKF- N in the perfect-modelcontext [Boc12]. Figure 4 shows that this is sometimesachieved, but by a very small margin.Further experiments (not shown) were carried out,using different ensemble sizes and other types of modelerrors. The trends were similar to the benchmarks alreadyshown, but typically with less relative difference betweenthe filters. This paper has developed an adaptive inflation schemeas a hybrid of ( α ) the finite-size ensemble Kalman filterinflation [the EnKF- N of Boc11; Boc12; Boc15] and ( β )the inflation estimation conventionally associated with theensemble transform Kalman filter (ETKF). In so doing,it has provided several novel theoretical insights on theEnKF and adaptive inflation estimation.The first part of the paper is focused on idealisticcontexts, with sampling error being the main concern.Using two univariate toy experiments, section 2.1 illus-trated the generation of sampling error by nonlinearity, aspredicted by appendix B. Section 2.2 then discussed thecircumstances for inflation, cataloguing them accordingto linearity, stochasticity, and ensemble size. The dis-cussion revealed why sampling errors are attenuated inthe linear context, why the choice of normalization factor(e.g., N − ) is not crucial, and also touched on topicssuch as ensemble collapse and filter divergence. Next,section 3.1 gave a birds-eye view of the EnKF- N , showinghow (e.g., empirical Bayes) and why (e.g., feedback) itworks. The following sections filled in the details; inparticular, section 3.4 showed how the effective prior16
10 15 20 25 30
Forcing (F ) — both for truth and DA R M S E ETKF excessiveETKF tunedEAKF adaptive ETKF adaptive EnKF-N hybrid (a)
From the two-scale system, using N = 20. Speed-scale ratio (c ) — both for truth and DA R M S E ETKF excessiveETKF tunedEAKF adaptive ETKF adaptive EnKF-N hybrid (b)
From the two-scale system, using N = 20. Forcing (F ) — DA assumes F = 8 R M S E ETKF excessiveETKF tunedEAKF adaptive ETKF adaptive EnKF-N hybrid (c)
From the single-scale system, using N = 20. -4 -3 -2 -1 Magnitude (Q ii /∆ ) of noise on truth R M S E ETKF excessiveETKF tunedEAKF adaptive ETKF adaptive EnKF-N hybrid (d)
From the Lorenz-63 system, using N = 3. Figure 4:
Accuracy benchmarks of the adaptive inflation filters, plotted as functions of various control variables. Also includedin the plots is the RMS spread, plotted as thin, dashed lines. For perspective, two baselines are provided, in grey. These areobtained with the full (i.e. perfect) model using the pure EnKF- N : one with the same ensemble size as the adaptive filters(marker: (cid:3) ), and one with N = 80 (marker: +). Among the adaptive methods, the proposed hybrid EnKF- N (blue) scores thelowest RMSE averages nearly systematically across all contexts, albeit by a moderate margin. .51.01.5 E T K F t un e d E A K F a d a p t i v e E T K F a d a p t i v e DA cycle (k) E n K F - N h y b r i d Inflation RMS Error RMS Spread
Figure 5:
Illustration of the statistics from the twinexperiments by a segment of the typical time series. Generatedwith the two-scale Lorenz model, with F = 16. In the panel ofthe hybrid EnKF- N , the second inflation line (yellow) indicatesthe value of β a , and averages 1 . reduces to a Gaussian scale mixture, again demonstratingthe relationship between sampling error and inflation. Themixture parameter, α , is shown to be χ − . Section 3.6derived a saddlepoint form to retain the inflation-explicitexpressions all the way up to the posterior. Withoutrecourse to Lagrangian duality theory, section 3.7 thenfinalized the re-derivation of the (dual) EnKF- N byshowing how the mode of the posterior may be found byoptimizing for the inflation factor, α .In contrast to the above, section 4 is focused onmodel error, neglecting sampling error. A formaland unifying survey of the existing adaptive inflationliterature is presented; particular attention is given tothe schemes conventionally associated with the EAKF andthe ETKF ( ˆ β R ). The ETKF scheme is given a new andnatural derivation in section 4.2, again yielding the χ − distribution. Several potential improvements, some novel,were discussed in section 4.3, but generally found to beless rewarding than hoped for. Appendix D gives somenew results on biases, including the maximum likelihoodestimator. The survey is supplemented by appendix Con joint schemes, including a suggestion for inflationestimation by variational Bayes. Section 4.4 commentedon the forecasting of hyperparameters such as the inflationparameter, β .Combining the above, section 5 developed a hybridbetween the EnKF- N and the adaptive ETKF inflationscheme. The hybrid employs two inflation factors, α and β , separately targeting sampling and model error,respectively. The EnKF- N component ( α ) adds negligiblecomputational cost and no further tuning parameters tothe ETKF and its adaptive inflation ( β ), yet increases theinflation volatility and thus ability. The experiments ofsection 6 showed that the hybrid generally yields similar filter accuracy as fixed inflation, even in bias-dominatedcontexts, but without the costly need for tuning. It alsoyields improved filter accuracy in comparison with thestandard, pre-existing adaptive inflation schemes of theETKF and the EAKF.Unless the ensemble size was small and the contextstrongly nonlinear, however, the gains were found tobe relatively modest, as was the difference in betweenthe existing methods. This is somewhat surprising inview of the essential importance of inflation in manyconfigurations of the EnKF. Part of the explanation maybe that, as a hyperparameter, the accuracy of the inflationestimates is not as important as that of the (primary)state variables and that, instead, the main importanceof the inflation scheme consists in its capacity to avoiddivergence occurrences, which is a matter of a moreboolean character. Another cause is that the inflationestimates converge and become nearly constant within arelatively short span of time, and that these asymptoticestimates are sufficiently accurate for all of the methods.While the experimental results clearly demonstratedthe improvements of the hybrid adaptive inflation scheme,extrapolating these findings to other, larger applicationsis non-trivial. Spatialization of the inflation parameterwill likely be necessary; it may be implemented withoutconsiderable complexity as in Miyoshi [2011]. Still, therelative modesty of the above experimental results doesnot promise great, general gains. On the other hand itsuggests the conclusion, aided by the rigour and scopeof this study, that further sophistication of single-factoradaptive inflation estimation schemes is unlikely to yieldsignificant, further improvements. A Standard distributions
Table 2 specifies the distributions in use in this paper.The following properties are useful.
Property 1
The (“scaled”) chi-square distributions areequivalent to the Gamma distributions: χ ± ( β | s, ν ) = Gamma ± ( β | ν/ , νs ∓ / , (79)where the switch sign ± has been used to representboth the regular and inverse distributions. The χ parameterization has been preferred for the parameterinterpretations offered by Property 2, and the notationalsimplicity of Properties 3 and 4. Property 2
Asymptotic normality. If β ∼ χ ± ( s, ν ),then the distribution of √ ν ( β − s ) converges to N (0 , s )as ν → ∞ .Since it describes the sum of squared Gaussians, theasymptotic result for χ +2 is a consequence of the centrallimit theorem. The result for χ − can be shown bythrough the pointwise convergence of the pdf of √ ν ( β − s ),normalized by its value at 0.18 able 2: Parametric probability distributions. As elsewhere in the paper, b , x ∈ R M , B , S ∈ B , s, β >
0, and it is assumedthat ν > M . The constants are c N = (2 π ) − M/ , c t = Γ( ν + M )( πν ) M/ Γ( ν/ , c W = ν ν/ νM/ Γ M ( ν/ , and c χ = c W with M = 1. The(not listed) variance of element ( i, j ) of B with the Wishart distribution is ( s ij + s ii s jj ) /ν , where s ij is element ( i, j ) of S . Thevariances of the inverse-Wishart distribution are asymptotically, for ν → ∞ , the same. Name Symbol Probability density function Mean Mode (Co)Var
Gauss./Normal N ( x | b , B ) = c N | B | − / exp (cid:0) − k x − b k B ) b b B t distribution t ( x | ν ; b , B ) = c t | B | − / (cid:0) ν k x − b k B (cid:1) − ( ν + M ) / b b νν − B Wishart W +1 ( B | S , ν ) = c W | S | − ν/ | B | ( ν − M − / e − tr( ν BS − ) / S ν − M − ν S Inv-Wishart W − ( B | S , ν ) = c W | S | ν/ | B | − ( ν + M +1) / e − tr( ν SB − ) / νν − M − S νν + M +1 S Chi-square χ +2 ( β | s, ν ) = cχ s − ν/ β ν/ − e − νβ/ s s ν − ν s s /ν Inv-chi-square χ − ( β | s, ν ) = cχ s ν/ β − ν/ − e − νs/ β νν − s νν +2 s νs ) ( ν − ( ν − Note that the same limit would have applied if β ∼N ( s, s /ν ). This shows that s plays the role of a locationparameter in χ ± ( s, ν ), while 2 s /ν plays the role ofvariance, and explains why “certainty” is preferred to“degree of freedom” for ν in this paper. Property 3
In the univariate case ( M = 1), W ± ( β | s, ν ) = χ ± ( β | s, ν ) . (80) Property 4
Reciprocity. With t = 1 /β , p ( β ) = χ − ( β | s, ν )iff. p ( t ) = χ +2 ( t | /s, ν ) . (81) Property 5
Reciprocity. With T = B − , p ( B ) = W − ( B | S , ν )iff. p ( T ) = W +1 ( T | S − , ν ) , (82)as follows by the change of variables and the Jacobian | T | − ( M +1) [Muirhead, 1982, §2.1]. Property 6
Let u = be any M -dimensional vector, oran (almost never zero) random vector. If T ∼ W +1 ( S , ν )is independent of u , then u T T uu T S u ∼ χ +2 (1 , ν ) . (83)Moreover, this statistic is also independent of u . Proof:theorem 3.2.8 of Muirhead [1982]. B Nonlinearity and sampling error
This discussion complements that of section 2.1.
B.1 Why does nonlinearity generatesampling error?
First, consider what is meant by “sampling error”. Asample does not per se have a sampling error; it isby definition random, i.e. subject to variation. Bycontrast, estimators, or rather their realized estimates,have sampling error: the difference between the estimateand its expected value. By extension, any statistic(any function of the sample) may be said to havesampling error; for simplicity, however, the discussionbelow is limited to the non-central sample moments, i.e.ˆ µ m ( { x n } Nn =1 ) = N − P Nn =1 x mn , in the univariate case. Ifthe sample is drawn from the same distribution as x , thenˆ µ m is an unbiased estimate of the m -th moment of x , i.e. µ m = E [ x m ], and the sampling error is the difference:Error m = ˆ µ m − µ m . (84)It is a well known property of the Kalman filter thatthe covariance does not depend on the mean. In theforecast step, this is due to the fact that their evolutionsare entirely decoupled . Indeed, in the case of lineardynamics ( d = 1), the m -th forecast moment is givenby: µ f m = M m µ m , where M is the (scalar) linear model: x f = M x . For nonlinear forecast dynamics M , however,the moments will be coupled through M . For example,if (locally to the support of p ( x )) the model M canbe represented by a polynomial of degree d , then the m -th moment of the random variable M ( x ) is a linearcombination of moments of x of order 1 through md : µ f m = md X i =1 C m,i µ i . (85)Thus, for d > et al. , 2006, §29].A similar analysis reveals that the same couplingtakes place for the sample moments, ˆ µ m . Therefore the19ampling errors are also coupled:Error f m = md X i =1 C m,i Error i . (86)But an N -sized ensemble can only match N moments,e.g., Error i = 0 for i = 1 , . . . , N , and so there will alwaysbe some sampling error present for i > N . Then, bythe mixing of equation (86), this error will cascade intothe lower-order forecast errors. In summary, nonlinearitycauses sampling error in (e.g.) the mean and covarianceby pulling in the inevitable (with finite N ) sampling errorfrom higher-order moments.The above analysis is concerned with the generation of sampling error. Another reason for sampling errorin the context of nonlinearity is that chaos prevents its elimination by limiting the effect of far-past observations(as opposed to the linear case illustrated in Figure 1, wherethe initial sampling error is quickly attenuated). It isnot immediately clear whether this is a separate cause or,rather, a different perspective on the same phenomenon. B.2 A nonlinear model preserving N The nonlinear model of section 2.1 was designedusing “inverse transform sampling”. It is specifiedby: M NonLin ( x ) = √ F − N (cid:0) Fχ ( x ) (cid:1) , where Fχ is thecumulative distribution function (CDF) for χ +2 (1 , F − N is the inverse CDF for N (0 , M NonLin : (i) is V-shaped, with a singularity at 0, (ii)is closely approximated as M NonLin ( x ) ≈ √ { . | x | +0 .
23 log( x ) − . } , (iii) applied to a density symmetricabout 0, it may be visualized as folding it up in themiddle before smearing it back out again, (iv) may begeneralized to higher dimensionality by expressing x inpolar coordinates, since p ( k x k ) = χ +2 ( k x k | M, M ) if p ( x ) = N ( x | , I M ).Presumably, any nonlinear model that preservesGaussianity must include a singularity. This rendersthe example using M NonLin to generate sampling errorwithout non-Gaussianity somewhat artificial, but doesnot jeopardize the utility of considering the two issuesseparately.
C Joint state-covariance estima-tion
As mentioned immediately above section 4.1, the jointapproach approximates p ( x , β | y ) simultaneously in x and β . Thus, their analyses impact each other. Thisis particularly relevant for state-inflation (or state-covariance) estimation problem, because of the non-Gaussianity of p ( x , β | y ). C.1 Variational methods
A common approximate solution to this non-Gaussianityis to use variational methods for parametric fitting. As detailed below, this leads to iterative schemes where theupdate to the hyperparameter, β , is computed in termsof the updated ensemble (for x ), and vice versa. Someof the literature below is concerned with estimating R (jointly with x ), but is still pertinent by the proximity ofthe problem to that of estimating B .Sarkka and Nummenmaa [2009] introduce theVariational Bayes (VB) method in the framework of theKalman filter. Nakabayashi and Ueno [2017] extend itto the EnKF. The VB method imposes an approximateposterior with two factor distributions, q ( x | y ) q ( R | y ),fitted by minimizing the Kullback-Leibler divergencefrom the correct posterior, p ( x , R | y ). This yields thecondition that each factor be the (geometric) marginalof p ( x , R | y ) with respect to the other, i.e. two coupledpdf equations. Independence is assumed for the prior: p ( x , R ) = N ( x | ¯ x , ¯B ) W − ( R | R f , ν f ). It is then shownthat the distributions of the approximate posterior areagain N and W − , with parameters computable by fixedpoint iteration. Iteration i + 1 consists of the EnKFequations using the i -th estimate of R a to compute theanalysis mean, ¯ x a , and covariance, ¯P a , whose updatedvalues are then used to in the update R a = (cid:8) ν f R f + ˆ R (cid:9) /ν a where ν a = ν f + 1 andˆ R = ( y − H¯ x a )( y − H¯ x a ) T + H ¯P a H T . (87)Ueno and Nakamura [2016], also estimating R in the framework of the EnKF, use the expectationmaximization (EM) method to maximize the marginalposterior p ( R | y ), with an W − prior. The expectationis over x , given y and the current estimate of R . Theempirical distribution is assumed for the prior: p ( x ) ≈ N − P n δ ( x − x n ), yielding iterations involving weightedstatistics. Our investigation indicates that if, instead,the standard EnKF assumption had been used: p ( x ) ≈N ( x | ¯ x , ¯B ), then the method would have yielded iterationsas in equation (87). This is unsurprising in view of theclose connection between EM and VB.An original result is obtained by applying the VBmethod to estimate the inflation parameter. Considerthe prior p ( x , β ) = N ( x | ¯ x , β ¯B ) χ − ( β | β f , ν f ); note that x and β are not assumed independent. It can then beshown that the resulting VB scheme consists of using the i -th iterate of β a to compute the i + 1 iterates of ¯ x a and ¯P a which, in turn, are used to compute the i + 1 iterate β a = (cid:8) ν f β f + ˆ β (cid:9) /ν a where ν a = ν f + M andˆ β = (cid:13)(cid:13) ¯ x a − ¯ x f (cid:13)(cid:13) ¯B + tr( ¯P a ¯B − ) , (88)which should be computed in the ensemble subspaceif N ≤ M . Equation (88) may be interpreted usingthe trigonometric relations of Desroziers et al. [2005].Numerical experiments indicate that the bias of this VBmethod (with a flat prior) is significantly higher than forˆ β R , typically also with a larger variance. It seems likelythat the reason is similar to that of ˆ β ML , analysed inappendix D.2.20 .2 Hierarchical methods Instead of the variational approach, a more principledapproach is to represent the marginal p ( β | y ) with aMonte-Carlo sample. This yields a hyper-ensemble ofdistributions p ( x | β, y ) and their ensembles. The label“hierarchical” is sometimes reserved for adaptive filterstreating the hyperparameter in this more Bayesian (i.e.full-pdf) manner. However, the number of realizationsquickly becomes exorbitant. On the other hand, if only asingle sample is drawn from p ( x | β, y ) for each β (i.e. notusing an ensemble of ensembles), then it seems especiallyprone to spurious correlations. Another criticism is that itmight cause discrete jumps between localization domains.Myrseth et al. [2010] provided one example ofa hierarchical EnKF, but without conditioning thehyperparameter on y , without accounting for model error,and with a forecast of (the hyperprior of) B that decaystowards . Tsyrulnikov and Rakitko [2015] present anambitious hierarchical EnKF, explicitly considering bothmodel and sampling error. However, the filter is onlytested in experiments with a novel, univariate model.An intriguing approach is that of Stroud andBengtsson [2007]. They peg the scaling of R and Q together, so as to estimate only a single parameter, β . This is difficult to justify, but simplifies theproblem significantly because then ¯B also scales with β (provided linear dynamics) so that p ( x , β ) = N ( x | ¯ x , β ¯B ) χ − ( β | . . . ) is conjugate to p ( y | x , β ), andthe full, joint posterior is available in closed form, withtrivial parametric updates, hence avoiding Monte-Carlo.However, without pegging Q to R , the conjugacy is lost.The EnKF- N can also be said to be hierarchicalbecause of its careful treatment of the full marginal, p ( α | y ), before the variational approximation (13). D More on the marginal inflationestimators
This section complements section 4.1.
D.1 Other trace-based estimators
It is common to form chi-square diagnostics by measuring ¯ δ by its Mahalanobis norm [Ménard et al. , 2000; Wu et al. ,2013; Haussaire, 2017]. This provides the motivation touse ¯C (1) − to transform equation (53), yielding:ˆ β ¯C = (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) ¯C (1) − tr (cid:0) R ¯C (1) − (cid:1) tr (cid:0) H ¯BH T ¯C (1) − (cid:1) . (89)Alternatively, equation (53) can be transformed by( H ¯BH T ) − , yielding (cid:0) ¯ δ ¯ δ T − R (cid:1) ( H ¯BH T ) − = β I P For the purpose of inflation, this seems like the bestoption because then the trace consists of terms with thesame expected magnitude, yielding the lowest aggregate variance. The estimator becomes:ˆ β H¯BH T = 1 P (cid:8)(cid:13)(cid:13) ¯ δ (cid:13)(cid:13) H¯BH T − tr (cid:0) R ( H ¯BH T ) − (cid:1)(cid:9) , (90)If H ¯BH T is rank-deficient, then ˆ β H¯BH T must be definedusing the pseudo-inverse. The estimate would then notbe impacted by components of ¯ δ outside of the ensemblesubspace.Lastly, note that none of the trace-based estimators arecomputationally costly, because they can all be computedvia the SVD (45). D.2 Single-cycle bias
This subsection considers the properties of the estimatorswithin a single analysis, based only on the inflationestimators’ sampling distributions Here, the assumptionof section 4 that N = ∞ is undone, so that ¯B is alsorandom (before conditioning on the ensemble), in additionto x , and y , and subject to sampling errors. As will beshown, this yields biases in the inflation estimators.Now, using variables defined via the SVD (45), it canbe shown that each of the following estimators of β satisfythe condition: 0 = P X i =1 γ i (cid:0) β ¯ σ i − d i (cid:1) , (91)where d i is the i -th component of the transformedinnovation, U T R − / ¯ δ , and γ i = β R (1 + ¯ σ i ) − for ˆ β ¯C (¯ σ i ) − for ˆ β H¯BH T (1 + ¯ σ i ) / (1 + ˆ β ¯ σ i ) for ˆ β ML . (92)Equation (91) may be solved explicitly for ˆ β , except in thecase of ˆ β ML . Nevertheless, equations (91) and (92) maybe used to provide an insight on the bias of ˆ β ML , a subjectof study since Mitchell and Houtekamer [2000].Indeed, note that while the full matrix, ¯B , is anunbiased estimator of B , the spectrum of ¯B is a biasedestimate of the spectrum of B [van der Vaart, 1961;Takemura, 1984]. Thus, the spectrum of H ¯BH T R − ,namely { ¯ σ i } Pi =1 , is also biased. Hence, generally, functionsof the spectrum will be biased. An important exceptionis that E ( P i ¯ σ i ) = tr( HBH T R − ), meaning that theexpectation of equation (91) holds for ˆ β R . Note that ˆ β R isstill biased, as it requires inverting P i ¯ σ i . Nevertheless,considering the expressions (92) for γ i , it seems logicalthat the bias of { ¯ σ i } Pi =1 will significantly carry over intothe more complicated ones, such as ˆ β ML . Numericalexperiments confirm this, and show that the bias of ˆ β ML is worse than it is for ˆ β R , but less than for ˆ β ¯C and ˆ β H¯BH T .Why is the bias of ˆ β R of equation (55) the least?Loosely speaking, because the trace of ¯B is taken beforedividing. Moreover, in case R and HBH T have thesame structure, i.e. R − / HBH T R − T / = σ I P , the21ias of ˆ β R can be obtained analytically, as follows.Due to the modified assumption (50), equation (18)now yields ¯B ∼ W +1 ( B /β, N − R − / H ¯BH T R − T / are iid, with distribution χ +2 ( σ /β, N − ν = P ( N − σ = tr( R − / H ¯BH T R − T / ) ∼ χ +2 ( σ /β, ν ) . (93)Then, according to Property 4 and Table 2, E [1 / ¯ σ ] = νν − β/σ . Meanwhile, E (cid:2) ¯ δ ¯ δ T (cid:3) = R + (1 + 1 /βN ) HBH T so that E (cid:2)(cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R /P − (cid:3) = (1 + 1 /βN ) σ . Now, thesample mean and variance of Gaussian samples areindependent. Hence, the nominator and denominator ofˆ β R of equation (55) are independent, so that E (cid:2) ˆ β R (cid:3) = E (cid:2)(cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R /P − (cid:3) E [1 / ¯ σ ]= (1 + 1 /βN ) νν − β , (94)which is close to β for large N and P . Acknowledgements
The authors thank the two anonymous reviewers for theirconstructive comments; A. Farchi for technical help anddiscussions; L. Bertino for fostering the collaboration andfor his patience; C. Grudzien for insight on the behaviourof the Lorenz system; F. Counillon, for reading andopinion; A. Karspeck and M. Gharamti for stimulatingquestions.Author P. N. Raanes has been funded by theEmblAUS project of the Nordic countries funding agencyNordForsk, and by DIGIRES, a project sponsored byPETROMAKS2 of the Research Council of Norway andindustry partners. CEREA is thanked for providing officespace and community for visiting researcher P. N. Raanes.CEREA is a member of the Institut Pierre-Simon Laplace(IPSL). Author A. Carrassi has been partly funded by theproject REDDA of the Norwegian Research Council.
References
Anderson JL. 2003. A local least squares framework forensemble filtering.
Monthly Weather Review (4):634–642.Anderson JL. 2007. An adaptive covariance inflationerror correction algorithm for ensemble filters.
TellusA (2): 210–224.Anderson JL. 2009. Spatially and temporally varyingadaptive covariance inflation for ensemble filters. TellusA (1): 72–83.Anderson JL, Anderson SL. 1999. A Monte Carloimplementation of the nonlinear filtering problem toproduce ensemble assimilations and forecasts. MonthlyWeather Review (12): 2741–2758. Asch M, Bocquet M, Nodet M. 2016.
Data assimilation:methods, algorithms, and applications . Fundamentalsof Algorithms, SIAM: Philadelphia, PA, doi:10.1137/1.9781611974546.Azevedo-Filho A, Shachter RD. 1994. Laplace’s methodapproximations for probabilistic inferencein beliefnetworks with continuous variables. In:
Proceedingsof the Tenth international conference on Uncertaintyin artificial intelligence . Morgan Kaufmann PublishersInc., pp. 28–36.Berry T, Harlim J. 2014. Linear theory for filtering non-linear multiscale systems with model error.
Proceedingsof the Royal Society A: Mathematical, Physical andEngineering Science (2167): 20140 168.Bishop CH, Etherton BJ, Majumdar SJ. 2001. Adaptivesampling with the ensemble transform Kalman filter.Part I: Theoretical aspects.
Monthly Weather Review (3): 420–436.Bocquet M. 2011. Ensemble Kalman filtering withoutthe intrinsic need for inflation.
Nonlinear Processes inGeophysics (5): 735–750.Bocquet M. 2016. Localization and the iterative ensembleKalman smoother. Quarterly Journal of the RoyalMeteorological Society (695): 1075–1089.Bocquet M, Carrassi A. 2017. Four-dimensional ensemblevariational data assimilation and the unstable subspace.
Tellus A: Dynamic Meteorology and Oceanography (1): 1304 504.Bocquet M, Gurumoorthy KS, Apte A, Carrassi A,Grudzien C, Jones CKRT. 2017. Degenerate Kalmanfilter error covariances and their convergence onto theunstable subspace. SIAM/ASA Journal on UncertaintyQuantification (1): 304–333.Bocquet M, Raanes PN, Hannart A. 2015. Expandingthe validity of the ensemble Kalman filter withoutthe intrinsic need for inflation. Nonlinear Processes inGeophysics (6): 645–662.Bocquet M, Sakov P. 2012. Combining inflation-free anditerative ensemble Kalman filters for strongly nonlinearsystems. Nonlinear Processes in Geophysics (3): 383–399.Bocquet M, Sakov P, et al. Nonlinear Processes in Geophysics (5): 803–818.Brankart JM, Cosme E, Testut CE, Brasseur P, VerronJ. 2010. Efficient adaptive error parameterizations forsquare root or ensemble Kalman filters: application tothe control of ocean mesoscale signals. Monthly WeatherReview (3): 932–950.22hapnik B, Desroziers G, Rabier F, Talagrand O. 2006.Diagnosis and tuning of observational error in a quasi-operational data assimilation setting.
Quarterly Journalof the Royal Meteorological Society (615): 543–565.Daley R. 1992. Estimating model-error covariances forapplication to atmospheric data assimilation.
MonthlyWeather Review (8): 1735–1746.Dee D, Cohn S, Dalcher A, Ghil M. 1985. An efficientalgorithm for estimating noise covariances in distributedsystems.
IEEE transactions on automatic control (11): 1057–1065.Dee DP. 1995. On-line estimation of error covarianceparameters for atmospheric data assimilation. MonthlyWeather Review (4): 1128–1145.Dee DP, Da Silva AM. 1999. Maximum-likelihoodestimation of forecast and observation error covarianceparameters. Part I: Methodology.
Monthly WeatherReview (8): 1822–1834.Desroziers G, Berre L, Chapnik B, Poli P. 2005. Diagnosisof observation, background and analysis-error statisticsin observation space.
Quarterly Journal of the RoyalMeteorological Society (613): 3385–3396.Desroziers G, Ivanov S. 2001. Diagnosis and adaptivetuning of observation-error parameters in a variationalassimilation.
Quarterly Journal of the Royal Meteoro-logical Society (574): 1433–1452.Dreano D, Tandeo P, Pulido M, Ait-El-Fquih B,Chonavel T, Hoteit I. 2017. Estimating model-error covariances in nonlinear state-space modelsusing Kalman smoothing and the expectation–maximization algorithm.
Quarterly Journal of the RoyalMeteorological Society (705): 1877–1885.Durbin J, Koopman SJ. 2012.
Time series analysis bystate space methods , vol. 38. OUP Oxford.Evensen G. 2003. The ensemble Kalman filter: Theoreticalformulation and practical implementation.
OceanDynamics (4): 343–367.Evensen G. 2009a. Data assimilation . Springer, 2 edn.Evensen G. 2009b. The ensemble Kalman filter forcombined state and parameter estimation.
ControlSystems, IEEE (3): 83–104.Fernandez C, Steel MFJ. 1999. Multivariate Student-tregression models: Pitfalls and inference. Biometrika (1): 153–167.Fitzgerald R. 1971. Divergence of the Kalman filter. Automatic Control, IEEE Transactions on (6): 736–747.Frei M, Künsch HR. 2012. Sequential state andobservation noise covariance estimation using combinedensemble Kalman and particle filters. Monthly WeatherReview (5): 1476–1495. Furrer R, Bengtsson T. 2007. Estimation of high-dimensional prior and posterior covariance matrices inKalman filter variants.
Journal of Multivariate Analysis (2): 227–255.Gelman A, Carlin JB, Stern HS, Rubin DB. 2004. Bayesian data analysis . Texts in Statistical ScienceSeries, Chapman & Hall/CRC, Boca Raton, FL, secondedn.Geweke J. 1993. Bayesian treatment of the independentstudent-t linear model.
Journal of applied econometrics (S1).Gharamti ME. 2018. Enhanced adaptive inflationalgorithm for ensemble filters. Monthly Weather Review (2): 623–640.Goutis C, Casella G. 1999. Explaining the saddlepointapproximation.
The American Statistician (3): 216–224.Grudzien C, Carrassi A, Bocquet M. 2018. Chaoticdynamics and the role of covariance inflation for reducedrank Kalman filters with model error. NonlinearProcesses in Geophysics : 1–25.Hamill TM, Whitaker JS, Snyder C. 2001. Distance-dependent filtering of background error covarianceestimates in an ensemble Kalman filter.
MonthlyWeather Review (11): 2776–2790.Haussaire JM. 2017. Méthodes variationnelles d’ensembleitératives pour l’assimilation de données non-linéaire:Application au transport et la chimie atmosphérique.PhD thesis, Université Paris-Est.Houtekamer PL, Mitchell HL. 1998. Data assimilationusing an ensemble Kalman filter technique.
MonthlyWeather Review (3): 796–811.Hunt BR, Kalnay E, Kostelich EJ, Ott E, Patil DJ,Sauer T, Szunyogh I, Yorke JA, Zimin AV. 2004. Four-dimensional ensemble Kalman filtering.
Tellus A (4):273–277.Hunt BR, Kostelich EJ, Szunyogh I. 2007. Efficientdata assimilation for spatiotemporal chaos: A localensemble transform Kalman filter. Physica D: NonlinearPhenomena (1): 112–126.Jaynes ET. 2003.
Probability theory: the logic of science .Cambridge university press.Kotsuki S, Ota Y, Miyoshi T. 2017. Adaptive covariancerelaxation methods for ensemble data assimilation:Experiments in the real atmosphere.
Quarterly Journalof the Royal Meteorological Society .Le Gland F, Monbet V, Tran VD. 2009. Large sampleasymptotics for the ensemble Kalman filter. ResearchReport RR-7014, INRIA.23ewis JM, Lakshmivarahan S, Dhall S. 2006.
Dynamicdata assimilation: A least squares approach , vol. 13.Cambridge University Press.Li H, Kalnay E, Miyoshi T. 2009. Simultaneous estimationof covariance inflation and observation errors within anensemble Kalman filter.
Quarterly Journal of the RoyalMeteorological Society (639): 523–533.Liang X, Zheng X, Zhang S, Wu G, Dai Y, Li Y. 2012.Maximum likelihood estimation of inflation factorson error covariance matrices for ensemble Kalmanfilter assimilation.
Quarterly Journal of the RoyalMeteorological Society (662): 263–273.Lorenz EN. 1963. Deterministic nonperiodic flow.
Journalof the Atmospheric Sciences (2): 130–141.Lorenz EN. 1996. Predictability: A problem partly solved.In: Proc. ECMWF Seminar on Predictability , vol. 1.Reading, UK, pp. 1–18.Lorenz EN. 2005. Designing chaotic models.
Journal ofthe Atmospheric Sciences (5): 1574–1587.Mandel J, Cobb L, Beezley JD. 2011. On theconvergence of the ensemble Kalman filter. Applicationsof Mathematics (6): 533–541.Mehra R. 1972. Approaches to adaptive filtering. IEEETransactions on automatic control (5): 693–698.Ménard R. 2016. Error covariance estimation methodsbased on analysis residuals: theoretical foundationand convergence properties derived from simplifiedobservation networks. Quarterly Journal of the RoyalMeteorological Society (694): 257–273.Ménard R, Cohn SE, Chang LP, Lyster PM. 2000. Assim-ilation of stratospheric chemical tracer observationsusing a Kalman filter. Part I: Formulation.
MonthlyWeather Review : 2654–2671.Mitchell HL, Houtekamer PL. 2000. An adaptive ensembleKalman filter.
Monthly Weather Review (2): 416–433.Mitchell L, Carrassi A. 2015. Accounting for modelerror due to unresolved scales within ensemble Kalmanfiltering.
Quarterly Journal of the Royal MeteorologicalSociety (689): 1417–1428.Miyoshi T. 2011. The Gaussian approach to adaptivecovariance inflation and its implementation with thelocal ensemble transform Kalman filter.
MonthlyWeather Review (5): 1519–1535.Miyoshi T, Kalnay E, Li H. 2013. Estimating and includ-ing observation-error correlations in data assimilation.
Inverse Problems in Science and Engineering (3):387–398. Mohamed AH, Schwarz KP. 1999. Adaptive Kalmanfiltering for INS/GPS. Journal of geodesy (4): 193–203.Muirhead RJ. 1982. Aspects of multivariate statisticaltheory . John Wiley & Sons, Inc., New York. Wiley Seriesin Probability and Mathematical Statistics.Myrseth I, Omre H, et al.
SPE Journal (02): 569–580.Nakabayashi A, Ueno G. 2017. An extension of theensemble Kalman filter for estimating the observationerror covariance matrix based on the variational Bayes’smethod. Monthly Weather Review (1): 199–213.Özkan E, Šmídl V, Saha S, Lundquist C, GustafssonF. 2013. Marginalized adaptive particle filtering fornonlinear models with unknown time-varying noiseparameters.
Automatica (6): 1566–1575.Palatella L, Trevisan A. 2015. Interaction of Lyapunovvectors in the formulation of the nonlinear extension ofthe Kalman filter. Physical Review E (4): 042 905.Pham DT, Verron J, Roubaud MC. 1998. A singularevolutive extended Kalman filter for data assimilation inoceanography. Journal of Marine systems (3): 323–340.Pulido M, Tandeo P, Bocquet M, Carrassi A, Lucini M.2018. Stochastic parameterization identification usingensemble Kalman filtering combined with maximumlikelihood methods. Tellus A: Dynamic Meteorology andOceanography (1): 1442 099.Raanes PN. 2016. Improvements to ensemble methods fordata assimilation in the geosciences. PhD thesis, Uni-versity of Oxford. https://ora.ox.ac.uk/objects/uuid:9f9961f0-6906-4147-a8a9-ca9f2d0e4a12 .Raanes PN, Carrassi A, Bertino L. 2015. Extending thesquare root method to account for model noise inthe ensemble Kalman filter. Monthly Weather Review (10): 3857–3873.Roberts GO, Rosenthal JS. 2001. Infinite hierarchies andprior distributions.
Bernoulli : 453–471.Roth M, Ardeshiri T, Özkan E, Gustafsson F. 2017.Robust Bayesian Filtering and Smoothing UsingStudent’s t Distribution.
ArXiv e-prints .Sacher W, Bartello P. 2008. Sampling errors in ensembleKalman filtering. Part I: Theory.
Monthly WeatherReview (8): 3035–3049.Sakov P, Bertino L. 2011. Relation between two commonlocalisation methods for the EnKF.
ComputationalGeosciences (2): 225–237.Sakov P, Oke PR. 2008. Implications of the form of theensemble transformation in the ensemble square rootfilters. Monthly Weather Review (3): 1042–1053.24arkka S, Hartikainen J. 2013. Non-linear noise adaptiveKalman filtering via variational Bayes. In:
MachineLearning for Signal Processing (MLSP), 2013 IEEEInternational Workshop on . IEEE, pp. 1–6.Sarkka S, Nummenmaa A. 2009. Recursive noiseadaptive Kalman filtering by variational Bayesianapproximations.
IEEE Transactions on AutomaticControl (3): 596–600.Snyder C. 2012. Introduction to the Kalman filter , ch. 3.Advanced Data Assimilation for Geosciences: LectureNotes of the Les Houches School of Physics: SpecialIssue, Oxford University Press, pp. 75–120.Sommer M, Janjić T. 2017. A flexible additive inflationscheme for treating model error in ensemble Kalmanfilters.
Quarterly Journal of the Royal MeteorologicalSociety .Stordal AS, Elsheikh AH. 2015. Iterative ensemblesmoothers in the annealed importance samplingframework.
Advances in Water Resources : 231–239.Storvik G. 2002. Particle filters for state-space modelswith the presence of unknown static parameters. IEEETransactions on signal Processing (2): 281–289.Stroud JR, Bengtsson T. 2007. Sequential state andvariance estimation within the ensemble Kalman filter. Monthly Weather Review (9): 3194–3208.Stroud JR, Katzfuss M, Wikle CK. 2018. A Bayesianadaptive ensemble Kalman filter for sequential state andparameter estimation.
Monthly Weather Review (1):373–386.Takemura A. 1984. An orthogonally invariant minimaxestimator of the covariance matrix of a multivariatenormal population.
Tsukuba journal of mathematics (2): 367–376.Tsyrulnikov M, Rakitko A. 2015. Hierarchical Bayesensemble Kalman filtering. Physica D: NonlinearPhenomena .Ueno G, Higuchi T, Kagimoto T, Hirose N. 2010.Maximum likelihood estimation of error covariances inensemble-based filters and its application to a coupledatmosphere–ocean model.
Quarterly Journal of theRoyal Meteorological Society (650): 1316–1343.Ueno G, Nakamura N. 2016. Bayesian estimation of theobservation-error covariance matrix in ensemble-basedfilters.
Quarterly Journal of the Royal MeteorologicalSociety (698): 2055–2080.van der Vaart HR. 1961. On certain characteristics ofthe distribution of the latent roots of a symmetricrandom matrix under general conditions.
The Annalsof Mathematical Statistics : 864–873. van Leeuwen PJ. 1999. Comment on “Data assimilationusing an ensemble Kalman filter technique”.
MonthlyWeather Review (6): 1374–1377.Wang X, Bishop CH. 2003. A comparison of breedingand ensemble transform Kalman filter ensemble forecastschemes.
Journal of the Atmospheric Sciences (9):1140–1158.Whitaker JS, Hamill TM. 2012. Evaluating methods toaccount for system errors in ensemble data assimilation. Monthly Weather Review (9): 3078–3089.Wilks DS. 2005. Effects of stochastic parametrizations inthe Lorenz’96 system.
Quarterly Journal of the RoyalMeteorological Society (606): 389–407.Wilks DS. 2011.
Statistical methods in the atmosphericsciences , vol. 100. Academic Press.Wu L, Bocquet M, Chevallier F, Lauvaux T, DavisK. 2013. Hyperparameter estimation for uncertaintyquantification in mesoscale carbon dioxide inversions.
Tellus B: Chemical and Physical Meteorology (1):20 894, doi:10.3402/tellusb.v65i0.20894.Ying Y, Zhang F. 2015. An adaptive covariance relaxationmethod for ensemble data assimilation. QuarterlyJournal of the Royal Meteorological Society (692):2898–2906.Zheng X. 2009. An adaptive estimation of forecasterror covariance parameters for Kalman filtering dataassimilation.
Advances in Atmospheric Sciences26