[PDF] Adaptive covariance inflation in the ensemble Kalman filter by Gaussian scale mixtures

Abstract

This paper studies multiplicative inflation: the complementary scaling of the state covariance in the ensemble Kalman filter (EnKF). Firstly, error sources in the EnKF are catalogued and discussed in relation to inflation; nonlinearity is given particular attention as a source of sampling error. In response, the "finite-size" refinement known as the EnKF-N is re-derived via a Gaussian scale mixture, again demonstrating how it yields adaptive inflation. Existing methods for adaptive inflation estimation are reviewed, and several insights are gained from a comparative analysis. One such adaptive inflation method is selected to complement the EnKF-N to make a hybrid that is suitable for contexts where model error is present and imperfectly parameterized. Benchmarks are obtained from experiments with the two-scale Lorenz model and its slow-scale truncation. The proposed hybrid EnKF-N method of adaptive inflation is found to yield systematic accuracy improvements in comparison with the existing methods, albeit to a moderate degree.

Full PDF

aa r X i v : . [ phy s i c s . d a t a - a n ] J a n Adaptive covariance inﬂation in the ensemble Kalman ﬁlterby Gaussian scale mixtures

Patrick N. Raanes ∗ , Marc Bocquet , and Alberto Carrassi Nansen Environmental and Remote Sensing Center, Thormøhlensgate 47, Bergen, N-5006, Norway CEREA, Joint laboratory of École des Ponts ParisTech and EDF R&D, Université Paris-Est,Champs-sur-Marne, FranceJanuary 18, 2019

Abstract

This paper studies multiplicative inﬂation: the complementary scaling of the state covariance in the ensembleKalman ﬁlter (EnKF). Firstly, error sources in the EnKF are catalogued and discussed in relation to inﬂation;nonlinearity is given particular attention as a source of sampling error. In response, the “ﬁnite-size” reﬁnementknown as the EnKF- N is re-derived via a Gaussian scale mixture, again demonstrating how it yields adaptiveinﬂation. Existing methods for adaptive inﬂation estimation are reviewed, and several insights are gained from acomparative analysis. One such adaptive inﬂation method is selected to complement the EnKF- N to make a hybridthat is suitable for contexts where model error is present and imperfectly parameterized. Benchmarks are obtainedfrom experiments with the two-scale Lorenz model and its slow-scale truncation. The proposed hybrid EnKF- N method of adaptive inﬂation is found to yield systematic accuracy improvements in comparison with the existingmethods, albeit to a moderate degree. Consider the problem of estimating the state x k ∈ R M given the observation y k ∈ R P , as generated by: x k = M ( x k − ) + ξ k , ξ k ∼ N ( , Q k ) , (1a) y k = H x k + υ k , υ k ∼ N ( , R k ) , (1b)for sequentially increasing time index k , where theGaussian noise processes, ξ k and υ k , are independentin time and from each other. More speciﬁcally, theBayesian ﬁltering problem consists of computing andrepresenting p ( x k | y k ), namely the probability densityfunction (pdf) of the current state, x k , given thecurrent and past observations, y k = { y l } kl =1 . In dataassimilation (DA) for the geosciences, the state size, M ,and possibly the observation size, P , may be large, and thedynamical operator, M , may be nonlinear (observationoperators that are nonlinear are implicitly included bystate augmentation [Evensen, 2003]). These diﬃcultiesnecessitate approximate solution methods such as theensemble Kalman ﬁlter (EnKF), which is simple andeﬃcient [Evensen, 2009b].The EnKF computes an ensemble of N realizations,or “members”, to represent p ( x k | y k ) as a (supposed)sample thereof. It consists of a forecast-analysis “cycle”for each sequential time window of the DA problem. Theforecast step simulates the dynamical forecast (1a) foreach individual member. This paper is focused on theanalysis step. Since the analysis only concerns a ﬁxed time, k , this subscript is henceforth dropped, as is theexplicit conditioning on y k − . Thus, the prior at time k is written p ( x ), and the analysis (posterior) at time k becomes p ( x | y ) ∝ p ( y | x ) p ( x ), per Bayes’ rule.Denote { x n } Nn =1 the forecasted ensemble representing p ( x ), and deﬁne the prior sample mean and covariance: ¯ x = 1 N N X n =1 x n , (2a) ¯B = 1 N − N X n =1 ( x n − ¯ x ) ( x n − ¯ x ) T . (2b)The EnKF analysis update can be derived by assumingthat ¯ x and ¯B exactly equal the true moments of p ( x ),labelled b and B , and carefully dealing with rank issues[§6.2 of Raanes, 2016]. The posterior then arises as inthe Kalman ﬁlter, described by the analysis moments ¯ x a and ¯P a , or a (deterministic, “square-root”) ensembletransformation to match these.Multiplicative inﬂation is an auxiliary technique toadjust (typically increase) the ensemble spread andthereby covariance, initially studied by Pham et al. [1998];Anderson and Anderson [1999]; Hamill et al. [2001]. Here,the speciﬁc variant studied is that of multiplying the priorstate covariance matrix, ¯B , by the inﬂation factor, α > ¯B α ¯B . (3)1he need for inﬂation may arise from intrinsicdeﬁciencies of the EnKF: errors due to non-Gaussianityor the ﬁnite size of the ensemble. The technique oflocalization should be applied as the primary remedy,but inﬂation is still generally necessary and beneﬁcial[Asch et al. , 2016, ﬁgure 6.6]. Inﬂation may alsobe necessary as a heuristic but pragmatic treatmentfor extrinsic deﬁciencies, i.e. model and observationalerrors, meaning any misspeciﬁcation of equations (1a)and (1b). Again, however, it is advisable to exploitany prior knowledge of errors (bias, covariance, subspace,etc.) with more advanced treatments before employingmultiplicative inﬂation. Examples include additive noise[Whitaker and Hamill, 2012], relaxation [Kotsuki et al. ,2017], and square-root transformations Raanes et al. [2015]; Sommer and Janjić [2017].It is diﬃcult to formulate directives for thetuning conﬁgurations of the EnKF with any generality.Concerning α , it may be that the accuracy of the EnKFis improved either by well-tuned inﬂation ( α >

1) ordeﬂation ( α < adding to other errors. Similarly, inﬂating istypically required in conditions of extrinsic error such asmodel error [Li et al. , 2009].Further speciﬁcity and quantitative guidelines arediﬃcult to deduce. Therefore, the inﬂation parametertypically requires application-speciﬁc, oﬀ-line tuning fora ﬁxed value, sometimes at signiﬁcant expense. As analternative strategy, adaptive inﬂation aims to estimatethe inﬂation factor on-line. This also naturally promotesthe use of time-varying values.The EnKF- N [Bocquet et al. , 2015, hereafter Boc15]is a reﬁnement of the analysis step of the EnKF thatexplicitly accounts for sampling error in ¯ x and ¯B , meaningtheir discrepancy from the true moments b and B , whichare seen as uncertain, hierarchical “hyperparameters”.The derivation proceeds from the rejection of theassumption that ¯ x and ¯B are exact [Bocquet, 2011,hereafter Boc11]. Moreover, when using a non-informativehyperprior for b and B , the EnKF- N has been shownto yield a “dual” form which can be straightforwardlyidentiﬁed as a scheme for adaptive inﬂation [Bocquetand Sakov, 2012]. Its implementation only requiresminor add-ons to the (square-root) EnKF, with negligiblecomputational cost. In the idealistic context where modelerror is absent or accurately parameterized by the noiseprocess, as detailed by section 2, the EnKF- N nulliﬁesthe need for inﬂation tuning, making it opportune forsynthetic experiments. However, (i) wider adoption of theEnKF- N has been limited by some technically challengingaspects of its derivation. Moreover, (ii) the idealismof the above context means that the EnKF- N wouldstill be reliant on ad-hoc inﬂation tuning in real-world,operational use. This paper addresses both of the above issues of theEnKF- N . Firstly, by re-deriving it with a focus oninﬂation, section 3 further elucidates its workings. Then,section 4 reviews and analyses the literature on adaptiveinﬂation estimation. In contrast to the EnKF- N , theseadaptive inﬂation methods have hyperpriors that are time-dependent (as opposed to being “reset” at each analysistime) making them suitable for realistic contexts wheremodel error is present and imperfectly parameterized.Then, section 5 uses one such method to complementthe EnKF- N and create a new, hybrid method. Lastly,section 6 presents benchmark experiments of the variousadaptive inﬂation methods. Expressions and propertiesof the standard parametric pdfs in use in this paper, N , t , χ +2 , χ − , W +1 , W − , can be found in appendix A. Model-error adaptive inﬂation is considered from section 4and onward. By contrast, this section is focused on theeﬀects of sampling error, as well as its causes, especiallynonlinearity. Section 3 will show how sampling error ispartially remedied by the EnKF- N . Consider the univariate (scalar) ﬁltering problem wherethe likelihood p ( y | x ) = N (0 | x,

2) and dynamical model M Lin ( x ) = √ x repeat identically for each time index,and the initial prior is p ( x ) = N ( x | , p ( x | y ) = N ( x | , M NonLin ( x ), detailed in appendix B.2. Thismodel has been designed to preserve Gaussianity despitebeing nonlinear: if p ( x | y ) = N ( x | ,

1) then M NonLin ( x )has the distribution N (0 , N = 40 members drawnrandomly from p ( x ). But, in the linear case, the resultingsampling errors are quickly attenuated, and the ensemblestatistics converge to the exact ones.By contrast, in the nonlinear case, the jitteriness(sampling error) is chronic. This demonstrates thatsampling error may arise purely due to nonlinearity, i.e.without actual stochasticity. Furthermore, note that thetrue distributions are perfectly Gaussian, and therefore2 ¯x a ¯P a ¯B DA cycle (i.e. time) index

Figure 1:

Time series of statistics from the EnKF appliedto the univariate DA problem with M Lin (smooth lines) and M NonLin (jittery lines). the EnKF would compute the exact solution if N wereinﬁnite. Thus, even though nonlinearity typically yieldsnon-Gaussianity, this is not always the case. Hence, theissue of sampling error, even if caused by nonlinear models,can be analysed and addressed separately from the issueof non-Gaussianity.An instructive scenario (not shown) of the nonlinearexperiment is that in which the initial ensemble hasa mean of 0 and a variance of 2, exactly. Despitethe “perfect” initialization, sampling errors will still begenerated, as predicted by appendix B.1. However, thiserror is not immediately as big as if the ensemble wereactually randomly sampled from N (0 , ¯B ∼ χ +2 (2 , N − E [ ¯B − = 8 / ( N − M NonLin for the ensemble to saturate at anoise level of 8 / ( N − This subsection is summarized in Table 1, whose rowscorrespond to paragraphs, as numbered (§).§1. An important property of the EnKF is that it is aconsistent estimator in the linear-Gaussian case [Le Gland et al. , 2009; Mandel et al. , 2011]: at each time k , the EnKFstatistics ¯ x and ¯B converge (in probability, as N → ∞ )to the true moments, b and B . Clearly, in this context,inﬂating or deﬂating will degrade the ensemble estimates.§2. Stochastic forms of the EnKF employ pseudo-random “observation perturbations” for the analysisupdate step. Similarly, the forecast step may simulateadditive or more advanced stochastic parameterizations ofthe forecast noise. With N < ∞ , this introduces samplingerror. Table 1:

Summary of section 2.2 regarding ﬁlteringcontexts and the consequent need for inﬂation. Backgroundassumptions are idealistic: M , H , Q , R perfectly known, and p ( x ) and p ( y | x ) always Gaussian. The star (*) means “ineither case”. Ensemble Treatment of Model Should§ size ( N ) noises ( Q , R ) ( M ) inﬂate?1 ∞ * * No2 ( M, ∞ ) Stochastic * Yes3 ( M, ∞ ) * Nonlin. Yes4 ( M, ∞ ) Deterministic Linear No5 [2 , M ] * * YesOne cause of the typical need for α > E [tr( ¯P a )] < tr( P a ) , (4)where the expectation is taken over the prior ensemble (orequivalently the covariance, ¯B ), and ¯P a = ( ¯B − + H T R − H ) − , (5) P a = ( B − + H T R − H ) − . (6)In other words, even though E [ ¯B ] = B , the nonlinearity(concavity) of ¯P a , as a function of ¯B , causes a bias.A related but distinct bias applies for the Kalman gainmatrix, ¯K = ¯BH T ( H ¯BH T + R ) − . Note, though, thatthe sampling error originates in the prior; therefore, theprior covariance is the root cause, and targeting (inﬂating) ¯B , rather than ¯P a and ¯K , is more principled.There is a misconception that this bias leads toensemble “collapse”, meaning that ¯P a → and ¯B → as k → ∞ . But no matter how acute the single-cycle bias is,its accumulation will saturate, because it is counteractedby reductions in ¯K .The term “inbreeding” is sometimes used to refer tothe bias (4). However, inbreeding also encompasses twoother issues, namely the introduction of non-Gaussianityand of dependency between ensemble members. These arecaused by the cross-member interaction that takes placethrough the EnKF update [Houtekamer and Mitchell,1998]. It is not quite clear how these eﬀects will impactthe need for inﬂation in later cycles.Analytical, quantitative results on the bias (4) havebeen obtained for the general, multivariate case byFurrer and Bengtsson [2007]; Sacher and Bartello [2008].However, the degree of the approximation is not entirelyclear, the assumption of the ensemble being trulystochastic is unreliable, and the related correctionalmethods were only moderately successful. An alternativeapproach is that of §15.3 of Evensen [2009a], wherethe bias is empirically estimated by using a companionensemble of white noise.However, as discussed below equation (13), asigniﬁcant drawback of the inﬂation methods targeting3his bias is that they do not establish a feedbackmechanism through the cycling of DA. Moreover, as shownby the theory of the EnKF- N in section 3, even in asingle cycle, the observations, y , contain information thatcan improve estimates of prior hyperparameters “before”utilising y to update the state vector, x , thereby reducingsampling error and biases.§3. Deterministic, square-root update forms of theEnKF (which may also be formulated for the forecast noise[Raanes et al. , 2015]) do not introduce sampling error inthe mean and covariance. Yet, with N < ∞ , samplingerrors will arise due to model nonlinearities. This wasillustrated in the experiments of section 2.1, and predictedby appendix B.1. As in §2, sampling error will instigatethe need for inﬂation. Indeed, the bias (4) is slightlyvisible in the nonlinear experiment of Figure 1, where thecovariances, ¯B and ¯P a , are on average lower (long-runaverages: 1 .

95 and 0 .

98) than the true values.Filter “divergence” is the situation where the actualerror is far larger than expected from ¯B . It cannot occurin the linear context, except by extrinsic errors [Fitzgerald,1971]. It may, however, arise in nonlinear, chaotic contextsbecause, heuristically, (i) smaller covariances are proneto deﬁcient (relative) growth by the forecast, creating aninstability that (ii) might not be adequately controlled bythe analyses. Further, the deﬁciency in growth typicallydepends on the starting deﬁciency of the covariance, aform of positive feedback that makes the cycle even more“vicious”. The alarming prospect of divergence, especiallyin light of the bias (4), favours “erring on the side ofcaution”, i.e. using α > Q = [Bocquet et al. , 2017]. Hence, convergence (in time k )holds for any N ≥ M , with a rate independent of N .Interestingly, a similar analysis reveals that the choiceof normalization factor for the covariance estimator(usually N − , or N ) does not impact the asymptoticEnKF-estimated moments (in the linear context): theyalways converge to the true moments as k → ∞ . Thismeans that the success of the EnKF does not so much relyon some statistical, single-cycle optimality or unbiasedness(in ¯B , ¯P a , or ¯K ), but rather on the above insensitivity tothe choice of normalization factor. §5. Decreasing the ensemble size, N , increases thesampling error, the bias (4), and the need for inﬂation.Worse, if N ≤ M , then the ensemble is said to be rank-deﬁcient; this is a separate issue from sampling error,with the grave consequence that the truth, x , will notlie entirely within the ensemble subspace (cf. section 3.5).By operating marginally, “localization” [Anderson, 2003;Sakov and Bertino, 2011], can mitigate the rank deﬁciency.Localization also diminishes oﬀ-diagonal sampling errors(“spurious correlations”), thus decreasing the need forinﬂation. On the other hand, by eliminating priorcorrelations, localization aﬀects an overly uncertain prior,yielding too strong a reduction of the ensemble spread .Another consequence of rank deﬁciency is thepossibility of the Bayesian uncertainty (i.e. potentialerror) outside of the ensemble subspace “mixing in”, andadding to, the ensemble subspace uncertainty. If Q = and the context is linear, this interaction is small andtransitory. It then does not seem beneﬁcial to (inﬂatein order to) have the ensemble spread match the total(as opposed to the subspace) uncertainty. By contrast,if Q > [Grudzien et al. , 2018], or in the nonlinearcontext [Palatella and Trevisan, 2015], the interaction willoccur, favouring the use of α >

1. In their section 4,Boc15 showed that (scalar/homogeneous) inﬂation is well-suited to combat this type of error; this applies for bothmultiplicative and additive treatments.Assuming Q = , the long-run ( k → ∞ ) rank ofthe true state covariance, B , is the number of non-decaying modes (non-negative Lyapunov exponents) ofthe dynamics, i.e. the rank of the “unstable subspace”,0 ≤ n < M . This correspondence also holdsapproximately in the nonlinear context, and means thatthe rank deﬁciency of the ensemble may be much lesssevere than M − N + 1 [Bocquet and Carrassi, 2017]. Ifthis is the case, a duplicate of Table 1 applies, with M replaced by n .Filter divergence will (almost surely) occur if N ≤ n ,if localization is not used. In contrast with §3, inﬂationis then futile, because the divergence is caused by rankdeﬁciency, regardless of the degree of nonlinearity of thegrowth. It could be speculated that nonlinearity willsequentially “rotate” the ensemble around in the unstablesubspace, and hence eﬀectively encompass it. However,twin experiments with the 40-dimensional Lorenz model,such as the data point N = n = 14 of Figure 6.6 of [Asch et al. , 2016], do not give credence to this hypothesis. Formally, quantify the reduction via | I P − H ¯K | = | R | / | H ¯BH T + R | , the determinant of the reduction in the variance. Localizationdecreases the magnitude of the oﬀ-diagonals of H ¯BH T + R , providedthe eigen-structures of the two terms are not too dissimilar. Thus,localization increases the denominator, hence reducing I P − H ¯K andthe posterior variance. Re-deriving the dual EnKF- N via a Gaussian scale mixture This section gives a new derivation of the dual EnKF- N .Subsection 3.1 outlines the main ideas. The details areﬁlled in by the subsequent subsections. Suppose the Bayesian forecast prior for the “truth” isGaussian, with mean b and covariance B ; formally, p ( x | y k − ) = N ( x | b , B ), where the conditioningon past observations has been made explicit again.Furthermore, assume that the sample { x n } Nn =1 is an“ensemble”, meaning that its members are independentand statistically indistinguishable from the truth [Wilks,2011], having been drawn from the very same distribution.In short, x and x n ∼ N ( b , B ) iid . (7)The assumption (7) is convenient, but may be tooidealistic in case of severe inbreeding, non-Gaussianity,and model error. Conversely, it may be too agnostic incase the ensemble is not fully random, as discussed insection 2.1. For convenience, assemble the ensemble intothe matrix E = (cid:2) x , . . . x n , . . . x N (cid:3) .Even in the linear-Gaussian context, computationalconstraints induce the use of an ensemble to carry theinformation on the state, and thus the approximation p ( x | y k − ) ≈ p ( x | E ) , (8)meaning the reduction of the information of y k − to thatrepresented by the forecast ensemble, E . Thus, whilein principle (with inﬁnite computational resources) the“true moments”, b and B , are known, this is not so whenemploying the EnKF. Here, all that is known about b and B comes from E .The appropriate response is to consider all of thepossibilities; indeed, since by the above assumptions p ( x , b , B | E ) = N ( x | b , B ) p ( b , B | E ), marginalizationyields: p ( x | E ) = Z B Z R M N ( x | b , B ) p ( b , B | E ) d b d B , (9)where B is the set of M × M (symmetric) positive-deﬁnitematrices . Equation (9) says that the “eﬀective prior”, p ( x | E ), is a (continuous) mixture: the average of the“candidate priors”, N ( x | b , B ), as weighted by the “mixingdistribution”, p ( b , B | E ). Since the distribution of thestate, x , depends on the abstract parameters b and B thatare themselves unknown, these are called hyperparametersand this layered structure is called hierarchical. B is the Euclidean space R M ( M +1) / corresponding to the M ( M + 1) / B , restricted to positive-deﬁnite matrices (the conic subset wherein B > ). The standard EnKF may be recovered from themixture (9) by assuming that the ensemble size is inﬁnite( N = ∞ ), in which case the sample mean and covariance, ¯ x and ¯B of equation (2), are exact, implying a mixingdistribution of Dirac delta functions: δ ( b − ¯ x ) δ ( B − ¯B ),and hence the eﬀective prior: N ( x | ¯ x , ¯B ).The EnKF- N does not make this approximation, butinstead acknowledges that N is ﬁnite (whence the “ﬁnite-size” moniker). The mixing distribution is obtainedwith Gaussian sampling theory and a non-informativehyperprior, p ( b , B ). For now, N > M is assumed, inwhich case ¯B − exists [almost surely, per theorem 3.1.4 ofMuirhead, 1982].The connection to inﬂation comes from noting, as willbe proven later, that equation (9) reduces to: p ( x | E ) = Z α> N ( x | ¯ x , α ¯B ) p ( α | E ) d α , (10)which is a mixture of candidate Gaussians over a scalar,scale parameter, only. The mixture (10) is illustrated bythe orange objects in Figure 2. The candidate (prior)Gaussians are distinguished solely by the scaling, α ,of the covariance, ¯B . Only a ﬁnite selection of thecontinuous family of candidate priors is plotted, theselection being representative of the mixing distribution, p ( α | E ). Interestingly, as detailed later, this yields aneﬀective prior, p ( x | E ), which is not Gaussian, but rathera (Student’s) t distribution.The eﬀective posterior, p ( x | E , y ) ∝ p ( y | x ) p ( x | E ), isgiven by Bayes’ rule, i.e. pointwise multiplication. Butthe likelihood, p ( y | x ) = N ( y | H x , R ) , (11)per equation (1b), is Gaussian. The posterior isthen neither Gaussian nor t , and does not simplifyparametrically. This poses a computational challenge inhigh-dimensional problems, and the question of how theposterior (or an ensemble thereof) is to be computedin practice. Progress can be made by noting that theaveraging over the prior moments can be “delayed” untilafter application of Bayes’ rule, i.e. p ( x | E , y ) ∝ Z N (cid:0) y | H x , R (cid:1) N ( x | ¯ x , α ¯B ) | {z } p ( x , y | α, E ) p ( α | E ) d α . (12)Thus, the eﬀective posterior can also be seen asthe average of the (Gaussian) candidate posteriors, p ( x | α, y , E ), each of which is given by the Kalmanﬁlter formulae for a given α , and computable essentiallysimultaneously for all α .The by-product of Bayes’ rule is the “evidence”, p ( y | α, E ). In this context, it is not a constant, but insteadconstitutes the likelihood of the mixing parameter, α . Toreﬂect this, the candidate posterior curves in Figure 2have not been normalized to integrate to 1, but instead p ( y | α, E ) · c .5 d f x Prior ensembleLikelihoodCandidate priorsCandidate posteriorsEffective priorEffective posterior

Figure 2:

Illustration of the EnKF- N as a scale mixture of Gaussians, as described in section 3.1. The constant c has been inserted and set such thatthe particular candidate posterior whose mode coincideswith that of the eﬀective posterior also shares its height.This makes it visible that no candidate posterior is fullycoincident with the eﬀective posterior. Nevertheless, itseems a reasonable approximation. But this candidateposterior corresponds to a candidate prior, which merelyamounts to choosing a particular prior inﬂation, α ⋆ . Theapproximation can thus be written: p ( x | E , y ) ≈ p ( x | α ⋆ , E , y ) , (13)meaning that the integral over the hyperparameter, α ,for the eﬀective posterior (12), is replaced by using aparticular value, α ⋆ , which is chosen after taking intoaccount y . This approximation is a form of “empiricalBayes”, known as such because the eﬀective prior isapproximated in a way that depends on the observations, y . This may appear to over-use the observations, y , but it is merely an artefact of the approximation.Indeed, decomposing the integrand in equation (12) as p ( x | α, y , E ) p ( y | α, E ) makes it apparent that α dependson y .A posterior ensemble corresponding to the approx-imate posterior (13) may be computed using standardEnKF formulae, except with ¯B replaced by the selectedvalue, α ⋆ ¯B . Provided that the choice among the approx-imating Gaussian posteriors is judicious, it stands toreason that the resulting ensemble yields an improvedanalysis compared to that of the standard EnKF. Afterall, the standard EnKF chooses its covariance estimate( B = ¯B ) before taking into account y . By contrast,the EnKF- N lets y inform this choice ( B = α ⋆ ¯B ). Forthe same reason, even though the EnKF- N does nottarget any particular unbiasedness, improvement could beachieved compared to the methods targeting “single-cycleunbiasedness”, described below equation (4).However, the main asset of the EnKF- N is thatits secondary dependence in y implicitly establishes a negative feedback loop via the sequential cycling of DA: ifthe covariance estimate was too small at time k , this willlikely be detected and adjusted for at k +1. Moreover, this feedback is “theoretically tuned”: parameters that may betuned exist (cf. section 3.7), but none strictly require it.As will be shown, the inﬂation prior is centredon 1, conferring important advantages to the EnKF- N . However, this anchoring to 1 also reﬂects the maindrawback of the EnKF- N : the hyperprior is static, so thatno explicit accumulation of past information takes placefor the inﬂation factor, which otherwise could have beenused to account for model error. Redressing this is thesubject of section 4 and onwards. This subsection and the next further describe equation (9)for the eﬀective prior , p ( x | E ). They are largelysourced from textbooks on Gaussian sampling theory andinference, under the heading of “predictive posterior”: theprobability of another draw, x , from the same distributionas the sample, E [e.g., §3.2 of Gelman et al. , 2004]. Thepresentation is didactic, giving meaning to intermediatestages. A concise version is provided by Boc11.The mixing distribution in equation (9) is given by: p ( b , B | E ) ∝ p ( E | b , B ) p ( b , B ) , (14)where p ( b , B ) is a hyperprior to be speciﬁed. Here, as inBoc11, the Jeﬀreys priors are independently assigned tothe hyperparameters: p ( b , B ) = p ( b ) p ( B ) ∝ · | B | − ( M +1) / . (15)This is a prior designed to be as non-informative (agnostic)as possible. It may be derived by positing invariance inlocation and scale [e.g., §12.4 of Jaynes, 2003]. Boc15also showed the utility of using a highly informativehyperprior, suitable in contexts with little nonlinearity.Examples were also given for encoding information suchas climatology or conditional statistics, resulting in a formof localization.By the Gaussian ensemble assumption (7), p ( E | b , B ) = Y n N ( x n | b , B ) ∝ | B | − N/ e − P n k x n − b k B / , (16)6here k x k B = x T B − x = tr( xx T B − ). Now, writing x n − b = ( ¯ x − b ) + ( x n − ¯ x ), it can be shown that X n k x n − b k B = N k ¯ x − b k B + tr(( N − ¯BB − ) . (17)Combining equations (15) to (17) for the mixingdistribution (14), the resulting factors may be identiﬁedas: p ( b , B | E ) = N ( b | ¯ x , B /N ) | {z } p ( b | B , E ) W − ( B | ¯B , N − | {z } p ( B | E ) , (18)where W − is the inverse-Wishart distribution (cf. Table 2of appendix A). Writing the integrand of equation (9) as p ( x , b , B | E ) = p ( b | x , B , E ) p ( x | B , E ) p ( B | E ), the integral over b becomestrivial, leaving just the latter two factors: p ( x | E ) = Z p ( x | B , E ) p ( B | E ) d B , (19)of which p ( B | E ) was obtained in equation (18).Meanwhile, recalling p ( x | b , B ) and p ( b | B , E ) fromequations (7) and (18) respectively, it may be shown bycompleting the square in b that p ( x | b , B ) p ( b | B , E ) = N (cid:0) b (cid:12)(cid:12) N ¯ x + x N +1 , B / ( N + 1) (cid:1)| {z } p ( b | x , B , E ) N ( x | ¯ x , ε N B ) | {z } p ( x | B , E ) , (20)where ε N = 1 + N . The underbraces follow byidentiﬁcation and provide the ﬁrst factor in equation (19).Thus, p ( x | E ) = Z N ( x | ¯ x , ε N B ) W − ( B | ¯B , N −

1) d B . (21)It should be appreciated that equation (21) would beunchanged if b = ¯ x had been assumed from the start,except for the slight adjustment of ε N and the reductionfrom N to N − W − . Bycontrast, as shown in the following, the uncertainty in B has signiﬁcantly more interesting consequences. This section derives the scale mixture equation (10).While conventional, the assumption “ ¯B ∝ B ” is ill-suited for inﬂation targeting sampling error, as it yieldsan inﬂation prior with an overpowering conﬁdence, to thedetriment of the likelihood [Raanes, 2016, §C.4]. Thisassumption is therefore not made. But then merelydeﬁning the inﬂation parameter becomes challenging.Clearly, it must be some scalar summary statistic onthe “ratio” of B versus ¯B ; possibilities include usingthe determinant, trace, or matrix norms. However, the subsequent assignment “ B = ˜ α ¯B ” would representan artiﬁcial approximation. By contrast, the followingdeﬁnition and developments make no approximations.Consider a ﬁxed x , and deﬁne the (squared) inﬂation:˜ α = k x − ¯ x k ¯B k x − ¯ x k B . (22)Now, given the ensemble, E , the sample moments ¯ x and ¯B are known (ﬁxed), while p ( B | E ) = W − ( B | ¯B , N − B − ∼W +1 ( ¯B − , N − / ˜ α ∼ χ +2 (1 , N − p (˜ α | E ) = χ − (˜ α | , N − , (23)meaning that ˜ α is inverse-chi-square (cf. Table 2), withlocation parameter 1 and certainty N − p (˜ α | E ) could also have been derived bymarginalizing p ( B | E ) over C ∈ C ˜ α , where C denotes(any parameterization of) the degrees of freedom in B not ﬁxed by ˜ α , i.e. C ˜ α = { C ∈ R M ( M +1) / − ; B (˜ α, C ) ∈B , k x − ¯ x k B = k x − ¯ x k α ¯B } . Formally, Z C ˜ α p ( B | E ) J d C = p (˜ α | E ) , (24)with J denoting the Jacobian determinant of (˜ α, C ) B .Inserting the pdfs from equations (18) and (23): Z C ˜ α W − ( B | ¯B , N − J d C = χ − (˜ α | , N − . (25)Now, the covariance mixture (21) can be rearranged as: p ( x | E ) ∝ Z B exp (cid:0) − k x − ¯ x k ε N B (cid:1) W − (cid:0) B (cid:12)(cid:12) N − N ¯B , N (cid:1) d B . The same change of variables then yields: p ( x | E ) ∝ Z ˜ α> exp (cid:0) − k x − ¯ x k ε N ˜ α ¯B (cid:1)Z C ˜ α W − (cid:0) B (cid:12)(cid:12) N − N ¯B , N (cid:1) J d C  d˜ α . (26)The inner integral can be substituted by comparing it toequation (25), yielding: p ( x | E ) ∝ Z exp (cid:0) − k x − ¯ x k ε N ˜ α ¯B (cid:1) χ − (cid:0) ˜ α (cid:12)(cid:12) N − N , N (cid:1) d˜ α ∝ Z N ( k x − ¯ x k ¯B | , ε N ˜ α ) χ − (˜ α | , N −

1) d˜ α . (27)7n conclusion, the covariance mixture of equation (21)reduces to a scale mixture. An alternative, direct proof,using tricks from complex analysis instead of Property 6,was given in a preprint version of this paper.Note that the scale mixture (27) has been writtenusing the notational trick where N acts as a univariate function. Also, since ˜ α is deﬁned via x , the integrand ofequation (27) cannot be read as “ p ( x | E , ˜ α ) p (˜ α | E )”. Bycontrast, the mixture (10) is obtained by undoing thetrick, and deﬁning α = ε N ˜ α . Let be the vector of ones of length N , and I N the N × N identity matrix. Then the sample moments, givenin equation (2), may be conveniently expressed as: ¯ x = E /N , ¯B = N − XX T , (28)where X = (cid:2) x − ¯ x , . . . x n − ¯ x , . . . x N − ¯ x (cid:3) = EΠ ⊥ is the ensemble “anomalies”, with Π ⊥ = (cid:0) I N − T /N (cid:1) the orthogonal projector onto range( ) ⊥ ,the orthogonal complement space to range( ).So far it has been assumed that N > M so that ¯B is invertible (almost surely) and that the ensemble spansthe entire state space. This is unrealistic for geoscientiﬁcDA, where N rarely exceeds 100, while M may exceed10 . More reasonably, it is henceforth assumed that thesupport of the forecast pdf is conﬁned to the ensemblesubspace, i.e. the aﬃne space { x ∈ R M : [ x − ¯ x ] ∈ range( X ) } . This assumption is actually conventional, as itis implied by the standard EnKF’s assumption that b = ¯ x and B = ¯B along with Gaussianity. The assumptionmeans that the ensemble has suﬃcient rank. Thus, onemay expect tolerable accuracy of the ﬁlter, even withoutlocalization [Bocquet and Carrassi, 2017]. It is preferableto work with variables that embody the restriction of theassumption [Hunt et al. , 2007]; therefore, with w ∈ R N ,the following change of variables is done: x ( w ) = ¯ x + X w . (29)In terms of the new variable, the likelihood (11) maybe succinctly written as: p ( y | w ) = N ( ¯ δ | Y w , R ) . (30)with ¯ δ = y − HE /N the average innovation, and Y = HEΠ ⊥ = HX the corresponding observation anomalies.For the eﬀective prior (27), note that the ensemblemembers expressed in the coordinate system of w aremerely the coordinate vectors ( x n = ¯ x + X e n , with e n being the n -th column of I N ). Hence, in this coordinatesystem, the sample mean is /N , replaceable by zero since X = , and the sample covariance matrix is N − I N .Substituting these for ¯ x and ¯B in equation (27) is ashortcut to obtain the eﬀective prior for w ; with α = ε N ˜ α , p ( w | E ) ∝ (31) Z α − g/ N (cid:0) k w k N − I N (cid:12)(cid:12) , α (cid:1) χ − ( α | ε N , N −

1) d α , where the presence of α − g/ is explained in the following.Denote g the dimensionality of the nullspace of X . Dueto Π ⊥ it holds that g = max(1 , N − M ), almost surely.Thus, typically g = 1, and the parameterization in w hasone direction of redundancy, warranting careful attention.The issue is analogous to expressing 1 random variableas the sum of 2, or indeed expressing N − g randomvariables as a linear combination of N . The principle isthat regardless of how the probability space is augmentedwith the redundant degrees of freedom, once these aremarginalized out, one should be left with the originaldistribution. Boc15 showed that the adjustment of g inequation (31) is then required. Denote f the integrand of the scale mixture (31).Expanding the parametric pdfs yields: f ( α, z ) = α − ( N + g ) / − e − z/ α , (32)where all of the dependency in w is contained in z ( w ) = ( N − (cid:0) ε N + k w k (cid:1) . (33)Deﬁning F ( z ( w )) ∝ p ( w | E ) for the eﬀective prior,equation (31) may be restated as: F ( z ) = Z f ( α, z ) d α . (34)It can be seen that the change of variables u = α/z factors z out of the integral, yielding F ( z ) = z − ( N + g ) / F (1) , (35)or, reverting to the full w notation, p ( w | E ) ∝ (cid:0) ε N + k w k (cid:1) − ( N + g ) / , (36)which is a t distribution (cf. appendix A), also calleda Cauchy distribution when g = 1. The t distribution iselliptical, like the Gaussian [Muirhead, 1982, §1.5], but hasthick tails, making it suited for robust inference [Geweke,1993; Fernandez and Steel, 1999; Roth et al. , 2017].Unlike Boc11, here the t distribution form (36) ofthe eﬀective prior will not be used directly. Instead, theeﬀective prior ( F ) will again be expressed as a Gaussian( G ) with inﬂation. To that end, note that, for generalfunctions F and G with image[ F ] ⊆ image[ G ], there willalways exist a function ζ ( z ) such that F ( z ) = G ( ζ ( z )).To ﬁnd a suitable G , consider applying the mean-valuetheorem to the integral (34) for a ﬁxed z , denoting ζ − the particular point for α . This will not work becausethe integrational interval, [0 , ∞ ), is of inﬁnite length. Inplace of the length, therefore, substitute c ζ − to form: G ( ζ, z ) = c ζ − f ( ζ − , z ), where the constant c ensuresthat the height (and hence image) of G ( . , z ) is suﬃcient.Inserting f from equation (32) yields G ( ζ, z ) = c ζ ( N + g ) / e − zζ/ , (37)8his G works well; indeed, equating equation (37) to (35)immediately yields the associated function ζ ( z ) = c /z .In summary, the eﬀective prior may be expressed by G (37) which, similarly to the integrand, f , is Gaussianin w . For later optimization purposes, c = N + g is set,and the logarithm is taken: p ( w | E ) = c exp (cid:0) − L (cid:1) , (38a) L ( ζ, w ) = (cid:0) ε N + k w k (cid:1) ζ − ( N + g ) log ζ , (38b) ζ ( w ) = N + gε N + k w k , (38c)The value of c was chosen so as to yield the property that ∂L∂ζ ( ζ ( w ) , w ) = 0 for any w , as can be directly veriﬁed.Conversely, this means that ζ may be treated as a freevariable to be optimized for, because equation (38c) issatisﬁed wherever ∂L∂ζ = 0. This tactic becomes useful inthe following section.The above form of the eﬀective prior may bederived (as in a preprint version of this paper) as a“saddlepoint approximation”. Here, however, there is noapproximation. Its exactitude is a remarkable featureknown to arise in a few cases [Azevedo-Filho and Shachter,1994; Goutis and Casella, 1999]. Deﬁne J = − p ( w | E , y ) plus a constant, where theposterior is given by Bayes’ rule with the likelihood (30)and the eﬀective prior (38). The log posterior reads: J ( ζ, w ) = L ( ζ, w ) − N ( ¯ δ | Y w , R ) + c (39a)= ε N ζ − ( N + g ) log ζ + ζ k w k + (cid:13)(cid:13) ¯ δ − Y w (cid:13)(cid:13) R . (39b)Completing the square in w yields: J ( ζ, w ) = k w − ¯ w a ( ζ ) k ¯P a w ( ζ ) + D ( ζ ) , (40)where the quadratic form is speciﬁed by the usual EnKFsubspace analysis formulae: ¯P a w ( ζ ) = (cid:0) ζ I N + Y T R − Y (cid:1) − , (41a) ¯ w a ( ζ ) = ¯P a w ( ζ ) Y T R − ¯ δ , (41b)and D should be recognized as the “dual”, as in Bocquetand Sakov [2012]: D ( ζ ) = ε N ζ − ( N + g ) log ζ + (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R + YY T /ζ . (42)Now, ζ depends on w , and so the maximization of p ( w | E , y ) is not as obvious as equation (40) suggests.Fortunately, to ﬁnd a critical point, it suﬃces to satisfy ∂J∂ w = , (43a) ∂J∂ζ = 0 . (43b)This is because the criteria (43a) and (43b) imply d J d w = ∂J∂ w + ∂J∂ζ d ζ d w = , where ζ ( w ) is given by equation (38c), which is enforced since ∂L∂ζ = 0, as followsfrom equations (39a) and (43b).Now, the ﬁrst criterion (43a) is trivially satisﬁedby setting w = ¯ w a ( ζ ) for a given ζ , as seen fromequation (40). But it can also be seen that J = D alongthe constraint w = ¯ w a ( ζ ), and sod D d ζ = d J d ζ = ∂J∂ζ + ∂J∂ w d w d ζ = ∂J∂ζ . (44)Hence, ﬁnding ζ such that d D d ζ = 0 will satisfy the secondcriterion (43b).In conclusion, w = ¯ w a ( ζ ⋆ ) is a critical point of theeﬀective posterior if and only if ζ ⋆ is a local minimizer of D ( ζ ). Since both of the terms D ( ζ ) and k w − ¯ w a k ¯P a w of equation (40) are here individually minimized, thiscritical point must be a minimum, as was originally shownusing Lagrangian duality theory by Bocquet and Sakov[2012]. Hence the N -dimensional, non-Gaussian mode-ﬁnding problem for p ( w | E , y ) may be exchanged for thescalar optimization problem in ζ .The optimization of D ( ζ ) requires iterating, but eachevaluation of (42) and its derivative is computationallynegligible, given the singular value decomposition (SVD), U diag(¯ σ , . . . , ¯ σ P ) V T = [( N − R ] − / Y , (45)has been obtained beforehand, as is typical to computeequation (41). Multiple minima are a rarity; in such cases ζ ⋆ will depend on the optimizer and initial guess, hereNewton’s method and ζ = N − ensemble , a Gaussianapproximation to the eﬀective posterior is chosen. Inaddition to its simplicity, twin experiments [Boc11; Boc12;Boc15] have provided solid support to using that of ζ ⋆ : p ( w | E , y ) ≈ N (cid:0) w (cid:12)(cid:12) ¯ w a ( ζ ⋆ ) , ¯P a w ( ζ ⋆ ) (cid:1) . (46)Notably, this approximation matches the mode andHessian of the exact p ( w | E , y ). The correspondingensemble is constructed as: E a = [ ¯ x + X ¯ w a ( ζ ⋆ )] T + √ N − XT , (47)where T = ( ¯P a w ( ζ ⋆ )) / is readily computed using thesame SVD (45) as before, and may be appended bya mean-preserving orthogonal matrix [Sakov and Oke,2008]. Note that this is “just” the symmetric square-rootEnKF, i.e. the ensemble transform Kalman ﬁlter (ETKF)of Bishop et al. [2001]; Hunt et al. [2004], except for aprior (squared) inﬂation of α ⋆ = ( N − /ζ ⋆ . (48)As discussed by Boc15, the choice of candidateposterior is not unassailable, and certain modiﬁcationsof the dual function can be argued on the grounds ofmodifying this choice. Indeed, if the inﬂuence of thelikelihood, quantiﬁed by¯ σ = tr( H ¯BH T R − ) /P , (49)9nd computed using the SVD (45), is small: ¯ σ → χ − to ahigher value than N −

1, yielding a Dirac-delta in thelimit. Finally, since ζ is not actually constant in w , theHessian, ( ¯P a w ( ζ ⋆ )) − , should be corrected by subtracting e ¯ w a ¯ w a T , where e = ζ ⋆ N + g , [Boc15]. Since this is buta rank-one update, a corrected transformation can becomputed without signiﬁcant expense as T ( I N + γ vv T ),where v = T T ¯ w a , γ = e/ ( τ / + τ ), and τ = 1 − e v T v ,as may be deduced from equation (B2) of Bocquet [2016].However, this is typically a very minor correction, whilethe Gaussian posterior remains but an approximation.None of these adjustments were deemed necessary to usein the numerical experiments of this paper. The ﬁltering problem as formulated with equations (1a)and (1b) uses additive noise processes ξ k and υ k torepresent model and observation errors. “Primary” ﬁlters[Dee et al. , 1985] assume the noise/error covariances, Q and R , are perfectly known. In reality, these systemstatistics are rarely well known; this deteriorates the stateestimates and can cause ﬁlter divergence. Self-diagnosingand correcting ﬁlters which estimate and modify thegiven noise statistics are called “adaptive”. AdaptiveKalman ﬁlter literature include Mehra [1972]; Mohamedand Schwarz [1999]. Adaptive particle ﬁlter literatureinclude Storvik [2002]; Özkan et al. [2013].This section is concerned with adaptive ﬁltering forDA, focusing on the EnKF and the on-line estimation ofthe model error, i.e. Q . Among the literature using full[Miyoshi et al. , 2013; Nakabayashi and Ueno, 2017; Pulido et al. , 2018] and/or diagonal [Ueno and Nakamura, 2016;Dreano et al. , 2017] covariance parameterizations, somesuccess has been noted. However, the scope of this paperis restricted to the estimation of a single multiplicativeinﬂation factor, β , for ¯B . This assigns a structure to thecovariance and reduces the dimensionality and complexityof the problem, thus regularizing it. The tradeoﬀ is a bias,but this drawback is largely oﬀset by spatialization.Since this section presents adaptive inﬂation for modelerror, the inﬂation factor is here labelled β . This contrastsit to α of the EnKF- N , which targets sampling error.Formally, the unknown, β , is deﬁned by the modellingassumption that the ensemble is now drawn with a “On-line” means that the estimation is included in the DAcycling loop, so as to be updated each time new data y is received.Some oﬀ-line approaches not further reviewed include Dee andDa Silva [1999]; Ueno et al. [2010]; Mitchell and Carrassi [2015]. The methods may also possess some skill in dealing with errorsdue to non-Gaussianity, biases, and even misspeciﬁcation of R . covariance that is β times too small, x n ∼ N ( b , B /β ) , (50)as compared to the distribution (7) of the truth.Moreover, in order to neglect sampling errors, theensemble is assumed inﬁnite ( N = ∞ ). Again, thisis commonly tacitly assumed in the adaptive EnKFliterature. The reason, as discussed below equation (9), isthat it yields a prior that is (the limit of the t distributionwhich is) Gaussian. In summary, p ( x | β, E ) = N (cid:0) x (cid:12)(cid:12) ¯ x , β ¯B (cid:1) . (51)The N = ∞ assumption is rolled back in the biasstudy of appendix D.2. More pragmatically, section 5makes a hybrid of the EnKF- N inﬂation with a method ofadaptive inﬂation for model error. The following reviewand analyses serve to choose a method with which to makethis hybrid.The review is split between methods that may betermed “marginal”, working with β separately from x ,and “joint”, working with ( β, x ) simultaneously. Thejoint approach, including “variational” and “hierarchical”methods, is theoretically appealing. However, on closerinspection, including numerical testing, it was found tobe less advantageous. Therefore, the marginal methodstake centre stage, while the review of the joint methodshas been placed in appendix C. Recall the average innovation, ¯ δ = y − H¯ x . For brevity,the explicit conditioning on the ensemble, E , is henceforthdropped. Writing ¯ δ = ( y − H x ) + H ( x − b ) + H ( b − ¯ x ) it can beseen that E [ ¯ δ ¯ δ T | B ] = R + H ε N BH T . (52)A departure from non-ensemble works [e.g., Daley, 1992] isthe adjustment ε N = 1 + 1 /N , resulting from (necessarily)using the prior ensemble mean, ¯ x , rather than the trueprior mean, b , to deﬁne the innovation, ¯ δ . However, since N = ∞ is here assumed, ε N = 1.Substitute the expectation E [ ¯ δ ¯ δ T | B ] by the observedvalue ¯ δ ¯ δ T in equation (52), and B by β ¯B , perequation (51). Then, ¯ δ ¯ δ T ≈ β H ¯BH T + R =: ¯C ( β ) , (53)which suggests matching (some univariate summary of) ¯ δ ¯ δ T by adjusting the inﬂation, β . For example, Li et al. For familiarity, the state distributions are written in terms ofthe original variable x , rather than the subspace variable w ofequation (29). Note that most of the following inﬂation estimatorsalso implicitly address the rank deﬁciency issue, as they do not ignorecomponents of the innovation outside of the (observed) ensemblesubspace. This may not be generally beneﬁcial (cf. §5 of section 2.2). β I = (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) − tr( R )tr( H ¯BH T ) , (54)which follows directly from the trace of equation (53).Alternatively, the estimator of Wang and Bishop [2003]can be obtained by transforming equation (53) by R − before taking the trace, thus allowing for heterogeneousobservations and diﬀerent units. This yields:ˆ β R = (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R /P − σ , (55)where tr( ¯ δ ¯ δ T R − ) = (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R has been used and ¯ σ isdeﬁned in equation (49).Miyoshi [2011] proposes a variant using the Schurproduct ◦ R − . However, the regular algebra behind ˆ β R is preferable, as it decorrelates the diagonals of ¯ δ ¯ δ T andhence diminishes the variance of their trace. More variantsare studied in appendix D.1. However, appendix D.2show them to have more bias than ˆ β R because of furtherexposure to the uncertainty in ¯B , which is present whenthe N = ∞ assumption is not made. As an alternative to the trace-based estimators, considersome results of likelihood maximization. Recallingequations (11) and (51), it can be shown that thelikelihood for β is: p ( y | β ) = N (cid:0) y (cid:12)(cid:12) H¯ x , C (cid:1) = N (cid:0) ¯ δ (cid:12)(cid:12) , C (cid:1) , (56)where C = ¯C ( β ) of equation (53). By this Gaussianity, ¯ δ ¯ δ T can be seen as the maximum of the likelihood for C ,but only in the case of univariate observations ( P = 1).In the multivariate case, the maximization is only deﬁnedrestricted to the direction of ¯ δ ¯ δ T .Less artiﬁcially, the maximization can be renderedwell-posed by restricting C to the scaling: C ( θ ) = θ C for some C , in which case the maximum likelihood(ML) estimate of θ is k δ k C /P , as noted by Dee [1995].Unfortunately, with the parameterization C ( β ) = ¯C ( β ) ofequation (56), the ML estimate, ˆ β ML , is not analyticallyavailable, and will require iterations [e.g., Mitchell andHoutekamer, 2000; Zheng, 2009; Liang et al. , 2012].Also note that equation (53) may be derived in thevariational framework, where it is seen as a consistencycriterion on the cost function [Desroziers and Ivanov, 2001;Chapnik et al. , 2006; Ménard, 2016]. These referencesare also known for deriving further diagnostics, employingdistances labelled “observation-analysis” and “analysis-background”, which enable the simultaneous estimationof the observation error covariance. Li et al. [2009] makeuse of this in an ensemble framework, as do Ying andZhang [2015]; Kotsuki et al. [2017], who makes use of thescheme in a relaxation variety. The accumulation and memorization of past informationhas gradually become more sophisticated. Wang andBishop [2003] take the geometric average of ˆ β R over time,while Mitchell and Houtekamer [2000] use the median ofinstantaneous maximum likelihood estimates. Lackingthe fuller Bayesian setting, neither yields consistency(convergence to the true value) in time, because atemporal average of some point estimate ˆ β k based on p ( y k | β ) is generally not the value to which β | y K converges as K → ∞ . Nevertheless, this mismatch is notlikely to be severe, and the simplicity of the approach isan advantage.The approach of temporally averaging likelihoodestimates has since been replaced by the more rigorousapproach of ﬁltering. It should be noted that this“secondary” ﬁlter is valid because the innovations aresupposedly independent in time [Mehra, 1972]. Li et al. [2009] and Miyoshi [2011] assign a Gaussian prior p ( β ) = N ( β | β f , V f ), where the mean β f is a persistence orrelaxation of the previous analysis mean, and where thevariance V f is a tuning parameter. The likelihood is alsoassumed Gaussian: p ( y | β ) = N ( ˆ β | β, ˆ V ), where ˆ β = ˆ β I in Li et al. [2009] and ˆ β = ˆ β R in Miyoshi [2011]. In theformer, ˆ V is a tuning parameter, while the latter suggestsusing the variance of ˆ β R of equation (55):ˆ V = " tr( HBH T R − ) /P + 1¯ σ p P/ ≈ " β f ¯ σ + 1¯ σ p P/ , (57)where the approximation comes from using B ≈ β f ¯B ,which is consistent with operative assumption that N = ∞ . Importantly, equation (57) has the logical consequencethat the observations are given no weight when they carryno information, i.e. when ¯ σ → P →

0. The methodis spatialized by associating each local analysis domainwith its own inﬂation parameter. Localization taperingpromotes smoothness of the inﬂation ﬁeld.Also working in the framework of a square-rootEnKF, Brankart et al. [2010] use diﬀusive forecasts forthe inﬂation distribution. They do not sequentiallyapproximate the posteriors, instead expressing themexplicitly in terms of all of the past innovations,progressively diﬀused by a “forgetting exponent”, φ < p ( β | y K ) ∝ ( p ( β )) φ K K Y k =1 N (cid:0) ¯ δ k | , φ k − K ¯C k ( β ) (cid:1) . (58)The chosen estimate, which maximizes p ( β | y K ), is founditeratively. The requisite evaluations of the cost functionis not prohibitive because the square-root formulationmeans that diagonalizing matrix decompositions havealready been computed.Anderson [2007], working in the framework of theEAKF, explicitly considers forecasting the (parametersof the) inﬂation distribution, but only tests persistenceforecasts, i.e. p ( β k | β k − ) = δ ( β k − β k − ). The posteriors11and hence the forecast priors) are approximated byGaussians: for a given time, p ( β | y i ) ≈ N ( β | ˆ β MAP , V a ) , (59)which are computed serially for each component y i ofthe observation y . The univariate-observation likelihoodallows the maximum a posteriori (MAP) estimate ˆ β MAP to be found exactly and without iterations, via a cubicpolynomial; V a is ﬁtted by requiring that equation (59) beexact for another particular value of β , as normalized byits value at ˆ β MAP . Anderson [2009] spatializes the method,associating each state variable with its own inﬂationparameter. Here, correlation coeﬃcients complicate thelikelihood, so that ˆ β MAP must be found via a Taylorexpansion of the likelihood and a resulting quadraticpolynomial.

Anderson [2009] also notes that a Gaussian prior does notconstrain the inﬂation estimates to positive values. Thus,nonsensical negative estimates will occasionally occur forsmall innovations. Imposing some lower cap (bound), e.g.,ˆ β MAP > β MAP >

1, is done as a quick-ﬁx. AsStroud et al. [2018] note, such capping may cause a bias.However, this bias should be avoided by employing theun-capped values in the averaging.Instead of ad-hoc mechanisms, using a better tailoredprior will pre-empt the problem. Brankart et al. [2010] usean exponential pdf for the initial prior, but acknowledgethat it is not appropriate for small values.Indeed, a better choice is the inverse-chi-squaredistribution, χ − , also called inverse-Gamma; it is theconjugate prior to the variance parameter of a Gaussiansample, and was shown in section 3 to be intimately linkedto inﬂation. It has been employed in adaptive ﬁlteringby e.g., Mehra [1972]; Storvik [2002], and in an EnKFcontexts by Stroud and Bengtsson [2007] and Gharamti[2018], who used it to enhance the scheme of Anderson[2009]. Similarly, the following section formulates aninﬂation ﬁlter based on ˆ β R using χ − distributions. This subsection improves the formulation of the estimatorˆ β R ; in particular, it abstains from assuming Gaussianityfor the distributions for β .Recall the deﬁnition (49) of ¯ σ and make theapproximation H ¯BH T R − ≈ ¯ σ I P . This proportionalityrenders the likelihood symmetric and hence reducible;indeed, the likelihood (56) can now be written: p ( y | β ) ∝∼ N (cid:0) R − / ¯ δ (cid:12)(cid:12) , (1 + ¯ σ β ) I P (cid:1) ∝ (1 + ¯ σ β ) − P/ e − k ¯ δ k R / σ β ) ∝ χ +2 (cid:0)(cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R /P (cid:12)(cid:12) (1 + ¯ σ β ) , P (cid:1) . (60) Remarkably, the value of β that maximizes thisapproximate likelihood is ˆ β R of equation (55).The likelihood (60) may be further approximated byﬁtting the following shape to it: p ( y | β ) ≈ χ +2 ( ˆ β R | β, ˆ ν ) . (61)Irrespective of the certainty parameter ˆ ν , the mode (in β )is then the same as for equation (60), namely ˆ β R . Alsoﬁtting the curvature at the mode to that of equation (60)yields ˆ ν = P [¯ σ ˆ β R / (1 + ¯ σ ˆ β R )] . Remarkably, this yieldsa variance (cf. Table 2) equal to ˆ V of equation (57), exceptwith ˆ β R in place of β f .The beneﬁt of making second approximation (61) inaddition to the ﬁrst (60) is the resulting conjugacy withan χ − prior. Indeed, suppose the forecast prior for theinﬂation parameter is: p ( β ) = χ − ( β | β f , ν f ) , (62)for some ( β f , ν f ). The posterior is then: p ( β | y ) = χ − ( β | β a , ν a ) . (63)where ν a = ν f + ˆ ν , (64a) β a = ( ν f β f + ˆ ν ˆ β R ) /ν a . (64b)This weighted-average update for the parameters isthe same as that of the Kalman ﬁlter, except withslightly diﬀerent meaning to the parameters. As such,it constitutes a natural and original derivation of theinﬂation ﬁlter of Miyoshi [2011]. Rather than approximating the likelihood as χ +2 , animproved approximation could be obtained by reverting tothe likelihood p ( y | β ) of equation (60) and directly ﬁtting χ − ( β | β a , ν a ) to the posterior p ( y | β ) χ − ( β | β f , ν f ) bymatching their modes and local curvatures. As describedfor equation (59), ﬁtting posterior criteria is the approachtaken by Anderson [2007]. With the χ − prior, though,an immediate beneﬁt is that of precluding negative andnonsensical values for β a . Moreover, it can be shown thatthe posterior mode satisﬁes a cubic equation. However,the curvature is more complicated; alternatives includesetting ν a = ν f + ν , where ν = ˆ ν or where ν issuch such that a likelihood χ +2 ( ˆ β R | β, ν ) would yield theaforementioned cubic-root mode.The numerical approach oﬀers another set of options:locating the mode by optimization and computing thecurvature by ﬁnite diﬀerences. The latter could also beexchanged for the ratio of two points, as in Anderson[2007]. If such avenues are pursued, it is then notnecessary to make the approximation of reducing thelikelihood (56) to (60); indeed, it is feasible to ﬁnd the12equired statistics with the full likelihood, provided thepre-computed SVD (45). However, this is not necessarilyadvisable, because it yields substantial bias due to theuncertainty in ¯B , as discussed in appendix D.2.An alternative approximation is to ﬁt ( β a , ν a ) viathe mean and variance of the posterior, as computedby quadrature. This requires careful implementation toavoid a drift due to truncation errors, which otherwiseaccumulate exponentially through the DA cycles. It isalso important to judiciously deﬁne the extent of the grid.Another option is to abandon the parametric approachaltogether, and instead represent the posterior on a grid[e.g., Stroud et al. , 2018]. In the case of a single, global,inﬂation parameter, as in this paper, this approach isaﬀordable for any purposeful precision. Lastly, a Monte-Carlo representation can also be employed [e.g., Frei andKünsch, 2012].For all variants, a choice must be made as to whichpoint value of β to use to inﬂate the ensemble. Ratherthan using the parameter β a directly, one can use themean or the mode (cf. Table 2). Typically, though, ν a is so large that this does not matter much. Similarly,although ˆ β R has a slight bias (appendix D.2), it errs onthe side of caution. For this reason or other, de-biasing didnot generally yield gains in testing by twin experiments.Therefore, and for simplicity, de-biasing is not furtheremployed. The following is a simple and pragmatic modellingof the forgetting of past information, achieved by“relaxing” towards some background (and initial) prior, χ − ( β | β , p ( β ) = χ − ( β | β f , ν f ) , (65)where ν f = e − ∆ /L ν a , (66a) β f = e − ∆ /L ( β a − β ) + β , (66b)with ∆ as the time between analyses. The time scale, L ≥

0, controls the rate of relaxation, and could be set asa multiple of the time scale of the model. Alternatively,it can be set by solving the stationarity condition ν ∞ = e − ∆ /L ν ∞ + ˆ ν , derived from equations (64a) and (66a).The forecast (66a) and (66b) was engineered – notderived from dynamics. While the issue is largelyacademic, this lack of formality has been perceived as adiﬃculty [Anderson, 2007, 2009; Sarkka and Hartikainen,2013; Nakabayashi and Ueno, 2017]. The followingconstrues a few possible resolutions.One possibility is diﬀusion. This would yield similarresults to (66a) and (66b), though how the χ − shapemay be maintained is not clear. However, multiplicative,positive noise seems preferable to avoid negative values.Computationally, if the parametric distribution is not preserved under the forecast, then gridded, Monte-Carlo, or kernelized inverse-transform approaches maybe employed. Instead of diﬀusion, Brankart et al. [2010] suggested exponentiating p ( β ) in the forecast step(cf. equation (58)), also called “annealing” [Stordal andElsheikh, 2015]. This maintains the χ − shape, but isdiﬃcult to motivate physically. Alternatively, the desiredeﬀect of “forgetting” is perhaps most naturally modelledwith an autoregressive model with limited correlationlength. This can be rendered Markovian (to ﬁt withﬁltering theory) using state augmentation [Durbin andKoopman, 2012, §3.4].It may also be argued that the search for physicaldynamics is misguided. After all, the inﬂation, β ,is already a hyperparameter. So why should it bemore natural to forecast β rather than the hyper-hyperparameters, β a,f and ν a,f k as in equations (66a)and (66b)? Several of the improvements of section 4.3 were tested withtwin experiments. The results (not shown) indicate thatmost schemes, including the original one of section 4.2,perform surprisingly well (in terms of ﬁlter accuracy)provided they have reasonable settings of their tunableparameters. The most plausible explanation is that theinﬂation ﬁlters are consistent estimators: as the DA cyclesbuild up, they all eventually converge towards a near-optimal value of inﬂation. In view of this parity it seemslogical to opt for the simplest scheme, namely that ofequations (63) and (64).Similarly, instead of equations (66a) and (66b), thenumerical experiments use the ﬁxed value ν f = 10 ,corresponding to a variance of approximately 2( β f ) / ,according to Table 2. Moreover, instead of ﬁtting thelikelihood’s certainty, ˆ ν , it was simply set to 1. Keeping ν f ﬁxed is suboptimal in the spin-up phase of the twinexperiments, but this part of the experiment is notincluded in the time-averaged statistics. Keeping ν f ﬁxedalso foregoes the interesting possibility of actually havingthe certainty decrease through an update, something thatwill occur with the posterior ﬁtting or non-parametricapproaches. Nevertheless, this simpliﬁcation was done(i) to facilitate reproduction; (ii) because in the twinexperiments of this paper, the models and observationalnetworks are homogeneous, and therefore using localizedand variable ˆ ν and ν f is not crucial; (iii) it was foundthat using the ﬁtted ˆ ν sometimes yielded worse ﬁlteraccuracy than keeping it ﬁxed; (iv) to provide faircomparisons between all of the methods by equalizing thesophistication of their forecast step.In addition to the adaptive inﬂation, Miyoshi [2011]also uses a ﬁxed inﬂation of 1 . . et al. [2009], the diﬀerences having proven largelyirrelevant and “cosmetic”, as discussed above. It istherefore labelled “ETKF adaptive”. Another schemethat was tested, labelled “EAKF adaptive”, is that ofequation (59). Instead of ﬁtting V a , however, the valueof V f was ﬁxed, and set (tuned) to 0 .

01. As describedby Anderson [2007], the actual inﬂation applied to theensemble is damped, using the value 0 . β MAP . The abovetwo schemes are the established standard in the literature,and have featured in many operational studies, althoughmainly in their spatialized formulations.The proliferation of tunable “hyper-hyperparameters”(the hyperparameters of the inﬂation ﬁlter), is due tothe hierarchical nature of adaptive ﬁlters, and may makethe adaptive approach appear counterproductive. But,considering the breadth of error sources targeted, it shouldbe recognized that such methods will necessarily be ad-hoc, and that the existence of tuning parameters is tobe expected. One should not be dismayed, however,because the performances of the adaptive ﬁlters are largelyinsensitive to the hyper-hyperparameter settings. Indeed,intuitively, more abstract parameters (further up thehierarchy) should have less impact, as is illustrated bythe results of Roberts and Rosenthal [2001]. Indeed,the given values for the hyper-hyperparameters (i) wereonly tuned for a single experimental context, (ii) seemreasonable, and (iii) yield satisfactory ﬁlter accuracyalmost universally across the experiments. Point (iii)corroborates previous ﬁndings [e.g., Anderson, 2007;Miyoshi, 2011], and suggests that these values may beused in vivo . N Section 3 showed that the EnKF- N may yield a formof adaptive inﬂation, α ⋆ . However, the EnKF- N isbuilt on equation (7), with the assumption that thetruth is statistically indistinguishable from the ensemble.Therefore, it only targets sampling error, and the prior(23) always has the location parameter 1. The EnKF- N is therefore not robust in the context of model error,analysed in section 4.If tuned inﬂation is used in concert with the EnKF- N ,then the tuned value could be seen as a measure of modelerror disentangled from sampling error [Bocquet et al. ,2013]. Here, however, the aim is to hybridize the EnKF- N with an adaptive inﬂation scheme that estimates aninﬂation factor, β , targeted at model error. Notably, sucha scheme has a prior that is time-dependent and, generally,not with location parameter 1. The scheme used is the“adaptive ETKF” speciﬁed in section 4.5.Again, the explicit conditioning on the ensemble, E , ishere dropped for brevity. Consider p ( x , β | y ) ∝ p ( y | x ) p ( x | β ) p ( β ) . (67) In contrast with section 4 and equation (51), N willhere not be assumed inﬁnite. Re-deriving, therefore, theEnKF- N prior (10), but with equation (50) in place of(7), reveals that p ( x | β ) is a scale mixture over α , but nowwith ¯B also scaled by β , i.e. p ( x | α, β ) = N ( x | ¯ x , αβ ¯B ).Further, sampling error is assumed independent frommodel error: p ( α | β ) = p ( α ). Hence, p ( x , β | y ) ∝ p ( y | x ) (cid:18) Z p ( x | α, β ) p ( α ) d α (cid:19) p ( β ) . (68)Moving the likelihood inside as in the EnKF- N (12) yieldsa mixture over p ( y , x | α, β ), which can be re-factorized toobtain: p ( x , β | y ) ∝ (cid:18) Z p ( x | y , α, β ) p ( y | α, β ) p ( α ) d α (cid:19) p ( β )(69)As in the EnKF- N (13), the mixture is approximated byempirical Bayes: p ( x , β | y ) ∝∼ p ( x | y , α ⋆ , β ) | {z } ≈ p ( x | y ,β ) (cid:18) Z p ( y | α, β ) p ( α ) d α | {z } p ( y | β ) (cid:19) p ( β ) , (70)meaning that p ( x | α ⋆ , β, y ) is given by equation (46),with the change of variables (29), except that the priorcovariance is now also scaled by β , which also impacts theselection of α ⋆ through the dual cost function.The remaining integral of equation (70) is againapproximated using a particular value of α : Z p ( y | α, β ) p ( α ) d α ≈ p ( y | α f , β ) . (71)In this instance, however, the optimizing value of α is notconditioned on y , and is therefore denoted α f . In practice, α f = 1 is used for simplicity. Thus, p ( y | α f , β ) becomesthe same as in equation (61), yielding the posterior ofequation (63).A ﬁnal approximation is made to decouple the jointposterior: replacing β in the conditional distribution, p ( x | y , α ⋆ , β ), by β ⋆ , some point estimate from p ( β | y , α f ).This may again be seen as empirical Bayes, except that β is not a latent variable. Other reasons for only usinga single inﬂation value for the ensemble are discussed inappendix C.2. The particular point used is the mean: β ⋆ = ν a ν a − β a . (72)Thus, equation (70) becomes: p ( x , β | y ) ≈ p ( x | y , α ⋆ , β ⋆ ) p ( y | α f , β ) p ( β ) ∝ p ( x | y , α ⋆ , β ⋆ ) p ( β | y , α f ) . (73)The algorithm of the analysis update of the EnKF- N hybrid can be stated as follows.14. Update the general-purpose inﬂation, β , accordingto equations (64a) and (64b).2. Conditional on β ⋆ (72), ﬁnd the EnKF- N inﬂationusing the dual (42): α ⋆ = argmin α D (( N − /β ⋆ α ).3. Update the ensemble by the ETKF with a priorinﬂation of α ⋆ β ⋆ . In other words, implementequation (47) with ( N − /α ⋆ β ⋆ in place of ζ ⋆ .Note that the second step is the only essential diﬀerenceto the ETKF adaptive inﬂation scheme of Miyoshi [2011].The forecast is greatly facilitated by the decoupling ofthe posterior (73), which means that the ensemble andgeneral-purpose inﬂation β are independent. As discussedby section 4.4, it is pragmatic to separate the forecastof the ensemble from the forecast of β , which should becarried out as in equation (66). Instead, however, for thereasons described in section 4.5, here ν f = 10 , and β f isset to the previous β a . The EnKF- N inﬂation, α , is notforecasted, as its prior is static. The standard methods of section 4.5 and the hybridof section 5 are tested with twin experiments: asynthetic truth and observation thereof are simulated andsubsequently estimated by the DA methods. Contrary tothe meaning of “twin”, however, the setup is deliberatelyone of model error: the model provided to the DA systemis diﬀerent from the one actually generating the truth.The main system used is the two-scale/layerLorenz model [Lorenz, 1996, 2005]. It constitutes asurrogate system for synoptic weather, exhibiting similarcharacteristics, and enables studying the impact ofunresolved scales on ﬁlter accuracy. The autonomousdynamics are given by: ψ ± i ( u ) = u i ∓ ( u i ± − u i ∓ ) − u i ,where the indices apply periodically. Then,d x i d t = ψ + i ( x ) + F − h cb X j =1 z j +10( i − , (74)d z j d t = cb ψ − j ( b z ) + 0 + h cb x j − // , (75)for i = 1 , . . . , j = 1 , . . . , // meansinteger division. Unless otherwise stated, the constantsare set as in Lorenz [1996]: time-scale ratio: c = 10, space-scale ratio: b = 10, coupling: h = 1, forcing: F = 10. Theresulting dynamics, illustrated in Figure 3, are chaotic andhave a leading Lyapunov exponent of 1.3775 [Mitchell andCarrassi, 2015].The truth is composed of both ﬁelds: (cid:2) x , z (cid:3) . Bycontrast, the model provided to the DA methods is atruncated version of the full one:d x i d t = ψ + i ( x ) + F − [ A + Bx i ] , i = 1 , . . . , . (76)The term in the brackets parameterize (compensate for)the missing coupling to the small scale. Ahead of each −4−20246810 1 4 8 12 16 20 24 28 32 36 Figure 3:

Illustration of the dynamics of the two-scale Lorenzmodel by 20 consecutive snapshots of the state proﬁle. Thecolour gradation, from light to dark, represents the timesequence, with a total span of 0 .

3. The large-scale x ﬁeld witha mean value of 2.4 can be seen moving left. The small-scale z ﬁeld with a mean value of 0.1 is moving right; its variations arefaster, but of less amplitude. The abscissa denotes the indicesof (upper) x and (lower) z . experiment, the constants A and B are determined bylinear regression to unresolved tendencies, as describedby Wilks [2005]. The error parameterization removes thelinear bias of the truncated model; this has been donebecause inﬂation is not well suited to deal with systematicerrors. Higher-order parameterizations were tested andfound to yield little improvement to the ﬁlter accuracies,while a 0-order parameterization yielded too much modelerror, overly dominating the dynamical growth in the ﬁltererror.Another source of model error is that the full modelis integrated with a time step of 0 . .

05 to lower computational costs. However,this source of error was found to be negligible comparedto the truncation itself.The time between observations is ∆ = 0 .

15. Directobservations are taken of the full x ﬁeld with errorcovariance R = I M . There is no model noise: Q = .The ﬁlters are assessed by their accuracy as measured byroot-mean squared error:RMSE = r M k x − ¯ x k , (77)which is recorded immediately following each analysis.This instantaneous RMSE is then averaged in time (3300cycles following a spin-up of 40 cycles), and over 32repetitions of each experiment. A table of RMSE averagesis compiled for a range of experiment settings and plottedas curves for each method. The ﬁgures also present (thin,dashed lines) the root-mean variances (RMS spread) ofeach method. All of the experiment benchmark resultscan be reproduced using Python-code scripts hosted15nline at https://github.com/nansencenter/DAPPER/tree/paper_AdInf .Figure 4a shows benchmarks obtained with the two-scale system as a function of the forcing, F . The increasingRMSE averages of all ﬁlters reﬂect the fact (not shown)that the system variability and chaoticity both increasewith F . The same applies for decreasing c in Figure 4b,where F is ﬁxed at 10. Note that all of the adaptiveﬁlters are largely coincident at F = 10 and c = 10, withRMSE scores almost as low as ﬁxed, tuned inﬂation. Thisis because the hyper-hyperparameters for each method,described in section 4.5, were tuned at this point (and thispoint only). By contrast, the ﬁxed inﬂation of the “ETKFtuned” ﬁlter is determined for each experiment setting byselecting for the lowest RMSE among 40 inﬂation valuesbetween 0 .

98 and 3, most of which are close to 1.It is not surprising that tuning an adaptive ﬁlter willmake it about as accurate any other. The objective,however, is to avoid tuning. In that regard, it is surprisingis how well all of the adaptive ﬁlters perform overall.Indeed, except for the fairly extreme contexts of

F > c <

4, the diﬀerence in RMSE is small in the sense thatthe adaptive ﬁlters are all superior to “ETKF excessive”:a ﬁxed-inﬂation ﬁlter with a suboptimal inﬂation factorthat adds 0 . x i d t = ψ + i ( x ) + F , i = 1 , . . . , , (78)and there is no z ﬁeld. As in Anderson [2007], the modelerror consists in using a diﬀerent value of F for the truththan for the DA system. The setup is otherwise repeatedfrom above. As shown in Figure 4c, the benchmarksplotted as a function of F are V-shaped, with the lowestscores obtained in the absence of error ( F = 8). Theadaptive ﬁlters score very similar RMSE averages, whichare generally signiﬁcantly in excess of the RMS spreadscores. The mismatch can be explained by the well knownbias-variance decomposition of RMSE, and the fact thatthe model error contains signiﬁcant bias. The presence ofbias is also a likely cause for the closeness of the RMSEaverages of the adaptive methods, because inﬂation is notwell suited to treat bias.Tests were also run with the 3-variable Lorenz-63system [Lorenz, 1963], where the model error consists inadding independent white noise to the truth. The setup isthe same as above, except that R = 2 I . Figure 4d showsthe corresponding benchmarks. These are obtained witha small ensemble ( N = 3); using a larger ensemble, therelative advantage of the hybrid disappears.The hybrid EnKF- N obtains slightly superior accuracyrelative to the adaptive ETKF and EAKF for nearly allexperiments. This is as expected from theory: separate,dedicated treatment of sampling and model errors yieldimproved accuracy. The practical advantage of the hybridis illustrated in the time series of Figure 5. Notably, the inﬂation of the hybrid EnKF- N has much more volatility(shorter time scale). This is made possible by the staticprior which “anchors” the inﬂation to 1. By contrast,similar volatility in the adaptive ETKF and EAKF wouldrequire much more lenient settings of ν f (or V f ), whichwould yield excessive longer-term volatility (i.e. variance).On the other hand, the volatility also means that largerspikes in the inﬂation will occur. Since inﬂation is nota physical or especially gentle way of increasing spread,this could potentially cause trouble. For example, thepure EnKF- N with the full model and large F values,blows up due to stiﬀness, as illustrated by the greycurves in Figure 4a. This could have been prevented byincreasing (doubling) its certainty parameter, somethingthat would also slightly improve its accuracy across all ofthe experiment settings.Diﬀerent hyper-hyperparameter values for the adap-tive EAKF and ETKF will penalize their RMSE scoresfor some settings, and reward it for other settings. Thissensitivity was observed to be much reduced for thehybrid, which is in line with the principal objective: toavoid tuning across a multitude of contexts.A secondary objective is to obtain improved accuracycompared to ﬁxed, tuned inﬂation, as has been previouslyobserved for the pure EnKF- N in the perfect-modelcontext [Boc12]. Figure 4 shows that this is sometimesachieved, but by a very small margin.Further experiments (not shown) were carried out,using diﬀerent ensemble sizes and other types of modelerrors. The trends were similar to the benchmarks alreadyshown, but typically with less relative diﬀerence betweenthe ﬁlters. This paper has developed an adaptive inﬂation schemeas a hybrid of ( α ) the ﬁnite-size ensemble Kalman ﬁlterinﬂation [the EnKF- N of Boc11; Boc12; Boc15] and ( β )the inﬂation estimation conventionally associated with theensemble transform Kalman ﬁlter (ETKF). In so doing,it has provided several novel theoretical insights on theEnKF and adaptive inﬂation estimation.The ﬁrst part of the paper is focused on idealisticcontexts, with sampling error being the main concern.Using two univariate toy experiments, section 2.1 illus-trated the generation of sampling error by nonlinearity, aspredicted by appendix B. Section 2.2 then discussed thecircumstances for inﬂation, cataloguing them accordingto linearity, stochasticity, and ensemble size. The dis-cussion revealed why sampling errors are attenuated inthe linear context, why the choice of normalization factor(e.g., N − ) is not crucial, and also touched on topicssuch as ensemble collapse and ﬁlter divergence. Next,section 3.1 gave a birds-eye view of the EnKF- N , showinghow (e.g., empirical Bayes) and why (e.g., feedback) itworks. The following sections ﬁlled in the details; inparticular, section 3.4 showed how the eﬀective prior16

10 15 20 25 30

Forcing (F ) — both for truth and DA R M S E ETKF excessiveETKF tunedEAKF adaptive ETKF adaptive EnKF-N hybrid (a)

From the two-scale system, using N = 20. Speed-scale ratio (c ) — both for truth and DA R M S E ETKF excessiveETKF tunedEAKF adaptive ETKF adaptive EnKF-N hybrid (b)

From the two-scale system, using N = 20. Forcing (F ) — DA assumes F = 8 R M S E ETKF excessiveETKF tunedEAKF adaptive ETKF adaptive EnKF-N hybrid (c)

From the single-scale system, using N = 20. -4 -3 -2 -1 Magnitude (Q ii /∆ ) of noise on truth R M S E ETKF excessiveETKF tunedEAKF adaptive ETKF adaptive EnKF-N hybrid (d)

From the Lorenz-63 system, using N = 3. Figure 4:

Accuracy benchmarks of the adaptive inﬂation ﬁlters, plotted as functions of various control variables. Also includedin the plots is the RMS spread, plotted as thin, dashed lines. For perspective, two baselines are provided, in grey. These areobtained with the full (i.e. perfect) model using the pure EnKF- N : one with the same ensemble size as the adaptive ﬁlters(marker: (cid:3) ), and one with N = 80 (marker: +). Among the adaptive methods, the proposed hybrid EnKF- N (blue) scores thelowest RMSE averages nearly systematically across all contexts, albeit by a moderate margin. .51.01.5 E T K F t un e d E A K F a d a p t i v e E T K F a d a p t i v e DA cycle (k) E n K F - N h y b r i d Inflation RMS Error RMS Spread

Figure 5:

Illustration of the statistics from the twinexperiments by a segment of the typical time series. Generatedwith the two-scale Lorenz model, with F = 16. In the panel ofthe hybrid EnKF- N , the second inﬂation line (yellow) indicatesthe value of β a , and averages 1 . reduces to a Gaussian scale mixture, again demonstratingthe relationship between sampling error and inﬂation. Themixture parameter, α , is shown to be χ − . Section 3.6derived a saddlepoint form to retain the inﬂation-explicitexpressions all the way up to the posterior. Withoutrecourse to Lagrangian duality theory, section 3.7 thenﬁnalized the re-derivation of the (dual) EnKF- N byshowing how the mode of the posterior may be found byoptimizing for the inﬂation factor, α .In contrast to the above, section 4 is focused onmodel error, neglecting sampling error. A formaland unifying survey of the existing adaptive inﬂationliterature is presented; particular attention is given tothe schemes conventionally associated with the EAKF andthe ETKF ( ˆ β R ). The ETKF scheme is given a new andnatural derivation in section 4.2, again yielding the χ − distribution. Several potential improvements, some novel,were discussed in section 4.3, but generally found to beless rewarding than hoped for. Appendix D gives somenew results on biases, including the maximum likelihoodestimator. The survey is supplemented by appendix Con joint schemes, including a suggestion for inﬂationestimation by variational Bayes. Section 4.4 commentedon the forecasting of hyperparameters such as the inﬂationparameter, β .Combining the above, section 5 developed a hybridbetween the EnKF- N and the adaptive ETKF inﬂationscheme. The hybrid employs two inﬂation factors, α and β , separately targeting sampling and model error,respectively. The EnKF- N component ( α ) adds negligiblecomputational cost and no further tuning parameters tothe ETKF and its adaptive inﬂation ( β ), yet increases theinﬂation volatility and thus ability. The experiments ofsection 6 showed that the hybrid generally yields similar ﬁlter accuracy as ﬁxed inﬂation, even in bias-dominatedcontexts, but without the costly need for tuning. It alsoyields improved ﬁlter accuracy in comparison with thestandard, pre-existing adaptive inﬂation schemes of theETKF and the EAKF.Unless the ensemble size was small and the contextstrongly nonlinear, however, the gains were found tobe relatively modest, as was the diﬀerence in betweenthe existing methods. This is somewhat surprising inview of the essential importance of inﬂation in manyconﬁgurations of the EnKF. Part of the explanation maybe that, as a hyperparameter, the accuracy of the inﬂationestimates is not as important as that of the (primary)state variables and that, instead, the main importanceof the inﬂation scheme consists in its capacity to avoiddivergence occurrences, which is a matter of a moreboolean character. Another cause is that the inﬂationestimates converge and become nearly constant within arelatively short span of time, and that these asymptoticestimates are suﬃciently accurate for all of the methods.While the experimental results clearly demonstratedthe improvements of the hybrid adaptive inﬂation scheme,extrapolating these ﬁndings to other, larger applicationsis non-trivial. Spatialization of the inﬂation parameterwill likely be necessary; it may be implemented withoutconsiderable complexity as in Miyoshi [2011]. Still, therelative modesty of the above experimental results doesnot promise great, general gains. On the other hand itsuggests the conclusion, aided by the rigour and scopeof this study, that further sophistication of single-factoradaptive inﬂation estimation schemes is unlikely to yieldsigniﬁcant, further improvements. A Standard distributions

Table 2 speciﬁes the distributions in use in this paper.The following properties are useful.

Property 1

The (“scaled”) chi-square distributions areequivalent to the Gamma distributions: χ ± ( β | s, ν ) = Gamma ± ( β | ν/ , νs ∓ / , (79)where the switch sign ± has been used to representboth the regular and inverse distributions. The χ parameterization has been preferred for the parameterinterpretations oﬀered by Property 2, and the notationalsimplicity of Properties 3 and 4. Property 2

Asymptotic normality. If β ∼ χ ± ( s, ν ),then the distribution of √ ν ( β − s ) converges to N (0 , s )as ν → ∞ .Since it describes the sum of squared Gaussians, theasymptotic result for χ +2 is a consequence of the centrallimit theorem. The result for χ − can be shown bythrough the pointwise convergence of the pdf of √ ν ( β − s ),normalized by its value at 0.18 able 2: Parametric probability distributions. As elsewhere in the paper, b , x ∈ R M , B , S ∈ B , s, β >

0, and it is assumedthat ν > M . The constants are c N = (2 π ) − M/ , c t = Γ( ν + M )( πν ) M/ Γ( ν/ , c W = ν ν/ νM/ Γ M ( ν/ , and c χ = c W with M = 1. The(not listed) variance of element ( i, j ) of B with the Wishart distribution is ( s ij + s ii s jj ) /ν , where s ij is element ( i, j ) of S . Thevariances of the inverse-Wishart distribution are asymptotically, for ν → ∞ , the same. Name Symbol Probability density function Mean Mode (Co)Var

Gauss./Normal N ( x | b , B ) = c N | B | − / exp (cid:0) − k x − b k B ) b b B t distribution t ( x | ν ; b , B ) = c t | B | − / (cid:0) ν k x − b k B (cid:1) − ( ν + M ) / b b νν − B Wishart W +1 ( B | S , ν ) = c W | S | − ν/ | B | ( ν − M − / e − tr( ν BS − ) / S ν − M − ν S Inv-Wishart W − ( B | S , ν ) = c W | S | ν/ | B | − ( ν + M +1) / e − tr( ν SB − ) / νν − M − S νν + M +1 S Chi-square χ +2 ( β | s, ν ) = cχ s − ν/ β ν/ − e − νβ/ s s ν − ν s s /ν Inv-chi-square χ − ( β | s, ν ) = cχ s ν/ β − ν/ − e − νs/ β νν − s νν +2 s νs ) ( ν − ( ν − Note that the same limit would have applied if β ∼N ( s, s /ν ). This shows that s plays the role of a locationparameter in χ ± ( s, ν ), while 2 s /ν plays the role ofvariance, and explains why “certainty” is preferred to“degree of freedom” for ν in this paper. Property 3

In the univariate case ( M = 1), W ± ( β | s, ν ) = χ ± ( β | s, ν ) . (80) Property 4

Reciprocity. With t = 1 /β , p ( β ) = χ − ( β | s, ν )iﬀ. p ( t ) = χ +2 ( t | /s, ν ) . (81) Property 5

Reciprocity. With T = B − , p ( B ) = W − ( B | S , ν )iﬀ. p ( T ) = W +1 ( T | S − , ν ) , (82)as follows by the change of variables and the Jacobian | T | − ( M +1) [Muirhead, 1982, §2.1]. Property 6

Let u = be any M -dimensional vector, oran (almost never zero) random vector. If T ∼ W +1 ( S , ν )is independent of u , then u T T uu T S u ∼ χ +2 (1 , ν ) . (83)Moreover, this statistic is also independent of u . Proof:theorem 3.2.8 of Muirhead [1982]. B Nonlinearity and sampling error

This discussion complements that of section 2.1.

B.1 Why does nonlinearity generatesampling error?

First, consider what is meant by “sampling error”. Asample does not per se have a sampling error; it isby deﬁnition random, i.e. subject to variation. Bycontrast, estimators, or rather their realized estimates,have sampling error: the diﬀerence between the estimateand its expected value. By extension, any statistic(any function of the sample) may be said to havesampling error; for simplicity, however, the discussionbelow is limited to the non-central sample moments, i.e.ˆ µ m ( { x n } Nn =1 ) = N − P Nn =1 x mn , in the univariate case. Ifthe sample is drawn from the same distribution as x , thenˆ µ m is an unbiased estimate of the m -th moment of x , i.e. µ m = E [ x m ], and the sampling error is the diﬀerence:Error m = ˆ µ m − µ m . (84)It is a well known property of the Kalman ﬁlter thatthe covariance does not depend on the mean. In theforecast step, this is due to the fact that their evolutionsare entirely decoupled . Indeed, in the case of lineardynamics ( d = 1), the m -th forecast moment is givenby: µ f m = M m µ m , where M is the (scalar) linear model: x f = M x . For nonlinear forecast dynamics M , however,the moments will be coupled through M . For example,if (locally to the support of p ( x )) the model M canbe represented by a polynomial of degree d , then the m -th moment of the random variable M ( x ) is a linearcombination of moments of x of order 1 through md : µ f m = md X i =1 C m,i µ i . (85)Thus, for d > et al. , 2006, §29].A similar analysis reveals that the same couplingtakes place for the sample moments, ˆ µ m . Therefore the19ampling errors are also coupled:Error f m = md X i =1 C m,i Error i . (86)But an N -sized ensemble can only match N moments,e.g., Error i = 0 for i = 1 , . . . , N , and so there will alwaysbe some sampling error present for i > N . Then, bythe mixing of equation (86), this error will cascade intothe lower-order forecast errors. In summary, nonlinearitycauses sampling error in (e.g.) the mean and covarianceby pulling in the inevitable (with ﬁnite N ) sampling errorfrom higher-order moments.The above analysis is concerned with the generation of sampling error. Another reason for sampling errorin the context of nonlinearity is that chaos prevents its elimination by limiting the eﬀect of far-past observations(as opposed to the linear case illustrated in Figure 1, wherethe initial sampling error is quickly attenuated). It isnot immediately clear whether this is a separate cause or,rather, a diﬀerent perspective on the same phenomenon. B.2 A nonlinear model preserving N The nonlinear model of section 2.1 was designedusing “inverse transform sampling”. It is speciﬁedby: M NonLin ( x ) = √ F − N (cid:0) Fχ ( x ) (cid:1) , where Fχ is thecumulative distribution function (CDF) for χ +2 (1 , F − N is the inverse CDF for N (0 , M NonLin : (i) is V-shaped, with a singularity at 0, (ii)is closely approximated as M NonLin ( x ) ≈ √ { . | x | +0 .

23 log( x ) − . } , (iii) applied to a density symmetricabout 0, it may be visualized as folding it up in themiddle before smearing it back out again, (iv) may begeneralized to higher dimensionality by expressing x inpolar coordinates, since p ( k x k ) = χ +2 ( k x k | M, M ) if p ( x ) = N ( x | , I M ).Presumably, any nonlinear model that preservesGaussianity must include a singularity. This rendersthe example using M NonLin to generate sampling errorwithout non-Gaussianity somewhat artiﬁcial, but doesnot jeopardize the utility of considering the two issuesseparately.

C Joint state-covariance estima-tion

As mentioned immediately above section 4.1, the jointapproach approximates p ( x , β | y ) simultaneously in x and β . Thus, their analyses impact each other. Thisis particularly relevant for state-inﬂation (or state-covariance) estimation problem, because of the non-Gaussianity of p ( x , β | y ). C.1 Variational methods

A common approximate solution to this non-Gaussianityis to use variational methods for parametric ﬁtting. As detailed below, this leads to iterative schemes where theupdate to the hyperparameter, β , is computed in termsof the updated ensemble (for x ), and vice versa. Someof the literature below is concerned with estimating R (jointly with x ), but is still pertinent by the proximity ofthe problem to that of estimating B .Sarkka and Nummenmaa [2009] introduce theVariational Bayes (VB) method in the framework of theKalman ﬁlter. Nakabayashi and Ueno [2017] extend itto the EnKF. The VB method imposes an approximateposterior with two factor distributions, q ( x | y ) q ( R | y ),ﬁtted by minimizing the Kullback-Leibler divergencefrom the correct posterior, p ( x , R | y ). This yields thecondition that each factor be the (geometric) marginalof p ( x , R | y ) with respect to the other, i.e. two coupledpdf equations. Independence is assumed for the prior: p ( x , R ) = N ( x | ¯ x , ¯B ) W − ( R | R f , ν f ). It is then shownthat the distributions of the approximate posterior areagain N and W − , with parameters computable by ﬁxedpoint iteration. Iteration i + 1 consists of the EnKFequations using the i -th estimate of R a to compute theanalysis mean, ¯ x a , and covariance, ¯P a , whose updatedvalues are then used to in the update R a = (cid:8) ν f R f + ˆ R (cid:9) /ν a where ν a = ν f + 1 andˆ R = ( y − H¯ x a )( y − H¯ x a ) T + H ¯P a H T . (87)Ueno and Nakamura [2016], also estimating R in the framework of the EnKF, use the expectationmaximization (EM) method to maximize the marginalposterior p ( R | y ), with an W − prior. The expectationis over x , given y and the current estimate of R . Theempirical distribution is assumed for the prior: p ( x ) ≈ N − P n δ ( x − x n ), yielding iterations involving weightedstatistics. Our investigation indicates that if, instead,the standard EnKF assumption had been used: p ( x ) ≈N ( x | ¯ x , ¯B ), then the method would have yielded iterationsas in equation (87). This is unsurprising in view of theclose connection between EM and VB.An original result is obtained by applying the VBmethod to estimate the inﬂation parameter. Considerthe prior p ( x , β ) = N ( x | ¯ x , β ¯B ) χ − ( β | β f , ν f ); note that x and β are not assumed independent. It can then beshown that the resulting VB scheme consists of using the i -th iterate of β a to compute the i + 1 iterates of ¯ x a and ¯P a which, in turn, are used to compute the i + 1 iterate β a = (cid:8) ν f β f + ˆ β (cid:9) /ν a where ν a = ν f + M andˆ β = (cid:13)(cid:13) ¯ x a − ¯ x f (cid:13)(cid:13) ¯B + tr( ¯P a ¯B − ) , (88)which should be computed in the ensemble subspaceif N ≤ M . Equation (88) may be interpreted usingthe trigonometric relations of Desroziers et al. [2005].Numerical experiments indicate that the bias of this VBmethod (with a ﬂat prior) is signiﬁcantly higher than forˆ β R , typically also with a larger variance. It seems likelythat the reason is similar to that of ˆ β ML , analysed inappendix D.2.20 .2 Hierarchical methods Instead of the variational approach, a more principledapproach is to represent the marginal p ( β | y ) with aMonte-Carlo sample. This yields a hyper-ensemble ofdistributions p ( x | β, y ) and their ensembles. The label“hierarchical” is sometimes reserved for adaptive ﬁlterstreating the hyperparameter in this more Bayesian (i.e.full-pdf) manner. However, the number of realizationsquickly becomes exorbitant. On the other hand, if only asingle sample is drawn from p ( x | β, y ) for each β (i.e. notusing an ensemble of ensembles), then it seems especiallyprone to spurious correlations. Another criticism is that itmight cause discrete jumps between localization domains.Myrseth et al. [2010] provided one example ofa hierarchical EnKF, but without conditioning thehyperparameter on y , without accounting for model error,and with a forecast of (the hyperprior of) B that decaystowards . Tsyrulnikov and Rakitko [2015] present anambitious hierarchical EnKF, explicitly considering bothmodel and sampling error. However, the ﬁlter is onlytested in experiments with a novel, univariate model.An intriguing approach is that of Stroud andBengtsson [2007]. They peg the scaling of R and Q together, so as to estimate only a single parameter, β . This is diﬃcult to justify, but simpliﬁes theproblem signiﬁcantly because then ¯B also scales with β (provided linear dynamics) so that p ( x , β ) = N ( x | ¯ x , β ¯B ) χ − ( β | . . . ) is conjugate to p ( y | x , β ), andthe full, joint posterior is available in closed form, withtrivial parametric updates, hence avoiding Monte-Carlo.However, without pegging Q to R , the conjugacy is lost.The EnKF- N can also be said to be hierarchicalbecause of its careful treatment of the full marginal, p ( α | y ), before the variational approximation (13). D More on the marginal inﬂationestimators

This section complements section 4.1.

D.1 Other trace-based estimators

It is common to form chi-square diagnostics by measuring ¯ δ by its Mahalanobis norm [Ménard et al. , 2000; Wu et al. ,2013; Haussaire, 2017]. This provides the motivation touse ¯C (1) − to transform equation (53), yielding:ˆ β ¯C = (cid:13)(cid:13) ¯ δ (cid:13)(cid:13) ¯C (1) − tr (cid:0) R ¯C (1) − (cid:1) tr (cid:0) H ¯BH T ¯C (1) − (cid:1) . (89)Alternatively, equation (53) can be transformed by( H ¯BH T ) − , yielding (cid:0) ¯ δ ¯ δ T − R (cid:1) ( H ¯BH T ) − = β I P For the purpose of inﬂation, this seems like the bestoption because then the trace consists of terms with thesame expected magnitude, yielding the lowest aggregate variance. The estimator becomes:ˆ β H¯BH T = 1 P (cid:8)(cid:13)(cid:13) ¯ δ (cid:13)(cid:13) H¯BH T − tr (cid:0) R ( H ¯BH T ) − (cid:1)(cid:9) , (90)If H ¯BH T is rank-deﬁcient, then ˆ β H¯BH T must be deﬁnedusing the pseudo-inverse. The estimate would then notbe impacted by components of ¯ δ outside of the ensemblesubspace.Lastly, note that none of the trace-based estimators arecomputationally costly, because they can all be computedvia the SVD (45). D.2 Single-cycle bias

This subsection considers the properties of the estimatorswithin a single analysis, based only on the inﬂationestimators’ sampling distributions Here, the assumptionof section 4 that N = ∞ is undone, so that ¯B is alsorandom (before conditioning on the ensemble), in additionto x , and y , and subject to sampling errors. As will beshown, this yields biases in the inﬂation estimators.Now, using variables deﬁned via the SVD (45), it canbe shown that each of the following estimators of β satisfythe condition: 0 = P X i =1 γ i (cid:0) β ¯ σ i − d i (cid:1) , (91)where d i is the i -th component of the transformedinnovation, U T R − / ¯ δ , and γ i =  β R (1 + ¯ σ i ) − for ˆ β ¯C (¯ σ i ) − for ˆ β H¯BH T (1 + ¯ σ i ) / (1 + ˆ β ¯ σ i ) for ˆ β ML . (92)Equation (91) may be solved explicitly for ˆ β , except in thecase of ˆ β ML . Nevertheless, equations (91) and (92) maybe used to provide an insight on the bias of ˆ β ML , a subjectof study since Mitchell and Houtekamer [2000].Indeed, note that while the full matrix, ¯B , is anunbiased estimator of B , the spectrum of ¯B is a biasedestimate of the spectrum of B [van der Vaart, 1961;Takemura, 1984]. Thus, the spectrum of H ¯BH T R − ,namely { ¯ σ i } Pi =1 , is also biased. Hence, generally, functionsof the spectrum will be biased. An important exceptionis that E ( P i ¯ σ i ) = tr( HBH T R − ), meaning that theexpectation of equation (91) holds for ˆ β R . Note that ˆ β R isstill biased, as it requires inverting P i ¯ σ i . Nevertheless,considering the expressions (92) for γ i , it seems logicalthat the bias of { ¯ σ i } Pi =1 will signiﬁcantly carry over intothe more complicated ones, such as ˆ β ML . Numericalexperiments conﬁrm this, and show that the bias of ˆ β ML is worse than it is for ˆ β R , but less than for ˆ β ¯C and ˆ β H¯BH T .Why is the bias of ˆ β R of equation (55) the least?Loosely speaking, because the trace of ¯B is taken beforedividing. Moreover, in case R and HBH T have thesame structure, i.e. R − / HBH T R − T / = σ I P , the21ias of ˆ β R can be obtained analytically, as follows.Due to the modiﬁed assumption (50), equation (18)now yields ¯B ∼ W +1 ( B /β, N − R − / H ¯BH T R − T / are iid, with distribution χ +2 ( σ /β, N − ν = P ( N − σ = tr( R − / H ¯BH T R − T / ) ∼ χ +2 ( σ /β, ν ) . (93)Then, according to Property 4 and Table 2, E [1 / ¯ σ ] = νν − β/σ . Meanwhile, E (cid:2) ¯ δ ¯ δ T (cid:3) = R + (1 + 1 /βN ) HBH T so that E (cid:2)(cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R /P − (cid:3) = (1 + 1 /βN ) σ . Now, thesample mean and variance of Gaussian samples areindependent. Hence, the nominator and denominator ofˆ β R of equation (55) are independent, so that E (cid:2) ˆ β R (cid:3) = E (cid:2)(cid:13)(cid:13) ¯ δ (cid:13)(cid:13) R /P − (cid:3) E [1 / ¯ σ ]= (1 + 1 /βN ) νν − β , (94)which is close to β for large N and P . Acknowledgements

The authors thank the two anonymous reviewers for theirconstructive comments; A. Farchi for technical help anddiscussions; L. Bertino for fostering the collaboration andfor his patience; C. Grudzien for insight on the behaviourof the Lorenz system; F. Counillon, for reading andopinion; A. Karspeck and M. Gharamti for stimulatingquestions.Author P. N. Raanes has been funded by theEmblAUS project of the Nordic countries funding agencyNordForsk, and by DIGIRES, a project sponsored byPETROMAKS2 of the Research Council of Norway andindustry partners. CEREA is thanked for providing oﬃcespace and community for visiting researcher P. N. Raanes.CEREA is a member of the Institut Pierre-Simon Laplace(IPSL). Author A. Carrassi has been partly funded by theproject REDDA of the Norwegian Research Council.

References

Anderson JL. 2003. A local least squares framework forensemble ﬁltering.

Monthly Weather Review (4):634–642.Anderson JL. 2007. An adaptive covariance inﬂationerror correction algorithm for ensemble ﬁlters.

TellusA (2): 210–224.Anderson JL. 2009. Spatially and temporally varyingadaptive covariance inﬂation for ensemble ﬁlters. TellusA (1): 72–83.Anderson JL, Anderson SL. 1999. A Monte Carloimplementation of the nonlinear ﬁltering problem toproduce ensemble assimilations and forecasts. MonthlyWeather Review (12): 2741–2758. Asch M, Bocquet M, Nodet M. 2016.

Data assimilation:methods, algorithms, and applications . Fundamentalsof Algorithms, SIAM: Philadelphia, PA, doi:10.1137/1.9781611974546.Azevedo-Filho A, Shachter RD. 1994. Laplace’s methodapproximations for probabilistic inferencein beliefnetworks with continuous variables. In:

Proceedingsof the Tenth international conference on Uncertaintyin artiﬁcial intelligence . Morgan Kaufmann PublishersInc., pp. 28–36.Berry T, Harlim J. 2014. Linear theory for ﬁltering non-linear multiscale systems with model error.

Proceedingsof the Royal Society A: Mathematical, Physical andEngineering Science (2167): 20140 168.Bishop CH, Etherton BJ, Majumdar SJ. 2001. Adaptivesampling with the ensemble transform Kalman ﬁlter.Part I: Theoretical aspects.

Monthly Weather Review (3): 420–436.Bocquet M. 2011. Ensemble Kalman ﬁltering withoutthe intrinsic need for inﬂation.

Nonlinear Processes inGeophysics (5): 735–750.Bocquet M. 2016. Localization and the iterative ensembleKalman smoother. Quarterly Journal of the RoyalMeteorological Society (695): 1075–1089.Bocquet M, Carrassi A. 2017. Four-dimensional ensemblevariational data assimilation and the unstable subspace.

Tellus A: Dynamic Meteorology and Oceanography (1): 1304 504.Bocquet M, Gurumoorthy KS, Apte A, Carrassi A,Grudzien C, Jones CKRT. 2017. Degenerate Kalmanﬁlter error covariances and their convergence onto theunstable subspace. SIAM/ASA Journal on UncertaintyQuantiﬁcation (1): 304–333.Bocquet M, Raanes PN, Hannart A. 2015. Expandingthe validity of the ensemble Kalman ﬁlter withoutthe intrinsic need for inﬂation. Nonlinear Processes inGeophysics (6): 645–662.Bocquet M, Sakov P. 2012. Combining inﬂation-free anditerative ensemble Kalman ﬁlters for strongly nonlinearsystems. Nonlinear Processes in Geophysics (3): 383–399.Bocquet M, Sakov P, et al. Nonlinear Processes in Geophysics (5): 803–818.Brankart JM, Cosme E, Testut CE, Brasseur P, VerronJ. 2010. Eﬃcient adaptive error parameterizations forsquare root or ensemble Kalman ﬁlters: application tothe control of ocean mesoscale signals. Monthly WeatherReview (3): 932–950.22hapnik B, Desroziers G, Rabier F, Talagrand O. 2006.Diagnosis and tuning of observational error in a quasi-operational data assimilation setting.

Quarterly Journalof the Royal Meteorological Society (615): 543–565.Daley R. 1992. Estimating model-error covariances forapplication to atmospheric data assimilation.

MonthlyWeather Review (8): 1735–1746.Dee D, Cohn S, Dalcher A, Ghil M. 1985. An eﬃcientalgorithm for estimating noise covariances in distributedsystems.

IEEE transactions on automatic control (11): 1057–1065.Dee DP. 1995. On-line estimation of error covarianceparameters for atmospheric data assimilation. MonthlyWeather Review (4): 1128–1145.Dee DP, Da Silva AM. 1999. Maximum-likelihoodestimation of forecast and observation error covarianceparameters. Part I: Methodology.

Monthly WeatherReview (8): 1822–1834.Desroziers G, Berre L, Chapnik B, Poli P. 2005. Diagnosisof observation, background and analysis-error statisticsin observation space.

Quarterly Journal of the RoyalMeteorological Society (613): 3385–3396.Desroziers G, Ivanov S. 2001. Diagnosis and adaptivetuning of observation-error parameters in a variationalassimilation.

Quarterly Journal of the Royal Meteoro-logical Society (574): 1433–1452.Dreano D, Tandeo P, Pulido M, Ait-El-Fquih B,Chonavel T, Hoteit I. 2017. Estimating model-error covariances in nonlinear state-space modelsusing Kalman smoothing and the expectation–maximization algorithm.

Quarterly Journal of the RoyalMeteorological Society (705): 1877–1885.Durbin J, Koopman SJ. 2012.

Time series analysis bystate space methods , vol. 38. OUP Oxford.Evensen G. 2003. The ensemble Kalman ﬁlter: Theoreticalformulation and practical implementation.

OceanDynamics (4): 343–367.Evensen G. 2009a. Data assimilation . Springer, 2 edn.Evensen G. 2009b. The ensemble Kalman ﬁlter forcombined state and parameter estimation.

ControlSystems, IEEE (3): 83–104.Fernandez C, Steel MFJ. 1999. Multivariate Student-tregression models: Pitfalls and inference. Biometrika (1): 153–167.Fitzgerald R. 1971. Divergence of the Kalman ﬁlter. Automatic Control, IEEE Transactions on (6): 736–747.Frei M, Künsch HR. 2012. Sequential state andobservation noise covariance estimation using combinedensemble Kalman and particle ﬁlters. Monthly WeatherReview (5): 1476–1495. Furrer R, Bengtsson T. 2007. Estimation of high-dimensional prior and posterior covariance matrices inKalman ﬁlter variants.

Journal of Multivariate Analysis (2): 227–255.Gelman A, Carlin JB, Stern HS, Rubin DB. 2004. Bayesian data analysis . Texts in Statistical ScienceSeries, Chapman & Hall/CRC, Boca Raton, FL, secondedn.Geweke J. 1993. Bayesian treatment of the independentstudent-t linear model.

Journal of applied econometrics (S1).Gharamti ME. 2018. Enhanced adaptive inﬂationalgorithm for ensemble ﬁlters. Monthly Weather Review (2): 623–640.Goutis C, Casella G. 1999. Explaining the saddlepointapproximation.

The American Statistician (3): 216–224.Grudzien C, Carrassi A, Bocquet M. 2018. Chaoticdynamics and the role of covariance inﬂation for reducedrank Kalman ﬁlters with model error. NonlinearProcesses in Geophysics : 1–25.Hamill TM, Whitaker JS, Snyder C. 2001. Distance-dependent ﬁltering of background error covarianceestimates in an ensemble Kalman ﬁlter.

MonthlyWeather Review (11): 2776–2790.Haussaire JM. 2017. Méthodes variationnelles d’ensembleitératives pour l’assimilation de données non-linéaire:Application au transport et la chimie atmosphérique.PhD thesis, Université Paris-Est.Houtekamer PL, Mitchell HL. 1998. Data assimilationusing an ensemble Kalman ﬁlter technique.

MonthlyWeather Review (3): 796–811.Hunt BR, Kalnay E, Kostelich EJ, Ott E, Patil DJ,Sauer T, Szunyogh I, Yorke JA, Zimin AV. 2004. Four-dimensional ensemble Kalman ﬁltering.

Tellus A (4):273–277.Hunt BR, Kostelich EJ, Szunyogh I. 2007. Eﬃcientdata assimilation for spatiotemporal chaos: A localensemble transform Kalman ﬁlter. Physica D: NonlinearPhenomena (1): 112–126.Jaynes ET. 2003.

Probability theory: the logic of science .Cambridge university press.Kotsuki S, Ota Y, Miyoshi T. 2017. Adaptive covariancerelaxation methods for ensemble data assimilation:Experiments in the real atmosphere.

Quarterly Journalof the Royal Meteorological Society .Le Gland F, Monbet V, Tran VD. 2009. Large sampleasymptotics for the ensemble Kalman ﬁlter. ResearchReport RR-7014, INRIA.23ewis JM, Lakshmivarahan S, Dhall S. 2006.

Dynamicdata assimilation: A least squares approach , vol. 13.Cambridge University Press.Li H, Kalnay E, Miyoshi T. 2009. Simultaneous estimationof covariance inﬂation and observation errors within anensemble Kalman ﬁlter.

Quarterly Journal of the RoyalMeteorological Society (639): 523–533.Liang X, Zheng X, Zhang S, Wu G, Dai Y, Li Y. 2012.Maximum likelihood estimation of inﬂation factorson error covariance matrices for ensemble Kalmanﬁlter assimilation.

Quarterly Journal of the RoyalMeteorological Society (662): 263–273.Lorenz EN. 1963. Deterministic nonperiodic ﬂow.

Journalof the Atmospheric Sciences (2): 130–141.Lorenz EN. 1996. Predictability: A problem partly solved.In: Proc. ECMWF Seminar on Predictability , vol. 1.Reading, UK, pp. 1–18.Lorenz EN. 2005. Designing chaotic models.

Journal ofthe Atmospheric Sciences (5): 1574–1587.Mandel J, Cobb L, Beezley JD. 2011. On theconvergence of the ensemble Kalman ﬁlter. Applicationsof Mathematics (6): 533–541.Mehra R. 1972. Approaches to adaptive ﬁltering. IEEETransactions on automatic control (5): 693–698.Ménard R. 2016. Error covariance estimation methodsbased on analysis residuals: theoretical foundationand convergence properties derived from simpliﬁedobservation networks. Quarterly Journal of the RoyalMeteorological Society (694): 257–273.Ménard R, Cohn SE, Chang LP, Lyster PM. 2000. Assim-ilation of stratospheric chemical tracer observationsusing a Kalman ﬁlter. Part I: Formulation.

MonthlyWeather Review : 2654–2671.Mitchell HL, Houtekamer PL. 2000. An adaptive ensembleKalman ﬁlter.

Monthly Weather Review (2): 416–433.Mitchell L, Carrassi A. 2015. Accounting for modelerror due to unresolved scales within ensemble Kalmanﬁltering.

Quarterly Journal of the Royal MeteorologicalSociety (689): 1417–1428.Miyoshi T. 2011. The Gaussian approach to adaptivecovariance inﬂation and its implementation with thelocal ensemble transform Kalman ﬁlter.

MonthlyWeather Review (5): 1519–1535.Miyoshi T, Kalnay E, Li H. 2013. Estimating and includ-ing observation-error correlations in data assimilation.

Inverse Problems in Science and Engineering (3):387–398. Mohamed AH, Schwarz KP. 1999. Adaptive Kalmanﬁltering for INS/GPS. Journal of geodesy (4): 193–203.Muirhead RJ. 1982. Aspects of multivariate statisticaltheory . John Wiley & Sons, Inc., New York. Wiley Seriesin Probability and Mathematical Statistics.Myrseth I, Omre H, et al.

SPE Journal (02): 569–580.Nakabayashi A, Ueno G. 2017. An extension of theensemble Kalman ﬁlter for estimating the observationerror covariance matrix based on the variational Bayes’smethod. Monthly Weather Review (1): 199–213.Özkan E, Šmídl V, Saha S, Lundquist C, GustafssonF. 2013. Marginalized adaptive particle ﬁltering fornonlinear models with unknown time-varying noiseparameters.

Automatica (6): 1566–1575.Palatella L, Trevisan A. 2015. Interaction of Lyapunovvectors in the formulation of the nonlinear extension ofthe Kalman ﬁlter. Physical Review E (4): 042 905.Pham DT, Verron J, Roubaud MC. 1998. A singularevolutive extended Kalman ﬁlter for data assimilation inoceanography. Journal of Marine systems (3): 323–340.Pulido M, Tandeo P, Bocquet M, Carrassi A, Lucini M.2018. Stochastic parameterization identiﬁcation usingensemble Kalman ﬁltering combined with maximumlikelihood methods. Tellus A: Dynamic Meteorology andOceanography (1): 1442 099.Raanes PN. 2016. Improvements to ensemble methods fordata assimilation in the geosciences. PhD thesis, Uni-versity of Oxford. https://ora.ox.ac.uk/objects/uuid:9f9961f0-6906-4147-a8a9-ca9f2d0e4a12 .Raanes PN, Carrassi A, Bertino L. 2015. Extending thesquare root method to account for model noise inthe ensemble Kalman ﬁlter. Monthly Weather Review (10): 3857–3873.Roberts GO, Rosenthal JS. 2001. Inﬁnite hierarchies andprior distributions.

Bernoulli : 453–471.Roth M, Ardeshiri T, Özkan E, Gustafsson F. 2017.Robust Bayesian Filtering and Smoothing UsingStudent’s t Distribution.

ArXiv e-prints .Sacher W, Bartello P. 2008. Sampling errors in ensembleKalman ﬁltering. Part I: Theory.

Monthly WeatherReview (8): 3035–3049.Sakov P, Bertino L. 2011. Relation between two commonlocalisation methods for the EnKF.

ComputationalGeosciences (2): 225–237.Sakov P, Oke PR. 2008. Implications of the form of theensemble transformation in the ensemble square rootﬁlters. Monthly Weather Review (3): 1042–1053.24arkka S, Hartikainen J. 2013. Non-linear noise adaptiveKalman ﬁltering via variational Bayes. In:

MachineLearning for Signal Processing (MLSP), 2013 IEEEInternational Workshop on . IEEE, pp. 1–6.Sarkka S, Nummenmaa A. 2009. Recursive noiseadaptive Kalman ﬁltering by variational Bayesianapproximations.

IEEE Transactions on AutomaticControl (3): 596–600.Snyder C. 2012. Introduction to the Kalman ﬁlter , ch. 3.Advanced Data Assimilation for Geosciences: LectureNotes of the Les Houches School of Physics: SpecialIssue, Oxford University Press, pp. 75–120.Sommer M, Janjić T. 2017. A ﬂexible additive inﬂationscheme for treating model error in ensemble Kalmanﬁlters.

Quarterly Journal of the Royal MeteorologicalSociety .Stordal AS, Elsheikh AH. 2015. Iterative ensemblesmoothers in the annealed importance samplingframework.

Advances in Water Resources : 231–239.Storvik G. 2002. Particle ﬁlters for state-space modelswith the presence of unknown static parameters. IEEETransactions on signal Processing (2): 281–289.Stroud JR, Bengtsson T. 2007. Sequential state andvariance estimation within the ensemble Kalman ﬁlter. Monthly Weather Review (9): 3194–3208.Stroud JR, Katzfuss M, Wikle CK. 2018. A Bayesianadaptive ensemble Kalman ﬁlter for sequential state andparameter estimation.

Monthly Weather Review (1):373–386.Takemura A. 1984. An orthogonally invariant minimaxestimator of the covariance matrix of a multivariatenormal population.

Tsukuba journal of mathematics (2): 367–376.Tsyrulnikov M, Rakitko A. 2015. Hierarchical Bayesensemble Kalman ﬁltering. Physica D: NonlinearPhenomena .Ueno G, Higuchi T, Kagimoto T, Hirose N. 2010.Maximum likelihood estimation of error covariances inensemble-based ﬁlters and its application to a coupledatmosphere–ocean model.

Quarterly Journal of theRoyal Meteorological Society (650): 1316–1343.Ueno G, Nakamura N. 2016. Bayesian estimation of theobservation-error covariance matrix in ensemble-basedﬁlters.

Quarterly Journal of the Royal MeteorologicalSociety (698): 2055–2080.van der Vaart HR. 1961. On certain characteristics ofthe distribution of the latent roots of a symmetricrandom matrix under general conditions.

The Annalsof Mathematical Statistics : 864–873. van Leeuwen PJ. 1999. Comment on “Data assimilationusing an ensemble Kalman ﬁlter technique”.

MonthlyWeather Review (6): 1374–1377.Wang X, Bishop CH. 2003. A comparison of breedingand ensemble transform Kalman ﬁlter ensemble forecastschemes.

Journal of the Atmospheric Sciences (9):1140–1158.Whitaker JS, Hamill TM. 2012. Evaluating methods toaccount for system errors in ensemble data assimilation. Monthly Weather Review (9): 3078–3089.Wilks DS. 2005. Eﬀects of stochastic parametrizations inthe Lorenz’96 system.

Quarterly Journal of the RoyalMeteorological Society (606): 389–407.Wilks DS. 2011.

Statistical methods in the atmosphericsciences , vol. 100. Academic Press.Wu L, Bocquet M, Chevallier F, Lauvaux T, DavisK. 2013. Hyperparameter estimation for uncertaintyquantiﬁcation in mesoscale carbon dioxide inversions.

Tellus B: Chemical and Physical Meteorology (1):20 894, doi:10.3402/tellusb.v65i0.20894.Ying Y, Zhang F. 2015. An adaptive covariance relaxationmethod for ensemble data assimilation. QuarterlyJournal of the Royal Meteorological Society (692):2898–2906.Zheng X. 2009. An adaptive estimation of forecasterror covariance parameters for Kalman ﬁltering dataassimilation.

Advances in Atmospheric Sciences26