What limits the number of observations that can be effectively assimilated by EnKF?
WWhat limits the number of observations thatcan be effectively assimilated by EnKF? ∗Daisuke Hotta †‡§
Yoichiro Ota ‡ June 2, 2020
Abstract
The ability of ensemble Kalman filter (EnKF) algorithms to extract informa-tion from observations is analyzed with the aid of the concept of the degreesof freedom for signal (DFS). A simple mathematical argument shows that DFSfor EnKF is bounded from above by the ensemble size, which entails that as-similating much more observations than the ensemble size automatically leadsto DFS underestimation. Since DFS is a trace of the posterior error covari-ance mapped onto the normalized observation space, underestimated DFS im-plies overconfidence (underdispersion) in the analysis spread, which, in a cy-cled context, requires covariance inflation to be applied. The theory is thenextended to cases where covariance localization schemes (either B-localizationor R-localization) are applied to show how they alleviate the DFS underestima-tion issue. These findings from mathematical argument are demonstrated witha simple one-dimensional covariance model. Finally, the DFS concept is usedto form speculative arguments about how to interpret several puzzling features ∗ submitted to QJRMS † Meteorological Research Institute, Japan Meteorological Agency, Tsukuba, Japan ‡ Numerical Prediction Division, Japan Meteorological Agency, Tokyo, Japan § European Centre for Medium-range Weather Forecasts, Reading, UK a r X i v : . [ phy s i c s . d a t a - a n ] M a y f LETKF previously reported in the literature such as why using less obser-vations can lead to better performance, when optimal localization scales tendto occur, and why covariance inflation methods based on relaxation to prior in-formation approach are particularly successful when observations are inhomo-geneously distributed. A presumably first application of DFS diagnostics to aquasi-operational global EnKF system is presented in Appendix. The number of observations that are available for operational numerical weatherprediction (NWP) systems has undergone dramatic increase over the last severaldecades. This increasing trend, largely driven by advances in remote-sensing tech-nology, is envisaged to continue in the near future thanks to the incoming meteo-rological Big Data such as measurements of phased-array weather radars (Miyoshiet al., 2016) and satellite-based hyper-spectral soundings. A challenge in data as-similation (DA) development that is becoming increasingly relevant today is thusto effectively exploit the ever-increasing amount of observations that are becomingdenser and more frequent. More specifically, we need to be able to extract as muchinformation as possible from such observations. This is not an easy task, even for ad-vanced DA algorithms such as ensemble Kalman filter (EnKF), partly because thesealgorithms are built upon the assumption that the number of observations to be as-similated is orders of magnitude smaller than the degrees of freedom of the statespace — an assumption that was perfectly legitimate when these algorithms weredevised several decades ago but may need to be revisited given the current explosiveincrease in the volume of observations.In this context, it is important to quantify the amount of information that a DAsystem can extract from observations. Several criteria for quantifying the amountof information have been proposed in the past, and degrees of freedom for signal(DFS), or information content, is an example of such measures. The theory of DFS2as developed in the statistics literature for general inverse problems (Wahba et al.,1995; Rodgers, 2000) and has been adapted to address many problems that arise inNWP (e.g. Purser and Huang, 1993; Rabier et al., 2002; Bocquet et al., 2011). DFSis defined for linear Gaussian least-square DA schemes as the trace of the “influ-ence matrix” ( S = ( HK ) T ) where H and K denote, respectively, the Jacobian of theobservation operator and the Kalman gain matrix (see Section 2 for details). ForDA algorithms suitable for systems with large state-vector dimensions like 4Dvar orEnKF, the Kalman gain K is usually not explicitly constructed, so evaluation of DFSis not straightforward. Several ways to approximately compute DFS for variationalDA systems at a feasible computational cost have been proposed (Cardinali, 2004;Fisher, 2003; Chapnik et al., 2006; Lupu et al., 2011) and DFS has now become astandard diagnostics for assessing the impact of different instruments or observingplatforms for 4DVar-based operational DA systems.In the case of EnKF, Liu et al. (2009) showed that DFS can be easily computedas long as the analysis perturbations projected onto observation space are available.However, in contrast to variational DA systems, to the authors’ best knowledge, DFShas not yet been applied to operational EnKF-based DA systems.As we report in Appendix, we have applied DFS diagnostics to the quasi-operationalversion of Japan Meteorological Agency (JMA)’s global EnKF-based DA system to ex-amine how much information our analysis extracts from each type of observations.The diagnostics revealed, intriguingly, that, whilst our EnKF system is extractingreasonable amount of information from relatively sparse observations such as con-ventional ground-based observations (SYNOP) and radiosondes (TEMP), showingper-obs DFS comparable to that from 4DVar, it retrieves far less information fromdense observations such as the satellite radiances (hyper-spectral soundings fromIASI and AIRS in particular), with DFS an order of magnitude smaller than in4DVar. The theoretical arguments and their demonstrations using a simple covari-ance model presented in this manuscript grew out from our attempt to understandthe reason behind this problem. As we will show later, an argument based on DFS3llows us to clearly illustrate the relevance of the ensemble size (or the effective rankof the background error covariance matrix) in effectively extracting observational in-formation in scenarios where a large volume of observations are available.The purpose of this paper is to analyze the properties of Ensemble TransformKalman filter (ETKF Bishop et al., 2001; Wang et al., 2004) and its local variant(LETKF Hunt et al., 2007) in light of the concept of DFS. We will show, withoutusing anything beyond elementary linear algebra, that DFS can be used to quanti-tatively describe the well-known (but vaguely defined) “rank deficiency” issue thatan EnKF system suffers when the ensemble size is not sufficient. The theory de-veloped here not only has direct relevance to the important question of how manyensemble members we need to effectively assimilate a given set of observations, butalso bares some interesting implication on covariance localization and inflation —the two crucial components of EnKF without which the algorithm do not work butare often subject to manual tuning due to our lack of understanding.The rest of the paper is structured as follows: Section 2 reviews the theory ofDFS and shows how it is related to the singular values of the “observability matrix” R − / HB / ( i.e. , the square root of the background error covariance matrix measuredin observation space normalized by the inverse square root of the observation errrorcovariance matrix). Section 3 then applies this theory to ETKF and proves that theDFS that ETKF can attain is limited from above by the ensemble size, which entailsthat the analysis becomes suboptimal whenever the ensemble size is smaller thanthe DFS that should be attained by an optimal analysis. Section 3 also discusses theimplication of this DFS underestimation on covariance localization and inflation.Section 4 illustrates the findings given in the preceding section through a series ofidealized experiments using a simple one-dimensional covariance model. Section 5provides several speculative discussions as to how the logic developed in this papercan be used to interpret some puzzling (or counter-intuitive) results previously re-ported in the literature, followed by conclusions in Section 6.4 Degrees of Freedom for Signal (DFS)
DFS is a measure of how much information an analysis has retrieved from observa-tions (e.g., Rodgers, 2000). For a linear Gaussian DA scheme, DFS is defined as thetrace of the influence matrix K T H T :DFS = tr K T H T = tr HK (1)where tr denotes the trace of a matrix, H denotes the linear observation operator and K denotes the Kalman gain matrix. Since the analysis projected onto the observationspace is y a = Hx a = Hx b + HK (cid:0) y o − Hx b (cid:1) (2)where x b denotes the background mean, DFS can also be expressed asDFS = tr (cid:18) ∂ y a ∂ y o (cid:19) = p (cid:88) i =1 ∂y ai ∂y oi (3)where p denotes the number of the assimilated observations. Equation 3 allows usto interpret DFS as the sensitivity of analysis with respect to observations. Notably,the total DFS can be partitioned into contributions from each observation (indexedby i ), and the contribution from each observation takes the form of “self-sensitivity.”The nature of DFS is better understood if we work in the singular space of the ob-servability matrix R − / HB / (Johnson et al., 2005) where R and B denote, respec-tively, the observation and background error covariance matrices. In the canonicalKalman filter (KF), the analysis error covariance matrix A and the Kalman gain K are related by K = AH T R − (4)so that H (cid:48) AH (cid:48) T := R − / HAH T R − / = R − / HKR / (5)5here we have introduced the normalized observation operator H (cid:48) := R − / H toavoid cluttered notation. Recalling that the Kalman gain K can be also expressed as K = BH T (cid:0) HBH T + R (cid:1) − , (6)Equation 5 can be expressed as H (cid:48) AH (cid:48) T = (cid:16) H (cid:48) BH (cid:48) T (cid:17) (cid:110)(cid:16) H (cid:48) BH (cid:48) T (cid:17) + I p (cid:111) − (7)where I p denotes the identity matrix of size p × p . Now, introducing singular decom-position on the observability matrix H (cid:48) B / : H (cid:48) B / = UΣ b V T (8)the background error covariance matrix projected onto the normalized observationspace H (cid:48) BH (cid:48) T is eigendecomposed as H (cid:48) BH (cid:48) T = UΛ b U T , (9)which, plugged-in to Equation 7, yields the eigen-decomposed expression of H (cid:48) AH (cid:48) T as H (cid:48) AH (cid:48) T = UΛ a U T (10)where the diagonal matrix Λ a is defined later (see Equation 12). Denoting the i -thdiagonal element of the matrix Σ b by σ bi , which are the singular values of the observ-ability matrix, and assuming that they are sorted from the largest to the smallest,the i -th diagonal element of the diagonal matrix Λ b = Σ b Σ bT is λ bi = σ bi ( i ≤ r )0 ( i > r ) (11)where r is the rank of the observability matrix H (cid:48) B / . The eigenvalues λ ai of H (cid:48) AH (cid:48) T are then related to the eigenvalues λ bi of H (cid:48) BH (cid:48) T by λ ai = λ bi λ bi + 1 . (12)6quations 9–12 are very helpful in understanding why DFS is called as such.From Equation 9 we see that, in the space spanned by each column of U , the statecan vary statistically independently in each direction, and the background error hasthe variance of λ bi in each direction. Each direction corresponds to one “degree offreedom” since it can vary independently of each other. Equation 10 means that, inthis space, the error variance λ bi in the i -th direction is reduced by a factor of λ ai /λ bi by assimilating observations. From Equation 12, this factor can be expressed as λ ai /λ bi = 1 / ( λ bi + 1) . Now, if λ bi is much greater than one ( λ bi (cid:29) ), the error variancein the i -th direction is reduced by a large fraction ( λ ai /λ bi (cid:28) ), meaning that theuncertainty in this direction is well constrained by the observations. Such a directioncan be considered as representing one “degree of freedom for signal.” Conversely, if λ bi is close to zero ( λ bi (cid:28) ), the error variance in that direction is hardly reduced( λ ai /λ bi ≈ ), meaning that this direction is virtually not observed at all. Such adirection can be considered as representing one “degree of freedom for noise” (orequivalently, zero “degree of freedom for signal”). The eigenvalue of H (cid:48) AH (cid:48) T , λ ai , hasan interesting property of approaching one if λ bi is large ( λ ai → as λ bi → ∞ ; i.e. ,the i -th direction represents a degree of freedom for signal) and approaching zeroif λ bi is close to zero ( λ ai → as λ bi → ; i.e. , the i -th direction represents a degreeof freedom for noise). It is then sensible to define the total “degrees of freedom forsignal” (DFS) as the sum of all λ ai ’s, and this in fact agrees with the exact definitionof DFS. To see this, recall that the trace of a product of matrices is invariant undercyclic permutation, and use Equations 1, 5 and 10 to deriveDFS = tr HK = tr R − / HKR / = tr H (cid:48) AH (cid:48) T = tr UΛ a U T = tr Λ a = r (cid:88) i =1 λ ai , (13)where, in the penultimate equality, we have used the fact that U T U is a unit matrix.A more detailed discussion along this line can be found in Fisher (2003).7 .2 Upper bounds of DFS Equation 13 leads to obvious upper bounds of DFS. Recalling that λ bi = σ bi are allnon-negative, it follows immediately from Equation 12 that ≤ λ ai < , (14)from which follows thatDFS = r (cid:88) i =1 λ ai < r = rank R − / HB / = min { rank R , rank H , rank B } (15) = min { number of observations, dimension of state space } . (16)These upper bounds may seem self-evident as long as we deal with an optimal DAscheme where both the observation and background error covariance matrices, R and B , are perfectly prescribed ( i.e. identical to the true ones) and all the matrix op-erations are performed exactly. However, as we will show in the next section, thesesimple upper bounds become relevant and can provide meaningful insight when weanalyze practical algorithms which compromise optimality for affordable computa-tional complexity. In this section,we apply the theory of DFS just outlined above to ETKF (Bishop et al., 2001;Wang et al., 2004), a determistic variant of EnKF, and show that the DFS that isattained with this scheme can never exceed the ensemble size. The same discussionalso applies to each local analysis of LETKF (Hunt et al., 2007), a local variant ofETKF. This seemingly simple fact, the authors believe, has many important impli-cations, as we discuss later. In this paper we focus on ETKF and their variants,but the results should be valid for other implementations of square-root filters like8nsemble Adjustment Kalman Filter (EAKF Anderson, 2001) and the serial ensem-ble square-root filter (EnSRF Whitaker and Hamill, 2002) since these algorithms,when performed without localization, have been shown to result in the same poste-rior mean and the same error space spanned by the posterior perturbations (Tippettet al., 2003).
We consider the ETKF algorithm, or a local assimilation step of LEKTF, and for nowignore covariance localization. In ETKF or LETKF, as with any EnKF algorithms,the background error covariance matrix B is approximated using K members of en-semble forecasts, x bi , i = 1 , , · · · , K , as B ens = 1 K − X b X bT (17)where K is the ensemble size and the matrix X b is the matrix of background pertur-bations defined as X b = (cid:2) δ x b , δ x b · · · , δ x bK (cid:3) where δ x bi = x bi − x b , i = 1 , · · · , K are the N -dimensional vectors of background perturbations where N is the dimension of thestate space, and x b = (1 /K ) (cid:80) Ki =1 x bi is the ensemble mean of the background state.The discussion given in section 2 remains valid for ETKF or each local analysis ofLETKF (in fact, these algorithms rely on Equation 8 in computing A and K ), so thatEquation 15 implies that the DFS for these algorithms (denoted DFS ens hereafter) isbounded from above by K − : DFS ens < K − (18)since rank B ens = rank X b ≤ K − . If we define the optimal DFS (denoted DFS opt hereafter) as the DFS that would be attained if the analysis is performed optimallywith the canonical KF, the DFS for ETKF (or local analysis of LEKTF) is unavoidablyunderestimated provided that DFS opt > K − .In a practical NWP setup, the underestimation of DFS is quite likely; for example,in the global LETKF of JMA (see the Appendix), the ensemble size K is only 50,9hile the number of observations that are assimilated locally is typically O (10 ) oreven greater, so that DFS opt locally should be O (10 ) or more which is much largerthan K . The implications of this DFS underestimation are discussed in more detailbelow. The underestimation of DFS ( i.e. , DFS ens < DFS opt ) means, by definition, that theanalysis is not fully extracting information from observations. More specifically,since the analysis increment projected onto the observation space is
HKd where d is the O-B departure y o − Hx b , an underestimated DFS ens = tr HK ens (where K ens denotes the Kalman gain used in EnKF) suggests that the analysis increment (orthe correction of the background by the observations) is likely smaller than what itshould be under optimality.A more important consequence of DFS underestimation is that the analysis be-comes overconfident (or equivalently, the ensemble becomes underdispersive). Thisis becuase DFS coincides with (the square of) the analysis spread measured in thenormalized observation space since DFS = tr H (cid:48) AH (cid:48) T (cid:0) = tr R − / HAH T R − / (cid:1) (seeEquation 13). Consider a situation where we assimilate p independent observations,each of which is as accurate as their counterpart from the background, in which casethe DFS that should be attained under optimality should be of comparable order ofmagnitude to the number of the assimilated observations: DFS opt = c × p with somepositive c ∼ O (0 . (see the footnote in section 3.1 for a rationale behind this roughestimate). Note here that the DFS attained under optimality, or the posterior errorvariance (measured in the normalized observation space) that the optimal analysis Here we assume that the observations are of comparable accuracy to the background. In such acase, the observations and the background are roughly equally informative, so it should be legitimateto assume, from Equation 3, that the per-obs DFS is, on average, not too different from one half,which entails that DFS opt should be of the same order of magnitude to that of half the number ofobservations. opt = E (cid:13)(cid:13) H (cid:48) (cid:0) x a , opt − x true (cid:1)(cid:13)(cid:13) (19)where E( · ) denotes statistical expectation, x a , opt denotes the posterior state obtainedby an optimal analysis, and x true denotes the true state. Now, if we assimilate such ob-servations by an ETKF with the ensemble size K that is orders of magnitude smallerthan the number of observations p ( i.e. , K (cid:28) c × p = DFS opt ), it follows from Equation18 that tr H (cid:48) A ens H (cid:48) T = DFS ens (cid:28)
DFS opt = E (cid:13)(cid:13) H (cid:48) (cid:0) x a , opt − x true (cid:1)(cid:13)(cid:13) < E (cid:13)(cid:13) H (cid:48) (cid:0) x a , ens − x true (cid:1)(cid:13)(cid:13) (20)where x a , ens denotes the ensemble mean of the posterior state, and the rightmost in-equality follows from an assumption that ETKF with a limited ensemble size shouldresult in an analysis inferior to that of the optimal KF. Equation 20 means that thesquared posterior spread, which is the estimated posterior error variance assumedby the ETKF, is much smaller than the true error variance of the posterior mean (orequivalently, the posterior ensemble is underdispersive).The overconfidence of analysis (or the uderdispersion of the posterior ensemble)can accumulate over cycles and can eventually lead to filter divergence. Variouscovariance inflation techniques have been proposed and are employed to counteractagainst it. The discussion given in the paragraph above suggests that very strongcovariance inflation is required if the ensemble size K is much smaller than DFS opt ,but covariance inflation that is too strong is undesirable because that would ruin theEnKF’s ability to represent the “errors of the day,” which is one of the most appealingaspects of EnKF algorithms.We can also infer that a small ensemble size K can be tolerated if DFS opt is small.Such a situation can happen, for example, when• observations are much less accurate than the background (so that all singularvalues of the true observability matrix ( H (cid:48) B true1 / ) become much smaller thanone), or 11 the singular spectrum of the true observability matrix is dominated by a smallnumber of large ones, which occurs if the dynamical system has a small numberof growing modes so that the error space (locally) has low unstable dimensions.Conversely, given an ensemble size K , we can avoid the underdispersion by limitingthe number of observations to assimilate to K times some factor of O (1) by, say,some form of thinning or by choosing a tighter domain localization (in the case ofLETKF) for areas that abound with observations. Such measures come at the priceof discarding some pieces of information from observations, but are neverthelessshown by a number of previous studies to be very effective in practice, as we discussin section 5.1. In the previous subsections, we deliberately deferred discussing the impact of covari-ance localization on DFS; this subsection is devoted to exploring this issue.In the context of EnKF, covariance localization has traditionally been explainedas serving two different (but related) functions: one is to suppress spurious correla-tions that appear in covariance matrices because of sampling noises, and the other isto mitigate the so-called “rank issue” (or “rank deficiency issue”) that is roughly de-fined as any issues that arise from the ensemble-derived background covariance B ens not being full-rank (e.g. Houtekamer and Zhang, 2016). Arguments based on DFSconcept helps us to quantitatively assess the latter ( i.e. , how localization mitigatethe rank issue).Covariance localization schemes generally applied in EnKF algorithms can beclassified into two types depending on whether they operate on the observation er-ror covariance matrix R or on the background error covariance matrix B (Grey-bush et al., 2011). The former (called “R-localization” hereafter) is typically usedwith LETKF. The latter (called B-localization hereafter) can be further split into“observation-space B-localization” that operates on BH T and HBH T in computing12he gain K , and “model-space B-localization” that operates directly on B . In this pa-per we focus on the difference between R-localization and model-space B-localization.R-localization implements localization by inflating the error variance for obser-vations that are far from the analyzed grid point. This can be realized by replacing R with R loc = L g RL g in each local analysis where L g is a p × p diagonal matrixwhose diagonal entries are the values of some increasing function (typically the in-verse square-root of the Gaussian function) of the distance between the analyzed gridpoint and the location of the observation. Simply replacing R with R loc has no effecton the upper bound on DFS ens given in Equation 18, suggesting that R-localization,while effective in suppressing spurious correlations, does not help in mitigating therank issue. In section 4, we show that R-localization may even decrease DFS, whichis also demonstrated in Huang et al. (2019). While R-localization (artificial inflationof the error variance for distant observations) in itself does not lend to mitigate therank issue, domain localization (c.f., section 4.3.2) that is inherent in R-localizationcan mitigate the issue. We will discuss this issue further in section 4.3.2.Model-space B-localization implements localization by tapering the B matrix throughtaking Schur product (element-wise multiplication) with a localization matrix ρ whose ( i, j ) -element is the value of some decreasing function of the distance between thelocations of i -th and j -th elements of the state vector. Implementing this type oflocalization in an EnKF is not straightforward since B matrix is not explicitly con-structed in EnKF, but the ensemble modulation approach, proposed in Bishop andHodyss (2009b), allows us to perform model-space B-localization without explicitlyconstructing B in model-space. In this approach, the ensemble covariance localizedwith ρ is expressed as a sample covariance of a larger ensemble: ρ ◦ B ens = 1 K − ρ ◦ (cid:0) X b X T (cid:1) = 1 M − T (21)where M = KL , L is the rank of ρ , and the N × M “modulated ensemble” Z is defined,13sing the square-root matrix L = [ l , · · · , l L ] of ρ that satisfies LL T = ρ , as Z = (cid:114) M − K − (cid:2)(cid:0) l ◦ δ x b , l ◦ δ x b , · · · , l L ◦ δ x b (cid:1) , · · · , (cid:0) l ◦ δ x bK , l ◦ δ x bK , · · · , l L ◦ δ x bK (cid:1)(cid:3) . (22)From Equation 21 we can see that performing a regular EnKF algorithm using Z as the ensemble of background perturbations in place of X b achieves model-spaceB-localization. We remark that choosing L as the exact square root of ρ results in L being the rank of ρ , which is unaffordably large, so in practice, L is approximatedby retaining only the dominant eigen modes of ρ . For ETKF, this model-space B-localization through ensemble modulation has difficulty in updating perturbationssince it results in M = KL posterior members produced given K prior members, butwe need K -member ensemble to initialize the next cycle. A method that resolves thisdifficulty have recently been devised independently by Bocquet (2016) and Bishopet al. (2017).By model-space B-localization through ensemble modulation, the upper boundon DFS given in Equation 18 increases L -fold from K − to KL − , suggestingthat DFS underestimation (and hence underdispersion of posterior ensemble) can bepotentially mitigated. In section 4, we demonstate, with a simple covariance model,that model-space B-localization does indeed significantly increase DFS ens . It is worth mentioning that several covariance inflation methods act to increaseDFS ens but their impact is limited since the upper bounds given in Equation 18 stillapplies even after the introduction of such methods. With the multiplicative infla-tion (Pham et al., 1998; Anderson and Anderson, 1999), each prior perturbation isinflated by a common factor a > , resulting in each λ bi uniformly inflated by thefactor a > ; recalling that λ ai as a function of λ bi is monotinically increasing (seeEquation 12), this means that each λ ai ( i = 1 , · · · , r ) , and hence DFS ens (which is their14um over all i ’s), are also increased. Similarly, with the additive inflation (Mitchelland Houtekamer, 2000), a random independent draw from a certain predefined errordistribution is added to each prior perturbation (each column the matrix X b ) beforeperforming analysis, leading to B ens in Equation 17 replaced by B ens plus some sym-metric matrix Q ens . Since this Q ens is a positive-semidefinite matrix of rank at most K − , all λ bi ( i = 1 , · · · , r (cid:28) K − are added with some positive increment, resultingin DFS ens increased accordingly. However, despite being increased by these inflationmethods, DFS ens is still subject to the upperbound (Equation 18) since the rank of B ens does not increase by these operations.The inflation methods that operate on the posterior perturbations such as RTPPand RTPS (see section 5.3) are typically applied after computing the gain matrix K .As such, while DFS ens could be increased by these methods, the underestimation ofthe analysis increment Kd cannot be mitigated. The most important messages from the examination of DFS given above are that,if the ensemble size K is insufficient relative to DFS opt (the true information con-tent of the observations), posterior ensemble from EnKF algorithms like ETKF orLETKF will be automatically underdispersive, which hinders effective exploitation ofobservational information, and that this limitation can be mitigated by model-spaceB-localization that applies Schur product tapering on B . We remark that Furrer andBengtsson (2007) obtained a similar result for stochastic EnKF with perturbed ob-servations: they showed, for special cases where HH T = I p holds, that, the concavityof HK as a function of the background error covariance (in the observation space) HBH T implies negatively-biased expectation of tr HK via Jensen’s inequality, which,with our notation, can be summarized as E [
DFS ens ] ≤ DFS opt . (23)15hitaker and Hamill (2002) also remarked this property for a univariate case byexperimentation. Compared to their findings, our proposition in Equation 18 isstronger in being valid deterministically (not being valid only in expected sense),and in giving a more explicit upper bound.Using the interpretation of DFS given in the last paragraph of section 2.1, DFS ens 50 members. This contrast is consistent with our expectation from the theorythat a small ensemble size K should be tolerated if DFS opt is small (c.f., section 3.2).The DFS diagnostics shown above is performed in the normalized observationspace, but similar tendencies can also be observed in the model space. Figure 3shows the trace of the analysis ensemble covariance, tr A ens , and the mean squarederror (MSE) of the analysis mean, (cid:107) x a , ens − x true (cid:107) , averaged over 1,000 trials, as afunction of the ensemble size K . As a reference, their counterpart in an optimalcanonical KF is also plotted with a dotted line. In both “high DFS” and “low DFS”scenarios, both tr A ens and the analysis mean MSE converge to their optimal valueas K becomes large, but the former is consistently smaller than the latter. Recall-ing that tr A ens is the estimate of the analysis mean MSE that is assumed by theassimilation algorithm, tr A ens being smaller than the analysis MSE means that theanalysis is overconfident. We can observe that, in both cases, the level of overconfi-dence diminishes as the ensemble size K gets larger. Comparing the two scenarios,we can also observe that the overconfidence is much stronger in the “high DFS” sce-nario than in the “low DFS” scenario. These results corroborate our deduction from19he theory that the ensemble size required to alleviate overconfidence in analysisshould increase in proportion to DFS opt .As discussed in section 3.2, underestimation of DFS is suggestive of underesti-mation of analysis increment. A plot (not shown) comparing the l -norm of analysisincrement from ETKF, (cid:107) K ens d (cid:107) as a function of the ensemble size K , with that of anoptimal analysis, (cid:107) K opt d (cid:107) , exhibits a converging curve similar to the one shown inFigure 2, with the former consistently underestimating the latter, again corroborat-ing the expectation from the theory.A plot similar to Figure 2 but with a fixed ensemble size ( K = 40 ) and varying thenumber of observations (Figure 4) is also illuminating. In the “high DFS” scenario,DFS ens is close to DFS opt when the number of observations p is small, but as p increasebeyond the ensemble size K = 40 , the former begins to underestimate the latter. Incontrast, in the “low DFS” scenario, DFS ens stays close to DFS opt even for very largevalues of p . This is because DFS opt is well below the ensemble size ( K = 40 ) even forthe fully observed case so that the DFS ens being bounded by K does not pose muchlimitation. In section 3.3 we expounded on how the DFS underestimation that occurs if theensemble size K is much smaller than DFS opt can be alleviated by localization. Inthis subsection we experimentally illustrate how the two localization methods differin this respect. Recalling that DFS is the sum of all the eigenvalues λ ai of the matrix H (cid:48) AH (cid:48) T , it isilluminating to examine how localization changes the eigenspectrum of this matrix.The eigenvalues, sorted from the largest to the smallest, of the matrix H (cid:48) A opt H (cid:48) T computed using the true background covariance matrix with the canonical KF, are20lotted in Figure 5 with the thick solid line. The true posterior eigenvalues smoothlydecrease as the mode number gets higher and almost (but not exactly) vanishes at ∼ B ens constructed from 40 members (the thick dotted line),abruptly become zero at the 40th mode. This is an indication of DFS underestima-tion since the areas below these curves correspond to DFS.Model space B-localization allows us to avoid this abrupt truncation of the eigen-spectrum. The posterior eigenvalues, computed by ETKF using the modulated back-ground ensemble (see section 3.3), are plotted with the thin solid line in Figure 5.Here, we manually tuned the localization scale to achieve minimal analysis meanMSE (giving d cut − off = 20 (cid:112) / ≈ ), and the localization matrix ρ is approximatedby retaining the leading L = 20 eigen modes, which recovers 93.4% of the trace ofthe original matrix ρ . We can observe that a well-tuned B-localization can almostperfectly recover the true posterior eigenspectrum, which means that DFS underes-timation can be avoided.The change in eigenspectrum caused by the use of B-localization can be better un-derstood by noting the following (suggested by Dr. T. Tsuyuki; private communica-tion): as we saw in section 3.3, the rank of B ens , or eqivalently the number of non-zeroeigenvalues λ bi of H (cid:48) B ens H (cid:48) T , increases from K − to KL − by applying B-localization.Now, recalling that model-space and observation-space B-localization are equivalentin this particular case where H only picks up state variables at selected grid points,the matrix H (cid:48) ( ρ ◦ B ens ) H (cid:48) T can be expressed as ρ obs ◦ (cid:0) H (cid:48) B ens H (cid:48) T (cid:1) by choosing an ap-propriate p × p correlation matrix ρ obs . Since all the diagonal entries of ρ obs are one, itfollows that (cid:80) K − i =1 λ b, raw i = tr H (cid:48) B ens H (cid:48) T = tr ρ obs ◦ (cid:0) H (cid:48) B ens H (cid:48) T (cid:1) = (cid:80) KL − i =1 λ b, loc i , where λ b, raw i and λ b, loc i denote, respectively, the eigenvalues of the prior error covariance ma-trix measured in the normalized observation space before and after application ofB-localization. This means that the sum of all the eigenvalues of the prior errorcovariance is invariant under application of B-localization, while the number of itsnon-zero elements increases L -fold, which implies that B-localization reduces the21arger eigenvalues while increasing the smaller eigenvalues including those that arenull, leading to flattening of the prior eigenspectrum. Consequently, the posterioreigenspectrum is also flattened by B-localization since the posterior eigenvalue λ ai asa function of the prior eigenvalue λ bi is monotonically increasing (see Equation 12),which explains how the posterior eigenspectrum is deformed by B-localization fromthe gray thick dotted line to the thin black solid line shown in Figure 5. R-localization circumvents this DFS underestimation issue in a different manner. Aswe saw in section 3.3, with R-localization, DFS ens is inevitably smaller than the en-semble size so DFS underestimation is unavoidable as long as DFS opt is greater thanthe ensemble size. Instead of solving the full data assimilation problem using thefull B matrix, localized EnKF algorithms like LETKF divide the domain into smallerpieces exploiting the localized structure of the background error covariance ( i.e. , theblock diagonality of the B matrix) and solves smaller data assimilation problems us-ing the diagonal submatrices of B (Evensen, 2004; Ott and Coauthors, 2004). Theresultant smaller data assimilation problems are solved independently for each localdomain. In our simple covariance model with the parameter d = 20 (Figure 1b) , forexample, the B matrix assumes localized structure with almost zero correlations be-yond ∼ grid points apart, so that, when performing analysis for a particular gridpoint, limiting the domain to the neighboring grid points within some radius (calledcut-off distance hereafter) greater than ∼ and then performing data assimilation,neglecting all background field and observations outside this local domain, shouldyield analysis close to what would be obtained with global analysis using full back-ground field and observations. This is exactly how data assimilation is performedwith LETKF. This way the number of locally assimilated observations can be reduced(Hunt et al., 2007), resulting in DFS opt for the localized problem being smaller thanthe ensemble size if the cut-off distance is chosen small enough, thus allowing us to22ircumvent the DFS underestimation issue.In the case we are considering (with parameter d = 20 and the ensemble size K =40 ), the optimally-tuned cut-off distance for R-localization was d cut − off = 14 (cid:112) / ≈ , in which case the number of the locally assimilated observations is restricted toonly ≈ (2 × 26 + 1) / . The gray thick solid line plotted in Figure 6 shows theeigenvalues of the posterior error covariance matrix projected onto the normalizedobservation space H (cid:48) loc A optloc H (cid:48) T loc computed by the optimal canonical KF restrictedto this localized domain (here the matrix H (cid:48) loc denotes a normalized observationoperator similar to H (cid:48) but which selects only the observations whose distances fromthe analyzed grid point are below the cut-off distance). Here we arbitrarily show theresult for the localized problem centered around the first grid point but the resultsare insensitive to the choice of the analyzed grid point because of the translationalsymmetry of our covariance model. Note that the eigenvalues are zero beyond the18th mode since the rank of the matrix H (cid:48) loc A optloc H (cid:48) T loc is only 17 (the number of locallyassimilated observations). Their ensemble equivalent computed with raw ensemblecovariance but with the domain localization (gray thick dotted line) is close to theoptimal one unlike in the case of full domain (compare with Figure 5). When R-localization is applied, the posterior eigenvalues become smaller than when onlydomain localization is applied (black thin dotted line), meaning that R-localizationactually leads to smaller DFS than without. This is consistent with the finding ofHuang et al. (2019) who also provided a mathematical explanation.By working with the localized domain, the DFS underestimation is mostly mit-igated, and so is the overconfidence in the analysis spread. The panels (a) and (b)in Figure 8 show, respectively, a plot similar to Figure 3a but for optimally-tunedLETKF with R-localization (together with its inherent domain localization), and thesame plot but with domain localization only. Comparing these with Figure 3a, it isevident that LETKF analyses suffer much less from overconfidence issue than theraw ETKF without any localization (Figure 3a), and that the analysis MSE close tothat of the optimal KF is achieved with much smaller ensemble sizes, proving the ef-23ectiveness of domain localization. Interestingly, for this particular simple problem,the benefit of R-localization appears to be mostly attributable to the use of domainlocalization, as we can infer by comparing Figures 8a and 8b which show very simi-lar performances. Note that we should not discredit the advantage of R-localizationover domain localization simply based on these results. R-localization is known tohave the advantage of ensuring spatially smoother (and thus better balanced) anal-ysis than simple domain localization does. This advantage is particularly importantin a cycled context (Greybush et al., 2011), but is totally dismissed in our problemsetup.In LETKF with R-localization or domain localization, the Kalman gain for theglobal analysis in full domain (denoted K LEKTF hereafter), or the analysis error co-variance implied by K LEKTF , are not explicitly available in algebraic matrix form, butit is possible to numerically compute them row-wise if we note that each row (say the i -th row) of the global Kalman gain K LEKTF is computed in the local analysis centeredaround the i -th grid point as the row of local Kalman gain that corresponds to thecenter of this local analysis. The row from the local Kalman gain is shorter than thecorresponding row of the global Kalman gain, so the components of the latter thatcorresponds to the observations that are outside the truncated local domain have tobe padded with zero. Once K LEKTF is thus computed, a counterpart of Figure 5 canbe produced, albeit with several caveats to be kept in mind.In the global analysis of LETKF, the optimality of the Kalman gain (in the senseof Equation 6 being strict), while being valid for each local domain, does not globallyhold exactly, which means that Equation 4 and thus Equation 5 are not valid. Ac-cordingly the equivalence of H (cid:48) AH (cid:48) T and R − / HKR / does not hold for the globalanalysis, making it difficult to interpret the DFS (defined as the trace of HK ) as theposterior error variance represented by the assimilation scheme. Another difficultythat follows from Equation 6 being not exact is that R − / HK LETKF R / is not assuredto be symmetric, resulting in their eigenvalues not necessarily real values. An easy24emedy to this is to focus on its symmetric component H (cid:48) K (cid:48) LETKF ,S := 12 (cid:110)(cid:0) R − / HK LETKF R / (cid:1) + (cid:0) R − / HK LETKF R / (cid:1) T (cid:111) (27)and investigate its eigenspectrum (suggested by Dr. T. Tsuyuki; private communica-tion). This remedy may seem ad hoc but is justified by the following two properties:Firstly, it conserves the trace and thus respects the identity tr H (cid:48) K (cid:48) LETKF ,S = tr R − / HK LETKF R / = tr HK LETKF = DFS ens . (28)Secondly, by focusing on the symmetric component, the requirement that eigenval-ues λ ai of H (cid:48) K (cid:48) LETKF ,S should lie between 0 and 1 can be interpreted intuitively asrequiring that the analysis should be an interpolation between the first guess andthe observation. To see why the second point holds, recall that the observation-spaceinner product (scaled by R − ) between the innovation vector and the analysis incre-ment can be expressed as (cid:0) y o − y b (cid:1) T R − (cid:0) y a − y b (cid:1) = d T R − (cid:0) y a − y b (cid:1) = d T R − HK LETKF d = d (cid:48) T R − / HK LETKF R / d (cid:48) = 12 (cid:26)(cid:16) d (cid:48) T R − / HK LETKF R / d (cid:48) (cid:17) + (cid:16) d (cid:48) T R − / HK LETKF R / d (cid:48) (cid:17) T (cid:27) = d (cid:48) T H (cid:48) K (cid:48) LETKF ,S d (cid:48) (29)where y b := Hx b is the first guess in the observation space and d (cid:48) := R − / d is thenormalized innovation vector. Since the eigenvalues λ ai are the factors by which thevector d (cid:48) magnifies or shrinks in the corresponding eigen-directions when multipliedfrom left by H (cid:48) K (cid:48) LETKF ,S , if < λ ai < holds for all i , that implies that, taking y b as theorigin in the normalized observation space, in any direction, y a never goes beyond y o nor does it lie in the opposite side of y o with respect to the origin y b , or stateddifferently, y a is an interpolation between y b and y o .The eigenspectrum of H (cid:48) K (cid:48) LETKF ,S , computed with the optimally-tuned R-localizationor with the domain localization only, are plotted in Figure 7. Similarly to B-localization(see Figure 5), R-localization and domain localization successfully restore the true25igenspectrum (gray thick solid line), attesting the effectiveness of R-localization ordomain localization in the context of global analysis. Interestingly again, the twocurves (for R-localization and domain localization) exhibit very similar pattern, sug-gesting that much of the benefit of R-localization can be achieved by domain local-ization alone. Since the underestimation of tr HK (the area below the curves) isalleviated, we can expect that R-localization or domain localization should alleviatethe underestimation of the analysis increment ( HKd or Kd ). We confirmed that thisis indeed the case in our experiment: the ratio of the l -norm of analysis incrementcomputed by ETKF (averaged over 1,000 trials) divided by the same quantity com-puted by the optimal canonocal KF increased from 0.7 to 0.9. We remark that, sincethe identity between H (cid:48) K (cid:48) LETKF ,S and H (cid:48) AH (cid:48) T does not necessarily hold for globalanalysis of LETKF, the mitigation of DFS (the sum of all λ ai ) cannot be immediatelyinterpreted as mitigation of posterior underdispersion (although from Figure 8 itdoes seem quite likely to be mitigated). In the literature of atmospheric data assimilation studies, especially those on oper-ational implementation of LETKF, several interesting findings have been reportedand some of them appear to be counter-intuitive. This section is devoted to discussinghow DFS-based arguments developed in the preceding sections could help interpretsuch seemingly puzzling findings. The discussions given here are admittedly highlyspeculative, but the authors wish nonetheless to present them in hopes of stimulat-ing further discussions by the community.26 .1 Using less is better Among the many findings (or caveats) on operational or real-world implementationsof LETKF, perhaps the most intriguing is the fact that assimilating less observa-tions can lead to more accurate analysis or forecast. For example, Hamrud et al.(2015) developed quasi-operational implementation of both LETKF and serial en-semble square-root filter (EnSRF Whitaker and Hamill, 2002) coupled with the Euro-pean Centre for Medium-range Weather Forecasts (ECMWF)’s operational forecastmodel (Integrated Forecast System; IFS) and found that, for LETKF, a very largeforecast improvement is achieved by severely limiting the number of assimilated ob-servations in the local analysis step for each analyzed grid point. They reported that,reducing the average number of locally assimilated observations ∼ -fold from theoriginal ∼ , to only ∼ , yields the best results and that the forecast perfor-mance is not very sensitive to the exact choice of the strength of number reduction.Curiously, they found this method (which they call “implicit covariance localization”)to be useful only for LETKF (with R-localization) and not for serial EnSRF (withobservation-space B-localization).The LETKF implementation for convective-scale data assimilation developed bySchraff and Coauthors (2016) takes a similar approach where, for each local analysis,the horizontal localization length-scale is adjusted so that the number of locally as-similated observations become constant (roughly double the ensemble size) at everyanalyzed grid point. Guo-Yuan Lien et al. (2017; private communication) applied asimilar method to assimilation of phased array weather radar (PAWR) data by theirLETKF (Lien et al., 2017) and confirmed that limiting the number of locally as-similated observations to a few times the ensemble size leads to significantly betterforecast than assimilating all data or applying a traditional thinning method beforeassimilating them.The fact that using less observations ( i.e. , discarding many of the available ob-servations) leads to better forecast performance is, naively, not easily justifiable, but27he theory of DFS applied to EnKF (see section 3) allows us a clear interpretation:DFS ens is locally at most K ˘1 (where K is the ensemble size), so that locally assimilat-ing much more than K observations results in overconfident analysis spread (requir-ing unreasonably large inflation) and smaller-than-optimal analysis increment. Theabove interpretation is not necessarily new, and we note that Schraff and Coauthors(2016) presents a similar heuristic argument as a rationale for their approach.One may argue that the benefit of assimilating less observations should be at-tributable to the observations being sparser and thus less affected by the error cor-relation between different observations. However, such an interpretation seems notto apply here, because, in all of the three studies mentioned above, the observationsthat are spatially closest to the analyzed grid point are selected, so that the issue ofcorrelated observation errors, if any, was not addressed by limiting the number oflocally assimilated observations. The argument above suggests a convenient guidance on how to tune the localizationscale in the context of LETKF with R-localization:the localization scale should be as large as possible to keep the localized problem closeto the original global problem, but subject to the condition that it is small enough sothat the number of locally assimilated observations does not exceed several times theensemble size (to ensure that locally DFS opt < K holds).This rule of thumb is again not new, and the insightful review paper by Tsyrul-nikov (2010) developed a similar heuristic argument based in part on Lorenc (2003)to reach at a similar conclusion. Quoting from section 4.3 of Tsyrulnikov (2010),he reasoned that the optimal localization scale occurs when local analysis domainis small enough so that “ensemble size (is) commensurable with the number of ob-served degrees of freedom within an effective box.” In his discussion, the “observed28egrees of freedom” has not been given a precise definition; we assert that DFS givesit a more precise definition. Tsyrulnikov (2010) further conducted meta-analysis ofthe published literature and found that the optimal localization scales that wereexperimentally determined in earlier studies do match this rule of thumb.Covariance localization has traditionally been regarded as a means to alleviatethe rank deficiency issue of the ensemble background error covariance matrix B ens (e.g., Houtekamer and Zhang, 2016) or a means to damp the sampling noises of B ens (e.g., Ménétrier et al., 2015). From this perspective, it appears that determi-nation of the optimal localization length scale is a problem intrinsic to the ensemblesize K and the structure of the true B , independent of H or R ( i.e. , how the obser-vations are distributed or how they are accurate). Interestingly, however, contraryto this traditional view, there have been ample evidence from the literature thatsuggests that the distribution and/or accuracy of observations are the key in deter-mining the optimal localization scales in the context of LETKF with R-localizationand domain localization. The theory based on DFS presented in this paper may helpto reconcile this apparent contradiction. Along with localization, covariance inflation is an indispensable component of prac-tical EnKF implementation that is often performed in an ad hoc manner despiteits importance. The concept of DFS is useful in explaining/justifying some of theinflation methods that were found effective in previous literature.Wang et al. (2007) experimentally observed that the regular ETKF (as formu-lated in Bishop et al., 2001; Wang et al., 2004) systematically underestimates theposterior error variance if K (cid:28) r where K is the ensemble size and r is the rank ofthe true background error covariance H (cid:48) B true H (cid:48) T projected onto the normalized ob-servation space. Motivated by this observation, and guided by a series of insightfuleducated guesses (see their Appendix A), Wang et al. (2007) introduced an “improved29TKF” formulation which inflates the eigenvalues λ bi of the sample prior error covari-ance projected onto the normalized observation space before computing the ensembletransform matrix. The inflation factor they derived has a rather complex expressionbut its key property is that it approaches r/K (the rank of H (cid:48) B true H (cid:48) T divided by theensemble size) under some assumptions. The chain of logic behind their derivation issomewhat complicated, but the DFS argument given in section 3.2 of this manuscriptallows us an intuitive (albeit heuristic) interpretation: when K (cid:28) r , DFS (and thusthe posterior error variance measured in the normalized observation space) shouldbe underestimated (DFS ens (cid:28) DFS opt ) since DFS ens < K − (cid:28) r and r/ DFS opt ∼ O (1) if observations are accurate enough. We can recover the correct posterior varianceby inflating DFS ens by a factor DFS opt / DFS ens , and a simple way to do this is to inflateall the prior eigenvalues λ bi by the same factor DFS opt / DFS ens . This factor is difficultto estimate since DFS opt is usually unknown (in fact Wang et al. (2007) derived quitean elaborate expression to estimate this factor), but provided that K (cid:28) r , it shouldbe reasonable to assume that it is roughly proportional to r/K .As we mentioned in section 3.3, Bishop et al. (2017) proposed the Gain-form ETKF(GETKF) that enables model-space B-localization in the ETKF framework. They ob-served that GETKF tend (though not always) to underestimate the posterior errorvariance in comparison to METKF (the ETKF that updates all the modulated ensem-ble members) and proposed to apply the “inherent GETKF inflation” which inflateseach of the GETKF’s posterior perturbations by a constant factor a that restores themodel-space posterior variance of METKF. With experiments using a one-dimesionaltoy system repeated with various localization length scales, they found that:• (a) the inherent inflation factor a increases monotonically with the localizationscale, and• (b) interestingly, analysis becomes most accurate when the localization lengthscale is such that it neutralizes the inherent inflation factor ( i.e. , a ≈ holds).The point (a) above is easier to understand if we note, from the DFS theory, that30he larger the localization scale is, the more observations are assimilated, leadingto severer DFS (and thus posterior variance) underestimation, requiring strongercovariance inflation. Bishop et al. (2017) acknowledges the potential significanceof point (b) suggesting that it could be exploited to adaptively optimize localizationscale if it is a generally applicable property. Bishop et al. (2017) were deliberate instating that the validity of this hypothesis is left to future assessment, but the im-plication discussed in section 3.2 together with the similar evidence from literaturesummarized in Tsyrulnikov (2010), make it all the more likely.Finally, among the many inflation methods, a family of “relaxation to prior” ap-proaches have been found to be particularly effective. This family of inflation meth-ods modify (inflate) the posterior ensemble after the analysis update has been madeby relaxing the posterior perturbations themselves to the prior perturbations (Relax-ation to Prior Perturbations; RTPP Zhang et al., 2004) or by relaxing the posteriorspread to the prior spread (Whitaker and Hamill, 2012, Relaxation to Prior Spread;RTPS). While there can be many reasons why these relaxation approaches are effec-tive, one important characteristic of them appears to be their ability to apply strongerinflation when and where the assimilated observations are distributed more densely,as pointed out by Whitaker and Hamill (2012). The DFS underestimation (or analy-sis overconfidence) that occurs when assimilating much more observations than theensemble size, lends itself to justify the success of the relaxation approaches. Aiming at understanding how EnKF effectively extracts information from observa-tions, in this paper we adapted the theory of DFS to EnKF algorithms with particularfocus on the ETKF framework. Simple mathematical arguments based on elemen-tary linear algebra revealed that, with EnKF algorithms, DFS is bounded from aboveby the ensemble size, which means that DFS is always underestimated when assim-ilating much more observations than the available ensemble size. This problem has31ong been recognized by the community and is referred to as the “rank deficiencyissue” but it appears to have been rather vaguely defined. DFS argument allows usto describe this issue in a more quantitative manner.The fact that DFS is underestimated when assimilating much more observationsthan the ensemble size has several important implications on the effectiveness ofpractical EnKF implementations, notably:• Strong covariance inflation is necessary when assimilating much more obser-vations than the ensemble size, which follows from the fact that DFS coincideswith the analysis (posterior) variance measured in the normalized observationspace, so that, if DFS is underestimated, the analysis spread becomes under-dispersive (or equivalently, analysis becomes overconfident).• DFS underestimation (or analysis overconfidence) can be avoided by imposingstronger localization when/where observations are denser. This comes at theexpense of discarding some of the information from observations, but the meritof alleviating the overconfidence in analysis can outweigh such disadvantage.These findings are not new, and similar arguments have been repeatedly made inthe literature (e.g., Lorenc, 2003; Tsyrulnikov, 2010; Furrer and Bengtsson, 2007),but the authors believe that the DFS-based argument presented here helps us tounderstand the issue more clearly and more quantitatively.The concept of DFS also allows us to understand how different localization schemeshelp to circumvent the DFS underestimation or analysis overconfidence. The impli-cations from the DFS theory in this context have been explored and showcased usingidealized experiments with a one-dimensional toy problem.The examination on the role of localization based on the DFS concept highlightsthe advantage of B-localization over R-localization ( i.e. , being less susceptible to theproblem of DFS underestimation or analysis overconfidence). In the previous stud-ies, the benefit of using model-space B-localization, through the modulated ensembleapproach in particular, have been emphasized in connection to its ability to correctly32ccount for non-local observations like satellite radiances for which the position ofthe observation is ill-defined and thus R-localization cannot be clearly formulated(Bishop et al., 2017; Lei et al., 2018). The discussion given in this paper suggeststhe merit of B-localization beyond its intended advantage of correctly accounting fornon-locality of observations.Finally, several speculative arguments have been presented based on DFS the-ory about how some of the interesting results on (L)ETKF reported in the previousliterature can be explained or justified in light of DFS concept. Intriguing questionssuch as why using less observations can be better, which localization length scalestend to be optimal, and why covariance inflation schemes based on “relaxation toprior” approaches are successful, all become easier to interpret by noting when DFSunderestimation issue occurs. We remark that the theoretical argument and thediscussions on the results from the idealized experiments presented in this paperonly bring out the issues that (L)ETKF algorithms are subject to even when strongsimplifying assumptions are satisfied, such as: the model and observation operatorare both linear, the background and observation errors obey Gaussian distributions,the observation errors are uncorrelated, and B and R are perfectly known. Practicalapplications like real-world NWP bear many other complicating factors, so we needto be careful not to extrapolate too much from the simple theoretical arguments pre-sented in this paper. The authors believe nevertheless that the findings shown hereprovide some useful insights.Most DA methodologies studied in the atmospheric science or geophysics liter-ature have assumed, implicitly or explicitly, that the system dimension is by fargreater than the number of observations or the ensemble size ( ∼ (cid:28) i.e. , (cid:28) ∼ ACKNOWLEDGEMENTS This paper grew out from the first author’s presentations at the 6th InternationalSymposium on Data Assimilation (ISDA), the 7th and 8th EnKF Data Assimila-tion Workshops and the RIKEN International Symposium on Data Assimilation(RISDA2017). The authors are grateful to the organizers and thank feedback frommany of the participants, notably (in no particular order), Prof. Craig Bishop (Uni-versity of Melbourne), Prof. Eugenia Kalnay (University of Maryland), Dr. Take-masa Miyoshi (RIKEN), Dr. Guo-Yuan Lien (CWB), Dr. Jeff Whitaker (NOAA/ESRL),Prof. Shunji Kotsuki (Chiba University), Prof. Kosuke Ito (University of the Ryukyus)and Dr. Le Duc (JAMSTEC). The authors also wish to thank Dr. Tadashi Tsuyuki(MRI) for reviewing the pre-submission version of the manuscript and for providingnumerous insightful suggestions. This work is supported in part by JSPS Grant-in-Aid for Scientific Research (KAKENHI) JP17H00852 and JP17H07352, the MEXTFLAGSHIP2020 Project within the priority study 4 (advancement of meteorologi-cal and global environmental predictions utilizing observational Big Data), and theMEXT “Program for Promoting Researches on the Supercomputer Fugaku” (LargeEnsemble Atmospheric and Environmental Prediction for Disaster Prevention andMitigation). CONFLICT OF INTEREST Authors declare no conflict of interest. 34 eferences Anderson, J. (2001) An ensemble adjustment Kalman filter for data assimilation. Monthly Weather Review , , 2884–2903.Anderson, J. and Anderson, S. (1999) A Monte Carlo implementation of the nonlin-ear filtering problem to produce ensemble assimilations and forecasts. MonthlyWeather Review , , 2741–2758.Bishop, C. H., Etherton, B. J. and Majumdar, S. J. (2001) Adaptive sampling with theensemble transform Kalman filter. Part I: Theoretical aspects. Monthly WeatherReview , , 420–436.Bishop, C. H. and Hodyss, D. (2009a) Ensemble covariances adaptively localized withECO-RAP. Part 1: Tests on simple error models. Tellus , , 84—96.— (2009b) Ensemble covariances adaptively localized with ECO-RAP. Part 2: Astrategy for the atmosphere. Tellus , , 97–111.Bishop, C. H., Whitaker, J. S. and Lei, L. (2017) Gain form of the Ensemble Trans-form Kalman Filter and its relevance to satellite data assimilation with modelspace ensemble covariance localization. Monthly Weather Review , , 4574–4592.Bocquet, M. (2016) Localization and the iterative ensemble kalman smoother. Quar-tely Journal of the Royal Meteorological Society , , 1075–1089.Bocquet, M., Wu, L. and Chevallier, F. (2011) Bayesian design of control space foroptimal assimilation of observations. I: Consistent multiscale formalism. QuartelyJournal of the Royal Meteorological Society , , 1340–1356.Cardinali, C. (2004) Monitoring the observation impact on the short-range forecast. Quartely Journal of the Royal Meteorological Society , , 239–250.35 (2013) Observation influence diagnostic of a data assimilation system. In DataAssimilation for Atmospheric, Oceanic and Hydrology Applications (ed. X. P.Park SK), vol. 2, chap. 4, 89–110. Berlin and Heidelberg, Germany: Springer-Verlag.Chapnik, B., Desroziers, G., Rabier, F. and Talagrand, O. (2006) Diagnosis and tun-ing of observational error in a quasi-operational data assimilation setting. Quar-tely Journal of the Royal Meteorological Society , , 543–565.Evensen, G. (2004) Sampling strategies and square root analysis schemes for theEnKF. Ocean Dynamics , , 539–560.Fisher, M. (2003) Estimation of entropy reduction and degrees of freedom for signalfor large variational analysis systems. ECMWF Technical Memoranda , , 18pp.Furrer, R. and Bengtsson, T. (2007) Estimation of high-dimensional prior and poste-rior covariance matrices in Kalman filter variants. Journal of Multivariate Anal-ysis , , 227–255.Gaspari, G. and Cohn, E. (1999) Construction of correlation functions in two andthree dimensions. Quartely Journal of the Royal Meteorological Society , , 723–757.Greybush, S. J., Kalnay, E., Miyoshi, T., Ide, K. and Hunt, B. R. (2011) Balance andensemble kalman filter localization techniques. Monthly Weather Review , ,511–522.Hamrud, M., Bonavita, M. and Isaksen, L. (2015) EnKF and hybrid gain ensembledata assimilation. Part i: EnKF implementation. Monthly Weather Review , ,4847–4864.Houtekamer, P. L. and Zhang, F. (2016) Review of the Ensemble Kalman Filter foratmospheric data assimilation. Monthly Weather Review , , 4489–4532.36uang, B., Wang, X. and Bishop, C. H. (2019) The High-rank Ensemble TransformKalman Filter. Monthly Weather Review , , 3025–3043.Hunt, B., Kostelich, E. and Szunyogh, I. (2007) Efficient data assimilation for spa-tiotemporal chaos: A local ensemble transform Kalman filter. Physica D , ,112–126.JMA (2013) Outline of the operational numerical weather prediction at the JapanMeteorological Agency (March 2013), Appendix to WMO Technical Progress Re-port on the Global Data-processing and Forecasting System (GDPFS) and Numer-ical Weather Prediction (NWP) Research. 188pp. URL: .— (2019) Outline of the operational numerical weather prediction at the Japan Me-teorological Agency (March 2019), Appendix to WMO Technical Progress Reporton the Global Data-processing and Forecasting System (GDPFS) and NumericalWeather Prediction (NWP) Research. 229pp. URL: .Johnson, C., Hoskins, B. J. and Nichols, N. K. (2005) A singular vector perspective of4D-Var: Filtering and interpolation. Quartely Journal of the Royal MeteorologicalSociety , , 1–19.Lei, L., Whitaker, J. S. and Bishop, C. (2018) Improving assimilation of radianceobservations by implementing model space localization in an ensemble Kalmanfilter. Journal of Advances in Modeling Earth Systems , , 3221–3232.Lien, G.-Y., Miyoshi, T., Nishizawa, S., Yoshida, R., Yashiro, H., Adachi, S. A., Ya-maura, T. and Tomita, H. (2017) The near-real-time SCALE-LETKF system: Acase of the September 2015 Kanto-Tohoku heavy rainfall. Scientific Online Letterson the Atmosphere , , 1–6. 37iu, J., Kalnay, E., Miyoshi, T. and Cardinali, C. (2009) Analysis sensitivity calcu-lation in an ensemble kalman filter. Quartely Journal of the Royal MeteorologicalSociety , , 1842–1851.Lorenc, A. C. (2003) The potential of the ensemble Kalman filter for NWP — a com-parison with 4D-Var. Quartely Journal of the Royal Meteorological Society , ,3183–3203.Lupu, C., Gauthier, P. and Laroche, S. (2011) Evaluation of the impact of observa-tions on analyses in 3d- and 4d-var based on information content. Monthly WeatherReview , , 726–737.Ménétrier, B., Montmerle, T., Michel, Y. and Berre, L. (2015) Linear filtering of sam-ple covariances for ensemble-based data assimilation. Part I: Optimality criteriaand applications to variance filtering and covariance localization. Monthly WeatherReview , , 1622–1643.Mitchell, H. L. and Houtekamer, P. (2000) An adaptive ensemble Kalman filter. Monthly Weather Review , , 416–433.Miyoshi, T., Kunii, M., Ruiz, J., Lien, G.-Y., Satoh, S., Ushio, T., Bessho, K., Seko, H.,Tomita, H. and Ishikawa, Y. (2016) “big data assimilation” revolutionizing severeweather prediction. Bulletin of the American Meteorological Society , , 1347–1354.Ott, E. and Coauthors (2004) A local ensemble Kalman filter for atmospheric dataassimilation. Tellus , , 415–428.Pham, D. T., Verron, J. and Roubaud, M. C. (1998) A singular evolutive extendedKalman filter for data assimilation in oceanography. Journal of Marine systems , , 323–340.Purser, R. J. and Huang, H. L. (1993) Estimating effective data density in a satelliteretrieval or an objective analysis. Journal of Applied Meteorology , , 1092–1107.38abier, F., Fourrié, N., Chafai, D. and Prunet, P. (2002) Channel selection methodsfor Infrared Atmospheric Sounding Interferometer radiances. Quartely Journal ofthe Royal Meteorological Society , , 1011–1027.Rodgers, C. (2000) Inverse Methods for Atmospheric Sounding Theory and Practice .World Scientific Publising.Schraff, C. and Coauthors (2016) Kilometre-scale ensemble data assimilation for theCOSMO model (KENDA). Quartely Journal of the Royal Meteorological Society , , 1453–1472.Tippett, M., Anderson, J., Bishop, C., Hamill, T. and Whitaker, J. (2003) Ensemblesquare root filters. Monthly Weather Review , , 1485–1490.Tsyrulnikov, M. (2010) Is the Local Ensemble Transform Kalman Filter suitable foroperational data assimilation? COSMO Newsletter , , 22–36.Wahba, G., Johnson, D., Gao, F. and Gong, J. (1995) Adaptive Tuning of NumericalWeather Prediction Models: Randomized GCV in Three- and Four-DimensionalData Assimilation. Monthly Weather Review , , 3358–3369.Wang, X., Bishop, C. and Julier, S. (2004) Which is better, an ensemble of posi-tive–negative pairs or a centered spherical simplex ensemble? Monthly WeatherReview , , 1590–1605.Wang, X., Hamil, T. M., Whitaker, J. S. and Bishop, C. H. (2007) A comparison ofhybrid ensemble transform Kalman Filter–Optimum Interpolation and ensemblesquare root filter analysis schemes. Monthly Weather Review , , 1055–1076.Whitaker, J. S. and Hamill, T. M. (2002) Ensemble data assimilation without per-turbed observations. Monthly Weather Review , , 1913–1924.— (2012) Evaluating methods to account for system errors in ensemble data assim-ilation. Monthly Weather Review , , 3078–3089.39amaguchi, H., Hotta, D., Kanehama, T., Ochi, K., Ota, Y., Sekiguchi, R., Shimpo,A. and Yoshida, T. (2018) Introduction to JMA’s new Global Ensemble PredictionSystem. CAS/JSC WGNE Research Actitivies on Atmospheric and Oceanic Mod-elling , , 6.13–6.14.Zhang, F., Snyder, C. and Sun, J. (2004) Impacts of initial estimate and observationavailability on convective-scale data assimilation with an ensemble Kalman filter. Monthly Weather Review , , 1238–1253.40 ppendix: DFS diagnostics applied to a quasi-operationalglobal LETKF Introduction The Degrees of Freedom for Signal (DFS), or analysis sensitivity to observations, firstintroduced to NWP by Cardinali (2004) and Fisher (2003), is a convenient measureof how much of information content a particular data assimilation (DA) system canextract from different types of observations. Diagnostics of this quantity is usefulin identifying issues or limitations of a data assimilation system, observation errorspecification, or of observations themselves. In this Appendix, we briefly show theresults from DFS diagnostics applied to a quasi-operational version of global LETKFsystem developed at Japan Meteorological Agency (JMA). DFS calculation Following Liu et al. (2009), DFS is calculated from the analysis perturbations mappedonto the observation space, Y a = HX a , usingDFS ens = tr HK = tr HA ens H T R − = (cid:0) R − / Y a (cid:1) T (cid:0) R − / Y a (cid:1) / ( K − . (A1)In our system, the prescribed observation error covariance matrix R is chosen tobe diagonal, so the DFS as calculated above can be divided into contributions fromeach observation (which are just the sample analysis variance corresponding to eachobservation normalized by the corresponding observation error variance). Then, foreach subset of observations grouped by each instrument or each observed type, wecan define “DFS per observation” (which is also referred to as “self sensitivity”; seeCardinali (2004) for detail). Monitoring of the per-obs DFS thus defined for differentobservation types is a very useful diagnostics.41 xperimental setup The DA system analyzed here is a pre-operational version of JMA’s global LETKFthat is operated to produce the initial perturbations used in the global ensembleprediction system (Yamaguchi et al., 2018). It is a 50-member LETKF system, eachmember of which is a lower-resolution run (TL319 in horizontal, ∼ km grid spac-ing, and 100 vertical levels reaching from the surface up to 0.01 hPa) of JMA’s GlobalSpectral Model (GSM). At the end of each analysis update, the analysis mean is re-centered to the deterministic higher-resolution analysis produced by 4DVar. A de-tailed description of this LETKF system is given in section 3.3.3.1 of JMA (2019).In this experiment, all the types of observations that are used operationally by4DVar are assimilated by the LETKF. The assimilated observations are groupedinto the following categories: SYNOP (surface pressure measurements from ground-based stations), SHIP (surface pressure measurements over the seas reported fromvessels or moored buoys), BUOY (as in SHIP but from drifting buoys), RADIOSONDE(upper-level sounding observations of pressure, temperature, winds and humidityreported from radiosondes), PILOT (upper-level wind observations from rawin orpilot balloons), AIRCRAFT (aircraft observations reported via AIREP or AMDARprogramme), TYBOGUS (typhoon bogus data), PROFILER (wind profiles measuredfrom ground-based radars), GNSS-DELAY (zenith total delay observations from ground-based GNSS receivers), GNSS-RO (GNSS radio occultation observed by low earth or-bit satellites), AMVGEO (upper-level winds inferred as atmospheric motion vectorsfrom geostationary satellite imagery), AMVLEO (as in AMVGEO but from lowerearth orbit satellites), AMSU-A (microwave radiance soundings from AMSU-A sen-sors), AIRS (hyper-spectral infrared sounding from AIRS sensor), MHS (microwavehumidity sounding from MHS sensors), IASI (hyper-spectral infrared sounding fromIASI sensor), SCATWIND (ocean surface wind vectors inferred from ASCAT scat-terometers), TMI (microwave imagery from TMI sensor onboard TRMM satellite),AMSR2 (microwave imagery from AMSR2 sensor onboard GCOM-W satellite), SS-42IS (microwave imagery from SSMIS sensors) and CSR (clear sky radiance im-agery from water-vapor-sensitive channels of geostationary satellites). The detailson these observations are documented in Section 2.2 of JMA (2013) or JMA (2019).The results shown here are based on the statistics taken from the 5-day periodfrom 06 UTC, 10 July 2013 to 00 UTC, 15 July 2013. Just five days may not be longenough for a reliable statistics, but we have confirmed that the DFS statistics arenot too different for different periods from different seasons; to make the appendixconcise, we only focus on this particular 5-day period. Overview of the results To gain insight as to which observation types are most informative in terms of in-formation content, we first examine the relative (fractional) contributions from eachobservation type to the total DFS, defined as the sum of DFS for each observationwithin each group divided by the total DFS (the sum of DFS for all observations),which are plotted in Figure A1. The DFS is most contributed by AMSU-A and GNSS-RO observations which together account for more than half of the total DFS, followedby CSR, RADIOSONDE and AIRCRAFT. We highlight here that contributions fromhyper-spectral soudings (AIRS and IASI) are relatively small despite that they dom-inate in terms of the data volume (the number of observations); this is in stark con-trast to recent DFS results from variational analysis at ECMWF (Cardinali, 2013)and Météo France , for instance, where a large contribution from hyper-spectralsounders (notably IASI) is reported.The mean per-obs DFS, defined as tr HK /p where p is the number of all the as-similated observations, is a measure of how much of information the analysis ex-tracts from observations on average. In some literature (e.g. Cardinali, 2004; Lupuet al., 2011) this is called Observation Influence (OI). The mean per-obs DFS for ourLETKF system was only 0.0157 (1.57%), meaning, to the authors’ surprise, that the ∼ 11% (Cardinali, 2013);similarly, Lupu et al. (2011) reported that the mean per-obs DFS at Canadian oper-ational 4DVar was about 10%. In an idealized Observing System Simulation Exper-iment (OSSE) using an intermediate global atmospheric model where rather sparsein-situ observations are assimilated by LETKF (Liu et al., 2009), the mean per-obsDFS was about 15%. As we show in the next paragraph, the smallness of the meanper-obs DFS can be attributed to the small per-obs DFS for hyper-spectral sounders(AIRS and IASI) that constitute more than 70% of the total observation count.Figure A2 plots the per-obs DFS (or self sensitivity) calculated for different typesof observations, using samples taken from (a) entire globe, (b) Tropics, (c) North-ern Hemisphere extratropics (NH), and (d) Southern Hemisphere extratropics (SH).In any of the regions, conventional (non-radiance) observations tend to have higherper-obs DFS than satellite radiance observations (except for CSR which are assignedrelatively small error variance and are relatively sparse compared to other radiancedata due to the cloud-freeness constraint and the strong horizontal thinning thatpicks only one observation in a 200 km × 200 km box). Notably, isolated observationslike BUOY and SHIP exhibit large per-obs DFS. It can be also observed that per-obsDFS for conventional observations (like SYNOP, SHIP, BUOY and RADIOSONDE)are larger in SH and Tropics than in NH. These features are consistent with Cardi-nali (2004) who showed that isolated observations tend to show larger DFS.What is striking in Figure A2, in comparison to similar diagnostics from vari-ational DA systems (e.g., Cardinali (2013) and the real-time monitoring at MétéoFrance) is that per-obs DFS for hyper-spectral sounders (AIRS and IASI) are verysmall. We discuss this point in the next subsection.44 iscussion The formula for computing DFS applicable to any EnKF algorithm has been devisedin Liu et al. (2009) but appears not to have been applied to an operational EnKFimplementation. Here a DFS diagnostics is applied perhaps for the first time to aquasi-operational global LETKF implementation that assimilates all types of obser-vations that are operationally assimilated by 4DVar.The striking feature of the DFS diagnosed for LETKF, in comparison to thosereported for variational DA systems in the literature, is that the mean per-obs DFSis very small. This feature is largely attributable to the smallness of per-obs DFS forAIRS and IASI that constitute more than 70% of the total data count. The importantquestion then is to understand why DFS is so small for these types of observations.From the classical theory of DFS (Cardinali, 2004; Fisher, 2003) (which assumesthat the Kalman gain K is accurately computed from B and R ), in order for DFS tobe small, the observation error variance has to be large in comparison to its coun-terpart from the background. In our LETKF, the observation error covariance R isidentical to what is used in the operational 4DVar, while the background error vari-ance (as inferred from the background ensemble spread) tends to be smaller thanits counterpart prescribed in 4DVar by a factor of ∼ or . This discrepancy inthe magnitude of B is perhaps a factor contributing to the discrepancies in DFS be-tween LETKF and 4DVar in general, but it alone cannot explain why per-obs DFS inLETKF are reasonably large for sparse observations like BUOY and SHIP but aredisproportionately small for AIRS and IASI. It appears then that, to understand andexplain the discrepancy in DFS for AIRS and IASI between LETKF and 4DVar, weneed to examine how the Kalman gain K is computed in these two algorithms. Thisquestion motivated the study presented in the main part of this manuscript.45 igures a) columns of B (every 30 grid points) d=5.0 grid c o v a r i an c e b) columns of B (every 30 grid points) d=20.0 grid c o v a r i an c e Figure 1: Structure of the true background error covariance matrix B for parame-ters (a) d = 5 and (b) d = 20 . Each curve represents the column of B starting from the15th column with a stride of 30. Larger values of d correspond to narrower peaks.46 a) 10 20 40 80 160 320 640 128001234 b) Figure 2: DFS attained by ETKF (or each local analysis of LETKF), DFS ens , plottedas a function of the ensemble size K , for (a) the “high DFS” and (b) “low DFS” sce-narios. As a reference, the DFS that would be attained by an optimal canonical KF,DFS opt , is plotted in each panel as a horizontal dashed line. 10 20 40 80 160 320 640 12800100200300 a) 10 20 40 80 160 320 640 12800100200300400 b) Figure 3: Trace of analysis error covariance tr A ens (gray filled circle) and the anal-ysis MSE (black filled square) for ETKF analysis (or each local analysis of LETKF)plotted as a function of the ensemble size K , for (a) the “high DFS” and (b) “low DFS”scenarios. As a reference, their counterpart obtained by an optimal canonical KF, isplotted in each panel as a horizontal dashed line.47 100 200 3001020304050 DFS opt and DFS ens vs s o =1, K=40 D F S {DFS}^{opt}{DFS}^{ens} a) DFS opt and DFS ens vs s o =5, K=40 D F S {DFS}^{opt}{DFS}^{ens} b) Figure 4: The DFS attained by ETKF with the ensemble size K = 40 (DFS ens , graydashed line) and the DFS attained by the optimal canonical KF (DFS opt , black solidline) plotted as a function of the number of observations, for (a) the “high DFS” and(b) “low DFS” scenarios. The vertical line in each panel shows the ensemble size K = 40 . Eigenvalue distribution ( Optimal KFETKF w/o locETKF B-loc Figure 5: Eigenvalues of the matrix H (cid:48) A opt H (cid:48) T (gray thick solid line), H (cid:48) A ens H (cid:48) T (gray thick dotted line), and their counterpart for ETKF with model-space B-localization (black thin solid line) computed using the modulated ensemble. Shownby the lines are the mean over 1,000 trials. Their sampling variability, defined hereas the upper and lower 5 percentiles, is represented by the shades.48 Eigenvalue distribution for truncated-domain DA( Optimal KFETKF w/o locLETKF R-loc Figure 6: As in Figure 5, but for a data assimilation problem for a smaller do-main localized with a cut-off distance of 26. Shown are the eigenvalues of the matrix H (cid:48) loc A opt H (cid:48) T loc (gray thick solid line), H (cid:48) loc A ens H (cid:48) T loc (gray thick dotted line), and theircounterpart for LETKF with R-localization (black thin dotted line). As in Figure 5,the shades represent the 5 to 95 percentile ranges. Eigenvalue distribution for global DA ( Optimal KFETKF w/o locdom-locR-loc Figure 7: As in Figure 5, but for the global analysis of LETKF with the optimallytuned R-localization (black thin dotted line) and LETKF with domain localizationonly (black thin dashed line). The latter uses the same cut-off distance d cut − off = 26 as the former. For comparison, the posterior eigenspectra of the optimal KF (graythick solid line) and the raw ETKF without localization (gray thick dotted line) areplotted here again. As in Figure 5, the shades represent the 5 to 95 percentile ranges.49 LETKF with R loc -tr A ens and Anl. MSE vs K σ o =1, d=20, DFS opt =39.8777 (cid:68)(cid:12) 10 20 40 80 160 320 640 1280050100150 (cid:3) LETKF with (cid:71)(cid:82)(cid:80) - loc tr A ens and Anl. MSE vs K σ o =1, d=20, DFS opt =39.8777 (cid:69)(cid:12) Figure 8: As in Figure 3a, but for LETKF with (a) R-localization together withdomain localization, and (b) domain localization only. For each ensemble size K , thelocalization parameter d cut − off is manually tuned to yield the smallest analysis MSEwith respect to the truth. 50igure A1: Relative contributions to the total DFS from different types of observa-tions. 51 ) b)c) d)) b)c) d)