[PDF] A Bayesian approach to regional decadal predictability: Sparse parameter estimation in high-dimensional linear inverse models of high-latitude sea surface temperature variability

Abstract

Stochastic reduced models are an important tool in climate systems whose many spatial and temporal scales cannot be fully discretized or underlying physics may not be fully accounted for. One form of reduced model, the linear inverse model (LIM), has been widely used for regional climate predictability studies - typically focusing more on tropical or mid-latitude studies. However, most LIM fitting techniques rely on point estimation techniques deriving from fluctuation-dissipation theory. In this methodological study we explore the use of Bayesian inference techniques for LIM parameter estimation of sea surface temperature (SST), to quantify the skillful decadal predictability of Bayesian LIM models at high latitudes. We show that Bayesian methods, when compared to traditional point estimation methods for LIM-type models, provide better calibrated probabilistic skill, while simultaneously providing better point estimates due to the regularization effect of the prior distribution in high-dimensional problems. We compare the effect of several priors, as well as maximum likelihood estimates, on (1) estimating parameter values on a perfect model experiment and (2) producing calibrated 1-year SST anomaly forecast distributions using a pre-industrial control run of the Community Earth System Model (CESM). Finally, we employ a host of probabilistic skill metrics to determine the extent to which a LIM can forecast SST anomalies at high latitudes. We find that the choice of prior distribution has an appreciable impact on estimation outcomes, and priors that emphasize physically relevant properties enhance the model's ability to capture variability of SST anomalies.

Full PDF

GGenerated using the oﬃcial AMS L A TEX template v5.0 two-column layout. This work has been submitted forpublication. Copyright in this work may be transferred without further notice, and this version may no longer beaccessible.

A Bayesian approach to regional decadal predictability: Sparse parameter estimation inhigh-dimensional linear inverse models of high-latitude sea surface temperature variability

Dallas Foster ∗ Oregon State University, Corvallis, Oregon 97331, USAComputational Physics and Methods (CCS-2), Los Alamos National Laboratory, Los Alamos, NM 87545, USA.

Darin Comeau

Computational Physics and Methods (CCS-2), Los Alamos National Laboratory, Los Alamos, NM 87545, USA.

Nathan M. Urban

Computational Physics and Methods (CCS-2), Los Alamos National Laboratory, Los Alamos, NM 87545, USA.

ABSTRACTStochastic reduced models are an important tool in climate systems whose many spatial and temporal scales cannot be fully discretizedor underlying physics may not be fully accounted for. One form of reduced model, the linear inverse model (LIM), has been widelyused for regional climate predictability studies - typically focusing more on tropical or mid-latitude studies. However, most LIM ﬁttingtechniques rely on point estimation techniques deriving from ﬂuctuation-dissipation theory. In this methodological study we explore theuse of Bayesian inference techniques for LIM parameter estimation of sea surface temperature (SST), to quantify the skillful decadalpredictability of Bayesian LIM models at high latitudes. We show that Bayesian methods, when compared to traditional point estimationmethods for LIM-type models, provide better calibrated probabilistic skill, while simultaneously providing better point estimates due to theregularization eﬀect of the prior distribution in high-dimensional problems. We compare the eﬀect of several priors, as well as maximumlikelihood estimates, on (1) estimating parameter values on a perfect model experiment and (2) producing calibrated 1-year SST anomalyforecast distributions using a pre-industrial control run of the Community Earth System Model (CESM). Finally, we employ a host ofprobabilistic skill metrics to determine the extent to which a LIM can forecast SST anomalies at high latitudes. We ﬁnd that the choice ofprior distribution has an appreciable impact on estimation outcomes, and priors that emphasize physically relevant properties enhance themodel’s ability to capture variability of SST anomalies.

1. Introduction

Inter-annual variability in sea surface temperatures(SSTs) is an important indicator of the variability of theglobal climate system and has long been studied as relat-ing to other atmospheric variables (Davis and Davis 1976;McKinnon and Deser 2018; Révelard et al. 2018). Fore-casting variations in SSTs themselves has been an area ofopen research: there is interest not only in the time scale ofpredictability (Davis and Davis 1976; Branstator and Teng2010; Guemas et al. 2012; Stock et al. 2015b), but also onregional phenomena such as the El-Niño-Southern Oscil-lation (ENSO) and the Paciﬁc Decadal Oscillation (PDO)(Thomas et al. 2018; DiNezio et al. 2017; Penland 1996;Penland et al. 1993; Hawkins and Sutton 2009; Wittenberget al. 2014) which can have global impacts through atmo-spheric teleconnections. Stochastic reduced models, likethe linear inverse model (LIM), have been widely used for ∗ Corresponding author : Dallas Foster, [email protected]

This work has been accepted to the Journal of Climate. TheAMS does not guarantee that the copy provided here is an accuratecopy of the ﬁnal published work. regional climate predictability studies. A LIM is a linearstochastic diﬀerential equation model for the evolution of aclimate ﬁeld, such as SSTs. Often, and is done in this paper,the empirical orthogonal functions (EOFs) of said ﬁeld aremodeled (see Penland (1989) for a discussion of the use ofEOFs in LIMs). The most prominant application of LIMsare to sub-annual prediction of SSTs for ENSO forecasting(Penland et al. 1993; Penland and Sardeshmukh 1995), butthere are also applications to decadal timescales and mid-latitude regions (Alexander et al. 2008; Newman 2007,2013; Dias et al. 2018; Delsole et al. 2013), extensions tosystems with nonlinear and memory eﬀects (Kravtsov et al.2009; Kondrashov et al. 2015; Chen et al. 2016), as wellas a variety of other contexts (Hawkins and Sutton 2009;Penland and Matrosova 2001; Wu et al. 2018; Martinez-Villalobos et al. 2018; Alexander et al. 2008; Barnston et al.1999).The high-latitudes have long been the subject of numer-ous climate predictability studies, with Arctic sea ice inparticular garnering considerable attention. While prac-tical needs drive predictions of Arctic sea ice on subsea-1 a r X i v : . [ s t a t . M E ] A p r AMS JOURNAL NAME sonal to seasonal timescales, many studies have also pushedthe forecast horizons to interannual to decadal timescales(Guemas et al. (2016) and references therein). In general,interest in decadal climate predictions has been increasingover the last several years (Meehl et al. 2014), includingfocus on SSTs (Hawkins et al. 2011) and high latitudes, par-ticularly in the Southern Ocean (Latif et al. 2017; Zhanget al. 2017), where interannual to decadal scale variabil-ity plays a strong role in oceanic forcing on the Antarcticice sheet, particularly to the vulnerable West Antarctic IceSheet (Jenkins et al. 2016).The main focus in this paper is the implementation, cali-bration and evaluation of the LIM framework as applied tothe forecasting of high latitude SST anomalies using mod-ern Bayesian statistical strategies and probabilistic scoring.We make use of the 1800 year fully-coupled pre-industrialcontrol run of the Community Earth System Model version1 (CESM1) Large Ensemble project (Kay et al. 2015), andwas chosen to ensure that there was enough data to performvalidation of the methodology and to avoid complicatingfactors of non-stationarity, data sparsity or bias. We adopta Bayesian perspective because of the regularization eﬀectthat the use of prior distribution have in high-dimensionalestimation problems (Hastie et al. 2017; Vogel 2002) andthe ability to produce better probabilistic calibration whencompared to point estimation (Samaniego 2010).The parameters of a LIM, i.e. the drift and diﬀusionmatrices, must be inferred from data. The inference of thedrift term is traditionally, e.g. in (Penland 1989), achievedby using ﬂuctuation-dissipation theory (FDT) relating thematrices and the data covariances. Estimation of the driftand incremental noise covariance matrices in this way arealso equivalent to the Maximum Likelihood (ML) esti-mates as also shown in (Penland 1989) and presented insection 3. To simplify the estimation problem, the dy-namics of the system are assumed linear and Markovianso that the eigenfunctions – or so called principal oscil-lation patterns (POPs) – of the propagating matrices canbe analyzed for spatial patterns (von Storch and Bruns1988; Hasselmann 1988). This technique produces pointestimates for the parameters which allows for scientists toinfer the physical characteristics of the EOFs from the sys-tem parameters (i.e. the spectrum of the drift matrix) andto forecast climate statistics using a reduced model. MLand Frequentist approachess can be used to obtain sam-ple statistics like conﬁdence intervals, and skewness andkurtosis estimates, as in Martinez-Villalobos et al. (2018),which provides insight to parameter uncertainty but weargue that the Bayesian perspective provides a completeframework to specify prior information and propagate pa-rameter uncertainty through the forecasting model.To facilitate the main focus of this paper, the sec-ondary objective is to consider the LIM problem froma Bayesian perspective and explore implementation strate-gies. Bayesian parameter estimation diverts from prevail- ing frequentist statistical estimation by deﬁning the param-eters as random variables with probability distributions andthe subsequent role of diverse prior probabilities given tothe estimands. For a brief review of Bayesian techniquesin parameter estimation see Gelman et al. (2013). No-tably, ﬁltering approaches, as, for example, in Hansen andPenland (2007) and Zhao et al. (2019), oﬀer convenientapproaches to state and parameter estimation at the cost ofthe assumption of Gaussian or particle ﬁlter distributions.The ﬁltering approach can be beneﬁcial for time-dependentparameters, but may also have diﬃculty with ﬁlter diver-gence or with exploring the entire posterior probabilityspace (Särkkä 2010). Because the LIM parameters are nottime dependent, we have chosen in this paper a MCMCvariant approach to estimate the posterior distribution ofthe parameters.In our application there are two primary motivations fora Bayesian approach to parameter estimation. First, weargue that the likelihood distribution induced by a LIM isrelatively weak, so that the posterior distribution is highlysensitive to a prior distribution and this sensitivity leadsto improvements in parameter recovery and forecast skill.In other words, the problem is high-dimensional and therelative amount of information per parameter is relativelylow. The introduction of prior densities in the Bayesianframework is a way to add additional information to, orregularize, the parameter distributions. It is known that,in parameter estimation, the Bayesian estimates will out-perform the frequentist estimate as long as the Bayesianprior does not both have large error in mean and relativelarge weight (Samaniego 2010). Second, we argue thatpropagating parameter uncertainties through the modelcan produce better calibrated forecast distributions. Inthe traditional ML approach, only the point estimates aretypically used to produce forecast distributions, althoughcross-validation methods provide another way to regularizeparameter estimates and produce an ensemble of forecasts.The Bayesian approach provides a framework and meansto characterize the distribution of parameter uncertaintiesand their eﬀect on the model and forecast distribution. Byfully characterizing this uncertainty, we believe that theBayesian methods detailed in this paper are better suitedto reproduce observation distributions when compared toa traditional point-estimate approach.The third thrust of this paper is a focus on the per-formance of the Bayesian perspective (in relation toMLE/FDT) as a function of the speciﬁc prior implemented.Using nontrivial priors in time series analysis has a longhistory in economics, where Bayesian methods are usedto estimate coeﬃcients (often) of Vector Auto-Regressive(VAR) models (Litterman 1986; Giannone et al. 2015; Par-tridge and Rickman 1998; Doan et al. 1983). There havebeen various attempts to introduce Bayesian learning ofparameters in stochastic diﬀerential equations, but usuallyrequire some assumptions or approximations of the under-lying likelihood distribution (Eraker 2001; Beskos et al.2006; Karimi and McAuley 2016; Batz et al. 2018; Tianet al. 2014; Stock et al. 2015a; Møller et al. 2011). Intro-duction of nontrivial prior beliefs about the parameters cancomplicate the process of describing the necessary parame-ters since the densities resulting from stochastic diﬀerentialequations do not generally have closed form expressions.Consideration of complex prior distributions in parametricmodels is underdeveloped, but the eﬀect of prior densitiesin non-parametric modeling has been studied in van Zanten(2013).Statistical estimation of stochastic models has a longhistory starting with ML and least square estimates asa simpliﬁcation, see Le Breton (1977) and Delsole andYang (2010). In application to climate statistics, reviewscan be found in DelSole et al. (1999) and DelSole et al.(2003) and comparisons in Mason et al. (2002); Nummelinet al. (2018). From Penland (1989), the use of FDT ap-proximations for the LIM to derive closed-form parameterestimation formulas makes this approach the method ofchoice in geoscience applications (Penland 1989; Penlandand Sardeshmukh 1995; Nummelin et al. 2018; Barnstonet al. 1999; Colman and Davey 2003), particularly over var-ious autoregressive models (DelSole et al. 2003; Penlandet al. 1993). Barnston et al. (2012) compares LIMs andother physical and statistical based models in IRI ENSOprediction plume. In this paper, we use a statistical frame-work to show that the same estimates can be achieved frommaximizing a natural parameter likelihood function andthat the introduction of prior distributions in a Bayesiansense is a natural extension.The regression interpretation of LIMs gives us reasoningto believe why regularization is necessary. The Gauss-Markov Theorem ensures that the MLE for LIMs have thesmallest variance of all unbiased estimates, it is known ingeneral that these estimates can be improved by variousmethods of regularization like Ridge Regression, LASSOand cross-validation that reduce estimator variance (Hastieet al. 2017). The need for regularization may be the resultof the presence of confounding variables or model errorand the lack of data relative to the number of parameters.In fact, the use of a Bayesian prior can be seen as a type ofregularization that aims to address these limitations.In what follows we brieﬂy discuss the theoretical un-derpinnings and results of LIMs, MLE for these models,and Bayesian Methods for Parameter Estimation. The aimof this paper is then to provide insight into the use ofa Bayesian framework with nontrivial prior distributionsand how they can be used to generate well-calibrated SSTanomaly forecasts. To this point, in the results section,we give two experiments to compare the use of ML andBayesian estimation methods. First, in a perfect modelmethodology, we compare convergence of parameter esti-mations. Second, we apply these methods to a LIM of SST anomalies and analyze the forecast performance of each us-ing a variety of traditional and probabilistic skill metrics.We ﬁnd that the use of Bayesian methods with informativepriors is signiﬁcantly more skillful in forecasts and canbetter estimate the underlying probability distribution ofthe data.

2. Linear Inverse Modeling

Consider the linear stochastic diﬀerential equation, dxxx t = BBBxxx t dt + dWWW t , (1)where xxx ∈ R m , BBB ∈ R m × m and WWW is an m -dimension Wienerprocess with zero mean and covariance given by QQQ ∈ R m × m .We denote the covariance of the process xxx : (cid:104) xxxxxx T (cid:105) = ΛΛΛ . Thevalues of

BBB or QQQ are not known a priori and the ﬁtting of aLIM requires the estimation of these m + m ( m + ) totalparameters, since QQQ is symmetric.The solution to (1) at a time t + τ is given by xxx t + τ = GGG ( τ ) xxx t + ξ t , ξ t ∼ N ( , ΣΣΣ ( τ )) (2)where GGG ( τ ) = e BBB τ (3)is refered to as the propagating matrix of lead time τ , but inthe literature it is also called the Green’s function for thisdiﬀerential equation, the incremental noise process satiﬁesthe relation ΣΣΣ ( τ ) = ΛΛΛ − GGG ( τ ) ΛΛΛ

GGG ( τ ) , (4)and the ﬂuctuation-dissipation relation gives BBB T ΛΛΛ + ΛΛΛ T BBB = − QQQ (5)(see Penland (1989) for detailed calculations). Both

QQQ and

ΣΣΣ must be symmetric positive deﬁnite, while

BBB must haveeigenvalues with negative real part in order for the solution(2) to be stable. See Penland and Sardeshmukh (1995) fora lengthier discussion on tests for the applicability of theassumptions of LIMs and resulting approximations.

3. Maximum Likelihood Estimation

Now consider observations of the random process yyy n = xxx ( t n ) for n = , . . ., T , abbreviated as yyy T when consider-ing a range of observations. We assume that the observa-tion process is exact, but the introduction of observationalnoise can be accommodated with only minor adjustmentsto the methods detailed below. The conditional probabilitydensity function, typically called the conditional marginallikelihood, for these observations conditioned on matrices GGG and

ΣΣΣ deﬁned above is Gaussian and given by p (cid:18) yyy T (cid:12)(cid:12)(cid:12)(cid:12) GGG , ΣΣΣ (cid:19) = T − τ (cid:214) n = N ( yyy n + τ − G ( τ ) yyy n , ΣΣΣ ( τ )) (6) AMS JOURNAL NAME

Taking the derivative of the marginal likelihood with re-spect to

GGG ( τ ) and ΣΣΣ ( τ ) , we can derive formulas for theso-called maximum likelihood (ML) estimatesˆ GGG ( τ ) = (cid:32) T − τ (cid:213) n = yyy n + τ yyy Tn (cid:33) (cid:32) T − τ (cid:213) n = yyy n yyy Tn (cid:33) − , (7)ˆ ΣΣΣΣΣΣΣΣΣ ( τ ) = T − τ − T − τ (cid:213) n = (cid:16) yyy n + τ − e BBB τ yyy n (cid:17) (cid:16) yyy n + τ − e BBB τ yyy n (cid:17) T = (cid:104) yyy τ : T yyy T τ : T (cid:105) − ˆ GGG ( τ )(cid:104) yyy T − τ yyy T T − τ (cid:105) ˆ GGG ( τ ) T (8)From these matrices, one calculates the matrices ˆ BBB using(3) and ˆ

QQQ using (5) by approximating

ΛΛΛ with ensembleaverages. While ˆ

QQQ is formally obtained from the ﬂuctu-ation dissipation relation, we will also refer to this as theML estimate of

QQQ since it is obtained from using the MLestimate of

BBB . Note that (8) is approximately equal to (4)under the assumption of stationarity.As a numerical example, consider the 1 dimensionalcase. In

Fig. (1) a stochastic process xxx t was generated us-ing chosen values BBB = − QQQ = . yyy k . We then compute p ( yyy T | GGG , ΣΣΣ ) ,where we simply use τ =

1. For the scalar case, trace plotsof p ( yyy T | BBB , QQQ ) can be obtained from p ( y T | GGG , ΣΣΣ ) since ΣΣΣ and

QQQ have the simple relationship

ΣΣΣ = ( e BBB τ − ) QQQ / BBB from solving (4) and (5). The traceplots for

QQQ and

BBB aredisplayed in the ﬁrst and third image respectively, adjoinedby their Bayesian estimates as described further in section4a. The location of the maximum of these trace plots giveML estimate ˆ

BBB , and ˆ

QQQ , whose experimental relative errorwas found to be approximately 146%, and 6% respec-tively, while the relative error for ˆ

GGG was approximately0 . BBB = log ( GGG )/ τ increasesthe variance of the marginal likelihood and ampliﬁes therelative error, it is conceivable from this experiment alonethat reasonable forecasts results can be achieved with ˆ G while the numerical value of ˆ B diﬀer signiﬁcantly from the’true’ value BBB . Additionally, observe that the marginal like-lihood places signiﬁcant probability on values of

BBB whichare positive, something which we know leads to unstabledynamics and is physically impossible. Regularization ofthe marginal likelihood can help constrain estimates fromundesirable regions; the Bayesian approach can be seen asa particular justiﬁcation of this regularization.

4. Bayesian Methods for Parameter Estimation

Determining the most likely values and distributions forthe LIM parameters is inherently a high-dimensional prob-lem with hundreds, if not thousands, of parameters needingestimation. Because the ratio of the number of parametersto data points may be close to or larger than unity, regu-larization is needed to constrain possible parameter valuesand prevent over-ﬁtting. In the statistical sense, regular-ization provides a way to balance bias and variance in the estimates. A Bayesian probabilistic framework providesa mechanism to formally regularize the maximum likeli-hood distribution by incorporating prior beliefs about theuncertainty of the parameters that is equivalent to Tikhonovregularization (Vogel 2002), also known as ridge regres-sion.Bayesian philosophy assumes that the parameters θ ∈ R d are random variables with joint prior probability distribu-tion p ( θ ) . This prior distribution represents any a prioriknowledge we have about the nature and structure of theparameters. A state space model with model states x ( t k ) ,observations y k , and unknown parameters may be writtenas θ ∼ p ( θ ) , xxx ( t ) ∼ p ( xxx ( t ) , θ ) , xxx ( t n ) ∼ p ( xxx ( t n )| xxx ( t n − ) , θ ) , yyy n ∼ p ( yyy n | xxx ( t n ) , θ ) , (9)where we label the initial uncertainty p ( xxx ( t ) , θ ) , the transi-tion probability density p ( xxx ( t n )| xxx ( t n − ) , θ ) , and observationuncertainty p ( yyy k | xxx ( t k ) , θ ) . To simplify the above relations,it is assumed that the initial condition and observationuncertainty are independent of the parameters θ . The fullposterior distribution, with these assumptions, can be writ-ten using Bayes’ rule: p ( xxx ( t T ) ,θ | yyy T ) = p ( xxx ( t )) p ( θ ) p ( yyy T ) T (cid:214) n = p ( yyy n | xxx ( t n )) p ( xxx ( t n )| xxx ( t n − ) ,θ ) (10)Note that (10) gives the joint posterior distribution for boththe state xxx T and the parameters θ and estimating it solvesnot only the parameter estimation problem but also theﬁltering/smoothing problem. The distribution of only theparameters requires marginalizing out the state variables, p ( θ | yyy T ) = ∫ p ( xxx ( t T ) , θ | yyy T ) dxxx ( t T ) . (11)Computation of this integral can be diﬃcult, but possi-ble using Monte Carlo methods; some possibilities arepresented in the appendix. We make the standard approxi-mation that the marginal distribution is given by p ( θ | yyy T ) ∝ p ( yyy T | θ ) p ( θ ) . (12)Taking the maximum of (12) yields the Maximum APosteriori (MAP) point estimate, ˆ θ MAP ,ˆ θ MAP = arg max [ p ( θ | yyy T )] . (13)Point estimates, like the MAP, are only simple statistics ofthe underlying posterior distribution, and fail to signal theoverall variability of likely parameters, but it will be usefulto use ˆ θ MAP when comparing results of estimation fromML and Bayesian techniques.From (12), it will be suﬃcient to consider only the priordistribution of parameters and the conditional marginal . . . . · − QQQ = 0 . QQQ = 0 . ≈ QQQ . . . MAP:

QQQ = 0 . QQQ = 0 . ≈ QQQ

Posterior p ( QQQ | yyy T )Likelihood p ( yyy T | QQQ )Prior p ( QQQ ) = N (0 , − − −

50 0 50 10022 . . · − BBB = − BBB = − ≈ BBB − − − − . . . MAP:

BBB = − . BBB = − ≈ √ BBB

Posterior p ( BBB | yyy T )Likelihood p ( yyy T | BBB )Prior p ( BBB ) = N ( − , Fig. 1. A stochastic process with known

BBB and

QQQ was generated by discretizing (1) using Euler-Maruyama with time step dt . 1000 observationsfrom that series were sampled with τ = dt to form y T . From left to right: (1) trace plots of the marginal likelihood p ( yyy T | QQQ ) , (2) marginallikelihood p ( yyy T | QQQ ) along with prior normal p ( QQQ ) and resulting posterior distribution p ( QQQ | yyy T ) , (3) marginal likelihood p ( y T | BBB ) , andﬁnally (4) the marginal likelihood p ( yyy T | BBB ) along with prior normal p ( BBB ) and resulting posterior distribution p ( BBB | yyy T ) are shown. MaximumLikelihood and Maximum A Posteriori are shown for comparison between Frequentist and Bayesian point-estimates while standard deviations ofthe distributions are shown to compare uncertainties in the parameters from the Bayesian perspective likelihood distribution (6) to form an approximate marginalposterior distribution. Having already deﬁned the marginallikelihood, the remaining factor in estimating the posteriordistribution is the choice of prior p ( θ ) . We discuss somecommon choices and their properties. a. Prior Distributions The prior distribution incorporates our beliefs about thenature of the parameters and their structure. These beliefsare not perfect and may be subject to their own uncertainty.The role of a prior distribution is to weakly prefer certainparameter values when no information is present. Uponinspection, the posterior distribution (10) can be seen asa competition between the likelihood and the prior, wherethe information from the prior becomes diluted given data.One can also view a prior as a penalty term whose role is toweakly enforce constraints onto the parameters, especiallywhen considering a Gaussian prior and likelihood wherethe prior can be viewed as a type of Tikhonov regulariza-tion.There are various labels that have been applied in orderto classify priors, see Gelman et al. (2013), but we will notrely on these terms. We distinguish and delineate diﬀer-ent prior densities based on whether they inform scale orsign information and whether the constraint informs certainstructures. We will also have cause to highlight the num-ber of so-called ’hyper-parameters’ (additional parametersneeded in order to fully specify the prior distributions).The presence of a large number of hyper-parameters canincrease the ﬂexibility of a prior distribution while also in-creasing the storage and computational costs of producingrandom samples from the posterior distribution.We discuss several options of priors for both

BBB and

QQQ ofvarying degree type that we will perform experiments with.A simple prior would be to assign a normal distribution uniformly to all parameters,

BBB ij ∼ N ( µ, ) QQQ ij ∼ N ( , ) , where N ( µ, ) is the Normal distribution with mean µ and variance 1. Choosing the normal distribution as aprior does not convey sign information and so it does notenforce QQQ to be positive deﬁnite, nor does it enforce theeigenvalues of

BBB to be negative. Instead, this choice of prioris a simple and computationally eﬃcient way to weaklyenforce the center of mass of the parameters via locationand scale information, i.e. it forces the values of

BBB to besomewhat close to µ and values of QQQ to be small. Variationsto inform sign could be the half-Normal, N + ( µ, σ ) , half-Cauchy, Cauchy + ( , γ ) , or β ( α, β ) .Consider again the 1-dimensional example problem ofestimating BBB and

QQQ from 1000 sampled data yyy , but nowfrom a Bayesian approach. In

Fig. 1 the priors

QQQ ∼ N + ( , ) and BBB ∼ N (− , ) are used in conjunction with the marginallikelihood to produce the Bayesian posterior in the secondand fourth image respectively. In this example, the truthand Maximum A Posteriori values are calculated. For BBB ,the error in the MAP estimate of roughly 60% is an im-provement over the MLE. While for

QQQ , the MAP estimateis marginally worse than the MLE.Using more sophisticated combinations of distributions,we can leverage knowledge about the structures of the ma-trix parameters in the LIM. Decompose

QQQ as QQQ = σ LL T σ, (14)where σ is a diagonal matrix with σ ii = √ Q ii and L , alower triangular matrix, is the Cholesky decomposition ofthe remaining correlation matrix. We impose simple half-Normal or half-Cauchy prior distributions on the diagonalelements of σ . For the m ( m + )/ L , a popular AMS JOURNAL NAME choice in the literature is to place a LKJ prior on LL T (Lewandowski et al. 2009), whose distribution is uniformover the space of correlation matrices.While mathematical constraints on the spectrum of BBB pose a possible source of prior information, placing priorsdirectly on an eigenvalue decomposition of

BBB is diﬃcultbecause it would require speciﬁcation of probabilities inthe complex plane and increases the number of parame-ters needed to be estimated. Furthermore, the eigenvaluedecomposition is sensitive to random perturbations in theinput matrix to the order of the condition number of thematrix times the perturbation (Stoer and Bulirsch 2002) -enough to possibly misidentify elements of the spectrumof B . Instead, we use distributions that indirectly producethese qualities by inducing sparsity (only few non-zero el-ements) and a (nearly) diagonal matrix. By inducing spar-sity, we are hypothesizing that EOFs are weekly coupledand this coupling decays with the explained variance ofthe EOF. We focus on two types of distributions that havebeen used to induce this kind of sparsity to varying de-grees, the Minnesota (Litterman) and (Finnish) Horseshoepriors (see Appendix A ). The extent to which sparsity isinduced is data-driven and determined by hyper-parametervalues, meaning that we do not impose a strict prior beliefof sparse interaction between EOFs.The Minnesota-Litterman prior (Litterman 1986; Doanet al. 1983) prior expresses the belief that (2) should beapproximately an uncoupled random walk through eachdimension. This prior places unit normal priors on the di-agonal of

GGG ( τ ) and zero normal priors on the oﬀ diagonalcomponents. The variance of these oﬀ-diagonal compo-nents is then proportional to the variance of the underlyingdynamics. This mechanism has the eﬀect of pushing anoﬀ-diagonal element to zero if there is not much interac-tion between the corresponding state variables. There area total of two hyper-parameters that are used to also controlthe over-all level of sparsity in the matrix, making this acomputationally attractive prior.Similar to the Minnesota-Litterman prior, the horseshoeprior (Carvalho et al. 2009) was developed to allow forgreater control of matrix sparsity. With the horseshoeprior, all entries of BBB are given a centered normal distri-bution whose variance is controlled by a global and localsparsity hyper-parameter. The global hyper-parameter setsa certain variance level for all elements of the matrix, whileeach entry has its own local hyper-parameter. Therefore,using a horseshoe prior for

BBB , say, requires the learning andsampling of 1 + m total parameters. The horseshoe for-mulation, while increasing the storage cost and producinga more topologically complicated posterior distribution tosample from, can, in principle, produce a more nuancedsparsity pattern than the Minnesota-Litterman prior. Inpractice, a regularized, or Finnish horseshoe prior is used.Once an adequate prior has been chosen, all that remainsis to sample the posterior distribution (11) or the approx- imation (12). There are several methods to achieve this,but in this paper we use a variant of the Markov ChainMonte Carlo method (MCMC)(Brooks 2011), Hamilto-nian Monte Carlo (HMC)(Neal 2012), see the Appendix B for details of the theory and implementation. This methodis chosen, in particular, because the need for an algorithmscalable to large parameter spaces. HMC utilizes gradientinformation to produce samples from the target distribu-tion quicker than traditional MCMC methods, reducing thestorage and computational costs required.

5. Parameter Estimation Results

We aim to present a test application to demonstrate thatthe advantages of enhanced estimation and forecast cali-bration gained from the Bayesian approach is signiﬁcant.We compare a suite of estimation techniques, including theML and various MAP estimators corresponding to combi-nations of the various aforementioned priors. Furthermore,we test the forecast ability using probabilistic metrics to un-derstand the ability of each method to capture the properforecast distribution.We consider the task of forecasting yearly SST anoma-lies. We use SST data from the fully-coupled 1800 yearpre-industrial control run from the CESM1 Large Ensem-ble project (Kay et al. 2015), which are averaged to annualvalues to produce a time series of 1800 data points foreach surface ocean location on a roughly 1 ◦ global grid.The control run is used here in order to test the impact ofdata availability on parameter estimation, not necessarilyto comment on the associated dynamics from forecast-ing, and to avoid the eﬀects of uncertainty in atmosphericforcing. This study is intended to showcase the potentialbeneﬁts from a Bayesian approach by directly comparingit with the classical ML methodology in a fair manner.Further research is necessary to come to a comprehensiveconclusion about the beneﬁts of the Bayesian approachon realistic data sets. An Empirical Orthogonal Function(EOF) decomposition is performed to reduce the more than86,000 spatial points to 10 global EOF spatial patterns thatcapture approximately 55% of the explained variance. Let { xxx t } represent the time series representing the dynamicsof the 10 spatial patterns over the 1800 year time frame. Aroughly 900 year fraction of this is taken as training data,500 years is reserved as test data with the remaining portionsaved for validation purposes. The matrices BBB , QQQ require10 × =

100 elements each to be estimated; i.e. the ratioof data points to parameters is 10:1. Increasing the numberof EOF spatial patterns reduces this ratio while only obtain-ing marginal increases in explained variance. We focus ourattention on the predictability of the high-latitude regions,and the evidence we display will be concerned with eachmethod’s performance in these regions. We believe thatusing global EOFs may capture possible teleconnectionsbetween regions, but recognize that strictly higher-latitudeEOFs may capture regional variability using fewer modes.Here we aim to compare the estimation capabilities ofthe ML method and the Bayesian approach with a suite ofprior distributions. This study performs two experiments:ﬁrst, a perfect model experiment in order to determinethe accuracy of ML and MAP estimates of parameters,and, second, a forecasting experiment in order to test thecalibration of the various parameter densities. a. Perfect Model Methodology

In this perfect model scenario, we test the ability of eachestimation technique to accurately estimate the mean pa-rameter values given data. Using the truncated EOFs, weﬁrst estimate MAP values for the parameters

BBB and

QQQ us-ing HMC (see the appendix). Using these MAP values, wediscretize (1) using the standard Euler-Maruyama numeri-cal method and simulate a path with integration time step ∆ t = − years. To sub-sample from this path, we takeobservations at intervals τ = k ∆ t =

1, to construct annualtime series of length N = , , BBB and

QQQ . The beneﬁt of using data-inspired pa-rameter values, as opposed to specifying values arbitrarily(e.g. diagonal matrices), is to not pre-impose prior beliefson the test data or true parameter values and potentiallyincur bias in the test results. Because of the randomnessin the construction of the sample paths and Markov chainconstruction, the use of HMC estimated parameters as truthshould not introduce bias results towards this method.For each of the prior distributions that we have discussed,we measure the relative errors (as a percent) of point esti-mates, arg max p ( θ | y T ) , from the ’true’ parameter valuesand results are shown in Fig. 2 . There are three mainconclusions from this experiment. First, as should be nosurprise, the relative errors decrease, although somewhatunevenly, as more data is available. The lack of clear con-vergence, as a result of the ‘limited’ data, should serve as aprecaution to those relying on the Central Limit Theoremfor asymptotic guarantees. Second, the Bayesian method-ology can produce estimates 4-6 × more accurate than MLalone. The choice of Minnesota prior for BBB and either LKJor Horseshoe for

QQQ produced nearly identical results andfeatures slightly less than 100% relative error in the largedata scenario. Finally, imposing structure in a concisemanner is a critical component when considering a priordistribution. There are obviously some prior combinationsthat perform just as poorly as MLE, such as the normalpriors, but priors with many hyper-parameters (like usingthe Horseshoe prior for

BBB ) can also be underwhelming.The errors in this experiment may seem unreasonablyhigh to an observer given the relative success of applyingLIMs to similar applications. We address this concern in two ways. First, the signal in the yearly averaged CESMcontrol run data is likely much weaker than those used inother papers, such as those from regional or monthly datasets that include atmospheric forcing. The stronger themean of the signal compared to the variance, the better theparameter estimation results will be. There are also someartiﬁcial/ad hoc ways to reduce noise in the data to achievethe same ends, such as smoothing, that we do not performin this analysis. It should also be noted that while statisti-cal bootstrapping techniques are common ways to improveestimation by inﬂating the sample size, we did not ﬁnd dis-cernible advantages of performing such re-sampling andchoose not to report the details in this paper. Second, aswe will see in the next experiment, an accurate ML orMAP estimate is not necessary for producing reasonableforecasts. While this experiment measures the error in themean of the posterior distribution, it does not commenton the overall calibration of the probability distributionsor how their uncertainties are propagated through to stateforecasts. Therefore, it is not necessary contradictory inpractice for there to be both large relative errors in param-eter recovery while forecast errors are relatively low. b. CESM1 Data Experiments

When using a LIM, there is traditionally only one sourceof uncertainty when computing sample paths of the SDE:the noise process with covariance

QQQ . In the Bayesianframework there is an additional source of uncertainty:the distribution of the parameters

BBB and

QQQ themselves.Therefore, we hypothesize that forecasts computed withthe MLE alone are too conservative and mis-estimate thevariance and resulting distribution of the underlying phys-ical process. The goal of this section is to measure thecalibration, i.e. how well the method reproduces the trueforecast probability distribution, of various sample poste-rior distributions p ( xxx t | BBB , QQQ , yyy t − ) .To perform each experiment we calculate a 100-memberensemble of with lead time tau using parameters drawnfrom their respective posterior distribution (in the ML casethis is taken to be only the point estimate). The poste-rior distributions are estimated from the training set of 900years. Forecasts are initialized from the test data set, in-dependent of the training data set. For this test, the datasets are smoothed with a 3 year backward-looking movingaverage. For the Bayesian estimates, a value for BBB and

QQQ are drawn anew for each ensemble member for each one-step prediction. From this ensemble, we calculate an arrayof statistics and metrics to judge the induced accuracy andcalibration from each choice of prior distribution. Notonly will we compute the correlation coeﬃcient, an oft-used statistic used in time-series forecasting, but we willalso calculate the empirical distribution function for theposterior distribution of sample paths and compare to thedistribution of the original data using probabilistic metrics

AMS JOURNAL NAME

Fig. 2. (cid:107) · (cid:107) Relative errors (%) for

BBB and

QQQ using a variety of prior densities and estimation techniques from the perfect model experiment.Data was generated from an Euler-Maruyama discretization ( ∆ t = − ) of a linear SDE with prescribed drift and diﬀusion coeﬃcients BBB and

QQQ .From this series, N points were uniformly observed for the estimation data set. Relative errors are taken with respect to the ML or MAP estimatesof the parameters. We use the notation PriorB_PriorQ, with ’ML’, ’N’, ’MINN’, ’LKJ’, ’HORSE’ referring to Maximum Likelihood, Normal,Minnesota, LKJ, and Horseshoe priors respectively. in order to have a holistic view of the ability of each methodto capture the underlying uncertainty. Table 1 displays the correlation coeﬃcient between themedian sample path and test data. In practice a correlationcoeﬃcient greater than .6 represents a heuristic for skillfulforecasts (Collins 2002). All fully Bayesian forecastingmethods exceed this threshold and Hybrid methods withnormal and horseshoe priors for

BBB also produce skillfulforecast that rival the fully Bayesian forecasts. Only usinga MLE for

BBB produces poor forecast skill. These two factstogether suggest that under-speciﬁcation of p ( BBB ) , not p ( QQQ ) ,is one of the major causes of poor forecast performance.The spatial, graphical, results for ML and Bayesian estima-tion using Minnesota and LKJ priors can be seen in Fig. 3 and

Fig. 4 for 1 and 2 year lead times, respectively. For fore-

Priors for

QQQ

ML N LKJ HORSE P r i o r s f o r BBB ML - - -N - 0.852 0.852 0.851HORSE - 0.852 0.850 0.851MINN - 0.851 0.852 Table 1. Correlation Coeﬃcient between the CESM-LE data and themedian path from p ( xxx t | yyy t , BBB , QQQ ) using a variety of prior densities andestimation techniques. Ensembles of predictions were generated froman Euler-Maruyama discretization ( dt = − between data points) ofa linear SDE with drift and diﬀusion coeﬃcients BBB and

QQQ drawn fromthe associated posterior distribution for each member of the ensemble.From the ensemble the median path was calculated and compared to theoriginal data set. Max Likelihood means that Bayesian estimation wasnot performed for that variable and only the Maximum Likelihood valuewas used. casts with 1 year lead time, the ML estimates are generallyskillful in the Arctic Ocean with exceptions in the Naresstrait and Beaufort and Kara seas. In the Southern Ocean,the ML estimates are only skillful oﬀ of East Antarcticaand in pockets in the Ross and Weddell seas. The Bayesianforecasts has exceptional skill in both polar regions. Forforecasts with 2 year lead time, the overall skill of eithertechnique is markedly lower. The ML estimates are notskillful in any region. The Bayesian forecasts are consid-ered skillful still in many areas of both polar regions. Inparticular we see high correlation skill oﬀ of West Antarc-tica in the Amundsen-Bellingshausen seas, whose climateis known to be strongly inﬂuenced by tropical Paciﬁc SSTs(Steig et al. 2012; Ding et al. 2011; Lachlan-Cope andConnolley 2006). The correlation metric, then, is verysensitive to proper accounting of the uncertainty in

BBB , butthis is not a comprehensive view of the calibration of theforecasts.A view of the correlation coeﬃcient of various methodsas a function of lead time is given in

Fig. 5 . Here, weplot both the global mean (dashed) and global max (solid)correlation coeﬃcients of the ML forecasts, Bayesian fore-casts with Minnesota and LKJ priors, and Vector AutoRe-gressive (VAR) forecasts with 2 and 3 terms. These VARmodels are ﬁt with a optimization-based maximum likeli-hood approach, and are included as a means of comparisonof LIMs against other typical time-series forecasting tech-niques. In general, the ML LIM forecasts are much lessskillful than either the Bayesian LIM or VAR models, withthe latter models having roughly identical skill. In par-ticular, the low global correlation coeﬃcients and sharpdecline of skill at small lead times from the ML forecastsare extreme in comparison to the other methods. Notice

Fig. 3. Correlation Coeﬃcient between mean forecasts with 1 year lead time and test data; higher correlation coeﬃcients are more predictive. (Top)ML estimates were used for the parameter values of

BBB and

QQQ . (Bottom) Minnesota and LKJ priors were used for p ( BBB ) and p ( QQQ ) respectively. that while the ML forecasts have skillful globally averagedcorrelation coeﬃcients, the low quality of the average cor-relation coeﬃcients indicates that there is much greatervariance of region dependent skill when compared to theother methods, especially for short lead times. All meth-ods, however, produce forecasts that are not skillful onaverage with a lead time of greater than 3 years.To begin to measure the calibration of probabilistic fore-casts, we use the ensemble forecast members to constructconditional cumulative distribution functions, from whichquantiles can be calculated and compared to the empiricaldistribution function for the observations,ˆ F ( t ) = T T (cid:213) n = + τ ( y n − y n − τ ) < t . (15) To be clear, under the assumption of stationarity, this aver-age is performed temporally but for clarity in presentationwe also integrate spatially to consider a 1-dimensional dis-tribution function. Fig. 6 plots the observed and forecastdistribution functions against each other, where the dis-crepancy between forecasted and observed frequencies areplotted. A perfect method would have 1:1 agreement be-tween the forecasts and the observations and have zeroresidual forecast probability, as denoted in the plot by thedashed (red) line. The distribution of the ML forecasts(dashed) are generally below the dotted line, especiallyfor more frequent events, representing that the method un-derestimates the uncertainty in the data. The forecastsgenerated with a Minnesota prior for

BBB and a Horseshoeprior for

QQQ have generally good agreement with the ob-0

AMS JOURNAL NAME

Fig. 4. Correlation Coeﬃcient between mean forecasts with 2 year lead time and test data; higher correlation coeﬃcients are more predictive. (Top)ML estimates were used for the parameter values of

BBB and

QQQ . (Bottom) Minnesota and LKJ priors were used for p ( BBB ) and p ( QQQ ) respectively. servations, but slightly overestimate the uncertainty in un-likely events. We include Vector Auto-Regressive modelsfor a wider comparison of methods, these VAR modelsare ﬁtted with an optimization based maximum likelihoodmodel. The VAR models suﬀer more strongly from over-and under-estimation than the Bayesian methods. All ofthe Bayesian methods outperform the ML techniques in thismetric, showing again that regularizing using reasonableprior densities and having access to the entire probabilitydistribution produces more skillful forecasts. Fig. 7 presents the relative L error between the distribu-tion of forecasts at a lead time of 1 year and of observations,what we call the "Calibration Error", as a function of space.Here a lower Calibration Error represents a smaller percentdiscrepancy between the forecast probability distribution from the empirical observational distribution. Generally,the ML forecasts are at least 1.5 to 2 times as less calibratedthan the Bayesian forecasts.

6. Conclusion

In this paper we advocate for a Bayesian approach toLIMs as a way to 1) address the ill-conditioned natureof high-dimensional parameter estimation by providing aformal way to regularize the parameter distributions and2) produce well-calibrated predictive uncertainties. Thisstrategy has been tested and compared to traditional LIMmodeling at recovering LIM parameter values and at fore-casting high-latitude SST anomalies. The evidence pre-sented in this paper suggests that a Bayesian approach to1

Lead Time C o rr e l a t i on C o e ff i c i e n t Method

ML_MLVAR(2,0)VAR(3,0)MINN_LKJ

Fig. 5. Global mean (dashed) and maximum (solid) correlation coeﬃcients of forecasts from using a variety of methods as a function of the lead(lag) time τ . statistical forecasting may produce signiﬁcant recovery andcalibration beneﬁts, especially in terms of regional pre-dictability, when compared to simple point-estimation.Bayesian approaches to parameter estimation forstochastic diﬀerential equations are rare in the literaturefor a number of factors: there is a lack of closed formsolutions to key probability densities, prohibitive compu-tational cost make estimation diﬃcult, and the decent his-toric performance of various levels of simplifying approx-imations have not spurred the need to apply the Bayesianapproach. Given the increasing amount of computationalresources now commonly available, development of eﬃ-cient sampling algorithms, and evolving view of statisticalestimation have set a fertile ﬁeld for the adoption of theBayesian framework. This paper advocates for the intel-ligent use of a Bayesian framework with nontrivial priordistributions to formulate the LIM problem and providesan accounting of how well these approaches reconstructand forecast the appropriate probability distributions.Maximum Likelihood and least square approaches to pa-rameter estimation in LIMs have asymptotic convergenceguarantees, but may not be optimal when the available datais sparse, the underlying signal is not strong, there are con-founding variables present, or when confronted with modelerror. As demonstrated by the simple example in Fig. 1 and

Fig. 2 , researchers should be cautious about using suchestimators in the LIM framework. We found that, even ina relatively large data set, relative errors often exceeded100%. Furthermore, our results also suggest that it wouldbe hazardous to conclude that forecast skill is a reﬂectionof parameter reconstruction

Table 1 . The main hypothesis of this paper is that using ML pointestimates can result in overly conservative forecasts, espe-cially when compared to results from a Bayesian frame-work. Using both traditional and probabilistic forecastmetrics, we found that predictions from a Bayesian ap-proach are more accurate than their ML competitors. Com-paring the empirical distribution functions from forecastsand the metrics in

Table 1 , Fig. 6 , and

Fig. 7 aﬃrms thatthe Bayesian approach better captures the probability dis-tribution of the dynamics and makes clear that the beneﬁtsfrom using a Bayesian framework in forecasting can besigniﬁcant.There are several possible avenues to continue and ex-tend the analysis in this paper. Immediately, this analysisshould be extended to observational data sets so that com-parisons can start to be used with other papers, e.g. Pen-land and Sardeshmukh (1995). It is still an open questionof whether a Bayesian framework can be used to reliablyimprove the performance of LIMs in forecasting decadalpredictability of SSTs (Hawkins and Sutton 2009; Bransta-tor and Teng 2010; Guemas et al. 2012), and this area canbe pursued with either a global or regional scope. Second,further exploration of relevant prior distributions shouldbe pursued with the goal of producing physically mean-ingful prior densities. In particular, we can explore theuse of global climate models in the Bayesian framework.There are several extensions to LIMs where one could fea-sibly implement a Bayesian approach. First, investigationof LIM frameworks that incorporate state-depent noise,as in Sardeshmukh et al. (2015) and Martinez-Villaloboset al. (2019), can also be formulated in a Bayesian waygiven an appropriate approximate likelihood distribution.2

AMS JOURNAL NAME −0.050−0.0250.0000.025 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Observed Frequency R es i du a l Fo r ecas t P r ob a b ili t y Method

ML_MLVAR(2,0)VAR(3,0)HORSE_LKJMINN_LKJ

Fig. 6. Comparison of empirical distribution function of observed data against ensembles of forecasts using diﬀerent prior probabilities as wellas ML estimated VAR(2,0) and VAR(3,0) methods. The horizontal axis represents the SST anomalies that occur at an observed frequency based onthe data and the vertical axis represents the diﬀerence between the forecasted and observed frequency of those values. The red dashed line representsan ideal 1-1 correspondence between forecast and observation distribution, positive/negative residuals represent overestimation/underestimation ofevents of that observed frequency.

Finally, one could depart from the parametric form thata linear stochastic diﬀerential equation imposes and at-tempt a Bayesian framework with more complicated, non-parametric physically constrained statistical models suchas in Majda and Harlim (2013) Wikle and Holan (2011) ormachine learning techniques such as Bayesian networks inMcDermott and Wikle (2019a) and McDermott and Wikle(2019b).

Acknowledgments.

This research was supported in partby the Regional and Global Model Analysis (RGMA) com-ponent of the Earth and Environmental System Model-ing (EESM) program of the U.S. Department of Energy’sOﬃce of Science, as contribution to the HiLAT-RASMproject. DF, NMU, and DC received support for this projectfrom the U.S. Department of Energy, Biological and Envi-ronmental Research (BER) division (Regional and GlobalModeling and Applications program); NMU received ad-ditional support under the BER Early Career Research pro-gram. APPENDIX A

Prior Matrix Distributions a. LKJ

The so-called LKJ prior (Lewandowski et al. 2009)(named after authors: Lewandowski, Kurowicka, and Joe)is the result of work that extended methods designed tosample random correlation matrices in a computationallyeﬃcient manner. In implementation, there is a singlehyper-parameter to the distribution, η , and the distributioncan be deﬁned as LKJ ( Σ | η ) ∝ det ( Σ ) η − . (A1)From the deﬁnition, if η =

1, the LKJ distribution is uniformover all correlation matrices. For η >

1, the distributionplaces mass closer to 1, resulting in samples close to theunit matrix. In other words, η controls the expected amountof correlation, with η > η < b. Minnesota-Litterman The Minnesota-Litterman prior (Litterman 1986; Doanet al. 1983) has two hyper-parameters, λ and θ , that allow3 Fig. 7. Calibration errors that represent the relative diﬀerence between forecast cumulative distribution and the empirical observation distribution.Smaller errors represents a better calibrated forecast. (Top) ML estimates were used for parameter values. (Bottom) Minnesota and LKJ priors wereused for p ( BBB ) and p ( QQQ ) respectively. for controlling the degree of sparsity in the matrix anddiﬀerentiating between diagonal and oﬀ-diagonal terms.The prior distributions are given by GGG ii ∼ N ( , λ ) , GGG ij ∼ N (cid:18) , λθ ΣΣΣ ii ΣΣΣ ij (cid:19) i (cid:44) j ,λ ∼ Cauchy + ( , ) θ ∼ U( , ) , (A2)where U( a , b ) is the uniform distribution on [ a , b ] .This formulation allows for the distinct treatment of di-agonal from oﬀ-diagonal parameters. λ acts as a globalcovariance hyper-parameter, controlling the overall uncer-tainty in the elements of the matrix. If λ =

0, then the prior would collapse to a distribution around the identity matrix.For λ (cid:44)

0, the oﬀ-diagonal terms λθ ΣΣΣ ii / ΣΣΣ j j shrinks λ by amultiple, θ , of the relative size of the respective variances.This combination is a natural choice since it controls thesparsity of the matrix according to the relative uncertaintyand size of the dynamics of elements of xxx t . c. (Finnish) Horseshoe Similar to the Minnesota-Litterman prior, the horseshoeprior (Carvalho et al. 2009) was originally derived to allowfor speciﬁcation of sparsity and has hyper-parameters λ and τ . λ ∈ R m × m , however, has m elements λ ij - one for4 AMS JOURNAL NAME each entry of

BBB . The distributions are

BBB ij ∼ N ( , τ λ ij ) ,λ ij ∼ Cauchy + ( , ) , (A3)where τ > τ and λ are interpreted at global and localsparsity controls. The value of τ should represent ouroverall belief about the size of elements of BBB . Then, each λ ij is used to tune the size of individual elements BBB ij . λ ij is given a half-Cauchy prior in order to allow for a largespread of values. If a value of τ is not known a priori, thenone can use a non-informative prior like a half-Cauchyinstead. Of concern with this scheme is the introductionof m + n elements of BBB . While the large number ofhyper-parameters can help control the size of individualelements

BBB ij , it increases the computational complexityof estimation and may require relatively more data to getreliable estimates than simpler priors. Furthermore, it isoccasionally possible during estimation for τ λ to becomeexceedingly large. The Finnish horseshoe introduces atransformation, ¯ λ ij , of λ ij and additional hyper-parameter c , such that ¯ λ ij = λ ij + τ λ ij / c . (A4)If τ λ (cid:28) c , then ¯ λ ≈ λ and the regularized horseshoebehaves similarly to the original horseshoe prior. On theother-hand, if τ λ (cid:29) c then ¯ λ (cid:28) λ and ¯ λ is a regulariza-tion. This transformation penalizes values of λ and τ thatare near some predetermined threshold c . Like τ , c canbe estimated by considering the understanding of the sizeof the parameters, or can be given some non-informativeprior and estimated along with the other parameters. Un-like the Minnesota prior, this prior can be applied to both BBB and

QQQ . Of course, when estimating

QQQ , we follow thetransformation (14) and need only estimate L .APPENDIX B Bayesian Computational Techniques

To make full use of the Bayesian framework, instead ofrelying on maximal parameter estimates, we must be ableto draw samples of the parameters from their respectiveprobability distributions. If the marginal likelihood func-tion p ( yyy T | θ ) is normal and the prior distribution p ( θ ) ischosen conjugate to the likelihood, then the posterior distri-bution can be calculated analytically. There are numerousbooks written on this technique, see Gelman et al. (2013).In general, however, numerical techniques are needed tosample from an arbitrary posterior distribution. In partic-ular, Markov Chain Monte Carlo (MCMC) algorithms arestandard practice to generate independent samples from the posterior. We will review MCMC in order to motivateHamiltonian Monte Carlo (HMC), which is the algorithmwe use for the studies in this paper. This appendix is onlymeant as a brief description of the types of methods em-ployed in this paper, for more detailed and comprehensivediscussions see Brooks (2011).1) Markov Chain Monte CarloA Markov chain is a sequence of random variables { (cid:15) , (cid:15) , . . . } where p ( (cid:15) , (cid:15) , . . ., (cid:15) N ) = p ( (cid:15) ) N (cid:214) k = p ( (cid:15) k | (cid:15) k − ) , (B1)i.e. a member of the sequence, given the previous value, isconditionally independent of all other elements in the se-quence. The goal of MCMC is to generate a Markov chainsuch that elements of the chain come from the posteriordistribution. The elements of the Markov chain are thenthe requested samples. The basic MCMC algorithm con-sists of specifying an initial distribution p ( (cid:15) ) , proposing anext element using a transition distribution p ( (cid:15) k | (cid:15) k − ) , anda method for accepting or rejecting the proposed value.In the LIM case, the initial distribution could be speci-ﬁed by the ML distribution since one has analytical rep-resentations for the mean and standard deviation. Forthis method to generate a proper Markov chain that has p ( θ | yyy T ) as its equilibrium distribution the transition den-sity and acceptance criterion must meet certain theoreticalconditions. If these conditions are met, then p ( (cid:15) k ) is sta-tionary, p ( (cid:15) k ) = p ( (cid:15) k (cid:48) ) for all k , k (cid:48) and (cid:15) k is independentof (cid:15) k (cid:48) . This stationary distribution is then guaranteed tobe equal to p ( θ | yyy T ) . In practice, ensuring that generatedMarkov chains are independent and come from the requireddistribution is non-trivial. We will not discuss these the-oretical concerns, but assume such conditions have beenmet and refer the reader to the literature. In order to avoidsome of these technical diﬃculties, we rely on using theHamiltonian Monte Carlo algorithm.2) Hamiltonian Monte CarloHamiltonian Monte Carlo (HMC) uses the dynamicsfrom an associated Hamiltonian dynamical system to con-struct a transition density p ( (cid:15) k | (cid:15) k − ) that is theoreticallyguaranteed to produce the required posterior distribution.These dynamics are deﬁned by the pair of diﬀerential equa-tions dpppdt = − ∂ HHH ∂ qqq , dqqqdt = ∂ HHH ∂ ppp , (B2)where ( ppp , qqq ) are coordinates of the state system (represent-ing momentum and position). HHH is the Hamiltonian deﬁned5as

HHH ( ppp , qqq ) = KKK ( ppp , qqq ) + UUU ( ppp , qqq ) , (B3)where KKK is called the "kinetic energy" and

UUU is the "poten-tial energy". HMC utilizes this framework by introducinga conjugate momentum variable ρ ∼ N ( , M ) and seekingto draw from the joint density p ( ρ, θ ) = p ( ρ | θ ) p ( θ | yyy T ) .This joint density deﬁnes the Hamiltonian, HHH ( ρ, θ ) = − log p ( ρ, θ ) = − log p ( ρ | θ ) − log p ( θ | yyy T ) = KKK ( ρ, θ ) + UUU ( θ ) (B4)We assume that the momentum is independent of the targetdensity, so p ( ρ | θ ) = p ( ρ ) and KKK ( ρ, θ ) = KKK ( ρ ) = p T M − p .To generate transitions for the Markov chain we evolve thesystem (B2) starting from an initial value for θ and a draw ρ from p ( ρ ) . In particular, we solve d θ dt = ∂ KKK ∂ ρ , d ρ dt = − ∂ UUU ∂θ . (B5)Solving B5 in general requires discretization of ρ and θ into their respective Markov chains and numerical integra-tion. While there are numerous numerical integrators forHamiltonian systems, the most common in practice is theLeapfrog, or Stömer-Verlet integrator with step size ∆ t ρ n + = ρ n − ∆ t ∂ UUU ∂θ ,θ n + = θ n + ∆ t M − ρ n + ,ρ n + = ρ n + − ∆ t ∂ UUU ∂θ . (B6)The probability that these values are accepted is given bymin ( , exp ( HHH ( ρ n , θ n ) − HHH ( ρ n + , θ n + ))) . If the generated values are not accepted then θ n + = θ n andthe algorithm continues for a total number of L time steps.It is proven that this algorithm produces samples thatconverge to the target distribution given certain conditions.Often this convergence is quicker, in terms of samplesneeded, than standard MCMC algorithms. As a trade oﬀfor speed, the HMC requires calculation of derivatives ateach time step. Furthermore, the time step ∆ t may needto be taken excessively small in order to ensure stabilityof the scheme, meaning that more time steps are requiredto adequately explore the state space. Modiﬁcations tothe original HMC algorithm, the no U-turn sampler inparticular, help automatically adjust the step size and num-ber of steps. We leave the details of these algorithms to(Neal 2012). For out-of-the-box implementation of thisalgorithm, see the programming language Stan (Carpenteret al. 2017). References

Alexander, M. A., and Coauthors, 2008: Forecasting Paciﬁc SSTs:Linear Inverse Model Predictions of the PDO.

Journal of Cli-mate ,

21 (2) , 385–402, doi:10.1175/2007JCLI1849.1, URL http://journals.ametsoc.org/doi/abs/10.1175/2007JCLI1849.1.Barnston, A. G., Y. He, M. H. Glantz, A. G. Barnston, M. H. Glantz,and Y. He, 1999: Predictive Skill of Statistical and Dynami-cal Climate Models in SST Forecasts during the 1997âĂŤ98El Niño Episode and the 1998 La Niña Onset.

Bulletin ofthe American Meteorological Society ,

80 (2) , 217–243, doi:10.1175/1520-0477(1999)080<0217:PSOSAD>2.0.CO;2, URLhttp://journals.ametsoc.org/doi/abs/10.1175/1520-0477%281999%29080%3C0217%3APSOSAD%3E2.0.CO%3B2.Barnston, A. G., M. K. Tippett, M. L. L’Heureux, S. Li, and D. G.Dewitt, 2012: Skill of real-time seasonal ENSO model predic-tions during 2002-11: Is our capability increasing?

Bulletin ofthe American Meteorological Society ,

93 (5) , 631–651, doi:10.1175/BAMS-D-11-00111.1, URL http://journals.ametsoc.org/doi/abs/10.1175/BAMS-D-11-00111.1.Batz, P., A. Ruttor, and M. Opper, 2018: Approximate Bayes learn-ing of stochastic diﬀerential equations. Tech. Rep. 2. doi:10.1103/PhysRevE.98.022109, URL https://arxiv.org/pdf/1702.05390.pdf.Beskos, A., O. Papaspiliopoulos, G. O. Roberts, and P. Fearnhead, 2006:Exact and computationally eﬃcient likelihood-based estimation fordiscretely observed diﬀusion processes.

Journal of the Royal Sta-tistical Society. Series B: Statistical Methodology ,

68 (3)

Journal of Climate ,

23 (23) , 6292–6311,doi:10.1175/2010JCLI3678.1, URL http://journals.ametsoc.org/doi/abs/10.1175/2010JCLI3678.1.Brooks, S., 2011:

Handbook of Markov chain MonteCarlo

Journal of Statistical Software ,

76 (1)

Journal of Machine Learning Research,W&CP 5 (AIStats). , , 73–80, URL http://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf.Chen, C., M. A. Cane, N. Henderson, D. E. Lee, D. Chapman, D. Kon-drashov, and M. D. Chekroun, 2016: Diversity, nonlinearity, season-ality, and memory eﬀect in ENSO simulation and prediction usingempirical model reduction. Journal of Climate ,

29 (5) , 1809–1830,doi:10.1175/JCLI-D-15-0372.1.Collins, M., 2002: Climate predictability on interannual to decadal timescales: The initial value problem.

Climate Dynamics ,

19 (8) , 671–692, doi:10.1007/s00382-002-0254-8, URL http://link.springer.com/10.1007/s00382-002-0254-8.Colman, A. W., and M. K. Davey, 2003: Statistical prediction ofglobal sea-surface temperature anomalies.

International Journalof Climatology ,

23 (14) AMS JOURNAL NAME

Davis, R. E., and R. E. Davis, 1976: Predictability of Sea SurfaceTemperature and Sea Level Pressure Anomalies over the NorthPaciﬁc Ocean.

Journal of Physical Oceanography , , 249–266,doi:10.1175/1520-0485(1976)006<0249:POSSTA>2.0.CO;2, URLhttp://journals.ametsoc.org/doi/abs/10.1175/1520-0485%281976%29006%3C0249%3APOSSTA%3E2.0.CO%3B2.DelSole, T., P. Chang, T. DelSole, and P. Chang, 2003: PredictableComponent Analysis, Canonical Correlation Analysis, and Autore-gressive Models. Journal of the Atmospheric Sciences ,

60 (2) , 409–416, doi:10.1175/1520-0469(2003)060<0409:PCACCA>2.0.CO;2, URL http://journals.ametsoc.org/doi/abs/10.1175/1520-0469%282003%29060%3C0409%3APCACCA%3E2.0.CO%3B2.DelSole, T., A. Y. Hou, T. DelSole, and A. Y. Hou,1999: Empirical Stochastic Models for the Dominant Cli-mate Statistics of a General Circulation Model.

Journalof the Atmospheric Sciences ,

56 (19) , 3436–3456, doi:10.1175/1520-0469(1999)056<3436:ESMFTD>2.0.CO;2, URLhttp://journals.ametsoc.org/doi/abs/10.1175/1520-0469%281999%29056%3C3436%3AESMFTD%3E2.0.CO%3B2.Delsole, T., L. Jia, and M. K. Tippett, 2013: Decadal prediction ofobserved and simulated sea surface temperatures.

Geophysical Re-search Letters ,

40 (11) , 2773–2778, doi:10.1002/grl.50185, URLhttp://doi.wiley.com/10.1002/grl.50185.Delsole, T., and X. Yang, 2010: State and parameter estimation instochastic dynamical models.

Physica D: Nonlinear Phenomena ,

239 (18)

Climate Dynamics , 1–19,doi:10.1007/s00382-018-4323-z, URL http://link.springer.com/10.1007/s00382-018-4323-z.DiNezio, P. N., C. Deser, Y. Okumura, and A. Karspeck, 2017:Predictability of 2-year La Niña events in a coupled generalcirculation model.

Climate Dynamics ,

49 (11-12) , 4237–4261,doi:10.1007/s00382-017-3575-3, URL http://link.springer.com/10.1007/s00382-017-3575-3.Ding, Q., E. J. Steig, D. S. Battisti, and M. Küttel, 2011: Winter warm-ing in West Antarctica caused by central tropical Paciﬁc warming.

Nature Geoscience , Journal of Business and Economic Statistics ,

19 (177-191) , URL http://eraker.marginalq.com/eraker01JBES.pdf.Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, andD. B. Rubin, 2013:

Bayesian data analysis, third edition . Chapman& Hall/CRC Texts in Statistical Science, Taylor & Francis, 1–646pp., URL https://books.google.com/books?id=ZXL6AQAAQBAJ.Giannone, D., M. Lenza, and G. E. Primiceri, 2015: PRIOR SELEC-TION FOR VECTOR AUTOREGRESSIONS.

Review of Economics& Statistics ,

97 (2) , 436–451, URL http://faculty.wcas.northwestern.edu/~gep575/Draft_GLP_V24.pdfhttps://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=102240456&site=ehost-live. Guemas, V., F. J. Doblas-Reyes, F. Lienert, Y. Souﬄet, and H. Du, 2012:Identifying the causes of the poor decadal climate prediction skill overthe North Paciﬁc.

Journal of Geophysical Research Atmospheres ,

117 (20) , doi:10.1029/2012JD018004, URL http://doi.wiley.com/10.1029/2012JD018004.Guemas, V., and Coauthors, 2016: A review on Arctic sea-ice pre-dictability and prediction on seasonal to decadal time-scales.

Quar-terly Journal of the Royal Meteorological Society ,

142 (695) ,546–561, doi:10.1002/qj.2401, URL http://doi.wiley.com/10.1002/qj.2401.Hansen, J. A., and C. Penland, 2007: On stochastic parameter estimationusing data assimilation.

Physica D: Nonlinear Phenomena ,

230 (1-2) ,88–98, doi:10.1016/j.physd.2006.11.006.Hasselmann, K., 1988: PIPs and POPs: The reduction of com-plex dynamical systems using principal interaction and oscilla-tion patterns.

Journal of Geophysical Research ,

93 (D9) , 11 015,doi:10.1029/JD093iD09p11015, URL http://doi.wiley.com/10.1029/JD093iD09p11015.Hastie, T., R. Tibshirani, and J. Friedman, 2017:

The Elements ofStatistical Learning Data Mining, Inference, and Prediction (12thprinting) . 2nd ed., Springer-Verlag New York, New York, 745 pp.,doi:10.1007/978-0-387-84858-7, URL http://link.springer.com/10.1007/978-0-387-84858-7.Hawkins, E., J. Robson, R. Sutton, D. Smith, and N. Keenlyside, 2011:Evaluating the potential for statistical decadal predictions of sea sur-face temperatures with a perfect model approach.

Climate Dynam-ics ,

37 (11-12) , 2495–2509, doi:10.1007/s00382-011-1023-3, URLhttp://link.springer.com/10.1007/s00382-011-1023-3.Hawkins, E., and R. Sutton, 2009: Decadal predictability of the AtlanticOcean in a coupled GCM: Forecast skill and optimal perturbations us-ing linear inverse modeling.

Journal of Climate ,

22 (14) , 3960–3978,doi:10.1175/2009JCLI2720.1, URL http://journals.ametsoc.org/doi/abs/10.1175/2009JCLI2720.1.Jenkins, A., P. Dutrieux, S. Jacobs, E. J. Steig, G. H. Gud-mundsson, J. Smith, and K. J. Heywood, 2016: Decadalocean forcing and Antarctic ice sheet response: Lessons fromthe Amundsen Sea.

Oceanography ,

29 (4) , 106–117, doi:10.5670/oceanog.2016.103, URL https://tos.org/oceanography/article/decadal-ocean-forcing-and-antarctic-ice-sheet-response-lessons-from-the-amu.Karimi, H., and K. B. McAuley, 2016: Bayesian Estimation in Stochas-tic Diﬀerential Equation Models via Laplace Approximation.

IFAC-PapersOnLine ,

49 (7) , 1109–1114, doi:10.1016/j.ifacol.2016.07.351,URL http://folk.ntnu.no/skoge/prost/proceedings/dycops-cab-2016/proceedings/media/papers/0007.pdf.Kay, J. E., and Coauthors, 2015: The community earth system model(CESM) large ensemble project : A community resource for studyingclimate change in the presence of internal climate variability.

Bulletinof the American Meteorological Society ,

96 (8) , 1333–1349, doi:10.1175/BAMS-D-13-00255.1, URL http://journals.ametsoc.Kondrashov, D., M. D. Chekroun, and M. Ghil, 2015: Data-driven non-Markovian closure models.

Physica D: Nonlinear Phenomena , ,33–55, doi:10.1016/j.physd.2014.12.005.Kravtsov, S., D. Kondrashov, and M. Ghil, 2009: Empirical modelreduction and the modelling hierarchy in climate dynamics and thegeosciences. Stochastic Physics and Climate Modelling Lachlan-Cope, T., and W. Connolley, 2006: Teleconnections betweenthe tropical Paciﬁc and the Amundsen-Bellinghausens Sea: Role ofthe El Niño/Southern Oscillation.

Journal of Geophysical ResearchAtmospheres ,

111 (23) , n/a–n/a, doi:10.1029/2005JD006386, URLhttp://doi.wiley.com/10.1029/2005JD006386.Latif, M., T. Martin, A. Reintges, and W. Park, 2017: Southern OceanDecadal Variability and Predictability. Springer International Pub-lishing, URL http://link.springer.com/10.1007/s40641-017-0068-8,163–173 pp., doi:10.1007/s40641-017-0068-8.Le Breton, A., 1977: Parameter Estimation in a Linear Stochastic Diﬀer-ential Equation.

Transactions of the Seventh Prague Conference on In-formation Theory, Statistical Decision Functions, Random Processesand of the 1974 European Meeting of Statisticians , Springer Nether-lands, Dordrecht, 353–366, doi:10.1007/978-94-010-9910-3{\_}36,URL http://link.springer.com/10.1007/978-94-010-9910-3_36.Lewandowski, D., D. Kurowicka, and H. Joe, 2009: Generatingrandom correlation matrices based on vines and extended onionmethod.

Journal of Multivariate Analysis ,

100 (9)

Journal of Business and EconomicStatistics , , 25–38, doi:10.1080/07350015.1986.10509491, URLhttps://digidownload.libero.it/rocco.mosconi/Litterman1986.pdf.Majda, A. J., and J. Harlim, 2013: Physics constrained nonlinear regres-sion models for time series. Nonlinearity ,

26 (1)

Geophysical Research Letters ,

46 (16) , 9909–9919, doi:10.1029/2019GL082922, URL https://onlinelibrary.wiley.com/doi/abs/10.1029/2019GL082922.Martinez-Villalobos, C., D. J. Vimont, C. Penland, M. New-man, and J. D. Neelin, 2018: Calculating state-dependent noisein a linear inverse model framework.

Journal of the Atmo-spheric Sciences ,

75 (2) , 479–496, doi:10.1175/JAS-D-17-0235.1,URL https://doi.org/10.1175/JAS-D-17-http://journals.ametsoc.org/doi/10.1175/JAS-D-17-0235.1.Mason, S. J., G. M. Mimmack, S. J. Mason, and G. M. Mim-mack, 2002: Comparison of Some Statistical Methods of Prob-abilistic Forecasting of ENSO.

Journal of Climate ,

15 (1) ,8–29, doi:10.1175/1520-0442(2002)015<0008:COSSMO>2.0.CO;2, URL http://journals.ametsoc.org/doi/abs/10.1175/1520-0442%282002%29015%3C0008%3ACOSSMO%3E2.0.CO%3B2.McDermott, P. L., and C. K. Wikle, 2019a: Bayesian recurrent neuralnetwork models for forecasting and quantifying uncertainty in spatial-temporal data.

Entropy ,

21 (2) , doi:10.3390/e21020184, URL http://arxiv.org/abs/1711.00636.McDermott, P. L., and C. K. Wikle, 2019b: Deep echo state networkswith uncertainty quantiﬁcation for spatio-temporal forecasting.

Envi-ronmetrics ,

30 (3) , doi:10.1002/env.2553, URL http://arxiv.org/abs/1806.10728.McKinnon, K. A., and C. Deser, 2018: Internal variability and regionalclimate trends in an observational large ensemble.

Journal of Climate ,

31 (17) , 6783–6802, doi:10.1175/JCLI-D-17-0901.1, URL http://journals.ametsoc.org/doi/10.1175/JCLI-D-17-0901.1. Meehl, G. A., and Coauthors, 2014: Decadal climate prediction anupdate from the trenches.

Bulletin of the American MeteorologicalSociety ,

95 (2) , 243–267, doi:10.1175/BAMS-D-12-00241.1, URLhttp://journals.ametsoc.org/doi/abs/10.1175/BAMS-D-12-00241.1.Møller, J. K., H. Madsen, and J. Carstensen, 2011: Param-eter estimation in a simple stochastic diﬀerential equa-tion for phytoplankton modelling.

Ecological Modelling , , 1793–1799, doi:10.1016/j.ecolmodel.2011.03.025, URLhttp://henrikmadsen.org/wp-content/uploads/2014/05/Journal_article_-_2011_-_Parameter_estimation_in_a_simple_stochastic_diﬀerential_equation_for_phytoplankton_modelling.pdf.Neal, R. M., 2012: MCMC using Hamiltonian dynamics. doi:10.1201/b10905-6, URL https://arxiv.org/pdf/1206.1901.pdfhttp://arxiv.org/abs/1206.1901.Newman, M., 2007: Interannual to decadal predictability of tropical andNorth Paciﬁc sea surface temperatures. Journal of Climate ,

20 (11) ,2333–2356, doi:10.1175/JCLI4165.1, URL http://journals.ametsoc.org/doi/abs/10.1175/JCLI4165.1.Newman, M., 2013: An Empirical Benchmark for Decadal Forecasts ofGlobal Surface Temperature Anomalies.

Journal of Climate ,

26 (14) ,5260–5269, doi:10.1175/JCLI-D-12-00590.1, URL http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00590.1.Nummelin, A., S. Jeﬀress, and T. Haine, 2018: Statistical inversionof surface ocean kinematics from sea surface temperature obser-vations.

Journal of Atmospheric and Oceanic Technology ,

35 (10) ,1913–1933, doi:10.1175/JTECH-D-18-0057.1, URL http://journals.ametsoc.org/doi/10.1175/JTECH-D-18-0057.1.Partridge, M. D., and D. S. Rickman, 1998: Generalizing the BayesianVector Autoregression Approach for Regional Interindustry Em-ployment Forecasting.

Journal of Business & Economic Statistics ,

16 (1)

Monthly Weather Review ,

117 (10) , 2165–2185, doi:10.1175/1520-0493(1989)117<2165:RFAFUP>2.0.CO;2, URL http://journals.ametsoc.org/doi/abs/10.1175/1520-0493%281989%29117%3C2165%3ARFAFUP%3E2.0.CO%3B2.Penland, C., 1996: A stochastic model of IndoPaciﬁc sea surfacetemperature anomalies. Tech. Rep. 2-4, 534–558 pp. doi:10.1016/0167-2789(96)00124-8, URL https://ac.els-cdn.com/0167278996001248/1-s2.0-0167278996001248-main.pdf?_tid=5bab1b2a-8e9d-4920-aeb9-45e5ab3eb1cb&acdnat=1541143825_3408dccbfcca2527b42894a255fa14f0.Penland, C., T. Magorian, C. Penland, and T. Magorian, 1993:Prediction of Niño 3 Sea Surface Temperatures Using Lin-ear Inverse Modeling.

Journal of Climate , , 1067–1076, doi:10.1175/1520-0442(1993)006<1067:PONSST>2.0.CO;2, URL http://journals.ametsoc.org/doi/abs/10.1175/1520-0442%281993%29006%3C1067%3APONSST%3E2.0.CO%3B2.Penland, C., and L. Matrosova, 2001: Expected and Actual Errorsof Linear Inverse Model Forecasts. Monthly Weather Review , doi:10.1016/0306-4522(81)90148-2.Penland, C., and P. D. Sardeshmukh, 1995: The Op-timal Growth of Tropical Sea Surface TemperatureAnomalies.

Journal of Climate , , 1999–2024, doi:10.1175/1520-0442(1995)008<1999:TOGOTS>2.0.CO;2, URLhttp://journals.ametsoc.org/doi/abs/10.1175/1520-0442%281995%29008%3C1999%3ATOGOTS%3E2.0.CO%3B2. AMS JOURNAL NAME

Révelard, A., C. Frankignoul, and Y. O. Kwon, 2018: A multivari-ate estimate of the cold season atmospheric response to North Pa-ciﬁc SST variability.

Journal of Climate ,

31 (7) , 2771–2796, doi:10.1175/JCLI-D-17-0061.1, URL http://journals.ametsoc.org/doi/10.1175/JCLI-D-17-0061.1.Samaniego, F. J., 2010:

A Comparison of the Bayesian and FrequentistApproaches to Estimation . 1st ed., Springer-Verlag New York, NewYork, XIII, 225 pp., doi:10.1007/978-1-4419-5941-6.Sardeshmukh, P. D., G. P. Compo, and C. Penland, 2015: Need forcaution in interpreting extreme weather statistics.

Journal of Climate ,

28 (23) , 9166–9187, doi:10.1175/JCLI-D-15-0020.1, URL http://journals.ametsoc.org/doi/10.1175/JCLI-D-15-0020.1.Särkkä, S., 2010:

Bayesian ﬁltering and smoothing . Cambridge Univer-sity Press, Cambridge, 1–232 pp., doi:10.1017/CBO9781139344203,URL http://ebooks.cambridge.org/ref/id/CBO9781139344203.Steig, E. J., Q. Ding, D. S. Battisti, and A. Jenkins, 2012:Tropical forcing of circumpolar deep water inﬂow and out-let glacier thinning in the amundsen sea embayment, westantarctica.

Annals of Glaciology ,

53 (60)

Diﬀerential Equations ,

137 (3) , 1–12, doi:10.1016/j.pocean.2015.06.007, URL https://core.ac.uk/download/pdf/11559876.pdf.Stock, C. A., and Coauthors, 2015b: Seasonal sea surface temperatureanomaly prediction for coastal ecosystems.

Progress in Oceanog-raphy , , 219–236, doi:10.1016/j.pocean.2015.06.007, URL http://dx.doi.org/10.1016/j.pocean.2015.06.007.Stoer, J., and R. Bulirsch, 2002: Eigenvalue Problems. Introduc-tion to Numerical Analysis , Springer New York, New York, NY,364–464, doi:10.1007/978-0-387-21738-3{\_}6, URL http://link.springer.com/10.1007/978-0-387-21738-3_6.Thomas, E. E., D. J. Vimont, M. Newman, C. Penland, and C. Martínez-Villalobos, 2018: The role of stochastic forcing in generatingENSO diversity.

Journal of Climate ,

31 (22) , 9125–9150, doi:10.1175/JCLI-D-17-0582.1, URL http://journals.ametsoc.org/doi/10.1175/JCLI-D-17-0582.1.Tian, T., Y. Zhou, Y. Wu, and X. Ge, 2014: Estimation of Param-eters in Mean-Reverting Stochastic Systems.

Mathematical Prob-lems in Engineering , Mathematical Biosciences ,

243 (2) ,215–222, doi:10.1016/j.mbs.2013.03.008, URL https://arxiv.org/pdf/1209.6433.pdf.Vogel, C. R., 2002:

Computational Methods for Inverse Prob-lems . Society for Industrial and Applied Mathematics, doi:10.1137/1.9780898717570, URL https://epubs.siam.org/doi/abs/10.1137/1.9780898717570.von Storch, H., and T. Bruns, 1988: Principal oscillation pat-tern analysis of the 30âĂŘto 60âĂŘday oscillation in gen-eral circulation model equatorial troposphere.

Journal ofGeo ,

93 (D9) , 11 022–11 036, doi:10.1029/JD093iD09p11022,URL http://doi.wiley.com/10.1029/JD093iD09p11022http://onlinelibrary.wiley.com/doi/10.1029/JD093iD09p11022/full. Wikle, C. K., and S. H. Holan, 2011: Polynomial nonlinear spatio-temporal integro-diﬀerence equation models.

Journal of Time SeriesAnalysis ,

32 (4) , 339–350, doi:10.1111/j.1467-9892.2011.00729.x,URL http://doi.wiley.com/10.1111/j.1467-9892.2011.00729.x.Wittenberg, A. T., A. Rosati, T. L. Delworth, G. A. Vecchi, and F. Zeng,2014: ENSO modulation: Is it decadally predictable?

Journal ofClimate ,

27 (7) , 2667–2681, doi:10.1175/JCLI-D-13-00577.1, URLhttp://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-13-00577.1.Wu, S., M. Notaro, S. Vavrus, E. Mortensen, R. Montgomery,J. de Piérola, and P. Block, 2018: Eﬃcacy of tendency and lin-ear inverse models to predict southern Peru’s rainy season precip-itation.

International Journal of Climatology ,

38 (5) , 2590–2604,doi:10.1002/joc.5442, URL http://doi.wiley.com/10.1002/joc.5442.Zhang, L., T. L. Delworth, and L. Jia, 2017: Diagnosis of decadalpredictability of Southern Ocean sea surface temperature in theGFDL CM2.1 Model.

Journal of Climate ,

30 (16) , 6309–6328, doi:10.1175/JCLI-D-16-0537.1, URL http://journals.ametsoc.org/doi/10.1175/JCLI-D-16-0537.1.Zhao, Y., Z. Liu, F. Zheng, and Y. Jin, 2019: Parameter opti-mization for real-world ENSO forecast in an intermediate coupledmodel.

Monthly Weather Review ,

147 (5)147 (5)