[PDF] Learning forecasts of rare stratospheric transitions from short simulations

Abstract

Nonlinear atmospheric dynamics produce rare events that are hard to predict and attribute due to many interacting degrees of freedom. Sudden stratospheric warming event is a model example. Approximately once every other year, the winter polar vortex in the boreal stratosphere rapidly breaks down, inducing a shift in midlatitude surface weather patterns persisting for up to 2-3 months. In principle, lengthy numerical simulations can be used to predict and understand these rare transitions. For complex models, however, the cost of the direct numerical simulation approach is often prohibitive. We describe an alternative approach which only requires relatively short-duration computer simulations of the system. The methodology is illustrated by applying it to a prototype model of an SSW event developed by Holton and Mass (1976) and driven with stochastic forcing. While highly idealized, the model captures the essential nonlinear dynamics of SSWs and exhibits the key forecasting challenge: the dramatic separation in timescales between the dynamics of a single event and the return time between successive events. We compute optimal forecasts of sudden warming events and quantify the limits of predictability. Statistical analysis relates these optimal forecasts to a small number of interpretable physical variables. Remarkably, we are able to estimate these quantities using a data set of simulations much shorter than the timescale of the warming event. This methodology is designed to take full advantage of the high-dimensional data from models and observations, and can be employed to find detailed predictors of many complex rare events arising in climate dynamics.

Full PDF

GGenerated using the oﬃcial AMS L A TEX template v5.0 two-column layout. This work has been submitted forpublication. Copyright in this work may be transferred without further notice, and this version may no longer beaccessible.

Learning forecasts of rare stratospheric transitions from short simulations

Justin Finkel ∗ Committee on Computational and Applied Mathematics, University of Chicago

Robert J. Webber

Courant Institute of Mathematical Sciences, New York University

Edwin P. Gerber

Courant Institute of Mathematical Sciences, New York University

Dorian S. Abbot

Department of the Geophysical Sciences, University of Chicago

Jonathan Weare

Courant Institute of Mathematical Sciences, New York University

ABSTRACTNonlinear atmospheric dynamics produce rare events that are hard to predict and attribute due to many interacting degrees of freedom.Sudden stratospheric warming event is a model example. Approximately once every other year, the winter polar vortex in the borealstratosphere rapidly breaks down, inducing a shift in midlatitude surface weather patterns persisting for up to 2-3 months. In principle,lengthy numerical simulations can be used to predict and understand these rare transitions. For complex models, however, the cost of thedirect numerical simulation approach is often prohibitive. We describe an alternative approach which only requires relatively short-durationcomputer simulations of the system. The methodology is illustrated by applying it to a prototype model of an SSW event developed byHolton and Mass (1976) and driven with stochastic forcing. While highly idealized, the model captures the essential nonlinear dynamicsof SSWs and exhibits the key forecasting challenge: the dramatic separation in timescales between the dynamics of a single event and thereturn time between successive events. We compute optimal forecasts of sudden warming events and quantify the limits of predictability.Statistical analysis relates these optimal forecasts to a small number of interpretable physical variables. Remarkably, we are able to estimatethese quantities using a data set of simulations much shorter than the timescale of the warming event. This methodology is designed totake full advantage of the high-dimensional data from models and observations, and can be employed to ﬁnd detailed predictors of manycomplex rare events arising in climate dynamics.

1. Introduction

As computing power increases and weather models growmore intricate and capable of generating a vast wealth ofrealistic data, the once-distant goal of extreme weatherevent prediction is starting to become plausible (Vitart andRobertson 2018). To take full advantage of the increasedcomputing power, we must develop new approaches to eﬃ-ciently manage and parse the data we generate (or observe)to derive physically interpretable, actionable insights. Ex-treme weather events are worthy targets for simulation ow-ing to their destructive potential to life and property. Rareevents have attracted signiﬁcant simulation eﬀorts recently,including hurricanes (Zhang and Sippel 2009; Webber et al.2019; Plotkin et al. 2019), heat waves (Ragone et al. 2018),rogue waves (Dematteis et al. 2018), and space weather ∗ Corresponding author : Justin Finkel, jﬁ[email protected] events, e.g., coronal mass ejections (Ngwira et al. 2013).These are very diﬃcult to characterize and predict, beingexceptionally rare and pathological outliers in the spectrumof weather events.Large ensemble simulations are the most detailed sourceof data to assess the frequency, intensity, and correlates ofextreme weather events (e.g., Schaller et al. 2018). A singlesimulation must span decades to incorporate the possibleimpacts of climate change and decadal-scale variability andthe full state space of a climate model may be billions ofdimensions large, depending on grid resolution. Unfortu-nately, the data-richness of a long simulation comes at thecost of sample size: even the largest ensembles are limitedto tens or hundreds of members as a matter of computa-tional necessity. Under stationary background parametersand some ergodicity properties, a single simulation willeventually sample state space thoroughly and provide all1 a r X i v : . [ phy s i c s . a o - ph ] F e b relevant statistics. In practice, however, simulations areoften not run long enough to reach steady state, and fur-thermore one may wish to change parameters over time.A much larger number of independent ensemble memberswould then be needed to quantify the eﬀects of initial con-ditions, changing climatology, feedbacks, and unresolvedhigh-frequency variability with statistical conﬁdence (Sill-mann et al. 2017; Webber et al. 2019).While the last decade has seen exciting progress in thedevelopment of targeted rare event simulation in geophys-ical contexts (Hoﬀman et al. 2006; Weare 2009; Bouchetet al. 2011, 2014; Vanden-Eĳnden and Weare 2013; Chenet al. 2014; Yasuda et al. 2017; Farazmand and Sapsis2017; Dematteis et al. 2018; Mohamad and Sapsis 2018;Dematteis et al. 2019; Webber et al. 2019; Bouchet et al.2019a,b; Plotkin et al. 2019; Simonnet et al. 2020; Ragoneand Bouchet 2020; Sapsis 2021), predicting long time-scale behavior of complex dynamical systems remains adiﬃcult task. A traditional approach to addressing thisissue is through dimensional reduction techniques whichseek to replace an expensive, high-ﬁdelity model with alower-dimensional and less costly model. Physics-basedreduced-order models have a long and very successful his-tory in atmospheric science, especially as prototypes ofchaos and multistability (Lorenz 1963; Charney and De-Vore 1979; Legras and Ghil 1985; Crommelin 2003; Tim-mermann et al. 2003; Ruzmaikin et al. 2003). Observa-tionally, regime behavior has been diagnosed by project-ing the empirical steady-state distributions onto leadingEOFs (e.g., Crommelin 2003). More recently, signiﬁcantattention has been paid to data-based dimensional reduc-tion techniques that use data generated by the high-ﬁdelitymodel to specify a more quantitatively accurate reduced-order model (e.g., Giannakis et al. 2018; Berry et al. 2015;Sabeerali et al. 2017; Majda and Qi 2018; Wan et al. 2018;Bolton and Zanna 2019; Chattopadhyay et al. 2020; Chenand Majda 2020). However the reduced-order model isderived, it can subsequently be thoroughly interrogated bydirect computer simulation.We advance an alternative computational approach topredicting and understanding rare events without sacri-ﬁcing model ﬁdelity. Like data-informed reduced ordermodeling, our method relies on data generated by a high-ﬁdelity model. However, unlike dimensional reductiontechniques, our approach focuses on computing speciﬁcquantities of interest rather than on capturing all aspects ofa very complicated, high dimensional dynamical system.In particular we will compute estimators of statistically op-timal forecasts using a data set of many short forward sim-ulations. To accomplish this we represent these forecastsas solutions to Feynman-Kac equations. In the continu-ous time limit, these become partial diﬀerential equations(PDE) with a number of independent variables equal to thedimension of the model state space. It is therefore hopeless to solve the equations using any standard spatial discretiza-tion. As we demonstrate nonetheless, the equations can besolved with remarkable accuracy via an expansion in abasis of functions informed by the data set. Importantly,our approach to solving these equations is independent ofthe model used to generate the data, avoiding unrealisticsimpliﬁcations or structural assumptions.As typical examples of the forecasts computatable withinour framework, we focus on the probability that a warmingevent occurs before a return to a “typical” state, as well asthe expected time that it takes for that event to occur. Bothquantities depend on the initial condition, and are thereforefunctions over all of state space. We will follow the con-vention in computational statistical mechanics and referto these as the committor function and mean ﬁrst passagetime (MFPT) respectively. The committor has been com-puted previously for low dimensional atmospheric modelsin Tantet et al. (2015); Lucente et al. (2019); Finkel et al.(2020). Forecasts like these quantify the risk associatedwith an event given the current state of the system. Theyalso encode important information regarding the rare eventitself.Even putting aside the diﬃculty of computing the com-mittor and MFPT, they still must be ‘decoded’; knowledgeof these functions does not automatically reveal insightsinto the fundamental causes or precursers of a rare event.Nor are they easily applied to observations of a limitedcollection of variables. They are, after all, complicatedfunctions of a high dimensional model state space. In Sec-tion 5 we will demonstrate a detailed statistical analysis ofour computed committor function aimed at identifying arelatively small subset of the original variables capable ofdescribing the committor (in the sense deﬁned below).We illustrate our approach on the highly simpliﬁedHolton-Mass model (Holton and Mass 1976; Christiansen2000) with stochastic velocity perturbations in the spirit ofBirner and Williams (2008). The Holton-Mass model iswell-understood dynamically in light of decades of anal-ysis and experiments, yet complex enough to present theessential computational diﬃculties of probabilistic fore-casting and test our methods for addressing them. Despitethe challenges posed by its 75-dimensional state space, ourcomputational framework can indeed accurately charac-terize of extreme events with unprecedented detail usingonly a data set of short model simulations. In the fu-ture, the same methodology could be applied to query theproperties of more complex models where less theoreticalunderstanding is available.Section 2 reviews the dynamical model, and section 3describes a general class of methods which we then apply toour problem speciﬁcally in section 4. We present the resultsin section 5, including a discussion of optimal forecastingand physical insights gleaned from our approach. We thenlay out future prospects and conclude in section 6.

2. Holton-Mass model

Holton and Mass (1976) devised a simple model of thestratosphere aimed at reproducing observed intra-seasonaloscillations of the polar vortex, which they termed “strato-spheric vacillation cycles." Earlier SSW models, origi-nating with that of Matsuno (1971), proposed upward-propagating planetary waves as the major source of dis-turbance to the vortex. This was a signiﬁcant step outsidethe bounds of the nonacceleration theorem of Charney andDrazin (1961), which stated that the vortex would be ro-bust to disturbances under a variety of conditions. WhileMatsuno (1971) used impulsive forcing from the tropo-sphere as the source of planetary waves, Holton and Mass(1976) suggested that stationary tropospheric forcing, iflarge enough, could lead to an oscillatory response, purelythrough dynamics internal to the stratosphere.Radiative cooling through the stratosphere and wave per-turbations at the tropopause are the two competing forcesthat drive the vortex in the Holton-Mass model. Altitude-dependent cooling relaxes the zonal wind toward a strongvortex in thermal wind balance with a radiative equilib-rium temperature ﬁeld. Gradients in potential vorticityalong the vortex, however, can allow the propagation ofRossby waves. When conditions are just right, a Rossbywave emerges from the tropopause and rapidly propagatesupward, sweeping heat poleward and stalling the vortex bydepositing a burst of negative momentum. The vortex isdestroyed and begins anew the rebuilding process.Yoden (1987a) found that for a certain range of param-eter settings, these two eﬀects balance each other to createtwo distinct stable regimes: a strong vortex with zonalwind close to the radiative equilibrium proﬁle, and a weakvortex with a possibly oscillatory wind proﬁle. We focusour study on this bistable setting as a prototypical model ofatmospheric regime behavior. The transition from strongto weak vortex state captures the essential dynamics ofan SSW. The methodology presented here, using only ob-served short trajectories, can be applied equally to anyof these models as well as observational data, which thereader should keep in mind as we present the speciﬁcs ofthe present application.The Holton-Mass model takes the linearized quasi-geostrophic potential vorticity (QGPV) equation for a per-turbation streamfunction 𝜓 (cid:48) ( 𝑥, 𝑦, 𝑧, 𝑡 ) on top of a zonalmean ﬂow 𝑢 ( 𝑦, 𝑧, 𝑡 ) , and projects these two ﬁelds onto asingle zonal wavenumber 𝑘 = /( 𝑎 cos 60 ◦ ) and a singlemeridional wavenumber ℓ = / 𝑎 , where 𝑎 is the Earth’sradius. The resulting ansatz is 𝑢 ( 𝑦, 𝑧, 𝑡 ) = 𝑈 ( 𝑧, 𝑡 ) sin ( ℓ𝑦 ) (1) 𝜓 (cid:48) ( 𝑦, 𝑧, 𝑡 ) = Re { Ψ ( 𝑧, 𝑡 ) 𝑒 𝑖𝑘𝑥 } 𝑒 𝑧 / 𝐻 sin ( ℓ𝑦 ) which is fully determined by the reduced state space 𝑈 ( 𝑧, 𝑡 ) , and Ψ ( 𝑧, 𝑡 ) , the latter being complex. Inserting this into the linearized QGPV equations yields the coupledPDE system (cid:20) − (cid:18) G ( 𝑘 + ℓ ) + (cid:19) + 𝜕 𝜕𝑧 (cid:21) 𝜕 Ψ 𝜕𝑡 (2) = (cid:20)(cid:18) 𝛼 − 𝛼 𝑧 − 𝑖 G 𝑘 𝛽 (cid:19) − 𝛼 𝑧 𝜕𝜕𝑧 − 𝛼 𝜕 𝜕𝑧 (cid:21) Ψ + (cid:26) 𝑖𝑘𝜀 (cid:20)(cid:18) 𝑘 G + (cid:19) − 𝜕𝜕𝑧 + 𝜕 𝜕𝑧 (cid:21) 𝑈 (cid:27) Ψ − 𝑖𝑘𝜀 𝜕 Ψ 𝜕𝑧 𝑈 (cid:18) − G ℓ − 𝜕𝜕𝑧 + 𝜕 𝜕𝑧 (cid:19) 𝜕𝑈𝜕𝑡 = (cid:2) ( 𝛼 𝑧 − 𝛼 ) 𝑈 𝑅𝑧 − 𝛼𝑈 𝑅𝑧𝑧 (cid:3) (3) − (cid:20) ( 𝛼 𝑧 − 𝛼 ) 𝜕𝜕𝑧 + 𝛼 𝜕 𝜕𝑧 (cid:21) 𝑈 + 𝜀𝑘ℓ 𝑒 𝑧 Im (cid:26) Ψ 𝜕 Ψ ∗ 𝜕𝑧 (cid:27) where we have nondimensionalized the equations with theparameter G = 𝐻 𝑁 /( 𝑓 𝐿 ) in order to create a homo-geneously shaped dataset more suited to our analysis. Ap-pendix A contains a more detailed derivation. Boundaryconditions are prescribed at the bottom of the stratosphere,which in this model corresponds to 𝑧 =

0, and the top ofthe stratosphere 𝑧 𝑡𝑜𝑝 = 𝑘𝑚 . Ψ ( , 𝑡 ) = 𝑔ℎ𝑓 Ψ ( 𝑧 𝑡𝑜𝑝 , 𝑡 ) = 𝑈 ( , 𝑡 ) = 𝑈 𝑅 ( ) 𝜕 𝑧 𝑈 ( 𝑧 𝑡𝑜𝑝 , 𝑡 ) = 𝜕 𝑧 𝑈 𝑅 ( 𝑧 𝑡𝑜𝑝 ) The vortex-stabilizing inﬂuence is represented by 𝛼 ( 𝑧 ) ,the altitude-dependent cooling coeﬃcient, and the linearrelaxation proﬁle 𝑈 𝑅 ( 𝑧 ) = 𝑈 𝑅 ( ) + 𝛾 𝑧 , which forces thevortex toward radiative equilibrium. Here 𝛾 = O( ) is thevertical wind shear in m/s/km. The competing force ofwave perturbation is encoded through the lower boundarycondition Ψ ( , 𝑡 ) = 𝑔ℎ / 𝑓 .Detailed bifurcation analysis of the model by both Yoden(1987a) and Christiansen (2000) in ( 𝛾, ℎ ) space revealedthe bifurcations that lead to bistability, vacillations, andultimately quasiperiodicity and chaos. Here we will focuson an intermediate parameter setting of 𝛾 = . ℎ = . 𝑈 closely following 𝑈 𝑅 and an almost barotropicstreamfunction, as well as a weak vortex with 𝑈 dippingclose to zero at an intermediate altitude and a disturbedstreamfunction with strong westward phase tilt. The twostable equilibria, which we call a and b , are represented inthe ﬁrst row of Figure 1 by their 𝑧 -dependent zonal windand streamfunction proﬁles.The two equilibria can be interpreted as two diﬀerentwinter climatologies, one with a strong vortex and onewith a weak vortex susceptible to vacillation cycles. Toexplore transitions between these two states, we followBirner and Williams (2008) and modify the Holton-Massequations with small additive noise in the 𝑈 variable tomimic momentum perturbations by smaller scale Rossbywaves, gravity waves, and other unresolved sources. While Fig. 1.

Illustration of the two stable states of the Holton-Mass model and transitions between them. (a) Zonal wind proﬁles of theradiatively maintained strong vortex (the ﬁxed point a , blue) which increases linearly with altitude, and the weak vortex (the ﬁxed point b , red)which dips close to zero in the mid-stratosphere. (b) Streamfunction contours are overlaid for the two equilibria a and b , the weak vortex exhibitingstrong westward phase tilt with altitude. (c) Timeseries of 𝑈 ( 𝑘𝑚 ) from a long stochastic simulation, including several noise-induced transitionsfrom 𝐴 to 𝐵 (green) and from 𝐵 to 𝐴 (orange). Although both states a and b are equilibria in this parameter regime ( ℎ = . 𝑚 ) , the stochasticperturbations uncover the vacillation cycles that would appear beyond the Hopf bifurcation if ℎ were increased. (d) A parametric curve of the sametrajectory segment as in (c) with the same color highlights for transition paths, but in the space ( | Ψ | ,𝑈 ) at 30 km. The two equilibria are indicatedwith horizontal blue and red lines. the details of the additive noise are ad hoc, this approachcan be more rigorously justiﬁed through the Mori-Zwanzigformalism (Zwanzig 2001). Because many hidden degreesof freedom are being projected onto the low-dimensionalspace of the Holton-Mass model, the dynamics on smallobservable subspaces can be considered stochastic. This isthe perspective taken in stochastic parameterization of tur-bulence and other high-dimensional chaotic systems (Has-selmann 1976; DelSole and Farrell 1995; Franzke and Ma-jda 2006; Majda et al. 2001; Gottwald et al. 2016). Moresophisticated parameterizations would surely inﬂuence theresults (Hu et al. 2019), but would not present a fundamen-tal problem for our purely data-driven numerical method. We follow Holton and Mass (1976) and discretize theequations using a ﬁnite-diﬀerence method in 𝑧 , with 27vertical levels (including boundaries). After constrainingthe boundaries, there are 𝑑 = × ( − ) =

75 degrees offreedom in the model. Christiansen (2000) investigatedhigher resolution and found negligible diﬀerences. Thefull discretized state is represented by a long vector X ( 𝑡 ) = (cid:104) Re Ψ ( Δ 𝑧, 𝑡 ) , . . . , Re Ψ ( 𝑧 𝑡𝑜𝑝 − Δ 𝑧, 𝑡 ) , Im Ψ ( Δ 𝑧, 𝑡 ) , . . . , Im Ψ ( 𝑧 𝑡𝑜𝑝 − Δ 𝑧, 𝑡 ) , (5) 𝑈 ( Δ 𝑧, 𝑡 ) , . . . ,𝑈 ( 𝑧 𝑡𝑜𝑝 − Δ 𝑧, 𝑡 (cid:105) ∈ R 𝑑 = R Boundary terms are not included because they are con-strained by the boundary conditions. The deterministicsystem can be written 𝑑 X ( 𝑡 )/ 𝑑𝑡 = 𝑚 ( X ( 𝑡 )) for a vectorﬁeld 𝑚 : R 𝑑 → R 𝑑 speciﬁed by discretizing (2) and (3).Under deterministic dynamics, X ( 𝑡 ) → a or X ( 𝑡 ) → b as 𝑡 → ∞ depending on initial conditions. The addition ofwhite noise changes the system into an Itô diﬀusion 𝑑 X ( 𝑡 ) = 𝑚 ( X ( 𝑡 )) 𝑑𝑡 + 𝜎 ( X ( 𝑡 )) 𝑑 𝑾 ( 𝑡 ) (6)where 𝜎 : R 𝑑 → R 𝑑 × 𝑑 imparts a correlation structure tothe vector 𝑾 ( 𝑡 ) ∈ R 𝑑 of independent standard white noiseprocesses. We design 𝜎 to be a low-rank, constant ma-trix that adds spatially smooth stirring to only the zonalwind 𝑈 (not the streamfunction Ψ ) and respects boundaryconditions at the bottom and top of the stratosphere. Wesimulate the model using the Euler-Maruyama method: ina timesetep 𝛿𝑡 , after a deterministic forward Euler stepwe add the stochastic perturbation to zonal wind on largevertical scales 𝛿𝑈 ( 𝑧 ) = 𝜎 𝑈 ∑︁ 𝑘 = 𝜂 𝑘 sin (cid:20)(cid:18) 𝑘 + (cid:19) 𝜋 𝑧𝑧 𝑡𝑜𝑝 (cid:21) √ 𝛿𝑡 (7)where 𝜂 𝑘 ( 𝑘 = , , ) are independent unit normal samples.We set the magnitude of 𝜎 by 𝜎 𝑈 = E [( 𝛿𝑈 ) ] 𝛿𝑡 ≈ ( ) / day (8) 𝜎 𝑈 technically has units of ( 𝐿 / 𝑇 )/ 𝑇 / , where the square-root of time comes from the quadratic variation of theWiener process. It is best interpreted in terms of the dailyroot-mean-square velocity perturbation of 1 . 𝑈 ( 𝑘𝑚 ) inpanel (c) of Figure 1. We display the zonal wind 𝑈 at 30km following Christiansen (2000), because this is aboutwhere zonal wind strength is minimized in the weak vor-tex. While the two regimes are clearly associated with thetwo ﬁxed points, they are better characterized by extended regions of state space with strong and weak vortices. Wethus deﬁne the two metastable subsets of R 𝑑 𝐴 = { X : 𝑈 ( 𝑘𝑚 )( X ) ≥ 𝑈 ( 𝑘𝑚 )( a ) = . 𝑚 / 𝑠 } 𝐵 = { X : 𝑈 ( 𝑘𝑚 )( X ) ≤ 𝑈 ( 𝑘𝑚 )( b ) = . 𝑚 / 𝑠 } This straightforward deﬁnition roughly follows the con-vention of Charlton and Polvani (2007), which deﬁnes anSSW as a reversal of zonal winds at 10 hPa. In the Holton-Mass model, where 𝑧 = 𝑧 = − 𝑘𝑚 ln ( / ) − 𝑘𝑚 = . 𝑘𝑚 , but wehave adjusted the speciﬁc altitude here to 30 km wherethe zonal wind reduction is most drastic. There is lively debate around the deﬁnition of SSW events (e.g., Butleret al. 2015), with diﬀerent thresholds leading to diﬀerentstatistics. The details are aﬀected by the deﬁnition in ouranalysis, but the results are qualitatively similar over a widerange. Our method is equally applicable to any deﬁnition,and so to illustrate we choose one that enjoys broad accep-tance. Incidentally, the analysis tools we present may behelpful in distinguishing predictability properties betweendiﬀerent deﬁnitions.The green highlights in Figure 1 (c) begin preciselywhen the system exits the 𝐴 region bound for 𝐵 , and endwhen the system enters 𝐵 . The orange highlights startwhen the system leaves 𝐵 bound for 𝐴 , and end when 𝐴 is reached. Note that 𝐴 → 𝐵 transitions are much shorterin duration than 𝐵 → 𝐴 transitions. Figure 1 (d) showsthe same paths, but viewed in the space ( | Ψ | ,𝑈 ) at 30 km.The 𝐴 → 𝐵 and 𝐵 → 𝐴 transitions are again highlightedin green and orange respectively, showing geometrical dif-ferences between the two directions. We will refer to the 𝐴 → 𝐵 transition as an SSW event, even though it is moreaccurately a transition between climatologies according tothe Holton-Mass interpretation. The 𝐵 → 𝐴 transition isa vortex restoration event. Our focus in this paper is onpredicting transition events and monitoring their progressin a principled way. In the next section we explain theformalism for doing so.

3. Theory and computation a. Deﬁnitions

We will introduce the quantities of interest by way of sev-eral simple, important examples. Suppose the stratosphereis observed in an initial state X ( ) = x that is neither in 𝐴 nor 𝐵 , so 𝑈 ( b )( 𝑘𝑚 ) < 𝑈 ( x )( 𝑘𝑚 ) < 𝑈 ( a )( 𝑘𝑚 ) and the vortex is somewhat weakened, but not completelybroken down. We call this intermediate zone 𝐷 = ( 𝐴 ∪ 𝐵 ) c (the complement of the two metastable sets). Because 𝐴 and 𝐵 are attractive, the system will soon ﬁnd its way toone or the other at the ﬁrst-exit time from 𝐷 , denoted 𝜏 𝐷 c = min { 𝑡 ≥ X ( 𝑡 ) ∈ 𝐷 c } (9)Because of stochastic forcing (which in practice arises fromunobserved variables), the ﬁrst-exit time is a random vari-able, formally called a “stopping time" (Oksendal 2003;Durrett 2013), meaning measurable from the history of X ( 𝑡 ) . The ﬁrst-exit location X ( 𝜏 𝐷 c ) is itself a randomvariable which importantly determines how the system ex-its 𝐷 : either X ( 𝜏 𝐷 c ) ∈ 𝐴 , meaning the vortex restores toradiative equilibrium, or X ( 𝜏 𝐷 c ) ∈ 𝐵 , meaning the vortexbreaks down into vacillation cycles. A fundamental goalof probabilistic forecasting is to determine the probabilitiesof these two events, which naturally leads to the deﬁnitionof the (forward) committor function 𝑞 + ( x ) =  P x { X ( 𝜏 𝐷 c ) ∈ 𝐵 } x ∈ 𝐷 = ( 𝐴 ∪ 𝐵 ) c x ∈ 𝐴 x ∈ 𝐵 (10)where the subscript x indicates that the probability isconditional on a ﬁxed initial condition X ( ) = x , i.e., P x {·} = P {·| X ( ) = x } . (The superscript “+" distinguishesthe forward committor from the backward committor , ananalogous quantity for the time-reversed process which wedo not use in this paper.) Throughout, we will use capital X ( 𝑡 ) to denote a stochastic process, and lower-case x torepresent a speciﬁc point in state space, typically an initialcondition, i.e., X ( ) = x . Both are 𝑑 = 𝑞 + in 𝐷 : if the vortex starts outvery weak, x is close to set 𝐵 and the system will probablyland in 𝐵 next, making 𝑞 + ( x ) ≈

1. If it starts out strongand close to 𝐴 , it will most likely restore to 𝐴 next, making 𝑞 + ( x ) ≈

0. The committor is clearly a function of initialcondition x , but assuming the process is Markovian, it doesnot depend on the history of the system that led it to x inthe ﬁrst place.Another important forecasting quantity is the lead timeto the event of interest. While the forward committorreveals the probability of experiencing vortex breakdown before returning to a strong vortex, it does not say how longeither event will take in absolute terms. Furthermore, evenif the vortex is restored ﬁrst, how long will it be until thenext SSW does occur? The time until the next SSW eventis denoted 𝜏 𝐵 , again a random variable, whose distributiondepends on the initial condition x . We call E x [ 𝜏 𝐵 ] the meanﬁrst passage time (MFPT) to 𝐵 . Conversely, we may askhow long a vortex disturbance will persist before normalconditions return; the answer (on average) is E x [ 𝜏 𝐴 ] , themean ﬁrst passage time to 𝐴 . Dissecting the expectationsfurther, we may condition 𝜏 𝐵 on the event that an SSWis coming before the strong vortex returns, leading to theconditional ﬁrst passage time E x [ 𝜏 𝐵 | 𝜏 𝐵 < 𝜏 𝐴 ] , which insome sense quantiﬁes the suddenness of SSW.All of these quantities can, in principle, be estimated bycollecting averages over very long simulations. For exam-ple, to estimate the committor at a given x , one can shoot 𝑁 trajectories starting from x and count the numbers 𝑁 𝐴 and 𝑁 𝐵 hitting 𝐴 and 𝐵 ﬁrst. Then 𝑁 𝐴 / 𝑁 will be an estimatefor the committor at x . The mean ﬁrst passage time can beestimated using these same sampled trajectories. But thisdirect method can be prohibitively expensive, especially ifapplied to many points all over state space. By deﬁnition,transitions between 𝐴 and 𝐵 are infrequent. Therefore, ifstarting from x far from 𝐵 , then a huge number of sampledtrajectories ( 𝑁 ) will be required to observe even a smallnumber ending in 𝐵 ( 𝑁 𝐵 ). Likewise, transition path statis-tics such as return times can, in principle, be computed from an extremely long model run, but in most cases ofinterest this direct simulation approach will not be feasible.In Subsection 3(b) we will write these forecasts in a sin-gle general form, and describe a computational approachto compute them using only a data set of short forwardmodel integrations. The method is called the Dynami-cal Galerkin Approximation (DGA), introduced in Thiedeet al. (2019). It takes advantage of the Feynman-Kac for-mula (Oksendal 2003), recasting conditional expectationsas PDE problems over state space. These equations are local and thus approximable by short trajectories. Weperform these calculations on the Holton-Mass model andpresent the results in Section 4. In Section 5(a) we describea statistical analysis to aid interpretation of the estimatedcommittor. b. Dynamical Galerkin Approximation In this section we describe the methodology, which in-volves some technical results from stochastic processes andmeasure theory. The forecast functions described above—committors and MFPTs—can all be written as conditionalexpectations of the form 𝐹 ( x ) = E x (cid:20) 𝐺 ( X ( 𝜏 )) 𝑒 − ∫ 𝜏 𝑉 ( X ( 𝑠 )) 𝑑𝑠 + ∫ 𝜏 𝐻 ( X ( 𝑠 )) 𝑒 − ∫ 𝑠 𝑉 ( X ( 𝑟 )) 𝑑𝑟 𝑑𝑠 (cid:21) (11)where again the subscript x denotes conditioning on X ( ) = x ; 𝐺, 𝐻 and 𝑉 are arbitrary known functions over R 𝑑 ; and 𝜏 is a stopping time, speciﬁcally a ﬁrst-exit time like Equation(9) but possibly with 𝐷 replaced by another set. To see thatthe forward committor takes on this form, set 𝐺 ( x ) = 𝐵 ( x ) (one on set 𝐵 and zero everywhere else), 𝑉 = 𝐻 =

0, and 𝜏 = 𝜏 𝐷 𝑐 . Then 𝐹 ( x ) = E x (cid:2) 𝐵 ( X ( 𝜏 )) (cid:3) = P x { X ( 𝜏 𝐷 𝑐 ) ∈ 𝐵 } = 𝑞 + ( x ) . For the mean ﬁrst passage time to 𝐵 , set 𝜏 = 𝜏 𝐵 , 𝐺 = 𝑉 =

0, and 𝐻 =

1. Then 𝐹 ( x ) = E x (cid:2) ∫ 𝜏 𝐵 𝑑𝑡 (cid:3) = E x [ 𝜏 𝐵 ] . For the conditional ﬁrst passage time to 𝐵 , set 𝐺 = 𝐵 , 𝜏 = 𝜏 𝐷 𝑐 , 𝑉 = − 𝜆 (a constant) and 𝐻 =

0. Then theexpectation can be computed in two steps, by computingand diﬀerentiating a moment-generating function: 𝐹 ( x ; 𝜆 ) = E x (cid:2) 𝐵 ( X ( 𝜏 𝐷 𝑐 )) 𝑒 𝜆𝜏 𝐷𝑐 (cid:3) 𝑞 + ( x ) 𝜕𝜕𝜆 𝐹 ( x ; 0 ) = E x [ 𝜏 𝐷 𝑐 𝐵 ( X ( 𝜏 𝐷 𝑐 )) E x [ 𝐵 ( X ( 𝜏 𝐷 𝑐 ))] (12) = E x [ 𝜏 𝐵 | 𝜏 𝐵 < 𝜏 𝐴 ] Note that the event 𝜏 𝐵 < 𝜏 𝐴 is equivalent to X ( 𝜏 𝐷 c ) ∈ 𝐵 ,where again 𝐷 c = 𝐴 ∪ 𝐵 . We approximate 𝜕𝐹 / 𝜕𝜆 witha centered ﬁnite diﬀerence, after computing 𝐹 ( x ; 𝜆 ) forseveral 𝜆 in the neighborhood of zero. In principle wecould compute higher moments in the same way and geta more detailed understanding of the conditional passagetime distribution. Alternatively we could estimate 𝐹 ( x ; 𝜆 ) for a large range of 𝜆 and recover the distribution with aLaplace transform.More generally, the function 𝐺 is chosen by the userto quantify risk at the terminal time 𝜏 ; in the case of theforward committor, that risk is binary, with an SSW rep-resenting a positive risk and a radiative vortex no risk atall. The function 𝐻 is chosen to quantify the risk accu-mulated up until time 𝜏 , which might be simply an event’sduration, but other integrated risks may be of more inter-est for the application. For example, one could expressthe total thermal energy absorbed by the polar vortex bysetting 𝐻 = 𝑣 (cid:48) 𝑇 (cid:48) , or the momentum lost by the vortex bysetting 𝐻 ( x ) = 𝑈 ( a ) − 𝑈 ( x ) , or a vertically integrated ver-sion. Using the moment generating function in (12), onecan compute not only means but higher moments of suchintegrals by expressing the risk with 𝑉 .Let us now describe how to numerically compute 𝐹 ( x ) of the form (11) with short trajectories, starting with thespecial case of the forward committor and then generaliz-ing. Consider starting a random trajectory at x = X ( ) ∈ 𝐷 = ( 𝐴 ∪ 𝐵 ) c and evolving it for a short time Δ 𝑡 . Its prob-ability of reaching 𝐵 ﬁrst, 𝑞 + ( x ) , is simply the probabilitythat it reaches 𝐵 ﬁrst starting from 𝑞 + ( X ( Δ 𝑡 )) instead, av-eraged over all possible X ( Δ 𝑡 ) (ignoring momentarily thesmall probability that 𝐴 ∪ 𝐵 is reached before time Δ 𝑡 ).That is, 𝑞 + ( x ) ≈ E x [ 𝑞 + ( X ( Δ 𝑡 ))] = : T Δ 𝑡 𝑞 + ( x ) . The oper-ator T Δ 𝑡 is known as the (stochastic) transition operator,which maps a function on state space to the expectationof that function at a future time. We could furthermoredivide by Δ 𝑡 and take the limit Δ 𝑡 →

0, eliminating theevent 𝜏 𝐷 c < Δ 𝑡 , and obtain the Kolmogorov Backward PDE(e.g., Oksendal 2003; Weinan et al. 2019). Instead, to rep-resent our numerical method more directly, we implementa purely ﬁnite-time approach from Strahan et al. (2020):artiﬁcally halt the dynamics upon ﬁrst arrival in 𝐷 c = 𝐴 ∪ 𝐵 and modify the equation to E x [ 𝑞 + ( X ( Δ 𝑡 ∧ 𝜏 𝐷 c ))] − 𝑞 + ( x ) = (T Δ 𝑡𝐷 c − ) 𝑞 + ( x ) = Δ 𝑡 ∧ 𝜏 𝐷 c : = min ( Δ 𝑡, 𝜏 𝐷 c ) and T Δ 𝑡𝐷 c is a “stopped"transition operator. This equation holds for x ∈ 𝐷 , andcomes with the boundary condition 𝑞 + ( x ) = 𝐵 ( x ) for x ∈ 𝐴 ∪ 𝐵 . Applying similar logic to the mean ﬁrst passagetime to 𝐵 , let x ∈ 𝐵 c , denote 𝑚 𝐵 ( x ) = E x [ 𝜏 𝐵 ] and observethat 𝜏 𝐷 c decreases by Δ 𝑡 ∧ 𝜏 𝐵 c during the short timespan.So for all x ∈ 𝐵 c , E x [ 𝑚 𝐵 ( X ( Δ 𝑡 ∧ 𝜏 + 𝐵 ))] − 𝑚 𝐵 ( x ) = (T Δ 𝑡𝐵 − ) 𝑚 𝐵 ( x ) = − E x [ Δ 𝑡 ∧ 𝜏 𝐵 ] (14)with the boundary condition 𝑚 𝐵 ( x ) = x ∈ 𝐵 . Nowin the general case, let 𝐷 stand in for the relevant region of state space and let 𝐺, 𝐻 and 𝑉 be arbitrary. The corre-sponding operator equation is (T Δ 𝑡𝐷 𝑐 − ) 𝐹 ( x ) − E x (cid:20) ∫ Δ 𝑡 ∧ 𝜏 𝐷 c 𝑉 ( X ( 𝑡 )) 𝐹 ( X ( 𝑡 )) 𝑑𝑡 (cid:21) = − E x (cid:20) ∫ Δ 𝑡 ∧ 𝜏 𝐷 c 𝐻 ( X ( 𝑡 )) 𝑑𝑡 (cid:21) (15)for x ∈ 𝐷 , with boundary condition 𝐹 ( x ) = 𝐺 ( x ) for x ∈ 𝐷 𝑐 .This linear equation comes from Dynkin’s formula, an inte-grated version of the Feynman-Kac; see Oksendal (2003);Karatzas and Shreve (1998); Weinan et al. (2019) theoret-ical background. The remarkable aspect of this formula isthat while 𝐹 is an expectation over paths going all the wayto the boundary 𝐷 c (a strong or weak vortex), it obeys a local equation with expectations over short trajectories oflength Δ 𝑡 . By collecting many short-trajectory samples,we can compute statistical properties of the event withoutever actually observing one happen in simulation. Notethat (15) reduces to (13) with 𝑉 = 𝐻 = 𝑉 = , 𝐻 = x . Successfulapproaches will involve a representation of the solution, 𝐹 ,suitable for the high dimensional setting, i.e. representa-tions of the type commonly employed for machine learningtasks. The DGA method, in particular, consists of expand-ing the unknown function 𝐹 in a “data-informed” basis (tobe speciﬁed later). The expectations in Equation (15) areestimated by launching short trajectories from all over statespace. Finally, a ﬁnite system of equations is solved for theunknown coeﬃcients in the basis expansion of 𝐹 , in eﬀectstitching together information from all trajectories at once.We can express the essential idea using the example ofEquation (13) for 𝑞 + ( x ) , while Appendix B contains a moregeneral version. We ﬁrst homogenize the boundary condi-tions with a guess function ˆ 𝑞 + ( x ) that obeys the boundaryconditions ˆ 𝑞 + | 𝐴 =

0, ˆ 𝑞 + | 𝐵 =

1, and let 𝑟 ( x ) = 𝑞 + ( x ) − ˆ 𝑞 + ( x ) ,so that 𝑟 obeys homogeneous Dirichlet conditions and sat-isﬁes (T Δ 𝑡𝐷 c − ) 𝑟 ( x ) = −(T Δ 𝑡𝐷 c − ) ˆ 𝑞 + ( x ) (16)We next expand 𝑟 in a ﬁnite-dimensional basis of func-tions { 𝜙 , . . . , 𝜙 𝑀 } with unknown coeﬃcients 𝑐 𝑗 : 𝑟 ( x ) = (cid:205) 𝑀𝑗 = 𝑐 𝑗 ( 𝑟 ) 𝜙 𝑗 ( x ) . Each 𝜙 𝑗 obeys the homogeneous bound-ary conditions. Finally, we take the inner product of bothsides with 𝜙 𝑖 , with respect to some measure 𝜇 , to producea system of 𝑀 linear equations 𝑀 ∑︁ 𝑗 = (cid:104) 𝜙 𝑖 , (T Δ 𝑡𝐷 c − ) 𝜙 𝑗 (cid:105) 𝜇 𝑐 𝑗 ( 𝑟 ) = −(cid:104) 𝜙 𝑖 , (T Δ 𝑡𝐷 c − ) ˆ 𝑞 + (cid:105) 𝜇 𝑖 = , . . . , 𝑀 (17)These inner products are intractable integrals over high-dimensional state space, but can be approximated usingMonte Carlo integration. If X is an R 𝑑 -valued randomvariable distributed according to 𝜇 , and we have access torandom samples { X , . . . , X 𝑁 } , the law of large numbersgives lim 𝑁 →∞ 𝑁 𝑁 ∑︁ 𝑛 = 𝑓 ( X 𝑛 ) = ∫ R 𝑑 𝑓 ( x ) 𝜇 ( 𝑑 x ) (18)This is where the short trajectory data enters the pic-ture. We generate a dataset of length- Δ 𝑡 trajectories { X 𝑛 ( 𝑡 ) : 0 ≤ 𝑡 ≤ Δ 𝑡, 𝑛 = , . . . , 𝑁 } . These short trajecto-ries might enter 𝐴 or 𝐵 before time Δ 𝑡 , and to accountfor this we also store the stopping times Δ 𝑡 ∧ 𝜏 𝑛,𝐷 c . The X 𝑛 ( ) ’s are sampled from an arbitrary measure 𝜇 , calledthe sampling measure, which is determined by the sam-pling procedure for initial points. For example, if pointsare selected randomly from a long trajectory, 𝜇 ≈ 𝜋 (thesteady-state probability density) by ergodicity. However,we may choose 𝜇 so that many samples appear in regionsof particular interest, such as transition regions far awayfrom 𝐴 and 𝐵 and to which 𝜋 assigns very little prob-ability. Once the dataset is generated, we use 𝜇 as thereference measure for the inner products in (17), allowingus to approximate them with Monte Carlo integration. Forexample, (cid:104) 𝜙 𝑖 , (T Δ 𝑡𝐷 c − ) 𝜙 𝑗 (cid:105) 𝜇 (19) ≈ 𝑁 𝑁 ∑︁ 𝑛 = 𝜙 𝑖 ( X 𝑛 ( )) (cid:2) 𝜙 𝑗 ( X 𝑛 ( Δ 𝑡 ∧ 𝜏 𝑛,𝐷 c ) − 𝜙 𝑗 ( X 𝑛 ( )) (cid:3) We can similarly estimate any expectation of the form (15)using diﬀerent basis functions adapted to the speciﬁc re-gion of interest.The formulation above works for any class of basis func-tions that becomes increasingly expressive as the librarygrows, capable of estimating any function of interest. How-ever, with a ﬁnite truncation, choosing the basis functionsis a crucial ingredient of DGA, greatly impacting the eﬃ-ciency and accuracy of the results. In our current study, werestrict to the simplest kind of basis, which consists of indi-cator functions 𝜙 𝑖 ( x ) = 𝑆 𝑖 ( x ) , where { 𝑆 , . . . , 𝑆 𝑀 } is a dis-joint partition of state space. In practice we will constructthese sets by clustering data. This basis set construction isborrowed from common practice in the computational sta-tistical mechanics community for building a Markov StateModel (MSM) (Frank and Fischer 2008; Pande et al. 2010;Bowman et al. 2013; Chodera and Noé 2014). MSMsare a dimensional reduction technique that has also beenused in conjuction with analysis of metastable transitions,primarily in protein folding dynamics (Noé et al. 2009)and were recently used to study ocean circulation in Mironet al. (2021). DGA can be viewed as an extension of MSMs, though, rather than producing any reduced com-plexity model, the explicit goal in DGA is the estimation ofspeciﬁc functions as in Equation (11). Appendix B spellsout DGA in considerably greater detail, which may be morehelpful to view after seeing the forthcoming results.

4. Methods

In this section we explain our speciﬁc application ofthe Dynamical Galerkin Approximation (DGA) method(Thiede et al. 2019; Strahan et al. 2020) to the Holton-Massmodel and validate our results empirically from simulation. a. Data generation

There are many possible ways to choose starting pointsfor the short trajectories. Whatever procedure we use willinduce a sampling measure 𝜇 on state space. 𝜇 ( x ) is aprobability density that speciﬁes the expected number ofstarting points per unit volume in the region of state spacenear x . This is a natural reference measure for the MonteCarlo inner products described in Subsection 3(b). Be-cause 𝜇 has minimal requirements, the user is aﬀordedgreat ﬂexibility in sampling the data. How to eﬃcientlygenerate maximally informative data is an active and non-trivial research question, but a few heuristics are obvious.In a metastable system, setting 𝜇 = 𝜋 would be a poorchoice, because the data would be strongly concentratedin the immediate neighborhoods of 𝐴 and 𝐵 , whereas theregions of primary interest are the transition regions some-where in between 𝐴 and 𝐵 . Diﬀerent physical observables,such as the Eliassen-Palm ﬂux, may be important prior can-didates for their predictive power, and we might like to seeddata samples uniformly over a certain range of that vari-able. On the other hand, the samples should fall within aphysically realistic region of state space, not just any pointin R . To see why, recall that the last 25 entries of the statevector X represent the velocity ﬁeld at discretized verticallevels from 𝑧 = 𝑘𝑚 to 𝑧 = 𝑘𝑚 . Because velocity is acontinuous function of altitude, adjacent entries should beclose together, which is not at all guaranteed for a randomlychosen 75-dimensional vector.Because a goal of this article is to demonstrate inter-pretable results of rare event analysis on a climate model,we choose an easy, probably suboptimal sampling strategy.We defer optimization to later work, perhaps for a more ex-pensive model that demands it. We deﬁne our samplingdistribution as the equilibrium distribution, re-weightedto be uniform over the space ( 𝑈 ( 𝑘𝑚 ) , | Ψ |( 𝑘𝑚 )) within the bounds realized by the control simulation,which are approximately − 𝑚 / 𝑠 ≤ 𝑈 ( 𝑘𝑚 ) ≤ 𝑚 / 𝑠 and 0 𝑚 / 𝑠 ≤ | Ψ |( 𝑘𝑚 ) ≤ × 𝑚 / 𝑠 . Without directaccess to the equilibrium distribution, we approximate itby running a very long trajectory of 500,000 days, produc-ing many transitions like those shown in Figure 1, with anEuler-Maruyama timestep of 0.005 days (for comparison,a single transition event takes on the order of 100 days).We acknowledge this is cheating on our claim to only useshort trajectories; however, we use the long simulationonly to seed the initial conditions for those short trajecto-ries, as well as to empirically validate the results of DGAlater on. This way we can emphasize the power of DGAitself, which will motivate more eﬃcient upstream datageneration methods. Alternatives exist for sampling statespace thoroughly without a long simulation, for exampletrajectory-splitting (e.g., L’Ecuyer et al. 2007). One couldinitialize trajectories in one of the metastable sets, say in 𝐴 directly on the ﬁxed point a , and integrate the trajectoriesfor a short time to explore the surrounding region. Thesenew data points can be used as initial conditions for the nextround of simulation, at each stage exploring a wider regionof state space until the bulk of the attractor is covered. Thisinitialization procedure may require a long total simulationtime, but is parallelizable. We will explore and optimizesuch methods in future work with more sophisticated mod-els, where eﬃcient initialization is more critical. For nowwe settle for initial data points from a long simulation.After downsampling the long simulation to a resolu-tion of 0 . ( 𝑈 ( 𝑘𝑚 ) , | Ψ ( 𝑘𝑚 )|) . Speciﬁcally, we computea discrete histogram over the two-dimensional space andweight each sample by the inverse of its density on thathistogram. We collect 𝑁 = × snapshots { X 𝑛 ( )} di-rectly from the long simulation, and then launch indepen-dent (hence completely parallelizable) short 10-day trajec-tories from each, to obtain the short trajectory database { X 𝑛 ( 𝑡 ) : 0 ≤ 𝑡 ≤ Δ 𝑡, 𝑛 = . . . , 𝑁 } . Afterward we identifythe ﬁrst-entry times to 𝐷 c for each trajectory, called 𝜏 𝑛,𝐷 c .This strategy is straightforward and guarantees that 𝜇 givessubstantial probability to candidate transition regions andthat only physically reasonable points are sampled. b. Computation and validation The partition { 𝑆 , . . . , 𝑆 𝑀 } to build the basis functionlibrary { 𝑆 𝑗 ( x )} 𝑁𝑛 = should be chosen with a number ofconsiderations in mind. The partition elements should besmall enough to accurately represent the functions they areused to approximate, but large enough to contain suﬃcientdata to robustly estimate transition probabilities. We formthese sets by a hierarchical modiﬁcation of 𝐾 -means clus-tering on { X 𝑛 ( )} 𝑁𝑛 = . 𝐾 -means is a robust method that canincorporate new samples by simply identifying the closestcentroid, and is commonly used in molecular dynamics(Pande et al. 2010). However, straightforward applica-tion of 𝐾 -means, as implemented in the scikit-learn software (Pedregosa et al. 2011), can produce a very im-balanced cluster size distribution, even with empty clus-ters. This leads to unwanted singularities in the constructed Markov matrix. To avoid this problem we cluster hierar-chically, starting with a coarse clustering of all points anditeratively reﬁning the larger clusters, at every stage en-forcing a minimum cluster size, until we have the desirednumber of clusters ( 𝑀 ). After clustering on the initialpoints { X 𝑛 ( )} , the other points { X 𝑛 ( 𝑡 ) , < 𝑡 ≤ Δ 𝑡 } areplaced into clusters using an address tree produced by the 𝐾 -means cluster hierarchy. To guarantee that 𝐷 and 𝐷 c consist exactly of a union of subsets, we cluster points in 𝐷 and 𝐷 c separately, with a number of clusters proportionalto the number of points therein. (We remind the readerthat the domain and boundary depend on which quantityof interest is being computed. For the forward and back-ward committor, 𝐷 c consists of 𝐴 and 𝐵 , which are deﬁned a priori by thresholds of 𝑈 ( 𝑘𝑚 ) .) The total number ofclusters is ﬁxed to 𝑀 = 𝑧 , we ﬁrst identify whether 𝑧 ∈ 𝐷 or 𝐷 c , and assign it to a cluster accordingly.Figure 2 demonstrates the accuracy of the calculated for-ward committor and mean ﬁrst passage time to 𝐵 by takingadvantage of the long trajectory from which we sampledthe short trajectories. We divide the interval ( , ) into20 bins, and identify for each interval ( 𝜁 , 𝜁 ) which datapoints { X 𝑛 ( ) : 𝜁 < 𝑞 + ( X 𝑛 ( )) < 𝜁 } were en route to thevacillating regime at the instant they were selected fromthe long simulation. If the committor is computed accu-rately, the proportion of data points headed to 𝐵 shouldfall in the interval ( 𝜁 , 𝜁 ) . For example, about 20-25%of data points X 𝑛 ( ) whose committor is calculated to bewithin ( . , . ) should be headed to set 𝐵 . Analogously,we expect rain 20% of the time the National Weather Ser-vice forecasts a 20% chance of rain. This is a very coarsemeasure of accuracy, and only a necessary condition, butthe strong empirical match shown in the scatter plots ofFigure 2 gives us conﬁdence in our numerical results. Themean ﬁrst passage time calculation is evaluated similarly:for all data points X 𝑛 ( ) with the estimated 𝑚 𝐵 ( X 𝑛 ( )) in a certain range ( 𝑡 , 𝑡 ) , we average the true ﬁrst-passagetime observed from the long trajectory. The match is quitegood up until very long lag times, where DGA underes-timates the long tail. The accuracy of committors andpassage times improve as the dataset grows and clustersare reﬁned. More sophisticated basis sets and samplingmethods may signiﬁcantly improve the convergence rate.The committor and ﬁrst passage time relate to theweather forecasting problem of predicting the next rareevent given the current initial condition. However, theycan also characterize the polar vortex climatology, mean-ing its average behavior over very long time periods aspertains to 𝐴 and 𝐵 . To wit, how much does the system“prefer" to be in a weak or strong state, as measured by thefraction of time it spends in either? This can be quantiﬁedby the steady state distribution (also called the invariantor stationary measure) 𝜋 ( x ) , the probability distributionfunction produced by binning data points from a very long0 Fig. 2.

Accuracy of the committor and mean ﬁrst passage timecalculations veriﬁed with long trajectory data.

The DGA calcula-tions assign approximate committor and mean ﬁrst passage time values 𝑞 + ( X 𝑛 ( )) and 𝑚 𝐵 ( X 𝑛 ( )) to each data point. Because each snapshot 𝑥 𝑛 was collected from a long trajectory, its destination and the time to getthere are known and can provide an empirical validation of the commit-tor and lead time. For 20 equal partitions ( 𝜁 , 𝜁 ) of the interval ( , ) ,we assemble all trajectory starts X 𝑛 ( ) with 𝑞 + ( X 𝑛 ( )) ∈ ( 𝜁 , 𝜁 ) andcount the fraction heading toward 𝐵 . These are the empirical commit-tors for the interval ( 𝜁 , 𝜁 ) , and are plotted on the vertical axis against ( 𝜁 + 𝜁 )/

2. Similarly, we bin the space of calculated ﬁrst passage timesand for each bin average the empirical ﬁrst passage time to 𝐵 . Both quan-tities line up well between DGA computations and empirical values, withthe exception of the longest passage times, which are underestimated. simulation. Figure 3 illustrates that the metastable Holton-Mass model has a starkly bimodal distribution, with thesystem tending to spend a long time in state 𝐴 or 𝐵 be-fore occasionally switching quickly to the other state. Wehave estimated 𝜋 here using a variation on the DGA recipedescribed above. The details of the calculation can befound in Appendix B. We have projected 𝜋 onto the two-dimensional subspace (| Ψ ( 𝑘𝑚 )| ,𝑈 ( 𝑘𝑚 )) on a logscale, along with a one-dimensional projection onto thelatter coordinate 𝑈 ( 𝑘𝑚 ) on a linear scale. The pref-erences for 𝐴 and 𝐵 can be quantitatively compared bythe fraction of time spent inside each set, as well as the Fig. 3.

Steady-state distribution.

The density 𝜋 ( x ) is projectedonto the two-dimensional space ( | Ψ | ,𝑈 ) at 30 km, on a log scale. Thedensity is peaked in the neighborhoods of the two ﬁxed points. On theright is a projection of 𝜋 onto the single variable 𝑈 (30 km), on a linearscale, conﬁrming strong bimodality.Region Fraction (DGA) Fraction (empirical)Inside 𝐴 ( 𝐴 ∪ 𝐵 ) c → 𝐴 𝐵 ( 𝐴 ∪ 𝐵 ) c → 𝐵 𝐴 , (ii)outside 𝐴 but destined for 𝐴 , (iii) inside 𝐵 , (iv) outside 𝐵 but destinedfor 𝐵 . Each fraction is computed directly from DGA and empiricallyveriﬁed from the long simulation. fraction of time spent between the two sets but destinedfor either one. These ergodic averages can be found byaveraging the forward committor over diﬀerent regions ofstate space. For example, the fraction of time spent in-side 𝐴 is ∫ 𝐴 𝜋 ( 𝑑 x ) = ∫ R 𝑑 𝐴 ( x ) 𝜋 ( 𝑑 x ) = (cid:104) 𝐴 (cid:105) 𝜋 . Similarly,the fraction of time spent inside 𝐵 is (cid:104) 𝐵 (cid:105) 𝜋 ; the fractionspent outside 𝐴 and 𝐵 but destined for 𝐵 is (cid:104) ( 𝐴 ∪ 𝐵 ) c 𝑞 + (cid:105) 𝜋 ;and the fraction spent outside 𝐴 and 𝐵 but destined for 𝐵 is (cid:104) ( 𝐴 ∪ 𝐵 ) c ( − 𝑞 + )(cid:105) 𝜋 . Table 1 displays these fractions calcu-lated from DGA and empirically from the long trajectory.The time spent either in 𝐴 or destined for 𝐴 (the ﬁrst tworows) is about equal to the time spent in or destined for 𝐵 (last two rows). However, more time is spent destined for 𝐵 than strictly inside 𝐵 , as vacillation cycles often increasethe zonal wind above 1.75 m/s before it dips back down.Furthermore, Figure 3 shows a higher and narrower peakin the 𝐴 regime. We interpret that a strong vortex is muchless variable than a weak vortex, which is consistent withthe vacillation cycles that characterize the latter.

5. Results and Discussion

Our analysis can be roughly divided into two parts.First, from a forecasting perspective, we demonstrate that1the committor is more robust than naïve proxies from themodel as a leading indicator of an oncoming SSW. We alsoﬁnd a low-rank representation of the committor in terms ofthe system’s basic observables using a sparsity-promotingLASSO regression (Tibshirani 1996). Second, we quanti-tatively relate the risk of an oncoming event with the leadtime to the event, an important consideration in extremeweather prediction. a. The committor as an early warning

Operational forecasting requires continuous updating ofprobabilities from incoming observations, which provideonly partial information on the state of the atmosphere.The choice of which observables to monitor is constrainedby measurement capabilities, but is also informed by pre-diction eﬃcacy; we desire warning signs that are highlycorrelated with the event and occur as early as possibleto give some buﬀer time to brace for impacts. Figure4 visually demonstrates the advantage of considering thecommittor as a forecasting metric compared to two otherobservables: zonal wind 𝑈 and meridional eddy heat ﬂux 𝑣 (cid:48) 𝑇 (cid:48) , both measured at the same altitude of 30 km. We haveextracted a typical complete SSW event ( 𝐴 → 𝐵 transitionpath) from the long simulation and plotted a timeseries ofthe observables on a common time axis. The time 𝑡 = 𝐵 , with zonal windat 30 km dropping below the threshold of 1.75 m/s. Thecommittor timeseries (Figure 4a) is estimated by nearestneighbor interpolation from the dataset { X 𝑛 ( )} 𝑁𝑛 = .The committor curve timeseries ﬁrst exceeds the thresh-old of 0 . 𝑈 ( 𝑡 ) , whichis plateauing, or very gradually decreasing, around 40 𝑚 / 𝑠 when the threshold 𝑞 + = . .

5, and so areading of 𝑈 directly would not give a strong warning signuntil late in the progress of transition. One could writethe committor as an approximate function of 𝑈 ( 𝑘𝑚 ) ,which is plotted in Figure 5(a) as explained below, andwould ﬁnd that the 𝑈 -level corresponding to 𝑞 + = . 𝑚 / 𝑠 . Unfortunately, 𝑈 ( 𝑘𝑚 ) does not dropbelow 37 𝑚 / 𝑠 until the SSW is 12 days away, providingmuch less lead time than if the full committor were known.The considerable gap in prediction date is shown by a bluestrip. Meanwhile, the heat ﬂux over time plotted in panel(c) suﬀers the same deﬁciency as a predictor, having ananalogous threshold of 1 . × − 𝐾 · 𝑚 / 𝑠 . The 𝑣 (cid:48) 𝑇 (cid:48) levelhardly budges while critical preconditions are falling intoplace, and only after the die is already cast in favor of a Fig. 4.

The committor vs. other observables as a forecastingtool.

A representative simulated SSW event from the long simulationis plotted over time, starting 65 days in advance of the oﬃcial eventwhen 𝑈 ( 𝑘𝑚 ) ﬁrst drops below 1.75 m/s, which is marked by avertical solid line. Panel (a) shows the committor over time followingthe trajectory, panel (b) shows the zonal wind 𝑈 ( 𝑘𝑚 ) , and panel (c)shows the eddy heat ﬂux 𝑣 (cid:48) 𝑇 (cid:48) ( 𝑘𝑚 ) . Horizontal dashed lines markthe natural forecasting threshold of 𝑞 + = . 𝑞 + = .

5: 37 𝑚 / 𝑠 (panel (b))and 1 . × − 𝐾 · 𝑚 / 𝑠 (panel (c)). The sharp increase in 𝑞 + as it crossesthe threshold provides a clear and early warning sign of oncoming SSW,about 26 days in advance. 𝑈 and 𝑣 (cid:48) 𝑇 (cid:48) are moving slowly at that time,and don’t clear their respective thresholds for the last time until the eventis much closer at hand. The gap in lead time is marked by blue strips. SSW does the heat ﬂux rise sharply. A monitoring sys-tem based on heat ﬂux alone would be very ill-informedabout the risk of impending SSW event. While heat ﬂuxis a dynamically consequential quantity for describing theevolution of an SSW, this does not directly translate intogood predictive properties. These prediction gaps are typ-ical: over many simulated transitions, the average delaybetween 𝑞 + ( X ( 𝑡 )) clearing 0.5 for the last time the otherobservables clearing their thresholds for the last time are9.1 days for 𝑈 ( 𝑘𝑚 ) and 9.8 days for 𝑣 (cid:48) 𝑇 (cid:48) ( 𝑘𝑚 ) .We use the caveat “directly” because the possibility re-mains that the vertical scale in Figures 4 (b-c) unfairlydownplay the predictive power of zonal wind and heat ﬂux.Perhaps they could be very robust predictors, if examinedon the right scale and with appropriate (possibly nonlin-ear) transformations. Calculating such a transformation ontheoretical grounds alone would be a daunting task, espe-2cially in light of stochastic perturbations. But even if thiswere possible, at best this calculation would approximatenothing other than the forward committor itself. Further-more, we incorporate all state variables at once into thecommittor calculation, which is at least as ﬂexible as con-sidering heat ﬂux or zonal wind alone. Nonetheless, for thesake of dynamical transparency and practical observationalconstraints, it would be helpful to have a parsimonious rep-resentation of the committor in terms of a small number ofstate variables, if possible. We pursue this prospect in thefollowing subsection. b. Sparse representation of the committor The committor’s superiority as a probabilistic forecastis not surprising, because it is built into the deﬁnition.The committor combines information from every degreeof freedom in just the right way to give the probability ofnext hitting 𝐵 rather than 𝐴 . However, these degrees offreedom may not all be “observable" in a practical sense,given the sparsity and resolution limits of weather sensors.It is therefore important to ask: what is the best possi-ble estimate of the committor given an observed subsetof state variables? A related question arises in the designof observational systems: which variables should be mea-sured to optimally estimate the committor, under cost andengineering constraints? In this section we will propose asystematic method to address these questions in the contextof the Holton-Mass model.Consider a single-variable observable like 𝑈 (30 km).If constrained to observe only 𝑈 (30 km) and forced toapproximate 𝑞 + ( x ) as a function of this one variable, wewould average 𝑞 + ( x ) across the remaining 74 model di-mensions, weighted by the invariant measure. We wouldassess the quality of this observable by the variance acrossthose projected-out dimensions: a large projected vari-ance would imply strong dependence on unobserved vari-ables. Figure 5 applies this projection to the commit-tor (ﬁrst row) and mean ﬁrst passage time to 𝐵 (secondrow), using three diﬀerent single-variable observables: 𝑈 (ﬁrst column), 𝑣 (cid:48) 𝑇 (cid:48) (second column), both at 30 km, andthe LASSO regression (third column). The solid curvesshow the projected means, and the dotted curves indicatethe one-standard-deviation envelope. 𝑞 + ( 𝑈 ( 𝑘𝑚 )) is asmooth, mostly monotonic curve with a consistently smallprojection error never exceeding ∼ .

2, which occurs near 𝑞 + = .

5. Compare this to 𝑞 + ( 𝑣 (cid:48) 𝑇 (cid:48) ( 𝑘𝑚 )) , which is es-sentially discontinuous as heat ﬂux increases from zero,and which has a large standard deviation approaching 0 . 𝑣 (cid:48) 𝑇 (cid:48) . Theprojected mean ﬁrst passage times tell a similar story, being strongly negatively correlated with the committor. Weakerzonal wind generally signals less lead time before enteringstate 𝐵 , but while heat ﬂux stays small, an observer is in thedark about how soon, as well as how certain, a transitionis.Let us brieﬂy formalize the projection idea before ex-ploring other variables. We want to approximate a func-tion 𝐹 : R 𝑑 → R , such as the committor or mean ﬁrstpassage time, as a function of some reduced coordinates 𝜽 : R 𝑑 → R 𝑘 , called “collective variables" (CVs) in chem-istry literature. That is, we wish to ﬁnd 𝑓 : R 𝑘 → R suchthat 𝐹 ( x ) ≈ 𝑓 ( 𝜽 ( x )) . For instance, 𝜽 ( x ) = ( 𝜃 ( x ) , 𝜃 ( x )) where 𝜃 ( x ) is the mean zonal wind at 30 km and 𝜃 ( x ) is the perturbation streamfunction magnitude | Ψ | at 30km. Typically the projected dimension 𝑘 (cid:28) 𝑑 , for in-stance 𝑘 = 𝑓 is chosen by minimizing some function-spacemetric between 𝑓 ◦ 𝜽 and 𝐹 . The simplest choice wouldbe the mean-squared error, so the projection problem is tominimize over functions 𝑓 : R 𝑘 → R the penalty 𝑆 [ 𝑓 ; 𝜽 ] : = (cid:107) 𝑓 ◦ 𝜽 − 𝐹 (cid:107) 𝐿 ( 𝜋 ) = ∫ R 𝑑 (cid:104) 𝑓 ( 𝜽 ( x )) − 𝐹 ( x ) (cid:105) 𝜋 ( 𝑑 x ) (20)The optimal 𝑓 for this purpose is the conditional ex-pectation 𝑓 ( y ) = E X ∼ 𝜋 [ 𝐹 ( X )| 𝜽 ( X ) = y ] = ∫ 𝑓 ( x ) 𝛿 ( 𝜽 ( x ) − y ) 𝜋 ( 𝑑 x ) . We derive a discretized version of this formulain Appendix B, and this is how we display all the low-dimensional projections. We call the square root of 𝑆 [ 𝑓 ; 𝜽 ] the projected standard deviation, or projection error, whichdetermines the dotted envelope in Figure 5.A much harder problem than optimizing over 𝑓 given 𝜽 is the problem of optimizing over sets of coordinates 𝜽 . CVs can be arbitrarily complex nonlinear functions ofthe basic state variables x . Modern machine learning al-gorithms such as artiﬁcial neural networks are designedexactly for that purpose: to represent functions nonpara-metrically from observed input-output pairs. However, wewish to maintain some interpretability in the committorrepresentation. For this reason, in searching for optimalprojections, we begin with more constrained and physics-informed feature spaces before allowing for more complexrelationships. We focus on observables coming from theEliassen-Palm (EP) relation, which relates wave activity,PV ﬂuxes and gradients, and heating source terms in a con-servation equation. From Yoden (1987b), the EP relationfor the Holton-Mass model takes the form 𝜕 𝑡 (cid:18) 𝑞 (cid:48) (cid:19) + ( 𝜕 𝑦 𝑞 ) 𝜌 − 𝑠 ∇ · 𝑭 = − 𝑓 𝑁 𝜌 − 𝑠 𝑞 (cid:48) 𝜕 𝑧 ( 𝛼𝜌 𝑠 𝜕 𝑧 𝜓 (cid:48) ) (21)where 𝑭 = (− 𝜌 𝑠 𝑢 (cid:48) 𝑣 (cid:48) ) j + ( 𝜌 𝑠 𝑣 (cid:48) 𝜕 𝑧 𝜓 (cid:48) ) k Fig. 5.

One-dimensional projections of the forward committor and mean ﬁrst passage time to 𝐵 , computed with DGA . These functionsdepend on all 𝑑 =

75 degrees of freedom in the model, but we have averaged across 𝑑 − =

74 dimensions to visualize the committor (ﬁrst row)and mean ﬁrst passage time to 𝐵 (second row) as rough functions of three single degrees of freedom: 𝑈 ( 𝑘𝑚 ) (ﬁrst column), 𝑣 (cid:48) 𝑇 (cid:48) ( 𝑘𝑚 ) (second column), and the LASSO-regressed committor (third column). The forward committor measures proximity to 𝐵 in probability, while meanpassage time to 𝐵 measures proximity in time, hence the negative correlation between the two quantities. The general trends reveal fairly obviousrelationships: stronger wind is associated with tendency towards the strong-vortex state 𝐴 , and larger poleward eddy heat ﬂux is associated withtendency toward the weak vortex state 𝐵 . In addition, curves like this assess the quality of single-variable observables as proxies for an oncomingtransition event. The committor and passage time vary smoothly and (mostly) monotonically with 𝑈 , but discontinuously with 𝑣 (cid:48) 𝑇 (cid:48) : the heat ﬂuxburst that accompanies a SSW gives no advance warning for the event, while a small negative change in 𝑈 indicates incrementally higher transitionprobability and shorter lead time. In the highly idealized Holton-Mass model, the EP ﬂux di-vergence has two alternative expressions: 𝜌 − 𝑠 ∇ · 𝑭 = 𝑣 (cid:48) 𝑞 (cid:48) = 𝑅𝐻 𝑓 𝜌 − 𝑠 𝑣 (cid:48) 𝑇 (cid:48) . If there were no dissipation ( 𝛼 = ) and thebackground zonal state were time-independent ( 𝜕 𝑡 𝑞 = 𝜕 𝑦 𝑞 would express local conserva-tion of wave activity A = 𝜌 𝑠 𝑞 (cid:48) /( 𝜕 𝑦 𝑞 ) . Neither of theseis true in the stochastic Holton-Mass model, so we usethe quantities in Equation (21) as diagnostics: enstrophy 𝑞 (cid:48) , PV gradient 𝜕 𝑦 𝑞 , PV ﬂux 𝑣 (cid:48) 𝑞 (cid:48) , and heat ﬂux 𝑣 (cid:48) 𝑇 (cid:48) .Each ﬁeld is a function of ( 𝑦, 𝑧 ) and takes on very diﬀer-ent proﬁles in 𝐴 and 𝐵 , as found by Yoden (1987b). Atransition from 𝐴 to 𝐵 , where the vortex weakens dramat-ically, must entail a reduction in 𝜕 𝑦 𝑞 and a burst in pos-itive 𝑣 (cid:48) 𝑇 (cid:48) and negative 𝑣 (cid:48) 𝑞 (cid:48) as a Rossby wave propagatesfrom the tropopause vertically up through the stratosphere.This is the general physical narrative of a sudden warming event, and these same ﬁelds might be expected to be use-ful observables to track for qualitative understanding andprediction, along with the basic state variables 𝑈 and | Ψ | .One option is to take vertical averages of any of theseﬁelds, but there may be particularly salient altitude levelsthat clarify the role of vertical interactions. The ﬁrst threerows of Figure 6 display, for three of these ﬁelds ( 𝑈 , | Ψ | and 𝑣 (cid:48) 𝑇 (cid:48) ) and for a range of altitude levels, the mean andstandard deviation of the committor projected onto thatﬁeld at that altitude. Each altitude has a diﬀerent range ofthe CV; for example, because 𝑈 has a Dirichlet condition atthe bottom and a Neumann condition at the top, the lowerlevels have a much smaller range of variability than thehigh levels. We also plot the integrated variance, or 𝐿 projection error, at each level in the right-hand column.A low projected committor variance over 𝑈 at altitude4 𝑧 means that the committor is mostly determined by thesingle observable 𝑈 ( 𝑧 ) , while a high projected varianceindicates signiﬁcant dependence of 𝑞 + on variables otherthan 𝑈 ( 𝑧 ) . In order to compare diﬀerent altitudes andﬁelds as directly as possible, the 𝐿 projection error at eachaltitude is an average over discrete bins of the observable,not a proper integral.In selecting good CV’s, we generally look for a simple,hopefully monotonic, and sensitive relationship with thecommittor. Of all the candidate ﬁelds, 𝑈 and 𝜕 𝑦 𝑞 standout the most in this respect, being clearly negatively cor-related with the forward committor at all altitudes. Theassociated projection error tends to be greatest in the re-gion 𝑞 + ≈ .

5, as observed before, but interestingly thereis a small altitude band around 20 −

25 km where its mag-nitude is minimized. This suggests an optimal altitudefor monitoring the committor through zonal wind, givingthe most reliable estimate possible for a single state vari-able. In contrast, the projection of 𝑞 + onto | Ψ | , displaysa large variance across all altitudes. The eddy heat ﬂuxis also rather unhelpful as an early warning sign, despiteits central role in SSW evolution, which is consistent withFigure 5. For example, the large, positive spikes in heatﬂux across all altitudes generally occur after the committor ≈ . 𝑣 (cid:48) 𝑇 (cid:48) with the committor is not smooth. The 𝑞 + < . 𝐿 projection error, despite the normalization men-tioned above. Furthermore, restricting to one variable ata time is limiting. Accordingly, in a second, more auto-mated approach to identify salient variables, we performa sparsity-promoting LASSO regression for the forwardcommittor (Tibshirani 1996; Pedregosa et al. 2011), usingas input features all state variables 𝑈, Re Ψ , Im Ψ and theirvertical derivatives. We leave out eddy ﬂuxes, which seemto have poor prediction properties. The advantage of asparsity-promoting regression is to isolate a small numberof observables that can decently approximate the commit-tor in linear combination. Considering that regions closeto 𝐴 and 𝐵 have low committor uncertainty, we regressonly on data points with 𝑞 + ∈ ( . , . ) , and of thoseonly a subset weighted by the reactive probability density 𝑞 + 𝑞 − 𝜋 , since we wish to isolate the dominant transitionpathways. To enforce predicted committors being betweenzero and one, we regress on the probit-transformed com-mittor ln ( 𝑞 + /( − 𝑞 + )) . First we do this at each altitudeseparately, and in Figure 7 (a) we plot the coeﬃcients ofeach component as a function of altitude. Each component is salient for some altitude range. Ingeneral, 𝑈 and 𝑈 𝑧 dominate as causal variables at lowaltitudes, while Ψ and Ψ 𝑧 dominate at high altitudes. Theoverall prediction quality, as measured by 𝑅 and plottedin Figure 7 (b), is greatest around 21.5 km, consistent withour qualitative observations of Figure 6. Note that not allsingle-altitude slices are suﬃcient for approximating thecommittor, even with LASSO regression; in the altitudeband 50 − 𝑘𝑚 , the LASSO predictor is not monotonicand has a large projected variance. The speciﬁc altitudecan matter a great deal. But by using all altitudes at once,the committor approximation may be improved further. Wethus repeat the LASSO with all altitudes simultaneouslyand ﬁnd the sparse coeﬃcient structure shown in 7 (c), witha few variables contributing the most: 𝑈 (21.5 km), 𝑈 (29.6km), Re Ψ 𝑧 (13.5 km), and Im Ψ (21.5 km). The resultsof LASSO regression are also displayed in the bottomrow of Figure 4, the right column of Figure 5, and thebottom row of Figure 6 for direct comparison with the othercandidate ﬁelds. With multiple lines of evidence indicating21.5 km as an altitude with high predictive value for theforward committor, we can make a strong recommendationfor targeting observations there. This conclusion appliesonly to the Holton-Mass model under these parameters, butthe methodology explained above can be applied similarlyto models of arbitrary complexity. c. Relationship to lead time A skillful forecast is only useful if it comes early andleaves some buﬀer time before impact. Having identiﬁedthe committor as optimally skillful among all observables,we can now assess the limits of early prediction by relatingcertainty levels and lead times. Such a relationship wouldanswer two dual questions: during the transition to an SSWwinter phase, (1) how far in advance will we be aware ofit with some prescribed conﬁdence, say 80%? (2) givensome prescribed lead time, say 42 days, how aware or inthe dark could we be of it?These questions clearly involve some kind of ﬁrst-passage time, like the curves in the bottom row of Fig-ure 5. The same quantity has been calculated previouslyin other simpliﬁed models, e.g. Birner and Williams(2008) and Esler and Mester (2019). But E [ 𝜏 𝐵 ] has anobvious shortcoming. From Figure 5, we see that when 𝑞 + ≈ . E [ 𝜏 + 𝐵 ] ≈ 𝐵 and the other half returningto 𝐴 and lingering there before eventually crossing into 𝐵 . The conditional passage time E [ 𝜏 𝐵 | 𝜏 𝐵 < 𝜏 𝐴 ] is de-signed to highlight only the contribution of the latter halfand measure the mean time of paths going directly to 𝐵 ,which can be computed by DGA using a Laplace trans-form as described in Subsection 3(b). Figure 8 showsall three quantities—the forward committor, mean pas-sage time to 𝐵 , and conditional passage time to 𝐵 —this5 Fig. 6.

Projection of the forward committor onto a large collection of one-dimensional CVs , along with the associated standard deviation,or projection error, of the committor along the remaining 74 model dimensions. Consider the ﬁrst two panels. The left-hand panel shows, foreach discretized altitude 𝑧 , a heatmap of the committtor as 𝑈 ( 𝑧 ) ranges from its minimum to its maximum realized strength at that altitude. Atthe bottom is an additional heatmap of the committor as a function of (cid:104) 𝑈 (cid:105) 𝑧 , the vertical average. These are conditional expectations, with thecorresponding conditional standard deviations displayed in the right-hand panels. The following rows display analogous plots for wave magnitude,eddy PV ﬂux, background PV gradient, eddy heat ﬂux, and the LASSO predictor speciﬁed in Figure 7. The bottom line on the last plot is not avertical average, but the results of regression on all altitudes at once. time projected on a two-dimensional observable space ( Im Ψ ( . 𝑘𝑚 ) ,𝑈 ( . 𝑘𝑚 )) identiﬁed as salient by sparseregression. Physically, these levels operate as a valve regu- lating wave propagation into the stratosphere. The commit-tor has a clear negative relationship with both conditionaland unconditional ﬁrst passage time: as the risk of immi-6 Fig. 7.

Results of LASSO regression of the forward committor with

𝑈 , Re Ψ , Im Ψ as input features . Panel (a) shows the coeﬃcients when 𝑞 + is regressed as a function of only the variables at a given altitude, and panel (b) shows the corresponding correlation score. 21.5 km seemsthe most predictive (where 𝑧 ≡ nent SSW grows, the time until impact shrinks. Figure9 shows this relationship more quantitatively, for both the 𝐴 → 𝐵 process (panel (a)) and the 𝐵 → 𝐴 process (panel(b)). The relationship is roughly quantiﬁed by a least-squares regressions, weighted by the change of measure,between the SSW probability 𝑞 + (resp. the restorationprobability 1 − 𝑞 + ) and the conditional lead time to theSSW event E [ 𝜏 𝐵 | 𝜏 𝐵 < 𝜏 𝐴 ] (resp. the conditional lead timeto vortex restoration, E [ 𝜏 𝐴 | 𝜏 𝐴 < 𝜏 𝐵 ] ). While the relation-ships are nonlinear and the spreads signiﬁcant, the linearﬁts oﬀer two meaningful numerical insight. The verti-cal intercept says how long the next excursion to a givenstate will take when the system starts trapped in the otherstate. The negative slope says how fast the remaining timeshrinks as the risk grows. The vertical intercepts of 79 days and 107 days oﬀer further evidence that the vortex breaksdown faster than it restores.These metrics can inform preparation for extremeweather. For example, a threatened community might de-cide in advance on an “alarm threshold" of, say, 50%,meaning they plan to prepare for an SSW event only onceit is 50% certain to occur. According to the linear ﬁt inpanel (a), they must be ready to do so in ∼

48 days time.The nonlinear deviations are, however, signiﬁcant. Thespread around the linear ﬁt increases suddenly towards thelower-right corner of the plot, meaning that the uncertaintyin timing, viewed as a function of the committor, increasesas the SSW certainty increases. Lead time must thereforedepend strongly on more than just the forward commit-tor, and must be estimated by taking more details of the7

Fig. 8.

Two-dimensional projections of the committor and meanﬁrst passage times.

We have projected three quantities onto the ob-servable subspace of zonal wind and imaginary part of streamfunctionat 21.5 km. (a) forward committor 𝑞 + ( x ) = P x { 𝜏 𝐵 < 𝜏 𝐴 } , (b) ﬁrst pas-sage time to 𝐵 E [ 𝜏 𝐵 ] , and (c) conditional mean ﬁrst passage time to 𝐵 E [ 𝜏 𝐵 | 𝜏 𝐵 < 𝜏 𝐴 ] . The condition 𝜏 𝐵 < 𝜏 𝐴 decreases the passage timeby an order of magnitude, because it excludes the possibility of gettingtrapped in 𝐴 ﬁrst. Figure 9 quantiﬁes the relationship between the com-mittor and conditional passage time, and its forecasting implications. current state into account. We emphasize that the choiceof 𝐴 , 𝐵 and alarm thresholds are more of a communityand policy decision than a scientiﬁc one. The strength ofour approach is that it provides a ﬂexible numerical frame- Fig. 9.

Relationship between committor and mean ﬁrst passagetime.

Panel (a) shows the relationship between 𝑞 + (probability of nexthitting 𝐵 ) and E [ 𝜏 𝐵 | 𝜏 𝐵 < 𝜏 𝐴 ] , the time until hitting 𝐵 conditional onavoiding 𝐴 . These quantities correspond to panels (a) and (c) of Fiure8. Panel (b) shows the same relationship but in the 𝐵 → 𝐴 direction.In both cases, we performed a least squares regression weighted by thechange of measure. A + . 𝑞 + of next hitting 𝐵 comes with a 6.3-day decrease in the expected time to get there,whereas a + . − 𝑞 + comes witha 9.8-day reduction in the time to reach 𝐴 . Meanwhile, the verticalintercepts indicate the mean time of a full transition from 𝐴 → 𝐵 (79days) and 𝐵 → 𝐴 (170 days). work to quantify and optimize the consequences of thosedecisions.

6. Conclusion

Forecasting rare events is, by the very nature of rareevents, an extremely diﬃcult computational task. Giventhe dangers posed by climate change, it also one of sci-ence’s most pressing challenges. We suggest a computa-tional framework that uses relatively short model simula-tions to make predictions on much longer time scales. Ournumerical results point to its promise for forecasting.Within the context of a stochastically forced Holton-Mass model with 75 degrees of freedom, we have com-puted fundamental quantities of the SSW transition pro-cess, including committor probabilities and expected leadtimes, for both the vortex destruction and vortex restoration8processes. The system is irreversible, making these twodirections very statistically distinct from each other. Bysystematically evaluating many model variables for theirutility in predicting the fate of the vortex, we have iden-tiﬁed some salient physical descriptions of early warningsigns. We have furthermore quantiﬁed the relationship be-tween probability and lead time for a given rare event, apotentially useful paradigm for assessing predictability andpreparing for extreme weather. Our results suggest that theslow evolution of vortex preconditioning is an importantsource of predictability. In particular, the zonal wind andstreamfunction at 20 km seems to be optimal among a largeclass of dynamically motivated observables.The committor and mean ﬁrst passage time have obvi-ous utility for forecasting, but they are also ingredients ina larger framework called Transition Path Theory (TPT)for describing rare steady state transition events. In princi-ple, interrogating the ensemble of transition paths requiresdirect simulation of the system long enough to observemany transition events. However, using TPT, quantitiescomputable by our framework can be combined to yieldkey statistics describing the ensemble of transition paths connecting regions in state space, (Finkel et al. 2020; Met-zner et al. 2006, 2009; Vanden-Eĳnden and E 2010; E. andVanden-Eĳnden 2006). In a following paper we will ap-ply the same short-trajectory forecasting approach togetherwith TPT to compute transition path statistics such as re-turn times and extract insight about physical mechanismsof the transition process.Our numerical pipeline is promising and robust, butleaves plenty of room for improvement. Our samplingmethod, while advantageous for validation of results,wastes a great deal of data. Targeted sampling from thetransition region has the potential to achieve the same pre-cision for the quantities of interest with much less data.Also, moving beyond a basis expansion of the forecastfunctions, in upcoming work we will explore more ﬂexi-ble representations using kernel methods and neural net-works. The solution of high-dimensional PDEs is an ac-tive research area that is making innovative use of neuralnetwork approximations, particularly in the ﬁelds of com-putational chemistry and quantum mechanics (e.g., Carleoand Troyer 2017; Han et al. 2018; Khoo et al. 2018; Liet al. 2020; Mardt et al. 2018; Li et al. 2019; Lorpaiboonet al. 2020). Similar approaches may hold great potentialfor understanding predictability in atmospheric science.

Acknowledgments.

J.F. is supported by the U.S. De-partment of Energy, Oﬃce of Science, Oﬃce of AdvancedScientiﬁc Computing Research, Department of EnergyComputational Science Graduate Fellowship under AwardNumber DE-SC0019323. R.J.W. is supported by NewYork University’s Dean’s Dissertation Fellowship and bythe Research Training Group in Modeling and Simula-tion funded by the National Science Foundation via grant RTG/DMS-1646339. E.P.G. acknowledges support fromthe U. S. National Science Foundation through grant AGS-1852727. This work was partially supported by the NASAAstrobiology Program, grant No. 80NSSC18K0829 andbeneﬁted from participation in the NASA Nexus for Ex-oplanet Systems Science research coordination network.J.W. acknowledges support from the Advanced ScientiﬁcComputing Research Program within the DOE Oﬃce ofScience through award DE-SC0020427. The computa-tions in the paper were done on the high-performancecomputing clusters at New York University and the Re-search Computing Center at the University of Chicago.We thank John Strahan, Aaron Dinner, and Chatipat Lor-paiboon for helpful methodological advice. Mary Silber,Noboru Nakamura, and Richard Kleeman oﬀered invalu-able scientiﬁc insight. J.F. beneﬁtted from many helpfuldiscussions with Anya Katsevich.

Data availability statement.

The stochastic Holton-Mass model and analysis techniques are fully described inthe text and references, and can be integrated quickly atlow computational costs. J.F. is happy to share code anddata upon request. APPENDIX A

Holton-Mass Model

The Holton-Mass model describes a wave-mean ﬂow in-teraction, which we review here following Christiansen(2000). The “wave” part is governed by the zonally aver-aged, linearized quasigeostrophic (QG) potential vorticity(PV) equation on a 𝛽 -plane channel from 30 ◦ N to 90 ◦ N.The 𝑧 coordinate is log-pressure: 𝑧 = − 𝐻 ln ( 𝑝 / 𝑝 ) , where 𝐻 = 𝑝 is a ref-erence pressure. A standard density proﬁle 𝜌 𝑠 = 𝜌 𝑒 − 𝑧 / 𝐻 is used. 𝜕𝑞 (cid:48) 𝜕𝑡 + 𝑢 𝜕𝑞 (cid:48) 𝜕𝑧 + 𝜕𝑞𝜕𝑦 𝑣 (cid:48) = sources − sinks (A1) (cid:18) 𝜕𝜕𝑡 + 𝑢 𝜕𝜕𝑥 (cid:19) 𝑞 (cid:48) + 𝜕𝑞𝜕𝑦 𝜕𝜓 (cid:48) 𝜕𝑥 = − 𝑓 𝑁 𝜌 𝑠 𝜕𝜕𝑧 (cid:18) 𝛼𝜌 𝑠 𝜕𝜓 (cid:48) 𝜕𝑧 (cid:19) (A2)where 𝑞 ≡ 𝑓 + 𝛽𝑦 + ∇ 𝜓 + 𝑓 𝑁 𝜌 𝑠 𝜕𝜕𝑧 (cid:18) 𝜌 𝑠 𝜕𝜓𝜕𝑧 (cid:19) (A3)and ( 𝑢, 𝑣 ) = (cid:18) − 𝜕𝜓𝜕𝑦 , 𝜕𝜓𝜕𝑥 (cid:19) (A4)The right hand side is a parameterized Newtonian cooling,with a height-dependent cooling coeﬃcient 𝛼 ( 𝑧 ) = (cid:20) . + tanh (cid:18) ( 𝑧 / ) − (cid:19)(cid:21) × − 𝑠 − (A5)with 𝑧 measured in meters. Coordinates are set so that 𝑧 = 𝑧 𝑏𝑜𝑡 = 𝑧 𝑡𝑜𝑝 = ,

000 m is9the thickness of the stratosphere. The tropospheric waveforcing is encoded through a nonzero Dirichlet conditionon 𝜓 (cid:48) at the lower boundary (to be speciﬁed shortly), while 𝜓 (cid:48) ≡ () and () (cid:48) represent zonalmean and departure from zonal mean. 𝑁 is buoyancy fre-quency. Taking zonal means and combining the deﬁnitionsof 𝑢 and 𝑞 , potential vorticity’s mean meridional gradientand perturbation amplitude are given by 𝛽 𝑒 ≡ 𝜕𝑞𝜕𝑦 = 𝛽 + 𝜕 𝜓𝜕𝑦 + 𝑓 𝑁 𝜌 𝑠 𝜕𝜕𝑧 (cid:18) 𝜌 𝑠 𝜕𝜕𝑧 𝜕𝜓𝜕𝑦 (cid:19) (A6) = 𝛽 − 𝜕 𝑢𝜕𝑦 − 𝑓 𝑁 𝜌 𝑠 𝜕𝜕𝑧 (cid:18) 𝜌 𝑠 𝜕𝑢𝜕𝑧 (cid:19) 𝑞 (cid:48) = 𝜕 𝜓 (cid:48) 𝜕𝑦 + 𝑓 𝑁 𝜌 𝑠 𝜕𝜕𝑧 (cid:18) 𝜌 𝑠 𝜕𝜓 (cid:48) 𝜕𝑧 (cid:19) (A7)It remains to specify the mean ﬂow upon which the PVwaves propagate. This is given by the zonally averagedprimitive equations on the same domain. These are initiallywritten in terms of the geopotential Φ = 𝑓 𝜓 . 𝜕𝑢𝜕𝑡 − 𝑓 𝑣 = 𝑓 𝑢 = − 𝜕 Φ 𝜕𝑦 (A9) 𝜕𝜕𝑡 (cid:18) 𝜕 Φ 𝜕𝑧 (cid:19) + 𝑁 𝑤 = (A10) − 𝛼 (cid:18) 𝜕 Φ 𝜕𝑧 − 𝑅𝑇 ∗ 𝐻 (cid:19) − 𝜕𝜕𝑦 (cid:18) 𝑣 (cid:48) 𝜕 Φ (cid:48) 𝜕𝑧 (cid:19) 𝜕𝑣𝜕𝑦 + 𝜌 𝑠 𝜕𝜕𝑧 ( 𝜌 𝑠 𝑤 ) = 𝜕 Φ 𝜕𝑧 − 𝑅𝑇 ∗ 𝐻 isa departure from or equivalently, thermal wind balancewith the radiative equilibrium temperature ﬁeld, which ismultiplied by the Newtonian cooling coeﬃcient describedabove. The last term − 𝜕 𝑦 ( 𝑣 (cid:48) 𝜕 𝑧 Φ (cid:48) ) is the meridional eddyheat ﬂux convergence, which carries wave disturbances todestabilize the vortex. Eliminating 𝑣, 𝑤 , and Φ in favor of 𝑢 ,Holton and Mass formed the single “prediction equation" 𝜕𝜕𝑡 (cid:20) 𝑓 𝑁 𝜌 𝑠 𝜕𝜕𝑧 (cid:18) 𝜌 𝑠 𝜕𝑢𝜕𝑧 (cid:19) + 𝜕 𝑢𝜕𝑦 (cid:21) = (A12) − 𝑓 𝑁 𝜌 𝑠 𝜕𝜕𝑧 (cid:20) 𝛼𝜌 𝑠 (cid:18) 𝜕𝑢𝜕𝑧 − 𝜕𝑢 𝑅 𝜕𝑧 (cid:19)(cid:21) + 𝜕 𝜕𝑦 (cid:20) 𝑓 𝑁 𝜌 𝑠 𝜕𝜕𝑧 (cid:18) 𝜌 𝑠 𝑣 (cid:48) 𝜕𝜓 (cid:48) 𝜕𝑧 (cid:19)(cid:21) The radiative temperature distribution now appears onthe right implicitly through the thermal wind relation with a constant-shear zonal wind ﬁeld 𝑢 𝑅 : 𝜕 𝑧 𝑢 𝑅 = ( 𝑅 / 𝑓 ) 𝜕 𝑦 𝑇 ∗ . 𝑢 obeys a Dirichlet condition 𝑢 = 𝑢 𝑅 at the lower bound-ary, and a Neumann condition 𝜕 𝑧 𝑢 = 𝜕 𝑧 𝑢 𝑅 at the upperboundary.In a dramatic simpliﬁcation, Holton and Mass(1976) projected the time-dependent ﬁelds 𝑢 ( 𝑦, 𝑧, 𝑡 ) and 𝜓 (cid:48) ( 𝑥, 𝑦, 𝑧, 𝑡 ) onto a single zonal wavenumber 𝑘 = /( 𝑎 cos 60 ◦ ) and a single meridional wavenumber ℓ = / 𝑎 ,where 𝑎 is the Earth’s radius. 𝑢 ( 𝑦, 𝑧, 𝑡 ) = 𝑈 ( 𝑧, 𝑡 ) sin ( ℓ𝑦 ) (A13) 𝜓 (cid:48) ( 𝑦, 𝑧, 𝑡 ) = Re { Ψ ( 𝑧, 𝑡 ) 𝑒 𝑖𝑘𝑥 } 𝑒 𝑧 / 𝐻 sin ( ℓ𝑦 ) (A14)Substituting in this ansatz yields two 𝑧 - and 𝑡 -dependentPDEs: one for 𝑈 ( 𝑧, 𝑡 ) and one for Ψ ( 𝑧, 𝑡 ) (which is actuallya complex function, so equivalent to two real PDEs). ( 𝜕 𝑡 + 𝑖𝑘𝜀𝑈 ) (cid:20) − ( 𝑘 + ℓ ) + 𝑓 𝑁 (cid:18) 𝜕 𝑧 − 𝐻 (cid:19)(cid:21) Ψ (A15) + 𝛽 (cid:48) 𝑒 𝑖𝑘 Ψ + 𝑓 𝑁 (cid:18) 𝜕 𝑧 − 𝐻 (cid:19) (cid:20) 𝛼 (cid:18) 𝜕 𝑧 + 𝐻 (cid:19)(cid:21) Ψ = 𝜕 𝑡 (cid:20) 𝑓 𝑁 (cid:18) 𝑈 𝑧𝑧 − 𝐻 𝑈 𝑧 (cid:19) − ℓ 𝑈 (cid:21) = (A16) − 𝑓 𝑁 𝑒 𝑧 / 𝐻 𝜕 𝑧 (cid:104) 𝛼𝑒 − 𝑧 / 𝐻 𝜕 𝑧 ( 𝑈 − 𝑈 𝑅 ) (cid:105) + 𝑓 𝑁 𝜀𝑘ℓ 𝑒 𝑧 / 𝐻 Im { ΨΨ ∗ 𝑧𝑧 } where 𝛽 (cid:48) 𝑒 : = 𝛽 + 𝜀ℓ 𝑈 − 𝑓 𝑁 𝜀 (cid:18) 𝑈 𝑧𝑧 − 𝐻 𝑈 𝑧 (cid:19) and 𝜀 = 𝜋 accompanied by the boundary conditions Ψ ( , 𝑡 ) = 𝑔ℎ𝑓 𝑈 ( , 𝑡 ) = 𝑈 𝑅 ( ) Ψ ( 𝑧 𝑡𝑜𝑝 , 𝑡 ) = 𝜕 𝑧 𝑈 ( 𝑧 𝑡𝑜𝑝 , 𝑡 ) = 𝜕 𝑧 𝑈 𝑅 ( 𝑧 𝑡𝑜𝑝 ) For our data analysis purposes, it will be convenient to workin non-dimensionalized coordinates, so that 𝑈 and Ψ havesimilar magnitudes. We therefore introduce the horizontallength scale 𝐿 = . × m, the vertical scale 𝐻 = 𝑇 = · = 𝑡 = 𝑇 (cid:101) 𝑡 ( 𝑘, ℓ ) = 𝐿 ( (cid:101) 𝑘, (cid:101) ℓ ) 𝑧 = 𝐻 (cid:101) 𝑧𝑈 = 𝐿𝑇 (cid:101) 𝑈 Ψ = 𝐿 𝑇 (cid:101) Ψ 𝛼 = 𝑇 (cid:101) 𝛼 𝛽 = 𝐿𝑇 (cid:101) 𝛽 (cid:20) − (cid:18) G ( 𝑘 + ℓ ) + (cid:19) + 𝜕 𝜕𝑧 (cid:21) 𝜕 Ψ 𝜕𝑡 (A17) = (cid:20)(cid:18) 𝛼 − 𝛼 𝑧 − 𝑖 G 𝑘 𝛽 (cid:19) − 𝛼 𝑧 𝜕𝜕𝑧 − 𝛼 𝜕 𝜕𝑧 (cid:21) Ψ + (cid:26) 𝑖𝑘𝜀 (cid:20)(cid:18) 𝑘 G + (cid:19) − 𝜕𝜕𝑧 + 𝜕 𝜕𝑧 (cid:21) 𝑈 (cid:27) Ψ − 𝑖𝑘𝜀 𝜕 Ψ 𝜕𝑧 𝑈 (cid:18) G ℓ + 𝐷 𝑧 − 𝜕 𝜕𝑧 (cid:19) 𝜕𝑈𝜕𝑡 = (A18) − (cid:2) ( 𝛼 𝑧 − 𝛼 ) 𝑈 𝑅𝑧 − 𝛼𝑈 𝑅𝑧𝑧 (cid:3) + (cid:20) ( 𝛼 𝑧 − 𝛼 ) 𝜕𝜕𝑧 + 𝛼 𝜕 𝜕𝑧 (cid:21) 𝑈 − 𝜀𝑘ℓ 𝑒 𝑧 Im (cid:26) Ψ 𝜕 Ψ ∗ 𝜕𝑧 (cid:27) with the nondimensional parameter G = 𝐻 𝑁 /( 𝑓 𝐿 ) .APPENDIX B Feynman-Kac formulae and dynamical Galerkinapproximation

In this section we spell out the DGA procedure in moredetail than the main text, explaining the variants that getus to the more intricate conditional expectations. Thetheoretical background can be found in, e.g., Karatzas andShreve (1998); Oksendal (2003); Weinan et al. (2019).Let X ( 𝑡 ) be a time-homogeneous stochastic process withcontinuous sample paths in R 𝑑 . Associated to this processis the inﬁnitesimal generator, L , which acts on observablefunctions by evolving their expectation forward in time: L 𝑓 ( x ) = lim ℎ → E x [ 𝑓 ( X ( ℎ ))] − 𝑓 ( x ) ℎ (B1) = lim ℎ → T ℎ 𝑓 ( x ) − 𝑓 ( x ) ℎ where E x [·] : = E [·| X ( ) = x ] . It can be shown that underthe above assumptions on X , the Itô chain rule gives 𝑑𝑓 ( X ( 𝑡 )) = L 𝑓 ( X ( 𝑡 )) 𝑑𝑡 + 𝑑𝑀 ( 𝑡 ) (B2)where 𝑀 ( 𝑡 ) is a martingale. More concretely, in this paper, X ( 𝑡 ) is an Itô diﬀusion obeying the stochastic diﬀerentialequation X ( 𝑡 ) = X ( ) + ∫ 𝑡 𝑏 ( X ( 𝑠 )) 𝑑𝑠 (B3) + ∫ 𝑡 𝜎 ( X ( 𝑠 )) 𝑑 W ( 𝑠 ) with inﬁnitesimal generator and martingale terms L 𝑓 ( x ) = 𝑑 ∑︁ 𝑖 = 𝑏 𝑖 ( x ) 𝜕 𝑓 ( x ) 𝜕 x 𝑖 (B4) + 𝑑 ∑︁ 𝑖 = 𝑑 ∑︁ 𝑗 = (cid:2) 𝜎 ( x ) 𝜎 ( x ) (cid:62) (cid:3) 𝑖 𝑗 𝜕 𝑓 ( x ) 𝜕 x 𝑖 𝜕 x 𝑗 𝑑𝑀 ( 𝑡 ) = 𝑑 ∑︁ 𝑖 = 𝜕 𝑓 ( x ) 𝜕 x 𝑖 𝜎 𝑖 𝑗 ( x ) 𝑑 W 𝑗 ( 𝑡 ) (B5)Key quantities in TPT can be solved by operator equationsinvolving the transfer operator T ℎ , which we now lay outin detail, roughly following Weinan et al. (2019). a. Feynman-Kac formula The Feynman-Kac formula represents certain expecta-tions over stochastic trajectories as solutions to PDE prob-lems over state space involving the inﬁnitesimal generator L . Let 𝐷 be a domain in R 𝑑 (for example, ( 𝐴 ∪ 𝐵 ) 𝑐 )and 𝜏 + 𝐷 𝑐 be the ﬁrst exit time from this domain starting attime zero. This is a random variable which depends on thestarting condition 𝑥 ∈ 𝐷 . Let 𝐺 : 𝜕𝐷 → R be a boundarycondition, 𝐻 : 𝐷 → R a source term, and 𝑉 : 𝐷 → R aso-called potential term. We seek a PDE for the followingconditional expectation: 𝐹 ( x ) = E x (cid:20) 𝐺 ( X ( 𝜏 𝐷 c )) 𝑒 − ∫ 𝜏𝐷 c0 𝑉 ( X ( 𝑠 )) 𝑑𝑠 (B6) + ∫ 𝜏 𝐷 c 𝐻 ( X ( 𝑠 )) 𝑒 − ∫ 𝑠 𝑉 ( X ( 𝑟 )) 𝑑𝑟 𝑑𝑠 (cid:21) where E x [·] = E [·| X ( ) = x ] . To get the PDE, consider thefollowing stochastic process: 𝑍 ( 𝑡 ) = 𝐹 ( X ( 𝑡 )) 𝑌 ( 𝑡 ) + ∫ 𝑡 𝐻 ( X ( 𝑠 )) 𝑌 ( 𝑠 ) 𝑑𝑠 (B7)where 𝑌 ( 𝑡 ) : = exp (cid:0) − ∫ 𝑡 𝑉 ( X ( 𝑠 )) 𝑑𝑠 (cid:1) . Itô’s lemma givesus that 𝑑𝑌 ( 𝑡 ) = − 𝑉 ( X ( 𝑡 )) 𝑌 ( 𝑡 ) 𝑑𝑡 . Hence, applying theproduct rule to 𝑍 ( 𝑡 ) , 𝑑𝑍 ( 𝑡 ) = 𝑑𝐹 ( X ( 𝑡 )) 𝑌 ( 𝑡 ) (B8) + 𝐹 ( X ( 𝑡 )) 𝑑𝑌 ( 𝑡 ) + 𝐻 ( X ( 𝑡 )) 𝑌 ( 𝑡 ) = L 𝐹 ( X ( 𝑡 )) 𝑌 ( 𝑡 ) + 𝑑𝑀 ( 𝑡 ) 𝑌 ( 𝑡 ) (B9) − 𝐹 ( X ( 𝑡 )) 𝑉 ( X ( 𝑡 )) 𝑌 ( 𝑡 ) 𝑑𝑡 + 𝐻 ( X ( 𝑡 )) 𝑌 ( 𝑡 ) = (cid:2) L 𝐹 − 𝑉 𝐹 + 𝐻 (cid:3) ( X ( 𝑡 )) 𝑌 ( 𝑡 ) + 𝑌 ( 𝑡 ) 𝑑 M ( 𝑡 ) (B10)where in (B8) we have left out the quadratic cross-variationof 𝐹 ( X ( 𝑡 )) and 𝑌 ( 𝑡 ) because 𝑌 has ﬁnite variation. If thebracketed term (L − 𝑉 ( x )) 𝐹 ( x ) + 𝐻 ( x ) = x , then1 𝑍 ( 𝑡 ) is a martingale and it follows that 𝑍 ( ) = E x [ 𝑍 ( 𝑡 )] (B11) 𝐹 ( x ) = E x (cid:20) 𝐹 ( X ( 𝑡 )) 𝑒 − ∫ 𝑡 𝑉 ( X ( 𝑠 )) 𝑑𝑠 (B12) + ∫ 𝑡 𝐻 ( X ( 𝑠 )) 𝑒 − ∫ 𝑠 𝑉 ( X ( 𝑟 )) 𝑑𝑟 𝑑𝑠 (cid:21) Finally, we are allowed to substitute a stopping time for 𝑡 . By choosing 𝑡 = 𝜏 𝐷 c , the ﬁrst exit time from 𝐷 , the 𝐹 ( X ( 𝑡 )) inside the brackets becomes its boundary value 𝐺 ( X ( 𝜏 𝐷 c )) . Thus 𝐹 ( x ) as deﬁned in (B6) also solves thePDE boundary value problem (cid:40) −(L − 𝑉 ( x )) 𝐹 ( x ) = 𝐻 ( x ) x ∈ 𝐷𝐹 ( x ) = 𝐺 ( x ) x ∈ 𝐷 c (B13)Conditional expectations of the form (B6) deliver a varietyof quantities relevant to forecasting. For example, if 𝐹 ( x ) is the committor 𝑞 + ( x ) , we set 𝐷 = ( 𝐴 ∪ 𝐵 ) c , 𝐺 = 𝐵 , and 𝐻 = 𝑉 =

0. If 𝐹 ( x ) is the mean ﬁrst passage time to 𝐵 ,we set 𝐷 = 𝐵 c , 𝑓 = 𝑉 =

0, and 𝑔 =

1. A more complexboundary value problem combines avoiding one set andhitting another. For example, section 3(a) of the mainpaper is concerned with the expected time to 𝐵 conditionalon avoiding 𝐴 on the way there, or symbolically E x [ 𝜏 𝐵 | 𝜏 𝐵 < 𝜏 𝐴 ] = E x [ 𝜏 𝐵 { 𝜏 𝐵 < 𝜏 𝐴 }] P x { 𝜏 𝐵 < 𝜏 𝐴 } (B14)The denominator is simply 𝑞 + ( x ) , which has already beencomputed, but the numerator is not directly expressible as aconditional expectation of the form in Equation B6. How-ever, we can get there with a moment-generating function.Deﬁning 𝐺 = 𝐵 , 𝑉 = − 𝜆 (a constant) and 𝐻 = 𝐹 ( x ; 𝜆 ) = E x (cid:2) 𝐵 ( X ( 𝜏 𝐷 c )) exp ( 𝜆𝜏 𝐷 c ) (cid:3) (B15) 𝜕𝐹𝜕𝜆 ( x ; 0 ) = E x (cid:2) 𝜏 𝐷 c 𝐵 ( X ( 𝜏 𝐵 )) (cid:3) (B16)In practice we can compute 𝐹 ( x ; 𝜆 ) for two small valuesof 𝜆 near zero and take a ﬁnite diﬀerence. Higher-ordermoments of the passage time distribution can be found thesame way.To implement DGA and solve for 𝑢 with short trajecto-ries, we run the dynamics forward for a short time Δ 𝑡 , andhalt them if they reach 𝐷 c in that time. Dynkin’s formula(Oksendal 2003) states how 𝐹 ( X ( 𝑡 )) changes on average: E x [ 𝐹 ( X ( Δ 𝑡 ∧ 𝜏 𝐷 c )] = (B17) 𝐹 ( x ) + E x (cid:20) ∫ Δ 𝑡 ∧ 𝜏 𝐷 c L 𝐹 ( X ( 𝑡 )) 𝑑𝑡 (cid:21) E x [ 𝐹 ( X ( Δ 𝑡 ∧ 𝜏 𝐷 c ))] − 𝐹 ( x ) = (B18) E x (cid:20) ∫ Δ 𝑡 ∧ 𝜏 𝐷 c (cid:16) 𝑉 𝐹 − 𝐻 (cid:17) ( X ( 𝑡 )) 𝑑𝑡 (cid:21) (T Δ 𝑡𝐷 c − ) 𝐹 ( x ) = I Δ 𝑡𝐷 c [ 𝑉 𝐹 − 𝐻 ] ( x ) (B19) where we’ve introduced the stopped transfer operator T ℎ𝐷 𝑐 and the integral operator I ℎ𝐷 𝑐 for succinct notation. Thisoperator equation is linear, leading naturally to a Galerkinexpansion. For the purpose of having homogeneous basisfunctions, we start by homogenizing boundary conditionswith a guess function ˆ 𝐹 ( x ) that obeys ˆ 𝐹 | 𝜕𝐷 = 𝐺 , so 𝐹 = ˆ 𝐹 + 𝑣 where 𝑣 | 𝜕𝐷 =

0. Inserting 𝐹 = ˆ 𝐹 + 𝑣 into EquationB19, and moving all the 𝑣 ’s to the left, we get a linearequation for 𝑣 . (T Δ 𝑡𝐷 c − )( ˆ 𝐹 + 𝑣 )( x ) = I Δ 𝑡𝐷 c [ 𝑉 ( ˆ 𝐹 + 𝑣 ) − 𝑔 ] ( x ) (B20) (T Δ 𝑡𝐷 c − ) 𝑣 ( x ) − I Δ 𝑡𝐷 c [ 𝑉𝑣 ] ( x ) = (B21) − (T Δ 𝑡𝐷 c − ) ˆ 𝐹 ( x ) + I Δ 𝑡𝐷 c [ 𝑉 ˆ 𝐹 − 𝐻 ] ( x ) We expand 𝑣 in a homogeneous basis, 𝑣 ( x ) = (cid:205) 𝑀𝑗 = 𝑐 𝑗 ( 𝑣 ) 𝜙 𝑗 ( x ) and take an inner product of both sideswith 𝜙 𝑖 , with respect to an arbitrary measure 𝜆 . 𝑁 ∑︁ 𝑗 = 𝑐 𝑗 ( 𝑣 ) (cid:68) 𝜙 𝑖 , (T ℎ𝐷 𝑐 − ) 𝜙 𝑗 − I Δ 𝑡𝐷 c [ 𝑉 𝜙 𝑗 ] (cid:69) 𝜆 = (B22) − (cid:68) 𝜙 𝑖 , (T Δ 𝑡𝐷 c − ) ˆ 𝐹 + I Δ 𝑡𝐷 c [ 𝐻 − 𝑉 ˆ 𝐹 ] (cid:69) 𝜆 The only remaining task is to estimate these inner prod-ucts with Monte Carlo, using a dataset { X 𝑛 ( 𝑡 ) : 0 ≤ 𝑡 ≤ Δ 𝑡 } 𝑁𝑛 = of short trajectories and a basis set { 𝜙 𝑗 ( x )} 𝑀𝑗 = (possibly constructed from the data itself). If X 𝑛 ( ) isdistributed according to a ﬁxed sampling measure 𝜇 , weuse 𝜆 = 𝜇 above and estimate each inner product as anunweighted sum over trajectories. To wit, (cid:104) 𝜙, 𝜓 (cid:105) 𝜇 = ∫ R 𝑑 𝜙 ( x ) 𝜓 ( x ) 𝜇 ( 𝑑 x ) (B23) ≈ 𝑁 𝑁 ∑︁ 𝑛 = 𝜙 ( X 𝑛 ( )) 𝜓 ( X 𝑛 ( )) (B24) (cid:104) 𝜙, T Δ 𝑡𝐷 c 𝜓 (cid:105) 𝜇 = ∫ R 𝑑 𝜙 ( x ) E x [ 𝜓 ( X ( Δ 𝑡 ∧ 𝜏 𝐷 c ))] 𝜇 ( 𝑑 x ) (B25) ≈ 𝑁 𝑁 ∑︁ 𝑛 = 𝜙 ( X 𝑛 ( )) 𝜓 ( X 𝑛 ( Δ 𝑡 ∧ 𝜏 𝑛,𝐷 c )) (B26) (cid:104) 𝜙, I Δ 𝑡𝐷 c 𝜓 (cid:105) 𝜇 = ∫ R 𝑑 𝜙 ( x ) E x (cid:20) ∫ Δ 𝑡 ∧ 𝜏 𝐷 c 𝜓 ( X ( 𝑡 )) 𝑑𝑡 (cid:21) 𝜇 ( 𝑑 x ) (B27) ≈ 𝑁 𝑁 ∑︁ 𝑛 = 𝜙 ( X 𝑛 ( )) ∫ Δ 𝑡 ∧ 𝜏 𝑛,𝐷 c 𝜓 ( X 𝑛 ( 𝑡 )) 𝑑𝑡 (B28)We use a lag time of Δ 𝑡 =

10 days and sample the shorttrajectories with a timestep of 5 days, approximating theintegrals with the trapezoid rule. Standard linear algebraroutines can be used to solve for the coeﬃcients 𝑐 𝑗 , giving aGalerkin approximation for 𝐹 and therefore 𝑣 . We used theNumpy interface to LAPACK’s 𝐿𝑈 decomposition solver.2 b. Change of measure We now specify how to compute the change of measurefrom 𝜇 (the sampling distribution) to 𝜋 (the steady-statedistribution), using an adjoint version of the Feynman-Kac formula. Each of the basis functions 𝜙 𝑖 has anexpectation at time zero with respect to the invariantmeasure: E X ( )∼ 𝜋 [ 𝜙 𝑖 ( X ( ))] = ∫ 𝜙 𝑖 ( x ) 𝜋 ( 𝑑 x ) . Evolvingthe dynamics from 0 to Δ 𝑡 induces another expectation: E X ( )∼ 𝜋 [ 𝜙 𝑖 ( X ( Δ 𝑡 ))] = ∫ T Δ 𝑡 𝜙 𝑖 ( x ) 𝜋 ( 𝑑 x ) . That 𝜋 is in-variant means that these two integrals are equal. Fur-thermore, with a change of measure they can be rewrit-ten with respect to the sampling measure 𝜇 instead of 𝜋 , so ∫ (T Δ 𝑡 − ) 𝜙 𝑖 ( x ) 𝑑 𝜋𝑑𝜇 ( x ) 𝜇 ( 𝑑 x ) =

0. The change ofmeasure 𝑑 𝜋𝑑𝜇 ( x ) , which we abbreviate 𝑤 ( x ) , is yet an-other unknown function which we expand in the basis as 𝑤 ( x ) = (cid:205) 𝑗 𝑐 𝑗 ( 𝑤 ) 𝜙 𝑗 ( x ) . Putting this into the integral andusing Monte Carlo, we cast the coeﬃcients 𝑐 𝑗 ( 𝑤 ) as thesolution to a null eigenvector problem:0 = ∫ (T Δ 𝑡 − ) 𝜙 𝑖 ( x ) 𝑀 ∑︁ 𝑗 = 𝑐 𝑗 ( 𝑤 ) 𝜙 𝑗 ( x ) 𝜇 ( 𝑑 x ) (B29) ≈ 𝑀 ∑︁ 𝑗 = 𝑐 𝑗 ( 𝑤 ) 𝑁 ∑︁ 𝑛 = (cid:2) 𝜙 𝑖 ( X 𝑛 ( Δ 𝑡 )) − 𝜙 𝑖 ( X 𝑛 ( )) (cid:3) 𝜙 𝑗 ( X 𝑛 ( )) (B30) c. Visualization details Here we describe how to view a scalar ﬁeld 𝐹 ( x ) , suchas the forward committor, as an approximate function ofa lower-dimensional space of collective variables (CVs) 𝜽 : R 𝑑 → R 𝑘 . For example, in Figure 8 of the main text,we visualize the committor and mean ﬁrst passage timesin the CV space 𝜽 ( x ) = ( 𝑣 (cid:48) 𝑇 (cid:48) ( 𝑘𝑚 )( x ) ,𝑈 ( 𝑘𝑚 )( x )) .Formally, given a function 𝐹 : R 𝑑 → R , we seek some 𝑓 : R 𝑘 → R such that 𝐹 ( x ) ≈ 𝑓 ( 𝜽 ( x )) . To do this wehave to select, and then minimize, some function-spacemetric between 𝑓 ◦ 𝜽 and 𝐹 . The simplest choice wouldbe the mean-squared error, so the projection problem is tominimize over functions 𝑓 : R 𝑘 → R the penalty 𝑆 [ 𝑓 ; 𝜽 ] : = (cid:107) 𝑓 ◦ 𝜽 − 𝐹 (cid:107) 𝐿 ( 𝜋 ) (B31) = ∫ R 𝑑 (cid:104) 𝑓 ( 𝜽 ( x )) − 𝐹 ( x ) (cid:105) 𝜋 ( 𝑑 x ) (B32) ≈ 𝑁 ∑︁ 𝑛 = (cid:104) 𝑓 ( 𝜽 ( X 𝑛 ( ))) − 𝐹 ( X 𝑛 ( )) (cid:105) 𝑤 ( X 𝑛 ( )) (B33)The optimal 𝐹 for this purpose is the conditional expec-tation 𝑓 ( y ) = E [ 𝑓 ( x )| 𝜽 ( x ) = y ] , where the expectation iswith respect to the invariant measure. We can derive thisintuitive formula by optimizing over 𝐹 in a discrete subsetof function space given by indicator functions on disjointsets. We choose { 𝑇 , . . . ,𝑇 𝐿 } as a partition of CV space, like a two-dimensional square grid. The projection is writ-ten 𝑓 ( 𝜽 ( x )) = 𝐿 ∑︁ ℓ = 𝑎 ℓ ( 𝐹 ) 𝑇 ℓ ( 𝜽 ( x )) (B34)and we can simply optimize over the 𝑐 ℓ ’s. (We could alsooptimize in inﬁnite-dimensional function space, but thedata-based approach leads more directly to useful formu-las.) In the following, X 𝑛 = X 𝑛 ( ) . 𝑆 [ 𝑓 ; 𝜽 ] = 𝑁 ∑︁ 𝑛 = (cid:20) 𝐿 ∑︁ ℓ = 𝑎 ℓ ( 𝑓 ) 𝑇 ℓ ( 𝜽 ( X 𝑛 )) − 𝐹 ( X 𝑛 ) (cid:21) 𝑤 ( X 𝑛 ) (B35) 𝜕𝑆 [ 𝑓 ; 𝜽 ] 𝜕𝑐 𝑚 = = 𝑁 ∑︁ 𝑛 = × (cid:20) 𝐿 ∑︁ ℓ = 𝑎 ℓ ( 𝑓 ) 𝑇 ℓ ( 𝜽 ( X 𝑛 )) − 𝐹 ( X 𝑛 ) (cid:21) 𝑇 𝑚 ( 𝜽 ( X 𝑛 )) 𝑤 ( X 𝑛 ) = 𝑁 ∑︁ 𝑛 = (cid:104) 𝑎 𝑚 ( 𝑓 ) 𝑇 𝑚 ( 𝜽 ( X 𝑛 )) − 𝐹 ( X 𝑛 ) (cid:105) 𝑤 ( X 𝑛 ) ∴ 𝑎 𝑚 ( 𝑓 ) = (cid:205) 𝑁𝑛 = 𝐹 ( X 𝑛 ) 𝑇 𝑚 ( 𝜽 ( X 𝑛 )) 𝑤 ( X 𝑛 ) (cid:205) 𝑁𝑛 = 𝑇 𝑚 ( 𝜽 ( X 𝑛 )) 𝑤 ( X 𝑛 ) (B36) ≈ E X ∼ 𝜋 [ 𝐹 ( X ) 𝑇 𝑚 ( 𝜽 ( X ))] E X ∼ 𝜋 [ 𝑇 𝑚 ( 𝜽 ( X ))]≈ E X ∼ 𝜋 (cid:2) 𝐹 ( X )| 𝜽 ( X ) ∈ 𝑇 𝑚 (cid:3) The left side is a consistent (as 𝑁 → ∞ ) but biased estimatorfor the right hand side. This is simply a restatement of thevariance-minimizing property of conditional expectationDurrett (2013). This all assumes a ﬁxed CV space, but inpractice we will be in search of a good CV space. We nowhave a metric by which to evaluate a given choice. If 𝑓 ∗ isthe optimal 𝑓 for the given 𝜽 , we havemin 𝑓 𝑆 [ 𝑓 ; 𝜽 ] (B37) = 𝑁 ∑︁ 𝑛 = (cid:20) 𝐿 ∑︁ ℓ = 𝑎 ℓ ( 𝑓 ∗ ) 𝑇 ℓ ( 𝜽 ( X 𝑛 )) − 𝑓 ( X 𝑛 ) (cid:21) 𝑤 ( X 𝑛 ) = 𝑁 ∑︁ 𝑛 = (cid:20) 𝐿 ∑︁ ℓ = (cid:16) 𝑎 ℓ ( 𝑓 ∗ ) − 𝑓 ( X 𝑛 ) (cid:17) 𝑇 ℓ ( 𝜽 ( X 𝑛 )) (cid:21) 𝑤 ( X 𝑛 ) = 𝐿 ∑︁ ℓ = 𝑁 ∑︁ 𝑛 = (cid:16) 𝑎 ℓ ( 𝑓 ∗ ) − 𝐹 ( X 𝑛 ) (cid:17) 𝑇 ℓ ( 𝜽 ( X 𝑛 )) 𝑤 ( X 𝑛 ) By splitting the sum over grid cells in CV space, wecan now visualize both 𝑓 ∗ and the thing it minimizes,min 𝑓 𝑆 [ 𝑓 ; 𝜽 ] , which will be critical for selecting the bestvariables. The altitude-dependent projections of the com-mittor in Section 5(b) show this sum as the 𝐿 -projection3error, with one modiﬁcation: we divide by the number 𝐿 of grid boxes in CV space in order to normalize betweendiﬀering dimensions of diﬀerent observables.There is one exception to this display algorithm: theinvariant measure 𝜋 is best visualized in CV space as 𝜋 ( 𝜽 − ( 𝑇 ℓ )) , which is estimated 𝜋 ( 𝜽 − ( 𝑇 ℓ )) = ∫ R 𝑑 𝑇 ℓ ( 𝜽 ( x )) 𝜋 ( 𝑑 x ) (B38) ≈ 𝑁 ∑︁ 𝑛 = 𝑇 ℓ ( 𝜽 ( X 𝑛 )) 𝑤 ( X 𝑛 ) References

Berry, T., D. Giannakis, and J. Harlim, 2015: Nonparametric forecastingof low-dimensional dynamical systems.

Phys. Rev. E , , 032 915,doi:10.1103/PhysRevE.91.032915.Birner, T., and P. D. Williams, 2008: Sudden stratospheric warmingsas noise-induced transitions. Journal of the Atmospheric Sciences ,

65 (10) , 3337–3343, doi:10.1175/2008JAS2770.1.Bolton, T., and L. Zanna, 2019: Applications of deep learn-ing to ocean data inference and subgrid parameterization.

Jour-nal of Advances in Modeling Earth Systems ,

11 (1) , 376–399,doi:https://doi.org/10.1029/2018MS001472, URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2018MS001472, https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2018MS001472.Bouchet, F., J. Laurie, and O. Zaboronski, 2011: Control andinstanton trajectories for random transitions in turbulent ﬂows.

Journal of Physics: Conference Series ,

318 (2) , 022 041, doi:10.1088/1742-6596/318/2/022041, URL https://doi.org/10.1088%2F1742-6596%2F318%2F2%2F022041.Bouchet, F., J. Laurie, and O. Zaboronski, 2014: Langevin dynamics,large deviations and instantons for the quasi-geostrophic model andtwo-dimensional euler equations.

Journal of Statistical Physics , ,1066–1092, doi:10.1007/s10955-014-1052-5, URL https://doi.org/10.1007/s10955-014-1052-5.Bouchet, F., J. Rolland, and E. Simonnet, 2019a: Rare event algorithmlinks transitions in turbulent ﬂows with activated nucleations. Physi-cal Review Letters ,

122 (7) , 074 502, doi:10.1103/PhysRevLett.122.074502.Bouchet, F., J. Rolland, and J. Wouters, 2019b: Rare event samplingmethods.

Chaos: An Interdisciplinary Journal of Nonlinear Science ,

29 (8) , 080 402, doi:10.1063/1.5120509.Bowman, G. R., V. S. Pande, and F. Noé, 2013:

An introduction toMarkov state models and their application to long timescale molec-ular simulation , Vol. 797. Springer Science & Business Media.Butler, A. H., D. J. Seidel, S. C. Hardiman, N. Butchart, T. Birner, andA. Match, 2015: Deﬁning sudden stratospheric warmings.

Bulletinof the American Meteorological Society ,

96 (11) , 1913–1928, doi:10.1175/BAMS-D-13-00173.1.Carleo, G., and M. Troyer, 2017: Solving the quantum many-body prob-lem with artiﬁcial neural networks.

Science ,

355 (6325) , 602–606,doi:10.1126/science.aag2302, URL https://science.sciencemag.org/content/355/6325/602, https://science.sciencemag.org/content/355/6325/602.full.pdf. Charlton, A. J., and L. M. Polvani, 2007: A new look at stratosphericsudden warmings. part i: Climatology and modeling benchmarks.

Journal of Climate ,

20 (3) , 449–469, doi:10.1175/JCLI3996.1.Charney, J. G., and J. G. DeVore, 1979: Multiple FlowEquilibria in the Atmosphere and Blocking.

Journalof the Atmospheric Sciences ,

36 (7) , 1205–1216, doi:10.1175/1520-0469(1979)036<1205:MFEITA>2.0.CO;2, URLhttps://doi.org/10.1175/1520-0469(1979)036<1205:MFEITA>2.0.CO;2, https://journals.ametsoc.org/jas/article-pdf/36/7/1205/3420739/1520-0469(1979)036_1205_mfeita_2_0_co_2.pdf.Charney, J. G., and P. G. Drazin, 1961: Propagation of planetary-scaledisturbances from the lower into the upper atmosphere.

Journal ofGeophysical Research (1896-1977) ,

66 (1) , 83–109, doi:10.1029/JZ066i001p00083, URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/JZ066i001p00083, https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/JZ066i001p00083.Chattopadhyay, A., E. Nabizadeh, and P. Hassanzadeh, 2020: Ana-log forecasting of extreme-causing weather patterns using deeplearning.

Journal of Advances in Modeling Earth Systems ,

12 (2) ,e2019MS001 958, doi:https://doi.org/10.1029/2019MS001958,URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2019MS001958, e2019MS001958 10.1029/2019MS001958, https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2019MS001958.Chen, N., D. Giannakis, R. Herbei, and A. J. Majda, 2014: Anmcmc algorithm for parameter estimation in signals with hiddenintermittent instability.

SIAM/ASA Journal on Uncertainty Quan-tiﬁcation , , 647–669, doi:10.1137/130944977, URL https://doi.org/10.1137/130944977, https://doi.org/10.1137/130944977.Chen, N., and A. J. Majda, 2017: Beating the curse of dimensionwith accurate statistics for the fokker–planck equation in complexturbulent systems. Proceedings of the National Academy of Sci-ences ,

114 (49)

Chaos: An Inter-disciplinary Journal of Nonlinear Science ,

30 (3) , 033 101, doi:10.1063/1.5122199, URL https://doi.org/10.1063/1.5122199, https://doi.org/10.1063/1.5122199.Chodera, J. D., and F. Noé, 2014: Markov state models ofbiomolecular conformational dynamics.

Current Opinion in Struc-tural Biology , Journalof the Atmospheric Sciences ,

57 (18) , 3161–3173, doi:10.1175/1520-0469(2000)057<3161:CQAIVS>2.0.CO;2.Crommelin, D. T., 2003: Regime transitions and hetero-clinic connections in a barotropic atmosphere.

Journalof the Atmospheric Sciences ,

60 (2) , 229 – 246, doi:10.1175/1520-0469(2003)060<0229:RTAHCI>2.0.CO;2, URLhttps://journals.ametsoc.org/view/journals/atsc/60/2/1520-0469_2003_060_0229_rtahci_2.0.co_2.xml.DelSole, T., and B. F. Farrell, 1995: A stochastically excited lin-ear system as a model for quasigeostrophic turbulence: Analytic results for one- and two-layer ﬂuids. Journal of the AtmosphericSciences ,

52 (14) , 2531–2547, doi:10.1175/1520-0469(1995)052<2531:ASELSA>2.0.CO;2.Dematteis, G., T. Grafke, M. Onorato, and E. Vanden-Eĳnden, 2019: Ex-perimental evidence of hydrodynamic instantons: The universal routeto rogue waves.

Phys. Rev. X , , 041 057, doi:10.1103/PhysRevX.9.041057, URL https://link.aps.org/doi/10.1103/PhysRevX.9.041057.Dematteis, G., T. Grafke, and E. Vanden-Eĳnden, 2018: Roguewaves and large deviations in deep sea. Proceedings of the Na-tional Academy of Sciences ,

115 (5) , 855–860, doi:10.1073/pnas.1710670115.Durrett, R., 2013:

Probability: Theory and Examples . Cambridge Uni-versity Press.E., W., and E. Vanden-Eĳnden, 2006: Towards a Theoryof Transition Paths.

Journal of Statistical Physics ,

123 (3) ,503, doi:10.1007/s10955-005-9003-9, URL https://doi.org/10.1007/s10955-005-9003-9.Esler, J. G., and M. Mester, 2019: Noise-induced vortex-splittingstratospheric sudden warmings.

Quarterly Journal of the Royal Me-teorological Society ,

145 (719) , 476–494, doi:https://doi.org/10.1002/qj.3443, URL https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/qj.3443, https://rmets.onlinelibrary.wiley.com/doi/pdf/10.1002/qj.3443.Farazmand, M., and T. P. Sapsis, 2017: A variational approachto probing extreme events in turbulent dynamical systems.

Sci-ence Advances , , doi:10.1126/sciadv.1701533, URL https://advances.sciencemag.org/content/3/9/e1701533, https://advances.sciencemag.org/content/3/9/e1701533.full.pdf.Finkel, J., D. S. Abbot, and J. Weare, 2020: Path Properties ofAtmospheric Transitions: Illustration with a Low-Order SuddenStratospheric Warming Model. Journal of the Atmospheric Sci-ences ,

77 (7) , 2327–2347, doi:10.1175/JAS-D-19-0278.1, URLhttps://doi.org/10.1175/JAS-D-19-0278.1, https://journals.ametsoc.org/jas/article-pdf/77/7/2327/4958190/jasd190278.pdf.Frank, N., and S. Fischer, 2008: Transition networks for modeling the ki-netics of conformational change in macromolecules.

Current Opinin-ion in Structural Biology , , 154–163, doi:10.1016/j.sbi.2008.01.008.Franzke, C., and A. J. Majda, 2006: Low-order stochastic mode reduc-tion for a prototype atmospheric gcm. Journal of the AtmosphericSciences ,

63 (2) , 457–479, doi:10.1175/JAS3633.1.Giannakis, D., A. Kolchinskaya, D. Krasnov, and J. Schumacher,2018: Koopman analysis of the long-term evolution in a turbulentconvection cell.

Journal of Fluid Mechanics , , 735–767, doi:10.1017/jfm.2018.297.Gottwald, G. A., D. T. Crommelin, and C. L. E. Franzke, 2016: Stochas-tic climate theory. 1612.07474.Han, J., A. Jentzen, and W. E, 2018: Solving high-dimensional par-tial diﬀerential equations using deep learning. Proceedings of theNational Academy of Sciences ,

115 (34)

Tellus ,

28 (6) , 473–485, doi:10.3402/tellusa.v28i6.11316. Hoﬀman, R. N., J. M. Henderson, S. M. Leidner, C. Grassotti, andT. Nehrkorn, 2006: The response of damaging winds of a sim-ulated tropical cyclone to ﬁnite-amplitude perturbations of diﬀer-ent variables.

Journal of the Atmospheric Sciences ,

63 (7) , 1924– 1937, doi:10.1175/JAS3720.1, URL https://journals.ametsoc.org/view/journals/atsc/63/7/jas3720.1.xml.Holton, J. R., and C. Mass, 1976: Stratospheric vacillation cycles.

Journal of the Atmospheric Sciences ,

33 (11) , 2218–2225, doi:10.1175/1520-0469(1976)033<2218:SVC>2.0.CO;2.Hu, G., T. Bódai, and V. Lucarini, 2019: Eﬀects of stochasticparametrization on extreme value statistics.

Chaos: An Interdis-ciplinary Journal of Nonlinear Science ,

29 (8) , 083 102, doi:10.1063/1.5095756, URL https://doi.org/10.1063/1.5095756, https://doi.org/10.1063/1.5095756.Karatzas, I., and S. E. Shreve, 1998:

Brownian Motion and StochasticCalculus . Springer.Khoo, Y., J. Lu, and L. Ying, 2018: Solving for high-dimensionalcommittor functions using artiﬁcial neural networks.

Research in theMathematical Sciences , , doi:10.1007/s40687-018-0160-2, URLhttps://doi.org/10.1007/s40687-018-0160-2.L’Ecuyer, P., V. Demers, and B. Tuﬃn, 2007: Rare events, splitting, andquasi-monte carlo. ACM Transactions on Modeling and ComputerSimulation (TOMACS) ,

17 (2) , 9–es.Legras, B., and M. Ghil, 1985: Persistent anomalies, block-ing and variations in atmospheric predictability.

Jour-nal of Atmospheric Sciences ,

42 (5) , 433 – 471, doi:10.1175/1520-0469(1985)042<0433:PABAVI>2.0.CO;2, URLhttps://journals.ametsoc.org/view/journals/atsc/42/5/1520-0469_1985_042_0433_pabavi_2_0_co_2.xml.Li, H., Y. Khoo, Y. Ren, and L. Ying, 2020: Solving for high dimensionalcommittor functions using neural network with online approximationto derivatives. 2012.06727.Li, Q., B. Lin, and W. Ren, 2019: Computing committor functionsfor the study of rare events using deep learning.

The Journal ofChemical Physics ,

151 (5) , 054 112, doi:10.1063/1.5110439, URLhttps://doi.org/10.1063/1.5110439.Lorenz, E. N., 1963: Deterministic nonperiodic ﬂow.

Journal of Atmo-spheric Sciences ,

20 (2) , 130 – 141, doi:10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2, URL https://journals.ametsoc.org/view/journals/atsc/20/2/1520-0469_1963.Lorpaiboon, C., E. H. Thiede, R. J. Webber, J. Weare, and A. R. Din-ner, 2020: Integrated variational approach to conformational dynam-ics: A robust strategy for identifying eigenfunctions of dynamicaloperators.

The Journal of Physical Chemistry B ,

124 (42) , 9354–9364, doi:10.1021/acs.jpcb.0c06477, URL https://doi.org/10.1021/acs.jpcb.0c06477.Lucente, D., S. Duﬀner, C. Herbert, J. Rolland, and F. Bouchet, 2019:Machine learning of committor functions for predicting high impactclimate events.

Climate Informatics , Paris, France, URL https://hal.archives-ouvertes.fr/hal-02322370.Majda, A. J., and D. Qi, 2018: Strategies for reduced-order modelsfor predicting the statistical responses and uncertainty quantiﬁca-tion in complex turbulent dynamical systems.

SIAM Review ,

60 (3) ,491–549, doi:10.1137/16M1104664, URL https://doi.org/10.1137/16M1104664, https://doi.org/10.1137/16M1104664. Majda, A. J., I. Timofeyev, and E. Vanden Eĳnden, 2001: A mathemat-ical framework for stochastic climate models.

Communications onPure and Applied Mathematics ,

54 (8) , 891–974, doi:https://doi.org/10.1002/cpa.1014, URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.1014, https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.1014.Mardt, A., L. Pasuali, H. Wu, and F. Noé, 2018: Vampnetsfor deep learning of molecular kinetics.

Nature Communications , , doi:10.1038/s41467-017-02388-1, URL https://doi.org/10.1038/s41467-017-02388-1.Matsuno, T., 1971: A Dynamical Model of the Stratospheric SuddenWarming. Journal of the Atmospheric Sciences ,

28 (8) , 1479–1494,doi:10.1175/1520-0469(1971)028<1479:ADMOTS>2.0.CO;2,URL https://doi.org/10.1175/1520-0469(1971)028<1479:ADMOTS>2.0.CO;2, https://journals.ametsoc.org/jas/article-pdf/28/8/1479/3417422/1520-0469(1971)028_1479_admots_2_0_co_2.pdf.Metzner, P., C. Schutte, and E. Vanden-Eĳnden, 2006: Illustration oftransition path theory on a collection of simple examples.

The Journalof Chemical Physics ,

125 (8) , 1–2, doi:10.1063/1.2335447.Metzner, P., C. Schutte, and E. Vanden-Eĳnden, 2009: Transition paththeory for markov jump processes.

Multiscale Modeling and Simula-tion , , 1192–1219, doi:10.1137/070699500.Miron, P., F. Beron-Vera, L. Helfmann, and P. Koltai, 2021: Transi-tion paths of marine debris and the stability of the garbage patches. Chaos: An Interdisciplinary Journal of Nonlinear Science , acceptedfor publication.Mohamad, M. A., and T. P. Sapsis, 2018: Sequential sampling strategyfor extreme event statistics in nonlinear dynamical systems.

Proceed-ings of the National Academy of Sciences ,

115 (44)

Space Weather ,

11 (12) , 671–679,doi:https://doi.org/10.1002/2013SW000990, URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2013SW000990, https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1002/2013SW000990.Noé, F., C. Schütte, E. Vanden-Eĳnden, L. Reich, and T. R. Weikl,2009: Constructing the equilibrium ensemble of folding path-ways from short oﬀ-equilibrium simulations.

Proceedings of theNational Academy of Sciences ,

106 (45)

Stochastic Diﬀerential Equations: An Introductionwith Applications . Springer.Pande, V. S., K. Beauchamp, and G. R. Bowman, 2010: Everythingyou wanted to know about markov state models but were afraid toask.

Methods ,

52 (1) , 99–105, URL https://doi.org/10.1016/j.ymeth.2010.06.002.Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning inPython.

Journal of Machine Learning Research , , 2825–2830.Plotkin, D. A., R. J. Webber, M. E. O’Neill, J. Weare, and D. S. Ab-bot, 2019: Maximizing simulated tropical cyclone intensity with ac-tion minimization. Journal of Advances in Modeling Earth Systems ,

11 (4) , 863–891, doi:10.1029/2018MS001419. Ragone, F., and F. Bouchet, 2020: Computation of extreme valuesof time averaged observables in climate models with large devi-ation techniques.

Journal of Statistical Physics ,

179 (5) , 1637–1665, doi:10.1007/s10955-019-02429-7, URL https://doi.org/10.1007/s10955-019-02429-7.Ragone, F., J. Wouters, and F. Bouchet, 2018: Computation of ex-treme heat waves in climate models using a large deviation algorithm.

Proceedings of the National Academy of Sciences ,

115 (1)

Journal of Cli-mate , , 1593–1600, doi:10.1175/2007JCLI2119.1.Sabeerali, C. T., R. S. Ajayamohan, D. Giannakis, and A. J. Ma-jda, 2017: Extraction and prediction of indices for monsoon in-traseasonal oscillations: an approach based on nonlinear lapla-cian spectral analysis. Climate Dynamics ,

49 (9) , 3031–3050, doi:10.1007/s00382-016-3491-y.Sapsis, T. P., 2021: Statistics of extreme events in ﬂuid ﬂowsand waves.

Annual Review of Fluid Mechanics ,

53 (1) , 85–111, doi:10.1146/annurev-ﬂuid-030420-032810, URL https://doi.org/10.1146/annurev-ﬂuid-030420-032810, https://doi.org/10.1146/annurev-ﬂuid-030420-032810.Schaller, N., J. Sillmann, J. Anstey, E. M. Fischer, C. M. Grams, andS. Russo, 2018: Inﬂuence of blocking on northern european and west-ern russian heatwaves in large climate model ensembles.

Environ-mental Research Letters ,

13 (5) , 054 015, doi:10.1088/1748-9326/aaba55, URL https://doi.org/10.1088%2F1748-9326%2Faaba55.Sillmann, J., and Coauthors, 2017: Understanding, modeling andpredicting weather and climate extremes: Challenges and oppor-tunities.

Weather and Climate Extremes , , 65 – 74, doi:https://doi.org/10.1016/j.wace.2017.10.003.Simonnet, E., J. Rolland, and F. Bouchet, 2020: Multistability and rarespontaneous transitions between climate and jet conﬁgurations in abarotropic model of the jovian mid-latitude troposphere. 2009.09913.Strahan, J., A. Antoszewski, C. Lorpaiboon, B. P. Vani, J. Weare, andA. R. Dinner, 2020: Long-timescale predictions from short-trajectorydata: A benchmark analysis of the trp-cage miniprotein. 2009.04034.Tantet, A., F. R. van der Burgt, and H. A. Dĳkstra, 2015: An earlywarning indicator for atmospheric blocking events using transferoperators. Chaos: An Interdisciplinary Journal of Nonlinear Sci-ence ,

25 (3) , 036 406, doi:10.1063/1.4908174, URL https://doi.org/10.1063/1.4908174, https://doi.org/10.1063/1.4908174.Thiede, E., D. Giannakis, A. R. Dinner, and J. Weare, 2019: Approxima-tion of dynamical quantities using trajectory data. arXiv:1810.01841[physics.data-an] , 1–24, doi:1810.01841.Tibshirani, R., 1996: Regression shrinkage and selec-tion via the lasso.

Journal of the Royal Statistical Soci-ety: Series B (Methodological) ,

58 (1) , 267–288, doi:https://doi.org/10.1111/j.2517-6161.1996.tb02080.x, URLhttps://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1996.tb02080.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.2517-6161.1996.tb02080.x.Timmermann, A., F.-F. Jin, and J. Abshagen, 2003: A nonlinear theoryfor el niño bursting.

Journal of the Atmospheric Sciences ,

60 (1) ,152 – 165, doi:10.1175/1520-0469(2003)060<0152:ANTFEN>2. Annual Review of Phys-ical Chemistry ,

61 (1) , 391–420, doi:10.1146/annurev.physchem.040808.090412.Vanden-Eĳnden, E., and J. Weare, 2013: Data assimilation in the lownoise regime with application to the kuroshio.

Monthly Weather Re-view ,

141 (6) , 1822–1841, doi:10.1175/MWR-D-12-00060.1.Vitart, F., and A. W. Robertson, 2018: The sub-seasonal to seasonalprediction project (s2s) and the prediction of extreme events. npjClimate and Atmospheric Science , , URL https://doi.org/10.1038/s41612-018-0013-0.Wan, Z. Y., P. Vlachas, P. Koumoutsakos, and T. Sapsis, 2018: Data-assisted reduced-order modeling of extreme events in complex dy-namical systems. PLOS ONE ,

13 (5) , 1–22, doi:10.1371/journal.pone.0197704, URL https://doi.org/10.1371/journal.pone.0197704.Weare, J., 2009: Particle ﬁltering with path sampling and an applicationto a bimodal ocean current model.

Journal of Computational Physics ,

228 (12) , 4312 – 4331, doi:https://doi.org/10.1016/j.jcp.2009.02.033.Webber, R. J., D. A. Plotkin, M. E. O’Neill, D. S. Abbot, and J. Weare,2019: Practical rare event sampling for extreme mesoscale weather.

Chaos ,

29 (5) , 053 109, doi:10.1063/1.5081461.Weinan, E., T. Li, and E. Vanden-Eĳnden, 2019:

Applied stochasticanalysis , Vol. 199. American Mathematical Soc.Yasuda, Y., F. Bouchet, and A. Venaille, 2017: A new interpretation ofvortex-split sudden stratospheric warmings in terms of equilibriumstatistical mechanics.

Journal of the Atmospheric Sciences ,

74 (12) ,3915–3936, doi:10.1175/JAS-D-17-0045.1.Yoden, S., 1987a: Bifurcation properties of a stratospheric vacillationmodel.

Journal of the Atmospheric Sciences ,

44 (13) , 1723–1733,doi:10.1175/1520-0469(1987)044<1723:BPOASV>2.0.CO;2.Yoden, S., 1987b: Dynamical Aspects of Stratospheric Vacillations ina Highly Truncated Model.

Journal of the Atmospheric Sciences ,

44 (24) , 3683–3695, doi:10.1175/1520-0469(1987)044<3683:DAOSVI>2.0.CO;2, URL https://doi.org/10.1175/1520-0469(1987)044<3683:DAOSVI>2.0.CO;2.Zhang, F., and J. A. Sippel, 2009: Eﬀects of moist convection on hurri-cane predictability.

Journal of the Atmospheric Sciences ,

66 (7) , 1944– 1961, doi:10.1175/2009JAS2824.1, URL https://journals.ametsoc.org/view/journals/atsc/66/7/2009jas2824.1.xml.Zwanzig, R., 2001: