[PDF] Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models

Abstract

We present an ensemble prediction system using a Deep Learning Weather Prediction (DLWP) model that recursively predicts key atmospheric variables with six-hour time resolution. This model uses convolutional neural networks (CNNs) on a cubed sphere grid to produce global forecasts. The approach is computationally efficient, requiring just three minutes on a single GPU to produce a 320-member set of six-week forecasts at 1.4{\deg} resolution. Ensemble spread is primarily produced by randomizing the CNN training process to create a set of 32 DLWP models with slightly different learned weights. Although our DLWP model does not forecast precipitation, it does forecast total column water vapor, and it gives a reasonable 4.5-day deterministic forecast of Hurricane Irma. In addition to simulating mid-latitude weather systems, it spontaneously generates tropical cyclones in a one-year free-running simulation. Averaged globally and over a two-year test set, the ensemble mean RMSE retains skill relative to climatology beyond two-weeks, with anomaly correlation coefficients remaining above 0.6 through six days. Our primary application is to subseasonal-to-seasonal (S2S) forecasting at lead times from two to six weeks. Current forecast systems have low skill in predicting one- or 2-week-average weather patterns at S2S time scales. The continuous ranked probability score (CRPS) and the ranked probability skill score (RPSS) show that the DLWP ensemble is only modestly inferior in performance to the European Centre for Medium Range Weather Forecasts (ECMWF) S2S ensemble over land at lead times of 4 and 5-6 weeks. At shorter lead times, the ECMWF ensemble performs better than DLWP.

Full PDF

mmanuscript submitted to

Journal of Advances in Modeling Earth Systems

Sub-seasonal forecasting with a large ensemble ofdeep-learning weather prediction models

Jonathan A. Weyn ∗ , Dale R. Durran , Rich Caruana , NathanielCresswell-Clay Department of Atmospheric Sciences, University of Washington Microsoft Research, Redmond, Washington

Key Points: • An ensemble forecast system is developed using convolution neural networks (CNNs)to generate data-driven global forecasts. • Only 3 seconds are required to compute a large 320-member ensemble of skillful6-week sub-seasonal predictions. • Shorter lead time forecasts also show skill, including a single deterministic 4-dayforecast for Hurricane Irma. ∗ Current aﬃliation: Microsoft, Redmond, WA, USA.

Corresponding author: J. Weyn, [email protected] –1– a r X i v : . [ phy s i c s . a o - ph ] F e b anuscript submitted to Journal of Advances in Modeling Earth Systems

Abstract

We present an ensemble prediction system using a Deep Learning Weather Prediction(DLWP) model that recursively predicts key atmospheric variables with six-hour timeresolution. This model uses convolutional neural networks (CNNs) on a cubed spheregrid to produce global forecasts. The approach is computationally eﬃcient, requiring justthree minutes on a single GPU to produce a 320-member set of six-week forecasts at 1.4 ◦ resolution.Ensemble spread is primarily produced by randomizing the CNN training process to cre-ate a set of 32 DLWP models with slightly diﬀerent learned weights.Although our DLWP model does not forecast precipitation, it does forecast totalcolumn water vapor, and it gives a reasonable 4.5-day deterministic forecast of Hurri-cane Irma. In addition to simulating mid-latitude weather systems, it spontaneously gen-erates tropical cyclones in a one-year free-running simulation. Averaged globally and overa two-year test set, the ensemble mean RMSE retains skill relative to climatology be-yond two-weeks, with anomaly correlation coeﬃcients remaining above 0.6 through sixdays. Our primary application is to subseasonal-to-seasonal (S2S) forecasting at lead timesfrom two to six weeks. Current forecast systems have low skill in predicting one- or 2-week-average weather patterns at S2S time scales. The continuous ranked probabilityscore (CRPS) and the ranked probability skill score (RPSS) show that the DLWP en-semble is only modestly inferior in performance to the European Centre for Medium RangeWeather Forecasts (ECMWF) S2S ensemble over land at lead times of 4 and 5-6 weeks.At shorter lead times, the ECMWF ensemble performs better than DLWP. Plain Language Summary

We develop a machine learning weather prediction system, trained on past weatherdata, thereby providing an alternative approach to the current widely used computer mod-els that predict the weather based on approximate mathematical representations of phys-ical laws. Our approach is much more computationally eﬃcient, allowing us to make avery large number of similar forecasts, known as an ensemble, for the same weather event.The ensemble better deﬁnes the range of possible future weather conditions than a sin-gle forecast. Our ensembles show skill at both short forecast lead times and for 4-6-week(sub-seasonal) prediction. The sub-seasonal predictions are for 1 or 2-week averaged weatherpatterns. Although our forecasts are better than benchmark guesses, such as a forecastconsisting of the average weather conditions on a given calendar date, the quantitativeforecast skill remains low for sub-seasonal prediction. Nevertheless, by several quanti-tative measures our forecasts at 4-6 weeks score nearly as well as the current state-of-the-art forecasts from major operational centers. The ease with which our machine learn-ing approach can eﬃciently generate very large ensemble forecasts holds promise for fu-ture developments to improve the skill of sub-seasonal prediction.

Weather forecasting relies heavily on data assimilation to estimate the current stateof the atmosphere and on numerical weather prediction (NWP) to approximate its sub-sequent evolution. The skill of such deterministic weather forecasts is typically limitedto about two weeks by the chaotic growth of small initial errors and inaccuracies in ourapproximate models of the atmosphere. On much longer, multi-month time scales, thecoupling of the atmosphere with slowly evolving ocean-land forcing allows skillful sea-sonal forecasts of monthly or seasonally averaged conditions. Between these two extremes,the production of skillful one- or two-week averaged forecasts at lead times ranging roughlybetween two weeks and two months (the so-called subseasonal-to-seasonal or S2S timeframe) has proved particularly challenging; yet there are many societal sectors that would –2–anuscript submitted to

Journal of Advances in Modeling Earth Systems greatly beneﬁt from improved S2S forecasts (White et al., 2017). Several major oper-ational centers have developed NWP-based ensemble systems focused on improving S2Sforecasting (Vitart et al., 2017).In 1992, the European Centre for Medium-Range Weather Forecasts (ECMWF)and the National Centers for Environmental Prediction (NCEP) began issuing ensem-ble weather forecasts. They were soon followed by the other major weather predictioncenters across the world. Ensemble forecasts strive to provide a set of equally-likely fore-cast realizations spanning the range of possible future atmospheric states. Early evidencefor the economic value of probabilistic forecasts derived from the ECMWF ensemble rel-ative to a single deterministic forecast was provided by Richardson (2000). Ensemble fore-casts are now recognized as essential to represent the probabilistic nature of weather fore-casting and to break through the intrinsic limits to predictability of the atmosphere (Palmer,2018).Ensembles are particularly appropriate as one looks beyond lead times where de-terministic forecasts lose all skill relative to climatology. On S2S lead times, ensemble-mean and ensemble-based probabilistic forecasts have shown modest skill relative to cli-matology (Vitart, 2004; Weigel et al., 2008; Vitart, 2014; Monhart et al., 2018). Com-putational resources do, however, impose a signiﬁcant limitation on eﬀorts to create S2Sforecasts with NWP ensembles. As of 2016, 11 forecast centers were contributing S2Sforecasts to the S2S database (Vitart et al., 2017), and the ensembles from three of thesecenters consisted of just four members. The largest S2S ensemble, with 51 members pro-viding forecasts out to 46 days, is generated by ECMWF’s Integrated Forecast System(IFS) using a signiﬁcant fraction of the time available on one of the world’s most pow-erful computer systems (Bauer et al., 2015). Yet there is increasing evidence that thenumber of ensemble members for S2S forecasts should be higher, perhaps in the 100-200range (Buizza, 2019). Large ensemble sizes are also helpful for assessing the likelihoodof events in the tails of probabilistic forecast distributions (Leutbecher, 2018), and suchextreme events are often the most impactful.Machine learning provides one potential avenue to develop S2S forecasts systemswith signiﬁcantly lower computational costs. Recognizing that there are other success-ful machine-learning approaches to S2S forecasting (Hwang et al., 2019), here our focuswill be on the development of a data-driven deep-learning weather prediction (DLWP)model that can be iteratively stepped forward, like traditional NWP models, to simu-late atmospheric states at arbitrarily long lead times. In one of the ﬁrst attempts to useML to create such a model, Dueben and Bauer (2018) trained neural networks (NNs)on several years of reanalysis data to predict 500 hPa geopotential height ( Z ) on theglobe at 6 ◦ resolution, demonstrating the ability to produce ML weather forecasts thathave at least modest forecast skill. Using advanced convolutional neural networks (CNNs),Scher and Messori (2019) trained algorithms on simulations from a simpliﬁed GCM thatsigniﬁcantly outperformed baseline metrics and eﬀectively captured the simpliﬁed-GCMdynamics at spherical harmonic resolutions of T21 and T42 (roughly 5.6 ◦ and 2.8 ◦ ). Train-ing only on historical data, Weyn et al. (2019) used CNNs to generate forecasts for north-ern mid-latitude Z and 300-700-hPa thickness ( τ − ) on a 2.5 ◦ latitude-longitudegrid that showed skill relative to climatology and persistence through ﬁve days.Recently we extended our DLWP model to the full globe using a volume-conservativemapping to project global data from latitude-longitude grids onto a cubed sphere andimproved the CNN architecture operating on the cube faces (Weyn et al., 2020, here-after WDC20). In addition to Z and τ − , our improved model forecasts two ad-ditional surface ﬁelds, 1000 hPa height ( Z ) and 2-meter temperature ( T ), and usesthree externally-speciﬁed 2D ﬁelds: a land-sea mask, topographic height, and top of theatmosphere insolation. This new 1.9 ◦ resolution model showed skill relative to climatol-ogy and persistence through seven days. Moreover, it could be stepped forward repeat- –3–anuscript submitted to Journal of Advances in Modeling Earth Systems edly from a single initialization for at least one year, and while doing so, captured theseasonal cycle with reasonable accuracy.In the following we further improve our DLWP model by adding two more 2D prog-nostic ﬁelds and increasing the spatial resolution to 1.4 ◦ . Large 320-member ensemblesgenerated using the improved model are used to provide S2S forecasts through a six-weeklead time. These forecasts are veriﬁed against ERA5 data and compared to operationalECMWF S2S products.The remainder of this paper is organized as follows. Section 2 describes the improve-ments to our previous DLWP model, while section 3 discusses the incorporation of thatmodel in our ensemble forecast system. The behavior of that ensemble is assessed at shortdeterministic lead times in section 4, and at longer S2S lead times in section 5. Section6 contains the conclusions. The basic model is very similar to that described in detail in WDC20, in which fourforecast ﬁelds, geopotential height at 1000 hPa ( Z ) and at 500 hPa ( Z ), 300-700hPa thickness ( τ − ), and 2-meter temperature ( T ), are mapped to a cubed sphere.Three known ﬁelds are also provided: top-of-atmosphere radiation, topographic height,and a land-sea mask. Convolutional neural networks (CNN) were trained using the same3 × t − t hr to forecast ﬁelds at t + 6 and t + 12 hr. The model is trainedto minimize the mean squared error (MSE) over two steps, or equivalently over a 24-hrperiod with 6-hr temporal resolution.Here we extend the WDC20 model by adding two more forecast ﬁelds: tempera-ture at 850 hPa ( T ), which is strongly modulated by large-scale weather patterns whileexhibiting less sensitivity to diurnal heating and surface-layer processes than T , and to-tal column water vapor (TCWV). TCWV is the vertically-integrated total gas-phase wa-ter above each grid cell; its inclusion is a step toward characterizing tropical convectivesystems including tropical cyclones and the Madden-Julian oscillation (MJO). The MJOis believed to be an important source of forecast skill on S2S time scales and has beencharacterized as a “moisture mode” (Adames & Kim, 2016).The resolution was also modestly increased from 48 ×

48 to 64 ×

64 grid cells oneach face of the cube, yielding an eﬀective resolution of approximately 1.4 ◦ in latitudeand longitude at the equator. ERA5 data at a gridded resolution of 1 ◦ in latitude andlongitude were remapped with the Tempest-Remap package (Ullrich & Taylor, 2015; Ull-rich et al., 2016) for training, validation and testing. Somewhat fortuitously, the WDC20convolutional neural network architecture continued to perform quite well despite thechanges to the model input and target data, although we were able to improve the modelby doubling the number of ﬁlters used in each convolutional layer. This increased thenumber of ﬁlters in the ﬁrst layer from 32 to 64. Our improved DLWP CNN architec-ture is tabulated in Table 1. The increases in the number of ﬁlters in each layer and num-ber of forecast ﬁelds increased the total number of trainable parameters relative to thatin WDC20 by about a factor of about 4, to 2.7 million. Nevertheless, as a result of ad-ditional code optimization, the model still trains in 6–8 days. –4–anuscript submitted to Journal of Advances in Modeling Earth Systems

Table 1.

CNN architecture for DLWP as a sequence of operations on layers. The parameter v represents the number of input ﬁelds, t represents the number of input time steps, and c rep-resents the number of auxiliary prescribed inputs (here top-of-atmosphere radiation at t times,land-sea mask and topographic height). The layer names (except for the suﬃx “CubeSphere”)correspond to the names in the Keras Library. “Concatenate” appends the state in parentheses,numbered earlier, to the output of the previous layer. Layer Filters Filter size Output shape a Trainable params b input (6, 64, 64, vt + c )Conv2D–CubeSphere 64 3 × × × × × × × × × × × × × × vt × vt ) 1,560 a Output shape is (face, y , x , channels). b Number of learned parameters for t = 2, v = 6, c = 4. Total is 2,676,376. The basic ensemble design follows the typical practice used in operational NWPforecasting by including ensemble members with both perturbed initial conditions (ICs)and variations in the model’s representation of the atmosphere—the later being incor-porated in NWP ensembles either through the use of several diﬀerent “physics” param-eterization packages, through a suite of diﬀerent parameter values with a ﬁxed set of pack-ages, or through the incorporation of stochastic physics. The perturbed initial conditionsand our approach to varying the model representation of the atmosphere are discussedbelow.

The ERA5 dataset includes 10 perturbed ensemble members generated by ensem-ble data assimilation with 4DVAR (Isaksen et al., 2010) to help with uncertainty esti-mation, and we use these as a convenient set of perturbed ICs for construction of ourDLWP ensemble. Unfortunately, this set of ICs is non-optimal, because, unlike the op-erational ECMWF ensemble (Palmer, 2018), singular vectors were not used to select the –5–anuscript submitted to

Journal of Advances in Modeling Earth Systems T R M S E IC ensembleICx2 ensembleSP ensembleGrand ensemblePersistenceClimatology

Figure 1.

RMSE in T (K) as a function of forecast lead time for DLWP ensembles (solidlines) and corresponding ensemble spread (dashed): IC (blue), IC × most rapidly growing initial perturbations. Moreover, the ERA5 ensemble itself is mod-erately under-dispersive .Figure 1 shows the globally-averaged RMSE of the ensemble mean T plotted asa function of forecast lead time for four DLWP ensemble strategies. Also plotted are RMSEreference curves for climatology and persistence. The solid blue curve shows the RMSEof a DLWP forecast generated using the 10 perturbed members of the ERA5 dataset tocreate a 10-member IC ensemble. Since the ERA5 IC perturbations do not project stronglyon to the most rapidly growing modes, the ensemble spread (computed as the square rootof the average over all forecasts of the ensemble variance) actually decreases over the ﬁrst36 h of forecast lead time (dashed blue line in Fig. 1). In a conventional NWP model,such a reduction in ensemble spread, which occurs primarily on small spatial scales, canarise from a combination of numerical dissipation and the dispersion of inertia-gravitywaves, and an analogous behavior is present in the DLWP model.Another serious problem with the IC ensemble is that the spread is much smallerthan the RMSE of the ensemble mean. In an ideal ensemble, the joint distribution of theensemble members would be unchanged if the verifying observations were substitutedfor any one of the individual ensemble members, and under that assumption the spreadand RMSE curves should coincide (Fortin et al., 2014). In an eﬀort to better match theensemble spread to the RMSE, a 10-member “IC ×

2” ensemble was created, in which thediﬀerence in the IC perturbations from the control ERA5 data was doubled. The spreadfor the IC × × https://confluence.ecmwf.int/display/CKB/ERA5%3A+uncertainty+estimation –6–anuscript submitted to Journal of Advances in Modeling Earth Systems

The learnable weights for convolutions in the CNN are initialized as small randomvalues, and we can exploit this randomness by repeatedly retraining with diﬀerent ini-tial seeds to produce a family of DLWP models with slightly diﬀerent ﬁnal weights. Themodels in this family are capable of making approximately equally-skillful forecasts, butwith enough statistical independence to produce a good ensemble. Scher and Messori(2020) pursued a similar strategy, training and re-training models to both emulate a sim-ple GCM and to forecast the real atmosphere with data from ERA5.As part of its holistic estimation of the next atmospheric state, our CNN based DLWPmodel eﬀectively captures physical processes that are parameterized in operational NWPmodels. For example, as noted in WDC20 (see also Fig. 5c,d), our DLWP model doesan excellent job of forecasting 2-m temperatures, including their diurnal cycle, withoutusing any explicit parameterization of boundary layer processes, and without includingmost of the meteorological ﬁelds that would be used in such parameterizations. We there-fore refer to the ensembles in which the CNN ﬁlter coeﬃcients are randomly perturbedduring training as “stochastically perturbed” (SP) ensembles, although the nature of theinduced stochastic variations diﬀers from that in operational NWP.Rather than completely retrain each member of our SP ensemble from new ran-dom seeds, we gained eﬃciency by using intermediate results produced during the train-ing process. Our DLWP model is trained using the adaptive learning scheme Adam (Kingma& Ba, 2014). Figure 2 shows the learning curve for our loss function (mean squared er-ror, see WDC20) as a function of the training epoch number for a training cycle repre-sentative of that for one of our DLWP ensemble members. As expected, the error on thetraining set decreases smoothly, while the error on the validation set oscillates much more,strongly suggesting the learned weights in the model undergo nontrivial changes over eachtraining epoch. The variations induced by these changes in the weights turn out to besuﬃcient to provide many useful members in a SP ensemble, and we exploited these vari-ations as follows.After at least 100 training epochs, we selected multiple potential ensemble mem-bers from a single training cycle by checkpointing every 10 epochs and saving the model’sweights at each checkpoint. Using these checkpointed weights, we tested each resultingmodel’s utility for S2S forecasts by evaluating its average global T anomaly correla-tion coeﬃcient (ACC) score at 4-week lead time from twice-a-week forecasts over the fullfour-year validation set. The 4-week T

ACC scores for the checkpointed models wereoften equally good (or occasionally even better) than that for ﬁnal trained model , andthere was nearly as much ensemble spread between the various checkpoints of one train-ing cycle as there was between models generated by separate training cycles.Our SP ensemble used a total of 32 slightly diﬀerent DLWP models generated byeight training cycles with four members drawn from each cycle. The members selectedfrom each cycle had the best 4-week T ACC scores, although three of the 32 mem-bers selected in this fashion required further reﬁnement because in a few individual fore-casts they produced ﬁelds with unrealistic structures even though their numerical val-ues remained bounded within reasonable limits. Those three members were further train-ined for two more epochs using a re-initialized stochastic gradient descent (SGD) opti-mizer, which consistently ﬁxed the issue with nonphysical solutions. The SGD cycles alsolowered the training and validation loss slightly, as exempliﬁed over the last 15 epochsof the learning curve in Fig. 2. Although the ﬁnal trained model had the lowest validation set loss during training, the training lossfunction is based solely on the RMSE over the ﬁrst 24 hours of forecast lead time; it does not depend onthe performance of a 4-week forecast. –7–anuscript submitted to

Journal of Advances in Modeling Earth Systems m e a n s q u a r e d e rr o r Training lossValidation loss

Figure 2.

Example of the CNN learning curve for a representative DLWP model. The loss,which is the mean-squared-error over all target outputs of the model, evaluated on the training(validation) set is shown as a function of training epoch number in blue (orange). The optimizerwas switched from Adam to a standard SGD optimizer after 200 epochs, producing an abruptsmall decrease in the loss function on both data sets and more uniform values on the validationset.

The RMSE and spread of the SP ensemble is plotted in Fig. 1. It clearly outper-forms the IC and IC × × –8–anuscript submitted to Journal of Advances in Modeling Earth Systems

When considering the eﬀectiveness of ensemble forecasts, it is useful to comparethe ensemble mean to a single control member. For example, in comparison to the otherensemble members, the ECMWF control forecast is run at higher horizontal (9 km) andvertical (137 levels) resolution, and without perturbations to the initial conditions. Thiscontrol forecast might nominally be expected to perform better than a typical ensem-ble member that must be run at lower resolution because of computational constraints.Our control forecast is trained to better minimize the loss function using a Adam op-timizer and learning rates that decrease as the weights approach their optimum values.In particular, the learning rate starts at 10 − , but once the validation-set loss does notdecrease for 20 epochs, the learning rate of the optimizer is reduced by a factor of 5. Thiscontinues (up to a minimum learning rate of 10 − ) until a criterion of no reduction invalidation loss after 50 epochs is met. The result is a model with weights that better min-imize the loss function and produces good forecasts, although in contrast to the ECMWFhigh-resolution control, the model used for our control forecast does not have signiﬁcantadvantages relative to the other ensemble members. The control forecast is initializedwith the control ERA5 reanalysis data. Unlike global climate models, NWP models are designed to make accurate predic-tions over relatively short forecast lead times without worrying about certain physicalconstraints, such as global radiative balance, that would be necessary for long-term cli-mate simulations. As a consequence, NWP models are typically subject to systematicdrift in long-term (including sub-seasonal) forecasts. For example, previous versions ofthe ECMWF S2S ensemble have been shown to develop pronounced spatially-dependentpatterns of mean model drift on time scales of 1–4 weeks (Vitart, 2004; Weigel et al., 2008).To compensate for this model bias, all of the major weather prediction centers producereforecasts, or hindcasts, using their S2S ensemble prediction systems (Vitart et al., 2017),and use these reforecasts to calibrate the operational forecast products. As an example,Vitart (2004) computed the bias for a given calendar date by performing hindcasts us-ing the full ensemble initialized on the same calendar date in each of 12 previous years,and removed this bias from the forecasts in a post-processing step.Using a strategy similar to that in Vitart (2004), we bias corrected each ﬁeld byﬁrst computing ensemble reforecasts twice weekly for each DLWP SP ensemble mem-ber (and the control) on the same set of calendar dates spanning the years 1991–2015,which includes the training set and most of the validation set, but does not bleed intothe 2017-2018 test set. Then, for all reforecast dates, each member’s spatially varyingbias is calculated as the average of its bias on that date over the 25 year period. The biasfor a speciﬁc forecast date in the test set is taken as the average reforecast bias over allavailable calendar dates spanning the 28-day interval centered on that forecast date, andthis bias is subtracted from the forecast produced by each of the corresponding ensem-ble members.Figure 3 shows spatial patterns of the annual-average model bias in 2-m temper-ature for the 32-member DLWP SP ensemble at forecast lead times up to 6 weeks. Warmbiases are present over the northern hemisphere land masses, along with a cold bias overAntarctica. There are also warm biases in subtropical regions commonly dominated bymarine stratocumulus clouds oﬀ the Paciﬁc coasts of North and South America. Thesebiases gradually amplify as the forecast lead time increases, although the globally av-eraged spatial-mean bias (noted in each panel) decreases at longer lead times. The ten-dency of increasing local biases to better cancel in the global mean at longer lead timesis interesting and perhaps surprising because the model is only trained to minimize T –9–anuscript submitted to Journal of Advances in Modeling Earth Systems

Mean: 0.094

DLWP ensemble, days 8-14

Mean: 0.069

DLWP ensemble, days 15-21

Mean: 0.055

DLWP ensemble, days 22-28

Mean: 0.049

DLWP ensemble, days 29-42 T bias (K) a) b)c) d) Figure 3.

Bias in 2-m temperature (ensemble-mean − observation) averaged over 25 yearsof re-forecasts from all members of the DLWP SP ensemble: one-week averages for forecast leadtimes of a) 2 weeks, b) 3 weeks, c) 4 weeks, and d) a 2-week average for the week 5–6 forecast. errors over the ﬁrst 24 hours of the forecast—no global energy-balance constraints areimposed.Bias correction has a positive impact on the control forecast and on the IC, SP, andthe grand ensembles. Although the RSME and spread of the SP and grand ensemblesare almost identical over the ﬁrst 14 days, at longer lead times, and particularly afterbias-correction, the grand ensemble is clearly superior to the SP ensemble (not shown).The performance of the grand ensemble will, therefore, be our focus throughout the re-mainder of this paper.In addition to the persistence and climatology benchmarks, which serve as a base-lines that must be exceeded by any skillful forecast, we will also compare our results againstthe state-of-the-art ECMWF 50 member S2S ensemble and a higher resolution ECMWFcontrol simulation (Vitart et al., 2017). Errors are computed with respect to ERA5 datathat is downloaded at 1 degree resolution, transformed onto our cube-sphere grid, andthen transformed back to a 1.5 × × × × T . Bias correction was also performed on the ECMWF S2S control and ensem-ble forecasts on the 1.5 × –10–anuscript submitted to Journal of Advances in Modeling Earth Systems

Table 2.

Comparison of key attributes of our DLWP ensemble and those of the state-of-the-artECMWF ensemble for extended-range forecasting.

DLWP ECMWFAtmospheric ﬁelds 6 2-D variables 9 prognostic 3-D variables; 91 vertical levelsHorizontal resolution 150 km 18 km ( 36 km after day 15)Atmospheric physics 3 prescribed inputs Many physical parameterizationsCoupled models None Ocean, wave, and sea ice modelsInitial condition perturbations 10 (ERA5 uncertainty) 50 (SVD/4DVAR)Model perturbations “Stochastic” CNN weights Stochastic physicsEnsemble members 320 (+control) 50 (+control)

The following summarizes the construction of the DLWP grand ensemble.1. Eight distinct training cycles of the DLWP CNN were produced with diﬀerent ran-dom seeds as a ﬁrst step in generating 32 stochastically perturbed (SP) models.2. Four checkpoints during each of the eight training cycles were selected based on T ACC skill as individual SP ensemble models.3. The model associated with each checkpoint was run on the validation data set toproduce 416 four-week forecasts, which were then manually inspected for forecastquality. Any model that displayed irregularities was further trained with an SGDoptimizer. The collection of models given by these checkpoints formed the 32-memberSP ensemble of models. (The SGD optimizer was used in 3 of the 32 SP ensem-ble members.)4. Each of the 32 SP models was run with each of the 10 initial conditions (ICs) givenby the perturbed reanalyses in the ERA5 product to yield the 320-member grandensemble.5. A single control DLWP model was trained slightly diﬀerently, by periodically re-ducing the Adam optimizer learning rate.6. The mean model bias for re-forecasts in the period 1991–2015 was computed forthe control and each SP model. That bias was removed from all the 2017-2018 test-set forecasts.Finally, a tabular comparison between our DLWP ensemble and the current state-of-the-art ECMWF ensemble is provided in Table 2. In most regards, the ECMWF en-semble is superior, with higher resolution, coupled ocean and wave models, complete physics,more atmospheric variables, and better initial condition perturbations. However, our DLWPensemble consists of far more ensemble members, with 320 plus control, compared to only50 plus control for the ECMWF ensemble.

Before discussing the performance of the DLWP ensemble on S2S time scales webrieﬂy assess the qualitative skill of the model in a short deterministic forecast. We thenassess the quantitative skill of the DLWP control and grand ensemble in two-week fore-casts relative to the state-of-the-art 50-member ECMWF S2S ensemble and the stan-dard benchmarks of climatology and persistence.Figure 4 compares a 4.5-day global forecast of Z and Z with the verifyinganalysis for 12:00 UTC, 11 September 2017, when hurricane Irma was located in the south- –11–anuscript submitted to Journal of Advances in Modeling Earth Systems (a) Forecast (b) Verification

500 510 520 530 540 550 560 570 580

Figure 4.

Fields of Z (color contours) and Z (black contours at 100 m intervals withnegative values dashed) for (a) a 4.5-day forecast and (b) the veriﬁcation on 12:00 UTC, 11September 2017. The blue curve is the 540-dm contour for Z . eastern US. Irma is farther north and weaker in the DLWP forecast, but still reasonablywell represented for a 4.5-day forecast of such a small-scale disturbance. Other features,such as the 500-hPa cutoﬀ low west of California and the pair of short waves in the Gulfof Alaska and west of Hudson Bay are also reasonably represented. On the other hand,the 500-hPa cutoﬀ low over Novia Scotia is much weaker and farther west than in theveriﬁcation, and the DLWP forecast does not develop the associated surface cyclone. Theclosed surface low in the DLWP forecast over the Dominican Republic is the model’s fore-cast for hurricane Jose, which is stronger and farther south than in the veriﬁcation. Whileneither perfect, nor the equal of a state-of-the-art NWP operational forecast, on the bal-ance the DLWP forecast is arguably still impressive given that it is computed at 1.4 ◦ res-olution using just 6 prognostic variables, each deﬁned on a single spherical shell.Turning to the quantitative veriﬁcation of the ﬁrst two weeks of forecast lead time,our cases are chosen to match the available forecast initialization times from the oper-ational S2S forecast runs at ECMWF. The DLWP model is therefore tested on 208 fore-casts initialized twice weekly starting at 00 UTC 2 January 2017, followed by the 5th,9th and 12th of January, and so on, through the end of 2018.RMSE scores for Z are compared in Fig. 5a. The DLWP grand ensemble remainssuperior to climatology through 14 days, though unsurprisingly, its error exceeds thatof the ECMWF S2S ensemble. Both the DLWP and ECMWF control forecasts performworse than their respective ensembles, and the control-to-ensemble improvement is qual-itatively similar for both systems. At lead times beyond 9 days, the DWLP ensemble per-forms better than the ECMWF control. Similarly qualitative behaviors are apparent forthe Z anomaly correlation coeﬃcient (ACC) scores shown in Fig. 5b, although the ACCscore for the ECMWF remains superior to that of the DLWP ensemble until almost day13. The persistence forecast is grossly inferior. –12–anuscript submitted to Journal of Advances in Modeling Earth Systems T R M S E T A CC T R M S E T A CC c) d) Z R M S E a) Z A CC DLWP controlDLWP grand ensembleECMWF S2S controlECMWF S2S ensemblePersistenceClimatology b)e) f)

Figure 5.

Forecast error as a function of time for the DLWP control member (green) andgrand ensemble mean (blue), the ECMWF S2S control (dot-dashed red) and ensemble mean(dot-dashed orange), along with persistence (pink) and climatology (gray) benchmarks for twice-weekly forecasts during 2017–2018. Panels are Z : (a) RMSE (m) and (b) ACC; daily-averaged T : (c) RMSE (K) and (d) ACC; T at 00 UTC: (e) RMSE (K) and (f) ACC. The error isarea-weighted in latitude and globally-averaged.–13–anuscript submitted to Journal of Advances in Modeling Earth Systems

Turning now to the daily-averaged surface temperature ﬁeld, which has a far morecomplex structure than Z , Fig. 5c,d show the performance of the DLWP models rel-ative to the ECMWF system remains similar to that for Z . The ECMWF S2S ensem-ble gives the best results, with its RMSE, together with that of the DLWP grand ensem-ble, remaining below the climatological benchmark through 14 days. Note that the ECMWFproduct uses slightly diﬀerent initial conditions, which explains the substantial error inthe ECMWF RMSE at early lead times. The ACC of the DLWP ensemble again becomessuperior to the ECMWF control at long lead times (after 11 days). The ability of themodel to correctly forecast surface temperatures using the same set of 3 × T , are plotted in Fig. 5e,f. This is a more diﬃ-cult ﬁeld to forecast in the sense that the RMSE for persistence drops below the clima-tological skill threshold at shorter lead times than for Z or T , and the ACC score forpersistence drops below 0.4 in just two days. Nevertheless, the RMSE for the DLWP andECMWF ensembles again remains below climatology for 14 days, with the ECMWF en-semble performing the best. The errors in the DLWP ensemble also drop below thosein the ECMWF control at earlier lead times than for the other ﬁelds: 7 days for RMSEand 10 days for ACC. Globally and temporally averaged ACC scores for T and T at forecast lead timesof 3–4 and 5–6 weeks are compared in Fig. 6 for the DLWP and ECMWF models, alongwith the persistence benchmark, for the 2017-2018 test set. Persistence forecasts are com-puted by averaging the observed anomalies from the 14 days prior to the forecast ini-tialization date. At both lead times, scores are higher for T than for T , reﬂecting greatermemory in the system for near-surface temperatures than those aloft. One source of thismemory are sea surface temperatures, which are closely tied to T over the oceans. Un-like the ECWMF S2S model, our DLWP model does not currently include coupling withthe ocean, and as a consequence, our T forecast over the ocean is essentially a proxyforecast for SST. This may be one reason why the DLWP control is the only model inFig. 6a,b that performs worse than persistence. Nevertheless for T forecasts at both leadtimes, the DLWP ensemble is superior to both persistence and the ECMWF control. Un-surprisingly, the ECMWF ensemble gives the best results at both lead times, with anaveraged ACC of roughly 0.5 for T at 3–4 weeks. Turning to T , the performance ofthe DLWP ensemble relative to the other forecasts is better than those for T . The DLWPensemble is again superior to the ECMWF control at both lead times, and more impres-sively, at weeks 5–6 it is in a statistical tie with the full ECMWF S2S ensemble in thesense that the 95% conﬁdence intervals for the forecasts overlap (black bars in Fig. 6d).ACC scores for the best and worst individual forecasts for each period are shownby the black dots in Fig. 6. The ECMWF ensemble has the “best” worst forecasts, orthe least susceptibility to bust forecasts, except for 5–6-week T , for which its ACC of-0.22 is worse than the − .

18 value for DLWP ensemble. At weeks 5–6, the best indi-vidual forecasts for both the DLWP and ECMWF ensembles have ACC scores of about0.8 for T and about 0.6 for T . The globally averaged ACC for the 3–4-week T fore- Only the daily-averaged T ﬁeld was available on the ECMWF archive. As shown in WDC20, theDLWP model does capture diurnal temperature variations. –14–anuscript submitted to Journal of Advances in Modeling Earth Systems D L W P c o n t r o l D L W P g r a n d e n s e m b l e E C M W F c o n t r o l E C M W F e n s e m b l e P e r s i s t e n c e Weeks 3–4 - m T e m p e r a t u r e T Weeks 5–6 D L W P c o n t r o l D L W P g r a n d e n s e m b l e E C M W F c o n t r o l E C M W F e n s e m b l e P e r s i s t e n c e a) b)c) d) Figure 6.

Anomaly correlation coeﬃcient of 2-week-averaged forecasts from the DLWP con-trol (green) and grand ensemble mean (blue), the ECMWF S2S control (red) and ensemble mean(orange), and persistence (pink) for forecasts made twice weekly in 2017–2018. Panels show T for (a) weeks 3–4, (b) weeks 5–6, and T for (c) weeks 3–4, (d) weeks 5–6. Scores are area-weighted in latitude and globally-averaged. Black lines on each bar represent the 95% conﬁdenceinterval computed using bootstrapping with 10,000 iterations. The black dots show the lowestand highest scores among the 208 forecasts. –15–anuscript submitted to Journal of Advances in Modeling Earth Systems (e) Persistence (f) Verification(c) ECMWF Control (d) ECMWF Ensemble(a) DLWP Control (b) DLWP Ensemble K Figure 7.

Predicted 2-week average anomalies in T relative to climatology for the period 3–4weeks after 27 September 2018. Forecasts are from DLWP (a) control and (b) grand ensemble,ECMWF S2S (c) control and (d) ensemble, and (e) persistence; (f) is the veriﬁcation. The ‘rmse’and ‘acc’ numbers are global averages; the rmse is the root-mean-squared error of the anomalies . casts exceeds 0.5 roughly 25% of the time for the DLWP grand ensemble and 50% of thetime for the ECMWF ensemble.An example of a good (though not the best) 3–4-week T anomaly forecast for bothensembles is shown in Fig. 7. These forecasts were initialized on 27 September 2018. Theintensity and spatial variability in the DLWP control forecast are grossly similar, althoughmodestly weaker and smoother, than those in the ECMWF control, demonstrating thatthe DLWP forecast is not simply approaching a smooth climatology at long forecast leadtimes. Similarly, the forecast from the 320-member DLWP grand ensemble exhibits muchof the intensity and spatial variability in the 50-member ECMWF ensemble forecast. Anomalies that both verify (Fig. 7f) and are common to the DLWP (Fig. 7a) andECMWF S2S (Fig. 7b) ensembles include cold in Greenland and warmth in eastern Aus-tralia, eastern Siberia and over the adjacent Arctic Ocean. One place where the ECMWF Although the horizontal resolution of the ECMWF S2S ensemble members is 31 km beyond lead timesof 15 days, the data are archived at coarser resolution and all forecasts are displayed on a 1.5 ◦ × ◦ latitude-longitude grid. –16–anuscript submitted to Journal of Advances in Modeling Earth Systems

S2S ensemble clearly out performs the DLWP ensemble is over North America, whereit better captures the observed large and intense cold anomaly. Another important su-periority of the ECMWF ensemble lies in its forecast of a developing El Ni˜no over theequatorial Paciﬁc, although the absence of the El Ni˜no in the DLWP ensemble forecastis not particularly surprising because it does not include any oceanic data.

Having just examined the ACC scores of the ensemble mean, we now investigatethe performance of ensemble-produced probabilistic forecasts, speciﬁcally the continu-ous ranked probability score (CRPS) and the ranked probability skill score (RPSS).

Denoting the ensemble probability distribution function of a forecast for some vari-able x as ρ ( x ), the cumulative distribution function (CDF) associated with ρ is P ( x ) = (cid:90) x −∞ ρ ( y ) dy. (1)If the verifying value occurs at x a , a CDF for the observation may be deﬁned as P a ( x ) = H ( x − x a ) , (2)where H ( s ) = (cid:40) s < s ≥ P, x a ) = (cid:90) ∞−∞ [ P ( x ) − P a ( x )] dx. (4)The CRPS score penalizes both overly narrow (conﬁdent) forecast distributions that ver-ify incorrectly and overly broad (uncertain) forecast distributions, regardless of the ac-curacy of the ensemble mean. The CRPS has several desirable properties. First, it is aproper statistical score (Gneiting & Raftery, 2007), meaning that the CRPS is optimizedfor a forecast which predicts the correct probability distribution of a predicted variable.Second, in contrast to categorical scores, such as the RPSS (see Section 5.2.2), it accountsfor information across all possible values of x . Finally, the CRPS reduces to the meanabsolute error for a single deterministic forecast, allowing the performance of ensembleand deterministic forecasts to be easily compared.Figure 8 compares CRPS in T for forecasts from the DLWP and ECMWF S2Sensembles and control forecasts, along with persistence and climatology benchmarks, av-eraged over one week for lead times of 2, 3 and 4 weeks, along with a two-week averagefor a lead time of 5–6 weeks. As with our earlier results for the T RMSE and ACCof the ensemble mean (Fig. 5), the ECMWF ensemble clearly gives better week-2 CRPSscores than the DLWP ensemble. At week 3, when deterministic forecast skill from ini-tial conditions has largely eroded, the ECMWF ensemble continues to give a slightly bet-ter global mean result. But by week 4 and weeks 5–6, the DLWP ensemble has caughtup and is essentially tied with ECMWF (Fig. 8a). At lead times of three weeks or longer,the next best global-mean forecasts are given by climatology, which out-performs the ECMWFand DLWP control forecasts, which are in turn better than persistence. Note that allCRPS scores except persistence improve signiﬁcantly from week 4 to weeks 5–6 due tothe longer two-week averaging window.Focusing on the tropics, from 20 ◦ S to 20 ◦ N, the CRPS for all models are substan-tially improved, with numerical values roughly 0.4 times the corresponding global results –17–anuscript submitted to

Journal of Advances in Modeling Earth Systems a) Global b) Tropics

DLWP controlDLWP ensembleECMWF ensembleECMWF controlPersistenceClimatology W ee k W ee k W ee k W ee k s - d) NH extratropics, DJF W ee k W ee k W ee k W ee k s - d) NH extratropics, JJA T C R P S ( K ) T C R P S ( K ) Figure 8.

Continuous ranked probability score (CRPS; lower is better) in T from theDLWP control (green circles), DLWP grand ensemble (blue pentagons), ECMWF S2S control(red downward pointing triangles), ECMWF ensemble (orange upward pointing triangles), persis-tence (pink crosses), and climatology (gray squares), as a function of averaged forecast-lead time.Panels show: a) Global average, annual mean; b) average over the tropics (20 ◦ S – 20 ◦ N), annualmean; c) average over the northern hemisphere extra-tropics (30 ◦ N – 90 ◦ N), mean of forecastsinitialized in DJF; d) average over the NH extra-tropics, JJA. Error bars correspond to the 95%conﬁdence interval determined by bootstrapping with 10,000 samples.–18–anuscript submitted to

Journal of Advances in Modeling Earth Systems (Fig. 8b). The ECMWF and DLWP ensembles still perform the best, and the relativeperformance of the various forecast systems is similar to the globally-averaged results,although the ECMWF ensemble scores are slightly improved relative to the DLWP en-semble. CRPS scores are worse in the northern hemisphere extra-tropics, 30 ◦ N – 90 ◦ N(Fig. 8c,d), and there is a pronounced seasonality in performance. The scores are worsein boreal winter, and the performance of the DLWP ensemble relative to the ECMWFensemble is also worse. On the other hand, in boreal summer, the CRPS values improveand the DWLP ensemble ties ECMWF in weeks 3, 4 and 5–6. This seasonal diﬀerencein extra-tropical performance suggests that the DLWP model performs worse relativeto ECMWF when synoptic-scale dynamics exert more inﬂuence on the weather.

The other metric we use to evaluate the ensemble forecasts is the ranked proba-bility skill score (RPSS). To compute the RPSS, K categorical forecasts are ﬁrst deﬁned.Then let y i be the probabilistic forecast of the event occurring in category i ; let c i bethe climatological probability of the event falling in category i , and bin the veriﬁcationsuch that o i = 1 if the event was observed to be in category i and o i = 0 otherwise.The k th components of the cumulative forecast, climatological, and observational dis-tributions Y k , C k , and O k are evaluated for each of the K categories as Y k = (cid:80) ki =1 y i , C k = (cid:80) ki =1 c i , and O k = (cid:80) ki =1 o i .Ranked probability scores for the forecast (RPS) and climatology (RPS C ) are com-puted as RPS = K (cid:88) k =1 ( Y k − O k ) (5)RPS C = K (cid:88) k =1 ( C k − O k ) , (6)and ﬁnally, using angled brackets to denote the average over all forecast-observation pairs,the RPSS is deﬁned as RPSS = 1 − (cid:104) RPS (cid:105)(cid:104)

RPS C (cid:105) . (7)The RPS is zero for a perfect forecast and increases positively otherwise, thereforethe RPSS for a perfect forecast is one and decreases otherwise. Normalizing (cid:104) RPS (cid:105) by (cid:104) RPS C (cid:105) in (7) sets the threshold value below which there is no skill relative to clima-tology to zero. Like the CRPS, the RPSS is a proper score, but it does depend on thedeﬁnition of categories. The RPSS is sensitive to the size of the ensemble, having a neg-ative bias for small ensemble sizes (Weigel et al., 2007). Although both the grand en-semble size of 320 and the ensemble size of 50 for ECMWF are large enough to mitigatesuch bias, in lieu of (7) we will use the de-biased formulation of the RPSS (Weigel et al.,2007), which for ensemble size M isRPSS = 1 − (cid:104) RPS (cid:105)(cid:104)

RPS C (cid:105) + D /M (8) D = K − K . (9)We compute the RPSS for T and T , binning into three climatologically equally-likely terciles of below-, near-, and above-normal relative to the baseline period of 1981–2010. Tercile bounds for T were determined from daily-averaged values (using the times0, 6, 12, and 18 UTC) for each date in the 30-year record. These daily tercile boundsare then averaged over each one- or two-week veriﬁcation period. For T , only the in-stantaneous 0 UTC values were available from the ECMWF S2S database, therefore a –19–anuscript submitted to Journal of Advances in Modeling Earth Systems

DLWP ensembleECMWF ensemble0.10.00.10.20.30.40.50.6 a) Global T R P SS b) Tropics c) NH extratropics, DJF T R P SS d) NH extratropics, JJA W ee k W ee k W ee k W ee k s – W ee k W ee k W ee k W ee k s – e) Global - land only T R P SS f) Tropics - land only Figure 9.

One- or two-week averaged ranked probability skill score (RPSS; higher is better)for T at indicated forecast lead times. DLWP grand ensemble (blue circles) and the ECMWFS2S ensemble (orange triangles) averaged over the (a) globe, annual mean; (b) tropics (20 ◦ S –20 ◦ N), annual mean; (c) NH extra-tropics (30 ◦ N – 90 ◦ N), mean of forecasts initialized in DJF;(d) NH extra-tropics, mean over JJA; (e) and (f), as in (a) and (b) except spatially averaged onlyover land. Error bars correspond to the 95% conﬁdence interval determined by bootstrappingwith 10,000 samples. separate climatology is computed from only 0 UTC values to evaluate the ECMWF model(the DLWP forecasts are daily-averaged and evaluated using terciles computed from thedaily averages). Because each of the forecast categories are, by design, equally likely, theclimatological forecast is simply a 33% likelihood of occurrence in each of the categories.Note that because (5) and (6) use cumulative distributions, events verifying in the near-normal category have lower expected random-chance forecast error than events verify-ing in either the below- or above-normal categories.Spatially- and temporally-averaged RPSS scores in T from the DLWP and ECMWFS2S ensembles are shown in Fig. 9. The globally-averaged RPSS (Fig. 9a) for the DLWPensemble is well above the zero threshold for random chance at all lead times. Compar-ing Figs. 8 and 9, and recalling that, in contrast to the RPSS, lower CRPS scores arebetter, both metrics show similar variations in ensemble skill with forecast lead time inall regions and over all time windows. The superiority of the ECMWF ensemble is, nev- –20–anuscript submitted to Journal of Advances in Modeling Earth Systems

Table 3.

RPSS scores for T for three S2S lead times, debiased for ensemble size, averagedglobally for twice weekly forecasts in years 2017-2018 from the ECMWF and DLWP ensembles.Column ECMWF 2011 gives the same except for reforecasts covering the years 1995–2001 usingthe ECMWF ensemble as formulated in 2011. Days ECMWF DLWP ECMWF 201112–18 0 . ± .

017 0 . ± .

016 0 . ± . . ± .

016 0 . ± .

016 0 . ± . . ± .

015 0 . ± .

015 0 . ± . ◦ N) and slightly dif-ferent weekly periods of 12-18, 19-25 and 26-32 days. In Table 3, the performance of theDLWP and current ECMWF S2S ensembles are compared with data taken from Fig. 12aof Vitart (2014), which gives the average performance of a ﬁve-member ensemble of re-forecasts using the 2011 ECMWF forecast system in twice-weekly forecasts over the years1995–2001. Because sub-seasonal forecast skill can vary from year to year depending onslowly evolving large-scale circulation patterns, caution must be used when comparingtheir results averaged over 1995–2001 to our calculations for 2017-2018. Nevertheless,within the limitations of such comparison, it is worth noting that the DLWP grand en-semble performs better that the 2011 ECMWF forecast system at lead times of 19–25and 26–32 days.The global pattern of RPSS scores, averaged over all of the forecasts in the 2017–2018 test set, is shown by the maps of RPSS scores for both ensembles at weeks 3, 4, and5–6 in Fig. 10. As expected from the plot of global average scores in Fig. 9a, the ECMWFensemble is superior to the DLWP ensemble, with the two becoming more alike and show-ing more skill in the forecasts averaged over the longer two-week period, weeks 5–6. Par-ticularly at weeks 5–6 (Fig. 10e,f), the distribution of the low- and high-skill regions inthe DLWP and ECMWF ensembles are quite similar. Both ensembles perform almostthe same over land and both do very well over the Southern Ocean. The DLWP ensem-ble shows skill in the tropical oceans, but the ECMWF ensemble does much better inthat region, and it largely avoids the loss of skill suﬀered by the DLWP ensemble overthe adjacent subtropical waters. As mentioned previously, our DLWP model likely suf-fers from the absence of information about SSTs.The seasonal variation in RPSS at 5-6 weeks is shown in Fig. 11. The performanceof the DLWP and ECMWF ensembles is most similar in MAM, with global RPSS av-erages of 0.180 and 0.191, respectively, and analogous spatial patterns of high and lowskill. As in the annual mean (Fig. 10e,f), the skill is relatively high over the tropical andsouthern oceans. The seasonal averages show more pronounced localized regions of highand no skill (RPSS < –21–anuscript submitted to Journal of Advances in Modeling Earth Systems

Mean: 0.112Mean: 0.102Mean: 0.172 Mean: 0.205Mean: 0.184Mean: 0.143 T RPSS

DLWP ensemble ECMWF ensemble W ee k s – W ee k W ee k a) b)c) d)e) f) Figure 10.

Annual average over the 2017–2018 testing period of T RPSS scores. Left(right) column show DLWP (ECMWF) ensembles at forecast lead times of (a), (b): 3 weeks, (c),(d) 4 weeks, and (e), (f) 5–6 weeks. The weighted global mean is noted in each panel.–22–anuscript submitted to

Journal of Advances in Modeling Earth Systems

Mean: 0.153 Mean: 0.191 T RPSS

DLWP ensemble ECMWF ensemble JJ A S O N M A M D J F Mean: 0.185Mean: 0.189Mean: 0.181 Mean: 0.229Mean: 0.180 Mean: 0.215 a) b)c) d)e) f)g) h)

Figure 11.

Maps of seasonally averaged RPSS scores in T during the testing period of2017–2018. Left (right) column DLWP (ECMWF S2S) ensembles at 5–6 weeks forecast lead timefor months (a), (b): DJF, (c), (d) MAM, (e), (f) JJA, and (g), (h) SON. The weighted globalmean is noted in each panel. –23–anuscript submitted to Journal of Advances in Modeling Earth Systems

Mean: 0.155 Mean: 0.128Mean: 0.287 T RPSS

DLWP ensemble ECMWF ensemble B i a s - c o rr e c t e d U n c o rr e c t e d Mean: -0.037 a) b)c) d)

Figure 12.

Annual average RPSS skill maps for T at weeks 5–6. Without bias correction: (a)DLWP ensemble, (b) ECMWF ensemble; with bias correction: (c) DLWP ensemble, (d) ECMWFensemble. The weighted global mean is noted in each panel. Finally, we consider surface temperature anomalies, which, as might be expectedgiven the pronounced model drift evident in Fig. 3, are signiﬁcantly improved by biascorrection. Maps of RPSS for both the DLWP and ECMWF ensemble forecasts for weeks5–6 are shown in Fig. 12. Bias correction signiﬁcantly improves the RPSS over land forboth ensembles, with much of that improvement in the ECMWF model coming from re-gions with topography where removing the bias helps correct for diﬀerences in the waythe orography is represented at diﬀerent grid resolutions. Bias removal also makes largeimprovements over the oceans in the DLWP ensemble, perhaps partly compensating forthe lack of SST data. The regions of highest skill in the bias-corrected forecasts from bothensembles are mostly over the oceans, particularly in the tropics. The global mean RPSSfor both ensembles, 0.155 for DLWP and 0.287 for ECMWF, are non-negative, indicat-ing modest skill relative to climatology. The results for the bias-corrected ECMWF en-ssemble (Fig. 3d) are generally consistent with the distribution of RPSS scores from anearlier version of the ECMWF ensemble in (Weigel et al., 2008), which showed skill pre-dominantly over tropical oceans after week 2, despite accounting for bias correction inthe forecasts. Improvements in the ECMWF model since (Weigel et al., 2008) have, nev-ertheless, led to generally higher RPSS values over the oceans and much better perfor-mance over the Western Paciﬁc and Indian Oceans, as would be apparent in a compar-ison of these results with their Fig. 4, although such comparison must be qualiﬁed bynoting the current forecasts and those in (Weigel et al., 2008) verify in diﬀerent years.The tendency of the 5–6 week bias-corrected T ECMWF ensemble forecast to per-form similarly to the DLWP forecast over land, and better over the oceans, is again ap-parent in the seasonal results for SON shown in Fig. 13. Note in particular, the nega-tive RPSS score over the equatorial eastern Paciﬁc Ocean, which arises because the DLWPensemble fails to correctly capture the onset of a weak El Ni˜no event in 2018. In con- –24–anuscript submitted to

Journal of Advances in Modeling Earth Systems

Mean: 0.281 T RPSS

DLWP ensemble ECMWF ensemble

Mean: 0.158 a) b)

Figure 13.

As in Fig. 12, but only for bias-corrected forecasts initialized in SON. trast, the ECMWF ensemble, with coupling to an ocean model, exhibits high skill through-out the same region.

As a ﬁrst step toward developing a deep-learning-based ensemble system for S2Sforecasting, we reﬁned our previous data-driven global model (Weyn et al., 2020) by im-proving the resolution at the equator to approximately 1.4 ◦ by 1.4 ◦ and by adding twomore physical ﬁelds, the temperature at 850 hPa and total column water vapor. Thesereﬁnements allowed the model to both spontaneously generate tropical cyclones and alsoproduce a reasonable, though far from state-of-the-art, four-day deterministic forecastof hurricane Irma. Despite the higher resolution and the expansion from four to six spher-ical shells of prognostic variables, the model remains very computationally eﬃcient. Aone-week forecast, stepped forward with a 12-hour time step (and 6-hour resolution), canbe performed in approximately 1/10th of a second on an Nvidia Tesla V100 graphics pro-cessing unit (GPU).We exploited this eﬃciency to generate large ensemble forecasts. Only about 3 min-utes are required to produce a 320-member six-week ensemble forecast. Those 320 en-semble members were generated by running 32 diﬀerent DLWP models, trained to slightlydiﬀerent convolutional ﬁlter coeﬃcients, on each of 10 initial conditions. The initial con-ditions were non-optimal; rather than including information from singular vectors, theywere simply drawn from the ERA5 archive. The strategy of training DLWP models withslightly diﬀerent ﬁlter coeﬃcients was, on the other hand, very eﬀective, adding signif-icant skill to the ensemble mean and greatly increasing the ensemble spread (Figs. 1 and5). The ensemble spread introduced by the 32 similar DLWP models is functionally anal-ogous to that of conventional NWP ensemble members with stochastically perturbed phys-ical parameterization tendencies or stochastic kinetic energy backscatter. Our DLWPmodel requires 6–8 days of computation to train on a single Tesla V100 GPU. We wereable to economize by training only eight of our 32 “physics-ensemble” members from scratchwith diﬀerent initial seeds and ﬁlling out the ensemble with models using ﬁlter coeﬃ-cients from diﬀerent checkpoints saved during the eight training iterations.As was also the case for the ECMWF S2S forecast, the DLWP ensemble mean fore-cast was a signiﬁcant improvement over that from a single control member. In partic-ular, the average RMSE over the 2017–2018 test set of DLWP ensemble forecasts for Z , T , and 2-m temperature remained below climatology for at least 14 days, while theanomaly correlation coeﬃcients remained above 0.6 for 7–8 days. Not surprisingly the –25–anuscript submitted to Journal of Advances in Modeling Earth Systems

ECMWF S2S ensemble did perform better, particularly at earlier lead times, and it alsogave ACC scores exceeding the 0.6 threshold out to 10 days. At longer lead times, week3–4 or week 5-6 averages, the ACC scores of the ECMWF and DLWP ensemble meanswere positive and better than persistence, but still relatively low. The 2017–2018 aver-aged scores ranged from roughly 0.25 to 0.5, with the ECMWF ensemble performing bet-ter in all cases except for T at weeks 5–6, for which both ensembles were in a statis-tical tie with an ACC of approximately 0.25.We examined two probabilistic measures of ensemble skill, the CRPS and RPSS.The DLWP and ECMWF S2S ensembles produce essentially the same week-4 and weeks-5–6 CRPS scores. At shorter lead times the ECMWF ensemble is superior, performingmarginally better at week 3 and distinctly better than the DLWP ensemble at week 2.Both the DLWP and ECMWF ensembles clearly out-performed climatology and persis-tence. Examining seasonal and regional contrasts showed that in the northern hemisphereextra-tropics, the DLWP ensemble performed best, and on a par with the ECMWF en-semble in the summer season; while performing worst in winter.Like the CRPS, the spatially- and temporally-averaged RPSS scores showed mod-est skill relative to climatology at all lead times. The ECMWF ensemble RPSS scoresexceeded those of the DWLP ensemble by larger margins than in the CRPS metric, ex-cept in summer in the northern extra-tropics when both ensembles again achieved sim-ilar scores. In both the globally-averaged and tropics-only-averaged RPSS, the diﬀeren-tial by which the ECMWF RPSS score exceeds that of DLWP is smaller over land thanover the full globe. Global maps comparing the ECMWF and DLWP RPSS scores showgenerally similar regions of higher and lower skill, except that the ECMWF ensemble per-forms better over the tropical oceans. At weeks 5-6, the spatial distribution of regionsof skill and no-skill in the RPSS metric over land are surprisingly similar between theECMWF and DLWP ensembles (Fig. 11). One reason the DLWP model performs poorerover the tropical oceans is likely due to its lack of SST data, as suggested by its failurein the eastern equatorial Paciﬁc during the onset of a weak El Ni˜no event in 2018.Although our current data-driven DLWP model is worse than operational NWPmodels for the deterministic prediction of synoptic-scale weather patterns, its capabil-ity to learn physics-based phenomena, including the complex evolution of near-surfacetemperatures and long-term patterns in the convection-dominated tropics, is remarkable.As such, DLWP may prove a valuable tool for supplementing NWP-based S2S forecastswhere they are weakest: in the tropics and in the spring and summer months.There are many avenues for further development of our elementary DLWP ensem-ble system. One obvious shortcoming is that our DLWP model does not yet forecast pre-cipitation. This might be addressed by adding precipitation to the current set of six prog-nostic 2D ﬁelds that are recursively stepped forward by the model, although instead ofincluding the precipitation at previous times in the CNN architecture, it could alterna-tively be diagnosed from the other ﬁelds after each step (Larraondo et al., 2019).The DLWP model’s computational eﬃciency can be used for more than simply pro-ducing timely operational forecasts; it also enables researchers to make unprecedenteduse of large numbers of reforecasts for past weather events. We computed 25 years of re-forecasts, with 104 forecasts per year for 33 ensemble members, in a matter of hours ona single GPU. We only used these reforecasts to correct the average drift in the DLWPmodel, but one could also use them to calibrate ensemble probability distributions, an-alyze model errors, or investigate of the sources of predictability captured by the model.Historically, adjoint models (e.g., Doyle et al., 2014), which are tangent linear, diﬀeren-tiable approximations to full non-linear dynamical NWP models, have been used to ex-amine how model errors depend on initial condition uncertainties (for example, whethererrors in moisture over the US have a strong inﬂuence on the location or intensity of acyclone over Europe). Adjoint models are diﬃcult to create for complex operational NWP –26–anuscript submitted to Journal of Advances in Modeling Earth Systems models. Yet, because a CNN is fully diﬀerentiable, it is easy to produce the correspond-ing adjoint model, enabling studies of error growth and atmospheric predictability. Re-cent work on the interpretation of deep neural networks may provide some valuable toolsfor this form of analysis (Toms et al., 2020; Ebert-Uphoﬀ & Hilburn, 2020).

Acknowledgments

J. A. Weyn and D. R. Durran’s contributions to this research were funded by Grant N00014-17-1-2660 from the Oﬃce of Naval Research (ONR). D. R. Durran was also supportedby Grant N00014-20-1-2387 from ONR. J. A. Weyn was also supported by a NationalDefense Science and Engineering Graduate (NDSEG) fellowship from the Departmentof Defense (DoD). Computational resources were provided by Microsoft Azure via a grantfrom Microsoft’s AI for Earth program. Stan Posey (Nvidia) also generously donated twoV100 GPUs to the University of Washington which were used for this work.

Data Availability Statement

The ERA5 reanalysis data are available via the Copernicus Climate Data Store (DOI:10.24381/cds.bd0915c6 and DOI: 10.24381/cds.adbb2d47). The ECMWF S2S forecastsare available at https://apps.ecmwf.int/datasets/data/s2s. The T42 and T63 IFS fore-casts are available from Rasp et al. (2020b).

References

Adames, ´A. F., & Kim, D. (2016, March). The MJO as a dispersive, convectivelycoupled moisture wave: Theory and observations.

Journal of the AtmosphericSciences , (3), 913–941. doi: 10.1175/JAS-D-15-0170.1Bauer, P., Thorpe, A., & Brunet, G. (2015). The quiet revolution of numericalweather prediction. Nature , (7567). doi: 10.1038/nature14956Buizza, R. (2019). Introduction to the special issue on “25 years of ensemble fore-casting”. Quarterly Journal of the Royal Meteorological Society , (S1), 1–11.doi: 10.1002/qj.3370Doyle, J. D., Amerault, C., Reynolds, C. A., & Reinecke, P. A. (2014). Initialcondition sensitivity and predictability of a severe extratropical cycloneusing a moist adjoint. Monthly Weather Review , (1), 320–342. doi:10.1175/MWR-D-13-00201.1Dueben, P. D., & Bauer, P. (2018). Challenges and design choices for global weatherand climate models based on machine learning. Geoscientiﬁc Model Develop-ment , (10), 3999–4009. doi: 10.5194/gmd-11-3999-2018Ebert-Uphoﬀ, I., & Hilburn, K. A. (2020, may). Evaluation, Tuning and Interpre-tation of Neural Networks for Meteorological Applications. ArXiv . Retrievedfrom http://arxiv.org/abs/2005.03126

Fortin, V., Abaza, M., Anctil, F., & Turcotte, R. (2014, Aug). Why should ensem-ble spread match the rmse of the ensemble mean?

J. Hydrometeorol. , (4),1708–1713. doi: 10.1175/JHM-D-14-0008.1Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction,and estimation. Journal of the American Statistical Association , (477),359–378. doi: 10.1198/016214506000001437Hersbach, H. (2000). Decomposition of the continuous ranked probability score forensemble prediction systems. Weather and Forecasting , (5), 559–570. doi: 10.1175/1520-0434(2000)015 (cid:104) (cid:105) Proceedings of the acm sigkdd international conference onknowledge discovery and data mining (pp. 2325–2335). Retrieved from –27–anuscript submitted to

Journal of Advances in Modeling Earth Systems http://arxiv.org/abs/1809.07394https://arxiv.org/abs/1809.07394 doi: 10.1145/3292500.3330674Isaksen, L., Bonavita, M., Buizza, R., Fisher, M., Haseler, J., Leutbecher, M., &Raynaud, L. (2010). Ensemble of data assimilations at ECMWF.

ECMWFTech. Memo. , (December), 1–41.Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ArXiv . Retrieved from http://arxiv.org/abs/1412.6980

Larraondo, P. R., Renzullo, L. J., Inza, I., & Lozano, J. A. (2019). A data-drivenapproach to precipitation parameterizations using convolutional encoder-decoder neural networks.

ArXiv . Retrieved from http://arxiv.org/abs/1903.10274

Leutbecher, M. (2018, sep). Ensemble size: How suboptimal is less than inﬁnity?

Quarterly Journal of the Royal Meteorological Society , (S1), 107–128. doi:10.1002/qj.3387Monhart, S., Spirig, C., Bhend, J., Bogner, K., Sch¨ar, C., & Liniger, M. A. (2018,Aug). Skill of subseasonal forecasts in Europe: Eﬀect of bias correction anddownscaling using surface observations. Journal of Geophysical Research: At-mospheres . Retrieved from http://doi.wiley.com/10.1029/2017JD027923 doi: 10.1029/2017JD027923Palmer, T. (2018). The ECMWF ensemble prediction system: Looking back (morethan) 25 years and projecting forward 25 years.

Quarterly Journal of the RoyalMeteorological Society . Retrieved from http://doi.wiley.com/10.1002/qj.3383 doi: 10.1002/qj.3383Richardson, D. S. (2000). Skill and relative economic value of the ECMWF ensem-ble prediction system.

Quarterly Journal of the Royal Meteorological Society , (563), 649–667. doi: https://doi.org/10.1002/qj.49712656313Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networksfor biomedical image segmentation. In Lecture notes in computer science (in-cluding subseries lecture notes in artiﬁcial intelligence and lecture notes inbioinformatics) (Vol. 9351, pp. 234–241). Springer, Cham. Retrieved from http://link.springer.com/10.1007/978-3-319-24574-4{\ }28 doi:10.1007/978-3-319-24574-4 28Scher, S., & Messori, G. (2019). Weather and climate forecasting with neu-ral networks: using GCMs with diﬀerent complexity as study-ground.

Geoscientiﬁc Model Development Discussions (March), 1–15. Retrievedfrom doi:10.5194/gmd-2019-53Scher, S., & Messori, G. (2020, feb). Ensemble neural network forecasts with singu-lar value decomposition.

ArXiv . Retrieved from http://arxiv.org/abs/2002.05398

Toms, B. A., Barnes, E. A., & Ebert-Uphoﬀ, I. (2020). Physically Interpretable Neu-ral Networks for the Geosciences: Applications to Earth System Variability.

Journal of Advances in Modeling Earth Systems , (9), e2019MS002002. doi:10.1029/2019MS002002Ullrich, P. A., Devendran, D., & Johansen, H. (2016). Arbitrary-order conserva-tive and consistent remapping and a theory of linear maps: Part II. MonthlyWeather Review , (4), 1529–1549. doi: 10.1175/MWR-D-15-0301.1Ullrich, P. A., & Taylor, M. A. (2015). Arbitrary-order conservative and consis-tent remapping and a theory of linear maps: Part I. Monthly Weather Review , (6), 2419–2440. doi: 10.1175/MWR-D-14-00343.1Vitart, F. (2004). Monthly forecasting at ECMWF. Monthly Weather Review , (12), 2761–2779. doi: 10.1175/MWR2826.1Vitart, F. (2014). Evolution of ECMWF sub-seasonal forecast skill scores. Quar-terly Journal of the Royal Meteorological Society , (683), 1889–1899. doi: 10 –28–anuscript submitted to Journal of Advances in Modeling Earth Systems .1002/qj.2256Vitart, F., Ardilouze, C., Bonet, A., Brookshaw, A., Chen, M., Codorean, C., . . .Zhang, L. (2017). The Subseasonal to Seasonal (S2S) Prediction ProjectDatabase.

Bulletin of the American Meteorological Society , (1), 163–173.doi: 10.1175/BAMS-D-16-0017.1Weigel, A. P., Baggenstos, D., Liniger, M. A., Vitart, F., & Appenzeller, C.(2008, dec). Probabilistic Veriﬁcation of Monthly Temperature Fore-casts. Monthly Weather Review , (12), 5162–5182. Retrieved from http://journals.ametsoc.org/doi/10.1175/2008MWR2551.1 doi:10.1175/2008MWR2551.1Weigel, A. P., Liniger, M. A., & Appenzeller, C. (2007, jan). The discrete Brier andranked probability skill scores. Monthly Weather Review , (1), 118–124. Re-trieved from http://journals.ametsoc.org/doi/10.1175/MWR3280.1 doi:10.1175/MWR3280.1Weyn, J. A., Durran, D. R., & Caruana, R. (2019). Can Machines Learn to Pre-dict Weather? Using Deep Learning to Predict Gridded 500-hPa GeopotentialHeight From Historical Weather Data. Journal of Advances in Modeling EarthSystems , (8), 2680–2693. doi: 10.1029/2019ms001705Weyn, J. A., Durran, D. R., & Caruana, R. (2020). Improving data-drivenglobal weather prediction using deep convolutional neural networks on acubed sphere. Journal of Advances in Modeling Earth Systems , (9),e2020MS002109. doi: 10.1029/2020MS002109White, C. J., Carlsen, H., Robertson, A. W., Klein, R. J. T., Lazo, J. K., Ku-mar, A., . . . Zebiak, S. E. (2017). Potential applications of subseasonal-to-seasonal (S2S) predictions. Meteorological Applications , (3), 315–325. doi:10.1002/met.1654(3), 315–325. doi:10.1002/met.1654