[PDF] Data-driven medium-range weather prediction with a Resnet pretrained on climate simulations: A new model for WeatherBench

Abstract

Numerical weather prediction has traditionally been based on physical models of the atmosphere. Recently, however, the rise of deep learning has created increased interest in purely data-driven medium-range weather forecasting with first studies exploring the feasibility of such an approach. To accelerate progress in this area, the WeatherBench benchmark challenge was defined. Here, we train a deep residual convolutional neural network (Resnet) to predict geopotential, temperature and precipitation at 5.625 degree resolution up to 5 days ahead. To avoid overfitting and improve forecast skill, we pretrain the model using historical climate model output before fine-tuning on reanalysis data. The resulting forecasts outperform previous submissions to WeatherBench and are comparable in skill to a physical baseline at similar resolution. We also analyze how the neural network creates its predictions and find that, with some exceptions, it is compatible with physical reasoning. Finally, we perform scaling experiments to estimate the potential skill of data-driven approaches at higher resolutions.

Full PDF

PPurely data-driven medium-range weatherforecasting achieves comparable skill to physicalmodels at similar resolution

Stephan Rasp

Department of InformaticsTechnical University of MunichMunich, Germany [email protected]

Nils Thuerey

Department of InformaticsTechnical University of MunichMunich, Germany [email protected]

Abstract

Numerical weather prediction has traditionally been based on physical models ofthe atmosphere. Recently, however, the rise of deep learning has created increasedinterest in purely data-driven medium-range weather forecasting with ﬁrst studiesexploring the feasibility of such an approach. Here, we train a signiﬁcantly largermodel than in previous studies to predict geopotential, temperature and precipitationup to 5 days ahead and achieve comparable skill to a physical model run at similarhorizontal resolution. Crucially, we pretrain our models on historical climate modeloutput before ﬁne-tuning them on the reanalysis data. We also analyze how theneural network creates its predictions and ﬁnd that, with some exceptions, it iscompatible with physical reasoning. Our results indicate that, given enough trainingdata, data-driven models can compete with physical models. At the same time,there is likely not enough data to scale this approach to the resolutions of currentoperational models.

Introduction

Numerical weather prediction (NWP) is based on physical models of the atmosphere, and ocean forlonger forecast times, in which the governing equations are discretized and sub-grid processes areparameterized . Continued reﬁnement of these models along with increasing computing power andbetter observations to create initial conditions has led to steady increases in forecast skill over thelast four decades . The improvements in the model components and the tuning of free parameters is,in a large majority of cases, guided by scientiﬁc expertise rather than using a statistical method .In the current operational weather forecasting chain, the only component that includes a learningalgorithm is post-processing, the correction of statistical errors from NWP output. Most commonly,post-processing is done using simple linear techniques (model output statistics; MOS) but in recentyears more modern machine learning (ML) techniques, such as random forests and neural networks,have been explored .With the apparent successes of deep learning in modeling high-dimensional data in other domainssuch as computer vision and natural language processing, a natural question to ask is whethernumerical weather models can also be learned purely from data. This question sparked some debateafter initial studies showed the general feasibility of such an approach for medium-rangeweather forecasting. In particular, some researchers were sceptical whether the complex physicsdescribed by systems of partial differential equations could be encoded in a neural network. In thispaper, we aim to answer this fundamental question and explore the potential for purely data-drivenweather forecasting as an alternative to physical modeling. a r X i v : . [ phy s i c s . a o - ph ] A ug peciﬁcally, we focus on global, medium-range weather forecasting up to 5 days forecast time.From a societal point of view, this forecast range is particularly important in order to preparefor extreme weather events such as heavy rainfall, tropical cyclones, cold spells or heat waves.Scientiﬁcally, producing a good medium-range forecast requires modeling the large-scale dynamicsof the atmosphere, especially the initiation and evolution of tropical and extra-tropical cyclones. Thisis in contrast to very short-term prediction of e.g. precipitation, called "nowcasting", which doesnot necessarily require knowledge about atmospheric dynamics and can be done by extrapolatingobservations into the future .We tackle the challenge posed in the WeatherBench benchmark , namely predicting 500 hPa geopo-tential (Z500), 850 hPa temperature (T850), 2-meter temperature (T2M) and 6-hourly accumulatedprecipitation (PR) up to 5 days ahead. The ﬁrst two variables represent upper-level atmosphericvariables that describe the large-scale ﬂow of the atmosphere, while the latter two represent impactvariables. The data used are 40 years of ERA5 reanalyses at 5.625 ◦ resolution (32 ×

64 grid pointsin latitude/longitude, approximately 625 km resolution at the equator). This data is much coarser thancurrent operational weather ( ∼

10 km) or even climate models ( ∼

100 km). However, with a large deepneural network and using signiﬁcantly more variables and levels than has been done before, usinghigher-resolution data is technically very challenging. As we will see below, even 5.625 ◦ resolutiondata allow us to draw meaningful conclusions about the potential of data-driven weather forecastingif it was scaled to higher resolutions. In addition to the ERA data, we use 150 years of climate modeldata from the Climate Model Inter-comparison Project (CMIP) for pretraining (see Methods).There are three fundamental techniques for creating data-driven forecasts: direct, continuous anditerative. For direct forecasts, a separate model is trained directly for each desired forecast time. Incontinuous models, time is an additional input and a single model is trained to predict all forecastlead times (as in MetNet ). Finally, iterative forecasts are created by training a direct model for ashort forecast time (e.g. 6 h) and then running the model several times using its own output fromthe previous iteration. Here, we train direct and continuous models using a fully-convolutionalResnet that takes the prognostic variables at seven vertical levels as well as some surface andconstant ﬁelds at the current time t as well as t − h and t − h as input. Z500, T850 and T2M arepredicted with a separate set of networks tha PR (see Methods for details on the model). Iterativemodels, if successfully trained, have some nice properties such long-term stability . However, thecomputational cost and memory requirements of training such models over several time steps ischallenging for a network of the size used here.Finally, it is important to have meaningful baselines to judge the skill of the data-driven techniques. Asa gold standard, we use the operational Integrated Forecasting System (IFS) model from the EuropeanCenter for Medium-range Weather Forecasting (ECMWF) which currently has a horizontal resolutionof around 10 km . Further, we compare our forecasts to IFS forecasts run at lower resolution, T42(approximately 2.8 ◦ or 310 km at the equator) and T63 (approximately 1.9 ◦ or 210 km) (see Methodsfor details). These lower-resolution versions should provide a fair comparison to our data-drivenforecasts in terms of resolution and computational expense. Results

Forecast skill

For the upper-level ﬁelds Z500 and T850, the direct and continuous models achieve comparableskill to the T63 forecast across metrics with better relative scores for short forecast times (Figs. 1and S1 and Tables 1 and S1). Pretraining with climate model data helps in achieving good skillwith increasing impact for longer forecast times as the gap between the ERA only and pretrainednetworks show. This is because overﬁtting, as measured by the difference between training andtesting scores, tends to be worse for longer lead times (Fig. 3a). As the atmosphere becomes morechaotic for longer forecast horizons, similar initial conditions can lead to a wider range of outcomes.In the face of such uncertainty, a model that is trained to minimize the mean squared error, willtend to predict the mean of the distribution of possible outcomes. Our hypothesis is that for a widerdistribution (longer forecast time) more training data is required to estimate the mean. In otherwords, if, because of the intrinsic unpredictability of the atmosphere, a broader range of outcomeis physically plausible, then overﬁtting to individual outcomes encountered in the training data willlead to more overﬁtting than it would for shorter forecast times, where the plausible forecasts are2able 1: RMSE for 3 and 5 days forecast time. All forecasts evaluated at 5.625 ◦ resolution. Bestphysical and data-driven methods are highlighted.Latitude-weighted RMSE (3 days / 5 days) Model Z500 [m s − ] T850 [K] T2M [K] PR [mm] Persistence 936 / 1033 4.23 / 4.56 3.00 / 3.27 3.23 / 3.24Climatology 1075 5.51 6.07 2.36Weekly climatology 816 3.50 3.19

IFS T42 489 / 743 3.09 / 3.83 3.21 / 3.69IFS T63 268 / 463 1.85 / 2.52 2.04 / 2.44Operational IFS /

334 1.36 / / / 2.35Direct (CMIP only) 323 / 561 2.09 / 2.82 1.90 / 2.32 2.30 / 2.39Direct (pretrained) / 523 / 2.52 / 2.03 2.16 / Continuous (ERA only) 331 / 545 1.87 / 2.57 1.60 / 2.06 2.22 / 2.32Continuous (CMIP only) 330 / 548 2.12 / 2.75 2.24 / 2.59 2.29 / 2.38Continuous (pretrained) 284 /

Sensitivity to resolution and network size

Next, we conducted sensitivity tests to probe the scaling of forecast skill with data resolution andnetwork size. To assess the impact of resolution we trained 3-day direct networks using 11.25 ◦ and 22.5 ◦ data but an otherwise identical training procedure (Fig. 3b). The skill drops with coarserresolution. This trend is present regardless whether the evaluation was done at 5.625 ◦ or 22.5 ◦ Figure 3: a) Generalization error (testing minus training RMSE for Z500) b) RMSE of Z500 fornetworks trained with different resolution data. Bars show the RMSE computed at 5.625 ◦ resolution.For this the predictions from the lower resolution networks were upscaled. Dots show the RMSEevaluated at 22.5 ◦ for which all predictions were downscaled. c) RMSE of Z500 for different networkarchitectures. y-axis has the same units (Z500 RMSE in m s − ) for all three panels.4esolution with higher/lower resolution data interpolated to the evaluation resolution. This tendencymakes sense since a higher data resolution provides better information to the network. One caveatof this sensitivity test is that we left the model architecture the same for these experiments, whichmeans that the number of parameters relative to the size of the input/output vectors increases withcoarser resolutions.To compare different network sizes we reduced the number of channels in each convolution from128 to 64, 32 and 16 (Fig. 3c). The number of parameters decreases approximately by a factor 4 foreach reduction. The testing skill increases with increasing network size but the trend ﬂattens off andoverﬁtting increases. This suggests that, while further improvements are certainly possible, therelikely is a ceiling in skill for a given amount of training data. Note that the regularization parameters(weight decay and dropout) are the same across all network sizes. Another way to change the networksize would be to change the number of layers. These experiments led to qualitatively similar results.Recent ﬁndings in deep learning suggest that, further increasing network size can lead to lowertesting losses despite increased overﬁtting. It would be interesting to see whether similar trends holdfor this dataset. Interpretability

The data-driven weather models predict weather with reasonable skill. One interesting question toask is whether they do this for the "right reasons". To ﬁnd out, we test which variables and whichgeographical region are important for the network to make a prediction. We do this by computingsaliency maps . That is for each sample, we chose a point in space and a speciﬁc variable p , e.g.T850 over London. We then compute the gradient G of this scalar p with respect to the entire inputarray X ∈ R samples × lat × lon × variables : G = ∂p/∂X with the same shape as X . We do this analysisfor two climatologically different locations: London, which is in the mid-latitudes and thereforeinﬂuenced by eastwards-propagating Rossby waves and Barbados, located in the sub-tropical tradewind zones. This is done for different lead times using the pretrained direct networks.It is important to highlight that the saliency method does not evaluate which inputs were mostimportant for the prediction but rather which changes in the input would most affect the output. For adiscussion on the differences, see . For the purposes of this paper, the saliency method is appropriatesince it allows us to evaluate effect of small input perturbations which is closely related to the bodyof work on adjoint sensitivity .First, we investigate the region of inﬂuence by computing the mean absolute gradient of T850 overall samples | G | = 1 /N samples (cid:80) i | G i | and then taking the mean over all input variables (Fig. 4a).Because we compute the gradients for the normalized inputs, the different variables and levels shouldbe comparable in scale and the gradients are dimensionless. It is important to highlight that thesaliency analysis is primarily of qualitative nature. The resulting maps show that the networks tendsto look at physically reasonable geographical regions. For London the region of inﬂuence extendstowards the West with increasing forecast time. This is in line with our physical understanding ofeastwards traveling Rossby waves being a key factor for weather in the mid-latitudes. Further we canlook at the mean gradient ¯ G = 1 /N samples (cid:80) i G i of a speciﬁc input variable, in this case Z500, for 3days forecast time (Fig. 4b). Here we see a positive-negative pattern across the Atlantic. Physically,one could interpret this as the signature of Rossby phase shifts inﬂuencing the temperature overLondon several days ahead. Over Barbados the region of inﬂuence looks very different and morecircular which is in accordance with calmer meteorological conditions in the subtropics withouta persistent preferred wind direction. Note that | G | is a mean over all seasons so that seasonallyprevailing regimes do not show up.We can also take the horizontal mean of ¯ | G | to obtain the mean inﬂuence of each normalized inputvariable (Supp. Fig. S2). Geopotential and temperature show the largest gradients on average.Speciﬁcally changes in the geopotential at 250 hPa appear to have a large effect. This is reasonablesince 250 hPa is close to the tropopause and changes in the tropopause height are known to beinﬂuential for medium-range weather evolution . Further, the gradient analysis shows that T2M isimportant for Barbados which reﬂects the importance of the ocean temperatures. Comparing, theinﬂuence of the inputs at the current time step t , t − h and t − h, the current time step is muchmore important than earlier time steps. The conﬁrms our empirical ﬁndings that adding these previoustime steps only improved the scores marginally (not shown).5igure 4: Saliency plots: a) The region of inﬂuence | G | (see text for explanation) of T850 overLondon and Barbados with respect to all input variables (averaged). b) Mean gradient over time ¯ G of T850 with respect to Z500 over London. c) Sample gradient G of Z500 with respect to 250 hPageopotential for 8 January 2017 12:00UTC.So far, all results are in agreement with physical reasoning and similar results could be expected tocome out of adjoint sensitivity studies with physical models. However, looking at G for individualsamples, it is evident that this is not always the case. Fig. 4c shows the gradient of T850 over Londonwith respect the 250 hPa geopotential for a 3 day forecast. Signiﬁcant gradients stretch across theAtlantic and North America all the way to Hawaii. This extent of information propagation within 3days is rather unphysical. Studies using physical models typically estimate that it takes perturbations5–6 days to cross the Atlantic . These results suggest that while the network, on average, learnsphysically plausible connections from data it appears to make unphysical connections for somesamples. This makes sense since, in our setup, the network purely learns correlations between inputand output images and there is nothing stopping it from learning "unphysical" correlations. If, forexample, a certain pattern over eastern North America - which likely has an inﬂuence on Europeanweather 3 days layer - also concurs in the training data with some pattern over the eastern Paciﬁc, thenetwork will pick up that connection between Paciﬁc and European weather even if it might not be acausal relationship. In a way, such "unphysical" relations are also a sign of overﬁtting. Discussion

To accurately assess how the data-driven models stack-up against physical models it is importantto highlight some aspects of the comparison conducted in this study. First, the physical models(operational IFS and T63) are initialized from slightly different initial conditions, leading to a non-zero error at t = 0 . In addition, the coarse resolution models T42 and T63 suffer from errors due to theconversion to spherical coordinates at coarse resolutions. Since error growth is initially exponential,this initial condition difference primarily affects short forecast times up to two days . A likelymore important consideration is that the T42 and T63 models were not tuned for this resolution.This is in contrast to the operational IFS model which is carefully tuned over many years. Thismeans that tuning the lower resolution IFS models would almost certainly lead to increased skill,6owever it is hard to estimate how much. On the other hand, our models are trained at signiﬁcantlycoarser resolutions and further hyper-parameter/architecture tuning would likely result in betterscores. Another limitation is that statistical errors of the physical model were not removed by post-processing and that the evaluation was done at a very coarse grid. This is likely not so important forthe upper-level variables Z500 and T850 but very important for surface variables like T2M and PR .In data-driven forecasts the post-processing is implicitly performed. While not a perfect one-to-onecomparison we believe that it is fair to say that our data-driven models achieve comparable skillcompared to physical models at similar horizontal resolutions.More generally, our ﬁndings suggest that, given enough data, there is no fundamental reason whypurely data-driven forecasts could not be as good as state-of-the-art physical models. Our scalinganalysis indicates that going to higher resolutions and larger networks leads to better scores. It isan interesting question whether the resolution scaling continues for higher resolutions than thoseconsidered here. However, the increased overﬁtting for larger networks already suggests that largeamounts of data are required to train competitive data-driven models. One can also assume thatlarger models are needed for higher-resolutions to maintain a reasonable receptive ﬁeld. Here, weused climate model simulations to combat overﬁtting. Current CMIP models, however, are run ataround 100 km resolution, and therefore cannot be used for forecasts at higher-resolutions. There areseveral atmosphere-only climate simulations run at resolutions comparable to the ERA5 resolutionof 25 km. It can be assumed that using all this available data at the highest possible resolution fortraining would greatly increase the forecast skill of data-driven methods. However, for the resolutionsof current operational NWP models (10 km) is it unlikely that there is sufﬁcient data to challengethese models (see ref. for a theoretical argument).As an aside, even if data-driven models matched physical models at forecasting, creating an initialcondition currently requires data-assimilation systems that are currently based on physical models.The ﬁndings regarding the relative potential skill of data-driven forecasting versus physical modelingare speciﬁc to the problem of forecasting global weather in the medium range, however. Theapplicability of data-driven forecasts has to be assessed for every application separately based on theavailability of data and the potential to improve upon physical approaches. Methods

Data

The data handling follows the WeatherBench paper . The data are freely available. Instruction fordownloading can be found at https://github.com/pangeo-data/WeatherBench . The ERA5data was regridded bilinearly to 5.625 ◦ resolution using the xesmf Python package . Data isavailable from 1979 to 2018, with the last two years reserved for testing/evaluation.For the climate model pretraining, we downloaded a historical simulation from the CMIP6 archive .Speciﬁcally, we picked the MPI-ESM-HR model since it was one of the only models for which thedata was saved at vertical resolution to match the ERA5 data. The temporal resolution is six hours.The regridded climate model data are also available on the WeatherBench data repository. Veriﬁcation

In this study we use two skill metrics, the latitude-weighted root mean squared error (RMSE), deﬁnedas

RMSE = 1 N forecasts N forecasts (cid:88) i (cid:118)(cid:117)(cid:117)(cid:116) N lat N lon N lat (cid:88) j N lon (cid:88) k L ( j )( f i,j,k − t i,j,k ) (1)where f is the model forecast and t is the ERA5 truth. L ( j ) is the latitude weighting factor for thelatitude at the j th latitude index: L ( j ) = cos(lat( j )) N lat (cid:80) N lat j cos(lat( j )) (2)and the anomaly correlation coefﬁcient (ACC; see Section 7.6.4 of ref. ). The ACC comparesthe correlation of anomalies with respect to the respective climate, thereby performing an implicit7ias correction. For all scores, the grid points are weighted by their area. We generally follow theWeatherBench methodology . Baselines

WeatherBench contains three physically-based baselines: the operational Integrated ForecastingSystem (IFS) of the European Center for Medium-range Weather Forecasting (ECMWF), the currentstate-of-the-art in NWP, which currently runs at 10 km horizontal resolution with 137 vertical levels;and the same model run at two lower resolutions, T42 (approximately 2.8 ◦ or 310 km at the equator)with 62 vertical levels and T63 (approximately 1.9 ◦ or 210 km at the equator) with 137 vertical levels.Computationally, a single ten day forecast with the operational IFS models takes roughly one hour ofreal time on a cluster with 11,664 cores. The T42 and T63 models take 270 s and 503 s and a singleXC40 node with 36 cores. For details on the initialization of each model, refer to the WeatherBenchpaper .As an additional reference in Table 1 we include the work by Weyn et al. who trained an neuralnetwork to predict Z500 and T850. Their model is iterative, i.e. it consists of a sequence of 6 hforecasts. During training they also trained their neural network over two time steps (12 h) to ensurestability for longer integrations. Further they mapped the latitude-longitude data to a cube-spheregrid with roughly 1.9 ◦ resolution to minimize the distortion during the convolution operations. Theytrained only on ERA data. Data-driven models

All models in this study use the same architecture (except in the network size scaling experiments).The basic structure is a fully convolutional Resnet with 19 residual blocks. Each residual blockconsists of two convolutional blocks, deﬁned as [2D convolution -> LeakyReLU -> Batch normaliza-tion -> Dropout], after which the inputs to the residual layer are added to the current signal. The 2Dconvolutions inside the residual blocks have 128 channels with a kernel size of 3. All convolutionsare periodic in longitude with zero padding in the latitude direction. For the ﬁrst layer a simpleconvolutional block with 128 channels is used with a kernel size of 7 to increase the ﬁeld of view.The inputs are geopotential, temperature, zonal and meridional wind and speciﬁc humidity at sevenvertical levels (50, 250, 500, 600, 700, 850 and 925 hPa), 2-meter temperature, 6-hourly accumulatedprecipitation, the top-of-atmosphere incoming solar radiation, all at the current time step t , t − hand t − h, and, ﬁnally three constant ﬁelds: the land-sea mask, orography and the latitude at eachgrid point. All ﬁelds were normalized by subtracting the mean and dividing by the standard deviation,with the exception of precipitation for which the mean was not subtracted to keep the lower boundat zero. Additionally, we log-transform of the precipitation to make the distribution less skewed( ˜ P R = ln( (cid:15) + P R ) − ln( (cid:15) ) ) with (cid:15) = 0 . . Subtracting the log of (cid:15) ensures that zero values remainzero. This transformation turns out to be crucial to prevent the network from simply predicting zeros.All variables, levels and time-steps were stacked to create an input signal with 114 channels. For thecontinuous forecast, in addition, we add a 32 ×

64 ﬁelds which contains the forecast time in hoursdivided by 100. During training, a random forecast time from 6 to 120 hours is drawn for each sample.For the output layer a simple 2D convolution is used with the number of output channels either beingthree for the networks that predict Z500, T850 and T2M or one for the PR networks. Zero paddingis used in the latitude direction. LeakyReLU is used with α = 0 . . Weight decay of × − isused for all layers. Dropout of 0.1 is only used for the ERA only networks. For the CMIP only andpretrained networks the validation score was better without any dropout.The loss function is the latitude-weighted mean squared error. The latitudes are weighted propor-tionally to the area of the grid boxes ∝ cos φ . The Adam optimizer is used with a batch size of32 and an initial learning rate of × − for the ERA and CMIP only experiments. The learningrate was decreased twice by a factor of 5 when the validation loss has not decreased for two epochs.Early stopping on the validation loss was used to terminate training with a patience of 5 epochs. Thetraining period for ERA was from 1979 to 2015, validation was done with a single year (2016). Forﬁne-tuning the CMIP networks on ERA data, a lower initial learning rate of × − was chosen. Forthe direct approach we trained models for 6h, 1d, 3d and 5d forecast time. We used Tensorﬂow>=2.0.Training a single model takes around one day on a GTX 2080 GPU.8 ata and code availability All data are available at https://github.com/pangeo-data/WeatherBench . All code is avail-able at https://github.com/raspstephan/WeatherBench . Acknowledgements

We would like to thank Sebastian Scher and David Greenberg for their valuable comments on thepaper as well as George Craig for discussing the saliency analysis. Stephan Rasp acknowledgesfunding from the German Research Foundation (DFG) under grant no. 426852073.

Author contributions

S.R. conceived the study, trained the models and analyzed results. All authors discussed results andwrote the manuscript.

References [1] Brian Ancell and Gregory J. Hakim. Comparing Adjoint- and Ensemble-Sensitivity Analysiswith Applications to Observation Targeting.

Monthly Weather Review , 135:4117–4134, 2007.ISSN 0027-0644. doi: 10.1175/2007MWR1904.1.[2] Peter Bauer, Alan Thorpe, and Gilbert Brunet. The quiet revolution of numerical weatherprediction.

Nature , 525(7567):47–55, 9 2015. ISSN 0028-0836. doi: 10.1038/nature14956.URL .[3] Peter D Dueben and Peter Bauer. Challenges and design choices for global weather and climatemodels based on machine learning.

Geosci. Model Dev. , 2018. doi: 10.5194/gmd-2018-148.URL .[4] Imme Ebert-Uphoff and Kyle A. Hilburn. Evaluation, Tuning and Interpretation of NeuralNetworks for Meteorological Applications. 5 2020. URL http://arxiv.org/abs/2005.03126 .[5] Veronika Eyring, Sandrine Bony, Gerald A. Meehl, Catherine A. Senior, Bjorn Stevens, Ronald J.Stouffer, and Karl E. Taylor. Overview of the Coupled Model Intercomparison Project Phase6 (CMIP6) experimental design and organization.

Geoscientiﬁc Model Development , 9(5):1937–1958, 5 2016. ISSN 1991-9603. doi: 10.5194/gmd-9-1937-2016. URL .[6] Peter Grönquist, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, andTorsten Hoeﬂer. Deep Learning for Post-Processing Ensemble Weather Forecasts. 5 2020. URL http://arxiv.org/abs/2005.08748 .[7] Reindert J. Haarsma, Malcolm J. Roberts, Pier Luigi Vidale, Catherine A. Senior, AlessioBellucci, Qing Bao, Ping Chang, Susanna Corti, Neven S. Fuˇckar, Virginie Guemas, Jostvon Hardenberg, Wilco Hazeleger, Chihiro Kodama, Torben Koenigk, L. Ruby Leung, JianLu, Jing-Jia Luo, Jiafu Mao, Matthew S. Mizielinski, Ryo Mizuta, Paulo Nobre, MasakiSatoh, Enrico Scoccimarro, Tido Semmler, Justin Small, and Jin-Song von Storch. HighResolution Model Intercomparison Project (HighResMIP v1.0) for CMIP6.

Geoscientiﬁc ModelDevelopment , 9(11):4185–4208, 11 2016. ISSN 1991-9603. doi: 10.5194/gmd-9-4185-2016.URL https://gmd.copernicus.org/articles/9/4185/2016/ .[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for ImageRecognition. 12 2015. URL http://arxiv.org/abs/1512.03385 .[9] Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, Adrian Simmons,Cornel Soci, Saleh Abdalla, Xavier Abellan, Gianpaolo Balsamo, Peter Bechtold, GionataBiavati, Jean Bidlot, Massimo Bonavita, Giovanna Chiara, Per Dahlgren, Dick Dee, Michail9iamantakis, Rossana Dragani, Johannes Flemming, Richard Forbes, Manuel Fuentes, AlanGeer, Leo Haimberger, Sean Healy, Robin J. Hogan, Elías Hólm, Marta Janisková, Sarah Keeley,Patrick Laloyaux, Philippe Lopez, Cristina Lupu, Gabor Radnoti, Patricia Rosnay, Iryna Rozum,Freja Vamborg, Sebastien Villaume, and Jean-Noël Thépaut. The ERA5 Global Reanalysis.

Quarterly Journal of the Royal Meteorological Society , page qj.3803, 5 2020. ISSN 0035-9009.doi: 10.1002/qj.3803. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/qj.3803 .[10] Tim D. Hewson and Fatima M. Pillosu. A new low-cost technique improves weather forecastsacross the world. 3 2020. URL http://arxiv.org/abs/2003.14397 .[11] B. J. Hoskins, M. E. McIntyre, and A. W. Robertson. On the use and signiﬁcance of isentropicpotential vorticity maps.

Quarterly Journal of the Royal Meteorological Society , 111(470):877–946, 8 1985. ISSN 00359009. doi: 10.1002/qj.49711147002. URL http://doi.wiley.com/10.1002/qj.49711147002 .[12] Frédéric Hourdin, Thorsten Mauritsen, Andrew Gettelman, Jean-Christophe Golaz, Venka-tramani Balaji, Qingyun Duan, Doris Folini, Duoying Ji, Daniel Klocke, Yun Qian, FlorianRauser, Catherine Rio, Lorenzo Tomassini, Masahiro Watanabe, and Daniel Williamson. TheArt and Science of Climate Model Tuning.

Bulletin of the American Meteorological Soci-ety , 98(3):589–602, 3 2017. ISSN 0003-0007. doi: 10.1175/BAMS-D-15-00135.1. URL http://journals.ametsoc.org/doi/10.1175/BAMS-D-15-00135.1 .[13] Eugenia Kalnay.

Atmospheric modeling, data assimilation, and predictability , volume 54.2003. ISBN 9780521791793. URL http://books.google.com/books?hl=en&lr=&id=Uqc7zC7NULMC&oi=fnd&pg=PR11&dq=Atmospheric+modeling,+data+assimilation+and+predictability&ots=lI5gpir1RV&sig=FuhXqkYSMxhz2jLI2T8144HX6fs .[14] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv ,1412.6980, 12 2014. URL http://arxiv.org/abs/1412.6980 .[15] Amy McGovern, Kimberly L. Elmore, David John Gagne, Sue Ellen Haupt, Christopher D.Karstens, Ryan Lagerquist, Travis Smith, John K. Williams, Amy McGovern, Kimberly L.Elmore, David John Gagne II, Sue Ellen Haupt, Christopher D. Karstens, Ryan Lagerquist,Travis Smith, and John K. Williams. Using Artiﬁcial Intelligence to Improve Real-TimeDecision-Making for High-Impact Weather.

Bulletin of the American Meteorological Society ,98(10):2073–2090, 10 2017. ISSN 0003-0007. doi: 10.1175/BAMS-D-16-0123.1. URL http://journals.ametsoc.org/doi/10.1175/BAMS-D-16-0123.1 .[16] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever.Deep Double Descent: Where Bigger Models and More Data Hurt. 12 2019. URL http://arxiv.org/abs/1912.02292 .[17] Tim Palmer. A Vision for Numerical Weather Prediction in 2030. 7 2020. URL http://arxiv.org/abs/2007.04830 .[18] Stephan Rasp and Sebastian Lerch. Neural Networks for Postprocessing Ensemble Weather Fore-casts.

Monthly Weather Review , 146(11):3885–3900, 11 2018. doi: 10.1175/MWR-D-18-0187.1. URL http://journals.ametsoc.org/doi/10.1175/MWR-D-18-0187.1 .[19] Stephan Rasp, Peter D. Dueben, Sebastian Scher, Jonathan A. Weyn, Soukayna Mouatadid, andNils Thuerey. WeatherBench: A benchmark dataset for data-driven weather forecasting. 2 2020.URL http://arxiv.org/abs/2002.00469 .[20] Mark J. Rodwell, Linus Magnusson, Peter Bauer, Peter Bechtold, Massimo Bonavita, CarlaCardinali, Michail Diamantakis, Paul Earnshaw, Antonio Garcia-Mendez, Lars Isaksen, ErlandKällén, Daniel Klocke, Philippe Lopez, Tony McNally, Anders Persson, Fernando Prates, andNils Wedi. Characteristics of Occasional Poor Medium-Range Weather Forecasts for Europe.

Bulletin of the American Meteorological Society , 94(9):1393–1405, 9 2013. ISSN 0003-0007.doi: 10.1175/BAMS-D-12-00099.1. URL http://journals.ametsoc.org/doi/abs/10.1175/BAMS-D-12-00099.1 . 1021] S. Scher. Toward Data-Driven Weather and Climate Forecasting: Approximating a SimpleGeneral Circulation Model With Deep Learning.

Geophysical Research Letters , 45(22):616–12,11 2018. ISSN 0094-8276. doi: 10.1029/2018GL080704. URL https://onlinelibrary.wiley.com/doi/abs/10.1029/2018GL080704 .[22] Sebastian Scher and Gabriele Messori. Generalization properties of neural networkstrained on Lorenzsystems.

Nonlinear Processes in Geophysics Discussions , pages 1–19, 6 2019. ISSN 2198-5634. doi: 10.5194/npg-2019-23. URL .[23] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Inside Convolutional Networks:Visualising Image Classiﬁcation Models and Saliency Maps. , 12 2013. URL http://arxiv.org/abs/1312.6034 .[24] Casper Kaae Sønderby, Lasse Espeholt, Jonathan Heek, Mostafa Dehghani, Avital Oliver, TimSalimans, Shreya Agrawal, Jason Hickey, and Nal Kalchbrenner. MetNet: A Neural WeatherModel for Precipitation Forecasting. 3 2020. URL http://arxiv.org/abs/2003.12140 .[25] Maxime Taillardat, Olivier Mestre, Michaël Zamo, and Philippe Naveau. Calibrated En-semble Forecasts using Quantile Regression Forests and Ensemble Model Output Statis-tics.

Monthly Weather Review , page 160301131220006, 3 2016. ISSN 0027-0644. doi:10.1175/MWR-D-15-0260.1. URL http://journals.ametsoc.org/doi/abs/10.1175/MWR-D-15-0260.1?af=R .[26] Jonathan A. Weyn, Dale R. Durran, and Rich Caruana. Can machines learn to predict weather?Using deep learning to predict gridded 500-hPa geopotential height from historical weatherdata.

Journal of Advances in Modeling Earth Systems , page 2019MS001705, 7 2019. ISSN1942-2466. doi: 10.1029/2019MS001705. URL https://onlinelibrary.wiley.com/doi/abs/10.1029/2019MS001705 .[27] Jonathan A Weyn, Dale Richard Durran, and Rich Caruana. Improving data-driven globalweather prediction using deep convolutional neural networks on a cubed sphere. 2020. doi:10.1002/ESSOAR.10502543.1.[28] Daniel S Wilks.

Statistical Methods in the Atmospheric Sciences . Elsevier, 2006. ISBN0127519661. URL http://cds.cern.ch/record/992087 .[29] WMO. WMO Lead Centre for Deterministic Forecast Veriﬁcation, 2020. URL https://apps.ecmwf.int/wmolcdnv/ .[30] Fuqing Zhang, Naifang Bei, Richard Rotunno, Chris Snyder, and Craig C Epifanio. Mesoscalepredictability of moist baroclinic waves: Convection-permitting experiments and multistageerror growth dynamics.

Journal of the Atmospheric Sciences , 64(10):3579–3594, 2007.[31] Jiawei Zhuang. xESMF: v0.2.1, 10 2019. URL https://xesmf.readthedocs.io/ .11 upplement

Figure S1: Anomaly correlation coefﬁcient (ACC) for a) Z500, b) T850, c) T2M and d) PR evaluatedagainst ERA5 data.Table S1: ACC for 3 and 5 days forecast time. All forecasts evaluated at 5.625 ◦ resolution.Latitude-weighted ACC (3 days / 5 days) Model Z500 [m s − ] T850 [K] T2M [K] PR [mm] Persistence 0.62 / 0.53 0.69 / 0.65 0.88 / 0.85 0.06/0.06Climatology 0 0 0 0Weekly climatology 0.65 0.77 0.85 0.16IFS T42 0.90 / 0.78 0.86 / 0.78 0.87 / 0.83IFS T63 0.97 / 0.91 0.94 / 0.90 0.94 / 0.92Operational IFS / / / / Direct (ERA only) 0.96 / 0.85 0.94 / 0.86 0.97 / 0.92 / 0.24Direct (CMIP only) 0.95 / 0.85 0.93 / 0.86 0.95 / 0.92 0.32 / 0.20Direct (pretrained) / 0.87 / 0.89 / 0.94 0.45 /

Continuous (ERA only) 0.95 / 0.86 0.94 / 0.88 0.96 / 0.94 0.41 / 0.29Continuous (CMIP only) 0.95 / 0.86 0.93 / 0.87 0.93 / 0.91 0.41 / 0.29Continuous (pretrained) 0.96 / / / 0.2812igure S2: Horizontally averaged saliency ||