[PDF] RainBench: Towards Global Precipitation Forecasting from Satellite Imagery

Abstract

Extreme precipitation events, such as violent rainfall and hail storms, routinely ravage economies and livelihoods around the developing world. Climate change further aggravates this issue. Data-driven deep learning approaches could widen the access to accurate multi-day forecasts, to mitigate against such events. However, there is currently no benchmark dataset dedicated to the study of global precipitation forecasts. In this paper, we introduce \textbf{RainBench}, a new multi-modal benchmark dataset for data-driven precipitation forecasting. It includes simulated satellite data, a selection of relevant meteorological data from the ERA5 reanalysis product, and IMERG precipitation data. We also release \textbf{PyRain}, a library to process large precipitation datasets efficiently. We present an extensive analysis of our novel dataset and establish baseline results for two benchmark medium-range precipitation forecasting tasks. Finally, we discuss existing data-driven weather forecasting methodologies and suggest future research avenues.

Full PDF

RRainBench: Towards Global Precipitation Forecasting from Satellite Imagery

Christian Schroeder de Witt* , Catherine Tong* , Valentina Zantedeschi , Daniele De Martini ,Freddie Kalaitzis , Matthew Chantry , Duncan Watson-Parris , Piotr Bili ´nski University of Oxford (*: Equal contribution) Inria, Lille - Nord Europe research centre and University College London, Centre for Artiﬁcial Intelligence University of Warsaw

Abstract

RainBench , a new multi-modal benchmark dataset for data-driven precipitation forecasting. It includes simulated satel-lite data, a selection of relevant meteorological data from theERA5 reanalysis product, and IMERG precipitation data. Wealso release

PyRain , a library to process large precipitationdatasets efﬁciently. We present an extensive analysis of ournovel dataset and establish baseline results for two bench-mark medium-range precipitation forecasting tasks. Finally,we discuss existing data-driven weather forecasting method-ologies and suggest future research avenues.

Introduction

Extreme precipitation events, such as violent rain and hailstorms, can devastate crop ﬁelds and disrupt harvests (Vo-gel et al. 2019; Li et al. 2019). These events can be locallyforecasted with sophisticated numerical weather models thatrely on extensive ground and satellite observations. How-ever, such approaches require access to compute and dataresources that developing countries in need - particularly inSouth America and West Africa - cannot afford (Le Coz andvan de Giesen 2020; Gubler et al. 2020). The lack of ad-vance planning for precipitation events impedes socioeco-nomic development and ultimately affects the livelihoods ofmillions around the world. Given the increase in global pre-cipitation and extreme precipitation events driven by climatechange (Gupta et al. 2020), the need for accurate precipita-tion forecasts is ever more pressing.Data-driven machine learning approaches circumventthe dependence on traditional resource-intensive numericalmodels, which typically take several hours to run (Sønderbyet al. 2020), incurring a signiﬁcant time lag. In contrast, deeplearning models deployed on dedicated high-throughputhardware can produce inferences in a matter of seconds.

However, while there have been attempts in forecasting pre-cipitation with neural networks, they have mostly been frag-mented across different local regions, which hinders a sys-tematic comparison into their performance.In this work, we introduce

RainBench , a multi-modaldataset to support data-driven forecasting of global pre-cipitation from satellite imagery. We curate three types ofdatasets: simulated satellite data (SimSat), numerical re-analysis data (ERA5), and global precipitation estimates(IMERG). The use of satellite images to forecast precipi-tation globally would circumvent the need to collect groundstation data, and hence they are key to our vision for widen-ing the access to multi-day precipitation forecasts. Reanal-ysis data provide estimates of complete atmospheric state,and IMERG provides rigorous estimates of global precipita-tion. Access to these data opens up opportunities to developmore timely and potentially physics-informed forecast mod-els, which so far could not have been studied systematically.Most related to our work, Rasp et al. (2020) have devel-oped WeatherBench, a benchmark environment for globaldata-driven medium-range weather forecasting. This datasetforms an excellent ﬁrst step in weather forecasting. How-ever, some important features of WeatherBench limit its usefor end-to-end precipitation forecasts. WeatherBench doesnot include any observational raw data (e.g. satellite data)and only contains ERA5 reanalysis data, which have limitedresolution of extreme precipitation events. Further, Weather-Bench does not include a fast dataloading pipeline to trainML models, which we found to be a signiﬁcant bottle-neck in our model development and testing process. Thisgap prompted us to also release

PyRain , a data processingand experimentation framework with fast and conﬁgurablemulti-modal dataloaders.To summarise our contributions: (a) We introduce themulti-modal

RainBench dataset which supports data-driveninvestigations for global precipitation forecasting from satel-lite imagery; (b) we release

PyRain , which allows re-searchers to run Deep Learning (DL) experiments on Rain-Bench efﬁciently, reducing time and hardware costs and thuslowering the barrier to entry into this ﬁeld; (c) we intro-duce two benchmark precipitation forecasting tasks on Rain-Bench and their baseline results, and present experimentsstudying class-balancing schemes. Finally, we discuss thechallenges in the ﬁeld and outline several fruitful avenues a r X i v : . [ c s . L G ] D ec or future research. Related Work

Weather forecasting systems have not fundamentallychanged since they were ﬁrst operationalised nearly 50 yearsago. Current state-of-the-art operational weather forecastingsystems rely on numerical models that forward the physi-cal atmospheric state in time based on a system of physi-cal equations and parameterised subgrid processes (Bauer,Thorpe, and Brunet 2015). While global simulations typi-cally run at grid sizes of

10 km , regional models can reach . (Franch et al. 2020) . Even in the latter case, skilledforecast lengths are usually limited to a maximum of days, with a conjectured hard limit of to days (Zhanget al. 2019). Nowcasting , i.e. high-resolution weather fore-casting only a few hours in advance, is currently limited bythe several hours that numerical forecasting models take torun (Sønderby et al. 2020).Given the huge amounts of data currently available fromboth numerical models and observations, new opportunitiesexist to train data-driven models to produce these forecasts.The current boom in Machine Learning (ML) has inspiredseveral other groups to approach the problem of weatherforecasting. Early work by Xingjian et al. have invested us-ing convolutional recurrent neural networks for precipita-tion nowcasting. More recently, Sønderby et al. from Googleproposed a “(weather) model free” approach, MetNet, whichseeks to forecast precipitation in continental USA using geo-stationary satellite images and radar measurements as in-puts. This approach performs well up to 7-8 hours, but in-evitably runs into a forecast horizon limit as informationfrom global or surrounding geographic areas is not incor-porated into the system. This time window has value thoughit would not enable substantial disaster preparedness.The prediction of extreme precipitation (and other ex-treme weather events) has a long history with traditionalforecasting systems (Lalaurette 2003). More recent devel-opments in ensemble weather forecasting systems surroundthe introduction of novel forecasting indices (Zs´ot´er 2006,EFI) and post-processing (Gr¨onquist et al. 2020). There hasalso been other deep-learning based precipitation forecast-ing models as motivated by the monsoon prediction prob-lem, for example, Saha, Mitra, and Nanjundiah (2017) andSaha et al. (2020) use a stacked autoencoder to identifyclimatic predictors and an ensemble regression tree model,while Praveen et al. (2020) use kriging and multi-layer per-ceptrons to predict monsoon rainfall from ERA5 data.WeatherBench (Rasp et al. 2020) is a benchmark datasetfor data-driven global weather forecasting, derived from datain the ERA5 archive. Its release has prompted a numberof follow-up works to employ deep learning techniques forweather forecasting, although the variables considered haveonly been restricted to the forecasts of relatively static vari-ables, such as 500 hPa geopotential and 850 hPa temper-ature (Weyn, Durran, and Caruana 2019, 2020; Rasp andThuerey 2020; Bihlo 2020; Arcomano et al. 2020). UnlikeRainBench which incorporates the element of observationalinput data from (simulated) satellites, WeatherBench’s datacomes solely from the ERA5 reanalysis archive, and thus provides no route to producing an end-to-end forecastingsystem.

RainBench

In this section, we introduce RainBench, which consistsof data derived from three publicly-available sources: (1)European Centre for Medium-Range Weather Forecasts(ECMWF) simulated satellite data (SimSat), (2) the ERA5reanalysis product, and (3) Integrated Multi-satellitE Re-trievals (IMERG) global precipitation estimates.

SimSat

We use simulated satellite data in place of realsatellite imagery to minimise data processing requirementsand to simplify the prediction task. SimSat data are model-simulated satellite data generated from ECMWF’s high-resolution weather-forecasting model using the RTTOV ra-diative transfer model (Saunders et al. 2018). SimSat emu-lates three spectral channels from the Meteosat-10 SEVIRIsatellite (Aminou 2002). SimSat provides information aboutglobal cloud cover and moisture features and has a nativespatial resolution of about . ° – i.e. about

10 km – at three-hourly intervals. The product is available from April 2016to present (with a lag time of

24 h ). Using simulated satel-lite data provides an intermediate step to using real satel-lite observations as the images are a global nadir view ofEarth, avoiding issues of instrument error and large num-bers of missing values. Here we aggregate the data to . °– about

30 km – to be consistent with the ERA5 dataset.

ERA5

We use ERA5 as it is an accurate and commonlyused reanalysis product familiar to the climate science com-munity (Rasp et al. 2020). ERA5 reanalysis data provideshourly estimates of a variety of atmospheric, land andoceanic variables, such as speciﬁc humidity, temperatureand geopotential height at different pressure levels (Hers-bach et al. 2020). Estimates cover the full globe at a spatialresolution of . ° and are available from 1979 to present,with a lag time of ﬁve days. IMERG

IMERG is a global half-hourly precipitation es-timation product provided by NASA (Huffman et al. 2019).We use the Final Run product which primarily uses satellitedata from multiple polar-orbiting and geo-stationary satel-lites. This estimate is then corrected using data from re-analysis products (MERRA2, ERA5) and rain-gauge data.IMERG is produced at a spatial resolution of . ° – about

10 km – and is available from June 2000 to present, with alag time of about three to four months.To facilitate efﬁcient experimentation, all data is con-verted from thier original resolutions to . ° resolutionsusing bilinear interpolation.RainBench provides precipitation values from twosources, ERA5 and IMERG, as both are widely used andconsidered to be high-quality precipitation datasets. TheERA5 precipitation is accumulated precipitation over thelast hour and is calculated as an averaged quantity over agrid-box. We aggregated IMERG precipitation into hourlyaccumulated precipitation and should be considered as apoint estimate of the precipitation.igure 1 shows the distribution of precipitation for theyears 2000-2017 with both ERA5 and IMERG. IMERG isgenerally regarded as a more trust-worthy dataset for pre-cipitation due to the direct inclusion of precipitation obser-vations in the data assimilation process and the higher spa-tial resolution used to produce the dataset, which also resultin seen difference in data distributions. IMERG has signiﬁ-cantly larger rainfall tails than ERA5, and these tails rapidlyvanish with decreasing dataset resolution. The underestima-tion of extreme precipitation events in ERA5 is clearly visi-ble. precipitation [mm/hour] p r o b a b ili t y d e n s i t y ERA:tp 1.40625ERA:tp 5.625IMERG 0.25IMERG 1.40625IMERG 5.625

Figure 1: Precipitation histogram from 2000-2017 withERA5 and IMERG at different resolutions. Vertical lines de-lineate convection rainfall types: slight ( – − ), mod-erate ( –

10 mm h − ), heavy ( –

50 mm h − ), and violent(over

50 mm h − ) (MetOfﬁce 2012). PyRain

To support efﬁcient data-handling and experimentation onRainbench, we release PyRain, an out-of-the-box experi-mentation framework.PyRain introduces an efﬁcient dataloading pipeline forcomplex sample access patterns that scales to the terabytesof spatial timeseries data typically encountered in the cli-mate and weather domain. Previously identiﬁed as a decisivebottleneck by the Pangeo community , PyRain overcomesexisting dataloading performance limitations through an ef-ﬁcient use of NumPy memmap arrays in conjunction withoptimised software-side access patterns.In contrast to storage formats requiring read system calls,including HDF5 , Zarr or xarray , memory-mapped ﬁlesuse the mmap system call to map physical disk space directlyto virtual process memory, enabling the use of lazy OS de-mand paging and circumventing the kernel buffer. While lessbeneﬁcial for chunked or sequential reads and spatial slicing,memmaps can efﬁciently handle the fragmented random ac-cess inherent to the randomized sliding-window access pat-terns along the primary axis as required in model training.In Table 1, we compare PyRain’s memmap data read-ing capcity against a NetCDF+Dask (Rocklin 2015) dat- https://github.com/frontierdevelopmentlab/pyrain https://pangeo.io/index.html (2021) https://docs.python.org/3/library/mmap.html (2021) https://portal.hdfgroup.org/display/HDF5/HDF5(2021) https://zarr.readthedocs.io/en/stable/ (2021) http://xarray.pydata.org/en/stable/ (2021) Table 1: Number of data samples loaded per second usingPyRain versus a conventional NetCDF framework. Typicalconﬁgurations assumed and performed on a NVIDIA DGX1server with CPUs.NetCDF PyRain Speedup16 workers 40 2410 60.3 ×

64 workers 70 1930 27.6 × aloader. We ﬁnd empirically that PyRain’s memmap dat-aloader offers signiﬁcant speedups over other solutions, sat-urating even SSD I/O with few process workers when usedwith PyTorch’s (Paszke et al. 2019) inbuilt dataloader.Note that explicitly storing each training sample is notonly slow and inﬂexible for research settings, but it alsorequires twenty to ﬁfty times more storage and as a resultcomes at a higher cost than constructing samples on-the-ﬂy.Thus, other options such as writing samples in TFRecordformat (Weyn, Durran, and Caruana 2019; Abadi et al. 2016)would only be sensible for highly distributed training in pro-duction settings.PyRain’s dataloader is easily conﬁgurable and supportsboth complex multimodal item compositions, as well as pe-riodic (Sønderby et al. 2020) and sequential (Weyn, Durran,and Caruana 2020) train-test set partitionings. Apart from itsdata-loading pipeline, PyRain also supplies ﬂexible raw-dataconversion tools, a convenient interface for data-analysistasks, various data-normalisation methods and a number ofready-built training settings based on PyTorch Lightning .While being optimised for use with RainBench, PyRain isalso compatible with WeatherBench. Evaluation Tasks

We deﬁne two benchmark tasks on RainBench for precipi-tation forecasting, with the ground truth precipitation valuestaken from either ERA5 or IMERG.For each benchmark task, we consider three different in-put data settings: SimSat, reanalysis data (ERA5), or both.From the ERA5 dataset, we select a subset of variables as in-put to the forecast model based on our data analysis results;the inputs are geopotential (z), temperature (t), humidity(q), cloud liquid water content (clwc), cloud ice water con-tent (ciwc), each sampled at

300 hPa ,

500 hPa and

850 hPa geopotential heights; to these we add the surface pressureand the 2-meter temperature (t2m), as well as static vari-ables that describe the location and surface of the Earth, i.e.latitude, longitude, land-sea mask, orography and soil type.From the SimSat dataset, the inputs are cloud-brightnesstemperature (clbt) taken at three wavelengths. We normalizeeach variable with its global mean and standard deviation.Since data from each source are available at differenttimes, we use the data subset from April 2016 to train allmodels for the benchmark tasks, unless speciﬁed otherwise.We use data from 2018 and 2019 as validation and test sets https://pytorch-lightning.readthedocs.io/en/latest/ (2021) igure 2: Model setup for the benchmark forecasting tasks.respectively. To make sure no overlap exists between train-ing and evaluation data, the ﬁrst evaluated date is 6 January2019 while the last training date is 31 December 2017.We perform experiments with a neural network based onConvolutional LSTMs, which have been shown to be effec-tive for regional precipitation nowcasting (Xingjian et al.2015). We structure our forecasting task based on MetNet’sconﬁgurations (Sønderby et al. 2020), where a single modelis trained conditioned on time and is capable of forecastingat different lead times.The network’s input is composed of a time series { x t } ,where each x t is the set of standardized features at time t ,sampled in regular intervals ∆ t from t = − T to t = 0 ; theoutput is a precipitation forecast y at lead time t = τ ≤ τ L . In addition to the aforementioned atmospheric features,static features (e.g. latitude) along with three time-dependantfeatures (hour, day, month) are repeated per timestep. Theinput vector is then concatenated with a lead-time one-hotvector x τ . In our experiments, we adopt T = 12 h , ∆ t = 3h and forecasts at 24-hour intervals up to τ L = 120 h . Wenote that we do not include precipitation as an input tempo-ral feature. An overview of our setup is shown in Figure 2.We approach the tasks as a regression problem. Following(Rasp et al. 2020), we use the mean latitude-weighted Root-Mean-Square Error (RMSE) as loss and evaluation metric.We compare the results to persistence and climatology base-lines. For persistence, precipitation values at t = 0 are usedas prediction at t = τ . We compute climatology and weeklyclimatology baselines from the full training dataset (since1979 for ERA5 and since 2000 for IMERG), where localclimatologies are computed as a single mean over all timesand per week respectively (Rasp et al. 2020). Results

In this section, we ﬁrst present our data analysis of Rain-Bench. We then describe models’ performance on the bench-mark precipitation forecasting tasks, which highlights thedifﬁculty in forecasting precipitation values on IMERG. Fi-nally, we present an experiment on same-timestep precipita-tion estimation to investigate class balancing issues.

Data Analysis

To analyse the dependencies between all RainBench vari-ables, we calculate pairwise Spearman’s rank correlation in-dices over latitude band from − to ° and date rangefrom April 2016 to December 2019 (see Figure 3). In con-trast to Pearson’s correlation coefﬁcient, Spearman’s cor-relation coefﬁcient is signiﬁcant if there is a, potentially l o n l a t l s m o r o s l t z t q s p c l w c c i w c t m c l b t : c l b t : c l b t : t p i m e r g lonlatlsmorosltztqspclwcciwct2mclbt:0clbt:1clbt:2tpimerg 1.000.750.500.250.000.250.500.751.00 Figure 3: Spearman’s correlation of RainBench variablesfrom April 2016 to December 2019 in latitude band [ − ◦ , ◦ ] at pressure levels

300 hPa (about

10 km ) (up-per triangle) and

850 hPa ( . ) (lower triangle). Leg-end: lon : longitude, lat : latitude, lsm : land-sea mask, oro :orography (topographic relief of mountains), lst : soil type, z : geopotential height, t : temperature, q : speciﬁc humidity, sp : surface pressure, clwc: cloud liquid water content, ciwc:cloud ice water content, t m : temperature at 2m, clbt: i : i th SimSat channel, tp : ERA5 total precipitation, imerg:IMERG precipitation. All correlations in this plot are sta-tistically signiﬁcant ( p < . ).non-linear, monotonic relationship between variables, whilePearson’s considers only linear correlations. This allowsto capture relationships between variables such as betweentemperature and absolute latitude. Comparing correlations ataltitude pressure levels

300 hPa (about

10 km ) and

850 hPa ( . ), we can see that they are almost identical, savefor a few exceptions: Speciﬁc humidity, q , and geopotentialheight, z , correlate strongly at

300 hPa but not at

850 hPa ,cloud ice water content, ciwc, generally correlates morestrongly at higher altitude (and cloud liquid water content,clwc, vice versa). A careful examination of the underlyingphysical dependencies results in the realisation that all ofthese asymmetries stem mostly from latitudinal correlationsor effects related to cloud formation, e.g. ice and liquid formin clouds at different temperatures/altitudes.As we are particularly interested in variables that havepredictive skill on precipitation, we note that all SimSatspectral channels moderately anti-correlate with both ERA5and IMERG precipitation estimates. Interestingly, SimSatsignals correlate much more strongly with speciﬁc humid-ity and cloud ice water content at higher altitude, whichmight be a consequence of spectral penetration depth. ERA5state variables that correlate the most with either precipita-tion estimates are speciﬁc humidity and temperature. Cloudice water content correlates moderately strongly with pre-cipitation estimates at high altitude, but not at all at loweraltitudes (where ice water content tends to be much lower).able 2: Precipitation forecasts evaluated with Latitude-weighted RMSE (mm). All rows except where otherwisestated show models trained with data from 2016 onwards. (a) Predicting Precipitation from ERA

Inputs 1-day 3-day 5-dayPersistence 0.6249 0.6460 0.6492Climatology 0.4492 (1979-2017)Climatology (weekly) (1979-2017)SimSat 0.4610 0.4678 0.4691ERA 0.4562 0.4655 0.4677SimSat + ERA

ERA (1979-2017) 0.4485 0.4670 0.4699 (b) Predicting Precipitation from IMERG

Inputs 1-day 3-day 5-dayPersistence 1.1321 1.1497 1.1518Climatology 0.7696 (2000-2017)Climatology (weekly) (2000-2017)SimSat 0.8166 0.8201 0.8198ERA 0.8182 0.8224 0.8215SimSat + ERA

ERA (2000-2017) 0.8085 0.8194 0.8214Further, a number of time-varying ERA5 state variables cor-relate more strongly with IMERG precipitation than ERA5precipitation, as do SimSat signals. Conversely, a numberof constant variables, such as land-sea mask, orography andsoil type are signiﬁcantly anti-correlated with ERA5 precip-itation, but not at all correlated with IMERG. Overall, weﬁnd that all variables that are signiﬁcantly correlated or anti-correlated with both ERA5 tp and IMERG are also corre-lated or anti-correlated with SimSat clbt:0-2, suggesting thatprecipitation prediction from simulated satellite data alonemay be feasible.

Precipitation Forecasting

Table 2 compares the neural model forecasts in differentdata settings when predicting precipitation from ERA5 andIMERG. Using the ERA5 precipitation as target, Table 2ashows that training from SimSat alone gives the worst resultsacross the data settings. This conﬁrms the difﬁculty in pre-cipitation forecast from satellite data alone, which does notcontain as much information about the atmospheric state assophisticated reanalysis data such as ERA5. Importantly, thecomplementary beneﬁts of utilizing data from both sourcesis already visible despite our simple concatenation setup, astraining from both SimSat and ERA5 achieves the best re-sults across all lead times (when holding the number of train-ing instances constant).Figure 4 shows example forecasts from one random in-put sequence across the different data settings for predictingERA5 precipitation. We observe that the forecasts can cap- ture the general precipitation distribution across the globe,but there is various degrees of blurriness in the outputs. Aswe shall discuss later in the paper, considering probabilisticforecasts would be a promising solution to blurriness, whichmight have arisen as the mean predicted outcome.We also see the importance in using a large trainingdataset, since extending the considered training instances tothe full ERA5 dataset outperforms the baselines further inthe 1-day forecasting regime (shown in the last rows).Table 2b shows the forecast results when predictingIMERG precipitation. As before, the neural model’s fore-casting skill based on both SimSat and ERA input outper-forms the other input settings. The higher observed RMSEssuggest that this is a considerably more difﬁcult task, whichwe believe to be closely tied to IMERG featuring more ex-treme precipitation events (Figure 1). In the next section, weinvestigate this issue further by considering a same-timestepprecipitation estimation task.A key limitation in our current experimental setup is thatit requires all of ERA5, IMERG and SimSat channels to beavailable at each time step, limiting the range of our train-ing data to April 2016 and onward. Nevertheless, our neuralmodels signiﬁcantly outperform persistence baselines. Thefact that local climatology trained over longer time periodssigniﬁcantly outperforms our network model baselines sug-gests the development of alternative modelling setups thatcan make use of the full available datasets from each source.

Same-Timestep Precipitation Estimation

We now describe a set of experiments for same-timestep pre-cipitation estimation on IMERG. This analysis is done in-dependently from the precipitation forecasting benchmarktasks, in order to provide an in-depth understanding of thechallenges in modelling extreme precipitation events.We use a gradient boosting decision tree learning algo-rithm (Ke et al. 2017, LightGBM) in order to estimate same-timestep IMERG precipitation directly from ERA5 and Sim-Sat. Our training set consists of million randomly sam-pled grid points/pixels within the time interval April to December . We compare the (not latitude-adjusted)RMSE for two pixel sampling variants: A) unbalanced sam-pling, meaning grid points are chosen randomly from theraw data distribution and B) balanced sampling, in whichwe bin IMERG precipitation into the four classes deﬁned inFigure 1 and sample grid points such that we end up with anequal amount of pixels per bin.In Table 3, we ﬁnd that taking a balanced sampling ap-proach reduces the per-class validation RMSE of moderate,heavy and violent precipitation. This balanced sampling ap-proach also has detrimental effects on the mean forecast-ing performance but not the macro-mean performance, asthe ‘slight’ class dominates the dataset and is misclassiﬁedmore often. However, balancing the training set does resultin a lower macro RMSE.Designing an appropriate class-balanced sampling mayplay a crucial role toward improving predictions of extremeprecipitation events. It is not quite clear how a per-pixel sam-pling scheme may be translated into a global output contextapproach such as in MetNet (Sønderby et al. 2020) where imsatERASimsat & ERA 1-day 2-day 3-day 4-day 5-dayTruth Figure 4: ERA5 Precipitation forecasts on one random sample.Table 3: Comparing RMSE Results with and without aclass-balanced training dataset. The modelling task is same-timestep estimation of IMERG precipitation.L M H V Mean MacroUnbalancedERA each individual pixel’s input distribution should be kept bal-anced, while training as many pixels per input data sampleas possible for efﬁciency. A possible way of navigating thischallenge would be to sample greedily, i.e. based on the cur-rently most imbalanced pixel and combine this with learningrate adjustments for other pixels trained on the same framebased on how imbalanced these pixels are at that timestep.

Discussion

We outline the key challenges in global precipitation fore-casting, our proposed solutions, we also discuss promisingresearch avenues that can build on our work.

Challenges

From our experiments, we identiﬁed a number of challengesinherent to data-driven extreme precipitation forecasting.

Class imbalance

Extreme precipitation events, by theirnature, rarely occur (see Figure 1). In the context of super-vised learning, this manifests as a class imbalance problem, in which a model might rarely predict extreme values. De-signing an appropriate class sampling strategy (e.g. inversefrequency sampling) can mitigate this imbalance, as shownin our same-timestep prediction experiments. Further, webelieve that a mixture of pixelwise-weighting and balancedsampling could be a potential solution.

Probabilistic forecasts.

The current machine learningsetup produces deterministic predictions, which may leadto an averaging of possible futures into a single blurryprediction. This limitation may be overcome with proba-bilistic modelling, which may take different forms. For in-stance, Sønderby et al. made use of a cross-entropy lossover a categorical distribution to handle probabilistic fore-casts. Stochastic video prediction techniques (Babaeizadehet al. 2018) and conditional generative adversarial learning(Mirza and Osindero 2014) have also been shown to pro-duce realistic predictions in other ﬁelds. Other relevant tech-niques that predict distribution parameters are VariationalAuto-Encoders (Kingma and Welling 2014) and normaliz-ing ﬂows (Rezende and Mohamed 2016).

Data normalisation.

Feature scaling is a common data-processing step for training machine learning models andwell-understood to be advantageous (Bhanja and Das 2019).Our current approach normalizes each variable using itsglobal mean and standard deviation; This disregards anylocal spatial differences, which is important for modellinglocal weather patterns (Weyn, Durran, and Caruana 2019).Previous work suggested that patch-wise normalisation maybe appropriate (Gr¨onquist et al. 2020, Local Area-wise Stan-dardization (LAS)). We suggest studying a reﬁnement toLAS, which adjusts the kernel size with latitude such thatthe spatial normalisation context remains constant (

Latitude-Adjusted LAS ) per-channel image-size normalisation.

Data topology.

Lastly, the spherical input and output datatopology of global forecasting contexts poses interestinguestions to neural network architecture. While a multi-tude of approaches to handle spherical input topologies havebeen suggested, see (Llorens Jover 2020) for an overview, itseems yet unclear which approach works best. Our datasetmight constitute a valuable benchmark for such research.

Future research avenues

Apart from overcoming the challenges outlined above, wehave identiﬁed a variety of opportunities for further research.

Physics-informed multi-task learning.

Apart from us-ing reanalysis data for model training, we do not currentlyexploit the fact that many aspects of weather forecasting arewell-understood from a physical perspective. One way ofinforming model training of physical constraints would beto train precipitation forecasting concurrently with predic-tion of physical state variables, including temperature andspeciﬁc humidity, in a multi-task setting, e.g. through usingseparate decoder heads for different variables (similarly toCaruana (1997)). This approach promises to combine the ad-vantages of data-driven learning with low-level feature regu-larisation through a physics-informed inductive bias. Multi-task learning can also be regarded as a form of data aug-mentation (Shorten and Khoshgoftaar 2019), promising tofurther increase forecasting performance using real or sim-ulated satellite data without requiring access to reanalysisdata at inference time.

Increasing spatial resolution.

Data at higher spatial res-olution tends to capture heavy and extreme precipitationevents better but poses a number of challenges. Large sam-ple batch sizes may lead to network activation storage thatexceeds GPU global memory capacity even for distributedtraining. Apart from exploring TPU or nvlink-based solu-tions, another way would be to switch to mixed-precision orhalf-precision or employ techniques that trade-off memoryfor compute such as gradient checkpointing (Pinckaers, vanGinneken, and Litjens 2019). PyRain’s dataloader efﬁcientlymaximises total disk throughput, which may itself become abottleneck at very high resolutions. Storing all or part of thetraining data memmaps on one or several high-speed localSSDs may increase disk throughput a few-fold. Apart frommemory and disk throughput, there is also a lack of suitablyhighly resolved historical climate data for pre-training (Raspet al. 2020). One possible way of overcoming this would beto integrate high-resolution local forecasting model or sen-sor data into the training process (Franch et al. 2020), an-other exciting approach spearheaded in computational ﬂuiddynamics (Jabarullah Khan and Elsheikh 2019) is to em-ploy a multi-ﬁdelity approach, where hierarchical variance-reduction techniques are employed to enable training to beperformed at lower-resolution data as often as possible, thusminimising the need for training on high-resolution data.

Reducing IMERG Early Run lag time.

While the ﬁnalIMERG product becomes available at a time lag of ca. 3-4 months, a preliminary, Early Run, product based on raw satellite data becomes available after ca. hours. We pos-tulate that this lag could be further reduced if, instead ofhigh-dimensional observational data, forecasting agencieswere exchanging their locally processed low-dimensionalembeddings derived from local encoder networks. Embed-dings could then be feed into a late fusion network architec-ture similar to Rudner et al. (2019, Multi Net).

Multi-time-step loss function.

Numerical forecastingsystems forward the physical state in time by following aniterative setting, where the output of the previous step is fedas input to the next step. As the update rules are identical foreach step, it in principle sufﬁces for neural networks to learna single such update step and apply it multiple times duringinference depending on the prediction lead time, thus reduc-ing the number of trainable weights and potentially increasegeneralisation performance. To avoid instability issues in-herent to iterative approaches (Rasp et al. 2020), model roll-outs can be trained end-to-end (McGibbon and Bretherton2019; Brenowitz and Bretherton 2018). Weyn, Durran, andCaruana (2020) pioneer this approach but limit themselvesto just two time steps. To overcome device memory con-straints in such a setting and to scale to a large numberof time steps tollouts, iteration layers could be chosen tobe reversible (Gomez et al. 2017) such that activations canbe computed on-the-ﬂy during backpropagation and do notneed to be stored in device memory.

Conclusion

We presented

RainBench , a novel benchmark suite for data-driven extreme precipitation forecasting, and

PyRain , an as-sociated rapid experimentation framework with a fast dat-aloader. Both RainBench and PyRain are open source andwell-documented. We furthermore present neural baselinesfor multi-day precipitation forecasting from both reanalysisand simulated satellite data. Despite our simple approach,we ﬁnd that our neural baselines beat climatology and per-sistence baselines for up to day forecasts. In addition, weuse a gradient boosting decision tree algorithm to study theimpact of precipitation class balancing on regression in aprecipitation estimation setting and present various forms ofdata exploration, including a correlation study.In the near future, we will augment RainBench with realsatellite data. We plan on also including historical climatedata for pre-training. Concurrently, we will explore variousdirections for future research, as discussed above. In partic-ular, we believe increasing the spatial resolution of our inputdata is crucial to closing the gap to operational forecastingmodels. Ultimately, we hope that our benchmark and frame-work will lower the barrier of entry for the global researchcommunity such that our work contributes to rapid progressin data-driven weather prediction, democratisation of accessto adequate weather forecasts and, ultimately, help protectand improve livelihoods in a warming world. cknowledgements This research was conducted at the Frontier DevelopmentLab (FDL), Europe. The authors gratefully acknowledgesupport from the European Space Agency ESRIN PhiLab, Trillium Technologies, NVIDIA Corporation, GoogleCloud, and SCAN. The authors are thankful to PeterDueben, Stephan Rasp, Julien Brajard and Bertrand Le Sauxfor useful suggestions.

References

Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.;Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.;Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard,M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Lev-enberg, J.; Mane, D.; Monga, R.; Moore, S.; Murray, D.;Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.;Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Vie-gas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke,M.; Yu, Y.; and Zheng, X. 2016. TensorFlow: Large-ScaleMachine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [cs] .Aminou, D. 2002. MSG’s SEVIRI instrument.

ESABulletin(0376-4265) (111): 15–17.Arcomano, T.; Szunyogh, I.; Pathak, J.; Wikner, A.; Hunt,B. R.; and Ott, E. 2020. A Machine Learning-Based GlobalAtmospheric Forecast Model.

Geophysical Research Letters arXiv:1710.11252 [cs]

URL http://arxiv.org/abs/1710.11252. ArXiv: 1710.11252.Bauer, P.; Thorpe, A.; and Brunet, G. 2015. The quiet rev-olution of numerical weather prediction.

Nature arXiv:1812.05519 [cs, stat]

URL http://arxiv.org/abs/1812.05519. ArXiv: 1812.05519.Bihlo, A. 2020. A generative adversarial network ap-proach to (ensemble) weather prediction. arXiv preprintarXiv:2006.07718 .Brenowitz, N. D.; and Bretherton, C. S. 2018. PrognosticValidation of a Neural Network Uniﬁed Physics Parameter-ization.

Geophysical Research Letters

MachineLearning

Scientiﬁc Data arXiv:1707.04585 [cs]

ArXiv:1707.04585.Gr¨onquist, P.; Yao, C.; Ben-Nun, T.; Dryden, N.; Dueben,P.; Li, S.; and Hoeﬂer, T. 2020. Deep Learning for Post-Processing Ensemble Weather Forecasts .Gubler, S.; Sedlmeier, K.; Bhend, J.; Avalos, G.; Coelho, C.;Escajadillo, Y.; Jacques-Coper, M.; Martinez, R.; Schwierz,C.; de Skansi, M.; et al. 2020. Assessment of ECMWFSEAS5 seasonal forecast performance over South America.

Weather and Forecasting

Food and Agriculture Spectrum Journal

Quarterly Journal of the Royal Meteorological Society

Fron-tiers in Environmental Science

7. ISSN 2296-665X. Pub-lisher: Frontiers.Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.;Ye, Q.; and Liu, T.-Y. 2017. Lightgbm: A highly efﬁcientgradient boosting decision tree. In

Advances in neural infor-mation processing systems , 3146–3154.Kingma, D. P.; and Welling, M. 2014. Auto-EncodingVariational Bayes. arXiv:1312.6114 [cs, stat]

URL http://arxiv.org/abs/1312.6114. ArXiv: 1312.6114.Lalaurette, F. 2003. Early detection of abnormal weatherconditions using a probabilistic extreme forecast index.

Quarterly Journal of the Royal Meteorological Society

Journal of Hy-drometeorology

Global Change Biology

Geophysical Research Letters arXiv:1411.1784 [cs, stat]

URL http://arxiv.org/abs/1411.1784. ArXiv: 1411.1784.Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga,L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison,M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai,J.; and Chintala, S. 2019. PyTorch: An Imperative Style,High-Performance Deep Learning Library. In Wallach, H.;Larochelle, H.; Beygelzimer, A.; d. Alch´e-Buc, F.; Fox, E.;and Garnett, R., eds.,

Advances in Neural Information Pro-cessing Systems 32 , 8024–8035. Curran Associates, Inc.Pinckaers, H.; van Ginneken, B.; and Litjens, G. 2019.Streaming convolutional neural networks for end-to-endlearning with multi-megapixel images. arXiv:1911.04432[cs]

URL http://arxiv.org/abs/1911.04432. ArXiv:1911.04432.Praveen, B.; Talukdar, S.; Shahfahad; Mahato, S.; Mon-dal, J.; Sharma, P.; Islam, A. R. M. T.; and Rahman, A.2020. Analyzing trend and forecasting of rainfall changesin India using non-parametrical and machine learning ap-proaches.

Scientiﬁc Reports arXiv:2002.00469 [physics, stat]

ArXiv: 2002.00469.Rasp, S.; and Thuerey, N. 2020. Purely data-driven medium-range weather forecasting achieves comparable skill tophysical models at similar resolution .Rezende, D. J.; and Mohamed, S. 2016. Variational Infer-ence with Normalizing Flows. arXiv:1505.05770 [cs, stat] .Rocklin, M. 2015. Dask: Parallel Computation with Blockedalgorithms and Task Scheduling. In Huff, K.; and Bergstra,J., eds.,

Proceedings of the 14th Python in Science Confer-ence , 130 – 136.Rudner, T. G. J.; Rußwurm, M.; Fil, J.; Pelich, R.; Bis-chke, B.; Kopaˇckov´a, V.; and Bili´nski, P. 2019. Multi3Net:Segmenting Flooded Buildings via Fusion of Multireso-lution, Multisensor, and Multitemporal Satellite Imagery.

Proceedings of the AAAI Conference on Artiﬁcial Intelli-gence

Journal of Earth System Science

International Journal of Fore-casting

Geoscientiﬁc Model Devel-opment

Journal of BigData arXiv preprint arXiv:2003.12140 .Vogel, E.; Donat, M. G.; Alexander, L. V.; Meinshausen, M.;Ray, D. K.; Karoly, D.; Meinshausen, N.; and Frieler, K.2019. The effects of climate extremes on global agriculturalyields.

Environmental Research Letters

Journal of Advances in Modeling EarthSystems arXiv:2003.11927 [physics, stat]

URL http://arxiv.org/abs/2003.11927. ArXiv: 2003.11927.Xingjian, S.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; and Woo, W.-c. 2015. Convolutional LSTM network:A machine learning approach for precipitation nowcasting.In

Advances in neural information processing systems , 802–810.Zhang, F.; Sun, Y. Q.; Magnusson, L.; Buizza, R.; Lin, S.-J.; Chen, J.-H.; and Emanuel, K. 2019. What Is the Pre-dictability Limit of Midlatitude Weather?

Journal of the At-mospheric Sciences ppendix

Pixel-wise precipitation class histograms (IMERG)

We include pixel-wise precipitation class histograms derived from IMERG at native resolution ( . ◦ ) with max-pooling asdownscaling to preserve pixel-wise extremes.

180 W 144 W 108 W 72 W 36 W 0 36 E 72 E 108 E 144 E 180 E90 N72 N54 N36 N18 N018 S36 S54 S72 S90 S 95%96%97%98%100%

Figure 5: Global distribution of slight rain events ( % of total events)

180 W 144 W 108 W 72 W 36 W 0 36 E 72 E 108 E 144 E 180 E90 N72 N54 N36 N18 N018 S36 S54 S72 S90 S 0%1%2%3%5%

Figure 6: Global distribution of moderate rain events ( % of total events)

80 W 144 W 108 W 72 W 36 W 0 36 E 72 E 108 E 144 E 180 E90 N72 N54 N36 N18 N018 S36 S54 S72 S90 S 0%0%0%0%1%

Figure 7: Global distribution of heavy rain events ( % of total events)

180 W 144 W 108 W 72 W 36 W 0 36 E 72 E 108 E 144 E 180 E90 S72 S54 S36 S18 S018 N36 N54 N72 N90 N 0.00%0.03%0.05%0.07%0.10%

Figure 8: Global distribution of violent rain events ( %%