TRU-NET: A Deep Learning Approach to High Resolution Prediction of Rainfall
Rilwan Adewoyin, Peter Dueben, Peter Watson, Yulan He, Ritabrata Dutta
TTRU-NET: A Deep Learning Approach to HighResolution Prediction of Rainfall
Rilwan A. Adewoyin · Peter Dueben · Peter Watson · Yulan He · RitabrataDuttaAbstract
Climate models (CM) are used to evaluate the impact of climate changeon the risk of floods and strong precipitation events. However, these numericalsimulators have difficulties representing precipitation events accurately, mainlydue to limited spatial resolution when simulating multi-scale dynamics in theatmosphere. To improve the prediction of high resolution precipitation we apply aDeep Learning (DL) approach using an input of CM simulations of the model fields(weather variables) that are more predictable than local precipitation. To this end,we present TRU-NET (Temporal Recurrent U-Net), an encoder-decoder modelfeaturing a novel 2D cross attention mechanism between contiguous convolutional-recurrent layers to effectively model multi-scale spatio-temporal weather processes.We use a conditional-continuous loss function to capture the zero-skewed patternsof rainfall. Experiments show that our model consistently attains lower RMSE andMAE scores than a DL model prevalent in short term precipitation prediction andimproves upon the rainfall predictions of a state-of-the-art dynamical weather model.Moreover, by evaluating the performance of our model under various, training andtesting, data formulation strategies, we show that there is enough data for ourdeep learning approach to output robust, high-quality results across seasons andvarying regions.
Keywords
Climate Modelling · Precipitation Downscaling · Attention mechanism · Rainfall forecasting · Recurrent Neural Networks
Rilwan A. AdewoyinDepartment of Computer Science and Engineering,Southern University of Science and Technology, ChinaDepartment of Computer Science, University of Warwick, UKPeter DuebenEarth System Modelling Section, The European Centre for Medium-Range Weather ForecastsPeter WatsonSchool of Geographical Sciences, University of Bristol, UKYulan HeDepartment of Computer Science, University of Warwick, UKRitabrata DuttaDepartment of Statistics, University of Warwick, UK a r X i v : . [ c s . C E ] A ug Adewoyin et al.
Across the globe, society is becoming increasingly prone to extreme precipitationevents due to climate change. The United Nations stated that flooding was the mostdominant weather-related disaster over the 20 years to 2015, affecting 2.3 billionpeople and accounting for $1.89 trillion in reported economic losses (Wallemacqand Herden, 2015). With the increase in the monetary and societal risk posed byflooding (Shukla et al., 2019), the CM predictions for extreme precipitation eventsare an important resource in guiding the decision of policy-makers.State-of-the-art regional climate models (RCM) typically run at horizontalresolutions of ∼ main aim of this paper is to create a model that can produce highresolution predictions for daily rainfall across the UK, by using low resolution CMsimulations of model fields (weather variables) as input. It should be noted that theinput simulations do not include precipitation, but include other weather variableswhich, unlike precipitation, are well simulated. When trained, our model can thenbe used on the output of any CM simulation for the future. This allows us toproduce computationally cheap long-term forecasts of high resolution precipitationinto the future. This will help to diagnose changes in precipitation events due toclimate change.To do this, we use the ERA5 reanalysis dataset (Hersbach et al., 2020) as ananalogue for CM output. Reanalysis data is based on a weather forecast model– that is running at a similar resolution to a CM – to which observations areconstantly assimilated to yield the best estimate of the weather state. We takethese historical weather state estimates and use it to predict the high resolutionprecipitation observations, made available by the E-obs dataset (Cornes et al.,2019).More concretely, our input data is formed as a timeseries of length t days( t ∈ [1 , T ] ) containing 6 key model fields (air temperature, specific humidity,longitudinal and latitudinal components of wind velocity at 850 hPa, geopotentialheight at 500 hPa and total column water vapour in the entire vertical column),each defined on a ( × ) grid representing the UK at approximately 65km spatialresolution. By stacking together the six model fields, we have a (20 , , matrixrepresenting the UK weather state at 6-hour intervals. Our model will therefore take as input a sequence of daily model field observations X t ∈ R × × × fromwhich it will output a prediction for the true daily total precipitation ( mm ), Y t ∈ R × , defined on a (100 , grid over the UK with approximately 8.5kmspatial resolution.Interpreting X t as a spatio-temporal sequence of low resolution 3D images and Y t as a spatio-temporal sequence of high resolution 2D images, our task can beinterpreted as a combination of Sequence Transduction and Image Super-Resolution.DeepSD (Vandal et al., 2017) utilised a popular Image Super-Resolution modelto downscale 2D precipitation images up to a factor of 4x. (Vandal et al., 2018)extended DeepSD by incorporating an optimization scheme utilising a conditional- RU-NET 3 continuous (CC) loss function for improving the modelling of zero-skewed precipi-tation events. In the related sequence transduction task, precipitation nowcasting,(Shi et al., 2015) introduced the Convolutional Long Short-Term Memory (ConvL-STM) cell to simultaneously model the behaviour of weather dynamics in space andtime, using convolutions to incorporate the surrounding flow-fields and a recurrentstructure to model the temporal dependencies of weather. Extending upon this,the encoder-decoder ConvLSTM (Shi et al., 2015) is also able to represent weatherdynamics defined on multiple spatial scales in space due to the use of successiveconvolution based layers, each layer to model larger scale dynamics.We extend the encoder-decoder ConvLSTM by adding the ability to representweather dynamics defined on multiple scales in time with our proposed TRU-NETmodel, a Convolutional Gated Recurrent Unit (ConvGRU) based encoder-decoderthat includes the following three features:1. A novel Fused Temporal Cross Attention (FTCA) mechanism to improve uponexisting methods (Jauhar et al., 2018, Zhao et al., 2019, Liu et al., 2019b) to model multiple temporal scales of weather dynamics using a stacked recurrent structure.2. A encoder-decoder structure adapting the U-NET (Ronneberger et al., 2015)structure, by contracting or expanding the temporal dimensions as opposed tothe spatial dimensions.3. A conditional continuous (Husak et al., 2007, Vandal et al., 2018) loss trainingscheme to improve the prediction of extreme precipitation events, by decou-pling the modelling of low intensity precipitation events and high intensityprecipitation events.We present TRU-NET and the conditional continuous loss in Section 2 anddiscuss the model training details in Section 3. In Section 4, we perform experimentsto compare TRU-NET to baselines and investigate TRU-NET’s performance onout-of-sample forecasting tasks. Finally, we perform an ablation study to compareour proposed FTCA against alternative existing methods. The results of theseexperiments show that: – Our novel model, TRU-NET, achieves a lower RMSE and MAE than botha state-of-the-art dynamical weather forecasting model’s coarse precipitationprediction and a hierarchical ConvGRU model. – The quality of TRU-NET’s predictions are stable when evaluated on out-of-sample weather predictions formed by time periods outside of the trainingset. – Our proposed FTCA outperforms existing methods to decrease the temporal scale modelled by successive recurrent layers in a stacked recurrent structure.
Our TRU-NET model, visualised in Figure 1, maps the 6-hourly low resolutionmodel fields, X t ∈ R × × × , to a representation capturing variability on 6-hourly, daily and weekly time scales, and then uses these representations to outputa prediction, ˆ Y t ∈ R × , for daily total rainfall. Adewoyin et al.
Fig. 1: TRU-NET architecture. This figure depicts the Conditional Continuousvariant, which outputs values for rain level and rain probability for 28 consecutivedays. The Sequence Length of 3D tensors between layers contracts/expands throughthe encoder/decoder. This relates to an increasing/decreasing of the temporal scalesmodelled. The horizontal direction from left to right indicates time.As a first step within TRU-NET, we map the input data of the coarse grid ontothe fine grid using bi-linear upsampling .The encoder contains a stack of 3 bi-directional ConvGRU layers. Within theencoder, these 3 layers map the input into coarser spatial/temporal scales, fromsix-hourly/8.5km, to daily/34km, and to weekly/136km. To achieve this reductionin the temporal scales modelled by contiguous encoder layers, we propose a novelFused Temporal Cross Attention mechanims (FTCA) as shown in Figure 2. Thesescales are aligned to the timescales associated with extreme rainfall events in theUK (Burton, 2011).The decoder maps the latent representation captured at the weekly scale backto the daily scale before feeding it to an output layer for daily rain prediction.Due to memory constraints we do not input the full (28 · × × ×
6) dimen-sional model fields at once. In space, we extract stencils of × grid-pointsfor the input to predict precipitation over the stencil of × grid-points in the centre of input stencil. TRU-Net processes 28 days worth of information at a time,generating an output of total daily precipitation for all of the 28 days for eachapplication of TRU-NET: ˆ Y t , . . . , ˆ Y t − J = f ( X t , X t − , . . . , X t − J ) (1)with J = 28 . This will naturally generate a lack of information on the past for thefirst timesteps ( J = 1 , , ... ) and a lack of information on the future for the last Please note that an increase in resolution is called upsampling in machine learning but down-scaling in meteorology literature.RU-NET 5 timesteps ( J = ... , , ). However, this could be avoided by a stream of inputdata that only makes predictions for the time-steps in the centre of the time-seriesin future studies.In the following, we describe each of the main components of TRU-Net in moredetail.2.1 EncoderThe encoder of our TRU-NET model, as shown in Figure 1, has L = 3 ConvGRUlayers, where the l -th layer decreases the sequence length by a factor m l : → → .This results in the number of units in each ConvGRU based layer decreasing in themanner: → → , corresponding to six-hourly, daily and weekly temporalresolutions.The conventional ConvGRU is a recurrent neural network designed to modelspatial-temporal information. In a conventional ConvGRU Layer, each unit i sharesits trainable weight matrices { W k , U k , b k : k ∈ [ z, r, ˜ A ] } with other units in thelayer, and collectively they are described as having tied weights . Each unit i takestwo inputs, namely the previous state A i − and the input in the current timestep (cid:99) B i ( h b ,w b ,c b ) , and outputs a state A ( h a ,w a ,c a ) i , as detailed below. Here, z i is theupdate gate, r i is the reset gate, ˜ A is the cell state, • and * denote the Hadamardproduct and convolution, respectively. z i = σ (cid:16)(cid:99) B i ∗ W z + A i − ∗ U z + b z (cid:17) ˜ A i = tanh (cid:16) W ˜ A ∗ (cid:99) B i + r i • U ˜ A ∗ A i − + b ˜ A (cid:17) r i = σ (cid:16)(cid:99) B i ∗ W r + A i − ∗ U r + b r (cid:17) A i = z i • A i − + (1 − z i ) • ˜ A i (2)When mapping an input from one time scale to another, e.g. generating thedaily time scale tensor for day t from a sequence of 4 corresponding six-hourly timescale tensors, a simple approach is to average the 4 six-hourly tensors. However,such a simple aggregation strategy ignores the influence of the daily time scaletensor from the previous day t − . We instead propose Fused Temporal CrossAttention (FTCA), as a better aggregation strategy based on the cross attentionmechanism.In the final two ConvGRU layers of the encoder, FTCA is fused into theConvGRU in order to aggregate the inputs from the previous layer to generate a representation for the current layer. The ConvGRU with FTCA is illustrated inFigure 2 and explained in the following subsection.2.2 Convolutional Gated Recurrent Unit with Fused Temporal Cross Attention(ConvGRU w/ FTCA)In the conventional ConvGRU, the i th unit of the l th layer, denoted as D l,i , takestwo inputs, the previous state A i − and the input in the current time step (cid:99) B i . Inour setup here, however, we stack ConvGRU layers with different temporal scales. Adewoyin et al.(a) Layer l (b) Recurrent Unit D l,i Fig. 2:
ConvGRU with Fused Temporal Cross Attention (FTCA) . (a)illustrates a ConvGRU with FTCA layer and (b) illustrates an individual unitwithin the layer. The grey box in (b) shows our adaptation of the generic ConvGRUthrough the addition of an FTCA operation (grey box) that outputs (cid:98) B i .As such, the input in the current time step to D l,i is no longer a single tensor,but instead, an ordered sequence of tensors, B i ≡ B ( h b ,w b ,c b )1: T b , as shown in Figure2(a), where the input B l,i consists of T b time-aligned outputs from the ( l − -thConvGRU layer, i.e., B l,i ≡ { A l − ,j : j =1 ,...,T b } . For example, if the l th layer hasthe daily time resolution, then the ( l − th layer would have the six-hourly timeresolution, T b = 4 and B l,i ≡ { A l − , , A l − , , A l − , , A l − , } .Given B i ≡ B ( h b ,w b ,c b )1: T b , we propose a Fused Temporal Cross Attention (FTCA)mechanism to calculated a weighted average (cid:99) B i . Here, we use A i − to derive a querytensor and B i to derive both a key tensor and a value tensor. The query tensoris compared with the key tensor to generate weights which are used to aggregatevarious elements in the value tensor to obtain the final aggregated representationof (cid:99) B i .Afterwards, the ConvGRU operations in Equation 2 are resumed. The FTCArelated operations for unit i have been decomposed into the following three steps: – Downscaling representations: On A i and B i , we first perform a 3D average pool-ing (3DAP) with a pool size of M × M × and transform them to matrices A P Fi and B P Fi of dimensions (1 , d a ) and ( T b , d b ) respectively, via matrix-reshaping,where d a = h a × w a × c a × M − and d b = h b × w b × c b × M − . A P Fi = Reshape (cid:0) A i ) (cid:1) ( haM , waM ,c a ) → (1 ,d a ) B P Fi = Reshape (cid:0) B i ) (cid:1) ( T b , hbM , wbM ,c b ) → ( T b ,d b ) (3) – Similarity calculation using relative attention score (RAS):
We transform A P Fi and B P Fi to Q = A P Fi ◦ W Q and K ( T b ,d o ) = B P Fi ◦ W K through matrix multipli-cation, ◦ , with two trainable weight matrices, W ( d a ,d o ) Q and W ( d b ,d o ) K . We then The use of 3D average pooling is motivated by the high spatial correlation within a givenfeature map due to the spatially correlated nature of weather and to reduce the computationalexpense of the matrix multiplication.RU-NET 7 compute a matrix of weights S (1 ,T b ) , corresponding to the T b vectors in K , asfollows: S = softmax (cid:18) Q ◦ ( K + a K ) T √ d o (cid:19) (4)Note here we use the relative attention score (RAS) function (Shaw et al., 2018)to compute the similarity in Equation 4. Generally to calculate the similarityscores between Q and each vector K b , the inner product function is used (Vaswaniet al., 2017). RAS extends this inner product scoring function by consideringthe relative position of each vector K b to one another. In our case, this positionrelates to the temporal position of K b relative to other members of K . Tofacilitate this, we also learn vectors a Kb which encode the relative position ofeach K b . – Informative representation:
Finally the new informative representation (cid:98) B is learntusing two trainable convolution weight matrices with c f filters, W ( c f , , ,c b ) V and W ( c f , , ,c b ) V and a set of trainable vectors a Vi ∈ a V , encoding the relative position of each vector V i ∈ V as following: V = B ∗ W V (cid:98) B = ( S ◦ ( V + a v )) ∗ W V (5)We also use Multi-Head Attention (MHA) which allows the attention mechanismto encode multiple patterns of information by using H heads, { W Q , W K , W V } ,and performing H parallel cross-attention calculations. The different values of { W Q , W K , W V } across the heads capture different pattern/relationship in data,whereas simply using one head will lead to less diverse or informative patternscaptured. Q h = A P F ◦ W hQ Q = Concat( Q h H ) K h = B P F ◦ W hK K = Concat( K h H ) V h = B ∗ W hV V i = Concat( V h H ) (6)2.3 DecoderThe decoder is composed of one Dual State ConvGRU (dsConvGRU) layer and anoutput layer which outputs predictions for the rain level Y t : t +28 for 28 consecutivedays. If the conditional-continuous framework is in use, a second output layer outputs the corresponding predictions for the probability of rainfall r t : t +28 asillustrated in Figure 1. dsConvGRU: As illustrated in Figure 1, the inputs to the dsConvGRU layer comesfrom the nd and the rd Encoder layers, while the output of the dsConvGRUlayer is a sequence of 28 tensors which form a latent representation for the 28 daysof the target daily precipitation Y t : t +28 .As the dsConvGRU layer contains 28 units, we must expand the rd Encoderlayer’s output from sequence length 4 to sequence length 28. To do this, we repeatevery element in the sequence of length 4, 7 times, as in (Tai et al., 2015). As such, Adewoyin et al. each unit in the dsConvGRU layer receives an input from the temporally alignedunit in the rd Encoder layer.Extending Equations 2, the dsConvGRU augments the conventional ConvGRUby replacing the input (cid:98) B i with two separate inputs (cid:98) B i, (1) and (cid:98) B i, (2) , each possessingthe same dimensions as (cid:98) B i . Further, the i -th unit of the dsConvGRU layer takesthree inputs, A i − , (cid:98) B i, (1) and (cid:98) B i, (2) , and outputs a state A i .Finally, referring to Equations 2, we calculate two sets of the values, { z i, ( j ) , r i, ( j ) , ˜ A i, ( j ) , A i, ( j ) } j ∈ [1 , , corresponding to the use of (cid:98) B (1) or (cid:98) B (2) in placeof (cid:98) B . Finally, A i is calculated as the average of A i, (1) and A i, (2) . Output Layer:
As we need to output two sequences of values, rainfall probabilities ˆ r t : t +28 and rainfall values ˆ Y t : t +28 , for the conditional-continuous framework whichwill be discussed in Section 2.4, our model contains a separate output layer stackedover the dual-state ConvGRU layer for each output. Each output layer containstwo 2D convolution layers, with 32 and 1 filters respectively and a kernel shape of(3,3).2.4 Conditional Continuous (CC) AugmentationTo reflect the zero-skewed nature of rainfall data, due to many days without rainfall,a conditional continuous (CC) distribution (Husak et al., 2007, Stern and Coe,1984) is often used to model precipitation. These distributions can be interpretedas the composition of a discrete component and a continuous distribution to jointlymodel the occurrence and intensity of rainfall: δ ( Y t ) ≈ (cid:26) , Y t = 00 , Y t (cid:54) = 0 (7) p ( Y t ; γ ) = (1 − r t ) δ ( Y t ) + r t · g ( Y t ) · (1 − δ ( Y t )) (8)where δ is the Dirac function such that (cid:82) ∞−∞ δ ( x ) dx = 1 , r t is the probability ofrain at t -th day and g ( · ) is Gaussian distribution with unit variance and predicted rainfall ˆ Y t as mean. Therefore (1 − r t ) δ ( y t ) models the no rain events, while r t · g ( Y t ) · (1 − δ ( Y t )) handles the rain events.This conditional-continuous distribution requires our model to output a predic-tion ( ˆ r t ), for the probability of rain occurring as well as a prediction ( ˆ Y t ), for thelevel of rainfall conditional on day t being a rainy day. To facilitate the requirementof two outputs, ˆ Y t and ˆ r t , we augment the decoder to contain a second identicaloutput layer. In this case, the TRU-NET model has a branch like structure, with r t and Y t the respective outputs of each of these branches.During training, we sample one set of [ ˆ Y t , ˆ r t ] per prediction and use the followingloss function. This can be observed as a combination of the binary cross entropy RU-NET 9 on predictions for whether or not it rained (the first term) and a squared errorterm on the predicted conditional rainfall intensity (the second term).L ( Y t , [ ˆ Y t , ˆ r t ]) = 1 T (cid:34) T (cid:88) t =1 (cid:18) y t > · log(ˆ r t ) + ( y t =0 ) · log (1 − ˆ r t ) (cid:19) − T (cid:88) t =1 (cid:13)(cid:13)(cid:13) Y t − ˆ Y t (cid:13)(cid:13)(cid:13) (cid:35) (9)2.5 Monte Carlo Model Averaging (MCMA)When training with dropout, each of the n weights in the neural network has aprobability p of being masked. As such, there are n possible models, defined bythe combination of weights that can be masked. When sampling predictions fromthe model, it is infeasible to sample from each of the n variations. Instead, we canform a sample of predictions from a random selection of the n possible models,and calculate the average of the sample. More formally, MCMA is the process ofusing dropout during training and testing. During training, dropout is performedwith a fixed probability p of masking weights. During testing we draw n samples,from our model for each prediction. To do this we use n different dropout maskson the model’s weights. Each dropout mask uses the same masking probability, p ,on the model’s weight as was used during training. We then calculate the mean ofthese samples to arrive at a model averaged prediction. Experiments in (Srivastavaet al., 2014, §7.5), show this method is effective to sample from neural networkstrained with dropout.During inference, we use the MCMA framework to produce i ∈ I samples [ˆ r it , ˆ Y it ] for each observed rainfall Y t . For each observation, we calculate a finalprediction ˆ Y t for Y t : ˆ Y t = 1 I I (cid:88) i =1 ˆ r it > . · ˆ Y it (10) This section describes the data used for performance evaluation, baseline modelsused for comparison, and model hyper-parameter setup.
Integrated Forecast System (IFS):
The IFS is a numerical weather predictionsystem which is solving the physical equations of atmospheric motion. IFS is usedfor operational weather predictions at the European Centre for Medium-RangeWeather Forecasts (ECMWF). It is also used to generate the ERA5 reanalysis datawhich is used as input data for TRU-NET. While the input fields are a product of the data assimilation process of ERA5, there are also data for precipitationpredictions available which are diagnosed from short-term forecast simulationswith IFS which use ERA5 as initial conditions. There are two forecast simulationsstarted each day at 6 am and 6 pm. We extract the precipitation fields for thefirst 12 hours of each simulation to reproduce daily precipitation - this is presentlythe optimal way to derive meaningful precipitation predictions from a dynamicalmodel that is consistent with the large-scale fields in the ERA5 reanalysis data.The ERA5 and precipitation data is available on a grid with 31 km resolution.However, our target is to use model fields from climate models as input which aretypically run at coarser resolution. We therefore map the ERA5 data onto the gridthat is used in the HadGEM3 climate model (Murphy et al., 2018).
Hierarchical Convolutional GRU (HCGRU):
The general structure of the HCGRU,illustrated in Figure 9 in the Appendix, has been used successfully in precipitationnowcasting (Shi et al., 2015, 2017) wherein it outperformed an Optical Flowalgorithm (Woo and Wong, 2017) produced by the Hong Kong Observatory. Ourimplementation contains 4 ConvGRU layers and an output layer, matching thenumber of layers in our TRU-NET model. Prior to the first layer, we reduce theinput sequence from length 112 to 28, by concatenating blocks of 4 sequentialelements. Each of the 4 ConvGRU layers contain 28 recurrent units, with eachrecurrent unit in each layer containing convolutional operations with 80 filters anda kernel shape of (4 , . Skip connections exists over the final 3 ConvGRU layersand a final skip connection exists from the output of the first ConvGRU layer tothe input of the output layer. The output layer follows the same formulation as inTRU-NET, with two 2D Convolution layers.3.2 DataOur input data comprises the following 6 model fields: air temperature, specifichumidity, longitudinal and latitudinal components of wind velocity at 850 hPa,geopotential height at 500 hPa and total column water vapour in the entire verticalcolumn, each defined on a ( × ) grid representing the UK at approximately65km spatial resolution chosen to match that used in the UK Climate Projectionsdatasets (Murphy et al., 2018).For training, we use the × stencils surrounding sixteen locations toform our training and validation sets, namely, Cardiff, London, Glasgow (G’gow),Birmingham (B’ham), Lancaster (Lanc.), Manchester (Manc.), Liverpool (L’pool), Bradford (B’ford), Edinburgh (Edin), Leeds, Dublin, Truro, Newry, Norwich,Plymouth (P’mth) and Bangor.These locations were chosen as they are important population centres thatsample a wide breadth of locations across the UK. Further, collectively theselocations posses varied meteorological profiles, depicted in Figure 3. For example,percentage of days with rainfall >10mm (R10) ranges from 2.4% to 11.9% andaverage rainfall conditional on an R10 event is ranging from 13.8 mm to 16.5 mm.During testing, we either test on the whole UK, region by region, or test on asingle location such as a city. For single location testing, we extract the nearestgrid point to the centre of the given location.
RU-NET 11(a) Geographical representation of locationsused in training sets. (b) Average Daily Rainfall.(c) Percentage of Days with Rainfall largerthan 10mm.(d) Average Daily Rainfall for R10 events.
Fig. 3: Precipitation profiles of the regions in the training and validation set betweenthe years 1979 and 2019.3.3 Hyperparameter SettingsFor TRU-NET and HCGRU, the dropout rate used for the output layer and FusedTemporal Cross Attention is 0.2. For the the remaining weights in the ConvGRU-based layers, dropout rates of 0.2 and 0.3 were used. During training we used global norm gradient clipping with the
RectifiedAdam optimizer (Liu et al., 2019a),featuring gradient warm up. Parameters for
RecADAM were selected as follows; β = 0.9, β =0.99, (cid:15) = 5 e − , maximum learning rate of e − , minimum learningrate of e − and total warmup steps of 20 with 13 steps of increase.We trained all models in python, using Tensorflow and executed our experimentson a server with 4 NVIDIA GTX 1080 GPUs. We also utilize mixed precisiontraining. The models were trained for a maximum of 300 epochs, with earlystopping. We use the following metrics to evaluate the performance of each model: Root MeanSquared Error (RMSE), RMSE for days of observed rainfall over 10/mm (R10RMSE) and Mean Absolute Error (MAE). We present these metrics for each season,where the seasons have been defined as Spring (March, April, May), Summer (June,July, August), Autumn (September, October, November) and Winter (December,January, February). The training set spans the years 1979 till 2008, the validationset spans the years 2009 till 2013 and the test set spans the time period 2014 tillAugust 2019.
Model name RMSE R10 RMSE MAEIFS 3.627 9.001 1.976HCGRU 3.268 8.792 1.762HCGRU+CC 3.266 (a) All SeasonsModel name RMSE R10 RMSE MAEIFS 3.950 9.114 2.233HCGRU 3.740 8.879 2.135HCGRU+CC 3.731 (b) Winter Model name RMSE R10 RMSE MAEIFS 3.135 8.455 1.692HCGRU 2.707 7.922 1.439HCGRU+CC 2.710 7.832 1.419T-NET 2.549 7.817 1.487T-NET+CC (c) SpringModel name RMSE R10 RMSE MAEIFS 3.663 9.021 2.018HCGRU 3.210 9.056 1.718HCGRU+CC 3.193 (d) Summer Model name RMSE R10 RMSE MAEIFS 3.765 9.222 1.987HCGRU 3.381 9.073 1.783HCGRU+CC 3.398 8.923 1.773T-NET (e) Autumn
Table 1: Comparing TRU-NET with Baselines. Here, we present the results formodels tested on the whole UK between the dates of 2014 and August 2019. TheTRU-NET CC (T-NET+CC) model achieves the best RMSE and MAE scoresacross all seasons.In Table 1(a) we observe that the TRU-NET CC model generally outperformsalternative models in terms of RMSE and MAE. Further, the CC variants ofTRU-NET and HCGRU achieve a better R10 RMSE than their non conditionalcontinuous counterparts.
RU-NET 13
In the previous sub-section, we presented seasonal performance metrics for eachmodel tested on the whole country. Here we focus on the predictive errors on 5specific cities across the range of precipitation profiles displayed in Figure 3. Thesecities chosen can be divided into two groups; those with lower rainfall characteristics(London, Birmingham and Manchester) and those with high rainfall characteristics(Cardiff and Glasgow). These locations have been chosen in order to discern whetherthe quality of predictions over a region is related to the region’s precipitation profile.The following tables present the predictive scores of the TRU-NET CC modeltrained on data from 16 locations over the time span covering 1979 till 2013. Theresults are presented in Table 2 where we provide the performance of the IFS modelas the second number in each cell.We observe that both TRU-NET CC and IFS generally achieves lower RMSEscores during the Spring and Summer months with less rainfall. By observing theMean Error (ME) we notice our model generally under-predicts rainfall for cities with high average rainfall (Glasgow and Cardiff) and over-predicts rainfall for cities with low average rainfall (London, Birmingham, Manchester).
Figure 4 illustrates TRU-NET CC’s and IFS predicted rainfall values, for the wholeUK, plotted against the true observed rainfall over the period 2014-2019.When comparing TRU-NET’s predictions to IFS predictions, we notice asignificant number of cases wherein both TRU-NET and IFS predict rainfall higherthan 0mm, for days where observed rainfall is 0mm. However, as can be seen bythe vertical blue cloud of points to the left of each sub-figure, TRU-NET’s log-transformed predictions for non-rainy days spread up to 2.75, while IFS performsworse and spread up to 3.4.For observed rainfall events between 10 and 19 mm/day we notice that bothTRU-NET and IFS slightly under-predict the observed rainfall by a similar amount.However, TRU-NET’s predictions have less variance than the IFS predictions,which routinely produce predictions significantly below or above the y=x line.This is highlighted by the large vertical spread of IFS predictions, in Figure 4 (b),between observed rainfall of 10mm/day and 3.For observed rainfall events above 20mm/day, we notice that TRU-NET under-predicts rainfall events more than IFS. We believe that the rarity of rainfall>20events in the training set has negatively impacted TRU-NET’s ability to learnthese relationships, while IFS learns the underlying physical equations.
Here, we check the spatial structure of the predictions via cross-correlation plotsfor TRU-NET CC Normal predictions on the central point within pairs of cities.We use Leeds as our base location and compute pairwise cross-correlations with thefollowing six locations; Bradford (13km), Manchester (57km), Liverpool (104km),Edinburgh (261km), London (273km) and Cardiff (280km), where the each brack-eted number is the distance of this location from Leeds. The cross correlations /2.884 3.368/6.805 -0.095 /0.334AUT 1.781/2.661 /6.676 0.318/0.125All 1.933/2.621 5.215/6.649 0.263/0.222(a) BirminghamRMSE R10 RMSE MEWNT 2.292/3.486 4.737/6.564 0.155/
SPR /3.313 /6.644 0.301/0.727SUM 2.026/3.735 4.940/7.769 -0.377/0.345AUT 2.241/3.701 4.517/8.408 -0.185/0.143All 2.215/3.551 4.809/7.396 -0.008/0.315(b) Cardiff RMSE R10 RMSE MEWNT 3.752/5.079 7.454/9.479 -1.100/-1.025SPR /3.023 5.733/7.172 -0.312/ -0.008
SUM 2.455/3.758 /7.413 -0.241/0.266AUT 3.022/4.128 6.523/7.812 -0.460/-0.561All 3.132/4.059 1.783/2.387 -0.531/-0.337(c) GlasgowRMSE R10 RMSE MEWNT 2.159/2.611 /7.768 0.244/ -0.031
SPR /2.411 3.866/6.829 0.309/0.447SUM 2.367/3.156 6.824/9.787 -0.040/0.487AUT 1.940/2.771 4.461/9.105 0.083/-0.081All 2.221/2.735 7.850/8.500 0.158/0.210(d) London RMSE R10 RMSE MEWNT 2.372/3.212 4.252/7.223 0.378/0.271SPR 2.048/3.342 /7.959 /0.585SUM /3.795 5.186/9.244 -0.179/0.618AUT 2.253/3.428 5.008/7.783 -0.221/-0.093All 2.374/3.428 6.517/7.994 0.013/0.353(e) Manchester
Table 2: Seasonally dis-aggregated Performance Metrics for TRU-NET CC trainedon data between 1979 and 2013 and tested on the . × km region around 5cities between 2014 and August 2019. The first/second number in each cell is theassociated performance of TRUNET/IFS. The scores in bold represent the bestpredictive performance for each metric across all seasons. The left hand columncontains the following abbreviations for Seasons: Winter=WNT, Spring=SPR,Summer=SUM, Autumn=AUTwith comparison cities are ordered with increasing distance from Leeds. Linearde-trending was used.Figures 5 and 6 illustrate the cross correlations between TRU-NET CC’spredictions for the central points of pairs of cities. Figures 5 shows the cross-correlation function up to 28 days lag. As expected, we notice a strong correlationup to approximately 5 days. For all sets of figures, the relationships exhibited byTRU-NET CC’s predictions (blue line) are approximately mirrored by the observed values (orange line) confirming that our model is producing sensible predictions.In Figure 6, as expected, we observe that the Lag 0 cross-correlation between thepredicted daily rainfall for cities decreases as the cities become increasingly distantfrom each other.4.2 Investigation of TRU-NET’s LimitationsThe high temporal correlation in weather data reduces the effective sample size and provides the risk that any neural network trained on N consecutive years will RU-NET 15(a) TRU-NET CC(b) IFS
Fig. 4:
Distribution of Predictions:
These figures illustrate the distribution ofpredicted rainfall against observed rainfall for the TRU-NET CC and IFS modelsfrom Section 4.1.1. The dashed red line shows the mean and standard deviation ofpredictions in 3 mm/day intervals of observed rainfall. The purple line indicatesthe boundary for rainfall events with at least 10mm/day. For illustrative purposes,we sub-sample every 25th pair of prediction and observed value. The log transformused is log( y + 1) . only learn a limited set of weather patterns. The reliability of the DL model’sextrapolation to out of sample predictions (new weather patterns) is more doubtfulbecause DL models do not aim to learn the underlying physical equations, unlikenumerical weather algorithms.The three experiments introduced below evaluate the robustness of TRU-NET’sout of sample predictive ability. Fig. 5:
Cross Correlation across Predictions:
These figures illustrate the CrossCorrelation function (XCF) between rainfall predictions for Leeds and alternativecities. Here, we present XCF up to lag 28. The orange line provides the samestatistics, except with the true observed rainfall values. The red line provides the
5% significance threshold, above which we can assume their is significant correlation.
Here, we fix the test set to span the years 2014 to August 2019 and vary the numberof years, starting from 1979, used to train our TRU-NET CC model. We measurethe training set size by years and by unique test datums. As our model operates onextracted temporal patches from the coarse grid, the amount of unique datums in
RU-NET 17
Fig. 6:
Lag 0 Cross Correlation between Leeds and other cities
This figureshows the Cross-Correlation function (XCF) at Lag 0 for the daily rain predictionsbetween Leeds and the 6 cities on the x-axis. The cities are ordered by increasingdistance from Leeds. We use the rain prediction for the central point within the × stencil representing a city. We provide comparison to the XCF of the true observed rainfall values. (a) RMSE(b) R10 RMSE (c) MAE Fig. 7:
Varied Time Span:
Here, we vary the size of TRU-NET CC’s trainingset and observe the corresponding predictive performances on a test set spanning2014 to August 2019.Sub-figures (a),(b) and (c) show the predictive performancesevaluated by RMSE, R10 RMSE and MAE respectively. We observe that RMSEand MAE scores improve as the training set size increases.
Fig. 8:
Forecasting Range:
Here, we inspect the annually aggregated predictiveperformance for a TRU-NET CC model trained on data spanning 1979 to 1997.We notice no clear trend in the R10 RMSE predictive performance as the testyear becomes further forward in time from the training set. However, the RMSEand MAE show a steady decline between 1998 and 2005, after which point thepredictive scores stay steady.a training set is proportional to the product of the number of years we choose totrain on and the number of locations included in the training set. In Figure 7, weobserve a downward trend in RMSE and MAE as the number of years and uniquetest datums increases. The fact that the RMSE is reaching the lowest value for thelargest dataset indicates that an increase of our dataset by using more locationscould achieve further improvements in our model’s predictive ability.
Here, we evaluate the change in quality of predictions at increasingly larger temporaldistances from the time covered by the training set. We train a TRU-NET CCmodel using data between 1979 and 1997 and then calculate annual RMSE, R10RMSE and MAE metrics for each calendar year of predictions between 1998 and2018. In the results, illustrated in Figure 8, the R10 RMSE shows no clear upwardor downward trend throughout the whole test period, while the RMSE and MAEdecline until 2005, after which the score remains steady. This indicates that ourmodel’s predictive ability is robust to at least 21 years of weather pattern changesdue to climate change and natural decadal variability.
RU-NET 19
To judge the extent to which TRU-NET’s predictive ability is dependent on thetime period it is trained on, we divide the 40 year dataset into 4 sub-datasets of 10years each. The first sub-dataset (DS1) corresponds to years 1979-1988, the secondsub-dataset (DS2) to the years 1989-1998, the third sub-dataset (DS3) to the years1999-2008 and the fourth sub-dataset (DS4) to the years 2009-2018. We set-up aK-fold cross validation based experiment by training a separate TRU-NET CCmodel on each of DS1, DS2, DS3 and DS4, creating four models M1, M2, M3 andM4.Appendix Table 5 shows the results from testing each model on the out-of-sample datasets. For each evaluation metric we perform a Tukey HSD test. ATukey HSD test is used to accept or reject a claim that the means of two groups ofvalues are not significantly different from each other. In our case we use the models(M1-4) as treatment groups and each models predictive scores form a groups ofobservations. The Tukey HSD test, then compares two models for a significantdifference between the mean of each models reported predictive scores.The Tukey HSD results for each evaluation metrics and all pairs of Modelsis presented in Table 3(a,b,c). The 1st two columns indicate the models undercomparison, the 3rd the mean difference in their predictive scores. The rightmostcolumn confirms whether or not there is a significant difference (sig. diff) betweenthe performance of the corresponding pair of models. We can observe that thepredictive performance between each pair of models is not significantly different.This implies our TRU-NET CC model is fairly invariant to the period of data it istrained on.4.3 Ablation: Fused Temporal Cross AttentionIn this section, we investigate the efficacy of our FTCA relative to other methodsfor achieving the multi-scale hierarchical structure in the TRU-NET Encoder. Moreconcretely, we replace the temporal fused cross attention with concatenation, lastelement method (Jauhar et al., 2018, Zhao et al., 2019) and temporal self attention(Liu et al., 2019b). We examine how the effect of changing the number of heads inFTCA. Table 4 shows that our model achieves lower RMSE than other methodsof achieving the multi-scale hierarchical structure. Furthermore, we notice strongperformance relative to the self attention variant which has the same model size.This highlights the importance of using information from higher spatio-temporalscales to guide the aggregation of information from lower spatio-temporal scales in our TRU-NET model.
M1 M2 0.004 0.9 FalseM1 M3 0.0067 0.9 FalseM1 M4 0.0436 0.5243 FalseM2 M3 0.0027 0.9 FalseM2 M4 0.0396 0.5907 FalseM3 M4 0.037 0.6349 False(a) RMSE
M1 M2 -0.0259 0.9 FalseM1 M3 0.0607 0.9 FalseM1 M4 0.1254 0.5946 FalseM2 M3 0.0866 0.7964 FalseM2 M4 0.1513 0.4604 FalseM3 M4 0.0647 0.9 False(b) R10 RMSE
M1 M2 0.029 0.6119 FalseM1 M3 0.0285 0.6246 FalseM1 M4 0.0187 0.8387 FalseM2 M3 -0.0006 0.9 FalseM2 M4 -0.0104 0.9 FalseM3 M4 -0.0098 0.9 False(c) R10 RMSE
Table 3:
Varied Time Period - Tukey HSD test:
We train four TRU-NETCC models (M 1-4) on 4 training sets, labelled (DS 1-4), which cover mutuallyexclusive 10 year time spans. Each model is then tested on all time spans, exceptthat of its training set. We evaluate the predictions using three evaluation metrics(RMSE, R10 RMSE, MAE) and then for each evaluation metric perform a TukeyHSD test on the results. The final column of each table confirms that there is nostatistically significant difference (sig. diff) between the mean performance of thecorresponding two models. This implies that the performance of our TRU-NETCC model is invariant to the time period it is trained on, provided all the timeperiods have the same time length.
RU-NET 21TRU-NET CC RMSE R10 RMSE MAET Cross-Attn 8 heads
T Cross-Attn 4 heads 3.100 8.985 1.656T Cross-Attn 1 heads 3.100 8.893 1.656Concatenation 3.147 9.146 1.661Last Element 3.098 8.861 1.641T Self-Attn 3.124 9.085 1.646
Table 4:
Ablation Study:
Here, we evaluate the predictive performance ofalternative methods to achieve the multi-scale hierarchical structure in the TRU-NET CC Encoder. We evaluate these TRU-NET CC variants using a training setconsisting of the years 1979 till 2013. The test set is composed of data betweenthe dates 2013 till August 2019 for the whole UK. We observe that our proposedTemporal Cross Attention (T Cross-Attn) with 8 heads outperforms other methods.
In this work we present TRU-NET, featuring a novel Fused Temporal CrossAttention mechanism to improve the modelling of processes defined on multiplespatio-temporal scales. We utilise a conditional-continuous loss function to obtainpredictions for zero-skewed rainfall events.For the prediction of local precipitation for all seasons over the whole UK, ourmodel achieves a 10% lower RMSE than a hierarchical ConvGRU model and a15% lower RMSE than a dynamical weather forecasting model (IFS) initialised0-24h before each precipitation prediction. After further analysis, we observe thatTRU-NET attains lower RMSE scores than IFS when predicting rainfall eventsup to and including 20 mm/day, which comprises the majority of rainfall events.However, after this point TRU-NET under-predicts rainfall events to a higherdegree than IFS.We address concerns regarding the suitability of DL approaches to precipitationprediction (Rasp et al., 2020, 2018), given the limited amount of training data. Weshow that the current amount of data available is sufficient for a DL approach toproduce quality predictions.The current work used deterministic models and readily available reanalysis dataas an analogue for climate model output. Future works, could utilise probabilisticneural network methods, such as Monte Carlo Dropout (Gal and Ghahramani, 2015)or Horseshoe Prior(Ghosh et al., 2018), as well as data from climate simulations tosimulate risks of severe weather under varying climate scenarios. Further, methodscombining Extreme Value Theory and machine learning (Ding et al., 2019) couldbe used to improve TRU-NET’s ability to predict rainfall events over 20mm/day.The code used to train and evaluate our models can be downloaded fromhttps://github.com/Akanni96/TRUNET.
Acknowledgements
The project was funded by Alan Turing Institute under Climate ActionPilot Projects Call. We also acknowledge Dr. Sherman Lo’s help in processing the datasets.
References
Burton C (2011) How weather patterns have contributd to extreme precipitationin the united kingdom, and links to past flood eventsCornes R, van der Schrier G, van den Besselaar E, Jones P (2019) An ensembleversion of the e-obs temperature and precipitation datasets: version 21.0eDing D, Zhang M, Pan X, Yang M, He X (2019) Modeling extreme events in time series prediction. In: Proceedings of the 25th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, Association for ComputingMachinery, New York, NY, USA, KDD ’19, p 1114–1122, DOI 10.1145/3292500.3330896Gal Y, Ghahramani Z (2015) Dropout as a bayesian approximation: Representingmodel uncertainty in deep learning.
Ghosh S, Yao J, Doshi-Velez F (2018) Structured variational learning of bayesianneural networks with horseshoe priors.
Hersbach H, Bell B, Berrisford P, Hirahara S, Horányi A, Muñoz-Sabater J (2020)The era5 global reanalysis. Quarterly Journal of the Royal Meteorological Soci-
RU-NET 23 ety 146(730):1999–2049, DOI 10.1002/qj.3803, https://rmets.onlinelibrary.wiley.com/doi/pdf/10.1002/qj.3803
Husak GJ, Michaelsen J, Funk C (2007) Use of the gamma distribution to representmonthly rainfall in africa for drought monitoring applications. InternationalJournal of Climatology 27(7):935–944, DOI 10.1002/joc.1441, https://rmets.onlinelibrary.wiley.com/doi/pdf/10.1002/joc.1441
IPCC (2007) Fourth Assessment Report: Climate Change 2007: The AR4 SynthesisReport. Geneva: IPCCJauhar SK, Gamon M, Pantel P (2018) Neural task representations as weaksupervision for model agnostic cross-lingual transfer.
Liu L, Jiang H, He P, Chen W, Liu X, Gao J, Han J (2019a) On the variance ofthe adaptive learning rate and beyond.
Liu YT, Li YJ, Yang FE, Chen SF, Wang YCF (2019b) Learning hierarchicalself-attention for video summarization. 2019 IEEE International Conference onImage Processing (ICIP) pp 3377–3381May W (2004) Simulation of the variability and extremes of daily rainfall duringthe indian summer monsoon for present and future times in a global time-sliceexperiment. Climate Dynamics 22(2):183–204, DOI 10.1007/s00382-003-0373-xMurphy J, Harris G, Sexton D, Kendon E, Bett P, Clark R (2018) Ukcp18 landprojections: Science report. Tech. rep., Met OfficeRasp S, Pritchard MS, Gentine P (2018) Deep learning to represent subgridprocesses in climate models. Proceedings of the National Academy of Sciences115(39):9684–9689, DOI 10.1073/pnas.1810286115Rasp S, Dueben PD, Scher S, Weyn JA, Mouatadid S, Thuerey N (2020) Weather-bench: A benchmark dataset for data-driven weather forecasting.
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks forbiomedical image segmentation.
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position rep-resentations. In: Proceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 2 (Short Papers), pp 464–468, DOI 10.18653/v1/N18-2074Shi X, Chen Z, Wang H (2015) Convolutional lstm network: A machine learningapproach for precipitation nowcasting. In: Advances in Neural InformationProcessing Systems 28, Curran Associates, Inc., pp 802–810Shi X, Gao Z, Lausen L, Wang H, Yeung DY, Wong Wk, Woo Wc (2017) Deep learn-ing for precipitation nowcasting: A benchmark and a new model. In: Advancesin neural information processing systems, pp 5617–5627Shukla P, Skea J, Buendia EC, Masson-Delmotte V, Pörtner H (2019) Ipcc specialreport on climate change, desertification, land degradation, sustainable land management, food security, and greenhouse gas fluxes in terrestrial ecosystemsSrivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014)Dropout: A simple way to prevent neural networks from overfitting. Journal ofMachine Learning Research 15(56):1929–1958Stern R, Coe R (1984) A model fitting analysis of daily rainfall data. Journal ofthe Royal Statistical Society Series A (General) 147(1):1–34Tai KS, Socher R, Manning CD (2015) Improved semantic representations fromtree-structured long short-term memory networks. In: Proceedings of the 53rdAnnual Meeting of the Association for Computational Linguistics and the 7thInternational Joint Conference on Natural Language Processing (Volume 1:
Long Papers), Association for Computational Linguistics, Beijing, China, pp1556–1566, DOI 10.3115/v1/P15-1150Vandal T, Kodra E, Ganguly S, Michaelis A, Nemani R, Ganguly AR (2017) Deepsd:Generating high resolution climate change projections through single image super-resolution. In: Proceedings of the 23rd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, Association for Computing Machinery,New York, NY, USA, KDD ’17, p 1663–1672, DOI 10.1145/3097983.3098004Vandal T, Kodra E, Dy J, Ganguly S, Nemani R, Ganguly AR (2018) Quantifyinguncertainty in discrete-continuous and skewed data with bayesian deep learning.Proceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining DOI 10.1145/3219819.3219996Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu,Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, BengioS, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in NeuralInformation Processing Systems 30, Curran Associates, Inc., pp 5998–6008Wallemacq P, Herden C (2015) The Human Cost Of Weather Related Disasters.CRED, UNIDSRWoo WC, Wong WK (2017) Operational application of optical flow techniques toradar-based rainfall nowcasting. Atmosphere 8(3), DOI 10.3390/atmos8030048Zhao Y, Shen Y, Yao J (2019) Recurrent neural network for text classification withhierarchical multiscale dense connections. In: Proceedings of the Twenty-EighthInternational Joint Conference on Artificial Intelligence, IJCAI-19, InternationalJoint Conferences on Artificial Intelligence Organization, pp 5450–5456, DOI10.24963/ijcai.2019/757
RU-NET 25
A Hierarchical Convolutional Gated Recurrent Unit (HCGRU) model
Fig. 9: Illustration of the conditional-continuous variant of the HCGRU model usedas a baseline in experiments.
B Varied Time Experiment
DS1 DS2 DS3 DS4M1 nan 3.365 3.393 3.324M2 3.395 nan 3.385 3.314M3 3.410 3.376 nan 3.315M4 3.420 3.392 3.400 nan(a) RMSEDS1 DS2 DS3 DS4M1 nan 9.319 9.399 9.139M2 9.305 nan 9.389 9.085M3 9.410 9.418 nan 9.210M4 9.384 9.390 9.458 nan(b) R10 RMSE DS2 DS3 DS4 DS1M1 1.738 1.794 1.788 nanM2 nan 1.800 1.796 1.812M3 1.766 nan 1.805 1.835M4 1.757 1.799 nan 1.821(c) MAE
Table 5: