Machine learning emulation of gravity wave drag in numerical weather forecasting
Matthew Chantry, Sam Hatfield, Peter Duben, Inna Polichtchouk, Tim Palmer
mmanuscript submitted to
Journal of Advances in Modeling Earth Systems
Machine learning emulation of gravity wave drag innumerical weather forecasting
Matthew Chantry , Sam Hatfield , Peter Dueben , Inna Polichtchouk , TimPalmer Atmospheric, Oceanic and Planetary Physics, University of Oxford, Oxford, United Kingdom European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom
Key Points: • Physical parameterisation schemes can be emulated with neural networks. • These emulators produce accurate & stable forecasts over long timescales. • Neural networks can reduce the cost of increased complexity parameterisedphysics.
Corresponding author: Matthew Chantry, [email protected] –1– a r X i v : . [ phy s i c s . a o - ph ] J a n anuscript submitted to Journal of Advances in Modeling Earth Systems
Abstract
We assess the value of machine learning as an accelerator for the parameterisationschemes of operational weather forecasting systems, specifically the parameterisationof non-orographic gravity wave drag. Emulators of this scheme can be trained thatproduce stable and accurate results up to seasonal forecasting timescales.Generally,more complex networks produce more accurate emulators. By training on an increasedcomplexity version of the existing parameterisation scheme we build emulators thatproduce more accurate forecasts than the existing parameterisation scheme. Using thecurrent operational CPU hardware our emulators have a similar computational costto the existing scheme, but are heavily limited by data movement. On GPU hardwareour emulators perform ten times faster than the existing scheme on a CPU.
Numerical weather prediction has a proud history of saving lives and protect-ing property in societies particularly vulnerable to extremes of weather. As theseextremes become more extreme still under the influence of climate change, it is impor-tant that numerical weather prediction systems improve further. One way to enhancethe skill of numerical weather prediction is to increase model resolution. Not onlywill higher resolution directly enhance the ability of models to simulate small-scaleextreme events, higher resolution will also enable the information in observations tobe better assimilated at initial time, and will reduce the dependence of models on inac-curate parameterised processes, and in this way will reduce systematic errors (Palmer,2020). However, increasing model resolution is computationally expensive: a doublingof resolution can increase computing costs by up to a factor of 16. Therefore we mustfind ways of improving the computational efficiency of our models, so that valuablecomputing resources can be targeted where they will be most effective.Many physical processes that are involved in the forecasting of weather or cli-mate occur at spatial scales smaller than the numerical grid of the models. Thereforewhile the physics might be understood, it is necessary to derive closure schemes whichcapture the effect of these physical processes on the grid scales. Examples of physicalprocesses are radiation, convection and gravity wave drag, which will be our focushere. Inherently, by the nature of being a closure scheme, these physical parameteri-sation schemes are uncertain and imperfect. Together these schemes typically accountfor a significant portion of both the computational burden and the number of linesof code in the code base. In the European Centre for Medium-Range Weather Fore-cast’s (ECMWF) Integrated Forecasting System (IFS) model they contribute to abouta third of the overall computational cost of running the model. As more Earth Systemcomplexity is introduced into numerical weather prediction models, e.g. aerosols andatmospheric chemistry, these numbers will only increase.The recent boom in both hardware and software developments around machinelearning has caused many fields to examine the possible boons that machine learningcan bring. In the field of weather and climate forecasting researchers are examiningthe applicability of machine learning techniques to a spectrum of problems (Chantryet al., 2021), covering algorithmic changes from the systemic to the incremental. Atone end, researchers are investigating whether machine learning can replace the wholeforecasting system, either by learning from observational data (Sønderby et al., 2020)or atmospheric reanalysis (Rasp et al., 2020). Early results are promising in thearea of nowcasting (Sønderby et al., 2020), but still lag behind classical modellingfor short and medium range forecasting (Weyn et al., 2019). For seasonal forecastingmachine learning techniques again show promising results, e.g. forecasting El Ninosea-surface temperatures (Dijkstra et al., 2019). At the other end of this spectrum lieapproaches where machine learning aims to learn the behaviour of existing kernels of a –2–anuscript submitted to
Journal of Advances in Modeling Earth Systems weather forecasting model, with the aim of building machine learning emulators thatcan accelerate the kernel (Krasnopolsky, 1997; Chevallier et al., 1998; O’Gorman &Dwyer, 2018; Brenowitz & Bretherton, 2018). Successful acceleration of these kernelscould allow reductions in run-time or re-investments of the computation savings intoincreased complexity kernels or increases in spatial resolution. It is this approach thatwe shall investigate here. The appeal of this approach is the ability to leverage existingphysics knowledge in a small self-contained unit. This application of machine learningto weather and climate forecasting is closely related to the use of reduced numericalprecision to accelerate weather forecasting (Váňa et al., 2017; Hatfield et al., 2019),whereby a slightly less accurate version of a kernel can be used undetected beneaththe uncertainty and inaccuracy of the system. On current GPUs neural networks canbe easily run at reduced precision for both training and inference, leveraging increasedlinear algebra performance (NVIDIA, 2017). While existing atmospheric algorithmsneed to be reformulated to fit within the compact dynamic range of half-precision(Hatfield et al., 2019), neural networks naturally fit within this range due to datanormalisation.Several groups have already attacked the wider problem of parameterisation em-ulation. In the 1990s both Krasnopolsky (1997) and Chevallier et al. (1998) usedneural networks to emulate and accelerate radiation schemes. Their approaches werebroadly successful but did not lead to widespread adoption within operational weatheror climate models. More recently O’Gorman and Dwyer (2018) used random foreststo emulate a convection scheme. Efforts have also been made to accelerate increasedcomplexity (dubbed super) parameterisation schemes (Rasp et al., 2018; Gentine etal., 2018). Brenowitz and Bretherton (2018) built a neural network to capture allparameterised physics within an aqua-planet scenario. Veerman and Pincus (2021)emulate a radiation scheme and assess the computational cost relative to the existingscheme.Gravity wave drag is a parameterised physical process that has not yet been con-sidered for emulation. It is typically parameterised as two processes, orographic andnon-orographic, and it is the latter that we shall focus on in this study. Non-orographicgravity waves can be generated by dynamics such as fronts and convection and occur ona vast range of scales (Gardner et al., 1989; Ern et al., 2004) meaning that while somegravity waves are resolved by current forecast resolutions, others need to be parame-terised. In the IFS model this is achieved using the scheme from Warner and McIntyre(2001) and in particular the variant described in Orr et al. (2010) (henceforth referredto as NOGWD). Like most current physical parameterisation schemes, it operates on avertical column of fluid and can be run in parallel across all the columns within a givengrid. In this scheme a fixed amount of momentum is launched upwards from a givenheight at a range of angles in the horizontal direction and over a range of wavelengths(equivalently wavespeeds). At each model level the stability of these waves withinthe atmospheric profile is calculated, and depending on this stability momentum isdeposited at these layers. Accurate parameterisation of non-orographic gravity wavedrag is important for maintaining an accurate zonal-mean wind and temperature dis-tribution (Garcia & Boville, 1994) and for capturing the phase and amplitude of theQuasi-Biennial Oscillation (QBO) (Dunkerton, 1997). We choose to emulate NOGWDas it is a medium complexity scheme that has particularly important effects on sea-sonal timescales. The effects can be seen both in the stratosphere and troposphere(Polichtchouk, Shepherd, Hogan, & Bechtold, 2018; Polichtchouk, Shepherd, & Byrne,2018). Recent work has shown the challenges in using offline trained neural networksin free-running simulations (Brenowitz et al., 2020) and we will assess whether theseproblems also exist in our scenario. By sitting in the middle of the complexity spec-trum, NOGWD has enough complexity to challenge machine learning methods but itis nontrivial that an emulator will be faster than the current scheme. If we can demon-strate efficiency gains in this parameterisation scheme, we would postulate that many –3–anuscript submitted to
Journal of Advances in Modeling Earth Systems
Figure 1.
Standard deviation of the u and v wind tendencies from the HC NOGWD scheme(using the first 30 days of the dataset), plotted on regular (left) and semi-logarithmic (right) axesas a function of model level, plotting on the model levels where the scheme is activate. Velocitytendencies are largest at the top of the atmosphere ( ∼ . ), with a second peak around the30th model level ( ∼ ). of the parameterisation schemes can be efficiently emulated using neural networks. Toour knowledge, no work has explored parameterisation emulation in an operationalforecasting model, with as many vertical levels and the full complexity of physics (e.g.many studies have considered an aqua-planet setup).Next, in section 2, we discuss the data used for the machine learning, followed byour machine learning methodology (section 3). After showing that NOGWD can in-deed be emulated using neural networks (section 4), testing both the offline and onlineperformance. In section 5 we address several important questions in taking parame-terisation emulators into operational prediction systems. We present new results anddiscussions on the key issues of conservation properties, computational cost, alternatenetwork structures, generalisability and emulating other parameteristion schemes. The specific goal of this project is to create emulators of the non-orographicgravity wave drag parameterisation found in the IFS model. To generate the datawe modify cycle 45R1 of IFS (ECMWF, n.d.) to output all time-dependent variableseither inputted to or outputted from the scheme in question, saved on the nativecubic-octahedral grid. We use the horizontal grid TCo399, equivalent to ∼ km grid-point spacing. In the vertical dimension we use 91 model levels spanning the surfaceto . As outlined above, momentum is launched at a set number of angles inthe horizontal, and for a range of wavespeeds/wavelengths. Typically four launchangles are used and the wavespeed spectrum is discretised into 20 elements. Whengenerating datasets, we increase these values to 16 and 100 respectively, which werefer to from hereon as the HC (high complexity) scheme. These changes increasethe cost of the existing scheme by a factor of 20, but provide a higher resolutiondiscretisation of the equations governing NOGWD. Scinocca (2003) suggest that 35elements is sufficient but differences can still be observed between 35 and 1000 elements.In testing both variants of the NOGWD scheme in forecasts we found that those usingthe HC variant produced closer forecasts to the analysed state of the atmosphere. In –4–anuscript submitted to Journal of Advances in Modeling Earth Systems preliminary testing we found that our networks produce a lower mean-squared-erroragainst the testing dataset when both training and testing data came from the HCscheme, compared with when the standard complexity scheme provides the testing andtraining data. This is perhaps because the HC has smoother vertical tendency profiles.In our online testing later we will compare both the standard formulation (with 4 and20) and the HC variant of the existing scheme.Here, IFS is run for 30 day integrations, starting every 30 days, covering a threeyear period. This was chosen to give a full seasonal cycle for each of training, testingand validation data. Data is saved every 20 hours, in this way sampling the daily cycle.Without further reductions, each 30-day period produces , , input/outputcolumns for training. We sub-sample every 10th grid column, cycling the startingindex for different 30-day periods. In this fashion, every grid-point features in thetraining dataset. Our training dataset comprises of 12 30-day periods, covering 2015,totalling , , training pairs. Our testing and validation datasets comprise ofthree 30-day periods spread across 2016 and 2017 respectively, (starting 2016-02-25,2016-06-24, 2016-12-21 and 2017-02-19, 2017-07-19, 2017-11-16 respectively). Ourresults were not sensitive to the precise period selected nor increasing the testing orvalidation dataset sizes. Our training dataset is significantly larger than is typicallyused in emulation training, e.g. Brenowitz et al. (2020) used , , columns intraining. We tested reducing the data volume by training only on the first of eachmonth, but training for 10-times the number of epochs. Training with this data gavea significant degradation in our testing error ( ∼ larger).To mirror the existing parameterisation scheme, we include only the variablesused to predict the velocity tendencies in the existing model. These are the horizontalvelocity ( u and v ) and temperature profiles on the vertical model levels. In additionthere are three descriptors of the model levels at each grid-point: the pressure, half-levelpressure and geopotential. The outputs of the existing scheme are horizontal velocitytendencies, showing the impact of unresolved gravity wave drag on the velocity field.A temperature tendency can then be derived from the velocity fields and NOGWDtendencies using a simple formula . In total, this produces a dataset with ×
91 = 546 inputs and ×
91 = 182 outputs.Reductions can be made to both input and output vector sizes by applying simpledomain knowledge. Firstly, examination of the existing scheme shows that the schemeis only active in the upper 63 layers of the atmosphere and only uses information inthe top 63 layers of the input profiles. Secondly, pressure and half-level pressure caneach be described on model levels as time-independent functions of surface pressure.Theoretically, geopotential can be reconstructed using the surface geopotential, pres-sure, temperature and humidity, the last of which is not inputted to the scheme. Inpractice, we find our models produce accurate results using only surface pressure andsurface geopotential. Combining these ideas we produce input and output vectors ofsize × and ×
63 = 126 respectively. Even after data reduction the size oftraining data totals in excess of 30Gb, larger than the memory of our available GPUs.Therefore we re-write the data into the TFRecord format (Abadi et al., 2015) to alloweasy and efficient streaming of data from disk to GPU. Data is publicly available in30-day chunks in the HDF5 file format for portability.Data normalisation is a typical step in any machine learning workflow. Herewe normalise both the input and output data from our dataset. We examine twonormalisation approaches. The first, here dubbed “MEANSD” , calculates elemen- T t = − c p ( uu t + vv t ) , where t denotes the NOGWD tendency and c p is the dry air calorific capacityat constant pressure. https://storage.ecmwf.europeanweather.cloud/NOGWD_L91_PUBLIC/README.html –5–anuscript submitted to Journal of Advances in Modeling Earth Systems tal means and standard deviations for each feature (i.e. a variable at a given modellevel) and normalises both inputs and outputs by these values. The second, dubbed“TMEANVMAX”, normalises temperature (plus pressure and geopotential) with theabove method, but for each of velocity inputs and tendency outputs the entire col-umn is divided by the largest standard deviation from the column and no mean issubtracted. With both methods our loss function will be the L2 loss (mean squarederror). With the first normalisation, our models will seek to optimise each outputfeature equally, irrespective of the typical magnitude of the velocity tendency. Withthe second approach the model will be encouraged to learn features in proportion totheir absolute size. The latter method might seem a more natural choice given thatthe outputs will contribute to the next velocity field values. However, if we examinethe structure of the velocity profiles in figure 1 we see that the largest tendencies oc-cur at the very top of the model atmosphere. The top of the atmosphere is weaklyconstrained by data assimilation and velocity fields in the top ten layers of the 91level IFS model are strongly damped with a sponge layer to prevent wave reflectionfrom the rigid upper boundary. Therefore, allowing the model to focus on these layersmight not be constructive for an accurate forecast when coupled back to the full atmo-spheric model. We will therefore try both approaches and later test both in coupledsimulations.
In this work we focus on neural networks (NN) as our tools for emulation. Cur-rently both neural networks and random forests are popular methodologies for pa-rameterisation scheme emulation or creation. Random forests can conserve physicalquantities (e.g. energy conservation in O’Gorman & Dwyer, 2018), but recent workhas shown how to also achieve this with neural networks (Beucler et al., 2019). Neuralnetworks promise greater theoretical performance as random forests are limited bymemory performance. However random forest methods have recently been shown ina cloud parameterisation scheme to be more stable in long-term simulations coupledto atmospheric models (Brenowitz et al., 2020). With this in mind we will carefullyassess whether our neural network models run stably when coupled to the IFS modelfor seasonal timescales to further investigate this issue. We use Tensorflow (Abadi etal., 2015) to build and train our networks.Predominantly we will focus on fully-connected neural networks in our searchspace. These networks are the most general purpose, with no explicit encoding of thevertical structure of our dataset which would occur with a convolutional-based network.However, we will present results for more sophisticated network architectures in section5. Within the fully-connected neural network framework we wish to explore manyof the remaining hyper-parameters. We constrain our networks to have constant width(for each layer) but explore the space of layer width (between 1 and 1000 neurons) andnumber of hidden layers (between 1 and 10 layers). For the activation function we testReLU, tanh, leaky ReLU and Swish (Ramachandran et al., 2017) activation functions,and consistently find that tanh and Swish produce the lowest training losses. Afterpreliminary exploration we settle on using the Adam optimiser with a learning rate of − and a batch size of 256. We train for 50 epochs, check-pointing after each epochto select the best model on the testing dataset. Given our desire to accelerate an existing parameterisation scheme, the speedof any machine learning emulator built is as important as the accuracy. Thereforewe present our offline results as a function of the degrees of freedom, DOF, (number –6–anuscript submitted to
Journal of Advances in Modeling Earth Systems (a) (b)
Figure 2.
Offline root-mean-squared error of the best performing network for a range of pre-scribed DOFs. Networks are trained and validated on data normalised using: (a) MEANSD, (b)TMEANVMAX. To contextualise the offline results we plot two baselines. First, the error of thecurrent scheme relative to the HC scheme (which was used to generate the training data). Sec-ond, the error when persisting the HC scheme tendencies for the following timestep. A verticalred line denotes the approximate FLOP cost of the current scheme. of trainable parameters) in each network. For fully-connected neural networks thisis a good proxy for the number of floating point operations (or FLOPs) required tocalculate a single inference step on a column of data. The degrees of freedom in afully-connected neural network are dominated by the neuron weights, each of which isused once in a fused multiply-add operation as part of a matrix-vector multiplication.FLOPs are only one way of measuring computational cost and are the appropriatemeasure when a calculation is operation-bound rather than memory-bound. However,as most parts of weather and climate models are not able to operate anywhere close tothe peak performance of the hardware and as the ability to leverage peak performancewill depend on pattern and data requirement of kernels, the measure of FLOPs may bemisleading when estimating computing time. We will later return to this issue duringonline testing.
Offline learning
In figure 2 we plot the performance of our best models for a given number ofDOF on the validation dataset for the two normalisation schemes. To give context tothe error and cost of these schemes we plot the error on the validation dataset of usingeither the existing scheme or persisting the HC scheme tendencies for the followingtimestep . Relative to the HC scheme, we are easily able to construct NN that producelower error than the existing scheme. Beating persistence is a more challenging taskthat is only achieved for our most expensive MEANSD normalisation networks. Forthe largest TMEANVMAX networks our error no longer decreases and indeed showsa slight increase for the largest networks that we train. Potentially the addition ofregularisation (e.g. on the weights) could help with training large networks. Thisoffline testing phase helps identify the best network architectures for a given numberof DOF. For both normalisation methods there is no clear pattern relating the depth, In the operational version of IFS, persistence is used for the NOGWD scheme to reduce costs. –7–anuscript submitted to
Journal of Advances in Modeling Earth Systems width and the DOF. Typically our optimal networks consist of 3 to 6 layers, butvalidation losses vary little between 3 and 10 layers. This suggests that the precisenetwork architecture is unimportant for these fully-connected fixed-width networks inour problem.
Coupling neural networks within IFS
Small offline errors are not a necessary or sufficient condition to find good perfor-mance when using the networks online with simulations with the IFS. Theoretically,seemingly small biases could accumulate as large errors and trigger instabilities evenif the mean errors are small. To understand the impact of our schemes on the widerforecast we couple our neural networks back into the IFS code replacing the exist-ing NOGWD scheme. This requires interfacing neural networks, typically written inPython, with Fortran, and is a problem faced by many machine learning researchersin this field. In our work we solve this problem by writing a Fortran module to loadthe weights saved and reproduce the neural network architecture using matrix-matrixmultiplication algorithms found in the BLAS library (Blackford et al., 2002). Thisworks well for simple network architectures but would be more difficult to realise forthe complex network structures that can be built in libraries such as Tensorflow. Re-cently Ott et al. (2020) have produced an open-source package which accomplishes asimilar task. For a given DOF choice we use our best network with the Tanh activationfunction as this has comparable performance to the Swish function but this activationfunction already exists with the Fortran standard.
Medium range forecasting
To demonstrate the performance of our networks in online mode, i.e. coupledto the IFS, we complete a 120 hour forecast at ∼ (TCo399) resolution using avariety of model configurations. Our truth here is chosen to be a simulation using thehigh-complexity variant of the existing NOGWD scheme. We choose the HC NOGWDscheme as truth instead of reanalysis because the aim is to emulate the HC schemewith a NN and not the atmospheric reanalysis. Against this we plot the forecast errorusing the existing NOGWD scheme, using no activated NOGWD scheme at all andusing six neural network configurations covering both normalisation methods and arange of network sizes. For each simulation we calculate the horizontally-averagedRMSE relative to a forecast using the HC scheme and plot the error as a functionof pressure after 120 hours in figure 3. For the TMEANVMAX normalisation wefind similar performance for each network, irrespective of network complexity, witheach of the schemes producing a comparable forecast to the existing variant of thescheme. For the MEANSD normalisation, the higher complexity networks with and DOF produce the closest forecast to the truth, marginally more accurate thanthe existing version of the scheme. However the small MEANSD network produces anotably degraded forecast. This suggests that if one is trying to produce the cheapestpossible forecast, then the TMEANVMAX normalisation allows a network to cheaplylearn the key features. However to produce a best possible forecast there is value in amore equal weighting of the features, as given by the MEANSD normalisation.Beyond the above study we carry out 10-day forecasts at the TCo399 resolutionstarting each day within July and December 2019. In figure 4 we plot the resultsaveraged across the December start dates, comparable results are found for July startdates. We measure the difference in RMSE for two forecast models, each assessedagainst the ECMWF operational analysis. Specifically we compare forecasts using our100k TMEANVMAX network against forecasts using the existing NOGWD scheme.Blue denotes a reduction in forecast error for our NN compared with the existingscheme. Predominately blue is observed, indicating a net improvement in forecastquality using our network. This improvement is possible because our NN emulate –8–anuscript submitted to
Journal of Advances in Modeling Earth Systems
Figure 3.
Horizontal-average forecast errors in simulations after 120 hours at TCo399( ∼ ) resolution. Truth is taken to be a simulation of IFS using the HC version of theNOGWD scheme. Against this we measure the existing NOGWD scheme, no NOGWD schemeand six variants of our NNs covering a range of complexities for both normalisation methods.Except for the , (30k) DOF MEANSD NN each of the NNs have comparable errors to theoriginal scheme, with several of the MEANSD schemes outperforming the original.–9–anuscript submitted to Journal of Advances in Modeling Earth Systems
Figure 4.
Zonal-average RMSE differences between forecasts using our 100K TMEANVMAXNN and those using the existing NOGWD scheme, each measured against the ECMWF analy-sis for (a) horizontal velocities (b) temperature. Blue indicates improvement over the existingscheme, with red indicating degradation. Crosses indicates significantly different values. Predomi-nately blue is observed, indicating a net improvement in forecast quality using our networks.–10–anuscript submitted to
Journal of Advances in Modeling Earth Systems the high complexity NOGWD scheme, which we found to also improve forecast skill(relative to the existing NOGWD scheme). We find our NN forecasts to be neutralcompared with forecasts using the HC scheme. Comparable results were also seen fora 100k MEANSD network (not plotted).
Year-long simulations
In order to fully measure the accuracy and stability of our neural networks wenow assess their performance in long-range forecasting. When the NOGWD schemewas first introduced, it was found to have a positive impact on the ability of the IFSmodel to capture the quasi-biennial oscillation (QBO) when compared with the previ-ous Rayleigh friction model Orr et al. (2010). Crucially, the Rayleigh friction modelfails to capture the descent phase and produces an overly-regular oscillation period.Additionally, the latitude-pressure distribution of zonal winds and temperatures in themiddle-atmosphere were improved by the introduction of the NOGWD scheme Orr etal. (2010). For computational cost reasons we carry out these simulations with TL159( ∼ horizontal resolution), but do not retrain for this different resolution. Toexamine the behaviour of our networks on long timescales we carry out six year-longsimulations starting on the 1st of November in the years 2009 to 2014. We forecastthis period with the high-complexity and normal versions of the existing NOGWDscheme, the Rayleigh friction scheme and four of the configurations described in theprevious section, omitting the cheapest networks. In figure 5 we plot the time-pressurebehaviour of the equatorial zonal wind, calculated as the horizontal average betweenlatitudes -5 and 5. Discontinuities indicate where a new simulation is started and existfor each experiment. Figure 6 shows differences in jet structure from the HC truth run.Each of the neural networks tested significantly outperforms the old Rayleigh frictionscheme and shows clear descent of the jet altitude. We find that the MEANSD nor-malised networks do a better job at capturing the strong descents observed in the twovariants of the original schemes in 2013 and 2015. Interestingly, we find only a smallimprovement for the 1 million (1m) DOF models over the 100,000 (100k) DOF models.For the 100k MEANSD normalisation we observe a small amount of noise at the topof the atmosphere, but this appears to have no noticeable impact on the jet structure.From the same experiments in figures 7 and 8 we plot the June-July-August (JJA)zonal-mean zonal wind and temperature distribution, averaged over all six start-dates.This period was chosen to capture winter in the southern-hemisphere, comparable re-sults were seen for the December-January-February period for the northern-hemispherewinter period (not plotted). For the zonal winds we see some improvement in the jetstructure for the more expensive neural networks. For the temperature field we seea slightly better performance for the networks normalised using the TMEANVMAXscheme, in contrast to the results for the QBO where the MEANSD schemes producebetter jet structure.In these yearly integrations we find some evidence that neural networks trainedwith the MEANSD normalisation approach outperform the TMEANVMAX method.Reexamining figure 1 we see a local maximum in tendency activity around the 35thmodel level ( ∼ . In the MEANSD approach, each level has equal contributionto the loss function, encouraging our networks to capture these variations. In theTMEANVMAX approach these mid-level tendencies are an order of magnitude smallerthan those at the top of the atmosphere and so have little contribution to our lossfunction. As such the network struggles to reproduce the descent as faithfully. –11–anuscript submitted to Journal of Advances in Modeling Earth Systems
Figure 5.
Average zonal-mean zonal jet between latitudes -5 to 5 for six consecutive year-longintegrations at TL159 resolution ( ∼ ). Plots show the ability of the IFS model to capturethe QBO when coupled to different NOGWD schemes: NOGWD, the current scheme; NOGWDHC, the increased complexity variant; 4 NN schemes covering different complexities and normali-sation approaches; Rayleigh friction, the precursor scheme to the current NOGWD scheme. Bothvariations of the current scheme produce similar dynamics, which are faithfully reproduced by theMEANSD neural networks. The TMEANVAX networks outperform Rayleigh friction but fail toadequately reproduce the descent phase of the QBO.–12–anuscript submitted to Journal of Advances in Modeling Earth Systems
Figure 6.
Difference in the zonal jets between the simulations plotted in figure 5. NOGWDHC is treated as the reference scheme and hence has no error. Both 100,000 (100k) and 1 million(1m) DOF MEANSD schemes have comparable errors to the existing NOGWD scheme. Errorsfor the TMEANVMAX scheme are larger. –13–anuscript submitted to
Journal of Advances in Modeling Earth Systems
Figure 7.
June-July-August (JJA) average zonal jet as a function of latitude and pressurelevel for variants of the NOGWD scheme. Plots on the left show the jet, with differences to theNOGWD HC scheme depicted on the right. Here the more expensive schemes slightly outperformtheir cheaper counterparts. –14–anuscript submitted to
Journal of Advances in Modeling Earth Systems
Figure 8.
JJA average temperature profiles. Left plots show the temperature profiles andright plots show the deviations from the NOGWD HC result. Each scheme produces very similartemperature profile to our reference simulation. The MEANSD models have a small warm biasover the Arctic. –15–anuscript submitted to
Journal of Advances in Modeling Earth Systems
Conservation properties
The existing NOGWD scheme conserves momentum. A fixed amount of momen-tum flux is launched upwards within a column, with momentum deposited at eachlayer depending on wave stability. All remaining momentum that reaches the upper-most layer is deposited there to ensure conservation. In our first iterations of neuralnetworks we utilised the same approach and trained our networks to predict all otheroutput layers and used momentum conservation to deduce the upper-most tendency.This approach is similar to the work of Beucler et al. (2019). In our formulation,our training loss does not explicitly include the upper-most layer, whereas Beucler etal. (2019) use conservation properties to deduce the upper-most layer which is thenincluded in the loss calculation. However, during our long-range forecasting experi-ments we found that this momentum conservation approach occasionally destabilisedour simulations. Due to the very strong difference in mass per volume of air, a cor-rection of momentum at the top of the model, that is caused by a relatively smallchange in momentum close to the ground, can have a significant detrimental effect onthe local momentum budget at the model top which can trigger model instabilities.For our final networks we abandon exact momentum conservation and instead directlypredict the top layer tendencies along with all other layers using our networks. Withthis approach we are able to produce stable and accurate parameterisation schemes.Further work could be undertaken to follow Beucler et al. (2019) and use conservationof momentum before the loss is calculated to test if this stabilises the forecasts.
Performance analysis
Given our motivation for building these networks i.e. improving the overall per-formance of the IFS model, an assessment of the performance is a crucial albeit com-plex step. As previously highlighted, the current NOGWD scheme uses approximately , FLOPs to calculate the NOGWD tendencies for a single column. We test ourperformance in 10 day forecasts at TCo399 resolution, using 66 processors each using 6threads. The current NOGWD scheme is run at the IFS standard of double-precision,whereas the NNs are run at the trained precision, single-precision. In this setup 1.6million columns are evaluated per thread during the simulation. In our test IFS setup300 seconds or of the total simulation time are spent on the NOGWD calculations.By comparison, when using one of our 100k DOF models in the equivalent setup theNOGWD scheme takes 360 seconds per thread, slightly slower than the current ver-sion. For a 30k DOF model, this is reduced to 250 seconds, slightly cheaper than thecurrent scheme but an insignificant performance increase. To better understand theperformance bottlenecks of both schemes we also run each scheme decoupled from theIFS model. Simulating the same number of columns (1.6 million) in decoupled testingtakes 18 seconds for the existing NOGWD scheme, 20 seconds for a 100k DOF modelor 10 seconds for a 30k DOF model. This shows that both the original and neural net-work scheme are heavily limited by data movement, so approaches that can overcomethis problem can have dramatic performance increases. One such vision is a hetero-geneous hardware cluster where some of the parameterisation schemes are calculatedon GPUs with the output being sent back to the CPU for the remaining simulations.We have shown here that this could be achieved by training neural network emula-tors, meaning that existing code does not have to be converted to CUDA or otherGPU-appropriate languages. To understand the possible benefits of this architecturewe test our neural networks on a NVidia V100 GPU. Even when including the timefor transferring the data to and from the GPU a 100k DOF NN takes 4 seconds tosimulate 1.6million columns. Even this time is dominated by overhead costs, as a 1mDOF NN also takes 4 seconds. Therefore, while the CPU performance when coupledto the IFS is not currently faster than the existing scheme, there is scope for dramatic –16–anuscript submitted to Journal of Advances in Modeling Earth Systems
Input StackStackDense Locally connected layer V e r t i ca l l e v e l s Variables Output e.g. surface pressure
Small number of outputs e.g. calculates vertical sums/integrals
Hybrid Block
Aux input
Input Output
Aux input
Figure 9.
Schematic of our hybrid convolution deep network. Top: our hybrid block. Bottom:An Example model layout consisting of four blocks. The Input block to the hybrid networks arethe profiles of winds and temperature. The Aux input refers to the surface pressure and geopo-tential, which are injected to each vertical layer of each block. Arrows indicate layers, e.g. adense (fully-connected) layer taking the inputs and aux inputs. Stack indicates a repeating of anobject to enable concatenation with data that has vertical structure (e.g. wind profiles.) performance gains on heterogeneous hardware, particularly if more parameterisationschemes can be accurately emulated by neural networks.
Reduced numerical precision
Recent developments of GPU and Tensor Processing Unit (TPU) hardware havegiven users access to low numerical precision floating point numbers with significantlyimproved performance for neural networks. Outside of machine learning, elementsof both the dynamics (Hatfield et al., 2019) and parameterised physics (Saffin et al.,2020) can be calculated at half-precision with no impact on forecast quality. To test theapplicability of reduced precision networks for our emulation we utilise Tensorflow’smixed-precision capability which stores variables at 32 bits (single-precision) but uses16 bits (half-precision) for intermediate calculations. Through both our training andoffline testing phases we find no notable degradation in the accuracy of our networks.Currently the Fortran standard does not include half-precision calculations so onlinetesting cannot currently be easily carried out. Recent work in emulation of convectionparameterisation used emulated half-precision with good network accuracy (Yuval etal., 2020). It is future work to couple our emulators to the IFS model in such a waythat GPUs and reduced numerical precision can be leveraged. –17–anuscript submitted to
Journal of Advances in Modeling Earth Systems
Alternate network structures
In this study we considered fully-connected fixed-width neural networks, one ofthe simplest network designs available. In this architecture each node in the firstlayer has access to all input features, with no knowledge of the vertical structureof the atmosphere. Given that many calculations in physics are local in space, e.g.calculation of a vertical derivative, this suggests that improved performance could befound by a more appropriate choice of network architecture.One choice is using convolutional layers, a fixed local (in the vertical direction)operator where the same weights are used for each vertical layer in the atmosphere.Given that our model levels are unequally spaced in both distance from the Earth(from O(m) to O(km) distance per layer) and pressure values this appears to be apoor choice for our application. When testing with fully convolutional networks wefind that offline errors are dramatically higher than for our fully connected networks,irrespective of the number of filters or convolutional layers used.An alternate choice is to use locally-connected layers. These have the samestencil shape as convolution layers, but the weights are trained individually for eachvertical layer. It is therefore possible to encode a finite difference vertical derivative onour vertical grid within this architecture. Despite this, our testing finds that modelscomprising of entirely locally-connected layers still fail to match the performance ofour fully-connected networks, with testing errors more than 2 times larger than ourbest 100k DOF fully-connected networks.Examining the algorithm of the existing NOGWD scheme we can understandwhy purely local approaches struggle to achieve good predictive performance. In theexisting scheme, while most of the calculations are local in vertical space, there areseveral points where non-local calculations take place, for example when velocitiesrelative to that of the launch level wind velocities are calculated. With a purely localmethod, information can only propagate through the network based upon the filterwidth at each layer, so with a width of 5, a minimum of 12 layers would be requiredfor information to be able to propagate from the lowest to highest vertical layer.With the above in mind we design a hybrid approach, utilising both fully-connected and locally-connected layers, a schematic of which is plotted in figure 9.Each block of this network comprises of a small number of dense connections, whichcan combine both the inputs to that layer and the auxiliary inputs, which in this con-text are surface pressure and surface geopotential. The output of these dense layers arethen stacked vertically, alongside auxiliary inputs. All of these features are then passedthrough a set of locally-connected filters. This design seeks to predominantly leveragethe local nature of the physical parameterisation scheme while also enabling informa-tion to propagate throughout the domain within a single block. In figure 10 we re-plotthe results for the fully-connected networks using MEANSD normalisation and addthe results of our hybrid networks. To explore the hyper-parameters associated withour hybrid networks (e.g. the number of blocks or number of dense outputs) we usethe HyperOpt tool (Bergstra et al., 2013). For all DOF values our hybrid networkssignificantly outperform their fully-connected equivalents, producing comparable re-sults to a fully-connected network with three-times the DOF. There is certainly roomfor further improvements in searching this and other network architecture spaces. Dueto the increased network architecture complexity we have not yet implemented thesenetworks within our Fortran approach. This network architecture could prove to be auseful approach across the spectrum of physical parameterisation schemes. –18–anuscript submitted to
Journal of Advances in Modeling Earth Systems
Figure 10.
Offline results on the validation dataset using the MEANSD normalisationmethod. In addition to the results presented in figure 2 we plot our best performing hybridnetworks as outlined in figure 9. Consistently the hybrid network outperforms the fully-connectednetworks, with approximately three-times reduction in the required DOF for an equivalent valida-tion error.
Figure 11.
Average zonal-mean zonal jet between latitudes -5 to 5 for six consecutive year-long integrations at TL159 resolution ( ∼ ). Experiments match equivalent names fromfigure 5 except that the numerical sponge at the top of the atmosphere has been removed. Windspeeds in the stratosphere are significantly increased, yet the QBO phase and amplitude is cap-tured by the NN. –19–anuscript submitted to Journal of Advances in Modeling Earth Systems
Generalisability to model resolution and version changes
If a body of work is undertaken to produce emulators of physical parameterisatonschemes within an operational modelling system it is important to understand pos-sible sensitivities to changes in either the model resolution or model version. If anychanges of these hyperparameters necessitated the learning of new neural networks,and hence generation of new training datasets, then this could quickly become a verylarge infrastructure project to maintain. Vertical resolution is an intrinsic part of eachparameterisation scheme, and so changes to vertical resolution will require generationof new schemes. However for horizontal resolution, in the context of the NOGWDscheme, we found no sensitivity to changing the spatial resolution. We tested thisboth offline and online, e.g. our results for long-range forecasting are simulated at alower spatial resolution than the training dataset. For model cycle upgrades, we testedperformance across three consecutive cycles of IFS (45r1,46r1 & 47r1) and found nonoticeable degradation in performance. In figure 11 we show the evolution of the equa-torial zonal-mean zonal jet with the sponge layer removed. Despite no training on adataset with the sponge removed our MEANSD NN produces a stable atmosphericstructure. This is despite a significant increase in stratospheric wind speeds (note thechange in contour levels compared to figure 5. The phase and amplitude of the QBOis comparable to our simulation with the HC NOGWD scheme with no sponge layer.This is more evidence that our neural networks show reasonable robustness to modelchanges. However, we cannot rule out degradation in scenarios where very significantmodel changes are made.
Orographic gravity wave drag
Armed with the successes of our emulation of NOGWD we carried out an equiv-alent task to learn the orographic gravity wave drag (OGWD) scheme within IFS. Weused the same methodology for data generation, same normalisation methods and samenetwork architecture choices. Orographic gravity wave drag shares the same inputsas its non-orographic counterpart, with the addition of four parameters describing theunresolved orography. The existing OGWD scheme is active only in locations withsignificant unresolved orography and when the atmospheric profile is susceptible tothe waves generated by that orography. The result of this is a scheme that is inactivefor most of the globe and is often inactive even in locations with significant unresolvedorography.Unfortunately, our attempts to learn this OGWD scheme with fully-connectedneural networks was unsuccessful. Increasing the degrees of freedom did not reduce thetesting error, unlike our results for the NOGWD scheme. Our neural networks strug-gled to derive this nuanced, yet computationally cheap ( ∼ , FLOPs per column),interaction between the orography and atmospheric profile. The most common resultof training was networks that predicted zero velocity tendency for almost all inputcolumns. In online testing our model performed similarly well to a forecast withoutan OGWD scheme. In attempting to overcome these problems we tested many ap-proaches. Learning a single grid-point, i.e. generating a network which could predictthe tendencies for only a single point in space, was possible but is not a scalable solu-tion. The existing OGWD scheme is only utilised when sub-grid orographic variabilityexceeds a threshold, we followed this example and trained only on locations where thisthreshold was met. Additionally we tested methods to change the balance of trainingdata to exclude a proportion of our training data where the scheme was inactive, butthis often resulted in a dramatic increase in false positives (OGWD activity where thecurrent scheme is inactive).From this experience we summarise that training emulators of physical param-eterisation schemes is a non-trivial task, not guaranteed to succeed even with large –20–anuscript submitted to
Journal of Advances in Modeling Earth Systems amounts of available data. Our outlook is that most parameterisation schemes sharemore in common with the NOGWD scheme and therefore will be possible to learn.However this remains to be tested.
In this work we successfully emulate the non-orographic gravity wave drag schemefrom the operational IFS forecasting model. Despite training offline we are able to pro-duce an emulator which can run stably when coupled to the IFS for seasonal timescales.The most exacting test of our emulators, reproducing the descent of the QBO, wassuccessfully achieved by our neural networks. Broadly, our networks have similarcomputational cost to the existing scheme when coupled to the IFS on a CPU-basedarchitecture. However, both the existing scheme and our neural networks are heavilylimited by data movement when run within the IFS. The major advantage of our net-works is the ease of portability to heterogeneous architectures, where neural networkemulators of parameterisation schemes could be offloaded to GPUs. This approachwill have the greatest benefit if many parameterisation schemes can be emulated withneural networks. However, as discussed above for the orographic gravity wave drag,other schemes might prove more challenging to emulate as accurately as the NOGWDscheme. Beyond forecasting, there are other values of building parameterisation emu-lation. The 4D-var data assimilation approach uses a tangent-linear and adjoint to theforecasting model, constructed of tangent-linear and adjoint versions of each kernel.This is a challenging process typically derived by hand. However, with an accurateneural network emulator the tangent linear and adjoint versions can be trivially de-rived due to the simple nature of a neural network’s algorithm. In our sister paperHatfield et al. (2021) we will test how these tangent linear and adjoint versions performwithin a data-assimilation framework.
Acknowledgments
M. Chantry and T. Palmer were supported by a grant from the Office of Naval ResearchGlobal. ECMWF provided computing resources resources, including the EuropeanWeather Cloud, which is an ECMWF and EUMETSAT project. Thanks to IoanHadade for his technical support. P. Dueben gratefully acknowledges funding fromthe Royal Society for his University Research Fellowship as well as the MAELSTROMand AI4Copernicus EU projects (grant agreement No 955513 and 101016798). S.Hatfield and P. Dueben have also received funding from the ESIWACE EU project(grant agreement No 823988). T. Palmer also received funding from an ERC AdvancedGrant, ITHACA 741112.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., . . . Zheng, X.(2015).
TensorFlow: Large-scale machine learning on heterogeneous systems.
Retrieved from (Software available fromtensorflow.org)Bergstra, J., Yamins, D., & Cox, D. (2013). Making a science of model search: Hy-perparameter optimization in hundreds of dimensions for vision architectures.In
International conference on machine learning (pp. 115–123).Beucler, T., Pritchard, M., Rasp, S., Ott, J., Baldi, P., & Gentine, P. (2019). Enforc-ing analytic constraints in neural-networks emulating physical systems. arXivpreprint arXiv:1909.00912 .Blackford, L. S., Petitet, A., Pozo, R., Remington, K., Whaley, R. C., Demmel, J.,. . . others (2002). An updated set of basic linear algebra subprograms (blas).
ACM Transactions on Mathematical Software , (2), 135–151. –21–anuscript submitted to Journal of Advances in Modeling Earth Systems
Brenowitz, N. D., & Bretherton, C. S. (2018). Prognostic validation of a neural net-work unified physics parameterization.
Geophysical Research Letters , (12),6289–6298.Brenowitz, N. D., Henn, B., McGibbon, J., Clark, S. K., Kwa, A., Perkins, W. A.,. . . Bretherton, C. S. (2020). Machine learning climate model dynamics:Offline versus online performance. arXiv preprint arXiv:2011.03081 .Chantry, M., Christensen, H., Düben, P., & Palmer, T. (2021). Opportunitiesand challenges for machine learning in weather and climate modelling: hard,medium and soft AI. Philosophical Transactions A .Chevallier, F., Chéruy, F., Scott, N., & Chédin, A. (1998). A neural network ap-proach for a fast and accurate computation of a longwave radiative budget.
Journal of Applied Meteorology , (11), 1385–1397.Dijkstra, H., Petersik, P., Hernandez-Garcia, E., & Lopez, C. (2019). The appli-cation of machine learning techniques to improve el nino prediction skill. Fron-tiers in Physics , , 153.Dunkerton, T. J. (1997). The role of gravity waves in the quasi-biennial oscillation. Journal of Geophysical Research: Atmospheres , (D22), 26053–26076.ECMWF. (n.d.). Ifs documentation. cy45r1. .Ern, M., Preusse, P., Alexander, M. J., & Warner, C. D. (2004). Absolute values ofgravity wave momentum flux derived from satellite data.
Journal of Geophysi-cal Research: Atmospheres , (D20).Garcia, R. R., & Boville, B. A. (1994). “downward control” of the mean merid-ional circulation and temperature distribution of the polar winter stratosphere. Journal of the atmospheric sciences , (15), 2238–2245.Gardner, C. S., Miller, M. S., & Liu, C.-H. (1989). Rayleigh lidar observations ofgravity wave activity in the upper stratosphere at urbana, illinois. Journal ofthe atmospheric sciences , (12), 1838–1854.Gentine, P., Pritchard, M., Rasp, S., Reinaudi, G., & Yacalis, G. (2018). Couldmachine learning break the convection parameterization deadlock? GeophysicalResearch Letters , (11), 5742–5751.Hatfield, S., Chantry, M., Düben, P., & Palmer, T. (2019). Accelerating high-resolution weather models with deep-learning hardware. In Proceedings of theplatform for advanced scientific computing conference (pp. 1–11).Hatfield, S., Düben, P., Lopez, P., Geer, A., Chantry, M., & Palmer, T. (2021).Neural networks as the building blocks for tangent-linear and adjoint models. arXiv preprint arXiv: .Krasnopolsky, V. (1997). A neural network forward model for direct assimilation ofssm/i brightness temperatures into atmospheric models.
Research activities inatmospheric and oceanic modeling .NVIDIA. (2017).
NVIDIA Tesla V100 GPU Architecture (Tech. Rep.). Retrievedfrom
O’Gorman, P. A., & Dwyer, J. G. (2018). Using machine learning to parameter-ize moist convection: Potential for modeling of climate, climate change, andextreme events.
Journal of Advances in Modeling Earth Systems , (10),2548–2563.Orr, A., Bechtold, P., Scinocca, J., Ern, M., & Janiskova, M. (2010). Improved mid-dle atmosphere climate and forecasts in the ecmwf model through a nonoro-graphic gravity wave drag parameterization. Journal of climate , (22),5905–5926.Ott, J., Pritchard, M., Best, N., Linstead, E., Curcic, M., & Baldi, P. (2020). Afortran-keras deep learning bridge for scientific computing. arXiv preprintarXiv:2004.10652 .Palmer, T. (2020). A vision for numerical weather prediction in 2030. arXiv preprint –22–anuscript submitted to Journal of Advances in Modeling Earth Systems arXiv:2007.04830 .Polichtchouk, I., Shepherd, T., Hogan, R., & Bechtold, P. (2018). Sensitivity ofthe brewer–dobson circulation and polar vortex variability to parameterizednonorographic gravity wave drag in a high-resolution atmospheric model.
Jour-nal of the Atmospheric Sciences , (5), 1525–1543.Polichtchouk, I., Shepherd, T. G., & Byrne, N. J. (2018). Impact of parametrizednonorographic gravity wave drag on stratosphere-troposphere coupling in thenorthern and southern hemispheres. Geophysical Research Letters , (16),8612–8618.Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Swish: a self-gated activationfunction. arXiv preprint arXiv:1710.05941 , .Rasp, S., Düben, P. D., Scher, S., Weyn, J. A., Mouatadid, S., & Thuerey, N.(2020). Weatherbench: A benchmark dataset for data-driven weather fore-casting. arXiv preprint arXiv:2002.00469 .Rasp, S., Pritchard, M. S., & Gentine, P. (2018). Deep learning to represent subgridprocesses in climate models. Proceedings of the National Academy of Sciences , (39), 9684–9689.Saffin, L., Hatfield, S., Düben, P., & Palmer, T. (2020). Reduced-precisionparametrization: lessons from an intermediate-complexity atmospheric model. Quarterly Journal of the Royal Meteorological Society , (729), 1590–1607.Scinocca, J. F. (2003). An accurate spectral nonorographic gravity wave drag pa-rameterization for general circulation models. Journal of the atmospheric sci-ences , (4), 667–682.Sønderby, C. K., Espeholt, L., Heek, J., Dehghani, M., Oliver, A., Salimans, T., . . .Kalchbrenner, N. (2020). Metnet: A neural weather model for precipitationforecasting. arXiv preprint arXiv:2003.12140 .Váňa, F., Düben, P., Lang, S., Palmer, T., Leutbecher, M., Salmond, D., & Carver,G. (2017). Single precision in weather forecasting models: An evaluation withthe ifs. Monthly Weather Review , (2), 495–502.Veerman, M., & Pincus, R. (2021). Predicting atmospheric optical properties forradiative transfer computations using neural networks. Philosophical Transac-tions A .Warner, C., & McIntyre, M. (2001). An ultrasimple spectral parameterization fornonorographic gravity waves.