[PDF] Spherical convolution and other forms of informed machine learning for deep neural network based weather forecasts

Abstract

Recently, there has been a surge of research on data-driven weather forecasting systems, especially applications based on convolutional neural networks (CNNs). These are usually trained on atmospheric data represented as regular latitude-longitude grids, neglecting the curvature of the Earth. We asses the benefit of replacing the convolution operations with a spherical convolution operation, which takes into account the geometry of the underlying data, including correct representations near the poles. Additionally, we assess the effect of including the information that the two hemispheres of the Earth have "flipped" properties - for example cyclones circulating in opposite directions - into the structure of the network. Both approaches are examples of informed machine learning. The methods are tested on the Weatherbench dataset, at a high resolution of ~ 1.4 ∘ which is higher than in previous studies on CNNs for weather forecasting. We find that including hemisphere-specific information improves forecast skill globally. Using spherical convolution leads to an additional improvement in forecast skill, especially close to the poles in the first days of the forecast. Combining the two methods gives the highest forecast skill, with roughly equal contributions from each. The spherical convolution is implemented flexibly and scales well to high resolution datasets, but is still significantly more expensive than a standard convolution operation. Finally, we analyze cases with high forecast error. These occur mainly in winter, and are relatively consistent across different training realizations of the networks, pointing to connections with intrinsic atmospheric predictability.

Full PDF

SSpherical convolution and other forms of informed machinelearning for deep neural network based weather forecasts

Sebastian Scher ∗ and Gabriele Messori Department of Meteorology and Bolin Centre for Climate Research, Stockholm University, Stockholm,Sweden Department of Earth Sciences, Uppsala University, Uppsala, Sweden

Abstract

Recently, there has been a surge of research on data-driven weather forecasting systems, es-pecially applications based on convolutional neural networks (CNNs). These are usually trainedon atmospheric data represented as regular latitude-longitude grids, neglecting the curvature ofthe Earth. We asses the beneﬁt of replacing the convolution operations with a spherical convolu-tion operation, which takes into account the geometry of the underlying data, including correctrepresentations near the poles. Additionally, we assess the eﬀect of including the informationthat the two hemispheres of the Earth have “ﬂipped” properties - for example cyclones circu-lating in opposite directions - into the structure of the network. Both approaches are examplesof informed machine learning. The methods are tested on the Weatherbench dataset, at a highresolution of ~1.4 ◦ which is higher than in previous studies on CNNs for weather forecasting.We ﬁnd that including hemisphere-speciﬁc information improves forecast skill globally. Usingspherical convolution leads to an additional improvement in forecast skill, especially close to thepoles in the ﬁrst days of the forecast. Combining the two methods gives the highest forecastskill, with roughly equal contributions from each. The spherical convolution is implementedﬂexibly and scales well to high resolution datasets, but is still signiﬁcantly more expensive thana standard convolution operation. Finally, we analyze cases with high forecast error. Theseoccur mainly in winter, and are relatively consistent across diﬀerent training realizations of thenetworks, pointing to connections with intrinsic atmospheric predictability. Plain Language Summary

Weather forecasting is traditionally done with complex computer models based on physical under-standing. Recently, however, there has been rising interest in using machine-learning methods in-stead. Especially techniques from the area of image and video recognition have been tested for thisend. When using these techniques for weather forecasting, atmospheric ﬁelds are often treated asrectangular images. This is, however, inappropriate for global ﬁelds, since the Earth is a globe, and ∗ corresponding author ( [email protected] ) a r X i v : . [ phy s i c s . a o - ph ] A ug epresenting global data as rectangular images leads to strong distortions close to the poles. Herewe test a technique that circumvents this problem, and show that it increases forecast performance.Additionally, we design our machine learning method in such a way that it “knows” a-priori that theweather on the Earth’s two hemispheres is similar but "mirrored". Weather forecasting has for decades been dominated by numerical models build on physical principles,the so-called Numerical Weather Prediction Models (NWP). These models have seen a constantincrease in skill over time [Bauer et al., 2015]. Recently, however, there has been a surge of interestin data-driven weather forecasting in the medium-range (~2-14 days ahead), especially using neuralnetworks [Scher, 2018, Scher and Messori, 2019b, Dueben and Bauer, 2018, Weyn et al., 2019, 2020,Faranda et al., 2020, Scher and Messori, 2020]. A historic overview of paradigms in weather prediction,is outlined in [Balaji, 2020]. Many of the recently proposed data-driven approaches use convolutionalneural networks (CNNs) [Scher, 2018, Scher and Messori, 2019b, Weyn et al., 2019] or a local networkthat is shared across the domain [Dueben and Bauer, 2018]. What these methods have in commonis that they use global data on a regular lat-lon grid. This, however, leads to distortions, especiallyclose to the poles. However, a standard convolution or shared local architecture does not take thisinto account since it uses a ﬁlter whose size is a ﬁxed number of gridpoints (e.g. 3 × ◦ -images [Coors et al., 2018]. Additionally, we test two diﬀerent approachesof including information of the hemispheres into our networks. All these approaches can be seenas variantd of “informed machine learning” [von Rueden et al., 2020], in which prior knowledge isincluded into the machine-learning pipeline. In our case, the prior knowledge is that the Earthis spherical (in contrast to the regular data that we provide), and that the dynamics of the twohemispheres are - to some extent- “ﬂipped” relative to each other. This information is directlyencoded into the structure of the neural network.The aim of this paper is not to create the best data-driven weather forecasts, but rather to assessthe eﬀect of two possible adaptions of earlier proposed methods, and to disentangle their individualcontributions to forecast skill. These methods are tested on reanalysis data from the Weatherbenchdataset [Rasp et al., 2020]. This is a dataset speciﬁcally designed for benchmarking machine-learningbased weather forecasts. We use the data at a resolution of up to 1.4 deg, which is higher than in2revious studies. We assess both medium range forecast skill and long-term stability of forecasts.Finally, we include an analysis of the events with highest forecast errors. These are important from anend-user point of view, and in NWP they have elicited signiﬁcant attention models. The occurrenceof unusually bad forecasts (“forecast busts”) in NWP models is connected with certain weathersituations [Rodwell et al., 2013, Lillo and Parsons, 2017], and more generally, diﬀerent weathersituations have diﬀerent predictability [Ferranti et al., 2015]. We analyze whether the worst forecastsof our data-driven forecasts systems are randomly distributed, or, as in NWP, are associated withrecurrent weather situations. We use data from Weatherbench Rasp et al. [2020], a benchmark dataset for data-driven weatherforecasting. The subset we use consists of ERA5 reanalysis data, regridded to a regular lat-lon gridwith two diﬀerent resolutions: 2.8125 deg (hereafter called “low-resolution” or “lres”) and 1.40625deg (hereafter called “high-resolution” or “hres). The following input variables are used: temperatureat 500 and 850 hPa, geopotential at 500 and 850 hPa. As evaluation variables, we use gepotentialheight at 500hPa (“z500”) and temperature at 850hPa (“t850”). We use the period 1979-2016 fortraining and validation, and 2017-2018 for evaluation (as proposed in Weatherbench).The forecasts of the network trained on the lres data are evaluated only on the lres grid. Theforecasts made with the hres architecture are evaluated both on the hres grid and, after bilinearregridding, also on the lres grid.

In normal convolution, for each gridpoint, a ﬁxed number of gridpoints in the vicinity are sampled(for example a 3 × EQN Standard kernel Dense (“fractional”)kernel b) input 128x256x12custom convolution 128x256x32 Z500, z850, t500, t8502 timesteps128x256x8 sin(doy),cos(doy),sin(hod), cos(hod)128x156x4custom convolution 128x256x32 average pooling 64x128x32 custom convolution 64x128x64custom convolution 64x128x64 average pooling 32x64x64 custom convolution 32x64x128custom convolution 32x64x64 upsampling 64x128x64 custom convolution 64x128x64custom convolution 64x128x32 upsampling 128x256x32 concatenate 64x128x128concatenate 64x128x64custom convolution 128x256x32 custom convolution 128x256x32 custom convolution (linear) 128x256x8 add 128x256x8 R e t u r n ou t pu t f o r s e c ond s t ep Figure 1: a) Sketch of the principle of spherical convolution. Shown is a standard 3 × × × × Since Coors et al. [2018] have not provided details on their technical implementation, and since theircode is not publicly available, we have designed our own implementation of spherical convolution. Inthis section we use the word “tensor” as it is used in computational packages such as tensorﬂow, thusinterchangeably with “array”. Therefore, not everything referred to as a tensor here is necessarily atensor in the strict mathematical sense.We have implemented the spherical convolution in the following steps (the channel dimension of4he neural network is omitted here for simpliﬁcation):1. we start with a (ﬁxed) ﬁlter kernel ~K of length n, consisting of n pairs of lat-lon distances∆ p i = (∆ y i , ∆ x i ), corresponding to gridpoints at the equator. A 3 × − , − , ( − , , ( − , , (0 , − , (0 , , (0 , , (1 , − , (1 , , (1 , N × M input gridpoints p = ( x, y ) in the regular grid, we compute n pairs of(potentially non-integer) coordinates p = ( y i , x i ), corresponding to the n points in the kernel ~K , transformed for the current position of p on the globe with the following equations: y = y + ∆ y (1) x = x + ∆ xcos ( φ ) (2)with latitude φ of the central point. The transformed coordinates are in regular lat-lon coor-dinates. The transformed coordinates for each gridpoint and kernel points are combined in acoordinate tensor ˆ A of shape N × M × n .3. the input ~x data is ﬂattened to ~x flat with shape L = N · M , and the coordinate tensor ˆ A isﬂattened to a tensor of shape L × n , with the coordinates transformed to ﬂattened coordinates.4. a sparse interpolation tensor ˆ L of size L × L is created, and ﬁlled with the target coordinatesin such a way that multiplying the ﬂattened input data ~x flat with the interpolation tensorresults in the expanded input data ~x exp = ˆ L~x with shape L × n . ˆ L is implemented as a sparsetensorﬂow tensor. This implementation allows the use also on very large grids (large L ), asonly the non-zero components are kept in memory.5. on ~x exp , a standard 1-d convolution with kernel size n (as implemented in major neural networklibraries such as tensorﬂow) can now be applied, resulting in ~x out with size L , which is thenunﬂattened to shape N × M The steps 1,2, and the computation of the interpolation tensor needs to be done only once (whensetting up the network). ˆ L is stored in memory for all subsequent operations.At gridpoints close to the poles, kernel-points can “pass” through the pole. For these points, notonly the longitude, but also the latitude is adjusted. For example, on a 1x1 deg gid a kernel pointthat without this adjustment would correspond to the impossible point 90.5N 0E will be set as 89.5N180E. With this, the “polar problem” of regular grids is eliminated. We use a neural network architecture based on that proposed in Weyn et al. [2020], namely a U-netarchitecture. Weyn et al. however do not use data on a regular grid, but on a cubed sphere, consistingof several regular grids. In each of their convolution layers in the U-net, a standard convolution ismade separately for each region, without sharing weights between the regions. We use the samearchitecture, but with each of their special convolution layers replaced by a standard convolution,5ur spherical convolution and/or hemisphere-wise convolution (see below). The network structure isshown in ﬁg. 1 b). Our networks are implemented in tensorﬂow [Martín Abadi et al., 2015] usingthe dataset api with tensorﬂow record ﬁles, resulting in a scaleable implementation that should alsoscale to datasets with higher resolution than used here.In addition to the input variables from ERA5 discussed in the data-section, day of the year(“doy”) and hour of the day (“hod) are used as additional input variables. Since these are “circular”variables, each of them is converted to two variables: doy sin (cid:18) π doy (cid:19) (3) doy cos (cid:18) π doy (cid:19) (4) hod sin (cid:18) π doy (cid:19) (5) hod cos (cid:18) π doy (cid:19) (6)These 4 scalars are extended to the grid-resolution of the data and added as additional channels.The input of the networks is comprised of two timesteps (2 variables at 2 pressure levels each), but theadditional 4 variables are provided only once, resulting in 8+4=12 input channels. The output of thenetworks are 2 timesteps of the input variables, without the additional variables (thus 8 channels).One forecast step is made of two consecutive passes through the network, via feeding the output backto the input, resulting in a 24 hour forecast. For details see Weyn et al. [2020].For consecutive forecasts (longer than 24 hours), hod is not updated, since each forecast stepis 24 hours. We also choose not to updated doy, since the forecast length of 10 days is very shortcompared to seasonal variations. Only for the long-term stability experiment in section 3.4 doy isupdated with each consecutive forecast step. Our base architecture without spherical convolution uses normal convolution. Along the longitudedirection the convolution is “wrapped” around, so there is no artiﬁcial boundary. At the poles thegrid is padded with zeros to make the output of the convolution operation the same size as the input.This is the same approach for dealing with the boundaries as in Weyn et al. [2019]. The kernel sizeof the convolutions is 3 × The spherical convolution architecture is the same as the base architecture, except that each convo-lution operation is replaced by a spherical convolution operation. Since the convolution deals bothwith the poles and the longitude-wrap, no padding is applied. In the standard spherical convolutionarchitecture (“sphereconv”) we use a 3 × × We use two related approaches for incorporating the fact there there are 2 hemispheres into the net-works. In the ﬁrst, we use separate (independent) convolution operations (with separate weights) foreach hemisphere. The data is split at the equator. For the architecture without spherical convolution,the ﬁrst row of the other hemisphere is added as padding for the boundary of the convolution. Whenusing spherical convolution together with hemispheric convolution this is not necessary, as this isincluded in the interpolation for the spherical convolution. Then, on each hemisphere, a convolutionoperation is performed. This will be referred to as “hemconv” and “sphereconv_hemconv”. In the sec-ond approach, the same convolution operation is used for both hemispheres, with the ﬁlter “ﬂipped”along the lat dimension for the second hemisphere. This will be referred to as “hemconv_shared”and “sphereconv_hemconv_shared”. This approach is a variant of the inclusion of “invariances” intothe neural network in the terminology of von Rueden et al. [2020].

Each network is trained 5 times with diﬀerent random seeds to account for the randomness in thetraining. Each training realization is evaluated separately, and throughout the paper the average ofthe errors and skill scores is shown. The same architecture is used both for high and low resolutiondata. Only the input size is adjusted according to the resolution. Since the architecture is a pureconvolution architecture, the number of parameters (weights) is independent of the input size, andthus both for hres and lres the same number of parameters is used (336,040 for architectures withsame weights for both hemispheres, and 671,816 for the architectures with independent weights foreach hemisphere). We train the networks ﬁrst for 4 epochs (10 epochs for hres), and then over anadditional 50 (20 for hres) epochs with early stopping (stopping after no increase in skill at 10% ofthe training data left out for validation). The reason for less epochs for the hres data was solely toreduce computational time.The data from Weatherbench is converted to the tensorﬂow-record format.

We use both Root Mean Square Error (RMSE) and Anomaly Correlation Coeﬃcient (ACC), whichare also the two measures used in Weatherbench.RMSE is deﬁned as

RM SE = ( f c − truth ) (7)with the overbar representing latitude-weighted area and time mean,and ACC as ACC = corr ( f c − clim, truth − clim ) (8)7ith the correlation computed with latitude weights and clim the time-mean over all forecasts.For details of the calculations see Rasp et al. [2020]. We start by looking at global average RMSE and ACC, shown in ﬁg. 2-3 and table 1. The upperpanels of the ﬁgures show absolute values, whereas the lower panels show the diﬀerence to thebase architecture. The base architecture has the next to lowest skill at all lead times, with onlysphereconv_densekernel performing worse (sphereconv_densekernel is omitted from the diﬀerencepanels in order to better visualize the diﬀerence between the other architectures). Sphereconv doesconsistently better than the base architecture. This holds for all lead-times, both resolutions and bothfor RMSE and ACC, except for day 1 on the lres data, where sphereconv is slightly worse than base.The improvement is however relatively small. Changing the kernel of the spherical convolution tothe densekernel (higher kernel resolution in the latitude direction) degrades forecast skill to far beloweven the base architecture. Due to this negative result we decide not to test the (computationallyexpensive) dense kernel method on the high resolution data.The hemconv architecture is better than the base architecture, and mostly also better thanthan the sphereconv architecture on the hres data, but on the lres data it is slightly worse thansphereconv at most lead times. The hemisphere convolution architecture with shared (ﬂipped) weights(hemconv_shared) outperforms the hemisphere convolution architecture with independent weights.The same holds when combining spherical convolution with hemisphere-wise convolution. Bothhemisphere methods lead to an additional improvement on top of the spherical convolution, butsharing the ﬂipped weights leads to the best result. The architecture with spherical convolutionand hemispheric convolution with shared weights (sphereconv_hemconv_shared) is the best of allarchitectures on lres for all lead-times, and on hres up to day 4, with roughly equal contributionsfrom the spherical convolution and from the splitting into hemispheres.Evaluating the hres forecasts on the lres grid instead of on the hres grid has only a small inﬂuence,with slightly lower errors than when evaluating on the hres data, but not changing the basic results(table 1).We now turn to the spatial distribution of RMSE for z500 (ﬁg. 4). Panel a) shows the error ofthe base networks at diﬀerent lead times for the hres data, with increasing leadtime from upper leftto lower right. The error pattern follows the typical error patterns of medium range NWP forecasts,with lowest predictability in the storm-track regions (e.g. Scher and Messori [2019a]). As expected,the error grows with increasing lead-time, with no dramatic changes in the spatial patterns.More interesting is the diﬀerence between the spherconv and the base architecture (panel b) andthe diﬀerence between sphereconv_hemconv_shared and hemconv_shared (panel d). Up to forecastday 4, the spherical convolution clearly improves the forecasts around both poles. In the NorthernHemisphere, however, the spherconv forecasts in the storm track regions are slightly worse than thebase architecture, and for longer lead-times also the forecasts around the North Pole are worse withthe spherical convolution. Results are similar for t850 (ﬁg. S1), except that for t850 hres the longerforecasts around the north pole have the same error both in the base architecture and the spherconv8eopotential 500 day 3/5 [ m /s

2] temperature 850 day 3/5 [ K ]hres base 575 / 863 2.88 / 3.92hres sphereconv 558 / 836 2.82 / 3.84hres hemconv 549 / 825 2.80 / 3.86hres hemconv_shared 516 / 781 2.71 / 3.74hres sphereconv_hemconv 525 / 818 2.76 / 3.87hres sphereconv_hemconv_shared 496 / 800 2.68 / 3.79hres base regrid 572 / 859 2.8 / 3.85hres sphereconv regrid 555 / 832 2.74 / 3.78hres hemconv regrid 545 / 820 2.73 / 3.8hres hemconv_shared regrid 512 / 777 2.63 / 3.68hres sphereconv_hemconv regrid 522 / 814 2.68 / 3.81hres sphereconv_hemconv_shared regrid 492 / 796 2.61 / 3.73lres base 539 / 834 2.8 / 3.87lres sphereconv 527 / 803 2.75 / 3.8lres hemconv 526 / 816 2.77 / 3.86lres hemconv_shared 501 / 795 2.66 / 3.75lres sphereconv_hemconv 502 / 782 2.74 / 3.82lres sphereconv_hemconv_shared 454 / 734 2.61 / 3.69sphereconv_denskernel 687 / 1043 3.5 / 5.05IFS T42 489 / 743 3.09 / 3.83IFS T63 268 / 463 1.85 / 2.52Operational IFS 154 / 334 1.36 / 2.03Table 1: Baseline scores (RMSE) on the Weatherbench dataset, including NWP model scores asbaselines.architecture.Finally, panel c) compares hemconv_shared with the base architecture. Here the improvementsin the ﬁrst days of the forecasts are more spatially uniform.. Interestingly, from day 5 onwardthere is a deterioration around both poles in the hemconv_shared architecture compared to the basearchitecture. We next assess the long-term stability of forecasts for lead-times beyond 10 days. For this, we starta forecast from a date in January 2017 and perform iterative forecasts for a whole year. The doyinput is updated every day to account for the seasonality. This is repeated with several Januarydates from 2017. The result for one starting date and one training realization is shown in ﬁg. 5. Theﬁgure shows z500 of the 1-year forecast made by the sphereconv_shared architecture (panel a), ofthe hemconv_shared architecture (panel b) and the corresponding analysis from ERA5 (panel c). Ascan be seen, the yearly cycle of the forecast is highly unrealistic. This is the same for other startingdates (not shown). Additionally, some training realizations produce even more unrealistic long-termforecasts (not shown). This might indicate that the doy as boundary condition is not suﬃcient toreproduce a good yearly cycle, and/or that the networks introduce substantial non-physical errorsthat accumulate over time. 9) b)d)c)Figure 2: Forecast skill (RMSE and ACC) for all hres (1.4 ◦ ) architectures for geopotential at 500hPaand temperature at 850hpa. a,b: absolute values, c,d: diﬀerence to base architecture.10) b) hd)c)Figure 3: Forecast skill (RMSE and ACC) for all lres (2.8 ◦ ) architectures for geopotential at500hPa and temperature at 850hpa. a,b: absolute values, c,d: diﬀerence to base architecture. In c,dsphereconv_densekernel is ommited in order to be able to better visualize the diﬀerences betweenthe other methods. 11) b)c) d)Figure 4: a): RMSE of geopotential at 500 hpa [ m /s ] of the base architecture. b) diﬀerence inRMSE between sphereconv and base c) diﬀerence between hemconv_shared and base d) diﬀerencebetween sphereconv_hemvonc_shared and hemconv_shared We now look at the forecasts within the upper 5% of RMSE (forecast “busts”) for the NorthernHemisphere (NH) for lead-time 3 for the spherconv_hemconv_shared architecture (the architecturewith highest forecast skill in the ﬁrst days of the forecast). For each of the 5 training realizations, thepercentile is computed individually. When comparing the initialization dates of the worst forecasts,~20% are exactly the same dates for all members, ~36% occur in at least 4 of the 5 members, ~50%in at least 3 of the 5 members (ﬁg. 6 a). This is much higher than expected by chance if the eventswere randomly distributed. Events with high error are more common in winter than in summer (ﬁg.6 b), in line with the performance of operational NWP models.Finally, ﬁg. 6 b) shows a composite of z500 anomaly of all initial times at which at least one of thespherconv_hemconv hres networks had an error >

95% in the NH. The anomaly is computed withrespect to the mean over 2016-2017 (the evaluation period), and separately for each month. There arepositive anomalies east of Greenland and in the middle of the Northern Paciﬁc, and (less pronounced)negative anomalies over northern Canada, the eastern coast of Asia and over central Europe. Theseresults are similar for other architectures (ﬁg. S2-S4), and for other lead-times (ﬁg. S5-S8). Thisindicates that the skill of the network forecasts is dependent on the atmospheric conﬁguration, justas in NWP forecasts (e.g. Ferranti et al. [2015]).

Replacing standard convolution with spherical convolution introduces a signiﬁcant amount of addi-tional computations. While the use of sparse tensorﬂow tensors for the interpolation tensor ˆ L avoids12) b)c)Figure 5: Evolution of a single long term forecast for hemconv_shared (a) and sphere-conv_hemconv_shared (b), and verifying reality (c). Everything 500hpa geopotential.13) b)c)Figure 6: a) fraction of dates with extreme forecast error (>95%) that occur in at least n_commonout of 5 members for sphereconv_hemconv_shared. b) yearly cycle of dates with at least one memberwith extreme forecast error. All forecasts day 3 with hres data. c) composite mean of z500 of all dayswhere at least 1 member has a bust, shown as anomaly with respect to deseasonalized time mean of2016-2018. 14arge memory requirements, the computation time of the sphereconv network for the hres datacom-pared to the base architecture is roughly a factor 6 higher on a CPU with 2 cores (14 vs 2.4s),and by a factor of 20 higher on a NVIDIA Tesla v100 GPU (1.0s vs 50ms). Using hemisphere-wiseconvolution does not introduce any signiﬁcant performance overhead (with shared weights ~4%, withseparate weights none at all). In this paper we have tested two approaches to improve data-driven weather forecasts with CNNs.Firstly, we have tested replacing standard convolution operations with exact spherical convolution.Secondly, we tested integrating basic meteorological knowledge into the structure of the networks,namely that the dynamic of one hemisphere is “ﬂipped” with respect to the other hemisphere. This wehave hardcoded into our networks via ﬂipping the weights of the network. In a variant of this method,we have also used independent weights for each hemisphere. These methods (and combinations ofthem) were tested on the ERA5 data from the Weatherbench dataset [Rasp et al., 2020]. We useda neural network architecture previously proposed by Weyn et al. [2020], and adapted it to ourconvolution methods. We found that both the spherical convolution and the hemisphere-informationimprove the forecasts, but in diﬀerent ways. Spherical convolution mainly leads to improvementsclose to the poles, and less so in other regions. The hemisphere-speciﬁc information instead leadsto relatively uniform improvements in the mid-latitudes, where the largest forecast errors appear.For the ﬁrst couple of days, combining spherical convolution with hemisphere-wise convolution withshared ﬂipped weights leads to the best forecasts. Compared to the base architecture, sphericalconvolution and hemisphere-wise convolution contribute roughly equally to forecast improvement.Finally, we have found that initial conditions causing largest forecast errors are relatively consis-tent across diﬀerent training realizations of the same network. This indicates, as one would expect,that errors of neural network weather forecasts are not completely random, but controlled at leastpartly by the intrinsic predictability of the atmospheric state (and by the error in the used analysisproduct).Our approach innovates over previous studies in the ﬁeld in several respects. Weyn et al. [2020]have split up the world into a couple of regions, with each region being represented by a local grid. Onthese local grids they used standard convolution operations. This method still leads to distortions, aseven a subregion of the Earth’s surface cannot be represented on a local regular grid with completeaccuracy. In addition, this method also needs padding, which at the edges is ambiguous. Weynet al. [2020] have only tested a conﬁguration where the weights are not shared between the diﬀerentregions (except between the two polar regions, where they use the same but “ﬂipped” weights forthe second pole). Therefore, it is not possible to disentangle the eﬀect of local convolutions and theeﬀect of dealing with the spherical nature of the Earth. Finally, Weyn et al. [2020] have used diﬀerentinput variables (for example they have included radiation as well), which might explain their higherskill and long-term stability compared to the results here. From a practical point of view, there arealso diﬀerences. The method of Weyn et al. needs data-preprocessing (regridding), but then canuse standard neural network operations. In our approach, the standard data can be used, but the15pherical convolution introduces computational overhead in every pass through the network.The fact that the additional relative runtime needed for the networks with spherical convolutionis higher on a GPU than on CPU could indicate that the implementation using sparse tensorﬂowtensors is not optimized for GPUs. For small input sizes, an alternative would be to use standardtensorﬂow tensors (still ﬁlled sparsely, but represented as a full tensor (array) in memory), but forfull resolution ERA5 data (0.25 deg resolution, 3600x1801 gridpoints on a regular latlon point) thiswould not be feasible with current computers due to memory limitations, since the interpolationtensor then has a size of 6483600x6483600.The aim of our study was not to provide the best possible neural network based weather forecasts,but to assess the eﬀect of two speciﬁc changes to a neural network architecture. Still, possibleimprovements to the methods presented here could be: • splitting up the convolution for smaller diﬀerent regions (similar to Weyn et al. [2020]) • including locally connected layers (here each gridpoint in a layer is also a combination of theinputs from a certain kernel (e.g. 3 × • adding more prior information into the structure of the network, for example vertical structureof the atmosphere. Author contributions

S.S. designed and implemented the methods of the paper and drafted the manuscript. Both authorsdiscussed and interpreted the results and improved the manuscript.

References

V. Balaji. Climbing down Charney’s ladder: Machine Learning and the post-Dennard era of compu-tational climate science. arXiv:2005.11862 [nlin, physics:physics] , May 2020.Peter Bauer, Alan Thorpe, and Gilbert Brunet. The quiet revolution of numerical weather prediction.

Nature , 525(7567):47–55, September 2015. ISSN 1476-4687. doi: 10.1038/nature14956.Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. SphereNet: Learning SphericalRepresentations for Detection and Classiﬁcation in Omnidirectional Images. In Vittorio Ferrari,Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors,

Computer Vision – ECCV 2018 ,Lecture Notes in Computer Science, pages 525–541, Cham, 2018. Springer International Publishing.ISBN 978-3-030-01240-3.Peter D. Dueben and Peter Bauer. Challenges and design choices for global weather and climatemodels based on machine learning.

Geoscientiﬁc Model Development , 11(10):3999–4009, October2018. ISSN 1991-959X. doi: 10.5194/gmd-11-3999-2018.Davide Faranda, M. Vrac, P. Yiou, F. M. E. Pons, A Hamid, G Carella, C. G. Ngoungue Langue,S. Thao, and V. Gautard. Boosting performance in machine learning of geophysical ﬂows via scaleseparation. June 2020. 16aura Ferranti, Susanna Corti, and Martin Janousek. Flow-dependent veriﬁcation of the ECMWFensemble over the Euro-Atlantic sector.

Quarterly Journal of the Royal Meteorological Society , 141(688):916–924, April 2015. ISSN 1477-870X. doi: 10.1002/qj.2411.Samuel P. Lillo and David B. Parsons. Investigating the dynamics of error growth in ECMWFmedium-range forecast busts.

Quarterly Journal of the Royal Meteorological Society , 143(704):1211–1226, 2017. ISSN 1477-870X. doi: 10.1002/qj.2938.Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.Corrado, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, AndrewHarp, Geoﬀrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, ManjunathKudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin-cent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Watten-berg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.

TensorFlow: Large-Scale Machine Learningon Heterogeneous Systems . 2015.Stephan Rasp, Peter D. Dueben, Sebastian Scher, Jonathan A. Weyn, Soukayna Mouatadid,and Nils Thuerey. WeatherBench: A benchmark dataset for data-driven weather forecasting. arXiv:2002.00469 [physics, stat] , June 2020.Mark J. Rodwell, Linus Magnusson, Peter Bauer, Peter Bechtold, Massimo Bonavita, Carla Cardi-nali, Michail Diamantakis, Paul Earnshaw, Antonio Garcia-Mendez, Lars Isaksen, Erland Källén,Daniel Klocke, Philippe Lopez, Tony McNally, Anders Persson, Fernando Prates, and Nils Wedi.Characteristics of Occasional Poor Medium-Range Weather Forecasts for Europe.

Bulletin ofthe American Meteorological Society , 94(9):1393–1405, September 2013. ISSN 0003-0007. doi:10.1175/BAMS-D-12-00099.1.S. Scher. Toward Data-Driven Weather and Climate Forecasting: Approximating a Simple GeneralCirculation Model With Deep Learning.

Geophysical Research Letters , 0(0), November 2018. ISSN0094-8276. doi: 10.1029/2018GL080704.S. Scher and G. Messori. How Global Warming Changes the Diﬃculty of Synoptic WeatherForecasting.

Geophysical Research Letters , 46(5):2931–2939, 2019a. ISSN 1944-8007. doi:10.1029/2018GL081856.Sebastian Scher and Gabriele Messori. Weather and climate forecasting with neural networks: UsingGCMs with diﬀerent complexity as study-ground.

Geoscientiﬁc Model Development Discussions ,pages 1–15, March 2019b. ISSN 1991-959X. doi: 10.5194/gmd-2019-53.Sebastian Scher and Gabriele Messori. Ensemble neural network forecasts with singular value de-composition. arXiv:2002.05398 [physics] , February 2020.Laura von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, RaoulHeese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, Michal Walczak,Jochen Garcke, Christian Bauckhage, and Jannis Schuecker. Informed Machine Learning – A17axonomy and Survey of Integrating Knowledge into Learning Systems. arXiv:1903.12394 [cs,stat] , February 2020.Jonathan A Weyn, Dale R Durran, and Rich Caruana. Can Machines Learn to Predict Weather?Using Deep Learning to Predict Gridded 500-hPa Geopotential Height From Historical WeatherData.