[PDF] A comparative study of convolutional neural network models for wind field downscaling

Abstract

We analyze the applicability of convolutional neural network (CNN) architectures for downscaling of short-range forecasts of near-surface winds on extended spatial domains. Short-range wind field forecasts (at the 100 m level) from ECMWF ERA5 reanalysis initial conditions at 31 km horizontal resolution are downscaled to mimic HRES (deterministic) short-range forecasts at 9 km resolution. We evaluate the downscaling quality of four exemplary model architectures and compare these against a multi-linear regression model. We conduct a qualitative and quantitative comparison of model predictions and examine whether the predictive skill of CNNs can be enhanced by incorporating additional atmospheric variables, such as geopotential height and forecast surface roughness, or static high-resolution fields, like land-sea mask and topography. We further propose DeepRU, a novel U-Net-based CNN architecture, which is able to infer situation-dependent wind structures that cannot be reconstructed by other models. Inferring a target 9 km resolution wind field from the low-resolution input fields over the Alpine area takes less than 10 milliseconds on our GPU target architecture, which compares favorably to an overhead in simulation time of minutes or hours between low- and high-resolution forecast simulations.

Full PDF

RR E S E A R C H A R T I C L E

C o nv o l u t i o n a l N e u r a l N e t w o r k s f o r Wi n d F i e l d D ow n s c a l i n g

A Comparative Study of Convolutional NeuralNetwork Models for Wind Field Downscaling

Kevin Höhlein | Michael Kern | Timothy Hewson |Rüdiger Westermann TUM Department of Informatics, TechnicalUniversity of Munich, Garching, Germany European Center for Medium-Range WeatherForecasts, Reading, UK

Correspondence

Kevin Höhlein, TUM Department ofInformatics, Technical University of Munich,Garching, DE-85748, GermanyEmail: [email protected]

Funding information

Deutsche Forschungsgemeinschaft,CRC/Transregio 165, Waves to Weather

We analyze the applicability of convolutional neural network(CNN) architectures for downscaling of short-range forecasts ofnear-surface winds on extended spatial domains. Short-rangewind ﬁeld forecasts (at the 100 m level) from ECMWF ERA5reanalysis initial conditions at 31 km horizontal resolution aredownscaled to mimic HRES (deterministic) short-range fore-casts at 9 km resolution. We evaluate the downscaling quality offour exemplary model architectures and compare these against amulti-linear regression model. We conduct a qualitative and quan-titative comparison of model predictions and examine whetherthe predictive skill of CNNs can be enhanced by incorporatingadditional atmospheric variables, such as geopotential height andforecast surface roughness, or static high-resolution ﬁelds, likeland-sea mask and topography. We further propose DeepRU,a novel U-Net-based CNN architecture, which is able to infersituation-dependent wind structures that cannot be reconstructedby other models. Inferring a target 9 km resolution wind ﬁeldfrom the low-resolution input ﬁelds over the Alpine area takesless than 10 milliseconds on our GPU target architecture, whichcompares favorably to an overhead in simulation time of minutesor hours between low- and high-resolution forecast simulations.

Keywords — statistical downscaling, wind ﬁeld simulation,deep learning, convolutional neural network (CNN) a r X i v : . [ phy s i c s . a o - ph ] A ug H ÖHLEIN ET AL . | INTRODUCTION AND CONTRIBUTION

Accurate prediction of near-surface wind ﬁelds is a topic of central interest in various ﬁelds of science and industry. Severememory and performance costs of numerical weather simulations, however, limit the availability of ﬁne-scale (high-resolution)predictions, especially when forecast data is required for extended spatial domains. While running global reanalyses andforecasts with a spatial resolution of around 30 km is computationally affordable (e.g., Hersbach et al., 2020), these modelsare unable to accurately reproduce wind climatology in regions with complex orography, such as mountain ranges. Sincewind speed and direction are determined by localized interactions between air ﬂow and surface topography, with sometimesthe added complication of thermal forcing, accurate numerical simulation requires information on signiﬁcantly ﬁner lengthscales, particularly in regions that are topographically complex. For instance, (sub-grid-scale) topographic features such as steepslopes, valleys, mountain ridges or cliffs may induce wind shear, turbulence, acceleration and deceleration patterns that cannotbe resolved by global models that lack information on these factors. Moreover, meteorologically relevant factors such as thevertical stability, snow cover, or the presence of nearby lakes, river beds, or sea can strongly inﬂuence local wind conditions (e.g.,McQueen et al., 1995; Holtslag et al., 2013). In these regions, ﬁner-resolution regional numerical models, with grid spacings oforder kms or less need to be applied in order to obtain reliable low-level winds (e.g., Salvador et al., 1999; Mass et al., 2002). (a)(b) ] F I G U R E 1 Wind ﬁeld on December 05, 2018 at 12:00 UTC. Left: Low-resolution simulation based on ERA5 reanalysisdata. Middle: High-resolution simulation based on HRES. Right: Prediction from the low-resolution ﬁeld our proposedconvolutional neural network DeepRU. Streamlines are color-coded with wind magnitude. (a) : Coastal region enclosing theFrench Riviera and Corsica. (b) : Highly varying winds over part of the Swiss Alps.One approach to circumvent costly high-resolution simulations over extended spatial scales is known as downscaling, i.e.inferring information on physical quantities at local scale from readily available low-resolution simulation data using suitablereﬁnement processes. Downscaling is a long-standing topic of interest in many scientiﬁc disciplines, and in particular inmeteorological research there exists a large variety of methods to downscale physical parameters. Such methods can be broadly

ÖHLEIN ET AL . 3 classiﬁed into dynamical and empirical-statistical approaches (e.g., Hewitson and Crane, 1996; Rummukainen, 1997; Wilby andWigley, 1997).In dynamical downscaling (e.g., Rummukainen, 2010; Radi´c and Clarke, 2011; Xue et al., 2014; Kotlarski et al., 2014;Räisänen et al., 2004), high-resolution numerical models are used over limited sub-domains of the area of interest, and numericalmodel outputs on coarser scales provide boundary conditions for the simulations on the ﬁner scale. While the restricted size ofthe model domain leads to a signiﬁcant reduction of computational costs compared to global-domain simulations, dynamicaldownscaling still remains computationally demanding and time-consuming.Statistical downscaling, on the other hand, aims to avoid the simulation at the ﬁner scales, by using a coarse scale simulation(referred to as predictor data) to infer predictions at ﬁne scale (referred to as predictand data). Correlations between the quantitiesat ﬁne and coarse scales are learned by training statistical models on a set of known predictor-predictand data pairs.Over time, a large number of empirical-statistical down-scaling approaches have been developed, which apply statisticalregression methods for downscaling purposes, like (generalized) multi-linear regression methods (e.g., Chandler, 2005), orquantile mapping approaches (e.g., Wood et al., 2004). With recent developments in data-driven machine learning and computerscience, however, more powerful modeling techniques have become available, which may have the potential to outperformprevious methods in terms of both accuracy and efﬁciency. Only a few studies have examined the use of non-linear regressionmethods or more recent non-classical machine learning techniques (e.g., Eccel et al., 2007; Gaitan et al., 2014; Vandal et al.,2019) . Speciﬁcally, the extent to which non-linear machine-learning approaches can provide additional value over classicalmethods is a question that has not been answered conclusively, as yet.Deep-learning methods are among the most prominent examples for state-of-the-art machine-learning techniques (e.g.,LeCun et al., 2015; Goodfellow et al., 2016). In particular, convolutional neural networks (CNNs) have found manifoldapplication in complex image processing and understanding tasks (e.g., Guo et al., 2016; Yang et al., 2019). One of these issingle-image superresolution, i.e. the generation of high-resolution images from low-resolution images (e.g., Yang et al., 2019),which, formally, can be thought of as a very similar task to downscaling of climate variables.CNNs rely on expressing regression models that operate on an extended spatial domain as a set of localized linear models(localized ﬁlter kernels), which are applied repeatedly at varying spatial positions across the domain through convolutionoperations. The restriction of the model parameterization to local ﬁlter kernels effectively limits the number of trainableparameters, and thus reduces the tendency of the model to overﬁt spurious patterns in the data, while increasing model efﬁciency.While also applicable to irregular graph-based data structures (Kipf and Welling, 2016), e.g., data deﬁned on irregular grids,CNNs work most effectively with regular-gridded data in multi-dimensional array representations, facilitating an efﬁcient parallelcomputation of optimization tasks on GPU-based compute hardware. Computational efﬁciency through parallelization is one ofthe major selling points of CNNs and should be considered as an important aspect during model design and data preparation.Furthermore, more complex mappings can be learned by stacking multiple layers of convolution operations (increasing the depthof the models) and applying these successively, to generate more abstract feature representations. Similar to standard artiﬁcialneural networks (ANNs), applying non-linear activation functions between successive convolution layers can enable the modelto learn non-linear mappings. Beyond purely sequential feature processing, more elaborate model design patterns, like skipconnections between pairs of convolution layers (Srivastava et al., 2015), residual learning (He et al., 2016, e.g.,), or changesin the spatial resolution of internal feature representations (e.g., Ronneberger et al., 2015) can be leveraged to improve modelperformance.CNNs are, thus, particularly well-suited for learning tasks involving spatially distributed data, which are often encounteredin meteorology. Though CNN-based model architectures are increasingly adopted also in earth-system sciences (e.g., Shen,2018; Reichstein et al., 2019; Vannitsem et al., 2020), their usage for downscaling applications has been rarely discussed (e.g.,Vandal et al., 2018; Baño-Medina et al., 2019). In particular, earlier studies focused on simple CNN architectures, which do notmake use of recent model design patterns, and thus do not exploit the full potential of state-of-the-art CNN architectures. H ÖHLEIN ET AL . | Contribution

In this work, we perform a study of fully-convolutional neural network architectures for statistical downscaling of near-surfacewind vector ﬁelds. The results are compared to those obtained by a multi-linear regression model, both w.r.t. quality andperformance. We train models to predict the most likely outcome of a high-resolution simulation of near-surface winds 100 mabove ground, based on low-resolution short-range wind-ﬁeld forecasts as primary predictors. The data are deﬁned on irregularoctahedral and triangular reduced Gaussian grids with 9 km and 31 km horizontal resolution, respectively. To enable efﬁcientprocessing of the data with CNNs and to avoid destroying local detail via interpolation, the data are mapped to regular gridsthrough suitable padding. We view this work as an initial ’proof of concept’ step, to pave the way to using ﬁner resolutions, forboth predictor and predictand. If the predictand scale could reach 1 or 2 km we would envisage a much greater range of practicalapplications emerging.We compare the capabilities of different existing models, which reﬂect varying degrees of model complexity and elaboration.Starting with a multi-linear regression model and a light-weight linear convolutional model, we continue the comparison withnon-linear convolutional models of increasing complexity. By incorporating beneﬁcial design patterns identiﬁed beforehand, incombination with adaptions in architectural design and training methodology, we propose DeepRU – a U-Net-based CNN modelthat improves the reconstruction quality of existing architectures.For all models, we analyze whether incorporating additional climate variables and high-resolution topography like surfacealtitude and land-sea mask improves the network’s inference capabilities. We further train the models on sub-regions of thedomain, to avoid learning relationships between low- and high-resolution winds purely based on geographic location, i.e., toavoid overﬁtting to a particular domain. The reconstruction quality of all downscaling models is compared to the high-resolutionsimulations of real-world weather situations for a topographically complex region in central and southern Europe for the periodbetween March 2016 and September 2019 (Figure 1). Our key ﬁnding is that thought-out architecture design and appropriatemodel tuning enable network-based downscaling methods to efﬁciently generate high-resolution wind ﬁelds in which local andglobal scale structures are reproduced with high ﬁdelity.To further analyze the usability of network-based downscaling, the relationships between model complexity, networkperformance, and computational requirements such as memory consumption and prediction time are evaluated. We show howthe model depth as well as the used design patterns, i.e. residual connections across successive convolution layers and U-shapedencoder-decoder architectures, are leveraged to balance between model complexity and prediction quality. | RELATED WORK2.1 | Empirical-Statistical Downscaling

In describing downscaling options available at the time, (Wilby and Wigley, 1997) distinguish between regression methods,weather typing approaches and stochastic weather generators. Regression-based methods build upon the construction ofparametric models, which are trained in an optimization procedure to establish a transfer function between low-resolutionpredictor variables and high-resolution predictands. Weather typing approaches, in contrast, rely on ﬁnding a suitable matchbetween a set of predictor values and predictor value sets contained in the training data, in order to select out the most appropriateweather pattern analogue (e.g., Zorita and von Storch, 1999). Stochastic weather generators provide a probabilistic approach andare trained to replicate spatio-temporal sample statistics, as implied by the training data (e.g., Wilks, 2010, 2012).A comprehensive review and comparison of empirical-statistical models for downscaling climate variables has beenconducted by (Maraun et al., 2015), (Gutiérrez et al., 2019) and (Maraun et al., 2019), who showed that many of the approachesperform generally well, but leave space for improvement. For instance, realistic replication of spatial variability in the high-

ÖHLEIN ET AL . 5 resolution predictand variables remains a major challenge for many of the models (Maraun et al., 2019).Speciﬁcally addressing the problem of wind-ﬁeld downscaling and forecasting, (Pryor, 2005) and (Michelangeli et al., 2009)proposed distribution-based approaches for wind-ﬁeld inference, and (Huang et al., 2015) proposed a physical-statistical hybridmethod for downscaling.The question of what methods provide additional value over classical approaches, has only been addressed by a numberof smaller model-comparison studies – with varying results. While (Eccel et al., 2007; Mao and Monahan, 2018) and (Vandalet al., 2019) found hardly any or no advantage, in applying non-classical machine-learning methods, (Gaitan et al., 2014) shownon-classical methods outperforming classical ones, with artiﬁcial neural networks being a particular method example. Morerecently, (Buzzi et al., 2019) used neural networks for nowcasting wind in the Swiss alps, and achieved very skillful models.These apparently contradictory ﬁndings raise the question of when, and under what conditions, can deep learning methods beproﬁtably employed for downscaling.Within meteorology, only a small number of studies have dealt with using CNNs for downscaling applications. For example,(Vandal et al., 2018) proposed "DeepSD", a simple convolutional neural network for downscaling precipitation over extendedspatial domains, and more recently, (Baño-Medina et al., 2019), studied the performance of a set of convolutional neural networksfor downscaling temperature and precipitation over Europe. (Pan et al., 2019) proposed a similar architecture, again with a focuson precipitation.While the inﬂuence of model complexity has been examined by (Baño-Medina et al., 2019) in terms of model depth, i.e. thenumber of convolution layers, the models in use did not exploit recent design patterns, like skip or residual connections (e.g.,Srivastava et al., 2015; He et al., 2016) or the fully-convolutional U-Net like architecture (Ronneberger et al., 2015), whichenable network models to achieve state-of-the-art results in computer vision tasks. | Single-Image Super-Resolution

Computer vision, being the origin of a large number of technological developments in machine learning, provides a problemsetting which is closely related to downscaling in meteorology and climatology – single-image super-resolution. There, thegoal is to identify mappings which allow for increasing the resolution of single low-resolution input images, while maintainingvisual quality and avoiding pixel artifacts and blurriness. Within this context, the use of deep learning has led to remarkableimprovements compared to standard statistical models (e.g., Yang et al., 2019). Especially CNNs were found to be particularlysuccessful (e.g., Dong et al., 2014, 2016; Sajjadi et al., 2017). | TRAINING DATA

For model training and evaluation, we use short-range weather forecast data, which include near-surface wind ﬁeld simulations atdifferent scales. The data is taken from the ecmwf! ( ecmwf! ) mars! ( mars! ) archive (Maass, 2019) and covers a spatial domainin central and southern Europe. | Domain Description

The training domain is restricted to ◦ − ◦ N and ◦ − ◦ E (see Figure 2 (a)), and is comprised of sub-regions with varyingorographic properties. Speciﬁcally, the domain contains high mountains of the Alps, some smaller mountain ranges in centralEurope, ﬂat areas in France, parts of the Mediterranean Sea, and southwest-facing coastal regions of the Adriatic, to confront theemployed models with challenging scenarios where winds are highly inﬂuenced by the topography. Especially in the Dinaric H ÖHLEIN ET AL . Alps, situated in the eastern part of the domain, topographically forced gap ﬂows are known to be an important phenomenon (e.g.,Lee et al., 2005; Beluši´c et al., 2013). Signiﬁcant differences between the low- and high-resolution numerical simulation resultsare most commonly observed in and around mountain ranges and coast lines , leading to the question of whether downscalingtechniques can learn these differences and accurately predict the high-resolution ﬁelds from the low-resolution versions.

0° 5°E 10°E 15°E 20°E40°N42°N44°N46°N48°N50°N0° 20°E40°N50°N (a) (b)

F I G U R E 2 (a)

Map of the surface topography in Europe representing the data domain. (b)

Low-resolution (N320) andhigh-resolution octahedral Gaussian simulation grid (O1280) used by ERA5 and HRES respectively. Over our domain thehigh-resolution grid comprises about 3 times more grid points in longitude and about 4 times more in latitude. | Low- and High-Resolution Simulations

As "low-resolution" input to our models, we use data derived from the ERA5 reanalysis product suite (Hersbach et al., 2020).ERA5 is the ﬁfth in the series of ECMWF global reanalyses, and provides estimates of the 3-dimensional global atmosphericstate (climate) over time, based on a four-dimensional variational (4d-Var) data assimilation, of past observations, into a recentversion of the operational ECMWF numerical forecast model. Output is provided on a regular reduced Gaussian grid with ahorizontal resolution of 31 km ( . ◦ ). In this study we use hourly forecast ﬁelds, from data times of 06:00 and 18:00 UTC,at time steps of T +

1, 2, . . . 12 h. We use these short range forecasts instead of the true reanalysis ﬁelds to avoid systematic smalljumps in low-level winds seen in the latter at 09:00 and 21:00 UTC (documented in Hersbach et al., 2020).The higher-resolution target dataset was provided by operational short-range forecasts from ECMWF’s HRES (HighRESolution) model, also at hourly intervals, initialized twice per day. HRES is a component of the ECMWF Integrated ForecastSystem (IFS) that can provide relatively accurate forecast products into the medium ranges ( ≥ h ahead) (ECMWF, 2017).HRES is the highest available resolution model at ECMWF ( ∼ +

7, 8, 9, . . . 18 h from the 00:00 UTC and 12:00 UTC runs. These were chosen as acompromise between being long enough to reduce any contamination from model spin-up, and short enough to retain forecastaccuracy. The different spatial resolutions of ERA5 and HRES are illustrated in Figure 2 (b).Products for HRES on the O1280 grid were ﬁrst introduced operationally in March 2016, and so are only available from thatpoint onwards. Therefore, we restrict our analysis to time periods between March 2016 to October 2019.

ÖHLEIN ET AL . 7 | Predictor and Predictand Variables

Both the low-resolution predictors and the high-resolution predictands provide two wind variables, which contain spatio-temporalinformation on the horizontal wind components m above ground. The wind variables are denoted by U (meridional wind)and V (zonal wind). At the same locations (i.e. grid points), land surface elevation (altitude, ALT) and a binary land-sea mask(LSM) are available in low- and high-resolution variants. These are used as static predictors.From the low-resolution dataset, supplementary predictor variables are obtained and used as dynamical, i.e., time-varying,predictors. The additional variables were manually selected according to the following considerations: • Boundary layer height (BLH) is a model diagnostic that describes the vertical extent of the lowest layer of the atmospherewithin which interactions take place between the Earth’s surface and the atmosphere (Stull, 2017). Its value typically rangesbetween about 0.3 and 3 km and it is essentially a metric for low level stability, with larger values implying deeper layers ofinstability-driven mixing. Earlier studies (e.g., Holtslag et al., 2013) found that boundary-layer effects can have a signiﬁcantimpact on model performance in numerical temperature and wind predictions. Therefore, BLH may encode information thataffects the matching between the low and high-resolution variants. Also, BLH can provide the model with information aboutdiurnal cycles. For these various reasons there was clear potential for this standard model output variable to be a usefulpredictor. • Forecast surface roughness (FSR) denotes the surface roughness as represented in the forecast, and thereby providesinformation on the frictional retardation of the near-surface airﬂow. Contributory factors are vegetation types and land coverlike soil or snow. The only dynamic component in the ECMWF modelling architecture is snow cover; other aspects areﬁxed year-round. We expected a small but direct impact from the snow cover. • Geopotential height at 500 hPa (Z500) designates the elevation of the 500 hPa pressure level above mean sea level, andtypically has values around m. At this height, the pressure gradients and Coriolis force are typically in balance andwinds are roughly parallel to Z500-isolines (see e.g. geostrophic winds in Wallace and Hobbs, 2006). Fields of Z500 verycommonly serve as a proxy for forecasters of the general atmospheric ﬂow structure and indeed synoptic pattern. So onthe one hand one might expect a link with near-surface winds, but on the other the level is so far from the surface that it isunlikely to be a good predictor of local winds. This variable was partly included as a test of the veracity of our results. Eventhough on physical grounds we did not, overall, expect strong predictive skill from this variable, our results indicate anapparent inﬂuence on the inferred ﬁelds. | Data Padding

The training data obtained from MARS is deﬁned on irregular grids where the number of grid nodes per latitude decreases withincreasing latitude. As CNNs require the input data as multi-dimensional data arrays, the data needs to be resampled on a regulargrid structure. Since resampling using interpolation can smooth out and even remove relevant structures, the initial data is copiedinto rectangular 2D grids and padded appropriately. Therefore, the maximum number of longitudes for the latitude nearest to theequator is computed, and new points are padded to the remaining latitudes for each grid (cf. Figure 3). This approach preservesthe spatial adjacency of grid nodes for a large proportion of the nodes, which is important to facilitate proper learning of spatialcorrelations. The true distance between grid nodes in world space is however ignored in the training process. The padded pointsare marked in a binary mask, which is passed to the objective function during network training to distinguish between valid andpadded values in the loss computation.Padding is chosen based on the fact that convolutional neural networks do not only take into account neighborhood relationsbut also relative changes of neighboring values. Zero-padding, which may cause steep gradients between neighboring values, is H ÖHLEIN ET AL . F I G U R E 3 Example of padding and masking used to resample the initial (low-resolution) data from an irregular Gaussiangrid to a Cartesian grid. Blue cells indicate the data points of the gridded wind ﬁeld. The interior of the data domain is shown inlight-blue, boundary points are drawn in dark-blue, their values are represented by numbers. A regular grid is achieved bypadding new data points to the grid (light-red cells) while replicating the corresponding boundary values.thus deemed unsuitable, and replaced by replication padding using the values of the boundary grid points of the valid domain.The initial low- and high-resolution data with respectively 1918 and 20416 grid points on irregular grids are mapped toregular grids of size × and × in latitude and longitude directions. This results in an increase in the number of gridpoints by a factor of × between low-resolution and high-resolution grids, which reﬂects the actual difference in resolutionbetween ERA5 and HRES simulations (see Figure 2). | Data Scaling

Before training, the padded data are standardized by subtracting sample mean and dividing by sample standard deviation.Standardization has proven useful in machine learning for improving the stability and convergence time of non-linear optimizationmethods (e.g., Ioffe and Szegedy, 2015; Srivastava et al., 2014). For time-dependent predictors, sample mean and standarddeviation were computed node-wise from the snapshot statistics of the respective training datasets. Node-wise scaling is preferredover global domain scaling as spatial inhomogeneities are reduced, which we found to improve the downscaling results in ourexperiments. For static predictors, mean and standard deviation were computed from domain statistics. For sample standarddeviations, we considered the unbiased ensemble estimate. Validation data is transformed accordingly before processing.Standardization is performed also for the predictand variables. We found this useful due to strong differences in averagewind speeds between coast or sea sites and mountain ranges. Further details are discussed in Sect. 5. | NETWORK ARCHITECTURES

All of the models we use and compare in this work are constructed as parametric mappings of the form y = f ( x | β ) , (1)wherein y represents the array of high-resolution predictands, x denotes the array of predictor variables, and β summarizes themodel-speciﬁc parameters to be optimized during training.We use in particular CNNs, which repeatedly apply convolution kernels of ﬁxed size to gridded input data at varying spatialpositions to capture different types of features. ÖHLEIN ET AL . 9

LinearCNN LinearEnsemble DeepSD FSRCNN EnhanceNet

DeepRU

SuperRes SuperResResidualUpsamplingConvolution [ ] Transposed ConvolutionBatch Normalization

Nonlinear Activation

HRLR x36x60 x144x18025 x36x60 Matrix Addition

F I G U R E 4 Schematic of all downscaling models used in this paper.For the downscaling CNNs in our study, we consider input predictor arrays of shape c (LR) X × s lat × s lon or c (HR) X × s lat × s lon ,for low-resolution or high-resolution predictors x (LR) and x (HR) , respectively. Therein, c (LR) X and c (HR) X indicate the number oflow- and high-resolution predictor variables per grid node, and s lat and s lon denote the number of grid nodes of the low-resolutionarray grid in latitude and longitude directions, as speciﬁed in Sect. 3. Note here that the values of s lat and s lon may equalthe maximum values s lat = 36 and s lon = 60 , corresponding to running the model on the full domain inputs, but may alsobe set to smaller values, as the convolution operations can adapt to varying input sizes. Choosing smaller values of s lat and s lon corresponds to running the models on limited sub-domains, which we use for data augmentation, as discussed in Sect. 5.Predictands y are assumed to be of shape c Y × s lat × s lon , with c Y indicating the number of predictand variables.While c Y = 2 is ﬁxed for all our models, corresponding to high-resolution wind components U and V, c (LR) X and c (HR) X varydepending on the predictors supplied to the models, as detailed in Sect. 6.2. In particular, some of the models are providedwith low-resolution predictors exclusively, whereas other model conﬁgurations are informed additionally with high-resolutiontopography predictors.The (rectangular) ﬁlter kernels are parametrized per convolution layer as arrays of shape c in × c out × k lat × k lon , with c in and c out denoting the numbers of input and output features of the layer, and k lat ans k lon the spatial extent of the kernel ﬁlters inlatitude and longitude. Due to the size of the kernel, the number of elements in convolution output arrays differs from that of theinput arrays. To compensate for this, suitable replication padding between successive convolution layers is employed to maintainthe spatial shape of feature arrays constant throughout the series of convolutions.In the following, the details of the different model architectures used in our evaluation are described. A schematic summaryof all models is provided in Figure 4. | Linear Convolutional Network Model: LinearCNN

For our comparison, we implemented a linear convolutional model, which we call LinearCNN. In contrast to usual practice, weintentionally omit non-linear activation functions from the architecture to force the model to rely on local linear relationshipsbetween predictors and predictands.Formally, the architecture is designed to mimic the effect of a linear model, which takes a section of the low-resolutionpredictor array of × pixels and outputs an estimate for the × pixel patch of the high-resolution wind ﬁeld array ÖHLEIN ET AL . [3x3][3x5][5x5] [3x3] HRLR HRLR x144x180Cx36x60 Cx144x180C/2x36x60 C/2x72x180C/2x72x60C/2x36x60 Input HR-Input x36x60 x144x180 x36x60 x144x180

UpsamplingBatch NormalizationLeakyReLUConvolution [ ]Strided ConvolutionConcatenation

F I G U R E 5 Input blocks used in FSRCNN, EnhanceNet, DeepRU (left) and DeepSD (right).that corresponds to the center pixel of the low-resolution section. Since, however, the data padding may cause distortionsbetween neighboring pixels in latitude-longitude coordinates, the neighborhood correspondence between low-resolution andhigh-resolution array grids may be imperfect. To account for this, we ﬁnd it beneﬁcial to average the ﬁnal estimate over multiplesample estimates which are obtained from a small surrounding of the central low-resolution pixel. LinearCNN, therefore, isdesigned to produce a high-resolution estimate of size × , corresponding to the area, which is covered by the interior × pixels of the low-resolution section. In this way, the averaging is realized automatically when applying the model to largerdomains in convolution mode.The model architecture consists of two branches for processing low-resolution and high-resolution inputs separately. Thelow-resolution branch is comprised of a single convolution layer and a successive transpose convolution (e.g., Dumoulin andVisin, 2016) to increase the resolution of the input features. Transpose convolutions can be understood as linear operations,which learn to parameterize the gradient of a standard convolution. Whereas standard convolutions possess the ability to reducethe spatial resolution of feature arrays by skipping pixels between successive applications of the kernel, which is known asstriding (e.g., Dumoulin and Visin, 2016), transpose convolutions can achieve an increase in resolution by parameterizing thekernel of a strided convolution. For our experiments, we select a kernel size of ( k lat , k lon ) = ( , ) for the standard convolution, aswell as kernel size ( , ) and strides ( , ) for the transpose convolution, reﬂecting the magniﬁcation ratio between low-resolutionand high-resolution grids. Depending on the number of input variables c (LR) X , the ﬁrst convolution layer transforms the inputpredictors into a set of c (LR) X latent features, which are then processed further by the transpose convolution. The numberof features is selected to admit a full-rank linear mapping between low-resolution predictors and high-resolution predictandestimates, i.e., each of the single pixel estimates may depend on all of the covered predictor values, independent on otherestimates.On the high-resolution branch, the predictors, are fed into a single standard convolution layer with kernel size ( , ) . Theoutputs of this layer are directly added to those of the low-resolution transpose convolution. Empirically, we found that modelswith larger kernel sizes did not improve the performance. | Simple Non-Linear CNN: DeepSD

DeepSD is a simple non-linear CNN architecture, which has been proposed by (Vandal et al., 2018) for downscaling climatechange projections over extended spatial domains. The design of DeepSD builds upon the Super-Resolution CNN (SRCNN) by

ÖHLEIN ET AL . 11 [3x3][3x3][3x3]Matrix AdditionCxWxH CxWxHCxWxHCxWxH

CxWxH

DeepRU[3x3][3x5][3x3][3x3][3x3]2x144x180Cx144x180Cx144x180Cx72x180Cx72x60Cx36x60

SuperRes Residual

F I G U R E 6 Super-resolution block (left) and residual block (right) for FSRCNN, EnhanceNet and DeepRU.(Dong et al., 2014) – one of the ﬁrst CNN-based architectures for single-image super-resolution. SRCNN is comprised of threeconvolution layers with rectiﬁed-linear activation functions inbetween, which are used to post-process the result of a bicubicinterpolation of the low-resolution image data. Although (Vandal et al., 2018) propose to compose DeepSD of several instancesof stacked SRCNNs for better predictions, we found that for the magniﬁcation ratio of 3x in longitude and 4x in latitude a singlestage of SRCNN already attains results on a par with those achieved by other SRCNN instances.In the implementation of DeepSD we follow the design proposed by (Dong et al., 2014) and (Vandal et al., 2018). The ﬁrstlayer uses a large kernel size of ( , ) to transform the input predictor ﬁelds into an abstract feature space representation with 64features, followed by a non-linear activation. The second layer applies a pixel-wise dimensionality reduction with a convolutionof kernel size ( , ) and 32 output features, and a second non-linear activation. The ﬁnal layer applies a convolution with kernelsize ( , ) to transform the features to the target resolution.(Vandal et al., 2018) further proposed to inform the model with high-resolution orography to learn the inﬂuence of thetopography on the inferred climate variables. Hence, we include the high-resolution static orography predictors during trainingof all our DeepSD models. To match low-resolution and high-resolution predictors, the low-resolution predictors are ﬁrstinterpolated to high-resolution using a bicubic interpolation, and then concatenated to the high-resolution predictors to create acombined input array. A schematic of the HR-input block is shown in Figure 5. | Fast Non-Linear CNN: FSRCNN

Beyond previously proposed downscaling models, we also took inspiration from ongoing work in computer vision on imagesuper-resolution. With FSRCNN (Fast Super-Resolution CNN) proposed by (Dong et al., 2016), we include a direct successorof SRCNN in our comparison.SRCNN has limitations in computational speed as it operates on a high-resolution interpolant of the original low-resolutionimage. This leads to an increased amount of ﬂoating point operations and requires larger convolution kernel sizes with a largenumber of trainable parameters to capture spatial features in high-resolution. FSRCNN circumvents these problems by applying7 convolution layers to the low-resolution inputs directly, and upsampling features to the ﬁnal target resolution only at the veryend. FSRCNN replaces convolution layers with large kernels, i.e. ( , ) or ( , ) in SRCNN, with a sequence of convolutionsusing smaller kernel sizes of ( , ) and ( , ) . The smaller-sized convolutions, however, speed up the computation time by a factorsimilar to the magniﬁcation ratio in each dimension and are, thus, beneﬁcial in terms of inference speed. (Dong et al., 2016) alsoproposed an hourglass-shaped network architecture, where the highest number of feature channels is used for the outermost ÖHLEIN ET AL . layers, while the channel size of the inner layers are reduced. This design pattern is supposed to avoid costly computations whilemaintaining prediction quality.In our experiments, we slightly adapt the architecture of FSRCNN and split the model into three parts: an input processingstage for primary feature extraction, a feature processing stage, and a super-resolution stage for successively increasing theresolution until the target resolution is reached.The design of the input stage varies depending on the predictors in use. When employing low-resolution predictorsexclusively, a single convolution layer of kernel size ( , ) is used to transform the inputs into a set of 56 spatial feature ﬁelds,which coincides with the original design by (Dong et al., 2016). For model conﬁgurations that employ both low-resolutionand high-resolution predictors, a combined feature representation in the low-resolution spatial domain is created by applyingthe input block as depicted in Figure 5. We apply two independent convolution chains to low- and high-resolution predictorsseparately, and restrict the number of feature channels for both chains to c (LR) = c (HR) = 28 . While on the low-resolution branchone single convolution with kernel size ( , ) is used for feature extraction, the high-resolution branch consists of a sequenceof strided convolutions with kernel sizes as indicated in Figure 5. This reduces the resolution of the features successively tolow-resolution scale. The resulting features are concatenated with the previously computed low-resolution features and suppliedto the feature processing stage.The feature-processing stage again reﬂects the original design choices by (Dong et al., 2016). In an hourglass-likearchitecture, a convolution with a ( , ) -kernel is applied to reduce the number of feature from 56 channels to 12, which is thenfollowed by a sequence of four convolution layers with kernel size ( , ) , 12 output feature channels, batch normalization andnon-linear activation. The last convolution layer of the processing stage uses a ( , ) -kernel to return to the 56 feature channels.In the original FSRCNN, the resulting features are used as input for a single transpose convolution with a kernel sizeof ( , ) for upsampling. In our experiments, however, we found that this very large transpose convolution can lead to slowtraining progress, and can even prevent training from convergence. Furthermore, (Odena et al., 2016) has shown that transposeconvolutions can introduce checkerboard-like artifacts in the ﬁnal prediction. To circumvent these problems, the extractedfeatures are fed into a super-resolution block, as sketched in Figure 6, after the ﬁnal batch normalization and non-linear activationlayer of the feature extraction stage. Hence, we avoid transpose convolutions in our work and, instead, use bilinear upsamplingﬁrst and apply conventional convolution afterwards to obtain an upsampled result (e.g., Dong et al., 2016). In addition, wereplace a single upsampling convolution with scaling factor ( , ) by a sequence of 3 upsampling blocks with smaller scalingfactors of ( , ) , ( , ) , and ( , ) to obtain the ﬁnal image in target resolution. The upsampling blocks are comprised of bilinearinterpolation, convolution layers with kernel size ( , ) , ( , ) , or ( , ) , batch normalization, and a non-linear activation function.Finally, upsampling is followed by an additional convolution layer with batch normalization and non-linear activation, and asingle output convolution without any activation function. Being a non-linear model, all but the very last convolution layers inFSRCNN are followed by non-linear activations, which are realized as parametric rectiﬁed linear units (PReLU), as proposed in(Dong et al., 2016).Note that in the original FSRCNN architecture, batch normalization has not been used. In our experiments, however,we found it beneﬁcial to regularize the feature representations through batch normalization, since the increased depth of ourFSRCNN variant may lead to instabilities in training due to e.g. internal covariate shifts (Ioffe and Szegedy, 2015). By applyingbatch normalization after each convolution, we could successfully stabilize the training process. | Deep Non-Linear CNN: EnhanceNet

Prior work in Deep Learning (e.g., Timofte et al., 2017, and references therein) has shown that increasing network depth can helpimprove prediction quality, and led to network architectures which outperform shallow networks. However, deep networks caneasily introduce instabilities in the optimization process, which is typically based on backpropagation of gradients. Speciﬁcally,

ÖHLEIN ET AL . 13 training may become inefﬁcient due to vanishing gradients (Glorot and Bengio, 2010), which originate from the accumulation ofsmall parameter gradients in the chain-rule-based estimation of model parameter updates. The sequential algorithm for gradientestimation causes an exponential decay of parameter updates in early layers of the network, and prevents the parameters fromchanging signiﬁcantly during training. While batch normalization may help to stabilize network training, vanishing gradientsremain an intrinsic problem of deep neural network architectures.An effective way to address this problem is the integration of so called short-cut connections. The purpose of theseconnections is to pass output feature of earlier layers directly to a later stage in the network, effectively skipping parameterdependencies of intermediate model parts and circumventing the accumulation of small gradients. Two prominent examplesare skip connections used by (Srivastava et al., 2015) and (Ronneberger et al., 2015), as well as residual connections proposedby (He et al., 2016). With skip connections, the output of a previous layer is concatenated with the result of an intermediatelayer. An example is given in Figure 7, which is discussed in more detail in Sect. 4.5. Residual connections are similar to skipconnections, but instead of being concatenated, the features before and after intermediate processing are added. This enables themodel to learn mappings that are close to identity more directly.As a deep CNN architecture with residual connections we selected EnhanceNet (Sajjadi et al., 2017), which has originallybeen proposed for image super-resolution. EnhanceNet is comprised of an input stage for raw feature extraction, followed by astack of 20 convolution layers for feature processing, and a super-sampling stage (see Figure 6), similar to that of FSRCNN.Residual learning is incorporated into the architecture in two variants. On the one hand, convolutions for feature processing aresubdivided into 10 blocks of 2 layers each, where each block is wrapped by a residual connection. A schematic representation ofone of these residual blocks is shown in Figure 6. On the other hand, bicubic interpolation is used to interpolate the low-resolutionwind-ﬁeld inputs to target resolution, yielding a baseline estimate for the high-resolution ﬁeld, which is added to the modeloutput.For reasons of efﬁciency, the convolution layers of EnhanceNet use a kernel size of ( , ) . In our experiments, the numberof feature channels is set to 64, which is equivalent to the parameters chosen in the original paper by (Sajjadi et al., 2017).The non-linear activation functions for EnhanceNet are realized through rectiﬁed linear units. Similar to LinearCNN andFSRCNN, we consider network variants with varying settings of low-resolution dynamical predictors, as well as with andwithout high-resolution topography. Depending on the predictor conﬁguration, either a single convolution layer with kernelsize ( , ) or the input block depicted in Figure 5 is used for primary feature extraction. Since the main focus of our study is onpixel-wise accuracy of the downscaling results, we refrain from using perceptual and adversarial losses that are typically used insuper-resolution image tasks (Sajjadi et al., 2017) and instead use pixel-wise losses as discussed in Sect. 5.3. | DeepRU

Network architectures for super-resolution image generation have been optimized for natural images, which possess propertiesthat are different from those of meteorological simulation results. For instance, natural images typically depict coherent objects,like cars or animals, with well-deﬁned shapes and boundaries. In contrast, meteorological data contains different meteorologicalvariables which vary smoothly yet less coherently across the domain. Therefore, we expect that more skillful models can beobtained by tailoring model architectures explicitly to meteorological data.For the present application, we argue that near-surface wind systems result from a complex interplay between large-scaleweather situation, i.e., continental-scale pressure distribution, and boundary-layer processes at ﬁner horizontal scales. Thecorrect treatment of physical processes at varying scales, therefore, appears as an important aspect in downscaling wind ﬁelds onextended spatial domains. This motivates the use of a model architecture that is not restricted to a single resolution scale forfeature extraction, but uses different resolution stages to understand the data on multiple scales.To account for this, we propose to use a U-Net architecture (Ronneberger et al., 2015) with residual connections (He et al.,

ÖHLEIN ET AL . E n c od i ng D e c o d i n g Residual Residual Residual Residual Residual ResidualResidualResidualResidualResidualResidualResidualUpsampling BilinearConcatenationStrided ConvolutionConvolution 3x3Batch NormalizationLeakyReLUSkip Connection

F I G U R E 7 Schematic of the DeepRU architecture.

ÖHLEIN ET AL . 15 × . This option gave the most accurate downscaling results among a variety of alternatives we have tried. Thehigh-resolution features are then fed into the adapted U-Net architecture. We use strided convolutions to downsample the featuresduring encoding and bilinear interpolation with a successive convolution layer to increase the resolution again during decoding.At each resolution stage, we apply batch normalization and leaky-ReLU activation before passing features to a residualblock, as depicted in Figure 6. The residual blocks, originally proposed by (He et al., 2016), have been slightly modiﬁed for thedownscaling task. We ﬁnd that extending the original residual block by another convolution layer before the addition operationleads to an increase in ﬂexibility of the residuals, which translates to a better overall model performance.We implemented skip connections so that a new combined input can be formed by concatenating the features from theencoding stage to the corresponding super-sampled features in the decoding stage. The combined input is then processed by asingle convolution layer with batch normalization and leaky-ReLU activation to further reduce the number of feature channels.The reduced features are ﬁnally passed to an additional residual block. After the last residual block at the target resolutionin the decoding stage, a convolution layer is added to output a set of features, which are added to a bicubic interpolant of thelow-resolution winds, resulting in the ﬁnal wind ﬁeld prediction. | Localized Multi-Linear Regression Model: LinearEnsemble

To enable a comparison of the CNN models with more classical approaches, we also consider a model that is based on standardmulti-linear regression instead of successive convolutions. Due to simplicity and interpretability, multi-linear regression modelsare frequently used in downscaling and post-processing tasks (e.g., Eccel et al., 2007; Fowler et al., 2007; Gaitan et al., 2014).For multi-linear regression models, Eq. (1) can be rewritten in simpliﬁed form as y = W x + b , (2)wherein W is a ( c Y d (HR) × c X d (LR) ) -shaped matrix of weight parameters capturing linear relationships between ﬂattenedpredictor vectors x ∈ (cid:210) c X d (LR) and ﬂattened predictand vectors y ∈ (cid:210) c Y d (HR) , and b ∈ (cid:210) c Y d (HR) is a vector of offset parameters.Again, c X and c Y denote the number of predictor and predictand variables per grid node, and d (LR) and d (HR) are the numbersof nodes in the low-resolution and high-resolution domain. Due to the strong increase in the number of trainable parameters with O( d (LR) d (HR) ) for increasing domain size, typical applications of multi-linear downscaling models have been focused on localstation data or small spatial domains with limited numbers of grid nodes. ÖHLEIN ET AL . For our comparison, we limit the number of trainable parameters to O (cid:0) k · d (HR) (cid:1) , for some user-deﬁned constant k ≤ d (LR) .An ensemble of multi-linear regression models is trained, where each model uses the k -nearest nodes from the low-resolutioninput to predict the wind components U and V at a single grid node of the high-resolution domain. This corresponds to aninduced sparsity pattern on W , which allows at most k · c X · c Y · d (HR) entries of W to be non-zero.In contrast to CNNs, we train only two different variants of the model ensemble. In a ﬁrst step, we use only the low-resolutionwind components U and V to inform the model, resulting in a channel number of c X = 2 . In a second step, we also add thecomplementary low-resolution dynamic predictors BLH, FSR, and Z500, resulting in a total of c X = 5 predictor channels. Staticpredictors are not included in the training process, as the resulting contributions in Eq. (2) would be indifferent among samples,and can thus be incorporated into the offset-vector b without loss of information. The k nearest low-resolution grid nodes aredetermined based on the standard L distance (in latitude-longitude space) to the target node. We empirically determined thatneighborhood sizes beyond k = 16 did not improve the results signiﬁcantly in our application. | TRAINING METHODOLOGY

The time range of about three years that is covered by our data is comparatively short, when set in relation to time scalescommonly used to deﬁne "climatology". Moreover temporal correlations between successive samples limit the number ofindependent examples of weather situations across the domain. This raises the need for efﬁcient data splitting using crossvalidation, and employing suitable methods to increase the number of training samples. In the following, we shed light on thetraining methodology and loss functions used in our experiments, and provide details on the optimization process. | Cross Validation

For all models, including LinearEnsemble, we employ cross-validation with three cycles of model training and validation. Ineach cycle, we exclude a subset of the data from training. As the data exhibits both short-term temporal correlations on timescales of up to a few days and variations due to seasonality, we decided to pick full consecutive years of data for validation.This minimizes information overlap between training and validation data due to systematic correlations at the beginning andend of validation intervals. Furthermore, it reduces impacts of seasonality on results by averaging model performance over thefull seasonal cycle. The excluded validation epochs are chosen pair-wise disjoint and cover the time ranges from June 2016 toMay 2017, June 2017 to May 2018, and June 2018 to May 2019, respectively. Each model was trained three times with varyingrandom initializations of the regression parameters in each validation cycle. After convergence, the model with the smallestaverage validation loss was selected for further evaluation. The performance of the overall model architecture was then assessedby combining the results of the best models of each of the three validation cycles. | Patch Training

To further increase diversity and variance of training samples, we perform CNN training on sub-patches of the full domain. Thisprocedure limits the dimensionality of the model inputs, thus enforcing models to base their predictions on local information, andreducing the chance of over-ﬁtting to statistical artefacts in the data. Speciﬁcally, ﬁtting of potentially non-physical long-distancecorrelations is efﬁciently avoided.From another perspective, patch training is advantageous due to an improved usage of static predictor information incomparison to full-domain training. Static predictors remain invariant when training on the full domain and can, thus, be ignoredby the network or be leveraged to establish a network operation mode of local pattern matching, instead of regression. In such a

ÖHLEIN ET AL . 17 mode, models might learn to associate the invariant topography with preselected local patterns, learned by heart, instead of usingthe provided dynamic information to regress on.Conﬁrming our expectations, we found that patch-trained models yield lower training and validation losses compared tomodels trained on the entire domain. Experiments show that intermediate patch sizes yield the best training results. For verysmall patch sizes, we observe a decrease in prediction quality, which may be attributed to a loss of context information dueto insufﬁcient data supply. These ﬁndings may also be related to the concept of the minimum skillful scale of the underlyinglow-resolution simulation (Benestad et al., 2008), i.e., the smallest spatial domain size, for which the low-resolution data providesa sufﬁcient amount of information for the downscaling model to generate skillful predictions.In our experiments, low-resolution data was processed in patches of size × and matched with the correspondinghigh-resolution patches of size × . This was found to yield the most accurate full-grid predictions when applied tovalidation samples. The sub-patches for training were selected randomly for each predictor-predictand data pair and each trainingstep, so that the induced randomness further decreases the chance of over-ﬁtting to the training input. Note, however, thatpatching was applied exclusively during training of the models. For validation and evaluation of model performance, predictionswere computed based on the full domain. | Loss functions

For measuring error magnitude between predictions and high-resolution targets, we consider different deviation measures, whichput weight on distinct aspects of reconstruction accuracy. For optimization purposes, we consider spatially averaged deviationscores, whereas for further evaluation, we consider both average and local deviations.Given that (cid:174) t i and (cid:174) y i represent the target wind and prediction wind vectors at node i , with ≤ i ≤ d (HR) indexing the nodesof the high-resolution grid, we consider, in ﬁrst place, the mean square error (MSE) withMSE (cid:16)(cid:8) (cid:174) t i (cid:9) , (cid:8) (cid:174) y i (cid:9)(cid:17) = (cid:10) (cid:12)(cid:12) (cid:174) t i − (cid:174) y i (cid:12)(cid:12) (cid:11) D .Here, (cid:8) (cid:174) t i (cid:9) and (cid:8) (cid:174) y i (cid:9) denote the sets of predictand and prediction vectors throughout the domain, | · | indicate standard L vectornorm, and (cid:104) · (cid:105) D an average over the spatial domain. The main advantage of MSE is its invariance with respect to rotations oflocal vector directions, i.e. predictand-prediction pairs which differ only by node-wise rotations of wind directions, are assignedan identical deviation score.However, a potential drawback of MSE is that local deviation scores scale quadratically with wind magnitude (the signiﬁcanceof this would ulitmately depend on the application). In particular, small-angle deviations in areas of large wind speeds maycontribute largely to the overall deviation score, whereas some strong directional deviations, such as opposite wind directions inareas of low wind speed, are hardly taken into account. This problem becomes particularly serious in certain scenarios, whereslow but strongly variable winds over mountainous areas are accompanied by increased wind speeds over the sea .A solution to weaken the square dependence effect is to linearize MSE, resulting in the mean absolute error (MAE).Unfortunately, even MAE does not fully overcome the scaling issue and inherits the problems of MSE. Considering angulardeviations instead, for instance, as measured by cosine dissimilarity, does not provide an alternative, as well, since angle-baseddeviation measures do not provide the model with information on differences in wind speed magnitude. A potential alternativewould be to use a weighted average of the above-mentioned deviation metrics. However, we refrained from using such metrics asthis would require an optimization of additional ad-hoc hyper-parameters.An effective solution is to use the standard MSE and reduce spatial inhomogeneity through node-wise standardization ofthe target predictands. The models then learn to mimic a reduced representation of the non-standard predictands, which caneasily be converted back to true-scale through an easily invertible linear transformation. As stated in Sect. 3, sample mean and ÖHLEIN ET AL . standard deviation are computed from the respective training data set. For validation and evaluation purposes, we convert back toreal-scale target predictands and predictions. | Implementation and Optimization

All models have been realized and evaluated in PyTorch (Paszke et al., 2019). Optimization is performed using the ADAMoptimizer (Kingma et al., 2014) with an initial learning rate of − , which is reduced by a factor of 0.1 whenever the validationloss in terms of MSE does not decay by more than a fraction of − over a period of 5 training epochs. The process is continueduntil a minimum learning rate of − is reached. To guarantee a proper convergence of the models, we train for 150 epochs ineach of the three runs per cross-validation cycle, without early stopping. Saturation of training and validation losses was usuallyachieved after 50-60 epochs, and both training and validation losses showed only minor variations beyond. In particular, we didnot observe tendencies of additional over-ﬁtting once the models converged. | Regularization

During training, we employ weight decay with a rate of − (Kingma and Welling, 2013). Additionally, non-linear convolutionalmodels use batch normalization (Ioffe and Szegedy, 2015) after each convolution operation, which we ﬁnd to signiﬁcantlyaccelerate training convergence. For DeepRU, we apply two-dimensional dropout regularization (Srivastava et al., 2014) witha dropout rate of 0.1 after each residual block; i.e. succeeding each residual block a fraction of . of the respective outputfeature channels is selected randomly and set to zero. Although earlier studies reported performance issues when using batchnormalization and dropout regularization in common, (see e.g., Li et al., 2019), we did not encounter any such negative effects. | EVALUATION

To compare the different model architectures with respect to downscaling performance, we consider sample-wise deviationsbetween target predictands and model predictions, and investigate the extent to which the predictions depend on particularpredictors. To shed light on the impotance of the choice of predictors, the CNN models are trained with four different predictorconﬁgurations, including low-resolution wind ﬁelds and orography only, providing supplementary high-resolution orographypredictors or additional low-resolution dynamic predictors, or the full set of parameters. The predictor settings are detailed inTable 1 and indicated with letters (A) through (D).TA B L E 1 Predictor conﬁgurations for model trainings with varying combinations of low-resolution (LR) andhigh-resolution (HR) predictors. c (LR) X and c (HR) X denote the total number of low-resolution and high-resolution predictor ﬁelds,supplied to the models. LR HRWind Dynamic Static StaticConﬁg. c (LR) X c (HR) X U V Z500 BLH FSR LSM ALT LSM ALT(A) 4 0 (cid:88) (cid:88) – – – (cid:88) (cid:88) – –(B) 4 2 (cid:88) (cid:88) – – – (cid:88) (cid:88) (cid:88) (cid:88) (C) 7 0 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) – –(D) 7 2 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Exceptions from this strategy arise for DeepSD and LinearEnsemble. In the case of DeepSD, we refrain from suppressingthe use of high-resolution static predictors, in order to stay close to the original implementation, which included high-resolutionorography predictors by design. Therefore, for DeepSD, we only consider conﬁgurations (B) and (D). For LinearEnsemble we

ÖHLEIN ET AL . 19 exclude static predictors in both low-resolution and high-resolution, as by design the model does not take advantage from staticpredictors (see Sect. 4.6); we therefore consider only conﬁgurations (A) and (C). | Run-time Performance and Memory Requirements

A general overview of the model performances with respect to the number of trainable parameters, memory consumption,computational time for yearly or daily predictions are provided in Table 2. The time measurements were conducted on theNVIDIA TITAN RTX GPU with 24 GB video memory.At training time, data for all models, except for LinearEnsemble, was processed in batches of 30 to 200 samples, dependingon the model complexity and memory requirements. During training, a signiﬁcant amount of the memory consumption is causedby optimization computations which are signiﬁcantly more complex for deeper model architectures. The measured trainingtime spans the full training period until convergence of the respective model, including prediction time, as well as time forloss computation and optimization. In the reference trainings, we considered all dynamic and static predictors at low- andhigh-resolution.LinearEnsemble is exceptional here, as memory limitations arise from the need for rapidly accessible storage of the trainingdata, rather than from optimization computations. As the nearest neighbor positions vary irregularly with spatial position,data selection for LinearEnsemble cannot be realized through efﬁcient array-slicing operations, as it is the case for CNNs.Nearest-neighbor indexing has to be performed for all linear models separtely and was found to be too slow to be conducted attraining time. As a result, data for the LinearEnsemble had to be preselected and stored with high redundancy during training.For the full ensemble of linear models with nearest neighbors, the 3 year dataset, including all low-resolution dynamicpredictors, required the allocation of roughly GiB of memory, which is not feasible to be stored in RAM on a local machinewith typically less than 32 GiB available. Hence, the data was outsourced to a separate HDF5 ﬁle and streamed from the harddrive during training, which delivers, by a large margin, the highest training time among all trained models. The training timesfor the remaining models scaled with model complexity, with the highest being for the most complex model – i.e., DeepRU.In contrast to the above, and for reasons of fair comparison, the computational time for model prediction (PR), is computedusing a batch size of 220 for all networks; note that timings for loss computations and optimization are not included in themeasurements. To compute the total time for model predictions, we make use of the Python’s timer module to measure the plaintime required by the model to perform downscaling on all input hours for one year, in our case 8760 hours. As timings areoften distorted due to hardware communication and process management, we conducted 3 measurement runs for all modelsand averaged the results to obtain the ﬁnal total prediction time. The time for single hour predictions is represented by theratio between the total computational time and the total number of inputs. In our study, we experienced that the measured timeincreased with the model complexity, with highest computational costs for DeepRU.Regarding the number of trainable parameters, the deeper non-linear solutions EnhanceNet and DeepRU exhibit a signiﬁ-cantly higher amount of convolutional layers in comparison to the remaining models and, thus, require more memory to storethe trained parameters. Consequently, the general memory consumption scales with the model complexity (see MEM columnin Table 2). Despite the higher consumption of memory for non-linear models, in particular for DeepRU, we found that theyachieved the overall best results in our experiments, which is further discussed in the following sections. | Quantitative Analysis

The statistics of spatially averaged MSE on the validation data are illustrated in Figure 8, conﬁrming that both model architectureand predictor selection have a considerable effect on model performance. The weakest model is LinearCNN, showing largestoverall errors and proﬁting the least from supplementary predictor information. In particular, the use of high-resolution static

ÖHLEIN ET AL . M S E UV, Oro(LR) UV, Oro(LR, HR)0481216 M S E UV, Dyn, Oro(LR) UV, Dyn, Oro(LR, HR)LinearCNNDeepSDFSRCNNLinearCNNDeepSDFSRCNN EnhanceNetDeepRULinearEnsembleEnhanceNetDeepRULinearEnsemble

F I G U R E 8 Comparison of validation losses for model variants with varying combinations of input predictors windcomponents (UV), orography variables ALT and LSM in low and high resolution (Oro, LR/HR), and supplementary dynamicpredictors BLH, FSR and Z500 (Dyn). Circles indicate maximum deviation observed on the validation set, black triangles signalmaximum reconstruction error beyond the scale of the plot.

ÖHLEIN ET AL . 21

TA B L E 2 Run-time performance statistics for LinearCNN, DeepSD, FSRCNN, EnhanceNet, DeepRU, and LinearEnsemble.For each model, the columns describe the total number of trainable parameters (TP) in k (thousands), individual memoryconsumption to store a model (MEM) in MiB, duration of an entire training procedure for a cross-validation run with 8760hourly data (TR), prediction time for all 8760 inputs (PR), and the prediction time for one single time step (TS) in milliseconds.

Model TP [k] MEM[MiB] TR [h] PR [s] TS [ms]LinearCNN 31.7 0.1 0.7 5.4 0.6DeepSD 50.6 0.2 0.9 5.8 0.7FSRCNN 165.3 0.6 1.9 8.0 0.9EnhanceNet 942.6 3.6 4.0 15.4 1.8DeepRU 37113.9 142.0 13.5 82.5 9.4LinearEnsemble 3307.4 12.6 25.8 11.8 1.4 predictors, which proved to be useful for all the non-linear models, appears to have no effect on the performance of LinearCNN.The model appears unsuited to extract useful correlations between low-resolution predictors and high-resolution wind ﬁelds. Thereason for this is the restrictive parametrization scheme, which is unsuitable for capturing random offsets and distortions betweenlow- and high-resolution ﬁeld variables caused by the data padding procedure (see Sect. 3.4). As the same linear model is sharedacross the entire domain, LinearCNN is forced to yield a most likely estimate, which, however, is found to be inaccurate for mostof the grid nodes, and poor regarding spatial detail.In contrast, LinearEnsemble takes advantage of the local parameterization, and achieves considerably better results,comparable with or better than those of the non-linear models DeepSD and FSRCNN. The gain in performance, however, comesat the expense of a higher tendency of the model to overﬁt on the training data. Especially for model variants with a large numberof predictors, either due to the use of additional dynamic predictors or larger environment size k , one observes severe over-ﬁtting.This is visible also in Figure 8, as the maximum reconstruction error of LinearEnsemble models with full predictor set (UV,Dyn, Oro(LR) and UV, Dyn, Oro(LR, HR)) exceeds the maximum error of even LinearCNN. L -regularization did not improvegeneralization performance, but increased the reconstruction error on both training and test data. For the nonlinear models, incontrast, over-ﬁtting could be minimized through weight decay during optimization – having a similar effect as L -regularization– and dropout regularization.In agreement with earlier studies by (Dong et al., 2016), FSRCNN achieves smaller downscaling errors than DeepSD. Thequality of the downscaled wind ﬁelds, however, is slightly below that of the LinearEnsemble model for all predictor variantsunder consideration.Nevertheless, prediction quality can be further improved by considering more complex models. EnhanceNet, which differsfrom FSRCNN by an increased number of convolution layers and the use of residual connections in combination with bicubicdownscaling as additive baseline estimate, is the ﬁrst model to surpass the performance of LinearEnsemble. Notably, EnhanceNetachieves slightly worse results than LinearEnsemble when omitting the high-resolution orography predictors, but catches up afteradding the high-resolution predictors. The same is true for DeepRU, which achieves another reduction of MSE.Comparing directly DeepRU and LinearEnsemble, we ﬁnd that DeepRU not only reduces the MSE, but can also moreeffectively take advantage of additional predictors. Whereas LinearEnsemple responds with an increased tendency of overﬁtting,DeepRU achieves a reduction in deviation score when supplied with high-resolution static and low-resolution dynamic predictors.Speciﬁcally, model conﬁguration (D) of DeepRU is the most accurate model in our comparison with an average MSE around . ( ms − ) . ÖHLEIN ET AL . F I G U R E 9 Mean magnitude difference (top row) and mean cosine deviations (bottom row) between target high-resolutionforecast and low-resolution forecast simulation (left), prediction of LinearEnsemble (middle) and DeepRU (right). The averageis taken over all 3 validation years. | Spatial distribution of prediction errors

To examine the spatial distribution of reconstruction errors, we consider additional angular and magnitude-speciﬁc deviationmeasures, which we average over the sample distribution instead of the spatial domain. Speciﬁcally, we consider cosinedissimilarity (CosDis) CosDis (cid:0) (cid:174) t i , (cid:174) y i (cid:1) = 12 (cid:0) − (cid:10) cos (cid:0) (cid:174) t i , (cid:174) y i (cid:1)(cid:11) X (cid:1) for angular deviations between target predictands and predictions. Systematic deviations in wind speed magnitude are measuredin terms of the magnitude difference (MD) MD (cid:0) (cid:174) t i , (cid:174) y i (cid:1) = (cid:10) |(cid:174) t i | − | (cid:174) y i | (cid:11) X ,which provides a measure for how much the respective models over- or underestimate wind speed magnitudes. In both measures, (cid:104) · (cid:105) X indicates the sample average over the validation sets of the 3 cross-validation cycles, respectively.Figure 9 shows the spatial distribution of magnitude difference and cosine dissimilarity for low-resolution forecastsinterpolated bilinearly to the high-resolution grid, as well as outputs of the best-performing DeepRU and LinearEnsemble models,relative to the high-resolution forecasts. Regarding the low-resolution simulation, velocities in speciﬁc regions near the coastsare not well captured and are mainly underestimated with magnitude shifts greater than . ms − . Angular deviations are morepronounced in mountainous areas. Typical values of cosine dissimilarity range between 0.25 and 0.30, which correspondsto average deviation angles of more than ◦ . In the northern part of the Mediterranean Sea, the magnitude difference plotfor the low-resolution simulation suggests checker-board-like artifacts, which, however, are most likely due to a mismatch inspatial resolution and grid structure of low-resolution and high-resolution grids, as well as the use of bilinear interpolation forvisualization purposes.In contrast to the low-resolution simulation, LinearEnsemble tends to underestimate, on average, wind magnitudes at all ÖHLEIN ET AL . 23 local grid nodes. We expect that this is mainly caused by an underestimation of extreme winds through LinearEnsemble, whichis a common problem of multilinear models that are optimized for minimizing MSE losses (e.g., Bishop, 2006). As expected,cosine deviations for LinearEnsemble are much lower than for the low-resolution simulations. However, in areas close to themountains, LinearEnsemble fails to properly predict both extreme shifts in magnitude and direction, for example due to ridgelines.DeepRU shows overall better performances with lowest cosine and magnitude differences. Prediction errors exhibit a spatiallysimilar pattern to LinearEnsemble but with generally smaller amplitudes. Furthermore, DeepRU outperforms LinearEnsemble incapturing local variance in wind speed magnitude and directions. As a result, magnitude differences appear less uniform, withover- and underestimation in ﬂat areas and near the boundaries, which are caused by imperfect information due to convolutionpadding. In the Mediterranean Sea magnitude errors show large-scale wave-like patterns, which especially north of Corsica endeast of Sardinia resemble ringing artifacts due to Gibbs phenomenon (Gibbs, 1898). In turn this relates to the models’ spectralrepresentation of topography; issues arise in regions adjacent to where steep slopes meet ﬂat land or sea. In fact the providedtopographic height ﬁelds contain very similar patterns; sea altitudes look invalid. | Analysis of Feature Importance

For the model conﬁguration, which was trained on the full set of predictors (D), we also investigate the importance of particularpredictors according to the method proposed by (Breimann, 2001). For this, we perturb the model inputs from the validation dataset by randomly shufﬂing single predictors, and then measure the change in the prediction error that is caused by the perturbation.Let X = { x , . . . , x t , . . . , x T } be the (plain) validation dataset for the respective model run, with data samples x t = (cid:0) x ( ) t . . . x ( p ) t . . . x ( c X ) t (cid:1) , containing the predictor variables, x ( p ) t ∈ (cid:210) s lon × s lat , for ≤ p ≤ c X = c (LR) X + c (HR) X . Then, for everypredictor p we generate a random permutation Π of the sample index set { , . . . , t , . . . , T } , so that the feature- p -perturbeddataset ˜ X ( p ) contains samples of the form ˜ x t = (cid:16) x ( ) t . . . Φ (cid:0) x ( p ) Π ( t ) (cid:1) . . . x ( c X ) t (cid:17) . Here, Φ (·) denotes an additional shufﬂing operation in the spatial domain by decomposing the predictor data into equally-sizedsub-patches, rearranging the patches randomly, and concatenating them again. In our experiments, we ﬁx a patch size of × .Results for different patch sizes are comparable, though. From the perturbed and non-perturbed predictions ˜ y ( p ) t and y t , therelative change in prediction error is computed as ρ ( p ) t = (cid:10) MSE (cid:0) ˜ y ( p ) t , y ∗ t (cid:1)(cid:11) Π , Φ MSE (cid:0) y t , y ∗ t (cid:1) ,wherein y ∗ t denotes the ground-truth predictand, and (cid:104) · (cid:105) Π , Φ denotes an average over 10 realizations of Π and Φ . Large valuesof the change ratio ρ ( p ) t indicate a stronger impact of predictor p on downscaling accuracy, and thus higher importance of thepredictor.Figure 10 illustrates the sample statistics of ρ ( p ) t for the full set of predictors and all downscaling models. In good agreementwith expectations, perturbations in the predictor wind components U and V have the largest effects on model performance forall architectures in our comparison, indicating that the models in fact use mainly the information on wind speed and directionfor downscaling. The effect of perturbations in the wind components strengthens with increasing model complexity. Reasonsfor this may lie in the non-linear structure of the more complex models, which could increase the sensitivity of the predictionsto perturbations. Also, as shown in Figure 8, more complex models achieve smaller deviation scores when informed with ÖHLEIN ET AL . M S E C h a n g e LinearCNN DeepSD FSRCNN141664 M S E C h a n g e EnhanceNet DeepRU LinearEnsembleUVUV Z500BLHFSRZ500BLHFSR LSM (LR)ALT (LR)LSM (HR)ALT (HR)LSM (LR)ALT (LR)LSM (HR)ALT (HR)

F I G U R E 1 0 Relative change in MSE (sample-wise) for different models, when provided with perturbed predictor data.Circles indicate maximum values.unperturbed data. A similar increase in prediction error in terms of absolute deviation score therefore yields a larger change ratiofor more complex models. This implies that the change ratios ρ ( p ) t should be interpreted in a model-speciﬁc context.Assessing the relative importance of the remaining predictors, we ﬁnd that least information is extracted from FSR, asperturbations in this predictor hardly affect any of the models. As FSR is provided on the same coarse grid resolution as thepredictor winds, all the information it provides could already be encapsulated in the winds themselves, so that most models learnto ignore the redundant information. Interestingly, LinearEnsemble is the only model that ﬁts correlations between FSR andhigh-resolution winds, which may be related to the overﬁtting problem of the model. Perturbations in BLH also have only aslight impact on prediction performance. This was quite a surprising result, given that this quantity varies considerably over time,and given that wind speeds at 100 m can be closely related, especially when BLH values are small.Z500 is leveraged mainly by the less complex models LinearCNN and DeepSD. Z500 provides information on large-scaleweather patterns, and there is known relationship between its gradients and 500 hPa geostrophic winds, which seems to berecognized most prominently by DeepSD. Nevertheless, direct links between Z500 and 100-m-winds tend to be relatively weak,which explains its minor impact on the performance of other models. | Analysis of Reconstructed Flow Patterns

The quantitative analysis provides high-level abstract information on overall downscaling performance of the models, yet it doesnot convey detailed information on the ability of the models to reproduce complex ﬂow patterns that we see in the high-resolutionsimulation. To investigate this aspect in more detail, we select two example cases, which exhibit strong discrepancies betweenERA5 and HRES forecasts, and compare the prediction skills of two different models for these examples. For reasons of

ÖHLEIN ET AL . 25 (a) (b)(c) (d)C B A

Magnitude [ms ] F I G U R E 1 1 Wind ﬁelds over Europe, as obtained from low-resolution and high-resolution short-range forecast simulationsand model predictions for October 17, 2017, 09:00 UTC: The top ﬁgures show the ﬂow ﬁeld for (a) the low-resolution and (b) the high-resolution simulation and highlight differences between both predictions, (c) depicts the predictions of the localizedlinear model, LinearEnsemble, whilst (d) represents the wind ﬂow predicted by DeepRU. These LIC images show the currentmotion of particle ﬂow produced from the wind ﬁeld products. The LIC ﬁeld is colored according to local wind magnitude inms − . Regions with strong differences between predictions are marked by rectangles A, B, and C. Errors of LinearEnsemble areMSE = 1 . ( ms − ) , CosDis = 0 . , and of DeepRU are MSE = 0 . ( ms − ) and CosDis = 0 . ÖHLEIN ET AL . (a)(b)(c) F I G U R E 1 2 Example ﬂow patterns on 09:00 UTC October 17, 2017, as obtained from low-resolution and high-resolutionshort-range forecast simulations, and predictions of LinearEnsemble, and DeepRU, visualized as LIC plots. The location of theregions within the data domain is marked on a global map on the left for each case. (a) illustrates the ﬂow ﬁeld outputs in aregion between Italy and Croatia over the Adriatic Sea, (b) depicts the ﬂow over the Austrian Alps with low-speed winds andlarge directional variations, (c) shows the wind ﬂow of areas near central France.conciseness, we limit the comparison to ouputs of the best-performing non-linear model, DeepRU, and the localized linear model,LinearEnsemble.To visualize wind vector ﬁelds, we use Line Integral Convolution (LIC), introduced by (Cabral and Leedom, 1993). Togenerate a LIC visualization, a randomly sampled white-noise intensity image of user-deﬁned resolution is convolved with a 1Dsmoothing kernel along streamlines in the vector ﬁeld. Thus, while LIC generates high correlation between the intensities alongthe streamlines, different streamlines are emphasized by low-intensity correlation between them. In addition, color mapping isused to encode additional parameters, such as the local vector ﬁeld magnitude. In contrast to alternative visualizations, such asvector glyphs or stream line plots, LIC provides a global and dense view of the vector ﬁeld and can avoid occlusion artifactsdue to improper glyph size or sparse sub-domains due to improper streamline seeding. A disadvantage of LIC is that there isambiguity about which of two opposite directions are represented.The ﬁrst example is given for lead time October 17, 2017 at 09:00 UTC. This case represents a rather anticyclonic scenariowith generally low wind speeds, as denoted by the surface chart in Figure13 (a). Figs. 11 (a) and (b) show LIC visualizationsof the underlying wind vector ﬁelds, obtained from low- and high-resolution forecast simulations. Color coding reﬂects totalwind speed magnitude. Differences in ﬂow patterns indicate that especially in mountainous regions, like the Alps, Apennines(Italy) and Dinaric Alps (Croatia) the low-resolution simulation fails to properly capture the local variability in wind directionand magnitude, which is present in the HRES simulation.The results of LinearEnsemble and DeepRU are shown in Figures 11 (c) and (d), respectively. We highlighted the mostimportant visual differences between both predictions with rectangles; speciﬁc cases are labelled with the letters A – C. In-detailviews of the streamlines for all highlighted cases are shown in Figures 12 (a), (b), and (c), respectively. Quantitative differencesto the HRES simulation are measured in terms of wind direction through local cosine dissimilarity, and wind speed though local

ÖHLEIN ET AL . 27 (a) (b)

F I G U R E 1 3 Synoptic charts showing mean sea level pressure (hPa) for 12:00 UTC October 17, 2017 and 00:00 UTCMarch 19, 2017. Images were obtained from (Metcheck, 2020)absolute relative error (ARE), ARE (cid:0) (cid:174) t i , (cid:174) y i (cid:1) = | |(cid:174) t i | − | (cid:174) y i | ||(cid:174) t i | ,as well as local L deviation, which combines both aspects. Results for the outputs of the low-resolution simulation and modelpredictions are depicted in Figure 12.Based on the quantitative evaluation of all models in Sect. 6.2, it can be conjectured that both LinearEnsemble and DeepRUreconstruct meaningful downscaling results, with DeepRU leading to overall better prediction quality in scenarios of highinhomogeneities. As seen, e.g., in the cases A (Adriatic Sea) and B (Austrian Alps) in Figure 11, LinearEnsemble tends to notreconstruct the ﬂow features whene there is a pronounced ﬂow pattern mismatch between the low-resolution and high-resolutionforecast simulations. DeepRU, in contrast, uses both local and global information about the orography, and presumably additionalparameters, and is able to better replicate the HRES wind ﬁelds. Especially over the Adriatic Sea (A), the winds are mainlynorthwesterly, tangential to the coast, and higher magnitudes are more pronounced. LinearEnsemble relies solely on localinformation in the low-resolution ﬁelds and is not able to reconstruct the ground truth faithfully.In areas of complex surface topography, such as near the Austrian Alps (B), variations in wind speed and direction areusually more pronounced, as wind ﬁelds are highly inﬂuenced by surface interactions. Here, both models learn a reasonablemapping and are able to handle these cases quite well. According to cosine dissimilarity (Figure 14 (a)), DeepRU performsslightly better than LinearEnsemble in terms of direction predictions. Also, DeepRU is able to better replicate extreme transitionsin magnitude, occuring on small spatial scales, which results in smaller relative and L errors (see Figure 14 (b) and (c)).A scenario with generally stronger and rather laminar ﬂow, which exhibits some large differences in wind speed magnitude,is given in C, where ﬁne-scale mountains slow down winds in eastern France. Since ﬂuctuations in wind direction are small inthis area, both models exhibit small errors overall in wind direction. Nonetheless, LinearEnsemble is not really able to accountfor orography-mediated ﬂow adjustments on small spatial scales, whilst DeepRU can more precisely predict deviations fromlaminar ﬂow. This is also clearly demonstrated by the absolute relative errors in Figure 14 (b).The second example is for March 19, 2017, 01:00 UTC. Figure 15 depicts LIC plots of the wind ﬁelds for the simulationsand predictions similar to Figure 11. As illustrated in Figure 13 (b), the weather pattern over our domain is mainly dominated byan Alpine lee cyclone, situated between Corsica and northwest Italy. Comparing low-resolution and high-resolution forecastsimulations, major parts of the ﬂow are rather laminar with high wind speeds up to ms − . Contrary to the low-resolutionsimulation, HRES exhibits sharper changes in magnitude over mountain ridges and mountain edges, and exhibits higher ÖHLEIN ET AL . (c)(b)(a) F I G U R E 1 4 Visualization of spatial deviations of the low-resolution simulation, LinearEnsemble, and DeepRU windpredictions compared with the output of the high-resolution simulation shown in Fig. 11. Here, the deviations are (a) cosinedissimilarity, (b) absolute relative error, and (c) L norm. ÖHLEIN ET AL . 29 AB ] F I G U R E 1 5 Wind ﬁelds over Europe, as obtained from low-resolution and high-resolution short-range forecast simulationsand model predictions for March 19, 2017, 1:00 UTC, similar to Fig. 11. Color-coding indicates the local wind velocity.

ÖHLEIN ET AL . distortions in wind directions over the sea. Two particular cases with differences between forecast simulations and modelpredicitions are highlighted in Figure 15 and are labelled A and B.In case A, the outputs of both the low-resolution simulation and the LinearEnsemble suggest a rather circular vortex patternwith moderate wind speeds over the Ligurian Sea, between the French Riviera and Corsica. The high-resolution simulation, incontrast, displays a distorted, more elongated ﬂow pattern. DeepRU here elongates the ﬂow around the vortex towards northernItaly, and additionally enhances the southerly wind near the western coast of Corsica, which, in summary, better mirrors thepredictions of HRES.Case B emphasizes the wind ﬁeld above northern Italy, where the ﬂow is more inhomogeneous since regions of high windspeeds are interleaved with topographically triggered vortex structures.Here, LinearEnsemble fails to predict as well as DeepRU the sharp magnitude changes seen in HRES along the mountainridge of the Appenines and near to the three marked lakes. | APPLICATIONS IN FORECASTING

As this was ostensibly a proof-of-concept study, it was not intended that the CNN architectures computed here would be useddirectly for forecasting. Indeed the spatial resolutions of our predictor and predictand datasets are not competitive relativeto current operational conﬁgurations. In Europe for example, operations nowadays use global models with spatial resolution ∼ ∼ ∼ ∼ ÖHLEIN ET AL . 31

Society requires not only predictions of mean wind speeds, but also forecasts of gusts, particularly extreme gusts, because ofthe dangers posed to life and infrastructure. Gusts have not been directly explored in this study. One might be able to convertmean speeds into reasonable gust forecasts using empirically deﬁned gust-to-mean relationships (see e.g. Ashcroft, 1994),developed for different land surface types, although for cyclone-related gusts, which tend to be the major wind-related hazardin the vicinity of storm tracks (e.g. in northern and western Europe) caution is needed. Low-level stability, and destabilizationmechanisms, as outlined in (Hewson and Neu, 2015), are of paramount importance for determining the strengths of phenomenasuch as the cold jet, warm jet and sting jet (see also Browning, 2004). In that context it is curious that the BLH parameter used inour study, which relates directly to stability, did not add much predictive value for the CNNs. Our use of a region that is relativelyremote from storm tracks may explain this.It is important to re-iterate that airﬂow, and thus winds, can be very scale dependant. On meter scales speeds around citybuildings vary dramatically, whilst on a lake the behavior of a yacht can be inﬂuenced by clumps of bushes nearby. Indeed scaledependence is more acute than it is for other parameters, such as rainfall and temperature. Thus model resolution increasesbring with them more and more application areas for forecasts, particularly for regions that are topographically and/or physicallycomplex. In turn this brings sustainability, whereby the method outlined in this paper, and variants of it, can ﬁnd utility for theforeseeable future as numerical weather prediction models continue to evolve. | CONCLUSION

In this study, we have analyzed convolutional neural networks for downscaling of wind ﬁelds on extended spatial domains.By going from a simple linear CNN to deeper and more elaborate non-linear models, we have investigated how the networkcomplexity affects downscaling performance. We have further compared the performance of different CNNs to that of anensemble of localized linear regression models.Our study has shown that the prediction accuracy of the linear ensemble model is higher than what can be achieved withshallow non-linear CNN architectures. Especially for simplistic non-linear models with only a few convolution layers, it seemsthat the non-linearity even hinders performance. We attribute this to the distortion of the wind-ﬁeld by the non-linear activationson its way through the network, preventing the model to beneﬁt from simple mapping schemes, such as e.g. interpolation kernels.The use of overly simplistic and shallow non-linear models may be one reason why earlier studies found no additional value inapplying CNN-based machine-learning methods (e.g., Eccel et al., 2007).Deeper and more complex network models, on the other hand, are able to discover skillful mappings by exploiting non-linearcorrelations for modelling the relationship between low- and high-resolution ﬁelds. Speciﬁcally, we found that all non-linearmodels in our study take advantage of additional high-resolution static predictor data, such as information on local orography. Incomparison, the use of 3 pre-deﬁned low-resolution dynamic predictors gave only minor improvements.Among the non-linear CNNs, we identiﬁed EnhanceNet, previously-proposed for single-image super-resolution, as a deepCNN that was able to beat the baseline linear ensemble model. With DeepRU, we proposed a novel deep residual U-Netarchitecture, which outperformed both the linear model and EnhanceNet in terms of reconstruction accuracy. The majoradvantage of DeepRU lies in its ability to process features at different spatial scales. This is particularly useful for downscalingof wind ﬁelds, where local wind systems have to be consistent with large-scale ﬂow patterns. Although we still observe somedeviations between high-resolution model predictions and native high-resolution forecast simulations, we are conﬁdent thatconvolutional neural networks can provide promising downscaling results and add more value to downscaling than linear modelsat a reasonable computational cost.We conclude that deep CNN approaches are particularly effective for downscaling with high magniﬁcation ratios on largespatial domains. In this setting, the use of classical models becomes computationally inefﬁcient, and linear link functions

ÖHLEIN ET AL . between predictor variables and predictands become insufﬁcient to account for non-trivial variability in the local ﬂow, e.g., due topronounced ﬂow distortion around obstacles. We found that deep CNNs are better suited for replicating this variance, especiallyin mountainous areas or over the sea near to coasts, and expect that the same holds true also at ﬁner spatial scales.An interesting question for future research is how to consider a more accurate treatment of the domain geometry inCNN-based downscaling. While data padding was found to be well-suited for reshaping irregular grids on domains of up to a fewthousand kilometers of horizontal extent, increasing domain size even further may lead to distortion artifacts due to disregard ofthe spherical geometry of the Earth’s surface. The same is true for interpolation-based resampling methods, where the horizontalspacing of the sampling points varies with latitude, limiting data resolution close to the equator and enforcing data redundancycloser to the poles. The use of more appropriate convolutional model architectures, like spherical CNNs for unstructured grids(e.g., Jiang et al., 2019), or geometric deep-learning approaches (e.g., Bronstein et al., 2017) in general, may help to overcomesuch limitations, thus increasing physical plausibility and data efﬁciency of the models.From the exciting perspective of real-time application, one would ideally want to step down in scale and apply the results ofthis proof-of-concept study in a ﬁner resolution setting. We envisage that operational real-time forecast runs – single deterministicand/or ensemble – could be downscaled in real-time to 1-2 km, over any pre-selected domains, for customer applications. Thiscould be activated on a central cloud-type platform, or locally by customers to meet their own needs. Given the small numberof low-resolution predictors, data transfer requirements for the second option would be minimal, compared to say the task oftransferring four dimensional (full-atmosphere) ﬁelds for many variables.At such very high target resolutions, the correct treatment of ambiguity in the data becomes increasingly important, sincethe same coarse-scale ﬂow pattern may correspond to multiple ﬁne-scale realizations. Similar to stochastic weather generators,generative CNN models like variational auto-encoders (Kingma and Welling, 2013) or convolutional generative adversarialnetworks (e.g., Goodfellow et al., 2014; Radford et al., 2015) may provide promising approaches for building ﬂexible models forensemble-based probabilistic downscaling. Moreover, if the low-resolution feed were based on ensemble data itself, one couldthen generate a super-ensemble (i.e. ensemble of ensembles) to provide the ﬁnal smooth-format probabilistic output for users. R E F E R E N C E S

J. Ashcroft. The relationship between the gust ratio, terrain roughness, gust duration and the hourly mean wind speed.

Journal of WindEngineering and Industrial Aerodynamics , 53(3), 1994. ISSN 01676105. doi: . / - ( ) - .Jorge Baño-Medina, Rodrigo Manzanas, and José Manuel Gutiérrez. Conﬁguration and Intercomparison of Deep Learning NeuralModels for Statistical Downscaling. Geoscientiﬁc Model Development Discussions , 13(4):2109–2124, 2019. ISSN 1991-959X.Danijel Beluši´c, Mario Hrastinski, Željko Veˇcenaj, and Branko Grisogono. Wind Regimes Associated with a Mountain Gap at theNortheastern Adriatic Coast.

Journal of Applied Meteorology and Climatology , 52(9):2089–2105, sep 2013. ISSN 1558-8424.Rasmus E Benestad, Inger Hanssen-Bauer, and Deliang Chen.

Empirical-Statistical Downscaling . WORLD SCIENTIFIC, sep 2008.ISBN 978-981-281-912-3.Christopher Bishop.

Pattern Recognition and Machine Learning . Springer New York, New York, New York, USA, 1 edition, 2006.ISBN 9780387310732.Leo Breimann. Random Forests.

Machine Learning , 45:5–32, 2001.M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data.

IEEESignal Processing Magazine , 34(4):18–42, 2017.K. A. Browning. The sting at the end of the tail: Damaging winds associated with extratropical cyclones.

Quarterly Journal of theRoyal Meteorological Society , 130(597):375–399, 2004. ISSN 00359009. doi: . /qj. . . ÖHLEIN ET AL . 33

Matteo Buzzi, Matteo Guidicelli, and Mark A Liniger. Nowcasting wind using machine learning from the stations to the grid. 2019.Brian Cabral and Leith Casey Leedom. Imaging vector ﬁelds using line integral convolution. In

Proceedings of the 20th annualconference on Computer graphics and interactive techniques - SIGGRAPH ’93 , pages 263–270, New York, New York, USA, 1993.ACM Press. ISBN 0897916018.Richard E. Chandler. On the use of generalized linear models for interpreting climate variability.

Environmetrics , 16(7):699–715, nov2005. ISSN 1180-4009.C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 38(2):295–307, 2016.Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. InDavid Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,

Computer Vision – ECCV 2014 , pages 184–199, Cham,2014. Springer International Publishing. ISBN 978-3-319-10593-2.Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerating the super-resolution convolutional neural network. In Bastian Leibe, JiriMatas, Nicu Sebe, and Max Welling, editors,

Computer Vision – ECCV 2016 , pages 391–407, Cham, 2016. Springer InternationalPublishing. ISBN 978-3-319-46475-6.Peter D. Dueben, Nils Wedi, Sami Saarinen, and Christian Zeman. Global simulations of the atmosphere at 1.45 km grid-spacingwith the integrated forecasting systemglobal simulations of the atmosphere at 1.45 km grid-spacing with the integrated forecastingsystem.

Journal of the Meteorological Society of Japan. Ser. II , 2020. doi: . /jmsj. - .Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 , 2016.E. Eccel, L. Ghielmi, P. Granitto, R. Barbiero, F. Grazzini, and D. Cesari. Prediction of minimum temperatures in an alpine regionby linear and non-linear post-processing of meteorological models. Nonlinear Processes in Geophysics , 14(3):211–222, may 2007.ISSN 1607-7946.ECMWF. IFS documentation CY46r1, Part VII: ECMWF wave model. 2017.H. J. Fowler, S. Blenkinsop, and C. Tebaldi. Linking climate change modelling to impacts studies: recent advances in downscalingtechniques for hydrological modelling.

International Journal of Climatology , 27(12):1547–1578, oct 2007. ISSN 08998418.C.F. Gaitan, W.W. Hsieh, A.J. Cannon, and P. Gachon. Evaluation of Linear and Non-Linear Downscaling Methods in Terms of DailyVariability and Climate Indices: Surface Temperature in Southern Ontario and Quebec, Canada.

Atmosphere-Ocean , 52(3):211–221,may 2014. ISSN 0705-5900.J. Willard Gibbs. Fourier’s Series.

Nature , 59(1522):200–200, dec 1898. ISSN 0028-0836.Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In

Proceedings of thethirteenth international conference on artiﬁcial intelligence and statistics , pages 249–256, 2010.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,

Advances inNeural Information Processing Systems 27 , pages 2672–2680. Curran Associates, Inc., 2014.Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning . MIT Press, 2016. ISBN 0262035618.Yanming Guo, Yu Liu, Ard Oerlemans, Songyang Lao, Song Wu, and Michael S. Lew. Deep learning for visual understanding: Areview.

Neurocomputing , 187:27–48, apr 2016. ISSN 09252312.

ÖHLEIN ET AL . J. M. Gutiérrez, D. Maraun, M. Widmann, R. Huth, E. Hertig, R. Benestad, O. Roessler, J. Wibig, R. Wilcke, S. Kotlarski, D. SanMartín, S. Herrera, J. Bedia, A. Casanueva, R. Manzanas, M. Iturbide, M. Vrac, M. Dubrovsky, J. Ribalaygua, J. Pórtoles, O. Räty,J. Räisänen, B. Hingray, D. Raynaud, M. J. Casado, P. Ramos, T. Zerenner, M. Turco, T. Bosshard, P. Štˇepánek, J. Bartholy,R. Pongracz, D. E. Keller, A. M. Fischer, R. M. Cardoso, P. M. M. Soares, B. Czernecki, and C. Pagé. An intercomparison of a largeensemble of statistical downscaling methods over Europe: Results from the VALUE perfect predictor cross-validation experiment.

International Journal of Climatology , 39(9):3750–3785, jul 2019. ISSN 08998418.K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 770–778, 2016.Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey,Raluca Radu, Dinand Schepers, Adrian Simmons, Cornel Soci, Saleh Abdalla, Xavier Abellan, Gianpaolo Balsamo, Peter Bechtold,Gionata Biavati, Jean Bidlot, Massimo Bonavita, Giovanna Chiara, Per Dahlgren, Dick Dee, Michail Diamantakis, Rossana Dragani,Johannes Flemming, Richard Forbes, Manuel Fuentes, Alan Geer, Leo Haimberger, Sean Healy, Robin J. Hogan, Elías Hólm, MartaJanisková, Sarah Keeley, Patrick Laloyaux, Philippe Lopez, Cristina Lupu, Gabor Radnoti, Patricia Rosnay, Iryna Rozum, FrejaVamborg, Sebastien Villaume, and Jean-Noël Thépaut. The ERA5 global reanalysis.

Quarterly Journal of the Royal MeteorologicalSociety , jun 2020. ISSN 0035-9009. doi: . /qj. .BC Hewitson and RG Crane. Climate downscaling: techniques and application. Climate Research , 7(2):85–95, 1996. ISSN 0936-577X.Tim Hewson. Use and veriﬁcation of ecmwf products in member and co-operating states (2018).

ECMWF Technical Memorandum ,2019. doi: http://dx.doi.org/ . /jgz nh uc .Tim D. Hewson and Urs Neu. Cyclones, windstorms and the imilast project. Tellus, Series A: Dynamic Meteorology and Oceanography ,2015. ISSN 16000870. doi: . /tellusa.v . .A. A.M. Holtslag, G. Svensson, P. Baas, S. Basu, B. Beare, A. C.M. Beljaars, F. C. Bosveld, J. Cuxart, J. Lindvall, G. J. Steeneveld,M. Tjernström, and B. J.H. Van De Wiel. Stable atmospheric boundary layers and diurnal cycles: Challenges for weather andclimate models. Bulletin of the American Meteorological Society , 94(11):1691–1706, 2013. ISSN 00030007.X. Hu, M. A. Naiel, A. Wong, M. Lamm, and P. Fieguth. RUNet: A Robust UNet Architecture for Image Super-Resolution. In

TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , June 2019.Hsin Yuan Huang, Scott B. Capps, Shao Ching Huang, and Alex Hall. Downscaling near-surface wind over complex terrain using aphysically-based statistical modeling approach.

Climate Dynamics , 44:529–542, 2015. ISSN 14320894.Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXivpreprint arXiv:1502.03167 , 2015.Chiyu "Max" Jiang, Jingwei Huang, Karthik Kashinath, Prabhat, Philip Marcus, and Matthias Niessner. Spherical CNNs on Unstruc-tured Grids. pages 1–16, jan 2019.Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 , 2013.Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative mod-els. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,

Advances in Neural InformationProcessing Systems 27 , pages 3581–3589. Curran Associates, Inc., 2014.Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. arXiv preprint arXiv:1609.02907 ,2016.S. Kotlarski, K. Keuler, O. B. Christensen, A. Colette, M. Déqué, A. Gobiet, K. Goergen, D. Jacob, D. Lüthi, E. Van Meijgaard,G. Nikulin, C. Schär, C. Teichmann, R. Vautard, K. Warrach-Sagi, and V. Wulfmeyer. Regional climate modeling on Europeanscales: A joint standard evaluation of the EURO-CORDEX RCM ensemble.

Geoscientiﬁc Model Development , 7(4):1297–1333,2014. ISSN 19919603.

ÖHLEIN ET AL . 35

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.

Nature , 521(7553):436–444, may 2015. ISSN 0028-0836.Craig M. Lee, Farid Askari, Jeff Book, Sandro Carniel, Benoit Cushman-Roisin, Clive Dorman, James Doyle, Pierre Flament, Court-ney K. Harris, Burton H. Jones, Milivoj Kuzmic, Paul Martin, Andrea Ogston, Mirko Orlic, Henry Perkins, Pierre-Marie Poulain,Julie Pullen, Aniello Russo, Christopher Sherwood, Richard P. Signell, and Dietmar Thaler. Northern Adriatic response to a winter-time bora wind event.

Eos, Transactions American Geophysical Union , 86(16):157–165, 2005. ISSN 0096-3941.Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by varianceshift. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2019.Zhengyang Lu and Ying Chen. Single image super resolution based on a modiﬁed u-net with mixed gradient loss. arXiv preprintarXiv:1911.09428 , 2019.Carsten Maass. Mars user documentation. https://confluence.ecmwf.int/display/UDOC/MARS+user+documentation , June 2019.[accessed 19-May-2020].Yiwen Mao and Adam Monahan. Linear and nonlinear regression prediction of surface wind components.

Climate Dynamics , 51:3291–3309, nov 2018. ISSN 0930-7575.Douglas Maraun, Martin Widmann, José M Gutiérrez, Sven Kotlarski, Richard E Chandler, Elke Hertig, Joanna Wibig, Radan Huth,and Renate A.I. Wilcke. VALUE: A framework to validate downscaling approaches for climate change studies.

Earth’s Future , 3(1):1–14, jan 2015. ISSN 23284277.Douglas Maraun, Martin Widmann, and José M. Gutiérrez. Statistical downscaling skill under present climate conditions: A synthesisof the VALUE perfect predictor experiment.

International Journal of Climatology , 39(9):3692–3703, jul 2019. ISSN 08998418.Clifford F. Mass, David Ovens, Ken Westrick, and Brian A. Colle. Does Increasing Horizontal Resolution Produce More SkillfulForecasts?

Bulletin of the American Meteorological Society , 83(3):407–430, mar 2002. ISSN 0003-0007.Jeffery T. McQueen, Roland R. Draxler, and Glenn D. Rolph. Inﬂuence of Grid Size and Terrain Resolution on Wind Field Predictionsfrom an Operational Mesoscale Model.

Journal of Applied Meteorology , 34(10):2166–2181, oct 1995. ISSN 0894-8763.Metcheck. Gfs chart archive. , June 2020. [accessed 17-June-2020].P.-A. Michelangeli, M. Vrac, and H. Loukos. Probabilistic downscaling approaches: Application to wind cumulative distributionfunctions.

Geophysical Research Letters , 36(11), jun 2009. ISSN 0094-8276.Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts.

Distill , 2016. doi: . /distill. .Baoxiang Pan, Kuolin Hsu, Amir AghaKouchak, and Soroosh Sorooshian. Improving Precipitation Estimation Using ConvolutionalNeural Network. Water Resources Research , 55(3):2301–2321, mar 2019. ISSN 0043-1397.Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, NataliaGimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, SasankChilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deeplearning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,

Advances in NeuralInformation Processing Systems 32 , pages 8026–8037. Curran Associates, Inc., 2019.S. C. Pryor. Empirical downscaling of wind speed probability distributions.

Journal of Geophysical Research , 110(D19), 2005. ISSN0148-0227.Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarialnetworks. arXiv preprint arXiv:1511.06434 , 2015.Valentina Radi´c and Garry K. C. Clarke. Evaluation of IPCC Models’ Performance in Simulating Late-Twentieth-Century Climatologiesand Weather Patterns over North America.

Journal of Climate , 24(20):5257–5274, oct 2011. ISSN 0894-8755.

ÖHLEIN ET AL . J. Räisänen, U. Hansson, A. Ullerstig, R. Döscher, L. P. Graham, C. Jones, H. E. M. Meier, P. Samuelsson, and U. Willén. Europeanclimate in the late twenty-ﬁrst century: regional simulations with two driving global models and two forcing scenarios.

ClimateDynamics , 22:13–31, jan 2004. ISSN 0930-7575.Markus Reichstein, Gustau Camps-Valls, Bjorn Stevens, Martin Jung, Joachim Denzler, Nuno Carvalhais, and Prabhat. Deep learningand process understanding for data-driven Earth system science.

Nature , 566(7743):195–204, 2019. ISSN 14764687.Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In NassirNavab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors,

Medical Image Computing and Computer-AssistedIntervention – MICCAI 2015 , pages 234–241, Cham, 2015. Springer International Publishing. ISBN 978-3-319-24574-4.M Rummukainen.

Methods for statistical downscaling of GCM simulations.

SMHI, 80 edition, 1997.Markku Rummukainen. State-of-the-art with regional climate models.

Wiley Interdisciplinary Reviews: Climate Change , 1(1):82–96,2010.Mehdi S. M. Sajjadi, Bernhard Scholkopf, and Michael Hirsch. Enhancenet: Single image super-resolution through automated texturesynthesis. In

The IEEE International Conference on Computer Vision (ICCV) , Oct 2017.Rosa Salvador, Josep Calbó, and Millán M. Millán. Horizontal Grid Size Selection and its Inﬂuence on Mesoscale Model Simulations.

Journal of Applied Meteorology , 38(9):1311–1329, sep 1999. ISSN 0894-8763.Chaopeng Shen. A Transdisciplinary Review of Deep Learning Research and Its Relevance for Water Resources Scientists.

WaterResources Research , 54(11):8558–8593, 2018. ISSN 19447973.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to preventneural networks from overﬁtting.

J. Mach. Learn. Res. , 15(1):1929–1958, January 2014. ISSN 1532-4435.Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387 , 2015.Roland Stull.

Practical Meteorology - An Algebra-based Survey of Atmospheric Science . BC Campus, Vancouver, 2017. ISBN9780888652836.Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single image super-resolution: Methods and results. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , July2017.Thomas Vandal, Evan Kodra, Sangram Ganguly, Andrew Michaelis, Ramakrishna Nemani, and Auroop R Ganguly. Generating highresolution climate change projections through single image super-resolution: An abridged version.

International Joint Conferenceson Artiﬁcial Intelligence Organization , Jul 2018.Thomas Vandal, Evan Kodra, and Auroop R. Ganguly. Intercomparison of machine learning methods for statistical downscaling: thecase of daily and extreme precipitation.

Theoretical and Applied Climatology , 137:557–570, jul 2019. ISSN 0177-798X.Stéphane Vannitsem, John Bjørnar Bremnes, Jonathan Demaeyer, Gavin R Evans, Jonathan Flowerdew, Stephan Hemri, SebastianLerch, Nigel Roberts, Susanne Theis, Aitor Atencia, et al. Statistical postprocessing for weather forecasts–review, challenges andavenues in a big data world. arXiv preprint arXiv:2004.06582 , 2020.John M Wallace and Peter V Hobbs.

Atmospheric science: an introductory survey , volume 92. Elsevier, 2006.R.L. Wilby and T.M.L. Wigley. Downscaling general circulation model output: a review of methods and limitations.

Progress inPhysical Geography: Earth and Environment , 21(4):530–548, dec 1997. ISSN 0309-1333.Daniel S. Wilks. Use of stochastic weathergenerators for precipitation downscaling.

Wiley Interdisciplinary Reviews: Climate Change ,1(6):898–907, nov 2010. ISSN 17577780.

ÖHLEIN ET AL . 37

Daniel S. Wilks. Stochastic weather generators for climate-change downscaling, part II: multivariable and spatially coherent multisitedownscaling.

Wiley Interdisciplinary Reviews: Climate Change , 3(3):267–278, may 2012. ISSN 17577780.A. W. Wood, L. R. Leung, V. Sridhar, and D. P. Lettenmaier. Hydrologic implications of dynamical and statistical approaches todownscaling climate model outputs.

Climatic Change , 62:189–216, 2004. ISSN 01650009.Yongkang Xue, Zavisa Janjic, Jimy Dudhia, Ratko Vasic, and Fernando De Sales. A review on regional dynamical downscaling inintraseasonal to seasonal simulation/prediction and major factors that affect downscaling ability.

Atmospheric Research , 147-148:68–85, 2014. ISSN 01698095.W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue, and Q. Liao. Deep learning for single image super-resolution: A brief review.

IEEETransactions on Multimedia , 21(12):3106–3121, 2019.Eduardo Zorita and Hans von Storch. The Analog Method as a Simple Statistical Downscaling Technique: Comparison with MoreComplicated Methods.