[PDF] Forecasting High-Frequency Spatio-Temporal Wind Power with Dimensionally Reduced Echo State Networks

Abstract

Fast and accurate hourly forecasts of wind speed and power are crucial in quantifying and planning the energy budget in the electric grid. Modeling wind at a high resolution brings forth considerable challenges given its turbulent and highly nonlinear dynamics. In developing countries where wind farms over a large domain are currently under construction or consideration, this is even more challenging given the necessity of modeling wind over space as well. In this work, we propose a machine learning approach to model the nonlinear hourly wind dynamics in Saudi Arabia with a domain-specific choice of knots to reduce the spatial dimensionality. Our results show that for locations highlighted as wind abundant by a previous work, our approach results in a 11% improvement in the two-hours-ahead forecasted power against operational standards in the wind energy sector, yielding a saving of nearly one million US dollars over a year under current market prices in Saudi Arabia.

Full PDF

FForecasting High-Frequency Spatio-TemporalWind Power with Dimensionally ReducedEcho State Networks

Huang Huang , Stefano Castruccio and Marc G. Genton February 3, 2021

Abstract

Fast and accurate hourly forecasts of wind speed and power are crucial in quan-tifying and planning the energy budget in the electric grid. Modeling wind at a highresolution brings forth considerable challenges given its turbulent and highly non-linear dynamics. In developing countries where wind farms over a large domain arecurrently under construction or consideration, this is even more challenging given thenecessity of modeling wind over space as well. In this work, we propose a machinelearning approach to model the nonlinear hourly wind dynamics in Saudi Arabiawith a domain-speciﬁc choice of knots to reduce the spatial dimensionality. Ourresults show that for locations highlighted as wind abundant by a previous work,our approach results in a 11% improvement in the two-hours-ahead forecasted poweragainst operational standards in the wind energy sector, yielding a saving of nearlyone million US dollars over a year under current market prices in Saudi Arabia.

Keywords: echo state network, Gaussian random ﬁeld, machine learning, reservoir comput-ing, space-time model Statistics Program, King Abdullah University of Science and Technology, Thuwal, 23955-6900, SaudiArabia. [email protected]. [email protected]. Department of Applied and Computational Mathematics and Statistics, University of Notre Dame,Notre Dame, IN, 46556, USA. [email protected]. a r X i v : . [ s t a t . A P ] F e b Introduction

The wide consensus of the increasingly negative eﬀects of a warming climate caused bygreenhouse gas emissions (IPCC, 2014) has prompted the global community to seek alter-native, carbon-free energy sources. While heterogeneous across countries, the increase inthe usage of renewable resources worldwide has been steady over the past decades, and theaftermath of the current COVID-19 is foreseen to further accelerate this trend (Organisa-tion for Economic Co-operation and Development, 2020).Saudi Arabia has the second-largest oil reserve in the world, and its internal energyportfolio is almost exclusively focused on fossil fuels. Worldwide, the country currentlyranks sixth in oil consumption and has one of the highest per capita energy consump-tions (British Petroleum, 2020). In an eﬀort to align the country’s targets with the ParisAgreement (Kinley, 2017) and reduce its carbon footprint, the Saudi Arabian governmentrecently outlined “Vision 2030”, a blueprint plan to promote an economically viable tran-sition to renewable energies to reduce its dependence on oil. While solar energy has thegreatest potential given its geographical position, the plan aims to also install 16 GW ofwind capacity by 2030. Wind investment is strategic for the diversiﬁcation and increasedpenetration of renewable energies given its high abundance, especially during night hourswhen solar power is not available. For a country with little to no installed capacity, thedevelopment of “Vision 2030” requires a preliminary understanding on where the turbinesshould be built and what factors are more inﬂuential in determining their sites. Very recentwork provided some initial evidence. Giani et al. (2020) investigated various types of in-stallation and maintenance costs and identiﬁed the optimal wind farm build-out under thepresent wind regimes; Zhang et al. (2020) studied the resilience of the plan under a chang-ing climate over the next thirty years; and Chen et al. (2021) investigated the probability2f disruption of wind operations due to extreme wind. One fundamental question must beaddressed to provide a pathway for its practical implementation: What is the beneﬁt of areliable forecasting system, and what would be its economic advantage against standardoperational forecasting systems? Such information would be instrumental to policymakersin their ﬁnal decision regarding the optimal construction sites.Modeling hourly wind ﬁelds (the resolution of interest in the energy market) for a coun-try of one-fourth the size of the United States, with nonlinear dynamics implied by diversegeography comprising two coastal areas, mountain ranges, and sandy and rocky deserts, isa considerable challenge. Statistical approaches such as the standard Autoregressive Inte-grated Moving Average (ARIMA) are eﬀective in capturing a short-range (approximately)linear dependence; however, more ﬂexible alternatives are necessary for nonlinear dynam-ics, and simple modiﬁcations such as fractional integration (Taqqu et al., 1995) have beenshown to be largely insuﬃcient in capturing a more structured dependence in time.Machine learning approaches to time series, such as artiﬁcial neural networks, havebeen more successful in capturing nonlinear dynamics, owing to their ﬂexibility and thepossibility of deﬁning a recursive structure (

Recurrent Neural Networks , RNNs) suitablefor temporally resolved data. However, the inference for RNN is very challenging giventhe high dimensionality of the parameter space and the impossibility to directly applythe iterative gradient computations of traditional neural networks, which would lead tonumerical instabilities (Doya, 1992). To achieve more stable and computationally aﬀordableinference, Jaeger (2001) proposed a new approach to RNN, i.e., the

Echo State Networks (ESN), which is predicated on the use of sparse, random matrices instead of dense unknownones (Lukoˇseviˇcius and Jaeger, 2009). Maass et al. (2002) independently developed a similarapproach with the name

Liquid State Machines but with more sophisticated topologicalconstraints motivated by neuroscience. This paradigm is now commonly referred to as

Reservoir Computing when inputs of the neural networks are mapped to high-dimensional3idden reservoir states through ﬁxed nonlinear dynamics. These hidden reservoir statesconsider sequential linkages and thus allow for a nonlinear transformation of the inputhistory. Despite the use of reservoir states being ideally suited as a baseline for statisticalmodeling, the machine learning community has mostly focused on this class of methods asa means to ease the computational burden.Recently, a series of seminal works (McDermott and Wikle, 2017, 2019a,b) on the ESNframework was proposed in the statistical literature to forecast Paciﬁc sea surface temper-ature and the United States soil moisture. The Empirical Orthogonal Functions (EOF)method, or equivalently, the Principal Component Analysis (PCA), is used to reduce thespatial dimensionality. The leading orthogonal basis functions, which are the leading eigen-vectors of the spatial covariance matrix of the space-time data, capture the main modesof spatial variability. Therefore, they reduce the dimensionality and preserve the primaryspatial variability structure when using the leading orthogonal basis functions with a muchsmaller number than the original number of locations to represent the entire spatial ﬁeld.In this work, however, we propose a new ESN framework that describes the dynamicsof a spatial ﬁeld though a collection of knots, sampled by both a grid and ﬁxed pointsover areas of complex patterns such as rugged terrains (see Figure 2). The entire ﬁeld isthen reconstructed with a spatial interpolation approach, thereby allowing for a full spatio-temporal forecast without a computationally prohibitive ESN. Comparison with the ESNbased on EOF, ARIMA, and persistence forecasting (i.e., assuming wind does not changefrom the last observed value) approaches shows that the prediction under our proposedESN framework is more accurate. Furthermore, we oﬀer a comprehensive simulation studyinvestigating how and to what extent the ESN can capture the data dynamics and yieldpredictive improvements against other methods when the underlying simulated process hasa controlled level of nonlinearity.The manuscript proceeds as follows: Section 2 presents the wind data; Section 3 in-4roduces the temporal model, the ESN, and the spatial model; Section 4 discusses theinferential and forecasting approach; Section 5 compares the forecasting from our model toseveral operational alternatives in the wind energy market; Section 6 presents the compu-tation of the wind power forecast and discusses and quantiﬁes the economic advantages ofour method; and Section 7 provides the conclusions of the study and a discussion.Implementation of our proposed ESN for the wind speed forecasts in Saudi Arabia canbe found in https://github.com/hhuang90/KSA-wind-forecast . While developed countries can perform forecasting with publicly available, continuouslyupdated high-resolution simulations routinely used for trading energy options (e.g., theHigh-Resolution Rapid Refresh (Benjamin et al., 2016) in North America), developingcountries cannot rely on such data products.To investigate the wind speed and subsequent wind power potentials in Saudi Ara-bia, Giani et al. (2020) performed hourly high-resolution simulations of the Weather Re-search and Forecasting model (WRF, Skamarock et al. (2008)) driven by the operationalhigh-resolution European Centre for Medium-Range Weather Forecast model. The WRFwas simulated from 2013 to 2016 because the site measurements of the wind speed frommeteorological masts during this period were available for validation. Four simulationswith diﬀerent model setups were performed, and for this study we choose the one closerto the observations, which relies on the Mellor–Yamada–Janji´c planetary boundary layerscheme (Mellor and Yamada, 1982; Janji´c, 2001; G´omez-Navarro et al., 2015) and a six-kilometer resolution. We consider the hourly near-surface wind speed data over SaudiArabia. Figure 1(A) shows the average hourly near-surface wind speed from 2013 to 2016,with generally higher values in the west due to mountain ranges. While abundant in wind,5hese regions are generally unsuitable for wind farming due to the high installation costsimplied by the rough terrain. The area in northwest Saudi Arabia near the Gulf of Aqabais the construction site of NEOM City (Farag, 2019), and data indicate its suitability forwind energy harvesting, as also discussed in Giani et al. (2020). We select two locationswith a high or low average wind to display the temporal variability in the hourly windspeed data set. Figures 1(B) and 1(C) show the mean and the standard deviation of thehourly wind speed at each hour of the day in each year from 2013 to 2016. We observe thatthe wind speed interannual variability is low in the Arabian Peninsula, as also highlightedby Chen et al. (2018). Therefore, our methods based on this data set can be extended toother years in the future. Our used simulated data set from the WRF was well validatedby the observations at ten sites within the King Abdullah City for Atomic and RenewableEnergy monitoring network through various assessment metrics in Giani et al. (2020).Figure 1: (A) Average wind speed in Saudi Arabia from 2013 to 2016. (B, C) Meanand standard deviation of the hourly wind speed at each hour of the day in each year attwo selected locations (triangle and circle in panel (A)). Each of the four curves shows theunderlying quantity for one year from 2013 to 2016. Unit: m/s.6

Models

Throughout this work, we denote by Z t ( s ) the near-surface wind speed at location s andtime t , where t takes values from { , . . . , T } indicating the number of hours after 00:00Jan. 1, 2013 (Feb. 29 in the leap year of 2016 is removed for simplicity), and s is anylocation in Saudi Arabia. For practical purposes, we do not consider islands in the RedSea (we exclude Tiran Island, Sanaﬁr Island in the Red Sea Project Lagoon, and FarasanIsland) and have a total of n = 53 ,

333 observations.

We observe the periodic inter-daily and multi-daily patterns in the wind speed data, whichare explainable by both the daily land/ocean heat ﬂuxes and mesoscale patterns of recurringwinds in the Arabian Peninsula; see Figures 1(B) and 1(C) for the daily patterns. For eachlocation s , we assume that (cid:112) Z t ( s ) depends on harmonics: (cid:112) Z t ( s ) = β ( s ) + F (cid:88) i =1 (cid:26) β i, ( s ) cos (cid:16) πtT i (cid:17) + β i, ( s ) sin (cid:16) πtT i (cid:17)(cid:27) + γ ( s ) Y t ( s ) , (1)where F is the number of harmonics, β ( s ) is the intercept, β i, ( s ) , β i, ( s ), i = 1 , . . . , F ,are the coeﬃcients for each harmonic with period T i , and γ ( s ) is the location-wise scalingparameter such that the residual Y t ( s ) has unit variance at each location s . According tothe discrete Fourier transform diagnostics illustrated in Figure S1 in the SupplementaryMaterial, there are F = 5 signiﬁcant harmonics at periods of one year, half a year, one day,twelve hours, and eight hours. The choice of the square root is dictated by the need totransform the hourly wind speed data, which are generally right skewed due to occasionalwind gusts, to normality. The choice of this particular transformation is justiﬁed by a well-established literature on wind modeling (e.g., Gneiting (2002)), as well as our diagnosticsin Figure S2 in the Supplementary Material.7he parameters β ( s ) and β i, ( s ) , β i, ( s ) , i = 1 , . . . , F , are estimated by linear re-gression with the harmonics independently across sites using the data from 2013 to 2015(training data), which is a task that can be easily performed in parallel. The estimated pa-rameters are then used to estimate Y t ( s ) ( γ ( s ) is calculated such that Y t ( s ) is standardized)in Equation (1), which is a zero-mean spatio-temporal residual process with a structurespeciﬁed in the subsequent sections. The hourly, high-spatial resolution residual wind Y t ( s ) is bound to show a highly nonlinearbehavior that traditional time series approaches, such as the ARIMA or more general linearspace-time models fail to capture. The inability of traditional methods to capture the non-linearity of spatio-temporal dynamics (McDermott and Wikle, 2017) necessarily translatesinto sub-optimal forecasts, as will be illustrated in Section 5. We propose herein an ESNmodel with nonlinear dynamics generated by a state-space with neural network dynamics.We will introduce the model with the standard statistical terminology, but we also put inparenthesis the equivalent machine learning terms and use them henceforth.We denote the data ( output ) by y t := (cid:0) Y t ( s ) , . . . , Y t ( s n ) (cid:1) (cid:62) , which is the vector of thewind speed residual at time t and all locations. The lagged wind speed residuals up to m time steps before and the intercept are used as explanatory variables ( input ) at time t (i.e., x t := (1 , y (cid:62) t − , . . . , y (cid:62) t − m ) (cid:62) ), although this can be generalized to incorporate additionalcovariates. The quadratic interactions had been shown to be crucial to characterizing thenonlinear dynamical evolution in many physical processes (Wikle and Hooten, 2010; Gladishand Wikle, 2014). For this work, and consistently with McDermott and Wikle (2017), wepropose to model the relationship between the n h -dimensional latent variables h t ( hidden tates or reservoir states ) and y t using a quadratic ESN: y t = V h t + V h t + (cid:15) t , (2)where V and V are weight matrices whose entries are unknown, and (cid:15) t is the error attime t following a zero-mean multivariate normal distribution. As (cid:15) t is the error vector fordiﬀerent locations, there should be a particular structure for its covariance. However, thepoint estimate of V and V by least squares, as will be shown later in Section 4.2.1, areindependent from the form of the covariance matrix of (cid:15) t . The uncertainty of the estimated y t relies on the covariance of (cid:15) t , but we will show that the ensemble technique for the ESNis used to capture the variation of the forecasted y t . Therefore, we do not specify anycovariance model for (cid:15) t here. The dynamics of the vector of reservoir states h t is expressedas h t = φ f (cid:18) δ | λ W | W h t − + U x t (cid:19) + (1 − φ ) h t − , (3)where f is the element-wise activation function (here we choose the commonly used hyper-bolic tangent function, but other choices are possible such as the rectiﬁed linear unit), λ W is the maximal absolute eigenvalue of W , δ is a tuning parameter to scale W , and φ is theleaking rate to control the speed of the dynamics. In other words, the temporal evolutionof h t is controlled by a nonlinear evaluation of a linear combination of h t − and x t , and amemory term that allows controlling the long-range dependence of the data.The n h × n h -dimensional weight matrix W = [ w ij ] controls the interaction amongreservoir states with time evolution, and the n h × ( mn + 1)-dimensional weight matrix U = [ u ij ] determines the mapping from the inputs to the reservoir states. The key diﬀerencebetween the ESN and the standard RNN lies in the randomness of W and U with a set levelof sparsity instead of full unknown matrices to be estimated. Therefore, the interaction ofeach pair of reservoir states, as well as each pair of inputs and reservoir states, is assumedto exist with a given probability. The sparsity in both matrices prevents overﬁtting and9ramatically reduces the model dimensionality. Speciﬁcally, the entries w ij of W and u ij of U are generated as follows: w ij ∼ γ wij Unif( − a w , a w ) , γ wij ∼ Bern( π w ) ,u ij ∼ γ uij Unif( − a u , a u ) , γ uij ∼ Bern( π u ) , (4)where a w and a u control the magnitude of nonzero entries in W and U , respectively, and π w and π u determine the expected sparsity level. The proposed ESN model could, in principle, be used directly to produce forecasts of y t ; however, this would imply the use of input comprising a spatially referenced vectorof n = 53 ,

333 points at each time t . Consequently, it requires a very high-dimensionalvector of reservoir states h t to successfully capture the dynamics and subsequently veryhigh-dimensional matrices V and V in Equation (2) and W and U in Equation (3).Therefore, this brute-force approach not accounting for the spatial nature of the problemwould be computationally intractable.One commonly used approach to reduce the size of spatially referenced data in geo-sciences is using the EOF (or equivalently PCA) method. This approach would allow usto capture the main modes of spatial variability; for example, Gladish and Wikle (2014)used the ﬁrst ten EOFs to represent the sea surface temperature on a region in the PaciﬁcOcean comprising 2,229 locations. Alternatively, we can use a stochastic approach wherewe employ a Gaussian random ﬁeld to model the spatial data. After ﬁtting the Gaussianrandom ﬁeld with all the available spatial data, we consider a set of knots and use them asa low-rank representation of the entire space-time data. We will show later in Section 5.1that our stochastic spatial model performs better than the EOF approach.More speciﬁcally, we use a zero-mean Gaussian random ﬁeld to model Y t ( s ) at each time10 , i.e., Y t ( s ) ∼ GP (cid:0) , C t ( · , · ) (cid:1) , where C t ( · , · ) is the Mat´ern covariance function for Y t ( s ),such that for any pair of locations s i and s j , i, j ∈ { , . . . , n = 53 , } , the covariance isrepresented as C t ( s i , s j ) = cov (cid:0) Y t ( s i ) , Y t ( s j ) (cid:1) = σ t − ν t Γ( ν t ) (cid:18) √ ν t (cid:107) s i − s j (cid:107) ρ t (cid:19) ν t K ν t (cid:18) √ ν t (cid:107) s i − s j (cid:107) ρ t (cid:19) + τ t { s i = s j } . (5)Here, K ν ( · ) is the modiﬁed Bessel function of the second kind, {} is the indicator func-tion, and the time-varying parameters σ t , ρ t , ν t , and τ t are called the partial sill, range,smoothness, and nugget parameters, respectively (Stein, 1999). We ﬁt the random ﬁeldwith Y t ( s ) in 2015 at each time t separately. Figure S3 in the Supplementary Materialshows the boxplots of the four parameters at each hour of the day across diﬀerent days,where we see daily periodicity. Therefore, we assume that the estimated parameters areconstants at each hour of the day (i.e., σ t = σ t , ρ t = ρ t , ν t = ν t , and τ t = τ t , if | t − t | = 24 k, k ∈ N ) to reduce the model complexity. We use the mean across all diﬀer-ent days as the estimator. Figure S4 in the Supplementary Material depicts the estimatesof σ t , ρ t , ν t , and τ t at each hour of the day.With the ﬁtted spatial Gaussian random ﬁeld, we place a reasonably large set of knotsto present the entire spatial domain. The space-time dynamics are only modeled on theknots by the ESN. Once forecasts at each time have been provided on the knots, spatialinterpolation, or kriging, is used to make spatial predictions.Our chosen knots consist of two sets. The 2,756 chosen knots in the ﬁrst set are datalocations closest to a coarse, regular grid with a 0 . ◦ resolution across the entire domain.The second set comprises locations in areas with a high wind power potential. We calculatethe average wind speed from 2013 to 2015 and focus on the locations with an average windspeed greater than 6 m/s, which is approximately the 98% quantile. A set of 417 knotsare then selected, such that no pairs have longitude or latitude diﬀerence less than 0 . ◦ .11igure 2 shows a total number of n ∗ = 3 ,

173 resulting chosen knots, and we denote them by { s ∗ , . . . , s ∗ n ∗ } . The output y t is then replaced by y ∗ t := (cid:0) Y t ( s ∗ ) , . . . , Y t ( s ∗ n ∗ ) (cid:1) (cid:62) . The input x t is substituted accordingly with x ∗ t := (cid:0) , y ∗ t − (cid:62) , . . . , y ∗ t − m (cid:62) (cid:1) (cid:62) , and the reservoir states h t inEquation (3) are updated by plugging in x ∗ t as h t = φ f (cid:16) δ | λ W | W h t − + U x ∗ t (cid:17) + (1 − φ ) h t − . Figure 2: Overview of the 3,173 chosen knots: 2,756 knots closest to a regular grid with a0 . ◦ resolution are shown in light gray, and 417 knots with an average wind speed from2013 to 2015 that is greater than 6 m/s are shown in red. This section discusses both the inference and the forecasting approach. Forecasting isintroduced ﬁrst because the inference for all, but the output matrices, V and V inEquation (2), relies on cross-validating predictions.12 .1 Forecasting Let the training period be t ∈ { , . . . , T } and the forecasting period be t ∈ { T +1 , . . . , T max } .Let us ﬁrst assume that the parameters are known. In this work, not only are we inter-ested in a one time-step forecast (corresponding to one hour ahead) but also in two andthree hours in the future because this information is necessary to the majority of wind en-ergy markets (Hering and Genton, 2010). At time t , given the knot data, y ∗ , . . . , y ∗ t and toforecast y ∗ t +2 , the vector y ∗ t +1 is necessary but unavailable; hence the forecast ˆ y ∗ t +1 is substi-tuted. Section S1 in the Supplementary Material discusses a formalization of this approachto forecast with multiple time-steps ahead. We produce an ensemble of 100 replications fordiﬀerent realizations of W and U in Equation (3) throughout this work to account for theuncertainty.At time t , once the forecast ˆ y ∗ t + h = (cid:0) ˆ Y t + h ( s ∗ ) , . . . , ˆ Y t + h ( s ∗ n ∗ ) (cid:1) (cid:62) , h = 1 , ,

3, at the knotsis obtained, the wind speed residuals at all locations ˆ y t + h = (cid:0) ˆ Y t + h ( s ) , . . . , ˆ Y t + h ( s n ) (cid:1) (cid:62) canbe obtained via spatial interpolation, i.e., ˆ y t + h = K f K − ˆ y ∗ t + h , where K f = [ k fij ] is an n × n ∗ -dimensional covariance matrix with entries k fij = cov (cid:0) Y t + h ( s i ) , Y t + h ( s ∗ j ) (cid:1) and K = [ k ij ] isan n ∗ × n ∗ -dimensional covariance matrix with entry k ij = cov (cid:0) Y t + h ( s ∗ i ) , Y t + h ( s ∗ j ) (cid:1) . All parameters, except V and V , are estimated by cross-validation, as will be explainedlater. The estimation of V and V can be formulated as a least square regression. Thelarge size of the output vector would potentially result in a high dimension of the reservoirstate space needed. Therefore, it is necessary to penalize the coeﬃcient estimation to13educe its variance and prevent overﬁtting. If we denote Y =  y ∗ (cid:62) ... y ∗ T (cid:62)  , H =  h (cid:62) , h (cid:62) ... h (cid:62) T , h T (cid:62)  , B =  V (cid:62) V (cid:62)  , E =  (cid:15) (cid:62) ... (cid:15) (cid:62) T  , then, we can express Equation (2) in the matrix linear regression form Y = HB + E . Weuse ridge regression or Tikhonov regularization to estimate B such thatˆ B = arg min B ( (cid:107) Y − HB (cid:107) + λ (cid:107) B (cid:107) ) , where (cid:107) · (cid:107) F is the Frobenius norm and λ is the ridge penalty. The ridge estimator ˆ B hasthe following closed-form: ˆ B = ( H (cid:62) H + λ I ) − H (cid:62) Y . (6)Once ˆ B is obtained, the forecast one time-step ahead (and iteratively for any future timestep) can be made on the knots as follows:ˆ y ∗ t = ˆ B (cid:62)  h t h t  , for t ∈ { T + 1 , . . . , T max } . There are several parameters that need to be tuned: the dimension of the reservoir states n h ,the number of lags m in x t , the leaking rate φ and the scaling parameter δ in Equation (3),the magnitude and sparsity parameters a w , π w , a u , and π u in Equation (4), and the ridgepenalty λ in Equation (6).The cross-validation method is used to determine the optimal values of these parametersin our problem. We consider y ∗ t from 2013 to 2014, for a total of 17,520 hourly observations,14s the training data to yield the ridge estimator ˆ B in Equation (6). Subsequently, y ∗ t in2015, for a total of 8,760 hourly observations, is used as the validation data to assess theforecast performance. More speciﬁcally, at any time t in 2015, y ∗ , . . . , y ∗ t are available; weforecast y ∗ t +1 and compare it with the true validation data; and we then calculate the MeanSquared Error (MSE) from all knots and time points in the 100 ensemble forecasts (seeSection S2 in the Supplementary Material for more details). We search over an extensivelylarge grid (Table S1) to obtain the optimal values of the parameters. The optimal valuesminimizing the MSE, along with their interpretation, are listed in Table 1.Table 1: Estimated parameters in the ESN model via cross-validation, with the trainingdata set of y ∗ t in 2013 and 2014 and the validation in 2015.Parameter explanation Notation EstimateNumber of reservoir states n h m φ W δ λ W a w W a u U π w U π u n ∗ = 3 ,

173 dimensions. The estimatedridge penalty helps avoid overﬁtting using the large reservoir space. The obtained optimalleaking rate is one (i.e., the reservoir states are fully updated by new activations at each15ime). Since we removed the low-frequency periodicities to the original wind speed (seeEquation (1)), the optimal leaking rate of one suggests a short-term dynamic in the residualﬁeld Y t ( s ). In our ESN, we use the lagged wind speed residuals as the input, and weobserve that the optimal model only requires one lagged observation as the input at eachtime. The scaling parameter, δ , determines the spectral radius of the matrix W afterscaling and inﬂuences the fading speed of the inputs in the hidden neurons. In practice, δ < δ indicatesthat even though only one lagged observation is directly needed at each time, the hiddenneurons preserve a long memory of the inputs (lagged wind speed residuals) by employinga moderately large spectral radius of W , 0.9. The values of a w , π w , a u , and π u are allcomparatively small, indicating that a sparse collection of weak connections among reservoirstates and between the reservoir states and inputs is selected. Once the parameters have been estimated from the training step from 2013 to 2015, weproceed to forecast y t in 2016 and compare it with some of the most common strategiesfor short term wind forecasts.We focus herein on persistence forecasting and the ARIMA model. The persistenceforecasting method uses the last observed values as estimates for the near future (i.e., ˆ y t +3 =ˆ y t +2 = ˆ y t +1 = y t ). The persistence forecasting method is very simple, but frequently usedin practice because of its good performance for very short-term forecasting problems. Inthe ARIMA approach, at each knot s ∗ i , i = 1 , . . . , n ∗ = 3 , Y t ( s ∗ i ) is assumed to have the following model: (cid:32) − p (cid:88) k =1 ψ k L k (cid:33) (1 − L ) d Y t ( s ∗ i ) = c + (cid:32) q (cid:88) k =1 θ k L k (cid:33) ε t ( s ∗ i ) , where p , d , and q are respectively the order of the autoregressive, diﬀerencing, and moving-average terms, L is the lag operator (e.g., L Y t ( s ∗ i ) = Y t − ( s ∗ i )), ε t ( s ∗ i ) is the error term, and c , ψ , . . . , ψ p and θ , . . . , θ q are unknown parameters. The optimal orders in each ARIMAmodel are optimized through a cross-validation using the data from 2013 to 2015. EachARIMA model with the optimal parameters then makes forecasts at each correspondingknot up to three time-steps ahead for each time in 2016 using all the observed data beforethat time.Table 2 summarizes the MSE results for y ∗ t at all the n ∗ = 3 ,

173 knots and time pointsin 2016. Our forecasting results by the ESN outperform the ARIMA and the persistenceforecasting, and the outperformance becomes more signiﬁcant when the lead time increasesfrom one hour ahead to three hours ahead.Table 2: MSE of forecasts of y ∗ t at all the 3,173 knots and time points in 2016 using theESN, ARIMA, and persistence forecasting methods.Forecast ESN ARIMA PersistenceOne hour ahead 0.235 0.292 0.326Two hours ahead 0.394 0.533 0.657Three hours ahead 0.508 0.689 0.920Figure S5 in the Supplementary Material shows the MSE diﬀerence between the ESNand the persistence forecasts, where we observe that the ESN outperforms the persistenceapproach at the vast majority of locations. Only ﬁve locations show smaller MSEs of thepersistence forecasts, all being one hour ahead forecasts.17fter obtaining the forecasted Y t ( s ) at the knots, we can forecast Y t ( s ) everywhereelse at the n = 53 ,

333 original locations by spatial interpolation from the inferred spatialGaussian random ﬁeld described in Section 2. In contrast, we can also use the persistenceforecast Y t ( s ) directly at all the 53,333 locations. Thus, these persistence forecasts allowthe use of more information than our ESN, which is deﬁned only at the knots. Figure 3shows the diﬀerence between the MSE by using our ESN and the spatial modeling approach,denoted as S-ESN, and the persistence forecasting methods for up to three hours ahead.We observe smaller errors in our forecasts than the persistence forecasts at the majorityof locations from Figure 3, where the outperformance in the average MSE across locationsis depicted in Table 3. However, for one hour ahead forecast, there exist some areaswhere the persistence method leads to better forecasts, especially over the mountain range.Recall that only at ﬁve knots, the persistence method has smaller MSEs than the ESN,indicating the good forecast performance of the ESN when the data access is only boundto the knots. However, when it extends to the entire ﬁeld, the spatial interpolation isless capable of capturing the spatial structure in complex topographical areas, such asmountains. Nevertheless, in spite of a few areas for one hour ahead forecasts, the overallpattern is well modeled by S-ESN with smaller MSE at the majority of locations.Table 3: MSE of forecasts of y t at all the 53,333 locations and time points in 2016 usingthe S-ESN, EOF-ESN, and persistence forecasting method.Forecast S-ESN EOF-ESN PersistenceOne hour ahead 0.277 0.364 0.335Two hours ahead 0.424 0.570 0.670Three hours ahead 0.537 0.705 0.936In addition, as mentioned in Section 3.3, the EOF approach is also commonly used18igure 3: MSE of forecasts of y t at each of the 53,333 locations in 2016 using the S-ESNminus MSE using the persistence forecasting method.to reduce dimensions (McDermott and Wikle, 2017). We investigate this approach for aforecast performance comparison and denote it as EOF-ESN. We use y t at all n = 53 , y t (originally 53,333) to3,173 using its ﬁrst 3,173 expansion coeﬃcients (principal component scores). We apply theESN to the 3,173 expansion coeﬃcients and make forecasts up to three time steps aheadfor each time in 2016. Lastly, the full forecasted map at the n = 53 ,

333 knots is recoveredusing the ﬁrst 3,173 EOFs and the forecasted expansion coeﬃcients. In EOF-ESN, theparameters in the deployed ESN are optimized through a cross-validation on the 3,173expansion coeﬃcients from 2013 to 2015.Table 3 also summarizes the MSE results for y t at all n = 53 ,

333 knots and timepoints in 2016 using the EOF-ESN. We observe the forecast improvement by our approachincorporating spatial modeling compared to the commonly used deterministic dimension-reduction method of the EOF. Note again that in Table 3, the persistence forecastingmethod directly uses the last observed values at all 53,333 locations, whereas S-ESN andEOF-ESN only model the data in the reduced space. For the very short-term forecast19cheme with one hour ahead, the EOF-ESN performs worse than the persistence forecastingmethod. Starting from two hours ahead, the EOF-ESN yields better forecasts than thepersistence method.In short, we stress the remarkable result of our proposed S-ESN approach, which is able,even for very short-term forecasts, to outperform persistence overall despite only relyingon wind data at the knots, about 6% of the 53,333 locations.

To better understand the ESN’s ability in capturing nonlinear dynamics, we perform asimulation study from a model with a controlled level of nonlinearity in the temporaldynamics. We consider one of the most popular models in numerical weather prediction,the Lorenz 96 model (Lorenz, 1996), and we modify it by adding a factor η to control thenonlinearity in the dynamics as follows: dy i ( t ) dt = η (cid:0) y i +1 ( t ) − y i − ( t ) (cid:1) y i − ( t ) − y i ( t ) + F, i = 1 , , . . . , N, (7)where y − ( t ) = y N − ( t ) , y ( t ) = y N ( t ) , y N +1 ( t ) = y ( t ), N is the number of sites, and F isthe external forcing.We choose N = 5 and F = 8 and simulate data on t ∈ [ − . , t = 0 .

1. The initial values are drawn from an independent standardnormal distribution (i.e., y i ( − . i.i.d. ∼ N (0 , , i = 1 , . . . , N = 5). The realizationsat the ﬁrst 2,000 time points are treated as burn-in and discarded. To account for themeasurement error, we add white noise to the generated realizations (i.e., ˜ y i ( t ) = y i ( t )+ (cid:15) i ( t )where (cid:15) i ( t ) i.i.d. ∼ N (0 , , i = 1 , . . . , N = 5 , t = 0 . , . , . . . , y i ( t ) , i = 1 , . . . , N = 5 , t = 0 . , . , . . . , . , . , . . . , .

4, for η to control diﬀerent levels of nonlinearitylevels ( η is 1 in the original Lorenz 96 model). Nonlinearity is more signiﬁcant when η is20arger. For each η , we generate 50 independent sets of realizations. We apply the ESN,Vector Autoregression (VAR), ARIMA, and persistence methods to the simulated data.Since we have a small number of sites ( N = 5), it is feasible to apply the VAR for the5-variate time series. The VAR model is described as follows:  y ( t )... y N ( t )  =  c ... c N  + A  y ( t − y N ( t −  + · · · + A p  y ( t − p )... y N ( t − p )  +  ε ( t )... ε N ( t )  , where p is the order of the vector autoregression term, ε ( t ) , . . . , ε N ( t ) are error terms,and c , . . . , c N and the matrices A , . . . , A p are unknown parameters. It is noteworthythat the ESN and VAR model the realizations at the ﬁve sites altogether and make a 5-variate forecast at each time point, whereas the ARIMA models the time series at each siteindependently and performs univariate forecasting. All the parameters in these models areselected through a cross-validation using the MSE for the one time-step forecast. In thecross-validation, the realizations at t ∈ { . , . . . , } are used for the training, while thoseat t ∈ { . , . . . , } are used for the validation.Once the optimal parameter values are obtained, we forecast up to three time-stepsahead for t ∈ { . , . . . , } using each method. Figure 4 illustrates the MSE results.Short-term forecasting is more diﬃcult when the nonlinearity of the dynamic becomesmore signiﬁcant as η increases. It is no surprise that forecasts with more time-steps aheadare less accurate. However, in all cases, the ESN leads to the most accurate forecasts.When η is small, the outperformance of the ESN is not obvious. When η departs fromzero, the ESN is generally more capable of capturing the nonlinear dynamics compared toother linear methods. 21igure 4: MSE of the ESN, VAR, ARIMA, and persistence models for the forecasts upto three time-steps ahead for t ∈ { . , . . . , } for the simulated data from the modiﬁedLorenz 96 model in Equation (7). The solid lines show the mean MSE in the 50 experiments,whereas the bands indicate the mean ± the standard deviation. We now analyze the forecasts using the proposed model to the wind speed data in Section 2to achieve the improved forecast of wind power. From an operational perspective, theelectricity produced by wind power cannot be stored and must be consumed once it entersthe power grid. An accurate prediction of the wind power ahead helps to plan the rightamount of additional electricity to be dispatched from other sources, such as thermal power.Penalties or ﬁnes are applied in utility markets in case of unfulﬁlled commitment to providean agreed amount of power. The typical forecast time window for scheduling electricitytransmission, allocating resources, and dispatching generated power is two to four hours(Gneiting et al., 2006; Hering and Genton, 2010). Here we focus on forecasting wind twohours ahead in 2016. We study our wind power forecasts based on the optimal wind farmbuild-out in Saudi Arabia discovered in Giani et al. (2020) and assess how much powercould be saved with better forecasts if all the identiﬁed wind farms were installed and in22peration.The two hours ahead forecasted Y t ( s ) is then transformed into the near-surface windspeed Z t ( s ) according to Equation (1). Since the wind speed is diﬀerent at diﬀerent heightsat a particular location, we need to convert the wind speed on the near-surface (10 me-ters above the ground) to the wind turbine hub heights. The wind power law, which isalso known as the traditional logarithmic proﬁle law, is commonly used to perform thisconversion: Z ( h ) t ( s ) = Z t ( s )( h/ α ( s ,t ) , where h is the hub height, and α ( s , t ) is a location- and time-speciﬁc factor. Traditionally,the factor α ( s , t ) is ﬁxed as a constant for all locations and time, and the value 1/7 isusually used over open land surfaces (Pryor and Barthelmie, 2011) and speciﬁcally overSaudi Arabia (Rehman et al., 2007; Tagle et al., 2019). However, Crippa et al. (2021) foundthat if the location- and time-speciﬁc factor α ( s , t ) is independently estimated with dailyharmonics at each location using the WRF wind speed data from diﬀerent proﬁle heights,the recovered wind speed at the hub height is 36% more accurate. We use their results toconvert the wind speed vertically. It is noteworthy that this vertical extrapolation modelby the wind power law would lead to some errors. Because currently we only investigatethe point estimate of the wind power forecasts, the point estimates of the factor α ( s , t )suﬃces to provide the conversion results. If a full probabilistic forecast is studied in thefuture, the source of uncertainty from the wind power law also needs to be considered.There are only two types of turbine models among the 75 identiﬁed optimal windfarms (Giani et al., 2020): Nordex N131/3300 and GE Energy 2.75-100, with hub heights84 and 75 meters above the ground, respectively. We obtain the wind speed for each windfarm at the turbine hub heights. We then use the power curve provided by the wind turbinemanufacturer to convert the wind speed to produced electric power (Figure S6). Each curvehas three critical wind speeds that divide the power curve into four zones. When the wind23peed is less than the cut-in speed, the turbine motor does not rotate. When the windspeed is greater than the cut-in but less than the rated speed, the turbine motor producesmore power with faster wind. Once the wind speed reaches the rated speed but not greaterthan the cut-out speed, the turbine keeps producing the maximal power output. If the windspeed goes beyond the cut-out speed, the turbine motor stops rotating to avoid damageand produces no power. Therefore, the assessment of the wind power forecasts is morecomplicated than that of the wind speed itself because better wind speed forecasts do notnecessarily yield better wind power forecasts. For example, no improvement of wind powerforecasts is gained when the forecasted and true wind speed are both in the ﬁrst, third, orfourth zone.Following the same procedure, we also transform the forecasted near-surface wind speedby the persistence strategy and the true wind speed to the produced power. Subsequently,we can compare the error in the wind power forecasts by our approach and the persistencestrategy. Note that the economic cost may be asymmetric for under- and over-forecastsof wind power. Under-forecast occurrence means that the power grid has ordered moreelectricity than needed, which results in a power surplus. Then, the power generationmust be reduced, and it is usually more expensive than the opposite case. However, thisasymmetry is case-speciﬁc and can vary in diﬀerent countries. For simplicity, we do notdistinguish the direction of the wind power forecast errors in this work. We then calculatethe absolute diﬀerence between the S-ESN wind power forecasts and the true power foreach hour and similarly between the persistence wind power forecasts and the truth. Thewhole-year diﬀerence in wind energy at all wind turbines combined is 2 . × kW · h, asobtained using the S-ESN, and 3 . × kW · h, as obtained by the persistence forecasts;thus, we obtain approximately 11% of improvement. The bid price in the electricity marketvaries based on many factors (demand and supply, providers, etc.). Modeling the imbalanceprice is generally diﬃcult and out of the scope of our work; see Zhu and Genton (2012)24nd Pinson (2013) for some discussions on the operational management in an electricitymarket. To gain some insights, using a medium bid price of 0.025 US dollars per kW · hnowadays in Saudi Arabia as a reference, the forecasts from our approach could yield asaving of nearly one million US dollars over a year. In this work, we have introduced a novel ESN framework for spatio-temporal data. Theapproach is shown to have superior predictive performances with the operational standardsin the wind energy market, contributes to the spatio-temporal model literature, and haspractical implications for the wind energy sector in Saudi Arabia.The proposed model uses the high dimensionality of neural networks to capture thenonlinear dynamics of hourly wind but provides two solutions to ease the computationalburden of the model. The paradigm of reservoir computing allows achieving sparsity inthe RNN matrices, thereby dramatically reducing the parameter space; the use of spatial(problem-informed) knots allows reducing the spatial dimensionality while preserving alarger amount of information than previous approaches. Although our method shows highpredictive power, it represents only one of the many possibilities available to bridge thegap between machine learning methods and spatio-temporal statistics. The use of alter-native spatial approaches, including the stochastic partial diﬀerential equations (Lindgrenet al., 2011), ﬁxed rank kriging (Cressie and Johannesson, 2008) and predictive processes(Banerjee et al., 2008) among many, and convolutional neural networks represents a vastlyunexplored area to capture nonlinear spatio-temporal processes.This work presents a fast and eﬃcient approach to hourly wind forecasting over largeareas. A more detailed quantiﬁcation of the economic gains from an improved forecastis reliant upon a market model for trading and bidding energy. The energy sector of25audi Arabia is currently under full governmental control; hence, the integration of ourforecasting approach would require considerable speculation on the market model that willbe employed in the future. Therefore, we chose not to pursue this route but rather optedfor a simpler comparison against operational standards, such as persistence, under thecurrent market prices in Saudi Arabia. Some of our current studies focus on engaging withlocal leaders in the energy sectors to detail the range of possible future market models thatcould be integrated with our approach. In more immediate terms, this work will provideadditional information on where to build turbines aside from the wind abundance andconstruction and operation costs in Giani et al. (2020). Finally, while this work is reliantupon (validated) high-resolution WRF simulations, it represents a template to assess windpredictions over a large spatial area for developing countries in the Arab Gulf, and morebroadly worldwide, with an emerging wind energy portfolio.

Acknowledgments

This publication is based upon work supported by the King Abdullah University of Scienceand Technology (KAUST) Oﬃce of Sponsored Research (OSR) under Award No: OSR-2018-CRG7-3742.

References

Banerjee, S., A. E. Gelfand, A. O. Finley, and H. Sang (2008). Gaussian predictive processmodels for large spatial data sets.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) 70 (4), 825–848.Benjamin, S. G., S. S. Weygandt, J. M. Brown, M. Hu, C. R. Alexander, T. G. Smirnova,J. B. Olson, E. P. James, D. C. Dowell, G. A. Grell, H. Lin, S. E. Peckham, T. L. Smith,26. R. Moninger, J. S. Kenyon, and G. S. Manikin (2016). A North American hourly as-similation and model forecast cycle: The rapid refresh.

Monthly Weather Review 144

Extremes . inpress.Chen, W., S. Castruccio, M. G. Genton, and P. Crippa (2018). Current and future es-timates of wind energy potential over Saudi Arabia.

Journal of Geophysical Research:Atmospheres 123 (12), 6443–6459.Cressie, N. and G. Johannesson (2008). Fixed rank kriging for very large spatial datasets.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (1),209–226.Crippa, P., M. Alifa, D. Bolster, M. G. Genton, and S. Castruccio (2021). A spatio-temporal model for vertical extrapolation of wind speed and wind energy assessment. under review .Doya, K. (1992). Bifurcations in the learning of recurrent neural networks. In

Proceedings ofIEEE International Symposium on Circuits and Systems 1992 , Volume 6, pp. 2777–2780.Farag, A. A. (2019). The story of NEOM city: Opportunities and challenges. In

New Citiesand Community Extensions in Egypt and the Middle East , pp. 35–49. Springer.Giani, P., F. Tagle, M. G. Genton, S. Castruccio, and P. Crippa (2020). Closing thegap between wind energy targets and implementation for emerging countries.

Applied nergy 269 , 115085.Gladish, D. W. and C. K. Wikle (2014). Physically motivated scale interaction parameter-ization in reduced rank quadratic nonlinear dynamic spatio-temporal models. Environ-metrics 25 (4), 230–244.Gneiting, T. (2002). Nonseparable, stationary covariance functions for space–time data.

Journal of the American Statistical Association 97 (458), 590–600.Gneiting, T., K. Larson, K. Westrick, M. G. Genton, and E. Aldrich (2006). Calibratedprobabilistic forecasting at the stateline wind energy center: The regime-switching space–time method.

Journal of the American Statistical Association 101 (475), 968–979.G´omez-Navarro, J. J., C. C. Raible, and S. Dierer (2015). Sensitivity of the WRF model toPBL parametrisations and nesting techniques: Evaluation of wind storms over complexterrain.

Geoscientiﬁc Model Development (GMD) 8 (10), 3349–3363.Hering, A. S. and M. G. Genton (2010). Powering up with space-time wind forecasting.

Journal of the American Statistical Association 105 (489), 92–104.IPCC (2014). Part a: Global and sectoral aspects. In

AR5 Climate Change 2014: Impacts,Adaptation, and Vulnerability . Cambridge University Press.Jaeger, H. (2001). The “echo state” approach to analysing and training recurrent neuralnetworks-with an erratum note.

Bonn, Germany: German National Research Center forInformation Technology GMD Technical Report 148 (34), 13.Janji´c, Z. I. (2001). Nonsingular implementation of the Mellor-Yamada level 2.5 scheme inthe NCEP Meso model. Oﬃce Note 437, National Centers for Environmental Prediction.Kinley, R. (2017). Climate change after Paris: From turning point to transformation.

Climate Policy 17 (1), 9–15.Lindgren, F., H. Rue, and J. Lindstr¨om (2011). An explicit link between gaussian ﬁelds28nd gaussian markov random ﬁelds: The stochastic partial diﬀerential equation approach.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (4), 423–498.Lorenz, E. N. (1996). Predictability: A problem partly solved. In

Proceedings Seminar onPredictability , Volume 1, pp. 1–18.Lukoˇseviˇcius, M. and H. Jaeger (2009). Reservoir computing approaches to recurrent neuralnetwork training.

Computer Science Review 3 (3), 127–149.Maass, W., T. Natschl¨ager, and H. Markram (2002). Real-time computing without sta-ble states: A new framework for neural computation based on perturbations.

NeuralComputation 14 (11), 2531–2560.McDermott, P. L. and C. K. Wikle (2017). An ensemble quadratic echo state network fornon-linear spatio-temporal forecasting.

Stat 6 (1), 315–330.McDermott, P. L. and C. K. Wikle (2019a). Bayesian recurrent neural network models forforecasting and quantifying uncertainty in spatial-temporal data.

Entropy 21 (2), 184.McDermott, P. L. and C. K. Wikle (2019b). Deep echo state networks with uncertaintyquantiﬁcation for spatio-temporal forecasting.

Environmetrics 30 (3), e2553.Mellor, G. L. and T. Yamada (1982). Development of a turbulence closure model forgeophysical ﬂuid problems.

Reviews of Geophysics 20 tatistical Science 28 (4), 564–585.Pryor, S. and R. Barthelmie (2011). Assessing climate change impacts on the near-termstability of the wind energy resource over the United States.

Proceedings of the NationalAcademy of Sciences 108 (20), 8167–8171.Rehman, S., I. El-Amin, F. Ahmad, S. Shaahid, A. Al-Shehri, and J. Bakhashwain (2007).Wind power resource assessment for Rafha, Saudi Arabia.

Renewable and SustainableEnergy Reviews 11 (5), 937–950.Skamarock, W. C., J. B. Klemp, J. Dudhia, D. O. Gill, D. M. Barker, M. G. Duda, X.-Y.Huang, W. Wang, and J. G. Powers (2008). A description of the Advanced ResearchWRF version 3. NCAR Technical Note TN-475+STR, National Center for AtmosphericResearch.Stein, M. (1999).

Interpolation of Spatial Data: Some Theory for Kriging . Springer, NY.Tagle, F., S. Castruccio, P. Crippa, and M. G. Genton (2019). A non-Gaussian spatio-temporal model for daily wind speeds based on a multi-variate skew-t distribution.

Jour-nal of Time Series Analysis 40 (3), 312–326.Taqqu, M. S., V. Teverovsky, and W. Willinger (1995). Estimators for long-range depen-dence.

Fractals 3 (4), 785–798.Wikle, C. K. and M. B. Hooten (2010). A general science-based framework for dynamicalspatio-temporal models.

TEST 19 (3), 417–451.Zhang, J., P. Crippa, M. G. Genton, and S. Castruccio (2020). Assessing the reliabilityof wind power operations under a changing climate with a non-Gaussian distributionaladjustment. arXiv:2011.01263.Zhu, X. and M. G. Genton (2012). Short-term wind speed forecasting for power systemoperations.

International Statistical Review 80 (1), 2–23.30 upplementary Material

S1 Descriptions of More Than One Time-step Fore-cast

We write h t ( x ∗ t ) more explicitly as a function of x ∗ t in the forecasted ˆ y ∗ t ,ˆ y ∗ t = ˆ B (cid:62)  h t ( x ∗ t ) h t ( x ∗ t )  . At time t , the forecasted ˆ y ∗ t +1 uses x ∗ t +1 = (cid:0) , y ∗ t (cid:62) , . . . , y ∗ t − m +1 (cid:62) (cid:1) (cid:62) ; the forecasted ˆ y ∗ t +2 uses x ∗ t +2 =  (cid:0) , ˆ y ∗ t +1 (cid:62) , y ∗ t (cid:62) . . . , y ∗ t − m +2 (cid:62) (cid:1) (cid:62) , m > (cid:0) , ˆ y ∗ t +1 (cid:62) (cid:1) (cid:62) , m = 1 ;the forecasted ˆ y ∗ t +3 uses x ∗ t +3 =  (cid:0) , ˆ y ∗ t +2 (cid:62) , ˆ y ∗ t +1 (cid:62) , y ∗ t (cid:62) . . . , y ∗ t − m +3 (cid:62) (cid:1) (cid:62) , m > (cid:0) , ˆ y ∗ t +2 (cid:62) , ˆ y ∗ t +1 (cid:62) (cid:1) (cid:62) , m = 2 (cid:0) , ˆ y ∗ t +2 (cid:62) (cid:1) (cid:62) , m = 1 . S2 Discarded Forecasts at the Beginning and Endingof the Forecasting Period

Let the training period be { , . . . , T } and the forecasting period be { T + 1 , . . . , T max } . Wewrite ˆ y ∗ t | as the forecast one hour ahead when at time t −

1, ˆ y ∗ t | as the forecast two hoursahead when at time t −

2, and ˆ y ∗ t | as the forecast three hours ahead when at time t − • ˆ y ∗ t | : t ∈ { T + 1 , . . . , T max } , 1 ˆ y ∗ t | : t ∈ { T + 2 , . . . , T max } , • ˆ y ∗ t | : t ∈ { T + 3 , . . . , T max } .For example, we do not consider ˆ y ∗ T +1 | because the needed foreacasted ˆ y ∗ T | is not avail-able. At time T max −

1, we only forecast for one time-step, and the forecasted ˆ y ∗ T max +1 | and ˆ y ∗ T max +2 | are not required because we do not have data y ∗ T max +1 and y ∗ T max +2 for theassessment. 2 Figure S1: Discrete Fourier transform of the near-surface wind speed. Highlighted are thesigniﬁcant frequencies 4, 8, 1460, 2929, and 4380 Hz, corresponding to periods of one year,half a year, one day, twelve hours, and eight hours, respectively.3igure S2: Scaling parameter γ ( s ) at each location (left). Histogram of Y t ( s ) samples overa regular space-time grid (middle) and the samples’ Q-Q plot for a normal distribution(right).Figure S3: Boxplots of the estimated σ t , ρ t , ν t , and τ t in Equation (5) at each hour of theday across all diﬀerent days in 2015. 4igure S4: Mean of the estimated σ t , ρ t , ν t , and τ t in Equation (5) at each hour of theday across all diﬀerent days in 2015.Figure S5: MSE of forecasts of y ∗ t at each of the 3,173 knots in 2016 using the ESN minusMSE using the persistence forecasting method.5igure S6: Power curve provided by manufacturer for transforming wind speed to producedpower. 6 Table S1: Search grid for the parameters in the cross-validation.Parameters Grid n h { , , . . . , } m { , , . . . , } φ { . , . , . . . , } δ { . , . , . , . . . , } λ { . , . , . , . , . , . , . } a w { . , . , . , . , . } a u { . , . , . , . , . } π w { . , . , . , . , . } π u { . , . , . , . , . }}