[PDF] Wind Field Reconstruction with Adaptive Random Fourier Features

Abstract

We investigate the use of spatial interpolation methods for reconstructing the horizontal near-surface wind field given a sparse set of measurements. In particular, random Fourier features is compared to a set of benchmark methods including Kriging and Inverse distance weighting. Random Fourier features is a linear model \beta(\pmb x) = \sum_{k=1}^K \beta_k e^{i\omega_k \pmb x} approximating the velocity field, with frequencies \omega_k randomly sampled and amplitudes \beta_k trained to minimize a loss function. We include a physically motivated divergence penalty term |\nabla \cdot \beta(\pmb x)|^2, as well as a penalty on the Sobolev norm. We derive a bound on the generalization error and derive a sampling density that minimizes the bound. Following (arXiv:2007.10683 [math.NA]), we devise an adaptive Metropolis-Hastings algorithm for sampling the frequencies of the optimal distribution. In our experiments, our random Fourier features model outperforms the benchmark models.

Full PDF

WWIND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOMFOURIER FEATURES

JONAS KIESSLING, EMANUEL STR ¨OM, AND RA ´UL TEMPONE

Abstract.

We investigate the use of spatial interpolation methods for recon-structing the horizontal near-surface wind ﬁeld given a sparse set of measure-ments. In particular, random Fourier features is compared to a set of benchmarkmethods including Kriging and Inverse distance weighting. Random Fourier fea-tures is a linear model β ( xxx ) = (cid:80) Kk =1 β k e iω k xxx approximating the velocity ﬁeld,with frequencies ω k randomly sampled and amplitudes β k trained to minimizea loss function. Inspired by [1], we include a physically motivated divergencepenalty term |∇ · β ( xxx ) | , as well as a penalty on the Sobolev norm. We derive abound on the generalization error and derive a sampling density that minimizesthe bound. Following [2], we devise an adaptive Metropolis-Hastings algorithmfor sampling the frequencies of the optimal distribution. In our experiments,our random Fourier features model outperforms the benchmark models. Keywords:

Random Fourier features, neural networks, Metropolis algorithm,discrete Fourier, physical loss function, spatial interpolation, machine learning,wind ﬁeld reconstruction, ﬂow ﬁeld estimation.

Date : January 2021. a r X i v : . [ m a t h . NA ] F e b JONAS KIESSLING, EMANUEL STR ¨OM, AND RA ´UL TEMPONE

Contents

1. Introduction 32. Problem formulation 42.1. Data 42.2. Spatial interpolation models 52.3. Quality of ﬁt 63. Models 73.1. Fourier models 73.2. Benchmarking models 124. Results 145. Discussion 196. Conclusion 207. Acknowledgements 218. Appendix 21References 26

IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 3 Introduction

An integral part in wind farm planning and weather forecasting is access to highquality wind data. The available data is often sparse and interpolation techniquesare employed in order to increase spatial resolution. Depending on the application,interpolation can be done with respect to time aggregates or for each measurement.In wind farm site planning for example, a point of interest is the expected energyoutput over time, which can be estimated as a function of the wind speed. A com-mon approach is to approximate the distribution of the wind speed over time withsome parametric model, and then apply spatial interpolation to the parameters.The work [3] lists approximately 200 papers written between 1940 and 2008 whichfocus on parametric models for time series of wind speed measurements. Due tohigh spatial variability, interpolating wind over shorter time intervals is considereda harder problem. However, there are reasons why this can be useful, such as pre-dicting the propagation of forest ﬁres and pollutants, modelling the movement ofﬂying animals and insects [4], or setting up initial conditions for weather simula-tions.Interpolating both wind speed and direction from a sparse set of wind velocitymeasurements is commonly referred to as wind ﬁeld reconstruction. In modern ap-proaches to ﬁeld reconstruction such as [5, 6, 7], high resolution numerical simula-tions are used to train machine learning models to astonishing accuracy. However,this type of high resolution training data is sometimes not accessible, or mighttake a considerable time and eﬀort to simulate. In such cases, simpler interpola-tion models can arguably be a valid alternative. The deﬁnition of an interpolationmodel varies depending on the context. The deﬁnition used in our work drawsfrom [8], and is synonymous with regression (see Section 2.2).A number of studies have been made comparing diﬀerent interpolation modelsin various applications such as temperatures [9], snow depth [10], mineral concen-trations [8] and wind speed [11, 3, 12]. Traditional spatial interpolation modelsinclude nearest neighbours, inverse distance weighting, Kriging, linear regression,polynomial spline methods and various combinations of them [13]. In recent years,machine learning methods such as random forests and neural networks have provenworthy adversaries of traditional interpolation models, see for example [9] and [8].These types of models trade interpretability for power and ﬂexibility and allow forthe inclusion of features such as terrain elevation, slope, concavity, roughness [9],without much additional work.In this paper, we compare methods for near surface wind ﬁeld reconstruction. Amachine learning method known as random Fourier features is compared to a se-lection of popular and successful interpolation techniques. The model ﬁts a Fourierseries β ( x ) = (cid:80) k β k e iω k x to the data. Instead of traditional greedy optimisation JONAS KIESSLING, EMANUEL STR ¨OM, AND RA ´UL TEMPONE methods such as stochastic gradient descent, our random Fourier features modelexplores the frequency domain using an adaptive Metropolis algorithm inspiredby [2]. In each step, the Fourier coeﬃcients β k are optimised with respect to aloss function. The work [1] proposes a technique for interpolation of incompress-ible ﬂow by incorporating zero divergence as a constraint in the optimisation. Wepropose an alternative approach where the incompressibility takes the form of aregularisation term. In Section 2, a mathematical formulation of the wind ﬁeldreconstruction problem is formulated. The data is also presented, along with errorestimates used to evaluate the models. Section 3 contains a short introductionto the interpolation models. The results are presented in section 4 and ﬁnally,discussed in Section 5. 2. Problem formulation

Let Ω ⊂ R denote a geographic region. For the purpose of this report, Ω isthe set of points contained within the Swedish borders. We deﬁne the horizontalwind ﬁeld uuu : Ω × [0 , ∞ ) → R that maps every point in space xxx = ( x, y ) ∈ Ωand time t > uuu ( xxx, t ) = ( u ( xxx, t ) , v ( xxx, t )) ∈ R . Typically, airﬂow is assumed to satisfy the incompressible Navier-Stokes diﬀerential equations.Given an initial state at the time t = 0 and a set of boundary conditions, it ispossible to simulate and forecast the wind using these equations. In this report,the focus is shifted from forecasting to interpolation. In practice, this is a hardproblem because wind measurements are only accessible in a sparse set of points inΩ. We call this problem wind ﬁeld reconstruction . The above notation as well asthe notation presented in sections 2.1 and 2.2 will be used extensively throughoutour work.2.1. Data.

The data was obtained from the Swedish Meterological and Hydrolog-ical Institute (SMHI). It contains a set of hourly wind velocity observations duringthe entirety of 2018, from N = 171 weather stations scattered across Sweden asshown in Figure 1. Each measurement is collected 10 meters above ground. Thepositions of the stations are given by the altitude as well as the latitude and longi-tude. Each measurement is a 10-minute average and consists of two components:The wind speed, which is measured in meters per second, and the angle of thehorizontal component of the wind vector, measured counter-clockwise relative tonorth. Both measurements are rounded to varying degrees of precision dependingon the station [14]. The stations are not active at all times. In fact, there areoccasional hours with as few as one station reporting measurements. The veloci-ties are highly correlated over time, as demonstrated in Figure 3. The data wasprocessed before training. First, the latitude-longitude pairs were transformed tocartesian coordinates xxx = ( x, y ) where x is the eastward-measured distance and y is the northward measured distance, as shown in Figure 1. This was done using IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 5 y xz

Figure 1.

The

SWEREF 99 TM orthographic projectionof Sweden. The weather stations are marked in blue andscaled according to their altitudes z . The coordinate sys-tem is drawn in the bottom left.the SWEREF 99 TM map projection. Secondly, the wind measurements were trans-formed from polar to cartesian coordinates uuu = ( u, v ), where u corresponds to thevelocity component along the x − axis and v corresponds to the velocity compo-nent along the y − axis. Lastly, the wind measurements from September 2018 wereremoved because of unusually high wind speeds.2.2. Spatial interpolation models.

For the purpose of this report, the deﬁnitionof interpolation models is restricted to approximations of functions uuu : Ω → R .The range is two dimensional since we are modelling horizontal wind velocities, andthe region Ω is Sweden. In general, interpolation models are also used for otherproperties like temperature, pressure, population density etc. Nevertheless, wedeﬁne a spatial interpolation model as a map f from a set of velocity measurements More information about the

SWEREF 99 TM

JONAS KIESSLING, EMANUEL STR ¨OM, AND RA ´UL TEMPONE D = { ( xxx n , uuu n ) : xxx n ∈ Ω , n = 1 , , . . . , N } to a vector ﬁeld f D : Ω → R . Theprocess of evaluating a model f on D is called training . Usually, the training isdone by minimising a loss function. The above notation will be used throughoutour report. Note that there is an important distinction between f and f D . Thefunction f represents a model, which is trained on data D and produces a vectorﬁeld f D which approximates uuu . Any time a symbol is indexed by the letter D orvariations of it such as D t , it symbolises an interpolation model trained on thatparticular data set. A parametrised set of interpolation models { f θ : θ ∈ Θ } ofinterpolation models f θ is called an interpolation family , and the parameters θ arecalled hyper parameters . For example, the K -nearest neighbours models can beviewed as an interpolation family with hyper parameter K . The process of ﬁndingthe best model f θ in an interpolation family with respect to some quality of ﬁt iscalled hyper optimisation . In the next section, we present a quality of ﬁt used forhyper optimisation and comparisons between diﬀerent interpolation models.2.3. Quality of ﬁt.

Given an interpolation model f and observations {D t } t ∈ T atdiﬀerent times t in a time span T , the quality of ﬁt Q ( f ) is deﬁned as the expectedsquare loss with respect to the distribution ρ (d xxx, d t ) of the data in space and timeΩ × T : Q ( f ) = E (cid:2) || f D t ( xxx ) − uuu ( xxx, t ) || (cid:3) = (cid:90) T (cid:90) Ω || f D t ( xxx ) − uuu ( xxx, t ) || ρ (d xxx, d t ) . We only have access to a limited sparse set of data, so calculating the exact valueof Q is not possible. Even if the number of measurements was suﬃcient, modeltraining is often cumbersome. Instead, we use a Monte-Carlo sampling average intime and a cross-validation scheme in space. Let { t k } k ∈K be a set of independenttime samples from T , indexed by a set K , and take the measurements D k = { ( xxx n , uuu kn ) : n = 1 , , . . . , N k } to be the set of observations made at the time t k . Asstated in Section 2.1, the weather stations are not always active, so the number ofmeasurements N k at the time t k varies. Deﬁne a random partition of the weatherstations into M disjoint sets of equal size D mk , m = 1 , , . . . , M , and denote D − mk = D k \ D mk . The conditional expectations Q t ( f ) = E [ || f D t − uuu ( xxx, t ) || | t ] of thequality of ﬁt given the times t k are approximated using cross-validation: Q t k ( f ) ≈ (cid:101) Q k ( f ) := 1 |D k | M (cid:88) m =1 (cid:88) ( xxx (cid:48) ,uuu (cid:48) ) ∈D mk || f D − mk ( xxx (cid:48) ) − uuu (cid:48) || . We found that 5-fold cross-validation ( M = 5) struck a good balance betweencomputation time and accuracy. Averaging (cid:101) Q k ( f ) over the samples in K yields aMonte Carlo sample estimate of Q ( f ): Q ( f ) ≈ (cid:101) Q ( f ) := 1 |K| (cid:88) k ∈K (cid:101) Q k ( f ) , IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 7 also called the unexplained variance . An alternative measure E is obtained fromnormalising Q ( f ) by the expected square wind speed E t,xxx [ || uuu ( xxx, t ) || ], denoted Q (0):(1) E ( f ) = Q ( f ) Q (0) ≈ (cid:101) Q ( f ) (cid:101) Q (0) =: (cid:101) E ( f ) . This measurement is referred to as the fraction of unexplained variance . If thewind ﬁeld uuu has zero mean, the number 1 − E ( f ) is referred to as R , or the coeﬃcient of determination . We estimate the variance of (cid:101) Q as(2) Var( (cid:101) Q ) = Var( Q t ) |K| , Var( Q t ) ≈ |K| (cid:88) k ∈K ( (cid:101) Q k − (cid:101) Q ) . Provided that the data samples are independent and that the cross-validation esti-mates (cid:101) Q k of Q t k are exact and the sample size |K| is suﬃciently large, the centrallimit theorem guarantees that the quality of ﬁt follows a normal distribution. Asa direct consequence, two standard deviations of (cid:101) Q constitute an approximate 5thpercentile conﬁdence bound for Q . Note that the conﬁdence bound for Q is notsuﬃcient when comparing models, since the errors can be highly correlated. Giventwo models f, g we instead deﬁne the diﬀerence ∆ Q ( f, g ) := Q ( f ) − Q ( g ) and es-timate it as ∆ (cid:101) Q ( f, g ) := (cid:101) Q ( f ) − (cid:101) Q ( g ). The variance of ∆ (cid:101) Q ( f, g ) is estimatedsimilarly to (2):(3) Var(∆ (cid:101) Q ( f, g )) = Var(∆ Q t ( f, g )) |K| , whereVar(∆ Q t ( f, g )) ≈ |K| (cid:88) k ∈K (∆ (cid:101) Q k ( f, g ) − ∆ (cid:101) Q ( f, g )) . The equations (2) and (3) were used in Section 4 to determine the required samplesize |K| for the 5th percentile conﬁdence intervals to determine Q ( f ) and ∆ Q ( f, g )with a precision of 1% of the mean square wind speed. This was done with the 2018wind data set for all the tested models f and g as the random Fourier features. Dueto high autocorrelation and seasonality in the wind as demonstrated in Figure 3,these conﬁdence intervals do not necessarily generalise to larger time intervals T than 2018. The quality of ﬁt was used for both hyperparameter optimisation andvalidation of the models, but with diﬀerent samples K to avoid overﬁtting.3. Models

Fourier models.

In this section we introduce two Fourier series based mod-els. At the end of the section, we arrive at the random Fourier features model,which is the main focus of our report. In Section 3.2, we lay out some of theconventionally used interpolation models which are used for bench-marking.

JONAS KIESSLING, EMANUEL STR ¨OM, AND RA ´UL TEMPONE

Fourier series.

The Fourier series based model takes the form β ( xxx ) = K (cid:88) k =1 β k e iω k · xxx , ω k ∈ R , β k ∈ C , k = 1 , , . . . , N, where K is the number terms and ω k · xxx denotes the scalar product between ω k and xxx . The Fourier series generalises to arbitrary dimensions of the input xxx , butin this report we chose xxx = ( x, y ) to be just the horizontal coordinates, hence why ω k is two-dimensional. In the Fourier series based model, ωωω = ( ω , ω , . . . , ω K ) isheld ﬁxed, and the parameters βββ = ( β , β , . . . , β K ) are estimated by optimisingwith respect to the expectation of a loss function (cid:96) : R × R → R + :min βββ E ˜ ρ [ (cid:96) ( β ( xxx ) , uuu )] , where the expectation is taken over the joint density ˜ ρ of the data xxx, uuu . For ourapplication, ˜ ρ is unknown. Therefore, the expected loss is replaced by a Monte-Carlo sample estimate of the loss, called the empirical loss . Given a data set D = { ( xxx n , uuu n ) } Nn =1 , the empirical loss is expressed as1 N N (cid:88) n =1 (cid:96) ( β ( xxx n ) , uuu n ) ≈ E ˜ ρ [ (cid:96) ( β ( xxx ) , uuu )] . For this work, we chose the following loss function:(4) (cid:96) ( β ( xxx ) , uuu ) = || β ( xxx ) − uuu || + λ || β || S (2 , + η ||∇ · β || L . The expression || β || S (2 , denotes the second order squared Sobolev-norm of β . Byorthogonality of the Fourier features, the Sobolev norm can be simpliﬁed to || β || S (2 , = K (cid:88) k =1 (cid:0) γ || ω k || + γ || ω k || + 1 (cid:1) || β k || . The hyper parameter γ was added to allow for more ﬂexibility in the penalty, andis equivalent to rescaling the input variable before applying the Sobolev norm. Theexpression ||∇ · β || L is the squared L -norm of the divergence of β , which can berewritten as ||∇ · β || L = K (cid:88) k =1 | ω k · β k | . The intended eﬀect of the Sobolev norm is to dampen high frequencies, and thedivergence penalty is supposed to simulate incompressible ﬂow. The empirical lossusing the loss function as deﬁned in (4) is(5) 1 N (cid:88) n =1 || β ( xxx n ) − uuu n || + λ || β || S (2 , + η ||∇ · β || L . Here, λ and η are hyper parameters. In order to determine a suitable choice for ωωω , assume the standard regression setting u = f ( x ) + (cid:15) , where (cid:15) is zero mean IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 9 and independent of xxx . Since the spatial region Ω is bounded, f can be extendedperiodically over R . That is, the relation f ( x + mτ x , y + nτ y ) = f ( x, y ) is imposedon f for all xxx = ( x, y ) ∈ R , whole numbers m, n and some two-dimensional period τττ = ( τ x , τ y ). This means that f can be written as a Fourier series. Thus, f can intheory be approximated arbitrarily well by β as the number of terms K tends toinﬁnity. However, since there is only a limited amount of data and β contains alimited number of terms, the choice of ωωω determines how well β can approximate f . In the Fourier series based model, we settle on choosing ωωω as a square grid withside length 2 M + 1, centered at the origin in the frequency domain: ωωω = { ( π mτ x , π nτ y ) : − M ≤ m ≤ M, − M ≤ n ≤ M } . As such, the Fourier series based model is a spatial interpolation family with thehyper parameters λ, η, M and τττ . We went with M = 10, and λ, η, τττ were chosenas explained in the upcoming section. Remark 1.

Minimising the expression in (5) amounts to solving a system of linearequations with respect to the elements βββ . First deﬁne the N × K matrix SSS suchthat

SSS n,k = e iω k · xxx n for k = 1 , . . . , K and n = 1 , . . . , N . Secondly, deﬁne a matrix UUU n,i = u n,i for i = 1 , n = 1 , , . . . , N . Lastly, deﬁne three diagonal K × K matrices ΛΛΛ , DDD and DDD such that ΛΛΛ kk = λ ( γ || ω k || + γ || ω k || +1) and DDD kki = √ ηω ki for i = 1 , k = 1 , , . . . , K . Denoting the row vectors of βββ and UUU by βββ i and UUU i respectively, the empirical loss takes the form( DDD βββ + DDD βββ ) † ( DDD βββ + DDD βββ ) + (cid:88) i =1 , N ( SSSβββ i − UUU i ) † ( SSSβββ i − UUU i ) + βββ † i ΛΛΛ βββ i , where A † denotes the Hermitian conjugate of the matrix A . Diﬀerentiating withrespect to the real and imaginary parts of βββ , setting to zero and adding the equa-tions together results in the system( N SSS † SSS + ΛΛΛ +

DDD † DDD ) βββ + DDD † DDD βββ = N SSS † UUU , ( N SSS † SSS + ΛΛΛ +

DDD † DDD ) βββ + DDD † DDD βββ = N SSS † UUU . This is a linear system of equations which can be solved for example by singularvalue decomposition. The most noticeable diﬀerence from Tikhonov regression isthe interaction between βββ and βββ due to the divergence penalty.3.1.2. Random Fourier features.

The random Fourier features model and the Fourierseries based approach start oﬀ similarly: β ( x ) = K (cid:88) k =1 β k e iω k · xxx , ω k ∈ R , β k ∈ C , k = 1 , , . . . , N, The diﬀerence is that instead of just optimising with respect to βββ = ( β , β , . . . , β K ),random Fourier features aims at solving the harder problemmin βββ,ωωω E [ (cid:96) ( β ( xxx ) , uuu )] . That is, we also want to optimise with respect to ωωω . In this way, the random Fourierfeatures is similar to a fully connected neural network with one hidden layer andthe activation function x (cid:55)→ e ix . The optimisation is traditionally done using somegreedy method such as stochastic gradient descent, but in random Fourier featuresthe expected loss is approximated by viewing ωωω as a random variable:min βββ,ωωω E [ (cid:96) ( β ( xxx ) , uuu )] ≤ E (cid:20) min βββ E [ (cid:96) ( β ( xxx ) , uuu ) | ωωω ] (cid:21) . We use the loss function deﬁned in (4). Furthermore, we assume same the standardregression setting uuu = f ( xxx ) + (cid:15) as presented in the previous section, where (cid:15) ∈ R is zero mean and independent of xxx and f a periodic function. Lastly, the elementsof ωωω are assumed to be independent and identically distributed according to thedensity ρ . In the Appendix we derive an upper boundE (cid:20) min βββ E [ (cid:96) ( β ( xxx ) , uuu ) | ωωω ] (cid:21) ≤ λC (2 π ) K (cid:118)(cid:117)(cid:117)(cid:116) E (cid:34) || ˆ f ( ω ) || ρ ( ω ) (cid:35) − K E [ || f ( xxx ) || ] + E[ || (cid:15) || ]using the above assumptions as well as assuming that the distribution ρ is discreteand has bounded moments up to a degree determined by the order of the deriva-tives used in the regularisation. Furthermore, we show that this upper bound isminimised by choosing ρ ( ω ) ∝ || ˆ f ( ω ) || , where ˆ f ( ω ) are the Fourier coeﬃcients forthe Fourier series expansion of f . Since the function || ˆ f ( ω ) || is not known a priori,this distribution has to be approximated somehow. The authors of [2] presentan adaptive Metropolis algorithm for sampling from ρ in a related setting. Themain diﬀerences are that f is L integrable on the entirety of R and has a one-dimensional range. Drawing from the work in [2], we devise an adaptive Metropolisalgorithm for sampling the weights ω from || ˆ f || as seen in Algorithm 1. The ran-dom Fourier features algorithm can be parametrised as { β θ : θ ∈ Θ } where thehyper parameter θ consists of the number of fourier frequencies K , the number ofsteps B , the periodicity τ , the regularisation parameters σ and γ , as well as the pa-rameters λ and η in the Metropolis sampler. Similarly to the work [1], τ was chosenbetween two and three times the size of the region of interest, τ = (4 · , · ).The number of frequencies was ﬁxed to 400 and the number of steps B was ﬁxedto 500. Given a d -dimensional feature space, [2] shows that optimal choice for γ as K, N → ∞ with a ﬁxed computational work is 3 d −

2. Furthermore, the classicalresult from [15] shows that the optimal variance of the proposal kernel for a generalrandom walk Metropolis-Hastings algorithm is σ ≈ . d . IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 11

Algorithm 1:

Discrete random Fourier features with Metropolis sampling input :

Rescaled data { x n , y n } Nn =1 ⊂ [0 , × R output: Random features x (cid:55)→ (cid:80) Kk =1 β k e iω k · x K ← Choose the number of frequencies; B ← Choose the number of steps; σ ← Choose a variance for the proposal kernel; γ ← Choose the exponent for the transition probability; λ ← Choose the Sobolev regularisation parameter; η ← Choose the divergence regularisation parameter; ωωω ← the zero vector in R K ; βββ ← minimiser of the empirical (5) loss given ωωω ; for b ← , . . . , B do r N ← standard normal random vector in R K ; r ← round the elements of σr N to the nearest integer; ωωω (cid:48) ← ωωω + r ; βββ (cid:48) ← minimiser of the empirical loss (5) given ωωω (cid:48) ; for k ← , . . . , K do α ← sample from uniform distribution on [0 , if || β (cid:48) k || γ / || β k || γ > α then ω k ← ω (cid:48) k ; β k ← β (cid:48) k ; endendend βββ ← minimiser of the empirical loss (5) given ωωω ; x (cid:55)→ (cid:80) Kk =1 β k e iω k · x Since theoretical results listed above hinge on limit cases, they are not guaran-teed to hold for the sparse wind measurements. Therefore, a partial grid searchminimisation of the unexplained variance E ( β θ ) from (1) was carried out in aneighbourhood of the theoretically optimal values for γ and σ , keeping τ, λ and η ﬁxed. Given estimations of the optimal γ and σ , a similar grid search was thencarried out on the parameters λ and η . The number of steps was adjusted to insureconvergence of the Markov chain, while also keeping the computation time at aminimum. Note that the random Fourier features algorithm is not guaranteed toﬁnd the optimal frequencies. There is ongoing research in this area, and an itera-tive greedy method which ﬁnds the optimal frequencies is currently in the works[16]. Benchmarking models.

The following interpolation models were used forbenchmarking: Nearest neighbours, inverse distance weighting (IDW for short),Kriging, random forest, neural networks and a Fourier series based model intro-duced in the previous section. This section will serve as a short introduction toeach of the methods as well as motivation as to why they are relevant. We use thesame notation D = { ( xxx n , uuu n ) ∈ Ω × R : n = 1 , , . . . , N } for the measurements asin Section 2.2.3.2.1. Inverse distance weighting.

Inverse distance weighting methods { f p : p ≥ } is an interpolation family of methods which are evaluated at a point xxx by takinga weighted average of the velocity measurements in D . Speciﬁcally, the weights α ( xxx n , xxx ) for a model f p from the IDW family are proportional to 1 /d ( xxx n , xxx ) p where d : Ω × Ω → [0 , ∞ ) is a distance on Ω. That is, f p D ( xxx ) = (cid:80) Nn =1 α ( xxx n , xxx ) uuu n (cid:80) Nn =1 α ( xxx n , xxx ) , where α ( xxx n , xxx ) = 1 d ( xxx n , xxx ) p . The singularities at xxx = xxx n , n = 1 , , . . . , N are removed by setting f p D ( xxx n ) = uuu n .The hyper parameter p adjusts the amount of inﬂuence each data point has over itsimmediate surroundings. Letting p = 0 will result in all points weighing equallyeverywhere, i.e. taking the arithmetic mean of the data. Letting p → ∞ willresult in the nearest neighbor method. Usually, p is chosen somewhere inbetween.A common shortcoming of IDW is that the interpolated values are bounded bythe maximum and minimum values of the data and therefore IDW fails to predictunobserved extreme points. The main beneﬁts are interpretability and relativelyshort training time. In this report, two values of p were tested, namely p = 2and p = ∞ (i.e. nearest neighbors). Furthermore, we used the horizontal distancebetween points, ignoring altitudes.3.2.2. Kriging.

Kriging is a statistical approach to spatial interpolation. The truevelocity uuu is assumed to satisfy the equality uuu ( xxx ) = µ ( xxx ) + (cid:15) xxx , where µ : Ω → R is a deterministic function and (cid:15) xxx is a zero mean stochasticprocess over x with a speciﬁc covariance structure given by the covariogram C :Cov( (cid:15) xxx , (cid:15) xxx (cid:48) ) = E[ (cid:15) xxx (cid:15) xxx (cid:48) ] = C ( | xxx − xxx (cid:48) | ) . Kriging works by ﬁrst using a deterministic model to estimate the mean µ usingthe data, resulting in a trained model µ D . For each ( xxx n , uuu n ) ∈ D , the residuals (cid:15) xxx n are estimated as (cid:15) xxx n ≈ (cid:15) n = uuu n − µ D ( xxx n ). Next, the covariogram is ﬁttedusing the residuals (cid:15) n . The last part of the ﬁt is to, given an input xxx , ﬁnd alinear combination (cid:80) n ω n (cid:15) n = ω T (cid:15) such that the weights ω minimise the expected IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 13 square error E[( (cid:15) xxx − (cid:80) n ω n (cid:15) xxx n ) ]. This is equivalent to solving the linear systemof equations N (cid:88) m =1 C ( || xxx n − xxx m | ) ω m = C ( | xxx n − xxx | ) , n = 1 , , . . . , N Thus, deﬁning the N × N matrix CCC mn = C ( | xxx m − xxx n | ) and the N × c n ( xxx ) = C ( | xxx n − xxx | ) for n, m = 1 , , . . . , N , the ﬁnal estimation f D of u takes theform f D ( xxx ) = µ D ( xxx ) + (cid:15) T CCC − c ( xxx ) . By also modelling the so-called variogram E [( (cid:15) xxx − (cid:15) xxx (cid:48) ) ], it is possible to calculatethe variance of the residuals (cid:15) xxx and thereby obtain an estimate of the uncer-tainty at each point xxx . Furthermore, Kriging can be combined with virtually anyother unbiased deterministic interpolation model by interpreting it as the mean µ ( xxx ). However, when the statistical assumptions do not hold, Kriging can performpoorly. Machine learning methods such as random forests have been successfulin beating Kriging for various spatial interpolation tasks, see for example [9, 8].The authors in [8] argue that even though Kriging might be redundant in termsof accuracy, it remains valueable tool for understanding data, exactly because ofits statistical properties and interpretability. We used a version of Kriging called Universal Kriging , which diﬀers from Kriging in that a linear regression approxi-mation of the mean µ ( xxx ) and the residuals (cid:15) xxx are ﬁtted to the data simultaneously,resulting in a joint system of equations for all the weights. The python package pykrige was used to implement Universal Kriging with a linear variogram.3.2.3. Random forest.

Random forests are constructed by averaging an ensembleof regression trees. Each tree is trained on a random sample of the data D andeach split in the tree is chosen by randomly selecting one feature out of the inputfeatures [17]. Random forests have been used sucessfully in spatial interpolationproblems, for example to predict temperatures on and around Kilimanjaro, Tan-zania [9] as well as mineral concentrations [8]. The main drawback of randomforests is lack of interpretability. In this report we used a random forest with 200trees with mean square loss for splitting, and unlimited tree depth. The forest wasimplemented in scikit-learn . Furthermore, the Random forest was trained on apolynomial feature map φ of the horizontal coordinates xxx = ( x, y ) and the altitude z . Speciﬁcally, φ maps the coordinates to all polynomial features 1 p x p y p z p witha total order p + p + p + p of ≤

3. That is, the trained model f D is a composition h ◦ φ of the feature map φ and the random forest h D : R → R .3.2.4. Feedforward neural network.

Feedforward neural networks have been usedextensively in diﬀerent areas of applied mathematics. In this report, only a speciﬁcfamily of feedforward neural networks were considered. Namely, the networks arecharacterised by 3 fully connected hidden layers, a constant number of n nodes ineach layer and the ReLU activation function. The input layer consists of the three spatial coordinates x, y and z . The weights were optimised using the Adam algo-rithm [18], with respect to the L -regularised loss. The network was implementedin TensorFlow , and hyper parameter optimisation of the number of nodes andregularisation parameter was done with a grid search on the unexplained variance.3.2.5.

Weighted linear combination of model.

Given a set of n interpolation models { f , f , . . . f n } and a data set D t , we can improve on the individual models byforming a weighted average f D t = α f D t + α f D t + · · · + α n f n D t , Where α , α , . . . , α n are the weights. The weights are regarded as hyper param-eters. If the number of models n is not too high, there is little risk of overﬁtting,and the hyper parameters can simply be directly ﬁtted to minimise the quality ofﬁt. In Section 4, we use this method to combine the random forest and randomFourier features models. 4. Results

We begin the results section by establishing a suitable choice for the number oftime samples |K| , as discussed in Section 2.3. We are looking to satisfy two mainconditions. First, (cid:101) Q needs to be approximately normally distributed. As seen fromFigure 2, the central limit theorem seems to hold for estimates of E with samplesizes |K| >

50. Using normality of Q and E means that two standard deviationsof correspond to the 5th percentile conﬁdence intervals. Secondly, |K| needs to besuﬃciently large for the error to be reasonably small. We went with |K| = 500, cor-responding to Var( (cid:101) E ) ≈ . Q , as mentioned in Section 2.3.The quality of ﬁt measurements reported in table 1 were all obtained using thissample size, and the reported uncertainty corresponds to two standard deviations,estimated according to (2). As the table shows, the conﬁdence bounds vary slightlydepending on the model. The same sample size was also used for the diﬀerences∆ Q between the quality of ﬁt of the benchmarking models and the random Fourierfeatures model listed in Table 2, as well as the hyper parameter grid searches shownin Figures 5 and 6. The optimal hyperparameters for the random Fourier featuresmodel were 0 .

01 for the Sobolev regularisation constant λ , 0 .

001 for the divergencepenalty η , 1 . γ and 2 .

25 for the step size σ in the proposal kernelin the adaptive Metropolis algorithm (see Algorithm 1). Additionally, running theMetropolis algorithm was chosen for B = 500 steps struck a good balance betweenconvergence and computation time. We omit the hyper optimisation results forthe benchmarking models, which were all performed using the same type of gridsearch methods. IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 15

Figure 2.

A histogram of 5000 estimations of Q ( f ) = |K| (cid:80) k ∈K Q k ( f ) with |K| = 50, boot-strapped from a total of 500 samples of Q k . Themodel f is the fourier series, and the orange lineis a ﬁtted normal distribution. A u t o c o rr e l a ti on Autocorrelation of u-component

Mean

Figure 3.

Autocorrelation of theeast-west component u of the wind uuu for the available weather stations,up to a lag of 300 hours. Each redline represents the autocorrelation ofone weather station. A u t o c o rr e l a ti on Autocorrelation of v-component

Mean

Figure 4.

Autocorrelation of thenorth-south component v of the wind uuu for the available weather stations,up to a lag of 300 hours. Each redline represents the autocorrelation ofone weather station. Quality of fit

Figure 5.

Fraction of unexplainedvariance E in the random Fourier fea-tures method, as a function of theSobolev penalty λ and divergencepenalty η explained in 4, with τ =4 × , σ = 2 . γ = 1 . Quality of fit

Figure 6.

Fraction of unexplainedvariance E as a function of the expo-nent γ and step size σ in the adap-tive Metropolis algorithm, with τ =4 × , η = 0 . λ = 0 . Figure 7.

Ridge line plot showing histograms of the conditionalfraction of unexplained variance (cid:101) E k ( f ) = (cid:101) Q k ( f ) / (cid:101) Q (0) , k ∈ K with |K| = 500 independently sampled times, for diﬀerent interpola-tion models f . The fraction of unexplained variance (cid:101) E ( f ) = |K| (cid:80) k ∈K (cid:101) E k ( f ) is indicated as a black vertical line, and the uncer-tainty margin for (cid:101) E ( f ) is indicated with dashed black lines. IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 17

Figure 8.

Approximation ofthe optimal sampling density ρ ( ω ) for the random Fourier fea-tures algorithm outlined in 1, formeasurement data from january1st, 2018 at 6 am. The distri-bution was estimated by collect-ing all the frequencies from 1000steps of the adaptive Metropolisalgorithm into a histogram. Figure 9.

Heat map show-ing the magnitudes || β || of theFourier terms βe iω T x as a func-tion of the frequencies ω for theFourier series based model de-scribed in Section 3.1.1, for mea-surement data from January 1st,2018 at 6 am. The support is asquare grid with width 41, cen-tered at the origin. Figure 10.

Plot showing the trajectory of one frequency in red, sampled usingthe random Fourier features algorithm outlined in 1. The algorithm was run onmeasurement data from January 1st, 2018 at 6 am. The remaining 400 trajectoriesare shown as black semi-transparent line plots. Some of the trajectories extendbeyond the axis limits. A coordinate ( m, n ) in the left plot represents an imaginaryterm e iπ (cid:16) m x τ + n x τ (cid:17) in the Fourier series. Interpolation model (cid:101) E [1] (cid:101) Q [ m s − ]Nearest neighbors 0.628 ± ± ± ± ± ± ± ± ± ± ± ± Random Fourier features (FF) ± ± ± ± Table 1.

Quality of ﬁt measurements Q and E with 5th percentile con-ﬁdence intervals for a number of diﬀerent models. The dimension is indi-cated at the top row, in square brackets.Interpolation model ∆ (cid:101) E [1] ∆ (cid:101) Q [ m s − ]Nearest neighbours 0.258 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2.

Diﬀerence in quality of ﬁt ∆ Q and ∆ E with 5th percentile con-ﬁdence intervals for the benchmark models relative to the random Fourierfeatures model (FF). The dimension is indicated at the top row, in squarebrackets. A positive number means that the given model is worse thanrandom Fourier features. IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 19 Discussion

Table 1 shows that the model consisting of an average between the random forestand random Fourier features model performed the best out of the tested models.Table 2 consisting of the diﬀerence between quality of ﬁt of the random Fourierfeatures and remaining models indicates that this ordering is statistically signif-icant, since none of the conﬁdence intervals overlap with zero. Speciﬁcally, thetransition from the ﬁxed frequencies in the Fourier series based model to the ran-domly sampled frequencies of the random Fourier features results in a signiﬁcantimprovement. The numerical experiments found in [2] indicate that the randomFourier features model wins out when more data is introduced. This is because thehigh frequency details can be captured by exploring remote parts of the frequencydomain, whereas a ﬁxed grid of frequencies centered on the origin cannot.The success of the random forest-random Fourier features average hints that theremight be more potential for improving the results using similar types of mixturemodels. Nevertheless, it is clear that the random Fourier features model is compet-itive in comparison to the benchmark models. A common argument for favouringstatistical models such as Kriging before machine learning methods is easy accessto precise error analysis, given that the prerequisite assumptions discussed in Sec-tion 3.2.2. Whether or not this is worth trading in exchange for higher accuracydepends on the situation. Furthermore, the random Fourier features model has theupside of being easy to manipulate once it has been trained. It can be eﬃcientlydiﬀerentiated and integrated for analysis of physical quantities such as divergence,vorticity and energy.A fraction of unexplained variance 0 .

36 is not immediately recognised as a goodresult. For example, the works [7, 5, 6] achieve an unexplained variance in theorder of 0 .

001 modelling 2D ﬂuid ﬂow around a cylinder as well as sea mean tem-peratures. The main diﬀerence is that our data is extremely sparse, that there areno high resolution grids to train the models on, and that wind data in particular isknown for high variance over time and space. The authors of [7] point out [7, p. 18]that their model generalises poorly to unseen data if the ﬂow is non-stationary. Itstands to reason that for their models to achieve the same eﬃciency on the sparsewind data used in this report, higher resolution in-sample measurements or simu-lations are required for training. That being said, we know from Figure 3 that thewind measurements are highly correlated over time, with clear seasonality. Themodels used in [7, 5, 6] are trained on multiple time samples whereas the spatialinterpolation models used in our work only use the time aspect for hyperparameteroptimisation. Therefore, extending the spatial interpolation models to take timeinto account could improve the results.

Another way of improving the results is to simply increase the quality or num-ber of measurements by incorporating weather stations from nearby regions suchas the Baltic sea, Finland, Norway and Denmark. The random Fourier featuresmodel is constructed to eﬃciently ﬁnd high frequency details in the target function,which makes it suitable when increasing the resolution of the data. Additionally,using more features such as terrain convexity, terrain slope, terrain roughness,altitude, air pressure, temperature and humidity might allow for more accuratepredictions (see [9]). We found that the wind speed was highly correlated withaltitude as well as closeness to coastal areas, and adding altitudes to the featurespace of the random forest resulted in a signiﬁcant increase in accuracy.As seen in Figure 5, the divergence penalty does not have much inﬂuence overthe results. Albeit a compelling idea, it makes sense that penalising the diver-gence on a 2D slice of the wind ﬁeld would not improve the results, since windﬂow is not necessarily parallel with the ground. We suggest two alternative ap-proaches that could produce signiﬁcant improvements. Firstly, by incorporatingaltitudes as a third feature in the random Fourier features model we go from a2D to a 3D setting in which zero divergence is a better approximation of reality.An alternative approach is to model the wind ﬂow in two layers, and to couplethe two layers by penalising a weighted sum of the divergence. More advancedmethods can possibly incorporate no-slip or slipping boundary conditions on theground surface, like the methods explored in [1].6.

Conclusion

In this report, we explored the potential for wind ﬁeld reconstruction with sparsedata using interpolation models. In particular, we investigated the random Fourierfeatures model and compared it to popular statistical interpolation models such asKriging, as well as modern machine learning methods such as random forests andNeural networks. Drawing from the work [2], we derived an optimal density forthe Fourier frequencies and devised an adaptive Metropolis algorithm for samplingfrom this density. We showed that random Fourier features is competitive withrespect to a time-space average of the square error and provided some ways toimprove the results such as including terrain speciﬁc features.

IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 21 Acknowledgements

The authors of this report would like to thank Prof. Anders Szepessy at KTH forhis support and feedback. Furthermore, the work of Dmitry Kabanov, Luis Espathand Andreas Enblom in data processing and programming was integral in realisingthe project. Jonas Kiessling and Ra´ul Tempone were partially supported by theKAUST Oﬃce of Sponsored Research (OSR) under Award numbers URF/1/2281-01-01, URF/1/2584-01-01 in the KAUST Competitive Research Grants ProgramRound 8, and the Alexander von Humboldt Foundation, through the Alexandervon Humboldt Professorship award. Emanuel Str¨om was supported by the KAUSTVisiting Student Research Program (VSRP).8.

Appendix

In this proposition, we derive an upper bound for the minimum of the expectedloss E [ (cid:96) ( β ( xxx ) , uuu )] for the Fourier features model. We assume the standard regres-sion setting where uuu = f ( xxx ) + (cid:15) and (cid:15) ∈ R is independent of xxx , and E[ || (cid:15) || ] = σ .Let ˆ f ( ω ) deﬁne the Fourier coeﬃcients of f and suppose for simplicity that f isdeﬁned on the domain X = [0 , π ] × [0 , π ]. That is, f can be expressed as theFourier series f ( xxx ) = 12 π (cid:88) ω ∈ Z ˆ f ( ω ) e iω · xxx . For the random Fourier features model, we choose β ( xxx ) = K (cid:88) k =1 β k e iω k · xxx , where βββ = ( β , β , . . . , β K ) are the complex-valued two-dimensional coeﬃcientsand ωωω = ( ω , ω , . . . , ω K ) are independent and identically distributed according toa discrete distribution ρ : Z → [0 , ∞ ). The loss function is deﬁned as (cid:96) ( β ( xxx ) , uuu ) = || β ( xxx ) − uuu || + λ ||L β || , where ||L β || = (cid:90) X L β ( xxx ) L β ( xxx )d xxx, and L = (cid:80) Mm =1 c m ∂ α m, ∂ α m, is a linear diﬀerential operator with derivatives of atmost order d (i.e. α m, + α m, ≤ d ), for example L β = ∇ · β = ( ∂ + ∂ ) β . Thedistribution ρ is assumed to exist in a family a family P of discrete distributions: P := (cid:110) ρ : Z → (0 , ∞ ) (cid:12)(cid:12)(cid:12) ρ ( ω ) > | ω i | m ] < C, ≤ m ≤ d i = 1 , (cid:111) where C is a positive real number. Thus, ρ ∈ P is strictly positive and hasuniformly bounded moments (cid:80) ω ∈ Z | ω i | m ρ ( ω ) of degree up to 4 d . Lastly, we make the assumption that(6) || ˆ f ( ω ) || (cid:80) ω (cid:48) ∈ Z || ˆ f ( ω (cid:48) ) || ∈ P . Which means that f ( xxx ) has to be a member of the Sobolev space W d, ( X ). Proposition 2.

In the above setting, the following holds:(a) The minimum of E[ (cid:96) ( β ( xxx ) , uuu )] with respect to the coeﬃcients βββ can be bounded:E (cid:20) min βββ ∈ C K E [ (cid:96) ( β ( xxx ) , uuu ) | ωωω ] (cid:21) ≤ λC (2 π ) K (cid:118)(cid:117)(cid:117)(cid:116) E (cid:34) || ˆ f ( ω ) || ρ ( ω ) (cid:35) + σ − K E[ || f ( xxx ) || ] , where C > ρ ( ω ) = || ˆ f ( ω ) || (cid:80) ω (cid:48) ∈ Z || ˆ f ( ω (cid:48) ) || , ω ∈ Z . Proof.

We divide the problem into part (a) and part (b) of the proposition.(a) Let ω := ω , β := β for the sake of brevity, and deﬁne β k = ˆ f ( ω k )2 πKρ ( ω k ) , k = 1 , , . . . , K. Using this deﬁnition, β ( xxx ) is unbiased given xxx . To show this, we use the iidproperty of ωωω :E (cid:34) K (cid:88) k =1 β k e iω k · xxx (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) xxx (cid:35) = E (cid:2) Kβ e iω · xxx | xxx (cid:3) = (cid:88) ω ∈ Z K ˆ f ( ω )2 πKρ ( ω ) e iω · xxx ρ ( ω )= 12 π (cid:88) ω (cid:48) ∈ Z ˆ f ( ω ) e iω · xxx = f ( xxx ) . Coincidentally, the expected square error can be simpliﬁed using independenceof the residuals (cid:15) and the data xxx :E (cid:2) || β ( xxx ) − uuu || | xxx (cid:3) = E (cid:2) || β ( xxx ) − ( f ( xxx ) + (cid:15) ) || | xxx (cid:3) = E (cid:2) || ( β ( xxx ) − f ( xxx )) − (cid:15) || | xxx (cid:3) = E (cid:2) || β ( xxx ) − f ( xxx ) || | xxx (cid:3) + σ . IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 23

Which shows that the expected square error is exactly the variance of β , plusthe mean square of the noise. The variance can be simpliﬁed further since thefrequencies ω k are assumed independent:E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) k =1 β k e iω k · xxx − f ( xxx ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)  = 1 K E (cid:104)(cid:12)(cid:12)(cid:12)(cid:12) Kβe iω · xxx − f ( xxx ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:105) = K E (cid:2) || β || (cid:3) − K E (cid:2) || f ( xxx ) || (cid:3) = 1(2 π ) K E (cid:34) || ˆ f ( ω ) || ρ ( ω ) (cid:35) − K E (cid:2) || f ( xxx ) || (cid:3) ≤ π ) K (cid:118)(cid:117)(cid:117)(cid:116) E (cid:34) || ˆ f ( ω ) || ρ ( ω ) (cid:35) − K E (cid:2) || f ( xxx ) || (cid:3) . The last step is due to the Jensen inequality. Now, turn to the penaltyterm containing the linear operator L . Applying L to β ( xxx ) is equivalent tomultiplying each term β kj e iω k x , j = 1 , (cid:96) j ( ω k ) = (cid:80) Mm =1 c m ( iω k, ) α m, ( iω k, ) α m, , a multivariate polynomial in ω k of degree d . De-ﬁne r ( ω ) = | (cid:96) ( ω ) | + | (cid:96) ( ω ) | . Note that r has degree 2 d . It follows that ||L β || = (cid:90) X ||L β ( x ) || d x = (cid:90) X (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) k =1 ( (cid:96) ( ω k ) β k + (cid:96) ( ω k ) β k ) e iω k xxx (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d xxx ≤≤ K (cid:88) k =1 | (cid:96) ( ω k ) β k + (cid:96) ( ω k ) β k | (cid:90) X x = K (cid:88) k =1 (2 π ) | (cid:96) ( ω k ) β k + (cid:96) ( ω k ) β k | ≤ K (cid:88) k =1 r ( ω k ) || πβ k || , where we used the triangle inequality combined with the size (2 π ) of theregion X = [0 , π ] × [0 , π ]. Taking the expectation, using independence of itselements as well as the Cauchy-Schwarz inequality we get:E (cid:2) ||L β || (cid:3) ≤ E (cid:34) K (cid:88) k =1 r ( ω k ) || πβ k || (cid:35) = K (cid:88) k =1 E (cid:2) r ( ω k ) || πβ k || (cid:3) = K E (cid:2) r ( ω ) || πβ || (cid:3) = K E  r ( ω ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) π ˆ f ( ω )2 πKρ ( ω ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)  ≤ K (cid:118)(cid:117)(cid:117)(cid:116) E [ r ( ω ) ] E (cid:34) || ˆ f ( ω ) || ρ ( ω ) (cid:35) The multivariate polynomial r ( ω ) := (cid:80) M (cid:48) m =1 r m ω α m, ω α m, is of order 4 d , i.e. α m, + α m, ≤ d . For each term m , deﬁne γ m := α m, α m, + α m, ∈ (0 , ρ lies in P , and therefore, the expectation E [ r ( ω ) ] can be uniformlybounded as follows:E (cid:2) r ( ω ) (cid:3) = M (cid:48) (cid:88) m =1 r m E (cid:2) ω α m, ω α m, (cid:3) ≤ M (cid:48) (cid:88) m =1 | r m | E (cid:2) | ω | ( α m, + α m, ) γ m | ω | ( α m, + α m, )(1 − γ m ) (cid:3) ≤ M (cid:48) (cid:88) m =1 | r m | (cid:16) E (cid:2) | ω | α m, + α m, (cid:3) γ m E (cid:2) | ω | α m, + α m, (cid:3) − γ m (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) ≤ C γm C − γm = C ≤ C M (cid:48) (cid:88) m =1 | r m | := C (2 π ) . In the above calculations, we used H¨older’s inequality: E[ XY ] ≤ (E[ | X | p ]) p (E[ Y q ] q )for any p + q = 1, termwise with p = γ m . Collecting all of our results, theexpected loss is bounded byE (cid:20) min βββ E (cid:2) || β ( xxx ) − uuu || + ||L β || | ωωω (cid:3)(cid:21) ≤ λC (2 π ) K (cid:118)(cid:117)(cid:117)(cid:116) E (cid:34) || ˆ f ( ω ) || ρ ( ω ) (cid:35) − K E[ || f ( xxx ) || ] + σ . (b) To derive an optimal choice for ρ , we seek to minimise the expression insidethe root. We can redeﬁne this problem in terms of a non-normalised function p such that ρ ( ω ) = p ( ω ) / (cid:80) Z p ( ω (cid:48) ):minimise (cid:32)(cid:88) Z || ˆ f ( ω ) || p ( ω ) (cid:33) · (cid:32)(cid:88) Z p ( ω ) (cid:33) . First, deﬁne a real-valued function H ( (cid:15) ) where (cid:15) is a real number close to zero: H ( (cid:15) ) = (cid:32)(cid:88) Z || ˆ f ( ω ) || ( p ( ω ) + (cid:15)δ ( ω )) (cid:33) · (cid:32)(cid:88) Z p ( ω ) + (cid:15)δ ( ω ) (cid:33) , where δ is a small arbitrary variation of p . Next, seek a solution p to H (cid:48) (0) = 0. H (cid:48) (0) = (cid:32)(cid:88) Z − || ˆ f ( ω ) || p ( ω ) δ ( ω ) (cid:33) · c (cid:122) (cid:125)(cid:124) (cid:123)(cid:32)(cid:88) Z p ( ω ) (cid:33) + (cid:32)(cid:88) Z || ˆ f ( ω ) || p ( ω ) (cid:33) · (cid:32)(cid:88) Z p ( ω ) (cid:33) (cid:124) (cid:123)(cid:122) (cid:125) c (cid:88) Z δ ( ω ) = 0 . IND FIELD RECONSTRUCTION WITH ADAPTIVE RANDOM FOURIER FEATURES 25

Deﬁning the constants c , c as above, the equation can be rewritten as (cid:88) Z (cid:32) || ˆ f ( ω ) || p ( ω ) − c c (cid:33) δ ( ω ) = 0Since δ ( ω ) is arbitrary here, the expression inside the sum must be zero. Thus, (cid:32) || ˆ f ( ω ) || p ( ω ) − c c (cid:33) = 0 ⇐⇒ p ( ω ) = (cid:114) c c || ˆ f ( ω ) || Hence, the optimal ρ is ρ ( ω ) = p ( ω ) (cid:80) Z p ( ω (cid:48) ) = (cid:113) c c || ˆ f ( ω ) || (cid:80) Z (cid:113) c c || ˆ f ( ω (cid:48) ) || = || ˆ f ( ω ) || (cid:80) Z || ˆ f ( ω (cid:48) ) || . Lastly, some straight forward calculations show the optimisation with respectto ρ is a convex optimisation problem (that is, P is a convex set and E (cid:104) || ˆ f ( ω ) || ρ ( ω ) (cid:105) is a convex function with respect to ρ ). But then, the above derived localminimum must also be a global minimum. QED References [1] R. F. Tempone, “Approximation and interpolation of divergence free ﬂows,” 1999. Tesis demaestr´ıa. Universidad de la Rep´ublica (Uruguay). Facultad de Ingenier´ıa.[2] A. Kammonen, J. Kiessling, P. Plech´aˇc, M. Sandberg, and A. Szepessy, “Adaptive randomfourier features with metropolis sampling,”

Foundations of Data Science , vol. 2, no. 2639-8001 2020 3 309, p. 309, 2020.[3] J. A. Carta, P. Ram´ırez, and S. Vel´azquez, “A review of wind speed probability distributionsused in wind energy analysis case studies in the canary islands,”

Renewable and SustainableEnergy Reviews , 2008.[4] W. Luo, M. C. Taylor, and S. R. Parker, “A comparison of spatial interpolation methodsto estimate continuous wind speed surfaces using irregularly distributed data from englandand wales,”

International Journal of Climatology , vol. 28, no. 7, pp. 947–959, 2008.[5] J. L. Callaham, K. Maeda, and S. L. Brunton, “Robust ﬂow reconstruction from limitedmeasurements via sparse representation,”

Physical Review Fluids , vol. 4, Oct 2019.[6] X. Jin, S. Laima, W.-L. Chen, and H. Li, “Time-resolved reconstruction of ﬂow ﬁeld arounda circular cylinder by recurrent neural networks based on non-time-resolved particle imagevelocimetry measurements,”

Experiments in Fluids , vol. 61, 04 2020.[7] N. Erichson, L. Mathelin, Z. Yao, S. Brunton, M. Mahoney, and J. Kutz, “Shallow neuralnetworks for ﬂuid ﬂow reconstruction with limited sensors,”

Proceedings of the Royal SocietyA: Mathematical, Physical and Engineering Sciences , vol. 476, p. 20200097, 06 2020.[8] T. Hengl, M. Nussbaum, M. N. Wright, G. B. M. Heuvelink, and B. Gr¨aler, “Random forestas a generic framework for predictive modeling of spatial and spatio-temporal variables,”2018.[9] T. Appelhans, E. Mwangomo, H. Douglas R, A. Hemp, and T. Nauss, “Evaluating machinelearning approaches for the interpolation of monthly air temperature at mt. kilimanjaro,tanzania,”

Spatial Statistics , 2015.[10] J. Erxleben, K. Elder, and R. Davis, “Comparison of spatial interpolation methods forestimating snow distribution in the colorado rocky mountains,”

Hydrological Processes , 2002.[11] M. Cellura, G. Cirrincione, A. Marvuglia, and A. Miraoui, “Wind speed spatial estimationfor energy planning in sicily: Introduction and statistical analysis,”

Renewable Energy , 2008.[12] C. Jung and D. Schindler, “Statistical modeling of near-surface wind speed: A case studyfrom baden-wuerttemberg (southwest germany),”

Austin Journal of Earth Science , 2015.[13] J. Li,

A Review of Spatial Interpolation Methods for Environmental Scientists . 01 2008.[14] “Ladda ner meterologiska observationer, vindhastighet och vindriktning.” . Accessed: 2020-07-30.[15] G. O. Roberts and J. S. Rosenthal, “Optimal scaling for various metropolis-hastings algo-rithms,”

Statistical Science. 16(4):351–367, 11 2001 , 2001.[16] L. Espath, J. Kiessling, and D. Kabanov, “Sparse divergence-free fourier approximationsbased on discrete l projections.” Unpublished manuscript.[17] L. Breiman, “Random forests,” Machine Learning , vol. 45, pp. 5–32, oct 2001.[18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017., vol. 45, pp. 5–32, oct 2001.[18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.