[PDF] Joint Variable Selection of both Fixed and Random Effects for Gaussian Process-based Spatially Varying Coefficient Models

Abstract

Spatially varying coefficient (SVC) models are a type of regression model for spatial data where covariate effects vary over space. If there are several covariates, a natural question is which covariates have a spatially varying effect and which not. We present a new variable selection approach for Gaussian process-based SVC models. It relies on a penalized maximum likelihood estimation (PMLE) and allows variable selection both with respect to fixed effects and Gaussian process random effects. We validate our approach both in a simulation study as well as a real world data set. Our novel approach shows good selection performance in the simulation study. In the real data application, our proposed PMLE yields sparser SVC models and achieves a smaller information criterion than classical MLE. In a cross-validation applied on the real data, we show that sparser PML estimated SVC models are on par with ML estimated SVC models with respect to predictive performance.

Full PDF

JJoint Variable Selection of both Fixed and RandomEﬀects for Gaussian Process-based Spatially VaryingCoeﬃcient Models

Jakob A. Dambon ∗ , Fabio Sigrist † , and Reinhard Furrer ‡ Department of Mathematics, University of Zurich Institute of Financial Services Zug, Lucerne University of AppliedSciences and Arts Department of Computational Science, University of Zurich

February 12, 2021

Abstract

Spatially varying coeﬃcient (SVC) models are a type of regression model forspatial data where covariate eﬀects vary over space. If there are several covariates,a natural question is which covariates have a spatially varying eﬀect and whichnot. We present a new variable selection approach for Gaussian process-basedSVC models. It relies on a penalized maximum likelihood estimation (PMLE) andallows variable selection both with respect to ﬁxed eﬀects and Gaussian processrandom eﬀects. We validate our approach both in a simulation study as well as areal world data set. Our novel approach shows good selection performance in thesimulation study. In the real data application, our proposed PMLE yields sparserSVC models and achieves a smaller information criterion than classical MLE. Ina cross-validation applied on the real data, we show that sparser PML estimatedSVC models are on par with ML estimated SVC models with respect to predictiveperformance.

Keywords: adaptive LASSO, Bayesian optimization, coordinate descent algorithm, model-based optimization, penalized maximum likelihood estimation, spatial statistics ∗ Corresponding author . Email: [email protected], Address: Department of Mathematics,University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland. † Email: [email protected], Address: Institute of Financial Services Zug, Campus Zug-Rotkreuz,Suurstoﬃ 1, 6343 Rotkreuz, Switzerland. ‡ Email: [email protected], Address: Department of Mathematics, University of Zurich,Winterthurerstrasse 190, 8057 Zurich, Switzerland. a r X i v : . [ s t a t . M E ] F e b Introduction

Improved and inexpensive measuring devices lead to a substantial increase of spatialpoint data sets, where each observation is associated with (geographic) information onthe observation location, say, a pair of latitude and longitude coordinates. Usually, thesedata sets not only consist of a large number of observations, but also include severalcovariates. Spatially varying coeﬃcient (SVC) models are a type of regression modelswhich oﬀer a high degree of ﬂexibility as well as interpretability for spatial data. Incontrast to classical geostatistical models (see Cressie, 2011, for an overview), they allowthe covariate’s marginal eﬀect, i.e., the corresponding coeﬃcient, to vary over space. InDambon et al. (2021a) a maximum likelihood estimation (MLE) of Gaussian process(GP)-based SVC models is presented. In particular, the presented methodology not onlyscales well in the number of observations but also in the number of covariates. Whenmodeling such data with SVC models, the large number of potential coeﬃcients leads tothe question:

Which of the given covariates do indeed have a non-zero spatially varyingcoeﬃcient?

This is similar to variable selection for classical ﬁxed eﬀects variables but, inaddition, we also need to select random Gaussian process coeﬃcients. While the literatureon variable selection in general is extensive, it is very limited for SVC models.Variable selection for SVC models is usually done locally or globally over space. Forinstance, Smith and Fahrmeir (2007) introduce a local selection procedure for SVC mod-els on lattice data. Reich et al. (2010) proposed a Bayesian method that selects SVCglobally, i.e., the coeﬃcient enters the model equation either for all locations or for none.Boehm Vock et al. (2015) present a simultaneous local and global variable selection.There also exist non-model-based approaches for SVC selection, e.g., geographicallyweighted regression (GWR, Fotheringham et al., 2002). Li and Lam (2018) present alocal SVC approach selection based on the elastic net, while Lu et al. (2014) use aninformation criteria based global SVC selection for GWR.Concerning frequentist Gaussian process-based SVC models, existing variable selec-tion methods which we list in the following are exclusively for the ﬁxed eﬀects parts.Huang and Chen (2007) proposed a selection method between smoothing splines withdeterministic spatial structure and kriging which is GP-based. The method relies on gen-eralized degrees of freedom (Ye, 1998) to assess the models. Another variable selection forthe ﬁxed eﬀects was proposed by Wang and Zhu (2009). Using penalized least squares,variable selection under a variety of penalty functions is conducted. In particular, thepenalty function smoothly clipped absolute deviation (SCAD) suggested by Fan and Li(2001) and used in the simulation study performs best for variable selection in spatialmodels. However, the error term of the spatial models is deﬁned as strong mixing and inall of the numerical examples the sample size is very limited. Chu et al. (2011) providetheoretical results for variable selection in a GP-based geostatistical model using a pe-nalized maximum likelihood. Since MLE is computationally expensive, Chu et al. (2011)2lso give results for variable selection under covariance tapering (Furrer et al., 2006).Their penalized maximum likelihood estimation (PMLE) procedure – with and withoutcovariance tapering – is a one-step sparse estimation and therefore diﬀers from Fan andLi (2001). The SCAD is the penalty function of choice in Chu et al. (2011), too.In this article, we present a novel selection methodology for SVC models. The rest ofthe article is structured as follows. We introduce GP-based SVC models in Section 2. InSection 3, we propose our implementation of the selection problem. Numeric results ofboth a simulation study and a particular real data set application are given in Sections 4and 5, respectively. In Section 6 we conclude.

Let n be the number of observations, let p be the number of covariates x ( j ) ∈ R n , j =1 , ..., p , for a ﬁxed eﬀect, and let q be the number of covariates w ( k ) ∈ R n , k = 1 , ..., q ,with random coeﬃcients, i.e., spatially varying coeﬃcients. The ﬁxed and random eﬀectscovariates do not necessarily need to be identical. Each observation i is associated withan observation location s i in domain D ⊂ R d , d ≥

1. Further, let y ∈ R n be the vector ofobserved responses and ε ∼ N n ( n , τ I n × n ) the error term (also called nugget). Withoutloss of generality, we assume the random coeﬃcients to have zero mean and deﬁne aGP-based SVC model as y i = p (cid:88) j =1 µ j x ( j ) i + q (cid:88) k =1 η k ( s i ) w ( k ) i + ε i . (1) Here, the k th SVC is deﬁned by a zero-mean GP η k ( · ) ∼ GP ( , c ( · , · ; θ k )) with covari-ance function c and covariance parameters θ k := ( ρ k , σ k ), i.e., each SVC is parameterizedby a range parameter ρ k and variance σ k . We assume additional parameters like thesmoothness of the covariance function c to be known and that we can write c ( s l , s m ; θ k ) = σ k r (cid:18) || s l − s m || A ρ k (cid:19) , (2) where r : [0 , ∞ ) → [0 ,

1] is a correlation function, || · || A is an anisotropic geometricnorm deﬁned by a positive-deﬁnite matrix A ∈ R d × d (Wackernagel, 1995; Schmidt andGuttorp, 2020). Note that (2) covers most of commonly used covariance functions such asthe Mat´ern class or the generalized Wendland class. For instance, the Mat´ern covariancefunction is of the form of (2) with correlation function r ( u ) = 2 − ν Γ( ν ) (cid:16) √ νu (cid:17) ν K ν (cid:16) √ νu (cid:17) , (3) where ν ∈ R + is the smoothness parameter, u = || s l − s m || A /ρ is a scaled anisotropicdistance, and K ν is the modiﬁed Bessel function of the second kind and order ν . For3 = 1 /

2, equation (3) reduces to the exponential function r ( u ) = exp( − u ) and thecorresponding covariance function is therefore given by c ( s l , s m ; ρ, σ ) = σ exp( −|| s l − s m || A /ρ ). Finally, we assume mutual prior independence between the SVCs η k ( · ) as wellas the nugget ε .For the observed data, the GPs η k ( · ) reduce to ﬁnite dimensional normal distributionswith covariance matrices Σ k deﬁned as ( Σ k ) l,m := c ( s l , s m ; θ k ). Thus, we have η k ( s ) ∼N n ( n , Σ k ) and under the assumption of mutual prior independence the joint eﬀect isgiven by η ( s ) ∼ N nq ( nq , Σ η ) with Σ η := diag( Σ , ..., Σ q ) ∈ R nq × nq . We denote twodata matrices as X and W , where we have X ∈ R n × p such that the j th column isequal to x ( j ) and W := (cid:0) diag( w (1) ) , ..., diag( w ( q ) ) (cid:1) ∈ R n × nq . Let µ := ( µ , ..., µ p ) (cid:62) and θ := ( ρ , σ , ..., ρ q , σ q , τ ) (cid:62) be the unknown vectors of ﬁxed eﬀects and covarianceparameters. Then the GP-based SVC model is given by y = X µ + W η ( s ) + ε (4) and is parametrized by ω := ( µ (cid:62) , θ (cid:62) ) (cid:62) ∈ Ω := R p × ( R > × R ≥ ) q × R > . (5) The response variable Y follows a multivariate normal distribution Y ∼ N n ( X µ , Σ Y ( θ ))with covariance matrix Σ Y ( θ ) := q (cid:88) k =1 (cid:16) w ( k ) w ( k ) (cid:62) (cid:17) (cid:12) Σ k ) + τ I n × n (6) and has the following log-likelihood (cid:96) ( ω ) = − (cid:16) n log(2 π ) + log det Σ Y ( θ ) + ( y − X µ ) (cid:62) Σ Y ( θ ) − ( y − X µ ) (cid:17) . (7) Dambon et al. (2021a) provide a computationally eﬃcient MLE approach to estimate ω by maximizing (7), which we denote as ˆ ω (MLE) := argmax ω ∈ Ω (cid:96) ( ω ). We introduce a penalized likelihood that will induce global variable selection for the GP-based SVC model (4). Related to this, we say that for µ j (cid:54) = 0 an eﬀect of x ( j ) and for η k ( s ) (cid:54) = n an eﬀect of w ( k ) on the response is given. Note that variable selection for therandom coeﬃcients corresponds to choosing between σ k > σ k = 0. For the specialcase that x ( j ) = w ( j ) for some j , there are 3 possible cases for each covariate to enter amodel (Reich et al., 2010):1. The j th covariate is associated with a non-zero mean SVC, i.e., there exist a non-zero ﬁxed eﬀect and a random coeﬃcient.2. There only exists a ﬁxed eﬀect µ j (cid:54) = 0 for the j th covariate, but η j ( s ) is identicalto 0. 4. The j th covariate enters the model solely through the zero-mean SVC, i.e., µ j = 0and η j ( s ) not identical to 0.The parameters within ω that we penalize are therefore µ j and σ k . Given some penaltyfunction p ( | · | ), we deﬁne p λ ( | · | ) := λp ( | · | ), where λ acts as a shrinkage parameter. Weaugment the likelihood function (7) with penalties for the mean and variance parameters.In general, each of the parameters µ j and σ k have their corresponding shrinkage parameter λ j > λ p + k >

0, respectively, yielding the penalized log-likelihood : p(cid:96) ( ω ) = (cid:96) ( ω ) − n p (cid:88) j =1 p λ j ( | µ j | ) − n q (cid:88) k =1 p λ p + k ( | σ k | ) . (8) For a given set of shrinkage parameters λ j , ≤ j ≤ p + q , we maximize the penalizedlikelihood function to obtain a penalized maximum likelihood estimate :ˆ ω (PMLE) := argmax ω ∈ Ω p(cid:96) ( ω ) . (9) Note the similarity to Bondell et al. (2010) and Ibrahim et al. (2011) who present a jointvariable selection by individually penalizing the ﬁxed and random eﬀects in linear mixedeﬀects models. M¨uller et al. (2013) give an overview of such selection and shrinkagemethods.Finally, one usually assumes the penalty function to be singular at the origin andnon-concave on (0 , ∞ ) to ensure that ˆ ω (PMLE) has favorable properties such as sparsity,continuity, and unbiasedness (Fan and Li, 2001). From now on, we will consider the L penalty function, i.e., p ( | · | ) = | · | (Tibshirani, 1996). In this section, we show how to ﬁnd the maximizer of the penalized log-likelihood andhow we choose the tuning parameters λ j . In the following, we show how a maximizer of the penalized log-likelihood (8) can befound for given λ j . This is equivalent to minimizing the negative penalized log-likelihood.We propose a (block) coordinate descent (CD) which cyclically optimizes over the meanparameters µ and covariance parameters θ , i.e., µ ( t +1) := argmin µ ∈ R p (cid:2) − p(cid:96) ( µ | θ ( t ) ) (cid:3) , (10) θ ( t +1) := argmin θ ∈ Θ (cid:2) − p(cid:96) ( θ | µ ( t +1) ) (cid:3) , (11) t ≥

0. The initial values are given by the ML estimate θ (0) := ˆ θ (MLE). Algorithm 1summarizes this block coordinate descent approach. In the following, we show how thecomponent wise minimizations (10) and (11) are realized. Data: shrinkage parameters λ j , convergence threshold δ , ML estimate ˆ θ (MLE),objective function p(cid:96) ( · ) Result: ˆ ω (PMLE) Initialize t ← , θ (0) ← ˆ θ (MLE) ; repeat µ ( t +1) ← argmin µ ∈ R p (cid:2) − p(cid:96) ( µ | θ ( t ) ) (cid:3) ; θ ( t +1) ← argmin θ ∈ Θ (cid:2) − p(cid:96) ( θ | µ ( t +1) ) (cid:3) ; t ← t + 1 ; until || θ ( t ) − θ ( t − || / || θ ( t − || < δ ; ˆ ω (PMLE) ← (cid:0) θ ( t ) (cid:62) , µ ( t ) (cid:62) (cid:1) (cid:62) ; Algorithm 1:

General coordinate descent algorithm for penalized likelihood.

We ﬁx the covariance parameters θ ( t ) for some t ≥

0. The optimization step for the meanparameter (line 3, Algorithm 1) simpliﬁes to µ ( t +1) = argmin µ ∈ R p (cid:34) − (cid:96) ( µ | θ ( t ) ) + n p (cid:88) j =1 λ j | µ j | (cid:35) = argmin µ ∈ R p (cid:34)

12 ( y − X µ ) (cid:62) Σ Y ( θ ( t ) ) − ( y − X µ ) + n p (cid:88) j =1 λ j | µ j | (cid:35) We note that the ﬁrst term is a generalized least square estimate for a linear model with Y ∼ N n (cid:0) X µ , Σ Y ( θ ( t ) ) (cid:1) . Using the Cholesky decomposition L of Σ Y ( θ ( t ) ) and a simplevariable transformation ˜ y := L − y and ˜ X := L − X , the objective function simpliﬁes to µ ( t ) = argmin µ ∈ R p (cid:34) n || ˜ y − ˜ X µ || + p (cid:88) j =1 λ j | µ j | (cid:35) . We note that this objective function coincides with a LASSO for a classical linear regres-sion model (Tibshirani, 1996) with individual shrinkage parameters per coeﬃcient.

We optimize over the covariance parameters θ with ﬁxed mean parameter µ ( t +1) (c.f. Al-gorithm 1, line 4). We rearrange this optimization problem writing θ ( t +1) = argmin θ ∈ Θ f ( θ )with the following objective function: f ( θ ) := − (cid:96) (cid:16)(cid:0) µ ( t ) (cid:62) , θ (cid:62) (cid:1) (cid:62) (cid:17) + n q (cid:88) k =1 λ p + k | σ k | . (12)

6e use numeric optimization to obtain θ ( t +1) . In particular, we use a quasi Newtonmethod (Byrd et al., 1995) for the numeric optimization. Note that for a κ ∈ { , ..., q } with σ κ = 0, the corresponding range ρ κ is not identiﬁable and might induce non-convergence issues. The following proposition ensures that in case described above theoptimization is well behaved. Proposition 1.

Let r be a correlation function for a GP-based SVC model as given above.Assume that the derivative of r exists and is bounded, i.e., | r (cid:48) | < C for some constant C ≥ . Let B κ := { θ ∈ Θ : σ κ = 0 } for κ ∈ { , ..., k } . Then the objective function f deﬁned in (12) fulﬁlls ∂∂ρ κ f ( b κ ) = 0 , for all b κ ∈ B κ and κ ∈ { , ..., k } .Proof. The proof is given in the Appendix A.1.The additional assumption | r (cid:48) | < C is quite weak and fulﬁlled by most of covariancefunctions used for modeling GP. The partial derivative with respect to ρ κ is 0 and thereforethe approximation of the gradient function for ρ κ yields values very close or identicallyto 0. Therefore, in the case of σ κ = 0, the numeric optimization will stop making anyadjustments along the ρ κ direction. We use the adaptive LASSO (ALASSO, Zou, 2006) for the parameters λ j given by λ j := λ µ | ˆ µ j | , λ p + k := λ θ | ˆ σ k | , (13) where we use the ML estimates ˆ µ j (MLE) and ˆ σ k (MLE) to weight the shrinkage param-eters ( λ µ , λ θ ) ∈ Λ := ( R > ) . These two parameters account for the diﬀerences betweenthe mean and variance parameters. This leaves us with the task to ﬁnd a sensible choiceof shrinkage parameters λ := ( λ µ , λ θ ) ∈ Λ. We choose these parameters by maximiz-ing an information criterion (IC) which combines the goodness of ﬁt (GoF) and modelcomplexity (MC), i.e, IC = GoF + MC . An alternative approach that is computationally more expensive is to use cross-validation.To emphasize the dependency on λ , we introduce the short hand notation ˆ ω λ for denotingthe PML estimate of ω for a given λ . The corresponding mean parameters and variancesare given by ˆ µ λ and ˆ σ λ , respectively. We use a BIC type information criterion which isgiven by BIC = − (cid:96) ( ˆ ω λ ) + log( n ) (cid:0) || ˆ µ λ || + || ˆ σ λ || (cid:1) , (14) || · || is the count of non-zero entries. That is, the MC is captured via the numberof non-zero ﬁxed eﬀects and non-constant random coeﬃcients. Of course, there existvarious ICs like, for instance, the corrected Akaike IC (cAIC, see M¨uller et al., 2013, foran overview). In our empirical experience the BIC performs best with respect to variableselection which is in line with Schelldorfer et al. (2011).Our goal is to ﬁnd a minimizer λ , i.e., ˆ λ BIC := argmin λ ∈ Λ BIC( λ ). A single evaluationof BIC( λ ) is computationally expensive, which is why we want the number of evaluationsas low as possible. To this end, we use model-based optimization (MBO, also knownas Bayesian optimization, Jones, 2001; Koch et al., 2012; Horn and Bischl, 2016). Forour objective function BIC( λ ), we initialize the optimization by drawing a Latin squaresample of size n init from Λ which we deﬁne as { λ (1) , ..., λ ( n init ) } ⊂ Λ. Then we compute ξ ( i ) = BIC( λ ( i ) ) to obtain the tuples (cid:0) ξ ( i ) , λ ( i ) (cid:1) . Now we can ﬁt a surrogate modelto the tuples. In our case, we assume a kriging model deﬁned as a Gaussian processwith a Mat´ern covariance function of smoothness ν = 3 /

2, c.f. equation (3). That is,given the observed tuples, we have Ξ( λ ) ∼ N (ˆ µ ( λ ) , ˆ s ( λ )) as our surrogate model. Aninﬁll criterion is used to ﬁnd the next, most promising λ ( n init +1) . We use the expectedimprovement (EI) inﬁll criterion which is derived from the current surrogate model Ξ( λ ).The EI at λ is deﬁned as EI( λ ) = E Ξ (max { ξ min − Ξ( λ ) , } ) , (15) where ξ min denotes the current BIC minimum. In the case of a GP-based surrogate model,the EI (15) can be expressed analytically. A separate optimization of (15) returns thebest parameter λ ( n init +1) according to the inﬁll criteria and then the BIC is calculated.The tuple (cid:16) ξ ( n init +1) , λ ( n init +1) (cid:17) is added to the existing tuples and surrogate model isupdated. This procedure is repeated n iter times. Finally, the minimizer of the BIC fromthe set { λ ( i ) : i = 1 , ..., n init + n iter } is returned. For more details regarding MBO referto Bischl et al. (2017). The described PMLE approach is implemented in the R package varycoef (Dambon et al.,2021b). It contains the discussed MBO which is implemented using the R package ml-rMBO (Bischl et al., 2017) as well as a grid search to ﬁnd the best shrinkage parameters.Further, the varycoef package implements our proposed optimization of the penalizedlikelihood using a coordinate descent. The optimization over the mean parameters isexecuted using a classical adaptive LASSO with the R package glmnet (Friedman et al.,2010). For the optimization over the covariance parameters, we use the quasi Newtonmethod "L-BFGS-B" (Byrd et al., 1995) which is available in a parallel version in the R package optimParallel (Gerber and Furrer, 2019).8 Simulation

We simulate N = 100 data sets consisting of n = 15 = 225 samples with observationlocations s i , i = 1 , ..., n, from a 15 × perturbed grid , c.f. Appendix A.2 and Furrer et al.(2016); Dambon et al. (2021a). As in Tibshirani (1996) and his suggested simulation setup, which has been assumed (e.g. Fan and Li, 2001) or adapted (e.g. Li and Liang, 2008;Ibrahim et al., 2011) in other simulation studies for variable selection, we use p = q = 8covariates which are sampled from a multivariate zero mean normal distribution suchthat the covariance between x ( j ) and x ( k ) is γ | j − k | with γ = 0 .

5. For each simulation run,we sample the covariates x ( j ) which are both associated with a ﬁxed and random eﬀect.Let X and W be the corresponding data matrices as deﬁned above. Concerning the ﬁxedeﬀects parameters, we assume µ = (3 , . , , , , , , (cid:62) , (16) in order to obtain a balanced design in the sense that for each possible combinationof zero and non-zero parameters of both the ﬁxed and random eﬀects there are twocases. Let η k ( s ) ∼ N n ( n , Σ k ) for k = 1 , ..., q be the GP-based SVCs deﬁned by anexponential covariance function with corresponding parameters provided in Table 1. For k ∈ { , , , } the respective variance σ k is zero and therefore the respective true SVCs η k ( s ) are constant zero. Hence, the true model contains two of each covariates with anon-zero mean SVC ( k = 1 , k = 2 , k = 3 , k = 4 , ε ∼ N n ( n , τ I n × n ) with nugget variance τ = 0 . y .We compare methodologies for estimating the SVC model, c.f. (4), or the oracle SVCmodel y = X ( µ , µ , , , µ , , µ , (cid:62) + (cid:88) k ∈{ , , , } η k ( s ) (cid:12) x ( k ) + ε . (17) The latter one is called the oracle model as it assumes the true data generating covari-ates to be known and excludes all other parameters from their respective estimation. Asa reference, we will use two classical MLE approaches without any shrinkage or vari-able selection to estimate (4) and (17) labeled

MLE and

Oracle , respectively. The thirdmethodology is our novel approach which estimates (4) and we denote it by

PMLE . Forthe coordinate descent algorithm, we set a relative convergence threshold of δ = 10 − and a maximum of T max = 20 iterations. Note that this upper limit on the number ofiterations was never attained in our simulations. Further, the range of shrinkage pa-rameters was set to (10 − , λ = ( λ µ , λ θ ) (cid:62) ∈ (10 − , × (10 − , λ BIC are estimated by MBO using n init = 10 initial evaluations and9 iter = 10 iteration steps. The inﬁll criterion is the previously deﬁned expected improve-ment given in equation (15). The shrinkage parameter ranges as well as the number ofinitial evaluations and iteration steps was chosen such that we obtain reasonable resultsunder a feasible time and computational resource constraint. The methods PMLE , MLE ,and

Oracle are implemented using varycoef (Dambon et al., 2021b). Further details aregiven in Appendix A.3.

Table 1:

True GP parametrization for simulation study. The ranges of zerovariance GP are not identiﬁable which we denote by “–”. To ease readability,parameters identical to 0 are given by a sole zero, i.e., not by “0.000”.

Parameters True GP Covariance Paramaters , k =1 2 3 4 5 6 7 8Variance σ k ρ k First, we check the convergence properties of the CD over all simulations. This is depictedin Figure 1. Facet (a) shows a histogram of the number of evaluated iterations T , wherethe median over all iterations was 4. This shows that the CD algorithm converges quickly.Second, the estimated shrinkage parameters are depicted in Figure 1(b). We observe thatthere exist two regimes with respect to λ θ . The borders of the shrinkage parameter spaceused in the MBO are depicted darkgreen. There are some estimates very close to theborder. Upon closer inspection of actual parameter estimates (see below) we could notﬁnd any substantial diﬀerence between parameter estimates obtained by ˆ λ BIC close tothe boundary compared to all other ˆ λ BIC .Second, we turn to the actual parameter estimates. They are depicted in Figure 2and grouped by variances, ranges, and means. The true values are indicated by redlines. Further, the number of zero estimates is given, if there are any. As one can see, the

PMLE returns sparse estimates in the ﬁxed eﬀects as well as the variances. We can clearlyobserve that for all methods, the estimation of the mean parameters is much more precisecompared to the covariance parameters. However, the MLE method is not able to returnmean parameters identical to zero. For the variances, we observe that both the

MLE and

PMLE are capable of obtaining sparse estimates. Here, the additional penalizationof the variances manifests itself in the higher number of zero estimates. On average, weobserve an increase of 25 correctly zero estimated variance parameters. The penalty hasa minor downside as there are some zero estimates, where the true variance is unequalzero. In that case, the nugget compensates for unattributed variation by the covariateswhich manifests itself in large variance estimates, c.f. outliers of box plot for estimated10 igure 1:

The coordinate descent with (a) the number of iterations steps T and (b) the MBO estimated ˆ λ under the BIC.nugget variance, Figure 2.Finally, we summarize the results of the simulation study in Table 2. For each simu-lation and methodology m , we compute the relative model error (RME) which is deﬁnedas RME( m ) = || y − ˆ y ( m ) || || y − ¯ y · n || . The median relative model error (MRME), i.e., the median over all RME, is providedin Table 2 and is smallest for

MLE . This comes at no surprise given the high degree ofﬂexibility of an SVC model. It shows that some kind of variable selection is desirable inorder to counter over-ﬁtting. The MRME for

PMLE is similar to the one of

Oracle . It isnot identical as the parameter estimates cannot fully mimic the behavior of the oracle.This is summarized in the second part of Table 2, where the number of estimated zeroparameters are given both within the mean eﬀects of µ and the random eﬀects of θ . Theaverage number of correctly identiﬁed zero parameter (C) and incorrectly identiﬁed zeroparameters (IC) over all simulations is provided. PMLE introduces sparse estimates forthe ﬁxed eﬀects while there are no IC in the ﬁxed eﬀects. Further,

PMLE substantiallyincreases the C for the random eﬀects. The downside is a slight increase in the IC. Itis also worth mentioning that both the

MLE and

PMLE do not incorrectly estimate anymean parameter as zero. 11 igure 2:

Parameter estimates as box plots and, if any, number of zero esti-mates. Simulation setup: N = 100 simulation runs, n = 225 observations on aperturbed 15 ×

15 grid. Red line are the true values, c.f. (16) and Table 1.

Table 2:

Median relative model error (MRME) and average number of correctly(C) and incorrectly (IC) estimated zero parameters, divided into ﬁxed eﬀectsand random eﬀects, i.e., the SVCs.

Method MRME Fixed Eﬀects Random Eﬀects m C IC C IC

PMLE

MLE

Oracle

In this section, we summarize our results of the simulation study. Our ﬁrst focus is theselection of covariance parameters, where the overall performance is not on par with ones12f the mean parameters. However, this comes at no surprise as covariance parameters areknown to be more diﬃcult to estimate than mean parameters. Further, we can observe afamiliar behavior from variable selection of classical linear models. Increasing the sparsityof the model, i.e., having none or only a few SVC, will increase the variance of the nuggeteﬀect.Overall, our newly suggested PMLE method correctly identiﬁes covariates with noeﬀect in over 90% of the ﬁxed and 85% of the random eﬀects. This is notable consideringthe only drawback is that less than 5% of covariates were incorrectly estimated to haveno eﬀect.

As an application, we consider the Dublin Voter data set. It consists of the voter turnoutin the 2002 General Elections (GE) and 8 other demographic covariates for n = 322electoral divisions (ED) in the Greater Dublin area (Ireland), see Figure 3. All ninevariables are given in percentages and we provide an overview in Table 3. The dataset was ﬁrst studied by Kavanagh (2004). The data set is available in the R package GWmodel . The article by Gollini et al. (2015) showcases further analysis using theGWR framework. In particular, a GWR selection based on a corrected AIC (Hurvichet al., 1998) is conducted. The goal of this section is to do variable selection using thePMLE. Further, we compare our results to linear model-based LASSO and an GWR-based selections.As one can see in Table 3, the summary statistics of the variables vary a lot, whichcan be an issue in numeric optimization. Therefore, without loss of generality and inter-pretability, we standardize the data by subtracting the empirical mean and scaling by theempirical standard deviation. We annotate the standardized variables by a preﬁx “ Z. ”.Further, the observation locations represent the electoral divisions as depicted in Fig-ure 3 and are provided as Easting and Northing in meters. We transform them to kilo-meters to ensure computational stability while staying interpretable with respect to therange parameter. 13 igure 3: Voter turnout in the 2002 General Elections per ED. The blackdots represent the observation locations per ED used in our analysis. They areprovided in the data set, too. 14 able 3:

Description and summary statistics of Dublin Voter data. The descrip-tions are taken from the package

GWmodel . The summary statistics include theminimum, mean, standard deviation (SD), and maximum. Note that due to annaming error the response of interest is called

GenEl2004 , although it actuallyrefers to the 2002 GE.

Variable Description Summary Statistics [in %]Percentage of population in each ED... Min. Mean SD Max.

DiffAdd ...who are one-year migrants 1.90 9.86 6.13 34.74

LARent ...who are local authority renters 0.00 15.17 24.69 100.00

SC1 ...who are social class one (high) 0.22 8.05 6.14 25.63

Unempl ...who are unemployed 1.26 7.56 5.28 31.14

LowEduc ...who are with little formal education 0.00 0.49 0.73 9.24

Age18 24 ...who are age group 0.14 13.43 6.06 56.55

Age25 44 ...who are age group 17.63 31.65 6.74 56.40

Age45 64 ...who are age group 7.12 20.87 5.12 34.01

GenEl2004 ...who voted in 2002 GE 27.98 55.61 8.71 72.91

We use two models in our comparison. First, a simple linear regression using the adaptiveLASSO for variable selection is deﬁned as y i = Z.GenEl2004 i = µ Z.DiffAdd i + µ Z.LARent i + µ Z.SC1 i + µ Z.Unempl i + µ Z.LowEduc i + µ Z.Age18 24 i + µ Z.Age25 44 i + µ Z.Age45 64 i + ε i . Since we use the standardized covariate

Z.GenEl2004 as our response, we do notexpect an intercept in the model. The coeﬃcients are denoted as µ j in accordancewith our previous notation, i.e., these are only mean eﬀects, and ALASSO denotes theadaptive LASSO used to estimate the mean eﬀects. It is implemented using the R package glmnet (Friedman et al., 2010), where the corresponding adaptive weights are given byan ordinary least squares estimate and the shrinkage parameter is estimated by cross-validation. The second model is a full SVC model including an intercept: y i = Z.GenEl2004 i = β ( s i ) + β ( s i ) Z.DiffAdd i + β ( s i ) Z.LARent i + β ( s i ) Z.SC1 i + β ( s i ) Z.Unempl i + β ( s i ) Z.LowEduc i + β ( s i ) Z.Age18 24 i + β ( s i ) Z.Age25 44 i + β ( s i ) Z.Age45 64 i + ε i . The SVCs have yet to be speciﬁed according to the method being used to estimate themodel, but generally, they do contain mean eﬀects, i.e., the SVCs are not centered around15ero. As for the intercept, we expect a zero mean eﬀect. However, there might be somespatially local structures that can be captured using an SVC.In the case of the GP-based SVC model as given in (1), the β j ( s i ) can simply bepartitioned into the mean eﬀect µ j and the random eﬀect η j ( s i ). For the GWR, β j ( s i )is deﬁned as the generalized least square estimate where observations are geographicallyweighted. As a kernel to weight the observations based on their distances, we are using anexponential function. The kernel relies on an adaptive bandwidth that is selected usinga corrected AIC (Hurvich et al., 1998).We use MLE as well as PMLE to estimate the full SVC model, again labeled MLE and

PMLE , respectively, and implemented using varycoef . For further reference, wealso report the results of an GWR-based variable selection labeled

GWR and using the R package GWmodel as well as an adaptive LASSO labeled

ALASSO and using the R package glmnet . In this section, we present the results for all model-based approaches, i.e.,

ALASSO , MLE ,and

PMLE , as not all comparison measures of

GWR are deﬁned. In terms of variableselection, we provide the estimated shrinkage parameters, number of non-zero estimatesof the mean and the variances, the log likelihood (cid:96) and the BIC in Table 4. As one can see,the

ALASSO gives a very sparse model with a relative small log likelihood. As expected,

MLE does not provide any sparse ﬁxed eﬀects but selects only 6 of the potentially 9random eﬀects. With

PMLE we obtain a sparser model. Over all, the smallest BIC isachieved by

PMLE followed by

MLE and

ALASSO . Table 4:

Overview of estimated shrinkage parameters (Shrink. Par.), modelcomplexity (MC), goodness of ﬁt (GoF), and combined information criterion(IC) of all model-based approaches.

Method Shrink. Par. MC GoF IC m ˆ λ µ ˆ λ θ || ˆ µ || || ˆ σ || (cid:96) BIC

ALASSO − − − MLE − − − PMLE . × − − PMLE results in zero estimates for the intercept and

Z.LowEduc . These ﬁxed eﬀectsare only a subset of what

ALASSO estimated to be zero.In the covariance eﬀects, we observe an interesting behavior. We note that

MLE giveszero estimates for one third of the SVC variances even without any penalization. Thiscoincides with previous results from the simulation study. As expected, the non-zero16 able 5:

Parameter estimates of model-based approaches. We use “–” for notavailable values. To ease readability, estimates identical to 0 are given by a solezero, i.e., not by “0.000”. Range estimates, where the corresponding variancewas estimated to be equal to zero, are given in italic font as they cannot beinterpreted.

Variable j, k

Mean ˆ µ j Range ˆ ρ k Variance ˆ σ k ALASSO MLE PMLE MLE PMLE MLE PMLE

Intercept 1 – − Z.DiffAdd − − Z.LARent − − − Z.SC1

Z.Unempl − − − Z.LowEduc

Z.Age18 24 − − Z.Age25 44 − − − Z.Age45 64 − − PMLE are a subset of the ones of

MLE . Further details on the CDalgorithm for

PMLE are given in Appendix A.4.Overall, the PMLE method results in less zero ﬁxed eﬀects compared to the adap-tive Lasso. There is only one variable which does not enter the model at all, namely

Z.LowEduc , and only one other that does have a zero mean eﬀect, namely the intercept.However, the linear model does not take the spatial structures into account. Consideringthe spatial structure apparently triggers the inclusion of additional ﬁxed eﬀects. Themodel gains complexity but we have to consider that this might be necessary in orderto account for the spatial structures within the data. Therefore, SVC model selectionremains a trade-oﬀ between model complexity and goodness of ﬁt which we will addressin the following section using cross-validation.

In the last section we examined the goodness of ﬁt combined with model complexity aswell as parameter estimation. We now turn to predictive performance. Here, we expectthat due to high degree of ﬂexibility of full SVC models, methods without any kindof variable selection could result in over-ﬁtting on the training data in connection withbiased spatial extrapolation on the prediction data set.To examine this, we conduct a classical 10-fold cross-validation. That is, we randomlydivide the Dublin Voter data set into ten folds of size 32 or 33, i.e., each observation i falls into one of the following disjoint sets S , ..., S : (cid:83) ι =1 S ι = { , ..., } . For17 igure 4: Bubble plot visualizing the results of the cross-validation. The loca-tion of the bubbles corresponds to number of ﬁxed and random eﬀects selectedby model. Note that the locations are slightly jittered to ensure overlappingmethods are still visible. The size of the bubbles corresponds to the out-of-sample RMSE. The facets show individual results per fold.each ι = 1 , ...,

10, the variable selection and model ﬁtting of all four methods m ∈{ ALASSO , GWR , MLE , PMLE } is conducted on the union of nine sets (short hand notation S − ι ) with one set left out for validation ( S ι ). We report the out-of-sample rooted meansquared errors (RMSE) on S ι for each method m denoted RMSE ι ( m ). This procedure isrepeated for each ι . The results are visualized in Figure 4 and Table 6.First, we observe that the size of the bubbles, i.e., the RMSE, varies more betweendiﬀerent folds than between diﬀerent methods. To analyze the diﬀerences in predictiveperformance, we provide the mean and standard deviation per method in Table 6. Whilethe predictive performance of the estimated SVC models are similar, ALASSO falls behind.Second, we address the number of selected ﬁxed and random eﬀects. Upon closerinspection of Figure 4, we see notable variations of the number of selected ﬁxed andrandom eﬀects over diﬀerent folds for all methods but

ALASSO . We attribute this to therelatively small sample size. As already observed in Section 5.3, the number of selectedmean eﬀects is higher for selection methods PMLE and GWR on full SVC models thanthe adaptive LASSO on a linear model. However, the inclusion of more ﬁxed eﬀects andaccounting the spatial structure using SVCs clearly surpasses the predictive performanceof a simple linear model. 18 able 6:

Mean and standard deviation (SD) of RMSE per method.

Method RMSE m Mean SD

PMLE

MLE

GWR

ALASSO

We present a novel methodology to eﬃciently select covariates in Gaussian process-basedSVC models. We propose to perform variable selection for both ﬁxed eﬀects and spatiallyvarying coeﬃcients by respectively penalizing the ﬁxed regression coeﬃcients and thevariance of the spatial random coeﬃcients using L penalties. Further, we show howoptimization of the resulting objective function can be done using a coordinate descentalgorithm. The latter can be done eﬃciently by relying on existing eﬃcient methodologyfor the ﬁxed eﬀects part and by parallelizing the optimization procedure.In a simulation study, we show that our method is capable of spatial variance com-ponent as well as ﬁxed eﬀect selection. In real data application, we ﬁnd that our novelmethod results in the best in-sample ﬁt with the lowest BIC among the considered ap-proaches and gives high predictive accuracy when doing cross-validation. While achievinga similar predictive performance as the unpenalized MLE, the PML estimated modelsare sparser and have lesser model complexity. Examining the inﬂuence of possibly newcovariates in SVC models, we recommend our newly proposed PMLE over a classicalMLE for variable selection.Our novel variable selection methodology has been implemented speciﬁcally with SVCmodels in mind. However, with some adaptions, one can apply it to generalized linearmodels that ﬁt in our underlying model assumptions. Further future work on our PMLEcan include individual weighting of ﬁxed and random eﬀects via the BIC. One could addweights to the speciﬁc model complexity penalization as, usually, the addition of anotherﬁxed eﬀect to an SVC model is not equal to adding another random eﬀect in terms of,say, computational work load or model complexity. Declarations

Acknowledgment

JD sincerely thanks Lucas Kook for the valuable exchange and help with respect to theMBO implementation. 19 unding

JD and FS gratefully acknowledge the support of the

Swiss Agency for Innovation ( in-nosuisse project number 28408.1 PFES-ES). RF gratefully acknowledges the support ofthe Swiss National Science Foundation (SNSF grant 175529).

Conﬂicts of Interest

The authors declare no competing interests.

Availability of Data and Material

The data from the simulation is available online under https://git.math.uzh.ch/jdambo/open-access-svc-selection-paper . The data used in the application is givenin the R package GWmodel (Gollini et al., 2015).

Code Availability

The code for the simulation and application is available online under https://git.math.uzh.ch/jdambo/open-access-svc-selection-paper . Our novel proposed method isimplemented in the R package varycoef (Dambon et al., 2021b). Bibliography

Bischl, B., J. Richter, J. Bossek, D. Horn, J. Thomas, and M. Lang (2017). mlrMBO: AModular Framework for Model-Based Optimization of Expensive Black-Box Functions. http://arxiv.org/abs/1703.03373 .Boehm Vock, L. F., B. J. Reich, M. Fuentes, and F. Dominici (2015). Spatial VariableSelection Methods for Investigating Acute Health Eﬀects of Fine Particulate MatterComponents.

Biometrics 71 (1), 167–177.Bondell, H. D., A. Krishna, and S. K. Ghosh (2010). Joint Variable Selection for Fixedand Random Eﬀects in Linear Mixed-Eﬀects Models.

Biometrics 66 (4), 1069–1077.Byrd, R. H., P. Lu, J. Nocedal, and C. Zhu (1995). A Limited Memory Algorithmfor Bound Constrained Optimization.

SIAM Journal on Scientiﬁc Computing 16 (5),1190–1208.Chu, T., J. Zhu, and H. Wang (2011). Penalized Maximum Likelihood Estimation andVariable Selection in Geostatistics.

Ann. Statist. 39 (5), 2607–2625.Cressie, N. (2011).

Statistics for Spatio-Temporal Data . Wiley Series in Probability andStatistics. Hoboken, N.J: Wiley. 20ambon, J. A., F. Sigrist, and R. Furrer (2021a). Maximum Likelihood Estimation ofSpatially Varying Coeﬃcient Models for Large Data with an Application to Real EstatePrice Prediction.

Spatial Statistics 41 , 100470.Dambon, J. A., F. Sigrist, and R. Furrer (2021b). varycoef : Modeling Spatially Vary-ing Coeﬃcients. R package version 0.3.0, https://cran.r-project.org/package=varycoef .Fan, J. and R. Li (2001). Variable Selection via Nonconcave Penalized Likelihood and ItsOracle Properties. Journal of the American Statistical Association 96 (456), 1348–1360.Fotheringham, A. S., C. Brunsdon, and M. Charlton (2002).

Geographically WeightedRegression: The Analysis of Spatially Varying Relationships . Chichester: Wiley.Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization Paths for GeneralizedLinear Models via Coordinate Descent.

Journal of Statistical Software, Articles 33 (1),1–22.Furrer, R., F. Bachoc, and J. Du (2016). Asymptotic Properties of Multivariate Taperingfor Estimation and Prediction.

Journal of Multivariate Analysis 149 , 177–191.Furrer, R., M. G. Genton, and D. W. Nychka (2006). Covariance Tapering for Inter-polation of Large Spatial Datasets.

Journal of Computational and Graphical Statis-tics 15 (3), 502–523.Gerber, F. and R. Furrer (2019). optimParallel : An R Package Providing a ParallelVersion of the L-BFGS-B Optimization Method.

The R Journal 11 (1), 352–358.Gollini, I., B. Lu, M. Charlton, C. Brunsdon, and P. Harris (2015).

GWmodel : An R Package for Exploring Spatial Heterogeneity Using Geographically Weighted Models.

Journal of Statistical Software 63 (17), 1–50.Horn, D. and B. Bischl (2016). Multi-Objective Parameter Conﬁguration of MachineLearning Algorithms using Model-Based Optimization. In , pp. 1–8.Huang, H.-C. and C.-S. Chen (2007). Optimal Geostatistical Model Selection.

Journalof the American Statistical Association 102 (479), 1009–1024.Hurvich, C. M., J. S. Simonoﬀ, and C.-L. Tsai (1998). Smoothing Parameter Selection inNonparametric Regression using an Improved Akaike Information Criterion.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) 60 (2), 271–293.Ibrahim, J. G., H. Zhu, R. I. Garcia, and R. Guo (2011). Fixed and Random EﬀectsSelection in Mixed Eﬀects Models.

Biometrics 67 (2), 495–503.21ones, D. R. (2001). A Taxonomy of Global Optimization Methods Based on ResponseSurfaces.

Journal of Global Optimization 21 , 345–383.Kavanagh, A. (2004). Turnout or Turned Oﬀ? Electoral Participation in Dublin in the21st Century.

Journal of Irish Urban Studies 3 (2), 1–22.Koch, P., B. Bischl, O. Flasch, T. Bartz-Beielstein, C. Weihs, and W. Konen (2012).Tuning and Evolution of Support Vector Kernels.

Evolutionary Intelligence 5 (3), 153–170.Li, K. and N. S. N. Lam (2018). Geographically Weighted Elastic Net: A Variable-Selection and Modeling Method under the Spatially Nonstationary Condition.

Annalsof the American Association of Geographers 108 (6), 1582–1600.Li, R. and H. Liang (2008). Variable Selection in Semiparametric Regression Modeling.

Ann. Statist. 36 (1), 261–286.Lu, B., M. Charlton, P. Harris, and A. S. Fotheringham (2014). Geographically weightedregression with a non-Euclidean distance metric: a case study using hedonic houseprice data.

International Journal of Geographical Information Science 28 (4), 660–681.M¨uller, S., J. L. Scealy, and A. H. Welsh (2013). Model Selection in Linear Mixed Models.

Statist. Sci. 28 (2), 135–167.Reich, B. J., M. Fuentes, A. H. Herring, and K. R. Evenson (2010). Bayesian VariableSelection for Multivariate Spatially Varying Coeﬃcient Regression.

Biometrics 66 (3),772–782.Schelldorfer, J., P. B¨uhlmann, and S. van de Geer (2011). Estimation for High-Dimensional Linear Mixed-Eﬀects Models Using (cid:96) -Penalization. Scandinavian Journalof Statistics 38 (2), 197–214.Schmidt, A. M. and P. Guttorp (2020). Flexible Spatial Covariance Functions.

SpatialStatistics 37 , 100416. Frontiers in Spatial and Spatio-temporal Research.Smith, M. and L. Fahrmeir (2007). Spatial Bayesian Variable Selection With Applica-tion to Functional Magnetic Resonance Imaging.

Journal of the American StatisticalAssociation 102 (478), 417–431.Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.

Journal of theRoyal Statistical Society. Series B (Methodological) 58 (1), 267–288.Wackernagel, H. (1995).

Anisotropy , pp. 46–49. Berlin, Heidelberg: Springer BerlinHeidelberg. 22ang, H. and J. Zhu (2009). Variable Selection in Spatial Regression via Penalized LeastSquares.

Canadian Journal of Statistics 37 (4), 607–624.Ye, J. (1998). On Measuring and Correcting the Eﬀects of Data Mining and ModelSelection.

Journal of the American Statistical Association 93 (441), 120–131.Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties.

Journal of the AmericanStatistical Association 101 (476), 1418–1429.23

Appendix to: Joint Variable Selection of bothFixed and Random Eﬀects for Gaussian Process-based Spatially Varying Coeﬃcient Models

A.1 Proofs

Proof of Proposition 1.

Let U A ∈ R n × n be the distance matrix under some anisotropicgeometric norm deﬁned by a positive-deﬁnite matrix A ∈ R d × d , i.e., ( U A ) lm = u lm := || s l − s m || A . Let r : R ≥ → [0 ,

1] be a correlation function where by r ( U A ) we denote thecomponent-wise evaluation, i.e., ( r ( U A )) lm = r ( u lm ). We have: ∂∂ρ κ Σ Y ( θ ) = ∂∂ρ κ (cid:32) q (cid:88) k =1 (cid:16) w ( k ) w ( k ) (cid:62) (cid:17) (cid:12) Σ k + τ I n × n (cid:33) = (cid:16) w ( k ) w ( k ) (cid:62) (cid:17) (cid:12) ∂∂ρ κ Σ κ == (cid:16) w ( k ) w ( k ) (cid:62) (cid:17) (cid:12) σ κ r (cid:48) (cid:18) U A ρ κ (cid:19) (cid:12) (cid:18) − U A ρ κ (cid:19) . With | r (cid:48) | < C , we have ∂∂ρ κ Σ Y ( b κ ) = n × n and recall that Σ Y ( b κ ) is well-deﬁned andinvertible. With the identities (log det M ) (cid:48) = tr( M − M (cid:48) ) and ( M − ) (cid:48) = − M − M (cid:48) M − for some quadratic matrix M , we obtain ∂∂ρ κ f ( θ ) = tr (cid:18) Σ Y ( θ ) − ∂∂ρ κ Σ Y ( θ ) (cid:19) + ( y − X µ ( t +1) ) (cid:62) (cid:18) − Σ Y ( θ ) − (cid:18) ∂∂ρ κ Σ Y ( θ ) (cid:19) Σ Y ( θ ) − (cid:19) ( y − X µ ( t +1) )and therefore ∂∂ρ κ f ( b κ ) = 0. A.2 Perturbed Grid

A perturbed grid is used to sample the observation locations. It consists of 15 ×

15 sam-pling domains arranged as a regular grid. Each sampling domain is a square surroundedby a thin margin. We sample an observation location uniformly from each square in thesampling domains. The true eﬀect of the SVC is then sampled from an GP at correspond-ing observation locations. In Figure 5 such an example is provided. This is repeated forall N = 100 simulation runs. A.3 Numeric Optimization

We provide further details for the numeric optimization. In all ML and PML methods,we use the R package optimParallel (Gerber and Furrer, 2019). In the simulation study,A1 igure 5: Perturbed grid of size 15 ×

15 within the unit square (black border).The grey squares indicate the sampling domian. The colored points are theobservation locations with corresponding value of the SVC, i.e., η ( s ).recall that we are on a m × m perturbed grid, where we have m = 15 observations alongone side. Here, we set σ k ≥ , ρ k ≥ (3 m ) − , and τ ≥ − as lower bounds. The lowerbounds in the case of the ranges are motivated by the eﬀective range of an exponentialGP. It is 3 ρ , where ρ is the corresponding range of the GP. Since the smallest distancesbetween neighbors on a perturbed grid are 1 /m on average, it is not feasible to modelGPs with ranges smaller than a third of that. This prevents individual SVCs to take onthe role of a nugget eﬀect. A.4 Coordinate Descent Iterations

We brieﬂy discuss the coordinate descent in

PMLE from Sections 5.2 and 5.3. For thetwo covariates

Z.DiffAdd and

Z.SC1 , we give the covariance parameters for respectiveA2 igure 6:

Covariance parameters trajectories of two SVCs for covariates

Z.DiffAdd and

Z.SC1 in coordinate descent.SVCs at individual steps t in the coordinate descent in Figure 6. We have chosen thesetwo covariates as both their variances have the largest absolute diﬀerence between theinitial and ﬁnal values. Additionally, one of the variances shrunk to 0 where the otherdid not. In total, we have T = 5 iterations of the CD algorithm and the individualcovariance parameters ( ρ ( t ) k , σ k ( t ) ) for t = 0 , ..., T and k = 2 , t . Note that for both covariates the covariance parametersof the last 3 iteration steps are (almost) identical. The lower bound of the variance isindicated by the dark green line. Figure 6 clearly visualizes the eﬀect of adding thepenalties on the variance parameters as both optimization trajectories ﬁrst evolve alongthe variance axis, before the range is adjusted further on. For ρ ( t )4 , t ≥

3, we see thatno further adjustments are made beside the fact that ρ is unidentiﬁable under σ24