[PDF] Matrix-free Penalized Spline Smoothing with Multiple Covariates

Abstract

The paper motivates high dimensional smoothing with penalized splines and its numerical calculation in an efficient way. If smoothing is carried out over three or more covariates the classical tensor product spline bases explode in their dimension bringing the estimation to its numerical limits. A recent approach by Siebenborn and Wagner(2019) circumvents storage expensive implementations by proposing matrix-free calculations which allows to smooth over several covariates. We extend their approach here by linking penalized smoothing and its Bayesian formulation as mixed model which provides a matrix-free calculation of the smoothing parameter to avoid the use of high-computational cross validation. Further, we show how to extend the ideas towards generalized regression models. The extended approach is applied to remote sensing satellite data in combination with spatial smoothing.

Full PDF

MMatrix-free Penalized Spline Smoothing with Multiple Covariates

Julian Wagner ∗ G¨oran Kauermann † Ralf M¨unnich ‡ January 18, 2021

Abstract

The paper motivates high dimensional smoothing with penalized splines and its numerical calculationin an eﬃcient way. If smoothing is carried out over three or more covariates the classical tensor productspline bases explode in their dimension bringing the estimation to its numerical limits. A recent approachby Siebenborn and Wagner (2019) circumvents storage expensive implementations by proposing matrix-free calculations which allows to smooth over several covariates. We extend their approach here bylinking penalized smoothing and its Bayesian formulation as mixed model which provides a matrix-freecalculation of the smoothing parameter to avoid the use of high-computational cross validation. Further,we show how to extend the ideas towards generalized regression models. The extended approach isapplied to remote sensing satellite data in combination with spatial smoothing.

Keywords : tensor product splines, penalized spline smoothing, remote sensing data, curse of dimension,matrix-free algorithms

Penalized spline smoothing traces back to O’Sullivan (1986) but was made popular by Eilers and Marx(1996). The general idea is to replace a smooth unknown function by a high dimensional B-spline basis,where a penalty is imposed on the spline coeﬃcients to guarantee smoothness. The penalty itself is steeredby a penalty parameter which controls the amount of penalization. Wand (2003) exhibited the connectionbetween penalized spline smoothing and mixed models and showed how to use mixed model software toestimate the penalty (or regularization) parameter using Maximum Likelihood theory (see also Green andSilverman, 1993, Brumback and Rice, 1998 or Verbyla et al., 1999 for earlier references in this line). Thegeneral idea is to comprehend the penalty as a prior normal distribution imposed on the spline coeﬃcients.Now, the penalty parameter becomes the (reciprocal of the) prior variance of the random spline coeﬃcients.This link has opened an avenue of ﬂexible smoothing techniques which were proposed in Ruppert et al.(2003). Penalized spline smoothing achieved general recognition and the method was extended and appliedin many areas, as nicely demonstrated in the survey article by Ruppert et al. (2009). Kauermann et al.(2009) provide a theoretical justiﬁcation for estimating the smoothing parameter based on mixed modeltechnology. Further details are found in the comprehensive monograph of Wood (2017). ∗ Trier University / RTG ALOP, Germany † LMU Munich, Germany ‡ Trier University, Germany a r X i v : . [ s t a t . M E ] J a n n times of Big Data we obtain larger and larger data bases which allows for more complex modelling.Though the curse of dimensionality remains (see e.g. Hastie and Tibshirani, 1990) we are put in theposition of ﬁtting high dimensional models to massive data bases. In particular, this allows to includeinteractive terms in the model, or putting it diﬀerently, we can replace (generalized) additive models of theform Y = β + s ( x ) + ... + s P ( x P ) + ε (1.1)by interaction models of the type Y = β + s ( x , ..., x P ) + ε. (1.2)In (1.1) the functions s p ( · ) are univariate smooth functions, normed in some way to achieve identiﬁability,while s ( · ) in (1.2) is a smooth function with a P dimensional argument vector. Using the idea of penalizedspline smoothing to ﬁt multivariate smooth functions as in model (1.2) is carried out by using a tensorproduct of univariate B-spline bases and an appropriately chosen penalty matrix. This leads to a diﬀerentview of the curse of dimensionality, since the resulting tensor product spline basis increases exponentiallyin P . For instance, using 40 univariate B-splines in each dimension leads to a 64,000 dimensional basismatrix for P = 3 dimensions. Wood et al. (2017) proposes a discretization scheme as introduced in Langet al. (2014) to estimate such models. Li and Wood (2019) extend the results by exploiting the structure ofthe model matrices using a blockwise Cholesky decomposition. If the covariates x , ..., x P live on a regularlattice, one can rewrite the entire model and make use of array operations, where the numerical eﬀort in factthen grows only linearly in P . This has been proposed in Currie et al. (2006), see also Eilers et al. (2006).A diﬀerent idea to circumvent massive dimensional tensor product B-spline bases has been proposed byZenger (1991) as so called sparse grids, see also Bungartz and Griebel (2004) or Kauermann et al. (2013).The idea is to reduce the tensor product spline dimension by taking the hierarchical construction principleof B-splines into account. However, these simpliﬁcations work only in case of regular lattice data and notin general, while we tackle the general case here.A novel strategy to deal with massive dimensional tensor product spline matrices has been proposed inSiebenborn and Wagner (2019), see also Wagner (2019) for extensive technical details. The principal ideais to never construct and store the tensor product spline basis (which in many cases is numerically noteven feasible) but to exploit the structure of the tensor product and calculate the necessary quantitiesby using univariate B-spline bases only. The strategy is referred to as matrix-free calculation, since itonly involves the calculation and storage of univariate basis matrices but not of the massive dimensionalmultivariate basis matrices. We extend these results here by linking penalized spline smoothing to mixedmodels. Hence, we develop a matrix-free calculation of the necessary quantities to apply a (RE)ML basedcalculation of the smoothing parameter. This in turn allows to generally apply high dimensional smoothingbased on tensor-product spline bases including a data driven calculation (estimation) of the smoothing orregularization parameter, respectively. We extend these results in two directions. First, we make use ofthe link between penalized splines and mixed models, which allows to estimate the smoothing parameteras a priori variance of the prior imposed on the spline coeﬃcients. The resulting estimation formulae aresimple in matrix notation (see e.g. Wood et al., 2017), but for high dimensional tensor product splinestheir calculation becomes numerically infeasible. We show how to calculate the quantities with alternativealgorithms using a matrix-free strategy. This in turn allows to estimate the smoothing parameter without2ctually even building the high dimensional and numerically demanding (or infeasible) design matrix. Thesecond extension to the results in Siebenborn and Wagner (2019) is, that we show how to extend the ideastowards generalized regression models, where a weight matrix needs to be considered.We apply the routine to estimate organic carbon using the LUCAS (Land Use and Coverage Area frameSurvey) data set which contains topsoil survey data including the main satellite spectral bands on thesampling points as well as spatial information (see e.g. Toth et al., 2013 and Orgiazzi et al., 2017). Theresults show, that spline regression models are able to provide highly improved estimates using large scalemulti-dimensional data.In Section 2, we review tensor product based mutltivariate penalized spline smoothing. In Section 3, weshow how ﬁtting can be pursued without storage of the massive dimensional multivariate spline basis usingmatrix-free calculation. Section 4 extends the results towards generalized additive regression, i.e. in case ofmultiple smooth functions and non-normal response. In Section 5, we apply the method to the LUCAS dataestimating organic carbon in a multi-dimensional setting. Finally, Section 6 concludes the manuscript. We start the presentation by simple smoothing of a single though multivariate smooth function. Let us,therefore, assume the data { ( x i , y i ) ∈ R P × R : i = 1 , . . . , n } , (2.1)where the y i ∈ R are observations of a continuous response variable and the x i ∈ Ω ⊂ R P representthe corresponding value of a continuous multivariate covariate. The set Ω is an arbitrary but typicallyrectangular subset of R P containing the covariates. We assume the model y i = s ( x i ) + ε i , ε i ind ∼ N (0 , σ ε ) , i = 1 , . . . , n, (2.2)where s ( · ) : Ω → R is a smooth but further unspeciﬁed function. The estimation of s ( · ) will be carriedout with penalized spline smoothing, initially proposed by Eilers and Marx (1996), see also Ruppert et al.(2003) and Fahrmeir et al. (2013). The main idea is to model the function s ( · ) as a spline function witha rich basis, such that the spline function is ﬂexible enough to capture very complex and highly nonlineardata structures and, in order to prevent from overﬁtting the data, to impose an additional regularizationor penalty term. In the case of a single covariate, i.e. P = 1, let Ω := [ a, b ] be a compact interval partitioned by the m + 2knots K := { a = κ < . . . < κ m +1 = b } . Let C q (Ω) denote the space of q -times continuously diﬀerentiable functions on Ω and let P q (Ω) denote thespace of polynomials of degree q . We call the function space S q ( K ) := { s ∈ C q − (Ω) : s | [ κ j − ,κ j ] ∈ P q ([ κ j − , κ j ]) , j = 1 , . . . , m + 1 } q ∈ N with knots K . It is a ﬁnite dimensional linear space ofdimension J := dim ( S q ( K )) = m + q + 1and with { φ j,q : j = 1 , . . . , J } we denote a basis of S q ( K ). For numerical applications the B-spline basis (cf. de Boor, 1978) is frequentlyused, whereas the truncated power series basis (cf. Schumaker, 1981) is often used for theoretical investi-gations.To extend the concept of splines to multiple covariates, i.e. P ≥

2, we implement a tensor product approach.Let therefore S q p ( K p ) denote the spline space for the p -th covariate, p = 1 , . . . , P , and let { φ pj p ,q p : j p = 1 , . . . , J p := m p + q p + 1 } denote the related basis. The tensor product of basis functions decomposes as product in the form φ j,q : Ω := Ω × . . . × Ω P ⊆ R P → R , φ j,q ( x ) = P (cid:89) p =1 φ pj p ,q p ( x p ) , (2.3)where j := ( j , . . . , j P ) and q := ( q , . . . , q P ) denote multiindices and x = ( x , . . . , x P ) T ∈ Ω denotes anarbitrary P-dimensional vector. The space of tensor product splines is then spanned by these tensor productbasis functions, i.e. S q ( K ) := span { φ j,q : 1 ≤ j ≤ J := ( J , . . . , J P ) } . By deﬁnition, this is a ﬁnite dimensional linear space of dimension K := dim( S q ( K )) = P (cid:89) p =1 dim( S q p ( K p )) = P (cid:89) p =1 J p . Note, that we use the same symbols in the univariate and the multivariate context with the diﬀerencethat for

P > s ∈ S q ( K ) has a unique representation interms of its basis functions and for computational reasons we uniquely identify the set of multiindices { j ∈ N P : 1 ≤ j ≤ J } in descending lexicographical order as { , . . . , K } so that s = (cid:88) ≤ j ≤ J α j φ j,q = K (cid:88) k =1 α k φ k,q with unique spline coeﬃcients α k ∈ R . In order to ﬁt a spline function to the given observations (2.1) we apply a simple least squares criterionmin s ∈S q ( K ) n (cid:88) i =1 ( s ( x i ) − y i ) ⇔ min α ∈ R K (cid:107) Φ α − y (cid:107) , (2.4)4here Φ ∈ R n × K denotes the matrix of spline basis functions evaluated at the covariates, i.e. is element-wisegiven as Φ[ i, k ] = φ k,q ( x i ) ,y ∈ R n denotes the vector of the response values, and α ∈ R K denotes the unknown vector of splinecoeﬃcients. To prevent overﬁtting of the resulting least-squares spline a regularization term R : S q ( K ) → R + is included to avoid wiggling behavior of the spline function. This leads to the regularized least squaresproblem min s ∈S q ( K ) n (cid:88) i =1 ( s ( x i ) − y i ) + λ R ( s ) , (2.5)where the regularization parameter λ > P = 1 andif the B-spline basis is based on equally spaced knots, Eilers and Marx (1996) suggest to base the penaltyon higher-order diﬀerences of the coeﬃcients of adjacent B-splines, i.e. J (cid:88) j = r +1 ∆ r ( α j ) = α T (∆ r ) T ∆ r α = (cid:107) ∆ r α (cid:107) , where ∆ r ( · ) denotes the r -th order backwards diﬀerence operator and ∆ r ∈ R ( J − r ) × J denotes the relateddiﬀerence matrix. According to Fahrmeir et al. (2013), this diﬀerence penalty is extend to tensor productspline functions by R diﬀ ( s ) := P (cid:88) p =1 α T (cid:18) I J ⊗ . . . ⊗ I J p − ⊗ (cid:16) ∆ pr p (cid:17) T ∆ pr p ⊗ I J p +1 ⊗ . . . ⊗ I J P (cid:19) α, (2.6)where ∆ pr p ∈ R ( J p − r p ) × J p denotes the r p -th order diﬀerence matrix for the p -th covariate, I denotes theidentity matrix of the respective dimension, and ⊗ denotes the Kronecker product. A more general regu-larization term for univariate splines which is applicable for arbitrary knots and basis functions is due toO’Sullivan (1986) and is extend to multivariate splines by Eubank (1988), see also Green and Silverman(1993). This so called curvature penalty is given as integrated square of the sum of all partial derivativesof total order two, that is R curv ( s ) := (cid:90) Ω P (cid:88) p =1 P (cid:88) p =1 (cid:18) ∂ ∂x p ∂x p s ( x ) (cid:19) d x. Note that, according to Siebenborn and Wagner (2019), it is R curv ( s ) = (cid:88) | r | =2 r ! α T Ψ r α, (2.7)where r ∈ N P denotes a multiindex and Ψ r ∈ R K × K is element-wise given asΨ r [ k, (cid:96) ] = (cid:90) Ω ∂ r φ k,q ( x ) ∂ r φ (cid:96),q ( x )d x = (cid:104) ∂ r φ k,q , ∂ r φ (cid:96),q (cid:105) L (Ω) . α ∈ R K (cid:107) Φ α − y (cid:107) + λ α T Λ α (2.8)with Λ ∈ R K × K being an adequate symmetric and positive semideﬁnite penalty matrix representing theregularization term. A solution of (2.8) is given by a solution of the linear system (cid:0) Φ T Φ + λ Λ (cid:1) α ! = Φ T y (2.9)with symmetric and positive semideﬁnite coeﬃcient matrix Φ T Φ+ λ Λ. In the following, we assume that thiscoeﬃcient matrix is even positive deﬁnite, which holds true under very mild conditions on the covariates x i , i = 1 , . . . , n , and is generally fulﬁlled in practice (cf. Wagner, 2019). This especially yields that thesolution of (2.8) uniquely exists and is analytically given as (cid:98) α := (cid:0) Φ T Φ + λ Λ (cid:1) − Φ T y. (2.10)Though formula (2.10) looks simple, for P in the order of three or higher we obtain a numerically infeasibleproblem due to an exponential growth of the problem dimension K within the number of covariates P , i.e. K = O (2 P ). For example for P = 3 already the storage of the penalty matrix Λ requires approximatelytwo gigabyte (GB) of random access memory (RAM) even with using a sparse column format. Since notonly Λ has to be stored and since the matrices further need to be manipulated, i.e. the the linear system(2.9) has to be solved, this clearly exceeds the internal memory of common computer systems. More detailis given in Section 3. Selecting the regularization parameter is an important part since it regulates the inﬂuence of the regular-ization term and therefore the amount of smoothness of the resulting P-spline. We follow the mixed modelapproach of Ruppert et al. (2003), see also Wood et al. (2017). To do so, we assume the prior α ∼ N (0 , σ α Λ − ) (2.11)where Λ − denotes the generalized inverse of Λ. Given α we assume normality so that y | α ∼ N (Φ α, σ ε I ) . (2.12)Applying (2.11) to (2.12) leads to the observation model y ∼ N (0 , σ ε ( I + λ − Φ T Λ − Φ)) , where λ = σ ε σ α (2.13)with σ ε = (cid:107) Φ α − y (cid:107) n and σ α = α T Λ α trace ((Φ T Φ + λ Λ) − Φ T Φ) . λ and vice versa, an analytical solution can not be achieved.However, following Wand (2003) or Kauermann (2005) we can estimate the variances iteratively as follows.Let ˆ α ( t ) be the penalized least squares estimate of α with λ being set to ˆ λ ( t ) , then (cid:0) ˆ σ ε (cid:1) ( t ) = (cid:107) Φ ˆ α ( t ) − y (cid:107) n , (cid:0) ˆ σ α (cid:1) ( t ) = (cid:0) ˆ α ( t ) (cid:1) T Λ ˆ α ( t ) trace (cid:16) (Φ T Φ + ˆ λ ( t ) Λ) − Φ T Φ (cid:17) , (2.14)which leads to a ﬁxed-point iteration according to Algorithm 1. Algorithm 1:

Fixed point iteration for α and λ Input: ˆ λ (0) > for t = 0 , , , . . . do ˆ α ( t ) ← (cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) − Φ T y (cid:0) ˆ σ ε (cid:1) ( t ) ← (cid:107) Φ ˆ α ( t ) − y (cid:107) n (cid:0) ˆ σ α (cid:1) ( t ) ← (cid:0) ˆ α ( t ) (cid:1) T Λ ˆ α ( t ) trace (cid:16) (Φ T Φ + ˆ λ ( t ) Λ) − Φ T Φ (cid:17) ˆ λ ( t +1) ← (cid:0) ˆ σ ε (cid:1) ( t ) (ˆ σ α ) ( t ) if | ˆ λ ( t +1) − ˆ λ ( t ) | ≤ tol then stop end end ˆ α ( t +1) ← (cid:16) Φ T Φ + ˆ λ ( t +1) Λ (cid:17) − Φ T y return ˆ α ( t +1) , ˆ λ ( t +1) The main task within the P-spline method is (repetitively) solving a linear system system of the form(Φ T Φ + λ Λ) α ! = Φ T y (3.1)for the unknown spline coeﬃcients α ∈ R K . Since the coeﬃcient matrix is symmetric and positive deﬁnite,a variety of solution methods such as the conjugate gradient (CG) method of Hestenes and Stiefel (1952)can be applied. The diﬃculty is hidden in the problem dimension K = P (cid:89) p =1 J p = O (2 P )that depends exponentially on the number of utilized covariates P . This fact, known as the curse ofdimensionality (cf. Bellman, 1957), leads to a tremendous growth of memory requirements to store thecoeﬃcient matrix with increasing P . Even for a moderate number of covariates P ≥ P ≥

3, we present a matrix-free algorithm to solve the linear system (3.1) as well as to estimatethe related regularization parameter (2.13) that require a negligible amount of storage space.

The basic idea of matrix-free methods for solving a linear system of equations is not to store the coeﬃcientmatrix explicitly, but only accesses the matrix by evaluating matrix-vector products. Many iterative meth-ods, such as the CG method, allow for such a matrix-free implementation and we now focus on computingmatrix-vector products with the coeﬃcient matrixΦ T Φ + λ Λwithout assembling and storing the matrices Φ and Λ.We ﬁrst focus on the spline basis matrix Φ ∈ R n × K . By deﬁnition of the tensor product spline space itsbasis functions are given as point-wise product of the univariable basis functions (2.3) and therefore weconclude for the i -th column of Φ T thatΦ T [ , i ] =  φ ,q ( x i )... φ K,q ( x i )  = P (cid:79) p =1  φ p ,q p ( x pi )... φ pJ p ,q p ( x pi )  . Deﬁning Φ p ∈ R n × J p as the matrix of the univariate spline basis functions evaluated at the respectivecovariates, i.e. Φ p [ i, j p ] := φ pj p ,q p ( x pi ) , p = 1 , . . . , P, it follows Φ T [ , i ] = P (cid:79) p =1 Φ Tp [ · , i ] =  P (cid:79) p =1 Φ p [ i, · ]  T . (3.2)Note that therefore Φ T = (cid:34)(cid:32) P (cid:78) p =1 Φ p [1 , · ] (cid:33) T , . . . , (cid:32) P (cid:78) p =1 Φ p [ n, · ] (cid:33) T (cid:35) ∈ R K × n , where Φ p [ i, · ] denotes the i -th row of Φ p . This can be expressed in more compact form asΦ T = P (cid:75) p =1 Φ Tp , where (cid:12) denotes the Khatri-Rao product, which is deﬁned as column-wise Kronecker product for matriceswith the same number of columns. Using (3.2) it holds for arbitrary y ∈ R n thatΦ T y = n (cid:88) i =1 y [ i ] v i , (3.3)8here v i :=  P (cid:79) p =1 Φ p [ i, · ]  T ∈ R K (3.4)denotes the i -th column of Φ T and y [ i ] the i -th element of th vector y. Exploiting this structure, Algorithm2 computes the matrix-vector product Φ T y by only accessing and storing the small univariate basis matricesΦ p ∈ R n × J p , p = 1 , . . . , P . Algorithm 2:

Matrix-vector product with Φ T Input: Φ , . . . , Φ P , y Output: Φ T y w ← for i = 1 , . . . , n do v ← Φ [ i, · ] ⊗ . . . ⊗ Φ P [ i, · ] w ← w + y [ i ] v end return w In analogy to (3.3), it holds for arbitrary α ∈ R K thatΦ α = (cid:0) v T α, . . . , v Tn α (cid:1) T such that Algorithm 3 computes the matrix-vector product Φ α again by only accessing and storing thesmall factors Φ , . . . , Φ P . Algorithm 3:

Matrix-vector product with Φ

Input: Φ , . . . , Φ P , α Output: Φ α for i = 1 , . . . , n do v ← Φ [ i, · ] ⊗ . . . ⊗ Φ P [ i, · ] w [ i ] ← v T α end return w Since for

P > J p (cid:28) K = P (cid:89) p =1 J p , the storage costs of the univariable basis matrices Φ p are negligibly and the Algorithms 2 and 3 allow thecomputation of matrix vector products with Φ, Φ T and Φ T Φ without explicitly forming these infeasiblylarge matrices.Regarding the penalty matrix Λ, the underlying regularization term is crucial. For the diﬀerence penalty92.6), we note that Λ diﬀ = P (cid:88) p =1 (cid:18) I J ⊗ . . . ⊗ I J p − ⊗ (cid:16) ∆ pr p (cid:17) T ∆ pr p ⊗ I J p +1 ⊗ . . . ⊗ I J P (cid:19) = P (cid:88) p =1 (cid:0) I L p ⊗ Θ p ⊗ I R p (cid:1) , (3.5)where L p := p − (cid:89) t =1 J t , R p := P (cid:89) t = p +1 J t , Θ p := (cid:16) ∆ pr p (cid:17) T ∆ pr p ∈ R J p × J p . A matrix of the form I L ⊗ A ⊗ I R with L, R ∈ N and quadratic A ∈ R J × J is called normal factor, such that Λ diﬀ is given as a sum of normalfactors. Modifying the idea of Benoit et al. (2001), multiplication with a normal factor is achieved asfollows. For arbitrary α ∈ R LJR and B := A ⊗ I R ∈ R JR × JR it holds ( I L ⊗ A ⊗ I R ) α = ( I L ⊗ B ) α. The matrix I L ⊗ B is a block-diagonal matrix consisting of L blocks containing the matrix B in each block.According to B , we decompose the vector α into L chunks α (1) , . . . , α ( L ) , each of length J R such that( I L ⊗ A ⊗ I R ) α =  Bα (1) ... Bα ( L )  In order to compute Bα ( l ) for l = 1 , . . . , L note that B = A ⊗ I R =  a , I R . . . a ,J I R ... . . . ... a J, I R . . . a J,J I R  such that each row of B consist of the elements of a row of A at distance R apart and zeros for the rest.Computation of the t -th element of Bα ( l ) therefore boils down to the repeated extraction of componentsof α ( l ) , at distance R apart and starting with element number t , forming a vector z in l,t ∈ R J , and thenmultiplying the t -th row of A with z in l,t . The multiplication z out l,t := Az in l,t ∈ R J , therefore, provides several elements of the matrix-vector product ( I L ⊗ A ⊗ I R ) α , where the positions areat distance R apart. Finally, looping over all L chunks and all R blocks yields Algorithm 4 to compute amatrix-vector product of a normal factor I L ⊗ A ⊗ I R with an arbitrary vector α ∈ R LJR by only storingthe matrix A ∈ R J × J . The application of Algorithm 4 to each addend in (3.5) ﬁnally allows to computematrix-vector products by only forming and storing the small matrices Θ p ∈ R J p × J p , p = 1 , . . . , P .10 lgorithm 4: Matrix-vector product with a normal factor

Input:

L, R ∈ N , A ∈ R J × J , α ∈ R LJR

Output: ( I L ⊗ A ⊗ I R ) α ∈ R LJR base ← for l = 1 , . . . , L do // loop over all L chunks for r = 1 , . . . , R do // loop over all R blocks index ← base + r for j = 1 , . . . , J do // form the vector z in l z in [ j ] ← α [index] index ← index + R end z out ← Az in // compute z out l index ← base + r for j = 1 , . . . , J do α [index] ← z out [ j ] // store results at the right position index ← index + R end end base ← base + RJ // jump to the next chunk end return α If the curvature penalty (2.7) is used instead of the diﬀerence penalty (2.6), i.e.Λ curv = (cid:88) | r | =2 r ! Ψ r , Wagner (2019) showed that Ψ r = P (cid:79) p =1 Ψ pr p = P (cid:89) p =1 (cid:16) I L p ⊗ Ψ pr p ⊗ I R p (cid:17) , where Ψ pr p is the onedimensional counterpart of Ψ r in the p -th direction, i.e.Ψ pr p ∈ R J p × J p , Ψ pr p [ j p , (cid:96) p ] = (cid:68) ∂ r p φ pj p ,q p , ∂ r p φ p(cid:96) p ,q p (cid:69) L (Ω p ) , p = 1 , . . . , P. Therefore, a matrix vector product with Λ curv can also be computed by means of Algorithm 4 by loopingbackwards over all P normal factors. That is, setting w P := α , we compute w p − := (cid:16) I L p ⊗ Ψ pr p ⊗ I R p (cid:17) w p , p = P, . . . , w = Ψ r α as the desired matrix-vector product, again by only storing thesmall matrices Ψ pr p ∈ R J p × J p , p = 1 , . . . , P . 11 .2 Matrix-free CG Method We now turn back to the main problem of solving the linear system (3.1) for the spline coeﬃcients α ∈ R K ,where the coeﬃcient matrix Φ T Φ + λ Λ ∈ R K × K does not ﬁt into the working memory of customary computational systems. Since the coeﬃcient matrix issymmetric and positive deﬁnite, the conjugate gradient method is an appropriate solution method. A majoradvantage of the CG method is that the coeﬃcient matrix does not have to be known explicitly but onlyimplicitly by actions on vectors, i.e. only matrix-vector products with the coeﬃcient matrix are required.The CG method can therefore be implemented in a matrix-free manner, in the sense that the coeﬃcientmatrix is not required to exist. Using the algorithms implemented in Section 3.1, matrix-vector productswith all components Φ T , Φ, and Λ of the coeﬃcient matrix, and therefore with the coeﬃcient matrix itself,can be computed at negligible storage costs. The resulting matrix-free version of the CG method to solvethe linear system (3.1) is given in Algorithm 5. Note that, for a ﬁxed P , Algorithm 5 merely requiresthe storage of the small factor matrices occurring within the Kronecker and Khatri-Rao product matrices.Therefore, the storage demand depends only linear on P , i.e. is O ( P ), which is an signiﬁcant improvementcompared to O (2 P ) which is required for the naive implementation with full matrices. Algorithm 5:

Matrix-free CG method for α Output: (cid:0) Φ T Φ + λ Λ (cid:1) − Φ T y α ← p ← r ← Φ T y // matrix-vector product by Algorithm 2 while (cid:107) r (cid:107) > tol do v ← (cid:0) Φ T Φ + λ Λ (cid:1) p // matrix-vector product by Algorithm 2, 3, and 4 w ← (cid:107) r (cid:107) / ( p T v ) α ← α + wp ˜ r ← r r ← r − wv p ← r + ( (cid:107) r (cid:107) / (cid:107) ˜ r (cid:107) ) p end return α It is well known that the CG method ﬁnds the exact solution in at most K iterations. However, since K might be very large, the CG method is in general used as an iterative method in practice. The rateof convergence of the CG method strongly depends on the condition number of the coeﬃcient matrixwhich is assumed to be large due to the construction of the linear system (3.1) via a normal equation.Further, for an increasing number P of covariates, the condition number deteriorates since the number n ofobservations does not increase to the same extend. In order to signiﬁcantly reduce the runtime of Algorithm5, preconditioning techniques are widely used. However, since the CG method is implemented matrix-free,the preconditionier has to be matrix-free itself, such that traditional methods cancel out. According toSiebenborn and Wagner (2019) a multigrid preconditioner can be utilized provided the curvature penaltywith equally spaced knots and B-spline basis is used. The presented matrix-free CG method, on thecontrary, is applicable for arbitrary choices of knots, penalties, and basis functions. In this case, we can atleast implement a simple diagonal preconditioner, that is we choose the diagonal of the coeﬃcient matrix12s preconditioner. To compute this diagonal, again, the coeﬃcient matrix is not allowed to exist. Since thediagonal of a Kronecker-matrix is simply the Kronecker-product of the diagonals of the factors it remainsto compute the diagonal of Φ T Φ. Let e k denote the k -th unit vector such the the k -th diagonal element ofΦ T Φ reads Φ T Φ[ k, k ] = e Tk Φ T Φ e k = (cid:107) Φ e k (cid:107) = (cid:107)  v T e k . . .v Tn e k  (cid:107) = (cid:107)  v T [ k ] . . .v Tn [ k ]  (cid:107) = n (cid:88) i =1 v i [ k ] , where v i ∈ R K is deﬁned as in (3.3). In summary, the diagonal of the coeﬃcient matrix Φ T Φ + λ Λ can becomputed without explicitly forming the matrix.

As shown in Section 2.3, the regularization parameter λ can be estimated by means of the ﬁxed-pointiteration presented in Algorithm 1. There, in each iteration step the linear system (cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) α ! = Φ T y has to be solved, which can now be achieved by Algorithm 5. It remains to compute the trace of the matrix B t := (cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) − Φ T Φ ∈ R K × K which is clearly not trivial since (cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) − can not be computed. Since B t is symmetric and positivedeﬁnite, we follow the suggestions of Avron and Toledo (2011) who proposed several algorithms to estimatethe trace of an implicit, symmetric, and positive semideﬁnite matrix. The basic approach, which is due toHutchinson (1989), is to estimate the trace of B t astrace( B t ) ≈ M M (cid:88) m =1 z Tm B t z m , where the z m ∈ R K are M independent random vectors whose entries are i.i.d. Rademacher distributed,that is prob( z m = ±

1) = 1 / . The advantage of this approach is that the matrix B t does not have to be explicitly known, but only the M products z Tm ( B t z m ) need to be eﬃciently computed. For computational reasons, we use the reformulationtrace( B t ) = trace (cid:18)(cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) − Φ T Φ (cid:19) = K − trace (cid:18)(cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) − ˆ λ ( t ) Λ (cid:19) . Given the random vectors z m , m = 1 , . . . , M , we ﬁrst compute the vectors˜ z ( t ) m := ˆ λ ( t ) Λ z m

13y means of Algorithm 4 and then the vectors¯ z ( t ) m := (cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) − ˜ z ( t ) m (3.6)by Algorithm 5, i.e. as solution of a linear system. The desired estimate of the trace then ﬁnally readstrace (cid:18)(cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) − Φ T Φ (cid:19) ≈ K − M M (cid:88) m =1 z Tm ¯ z ( t ) m . (3.7)Summarizing the results, Algorithm 6 allows for a simultaneous matrix-free estimation of the regularizationparameter λ and the spline coeﬃcients α . Algorithm 6:

Matrix-free ﬁxed-point iteration for α and λ Input: z , . . . , z M , ˆ λ (0) > for t = 0 , , , . . . do ˆ α ( t ) ← (cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) − Φ T y // solve linear system by Algorithm 5 (cid:0) ˆ σ ε (cid:1) ( t ) ← (cid:107) Φ ˆ α ( t ) − y (cid:107) n // matrix-vector product by Algorithm 3 for m = 1 , . . . , M do ˜ z ( t ) m ← ˆ λ ( t ) Λ z m // matrix-vector product by Algorithm 4 ¯ z ( t ) m ← (cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) − ˜ z ( t ) m // solve linear system by Algorithm 5 end v ( t ) ← K − M (cid:18) M (cid:80) m =1 ( z ( t ) m ) T ¯ z ( t ) m (cid:19) (cid:0) ˆ σ α (cid:1) ( t ) ← (cid:0) ˆ α ( t ) (cid:1) T Λ ˆ α ( t ) v ( t ) // matrix-vector product by Algorithm 4 ˆ λ ( t +1) ← (cid:0) ˆ σ ε (cid:1) ( t ) (ˆ σ α ) ( t ) if | ˆ λ ( t +1) − ˆ λ ( t ) | ≤ tol then stop end end ˆ α ( t +1) ← (cid:16) Φ T Φ + ˆ λ ( t ) Λ (cid:17) − Φ T y // solve linear system by Algorithm 5 return ˆ α ( t +1) , ˆ λ ( t +1) Certainly, the accuracy of the trace estimate (3.7) depends on the number M of utilized random vectors aswell as on the particular sample. For the application we have in mind, however, numerical tests show thatalready a moderate amount of random vectors yields satisfying estimates. This is shown in Figure 1 wherethe trace estimates for a simple test problem with P = 2 covariates and ﬁxed regularization parameter λ are graphed versus the number of utilized random vectors. In this particular example, the true trace (redline) is estimated with suﬃcient precision already for M ≥ M of random vectors. The above results are given for normal response models. Following the usual generalization from regressionto generalized regression we extend the above results now by allowing for an exponential family distributionas well as multiple additive functions.

We discuss ﬁrst how the above formulas extend to the case of multiple smooth functions. That is we replace(2.2) by y i = I (cid:88) j =1 s ( j ) ( x ( j ) ,i ) + ε i , where x ( j ) is a vector of continuous covariates with x ( j ) ∈ R K ( j ) and s ( j ) ( · ) a smooth function, which isestimated by penalized splines. To do so, we replace it by a (Tensor product) matrix Φ ( j ) ∈ R n × K ( j ) constructed as above with corresponding spline coeﬃcient α ( j ) ∈ R K ( j ) . This leads, again, to the massivedimensional spline basis matrixΦ = (cid:2) Φ (1) , . . . , Φ ( I ) (cid:3) ∈ R n × K , K := I (cid:88) j =1 K ( j ) . However, since Φ α = (cid:2) Φ (1) , . . . , Φ ( I ) (cid:3)  α (1) ... α ( I )  = I (cid:88) j =1 Φ ( j ) α ( j ) and Φ T y =  Φ T (1) ...Φ T ( I )  y =  Φ T (1) y ...Φ T ( I ) y,  , T , and Φ T Φ. For the related penalty matrixΛ( λ ) := block diag ( λ ( j ) Λ ( j ) )the matrix-vector products Λ( λ ) α =  λ (1) Λ (1) α (1) ... λ ( I ) Λ ( I ) α ( I )  can also still be computed by Algorithm 4. Thus, the matrix-free CG method 5 is straightforwardlyextended to the case of an additive model.In order to estimate the regularization parameters we assume normality for α ( j ) as in (2.10) and mutualindependence of α ( j ) and α ( l ) for j (cid:54) = l , so that α ( j ) ∼ N (0 , σ j ) Λ − ( j ) ) . We can now estimate the prior variances σ j ) by extending (2.13) as follows. Let D − ( j ) = block diag(0 , ..., , Λ − ( j ) , , ..., , that is D − ( j ) is a block diagonal matrix with zero entries except of the j-th block diagonal. This leads toˆ σ j ) = α T ( j ) Λ ( j ) α ( j ) trace(( I + ΦΛ( λ ) − Φ) − Φ λ ( j ) D − ( j ) Φ T ) . The above formula can be approximated throughˆ σ j ) ≈ α T ( j ) Λ ( j ) α ( j ) trace((Φ T ( j ) Φ ( j ) + λ ( j ) Λ ( j ) ) − Φ T ( j ) Φ ( j ) )which results if we penalize only the j -th component. As in the case of one single smooth function, thevariances ˆ σ j ) can be estimated by means of formula (3.7) by only taking the j -th component into account.Finally, since all required components can be computed matrix-free, Algorithm 6 can directly be extendedto the additive model with only slight modiﬁcations. The results of the previous chapter are readily extended towards generalized regression models. This leadsto iterated weighted ﬁtting, that is applying the algorithm of the previous section with weights includedin an iterative manner. Following the standard derivations of generalized additive models (see e.g. Hastieand Tibshirani, 1990 or Wood, 2017), we assume the exponential family distribution model y | x ∼ exp { ( yθ − κ ( θ )) } c ( y )16here θ = h ( s ( x )) (4.1)with h ( · ) as invertible link or response function. The terms θ and s ( x ) are linked through the expectedvalue E ( y | x ) = ∂κ ( θ ) ∂θ = µ ( θ ) = µ ( h ( s ( x ))) . Minimization (2.4) is then replaced bymin s ∈ S q ( K ) − n (cid:88) i =1 ( y i θ i − κ ( θ i )) + λ R ( s ) . This modiﬁes (2.8) to the iterative version α ( t +1) = α ( t ) + I ( α ( t ) , λ ) − s ( α ( t ) , λ ) (4.2)where I ( α ( t ) , λ ) = Φ T W ( t )2 Φ + λ Λis the (penalized) Fischer matrix and s ( α ( t ) , λ ) = Φ T W ( t )1 (cid:16) y − h (Φ α ( t ) ) (cid:17) − λ Λ α ( t ) is the penalized score. The weight matrices are constructed from W ( t )1 = diag( h (cid:48) (Φ α ( t ) )) W ( t )2 = diag( h (Φ α ( t ) ) ∂ κ ( θ ( t ) ) ∂θ∂θ T h (Φ α ( t ) ) T ) (4.3)where h (cid:48) ( · ) is the ﬁrst order derivative of h ( · ) and ∂ κ ( θ ) / ( ∂θ∂θ T ) results as the variance contribution inthe exponential family distribution. In case of a canonical link, that is θ = Φ α , one gets h ( · ) = id ( · ) sothat W equals the identity matrix and W is the diagonal matrix of variances.In order to solve the linear system I ( α ( t ) , λ ) − s ( α ( t ) , λ )we have to slightly modify Algorithm 5 since we now need to compute matrix-vector products with I ( α ( t ) , λ ).Therefore note that W ( t )2 ∈ R K × K is a diagonal matrix and can therefore be stored as a vector w ( t )2 ∈ R K .The matrix-vector product is then given as I ( α ( t ) , λ ) α ( t ) = Φ T W ( t )2 Φ α ( t ) + λ Λ α ( t ) = Φ T (cid:16) w ( t )2 ∗ (cid:16) Φ α ( t ) (cid:17)(cid:17) + λ Λ α ( t ) , where ∗ denotes element-wise vector-vector multiplication. The product can then still be computed bymeans of the methods presented in Algorithm 3, 2, and 4. Since the same modiﬁcation with w ( t )1 ∈ R K instead of W ( t )1 ∈ R K × K is possible to compute the right-hand side vector s ( α ( t ) , λ ), the matrix-free CGmethod can be applied to solve the linear system in each iteration.Note, that from a computational point of view, the major diﬀerence to the conventional spline regressionmodel is that the P-spline model is ﬁtted by solving one linear system, whereas the generalization requires17olving a linear system (of the same size) per iteration.In order to estimate the regularization parameter one can proceed in analogy to Section 2.3 and obtainˆ λ ( t ) = (cid:0) ˆ σ ε (cid:1) ( t ) (ˆ σ α ) ( t ) , where in the generalized model it is (cid:0) ˆ σ ε (cid:1) ( t ) = (cid:107) h (cid:48) (Φ ˆ α ( t ) ) − y (cid:107) n (cid:0) ˆ σ α (cid:1) ( t ) = (cid:0) ˆ α ( t ) (cid:1) T Λ ˆ α ( t ) trace (cid:16) (Φ T W ( t )2 Φ + ˆ λ ( t ) Λ) − Φ T W ( t )2 Φ (cid:17) = (cid:0) ˆ α ( t ) (cid:1) T Λ ˆ α ( t ) K − trace (cid:16) (Φ T W ( t )2 Φ + ˆ λ ( t ) Λ) − λ ( t ) Λ (cid:17) . Since matrix-vector products with W ( t )2 can be performed by element-wise vector-vector products, we canestimate the trace of (Φ T W ( t )2 Φ + ˆ λ ( t ) Λ) − λ ( t ) Λin analogy to (3.7) by replacing (3.6) with¯ z ( t ) m := (cid:16) Φ T W ( t )2 Φ + ˆ λ ( t ) Λ (cid:17) − ˜ z ( t ) m . Also the extension of the generalized response to an additive model is then straightforward in analogy toSection 4.1.

The presence of organic carbon is used to provide information on soil erosion (cf. Fang et al., 2018, p. 4). Toestimate organic carbon, in general, regression and calibration methods are applied. Applying non-linearregression methods on large data sets, however, urge the need of introducing eﬃcient algorithms to provideappropriate estimates and predicts. The use of non-parametric regression models like P-splines, so far,was suﬀering from high computational eﬀorts. In the following application, we show that multidimensionalpenalized spline regression can be applied successfully to estimate organic carbon using soil and satellitedata.In our application, we estimate organic carbon, denoted by Y using auxiliary information from the LUCASdata set. The LUCAS topsoil survey data provide information on the main properties and multispectralreﬂectance data of topsoil in 23 EU member states based on a grid of observations. The database comprises19,967 geo-referenced samples. The data set as well as its sampling design and methodology are describedin Toth et al. (2013). In addition to the soil properties, the data set comprises diﬀuse high-resolutionreﬂectance spectra for all samples from 0.400 to 2.500 µ m with 0.5 nm spectral resolution. For this studythese spectra were resamples according to the spectral characteristics of ESA’s Optical high-resolutionSentinel-2 satellite mission (492.4 µ m, 559.8 µ m, 664.6 µ m, 704.1 µ m, 740.5 µ m, 782.8 µ m, 832.8 µ m, 864.7 µ m, 1613.7 µ m, 2202.4 µ m) and used as input variables for the regression model to assess the potential ofSentinel-2 for organic carbon monitoring. 18s covariate information X := [ X , X , X ], we use the ﬁrst three satellite spectral bands of the visiblelight since they are known to be appropriate indicators for organic carbon. Adding further spectral bandsor using principal components from the band data did not show an improvement of the results. Also wemake use of spatial information U = [ U , U ] given by longitude and latitude, respectively. A RESET testfor a linear model on the utilized variables using the resettest( · ) with standard settings from R’s lmtest package (cf. Zeileis and Hothorn, 2002) suggest to reject the null hypothesis on linearity.As models, we propose the spline models introduced in this paper using the organic carbon as the dependentvariable and either spatial information U , satellite spectral information X , or both in an additive P-splinemodel on the sampled locations. In extension to Wagner et al. (2017), where negative predicts had to beavoided using non-negativity shape constraints, we suggest to induce positivity by adding an exponentiallink, i.e. we model Y as normally distributed with E ( Y | x ) = exp( s ( x )) , according to Section 4. This leads to weight matrices W ( t )1 = diag(exp( s ( x )))and W ( t )2 = diag(2 exp( s ( x )))as substitutes for (4.3). Additionally, we include the linear model as well as the generalized linear modelusing both X and U , leading to eight diﬀerent models.Table 1 provides information on the utilized model, the corresponding residual sum of squares (RSS), theAkaike information criterion (AIC), the runtime in seconds for ﬁtting the model with ﬁxed regularizationparameter (run single), the runtime in seconds for ﬁtting the model with iteratively updated regularizationparameter (run total), as well as the availability of negative predicts (neg).Nr. model RSS AIC run single run total neg1 y = x + u < y = exp ( x + u ) 8757.9102 383028.70 – < y = s ( u ) 341.6895 221550.30 7.43 58.83 Y4 y = s ( x ) 86.1090 169971.40 26.50 593.60 Y5 y = s ( x ) + s ( u ) 67.2516 161330.70 28.15 567.42 Y6 y = exp ( s ( u )) 345.1930 221712.90 27.45 118.55 N7 y = exp ( s ( x )) 89.7715 170874.40 908.15 1604.26 N8 y = exp ( s ( x ) + s ( u )) 61.1155 156456.90 1071.43 1948.86 NTable 1: Comparison of the utilized models.The results show that linear models provide unacceptable residual sum of squares which is in line withthe rejection of the test on linearity. Further, the non-exponential models suﬀer from negative predicts,which were easily removed using the exponential form. In general, the pure spatial information doesnot provide high accuracies of the model. However, the combination of spatial and spectral informationyields in all cases the best model evaluations. This ﬁnally suggest that model 8 is outperforming all othermodels considerably. The only drawback is the higher computation time which is induced by applying theexponential form. These results can also be drawn from the residual plots, which are depicted in Figure 2.19igure 2: Plots of ﬁtted values versus residuals for the models 1 (left), 5 (middle) and 8 (right).The selected residual plots are all using spatial and spectral information. Models 1 and 5 show signiﬁcantpatterns which indicate a violation against model assumptions. Only in the last model 8, the normalityassumption is considerably met. However, some outliers may still leave some space for further ﬁnetuningwhile integrating further soil information or using adequate transformations.Finally, maps are presented for models 5 and 8. In the left graph, the negative predicts can be seenespecially in Eastern Europe. Model 8 provides sensible predicts over Europe.Figure 3: Maps of model predicts from models 5 (left) and 8 (right) for organic carbon on the LUCASsample Penalized spline smoothing is a very powerful and widely used method in non-parametric regression. Inthree or more dimensions, however, the method reaches computational limitations due to the curse of di-mensionality. More speciﬁcally, storing and manipulating the resulting massive dimensional design matrices20apidly exceed the working memory of customary computer systems.A recent approach by Siebenborn and Wagner (2019) circumvents storage expensive implementations byproposing matrix-free calculations, which are brieﬂy presented in this paper. We extended their results bymaking use of the link between penalized splines and mixed models, which allows to estimate the smoothingparameter and showed how this estimation is performed in a matrix-free manner. This especially preventsthe need of computational expensive cross-validation methods for determining the regularization parameter.Further, we extended their results towards generalized regression models, where a weight matrix needs tobe considered, and showed how these computations can also be performed matrix-free.An application to estimating organic carbon has shown the straight forward and eﬃcient application of themethods including generalized additive models with an exponential link.

Acknowledgment

This work has been supported by the German Research Foundation (DFG) within the research traininggroup ALOP (GRK 2126). The third author was supported by the RIFOSS project which is funded by theGerman Federal Statistical Oﬃce.

References

Avron, H. and Toledo, S. (2011). Randomized algorithms for estimating the trace of an implicit symmetricpositive semi-deﬁnite matrix.

Journal of the ACM , 58(2):Article 8.Bellman, R. (1957).

Dynamic Programming . Princeton University Press.Benoit, A., Plateau, B., and Stewart, W. J. (2001). Memory eﬃcient iterative methods for stochasticautomata networks. Technical Report 4259, INRIA.Brumback, B. A. and Rice, J. A. (1998). Smoothing spline models for the analysis of nested and crossedsamples of curves.

Journal of the American Statistical Association , 93(443):961–976.Bungartz, H.-J. and Griebel, M. (2004). Sparse grids.

Acta numerica , 13:147–269.Currie, I. D., Durban, M., and Eilers, P. H. (2006). Generalized linear array models with applications tomultidimensional smoothing.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,68(2):259–280.de Boor, C. (1978).

A Practical Guide to Splines . Springer.Eilers, P. H., Currie, I. D., and Durb´an, M. (2006). Fast and compact smoothing on large multidimensionalgrids.

Computational Statistics & Data Analysis , 50(1):61–76.Eilers, P. H. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties.

Statistical Science ,11:89–121.Eubank, R. L. (1988).

Spline Smoothing and Nonparametric Regression . Marcel Dekker Inc.21ahrmeir, L., Kneib, T., Lang, S., and Marx, B. D. (2013).

Regression: Models, Methods and Applications .Springer-Verlag, Berlin Heidelberg.Fang, Q., Hong, H., Zhao, L., Kukolich, S., Yin, K., and Wang, C. (2018). Visible and near-infraredreﬂectance spectroscopy for investigating soil mineralogy: a review.

Journal of Spectroscopy , 2018.Green, P. J. and Silverman, B. W. (1993).

Nonparametric Regression and Generalized Linear Models: Aroughness penalty approach . Chapman and Hall.Hastie, T. and Tibshirani, R. (1990). Exploring the nature of covariate eﬀects in the proportional hazardsmodel.

Biometrics , pages 1005–1016.Hestenes, M. R. and Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.

Journalof Research of the National Bureau of Standards , 49(6):406 – 436.Hutchinson, M. F. (1989). A stochastic estimator of the trace of the inﬂuence matrix for Laplacian smooth-ing splines.

Communications in Statistics, Simulation and Computation , 18:1059–1076.Kauermann, G. (2005). A note on smoothing parameter selection for penalized spline smoothing.

Journalof statistical planning and inference , 127(1-2):53–69.Kauermann, G., Krivobokova, T., and Fahrmeir, L. (2009). Some asymptotic results on generalized pe-nalized spline smoothing.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,71(2):487–503.Kauermann, G., Schellhase, C., and Ruppert, D. (2013). Flexible copula density estimation with penalizedhierarchical B-splines.

Scandinavian Journal of Statistics , 40(4):685–705.Lang, P., Paluszny, A., and Zimmerman, R. (2014). Permeability tensor of three-dimensional fracturedporous rock and a comparison to trace map predictions.

Journal of Geophysical Research: Solid Earth ,119(8):6288–6307.Li, Z. and Wood, S. N. (2019). Faster model matrix crossproducts for large generalized linear models withdiscretized covariates.

Statistics and Computing .Orgiazzi, A., Ballabio, C., Panagos, P., Jones, A., and Fern´andez-Ugalde, O. (2017). LUCAS soil, thelargest expandable soil dataset for europe: a review.

European Journal of Soil Science , 69(1):140–153.O’Sullivan, F. (1986). A statistical perspective on ill-posed inverse problems.

Statistical Science , 1(4):502–527.Ruppert, D., Wand, M. P., and Carroll, R. J. (2003).

Semiparametric Regression . Cambridge UniversityPress.Ruppert, D., Wand, M. P., and Carroll, R. J. (2009). Semiparametric regression during 2003–2007.

Elec-tronic journal of statistics , 3:1193.Schumaker, L. L. (1981).

Spline Functions: Basic Theory . Wiley.Siebenborn, M. and Wagner, J. (2019). A multigrid preconditioner for tensor product spline smoothing. arXiv preprint:1901.00654 . 22oth, G., Jones, A., and Montanarella, L. (2013). The LUCAS topsoil database and derived information onthe regional variability of cropland topsoil properties in the european union.

Environmental Monitoringand Assessment , 185(9):7409–7425.Verbyla, A. P., Cullis, B. R., Kenward, M. G., and Welham, S. J. (1999). The analysis of designedexperiments and longitudinal data by using smoothing splines.

Journal of the Royal Statistical Society:Series C (Applied Statistics) , 48(3):269–311.Wagner, J. (2019).

Optimization Methods and Large-Scale Algorithms in Small Area Estimation . PhDthesis, Trier University.Wagner, J., M¨unnich, R., Hill, J., Stoﬀels, J., and Udelhoven, T. (2017). Non-parametric small areamodels using shape-constrained penalized B-splines.

Journal of the Royal Statistical Society: Series A ,180(4):1089–1109.Wand, M. P. (2003). Smoothing and mixed models.

Computational statistics , 18(2):223–249.Wood, S. N. (2017).

Generalized additive models: an introduction with R . CRC press.Wood, S. N., Li, Z., Shaddick, G., and Augustin, N. H. (2017). Generalized additive models for gigadata:modeling the UK black smoke network daily data.

Journal of the American Statistical Association ,112(519):1199–1210.Zeileis, A. and Hothorn, T. (2002). Diagnostic checking in regression relationships.

R News , 2(3):7–10.Zenger, C. (1991). Sparse grids.