[PDF] Adaptive shrinkage of smooth functional effects towards a predefined functional subspace

Abstract

In this paper, we propose a new horseshoe-type prior hierarchy for adaptively shrinking spline-based functional effects towards a predefined vector space of parametric functions. Instead of shrinking each spline coefficient towards zero, we use an adapted horseshoe prior to control the deviation from the predefined vector space. For this purpose, the modified horseshoe prior is set up with one scale parameter per spline and not one per coefficient. The presented prior allows for a large number of basis functions to capture all kinds of functional effects while the estimated functional effect is prevented from a highly oscillating overfit. We achieve this by integrating a smoothing penalty similar to the random walk prior commonly applied in Bayesian P-spline priors. In a simulation study, we demonstrate the properties of the new prior specification and compare it to other approaches from the literature. Furthermore, we showcase the applicability of the proposed method by estimating the energy consumption in Germany over the course of a day. For inference, we rely on Markov chain Monte Carlo simulations combining Gibbs sampling for the spline coefficients with slice sampling for all scale parameters in the model.

Full PDF

AAdaptive shrinkage of smooth functionaleﬀects towards a predeﬁned functionalsubspace

Paul Wiemann and Thomas Kneib Göttingen UniversityJanuary 15, 2021

In this paper, we propose a new horseshoe-type prior hierarchy for adap-tively shrinking spline-based functional eﬀects towards a predeﬁned vectorspace of parametric functions. Instead of shrinking each spline coeﬃcient to-wards zero, we use an adapted horseshoe prior to control the deviation fromthe predeﬁned vector space. For this purpose, the modiﬁed horseshoe prior isset up with one scale parameter per spline and not one per coeﬃcient. Thepresented prior allows for a large number of basis functions to capture allkinds of functional eﬀects while the estimated functional eﬀect is preventedfrom a highly oscillating overﬁt. We achieve this by integrating a smooth-ing penalty similar to the random walk prior commonly applied in BayesianP-spline priors. In a simulation study, we demonstrate the properties of thenew prior speciﬁcation and compare it to other approaches from the litera-ture. Furthermore, we showcase the applicability of the proposed method byestimating the energy consumption in Germany over the course of a day. Forinference, we rely on Markov chain Monte Carlo simulations combining Gibbssampling for the spline coeﬃcients with slice sampling for all scale parametersin the model.

In this paper, we consider the non-parametric regression problem y i = f ( x i ) + ε i , i = 1 , . . . , n (1)with independent and identically distributed error terms ε i ∼ N(0 , σ ) , a smooth, nonlin-ear function f and continuous covariate x i . A common approach for solving the nonpara-metric regression problem is to transform it into a semi-parametric estimation problem1 a r X i v : . [ s t a t . M E ] J a n y assuming that f can be approximated via a linear combination of basis functions eval-uated for the considered covariate (Ruppert et al., 2003). In alternative to a regressionspline, one might consider a parametric assumption on the dependence of the covariate,i.e., a linear or quadratic dependence, or a periodic dependence with a trigonometricpolynomial. To assess the correctness of the assumptions, goodness-of-ﬁt measures likethe widely applicable information criterion (WAIC, Watanabe, 2010), residual plots orposterior predictive checks can be employed (Gelman et al., 2014).Similar to Shin et al. (2020), we present a methodology based on a shrinkage priorthat represents a trade-oﬀ between the solution of a pre-deﬁned parametric function andthe solution of a ﬂexible regression spline. Consider a given, predeﬁned subspace, e.g.,the space of all linear or quadratic eﬀects. The prior we propose addresses the followingthree main objective:• For a weak signal or for a signal that can be represented well by a function from thepre-speciﬁed subspace, the estimated functional eﬀect should be shrunken towardsthe considered subspace.• When the signal is strong and it cannot be adequately represented by a functionfrom the subspace, the shrinkage should be as small as possible.• The prior should prevent the estimated function from oscillating too strongly andthus possibly overﬁtting the data. This is a known problem for unrestricted regres-sion splines when many basis functions are used (Fahrmeir et al., 2013). However,we want the prior to be able to handle a high number of basis functions so that thespace spanned by the basis functions can represent quickly changing functions.The concept of shrinkage has successfully and widely been applied in many scientiﬁcﬁelds since the foundational work by James and Stein (1961). We mention here ridgeregression (Hoerl and Kennard, 1970; Marquardt and Snee, 1975), the LASSO penalty(Tibshirani, 1996) and its Bayesian counterpart the Bayesian LASSO (Park and Casella,2008) as examples for shrinkage of the regression coeﬃcients in a linear model. Morerecently, the concept of global-local shrinkage (Polson and Scott, 2011) has come up.Compared to ﬁxpoint shrinkage, which shrinks every regression coeﬃcient towards 0,global-local shrinkage shrinks the vector of all regression coeﬃcients towards a sparse so-lution, which makes them in particular useful for high-dimensional but sparse estimationproblems. These global-local priors are usually equipped with global hyper-parametersthat control the sparseness of the estimated vector. The horseshoe prior (Carvalho et al.,2010) is a prominent member of this class of shrinkage priors. The horseshoe prior isespecially known for its adaptive properties. Due to the horseshoe-shaped prior on theshrinkage coeﬃcient, it is capable of leaving strong signals almost untouched while ap-plying strong shrinkage on weak signals. Polson and Scott (2011) draw a connectionbetween global-local shrinkage and Bayesian model averaging that Carvalho et al. (2010)also highlight for their horseshoe prior.Hence, the prior developed in this paper diﬀers from the above mentioned priors, as weare not aiming at shrinking a sparse vector or individual regression coeﬃcients towards2ero. Instead, we aim at shrinking the function estimated by a regression spline towardsa predeﬁned parametric subspace, while simultaneously controlling its curvature. Thisdistinguishes our approach from the goals usually pursued with the regularization ofregression splines, which essentially focusses on limiting the curvature of the solution(see Lang and Brezger (2004)).Recently, the work by Shin et al. (2020) has come close to and, by now, has inspired themethod developed in this paper. However, their approach does not take the curvature ofthe estimated eﬀect into account and, consequently, as we show in Section 2, the estimatedfunction becomes easily very wiggly or might not be capable to capture essential patternsin the data. Moreover, the method of eﬀect selection introduced by Scheipl et al. (2012)and Klein et al. (2019) is not comparable with our approach. Firstly, instead of shrinkage,a spike and slab-based procedure is employed (Mitchell and Beauchamp, 1988; Ishwaranand Rao, 2005). Secondly, the null eﬀect arises from the kernel of the improper precisionmatrix in the prior on the spline coeﬃcients and is therefore not an arbitrary parametricfunction of the covariate.The remainder of this paper is organized as follows. In Section 2, we introduce theprior, discuss its extension for additive regression models (Section 2.3), and investigateits shrinkage properties in Section 2.4. We describe the Bayesian inference scheme inSection 3 before demonstrating the validity of our approach with simulations in Section 4.Here, we investigate the prior in a simple univariate case as well as in an additive model.We conclude the empirical part of this paper in Section 5 with an application to energyconsumption in Germany before we end with the discussion in Section 6. In this section, we introduce the prior speciﬁcation for estimating a smooth functionaleﬀect combined with shrinkage towards a predeﬁned functional subspace. Consider thenon-parametric regression problem from Equation (1). As stated in the introduction,a common approach for the estimation of f is to transform it into a semi-parametricestimation problem by assuming that f can be approximated via a linear combinationof k pre-speciﬁed basis function denoted with B j . More precisely, we assume f ( x ) = k (cid:88) j =1 B j ( x ) β j = B ( x ) (cid:48) β with B ( x ) = (B ( x ) , . . . , B k ( x )) (cid:48) and β = ( β , . . . , β k ) (cid:48) . Thus, a ﬁnite number of param-eters needs to be estimated that have usually no valid interpretation by themselves. Inthis work, we employ B-Splines of third order with equally spaced knots (Fahrmeir et al.,2013) . The number of basis functions k is highly important for the functional space spannedby the basis function. A small number of basis functions may not be ﬂexible enough3o capture the pattern observed in the data. Using many basis functions, however, mayresult in a highly oscillating function that overﬁts the data. In the literature, diﬀerentapproaches can be found to overcome this problem. One is to choose the number of basisfunction based on the number of available observations (see Ruppert et al., 2003, fordetails).Let Z denote the n × k matrix of basis functions evaluated at the observed covariates x , . . . , x n . When limiting the number of basis functions, Zellner’s g -prior is a commonchoice for the prior on β (Zellner, 1986) β | g ∼ N( , g ( Z (cid:48) Z ) − ) where the hyper-parameter g > adjusts the weight of the prior in relation to the obser-vations. Zellner’s g -prior has the advantage of taking the (observed) correlation structureof the covariates into account. Under this prior, the posterior mean of predictions given g can be expressed as E( ˆ y | g, y ) = (cid:18) −

11 + g (cid:19) Z ( Z (cid:48) Z ) − Z (cid:48) y , which shows that g can be interpreted as a shrinkage parameter with shrinkage towardsthe null eﬀect. In this approach, the regularization of the spline is achieved by limitingthe number of basis functions depending on the number of observations. Thus the func-tional pattern that can be represented with the spline is dependent on the number ofobservations. This is not a desirable property.To circumvent this issue, a diﬀerent approach is based on regularization of the curvatureof estimated function or, to word it diﬀerently, shrinkage towards a function with lesscurvature. Here, a moderately large number of basic functions is used in conjunctionwith a prior which ensures that adjacent regression coeﬃcients are not too diﬀerent, andthus produces a certain smoothness of the functional estimate.The Bayesian P-spline approach by Lang and Brezger (2004) achieves this by placing aGaussian random walk prior on the spline coeﬃcients. In the case of a ﬁrst order randomwalk β j = β j − + u j for < j ≤ k , with u j ∼ N(0 , τ ) , the variance of the increments actsas a smoothness parameter and the prior shrinks towards a constant functional eﬀect.The second order random walk β j = 2 β j − − β j − + u j , for < j ≤ k , acts as a linearextrapolation with a Gaussian error term u j ∼ N(0 , τ ) . Again, τ can be interpretedas a smoothing parameter and the prior shrinks towards a linear function. The overalllevel is not penalized by the Bayesian P-spline prior since it places a ﬂat prior on β and for the second order random walk the ﬂat prior extends to β and thus the slope isnot regularized. With the appropriate rank deﬁcient penalty matrix K , this prior canequivalently expressed as β ∼ N prec ( , τ − K ) where N prec is the degenerated normal distribution with precision matrix τ − K (see Langand Brezger (2004) and Rue and Held (2005) for details).For the construction of our smooth subspace shrinkage, we use a combination of bothapproaches: we use the smoothing properties of the Bayesian P-spline and combine it4ith an extended version of Zellner’s g -prior - very similar to the functional horseshoeprior (Shin et al., 2020) - to shrink towards a predeﬁned functional subspace. In comparison to the Bayesian P-spline, we do not only want to shrink towards lesscurvature, but additionally to a parametric functional subspace of the covariate. To givean example, this could be a quadratic eﬀect, as well as a periodic eﬀect. Nevertheless,the key feature of the Bayesian P-spline should be maintained, namely preventing thefunctional estimate from exhibiting highly oscillating behavior and thus overﬁtting thedata when a high number of basis functions is used. The prior however should be ableto adapt to quickly changing functions when the data is suggesting this, so we waive theoption of using only a small number of basis functions.We achieve our goal by using the following prior hierarchy. Let S be a matrix whosecolumns span the space N towards we want to shrink. To give examples, use S = ( , x ) for the null space comprised by the linear eﬀect and S = ( , x , x ) when quadratic eﬀectsshould comprise the null space. Hereby, operations on the covariate vector are carriedout element-wise. Based on S , we deﬁne the n × n matrices P = S ( S (cid:48) S ) − S and P = I − P where I denotes the identity matrix of appropriate size. P and P areboth projection matrices with their images being orthogonal complements of each other.Additionally, the union of their images is equal to the space spanned by the columns of Z .By construction, N is the image of P and the kernel of P . We can now decomposethe image of the element-wise application of the function f ( x ) = ( f ( x ) , . . . , f ( x n )) (cid:48) intothe part within and not within the null space, i.e., f ( x ) = ( P + I − P ) f ( x ) = P f ( x ) (cid:124) (cid:123)(cid:122) (cid:125) ∈N + P f ( x ) (cid:124) (cid:123)(cid:122) (cid:125) ∈ ¯ N where ¯ N denotes the orthogonal complement of N . Therefore, f ¯ N ( x ) = P f ( x ) is thepart of the functional eﬀect that is not within the null space and consequently shouldbe controlled by the shrinkage prior. To achieve this, we propose to place the improperprior β | λ, τ , σ ∼ N prec ( , Q ) with Q = σ − λ − Z (cid:48) P Z + τ − K on β where K denotes the appropriate second-order walk penalty matrix. Therefore, β (cid:48) Qβ = f ¯ N ( x ) (cid:48) f ¯ N ( x ) λ σ + β (cid:48) Kβ τ ; i.e., the argument to the exponential function in the degenerated normal density of theprior on β measures the sum of the quadratic deviation from the functional null space(weighted with λ − σ − ) in the ﬁrst term and the sum of the quadratic second orderdiﬀerences of the spline coeﬃcients (weighted with τ − ) in the second term. We dealwith the priors on the scale parameters later in Equation (5).5he posterior of β conditioned on σ , λ and τ is then normally distributed with mean β (cid:63) and precision matrix Q (cid:63) given as follows: Q (cid:63) = σ − Z (cid:48) Z + Q β (cid:63) = σ − ( Q (cid:63) ) − Z (cid:48) y . (2)When deﬁning κ = 11 + λ (3)as in the horseshoe prior and considering the expected functional eﬀect conditioned onno penalization of the second order derivative lim τ →∞ E ( Zβ | y , λ, τ ) = ((1 − κ ) P Z + κ P ) y (4)where P Z = Z ( Z (cid:48) Z ) − Z (cid:48) , the role of κ as the shrinking factor becomes apparent as theexpectation is a weighted average of the parametric least squares solution (i.e., P y ) andthe unpenalized spline estimate (i.e., P Z y ). We reach the extreme of no shrinkage andthe reassembling of the spline solution for κ → . Full shrinkage is obtained for κ → ,which equals the parametric least square solution.We now discuss the necessity of the second term in the precision matrix for the prior on β . From Equation (4), it is evident that the spline solution passes on its highly oscillatingbehavior to the weighted average if there is no, or only little, shrinkage. No shrinkageor slight shrinkage is especially desirable in the situation in which the deﬁned null spacedoes not match the pattern in the data, and thus the model is supposed to adapt to thedata. But even in the case of strong but not complete shrinkage, the highly oscillatingbehavior is included in the estimated function albeit with smaller amplitude.We illustrate this with an example. Consider the solid line in Figure 1. Without anypenalization of the second order derivative (i.e., τ → ∞ ) the expected value of theprediction tends to overﬁt the data. Only in the case of the correct speciﬁed null space,we see a ﬁtting and not oscillating functional estimate (second row, fourth column). Inaddition, the ﬁgure demonstrates that integrating the second order random walk prioron β helps to prevent this. Even when no shrinkage is present (i.e., κ → ), the posteriormean exhibits less curvature and ﬁts the underlying pattern better.We place half Cauchy and inverse gamma priors on the scale parameters in the prioron β λ ∼ C + (0 , ˜ ξ ) , ξ ∼ C + (0 , ξ ) , τ ∼ C + (0 , ν ) and σ ∼ IG( a , b ) (5)where C + ( a, b ) denotes the Cauchy distribution with mean parameter a and scale pa-rameter b truncated to the positive real numbers and IG( c, d ) denotes the inverse gammadistribution with scale parameter c and shape parameter d .A probability statement (Equation (6)) is used to determine the prior on the varianceof the random walk, namely ν > . Over the domain of the covariates, the second order6 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll k = k = k = k = li nea r quad r a t i c −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2−50−50 x y tau_sq Inf101

Figure 1:

The plots demonstrate the eﬀect of the shrinkage coeﬃcient κ and the smoothness variance τ on the posterior mean as given in Equation (2) for a data example (scatterplot) generatedfrom a quadratic function. The lines represent the posterior mean. Without any smoothing(i.e., τ → ∞ ) the spline is highly oscillating and overﬁtting the data when no, or only slight,shrinkage is present. In this example, σ = 1 and the spline is based on 20 basis functions. derivative should be, in absolute terms, smaller than the prespeciﬁed threshold c withprobability − α , i.e., lim λ →∞ Pr (cid:18) max x ∈D (cid:12)(cid:12) f (cid:48)(cid:48) ( x ) (cid:12)(cid:12) < c (cid:12)(cid:12)(cid:12)(cid:12) λ (cid:19) = 1 − α. (6)Furthermore, ξ , a , b > must be speciﬁed by the user and ˜ ξ is a function of ξ . Theprior on λ is an adaptation of the hierarchy considered in the horseshoe prior (Carvalhoet al., 2010). We scale ξ such that the marginal variance of the spline coeﬃcients is notdependent on the design matrix of the spline Z nor the deﬁned null space and thus theprojection matrix P . Following the idea of Sørbye and Rue (2014), we deﬁne ˜ ξ = ξσ − ref ( Z (cid:48) P Z ) where σ − ref ( · ) represents average marginal standard deviation induced by the precisionmatrix. For that we calculate the standard deviation as the geometric mean σ ref ( Q ) = exp (cid:32) n n (cid:88) i =1

12 log( Σ (cid:63)ii ) (cid:33) with Σ (cid:63) as the generalized inverse of Q .To have more control over the shrinkage properties, especially the strength of a signal, ξ may be ﬁxed. This might be in particular well suited in situations in which only onefunctional eﬀect should be controlled by the proposed prior and, therefore, no global-local shrinkage is required. The extension to multiple additive eﬀects is described inSection 2.3 and an assessment of the shrinkage properties is made in Section 2.4.7 .3 Extension to multiple additive functional eﬀects We describe brieﬂy how the proposed prior hierarchy can be extended to cover multiplefunctional eﬀects in an additive regression design (Hastie and Tibshirani, 1986). Suppose L smooth eﬀects should be modeled such that the response for i = 1 , . . . , n observationsis given by y i = β + f ( x i ) + f ( x i ) + . . . + f L ( x iL ) + ε i where the distributional assumption from above applies to ε i . In addition to the assump-tion that the smooth functions can be represented by a spline, we assume for identiﬁcationpurposes that the smooth functions are centered; i.e., the sum over the coeﬃcients of eachspline equals 0. This assumption is justiﬁed because it causes no general restriction thatcannot be nulliﬁed by centering y or by the inclusion of an intercept, i.e., β .As above, we approximate the smooth eﬀects with linear combinations of basis func-tions resulting in the design matrices Z , . . . , Z L with the corresponding vectors of co-eﬃcients β , . . . , β L . For l = 1 , . . . , L , the prior hierarchy on the coeﬃcient vectors isgiven by β l | λ l , τ l , σ ∼ N prec ( , Q l ) with Q l = σ − λ − l Z (cid:48) l P ( l )1 Z l + τ − K l and the linear restriction (cid:48) β l = 0 λ l | ξ ∼ C + (0 , ˜ ξ l ) with ˜ ξ l = ξσ − ref ( Z (cid:48) l P ( l )1 Z l ) ξ ∼ C + (0 , ξ ) τ l | ν l ∼ C + (0 , ν l ) with ν l s.t. lim λ l →∞ Pr (cid:18) max x ∈D l (cid:12)(cid:12) f (cid:48)(cid:48) l ( x ) (cid:12)(cid:12) < c l (cid:12)(cid:12)(cid:12)(cid:12) λ l (cid:19) = 1 − α l where• the prior on β l includes now the linear restriction (cid:48) β l = 0 for identiﬁability,• K l denotes the appropriate second order walk penalty matrix,• P ( l )1 denotes the projection matrix with kernel equal to the chosen null space forthe l -th eﬀect,• and α l and c l are chosen as above to control the wigglyness of f l .In addition, we place the same prior as before on σ and use a non-informative priorfor the intercept β . In this prior hierarchy and similar to the horseshoe prior, the priordistributions on λ l are connected via ξ which acts as a global shrinkage parameter while λ l controls the local shrinkage. In this section, we study the shrinkage properties of the proposed prior. The literaturesuggests (Scheipl et al., 2012; Klein et al., 2019) to examine the marginal distribution ofthe spline coeﬃcients, i.e., p ( β l ) for the l -th functional eﬀect. For simplicity, we make8he same assumption as Carvalho et al. (2010) regarding the variance of the error term,namely σ = 1 . The error variance can be easily estimated from the data and acts in theproposed prior hierarchy just as a scaling component. In addition, we anticipate thatthe prior on the second order random walk, speciﬁcally on τ l , has been chosen such thatit does not aﬀect the shrinkage properties. We consider this assumption to be valid, aswe believe that the user selects the limit for the second derivative to prevent overﬁtting,but not to aﬀect the shrinkage properties. Therefore, we consider in the following thelimit τ l → ∞ . For the sake of compactness, we refrain from stating the conditioning on τ l and σ for the rest of this section.The proposed prior features local-global shrinkage properties with the local shrinkageparameter λ l and the global shrinkage parameter ξ . Since i) the main local shrinkageproperties are unaﬀected by the global shrinkage parameter’s value and ii) the eﬀect onthe marginal distribution of β l without a ﬁxing global shrinkage parameter depends onthe speciﬁc predictor structure, we condition in the following on ξ and thus ˜ ξ l .Note that the results derived in this section do not depend on a speciﬁc value of ξ .For a better readability we drop in the following the index l . The implied marginaldistribution of β | ˜ ξ can now be derived as p ( β | ˜ ξ ) = (cid:90) ∞ p ( β | λ ) p ( λ | ˜ ξ ) dλ = (2 π ) − rk( F ) / ( | F | ∗ ) / (cid:90) ∞ λ − rk( F ) exp (cid:18) − λ β (cid:48) F β (cid:19) π ˜ ξ (cid:18) λ ˜ ξ (cid:19) − dλ = 2 π ˜ ξ (2 π ) − rk( F ) / ( | F | ∗ ) / (cid:124) (cid:123)(cid:122) (cid:125) =: c (cid:90) ∞ exp( − λ β (cid:48) F β ) λ rk( F ) (cid:16) λ ˜ ξ (cid:17) dλ. where F = Z (cid:48) P Z . No analytical solution exists for this integral, though. We deﬁne theconstant scalar in front of the integration symbol as c .In the assessment of shrinkage properties, we are mainly interested in the tail behaviorand the behavior in the origin with respect to the distance of the functional eﬀect fromthe null space N . To assess this, we deﬁne this distance as d = || P Zβ || . The impliedmarginal density on d is then given as p ( d | ˜ ξ ) = c (cid:90) ∞ exp( − d λ ) λ rk( F ) (cid:16) λ ˜ ξ (cid:17) dλ with no closed form solution available so that we approximate it numerically for a visualimpression shown in Figure 2.We investigate more closely, whether the spike in the origin is inﬁnite. An inﬁnitespike in zero exhibits particular advantageous shrinkage properties as it implies a strongpenalization and thus strong shrinkage towards the origin for small eﬀects (Klein et al.,2019). Translated to our hierarchy, this implies that eﬀects close to the null space receive9 k(F) = 10 rk(F) = 200 10 20 30 40 50 0 10 20 30 40 501e−211e−131e−051e+03 d p ( d | x ~ = ) Figure 2:

Plots of the marginal density of the distance of spline coeﬃcients to the null space conditionedon ˜ ξ . The left plot is based on rk( F ) = 10 and the right plot on rk( F ) = 20 . strong shrinkage towards the null space. Using the last equation, we obtain p ( d = 0 | ˜ ξ ) = c (cid:90) ∞ λ rk( F ) (cid:16) λ ˜ ξ (cid:17) dλ = c (cid:90) λ rk( F ) (cid:16) λ ˜ ξ (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) =: w ( λ ) dλ + c (cid:90) ∞ λ rk( F ) (cid:16) λ ˜ ξ (cid:17) dλ (cid:124) (cid:123)(cid:122) (cid:125) ≥ for which we can ﬁnd a bound from below for the integrand w ( λ ) as w ( λ ) = 1 λ rk( F ) + λ rk( F )+2 ˜ ξ = ˜ ξ ˜ ξ λ rk( F ) + λ rk( F )+20 <λ< ≥ ˜ ξ ˜ ξ λ rk( F ) + λ rk( F ) and thus (cid:90) w ( λ ) dλ ≥ (cid:90) ˜ ξ ˜ ξ λ rk( F ) + λ rk( F )rk( F ) > = (cid:34) ˜ ξ λ − rk( F ) (1 − rk( F ))( ˜ ξ + 1) (cid:35) = ∞ The marginal has indeed an inﬁnite spike in the origin w.r.t. d and consequently themarginal p ( β ) has an inﬁnite spike if Zβ is in the null space.Besides studying the marginal prior in the origin, Scheipl et al. (2012) suggest toexplore the tail behavior by focusing on the ﬁrst derivative of the log marginal, i.e., thescore function and its properties in the limit. Klein et al. (2019) state that the prior has10 k(F) = 10 rk(F) = 200 10 20 30 40 50 0 10 20 30 40 50−20−15−10−50 d ¶ ¶ d l og ( p ( d | x ~ = )) Figure 3:

Plots of the marginal score function of the distance of spline coeﬃcients to the null spaceconditioned on ˜ ξ . The left plot is based on rk( F ) = 10 and the right plot on rk( F ) = 20 . heavy tails, if the score function is redescending; namely, its value approaches zero in theinﬁnite limit of its argument. Heavy tails in the marginal prior are desirable since theyimply Bayesian robustness of the estimates. Unfortunately, the limiting behavior of thescore function ∂∂d log( p ( d | ˜ ξ )) = ∂∂d p ( d | ˜ ξ ) p ( d | ˜ ξ )= − dc (cid:82) ∞ − d λ ) λ rk( F )+2 (cid:16) λ ξ (cid:17) dλc (cid:82) ∞ (cid:16) − d λ (cid:17) λ rk( F ) (cid:16) λ ξ (cid:17) dλ ≤ is not analytically accessible so that we must rely on a numerical approximation. Figure 3displays plots of the score function. From the plots we conclude that the prior constitutesa redescending score function and thus has heavy tails. These ﬁndings translate to thefunctional space which consequently means that functional estimates in the tails of theprior are robust. We base our Bayesian estimation procedure on Markov chain Monte Carlo (MCMC)simulations and employ a within-Gibbs updating scheme. Within each MCMC iteration,we initially update the intercept β with a Gibbs update and then update the splinecoeﬃcients of each smooth function with a draw from the full conditionals.Suppose the l -th component of the predictor should be updated. Let ˜ y − l denote theworking observations, that is ˜ y − l = y − η − l where η − l denotes the predictor with theeﬀect of the component under consideration removed.With a non-informative prior on the intercept, the full conditional distribution of the11ntercept is then given by β | · ∼ N (cid:0) ¯˜ y − , σ /n (cid:1) where ¯˜ y − is the mean of the workingobservations.For the spline coeﬃcients of the l -th smooth eﬀect, the full conditional follows fromEquation (2), and is given by β l | · ∼ N (cid:0) β (cid:63)l , ( Q (cid:63)l ) − (cid:1) where Q (cid:63)l = σ − Z (cid:48) l Z l + Q l and β (cid:63)l = ( Q (cid:63)l ) − Z (cid:48) l ˜ y − l . To ensure identiﬁability, we sample under the linear constrain (cid:48) β l = 0 as described inRue and Held (2005, Algorithm 2.6).Finally, the scale parameters of the model are updated using slice updates (Neal,2003). Hereby, we choose to update ﬁrst σ , then the scale parameters of each spline,namely λ i and τ i , and lastly the global shrinkage parameter ξ . All updates of the scaleparameters are performed in log-space. As proposed by Makalic and Schmidt (2016),the global shrinkage parameter could also be updated using the double inverse gammarepresentation. However, we could not observe improvements of the MCMC estimation.Due to that, we stick to the slice sampler for the sake of consistency. We assess the validity of our approach in three simulation studies. In the ﬁrst two studies,we employ a simple additive model featuring one covariate. We explore the behavior ofthe presented prior considering diﬀerent null spaces and its reaction to diﬀerent signal tonoise ratios (SNRs). In the third simulation study, we analyze the proposed prior withinthe additive model and compare the results to the popular Bayesian P-splines approach.Notably, the work in Shin et al. (2020) has partly close resemblance to the methoddeveloped in this paper. Nonetheless, we refrain from an empirical comparison to thisapproach, since the functional horseshoe prior lacks a smoothing component. This is notin line with our goal of producing ﬂexible and non-wiggly functional estimates.

In the ﬁrst two scenarios, we consider a simple regression setting with only one smootheﬀect and 100 observations. A quadratic eﬀect is featured in scenario I, while a morecomplex function is used in scenario II. In particular, the smooth functions are deﬁnedas f ( x ) = 1 + 1 . x (scenario I) and f ( x ) = 1 + 10 sin( x ) + x + 0 . x (scenario II) . The covariate values x are equally spaced within the interval [ − π, π ] . In both scenar-ios, we add an independently and normally distributed error term with zero mean andvariance equal to σ .We ﬁt models comprising the proposed prior with null spaces spanned by [ , x ] and [ , x , x ] in scenario I and add in scenario II the null spaces spanned by [ , sin( x )] and12 , x , x , sin( x )] . The operations on the covariates are deﬁned element-wise and the lastnull space is referred to as complex in the following. To feature two diﬀerent SNRs, wechoose the standard derivation of the error term σ equal to . or equal to . . Eachscenario is replicated 100 times.For the estimation, we employ a cubic spline with 20 inner knots and we determine theparameter ν in the prior on τ based on α = 0 . and the cutoﬀ c , where c is determinedas follows. Let c p be the maximum absolute value of the second order derivative of theparametric solution within the null space. Then the cutoﬀ is set to c = 10 max( c p , . .Furthermore, ξ is set to 1. Inference is based on 10.000 MCMC iterations of which theﬁrst half is considered warm-up.The main results are summarized in Figure 4 and Figure 5 for scenario I and scenario II,respectively. In scenario I, we can see that the prior adopts to the data generating function k = k = k = k = linear quadratic . . linear quadratic . . −4 0 4 −4 0 4−2.50.02.55.0−2.50.02.55.0 0.0 0.5 1.0 0.0 0.5 1.00101 x kappa y Figure 4:

Plots concerning simulation scenario I. The left plot shows data of one iteration, the datagenerating function (red line), the average of the posterior mean using a parametric model(blue line), and the average of the posterior means together with 90% quantiles when em-ploying the proposed model. The right plot shows histograms of the posterior mean of theshrinkage parameter κ (see Equation (3)). The columns within the plots distinguish betweendiﬀerent types of null spaces which are also used for the parametric estimates. Each rowwithin the plots displays a diﬀerent SNR scenario. regardless of the speciﬁed null space, when the signal is strong enough. With more noise,the prior still shrinks to the parametric function when the correct null space is speciﬁed.Concerning the linear null space, the prior performs worse and does not show a consistentbehavior when more noise is present.Now focusing on scenario II, we can deduce that the prior is mostly able to decidebetween signal and no signal as most values are either close to 0 or 1. Furthermore, wecan observe that in the high noise scenario the prior forces the estimate to be very closeto the parametric solution and thus to be in the null space. This is especially true in bothSNR scenarios for the complex null space which includes the data generating function.In the low-noise scenario, the estimate is mainly able to capture the true function withsome diﬃculties with the quadratic null space.13 = k = k = k = . . k = k = k = k = . . . . sin complex sin complex . . −4 0 4 −4 0 4 0.0 0.5 1.0 0.0 0.5 1.0−5.0−2.50.02.55.0−5.0−2.50.02.55.0 −4 0 4 −4 0 4 0.0 0.5 1.0 0.0 0.5 1.0−5.0−2.50.02.55.0−5.0−2.50.02.55.0 x kappax kappa yy Figure 5:

Plots analog to Figure 4 but concerning simulation scenario II. The second row of plotscomprises the additional null spaces.

In the ﬁnal simulation study, scenario III, we use the proposed prior within an additivepredictor that includes the eﬀect of multiple covariates. Its performance is compared toBayesian P-splines (Lang and Brezger, 2004). To do so, we generate data for n = 100 ob-servations according to y i = f ( x i ) + f ( x i ) + f ( x i ) + f ( x i ) + ε i for i = 1 , . . . , n with the covariate information x ij independently and uniformly distributed on the inter-val [ − , . Furthermore, ε i is independently normal distributed with expectation zeroand variance σ and f , . . . , f denote four smooth functions, in particular f ( x ) = x f ( x ) = 2 x − / f ( x ) = sin( xπ ) f ( x ) = 2 exp( x )exp(1) − exp( − − where the integral from -1 to 1 is equal to 0 for all functions and they are scaled suchthat the Lebesgue measure of image of [ − , under the function is equal to 2; that is, sup( { f ( x ) | x ∈ [ − , } ) − inf( { f ( x ) | x ∈ [ − , } ) = 2 for a continuous function f on [ − , . 14e ﬁt two models to the data. In the ﬁrst model, each smooth function is representedby Bayesian P-splines with the default inverse gamma prior ( a = 0 . , b = 0 . ) onthe variance parameter of spline coeﬃcients. In the second model, the smooth func-tions are estimated employing the introduced prior. In each model, we use the samenumber of inner knots (i.e. 20) and the same knot positions. For the shrinkage prior,specify the correct null space for the ﬁrst three functions, i.e., [1 , x ] , [1 , x , x ] , and [1 , sin( x π ) , cos( x π )] . As before, the operations on the vector are interpreted element-wise. For the fourth smooth eﬀect, the null space comprises, in contrast to the true eﬀect,only of constant eﬀects. We set ν l = 0 . for l = 1 , . . . , and ξ = 1 . The scenario isreplicated 100 times and we use 15.000 MCMC iterations in the estimation procedure ofwhich the ﬁrst half is considered warm-up.The plots in Figure 6 summarize the main ﬁndings. All plots are based on the estimatedposterior mean. Considering subplot a), we make the following observations. First, theroot mean square error (RMSE) with respect to the observed values is slightly smallerwhen using Bayesian P-splines. Having said this, however, the RMSE with respect to thetrue values is smaller when employing our approach. These observations indicate thatthe Bayesian P-spline slightly overﬁts. The proposed prior seems to use the informationsupplied with the null space but is still able to diverge from it when it does not ﬁt. Thisinterpretation is supported by the results obtained from the ﬁrst two scenarios.Breaking down the individual smooth eﬀects, the histograms in Figure 6 b) show thatthe estimated shrinkage weight κ is large for the splines with the correctly speciﬁed nullspace and small in the other case. Thus, the prior is able to detect a misspeciﬁed nullspace and restrains the shrinkage.This gets partly reﬂected in the RMSE (deﬁned in terms of the integral from − to over the squared diﬀerences to the true function) of the individual smooth eﬀects.For the fourth smooth eﬀect with the misspeciﬁed null space, the diﬀerences betweenthe shrinkage prior and the P-spline solution are minor with respect to the RMSE. TheRMSE is the smallest and, in particular, considerably smaller than for the P-splinesolution for the ﬁrst and third smooth eﬀect. We ﬁnd however almost no diﬀerence forthe second smooth eﬀect between both methods even though the shrinkage weight is inmost replications close to one. Together with the information provided in Figure 6 d), wededuce that the estimated function is within the null space, meaning it can be representedby a quadratic polynomial, but misrepresents the underlying true function. The RMSEin this sub-ﬁgure shows the distance from the estimated function to the closest functionfrom null space. Again the RMSE is deﬁned in terms of an integral. Thus, the plots showthat the functions estimated by the shrinkage approach are very close to a parametricsolution from the null space for the ﬁrst three functions. For the misspeciﬁed null space,the estimated function is on average not close to a constant eﬀect. To conclude, the proposed shrinkage prior is suited to take the additional informationinto account but can diverge from it when it is clearly misspeciﬁed. This can help toavoid overﬁtting compared to the Bayesian P-spline. However, an even better distinction15 l ll llll l llll l lll l llllllllll ll lllllllllll llllllllllll ll r m s e kappa r m s e r m s e b)a)c) d) Figure 6:

Plots concerning simulation scenario III. The red color always refers to the Bayesian P-splinemodel and the green color to the proposed approach. In b), c), and d) the numbers 1 to 4indicate the functional eﬀect. a) Boxplots of the RMSE of the mean posterior predictionswith respect to observation and to the true data generating function. b) Histograms of theposterior mean of the shrinkage coeﬃcients. c) Boxplots of the RMSE of each estimatedfunctional eﬀect with respect to the true function. d) Boxplots of the RMSE of the estimatedeﬀects to the nearest parametric function. The RMSE in c) and d) is deﬁned in terms of theintegral from -1 to 1 instead of the mean of at observed values as in a). between noise and signal is desirable, since this would lead to a more consistent estimationof extreme values of the shrinkage parameter κ , i.e., close to 0 or 1. Values of thisparameter far away from the extremes can be observed in scenario I and scenario II forsome null spaces. Predicting energy consumption is an important aspect of keeping the energy grid stable.Grid operators need to ensure that the amount of energy produced matches the amountof energy consumed. However, failing to manage this successfully may cause small distur-bances, such as ﬂuctuations in the AC grid frequency, but also severe consequences suchas blackouts. To guarantee grid stability, the network operators have various options attheir disposal. Large and eﬃcient power plants should provide the base load and moreﬂexible power plants can be used to absorb rapid ﬂuctuations in consumption. Usually,energy produced by the ﬂexible power plants is more expensive but large power plants,such as nuclear power plants, need more time to adapt their output. Therefore, predict-ing the energy consumption that can be covered by the eﬃcient power plants proves quitehelpful. Ideally, the long adaptation time of these power plants should be considered dur-16ng estimation while still being able to ﬂexibly detect quick changes in the base load. Thepresented approach of smooth subspace shrinkage fulﬁlls this requirement. The subspacecould be comprised of a trigonometric polynomial of low order that represents patternsthat can be handled by eﬃcient power plants. However, a pure parametric estimationis not suitable, since a rapid increase in consumption, for example in the morning, mustremain detectable.In the following, we present an application to data from the German energy market.The used data is freely available from SMARD . Suppose the quarter-hourly energyconsumption in Germany over one day needs to be estimated. The most naive approachmight be a linear model based on a trigonometric polynomial of order Ω , i.e., E( y | x ) = β + Ω (cid:88) ω =1 (cid:20) β ω cos (cid:18) ω π x (cid:19) + ˜ β ω sin (cid:18) ω π x (cid:19)(cid:21) where x ∈ [0 , denotes the hour of the day. Another approach is the use of regressionsplines, as they can ﬂexibly adapt to the data. The latter may have the disadvantagethat the spline solution diverges from the parametric solution even though the parametricsolution is preferred based on theoretical considerations, and the spline solution ﬁts thedata only negligibly better.We employ this example to showcase the applicability of the introduced prior hierarchy.For that, we choose eight weekdays (only Mondays and Tuesdays) and eight weekend daysfrom November 2018 and preprocess the data by rescaling and subtracting the daily mean.We specify the parameters in our prior such that it shrinks towards the polynomial fromabove with Ω = 4 . As explain above, we choose this order as we assume that eﬃcientpower plants can handle this pattern.Since we employ only one covariate, we refrain from using the two level hierarchy forthe scale parameter of the spline coeﬃcients and thus ﬁx ξ to . . Furthermore, theprior for τ is determined by setting the cutoﬀ c to two times the largest absolute secondorder derivative of the parametric solution and α = 0 . .The MCMC sampling runs for 12.000 iterations of which the ﬁrst 2.000 are consideredwarmup.A plot of the data together with the estimated function is displayed in Figure 7. Weobserve that the proposed prior adopts to the data by shrinking the estimated eﬀect tothe parametric function for the weekend days and leaves it basically untouched for theother days. This gets reﬂected in the shrinkage coeﬃcient κ that is almost one for theweekend, thus full shrinkage is observed. Weekend days can be modeled with the chosentrigonometric polynomial while more ﬂexibility is needed for weekdays. Thus, our priorseems to trades oﬀ well between both situations. SMARD - Strommarktdaten der Bundesnetzagentur für Elektrizität, Gas, Telekommunikation, Postund Eisenbahnen. The data can be downloaded at https://smard.de. = k = weekday weekend0 5 10 15 20 0 5 10 15 20−2−101 hour of the day c on s u m p t i on method parametricsshs Figure 7:

Plots of the total energy consumption in Germany for eight weekdays and eight weekend daysin November 2018 (gray lines). Estimated outcome using the parametric and the introducedapproach (sshs).

In this paper, we presented a new prior hierarchy for smooth functional eﬀects. Thenovelty of the developed method is the ability to shrink towards a ﬂexible user-deﬁnednull space while still enforcing smoothness for the estimate. We showed that the prior isadaptive by shrinking small eﬀects but leaving strong signals mainly untouched.In simulation studies and an application, we demonstrated the performance of theshrinkage prior. The empirical studies show that the prior works well in general, butunder certain conditions a better separation between shrinkage and non-shrinkage is de-sirable. To improve this, one could try to replace the half Cauchy distribution in the prioron λ . A suitable candidate is the generalized centered half Cauchy distribution recentlyproposed by Shin et al. (2020). Using it implies, conditioned on τ → ∞ , a beta prior onthe shrinkage coeﬃcient κ . Consequently, the prior can be speciﬁed such that almost allprobability weight is put on the extremes. However, the proposed distribution does notinclude a scale parameter and, consequently, no global-local shrinkage is available. Thedistribution would need to be enhanced to keep this feature. The results of Shin et al.(2020) are promising and abandoning the global shrinkage parameter may be preferable.We would also like to explore the option of specifying a joint prior on ( λ, τ ) . Thismight enable us to employ regularization of the curvature only if the shrinkage towardsthe subspace is weak. For strong shrinkage the smoothness is already guaranteed fromthe selected subspace.Another focus of research could be the adaption of the proposed prior to more responsedistributions. In a ﬁrst step the prior could be adopted for the exponential family andthen further extended to Bayesian distributional regression (Klein et al., 2015). In thismodel class, all distributional parameters are related to the available covariate informa-tion via additive predictors. Functional eﬀects within the predictors could beneﬁt fromthe proposed subspace shrinkage.The proposed methodology could also be adopted in a penalized likelihood approachyielding a new type of shrinkage complementing LASSO-type methods.18 eferences Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator forsparse signals.

Biometrika , 97(2):465–480.Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013).

Regression . Springer, Berlin,Heidelberg.Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. (2014).

Bayesian Data Analysis . Texts in Statistical Science. CRC Press, Boca Raton, thirdedition.Hastie, T. and Tibshirani, R. (1986). Generalized additive models.

Statistical Science ,1(3):297–310.Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation fornonorthogonal problems.

Technometrics , 12(1):55–67.Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist andBayesian strategies.

The Annals of Statistics , 33(2):730–773.James, W. and Stein, C. (1961). Estimation with quadratic loss. In

Proceedings of theFourth Berkeley Symposium on Mathematical Statistics and Probability , pages 361–379.Klein, N., Carlan, M., Kneib, T., Lang, S., and Wagner, H. (2019). Bayesian eﬀectselection in structured additive distributional regression models. arXiv:1902.10446[stat] .Klein, N., Kneib, T., Lang, S., and Sohn, A. (2015). Bayesian structured additive dis-tributional regression with an application to regional income inequality in Germany.

The Annals of Applied Statistics , 9(2):1024–1052.Lang, S. and Brezger, A. (2004). Bayesian P-slines.

Journal of Computational andGraphical Statistics , 13(1):183–212.Makalic, E. and Schmidt, D. F. (2016). A simple sampler for the horseshoe estimator.

IEEE Signal Processing Letters , 23(1):179–182.Marquardt, D. W. and Snee, R. D. (1975). Ridge regression in practice.

The AmericanStatistician , 29(1):3–20.Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regres-sion.

Journal of the American Statistical Association , 83(404):1023–1032.Neal, R. M. (2003). Slice sampling.

The Annals of Statistics , 31(3):705–767.Park, T. and Casella, G. (2008). The Bayesian Lasso.

Journal of the American StatisticalAssociation , 103(482):681–686. 19olson, N. G. and Scott, J. G. (2011). Shrink globally, act locally: sparse Bayesianregularization and prediction. In Bernardo, J. M., Bayarri, M. J., Berger, J. O., Dawid,A. P., Heckerman, D., Smith, A. F. M., and West, M., editors,

Bayesian Statistics ,volume 9, pages 501–538. Oxford University Press.Rue, H. and Held, L. (2005).

Gaussian Markov random ﬁelds: theory and applica-tions . Number 104 in Monographs on Statistics and Applied Probability. Chapman &Hall/CRC, Boca Raton.Ruppert, D., Wand, M. P., and Carroll, R. J. (2003).

Semiparametric regression . Num-ber 12 in Cambridge Series in Statistical and Probabilistic Mathematics. CambridgeUniversity Press, Cambridge, New York.Scheipl, F., Fahrmeir, L., and Kneib, T. (2012). Spike-and-slab priors for function se-lection in structured additive regression models.

Journal of the American StatisticalAssociation , 107(500):1518–1532.Shin, M., Bhattachrya, A., and Johnson, V. E. (2020). Functional horseshoe priors forsubspace shrinkage.

Journal of the American Statistical Association , 115(532):1784–1797.Sørbye, S. H. and Rue, H. (2014). Scaling intrinsic Gaussian Markov random ﬁeld priorsin spatial modelling.

Spatial Statistics , 8:39–51.Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.

Journal of theRoyal Statistical Society: Series B (Methodological) , 58(1):267–288.Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely appli-cable information criterion in singular learning theory.

Journal of Machine LearningResearch , 11(116):3571–3594.Zellner, A. (1986). On asessing prior distributions and Bayesian regression analysis withg pior distributions. In Goel, P. K. and Zellner, A., editors,