Adaptive shrinkage of smooth functional effects towards a predefined functional subspace
AAdaptive shrinkage of smooth functionaleffects towards a predefined functionalsubspace
Paul Wiemann and Thomas Kneib Göttingen UniversityJanuary 15, 2021
In this paper, we propose a new horseshoe-type prior hierarchy for adap-tively shrinking spline-based functional effects towards a predefined vectorspace of parametric functions. Instead of shrinking each spline coefficient to-wards zero, we use an adapted horseshoe prior to control the deviation fromthe predefined vector space. For this purpose, the modified horseshoe prior isset up with one scale parameter per spline and not one per coefficient. Thepresented prior allows for a large number of basis functions to capture allkinds of functional effects while the estimated functional effect is preventedfrom a highly oscillating overfit. We achieve this by integrating a smooth-ing penalty similar to the random walk prior commonly applied in BayesianP-spline priors. In a simulation study, we demonstrate the properties of thenew prior specification and compare it to other approaches from the litera-ture. Furthermore, we showcase the applicability of the proposed method byestimating the energy consumption in Germany over the course of a day. Forinference, we rely on Markov chain Monte Carlo simulations combining Gibbssampling for the spline coefficients with slice sampling for all scale parametersin the model.
In this paper, we consider the non-parametric regression problem y i = f ( x i ) + ε i , i = 1 , . . . , n (1)with independent and identically distributed error terms ε i ∼ N(0 , σ ) , a smooth, nonlin-ear function f and continuous covariate x i . A common approach for solving the nonpara-metric regression problem is to transform it into a semi-parametric estimation problem1 a r X i v : . [ s t a t . M E ] J a n y assuming that f can be approximated via a linear combination of basis functions eval-uated for the considered covariate (Ruppert et al., 2003). In alternative to a regressionspline, one might consider a parametric assumption on the dependence of the covariate,i.e., a linear or quadratic dependence, or a periodic dependence with a trigonometricpolynomial. To assess the correctness of the assumptions, goodness-of-fit measures likethe widely applicable information criterion (WAIC, Watanabe, 2010), residual plots orposterior predictive checks can be employed (Gelman et al., 2014).Similar to Shin et al. (2020), we present a methodology based on a shrinkage priorthat represents a trade-off between the solution of a pre-defined parametric function andthe solution of a flexible regression spline. Consider a given, predefined subspace, e.g.,the space of all linear or quadratic effects. The prior we propose addresses the followingthree main objective:• For a weak signal or for a signal that can be represented well by a function from thepre-specified subspace, the estimated functional effect should be shrunken towardsthe considered subspace.• When the signal is strong and it cannot be adequately represented by a functionfrom the subspace, the shrinkage should be as small as possible.• The prior should prevent the estimated function from oscillating too strongly andthus possibly overfitting the data. This is a known problem for unrestricted regres-sion splines when many basis functions are used (Fahrmeir et al., 2013). However,we want the prior to be able to handle a high number of basis functions so that thespace spanned by the basis functions can represent quickly changing functions.The concept of shrinkage has successfully and widely been applied in many scientificfields since the foundational work by James and Stein (1961). We mention here ridgeregression (Hoerl and Kennard, 1970; Marquardt and Snee, 1975), the LASSO penalty(Tibshirani, 1996) and its Bayesian counterpart the Bayesian LASSO (Park and Casella,2008) as examples for shrinkage of the regression coefficients in a linear model. Morerecently, the concept of global-local shrinkage (Polson and Scott, 2011) has come up.Compared to fixpoint shrinkage, which shrinks every regression coefficient towards 0,global-local shrinkage shrinks the vector of all regression coefficients towards a sparse so-lution, which makes them in particular useful for high-dimensional but sparse estimationproblems. These global-local priors are usually equipped with global hyper-parametersthat control the sparseness of the estimated vector. The horseshoe prior (Carvalho et al.,2010) is a prominent member of this class of shrinkage priors. The horseshoe prior isespecially known for its adaptive properties. Due to the horseshoe-shaped prior on theshrinkage coefficient, it is capable of leaving strong signals almost untouched while ap-plying strong shrinkage on weak signals. Polson and Scott (2011) draw a connectionbetween global-local shrinkage and Bayesian model averaging that Carvalho et al. (2010)also highlight for their horseshoe prior.Hence, the prior developed in this paper differs from the above mentioned priors, as weare not aiming at shrinking a sparse vector or individual regression coefficients towards2ero. Instead, we aim at shrinking the function estimated by a regression spline towardsa predefined parametric subspace, while simultaneously controlling its curvature. Thisdistinguishes our approach from the goals usually pursued with the regularization ofregression splines, which essentially focusses on limiting the curvature of the solution(see Lang and Brezger (2004)).Recently, the work by Shin et al. (2020) has come close to and, by now, has inspired themethod developed in this paper. However, their approach does not take the curvature ofthe estimated effect into account and, consequently, as we show in Section 2, the estimatedfunction becomes easily very wiggly or might not be capable to capture essential patternsin the data. Moreover, the method of effect selection introduced by Scheipl et al. (2012)and Klein et al. (2019) is not comparable with our approach. Firstly, instead of shrinkage,a spike and slab-based procedure is employed (Mitchell and Beauchamp, 1988; Ishwaranand Rao, 2005). Secondly, the null effect arises from the kernel of the improper precisionmatrix in the prior on the spline coefficients and is therefore not an arbitrary parametricfunction of the covariate.The remainder of this paper is organized as follows. In Section 2, we introduce theprior, discuss its extension for additive regression models (Section 2.3), and investigateits shrinkage properties in Section 2.4. We describe the Bayesian inference scheme inSection 3 before demonstrating the validity of our approach with simulations in Section 4.Here, we investigate the prior in a simple univariate case as well as in an additive model.We conclude the empirical part of this paper in Section 5 with an application to energyconsumption in Germany before we end with the discussion in Section 6. In this section, we introduce the prior specification for estimating a smooth functionaleffect combined with shrinkage towards a predefined functional subspace. Consider thenon-parametric regression problem from Equation (1). As stated in the introduction,a common approach for the estimation of f is to transform it into a semi-parametricestimation problem by assuming that f can be approximated via a linear combinationof k pre-specified basis function denoted with B j . More precisely, we assume f ( x ) = k (cid:88) j =1 B j ( x ) β j = B ( x ) (cid:48) β with B ( x ) = (B ( x ) , . . . , B k ( x )) (cid:48) and β = ( β , . . . , β k ) (cid:48) . Thus, a finite number of param-eters needs to be estimated that have usually no valid interpretation by themselves. Inthis work, we employ B-Splines of third order with equally spaced knots (Fahrmeir et al.,2013) . The number of basis functions k is highly important for the functional space spannedby the basis function. A small number of basis functions may not be flexible enough3o capture the pattern observed in the data. Using many basis functions, however, mayresult in a highly oscillating function that overfits the data. In the literature, differentapproaches can be found to overcome this problem. One is to choose the number of basisfunction based on the number of available observations (see Ruppert et al., 2003, fordetails).Let Z denote the n × k matrix of basis functions evaluated at the observed covariates x , . . . , x n . When limiting the number of basis functions, Zellner’s g -prior is a commonchoice for the prior on β (Zellner, 1986) β | g ∼ N( , g ( Z (cid:48) Z ) − ) where the hyper-parameter g > adjusts the weight of the prior in relation to the obser-vations. Zellner’s g -prior has the advantage of taking the (observed) correlation structureof the covariates into account. Under this prior, the posterior mean of predictions given g can be expressed as E( ˆ y | g, y ) = (cid:18) −
11 + g (cid:19) Z ( Z (cid:48) Z ) − Z (cid:48) y , which shows that g can be interpreted as a shrinkage parameter with shrinkage towardsthe null effect. In this approach, the regularization of the spline is achieved by limitingthe number of basis functions depending on the number of observations. Thus the func-tional pattern that can be represented with the spline is dependent on the number ofobservations. This is not a desirable property.To circumvent this issue, a different approach is based on regularization of the curvatureof estimated function or, to word it differently, shrinkage towards a function with lesscurvature. Here, a moderately large number of basic functions is used in conjunctionwith a prior which ensures that adjacent regression coefficients are not too different, andthus produces a certain smoothness of the functional estimate.The Bayesian P-spline approach by Lang and Brezger (2004) achieves this by placing aGaussian random walk prior on the spline coefficients. In the case of a first order randomwalk β j = β j − + u j for < j ≤ k , with u j ∼ N(0 , τ ) , the variance of the increments actsas a smoothness parameter and the prior shrinks towards a constant functional effect.The second order random walk β j = 2 β j − − β j − + u j , for < j ≤ k , acts as a linearextrapolation with a Gaussian error term u j ∼ N(0 , τ ) . Again, τ can be interpretedas a smoothing parameter and the prior shrinks towards a linear function. The overalllevel is not penalized by the Bayesian P-spline prior since it places a flat prior on β and for the second order random walk the flat prior extends to β and thus the slope isnot regularized. With the appropriate rank deficient penalty matrix K , this prior canequivalently expressed as β ∼ N prec ( , τ − K ) where N prec is the degenerated normal distribution with precision matrix τ − K (see Langand Brezger (2004) and Rue and Held (2005) for details).For the construction of our smooth subspace shrinkage, we use a combination of bothapproaches: we use the smoothing properties of the Bayesian P-spline and combine it4ith an extended version of Zellner’s g -prior - very similar to the functional horseshoeprior (Shin et al., 2020) - to shrink towards a predefined functional subspace. In comparison to the Bayesian P-spline, we do not only want to shrink towards lesscurvature, but additionally to a parametric functional subspace of the covariate. To givean example, this could be a quadratic effect, as well as a periodic effect. Nevertheless,the key feature of the Bayesian P-spline should be maintained, namely preventing thefunctional estimate from exhibiting highly oscillating behavior and thus overfitting thedata when a high number of basis functions is used. The prior however should be ableto adapt to quickly changing functions when the data is suggesting this, so we waive theoption of using only a small number of basis functions.We achieve our goal by using the following prior hierarchy. Let S be a matrix whosecolumns span the space N towards we want to shrink. To give examples, use S = ( , x ) for the null space comprised by the linear effect and S = ( , x , x ) when quadratic effectsshould comprise the null space. Hereby, operations on the covariate vector are carriedout element-wise. Based on S , we define the n × n matrices P = S ( S (cid:48) S ) − S and P = I − P where I denotes the identity matrix of appropriate size. P and P areboth projection matrices with their images being orthogonal complements of each other.Additionally, the union of their images is equal to the space spanned by the columns of Z .By construction, N is the image of P and the kernel of P . We can now decomposethe image of the element-wise application of the function f ( x ) = ( f ( x ) , . . . , f ( x n )) (cid:48) intothe part within and not within the null space, i.e., f ( x ) = ( P + I − P ) f ( x ) = P f ( x ) (cid:124) (cid:123)(cid:122) (cid:125) ∈N + P f ( x ) (cid:124) (cid:123)(cid:122) (cid:125) ∈ ¯ N where ¯ N denotes the orthogonal complement of N . Therefore, f ¯ N ( x ) = P f ( x ) is thepart of the functional effect that is not within the null space and consequently shouldbe controlled by the shrinkage prior. To achieve this, we propose to place the improperprior β | λ, τ , σ ∼ N prec ( , Q ) with Q = σ − λ − Z (cid:48) P Z + τ − K on β where K denotes the appropriate second-order walk penalty matrix. Therefore, β (cid:48) Qβ = f ¯ N ( x ) (cid:48) f ¯ N ( x ) λ σ + β (cid:48) Kβ τ ; i.e., the argument to the exponential function in the degenerated normal density of theprior on β measures the sum of the quadratic deviation from the functional null space(weighted with λ − σ − ) in the first term and the sum of the quadratic second orderdifferences of the spline coefficients (weighted with τ − ) in the second term. We dealwith the priors on the scale parameters later in Equation (5).5he posterior of β conditioned on σ , λ and τ is then normally distributed with mean β (cid:63) and precision matrix Q (cid:63) given as follows: Q (cid:63) = σ − Z (cid:48) Z + Q β (cid:63) = σ − ( Q (cid:63) ) − Z (cid:48) y . (2)When defining κ = 11 + λ (3)as in the horseshoe prior and considering the expected functional effect conditioned onno penalization of the second order derivative lim τ →∞ E ( Zβ | y , λ, τ ) = ((1 − κ ) P Z + κ P ) y (4)where P Z = Z ( Z (cid:48) Z ) − Z (cid:48) , the role of κ as the shrinking factor becomes apparent as theexpectation is a weighted average of the parametric least squares solution (i.e., P y ) andthe unpenalized spline estimate (i.e., P Z y ). We reach the extreme of no shrinkage andthe reassembling of the spline solution for κ → . Full shrinkage is obtained for κ → ,which equals the parametric least square solution.We now discuss the necessity of the second term in the precision matrix for the prior on β . From Equation (4), it is evident that the spline solution passes on its highly oscillatingbehavior to the weighted average if there is no, or only little, shrinkage. No shrinkageor slight shrinkage is especially desirable in the situation in which the defined null spacedoes not match the pattern in the data, and thus the model is supposed to adapt to thedata. But even in the case of strong but not complete shrinkage, the highly oscillatingbehavior is included in the estimated function albeit with smaller amplitude.We illustrate this with an example. Consider the solid line in Figure 1. Without anypenalization of the second order derivative (i.e., τ → ∞ ) the expected value of theprediction tends to overfit the data. Only in the case of the correct specified null space,we see a fitting and not oscillating functional estimate (second row, fourth column). Inaddition, the figure demonstrates that integrating the second order random walk prioron β helps to prevent this. Even when no shrinkage is present (i.e., κ → ), the posteriormean exhibits less curvature and fits the underlying pattern better.We place half Cauchy and inverse gamma priors on the scale parameters in the prioron β λ ∼ C + (0 , ˜ ξ ) , ξ ∼ C + (0 , ξ ) , τ ∼ C + (0 , ν ) and σ ∼ IG( a , b ) (5)where C + ( a, b ) denotes the Cauchy distribution with mean parameter a and scale pa-rameter b truncated to the positive real numbers and IG( c, d ) denotes the inverse gammadistribution with scale parameter c and shape parameter d .A probability statement (Equation (6)) is used to determine the prior on the varianceof the random walk, namely ν > . Over the domain of the covariates, the second order6 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll k = k = k = k = li nea r quad r a t i c −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2−50−50 x y tau_sq Inf101
Figure 1:
The plots demonstrate the effect of the shrinkage coefficient κ and the smoothness variance τ on the posterior mean as given in Equation (2) for a data example (scatterplot) generatedfrom a quadratic function. The lines represent the posterior mean. Without any smoothing(i.e., τ → ∞ ) the spline is highly oscillating and overfitting the data when no, or only slight,shrinkage is present. In this example, σ = 1 and the spline is based on 20 basis functions. derivative should be, in absolute terms, smaller than the prespecified threshold c withprobability − α , i.e., lim λ →∞ Pr (cid:18) max x ∈D (cid:12)(cid:12) f (cid:48)(cid:48) ( x ) (cid:12)(cid:12) < c (cid:12)(cid:12)(cid:12)(cid:12) λ (cid:19) = 1 − α. (6)Furthermore, ξ , a , b > must be specified by the user and ˜ ξ is a function of ξ . Theprior on λ is an adaptation of the hierarchy considered in the horseshoe prior (Carvalhoet al., 2010). We scale ξ such that the marginal variance of the spline coefficients is notdependent on the design matrix of the spline Z nor the defined null space and thus theprojection matrix P . Following the idea of Sørbye and Rue (2014), we define ˜ ξ = ξσ − ref ( Z (cid:48) P Z ) where σ − ref ( · ) represents average marginal standard deviation induced by the precisionmatrix. For that we calculate the standard deviation as the geometric mean σ ref ( Q ) = exp (cid:32) n n (cid:88) i =1
12 log( Σ (cid:63)ii ) (cid:33) with Σ (cid:63) as the generalized inverse of Q .To have more control over the shrinkage properties, especially the strength of a signal, ξ may be fixed. This might be in particular well suited in situations in which only onefunctional effect should be controlled by the proposed prior and, therefore, no global-local shrinkage is required. The extension to multiple additive effects is described inSection 2.3 and an assessment of the shrinkage properties is made in Section 2.4.7 .3 Extension to multiple additive functional effects We describe briefly how the proposed prior hierarchy can be extended to cover multiplefunctional effects in an additive regression design (Hastie and Tibshirani, 1986). Suppose L smooth effects should be modeled such that the response for i = 1 , . . . , n observationsis given by y i = β + f ( x i ) + f ( x i ) + . . . + f L ( x iL ) + ε i where the distributional assumption from above applies to ε i . In addition to the assump-tion that the smooth functions can be represented by a spline, we assume for identificationpurposes that the smooth functions are centered; i.e., the sum over the coefficients of eachspline equals 0. This assumption is justified because it causes no general restriction thatcannot be nullified by centering y or by the inclusion of an intercept, i.e., β .As above, we approximate the smooth effects with linear combinations of basis func-tions resulting in the design matrices Z , . . . , Z L with the corresponding vectors of co-efficients β , . . . , β L . For l = 1 , . . . , L , the prior hierarchy on the coefficient vectors isgiven by β l | λ l , τ l , σ ∼ N prec ( , Q l ) with Q l = σ − λ − l Z (cid:48) l P ( l )1 Z l + τ − K l and the linear restriction (cid:48) β l = 0 λ l | ξ ∼ C + (0 , ˜ ξ l ) with ˜ ξ l = ξσ − ref ( Z (cid:48) l P ( l )1 Z l ) ξ ∼ C + (0 , ξ ) τ l | ν l ∼ C + (0 , ν l ) with ν l s.t. lim λ l →∞ Pr (cid:18) max x ∈D l (cid:12)(cid:12) f (cid:48)(cid:48) l ( x ) (cid:12)(cid:12) < c l (cid:12)(cid:12)(cid:12)(cid:12) λ l (cid:19) = 1 − α l where• the prior on β l includes now the linear restriction (cid:48) β l = 0 for identifiability,• K l denotes the appropriate second order walk penalty matrix,• P ( l )1 denotes the projection matrix with kernel equal to the chosen null space forthe l -th effect,• and α l and c l are chosen as above to control the wigglyness of f l .In addition, we place the same prior as before on σ and use a non-informative priorfor the intercept β . In this prior hierarchy and similar to the horseshoe prior, the priordistributions on λ l are connected via ξ which acts as a global shrinkage parameter while λ l controls the local shrinkage. In this section, we study the shrinkage properties of the proposed prior. The literaturesuggests (Scheipl et al., 2012; Klein et al., 2019) to examine the marginal distribution ofthe spline coefficients, i.e., p ( β l ) for the l -th functional effect. For simplicity, we make8he same assumption as Carvalho et al. (2010) regarding the variance of the error term,namely σ = 1 . The error variance can be easily estimated from the data and acts in theproposed prior hierarchy just as a scaling component. In addition, we anticipate thatthe prior on the second order random walk, specifically on τ l , has been chosen such thatit does not affect the shrinkage properties. We consider this assumption to be valid, aswe believe that the user selects the limit for the second derivative to prevent overfitting,but not to affect the shrinkage properties. Therefore, we consider in the following thelimit τ l → ∞ . For the sake of compactness, we refrain from stating the conditioning on τ l and σ for the rest of this section.The proposed prior features local-global shrinkage properties with the local shrinkageparameter λ l and the global shrinkage parameter ξ . Since i) the main local shrinkageproperties are unaffected by the global shrinkage parameter’s value and ii) the effect onthe marginal distribution of β l without a fixing global shrinkage parameter depends onthe specific predictor structure, we condition in the following on ξ and thus ˜ ξ l .Note that the results derived in this section do not depend on a specific value of ξ .For a better readability we drop in the following the index l . The implied marginaldistribution of β | ˜ ξ can now be derived as p ( β | ˜ ξ ) = (cid:90) ∞ p ( β | λ ) p ( λ | ˜ ξ ) dλ = (2 π ) − rk( F ) / ( | F | ∗ ) / (cid:90) ∞ λ − rk( F ) exp (cid:18) − λ β (cid:48) F β (cid:19) π ˜ ξ (cid:18) λ ˜ ξ (cid:19) − dλ = 2 π ˜ ξ (2 π ) − rk( F ) / ( | F | ∗ ) / (cid:124) (cid:123)(cid:122) (cid:125) =: c (cid:90) ∞ exp( − λ β (cid:48) F β ) λ rk( F ) (cid:16) λ ˜ ξ (cid:17) dλ. where F = Z (cid:48) P Z . No analytical solution exists for this integral, though. We define theconstant scalar in front of the integration symbol as c .In the assessment of shrinkage properties, we are mainly interested in the tail behaviorand the behavior in the origin with respect to the distance of the functional effect fromthe null space N . To assess this, we define this distance as d = || P Zβ || . The impliedmarginal density on d is then given as p ( d | ˜ ξ ) = c (cid:90) ∞ exp( − d λ ) λ rk( F ) (cid:16) λ ˜ ξ (cid:17) dλ with no closed form solution available so that we approximate it numerically for a visualimpression shown in Figure 2.We investigate more closely, whether the spike in the origin is infinite. An infinitespike in zero exhibits particular advantageous shrinkage properties as it implies a strongpenalization and thus strong shrinkage towards the origin for small effects (Klein et al.,2019). Translated to our hierarchy, this implies that effects close to the null space receive9 k(F) = 10 rk(F) = 200 10 20 30 40 50 0 10 20 30 40 501e−211e−131e−051e+03 d p ( d | x ~ = ) Figure 2:
Plots of the marginal density of the distance of spline coefficients to the null space conditionedon ˜ ξ . The left plot is based on rk( F ) = 10 and the right plot on rk( F ) = 20 . strong shrinkage towards the null space. Using the last equation, we obtain p ( d = 0 | ˜ ξ ) = c (cid:90) ∞ λ rk( F ) (cid:16) λ ˜ ξ (cid:17) dλ = c (cid:90) λ rk( F ) (cid:16) λ ˜ ξ (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) =: w ( λ ) dλ + c (cid:90) ∞ λ rk( F ) (cid:16) λ ˜ ξ (cid:17) dλ (cid:124) (cid:123)(cid:122) (cid:125) ≥ for which we can find a bound from below for the integrand w ( λ ) as w ( λ ) = 1 λ rk( F ) + λ rk( F )+2 ˜ ξ = ˜ ξ ˜ ξ λ rk( F ) + λ rk( F )+20 <λ< ≥ ˜ ξ ˜ ξ λ rk( F ) + λ rk( F ) and thus (cid:90) w ( λ ) dλ ≥ (cid:90) ˜ ξ ˜ ξ λ rk( F ) + λ rk( F )rk( F ) > = (cid:34) ˜ ξ λ − rk( F ) (1 − rk( F ))( ˜ ξ + 1) (cid:35) = ∞ The marginal has indeed an infinite spike in the origin w.r.t. d and consequently themarginal p ( β ) has an infinite spike if Zβ is in the null space.Besides studying the marginal prior in the origin, Scheipl et al. (2012) suggest toexplore the tail behavior by focusing on the first derivative of the log marginal, i.e., thescore function and its properties in the limit. Klein et al. (2019) state that the prior has10 k(F) = 10 rk(F) = 200 10 20 30 40 50 0 10 20 30 40 50−20−15−10−50 d ¶ ¶ d l og ( p ( d | x ~ = )) Figure 3:
Plots of the marginal score function of the distance of spline coefficients to the null spaceconditioned on ˜ ξ . The left plot is based on rk( F ) = 10 and the right plot on rk( F ) = 20 . heavy tails, if the score function is redescending; namely, its value approaches zero in theinfinite limit of its argument. Heavy tails in the marginal prior are desirable since theyimply Bayesian robustness of the estimates. Unfortunately, the limiting behavior of thescore function ∂∂d log( p ( d | ˜ ξ )) = ∂∂d p ( d | ˜ ξ ) p ( d | ˜ ξ )= − dc (cid:82) ∞ − d λ ) λ rk( F )+2 (cid:16) λ ξ (cid:17) dλc (cid:82) ∞ (cid:16) − d λ (cid:17) λ rk( F ) (cid:16) λ ξ (cid:17) dλ ≤ is not analytically accessible so that we must rely on a numerical approximation. Figure 3displays plots of the score function. From the plots we conclude that the prior constitutesa redescending score function and thus has heavy tails. These findings translate to thefunctional space which consequently means that functional estimates in the tails of theprior are robust. We base our Bayesian estimation procedure on Markov chain Monte Carlo (MCMC)simulations and employ a within-Gibbs updating scheme. Within each MCMC iteration,we initially update the intercept β with a Gibbs update and then update the splinecoefficients of each smooth function with a draw from the full conditionals.Suppose the l -th component of the predictor should be updated. Let ˜ y − l denote theworking observations, that is ˜ y − l = y − η − l where η − l denotes the predictor with theeffect of the component under consideration removed.With a non-informative prior on the intercept, the full conditional distribution of the11ntercept is then given by β | · ∼ N (cid:0) ¯˜ y − , σ /n (cid:1) where ¯˜ y − is the mean of the workingobservations.For the spline coefficients of the l -th smooth effect, the full conditional follows fromEquation (2), and is given by β l | · ∼ N (cid:0) β (cid:63)l , ( Q (cid:63)l ) − (cid:1) where Q (cid:63)l = σ − Z (cid:48) l Z l + Q l and β (cid:63)l = ( Q (cid:63)l ) − Z (cid:48) l ˜ y − l . To ensure identifiability, we sample under the linear constrain (cid:48) β l = 0 as described inRue and Held (2005, Algorithm 2.6).Finally, the scale parameters of the model are updated using slice updates (Neal,2003). Hereby, we choose to update first σ , then the scale parameters of each spline,namely λ i and τ i , and lastly the global shrinkage parameter ξ . All updates of the scaleparameters are performed in log-space. As proposed by Makalic and Schmidt (2016),the global shrinkage parameter could also be updated using the double inverse gammarepresentation. However, we could not observe improvements of the MCMC estimation.Due to that, we stick to the slice sampler for the sake of consistency. We assess the validity of our approach in three simulation studies. In the first two studies,we employ a simple additive model featuring one covariate. We explore the behavior ofthe presented prior considering different null spaces and its reaction to different signal tonoise ratios (SNRs). In the third simulation study, we analyze the proposed prior withinthe additive model and compare the results to the popular Bayesian P-splines approach.Notably, the work in Shin et al. (2020) has partly close resemblance to the methoddeveloped in this paper. Nonetheless, we refrain from an empirical comparison to thisapproach, since the functional horseshoe prior lacks a smoothing component. This is notin line with our goal of producing flexible and non-wiggly functional estimates.
In the first two scenarios, we consider a simple regression setting with only one smootheffect and 100 observations. A quadratic effect is featured in scenario I, while a morecomplex function is used in scenario II. In particular, the smooth functions are definedas f ( x ) = 1 + 1 . x (scenario I) and f ( x ) = 1 + 10 sin( x ) + x + 0 . x (scenario II) . The covariate values x are equally spaced within the interval [ − π, π ] . In both scenar-ios, we add an independently and normally distributed error term with zero mean andvariance equal to σ .We fit models comprising the proposed prior with null spaces spanned by [ , x ] and [ , x , x ] in scenario I and add in scenario II the null spaces spanned by [ , sin( x )] and12 , x , x , sin( x )] . The operations on the covariates are defined element-wise and the lastnull space is referred to as complex in the following. To feature two different SNRs, wechoose the standard derivation of the error term σ equal to . or equal to . . Eachscenario is replicated 100 times.For the estimation, we employ a cubic spline with 20 inner knots and we determine theparameter ν in the prior on τ based on α = 0 . and the cutoff c , where c is determinedas follows. Let c p be the maximum absolute value of the second order derivative of theparametric solution within the null space. Then the cutoff is set to c = 10 max( c p , . .Furthermore, ξ is set to 1. Inference is based on 10.000 MCMC iterations of which thefirst half is considered warm-up.The main results are summarized in Figure 4 and Figure 5 for scenario I and scenario II,respectively. In scenario I, we can see that the prior adopts to the data generating function k = k = k = k = linear quadratic . . linear quadratic . . −4 0 4 −4 0 4−2.50.02.55.0−2.50.02.55.0 0.0 0.5 1.0 0.0 0.5 1.00101 x kappa y Figure 4:
Plots concerning simulation scenario I. The left plot shows data of one iteration, the datagenerating function (red line), the average of the posterior mean using a parametric model(blue line), and the average of the posterior means together with 90% quantiles when em-ploying the proposed model. The right plot shows histograms of the posterior mean of theshrinkage parameter κ (see Equation (3)). The columns within the plots distinguish betweendifferent types of null spaces which are also used for the parametric estimates. Each rowwithin the plots displays a different SNR scenario. regardless of the specified null space, when the signal is strong enough. With more noise,the prior still shrinks to the parametric function when the correct null space is specified.Concerning the linear null space, the prior performs worse and does not show a consistentbehavior when more noise is present.Now focusing on scenario II, we can deduce that the prior is mostly able to decidebetween signal and no signal as most values are either close to 0 or 1. Furthermore, wecan observe that in the high noise scenario the prior forces the estimate to be very closeto the parametric solution and thus to be in the null space. This is especially true in bothSNR scenarios for the complex null space which includes the data generating function.In the low-noise scenario, the estimate is mainly able to capture the true function withsome difficulties with the quadratic null space.13 = k = k = k = . . k = k = k = k = . . . . sin complex sin complex . . −4 0 4 −4 0 4 0.0 0.5 1.0 0.0 0.5 1.0−5.0−2.50.02.55.0−5.0−2.50.02.55.0 −4 0 4 −4 0 4 0.0 0.5 1.0 0.0 0.5 1.0−5.0−2.50.02.55.0−5.0−2.50.02.55.0 x kappax kappa yy Figure 5:
Plots analog to Figure 4 but concerning simulation scenario II. The second row of plotscomprises the additional null spaces.
In the final simulation study, scenario III, we use the proposed prior within an additivepredictor that includes the effect of multiple covariates. Its performance is compared toBayesian P-splines (Lang and Brezger, 2004). To do so, we generate data for n = 100 ob-servations according to y i = f ( x i ) + f ( x i ) + f ( x i ) + f ( x i ) + ε i for i = 1 , . . . , n with the covariate information x ij independently and uniformly distributed on the inter-val [ − , . Furthermore, ε i is independently normal distributed with expectation zeroand variance σ and f , . . . , f denote four smooth functions, in particular f ( x ) = x f ( x ) = 2 x − / f ( x ) = sin( xπ ) f ( x ) = 2 exp( x )exp(1) − exp( − − where the integral from -1 to 1 is equal to 0 for all functions and they are scaled suchthat the Lebesgue measure of image of [ − , under the function is equal to 2; that is, sup( { f ( x ) | x ∈ [ − , } ) − inf( { f ( x ) | x ∈ [ − , } ) = 2 for a continuous function f on [ − , . 14e fit two models to the data. In the first model, each smooth function is representedby Bayesian P-splines with the default inverse gamma prior ( a = 0 . , b = 0 . ) onthe variance parameter of spline coefficients. In the second model, the smooth func-tions are estimated employing the introduced prior. In each model, we use the samenumber of inner knots (i.e. 20) and the same knot positions. For the shrinkage prior,specify the correct null space for the first three functions, i.e., [1 , x ] , [1 , x , x ] , and [1 , sin( x π ) , cos( x π )] . As before, the operations on the vector are interpreted element-wise. For the fourth smooth effect, the null space comprises, in contrast to the true effect,only of constant effects. We set ν l = 0 . for l = 1 , . . . , and ξ = 1 . The scenario isreplicated 100 times and we use 15.000 MCMC iterations in the estimation procedure ofwhich the first half is considered warm-up.The plots in Figure 6 summarize the main findings. All plots are based on the estimatedposterior mean. Considering subplot a), we make the following observations. First, theroot mean square error (RMSE) with respect to the observed values is slightly smallerwhen using Bayesian P-splines. Having said this, however, the RMSE with respect to thetrue values is smaller when employing our approach. These observations indicate thatthe Bayesian P-spline slightly overfits. The proposed prior seems to use the informationsupplied with the null space but is still able to diverge from it when it does not fit. Thisinterpretation is supported by the results obtained from the first two scenarios.Breaking down the individual smooth effects, the histograms in Figure 6 b) show thatthe estimated shrinkage weight κ is large for the splines with the correctly specified nullspace and small in the other case. Thus, the prior is able to detect a misspecified nullspace and restrains the shrinkage.This gets partly reflected in the RMSE (defined in terms of the integral from − to over the squared differences to the true function) of the individual smooth effects.For the fourth smooth effect with the misspecified null space, the differences betweenthe shrinkage prior and the P-spline solution are minor with respect to the RMSE. TheRMSE is the smallest and, in particular, considerably smaller than for the P-splinesolution for the first and third smooth effect. We find however almost no difference forthe second smooth effect between both methods even though the shrinkage weight is inmost replications close to one. Together with the information provided in Figure 6 d), wededuce that the estimated function is within the null space, meaning it can be representedby a quadratic polynomial, but misrepresents the underlying true function. The RMSEin this sub-figure shows the distance from the estimated function to the closest functionfrom null space. Again the RMSE is defined in terms of an integral. Thus, the plots showthat the functions estimated by the shrinkage approach are very close to a parametricsolution from the null space for the first three functions. For the misspecified null space,the estimated function is on average not close to a constant effect. To conclude, the proposed shrinkage prior is suited to take the additional informationinto account but can diverge from it when it is clearly misspecified. This can help toavoid overfitting compared to the Bayesian P-spline. However, an even better distinction15 l ll llll l llll l lll l llllllllll ll lllllllllll llllllllllll ll r m s e kappa r m s e r m s e b)a)c) d) Figure 6:
Plots concerning simulation scenario III. The red color always refers to the Bayesian P-splinemodel and the green color to the proposed approach. In b), c), and d) the numbers 1 to 4indicate the functional effect. a) Boxplots of the RMSE of the mean posterior predictionswith respect to observation and to the true data generating function. b) Histograms of theposterior mean of the shrinkage coefficients. c) Boxplots of the RMSE of each estimatedfunctional effect with respect to the true function. d) Boxplots of the RMSE of the estimatedeffects to the nearest parametric function. The RMSE in c) and d) is defined in terms of theintegral from -1 to 1 instead of the mean of at observed values as in a). between noise and signal is desirable, since this would lead to a more consistent estimationof extreme values of the shrinkage parameter κ , i.e., close to 0 or 1. Values of thisparameter far away from the extremes can be observed in scenario I and scenario II forsome null spaces. Predicting energy consumption is an important aspect of keeping the energy grid stable.Grid operators need to ensure that the amount of energy produced matches the amountof energy consumed. However, failing to manage this successfully may cause small distur-bances, such as fluctuations in the AC grid frequency, but also severe consequences suchas blackouts. To guarantee grid stability, the network operators have various options attheir disposal. Large and efficient power plants should provide the base load and moreflexible power plants can be used to absorb rapid fluctuations in consumption. Usually,energy produced by the flexible power plants is more expensive but large power plants,such as nuclear power plants, need more time to adapt their output. Therefore, predict-ing the energy consumption that can be covered by the efficient power plants proves quitehelpful. Ideally, the long adaptation time of these power plants should be considered dur-16ng estimation while still being able to flexibly detect quick changes in the base load. Thepresented approach of smooth subspace shrinkage fulfills this requirement. The subspacecould be comprised of a trigonometric polynomial of low order that represents patternsthat can be handled by efficient power plants. However, a pure parametric estimationis not suitable, since a rapid increase in consumption, for example in the morning, mustremain detectable.In the following, we present an application to data from the German energy market.The used data is freely available from SMARD . Suppose the quarter-hourly energyconsumption in Germany over one day needs to be estimated. The most naive approachmight be a linear model based on a trigonometric polynomial of order Ω , i.e., E( y | x ) = β + Ω (cid:88) ω =1 (cid:20) β ω cos (cid:18) ω π x (cid:19) + ˜ β ω sin (cid:18) ω π x (cid:19)(cid:21) where x ∈ [0 , denotes the hour of the day. Another approach is the use of regressionsplines, as they can flexibly adapt to the data. The latter may have the disadvantagethat the spline solution diverges from the parametric solution even though the parametricsolution is preferred based on theoretical considerations, and the spline solution fits thedata only negligibly better.We employ this example to showcase the applicability of the introduced prior hierarchy.For that, we choose eight weekdays (only Mondays and Tuesdays) and eight weekend daysfrom November 2018 and preprocess the data by rescaling and subtracting the daily mean.We specify the parameters in our prior such that it shrinks towards the polynomial fromabove with Ω = 4 . As explain above, we choose this order as we assume that efficientpower plants can handle this pattern.Since we employ only one covariate, we refrain from using the two level hierarchy forthe scale parameter of the spline coefficients and thus fix ξ to . . Furthermore, theprior for τ is determined by setting the cutoff c to two times the largest absolute secondorder derivative of the parametric solution and α = 0 . .The MCMC sampling runs for 12.000 iterations of which the first 2.000 are consideredwarmup.A plot of the data together with the estimated function is displayed in Figure 7. Weobserve that the proposed prior adopts to the data by shrinking the estimated effect tothe parametric function for the weekend days and leaves it basically untouched for theother days. This gets reflected in the shrinkage coefficient κ that is almost one for theweekend, thus full shrinkage is observed. Weekend days can be modeled with the chosentrigonometric polynomial while more flexibility is needed for weekdays. Thus, our priorseems to trades off well between both situations. SMARD - Strommarktdaten der Bundesnetzagentur für Elektrizität, Gas, Telekommunikation, Postund Eisenbahnen. The data can be downloaded at https://smard.de. = k = weekday weekend0 5 10 15 20 0 5 10 15 20−2−101 hour of the day c on s u m p t i on method parametricsshs Figure 7:
Plots of the total energy consumption in Germany for eight weekdays and eight weekend daysin November 2018 (gray lines). Estimated outcome using the parametric and the introducedapproach (sshs).
In this paper, we presented a new prior hierarchy for smooth functional effects. Thenovelty of the developed method is the ability to shrink towards a flexible user-definednull space while still enforcing smoothness for the estimate. We showed that the prior isadaptive by shrinking small effects but leaving strong signals mainly untouched.In simulation studies and an application, we demonstrated the performance of theshrinkage prior. The empirical studies show that the prior works well in general, butunder certain conditions a better separation between shrinkage and non-shrinkage is de-sirable. To improve this, one could try to replace the half Cauchy distribution in the prioron λ . A suitable candidate is the generalized centered half Cauchy distribution recentlyproposed by Shin et al. (2020). Using it implies, conditioned on τ → ∞ , a beta prior onthe shrinkage coefficient κ . Consequently, the prior can be specified such that almost allprobability weight is put on the extremes. However, the proposed distribution does notinclude a scale parameter and, consequently, no global-local shrinkage is available. Thedistribution would need to be enhanced to keep this feature. The results of Shin et al.(2020) are promising and abandoning the global shrinkage parameter may be preferable.We would also like to explore the option of specifying a joint prior on ( λ, τ ) . Thismight enable us to employ regularization of the curvature only if the shrinkage towardsthe subspace is weak. For strong shrinkage the smoothness is already guaranteed fromthe selected subspace.Another focus of research could be the adaption of the proposed prior to more responsedistributions. In a first step the prior could be adopted for the exponential family andthen further extended to Bayesian distributional regression (Klein et al., 2015). In thismodel class, all distributional parameters are related to the available covariate informa-tion via additive predictors. Functional effects within the predictors could benefit fromthe proposed subspace shrinkage.The proposed methodology could also be adopted in a penalized likelihood approachyielding a new type of shrinkage complementing LASSO-type methods.18 eferences Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator forsparse signals.
Biometrika , 97(2):465–480.Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013).
Regression . Springer, Berlin,Heidelberg.Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. (2014).
Bayesian Data Analysis . Texts in Statistical Science. CRC Press, Boca Raton, thirdedition.Hastie, T. and Tibshirani, R. (1986). Generalized additive models.
Statistical Science ,1(3):297–310.Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: biased estimation fornonorthogonal problems.
Technometrics , 12(1):55–67.Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist andBayesian strategies.
The Annals of Statistics , 33(2):730–773.James, W. and Stein, C. (1961). Estimation with quadratic loss. In
Proceedings of theFourth Berkeley Symposium on Mathematical Statistics and Probability , pages 361–379.Klein, N., Carlan, M., Kneib, T., Lang, S., and Wagner, H. (2019). Bayesian effectselection in structured additive distributional regression models. arXiv:1902.10446[stat] .Klein, N., Kneib, T., Lang, S., and Sohn, A. (2015). Bayesian structured additive dis-tributional regression with an application to regional income inequality in Germany.
The Annals of Applied Statistics , 9(2):1024–1052.Lang, S. and Brezger, A. (2004). Bayesian P-slines.
Journal of Computational andGraphical Statistics , 13(1):183–212.Makalic, E. and Schmidt, D. F. (2016). A simple sampler for the horseshoe estimator.
IEEE Signal Processing Letters , 23(1):179–182.Marquardt, D. W. and Snee, R. D. (1975). Ridge regression in practice.
The AmericanStatistician , 29(1):3–20.Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regres-sion.
Journal of the American Statistical Association , 83(404):1023–1032.Neal, R. M. (2003). Slice sampling.
The Annals of Statistics , 31(3):705–767.Park, T. and Casella, G. (2008). The Bayesian Lasso.
Journal of the American StatisticalAssociation , 103(482):681–686. 19olson, N. G. and Scott, J. G. (2011). Shrink globally, act locally: sparse Bayesianregularization and prediction. In Bernardo, J. M., Bayarri, M. J., Berger, J. O., Dawid,A. P., Heckerman, D., Smith, A. F. M., and West, M., editors,
Bayesian Statistics ,volume 9, pages 501–538. Oxford University Press.Rue, H. and Held, L. (2005).
Gaussian Markov random fields: theory and applica-tions . Number 104 in Monographs on Statistics and Applied Probability. Chapman &Hall/CRC, Boca Raton.Ruppert, D., Wand, M. P., and Carroll, R. J. (2003).
Semiparametric regression . Num-ber 12 in Cambridge Series in Statistical and Probabilistic Mathematics. CambridgeUniversity Press, Cambridge, New York.Scheipl, F., Fahrmeir, L., and Kneib, T. (2012). Spike-and-slab priors for function se-lection in structured additive regression models.
Journal of the American StatisticalAssociation , 107(500):1518–1532.Shin, M., Bhattachrya, A., and Johnson, V. E. (2020). Functional horseshoe priors forsubspace shrinkage.
Journal of the American Statistical Association , 115(532):1784–1797.Sørbye, S. H. and Rue, H. (2014). Scaling intrinsic Gaussian Markov random field priorsin spatial modelling.
Spatial Statistics , 8:39–51.Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.
Journal of theRoyal Statistical Society: Series B (Methodological) , 58(1):267–288.Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely appli-cable information criterion in singular learning theory.
Journal of Machine LearningResearch , 11(116):3571–3594.Zellner, A. (1986). On asessing prior distributions and Bayesian regression analysis withg pior distributions. In Goel, P. K. and Zellner, A., editors,