Variational Full Bayes Lasso: Knots Selection in Regression Splines
11 a r X i v : . [ s t a t . M E ] F e b ariational Full Bayes Lasso:Knots Selection in Regression Splines Larissa Alves , Ronaldo Dias , Helio S. Migons March 1, 2021
Abstract
We develop a fully automatic Bayesian Lasso via variational inference. This is a scalable procedure forapproximating the posterior distribution. Special attention is driven to the knot selection in regression spline.In order to carry through our proposal, a full automatic variational Bayesian Lasso, a Jefferey’s prior is proposedfor the hyperparameters and a decision theoretical approach is introduced to decide if a knot is selected ornot. Extensive simulation studies were developed to ensure the effectiveness of the proposed algorithms. Theperformance of the algorithms were also tested in some real data sets, including data from the world pandemicCovid-19. Again, the algorithms showed a very good performance in capturing the data structure.
In the recent literature one finds many alternative proposals for modeling and estimating a smooth function.In this article we focus on variants of smoothing splines, called penalized regression splines (Montoya et al. [2014],Eilers and Marx [1996]). This is an attractive approach for modeling the nonlinear smoothing effect of covariates.This work discusses the selection of knots given a fixed maximum number of knots. A roughness penalty isintroduced to control the selection of knots and consequently to balance the two conflicting goals, goodness offit and smoothness. Our approach will be through a full Bayesian Lasso with variational inference. It is relatedto the work of Osborne et al. [1998] where an efficient algorithm to calculate the classical Lasso estimator waspresented. Our contribution, therefore, includes the application of the mean field variational inference (Blei et al.[2017], Ormerod and Wand [2010]) for the complete Bayesian lasso penalty (Park and Casella [2008] and Mallickand Yi [2014])). Choosing the ideal number of knots and their position is a difficult problem. We propose atwo-step procedure related to the work of Ruppert [2002]. For regularization and model selection, the proposedprocedure starts with a fixed maximum number of knots and then uses a full Bayesian lasso, which combinescharacteristics of shrinkage and variable selection, to obtain the most significant knots to recover the unknownsmooth function. The number of knots is chosen based on an approximation of the predictive distribution in agrid of knots values.The original formulation of the Bayesian Lasso is based on a hierarchical representation of the Laplace dis-tribution, as a mixture of scale normal based on exponential (Park and Casella [2008]) and more recently as amixture of uniform with exponential (Mallick and Yi [2014]). ∗ The authors,(LA) ENCE, (RD) IMECC-UNICAMP and (HSM) IM-UFRJ, contributed equally to the design and implementationof the research, to the analysis of the results and to the writing of the manuscript.
Corresponding author. E-mail addresses: [email protected]
In order to set the notation to be used later, this section presents a brief summary of Bayesian regressionmodels and variational inference techniques and also establishes the framework for our proposal to select knots inregression spline models to be developed in Section 4.We summarize the conjugate Bayesian analysis of a linear model. In addition, we present an introductionto variational inference and the hierarchical representation of the Laplace distribution. For more details see,Drugowitsch [2019], Denison et al. [1998], Berry et al. [2002], Goepp et al. [2018] and Lang and Brezger [2004].
Following the notation of Migon et al. [2015] let the linear model be y | β , φ ∼ N ( X β , φ − I n ) where y is n-vector of observed quantities, X is a known n × p matrix, β is a p-vector of parameters and φ isthe precision associated with each one of the independent observations. The conjugate prior, a Normal-Gamma,is defined as: β | φ ∼ N ( m , φ − C − ) φ ∼ Ga ( a , b ) where m and ( φ C ) − are, respectively, the prior mean and covariance matrix and a , b are the parameters ofthe precision prior distribution. The posterior distribution is β | φ, y , X ∼ N ( m , φ − C − ) φ | y , X ∼ Ga ( a , b ) m = C − ( C m + X T y ) and C = C + X T Xa = a + n and b = b + 12 [( y − X m ) T y + ( m − m ) T C m ] A very useful extension of the above regression model is the Bayesian hierarchical regression models, which2ill be extensively used latter. It was proposed in the seminal paper of Lindley and Smith [1972] and a dynamicversion was introduced in Gamerman and Migon [1993].
It is well known that Bayesian inference regarding unknown quantities is entirely based on their probabilisticdescription. Therefore, variational inference (VI) , a method to deal with the approximation of probability densitiesis very useful for Bayesian inference. In fact, these techniques can be traced back to the field of machinelearning (Jordan et al. [1999]). Loosely speaking, they basically exchange sampling , as in MCMC procedures, foroptimization. By choosing a flexible family of approximate densities, an attempt is made to find a member ofthis family, which minimizes some optimal criterion, for example Kulback-Leibner divergence ( KL ). Variationalinference is useful for quickly comparing alternative models and also for dealing with large data sets. Blei et al.[2017] pointed out that the accuracy of variational inference has not yet been thoroughly studied and many openquestions are still there to be answered.The basic ideas about variational inference can be easily followed in Blei et al. [2017] and in Ormerod andWand [2010]. Many examples are presented in the Bishop [2006] book. Let y be a vector of n independentidentically distributed observations and z a vector including latent variables and the parameters as well. The logmarginal data distribution, also known as evidence integral, is denoted by p ( y ) . Evidence integrals that are oftenunavailable in closed form require exponential time to be evaluated and present difficulties in making the inferencefor a model such as this. To avoid calculating the evidence integral, one tries to find a lower bound, which isknown as ELBO ( q ) - Evidence Lower Bound and will be denoted by L ( q ) . It is easy to verify that: log p ( y ) = L ( q ) + KL ( q || p ) , where L ( q ) = (cid:82) q ( z ) log p ( y , z ) q ( z ) d z and KL ( q || p ) = − (cid:82) q ( z ) log p ( z | y ) q ( z ) d z , since p ( y ) = p ( y , z ) /q ( z ) p ( z | y ) /q ( z ) .It is clear that max q L ( q ) (cid:39) min q KL ( q || p ) and also that KL ( q || p ) ≥ with equality if and only if p ( z | y ) = q ( z ) . In general, it is difficult to obtain this posterior distribution, therefore, the approach is to choose a familyof tractable densities. Let’s assume the following: q ( z ) = m (cid:89) l =1 q l ( z l ) where a partition of the z into m disjoint groups is denoted as z l . It is worth pointing out that there is norestriction on the functional forms of the variational densities q l ( z l ) .The central idea is to maximize each factor (blocks of z’s) of q ( z ) in turn. We keep q l (cid:54) = h fixed and maximize L ( q ) . Note that: L ( q ) = (cid:90) m (cid:89) l =1 q l ( z l )[log p ( y , z ) − log q l ( z l )] d z = (cid:90) q h ( z h ) [ (cid:90) log p ( y , z ) (cid:89) l (cid:54) = h q l ( z l ) d z l ] d z h − (cid:90) q h ( z h ) log q h ( z h ) d z h + const = (cid:90) q h ( z h ) log ˜ p ( y , z h ) d z h − (cid:90) q h ( z h ) log q h ( z h ) d z h + const (1)where ˜ p ( y , z h ) = E l (cid:54) = h log p ( y , z ) + const . The L ( q ) will be presented for the specific case of Lasso in subsection3.3. Note that it depends on the variational parameters.Worth emphasizing that the problem of approximating the posterior distribution for the parameters of interestwas replaced for a maximization problem. The algorithm to solve the optimization problem was introduced byBishop [2006] and denoted by CAVI - coordinate ascent variational inference. The CAVI optimizes one factor ofthe mean field variational density at a time.Since (1) is equal to − KL ( ·||· ) , maximizing it is equivalent to minimizing KL . Therefore, the optimal solutionis: log q ∗ ( z l ) = E l (cid:54) = h [log p ( y , z )] + const As one can see, q ∗ ( z l ) depends on the full conditional distributions, as usually denoted in the MCMC literature(Casella and George [1992]). Therefore, there is a natural link with Gibbs Sampling but the proposed approachleads to tractable solutions involving only local operations. It is well known that the original Lasso formulation (Tibshirani [1996]) is related to the Laplace distributionwhich can be represented as hierarchical mixture of distributions and is relevant to a hierarchical modeling, whichin turn is important for the implementation of Gibbs Sampling.One of these forms of representation is a scale mixture of a Normal distribution with an Exponential distribution(West [1987]) and the other is a mixture of a Uniform distribution with a Gamma distribution (Mallick and Yi[2014]).Specifically, following Andrews [1974], it is easy to verify that the hierarchical representation: β | τ ∼ N [0 , τ ] and τ | λ ∼ exp (cid:16) λ (cid:17) leads, by marginalizing on τ , to the standard Laplace distribution, whose density is: λ − λ | β | ) = (cid:90) ∞ (cid:20) τ − / √ π exp (cid:18) − β τ (cid:19)(cid:21) (cid:20) λ (cid:18) − λ τ (cid:19)(cid:21) dτ. (2)The above hierarchical representation of the Laplace distribution is important to introduce the full BayesianLasso. The penalty term in the classical Lasso can be interpreted as independent Laplace prior distribution overthe regression parameters. Moreover, the posterior mode can be seen as the Lasso estimates. Following the hierarchical representation for the Laplace distributions in Subsection 2.3, Park and Casella[2008] shows a Bayesian formulation of the Lasso regression model. The hierarchical model is defined as: y | X, β , φ ∼ N [ X β , φ − I n ] β | φ, τ ∼ N [0 , φ − D τ ] τ j | λ ∼ Exp ( λ ) with j = 1 , . . . , p where D τ = diag ( τ , . . . , τ p ) and τ j | λ are conditionally independent for all j . The model can be completedwith the hyperparameters of the priors φ ∼ Ga ( a , b ) and λ ∼ Ga ( g , h ) . In Subsection 3.1 we propose anindependent Jeffreys prior for φ and λ to automate the Lasso, and this implies supposing a , b , g and h tendingto zero. 4et θ = ( β , φ, τ , λ ) be the vector of the parameters and the latent variables of the model. The posterior dis-tribution is obtained as proportional to the model distribution times the prior distribution for the latent componentand the parameters: p ( θ | y , X ) ∝ p ( y | X, β , φ ) p ( β | φ, τ ) p ( τ | λ ) p ( φ ) p ( λ ) . For instance, the above joint posterior is often intractable. An almost obvious numerical approach, since thebreakthrough paper of Gelfand and Smith [1990], is to use stochastic simulation.
In order to develop an automatic Bayesian Lasso procedure it is worth to introduce non informative priorsfor the hyperparameters involved. Following Fonseca et al. [2019] and exploring the conditional independenceinvolved in the Lasso model, the Fisher information decomposition for Lasso follows as: I y ( λ ) = I τ ( λ ) − E y [ I β , τ ( λ | y )] , (3)where I β , τ ( λ | y ) is the information obtained from the full conditional distribution p ( β , τ | y , λ ) . We also are usingthe conditional independence described by the graph that represents the Bayesian Lasso model.We will develop, in turn, each of the components in the expression (3). The quantity I τ ( λ ) is based on theindependent marginal distribution of τ j , leading directly to I τ ( λ ) = pλ .In order to obtain I β , τ ( λ | y ) , we take advantage of the known full conditional distribution of ( β , τ | λ, y ) (see(4)). Since ( β | τ , λ, y ) does not depend on λ , then it is easy to obtain E y [ I β , τ ( λ | y )] = pλ .Then substituting in (3), it follows I y ( λ ) = pλ + pλ and so the prior for λ is p ( λ ) ∝ λ − . This result issimilar to the one reported in Fonseca et al. [2019], using the Uniform Gamma mixture.It is well known that the Jeffrey’s prior of φ is proportional of φ − . Considering the model and the prior distribution already specified, we know that the posterior distribution inthis case has an unknown form. Therefore, we can use the MCMC to obtain a sample of the posterior distributionthrough the complete conditional distributions (Gibbs Sampler). Calculations of complete conditionals are asfollows. ( β | y , θ − β ) ∼ N (cid:18) ( X T X + D − τ ) − X T y , φ ( X T X + D − τ ) − (cid:19) ( τ j | y , θ − τ j ) ∼ GIG (cid:18) , λ, β j φ (cid:19) ( φ | y , θ − φ ) ∼ Ga (cid:18) n p a , b + 12 [( y − X β ) T ( y − X β ) + β T D − τ β ] (cid:19) ( λ | y , θ − λ ) ∼ Ga g + p, h + p (cid:88) j =1 τ j (4)where θ − stands for the entire vector θ without the parameter followed by symbol ” ”, and GIG denotes thegeneralized inverse Gaussian distribution. See appendix.5 .3 The variational approximation applied to Lasso
In order to obtain a scalable inference procedure, we introduce an alternative methodology.To make the notation consistent, the vector including latent variables and parameters denoted by z in thesubsection 2.2 is represented in this section by the vector θ . Let the independent Jeffrey’s prior be p ( φ ) ∝ φ and p ( λ ) ∝ λ . The joint distribution of the observations, latent components and parameters can easily be followedfrom the Figure 1 which in turn summarizes the model. 𝜙 λτ 𝑗 β 𝑗 𝑦 𝑖
1: 𝑝
1: 𝑛 𝑥 𝑖 Figure 1: Directed acyclic graphIt is worth remembering the expression of the mean field posterior approximation for the latent componentsand parameters: log( q ( θ )) = log( q ( β , φ )) + log( q ( τ | λ )) + log( q ( λ )) After quoting Blei et al. [2017] the optimal q l ( θ l ) is proportional to the exponential of the log of the completeconditional distribution that is calculated in (4) q ∗ l ( θ l ) ∝ exp { E − l [log p ( θ l | θ − l , y )] } , l = 1 , , . In the first step, the variational posterior for β and φ , that maximizes the variational bound L ( q ) while holding q ( τ | λ ) and q ( λ ) fixed, is given by log q ∗ ( β , φ ) = log( p ( y | β , φ )) + E τ [log( p ( β , φ | τ ))] + const = log N ( β | m β , φ − C β ) × Ga ( φ | a φ , b φ ) It is easy to see that this is a normal-gamma distribution with parameters:6 − β = E τ ( D − τ ) + X T X, and m β = C β X T y ,a φ = a + n/ , and b φ = b + 12 ( y T y − m Tβ C − β m β ) . Next, the variational distribution of τ , that maximizes the variational bound L ( q ) while holding q ( λ ) fixed isgiven by log q ∗ ( τ j ) = E λ [log( p ( τ j | λ ))] + E β,φ [log( p ( β j , φ | τ j ))] + const = log GIG ( τ j | c τ , d τ , f τ j ) with GIG being generalized inverse Gaussian distribution, where c τ = 12 ; d τ = 2 E λ [ λ ] ; f τ j = E β,φ [ φβ j ] . Therefore, log q ∗ ( τ ) = log p (cid:89) j =1 GIG ( τ j | c τ , d τ , f τ j ) . Finally, we will identify the variational distribution of λ : log q ∗ ( λ ) = log( p ( λ )) + E τ [log( p ( τ | λ ))] + const = log Ga ( λ | g λ , h λ ) which is a gamma distribution with parameters g λ = g + p ; h λ = h + p (cid:88) j =1 E τ ( τ j ) . The expected values involved in the definition of the above variational distributions are computed as follows(see the appendix for details and Jørgensen [1982]). E τ ( τ j ) = (cid:112) f τ j κ c τ +1 ( (cid:112) d τ f τ j ) √ d τ κ c τ ( (cid:112) d τ f τ j ) , (5) V ar τ ( τ j ) = f τ j d τ K c τ +2 ( (cid:112) d τ f τ j ) K c τ ( (cid:112) d τ f τ j ) − (cid:32) K c τ +1 ( (cid:112) d τ f τ j ) K c τ ( (cid:112) d τ f τ j ) (cid:33) , (6) E τ [ D − τ ] = diag ( E τ ( τ − ) , . . . , E τ ( τ − p )) , where E τ ( τ − j ) = √ d τ κ c τ +1 ( (cid:112) d τ f τ j ) (cid:112) f τ j κ c τ ( (cid:112) d τ f τ j ) − c τ f τ j ,E β,φ [ φβ j ] = m β j a φ /b φ + ( C β ) jj ,E λ ( λ ) = g λ h λ κ p ( · ) is the Bessel modified function of the second kind.The evidence lower bound (ELBO) for this model consists of: L ( q ) = E β,φ (log p ( y | X, β , φ )) + E β,φ,τ (log p ( β , φ | τ )) + E τ,λ (log p ( τ | λ )) ++ E λ (log p ( λ )) − E β,φ (log q ( β , φ )) − E τ,λ (log q ( τ | λ )) − E λ (log q ( λ )) Each of the above terms are evaluated as function of the variational parameters, as follows: E β,φ (log p ( y | X, β , φ )) = n ψ ( a φ ) − log b φ − log 2 π ) + − (cid:20) a φ b φ ( y − Xm β ) T ( y − Xm β ) + tr ( X T XC β ) (cid:21) E β,φ,τ (log p ( β , φ | τ )) = p ψ ( a φ ) − log b φ − log 2 π ) + ( a − ψ ( a φ ) − log b φ ) − b a φ b φ ++ 12 p (cid:88) j =1 E τ (log τ j ) − p (cid:88) j =1 (cid:20) m β j a φ b φ + ( C β ) jj (cid:21) E τ (cid:18) τ j (cid:19) E τ,λ (log p ( τ | λ )) = p ( ψ ( g λ ) − log h λ ) − g λ h λ p (cid:88) j =1 E τ ( τ j ) E λ (log p ( λ )) = g log h − log Γ( g ) + ( g − ψ ( g λ ) − log h λ ) − h g λ h λ E β,φ (log q ( β , φ )) = p ψ ( a φ ) − log b φ − log 2 π ) −
12 log | C β | + a φ log b φ − log Γ( a φ ) ++( a φ − ψ ( a φ ) − log b φ ) − a φ E τ,λ (log q ( τ | λ )) = p (cid:88) j =1 (cid:20) c τ d τ f τ j − log 2 − log K c τ ( (cid:113) d τ f τ j ) + ( c τ − E τ (log τ j ) + − (cid:20) d τ E τ ( τ j ) − f τ j E τ (cid:18) τ j (cid:19)(cid:21) E λ (log q ( λ )) = − log Γ( g λ ) + ( g λ − ψ ( g λ ) + log h λ − g λ The second order Taylor expansion for log τ j at E ( τ j ) is used to obtain the approximation for its expectedvalue: E (log τ j ) ≈ log E ( τ j ) − V ar ( τ j )2 E ( τ j ) where the mean and the variance of τ j are in equations (5) and (6).Note that the variational bound depends on the quantities m β , C β , b φ , d τ , f τ j e h λ . The algorithm updatesthese quantities in each iteration. The ELBO is maximized and hence L ( q ) reaches a plateau with stabilizationof those quantities. The algorithm consists of the following steps: Algorithm 1
Variational Inference
Step
1. Initialize the variational hyperparameters: m β , C β , a φ , b φ , g λ , h λ , c τ , d τ , f τ j . Step while ELBO does not reach convergence dofor l = 1 , , do Compute q ∗ l ( θ l ) ∝ exp { E − l [log p ( θ l | θ − l , y )] } Step
3. Update the variational hyperparameters based on the expected values.Calculate ELBO end forend while y o e y p be the observed and the predictedvectors, respectively. Finally, let p ( β, φ | y o ) be its variational component. Then, after some algebraic calculations,we have a Student’s t-distribution (St) as follows: p ( y p | y o , X p ) = (cid:90) (cid:90) p ( y p | β , φ ) p ( β , φ | y o ) d β dφ ≈ (cid:90) (cid:90) p ( y p | β , φ ) q ( β , φ ) d β dφ = St (cid:18) y p | X T m β , (1 + X T C β X ) b φ ( a φ − , a φ (cid:19) , where q ( β , φ ) is the variational approximation of the posterior distribution, a normal-gamma distribution. We will discuss, from a Bayesian point of view, three alternative procedures for selecting knots (variables) inpenalized regression splines (linear regression). In this work, we propose a new decision criterion based on theBayes factor. This proposed criterion is fully described below in 3.4.1
In general, the selection of predictors/knots in a penalized regression/penalized regression splines () can beseen as a decision problem. Consider the general case where it is necessary to decide for one of the following models M : θ ∈ Θ or M : θ ∈ Θ based on some observations ( D ). An optimal decision will be based on the posteriorprobabilities, p ( M | D ) and p ( M | D ) and, also on the cost of the wrong decisions. Denote by a the cost associatefor the choice of the model M when, in fact, the true model is M and let b be the cost of choosing model M when the true model is M . Therefore, if b p ( M | D ) > a p ( M | D ) then M should be chosen as the mostplausible model for θ . By Bayes’ theorem, the posterior odds are given by the product of the prior odds times theBayes factor, F B ( M , M ) = p ( D |M ) /p ( D |M ) , where p ( D |M i ) = (cid:82) Θ i p ( D | θ, M i ) p ( θ |M i ) dθ, i = 0 , .Hereafter, we will assume that the prior odds is equal one.Particularly, we consider two alternatives: M : β j = 0 and M : β j = δ , where δ (cid:54) = 0 is a constantto be defined. Under M , let us assume that β j | D ∼ N (0 , s j ) and under M we have β j | D ∼ N ( δ, s j ) ,where s j = var ( β j | D ) . Hence, it is straightforward to get log ( F B ( M , M )) = log (exp {− β j } / exp {− ( β j − δ ) } ) = δ − β j δ. Assuming that at least a moderate evidence against M corresponds to F B ( M , M ) ≥ and β is consideredto be significantly distant from the M , if and only if it is greater or equal to the third quartile of the standardnormal distribution, that is β . = 0 . , then a quadratic equation must be solved whose root is δ = 2 . . Thus,our proposal to select knots in spline regression is described in the following algorithm:9 lgorithm 2 Bayes Factor decision criteria
Step
1. Take β ∗ j = m j /s j , a standardized point estimate of β j , m j = E [ β j | D ] and s j = var ( β j | D ) . Step
2. Compute the Bayes factor at (cid:98) β (cid:63)j : BF ( M , M ) = exp {− ( (cid:98) β (cid:63)j ) } exp {− ( (cid:98) β (cid:63)j − δ ) } , δ = 2 . . Step
3. Compute π (cid:63) = BF ( M , M ) / (1 + BF ( M , M )) . Step
4. Define BF evidenceChoose two positive numbers a and b , (with a = 1 and b = 3 , corresponding to a Bayes factor equal to 3) if π (cid:63) < aa + b then M is rejected and j th predictor is excluded from the model else the j th predictor is not excluded from the model. end if One procedure, due to (Li and Lin [2010]), is based on a credible interval. That is, if the credibleinterval of a given coefficient contains zero, the explanatory variable associated with it must be removed from themodel. A second criterion, named scaled neighborhood (Li and Lin [2010]) corresponds to evaluate the posteriorprobability of [ − s j , s j ] , where s j = var ( β j | y ) and decide the exclusion of a predictor if this probability exceeds acertain threshold, for instance Li and Lin [2010] suggests / as this limit. We start with the standard setup for nonparametric regression models. Let’s suppose we have a collection ofobservations ( y i , x i ) for i = 1 , . . . , n such that y i = f ( x i ) + (cid:15) i , (7)where f ( x i ) = E [ y i | x i ] are the values obtained by a unknown smooth function f that takes values on the interval [ a, b ] ⊂ R and (cid:15) i is a sequence of random variables that are uncorrelated with mean zero and unknown precision φ . A possible approach to estimate f is to assume that the regression curve f can be well approximated bya spline function. See details in Dias [1999]. That this, given a sequence of knots κ = ( κ , . . . , κ K ) so that κ < κ . . . < κ K − < κ K , a spline regression model can be written as: f ( x, β ) = β + p (cid:88) j =1 β j x j + K (cid:88) k =1 β p + k ( x − κ k ) p + , (8)where p is the degree of the polynomial spline and β is the vector of coefficients of dimension K + p + 1 . Thefunctions ( u ) + are the well known truncated power basis, ( u ) p + = max(0 , u p ) . Note that, for a fixed K , thevector of knots κ and the set of basis functions { , x, x , . . . , x p , ( x − κ ) p + , . . . ( x − κ K ) p + } , an estimate of f ,say ˆ f , can be obtained by estimating the vector β . Such that, ˆ f = f ( x, (cid:98) β ) = ˆ β + p (cid:88) j =1 ˆ β j x j + K (cid:88) k =1 ˆ β p + k ( x − κ k ) p + . It’s well known (Dias [1998], Dias and Gamerman [2002], Dias and Garcia [2007], Kooperberg and Stone [1991])when K increases the bias gets smaller causing over-fitting but at the same time substantially increases the10ariance. On the other hand, if K goes to zero then variance might drastically be reduced causing under-fittingand considerably increases bias. Thus, K acts as the smoothing parameter in regression spline fit and hence itbalances the trade-off between over-fitting and under-fitting. A good procedure should not only provide an idealnumber of knots (or basis functions) but also quantify the uncertainty of adding or removing them. Figure 2shows the effect of different values of K for a spline regression model. ****************************************************************************************************0.0 0.2 0.4 0.6 0.8 1.0 − − Effect of K dat y da t K=3 K=6 K=60
Figure 2: Large values of K causes over-fitting.There are other basis functions that can represent a regression function such as B-splines, wavelets radial basisetc. For all, it is still necessary to balance between under-fitting and over-fitting. Even in the case of smoothingsplines, the regularization parameter needs to be obtained. Specifically, in this work we are dealing with thefollowing optimization problem: Find (cid:98) β ( λ ) the minimizer of n (cid:88) i =1 ( y i − f ( x i , β )) + λ K (cid:88) k =1 | β p + k | , (9)where λ is the smoothing parameter. For large values of λ the solution of this optimization problem tends to thepolynomial regression fit, that is over-fitting. Note that the penalty term involves only the coefficients associatedto the knots sequence κ < κ . . . < κ K . Consequently, selecting knots is equivalent to selecting the coefficientsthat contribute most to the fitting. Under the Bayesian point of view, this work presents a novel and scalableprocedure for selecting knots. Following the idea of the Lasso procedure for variable selection, under the Bayesian point of view, the selectionof knots in the spline regression models can be made by assuming an independent Laplace prior distribution for thecoefficients associated with the knots. We will denote β = (cid:16) β (1) T , β (2) T (cid:17) T where β (1) = ( β , β , . . . , β p ) T isthe polynomial coefficients with dimension p + 1 and β (2) = ( β p +1 , . . . , β p + K ) T of dimension K is the penalisedcoefficients. The hierarchical structure presented in the subsection 2.3 is maintained and in this way we completethe model defined by the equations (7) and (8): 11 (1) T ∼ N ( m , C ) β (2) T | φ, τ ∼ N (0 , φ − D τ ) τ j | λ ∼ Exp ( λ ) , j = 1 , . . . , Kφ ∼ Ga ( a , b ) λ ∼ Ga ( g , h ) (10)The Bayesian inference procedure in this case must be carried out with caution since the vector of coefficientscontains the coefficients of the polynomial, which will not be penalized, and the coefficients of the basis functions towhich we assume a prior Lasso distribution for knots selection. The design matrix is then partitioned X = ( X , X ) with X of dimension ( n × p + 1) and X ( n × K ) . Then, the i-th row of the matrix X is given by: X i = { , x i , x i , . . . , x pi (cid:124) (cid:123)(cid:122) (cid:125) X i , ( x i − κ ) p + , . . . ( x i − κ K ) p + (cid:124) (cid:123)(cid:122) (cid:125) X i } . and X β = X β (1) + X β (2) . In this context, we have a prior distribution of the vector θ = ( β (1) , β (2) , φ, τ, λ ) and the posterior distributionis given by: p ( θ | y ) ∝ p ( y | X, β , φ ) p ( β (1) ) p ( β (2) | φ, τ ) p ( τ | λ ) p ( φ ) p ( λ ) . Slight adaptations need to be made in the variational inference method and the main one refers to the fact thatwe now have 4 partitions of the parametric vector giving rise to the following variational densities: log( q ( θ )) = log( q ( β (2) , φ )) + log( q ( τ )) + log( q ( λ )) + log( q ( β (1) )) where log q ( β (1) ) = log N ( β (1) | m β , C β ) . The calculations for this and other variational densities are similar tothose developed for the regression model in subsection 3.3 and can be found in the appendix. An alternative approach to determine the maximum number of knots K is to consider it as an unknownparameter and estimate it. However, as in this work, K is not a parameter of direct interest since the selection ofknots is carried out through the Lasso scheme. Despite this, in order to have a good fit, it is important to properlydefine the number of knots and their positions. In section 5 presents some exercises involving selecting knots andshows the importance of the correct specification of the value of K . Thus, an algorithm will be proposed for theautomatic choice of the number of knots K that is based both on the VB algorithm for estimating the Lassomodel and on the criterion for selecting variables (knots).The algorithm starts by proposing a grid of possible values of K the maximum number knots. Naturally,the grid of values is an issue to be discussed. Note that it is not necessary to propose a grid that covers allthe natural numbers, since computational time can be excessively high. Moreover, given a maximum number ofknots, the most significant knots will be selected. For instance, start the grid with the maximum number of knots K = 20 . By using Lasso and the selection criteria, it is possible to have 6 among these 20 knots as the mostsignificant ones. On the other hand, simulations show that starting with a very large maximum number of knots12ay cause problems in the selection, since the penalty criteria acts more severely when the number of knots isextremely big for the size of the data set. See numerical simulations in Section 5. Therefore, our proposal is toprovide a grid that increases 10 units at a time, so that the knots are placed in the quantiles or evenly spaced inthe explanatory variable domain. This spacing allows us to position knots in different locations before and afterselecting significant knots. In the simulated exercises presented in Section 5, the grid starts with K = 10 knots.In summary, ELBO is taken as a stopping criterion and the objective is to maximize it. The algorithm consistsof: for fixed grid values, start with the lowest value. The model is adjusted via VB together with the selectioncriteria and then calculates the ELBO. Move to the next grid value and repeat the procedure. As long as theELBO increases with the grid values, the algorithm continues. The detailed algorithm is given below: Algorithm 3
Maximum Number of Knots
Step j ← . Initialize K j = 10 . Step
2. Fit model via VB algorithm.
Step
3. Compute ELBO.
Step
4. Apply BF to select the most significant knots.
Step K j +1 = K j + 10 and repeat steps 2, 3 and 4. if ELBO( K j +1 ) ≥ ELBO( K j ) then Set K j ← K j +1 . Repeat steps 2 to 5. end ifif ELBO( K j +1 ) < ELBO( K j ) then Stop and deliver ELBO and the most significant knots end if
This section proposes 5 exercises with artificial data. The first three based on Lasso for linear regressionmodels and the last two focused on the use of Lasso to select knots in spline regression.The inference procedure assumes, for all exercises, the following prior distributions: φ ∼ Ga (0 . , . , λ ∼ Ga (0 . , . e β ∼ N (1 , (in case of regression splines). For MCMC, 15,000 iterations were necessary toachieve convergence. The first 5,000 iterations were discarded as the burn in process and one observation wastaken for every ten observations to remove autocorrelation, ending with a sample size of 10000. These quantitieswere obtained by using the criterion found in Lewis and Raftery [1997], that provides the number of iterationsneeded to guarantee the convergence in the Gibbs Sampler. The VB algorithm is repeated until the changes in m β , C β , b φ , d τ , f τ j and h λ between two consecutive iterations are less than . . When applied, the classicprocedure was implemented using the R software glmnet package, which in turn applies 5-fold cross-validation toestimate the penalty parameter λ . The goal of these exercises applied to the linear regression models is twofold. Firstly to compare the estimationmethods VB and MCMC (eventually we also use the classic Lasso in the comparison). Secondly, we wish tocompare the CI, SN and BF selection criteria. Moreover, different sparsity scenarios, variations in the samplesize, different correlations between explanatory variables and different values for the accuracy of the model areconsidered.Specifically, exercise 1 aims to estimate Lasso hyperparameters via VB and MCMC. Only one data set issimulated from which the real values of all parameters and hyperparameters of the model are known. VB presents13esults similar to MCMC and computational time 14 times shorter. Exercise 2 is based on a simulation study with100 replicates that presents a lesser sparsity structure. Variations in the sample size and in the correlation amongthe explanatory variables are considered. Again, VB and MCMC present similar and superior results to the classicLasso. When the CI, SN and BF selection criteria are compared, BF gives the best results, with high proportionsof exclusion for coefficients that are zero and low exclusion proportions for coefficients that are different fromzero. Exercise 3 is designed for scenarios with 100 replicates and with greater sparsity when compared to exercise2. This exercise takes into account cases where n < p and different values for the model’s precision. The resultsare similar of those obtained in simulation 2
The purpose of simulation 1 is to compare the MCMC and VB methods to curve fitting and computationaltime. For this study we considered n = 100 , p = 10 and each column of the matrix X was generated from adistribution N ( , I n ) . For the parameters, were taken φ = 0 . , λ = 5 and τ j | λ ∼ Exp ( λ ) , ∀ j . The regressioncoefficients and observations were generated considering the Lasso regression model.Table 1 shows us a posterior summary of the model parameters. There, one can see the mean and the standarddeviation of the approximate posterior obtained by using VB. Also, the posterior mean, the posterior standarddeviation via MCMC and the true value of the parameters. Note that the point estimates obtained by the VB areclose to those obtained by the MCMC. In addition, for both methods, the results are close to the real values withsmall standard deviation. This same conclusion can be seen in Figure 3. In fact, Figure 3 exhibits a graphicalcomparison between MCMC and VB. The histogram represents the sample of the posterior distribution obtainedvia MCMC and the curve in red the approximate posterior density obtained by the VB. The green dot indicatesthe true value of the parameters. Note that the curves approximated by the VB are close to the histograms andboth centered on the actual values. The remaining parameters τ j show similar results.14able 1: Posterior summary. Parameters Real Mean VB Sd VB Mean MCMC Sd MCMC β β β -1.251 -1.316 0.153 -1.315 0.160 β β -0.319 -0.078 0.137 -0.079 0.142 β β -0.036 0.091 0.142 0.096 0.149 β β β -0.298 -0.370 0.150 -0.369 0.163 φ τ τ τ τ τ τ τ τ τ τ λ Since MCMC and VB present similar results, it is worth to point out the main difference between theseestimation methods, which is computational time. For exercise 1, the computational time of the VB was 0.72seconds while that of the MCMC was 10.15 seconds. In the following exercises these computational times becomeeven more discrepant as we will be dealing with simulations with replicates.
In this exercise a simulation was developed based on 100 replicates p = 8 , β = (3 , . , , , , , , T and thedesign matrix is generated from a multivariate normal distribution with zero mean, variance 1 and two differentcorrelation structures between x i e x j : 0 e . | i − j | , ∀ i e j . Let’s consider φ = 1 / and 3 nested scenarios varyingthe sample size with { n T , n V } = { , } , { , } e { , } , where n T e n V denote the size of the trainingset and the size of the validation set, respectively. Therefore, we have a total of 6 different scenarios. Note thatthe explanatory variables are standardized to have mean 0 and variance 1A Tabela 2 summarize the results of exercise 2.Table 2: Simulation 2 with 100 replicates, p = 8 explanatory variables and the vector of coefficients β =(3 , . , , , , , , T . Simulation n T n V cov ( X i , X j ) S2.1 20 10 0S2.2 100 50 − S2.3 200 100 − S2.4 20 10 . | i − j | S2.5 100 50 − S2.6 200 100 − The comparison of the MCMC and VB methods is our main objective in this simulation, however, frequentistLasso is also considered through the glmnet package of the R software. For the frequentist Lasso, a 5-fold cross-15 .2 0.4 0.6 0.8 . . . . . . . b den s i t y −0.4 −0.2 0.0 0.2 0.4 . . . . . . . b den s i t y −1.8 −1.4 −1.0 . . . . . . b den s i t y . . . . . . b den s i t y −0.6 −0.4 −0.2 0.0 0.2 0.4 . . . . . . . b den s i t y . . . . . . b den s i t y −0.4 −0.2 0.0 0.2 0.4 0.6 . . . . . . b den s i t y −0.4 −0.2 0.0 0.2 0.4 0.6 . . . . . . . b den s i t y −0.4 −0.2 0.0 0.2 0.4 . . . . . . . b den s i t y −0.8 −0.6 −0.4 −0.2 0.0 . . . . . . b den s i t y f den s i t y . . . . . l den s i t y t den s i t y t den s i t y . . . . . . t den s i t y t den s i t y Figure 3: Comparison MCMC (histogram) versus VB (solid line). The dot marks the actual value of the parameterused to generate the data..validation is used to select the parameter λ . In addition, different variable selection criteria will be compared asdescribed in 5: credible interval (CI), scaled neighborhood (SN) and Bayes factor (BF).In order to compare Lasso’s predictive power from the different estimation techniques, MCMC, VB andfrequentist Lasso, the mean absolute error (MAE) was calculated for each replicate of the validation set using thefollowing expression: 16 AE = 1 n V n V (cid:88) i =1 | y Pi − y Vi | (11)where y Pi are the predicted values in the validation set, obtained from the fitted model after the selection of thecoefficients. y Vi are the observed values in the validation set and n V is the size of the validation set. Note thatMCMC generates a sample of the predictive distribution from each iteration of the method. Then, y Pi is obtainedas follows: p ( y Pi | y ) = 1 AM AM (cid:88) j =1 p ( y Pi | θ ( j ) ) , where AM is MCMC number of iterations and θ is the vector of coefficients.Figure 4 shows the box-plots of the mean absolute errors for each of the six proposed scenarios. As the samplesize increases, we observe a smaller difference between the three estimation methods. When the sample is small,similar results are obtained between MCMC and VB. These have the median MAE and the lowest dispersion whencompared to the frequentist Lasso. Next, we will detail the performance of the selection criteria for each β j . S2.1
Lasso MCMC VB M AE S2.2
Lasso MCMC VB M AE S2.3
Lasso MCMC VB M AE S2.4
Lasso MCMC VB M AE S2.5
Lasso MCMC VB M AE S2.6
Lasso MCMC VB M AE Figure 4: Mean absolute error (MAE) using (5.1.2) for the 6 scenarios of the simulation 2. Estimation methods:MCMC, VB and frequentist Lasso. 17he Table 3 shows the frequency of times that the predictor x j , j = 1 , . . . , was excluded in the 100replicates, considering the three variables selection methods and all six scenarios built in Simulation 2. We presentthe proportions only for the VB because so far its results are similar to those of the MCMC. Note that for thissimulation exercise the BF presents the best results in all scenarios, with a greater proportion of exclusion whenthe actual values of β j are zero and a small proportion when the β ’s are different from zero. In addition, it isnoted that as the sample size increases, the three criteria tend to correctly choose coefficients that are zero andthe coefficients that are different from zero. From exercises 1 and 2, one may notice that the approximations ofthe VB are as good as the results obtained by the MCMC. Nevertheless, the gain in computational time providedby VB is far superior than MCMC. In addition, we saw that BF is a variable selection criterion that presentssuperior results when compared with CI and SN. In the following subsection we show the performance of the VBestimation method and the BF selection criterion for a more complex numerical experiment with greater sparsity.Table 3: Comparison of the three methods on variable selection accuracy using VB for the six scenarios (thefrequency of exclusions for the predictor x j , j = 1 , . . . , ) with β = (3 , . , , , , , , T . Simulation Method β β β β β β β β S2.1 VB + CI 0.01 0.09 0.64 0.57 0.13 0.66 0.65 0.62VB + SN 0.02 0.15 0.73 0.71 0.20 0.77 0.75 0.71VB + BF 0.00 0.09 0.88 0.70 0.13 0.82 0.75 0.92S2.2 VB + CI 0.00 0.00 0.47 0.51 0.00 0.57 0.67 0.56VB + SN 0.00 0.00 0.70 0.62 0.00 0.71 0.78 0.72VB + BF 0.00 0.00 0.73 0.78 0.00 0.73 0.84 0.72S2.3 VB + CI 0.00 0.00 0.50 0.47 0.00 0.67 0.60 0.58VB + SN 0.00 0.00 0.71 0.62 0.00 0.71 0.73 0.76VB + BF 0.00 0.00 0.72 0.73 0.00 0.77 0.80 0.77S2.4 VB + CI 0.02 0.09 0.53 0.53 0.19 0.71 0.60 0.64VB + SN 0.06 0.11 0.70 0.62 0.24 0.75 0.73 0.74VB + BF 0.02 0.09 0.65 0.80 0.18 0.85 0.81 0.88S2.5 VB + CI 0.00 0.00 0.57 0.49 0.00 0.60 0.51 0.57VB + SN 0.00 0.00 0.73 0.68 0.00 0.75 0.70 0.75VB + BF 0.00 0.00 0.78 0.77 0.00 0.76 0.80 0.82S2.6 VB + CI 0.00 0.00 0.55 0.47 0.00 0.60 0.46 0.57VB + SN 0.00 0.00 0.70 0.66 0.00 0.77 0.64 0.75VB + BF 0.00 0.00 0.77 0.75 0.00 0.79 0.72 0.77
In this exercise we consider a situation with sparsity given by p = 40 e β = ( T , T , T , T ) T , where e are vectors of dimension 10 and each of their entries are 0 and 3 respectively. The design matrix X is generatedfrom a multivariate normal distribution with mean zero, variance 1 and the correlation between the columns x i e x j is equal to 0.5, ∀ i (cid:54) = j . We analyze 4 different scenarios by varying the sample size and the precision parameter φ . The simulated data were analyzed as follows, { n T , n V } = { , } e { , } where n T e n V are the sizeof the training set and the size of the validation set respectively. In addition, we set the precision parameter as φ = 1 / and φ = 1 / . For each scenario we consider 100 replicates. Table 4 summarizes all the scenariosconsidered in this simulation exercise 3. It is worth mentioning that in scenarios S3.1 and S3.3 we have n < p Similarly to exercise 2, the MAE was calculated for each replicate as a predictive measure. Figure 5 shows thebox-plots of each scenario for MCMC, VB and Lasso. One may see that the MCMC and VB present similar and18able 4: Scenarios in Simulation 3
Simulation n T n V φ S3.1 20 10 / S3.2 200 100 -S3.3 20 10 / S3.4 200 100 - superior results to the Lasso when the sample size is small. As the sample increases the results become similar inthe 3 approaches.
S3.1
Lasso MCMC VB M AE S3.2
Lasso MCMC VB M AE S3.3
Lasso MCMC VB M AE S3.4
Lasso MCMC VB M AE Figure 5: Mean absolute error (MAE) obtained by using (5.1.2) for the 4 scenarios in exercise 3, comparing theestimation methods MCMC, VB and Lasso.Figure 6 shows the proportions of exclusions (gray bars) and selections (black bars) for each of the 40coefficients in the 100 replicates, when comparing the estimation methods, VB and Lasso. MCMC was omittedfor presenting results similar to VB. In the Bayesian context, the selection criterion used in all scenarios was theBF. It is expected that the black bars will be larger when the true coefficients are different from zero and that thegray bars will be large when the true coefficients are equal to zero. The proportions of the errors are represented19y the black bars when the coefficients are zero (type I error) and by the gray bars when the coefficients aredifferent from zero (type II error Thus, it can be seen for n < p , both VB and Lasso do not have a good selectionand exclusion performance, with a slight advantage of VB. On the other hand, as the sample increases, the VBpresents good results, better than those presented by Lasso. Also note that when n > p both VB and Lasso havethe same type II error. However, for all coefficients, the type I error is considerably less in VB than in MCMC
As the VB presents results similar to the MCMC, but with considerably less computational time, therefore, inthe two exercises applied to the spline regression models, only the VB is used. The goal is to define the maximumnumber of knots from a grid of values and, in turn, to select the most significant knots and their positions.Exercises 4 and 5 consist of a simulation study of the penalized spline regression model defined in equations (7),(8) and (10). What differs in the two simulation studies consists of the number of bumps in the smooth function f . In exercise 4, shows an example with 1 bump, while exercise 5 analyzes 2 bumps.In both exercises we have 100 replicates with n = 100 , variance equals to . ( φ = 1 / . ) and x i takingequally spaced values in the interval [0 , . In addition, Cubic Splines ( p = 3 ) are used, the interior knots of thetruncated power basis are positioned in the quantiles of the variable x i . The maximum number of knots varies inthe grid K = 10 , , , and .The ELBO is used to indicate the maximum number of knots and for the both exercises we have an optimalinitial guess of K = 30 knots. The BF and CI selection criteria indicate that around 8 of these 30 initial knotshave a higher frequency of being selected as the most significant ones. In exercise 4 we use the smooth function f called ”Bump” given by f ( x ) = x + 2 exp {− (16 ∗ ( x − . } .Figures 7, 8 and 9 show some results for K = 10 , and knots, respectively. In the first line of graphicsthere are the plots of the data generated for one of the replicates (dots), the true curve (solid line) and theaverage of the fittings of the 100 replicates for the three selection criteria (dashed lines). The second line showsthe proportion of excluded knots, also for the three selection criteria CI, SN and BF.Observe that the positions of the knots most selected as significant are in the rise and fall of the bump. Inaddition, it should also be noted that as K increases, the three selection criteria tend to be more rigorous in thepenalty, hence, excluding more knots. This occurs more severely in the SN criterion, which presents an averageof fittings worse when K = 50 . The results of the CI and BF criteria are similar in all cases. The results forexercises where the maximum number of knots are and have been omitted as they are similar, for K = 10 and K = 50 , respectivelyFigure 10 exhibits the 100 fitted models for each of the replicates considering K = 10 , and , and theBF selection criterion. One can see that the variability increases as the value of K increases. The same occurswhen considering the other selection criteria.Figure 11 shows the frequency of the number of knots selected in the 100 replicates for each of the selectioncriteria (CI, SN and BF) and maximum number of knots ( K = 10 , and ). Note that CI and BF criteria,when K = 10 , select between 5 and 6 knots as the most frequent ones. On the other hand, when K = 30 knots7 knots among them are selected more frequently. In the case where the maximum number of knots is K = 50 we have a bimodal behavior for the frequency of the number of selected knots and one can see that the penalty20s more severe as K grows.ELBO as a model comparison measure can be used to propose the maximum number of knots. It is worthmentioning that the larger the ELBO the more that model is preferable. From the Table 5, which presents theaverage ELBO for each value of K , we have that the model proposed with K = 30 knots which selects morefrequently 7 of these knots as significant, is preferable.Table 5: Average ELBO Criterion k = 10 k = 20 k = 30 k = 40 k = 50FB -32.24 -16.10 -2.55 -19.59 -22.23IC -32.24 -16.37 -2.55 -20.91 -46.14SN -37.15 -39.14 -69.16 -244.80 -390.80 B − S3.1
Regression Coefficients . . b Lasso − S3.1
Regression Coefficients . . b VB − S3.2
Regression Coefficients . . b Lasso − S3.2
Regression Coefficients . . b VB − S3.3
Regression Coefficients . . b Lasso − S3.3
Regression Coefficients . . b VB − S3.4
Regression Coefficients . . b Lasso − S3.4
Regression Coefficients . . b Figure 6: Proportion of selected (black) and excluded coefficients (gray) for the 4 scenarios in exercise 3 with theestimation methods VB (left column ) and Lasso (right column).22 .0 0.4 0.8 CI x y SN x y BF x y E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . Figure 7: Proportion of excluded knots and the average of the fittings (dashed line), K=10.23 .0 0.4 0.8 CI x y SN x y BF x y E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . Figure 8: Proportion of excluded knots and the average of the fittings (dashed line), K=30.24 .0 0.4 0.8 CI x y SN x y BF x y E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . Figure 9: Proportion of excluded knots and the average of the fittings (dashed line), K=50. k = 10 x y k = 30 x y k = 50 x y Figure 10: Fittings for each of the 100 replicates, for different numbers of knots and according to the selectioncriteria BF. 25
I, k = 10 number of knots F r equen cy SN, k = 10 number of knots
BF, k = 10 number of knots
CI, k = 30 number of knots F r equen cy SN, k = 30 number of knots
BF, k = 30 number of knots
CI, k = 50 number of knots F r equen cy SN, k = 50 number of knots
BF, k = 50 number of knots
Figure 11: Frequency of the number of selected knots in the 100 replicates.26 .2.2 Exercise 5: Double structure/Two bumps
Similar to exercise 4, we proposed exercise 5, however, now considering a curve with 2 bumps. The curve f is a mixing two normal distributions: f ( x ) = 0 . N ( x | . , .
01) + 0 . N ( x | . , . , where N ( x | a, b ) is a normal distribution with mean a and variance b . In this exercise we assume n = 300 and theother conditions of simulation 4 were kept. That is, 100 replicates, φ = 1 / . , x i equally space in [0 , , p = 3 ,knots placed at the x i quantiles and maximum number of knots K = 10 , , , , . We omitted the graphicsfor K = 20 and K = 40 , as they are similar.The results obtained for the case of 1 bump are similar to those obtained here. Once again, it is possible tosee in the Figures 12, 13 and 14 the average of the fittings of the 100 replicates for the three selection criteria inthe dashed lines (first row of plots) and the proportion of excluded knots (second row of plots) for different valuesof K . In the three figures one may notice that the knots that are most selected as significant are positioned in theups and downs of the bumps. For the case of 2 bumps, the higher the value of K , the more severe the exclusionof knots is. The SN criterion does not show good results as the maximum number of knots increases and theperformance of the CI and BF criteria are similar in all cases. − CI x y − SN x y − BF x y E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . Figure 12: Proportion of excluded knots and the average of the fittings - K=10.In Figure 15 shows the fittings of 100 replicates according to the BF criterion for different numbers of knots.27 .0 0.4 0.8 − CI x y − SN x y − BF x y E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . Figure 13: Proportion of excluded knots and the average of the fittings - K=30.It can be seen that there is less variability than in the case of 1 bump, possibly due to the increase in the samplesize to n = 300 . Nonetheless, it is possible to see that the when the maximum number of knots increases , thevariability between the fitted models also increases.Our analysis shows that the SN criterion does not provide a good fit for K = 50 and in Figure 16 we see thatthis criterion tends to underestimate the number of significant knots as K increases. We will analyze in moredetail the frequency of the number of knots selected in the 100 replicates using the CI and BF criteria. Note thatboth criteria present similar results. In Figure 16 we observed that when the maximum number of knots is 10,both the CI and BF criteria indicate more frequently that 7 out of 10 knots are significant. When K = 30 or K = 50 , the criteria CI and BF most frequently indicate 8 knots as significant.In order to define the maximum number of knot, the average ELBO was calculated for each value of K in afixed grid. For the FB criterion we have that the average ELBO is -74.36 when K = 10 . This value increases(-68.67 when K = 20 ) until it reaches the maximum value of -67.15 when K = 30 . The average ELBO valuefor K = 40 is -70.44. Thus, for BF criterion K = 30 is the initial guess for the maximum number of knots.Thesame can be seen in the CI criterion. This result coincides with that obtained in exercise 4.28 .0 0.4 0.8 − CI x y − SN x y − BF x y E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . E xc l uded p r opo r t i on knots . . . Figure 14: Proportion of excluded knots and the average of the fitted models - K=50 n´os. − k = 10 x y − k = 30 x y − k = 50 x y Figure 15: Fit of 100 replicates for different values of knots.29
I, k = 10 number of knots F r equen cy SN, k = 10 number of knots
BF, k = 10 number of knots
CI, k = 30 number of knots F r equen cy SN, k = 30 number of knots
BF, k = 30 number of knots
CI, k = 50 number of knots F r equen cy SN, k = 50 number of knots
BF, k = 50 number of knots
Figure 16: Frequency of selected knots in 100 replicates.30
Applications to real data
In this section, two applications with real data will be analyzed. The first is considered more usual in theliterature in the area and the second addresses a current issue related to the world pandemic of Covid-19. Inboth cases, the Penalized Spline Regression model with polynomials of degrees 2 and 3 is fiited to the data. Inaddition, the maximum number varies between 10, 20 and 30. It is noteworthy that in the first example the knotsare equally spaced positioned. Lasso, through variational inference (VB), is used in both applications togetherwith the Bayes Factor (BF) criterion for the selection of the most significant knots. ELBO will be the measureconsidered for comparing models
The first data set considers the income and age of 205 Canadians (Ullah and Zinde-Walsh [1985]. These datahave been widely used in applications of non-parametric regression models. See for example Ruppert [2002]. Alogarithmic transformation was applied to income, as can be seen in the data represented by black dots in thetwo plot in Figure 17. Table 6 shows the ELBO measure computed for each fitted model by varying the degreeof the polynomial ( p ) and the maximum number of knots ( k ) . For both p = 2 and p = 3 , ELBO achieves itsmaximum value at K = 10 . Comparing these two quantities, the largest ELBO occurs for p = 3 and K = 10 .The graph on the left of Figure 17 shows the fitting of the penalized spline regression model (solid black line) andits credible interval of (gray shaded area). Note that the credible interval covers a large part of the observedpoints. At the bottom of the graph we have the symbol ”x” representing the 10 knots positioned and in blackthe only knot considered significant among the 10, according to the FB criterion. At the level of comparison, thegraph on the right of Figure 17 shows the fitted model obtained with the R function ”smooth.spline” (solid linein red) and the fitting of the proposed model (black solid line). In this application, the results of these two fittedmodels are similar. Table 6: ELBO - Age and income data k = 10 k = 20 k = 30 p = 2 -221.49 -225.45 -237.22 p = 3 -220.77 -227.03 -223.93 In order to observe the trend of the daily cases of Covid-19 in the USA and Brazil, the penalized splineregression model was fitted to the data (logarithmic scale). United States data ranges from March 1, 2020 toNovember 30, 2020, while data from Brazil ranges from March 10, 2020 to November 30, 2020.Figure 18 shows the number of daily cases of Covid-19 in the USA (left) and Brazil (right) on the originalscale. The black dots in the Figures 19 and 20 show the same data set in logarithmic scale.In this example, analyzing real data, we consider models with splines of degrees 2 and 3. The maximumnumber of knots varies every 10 knots. To find the maximum number of knots, we use the ELBO measure bystarting the grid with K = 10 knots. After obtaining the optimal K value BF criterion is applied to select whichknots are the most significant and their positions.The inference procedure was performed from the Bayesian point of view through variational inference. Theprior distribution remains the same as for studies with artificial data.31 age l og i n c o m e
20 30 40 50 60 age l og i n c o m e Figure 17: Left: The fit of the spline regression model (black solid line) with p = 3 and one significant knot among K = 10 knots and the log-income and age data (black dots) with the credible interval of (shaded area).Right: The fit of the proposed model (solid black line) and a fitted model using the R function ”smooth.spline”(solid red line) to the log-income and age data (dots). time C a s e s C o v i d − U S time C a s e s C o v i d − B r a z il Figure 18: Daily cases of Covid-19 in US (left) and in Brazil (right).Table 7 exhibits the results of ELBO for different values of p and K , for data from the USA and Brazil.Marked in bold are the cases in which ELBO achieves the highest values for p = 2 and p = 3 . In the NorthAmerican case, ELBO is maximum when p = 3 and K = 20 . It is worth mentioning that among the 20 knots,9 were significant according to the BF criterion. Considering the Brazil data, one can see that the largest ELBOoccurs when p = 2 and K = 10 , and only 6 of these 10 knots are significant, according to the BF. The followingresults are presented only for models with the largest ELBO.32able 7: ELBO - US and Brazil Covid-19 data. US Brazilk = 10 k = 20 k = 30 k = 10 k = 20 k = 30 p = 2 -145.32 -168.23 -164.55 p = 3 -179.88 -197.66 The plot to the left of the Figure 19 shows the fit (solid green line) of the penalized spline regression modelwith 3 polynomial of degree 3 and 9 significant knots(out of a total of 20 knots) for the US Covid-19 data. Theshaded area represents the credible interval and it contains most of the observed data (black dots). Notethat significant knots (black asterisks) are located such that they cover the bumps of the curve. The ”x” in redare the excluded knots that were not considered in the fit. The plot on the right compares the fits of the proposedmodel (solid green line) with the R function ”smooth .spline” (red solid line). Note that the second fit captures,in addition to the signal, the noise contained in the data. time l og c a s e s C o v i d − U S time l og c a s e s C o v i d − U S Figure 19: Left: The fit of the penalized spline regression model (solid green line) with p = 3 and 9 significantknots (black asterisks) out of K = 20 knots (x red) with the credible interval (shaded area) and the fit forthe logarithm of the number of daily cases of Covid-19 in the USA (black dots). Right: The fit of the proposedmodel (solid green line) and the fit of a model using the R function ”smooth.spline” (solid red line) to the logdata of the number of daily cases of Covid-19 in the USA (dots).Figure 20 shows the fit (solid green line) of the proposed model with p = 2 and 6 significant knots (blackasterisks at the bottom of the graph). The excluded knots are represented by ”x” in red. The credibleinterval (gray shadow) covers much of the observed data points. The plot on the right makes a comparisonbetween the fit of the proposed model and the R function ”smooth.spline” (red solid line). As in the case of theU.S. data, we observed that the fit by using ”smooth.spline” does not smooth the data as the proposed modeland follows the series random noise more closely. 33
50 100 150 200 250 time l og c a s e s C o v i d − B r a z il time l og c a s e s C o v i d − B r a z il Figure 20: Left: the fit of the penalized regression spline model (solid green line) with p = 2 and 6 significantknots (black asterisks) out of K = 10 knots (x in red) with the credible interval (shaded area) and the fitof the logarithm of the number of daily cases of Covid-19 in Brazil (black dots). Right: The curve fitting of theproposed model (solid green line) and the fit using R function ”smooth.spline” (solid red line) to the logarithmof the number of daily cases of Covid-19 in Brazil (dot). This article proposes a new scalable procedure for selecting the number of knots in regression splines: Afully automatic Bayesian Lasso through variational inference. Simulation studies have shown effectiveness of thisprocedure in modeling different types of data sets. In addition, the numerical exercises show that this approachis much faster than the traditional one that is based on MCMC type algorithms. In real data sets the procedurewas able to capture the trend existing in them. Thus providing a better understanding of the data dynamic.
Acknowledgments.
This paper was partially supported by Fapesp Grants (RD) 2018/04654, (RD and HSM)2019/10800-0, (RD) 2019/00787-7. 34ppendix 1: The variational distributions for LassoThe variational posterior for β and φ while holding q ( τ | λ ) and q ( λ ) fixed, is given by log q ∗ ( β , φ ) = log( p ( y | β , φ )) + E τ [log( p ( β , φ | τ ))] + const = log (cid:20) (2 π ) − n/ | φ − I n | − / exp (cid:26) −
12 ( y − X β ) T ( φI n )( y − X β ) (cid:27)(cid:21) ++ E τ (cid:26) log (cid:20) (2 π ) − p/ | φ − D τ | − / exp (cid:26) − β T φ D − τ β (cid:27)(cid:21)(cid:27) ++ E τ { log[ φ a − exp {− φb } ] } + const = (cid:16) n p a − (cid:17) log φ − φ { β T [ E τ ( D − τ ) + X T X ] β + y T y − y T X β + 2 b } + const = log N ( β | m β , φ − C β ) × Ga ( φ | a φ , b φ ) It is easy to see that this is a normal-gamma distribution with parameters: C − β = E τ ( D − τ ) + X T X, and m β = C β X T y ,a φ = a + n/ , and b φ = b + 12 ( y T y − m Tβ C − β m β ) . The variational distribution of τ while holding q ( λ ) fixed is given by log q ∗ ( τ j ) = E λ [log( p ( τ j | λ ))] + E β,φ [log( p ( β j , φ | τ j ))] + const = E λ { log[exp {− λτ j } ] } + E β,φ (cid:26) log (cid:20) ( φ − τ j ) − / exp (cid:26) − φ τ j β j (cid:27)(cid:21)(cid:27) + const = −
12 log τ j − (cid:18) E λ [ λ ] τ j + 1 τ j E β,φ [ φβ j ] (cid:19) + const = log GIG ( τ j | c τ , d τ , f τ j ) with GIG being generalized inverse Gaussian distribution, where c τ = 12 ; d τ = 2 E λ [ λ ] ; f τ j = E β,φ [ φβ j ] . Therefore, log q ∗ ( τ ) = log p (cid:89) j =1 GIG ( τ j | c τ , d τ , f τ j ) . The variational distribution of λ is: 35 og q ∗ ( λ ) = log( p ( λ )) + E τ [log( p ( τ | λ ))] + const = log[ λ g − exp {− h λ } ] + E τ log p (cid:89) j =1 λ exp {− τ j λ } + const = ( g + p −
1) log λ − λ [ h + p (cid:88) j =1 E τ ( τ j )] + const = log Ga ( λ | g λ , h λ ) which is a gamma distribution with parameters g λ = g + p ; h λ = h + p (cid:88) j =1 E τ ( τ j ) . The expected values can be computed as follows:It is worth pointing out that if X ∼ GIG ( p, a, b ) , then its density is f ( x | p, a, b ) = ( ab ) p κ p ( √ ab ) x p − exp {− ( ax + b/x ) / } , x > , where κ p ( · ) is a modified Bessel function of the second kind, with E [ X ] = (cid:114) ba κ p +1 ( √ ab ) κ p ( √ ab ) and V ar [ X ] = (cid:18) ba (cid:19) κ p +2 ( √ ab ) κ p ( √ ab ) − (cid:32) κ p +1 ( √ ab ) κ p ( √ ab ) (cid:33) E [ X − ] = (cid:114) ab κ p +1 ( √ ab ) κ p ( √ ab ) − pb Therefore, E τ [ D − τ ] = diag ( E τ ( τ − ) , . . . , E τ ( τ − p )) ,E τ ( τ − j ) = √ d τ κ c τ +1 ( (cid:112) d τ f τ j ) (cid:112) f τ j κ c τ ( (cid:112) d τ f τ j ) − c τ f τ j ,E τ ( τ j ) = (cid:112) f τ j κ c τ +1 ( (cid:112) d τ f τ j ) √ d τ κ c τ ( (cid:112) d τ f τ j ) ,E λ ( λ ) = g λ h λ . For the calculus of E β,φ [ φβ j ] , let x | y ∼ N ( µ x , y − σ x ) and y ∼ Ga ( a, b ) , so E [ X Y ] = E [ E ( X Y | Y )] = E [ Y E ( X | Y )] = E [ Y ( E ( X | Y ) + V ar ( X | Y ))] = µ x E ( Y ) + σ x = aµ x b + σ x . Thus, 36 β,φ [ φβ j ] = m β j a φ /b φ + ( C β ) jj . β (2) and φ while holding q ( τ | λ ) , q ( λ ) and q ( β (1) ) fixed, is given by log q ∗ ( β (2) , φ ) = E β (1) [log( p ( y | β , φ ))] + E τ [log( p ( β (2) , φ | τ ))] + const = E β (1) (cid:110) log (cid:104) (2 π ) − n/ | φ − I n | − / ×× exp (cid:26) −
12 ( y − X β (1) − X β (2) ) T ( φI n )( y − X β (1) − X β (2) ) (cid:27)(cid:21)(cid:27) ++ E τ (cid:26) log (cid:20) (2 π ) − K/ | φ − D τ | − / exp (cid:26) − β (2) T φ D − τ β (2) (cid:27)(cid:21)(cid:27) ++ log[ φ a − exp {− φb } ] + const = (cid:18) n K a − (cid:19) log φ + − φ (cid:110) β (2) T [ E τ ( D − τ ) + X T X ] β (2) − β (2) T [ X T y − X T X E (1) β ( β (1) )] (cid:111) + − φ (cid:26) b + 12 [ y T y − E β (1) ( β (1) T ) X T y + E β (1) ( β (1) T X T X β (1) )] (cid:27) + const = log N ( β (2) | m β (2) , φ − C β (2) ) × Ga ( φ | a φ , b φ ) It is easy to see that this is a normal-gamma distribution with parameters: C − β (2) = E τ ( D − τ ) + X T X , and m β (2) = C β (2) [ X T y − X T X E (1) β ( β (1) )] ,a φ = a + n/ and b φ = (cid:26) b + 12 [ y T y − E β (1) ( β (1) T ) X T y + E β (1) ( β (1) T X T X β (1) )] − m Tβ (2) C − β (2) m β (2) (cid:27) . The variational distribution of τ while holding q ( λ ) fixed is given by log q ∗ ( τ j ) = E λ [log( p ( τ j | λ ))] + E β,φ [log( p ( β j , φ | τ j ))] + const = E λ { log[exp {− λτ j } ] } + E β,φ (cid:26) log (cid:20) ( φ − τ j ) − / exp (cid:26) − φ τ j β j (cid:27)(cid:21)(cid:27) + const = −
12 log τ j − (cid:18) E λ [ λ ] τ j + 1 τ j E β,φ [ φβ j ] (cid:19) + const = log GIG ( τ j | c τ , d τ , f τ j ) with GIG being generalized inverse Gaussian distribution, where c τ = 12 ; d τ = 2 E λ [ λ ] ; f τ j = E β,φ [ φβ j ] . Therefore, log q ∗ ( τ ) = log p (cid:89) j =1 GIG ( τ j | c τ , d τ , f τ j ) . λ is: log q ∗ ( λ ) = log( p ( λ )) + E τ [log( p ( τ | λ ))] + const = log[ λ g − exp {− h λ } ] + E τ log p (cid:89) j =1 λ exp {− τ j λ } + const = ( g + p −
1) log λ − λ [ h + p (cid:88) j =1 E τ ( τ j )] + const = log Ga ( λ | g λ , h λ ) which is a gamma distribution with parameters g λ = g + p ; h λ = h + p (cid:88) j =1 E τ ( τ j ) . The expected values can be computed as follow:It is worth pointing out that if X ∼ GIG ( p, a, b ) , then its density is f ( x | p, a, b ) = ( ab ) p κ p ( √ ab ) x p − exp {− ( ax + b/x ) / } , x > , where κ p ( · ) is a modified Bessel function of the second kind, with E [ X ] = (cid:114) ba κ p +1 ( √ ab ) κ p ( √ ab ) and V ar [ X ] = (cid:18) ba (cid:19) κ p +2 ( √ ab ) κ p ( √ ab ) − (cid:32) κ p +1 ( √ ab ) κ p ( √ ab ) (cid:33) E [ X − ] = (cid:114) ab κ p +1 ( √ ab ) κ p ( √ ab ) − pb Therefore, E τ [ D − τ ] = diag ( E τ ( τ − ) , . . . , E τ ( τ − p )) ,E τ ( τ − j ) = √ d τ κ c τ +1 ( (cid:112) d τ f τ j ) (cid:112) f τ j κ c τ ( (cid:112) d τ f τ j ) − c τ f τ j ,E τ ( τ j ) = (cid:112) f τ j κ c τ +1 ( (cid:112) d τ f τ j ) √ d τ κ c τ ( (cid:112) d τ f τ j ) ,E λ ( λ ) = g λ h λ . For the calculus of E β,φ [ φβ j ] , let x | y ∼ N ( µ x , y − σ x ) and y ∼ Ga ( a, b ) , so E [ X Y ] = E [ E ( X Y | Y )] = E [ Y E ( X | Y )] = E [ Y ( E ( X | Y ) + V ar ( X | Y ))] = µ x E ( Y ) + σ x = aµ x b + σ x . E β,φ [ φβ j ] = m β j a φ /b φ + ( C β ) jj . eferences D. F. Andrews. A robust method for multiple linear regression.
Technometrics , 16(4):523–531, 1974. doi: . URL .S. M. Berry, R. J. Carroll, and D. Ruppert. Bayesian smoothing and regression splines for measurement errorproblems.
Journal of the American Statistical Association , 97(457):160–169, 2002. ISSN 01621459. URL .C. M. Bishop.
Pattern Recognition and Machine Learning . Springer, 2006.D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: a review for statisticians.
J. Amer.Statist. Assoc. , 112(518):859–877, 2017. ISSN 0162-1459. doi: . URL https://doi.org/10.1080/01621459.2017.1285773 .G. Casella and E. I. George. Explaining the gibbs sampler.
The American Statistician , 46(3):167–174, 1992. ISSN00031305. URL .D. G. T. Denison, B. K. Mallick, and A. F. M. Smith. Automatic bayesian curve fitting.
Journal of the RoyalStatistical Society B , 60:363–377, 1998.R. Dias. Density estimation via hybrid splines.
J. Statist. Comput. Simul. , 60(4):277–294, 1998.R. Dias. Sequential adaptive non parametric regression via H-splines.
Communications in Statistics: Computationsand Simulations , 28:501–515, 1999.R. Dias and D. Gamerman. A Bayesian approach to hybrid splines nonparametric regression.
Journal of StatisticalComputation and Simulation. , 72(4):285–297, 2002.R. Dias and N. L. Garcia. Consitent estimator for basis selection based on a proxy of the kullbakc-leibler distance.
Journal of Econometrics , 141(1):167–178, 2007.J. Drugowitsch. Vblinlogit: Variational bayesian linear and logistic regression.
J. Open Source Softw. , 4(38):1359,2019. doi: . URL https://doi.org/10.21105/joss.01359 .P. H. C. Eilers and B. D. Marx. Flexible smoothing with B -splines and penalties. Statist. Sci. , 11(2):89–121,1996. ISSN 0883-4237. With comments and a rejoinder by the authors.T. C. O. Fonseca, H. S. Migon, and H. Mirandola. Reference bayesian analysis for hierarchical models, 2019.D. Gamerman and H. S. Migon. Dynamic hierarchical models.
Journal of the Royal Statistical Society. Series B(Methodological) , 55(3):629–642, 1993. ISSN 00359246. URL .A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal densities.
Journal of theAmerican Statistical Association , 85(410):398–409, 1990. ISSN 01621459. URL .V. Goepp, O. Bouaziz, and G. Nuel. Spline Regression with Automatic Knot Selection. working paper or preprint,Aug. 2018. URL https://hal.archives-ouvertes.fr/hal-01853459 .41. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods forgraphical models.
Machine Learning , 37(2):183–233, 1999. doi: . URL https://doi.org/10.1023/A:1007665907178 .B. Jørgensen.
Statistical properties of the generalized inverse Gaussian distribution , volume 9 of
Lecture Notesin Statistics . Springer-Verlag, New York-Berlin, 1982. ISBN 0-387-90665-7.C. Kooperberg and C. J. Stone. A study of logspline density estimation.
Comp. Stat. Data Anal. , 12:327–347,1991.S. Lang and A. Brezger. Bayesian p-splines.
Journal of Computational and Graphical Statistics , 13(1):183–212,2004. doi: . URL https://doi.org/10.1198/1061860043010 .S. M. Lewis and A. E. Raftery. Estimating Bayes factors via posterior simulation with the Laplace-Metropolisestimator.
J. Amer. Statist. Assoc. , 92(438):648–655, 1997. ISSN 0162-1459. doi: . URL https://doi.org/10.2307/2965712 .Q. Li and N. Lin. The Bayesian elastic net.
Bayesian Anal. , 5(1):151–170, 2010. ISSN 1936-0975. doi: . URL https://doi.org/10.1214/10-BA506 .D. V. Lindley and A. F. M. Smith. Bayes estimates for the linear model.
J. Roy. Statist. Soc. Ser. B , 34:1–41,1972. ISSN 0035-9246. URL http://links.jstor.org/sici?sici=0035-9246(1972)34:1<1:BEFTLM>2.0.CO;2- .H. Mallick and N. Yi. A new Bayesian lasso.
Stat. Interface , 7(4):571–582, 2014. ISSN 1938-7989. doi: . URL https://doi.org/10.4310/SII.2014.v7.n4.a12 .H. S. Migon, D. Gamerman, and F. Louzada.
Statistical inference . Chapman & Hall/CRC Texts in StatisticalScience Series. CRC Press, Boca Raton, FL, second edition, 2015. ISBN 978-1-4398-7880-4. An integratedapproach.E. Montoya, N. Ulloa, and V. Miller. A simulation study comparing knot selection methods with equally spacedknots in a penalized regression spline.
International Journal of Statistics and Probability , 3(3):96–110, 2014.J. T. Ormerod and M. P. Wand. Explaining variational approximations.
The American Statistician , 64(2):140–153,2010. doi: . URL https://doi.org/10.1198/tast.2010.09058 .M. R. Osborne, B. Presnell, and B. A. Turlach. Knot selection for regression splines via the lasso.
ComputingScience and Statistics , pages 44–49, 1998.T. Park and G. Casella. The Bayesian Lasso.
Journal of the American Statistical Association , 103(482):681–686, June 2008. ISSN 0162-1459, 1537-274X. doi: . URL .D. Ruppert. Selecting the number of knots for penalized splines.
Journal of Computational and Graphical Statistics ,11(4):735–757, 2002. doi: . URL https://doi.org/10.1198/106186002853 .S. Spiriti, R. Eubank, P. W. Smith, and D. Young. Knot selection for least-squares and penalized splines.
Journalof Statistical Computation and Simulation , 83(6):1020–1036, 2013. doi: . URL https://doi.org/10.1080/00949655.2011.647317 .42. Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society: SeriesB (Methodological) , 58(1):267–288, 1996. doi: https://doi.org/10.1111/j.2517-6161.1996.tb02080.x . URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1996.tb02080.x .A. Ullah and V. Zinde-Walsh. Estimation and testing in a regression model with spherically symmetric errors.
Econom. Lett. , 17(1-2):127–132, 1985. ISSN 0165-1765. doi: . URL https://doi.org/10.1016/0165-1765(85)90142-9 .M. West. On scale mixtures of normal distributions.
Biometrika , 74:646–648, 1987. URL