Bayesian Quantile Regression for Partially Linear Additive Models
aa r X i v : . [ s t a t . C O ] J u l Bayesian Quantile Regression for PartiallyLinear Additive Models
Yuao Hu, Kaifeng Zhao and Heng Lian
Division of Mathematical Sciences, School of Physical and Mathematical Sciences,Nanyang Technological University,Singapore, 637371
Abstract
In this article, we develop a semiparametric Bayesian estimationand model selection approach for partially linear additive models inconditional quantile regression. The asymmetric Laplace distributionprovides a mechanism for Bayesian inferences of quantile regressionmodels based on the check loss. The advantage of this new methodis that nonlinear, linear and zero function components can be sepa-rated automatically and simultaneously during model fitting withoutthe need of pre-specification or parameter tuning. This is achievedby spike-and-slab priors using two sets of indicator variables. Forposterior inferences, we design an effective partially collapsed Gibbssampler. Simulation studies are used to illustrate our algorithm. Theproposed approach is further illustrated by applications to two realdata sets. eywords : Additive models; Markov chain Monte Carlo; Quantile re-gression; Variable selection. Partially linear additive models (PLAMs) generalize multiple linear regres-sion models. They can also be regarded as a special case of generalized ad-ditive regression models (Hastie and Tibshirani 1990). PLAMs, containingboth linear and nonlinear additive components, are more flexible than strin-gent linear models, and are more parsimonious than general nonparametricregression models and they circumvent the difficulty brought by the prob-lem known as “curse of dimensionality”. PLAMs have been widely appliedin practice because of these advantages. For example, Liang et al (2008)applied PLAMs to study the relationship between environmental chemicalexposures and semen quality, and Panagiotelis and Smith (2008) applied themodels for intra-day electricity load analysis.In this article, we propose a Bayesian quantile regression approach forpartially linear additive models. At a given quantile level τ ∈ (0 , y i = µ τ + p X j =1 f τ,j ( x ij ) + ǫ τ,i , i = 1 , . . . , n, (1)where ( y i , x i ) are independent and identically distributed pairs. Here y i isthe response, x i = ( x i , . . . , x ip ) T is the p -dimensional predictor, µ τ is theintercept, ǫ i , i = 1 , . . . , n, are random errors with their τ th quantile equalto 0 and f τ,j is a univariate component function which might be nonlin-2ar, linear or zero. Quantile regression (Koenker and Bassett (1978)) hasbeen demonstrated to be valuable by a rapidly expanding literature in eco-nomics, social sciences, and biomedical studies (Buchinsky (1994), Abrevaya(2001), Cade and Noon (2003)). It provides more robust analyses and morecomplete descriptions of data structure than traditional mean regression.Quantile regression for additive models has previously been considered inthe literature. In the frequentist context, De Gooijer and Zerom (2003),Horowitz and Lee (2005), Yu and Lu (2004) all developed methodologies ofnonparametric estimation for additive quantile regression. In the Bayesiancontext, Yue and Rue (2011) proposes a Bayesian quantile regression ap-proach for additive mixed models.There are some works focusing on the selection of significant compo-nents using penalization approaches from the frequentist perspective, suchas Ravikumar et al. (2009), Meier et al. (2009) and Huang et al. (2010). Be-sides, a number of existing papers are concerned with the selection of func-tion components using Bayesian inferences recently (Panagiotelis and Smith(2008), Shively et al. (1999), and Yau et al. (2003)). They express each func-tion as a linear combination of basis terms and assign priors on the coefficientsof the basis functions. They all introduce indicator variables to enable vari-able selection. However none of the works mentioned above considers linearcomponent identification. These works considered only least squares regres-sion. Usually, pre-specification of linear components is required. Recently,several works consider performing variable selection, parametric componentidentification, and parameter estimation all at the same time (Zhang et al.(2011), Lian et al. (2012)) from a frequentist perspective, for mean regres-sion. 3he quantile regression approach we propose in this article has the abil-ity of separating function components into those with nonlinear effects, thosewith linear effects, and those irrelevant to responses, in the context of quantileregression. In a Bayesian context, this separation of components is a soft de-cision based on the posterior probabilities of the components being selectedas nonlinear, linear, or zero. We establish a hierarchical Bayesian modelby adopting the asymmetric Laplace distribution for errors (Yu and Moyeed(2011)). Then, extending the Bayesian variable selection approach, we intro-duce two sets of indicator variables in the spike-and-slab priors which makethe separation of components possible. Scheipl et al. (2012) also conducteda similar study which can separate the components into nonlinear, linear andzero ones, for generalized additive models.The remainder of the paper proceeds as follows. In Section 2, we describeour hierarchical Bayesian model for quantile regression based on the additivemodel structure. We also discuss our prior choices and introduce the posteriorsampling algorithm focusing on an efficient partially collapsed sampler. Thedetails of the algorithm are explained in the Appendix. In Section 3, wepresent numerical illustrations including simulation studies and two real dataexamples. In Section 4, we conclude the paper with a discussion. Quantile regression is typically achieved by solving a minimization problembased on the check loss function. With model (1), the specific problem isestimating µ τ and the sequence of functions f τ,j , j = 1 , . . . , p, by minimizing4he following objective function, L ( y , x ) = n X i =1 ρ τ ( y i − µ τ − p X j =1 f τ,j ( x ij )) , (2)where y = ( y , . . . , y n ) T , x = ( x , . . . , x n ) T , and ρ τ ( u ) = u ( τ − I ( u ≤ check f unction . In a Bayesian setup, we assume ǫ τ,i , i =1 , . . . , n , are i.i.d. random variables from an asymmetric Laplace distributionwith density p ( ǫ τ,i ) = τ (1 − τ ) δ exp {− δ ρ τ ( ǫ τ,i ) } , where δ is the scale parameter. Then the conditional distribution of y is inthe form of p ( y | x ) = τ n (1 − τ ) n δ n exp {− δ n X i =1 ρ τ ( y i − µ τ − p X j =1 f τ,j ( x ij )) } . (3)Hence, maximizing the likelihood (3) is equivalent to minimizing (2), giving(2) a likelihood-based interpretation. By introducing the location-scale mix-ture representation of the asymmetric Laplace distribution (Kozumi and Kobayashi(2011)), (3) can be equivalently written as y i = µ τ + p X j =1 f τ,j ( x ij ) + k e i + p k δ e i z i , (4)where e i ∼ exp(1 /δ ) follows an exponential distribution with mean δ , z i follows the standard normal distribution and is independent of e i , k = − ττ (1 − τ ) ,and k = τ (1 − τ ) . For ease of notation, we will omit τ in the expressions inthe following.We assume the distribution of x j , j = 1 , . . . , p , is supported on [0 ,
1] and5lso impose the condition Ef j ( x j ) = 0 for identifiability. To model each un-known function f j flexibly, we use the truncated power splines to approximatethe functions in this article. Let t = 0 < t < · · · < t k < t k +1 partition[0 ,
1] into subintervals [ t i , t i +1 ), i = 0 , . . . , k with k internal knots. Here wefocus on equally spaced knots although more complicated data-driven choicecan be considered. For a given degree q , B ( x ) , . . . , B K ( x ) are used to de-note K + 1 = q + k truncated power spline basis x, x , . . . , x q , ( x − t ) q I ( x >t ) , . . . , ( x − t k ) q I ( x > t k ). Because of the centering constraint Ef j ( x j ) = 0,we use the centered basis { B jk ( x ) = B k ( x ) − P ni =1 B k ( x ij ) /n , k = 0 , . . . , K } with K = q + k −
1. We separate linear basis (a single linear function) andnonlinear basis to enable the identification of linear and nonlinear compo-nents. Thus, with splines approximation, we have y i = µ + p X j =1 α j B j ( x ij ) + p X j =1 K X k =1 β jk B jk ( x ij ) + k e i + p k δ e i z i , i = 1 , . . . , n. We view e i , i = 1 , . . . , n as latent variables and denote e = ( e , . . . , e n ) T .By defining E = k δ diag( e , . . . , e n ), B = ( B , . . . , B p ) with B j =( B j ( x j ) , . . . , B j ( x nj )) T , α = ( α , . . . , α p ) T , β j = ( β j , . . . , β jK ) T and B j = B j ( x j ) B j ( x j ) . . . B jK ( x j ) B j ( x j ) B j ( x j ) . . . B jK ( x j )... ... ... ... B j ( x nj ) B j ( x nj ) . . . B jK ( x nj ) , y can be expressed as, p ( y | α , { β j } , e , δ , x , µ ) ∝ exp {−
12 ( y − f − k e ) T E − ( y − f − k e ) } (det[ E ]) − / . (5)where f = µ n + B α + p X j =1 B j β j , and n = (1 , . . . , T is a vector of dimension n with all components 1.We choose spike-and-slab priors for α j and β j following George and McCulloch(1993), Panagiotelis and Smith (2008), and many others to enable variableselection and linear component selection. We introduce indicator variables γ ( ν ) = ( γ ( ν )1 , . . . , γ ( ν ) p ) T , such that ν j = 0 if and only if γ ( ν ) j = 0, with ν = α, β .In other words, f j is regarded as a nonlinear function, a linear function ora zero function under the situation that ( γ ( β ) j = 1), ( γ ( α ) j = 1, γ ( β ) j = 0), or( γ ( α ) j = 0, γ ( β ) j = 0), respectively.For α , we choose an independent Gaussian prior on each component,which is also called ridge prior (Goldstein and Smith (1974)), p ( α | γ ( α ) , σ ) ∼ N ( p , Σ α ) , Σ α = diag ( γ ( α )1 σ , γ ( α )2 σ , . . . , γ ( α ) p σ p ) , where p is a zero vector of dimension p and σ = ( σ , . . . , σ p ).For β j , we take the conjugate prior usually undertaken in a Bayesiancontext p ( β j ) ∝ exp {− τ j β Tj Ω j β j } when γ ( β ) j = 1, such as Smith and Kohn(1996) and Chib and Jeliazkov (2006). Here the ( k, k ′ ) entry of Ω j is R B ′′ jk ( x ) B ′′ jk ′ ( x ) dx ( B ′′ jk is the second derivative of B jk ). Since β j = K when γ ( β ) j = 0, the prior7f β j can be described as, p ( β j | γ ( β ) j , τ j ) ∼ N ( K , γ ( β ) j τ j Ω − j ) . We use the same prior as in Cripps et al. (2005) on γ ( ν ) , ν = α, β , p ( γ ( ν ) ) = 1 p + 1 pq γ ( ν ) − , where q γ ( ν ) is the number of non-zero ν j . Under this prior, equal weights areplaced on γ ( ν ) with different numbers of non-zero ν j . Alternatively, we canuse p ( γ ( ν ) ) = π q γ ( ν ) (1 − π ) p − q γ ( ν ) with π a constant. When π = 1 /
2, we havethe uniform prior on γ ( ν ) . We find the results are not sensitive to this choiceof prior.To summarize, the Bayesian hierarchical formulation is given by, y | α , { β j } , e , δ , x , µ ∼ N ( f + k e , E ) , α | γ ( α ) , σ ∼ N ( p , Σ α ) , γ ( α ) ∼ p ( γ ( α ) ) , β j | γ ( β ) j , τ j ∼ N ( K , γ ( β ) j τ j Ω − j ) , γ ( β ) ∼ p ( γ ( β ) ) ,e i i.i.d. ∼ exp(1 /δ ) , δ ∼ p ( δ ) ,σ j ∼ p ( σ j ) , τ j ∼ p ( τ j ) , µ ∼ p ( µ ) , where p ( δ ), p ( σ j ) and p ( τ j ) represent the hyperpriors of δ , σ j and τ j . Theyare set to be IG ( a , a ), where IG denotes the inverse Gamma distribution,and a and a are set to be 0.5 in all our numerical experiments as an uninfor-mative choice. Sensitivity analysis reveals that our results are not sensitiveto these choices. Finally, we use an uninformative prior on µ , p ( µ ) ∝ ν j , ν = α, β in some of the sampling steps, resulting in a partially col-lapsed sampler (van Dyk and Park (2008)). The readers are referred to theAppendix for details. We demonstrate the performance of our proposed quantile regression ap-proach (denoted by BQPLAM) in terms of its estimation accuracy and modelselection accuracy. The MCMC algorithm is implemented in R , and avail-able upon request. For comparison, we consider the following 4 additionalmethods. Method 1:
The hierarchical Bayesian model in section 2 can be easily revised to dealwith partially linear additive mean regression model. Consider the followingmodel, y i = µ + p X j =1 f j ( x ij ) + ǫ i , i = 1 , . . . , n, ǫ i , i = 1 , . . . , n are i.i.d. normally distributed with mean zero andvariance δ . The Bayesian hierarchical formulation will then be as follows, y | α , { β j } , δ , x , µ ∼ N ( f , δ I n × n ) , α | γ ( α ) , σ ∼ N ( p , Σ α ) , γ ( α ) ∼ p ( γ ( α ) ) , β j , τ j ∼ N ( K , γ ( β ) j τ j Ω − j ) , γ ( β ) ∼ p ( γ ( β ) ) ,δ ∼ p ( δ ) , τ j ∼ p ( τ j ) , µ ∼ p ( µ ) , where I n × n represents the n × n identity matrix. We denote this meanregression method to be BPLAM.The other three are quantile regression approaches. Method 2:
We consider a Bayesian additive model that combines nonlinear and linearterms together. This only performs variable selection but not linear compo-nent identification. To be specific, the Bayesian additive regression is basedon model (5), with f = µ n + p X j =1 B j β j , where β j = ( β j , β j , . . . , β jK ) T and B j = B j ( x j ) B j ( x j ) . . . B jK ( x j ) B j ( x j ) B j ( x j ) . . . B jK ( x j )... ... ... ... B j ( x nj ) B j ( x nj ) . . . B jK ( x nj ) . y | α , { β j } , e , δ , x , µ ∼ N ( f + k e , E ) , β j | γ ( β ) j , τ j ∼ N ( K +1 , γ ( β ) j τ j Ω − j ) , γ ( β ) ∼ p ( γ ( β ) ) ,e i i.i.d. ∼ exp(1 /δ ) , δ ∼ p ( δ ) ,τ j ∼ p ( τ j ) , µ ∼ p ( µ ) . This method is denoted by BQAM V . Method 3:
We consider Bayesian quantile linear regression based on model (5), with f = µ n + B α , in which all components are assumed to be linear. The Bayesian hierarchicalstructure is as follows, y | α , { β j } , e , δ , x , µ ∼ N ( f + k e , E ) , α | γ ( α ) , σ ∼ N ( p , Σ α ) , γ ( α ) ∼ p ( γ ( α ) ) ,e i i.i.d. ∼ exp(1 /δ ) , δ ∼ p ( δ ) ,σ j ∼ p ( σ j ) , µ ∼ p ( µ ) . The indicator variables γ ( α ) enable component selection. This method isdenoted by BQLM V . 11 ethod 4: The third quantile regression method for comparison is still based on themodel (5), except that we fix all indicator variables γ ( α ) j , j = 1 , . . . , p, and γ ( β ) j , j = 1 , . . . , p, to be 1. This method can only estimate componentfunctions but cannot select variables. We denote it as BQAM NV . We generate n = 100 observations ( x i , y i ) , i = 1 , . . . , , from the followingheteroscedastic additive model y i = p X j =1 f j ( x ij ) + (0 . x i ) ǫ i , (6)with f ( x ) = sin(2 πx ) / (2 − sin(2 πx )), f ( x ) = 5 x (1 − x ), f ( x ) = 2 x , f ( x ) = x , f ( x ) = − x and p = 10. Thus, the first 2 components are nonlinearcomponents, followed by 3 linear components and 5 zero components. Thecovariates x ij are generated from the standard normal distribution with cor-relations given by Cov ( x ij , x ij ) = (1 / | j − j | , and then transformed to bemarginally uniform on [0 ,
1] by applying the cdf of the standard normal dis-tribution. We consider two distributions of ǫ i , a normal distribution withmean 0 and standard deviation 0.5, and a Student’s t distribution withscale parameter 1 / { . , . , . , . , . } . For each scenario, 100 data sets are generated andfitted. For each replicate, the MCMC algorithm is run for 20000 iterationswith a burn-in of 10000. 12he performance was evaluated by the integrated squared error ( ISE ),for each component function, which is approximated over an equally spaced1000 points ( t , . . . , t T ), T = 1000 on [0 ,
1] by, d ISE = 1 T T X i =1 ( ˆ f j ( t i ) − f j ( t i )) , (7)where f j ( t i ) is the true value of function f j at t i , and ˆ f j ( t i ) is the posteriormean of f j at t i based on the 10000 iterations after burn-in. Tables 1 and 2summarize the average and standard deviation of p d ISE over 100 replicatesfor the first six components and for the regression function f = P j =1 f j .From the results, we can see that BQPLAM is obviously more efficient thanBQAM V and BQAM NV for the parametric components due to the separatedlinear basis and its associated indicator variables. BQLM V performs poorlyas expected, since it cannot capture nonlinear effects of the components.BQPLAM and BQAM V outperform BQAM NV for zero components (notethat we only present results of f among the five zero components). Theresults show that the two sets of indicators can help reduce errors besidesselecting components. The results of mean regression are similar to those ofmedian regression when the errors follow a normal distribution. However,the advantage of median regression is more obvious when the errors follow aStudent’s t distribution.To measure the prediction accuracy of our method, we generate n ′ =100 ,
000 independent test samples from the same generating model (6) asthe training samples and present the test errors in Tables 3 and 4. Formean and median regression, we consider two error measures including theroot mean squared errors (RMSE) and absolute deviation errors (AD). For13uantile regressions at quantiles { . , . , . , . } , the prediction error refersto the average check loss (ACL). More sepcifically, RM SE = vuut n ′ n ′ X i =1 (ˆ y i − y i ) ,AD = 1 n ′ n ′ X i =1 | ˆ y i − y i | and ACL = 1 n ′ n ′ X i =1 ρ τ (ˆ y i − y i ) , where y i , i = 1 , . . . , n ′ , are the responses of the testing data and ˆ y i , i =1 , . . . , n ′ , are the predicted values estimated withˆ y i = ˆ µ + p X j =1 ˆ f j ( x ij ) , where ˆ µ and ˆ f j ( x ij ) are posterior mean based on the 10000 sampled valuesafter burn-in. Our method BQPLAM results in smaller test errors and out-performs the other three quantile regression methods at all quantile levels.The median regression performs similarly as mean regression when the errorsare Gaussian, while it outperforms mean regression when errors are Student’st as expected.The results of zero and linear component selection are based on the es-timation of posterior distribution of γ ( α ) and γ ( β ) . To be specific, we com-14able 1: Summary of p d ISE for each univariate function and f = P j =1 f j over 100 replicates, when the distribution of error is normal and p = 10.Standard errors based on simulations are shown as subscripts. f f f f f f f BPLAM 0 . . . . . . . . . . . . . . τ = 0 . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . BQAM nv . . . . . . . . . . . . . . τ = 0 . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . BQAM nv . . . . . . . . . . . . . . τ = 0 . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . BQAM nv . . . . . . . . . . . . . . τ = 0 . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . .
029 0 . . BQAM nv . . . . . . . . . . . . . . τ = 0 . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . BQAM nv . . . . . . . . . . . . . . Table 2: Summary of p d
ISE for each univariate function and f = P j =1 f j over 100 replicates, when the distribution of error is Student’s t and p = 10. f f f f f f f BPLAM 0 . . . . . . . . . . . . . . τ = 0 . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . BQAM nv . . . . . . . . . . . . . . τ = 0 . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . BQAM nv . . . . . . . . . . . . . . τ = 0 . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . BQAM nv . . . . . . . . . . . . . . τ = 0 . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . BQAM nv . . . . . . . . . . . . . . τ = 0 . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . BQAM nv . . . . . . . . . . . . . . p = 10. “Normal” and “Student’s t” indicatethe distribution of ǫ i . RMSE ADNormal BPLAM 0 . . . . BQPLAM 0 . . . . BQAM v . . . . BQLM v . . . . BQAM nv . . . . Student’s t BPLAM 1 . . . . BQPLAM 1 . . . . BQAM v . . . . BQLM v . . . . BQAM nv . . . . Table 4: Summary of testing errors for quantile estimators at different quan-tile levels over 100 replicates, when p = 10. “Normal” and “Student’s t”indicate the distribution of ǫ i . τ = 0 . τ = 0 . τ = 0 . τ = 0 . . . . . . . . . BQAM v . . . . . . . . BQLM v . . . . . . . . BQAM nv . . . . . . . . Student’s t BQPLAM 0 . . . . . . . . BQAM v . . . . . . . . BQLM v . . . . . . . . BQAM nv . . . . . . . . p ( γ ( β ) j = 1 | y , x ), p ( γ ( α ) j = 1 , γ ( β ) j = 0 | y , x ) and p ( γ ( α ) j = 0 , γ ( β ) j = 0 | y , x ). The component willbe identified as nonlinear, linear or zero according to which of the three prob-abilities is the largest (this rule is used only for illustration and Bayesiansusually do not perform this step of making hard decision). In Tables 5-7,we report the number of nonzero components selected, number of nonzerocomponents selected that are nonzero in the true model, number of linearcomponents selected, and the number of linear components selected that arelinear in the true model for BQPLAM and BPLAM. We report the num-ber of variables selected and the number of variables selected that are trulynonzero for BQAM v and BLAM v , since these two methods can only selectnonzero components. Figures 1 and 2 display estimates of the three posteriorprobabilities for each component based on one randomly selected replicateamong the 100. The black, dark grey, and light grey areas represent thepercentage of times the component is selected as nonlinear, linear and zero,respectively. It is seen from the figures that BQPLAM can detect linear andnonlinear components quite accurately.Finally, we increase the dimension in the example (by adding zero com-ponents) to demonstrate the performance of our proposed method in higherdimensions. We consider the dimension p = 50 and 100. To save space andtime, we only consider BQPLAM at τ = 0 . p = 10. “Normal” and“Student’s t” indicate the distribution of ǫ i . Normal Student’s tBPLAM . . . . . . . . . . . . . . . . BQPLAM . . . . . . . . . . . . . . . . BQAM v . . . . . . . . BQLM v . . . . . . . . Table 6: Summary of the component selection results for quantile estimatorsat different quantile levels over 100 replicates, when the distribution of errorsis Gaussian and p = 10. τ = 0 . τ = 0 . τ = 0 . τ = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . . . p = 10. τ = 0 . τ = 0 . τ = 0 . τ = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BQAM v . . . . . . . . . . . . . . . . BQLM v . . . . . . . . . . . . . . . . . . . . . . (a) Mean . . . . . . (b) τ = 0 . . . . . . . (c) τ = 0 . . . . . . . (d) τ = 0 . . . . . . . (e) τ = 0 . . . . . . . (f) τ = 0 . Figure 1: Component selection results for one randomly selected replicate,when the distribution of errors is Gaussian and p = 10. The black, dark grey,and light grey areas represent the posterior probabilities of the componentbeing nonlinear, linear and zero, respectively.19 .0 0.2 0.4 0.6 0.8 1.0 ( a ) M e a n ( b ) τ = . ( c ) τ = . ( d ) τ = . ( e ) τ = . ( f ) τ = . F i g u r e : C o m p o n e n t s e l ec t i o n r e s u l t s f o r o n e r a nd o m l y s e l ec t e d r e p li c a t e , w h e n t h e d i s tr i bu t i o n o f e rr o r s i s S t ud e n t ’ s t a nd p = . able 8: Summary of testing errors for mean and median estimators over 100replicates, when p = 50 and 100. “Normal” and “Student’s t” indicate thedistribution of ǫ i . p = 50 p = 100RMSE AD RMSE ADNormal BPLAM 0 . . . . . . . . BQPLAM 0 . . . . . . . . Student’s t BPLAM 1 . . . . . . . . BQPLAM 1 . . . . . . . . Table 9: Summary of the component selection results for mean and medianestimators over 100 replicates, when p = 50 and 100. “Normal” and “Stu-dent’s t” indicate the distribution of ǫ i . p = 50 p = 100Normal Student’s t Normal Student’s tBPLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BQPLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . variable selection results are presented in Tables 8 and 9. The estimates ofthree posterior probabilities for each component (for one data set) are dis-played in Figures 3 and 4. The results show that our proposed method stillperforms well. 21 . . . . . . (a) Mean . . . . . . (b) τ = 0 . . . . . . . (c) Mean . . . . . . (d) τ = 0 . Figure 3: Component selection results for one randomly selected replicatewith p = 50. The upper panels are results when error distribution is Gaus-sian, and the lower panels are results when the error distribution is Student’st. 22 . . . . . . (a) Mean . . . . . . (b) τ = 0 . . . . . . . (c) Mean . . . . . . (d) τ = 0 . Figure 4: Component selection results for one randomly selected replicatewith p = 100. The upper panels are results when error distribution is Gaus-sian, and the lower panels are results when the error distribution is Student’st. 23 .2 Real data examples Now we illustrate the methodology with two data sets. In the followingexamples, we fix γ ( β ) associated with the discrete dummy variables to bezero all the time, which means that dummy variables can only be modeledas linear or zero based on the posterior estimate of γ ( α ) . All covariates arestandardized by a linear transformation to lie in [0 ,
1] before analysis.
We apply our method to a GDP growth data. This data set includes datafor 74 countries over two 20-year time intervals: 1960-1979 and 1980-1999,with sample size n = 147. This data set is used by Tan (2010) to uncover theinterplay between geography, institutions, and fractionalization in economicdevelopment by employing a regression tree analysis. Here we use our methodto detect the determinants of economic growth. The dependent variable isthe difference between the logarithmic values of real per capita GDP for thestart and end years of each 20-year time interval. We consider four categoriesof independent variables, a total of 13. The first covariate we consider is avariable measuring the quality of expropriation risk of government acrossthe years 1984-1997, which is denoted by EXPROP8497. Tan (2010) alsoconsiders another variable ICRG8497 to measure the quality of institution.Since the correlation between the two variables is very high, at over 0.8, wedrop the second one and only keep EXPROP8497. The second category ofindependent variables includes: proportion of a country’s land area that ex-periences more than 5 frost days per month in winter (FROST5), percentageof a country’s land area that is classified as a tropical eco-zone (ZTROP-24CS), and percentage of a country’s land area within 100 km of an ice-freecoast (LCR100KM). This category is used as proxy for climate. The thirdcategory measures degree of ethnic fractionalization, including ELF60, ETH-NIC, LANG and RELIG. ELF60 measures the probability in 1960 that tworandomly selected people from a given country will not belong to the sameethno-linguistic subgroup. ETHNIC combines racial and linguistic charac-teristics, LANG is based on data for shares of languages spoken as ‘mothertongues’, and RELIG describes differences in religion. The last category ofindependent variables includes several familiar neoclassical determinants: lognet depreciation rate (MNGD), log investment share (MINV), log schooling(MSCH15), a dummy variable for the period 1960-1979 (DUM6079), and loginitial per capita income (MGDP0).In Figure 5, we show the variable selection results of BPLAM and BQ-PLAM. For mean regression, we identified 6 nonzero components, all of whichare linear components. For quantile regression, we identified 4, 3, 4, 5 and6 nonzero components, of which 1, 0, 0, 1, 2 components are nonlinear at 5quantile levels respectively. The barplots show that some covariates only haveeffects at lower or upper quantiles. For example, MSCH15 and ELF60 onlyhave effects at upper quantiles, and ETHNIC only affects lower quantiles.In Figure 6, we show the fitted curves of the estimated regression functions(posterior mean based on the 10000 iterations after burn-in) as one covariatevaries while others are fixed at 0.5. For convergence diagnosis, we presenttrace plots of α (we only present α of those covariates detected to be nonzerocomponents), δ , and µ for the median regression in Figure 7 (we plot onesampled point in every ten to reduce the image size).To summarize, besides those neoclassicial determinants, which are funda-25 XPORP8497FROST5ZTROPICSLCR100KMELF60ETHNICLANGRELIGMNGDMINVMSCH15DUM6079MGDP00.0 0.2 0.4 0.6 0.8 1.0 ( a ) M e a n EXPORP8497FROST5ZTROPICSLCR100KMELF60ETHNICLANGRELIGMNGDMINVMSCH15DUM6079MGDP00.0 0.2 0.4 0.6 0.8 1.0 ( b ) τ = . EXPORP8497FROST5ZTROPICSLCR100KMELF60ETHNICLANGRELIGMNGDMINVMSCH15DUM6079MGDP00.0 0.2 0.4 0.6 0.8 1.0 ( c ) τ = . EXPORP8497FROST5ZTROPICSLCR100KMELF60ETHNICLANGRELIGMNGDMINVMSCH15DUM6079MGDP00.0 0.2 0.4 0.6 0.8 1.0 ( d ) τ = . EXPORP8497FROST5ZTROPICSLCR100KMELF60ETHNICLANGRELIGMNGDMINVMSCH15DUM6079MGDP00.0 0.2 0.4 0.6 0.8 1.0 ( e ) τ = . EXPORP8497FROST5ZTROPICSLCR100KMELF60ETHNICLANGRELIGMNGDMINVMSCH15DUM6079MGDP00.0 0.2 0.4 0.6 0.8 1.0 ( f ) τ = . F i g u r e : C o m p o n e n t s e l ec t i o n r e s u l t s f o rt h e G D P g r o w t hd a t a . .0 0.2 0.4 0.6 0.8 1.0 − . − . . . . . EXPROP8497 y − . − . . . . . FROST5 y − . − . . . . . ZTROPICS y − . − . . . . . LCR100KM y − . − . . . . . ELF60 y − . − . . . . . ETHNIC y − . − . . . . . LANG y − . − . . . . . RELIG y − . − . . . . . MNGD y − . − . . . . . MINV y − . − . . . . . MSCH15 y − . − . . . . . DUM6079 y − . − . . . . . MGDP0 y Figure 6: The fitted regression functions for the GDP growth data whenone covariate varies and others are fixed at 0.5, at quantile levels τ =0 . , . , . , . , .
9. The red solid lines are the fitted curves of mean re-gression. 27
500 1000 1500 2000 − − a − . . . . a . . . . . . a − . − . − . − . − . a . . . . . m . . . . . d Figure 7: Trace plots of selected α , δ , and µ of median regression for GDPgrowth data.mental determinants used in the production function (Mankiw et al 1992),both institutions and ethnic fractionalization are found to have effects on eco-nomical development while geography is found to play no roles. These resultsare consistent with the results of Tan (2010), and lead to the same conclusionthat more attention should be paid on the impact of quality of institutionsand degree of ethnic fractionalization on the development of economics. Another data set we use is a housing price data. This data set is used to in-vestigate the relationship between house price and several features, includingphysical attributes of the residential houses, characteristics of the surround-ing rural district and environmental nuisances. This application concernstransaction prices ( n = 2053) for the residential houses sold during 1996 and28997 in Brittany. We consider four physical characteristics of a house and itsadjacent lot: age of the house (AGE), state of repair (REPAIR)(dummy),number of rooms (ROOMS), and lot size (LOT). We use another four indi-cators on the characteristics of the surrounding rural district: the populationof the district (POP), average taxable family income (AVINC), proportionof vacant houses (VACANT), and a dummy variable indicating whether ornot the surrounding district is located in the department of Ille et Vilaine(COUNTRY). The environmental nuisances are expressed by two indicators,amount of nitrogen emissions from livestock farming per hectare of arableland in the rural district where the residential house is located (NITRO),and proportion of permanent grassland converted into cultivated grassland(TMEAD). This data set was used by Bontemps et al. (2008) to investigatethe effects of agricultural pollution on house prices. Bontemps et al. (2008)pre-specifes all the explanatory variables (AGE, REPAIR, ROOMS, LOT,COUNTY, VACANT, POP and AVINC) except the two environment indi-cators (TMEAD and NITRO) to be linear components in the semiparametricmodels, and incorporated the two environmental indicators in a nonparamet-ric way.We display the barplots in Figure 8 and the fitted regression functionsin Figure 9. The results show that all the components are identified asnonzero components and 4 of them are nonlinear for mean regression. Forquantile regression at the 5 quantile levels, we identified 9, 10, 10, 10 and9 nonzero components, and 1, 1, 1, 0, 0 nonlinear components respectively.Examining the influence of the physical characteristics of houses on pricesat different levels, we find that a larger number of rooms, a bigger lot sizeand the fact that a house has been repaired are factors contributing to an29ncrease in the price of a house, while an older age has a negative impacton the price. Similarly, the characteristics of the surrounding districts showtheir effects on house prices that conform to our expectations. For example,the price of houses located in the districts of the most urbanized countryof Brittany (Ille-et-Vilaine) is higher, and the price of residential houseslocated in districts with lower housing vacancy rates is higher. For the twoenvironmental indicators, we reveal their negative effects on house prices.The effects of covariates on responses at different quantile levels show somedegree of heterogeneity. For example, the effect of AGE is linear at quantile0.9, but is nonlinear at lower quantiles; the negative effect of NITRO is notobvious at quantile 0.1, but more obvious at upper quantiles. A G E R EPA I R R OO M S L O T C O UN T Y VA C A N T P O P AV I NC T M EA D N I T R O . . . . . . (a) Mean A G E R EPA I R R OO M S L O T C O UN T Y VA C A N T P O P AV I NC T M EA D N I T R O . . . . . . (b) τ = 0 . A G E R EPA I R R OO M S L O T C O UN T Y VA C A N T P O P AV I NC T M EA D N I T R O . . . . . . (c) τ = 0 . A G E R EPA I R R OO M S L O T C O UN T Y VA C A N T P O P AV I NC T M EA D N I T R O . . . . . . (d) τ = 0 . A G E R EPA I R R OO M S L O T C O UN T Y VA C A N T P O P AV I NC T M EA D N I T R O . . . . . . (e) τ = 0 . A G E R EPA I R R OO M S L O T C O UN T Y VA C A N T P O P AV I NC T M EA D N I T R O . . . . . . (f) τ = 0 . Figure 8: Component selection results for the housing price data.30
Discussions
In this article, we have proposed a Bayesian quantile regression method forpartially linear additive models, which explicitly models components thathave linear and nonlinear effects. As detailed in the Appendix, we designedan efficient MCMC algorithm for posterior inferences. With simulation stud-ies, we illustrated the empirical performances of our proposed quantile re-gression approach and demonstrated the efficacy of the two sets of indicatorvariables. The method can automatically determine the type of componenteffects. The performance of the proposed approach in our simulations is quiteencouraging, even when p is large.Note that the error distributions we use in the simulation examples arevery different from the asymmetric Laplace distribution. Our simulationresults show that our methods perform well even when the errors are notgenerated based on their assumed generating mechanism. Theoretically ithas been recently demonstrated that Bayesian quantile regression based onthe asymmetric Laplace distribution is consistent even when the error dis-tribution is misspecified Sriram et al. (2013). Li et al. (2010) and Hu et al.(2013) have also discussed this problem and showed that Bayesian methodsare not sensitive to this assumption.Finally, using asymmetric Laplace distribution is only one among a fewpossible alternatives for Bayesian quantile regression. A problem of the cur-rent approach is that regression functions at different quantile levels are sep-arately estimated, which can cause the quantile curves at different levelsto cross each other. Other approaches may be able to address this issue,but we choose to use asymmetric Laplace distribution for its simplicity and31 .0 0.2 0.4 0.6 0.8 1.0 . . . . . AGE Y . . . . . REPAIR Y . . . . . ROOMS Y . . . . . . LOT Y . . . . . COUNTY Y . . . . . VACANT Y . . . . . POP Y . . . . . AVINC Y . . . . . TMEAD Y . . . . . NITRO Y Figure 9: The fitted regression functions for housing price data whenone covariate varies and others are fixed at 0.5, at quantile levels τ =0 . , . , . , . , .
9. The red solid lines are the fitted curves of mean re-gression. 32omputational convenience.
Appendix: MCMC algorithm details
The joint distribution of all the variables is p ( α , { β j } , E , µ, δ | y , x ) ∝ exp {−
12 ( y − µ n − B α − p X j =1 B j β j − k e ) T E − ( y − µ n − B α − p X j =1 B j β j − k e ) } × det[ E ] − / × p ( α ) × p Y j =1 p ( β j ) × p ( δ ) × n Y i =1 p ( e i ) × p ( µ ) , where p ( β j ), p ( α ), p ( e i ), p ( δ ) and p ( µ ) are the prior distributions of β j , α , e i , δ , and µ respectively.We use the Metropolis-within-Gibbs algorithm to sample from the poste-rior distribution. We integrate out α j in step 3 and β j in step 5 to improvemixing of the Markov Chain. The posterior distribution of each variable isas follows ( ∼ denotes all variables except the one to be sampled):1. Sample p ( α j | ∼ ) = p ( α j | y ∗ , e , δ , σ j , γ ( α ) j ) , j = 1 , . . . , p , from the con-ditional distribution of α j , p ( α j | y ∗ , e , δ , σ j , γ ( α ) j = 1) ∼ N ( µ j , ξ j ) ,p ( α j | y ∗ , e , δ , σ j , γ ( α ) j = 0) = 0 , where y ∗ = y − µ n − p P i = j α i B i − p P i =1 B i β i − k e , ξ j = ( B Tj E − B j +33 σ j ) − and µ j = ξ j B Tj E − y ∗ .2. Sample p ( µ | ∼ ) = p ( µ | y ∗ , e , δ ), from the conditional distribution of µ , p ( µ | y ∗ , e , δ ) ∼ N ( µ , ξ ) , where ξ = k δ ( n P i =1 e − i ) − , µ = ξ Tn E − y ∗ , and y ∗ = y − p P i =1 α i B i − p P i =1 B i β i − k e .3. Sample p ( γ ( α ) j | ∼ ) = p ( γ ( α ) j | y ∗ , e , δ , σ j ) , j = 1 , . . . , p , from its condi-tional posterior after integrating over α j , p ( γ ( α ) j = 1 | y ∗ , e , δ , σ j ) = h ,with h = h p ( γ ( α ) j = 0 | γ ( α ) i = j ) p ( γ ( α ) j = 1 | γ ( α ) i = j ) ,h = exp {−
12 ( B Tj E − y ∗ ) ( B Tj E − B j + 1 σ j ) − }× ( σ j B Tj E − B j + 1) , where p ( γ ( α ) j = 0 | γ ( α ) i = j ) = ( p − q γ ( α )0 ) / ( p + 1) and p ( γ ( α ) j = 1 | γ ( α ) i = j ) =(1 + q γ ( α )0 ) / ( p + 1) with γ ( α )0 = ( γ ( α )1 , . . . , γ ( α ) j − , , γ ( α ) j +1 , . . . , γ ( α ) p ) T , and y ∗ = y − µ n − p P i = j α i B i − p P i =1 B i β i − k e .34. Sample p ( β j | ∼ ) = p ( β j | y ∗ , e , δ , τ j , γ ( β ) j ) , j = 1 , . . . , p , p ( β j | y ∗ , e , δ , τ j , γ ( β ) j = 1) ∼ N ( µ j , Σ j ) ,p ( β j | y ∗ , e , δ , τ j , γ ( β ) j = 0) = 0 , where y ∗ = y − µ n − p P i =1 α i B i − p P i = j B i β i − k e , µ j = Σ j B Tj E − y ∗ , Σ j = ( B Tj E − B j + τ j Ω j ) − .5. Sample p ( γ ( β ) j | ∼ ) = p ( γ ( β ) j | y ∗ , e , δ , τ j ) , j = 1 , . . . , p , from its condi-tional posterior after integrating ove β j , p ( γ ( β ) j = 1 | y ∗ , e , δ , τ j ) = h ,with h = h p ( γ ( β ) j = 0 | γ ( β ) i = j ) p ( γ ( β ) j = 1 | γ ( β ) i = j ) ,h = exp {−
12 [ y ∗ T E − B j ( B Tj E − B j + 1 τ j Ω j ) − B Tj E − y ∗ ] } × det[ 1 τ j Ω j ] − × det[ B Tj E − B j + 1 τ j Ω j ] , where y ∗ = y − µ n − p P i =1 α i B i − p P i = j B i β i − k e , p ( γ ( β ) j = 0 | γ ( β ) i = j ) =( p − q γ ( β )0 ) / ( p + 1) and p ( γ ( β ) j = 1 | γ ( β ) i = j ) = (1 + q γ ( β )0 ) / ( p + 1) with γ ( β )0 = ( γ ( β )1 , . . . , γ ( β ) j − , , γ ( β ) j +1 , . . . , γ ( β ) p ) T .35. Sample δ , δ ∼ IG ( a + 3 n/ , ν ) ,ν = a + ((2 k e i ) − n X i =1 ( y i − p X j =1 α j B j ( x ij ) − p X j =1 K X k =1 β jk B jk ( x ij ) − k e i ) + e i ) .
7. Sample σ j , j = 1 , . . . , p , and τ j , j = 1 , . . . , p , from their conditionalposterior distributions if γ ( α ) j , γ ( β ) j = 0, σ j ∼ IG ( a + 1 / , a + ( α j / ,τ j ∼ IG ( a + K/ , a + ( β Tj Ω j β j / . Otherwise they are generated from their priors.8. The full conditional distribution of e i , i = 1 , . . . , n is a generalizedinverse Gaussian distribution ( GIG ), p ( e i | δ , ν i ) ∼ GIG ( 12 , s ( y i − ν i ) k δ , s k k δ + 2 δ ) ,ν i = y i − µ − p X j =1 α j B j ( x ij ) − p X j =1 K X k =1 β jk B jk ( x ij ) , where the probability density function of GIG ( ρ, m, n ) is f ( x | ρ, m, n ) = ( n/m ) ρ K ρ ( mn ) x ρ − exp {−
12 ( m x − + n x ) } ,x > , −∞ < ρ < ∞ , m ≥ , n ≥ , K ρ is the modified Bessel function of the third kind (Barndorff-Nielsen and Shephard(2001)). 37 eferenceseferences