[PDF] Model detection and variable selection for mode varying coefficient model

Abstract

Varying coefficient model is often used in statistical modeling since it is more flexible than the parametric model. However, model detection and variable selection of varying coefficient model are poorly understood in mode regression. Existing methods in the literature for these problems often based on mean regression and quantile regression. In this paper, we propose a novel method to solve these problems for mode varying coefficient model based on the B-spline approximation and SCAD penalty. Moreover, we present a new algorithm to estimate the parameters of interest, and discuss the parameters selection for the tuning parameters and bandwidth. We also establish the asymptotic properties of estimated coefficients under some regular conditions. Finally, we illustrate the proposed method by some simulation studies and an empirical example.

Full PDF

aa r X i v : . [ s t a t . M E ] S e p Model detection and variable selection for mode varyingcoeﬃcient model

Xuejun Ma ∗ , Yue Du † Jingli Wang ‡ Abstract

Varying coeﬃcient model is often used in statistical modeling since it is more ﬂex-ible than the parametric model. However, model detection and variable selection ofvarying coeﬃcient model are poorly understood in mode regression. Existing methodsin the literature for these problems often based on mean regression and quantile re-gression. In this paper, we propose a novel method to solve these problems for modevarying coeﬃcient model based on the B-spline approximation and SCAD penalty.Moreover, we present a new algorithm to estimate the parameters of interest, anddiscuss the parameters selection for the tuning parameters and bandwidth. We alsoestablish the asymptotic properties of estimated coeﬃcients under some regular con-ditions. Finally, we illustrate the proposed method by some simulation studies and anempirical example.

Keywords : B-spline, SCAD penalty, mode regression, model detection, variableselection

MSC2010 subject classiﬁcations : Primary 62J07; secondary 62G08.

Suppose that Y i is a response variable and ( U i , X i ) is the associated covariate, i =1 , . . . , n , then the varying coeﬃcient model (VCM) [6] takes the form Y i = X ⊤ i α ( U i ) + ǫ i , (1.1)where X i = ( X (0) i , . . . , X ( p ) i ) ⊤ is the ( p +1) dimensional design vector with the ﬁrst element X (0) i = 1, and α ( U ) = { α ( U ) , . . . , α p ( U ) } ⊤ , α j ( U ) is an unknown smooth function, j = 0 , . . . , p . U i is an univariate index variable, without loss of generality, we assume U ranges over the unit interval [0 , ǫ i is the random error. VCM is attractive becausethe coeﬃcients α j ( U )’s depend on U , which not only reduces the modeling bias, but alsoavoids the “dimensional disaster”. Under diﬀerent assumptions, a large number of varyingcoeﬃcient models are developed, such as additive model [5], partial linear model [4], singleindex function coeﬃcient regression model [16], etc.In statistical modeling, variable selection is a signiﬁcant issue. Penalized techniqueshave been proposed to conduct variable selection by shrinking coeﬃcients of redundantvariables to zero. There are various powerful penalization methods such as LASSO [14],SCAD [3], adaptive LASSO [24], MCP [21]. Undoubtedly, variable selection is crucial ∗ School of Mathematical Sciences, Soochow University, 1 Shizi Street, Suzhou, 215006, China. [email protected] † The corresponding author. School of Mathematical Sciences, Soochow University, 1 Shizi Street,Suzhou, 215006, China. [email protected] ‡ School of Statistics and Data Sciences, Nankai University, Tianjin, China. [email protected]

We use the B-spline method to approximate unknown smooth function α j ( U ) , j =0 , , . . . , p . The B-spline basis functions with the degree of d can be presented as ˜ B ( U ) =( B ( U ) , . . . , B q ( U )) ⊤ , where q = k n + d + 1 and k n is the number of interior knots. Thereexists a transformation matrix G [12] such that G ˜ B ( U ) = (1 , ¯ B ( U ) ⊤ ) ⊤ . = B ( U ). Then α j ( U ) can be approximated by α j ( U ) ≈ B ( U ) ⊤ γ j = γ j + ¯ B ⊤ ( U ) γ j ∗ , j = 0 , , . . . , p (2.1)where γ j = ( γ j , γ ⊤ j ∗ ) ⊤ and γ j ∗ = ( γ j , . . . , γ jq ) ⊤ . Obviously, if || γ j ∗ || L = ( P qk =2 γ jk ) / =0, the variable X ( j ) has only a constant eﬀect. Then based on mode regression, we can2stimate γ by maximizing Q ( γ ) = n X i =1 K h ( Y i − Π ⊤ i γ ) (2.2)where Π i = ( X (0) i B ( U i ) ⊤ , . . . , X ( p ) i B ( U i ) ⊤ ) ⊤ , γ = ( γ ⊤ , . . . , γ ⊤ p ) ⊤ , K h ( · ) = h K ( · /h ) and K ( · ) is a kernel function, and h is the bandwidth. We assume K ( · ) is the standard normaldensity in this paper. Hence, the estimation of VCM can be transformed to a linearregression analysis. In this section, we discuss the penalized mode VCM. First, the estimator γ can beobtained by maximizing the following objective function Q ( γ ) − n p X j =1 p λ j ( || γ j ∗ || L ) − n p X j =1 p λ j ( | γ j | ) I ( || γ j ∗ || L = 0) (2.3)It is equivalent to minimizing l ( γ ) = − Q ( γ ) + n p X j =1 p λ j ( || γ j ∗ || L ) + n p X j =1 p λ j ( | γ j | ) I ( || γ j ∗ || L = 0) (2.4)where λ j and λ j > p λ ( · ) is SCAD penalty function: p λ,a ( θ ) =  λθ θ ≤ λ aλθ − . ( θ + λ ) a − λ < θ ≤ aλ λ ( a − ) a − θ > aλ (2.5)Fan and Li [3] suggested a = 3 . L norm of the varying parts of each varyingcoeﬃcient. Tang et al [13] applied L norm in the mean and quantile VCM. However, L norm does not treat γ j ∗ as a group, so it is more inclined to set constant coeﬃcients tovarying coeﬃcients. The second term is designed to select zero constant eﬀects. If thevariable X ( j ) has a constant coeﬃcient, all components of γ j ∗ are shrunk exactly to zero.On the other hand, if the variable X ( j ) has no eﬀect at all, both γ j ∗ and γ j are shrunkto zero. In practice, with the indicator function involved, it is very diﬃcult to minimizethe objective Eq.(2.4), therefore, we implement a two-step iterative procedure to get theestimation of coeﬃcients. Step 1:

Firstly, we obtain the ˆ γ V C by minimizing the penalized function l ( γ V C ) = − n X i =1 Q ( γ V C ) + n p X j =1 p λ j ( || γ V Cj ∗ || L ) (2.6)We penalize each varying coeﬃcient by the L norm of γ V Cj ∗ . When the j th covariate has aconstant eﬀect, the γ V Cj ∗ is shrunk to zero, which is an automatic separation of the varyingand constant eﬀects. Step 2:

After

Step 1 , the model becomes a partially linear varying coeﬃcient model.Let γ CZj = γ V Cj if the variable X ( j ) has a varying eﬀect, γ CZj = ( γ V Cj , ⊤ q − ) ⊤ if thevariable X ( j ) has a constant eﬀect. We obtain the estimator ˆ γ CZ by minimizing l ( γ CZ ) = − n X i =1 Q ( γ CZ ) + n p X j =1 p λ j ( | γ CZj | ) I ( || ˆ γ V Cj ∗ || L = 0) (2.7)3ere, we apply the same penalty λ j to each constant coeﬃcient part, and aim to excludethe irrelevant variables by assigning the SCAD penalty only to the terms that have beendetermined to be constant. Step 3:

Iterating

Steps γ as ˆ γ , α j ( U ) can be estimated by ˆ α j ( U ) = B ( U ) ⊤ ˆ γ j when the variable X ( j ) has a varyingeﬀect, and α j ( U ) = ˆ γ j when the variable X ( j ) only has a constant eﬀect. In this subsection, we extend the method proposed in [13] and the Modal Expectation-Maximization (MEM) method [18] to minimizing Eq.(2.4).Because directly minimizing Eq.(2.4) is diﬃcult, we ﬁrst locally approximate thepenalty function p λ ( · ) by a quadratic function proposed by Fan and Li [3] p λ ( | ω | ) ≈ p λ ( | ω | ) + 12 p ′ λ ( | ω | ) | ω | ( ω − ω ) , f or ω ≈ ω (2.8)Then, for a given initial value γ (0) , we can get p λ j ( || γ j ∗ || L ) ≈ p λ j ( || γ (0) j ∗ || L ) + 12 p ′ λ j ( || γ (0) j ∗ || L ) || γ (0) j ∗ || L ( || γ j ∗ || L − || γ (0) j ∗ || L ) (2.9) p λ j ( | γ j | ) ≈ p λ j ( | γ (0) j | ) + 12 p ′ λ j ( | γ (0) j | ) | γ (0) j | ( | γ j | − | γ j | ) (2.10)In this paper, γ (0) can be chosen as the unpenalized estimator based on the least squareregression γ (0) = (cid:16) Π ⊤ Π (cid:17) − Π ⊤ Y (2.11)where Π = ( Π ⊤ , . . . , Π ⊤ n ) ⊤ . The algorithm mainly includes two steps, and each step hasan EM iteration. Therefore, we can obtain the sparse estimators as follows:(1) Step 1:

Computation of ˆ γ V C ( t +1) After the t -th iteration, we obtain ˆ γ CZ ( t ) . Starting with ˆ γ V C ( t +1 , = ˆ γ CZ ( t ) , • E-step: π ( i | ˆ γ V C ( t +1 ,m ) ) = K h ( Y i − Π ⊤ i ˆ γ V C ( t +1 ,m ) ) P ni =1 K h ( Y i − Π ⊤ i ˆ γ V C ( t +1 ,m ) ) , i = 1 , . . . , n (2.12) • M-step:ˆ γ V C ( t +1 ,m +1) = arg min γ n n X i =1 {− π ( i | ˆ γ V C ( t +1 ,m ) ) log K h ( Y i − Π ⊤ i γ ) } + n γ ⊤ Σ λ (ˆ γ V C ( t +1 ,m ) ) γ o = (cid:16) Π ⊤ W Π + n Σ λ (ˆ γ V C ( t +1 ,m ) ) (cid:17) − Π ⊤ W Y (2.13)4here Σ λ (ˆ γ V C ( t +1 ,m ) ) = diag n ⊤ q , Σ λ , . . . , Σ λ p o (2.14)with Σ λ j = p ′ λ j ( || ˆ γ V C ( t +1 ,m ) j ∗ || L ) || ˆ γ V C ( t +1 ,m ) j ∗ || L { , ⊤ q − } ⊤ , j = 1 , . . . , p , and W is an n × n diagonal matrix with diagonal elements π ( i | ˆ γ V C ( t +1 ,m ) ). • Iterating the E-step and M-step procedure until convergence reached and obtainˆ γ V C ( t +1) , once || ˆ γ V C ( t +1) j ∗ || L < ǫ , we set ˆ γ V C ( t +1) j ∗ = . In the simulation, we set ǫ = 10 − , and the algorithm performs well.(2) Step 2:

Computation of ˆ γ CZ ( t +1) .After Step 1 , we setˆ γ CZ ( t +1 , = ( ˆ γ V C ( t +1) j ˆ γ V C ( t +1) j ∗ = (ˆ γ V C ( t +1) j , ⊤ q − ) ⊤ ˆ γ V C ( t +1) j ∗ = j = 1 , . . . , p (2.15) • E-step: π ( i | ˆ γ CZ ( t +1 ,m ) ) = K h ( Y i − Π ⊤ i ˆ γ CZ ( t +1 ,m ) ) P ni =1 K h ( Y i − Π ⊤ i ˆ γ CZ ( t +1 ,m ) ) , i = 1 , . . . , n (2.16) • M-step:ˆ γ CZ ( t +1 ,m +1) = arg min n n X i =1 {− π ( i | ˆ γ CZ ( t +1 ,m ) ) log K h ( Y i − Π ⊤ i γ ) } + n γ ⊤ Σ λ (ˆ γ CZ ( t +1 ,m ) ) γ o = (cid:16) Π ⊤ W Π + n Σ λ (ˆ γ CZ ( t +1 ,m ) ) (cid:17) − Π ⊤ W Y (2.17)where Σ λ (ˆ γ CZ ( t +1 ,m ) ) = diag n ⊤ q , Σ λ , . . . , Σ λ p o (2.18)with Σ λ j = p ′ λ j ( | ˆ γ CZ ( t +1 ,m ) j | ) | ˆ γ CZ ( t +1 ,m ) j | { , ⊤ q − } ⊤ , j = 1 , . . . , p , and W is an n × n diagonalmatrix with diagonal elements π ( i | ˆ γ CZ ( t +1 ,m ) ). • Similar to

Step1 , we iterate the EM procedure until convergence and obtainˆ γ CZ ( t +1) , we also set ˆ γ CZ ( t +1) j = 0 while | ˆ γ CZ ( t +1) j | < ǫ .(3) Step 3:

Iterating

Steps

To implement the above method, we should determine the values of the tuning param-eters λ j , λ j , the degree of B-spline d , the number of interior knots k n , and the bandwidth h . In our numerical studies, we suggest using the cubic splines in which d = 3.We should choose values for h , k n , λ j , λ j in each iteration of the two-step procedure.After the t -th iteration of the proposed two-step procedure, the details of selections aregiven as follows. 5 Selection of bandwidthWe provide a bandwidth selection method for the practical use of the mode regressionestimator. We ﬁrst estimate F ( x , u, h ) = E ( K ′′ h ( ǫ ) | X = x , U = u ), G ( x , u, h ) = E (( K ′ h ( ǫ )) | X = x , U = u ) by [17]ˆ F ( h ) = 1 n n X i =1 K ′′ h ( ˆ ǫ i ) (2.19)ˆ G ( h ) = 1 n n X i =1 { K ′ h ( ˆ ǫ i ) } (2.20)Then, we obtain ˆ h opt by minimizingˆ r ( h ) = ˆ G ( h ) ˆ F ( h ) − ˆ σ (2.21)where ˆ ǫ i = Y i − Π ⊤ i ˆ γ CZ ( t ) , ˆ σ = n P ni =1 ˆ ǫ i . The grid search method may be used toestimate h , according to the advise of Yao et al [17], h = 0 . σ × . j , j = 0 , , . . . , l for some ﬁxed l such as l = 50 or 100. • Selection of interior knotsWe use SIC-type criterion to obtain k n by minimizing SIC ( k ) = − log n X i =1 Q (ˆ γ CZ ( t ) ) + log nn ( v m ( k + d + 1) + c m ) (2.22)where v m denotes the number of covariates which have varying eﬀects, c m denotesthe number of covariates which have constant eﬀects. • Selection of tuning parameterWe also use SIC-type criterion to obtain λ j , λ j by minimizing SIC ( λ ) = − log n X i =1 Q (ˆ γ V C ( t ) λ ) + log nn edf (2.23) SIC ( λ ) = − log n X i =1 Q (ˆ γ CZ ( t ) λ ) + log nn edf (2.24)where ˆ γ V C ( t ) λ and ˆ γ CZ ( t ) λ are the minimizers of Eq.(2.6) and Eq.(2.7) with λ j = λ || ˆ γ V C ( t ) λ ∗ || L , j = 1 . . . . , p (2.25)and edf , edf are deﬁned as the total number of varying and nonzero constant coef-ﬁcients [15]. Let { α j ( · ) , j = 0 , . . . , v } be the varying coeﬃcients, { α j ( · ) , j = v +1 , . . . , s } be constantcoeﬃcients and α j ( · ) = 0 , j = s + 1 , . . . , p . We assume some regular conditions.(C1) The varying coeﬃcient functions α j ( · ) , j = 1 , . . . , v are t th continuously diﬀerentiableon [0 , t >

2. 6C2) The density function f U ( · ) is continuous and bounded away from zero and inﬁnityon [0 , { X i , i = 1 , . . . , n } are uniformly bounded in probability, and the positive deﬁnitematrix E ( X ⊤ X | U = u ) has bounded eigenvalues.(C4) The tuning parameters satisfy n t/ (2 t +1) min { λ j , λ j } → ∞ , max { λ j , λ j } → n → ∞ , and lim inf n →∞ lim inf || γ j ∗ || L → + p ′ λ j ( || γ j ∗ || L ) λ j > , j = v + 1 , . . . , p (2.26)lim inf n →∞ lim inf γ j → + p ′ λ j ( | γ j | ) λ j > , j = s + 1 , . . . , p (2.27) Theorem 1.

Suppose conditions (C1)-(C4) hold and k n ∼ n / (2 t +1) , with probabilityapproaching 1, ˆ α j ( · ) are nonzero constants, j = v +1 , . . . , s . And ˆ α j ( · ) = 0 , j = s +1 , . . . , p . Theorem 2.

Suppose conditions (C1)-(C3) hold and k n ∼ n / (2 t +1) , we have || ˆ α j ( · ) − α j ( · ) || L = O p ( n − t/ (2 t +1) + a n ) , j = 0 , . . . , v (2.28)where a n is deﬁned as in Appendix. We generate the random samples from following two models:

Model 1 Y i = 15 + 20 sin(2 πU i ) + (2 − U i − π/ X i + (6 − U i ) X i + 0 . X i + 2 X i + δ ǫ i (3.1)where U i ∼ U nif orm (0 , X ij comes from N (0 ,

1) for j = 1 , . . . , p . δ = X i . Model 2 Y i = 2 exp(1 − U i ) + (1 . πU i )) ) X i + (0 . U i (1 − U i )( U i − . X i + (2 − πU i )) X i + 2 X i + 0 . X i − . X i + δ ǫ i (3.2)where U i ∼ U nif orm (0 , X i , . . . , X ip ) ⊤ are generated from a multivariate normaldistribution with Cov ( X ik , X ij ) = 0 . | k − j | for any 1 ≤ j, k ≤ p , and δ = 0 . X i .We considered the following four cases: • Case 1: ǫ i ∼ N (0 , • Case 2: ǫ i ∼ t (3). • Case 3: ǫ i ∼ Lap (0 , • Case 4: ǫ i ∼ . N ( − , . ) + 0 . N (1 , . ). E ( ǫ ) = 0 , mode ( ǫ ) = 1.7he sample size n is 500. p varies from 10 to 30. We repeat 500 times for each model,compare the performance of our proposed method (VCEM) with the basis expansion(BSE L ) based on mean regression [13]. Because the method using L norm tends toset constant coeﬃcients to varying coeﬃcients, we replace L norm with L norm for theBSE L and denote it as BSE L . In order to evaluate the performance of the three methods,we consider the following criteria: • SV: the average number of true varying coeﬃcients (excluding the intercept) cor-rectly select as varying coeﬃcients. • SC: the average number of true constant coeﬃcients correctly select as constantcoeﬃcients. • SZ: the average number of redundant variables correctly select as redundant vari-ables. Table 1: Simulation results for Model 1 p Case Method SV SC SZ10 Oracle 2.000 2.000 6.0001 BSE L L L L L L L L L L L L L L L L L and BSE L . They perform similarly for selection of redundant variables.82) For the selection of constant coeﬃcients, VCEM is superior to BSE L and BSE L because SC of VCEM is larger than the others.(3) Even for the normal error case, VCEM performs no worse than BSE L and BSE L .Moreover, when the error distribution is asymmetric or has heavy tails, VCME isbetter than BSE L and BSE L , since the mode regression will put more weight to the‘most likely’ data around the true value, which lead to robust and eﬃcient estimator.(4) BSE L is better than BSE L , because L norm takes the basis function as a group.Table 2: Simulation results for Model 2p Case Method SV SC SZ10 Oracle 3.000 3.000 4.0001 BSE L L L L L L L L L L L L L L L L In this section, to illustrate the usefulness of the proposed procedure, we apply it tothe Boston housing data. This data set concerns the median value of the owner-occupiedhomes, and it contains 506 observations on 14 variables, which can be found in the Rpackage “mlbench”. We take MEDV (median value of owner-occupied homes in $1000s)as the response variable. Covariates are as follows: CRIM (per capita crime rate by town),ZN (proportion of residential land zoned for lots over 25,000 sq.ft), INDUS (proportion ofnon-retail business acres per town), CHAS (Charles River dummy variable (= 1 if tractbounds river; 0 otherwise)), NOX (nitrogen oxides concentration (parts per 10 million)),RM (average number of rooms per dwelling), AGE (proportion of owner-occupied units9uilt prior to 1940), DIS (weighted mean of distances to ﬁve Boston employment centres),RAD (index of accessibility to radial highways), TAX (full-value property-tax rate per$10,000), PTRATIO (pupil-teacher ratio by town), BLACK (1000( Bk − . where Bk is the proportion of blacks by town), and LSTAT (lower status of the population (percent))as the index variable. The response variable and covariates are standardized. The indexvariable LSTAT is transformed into interval [0 , M SE = 1506 X i =1 ( y i − ˆ y i ) (3.3)From Table 3, the results of three methods are comparable. From Table 3, we ﬁndAGE is an irrelevant variables; CRIM, INDUS and RAD have varying eﬀects. While,PTRATIO has negative eﬀects given by the BSE L and VCEM. RM has positive eﬀectsbased on the BSE L and VCEM. DIS has negative eﬀect based on the BSE L and BSE L .Moreover, the MSE of VCEM is smaller than the others.Table 3: results for Boston housing dataVariable BSE L BSE L VCEMCRIM V V VZN 0 V VINDUS V V VCHAS V V 0NOX V 0 -0.159RM 0.290 V 0.388AGE 0 0 0DIS -0.088 -0.023 VRAD V V VTAX 0 0 VPTRATIO V -0.038 -0.192BLACK V 0 VMSE 0.202 0.202 0.198

In this paper, we study the detection of varying coeﬃcient models, selection of vari-ables with nonzero varying eﬀects and selection of variables with nonzero constant eﬀectsbased on mode regression. We not only propose the VCEM algorithm but also establishasymptotic theories under some mild regular conditions. Obviously, numerical simulationsshow that our proposed method is eﬀective. We would like to explore our proposed methodto generalized varying coeﬃcient models and so on for future study.

References [1] Cai, Z., Fan, J. & Li, R. (2000). Eﬃcient estimation and inferences for varying-coeﬃcient models. Journal of the American Statistical Association 95(451): 888-902.[2] Fan, J. & Huang, T. (2005). Proﬁle likelihood inferences on semiparametric varying-coeﬃcient partially linear models. Bernoulli 11(6): 1031-1057.103] Fan, J. & Li, R. (2001). Variable selection via nonconcave penalized likelihood and itsoracle properties. Journal of the American Statistical Association 96(456): 1348-1360.[4] H¨ardle, W., Liang, H. & Gao, J. (2000). Partially Linear Models. Heidelberg: SpringPhysica.[5] Hastie, T. & Tibshirani, R. (1990). Generalized Additive Models. London: Chapman& Hall CRC.[6] Hastie, T. & Tibshirani, R. (1993). Varying-coeﬃcient model. Journal of the RoyalStatistical Society, Series B 55(4): 757-796.[7] Huang, J., Wu, C. & Zhou, L. (2002). Varying-coeﬃcient models and basis functionapproximation for the analysis of repeated measurements. Biometrika 89(1): 111-128.[8] Hu, T. & Xia, Y. (2012). Adaptive semi-varying coeﬃcient model selection. StatisticaSinica 22(2): 575-599.[9] Lee, M.(1989). Mode regression. Journal of Econometrics 42(3): 337-349.[10] Li, Q., & Racine, J. (2010). Smooth varying-coeﬃcient estimation and inference forqualitative and quantitative data. Econometric Theory 26(6): 1607-1637.[11] Ma, X. & Zhang, J. (2016). A new variable selection approach for varying coeﬃcientmodels. Metrika 79(1): 59-72.[12] Schumaker, L. (1981). Spline Functions: Basic Theory. New York: Wiley.[13] Tang, Y., Wang, H., Zhu, Z. & Song, X. (2012). A uniﬁed variable selection approachfor varying coeﬃcient models. Statistica Sinica 22(2): 601-628.[14] Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal ofthe Royal Statistical Society, Series B 58(1): 267-288.[15] Wang, H. & Xia, Y. (2009). Shrinkage estimation of the varying coeﬃcient model.Journal of the American Statistical Association 104(486): 747-757.[16] Xia, Y. & Li, W. (1999). On the estimation and testing of functional-coeﬃcientlinearmodels. Statistica Sinica 9(3): 735-757.[17] Yao, W., Lindsay, B. & Li, R. (2012). Local modal regression. Journal of Nonpara-metric Statistics 24(3): 647-663.[18] Yao, W. & Li, L. (2013). A new regression model: modal linear regression. Scandina-vian Journal of Statistics 41(3): 656-671.[19] Yuan, M. & Lin, Y. (2006). Model selection and estimation in regression with groupedvariables. Journal of the Royal Statistical Society, Series B (Statistical Methodology)68(1): 49-67.[20] Zhao, W., Zhang, R., Liu, J. & Lv, Y. (2014). Robust and eﬃcient variable selectionfor semiparametric partially linear varying coeﬃcient model based on modal regres-sion. Annals of the Institute of Statistical Mathematics 66(1): 165-191.[21] Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty.The Annals of Statistics 38(2): 894-942.1122] Zhang, R., Zhao, W. & Liu, J. (2013). Robust estimation and variable selection forsemiparametric partially linear varying coeﬃcient model based on modal regression.Journal of Nonparametric Statistics 25(2): 523-544.[23] Zhu, H., Li, R. & Kong, L. (2003). Multivariate varying coeﬃcent model for functionalresponses. Biometrics 59(2): 263-273.[24] Zou, H. (2006). The adaptive LASSO and its oracle properties. Journal of the Amer-ican Statistical Association 101(476): 1418-1429.

Appendix

Lemma A.1. Suppose conditions (C1)-(C4) hold, γ best is the best approximation of γ ,and ǫ , a , a are positive constants, such that(1) || γ bestj ∗ || L > ǫ , j = 1 , . . . , v , γ bestj = ( γ j , ⊤ q − ) ⊤ , j = v + 1 , . . . , s , γ bestj = , j = s + 1 , . . . , p .(2) sup | α j ( U ) − B ( U ) ⊤ γ bestj | ≤ a k − tn , j = 1 , . . . , v (3) sup | Π ⊤ γ best − Xα ( U ) | ≤ a k − tn Lemma A.2. Let δ = O ( n − t/ (2 t +1) ). Deﬁne γ = γ best + δ v , given ρ >

0, there exists alarge C such that [20] P { sup || v || = C L ( γ ) > L ( γ best ) } ≤ − ρ Proof of Theorem 2

Similar to the proof of Theorem 1 in [20], let a n = max j {| p ′ λ j ( || γ j ∗ || L ) } , b n =max j {| p ′′ λ j ( || γ j ∗ || L ) } . As b n →

0, we have || ˆ γ − γ best || = O p ( n − t/ (2 t +1) + a n ). Therefore || ˆ α j ( U ) − α j ( U ) || L = n X i =1 ( ˆ α j ( U i ) − α j ( U i )) = Z ( ˆ α j ( U ) − α j ( U )) dU = Z ( B ( U ) ⊤ ˆ γ j − B ( U ) ⊤ γ bestj + B ( U ) ⊤ γ bestj − α j ( U )) dU ≤ Z (( B ( U ) ⊤ ˆ γ j − B ( U ) ⊤ γ bestj ) dU + 2 Z ( B ( U ) ⊤ γ bestj − α j ( U )) dU = 2(ˆ γ j − γ bestj ) ⊤ Z B ( U ) B ( U ) ⊤ dU (ˆ γ j − γ bestj ) + 2 Z ( B ( U ) ⊤ γ bestj − α j ( U )) dU Because || R B ( U ) B ( U ) ⊤ dU || = O (1), we have(ˆ γ j − γ bestj ) ⊤ Z B ( U ) B ( U ) ⊤ dU (ˆ γ j − γ bestj ) = O p ( n − t/ (2 t +1) + a n )And according to the Lemma A.1., we have Z ( B ( U ) ⊤ γ bestj − α j ( U )) dU = O p ( n − t/ (2 t +1) )Consequently, the proof has been completed.12 roof of Theorem 1 By the property of SCAD, we know that max { λ j , λ j } → n → ∞ , then a n = 0,then by Theorem 2, we have || γ − γ best || = O p ( n − t/ t +1 )Firstly, if γ j ∗ = 0, it is clear that α j ( U ) is a constant. If γ j ∗ = 0, we have ∂l ( γ ) ∂ γ j ∗ = n X i =1 K ′ h ( Y i − Π ⊤ i γ ) X ( j ) i ¯ B ( U ) + np ′ λ j ( || γ j ∗ || L ) || γ j ∗ || L ( sign ( γ j, ) γ j, , . . . , sign ( γ j,q ) γ j,q ) ⊤ = n X i =1 n K ′ h ( ǫ i + D ni ) X ( j ) i ¯ B ( U ) + K ′′ h ( ǫ i + D ni ) X ( j ) i ¯ B ( U )[ Π ⊤ i ( γ − γ best )]+ K ′′′ h ( η i ) X ( j ) i ¯ B ( U )[ Π ⊤ i ( γ − γ best )] o + np ′ λ j ( || γ j ∗ || L ) || γ j ∗ || L ( sign ( γ j, ) γ j, , . . . , sign ( γ j,q ) γ j,q ) ⊤ = nλ j h O p ( λ − ¯ B ( U ) n − t t +1 ) + λ − j np ′ λ j ( || γ j ∗ || L ) || γ j ∗ || L ( sign ( γ j, ) γ j, , . . . , sign ( γ j,q ) γ j,q ) ⊤ i where η i is between Y i − Π ⊤ i γ and ǫ i + D ni , ǫ i = Y i − X ⊤ i α ( U i ), D ni = X ⊤ i α ( U i ) − Π ⊤ i γ best .As we all know, sup u || ¯ B ( U ) || = O (1), and by condition (C4) n t/ (2 t +1) min { λ j , λ j } →∞ and lim inf n →∞ lim inf || γ j ∗ || L → + p ′ λ j ( || γ j ∗ || L ) λ j > , j = s + 1 , . . . , p , we prove that thesign of the derivation is completely determined by the second part of the derivation.Hence, with the Lemma A.2. l ( γ ) gets its minimizer at ˆ γ V Cj ∗ = 0, and ˆ α j ( U ) ≈ ˆ γ V Cj +¯ B ⊤ ( U )ˆ γ V Cj ∗ = ˆ γ V Cj , i.e. ˆ α j ( U ) , j = v + 1 , . . . , p are constants.Secondly, since we have proved ˆ α j , j = v + 1 , . . . , p are constant, we only need to proveˆ γ CZj = 0 to obtain ˆ α j = 0 , for j = s + 1 , . . . , p . ∂l ( γ ) ∂γ j = n X i =1 K ′ ( Y i − Π ⊤ i γ ) X ( j ) i B ( U ) + np ′ λ j ( | γ j | ) sgn ( γ j )= n X i =1 n K ′ ( ǫ i + D ni ) X ( j ) i B ( U ) + K ′′ ( ǫ i + D ni ) X ( j ) i B ( U )[ Π ⊤ i ( γ − γ best )]+ K ′′′ ( η i ) X ( j ) i B ( U )[ Π ⊤ i ( γ − γ best )] o + np ′ λ j ( | γ j | ) sgn ( γ j )= nλ [ O p ( λ − j n − r r +1 ) + λ − j p ′ λ j ( | γ j | ) sgn ( γ j )]where η i is between Y i − Π ⊤ i γ and ǫ i + D ni , ǫ i = Y i − X ⊤ i α ( U i ), D ni = X ⊤ i α ( U i ) − Π ⊤ i γ best .By condition (C4) n t/ (2 t +1) min { λ j , λ j } → ∞ and lim inf n →∞ lim inf γ j → + p ′ λ j ( | α j | ) λ j > , j = s + 1 , . . . , p , we prove ∂l ( γ ) ∂γ j < when − δ < ˆ γ CZj < ∂l ( γ ) ∂γ j > when < ˆ γ CZj < δ where δ = O ( n − t/ (2 t +1) ). Hence, l ( γ ) gets its minimizer at ˆ γ CZj = 0, i.e. ˆ α j = 0 , j = s + 1 , . . . , p, . . . , p