Model detection and variable selection for mode varying coefficient model
aa r X i v : . [ s t a t . M E ] S e p Model detection and variable selection for mode varyingcoefficient model
Xuejun Ma ∗ , Yue Du † Jingli Wang ‡ Abstract
Varying coefficient model is often used in statistical modeling since it is more flex-ible than the parametric model. However, model detection and variable selection ofvarying coefficient model are poorly understood in mode regression. Existing methodsin the literature for these problems often based on mean regression and quantile re-gression. In this paper, we propose a novel method to solve these problems for modevarying coefficient model based on the B-spline approximation and SCAD penalty.Moreover, we present a new algorithm to estimate the parameters of interest, anddiscuss the parameters selection for the tuning parameters and bandwidth. We alsoestablish the asymptotic properties of estimated coefficients under some regular con-ditions. Finally, we illustrate the proposed method by some simulation studies and anempirical example.
Keywords : B-spline, SCAD penalty, mode regression, model detection, variableselection
MSC2010 subject classifications : Primary 62J07; secondary 62G08.
Suppose that Y i is a response variable and ( U i , X i ) is the associated covariate, i =1 , . . . , n , then the varying coefficient model (VCM) [6] takes the form Y i = X ⊤ i α ( U i ) + ǫ i , (1.1)where X i = ( X (0) i , . . . , X ( p ) i ) ⊤ is the ( p +1) dimensional design vector with the first element X (0) i = 1, and α ( U ) = { α ( U ) , . . . , α p ( U ) } ⊤ , α j ( U ) is an unknown smooth function, j = 0 , . . . , p . U i is an univariate index variable, without loss of generality, we assume U ranges over the unit interval [0 , ǫ i is the random error. VCM is attractive becausethe coefficients α j ( U )’s depend on U , which not only reduces the modeling bias, but alsoavoids the “dimensional disaster”. Under different assumptions, a large number of varyingcoefficient models are developed, such as additive model [5], partial linear model [4], singleindex function coefficient regression model [16], etc.In statistical modeling, variable selection is a significant issue. Penalized techniqueshave been proposed to conduct variable selection by shrinking coefficients of redundantvariables to zero. There are various powerful penalization methods such as LASSO [14],SCAD [3], adaptive LASSO [24], MCP [21]. Undoubtedly, variable selection is crucial ∗ School of Mathematical Sciences, Soochow University, 1 Shizi Street, Suzhou, 215006, China. [email protected] † The corresponding author. School of Mathematical Sciences, Soochow University, 1 Shizi Street,Suzhou, 215006, China. [email protected] ‡ School of Statistics and Data Sciences, Nankai University, Tianjin, China. [email protected]
We use the B-spline method to approximate unknown smooth function α j ( U ) , j =0 , , . . . , p . The B-spline basis functions with the degree of d can be presented as ˜ B ( U ) =( B ( U ) , . . . , B q ( U )) ⊤ , where q = k n + d + 1 and k n is the number of interior knots. Thereexists a transformation matrix G [12] such that G ˜ B ( U ) = (1 , ¯ B ( U ) ⊤ ) ⊤ . = B ( U ). Then α j ( U ) can be approximated by α j ( U ) ≈ B ( U ) ⊤ γ j = γ j + ¯ B ⊤ ( U ) γ j ∗ , j = 0 , , . . . , p (2.1)where γ j = ( γ j , γ ⊤ j ∗ ) ⊤ and γ j ∗ = ( γ j , . . . , γ jq ) ⊤ . Obviously, if || γ j ∗ || L = ( P qk =2 γ jk ) / =0, the variable X ( j ) has only a constant effect. Then based on mode regression, we can2stimate γ by maximizing Q ( γ ) = n X i =1 K h ( Y i − Π ⊤ i γ ) (2.2)where Π i = ( X (0) i B ( U i ) ⊤ , . . . , X ( p ) i B ( U i ) ⊤ ) ⊤ , γ = ( γ ⊤ , . . . , γ ⊤ p ) ⊤ , K h ( · ) = h K ( · /h ) and K ( · ) is a kernel function, and h is the bandwidth. We assume K ( · ) is the standard normaldensity in this paper. Hence, the estimation of VCM can be transformed to a linearregression analysis. In this section, we discuss the penalized mode VCM. First, the estimator γ can beobtained by maximizing the following objective function Q ( γ ) − n p X j =1 p λ j ( || γ j ∗ || L ) − n p X j =1 p λ j ( | γ j | ) I ( || γ j ∗ || L = 0) (2.3)It is equivalent to minimizing l ( γ ) = − Q ( γ ) + n p X j =1 p λ j ( || γ j ∗ || L ) + n p X j =1 p λ j ( | γ j | ) I ( || γ j ∗ || L = 0) (2.4)where λ j and λ j > p λ ( · ) is SCAD penalty function: p λ,a ( θ ) = λθ θ ≤ λ aλθ − . ( θ + λ ) a − λ < θ ≤ aλ λ ( a − ) a − θ > aλ (2.5)Fan and Li [3] suggested a = 3 . L norm of the varying parts of each varyingcoefficient. Tang et al [13] applied L norm in the mean and quantile VCM. However, L norm does not treat γ j ∗ as a group, so it is more inclined to set constant coefficients tovarying coefficients. The second term is designed to select zero constant effects. If thevariable X ( j ) has a constant coefficient, all components of γ j ∗ are shrunk exactly to zero.On the other hand, if the variable X ( j ) has no effect at all, both γ j ∗ and γ j are shrunkto zero. In practice, with the indicator function involved, it is very difficult to minimizethe objective Eq.(2.4), therefore, we implement a two-step iterative procedure to get theestimation of coefficients. Step 1:
Firstly, we obtain the ˆ γ V C by minimizing the penalized function l ( γ V C ) = − n X i =1 Q ( γ V C ) + n p X j =1 p λ j ( || γ V Cj ∗ || L ) (2.6)We penalize each varying coefficient by the L norm of γ V Cj ∗ . When the j th covariate has aconstant effect, the γ V Cj ∗ is shrunk to zero, which is an automatic separation of the varyingand constant effects. Step 2:
After
Step 1 , the model becomes a partially linear varying coefficient model.Let γ CZj = γ V Cj if the variable X ( j ) has a varying effect, γ CZj = ( γ V Cj , ⊤ q − ) ⊤ if thevariable X ( j ) has a constant effect. We obtain the estimator ˆ γ CZ by minimizing l ( γ CZ ) = − n X i =1 Q ( γ CZ ) + n p X j =1 p λ j ( | γ CZj | ) I ( || ˆ γ V Cj ∗ || L = 0) (2.7)3ere, we apply the same penalty λ j to each constant coefficient part, and aim to excludethe irrelevant variables by assigning the SCAD penalty only to the terms that have beendetermined to be constant. Step 3:
Iterating
Steps γ as ˆ γ , α j ( U ) can be estimated by ˆ α j ( U ) = B ( U ) ⊤ ˆ γ j when the variable X ( j ) has a varyingeffect, and α j ( U ) = ˆ γ j when the variable X ( j ) only has a constant effect. In this subsection, we extend the method proposed in [13] and the Modal Expectation-Maximization (MEM) method [18] to minimizing Eq.(2.4).Because directly minimizing Eq.(2.4) is difficult, we first locally approximate thepenalty function p λ ( · ) by a quadratic function proposed by Fan and Li [3] p λ ( | ω | ) ≈ p λ ( | ω | ) + 12 p ′ λ ( | ω | ) | ω | ( ω − ω ) , f or ω ≈ ω (2.8)Then, for a given initial value γ (0) , we can get p λ j ( || γ j ∗ || L ) ≈ p λ j ( || γ (0) j ∗ || L ) + 12 p ′ λ j ( || γ (0) j ∗ || L ) || γ (0) j ∗ || L ( || γ j ∗ || L − || γ (0) j ∗ || L ) (2.9) p λ j ( | γ j | ) ≈ p λ j ( | γ (0) j | ) + 12 p ′ λ j ( | γ (0) j | ) | γ (0) j | ( | γ j | − | γ j | ) (2.10)In this paper, γ (0) can be chosen as the unpenalized estimator based on the least squareregression γ (0) = (cid:16) Π ⊤ Π (cid:17) − Π ⊤ Y (2.11)where Π = ( Π ⊤ , . . . , Π ⊤ n ) ⊤ . The algorithm mainly includes two steps, and each step hasan EM iteration. Therefore, we can obtain the sparse estimators as follows:(1) Step 1:
Computation of ˆ γ V C ( t +1) After the t -th iteration, we obtain ˆ γ CZ ( t ) . Starting with ˆ γ V C ( t +1 , = ˆ γ CZ ( t ) , • E-step: π ( i | ˆ γ V C ( t +1 ,m ) ) = K h ( Y i − Π ⊤ i ˆ γ V C ( t +1 ,m ) ) P ni =1 K h ( Y i − Π ⊤ i ˆ γ V C ( t +1 ,m ) ) , i = 1 , . . . , n (2.12) • M-step:ˆ γ V C ( t +1 ,m +1) = arg min γ n n X i =1 {− π ( i | ˆ γ V C ( t +1 ,m ) ) log K h ( Y i − Π ⊤ i γ ) } + n γ ⊤ Σ λ (ˆ γ V C ( t +1 ,m ) ) γ o = (cid:16) Π ⊤ W Π + n Σ λ (ˆ γ V C ( t +1 ,m ) ) (cid:17) − Π ⊤ W Y (2.13)4here Σ λ (ˆ γ V C ( t +1 ,m ) ) = diag n ⊤ q , Σ λ , . . . , Σ λ p o (2.14)with Σ λ j = p ′ λ j ( || ˆ γ V C ( t +1 ,m ) j ∗ || L ) || ˆ γ V C ( t +1 ,m ) j ∗ || L { , ⊤ q − } ⊤ , j = 1 , . . . , p , and W is an n × n diagonal matrix with diagonal elements π ( i | ˆ γ V C ( t +1 ,m ) ). • Iterating the E-step and M-step procedure until convergence reached and obtainˆ γ V C ( t +1) , once || ˆ γ V C ( t +1) j ∗ || L < ǫ , we set ˆ γ V C ( t +1) j ∗ = . In the simulation, we set ǫ = 10 − , and the algorithm performs well.(2) Step 2:
Computation of ˆ γ CZ ( t +1) .After Step 1 , we setˆ γ CZ ( t +1 , = ( ˆ γ V C ( t +1) j ˆ γ V C ( t +1) j ∗ = (ˆ γ V C ( t +1) j , ⊤ q − ) ⊤ ˆ γ V C ( t +1) j ∗ = j = 1 , . . . , p (2.15) • E-step: π ( i | ˆ γ CZ ( t +1 ,m ) ) = K h ( Y i − Π ⊤ i ˆ γ CZ ( t +1 ,m ) ) P ni =1 K h ( Y i − Π ⊤ i ˆ γ CZ ( t +1 ,m ) ) , i = 1 , . . . , n (2.16) • M-step:ˆ γ CZ ( t +1 ,m +1) = arg min n n X i =1 {− π ( i | ˆ γ CZ ( t +1 ,m ) ) log K h ( Y i − Π ⊤ i γ ) } + n γ ⊤ Σ λ (ˆ γ CZ ( t +1 ,m ) ) γ o = (cid:16) Π ⊤ W Π + n Σ λ (ˆ γ CZ ( t +1 ,m ) ) (cid:17) − Π ⊤ W Y (2.17)where Σ λ (ˆ γ CZ ( t +1 ,m ) ) = diag n ⊤ q , Σ λ , . . . , Σ λ p o (2.18)with Σ λ j = p ′ λ j ( | ˆ γ CZ ( t +1 ,m ) j | ) | ˆ γ CZ ( t +1 ,m ) j | { , ⊤ q − } ⊤ , j = 1 , . . . , p , and W is an n × n diagonalmatrix with diagonal elements π ( i | ˆ γ CZ ( t +1 ,m ) ). • Similar to
Step1 , we iterate the EM procedure until convergence and obtainˆ γ CZ ( t +1) , we also set ˆ γ CZ ( t +1) j = 0 while | ˆ γ CZ ( t +1) j | < ǫ .(3) Step 3:
Iterating
Steps
To implement the above method, we should determine the values of the tuning param-eters λ j , λ j , the degree of B-spline d , the number of interior knots k n , and the bandwidth h . In our numerical studies, we suggest using the cubic splines in which d = 3.We should choose values for h , k n , λ j , λ j in each iteration of the two-step procedure.After the t -th iteration of the proposed two-step procedure, the details of selections aregiven as follows. 5 Selection of bandwidthWe provide a bandwidth selection method for the practical use of the mode regressionestimator. We first estimate F ( x , u, h ) = E ( K ′′ h ( ǫ ) | X = x , U = u ), G ( x , u, h ) = E (( K ′ h ( ǫ )) | X = x , U = u ) by [17]ˆ F ( h ) = 1 n n X i =1 K ′′ h ( ˆ ǫ i ) (2.19)ˆ G ( h ) = 1 n n X i =1 { K ′ h ( ˆ ǫ i ) } (2.20)Then, we obtain ˆ h opt by minimizingˆ r ( h ) = ˆ G ( h ) ˆ F ( h ) − ˆ σ (2.21)where ˆ ǫ i = Y i − Π ⊤ i ˆ γ CZ ( t ) , ˆ σ = n P ni =1 ˆ ǫ i . The grid search method may be used toestimate h , according to the advise of Yao et al [17], h = 0 . σ × . j , j = 0 , , . . . , l for some fixed l such as l = 50 or 100. • Selection of interior knotsWe use SIC-type criterion to obtain k n by minimizing SIC ( k ) = − log n X i =1 Q (ˆ γ CZ ( t ) ) + log nn ( v m ( k + d + 1) + c m ) (2.22)where v m denotes the number of covariates which have varying effects, c m denotesthe number of covariates which have constant effects. • Selection of tuning parameterWe also use SIC-type criterion to obtain λ j , λ j by minimizing SIC ( λ ) = − log n X i =1 Q (ˆ γ V C ( t ) λ ) + log nn edf (2.23) SIC ( λ ) = − log n X i =1 Q (ˆ γ CZ ( t ) λ ) + log nn edf (2.24)where ˆ γ V C ( t ) λ and ˆ γ CZ ( t ) λ are the minimizers of Eq.(2.6) and Eq.(2.7) with λ j = λ || ˆ γ V C ( t ) λ ∗ || L , j = 1 . . . . , p (2.25)and edf , edf are defined as the total number of varying and nonzero constant coef-ficients [15]. Let { α j ( · ) , j = 0 , . . . , v } be the varying coefficients, { α j ( · ) , j = v +1 , . . . , s } be constantcoefficients and α j ( · ) = 0 , j = s + 1 , . . . , p . We assume some regular conditions.(C1) The varying coefficient functions α j ( · ) , j = 1 , . . . , v are t th continuously differentiableon [0 , t >
2. 6C2) The density function f U ( · ) is continuous and bounded away from zero and infinityon [0 , { X i , i = 1 , . . . , n } are uniformly bounded in probability, and the positive definitematrix E ( X ⊤ X | U = u ) has bounded eigenvalues.(C4) The tuning parameters satisfy n t/ (2 t +1) min { λ j , λ j } → ∞ , max { λ j , λ j } → n → ∞ , and lim inf n →∞ lim inf || γ j ∗ || L → + p ′ λ j ( || γ j ∗ || L ) λ j > , j = v + 1 , . . . , p (2.26)lim inf n →∞ lim inf γ j → + p ′ λ j ( | γ j | ) λ j > , j = s + 1 , . . . , p (2.27) Theorem 1.
Suppose conditions (C1)-(C4) hold and k n ∼ n / (2 t +1) , with probabilityapproaching 1, ˆ α j ( · ) are nonzero constants, j = v +1 , . . . , s . And ˆ α j ( · ) = 0 , j = s +1 , . . . , p . Theorem 2.
Suppose conditions (C1)-(C3) hold and k n ∼ n / (2 t +1) , we have || ˆ α j ( · ) − α j ( · ) || L = O p ( n − t/ (2 t +1) + a n ) , j = 0 , . . . , v (2.28)where a n is defined as in Appendix. We generate the random samples from following two models:
Model 1 Y i = 15 + 20 sin(2 πU i ) + (2 − U i − π/ X i + (6 − U i ) X i + 0 . X i + 2 X i + δ ǫ i (3.1)where U i ∼ U nif orm (0 , X ij comes from N (0 ,
1) for j = 1 , . . . , p . δ = X i . Model 2 Y i = 2 exp(1 − U i ) + (1 . πU i )) ) X i + (0 . U i (1 − U i )( U i − . X i + (2 − πU i )) X i + 2 X i + 0 . X i − . X i + δ ǫ i (3.2)where U i ∼ U nif orm (0 , X i , . . . , X ip ) ⊤ are generated from a multivariate normaldistribution with Cov ( X ik , X ij ) = 0 . | k − j | for any 1 ≤ j, k ≤ p , and δ = 0 . X i .We considered the following four cases: • Case 1: ǫ i ∼ N (0 , • Case 2: ǫ i ∼ t (3). • Case 3: ǫ i ∼ Lap (0 , • Case 4: ǫ i ∼ . N ( − , . ) + 0 . N (1 , . ). E ( ǫ ) = 0 , mode ( ǫ ) = 1.7he sample size n is 500. p varies from 10 to 30. We repeat 500 times for each model,compare the performance of our proposed method (VCEM) with the basis expansion(BSE L ) based on mean regression [13]. Because the method using L norm tends toset constant coefficients to varying coefficients, we replace L norm with L norm for theBSE L and denote it as BSE L . In order to evaluate the performance of the three methods,we consider the following criteria: • SV: the average number of true varying coefficients (excluding the intercept) cor-rectly select as varying coefficients. • SC: the average number of true constant coefficients correctly select as constantcoefficients. • SZ: the average number of redundant variables correctly select as redundant vari-ables. Table 1: Simulation results for Model 1 p Case Method SV SC SZ10 Oracle 2.000 2.000 6.0001 BSE L L L L L L L L L L L L L L L L L and BSE L . They perform similarly for selection of redundant variables.82) For the selection of constant coefficients, VCEM is superior to BSE L and BSE L because SC of VCEM is larger than the others.(3) Even for the normal error case, VCEM performs no worse than BSE L and BSE L .Moreover, when the error distribution is asymmetric or has heavy tails, VCME isbetter than BSE L and BSE L , since the mode regression will put more weight to the‘most likely’ data around the true value, which lead to robust and efficient estimator.(4) BSE L is better than BSE L , because L norm takes the basis function as a group.Table 2: Simulation results for Model 2p Case Method SV SC SZ10 Oracle 3.000 3.000 4.0001 BSE L L L L L L L L L L L L L L L L In this section, to illustrate the usefulness of the proposed procedure, we apply it tothe Boston housing data. This data set concerns the median value of the owner-occupiedhomes, and it contains 506 observations on 14 variables, which can be found in the Rpackage “mlbench”. We take MEDV (median value of owner-occupied homes in $1000s)as the response variable. Covariates are as follows: CRIM (per capita crime rate by town),ZN (proportion of residential land zoned for lots over 25,000 sq.ft), INDUS (proportion ofnon-retail business acres per town), CHAS (Charles River dummy variable (= 1 if tractbounds river; 0 otherwise)), NOX (nitrogen oxides concentration (parts per 10 million)),RM (average number of rooms per dwelling), AGE (proportion of owner-occupied units9uilt prior to 1940), DIS (weighted mean of distances to five Boston employment centres),RAD (index of accessibility to radial highways), TAX (full-value property-tax rate per$10,000), PTRATIO (pupil-teacher ratio by town), BLACK (1000( Bk − . where Bk is the proportion of blacks by town), and LSTAT (lower status of the population (percent))as the index variable. The response variable and covariates are standardized. The indexvariable LSTAT is transformed into interval [0 , M SE = 1506 X i =1 ( y i − ˆ y i ) (3.3)From Table 3, the results of three methods are comparable. From Table 3, we findAGE is an irrelevant variables; CRIM, INDUS and RAD have varying effects. While,PTRATIO has negative effects given by the BSE L and VCEM. RM has positive effectsbased on the BSE L and VCEM. DIS has negative effect based on the BSE L and BSE L .Moreover, the MSE of VCEM is smaller than the others.Table 3: results for Boston housing dataVariable BSE L BSE L VCEMCRIM V V VZN 0 V VINDUS V V VCHAS V V 0NOX V 0 -0.159RM 0.290 V 0.388AGE 0 0 0DIS -0.088 -0.023 VRAD V V VTAX 0 0 VPTRATIO V -0.038 -0.192BLACK V 0 VMSE 0.202 0.202 0.198
In this paper, we study the detection of varying coefficient models, selection of vari-ables with nonzero varying effects and selection of variables with nonzero constant effectsbased on mode regression. We not only propose the VCEM algorithm but also establishasymptotic theories under some mild regular conditions. Obviously, numerical simulationsshow that our proposed method is effective. We would like to explore our proposed methodto generalized varying coefficient models and so on for future study.
References [1] Cai, Z., Fan, J. & Li, R. (2000). Efficient estimation and inferences for varying-coefficient models. Journal of the American Statistical Association 95(451): 888-902.[2] Fan, J. & Huang, T. (2005). Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli 11(6): 1031-1057.103] Fan, J. & Li, R. (2001). Variable selection via nonconcave penalized likelihood and itsoracle properties. Journal of the American Statistical Association 96(456): 1348-1360.[4] H¨ardle, W., Liang, H. & Gao, J. (2000). Partially Linear Models. Heidelberg: SpringPhysica.[5] Hastie, T. & Tibshirani, R. (1990). Generalized Additive Models. London: Chapman& Hall CRC.[6] Hastie, T. & Tibshirani, R. (1993). Varying-coefficient model. Journal of the RoyalStatistical Society, Series B 55(4): 757-796.[7] Huang, J., Wu, C. & Zhou, L. (2002). Varying-coefficient models and basis functionapproximation for the analysis of repeated measurements. Biometrika 89(1): 111-128.[8] Hu, T. & Xia, Y. (2012). Adaptive semi-varying coefficient model selection. StatisticaSinica 22(2): 575-599.[9] Lee, M.(1989). Mode regression. Journal of Econometrics 42(3): 337-349.[10] Li, Q., & Racine, J. (2010). Smooth varying-coefficient estimation and inference forqualitative and quantitative data. Econometric Theory 26(6): 1607-1637.[11] Ma, X. & Zhang, J. (2016). A new variable selection approach for varying coefficientmodels. Metrika 79(1): 59-72.[12] Schumaker, L. (1981). Spline Functions: Basic Theory. New York: Wiley.[13] Tang, Y., Wang, H., Zhu, Z. & Song, X. (2012). A unified variable selection approachfor varying coefficient models. Statistica Sinica 22(2): 601-628.[14] Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal ofthe Royal Statistical Society, Series B 58(1): 267-288.[15] Wang, H. & Xia, Y. (2009). Shrinkage estimation of the varying coefficient model.Journal of the American Statistical Association 104(486): 747-757.[16] Xia, Y. & Li, W. (1999). On the estimation and testing of functional-coefficientlinearmodels. Statistica Sinica 9(3): 735-757.[17] Yao, W., Lindsay, B. & Li, R. (2012). Local modal regression. Journal of Nonpara-metric Statistics 24(3): 647-663.[18] Yao, W. & Li, L. (2013). A new regression model: modal linear regression. Scandina-vian Journal of Statistics 41(3): 656-671.[19] Yuan, M. & Lin, Y. (2006). Model selection and estimation in regression with groupedvariables. Journal of the Royal Statistical Society, Series B (Statistical Methodology)68(1): 49-67.[20] Zhao, W., Zhang, R., Liu, J. & Lv, Y. (2014). Robust and efficient variable selectionfor semiparametric partially linear varying coefficient model based on modal regres-sion. Annals of the Institute of Statistical Mathematics 66(1): 165-191.[21] Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty.The Annals of Statistics 38(2): 894-942.1122] Zhang, R., Zhao, W. & Liu, J. (2013). Robust estimation and variable selection forsemiparametric partially linear varying coefficient model based on modal regression.Journal of Nonparametric Statistics 25(2): 523-544.[23] Zhu, H., Li, R. & Kong, L. (2003). Multivariate varying coefficent model for functionalresponses. Biometrics 59(2): 263-273.[24] Zou, H. (2006). The adaptive LASSO and its oracle properties. Journal of the Amer-ican Statistical Association 101(476): 1418-1429.
Appendix
Lemma A.1. Suppose conditions (C1)-(C4) hold, γ best is the best approximation of γ ,and ǫ , a , a are positive constants, such that(1) || γ bestj ∗ || L > ǫ , j = 1 , . . . , v , γ bestj = ( γ j , ⊤ q − ) ⊤ , j = v + 1 , . . . , s , γ bestj = , j = s + 1 , . . . , p .(2) sup | α j ( U ) − B ( U ) ⊤ γ bestj | ≤ a k − tn , j = 1 , . . . , v (3) sup | Π ⊤ γ best − Xα ( U ) | ≤ a k − tn Lemma A.2. Let δ = O ( n − t/ (2 t +1) ). Define γ = γ best + δ v , given ρ >
0, there exists alarge C such that [20] P { sup || v || = C L ( γ ) > L ( γ best ) } ≤ − ρ Proof of Theorem 2
Similar to the proof of Theorem 1 in [20], let a n = max j {| p ′ λ j ( || γ j ∗ || L ) } , b n =max j {| p ′′ λ j ( || γ j ∗ || L ) } . As b n →
0, we have || ˆ γ − γ best || = O p ( n − t/ (2 t +1) + a n ). Therefore || ˆ α j ( U ) − α j ( U ) || L = n X i =1 ( ˆ α j ( U i ) − α j ( U i )) = Z ( ˆ α j ( U ) − α j ( U )) dU = Z ( B ( U ) ⊤ ˆ γ j − B ( U ) ⊤ γ bestj + B ( U ) ⊤ γ bestj − α j ( U )) dU ≤ Z (( B ( U ) ⊤ ˆ γ j − B ( U ) ⊤ γ bestj ) dU + 2 Z ( B ( U ) ⊤ γ bestj − α j ( U )) dU = 2(ˆ γ j − γ bestj ) ⊤ Z B ( U ) B ( U ) ⊤ dU (ˆ γ j − γ bestj ) + 2 Z ( B ( U ) ⊤ γ bestj − α j ( U )) dU Because || R B ( U ) B ( U ) ⊤ dU || = O (1), we have(ˆ γ j − γ bestj ) ⊤ Z B ( U ) B ( U ) ⊤ dU (ˆ γ j − γ bestj ) = O p ( n − t/ (2 t +1) + a n )And according to the Lemma A.1., we have Z ( B ( U ) ⊤ γ bestj − α j ( U )) dU = O p ( n − t/ (2 t +1) )Consequently, the proof has been completed.12 roof of Theorem 1 By the property of SCAD, we know that max { λ j , λ j } → n → ∞ , then a n = 0,then by Theorem 2, we have || γ − γ best || = O p ( n − t/ t +1 )Firstly, if γ j ∗ = 0, it is clear that α j ( U ) is a constant. If γ j ∗ = 0, we have ∂l ( γ ) ∂ γ j ∗ = n X i =1 K ′ h ( Y i − Π ⊤ i γ ) X ( j ) i ¯ B ( U ) + np ′ λ j ( || γ j ∗ || L ) || γ j ∗ || L ( sign ( γ j, ) γ j, , . . . , sign ( γ j,q ) γ j,q ) ⊤ = n X i =1 n K ′ h ( ǫ i + D ni ) X ( j ) i ¯ B ( U ) + K ′′ h ( ǫ i + D ni ) X ( j ) i ¯ B ( U )[ Π ⊤ i ( γ − γ best )]+ K ′′′ h ( η i ) X ( j ) i ¯ B ( U )[ Π ⊤ i ( γ − γ best )] o + np ′ λ j ( || γ j ∗ || L ) || γ j ∗ || L ( sign ( γ j, ) γ j, , . . . , sign ( γ j,q ) γ j,q ) ⊤ = nλ j h O p ( λ − ¯ B ( U ) n − t t +1 ) + λ − j np ′ λ j ( || γ j ∗ || L ) || γ j ∗ || L ( sign ( γ j, ) γ j, , . . . , sign ( γ j,q ) γ j,q ) ⊤ i where η i is between Y i − Π ⊤ i γ and ǫ i + D ni , ǫ i = Y i − X ⊤ i α ( U i ), D ni = X ⊤ i α ( U i ) − Π ⊤ i γ best .As we all know, sup u || ¯ B ( U ) || = O (1), and by condition (C4) n t/ (2 t +1) min { λ j , λ j } →∞ and lim inf n →∞ lim inf || γ j ∗ || L → + p ′ λ j ( || γ j ∗ || L ) λ j > , j = s + 1 , . . . , p , we prove that thesign of the derivation is completely determined by the second part of the derivation.Hence, with the Lemma A.2. l ( γ ) gets its minimizer at ˆ γ V Cj ∗ = 0, and ˆ α j ( U ) ≈ ˆ γ V Cj +¯ B ⊤ ( U )ˆ γ V Cj ∗ = ˆ γ V Cj , i.e. ˆ α j ( U ) , j = v + 1 , . . . , p are constants.Secondly, since we have proved ˆ α j , j = v + 1 , . . . , p are constant, we only need to proveˆ γ CZj = 0 to obtain ˆ α j = 0 , for j = s + 1 , . . . , p . ∂l ( γ ) ∂γ j = n X i =1 K ′ ( Y i − Π ⊤ i γ ) X ( j ) i B ( U ) + np ′ λ j ( | γ j | ) sgn ( γ j )= n X i =1 n K ′ ( ǫ i + D ni ) X ( j ) i B ( U ) + K ′′ ( ǫ i + D ni ) X ( j ) i B ( U )[ Π ⊤ i ( γ − γ best )]+ K ′′′ ( η i ) X ( j ) i B ( U )[ Π ⊤ i ( γ − γ best )] o + np ′ λ j ( | γ j | ) sgn ( γ j )= nλ [ O p ( λ − j n − r r +1 ) + λ − j p ′ λ j ( | γ j | ) sgn ( γ j )]where η i is between Y i − Π ⊤ i γ and ǫ i + D ni , ǫ i = Y i − X ⊤ i α ( U i ), D ni = X ⊤ i α ( U i ) − Π ⊤ i γ best .By condition (C4) n t/ (2 t +1) min { λ j , λ j } → ∞ and lim inf n →∞ lim inf γ j → + p ′ λ j ( | α j | ) λ j > , j = s + 1 , . . . , p , we prove ∂l ( γ ) ∂γ j < when − δ < ˆ γ CZj < ∂l ( γ ) ∂γ j > when < ˆ γ CZj < δ where δ = O ( n − t/ (2 t +1) ). Hence, l ( γ ) gets its minimizer at ˆ γ CZj = 0, i.e. ˆ α j = 0 , j = s + 1 , . . . , p, . . . , p