[PDF] Endogenous Treatment Effect Estimation with some Invalid and Irrelevant Instruments

Abstract

Instrumental variables (IV) regression is a popular method for the estimation of the endogenous treatment effects. Conventional IV methods require all the instruments are relevant and valid. However, this is impractical especially in high-dimensional models when we consider a large set of candidate IVs. In this paper, we propose an IV estimator robust to the existence of both the invalid and irrelevant instruments (called R2IVE) for the estimation of endogenous treatment effects. This paper extends the scope of Kang et al. (2016) by considering a true high-dimensional IV model and a nonparametric reduced form equation. It is shown that our procedure can select the relevant and valid instruments consistently and the proposed R2IVE is root-n consistent and asymptotically normal. Monte Carlo simulations demonstrate that the R2IVE performs favorably compared to the existing high-dimensional IV estimators (such as, NAIVE (Fan and Zhong, 2018) and sisVIVE (Kang et al., 2016)) when invalid instruments exist. In the empirical study, we revisit the classic question of trade and growth (Frankel and Romer, 1999).

Full PDF

EEndogenous Treatment Eﬀect Estimation withsome Invalid and Irrelevant Instruments

Qingliang Fan ∗ Yaqian Wu † June 29, 2020

Abstract

Instrumental variables (IV) regression is a popular method for the estimation of the endoge-nous treatment eﬀects. Conventional IV methods require all the instruments are relevant andvalid. However, this is impractical especially in high-dimensional models when we consider alarge set of candidate IVs. In this paper, we propose an IV estimator robust to the existenceof both the invalid and irrelevant instruments (called R2IVE) for the estimation of endogenoustreatment eﬀects. This paper extends the scope of sisVIVE (Kang et al., 2016) by consideringa true high-dimensional IV model and a nonparametric reduced form equation. It is shownthat our procedure can select the relevant and valid instruments consistently and the proposedR2IVE is root- n consistent and asymptotically normal. Monte Carlo simulations demonstratethat the R2IVE performs favorably compared to the existing high-dimensional IV estimators(such as, NAIVE (Fan and Zhong, 2018) and sisVIVE (Kang et al., 2016)) when invalid instru-ments exist. In the empirical study, we revisit the classic question of trade and growth (Frankeland Romer, 1999). Keywords:

Endogenous treatment eﬀect, instrumental variable selection, high-dimensionality, eco-nomic growth and trade. ∗ Department of Economics, The Chinese University of Hong Kong. E-mail: [email protected]. † School of Economics, Xiamen University. E-mail: [email protected]. a r X i v : . [ ec on . E M ] J un Introduction

Instrumental variables (IV) method is useful for the identiﬁcation of the endogenous treatmenteﬀects in empirical studies. The internal validity of IV analysis requires that instruments are asso-ciated with the treatment (A1: Relevance Condition), have no direct pathway to the outcome (A2:Exclusion Condition), and are not related to unmeasured variables that aﬀect the treatment andthe outcome (A3: Exogenous Condition). Finding the instruments that satisfy assumptions (A1)-(A3) is often quite a challenging task for empirical researchers. The inclusion of many redundantinstruments (violation of A1) provides poor ﬁnite sample properties of estimators. Two-stage leastsquares (2SLS) tend to have large biases when many weak instruments are used (Bekker, 1994). Aninstrument is deemed invalid, if it has a direct eﬀect on the outcome (violation of A2), or an indirectassociation with the outcome through unobserved confounders (violation of A3). Using an invalidinstrument will lead to inconsistency of the 2SLS and LIML estimators (Kolesar et al., 2015). Itthus motivated this study to develop a method that could select relevant and valid instruments froma large set of candidate instruments.There is a strand of literature on instruments selection under the exclusion condition and exoge-nous condition. In a seminal work on IV selection, Donald and Newey (2001) consider a procedurethat minimizes higher-order asymptotic MSE which relies on a priori knowledge of the order ofinstruments. Kuersteiner and Okui (2010) use model averaging to construct optimal instruments.Okui (2011) considers ridge regression for estimating the ﬁrst-stage regression. In high-dimensionalsetting, Gautier and Tsybakov (2011) propose an estimation method related to the Dantzig selectorand the square-root Lasso when the structural equation in an instrumental variables model is itselfvery high-dimensional. Belloni et al. (2012) introduce the Post Lasso method which extends IVestimation to high dimension with heteroscedastic and non-Gaussian random disturbances. Canerand Fan (2015) use the adaptive Lasso estimator to eliminate the irrelevant instruments. Lin et al.(2015) propose a two-stage regularization framework for identifying and estimating important covari-ates’ eﬀects while selecting and estimating optimal instruments where the dimensions of covariatesand instruments can both be much larger than the sample size. Fan and Zhong (2018) study anonparametric approach regarding a general nonlinear reduced form equation to achieve a betterapproximation of the optimal instruments with the adaptive group Lasso.In the invalid IV settings, Liao (2013) proposes an adaptive GMM shrinkage estimator to selectvalid moment consistently. Also in GMM framework, Cheng and Liao (2015) use an information-based adaptive GMM shrinkage estimation to select both the relevant and valid moments consistently2nd Caner et al. (2018) develop the adaptive Elastic-Net generalized method of moments (GMM)estimator in large dimensional models with potentially (locally) invalid moment conditions. Theaforementioned papers need a prior knowledge about a subset of instruments that is known to bevalid and contains suﬃcient information for identiﬁcation and estimation of the causal eﬀects. Incontrast, Kang et al. (2016) propose a Lasso type procedure to identify and select the set of invalidinstruments without any prior knowledge about which instruments are potentially valid or invalid.Windmeijer et al. (2018) suggest a consistent median estimator using adaptive Lasso which obtainsthe oracle properties. Guo et al. (2018a) propose a Two-Stage Hard Thresholding (TSHT) withvoting procedure that selects relevant and valid instruments and produces valid conﬁdence intervalsfor the causal eﬀect, which also works in the high dimension setting.In this paper, we develop a method that extends the scope of sisVIVE (Kang et al., 2016) by con-sidering a true high-dimensional instrumental variable setting. Without knowing which instrumentis irrelevant or invalid, we consider a large set of candidate instruments IV = IV ∪IV ∪IV ∪IV ,where IV denotes the set of relevant and valid instruments, IV denotes the set of relevant andinvalid instruments, IV denotes the set of irrelevant and valid instruments and IV denotes the setof irrelevant and invalid instruments. We propose a three step procedure to achieve the estimationof endogenous treatment eﬀects. Firstly, we consider a nonparametric additive reduced form modeland estimate it by adaptive group Lasso, which select relevant instruments (denoted by A ∗ R ) consis-tently and estimate the optimal instruments D ∗ . This data-driven approach can usually adopt thelinearity form automatically for a truly linear reduced form model. In the second step, we replacethe original treatment variable by its estimated value and select the valid instruments (denoted by A ∗ V ) consistently by adaptive Elastic-Net. In the ﬁnal step, we take the selected invalid instruments(denoted by A ∗ I = ( A ∗ V ) c ) as covariates and run a least squares regression to obtain the treatmenteﬀect estimator. It is shown that our procedure can select relevant and invalid instruments consis-tently . The estimator has the desired theoretical properties such as consistency and asymptoticnormality. Therefore, it is called R obust IV E stimator to both the I nvalid and I rrelevant instru-ments (R2IVE), where the number 2 in R2IVE denotes both types of bad instruments. MonteCarlo simulations demonstrate that our estimator performs better than the existing IV estimators(such as 2SLS, NAIVE (Fan and Zhong, 2018) and sisVIVE (Kang et al., 2016)) for the endogenoustreatment eﬀects. In the empirical study, we revisit the classic question of trade and growth. It In the common sense of IV selection, we need to select the strong and valid IV. In the perspective of variableselection using Lasso type procedures, as we will demonstrate in our model part, the invalid instruments have non-zerocoeﬃcients in the structural equation and they should be selected. When we say we can select the invalid IV, it isnot to be confused with the ultimate goal of selecting the IV . n by L matrix X , we denotethe ( i, j )-th element of matrix X as X ij , the i th row as X i. , and the j th column as X .j . X T is thetranspose of X . M X = I n − P X , where P X = X ( X T X ) − X T is the projection matrix onto thecolumn space of X , and I n is the n -dimensional identity matrix. Let ι s denote a 1 × s vector of ones.The l p -norm is denoted by (cid:107)·(cid:107) p , and the l -norm, (cid:107)·(cid:107) , denotes the number of non-zero componentsof a vector. (cid:107)·(cid:107) is l -norm following conventions. For any set A , we denote A c to be its complementand | A | is the cardinality of set A . For i = 1 , . . . , n , let Y (0 , ) i ∈ R be the potential outcome without any treatment and instruments, Y ( d, z ) i ∈ R be the potential outcome if the individual i were to have treatment d and instrumentsvalues z . D ( z ) i ∈ R is the endogenous treatment if the individual i were to have instruments z , d, d (cid:48) ∈ R are two possible values of the treatment, and z , z (cid:48) ∈ R L are two possible values ofthe instruments. We consider a true high-dimensional instrumental variable setting such that thedimensionality L is allowed to be potentially larger than the sample size n . Without loss of generality,we consider the univariate endogenous treatment variable. Suppose we have the following potentialoutcomes model, Y ( d (cid:48) , z (cid:48) ) i − Y ( d, z ) i = ( z (cid:48) − z ) T α ∗ + ( d (cid:48) − d ) β ∗ (2.1) E (cid:16) Y (0 , ) i | Z i. (cid:17) = Z Ti. α ∗ (2.2)where β ∗ ∈ R is the treatment parameter of interest, α ∗ ∈ R L represents the direct eﬀect of z on Y ,and α ∗ ∈ R L represents the presence of unmeasured confounders that aﬀect both the instruments4nd the outcomes. An instrument Z j therefore does not satisfy the exclusion condition if α j (cid:54) = 0and does not satisfy the exogeneity condition if α j (cid:54) = 0.For each individual, only one possible realization of Y ( d, z ) i and D ( z ) i are observed, which aredenoted as Y i and D i , respectively, with the observed instrument values Z i. . We study the en-dogenous treatment eﬀect through a random sample (cid:8) Y i ; D i ; Z Ti. (cid:9) ni =1 . Let α ∗ = α ∗ + α ∗ , and ε i = Y (0 , ) i − E (cid:104) Y (0 , ) i | Z i. (cid:105) . Combining (2.1) and (2.2), the baseline model is given by Y i = Z Ti. α ∗ + D i β ∗ + ε i E ( ε i | Z i ) = 0 (2.3)where D i is an endogenous treatment variable, that is, E ( ε i | D i ) (cid:54) = 0 . Note that the model caninclude exogenous variables X i ∈ R p and if so we can replace the variables Y i , D i and Z i. withthe residuals after regressing them on X (e.g., replace Y by M X Y ) (Zivot and Wang, 1998). Forsimplicity, we assume that Y , D , and the columns of Z are centered, which can be obtained from aresidual transformation with X containing only the constant term. Deﬁnition 1:

Instrument j ∈ { , . . . , L } is valid if α ∗ j = 0, which means the instrument j satisﬁes both condition (A2) and (A3), and it is invalid if α ∗ j (cid:54) = 0. Let A ∗ V denote the set of validinstruments and A ∗ I = ( A ∗ V ) c denote the set of invalid instruments.We also extend the sisVIVE (Kang et al., 2016) by considering a nonparametric additive reducedform model with a large number of possible instruments. D i = L (cid:88) j =1 f j ( Z ij ) + ξ i , E ( ξ i | Z i. ) = 0 (2.4)where f j ( · ) is the j th unknown smooth univariate function and ξ i ’s are i.i.d random errors with mean0 and ﬁnite variance. For the model identiﬁcation, we assume that all functions f j ( · )’s are centered,that is, E [ f j ( Z j )] = 0, 1 ≤ j ≤ p , where Z j denotes the j th instrument. As it is more ﬂexible andgenerally applicable than the ordinary linear model, the nonparametric additive reduced form model(2.4) could achieve a better approximation to the optimal instruments D ∗ i = E ( D i | Z i . ) (Amemiya,1974; Newey, 1990). The resulting estimator based on (2.4) is expected to be more eﬃcient comparedto the linear IV estimator, which will be conﬁrmed both theoretically and numerically in the latersections. Deﬁnition 2:

Instrument j ∈ { , . . . , L } is a non-redundant IV that satisﬁes (A1), if f j ( Z j ) (cid:54) = 0.Let A ∗ R denote the set of these instruments that are able to approximate the conditional expectation Our study focuses on a known endogenous treatment eﬀect model. Guo et al. (2018b) study the testing endogeneityproblem and propose a new test that has better power than the Durbin-Wu-Hausman (DWH) test in high dimensions.

5f the endogenous variable.

In this section, we illustrate the identiﬁcation and estimation of models (2.3) and (2.4). Withoutknowing which instrument is irrelevant or invalid, we consider a large set of candidate instruments, IV = IV ∪ IV ∪ IV ∪ IV as introduced in Section 1. Here, IV is the set of ideal instrumentssatisfying the conditions (A1)-(A3). IV could contribute to the construction of optimal instrumentsand the correct speciﬁcation of model (2.3). IV should be excluded in the models, otherwise, thereduced form equation (2.4) will provide poor ﬁnite sample properties of estimators and the structuralmodel (2.3) will be overﬁtted. IV is the set of instruments that all conditions (A1)-(A3) are notsatisﬁed. These instruments should be excluded from the model (2.4) but included in the model(2.3). The model (2.3) will not be correctly speciﬁed if we delete these instruments mistakenly,that is, we take invalid instruments as valid. Our goal is therefore to select relevant instruments(denoted by A ∗ R = IV ∪ IV ) consistently for the model (2.4) and invalid instruments (denoted by A ∗ I = IV ∪ IV ) consistently for the model (2.3). Speciﬁcally, we ﬁrst estimate the equation (2.3)by adaptive group Lasso, and then substitute the estimated optimal instruments into (2.4) to selectthe valid instruments using adaptive Elastic-Net.Many researchers have studied additive nonparametric models (Linton et al., 1970). The non-parametric study of reduced form equation is often troubled by the curse of dimensionality (Newey,1990), which has been the focus of a substantial body of recent literature on high-dimensional prob-lems. Huang et al. (2010) proposed a variable selection procedure in nonparametric additive modelsusing the adaptive group Lasso based on a spline approximation to the nonparametric component.Fan and Zhong (2018) extend it to IV estimation, which selects the strong instruments consistentlyand adopts the linearity form automatically for a truly linear reduced form model. Here, we estimatethe optimal instruments and select the relevant instruments consistently following Fan and Zhong(2018).Let S n be the space of polynomial splines of degrees h ≥ { φ k , k = 1 , . . . , m n } be thenormalized B-spline basis functions for S n , where m n is the sum of the polynomial degree h and thenumber of knots. Let ψ k ( Z ij ) = φ k ( Z ij ) − n − (cid:80) ni =1 φ k ( Z ij ) be the centered B spline basis functionfor the j th instrument. Thus, for each f nj ∈ S n , it can be represented by the linear combination ofnormalized B-spline series f nj ( Z ij ) = m n (cid:88) k =1 γ jk ψ k ( Z ij ) 1 ≤ j ≤ L (2.5)6nder suitable smoothness conditions, the function f j ( Z ij ) in (2.4) can be well approximated bythe function f nj ( Z ij ) in S n by carefully choosing the coeﬃcients { γ j , ..., γ jm n } (Stone, 1985). Thenthe model (2.4) can be rewritten as D i ≈ L (cid:88) j =1 m n (cid:88) k =1 γ jk ψ k ( Z ij ) + ξ i (2.6)Denote D = ( D , ..., D n ) T as the n × γ j =( γ j , γ j , . . . , γ jm n ) T be the m n × j th instrument in (2.6)and γ = ( γ T , . . . , γ TL ) T be the m n L × U ij = ( ψ ( Z ij ) , ..., ψ m n ( Z ij )) T be the m n × U j = ( U j , ..., U nj ) T be the n × m n design matrix for the j th instrument and U = ( U , ..., U L ) be the corresponding n × m n L design matrix. To select the signiﬁcant instrumentsand estimate the component functions simultaneously, we consider the following penalized objectivefunction with an adaptive group Lasso penalty (cid:98) γ n = arg min γ  (cid:107) D − U γ (cid:107) + λ n L (cid:88) j =1 ω j (cid:107) γ j (cid:107)  (2.7)where the weights is deﬁned by ω j =  (cid:107) (cid:101) γ j (cid:107) − if (cid:107) (cid:101) γ j (cid:107) > ∞ if (cid:107) (cid:101) γ j (cid:107) = 0 (2.8)and (cid:101) γ n = ( (cid:101) γ T , . . . , (cid:101) γ TL ) T is obtained from group Lasso (cid:101) γ n = arg min γ  (cid:107) D − U γ (cid:107) + λ n L (cid:88) j =1 (cid:107) γ j (cid:107)  (2.9)Denote the (cid:98) A R = (cid:110) j : (cid:107) (cid:98) γ j (cid:107) > (cid:111) , and the adaptive group Lasso estimators of f j in (2.4) are (cid:98) f nj ( Z ij ) = m n (cid:88) k =1 (cid:98) γ jk ψ k ( Z ij ) , j ∈ (cid:98) A R Therefore the endogenous variable D i can be estimated by (cid:98) D i = (cid:88) j ∈ (cid:98) A R m n (cid:88) k =1 (cid:98) γ jk ψ k ( Z ij ) (2.10)7enote (cid:98) D = ( (cid:98) D , · · · , (cid:98) D n ) T , then (cid:98) D is the optimal instrument similar to Belloni et al. (2012) andFan and Zhong (2018) when |A ∗ R | is greater than 1, which is shown in the following Theorem 3.1.Note that when the nonparametric additive reduced form model (2.4) is indeed a linear model, thisdata-adaptive approach can usually select m n = 1 to degenerate to a linear model of (2.4). TheEBIC (Chen and Chen, 2008) and BIC (Wang et al., 2007) could be used to choose the tuningparameters λ n , λ n , and m n adaptively in practice for the high-dimensional and low-dimensionalcases, respectively.Next, we illustrate how to select the valid instruments and estimate the true endogenous treat-ment eﬀect β ∗ . By taking the conditional expectation of both sides of (2.3) given instrumentalvariables Z i , we have E ( Y i | Z i. ) = D ∗ i β ∗ + Z Ti. α ∗ (2.11)where D ∗ i = E ( D i | Z i . ). Denote ν i = Y i − E ( Y i | Z i. ), it is straightforward to show that E ( ν i ) = E [ E ( ν i | Z i. )] = 0 and cov( D ∗ i ν i ) = E [ D ∗ i E ( ν i | Z i. )] = 0. Adding ν i to both side of (2.11), we have Y i = D ∗ i β ∗ + Z Ti. α ∗ + ν i (2.12)Thus, D ∗ i is an exogenous variable in (2.12). It is worth noting that the coeﬃcient of the optimalinstrument D ∗ i in the model (2.12) remains the same β ∗ as in the structural equation (2.3). If D ∗ i is known, the model (2.12) can be easily estimated by linear estimation method. In practice, wereplace D ∗ i by its estimate (cid:98) D i to obtain the ﬁnal IV estimator for β ∗ . Substituting (cid:98) D i from (2.10)into (2.12), we have Y = (cid:98) D β ∗ + Z α ∗ + (cid:98) ν (2.13)Here, we need to have at least one relevant and valid instrument so that the equation (2.13) is nottroubled by the problem of potential collinearity . We consider a two step procedure to estimatethe equation (2.13). Notice that unlike β ∗ , we are essentially concerned with which element in α ∗ is not equal to 0 (invalid IV) but to a lesser extent, the true value of α ∗ .(S1) We ﬁrst remove the eﬀect of (cid:98) D from (2.13). Multiplying by M (cid:98) D on both sides of (2.13), we This is the assumption needed for the inference of the endogenous treatment eﬀect in our model. There is a richliterature on the inference using many weak IVs or invalid IVs. In recent studies, Hansen and Kozbur (2014) considerIV estimation with many weak IVs in high-dimensional models where the consistent model selection in the ﬁrst stagemay not be possible. Bi et al. (2020) discuss inferring treatment eﬀects after testing instrument strength in linearmodels. Berkowitz et al. (2012) shows how valid inferences can be made when an instrumental variable does notperfectly satisfy the orthogonality condition. M (cid:98) D Y = M (cid:98) D Z α ∗ + M (cid:98) D (cid:98) ν (2.14)Denote (cid:101) Y = M (cid:98) D Y , (cid:101) Z = M (cid:98) D Z and (cid:101) ν = M (cid:98) D (cid:98) ν . The equation (2.14) can be written as (cid:101) Y = (cid:101) Z α ∗ + (cid:101) ν , which is a standard high-dimensional linear model. The popular linear high-dimensional methods include the Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), Elastic-Net (Zou and Hastie, 2005), group Lasso (Yuan and Lin, 2006), adaptive Lasso (Zou, 2006),Dantzig selector (Candes and Tao, 2007) and adaptive Elastic-Net (Zou and Zhang, 2009)among others. Here, we use the adaptive Elastic-Net proposed in Zou and Zhang (2009) toestimate the α ∗ consistently.(a) We ﬁrst compute the Elastic-Net estimator (cid:101) α (Zou and Hastie, 2005) by (2.15), and thenconstruct the adaptive weights by (2.16). (cid:101) α = (1 + λ n ) (cid:26) arg min α (cid:13)(cid:13)(cid:13) (cid:101) Y − (cid:101) Z α (cid:13)(cid:13)(cid:13) + λ (cid:107) α (cid:107) + λ (cid:107) α (cid:107) (cid:27) (2.15) (cid:98) ω j =  | (cid:101) α j | − τ if (cid:101) α j > ∞ if (cid:101) α j = 0 (2.16)where we take τ = (cid:100) η − η (cid:101) following Zou and Zhang (2009) .(b) Then we solve the following optimization problem to get the adaptive Elastic-Net esti-mates (cid:98) α = (1 + λ n ) (cid:26) arg min α (cid:13)(cid:13)(cid:13) (cid:101) Y − (cid:101) Z α (cid:13)(cid:13)(cid:13) + λ (cid:107) α (cid:107) + λ ∗ (cid:98) ω j (cid:107) α (cid:107) (cid:27) (2.17)Denote (cid:98) A I = { Z j , for j : | α j | > } as the empirical set of invalid IVs. Note that the (cid:96) regularization parameters, λ and λ ∗ are allowed to be diﬀerent while the (cid:96) regularizationparameters are same. Here, we ﬁrst use a proper range of values with (relatively small)grid for λ . Then, for each λ , the EBIC and BIC are used to choose the λ and λ ∗ . Thismethod produces the entire solution path of the adaptive Elastic-Net.(S2) In step 2 we take the selected invalid instruments as covariates in (2.13) and run a least squareregression. The resulting IV estimator of β ∗ takes the form (cid:98) β = (cid:16) (cid:98) D T M (cid:98) A I (cid:98) D (cid:17) − (cid:98) D T M (cid:98) A I Y (2.18) Here the constant 0 ≤ η <

1. See Assumption 2 (A3) in the appendix.

9n summary, we present the following Algorithm 1 for the estimator.

Algorithm 1 R obust

IV E stimator to both the I nvalid and I rrelevant instruments (R2IVE) Step 1.

Obtain the penalized estimator (cid:98) γ n in (2.7) and estimate the conditional expectation ofthe endogenous treatment (cid:98) D i in (2.10), where the weights is deﬁned by (2.8) and (2.9) and theBIC or EBIC are applied to choose the tuning parameters λ n , λ n and m n . Step 2.

Take (cid:101) Y = M (cid:98) D Y , (cid:101) Z = M (cid:98) D Z and (cid:101) ν = M (cid:98) D (cid:98) ν . Obtain the penalized estimator (cid:98) α andthe invalid instruments set (cid:98) A I in (2.17), where the weights is deﬁned by (2.15) and (2.16). For aproper range of values for λ , the BIC or EBIC are used to choose the λ and λ ∗ for each λ . Step 3.

Take the selected invalid instruments as covariates and run a least square regression for(2.13). The resulting IV estimator of β ∗ takes form as (cid:98) β = (cid:16) (cid:98) D T M (cid:98) A I (cid:98) D (cid:17) − (cid:98) D T M (cid:98) A I Y .It is important to consider the model selection in the asymptotic theory of the estimator asdiscussed in Chernozhukov et al. (2018). We assume the following regularity conditions for theoretical study.

Assumption 1. (C1) (cid:112) E ( D i ) < ∞ for i = 1 , ..., n .(C2) The true value β ∗ is bounded, | β ∗ | ≤ C . The number of the relevant instruments s = |A ∗ R | is ﬁxed and s is a positive integer. The number of invalid instruments s = |A ∗ I | is less than L/ .The number of relevant and valid instruments |IV I | ≥ .(C3) The distribution of ξ i has subexponential tails, E [ exp ( C | ξ i )] < ∞ for a ﬁnite positiveconstant C . E ( ε i ) is bounded away from zero and the inﬁnity. E [ | ν i | δ ] < ∞ for some δ > . Here we only give the assumptions used in the proof of Theorem 3.1. The assumptions of Lemma3.1 and 3.2 are put in the appendix. The condition (C1) imposes mild restriction on the ﬁnite secondmoment of the endogenous treatment variable D i . Condition (C2) restricts the boundedness of truetreatment eﬀect and the number of relevant instruments and invalid instruments. The number ofinvalid instruments is needed to be less than the half of instruments, which is the identiﬁcationcondition in Kang et al. (2016). Condition (C3) clariﬁes the conditions about error terms. Thedistribution of the random errors ξ i ’s should not be too heavy-tailed, and it is satisﬁed for ξ i ’s thatare bounded uniformly or normally distributed. Lemma 3.1.

Using the group Lasso estimator (cid:101) γ with λ n (cid:16) O ( (cid:112) n log( m n L )) and m n (cid:16) O ( n / (2 s +1) ) to construct the weight for the adaptive group Lasso estimator . Suppose Assumption 1 and 2 hold Here s is a positive constant. See Assumption 2 (A2) in the appendix. nd λ n (cid:16) O ( √ n ) , then P (cid:16) (cid:98) A R = A R (cid:17) → , as n → ∞ (3.1) (cid:88) j ∈A R (cid:107) (cid:100) γ nj − γ j (cid:107) = O p (cid:16) n − (2 s − / (2 s +1) (cid:17) (3.2) (cid:88) j ∈A R (cid:13)(cid:13)(cid:13) (cid:98) f nj − f j (cid:13)(cid:13)(cid:13) = O p (cid:16) n − s/ (2 s +1) (cid:17) (3.3) (cid:13)(cid:13)(cid:13) D ∗ − ˆ D (cid:13)(cid:13)(cid:13) = o p (1) (3.4)This lemma shows the selection and estimation consistency of the adaptive group Lasso for high-dimensional nonparametric additive reduced form model, which essentially follows the results ofTheorem 3 in Huang et al. (2010). We give the proof of equation (3.4) in the appendix. Lemma 3.2.

Under the Assumption 1 and 2, the adaptive Elastic-Net achieves the model selectionconsistency, that is, P (cid:16) (cid:98) A I = A I (cid:17) → , as n → ∞ . This lemma shows the selection consistency of the adaptive Elastic-Net for high-dimensionallinear structure model. That is, the true set of invalid instruments can be identiﬁed with probabilitytending to 1. The proof of this lemma is in the appendix.

Theorem 3.1.

Suppose Assumption 1 and 2 hold, the R2IVE estimator in (2.18) is root- n consistentand asymptotically normal. That is σ − n √ n (cid:16) (cid:98) β − β ∗ (cid:17) → N (0 ,

1) (3.5) where the asymptotic variance σ n are given in the following two cases.(i) In the case that the structural error is heteroscedastic, there exists constants k such that E ( ν i | Z i . ) ≤ k holds almost surely for ≤ i ≤ n , σ n = (cid:2) E (cid:0) D ∗ T M A I D ∗ (cid:1)(cid:3) − E (cid:2) D ∗ T M A I D ∗ ν i (cid:3) (cid:2) E (cid:0) D ∗ T M A I D ∗ (cid:1)(cid:3) − (ii) In the case that the structural error is homoscedastic, that is, E ( ν i | Z i . ) = σ ν almost surely forall ≤ i ≤ n , (2.18) holds with σ n = σ ν (cid:2) E (cid:0) D ∗ T M A I D ∗ (cid:1)(cid:3) − . Simulation

We conduct various simulation studies to evaluate the estimation performance for diﬀerent meth-ods. We consider a structural model with one endogenous variable, Y i = D i β ∗ + Z Ti. α ∗ + ε i where β ∗ = 0 . Z i. = ( Z i , Z i , . . . , Z iL ) T is generated from a multivariate normal distribution N (0 , Σ ), and Σ = ( ρ j j ) L × L with ρ j j = 0 . | j − j | , for j , j = 1 , . . . , L, and for each i = 1 , . . . , n .We set α = ( q , ι s , L − q − s ) T , where ι s is a 1 × s vector of ones, which means the ﬁrst q and thelast L − q − s instruments are valid, and s is the number of invalid instruments. The endogenousvariable is generated based on either of the following two reduced form models,Model 1: D i = Z i. γ ∗ + ξ i Model 2: D i = 2 Z i + 0 . Z i + 1 . Z i + 3 sin( πZ i ) + ξ i For the model 1, we consider a “cut-oﬀ at s ” design, that is, γ ∗ = (2 , . , . , , . . . , L − s ) T ,where L − s is a 1 × ( L − s ) vector of zeros and s is the number of strong instruments. We ﬁllin the values of non-zero elements in γ ∗ by replicating the non-zero elements of (2 , . , . ,

1) tothe until its length is s . E.g, if s = 6, the non-zero elements of γ ∗ are (2 , . , . , , , . Z i , . . . , Z i in the form of Z i , . . . , Z i in one simulationsetting. We generate the error terms in both the structural model and reduced form models by  ε i ξ i  i . i . d . ∼ N   ,  . .  In the simulations, we vary (i) the sample size n , (ii) the number of strong instruments s , (iii)the number of invalid instruments s and (iv) the distribution among the four instruments sets IV − IV , which is aﬀected by the value of q . The model settings are summarized in Table 1.We run each simulation R = 1000 times and compute the average of the estimation bias (de-noted by ”Bias”), R − (cid:80) Rr =1 (cid:16) (cid:98) β r − β ∗ (cid:17) , with its empirical standard deviation and the estimatedmean squared errors (denoted by ”MSE”), R − (cid:80) Rr =1 (cid:16) (cid:98) β r − β ∗ (cid:17) , where (cid:98) β r denotes an estimator12able 1: The Model Setting n L s s q |IV | |IV | |IV | |IV | Linear 200 100 10 0 10 10 0 90 0200 100 10 10 7 7 3 83 7200 100 10 30 7 7 3 63 27200 100 4 30 2 2 2 68 28200 100 20 30 14 14 6 56 24200/500/1000 100 20 20 14 14 6 66 14Nonlinear 500/200 100 4 0 4 4 0 96 0500 100 4 20 2 2 2 78 18500 100 12 20 9 9 3 71 17of β ∗ in the r th experiment. We compare our method with OLS, 2SLS, Oracle 2SLS (this is the2SLS using the oracle IV set), NAIVE (Fan and Zhong, 2018) and sisVIVE (Kang et al., 2016).For 2SLS and NAIVE, we report the estimation results using only the endogenous treatment vari-able in the structural equation. We also present results for the so-called Post sisVIVE estimator(sisVIVE.post in the following tables), which is 2SLS estimator that takes the set of nonzero es-timated coeﬃcients α in sisVIVE as covariates in the structural equation. We report the modelselection performance for the estimators that embedding variable selection. Speciﬁcally, we reportaverage number (”mean”) of instruments selected as strong (for NAIVE) or invalid (for sisVIVE ,sisVIVE.post, and R2IVE) together with the minimum, median and maximum numbers of invalidIV selection, and the proportion of the instruments selected as strong or invalid to all strong orinvalid instruments (”freq”), respectively. Notice all the model settings in Table 1 are the ”StrongerValid” cases when the valid instruments are stronger than the invalid instruments. The readerscould refer to Kang et al. (2016) for more discussion on this setting.All simulation studies are conducted using the statistical software R. In particular, the R package naivereg (Fan and Zhong, 2018) is used to estimate the optimal instruments and the R package sisVIVE (Kang et al., 2016) is used to get the sisVIVE estimator. We use the R package gcdnet (Yang and Zou, 2012) to select invalid instruments with (adaptive) Elastic-Net penalty. The EBIC(Chen and Chen, 2008) and BIC (Wang et al., 2007) methods are employed to ﬁnd the optimaltuning parameters for the high-dimensional and low-dimensional cases, respectively. In this setting, we consider the linear reduced form. We ﬁrst ﬁx the sample size n , IV dimension L , the number of relevant instruments s and the value of q but change the number of invalidinstruments s to check the inﬂuence of the number of invalid instruments. Then we ﬁx n , L and13 but change the value of s and q to check the inﬂuence of the strength of instruments. Finally,we ﬁx L , s and s but change the sample size.Table 2: Fix n = 200 , L = 100 , s = 10, change the number of invalid instruments s Bias std dev MSE mean median max min freq s = 0 OLS 0.0163 0.0102 0.00042SLS 0.0079 0.0102 0.0002Oracle 2SLS 0.0003 0.0103 0.0001 NAIVE sisVIVE

R2IVE 0.0005 0.0103 0.0001 0.37 0 3 0 - s = 10 OLS 0.5310 0.0983 0.29272SLS 0.5282 0.0988 0.2899Oracle 2SLS 0.0008 0.0138 0.0002 NAIVE sisVIVE

R2IVE 0.0019 0.0137 0.0002 10.53 10 14 10 1 s = 30 OLS 0.2686 0.0937 0.08062SLS 0.2629 0.0942 0.0778Oracle 2SLS 0.0002 0.0145 0.0002 NAIVE sisVIVE

R2IVE -0.0005 0.015 0.0003 33.63 33 51 30 1

NOTE: This table summarizes the averages of estimated bias with the standard deviations andMSE, the average number of instruments selected as strong (for NAIVE) or invalid (for sisVIVE ,sisVIVE.post and R2IVE) together with the median, minimum and maximum numbers, and theproportion of the instruments selected as strong or invalid to all strong or invalid instrumentsfor respective estimators. Notice sisVIVE and sisVIVE.post share the same invalid IV selectionresults. The R2IVE share the same strong IV selection performance as the NAIVE.In the ﬁrst case, we ﬁx n = 200 and L = 100. The number of relevant instruments s = 10.We use the values of s = 0 , ,

30 to check the inﬂuence of the number of invalid instruments onestimation results. For s = 0, we have |IV | = |IV | = 0. Set |IV | = 10 and |IV | = 90. Forother nonzero values of s , we set q = 7, which means |IV | = 7, |IV | = 3, |IV | = 100 − s − |IV | = s −

3. The speciﬁc numbers of each type of instrument variables are summarized inTable 1. Table 2 reports the estimation bias with the standard deviations and MSE, as well as themodel selection results. Figure 1 shows the box plots of bias for these estimators. In Figure 1, wedo not include the results of OLS estimators since they are always biased and have the largest MSE,which will enlarge the scale of the ﬁgure and make the diﬀerences between other IV estimators lessdistinguishable. When there is no invalid instruments, the sisVIVE is outperformed by the NAIVEdue to the eﬀect of many irrelevant instruments. When the invalid instruments exist, the 2SLS14igure 1: Fix n = 200 , L = 100 , s = 10, change the number of invalid instruments s and NAIVE estimators perform similarly and are both severely biased due to the eﬀects of invalidinstruments. The sisVIVE estimators have smaller bias and MSE compared to 2SLS and NAIVEestimators. However, they are still substantially biased although the invalid instruments are alwaysselected as invalid by sisVIVE and the sisVIVE.post does not help reducing bias. Furthermore, sisVIVE always select too many invalid instruments, which reduce the eﬃciency of estimators. OurR2IVE performs best (among the non-Oracle estimators) with the smallest bias and MSE. It is veryclose to oracle 2SLS estimator in linear reduced form models and is shown to be robust to bothirrelevant and invalid instruments.Figure 2: Fix n = 200 , L = 100 , s = 30, change the number of irrelevant instruments s n = 200 , L = 100 , s = 30, change the number of irrelevant instruments s Bias std dev MSE mean median max min freq s = 4 OLS 0.5575 0.1670 0.33912SLS 0.5490 0.1699 0.3317Oracle 2SLS -0.0014 0.0336 0.0011NAIVE 0.5318 0.1764 0.3149 5.47 5 8 4 1 sisVIVE R2IVE 0.0133 0.0995 0.0165 37.17 36 70 30 1 s = 10 OLS 0.2686 0.0937 0.08062SLS 0.2629 0.0942 0.0778Oracle 2SLS 0.0002 0.0145 0.0002 NAIVE sisVIVE

R2IVE -0.0005 0.0150 0.0003 33.63 33 51 30 1 s = 20 OLS 0.2441 0.0640 0.06362SLS 0.2413 0.0642 0.0622Oracle 2SLS 0.0008 0.0094 0.0001NAIVE 0.2379 0.0653 0.0608 21.70 21 30 20 1 sisVIVE R2IVE 0.0011 0.0095 0.0001 32.06 32 38 29 0.998

Please see table notes in Table 2.In the second case, we ﬁx n = 200, L = 100 and s = 30 but change the value of s = 4 , , q = 2 , ,

14 fordiﬀerent s , respectively. |IV | = 2 , , |IV | = 2 , , |IV | = 68 , ,

56 and |IV | = 28 , , s respectively. The results are shown in the Table 3 and Figure 2. We observe that the sisVIVE estimators are dependent of the number of relevant instruments, and they have diminishingbias and MSE when the number of relevant instruments s increases. When s = 4, the sisVIVE is even outperformed by NAIVE that takes all instruments as valid. The sisVIVE.post does notreduce the bias of the sisVIVE . This shows the importance of selecting strong IVs. Our R2IVEperforms best and can estimate the casual eﬀect precisely. The R2IVE also improves as the numberof relevant instruments increases.In the last case, we ﬁx L = 100, s = 20, s = 20 and q = 14 while changing the sample size.The results are shown in the Table 4 and Figure 3. The increase of the sample size does not improvethe estimated performance of 2SLS and NAIVE, as they are always biased due to the endogeneityof IVs. The sisVIVE estimators have diminishing bias and MSE when the sample size increases.Our R2IVE estimator always performs best with the smallest bias and MSE. Compared to sisVIVE ,our method has much better ﬁnite sample performance. It is very close to oracle 2SLS in linearmodels. Note that the sisVIVE.post can reduce some bias from sisVIVE when n is 1000 but is still16able 4: Fix L = 100 , s = 20 , s = 20, change the number size n Bias std dev MSE mean median max min freq n = 200 OLS 0.2450 0.0508 0.06262SLS 0.2422 0.0509 0.0613Oracle 2SLS 0.0007 0.0092 0.0001 NAIVE sisVIVE

R2IVE 0.0011 0.0091 0.0001 21.21 21 27 20 1 n = 500 OLS 0.2418 0.0322 0.05952SLS 0.2373 0.0324 0.0573Oracle 2SLS 0.0002 0.0056 0.0000NAIVE 0.2346 0.0328 0.0561 25.79 26 31 21 0.999 sisVIVE R2IVE 0.0005 0.0056 0.0000 20.56 20 24 20 1 n = 1000 OLS 0.2424 0.0227 0.05932SLS 0.2373 0.0228 0.0568Oracle 2SLS -0.0002 0.0039 0.0000NAIVE 0.2352 0.0231 0.0559 25.72 26 31 21 1 sisVIVE R2IVE -0.0001 0.0039 0.0000 20.40 20 24 20 1

Please see table notes in Table 2.Figure 3: Fix L = 100 , s = 20 , s = 20, change the sample size n In this subsection, we consider the nonlinear reduced form. The results are summarized inTable 5 and Figure 4. The inﬂuence of the invalid instruments is checked by comparing the topleft and bottom left plots of Figure 4. Similar to the linear case, the 2SLS and NAIVE estimatorsbecome biased due to the invalid instruments. When there is no invalid instruments, the sisVIVE is outperformed by NAIVE since it does not consider the irrelevant instruments and the nonlinearreduced form equation. When the invalid instruments exist, our R2IVE always performs best.Diﬀerent from the linear case, the NAIVE always outperforms the 2SLS by capturing the nonlinearstructure. We check the inﬂuence of the strength of instruments by comparing the bottom left andbottom right plots of Figure 4. Both sisVIVE and R2IVE are improved when the number of relevantinstruments increase and our estimators always outperform sisVIVE and sisVIVE.post. We checkthe inﬂuence of sample size by comparing the top right with bottom left plot of Figure 4. Both sisVIVE and R2IVE improve as the sample size increases in the ”stronger invalid” settings.

In this section, we illustrate the usefulness of our estimator by revisiting the classic question oftrade and growth. The eﬀect of trade on growth is a very important research topic in both theoreticaland empirical economics, which has strong eﬀect on trade policies.One important issue in the empirical study of trade and growth is the endogeneity of trade vari-able due to the unobserved common driving forces that cause both trade and growth. Frankel andRomer (1999,

FR99 henceforth) circumvented the endogeneity problem utilizing instrumental vari-able constructed using gravity model of trade (Anderson, 1979). They showed that trade activitiespositively correlate with growth rate using cross-sectional data from 150 countries and economiesusing data from the mid-1980s. In

FR99 , they consider a linear structural equation, which includethe log of GDP per worker (outcome variable), the share of international trade to GDP (explanatoryvariable of interests) and two exogenous variables representing the size of a country: population andland area. They also consider a linear bilateral trade reduced form equation, where the instrumentsinclude the distance between two countries, dummy variables for landlocked countries, common bor-18able 5: Nonlinear reduced form equationBias std dev MSE mean median max min freqOLS 0.0256 0.0079 0.00072SLS 0.0111 0.0115 0.0003 n = 500, Oracle 2SLS 0.0011 0.0136 0.0002 p = 100, NAIVE 0.0035 0.0081 0.0001 5.66 6 9 4 1 s = 4, sisVIVE s = 0 sisVIVE.post 0.0112 0.0116 0.0003 R2IVE 0.0025 0.0140 0.0001 0.40 0 5 0 -OLS 0.1999 0.0952 0.05002SLS 0.2771 0.1166 0.0989 n = 200, Oracle 2SLS 0.0013 0.0402 0.0015 p = 100, NAIVE 0.1895 0.0987 0.0471 5.70 6 8 3 0.901 s = 4, sisVIVE s = 20 sisVIVE.post 0.0345 0.0207 0.0018 R2IVE 0.0052 0.0305 0.0005 23.82 23 48 20 1

OLS 0.1972 0.0597 0.04282SLS 0.3715 0.0875 0.1550 n = 500, Oracle 2SLS 0.0009 0.0237 0.0006 p = 100, NAIVE 0.1799 0.0608 0.0364 5.67 6 9 4 1 s = 4, sisVIVE s = 20 sisVIVE.post 0.0287 0.0184 0.0014 R2IVE 0.0037 0.0169 0.0002 22.05 22 31 20 1

OLS 0.2373 0.0288 0.05722SLS 0.5223 0.0475 0.2770 n = 500, Oracle 2SLS -0.0008 0.0248 0.0006 p = 100, NAIVE 0.2346 0.0290 0.0560 17.72 18 22 12 1 s = 12, sisVIVE s = 20 sisVIVE.post 0.0219 0.0124 0.0007 R2IVE 0.0020 0.0084 0.0000 20.28 20 24 20 1

NOTE: In the last section of this table, we replicate the same functional forms for Z i , . . . , Z i in the form of Z i , . . . , Z i so that the number of strong IVs s = 12. Other table notes pleaserefer to those in Table 2. 19igure 4: Nonlinear reduced form setting20er between two countries and the interaction terms, and two exogenous variable aforementioned.The instrumental variable (called proxy for trade in FR99 ) is the sum of predicted bilateral tradeshares for country i . Fan and Zhong (2018) extend the study of FR99 by considering more potentialinstruments and considering a nonlinear reduced form. Besides the instrument used in

FR99 , theyalso include total water area, coastline, the arable land as percentage of total land, land boundaries,forest area as percentage of land area, the number of oﬃcial and other commonly used languagesin a country, and the interaction terms of constructed trade proxy with these variables (in total15 instruments). The selected instruments include the proxy for trade (the original instrument in

FR99 ), area of land, total population, and the interaction term of proxy for trade and number oflanguages. The NAIVE method provides stronger results regarding trade on growth than

FR99 .In this study, we are concerned about the invalid IVs, which means some instruments might aﬀectgrowth directly. The inclusion of invalid instruments leads to the inconsistency of β . Following

FR99 and NAIVE, we use the cross-sectional data from 158 countries and economiesand update the data to year 2017 to investigate the contemporary eﬀect of trade on growth. Weconsider a linear structural equation lnY i = c + βD i + α Z i + δ S i + ε i (5.1)where Y i is GDP per worker in country i , D i is the share of international trade to GDP, S i is thesize of country: population and Land area (same as FR99 ), Z i is the instruments. Besides theinstruments used in Fan and Zhong (2018), we also also include a instrumental variable related toair pollution: the density of PM 2.5. Kukla-Gryz (2009) found that international trade and percapita income lead to the increase in air pollution in developing countries. In order to reduce thenegative impact of international trade on the environment, the state will gradually adopt new policieswith more environmental friendly standards hence raising the costs of production, which means airpollution could in turn aﬀect international trade. On the other hand, there is empirical evidence forthe environmental Kuznets curve between economic growth and environmental pollution. Ali andPuppim de Oliveira (2018) conclude the impact of pollution abatement on economic growth couldturn into win-win policy options. Hence, there is some reasons to believe that PM2.5 may impacttrade, but also aﬀect economic growth through mechanisms other than trade. ε i is unobservedrandom disturbances in the growth function. 21he reduced form model we consider is D i = µ + (cid:88) j f j ( z ij ) + ξ i (5.2)where f j ( · ) is the j th unknown smooth univariate functions, z ij is the i th observed value of theaforementioned j th instrument and ξ i is unobserved random disturbances, which is likely to becorrelated with ε i .Note that we can replace the variables Y i , D i , and Z i. with the residuals after regressing themon S i (e.g., replace Y by (cid:101) Y = ( I − P S ) Y ). The equation (5.1) and (5.2) becomes ln (cid:101) Y i = c + β (cid:101) D i + α (cid:101) Z i + ε i (cid:101) D i = µ + (cid:88) j f j ( (cid:101) z ij ) + ξ i (5.3)The summary statistics of main data is presented in Table 6. Figure 5 is the scatter diagram ofactual and constructed share of international trade. Their correlation coeﬃcient is 0.36.Table 6: Summary statisticsmean std.dev median minimum maximum sample sizeLn Income Per Capita 10.18 1.1 10.42 7.46 12.03 158Real Trade Share 0.87 0.52 0.76 0.2 4.13 158Constructed Trade Share 0.09 0.05 0.08 0.02 0.3 158Ln Population 1.38 1.8 1.48 -3.04 6.67 158Ln Area (Land) 11.73 2.26 12.02 5.7 16.61 158Area (Water) 25378 100818.4 2365 0 891163 158Coastline 4268.6 17451.71 523 0 202080 158Land Boundaries 2837.8 3407.8 1899.5 0 22147 158% Forest 29.89 22.38 30.62 0 98.26 158% Arable Land 40.95 21.55 42.06 0.56 82.56 158PM2.5 25.05 19.43 22 5.9 100 158Languages 1.87 2.13 1 1 16 158NOTE: Income Per Capita is measured in dollars. Population is measured in millions. Land areaand water area are measured in square kilometers. Coastline and land boundaries are measuredin kilometers. PM2.5 is measured in micrograms per cubic meter. Source: FR99 , Penn WorldTable (PWT 9.1), the World Bank, and State of Global Air.

To investigate the inﬂuence of invalid instruments on the estimation of β , we ﬁrst conduct theestimation using the same instruments as Fan and Zhong (2018) and compare our estimated valuewith FR99 and NAIVE. Then we add another instrumental variable PM 2.5. The results are22igure 5: scatter plot of real and constructed trade sharesummarized in Table 7 and 8. The ﬁrst column is OLS estimator. The second estimator is the 2SLSestimator using the same instruments with

FR99 . The third column is the NAIVE estimator andthe last column is our estimator.When the variable PM2.5 is not included, the selected relevant instruments include the proxy fortrade and the interaction term of proxy for trade and forest area as percentage of land area by theadaptive group Lasso with EBIC. The ﬁtted functions of selected instruments by EBIC are plottedin Figure 6. From Figure 6, we see that proxy for trade and the interaction term of proxy for tradeand forest area as percentage of land area instruments are likely to have nonlinear relationship withreal trade share. In Table 7, the OLS estimator has severe bias and is inconsistent because of theendogeneity issue. The t statistics for the NAIVE on trade is 3.85, compared to 2.99 for the FR99.As expected the each elements of α is estimated to zero and our estimator is same as NAIVE.Table 7: Estimation results for the trade and income data (PM2.5 not included in the IV set)OLS 2SLS NAIVE R2IVEconstant 5.68E-08 4.93E-08 5.35E-08 5.35E-08(0.08) (0.09) (0.08) (0.08)trade share 0.88*** 1.43** 1.11*** 1.11***(0.18) (0.48) (0.29) (0.29) R R FR99 , the causal eﬀect of trade on growth in 2010s is found to be smaller in magnitude but evenmore signiﬁcant.

In this paper, we develop a robust IV estimator to both the invalid and irrelevant instruments(R2IVE) for the estimation of endogenous treatment eﬀect, which extends the sisVIVE (Kang et al.,2016) by considering a true high-dimensional instrumental variable setting and a general nonlinearreduced form equation. The proposed R2IVE is shown to be root- n consistent and asymptoticallynormal. Monte Carlo simulations demonstrate that the R2IVE performs better than the existingcontemporary IV estimators (such as NAIVE and sisVIVE ) in many cases. The empirical studyrevisits the classic question of trade and growth. It is demonstrated that the R2IVE can be appliedto estimate the endogenous treatment eﬀect with a large set of instruments without knowing whichones are relevant or valid and the reduced form is linear or nonlinear.25 eferences Ali, S., Puppim de Oliveira, J., 2018. Pollution and economic development: An empirical researchreview. Environmental Research Letters 13.Amemiya, T., 1974. The nonlinear two-stage least squares estimator. Journal of Econometrics 2,105–110.Anderson, J., 1979. A theoretical foundation for gravity equation. American Economic Review 69,106–16.Bekker, P. A., 1994. Alternative approximations to the distributions of instrumental variable esti-mators. Econometrica 62 (3), 657–681.Belloni, A., Chen, D., Chernozhukov, V., Hansen, C., 2012. Sparse models and methods for optimalinstruments with an application to eminent domain. SSRN Electronic Journal.Berkowitz, D., Caner, M., Ying, F., 2012. The validity of instruments revisited. Journal of Econo-metrics 166, 255–266.Bi, N., Kang, H., Taylor, J., 2020. Inferring treatment eﬀects after testing instrument strength inlinear models. working paper.Candes, E., Tao, T., 2007. The dantzig selector: Statistical estimation when p is much larger thann. Annals of statistics 35, 2313–2351.Caner, M., Fan, Q., 2015. Hybrid generalized empirical likelihood estimators: Instrument selectionwith adaptive lasso. Journal of Econometrics 187 (1), 256–274.Caner, M., Han, X., Lee, Y., 2018. Adaptive elastic net gmm estimation with many invalid momentconditions: Simultaneous model and moment selection. Journal of Business & Economic Statistics36 (1), 24–46.Chen, J., Chen, Z., 2008. Extended bayesian information criteria for model selection with largemodel spaces. Biometrika 95 (3), 759–771.Cheng, X., Liao, Z., 2015. Select the valid and relevant moments: An information-based lasso forgmm with many moments. Journal of Econometrics 186 (2), 443–464.26hernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W., Robins, J., 2018.Double/debiased machine learning for treatment and structural parameters. The EconometricsJournal 21, C1–C68.Donald, S. G., Newey, W. K., 2001. Choosing the number of instruments. Econometrica 69 (5),1161–1191.Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties.Journal of the American Statistical Association 96 (456), 1348–1360.Fan, Q., Zhong, W., 2018. Nonparametric additive instrumental variable estimator: A group shrink-age estimation perspective. Journal of Business & Economic Statistics, 1–40.Frankel, J., Romer, D., 1999. Does trade causes growth? American Economic Review 89, 379–399.Gautier, E., Tsybakov, A., 2011. High-dimensional instrumental variables regression and conﬁdencesets.Guo, Z., Kang, H., Cai, T., Small, D., 2018a. Conﬁdence intervals for causal eﬀects with invalidinstruments by using two-stage hard thresholding with voting. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 80, 793–815.Guo, Z., Kang, H., Cai, T., Small, D., 2018b. Testing endogeneity with high dimensional covariates.Journal of Econometrics 207, 175–187.Hansen, C., Kozbur, D., 2014. Instrumental variables estimation with many weak instruments usingregularized jive. Journal of Econometrics 182, 290–308.Huang, J., Horowitz, J., Wei, F., 2010. Variable selection in nonparametric additive models. Annalsof statistics 38, 2282–2313.Kang, H., Zhang, A., Cai, T. T., Small, D. S., 2016. Instrumental variables estimation with someinvalid instruments and its application to mendelian randomization. Journal of the AmericanStatistical Association (513), 132–144.Kolesar, M., Chetty, R., Friedman, J., Glaeser, E., Imbens, G., 2015. Identiﬁcation and inferencewith many invalid instruments. Journal of Business & Economic Statistics 33, 474–484.Kuersteiner, G., Okui, R., 2010. Constructing optimal instruments by ﬁrst-stage prediction averag-ing. Econometrica 78, 697–718. 27ukla-Gryz, A., 2009. Economic growth, international trade and air pollution: A decompositionanalysis. Ecological Economics 68, 1329–1339.Liao, Z., 2013. Adaptive gmm shrinkage estimation with consistent moment selection. EconometricTheory 29 (5), 857904.Lin, W., Feng, R., Li, H., 2015. Regularization methods for high-dimensional instrumental vari-ables regression with an application to genetical genomics. Journal of the American StatisticalAssociation 110 (509), 270–288.Linton, O., Chen, R., H”ARDLE, W., 1970. An analysis of transformations for additive nonpara-metric regression. Journal of the American Statistical Association 92.Newey, W., 1990. Eﬃcient instrumental variable estimation on nonlinear models. Econometrica 58,809–37.Okui, R., 2011. Instrumental variable estimation in the presence of many moment conditions. Journalof Econometrics 165, 70–86.Stone, C., 1985. Additive regression and other nonparametric models. The Annals of Statistics 13.Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal StatisticalSociety 58 (1), 267–288.Wang, H., Li, R., Tsai, C.-L., 2007. Tuning parameter selectors for the smoothly clipped absolutedeviation method. Biometrika 94, 553–568.Windmeijer, F., Farbmacher, H., Davies, N., Smith, G., 2018. On the use of the lasso for instru-mental variables estimation with some invalid instruments. Journal of the American StatisticalAssociation, 1–32.Yang, Y., Zou, H., 2012. An eﬃcient algorithm for computing the hhsvm and its generalizations.Journal of Computational & Graphical Statistics 22 (2), 396–415.Yuan, M., Lin, Y., 2006. Model selection and estimation in regression with grouped variables. Journalof the Royal Statistical Society Series B 68, 49–67.Zivot, E., Wang, J., 1998. Inference on structural parameters in instrumental variables regressionwith weak instruments. Econometrica 66, 1389–1404.28ou, H., 2006. The adaptive lasso and its oracle properties. Publications of the American StatisticalAssociation 101 (476), 1418–1429.Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic nets. Journal of theRoyal Statistical Society Series B 67, 301–320.Zou, H., Zhang, H., 2009. On the adaptive elastic-net with a diverging number of parameters. Annalsof statistics 37, 1733–1751.

Firstly, we need to clarify the standard conditions for nonparametric estimation (Huang et al.,2010) and adaptive Elastic-Net (Zou and Zhang, 2009).

Assumption 2. (A1) The support of each instrument Z j is [ a, b ] , where a and b are ﬁnite realnumbers. And Z ij satisﬁes lim n →∞ max i =1 ,...,n (cid:80) Lj =1 Z ij n = 0 . The density function g j of Z j in (2.4) satisﬁes < C < g j ( Z ) < C < ∞ on [ a, b ] for j = 1 , ..., L . We use λ min ( M ) and λ max ( M ) todenote the minimum and maximum eigenvalues of a positive deﬁne matrix M , respectively. Thenwe assume b ≤ λ min ( 1 n Z T Z ) ≤ λ max ( 1 n Z T Z ) ≤ B where b and B are two positive constants.(A2) Let F be the class of function f such that the k th derivative f ( k ) exists and satisﬁes aLipschitz condition of order r ∈ (0 , . That is F = (cid:110) f ( · ) : (cid:12)(cid:12)(cid:12) f ( k ) ( t ) − f ( k ) ( t ) ≤ C | t − t | r (cid:12)(cid:12)(cid:12) , for t , t ∈ [ a, b ] and a constant C > (cid:111) where k is a nonnegative integer and r ∈ (0 , such that s = k + r > . . Suppose for f j ∈ F , j = 1 , ..., L in (2.4) , there exists C > such that min j ∈A ∗ R (cid:107) f j (cid:107) ≥ C , where (cid:107) f j (cid:107) = (cid:82) ba f j ( x ) dx .(A3) lim n →∞ log( L )log( n ) = η for some ≤ η < . lim n →∞ λ /n = 0 , lim n →∞ λ / √ n = 0 and lim n →∞ λ ∗ / √ n = 0 , lim n →∞ λ ∗ √ n n (1 − η )(1+ τ ) − = ∞ . lim n →∞ λ √ n (cid:113)(cid:80) j ∈A I α ∗ j = 0 and lim n →∞ min( nλ √ p , (cid:16) √ n √ pλ ∗ (cid:17) τ )( min j ∈A I | α ∗ j | ) →∞ . τ = (cid:108) η − η (cid:109) + 1 . .1 Proof of Lemma 3.1 The (3.1)-(3.3) essentially follows the results of Theorem 3 in Huang et al. (2010), we only givethe proof of (3.4). (cid:13)(cid:13)(cid:13) D ∗ − (cid:98) D (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈A R f j − (cid:88) j ∈ (cid:98) A R (cid:98) f nj (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ (cid:98) A R ∩A R (cid:16) (cid:99) f nj − f j (cid:17) − (cid:88) j ∈ (cid:98) A R ∩A cR (cid:99) f nj + (cid:88) j ∈ (cid:98) A cR ∩A R f j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ (cid:98) A R ∩A R (cid:16) (cid:99) f nj − f j (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ (cid:98) A R ∩A cR (cid:99) f nj ( z ij ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ (cid:98) A cR ∩A R f j ( z ij ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:115) (cid:88) j ∈A R (cid:13)(cid:13)(cid:13) (cid:99) f nj − f j (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ (cid:98) A R ∩A cR (cid:99) f nj (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j ∈ (cid:98) A cR ∩A R f j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) (7.1)where the ﬁrst term follows equation (3.3) and the last two terms → (cid:3) We ﬁrstly show the following two equations hold and then the results of Lemma 3.2 holds byZou and Zhang (2009), Theorem 3.3. (cid:13)(cid:13) M (cid:98) D Y − M D ∗ Y (cid:13)(cid:13) /n = o p (1) (7.2) (cid:13)(cid:13) M (cid:98) D Z − M D ∗ Z (cid:13)(cid:13) /n = o p (1) (7.3)Before the proof of (7.2) and (7.3), we ﬁrst prove the following conclusion. (cid:13)(cid:13) P D ∗ − P (cid:98) D (cid:13)(cid:13) = o p (1) (7.4)30 (cid:13) P D ∗ − P (cid:98) D (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) D ∗ ( D ∗ (cid:48) D ∗ ) − D ∗ (cid:48) − (cid:98) D ( (cid:98) D (cid:48) (cid:98) D ) − (cid:98) D (cid:48) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) D ∗ ( D ∗ (cid:48) D ∗ ) − D ∗ (cid:48) − ( (cid:98) D − D ∗ + D ∗ )( (cid:98) D (cid:48) (cid:98) D ) − ( (cid:98) D − D ∗ + D ∗ ) (cid:48) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) D ∗ (cid:104) ( D ∗ (cid:48) D ∗ ) − − ( (cid:98) D (cid:48) (cid:98) D ) − (cid:105) D ∗ (cid:48) − ( (cid:98) D − D ∗ )( (cid:98) D (cid:48) (cid:98) D ) − ( (cid:98) D − D ∗ ) (cid:48) − (cid:98) D − D ∗ )( (cid:98) D (cid:48) (cid:98) D ) − D ∗ (cid:48) (cid:13)(cid:13)(cid:13) ≤ (cid:12)(cid:12)(cid:12) ( D ∗ (cid:48) D ∗ ) − − ( (cid:98) D (cid:48) (cid:98) D ) − (cid:12)(cid:12)(cid:12) (cid:107) D ∗ (cid:107) + (cid:12)(cid:12)(cid:12) ( (cid:98) D (cid:48) (cid:98) D ) − (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) + 2 (cid:12)(cid:12)(cid:12) ( (cid:98) D (cid:48) (cid:98) D ) − (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) (cid:107) D ∗ (cid:107) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) (cid:98) D (cid:13)(cid:13)(cid:13) − (cid:107) D ∗ (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) (cid:98) D (cid:13)(cid:13)(cid:13) + (cid:12)(cid:12)(cid:12) ( (cid:98) D (cid:48) (cid:98) D ) − (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) + 2 (cid:12)(cid:12)(cid:12) ( (cid:98) D (cid:48) (cid:98) D ) − (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) (cid:107) D ∗ (cid:107) =: S + S + S (7.5)where D ∗ (cid:48) D ∗ and (cid:98) D (cid:48) (cid:98) D are real numbers and the ﬁrst inequality is derived by triangle inequality. S = (cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ + D ∗ (cid:13)(cid:13)(cid:13) − (cid:107) D ∗ (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) (cid:98) D (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) (cid:98) D − D ∗ (cid:17) (cid:48) D ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) (cid:98) D (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) n (cid:13)(cid:13)(cid:13) (cid:98) D (cid:13)(cid:13)(cid:13) /n + 2 (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) √ n (cid:107) D ∗ (cid:107) / √ n (cid:13)(cid:13)(cid:13) (cid:98) D (cid:13)(cid:13)(cid:13) /n = o p (1) O p (1) + o p (1) O p (1)= o p (1) (7.6) S = (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) n (cid:13)(cid:13)(cid:13) (cid:98) D (cid:13)(cid:13)(cid:13) /n = o p (1) O p (1) = o p (1) (7.7) S = (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) √ n (cid:107) D ∗ (cid:107) / √ n (cid:13)(cid:13)(cid:13) (cid:98) D (cid:13)(cid:13)(cid:13) /n = o p (1) O p (1) = o p (1) (7.8)Substituting (7.6)-(7.8) into (7.5), we have (cid:13)(cid:13) P D ∗ − P (cid:98) D (cid:13)(cid:13) = o p (1). After the proof of (7.4), we nowfocus on the proof of (7.2). We present the matrix form of (2.12) Y = D ∗ β ∗ + Z α ∗ + ν (7.9)31ubstituting (7.9) into (7.2), we have (cid:13)(cid:13) M (cid:98) D Y − M D ∗ Y (cid:13)(cid:13) /n = (cid:13)(cid:13)(cid:0) P D ∗ − P (cid:98) D (cid:1) Y (cid:13)(cid:13) /n = (cid:13)(cid:13)(cid:0) P D ∗ − P (cid:98) D (cid:1) ( D ∗ β ∗ + Z α ∗ + ν ) (cid:13)(cid:13) /n ≤ (cid:13)(cid:13)(cid:0) P D ∗ − P (cid:98) D (cid:1) D ∗ β ∗ (cid:13)(cid:13) /n + (cid:13)(cid:13)(cid:0) P D ∗ − P (cid:98) D (cid:1) Z α ∗ (cid:13)(cid:13) /n + (cid:13)(cid:13)(cid:0) P D ∗ − P (cid:98) D (cid:1) ν (cid:13)(cid:13) /n ≤ (cid:13)(cid:13) P D ∗ − P (cid:98) D (cid:13)(cid:13) (cid:107) D ∗ (cid:107) n | β ∗ | + (cid:13)(cid:13) P D ∗ − P (cid:98) D (cid:13)(cid:13) (cid:107) Z α ∗ (cid:107) n + (cid:13)(cid:13) P D ∗ − P (cid:98) D (cid:13)(cid:13) (cid:107) ν (cid:107) n = o p (1) O p (1) + o p (1) O p (1) + o p (1) O p (1)= o p (1) (7.10)Similarly, (cid:13)(cid:13) M (cid:98) D Z − M D ∗ Z (cid:13)(cid:13) /n = (cid:13)(cid:13)(cid:0) P D ∗ − P (cid:98) D (cid:1) Z (cid:13)(cid:13) /n ≤ (cid:13)(cid:13) P D ∗ − P (cid:98) D (cid:13)(cid:13) (cid:107) Z (cid:107) /n = o p (1) O p (1)= o p (1) (7.11) (cid:3) Substituting (7.9) into (2.18), we have (cid:98) β = (cid:16) (cid:98) D (cid:48) M (cid:98) A I (cid:98) D (cid:17) − (cid:98) D (cid:48) M (cid:98) A I Y = (cid:16) (cid:98) D (cid:48) M (cid:98) A I (cid:98) D (cid:17) − (cid:98) D (cid:48) M (cid:98) A I ( D ∗ β ∗ + Z α ∗ + ν )= (cid:16) (cid:98) D (cid:48) M (cid:98) A I (cid:98) D (cid:17) − (cid:98) D (cid:48) M (cid:98) A I (cid:104)(cid:16) (cid:98) D + D ∗ − (cid:98) D (cid:17) β ∗ + Z α ∗ + ν (cid:105) = β ∗ + (cid:16) (cid:98) D (cid:48) M (cid:98) A I (cid:98) D (cid:17) − (cid:98) D (cid:48) M (cid:98) A I (cid:104)(cid:16) D ∗ − (cid:98) D (cid:17) β ∗ + Z α ∗ + ν (cid:105) (7.12)which yields that √ n (cid:16) (cid:98) β − β ∗ (cid:17) = (cid:32) (cid:98) D (cid:48) M (cid:98) A I (cid:98) D n (cid:33) − (cid:98) D (cid:48) M (cid:98) A I (cid:104)(cid:16) D ∗ − (cid:98) D (cid:17) β ∗ + Z α ∗ + ν (cid:105) √ n =: T − + T (7.13)32e want to show √ n (cid:16) (cid:98) β − β ∗ (cid:17) = (cid:32) D ∗ (cid:48) M A I D ∗ n + o p (1) (cid:33) − (cid:32) D ∗ (cid:48) M A I ν √ n + o p (1) (cid:33) (7.14)Before we show (7.14), we ﬁrst prove the following conclusions. (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) = o p (1) (7.15) (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) P (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) (7.16)where the second term is (cid:13)(cid:13)(cid:13) P (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) P (cid:98) A I (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) (cid:98) D − D ∗ (cid:13)(cid:13)(cid:13) = o p (1) (7.17)where the ﬁrst inequality follows Cauchy-Schwarz inequality, the second step is because that thelargest eigenvalues of projection matrix is 1 and the last step is from the equation (3.4). Substituting(7.17) into (7.16) combining with (3.4), we have (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) = o p (1).After the proof of (7.15), we now go back to (7.14). We ﬁrst focus on the term T . T = (cid:98) D (cid:48) M (cid:98) A I (cid:98) D n = (cid:16) (cid:98) D − D ∗ + D ∗ (cid:17) (cid:48) M (cid:98) A I (cid:16) (cid:98) D − D ∗ + D ∗ (cid:17) n = D ∗ (cid:48) M A I D ∗ n + (cid:16) (cid:98) D − D ∗ (cid:17) (cid:48) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) n + 2 D ∗ (cid:48) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) n + D ∗ (cid:48) (cid:16) M (cid:98) A I − M A I (cid:17) D ∗ n =: D ∗ (cid:48) M A I D ∗ n + T + 2 · T + T (7.18)For the term T , we have | T | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) (cid:98) D − D ∗ (cid:17) (cid:48) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) /n (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) / √ n (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) / √ n (cid:13)(cid:13)(cid:13) = o p (1) (7.19)where the ﬁrst inequality holds by Cauchy-Schwarz inequality and the last step is derived by (7.15).Similarly, we deal with the term T , 33 T | = (cid:12)(cid:12)(cid:12) D ∗ (cid:48) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) /n (cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13) D ∗ / √ n (cid:13)(cid:13) (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) / √ n (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) D / √ n (cid:13)(cid:13) (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) / √ n (cid:13)(cid:13)(cid:13) = (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 D i /n (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) / √ n (cid:13)(cid:13)(cid:13) = ([ E ( D i )] / + o p (1)) o p (1)= o p (1) (7.20)Next, we deal with the term T T = D ∗ (cid:48) (cid:16) M (cid:98) A I − M A I (cid:17) D ∗ n = o p (1) (7.21)which holds since P ( (cid:98) A I = A I ) →

1. Substituting (7.19), (7.20) and (7.21) into (7.18), we have T = D ∗ (cid:48) M A I D ∗ n + o p (1) (7.22)Now, we work on the term T . T = (cid:98) D (cid:48) M (cid:98) A I (cid:104)(cid:16) D ∗ − (cid:98) D (cid:17) β ∗ + Z α ∗ + ν (cid:105) √ n = (cid:16) (cid:98) D − D ∗ + D ∗ (cid:17) (cid:48) M (cid:98) A I (cid:104)(cid:16) D ∗ − (cid:98) D (cid:17) β ∗ + Z α ∗ + ν (cid:105) √ n = D ∗ (cid:48) M A I ν √ n − (cid:16) (cid:98) D − D ∗ (cid:17) (cid:48) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) β ∗ √ n + (cid:16) (cid:98) D − D ∗ (cid:17) (cid:48) M (cid:98) A I Z α ∗ √ n + (cid:16) (cid:98) D − D ∗ (cid:17) (cid:48) M (cid:98) A I ν √ n − D ∗ (cid:48) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) β ∗ √ n + D ∗ (cid:48) M (cid:98) A I Z α ∗ √ n + D ∗ (cid:48) (cid:16) M (cid:98) A I − M A I (cid:17) ν √ n =: D ∗ (cid:48) M A I ν √ n − T + T + T − T + T + T (7.23)Similar to (7.19), we have | T | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) (cid:98) D − D ∗ (cid:17) (cid:48) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) β ∗ / √ n (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) | β ∗ | / √ n = o p (1) (7.24)34or the second term of T , we have | T | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) (cid:98) D − D ∗ (cid:17) (cid:48) M (cid:98) A I Z α ∗ / √ n (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) M (cid:98) A I Z α ∗ / √ n (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ( M (cid:98) A I − M A I + M A I ) Z α ∗ / √ n (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ( M (cid:98) A I − M A I ) Z α ∗ / √ n (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) / √ n (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) M A I Z α ∗ / √ n (cid:13)(cid:13) = o p (1) + 0= o p (1) (7.25)where for the terms in the last inequality, the ﬁrst term is o p (1) as derived by the equation (7.15)and the consistency of variable selection. The second term is equal to 0 since M A I Z α ∗ = 0.Then, we deal with term T : | T | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) (cid:98) D − D ∗ (cid:17) (cid:48) M (cid:98) A I ν / √ n (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13) ν / √ n (cid:13)(cid:13) = o p (1) (7.26)Similar to (7.20), we have | T | = (cid:12)(cid:12)(cid:12) D ∗ (cid:48) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17) β ∗ / √ n (cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13) D ∗ √ n (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) M (cid:98) A I (cid:16) (cid:98) D − D ∗ (cid:17)(cid:13)(cid:13)(cid:13) | β ∗ | = o p (1) (7.27)35ext, we deal with the term T | T | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D ∗ (cid:48) M (cid:98) A I Z α ∗ √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D ∗ (cid:48) (cid:16) M (cid:98) A I − M A I + M A I (cid:17) Z α ∗ √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D ∗ (cid:48) (cid:16) M (cid:98) A I − M A I (cid:17) Z α ∗ √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D ∗ (cid:48) M A I Z α ∗ √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D ∗ (cid:48) (cid:16) M (cid:98) A I − M A I (cid:17) Z α ∗ √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:13)(cid:13)(cid:13)(cid:13) D ∗ √ n (cid:13)(cid:13)(cid:13)(cid:13) (cid:107)M A I Z α ∗ (cid:107) = o p (1) + 0 = o p (1) (7.28)where for the terms in the last inequality, the ﬁrst term is o p (1) following the variables selectionconsistency in Lemma 3.2, the second term is equal to 0.Similar to (7.21), we have T = D ∗ (cid:48) (cid:16) M (cid:98) A I − M A I (cid:17) ν √ n = o p (1) (7.29)Combining (7.23)-(7.29), we have T = D ∗ (cid:48) M A I ν √ n + o p (1) (7.30)Combining (7.30) with (7.22) yields (7.14). √ n (cid:16) (cid:98) β − β ∗ (cid:17) = (cid:32) D ∗ (cid:48) M A I D ∗ n + o p (1) (cid:33) − (cid:32) D ∗ (cid:48) M A I ν √ n + o p (1) (cid:33) Because D ∗ (cid:48) M A I D ∗ /n = E (cid:16) D ∗ (cid:48) M A I D ∗ (cid:17) + o p (1) by the Weak Law of Large Numbers, notethat (cid:80) ni =1 D ∗ i M A I ( i,j ) ν j are independent and identically distributed with mean zero and variance σ n = (cid:104) E (cid:16) D ∗ (cid:48) M A I D ∗ (cid:17)(cid:105) − E (cid:16) D ∗ (cid:48) M A I D ∗ ν i (cid:17) (cid:104) E (cid:16) D ∗ (cid:48) M A I D ∗ (cid:17)(cid:105) − the Central Limit Theoremimply that σ − n √ n (cid:16) (cid:98) β − β ∗ (cid:17) → N (0 ,

1) (7.31)Under homoscedastic case, we have σ n = (cid:16) E (cid:16) D ∗ (cid:48) M A I D ∗ (cid:17)(cid:17) − σ ν with Var( ν i ) = σ ν . (cid:3)(cid:3)