[PDF] Valid Instrumental Variables Selection Methods using Auxiliary Variable and Constructing Efficient Estimator

Abstract

In observational studies, we are usually interested in estimating causal effects between treatments and outcomes. When some covariates are not observed, an unbiased estimator usually cannot be obtained. In this paper, we focus on instrumental variable (IV) methods. By using IVs, an unbiased estimator for causal effects can be estimated even if there exists some unmeasured covariates. Constructing a linear combination of IVs solves weak IV problems, however, there are risks estimating biased causal effects by including some invalid IVs. In this paper, we use Negative Control Outcomes as auxiliary variables to select valid IVs. By using NCOs, there are no necessity to specify not only the set of valid IVs but also invalid one in advance: this point is different from previous methods. We prove that the estimated causal effects has the same asymptotic variance as the estimator using Generalized Method of Moments that has the semiparametric efficiency. Also, we confirm properties of our method and previous methods through simulations.

Full PDF

VValid Instrumental Variables Selection Methods usingAuxiliary Variable and Constructing Eﬃcient Estimator

Shunichiro Orihara Graduate school of Data Science, Yokohama City University, Kanagawa,Japan

Abstract

In observational studies, we are usually interested in estimating causal eﬀects between treat-ments and outcomes. When some covariates are not observed, an unbiased estimator usuallycannot be obtained. In this paper, we focus on instrumental variable (IV) methods. Byusing IVs, an unbiased estimator for causal eﬀects can be estimated even if there exists someunmeasured covariates. Constructing a linear combination of IVs solves weak IV problems,however, there are risks estimating biased causal eﬀects by including some invalid IVs. Inthis paper, we use Negative Control Outcomes as auxiliary variables to select valid IVs. Byusing NCOs, there are no necessity to specify not only the set of valid IVs but also invalidone in advance: this point is diﬀerent from previous methods. We prove that the estimatedcausal eﬀects has the same asymptotic variance as the estimator using Generalized Method ofMoments that has the semiparametric eﬃciency. Also, we conﬁrm properties of our methodand previous methods through simulations.

Keywords : Causal inference, Instrumental variable, Variable selection, Semiparametriceﬃciency, Shrinkage method 1 a r X i v : . [ s t a t . M E ] F e b Introduction

In observational studies, we are usually interested in estimating causal eﬀects between treat-ments and outcomes, however, there are serious risks of estimating biased causal eﬀects un-less covariates (or confounders; hereafter, we call “covariates”) are appropriately adjusted.When all covariates are observed, the covariates can be adjusted and an unbiased estima-tor for causal eﬀects can be obtained; the situation of “no unmeasured confounders” (or“strong ignorability”, c.f. Hern´an and Robins, 2020 and Rosenbaum and Rubin, 1983). Nounmeasured confounder is one of the suﬃcient assumptions to estimate an unbiased esti-mator for causal eﬀects. Whereas, when some covariates are not observed, an unbiasedestimator usually cannot be obtained; the situations where there are some unmeasured co-variates. Therefore, another suﬃcient assumption need to be used. In this paper, we focuson instrumental variable (IV) methods.Regarding IV methods, many theoretical results have been derived and there exists manyapplications in econometrics (Sargan, 1958, White, 1982, Andrews, 1990, Newey, 1990, Im-bens, 2002, Hong et al., 2003, Liao, 2013, Koles´ar et al., 2013, and DiTraglia, 2016). Inbiometrics and related ﬁelds, especially in Mendelian Randomization (MR), some theoreti-cal results and applications have been appeared in recent years (Pierce et al., 2011, Baiocchiet al., 2014, Kang et al., 2016, Burgess et al., 2017, Guo et al., 2018, Windmeijer et al., 2019,and Sanderson et al., 2020). IVs need to satisfy three conditions: 1) related to treatments, 2)no direct eﬀects to outcomes (exclusion restriction), and 3) no correlations between IVs andunmeasured covariates; variables satisfying the above three conditions are called as “validIVs”, and not valid IVs are called as “invalid IVs” when emphasizing the validity. By usingIVs, an unbiased estimator for causal eﬀects can be estimated even if there exists some un-measured covariates (c.f. Hayashi, 2000 and Baiocchi et al., 2014). IV methods are useful,however, sometimes suﬀer from “weak IV problem”, especially in MR. Weak IV problemoccurs when there are weak correlations between treatments and IVs. In MR, constructingan allele score by using many DNA alleles as IVs is one of the solutions. (Pierce et al.,2011 and Burgess et al., 2017). Note that an allele score is assumed as a weighted linearcombination of IVs, and the weights can be estimated through a linear estimating equation(we call the weight as “parameter” simply); it is explained later.Constructing a linear combination of IVs solves weak IV problems, however, there arerisks estimating biased causal eﬀects by including some invalid IVs. Not only in MR butalso other ﬁelds, there are many interests to estimate unbiased causal eﬀects when there aresome invalid IVs. Condition 1) can be assessed simply by conﬁrming from datasets, whereas,some schemes (e.g. sensitivity analysis) are applied to conﬁrm condition 2) and 3) (Baiocchiet al., 2014). From here, we introduce important results to date with condition 2) deviationand condition 3) deviations, respectively. Regarding condition 2), there are some importantresults. Kang et al., 2016 have derived an estimator that has consistency and asymptoticnormality when there were more than half of valid IVs in candidates of IVs (i.e. the setof candidates of IVs includes valid IVs and invalid IVs), and linear relationships betweenoutcomes and treatments and between treatments and valid IVs respectively. Also, validIVs can be selected by using shrinkage methods. Their result leads continuous discussionsand further sophisticated results (Guo et al., 2018 and Windmeijer et al., 2019). Koles´aret al., 2013 also have derived the similar estimator under diﬀerent assumptions. Regardingcondition 3), there are also important results. As a very important and famous conclusion,Sagan, 1958 has proposed a statistical test judging there are some invalid IVs or not incandidates of IVs when assuming overidentifying restiction; there are more candidates of IVsthan treatments. However, the test cannot answer which candidates are invalid. DiTraglia,2016 has proposed an information criterion selecting IVs based on Focused InformationCriterion (Claeskens and Hjort, 2003). Since the information criterion selects variables fromcandidates of IVs so that the mean squared error of an estimator of causal eﬀects becomessmall, it doesn’t select the valid IVs exactly. Liao, 2013 has proposed a shrinkage methodnot only to estimate parameters but also to select parameters. However their methods needa very strong assumption: we have to know exactly which candidates are valid IVs or not.3ndrews, 1999 has also proposed an information criterion selecting valid IVs without thestrong assumption. The information criterion has the similar features as ordinary modelselection criterions by selecting a penalty term, however, linearity of interested parametersis necessary.As described in the previous paragraph, there are some results but not universal conclu-sions related to condition 3) violation since they need some assumptions. This is becausewe need to clarify relationship between IVs and “unobserved” covariates; Liao, 2013 andDiTraglia, 2016 assume the set of valid IVs are correctly speciﬁed. Unfortunately, the sit-uation is limited. Also, we think that it is less useful to use these methods since unbiasedcausal eﬀects can be estimated by using the already-known valid IVs. In this paper, we applyanother assumption: using Negative Control Outcomes (NCOs; Tchetgen Tchetgen, 2014,Miao and Tchetgen Tchetgen, 2018, Sanderson et al., 2020 and Shi et al., 2020) as auxiliaryvariables. NCO is a variable that it relates to unobserved covariates but does not relate toIVs and treatments directly; only through unobserved covariates. By using NCOs, there areno necessity to specify not only the set of valid IVs but also invalid one in advance. There-fore, the proposed method is more useful than the methods of Liao, 2013 and DiTraglia, 2016when the auxiliary variables can be used. Also, we proof that solving the linear estimatingequations explained the previous paragraph while selecting IVs and estimating the weightsof IVs, the estimated causal eﬀects has the same asymptotic variance as the estimator usingGeneralized Method of Moments (GMM, see Imbens, 2000) that has the semiparametriceﬃciency. Additionally, we show that ordinary shrinkage methods (e.g. Adaptive LASSO(Zou, 2006)) can apply to our proposed method and it has oracle properties.The remainder of the paper proceeds as follows. In section 2, we discuss the situationwhere all candidates of IVs are valid. We show that parameters of a linear combination ofIVs can be estimated through a linear estimating equation and our proposed method hasthe same asymptotic variance as GMM. In section 3, we explain a selection method of thevalid IVs without prior informations and the method has useful properties. Also, by using4daptive LASSO, we show that the method has oracle properties. In section 4, we conﬁrmproperties of our method and previous methods through simulations. Important regularityconditions, all proofs, and all ﬁgures are given in appendix.

At ﬁrst, we discuss the situation where all candidates of IVs are valid; we need not selectvalid IVs. Let n be the sample size. T i ∈ T ⊂ R , X i ∈ R p , U i ∈ R , Z i ∈ R K , and Y i ∈ R denote the treatment, a vector of covariates, an unobserved covariate, a vector of IVs, and anobserved outcome respectively, where the r.v.s have appropriate moment conditions. Notethat when T = { , } , it means binary treatment situations. We assume that i = 1 , , . . . , n are i.i.d. samples. Throughout this paper, the following linear model is assumed: y i = t i β t + x (cid:62) i β x + u i , (2.1)where E[ U i ] = 0 , V ar ( U i ) = σ < ∞ and β = (cid:0) β t , β (cid:62) x (cid:1) (cid:62) . Also, we assume that U i ⊥⊥ ( Z i , X i )and U i (cid:54)⊥⊥ T i . When estimating β by using OLS, there may be some biases; therefore, thefollowing estimating equation is used (c.f. Hayashi, 2000, Burgess et al., 2017): n (cid:88) i =1  h ( z i ) x i  (cid:0) y i − (cid:0) t i β t + x (cid:62) i β x (cid:1)(cid:1) = p +1 , (2.2)where h ( · ) is K -dimensional or less measurable function. If the model (2.1) is correct, thenthe estimating equation (2.2) becomesE  h ( Z ) X  (cid:0) Y − (cid:0) T β t + X (cid:62) β (cid:1)(cid:1) = E  h ( Z ) X  E (cid:2) Y − (cid:0) T β t + X (cid:62) β (cid:1)(cid:3) = p +1 , √ n (cid:16) ˆ β IV − β (cid:17) L → N (cid:16) p +1 , (Γ ) − Σ (cid:0) Γ (cid:62) (cid:1) − (cid:17) , (2.3)whereΣ = E  HU X U  ⊗  = σ  E[ H ⊗ ] E[ H X (cid:62) ]E[ X H (cid:62) ] E[ XX (cid:62) ]  , Γ =  E [ HT ] E (cid:2) H X (cid:62) (cid:3) E [ T X ] E (cid:2) XX (cid:62) (cid:3)  and Γ is non-singular matrix. Note that estimators of (2.1) are expressed asˆ β ( · ) = (cid:18) ˆ β ( · ) t , (cid:16) ˆ β ( · ) x (cid:17) (cid:62) (cid:19) (cid:62) , e.g. ˆ β IV means ordinary IV estimator.From here, we consider how to select the function h ( · ) such that the asymptotic varianceof (2.3) is minimized. As describing in Introduction, an allele score that means a linearcombination of IVs is well considered in MR (Burgess et al., 2017): h ( z i ) = γ (cid:62) z i = K (cid:88) k =1 γ k z ik , (2.4)where γ ∈ R K . Note that h ( z i ) is also IVs. In Burgess et al., 2017 and related papers,the parameters γ are estimated by using cross-validation or Jackknife methods. In fact, γ can be considered as the solution of a linear estimating equation under a model (2.1).Furthermore, the estimator ˆ γ can be estimated such that the asymptotic variance of (2.3)becomes minimum; that is, a IV estimator can be estimated as “best”. Proposition 1. ssume that Γ > O . When C.1 and C.2 hold, the solution of the following estimatingequation ˆ γ gives the minimal asymptotic variance related to ˆ β IVt : (cid:16) E (cid:2) Z ⊗ (cid:3) − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E (cid:2) XZ (cid:62) (cid:3)(cid:17) γ − (cid:16) E [ Z T ] − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) = K (2.5) Then, the asymptotic variance becomes (cid:16) (Γ ) − Σ (cid:0) Γ (cid:62) (cid:1) − (cid:17) (1 , = σ Ω − opt , (2.6)Ω opt = (cid:16) E [ Z T ] − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) (cid:62) (cid:16) E (cid:2) Z ⊗ (cid:3) − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E (cid:2) XZ (cid:62) (cid:3)(cid:17) − × (cid:16) E [ Z T ] − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) Note 1.

The solution of (2.5) ˆ γ means the OLS estimator for Z when T and ( Z , X ) are regardedas a response variable and objective variables, respectively. If there are linear relationshipsbetween T and ( Z , X ) , ˆ γ becomes the consistent estimator for coeﬃcient of T . Note 2.

The result of

Proposition 1 has diﬀerent optimality of Newey, 1990. Newey, 1990 has todetect the true model of T , whereas our methods does not; our methods does not have anyconcern about misspeciﬁcation of the model. Rather, our methods are regarded as expansionof Two-stage IV method (White, 1982). From the result of

Proposition 1 , we can estimate the weights of an allele score so thatthe asymptotic variance related to ˆ β IVt becomes minimal. By the way, as a well-known fact,the GMM has the semiparametric eﬃciency; the asymptotic variance of (2.3) is minimized7hen the function h ( · ) is selected as h ( Z i ) = ( Z i , · · · , Z iK ) (cid:62) . Under this situation, the following asymptotic normality holds: √ n (cid:16) ˆ β GMM − β (cid:17) L → N (cid:16) , (cid:0) Γ Σ − Γ (cid:62) (cid:1) − (cid:17) , (2.7)where Σ and Γ Σ − Γ (cid:62) are non-singular matrix. From the next theorem, we can derive theimportant conclusion that ˆ β IVt also has the semiparametric eﬃciency.

Theorem 1.

When C.1 and C.2 hold, the asymptotic variance related to ˆ β GMMt is the same as (2.6): (cid:16)(cid:0) Γ Σ − Γ (cid:62) (cid:1) − (cid:17) (1 , = σ Ω − opt As mentioned previously, the solution of (2.5) ˆ γ gives the minimal asymptotic variance ofˆ β IVt ; it is one of a diﬀerent point from GMM. By using the feature, we can propose the validIVs selection method in the next section.

In the previous section, we discussed the situation where all Z are valid IVs, but in the caseof a large number of IVs, there are risks that estimating biased causal eﬀects by includingsome invalid IVs. In this section, we discuss the situation where some of candidates ofIVs ( Z (cid:96) +1 , . . . , Z K ) do not satisfy Condition 3) explained at Introduction; in other words,( Z (cid:96) +1 , . . . , Z K ) are invalid IVs.Invalid IVs do not satisﬁed Condition 3); mathematically denoted as Z j (cid:54)⊥⊥ U, j ∈ { (cid:96) +8 , . . . , K } in this paper. Since U is an unobserved covariate, the nature of the variable itselfcannot be directly examined from the data. This is because we need to clarify relationshipbetween IVs and “unobserved” covariates; Liao, 2013 and DiTraglia, 2016 assume the set ofvalid IVs are correctly speciﬁed. Unfortunately, the situation is limited. Also, we think thatit is less useful to use these methods since we can estimate unbiased causal eﬀects by usingvalid IVs. In this paper we apply another assumption: using Negative Control Outcomes(NCOs; Tchetgen Tchetgen, 2014, Miao and Tchetgen Tchetgen, 2018, Sanderson et al.,2020 and Shi et al., 2020) as auxiliary variables M ∈ R . NCO is the variable that relatesto unobserved covariates but does not relate to IVs and treatments directly; only throughunobserved covariates. By using NCO, we need not specify not only the set of valid IVsbut also invalid one in advance. Therefore, our methods are more useful than the methodsof Liao, 2013 and DiTraglia, 2016 when using auxiliary variables. Note that the auxiliaryvariable can be considered as one-dimensional in this paper, this limitation is not essentialand can theoretically be extended to two or more dimensions. The NCO has been used inepidemiological studies as a way to check for the eﬀects of unobserved covariates. Here, wepresent a case study of the relationship between water quality and diarrhea in children aspresented in Miao and Tchetgen Tchetgen, 2018, and review a real-life example of NCO. Example

Khush et al. (2013) studied the association between water quality and child diarrhea in ruralSouthern India. Escherichia coli in contaminated water can increase the risk of diarrhea,but is unlikely to cause respiratory symptoms such as constant cough, congestion, etc. Khushet al. observed a slightly higher diarrhea prevalence at higher concentrations of Escherichiacoli; however, repeated analysis shows a similar increase in risk of respiratory symptoms,which suggests that at least part of the association between Escherichia coli and diarrhea isa result of confounding. (Miao and Tchetgen Tchetgen, 2018, p.7)

At ﬁrst, we introduce mathematically assumptions for unobserved covariates, NCOs, and9andidates of IVs to continue the discussion of the following this paper.

Assumption 1.

Unobserved covariates U ∈ R , Negative Control Outcome (NCO) M ∈ R , and candidates ofIVs Z = ( Z , . . . , Z (cid:96) , Z (cid:96) +1 , . . . , Z K ) (cid:62) ∈ R K assume the following assumptions:1. E [ M ] = 0 M = M ( U ) , i.e.all ﬂuctuation of NCO is explained by unobserved covariates3. | E [ Z k M ] | > w > , if k ∈ { (cid:96) + 1 , . . . , K } Assumption 1 is not an essential assumption, but to simplify the discussion below. Assump-tion 2 is an assumption for detect valid IVs from the candidates. Under Assumptions 1. and2., when k ∈ { , . . . , (cid:96) } ,E [ Z k M ] = E [ M ( U )E [ Z k | U ]] = E [ M ( U )] E [ Z k ] = 0 . Assumption 3 expresses the relationship between invalid IVs and NCO. By utilizing assump-tions 1. to 3., it is possible to identify valid and invalid IVs by NCO.

Note 3.

Assuming M = U α . The model satisfy Assumptions 1. and 2. About Assumption 3, | E [ Z k M ] | = | E [ Z k U ] || α | > w. Therefore, Assumption 3 can be reduced to an assumption about the magnitude of covariancebetween unobserved covariates and invalid IVs.

Note 4.

In this paper, we make three assumptions about NCO, in particular Assumptions 2. and 3. eem to be strong. For Assumption 2, the same assumptions are made in Miao and TchetgenTchetgen, 2018 (auxiliary variables and IVs are conditionally independent given unobservedcovariates ( M ⊥⊥ Z | U )), and NCO is, for this purposes, only related to unobserved covariates;Assumption 2 is considered to be natural. Assumption 3 is a necessary assumption sinceassuming the general formulation of M ( U ) in Assumption 2. assumes a general system offunctions. As conﬁrming in Note 3 , if a linear model can be assumed between M and U ,then the assumption of covariance between M and U turns out to be suﬃcient. Note 5.

When there are no NCO but at least one of the candidates of valid IVs can be detected, thevalid IVs can be used as auxiliary variables. In this situation, the following discussions andproofs also hold (see

Appendix C ). From here, we update the proposed method introduced in the previous section, and wepropose the new method diﬀerent from existing methods to estimate causal eﬀects whileselecting valid IVs. Note that to simplify the following discussions, we assume E (cid:2) XX (cid:62) (cid:3) =I p , but the assumption is not essential. In the proposed method, the weights γ can beestimated as the solution to the estimation equation (2.5): (cid:0) E (cid:2) Z ⊗ (cid:3) − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XZ (cid:62) (cid:3)(cid:1) γ − (cid:0) E [ Z T ] − E (cid:2) ZX (cid:62) (cid:3) E [ T X ] (cid:1) = K When there exists both valid and invalid IVs in the candidates, the weights related to validIVs γ val = ( γ , . . . , γ (cid:96) ) (cid:62) would like to be estimated from the estimation equation (2.5);the other weights γ inv = ( γ (cid:96) +1 , . . . , γ K ) (cid:62) would like to be estimated as 0, or convergence11equences to 0. As the estimator of (2.5), the following estimating equation can be considered:1 n n (cid:88) i =1  (cid:16) Z i − (cid:80) pj =1 Z X j (cid:17) I τ + κ ¯ I τ · · · (cid:16) Z i Z iK − (cid:80) pj =1 Z X j Z K X j (cid:17) I τ I τK . . . ... (cid:16) Z iK − (cid:80) pj =1 Z K X j (cid:17) I τK + κ ¯ I τK (cid:124) (cid:123)(cid:122) (cid:125) =: A i γ = 1 n n (cid:88) i =1  (cid:16) Z i T i − (cid:80) pj =1 Z X j T X j (cid:17) I τ + κ n ¯ I τ ... (cid:16) Z iK T i − (cid:80) pj =1 Z K X j T X j (cid:17) I τK + κ n ¯ I τK (cid:124) (cid:123)(cid:122) (cid:125) =: b i , (3.1)where γ = (cid:0) γ (cid:62) val , γ (cid:62) inv (cid:1) (cid:62) , Z k X j = 1 n n (cid:88) i =1 Z ik X ij , T X j = 1 n n (cid:88) i =1 T i X ij M κ = κ × diag (cid:8) , ¯ I τ , . . . , ¯ I τK (cid:9) , κ ∈ R , m κ = κ n × (cid:0) , ¯ I τ , . . . , ¯ I τK (cid:1) (cid:62) , κ n = o (cid:18) √ n (cid:19) (3.2) I τk = Φ τ ( ˆ w k + w ) Φ τ ( w − ˆ w k ) , ¯ I τk = 1 − I τk , (3.3) I τk = Φ τ (cid:0) w k + w (cid:1) Φ τ (cid:0) w − w k (cid:1) , ¯ I τk = 1 − I τk , (3.4)ˆ w k = 1 n n (cid:88) i =1 Z ki M i , w k = E [ Z k M ] , (3.5)12nd Φ τ ( · ) is CDF of N (0 , τ ). (3.3), (3.4) are estimators and the true values of the smoothweight function (c.f. Yang and Ding, 2017, Fig.1), respectively. From here, we conﬁrmformulas (3.1)-(3.5). At ﬁrst, we assume Z k is valid. Then, E [ Z k M ] = 0, and I τk = Φ τ (0 + w ) Φ τ ( w − → τ (cid:38) . Therefore, it is expected that I τk = Φ τ ( ˆ w k + w ) Φ τ ( w − ˆ w k ) P → n → ∞ , τ (cid:38) . Whereas, we assume Z k (cid:48) is invalid. Then, | E [ Z k (cid:48) M ] | > w , and I τk (cid:48) = Φ τ (cid:0) w k (cid:48) + w (cid:1) Φ τ (cid:0) w − w k (cid:48) (cid:1) → τ (cid:38) . Therefore, it is expected that I τk (cid:48) = Φ τ ( ˆ w k (cid:48) + w ) Φ τ ( w − ˆ w k (cid:48) ) P → n → ∞ , τ (cid:38) . In summary, about (3.1), it is expected that weights γ val of valid IVs and γ inv of invalid IVsare consistent asymptotically with solutions of following estimating equations respectively:1 n n (cid:88) i =1  Z i − (cid:80) pj =1 Z X j · · · Z i Z (cid:96)i − (cid:80) pj =1 Z X j Z (cid:96) X j . . . ... Z (cid:96)i − (cid:80) pj =1 Z (cid:96) X j (cid:124) (cid:123)(cid:122) (cid:125) =: ˇ A i γ val = 1 n n (cid:88) i =1  (cid:16) Z i T i − (cid:80) pj =1 Z X j T X j (cid:17) ... (cid:16) Z iK T i − (cid:80) pj =1 Z K X j T X j (cid:17) (cid:124) (cid:123)(cid:122) (cid:125) =:ˇ b i , (3.6)13 × diag (1 , . . . , γ inv = 0To prove the above expectations, we conﬁrm properties of γ through the following two steps: Step 1)

Derive an asymptotic equivalent random variable to (3.1)

Step 2)

By using random variable derived at Step 1), conﬁrming mathematical propertiesof the following formula: ˆ γ = (cid:32) n (cid:88) i =1 A i (cid:33) − n (cid:88) i =1 b i Note 6.

The reason why the ordinary indicator function (which take in some range and in others)is not used in (3.3), (3.4) is that the use of the indicator function renders the ordinaryasymptotic theory useless. When assuming τ ≡ τ n (cid:38)

0, the following lemma can be obtained.

Lemma 1.

Considering the following equation: n n (cid:88) i =1 ( b i − A i γ ) − n n (cid:88) i =1  ˇ b i κ n K − (cid:96)  −  ˇ A i O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)  γ  = 1 n n (cid:88) i =1  b i −  ˇ b i κ n K − (cid:96) (cid:124) (cid:123)(cid:122) (cid:125) =: c i − n n (cid:88) i =1  A i −  ˇ A i O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)  γ (cid:124) (cid:123)(cid:122) (cid:125) =: c i ( γ ) . (3.7) A suﬃcient condition that the following equations: n n (cid:88) i =1 c i k = o p (cid:18) √ n (cid:19) , n n (cid:88) i =1 c i k ( γ ) = o p (cid:18) √ n (cid:19) (3.8)14 re satisﬁed for ∀ γ and each k is (cid:18) | w − ˆ w k | τ n (cid:19) − exp (cid:40) −

12 ( w − ˆ w k ) τ n (cid:41) = o p (cid:18) √ n (cid:19) . (3.9) Lemma 1. gives the necessary and suﬃcient order to satisfy (3.8), but (3.9) is hard tointerpret. We assume that b n = √ n (cid:18) | w − ˆ w k | τ n (cid:19) − exp (cid:40) −

12 ( w − ˆ w k ) τ n (cid:41) . • When ( | w − ˆ w k | ) /τ n = O p ( n ), b n = O p (cid:18) √ n (cid:19) × O p (cid:0) exp (cid:0) − n (cid:1)(cid:1) = o p (1) • When ( | w − ˆ w k | ) /τ n = O p ( √ log n ), b n = O p (cid:18) √ n √ log n (cid:19) × O p (cid:16) n − (cid:17) = o p (1) • When ( | w − ˆ w k | ) /τ n = O p ( √ log log n ), b n = O p (cid:18) √ n √ log log n (cid:19) × O p (cid:16) (log n ) − (cid:17) = O p (cid:18) √ n √ log n log n (cid:19) From the above, one of a limit of an order satisﬁed (3.9) is ( | w − ˆ w k | ) /τ n = O p ( √ log n ).Then, since ˆ w k is a sample mean, a lower limit of an order for τ n is τ n = O p (cid:18) n (cid:19) . From

Lemma 1. , the root-n consistency of ˆ γ is satisﬁed. Proposition 2. hen C.1 holds, under the conditions of Lemma 1. , ˆ γ = (cid:32) n (cid:88) i =1 A i (cid:33) − n (cid:88) i =1 b i P →  E (cid:2) ˇ A (cid:3) − E (cid:2) ˇ b (cid:3) K − (cid:96)  =  γ val γ inv  . (3.10) Also, if (cid:12)(cid:12)(cid:12) E (cid:104)(cid:0) ˇ b − ˇ A γ val (cid:1) ⊗ (cid:105)(cid:12)(cid:12)(cid:12) < ∞ , then each k , ˆ γ val, k − γ val, k = O p (cid:18) √ n (cid:19) , ˆ γ inv, k = o p (cid:18) √ n (cid:19) . Note that γ val is the solution of (2.5), and γ inv = K − (cid:96) . In other words, the weights relatedto invalid IVs are not used when estimating ˆ β IVt . By using ˆ γ , an estimating equation for β can be constructed: n (cid:88) i =1  ˆ γ (cid:62) z i x i  (cid:0) y i − (cid:0) t i β t + x (cid:62) i β x (cid:1)(cid:1) = p +1 . (3.11)Regarding β , the following propertiy holds: Theorem 2.

When C.1 and C.2 hold, ˆ β has the following asymptotical property: √ n (cid:16) ˆ β IV − β (cid:17) L → N (cid:16) p +1 , (Γ ) − Σ (cid:0) Γ (cid:62) (cid:1) − (cid:17) , (3.12) where γ = (cid:0) ( γ val ) (cid:62) , ( γ inv ) (cid:62) (cid:1) (cid:62) ≡ (cid:0) ( γ val ) (cid:62) , (cid:62) K − (cid:96) (cid:1) (cid:62) and Σ = E  ( γ ) (cid:62) Z U X U  ⊗  , Γ =  E (cid:104) ( γ ) (cid:62) Z T (cid:105) E (cid:104) ( γ ) (cid:62) ZX (cid:62) (cid:105) E [ T X ] I p  . Theorem 2. shows that β IVt can be satisﬁed semiparametric eﬃciency when there aresome invalid IVs; therefore the conclusions derived in the previous section hold. One of the16mportant points in

Theorem 2. is that the IV estimator does not aﬀect the asymptoticvariance when applying the proposed method. This is because the variability of valid IVsis independent of unobserved covariates, and the variability of invalid IVs becomes 0 byselecting κ n = o p (1 / √ n ).Unfortunately, the result of Theorem 2. is shown in large-sample situation. In the nextsubsection, we consider an shrinkage method; Adaptive LASSO to select the valid IVs insmall-sample situations.

In this subsection, we consider an IV selection and estimating causal eﬀects by applyingAdaptive LASSO. (3.1) can be considered as the estimating equation of the following riskfunction for γ : (cid:32) n n (cid:88) i =1 ( b i − A i γ ) (cid:33) (cid:62) (cid:32) n n (cid:88) i =1 A i (cid:33) − (cid:32) n n (cid:88) i =1 ( b i − A i γ ) (cid:33) =: ψ ( γ ) . (3.13)Actually, the partial derivative of (3.13) with respect to γ becomes ∂∂ γ ψ ( γ ) = − (cid:32) n n (cid:88) i =1 A i (cid:33) (cid:62) (cid:32) n n (cid:88) i =1 A i (cid:33) − (cid:32) n n (cid:88) i =1 ( b i − A i γ ) (cid:33) = (cid:32) n n (cid:88) i =1 ( b i − A i γ ) (cid:33) . Next, an Adaptive LASSO-type shrinkage method is applied to the risk function (3.13), andˆ γ ∗ can be estimated as follows:ˆ γ ∗ = arg min γ nψ ( γ ) + λ n K (cid:88) k =1 ˆ ξ k | γ k | , (3.14)where λ n (cid:80) Kk =1 ˆ ξ k | γ k | is a penalty term and ˆ ξ k = 1 / | ˆ γ k | δ , δ >

0. Also, λ k satisﬁes λ n √ n → , λ n n ( δ − / → ∞ . Theorem 3.

Assuming that A = { , . . . , (cid:96) } , A ∗ n = { j : ˆ γ ∗ j (cid:54) = 0 } . The following results are obtained: • lim n →∞ Pr ( A ∗ n = A ) = 1 • √ n (ˆ γ ∗ val − γ val ) L → N (cid:16) (cid:96) , E (cid:2) ˇ A (cid:3) − (cid:17) By applying ˆ γ ∗ to (3.11), the estimator of causal eﬀect is obtained while selecting valid IVsin small-sample situations.In this paper, we propose the following procedures to estimate unbiased causal eﬀectswhen some auxiliary variables can be obtained:1. By applying some shrinkage methods such as Adaptive LASSO to (3.1), ˆ γ can beestimated while selecting valid IVs; therefore, the weight estimators for a linear com-bination of IVs can be prepared to estimate unbiased causal eﬀects.2. By applying ˆ γ to (3.11), an unbiased causal eﬀects can be obtained. Note that theestimator has the same asymptotic variance as GMM; it has semiparametric eﬃciency. In this section, we conﬁrm properties of our proposed method and compare with the methodput forward by Liao, 2013 and DiTraglia, 2016. We show that 1) our proposed methodhas the same asymptotic variance as GMM, 2) pre-speciﬁed valid IVs is not necessary toour proposed method when there are some NCOs, and 3) we conﬁrm how the parameterestimator ﬂuctuates when the tuning parameters κ , κ n , τ n , w, λ n , and δ are varied. Thesimulation setting is as follows (similar as Liao, 2013 and Miao and Tchetgen Tchetgen,2018): 18 andidates of IVs, treatment, and unmeasured covariate ( T i , Z i , Z i , U i , Z ∗ i ) (cid:62) i.i.d. ∼ N ( , Σ) , Z i = Z ∗ i + 0 . ∗ U i where Σ =  σ tz σ tz . 

1. Strong IVs: ( σ tz , σ tz ) = (0 . , . σ tz , σ tz ) = (0 . , . Outcome Y ti = 0 . . t + U i Negative Control Outcome M i = 1 + α mu U i

1. Strong NCO: α mu = 0 .

62. Weak NCO: α mu = 0 . Z is the invalid IV in the candidates of IVs; we would like to use Z , Z . Thenumber of iteration for following simulations is 1,000. At ﬁrst, we conﬁrm our proposed method has the same asymptotic variance as GMM. Inthis subsection, the valid IVs Z , Z are used when estimating causal eﬀects. To estimate γ , (3.1) is solved even if the valid IVs are determined. By using estimated ˆ γ , an estimating19quation (3.12) is constructed. Note that we only describe Strong IVs and Strong NCOsituation in this paper. Whereas, an estimating equation of GMM is constructed as Section2 (c.f., Hayashi, 2000, Imbens, 2000). Summaries each estimator for causal eﬀects are inTable 1. Table 1: Summary of estimators for causal eﬀects Method

Small sample n = 200 Large sample n = 500Mean(SD) Median(Range) Mean(SD) Median(Range) Proposed

GMM • ( κ , κ n , τ n , w ) = (10 , . , . , . • ( κ , κ n , τ n , w ) = (10 , . , . , . κ n and τ n need to determine inaccordance with sample size n . Therefore, the diﬀerence between the proposed estimatorand GMM becomes larger when n (cid:28) Next, we compare our proposed method with the previous methods proposed by Liao, 2013and DiTraglia, 2016. We conﬁrm the two scenarios:1. We know some valid IVs (i.e. Z ) and NCO (i.e. M )Under this situation, not only our proposed method but also the methods of Liao, 2013and DiTraglia, 2016 work well. 20. We know some valid IVs, but do not know NCOUnder this situation, the methods of Liao, 2013 and DiTraglia, 2016 work well. Whereas,our proposed method also work by using the valid IV as an auxiliary variable (see Note5 also).Regarding the method of Liao, 2013, we use an Adaptive LASSO-type penalty and tuningparameters are selected as λ n = 0 . ω = 1. Regarding our proposed method, the tuningparameters for proposed methods are selected as follows respectively: • ( κ , κ n , τ n , w, λ n , δ ) = (10 , . , . , . , . , • ( κ , κ n , τ n , w, λ n , δ ) = (10 , . , . , . , . , Regarding the methods of Liao, 2013 and DiTraglia, 2016, Z is used for the ﬁrst step ofestimations, and the IVs used for the second step will be selected from the candidates:( Z , Z ). Also, the NCO can be used; therefore all methods can estimate valid estimatorsfor causal eﬀects. Summaries each estimator for causal eﬀects are in Table 2.Table 2: Summary of estimators for causal eﬀects by situations Situation Method

Small sample n = 200 Large sample n = 500Mean(SD) Median(Range) RMSE Mean(SD) Median(Range) RMSE Strong IV &Strong NCO Proposed

Liao, 2013

DiTraglia, 2016

Strong IV &Weak NCO Proposed

Weak IV &Strong NCO Proposed

Liao, 2013

DiTraglia, 2016

Weak IV &Weak NCO Proposed

Under Strong IV situation, our proposed method and Liao, 2013 can estimate valid estimates.Although DiTraglia, 2016 has a large variance, a median of estimates is valid to some extent.21nder Weak IV situation, all methods have larger variance than Strong IV situation. Inparticular, the variance of our proposed method is relatively larger than Liao, 2013 in smallsample whereas being reversed in large sample. Regarding our proposed method, the strengthof NCO has little eﬀect on the parameter estimation.

Regarding the methods of Liao, 2013 and DiTraglia, 2016, Z is used for the ﬁrst step ofestimations, and the IVs used for the second step will be selected from the candidates:( Z , Z ). On the other hand, the NCO cannot be used; we consider that Z is used asan alternative auxiliary variable instead (see Appendix C ). Summaries each estimator forcausal eﬀects are in Table 3. Note that the summaries of the method of Liao, 2013 andDiTraglia, 2016 are the same as Table 2.Table 3: Summary of estimators for causal eﬀects by situations

Situation Method

Small sample n = 200 Large sample n = 500Mean(SD) Median(Range) RMSE Mean(SD) Median(Range) RMSE Strong IV Proposed

Weak IV Proposed

Under Strong IV situation, our proposed method has approximately the same variance asTable 2, whereas the variance is larger under weak IV situation. Therefore, when NCO canbe used, we should use it to estimate causal eﬀects.

Finally, we conﬁrm how the parameter estimator ﬂuctuates when the tuning parameters κ , κ n , τ n , w, λ n , and δ are varied. Note that we only describe Strong IVs and Strong NCOsituation in this paper. To conﬁrm a ﬂuctuation of a parameter estimator, we vary theparameter value ﬁxing the other parameters.22. κ , κ n Regarding these parameters, we only vary κ n since a proportion κ n /κ is more im-portant (see a proof of Proposition 2 also). Note that the other parameter valuesare ( κ , τ n , w, λ n , δ ) = (10 , . , . , . , . Summaries each estimator for causal eﬀects are in Table 4.Table 4: Summary of a ﬂuctuation of a parameter estimator for causal eﬀects regarding κ n Parametervalue

Small sample n = 200Mean(SD) Median(Range) κ n = 0 .

001 0.809(0.134) 0.815(0.40-1.16) κ n = 0 . κ n has little eﬀect on the parameterestimation.2. τ n Note that the other parameter values are( κ , κ n , w, λ n , δ ) = (10 , . , . , . , . Summaries each estimator for causal eﬀects are in Table 5.Table 5: Summary of a ﬂuctuation of a parameter estimator for causal eﬀects regarding τ n Parametervalue

Small sample n = 200Mean(SD) Median(Range) τ n = 0 .

001 0.806(0.125) 0.813(0.28-1.16) τ n = 0 . w κ , κ n , τ n , λ n , δ ) = (10 , . , . , . , . Summaries each estimator for causal eﬀects are in Table 6.Table 6: Summary of a ﬂuctuation of a parameter estimator for causal eﬀects regarding w Parametervalue

Small sample n = 200Mean(SD) Median(Range) w = 0 .

001 0.819(0.130) 0.825(0.40-2.31) w = 0 . w and τ n has some eﬀects on theparameter estimation. Theoretically, we should select | w | and τ n as small as possible,whereas range of parameter estimates become wide from the simulation result. There-fore we have to select w and τ n carefully, but the selection of w and τ n has less eﬀectthan λ n ; we conﬁrm it later.4. δ Note that the other parameter values are( κ , κ n , τ n , w, λ n ) = (10 , . , . , . . Summaries each estimator for causal eﬀects are as follows in Table 7.Table 7: Summary of a ﬂuctuation of a parameter estimator for causal eﬀects regarding δ Parametervalue

Small sample n = 200Mean(SD) Median(Range) δ = 0 . δ = 1 0.810(0.134) 0.816(0.40-1.16) δ = 1 . δ has some eﬀects on the parameterestimation. We should select the parameter as δ ≤

1, but the selection of δ has lesseﬀect than λ n ; we conﬁrm it later.5. λ n Note that the other parameter values are( κ , κ n , τ n , w, δ ) = (10 , . , . , . , . Summaries each estimator for causal eﬀects are in Table 8.Table 8: Summary of a ﬂuctuation of a parameter estimator for causal eﬀects regarding λ n Parametervalue

Small sample n = 200Mean(SD) Median(Range) λ n = 0 . λ n = 1 0.310(32.324) 0.883(-941.87-165.96) λ n = 10 0.791(10.260) 0.863(-221.31-37.30)From Table 8, it found that the selection of λ n has serious eﬀects on the parameterestimation. One of the important points is that a ﬂuctuation of a parameter estimatordoes not increase in accordance with increase of λ n . Therefore we have to select λ n carefully; we conﬁrm a ﬂuctuation of a parameter estimator in more detail.From Table 9, it found that we should select the parameter as λ n (cid:59) . n = 200 or n = 500. From the above, we need few penalty to estimate a parameterwhile IV selecting. In this paper, we proposed the new IV estimator which use a linear combination of IVs. Whenthere are some invalid IVs exists, our proposed method can select the valid IVs by using25able 9: Summary of a ﬂuctuation of a parameter estimator for causal eﬀects in more detailregarding λ n Parametervalue

Small sample n = 200 Large sample n = 500Mean(SD) Median(Range) Mean(SD) Median(Range) λ n = 0 .

01 0.812(0.136) 0.819(0.40-1.17) 0.813(0.096) 0.809(0.53-1.18) λ n = 0 . λ n = 0 . λ n = 0 . λ n = 1 0.310(32.324) 0.883(-941.87-165.96) -1.330(79.667) 0.866(-2508.86-43.54) λ n = 2 1.321(7.172) 0.878(-81.15-117.02) 2.537(38.721) 0.867(-35.39-1221.80) λ n = 5 1.381(4.051) 0.889(-29.16-74.59) 1.526(8.585) 0.871(-87.41-231.36) λ n = 10 0.791(10.260) 0.863(-221.31-37.30) 1.112(4.593) 0.870(-80.12-30.57) a negative control outcome or some auxiliary variable. Whether selecting valid IVs or not,we showed that our proposed estimator has the same eﬃciency of the generalized methodsof moments estimator. Also, by applying an Adaptive LASSO-type shrinkage method, weconstruct an estimating procedure estimating causal eﬀects and selecting valid IVs at once.We conﬁrm performances of our proposed method and some previous methods throughsimulations. Whether the strength of a negative control outcome, our proposed method workwell; the strength of IVs are more important. Not only large sample but also small samplesituation, the performance of our proposed method is superior to the previous methods inmany situations. Also, our proposed method work by using valid IVs when a negative controloutcome cannot be used. On the other hand, the selection of tuning parameters have littleeﬀect on the parameter estimation except for λ ; the strength of penalty. We need to select λ ≈ . eferences [1] Andrews, D. W. (1999). Consistent moment selection procedures for generalized methodof moments estimation. Econometrica , (3), 543-563.[2] Baiocchi, M., Cheng, J., and Small, D. S. (2014). Instrumental variable methods forcausal inference. Statistics in medicine , (13), 2297-2340.[3] Burgess, S., Small, D. S., and Thompson, S. G. (2017). A review of instrumental variableestimators for Mendelian randomization. Statistical methods in medical research , (5),2333-2355.[4] Claeskens, G., and Hjort, N. L. (2003). The focused information criterion. Journal ofthe American Statistical Association , (464), 900-916.[5] Cui, Y., Pu, H., Shi, X., Miao, W., and Tchetgen, E. T. (2020). Semiparametric proximalcausal inference. arXiv preprint arXiv:2011.08411 .[6] Darolles, S., Fan, Y., Florens, J. P., and Renault, E. (2011). Nonparametric instrumentalregression. Econometrica , (5), 1541-1565.[7] Deaner, B. (2019). Nonparametric Instrumental Variables Estimation Under Misspeci-ﬁcation. arXiv preprint arXiv:1901.01241 .[8] DiTraglia, F. J. (2016). Using invalid instruments on purpose: Focused moment selectionand averaging for GMM. Journal of Econometrics , (2), 187-208.[9] Gordon, R. D. (1941). Values of Mills’ ratio of area to bounding ordinate and of thenormal probability integral for large values of the argument. The Annals of MathematicalStatistics , (3), 364-366.[10] Guo, Z., Kang, H., Tony Cai, T., and Small, D. S. (2018). Conﬁdence intervals forcausal eﬀects with invalid instruments by using two-stage hard thresholding with voting. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , (4), 793-815.[11] Harville, D. A. (2006). Matrix Algebra From a Statistician’s Perspective.

Springer Sci-ence and Business Media.[12] Hayashi, F. (2000).

Econometrics . Princeton University Press.[13] Hern´an, M. A., and Robins, J. M. (2020).

Causal inference: what if . Chapman &Hill/CRC.[14] Hong, H., Preston, B., and Shum, M. (2003). Generalized empirical likelihood-basedmodel selection criteria for moment condition models.

Econometric Theory , 923-943.[15] Imbens, G. W. (2002). Generalized method of moments and empirical likelihood.

Journalof Business & Economic Statistics , (4), 493-506.2916] Kang, H., Zhang, A., Cai, T. T., and Small, D. S. (2016). Instrumental variables esti-mation with some invalid instruments and its application to Mendelian randomization. Journal of the American statistical Association , (513), 132-144.[17] Kianian, B., Kim, J. I., Fine, J. P., and Peng, L. (2019). Causal Proportional HazardsEstimation with a Binary Instrumental Variable. arXiv preprint arXiv:1901.11050 .[18] Khush, R. S., et al. (2013). H2S as an indicator of water supply vulnerability andhealth risk in low-resource settings: a prospective cohort study. The American journalof tropical medicine and hygiene , (2), 251-259.[19] Knight, K., and Fu, W. (2000). Asymptotics for lasso-type estimators. Annals of statis-tics , 1356-1378.[20] Koles´ar, M., Chetty, R., Friedman, J., Glaeser, E., and Imbens, G. W. (2015). Identiﬁ-cation and inference with many invalid instruments.

Journal of Business & EconomicStatistics , (4), 474-484.[21] Liao, Z. (2013). Adaptive GMM shrinkage estimation with consistent moment selection. Econometric Theory , 857-904.[22] Liu, X., Wang, L., and Liang, H. (2011). Estimation and variable selection for semi-parametric additive partial linear models.

Statistica Sinica , (3), 1225.[23] Mart´ınez-Camblor, P., Mackenzie, T., Staiger, D. O., Goodney, P. P., and O’Malley, A.J. (2019). Adjusting for bias introduced by instrumental variable estimation in the Coxproportional hazards model. Biostatistics , (1), 80-96.[24] Miao, W., and Tchetgen Tchetgen, E. (2018). A Confounding Bridge Approach forDouble Negative Control Inference on Causal Eﬀects (Supplement and Sample Codesare included). arXiv preprint arXiv:1808.04945 .[25] Newey, W. K. (1990). Eﬃcient instrumental variables estimation of nonlinear models. Econometrica: Journal of the Econometric Society , 809-837.[26] Okui, R., Small, D. S., Tan, Z., and Robins, J. M. (2012). Doubly robust instrumentalvariable regression.

Statistica Sinica , 173-205.[27] Ogburn, E. L., Rotnitzky, A., and Robins, J. M. (2015). Doubly robust estimation ofthe local average treatment eﬀect curve.

Journal of the Royal Statistical Society. SeriesB, Statistical methodology , (2), 373.[28] Pierce, B. L., Ahsan, H., and VanderWeele, T. J. (2011). Power and instrument strengthrequirements for Mendelian randomization studies using multiple genetic variants. In-ternational journal of epidemiology , (3), 740-752.[29] Rosenbaum, P. R., and Rubin, D. B. (1983). The central role of the propensity score inobservational studies for causal eﬀects. Biometrika , (1), 41-55.3030] Sanderson, E., Richardson, T., Hemani, G., and Smith, G. D. (2020). The use of negativecontrol outcomes in Mendelian Randomisation to detect potential population stratiﬁ-cation or selection bias. BioRxiv .[31] Sargan, J. D. (1958). The estimation of economic relationships using instrumental vari-ables.

Econometrica: Journal of the Econometric Society , 393-415.[32] Shi, X., Miao, W., and Tchetgen, E. T. (2020). A Selective Review of Negative ControlMethods in Epidemiology.

Current Epidemiology Reports , 1-13.[33] Su, L., and Zhang, Y. (2012). Variable Selection in Nonparametric and Semiparamet-ric Regression Models.

Handbook in Applied Nonparametric and Semi-NonparametricEconometrics and Statistics .[34] Tchetgen Tchetgen, E. (2014). The control outcome calibration approach for causalinference with unobserved confounding.

American journal of epidemiology , (5), 633-640.[35] Tchetgen Tchetgen, E. J., Walter, S., Vansteelandt, S., Martinussen, T., and Glymour,M. (2015). Instrumental variable estimation in a survival context. Epidemiology , (3),402-410.[36] Van der Vaart, A. W. (2000). Asymptotic statistics . Cambridge university press.[37] White, H. (1982). Instrumental variables regression with independent observations.

Econometrica: Journal of the Econometric Society , 483-499.[38] Windmeijer, F., Farbmacher, H., Davies, N., and Davey Smith, G. (2019). On the use ofthe lasso for instrumental variables estimation with some invalid instruments.

Journalof the American Statistical Association , (527), 1339-1350.[39] Yang, S., and Ding, P. (2018). Asymptotic inference of causal eﬀects with observationalstudies trimmed by the estimated propensity scores. Biometrika , (2), 487-493.[40] Ying, A., Xu, R., and Murphy, J. (2019). Two-stage residual inclusion for survival dataand competing risks - An instrumental variable approach with application to SEER-Medicare linked data. Statistics in Medicine , (10), 1775-1801.[41] Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the Americanstatistical association , (476), 1418-1429.31 Important regularity conditions

Throughout our papers, the following two regularity conditions are important:

C.1 E (cid:2) Z ⊗ (cid:3) − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E (cid:2) XZ (cid:62) (cid:3) > O C.2

E [ Z T ] − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:54) = C.1 means valid relationship between Z and X ; it shows only relationship of variables.Whereas, C.2 is regarded as a kind of IV condition 1). IfE [ Z T ] − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] = , i.e., γ = , a linear combination of IVs does not work. Therefore, an IV estimator cannotbe constructed. B Proofs

B.1 Proof of Proposition 1

We use the following lemma that is well-known conclusion (c.f. Harville, 2006) when showingthe

Proposition 1 . Lemma 2. A means m × m non-singular matrix, d means m -dimension vector. ∀ x \ { } ∈ R m , x (cid:62) A x (cid:0) d (cid:62) x (cid:1) ≥ (cid:0) d (cid:62) A − d (cid:1) − , the equation holds when x ∝ A − d . Proposition 1 . At ﬁrst, we calculate Γ − :Γ − =  (cid:16) E [ HT ] − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) − − E [ H X (cid:62) ] E[ HT ] (cid:18) E (cid:2) XX (cid:62) (cid:3) − E[ T X ]E [ H X (cid:62) ] E[ HT ] (cid:19) − ∗ ∗  . By using Sherman-Morrison formula (c.f. Harville, 2006), (cid:32) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] E (cid:2) H X (cid:62) (cid:3) E [ HT ] (cid:33) − = E (cid:2) XX (cid:62) (cid:3) − + E (cid:2) XX (cid:62) (cid:3) − E [ T X ] E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − / E [ HT ]1 − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] / E [ HT ]= (cid:16) E [ HT ] − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) E (cid:2) XX (cid:62) (cid:3) − E [ HT ] − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] + E (cid:2) XX (cid:62) (cid:3) − E [ T X ] E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ HT ] − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] Therefore,E (cid:2) H X (cid:62) (cid:3) E [ HT ] (cid:32) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] E (cid:2) H X (cid:62) (cid:3) E [ HT ] (cid:33) − = E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ HT ] − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ]We continue to calculate the variance components of ˆ β IVt :(Γ ) − Σ = σ  E [ H ] − E [ H X (cid:62) ] E [ XX (cid:62) ] − E[ H X ]E[ HT ] − E [ H X (cid:62) ] E [ XX (cid:62) ] − E[ T X ] O (cid:62) ∗ ∗  , (Γ ) − Σ (cid:0) Γ (cid:62) (cid:1) − = σ  E [ H ] − E [ H X (cid:62) ] E [ XX (cid:62) ] − E[ H X ] (cid:16) E[ HT ] − E [ H X (cid:62) ] E [ XX (cid:62) ] − E[ T X ] (cid:17) ∗∗ ∗  (B.1)The (1,1) component of (B.1) isE [ H ] − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ H X ] (cid:16) E [ HT ] − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) = γ (cid:62) (cid:16) E (cid:2) Z ⊗ (cid:3) − E (cid:2) Z X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E (cid:2) XZ (cid:62) (cid:3)(cid:17) γ (cid:16) γ (cid:62) (cid:16) E [ Z T ] − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17)(cid:17) . (B.2)33y using Lemma 2. , the minimum value of (B.2) can be derived. Therefore, when γ ∝ (cid:16) E (cid:2) Z ⊗ (cid:3) − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E (cid:2) XZ (cid:62) (cid:3)(cid:17) − (cid:16) E [ Z T ] − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) (B.3)the asymptotic variance related to ˆ β IVt becomes minimum. Also, (cid:16) (Γ ) − Σ (cid:0) Γ (cid:62) (cid:1) − (cid:17) (1 , = σ Ω − opt , Ω opt = (cid:16) E [ Z T ] − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) (cid:62) (cid:16) E (cid:2) Z ⊗ (cid:3) − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E (cid:2) XZ (cid:62) (cid:3)(cid:17) − × (cid:16) E [ Z T ] − E (cid:2) ZX (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) B.2 Proof of Theorem 1

To describe simply, we use the following descriptions:Σ = σ  E[ H ⊗ ] E[ H X (cid:62) ]E[ X H (cid:62) ] E[ XX (cid:62) ]  =  A B (cid:62)

B C  , Γ =  E [ HT ] E (cid:2) H X (cid:62) (cid:3) E [ T X ] E (cid:2) XX (cid:62) (cid:3)  =  F B (cid:62)

G C 

At ﬁrst, we calculate the inverse matrix of Σ :Σ − =  (cid:0) A − B (cid:62) C − B (cid:1) − − A − B (cid:62) (cid:0) C − BA − B (cid:62) (cid:1) − − (cid:0) C − BA − B (cid:62) (cid:1) − BA − (cid:0) C − BA − B (cid:62) (cid:1) −  =  D − A − B (cid:62) EEBA − E  , Γ (cid:62) Σ − =  F (cid:62) D − G (cid:62) EBA − − F (cid:62) A − B (cid:62) E + G (cid:62) EBD − CEBA − − BA − B (cid:62) E + CE  . (B.4)34bout the (2,2) component of (B.4), by using − BA − B (cid:62) E + CE = (cid:0) C − BA − B (cid:62) (cid:1) (cid:0) C − BA − B (cid:62) (cid:1) − = I p , Γ (cid:62) Σ − Γ =  F (cid:62) DF − G (cid:62) EBA − F + G (cid:62) EG − F (cid:62) A − B (cid:62) EG F (cid:62) DB (cid:62) − G (cid:62) EBA − B (cid:62) + G (cid:62) EC − F (cid:62) A − B (cid:62) ECG + BDF − CEBA − F BDB (cid:62) − CEBA − B (cid:62) + C  . (B.5)From here, we calculate each component of (B.5). At ﬁrst, about (2,1): BDF − CEBA − F = (cid:0) B − CEBA − (cid:0) A − B (cid:62) C − B (cid:1)(cid:1) DF = (cid:0) B − CEB + CEBA − B (cid:62) C − B (cid:1) DF = (cid:0) B − CE (cid:0) C − BA − B (cid:62) (cid:1) C − B (cid:1) DF = (cid:16) B − C (cid:0) C − BA − B (cid:62) (cid:1) − (cid:0) C − BA − B (cid:62) (cid:1) C − B (cid:17) DF = K . Next, about (1,2): − G (cid:62) EBA − B (cid:62) + G (cid:62) EC = G (cid:62) E (cid:0) C − BA − B (cid:62) (cid:1) = G (cid:62) (cid:0) C − BA − B (cid:62) (cid:1) − (cid:0) C − BA − B (cid:62) (cid:1) = G (cid:62) . At last, about (2,2):

BDB (cid:62) − CEBA − B (cid:62) = O K (cid:62) Σ − Γ =  F (cid:62) DF − G (cid:62) EBA − F + G (cid:62) EG G (cid:62)

G C  , (B.6)Therefore, (cid:0) Γ (cid:62) Σ − Γ (cid:1) − , = (cid:0) F (cid:62) DF − G (cid:62) EBA − F + G (cid:62) EG − G (cid:62) C − G (cid:1) − . (B.7)About the second term of (B.7), G (cid:62) EBA − F = G (cid:62) (cid:0) C − + C − BDB (cid:62) C − (cid:1) BA − F = G (cid:62) C − BA − F + G (cid:62) C − BDB (cid:62) C − BA − F = G (cid:62) C − BD (cid:0)(cid:0) A − B (cid:62) C − B (cid:1) A − + B (cid:62) C − BA − (cid:1) F = G (cid:62) C − BDF

Next the third and fourth term of (B.7), G (cid:62) EG − G (cid:62) C − G = G (cid:62) C − G + G (cid:62) C − BDB (cid:62) C − G − G (cid:62) C − G = G (cid:62) C − BDB (cid:62) C − G Therefore, (B.7) becomes (cid:0) Γ (cid:62) Σ − Γ (cid:1) − , = (cid:0) F DF (cid:62) − G (cid:62) C − BDF + G (cid:62) C − BDB (cid:62) C − G (cid:1) − = (cid:16)(cid:0) F − B (cid:62) C − G (cid:1) (cid:62) D (cid:0) F − B (cid:62) C − G (cid:1)(cid:17) − . Return to the original symbols: (cid:0) Γ (cid:62) Σ − Γ (cid:1) − , = σ (cid:16) E [ HT ] − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) (cid:62) (cid:16) E (cid:2) H ⊗ (cid:3) − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E (cid:2) X H (cid:62) (cid:3)(cid:17) − (cid:16) E [ HT ] − E (cid:2) H X (cid:62) (cid:3) E (cid:2) XX (cid:62) (cid:3) − E [ T X ] (cid:17) Therefore, the asymptotic variance related to ˆ β IVt consist with ˆ β GMMt

B.3 Proof of Lemma 1

We proceed the proof by dividing (3.7) into two parts: c i , c i ( γ ). At ﬁrst, regarding c i , • When k ∈ { , . . . , (cid:96) } ,1 n n (cid:88) i =1 (cid:40)(cid:32) Z ik T i − p (cid:88) j =1 Z k X j T X j (cid:33) I τk + κ n ¯ I τk − (cid:32) Z ik T i − p (cid:88) j =1 Z k X j T X j (cid:33)(cid:41) = 1 n n (cid:88) i =1 (cid:40) κ n − (cid:32) Z ik T i − p (cid:88) j =1 Z k X j T X j (cid:33)(cid:41) ¯ I τk (B.8) • When k ∈ { (cid:96) + 1 , . . . , K } ,1 n n (cid:88) i =1 (cid:40)(cid:32) Z ik T i − p (cid:88) j =1 Z k X j T X j (cid:33) I τk + κ n ¯ I τk − κ n (cid:41) = 1 n n (cid:88) i =1 (cid:40)(cid:32) Z ik T i − p (cid:88) j =1 Z k X j T X j (cid:33) − κ n (cid:41) I τk (B.9)About (B.8), (B.9), since the ﬁrst term ( n (cid:80) ni =1 {·} ) convergence to a constant in probability,we need to conﬁrm an asymptotic property of I τk . Next, regarding c i ( γ ), • When k ∈ { , . . . , (cid:96) } , k (cid:48) ∈ { , . . . , (cid:96) } ,About the diagonal component of the matrix,1 n n (cid:88) i =1 (cid:40)(cid:32) Z ik − p (cid:88) j =1 Z k X j (cid:33) I τk + κ ¯ I τk − (cid:32) Z ik − p (cid:88) j =1 Z k X j (cid:33)(cid:41) = 1 n n (cid:88) i =1 (cid:40) κ − (cid:32) Z ik − p (cid:88) j =1 Z k X j (cid:33)(cid:41) ¯ I τk (B.10)37bout the non-diagonal component of the matrix,1 n n (cid:88) i =1 (cid:40)(cid:32) Z ik Z ik (cid:48) − p (cid:88) j =1 Z k X j Z k (cid:48) X j (cid:33) I τk I τk (cid:48) − (cid:32) Z ik Z ik (cid:48) − p (cid:88) j =1 Z k X j Z k (cid:48) X j (cid:33)(cid:41) = 1 n n (cid:88) i =1 (cid:40)(cid:32) Z ik Z ik (cid:48) − p (cid:88) j =1 Z k X j Z k (cid:48) X j (cid:33)(cid:41) ( I τk I τk (cid:48) −

1) (B.11) • When k ∈ { (cid:96) + 1 , . . . , K } , k (cid:48) ∈ { (cid:96) + 1 , . . . , K } ,About the diagonal component of the matrix,1 n n (cid:88) i =1 (cid:40)(cid:32) Z ik T i − p (cid:88) j =1 Z k X j T X j (cid:33) I τk + κ n ¯ I τk − κ n (cid:41) = 1 n n (cid:88) i =1 (cid:40)(cid:32) Z ik T i − p (cid:88) j =1 Z k X j T X j (cid:33) − κ n (cid:41) I τk (B.12) • When k ∈ { , . . . , K } , k (cid:48) ∈ { (cid:96) + 1 , . . . , K } , or k ∈ { (cid:96) + 1 , . . . , K } , k (cid:48) ∈ { , . . . , K } ,About the non-diagonal component of the matrix,1 n n (cid:88) i =1 (cid:32) Z ik Z ik (cid:48) − p (cid:88) j =1 Z k X j Z k (cid:48) X j (cid:33) I τk I τk (cid:48) (B.13)About (B.10)-(B.13), since the ﬁrst term ( n (cid:80) ni =1 {·} ) convergence to a constant in proba-bility, we need to conﬁrm an asymptotic property of I τk .From here, we conﬁrm when I τk = o p (1 / √ n ) or ¯ I τk = o p (1 / √ n ) hold. Note that underthis situation, the product ( I τk I τk (cid:48) ) becomes o p (1 /n ). Regarding I τk , I τk = Φ τ ( ˆ w k + w )Φ τ ( w − ˆ w k ) = 1 √ πτ (cid:90) ˆ w k + w −∞ exp (cid:26) − ω τ (cid:27) dω √ πτ (cid:90) w − ˆ w k −∞ exp (cid:26) − ω τ (cid:27) dω. Transformation of variables as ω (cid:48) = ω/τ , I τk = 1 √ π (cid:90) ˆ wk + wτ −∞ exp (cid:26) − ( ω (cid:48) ) (cid:27) dω (cid:48) √ π (cid:90) w − ˆ wkτ −∞ exp (cid:26) − ( ω (cid:48) ) (cid:27) dω (cid:48) = (cid:90) ˆ wk + wτ −∞ d Φ (cid:90) w − ˆ wkτ −∞ d Φ , N (0 , < − I τk = 1 − (cid:90) ˆ wk + wτ −∞ d Φ (cid:90) w − ˆ wkτ −∞ d Φ <  − (cid:18)(cid:82) w − ˆ wkτ −∞ d Φ (cid:19) ( ˆ w k ≥ − (cid:18)(cid:82) ˆ wk + wτ −∞ d Φ (cid:19) ( ˆ w k < − (cid:32)(cid:90) w −| ˆ wk | τ −∞ d Φ (cid:33) , (B.14)and 0 < I τk = (cid:90) ˆ wk + wτ −∞ d Φ (cid:90) w − ˆ wkτ −∞ d Φ < (cid:90) w −| ˆ wk | τ −∞ d Φ (B.15)satisfy. • When k ∈ { , . . . , (cid:96) } ,In this situation, ˆ w k P →

0. At ﬁrst, when w ≥ | ˆ w k | , by using (B.14)0 < − I τ n k = 1 − (cid:32)(cid:90) w −| ˆ wk | τn −∞ d Φ (cid:33) = 1 − (cid:32) − (cid:90) ∞ w −| ˆ wk | τn d Φ (cid:33) = 2 (cid:90) ∞ w −| ˆ wk | τn d Φ − (cid:32)(cid:90) ∞ w −| ˆ wk | τn d Φ (cid:33) < (cid:90) ∞ w −| ˆ wk | τn d Φ (B.16)Regarding (B.16), using an evaluation of the tail of Normal distribution (c.f. Gordon,1941), 0 < − I τ n k < (cid:90) ∞ w −| ˆ wk | τn d Φ − (cid:32)(cid:90) ∞ w −| ˆ wk | τn d Φ (cid:33) < (cid:90) ∞ w −| ˆ wk | τn d Φ < √ π (cid:18) w − | ˆ w k | τ n (cid:19) − exp (cid:40) −

12 ( w − | ˆ w k | ) τ n (cid:41) < (cid:18) | w − ˆ w k | τ n (cid:19) − exp (cid:40) −

12 ( w − ˆ w k ) τ n (cid:41) . (cid:18) | w − ˆ w k | τ n (cid:19) − exp (cid:40) −

12 ( w − ˆ w k ) τ n (cid:41) = o p (cid:18) √ n (cid:19) , (B.17)¯ I τk = o p (1 / √ n ) is satisﬁed, and I τk = 1 + o p (1 / √ n ) is also satisﬁed. Next, when w < | ˆ w k | , we can ignore the situation in the view of the convergence in probabilitysince ˆ w k P → • When k ∈ { (cid:96) + 1 , . . . , K } ,In this situation, ˆ w k P → w k . At ﬁrst, when w ≥ | ˆ w k | , we can ignore the situation in theview of the convergence in probability since | ˆ w k | P → | w k | > w . Next, when w < | ˆ w k | ,by using (B.15) and an evaluation of the tail of Normal distribution,0 < I τ n k < (cid:90) w −| ˆ wk | τ −∞ d Φ = (cid:90) ∞− w −| ˆ wk | τ d Φ < (cid:18) | w − ˆ w k | τ n (cid:19) − exp (cid:40) −

12 ( w − ˆ w k ) τ n (cid:41) . Therefore, when (B.17), I τk = o p (1 / √ n ) hold, and ¯ I τk = 1 + o p (1 / √ n ) also hold.From the above, when (B.17) is satisﬁed, (3.7) is also satisﬁed. B.4 Proof of Proposition 2

From the result of

Lemma 1. , ∀ γ ,1 n n (cid:88) i =1 ( b i − A i γ ) − n n (cid:88) i =1  ˇ b i κ n K − (cid:96)  −  ˇ A i O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)  γ  = o p (cid:18) √ n (cid:19) . Therefore, K = 1 n n (cid:88) i =1 ( b i − A i ˆ γ ) = 1 n n (cid:88) i =1  ˇ b i κ n K − (cid:96)  −  ˇ A i O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)  ˆ γ  + o p (cid:18) √ n (cid:19) , (B.18)40 γ =  n n (cid:88) i =1  ˇ A i O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)  − n n (cid:88) i =1  ˇ b i κ n K − (cid:96)  + o p (cid:18) √ n (cid:19) P →  E (cid:2) ˇ A (cid:3) − E (cid:2) ˇ b (cid:3) K − (cid:96)  From the above the ﬁrst component of

Proposition 2 can be proved. Next regarding (B.18),conducting the taylor expansion around γ , K = 1 n n (cid:88) i =1  ˇ b i κ n K − (cid:96)  −  ˇ A i O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)  ˆ γ  + o p (cid:18) √ n (cid:19) = 1 n n (cid:88) i =1  ˇ b i κ n K − (cid:96)  −  ˇ A i O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)  γ  − n n (cid:88) i =1  ˇ A i O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)  (cid:0) ˆ γ − γ (cid:1) + o p (cid:18) √ n (cid:19) . Therefore, √ n (cid:0) ˆ γ − γ (cid:1) =  n n (cid:88) i =1  ˇ A i O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)  − × √ nn n (cid:88) i =1  ˇ b i κ n K − (cid:96)  −  ˇ A i O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)  γ  + o p (1)=  (cid:0) n (cid:80) ni =1 ˇ A i (cid:1) − O (cid:96) × K − (cid:96) O (cid:62) (cid:96) × K − (cid:96) κ I K − (cid:96)   √ nn (cid:80) ni =1 (cid:0) ˇ b i − ˇ A i γ val (cid:1) √ nκ n K − (cid:96)  + o p (1)=  (cid:0) n (cid:80) ni =1 ˇ A i (cid:1) − √ nn (cid:80) ni =1 (cid:0) ˇ b i − ˇ A i γ val (cid:1) √ n κ n κ K − (cid:96)  + o p (1)If (cid:12)(cid:12)(cid:12) E (cid:104)(cid:0) ˇ b − ˇ A γ val (cid:1) ⊗ (cid:105)(cid:12)(cid:12)(cid:12) < ∞ , by using the ordinary asymptotic theories for M-estimator (c.f.Van der Vaart, 2000), the second component of Proposition 2 can be proved.41 .5 Proof of Theorem 2

Regarding (3.11), conducting the taylor expansion around β and γ , p +1 = 1 n n (cid:88) i =1  ˆ γ (cid:62) z i x i  (cid:16) y i − ( t i , x (cid:62) i ) ˆ β (cid:17) = 1 n n (cid:88) i =1  ( γ ) (cid:62) z i x i  (cid:0) y i − ( t i , x (cid:62) i ) β (cid:1) − n n (cid:88) i =1  ( γ ) (cid:62) z i t i ( γ ) (cid:62) z i x (cid:62) i x i t i x i x (cid:62) i  (cid:16) ˆ β − β (cid:17) + 1 n n (cid:88) i =1  (cid:0) y i − ( t i , x (cid:62) i ) β (cid:1) z (cid:62) i O p × K  (cid:0) ˆ γ − γ (cid:1) + o p (cid:18) √ n (cid:19) . Therefore, √ n (cid:16) ˆ β − β (cid:17) =  n n (cid:88) i =1  ( γ ) (cid:62) z i t i ( γ ) (cid:62) z i x (cid:62) i x i t i x i x (cid:62) i  −  √ nn n (cid:88) i =1  ( γ ) (cid:62) z i u i x i u i  + 1 n n (cid:88) i =1  u i z (cid:62) i O p × K  √ n (cid:0) ˆ γ − γ (cid:1) + o p (cid:18) √ n (cid:19) . (B.19)The second term of {·} in (B.19) becomes1 n n (cid:88) i =1 u i z i P → E [

U Z ] =  (cid:96) E [

U Z (cid:96) +1 ]...E [ U Z K ]  , and the result of Proposition 2. , we can show that1 n n (cid:88) i =1  u i z (cid:62) i O p × K  √ n (cid:0) ˆ γ − γ (cid:1) = o p (1) . Theorem2 can be proved.

B.6 Proof of Theorem 3

The details of the proof is refered to Zou, 2006 and Knight and Fu, 2000. From here, weconﬁrm that the proof is the same ﬂow as Zou, 2006. At ﬁrst, we deﬁne γ = γ + v √ n . Then,Ψ n ( v ) = n (cid:32) n n (cid:88) i =1 (cid:18) b i − A i (cid:18) γ + v √ n (cid:19)(cid:19)(cid:33) (cid:62) (cid:32) n n (cid:88) i =1 A i (cid:33) − (cid:32) n n (cid:88) i =1 (cid:18) b i − A i (cid:18) γ + v √ n (cid:19)(cid:19)(cid:33) + λ n K (cid:88) k =1 ˆ ξ k (cid:12)(cid:12)(cid:12)(cid:12) γ k + v k √ n (cid:12)(cid:12)(cid:12)(cid:12) . We also deﬁne V ( n ) ( v ) := Ψ n ( v ) − Ψ n ( ). Then, V ( n ) ( v ) = n (cid:32) n n (cid:88) i =1 (cid:18) b i − A i (cid:18) γ + v √ n (cid:19)(cid:19)(cid:33) (cid:62) (cid:32) n n (cid:88) i =1 A i (cid:33) − (cid:32) n n (cid:88) i =1 (cid:18) b i − A i (cid:18) γ + v √ n (cid:19)(cid:19)(cid:33) + λ n K (cid:88) k =1 ˆ ξ k (cid:12)(cid:12)(cid:12)(cid:12) γ k + v k √ n (cid:12)(cid:12)(cid:12)(cid:12) − n (cid:32) n n (cid:88) i =1 (cid:0) b i − A i γ (cid:1)(cid:33) (cid:62) (cid:32) n n (cid:88) i =1 A i (cid:33) − (cid:32) n n (cid:88) i =1 (cid:0) b i − A i γ (cid:1)(cid:33) − λ n K (cid:88) k =1 ˆ ξ k (cid:12)(cid:12) γ k (cid:12)(cid:12) = − (cid:32) n (cid:88) i =1 A i v √ n (cid:33) (cid:62) (cid:32) n n (cid:88) i =1 A i (cid:33) − (cid:32) n n (cid:88) i =1 (cid:0) b i − A i γ (cid:1)(cid:33) + (cid:32) n (cid:88) i =1 A i v √ n (cid:33) (cid:62) (cid:32) n n (cid:88) i =1 A i (cid:33) − (cid:32) n n (cid:88) i =1 A i v √ n (cid:33) + λ n K (cid:88) k =1 ˆ ξ k (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) γ k + v k √ n (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12) γ k (cid:12)(cid:12)(cid:19) = v (cid:62) (cid:32) n n (cid:88) i =1 A i (cid:33) v − v (cid:62) √ n n (cid:88) i =1 (cid:0) b i − A i γ (cid:1) + λ n K (cid:88) k =1 ˆ ξ k (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) γ k + v k √ n (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12) γ k (cid:12)(cid:12)(cid:19) The rest of the proof is completed in the same ﬂow as in Zou, 2006.43

Using valid IVs as auxiliary variables

Assume that we know at least one IV Z is valid. Therefore, a solution of the followingestimating equation becomes true causal eﬀects when (2.1) is the true outcome model: n (cid:88) i =1  z i x i  (cid:0) y i − ( t i , x (cid:62) i ) β (cid:1) = p +1 Also, residuals can be estimated as follows: ε i (cid:16) ˆ β (cid:17) = y i − ( t i , x (cid:62) i ) ˆ β When constructing ˆ w k , we substitute ε i (cid:16) ˆ β (cid:17) for NCO M i :ˆ w k = 1 n n (cid:88) i =1 Z ki ε i (cid:16) ˆ β (cid:17) , and ˆ w k P → E [ Z k U ] . And, we can easily conﬁrm the discussions and proofs of