[PDF] Mostly Harmless Machine Learning: Learning Optimal Instruments in Linear IV Models

Abstract

We offer straightforward theoretical results that justify incorporating machine learning in the standard linear instrumental variable setting. The key idea is to use machine learning, combined with sample-splitting, to predict the treatment variable from the instrument and any exogenous covariates, and then use this predicted treatment and the covariates as technical instruments to recover the coefficients in the second-stage. This allows the researcher to extract non-linear co-variation between the treatment and instrument that may dramatically improve estimation precision and robustness by boosting instrument strength. Importantly, we constrain the machine-learned predictions to be linear in the exogenous covariates, thus avoiding spurious identification arising from non-linear relationships between the treatment and the covariates. We show that this approach delivers consistent and asymptotically normal estimates under weak conditions and that it may be adapted to be semiparametrically efficient (Chamberlain, 1992). Our method preserves standard intuitions and interpretations of linear instrumental variable methods, including under weak identification, and provides a simple, user-friendly upgrade to the applied economics toolbox. We illustrate our method with an example in law and criminal justice, examining the causal effect of appellate court reversals on district court sentencing decisions.

Full PDF

aa r X i v : . [ ec on . E M ] N ov Mostly Harmless Machine Learning:Learning Optimal Instruments in Linear IV Models

Jiafeng Chen

Harvard Business SchoolBoston, MA [email protected]

Daniel L. Chen

Toulouse School of EconomicsToulouse, France [email protected]

Greg Lewis

Microsoft ResearchCambridge, MA [email protected]

Abstract

We provide some simple theoretical results that justify incorporating machinelearning in a standard linear instrumental variable setting, prevalent in empiri-cal research in economics. Machine learning techniques, combined with sample-splitting, extract nonlinear variation in the instrument that may dramatically im-prove estimation precision and robustness by boosting instrument strength. Theanalysis is straightforward in the absence of covariates. The presence of linearlyincluded exogenous covariates complicates identiﬁcation, as the researcher wouldlike to prevent nonlinearities in the covariates from providing the identifying vari-ation. Our procedure can be effectively adapted to account for this complication,based on an argument by Chamberlain (1992). Our method preserves standard in-tuitions and interpretations of linear instrumental variable methods and provides asimple, user-friendly upgrade to the applied economics toolbox. We illustrate ourmethod with an example in law and criminal justice, examining the causal effectof appellate court reversals on district court sentencing decisions.

Instrumental variable (IV) designs are a popular method in empirical economics. Over 30% of allNBER working papers and top journal publications considered by Currie et al. (2020) include somediscussion of instrumental variables. The vast majority of IV designs used in practice are linearIV estimated via two-stage least squares (TSLS), a familiar technique in standard graduate intro-ductions to econometrics and causal inference (e.g. Angrist and Pischke, 2008). Standard TSLS,however, leaves on the table some variation provided by the instruments that may improve precisionof estimates, as TSLS only exploits variation that is linearly related to the endogenous regressors.In the event that the instrument has a low linear correlation with the endogenous variable, but never-theless predicts the endogenous variable well through a nonlinear transformation, we should expectTSLS to perform poorly in terms of estimation precision and inference robustness. In particular, insome cases, TSLS would provide spuriously precise but biased estimates (due to weak instruments,see Andrews et al. , 2019). Such nonlinear settings become increasingly plausible when exogenousvariation includes high dimensional data or alternative data, such as text, images, or other complexattributes. We show that off-the-shelf machine learning techniques provide a general-purpose tool-box for leveraging such complex variation, improving instrument strength and estimate quality.Of course, replacing the ﬁrst stage linear regression with more ﬂexible speciﬁcations does not comewithout cost in terms of identifying assumptions. The validity of TSLS hinges only upon the ex-clusion restriction that the instrument is linearly uncorrelated with unobserved disturbances in theresponse variable. Relaxing the linearity requires that endogenous residuals are mean zero condi-tional on the exogenous instruments, which is stronger. However, it is rare that a researcher has acompelling reason to believe the weaker non-correlation assumption, but rejects the slightly stronger

Preprint. Under review. ean-independence assumption. In fact, by not exploiting the nonlinearities, TSLS may accidentallymake a strong instrument weak, and deliver spuriously precise inference: Dieterle and Snell (2016)and references therein ﬁnd that several applied microeconomics papers have conclusions that aresensitive to the speciﬁcation (linear vs. quadratic) of the ﬁrst-stage.A more serious identiﬁcation concern to leveraging machine learning in the ﬁrst-stage comes fromthe parametric functional form in the second stage. When there are exogenous covariates that areincluded in the parametric structural speciﬁcation, nonlinear transformations of these covariatescould in principle be valid instruments, and provide variation that pins down the parameter of interest.For example, in the standard IV setup of Y = Dτ + Xβ + U where X is an exogenous covariate,imposing E [ U | X ] = 0 would formally result in X , X , etc. being valid “excluded” instruments.However, given that the researcher’s stated source of identiﬁcation comes from excluded instruments,such “identifying variation” provided by covariates is more of an artifact of parametric speciﬁcationthan any serious information from the data that relates to the researcher’s scientiﬁc inquiry.One principled response to the above concern is to make the second stage structural speciﬁcationlikewise nonparametric or semiparametric, thereby including an inﬁnite dimensional parameter toestimate, making the empirical design a nonparametric instrumental variable (NPIV) design. Sig-niﬁcant theoretical and computational progress have been made in this regard (Newey and Powell,2003; Ai and Chen, 2003, 2007; Horowitz and Lee, 2007; Severini and Tripathi, 2012; Ai and Chen,2012; Hartford et al. , 2017; Dikkala et al. , 2020; Chen et al. , 2020a,b, and references therein). How-ever, regrettably, NPIV has received relatively little attention in applied work in economics, po-tentially due to theoretical complications regarding identiﬁcation, difﬁculty in interpretation andtroubleshooting, and computational scalability. Moreover, in some cases parametric restrictions onstructural functions come from theoretical considerations or techniques like log-linearization, whereestimated parameters have intuitive theoretical interpretation and policy relevance; in these cases theresearcher may have compelling reasons to stick with parametric speciﬁcations.In the spirit of being user-friendly to practitioners, this paper considers estimation and inference inan instrumental variable model where the second stage structural relationship is linear, while allow-ing for as much nonlinearity in the instrumental variable as possible, without creating unintendedand spurious identifying variation from included covariates. We document some simple results viaelementary techniques, which provide intuition and justiﬁcation for using machine learning meth-ods in instrumental variable designs. We show that with sample-splitting, under weak consistencyconditions, a simple estimator that uses the predicted values of endogenous and included regressorsas technical instruments is consistent, asymptotically normal, and semiparametrically efﬁcient; theconstructed instrumental variable also readily provides weak instrument diagnostics and robust pro-cedures, such as the Anderson–Rubin procedure. Moreover, standard diagnostics like out-of-sampleprediction quality are directly related to quality of estimates. In the presence of exogenous covariatesthat are parametrically included in the second-stage structural function, adapting machine learningtechniques requires caution to avoid spurious identiﬁcation from functional forms of the includedcovariates. To that end, we formulate and analyze the problem as a sequential moment restriction,and develop estimators that utilize machine learning for extracting nonlinear variation from and onlyfrom instruments.We conclude the introduction by relating our work to the technical and applied literatures. The coretechniques that allow for the construction of our estimators follow from Chamberlain (1987, 1992).The ideas in our proofs are also familiar in the double machine learning (Chernozhukov et al. , 2018;Belloni et al. , 2012) and semiparametrics literatures (e.g. Liu et al. , 2020); our arguments, however,follow from elementary techniques that are familiar to graduate students and are self-contained.Our proposed estimator is, of course, similar to the split-sample IV or jackknife IV estimators inAngrist et al. (1999), but we do not restrict ourselves to linear settings or linear smoothers. Usingmachine learning in instrumental variable settings is considered by Hansen and Kozbur (2014) (forridge), Belloni et al. (2012) (for lasso), and Bai and Ng (2010) (for boosting), among others; andour work can be viewed as providing a simple, uniﬁed analysis for practitioners, much in the spiritof Chernozhukov et al. (2018). To the best of our knowledge, we are the ﬁrst to formally explorepractical complications of making the ﬁrst stage nonlinear in a context with exogenous covariates.Finally, we view our work as counterpoint to the recent work by Angrist and Frandsen (2019), which The NPIV estimation problem is an instance of an ill-posed inverse problem , meaning that very differentparameter values could lead to similar values in the optimization objective.

2s more pessimistic about combining machine learning with instrumental variables—a point weexplore in detail in Section 2.3.

We consider the standard cross-sectional setup where ( Y i , D i , X i , W i ) Ni =1 i . i . d . ∼ P are sampled fromsome inﬁnite population. Y i is some outcome variable, D i is a set of endogenous treatment variables, X i is a set of exogenous controls (which includes a constant), and W i is a set of instrumental vari-ables. The researcher is willing to argue that W i is exogenously or quasi-experimentally assigned.Moreover, the researcher believes that W i provides a source of variation that “identiﬁes” an effectof D i on Y i . We denote the endogenous variables and covariates as T i = [ D i , X i ] and the excludedinstrument and covariates as the technical instruments Z i = [ W i , X i ] .A typical speciﬁcation in empirical economics is the linear IV setup: Y i = D ⊺ i τ + X ⊺ i β + U i ≡ T ⊺ i θ + U i D i = W i Π + X i Γ + V i ≡ Z i Λ + V i ⇐⇒ E [ Z i ( Y i − T ⊺ i θ )] = 0 . ( U i , V i ) ⊥ ( X i , Z i ) , E [ U i ] = E [ V i ] = 0 (Linear IV)where A ⊥ B means Cov(

A, B ) = 0 .We argue that, in many settings, the researcher is willing to assume more than that

U, V are un-correlated from

X, Z . Common introductions of instrumental variables (Angrist and Pischke, 2008;Angrist and Krueger, 2001) stress that instruments induce variation in D and is otherwise unrelatedto U , and that a common source of instruments is natural experiments. We argue that these narra-tives imply a stronger form of exogeneity than TSLS requires—the researcher is perhaps willingto assume mean independence E [ U i | W i ] = 0 beyond Cov( U i , W i ) = 0 . After all, a symmet-ric mean-zero random variable S is uncorrelated with S , but one would hardly be comfortablejustifying S as unrelated to S . We demonstrate that strengthening the exogeneity assumption tomean-independence allows the researcher to extract more identifying variation from instruments,but doing so requires more ﬂexible machinery for dealing with the ﬁrst stage. Let us ﬁrst consider the case in which X i contains only a constant, and we have homoskedasticerrors, i.e. E [ U i | Z i ] = σ . Our mean-independence restrictions give rise to a conditional momentrestriction, E [ Y i − T ⊺ i θ | W i ] = 0 . By deﬁnition of conditional expectations, the conditional momentrestriction encodes an inﬁnite set of unconditional moment restrictions in the form of orthogonalitiesfrom the prediction errors:For all square integrable υ : E [ υ ( W i )( Y i − T ⊺ i θ )] = 0 . Chamberlain (1987) ﬁnds that all relevant statistical information in a conditional moment restrictionis in fact contained in a single unconditional moment restriction indexed by an optimal instrument Υ ,and the unconditional moment restriction with the optimal instrument delivers semiparametricallyefﬁcient estimation and inference. In our case, Υ( W i ) = E [ T i | W i ] = [ E [ D i | W i ] , ⊺ . It is thennatural to estimate Υ with ˆΥ and form a plug-in estimator for θ : ˆ θ = N N X i =1 ˆΥ( W i ) T ⊺ i ! − N N X i =1 ˆΥ( W i ) Y i ! . Note that estimating Υ amounts to predicting D i with W i , which is well-suited to machine learningtechniques. One might worry that the preliminary estimation of ˆΥ complicates asymptotic analy- That is, for a conditional moment restriction of the form E [ m ( Y, T, θ ) | W ] = 0 , it is sufﬁcient to considerthe unconditional moment restriction E [Υ( W ) m ( Y, T, θ )] = 0 , where the optimal instrument takes the form Υ( W ) ∝ E [ m ( Y, T, θ ) | W ] − E [ ∂m/∂θ | W ] . Intuitively, the optimal instrument takes a signal-to-noiseratio form: Larger values of E [ ∂m ( Y, T, θ ) /∂θ | W ] indicates that the moment condition is sensitive to θ at W and hence represents a notion of signal, and E [ m | W ] represents some notion of noise. ˆ θ . Under a simple sampling-splitting scheme, however, we state a high-level condition forconsistency, normality, and efﬁciency of ˆ θ . Though it simpliﬁes the proof and weakens regularityconditions, sample-splitting does reduce the size of data used to estimate the optimal instrument Υ ,but such problems can be effectively mitigated by k -fold sample-splitting: 20-fold sample-splitting,for instance, limits the loss of data to at the cost of 20 computations that can be effectivelyparallelized. Such concerns notwithstanding, we focus our exposition to two-fold sample-splitting.Speciﬁcally, assume N = 2 n for simplicity and let S , S ⊂ [ N ] be the two subsamples with size n . For j ∈ { , } , let ˆΥ j be an estimated prediction function predicting D i with W i , estimatedwith data from the other sample, S − j . ˆΥ may be a neural network or a random forest trainedvia empirical risk minimization, or a penalized linear regression such as elastic net or lasso. Theestimated instrument ˆΥ( W i ) is then formed by evaluating ˆΥ j at W i , i ∈ S j . Crucially, the sampling-splitting allows that, conditional on S − j , ˆΥ j may be viewed as nonrandom, and ( Y i , D i , ˆΥ( W i )) may be viewed as independently distributed over i . We term the resulting estimator the machinelearning split-sample estimator (MLSS) estimator.Theorem 2.1 shows that the MLSS estimator is consistent, asymptotically normal, and semipara-metrically efﬁcient when the ﬁrst stage estimation of E [ D | W ] is consistent in L norm; the MLSSestimator remains consistent and asymptotically normal when the L consistency condition fails,but a set of weaker conditions hold, which govern the limiting behavior of sample means involv-ing ˆΥ( W i ) . The L consistency condition is not strong—in particular, it is weaker than the L consistency at o ( N − / ) -rate condition commonly used in the double machine learning literature(Chernozhukov et al. , 2018), where such conditions are considered mild. Theorem 2.1.

Let ˆ θ MLSS be the machine learning split-sample estimator described above. UnderCondition 1 in the appendix, which governs the concentration of various terms, ˆ θ MLSS is consistentand asymptotically normal for some G, Ω deﬁned by Condition 1: ( σ G Ω G ⊺ ) − / √ N (ˆ θ MLSS − θ ) N (0 , . Moreover, if E [ D p | W ] = 0 for all entries p and the machine learning estimator ˆΥ is consistent in L , i.e. E [ k ˆΥ j ( W i ) − Υ( W i )] | S − j k p −→ , then parts 1 and 3 in Condition 1 automatically holdwith G − = E [Υ( W i ) T ⊺ i ] , Ω = E [Υ( W i )Υ ⊺ ( W i )] , and the asymptotic variance σ G Ω G ⊺ achievesthe semiparametric efﬁciency bound.We provide some intuition on the proof for Theorem 2.1, where we assume dim D = 1 and X = ∅ for simplicity, so that Υ outputs a scalar. The estimation error is of the form ˆ GN − / P Ni =1 ˆΥ( W i ) U i , . Within each sample, conditioned on the other sample, the terms ˆΥ( W i ) U i are i.i.d., and we should expect a central limit theorem to hold. Furthermore, supposethe ﬁrst-stage estimation is L consistent, i.e. E [ ˆΥ( W i ) − Υ( W i )] = o (1) . Note that, under theheuristic implications of the central limit theorem, the distance from the feasible estimator and theoracle estimator with known Υ( W i ) is √ N N X i =1 [ ˆΥ( W i ) − Υ( W i )] U i ≈ N , σ N N X i =1 [ ˆΥ( W i ) − Υ( W i )] ! . The variance term is approximately the L distance between ˆΥ and Υ , and thus vanishes by assump-tion. Thus the asymptotic distribution of the MLSS estimator is the same as the oracle IV estimatorwhen we know the efﬁcient instrument Υ( W i ) , and MLSS is semiparametrically efﬁcient. With k -fold sample-splitting, S − j is the union of all samples other than the j -th one. The nuisance parameter E [ D | W ] in this setting enjoys higher-order orthogonality property described inMackey et al. (2018). In particular, it is inﬁnite-order orthogonal, thereby requiring no rate condition to work.Intuitively, estimation error in Υ( · ) has no effect on the moment condition E [Υ( Y − T ⊺ θ )] = 0 holding, andthis feature of the problem makes the estimation robust to estimation of Υ . .2 Covariates and heteroskedasticity The presence of covariates X i complicates the analysis considerably. Under the researcher’s model,both W i and X i are considered exogenous, and thus we may assume E [ U i | Z i ] = 0 and use itas a conditional moment restriction, under which the efﬁcient instrument is Var( U i | Z i ) − E [ T i | Z i ] and our analysis from the previous section continues to apply mutatis mutandis . However, if theresearcher maintains a linear speciﬁcation Y i = T ⊺ i θ + U i , estimating θ based on the conditionalmoment restriction E [ U i | Z i ] = 0 may inadvertently “identify” θ through nonlinear behavior in X i rather than the variation in W i . Such a speciﬁcation may allow the researcher to precisely estimate θ even when the instrument W i is completely irrelevant, when, say, higher-order polynomial termsin the scalar X i , X i , X i , are strongly correlated with D i , perhaps due to misspeciﬁcation of thelinear moment condition. There may well be compelling reasons why these nonlinear terms in X i allow for identiﬁcation of τ under an economic or causal model; however, they are likely not theresearcher’s stated source of identiﬁcation, and allowing their inﬂuence to leak into the estimationprocedure undermines credibility of the statistical exercise.One idea to resolve such a conundrum is to make the structural function nonparametric as well, andconvert the model to a nonparametric instrumental variable regression (Newey and Powell, 2003;Ai and Chen, 2003) (See Appendix B for discussion). Another idea, which we undertake in thispaper, is to weaken the moment condition and rule nonlinearities in X i as inadmissible for inference.The moment condition E [ U i | Z i ] = 0 is equivalent toFor all (square integrable) υ, E [ υ ( W i , X i )( Y − T ⊺ i θ )] = 0 , which is too strong since it allows nonlinear transforms of X to be valid instruments. A naturalidea is to restrict the class of allowable instruments υ ( W i , X i ) to those that are partially linearin X i , υ ( W i , X i ) = h ( W i ) + X ⊺ i ℓ , thereby deliberately discarding information from nonlineartransformations of X . Doing so yields the followingFor all (square integrable) υ, E [ υ ( W i )( Y − T ⊺ i θ )] = E [ X i ( Y i − T ⊺ i θ )] = 0 , which is equivalent to the sequential moment restriction E [ X i ( Y i − T ⊺ i θ )] = E [ Y i − T ⊺ i θ | W i ] = 0 . (1)We see that (1) is a natural interpolation between the usual unconditional moment condition, E [( X i , W i ) ⊺ · U i ] = 0 , and the conditional moment restriction that may be spurious E [ U i | X i , W i ] =0 , by only allowing nonlinear information in W i to be used for estimation and inference.Due to conditioning over different random variables, efﬁcient estimation in sequential moment re-strictions is not straightforward. Sequential moment restrictions are analyzed in Chamberlain (1992),and we now review his argument which motivates the construction of our estimator. Chamberlain(1992)’s characterization of efﬁcient estimation proceeds from the conditional moment restrictionwith the ﬁnest conditioning, and sequentially orthogonalizes coarser moment restrictions against theﬁner ones, so as to extract the additional, independent information that each moment restriction pro-vides. For each orthogonalized moment condition, the efﬁcient instrument is readily available, andthe moment conditions are converted to unconditional ones.In our setting, let Σ( W i ) = Var( U i | W i ) and let Γ( W ) = E [ X i U i | W ]Σ( W ) − be the projectionof X i U i on U i —note that Γ( W ) takes a familiar Cov( · ) Var( · ) − form. We obtain the orthogonal-ized moment conditions by subtracting the projection Γ( W i ) U i from the moment condition X i U i : E [ U i | W i ] = E [ X i U i − Γ( W i ) U i ] = 0 . By construction, the two moments are orthogonal whenconditioned on W i : E [ U i · ( X i U i − Γ( W i ) U i ) | W i ] = E [ X i U i | W i ] − Γ( W i ) E [ U i | W i ] = 0 . The orthogonality means that the moment conditions now provide independent information for θ ; therefore, we may simply consider optimal instruments for each moment condition and forma combined moment restriction. The optimal instrument for the conditional moment restriction E [ U i | W i ] = 0 is E [ T i | W i ]Σ( W i ) and that for the unconditional moment restriction is the constant scalingmatrix C := E [ T i ( X i − Γ( W i )) ⊺ ] E [ U i ( X i − Γ( W i ))( X i − Γ( W i )) ⊺ ] − . E [Υ i ( Y i − T ⊺ i θ )] = 0 Υ i := Σ( W i ) − E [ T i | W i ] + C ( X i − Γ( W i )) whose solution takes the familiar form θ = E [Υ i T ⊺ i ] − E [Υ i Y i ] . This characterization then moti-vates a split-sample plug-in procedure as before, though the nuisance parameters are more complex—as C, Σ , Γ , E [ T i | W i ] all require estimation. We discuss the details in Appendix A. Analogously, wemay state high-level conditions for consistency and normality of ˆ θ MLSS in Theorem 2.2. Addition-ally, Lemma D.1 presents some sufﬁcient conditions for Theorem 2.2 as well.

Theorem 2.2.

Let G = E [Υ i T ⊺ i ] − , Ω = E [ U i Υ i Υ ⊺ i ] . Let ˆ θ MLSS be deﬁned as in (2) in Appendix A.Under Condition 2, which governs the L -consistency of ˆΥ and concentration of various terms, theestimator is consistent and asymptotically normal √ N (ˆ θ MLSS − θ ) N (0 , G Ω G ⊺ ) . We now provide some discussion in light of Theorems 2.1 and 2.2.

Estimation of standard errors.

The variance-covariance matrix of ˆ θ MLSS can be readily estimatedby plugging in ˆΥ and ˆ θ MLSS . In principle, under strong identiﬁcation, inference may be conductedvia bootstrap, but machine learning methods tend to be somewhat computationally expensive andbootstrap may not be an attractive option for large scale problems. “Forbidden regression.”

Nonlinearities in the ﬁrst stage are often discouraged due to a “forbiddenregression,” where the researcher regresses Y on ˆ D estimated via nonlinear methods, motivated by aheuristic explanation for TSLS. As Angrist and Krueger (2001) point out, this regression is incorrect,and correct inference follows from using ˆ D as an intrument for D , as we do, rather than replacing D with ˆ D —in the TSLS setting, the two estimates are numerically equivalent, but not in general. Connection between ﬁrst-stage ﬁtting and estimate quality.

Additionally, we connect the qualityof the ﬁrst-stage ﬁtting to the quality of the ﬁnal estimation in a heuristic argument here. Con-sider the homoskedastic, no-covariate case where D is a scalar and, in a slight abuse of notation,let Υ = E [ D | W ] be the optimal instrument that excludes the constant. IV estimators of τ , inthis case, broadly take the form of Cov n ( Q , Y ) / Cov n ( Q , D ) for some constructed instrument Q = [ Q , . . . , Q N ] ⊺ ∈ R N , where Cov n are sample covariance: Just-identiﬁed linear IV estimatorcorresponds to Q = W and the (infeasible) efﬁcient estimator corresponds to Q = Υ . The estima-tion error then takes the form of Cov n ( Q , U ) / Cov n ( Q , D ) . The central limit theorem implies that Cov n ( Q , U ) = O p (cid:0) N − / Var n ( Q ) σ (cid:1) . Thus the estimation error is of order Cov n ( Q , U ) = O p √ N p Var n ( Q ) σ Cov n ( Q , D ) ! = O p σ √ N p Var n ( D ) 1 R ! where R = Corr n ( Q , D ) . The estimation error is thus decreasing in the quality of prediction in the ﬁrst stage, as measuredby R , which reﬂects the intuition that a ﬁrst-stage prediction with better quality should deliver IVestimates that are more precise. The out-of-sample R , which can be readily computed from asplit-sample procedure, therefore offers a useful indicator for quality of estimation. In particular, ifone is comfortable with the strengthened identiﬁcation assumptions, there is little reason not to usethe model that achieves the best out-of-sample prediction performance on the split-sample. In manysettings, this best-performing model would be linear regression, but in many settings it may not be,and using more complex tools may deliver beneﬁts.Moreover, much of the discussion on using machine learning for instrumental variables analysishas been focused on selecting instruments (Belloni et al. , 2012; Angrist and Frandsen, 2019) as-suming some level of sparsity, motivated by statistical difﬁculties encountered when the number ofinstruments is high. In light of the heuristic above, a more precise framing is perhaps combining In the case where we assume homoskedasticity or choose to forgo efﬁciency, we only need to estimate E [ T i | W i ] . This heuristic, of course, should not be taken literally, since the heuristic assumes that E [Cov( Q , U )] = 0 ,which can fail if Q estimated from the data and suffers from overﬁtting. Weak IV robust inference.

A major practical motivation for our work, following Bai and Ng(2010), is to use machine learning to rescue otherwise weak instruments due to a lack of linearcorrelation; nonetheless, the instrument may be irredeemably weak, and providing weak-instrumentrobust inference is important in practice. Identiﬁcation-robust inference in this nonlinear frame-work is formally considered in Antoine and Lavergne (2019)—we provide a simple constructionhere which is valid but not necessarily optimal. The independence induced by sample-splittingreadily allows for weak-instrument robust inference, a point that Staiger and Stock (1994) brieﬂytouch upon and is used in recent works such as Mikusheva and Sun (2020). On each subsample S j , conditional on subsamples S − j , the triplet ( ˆΥ i , T i , Y i ) are independently distributed. Thus, con-ditioning on samples S − j , weak-instrument robust procedures, such as the Anderson–Rubin test(Anderson et al. , 1949) and the Kleibergen (2002) Lagrange multiplier test, continue to have correctcoverage. The tests over each subsample S j can be combined via a simple Bonferroni correction. Moreover, since the weak IV robust procedures take instruments, covariates, endogenous variables,and response variables as input, robust inference does not require possibly expensive re-ﬁtting of themachine learning procedures. Similarly, weak identiﬁcation robust inference applies out-of-the-boxin the more complex setting with covariates below as well. (When) is machine learning useful?

We conclude this section by discussing our work relativeto Angrist and Frandsen (2019), who note that using lasso and random forest methods in the ﬁrststage does not seem to provide large performance beneﬁts in practice, on a simulation design basedon the data of Angrist and Krueger (1991). We note that, per our discussion above in the connec-tion between ﬁrst-stage ﬁtting and estimate quality, a good heuristic summary for the estimationprecision is the R between the ﬁtted instrument and the true optimal instrument— E [ D | W ] in thehomoskedastic case. It is quite possible that in some settings, the conditional expectation E [ D | W ] is estimated well with linear regression, and lasso or random forest do not provide large beneﬁtsin terms of out-of-sample prediction quality. Since Angrist and Krueger (1991)’s instruments arequarter-of-birth interactions and are hence binary, it is in fact likely that predicting D with linearregression performs well relative to nonlinear or complex methods in the setting. Whether or notmachine learning methods work well relative to linear methods is something that the researcher mayverify in practice, via evaluating performance on a hold-out set, which is standard machine learningpractice but is not yet widely adopted in empirical economics. Indeed, we observe that in both real(Section 3) and Monte Carlo ( Appendix C) settings where the out-of-sample prediction quality ofmore complex machine learning methods out-perform linear regression, MLSS estimators performbetter than TSLS. We consider an empirical application in the criminal justice setting of Ash et al. (2019), where weconsider the causal effect of appellate court decisions at the U.S. circuit court level on lengths ofcriminal sentences at the U.S. district courts under jurisdiction of the circuit court. Ash et al. (2019)exploit the fact that appellate judges are randomly assigned, and use the characteristics of appellatejudges—including age, party afﬁliation, education, and career backgrounds—as instrumental vari-ables. In criminal justice cases, plaintiffs rarely appeal, as it would involve trying the defendanttwice for the same offense—generally not permitted in the United States; therefore, an appellatecourt reversal would typically be in favor of defendants, and we may posit a causal channel in whichsuch reversals affect sentencing; for instance, the district court may be more lenient as a result of areversal, as would be naturally predicted if the reversal sets up a precedent in favor of the defendant.To connect the set up with our notation, the outcome variable Y is the change in sentencing lengthbefore and after an appellate decision, measured in months, where larger values of Y indicates thatsentences become longer after the appellate court decision. The endogenous treatment variable D is whether or not an appellate decision reverses a district court ruling. The instrument W is thecharacteristics of the randomly assigned circuit judge presiding over the appellate case in question, More precisely, for each ˜ θ value on a grid, we test H : θ = ˜ θ by Bonferroni-correcting the Anderson–Rubin or Kleibergen LM tests over the subsamples S j . A conﬁdence interval is formed by collected ˜ θ ’s thatare not rejected. X contain textual features from the circuit case, represented by Doc2Vec embeddings(Le and Mikolov, 2014). We present our results in Table 1, comparing an MLSS estimator to a linear IV estimator, with orwithout sample-splitting. The ﬁrst three columns of Table 1 display the point estimate, standarderrors, and Wald (1.96-s.e.) conﬁdence intervals. The point estimate of MLSS estimators fall in therange -1.7 to -0.6, suggesting a small effect of appellate court reversals—typically in favor of anappealing defendant—on district court leniency. The Wald conﬁdence intervals are of the range -3to 0.8, which fail to reject the null hypothesis of zero effect. These results do not vary wildly acrosssample splits and are robust to inclusion of covariates. In the fourth column, we display our pre-ferred inference method, the identiﬁcation-robust Anderson–Rubin conﬁdence interval. For MLSSestimators, the AR interval mostly agrees with the Wald interval, suggesting that the estimated in-strument ˆΥ is in fact strong, despite that judge characteristics are a relatively weak predictor forappellate court decisions.Linear IV-based methods, however, suffer considerably from weak identiﬁcation. Standard errorsare signiﬁcantly larger for linear IV estimates, and the Wald conﬁdence intervals are dramaticallydifferent from the Anderson–Rubin conﬁdence intervals. Indeed, the Wald intervals are often sig-niﬁcantly shorter than AR intervals, which can sometimes even be ( −∞ , ∞ ) , suggesting that theinstruments are uninformative of the treatment. In this case, Wald conﬁdence intervals for linearIV estimators lead to overall imprecise inference and are sometimes quite misleading—Wald inter-vals for the non-split-sample linear IV estimator are quite spuriously narrow, masking identiﬁcationissues.Table 1: Estimates and conﬁdence intervals of the treatment effect of appeals decision reversal oncriminal sentence length (months) Estimate SE Wald Anderson-Rubin R EstimatorSplit-sample ID0 -1.71 0.70 [-3.08, -0.35] [-3.04, -0.23] 0.0199 MLSS (Random Forest)0 5.35 5.17 [-4.78, 15.47] [- ∞ , ∞ ] -0.0137 Split-sample linear IV1 -0.65 0.61 [-1.85, 0.55] [-2.34, 0.80] 0.0261 MLSS (Random Forest)1 1.14 2.74 [-4.24, 6.52] [-11.04, 11.58] -0.0078 Split-sample linear IV2 -0.91 0.67 [-2.22, 0.40] [-3.06, 0.98] 0.0250 MLSS (Random Forest)2 2.31 4.10 [-5.73, 10.35] [- ∞ , ∞ ] -0.0114 Split-sample linear IV3 -1.26 0.65 [-2.54, 0.02] [-2.64, -0.09] 0.0250 MLSS (Random Forest)3 5.74 5.17 [-4.39, 15.87] (- ∞ , -33.49 ] ∪ [ -8.54 , ∞ ) -0.0132 Split-sample linear IV4 -1.06 0.65 [-2.34, 0.22] [-2.96, 0.76] 0.0263 MLSS (Random Forest)4 -3.24 5.51 [-14.05, 7.56] [-10.88, 79.99] -0.1174 Split-sample linear IV0 -1.76 0.82 [-3.37, -0.16] [-3.04, -0.33] MLSS with covariates1 -0.68 0.75 [-2.14, 0.78] [-2.48, 0.89] MLSS with covariates2 -0.93 0.77 [-2.44, 0.59] [-2.95, 1.05] MLSS with covariates3 -1.31 0.77 [-2.82, 0.21] [-2.71, 0.01] MLSS with covariates4 -1.09 0.76 [-2.57, 0.39] [-3.12, 0.99] MLSS with covariatesFull-sample 0.8300 1.0600 [-1.25, 2.91] [-1.00, 14.06] — Linear IV Notes: Parameter estimates and conﬁdence intervals over ﬁve random sample-splits shown. For each split-sample, we further split S to form a validation sample 10% the size for tuning hyperparameters. Hyperparam-eters of the random forest are chosen to minimize validation error on this subsample of S . All Anderson–Rubinintervals for split-sample designs are applied to the estimated instrument, and thus are just-identiﬁed. In this paper, we provide a simple and user-friendly analysis of incorporating machine learning tech-niques into instrumental variable analysis in a manner that is familiar to applied researchers. Inparticular, we document via elementary techniques that a split-sample IV estimator with machinelearning methods as the ﬁrst stage inherits classical asymptotic and optimality properties of usualinstrumental regression, requiring only weak conditions governing the consistency of the ﬁrst stageprediction. In the presence of covariates, we also formalize moment conditions for instrumental Empirically, the instrument does not seem to predict the covariates very well, and so we make the con-venient assumption that E [ X | Z ] is constant in implementing the estimator with covariates, and the MLSSestimator with covariates under identity weighting is simply the TSLS estimator with the MLSS-estimatedinstrument ˆΥ and covariates. eferences A I , C. and C HEN , X. (2003). Efﬁcient estimation of models with conditional moment restrictions containingunknown functions.

Econometrica , (6), 1795–1843.— and — (2007). Estimation of possibly misspeciﬁed semiparametric conditional moment restriction modelswith different conditioning variables. Journal of Econometrics , (1), 5–43.— and — (2012). The semiparametric efﬁciency bound for models of sequential moment restrictions containingunknown functions. Journal of Econometrics , (2), 442–457.A NDERSON , T. W., R

UBIN , H. et al. (1949). Estimation of the parameters of a single equation in a completesystem of stochastic equations.

The Annals of Mathematical Statistics , (1), 46–63.A NDREWS , I., S

TOCK , J. H. and S UN , L. (2019). Weak instruments in iv regression: Theory and practice. In Annual Review of Economics .A NGRIST , J. and F

RANDSEN , B. (2019).

Machine labor . Tech. rep., National Bureau of Economic Research.A

NGRIST , J. D., I

MBENS , G. W. and K

RUEGER , A. B. (1999). Jackknife instrumental variables estimation.

Journal of Applied Econometrics , (1), 57–67.— and K RUEGER , A. B. (1991). Does compulsory school attendance affect schooling and earnings?

TheQuarterly Journal of Economics , (4), 979–1014.— and — (2001). Instrumental variables and the search for identiﬁcation: From supply and demand to naturalexperiments. Journal of Economic perspectives , (4), 69–85.— and P ISCHKE , J.-S. (2008).

Mostly harmless econometrics: An empiricist’s companion . Princeton universitypress.A

NTOINE , B. and L

AVERGNE , P. (2019). Identiﬁcation-robust nonparametric inference in a linear iv model.A SH , E., C HEN , D., Z

HANG , X., H

UANG , Z. and W

ANG , R. (2019). Deep iv in law: Analysis of appellateimpacts on sentencing using high-dimensional instrumental variables. In

Advances in Neural InformationProcessing Systems (Causal ML Workshop) .B AI , J. and N G , S. (2010). Instrumental variable estimation in a data rich environment. Econometric Theory ,pp. 1577–1606.B

ELLONI , A., C

HEN , D., C

HERNOZHUKOV , V. and H

ANSEN , C. (2012). Sparse models and methods foroptimal instruments with an application to eminent domain.

Econometrica , (6), 2369–2429.C HAMBERLAIN , G. (1987). Asymptotic efﬁciency in estimation with conditional moment restrictions.

Journalof Econometrics , (3), 305–334.— (1992). Comment: Sequential moment restrictions in panel data. Journal of Business & Economic Statistics , (1), 20–26.C HEN , J., C

HEN , X., L

IAO , Y. and T

AMER , E. (2020a).

Deep Learning Inference on Semi-Parametric Modelswith Weakly Dependent Data . Tech. rep.—, — and T

AMER , E. (2020b).

Efﬁcient Estimation of Expectation Functionals of Nonparametric ConditionalMoments via Neural Networks . Tech. rep.C

HERNOZHUKOV , V., C

HETVERIKOV , D., D

EMIRER , M., D

UFLO , E., H

ANSEN , C., N

EWEY , W. andR

OBINS , J. (2018). Double/debiased machine learning for treatment and structural parameters.C

URRIE , J., K

LEVEN , H. and Z

WIERS , E. (2020). Technology and big data are changing economics: miningtext to track methods. In

AEA Papers and Proceedings , vol. 110, pp. 42–48.D

IETERLE , S. G. and S

NELL , A. (2016). A simple diagnostic to investigate instrument validity and heteroge-neous effects when using a single instrument.

Labour Economics , , 76–86.D IKKALA , N., L

EWIS , G., M

ACKEY , L. and S

YRGKANIS , V. (2020). Minimax estimation of conditionalmoment models. arXiv preprint arXiv:2006.07201 .E SCANCIANO , J. C. and L I , W. (2020). Optimal linear instrumental variables approximations. Journal ofEconometrics . ANSEN , C. and K

OZBUR , D. (2014). Instrumental variables estimation with many weak instruments usingregularized jive.

Journal of Econometrics , (2), 290–308.H ARTFORD , J., L

EWIS , G., L

EYTON -B ROWN , K. and T

ADDY , M. (2017). Deep iv: A ﬂexible approach forcounterfactual prediction. In

International Conference on Machine Learning , pp. 1414–1423.H

OROWITZ , J. L. and L EE , S. (2007). Nonparametric instrumental variables estimation of a quantile regressionmodel. Econometrica , (4), 1191–1208.K LEIBERGEN , F. (2002). Pivotal statistics for testing structural parameters in instrumental variables regression.

Econometrica , (5), 1781–1803.L E , Q. and M IKOLOV , T. (2014). Distributed representations of sentences and documents. In

Internationalconference on machine learning , pp. 1188–1196.L IU , R., S HANG , Z. and C

HENG , G. (2020). On deep instrumental variables estimate.M

ACKEY , L., S

YRGKANIS , V. and Z

ADIK , I. (2018). Orthogonal machine learning: Power and limitations. In

International Conference on Machine Learning , PMLR, pp. 3375–3383.M

IKUSHEVA , A. and S UN , L. (2020). Inference with many weak instruments. arXiv preprintarXiv:2004.12445 .N EWEY , W. K. and P

OWELL , J. L. (2003). Instrumental variable estimation of nonparametric models.

Econo-metrica , (5), 1565–1578.S EVERINI , T. A. and T

RIPATHI , G. (2012). Efﬁciency bounds for estimating linear functionals of nonparamet-ric regression models with endogenous regressors.

Journal of Econometrics , (2), 491–498.S TAIGER , D. and S

TOCK , J. H. (1994).

Instrumental variables regression with weak instruments . Tech. rep.,National Bureau of Economic Research. Deﬁnition of the MLSS estimator in the presence of covariates

As before, we split the sample randomly in half and let S , S denote the two halves. For efﬁciency purposes,we would need to estimate the heteroskedasticity Σ( U i ) , which is unnecessary if we are not concerned withefﬁciency or we are willing to assume homoskedasticity. We refer to estimating Σ( U i ) as efﬁciency weighting ,and we refer to simply plugging in Σ( U i ) = 1 as identity weighting . Since the heteroskedasticity Σ( U i ) depends on θ , we need preliminary estimates ˆ θ , which we may obtain via TSLS or identity weighting. Let ˆ θ j be a preliminary, consistent estimate for θ using data only from S − j . Let ˆ f j be an estimate for E [ D i | W i ] trained using data on S − j and, similarly, let ˆ g j be an estimate for E [ R i | W i ] , trained on i ∈ S − j , where R i = ( X i under identity weighting X i ( Y i − T ⊺ i ˆ θ j ) under efﬁcient weighting . Under efﬁciency weighting, let ˆΣ j ( W i ) be an estimate of the heteroskedasticity, approximated by E [( Y i − T ⊺ i ˆ θ j ) | W i ] where i ∈ S − j , and let ˆΣ j ( W i ) = 1 under identity weighting. Lastly, let ˆ U i = Y i − T ⊺ i ˆ θ j andlet ˆ C j =  n X i ∈ S j T i X i − ˆ g j ( W i )ˆΣ j ( W i ) ! ⊺   n X i ∈ S j X i − ˆ g j ( W i )ˆΣ j ( W i ) ! X i − ˆ g j ( W i )ˆΣ j ( W i ) ! ⊺ ˆ U i  − be a plug-in estimate of C . Note that under homoskedasticity E [ U | X, W ] = σ , we have a much simplerform ˆ C j =  n X i ∈ S j T i ( X i − ˆ g j ( W i )) ⊺   n X i ∈ S j ( X i − g j ( W i )) ( X i − ˆ g j ( W i )) ⊺  − ˆ g = ˆ E [ X i | W i ] . Finally, for i ∈ S j , we may form an estimate of the optimal instrument and the MLSS estimator ˆΥ i := ˆ f j ( W i )ˆΣ j ( W i ) + ˆ C j · X i − ˆ g j ( W i )ˆΣ j ( W i ) ! ˆ θ MLSS := N N X i =1 ˆΥ i T i ! − N N X i =1 ˆΥ i Y i (2) B Discussion related to NPIV

A principled modeling approach is the NPIV model, which treats the unknown structural function g as aninﬁnite dimensional parameter and considers the model E [ Y − g ( T ) | Z ] = 0 . (NPIV)The researcher may be interested in g itself, or some functionals of g , such as the average derivative θ = E (cid:2) ∂g∂D ( T ) | Z (cid:3) or the best linear approximation β = E [ T T ⊺ ] − E [ T g ( T )] . One might wonder whetherchoosing a parametric functional form in place of g ( T ) is without loss of generality. Linear regression of Y on T , for instance, yields the best linear approximation to the structural function E [ Y | T ] , and thus has an attractivenonparametric interpretation; it may be tempting to ask whether an analogous property holds for IV regression.If an analogous property does hold, we may have license in being more blasé about linearity in the secondstage.Unfortunately, modeling g as linear does not produce the best linear approximation, at least not with respect tothe L -norm. Escanciano and Li (2020) show that the best linear approximation can be written as a particularIV regression estimand β = E [ h ( Z ) T ⊺ ] − E [ h ( Z ) Y ] where h has the property that E [ h ( Z ) | T ] = T . Note that with efﬁcient instrument in a homoskedastic, no-covariate linear IV context as we consider in Section 2.1, the optimal instrument is ˆ D ( W ) = E [ D | W ] . Asufﬁcient condition, under which the IV estimand based on the optimal instrument is equal to the best linearapproximation β , is the somewhat strange condition that the projection onto D of predicted D is linear in D itself: For some invertible A , E [ ˆ D ( W ) | D ] = AD.

The condition would hold, for instance, in a setting where

D, W are jointly Gaussian and all conditional expectations are linear, but it is difﬁcult to think it holds ingeneral. As such, linear IV would not recover the best linear approximation to the nonlinear structural functionin general.A simple calculation can nevertheless characterize the bias of the linear approach if we take the estimand to bethe best linear approximation to the structural function. Suppose we form an instrumental variable estimator hat converges to an estimand of the form γ = E [ f ( Z ) X ⊺ ] − E [ f ( Z ) Y ] . It is easy to see that γ − β = h g − E ∗ [ g | T ] , µ − E ∗ [ µ | T ] i , where h A, B i = E [ AB ] , µ ( T ) = E [ f ( Z ) | T ] , and E ∗ [ A | B ] is the best linear projection of A onto B . Thismeans that the two estimands are identical if and only if at least one of µ ( · ) or g ( · ) are linear, and all else equalthe bias is smaller if µ or g is more linear. Importantly, µ − E ∗ [ µ | T ] are objects that we could empiricallyestimate since they are conditional means, and in practice the researcher may estimate µ − E ∗ [ µ | T ] , whichdelivers bounds on γ − β through assumptions on linearity of g . C Monte Carlo example

We report in Table 2 the results of a Monte Carlo experiment corresponding to the simple case without covari-ates. We let N = 500 , W i ∼ N (0 , I p ) for p = 5 , and let D = f ( W ) + V where f ( · ) is linear in W andelement-wise squares and cubes of W . We let Y = Dτ + U for U = ρV + p − ρ S , where the endogeneityis controlled by ρ = 0 . and S, V ∼ N (0 , independently.We consider the performance of the traditional TSLS estimator, an MLSS estimator (where nuisance estima-tion is performed via two-layer ReLU feedforward network), and an oracle estimator (the efﬁcient instrument Υ( W ) = E [ D | W ] is known) in two settings. In one setting (“quadratic only”), f is symmetric about zero (onlycontaining quadratic terms in W ), and thus W is a weak instrument from the linear IV perspective. In the othersetting, W is a strong instrument for linear IV as well.Unsurprisingly, MLSS performs signiﬁcantly better in terms of estimation quality compared to linear TSLSwhen the instrument is linearly irrelevant. In the other setting, when the prediction performance of MLSS andTSLS are similar, we see that the performance in terms of coverage and RMSE is similar as well. However,this only indicates the performance of a speciﬁc machine learning estimator with minimal tuning; the ﬁrst stage R of the oracle estimator indicates that neither the linear regression nor the prediction function chosen bythe MLSS estimator is close to the true conditional expectation function, and we should expect better machinelearning methods, in terms of out-of-sample prediction quality, being able to do better in terms of RMSE of thestructural parameter.We see that Wald intervals for all there estimators cover at approximately the nominal level. The oracleestimator seems to conﬁrm that we should expect mild coverage distortions in ﬁnite samples, relative to the95% nominal level, as the oracle estimator also undercover by 2–3%.

Table 2: Monte Carlo experiment resultsRMSE Coverage R Quadratic onlyTrue MLSS 0.0192 0.9150 0.7555TSLS 0.1422 0.9700 -0.1625Oracle 0.0168 0.9250 0.8897False MLSS 0.0041 0.9400 0.4777TSLS 0.0038 0.9400 0.4732Oracle 0.0030 0.9200 0.9959

Notes: Summary statistics across 200 experiments shown.

RMSE refers to the root mean-square error of theparameter of interest.

Coverage refers to the coverage of a nominal 95% Wald interval. R refers to thestatistic R ≡ − Mean Squared Prediction Error

Var Y calculated out-of-sample, which can be negative. Quadratic only refers to whether E [ D | W ] has only quadratic terms in W —if so TSLS would be weak since the best linearapproximation to E [ D | W ] when W is symmetric about zero is a constant function. This is a fortunate fact for the TSLS estimator in the quadratic only scenario, where the coverage happensto be above nominal despite the nonidentiﬁcation. Omitted Proofs and Regularity Conditions

Condition 1.

1. (Strong identiﬁcation) For j = { , } , N X i ˆΥ( W i ) T ⊺ i p −→ G − for some nonsingular G − .2. (Lyapunov condition) E | U i | ǫ < ∞ for some ǫ > . Furthermore, none of the weights ˆΥ j domi-nates the others max i ∈ S j k ˆΥ j ( W i ) k P i ∈ S j k ˆΥ j ( W i ) k p −→ . n P i ∈ S j ˆΥ j ( W i ) ˆΥ ⊺ j ( W i ) p −→ Ω for some Ω , and there exists Z , Z such that Z j = √ n P i ∈ S j ˆΥ j ( W i ) U i + o p (1) and Cov( Z , Z ) → . Condition 2. ˆ C j p −→ C and certain empirical moments are bounded: n X i ∈ S j (cid:13)(cid:13)(cid:13)(cid:13) X i − ˆ g j ( W i )ˆΣ j ( W i ) (cid:13)(cid:13)(cid:13)(cid:13) = O p (1) 1 n X i ∈ S j U i = O p (1) 1 n X i ∈ S j k T i k = O p (1)

2. Let ˜Υ i be ˆΥ i but with C replacing ˆ C j . The constructed instrument ˜Υ i is consistent in L to itspopulation counterpart: ∀ j ∈ { , } : E i ∈ S j h ( ˜Υ i − Υ i ) | S − j i p −→ ,

3. (Strong identiﬁcation) E [Υ i T i ] < ∞ and E [Υ i T i ] is not singular, and4. (Lyapunov CLT) For some ǫ > , E [ | Υ i U i | ǫ ] < ∞ . Proof of Theorem 2.1.

Observe that, since Y i = T ⊺ i θ + U i , √ N (cid:16) ˆ θ MLSS − θ (cid:17) = N N X i =1 ˆΥ i T ⊺ i ! − √ N N X i =1 ˆΥ i U i . By conditions 2 and 3, ˜ Z j := √ n P i ∈ S j ˆΥ j ( W i ) U i N (0 , σ Ω) . jointly for j = 1 , . The asymptoticcovariance is zero by condition 3. Thus √ N P Ni =1 ˆΥ( W i ) U i = ˜ Z + ˜ Z √ N (0 , σ Ω) . The ﬁrst part of theclaim then follows from condition 1.The second part of the claim follows if we show N N X i =1 ˆΥ i T ⊺ i = 1 N N X i =1 Υ i T ⊺ i + o p (1) 1 √ N N X i =1 ˆΥ i U i = 1 √ N N X i =1 Υ i U i + o p (1) This argument is carried out in the proof to Theorem 2.2, where the condition needed is exactly the L consis-tency condition. Proof of Theorem 2.2.

Observe that, since Y i = T ⊺ i θ + U i , √ N (cid:16) ˆ θ MLSS − θ (cid:17) = (cid:16) N P Ni =1 ˆΥ i T ⊺ i (cid:17) − √ N P Ni =1 ˆΥ i U i . We proceed to analyze the right-hand side with elementary tech-niques. Note that if we show N N X i =1 ˆΥ i T ⊺ i = 1 N N X i =1 Υ i T ⊺ i + o p (1) (3) √ N N X i =1 ˆΥ i U i = 1 √ N N X i =1 Υ i U i + o p (1) , (4)then, √ N (ˆ θ MLSS − θ ) = (cid:16) N P Ni =1 Υ i T ⊺ i (cid:17) − N P Ni =1 Υ i U i + o p (1) under strong identiﬁcation (Con-dition 3). By a law of large numbers (Condition 3), (cid:16) N P Ni =1 Υ i T ⊺ i (cid:17) − p −→ G. By a central limit heorem (Condition 4) √ N P Ni =1 Υ i U i N (0 , Ω) , and thus we have the desired asymptotic normality √ N (ˆ θ MLSS − θ ) N (0 , G Ω G ⊺ ) . Note that it sufﬁces to check (3) and (4) for each subsample, and by symmetry it sufﬁces to check them for S : n X i ∈ S ˆΥ i T ⊺ i = 1 n X i ∈ S Υ i T ⊺ i + o p (1) 1 √ n X i ∈ S ˆΥ i U i = 1 √ n X i ∈ S Υ i U i + o p (1) . We ﬁrst check that we may replace the ˆΥ i left-hand side with ˜Υ i by checking that (cid:13)(cid:13)(cid:13)(cid:13) n ( ˆ C j − C ) X i ∈ S j X i − ˆ g j ( W i )ˆΣ j ( W i ) ! T ⊺ i (cid:13)(cid:13)(cid:13)(cid:13) F = o p (1) (cid:13)(cid:13)(cid:13)(cid:13) n ( ˆ C j − C ) X i ∈ S j X i − ˆ g j ( W i )ˆΣ j ( W i ) ! U i (cid:13)(cid:13)(cid:13)(cid:13) = o p (1) , which immediately follows from Condition 1.Next, we ﬁrst relate both quantities to E (cid:13)(cid:13)(cid:13) Υ i − ˜Υ i (cid:13)(cid:13)(cid:13) , where the expectation operator E is taken conditionalon S , and so ˆ f, ˆ g may be viewed as ﬁxed: E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i ∈ S ( ˜Υ i − Υ i ) T ⊺ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F ≤ E " n X i ∈ S (cid:13)(cid:13)(cid:13) ˜Υ i − Υ i (cid:13)(cid:13)(cid:13) k T ⊺ i k ( k AB k F ≤ k A k F k B k F ) ≤ s E (cid:20)(cid:13)(cid:13)(cid:13) Υ i − ˜Υ i (cid:13)(cid:13)(cid:13) (cid:21) (Cauchy–Schwarz) = O s E (cid:20)(cid:13)(cid:13)(cid:13) Υ i − ˜Υ i (cid:13)(cid:13)(cid:13) (cid:21)! E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ n X i ∈ S ( ˜Υ i − Υ i ) U i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = tr ( Var √ n X i ∈ S ( ˜Υ i − Υ i ) U i !) = tr n Var (cid:16) ( ˜Υ i − Υ i ) U i (cid:17)o = tr n E h ( ˜Υ i − Υ i )( ˜Υ i − Υ i ) ⊺ i E [ U i ] o (Independence) = O (cid:18) E (cid:20)(cid:13)(cid:13)(cid:13) ˜Υ i − Υ i (cid:13)(cid:13)(cid:13) (cid:21)(cid:19) , where k A k F denotes A ’s Frobenius norm, and k v k is the ℓ -norm of v . L consistency (Condition 2) thenguarantees that both quantities are o p (1) . Lemma D.1.

The following is a sufﬁcient condition for L consistency of ˜Υ i (Condition 2) and the ˆ C j p −→ C condition in (Condition 1):1. C < ∞

2. There exists

M, η > such that f, g is bounded above, Σ is bounded below, and ˆΣ is bounded belowalmost surely: k f k ∞ ≤ M , k g k ∞ ≤ M , inf w E [ U | W = w ] > η , and Pr( ˆΣ( W ) ≥ η ) = 1 .3. L -consistency of the individual components: E − j (cid:13)(cid:13) ˆ f j − f (cid:13)(cid:13) = o p (1) , E − j (cid:13)(cid:13) ˆ g j − g (cid:13)(cid:13) = o p (1) , and E − j (cid:12)(cid:12)(cid:12) ˆΣ j − Σ (cid:12)(cid:12)(cid:12) = o p (1) .4. E k T i k < ∞ , E U i < ∞ , E (cid:2) U i k T i k (cid:3) < ∞ . Proof.

Checking Condition 2.

Note that E (cid:20)(cid:13)(cid:13)(cid:13) Υ i − ˆΥ i (cid:13)(cid:13)(cid:13) (cid:21) = E (cid:2) k P + Q k (cid:3) ≤ E (cid:2) k P k + k Q k (cid:3) Precisely speaking, E [ X i ] = E [ X i | S ] where i ∈ S , and similarly for Var here P = ˆ f ( W i )ˆΣ ( W i ) − f ( W i )Σ( W i ) = ˆ f ˆΣ Σ (cid:16) Σ − ˆΣ (cid:17) + 1Σ ( ˆ f − f ) Q = C " ˆ g ( W i )ˆΣ ( W i ) − g ( W i )Σ( W i ) We may control E k P k : E [ (cid:13)(cid:13) P (cid:13)(cid:13) ] ≤ E " (cid:13)(cid:13) ˆ f (cid:13)(cid:13) ˆΣΣ (cid:16) Σ − ˆΣ (cid:17) + E (cid:20) (cid:13)(cid:13)(cid:13) ˆ f − f (cid:13)(cid:13)(cid:13)(cid:21) ≤ Mη − E [(Σ − ˆΣ) ] + η − E (cid:13)(cid:13) ˆ f − f (cid:13)(cid:13) = o p (1) Similarly, E [ k Q k ] = o p (1) . Checking ˆ C j p −→ C . By continuous mapping theorem, it sufﬁces to check that both multiplicands in ˆ C j converge in probability to the analogous multiplicand in C . Note that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i ∈ S ˆ g j ˆΣ j − g j Σ j ! T ⊺ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F ≤ s E k T i k E (cid:13)(cid:13)(cid:13)(cid:13) ˆ g j ˆΣ j − g j Σ j (cid:13)(cid:13)(cid:13)(cid:13) = o p (1) and thus n X i ∈ S T i ˆ g j ˆΣ j − g j Σ j ! = o p (1) . (Cauchy–Schwarz)The second term is analogously bounded (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i ∈ S ˆ U i ˆ A i ˆ A ⊺ i − U i A i A ⊺ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F ≤ n X i ∈ S (cid:12)(cid:12)(cid:12) ˆ U i − U i (cid:12)(cid:12)(cid:12) k ˆ A i ˆ A ⊺ i k F + 1 n X i ∈ S U i ( k ˆ A i k + k A i k ) k A i − ˆ A i k where A i = X i − g ( W i )Σ( W i ) . Via Cauchy–Schwarz and U i having four moments, taking E of the second termshows that the second term is O p ( E k ˆ A i − A i k ) = O p E (cid:20) ˆ g ˆΣ − g Σ (cid:21) ! = o p (1) . The ﬁrst term can be written as n X i ∈ S (cid:12)(cid:12)(cid:12) ˆ U i − U i (cid:12)(cid:12)(cid:12) k ˆ A i ˆ A ⊺ i k F ≤ k ˆ θ j − θ k · n X i ∈ S k T i k| Y i − T ⊺ i ( θ + ˆ θ j ) |k ˆ A i ˆ A ⊺ i k F = o p (1) n X i ∈ S k T i k (2 | U i | + k T i k o p (1)) ( k A i A ⊺ i k F + k A i k o p (1) + o p (1)) ! = o p (1) by the last condition.by the last condition.