[PDF] Inference with Many Weak Instruments

Abstract

We develop a concept of weak identification in linear IV models in which the number of instruments can grow at the same rate or slower than the sample size. We propose a jackknifed version of the classical weak identification-robust Anderson-Rubin (AR) test statistic. Large-sample inference based on the jackknifed AR is valid under heteroscedasticity and weak identification. The feasible version of this statistic uses a novel variance estimator. The test has uniformly correct size and good power properties. We also develop a pre-test for weak identification that is related to the size property of a Wald test based on the Jackknife Instrumental Variable Estimator (JIVE). This new pre-test is valid under heteroscedasticity and with many instruments.

Full PDF

aa r X i v : . [ ec on . E M ] A p r Inference with Many Weak Instruments

Anna Mikusheva and Liyang Sun Abstract

We develop a concept of weak identiﬁcation in linear IV models in which the number of instru-ments can grow at the same rate or slower than the sample size. We propose a jackknifed versionof the classical weak identiﬁcation-robust Anderson-Rubin (AR) test statistic. Large-sample in-ference based on the jackknifed AR is valid under heteroscedasticity and weak identiﬁcation. Thefeasible version of this statistic uses a novel variance estimator. The test has uniformly correctsize and good power properties. We also develop a pre-test for weak identiﬁcation that is re-lated to the size property of a Wald test based on the Jackknife Instrumental Variable Estimator(JIVE). This new pre-test is valid under heteroscedasticity and with many instruments.

Key words: instrumental variables, weak identiﬁcation, dimensionality asymptotics.

JEL classification codes:

C12, C36, C55.

Recent empirical applications of instrumental variables (IV) estimation often involve manyinstruments that together may or may not be strongly relevant. A prominent example isAngrist and Krueger (1991), which started the weak IV literature, uses 180 instruments byinteracting dummies for the quarter of birth with state and year of birth. Other examplesinclude papers that employ an empirical strategy known as “judge design” (Maestas etal., 2013; Sampat and Williams, 2015; Dobbie et al., 2018). Fueled by rich administrativedata, these papers use the exogenous assignment of cases to judges as instruments fortreatment. Since each judge can only process a certain number of cases out of the totalcourt cases, the number of judges (the number of instruments) is usually proportionalto the sample size. Another example is the famous Fama-MacBeth procedure in Asset Department of Economics, M.I.T. Address: 77 Massachusetts Avenue, E52-526, Cambridge, MA,02139. Email: [email protected]. National Science Foundation support under grant number 1757199is gratefully acknowledged. We are grateful to Josh Angrist, Kirill Evdokimov and Whitney Newey foradvice, to Brigham Frandsen for sharing code for simulations, and to Ben Deaner and Sylvia Klosin forresearch assistance. Replication code is available at http://economics.mit.edu/grad/lsun20/ . Department of Economics, M.I.T. Address: 77 Massachusetts Avenue, E52-300, Cambridge, MA,02139. Email: [email protected]. square root of the number of instruments stays bounded in large samples. We prove that even in a homoscedastic model withknown covariance, an asymptotically consistent test does not exist if the ratio of theconcentration parameter over the square root of the number of instruments stays boundedin large samples. Thus, a necessary condition for a consistent test to exist is that theconcentration parameter grows faster than the square root of the number of instruments.Later, we show that this is also a suﬃcient condition by constructing a robust test thatbecomes consistent when this condition is satisﬁed.We propose a new jackknifed version of the Anderson-Rubin (AR) test which is robustto both weak identiﬁcation and heteroscedasticity in a model with many instruments. Thenew test uses an asymptotic approximation based on a Central Limit Theorem (CLT) forquadratic forms. The new AR test has the correct size regardless of identiﬁcation strengthand becomes consistent as soon as the concentration parameter grows faster than thesquare root of the number of instruments.As an important technical contribution, we introduce a novel variance estimator forthe quadratic form CLT. The target variance is a quadratic form of the individual (het-2roscedastic) variances of errors. We apply cross-ﬁtting (Newey and Robins, 2018; Klineet al., 2019) to produce unbiased proxies for the individual variances of errors. We adjustthe quadratic form to remove the bias due to correlations between proxies. We prove theconsistency of the new estimator under the null and local alternatives.Finally, we propose a new pre-test for weak identiﬁcation which is easy to use and isconsistent with our deﬁnition of weak identiﬁcation. An empirical researcher can use ourpre-test to decide between employing our jackknife AR test if the pre-test suggests thatthe identiﬁcation is weak or a Wald test based on the Jackknife Instrumental VariableEstimator (JIVE, Angrist et al., 1999) if the pre-test suggests that the identiﬁcation isstrong. We guarantee the size of this two-step procedure. Chao et al. (2012) provethat JIVE is consistent in a heteroscedastic model when the concentration parametergrows faster than the square root of the number of instruments. Chao et al. (2012)also derive a consistent estimator of the JIVE standard error. The two-step procedure isappealing because when identiﬁcation is strong, the JIVE-Wald is more eﬃcient and easyto implement and report.Our pre-test is in the spirit of Stock and Yogo (2005), but it diﬀers from theirs intwo important ways. Firstly, our pre-test allows for a general form of heteroscedasticity,while the pre-test proposed in Stock and Yogo (2005) works only under conditionallyhomoscedastic errors. Secondly, the Stock and Yogo (2005) pre-test is designed for a smallnumber of instruments and is based on the Two-Stage Least Squares (TSLS) estimator.With many instruments TSLS is consistent only when the concentration parameter growsfaster than the number of instruments, which makes the Stock and Yogo (2005) pre-testnot very informative.We apply our pre-test to Angrist and Krueger (1991) and ﬁnd that their identiﬁcationis strong. Consequently the JIVE conﬁdence set is reliable (has coverage within 5%tolerance level of the declared coverage). Our weak identiﬁcation-robust jackknife ARconﬁdence set is somewhat wider than the JIVE conﬁdence set but is still informative.

Relation to the Literature . Our paper contributes to both the literature on weakIV and the literature on many instruments. The weak IV literature relates identiﬁcationstrength to the size of the concentration parameter and proposes robust tests that work3nly when there are a small number of instruments. Generalizations to many weak in-struments either strongly restrict the number of instruments (Andrews and Stock, 2007)or work only under homoscedasticity (Anatolyev and Gospodinov, 2011).The many instruments literature mostly establishes consistency conditions for partic-ular estimators. For example, Chao and Swanson (2005) show that in a homoscedasticmodel limited information maximum likelihood (LIML) and bias-corrected TSLS (BTSLS)are consistent when the concentration parameter grows faster than the square root of thenumber of instruments. In a heteroscedastic model, consistency of LIML and BTSLS re-quires that the concentration parameter grows faster than the number of instruments. Bycontrast, JIVE remains consistent when the concentration parameter grows faster thanthe square root of the number of instruments (Chao et al. (2012)).The remainder of this paper is organized as follows. In Section 2 we introduce ourdeﬁnition of weak identiﬁcation in an environment with many instruments. In Section 3we construct the jackknife AR test and establish its power properties. In Section 4 wepresent the pre-test and prove that it controls size. Section 5 reports our pre-test resultsfor Angrist and Krueger (1991) and conducts a simulation exercise inspired by Angristand Frandsen (2019), and Section 6 concludes. Some proofs and additional results maybe found in the Supplementary Appendix.

We study the linear IV regression with a scalar outcome Y i , a potentially endogenousscalar regressor X i and a K × vector of instrumental variables Z i :  Y i = βX i + e i ,X i = Π i + v i , (1)for i = 1 , ..., N. We denote Π i = E [ X i | Z i ] and allow the instruments to aﬀect the en-dogenous regressor in a non-linear way. All results in this paper hold conditionally ona realization of the instruments. Thus, we treat the instruments as ﬁxed (non-random)and Π i as some constants. The mean-zero errors ( e i , v i ) are independent across i but not4dentically distributed and may be heteroscedastic. We assume without loss of generalitythat there are no controls included in our model as they may be partialled out.Weak identiﬁcation under small K is studied extensively in the weak IV literature.For Gaussian homoscedastic errors ( e i , v i ) and linear ﬁrst stage ( Π i = π ′ Z i ), the strengthof the instruments corresponds directly to the concentration parameter, π ′ Z ′ Zπσ v , where σ v = V ar ( v i ) . The concentration parameter equals the signal-to-noise ratio in the ﬁrst-stage regression and is related to the bias of the TSLS estimator and the quality ofGaussian approximation for the TSLS t-statistic. For the general case with homoscedasticerrors, Staiger and Stock (1997) introduced weak instrument-asymptotics in which oneconsiders a sequence of models so that the concentration parameter converges to a constantas N → ∞ . Under this asymptotic embedding, neither a consistent estimator of β nor aconsistent test of the null hypothesis that β equals some scalar exists, and the test basedon the TSLS t-statistic severely over-rejects.The magnitude of the concentration parameter is not a good indicator of identiﬁca-tion strength when the number of instruments is large. We model large K by considering K → ∞ as N → ∞ , with the only restriction that K is at most a fraction of N . Underthis many instrument-asymptotics, Theorem 1 below shows that the re-scaled concentra-tion parameter π ′ Z ′ Zπσ v √ K provides a characterization of weak identiﬁcation in terms of theconsistency of tests. Theorem 1

Assume we have a sample from model (1) with linear ﬁrst stage Π i = π ′ Z i ,where the errors ( e i , v i ) are independently drawn from a Gaussian distribution N (0 , Ω) with a known covariance Ω . Assume that the K × K matrix Z ′ Z has rank K and K → ∞ as N → ∞ . For any sample of size N let Ψ N be the class of all tests of size α fortesting the hypothesis H : β = β , that is, any ψ ∈ Ψ N is a measurable function from { ( Y i , X i , Z i ) , i = 1 , ..., N } to the interval [0 , such that E β ,π ψ ≤ α for any value of π ∈ R K . Then for any β ∗ = β we have lim sup N →∞ max ψ ∈ Ψ N  min π : π ′ Z ′ Zπσ v √ K ≤ C E β ∗ ,π ψ  < . The setting considered in Theorem 1 is quite favorable: the ﬁrst stage is linear, errors5re Gaussian and homoscedastic with known covariance matrix. So the only unknownparameters are β and π . Theorem 1 states that even in this favorable setting there existsno test that consistently diﬀerentiates any β ∗ from β if the ratio π ′ Z ′ Zπσ v √ K is bounded.Indeed, for any test ψ we can ﬁnd its guaranteed power E β ∗ ,π ψ by minimizing over thealternatives ( β ∗ , π ) with bounded π ′ Z ′ Zπσ v √ K . We show that even the test that achieves themaximum guaranteed power has guaranteed power strictly less than one asymptotically.Later we show that in a more general heteroscedastic model we can construct a robusttest that becomes consistent when Π ′ Π √ K → ∞ .Theorem 1 can also be used to characterize weak identiﬁcation in terms of consistent es-timation since it implies there exists no consistent estimator for β when π ′ Z ′ Zπ √ K is bounded.Our result complements the literature on estimation with many instruments. Chao andSwanson (2005) show that with homoscedastic errors, when K grows proportionally to thesample size the TSLS estimator is consistent only if π ′ Z ′ ZπK → ∞ , while LIML and BTSLSestimators are consistent when π ′ Z ′ Zπ √ K → ∞ . However, under heteroscedasticity, evenwhen π ′ Z ′ Zπ √ K → ∞ , LIML and BTSLS become inconsistent, but JIVE is still consistent,according to Chao et al. (2012).The proof of Theorem 1 builds on several classical papers. Following the approachof Andrews et al. (2006), we ﬁrst reduce the class of tests to those based on a suﬃcientstatistic. Among these tests, the minimal power is achieved by a test invariant to rotationsof the instruments. This observation allows us to further reduce our attention to invarianttests, which depend on the data only through its maximal invariant under rotations. Thenwe derive a limit experiment for K → ∞ similar to that derived in Andrews and Stock(2007). In this limit experiment the minimax power is less than one. Finally we use theargument of Müeller (2011) to bound the desired asymptotic minimax power using theminimax power obtained in the limit experiment. The goal of this section is to introduce a test robust to weak identiﬁcation in the het-eroscedastic IV model when the number of instruments, K , is large.6he existing weak IV literature proposes several weak identiﬁcation-robust tests ofthe null hypothesis H : β = β , when K is small. These tests have correct size whenthe identiﬁcation is weak and become consistent when the identiﬁcation is strong. Oneexample is the Anderson-Rubin (AR) test. Speciﬁcally, the IV model (1) implies thatunder a given null hypothesis H : β = β , the exogeneity assumption holds E [ Z ′ e ( β )] =0 for the implied error e ( β ) = Y − β X . Then under mild assumptions, the scaledsample analog √ N Z ′ e ( β ) ⇒ N (0 , Σ) satisﬁes a K -dimensional Central Limit Theorem.The AR statistic is deﬁned as N e ( β ) ′ Z b Σ − Z ′ e ( β ) , where b Σ is a consistent estimator of V ar (cid:16) √ N Z ′ e (cid:17) . The AR test rejects the null hypothesis when the AR statistic exceedsthe (1 − α ) quantile of the χ K distribution. The AR test has asymptotically correct sizeregardless of the value of the ﬁrst stage coeﬃcients Π i and is asymptotically consistentwhen an analog of the concentration parameter grows to inﬁnity.Generalizing the AR statistic to the large- K setting is challenging for multiple reasons.Firstly, the covariance matrix Σ has dimension K × K . Its consistent estimation isproblematic if not impossible under general heteroscedasticity. Secondly, the AR statisticunder the null has an improperly centered limit distribution because χ K has a very largemean. Thirdly, the K -dimensional Central Limit Theorem provides a poor approximationto the AR statistic when K is large.We propose an analog of the AR test that is heteroscedasticity-robust and weakidentiﬁcation-robust in the presence of a large number of instruments. Denote the projec-tion matrix P = Z ( Z ′ Z ) − Z ′ . Our test rejects the null of H : β = β when the jackknifeAR statistic AR ( β ) = 1 √ K p b Φ N X i =1 X j = i P ij e i ( β ) e j ( β ) (2)exceeds the (1 − α ) quantile of the standard normal distribution. We defer the discussionof the estimator of the variance b Φ to the next subsection.To address the challenges with the existing AR statistic, the AR statistic we proposeuses the default homoscedasticity-inspired weighting ( Z ′ Z ) − in place of b Σ − . With the ( Z ′ Z ) − weighting, the existing AR statistic has a quadratic form e ( β ) ′ P e ( β ) . However,this quadratic form is not centered at zero as it contains the term P Ni =1 P ii e i , and each7ummand has positive mean. We thus remove this term from the quadratic form. Thisre-centering can be referred to as leave-one-out or jackknife. In the context of consistentestimation under many instruments, this leave-one-out idea was introduced by Angrist etal. (1999) and fruitfully exploited in a number of papers including Hausman et al. (2012)and Chao et al. (2012). In order to create a test of correct size based on our AR statistic,we use a Central Limit Theorem for quadratic forms proved in Chao et al. (2012) that isrestated below. Assumption 1

Assume P is an N × N projection matrix of rank K , K → ∞ as N → ∞ and there exists a constant δ such that P ii ≤ δ < . Lemma 1 (Chao et al. (2012)) Let Assumption 1 hold for matrix P . Assume the errors η i are independent, E η i = 0 , and there exists a constant C such that max i E η i < C , then √ K √ Φ N X i =1 X j = i P ij η i η j ⇒ N (0 , , where Φ = K P Ni =1 P j = i P ij V ar ( η i ) V ar ( η j ) . The assumption P ii ≤ δ < implies that KN = N P Ni =1 P ii ≤ δ < . This assumptionis often referred to as a balanced design assumption. In the case of group-dummiesinstruments, P ii is equal to the ratio of the size of the group that observation i belongsto over N . Assumption 1 can be checked for any speciﬁc design.While Lemma 1 requires K → ∞ , the Gaussian approximation may work well forsmaller K as well. For example, if K is ﬁxed and errors are homoscedastic, then √ K √ Φ N X i =1 X j = i P ij η i η j ⇒ χ K − K √ K as N → ∞ . We prove this statement in the Supplementary Appendix S4. While the limit here isnot Gaussian it is very well approximated by a standard normal distribution even forrelatively small K . The random variable χ K − K √ K exceeds the 95% quantile of the standardnormal distribution at most 7% of the time for all K , and at most 6% of the time for K > . 8 .1 Variance estimation In order to conduct asymptotically valid inference based on the normal approximation inLemma 1, we need an estimator for the scale parameter Φ , which is consistent under thenull. One ‘naive’ estimator that achieves this is b Φ = K P Ni =1 P j = i P ij e i ( β ) e j ( β ) , whichuses the square of the implied error as an estimator for the i -th error variance. Underthe null when e i ( β ) = e i , the estimator b Φ is consistent under relatively mild conditions.However, using b Φ in a test would result in poor power. To see this, note that under analternative value of the parameter β = β + ∆ , we can plug in the ﬁrst stage and write theimplied error e i ( β ) = Y i − β X i as the sum of a non-trivial mean ∆Π i and a mean-zerorandom term η i = e i + ∆ v i : e i ( β ) = ∆Π i + η i . While squaring e i ( β ) makes it an unbiased estimator for V ar ( e i ) under the null, it isbiased under the alternative when ∆ = 0 . The bias in b Φ grows at the same order as thefourth power of ∆ , which brings down the power of the test against distant alternatives.In order to remove the bias in e i ( β ) under the alternatives, one may residualize theimplied error before squaring. However, this introduces a bias under the null. Denote M = I − P and let M i be the i th row of M . Even under the null, the squared residualizederror is biased E ( M i e ) = V ar ( e i ) . This is because the squared residual contains not onlythe squared error e i but also the square of regression estimation mistake. The latter canbe large when the number of regressors K is large.This bias can be removed successfully using the cross-ﬁt variance estimator suggestedin Kline et al (2019) and Newey and Robins (2018). Namely, they show that a productof the implied error and residual achieves both goals: it removes the linearly predictablepart of the implied error and remains an unbiased estimator of the variance E (cid:20) e i M i eM ii (cid:21) = V ar ( e i ) . Our challenge is that the scale parameter Φ deﬁned in Lemma 1 is a quadratic formwith a double summation. Residuals M i e ( β ) and M j e ( β ) are correlated since they9ontain the same estimation mistake. One can show that E [ e i M i ee j M j e ] = ( M ii M jj + M ij ) V ar ( e i ) V ar ( e j ) . Our proposed estimator of the scale parameter Φ re-weights each term in the summationto remove the bias described above: b Φ = 2 K N X i =1 X j = i P ij M ii M jj + M ij [ e i ( β ) M i e ( β )] [ e j ( β ) M j e ( β )] . (3)We establish the consistency of b Φ under the null and extend this result to local alternatives. Assumption 2

Errors ǫ i , i = 1 , ..., N are independent with E ǫ i = 0 , max i E k ǫ i k < ∞ ,and for some constants c ∗ and C ∗ that do not depend on Nc ∗ ≤ min i min x x ′ V ar ( ǫ i ) xx ′ x ≤ max i max x x ′ V ar ( ǫ i ) xx ′ x ≤ C ∗ . Theorem 2

Let Assumption 1 hold for matrix P and Assumption 2 hold for errors e i ,then for β = β , we have b ΦΦ → p as N → ∞ . Theorem 2 combined with Lemma 1 implies that under the null H : β = β our proposedAR statistic has an asymptotically standard normal distribution. Since no assumptionabout identiﬁcation is made, the resulting AR test has asymptotically correct size regard-less of the strength of identiﬁcation. Theorem 3

Let Assumption 1 hold for matrix P and Assumption 2 hold for errors ǫ i =( e i , v i ) ′ , and Π ′ M Π ≤ CK Π ′ Π . Then for β = β + ∆ such that ∆ · Π ′ Π K → , we have b ΦΦ → p as N → ∞ . Theorem 3 establishes the consistency of the variance estimator when the null hy-pothesis does not hold. We use Theorem 3 to derive local power curves of the AR testdiscussed in the next section. The variance estimator (3) residualizes some implied errors M i e ( β ) to remove non-trivial mean of e ( β ) under the alternative. The residualizationis complete if the ﬁrst stage is linear Π i = π ′ Z i . We do not impose such an assumption10n Theorem 3. Instead we require that the approximation of Π i by a linear combinationof instruments improves with the number of instruments as measured by the L norm ofthe approximation mistake, Π ′ M Π . Let us introduce a jackknife measure of the information contained in the instruments: µ = N X i =1 X j = i P ij Π i Π j . Theorem 4

Let P β be a probability measure describing the distribution of AR ( β ) deﬁnedin (2) and (3) under model (1) with parameter β = β + ∆ . Assume that the sequence ofﬁrst stage parameters Π satisﬁes the following assumptions: Π ′ M Π ≤ CK Π ′ Π and Π ′ Π K → as N → ∞ . If Assumption 1 holds and the errors ǫ i = ( e i , v i ) ′ satisfy Assumption 2, thenfor any positive constant c we have: lim N →∞ sup | ∆ | ≤ c sup z (cid:12)(cid:12)(cid:12)(cid:12) P β { AR ( β ) < z } − F (cid:18) z − ∆ µ √ K Φ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = 0 , (4) where F ( · ) is the standard normal cdf. If the sequence of ﬁrst stage parameters additionallysatisﬁes the condition µ √ K Φ → ∞ , then for any ﬁxed ∆ = 0 the jackknife AR test isasymptotically consistent: lim N →∞ P β { AR ( β ) < z } = 1 . Equation (4) of Theorem 4 characterizes the local power curves of the jackknife AR test.The power under the alternative β = β + ∆ is a function of the distance ∆ between thealternative β and the null β , the number of instruments K , a measure of identiﬁcationstrength µ and the degree of uncertainty √ Φ . Our jackknife AR statistic can be negative,unlike the AR statistic from the small- K case which is always non-negative. We rejectthe null when AR ( β ) exceeds the (1 − α ) quantile of the standard normal distribution.Under the alternative β = β + ∆ , the AR statistics has a positive drift which gives rise to11 two-sided test. The second statement of Theorem 4 shows that the AR test consistentlydistinguishes β from β as long as µ √ K √ Φ → ∞ .Our measure of identiﬁcation strength, µ , has a form similar to the numerator of theconcentration parameter deﬁned for the homoscedastic small- K case. Though the twoforms are similar, there is an important distinction between them. In our case the signalstrength is measured by a jackknife form, while in the homoscedastic small- K case it ismeasured by Π ′ P Π = P Ni =1 P Nj =1 P ij Π i Π j . The instruments may aﬀect the endogenousregressor in an arbitrarily non-linear way, and only the projection of Π onto the linearspace of the instruments is used by the linear IV regression. Thus the projection matrixappears naturally in our measure of identiﬁcation strength. If the eﬀect of instrumentson the regressor ( Π ) is well approximated by the linear ﬁrst stage ( Π ′ M Π ≤ CK Π ′ Π ), thenthe strength of identiﬁcation has the same order as Π ′ Π in the sense that they grow toinﬁnity or stay bounded simultaneously. Indeed, under Assumption 1 we have: (cid:18) − δ − CK (cid:19) Π ′ Π ≤ µ = Π ′ Π − Π ′ M Π − N X i =1 P ii Π i ≤ Π ′ Π . Theorem 4 implies that µ √ K → ∞ is a suﬃcient condition for the consistency of thejackknife AR test. When the ﬁrst stage is well approximated by linear combination ofthe instruments, this translates to a suﬃcient condition of Π ′ Π √ K → ∞ . This complimentsTheorem 1 which implies that Π ′ Π √ K → ∞ is necessary for the consistency of any test. Itis worth noticing that the condition Π ′ Π K → imposed by Theorem 4 is quite weak as itcovers both weakly and strongly identiﬁed cases. In a prominent paper, Stock and Yogo (2005) introduced a pre-test for weak identiﬁcationthat has gained enormous popularity in applied work. In homoscedastic IV models withsmall K , the concentration parameter fully characterizes the worst bias of the TSLS as afraction of the OLS bias and the worst rejection rate of TSLS-Wald test. Since the ﬁrststage F statistic measures the concentration parameter, Stock and Yogo (2005) suggest12 set of cut-oﬀs for the ﬁrst stage F statistic, above which a researcher can guaranteewith high (prespeciﬁed) probability that the bias of TSLS is not larger than 10% of theOLS bias, or that the TSLS-Wald statistic does not over-reject by more than 5%. Thecut-oﬀs depend on the goal (bias or size) and the number of instruments. However, thesedetails seem to be mostly disregarded in empirical practice, as the most common guidancesuggests a cut-oﬀ of 10, regardless of the goal or the number of instruments.As with any procedure of such generality, it suﬀers from multiple drawbacks. First,the pre-test is valid only if the model is homoscedastic. Andrews (2018) shows that inmodels calibrated to commonly-used data sets with heteroscedasticity one may ﬁnd caseswith the ﬁrst stage F statistics exceeding 1000, that have large over-rejections of theTSLS-Wald test.Second, the TSLS estimator is less robust to weak identiﬁcation when K is large.In a homoscedastic model when K is growing proportionally to the sample size, theTSLS estimator is consistent only if π ′ Z ′ ZπK → ∞ , while LIML and BTSLS estimators areconsistent when π ′ Z ′ Zπ √ K → ∞ (see Chao and Swanson (2005)). In this case, the pre-testbecomes too conservative. Indeed, if π ′ Z ′ Zπ √ K → ∞ but π ′ Z ′ ZπK ∞ , then the pre-testmost likely declares weak identiﬁcation as the expectation of the ﬁrst stage F equals to π ′ Z ′ ZπKσ v + 1 , even though there exist consistent estimators and a reasonable Wald-test canbe constructed.We propose a new pre-test for weak identiﬁcation that allows us to form a two-stepprocedure: a researcher ﬁrst assesses instrument strength based on our pre-test and thenuses the JIVE-Wald test if the instruments appear strong and our jackknife AR test if theyappear weak. We can guarantee the size of such two-step procedure in a heteroscedasticIV model with large K . Our pre-test uses an empirical measure of µ √ K , whose valuecharacterizes weak identiﬁcation as discussed in the previous sections: e F = 1 √ K p b Υ N X i =1 X j = i P ij X i X j , (5)here b Υ = K P i P j = i P ij M ii M jj + M ij X i M i XX j M j X is an estimate of the variance Υ deﬁned13n (12). The JIVE-Wald test uses the JIV2 estimator introduced in Angrist et al. (1999): b β JIV E = P Ni =1 P j = i P ij Y i X j P Ni =1 P j = i P ij X i X j . Our choice of JIVE is based on two considerations. First, according to Hausman etal. (2012), in a heteroscedastic IV model, when π ′ Z ′ Zπ √ K → ∞ , LIML and BTSLS becomeinconsistent, but JIVE is consistent. Second, the JIVE estimator is a ratio of two quadraticforms similar to the jackknife AR statistic. We use the following estimator of the JIVEvariance, that is a cross-ﬁt version of the estimator derived in Chao et al. (2012): b V = P Ni =1 (cid:16)P j = i P ij X j (cid:17) b e i M i b eM ii + P Ni =1 P j = i e P ij M i X b e i M j X b e j (cid:16)P Ni =1 P j = i P ij X i X j (cid:17) , where b e i = Y i − X i b β JIV E . The Wald statistic is deﬁned as

W ald ( β ) = ( b β JIV E − β ) b V . Theorem 5

Let Assumption 1 hold for matrix P and Assumption 2 hold for errors ǫ i =( e i , v i ) ′ . Assume that Π ′ M Π ≤ C Π ′ Π K and Π ′ Π K / → as N → ∞ . Then for β = β , (cid:16) W ald ( β ) , e F (cid:17) ⇒ ξ − ̺ ξν + ξ ν , ν ! , (6) where ξ and ν are two normal random variables with means and µ √ K √ Υ , unit variancesand correlation coeﬃcient ̺ deﬁned in equation (12). Theorem 5 shows that the distribution of the JIVE-Wald statistics can be quite dif-ferent from its conventional χ limit when µ √ K √ Υ is small. If µ √ K √ Υ is large, then mostrealizations of the random variable ν are large as well and the limit of the JIVE-Wald isclose to the distribution of ξ , which is χ . This suggests that µ √ K √ Υ is a good measure foridentiﬁcation strength. We notice that the limit expression for the JIVE-Wald statisticsis similar to the limit distribution derived by Stock and Yogo (2005, formula (2.22)) forTSLS-Wald in homoscedastic weak IV with small K .Using Theorem 5 we can calculate the worst asymptotic rejection rate of the JIVE-14ald test as a function of µ √ K √ Υ = x : R max α ( x ) = max ̺ ∈ [ − , P x,̺ ( ξ − ̺ ξν + ξ ν ≥ χ , − α ) , where P x,̺ is the probability distribution of ξ, ν described in Theorem 5. For a typicaltest with nominal size α = 5% , we ﬁnd that µ √ K √ Υ = x > . implies R max5% ( x ) < .Theorem 5 also allows us to construct a 5%-test for the null hypothesis that the unknownstrength of identiﬁcation parameter µ √ K √ Υ is higher than 2.5. This test is based on thestatistic e F and accepts whenever e F > . . Using Bonferroni bounds we obtain thefollowing statement: Corollary 1

Let all assumptions of Theorem 5 hold. Then a two-step test for the nullhypothesis H : β = β that accepts the null if e F > . and W ald ( β ) < χ , . or if e F ≤ . and AR ( β ) < z . , has an asymptotic size smaller than 15%. The pre-test we propose is to compare e F with the cut-oﬀ of 4.14. If e F exceeds the cut-oﬀ one may proceed using JIVE test/conﬁdence set, otherwise one is advised to employweak-identiﬁcation robust jackknife AR test. The attraction of the two-step procedure isthat conﬁdence sets based on the JIVE-Wald test is relatively easy to construct and iswell understood by the practitioners. As we illustrate in simulations, the Jackknife ARconﬁdence sets tend to be wider than the JIVE-Wald conﬁdence sets when identiﬁcationis strong. Simulations also suggest the Bonferroni bounds derived in Corollary 1 tend tobe conservative, as the actual size of the two-step test does not exceed 7%. Angrist and Krueger (1991) (AK91 in what follows) provided a motivating example forthe weak identiﬁcation literature, starting with the seminal work by Bound et al. (1995).Staiger and Stock (1997) suggested that the relatively low value of the ﬁrst stage Fstatistic can be seen as a sign of potential weak instruments in the AK91 application.Hansen et al. (2008) argued that “many instruments” may be a more relevant description15F e F JIVE-Wald Jackknife AR180 instruments 2.428 13.422 [0.066,0.132] [0.008,0.201]1530 instruments 1.27 6.173 [0.024,0.121] [-0.047, 0.202]

Table 1: AK91 Pre-test Results

Notes:

Results on pre-tests for weak identiﬁcation and conﬁdence sets for IV speciﬁcation underlyingTable VII Column (6) of Angrist and Krueger (1991) using the original data. FF is the ﬁrst stage Fstatistic of Stock and Yogo (2005), e F is the statistic introduced in (5). The JIVE-Wald conﬁdence set isdescribed in Section 4. The jackknife AR conﬁdence set is based on analytical test inversion. of the identiﬁcation issue encountered in AK91, as instruments are possibly not weakcollectively. They suggested that estimators other than TSLS may restore the accuracyfor standard inferences. We apply our proposed pre-test statistics e F to the original AK91application to assess whether instruments are weak given that there are many of them.The original AK91 application estimated the eﬀect of schooling ( X i ) on log weeklywage ( Y i ) using quarter of birth as instruments in a sample of 329,509 men born 1930-39from the 1980 census. There are multiple speciﬁcations in the original AK91 study. Wefocus on the speciﬁcation with 180 instruments and also an extension of this speciﬁcationusing 1530 instruments. The 180 instruments include 30 quarter and year of birth in-teractions (QOB-YOB) and 150 quarter and state of birth interactions (QOB-POB). Forthe second speciﬁcation with 1530 instruments, we also include full interactions amongQOB-YOB-POB. Table 1 reports the ﬁrst stage F statistics (FF), our proposed pre-teststatistics e F introduced in (5), conﬁdence sets based on the JIVE-Wald and jackknife ARstatistics. While the ﬁrst stage F statistic is below 10 and the pre-test from Stock andYogo (2005) would point toward weak identiﬁcation for both speciﬁcations, the instru-ments turn out to be strong in both speciﬁcations based on our pre-test. As a result,the reported conﬁdence sets based on a norminal 5% JIVE-Wald test are reliable, as theactual size is at most 15%. The conﬁdence sets based on our jackknife AR statistic arewider, yet still informative. With this sample size, we cannot vectorize calculations involving P ij (for jackknife AR and pre-test)due to memory constraint. However, it is still relatively fast to execute the non-vectorized code, whichtakes around 20 minutes. K Avg. e F OLS 2SLS 2SLS LIML LIML JIVE JIVEbias bias size bias size bias size4,923 154 4.99 0.26 0.17 96.6% -0.001 0.6% -0.03 5%3,209 135 3.35 0.26 0.19 95.7% -0.05 2.7% -0.06 5.2%1,599 111 1.77 0.26 0.21 92.3% -0.89 14.5% 1.22 3.6%

Table 2: AK91 Simulation Results: Bias of diﬀerent estimators and Size of Non-robust Tests

Through Monte Carlo simulations we show that the jackknife AR and the pre-test wedevelop are robust to many weak instruments unlike canonical IV estimators. To illustratethe practical importance of many weak instruments, we attempt to preserve the structureof AK91. Speciﬁcally, we adopt the simulation design by Angrist and Frandsen (2019).There is very little endogeneity in the original AK91, which makes it hard to study thebiases of diﬀerent estimators. Thus, we follow Angrist and Frandsen (2019) to introduceadditional omitted variable bias to the simulated data. The simulated data has a nonlinearﬁrst stage and is heteroscedastic. We deviate from Angrist and Frandsen (2019) in tworespects. First, we vary the sample size N of the simulated data to be 1.5%, 1% and 0.5%of the original sample size. This is to vary the identiﬁcation strength. We report theidentiﬁcation strength by the average e F across simulations. Simulations with sample sizeequal to 1.5% of the original sample size produce strong identiﬁcation in our deﬁnition, 1%still produce strong identiﬁcation but close to the weak identiﬁcation region, while 0.5%produce weak identiﬁcation.When we reduce the sample size we also need to exclude theinstruments of the groups that are no longer populated. Second, both in data simulationand in estimation we do not include controls in order to isolate the implications of manyinstruments. The Appendix provides more details on our simulation design.We evaluate the performance of common estimators and tests based on 1000 simulationdraws. In Table 2, we report the bias and Wald tests size of OLS, 2SLS, LIML and JIVEestimators. For the Wald test based on the LIML estimator, we calculate the standarderrors as in Hansen et al. (2008). They corrected the canonical standard error estimatorto be robust to many instruments, but this test is not robust to heteroscedasticity, asLIML itself is inconsistent under heteroscedasticity. For the Wald test based on the17 K Avg. FF Avg. e F jackknife AR pre-test two-step test4,923 154 1.63 4.99 5.1% 70.5% 5.8%3,209 135 1.44 3.35 5.6% 26.7% 6.6%1,599 111 1.24 1.77 6.3% 4.5 % 7.2% Table 3: AK91 Simulation Results: Size of Robust Tests

JIVE estimator, we calculate the heteroscedasticity-robust standard errors as describedin Section 4.We ﬁnd that due to many instruments 2SLS has large bias even under strong identiﬁ-cation. While Hausman et al. (2012) show LIML is inconsistent under many instrumentsand heteroscedasticity, LIML is not too biased in our simulated data, as long as identiﬁ-cation is not weak. We ﬁnd that JIVE has low bias when identiﬁcation is strong, but itsbias increases when identiﬁcation is weak. The Wald test based on either LIML or JIVEis not robust to many weak instruments, and we ﬁnd substantial size distortion for LIMLunder weak identiﬁcation. Surprisingly we do not ﬁnd large size distortion for JIVE.In Table 3 we report the rejection frequency of the robust test we developed in thispaper based on the jackknife AR test statistic. We ﬁnd that the jackknife AR controlssize even under weak identiﬁcation. Our proposed pre-test also controls size and is able toswitch to the JIVE-Wald test when identiﬁcation is strong. In contrast, the ﬁrst stage Fstatistics of Stock and Yogo (2005) (FF) are very small even under strong identiﬁcation,which makes it not very informative.Finally, in Table 4 we compare the length of conﬁdence intervals formed by invertingvarious tests. In particular, when identiﬁcation is strong, jackknife AR conﬁdence setsare longer (less eﬃcient) but are not unreasonably long compared to the Wald tests basedon LIML and JIVE. In this case, a pre-test can improve the eﬃciency by switching to theWald test based on JIVE. As with the canonical AR test, the jackknife AR test can resultin conﬁdence intervals with inﬁnite length. We report the probability of inﬁnite length inthe last column of Table 4, and note that such probability increases as identiﬁcation getsweaker. 18 K Avg. e F Table 4: AK91 Simulation Results, Length of Conﬁdence Interval

In this paper, we argue that we can characterize weak identiﬁcation as an environmentwith many instruments when an analog of the concentration parameter staying boundedrelative to the square root of the number of instruments in large samples. We introduce ajackknifed version of the AR test that is robust to our deﬁnition of weak identiﬁcation andheteroscedasticity. We also propose a pre-test for weak identiﬁcation and correspondinglya two-step testing procedure in the spirit of Stock and Yogo (2005). Unlike the pre-testproposed by Stock and Yogo (2005), our two-step test controls size distortion even underheteroscedasticity and with many instruments. As an empirical example, our pre-testrejects weak identiﬁcation in Angrist and Krueger (1992) where up to 1530 instrumentsare used.

Anatolyev, S., and Gospodinov, N. (2011). “Speciﬁcation Testing in Models with ManyInstruments."

Econometric Theory

27, 427–441.Andrews, I. (2018). “Valid Two-Step Identiﬁcation-Robust Conﬁdence Sets for GMM."

The Review of Economics and Statistics , 100, 337–348.

Supplementary Appendix

Andrews, D.W.K., and Stock, J.H. (2007). “Testing with many weak instruments.”

Jour-nal of Econometrics

Econometrica

NBER working paper

Journal of Applied Econometrics

14, 57–67.19ngrist, J.D., and Krueger, A.B. (1991). “Does Compulsory School Attendance AﬀectSchooling and Earnings?"

The Quarterly Journal of Economics

Journal of the American Statistical Association , 90:430, 443–450.Chao, J.C., and Swanson, N.R. (2005). “Consistent Estimation with a Large Number ofWeak Instruments."

Econometrica

73, 1673–1692.Chao, J.C., Swanson, N.R., Hausman, J.A., Newey, W.K., and Woutersen, T. (2012).“Asymptotic Distribution of JIV in a heteroscedastic IV Regression with Many Instru-ments.”

Econometric Theory

28, 42–86.Dobbie, W., Goldin, J., and Yang, C.S. (2018). “The Eﬀects of Pretrial Detention onConviction, Future Crime, and Employment: Evidence from Randomly Assigned Judges."

American Economic Review

Journal of Political Economy

81, 607–636.Hansen, C., Hausman, J., and Newey, W. (2008). “Estimation With Many InstrumentalVariables."

Journal of Business & Economic Statistics

26, 398–422.Hausman, J.A., Newey, W.K., Woutersen, T., Chao, J.C., and Swanson, N.R. (2012).“Instrumental variable estimation with heteroscedasticity and many instruments.”

Quan-titative Economics

3, 211–255.Kleibergen F. (2002). “Pivotal statistics for testing structural parameters in instrumentalvariables regression.”

Econometrica

American Economic Review

Econo-metrica

79 (2): 395–435.Newey, W. (2004). “Many Instrument Asymptotics.”20ewey, W.K., and Robins, J.R. (2018). “Cross-Fitting and Fast Remainder Rates forSemiparametric Estimation.”Newey, W.K., and Windmeijer, F. (2009). “Generalized Method of Moments With ManyWeak Moment Conditions.”

Econometrica

77, 687–719.Sampat, B., and Williams, H.L. (2019). “How Do Patents Aﬀect Follow-On Innovation?Evidence from the Human Genome.”

American Economic Review

The Review of Finan-cial Studies

5, 1–33.Staiger, D., and Stock, J.H. (1997). “Instrumental Variables Regression with Weak In-struments.”

Econometrica

65 (3): 557–86.Stock, J.H., and Yogo, M. (2005). “Testing for weak instruments in Linear Iv regres-sion. In Identiﬁcation and Inference for Econometric Models: Essays in Honor of ThomasRothenberg," pp. 80–108.

Let C be a universal constant (that may be diﬀerent in diﬀerent lines but does not dependon N or K ). Denote σ i = V ar ( e i ) , ς i = V ar ( v i ) , γ i = cov ( e i , v i ) , and e P ij = P ij M ii M jj + M ij . Proof of Theorem 1.

Denote A to be an upper-triangular matrix, such that A Ω A ′ = I . The suﬃcient statistic in model (1) is  ξ ξ  = ( A ⊗ I K ) ·  ( Z ′ Z ) − / Z ′ Y ( Z ′ Z ) − / Z ′ X  ∼ N  e β ΠΠ  , I K  (7)where e β = (1 , A ( β, ′ is a (known) linear one-to-one transformation of β . Denote thecorresponding null and alternative as e β and e β ∗ . We denote also Π = ( Z ′ Z ) / πσ v , whichis one-to-one transformation of π . It is enough to restrict attention to the tests thatdepend on the data through suﬃcient statistics only. Indeed, for any test ψ ∈ Ψ N we mayconstruct a test ψ S = E ( ψ | ξ , ξ ) which depends on the data only through the suﬃcient21tatistics. Due to the law of iterated expectations the size and the power of ψ S is thesame as the initial ψ .Let U be the group of rotations on R K , that is U ∈ U are such U ′ U = I K . No-tice that the model is invariant to group U , namely if ( ξ , ξ ) satisfy model (7) withparameters ( e β, Π) then ( U ξ , U ξ ) satisfy model (7) with parameters ( e β, U Π) . Note that Π ′ Π = ( U Π) ′ ( U Π) . This implies that for any function f we have E ( e β, Π) f ( U ξ , U ξ ) = E ( e β,U Π) f ( ξ , ξ ) .We call a test ψ = ψ ( ξ , ξ ) invariant to rotations iﬀ for any U ∈ U we have ψ ( U ξ , U ξ ) = ψ ( ξ , ξ ) for all realizations of ( ξ , ξ ) . The maximum in Theorem 1 isachieved at an invariant test. Indeed, take any test ψ ∈ Ψ N that has size α , that is, E ( e β , Π) ψ ( ξ , ξ ) ≤ α for all Π . Let us consider a new test ψ ∗ ( ξ , ξ ) = R U ∈U ψ ( U ξ , U ξ ) dU, where the integral is taken uniformly over the unit sphere in R K . By construction, ψ ∗ isan invariant test as for any e U ∈ U , we have U e U ∈ U for all U ∈ U so that ψ ∗ ( e U ξ , e U ξ ) = Z U ∈U ψ ( U e U ξ , U e U ξ ) dU = Z U ∈U ψ ( U ξ , U ξ ) dU. E ( e β , Π) ψ ∗ ( ξ , ξ ) = Z U ∈U n E ( e β , Π) ψ ( U ξ , U ξ ) o dU = Z U ∈U n E ( e β ,U Π) ψ ( ξ , ξ ) o dU ≤ α. So, it has correct size. Now we check that the minimal power of ψ ∗ achieved over alter-natives ( e β ∗ , Π) with Π such that Π ′ Π √ K = C is not smaller than that of ψ . Assume that theminimum of power for test ψ is achieved at the alternative Π ∗ : min Π ′ Π √ K = C E ( e β ∗ , Π) ψ ( ξ , ξ ) = E ( e β ∗ , Π ∗ ) ψ ( ξ , ξ ) . Then, similarly to above: min Π ′ Π √ K = C E ( e β ∗ , Π) ψ ∗ ( ξ , ξ ) = min Π ′ Π √ K = C Z U ∈U n E ( e β ∗ ,U Π) ψ ( ξ , ξ ) o dU ≥≥ Z U ∈U min Π ′ Π √ K = C n E ( e β ∗ ,U Π) ψ ( ξ , ξ ) o dU = E ( e β ∗ , Π ∗ ) ψ ( ξ , ξ ) . All invariant tests depend on the data only through maximal invariant. Thus, we shouldonly consider tests that depend on the data through statistics Q = ( Q , Q , Q ) = ξ ′ ξ , ξ ′ ξ , ξ ′ ξ ) . If Π ′ Π / √ K → C then Q converges to the following distribution:  ξ ′ ξ − K √ Kξ ′ ξ √ Kξ ′ ξ − K √ K  ⇒ N  e β C √ e βC C √  , I  =  Q ∞ , Q ∞ , Q ∞ ,  = Q ∞ . (8)According to Theorem 1 of Müeller (2011) the limit of the maximal power of tests inexperiment based on Q is bounded above by the maximal power achieved in the limitexperiment described on Q ∞ as deﬁned in the right hand side of equation (8). Noticethat the maximal achievable power E e β ∗ ,C ψ ∗ ( Q ∞ ) is strictly less than 1 for any ﬁxed β ∗ and ﬁxed C . Indeed, the best achievable power in the limit experiment (8) is no morethan the best achievable power in the experiment when C is known. If C is known, theoptimal test follows from the Neyman-Pearson lemma, and is less than 1. Proof of Theorem 2.

Assumptions 1 and 2 imply ≥ K X i X j = i P ij = 1 K X i X j P ij − K X i P ii ≥ − δ K X i P ii = 1 − δ. Thus, (1 − δ )( c ∗ ) < Φ < ( C ∗ ) and it is suﬃcient to prove that b Φ − Φ → p . The laststatement holds due to Lemma 2 applied to ξ i = ( e i , e i , e i ) ′ . (cid:3) Lemma 2

Let Assumption 1 hold. Assume the errors ξ i = ( ξ (1) i , ξ (2) i , ξ (3) i ) ′ are indepen-dent mean zero random vectors with max i E k ξ i k < C . Then as N → ∞ , we have: K X i X j = i (cid:26) P ij M ii M jj + M ij h ξ (1) i M i ξ (2) i h ξ (1) j M j ξ (3) i − P ij E h ξ (1) i ξ (2) i i E h ξ (1) j ξ (3) j i(cid:27) → p . Proof of Lemma 2.

Notice that K X i X j = i P ij E h ξ (1) i ξ (2) i i E h ξ (1) j ξ (3) j i = 1 K X i X j = i e P ij E h ξ (1) i ξ (2) i ξ (1) j ξ (3) j i . Deﬁne ξ ij = ξ (1) i M i ξ (2) ξ (1) j M j ξ (3) − E h ξ (1) i M i ξ (2) ξ (1) j M j ξ (3) i , then we need to prove that K P i P j = i e P ij ξ ij → p . Since K P i P j = i e P ij ξ ij has zero mean, it is suﬃcient to show that23he variance of each term in expression (9) deﬁned below converges to zero (here I is asummation over distinct indexes ( i, i ′ , j, j ′ ) ): E K X i X j = i e P ij ξ ij ! = 1 K X i X j = i e P ij E ξ ij ++ 1 K X i X j = i X i ′ = { i,j } e P ij e P ii ′ E ξ ij ξ ii ′ + 1 K X I e P ij e P i ′ j ′ E ξ ij ξ i ′ j ′ . (9)First, we prove that max i,j E ξ ij < C . We expand ξ ij = A ,ij + A ,ij + A ,ij , where: A ,ij = M ii M jj (cid:16) ξ (1) i ξ (2) i ξ (1) j ξ (3) j − E [ ξ (1) i ξ (2) i ξ (1) j ξ (3) j ] (cid:17) + M ij (cid:16) ξ (1) i ξ (3) i ξ (1) j ξ (2) j − E [ ξ (1) i ξ (3) i ξ (1) j ξ (2) j ] (cid:17) ,A ,ij = ξ (1) i ξ (1) j X i ′ = { i,j } (cid:16) M ii M ji ′ ξ (2) i ξ (3) i ′ + M ii ′ M ij ξ (2) i ′ ξ (3) i + M jj M ii ′ ξ (2) i ′ ξ (3) j + M ji ′ M ij ξ (2) j ξ (3) i ′ (cid:17) ,A ,ij = ξ (1) i ξ (1) j X i ′ = { i,j } X j ′ = { i,j } M ii ′ M jj ′ ξ (2) i ′ ξ (3) j ′ . It is suﬃcient to show that max i,j E A s,ij is bounded for all s = 1 , , . The moment con-dition implies E A ,ij ≤ C (cid:0) M ii M jj + M ij (cid:1) ≤ C. Below we use that non-zero correlationsbetween summands in A s,ij imply that some indexes must coincide. We also use LemmaS1.1 from the Supplementary Appendix: E A ,ij ≤ C X i ′ ( M ii M ji ′ + M ii ′ M ij + M jj M ii ′ + M ji ′ M ij ) ≤ C, E A ,ij ≤ C X i ′ = { i,j } X j ′ = { i,j } (cid:0) P ii ′ P jj ′ + | P ii ′ P jj ′ P ij ′ P ji ′ | (cid:1) ≤ C. Next notice that e P ij = P ij M ii M jj + M ij ≤ P ij (1 − P ii )(1 − P jj ) ≤ − δ ) P ij . (10)Lemma B1 in Chao et al (2012) gives that P i P j = i P ij ≤ K and P i P j = i P j ′ = i,j ′ = j P ij P ij ′ ≤ K . Thus, given the bound on max i,j Eξ ij < C and by Cauchy-Schwarz inequality max i,j,k | E ξ ij ξ ik | < C , the ﬁrst two terms in expression (9) converge to zero.For the last term in (9), since i, i ′ , j, j ′ are all distinct, we have E A ,ij A s,i ′ j ′ = 0 for24 = 2 , , and E A ,ij A ,i ′ j ′ = 0 . The non-zero terms in E ξ ij ξ i ′ j ′ are | E A ,ij A ,i ′ j ′ | ≤ C | ( M ii M jj ′ + M ij M ij ′ )( M i ′ i ′ M jj ′ + M i ′ j M i ′ j ′ ) | ++ C | ( M jj M ii ′ + M ji ′ M ij )( M j ′ j ′ M ii ′ + M j ′ i ′ M ij ′ ) | . | E A ,ij A ,i ′ j ′ | ≤ C ( P ii ′ P jj ′ + P ij ′ P i ′ j ) . Given inequality (10) and the symmetry of summation, and statements (a)-(e) provedin Lemma S1.2 in the Supplementary Appendix, we obtain that the last two terms inequation (9) converge to zero. (cid:3)

Proof of Theorem 3.

Denote λ i = M i Π , then b Φ = 2 K X i X j = i e P ij ( η i + ∆Π i ) ( M i η + ∆ λ i ) ( η j + ∆Π j ) ( M j η + ∆ λ j ) . Let us deﬁne b Φ = K P i P j = i e P ij η i M i ηη j M j η . Assumption 2 guarantees that the varianceof η i = e i +∆ · v i is uniformly bounded. Lemma 2 with ξ i = ( η i , η i , η i ) ′ gives (cid:12)(cid:12)(cid:12)b Φ − Φ (cid:12)(cid:12)(cid:12) → p uniformly over bounded ∆ . Lemma 3 with ξ i = ( η i , η i , η i , η i ) ′ implies b Φ − b Φ → p . (cid:3) Lemma 3

Let ξ i = ( ξ (1) i , ξ (2) i , ξ (3) i , ξ (4) i ) ′ be independent mean zero × random vectors,such that E k ξ i k < C. Let Assumption 1 hold. Assume that λ ′ λ ≤ CK Π ′ Π and ∆ · Π ′ Π K → as N → ∞ . Then K X i X j = i e P ij (cid:16) ξ (1) i + ∆Π i (cid:17) (cid:0) M i ξ (2) + ∆ λ i (cid:1) (cid:16) ξ (3) j + ∆Π j (cid:17) (cid:0) M j ξ (4) + ∆ λ j (cid:1) −− K X i X j = i e P ij ξ (1) i M i ξ (2) ξ (3) j M j ξ (4) → p . Proof of Lemma 3.

We write the main expression of interest as a polynomial of fourthpower in ∆ : ∆ A +∆ A +∆ A +∆ A and prove that all terms are negligible ∆ l A l → p by showing that their means and variances converge to zero. Notice that for expressionswith identical structure but diﬀerent components of ξ i , the proof of their negligibility isexactly the same. Thus for simplicity we abuse the notation and drop the superscripts to ξ i when we can consolidate these expressions. For example, we write the expression for one25f the terms in A as K P i P j = i e P ij Π i λ i λ j ξ j , which collects both K P i P j = i e P ij Π i λ i λ j ξ (1) j and K P i P j = i e P ij Π i λ i λ j ξ (3) j . We also treat ξ i in all expressions below as scalar. A = 1 K X i X j = i e P ij Π i λ i Π j λ j ; A = 1 K X i X j = i e P ij Π i λ i λ j ξ j + 1 K X i X j = i e P ij Π i λ i Π j M j ξ ; A = 1 K X i X j = i e P ij λ i λ j ξ i ξ j + 1 K X i X j = i e P ij λ i ξ i Π j M j ξ ++ 1 K X i X j = i e P ij λ i Π i ξ j M j ξ + 1 K X i X j = i e P ij Π i Π j M i ξM j ξ ; A = 1 K X i X j = i e P ij λ i ξ i M j ξξ j + 1 K X i X j = i e P ij Π i M i ξξ j M j ξ. Term A is deterministic. We use bound (10) and Lemma S1.3 (d): ∆ | A | ≤ C ∆ Π ′ Π λ ′ λK ≤ C ∆ (Π ′ Π) K → . Term A is mean zero. Using the inequality V ar ( X + Y ) ≤ V ar ( X ) + 2 V ar ( Y ) we have: ∆ V ar ( A ) ≤ C ∆ K X j X i P ij | Π i || λ i | ! λ j + X k X i X j = i e P ij Π i λ i Π j M jk !  ≤≤ C ∆ K ( λ ′ λ ) Π ′ Π + X i,i ′ ,j,j ′ P ij | Π i λ i Π j | P i ′ j ′ | Π i ′ λ i ′ Π j ′ | X k | M jk M j ′ k | ! ≤≤ C ∆ K (cid:0) ( λ ′ λ ) Π ′ Π + (Π ′ Π) λ ′ λ (cid:1) ≤ C ∆ (Π ′ Π) K → . For the ﬁrst inequality, we apply Assumption 2 and bound (10). Then we use Cauchy-Schwarz inequality for the ﬁrst summand: (cid:0)P i P ij | Π i || λ i | (cid:1) ≤ Π ′ Π λ ′ λ . For the secondsummand, we apply Lemma S1.1 (ii) and Lemma S1.3 (c). Finally, we apply Lemma S2.1and S2.2 to get ∆ A → p and ∆ A → p . (cid:3) roof of Theorem 4. The infeasible version of AR statistics under β = β + ∆ is: √ K √ Φ X i X j = i P ij e i ( β ) e j ( β )= ∆ √ K √ Φ X i X j = i P ij Π i Π j + 2∆ √ K √ Φ X i X j = i P ij Π j ! η i + 1 √ K √ Φ X i X j = i P ij η i η j . (11)The ﬁrst term in (11) is deterministic and equals to ∆ µ √ K √ Φ . The second term has meanzero and variance ∆ K Φ X i X j = i P ij Π j ! V ar ( η i ) ≤ Cc K Φ X i w i ≤ C Π ′ Π K → . Here we used that variance of η i is bounded by Assumption 2, P j = i P ij Π i = w i , andthe ﬁnal bound is proven in Lemma S1.4. Thus, the second term converges to zero inprobability uniformly over | ∆ | ≤ c . The third term in (11) is asymptotically standardnormal due to Lemma 1. Finally, we notice that AR ( β ) = s Φ b Φ 1 √ K √ Φ X i X j = i P ij e i ( β ) e j ( β ) , and apply Theorem 3. This ﬁnishes the proof of statement (4).Now consider the case when µ √ K √ Φ → ∞ and ∆ = 0 is ﬁxed. Above we proved that √ K √ Φ X i X j = i P ij e i ( β ) e j ( β ) = µ √ K √ Φ ∆ + o p (1) + O p (1) . Finally, Theorem 3 implies that b ΦΦ → p . As a result, we have AR ( β ) → p ∞ when µ √ K √ Φ → ∞ and ∆ = 0 is ﬁxed. This lead to rejection probability converging to 1. (cid:3) Proof of Theorem 5.

Denote Q = ( Q ee , Q Xe , Q XX ) ′ = 1 √ K N X i =1 X j = i P ij ( e i e j , X i e j , X i X j ) ′ . Σ − / (cid:16) Q ee , Q Xe , Q XX − µ √ K (cid:17) ′ ⇒ N (0 , I ) ,where Σ is the asymptotic covariance matrix of Q , with some of its elements writtenbelow: Ψ = 1 K N X i =1 X j = i P ij γ i γ j + 1 K N X i =1 X j = i P ij σ i ς j + 1 K N X i =1 ( X j = i P ij Π j ) σ i = AV ar ( Q Xe ) , Υ = 2 K N X i =1 X j = i P ij ς i ς j + 4 K N X i =1 ς i ( X j = i P ij Π j ) = AV ar ( Q XX ) , (12) τ = 2 K N X i =1 X j = i P ij ς i γ j + 2 K N X i =1 γ i ( X j = i P ij Π j ) = ACov ( Q Xe , Q XX ) , ̺ = τ √ Ψ √ Υ . Note that b e i = Y i − X i b β JIV E = e i − X i ( b β JIV E − β ) and ( b β JIV E − β ) = Q Xe /Q XX . Thus, W ald ( β ) = Q Xe P Ni =1 (cid:16)P j = i P ij X j (cid:17) b e i M i b eM ii + P Ni =1 P j = i e P ij M i X b e i M j X b e j , where the denominator expands to N X i =1 X j = i P ij X j ! b e i M i b eM ii + N X i =1 X j = i e P ij M i X b e i M j X b e j ==  N X i =1 X j = i P ij X j ! e i M i eM ii + N X i =1 X j = i e P ij M i Xe i M j Xe j  −− Q Xe Q XX  N X i =1 X j = i P ij X j ! (cid:18) e i M i XM ii + X i M i eM ii (cid:19) + 2 N X i =1 X j = i e P ij M i Xe i M j XX j  ++ Q Xe Q XX  N X i =1 X j = i P ij X j ! X i M i XM ii + N X i =1 X j = i e P ij M i XX i M j XX j  . Applying Lemma S3.1 from the Supplementary Appendix to the expanded expression ofthe denominator, we show the terms appearing in the braces converge to Ψ , τ and Υ respectively. Then W ald ( β ) = Q Xe Ψ − Q Xe Q XX τ + Q Xe Q XX Υ (1 + o p (1)) = Q Xe / Ψ1 − Q Xe / √ Ψ Q XX / √ Υ ̺ + Q Xe Q XX ΥΨ (1 + o p (1)) . b Υ with ξ i = ( v i , v i , v i , v i ) ′ and ∆ = 1 give e F = Q XX √ Υ (1 + o p (1)) . Thus, the statement of Theorem 5 holds where we denote (cid:16) ξ, ν − µ √ K √ Υ (cid:17) to bethe Gaussian limit of ( Q Xe √ Ψ , Q XX √ Υ − µ √ K √ Υ ) . (cid:3) Proof of Corollary 1.

Denote x = µ √ K √ Υ . If x > . then due to Theorem 5: P x { e F > . and W ald ( β ) ≥ χ , . } ≤ P x { W ald ( β ) ≥ χ , . } ≤ . . If x ≤ . then due to the asymptotic gaussianity of e F : P x { e F > . and W ald ( β ) ≥ χ , . } ≤ P x { e F > . } ≤ . . Finally, for any x > : P x { H is rejected } = P { e F > . and W ald ( β ) ≥ χ , . } ++ P { e F > . and AR ( β ) ≤ z , . } ≤ .

10 + P { AR ( β ) ≤ z , . } ≤ . . To create many instruments, we interact QOB dummies with dummies for year of birth(YOB) and place (state) of birth (POB). Interacting three QOB dummies with nine YOBand 50 POB dummies generates 180 excluded instruments. The excluded instruments are Z i = (( { Q i = q, C i = c } ) ′ q ∈{ , , } ,c ∈{ ,..., } , { Q i = q, P i = p } ) ′ q ∈{ , , } ,p ∈{ states } ) ′ , where Q i , C i , P i are i ’s QOB, YOB and POB respectively. Note, that Z i are not groupinstruments in the strict sense as they are not mutually exclusive. We exclude instrumentswith P Ni =1 Z ij < to satisfy the balanced instruments assumption (Assumption 1).To increase the amount of omitted variable bias, we follow Angrist and Frandsen(2019) by taking the LIML model as the ground truth, where the outcome variable is Y i (income), the endogenous variable X i (highest grade completed) is instrumented by Z i and the control variables are a full set of POB-by-YOB interactions. Speciﬁcally, starting29ith the full 1980 census sample, we compute the average X i in each QOB-YOB-POB cell ¯ s ( q, c, p ) . We then estimate LIML and retain ˆ y ( c, p ) , the second-stage ﬁtted value aftersubtracting ˆ β LIML X i where ˆ β LIML is the LIML estimate of the returns to schooling. Wealso retain the variance of LIML residuals ω ( Q i , C i , P i ) to mimic the heteroskedasticity.The simulation model we consider is then ˜ y i = ¯ y + 0 . s i + ω ( Q i , C i , P i )( ν i + κ ǫ i )˜ s i ∼ P oisson ( µ i ) , for independent standard normal ν i and ǫ i . Here ¯ y = N P i ˆ y ( C i , P i ) and µ i = max { , γ + γ ′ Z Z i + κ ν i } where γ + γ ′ Z Z i is the projection of ¯ s ( Q i , C i , P i ) onto a constant and Z i . Weset κ = 1 . and κ = 0 .1