aa r X i v : . [ ec on . E M ] A p r Inference with Many Weak Instruments
Anna Mikusheva and Liyang Sun Abstract
We develop a concept of weak identification in linear IV models in which the number of instru-ments can grow at the same rate or slower than the sample size. We propose a jackknifed versionof the classical weak identification-robust Anderson-Rubin (AR) test statistic. Large-sample in-ference based on the jackknifed AR is valid under heteroscedasticity and weak identification. Thefeasible version of this statistic uses a novel variance estimator. The test has uniformly correctsize and good power properties. We also develop a pre-test for weak identification that is re-lated to the size property of a Wald test based on the Jackknife Instrumental Variable Estimator(JIVE). This new pre-test is valid under heteroscedasticity and with many instruments.
Key words: instrumental variables, weak identification, dimensionality asymptotics.
JEL classification codes:
C12, C36, C55.
Recent empirical applications of instrumental variables (IV) estimation often involve manyinstruments that together may or may not be strongly relevant. A prominent example isAngrist and Krueger (1991), which started the weak IV literature, uses 180 instruments byinteracting dummies for the quarter of birth with state and year of birth. Other examplesinclude papers that employ an empirical strategy known as “judge design” (Maestas etal., 2013; Sampat and Williams, 2015; Dobbie et al., 2018). Fueled by rich administrativedata, these papers use the exogenous assignment of cases to judges as instruments fortreatment. Since each judge can only process a certain number of cases out of the totalcourt cases, the number of judges (the number of instruments) is usually proportionalto the sample size. Another example is the famous Fama-MacBeth procedure in Asset Department of Economics, M.I.T. Address: 77 Massachusetts Avenue, E52-526, Cambridge, MA,02139. Email: [email protected]. National Science Foundation support under grant number 1757199is gratefully acknowledged. We are grateful to Josh Angrist, Kirill Evdokimov and Whitney Newey foradvice, to Brigham Frandsen for sharing code for simulations, and to Ben Deaner and Sylvia Klosin forresearch assistance. Replication code is available at http://economics.mit.edu/grad/lsun20/ . Department of Economics, M.I.T. Address: 77 Massachusetts Avenue, E52-300, Cambridge, MA,02139. Email: [email protected]. square root of the number of instruments stays bounded in large samples. We prove that even in a homoscedastic model withknown covariance, an asymptotically consistent test does not exist if the ratio of theconcentration parameter over the square root of the number of instruments stays boundedin large samples. Thus, a necessary condition for a consistent test to exist is that theconcentration parameter grows faster than the square root of the number of instruments.Later, we show that this is also a sufficient condition by constructing a robust test thatbecomes consistent when this condition is satisfied.We propose a new jackknifed version of the Anderson-Rubin (AR) test which is robustto both weak identification and heteroscedasticity in a model with many instruments. Thenew test uses an asymptotic approximation based on a Central Limit Theorem (CLT) forquadratic forms. The new AR test has the correct size regardless of identification strengthand becomes consistent as soon as the concentration parameter grows faster than thesquare root of the number of instruments.As an important technical contribution, we introduce a novel variance estimator forthe quadratic form CLT. The target variance is a quadratic form of the individual (het-2roscedastic) variances of errors. We apply cross-fitting (Newey and Robins, 2018; Klineet al., 2019) to produce unbiased proxies for the individual variances of errors. We adjustthe quadratic form to remove the bias due to correlations between proxies. We prove theconsistency of the new estimator under the null and local alternatives.Finally, we propose a new pre-test for weak identification which is easy to use and isconsistent with our definition of weak identification. An empirical researcher can use ourpre-test to decide between employing our jackknife AR test if the pre-test suggests thatthe identification is weak or a Wald test based on the Jackknife Instrumental VariableEstimator (JIVE, Angrist et al., 1999) if the pre-test suggests that the identification isstrong. We guarantee the size of this two-step procedure. Chao et al. (2012) provethat JIVE is consistent in a heteroscedastic model when the concentration parametergrows faster than the square root of the number of instruments. Chao et al. (2012)also derive a consistent estimator of the JIVE standard error. The two-step procedure isappealing because when identification is strong, the JIVE-Wald is more efficient and easyto implement and report.Our pre-test is in the spirit of Stock and Yogo (2005), but it differs from theirs intwo important ways. Firstly, our pre-test allows for a general form of heteroscedasticity,while the pre-test proposed in Stock and Yogo (2005) works only under conditionallyhomoscedastic errors. Secondly, the Stock and Yogo (2005) pre-test is designed for a smallnumber of instruments and is based on the Two-Stage Least Squares (TSLS) estimator.With many instruments TSLS is consistent only when the concentration parameter growsfaster than the number of instruments, which makes the Stock and Yogo (2005) pre-testnot very informative.We apply our pre-test to Angrist and Krueger (1991) and find that their identificationis strong. Consequently the JIVE confidence set is reliable (has coverage within 5%tolerance level of the declared coverage). Our weak identification-robust jackknife ARconfidence set is somewhat wider than the JIVE confidence set but is still informative.
Relation to the Literature . Our paper contributes to both the literature on weakIV and the literature on many instruments. The weak IV literature relates identificationstrength to the size of the concentration parameter and proposes robust tests that work3nly when there are a small number of instruments. Generalizations to many weak in-struments either strongly restrict the number of instruments (Andrews and Stock, 2007)or work only under homoscedasticity (Anatolyev and Gospodinov, 2011).The many instruments literature mostly establishes consistency conditions for partic-ular estimators. For example, Chao and Swanson (2005) show that in a homoscedasticmodel limited information maximum likelihood (LIML) and bias-corrected TSLS (BTSLS)are consistent when the concentration parameter grows faster than the square root of thenumber of instruments. In a heteroscedastic model, consistency of LIML and BTSLS re-quires that the concentration parameter grows faster than the number of instruments. Bycontrast, JIVE remains consistent when the concentration parameter grows faster thanthe square root of the number of instruments (Chao et al. (2012)).The remainder of this paper is organized as follows. In Section 2 we introduce ourdefinition of weak identification in an environment with many instruments. In Section 3we construct the jackknife AR test and establish its power properties. In Section 4 wepresent the pre-test and prove that it controls size. Section 5 reports our pre-test resultsfor Angrist and Krueger (1991) and conducts a simulation exercise inspired by Angristand Frandsen (2019), and Section 6 concludes. Some proofs and additional results maybe found in the Supplementary Appendix.
We study the linear IV regression with a scalar outcome Y i , a potentially endogenousscalar regressor X i and a K × vector of instrumental variables Z i : Y i = βX i + e i ,X i = Π i + v i , (1)for i = 1 , ..., N. We denote Π i = E [ X i | Z i ] and allow the instruments to affect the en-dogenous regressor in a non-linear way. All results in this paper hold conditionally ona realization of the instruments. Thus, we treat the instruments as fixed (non-random)and Π i as some constants. The mean-zero errors ( e i , v i ) are independent across i but not4dentically distributed and may be heteroscedastic. We assume without loss of generalitythat there are no controls included in our model as they may be partialled out.Weak identification under small K is studied extensively in the weak IV literature.For Gaussian homoscedastic errors ( e i , v i ) and linear first stage ( Π i = π ′ Z i ), the strengthof the instruments corresponds directly to the concentration parameter, π ′ Z ′ Zπσ v , where σ v = V ar ( v i ) . The concentration parameter equals the signal-to-noise ratio in the first-stage regression and is related to the bias of the TSLS estimator and the quality ofGaussian approximation for the TSLS t-statistic. For the general case with homoscedasticerrors, Staiger and Stock (1997) introduced weak instrument-asymptotics in which oneconsiders a sequence of models so that the concentration parameter converges to a constantas N → ∞ . Under this asymptotic embedding, neither a consistent estimator of β nor aconsistent test of the null hypothesis that β equals some scalar exists, and the test basedon the TSLS t-statistic severely over-rejects.The magnitude of the concentration parameter is not a good indicator of identifica-tion strength when the number of instruments is large. We model large K by considering K → ∞ as N → ∞ , with the only restriction that K is at most a fraction of N . Underthis many instrument-asymptotics, Theorem 1 below shows that the re-scaled concentra-tion parameter π ′ Z ′ Zπσ v √ K provides a characterization of weak identification in terms of theconsistency of tests. Theorem 1
Assume we have a sample from model (1) with linear first stage Π i = π ′ Z i ,where the errors ( e i , v i ) are independently drawn from a Gaussian distribution N (0 , Ω) with a known covariance Ω . Assume that the K × K matrix Z ′ Z has rank K and K → ∞ as N → ∞ . For any sample of size N let Ψ N be the class of all tests of size α fortesting the hypothesis H : β = β , that is, any ψ ∈ Ψ N is a measurable function from { ( Y i , X i , Z i ) , i = 1 , ..., N } to the interval [0 , such that E β ,π ψ ≤ α for any value of π ∈ R K . Then for any β ∗ = β we have lim sup N →∞ max ψ ∈ Ψ N min π : π ′ Z ′ Zπσ v √ K ≤ C E β ∗ ,π ψ < . The setting considered in Theorem 1 is quite favorable: the first stage is linear, errors5re Gaussian and homoscedastic with known covariance matrix. So the only unknownparameters are β and π . Theorem 1 states that even in this favorable setting there existsno test that consistently differentiates any β ∗ from β if the ratio π ′ Z ′ Zπσ v √ K is bounded.Indeed, for any test ψ we can find its guaranteed power E β ∗ ,π ψ by minimizing over thealternatives ( β ∗ , π ) with bounded π ′ Z ′ Zπσ v √ K . We show that even the test that achieves themaximum guaranteed power has guaranteed power strictly less than one asymptotically.Later we show that in a more general heteroscedastic model we can construct a robusttest that becomes consistent when Π ′ Π √ K → ∞ .Theorem 1 can also be used to characterize weak identification in terms of consistent es-timation since it implies there exists no consistent estimator for β when π ′ Z ′ Zπ √ K is bounded.Our result complements the literature on estimation with many instruments. Chao andSwanson (2005) show that with homoscedastic errors, when K grows proportionally to thesample size the TSLS estimator is consistent only if π ′ Z ′ ZπK → ∞ , while LIML and BTSLSestimators are consistent when π ′ Z ′ Zπ √ K → ∞ . However, under heteroscedasticity, evenwhen π ′ Z ′ Zπ √ K → ∞ , LIML and BTSLS become inconsistent, but JIVE is still consistent,according to Chao et al. (2012).The proof of Theorem 1 builds on several classical papers. Following the approachof Andrews et al. (2006), we first reduce the class of tests to those based on a sufficientstatistic. Among these tests, the minimal power is achieved by a test invariant to rotationsof the instruments. This observation allows us to further reduce our attention to invarianttests, which depend on the data only through its maximal invariant under rotations. Thenwe derive a limit experiment for K → ∞ similar to that derived in Andrews and Stock(2007). In this limit experiment the minimax power is less than one. Finally we use theargument of Müeller (2011) to bound the desired asymptotic minimax power using theminimax power obtained in the limit experiment. The goal of this section is to introduce a test robust to weak identification in the het-eroscedastic IV model when the number of instruments, K , is large.6he existing weak IV literature proposes several weak identification-robust tests ofthe null hypothesis H : β = β , when K is small. These tests have correct size whenthe identification is weak and become consistent when the identification is strong. Oneexample is the Anderson-Rubin (AR) test. Specifically, the IV model (1) implies thatunder a given null hypothesis H : β = β , the exogeneity assumption holds E [ Z ′ e ( β )] =0 for the implied error e ( β ) = Y − β X . Then under mild assumptions, the scaledsample analog √ N Z ′ e ( β ) ⇒ N (0 , Σ) satisfies a K -dimensional Central Limit Theorem.The AR statistic is defined as N e ( β ) ′ Z b Σ − Z ′ e ( β ) , where b Σ is a consistent estimator of V ar (cid:16) √ N Z ′ e (cid:17) . The AR test rejects the null hypothesis when the AR statistic exceedsthe (1 − α ) quantile of the χ K distribution. The AR test has asymptotically correct sizeregardless of the value of the first stage coefficients Π i and is asymptotically consistentwhen an analog of the concentration parameter grows to infinity.Generalizing the AR statistic to the large- K setting is challenging for multiple reasons.Firstly, the covariance matrix Σ has dimension K × K . Its consistent estimation isproblematic if not impossible under general heteroscedasticity. Secondly, the AR statisticunder the null has an improperly centered limit distribution because χ K has a very largemean. Thirdly, the K -dimensional Central Limit Theorem provides a poor approximationto the AR statistic when K is large.We propose an analog of the AR test that is heteroscedasticity-robust and weakidentification-robust in the presence of a large number of instruments. Denote the projec-tion matrix P = Z ( Z ′ Z ) − Z ′ . Our test rejects the null of H : β = β when the jackknifeAR statistic AR ( β ) = 1 √ K p b Φ N X i =1 X j = i P ij e i ( β ) e j ( β ) (2)exceeds the (1 − α ) quantile of the standard normal distribution. We defer the discussionof the estimator of the variance b Φ to the next subsection.To address the challenges with the existing AR statistic, the AR statistic we proposeuses the default homoscedasticity-inspired weighting ( Z ′ Z ) − in place of b Σ − . With the ( Z ′ Z ) − weighting, the existing AR statistic has a quadratic form e ( β ) ′ P e ( β ) . However,this quadratic form is not centered at zero as it contains the term P Ni =1 P ii e i , and each7ummand has positive mean. We thus remove this term from the quadratic form. Thisre-centering can be referred to as leave-one-out or jackknife. In the context of consistentestimation under many instruments, this leave-one-out idea was introduced by Angrist etal. (1999) and fruitfully exploited in a number of papers including Hausman et al. (2012)and Chao et al. (2012). In order to create a test of correct size based on our AR statistic,we use a Central Limit Theorem for quadratic forms proved in Chao et al. (2012) that isrestated below. Assumption 1
Assume P is an N × N projection matrix of rank K , K → ∞ as N → ∞ and there exists a constant δ such that P ii ≤ δ < . Lemma 1 (Chao et al. (2012)) Let Assumption 1 hold for matrix P . Assume the errors η i are independent, E η i = 0 , and there exists a constant C such that max i E η i < C , then √ K √ Φ N X i =1 X j = i P ij η i η j ⇒ N (0 , , where Φ = K P Ni =1 P j = i P ij V ar ( η i ) V ar ( η j ) . The assumption P ii ≤ δ < implies that KN = N P Ni =1 P ii ≤ δ < . This assumptionis often referred to as a balanced design assumption. In the case of group-dummiesinstruments, P ii is equal to the ratio of the size of the group that observation i belongsto over N . Assumption 1 can be checked for any specific design.While Lemma 1 requires K → ∞ , the Gaussian approximation may work well forsmaller K as well. For example, if K is fixed and errors are homoscedastic, then √ K √ Φ N X i =1 X j = i P ij η i η j ⇒ χ K − K √ K as N → ∞ . We prove this statement in the Supplementary Appendix S4. While the limit here isnot Gaussian it is very well approximated by a standard normal distribution even forrelatively small K . The random variable χ K − K √ K exceeds the 95% quantile of the standardnormal distribution at most 7% of the time for all K , and at most 6% of the time for K > . 8 .1 Variance estimation In order to conduct asymptotically valid inference based on the normal approximation inLemma 1, we need an estimator for the scale parameter Φ , which is consistent under thenull. One ‘naive’ estimator that achieves this is b Φ = K P Ni =1 P j = i P ij e i ( β ) e j ( β ) , whichuses the square of the implied error as an estimator for the i -th error variance. Underthe null when e i ( β ) = e i , the estimator b Φ is consistent under relatively mild conditions.However, using b Φ in a test would result in poor power. To see this, note that under analternative value of the parameter β = β + ∆ , we can plug in the first stage and write theimplied error e i ( β ) = Y i − β X i as the sum of a non-trivial mean ∆Π i and a mean-zerorandom term η i = e i + ∆ v i : e i ( β ) = ∆Π i + η i . While squaring e i ( β ) makes it an unbiased estimator for V ar ( e i ) under the null, it isbiased under the alternative when ∆ = 0 . The bias in b Φ grows at the same order as thefourth power of ∆ , which brings down the power of the test against distant alternatives.In order to remove the bias in e i ( β ) under the alternatives, one may residualize theimplied error before squaring. However, this introduces a bias under the null. Denote M = I − P and let M i be the i th row of M . Even under the null, the squared residualizederror is biased E ( M i e ) = V ar ( e i ) . This is because the squared residual contains not onlythe squared error e i but also the square of regression estimation mistake. The latter canbe large when the number of regressors K is large.This bias can be removed successfully using the cross-fit variance estimator suggestedin Kline et al (2019) and Newey and Robins (2018). Namely, they show that a productof the implied error and residual achieves both goals: it removes the linearly predictablepart of the implied error and remains an unbiased estimator of the variance E (cid:20) e i M i eM ii (cid:21) = V ar ( e i ) . Our challenge is that the scale parameter Φ defined in Lemma 1 is a quadratic formwith a double summation. Residuals M i e ( β ) and M j e ( β ) are correlated since they9ontain the same estimation mistake. One can show that E [ e i M i ee j M j e ] = ( M ii M jj + M ij ) V ar ( e i ) V ar ( e j ) . Our proposed estimator of the scale parameter Φ re-weights each term in the summationto remove the bias described above: b Φ = 2 K N X i =1 X j = i P ij M ii M jj + M ij [ e i ( β ) M i e ( β )] [ e j ( β ) M j e ( β )] . (3)We establish the consistency of b Φ under the null and extend this result to local alternatives. Assumption 2
Errors ǫ i , i = 1 , ..., N are independent with E ǫ i = 0 , max i E k ǫ i k < ∞ ,and for some constants c ∗ and C ∗ that do not depend on Nc ∗ ≤ min i min x x ′ V ar ( ǫ i ) xx ′ x ≤ max i max x x ′ V ar ( ǫ i ) xx ′ x ≤ C ∗ . Theorem 2
Let Assumption 1 hold for matrix P and Assumption 2 hold for errors e i ,then for β = β , we have b ΦΦ → p as N → ∞ . Theorem 2 combined with Lemma 1 implies that under the null H : β = β our proposedAR statistic has an asymptotically standard normal distribution. Since no assumptionabout identification is made, the resulting AR test has asymptotically correct size regard-less of the strength of identification. Theorem 3
Let Assumption 1 hold for matrix P and Assumption 2 hold for errors ǫ i =( e i , v i ) ′ , and Π ′ M Π ≤ CK Π ′ Π . Then for β = β + ∆ such that ∆ · Π ′ Π K → , we have b ΦΦ → p as N → ∞ . Theorem 3 establishes the consistency of the variance estimator when the null hy-pothesis does not hold. We use Theorem 3 to derive local power curves of the AR testdiscussed in the next section. The variance estimator (3) residualizes some implied errors M i e ( β ) to remove non-trivial mean of e ( β ) under the alternative. The residualizationis complete if the first stage is linear Π i = π ′ Z i . We do not impose such an assumption10n Theorem 3. Instead we require that the approximation of Π i by a linear combinationof instruments improves with the number of instruments as measured by the L norm ofthe approximation mistake, Π ′ M Π . Let us introduce a jackknife measure of the information contained in the instruments: µ = N X i =1 X j = i P ij Π i Π j . Theorem 4
Let P β be a probability measure describing the distribution of AR ( β ) definedin (2) and (3) under model (1) with parameter β = β + ∆ . Assume that the sequence offirst stage parameters Π satisfies the following assumptions: Π ′ M Π ≤ CK Π ′ Π and Π ′ Π K → as N → ∞ . If Assumption 1 holds and the errors ǫ i = ( e i , v i ) ′ satisfy Assumption 2, thenfor any positive constant c we have: lim N →∞ sup | ∆ | ≤ c sup z (cid:12)(cid:12)(cid:12)(cid:12) P β { AR ( β ) < z } − F (cid:18) z − ∆ µ √ K Φ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = 0 , (4) where F ( · ) is the standard normal cdf. If the sequence of first stage parameters additionallysatisfies the condition µ √ K Φ → ∞ , then for any fixed ∆ = 0 the jackknife AR test isasymptotically consistent: lim N →∞ P β { AR ( β ) < z } = 1 . Equation (4) of Theorem 4 characterizes the local power curves of the jackknife AR test.The power under the alternative β = β + ∆ is a function of the distance ∆ between thealternative β and the null β , the number of instruments K , a measure of identificationstrength µ and the degree of uncertainty √ Φ . Our jackknife AR statistic can be negative,unlike the AR statistic from the small- K case which is always non-negative. We rejectthe null when AR ( β ) exceeds the (1 − α ) quantile of the standard normal distribution.Under the alternative β = β + ∆ , the AR statistics has a positive drift which gives rise to11 two-sided test. The second statement of Theorem 4 shows that the AR test consistentlydistinguishes β from β as long as µ √ K √ Φ → ∞ .Our measure of identification strength, µ , has a form similar to the numerator of theconcentration parameter defined for the homoscedastic small- K case. Though the twoforms are similar, there is an important distinction between them. In our case the signalstrength is measured by a jackknife form, while in the homoscedastic small- K case it ismeasured by Π ′ P Π = P Ni =1 P Nj =1 P ij Π i Π j . The instruments may affect the endogenousregressor in an arbitrarily non-linear way, and only the projection of Π onto the linearspace of the instruments is used by the linear IV regression. Thus the projection matrixappears naturally in our measure of identification strength. If the effect of instrumentson the regressor ( Π ) is well approximated by the linear first stage ( Π ′ M Π ≤ CK Π ′ Π ), thenthe strength of identification has the same order as Π ′ Π in the sense that they grow toinfinity or stay bounded simultaneously. Indeed, under Assumption 1 we have: (cid:18) − δ − CK (cid:19) Π ′ Π ≤ µ = Π ′ Π − Π ′ M Π − N X i =1 P ii Π i ≤ Π ′ Π . Theorem 4 implies that µ √ K → ∞ is a sufficient condition for the consistency of thejackknife AR test. When the first stage is well approximated by linear combination ofthe instruments, this translates to a sufficient condition of Π ′ Π √ K → ∞ . This complimentsTheorem 1 which implies that Π ′ Π √ K → ∞ is necessary for the consistency of any test. Itis worth noticing that the condition Π ′ Π K → imposed by Theorem 4 is quite weak as itcovers both weakly and strongly identified cases. In a prominent paper, Stock and Yogo (2005) introduced a pre-test for weak identificationthat has gained enormous popularity in applied work. In homoscedastic IV models withsmall K , the concentration parameter fully characterizes the worst bias of the TSLS as afraction of the OLS bias and the worst rejection rate of TSLS-Wald test. Since the firststage F statistic measures the concentration parameter, Stock and Yogo (2005) suggest12 set of cut-offs for the first stage F statistic, above which a researcher can guaranteewith high (prespecified) probability that the bias of TSLS is not larger than 10% of theOLS bias, or that the TSLS-Wald statistic does not over-reject by more than 5%. Thecut-offs depend on the goal (bias or size) and the number of instruments. However, thesedetails seem to be mostly disregarded in empirical practice, as the most common guidancesuggests a cut-off of 10, regardless of the goal or the number of instruments.As with any procedure of such generality, it suffers from multiple drawbacks. First,the pre-test is valid only if the model is homoscedastic. Andrews (2018) shows that inmodels calibrated to commonly-used data sets with heteroscedasticity one may find caseswith the first stage F statistics exceeding 1000, that have large over-rejections of theTSLS-Wald test.Second, the TSLS estimator is less robust to weak identification when K is large.In a homoscedastic model when K is growing proportionally to the sample size, theTSLS estimator is consistent only if π ′ Z ′ ZπK → ∞ , while LIML and BTSLS estimators areconsistent when π ′ Z ′ Zπ √ K → ∞ (see Chao and Swanson (2005)). In this case, the pre-testbecomes too conservative. Indeed, if π ′ Z ′ Zπ √ K → ∞ but π ′ Z ′ ZπK ∞ , then the pre-testmost likely declares weak identification as the expectation of the first stage F equals to π ′ Z ′ ZπKσ v + 1 , even though there exist consistent estimators and a reasonable Wald-test canbe constructed.We propose a new pre-test for weak identification that allows us to form a two-stepprocedure: a researcher first assesses instrument strength based on our pre-test and thenuses the JIVE-Wald test if the instruments appear strong and our jackknife AR test if theyappear weak. We can guarantee the size of such two-step procedure in a heteroscedasticIV model with large K . Our pre-test uses an empirical measure of µ √ K , whose valuecharacterizes weak identification as discussed in the previous sections: e F = 1 √ K p b Υ N X i =1 X j = i P ij X i X j , (5)here b Υ = K P i P j = i P ij M ii M jj + M ij X i M i XX j M j X is an estimate of the variance Υ defined13n (12). The JIVE-Wald test uses the JIV2 estimator introduced in Angrist et al. (1999): b β JIV E = P Ni =1 P j = i P ij Y i X j P Ni =1 P j = i P ij X i X j . Our choice of JIVE is based on two considerations. First, according to Hausman etal. (2012), in a heteroscedastic IV model, when π ′ Z ′ Zπ √ K → ∞ , LIML and BTSLS becomeinconsistent, but JIVE is consistent. Second, the JIVE estimator is a ratio of two quadraticforms similar to the jackknife AR statistic. We use the following estimator of the JIVEvariance, that is a cross-fit version of the estimator derived in Chao et al. (2012): b V = P Ni =1 (cid:16)P j = i P ij X j (cid:17) b e i M i b eM ii + P Ni =1 P j = i e P ij M i X b e i M j X b e j (cid:16)P Ni =1 P j = i P ij X i X j (cid:17) , where b e i = Y i − X i b β JIV E . The Wald statistic is defined as
W ald ( β ) = ( b β JIV E − β ) b V . Theorem 5
Let Assumption 1 hold for matrix P and Assumption 2 hold for errors ǫ i =( e i , v i ) ′ . Assume that Π ′ M Π ≤ C Π ′ Π K and Π ′ Π K / → as N → ∞ . Then for β = β , (cid:16) W ald ( β ) , e F (cid:17) ⇒ ξ − ̺ ξν + ξ ν , ν ! , (6) where ξ and ν are two normal random variables with means and µ √ K √ Υ , unit variancesand correlation coefficient ̺ defined in equation (12). Theorem 5 shows that the distribution of the JIVE-Wald statistics can be quite dif-ferent from its conventional χ limit when µ √ K √ Υ is small. If µ √ K √ Υ is large, then mostrealizations of the random variable ν are large as well and the limit of the JIVE-Wald isclose to the distribution of ξ , which is χ . This suggests that µ √ K √ Υ is a good measure foridentification strength. We notice that the limit expression for the JIVE-Wald statisticsis similar to the limit distribution derived by Stock and Yogo (2005, formula (2.22)) forTSLS-Wald in homoscedastic weak IV with small K .Using Theorem 5 we can calculate the worst asymptotic rejection rate of the JIVE-14ald test as a function of µ √ K √ Υ = x : R max α ( x ) = max ̺ ∈ [ − , P x,̺ ( ξ − ̺ ξν + ξ ν ≥ χ , − α ) , where P x,̺ is the probability distribution of ξ, ν described in Theorem 5. For a typicaltest with nominal size α = 5% , we find that µ √ K √ Υ = x > . implies R max5% ( x ) < .Theorem 5 also allows us to construct a 5%-test for the null hypothesis that the unknownstrength of identification parameter µ √ K √ Υ is higher than 2.5. This test is based on thestatistic e F and accepts whenever e F > . . Using Bonferroni bounds we obtain thefollowing statement: Corollary 1
Let all assumptions of Theorem 5 hold. Then a two-step test for the nullhypothesis H : β = β that accepts the null if e F > . and W ald ( β ) < χ , . or if e F ≤ . and AR ( β ) < z . , has an asymptotic size smaller than 15%. The pre-test we propose is to compare e F with the cut-off of 4.14. If e F exceeds the cut-off one may proceed using JIVE test/confidence set, otherwise one is advised to employweak-identification robust jackknife AR test. The attraction of the two-step procedure isthat confidence sets based on the JIVE-Wald test is relatively easy to construct and iswell understood by the practitioners. As we illustrate in simulations, the Jackknife ARconfidence sets tend to be wider than the JIVE-Wald confidence sets when identificationis strong. Simulations also suggest the Bonferroni bounds derived in Corollary 1 tend tobe conservative, as the actual size of the two-step test does not exceed 7%. Angrist and Krueger (1991) (AK91 in what follows) provided a motivating example forthe weak identification literature, starting with the seminal work by Bound et al. (1995).Staiger and Stock (1997) suggested that the relatively low value of the first stage Fstatistic can be seen as a sign of potential weak instruments in the AK91 application.Hansen et al. (2008) argued that “many instruments” may be a more relevant description15F e F JIVE-Wald Jackknife AR180 instruments 2.428 13.422 [0.066,0.132] [0.008,0.201]1530 instruments 1.27 6.173 [0.024,0.121] [-0.047, 0.202]
Table 1: AK91 Pre-test Results
Notes:
Results on pre-tests for weak identification and confidence sets for IV specification underlyingTable VII Column (6) of Angrist and Krueger (1991) using the original data. FF is the first stage Fstatistic of Stock and Yogo (2005), e F is the statistic introduced in (5). The JIVE-Wald confidence set isdescribed in Section 4. The jackknife AR confidence set is based on analytical test inversion. of the identification issue encountered in AK91, as instruments are possibly not weakcollectively. They suggested that estimators other than TSLS may restore the accuracyfor standard inferences. We apply our proposed pre-test statistics e F to the original AK91application to assess whether instruments are weak given that there are many of them.The original AK91 application estimated the effect of schooling ( X i ) on log weeklywage ( Y i ) using quarter of birth as instruments in a sample of 329,509 men born 1930-39from the 1980 census. There are multiple specifications in the original AK91 study. Wefocus on the specification with 180 instruments and also an extension of this specificationusing 1530 instruments. The 180 instruments include 30 quarter and year of birth in-teractions (QOB-YOB) and 150 quarter and state of birth interactions (QOB-POB). Forthe second specification with 1530 instruments, we also include full interactions amongQOB-YOB-POB. Table 1 reports the first stage F statistics (FF), our proposed pre-teststatistics e F introduced in (5), confidence sets based on the JIVE-Wald and jackknife ARstatistics. While the first stage F statistic is below 10 and the pre-test from Stock andYogo (2005) would point toward weak identification for both specifications, the instru-ments turn out to be strong in both specifications based on our pre-test. As a result,the reported confidence sets based on a norminal 5% JIVE-Wald test are reliable, as theactual size is at most 15%. The confidence sets based on our jackknife AR statistic arewider, yet still informative. With this sample size, we cannot vectorize calculations involving P ij (for jackknife AR and pre-test)due to memory constraint. However, it is still relatively fast to execute the non-vectorized code, whichtakes around 20 minutes. K Avg. e F OLS 2SLS 2SLS LIML LIML JIVE JIVEbias bias size bias size bias size4,923 154 4.99 0.26 0.17 96.6% -0.001 0.6% -0.03 5%3,209 135 3.35 0.26 0.19 95.7% -0.05 2.7% -0.06 5.2%1,599 111 1.77 0.26 0.21 92.3% -0.89 14.5% 1.22 3.6%
Table 2: AK91 Simulation Results: Bias of different estimators and Size of Non-robust Tests
Through Monte Carlo simulations we show that the jackknife AR and the pre-test wedevelop are robust to many weak instruments unlike canonical IV estimators. To illustratethe practical importance of many weak instruments, we attempt to preserve the structureof AK91. Specifically, we adopt the simulation design by Angrist and Frandsen (2019).There is very little endogeneity in the original AK91, which makes it hard to study thebiases of different estimators. Thus, we follow Angrist and Frandsen (2019) to introduceadditional omitted variable bias to the simulated data. The simulated data has a nonlinearfirst stage and is heteroscedastic. We deviate from Angrist and Frandsen (2019) in tworespects. First, we vary the sample size N of the simulated data to be 1.5%, 1% and 0.5%of the original sample size. This is to vary the identification strength. We report theidentification strength by the average e F across simulations. Simulations with sample sizeequal to 1.5% of the original sample size produce strong identification in our definition, 1%still produce strong identification but close to the weak identification region, while 0.5%produce weak identification.When we reduce the sample size we also need to exclude theinstruments of the groups that are no longer populated. Second, both in data simulationand in estimation we do not include controls in order to isolate the implications of manyinstruments. The Appendix provides more details on our simulation design.We evaluate the performance of common estimators and tests based on 1000 simulationdraws. In Table 2, we report the bias and Wald tests size of OLS, 2SLS, LIML and JIVEestimators. For the Wald test based on the LIML estimator, we calculate the standarderrors as in Hansen et al. (2008). They corrected the canonical standard error estimatorto be robust to many instruments, but this test is not robust to heteroscedasticity, asLIML itself is inconsistent under heteroscedasticity. For the Wald test based on the17 K Avg. FF Avg. e F jackknife AR pre-test two-step test4,923 154 1.63 4.99 5.1% 70.5% 5.8%3,209 135 1.44 3.35 5.6% 26.7% 6.6%1,599 111 1.24 1.77 6.3% 4.5 % 7.2% Table 3: AK91 Simulation Results: Size of Robust Tests
JIVE estimator, we calculate the heteroscedasticity-robust standard errors as describedin Section 4.We find that due to many instruments 2SLS has large bias even under strong identifi-cation. While Hausman et al. (2012) show LIML is inconsistent under many instrumentsand heteroscedasticity, LIML is not too biased in our simulated data, as long as identifi-cation is not weak. We find that JIVE has low bias when identification is strong, but itsbias increases when identification is weak. The Wald test based on either LIML or JIVEis not robust to many weak instruments, and we find substantial size distortion for LIMLunder weak identification. Surprisingly we do not find large size distortion for JIVE.In Table 3 we report the rejection frequency of the robust test we developed in thispaper based on the jackknife AR test statistic. We find that the jackknife AR controlssize even under weak identification. Our proposed pre-test also controls size and is able toswitch to the JIVE-Wald test when identification is strong. In contrast, the first stage Fstatistics of Stock and Yogo (2005) (FF) are very small even under strong identification,which makes it not very informative.Finally, in Table 4 we compare the length of confidence intervals formed by invertingvarious tests. In particular, when identification is strong, jackknife AR confidence setsare longer (less efficient) but are not unreasonably long compared to the Wald tests basedon LIML and JIVE. In this case, a pre-test can improve the efficiency by switching to theWald test based on JIVE. As with the canonical AR test, the jackknife AR test can resultin confidence intervals with infinite length. We report the probability of infinite length inthe last column of Table 4, and note that such probability increases as identification getsweaker. 18 K Avg. e F Table 4: AK91 Simulation Results, Length of Confidence Interval
In this paper, we argue that we can characterize weak identification as an environmentwith many instruments when an analog of the concentration parameter staying boundedrelative to the square root of the number of instruments in large samples. We introduce ajackknifed version of the AR test that is robust to our definition of weak identification andheteroscedasticity. We also propose a pre-test for weak identification and correspondinglya two-step testing procedure in the spirit of Stock and Yogo (2005). Unlike the pre-testproposed by Stock and Yogo (2005), our two-step test controls size distortion even underheteroscedasticity and with many instruments. As an empirical example, our pre-testrejects weak identification in Angrist and Krueger (1992) where up to 1530 instrumentsare used.
Anatolyev, S., and Gospodinov, N. (2011). “Specification Testing in Models with ManyInstruments."
Econometric Theory
27, 427–441.Andrews, I. (2018). “Valid Two-Step Identification-Robust Confidence Sets for GMM."
The Review of Economics and Statistics , 100, 337–348.
Supplementary Appendix
Andrews, D.W.K., and Stock, J.H. (2007). “Testing with many weak instruments.”
Jour-nal of Econometrics
Econometrica
NBER working paper
Journal of Applied Econometrics
14, 57–67.19ngrist, J.D., and Krueger, A.B. (1991). “Does Compulsory School Attendance AffectSchooling and Earnings?"
The Quarterly Journal of Economics
Journal of the American Statistical Association , 90:430, 443–450.Chao, J.C., and Swanson, N.R. (2005). “Consistent Estimation with a Large Number ofWeak Instruments."
Econometrica
73, 1673–1692.Chao, J.C., Swanson, N.R., Hausman, J.A., Newey, W.K., and Woutersen, T. (2012).“Asymptotic Distribution of JIV in a heteroscedastic IV Regression with Many Instru-ments.”
Econometric Theory
28, 42–86.Dobbie, W., Goldin, J., and Yang, C.S. (2018). “The Effects of Pretrial Detention onConviction, Future Crime, and Employment: Evidence from Randomly Assigned Judges."
American Economic Review
Journal of Political Economy
81, 607–636.Hansen, C., Hausman, J., and Newey, W. (2008). “Estimation With Many InstrumentalVariables."
Journal of Business & Economic Statistics
26, 398–422.Hausman, J.A., Newey, W.K., Woutersen, T., Chao, J.C., and Swanson, N.R. (2012).“Instrumental variable estimation with heteroscedasticity and many instruments.”
Quan-titative Economics
3, 211–255.Kleibergen F. (2002). “Pivotal statistics for testing structural parameters in instrumentalvariables regression.”
Econometrica
American Economic Review
Econo-metrica
79 (2): 395–435.Newey, W. (2004). “Many Instrument Asymptotics.”20ewey, W.K., and Robins, J.R. (2018). “Cross-Fitting and Fast Remainder Rates forSemiparametric Estimation.”Newey, W.K., and Windmeijer, F. (2009). “Generalized Method of Moments With ManyWeak Moment Conditions.”
Econometrica
77, 687–719.Sampat, B., and Williams, H.L. (2019). “How Do Patents Affect Follow-On Innovation?Evidence from the Human Genome.”
American Economic Review
The Review of Finan-cial Studies
5, 1–33.Staiger, D., and Stock, J.H. (1997). “Instrumental Variables Regression with Weak In-struments.”
Econometrica
65 (3): 557–86.Stock, J.H., and Yogo, M. (2005). “Testing for weak instruments in Linear Iv regres-sion. In Identification and Inference for Econometric Models: Essays in Honor of ThomasRothenberg," pp. 80–108.
Let C be a universal constant (that may be different in different lines but does not dependon N or K ). Denote σ i = V ar ( e i ) , ς i = V ar ( v i ) , γ i = cov ( e i , v i ) , and e P ij = P ij M ii M jj + M ij . Proof of Theorem 1.
Denote A to be an upper-triangular matrix, such that A Ω A ′ = I . The sufficient statistic in model (1) is ξ ξ = ( A ⊗ I K ) · ( Z ′ Z ) − / Z ′ Y ( Z ′ Z ) − / Z ′ X ∼ N e β ΠΠ , I K (7)where e β = (1 , A ( β, ′ is a (known) linear one-to-one transformation of β . Denote thecorresponding null and alternative as e β and e β ∗ . We denote also Π = ( Z ′ Z ) / πσ v , whichis one-to-one transformation of π . It is enough to restrict attention to the tests thatdepend on the data through sufficient statistics only. Indeed, for any test ψ ∈ Ψ N we mayconstruct a test ψ S = E ( ψ | ξ , ξ ) which depends on the data only through the sufficient21tatistics. Due to the law of iterated expectations the size and the power of ψ S is thesame as the initial ψ .Let U be the group of rotations on R K , that is U ∈ U are such U ′ U = I K . No-tice that the model is invariant to group U , namely if ( ξ , ξ ) satisfy model (7) withparameters ( e β, Π) then ( U ξ , U ξ ) satisfy model (7) with parameters ( e β, U Π) . Note that Π ′ Π = ( U Π) ′ ( U Π) . This implies that for any function f we have E ( e β, Π) f ( U ξ , U ξ ) = E ( e β,U Π) f ( ξ , ξ ) .We call a test ψ = ψ ( ξ , ξ ) invariant to rotations iff for any U ∈ U we have ψ ( U ξ , U ξ ) = ψ ( ξ , ξ ) for all realizations of ( ξ , ξ ) . The maximum in Theorem 1 isachieved at an invariant test. Indeed, take any test ψ ∈ Ψ N that has size α , that is, E ( e β , Π) ψ ( ξ , ξ ) ≤ α for all Π . Let us consider a new test ψ ∗ ( ξ , ξ ) = R U ∈U ψ ( U ξ , U ξ ) dU, where the integral is taken uniformly over the unit sphere in R K . By construction, ψ ∗ isan invariant test as for any e U ∈ U , we have U e U ∈ U for all U ∈ U so that ψ ∗ ( e U ξ , e U ξ ) = Z U ∈U ψ ( U e U ξ , U e U ξ ) dU = Z U ∈U ψ ( U ξ , U ξ ) dU. E ( e β , Π) ψ ∗ ( ξ , ξ ) = Z U ∈U n E ( e β , Π) ψ ( U ξ , U ξ ) o dU = Z U ∈U n E ( e β ,U Π) ψ ( ξ , ξ ) o dU ≤ α. So, it has correct size. Now we check that the minimal power of ψ ∗ achieved over alter-natives ( e β ∗ , Π) with Π such that Π ′ Π √ K = C is not smaller than that of ψ . Assume that theminimum of power for test ψ is achieved at the alternative Π ∗ : min Π ′ Π √ K = C E ( e β ∗ , Π) ψ ( ξ , ξ ) = E ( e β ∗ , Π ∗ ) ψ ( ξ , ξ ) . Then, similarly to above: min Π ′ Π √ K = C E ( e β ∗ , Π) ψ ∗ ( ξ , ξ ) = min Π ′ Π √ K = C Z U ∈U n E ( e β ∗ ,U Π) ψ ( ξ , ξ ) o dU ≥≥ Z U ∈U min Π ′ Π √ K = C n E ( e β ∗ ,U Π) ψ ( ξ , ξ ) o dU = E ( e β ∗ , Π ∗ ) ψ ( ξ , ξ ) . All invariant tests depend on the data only through maximal invariant. Thus, we shouldonly consider tests that depend on the data through statistics Q = ( Q , Q , Q ) = ξ ′ ξ , ξ ′ ξ , ξ ′ ξ ) . If Π ′ Π / √ K → C then Q converges to the following distribution: ξ ′ ξ − K √ Kξ ′ ξ √ Kξ ′ ξ − K √ K ⇒ N e β C √ e βC C √ , I = Q ∞ , Q ∞ , Q ∞ , = Q ∞ . (8)According to Theorem 1 of Müeller (2011) the limit of the maximal power of tests inexperiment based on Q is bounded above by the maximal power achieved in the limitexperiment described on Q ∞ as defined in the right hand side of equation (8). Noticethat the maximal achievable power E e β ∗ ,C ψ ∗ ( Q ∞ ) is strictly less than 1 for any fixed β ∗ and fixed C . Indeed, the best achievable power in the limit experiment (8) is no morethan the best achievable power in the experiment when C is known. If C is known, theoptimal test follows from the Neyman-Pearson lemma, and is less than 1. Proof of Theorem 2.
Assumptions 1 and 2 imply ≥ K X i X j = i P ij = 1 K X i X j P ij − K X i P ii ≥ − δ K X i P ii = 1 − δ. Thus, (1 − δ )( c ∗ ) < Φ < ( C ∗ ) and it is sufficient to prove that b Φ − Φ → p . The laststatement holds due to Lemma 2 applied to ξ i = ( e i , e i , e i ) ′ . (cid:3) Lemma 2
Let Assumption 1 hold. Assume the errors ξ i = ( ξ (1) i , ξ (2) i , ξ (3) i ) ′ are indepen-dent mean zero random vectors with max i E k ξ i k < C . Then as N → ∞ , we have: K X i X j = i (cid:26) P ij M ii M jj + M ij h ξ (1) i M i ξ (2) i h ξ (1) j M j ξ (3) i − P ij E h ξ (1) i ξ (2) i i E h ξ (1) j ξ (3) j i(cid:27) → p . Proof of Lemma 2.
Notice that K X i X j = i P ij E h ξ (1) i ξ (2) i i E h ξ (1) j ξ (3) j i = 1 K X i X j = i e P ij E h ξ (1) i ξ (2) i ξ (1) j ξ (3) j i . Define ξ ij = ξ (1) i M i ξ (2) ξ (1) j M j ξ (3) − E h ξ (1) i M i ξ (2) ξ (1) j M j ξ (3) i , then we need to prove that K P i P j = i e P ij ξ ij → p . Since K P i P j = i e P ij ξ ij has zero mean, it is sufficient to show that23he variance of each term in expression (9) defined below converges to zero (here I is asummation over distinct indexes ( i, i ′ , j, j ′ ) ): E K X i X j = i e P ij ξ ij ! = 1 K X i X j = i e P ij E ξ ij ++ 1 K X i X j = i X i ′ = { i,j } e P ij e P ii ′ E ξ ij ξ ii ′ + 1 K X I e P ij e P i ′ j ′ E ξ ij ξ i ′ j ′ . (9)First, we prove that max i,j E ξ ij < C . We expand ξ ij = A ,ij + A ,ij + A ,ij , where: A ,ij = M ii M jj (cid:16) ξ (1) i ξ (2) i ξ (1) j ξ (3) j − E [ ξ (1) i ξ (2) i ξ (1) j ξ (3) j ] (cid:17) + M ij (cid:16) ξ (1) i ξ (3) i ξ (1) j ξ (2) j − E [ ξ (1) i ξ (3) i ξ (1) j ξ (2) j ] (cid:17) ,A ,ij = ξ (1) i ξ (1) j X i ′ = { i,j } (cid:16) M ii M ji ′ ξ (2) i ξ (3) i ′ + M ii ′ M ij ξ (2) i ′ ξ (3) i + M jj M ii ′ ξ (2) i ′ ξ (3) j + M ji ′ M ij ξ (2) j ξ (3) i ′ (cid:17) ,A ,ij = ξ (1) i ξ (1) j X i ′ = { i,j } X j ′ = { i,j } M ii ′ M jj ′ ξ (2) i ′ ξ (3) j ′ . It is sufficient to show that max i,j E A s,ij is bounded for all s = 1 , , . The moment con-dition implies E A ,ij ≤ C (cid:0) M ii M jj + M ij (cid:1) ≤ C. Below we use that non-zero correlationsbetween summands in A s,ij imply that some indexes must coincide. We also use LemmaS1.1 from the Supplementary Appendix: E A ,ij ≤ C X i ′ ( M ii M ji ′ + M ii ′ M ij + M jj M ii ′ + M ji ′ M ij ) ≤ C, E A ,ij ≤ C X i ′ = { i,j } X j ′ = { i,j } (cid:0) P ii ′ P jj ′ + | P ii ′ P jj ′ P ij ′ P ji ′ | (cid:1) ≤ C. Next notice that e P ij = P ij M ii M jj + M ij ≤ P ij (1 − P ii )(1 − P jj ) ≤ − δ ) P ij . (10)Lemma B1 in Chao et al (2012) gives that P i P j = i P ij ≤ K and P i P j = i P j ′ = i,j ′ = j P ij P ij ′ ≤ K . Thus, given the bound on max i,j Eξ ij < C and by Cauchy-Schwarz inequality max i,j,k | E ξ ij ξ ik | < C , the first two terms in expression (9) converge to zero.For the last term in (9), since i, i ′ , j, j ′ are all distinct, we have E A ,ij A s,i ′ j ′ = 0 for24 = 2 , , and E A ,ij A ,i ′ j ′ = 0 . The non-zero terms in E ξ ij ξ i ′ j ′ are | E A ,ij A ,i ′ j ′ | ≤ C | ( M ii M jj ′ + M ij M ij ′ )( M i ′ i ′ M jj ′ + M i ′ j M i ′ j ′ ) | ++ C | ( M jj M ii ′ + M ji ′ M ij )( M j ′ j ′ M ii ′ + M j ′ i ′ M ij ′ ) | . | E A ,ij A ,i ′ j ′ | ≤ C ( P ii ′ P jj ′ + P ij ′ P i ′ j ) . Given inequality (10) and the symmetry of summation, and statements (a)-(e) provedin Lemma S1.2 in the Supplementary Appendix, we obtain that the last two terms inequation (9) converge to zero. (cid:3)
Proof of Theorem 3.
Denote λ i = M i Π , then b Φ = 2 K X i X j = i e P ij ( η i + ∆Π i ) ( M i η + ∆ λ i ) ( η j + ∆Π j ) ( M j η + ∆ λ j ) . Let us define b Φ = K P i P j = i e P ij η i M i ηη j M j η . Assumption 2 guarantees that the varianceof η i = e i +∆ · v i is uniformly bounded. Lemma 2 with ξ i = ( η i , η i , η i ) ′ gives (cid:12)(cid:12)(cid:12)b Φ − Φ (cid:12)(cid:12)(cid:12) → p uniformly over bounded ∆ . Lemma 3 with ξ i = ( η i , η i , η i , η i ) ′ implies b Φ − b Φ → p . (cid:3) Lemma 3
Let ξ i = ( ξ (1) i , ξ (2) i , ξ (3) i , ξ (4) i ) ′ be independent mean zero × random vectors,such that E k ξ i k < C. Let Assumption 1 hold. Assume that λ ′ λ ≤ CK Π ′ Π and ∆ · Π ′ Π K → as N → ∞ . Then K X i X j = i e P ij (cid:16) ξ (1) i + ∆Π i (cid:17) (cid:0) M i ξ (2) + ∆ λ i (cid:1) (cid:16) ξ (3) j + ∆Π j (cid:17) (cid:0) M j ξ (4) + ∆ λ j (cid:1) −− K X i X j = i e P ij ξ (1) i M i ξ (2) ξ (3) j M j ξ (4) → p . Proof of Lemma 3.
We write the main expression of interest as a polynomial of fourthpower in ∆ : ∆ A +∆ A +∆ A +∆ A and prove that all terms are negligible ∆ l A l → p by showing that their means and variances converge to zero. Notice that for expressionswith identical structure but different components of ξ i , the proof of their negligibility isexactly the same. Thus for simplicity we abuse the notation and drop the superscripts to ξ i when we can consolidate these expressions. For example, we write the expression for one25f the terms in A as K P i P j = i e P ij Π i λ i λ j ξ j , which collects both K P i P j = i e P ij Π i λ i λ j ξ (1) j and K P i P j = i e P ij Π i λ i λ j ξ (3) j . We also treat ξ i in all expressions below as scalar. A = 1 K X i X j = i e P ij Π i λ i Π j λ j ; A = 1 K X i X j = i e P ij Π i λ i λ j ξ j + 1 K X i X j = i e P ij Π i λ i Π j M j ξ ; A = 1 K X i X j = i e P ij λ i λ j ξ i ξ j + 1 K X i X j = i e P ij λ i ξ i Π j M j ξ ++ 1 K X i X j = i e P ij λ i Π i ξ j M j ξ + 1 K X i X j = i e P ij Π i Π j M i ξM j ξ ; A = 1 K X i X j = i e P ij λ i ξ i M j ξξ j + 1 K X i X j = i e P ij Π i M i ξξ j M j ξ. Term A is deterministic. We use bound (10) and Lemma S1.3 (d): ∆ | A | ≤ C ∆ Π ′ Π λ ′ λK ≤ C ∆ (Π ′ Π) K → . Term A is mean zero. Using the inequality V ar ( X + Y ) ≤ V ar ( X ) + 2 V ar ( Y ) we have: ∆ V ar ( A ) ≤ C ∆ K X j X i P ij | Π i || λ i | ! λ j + X k X i X j = i e P ij Π i λ i Π j M jk ! ≤≤ C ∆ K ( λ ′ λ ) Π ′ Π + X i,i ′ ,j,j ′ P ij | Π i λ i Π j | P i ′ j ′ | Π i ′ λ i ′ Π j ′ | X k | M jk M j ′ k | ! ≤≤ C ∆ K (cid:0) ( λ ′ λ ) Π ′ Π + (Π ′ Π) λ ′ λ (cid:1) ≤ C ∆ (Π ′ Π) K → . For the first inequality, we apply Assumption 2 and bound (10). Then we use Cauchy-Schwarz inequality for the first summand: (cid:0)P i P ij | Π i || λ i | (cid:1) ≤ Π ′ Π λ ′ λ . For the secondsummand, we apply Lemma S1.1 (ii) and Lemma S1.3 (c). Finally, we apply Lemma S2.1and S2.2 to get ∆ A → p and ∆ A → p . (cid:3) roof of Theorem 4. The infeasible version of AR statistics under β = β + ∆ is: √ K √ Φ X i X j = i P ij e i ( β ) e j ( β )= ∆ √ K √ Φ X i X j = i P ij Π i Π j + 2∆ √ K √ Φ X i X j = i P ij Π j ! η i + 1 √ K √ Φ X i X j = i P ij η i η j . (11)The first term in (11) is deterministic and equals to ∆ µ √ K √ Φ . The second term has meanzero and variance ∆ K Φ X i X j = i P ij Π j ! V ar ( η i ) ≤ Cc K Φ X i w i ≤ C Π ′ Π K → . Here we used that variance of η i is bounded by Assumption 2, P j = i P ij Π i = w i , andthe final bound is proven in Lemma S1.4. Thus, the second term converges to zero inprobability uniformly over | ∆ | ≤ c . The third term in (11) is asymptotically standardnormal due to Lemma 1. Finally, we notice that AR ( β ) = s Φ b Φ 1 √ K √ Φ X i X j = i P ij e i ( β ) e j ( β ) , and apply Theorem 3. This finishes the proof of statement (4).Now consider the case when µ √ K √ Φ → ∞ and ∆ = 0 is fixed. Above we proved that √ K √ Φ X i X j = i P ij e i ( β ) e j ( β ) = µ √ K √ Φ ∆ + o p (1) + O p (1) . Finally, Theorem 3 implies that b ΦΦ → p . As a result, we have AR ( β ) → p ∞ when µ √ K √ Φ → ∞ and ∆ = 0 is fixed. This lead to rejection probability converging to 1. (cid:3) Proof of Theorem 5.
Denote Q = ( Q ee , Q Xe , Q XX ) ′ = 1 √ K N X i =1 X j = i P ij ( e i e j , X i e j , X i X j ) ′ . Σ − / (cid:16) Q ee , Q Xe , Q XX − µ √ K (cid:17) ′ ⇒ N (0 , I ) ,where Σ is the asymptotic covariance matrix of Q , with some of its elements writtenbelow: Ψ = 1 K N X i =1 X j = i P ij γ i γ j + 1 K N X i =1 X j = i P ij σ i ς j + 1 K N X i =1 ( X j = i P ij Π j ) σ i = AV ar ( Q Xe ) , Υ = 2 K N X i =1 X j = i P ij ς i ς j + 4 K N X i =1 ς i ( X j = i P ij Π j ) = AV ar ( Q XX ) , (12) τ = 2 K N X i =1 X j = i P ij ς i γ j + 2 K N X i =1 γ i ( X j = i P ij Π j ) = ACov ( Q Xe , Q XX ) , ̺ = τ √ Ψ √ Υ . Note that b e i = Y i − X i b β JIV E = e i − X i ( b β JIV E − β ) and ( b β JIV E − β ) = Q Xe /Q XX . Thus, W ald ( β ) = Q Xe P Ni =1 (cid:16)P j = i P ij X j (cid:17) b e i M i b eM ii + P Ni =1 P j = i e P ij M i X b e i M j X b e j , where the denominator expands to N X i =1 X j = i P ij X j ! b e i M i b eM ii + N X i =1 X j = i e P ij M i X b e i M j X b e j == N X i =1 X j = i P ij X j ! e i M i eM ii + N X i =1 X j = i e P ij M i Xe i M j Xe j −− Q Xe Q XX N X i =1 X j = i P ij X j ! (cid:18) e i M i XM ii + X i M i eM ii (cid:19) + 2 N X i =1 X j = i e P ij M i Xe i M j XX j ++ Q Xe Q XX N X i =1 X j = i P ij X j ! X i M i XM ii + N X i =1 X j = i e P ij M i XX i M j XX j . Applying Lemma S3.1 from the Supplementary Appendix to the expanded expression ofthe denominator, we show the terms appearing in the braces converge to Ψ , τ and Υ respectively. Then W ald ( β ) = Q Xe Ψ − Q Xe Q XX τ + Q Xe Q XX Υ (1 + o p (1)) = Q Xe / Ψ1 − Q Xe / √ Ψ Q XX / √ Υ ̺ + Q Xe Q XX ΥΨ (1 + o p (1)) . b Υ with ξ i = ( v i , v i , v i , v i ) ′ and ∆ = 1 give e F = Q XX √ Υ (1 + o p (1)) . Thus, the statement of Theorem 5 holds where we denote (cid:16) ξ, ν − µ √ K √ Υ (cid:17) to bethe Gaussian limit of ( Q Xe √ Ψ , Q XX √ Υ − µ √ K √ Υ ) . (cid:3) Proof of Corollary 1.
Denote x = µ √ K √ Υ . If x > . then due to Theorem 5: P x { e F > . and W ald ( β ) ≥ χ , . } ≤ P x { W ald ( β ) ≥ χ , . } ≤ . . If x ≤ . then due to the asymptotic gaussianity of e F : P x { e F > . and W ald ( β ) ≥ χ , . } ≤ P x { e F > . } ≤ . . Finally, for any x > : P x { H is rejected } = P { e F > . and W ald ( β ) ≥ χ , . } ++ P { e F > . and AR ( β ) ≤ z , . } ≤ .
10 + P { AR ( β ) ≤ z , . } ≤ . . To create many instruments, we interact QOB dummies with dummies for year of birth(YOB) and place (state) of birth (POB). Interacting three QOB dummies with nine YOBand 50 POB dummies generates 180 excluded instruments. The excluded instruments are Z i = (( { Q i = q, C i = c } ) ′ q ∈{ , , } ,c ∈{ ,..., } , { Q i = q, P i = p } ) ′ q ∈{ , , } ,p ∈{ states } ) ′ , where Q i , C i , P i are i ’s QOB, YOB and POB respectively. Note, that Z i are not groupinstruments in the strict sense as they are not mutually exclusive. We exclude instrumentswith P Ni =1 Z ij < to satisfy the balanced instruments assumption (Assumption 1).To increase the amount of omitted variable bias, we follow Angrist and Frandsen(2019) by taking the LIML model as the ground truth, where the outcome variable is Y i (income), the endogenous variable X i (highest grade completed) is instrumented by Z i and the control variables are a full set of POB-by-YOB interactions. Specifically, starting29ith the full 1980 census sample, we compute the average X i in each QOB-YOB-POB cell ¯ s ( q, c, p ) . We then estimate LIML and retain ˆ y ( c, p ) , the second-stage fitted value aftersubtracting ˆ β LIML X i where ˆ β LIML is the LIML estimate of the returns to schooling. Wealso retain the variance of LIML residuals ω ( Q i , C i , P i ) to mimic the heteroskedasticity.The simulation model we consider is then ˜ y i = ¯ y + 0 . s i + ω ( Q i , C i , P i )( ν i + κ ǫ i )˜ s i ∼ P oisson ( µ i ) , for independent standard normal ν i and ǫ i . Here ¯ y = N P i ˆ y ( C i , P i ) and µ i = max { , γ + γ ′ Z Z i + κ ν i } where γ + γ ′ Z Z i is the projection of ¯ s ( Q i , C i , P i ) onto a constant and Z i . Weset κ = 1 . and κ = 0 .1