Efficiency Loss of Asymptotically Efficient Tests in an Instrumental Variables Regression
EEfficiency Loss of Asymptotically Efficient Tests in anInstrumental Variables Regression Geert Ridder September 1, 2020 Preliminary results were presented at seminars organized by BU, Brown, Caltech, Harvard-MIT,PUC-Rio, University of California (Berkeley, Davis, Irvine, Los Angeles, Santa Barbara, and SantaCruz campuses), UCL, USC, and Yale, at the FGV Data Science workshop, and at conferences orga-nized by CIREq (in honor of Jean-Marie Dufour), Harvard University (in honor of Gary Chamberlain),Oxford University (New Approaches to the Identification of Macroeconomic Models ), and the Tinber-gen Institute (Inference Issues in Econometrics). We thank Marinho Bertanha, Michael Jansson, andPierre Perron for helpful comments; Jack Porter for valuable insights; Leandro Gorno for discussionson impossibility designs and matrix decompositions; and Mahrad Sharifvaghefi for several suggestionsand for excellent research assistance together with Pedro Melgar´e. This study was financed in part bythe Coordena¸c˜ao de Aperfei¸coamento de Pessoal de N´ıvel Superior - Brasil (CAPES) - Finance Code001. FGV EPGE, Praia de Botafogo, 190, 11th floor, Rio de Janeiro, RJ 22250-040, Brasil. Electroniccorrespondence: [email protected] Department of Economics and USC Dornsife INET, University of Southern California, KaprielianHall, Los Angeles, USA, CA 90089. Electronic correspondence: [email protected] a r X i v : . [ m a t h . S T ] A ug bstract In an instrumental variable model, the score statistic can be stochastically bounded for anyalternative in parts of the parameter space. These regions involve a constraint on the first-stage regression coefficients and the reduced-form covariance matrix. As a consequence, theLagrange Multiplier (LM) test can have power close to size, despite being efficient under stan-dard asymptotics. This loss of information limits the power of conditional tests which useonly the Anderson-Rubin (AR) and the score statistic. In particular, the conditional quasi-likelihood ratio (CQLR) test also suffers severe losses because its power can be bounded forany alternative.A necessary condition for drastic power loss to occur is that the Hermitian of the reduced-form covariance matrix has eigenvalues of opposite signs. These cases are denoted impossibilitydesigns or impossibility DGPs (ID). This restriction cannot be satisfied with homoskedasticerrors, but it can happen with heteroskedastic, autocorrelated, and/or clustered (HAC) errors.We show these situations can happen in practice, by applying our theory to the problem ofinference on the intertemporal elasticity of substitution (IES) with weak instruments. Out ofeleven countries studied by Yogo (2004) and Andrews (2016), the data in nine of them areconsistent with impossibility designs at the 95% confidence level. For these countries, the non-centrality parameter of the score statistic can be very close to zero. Therefore, the power lossis sufficiently extensive to dissuade practitioners from blindly using LM-based tests with HACerrors. Keywords:
Endogenous regressor, Instrumental variable, Score test, HAC errors
JEL classification:
C14, C36
Introduction
In an instrumental variable (IV) regression with potentially weak instruments, the practitionercurrently has a choice of test statistics when equation errors are homoskedastic and uncorre-lated. In the just-identified case, the Anderson-Rubin (AR) statistic (Anderson and Rubin,1949) is unbiased, and has at least as much power as any other unbiased test; see Moreira(2001, 2009). In the over-identified case, Andrews, Stock, and Sun (2019) suggest using theconditional likelihood ratio (CLR) test of Moreira (2003). Among the many researchers whohave contributed to inference on the structural parameters that is robust to weak instrumentsare Staiger and Stock (1997), Kleibergen (2002), Moreira (2002), Andrews, Moreira, and Stock(2006), and Mikusheva (2010). Stock, Wright, and Yogo (2002), Dufour (2003), and Andrewsand Stock (2007) review weak-instrument robust inference.Allowing for heteroskedastic, autocorrelated, and/or clustered (HAC) errors in IV regressionis important, because ignoring them results in substantial biases in the inference. That omittedvariables are potentially serially correlated in time-series data is obvious. Newey and West(1987) and Andrews (1991) propose non-parametric estimators of the variance matrix of theequation error. With cross-sectional and panel data, HAC errors are also common. With paneldata, the usual assumption is that equation errors are correlated in the time-series, but not in thecross-sectional dimension. Under the random-effects assumption, the model can be transformedinto one with homoskedastic and serially uncorrelated errors. If the random-effects assumptionis too restrictive, then the robust variance matrix of the regression coefficients can be estimated,as Liang and Zeger (1980) and White (1980) do. This estimator makes assumptions neither onthe error correlation within units nor on the conditional variance of the equation errors. Robuststandard errors are routinely used in empirical research; see Wooldridge (2001), Angrist andPischke (2009), and Cameron, Gelbach, and Miller (2011).Which test is best for IV regression with HAC errors is less obvious. In the just-identifiedcase, the Anderson-Rubin (AR) test is still the best choice. For the over-identified case, anumber of tests have been proposed by Stock and Wright (2000), Kleibergen (2005), Andrews(2016), Andrews and Mikusheva (2016), Moreira and Ridder (2017), and Moreira and Moreira(2019). Here we show that not all tests suggested for the HAC case are created equal. Apreference for tests that depend on the data just through the AR, score, and rank statistics isinformed by the observation that these statistics are equivalent to the maximal invariant in thehomoskedastic and uncorrelated case. However, such tests do not use all relevant informationin the data, and therefore can have low power if the errors are HAC.Surprisingly, the score statistic can be stochastically bounded for any alternative in partsof the parameter space. These regions involve a restriction on the first-stage regression coeffi-cients and the reduced-form covariance matrix. If the Hermitian of the reduced-form covariancematrix has eigenvalues of opposite signs, then there are regression coefficients which satisfy therestriction. This restriction cannot be satisfied with homoskedastic errors, but it can happenwith HAC errors. We call such cases the impossibility designs or impossibility DGPs (ID). Theone-sided score statistic has a mixed-normal distribution either in finite samples with normalerrors and known variance or under weak-instrument (WIV) asymptotics. However, we showa simpler normal distribution can approximate the score statistic in a variety of asymptoticsequences. These sequences include cases in which the variance matrix is degenerate or the1arameter of interest is far from the null. These examples are important to demonstrate thatthe score statistic can discard information unnecessarily. The low power of the score/Lagrangemultiplier (LM) test is not due to difficulties in distinguishing between the null and the alter-native. The test of the structural parameter is not an impossible testing problem, as studiedby Bertanha and Moreira (2020). Instead, the power loss of the LM test is so striking that theAR test can have size close to zero while its power is close to one (i.e., probability of makingtype I and type II errors close to zero). See Kraft (1955) on the connection between tests andthe total variation (TV) distance between (convex hulls of) the null and alternatives.The conditional quasi-likelihood ratio (CQLR) test and conditional linear combination(CLC) tests depend on the score statistic directly. As a result, their power can be constrainedin the impossibility designs as well. The CQLR test also suffers severe losses because its powercan be bounded for any alternative, and the test can even behave asymptotically like the LMtest. The power of CLC tests can be bounded by that of a J-test. The associated J-statistichas a chi-square distribution with one degree of freedom less than AR, but with the same non-centrality parameter. This does not mean that we recommend the J-test or AR test in theover-identified model. The J-test has no power under strong instrument (SIV) asymptotics,and other tests can perform better than AR in the over-identified case.In a simulation study, we compare the power of several tests for some impossibility designs.Numerical simulations support the theory by showing that the LM and CQLR test can haveno power while the AR test can distinguish the null from the alternative. We also show thatthe power of the PI-CLC test of Andrews (2016) is bounded by the J-test for parts of theparameter space. This numerical evidence confirms that CLC tests cannot behave much betterthan AR in those cases. The LM, CQLR, and PI-CLC tests are all asymptotically efficientunder strong instruments with local alternatives (SIV-LA). Other asymptotically efficient testsare the CIL test (Moreira and Ridder (2017)), the CLR test (Moreira and Moreira (2019) and,more generally, Andrews and Mikusheva (2016)), and SU tests (Moreira and Moreira (2019)).We include the CIL test in the simulations, which provides numerical support that using furtherdata information can help tests outperform the AR test. Furthermore, the power function isa very smooth function of the parameters of the HAC-IV model. Hence, the aforementionedpower problems extend to neighborhoods of the impossibility designs with power remaining nearsize, while it is still trivial to make inference on the structural parameter. These problematicparameter regions are large in a topological sense, having nonzero Lebesgue measure. Thereforethe power loss is sufficiently extensive to dissuade practitioners from blindly using the LM andCQLR tests, when errors are HAC.We follow Andrews (2016) and consider the problem of inference on the intertemporalelasticity of substitution (IES) with weak instruments. Using a standard Newey-West estimatorfor the data of Yogo (2004), we find that nine out of the eleven countries have eigenvalues ofopposite signs. Hence, there are nontrivial combinations of instruments’ coefficients whichsatisfy the impossibility designs. Although it is not possible to verify if the ID restrictionholds with weak instruments, we can provide confidence sets for those coefficients. We obtaina confidence bound, which is the smallest coverage probability such that the intersection of thecorresponding confidence region and the impossibility design restriction is not empty. If thatconfidence bound is less than 95% then the intersection of the conventional 95% confidenceset and the set that satisfies ID is not empty. The proximity of first-stage parameters to zero2llows a lower LM bound. For all nine countries whose Hermitian matrices have eigenvalues ofopposite signs, the LM bound can be very close to zero.Finally, the confidence bound indicates the size of the intersection of the confidence set andID. Because the intersection is defined by polynomial equations and inequalities, the intersectionis a semi-algebraic set. By the Tarski-Seidenberg theorem, a projection in a lower-dimensionalspace is also semi-algebraic. There exist efficient algorithms to compute these projections,allowing us to visualize the ID sets of first-stage parameters which are inside the confidenceregion.The paper is organized as follows. Section 2 introduces the model and the test statistics.Section 3 studies the limiting distribution of the LM statistic under a number of asymptoticsequences and introduces the impossibility design. Section 4 considers the CLC family of testsand, in particular, the CQLR test. Section 5 reports the result of a simulation study that showsthe power loss of the LM statistic with ID. Section 6 considers the empirical prevalence of ID.Section 7 concludes. The appendix contains all proofs, and the supplement extensively reportspower comparisons and confidence sets for the ID restriction with the IES data. We consider the instrumental variable (IV) regression model y = y β + u (2.1) y = Zπ + v with y , y n × Z an n × k matrix ofnon-random instrumental variables. The n × u, v are the zero mean structural equationand first-stage errors. The variance and covariance matrices of these errors are unrestricted,i.e., we allow for HAC errors. The errors have a normal distribution. Our objective is to testthe null hypothesis H : β = β against the alternative H : β (cid:54) = β with π a k × Y = [ y y ] is Y = Zπa (cid:48) + V (2.2)with a = ( β (cid:48) and V = [ v v ], v = u + v β .Let P = [ P : P ] be an n × n orthogonal matrix with P = Z ( Z (cid:48) Z ) − / . By orthogonalityof P , we have P (cid:48) Z = 0. We pre-multiply (2.2) by P (cid:48) and define R = P (cid:48) Y . The distribution of P (cid:48) Y does not depend on β or π . The induced model for R is given by R = µa (cid:48) + (cid:101) V (2.3)with µ = ( Z (cid:48) Z ) / π and (cid:101) V = ( Z (cid:48) Z ) − / Z (cid:48) V . Commonly-used estimators and test statisticsdepend on R = [ R : R ] and variance estimators of (cid:101) V = ( Z (cid:48) Z ) − / Z (cid:48) V . For example, the2SLS estimator is given by (cid:98) β = R (cid:48) R R (cid:48) R . (2.4)3esearchers often rely on consistency and normality of (cid:98) β to make inference on β . Both asymp-totic properties are based on two sets of assumptions. The first assumption uses the followingproperties of normalized averages of the instruments and reduced-form errors. Assumption NA. (a) n − Z (cid:48) Z p → D with D a positive definite k × k matrix.(b) n − V (cid:48) V p → Ω for some positive definite 2 × Z (cid:48) Z ) − / Z (cid:48) V d → N (0 , Σ) for some positive definite 2 k × k matrix Σ.Andrews, Moreira, and Stock (2004) note that Assumption NA holds under very generalconditions for the DGP. The second assumption requires that the IVs’ coefficients π are boundedaway from zero. This assumption is not innocuous. Nelson and Startz (1990) show this asymp-totic framework does not provide a good approximation to the finite sample distribution whenthe instruments Z are weakly correlated with the explanatory variable y (i.e., π close to zero).To remedy this problem, Staiger and Stock (1997) propose an alternative asymptotic theoryin which the IVs are weakly correlated with y . It turns out that this weak-instrument (WIV)asymptotic theory resembles the finite-sample distribution of (cid:98) β with normal errors and knownvariance. For this reason, we assume that the vector vec ( (cid:101) V ) has a normal distribution withknown variance matrix Σ.We transform R into the pair of k × S being pivotal and independent of T : S =[( b (cid:48) ⊗ I k )Σ( b ⊗ I k )] − / ( b (cid:48) ⊗ I k ) vec ( R ) and (2.5) T =[( a (cid:48) ⊗ I k )Σ − ( a ⊗ I k )] − / ( a (cid:48) ⊗ I k )Σ − vec ( R ),with a = ( β , (cid:48) , b = (1 , − β ) (cid:48) . To represent their finite-sample distribution, it is convenientto define R = RB , where B is the non-singular 2 × B = (cid:18) − β (cid:19) . (2.6)The induced model for R is vec ( R ) ∼ N ( vec ( µa (cid:48) ∆ ) , Σ ) , (2.7)where a ∆ = (∆ , (cid:48) , ∆ = β − β andΣ = ( B (cid:48) ⊗ I k )Σ( B ⊗ I k ) = (cid:18) Σ Σ Σ Σ (cid:19) . (2.8)The statistics S and T have distribution S ∼ N (∆Σ − / µ, I k ) and T ∼ N (cid:0) (Σ ) / ( I k − ∆Σ Σ − ) µ, I k (cid:1) , (2.9)where Σ = (Σ − Σ Σ − Σ ) − .The one-sided and two-sided score statistics of H : β = β are, respectively, given by LM = S (cid:48) Σ − / (Σ ) − / T (cid:13)(cid:13)(cid:13) Σ − / (Σ ) − / T (cid:13)(cid:13)(cid:13) and LM = (cid:16) S (cid:48) Σ − / (Σ ) − / T (cid:17) T (cid:48) (Σ ) − / Σ − (Σ ) − / T . (2.10)4ther readily available statistics are the Anderson-Rubin (AR), quasi-likelihood ratio (QLR),and the class of linear combination (LC) statistics: AR = S (cid:48) S (2.11) QLR = AR − r ( T ) + (cid:112) ( AR − r ( T )) + 4 LM · r ( T )2 (2.12) LC = w ( T ) .AR + (1 − w ( T )) .LM, (2.13)where r ( T ) = T (cid:48) T or T (cid:48) (Σ ) − / Σ − (Σ ) − / T , among other choices, and 0 ≤ w ( T ) ≤ T . In particular, we consider theconditional QLR and LC tests, hereinafter denoted CQLR and CLC tests. Andrews (2016)shows that AR and CQLR are special cases of CLC tests and proposes a new test, the plug-inconditional linear combination test (PI-CLC).For the just-identified case, the Anderson-Rubin test is uniformly most powerful, amongtests that are either unbiased (Moreira (2001, 2009) for homoskedastic errors, and Moreiraand Moreira (2019) for HAC errors) or invariant (Andrews, Moreira, and Stock (2006) forhomoskedastic errors, and Moreira and Ridder (2017) for HAC errors). For the over-identifiedcase, the AR test has good power, but it can be outperformed by other tests with strongidentification. In Section 3, however, we show there are parameter values for which the powerof the LM test is arbitrarily close to its size. In Section 4, we see that this limitation impairsthe performance of the CQLR tests and limits the power of CLC tests as well. Problems with the LM test in the homoskedastic and uncorrelated case become even moresevere with HAC errors. Although the LM test is efficient under standard asymptotics, its lowpower shows that the LM test ignores important information with weak instruments. As we willsee, commonly-used asymptotic approximations are not always a reliable guide to finite-samplebehavior.Other authors have pointed out problems and solutions for the LM test under weak-IVassumptions. Moreira (2001) shows that the noncentrality parameter for LM can be zero fora particular alternative if errors are homoskedastic and uncorrelated. He proposes a switchingtest based on the AR and LM tests. Andrews (2016) also notes issues with the LM test in theGMM context with HAC errors, and recommends the use of conditional tests based on linearcombinations between the AR and LM statistics. The theory derived below leads to moredefinitive conclusions regarding the low power of the LM statistic, with implications for othertests, including the CQLR test.We analyze the properties of the one-sided score statistic (2.10) under a number of asymp-totic approximations. The normality of S and T implies that S = ∆Σ − / µ + U S and T = (Σ ) / ( I k − ∆Σ Σ − ) µ + U T , (3.14)5ith U S and U T being independent random vectors with distribution N (0 , I k ). The LM statis-tic is given by LM = (∆Σ − / µ + U S ) (cid:48) ((Σ − / ( I k − ∆Σ Σ − ) µ + Σ − / (Σ ) − / U T ) (cid:13)(cid:13)(cid:13) Σ − / ( I k − ∆Σ Σ − ) µ + Σ − / (Σ ) − / U T (cid:13)(cid:13)(cid:13) . (3.15)We first consider the standard strong instrument (SIV) asymptotic theory in which π is afixed non-zero vector with local alternatives (LA). Assumption SIV-LA. (a) ∆ n = h ∆ /n / for some constant h ∆ .(b) π is a fixed non-zero k -vector for all n ≥ Proposition 1
Under Assumptions SIV-LA and NA, LM → d N (cid:16) h ∆ . (cid:0) π (cid:48) D / Σ − D / π (cid:1) / , (cid:17) . Like the Wald and likelihood ratio tests, the LM test is asymptotically efficient. The LMstatistic has a noncentrality parameter depending both on the distance of the alternative fromthe null and on the instruments’ strength.Consider instead that the difference between the alternative and the null is fixed at ∆.
Assumption SIV-FA. (a) ∆ is fixed.(b) π is a fixed non-zero k -vector for all n ≥ h ∆ = n / ∆, we would have concludedthat the LM test is consistent under fixed alternatives. However, it turns out the LM testmay not be consistent at all. The following proposition finds the asymptotic behavior of LM depends on the parameter ζ = ∆ π (cid:48) D / Σ − D / π − ∆ π (cid:48) D / Σ − Σ Σ − D / π. (3.16) Proposition 2
Under Assumptions SIV-FA and NA, LM → d N (cid:18) , π (cid:48) D / Σ − (Σ ) − Σ − D / ππ (cid:48) D / ( I k − ∆Σ − Σ )Σ − ( I k − ∆Σ Σ − ) D / π (cid:19) if ζ equals zero. If ζ is strictly positive (or negative), then LM diverges to + ∞ (or −∞ ). LM statistic converges to a normal distribution with zero mean when ∆ = 0 or∆ = π (cid:48) D / Σ − D / ππ (cid:48) D / Σ − Σ Σ − D / π . (3.17)Hence, a one-sided LM test can have no power in one specific alternative. For other values of∆, the LM diverges. This happens because µ grows at rate n / and it appears with differentexponents in the numerator and denominator of the noncentrality parameter of LM :∆ µ (cid:48) Σ − µ − ∆ µ (cid:48) Σ − Σ Σ − µ (cid:0) µ (cid:48) ( I k − ∆Σ − Σ )Σ − ( I k − ∆Σ Σ − ) µ (cid:1) / . (3.18)The statistic drifts to + ∞ and to −∞ depending on whether ζ is positive or negative. Becausethe sign of ζ does not necessarily coincide with the sign of ∆, a one-sided LM test could beinconsistent. This issue happens even in the homoskedastic case, as observed by Andrews,Moreira, and Stock (2006). Even so, the power would go to one for the two-sided LM test.These findings are standard. However, all problems which arise because SIV-LA is not richenough to embed the usual asymptotics for fixed alternatives (SIV-FA) are not fully appreci-ated. These preliminary results highlight the caveat that we may not always be able to embedone asymptotic framework into another one. In particular, the SIV asymptotics obscure thepotentially bad power of LM in other settings.The asymptotic theory of Staiger and Stock (1997) resembles the finite-sample distributionof LM with normal errors. This approximation holds formally under Assumptions NA, definedearlier, and WIV-FA: Assumption WIV-FA . (a) ∆ is fixed.(b) π = h π /n / for some non-stochastic k -vector h π .Assumption WIV-FA establishes that IVs are weak in the sense that π is in a neighborhoodof zero with fixed alternatives. Under WIV-FA, µ converges to µ ∞ = D / h π . (3.19)Hereinafter, we omit the subscript from µ ∞ for simplicity.The WIV-FA asymptotics provide a comprehensive description for the finite-sample behaviorof the LM statistic. The difficulty is that LM is a mixed normal random variable, with weightsdepending on ∆ itself and µ . This convolution may obscure potential problems with the LMtest. We investigate cases in which the LM statistic is asymptotically normal and easy toanalyze. This simplification shows LM may suffer severely bad power properties in situationswhere it is trivial to distinguish the null from the alternative (TV distance near one).The first case is when the variance matrix is nearly singular. Andrews (1987) studies Waldtests with a singular variance matrix, and Andrews and Guggenberger (2019) propose GMM For the homoskedastic case, our condition ζ = 0 implies ∆ .e (cid:48) Ω b = b (cid:48) Ω b . This yields the alternative β AR for which T has zero mean. Andrews, Cheng, and Guggenberger (2020) use sub-sequences to establish uniform size properties of tests.For further details, see Section 2.6 of Lehmann (1999).
Assumption NS. (a) Σ − / (Σ ) − / → − / Σ Σ − (Σ ) − / → U T are asymptotically negligible.This assumption is general enough that we do not even need to require convergence in distribu-tion. We just have to show that the difference between the distribution of LM and a normalapproximation —which may, or may not, converge— is asymptotically negligible. Theorem 1
Under Assumptions WIV-FA, NA, and NS, LM d = N (cid:32) ∆ µ (cid:48) Σ − (cid:0) I k − ∆Σ Σ − (cid:1) µ (cid:0) µ (cid:48) (cid:0) I k − ∆Σ − Σ (cid:1) Σ − (cid:0) I k − ∆Σ Σ − (cid:1) µ (cid:1) / , (cid:33) + o p (1) for all µ . Furthermore, this approximation is uniform for all µ such that (cid:13)(cid:13)(cid:13) Σ − / µ (cid:13)(cid:13)(cid:13) ≥ γ forany arbitrary γ > . Unlike before, we do not consider WIV and SIV asymptotics separately. If we were to followthis route, we would have to state rates, depending on the sample size, with which the variancewould approach the near-singular case. The approximation above is uniform for IVs which arenot arbitrarily weak, or pointwise regardless of the IVs’ strength.In the special case µ = 0, the mean of the approximation to LM is zero. To limit thepower of the LM test for µ (cid:54) = 0, we consider the following reasoning. Without restrictions,the numerator of the LM statistic in (3.15) is quadratic in ∆ and the denominator is linearin ∆. The quadratic term in the numerator has coefficient µ (cid:48) Σ − Σ Σ − µ , and appears inthe approximation given by Theorem 1. Interestingly, the quadratic term is irrelevant underSIV-LA but important in many other asymptotic settings. The reason for this discrepancy isthat SIV-LA provides a poor approximation for LM for large values of h ∆ (as explained aboveusing SIV-FA) or small values of µ (WIV-FA asymptotics).The numerator and denominator of LM are of the same order as ∆ if Assumption ID holds: Assumption ID. µ (cid:48) Σ − Σ Σ − µ = 0.If a DGP satisfies Assumption ID, we call it the impossibility design or impossibility DGP (ID). In particular, there is a bound on power for the LM test under regularity conditions. Let q α ( l ) be the 1 − α quantile of a chi-square distribution with l degrees of freedom. Corollary 1
Under Assumptions WIV-FA, NA, NS, and ID, LM d = N (cid:32) ∆ µ (cid:48) Σ − µ (cid:0) µ (cid:48) Σ − µ + ∆ µ (cid:48) Σ − Σ Σ − Σ Σ − µ (cid:1) / , (cid:33) + o p (1) . his approximation is uniform if (cid:13)(cid:13)(cid:13) Σ − / µ (cid:13)(cid:13)(cid:13) ≥ γ for any arbitrary γ > . Furthermore, thefollowing hold if µ (cid:48) Σ − Σ Σ − Σ Σ − µ > :(i) the absolute value of the mean m of the normal approximation is bounded by m = µ (cid:48) Σ − µ (cid:0) µ (cid:48) Σ − Σ Σ − Σ Σ − µ (cid:1) / , and (3.20) (ii) the power of the two-sided LM test is no larger than − G ( q α (1) ; m ) , where G ( q ; m ) isthe noncentral chi-square-one distribution function with noncentrality parameter m . The mean of the LM bound in (3.20) can be written as (cid:13)(cid:13)(cid:13) Σ − / µ (cid:13)(cid:13)(cid:13)(cid:16) ω (cid:48) Σ − / Σ Σ − Σ Σ − / ω (cid:17) / , (3.21)for the norm-one vector, ω = Σ − / µ/ (cid:13)(cid:13)(cid:13) Σ − / µ (cid:13)(cid:13)(cid:13) . We can then choose γ small enough to makethis bound arbitrarily small. In practice, this can be done by replacing µ by η.µ in the boundabove with η small. On the other hand, other readily available tests’ power can be arbitrarilylarge in this design. Take, for example, the Anderson-Rubin test. Its power depends on thenoncentrality parameter ∆ µ (cid:48) Σ − µ = ∆ (cid:13)(cid:13)(cid:13) Σ − / µ (cid:13)(cid:13)(cid:13) ≥ ∆ .γ. (3.22)Its power goes to one for arbitrarily large ∆ if (cid:13)(cid:13)(cid:13) Σ − / µ (cid:13)(cid:13)(cid:13) is bounded away from zero. Thisobservation will be important for the application on intertemporal elasticity of substitution(IES), later in the paper.We can look at the LM statistic without using small-sigma asymptotics, but still derivingan upper bound on the mean of the distribution of LM. Here, we propose a large | ∆ | bound.We consider the following assumption for growing alternatives. Assumption GA . | ∆ | → + ∞ .Andrews, Marmer, and Yu (2019) consider the behavior of IV tests for a fixed DGP and nullhypotheses which are far away from the alternative. They show that the CLR test can havepower lower than the two-sided power envelope when the alternative is fixed and the null driftsto −∞ or + ∞ . Here, we show that LM can suffer severe bad power properties in situationsin which it is trivial to distinguish the null from the alternative (TV distance near one), whenthe null is fixed and the alternative drifts to −∞ or + ∞ . Theorem 2
Under Assumptions WIV-FA, NA, GA, and ID, LM → d N (cid:32) sgn (∆) .µ (cid:48) Σ − µ (cid:0) µ (cid:48) Σ − Σ Σ − Σ Σ − µ (cid:1) / , µ (cid:48) Σ − (Σ ) − Σ − µµ (cid:48) Σ − Σ Σ − Σ Σ − µ (cid:33) or all µ (cid:48) Σ − Σ Σ − Σ Σ − µ > . Furthermore, this approximation is uniform for all µ suchthat (cid:13)(cid:13)(cid:13) Σ − / µ (cid:13)(cid:13)(cid:13) ≥ γ for any arbitrary γ > . The variance in the asymptotic distribution is larger than one, which decreases the powereven further compared to the normal approximation for the near-singular case. Hence, thebound (3.20) still holds.Finally, Assumptions NS and GA do not embed each other. Close inspection of the proofsof Theorems 1 and 2 shows that each asymptotic theory uses a term that is not used in theother asymptotic framework. In the appendix, we consider a pointwise approximation thatembeds either assumption. Here, instead, we consider a result related to SIV-FA in which webias-correct LM before finding its asymptotic distribution. Theorem 3
Under Assumptions SIV-FA and NA, LM − n / ∆ π (cid:48) D / Σ − D / π − ∆ π (cid:48) D / Σ − Σ Σ − D / π (cid:0) π (cid:48) D / ( I k − ∆Σ − Σ )Σ − ( I k − ∆Σ Σ − ) D / π (cid:1) / + o p (cid:0) n / (cid:1) → d N (cid:18) , π (cid:48) D / Σ − (Σ ) − Σ − D / ππ (cid:48) D / ( I k − ∆Σ − Σ )Σ − ( I k − ∆Σ Σ − ) D / π (cid:19) . Furthermore, if a DGP satisfies ID, the bias correction term as well as the asymptotic approx-imation for LM simplify: LM − n / ∆ π (cid:48) D / Σ − D / π (cid:0) π (cid:48) D / Σ − D / π + ∆ π (cid:48) D / Σ − Σ Σ − Σ Σ − D / π (cid:1) / + o p (cid:0) n / (cid:1) → d N (cid:18) , π (cid:48) D / Σ − (Σ ) − Σ − D / ππ (cid:48) D / Σ − D / π + ∆ π (cid:48) D / Σ − Σ Σ − Σ Σ − D / π (cid:19) . We can now obtain SIV-LA from SIV-FA under sequential asymptotics. Indeed, the limitingdistribution of LM given by SIV-LA coincides with the approximation given by Theorem 3and then taking ∆ n = h ∆ /n / sequentially (interestingly, whether ID holds or not has noimportance for the asymptotic behavior under SIV-LA). This embedment happens becauseTheorem 3 keeps all asymptotically relevant terms (in the numerator and denominator) of LM which are used in the SIV-LA asymptotics. This remark can be applied to all asymptoticframeworks. Each setting (WIV-FA, SIV-LA, SIV-FA, IV-NS, and IV-GA) keeps some termsand discards others, which are asymptotically negligible. If one asymptotic framework usesmore terms than another, we naturally have embedment. To make this clear, it is enough tofocus only on the terms present in the numerator of LM : (cid:122) (cid:125)(cid:124) (cid:123) ∆ µ (cid:48) Σ − µ − (cid:122) (cid:125)(cid:124) (cid:123) ∆ µ (cid:48) Σ − Σ Σ − µ + (cid:122) (cid:125)(cid:124) (cid:123) ∆ U (cid:48) T (Σ ) − / Σ − µ (3.23)+ U (cid:48) S Σ − / µ (cid:124) (cid:123)(cid:122) (cid:125) − ∆ U (cid:48) S Σ − / Σ Σ − µ (cid:124) (cid:123)(cid:122) (cid:125) + U (cid:48) S Σ − / (Σ ) − / U T (cid:124) (cid:123)(cid:122) (cid:125) . Theorem 1 for IV-NS uses the first, second, fourth, and fifthterms (for ID, the second term is zero) if µ (cid:54) = 0 and only the sixth term if µ = 0. Finally,Theorem 2 for IV-GA uses the first, third, and fifth terms. The following diagram displayseach setting together with asymptotically relevant terms from LM . Each arrow represents oneasymptotic approximation embedding another asymptotic setting.Figure 1: Asymptotic SettingsNaturally, WIV-FA embeds the other settings as it uses all terms in the numerator anddenominator of LM . The second more general theory involves the bias-corrected asymptoticsbased on SIV-FA. This framework only discards one term in the numerator and three terms inthe denominator from LM . Yet, it is relevant in practice and simple enough that it yields anormal approximation to LM instead of the more complex mixed normal distribution. The CLC test can be expressed as a convolution of the AR and LM tests. Under ID, the LMtest can be approximately ancillary. Hence, we expect the power loss that the LM test suffers tohave a negative impact on the power of CLC tests. To set the stage, consider a simple example.Let W ∼ N ( θ,
1) be a test statistic of H : θ = 0 against the alternative θ >
0. The test rejectsif W > c with critical value c = Φ − (1 − α ) and α the size of the test. Let W = W + W In the appendix, we present a proposition allowing IV-NS and/or IV-GA. This is related to SIV-FA, as theyboth use exactly the first five terms in the numerator and first three terms in the denominator.
11e another test statistic for H with W ∼ N (0 , τ ) independent of W . Note that W is theconvolution of W and an ancillary statistic. The critical value of this test is c ∗ = Φ − (1 − α ) √ τ = c √ τ . (4.24)The power function of the W test is 1 − Φ( c − θ ), and that of the W test is1 − Φ (cid:18) c ∗ − θ √ τ (cid:19) = 1 − Φ (cid:18) c − θ √ τ (cid:19) (4.25)so that for θ > W test is strictly smaller than the power of the W test.The test W > c is the Neyman-Pearson test of H : θ = 0 and W = W + W is a teststatistic that is the convolution with an ancillary statistic W ∼ G nondegenerate. The powerof the W test is lower for general convolutions with an ancillary statistic. By the optimality ofthe Neyman-Pearson test, for all θ > − (cid:90) Φ( c ∗ − θ − w ) dG ( w ) < − Φ( c − θ ) , (4.26)where the critical value is (cid:90) Φ( c ∗ − w ) dG ( w ) = 1 − α. (4.27)Now, consider a further extension that is closer to our application. Lemma 1
Let X ∼ N ( θ, I k ) and W ∼ G be independent of X with a distribution that doesnot depend on θ . The null is H : θ = 0 and the alternative H : θ (cid:54) = 0 . For W = X (cid:48) X ,the test rejects H if W > c , where c is the − α quantile of W when θ = 0 . The W testis UMPI; that is, the uniformly most powerful among the class of tests invariant to orthogonaltransformations of X . In particular, take the statistic W = g ( W , W ) . The W test has powerno smaller than the W test which rejects H if W > c ∗ , where c ∗ is the − α quantile of W when θ = 0 . We apply this result to the class of CLC tests. The weights for AR and LM are allowed tobe negative, so we consider the generalization LC = w AR ( T ) .AR + w LM ( T ) .LM (4.28)for arbitrary weights w AR ( T ) and w LM ( T ). A member of this larger class is the J-test, whichrejects the null when J = S (cid:48) S − LM > q α ( k − . (4.29)This test has no power under SIV-LA, but it has better power than AR in the impossibilitydesigns with a small LM bound under Assumption NS. The J statistic for homoskedastic errors is denoted by Q k − in the appendix of Moreira (2003). heorem 4 Suppose Assumptions WIV-FA, NA, NS, and ID hold. If the LM bound given in(3.20) goes to zero, the power of all CLC tests is smaller than the power of the J-test at thesame significance level. This more powerful test rejects the null for large values of a chi-square k − distribution with noncentrality parameter lim j →∞ ∆ j µ (cid:48) j Σ − ,j µ j = ξ . There are several situations in which the LM bound goes to zero. For example, Σ and µ are fixed and all singular values of Σ go to infinity (in absolute value). In the next section, wewill consider this scenario to compare power for different tests. Another possibility is to allowa sequence of coefficients µ j to shrink, while keeping the variance matrix fixed. If µ satisfiesID, then so does the coefficient η.µ for any value of η . As long as there exists a coefficient µ that satisfies ID, we can find a sequence µ j → η shrinks.Given that the LM bound shrinks, the LM statistic is asymptotically ancillary. Given thatthe score has asymptotically no information, it is natural to expect that all CLC tests havepower no larger than that of the Anderson-Rubin test. This conjecture is incorrect, as it ignoresthe dependence between the AR and LM statistics. Although CLC tests can perform betterthan the Anderson-Rubin tests in these scenarios, they have power no larger than that of theJ-test described above. Furthermore, the weight w ( T ) for CLC tests must be allowed to benegative so that a CLC test can behave as well as the J-test. As a consequence, it is possibleto outperform the CLC tests when either k > w ( T ) or k = 2 if 0 ≤ w ( T ) ≤ | ∆ | → ∞ is less interesting. After all, theAR test is a special case of CLC tests and has power going to one with alternatives arbitrarilydistant from the null. Therefore, we do not study CLC tests under Assumption GA here.Instead, we study their power under the richer asymptotic approximation given by Theorem 3. Theorem 5
Consider the asymptotic approximation given by Theorem 3 under ID. If the LMbound given in (3.20) goes to zero, then the bias-correction term of LM goes to zero. Fur-thermore, the power of all CLC tests is smaller than the power of a chi-square test at the samesignificance level. This more powerful test rejects the null for large values of a chi-square k − distribution with noncentrality parameter lim j →∞ ∆ j µ (cid:48) j Σ − ,j µ j = ξ . In Theorem 4, all CLC tests are dominated by the the J-test, which is a function of thedata only. The dominating statistic in Theorem 5 uses information on instruments’ coefficients.However, the fact that all CLC statistics cannot outperform a chi-square distribution with k − k = 2.Theorems 4 and 5 also apply to the CQLR test, as Andrews (2016) shows that it is a specialcase of the CLC test. However, we can get a stronger result for CQLR. The next propositioncan determine or bound the behavior of the CQLR test. This finding agrees with the point optimal invariant similar (POIS) tests of Andrews, Moreira, and Stock(2006) being linear combinations of AR and LM whose weights may be negative. roposition 3 For the QLR statistic defined in (2.12):(i) if r ( t ) → ∞ , then QLR → LM ; and(ii) if r ( t ) > AR , then QLR is bounded by
LM.r ( t ) r ( t ) − AR = LM − ARr ( t ) . Part (i) shows that if r ( T ) diverges to infinity, the QLR statistic converges to the LMstatistic. This finding is a simple generalization of Moreira (2003) for the homoskedastic case.However, it will be useful to establish the behavior of the CQLR in some near-singular designs.For large-alternative designs, this approximation may not be very useful. The reason is that itis pointwise in AR and LM , and AR grows with ∆. Instead, part (ii) establishes a bound forCQLR which is useful to determine designs in which the CQLR test behaves similarly to theLM test. The condition r ( T ) > AR is not innocuous for general r ( T ). However, r ( T ) ≥ AR does hold if r ( T ) = T (cid:48) (Σ ) − / Σ − (Σ ) − / T . For r ( T ) = T (cid:48) T , we can find designs in which r ( T ) > AR with probability approaching one with ∆ → + ∞ . This happens because AR/T (cid:48) T converges to the ratio of their noncentrality parameters as ∆ drifts off to infinity. The AR testis consistent for arbitrarily distant alternatives. That LM and CQLR tests are not consistent isan embarrassing feature; see Berger (1951) for an early discussion of uniformly consistent tests. If ID holds, then the probability that the LM test detects large deviations | ∆ | from the nullhypothesis is bounded away from 1. Worse, if µ and Σ are such that the LM bound is close tozero, then the power of the test against large deviations from the null is close to its size. Wegive examples of such designs in this section.Let J k be the k × k matrix with the anti-diagonal elements equal to one and the othercomponents zero. We have J k = I k . The k × k submatrices of Σ areΣ = c I k , Σ = c J k , and Σ = c I k . (5.30)The constants c , c , and c are chosen so that the matrix Σ is positive definite. Eachone of the eigenvalues of Σ , ς = c + c + (cid:113) ( c − c ) + 4 .c ς = c + c − (cid:113) ( c − c ) + 4 .c k . As long as c , c ≥ c .c ≥ c , the matrix Σ is semi-positive definite.Note that J k e = e k , where e = (1 , , ..., (cid:48) and e k = (0 , ..., , (cid:48) . Therefore, if we set µ = λ / e , with λ some positive constant, we find that µ (cid:48) Σ − Σ Σ − µ = λ c c e (cid:48) J k e = λ c c e (cid:48) e k = 0 (5.32)14o that ID holds for this choice of µ and Σ. We also haveΣ = (cid:18) c I k − c c J k J k (cid:19) − = c c c − c I k . (5.33)Hereinafter, we consider the case c = 1.We consider two sets of simulations in which the LM bound (3.20) holds. In this case, thebound is given by µ (cid:48) Σ − µ (cid:0) µ (cid:48) Σ − Σ Σ − Σ Σ − µ (cid:1) / = λ ( λc ) / = λ / | c | . (5.34)If either c is large or λ is small, the mean of LM is close to 0 under the alternative and thetest has power equal to size.Our theory identifies regions of the parameter space where the score test has low power,even power close to size. We confirm this suspicion in simulation experiments when testing H : β = 0. It turns out we can nearly perfectly distinguish the null from alternatives farenough from the null. To show this we choose a very small probability of a type I error, andwe find regions of the alternative where the probability of making a type II error is very smallas well.The normal errors are HAC with the variance matrix (2.8) given by (5.30). The goal inthese simulations is twofold: (1) to provide evidence for severe problems with the LM andCQLR tests; and (2) to show there are power gains from using the information on the statistic S beyond the AR and LM statistics. The AR, LM, CQLR, CLC, and J tests are all conditionaltests which depend only on AR and LM . To show severe power losses of both LM and CQLRtests, it is enough to compare both tests with the AR test. To demonstrate power bounds ofthe PI-CLC test, it is enough to report the J-test and the AR test again. The J-test has nopower under SIV-LA asymptotics. The AR does have power with strong IVs, but it is notefficient. One may wonder if there are available tests which are efficient under SIV-LA and canoutperform the AR in these designs. Available tests with correct size which use further datainformation are the CIL test considered by Moreira and Ridder (2017); the CLR test derivedby Moreira and Moreira (2019) for HAC-IV and by Andrews and Mikusheva (2016) for GMM;and the SU tests proposed by Moreira and Moreira (2019).The first design is related to the variance being near singular. In the supplement, wepresent power comparisons for all combinations among significance level α = 0 .
001 or 0 . k = 2 or 10, and instrument strength λ = µ (cid:48) Σ − µ ranging from 0 . α = 0 . k = 2,and λ = 1000. We select the significance level to be so small because the total variance betweenthe null and some alternatives is close to one. This means there exist tests which can easilyseparate the null from the alternative. However, we will see that the LM and CQLR tests canhardly distinguish the null from the alternatives in some impossibility designs. For CQLR,we consider two “rank” statistics, r ( T ) = T (cid:48) T or T (cid:48) (Σ ) − / Σ − (Σ ) − / T . The associatedtests are denominated CQLR1 and CQLR2. The small value of k equal to two makes it easyto show the discrepancy between the J and PI-CLC tests. We choose λ so large to highlightthat SIV-LA asymptotics may not provide a good approximation in finite samples, even whenthe instruments are considered to be strong. However, the choice of instrument strength hassecondary importance in power comparisons if we adjust the range of alternatives.15e set c = c + c − , so thatΣ − / (cid:0) Σ (cid:1) − / = c − / I k and Σ − / Σ Σ − (Σ ) − / = c − / I k . (5.35)Assumption NS holds if c → ∞ . The mean of T is E ( T ) = (Σ ) / ( I k − ∆Σ Σ − ) µ = λ / c / ( e − ∆ .c e k ) , (5.36)with e k being the k -th unit vector. The statistic T (cid:48) T diverges to infinity because its noncen-trality parameter equals λc + λ ∆ c . This term clearly diverges as c → ∞ . The statistic T (cid:48) (Σ ) − / Σ − (Σ ) − / T also diverges to infinity. As a result, both CQLR1 and CQLR2 be-have asymptotically as the LM test.As the approximations in Theorem 1 are more accurate if c is large, we set c = 100.The HAC variance matrix Σ converges to a singular matrix. Singularity of the variance matriximplies that a linear transformation of the S and T statistics has variance zero. This shouldimprove power beyond conditional tests which depend only on AR and LM .Figure 2 reports power for different values of ∆. The first graph shows that the power ofthe LM test and both versions of CQLR tests is approximately equal to the size of the test.The second graph shows that the performance of the PI-CLC test is between the AR test andthe J-test. This numerical comparison matches our earlier theory, which shows all CLC testsare dominated by the J-test under Assumption NS.The AR test does not have the same lack of power for the designs where the LM test fails.However, the power of the AR test is known to deteriorate when the number of instrumentsincreases. The CIL, CLR, and SU tests are not in the CLC class of tests. They depend on thestatistic S not just through the AR and LM statistics. We report the CIL test only, as theCLR and SU tests can be numerically challenging to implement in some designs. The CIL testperforms better than the LM and CQLR tests, and improves on the AR test —even when thenumber of instruments is as small as two.The flat power curves of the LM and CQLR tests are surprising. These tests are oftenapplied in nonlinear models with heteroskedastic and correlated errors, and their use is socommon that citations number in the thousands. On the other hand, Moreira and Moreira(2019) report power comparisons that show that the CIL and SU tests perform better thanthe LM and CQLR tests. One could perhaps argue that we should not discard these testsyet, because it is possible that there are regions in the parameter space where they couldoutperform other tests. As we argue next, the problems with LM and CQLR tests are muchmore fundamental.Kraft (1955) notes the connection between tests and the total variation (TV) distancebetween (convex hulls of) the null and alternatives. For example, take the problem of testing aBernoulli against a continuous distribution. We contemplate the test which rejects the null if weobserve the values 0 or 1, but fails to reject the null otherwise. This trivial test has size equal to The CLR test requires a grid search to establish initial conditions for the optimization procedures. Thiscreates difficulties to approximate the conditional quantile for some designs, as the model is not correctlyspecified when S ∼ N (0 , I k ) and T has distribution given by (2.9). The SU tests require linear programming tobe implemented. Although there are readily available algorithms, it requires the calculation of a density ratio.If not dealt with properly, the computation can go beyond the numerical accuracy of computer packages. k = 2 and α = . -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.200.10.20.30.40.50.60.70.80.91 = 1000 AR LM CQLR 1 CQLR 2 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.200.10.20.30.40.50.60.70.80.91 = 1000
AR J PI-CLC CIL zero and power equal to one. The existence of a trivial test is due to the total variation metricbeing one. Like testing a Bernoulli against a continuous distribution, our testing problem iseasy and we should have no problem distinguishing the null from subsets of the alternative.Nevertheless, the score and the CQLR tests have power close to size, and cannot separate thenull from these alternatives at all.The second design is related to the LM approximation for large values of ∆. We consider c = 10 and set c = (1 + c ) . We choose this design so that the variance is non-singular(the smallest eigenvalue is 1 /
6) and the CQLR test also performs poorly for large values of ∆(see Proposition 3). We consider α = 0 .
001 or 0 . k = 2 or 10, and λ ranging from 0 . LM statistic. In this section, we discuss the prevalence of the impossibility designs. These depend on thecovariance matrix of the reduced-form and first-stage errors.
Regions of low power are present in many DGP, not only the specific one presented in Sec-tion 5. The impossibility design occurs if the standardized first-stage coefficients µ satisfy µ (cid:48) Σ − Σ Σ − µ = 0. Which properties of the 2 k × k matrix Σ and of µ imply this equalitythat ensures that the noncentrality parameter of the LM statistic is bounded for any value of∆? The next proposition gives necessary and sufficient conditions for this to happen.17igure 3: Power curves with ID and large alternatives( k = 2 and α = . -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 100.10.20.30.40.50.60.70.80.91 = 100 AR LM CQLR 1 CQLR 2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 100.10.20.30.40.50.60.70.80.91 = 100
AR J PI-CLC CIL
Proposition 4
For a k × k matrix A , define the Hermitian part of A by the symmetric matrix H = ( A + A (cid:48) ) / . Then there exists x (cid:54) = 0 so that x (cid:48) Ax = 0 if and only if the convex hull of thespectrum of H contains the zero value (that is, a linear combination of eigenvalues of A equalto zero). We can apply this proposition to Assumption ID by taking x = µ and A = Σ − Σ Σ − .Proposition 4 shows why an impossibility DGP cannot occur if Σ is a Kronecker product design(that includes the case that the errors are uncorrelated and homoskedastic). When Σ = Ω ⊗ Φfor a positive definite 2 × k × k matrix Φ,the matrix Σ − Σ Σ − is proportional to Φ and is either positive or negative definite.This proposition also explains the findings in Section 5. There the matrix Σ − Σ Σ − issymmetric and proportional to the anti-diagonal J k matrix. The trace of this matrix is 0 or1 (depending on whether k is even or odd) and the determinant is negative. Therefore, thereexist at least one positive eigenvalue and one negative eigenvalue. This implies that there existcoefficients µ , such that the impossibility design holds.If Σ is singular, it is trivial to separate the null and alternative hypotheses. For exam-ple, the AR test’s noncentrality parameter goes to infinity under the alternative if one of theeigenvalues of Σ approaches zero for most parameter choices µ . If Σ is positive definite, wecan apply Proposition 4 to x = Σ − µ and A = Σ . Hence, the impossibility design holds forall matrices Σ so that Σ + Σ (cid:48) does not have all eigenvalues of the same sign. For a givenmatrix Σ , we can choose Σ and Σ so that the variance matrix Σ is semi-positive definite.Hence, there exists a vast range of designs in which the non-centrality parameter of the LMstatistic is bounded. In Appendix A, we also discuss cases in which Σ is near singular and theimpossibility design has the bound in (3.20) arbitrarily close to zero.The impossibility designs in Section 5 are constructed so that the commonly-used LM andCQLR tests have no power. When errors are uncorrelated and homoskedastic, there exists aminimax justification for using those tests. Andrews, Moreira, and Stock (2006) and Chamber-lain (2007) implicitly use the Hunt-Stein theorem to justify the focus on tests that depend on18able 1: Eigenvalues of Σ + Σ (cid:48) for model and data of Yogo (2004)Australia 0.0014 0.0008 0.0001 0.0003Canada 0.0030 0.0009 -0.0001 0.0002France -0.0011 0.0001 0.0008 0.0013Germany -0.0019 -0.0003 0.0005 0.0002Italy -0.0022 0.0013 0.0006 -0.0005Japan 0.0027 -0.0006 0.0010 0.0008Netherlands 0.0006 -0.0005 -0.6000 -0.0005Sweden 0.0008 0.0003 -0.0004 0.0001Switzerland 0.0005 0.0001 -0.0001 -0.0003United Kingdom -0.0032 0.0016 0.0003 0.0001United States 0.0009 0.0006 0.0003 0.0011the data only through S (cid:48) S , ( S (cid:48) T ) , and T (cid:48) T . However, the aforementioned minimax result isnot applicable to conditional (on the T statistic) tests that are linear combinations of the ARand LM statistics for HAC errors.Do empirical estimates of Σ have eigenvalues of Σ + Σ (cid:48) that are of opposite signs? Asan example, we take the estimation of intertemporal elasticity of substitution (IES) of Yogo(2004). He considers four instruments and three different models. As Moreira and Moreira(2019) do, we focus on the model where the endogenous variable is the real stock return andthe instruments are genuinely weak. Out of the eleven countries considered by Yogo (2004),nine have eigenvalues with opposite signs. Eigenvalues of the estimates of Σ + Σ (cid:48) , using thepopular Newey-West estimator of Σ (Newey and West (1987)), are in Table 1. The necessarycondition for the LM non-centrality parameter to be bounded (i.e., the impossibility design),is satisfied for most countries. One can argue that the values of µ for which the LM non-centrality parameter is bounded,i.e., µ (cid:48) Σ − Σ Σ − µ = 0, are very special. For a given matrix Σ , the set of µ for which µ (cid:48) Σ − Σ Σ − µ = 0 has Lebesgue measure zero. This argument is questionable. Take, for ex-ample, the theory of limit experiments. If a family of models is locally asymptotically quadratic(LAQ) then it is locally asymptotically mixed normal (LAMN) except for a set with Lebesguemeasure 0. Yet, there is a vast literature analyzing those special models of which the weak-IVmodel itself is a special case.However, even if µ does not satisfy µ (cid:48) Σ − Σ Σ − µ = 0 exactly, the problems with the LMand CQLR tests remain, because the information loss also occurs in a neighborhood of theimpossibility DGP, as we show in the next theorem. Theorem 6
In the HAC-IV model, (i) the power function ρ φ ( β, µ ) for any test φ is analytic in ( β, µ ) ; and k = 2 and α = . -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.200.10.20.30.40.50.60.70.80.91 = 1000 AR LM CQLR 1 CQLR 2 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.200.10.20.30.40.50.60.70.80.91 = 1000
AR J PI-CLC CIL (ii) the power function ρ φ ( β, µ ) is uniformly continuous over any compact set. Theorem 6, part (i) shows the power function is analytic for any test. Close inspection ofthe proof shows that the expectation of any statistic (assuming the expectation exists) is alsoanalytic so that the result can be generalized for other (curved) exponential families. This is astronger result than the differentiability of the expectation obtained by Hirano and Porter (2012,2015) for such models. We apply part (ii) of Theorem 6 for compact sets containing values of µ such that µ (cid:48) Σ − Σ Σ − µ = 0 and values of β large enough that the total variation distancebetween the null and that specific alternative is close to 1. Small changes in both parameterschange the power function very little. Let us consider the CIL and LM tests. Following Kraft(1955), the power function over the alternative of the CIL test gives a lower bound on the totalvariation distance of the hypotheses. Therefore for small changes in the parameters, the totalvariation distance remains close to one. Applying Theorem 6 to the LM test, the power of thattest will be close to size for small changes in β .In a simulation, we confirm that the LM and CQLR tests have power close to size, even if µ (cid:48) Σ − Σ Σ − µ is only close to 0. We use a variance matrix Σ δ = Σ + Ψ δ with Σ a variancematrix of the impossibility design and Ψ δ a small positive definite matrix. The 2 k × k matrixΨ δ is obtained by drawing the columns of the 2 k × k matrix X = [ X . . . X k ] independentlyfrom N (0 , δ · I k ), so that vec ( X ) ∼ N (0 , δ I k ). Define P = X ( X (cid:48) X ) − / , which is byconstruction an orthogonal matrix. Let Λ δ be the diagonal matrix of the eigenvalues of X (cid:48) X that are non-negative. We take Ψ δ = P Λ δ P (cid:48) , which is positive-definite with probability one.For each repetition, we first draw Σ δ and next draw the equation errors of the HAC modelusing Σ δ as their variance matrix. The power plots are averaged over the draws of X . Figures4 and 5 show power curves for the AR, LM, CQLR, and CIL tests for δ = 0 . k = 2 and α = . -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 100.10.20.30.40.50.60.70.80.91 = 100 AR LM CQLR 1 CQLR 2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 100.10.20.30.40.50.60.70.80.91 = 100
AR J PI-CLC CIL
Could we test whether the impossibility design holds by estimating π itself? If the instrumentsare weak, we consider π = h π / √ n . Because the parameter h π is not consistently estimable, wecannot be sure whether we have an impossibility DGP —even in large samples.Even if the loss of power of the LM and CQLR is large in the neighborhood of an exactimpossibility design, it could still be the case that the standardized first-stage estimates have alow probability of being close to an impossibility DGP. We have to check whether the confidenceset for µ intersects with the set of impossibility designs, i.e., we consider the intersection of thesets of µ defined by ( (cid:98) µ − µ ) (cid:48) Σ − ( (cid:98) µ − µ ) ≤ q α ( k ) and µ (cid:48) Σ − Σ Σ − µ = 0 . (6.37)We check whether the intersection is empty by solving the constrained minimization problemmin µ ( (cid:98) µ − µ ) (cid:48) Σ − ( (cid:98) µ − µ ) s . t . µ (cid:48) Hµ = 0 , (6.38)where the Hermitian matrix is H = Σ − Σ Σ − + Σ − Σ Σ − . (6.39)We call the minimum of the objective function the confidence bound because it is the smallestconfidence set cutoff such that the confidence region intersects with the impossibility design. Ifthe confidence bound is greater than q α ( k ) then the intersection of the 1 − α confidence set andthe impossibility DGP is empty. The DGP conforms to the impossibility design if and only ifthe eigenvalues of the Hermitian have opposite signs.If the convex hull of the spectrum of H does not contain zero, then trivially the only solutionto (6.38) is (cid:101) µ = 0. Otherwise, the solution (cid:101) µ to this problem satisfies (cid:0) Σ − + κH (cid:1) (cid:101) µ = Σ − (cid:98) µ, (6.40)21able 2: Smallest confidence bound κ × (cid:98) µ − ˜ µ ) (cid:48) Σ − ( (cid:98) µ − ˜ µ ) (cid:98) µ (cid:48) Σ − (cid:98) µ ˜ µ (cid:48) Σ − ˜ µ ( ˜ µ (cid:48) Σ − Σ Σ − Σ Σ − ˜ µ ) / Australia - - 6.40 -Canada 12.47 4.25 5.64 3.24France 2.72 1.40 6.74 3.37Germany -6.54 4.35 7.68 5.97Italy -1.02 1.51 2.07 2.49Japan 1.04 3.09 10.45 5.17Netherlands -2.45 1.36 5.67 2.69Sweden -0.44 1.83 5.63 12.35Switzerland -1.05 0.05 0.57 8.47UK 1.62 0.61 3.85 4.00USA - - 9.12 -where κ is a Lagrange multiplier. The solution (cid:101) µ resembles ridge estimation with two importantdifferences. In ridge regression, the first-order condition yields a unique solution once we deter-mine the multiplier. Furthermore, the constraint yields a decreasing function for the multiplier,which is easy to solve. In our setup, the matrix Σ − + κH may not be invertible, so that (cid:101) µ = (cid:0) Σ − + κH (cid:1) − Σ − (cid:98) µ. (6.41)In the appendix, we give an example in which Σ − + κH is singular and explain why this featuredoes not typically arise in applications. In this case, the constraint in (6.38) yields the equation (cid:98) µ (cid:48) Σ − / (cid:0) I + κH (cid:1) − H (cid:0) I + κH (cid:1) − Σ − / (cid:98) µ = 0 , (6.42)where H = Σ / H Σ / . Because H can have eigenvalues of opposite signs, the left-hand sidemay not be monotonic in κ . In practice, we solve all values of κ numerically. In the appendix,we provide more details on our search algorithm. Among the κ which satisfy (6.42), we findthe value (cid:98) κ which minimizes( (cid:98) µ − (cid:101) µ ) (cid:48) Σ − ( (cid:98) µ − (cid:101) µ ) = κ . (cid:98) µ (cid:48) Σ − / H (cid:0) I + κH (cid:1) − H Σ − / (cid:98) µ. (6.43)In Table 2, we report for all countries that satisfy the necessary condition for the impossibil-ity DGP, the Lagrange multiplier (column 2), the minimum of the objective function (column3), the objective function at µ = 0 (column 4), and the LM bound (column 5) at the minimizer (cid:101) µ . In the data, k = 4 so that the cut-off for the 95% confidence region is 9.49. Column 2 ofTable 2 shows that for all countries where the necessary condition for an impossibility DGP issatisfied, the intersection of the 95% confidence region and the impossibility DGP is non-empty. We refer the reader to Draper and Nostrand (1979) for further details. (cid:98) µ − ˜ µ ) (cid:48) Σ − ( (cid:98) µ − ˜ µ ) ≤ .
49 does not characterize the full set of µ in theconfidence region that are consistent with the impossibility DGP. Obviously for all countriesexcept Australia and the USA ˜ µ satisfies the restriction of the impossibility DGP and is withinthe confidence region. The vector η. ˜ µ for scalar η satisfies the impossibility DGP restrictionand is in the confidence region if( (cid:98) µ − η. (cid:101) µ ) (cid:48) Σ − ( (cid:98) µ − η. (cid:101) µ ) ≤ . . (6.44)To find the smallest η we minimize η under the restriction (6.44). The third column of Table 2shows that except for Japan η = 0 satisfies the restriction. This implies that the non-centralityparameter in the heading of Column 4 can be 0 for an impossibility DGP and inside the 95%confidence region for µ . For Japan the restriction (6.44) is binding, so that we minimize η under (6.44) as an equality constraint. We find η = (cid:101) µ (cid:48) Σ − (cid:98) µ (cid:101) µ (cid:48) Σ − (cid:101) µ − (cid:115)(cid:18) (cid:101) µ (cid:48) Σ − (cid:98) µ (cid:101) µ (cid:48) Σ − (cid:101) µ (cid:19) − (cid:98) µ (cid:48) Σ − (cid:98) µ − . (cid:101) µ (cid:48) Σ − (cid:101) µ . (6.45)For Japan the solution is η = 0 . Until now we have not characterized the full set of µ that are consistent with an impossibilityDGP and are within a confidence set. The set described in (6.37) is defined by a polynomialinequality and polynomial equality. A set defined by polynomial (in)equalities in R k is calledsemi-algebraic. The Tarski-Seidenberg theorem guarantees that projections of semi-algebraicsets in lower dimensions are also semi-algebraic sets. The Cylindrical Algebraic Decomposition(CAD) algorithm finds these projections. Figure 6 presents all six possible projections in R ofthe set in R for France at the 95% confidence level. The graphs are close to being symmetricnear zero. This is not surprising, since if µ satisfies the impossibility restriction, then so does − µ . If ˆ µ = 0 then the symmetry holds exactly, and if ˆ µ is close to zero it holds approximately.In the supplement, we provide these projections for all nine countries at both 95% and 99%confidence levels. In the IV model with HAC errors, the LM statistic can have little information on the structuralparameter for impossibility designs. The QLR statistic can reduce to the LM statistic. Thisinformation loss can be so extreme that the LM and CQLR tests can have power close tosize, failing to distinguish the null from the alternative. In testing contexts in which it is Instead, we could have tried to find ˜ µ in the confidence regions which satisfies the impossibility designand yields the smallest LM bound for Japan. The semi-algebraic nature of this setup also means that theminimization problem (6.38) is a semi-algebraic/polynomial problem for which we can find a global minimum;see Lasserre (2015). However, we do not pursue this approach here as the LM bound of 0.1219 is already verylow. Throughout this paper, we take the approach that Σ is known. If we instead considered Σ to be unknownand have a set of potential estimates of Σ, the set of potentially problematic first-stage estimates would belarger. However, our findings have more important and widely-applicable implica-tions. It is a commonly accepted research agenda to consider estimators, hypothesis tests, andconfidence sets which are efficient under special assumptions and robust to broader settings. Wehighlight here the importance of considering relative efficiency in these more general settings aswell. Otherwise, we are at risk of making inference which is correct (e.g., consistent estimators,tests with correct size, regions with correct coverage) but discarding efficiency unnecessarily. This theory favors the adaptation of the CLR test by Andrews and Mikusheva (2016) for moment models, incontrast to other GMM versions of the original LM and CLR tests designed for the IV model with homoskedasticerrors. eferences Anderson, T. W., and
H. Rubin (1949): “Estimation of the Parameters of a Single Equationin a Complete System of Stochastic Equations,”
Annals of Mathematical Statistics , 20, 46–63.
Andrews, D. W. K. (1987): “Consistency in Nonlinear Econometric Models: A GenericUniform Law of Large Numbers,”
Econometrica , 55, 1465–1471.(1991): “Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Esti-mation,”
Econometrica , 59, 817–858.
Andrews, D. W. K., X. Cheng, and
P. Guggenberger (2020): “Generic Results forEstablishing the Asymptotic Size of Confidence Sets and Tests,”
Journal of Econometrics ,in press.
Andrews, D. W. K., and
P. Guggenberger (2019): “Identification- and Singularity-Robust Inference for Moment Condition Models,”
Quantitative Economics , 10, 1703–1746.
Andrews, D. W. K., V. Marmer, and
Z. Yu (2019): “On optimal inference in the linearIV model,”
Quantitative Economics , 10, 457–485.
Andrews, D. W. K., M. J. Moreira, and
J. H. Stock (2004): “Optimal Invariant SimilarTests for Instrumental Variables Regression,” NBER Working Paper t0299.(2006): “Optimal Two-Sided Invariant Similar Tests for Instrumental Variables Re-gression,”
Econometrica , 74, 715–752.
Andrews, D. W. K., and
J. H. Stock (2007): “Inference with Weak Instruments,” in
Ad-vances in Economics and Econometrics, Theory and Applications: Ninth World Congress ofthe Econometric Society , ed. by T. P. R. Blundell, W. K. Whitney, vol. 3, chap. 6. CambridgeUniversity Press, Cambridge.
Andrews, I. (2016): “Conditional Linear Combination Tests for Weakly Identified Models,”
Econometrica , 84, 2155–2182.
Andrews, I., and
A. Mikusheva (2016): “Conditional Inference with a Functional NuisanceParameter,”
Econometrica , 84, 1571–1612.
Andrews, I., J. H. Stock, and
L. Sun (2019): “Weak Instruments in Instrumental Vari-ables Regression: Theory and Practice,”
Annual Review of Economics , 11, 727–753.
Angrist, J., and
J.-S. Pischke (2009):
Mostly Harmless Econometrics: An Empiricist’sCompanion . New Jersey: Princeton University Press.
Berger, A. (1951): “On uniformly consistent tests,”
Annals of Mathematical Statistics , 22,289293.
Bertanha, M., and
M. Moreira (2020): “Impossible Inference in Econometrics: Theoryand Applications,”
Journal of Econometrics , forthcoming.25 ameron, A. C., J. B. Gelbach, and
D. L. Miller (2011): “Robust Inference WithMultiway Clustering,”
Journal of Business Economics and Statistics , 77, 238–249.
Chamberlain, G. (2007): “Decision Theory Applied To an Instrumental Variables Model,”
Econometrica , 75, 609–652.
Draper, N. R., and
R. C. V. Nostrand (1979): “Ridge Regression and James-SteinEstimation: Review and Comments,”
Technometrics , 21, 451–466.
Dufour, J.-M. (2003): “Presidential Address: Identification, Weak Instruments, and Statis-tical Inference in Econometrics,”
Canadian Journal of Economics , 36, 767–808.
Hirano, K., and
J. R. Porter (2012): “Impossibility Results for Nondifferentiable Func-tionals,”
Econometrica , 80, 1769–1790.(2015): “Location Properties of Point Estimators in Linear Instrumental Variablesand Related Models,”
Econometric Reviews , 34, 720–733.
Kadane, J. B. (1971): “Comparison of k-Class Estimators When the Disturbances Are Small,”
Econometrica , 39, 723–737.
Kleibergen, F. (2002): “Pivotal Statistics for Testing Structural Parameters in InstrumentalVariables Regression,”
Econometrica , 70, 1781–1803.(2005): “Testing Parameters in GMM Without Assuming That They Are Identified,”
Econometrica , 73, 1103–1123.
Kraft, C. (1955): “Some Conditions for Consistency and Uniform Consistency of StatisticalProcedures,”
University of California Publications in Statistics , 2, 125–142.
Lasserre, J. B. (2015):
An Introduction to Polynomial and Semi-Algebraic Optimization .New York: Cambridge University Press.
Lehmann, E. L. (1999):
Elements of Large-Sample Theory . New York: Springer-Verlag.
Liang, K.-Y., and
S. L. Zeger (1980): “Longitudinal data analysis using generalized linearmodels,”
Biometrika , 12, 157–166.
Mikusheva, A. (2010): “Robust Confidence Sets in the Presence of Weak Instruments,”
Journal of Econometrics , 157, 236–247.
Moreira, H., and
M. J. Moreira (2019): “Optimal Two-Sided Tests for Instrumental Vari-ables Regression with Heteroskedastic and Autocorrelated Errors,”
Journal of Econometrics ,213, 398–433.
Moreira, M. J. (2001): “Tests with Correct Size when Instruments Can Be ArbitrarilyWeak,”
Center for Labor Economics Working Paper Series , 37, UC Berkeley.(2002): “Tests with Correct Size in the Simultaneous Equations Model,” Ph.D. thesis,UC Berkeley. 262003): “A Conditional Likelihood Ratio Test for Structural Models,”
Econometrica ,71, 1027–1048.(2009): “Tests with Correct Size when Instruments Can Be Arbitrarily Weak,”
Journalof Econometrics , 152, 131–140.
Moreira, M. J., and
G. Ridder (2017): “Optimal Invariant Tests in an InstrumentalVariables Regression With Heteroskedastic and Autocorrelated Errors,” arXiv:1705.00231.
Nelson, C. R., and
R. Startz (1990): “Some Further Results on the Exact Small SampleProperties of the Instrumental Variable Estimator,”
Econometrica , 58, 967–976.
Newey, W. K., and
K. D. West (1987): “A Simple, Positive Semi-Definite, Heteroskedas-ticity and Autocorrelation Consistent Covariance Matrix,”
Econometrica , 55, 703–708.
Staiger, D., and
J. H. Stock (1997): “Instrumental Variables Regression with Weak In-struments,”
Econometrica , 65, 557–586.
Stock, J. H., and
J. Wright (2000): “GMM with Weak Identification,”
Econometrica , 68,1055–1096.
Stock, J. H., J. Wright, and
M. Yogo (2002): “A Survey of Weak Instruments andWeak Identification in Generalized Method of Moments,”
Journal of Business and EconomicStatistics , 20, 518–529.
White, H. (1980): “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Di-rect Test for Heteroskedasticity,”
Econometrica , 48, 817–838.
Wooldridge, J. (2001):
Econometric Analysis of Cross Section and Panel Data . Cambridge:MIT Press.
Yogo, M. (2004): “Estimating the Elasticity of Intertemporal Rate of Substitution WhenInstruments Are Weak,”