[PDF] Impossible Inference in Econometrics: Theory and Applications

Abstract

This paper studies models in which hypothesis tests have trivial power, that is, power smaller than size. This testing impossibility, or impossibility type A, arises when any alternative is not distinguishable from the null. We also study settings in which it is impossible to have almost surely bounded confidence sets for a parameter of interest. This second type of impossibility (type B) occurs under a condition weaker than the condition for type A impossibility: the parameter of interest must be nearly unidentified. Our theoretical framework connects many existing publications on impossible inference that rely on different notions of topologies to show models are not distinguishable or nearly unidentified. We also derive both types of impossibility using the weak topology induced by convergence in distribution. Impossibility in the weak topology is often easier to prove, it is applicable for many widely-used tests, and it is useful for robust hypothesis testing. We conclude by demonstrating impossible inference in multiple economic applications of models with discontinuity and time-series models.

Full PDF

IImpossible Inference in Econometrics: Theory andApplications ∗ Marinho Bertanha † University of Notre Dame

Marcelo J. Moreira ‡ FGV EPGE

This version: November 19, 2018First version: October 11, 2016

This paper studies models in which hypothesis tests have trivial power, that is, power smaller thansize. This testing impossibility, or impossibility type A, arises when any alternative is not distinguishable from the null. We also study settings where it is impossible to have almost surely bounded conﬁdence setsfor a parameter of interest. This second type of impossibility (type B) occurs under a condition weakerthan the condition for type A impossibility: the parameter of interest must be nearly unidentiﬁed . Ourtheoretical framework connects many existing publications on impossible inference that rely on diﬀerentnotions of topologies to show models are not distinguishable or nearly unidentiﬁed. We also derive bothtypes of impossibility using the weak topology induced by convergence in distribution. Impossibility in theweak topology is often easier to prove, it is applicable for many widely-used tests, and it is useful for robusthypothesis testing. We conclude by demonstrating impossible inference in multiple economic applications ofmodels with discontinuity and time-series models.

Keywords: hypothesis tests, conﬁdence intervals, weak identiﬁcation, regression discontinuity

JEL Classiﬁcation:

C12, C14, C31 ∗ We thank Tim Armstrong, Leandro Gorno, and anonymous referees for helpful comments and sugges-tions. Bertanha gratefully acknowledges support from CORE-UcLouvain, and Moreira acknowledges theresearch support of CNPq and FAPERJ. This study was ﬁnanced in part by the Coordena¸c˜ao de Aper-fei¸coamento de Pessoal de N´ıvel Superior - Brasil (CAPES) - Finance Code 001. † ∼ mbertanh. ‡ Praia de Botafogo, 190 11th ﬂoor, Rio de Janeiro - RJ 22250-040, Brazil. Email: [email protected]. Website: epge.fgv.br/en/professor/marcelo-moreira. a r X i v : . [ m a t h . S T ] N ov Introduction

The goal of most empirical studies is to estimate parameters of a population statisticalmodel using a random sample of data. The diﬀerence between estimates and populationparameters is uncertain because sample data do not have all the information about thepopulation. Statistical inference provides methods for quantifying this uncertainty. Typicalapproaches include hypothesis testing and conﬁdence sets. In a hypothesis test, the re-searcher divides all possible population models into two sets of models. The null set includesthe models which the researcher suspects to be false. The alternative set includes all otherlikely models. It is desirable to control the size of the test, that is, the error probabilityof rejecting the null set when the null set contains the true model. A powerful test has asmall error probability of failing to reject the null set when the true model is outside the nullset. Another approach is to use the data to build a conﬁdence set for the unknown value ofparameters of the true model. The researcher needs to control the error probability that theconﬁdence set excludes the true value. Error probabilities must be controlled uniformly overthe entire set of likely models. This paper studies necessary and suﬃcient conditions for theimpossibility of controlling error probabilities of hypothesis tests and conﬁdence sets.Previous work demonstrates the impossibility of controlling error probabilities of testsand conﬁdence sets in speciﬁc settings. There are essentially two types of impossibilityfound in the literature. The ﬁrst type of impossibility says that any hypothesis test haspower limited by size. That is, it is impossible to ﬁnd a powerful test that controls size. Wecall this impossibility type A. The second type of impossibility states that any conﬁdenceset that is almost surely (a.s.) bounded has error probability arbitrarily equal to one (i.e.zero conﬁdence level). In other words, it is impossible for ﬁnite bounds to contain the truevalue of parameters with high probability. We call this impossibility type B. Despite beingrelated, both types of impossibility often appear disconnected in the existing literature.The ﬁrst contribution of this paper is to connect the literature on impossible inferenceand study the relationships between type A and type B impossibility. Figure 1 at the endof this introduction summarizes the literature along with novel relationships derived in thispaper. To the best of our knowledge, impossibility type A dates back to the 1950s. In aclassic paper, Bahadur and Savage (1956) show both types of impossibility in the populationmean case. Any test for distinguishing zero mean from non-zero mean distributions haspower limited by size; and any a.s. bounded conﬁdence interval for the population meanhas error probability equal to one. Bahadur and Savage (1956) employ the Total Variation The impossibility typically arises due to the richness of models in the class of all likely models. Impossi-bility does not arise if we restrict the class to have only one model, which is the same as pointwise inference.Uniform inference over a larger class of models is important because the researcher typically does not know strong distance . They show that the null set of distributions with a certainmean is dense with respect to (wrt) the TV metric in the set of distributions with all possiblemeans.In fact, impossibility type A is very much related to the density of the convex hull of thenull set in the set of all likely models wrt the TV metric. Kraft (1955) targets the problem oftesting any two sets of distributions and arrives at an important generalization of the theoryof Bahadur and Savage (1956). Kraft’s Theorem 5 gives a necessary and suﬃcient conditionfor the existence of a test whose minimum power is strictly greater than its size. Such testsexist if, and only if, the minimal TV distance between the convex hulls of the null andalternative sets is bounded away from zero. Kraft attributes the theorem to Le Cam, andan analogous version of his theorem appears in Theorem 2.1 of Ingster and Suslina (2003).Romano (2004) demonstrates that the null set being dense in the set of all likely models wrtthe TV metric is a suﬃcient condition for impossibility A. We derive a corollary of Kraft’sTheorem 5 that says that the convex hull of the null set being dense in the set of all likelymodels wrt the TV metric is a necessary and suﬃcient condition for impossibility type A.The null set being dense implies that the convex hull of the null set is dense. Our corollaryconnects the literature on impossibility type A wrt the TV metric.A diﬀerent branch of the econometrics literature focuses on impossibility type B of conﬁ-dence sets for a given parameter of interest, e.g. mean or regression slope. In the populationmean case, Bahadur and Savage (1956) arrive at impossibility type B by demonstrating thefollowing fact. For any mean value m , the set of distributions with mean equal to m isdense in the set of all likely models wrt the TV metric. This is stronger than the suﬃcientcondition for impossibility type B used by Gleser and Hwang (1987). Gleser and Hwang(1987) consider classes of models indexed by parameters in a Euclidean space. They obtainimpossibility type B whenever there exists one distribution P ∗ such that, for every valueof the parameter of interest, P ∗ is approximately equal to distributions with that value ofthe parameter of interest wrt the TV metric. As with impossibility type A, impossibilitytype B also holds if the condition of Gleser and Hwang (1987) holds over the convexiﬁedspace of distributions, which is a weaker suﬃcient condition. Donoho (1988) also provides all aspects of the model at hand. For example, instruments could be weak, and if we incorrectly assume theyare always strong, pointwise inference conclusions are quite misleading. Gleser and Hwang (1987) restrict their analysis to distributions that have parametric density functionswrt the same sigma-ﬁnite measure. Two distributions are indistinguishable if their density functions areapproximately the same pointwise in the data. In their setting, pointwise approximation in density functionsis the same as approximation in the TV metric. However, pointwise approximation in density functionsis still stronger than convergence in distribution. See Proposition 2.29 and Corollary 2.30, Van der Vaart(2000). P ∗ suchthat, for every value of the parameter of interest, there exists a sequence of distributionswith that value of the parameter of interest that converges in distribution to P ∗ . The weakernotion of distance restricts the analysis to conﬁdence sets whose boundary has zero proba-bility under P ∗ . The L´evy-Prokhorov (LP) metric is known to metrize weak convergence.We refer to this notion of distance as weak distance . We demonstrate the impossibility typeB of Dufour also holds after convexifying the space of distributions.We revisit impossibility type A when distributions are indistinguishable in the LP metricas opposed to the TV metric. We ﬁnd that impossibility type A applies to all tests that area.s. continuous under alternative distributions. A suﬃcient condition is that the convex hullof the null set is dense in the set of all likely models wrt the LP metric. On the one hand, theLP metric does not yield impossibility type A for every test function. On the other hand,the class of a.s. continuous tests includes the vast majority of tests used in empirical studies.Convergence in the TV metric always implies convergence in the LP metric. The converseis not true, except in more restricted settings. For example, if convergence in distributionimplies uniform convergence of probability density functions (PDF), then Scheﬀ´e’s Theoremimplies convergence in the TV metric (Corollary 2.30 of Van der Vaart (2000)).The second contribution of this paper is to note that a weaker notion of distance, suchas the LP metric, brings further insights into the problem of impossible inference. First, itis often easier to prove convergence of models in terms of the weak distance than it is in thestrong distance. Application of arguments similar to Portmanteau’s theorem immediatelyyields the LP version of impossible inference in an important class of models in economics thatrely on discontinuities. Second, the use of the LP metric helps researchers look for tests with4on-trivial power. If models are indistinguishable wrt the LP metric, but distinguishablewrt the TV metric, we show that a useful test must necessarily be a.s. discontinuous.Third, the LP metric can be a sensible choice of distance to study hypothesis tests thatare robust to small model departures. For example, consider the null set of continuousdistributions versus the alternative set of discrete distributions with ﬁnite support in therational numbers. It is possible to approximate any such discrete distribution by a sequenceof continuous distributions in the LP metric. Hence, it is impossible to powerfully testthese sets with a.s. continuous tests. On the other hand, a positive TV distance betweennull and alternative leads to a perfect test that rejects the null if observations take rationalvalues. Robustness leads us to ask whether observing rational numbers is indeed evidenceagainst the null hypothesis, or simply a matter of rounding or measurement error. The sameproblem may arise in reduced-form or structural econometric models, even when the degreeof misspeciﬁcation is small. Depending on the problem at hand, we may want to look fortests that separate the closure of each hypothesis wrt the LP metric.The third contribution of this paper is to point out impossible inference in microeconomet-ric models based on discontinuities and macroeconometric models of time series. Numerousmicroeconometric analyses identify parameters of interest by relying on natural discontinu-ities in the distribution of variables. This is the case of Regression Discontinuity Designs(RDD), an extremely popular identiﬁcation strategy in economics. In RDD, the assignmentof individuals into a program changes discontinuously at a cutoﬀ point in a variable such asage or test score, as for Hahn, Todd, and Van der Klaauw (2001) and Imbens and Lemieux(2008). For example, Schmieder, von Wachter, and Bender (2012) study individuals whoseduration of unemployment insurance jumps wrt age. Jacob and Lefgren (2004) analyze theeﬀect of students’ participation in summer school, which changes discontinuously wrt testscores. Assuming all other characteristics vary smoothly at the cutoﬀ, the eﬀect of thesummer school on future performance is captured by a discontinuous change in average per-formance at the cutoﬀ. A fundamental assumption for identiﬁcation is that performancevaries smoothly with test scores, after controlling for summer school. Models with continu-ous eﬀects are well-approximated by models with discontinuous eﬀects. Kamat (2018) usesthe TV metric to show that the current practice of tests in RDD suﬀers from impossibilitytype A. We revisit his result using the LP metric, and we show that impossibility type Balso holds in RDD.A Monte Carlo experiment shows that the usual implementation of Wald tests in RDD, assuggested by Calonico, Cattaneo, and Titiunik (2014), may have size above the desired sig-niﬁcance level, even under sensible model restrictions. We rely on data-generating processesthat are consistent with the empirical example of Lee (2008). Moreover, the simulations5how that the Wald test has very little power, even after artiﬁcially controlling size. Sloperestrictions on the conditional mean functions do not correct the ﬁnite sample failure of thetypical Wald test.In other applications, researchers assume a discontinuous change in unobserved charac-teristics of individuals at given points. This is the idea of bunching, widely exploited ineconomics. Bunching may occur because of a discontinuous change in incentives or a naturalrestriction on variables. For example, the distribution of reported income may display anon-zero probability at points where the income tax rates change, as in Saez (2010); or, thedistribution of average smoking per day has a non-zero mass at zero smoking. We showthat the problem of testing for existence of bunching in a scalar variable suﬀers from typeA impossibility for a.s. continuous tests but not for discontinuous tests.Caetano (2015) uses the conditional distribution of variables with bunching and proposesan exogeneity test without instrumental variables. The key insight is that bunching in thedistribution of an outcome variable given a treatment variable constitutes evidence of endo-geneity. For example, consider the problem of determining the eﬀect of smoking on birthweight. A crucial assumption is that birth weight varies smoothly with smoking while con-trolling for all other factors. Under this assumption, bunching is equivalent to the observedaverage birth weight being discontinuous at zero smoking. The exogeneity test looks for suchdiscontinuity as evidence of endogeneity. Our point is that models in which birth weight ishighly sloped or even discontinuous based on smoking are indistinguishable from smoothmodels. Therefore, we ﬁnd the exogeneity test has power limited by size. The current im-plementation of tests for the size of discontinuity leads to bounded conﬁdence sets, so it alsofails to control size.In addition to these applications with discontinuities, we verify the existence of impossibleinference in a macroeconometrics example where data are continuously distributed. We ﬁrstshow that the choice of the weak versus the strong distance connects to the work of PeterJ. Huber on robust statistics and leads us to look at the closure of the set of covariance-stationary time-series processes wrt the LP metric. This closure contains error-durationmodels and Compound Poisson models. Our theory implies that it is impossible to robustlydistinguish these models from covariance-stationary models, even with discontinuous tests.It is important to emphasize that our goal with these applications is not to say that validinference is never possible. Rather, we point practitioners to the need of either restrictingthe class of models under consideration or the null hypothesis being tested. In the RDDcase, impossibility vanish if we restrict the variation of conditional mean functions on eitherside of the cutoﬀ. Kamat (2018) demonstrates the asymptotic validity of Wald tests underuniform bounds on the derivatives of the conditional mean functions. Armstrong and Koles´ar62018) derive minimax optimal-length conﬁdence intervals in the case of a convex class ofconditional mean functions, which covers most smoothness or shape assumptions used ineconometrics. Alternatively, instead of limiting the whole class of models, researchers mayconsider null hypotheses that restrict other aspects of the model, beyond simply the eﬀectat the threshold. For example, the null of smooth models with zero eﬀect, or the null of notreatment spillover, do not suﬀer from type A impossibility. We expand this discussion inSection 4.1 with empirical examples in RDD.The rest of this paper is divided as follows. Section 2 sets up a statistical frameworkfor testing and building conﬁdence sets. It presents necessary and suﬃcient conditions forimpossible inference in general non-parametric settings. Section 3 connects the LP metricto robust hypothesis testing. Section 4 gives multiple economic applications where bothtypes of impossibility arise. Section 5 presents a Monte Carlo simulation for an empiricalapplication of RDD. Section 6 concludes. An appendix contains all formal proofs. Figure 1(on the next page) summarizes the literature on impossible inference, along with implicationsof this paper. The researcher has a sample of n observations Z = ( Z , . . . , Z n ) that take values in Z ,a subset of the Euclidean space R n × l . The data Z follow a distribution P , and the set ofall possible distributions considered by the researcher is P . Every probability distribution P ∈ P is deﬁned on the same sample space Z with Borel sigma-algebra B . It is assumedthat all distributions in P are absolutely continuous wrt the same sigma-ﬁnite measure. Weare interested in testing the null hypothesis H : P ∈ P versus the alternative hypothesis H : P ∈ P for a partition P , P of P . We characterize a hypothesis test by a functionof the data φ : Z → [0 , φ takes on only the values 0 and 1, the test is said to be non-randomized , but said to be randomized otherwise. Given a sample Z , we reject the null H if the function φ ( Z ) equals one, but we fail to reject H if φ ( Z ) = 0. If the function φ ( Z ) is between 0 and 1, we reject the null with probability φ ( Z ) conditional on Z . Theunconditional probability of rejecting the null hypothesis under distribution P ∈ P is denoted E P [ φ ].The size of the test φ is sup P ∈ P E P [ φ ]. The power of the test under distribution Q ∈ P is given by E Q [ φ ]. We say a test φ has power limited by size when sup Q ∈ P E Q [ φ ] ≤ Examples include Lebesgue measure for continuous distributions; counting measure for discrete distri-butions; and sum of Lebesgue and counting measures for mixed continuous-discrete distributions. i g u r e : I m p o ss i b ili t y L i t e r a t u r e D i ag r a m N o t e s : t h e d i ag r a m ill u s t r a t e s t h e r e l a t i o n s h i p s b e t w ee n t h e d i ﬀ e r e n t v e r s i o n s o f i m p o ss i b ili t y f o und i n t h e li t e r a t u r e . A rr o w s w i t h o u t l a b e l s o r a rr o w s w i t h r e f e r e n ce s i n s q u a r e b r a c k e t s a r e r e l a t i o n s h i p s m a d ee x p li c i t b y t h i s p a p e r . I m p o ss i b ili t y t y p e A s a y s t h a t e v e r y t e s t f un c t i o n φ h a s m a x i m u m p o w e r l e ss t h a n o r e q u a l t o s i ze . T h e s e t o f a llli k e l y m o d e l s i s P , w h i c h i s t h e un i o n o f m o d e l s und e r t h e nu ll P a nd a l t e r n a t i v e h y p o t h e s i s P . A t e s t φ i ss a i d t o b e P - a . s . c o n t i nu o u s i f t h e s e t o f d i s c o n t i nu i t y p o i n t s o f φ h a s ze r o p r o b a b ili t y und e r e v e r y Q ∈ P . T h e s e t c o ( P ) d e n o t e s t h ec o n v e x hu ll o f P . T h e T V m e t r i c i s t h e T o t a l V a r i a t i o n m e t r i c ( E q u a t i o n ( . ) i nS ec t i o n ) . T h e L P m e t r i c i s t h e L ´ e vy - P r o k h o r o v m e t r i c ( E q u a t i o n ( . ) i n S ec t i o n ) . T h e s e t P i s d e n s e i n P w r t a m e t r i c d ( · , · ) i f , f o r e v e r y Q ∈ P , t h e r ee x i s t s a s e q u e n ce { P k } k ⊆ P s u c h t h a t d ( P k , Q ) → . I m p o ss i b ili t y t y p e B s a y s t h a t e v e r y c o nﬁd e n ce s e t( C . S . ) i s unb o und e d w i t hp o s i t i v e p r o b a b ili t y f o r s o m e d i s t r i bu t i o n s i n P . T h e s ub s e t P ( m ) d e n o t e s a ll m o d e l s P s u c h t h a t a p a r a m e t e r o f i n t e r e s t µ ( P ) = m . A c o nﬁd e n ce s e t i s a f un c t i o n C ( · ) o f t h e d a t a Z . T h e C . S .i ss a i d t o b e P ∗ - a . s . c o n t i nu o u s i f t h e b o und a r y o f t h e s e t { m ∈ C ( Z ) } h a s ze r o p r o b a b ili t y und e r P ∗ f o r e v e r yv a l u e o f m i n t h e r a n g e o f µ ( · ) . T h e m o d e l P ∗ i s a li m i t p o i n t o f P ( m ) w r t a m e t r i c d ( · , · ) i f t h e r ee x i s t s a s e q u e n ce { P k } k ⊆ P ( m ) s u c h t h a t d ( P k , P ∗ ) → . I f P i s d e n s e i n P , t h e n c o ( P ) i s d e n s e i n P b ec a u s e P ⊆ c o ( P ) . S i m il a r l y ,i f P (cid:63) i s li m i t p o i n t o f P ( m ) , t h e n P (cid:63) i s a l s o li m i t p o i n t o f c o ( P ( m )) . C o n v e r g e n ce i n T V i m p li e s c o n v e r g e n ce i n L P . T h ec o n v e r s e i s n o t g e n e r a ll y t r u e . S ee p ag e f o r s u ﬃ c i e n t c o nd i t i o n s f o r t h ec o n v e r s e t o h o l d . P ∈ P E P [ φ ]. Deﬁne co ( P (cid:48) ) to be the convex hull of an arbitrary subset P (cid:48) ⊆ P . That is, co ( P (cid:48) ) = (cid:40) P ∗ : P ∗ = N (cid:88) i =1 α i P i , for some N ∈ N , P i ∈ P (cid:48) ∀ i,α i ∈ [0 , ∀ i, N (cid:88) i =1 α i = 1 (cid:41) . (2.1)A small distance between models in P and P determines testing impossibility. Thereexist various notions of distance to measure the diﬀerence between two distributions P andQ. A common choice in the literature on testing impossibility is the Total Variation (TV)metric d T V ( P, Q ): d T V ( P, Q ) = sup B ∈B | P ( B ) − Q ( B ) | . (2.2)Theorem 5 of Kraft (1955) says that there exists a test φ with minimum power strictlygreater than size if, and only if, there exists ε > d T V ( P, Q ) ≥ ε for every P ∈ co ( P ) and Q ∈ co ( P ). We restate his theorem below for convenience. Theorem 1. (Kraft (1955))

Fix ε > . The following statements are equivalent:(a) ∃ φ : inf Q ∈ P E Q φ ≥ ε + sup P ∈ P E P φ , and(b) ∀ P ∈ co ( P ) , ∀ Q ∈ co ( P ) , d T V ( P, Q ) ≥ ε . An important implication of Theorem 1 for impossible inference is that it gives a necessaryand suﬃcient condition in terms of the convex hull of the null set being dense in the set of alllikely models wrt the TV metric. In other words, the convex hull co ( P ) is indistinguishable from (or dense in) the set of all likely models wrt the TV metric if, for any Q ∈ P , thereexists a sequence { P k } ∞ k =1 in co ( P ) such that d T V ( P k , Q ) →

0. We demonstrate this fact inthe corollary below.

Corollary 1.

The following statements are equivalent:(a) for every Q ∈ P , there exists a sequence { P k } k ⊆ co ( P ) such that d T V ( P k , Q ) → ,and(b) for every φ and Q ∈ P , E Q φ ≤ sup P ∈ co ( P ) E P φ . The proof of this corollary, as well as all other proofs for the paper, is included inthe appendix. The striking result of Kraft (1955) stated in Theorem 1 makes the typeA impossibility found by Bahadur and Savage (1956) and Romano (2004) special cases of9orollary 1. In particular, Theorem 1 of Romano (2004) says that Corollary 1-(a) withoutconvexiﬁcation is a suﬃcient condition for Corollary 1-(b). Notably, Romano (2004) ﬁndsa positive result for testing population means. He demonstrates that the t-test uniformlycontrols size in large samples with a very weak uniform integrability type of condition, andthat the t-test is also asymptotic minimax optimal.Dufour (1997) uses the notion of distance associated with weak convergence to deriveimpossibility type B. We say a sequence { P k } ∞ k =1 converges in distribution to Q , if, for every B ∈ B such that Q ( ∂B ) = 0, P k ( B ) → Q ( B ). Here, ∂B is the boundary of a Borel set B,that is, the closure of B minus the interior of B . We denote convergence in distribution by P k d → Q . Convergence in distribution is equivalent to convergence in the L´evy-Prokhorov(LP) metric (Dudley (1976), Theorem 8.3) : d LP ( P, Q ) = inf { ε > P ( A ) ≤ Q ( A ε ) + ε for A ∈ B} (2.3)where A ε = { x : (cid:107) x − a (cid:107) < ε for a ∈ A } , and (cid:107) · (cid:107) is the Euclidean norm on R n × l . Convergence of P k to Q in the TV metric implies convergence in distribution. Theconverse does not hold, in general. It is necessary to restrict the class of distributions inorder for convergence in the TV metric to imply convergence in distribution. For example,suppose that P k d → Q , that these distributions have common support [ a, b ] and PDFs f P k , f Q . Assume further that f P k converges uniformly over [ a, b ]. Then, f P k converges uniformlyto f Q (Theorem 7.17 of Rudin (1976)). Convergence of PDFs implies convergence in theTV metric (Scheﬀ´e’s Theorem, see Corollary 2.30 of Van der Vaart (2000)). For a counter-example where these conditions do not hold, consider the bunching example of Section 4.2.The null is the set of distributions with a continuously diﬀerentiable CDF. The alternativeis the set of distributions with a mass point at x but continuously diﬀerentiable CDFotherwise. For any CDF F Q in the alternative, there exists a sequence of CDFs F P k in thenull that converges pointwise to F Q , so that convergence in distribution holds. Convergencein TV does not hold because x has positive probability under Q but zero probability under P k for every k . It must be the case that the PDFs f P k do not converge uniformly. In fact, F Q has a jump discontinuity at x , and the derivative of F P k at x grows without limit as k → ∞ . For example, a standardized binomial variable converges in distribution to a standard normal as thenumber of trials goes to inﬁnity and the probability of success is ﬁxed. It does not converge in the TV metricbecause the distance between these two distributions is always equal to one. In fact, consider the event equalto the entire real line minus the support of the binomial distribution. This event has unit probability underthe normal distribution, but zero probability under the binomial distribution.

10n the one hand, it is true that the zero TV distance provides a necessary and suﬃcientcondition for testing impossibility. On the other hand, there are examples of models withnon-zero TV distance where it seems sensible that no powerful test should exist. Section3 below formalizes this idea, but we start with a simple example for now. Consider thenull set of continuous distributions versus the alternative set of discrete distributions withﬁnite support in the rational numbers. It is possible to approximate any such discretedistribution by a sequence of continuous distributions in the LP metric. We are led to thinkthe data generated by a null model is observationally equivalent to data generated by analternative model. This motivates us to revisit impossibility type A when distributions areindistinguishable in the LP metric.

Assumption 1.

For every Q ∈ P , there exists a sequence { P k } ∞ k =1 in co ( P ) such that P k d → Q . In other words, the convex hull co ( P ) is indistinguishable from (or dense in) theset of all likely models wrt the LP metric. Assumption 1 is a suﬃcient condition for impossibility type A, as described in Theorem2.

Theorem 2.

If Assumption 1 holds, then any hypothesis test φ ( Z ) that is a.s. continuousunder any Q ∈ P has power limited by size. Remark 1.

As noted by Canay, Santos, and Shaikh (2013), the topology induced by theLP metric is not ﬁne enough to guarantee convergence of integrals of any test function φ .Nevertheless, the class of tests that are a.s. continuous under any Q ∈ P can be very large.For example, take a test that rejects the null when a test statistic is larger than a criticalvalue: φ ( Z ) = I ( ψ ( Z ) > c ) . This test is a.s. continuous if the function ψ is continuousand Q ∈ P is absolutely continuous wrt the Lebesgue measure. Theorem 2 only requires a.s.continuity under the alternative P , and the null P may still contain discrete distributions. Remark 2.

We do not need to restrict Theorem 2 to the class of a.s. continuous testsfor every case of P . For example, consider P to be a subset of the parametric exponentialfamily of distributions with parameter θ of ﬁnite dimension. Then, for any test φ , the powerfunction of φ is continuous in θ , and Theorem 2 applies under Assumption 1 (Theorem 2.7.1,Lehmann and Romano (2005)). Remark 3.

In many instances, Assumption 1 holds in both directions. That is, P is indis-tinguishable from P , and P is indistinguishable from P in the weak distance. For example,Bahadur and Savage (1956) ﬁnd that any distribution with mean m is well-approximated bydistributions with mean m (cid:48) (cid:54) = m , and vice-versa. Section 4 ﬁnds the same bidirectionality or models with discontinuities. If Assumption 1 holds in both directions, switching the rolesof P and P in Theorem 2 shows that power is equal to size. It is useful to connect our LP version of testing impossibility with the impossibility ofcontrolling error probability of conﬁdence sets found by Gleser and Hwang (1987) and Dufour(1997). Deﬁne a real-valued function µ : P → R , for example, mean, variance, median, andso on. The set of distributions P is implicitly chosen such that µ is well-deﬁned. Weconsider real-valued functions for simplicity, and results for µ with more general ranges arestraightforward to obtain. The range of µ is µ ( P ). Suppose we are interested in a conﬁdenceset for µ ( P ) when the true model is P ∈ P . A conﬁdence set takes the form of a function C ( Z ). For a model P ∈ P , the coverage probability of C ( Z ) is given by P [ µ ( P ) ∈ C ( Z )].The conﬁdence region C ( Z ) has conﬁdence level − α (i.e. error probability α ) if C ( Z )contains µ ( P ) with probability at least 1 − α :inf P ∈ P P [ µ ( P ) ∈ C ( Z )] = 1 − α. (2.4)For any value m ∈ µ ( P ), we deﬁne the subset P ( m ) by P ( m ) = { P ∈ P : µ ( P ) = m } . (2.5)Impossibility type B says that conﬁdence sets that are a.s. bounded under some distri-butions in P have zero conﬁdence level. The next assumption gives a suﬃcient condition forimpossibility type B in terms of the LP metric. Assumption 2.

There exists a distribution P ∗ (not necessarily in P ) such that for every m ∈ µ ( P ) there exists a sequence { P k } k in co ( P ( m )) such that P k d → P ∗ . If Assumption 1 holds with P = P ( m ) for every m ∈ µ ( P ), then Assumption 2 holds.In fact, if P ( m ) is dense in P for every m , then Assumption 2 is satisﬁed for P ∗ = Q forany Q ∈ P . Some models satisfy Assumption 1 with P = P ( m ) for every m ∈ µ ( P )and suﬀer from both types of impossibility. Examples of this case include the problem oftesting the mean (Bahadur and Savage (1956)), or the problem of testing the size of thediscontinuity in RDD (Section 4.1). Nevertheless, some other models satisfy Assumption 2but not Assumption 1 for every m . These models suﬀer from impossibility type B. Examplesinclude the problem of ratio of regression parameters (Gleser and Hwang (1987)), and theproblem of weak instruments (Dufour (1997)).The next theorem encapsulates the impossibility of controlling coverage probabilitiesfound by Gleser and Hwang (1987) and Dufour (1997). It diﬀers from Gleser and Hwang121987) because Assumption 2 uses the LP distance. It diﬀers slightly from Dufour (1997)because Assumption 2 is stated in terms of the convex hull of P ( m ) rather than simply P ( m ). Theorem 3.

Suppose Assumption 2 holds with P ∗ . Assume the conﬁdence set C ( Z ) ofEquation (2.9) has conﬁdence level − α , and P ∗ ( ∂ { m ∈ C ( Z ) } ) = 0 for every m ∈ µ ( P ) .Then, ∀ m ∈ µ ( P ) : P ∗ [ m ∈ C ( Z )] ≥ − α. (2.6) For a set A ⊂ R , deﬁne U [ A ] = sup { c : c ∈ A } , L [ A ] = inf { c : c ∈ A } , and D [ A ] = U [ A ] − L [ A ] . Assume { U [ C ( Z )] ≥ x } , { L [ C ( Z )] ≤ − x } , and { D [ C ( Z )] ≥ x } are measurableevents for every x ∈ [0 , ∞ ] . If D [ µ ( P )] = ∞ , then P ∗ [ D [ C ( Z )] = ∞ ] ≥ − α. (2.7) In addition, if P ∗ [ ∂ { D [ C ( Z )] = ∞} ] = 0 , then ∀ ε > P ∈ B ε ( P ∗ ) ∩ P P [ D [ C ( Z )] = ∞ ] ≥ − α (2.8) where B ε ( P ∗ ) = { P : d LP ( P, P ∗ ) < ε } . Remark 4.

Part (2.8) above implies the following. If − α > , then the conﬁdence set C ( Z ) is unbounded with strictly positive probability for some P ∈ P . Alternatively, thecontrapositive of part (2.8) says the following. Any conﬁdence set that is a.s. bounded underdistributions in P in a neighborhood of P ∗ has − α = 0 conﬁdence level. Remark 5.

It is possible to obtain a slightly more general version of Theorem 3 using As-sumption 2 stated in terms of the TV metric as opposed to the LP metric. In that case, Theo-rem 3 would be true for conﬁdence sets that do not necessarily satisfy P ∗ ( ∂ { m ∈ C ( Z ) } ) = 0 and P ∗ [ ∂ { D [ C ( Z )] = ∞} ] = 0 . A common way of obtaining conﬁdence sets is to invert hypothesis tests. The function C ( Z ) is constructed by inverting a test in the following manner. For a given m ∈ µ ( P ),deﬁne P ,m = P ( m ) and P ,m = P \ P ( m ), where A \ B denotes the remainder of set A afterwe remove the intersection of set B with set A . If φ m ( Z ) is a test for P ,m vs P ,m , then C ( Z ) = { m ∈ µ ( P ) : φ m ( Z ) = 0 } . (2.9)For every m ∈ µ ( P ), the test φ m ( Z ) has size α ( m ) = sup P ∈ P ,m E P [ φ m ( Z )]. The conﬁ-dence level of C ( Z ) is equal to one minus the supremum of α ( m ) over m ∈ µ ( P ). The proof13f this claim is found in Lemma 1 in the appendix.Theorem 3 along with Lemma 1 imply that tests that invert into a.s. bounded conﬁdencesets fail to control size. Corollary 2.

Suppose Assumption 2 holds, and µ ( P ) is unbounded. Let the conﬁdence set C ( Z ) be constructed from tests φ m ( Z ) , as in Equation (2.9). Assume C ( Z ) has conﬁdencelevel − α and satisﬁes the assumptions of Theorem 3. If C ( Z ) is a.s. bounded underdistributions in P in a neighborhood of P ∗ , then α = 1 . Consequently, for every ε > , thereexists m ε ∈ µ ( P ) such that sup P ∈ P ,mε E P φ m ε > − ε . Remark 6.

Moreira (2003) provides numerical evidence that Wald tests can have large nullrejection probabilities for the null of no causal eﬀect ( m = 0 ) in the simultaneous equationsmodel. To show that Wald tests have null rejection probabilities arbitrarily close to one, thehypothesized value m for the null would need to change as well. He also suggests replacingthe critical value by a critical value function of the data. This critical value function dependson the hypothesized value m . Our theory shows that this critical value function is unboundedif we change m freely. This section presents further motivation for using the LP metric to study impossibleinference. It relates the weak topology induced by the LP metric to the theory developed byPeter J. Huber, who is the most prominent researcher in the area of robust statistics. Werefer the reader to Huber and Ronchetti (2009) for more details. We start this section witha discussion of robust statistical procedures. An example of impossible robust hypothesistesting in time-series models appears in Section 4.4.Several statistical procedures are susceptible to small model departures. This perceptionhas led researchers to propose alternative procedures that are less sensitive to the break-downof usual assumptions. Huber studies diﬀerent ways of deﬁning a set of model departures P (cid:15) . One possibility is to assume that the actual distribution of the data is a mixture of adistribution in P with a distribution from a more general set of models M . In other words, P may be contaminated with probability (cid:15) : P (cid:15) = { H ∈ M ; ∃ F ∈ P and ∃ G ∈ M ; H = (1 − (cid:15) ) F + (cid:15) G } , (3.1)where M is larger than the original P . Estimators or tests are said to be robust if they haveminimax properties over the set of model departures P (cid:15) . To highlight the importance of14obust procedures, we brieﬂy discuss two examples.The ﬁrst example of a robust procedure involves point-estimation. The researcher has asample of n iid observations Z i ∈ R l , i = 1 , . . . , n . The set of joint probability distributions P is indexed by a parameter θ and admits marginal densities p ( Z i ; θ ) wrt the same dominatingmeasure (e.g. Lebesgue). The maximum likelihood estimator (MLE) then minimizes (cid:88) ni =1 − ln p ( Z i ; θ ) . This estimator (cid:98) θ solves (cid:88) ni =1 − ∂p (cid:16) Z i ; (cid:98) θ (cid:17) ∂θ . p (cid:16) Z i ; (cid:98) θ (cid:17) = 0 . Under the usual regularity conditions, (cid:98) θ is consistent, asymptotically normal, and eﬃcientwithin the class of regular estimators.A common choice for M is the set of distributions with symmetric, thick-tailed densities.It is well-known that optimal procedures derived under Gaussian distributions (sample drawnfrom P ) break down if there is a probability (cid:15) of observing outliers (sample drawn from M ).Huber (1964) suggests M-estimators. To give a speciﬁc example of a robust M-estimator,consider the regression model Y i = X (cid:48) i θ + U i ,where we observe Z i = ( Y i , X i ) but do not observe the zero-mean normal errors U i . TheMLE (cid:98) θ minimizes (cid:88) ni =1 ( Y i − X (cid:48) i θ ) ,and satisﬁes (cid:88) ni =1 X i (cid:16) Y i − X (cid:48) i (cid:98) θ (cid:17) = 0 . More generally, a M-estimator (cid:98) θ minimizes (cid:88) ni =1 ρ ( Y i − X (cid:48) i θ ) ,and satisﬁes (cid:88) ni =1 X i ψ (cid:16) Y i − X (cid:48) i (cid:98) θ (cid:17) = 0for choices of functions ρ and ψ . In the MLE case above, ρ ( u ) = u and ψ ( u ) = u .An M-estimator (cid:98) θ is said to be asymptotically minimax optimal among a class of es-timators if it minimizes the maximal asymptotic variance over distributions in P ε . The15-estimator associated with the functions ρ k ( u ) = (cid:40) u / | u | ≤ kk | u | − u / | u | > k and ψ k ( u ) = max {− k, min ( k, u ) } (3.2)are known to be asymptotically minimax optimal for model contamination. The constant k depends on the deviations (cid:15) in (3.1). As (cid:15) →

0, the truncation parameter k → ∞ . As themodel departure is small, the M-estimator approaches the MLE estimator. If (cid:15) →

1, theparameter k →

0. As the contamination is arbitrarily large, the M-estimator approaches theleast absolute deviation (LAD) estimator.The second example of a robust procedure is in hypothesis testing. Consider the problemof testing a simple null P against a simple alternative P . Assume both P and P havedensities p and p wrt the Lebesgue measure. For a sample X = ( X , ..., X n ), the likelihoodratio (LR) test rejects the null if and only if (cid:89) ni =1 p ( X i ) p ( X i ) > c α , where c α is the 1 − α quantile of the distribution of the left-hand side under the null. TheNeyman-Pearson Lemma asserts that the LR test is optimal, as it maximizes power withinthe class of tests with correct size α .Similar to model departures in the point-estimation example above, we consider thepossibility that the null and alternative hypotheses are misspeciﬁed. The (cid:15) -contaminatednull and alternatives are P i,(cid:15) = { H ∈ M ; ∃ F ∈ P i and ∃ G ∈ M ; H = (1 − (cid:15) ) F + (cid:15) G } , (3.3)for i = 0 ,

1. The new sets P ,(cid:15) and P ,(cid:15) allow for local departures for arbitrary distributionsin M . A minimax optimal hypothesis test maximizes the minimal power over P ,(cid:15) within theclass of tests with correct size over P ,(cid:15) .Huber (1965) shows that the minimax test to these model departures rejects the null ifand only if (cid:89) ni =1 π k (cid:18) p ( x ) p ( x ) (cid:19) > c α , where π k ( w ) = max { k , min ( k , w ) } , for constants k = ( k , k ) that depend on the size of the departure (cid:15) . As (cid:15) →

0, the constant k approaches zero, and k diverges to inﬁnity. Hence, as the departure decreases, the robust16est approaches the usual LR test.In the two examples of robust procedures given above, arbitrarily small model departures( (cid:15) →

0) do not aﬀect the solution to the minimax problem. That is, as (cid:15) approaches zero,the robust estimator converges to the MLE, and the robust test converges to the LR test.These limiting solutions remain the same, even if we ignore model departures ( ε = 0). Theseinference procedures target a parameter which is a functional of the underlying distribution P ∈ P ε . These functionals vary smoothly wrt ε as ε →

0. Robustness is associated withsmoothness of the functional, but such smoothness may not always occur in other settings.Our work on impossible inference and the diﬀerent metrics connects to Huber’s work onrobustness when we look at the following deﬁnition of model departure. For a metric space( M , d ), the set of model departures is deﬁned as an ε -neighborhood of P : P (cid:15) = { H ∈ M ; ∃ F ∈ P s.t. d ( F, H ) ≤ (cid:15) } . The set P (cid:15) is closed. The set (cid:84) (cid:15)> P (cid:15) is also closed and coincides with P , the closure of P . Hence, the set P is the minimal set of the Huber-type model departures P (cid:15) containing P .The minimal set of model departures crucially depends on a choice for the metric d .Aside from the L´evy-Prokhorov (LP) and the Total Variation (TV) metrics, there are manychoices of metrics for spaces of probability measures: Kolmogorov, Hellinger, and Wasser-stein, among others. Gibbs and Su (2002) provide a review. Which metric shall we choose?The choice of the metric on the space of models M induces a topology V on that space.Parameters of interest are functionals µ : ( M , V ) → ( R , U ) where U is the topology on the R space. Robustness is about the continuity of the functional µ , which crucially depends onthe choices of topologies V and U . For the real line, it seems reasonable to work with thesmallest topology involving all open sets of the form ( a, b ). However, there are many choicesof topologies for the set of measures M .As the set of continuous functionals µ grows, the topology becomes ﬁner on the domainof µ . Let us consider a simple example to illustrate this point. Take two topological spaces,( R , V ) and ( R , U ), and a function ψ ( x ) = x . Continuity of this simple function requires ψ − ( U ) ∈ V for every open set U ∈ U . Take U = (0 , ψ − ( U ) = (0 , V = {∅ , R } , then even this simple function is not continuous. It seemsreasonable to require all linear functions to be continuous. If the topology V is generatedby all open sets of the form ( a, b ), then all linear functions are continuous. Of course, othernon-linear functions may be continuous as well; e.g., ψ ( x ) = x . This example makes the In fact, take an arbitrary convergent sequence H n → H , such that H n ∈ P (cid:15) ∀ n . To show H ∈ P (cid:15) , pick anarbitrary F ∈ P . It is true that d ( H n , F ) ≤ (cid:15) ∀ n . Therefore, d ( F, H ) ≤ d ( F, H n )+ d ( H n , H ) ≤ (cid:15) + d ( H n , H ).Taking the limit as n → ∞ gives d ( F, H ) ≤ (cid:15) . V associated to the domain of thefunction. If a function is continuous for a topology V , then it is also continuous for a ﬁnertopology. This goes back to the discussion of robustness as continuity of a functional µ .If we choose a ﬁne topology, then many statistical procedures will be deemed robust,because many functionals will be continuous. If we choose a coarse topology, then fewerstatistical procedures will be robust. However, if a statistical procedure is robust in thecoarse topology, it is also robust in the ﬁne topology. Among the commonly-used notions ofdistance in measure spaces, the notion of distance behind weak convergence or convergencein distribution induces the coarsest topology. The LP metric is a notion of distance thatmetrizes weak convergence (Dudley (1976), Theorem 8.3). To be conservative, if we wereto choose one metric, we would choose one that metrizes weak convergence. After all, if afunctional is continuous wrt the topology induced by the LP metric, it is also continuouswrt the stronger topologies induced by the Kolmogorov or TV metrics.Another question is whether we should be stricter with robustness and look for an evenweaker topology than the weak topology induced by the LP metric. In perfect analogy tothe real line example, the weak topology is the coarsest topology that guarantees continuityfor all functionals of the form µ ( P ) = (cid:90) g dP, (3.4)for g bounded and continuous. It seems reasonable, after all, to require µ to be continuouswhen g is a bounded and continuous function. If we choose a weaker topology, then not even µ of this form will be continuous.In hypothesis testing, robustness over a minimal set of model departures motivates test-ing P against P instead of testing P against P . Allowing for robustiﬁed hypotheses P and P potentially protects us against numerical approximation errors, misspeciﬁed mod-els, measurement errors, and optimization frictions, among other deviations from the set ofmodels we are testing. Robustness of inference procedures for µ that are as simple as (3.4)requires a topology no weaker than the topology induced by the LP metric. Therefore, weuse the LP metric to deﬁne the closure of a set for robust hypothesis testing. We give twosimple examples to strengthen the argument of why the LP metric may be a sensible choice.The ﬁrst example of robust hypothesis testing using the LP metric compares extremelysimple discrete distributions under both null and alternative hypotheses. Take X to be aBernoulli random variable and X n = X +1 / (1 + n ) for n ∈ N . Let P X denote the distributionof X . The minimal TV distance between P = (cid:8) P X n for n ∈ N (cid:9) and P = (cid:8) P X (cid:9) is equal toone. According to Theorem 5 of Kraft (1955), there exists a test for P vs P with non-trivialpower. For example, deﬁne a test which rejects the null if we observe the values 0 or 1, but18ails to reject the null otherwise. This test has size equal to zero and power equal to one.Should we take the values 0 and 1 as evidence against the null? Or should we think insteadthat the null could have led to those same values, for all practical purposes? In this example,we note that P ⊂ P , where the closure is deﬁned wrt the LP metric. Hence, the minimalTV distance between P and P is equal to zero. After we robustify the null set to P , itbecomes impossible to ﬁnd any test with power greater than size (Corollary 1). However, ifwe deﬁne the closure of P wrt the TV metric, say P T V , the minimal TV distance between P T V and P is non-zero, which means it is still possible to powerfully distinguish these sets.The second example of robust hypothesis testing using the LP metric uses the multino-mial approximation to continuous distributions. Take P as the collection of multinomialdistributions, with each support being a ﬁnite subset of rational numbers. Let P be the setof continuous distributions. The minimal TV distance between P and P is one, and it ispossible to powerfully distinguish these sets. Robustness leads us to ask whether observingrational numbers is indeed evidence of the null hypothesis, or simply a matter of roundingor measurement error. The closure P wrt the LP metric contains continuous distributions,and the minimal TV distance between P and P is zero. After we robustify the null set to P , it becomes impossible to powerfully test these hypotheses.Both in the Bernoulli and multinomial examples, it becomes clear that the LP closureof the null set robustiﬁes the testing procedure. The next step is to check the TV distancebetween the robustiﬁed null and alternative sets as a way to search for robust tests withnon-trivial power. The use of the TV metric in the second step is justiﬁed by a corollaryof Theorem 5 of Kraft (1955). Corollary 1 demonstrates that a necessary and suﬃcientcondition for the existence of tests with non-trivial power is that the null set is not dense inthe set of all distributions wrt the TV metric. In this section, we apply our theory to multiple economic examples. The ﬁrst three exam-ples are of models with discontinuities: RDD, bunching in a scalar variable, and exogeneitytests based on bunching. In these settings, the proof of the LP version of impossible infer-ence follows arguments similar to Portmanteau’s Theorem. That is, the indicator functionsare approximately the same as the steep continuous functions using the weak distance. Theproblem of testing for the existence of bunching in a scalar variable diﬀers from the otherapplications with discontinuities because there exists a discontinuous powerful test. A fourth In fact, F n ( x ) = P ( X n ≤ x ) = P ( X ≤ x − /n ), F n ( x ) → P ( X < x ) = F ( x − ), F ( x − ) (cid:54) = F ( x ) ⇔ x ∈{ , } where { , } are the only discontinuity points of F , so P X is a limit point of P . The ﬁrst example is the Regression Discontinuity Design (RDD), ﬁrst formalized by Hahn,Todd, and Van der Klaauw (2001) (HTV01). RDD has had an enormous impact in appliedresearch in various ﬁelds of economics. Applications of RDD started gaining popularity ineconomics in the 1990s. Inﬂuential papers include those of Black (1999), who studies theeﬀect of quality of school districts on house prices, where quality changes discontinuouslyacross district boundaries; Angrist and Lavy (1999), who measure the eﬀect of class sizes onacademic performance, where size varies discontinuously with enrollment; and Lee (2008),who analyzes US House of Representatives elections and incumbency, where election victoryis discontinuous on the share of votes.Recent theoretical contributions include the study of rate optimality of RDD estimatorsby Porter (2003) and the data-driven optimal bandwidth rules by Imbens and Kalyanaraman(2012) and Calonico, Cattaneo, and Titiunik (2014). RDD identiﬁes causal eﬀects localto a cutoﬀ value; several authors develop conditions for extrapolating local eﬀects fartheraway from the cutoﬀ. These include estimation of derivatives of the treatment eﬀect at thecutoﬀ by Dong (2016) and Dong and Lewbel (2015); tests for homogeneity of treatmenteﬀects in fuzzy RDD by Bertanha and Imbens (2018); and estimation of average treatmenteﬀects in RDD with variation in cutoﬀ values by Bertanha (2017). All these theoreticalcontributions rely on point identiﬁcation and inference, and they are subject to both typesof impossibility. The current practice of testing and building conﬁdence intervals relies onWald test statistics ( t ( Z ) − m ) /s ( Z ), where t ( Z ) and s ( Z ) are a.s. continuous and boundedin the data. For a choice of critical value z, hypothesis tests φ ( Z ) = I {| ( t ( Z ) − m ) /s ( Z ) | > z } are a.s. continuous when the data is continuously distributed. Conﬁdence intervals C ( Z ) = { t ( Z ) − s ( Z ) z ≤ m ≤ t ( Z ) + s ( Z ) z } have a.s. bounded length 2 s ( Z ) z .The setup of RDD follows the potential outcome framework. For each individual i =1 , . . . , n , deﬁne four primitive random variables D i , X i , Y i (0) , Y i (1). These variables are in-dependent and identically distributed. The variable D i takes values in { , } and indicatestreatment status. The real-valued variables Y i (0) and Y i (1) denote the potential outcomes,respectively, if untreated and treated. Finally, the forcing variable X i represents a real-valued characteristic of the individual that is not aﬀected by the treatment. The forcing20ariable has a continuous PDF f ( x ) with interval support equal to X . The econometri-cian observes X i , D i , and only one of the two potential outcomes for each individual: Y i = D i Y i (1) + (1 − D i ) Y i (0). For simplicity, we consider the sharp RDD case, but it isstraightforward to generalize our results to the fuzzy case. In the sharp case, agents receivethe treatment if, and only if, the forcing variable is greater than or equal to a ﬁxed policycutoﬀ c in the interior of support X . Hence, D i = I { X i ≥ c } , where I {·} denotes the indicatorfunction.We focus on average treatment eﬀects. In RDD settings, identiﬁcation of average eﬀectsis typically obtained only at the cutoﬀ value after assuming continuity of average potentialoutcomes, conditional on the forcing variable. In other words, we assume that E [ Y i (0) | X i = x ]and E [ Y i (1) | X i = x ] are bounded continuous functions of x . HTV01 show that this leads toidentiﬁcation of the parameter of interest: m = E [ Y i (1) − Y i (0) | X i = c ] = lim x ↓ c E [ Y i | X i = x ] − lim x ↑ c E [ Y i | X i = x ] . (4.1)Let G denote the space of all functions g : X → R that are bounded, and that areinﬁnitely many times continuously diﬀerentiable in every x ∈ X \ { c } . The notation X \ { c } represents the set with every point of X except for c . Continuity of functions in G suﬃces toshow impossible inference in this section. Nevertheless, non-parametric estimators of the sizeof the discontinuity m typically assume that functions in g are continuously diﬀerentiable ofﬁrst or second order. We impose that functions in G are continuously diﬀerentiable of inﬁniteorder, to demonstrate that both types of impossibility hold even in this more restricted classof functions. The size of the discontinuity m at the cutoﬀ may take any value in R .Each individual pair of variables Z i = ( X i , Y i ) is iid as P . The family of all possiblemodels for P is denoted as P = { P : ( X i , Y i ) ∼ P, ∃ g ∈ G s.t. E P [ Y i | X i = x ] = g ( x ) } . (4.2)The local average causal eﬀect is the function of the distribution of the data P ∈ P givenby (4.1), provided the identiﬁcation assumptions of HTV01 hold. The parameter m of the sizeof the discontinuity is weakly identiﬁed in the set of possible true models P . Intuitively, anyconditional mean function E [ Y i | X i = x ] that is continuous except for a jump discontinuityat x = c is well-approximated by a sequence of continuous conditional mean functions.The reasoning behind this approximation is similar to the proof of part of Portmanteau’stheorem (Theorem 25.8, Billingsley (2008)). It is known that, if E [ f ( X n )] → E [ f ( X )] forevery bounded function f that is a.s. continuous under the distribution of X , then X n d → X .The proof of Corollary 3 uses an inﬁnitely continuously diﬀerentiable function f that is21pproximately equal to an indicator function. Corollary 3.

Assumption 1 is satisﬁed for P ,m ∀ m ∈ R , and Theorems 2 and 3 apply tothe RDD case. Namely, (i) a.s. continuous tests φ m ( Z ) for the value of the discontinuity m have power limited by size; and (ii) conﬁdence sets for the value of the discontinuity m andwith ﬁnite expected length have zero conﬁdence level. Remark 7.

Corollary 3 also applies to quantile treatment eﬀects by simply changing the def-inition of the functional µ ( P ) to be the diﬀerence in side limits of a conditional τ -th quantile Q τ ( Y i | X i = x ) at x = c . This contrasts with the problem of testing unconditional quan-tiles, which does not suﬀer from impossible inference. See Lehmann and D’Abrera (2006),Tibshirani and Wasserman (1988), and Coudin and Dufour (2009). Remark 8.

In the fuzzy RDD case, the treatment eﬀect is equal to the discontinuity in E [ Y i | X i ] at X i = c divided by the discontinuity in E [ D i | X i ] at X i = c . Corollary 3 appliesto both of these conditional mean functions, and it leads to impossible inference in the fuzzyRDD case as well. Feir, Lemieux, and Marmer (2016) study weak identiﬁcation in fuzzyRDD and propose a robust testing procedure. In contrast to Kamat (2018) and to this paper,their source of weak identiﬁcation comes from an arbitrarily small discontinuity in E [ D i | X i ] at X i = c . The most common inference procedures currently in use in applied research with RDDrely on Wald tests that are a.s. continuous in the data and produce conﬁdence intervals ofﬁnite expected length. See Imbens and Kalyanaraman (2012) and Calonico, Cattaneo, andTitiunik (2014) for the most commonly-used inference procedures. Corollary 3 implies thatit is impossible to control size of these tests and coverage of these conﬁdence intervals.Ours is not the ﬁrst paper to show impossible inference in the RDD case. Kamat (2018)demonstrates the important fact that models with a discontinuity are similar to modelswithout a discontinuity in the TV metric. He applies the testing impossibility of Romano(2004) and ﬁnds that tests have power limited by size. Using the graphical intuition ofFigure 3, we provide a simpler proof of the same facts, using the weak distance instead ofthe TV metric. Moreover, we add that conﬁdence intervals produced from Wald tests havezero conﬁdence level. It is worth highlighting the statistics literature on the impossibility ofadaptation gains for conﬁdence intervals on linear functionals of non-parametric functions(Low (1997), Cai and Low (2004)). They take conﬁdence intervals with correct coverageover a class of models P and derive a lower bound for the expected length of any conﬁdenceinterval under a given model P ∈ P . As the sample size increases, the rate at which thesebounds shrink to zero does not depend on P . In other words, any conﬁdence interval whose22xpected length at P ∈ P shrinks to zero faster than the lower bound must have incorrectcoverage over P . Armstrong and Koles´ar (2018) derive a lower bound for the expected lengthof any conﬁdence set that has correct coverage over P . The lower bound increases to inﬁnityas P becomes more general which is our impossibility of type B.On a positive note, the two types of impossibility vanish if we restrict the class of models P . The approximation used to prove Corollary 3 fails if we assume that functions in G haveabsolute slopes bounded by a ﬁnite constant C on either side of the cutoﬀ. Kamat (2018)shows that Wald tests have correct size asymptotically if the ﬁrst three derivatives of g ( x ), aswell as conditional moments, are uniformly bounded across P . Armstrong and Koles´ar (2018)derive minimax optimal-length conﬁdence intervals for a convex function class G coveringmost smoothness or shape assumptions used in econometrics. In the RDD case, they considerfunctions g ( x ) such that the p th -order Taylor approximation residual is bounded by Cx p oneither side of the cutoﬀ. In summary, applied researchers should bear in mind that thevalidity of tests and conﬁdence sets for the value of the discontinuity at the threshold reliesheavily on restricting the variation of average outcomes wrt the forcing variable X . Forexample, consider the analysis of summer-school programs in Chicago by Jacob and Lefgren(2004). The forcing variable X is a standardized reading score determining eligibility for theprogram, and Y is a standardized test score in math or reading after the program. Lookingat their Figures 6 and 7, it seems reasonable to assume that the slope of the conditionalmean of Y given X is smaller than one. In other words, an increase in today’s reading scoreby 1 point increases tomorrow’s average scores by less than one point.Restricting the class of models P is not the only way to construct valid tests in RDD.Another way to approach the problem is to consider null sets P diﬀerent than those inCorollary 3, where the focus is on the jump discontinuity at the threshold. One example is thenull hypothesis that an individual’s outcome is solely aﬀected by the treatment he receivesand not by the eﬀect of the treatment on neighboring individuals. In the summer-schoolapplication, the number of students attending classes in the summer is much smaller thanduring the school year. It is likely that students in the summer program interact much morewith each other, which leads to spillover eﬀects of the treatment. A researcher who desiresto test for no spillovers speciﬁes the null hypothesis of independence of Y i and Y j , conditionalon X i = X j = x for any i (cid:54) = j and x near the threshold. The alternative hypothesis thatoutcomes exhibit dependence across treated individuals cannot be approximated by modelsin the null.Another example of null hypothesis that is immune to testing impossibility is whenabsence of treatment eﬀects is equivalent to a smooth conditional mean function. We maydeﬁne the null hypothesis that g is Lipschitz continuous with some constant C , and the23lternative hypothesis that g is any other function as in Equation (4.2). Settings like thisarise when the treatment variable D is a function of the forcing variable X , and this functionchanges at a known cutoﬀ. This is the case of unemployment beneﬁts in Austria, studied byCard, Lee, Pei, and Weber (2015) (CLPW15). For unemployed individuals that used to earn X less than a threshold c , the unemployment beneﬁt grows with their earnings; otherwise, ifthey used to earn more than c , they simply receive a ﬁxed beneﬁt regardless of their earnings.CLPW15 ﬁnd that the unemployment duration does not depend on past earnings for thosewhose beneﬁt is ﬁxed to the right of the cutoﬀ (see their Figure 3). Moral hazard leads tounemployment duration that increases as beneﬁts increase with income to the left of thecutoﬀ. Therefore, the researcher may specify the null hypothesis of a smooth conditionalmean to test for the lack of moral hazard. Rejections may occur because of a sudden changein slope or a jump discontinuity at the threshold, both of which are evidence of a changein behavior regarding job search. Note, however, that the null hypothesis of Lipschitz g isdiﬀerent than the null hypothesis in the so-called Regression Kink Design (RKD) studied byCLPW15. The RKD null states that the ﬁrst derivative of g is continuous at the threshold,and such null suﬀers from testing impossibility.RKD has recently gained popularity in economics. In addition to CLPW15, see Dong(2016), Nielsen, Sørensen, and Taber (2010), and Simonsen, Skipper, and Skipper (2016).The setup is the same as in the RDD case, except that the causal eﬀect of interest is thechange in the slope of the conditional mean of outcomes at the threshold. Continuity of theﬁrst derivatives ∇ x E [ Y i (1) | X i = x ] and ∇ x E [ Y i (0) | X i = x ] at the threshold x = c guaranteesidentiﬁcation of the average eﬀect. The parameter of interest m = µ ( P ) is a function of thedistribution of Z i = ( X i , Y i ): µ ( P ) = ∇ x E [ Y i (1) − Y i (0) | X i = x ] = lim x ↓ c ∇ x E [ Y i | X i = x ] − lim x ↑ c ∇ x E [ Y i | X i = x ] . (4.3)The family of all possible distributions of Z i is deﬁned in a slightly diﬀerent way than inEquation (4.2): P = { P : ( X i , Y i ) ∼ P, ∃ g ∈ G s.t. ∇ x E [ Y i | X i = x ] = g ( x ) } . (4.4)Weak identiﬁcation of µ arises from the fact that any conditional mean function E [ Y i | X i = x ] with a discontinuous ﬁrst derivative at x = c is well-approximated by a sequence ofcontinuously diﬀerentiable conditional mean functions. Assumption 1 is easily veriﬁed usingthis insight. Corollary 4.

Assumption 1 is satisﬁed for P ,m ∀ m ∈ R , and Theorems 2 and 3 apply to KD. Namely, (i) a.s. continuous tests φ m ( Z ) for the value of the kink discontinuity m havepower limited by size; and (ii) conﬁdence sets for the value of the kink discontinuity m andwith ﬁnite expected length have zero conﬁdence level. The proof of Corollary 4 follows that of Corollary 3. Simply use the new deﬁnitions of P and µ ( P ), and construct the sequence P k with ∇ x E P k [ Y i | X i = x ] = g k ( x ). The second example applies Theorem 2 to the problem of testing for the existence ofbunching in a scalar random variable. Bunching occurs when the distribution of X exhibitsa non-zero probability at known point x , but it is continuous in a neighborhood of x .Bunching in the distribution of a single variable is the object of interest in many empiricalstudies. For example, Saez (2010) and Kleven and Waseem (2013) rely on the existence ofbunching on “reported income” at the boundary of tax brackets to identify the elasticityof reported income wrt tax rates; Goncalves and Mello (2018) use bunching on “chargedspeed in traﬃc tickets” to separate lenient from non-lenient police oﬃcers and identify racialdiscrimination; and a standard practice in RDD analyses is to check if the distribution ofthe forcing variable has bunching at the cutoﬀ, which would count as evidence against thedesign.Suppose X is a scalar random variable. In the absence of bunching, assume the CDFof X is continuously diﬀerentiable. Testing for bunching amounts to testing whether X haspositive probability mass at x . Let P be the set of distributions of X with a continuouslydiﬀerentiable CDF. The set P is all mixed continuous-discrete distributions, with one masspoint at x , but continuously diﬀerentiable CDF otherwise. Any distribution Q under thealternative is well-approximated in the LP metric by a sequence of distributions P k underthe null. Therefore, any a.s. continuous test has power limited by size. Corollary 5.

Assumption 1 is satisﬁed in the problem of testing for the existence of bunching.Hence, any test φ ( Z ) that is a.s. continuous under P has power limited by size. There is one interesting feature about this example that is not shared by the RDD andRKD examples of the previous section. In this example, it is not possible to ﬁnd a sequence P k under the null that approximates a Q ∈ P using the TV metric. The event X = x alwayshas zero probability under the null, but strictly positive probability under the alternative. The assumption that the CDF is continuously diﬀerentiable is not necessary in this section. We imposethis assumption because typical non-parametric density estimators assume a continuous density. The testingimpossibility of this section occurs regardless of whether the CDF is assumed continuously diﬀerentiable, orsimply continuous. d T V ( P, Q ) > P ∈ P , Q ∈ P . Theorem 1 suggests that there exists atest whose maximum power is bigger than size, but our Theorem 2 says this test cannot bea.s. continuous under P .The use of the LP metric, as opposed to the TV metric, leads us to search for tests that arediscontinuous under P . For a sample with n iid observations X i , the test φ ( X , . . . , X n ) = I (cid:8) n (cid:80) ni =1 I { X i = x } > (cid:9) is discontinuous under P . This test has size equal to zero, andpower equal to 1 − (1 − δ ) n , where δ = P [ X i = x ]. The third example comes from Caetano (2015), who uses the idea of bunching in aconditional distribution of Y given X to construct an exogeneity test that does not requireinstrumental variables. It applies to regression models where the distribution of unobservedfactors are assumed to be discontinuous wrt an explanatory variable. Of interest is the impactof a scalar explanatory variable X on an outcome variable Y , after controlling for covariates W . For example, suppose we are interested in the eﬀect of average number of cigarettessmoked per day X on birth weight Y , after controlling for mothers’ observed characteristics W . Conditional on ( X, W ), the distribution of mothers’ unobserved characteristics U is saidto bunch at zero smoking if it changes drastically when we compare non-smoking mothersto mothers that smoke very little. If bunching occurs, then the variable X is endogenousbecause we cannot separate the eﬀect of smoking on birth weight from the eﬀect of unobservedcharacteristics on birth weight.The population model that determines Y is written as Y = h ( X, W ) + U , where U summarizes unobserved confounding factors aﬀecting Y . We are unable to infer bunchingon U unless h is assumed continuous on ( X, W ). Bunching of U wrt X is evidence of localendogeneity of X at X = 0. Bunching at 0 implies discontinuity of E [ U | X = 0 , W ] − E [ U | X = x, W ] as x ↓

0. Continuity of h makes bunching equivalent to a discontinuity of E [ Y | X = 0 , W = w ] − E [ Y | X = x, W = w ] as x ↓ w . Caetano (2015) proposestesting ∀ w lim x ↓ E [ Y | X = 0 , W = w ] − E [ Y | X = x, W = w ] = 0 (4.5)as a means of testing for local exogeneity of X at X = 0. We argue that h may have a highslope on X , or even be discontinuous on X , which makes exogeneity untestable.The observed data Z = ( Z , . . . , Z n ), Z i = ( X i , W i , Y i ) is iid with probability P . Thesupport of ( X i , W i ) is denoted X × W . The distribution of Y conditional on ( X, W ) isassumed to be continuous. The distribution of X has non-zero probability at X = 0, but itis continuous otherwise. Assume ∃ δ > , δ ) ⊂ X .26et G denote the space of all functions g : X × W → R that are bounded and inﬁnitelymany times continuously diﬀerentiable wrt x over { X \{ }}× W . The size of the discontinuityat X = 0 may take any value in R . The family of all possible distributions is denoted as P = { P : Z i ∼ P, ∃ g ∈ G s.t. E P [ Y i | X i = x, W i = w ] = g ( x, w ) } . (4.6)Under local exogeneity of X , the function τ P ( w ) = E P [ Y i | X i = 0 , W i = w ] − lim x ↓ E P [ Y i | X i = x, W i = w ] must be equal to 0 ∀ w ∈ W . In practice, it is convenient to conduct inferenceon an aggregate of τ P ( w ) over w ∈ W instead of on the entire function τ P ( w ). Examples ofaggregation include the average of | τ P ( W ) | , the square root of the average of τ P ( W ) , or thesupremum of | τ P ( w ) | over w ∈ W . For the sake of brevity, we choose the second option. Fora distribution P ∈ P , deﬁne µ ( P ) = [ E P ( τ P ( W ) )] / . Local exogeneity corresponds to thetest of µ ( P ) = 0 versus µ ( P ) (cid:54) = 0.The parameter µ ( P ) is weakly identiﬁed in the class of models P . Just as in the RDD case,any conditional mean function E [ Y i | X i = x, W i = w ] with a discontinuity at x = 0 is well-approximated by a sequence of continuous conditional mean functions E [ Y i | X i = x, W i = w ].Assumption 1 is veriﬁed using the same argument as in the RDD case. Corollary 6.

Assumption 1 is satisﬁed for P ,m ∀ m ∈ R , and Theorems 2 and 3 apply tothe case of the local exogeneity test. Namely, (i) a.s. continuous tests φ m ( Z ) for the valueof the aggregate discontinuity m have power limited by size; and (ii) conﬁdence sets for thevalue of the aggregate discontinuity m and with ﬁnite expected length have zero conﬁdencelevel. The inference procedures suggested by Caetano (2015) rely on non-parametric local poly-nomial estimation methods. As in the RDD case, these procedures yield tests that are a.s.continuous in the data and conﬁdence intervals of ﬁnite expected length. Corollary 6 implieslack of size control and zero conﬁdence level.

The fourth example illustrates robust hypothesis testing wrt the LP metric, and it isof practical relevance to macroeconomists. Macroeconometrics often uses linear time-seriesprocesses. This is motivated by the Wold Representation Theorem, which asserts that everycovariance-stationary process x t can be written as an MA process plus some deterministicterm: x t = B ( L ) (cid:15) t , L is the lag operator, B ( l ) = (cid:80) qi =0 b i l i , and (cid:15) t is an uncorrelated error sequence. Acaveat is that the order q needs to be too large to be useful for many applications. Thefeatures of MA processes with inﬁnite lag order are well captured by ARMA models A ( L ) x t = B ( L ) (cid:15) t , with small orders for A ( L ) and B ( L ), where A ( l ) = (cid:80) pi =0 a i l i . The closure of the setof stationary ARMA(p,q) models with ﬁnite order ( p, q ) does not necessarily contain onlystationary models. The simplest example happens when A ( l ) = 1 − al and B ( l ) = 1. Theprocess is stationary when | a | <

1, but it is non-stationary when a = 1. This observationled to ARIMA models, which better capture the persistence in time series.Starting in the 1990s, applied researchers began to realize that ARIMA models them-selves have limitations. This led to the development of other stochastic processes, includingerror duration models, Markov switching models, threshold models, structural breaks, andfractionally integrated processes, among others. This is a vast literature and includes papersby Hamilton (1989), Parke (1999), and Bai and Perron (1998), just to name a few.A number of authors point out that these diﬀerent model extensions may not be toofar from each other. For example, Perron (1989) shows that integrated processes with driftand stationary models with a broken trend can be easily confused; Parke (1999) pointsout that the error-duration model encapsulates fractionally integrated series; Granger andHyung (1999) and Diebold and Inoue (2001) ﬁnd that linear processes with breaks canbe misinterpreted as long-memory models. In these papers, and in most of the relatedeconometrics literature, the focus is on the autocovariance of the stochastic process.Our discussion of robust hypothesis testing in Section 3 suggests looking at the closure ofARMA processes to distinguish these models from each other. For example, take the problemof testing the null that a process is covariance-stationary, against the alternative that it is anerror-duration model or a Compound Poisson model. The existence of a test with non-trivialpower requires us to look for the TV distance between these sets of processes. However, theability to approximate theses processes in the TV distance is often based on quite stringentassumptions. For example, see Barbour and Utev (1999) for the TV approximation ofCompound Poisson processes.The problem of searching for tests for covariance-stationary vs error-duration or Com-pound Poisson becomes much easier if we focus on the LP metric. To solve this problem, werely on Bickel and B¨uhlmann (1996), whose work has been largely ignored in the economet-rics literature. They characterize the closure of AR and MA processes wrt the TV and theMallows metric (also known as the Wasserstein metric). The TV metric is stronger than the28allows metric, which in turn is stronger than the LP metric. Indeed, convergence underthe Mallows metric implies weak convergence and convergence in second moments; see Bickeland Freedman (1981) and Bickel and B¨uhlmann (1996). As a result, the closure of stochasticprocesses wrt the LP metric is larger than the closure wrt the Mallows metric. It turns outthat error-duration and Compound Poisson models are in the closure wrt the LP metric. Inother words, the robustiﬁed null set wrt the LP metric contains the alternative set, and theminimal TV distance between these sets is zero. Hence, all tests for the robustiﬁed null havepower no larger than size.Given that the closure of ARMA processes of inﬁnite order is quite rich, we may wonderwhich hypotheses are testable. Bahadur and Savage (1956) and Romano (2004) point out it ishopeless to test population means, even in the iid case without further moment constraints.Could we try to test quantiles? Peskir (2000) and Shorack and Wellner (2009) providesuﬃcient conditions for uniform convergence of empirical processes under time dependence.A natural choice for quantile testing is the value at risk (VaR), which is commonly used inthe ﬁnance literature. It would be interesting to establish the class of empirical processesfor which hypotheses for the VaR are testable. We leave this example for future work. In this section, we provide Monte Carlo simulations to illustrate the impossibility oftesting within the context of RDD. We ﬁnd that the Wald test fails to control size uniformlyunder the null hypothesis. We use a data-generating process (DGP) based on an empiricalexample. Lack of size control occurs even for DGPs that are consistent with the data.Moreover, the simulations also show that the Wald test has very little power after artiﬁciallycontrolling size. For the sake of brevity, we focus on the RDD case, and we expect similarﬁndings for the RKD and Exogeneity Test cases.Our DGP is based on the incumbency data of Lee (2008). Lee studies incumbencyadvantage in the US House of Representatives. Districts where a party’s candidate barelywins an election are, on average, comparable to districts where that party’s candidate barelyloses the election. The forcing variable X is the margin of victory of the Democratic party inpercentage of votes. The target parameter is the eﬀect of the Democrats winning the electionat time t (incumbency) on the probability of the Democrats winning the election at time t + 1. Lee’s data have been used for simulation studies by several other econometricians, forexample, by Imbens and Kalyanaraman (2012), Calonico, Cattaneo, and Titiunik (2014), andArmstrong and Koles´ar (2018). We use the Monte Carlo DGP of Imbens and Kalyanaraman292012) and Calonico, Cattaneo, and Titiunik (2014), described in Equation (5.1). Y =  .

48 + 1 . X + 7 . X + 20 . X +21 . X + 7 . X + U if X ∈ ( − . , .

52 + 0 . X − X + 7 . X − . X + 3 . X + U if X ∈ [0 , .

99] (5.1)where X is distributed as Beta(2 , U is zero-mean Gaussian with standard deviation0 . X is independent of U . Figure 2 depicts the conditional mean function ofEquation (5.1).Figure 2: Conditional Mean Function Based on Lee’s (2008) Data Notes: conditional mean function of Equation (5.1). The forcing variable X is the margin of victory of theDemocratic party in percentage of votes in time t . The outcome variable Y is equal to one if Democrats winin time t + 1, but equal to zero otherwise. Our simulation study uses variations of Equation (5.1) that are governed by two param-eters: τ ∈ R and M ∈ R + . Y =  .

48 + τ Λ (4

M X/τ ) + 1 . X + 7 . X + 20 . X +21 . X + 7 . X + U if X ∈ ( − . , .

48 + τ Λ (4

M X/τ ) + 0 . X − X + 7 . X − . X + 3 . X + U if X ∈ [0 , .

99) (5.2)where Λ ( · ) is the logistic CDF function.The conditional mean function of both Equations (5.1) and (5.2) are diﬀerentiable on The DGP in Equation (5.1) belongs to the class of functions that Armstrong and Koles´ar (2018) studyin their application to RDD. The set of functions F RDP,p ( C ) on their page 658 contains Equation (5.1) with p = 2 and constant C = 7 . X = 0 with discontinuity of size0 .

04, while the second is continuous at X = 0. For τ = 0 .

04, Equation (5.2) approximatesEquation (5.1) as M → ∞ . The parameter M is the derivative of τ Λ (4

M X/τ ) wrt X at X = 0. As the slope M grows large, the continuous conditional mean function of Equation(5.2) approximates a discontinuous function with discontinuity of size τ . Figure 3 illustratesthis approximation, as well as the proof of Corollary 3 in Section 4.1. For example, a modelsimilar to Equation (5.2) with high values of M arises when districts manipulate the share ofvotes in order to win the election. Manipulation of the forcing variable has been extensivelystudied in the RDD literature. See, for example, McCrary (2008) and Gerard, Rokkanen,and Rothe (2016). Suppose the average causal eﬀect of winning the election conditional on X is small for districts with small margin of victory, but large otherwise. In the absenceof manipulation, E [ Y | X ] is continuous and very smooth to the right of the cutoﬀ. Theparty in districts with low X has incentives to manipulate the election, and the researcherobserves the manipulated margin of victory ˜ X , instead of X . Assume the probability thatmanipulation occurs conditional on ˜ X increases continuously but sharply to the right of thecutoﬀ. In this case, the researcher observes a conditional mean function that is continuousat the cutoﬀ but that increases sharply to the right of the cutoﬀ. In practice, one may falselyreject the null of zero eﬀect simply because of manipulation, and not because of an actualcausal eﬀect. We provide a concrete example for this DGP in Section A.6 in the appendix.The parameter of interest is m , the size of the jump discontinuity at X = 0. The nullhypothesis is m = 0, which is the set of models in Equation (5.2) with τ ∈ R and M ∈ R + .The alternative hypothesis is m (cid:54) = 0, which is the set of models with τ (cid:54) = 0 and M = ∞ .Section 4.1 shows that any model in the alternative is well-approximated in the LP metricby models under the null. The power of a.s. continuous tests is less than or equal to size.The Monte Carlo experiment simulates 10,000 draws of an iid sample with 500 obser-vations. The range of ( τ, M ) values for Model 5.2 in the experiment is consistent with themagnitudes of Lee’s DGP. The maximum slope magnitude of the conditional mean graph inFigure 2 is 1.97, and we set M ∈ { , , . . . , } . The value of m for Lee’s DGP is 0 .

04, and wevary τ in { , . , . , . } . We conduct a size and a power analysis. In the size analysis,we simulate rejection probabilities of the Wald test under each ( τ, M )-model. The estimatesof m and standard errors are obtained by the robust bias-corrected method of Calonico,Cattaneo, and Titiunik (2014) and implemented using the STATA package rdrobust . Foreach model ( τ, M ), the critical value of the test comes from the simulated distribution ofthe statistic under model ( τ, M = 0).The nominal size of the Wald tests in Table 1 is 5%, and the simulated rejection proba-31igure 3: Approximating a Discontinuous Conditional Mean Function ( τ = 0 . M = 0 (b) M = 0 . M = 2 (c) M = 8 Notes: the discontinuous conditional mean function E [ Y | X ] (solid line) is approximated by a sequence ofcontinuous conditional mean functions (dotted lines). The solid line is the E [ Y | X ] of Model 5.1, and thedotted line is the E [ Y | X ] of Model 5.2 for τ = 0 .

04 and M ∈ { , . , , } . The ﬁgure illustrates that model5.2 approximates the DGP based on Lee (2008) as the slope at X = 0 grows large. Table 1: Rejection Probability Under the Null - Size 5% τ M = 0 M = 2 M = 4 M = 6 M = 8 M = 10.01 0.0500 0.0540 0.0557 0.0592 0.0585 0.0580.02 0.0500 0.0665 0.0678 0.0649 0.0694 0.0685.03 0.0500 0.0910 0.0942 0.0938 0.0941 0.1016.04 0.0500 0.1005 0.1067 0.1071 0.1121 0.1139.05 0.0500 0.1114 0.1264 0.1334 0.1464 0.1434.06 0.0500 0.1292 0.1632 0.1680 0.1819 0.1819.07 0.0500 0.1258 0.1617 0.1832 0.1906 0.2026.08 0.0500 0.1320 0.1888 0.2142 0.2266 0.2427 Notes: the table displays the simulated rejection probability of the Wald test under various choices of ( τ, M )for Model 5.2. Critical values of the test vary by row, but are constant across columns. For each ( τ, M )-model, the critical value of the test comes from the simulated distribution of the statistic under model ( τ, m and standard errors for the Wald test are obtained by the robust bias-corrected methodof Calonico, Cattaneo, and Titiunik (2014) and implemented using the STATA package ‘rdrobust’. τ and M . For the maximum slope of M = 2 observed from the model inEquation (5.1), the size of the test varies between 5.4% and 13.2%, depending on the choiceof the model under the null. The true value of M is unknown, and a more conservativeupper bound on the slope M = 10 distorts the size of the test up to 24%.In the power analysis, we study rejection probabilities for models with M = ∞ and τ ∈ { , . , . . . , . } . These models fall under the alternative because m = τ when M = ∞ .For each ( τ, ∞ )-model, we would like the test to have correct size under the least favorablenull model. Table 1 suggests that the least favorable model under the null is the one withthe highest slope M . Figure 3 shows that null models can approximate any alternative( τ, ∞ )-model arbitrarily well. If we restrict the slope at X = 0 to be at most M , the worst-case model under the null for the alternative ( τ, ∞ )-model is the ( τ, M )-model. To evaluatethe rejection probability under a ( τ, ∞ )-model, the critical value of the test comes from thesimulated distribution of the statistic under a ( τ, M )-model for various choices of ( τ, M ).That way, the test has correct size when m = 0 under all possibilities of least favorable( τ, M )-models.Table 2: Rejection Probability Under the Alternative - Size 5% τ M = 0 M = 2 M = 4 M = 6 M = 8 M = 10.01 0.0610 0.0508 0.0504 0.0501 0.0504 0.0500.02 0.0763 0.0527 0.0524 0.0505 0.0513 0.0501.03 0.1020 0.0574 0.0532 0.0525 0.0526 0.0527.04 0.1204 0.0646 0.0571 0.0556 0.0536 0.0524.05 0.1583 0.0770 0.0618 0.0597 0.0573 0.0544.06 0.2013 0.0899 0.0682 0.0638 0.0605 0.0569.07 0.2192 0.1023 0.0732 0.0677 0.0642 0.0590.08 0.2781 0.1179 0.0839 0.0707 0.0654 0.0631 Notes: the entries of the table display the simulated rejection probability of the Wald test under Model 5.2with various τ and M = ∞ , so that the size of the discontinuity is m = τ . Critical values of the test varyby row and column. For each ( τ, M )-entry, the critical value comes from the simulated distribution of thestatistic under a null ( τ, M )-model. The estimates of m and standard errors for the Wald test are obtainedby the robust bias-corrected method of Calonico, Cattaneo, and Titiunik (2014) and implemented using theSTATA package ‘rdrobust’. The power of the tests in Table 2 increases with the size of discontinuity τ , but it decreaseswith the slope M of the least favorable model under the null. Intuitively, the higher M is, theharder it becomes to distinguish a ( τ, M )-model from a ( τ, ∞ )-model. For the empiricallyrelevant values of τ = 0 .

04 and M = 2, we see that the power of the test is 6.5%, barelyabove its size. More conservative upper bounds on the slope of the model under the nullessentially make power equal size. Section A.9 in the appendix contains versions of thesetables for nominal levels 1% and 10%, as well as the simulated critical values used.33 Conclusion

When drawing inference on a parameter in econometric models, some authors provideconditions under which tests have trivial power (impossibility type A). Others examine whenconﬁdence regions have error probability equal to one (impossibility type B). The motivationbehind these negative results is that the parameter of interest may be nearly unidentiﬁedacross models. Impossible inference relies on models being indistinguishable wrt some notionof distance. Some authors distinguish models using the Total Variation (TV) metric andothers rely on the L´evy-Prokhorov (LP) metric, which is a weaker notion of distance. Theability to distinguish models in the TV metric is a necessary and suﬃcient condition for theexistence of tests with non-trivial power. Impossible inference in terms of a weaker notionof distance is often easier to prove, it is applicable to the widely-used class of almost surelycontinuous tests, and it is useful for robust hypothesis testing.Impossibility type A is stronger than type B. Dufour (1997) focuses on models in whichtests based on bounded conﬁdence regions fail to control size, but they could still havenon-trivial power. Take the simultaneous equations model when instrumental variables maybe arbitrarily weak. Moreira (2002, 2003) and Kleibergen (2005) propose tests that havecorrect size in models with type B impossibility. Furthermore, these tests have good powerwhen identiﬁcation is strong, being eﬃcient under the usual asymptotics. Their power is nottrivial, exactly because not every model under the alternative is approximated by modelsunder the null.The choice of the LP versus the TV metric connects our work to the work of Peter J.Huber on robust statistics. It leads us to look at the closure of model departures under theLP metric. In particular, robust hypothesis testing requires a non-zero TV distance betweenthe closure of the null and alternative sets under the LP metric. For example, it is impossibleto ﬁnd a robust test that powerfully distinguishes covariance-stationary models from error-duration and Compound Poisson models, because the closure of the former contains thelatter. This closure is quite rich, and we wonder what sort of hypotheses are testable. It isimpossible to test the population mean, so one possibility may be quantiles such as value atrisk (VaR). Peskir (2000) and Shorack and Wellner (2009) provide suﬃcient conditions forconvergence of empirical processes under dependence. It would be interesting future workto build on these conditions to establish the class of processes in which quantile testing ispossible. 34 eferences

Angrist, J., and

V. Lavy (1999): “Using Maimonides’ Rule to Estimate the Eﬀect ofClass Size on Scholastic Achievement,”

Quarterly Journal of Economics , 114(2), 533–575.

Armstrong, T. B., and

M. Koles´ar (2018): “Optimal Inference in a Class of RegressionModels,”

Econometrica , 86(2), 655–683.

Bahadur, R. R., and

L. J. Savage (1956): “The Nonexistence of Certain StatisticalProcedures in Nonparametric Problems,”

Annals of Mathematical Statistics , 27(4), 1115–1122.

Bai, J., and

P. Perron (1998): “Estimating and Testing Linear Models with MultipleStructural Changes,”

Econometrica , 66(1), 47–78.

Barbour, A., and

S. Utev (1999): “Compound Poisson Approximation in Total Varia-tion,”

Stochastic Processes and Their Applications , 82(1), 89–125.

Bertanha, M. (2017): “Regression Discontinuity Design with Many Thresholds,” workingpaper, Department of Economics, University of Notre Dame.

Bertanha, M., and

G. Imbens (2018): “External Validity in Fuzzy Regression Disconti-nuity Designs,”

Journal of Business and Economic Statistics, forthcoming . Bickel, P. J., and

P. B¨uhlmann (1996): “What is a Linear Process?,”

Proceedings ofthe National Academy of Sciences , 93(22), 12128–12131.

Bickel, P. J., and

D. A. Freedman (1981): “Some Asymptotic Theory for the Boot-strap,”

Annals of Statistics , 9(6), 1196–1217.

Billingsley, P. (2008):

Probability and Measure . John Wiley & Sons, New York.

Black, S. (1999): “Do Better Schools Matter? Parental Valuation of Elementary Educa-tion,”

Quarterly Journal of Economics , 114(2), 577–599.

Caetano, C. (2015): “A Test of Exogeneity without Instrumental Variables in Models withBunching,”

Econometrica , 83(4), 1581–1600.

Cai, T. T., and

M. G. Low (2004): “An Adaptation Theory for Nonparametric ConﬁdenceIntervals,”

The Annals of Statistics , 32(5), 1805–1840.

Calonico, S., M. D. Cattaneo, and

R. Titiunik (2014): “Robust NonparametricConﬁdence Intervals for Regression-Discontinuity Designs,”

Econometrica , 82(6), 2295–2326.

Canay, I. A., A. Santos, and

A. M. Shaikh (2013): “On the Testability of Identiﬁcationin Some Nonparametric Models with Endogeneity,”

Econometrica , 81(6), 2535–2559.

Card, D., D. S. Lee, Z. Pei, and

A. Weber (2015): “Inference on Causal Eﬀects in aGeneralized Regression Kink Design,”

Econometrica , 83(6), 2453–2483.35 oudin, E., and

J.-M. Dufour (2009): “Finite-sample Distribution-free Inference in Lin-ear Median Regressions under Heteroscedasticity and Non-linear Dependence of UnknownForm,”

The Econometrics Journal , 12, S19–S49.

Diebold, F. X., and

A. Inoue (2001): “Long Memory and Regime Switching,”

Journalof Econometrics , 105(1), 131–159.

Dong, Y. (2016): “Jump or Kink? Regression Probability Jump and Kink Design forTreatment Eﬀect Evaluation,” working paper, University of California, Irvine.

Dong, Y., and

A. Lewbel (2015): “Identifying the Eﬀect of Changing the Policy Thresh-old in Regression Discontinuity Models,”

Review of Economics and Statistics , 97(5), 1081–1092.

Donoho, D. L. (1988): “One-sided Inference About Functionals of a Density,”

Annals ofStatistics , 16(4), 1390–1420.

Dudley, R. M. (1976): “Probabilities and Metrics,” Lecture Notes Series No. 45, Matem-atisk Institut, Aarhus Universitet.

Dufour, J.-M. (1997): “Some Impossibility Theorems in Econometrics with Applicationsto Structural and Dynamic Models,”

Econometrica , 65(6), 1365–1387.

Feir, D., T. Lemieux, and

V. Marmer (2016): “Weak Identiﬁcation in Fuzzy RegressionDiscontinuity Designs,”

Journal of Business and Economic Statistics , 34(2), 185–196.

Gerard, F., M. Rokkanen, and

C. Rothe (2016): “Bounds on Treatment Eﬀects inRegression Discontinuity Designs with a Manipulated Running Variable,” NBER WorkingPaper 22892.

Gibbs, A. L., and

F. E. Su (2002): “On Choosing and Bounding Probability Metrics,”

International Statistical Review , 70(3), 419–435.

Gleser, L. J., and

J. T. Hwang (1987): “The Nonexistence of 100(1- α )% ConﬁdenceSets of Finite Expected Diameter in Errors-in-Variables and Related Models,” Annals ofStatistics , 15(4), 1351–1362.

Goncalves, F., and

S. Mello (2018): “A Few Bad Apples?: Racial Bias in Policing,”working paper, Crime Lab New York.

Granger, C. W., and

N. Hyung (1999): “Occasional Structural Breaks and Long Mem-ory,” discussion paper 99-14, University of California, San Diego.

Hahn, J., P. Todd, and

W. Van der Klaauw (2001): “Identiﬁcation and Estimation ofTreatment Eﬀects with a Regression-Discontinuity Design,”

Econometrica , 69(1), 201–209.

Hamilton, J. D. (1989): “A New Approach to the Economic Analysis of NonstationaryTime Series and the Business Cycle,”

Econometrica , 57(2), 357–384.36 uber, P. J. (1964): “Robust Estimation of a Location parameter,”

Annals of Mathemat-ical Statistics , 35(1), 73–101.(1965): “A Robust Version of the Probability Ratio Test,”

Annals of MathematicalStatistics , 36(6), 1753–1758.

Huber, P. J., and

E. M. Ronchetti (2009):

Robust Statistics . John Wiley & Sons, Inc.Hoboken, NJ.

Imbens, G., and

K. Kalyanaraman (2012): “Optimal Bandwidth Choice for the Regres-sion Discontinuity Estimator,”

Review of Economic Studies , 79(3), 933–959.

Imbens, G. W., and

T. Lemieux (2008): “Regression Discontinuity Designs: A Guide toPractice,”

Journal of Econometrics , 142(2), 615–635.

Ingster, Y., and

I. Suslina (2003):

Nonparametric Goodness-of-Fit Testing Under Gaus-sian Models , vol. 169. Springer Science & Business Media, New York.

Jacob, B. A., and

L. Lefgren (2004): “Remedial Education and Student Achievement:a Regression-Discontinuity Analysis,”

Review of Economics and Statistics , 86(1), 226–244.

Kamat, V. (2018): “On Nonparametric Inference in the Regression Discontinuity Design,”

Econometric Theory , 34(3), 694–703.

Kleibergen, F. (2005): “Testing Parameters in GMM Without Assuming That They AreIdentiﬁed,”

Econometrica , 73(4), 1103–1123.

Kleven, H. J., and

M. Waseem (2013): “Using Notches to Uncover Optimization Fric-tions and Structural Elasticities: Theory and Evidence from Pakistan,”

Quarterly Journalof Economics , 128(2), 669–723.

Kraft, C. (1955): “Some Conditions for Consistency and Uniform Consistency of Statisti-cal Procedures,” in

University of California Publications in Statistics , ed. by J. Neyman,L. LeCam, and

H. Scheﬀ´e, vol. 2(6), pp. 125–142. University of California Press, Berkeleyand Los Angeles.

Lee, D. S. (2008): “Randomized Experiments from Non-Random Selection in US HouseElections,”

Journal of Econometrics , 142(2), 675–697.

Lehmann, E., and

H. D’Abrera (2006):

Nonparametrics . Springer-Verlag New York.

Lehmann, E., and

J. Romano (2005):

Testing Statistical Hypotheses . Springer-VerlagNew York.

Low, M. G. (1997): “On Nonparametric Conﬁdence Intervals,”

Annals of Statistics , 25(6),2547–2554.

McCrary, J. (2008): “Manipulation of the Running Variable in the Regression Disconti-nuity Design: A Density Test,”

Journal of Econometrics , 142(2), 698–714.37 oreira, M. J. (2002): “Tests with Correct Size in the Simultaneous Equations Model,”Ph.D. thesis, University of California, Berkeley.(2003): “A Conditional Likelihood Ratio Test for Structural Models,”

Economet-rica , 71(4), 1027–1048.

Nielsen, H. S., T. Sørensen, and

C. Taber (2010): “Estimating the Eﬀect of StudentAid on College Enrollment: Evidence from a Government Grant Policy Reform,”

AmericanEconomic Journal: Economic Policy , 2(2), 185–215.

Parke, W. R. (1999): “What is Fractional Integration?,”

Review of Economics and Statis-tics , 81(4), 632–638.

Perron, P. (1989): “The Great Crash, the Oil Price Shock, and the Unit Root Hypothesis,”

Econometrica , 57(6), 1361–1401.

Peskir, G. (2000): “From Uniform Laws of Large Numbers to Uniform Ergodic Theorems,”Lecture Notes Series No. 66, Dept. Math. Univ. Aarhus.

Porter, J. (2003): “Estimation in the Regression Discontinuity Model,” unpublishedmanuscript, University of Wisconsin, Madison.

Romano, J. P. (2004): “On Non-parametric Testing, the Uniform Behaviour of the t-test,and Related Problems,”

Scandinavian Journal of Statistics , 31(4), 567–584.

Rudin, W. (1976):

Principles of Mathematical Analysis . McGraw-Hill New York.

Saez, E. (2010): “Do Taxpayers Bunch at Kink Points?,”

American Economic Journal:Economic Policy , 2(3), 180–212.

Schmieder, J. F., T. von Wachter, and

S. Bender (2012): “The Eﬀects of ExtendedUnemployment Insurance Over the Business Cycle: Evidence from Regression Disconti-nuity Estimates over 20 Years,”

Quarterly Journal of Economics , 127(2), 701–752.

Shorack, G. R., and

J. A. Wellner (2009):

Empirical Processes with Applications toStatistics , vol. 59. Society for Industrial and Applied Mathematics (SIAM) Philadelphia.

Simonsen, M., L. Skipper, and

N. Skipper (2016): “Price Sensitivity of Demand forPrescription Drugs: Exploiting a Regression Kink Design,”

Journal of Applied Economet-rics , 31(2), 320–337.

Tibshirani, R., and

L. A. Wasserman (1988): “Sensitive Parameters,”

Canadian Jour-nal of Statistics , 6(2), 185–192.

Van der Vaart, A. W. (2000):

Asymptotic Statistics . Cambridge University Press, Cam-bridge UK. 38

Appendix

A.1 Proof of Corollary 1

We introduce some notation before embarking on the proof.The density of P ∈ P wrt a σ -ﬁnite measure µ is p . The set of densities of all distributionsin P is denoted p . Similarly, the null and alternative sets of densities are p and p , andtheir union equals p . Deﬁne co ( p (cid:48) ) to be the convex hull of an arbitrary subset p (cid:48) ⊆ p in asimilar fashion as in Equation (2.1).The Total Variation (TV) metric between two distributions P, Q ∈ P with densities p, q ∈ p is deﬁned as d T V ( p, q ) = 12 (cid:90) | p − q | dµ. (A.1)The proof of the equivalence of (a) and (b) is shown in three parts.Part 1: ( a ) ⇔ ( a (cid:48) ) where( a ) : ∀ q ∈ p ∃{ p k } k ⊆ co ( p ) such that d T V ( p k , q ) → a (cid:48) ) : ∀ q ∈ p ∃{ p k } k ⊆ co ( p ) and { ε k } k ↓ d T V ( p k , q ) < ε k ∀ k Part 1, proof, ( a ) ⇒ ( a (cid:48) ) :Fix q . For ε k = d T V ( p k , q ) →

0, there exists a monotone sub-sequence ε k j = d T V ( p k j , q ) ↓

0. Create new sequences (cid:101) p j = p k j and (cid:101) ε j = ε k j / d T V ( (cid:101) p j , q ) < (cid:101) ε j .Part 1, proof, ( a ) ⇐ ( a (cid:48) ) : straightforward.Part 2: ( a (cid:48) ) ⇔ ( b (cid:48) ) where( a (cid:48) ) : ∀ q ∈ p ∃{ p k } k ⊆ co ( p ) and { ε k } k ↓ d T V ( p k , q ) < ε k ∀ k ( b (cid:48) ) : ∀ q ∈ p ∃{ ε k } k ↓ ∀ φ (cid:90) φq dµ < ε k + sup p ∈ p (cid:90) φp dµ ∀ k Part 2, proof, ( a (cid:48) ) ⇒ ( b (cid:48) ):Fix q , (a’) implies there exists sequences { p k } k ⊆ co ( p ) and { ε k } k ↓ d T V ( p k , q ) < ε k ∀ k . Fix k . Use Theorem 1 with { p } = { q } . ( a (cid:48) ) implies ∀ φ (cid:82) φq dµ < ε k + sup p ∈ p (cid:82) φp dµ .This is true for every k of a sequence ε k that converges to zero, given an arbitrary q .Part 2, proof, ( a (cid:48) ) ⇐ ( b (cid:48) ):Fix q , get ε k . Fix k . Use Theorem 1 with { p } = { q } . ( b (cid:48) ) implies there exists p k ∈ co ( p )such that d T V ( p k , q ) < ε k . Repeat this for every k to get a sequence { p k } k ⊆ co ( p ) suchthat d T V ( p k , q ) < ε k ∀ k .Part 3: ( b (cid:48) ) ⇔ ( b ) where( b (cid:48) ) : ∀ q ∈ p ∃{ ε k } k ↓ ∀ φ (cid:90) φq dµ < ε k + sup p ∈ p (cid:90) φp dµ ∀ k b ) : ∀ φ and q ∈ p , (cid:90) φq dµ ≤ sup p ∈ p (cid:90) φp dµ Part 3, proof ( b (cid:48) ) ⇒ ( b ):Fix q , get ε k . Fix φ . It is true that (cid:82) φq dµ < ε k + sup p ∈ p (cid:82) φp dµ .Take limits on both sides, (cid:82) φq dµ ≤ sup p ∈ p (cid:82) φp dµ .This is true for every q and every φ .Part 3, proof ( b (cid:48) ) ⇐ ( b ):Straightforward because for arbitrary φ , q , and { ε k } k ↓ (cid:82) φq dµ ≤ sup p ∈ p (cid:82) φp dµ implies that (cid:82) φq dµ < ε k + sup p ∈ p (cid:82) φp dµ . (cid:3) A.2 Proof of Theorem 2

The proof of Theorem 2 follows the same lines as the proof of Theorem 1 by Romano(2004) except for the fact that our Assumption 1 is stated in terms of the LP metric and interms of the convex hull of P .Pick an arbitrary Q ∈ P . There exists a sequence of distributions { P k } ∞ k =1 ⊆ co ( P )such that P k d → Q . Convergence in distribution is equivalent to E P k [ g ] → E Q [ g ] for everybounded real-valued function g whose set of discontinuity points has probability zero under Q (Theorem 25.8, Billingsley (2008)). In particular, this is true for g = φ for an arbitrary φ that is a.s. continuous under Q .Take an arbitrary sequence ε n →

0, and pick a sub-sequence { P k n } n from the sequence { P k } k such that − ε n ≤ E Q φ − E P kn φ ≤ ε n . (A.2)Therefore, E Q φ ≤ E P kn φ + ε n ≤ sup P ∈ co ( P ) E P φ + ε n . (A.3)Given ε n →

0, it follows that, for ∀ Q ∈ P , E Q φ ≤ sup P ∈ co ( P ) E P φ. (A.4)Consequently, sup Q ∈ P E Q φ ≤ sup P ∈ co ( P ) E P φ. (A.5)It is clear that sup P ∈ co ( P ) E P φ ≥ sup P ∈ P E P φ . It remains to show that these areequal. Assume sup P ∈ co ( P ) E P φ > sup P ∈ P E P φ . Select ε > P ∈ co ( P ) E P φ − ε > sup P ∈ P E P φ . There exists P ε ∈ co ( P ) such thatsup P ∈ co ( P ) E P φ ≥ E P ε φ > sup P ∈ co ( P ) E P φ − ε > sup P ∈ P E P φ. (A.6)40y deﬁnition, P ε = (cid:80) Ni =1 α i P i for N ∈ N , P i ∈ P ∀ i , α i ∈ [0 , ∀ i , and (cid:80) Ni =1 α i = 1.Then, E P ε φ = (cid:80) Ni =1 α i E P i φ ≤ sup P ∈ P E P φ , a contradiction. Therefore, sup P ∈ co ( P ) E P φ = sup P ∈ P E P φ , and sup Q ∈ P E Q φ ≤ sup P ∈ P E P φ. (A.7) (cid:3) A.3 Proof of Theorem 3

The proof is a combination of proofs by Dufour (1997) and Gleser and Hwang (1987).Part (2.6):Fix m ∈ µ ( P ). Deﬁne φ m = I { m (cid:54)∈ C ( Z ) } , and note that sup P ∈ P ( m ) E P φ m = sup P ∈ co ( P ( m )) E P φ m (see proof of Theorem 2). It follows that 1 − α ≤ inf P ∈ P ( m ) P [ m ∈ C ( Z )] = inf P ∈ co ( P ( m )) P [ m ∈ C ( Z )].Therefore, ∀ P ∈ co ( P ( m )), P [ µ ( P ) ∈ C ( Z )] ≥ − α .By Assumption 2, there exists { P k } in co ( P ( m )) such that P k d → P ∗ . Then,1 − α ≤ P k [ µ ( P k ) ∈ C ( Z )] = P k [ m ∈ C ( Z )] → P ∗ [ m ∈ C ( Z )] (A.8)where the convergence follows by Portmanteau’s theorem because P ∗ ( ∂ { m ∈ C ( Z ) } ) = 0(Theorem 29.1 of Billingsley (2008)). This proves (2.6).Part (2.7):Pick a sequence m n ∈ µ ( P ) such that m n is unbounded. Without loss of generality,assume m n ↑ ∞ . We have that1 − α ≤ P ∗ [ m n ∈ C ( Z )] ≤ P ∗ [ m n ≤ U [ C ( Z )]] . (A.9)Taking the limit as n → ∞ , 1 − α ≤ P ∗ [ U [ C ( Z )] = ∞ ] (A.10) ≤ P ∗ [ U [ C ( Z )] − L [ C ( Z )] = ∞ ] = P ∗ [ D [ C ( Z )] = ∞ ] . (A.11)Part (2.8):Assumption 2 gives a sequence { P k } k in co ( P ) that converges in distribution to P ∗ . By as-sumption, P ∗ [ ∂ { D [ C ( Z )] = ∞ ] } ] = 0, so Portmanteau’s theorem gives P k [ D [ C ( Z )] = ∞ ]] → P ∗ [ D [ C ( Z )] = ∞ ]] ≥ − α . There exists a sequence δ k ↓ P k [ D [ C ( Z )] = ∞ ]] ≥ − α − δ k .Fix ε >

0. The set B ε ( P ∗ ) ∩ co ( P ) contains inﬁnitely many P k s from the sequence above.For these P k s, 1 − α − δ k ≤ P k [ D [ C ( Z )] = ∞ ]] (A.12) ≤ sup P ∈ B ε ( P ∗ ) ∩ co ( P ) P [ D [ C ( Z )] = ∞ ]] (A.13)41 sup P ∈ B ε ( P ∗ ) ∩ P P [ D [ C ( Z )] = ∞ ]] (A.14)where the last equality follows by the same argument seen in the proof of (2.6) above. Takingthe limit as k → ∞ gives (2.8). (cid:3) A.4 Proof of Lemma 1

Lemma 1.

Let C ( Z ) be constructed as in Equation (2.9). Then, inf P ∈ P P [ µ ( P ) ∈ C ( Z )] = 1 − sup m ∈ µ ( P ) α ( m ) . (A.15) Proof of Lemma 1.

Suppose sup m ∈ µ ( P ) sup P ∈ P ,m P ( φ m ( Z ) = 1) = α. (A.16)Now, pick ε >

0. Then, there exists m ε such that α − ε/ ≤ sup P ∈ P ,mε P ( φ m ε ( Z ) = 1) ≤ α. (A.17)There also exists P ε ∈ P ,m ε such that α − ε ≤ P ε ( φ m ε ( Z ) = 1) ≤ α. (A.18)Rearranging the expression above, we obtain1 − α + ε ≥ P ε ( µ ( P ε ) ∈ C ( Z )) ≥ − α. (A.19)Therefore, we ﬁnd that inf P ∈ P P [ µ ( P ) ∈ C ( Z )] = 1 − α, (A.20)as we wanted to prove. (cid:3) A.5 Proof of Corollary 3

Fix m ∈ R . Pick an arbitrary Q ∈ P ,m , and let m (cid:48) = µ ( Q ) (cid:54) = m . Deﬁne g ( x ) = E Q [ Y i | X i = x ]. Construct a sequence of functions g k : R → R , k = 1 , , . . . as follows: g k ( x ) = g ( x ) + ( m (cid:48) − m ) (cid:2) Λ (cid:0) k ( x − c ) (cid:1) − I { x ≥ c } (cid:3) (A.21)where Λ ( · ) is the cumulative distribution function (CDF) of the logistic distribution.The function g k is inﬁnitely continuously diﬀerentiable on X \ { c } , so g k ∈ G ∀ k , andlim x ↓ c g k ( x ) − lim x ↑ c g k ( x ) = m . Moreover, as k → ∞ , g k ( x ) → g ( x ) for every x (cid:54) = c . Deﬁne42 k to be the distribution of ( X i , Y i − g ( X i ) + g k ( X i )) when ( X i , Y i ) ∼ Q . It follows that µ ( P k ) = m and P k ∈ P ,m ∀ k .It remains to show that P k d → Q , or equivalently, to show that( X i , Y i − g ( X i ) + g k ( X i )) d → ( X i , Y i ) (A.22)as k → ∞ where ( X i , Y i ) ∼ Q . Note that ( X i , Y i − g ( X i ) + g k ( X i )) = ( X i , Y i ) + (0 , g k ( X i ) − g ( X i )), so it suﬃces to show that g k ( X i ) − g ( X i ) p → k → ∞ .Deﬁne A k = { c − k − < X i < c + k − } , and let A ck be the complement of A k . Fix ε > Q [ | g k ( X i ) − g ( X i ) | > ε ] (A.23)= Q [ | g k ( X i ) − g ( X i ) | > ε | A k ] Q [ A k ] (A.24)+ Q [ | g k ( X i ) − g ( X i ) | > ε | A ck ] Q [ A ck ] . (A.25)Part (A.24) vanishes as k → ∞ by the continuity property of probability measures, because A k ↓ { c } and Q [ { c } ] = 0 by assumption.For part (A.25), note that | g k ( x ) − g ( x ) | ≤ | m (cid:48) − m | Λ ( − k ) for any x ∈ A ck becauseΛ ( k ( x − c )) is strictly increasing in x and symmetric around x = c , so | g k ( x ) − g ( x ) | attainsits maximum at x = c − k − and x = c + k − . Therefore,( A. ≤ I {| m (cid:48) − m | Λ ( − k ) > ε } Q [ A ck ] → − k ) → k → ∞ .Therefore, Assumption 1 is satisﬁed for every m ∈ R . Theorem 2 applies, and Corollary2 applies with µ ( P ) = R . (cid:3) A.6 Example of RDD Model with Manipulation

In this section, we provide an example of DGP with manipulation that gives rise to amodel similar to Equation (5.2) in our Monte Carlo experiment. The potential outcome oflosing an election is normalized to zero ( Y i (0) = 0), and the potential outcome of winningan election is Y i (1). Suppose the expected potential gain of winning an election is smallfor tight elections but large otherwise; that is, let E [ Y i (1) − Y i (0) | X i = x ] = E [ Y i (1) | X i = x ] = x , where X i is the margin of victory of a given political party in district i . Assumethe distribution of X i is Uniform[ − , Y i (1) , X i , ε i ) from agiven distribution, where ε i denotes district i ’s potential to inﬂuence the election outcomein a world where manipulation is possible. In a world without manipulation, the researcherobserves ( Y i , D i , X i ), where D i = I { X i ≥ } is the victory indicator, and Y i = D i Y i (1) +(1 − D i ) Y i (0) = D i Y i (1) is the outcome. It follows that E [ Y i | X i = x ] = x I { x ≥ } . Thereis no discontinuity at the cutoﬀ, the causal eﬀect is zero, and the DGP is under the nullhypothesis of zero eﬀect at the cutoﬀ (Figure 4(a)).43igure 4: Example of RDD with Manipulation(a) Conditional Mean without Manipulation (b) Conditional Mean with Manipulation Notes: In ﬁgure (a) there is no manipulation, and the researcher observes a sample of ( Y i , X i ). The solid linedenotes the conditional mean of the observed outcome given the margin of victory E [ Y i | X i ]. The conditionalmean of the potential outcome in case of victory E [ Y i (1) | X i ] is the the dotted line, and the potential outcomein case of loss is normalized to zero Y i (0) = 0. In ﬁgure (b) there is manipulation, and the researcher observesa sample of ( ˜ Y i , ˜ X i ). The solid line is E [ ˜ Y i | ˜ X i ] while the dotted line depicts E [ Y i | X i ]. Manipulation increasesthe slope of the conditional mean function at the cutoﬀ. In a world with manipulation, the given party in district i decides to inﬂuence the electionif the expected potential gain of doing so is positive. In other words, if the margin of victorywithout manipulation X i leads to the loss of the election, and the expected potential gainof winning the election E [ Y i (1) − Y i (0) | X i ] is strictly positive, then the party decides tomanipulate. Thus, manipulation occurs if X i <

0, and the margin of victory changes from X i to ε i >

0. Although the party manipulates as little as possible to win the election, itdoes not have perfect control over its vote share. Assume ε i = 5 χ , that is, ﬁve times aChi-square distribution with three degrees of freedom. The pdf f ε evaluated at zero equalszero, but it is highly sloped to the right of zero. Let ˜ X i be the manipulated margin of victorydeﬁned as ˜ X i = I { X i < } ε i + I { X i ≥ } X i . The researcher observes ( ˜ Y i , ˜ D i , ˜ X i ), where˜ Y i = ˜ D i Y i (1) + (1 − ˜ D i ) Y i (0) = ˜ D i Y i (1), and ˜ D i = I { ˜ X i ≥ } is the victory indicator. Theconditional mean function under manipulation is given by E [ ˜ Y i | ˜ X i = x ] = I { x ≥ } E [ Y i (1) | ˜ X i = x ]= I { x ≥ } E (cid:104) Y i (1) (cid:12)(cid:12)(cid:12) { ε i = x, X i < } or { X i = x, X i ≥ } (cid:105) = I { x ≥ } (cid:110) θ ( x ) E [ Y i (1) | ε i = x, X i <

0] + (1 − θ ( x )) E [ Y i (1) | X i = x, X i ≥ (cid:111) = I { x ≥ } (cid:110) θ ( x ) E [ Y i (1) | X i <

0] + (1 − θ ( x )) E [ Y i (1) | X i = x ] (cid:111) = I { x ≥ } (cid:110) θ ( x )(1 /

3) + (1 − θ ( x )) x (cid:111) , where the weight θ ( x ) = f ε ( x ) P ( X i < f ε ( x ) P ( X i <

0) + f X ( x ) I { x ≥ } f ε ( x )0 . f ε ( x )0 . . I { x ≥ } is such that θ ( x ) ∈ [0 , θ (0) = 0, θ ( x ) is continuous in x , and it is positively and highlysloped near x = 0. The conditional mean function E [ ˜ Y i | ˜ X i = x ] increases sharply at thecutoﬀ (Figure 4(b)) because districts with low X i and high causal eﬀects manipulate their X i to ˜ X i = ε i to the right of the cutoﬀ. There is no discontinuity at the cutoﬀ, and the DGPis still under the null hypothesis of zero eﬀect. However, manipulation makes it harder todistinguish a zero eﬀect from a positive eﬀect at the cutoﬀ. A.7 Proof of Corollary 5

Fix Q ∈ P with CDF F Q ( x ). The CDF F Q ( x ) has a jump discontinuity of size δ > x = x . Call f Q the derivative of F Q at x (cid:54) = x , which is a continuous function of x for every x (cid:54) = x . The integral of f Q over R equals 1 − δ . The side limits of f Q at x , f Q ( x +0 ) and f Q ( x − ), may be diﬀerent from each other. Pick a sequence ε k ↓

0. Construct a continuous“hat-shaped” function g k ( x ) : [ x − ε k ; x + ε k ] → R such that: (i) g k ( x − ε k ) = f Q ( x − ε k );(ii) g k ( x + ε k ) = f Q ( x + ε k ); (iii) g k ( x ) has constant and positive slope for x ≤ x , andconstant and negative slope for x ≥ x ; (iv) g k ( x ) ≥ f Q ( x ); and (v) (cid:82) ( g k ( x ) − f Q ( x )) dx = δ .It is always possible to construct such a function for a small enough ε k . Deﬁne f P k ( x ) = f Q ( x ) + I { x − ε k ≤ x ≤ x + ε k } ( g k ( x ) − f Q ( x )). This is a continuous PDF function, andlet it deﬁne the distribution P k . Then the CDF F P k converges to F Q as k → ∞ at everycontinuity point of F Q , so that P k d → Q . (cid:3) A.8 Proof of Corollary 6

Fix m ∈ R . Choose an arbitrary Q ∈ P ,m , and let m (cid:48) = µ ( Q ) (cid:54) = m . Deﬁne g ( x, w ) = E Q [ Y i | X i = x, W i = w ], and τ Q ( w ) = g ( x, w ) − lim x ↓ g ( x, w ).Construct a sequence of functions g k : X × W → R , k = 1 , , . . . as follows: g k ( x, w ) = g ( x, w ) + ( τ Q ( w ) − m ) (cid:2) I { x > } − Λ (cid:0) k x (cid:1)(cid:3) (A.27)where Λ ( · ) is the CDF of the logistic distribution.The function g k is inﬁnitely many times continuously diﬀerentiable wrt x on { X \{ c }}× W ,so g k ∈ G ∀ k . Also, g k (0 , w ) − lim x ↓ g k ( x, w ) = m . Moreover, as k → ∞ , g k ( x, w ) → g ( x, w )pointwise. Deﬁne P k to be the distribution of ( X i , W i , Y i − g ( X i , W i ) + g k ( X i , W i )) when( X i , W i , Y i ) ∼ Q . It follows that µ ( P k ) = m and P k ∈ P ,m ∀ k .It remains to show that P k d → Q , or equivalently, to show that( X i , W i , Y i − g ( X i , W i ) + g k ( X i , W i )) d → ( X i , W i , Y i ) (A.28)as k → ∞ where ( X i , Y i ) ∼ Q . Note that ( X i , W i , Y i − g ( X i , W i ) + g k ( X i , W i )) =( X i , W i , Y i )+(0 , g k ( X i , W i ) − g ( X i , W i )), so it suﬃces to show that g k ( X i , W i ) − g ( X i , W i ) p → k → ∞ . 45eﬁne A k = { < X i < k − } , and let A ck be the complement of A k . Fix ε > Q [ | g k ( X i , W i ) − g ( X i , W i ) | > ε ] (A.29)= Q [ | g k ( X i , W i ) − g ( X i , W i ) | > ε | A k ] Q [ A k ] (A.30)+ Q [ | g k ( X i , W i ) − g ( X i , W i ) | > ε | A ck ] Q [ A ck ] . (A.31)Part (A.30) vanishes as k → ∞ by the continuity property of probability measures because A k ↓ {∅} where ∅ denotes the empty set and has zero probability.For part (A.31), | g k ( x, w ) − g ( x, w ) | ≤ | τ Q ( w ) − m || − Λ ( k ) | for any w and any x ∈ A ck because 1 − Λ ( k x ) is strictly decreasing in x . For ﬁxed w , | g k ( x, w ) − g ( x, w ) | attains itsmaximum at x = k − . Therefore,( A. ≤ P {| τ Q ( W i ) − m || − Λ ( k ) | > ε } Q [ A ck ] → k ) → k → ∞ and | τ Q ( W i ) − m | is bounded. (cid:3) A.9 Simulations - RDD

This section contains additional results of the RDD simulation in the main text. The sizeand power analyses in the main text use the 5% nominal level. This section has the sameanalyses using 1% and 10% nominal levels. It also has the simulated critical values underthe various choices of null ( τ, M )-models. 46able 3: Simulated Rejection Rates and Critical ValuesPanel 1:

Rejection Rate under the Null Model ( τ, M ) Using Critical Values Simulated under Model ( τ, (a) Nominal Size 1% τ M = 0 M = 2 M = 4 M = 6 M = 8 M = 10.01 0.0100 0.0097 0.0108 0.0142 0.0108 0.0123.02 0.0100 0.0130 0.0147 0.0120 0.0139 0.0132.03 0.0100 0.0195 0.0192 0.0212 0.0210 0.0224.04 0.0100 0.0231 0.0257 0.0232 0.0243 0.0276.05 0.0100 0.0274 0.0323 0.0335 0.0373 0.0345.06 0.0100 0.0339 0.0467 0.0523 0.0591 0.0577.07 0.0100 0.0402 0.0546 0.0612 0.0702 0.0737.08 0.0100 0.0451 0.0674 0.0827 0.0917 0.0968 (b) Nominal Size 10% τ M = 0 M = 2 M = 4 M = 6 M = 8 M = 10.01 0.1000 0.1096 0.1111 0.1170 0.1159 0.1127.02 0.1000 0.1253 0.1309 0.1270 0.1337 0.1254.03 0.1000 0.1587 0.1618 0.1651 0.1629 0.1707.04 0.1000 0.1733 0.1803 0.1818 0.1827 0.1901.05 0.1000 0.1801 0.2049 0.2097 0.2244 0.2177.06 0.1000 0.2126 0.2561 0.2635 0.2810 0.2805.07 0.1000 0.2172 0.2596 0.2867 0.3022 0.3072.08 0.1000 0.2244 0.2957 0.3238 0.3427 0.3594 Panel 2:

Rejection Rate under the Alternative Model ( τ, ∞ ) Using Critical Values Simulated under Model ( τ, M ) (a) Nominal Size 1% τ M = 0 M = 2 M = 4 M = 6 M = 8 M = 10.01 0.0113 0.0104 0.0099 0.0100 0.0100 0.0101.02 0.0159 0.0111 0.0105 0.0102 0.0101 0.0103.03 0.0223 0.0111 0.0112 0.0109 0.0111 0.0105.04 0.0286 0.0133 0.0114 0.0109 0.0114 0.0108.05 0.0390 0.0188 0.0137 0.0125 0.0114 0.0111.06 0.0668 0.0211 0.0149 0.0132 0.0124 0.0114.07 0.0869 0.0237 0.0150 0.0139 0.0128 0.0132.08 0.1165 0.0295 0.0184 0.0164 0.0132 0.0130 (b) Nominal Size 10% τ M = 0 M = 2 M = 4 M = 6 M = 8 M = 10.01 0.1150 0.1013 0.1004 0.1003 0.1001 0.1001.02 0.1411 0.1058 0.1026 0.1025 0.1013 0.1013.03 0.1695 0.1130 0.1060 0.1045 0.1031 0.1030.04 0.2003 0.1239 0.1129 0.1078 0.1064 0.1038.05 0.2389 0.1361 0.1182 0.1128 0.1116 0.1079.06 0.3093 0.1543 0.1319 0.1203 0.1154 0.1109.07 0.3335 0.1792 0.1438 0.1286 0.1192 0.1161.08 0.4073 0.2085 0.1567 0.1342 0.1269 0.1200 Panel 3:

Critical Values Simulated under Null Model ( τ, M ) (a) Nominal Size 1% τ M = 0 M = 2 M = 4 M = 6 M = 8 M = 10.01 3.0222 3.0074 3.0631 3.1741 3.0641 3.0927.02 3.0669 3.1787 3.2466 3.1459 3.1743 3.2177.03 3.0506 3.3975 3.4213 3.4238 3.3659 3.4641.04 3.0810 3.5389 3.5381 3.4846 3.4706 3.5916.05 3.0786 3.5443 3.6329 3.6819 3.7531 3.7461.06 2.9715 3.5573 3.7476 3.8374 3.9043 3.8987.07 2.9829 3.7526 3.9103 3.9917 3.9896 4.0030.08 2.9536 3.7593 4.0283 4.1033 4.2614 4.2798 (b) Nominal Size 10% τ M = 0 M = 2 M = 4 M = 6 M = 8 M = 10.01 1.8700 1.9312 1.9323 1.9577 1.9602 1.9437.02 1.8801 2.0155 2.0295 2.0221 2.0558 2.0240.03 1.8466 2.1393 2.1621 2.1594 2.1638 2.2066.04 1.8822 2.2410 2.2767 2.2879 2.2917 2.3270.05 1.9056 2.3180 2.4002 2.4403 2.4903 2.4722.06 1.8316 2.3812 2.5359 2.5594 2.6196 2.6234.07 1.8772 2.4403 2.5859 2.7054 2.7526 2.7885.08 1.8651 2.4662 2.7184 2.8357 2.8925 2.9366(c) Nominal Size 5% τ M = 0 M = 2 M = 4 M = 6 M = 8 M = 10.01 2.2543 2.2888 2.3158 2.3511 2.3279 2.3525.02 2.2491 2.4041 2.4038 2.3951 2.4190 2.4311.03 2.1983 2.5607 2.5687 2.5776 2.5385 2.6150.04 2.2350 2.6669 2.6713 2.6819 2.7055 2.7515.05 2.2390 2.7010 2.8138 2.8575 2.9340 2.8821.06 2.2030 2.7688 2.9399 2.9920 3.0553 3.0453.07 2.2713 2.8591 3.0370 3.0936 3.1734 3.2180.08 2.2556 2.9023 3.1370 3.2685 3.3442 3.3817 Notes: Model ( τ, M ) refers to Equation (5.2) in the main text. The estimates for the Wald test are obtained by the robust bias-corrected method ofCalonico, Cattaneo, and Titiunik (2014) and implemented using the STATA package ‘rdrobust’.) refers to Equation (5.2) in the main text. The estimates for the Wald test are obtained by the robust bias-corrected method ofCalonico, Cattaneo, and Titiunik (2014) and implemented using the STATA package ‘rdrobust’.