[PDF] Inference for Linear Conditional Moment Inequalities

Abstract

We consider inference based on linear conditional moment inequalities, which arise in a wide variety of economic applications, including many structural models. We show that linear conditional structure greatly simplifies confidence set construction, allowing for computationally tractable projection inference in settings with nuisance parameters. Next, we derive least favorable critical values that avoid conservativeness due to projection. Finally, we introduce a conditional inference approach which ensures a strong form of insensitivity to slack moments, as well as a hybrid technique which combines the least favorable and conditional methods. Our conditional and hybrid approaches are new even in settings without nuisance parameters. We find good performance in simulations based on Wollmann (2018), especially for the hybrid approach.

Full PDF

IInference for Linear Conditional

Moment Inequalities ∗ Isaiah Andrews Jonathan Roth Ariel PakesSeptember 24, 2019

Abstract

We consider inference based on linear conditional moment inequalities, whicharise in a wide variety of economic applications, including many structural mod-els. We show that linear conditional structure greatly simpliﬁes conﬁdence setconstruction, allowing for computationally tractable projection inference in set-tings with nuisance parameters. Next, we derive least favorable critical valuesthat avoid conservativeness due to projection. Finally, we introduce a condi-tional inference approach which ensures a strong form of insensitivity to slackmoments, as well as a hybrid technique which combines the least favorable andconditional methods. Our conditional and hybrid approaches are new even insettings without nuisance parameters. We ﬁnd good performance in simulationsbased on Wollmann (2018), especially for the hybrid approach.Keywords: Moment Inequalities, Subvector Inference, Uniform InferenceJEL Codes: C12 ∗ We thank Gary Chamberlain, Ivan Canay, Kirill Evdokimov, Jerry Hausman, Bulat Gafarov,Hiroaki Kaido, Adam McCloskey, Francesca Molinari, Whitney Newey, Ashesh Rambachan, JesseShapiro, Brit Sharoni, Xiaoxia Shi, Joerg Stoye, and participants at several seminars for helpfulcomments, and thank Thomas Wollmann for helpful discussion of his application. Andrews gratefullyacknowledges ﬁnancial support from the NSF under Grant 1654234. Roth gratefully acknowledgesﬁnancial support from an NSF Graduate Research Fellowship under Grant DGE1144152. Andrews:[email protected]. Roth: [email protected]. Pakes: [email protected] a r X i v : . [ ec on . E M ] S e p Introduction

Moment inequalities are an important tool in empirical economics, enabling researchersto use the most direct implications of utility or proﬁt maximization for inference inboth single-agent settings and games. Moment inequalities have also been used toweaken parametric, behavioral, measurement, and selection assumptions in a range ofproblems. Inference based on moment inequalities raises a number of challenges. First, calcu-lating tests and conﬁdence sets can be computationally taxing in settings with morethan a few nuisance parameters (for instance, coeﬃcients on control variables). Sec-ond, a simple approach to inference in settings with nuisance parameters is to useprojection, but this can yield imprecise results. Finally, it is often unclear ex-antewhich of the many moments implied by an economic model will be informative, andinclusion of uninformative or slack moments yields wide conﬁdence sets for some pro-cedures.This paper proposes new methods which address these three implementation chal-lenges for an important class of moment inequalities, which we term linear conditionalmoment inequalities . These are conditional moment inequalities that (a) are linear innuisance parameters and (b) have conditional variance (given the instruments) thatdoes not depend on the nuisance parameters. Such inequalities arise naturally whenthe nuisance parameters enter the moments linearly and interact only with exogenousvariables. This occurs, for example, in regression and instrumental variables settingswith interval-valued outcomes and exogenous controls. Linear conditional structurealso appears in a number of structural applications of moment inequalities in the For recent overviews of research involving moment inequalities, and partial identiﬁcation morebroadly, see Ho & Rosen (2017) and Molinari (2019). For the behavioral and measurement assump-tions underlying the use of moment inequalitieis in problems where agents are assumed to maximizeutility or proﬁt see Pakes (2010) and Pakes et al. (2015). For examples of inequalities generated byﬁrst order conditions see Dickstein & Morales (2018) on export decisions and Holmes (2011) on Wal-mart’s location decisions. For examples of inequalities generated by Nash equilibrium conditions seeCiliberto & Tamer (2009), Eizenberg (2014), or Wollman (2018) on entry and exit decisions. For ex-amples of the use of inequalities to weaken assumptions see Haile & Tamer (2003) on auctions, Chetty(2012) on labor supply, and Kline & Tartari (2016) on a welfare reform experiment. For momentinequalities used to overcome measurement problems see Manski & Tamer (2002) on interval-valuedoutcome variables and Ho & Pakes (2014) on errors in regressors in discrete choice models. For theuse of inequalities to overcome selection problems see Blundell et al. (2007) on changes in inequalityand Kreider et al. (2012) on take-up of SNAP. There is also closely related work in other ﬁelds, forexample on computation of bounds for competing risk models (e.g. Honore & Lleras-Muney (2006)on the war on cancer). The second challenge discussed above stems from the fact that many existing tech-niques deliver joint conﬁdence sets for all parameters entering the moment inequalities,which must then be projected to obtain conﬁdence sets for lower-dimensional parame-ters of interest. For examples of projection in the theoretical and empirical literature,see Canay & Shaikh (2017). As discussed by Bugni et al. (2017) and Kaido et al.(2019 a ), however, projection can yield very conservative tests and conﬁdence sets. Weshow that in settings with linear conditional moment inequalities, it is straightforwardto derive computationally tractable least favorable critical values that account for thepresence of nuisance parameters, and so construct non-conservative conﬁdence sets forthe parameters of interest. The ﬁnal challenge discussed above, sensitivity to slack moments, arises from thefact that the distribution of moment inequality test statistics depends on the (un-known) degree to which the moments are slack. As discussed by D. Andrews & Soares(2010), the degree of slackness cannot be uniformly consistently estimated, so the leastfavorable approach calculates critical values under the worst-case assumption that all Note, however, that our asymptotic results (developed in the appendix) hold the number ofparameters and moments ﬁxed. Hence, our analysis does not address settings that are “high-dimensional” in the sense that the number of parameters or moments grows with the sample size. In cases where some nuisance parameters enter the moments nonlinearly, these techniques deliverconﬁdence sets for the parameters of interest together with the nonlinear nuisance parameters.

Related Literature

Uniform inference on subsets of parameters based on linearmoment inequalities was previously studied by Cho & Russell (2019) and Gafarov(2019). Flynn (2019) further allows for the possibility of a continuum of linear mo-4ents. Unlike our approach these papers consider unconditional moment inequalities,but do not discuss the case where the parameters of interest may enter the momentsnonlinearly. Hsieh et al. (2017) propose a conservative form of projection inferencefor settings which include linear unconditional moment inequalities. Kaido et al.(2019 a ) develop techniques for eliminating projection conservativeness, while Bugniet al. (2017) develop an alternative approach for inference on subsets of parameters,and Belloni et al. (2018) build on this approach to develop results for subset inferencewith high-dimensional unconditional moments. All three techniques are more widelyapplicable than those we develop, requiring neither linearity nor conditional momentinequalities. At the same time, all can be computationally intensive in settings with alarge number of nuisance parameters. Chernozhukov et al. (2015) develop techniquesfor subset inference based on conditional moment inequalities, which unlike our ap-proach do not require linearity. Romano & Shaikh (2008) discuss subvector inferencebased on subsampling. Chen et al. (2018) discuss conﬁdence sets for the identiﬁed setfor subvectors based on a quasi-posterior Monte Carlo approach.Finally, there is a large literature on techniques which seek to reduce sensitivityto the inclusion of slack moments in settings without nuisance parameters, includingD. Andrews & Soares (2010), D. Andrews & Barwick (2012), Romano et al. (2014 a ),and Cox & Shi (2019). Chernozhukov et al. (2015), Bugni et al. (2017), Belloni et al.(2018), and Kaido et al. (2019 a ) build on related ideas to reduce sensitivity to slackmoments in models with nuisance parameters. If applied in our setting, however, thesetechniques would eliminate the linear structure which simpliﬁes computation. Even insettings without nuisance parameters our conditioning approach appears to be new,and a small set of simulations without nuisance parameters (described in AppendixF) ﬁnds our hybrid approach neither dominates nor is dominated by the test proposedby Romano et al. (2014 a ). Preview of Paper

The next section introduces our linear conditional setting. Sec-tion 3 develops a conditional asymptotic approximation that motivates our analysis,and discusses the relationship between our approach and the literature on conditionalmoment inequalities. Section 4 introduces projection and least favorable tests, whileSection 5 introduces conditional and hybrid tests. Section 6 discusses the practical Kaido et al. (2019 a ) propose the use of a response surface technique to facilitate computation,and ﬁnd that it yields substantial improvements. See Kaido et al. (2019 a ) and Gafarov (2019) forfurther evidence on computational performance. Throughout the paper, we assume that we observe independent and identically dis-tributed data D i , i = 1 , ..., n drawn from a distribution P . We are interested inparameters identiﬁed by k -dimensional conditional moment inequalities E P [ g ( D i , β, δ ) | Z i ] ≤ almost surely (1)assumed to hold at the true parameter value, for g ( D i , β, δ ) a known function of thedata and parameters. Going forward we leave the “almost surely” implicit for brevity.We seek tests and conﬁdence sets for β, while the p -dimensional vector δ is a nuisanceparameter. Formally, we want to test the null that a given value β belongs to theidentiﬁed set, (cid:101) H : β ∈ B I ( P ) , where B I ( P ) = { β : there exists δ such that E P [ g ( D i , β, δ ) | Z i ] ≤ } is the set of all values β such that there exists δ which makes (1) hold.We assume that the moment function g ( D i , β, δ ) is of the form g ( D i , β, δ ) = g ( D i , β, − X ( Z i , β ) δ (2)for some k × p matrix-valued function X ( Z i , β ) of the instruments and the parameterof interest β. This imposes two key restrictions. First, (2) requires that the nuisanceparameter δ enter the moments linearly. Since linear models are widely used in eco-nomics, this holds in a wide variety of applications. Second, (2) requires that thederivative of the moments with respect to δ be non-random conditional on the in-6truments Z i . Stated diﬀerently, we require that the moment inequalities (1) holdconditional on the Jacobian of the moments with respect to δ. This implies that

V ar P ( g ( D i , β, δ ) | Z i ) = V ar P ( g ( D i , β, | Z i ) , so the conditional variance of the moments does not depend on δ . This conditionplays a crucial role in the asymptotic approximation developed in Section 3 below.We call moment inequalities satisfying (1) and (2) linear conditional moment in-equalities . They can be understood as a generalization of the linear model with ex-ogenous regressors and outcome Y ∗ i , Y ∗ i = X (cid:48) i δ + ε i where E P [ ε i | X i ] = 0 , (3)to the moment inequality setting. Speciﬁcally, for linear conditional moment inequal-ities we can deﬁne ( Y i , X i ) = ( g ( D i , β , , X ( Z i , β )) (4)for β again the null value of β . If β ∈ B I ( P ) , then we can write Y i = X i δ + ε i where E P [ ε i | Z i ] ≤ . (5)Thus, the linear conditional moment inequality model resembles a generalization ofthe traditional linear regression model, where we (a) allow the possibility that thereare instruments Z i beyond the regressors X i and (b) relax the conditional momentrestriction on the errors ε i to an inequality. We show below that the restriction tolinear conditional moment inequalities yields important simpliﬁcations in the problemof testing (cid:101) H : β ∈ B I ( P ) . Before developing these results, we motivate our studyof linear conditional moment inequalities by showing that moment inequalities of thisform arise in a variety of economic examples.

Example 1

Linear conditional moment inequalities arise naturally from the linearregression model (3), and its instrumental variables generalization, when we onlyobserve bounds on the outcome Y ∗ i . Consider the model Y ∗ i = T i β + V (cid:48) i δ + ε i , E P [ ε i | Z i ] = 0 V i is exogenous in the sense that it is a function of Z i , while T i may be endoge-nous. For instance, β may be a causal eﬀect of interest, whereas V i represents a setof control variables. This is a linear instrumental variables model where the error ismean-independent of the instrument.As in e.g. Manski & Tamer (2002), suppose that rather than observing Y ∗ i , weinstead observe bounds Y Li and Y Ui where Y Li ≤ Y ∗ i ≤ Y Ui with probability one. Thelinear model (2) implies that E [ Y Li − T i β − V (cid:48) i δ | Z i ] ≤ and E [ T i β + V (cid:48) i δ − Y Ui | Z i ] ≤ ,so we obtain conditional moment inequalities. To cast these inequalities into ourframework, suppose we are interested in inference on β, and for any vector of non-negative functions of the instruments f ( Z i ) let Y i ( β ) = ( Y Li − T i β, T i β − Y Ui ) (cid:48) ⊗ f ( Z i ) , and X i = ( V i ⊗ (1 , (cid:48) ) ⊗ f ( Z i ) , for “ ⊗ ” the Kroneker product. This yields the moments E [ Y i ( β ) − X i δ | Z i ] ≤ , as desired. (cid:52) Example 2

Katz (2007) studies the impact of travel time on supermarket choice.Katz assumes that agent utilities are additively separable in utility from the basket ofgoods bought ( B i ), the travel time to the supermarkets chosen ( T i,s ), and the cost ofthe basket ( π ( B i , s ) ). Normalizing coeﬃcient on cost to one, agent i ’s realized utilityis assumed to be U i ( B i , s ) = U i ( B i ) + C (cid:48) s δ − ( β + ν i ) T i,s − π ( B i , s ) , where C s are observed characteristics of the supermarket, T i,s is the travel time for i going to s , and β + ν i is its impact on utility, where ν i has mean zero given supermarketcharacteristics and travel times.Katz assumes travel times and store characteristics are known to the shopper.For ˜ s a supermarket with T i, ˜ s > T i,s that also marketed B i , he divides the diﬀerence U i ( B i , s ) − U i ( B i , ˜ s ) by T i,s − T i, ˜ s and notes that a combination of expected utilitymaximization and revealed preference implies that E [ Y i ( β ) − X i δ | Z i ] ≤ , for Y i ( β ) ≡ − β − [ π ( B i , s ) − π ( B i , ˜ s )] T i,s − T i, ˜ s , X i ≡ − C s − C ˜ s T i,s − T i, ˜ s , and Z i ≡ ( T i,s , T i, ˜ s , C s , C ˜ s ) (cid:48) . Our approach to this application relies on the conditional moment restriction E P [ ε i | Z i ] = 0 .As discussed by Ponomareva & Tamer (2011), this means that the identiﬁed set may be empty ifthe linear model is incorrect. For Z i = ( T i , V (cid:48) i ) (cid:48) , Beresteanu & Molinari (2008) assume only that E [ ε i Z i ] = 0 , and their approach yields inference on the (necessarily nonempty) set of best linearpredictors. Bontemps et al. (2012) study identiﬁcation and inference, including speciﬁcation tests,for a class of linear models with unconditional moment restrictions.

8y adding an analogous inequality which uses a store closer to the agent, Katz obtainsboth upper and lower bounds for β .A similar approach can be used in any ordered choice problem, including thosewith interacting agents; see Pakes et al. (2015), who also provide a way to handle theboundaries of the choice set (as would occur in Katz’s case if there were no closersupermarket for some observations). (cid:52) Example 3

Wollman (2018) considers the bailout of GM and Chrysler’s commercialtruck divisions during the 2008 ﬁnancial crisis and asks what would have happenedhad they instead been allowed to either fail or merge with another ﬁrm. This exampleis the basis for our simulations below.Merger analysis focuses on price diﬀerences pre- and post-merger. Wollmann notesthat some commercial truck production is modular (it is possible to connect diﬀerentcab types to diﬀerent trailers), so some products would likely have been repositionedafter the change in the environment. To analyze product repositioning he requiresestimates for the ﬁxed costs of marketing a product. His estimated demand and costsystems enable him to estimate counterfactual proﬁts from adding or deleting prod-ucts. Assuming ﬁrms maximize expected proﬁts, diﬀerences in the expected proﬁtsfrom adding or subtracting products imply bounds on ﬁxed costs.To illustrate, let J f,t be the set of models that ﬁrm f marketed in year t andlet J f,t /j be that set excluding product j , while ∆ π ( J f,t , J f,t /j ) is the diﬀerence inexpected proﬁts between marketing J f,t and J f,t /j . Denote the ﬁxed cost to ﬁrm f of marketing product j at time t by X j,f,t ( β ) δ where the X ’s are product charac-teristics and β is a scalar which diﬀerentiates between marketing costs for productsthat were and were not marketed in the prior year. Then if Z f,t represents a set ofvariables known to the the ﬁrm when marketing decisions were made (which includesthe variables used to form X j,f,t ( β )) , the equilibrium condition ensures that E [ Y j,f,t − X j,f,t ( β ) δ | Z f,t ] ≥ for all j, where Y j,f,t ≡ ∆ π ( J f,t , J f,t /j ) · { j ∈ J f,t , j ∈ J f,t − } , X j,f,t ( β ) ≡ X f,j ( β ) · { j ∈ J f,t , j ∈ J f,t − } and { A } is an indicator for the event A . Additional inequalities can be added for9arketing a product that was not marketed in the prior period, for withdrawingproducts, and for combining the withdrawal of one product with adding another. SeeSection 7 below for details. (cid:52) Other recent applications that use linear conditional moment inequalities includeHo & Pakes (2014), who study the eﬀect of physician incentives on hospital referrals,and Morales et al. (2019), who develop and estimate an extended gravity model oftrade ﬂows. As the variety of examples illustrates, linear conditional moment inequal-ities arise in a range of economic contexts.

In this section we derive a normal asymptotic approximation that motivates the pro-cedures developed in the rest of the paper. For ( Y i , X i ) = ( g ( D i , β , , X ( Z i , β )) asin (4), recall that we can write the moments evaluated at β as g ( D i , β , δ ) = Y i − X i δ. We consider procedures that test (cid:101) H : β ∈ B I ( P ) based on the scaled sample averageof the moments evaluated at β ,g n ( β , δ ) = 1 √ n (cid:88) i g ( D i , β , δ ) = Y n − X n δ, for Y n = √ n (cid:80) i Y i and X n = √ n (cid:80) i X i . As in Bugni et al. (2017), we will formconﬁdence sets for β by testing a grid of values β . Hence, for the moment we ﬁxa null value β and suppress dependence on β in our notation, deferring furtherdiscussion of test inversion to Section 6 below.Similar to Abadie et al. (2014) we consider asymptotic approximations that con-dition on the instruments { Z i } = { Z i } ∞ i =1 . If we deﬁne µ i = µ ( Z i ) = E P [ Y i | Z i ] as the conditional mean of Y i given Z i , and µ n = √ n (cid:80) i µ i as the scaled sample averageof µ i , then under (cid:101) H : β ∈ B I ( P ) there exists a value δ such that µ n − X n δ ≤ (foralmost every { Z i } ). Since µ n and X n are nonrandom once we condition on { Z i } , totest β ∈ B I ( P ) we will test the implied hypothesis H : µ n ∈ M for M = { µ n : There exists δ such that µ n − X n δ ≤ } . (6)10ote that β ∈ B I ( P ) implies that µ n ∈ M for almost every { Z i } , so tests of H : µ n ∈ M with correct size also control size as tests of (cid:101) H : β ∈ B I ( P ) . Notefurther that H : µ n ∈ M holds trivially conditional on { Z i } if the column span of X n contains a strictly negative vector. Hence, going forward we assume that X n δ hasat least one non-negative element for all δ. To derive asymptotic approximations useful for testing H , note that Y n − µ n hasmean zero conditional on { Z i } by construction. Thus, under mild conditions we canapply the central limit theorem conditional on { Z i } . Lemma 1 (Lindeberg-Feller) Suppose that as n → ∞ , conditional on { Z i } we have n (cid:88) i E P [ Y i Y (cid:48) i | Z i ] 1 (cid:26) √ n (cid:107) Y i (cid:107) > ε (cid:27) → for all ε > , n (cid:88) i V ar P ( Y i | Z i ) → Σ = E P [ V ar P ( Y i | Z i )] . Then Y n − µ n → d N (0 , Σ) . The ﬁrst condition of Lemma 1 requires that the average of Y i given Z i not bedominated by a small number of large observations, while the second requires that theaverage conditional variance converge.Under these conditions, Lemma 1 suggests the normal approximation Y n − X n δ |{ Z i } ≈ d N ( µ n − X n δ, Σ) , (7)where we use ≈ d to denote approximate equality in distribution, and we have usedthat X n is non-random conditional on { Z i } to put it on the right hand side in (7). Inthe next three sections we assume this approximation holds exactly for known Σ andderive ﬁnite-sample results. We return to the issue of approximation error in AppendixD. There, we show that we can consistently estimate Σ , and that the ﬁnite-sampleproperties of our procedures in the normal model translate to uniform asymptoticproperties over large classes of data generating processes. Choice of Moments

Our asymptotic approximations focus on a ﬁxed choice ofmoments g ( D i , β, δ ) , which we take as given. This is common in practice, including11n all of the empirical papers using conditional moment inequalities that we discussabove, and is without loss of generality if the instruments Z i have ﬁnite support.For Z i continuously distributed, however, a single conditional moment inequalityimplies an uncountable family of possible moments. Speciﬁcally, given a momentfunction ˜ g ( D i , β, δ ) that satisﬁes (1), for f ( Z i ) non-negative g ( D i , β, δ ) = ˜ g ( D i , β, δ ) f ( Z i ) also satisﬁes (1). To obtain consistent tests (that is, tests that reject all values β (cid:54)∈ B I ( P ) with probability going to one as n → ∞ ), one may need to consider aninﬁnite number of inequalities in large samples. Motivated by this fact, the literatureon conditional moment inequalities, including D. Andrews & Shi (2013), Armstrong(2014 b ) and Chetverikov (2018), has primarily focused on consistent and eﬃcient in-ference on ( β, δ ) jointly, based on checking (at least asymptotically) an inﬁnite numberof inequality restrictions. More recently, Chernozhukov et al. (2015) have developedresults that can be used for subvector inference with conditional moment inequalities.Whether one can combine the results we develop here with results from the previousliterature on conditional moment inequalities to obtain tests that are consistent insettings with continuously distributed Z i is an interesting topic for future work. In many empirical applications using conditional moment inequalities, inference isbased on asymptotic approximations that do not condition on { Z i } . This sectionexplores the relationship between such unconditional asymptotic approximations andour conditional approach.

Lemma 2

Suppose that E P [ Y i Y (cid:48) i ] and E P [ X i X (cid:48) i ] are both ﬁnite. Then for all δ,Y n − X n δ − E P [ Y n − X n δ ] → d N (0 , Ω( δ )) for Ω( δ ) = V ar P ( Y i − X i δ ) . In particular, for a ﬁxed, ﬁnite set of moments we may have µ n ∈ M with probability approach-ing one even though β (cid:54)∈ B I ( P ) . Y n − X n δ ≈ d N ( E P [ Y n − X n δ ] , Ω( δ )) (8)where (cid:101) H : β ∈ B I ( P ) implies that E P [ Y n − X n δ ] ≤ for some δ. Many commonly-used approaches to testing joint hypotheses on ( β, δ ) , including D. Andrews & Soares(2010), D. Andrews & Barwick (2012), and Romano et al. (2014 a ), can be interpretedas applications of this approximation. Both (7) and (8) imply that the moments g n ( δ ) = Y n − X n δ are approximatelynormal, but the means and variances diﬀer. Considering ﬁrst the mean vectors, notethat by the law of iterated expectations E P [ µ n − X n δ ] = E P [ Y n − X n δ ] . Thus, the mean vectors in (7) and (8) coincide on average, but the mean vector in (7)is random from an unconditional perspective while that in (8) is ﬁxed.Turning next to the variance matrices, by the law of total variance

Ω ( δ ) = E P [ V ar P ( Y i − X i δ | Z i )] + V ar P ( E P [ Y i − X i δ | Z i ])= Σ + V ar P ( µ i − X i δ ) . Hence, we see that

Ω ( δ ) is always weakly larger than Σ in the usual matrix order,and will typically be strictly larger. Thus, using the conditional approximation (7) weobtain a smaller variance matrix. While the smaller variance matrix in the conditionalapproximation (7) will often lead to more powerful tests, one can show that this isnot universally the case for the procedures we consider. Critically for our results,however, Σ does not depend on δ , whereas V ar P ( µ i − X i δ ) . does. The main text in Romano et al. (2014 a ) uses bootstrap critical values, but the appendix, Romanoet al. (2014 b ), develops results for the normal model. Conditional variances were previously considered by e.g. Chetverikov (2018) for inference withconditional moment inequalities, and by Kaido et al. (2019 b ) and Barseghyan et al. (2019) for settingswith a discrete instrument. We discuss estimation of Σ in Section 6 below. Though the diagonal terms in Σ are smaller than those in Ω( δ ) , and this will lead to largervalues of the the test statistics introduced below, their oﬀ diagonal correlations also diﬀer, which cangenerate larger critical values. Least Favorable Tests

Recall that we are interested in testing the hypothesis H : µ n ∈ M under the linearnormal model (7). The unknown parameter δ appears in the null hypothesis, andis a nuisance parameter that needs to be dealt with to allow testing. A commonapproach to handling nuisance parameters in moment inequality settings is the pro-jection method (see Canay & Shaikh 2017 for examples). We begin by describing theprojection method in our setting. We then explain why linear conditional structureallows us to eliminate the computational problems which can arise for the projectionmethod. Finally, to avoid the conservativeness of the projection method, we derive(non-conservative) least-favorable critical values. The projection method tests the family of hypotheses H ( δ ) : µ n − X n δ ≤ , δ ∈ R p (9)and rejects H : µ n ∈ M if and only if we reject H ( δ ) for all δ. Provided our tests of H ( δ ) control size the projection method test does as well, since one of the hypothesestested corresponds to the true δ .Note that under H ( δ ) , Y n − X n δ is normally distributed with a weakly negativemean. Thus, testing H ( δ ) reduces to testing that the mean of a multivariate normalvector is less than or equal to zero. A number of tests have been proposed for thishypothesis, but here we focus on tests that reject for large values of the max statistic S ( Y n − X n δ, Σ) = max j (cid:110) ( Y n,j − X n,j δ ) / (cid:112) Σ jj (cid:111) where Y n,j − X n,j denotes the j th element of the vector Y n − X n δ and Σ jj is the j thdiagonal element of Σ , which we assume throughout is strictly positive for all j. This choice of test statistic will allow us to compute projection tests of the compositehypothesis H : µ n ∈ M via linear programming. That said, many of the resultsof this section (though not those in the following section) extend directly to otherstatistics S ( · , · ) that are elementwise increasing in the ﬁrst argument. Desirable properties for tests based on this statistic are discussed by Armstrong (2014 a ).

14o test H ( δ ) based on S ( Y n − X n δ, Σ) , we need a critical value. As discussed ine.g. Rosen (2008) and D. Andrews & Guggenberger (2009), to ensure correct size wecan compare S ( Y n − X n δ, Σ) to the maximum of its − α quantile over data generatingprocesses consistent with H ( δ ) . Formally, let c α ( γ, Σ) be the − α -quantile of S ( ξ, Σ) for ξ ∼ N ( γ, Σ) . The least favorable critical value is then c α,LF P (Σ) = sup γ ≤ c α ( γ, Σ) = c α (0 , Σ) , where the fact that the sup is achieved at γ = 0 follows from the fact that S iselementwise increasing in its ﬁrst argument. We subscript by LF P to emphasize thatthis is the least favorable critical value for testing H ( δ ) , which is in turn part of theprojection test for H .If we deﬁne the test of H ( δ ) to reject when S ( Y n − X n δ, Σ) exceeds c α,LF P (Σ) ,φ LF ( δ ) = 1 { S ( Y n − X n δ, Σ) > c α,LF P (Σ) } , where we use φ = 1 and φ = 0 to denote rejection and non-rejection respectively, thenit follows from the argument above that φ LF ( δ ) has size α as a test of H ( δ ) : sup µ n : µ n − X n δ ≤ E µ n [ φ LF ( δ )] = α. The least favorable projection test of H rejects if and only if φ LF ( δ ) rejects for all δ,φ LF P = inf δ φ LF ( δ ) = 1 (cid:26) min ˜ δ S (cid:16) Y n − X n ˜ δ, Σ (cid:17) > c α,LF P (Σ) (cid:27) . For any µ n ∈ M we know that there exists δ ( µ n ) such that µ n − X n δ ( µ n ) ≤ , so sup µ n ∈M E µ n [ φ LF P ] ≤ α. As we now show, the fact that neither min ˜ δ S (cid:16) Y n − X n ˜ δ, Σ (cid:17) nor the critical value c α,LF P (Σ) = c α (0 , Σ) depends on δ makes φ LF P particularly easy to compute.

Lemma 3

We can write φ LF P = 1 { ˆ η > c α,LF P (Σ) } or ˆ η the solution to min η,δ η subject to ( Y n,j − X n,j δ ) / (cid:112) Σ jj ≤ η ∀ j. (10)Thus, to calculate φ LF P we need only solve a linear programming problem and calcu-late c α,LF P (Σ) . Hence, φ LF P remains tractable even when the dimension of δ is large. The linear normal model (7) plays a key role in this result in two ways, ﬁrst throughlinearity in δ and second, perhaps less obviously, through the fact that the covariance Σ (and thus the critical value c α,LF P (Σ) ) does not depend on δ. If we instead considered projection tests based on the unconditional normal ap-proximation (8), this corresponds to substituting Ω( δ ) for Σ in our expressions for φ LF ( δ ) and φ LF P , and implies the unconditional projection method test φ ULF P = 1 (cid:110) min δ ( S ( Y n − X n δ, Ω( δ )) − c α,LF P (Ω( δ ))) > (cid:111) . The dependence of Ω( δ ) on δ means that evaluating this test requires nonlinear opti-mization. While this problem can be solved numerically when the dimension of δ islow, when the dimension is high this becomes computationally taxing. Thus, we see that the linear conditional structure we assume allows us to easilycalculate the least favorable projection method test φ LF P . As discussed by Bugniet al. (2017) and Kaido et al. (2019 a ), however, projection method tests are typicallyconservative, sup µ n ∈M E µ n [ φ LF P ] < α, and can be severely so when the dimension of the nuisance parameter δ is large. To see why the projection test φ LF P is conservative, recall that that its critical valueis calculated as the − α quantile of S ( ξ, Σ) where ξ ∼ N (0 , Σ) . By contrast, ˆ η isequal to min δ S ( Y n − X n δ, Σ) . Hence, c α,LF P (Σ) does not account for minimizationover δ . In this section we use the structure of the normal linear model (7) to derive Other recent applications of linear programming in set-identiﬁed settings include Mogstad et al.(2018), Khan et al. (2019), Tebaldi et al. (2019), and Torgovitsky (2019). Kaido et al. (2019 a ) discuss a response surface approach to speed this optimization in a moregeneral setting. δ .Speciﬁcally, deﬁne c α ( µ n , X n , Σ) as the − α quantile of min η,δ η subject to ( ξ j − X n,j δ ) / (cid:112) Σ jj ≤ η ∀ j. (11)when ξ ∼ N ( µ n , Σ) . The (non-conservative) least favorable critical value is c α,LF ( X n , Σ) = sup µ n ∈M c α ( µ n , X n , Σ) . Note that the least favorable projection critical value c α,LF P (Σ) corresponds to set-ting δ = 0 in (11), rather than minimizing. Hence, by construction c α,LF ( X n , Σ) ≤ c α,LF P (Σ) . If we deﬁne the least favorable test to reject when the max statistic exceeds c α,LF ( X n , Σ) , φ LF = (cid:110) min δ S ( Y n − X n δ, Σ) > c α,LF ( X n , Σ) (cid:111) = { ˆ η > c α,LF ( X n , Σ) } , then provided ˆ η is continuously distributed this test has size α , sup µ n ∈M E µ n [ φ LF ] = α. If instead the distribution of ˆ η has point mass, the size is bounded above by α. While describing the least favorable critical value c α,LF ( X n , Σ) is conceptuallystraightforward, to derive it in practice we need to maximize the quantile c α ( µ n , X n , Σ) over the set of µ n values consistent with the null. The linear structure of the problemimplies that the maximum is attained at µ n = 0 . Proposition 1 c α,LF ( X n , Σ) = c α (0 , X n , Σ) . This result follows immediately from the observations that (i) c α ( µ n , X n , Σ) is invari-ant to shifting µ n by X n ˜ δ , in the sense that for all ˜ δ,c α ( µ n , X n , Σ) = c α (cid:16) µ n + X n ˜ δ, X n , Σ (cid:17) , (ii) that c α ( µ n , X n , Σ) is non-decreasing in µ n , and (iii) that for every µ n ∈ M there17xists δ ( µ n ) such that µ n − X n δ ( µ n ) ≤ .To calculate the LF critical value we can simulate draws ξ ∼ N (0 , Σ) , solve thelinear programming problem (11) for each draw, and take the − α quantile of theresulting optimized values. While the need to repeatedly solve the problem (11) meansthat this approach requires more computation than the projection method, it remainshighly tractable and yields a non-conservative test. While less conservative than the projection approach, least favorable critical valuesstill assume that all the moments are binding, µ n = 0 . In practice we may suspect thatsome of the moments are far from binding, and the data may be informative about this.Motivated by this fact, D. Andrews & Soares (2010), D. Andrews & Barwick (2012),Romano et al. (2014 a ), and related papers propose techniques that use informationfrom the data to either select moments or shift the mean of the distribution fromwhich the critical values are calculated. This allows them to construct tests withhigher power in empirically relevant cases where many of the moments are slack.In our setting one can test H : µ n ∈ M by ﬁrst using one of the aforementionedapproaches to test H ( δ ) as deﬁned in (9) for all δ and then applying the projec-tion method. This yields a conservative test, but Kaido et al. (2019 a ) show how toeliminate this conservativeness when considering projections based on D. Andrews &Soares (2010). Unfortunately, however, projection tests based on moment-selectionprocedures break the linear structure discussed in the last section. Implementingthese approaches consequently requires solving a nonlinear, non-convex optimizationproblem.To obtain procedures which both perform well when we have slack moments andpreserve linearity, we introduce a novel conditional testing approach. When there isa unique, non-degenerate solution in the linear program (10), exactly p + 1 of theinequality constraints bind at the optimum. We propose tests which condition on theidentity of these binding moments, and on a suﬃcient statistic for the slackness of theremaining moments. These tests control size both conditional on the set of bindingmoments and unconditionally, and are highly computationally tractable. Moreover,these tests are insensitive to the presence of slack moments in the sense that as a subsetof the moments grows arbitrarily slack the conditional test approaches the test which18rops the slack moments ex-ante. Conditional tests thus automatically incorporate astrong form of moment selection.When the solution to (10) is non-unique or degenerate the set of binding momentsis no longer uniquely deﬁned, which would seem to pose a problem for the conditionaltest as described above. We show, however, that a reformulation of the conditionalapproach based on the dual linear program continues to apply in such settings. Thisapproach is equivalent to conditioning on the set of binding moments in (10) whenthere is a unique, non-degenerate solution but remains valid and easy to implementeven when these conditions fail.In what follows, we ﬁrst introduce the test in a special case where there are nonuisance parameters δ before turning to our results for the general case with a unique,non-degenerate solution. Results for the formulation based on the dual linear program,which allow for non-unique or degenerate solutions, are discussed in Section 5.3 andformally developed in Appendix A. To develop intuition for our conditional approach we ﬁrst consider a model withoutnuisance parameters δ. To further simplify, we assume that the variance is equal tothe identify matrix,

Σ = I . Our problem then reduces to that of testing µ n ≤ basedon Y n ∼ N ( µ n , I ) , which has been well-studied in the previous literature.In this setting, ˆ η is simply the max of the moments, ˆ η = S ( Y n , I ) = max j { Y n,j } . With probability one there is a unique binding constraint in the linear program (10),corresponding to the largest moment. Once we condition on the identity of the largestmoment, ˆ j = arg max j Y n,j , the problem becomes one of inference based on a normalvector conditional on the max occurring at a particular location, ˆ j = j .Unfortunately, the distribution of ˆ η = Y n, ˆ j conditional on ˆ j = j still depends onthe full vector µ n . This dependence comes from the fact that ˆ j = j if and only if Y n,j ≥ max ˜ j (cid:54) = j Y n, ˜ j , where the distribution of the lower bound depends on { µ n, ˜ j : ˜ j (cid:54) = j } . Toeliminate this dependence, we further condition on the value of the second largestmoment. Once we condition on ˆ j = j and on the value of the second largest moment,say max ˜ j (cid:54) = j Y n, ˜ j = V lo , ˆ η follows a truncated normal distribution ˆ η | (cid:26) ˆ j = j & max ˜ j (cid:54) = j Y n, ˜ j = V lo (cid:27) ∼ ξ | V lo ≤ ξ ξ ∼ N ( µ n,j , . Lemma A.1 of Lee et al. (2016) shows that this truncated normal distributionis increasing in µ n,j , so since µ n,j ≤ under the null, the − α quantile of theconditional distribution under µ n,j = 0 is a valid conditional critical value. We denotethis conditional critical value by c α,C ( j, V lo , I ) . The conditional test φ C = 1 (cid:26) ˆ η > c α,C (cid:18) ˆ j, max ˜ j (cid:54) =ˆ j Y n, ˜ j , I (cid:19)(cid:27) has maximal rejection probability equal to α under the null, conditional on ˆ j = j and max ˜ j (cid:54) = j Y n, ˜ j = V lo . By the law of iterated expectations its unconditional rejectionprobability under the null is thus bounded above by α as well, and this bound isachieved at µ n = 0 . Thus, φ C is a size α test of H : µ n ≤ . The simplicity of the present setting allows us to highlight some important featuresof the conditional test. When the second largest element of µ n , say max ˜ j (cid:54) = j µ n, ˜ j , is verynegative while the largest element ( µ n,j ) is not, ˆ j = j with high probability. In thiscase, the lower truncation point is very small with high probability, so the truncatednormal critical value c α,C (cid:16) ˆ j, max ˜ j (cid:54) =ˆ j Y n, ˜ j , I (cid:17) is close to the level − α standard normalcritical value with high probability. Thus, when the largest element of µ n is wellseparated from the remaining elements, the conditional test closely resembles the testwhich limits attention to the j th moment ex-ante, φ j = 1 { Y n,j > c α } for c α the level − α standard normal critical value. The power of φ j lies on the power envelope fortests of H : µ n ≤ when all the other elements of µ n are negative (see Romano et al.2014 b ). Thus, the conditional test has power approaching the power envelope whenwe take all moments but one to be slack. More broadly, Proposition 3 below showsthat if we take a subset of elements of µ n to −∞ , the conditional test converges tothe conditional test which drops the corresponding moments ex-ante.The only other test that we know of which shares this strong insensitivity property,while also controlling size in the ﬁnite sample normal model, is that of Cox & Shi(2019). In particular, while the tests of D. Andrews & Barwick (2012) and Romanoet al. (2014 a ) are relatively insensitive to the presence of slack moments, they are both Speciﬁcally, the baseline test discussed in that paper, not the modiﬁcation discussed in theirRemark 3. Interestingly, this test is also based on conditioning, though in the present example theirapproach conditions on the identity of the non-negative moments, { j : Y j > } , while we conditionon the identity of the largest moment and the value of the second-largest moment. While the test of Cox & Shi (2019) isstrongly insensitive to slack moments, its power does not in general converge to thepower envelope in the case where all moments but one are slack.This example also highlights a less desirable feature of our conditional test. Whenthe largest element of µ n is not well-separated, µ n,j ≈ max ˜ j (cid:54) = j µ n,j , the second-largestmoment max ˜ j (cid:54) =ˆ j Y n, ˜ j will often be nearly as large as the largest moment. Since the con-ditional critical value c α (cid:16) ˆ j, max ˜ j (cid:54) =ˆ j Y n, ˜ j , I (cid:17) is always strictly larger than max ˜ j (cid:54) =ˆ j Y n, ˜ j , this can lead to poor power for the conditional test. We illustrate this issue in simu-lation in Appendix F. Hybrid Tests

To address power declines for the conditional test when the largestelement of µ n is not well-separated we introduce what we call a hybrid test. Thismodiﬁes the conditional test to reject whenever the max statistic ˆ η exceeds a level κ ∈ (0 , α ) least-favorable critical value, c κ,LF ( I ) . If ˆ η ≤ c κ,LF ( I ) we then consider aconditional test, where we (i) further condition on the event that ˆ η ≤ c κ,LF ( I ) and(ii) modify the level of the conditional test to reﬂect the ﬁrst step. By the argumentsabove the distribution of ˆ η , conditional on not rejecting in the ﬁrst stage, is againtruncated normal, now truncated both from below and above, ˆ η | (cid:26) ˆ j = j, max ˜ j (cid:54) = j Y n, ˜ j = V lo & ˆ η ≤ c κ,LF ( I ) (cid:27) ∼ ξ |V lo ≤ ξ ≤ c κ,LF ( I ) for ξ ∼ N ( µ n,j , . For c ˜ α,H ( j, V lo , I ) the − ˜ α quantile of this distribution, inf µ n ≤ P r µ n (cid:26) ˆ η ≤ c ˜ α,H ( j, V lo , I ) | ˆ j = j, max ˜ j (cid:54) = j Y n, ˜ j = V lo , ˆ η ≤ c κ,LF ( I ) (cid:27) = 1 − ˜ α. To form hybrid tests, we set ˜ α = α − κ − κ to account for the ﬁrst-step comparison to theleast favorable critical value. Since c ˜ α,H ( j, V lo , I ) ≤ c κ,LF by deﬁnition, we can thuswrite the hybrid test as φ H = 1 (cid:26) ˆ η > c α − κ − κ ,H (cid:18) ˆ j, max ˜ j (cid:54) =ˆ j Y n, ˜ j , I (cid:19)(cid:27) . Through the size correction factor in D. Andrews & Barwick (2012), and the ﬁrst-stage criticalvalue in Romano et al. (2014 a ). α, and thisbound is attained at µ n = 0 . By construction this test rejects whenever the level κ least favorable test does, which improves power relative to the conditional test whenthe largest element of µ n is not well-separated. While the hybrid test retains manyof the properties of the conditional test, its dependence on the least-favorable criticalvalue means that it is aﬀected by the inclusion of even arbitrarily slack moments.Similar to the test of Romano et al. (2014 a ), however, the impact is small when κ isclose to zero.To illustrate the performance of hybrid tests in the present simpliﬁed setting,Appendix F reports simulation results for cases with two, ten, and ﬁfty moments. Wealso calculate results for the test proposed by Romano et al. (2014 a ) for comparison.We ﬁnd that the hybrid approach improves power relative to the conditional test inthe poorly-separated case, while still improving power relative to the least favorabletest in the well-separated case. Neither the hybrid test nor the test of Romano et al.(2014 a ) dominates the other: the test of Romano et al. (2014 a ) has better performancein the poorly-separated case, while the hybrid test has slightly higher power when thelargest moment is moderately well-separated. Unlike the test of Romano et al. (2014 a )however, the hybrid and conditional tests easily extend to the case with nuisanceparameters δ. Simulation results based on Wollman (2018), reported in Section 7,demonstrate that the power gains of the hybrid test are borne out in more realisticsettings with nuisance parameters.

We next discuss our conditional approach in the case with nuisance parameters δ and acovariance matrix Σ which may not equal the identity. In this section we assume thatthe linear program (10) has a unique, non-degenerate solution with probability one,while Appendix A develops an alternative formulation for the conditioning approach,based on the dual linear program, that does not impose these conditions. The primaland dual approaches are numerically equivalent when the solution to (10) is uniqueand non-degenerate (as we expect will often be the case in applications), so we focuson the primal approach here for ease of exposition. Degeneracy means that for W n as deﬁned below, the rows of W n corresponding to bindingconstraints are linearly dependent. See Section 10.4 of Schrijver (1986).

22o deﬁne our conditional approach, note that we can rewrite (10) as min η,δ η subject to Y n − W n ( η, δ (cid:48) ) (cid:48) ≤ . (12)for W n the matrix with row j equal to W n,j = (cid:16) (cid:112) Σ jj X n,j (cid:17) . Let (cid:16) ˆ η, ˆ δ (cid:17) denotethe optimal values in (12), which we assume for the moment are unique, and let (cid:98) B ⊆ { , ..., k } collect the indices corresponding to the binding constraints at theseoptimal solutions, so Y n,j − W n,j (ˆ η, ˆ δ (cid:48) ) (cid:48) = 0 if and only if j ∈ (cid:98) B. Let Y n, (cid:98) B and W n, (cid:98) B collect the corresponding rows of Y n and W n . Lemma 4

If the solution to (12) is unique and non-degenerate, | (cid:98) B | = p + 1 , and W n, (cid:98) B has full rank. Since Y n, (cid:98) B − W n, (cid:98) B (ˆ η, ˆ δ (cid:48) ) (cid:48) = 0 by the deﬁnition of (cid:98) B, Lemma 4 implies that (ˆ η, ˆ δ (cid:48) ) (cid:48) = W − n, (cid:98) B Y n, (cid:98) B . Thus, given a particular set of binding moments (cid:98) B = B, we can write ˆ η asa linear function of Y n , ˆ η = γ (cid:48) n,B Y n = e (cid:48) W − n,B Y n,B , for e the ﬁrst standard basis vector.We next consider under what conditions there exists a solution with moments B binding. Lemma 5

For B ⊆ { , ..., k } such that W n,B is a square, full-rank matrix, there existsa solution with the moments B binding if and only if Y n − W n W − n,B Y n,B ≤ . (13)Thus we see that there exists a solution with the moments B binding if and only ifthe implied (ˆ η, ˆ δ (cid:48) ) (cid:48) make the constraints in (12) hold.Our conditional test will condition on the existence of a solution with the moments B binding and reject when ˆ η is large relative to the resulting conditional distributionunder the null. The set of values Y n such that (13) holds is a polytope (a convex setwith ﬂat sides, also known as a polyhedron– see Schrijver 1986 pages 87-88), and asnoted above we can write ˆ η as a linear function of Y n conditional on this event. Thus,we are interested in the distribution of a linear function of a normal vector conditionalon that vector falling in a polytope. Lee et al. (2016) consider problems of this form,23nd we can use their results to derive conditional critical values. We ﬁrst calculatethe range of possible values for ˆ η conditional on Y n falling in this polytope. We thendetermine the distribution of ˆ η over this range conditional on a suﬃcient statistic forthe part of µ n not corresponding to ˆ η .To this end we use the following result, which is an immediate consequence ofLemma 5.1 of Lee et al. (2016). Lemma 6

Let M B be the selection matrix which selects rows corresponding to B. Sup-pose that W n,B is a square, full-rank matrix, and let γ n,B be the vector with M B γ n,B = W (cid:48) − n,B e , and zeros elsewhere. Assume γ (cid:48) n,B Σ γ n,B > . Let Λ n,B = I − W n W − n,B M B , and deﬁne ∆ n,B = Σ γ n,B γ (cid:48) n,B Σ γ n,B , and S n,B = (cid:0) I − ∆ n,B γ (cid:48) n,B (cid:1) Y n . Further deﬁne V lo ( S n,B ) = max j : ( Λ n,B ∆ n,B ) j < − (Λ n,B S n,B ) j (Λ n,B ∆ n,B ) j (14) V up ( S n,B ) = min j : ( Λ n,B ∆ n,B ) j > − (Λ n,B S n,B ) j (Λ n,B ∆ n,B ) j (15) V ( S n,B ) = min j : ( Λ n,B ∆ n,B ) j =0 − (Λ n,B S n,B ) j . The set of values Y n such that there exists a solution with the moments B binding is (cid:8) Y n : Y n − W n W − n,B Y n,B ≤ (cid:9) = (cid:8) Y n : V lo ( S n,B ) ≤ γ (cid:48) n,B Y n ≤ V up ( S n,B ) , V ( S n,B ) ≥ (cid:9) . This result shows that there exists a solution with the moments B binding if andonly if γ (cid:48) n,B Y n lies between the data-dependent bounds V lo ( S n,B ) and V up ( S n,B ) and,in addition, V ( S n,B ) ≥ . When such a solution exists, however, our arguments aboveshow that ˆ η = γ (cid:48) n,B Y n . Thus, whenever there exists a solution with the moments B binding, ˆ η lies between V lo ( S n,B ) and V up ( S n,B ) by construction.Lemma 6 assumes that γ (cid:48) n,B Σ γ n,B > . This implies that ˆ η has a non-degeneratedistribution conditional on the set of binding moments. While not necessary for ourconditional testing approach, this simpliﬁes a number of statements in what follows,24o going forward we maintain a suﬃcient condition for γ (cid:48) n,B Σ γ n,B > . Assumption 1

For all γ with W (cid:48) n γ = e and γ ≥ , γ (cid:48) Σ γ > . One can show that γ n,B as deﬁned in Lemma 6 has W (cid:48) n γ n,B = e and γ n,B ≥ for anyset of binding moments B . A suﬃcient, but not necessary, condition for Assumption1 is that the variance matrix Σ is positive-deﬁnite.Lemma 6 clariﬁes what it means to condition on the existence of a solution with themoments B binding, and thus the inference problem we need to solve. We are inter-ested in the behavior of ˆ η = γ (cid:48) n,B Y n conditional on the set of binding moments, but asin the simpliﬁed example above this conditional distribution depends on the full meanvector µ n , rather than just on γ (cid:48) n,B µ n , due to the inﬂuence of the bounds V lo ( S n,B ) and V up ( S n,B ) . Moreover, this conditional distribution is not in general monotonic in µ n , making it diﬃcult to ﬁnd least favorable values. To eliminate dependence on µ n other than through γ (cid:48) n,B µ n , we thus follow Lee et al. (2016) and further condition on S n,B , which is the minimal suﬃcient statistic for the part of µ n other than γ (cid:48) n,B µ n . Note that γ (cid:48) n,B Y n and S n,B are jointly normal and uncorrelated by construction, andthus independent. Hence, ˆ η follows a truncated normal distribution conditional on S n,B and the set of binding moments. Lemma 7

If the solution to (12) is unique and nondegenerate with probability one,the conditional distribution of ˆ η given (cid:98) B = B and S n,B = s is truncated normal, ˆ η | (cid:110) (cid:98) B = B & S n,B = s (cid:111) ∼ ξ | ξ ∈ (cid:2) V lo ( s ) , V up ( s ) (cid:3) , for ξ ∼ N ( γ (cid:48) n,B µ n , γ (cid:48) n,B Σ γ n,B ) , provided we consider a value s such that V ( s ) ≥ . As in Section 5.1 above, this truncated distribution is increasing in the mean γ (cid:48) n,B µ n . Since γ n,B ≥ , γ (cid:48) n,B X n = 0 , and µ n − X n δ ≤ under the null, the largestvalue of γ (cid:48) n,B µ n possible under the null is zero. We deﬁne the conditional critical value c α,C ( γ, V lo , V up , Σ) to equal the − α quantile of the truncated normal distribution ξ | ξ ∈ (cid:2) V lo , V up (cid:3) If this condition fails, we can deﬁne our conditional test to reject whenever γ (cid:48) n,B Σ γ n,B = 0 and ˆ η > , but this results in tests with size bounded above by α , rather than exactly correct size. In particular, S n,B is minimal suﬃcient for (cid:0) I − ∆ n,B γ (cid:48) n,B (cid:1) µ n and µ n is a one-to-one transfor-mation of (cid:0) γ (cid:48) n,B µ n , (cid:0) I − ∆ n,B γ (cid:48) n,B (cid:1) µ n (cid:1) , since µ n = (cid:0) I − ∆ n,B γ (cid:48) n,B (cid:1) µ n + ∆ n,B γ (cid:48) n,B µ n . This follows from Lemma 10 and Proposition 5 in Appendix A, but can also be veriﬁed directlyusing the Kuhn-Tucker conditions for optimality of (ˆ η, ˆ δ ) . ξ ∼ N (0 , γ (cid:48) n,B Σ γ n,B ) . We can write this critical value as c α,C ( γ, V lo , V up , Σ) = (cid:112) γ (cid:48) Σ γ · Φ − (cid:0) (1 − α ) ζ up + αζ lo (cid:1) (16)for Φ − the inverse of the standard normal distribution function, and (cid:0) ζ lo , ζ up (cid:1) = (cid:16) Φ (cid:16) V lo / (cid:112) γ (cid:48) Σ γ (cid:17) , Φ (cid:16) V up / (cid:112) γ (cid:48) Σ γ (cid:17)(cid:17) . Thus, conditional critical values are easy to compute in practice.Assuming the solution to (12) is unique and nondegenerate with probability oneand Assumption 1 holds, the results above imply that the conditional test whichcompares ˆ η to the conditional critical value, φ C = 1 (cid:110) ˆ η > c α,C (cid:16) γ n, (cid:98) B , V lo ( S n, (cid:98) B ) , V up ( S n, (cid:98) B ) , Σ (cid:17)(cid:111) , (17)rejects with probability at most α conditional on (cid:98) B = B under the null, and thus hasunconditional size α as well. Proposition 2

If the solution to (12) is unique and non-degenerate with probabilityone and Assumption 1 holds, the conditional test φ C has size α both conditional on (cid:98) B, sup µ n ∈M E µ n (cid:104) φ C | (cid:98) B = B (cid:105) = E (cid:104) φ C | (cid:98) B = B (cid:105) = α for all B such that P r µ n (cid:110) (cid:98) B = B (cid:111) > , and unconditionally, sup µ n ∈M E µ n [ φ C ] = E [ φ C ] = α. In our discussion of conditional tests so far we have relied on the uniqueness andnon-degeneracy of the solution to ensure both that the set of binding moments (cid:98) B isuniquely deﬁned and that the matrix W n,B is invertible. While these assumptions allowus to obtain simple expressions for conditional tests, they are not essential. Even whenthe solution (ˆ η, ˆ δ ) is nonunique or degenerate, ˆ η is unique. Our conditioning approachfor the normal model remains valid in such cases, but we need to work with the duallinear program to (12). This dual conditioning approach is numerically equivalent to26hat described above when the primal solution is unique and non-degenerate. Sinceformally developing the dual approach requires additional notation and adds littleintuition relative to the results above, we defer this development to Appendix A. Therewe formally establish the numerical equivalence of the primal and dual approacheswhen the former is valid, as well as conditional and unconditional size control for ourconditional tests based on the dual in the normal model, even when the primal solutionmay be non-unique or degenerate. To prove asymptotic validity of the conditionalapproach with non-normal data, our results in Appendix D require that the primalsolution be nondegenerate with probability one asymptotically, though it may benon-unique. A suﬃcient condition for non-degeneracy is that Σ has full rank, so thiscondition can be made to hold mechanically by adding a small amount of full-ranknoise to Y n . It is often not obvious whether the solution to (12) will be unique and non-degenerate with probability one in a given setting. Fortunately, the results in Ap-pendix A suggest a simple way to proceed in practice, based on the fact that thewidely-used dual-simplex algorithm for solving the primal problem (12) automaticallygenerates a vertex ˆ γ of the dual solution set as well. Proposition 5 in Appendix Ashows that so long as ˆ γ has exactly p + 1 strictly positive entries, and the rows of W n corresponding to these positive entries have full rank, we can take (cid:98) B to collect thecorresponding indicies and apply the results developed above. If this condition fails,then we should use the more general expressions developed in Appendix A. We motivated our study of conditional tests by a desire to reduce sensitivity to slackmoments. To formally understand the behavior of conditional tests in cases wheresome of the moments are slack, we will consider a sequence of mean vectors µ n,m , indexed by m, such that a subset of the moments grow arbitrarily slack as m → ∞ while the remaining moments are unchanged. This yields the following result, whichgeneralizes the insensitivity to slack moments noted in Section 5.1 for the special casewithout nuisance parameters to our general setting. Proposition 3

Consider a sequence of mean vectors µ n,m where µ n,m,j ≡ µ n,j ∈ R for all m if j ∈ B , while µ n,m,j → −∞ as m → ∞ if j (cid:54)∈ B . Let us further suppose thatthere exists γ B ≥ with W (cid:48) n, B γ B = e . Under Assumption 1, for Y n,m ∼ N ( µ n,m , Σ) , C,m the conditional test based on ( Y n,m , W n , Σ) , and φ B C,m the conditional test basedon ( Y n,m, B , W n, B , Σ B ) , φ C,m → p φ B C,m as m → ∞ . The restriction on W n, B ensures that the feasible set in the dual problem based on ( Y n,m, B , W n, B , Σ B ) is non-empty, and thus that the solution in the primal problem isﬁnite (see Section 7.4 of Schrijver (1986)). When this condition fails, the optimalvalue ˆ η diverges to −∞ . Proposition 3 shows that the conditional tests we consider are robust to the pres-ence of slack moments in a very strong sense. In particular, when a subset of momentsbecome arbitrarily slack, the conditional test converges in probability to the test whichdrops these moments ex-ante. As noted above, even in settings without nuisance pa-rameters the only other test we are aware of with this property in the normal model isthat of Cox & Shi (2019), and their approach does not address settings with nuisanceparameters (other than through projection).

In Section 5.1 above, we noted that in the special case without nuisance parametersconditional tests can have poor power in settings where the lower bound used by theconditional test is large with high probability. The same issue arises more broadly,and as in the case without nuisance parameters we can obtain improved performanceby considering hybrid tests.For some κ ∈ (0 , α ) the hybrid test rejects whenever ˆ η exceeds the level κ least-favorable critical value c κ,LF ( X n , Σ) . When ˆ η is less than this conditional criticalvalue, the hybrid test compares ˆ η to a modiﬁcation of the conditional critical valuethat also conditions on ˆ η ≤ c κ,LF ( X n , Σ) . This reduces V up ( s ) to V up,H ( s ) = min {V up ( s ) , c κ,LF ( X n , Σ) } . The level α hybrid test rejects whenever ˆ η exceeds the level α − κ − κ conditional criticalvalue based on the modiﬁed truncation points, where we deﬁne this quantile to equal Indeed, the same conclusion holds if there exists a sequence δ m and a vector δ such that µ n,m,j − X n,j δ m = µ n,j − X n,j δ ∈ R for all m if j ∈ B , while µ n,m,j − X n,j δ m → −∞ as m → ∞ if j (cid:54)∈ B . Similar to Romano et al. (2014 a ), we consider κ = α/ in our simulations below. Either c α,LF P (Σ) or c α,LF ( X n , Σ) could be used here, the tradeoﬀ being that c α,LF ( X n , Σ) will provide asmaller critical value but will have a somewhat higher computational burden. ∞ if V lo exceeds V up,H , φ H = (cid:110) ˆ η > c α − κ − κ ,C (cid:16) ˆ γ, V lo ( S n, (cid:98) B ) , V up,H ( S n, (cid:98) B ) , Σ (cid:17)(cid:111) . Since c α − κ − κ ,C (cid:16) ˆ γ, V lo ( S n, (cid:98) B ) , V up,H ( S n, (cid:98) B ) , Σ (cid:17) ≤ V up,H ( S n, (cid:98) B ) ≤ c κ,LF ( X n , Σ) , this test always rejects when ˆ η > c κ,LF ( X n , Σ) , as claimed above. The hybrid testhas size equal to α − κ − κ conditional on ˆ η ≤ c κ,LF and the set of binding moments, andunconditional size equal to α. Proposition 4

If the solution to (12) is unique and non-degenerate with probabilityone, and Assumption 1 holds, the hybrid test φ H has size α − κ − κ conditional on ˆ η ≤ c κ,LF ( X n , Σ) and (cid:98) B = B , sup µ n ∈M E µ n (cid:104) φ H | ˆ η ≤ c κ,LF ( X n , Σ) , (cid:98) B = B (cid:105) = E (cid:104) φ H | ˆ η ≤ c κ,LF ( X n , Σ) , (cid:98) B = B (cid:105) = α − κ − κ , for all B such that P r µ n (cid:110) ˆ η ≤ c κ,LF ( X n , Σ) , (cid:98) B = B (cid:111) > , and has unconditional size α , sup µ n ∈M E µ n [ φ H ] = E [ φ H ] = α. Thus, we see that our hybrid approach yields a non-conservative level α test. Dueto the inclusion of the least favorable critical value c κ,LF ( X n , Σ) this test no longershares the strong insensitivity to slack moments established for the conditional testby Proposition 3. That said, as a set of moments becomes slack the power of thehybrid test is bounded below by the power of the size α − κ − κ conditional test that dropsthese moments ex-ante. Moreover, the Monte Carlo results in Section 7 show thatthe hybrid does noticeably better than both the conditional and least favorable testsin some cases with slack moments. Appendix A establishes size control for hybridtests based on the dual approach even when the solution to (12) is non-unique ordegenerate. 29 Implementation

This section provides guidance for researchers seeking to implement the methods de-scribed in this paper. As in our theoretical results above, we assume that the researcherhas a moment function g ( D i , β, δ ) = Y i ( β ) − X i ( β ) δ (18)for Y i ( β ) ∈ R k , δ ∈ R p , and X i ( β ) a k × p matrix. We assume that at the true parametervalues E P [ Y i ( β ) − X i ( β ) δ | Z i ] ≤ , where Z i is a vector of instruments and X i ( β ) isnon-random given Z i . We suppose the researcher wishes to compute conﬁdence setsfor β . This is often done by discretizing the parameter space for β as { β , . . . , β L } ,and then testing pointwise whether each β l in the grid is contained in B I ( P ) . Theconﬁdence set then collects the non-rejected points.Sections 6.1 to 6.4 provide guidance on how to test whether a single value of β isin the identiﬁed set, which can then be applied to all points in the grid. Sections 6.5and 6.6 discuss implementation in extensions of this basic setting, such as when theresearcher wishes to conduct inference on (functions of) linear parameters, or whenthere are non-linear nuisance parameters. Alternative Procedures

While the linear conditional structure assumed in thispaper is present in a variety of moment inequality settings, there are practically im-portant cases where our results do not apply but alternatives are available. First, onemay have unconditional moment inequalities that are nonetheless linear in the param-eters, in which case one can use the approaches of Cho & Russell (2019) or Gafarov(2019). Alternatively, in settings with unconditional moment inequalities that may ormay not be linear in the nuisance parameters δ , or where we may be interested in anonlinear function of the parameters, one can use the approaches of e.g. Bugni et al.(2017) and Kaido et al. (2019 a ). For more discussion of the comparison among theseoptions, see Kaido et al. (2019 a ) and Gafarov (2019). Other alternatives include theprocedures discussed by Romano & Shaikh (2008) and Chen et al. (2018).Asymptotic validity for the procedures discussed above (and for the present paper– see Appendix D) are established under the assumption that the number of momentsis ﬁxed as the sample size tends to inﬁnity. This assumption may yield unsatisfactoryperformance if the number of moments is large relative to the sample size. By contrast,the approach of Belloni et al. (2018) gives guarantees even in high-dimensional settings,30hile the approach of Flynn (2019) allows a continuum of moments. Finally, theresults of Chernozhukov et al. (2015) apply in conditional moment settings where themoments may be nonlinear in the nuisance parameters, and dimension of g ( D i , β, δ ) may be large. Σ All of the tests for whether β ∈ B I ( P ) described in this paper require an estimate ofthe average conditional variance Σ( β ) = E P [ V ar P ( Y i ( β ) | Z i )] . It is important to notethat Σ( β ) depends on the non-linear parameter β , and thus must be estimated at eachgrid point; for ease of exposition, however, we ﬁx β and drop the explicit dependenceof Σ , Y , and X on β for the remainder of the section.The average conditional variance Σ can be estimated using the matching procedureproposed by Abadie et al. (2014). To do this, deﬁne Σ Z = (cid:100) V ar ( Z i ) . For each i , ﬁndthe nearest neighbor using the Mahalanobis distance in Z i : (cid:96) Z ( i ) = argmin j ∈{ ,...,n } ,j (cid:54) = i ( Z i − Z j ) (cid:48) Σ − Z ( Z i − Z j ) . The estimate of Σ is then: (cid:98) Σ = 12 n n (cid:88) i =1 (cid:0) Y i − Y (cid:96) Z ( i ) (cid:1) (cid:0) Y i − Y (cid:96) Z ( i ) (cid:1) (cid:48) . Proposition 10 in Appendix D proves that, under additional assumptions, (cid:98) Σ consis-tently estimates Σ . We can test whether a particular value β is in the identiﬁed set using the LF or LFPtests by solving the linear program (10) and rejecting if and only if the optimal value ˆ η exceeds a critical value.To compute the least-favorable projection critical value via simulation, draw a The matching procedure described below assumes that (cid:100)

V ar ( Z i ) is invertible. In certain applica-tions, such as in our Monte Carlo, elements of Z i may be linearly dependent by construction, leading (cid:100) V ar ( Z i ) to be singular. In this case conditioning on a maximal linearly independent subset of Z i is equivalent to conditioning on the full vector, so one can drop dependent elements from Z i until (cid:100) V ar ( Z i ) is invertible. × S matrix Ξ of independent standard normals. Let Ξ max denote the S × vectorwhere the s th element is the maximum of the s th column of (cid:98) Σ / Ξ . Set c α,LF P ( (cid:98) Σ) tothe − α quantile of Ξ max .Similarly, to compute the least favorable critical value, again let Ξ be a k × S matrixof independent standard normal draws. Denote by ξ s the s th column of (cid:98) Σ / Ξ . Foreach s = 1 , . . . , S , calculate η s = min η,δ η subject to ( ξ s − X n,j δ ) / (cid:113)(cid:98) Σ jj ≤ η ∀ j. Set c α,LF ( (cid:98) Σ) to the − α quantile of { η , . . . , η S } . To implement the conditional test in practice, we recommend taking the followingsteps:1. Solve the primal LP (10) using the dual-simplex method, which generates as abyproduct multipliers ˆ γ corresponding to a vertex of the solution set in the dualproblem (see Appendix A).2. Check whether there are exactly p +1 positive multipliers in ˆ γ , and if so, whetherthe rows of the constraint matrix corresponding with the positive multipliers, W n,B , are full-rank.3. If the conditions checked in step 2 hold, compute V lo and V up using the analyticalformulas in (14) and (15), replacing Σ by (cid:98) Σ . Otherwise, V lo and V up must becalculated using the deﬁnition in (22) and (23) in Appendix A. This can be doneusing a bisection method, which we describe in Appendix H.4. Compute the − α quantile of the truncated standard normal distribution withtruncation points V lo / (cid:113) γ (cid:48) (cid:98) Σ γ and V up / (cid:113) γ (cid:48) (cid:98) Σ γ . Reject the null if and only if Note that Ξ need only be drawn once, and can be reused for many iterations of the LFP test,as well as for the LF test. Holding the simulation draws ﬁxed as we vary β is likely to produceconﬁdence sets with smoother boundaries and may ease the computational burden. In our implementation, we do this via simulation using the method of Botev (2017) to eﬃcientlysimulate truncated normal draws. The critical value can also be calculated by inverting a normalCDF, as in equation (16), but we found the former method less prone to numerical precision errors. η/ (cid:113) γ (cid:48) (cid:98) Σ γ exceeds this critical value. To implement the hybrid test, for κ ∈ (0 , α ) (we use κ = α/ in our simulations),1. Solve the primal LP (10) using the dual-simplex method, which generates as abyproduct multipliers ˆ γ corresponding to a vertex of the solution set to the dualproblem.2. Compare the resulting value ˆ η to c κ,LF ( X n , (cid:98) Σ( β )) , calculated as described inSection 6.2. If ˆ η exceeds this critical value, reject; otherwise continue the pro-cedure.3. Follow steps 2 and 3 from the conditional approach to compute V lo and V up .4. Compute the − α − κ − κ quantile of the truncated standard normal distributionwith lower truncation point V lo / (cid:113) γ (cid:48) (cid:98) Σ γ and upper truncation point V up,H / (cid:113) γ (cid:48) (cid:98) Σ γ = min (cid:16) V up , c κ,LF ( X n , (cid:98) Σ( β )) (cid:17) / (cid:113) γ (cid:48) (cid:98) Σ γ. Reject the null if and only ˆ η/ (cid:113) γ (cid:48) (cid:98) Σ γ exceeds this critical value. In some cases, we may have moments of the form g ( D i , β , β , δ ) = Y i ( β , β ) − X i ( β , β ) δ and be interested in conducting inference only on β . In this case, we can conductpointwise inference over a grid for β = ( β , β ) . We then reject for a particular valueof β if and only if for all values of β we reject the hypothesis that ( β , β ) is in theidentiﬁed set (that is, we apply the projection method to eliminate β , while applying To apply the asymptotic uniformity results developed in Appendix D, here and for the hybridtest below we should reject if and only if ˆ η/ (cid:113) γ (cid:48) (cid:98) Σ γ exceeds the maximum of this critical value and − C , for C a user-selected positive constant. δ ). Alternatively, one could use oneof the methods discussed above which can directly address nonlinear parameters. In certain applications, we may have linear moments of the form E P [ Y i − X i δ | Z i ] ≤ ,where Y i and X i do not explicitly depend on a non-linear parameter, and we may beinterested in conducting inference on a linear combination of the parameters, β = l (cid:48) δ (or l ( X n ) (cid:48) δ ) . For instance we might be interested in constructing conﬁdence intervalsfor the coeﬃcient on X j , in which case we would set l = e j , the vector with a 1 in the j th position and zeros elsewhere. If we did this once for every parameter we wouldobtain conﬁdence intervals for each of the individual coeﬃcients. Linear combinationsof δ may be of interest in other settings as well – e.g., in Wollman (2018) and ourMonte Carlo, the average cost of marketing a new product is a linear combination of δ . We ﬁrst note that we can recast this problem into the standard form (18) and thenuse any of the methods described above. To see this, let B be a full rank matrix with l in the ﬁrst row, so that Bδ = ( β, ˜ δ (cid:48) ) (cid:48) for some ˜ δ . If we let M − be the selectionmatrix that selects all but the ﬁrst column of a matrix we have Y − Xδ = Y − X ( B − B ) δ = (cid:0) Y − XB − e β (cid:1) − XB − M − ˜ δ ≡ ˜ Y ( β ) − ˜ X ˜ δ. Since

V ar P ( Y i − X i δ | Z i ) does not depend on δ , Σ need only be estimated onceand conﬁdence sets for l (cid:48) δ using the LF and LFP methods can be obtained from alinear program (there is no need for point-wise grid test inversion). For example tocompute the upper bound of the conﬁdence set for β = l (cid:48) δ one can solve max δ l (cid:48) δ subject to ( Y n,j − X n,j δ ) / (cid:113)(cid:98) Σ jj ≤ c α ∀ j, where c α ∈ { c α,LF , c α,LF P } . (19)So far we have discussed the case without non-linear nuisance parameters, but thisapproach extends to the case where we are interested in β = l (cid:48) δ and Y and X dependon the non-linear nuisance parameter β . In this case, one can recast the problem as M (cid:48)− = [0 , I k − ] where is the zero vector, and I k − is the k − dimensional identity matrix. m (( β , β ) , ˜ δ ) , and then followthe approach in Section 6.5 for non-linear nuisance parameters. Given our assump-tion that the conditional covariance matrix does not depend on the linear nuisanceparameters, computational shortcuts are still available and conﬁdence intervals canbe calculated by running a linear program analogous to (19) for each β and takingthe maximum of the resulting values as the ﬁnal upper bound. Our simulations are calibrated to Wollman (2018)’s study of the bailouts of GM andChryslers’ truck divisions. To estimate the eﬀect of the bailouts while allowing productrepositioning, Wollmann needs to know the ﬁxed cost of marketing a product. Heobtains bounds based on conditional moment inequalities.We adopt the notation of Example 3 above, so J f,i,t is the set of products marketedby ﬁrm f in market i in period t, and ∆ π ( J f,i,t , J (cid:48) f,i,t ) is the diﬀerence in expectedproﬁts from marketing J f,i,t rather then J (cid:48) f,i,t . J f,i,t \ j and J f,i,t ∪ j are the sets obtainedby deleting and adding product j from the set J f,i,t respectively. Following Wollman(2018), the ﬁxed cost to ﬁrm f of marketing product j at time t is β ( δ c,f + δ g g j ) if theproduct was marketed last year ( j ∈ J f,i,t − ), and δ c,f + δ g g j otherwise. Here δ c,f is aper-product cost which is constant across products but may diﬀer across ﬁrms, while g j is the gross weight rating of product j .If we begin with the case where ﬁxed costs are constant across ﬁrms ( δ c,f = δ c for all f ) and again let {·} denote the indicator function, we obtain four conditionalmoment inequalities by adding and subtracting one product at a time from the setmarketed. For instance, similar to the Example 3 above, if ﬁrm f markets product j at both t − and t , then for m ( θ ) j,f,i,t ≡ − [∆ π ( J f,i,t , J f,i,t \ j ) − ( δ c + δ g g j ) β ] × { j ∈ J f,i,t , j ∈ J f,i,t − , } , we must have E (cid:2) m l ( θ ) j,f,i,t | V f,i,t (cid:3) ≤ for all variables V f,i,t in the ﬁrm’s informationset when time- t production decisions were made, since otherwise the ﬁrm would havechosen not to market product j in period t. Analagously, considering products thatwere marketed at time t but not time t − yields moment function m ( θ ) j,f,i,t ≡ − [∆ π ( J f,i,t , J f,i,t \ j ) − δ c − δ g g j ] × { j ∈ J f,i,t , j / ∈ J f,i,t − } , t yields moment functions m ( θ ) j,f,i,t ≡ − [∆ π ( J f,i,t , J f,i,t ∪ j ) + ( δ c + δ g g j ) β ] × { j / ∈ J f,i,t , j ∈ J f,i,t − } ,m ( θ ) j,f,i,t ≡ − [∆ π ( J f,i,t , J f,i,t ∪ j ) + δ c + δ g g j ] × { j / ∈ J f,i,t , j / ∈ J f,i,t − } . If the observed data result from a Nash equilibrium then E (cid:2) m l ( θ ) j,f,i,t | V f,i,t (cid:3) ≤ for l ∈ { , , , } and all variables V f,i,t in the ﬁrm’s information set at the time of thedecision.We obtain two further conditional moment inequalities by considering heavier andlighter models than the ﬁrm actually marketed. To state them formally, deﬁne J − ( j, f, i, t ) ≡ { j (cid:48) | g j (cid:48) < g j , j (cid:48) / ∈ J f,i,t , j (cid:48) / ∈ J f,i,t − } ,J + ( j, f, i, t ) ≡ { j (cid:48) | g j (cid:48) > g, j (cid:48) / ∈ J f,i,t , j (cid:48) / ∈ J f,i,t − } . and let m j,f,i,t ( θ ) ≡ − (cid:32) (cid:80) j (cid:48) ∈ J − ( j,f,i,t ) [∆ π ( J f,i,t , ( J f,i,t \ j ) ∪ j (cid:48) ) − δ g ( g j − g j (cid:48) )] J − ( j, f, i, t ) (cid:33) × { j ∈ J f,i,t , j / ∈ J f,i,t − } ,m j,f,i,t ( θ ) ≡ − (cid:32) (cid:80) j (cid:48) ∈ J + ( j,f,i,t ) [∆ π ( J f,i,t , ( J f,i,t \ j ) ∪ j (cid:48) ) + δ g ( g j − g j (cid:48) )] J + ( j, f, i, t ) (cid:33) × { j ∈ J f,i,t , j / ∈ J f,i,t − } . We calibrate our simulation designs using estimates based on Wollmann’s data (fordetails see Appendix G). In each simulation draw we generate data from a cross-sectionof 500 independent markets. This is substantially larger than the 27 observationsused by Wollmann, but allows us to consider speciﬁcations with a widely varyingnumber of moments. As in Wollmann, f ∈ { , . . . , F } , and there are nine ﬁrms so F = 9 . To generate data we model the expected and observed proﬁts for ﬁrm f frommarketing product j in market i in period t , denoted by π ∗ j,f,i,t and π j,f,i,t respectively,as π ∗ j,f,i,t = η j,i,t + (cid:15) j,f,i,t , and π j,f,i,t = π ∗ j,f,i,t + ν j,i,t + ν j,f,i,t , where the ν terms are mean zero disturbances that arise from expectational andmeasurement error and the η and (cid:15) terms represent product-, market-, and ﬁrm- The data in Wollman (2018) are a time-series but his variance estimates assume no serial corre-lation, so we adopt a simulation design consistent with this. The moments used to estimate our model are averages (over markets i ) of J (cid:88) j (cid:16) m lj,f,i ( θ ) ⊗ ˜ Z j,f,i (cid:17) (cid:48) , (20)where we also average over all ﬁrms assumed to share the same ﬁxed cost δ f,c . Since weconsider a single cross-section of markets we suppress the time subscript. We presentresults both for the case where ˜ Z j,f,i includes only a constant and for the case wherethe last two moments are interacted with a constant but the ﬁrst four moments areinteracted with both a constant and the common proﬁt-shifters η , ˜ Z j,f,i = (1 , η + j,i , η − j,i ) , for q + = max { q, } and q − = − min { q, } . In the model with a single constant term, δ c,f = δ c for all f , this generates 6 and 14 moment inequalities. We also present resultswhen the nine ﬁrms are divided into three groups each with a separate constant term,and when each ﬁrm has a separate constant term. For each speciﬁcation we considerthe ﬁrst four moments separately for the ﬁrm(s) associated with distinct parameters δ c,f , but average the last two moments across all ﬁrms as they do not depend on theconstant terms. This generates 14 and 38 moments for the three group classiﬁcation,and 38 and 110 moments when each ﬁrm has a separate constant term. To estimatethe conditional variance Σ , in each speciﬁcation we deﬁne the value of the instrument Z i in market i as the Jacobian of (20) with respect to the linear parameters ( δ g , { δ c,f } ) .We consider inference on three parameters of interest: the cost of marketing thetruck of mean weight when it was marketed in the prior year; the incremental cost The terms η j,i,t and ν j,i,t reﬂect product/market/time “shocks” that are known and unknown tothe ﬁrms, respectively, when they make their decisions. Shocks of this sort are an important aspectof Wollmann’s setting. Note that Wollmann also estimates (point-identiﬁed) demand and variablecost parameters in a ﬁrst step, while for simplicity we treat the variable proﬁts π j,f,i,t as known tothe econometrican. When we assume δ c,f is common across ﬁrms this is δ c + δ g µ g , where µ g is the population averageweight of trucks. When we allow the estimated δ c parameters to vary across groups, we estimate l (cid:48) δ ,for l = ( G , . . . , G , µ g ) (cid:48) , where G denotes the number of groups and δ = ( δ c, , ..., δ c,G , δ g ) (cid:48) . Note thatsince the simulation DGP holds the true value of δ c constant across groups, the true value of theparameter is the same in all speciﬁcations.

37f changing the weight of a product, δ g ; and the non-linear parameter β , where − β represents the proportional cost savings from marketing a product that was previouslymarketed relative to a new product. For the ﬁrst two parameters, each of which can bewritten as a linear combination of the vector δ , we hold β ﬁxed at its true value to allowus to examine performance in the linear case discussed in Section 6.6. As discussedin Section 6.5, if we instead treated β as unknown we could form joint conﬁdence setsfor β along with the linear combination of interest, and could form conﬁdence sets forthe linear parameter alone by projection. For inference on β we treat the entire vector δ as a nuisance parameter. All results are based on 500 simulation runs.We begin our discussion of the results with Figure 1, which shows rejection prob-abilities for the cost of the mean-weight truck. The vertical dashed lines denote theconservative estimates for the bounds of the identiﬁed set, and the four curves repre-sent the probability that each of the four methods considered rejects a given null valueof the parameter of interest. There is a clear ranking of the power of the LFP, LF,and Hybrid procedures in Figure 1. In all speciﬁcations, the LF test has noticeablyhigher power than the LFP. The hybrid test has power comparable to or above the LFtest in all speciﬁcations, with substantial diﬀerences emerging in cases with a largernumber of moments and parameters. The performance of the conditional test is morenuanced. When the number of moments per parameter is small, the conditional testperforms very similarly to the hybrid, and is at least as good as the LF and LFP.When we increase the number of moments holding the number of parameters ﬁxed,the conditional again performs similarly to the hybrid for parameter values close tothe identiﬁed set bounds, but can have power substantially below any of the othermethods far away from the identiﬁed set (see for instance Panel (d) of Figure 1).The power declines for the conditional test reﬂect that the set of binding moments isnot well-separated in this example. In particular, one can show that in this simulation Note that all of our simulation results in this section hold the data generating process constantbut vary the parameter values considered. Hence, the curves plotted should be interpreted as rejectionprobabilities for tests of diﬀerent null hypotheses, or one minus the coverage probability for conﬁdencesets. We cannot solve for the true identiﬁed set analytically, so we approximate it by the set satisfyingthe sample (unconditional) moment inequalities based on a simulation run with ﬁve million obser-vations. To ensure that our estimate of the identiﬁed set is conservative, we follow Chernozhukovet al. (2007) and add a correction factor to the moments of log ( n ) / √ n ≈ . when n=5,000,000.Hence, our estimate of the identiﬁed is conservative in these simulations due to both (a) the Cher-nozhukov et al. (2007) correction factor and (b) the use of unconditional rather than conditionalmoment inequalities. As noted in Section 5, the conditional test may performpoorly in such settings, and this prediction is borne out in this application. Our hybridtest eliminates these problems, as intended.Figure 2 reports rejection probabilities for testing hypotheses on the nonlinear pa-rameter β. Unlike in our simulations for the linear parameters, when testing nonlinearparameters it is sometimes the case that no procedure has rejection probability goingto one over the grid we consider, though this phenomenon disappears in all but theconditional power curves when we interact the conditional moments with the proﬁtshifters ( η + j,i , η − j,i ) . Regardless, we see that the LF test has higher power than the LFP,and that the power of the hybrid test is higher still. The conditional test performsreasonably well in cases with a small number of moments and parameters (e.g. inPanel (a)) but it has power well below any of the other tests considered at manyparameter values in some cases with more moments and/or parameters.Rejection probabilities for testing hypotheses on δ g are similar to those for testingthe cost of the average weight truck, though with better performance for the condi-tional test, and so are reported in Appendix G to conserve space. One notable featureof these results is that the identiﬁed set for δ g does not change across speciﬁcations, sounlike for our analysis of the other parameters, the speciﬁcations with more than sixmoments are adding moments and nuisance parameters without changing the iden-tiﬁed set. The results in this case conﬁrm that the hybrid approach appears lesssensitive to the addition of parameters and slack moments than the LF or LFP.Table 1 reports the size (formally, the maximal null rejection probability over theestimated identiﬁed set) for all the tests considered. As expected all tests approxi-mately control size, with the maximal null rejection probabilities for nominal 5% testsbounded above by 8%, and this bound is reached only in cases with 110 moments. Our estimates for the identiﬁed set are conservative, so those rejection probabilitiesshould, if anything, overestimate the true maximal rejection probability. Less frequently, we have multiple exact solutions, in which case we apply the dual approach. We also ran simulations deﬁning the identiﬁed set without the conservative Chernozhukov et al.(2007) correction factor, and the only designs for which this resulted in a diﬀerence of maximalrejection probabilities of more than 0.01 were two of the runs with 110 moments, where the boundswith the correction implied probabilities of 0.07 and 0.08, compared to 0.02 and 0.01 without thecorrection. (a) 2 Parameters, 6 Moments (b) 2 Parameters, 14 Moments(c) 4 Parameters, 14 Moments (d) 4 Parameters, 38 Moments(e) 10 Parameters, 38 Moments (f) 10 Parameters, 110 Moments β (a) 3 Parameters, 6 Moments R e j e c t i on P r obab ili t y LFP LF Conditional Hybrid Identified Set Bound (b) 3 Parameters, 14 Moments R e j e c t i on P r obab ili t y LFP LF Conditional Hybrid Identified Set Bound (c) 5 Parameters, 14 Moments R e j e c t i on P r obab ili t y LFP LF Conditional Hybrid Identified Set Bound (d) 5 Parameters, 38 Moments R e j e c t i on P r obab ili t y LFP LF Conditional Hybrid Identified Set Bound (e) 11 Parameters, 38 Moments R e j e c t i on P r obab ili t y LFP LF Conditional Hybrid Identified Set Bound (f) 11 Parameters, 110 Moments R e j e c t i on P r obab ili t y LFP LF Conditional Hybrid Identified Set Bound (a) Parameter: Cost of Mean-Weight Truck

Median Excess Length

Parameters

Moments LFP LF Cond. Hybrid .

32 3 .

99 4 .

07 3 .

752 14 12 .

75 10 .

48 10 .

49 8 .

544 14 7 .

56 5 .

91 4 .

07 4 .

374 38 19 .

08 16 .

33 14 .

68 11 . .

70 10 .

20 4 .

71 4 . .

61 22 .

36 17 .

91 14 . Max SizeLFP LF Cond. Hybrid .

02 0 .

020 0 0 .

02 0 .

020 0 0 .

02 0 .

020 0 0 .

03 0 .

030 0 0 .

02 0 .

010 0 0 .

08 0 . (b) Parameter: δ g Median Excess Length

Parameters

Moments LFP LF Cond. Hybrid .

02 4 .

28 4 .

18 3 .

932 14 6 .

91 5 .

40 4 .

43 4 .

184 14 7 5 .

19 4 .

43 4 .

184 38 7 .

97 6 .

68 4 .

43 4 . .

10 6 .

58 4 .

43 4 . .

11 7 .

69 5 .

18 5 . Max SizeLFP LF Cond. Hybrid .

01 0 .

04 0 .

05 0 .

060 0 .

02 0 .

05 0 .

050 0 .

03 0 .

05 0 .

060 0 0 .

05 0 .

050 0 .

01 0 .

05 0 .

050 0 0 .

03 0 . (c) Parameter: β Median Excess Length

Parameters

Moments LFP LF Cond. Hybrid3 6 118.69 61.87 41.67 36.623 14 0.76 0.56 0.45 0.355 14 10.25 7.78 6.01 5.35 38 0.86 0.66 0.96 0.4511 38 1.41 1.01 1.01 0.8111 110 0.86 0.66 2.57 0.56 Max SizeLFP LF Cond. Hybrid .

01 0 .

010 0 0 .

01 0 .

010 0 0 .

03 0 .

030 0 0 .

01 0 .

010 0 0 .

05 0 . We stress, however, that at least forthe simulation designs we consider, all four procedures remain highly tractable, andruntimes could be improved using parallelization. If computation times are an issue for the hybrid, the LF ﬁrst stage can be replaced with a LFPﬁrst stage, yielding a faster but somewhat less powerful test. (a) Parameter: Cost of Mean-Weight Truck

Runtime - Minutes

Parameters

Moments LFP LF Conditional Hybrid .

11 0 .

18 0 .

44 0 .

412 14 0 .

05 0 .

13 5 .

44 1 .

674 14 0 .

08 0 .

15 6 .

56 2 .

124 38 0 .

07 0 .

15 5 .

06 1 .

310 38 0 .

13 0 .

22 4 .

77 0 . .

13 0 .

21 4 .

47 0 . (b) Parameter: δ g Runtime - Minutes

Parameters

Moments LFP LF Conditional Hybrid .

08 0 .

16 6 .

69 2 .

542 14 0 .

05 0 .

13 5 . .

144 14 0 .

08 0 .

15 6 .

01 1 .

964 38 0 .

07 0 .

15 4 .

78 1 . .

13 0 .

22 3 .

77 1 . .

13 0 .

21 3 .

87 1 . (c) Parameter: β Runtime - Minutes

Parameters

Moments LFP LF Conditional Hybrid .

24 13 .

68 5 .

27 13 .

693 14 5 .

65 14 .

02 5 .

68 14 .

025 14 6 . .

36 6 .

93 15 .

375 38 8 .

16 17 .

12 8 .

17 17 . .

94 20 .

95 11 .

95 20 . .

78 32 .

95 16 .

74 32 . This table shows runtimes to calculate conﬁdence sets based on one simulated dataset for eachspeciﬁcation, without parallelization, on a 2014 Macbook Pro with a 2.6 GHz Intel i5 Processor and16GB of RAM. For the linear parameters (Panels a and b), the conﬁdence sets for the LF and LFPare computed using linear programming, as described in Section 6.6, and we use a grid of 1,001parameter values for the hybrid and conditional approaches. For the non-linear parameter β , allfour procedures use a grid of length 100. See Appendix G for additional details on the simulationspeciﬁcation. Conclusion

This paper considers the problem of inference based on linear conditional moment in-equalities, which arise in a wide variety of economic applications. Using linear condi-tional structure, we develop inference procedures which remain both computationallytractable and powerful in the presence of nuisance parameters, including conditionaland hybrid procedures which are insensitive to the presence of slack moments. Weﬁnd good performance for our least favorable, conditional, and hybrid procedures un-der a variety of simulation designs based on Wollman (2018), with especially goodperformance for the hybrid.

References

Abadie, A., Imbens, G. W. & Zheng, F. (2014), ‘Inference for misspeciﬁed models withﬁxed regressors’,

Journal of the American Statistical Association (508), 1601–1614.Andrews, D. W. & Barwick, P. J. (2012), ‘Inference for parameters deﬁned by mo-ment inequalities: A recommended moment selection procedure’,

Econometrica (6), 2805–2826.Andrews, D. W. & Guggenberger, P. (2009), ‘Hybrid and size-corrected subsamplingmethods’, Econometrica (3), 721–762.Andrews, D. W. & Shi, X. (2013), ‘Inference based on conditional moment inequali-ties’, Econometrica (2), 609–666.Andrews, D. W. & Soares, G. (2010), ‘Inference for parameters deﬁned by momentinequalities using generalized moment selection’, Econometrica (1), 119–159.Andrews, I., Kitagawa, T. & McCloskey, A. (2018), Inference on winners. WorkingPaper.Armstrong, T. B. (2014 a ), A note on minimax testing and conﬁdence intervals inmoment inequality models. Working Paper.Armstrong, T. B. (2014 b ), ‘Weighted ks statistics for inference on conditional momentinequalities’, Journal of Econometrics (2), 92–116.45arseghyan, L., Coughlin, M., Molinari, F. & Teitelbaum, J. C. (2019), Heterogenouschoice sets and preferences. Working Paper.Belloni, A., Bugni, F. & Chernozhukov, V. (2018), Subvector inference in PI modelswith many moment inequalities. Working Paper.Beresteanu, A. & Molinari, F. (2008), ‘Asymptotic properties for a class of partiallyidentiﬁed models’,

Econometrica (4), 763–814.Blundell, R., Gosling, A., Ichimura, H. & Meghir, C. (2007), ‘Changes in the distri-bution of male and female wages accounting for employment composition’, Econo-metrica , 323–363.Bontemps, C., Magnac, T. & Maurin, E. (2012), ‘Set identiﬁed linear models’, Econo-metrica (3), 1129–1155.Botev, Z. I. (2017), ‘The normal law under linear restrictions: simulation and es-timation via minimax tilting’, Journal of the Royal Statistical Society: Series B(Statistical Methodology) (1), 125–148.Bugni, F., Canay, I. & Shi, X. (2017), ‘Inference for subvectors and other functionsof partially identiﬁed parameters in moment inequality models’, Quantitative Eco-nomics (1), 1–38.Canay, I. & Shaikh, A. (2017), Practical and theoretical advances in inference forpartially identiﬁed models, in B. Honore, A. Pakes, M. Piazessi, & L. Samuelson,eds, ‘Advances in Economics and Econometrics’, Cambridge University Press.Chen, X., Christensen, T. & Tamer, E. (2018), ‘Monte carlo conﬁdence sets for iden-tiﬁed sets’,

Econometrica (6), 1965–2018.Chernozhukov, V., Hong, H. & Tamer, E. (2007), ‘Estimation and conﬁdence regionsfor parameter sets in econometric models’, Econometrica (5), 1243–1284.Chernozhukov, V., Newey, W. & Santos, A. (2015), Constrained conditional momentrestriction models. Working Paper.Chetty, R. (2012), ‘Bounds on elasticities with optimization frictions: A synthesis ofmicro and macro evidence on labor supply’, Econometrica (3), 969–1018.46hetverikov, D. (2018), ‘Adaptive test of conditional moment inequalities’, Economet-ric Theory (1), 186–227.Cho, J. & Russell, T. M. (2019), Simple inference on functionals of set-identiﬁedparameters deﬁned by linear moments. Working Paper.Ciliberto, F. & Tamer, E. (2009), ‘Market structure and multiple equilibria in airlinemarkets’, Econometrica (6), 1791–1828.Cox, G. & Shi, X. (2019), A simple uniformly valid test for inequalities. WorkingPaper.Dickstein, M. & Morales, E. (2018), ‘What do exporters know?’, Quarterly Journal ofEconomics (4), 1753–1801.Eizenberg, A. (2014), ‘Upstream innovation and product variety in the U.S. home pcmarket’,

Review of Economic Studies (3), 1003–1045.Flynn, Z. (2019), Inference based on continuous linear inequalities via semi-inﬁniteprogramming. Working Paper.Gafarov, B. (2019), Inference in high-dimensional set-identiﬁed aﬃne models. WorkingPaper.Haile, P. A. & Tamer, E. (2003), ‘Inference with an incomplete model of englishauctions’, Journal of Political Economy (1), 1–51.Ho, K. & Pakes, A. (2014), ‘Hospital choices, hospital prices and ﬁnancial incentivesto physicians’,

American Economic Review (12), 3841–84.Ho, K. & Rosen, A. (2017), Partial identiﬁcation in applied research, in B. Honore,A. Pakes, M. Piazessi, & L. Samuelson, eds, ‘Advances in Economics and Econo-metrics’, Cambridge University Press.Holmes, T. (2011), ‘The diﬀusion of Wal-Mart and economies of density’,

Economet-rica (1), 253–302.Honore, B. & Lleras-Muney, A. (2006), ‘Bounds in competing risks models and thewar on cancer’, Econometrica (6), 1675–1698.47sieh, Y.-W., Shi, X. & Shum, M. (2017), Inference on estimators deﬁned by mathe-matical programming. Working Paper.Kaido, H., Molinari, F. & Stoye, J. (2019 a ), ‘Conﬁdence intervals for projections ofpartially identiﬁed parameters’, Econometrica (4), 1397–1432.Kaido, H., Molinari, F. & Stoye, J. (2019 b ), Online appendix to "conﬁdence intervalsfor projections of partially identiﬁed parameters ". Supplementary Material.Katz, M. (2007), Supermarkets and zoning laws. Ph.D. dissertation, Harvard Univer-sity.Khan, S., Ponomareva, M. & Tamer, E. (2019), Identiﬁcation of dynamic panel binaryresponse models. Working Paper.Kline, P. & Tartari, M. (2016), ‘Bounding the labor supply response to a randomizedwelfare experiment: A revealed preference approach’, American Economic Review (4), 972–1014.Kreider, B., Pepper, J., Gundersen, C. & Jolliﬀe, D. (2012), ‘Identifying the eﬀects ofsnap (food stamps) on child health outcomes when participation is endogenous andmisreported’,

Journal of the American Statistical Association , 958–975.Lee, J. D., Sun, D. L., Sun, Y. & Taylor, J. E. (2016), ‘Exact post-selection inference,with application to the lasso’,

Annals of Statistics (3), 907–927.Manski, C. F. & Tamer, E. (2002), ‘Inference on regressions with interval data on aregressor or outcome’, Econometrica (2), 519–546.Mogstad, M., Santos, A. & Torgovitsky, A. (2018), ‘Using instrumental variables forinference about policy relevant treatment parameters’, Econometrica (5), 1589–1619.Molinari, F. (2019), Econometrics with partial identiﬁcation. Working Paper.Morales, E., Sheu, G. & Zahler, A. (2019), ‘Extended gravity’, Review of EconomicStudies

Forthcoming .Pakes, A. (2010), ‘Alternative models for moment inequalities’,

Econometrica (6), 1783–1822. 48akes, A., Porter, J. & Ishii, K. H. J. (2015), ‘Moment inequalities and their applica-tion’, Econometrica (1), 315–334.Ponomareva, M. & Tamer, E. (2011), ‘Misspeciﬁcation in moment inequality models:Back to moment equalities?’, Econometrics Journal (2), 186–203.Romano, J. P. & Shaikh, A. (2008), ‘Inference for identiﬁable parameters in par-tially identiﬁed econometric models’, Journal of Statistical Planning and Inference (9), 2786–2807.Romano, J. P., Shaikh, A. & Wolf, M. (2014 a ), ‘A practical two-step method fortesting moment inequalities’, Econometrica (5), 1979–2002.Romano, J. P., Shaikh, A. & Wolf, M. (2014 b ), ‘Supplement to ‘a practical two-stepmethod for testing moment inequalities”, Econometrica .Rosen, A. (2008), ‘Conﬁdence sets for partially identiﬁed parameters that satisfy aﬁnite number of moment inequalities’, Journal of Econometrics (1), 107–117.Schrijver, A. (1986),

Theory of Linear and Integer Programming , Wiley-Interscience.Tebaldi, P., Torgovitsky, A. & Yang, H. (2019), Nonparametric estimates of demandin the california health insurance exchange. Working Paper.Torgovitsky, A. (2019), ‘Partial identiﬁcation by extending subdistributions’,

Quanti-tative Economics (1), 105–144.Wollman, T. (2018), ‘Trucks without bailouts: Equilibrium product characteristics forcommercial vehicles’, American Economic Review (6), 1364–1406.49 upplement to the paper

Inference for Linear Conditional

Moment Inequalities

Isaiah Andrews Jonathan Roth Ariel Pakes

September 24, 2019This supplement contains proofs and additional results for the paper “Inferencefor Linear Conditional Moment Inequalities.” Section A discusses results for an al-ternative formulation of the conditional approach based on the dual linear program,which allows the possibility of non-unique or degenerate solutions. Section B devel-ops some additional results for the dual problem used in Section A. All proofs for theﬁnite-sample normal model are collected in Section C. Section D states our asymptoticresults, while proofs for these results are given in Section E. Section F provides sim-ulation results for our tests in a simple example without nuisance parameters, whileSection G provides additional details and results for the simulation designs discussedin Section 7 of the main text. Finally, Section H discusses a bisection algorithm forcomputing bounds used in the dual conditioning approach.

A Conditional Inference Based on the Dual

This section describes a conditioning approach based on a dual linear program whichcan be applied even in settings where the linear program (12) has a non-unique ordegenerate solution, but which is equivalent to the primal conditioning approach de-scribed in the main text when the solution to (12) is unique and non-degenerate. Toformally describe the dual approach, we ﬁrst deﬁne the dual linear program.

Lemma 8

When ˆ η as deﬁned in Lemma 3 is ﬁnite, it is equal to max γ γ (cid:48) Y n subject to γ ≥ , W (cid:48) n γ = e . (21)50 or W n the matrix with row j equal to W n,j = (cid:16) (cid:112) Σ jj X n,j (cid:17) and e = (1 , , ..., (cid:48) the ﬁrst standard basis vector. The set of solutions to the dual linear program is (cid:98)

Γ = { γ : ˆ η = γ (cid:48) Y n , γ ≥ , W (cid:48) n γ = e } . This set is deﬁned by a collection of linear equalities and inequalities and so is apolytope. Our dual approach conditions on the set of vertices (cid:98) V of (cid:98) Γ . Results in thenext section show that this set of solution vertices has ﬁnite support, and that anypair of possible vertices γ , γ arise together with probability either zero or one P r µ n (cid:110) { γ , γ } ⊆ (cid:98) V (cid:111) ∈ { , } . Thus, conditioning on a given value for the set of vertices, (cid:98) V = V, is equivalent toconditioning on γ ∈ (cid:98) V for any γ ∈ V, up to sets of measure zero. We thus considerinference conditional on γ ∈ (cid:98) V . We further discuss the set of vertices (cid:98) V and itsproperties in the next section.As before, the distribution of ˆ η conditional on γ ∈ (cid:98) V will in general depend on thefull vector µ n , rather than just on γ (cid:48) Y n . To eliminate dependence on µ n other thanthrough γ (cid:48) Y n we again condition on a suﬃcient statistic for the rest of the vector µ n ,S n,γ = (cid:16) I − Σ γγ (cid:48) γ (cid:48) Σ γ (cid:17) Y n , which coincides with S n,B deﬁned in the main text for γ = γ n,B . We obtain the following conditional distribution for ˆ η : Lemma 9

The conditional distribution of ˆ η given γ ∈ (cid:98) V and S n,γ = s is truncatednormal, ˆ η | (cid:110) S n,γ = s & γ ∈ (cid:98) V (cid:111) ∼ ξ | ξ ∈ (cid:2) V lo ( s ) , V up ( s ) (cid:3) for ξ ∼ N ( γ (cid:48) µ n , γ (cid:48) Σ γ ) , V lo ( s ) = min (cid:40) c : c = max ˜ γ ˜ γ (cid:48) (cid:16) s + Σ γγ (cid:48) Σ γ c (cid:17) subject to ˜ γ ≥ , W (cid:48) n ˜ γ = e (cid:41) (22) and V up ( s ) = max (cid:40) c : c = max ˜ γ ˜ γ (cid:48) (cid:16) s + Σ γγ (cid:48) Σ γ c (cid:17) subject to ˜ γ ≥ , W (cid:48) n ˜ γ = e (cid:41) , (23) provided s is such that the set on the right hand side of (22) is nonempty. γ to the dual satisﬁes additionalconditions, the truncation points V lo and V up in Lemma 9 are the same as thoseobtained in the primal problem. Proposition 5

Suppose there exists γ ∈ (cid:98) V with exactly p + 1 strictly positive entries.Let B denote the set of rows for these entries, and suppose that B corresponds tolinearly independent rows of W n . Then there exists a solution to the primal problem(12) with the moments B binding, γ = γ n,B as deﬁned in Lemma 6, and the deﬁnitionof V lo and V up in equations (14) and (15) coincides with that in equations (22) and(23). The conditions on γ in this proposition are implied by existence of a unique, non-degenerate solution to the primal problem. Lemma 10

If there is a unique, non-degenerate solution (ˆ η, ˆ δ (cid:48) ) (cid:48) to the primal problem(12), any solution ˆ γ ∈ (cid:98) Γ to the dual problem satisﬁes the conditions of Proposition 5. This result suggests a straightforward way to proceed in practice. The widely-useddual-simplex algorithm for solving the primal problem (12) automatically generatesa vertex ˆ γ ∈ (cid:98) V of the dual solution set as well. To determine how to calculate thetruncation points V lo and V up , we can thus simply check whether the conditions ofProposition 5 hold at this solution. If they do we can calculate V lo and V up using theclosed-form expressions given in Lemma 6, while otherwise we can use (22) and (23). Going forward we consider the conditional test φ C = 1 (cid:8) ˆ η > c α,C (cid:0) ˆ γ, V lo ( S n, ˆ γ ) , V up ( S n, ˆ γ ) , Σ (cid:1)(cid:9) . If the solution to (12) is unique and non-degenerate this test coincides with (17). In particular, one can show that the set on the right hand side of (22) is convex, so we can quicklyﬁnd lower and upper bounds using e.g. the bisection method (see Section H). See Section 6 in themain text for further discussion of implementation. onditional and Unconditional Size Control Now that we have formulated theconditional test in the general case, we can establish conditional and unconditionalsize control.

Proposition 6

Under Assumption 1, the conditional test φ C has size α both condi-tional on γ ∈ (cid:98) V , sup µ n ∈M E µ n (cid:104) φ C | γ ∈ (cid:98) V (cid:105) = E (cid:104) φ C | γ ∈ (cid:98) V (cid:105) = α for all γ such that P r µ n (cid:110) γ ∈ (cid:98) V (cid:111) > , and unconditionally, sup µ n ∈M E µ n [ φ C ] = E [ φ C ] = α. Size Control for Hybrid Tests

We can likewise show that the hybrid test basedon the dual formulation controls size. As before, hybrid tests reject when ˆ η >c κ,LF ( X n , Σ) , and otherwise modify the upper bound to V up,H ( s ) = min {V up ( s ) , c κ,LF ( X n , Σ) } , yielding the test φ H = (cid:110) ˆ η > c α − κ − κ ,C (cid:0) ˆ γ, V lo ( S n, ˆ γ ) , V up,H ( S n, ˆ γ ) , Σ (cid:1)(cid:111) . Proposition 7

Under Assumption 1, the hybrid test φ H has size α − κ − κ conditional on ˆ η ≤ c κ,LF ( X n , Σ) and γ ∈ (cid:98) V for all γ such that P r µ n (cid:110) ˆ η ≤ c κ,LF ( X n , Σ) , γ ∈ (cid:98) V (cid:111) > , sup µ n ∈M E µ n (cid:104) φ H | ˆ η ≤ c κ,LF ( X n , Σ) , γ ∈ (cid:98) V (cid:105) = E (cid:104) φ H | ˆ η ≤ c κ,LF ( X n , Σ) , γ ∈ (cid:98) V (cid:105) = α − κ − κ , and has unconditional size α , sup µ n ∈M E µ n [ φ H ] = E [ φ H ] = α. Properties of the Dual Solution Verticies (cid:98) V In this section we further discuss the set of solution vertices (cid:98) V used in the dualconditioning approach. As noted above, the set of solutions γ to the dual problem isthe polytope (cid:98) Γ = { γ : ˆ η = γ (cid:48) Y n , γ ≥ , W (cid:48) n γ = e } . Letting (cid:98) V = (cid:98) V ( Y n , W n ) denote the set of vertices of (cid:98) Γ , and (cid:98) C = C ( Y n , W n ) = { γ : γ (cid:48) Y n = 0 , γ ≥ , W (cid:48) n γ = 0 } , the characteristic cone of (cid:98) Γ , we can write (cid:98) Γ = CH ( (cid:98) V ) + (cid:98) C for CH ( A ) the convex hullof a set A , where we use B + D to denote the Minkowski sum of sets B and D (see e.g.Chapter 8.2 of Schrijver (1986)). Let us further deﬁne the set of values γ satisfyingthe constraints in (21) (often called the feasible set) as F = { γ : γ ≥ , W (cid:48) n γ = e } . The set F is again a polytope. Let V F denote the vertices of F , often called the basicfeasible solutions to the linear program (21). Any vertex of (cid:98) Γ must also be a vertex of F (see e.g. Chapter 8.3 of Schrijver (1986)), so (cid:98) V ⊆ V F . We can view (cid:98) V as a randomvariable with support contained in the (ﬁnite) power set of V F . Lemma 10 and Proposition 5 above show that when the primal problem has aunique and non-degenerate solution, conditioning on the set of vertices (cid:98) V is equivalentto conditioning on the set of binding moments in the primal problem. In more generalcases, however, conditioning on (cid:98) V rather than the set of binding moments resolves anumber of diﬃculties. Speciﬁcally, when there are multiple solutions to the primalproblem, approaches that condition on the set of binding moments face the questionof which set(s) of binding moments to use. By contrast, our results show that thepresence of multiple solutions to the dual raises no diﬃculties when we condition on (cid:98) V . As another alternative, rather than conditioning on (cid:98) V , one might instead conditionon the full solution set (cid:98) Γ or, equivalently, on (cid:98) C in addition to (cid:98) V .

Such conditioning isunnecessary to obtain tractable tests, however, and would further reduce the variationin the data usable for inference. We thus do not pursue this possibility.The problem of conditioning on (cid:98) V is greatly simpliﬁed by the fact that the support54f (cid:98) V is ﬁnite and disjoint. Lemma 11

There is a ﬁnite collection of sets V = { V , V , ..., V m } , with V j ⊆ V F forall j, such that P r µ n (cid:110) (cid:98) V ∈ V (cid:111) = 1 , P r µ n (cid:110) (cid:98) V = V j (cid:111) > for all j , and V j ∩ V k = ∅ forall j (cid:54) = k. This result simpliﬁes the problem of conditioning on (cid:98) V , since for any γ ∈ V j ∈ V the event γ ∈ (cid:98) V is equivalent to the event (cid:98) V = V j . Thus, in order for us to constructconditional tests it will be enough for us to ﬁnd a single vertex ˆ γ of (cid:98) V , rather than fullycharacterizing (cid:98) V . The widely used dual-simplex method for solving linear programsﬁnds such a vertex. C Proofs for Finite-Sample Normal Model

Proof of Lemma 1

Follows immediately from the Lindeberg-Feller central limittheorem (see e.g. Proposition 2.27 in Van der Vaart (2000))). (cid:3)

Proof of Lemma 2

Immediate from the central limit theorem for iid data (see e.g.Proposition 2.17 in Van der Vaart (2000)). (cid:3)

Proof of Lemma 3

By the deﬁnition of the maximum, S ( Y n − X n δ, Σ) is equal tothe smallest value η satisfying ( Y n,j − X n,j δ ) / (cid:112) Σ jj ≤ η ∀ j. The result of the lemma follows immediately. (cid:3)

Proof of Proposition 1

To prove this result, we note ﬁrst that min δ S ( Y n − X n δ, Σ) is invariant to shifts of Y n by X n ˜ δ , in the sense that min δ S ( Y n − X n δ, Σ) = min δ S (cid:16) Y n + X n ˜ δ − X n δ, Σ (cid:17) for all ˜ δ. From this, we see immediately that c α ( µ n , X n , Σ) is also invariant, in the sense that c α ( µ n , X n , Σ) = c α (cid:16) µ n + X n ˜ δ, X n , Σ (cid:17) for all ˜ δ. (24)55ext, we note that min δ S ( Y n − X n δ, Σ) is elementwise nondecreasing in Y n , and thusthat c α ( µ n , X n , Σ) is elementwise nondecreasing in µ n .To complete the proof, we ﬁrst argue that { c α ( µ n , X n , Σ) : µ n ∈ M } = { c α ( µ n , X n , Σ) : µ n ≤ } , (25)so the set of of critical values for µ n consistent with the null is equal to the set of criticalvalues consistent with µ n ≤ . To see that this is the case, consider any µ n ∈ M ,and note that by the deﬁnition of M there exists δ ( µ n ) such that µ n − X n δ ( µ n ) ≤ . By (24) above, however, this means that c α ( µ n , X n , Σ) = c α ( µ n − X n δ ( µ n ) , X n , Σ) . Since µ n − X n δ ( µ n ) ≤ , and we can repeat this argument for all µ n ∈ M , we seethat { c α ( µ n , X n , Σ) : µ n ∈ M } ⊆ { c α ( µ n , X n , Σ) : µ n ≤ } . On the other hand, { µ n ≤ } ⊆ M , so (25) follows immediately. Finally, note thatsince we showed above that c α ( µ n , X n , Σ) is elementwise nondecreasing in µ n , sup µ n ≤ c α ( µ n , X n , Σ) = c α (0 , X n , Σ) which completes the proof. (cid:3) Proof of Lemma 4

This result follows from Lemma 10 below. In particular, notethat by Lemma 10 any solution γ to the dual linear program has exactly p + 1 nonzeroelements. By complementary slackness the corresponding constraints in the primalproblem (12) must bind, and Lemma 10 implies that the corresponding rows of W n have full rank. Further, no additional constraints can bind since this would implydegeneracy of the solution. (cid:3) Proof of Lemma 5

To prove this result, note that since (12) is a linear program,the Kuhn-Tucker conditions are necessary and suﬃcient for a solution. By argumentsin the text, if there exists a solution with the moments B binding, then we can writethe optimal values as (ˆ η, ˆ δ (cid:48) ) (cid:48) = W − n,B Y n,B , which sets Y n,B − W n,B (ˆ η, ˆ δ (cid:48) ) (cid:48) = 0 . If theremaining inequalities fail to hold when evaluated at (ˆ η, ˆ δ (cid:48) ) (cid:48) then (ˆ η, ˆ δ (cid:48) ) (cid:48) is infeasible56nd so not a solution. If, on the other hand, the remaining inequalities hold whenevaluated at (ˆ η, ˆ δ (cid:48) ) (cid:48) , then if we take the corresponding Kuhn-Tucker multipliers to bezero while setting the multipliers on the binding moments equal to M B γ n,B = W − (cid:48) n,B e ,one can verify that the Kuhn-Tucker conditions hold. (cid:3) Proof of Lemma 6

Follows immediately from Lemma 5 together with Lemma 5.1of Lee et al. (2016). (cid:3)

Proof of Lemma 7

Follows immediately from Lemma 9 together with Lemma 10and Proposition 5. (cid:3)

Proof of Proposition 2

Follows immediately from Lemma 10, together with Propo-sitions 5 and 6. (cid:3)

Proof of Proposition 3

We prove this result for the dual conditioning approachintroduced in Section A. That these results also hold in the primal conditioning ap-proach discussed in Section 5.2 when the solution to the linear program (12) is uniqueand non-degenerate is immediate from Lemma 10 and Proposition 5. Our assumptionsimply that the set of feasible vertices V F in the dual problem based on ( Y n,m , W n , Σ) is non-empty, and that the set of optimal vertices (cid:98) V is likewise non-empty. Since theprimal is feasible by construction, we further know that the dual is bounded. We beginby showing that (cid:98) V converges to the set (cid:98) V B of solution vertices in the dual problembased on ( Y n,m, B , W n, B , Σ B ) . In particular, let V B F, B = (cid:8) γ ∈ R k : γ B ∈ V F, B , γ j = 0 ∀ j (cid:54)∈ B (cid:9) = { γ ∈ V F : γ j = 0 ∀ j (cid:54)∈ B} denote the set of vertices in V F corresponding to vertices V F, B of the feasible region inthe problem restricted to the moments B , and (cid:98) V BB ⊆ V B F, B the analog for (cid:98) V B , (cid:98) V BB = (cid:110) γ ∈ R k : γ B ∈ (cid:98) V B , γ j = 0 ∀ j (cid:54)∈ B (cid:111) . We will show that

P r µ n,m (cid:110) (cid:98) V = (cid:98) V BB (cid:111) → .To establish this result, recall that the dual problem (restricted to γ ∈ V F ) is max γ ∈ V F γ (cid:48) Y n,m . γ ∈ V F with γ j (cid:54) = 0 for some j (cid:54)∈ B , γ (cid:48) Y n,m → p −∞ as m → ∞ . Ourassumption that there exists γ B ≥ with W (cid:48) n, B γ B = e implies that there exists atleast one ˜ γ ∈ V F such that ˜ γ j = 0 for all j (cid:54)∈ B . Thus, for any γ ∈ V F with γ j (cid:54) = 0 forsome j (cid:54)∈ B , since ˜ γ (cid:48) Y n,m = O p (1) as m → ∞ ,P r { ˜ γ (cid:48) Y n,m > γ (cid:48) Y n,m } → . Thus, all γ ∈ V F with γ j > for some j (cid:54)∈ B yield a value of the objective smaller thanthat for ˜ γ with probability tending to one. This implies that P r (cid:110) (cid:98) V ⊆ V B F, B (cid:111) → . However, for any γ ∈ (cid:98) V such that γ ∈ V B F, B , γ ∈ (cid:98) V BB as well. Thus, we see that P r (cid:110) (cid:98) V = (cid:98) V BB (cid:111) → , as we wanted to show.For ˆ η the optimal value of η based on ( Y n,m , W n , Σ) , and ˆ η B the optimal valuebased on ( Y n,m, B , W n, B , Σ B ) , we see that (cid:98) V = (cid:98) V BB , implies ˆ η = ˆ η B . Thus, the argumentabove shows that ˆ η → p ˆ η B as m → ∞ .We next argue that the critical values c α,C (cid:0) ˆ γ, V lo ( S n,m, ˆ γ ) , V up ( S n,m, ˆ γ ) , Σ (cid:1) basedon ( Y n,m , W n , Σ) converge to the critical values c α,C (cid:0) ˆ γ B , V lo ( S n, B , ˆ γ B ) , V up ( S n, B , ˆ γ B ) , Σ B (cid:1) which limit attention to the moments B . To do so, we will show that V lo ( S n,m, ˆ γ ) → p V lo ( S n, B , ˆ γ B ) , and likewise for V up ( S n,m, ˆ γ ) . Recall, in particular, that V lo ( s ) = min (cid:40) c : c = max ˜ γ ˜ γ (cid:48) (cid:16) s + Σ γγ (cid:48) Σ γ c (cid:17) subject to ˜ γ ≥ , W (cid:48) n ˜ γ = e (cid:41) . By the results above we know that ˆ γ ∈ (cid:98) V BB with probability approaching one. Notethat for S n,m, ˆ γ = (cid:18) I − Σˆ γ ˆ γ (cid:48) ˆ γ (cid:48) Σˆ γ (cid:19) Y n,m , the conditioning statistic based on Y n,m , we have ˆ γ (cid:48) Y n,m = O p (1) , so S n,m, ˆ γ,j = O p (1) for all j ∈ B . By contrast S n,m,j → −∞ for all j (cid:54)∈ B .Note, next, that by linearity of the problem we can restrict the optimization in theconstruction of V lo to ˜ γ ∈ V F , and so write V lo ( s ) = min (cid:26) c : c = max ˜ γ ∈ V F ˜ γ (cid:48) (cid:18) s + Σ γγ (cid:48) Σ γ c (cid:19)(cid:27) . Using the divergence of S n,m, ˆ γ , for any ˜ γ ∈ V F such that ˜ γ j > for some j (cid:54)∈ B and58ny compact set C,P r (cid:26) ˜ γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) < ˆ γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) ∀ c ∈ C (cid:27) → . From the ﬁniteness of V F , we thus see that for any compact set CP r (cid:40) max ˜ γ ∈ V F /V B F, B ˜ γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) < ˆ γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) ∀ c ∈ C (cid:41) → . (26)Since ˆ γ ∈ V F , this implies P r (cid:40) max ˜ γ ∈ V F ˜ γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) = max ˜ γ ∈ V B F, B ˜ γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) ∀ c ∈ C (cid:41) → . Note that by the deﬁnition of ˆ γ , ˆ γ (cid:48) Y n,m = max ˜ γ ∈ V F ˜ γ (cid:48) Y n,m . Since ˆ γ ∈ V F , for any v we have ˆ γ (cid:48) ( Y n,m + v ) ≤ max ˜ γ ∈ V F ˜ γ (cid:48) ( Y n,m + v ) . Note further that from the deﬁnition of S m,n, ˆ γ , c = ˆ γ (cid:48) (cid:16) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:17) for any c andthat Y n,m = S n,m + Σˆ γ ˆ γ (cid:48) Σˆ γ ˆ γ (cid:48) Y n,m . Setting v = Σˆ γ ˆ γ (cid:48) Σˆ γ ( c − ˆ γ (cid:48) Y n,m ) , we then have c = ˆ γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) ≤ max γ ∈ V F γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) ∀ c. Note, further, that for all c, max γ ∈ V F γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) ≥ max γ ∈ V B F, B γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) , since the left hand side optimizes over a larger set. The fact that P r (cid:110) (cid:98) V ⊆ V B F, B (cid:111) → implies that with probability approaching one c = ˆ γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) ≤ max γ ∈ V B F, B γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) ∀ c, (cid:26) c : c = max γ ∈ V F γ (cid:48) (cid:18) S n,m + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19)(cid:27) ⊆ (cid:40) c : c = max γ ∈ V B F, B γ (cid:48) (cid:18) S n,m + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19)(cid:41) . Hence, if V lo ( S n, B , ˆ γ B ) is ﬁnite then with probability approaching one V lo ( S n, ˆ γ ) is ﬁniteas well.Note that the distribution of V lo ( S n, B , ˆ γ B ) does not depend on m. Further, thedistribution of V lo ( S n, B , ˆ γ B ) conditional on V lo ( S n, B , ˆ γ B ) being ﬁnite is trivially tight.Hence, conditional on the event that V lo ( S n, B , ˆ γ B ) is ﬁnite, our argument above forcompact sets C implies that P r (cid:8) V lo ( S n,m, ˆ γ ) = V lo ( S n, B , ˆ γ B ) |V lo ( S n, B , ˆ γ B ) ﬁnite (cid:9) → . On the other hand, when V lo ( S n, B , ˆ γ B ) is inﬁnite, we know that c = ˆ γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) = max γ ∈ V B F, B γ (cid:48) (cid:18) S n,m, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) for all c suﬃciently small. Hence, (26) implies that when V lo ( S n, B , ˆ γ B ) = −∞ , V lo ( S n,m, ˆ γ ) → p −∞ as well.We can apply the same argument for V up ( S n,m, ˆ γ ) . Note, however, that the condi-tional critical value is a continuous function of V lo ( S n,m ˆ γ ) and V up ( S n,m, ˆ γ ) , includingat V lo ( S n,m, ˆ γ ) = −∞ and V up ( S n,m, ˆ γ ) = ∞ . Thus, by the continuous mapping theo-rem, we see that (ˆ η, c α,C (cid:0) ˆ γ, V lo ( S n,m, ˆ γ ) , V up ( S n,m, ˆ γ ) , Σ (cid:1) ) converge in distribution to their analogs calculated based on the moments B alone, (ˆ η B , c α,C (cid:0) ˆ γ B , V lo ( S n, B , ˆ γ B ) , V up ( S n, B , ˆ γ B ) , Σ B (cid:1) ) . Assumption 1 implies that the variance of γ (cid:48) Y n is strictly positive for all γ B ∈ V F, B .Hence, γ (cid:48)B Y n, B is continuously distributed and independent of V lo ( S n, B , ˆ γ B ) , V up ( S n, B , ˆ γ B ) ,from which it follows that V lo ( S n, B ,γ B ) < V up ( S n, B ,γ B ) with probability one. Hence,since V F, B is ﬁnite, ˆ η B − c α,C (cid:0) ˆ γ B , V lo ( S n, B , ˆ γ B ) , V up ( S n, B , ˆ γ B ) , Σ B (cid:1) ,

60s continuously distributed, and the result follows from the continuous mapping the-orem. (cid:3)

Proof of Proposition 4

Follows immediately from Lemma 10 and Propositions 5and 7.

Proof of Lemma 8

This result follows from standard duality results for linearprogramming. Note, in particular, that the primal problem (10) is equivalent to − ˆ η = max θ − e (cid:48) θ subject to Y n,j − W n,j θ ≤ ∀ j. for θ = ( η, δ ) . The duality theorem for linear programming (see e.g. (24) in Chapter7.4 of Schrijver (1986)) implies that if the optimum in this problem is ﬁnite, it is equalto the solution in the dual problem − ˆ η = min γ − γ (cid:48) Y n subject to γ ≥ , − W (cid:48) n γ = − e . However, we see that the optimal value ˆ η in this problem is in turn equal to that in(21). (cid:3) Proof of Lemma 9

The result follows from the argument in Section 5.1 of Fithianet al. (2017), but we provide a separate proof for completeness.The set of values Y n such that Y (cid:48) n γ = max ˜ γ ˜ γ (cid:48) Y n subject to ˜ γ ≥ , W (cid:48) n ˜ γ = e (27)is convex. This follows from the fact that if (27) holds for both Y n and Y ∗ n , then weknow that both Y (cid:48) n γ ≥ Y (cid:48) n ˜ γ and Y ∗ (cid:48) n γ ≥ Y ∗ (cid:48) n ˜ γ for all ˜ γ ≥ with W (cid:48) n ˜ γ = e , whichimplies that ( αY n + (1 − α ) Y ∗ n ) (cid:48) γ ≥ ( αY n + (1 − α ) Y ∗ n ) (cid:48) ˜ γ as well.Thus, once we condition on S n , the set of values γ (cid:48) Y n such that (27) holds is aninterval. To derive the form of the endpoints, note that V lo ( s ) = min Y n : S n = s (cid:40) Y (cid:48) n γ : Y (cid:48) n γ = max ˜ γ ˜ γ (cid:48) Y n subject to ˜ γ ≥ , W (cid:48) n ˜ γ = e (cid:41) . S n , this is equivalent to: V lo ( s ) = min Y n : S n = s (cid:40) Y (cid:48) n γ : Y (cid:48) n γ = max ˜ γ ˜ γ (cid:48) (cid:16) s + Σ γγ (cid:48) Σ γ Y (cid:48) n γ (cid:17) subject to ˜ γ ≥ , W (cid:48) n ˜ γ = e (cid:41) . Finally, this is equivalent to V lo ( s ) = min (cid:40) c : c = max ˜ γ ˜ γ (cid:48) (cid:16) s + Σ γγ (cid:48) Σ γ c (cid:17) subject to ˜ γ ≥ , W (cid:48) n ˜ γ = e (cid:41) , if the support of Y (cid:48) n γ equals R . The linear structure of the problem implies that thisholds if and only if γ (cid:54) = 0 , which we know to be the case since W (cid:48) n γ = e (cid:54) = 0 . Theexpression for V up follows by the same argument.Independence of γ (cid:48) Y n and S n then implies that the conditional distribution of Y (cid:48) n γ given S n and (27) is truncated normal. (cid:3) Proof of Proposition 5

The Kuhn-Tucker conditions for optimality of γ in the dualproblem (which are necessary and suﬃcient since the problem is a linear program) arethat there exist (cid:16) ˆ θ, ˆ λ (cid:17) such that Y n + ˆ λ − W n ˆ θ = 0ˆ λ ≥ , ˆ λ j γ j = 0 ∀ j. From the complementary slackness conditions ˆ λ j γ j = 0 ∀ j , we see that ˆ λ j = 0 forall j ∈ B . Thus, for M B again the matrix which selects rows B , and M B c whichselects the remaining rows, Y n,B = M B Y n and Y n,B c = M B c Y n , Y n,B − W n,B ˆ θ = 0 . Since the strictly positive elements of γ correspond to linearly independent rows of W n by assumption, we know that W n,B has full rank. Thus, ˆ θ = W − n,B Y n,B . For such ˆ θ , however, there exists ˆ λ satisfying the conditions above if and only if Y n,B c − W n,B c ˆ θ ≤ . Note that any such ˆ θ is a solution to the primal problem, with ˆ θ = (ˆ η, ˆ δ (cid:48) ) (cid:48) . Inparticular, in the dual problem we know that γ (cid:48) Y n = γ (cid:48) B M (cid:48) B Y n,B and W (cid:48) n,B M B γ B = e , so M B γ B = (cid:0) W (cid:48) n,B (cid:1) − e (where γ B is as deﬁned in Lemma 6) and the optimal value62n the dual problem is e (cid:48) W − n,B Y n,B . If we consider the value implied by ˆ θ, we againobtain ˆ η = e (cid:48) ˆ θ = e W − n,B Y n,B . By Lemma 8, the optimal objective value of the primal is equal to that of the dual, so ˆ θ achieves the optimum for the primal, and we argued above that the primal constraintsare satisﬁed at ˆ θ when γ solves the dual. We have thus veriﬁed a solution to the primalwith B binding and γ n,B = γ whenever γ ∈ (cid:98) V .Finally, recall from the proof of Lemma 5 that if W n,B is invertible and there isa solution to the primal with B binding, then the Kuhn-Tucker conditions hold with M B γ = ( W (cid:48) n,B ) − e and the other entries of γ equal to zero, so by the suﬃciency ofthe Kuhn-Tucker conditions, γ solves the primal whenever B is binding in the dual.It follows that { Y n such that B is binding in the primal } = { Y n such that γ ∈ (cid:98) V } .Observing that when γ ∈ (cid:98) V , S n,γ = S n,B , it is then immediate from Lemmas 6 and9 that the deﬁnition of V lo and V up in equations (14) and (15) coincides with that inequations (22) and (23). (cid:3) Proof of Lemma 10

Uniqueness and non-degeneracy of the solution ˆ θ implies that | B | = p + 1 . To see that this is the case, note that if | B | < p + 1 then there existsa nonzero vector v such that W n,B v = 0 . If e (cid:48) v = 0 then for α suﬃciently small ˆ θ + α · v is also a solution to the primal problem, contradicting our assumption ofuniqueness. If instead e (cid:48) v (cid:54) = 0 , then for suﬃciently small α > , ˆ θ − α · sign ( e (cid:48) v ) v also satisﬁes the constraints of the primal problem and attains a smaller value ofthe objective, contradicting the optimality of ˆ θ. Likewise, if | B | > p + 1 , since W n has p + 1 columns the rows of W n,B cannot be linearly independent, violating ourassumption of non-degeneracy. Thus, our assumptions imply that W n,B must be afull-rank ( p + 1) × ( p + 1) matrix.We next show that there must be p + 1 strictly positive multipliers. Note that fromthe complementary slackness conditions, γ j = 0 for j (cid:54)∈ B, so there can be at most p + 1 strictly positive multipliers. Let ˆ γ be a solution to the dual problem (21). By(21) in Section 10.4 of Schrijver (1986), non-degeneracy of the primal problem impliesthat for v in an open neighborhood of zero, min θ e (cid:48) θ subject to ( Y n + v ) − W n θ ≤ e (cid:48) ˆ θ + ˆ γ (cid:48) v, (28)63o ˆ γ gives the marginal change in the objective for small changes in Y n . Uniqueness of ˆ θ implies that for ˆ γ B the elements of ˆ γ corresponding to B , ˆ γ B > . To see that this is the case, suppose not. Then there exists ˆ j ∈ B with ˆ γ ˆ j = 0 . Inthis case, note that for e ˆ j the vector with a one in entry ˆ j and zeros everywhere else, ˆ γ (cid:48) e ˆ j = 0 . We know that for α suﬃciently small, there continues to be a unique solutionwith only constraints B binding after we perturb Y n by α · e ˆ j , and thus that we canwrite ˆ θ (cid:0) α · e ˆ j (cid:1) = arg min θ e (cid:48) θ subject to (cid:0) Y n + α · e ˆ j (cid:1) − W n θ ≤ W − n,B (cid:0) Y n,B + α · M B e ˆ j (cid:1) , for M B the selection matrix that selects rows in B . Further, by (28) we know that e (cid:48) ˆ θ (cid:0) α · e ˆ j (cid:1) = ˆ θ (0) = ˆ η, so this perturbation does not aﬀect the objective. Let usdeﬁne ˜ θ ( α ) = ˆ θ + α · W − n,B M B e ˆ j . Note that e (cid:48) ˜ θ ( α ) = ˆ η, while Y n − W n ˜ θ ( α ) = Y n − W n ˆ θ − αW n W − n,B M B e j . However, for all α ≥ M B (cid:16) Y n − W n ˜ θ ( α ) (cid:17) = Y n,B − W n,B ˆ θ − αW n,B W − n,B M B e j = Y n,B − W n,B ˆ θ − αM B e j ≤ . Since the other rows of Y n − W n ˆ θ are not binding, they remain nonbinding for α suﬃ-ciently small. Thus, there exists α ∗ > such that Y n − W n ˜ θ ( α ∗ ) ≤ and e (cid:48) ˜ θ ( α ∗ ) = ˆ η. There is thus another solution to the primal problem, which contradicts our assump-tion of uniqueness. (cid:3)

Proof of Proposition 6

Monotonicity of the conditional distribution in γ (cid:48) µ n im-plies that the test has conditional size α given γ ∈ (cid:98) V and S n,γ = s for almost every s .For this section only, we make the dependence of V lo and V up on γ explicit, writing V lo ( s, γ ) and V up ( s, γ ) . Note that for all V ∈ V , Lemma 11 implies V lo ( S n,γ j , γ j ) = V lo ( S n,γ k , γ k ) ∀ γ j , γ k ∈ V V up ( S n,γ j , γ j ) = V up ( S n,γ k , γ k ) ∀ γ j , γ k ∈ V, ˆ γ is selected from (cid:98) V ,

P r µ n (cid:110) c α,C (cid:16) γ, V lo ( S n,γ , γ ) , V up ( S n,γ , γ ) , Σ (cid:17) = c α,C (cid:16) ˆ γ, V lo ( S n, ˆ γ , ˆ γ ) , V up ( S n, ˆ γ , ˆ γ ) , Σ (cid:17) | γ ∈ (cid:98) V (cid:111) = 1 . Lemma 9, the monotonicity of the conditional distribution in γ (cid:48) µ n , Assumption 1, andthe fact (argued in the proof of Proposition 3) that P r µ n (cid:8) V lo ( S n,γ , γ ) < V up ( S n,γ , γ ) (cid:9) =1 imply that for almost every s in the support of S n , given γ ∈ (cid:98) V and S n = s, sup µ n ∈M P r µ n (cid:110) ˆ η > c α,C (cid:0) γ, V lo ( S n,γ , γ ) , V up ( S n,γ , γ ) , Σ (cid:1) | γ ∈ (cid:98) V , S n = s (cid:111) = α, from which it follows that sup µ n ∈M P r µ n (cid:110) ˆ η > c α,C (cid:0) ˆ γ, V lo ( S n, ˆ γ , ˆ γ ) , V up ( S n, ˆ γ , ˆ γ ) , Σ (cid:1) | γ ∈ (cid:98) V , S n = s (cid:111) = α, and thus that sup µ n ∈M E µ n (cid:104) φ C | ˜ γ ∈ (cid:98) V , S n = s (cid:105) = E (cid:104) φ C | ˜ γ ∈ (cid:98) V , S n = s (cid:105) = α. For the ﬁrst equality we have used the fact that the sup is achieved at µ = 0 , whichagain follows monotonicity of the conditional distribution. The law of iterated expec-tations then immediately implies the ﬁrst result in the proposition, sup µ n ∈M E µ n (cid:104) φ C | ˜ γ ∈ (cid:98) V (cid:105) = E (cid:104) φ C | ˜ γ ∈ (cid:98) V (cid:105) = α. To obtain the second part of the proposition, note that by Lemma 11 the events (cid:98) V = V j , j ∈ { , ..., m } are disjoint, and their union occurs with probability one. Thus, E µ n [ φ C ] = m (cid:88) j =1 P r µ n (cid:110) (cid:98) V = V j (cid:111) E µ n (cid:104) φ C | (cid:98) V = V j (cid:105) . By Lemma 11, however, E µ n (cid:104) φ C | (cid:98) V = V j (cid:105) = E µ n (cid:104) φ C | ˜ γ ∈ (cid:98) V (cid:105) ∀ ˜ γ ∈ (cid:98) V . sup µ n ∈M E [ φ C ] = E [ φ C ] = α. (cid:3) Proof of Proposition 7

Size control conditional on ˆ η ≤ c κ,LF ( X n , Σ) and γ ∈ (cid:98) V holds by the same argument as the proof of Proposition 6, replacing V up with V up,H as in the text.To prove unconditional size control, note that E µ n [ φ H ] = E µ n [ φ H | ˆ η ≤ c κ,LF ( X n , Σ)]

P r µ n { ˆ η ≤ c κ,LF } + E µ n [ φ H | ˆ η > c κ,LF ( X n , Σ)]

P r µ n { ˆ η > c κ,LF } . From the ﬁrst part of the proposition and the law of iterated expectations we knowthat E µ n [ φ H | ˆ η ≤ c κ,LF ( X n , Σ)] is bounded above by α − κ − κ while by the construction ofthe hybrid test we know that E µ n [ φ H | ˆ η > c κ,LF ( X n , Σ)] = 1 . Thus, we see that for µ n ∈ H , E µ n [ φ H ] ≤ α − κ − κ P r µ n { ˆ η ≤ c κ,LF } + 1 − P r µ n { ˆ η ≤ c κ,LF } . This expression is decreasing in

P r µ n { ˆ η ≤ c κ,LF } , so to obtain an upper boundwe need to make P r µ n { ˆ η ≤ c κ,LF } as small as possible. By Proposition 1 we know P r µ n { ˆ η ≤ c κ,LF } ≥ − κ under the null, which yields sup µ n ∈ H E µ n [ φ H ] ≤ α − κ − κ (1 − κ ) + κ = α. Note, further, that both of the bounds we used above are tightest at µ n = 0 , andboth bind in this case provided ˆ η is continuously distributed. However, Assumption1 implies that ˆ η is continuously distributed, so E [ φ H ] = α. (cid:3) Proof of Lemma 11

Each element of (cid:98) V is also a vertex of the feasible set F = { γ : γ ≥ , W (cid:48) n γ = e } . V F , we thus see that (cid:98) V has supportequal to a subset of the power set of V F . Note, however, that if we consider two values γ , γ ∈ V F , then since Y n is normally distributed, P r { γ (cid:48) Y n = γ (cid:48) Y n } ∈ { , } . (29)Thus, a given set of optimal vertices V in the dual problem (21) either always or neverarise together. From this, and the ﬁniteness of the power set of V F , it follows thatthere exists a ﬁnite set V = { V , V , ..., V m } such that V j (cid:54) = V k for j (cid:54) = k, P r (cid:110) (cid:98) V ∈ V (cid:111) = 1 , and P r (cid:110) (cid:98) V = V j (cid:111) > for all j, whichestablishes the ﬁrst part of the result.To complete the proof, note that the restriction that each V j must arise withpositive probability together with (29) implies that V j ∩ V k = ∅ for all j (cid:54) = k. To seethat this is the case, suppose there exists an element γ ∈ V j ∩ V k . The restrictions that

P r (cid:110) (cid:98) V = V j (cid:111) > and P r (cid:110) (cid:98) V = V k (cid:111) > together with (29) imply that P r (cid:8) γ (cid:48) Y n = γ (cid:48) j Y n = γ (cid:48) k Y n ∀ ( γ j , γ k ) ∈ V j × V k (cid:9) = 1 . However, this is inconsistent with the restriction that

P r (cid:110) (cid:98) V = V j (cid:111) > and V j (cid:54) = V k .Thus, we see that V j ∩ V k = ∅ . (cid:3) D Asymptotics

In Sections 4 and 5 of the main text, we derived ﬁnite-sample results in the normalmodel (7), which we motivated in Section 3 as an asymptotic approximation. In thissection, we show that these ﬁnite sample results translate to asymptotic validity ofour proposed tests over a large class of data generating processes. In particular, weestablish uniform asymptotic validity of least favorable and least favorable projectiontests under minimal conditions. We likewise establish the uniform asymptotic validityof conditional and hybrid tests over classes of data generating processes implyingdiﬀerent µ n values, but these results impose more stringent conditions on X n and Σ .Speciﬁcally, our conditions for these results imply that the dual linear program (21)has a unique solution with probability tending to one, which in turn implies that the67rimal problem (10) has a non-degenerate solution with probability tending to one. We conduct our analysis conditional on a sequence of values for the instruments, { Z i } = { Z i } ∞ i =1 , and assume that conditional on { Z i } ∞ i =1 the data are independentbut potentially not identically distributed D i ⊥ D i (cid:48) | { Z j } ∞ j =1 for all i (cid:54) = i (cid:48) . We further assume that for some common conditional distribution P D | Z ,D i | Z i = z ∼ P D | Z ( z ) , where the conditional distribution belongs to a family P D | Z of conditional distribu-tions, P D | Z ∈ P D | Z . We explore conditions on P D | Z under which the procedures wesuggest are uniformly asymptotically valid.We ﬁrst assume that the average conditional variance of Y i given Z i convergesuniformly to some limit which may depend on P D | Z , and that this limit is uniformlybounded over P D | Z . Assumption 2

Σ : 1 / ¯ λ ≤ min j Σ jj ≤ max j Σ jj ≤ ¯ λ (cid:27) where ¯ λ is a ﬁnite constant. To justify this assumption, note that for an iid sample from P, if the conditionaldistribution of D i | Z i is P D | Z , the strong law of large numbers implies that for almostevery sequence { Z i } ∞ i =1 , n (cid:88) V ar P D | Z ( Y i | Z i ) → E P (cid:104) V ar P D | Z ( Y i | Z i ) (cid:105) , Note, however, that uniqueness of the dual solution holds automatically if Σ has full rank, andcan be ensured by adding full-rank, mean-zero noise to Y n . Moreover, since our results are uniform in µ n , they allow that the “population” version of (21), with Y n = µ n , may have a non-unique solutionas in one of our simulation speciﬁcations. P D | Z ∈ P D | Z . The second part of the assumption then requires that the averageconditional variance of each of the moments be bounded above and below, which isagain a mild condition. We do not require the matrix Σ (cid:0) P D | Z (cid:1) to have full rank,which is important since it allows us to accommodate moment equalities representedas pairs of moment inequalities.We next suppose that we have a uniformly consistent estimator of the variance Σ (cid:0) P D | Z (cid:1) . We discuss primitive conditions for this assumption in Section D.3 below,but for the moment take the existence of suitable estimator (cid:98) Σ as given. Assumption 3

We have an estimator (cid:98) Σ for the average conditional variance Σ (cid:0) P D | Z (cid:1) which is uniformly consistent in the sense that for all ε > , lim n →∞ sup P D | Z ∈P D | Z P r P D | Z (cid:110)(cid:13)(cid:13)(cid:13)(cid:98) Σ − Σ (cid:0) P D | Z (cid:1)(cid:13)(cid:13)(cid:13) > ε (cid:111) = 0 . We further assume that the scaled sample average Y n is uniformly asymptoticallynormal once recentered around µ n . To state this assumption we use the fact thatuniform convergence in distribution is equivalent to uniform convergence in boundedLipschitz metric (see Theorem 1.12.4 of Van der Vaart & Wellner (1996)). Assumption 4

For BL the class of Lipschitz functions which are bounded in abso-lute value by one and have Lipschitz constant bounded by one, and ξ P D | Z ∼ N (cid:0) , Σ (cid:0) P D | Z (cid:1)(cid:1) , lim n →∞ sup P D | Z ∈P D | Z sup f ∈ BL (cid:12)(cid:12)(cid:12) E P D | Z [ f ( Y n − µ n )] − E (cid:104) f (cid:16) ξ P D | Z (cid:17)(cid:105)(cid:12)(cid:12)(cid:12) = 0 . Under Assumption 2, Assumption 4 holds whenever the average conditional dis-tribution of Y i − µ i given Z i is uniformly integrable over P D | Z ∈ P D | Z . Lemma 12

Under Assumption 2, if for all ε > , lim sup n →∞ sup P D | Z ∈P D | Z n (cid:88) i E P D | Z (cid:2) (cid:107) Y i − µ i (cid:107) (cid:8) (cid:107) Y i − µ i (cid:107) > ε √ n (cid:9) | Z i (cid:3) = 0 , then Assumption 4 holds. .1 Uniform Validity of Least Favorable Tests Assumptions 2-4 imply the uniform asymptotic validity of feasible least favorable andleast favorable projection tests which replace Σ by the estimator (cid:98) Σ in all expressions.To formally state this result, it is helpful to deﬁne P D | Z as the class of conditionaldistributions consistent with our conditional moment restriction, P D | Z = (cid:110) P D | Z ∈ P D | Z : ∃ δ s.t. E P D | Z [ Y i − X i δ | Z i ] ≤ for all i (cid:111) . Proposition 8

Under Assumptions 2-4, the least favorable projection test is uni-formly asymptotically valid lim sup n →∞ sup P D | Z ∈P D | Z P r P D | Z (cid:110) ˆ η > c α,LF ( δ ) (cid:16)(cid:98) Σ (cid:17)(cid:111) ≤ α. The least favorable test is likewise uniformly valid once the critical value is increasedby an arbitrarily small amount. In particular, for any ε > n →∞ sup P D | Z ∈P D | Z P r P D | Z (cid:110) ˆ η > c α,LF (cid:16) X n , (cid:98) Σ (cid:17) + ε (cid:111) ≤ α. We adjust the critical value in the least favorable test by ε to accommodate thepossibility that the distribution of (cid:98) η may become degenerate asymptotically. D. An-drews & Shi (2013) termed this an inﬁnitesimal uniformity factor. We next discussassumptions which rule out such degeneracy, and so ensure asymptotic validity of leastfavorable tests with ε = 0 . Continuity of the Limit Distribution

We next consider assumptions which en-sure a continuous limiting distribution for ˆ η . These assumptions restrict the behaviorof X n and Σ (cid:0) P D | Z (cid:1) but, critically, impose no restrictions on µ n , and so allow anycombination of binding and non-binding moments.We ﬁrst assume that X n , appropriately scaled, converges to some limit as n → ∞ . Assumption 5 X ∗ n = √ n X n → X for a constant matrix X . As with Assumption 2, if the data are drawn iid from some distribution P with E P [ X i ] ﬁnite, then the strong law of large numbers implies that this assumption holdsfor almost every { Z i } ∞ i =1 if we take X = E P [ X i ] . V F ( X, Σ) of the feasible region F ( X, Σ) = (cid:110) γ : γ ≥ , W (cid:48) γ = e (cid:111) in the dual problem, where as in (12) W j = (cid:104) (cid:112) Σ jj X j (cid:105) . Assumption 6

For all P D | Z ∈ P D | Z , Σ (cid:0) P D | Z (cid:1) ∈ S where S ⊆ Λ is a compact set ofmatrices. Moreover, for some ﬁnite J , V F ( X, Σ) = (cid:8) γ ( X, Σ) , ..., γ J ( X, Σ) (cid:9) where each γ j ( X, Σ) is unique and continuous in both arguments on B ( X ) ×S for B ( X ) an open neighborhood of X . This assumption requires that the vertices V F ( X, Σ) of the feasible region be con-tinuous at the limiting pair ( X, Σ) . This will generally fail if the columns of W aremulti-collinear, since in this case some of the constraints in W (cid:48) γ = e are redundant,and the dimension of the feasible region F ( X, Σ) changes discontinuously in ( X, Σ) . This assumption thus implies an asymptotic rank condition, requiring that the dif-ferent elements of the nuisance parameter vector δ have distinguishable eﬀects on thevector of moments, and can in this sense be understood as an identiﬁcation conditionon δ .Our ﬁnal condition restricts the relationship between the variance matrix Σ andthe vertices V ( X, Σ) . Assumption 7

For all Σ ∈ S and all γ , γ ∈ V F ( X, Σ) with γ (cid:54) = γ ,1. / ¯ λ ≤ γ (cid:48) Σ γ ( γ − γ ) (cid:48) Σ ( γ − γ ) ≥ λ . To interpret this assumption, recall that ˆ η = max γ ∈ V F ( X n , (cid:98) Σ ) γ (cid:48) Y n , where the asymptotic variance of Y n is Σ . Thus, ˆ η is a (data-dependent) linear com-bination of the elements of Y n . The ﬁrst part of Assumption 7, / ¯ λ ≤ γ (cid:48) Σ γ , bounds71he asymptotic variance of these linear combinations away from zero, and can be in-terpreted as an asymptotic analog of Assumption 1 in the main text. The secondpart of Assumption 7, ( γ − γ ) (cid:48) Σ ( γ − γ ) ≥ λ , ensures that γ (cid:48) Y n and γ (cid:48) Y n are notperfectly correlated asymptotically.Both conditions hold automatically if we bound the minimal eigenvalue of Σ awayfrom zero. As noted above, however, we do not wish to rule out moment equalitiesrepresented as pairs of inequalities, and so do not impose this condition. More broadly,this assumption implies the existence of a unique solution in the dual problem (21),and thus non-degeneracy of the primal solution, with probability going to one. Whilethis does not require uniqueness in the primal problem (see Corollary 1 in Tijssen &Sierksma (1998)), it rules out the sort of exact primal degeneracy which Appendix Ashows can be accommodated in the normal model.It is worth contrasting Assumption 7 with conditions used elsewhere in the litera-ture on subvector inference. Gafarov (2019), Cho & Russell (2019), and Flynn (2019)all impose versions of the linear independence constraint qualiﬁcation, which requiresthat the Jacobian of the binding moments have full rank in a population problem. This rules out degenerate solutions. The linear programs studied in these papersdiﬀer from ours, in that they aim to minimize or maximize a parameter of interestsubject to moment constraints in the population, while we aim to minimize η subjectto constraints in the sample. Assumption 7 then rules out degenerate solutions toour primal problem in-sample. The distinction between the sample and populationproblems is important, however, since Assumption 7 imposes no restrictions on µ n , and as we note above can be made to hold mechanically by adding full-rank normalnoise to the moments.With these conditions, we obtain asymptotic validity of φ LF with ε = 0 , Corollary 1

Under Assumptions 2-7, the least favorable test is uniformly valid with-out an increase in the critical value, lim sup n →∞ sup P D | Z ∈P D | Z P r P D | Z (cid:110) ˆ η > c α,LF (cid:16) X n , (cid:98) Σ (cid:17)(cid:111) ≤ α. See Kaido et al. (2019) on the role of constraint qualiﬁcations for inference for partially identiﬁedmodels. .2 Uniform Validity of Conditional and Hybrid Tests We next turn to the asymptotic properties of conditional and hybrid tests. Note thatthe feasible conditional test based on the estimated variance (cid:98) Σ can be written as φ C = 1 (cid:110) ˆ γ (cid:48) Y n > c α,C (cid:16) ˆ γ, V lo ( S n, ˆ γ ) , V up ( S n, ˆ γ ) , (cid:98) Σ (cid:17)(cid:111) , where ˆ γ ∈ arg max γ ∈ V F ( X n , (cid:98) Σ ) γ (cid:48) Y n . We make the following additional assumption, which ensures that the vertices ofthe feasible set V F ( X, Σ) are either zero or nonzero on a neighborhood of ( X, Σ) . Assumption 8

For all Σ ∈ S and all γ ( X, Σ) ∈ V F ( X, Σ) , { γ j ( X, Σ) = 0 } isconstant on B ( X ) × B (Σ) for all j . Recall that ˆ γ can be interpreted as the vector of Lagrange multipliers in the primalproblem (12). This condition requires that when we consider the set of potentialLagrange multipliers V F ( X, Σ) , the elements do not switch from zero to nonzero at ( X, Σ) . Critically, since the realized multiplier ˆ γ is also determined by Y n , this stillallows the distribution of the realized ˆ γ to vary depending on µ n , which remainsunrestricted.To prove our asymptotic results, we use a modiﬁed version of the conditional testwhich never rejects if ˆ η < − C for C a large positive constant. We do this for technicalreasons, since when µ n diverges to −∞ , both ˆ η and our conditional critical values maylikewise diverge, and size control for the unmodiﬁed test φ C requires that we controlthe relative rates of divergence. At the same time, this modiﬁcation is reasonableon substantive grounds, since when ˆ η is very small it is clear from the data that themoments hold, and rejections of the null in this case reﬂect extreme realizations of theconditional critical values. Proposition 9

Under Assumptions 2-8, the modiﬁed conditional test φ ∗ C = φ C { ˆ η ≥ − C } is uniformly asymptotically valid, lim sup n →∞ sup P D | Z ∈P D | Z P r P D | Z { φ ∗ C = 1 } ≤ α. φ H = 1 (cid:110) ˆ γ (cid:48) Y n > c α − κ − κ ,H (cid:16) ˆ γ, V lo ( S n, ˆ γ ) , V up ( S n, ˆ γ ) , (cid:98) Σ (cid:17)(cid:111) . Once we modify the test to never reject if ˆ η < − C, asymptotic validity follows underthe same conditions. Corollary 2

Under Assumptions 2-8, the modiﬁed hybrid test φ ∗ H = φ H { ˆ η ≥ − C } is uniformly asymptotically valid lim sup n →∞ sup P D | Z ∈P D | Z P r P D | Z { φ ∗ H = 1 } ≤ α. D.3 Asymptotic Variance Estimation

Our asymptotic results have thus far taken as given the existence of a uniformlyconsistent estimator (cid:98) Σ for the conditional variance Σ (cid:0) P D | Z (cid:1) . Here, we establish theuniform consistency of a particular estimator under mild conditions.Following Abadie et al. (2014), we consider the nearest-neighbor variance estimator (cid:98)

Σ = 12 n n (cid:88) i =1 (cid:0) Y i − Y (cid:96) Z ( i ) (cid:1) (cid:0) Y i − Y (cid:96) Z ( i ) (cid:1) (cid:48) (31)where for Ξ n a positive-deﬁnite matrix, (cid:96) Z ( i ) = argmin j ∈{ ,...,n } ,j (cid:54) = i ( Z i − Z j ) (cid:48) Ξ n ( Z i − Z j ) selects the index for the observation j with Z j as close as possible to Z i in distancedeﬁned by Ξ n . One natural choice of Ξ n is the inverse of the sample variance, Ξ n = (cid:100) V ar ( Z i ) − , provided the sample variance has full rank. For ease of exposition weassume that Z i has at least one continuously distributed dimension, so that (cid:96) Z ( i ) isunique for all i . If instead Z i is entirely discrete, one can estimate (cid:98) Σ using the averageof the sample conditional variances.The intuition for the estimator (cid:98) Σ is straightforward. Provided the conditionalmean and variance of Y i given Z i are continuous in Z i , if Z (cid:96) Z ( i ) is close to Z i it will74ave nearly the same mean and variance. Hence, the variance of Y i − Y (cid:96) Z ( i ) will beapproximately twice the variance of Y i , and the approximation error will vanish as Z (cid:96) Z ( i ) approaches Z i . If the support of Z i is compact, however, then with a largeenough sample we are guaranteed to have observations quite “close” to almost all ofour observations, and (cid:98) Σ will converge to the average conditional variance Σ (cid:0) P D | Z (cid:1) . The next assumption formalizes the conditions needed for this argument.

Assumption 9

For λ max ( A ) the maximal eigenvalue of a matrix A , the followingconditions hold1. { Z i } ∞ i =1 ⊂ Z ∞ for Z a compact set2. lim sup n →∞ sup P D | Z ∈P D | Z n (cid:80) E P D | Z (cid:2) (cid:107) Y i (cid:107) | Z i (cid:3) is ﬁnite3. µ P D | Z ( z ) = E P D | Z [ Y i | Z i = z ] is Lipschitz in z with Lipschitz constant uniformlybounded over P D | Z ∈ P D | Z , and is uniformly bounded over P D | Z ∈ P D | Z V P D | Z ( z ) = E P D | Z [ Y i Y (cid:48) i | Z i = z ] is Lipschitz in z with Lipschitz constant uni-formly bounded over P D | Z ∈ P D | Z sup P D | Z ∈P D | Z sup z ∈Z λ max (cid:16) V ar P D | Z ( Y i | Z i = z ) (cid:17) is ﬁnite6. Ξ n → Ξ for a positive-deﬁnite limit Ξ Assumption 9(1) is used only to establish that the average distance between Z i and Z (cid:96) Z ( i ) converges to zero, n (cid:80) (cid:13)(cid:13) Z i − Z (cid:96) Z ( i ) (cid:13)(cid:13) → . Hence, one may instead assume thiscondition directly. Assumption 9(2) and (5) restrict the variance and second momentof Y i , and are satisﬁed under a wide range of data generating processes. Assumption9(3) and (4) impose Lipschitz continuity on the mean and second moment of Y i ,consistent with the heuristic argument given above. Finally, 9(6) requires only that Ξ n converge to a positive-deﬁnite limit. Proposition 10

Under Assumptions 2 and 9, for (cid:98) Σ as deﬁned in (31), and all ε > n →∞ sup P D | Z ∈P D | Z P r P D | Z (cid:110)(cid:13)(cid:13)(cid:13)(cid:98) Σ − Σ (cid:0) P D | Z (cid:1)(cid:13)(cid:13)(cid:13) > ε (cid:111) = 0 , so Assumption 3 holds. Proofs for Asymptotic Results

This section collects the proofs for the asymptotic results stated in Section D, alongwith the statements and proofs of some auxiliary results. Section E.1 proves Propo-sition 8, while Section E.2 proves Proposition 9, and Section E.3 proves Proposition10.

Proof of Lemma 12

Towards contradiction, suppose the conclusion of the lemmafails. Then there exists a sequence of distributions and sample sizes ( P D | Z,m , n m ) anda constant ε > such that lim inf m →∞ sup f ∈ BL (cid:12)(cid:12)(cid:12) E P D | Z,m [ f ( Y n m − µ n m )] − E (cid:104) f (cid:16) ξ P D | Z,m (cid:17)(cid:105)(cid:12)(cid:12)(cid:12) > ε. (32)Since the set Λ speciﬁed in Assumption 2 is compact, there exists a subsequence ofdistributions and sample sizes ( P D | Z,l , n l ) along which Σ (cid:0) P D | Z,l (cid:1) → Σ for Σ ∈ Λ . Under this subsequence, however, the Linderberg Feller Central Limit Theorem (seee.g. Proposition 2.27 in Van der Vaart 1998), along with the assumptions of thelemma, implies that Y n l − µ n l → d N (0 , Σ) , and thus that lim l →∞ sup f ∈ BL (cid:12)(cid:12)(cid:12) E P D | Z,l [ f ( Y n l − µ n l )] − E (cid:104) f (cid:16) ξ P D | Z,l (cid:17)(cid:105)(cid:12)(cid:12)(cid:12) = 0 . This contradicts (32), completing the proof. (cid:3)

E.1 Proof of Validity For Least Favorable Tests

As a preliminary step, we show that for test statistics R ( ξ, Σ) which are (a) constantoutside compact sets of values ξ and (b) bounded Lipschitz in both arguments, thecritical value function is likewise bounded Lipschitz. To prove this statement, we usethe metric d (Σ , Σ ) = (cid:13)(cid:13)(cid:13) Σ − Σ (cid:13)(cid:13)(cid:13) + (cid:107) Σ − Σ (cid:107) for (cid:107) A (cid:107) the Euclidean norm if A is a vector, and the operator norm if A is a matrix.76 emma 13 Suppose R ( ξ, Σ) is (a) constant in ξ when max j (cid:8) | ξ j | / (cid:112) Σ jj (cid:9) > C forsome constant C and (b) bounded Lipschitz in both arguments for Σ ∈ Λ with Lipschitzconstant K . Then for c α (Σ) the − α quantile of R ( ξ, Σ) under ξ ∼ N (0 , Σ) , c α (Σ) is bounded Lipschitz with a constant that depends only on C, K , and ¯ λ . Proof of Lemma 13

That c α (Σ) is bounded follows immediately from boundednessof R ( ξ, Σ) . Next, note that we can write R ( ξ, Σ) = R (cid:16) Σ ζ, Σ (cid:17) for ζ ∼ N (0 , I ) . Since R (cid:16) Σ ζ, Σ (cid:17) is constant for ζ outside a compact set C , it suﬃces to limit attention to ( ζ, Σ) ∈ C × Λ . Note further that for any pair Σ , Σ ∈ Λ and any ζ ∈ C , R (cid:16) Σ ζ, Σ (cid:17) − R (cid:16) Σ ζ, Σ (cid:17) ≤ K (cid:107) Σ − Σ (cid:107) + K (cid:13)(cid:13)(cid:13) Σ − Σ (cid:13)(cid:13)(cid:13) + K (cid:13)(cid:13)(cid:13)(cid:16) Σ − Σ (cid:17) ζ (cid:13)(cid:13)(cid:13) ≤ K (cid:107) Σ − Σ (cid:107) + K (cid:13)(cid:13)(cid:13) Σ − Σ (cid:13)(cid:13)(cid:13) + K (cid:13)(cid:13)(cid:13) Σ − Σ (cid:13)(cid:13)(cid:13) (cid:107) ζ (cid:107)≤ K (cid:107) Σ − Σ (cid:107) (1 + (cid:107) ζ (cid:107) ) + K (cid:13)(cid:13)(cid:13) Σ − Σ (cid:13)(cid:13)(cid:13) + K (cid:13)(cid:13)(cid:13) Σ − Σ (cid:13)(cid:13)(cid:13) (cid:107) ζ (cid:107)≤ K (1 + (cid:107)C(cid:107) ) d (Σ , Σ ) for (cid:107)C(cid:107) = sup ζ ∈C (cid:107) ζ (cid:107) , where the second line follows from the deﬁnition of the operatornorm, the third line adds a weakly positive term to the RHS, and the ﬁnal line usesthe deﬁnition of the metric and takes a supremum.Thus, we see that − α = P r { R ( ξ , Σ ) ≤ c α (Σ ) }≤ P r { R ( ξ , Σ ) ≤ c α (Σ ) + K (1 + (cid:107)C(cid:107) ) d (Σ , Σ ) } , and hence that c α (Σ ) ≤ c α (Σ ) + K (1 + (cid:107)C(cid:107) ) d (Σ , Σ ) . Repeating the argumentin the other direction, we obtain that | c α (Σ ) − c α (Σ ) | ≤ K (1 + (cid:107)C(cid:107) ) d (Σ , Σ ) , and hence that c α (Σ) is Lipschitz in Σ , as we aimed to show. (cid:3) Lemma 13 applies only to test statistics that are (a) globally Lipschitz and (b)constant for ξ large. Our next result builds on this lemma to establish asymptoticvalidity for tests based on a much wider range of statistics.77 ssumption 10 For all constants C , R ( ξ, Σ) is bounded Lipschitz in ( ξ, Σ) for (cid:26) ( ξ, Σ) : Σ ∈ Λ , max j (cid:110) | ξ j | / (cid:112) Σ jj (cid:111) ≤ C (cid:27) with Lipschitz constant K ( C ) . Lemma 14

Under Assumptions 2- 4, for any ε > and any sequence of test statistics R n satisfying Assumption 10 for a common K ( C ) , and corresponding critical values c α,n (cid:16)(cid:98) Σ (cid:17) , lim n →∞ sup P D | Z ∈P D | Z P r P D | Z (cid:110) R n (cid:16) Y n − µ n , (cid:98) Σ (cid:17) ≥ c α,n (cid:16)(cid:98) Σ (cid:17) + ε (cid:111) ≤ α. Proof of Lemma 14

For constants ( C , C ) with < C < C let us deﬁne ς ( ξ, Σ) = max j (cid:8) | ξ j | / (cid:112) Σ jj (cid:9) and ψ ( R, ξ, Σ , C , C ) = (cid:18) { ς ( ξ, Σ) < C } + C − ς ( ξ, Σ) C − C { C ≤ ς ( ξ, Σ) < C } (cid:19) R ( ξ, Σ) .ψ ( R, ξ, Σ , C , C ) is equal to R ( ξ, Σ) when ς ( ξ, Σ) is small, and continuously censorsto zero when ς ( ξ, Σ) is large. Note that for any ( C , C ) , the assumptions of the lemmaand the fact that products of bounded Lipschitz functions are bounded Lipschitz implythat ψ ( R, ξ, Σ , C , C ) is bounded Lipschitz in ( ξ, Σ) for ξ unrestricted and Σ ∈ Λ . By Lemma 13, if we deﬁne c α,n (Σ , C , C ) as the − α quantile of ψ ( R n , ξ, Σ , C , C ) under ξ ∼ N (0 , Σ) , we see that c α,n (Σ , C , C ) , and thus the diﬀerence ψ ( R n , ξ, Σ , C , C ) − c α,n (Σ , C , C ) is bounded Lipschitz as well.Towards contradiction, suppose the conclusion to the lemma fails. Then thereexists a sequence of distributions (cid:8) P D | Z,m (cid:9) ⊂ P D | Z , sample sizes n m , and a constant ν > such that lim inf m →∞ P r P D | Z,m (cid:110) R n m (cid:16) Y n m − µ n m , (cid:98) Σ (cid:17) > c α,n m (cid:16)(cid:98) Σ (cid:17) + ε (cid:111) ≥ α + ν. C > such that sup Σ ∈ Λ P r Σ (cid:26) ς ( ξ, Σ) ≥ C (cid:27) < ν . Since R n m ( ξ, Σ) and ψ ( R n m , ξ, Σ , C , C ) are equal when ς ( ξ, Σ) ≤ C , we see that c α + ν/ ,n m (Σ , C , C ) ≤ c α,n m (Σ) ≤ c α − ν/ ,n m (Σ , C , C ) . Assumptions 2-4 imply that lim sup m →∞ P r P D | Z,m (cid:110) ς (cid:16) Y n m − µ n m , (cid:98) Σ (cid:17) > C (cid:111) < ν . (33)To see that this is the case, note that since the set of matrices Λ is compact, forany sequence of distributions and sample sizes ( P D | Z,s , n s ) there exists a subsequence ( P D | Z,s t , n s t ) such that Σ( P D | Z,s t ) → Σ for some Σ ∈ Λ . Under this subsequence, Y n st − µ n st → d N (0 , Σ) , (cid:98) Σ → p Σ , and lim sup t →∞ P r P D | Z,st (cid:110) ς (cid:16) Y n st − µ n st , (cid:98) Σ (cid:17) > C (cid:111) < ν by the continuous mapping theorem and the portmanteau Lemma (see Lemma 2.2 ofVan der Vaart (2000)). Since such a subsequence can be extracted for any sequence,the claim follows.Since R n m (cid:16) Y n m − µ n m , (cid:98) Σ (cid:17) and ψ (cid:16) R n m , Y n m − µ n m , (cid:98) Σ , C , C (cid:17) are equal for ς (cid:16) Y n m − µ n m , (cid:98) Σ (cid:17) ≤ C , this implies that lim sup m →∞ P r P D | Z ,m (cid:110) R n m (cid:16) Y n m − µ n m , (cid:98) Σ (cid:17) (cid:54) = ψ (cid:16) R n m , Y n m − µ n m , (cid:98) Σ , C , C (cid:17)(cid:111) < ν . Thus, lim sup m →∞ P r P D | Z,m (cid:110) ψ (cid:16) R n m , Y n m − µ n m , (cid:98) Σ , C , C (cid:17) > c α,n m (cid:16)(cid:98) Σ (cid:17) + ε (cid:111) ≥ α + 34 ν. Since we have shown that c α,n m (Σ) ≥ c α + ν/ ,n m (Σ , C , C ) , this implies that lim sup m →∞ P r P D | Z,m (cid:110) ψ (cid:16) R n m , Y n m − µ n m , (cid:98) Σ , C , C (cid:17) > c α + ν/ ,n m (cid:16)(cid:98) Σ , C , C (cid:17) + ε (cid:111) ≥ α + 34 ν. T m = ψ (cid:16) R n m , Y n m − µ n m , (cid:98) Σ , C , C (cid:17) − c α + ν/ ,n m (cid:16)(cid:98) Σ , C , C (cid:17) and T m, ∞ = ψ ( R n m , ξ, Σ , C , C ) − c α + ν/ ,n m (Σ , C , C ) , for ξ ∼ N (cid:0) , Σ (cid:0) P D | Z,m (cid:1)(cid:1) . The diﬀerence between T m and T m, ∞ is that the former usesthe ﬁnite-sample distribution of Y n m − µ n m and (cid:98) Σ while the latter uses the asymptoticnormal distribution for ξ and the exact value of Σ . Our arguments above show that,viewed as a function of (cid:16) Y n − µ n , (cid:98) Σ (cid:17) , T m is bounded Lipschitz. Since compositions ofbounded Lipschitz functions are bounded Lipschitz, Assumptions 3 and 4 imply that lim m →∞ sup f ∈ BL | E [ f ( T m )] − E [ f ( T m, ∞ )] | = 0 . (34)Since T m is a sequence of bounded variables, by Prohorov’s theorem there existsa subsequence m s and a random variable T such that T m s → d T . By (34) and thePortmanteau lemma (see Lemma 2.2 of Van der Vaart (2000)), however, we also have T m, ∞ → d T . From the Portmanteau lemma, it follows that α + 34 ν ≤ lim sup s →∞ P r {T m s ≥ ε } ≤ P r {T ≥ ε } ≤ P r {T > } ≤ lim inf s →∞ P r {T m s , ∞ > } . However,

P r {T m s , ∞ > } ≤ α + ν for all m by the deﬁnition of the quantile function.Thus, since ν > we have arrived at a contradiction. (cid:3) Lemma 15

Provided inf δ max j { X n,j δ } (cid:54) = −∞ , the statistic min δ S ( ξ − X n δ, Σ) = min δ max j (cid:110) ( ξ j − X n,j δ ) / (cid:112) Σ jj (cid:111) satisﬁes Assumption 10 with Lipschitz constants independent of X n . Proof of Lemma 15

Note, ﬁrst, that for any ﬁxed δ the statistic ˜ S ( ξ, X n , Σ; δ ) = max j (cid:110) ( ξ j − X n,j δ ) / (cid:112) Σ jj (cid:111)

80s Lipschitz in ( ξ, Σ) for Σ ∈ Λ and ξ such that max j (cid:8) | ξ j | / (cid:112) Σ jj (cid:9) ≤ C , with aLipschitz constant that does not depend on δ or X n . Since the minimum of a collectionof functions with a common Lipschitz constant is Lipschitz with the same constant,this implies that ˜ S ( ξ, X n , Σ) = min δ ˜ S ( ξ, X n , Σ; δ ) is Lipschitz with the same constant.To see that the statistic is bounded observe that the assumption that inf δ max j { X n,j δ } (cid:54) = −∞ implies that ˜ S ( ξ, X n , Σ) ≥ min j (cid:110) ξ j / (cid:112) Σ jj (cid:111) , since otherwise the span of X n must contain a strictly negative vector, and hence inf δ max j { X n,j δ } = −∞ . On the other hand, by construction ˜ S ( ξ, X n , Σ) ≤ ˜ S ( ξ, X n , Σ; 0) = max j (cid:110) ξ j / (cid:112) Σ jj (cid:111) . Thus, we see that for max j (cid:8) | ξ j | / (cid:112) Σ jj (cid:9) ≤ C for any constant C , ˜ S ( ξ, X n , Σ) isbounded between − C and C. (cid:3) We next build on these preliminary results to prove uniform size control for theleast favorable test.

Proof of Proposition 8 If X n is such that inf δ max j { X n,j δ } = −∞ then ˆ η = −∞ with probability one, and our tests never reject. For the remainder of the proof wethus assume that inf δ max j { X n,j δ } (cid:54) = −∞ . For the least favorable projection test, note that this test rejects if and only if S (cid:16) Y n − X n δ, X n , (cid:98) Σ (cid:17) > c α,H ( δ ) (cid:16)(cid:98) Σ (cid:17) for all δ . Note that under the null, there exists a value δ ∗ such that µ n − X n δ ∗ ≤ . Hence, (cid:110) S (cid:16) Y n − X n δ ∗ , X n , (cid:98) Σ (cid:17) > c α,H ( δ ) (cid:16)(cid:98) Σ (cid:17)(cid:111) ≤ (cid:110) S (cid:16) Y n − µ n , X n , (cid:98) Σ (cid:17) > c α,H ( δ ) (cid:16)(cid:98) Σ (cid:17)(cid:111) . Note, however, that S (cid:16) Y n − µ n , X n , (cid:98) Σ (cid:17) is the (scaled) maximum of a ﬁnite numberof normal random variables with nonzero variance, and is Lipschitz in ( Y n − µ n , Σ) for81 ∈ Λ and Y n − µ n bounded. Lemma 14 thus implies that for any ε > , lim sup n →∞ sup P D | Z ∈P D | Z P r P D | Z (cid:110) S (cid:16) Y n − µ n , X n , (cid:98) Σ (cid:17) > c α,H ( δ ) (cid:16)(cid:98) Σ (cid:17) + ε (cid:111) ≤ α. (35)Moreover, for ξ ∼ N (0 , Σ) , S ( ξ, X n , Σ) is continuously distributed with densitybounded uniformly over Σ ∈ Λ (see e.g. Theorem 3 of Chernozhukov et al. (2015)).Thus, since (35) holds for all ε > , it follows that is also holds for ε = 0 .To establish size control for least favorable tests, we note that since the test statisticis monotonically increasing in Y n , the fact that µ n ≤ under the null implies that (cid:110) ˆ η > c α (cid:16) X n , (cid:98) Σ (cid:17) + ε (cid:111) ≤ (cid:110) min δ S (cid:16) Y n − µ n − X n δ, X n , (cid:98) Σ (cid:17) > c α (cid:16) X n , (cid:98) Σ (cid:17) + ε (cid:111) . Thus, if we can prove that the right hand side has asymptotic rejection probabilityless than or equal to α under the null, the left hand side must as well. Since Lemma 15shows that min δ S ( ξ − X n δ, X n , Σ) satisﬁes the conditions of Lemma 14 with Lipschitzconstants that do not depend on X n , Lemma 14 immediately implies that lim sup n →∞ sup P D | Z ∈P D | Z P r P D | Z (cid:110) min δ S (cid:16) Y n − µ n , X n , (cid:98) Σ (cid:17) ≥ c α,n (cid:16)(cid:98) Σ (cid:17) + ε (cid:111) ≤ α, as we aimed to show. (cid:3) Proof of Corollary 1

As in the proof of Lemma 14, let us assume the result fails.Then there exists a sequence of distributions (cid:8) P D | Z,m (cid:9) ⊂ P D | Z , sample sizes n m , anda constant ν > such that, for ˜ S deﬁned as in the proof of Lemma 15, lim inf m →∞ P r P D | Z,m (cid:110) ˜ S (cid:16) Y n m − µ n m , X n m , (cid:98) Σ , X n m (cid:17) > c α,LF (cid:16)(cid:98) Σ , X n m (cid:17)(cid:111) ≥ α + ν. Let us choose C > such that sup Σ ∈ Λ P r Σ (cid:26) ς ( ξ, Σ) ≥ C (cid:27) < ν , where we again deﬁne ς ( ξ, Σ) = max j (cid:8) | ξ j | / (cid:112) Σ jj (cid:9) . As argued in the proof of Lemma 14, this implies that (for ψ as deﬁned in that82roof) lim sup m →∞ P r P D | Z ,m (cid:110) ˜ S (cid:16) Y n m − µ n m , (cid:98) Σ (cid:17) (cid:54) = ψ (cid:16) ˜ S, Y n m − µ n m , (cid:98) Σ , C , C (cid:17)(cid:111) < ν . and lim sup m →∞ P r P D | Z,m (cid:110) ψ (cid:16) ˜ S, Y n m − µ n m , (cid:98) Σ , C , C (cid:17) > c α,LF (cid:16)(cid:98) Σ , X n m (cid:17)(cid:111) ≥ α + 34 ν. and thus that lim sup m →∞ P r P D | Z,m (cid:110) ψ (cid:16) ˜ S, Y n m − µ n m , (cid:98) Σ , C , C (cid:17) > c α + ν/ ,n m (cid:16)(cid:98) Σ , C , C (cid:17)(cid:111) ≥ α + 34 ν, for c α + ν/ ,n m (cid:16)(cid:98) Σ , C , C (cid:17) the − α − ν/ quantile of ˜ S ( ξ, X n m , Σ) for ξ ∼ N (0 , Σ) . Since the set Λ is compact, we can extract a further subsequence n s along which Σ (cid:0) P D | Z,n s (cid:1) → Σ . We see, however, that along this subsequence the continuousmapping theorem implies ˜ S (cid:16) Y n s − µ n s , X n s , (cid:98) Σ (cid:17) → d max γ ∈ V F ( X, Σ) γ (cid:48) ξ, and c α + ν/ ,n s (cid:16)(cid:98) Σ , C , C (cid:17) → p c α + ν/ ,n s (Σ , C , C ) , where we have used the continuityof ˜ S ( ξ, X, Σ) implied by Lemma 19 below, as well as the continuity of the criticalvalue implied by Lemma 13.The proof of Lemma 20 below then implies that ˜ S (cid:16) Y n s − µ n s , X n s , (cid:98) Σ (cid:17) − c α + ν/ ,n s (cid:16)(cid:98) Σ , C , C (cid:17) (36)converges in distribution to a continuous random variable. Note, however, that thetotal variation distance between (36) and T s = ψ (cid:16) ˜ S, Y n s − µ n s , (cid:98) Σ , C , C (cid:17) − c α + ν/ ,n s (cid:16)(cid:98) Σ , C , C (cid:17) is bounded above by ν/ asymptotically by the argument following (33) in the proof We write ( s, n s ) rather than ( m s , n m s ) for readability.

83f Lemma 14. If we deﬁne T s, ∞ = ψ (cid:16) ˜ S, ξ, Σ , C , C (cid:17) − c α + ν/ ,n s (Σ , C , C ) , then as in the proof of Lemma 14 we know that lim s →∞ sup f ∈ BL | E [ f ( T s )] − E [ f ( T s, ∞ )] | = 0 . As in the proof of Lemma 14, by Prohorov’s Theorem we know there exists afurther subsequence s t along with T s t → T for a random variable T . Moreover, weknow that T s t , ∞ converges to the same limit, and thus that by the Portmanteau lemma α + 34 ν ≤ lim sup t →∞ P r {T s t ≥ } ≤ P r {T ≥ } and P r {T > } ≤ lim inf t →∞ P r {T s t , ∞ > } ≤ α + ν/ by the deﬁnition of the critical value. Thus, we see that P r {T = 0 } ≥ ν . However,we have argued that for large t , T s t is within total variation distance ν of a sequenceof random variables that converge in distribution to a continuous limit, which impliesthat P r {T = 0 } ≤ ν . Thus, we have reached a contradiction. (cid:3)

E.2 Proof of Validity for Conditional and Hybrid Tests

We next turn to the proof of Proposition 9. Let us deﬁne T (cid:16) Y n , X n , (cid:98) Σ (cid:17) = ˆ γ (cid:48) Y n − c α,C (cid:16) ˆ γ, V lo ( S n, ˆ γ ) , V up ( S n, ˆ γ ) , (cid:98) Σ (cid:17) (37)for ˆ γ = argmax γ ∈ V F ( X n , (cid:98) Σ ) γ (cid:48) Y n . Note that ˆ η exceeds the conditional critical value if and only if T (cid:16) Y n , X n , (cid:98) Σ (cid:17) is strictlypositive. As in the last section, we begin by proving several auxiliary lemmas. Lemma 16

For all ˜ δ ∈ R p ,T (cid:16) Y n , X n , (cid:98) Σ (cid:17) = T (cid:16) Y n + X ∗ n ˜ δ, X ∗ n , (cid:98) Σ (cid:17) , here again X ∗ n = √ n X n . Proof of Lemma 16

Recall that the feasible region F ( X, Σ) is the set of points γ ≥ such that (cid:112) diag (Σ) (cid:48) γ = 1 and X (cid:48) γ = 0. It follows that F ( X n , Σ) = F ( X ∗ n , Σ) ,and hence that the set of vertices V F (cid:16) X n , (cid:98) Σ (cid:17) = V F (cid:16) X ∗ n , (cid:98) Σ (cid:17) . From this we seeimmediately that T (cid:16) Y n , X n , (cid:98) Σ (cid:17) = T (cid:16) Y n , X ∗ n , (cid:98) Σ (cid:17) . Since γ (cid:48) X ∗ n = 0 , we also see that ˆ γ calculated with Y n is the same as ˆ γ calculated with Y n + X ∗ n ˜ δ , and ˆ γ (cid:48) Y n = ˆ γ (cid:48) (cid:16) Y n + X ∗ n ˜ δ (cid:17) . Likewise, for all ˆ γ, γ ∈ V F (cid:16) X ∗ n , (cid:98) Σ (cid:17) ,γ (cid:48) S n, ˆ γ = γ (cid:48) Y n − γ (cid:48) (cid:98) Σˆ γ ˆ γ (cid:48) (cid:98) Σˆ γ ˆ γ (cid:48) Y n = γ (cid:48) (cid:16) Y n + X ∗ n ˜ δ (cid:17) − γ (cid:48) (cid:98) Σˆ γ ˆ γ (cid:48) (cid:98) Σˆ γ ˆ γ (cid:48) (cid:16) Y n + X ∗ n ˜ δ (cid:17) . Thus, γ (cid:48) S n, ˆ γ calculated with Y n is equal to γ (cid:48) S n, ˆ γ calculated with Y n + X ∗ n ˜ δ. From (22)and (23), it is thus clear that V lo ( s ) and V up ( s ) are the same when calculated with Y n as with Y n + X ∗ n ˜ δ. This suﬃces to establish the result. (cid:3)

Lemma 17

Under Assumptions 6 and 7, for all µ ∗ with µ ∗ j ∈ [ −∞ , for all j , ˆ γ ( ξ, X, Σ) = argmax γ ∈ V F ( X, Σ) γ (cid:48) ( ξ + µ ∗ ) , ˆ γ ( ξ, X, Σ) is almost surely continuous at ( ξ, X, Σ) for ξ ∼ N (0 , Σ) and ( X, Σ) non-stochastic, where we deﬁne · ∞ = 0 . Proof of Lemma 17

To prove this result, note ﬁrst that Assumption 7 implies thatfor any pair γ , γ ∈ V F ( X, Σ) , ( γ − γ ) (cid:48) ξ has a non-degenerate normal distribution.By Assumption 6, the same also holds on a neighborhood of ( X, Σ) . This implies,however, that on a neighborhood of ( X, Σ) , ˆ γ ( ξ, X, Σ) is unique with probability one.Almost everywhere continuity of ˆ γ ( ξ, X, Σ) then follows from Assumption 6. (cid:3) Lemma 18

Under Assumptions 6 and 7, the conditional critical value c α,C (cid:16) ˆ γ ( ξ, X, Σ) , V lo (cid:16) ˜ S n, ˆ γ ( ξ,X, Σ) (cid:17) , V up (cid:16) ˜ S n, ˆ γ ( ξ,X, Σ) (cid:17) , Σ (cid:17) s almost surely continuous at ( ξ, X, Σ) when computed with ˜ S n, ˆ γ ( ξ,X, Σ) = (cid:18) I − Σˆ γ ( ξ, X, Σ) ˆ γ ( ξ, X, Σ) (cid:48) ˆ γ ( ξ, X, Σ) (cid:48) Σˆ γ ( ξ, X, Σ) (cid:19) ( ξ + µ ∗ ) when ξ ∼ N (0 , Σ) and µ ∗ is as in Lemma 17. Proof of Lemma 18

For brevity of notation we abbreviate ˆ γ ( ξ, X, Σ) by ˆ γ . Toprove the result, recall that c α,C (cid:0) γ, V lo ( S n,γ ) , V up ( S n,γ ) , Σ (cid:1) = (cid:112) γ (cid:48) Σ γ · Φ − (cid:18) (1 − α ) Φ (cid:18) V up ( S n,γ ) √ γ (cid:48) Σ γ (cid:19) + α Φ (cid:18) V lo ( S n,γ ) √ γ (cid:48) Σ γ (cid:19)(cid:19) . That √ ˆ γ (cid:48) Σˆ γ and / √ ˆ γ (cid:48) Σˆ γ are almost everywhere continuous follows from Assumption6 and Lemma 17.Note, next, that provided γ (cid:48) Σ γ is nonzero, Φ − (cid:18) (1 − α ) Φ (cid:18) V up √ γ (cid:48) Σ γ (cid:19) + α Φ (cid:18) V lo √ γ (cid:48) Σ γ (cid:19)(cid:19) is continuous in (cid:0) V lo , V up (cid:1) on ( R ∪ {−∞ , ∞} ) . This is obvious when at least one of (cid:0) V lo , V up (cid:1) is ﬁnite. When V lo → −∞ and V up → ∞ , Φ − (cid:18) (1 − α ) Φ (cid:18) V up √ γ (cid:48) Σ γ (cid:19) + α Φ (cid:18) V lo √ γ (cid:48) Σ γ (cid:19)(cid:19) → Φ − (1 − α ) = Φ − ((1 − α ) Φ ( ∞ ) + α Φ ( −∞ )) , while when both V lo , V up → −∞ , Φ − (cid:18) (1 − α ) Φ (cid:18) V up √ γ (cid:48) Σ γ (cid:19) + α Φ (cid:18) V lo √ γ (cid:48) Σ γ (cid:19)(cid:19) → −∞ = Φ − ((1 − α ) Φ ( −∞ ) + α Φ ( −∞ )) and when both V lo , V up → ∞ , Φ − (cid:18) (1 − α ) Φ (cid:18) V up √ γ (cid:48) Σ γ (cid:19) + α Φ (cid:18) V lo √ γ (cid:48) Σ γ (cid:19)(cid:19) → ∞ = Φ − ((1 − α ) Φ ( ∞ ) + α Φ ( ∞ )) . To complete the argument, it suﬃces to show that (cid:16) V lo (cid:16) ˜ S n, ˆ γ (cid:17) , V up (cid:16) ˜ S n, ˆ γ (cid:17)(cid:17) are86ontinuous at almost every ( ξ, X, Σ) . To see that this is the case, recall that ˆ γ isalmost everywhere continuous by Lemma 17. Note, next, that for a given ˆ γ, V lo (cid:16) ˜ S n, ˆ γ (cid:17) = min (cid:26) c : c = max γ ∈ V F ( X, Σ) γ (cid:48) (cid:18) ˜ S n, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19)(cid:27) = min (cid:26) c : 0 = max γ ∈ V F ( X, Σ) ˆ a γ + ˆ b γ c (cid:27) for ˆ a γ = γ (cid:48) ˜ S n, ˆ γ and ˆ b γ = γ (cid:48) Σˆ γ ˆ γ (cid:48) Σˆ γ − . Note that ˆ a ˆ γ = ˆ b ˆ γ = 0 , so ≤ max γ ∈ V F ( X, Σ) ˆ a γ + ˆ b γ c for all c . Moreover, for c = ˆ γ (cid:48) Y n the max is attained at ˆ γ by construction. Hence, theset over which we are minimizing is non-empty.Intuitively, if we plot ˆ a γ + ˆ b γ c as a function of c, each γ ∈ V F ( X, Σ) deﬁnes a line,and we are interested in the set of values c such that zero lies on the upper envelopeof this collection of lines. As this characterization suggests, to ﬁnd the lower bound V lo it suﬃces to limit attention to γ ∈ V F ( X, Σ) with ˆ b γ ≤ . For given ˆ γ, V lo (cid:16) ˜ S n, ˆ γ (cid:17) thus is equal to either −∞ or the largest solution to c = γ (cid:48) (cid:18) ˜ S n, ˆ γ + Σˆ γ ˆ γ (cid:48) Σˆ γ c (cid:19) for γ in V F ( X, Σ) with γ (cid:48) Σˆ γ < ˆ γ (cid:48) Σˆ γ. Among γ with γ (cid:48) Σˆ γ (cid:54) = ˆ γ (cid:48) Σˆ γ , this largest solutionis well-deﬁned and continuous. Matters are more delicate for γ with γ (cid:48) Σˆ γ = ˆ γ (cid:48) Σˆ γ :in this case we may have discontinuities in Σ , but only if ˆ γ (cid:48) ˜ S n, ˆ γ = γ (cid:48) ˜ S n, ˆ γ . However, ˆ γ (cid:48) ˜ S n, ˆ γ = γ (cid:48) ˜ S n, ˆ γ with positive probability if and only if γ (cid:48) ξ − ˆ γ (cid:48) ξ = 0 with positiveprobability, which for γ (cid:54) = ˆ γ is ruled out by Assumption 7. Hence, we see that V lo (cid:16) ˜ S n, ˆ γ (cid:17) is almost everywhere continuous in the limit problem, as desired. Theanalogous argument applies for V up (cid:16) ˜ S n, ˆ γ (cid:17) , so overall we obtain that the critical valuefunction is almost everywhere continuous, as we wanted to show. (cid:3) Lemma 19

Under Assumptions 6-8, for µ ∗ such that µ ∗ j ∈ [ −∞ , for all j , max γ ∈ V F ( X, Σ) γ (cid:48) ( ξ + µ ∗ ) is almost everywhere continuous at ( ξ, X, Σ) for ξ ∼ N (0 , Σ) and ( X, Σ) constant. Proof of Lemma 19

To see that this is the case, note, ﬁrst, that almost everywherecontinuity of ˆ γ (cid:48) ξ is immediate from Lemma 17. Thus, what remains is to show almost87verywhere continuity of ˆ γ (cid:48) µ ∗ = (cid:80) ˆ γ j µ ∗ j . For those elements µ j that are ﬁnite, almosteverywhere continuity of ˆ γ j µ ∗ j is again immediate from Lemma 17. To complete theproof we need only to show that ˆ γ j µ ∗ j is almost everywhere continuous when µ ∗ j = −∞ . However, this follows from Assumption 8, which ensures that for every γ ( X, Σ) ∈ V F ( X, Σ) , γ j ( X, Σ) µ ∗ j is constant on a neighborhood of ( X, Σ) when µ ∗ j = −∞ . (cid:3) Lemma 20

Under Assumptions 6 and 7, for µ ∗ such that µ ∗ j ∈ [ −∞ , for all j ,if max γ ∈ V F ( X, Σ) γ (cid:48) µ ∗ is ﬁnite then T ( ξ + µ ∗ , X, Σ) as deﬁned in (37) is ﬁnite withprobability one and continuously distributed for ξ ∼ N (0 , Σ) and ( X, Σ) constant. Proof of Lemma 20

We ﬁrst prove ﬁniteness. In particular, note that since ξ isﬁnite with probability one and V F ( X, Σ) is a ﬁnite set, ﬁniteness of max γ ∈ V F ( X, Σ) γ (cid:48) µ ∗ implies ﬁniteness of ˆ η = max γ ∈ V F ( X, Σ) γ (cid:48) ( ξ + µ ∗ ) . Recall from the proof of Lemma18 that the conditional critical value is inﬁnite only if V lo ( s ) = V up ( s ) = ∞ or V lo ( s ) = V up ( s ) = −∞ . Since V lo (cid:16) ˜ S n, ˆ γ (cid:17) ≤ ˆ η ≤ V up (cid:16) ˜ S n, ˆ γ (cid:17) , however, this impliesthat V lo (cid:16) ˜ S n, ˆ γ (cid:17) is not equal to ∞ and V up (cid:16) ˜ S n, ˆ γ (cid:17) is not equal to −∞ , and thus that c α,C (cid:16) ˆ γ, V lo (cid:16) ˜ S n, ˆ γ (cid:17) , V up (cid:16) ˜ S n, ˆ γ (cid:17) , Σ (cid:17) is ﬁnite. Hence, T ( ξ + µ ∗ , X, Σ) is ﬁnite.To complete the proof, note that for ﬁxed γ , γ (cid:48) Y n is continuously distributed andindependent of ˜ S n,γ , and thus of (cid:16) V lo (cid:16) ˜ S n,γ (cid:17) , V up (cid:16) ˜ S n,γ (cid:17)(cid:17) . In particular,

P r (cid:110) γ (cid:48) Y n = V lo (cid:16) ˜ S n,γ (cid:17)(cid:111) = 0 . Since V F ( X, Σ) is ﬁnite, it follows that P r (cid:110) ˆ η = V lo (cid:16) ˜ S n, ˆ γ (cid:17)(cid:111) = 0 , and thus that V lo (cid:16) ˜ S n, ˆ γ (cid:17) < V up (cid:16) ˜ S n, ˆ γ (cid:17) with probability one. Recall that ˆ η lies between V lo (cid:16) ˜ S n, ˆ γ (cid:17) and V up (cid:16) ˜ S n, ˆ γ (cid:17) with probability one, and conditional on ˆ γ and ˜ S n, ˆ γ follows a truncatednormal distribution with untruncated variance ˆ γ (cid:48) Σˆ γ > . Hence T ( ξ + µ ∗ , X, Σ) iscontinuously distributed conditional on ˆ γ and ˜ S n, ˆ γ for almost every ˆ γ and ˜ S n, ˆ γ . Itfollows that T ( ξ + µ ∗ , X, Σ) is continuously distributed unconditionally as well. (cid:3) Proof of Proposition 9 If X n is such that inf δ max j { X n,j δ } = −∞ then ˆ η = −∞ with probability one, and our tests never reject. For the remainder of the proof wethus assume that inf δ max j { X n,j δ } (cid:54) = −∞ . As in D. Andrews et al. (2019), note that uniform asymptotic size control is equiv-alent to asymptotic size control under all sequences of distributions P D | Z,n ∈ P D | Z . φ ∗ C fails to control asymptotic size. Then thereexists a sequence of distributions P D | Z,n m , a sequence of sample sizes n m , and a value ν > such that lim inf m →∞ P r P D | Z,nm { φ ∗ C = 1 } > α + ν. By the compactness of S , for any such sequence, there exists a subsequence n m, along which Σ (cid:0) P D | Z,n m, (cid:1) → Σ ∈ S . For each n , since P D | Z,n ∈ P D | Z we know thereexists a δ n such that µ n − X ∗ n δ n ≤ . Thus, there exists a further subsequence n m, along which µ n m, , − X ∗ n m, , δ n m, → µ ∗ for µ ∗ ∈ [ −∞ , for µ n m, , the ﬁrst componentof µ n m, . Passing to further such subsequences, we see that there exists a subsequence n m,k +1 such that Σ (cid:0) P D | Z,n m,k +1 (cid:1) → Σ and µ n m,k − X ∗ n m,k δ n m,k → µ ∗ where µ ∗ j ∈ [ −∞ , for all j . For simplicity of notation, for the remainder of the proofwe assume that this property holds for the initial pair ( m, n m ) , so Σ (cid:0) P D | Z,n m (cid:1) → Σ and µ n m − X ∗ n m δ n m → µ ∗ .Lemma 16 implies that T (cid:16) Y n m , X n m , (cid:98) Σ (cid:17) = T (cid:16) Y n m − X ∗ n m δ n m , X ∗ n m , (cid:98) Σ (cid:17) , while Assumptions 2-5 imply that (cid:16) Y n m − X ∗ n m δ n m , X ∗ n m , (cid:98) Σ (cid:17) → d ( ξ + µ ∗ , X, Σ) for ξ ∼ N (0 , Σ) . Together, Lemmas 18 and 19 imply that T ( ξ + µ ∗ , X, Σ) is almosteverywhere continuous with respect to the distribution of ( ξ + µ ∗ , X, Σ) , and thus, bythe continuous mapping theorem, that T (cid:16) Y n m , X ∗ n m , (cid:98) Σ (cid:17) → d T ( ξ + µ ∗ , X, Σ) . If max γ ∈ V F ( X, Σ) γ (cid:48) µ ∗ = −∞ , then ˆ η → −∞ . Hence, since the modiﬁed conditionaltest never rejects for ˆ η < − C, this implies that lim m →∞ P r { φ ∗ C = 1 } = 0 , contradict-ing our assumption that size control fails. Thus, for the remainder of the argumentwe assume that max γ ∈ V F ( X, Σ) γ (cid:48) µ ∗ is ﬁnite. Under this assumption, Lemma 20 shows Recall that γ ∈ V F ( X, Σ) implies that γ ≥ , so we cannot have γ (cid:48) µ ∗ = ∞ . T ( ξ + µ ∗ , X, Σ) is continuously distributed. This implies that lim m →∞ P r (cid:110) T (cid:16) Y n m , X ∗ n m , (cid:98) Σ (cid:17) > (cid:111) → P r { T ( ξ + µ ∗ , X, Σ) > } , and thus that P r { T ( ξ + µ ∗ , X, Σ) > } ≥ α + ν. However, provided max γ ∈ V F ( X, Σ) γ (cid:48) µ ∗ is ﬁnite, Proposition 6 shows that for µ ∗ ≤ P r { T ( ξ + µ ∗ , X, Σ) > } ≤ α, so we have reached a contradiction. (cid:3) Proof of Corollary 2

Note that the hybrid test is of nearly the same form as theconditional test, except that it uses the V up,H ( S n, ˆ γ ) = min (cid:110) V up ( S n, ˆ γ ) , c κ,LF (cid:16) X n , (cid:98) Σ (cid:17)(cid:111) instead of V up ( S n, ˆ γ ) , and considers a diﬀerent quantile of the conditional distribution.Building on the proof of Proposition 9, to prove asymptotic validity of φ ∗ H it thus suf-ﬁces to show that V up,H ( S n, ˆ γ ) is almost-everywhere continuous when computed usingthe set of limit distributions considered in that proof. However, we have alreadyshown that V up ( S n, ˆ γ ) satisﬁes this property, so we need only show that c κ,LF ( X, Σ) is continuous in ( X, Σ) . Recall, however, that c κ,LF ( X, Σ) is the − κ quantile of max γ ∈ V F ( X, Σ) γ (cid:48) ξ for ξ ∼ N (0 , Σ) . Lemma 19 shows that under our assumptions this max is almost everywherecontinuous in ( ξ, X, Σ) , from which continuity of the − κ quantile follows immediately.To complete the argument, recall that the proof of Lemma 18 shows that V up (cid:16) ˜ S n, ˆ γ (cid:17) is almost everywhere continuous in the limit problem, which together with the argu-ment above shows that V up,H (cid:16) ˜ S n, ˆ γ (cid:17) is almost everywhere continuous. Note that thehybrid test is unchanged if, rather than deﬁning c α − κ − κ ,C (cid:0) ˆ γ, V lo ( S n, ˆ γ ) , V up,H ( S n, ˆ γ ) , Σ (cid:1) tobe −∞ when V lo ( S n, ˆ γ ) > V up,H ( S n, ˆ γ ) , we instead deﬁne it to be V up,H ( S n, ˆ γ ) . Withthis modiﬁcation, however, we see that c α − κ − κ ,C (cid:0) ˆ γ, V lo ( S n, ˆ γ ) , V up,H ( S n, ˆ γ ) , Σ (cid:1) is almost-everywhere continuous in the limit problem by the same argument as in the proof ofLemma 18. Hence, ˆ η − c α − κ − κ ,C (cid:0) ˆ γ, V lo ( S n, ˆ γ ) , V up,H ( S n, ˆ γ ) , Σ (cid:1) Since V lo ( S n, ˆ γ ) > V up,H ( S n, ˆ γ ) implies ˆ η > V up,H ( S n, ˆ γ ) .

90s almost-everywhere continuous in the limit problem by the same arguments as in theproof of Proposition 9.All that remains to show is that this quantity is continuously distributed. As ar-gued in the proof of Lemma 20, however, if ˆ η is ﬁnite it is continuously distributed con-ditional on ˆ γ and S n, ˆ γ for almost every ˆ γ and S n, ˆ γ . This implies that ˆ η is continuouslydistributed conditional on almost every realization of c α − κ − κ ,C (cid:0) ˆ γ, V lo ( S n, ˆ γ ) , V up,H ( S n, ˆ γ ) , Σ (cid:1) ,and so proves continuity. (cid:3) E.3 Proof of Variance Consistency

We ﬁrst prove two auxiliary lemmas, which we then use to prove Proposition 10.

Lemma 21

Under Assumption 9, n n (cid:88) i =1 (cid:16) Y (cid:96) Z ( i ) Y (cid:48) (cid:96) Z ( i ) − V P D | Z ( Z i ) (cid:17) → p uniformly over P D | Z ∈ P D | Z . Proof of Lemma 21

Note that we can write n n (cid:88) i =1 (cid:16) Y (cid:96) Z ( i ) Y (cid:48) (cid:96) Z ( i ) − V P D | Z ( Z i ) (cid:17) =1 n n (cid:88) i =1 (cid:16) Y (cid:96) Z ( i ) Y (cid:48) (cid:96) Z ( i ) − V P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1)(cid:17) + 1 n n (cid:88) i =1 (cid:16) V P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) − V P D | Z ( Z i ) (cid:17) , so to prove the result it suﬃces to show that both terms tend to zero. To show thatthe second term tends to zero, note that by the triangle inequality and Assumption9(4), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 (cid:16) V P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) − V P D | Z ( Z i ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) V P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) − V P D | Z ( Z i ) (cid:13)(cid:13)(cid:13) ≤ Kn n (cid:88) i =1 (cid:13)(cid:13) Z i − Z (cid:96) Z ( i ) (cid:13)(cid:13) K the upper bound on the Lipschitz constant. Note, next, that since Z is compactby Assumption 9(1), the proof of Lemma 1 of Abadie & Imbens (2008) implies that n n (cid:88) i =1 (cid:13)(cid:13) Z i − Z (cid:96) Z ( i ) (cid:13)(cid:13) → . Thus, we immediately see that n (cid:80) ni =1 (cid:16) V P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) − V P D | Z ( Z i ) (cid:17) → uniformlyover P D | Z ∈ P D | Z .We next show that n n (cid:88) i =1 (cid:16) Y (cid:96) Z ( i ) Y (cid:48) (cid:96) Z ( i ) − V P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1)(cid:17) → p . To do so, note ﬁrst that the number of observations that can be matched to a given Z i , { j : (cid:96) Z ( j ) = i } , is bounded above by the so-called “kissing number” which is aﬁnite function K (dim ( Z i )) of the dimension of Z (see Abadie et al. (2014)). Since Y i is independent across i , this implies that for ( A ) jk the ( j, k ) element of a matrix A,V ar (cid:32) n n (cid:88) i =1 (cid:16) Y (cid:96) Z ( i ) Y (cid:48) (cid:96) Z ( i ) − V P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1)(cid:17) jk | { Z i } ∞ i =1 (cid:33) ≤ K (dim ( Z i )) V ar (cid:32) n n (cid:88) i =1 ( Y i Y (cid:48) i ) jk | { Z i } ∞ i =1 (cid:33) = K (dim ( Z i )) n n (cid:88) i =1 V ar (cid:16) ( Y i Y (cid:48) i ) jk | Z i (cid:17) . By Assumption 9(2) and Chebyshev’s inequality, however, this implies that n n (cid:88) i =1 (cid:16) Y (cid:96) Z ( i ) Y (cid:48) (cid:96) Z ( i ) − V P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1)(cid:17) → p , uniformly over P D | Z ∈ P D | Z , which completes the proof. (cid:3) Lemma 22

Under Assumption 9, n n (cid:88) i =1 (cid:16) Y i Y (cid:48) (cid:96) Z ( i ) − µ P D | Z ( Z i ) µ P D | Z ( Z i ) (cid:48) (cid:17) → p , niformly over P D | Z ∈ P D | Z . Proof of Lemma 22

Note that we can write n n (cid:88) i =1 (cid:16) Y i Y (cid:48) (cid:96) Z ( i ) − µ P D | Z ( Z i ) µ P D | Z ( Z i ) (cid:48) (cid:17) == 1 n n (cid:88) i =1 (cid:16) Y i Y (cid:48) (cid:96) Z ( i ) − µ P D | Z ( Z i ) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) (cid:48) (cid:17) + 1 n n (cid:88) i =1 (cid:16) µ P D | Z ( Z i ) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) (cid:48) − µ P D | Z ( Z i ) µ P D | Z ( Z i ) (cid:48) (cid:17) . We ﬁrst show the initial term converges in probability to zero, and then do the samefor the second term.By independence, E (cid:104) Y i Y (cid:48) (cid:96) Z ( i ) − µ P D | Z ( Z i ) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) (cid:48) | Z i , Z (cid:96) Z ( i ) (cid:105) = 0 , while the variance of the jk th element is V ar P D | Z (cid:18)(cid:16) Y i Y (cid:48) (cid:96) Z ( i ) − µ P D | Z ( Z i ) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) (cid:48) (cid:17) jk | Z i , Z (cid:96) Z ( i ) (cid:19) = E P D | Z (cid:20)(cid:16) Y i,j Y (cid:96) Z ( i ) ,k − µ P D | Z ,j ( Z i ) µ P D | Z ,k (cid:0) Z (cid:96) Z ( i ) (cid:1)(cid:17) | Z i , Z (cid:96) Z ( i ) (cid:21) = µ P D | Z ,j ( Z i ) V ar P D | Z (cid:0) Y (cid:96) Z ( i ) ,k | Z (cid:96) Z ( i ) (cid:1) + V ar P D | Z ( Y i,j | Z i ) µ P D | Z ,k (cid:0) Z (cid:96) Z ( i ) (cid:1) + V ar P D | Z ( Y i,j | Z i ) V ar P D | Z (cid:0) Y (cid:96) Z ( i ) ,k | Z (cid:96) Z ( i ) (cid:1) . Assumption 9(5) thus implies that for some constant C , V ar P D | Z (cid:18)(cid:16) Y i Y (cid:48) (cid:96) Z ( i ) − µ P D | Z ( Z i ) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) (cid:48) (cid:17) jk | Z i , Z (cid:96) Z ( i ) (cid:19) ≤ (cid:16) µ P D | Z ,j ( Z i ) + µ P D | Z ,k (cid:0) Z (cid:96) Z ( i ) (cid:1) + C (cid:17) C , which, together with Assumption 9(2) and the ﬁniteness of the “kissing number”93 (dim ( Z i )) (see the proof of Lemma 21 above) implies that lim sup n →∞ sup P D | Z ∈P D | Z V ar (cid:32) n n (cid:88) i =1 (cid:16) Y i Y (cid:48) (cid:96) Z ( i ) − µ P D | Z ( Z i ) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) (cid:48) (cid:17) | { Z i } ∞ i =1 (cid:33) = 0 , and thus by Chebyshev’s inequality that n (cid:88) i =1 (cid:16) Y i Y (cid:48) (cid:96) Z ( i ) − µ P D | Z ( Z i ) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) (cid:48) (cid:17) → p , uniformly over P D | Z ∈ P D | Z , as we wanted to show.To complete the proof, we need only show that n n (cid:88) i =1 (cid:16) µ P D | Z ( Z i ) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) (cid:48) − µ P D | Z ( Z i ) µ P D | Z ( Z i ) (cid:48) (cid:17) . converges to zero uniformly over P D | Z ∈ P D | Z . Note, however, that by the triangleinequality and Assumption 9(3), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 (cid:16) µ P D | Z ( Z i ) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) (cid:48) − µ P D | Z ( Z i ) µ P D | Z ( Z i ) (cid:48) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) µ P D | Z ( Z i ) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) (cid:48) − µ P D | Z ( Z i ) µ P D | Z ( Z i ) (cid:48) (cid:13)(cid:13)(cid:13) ≤ n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) µ P D | Z ( Z i ) (cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13) µ P D | Z (cid:0) Z (cid:96) Z ( i ) (cid:1) − µ P D | Z ( Z i ) (cid:13)(cid:13)(cid:13) ≤ Kn n (cid:88) i =1 (cid:13)(cid:13)(cid:13) µ P D | Z ( Z i ) (cid:13)(cid:13)(cid:13) · (cid:13)(cid:13) Z (cid:96) Z ( i ) − Z i (cid:13)(cid:13) ≤ KCn n (cid:88) i =1 (cid:13)(cid:13) Z (cid:96) Z ( i ) − Z i (cid:13)(cid:13) (38)for K a Lipschitz constant and C a constant. As above, since Z is compact byAssumption 9(1), the proof of Lemma 1 of Abadie & Imbens (2008) implies that n n (cid:88) i =1 (cid:13)(cid:13) Z i − Z (cid:96) Z ( i ) (cid:13)(cid:13) → , and thus that (38) converges to zero uniformly over P D | Z ∈ P D | Z . (cid:3) roof of Proposition 10 Following proof of Lemma A.3 in Abadie et al. (2014),note that (cid:98)

Σ = 12 n n (cid:88) i =1 (cid:0) Y i − Y (cid:96) Z ( i ) (cid:1) (cid:0) Y i − Y (cid:96) Z ( i ) (cid:1) (cid:48) = 12 n n (cid:88) i =1 Y i Y (cid:48) i + 12 n n (cid:88) i =1 Y (cid:96) Z ( i ) Y (cid:48) (cid:96) Z ( i ) − n n (cid:88) i =1 (cid:0) Y i Y (cid:48) (cid:96) Z ( i ) + Y (cid:96) Z ( i ) Y (cid:48) i (cid:1) . Assumption 9(2) together with Chebyshev’s inequality implies that n n (cid:88) i =1 (cid:16) Y i Y (cid:48) i − V P D | Z ( Z i ) (cid:17) → p uniformly over P D | Z ∈ P D | Z . Since

V ar ( Y i | Z i ) = V P D | Z ( Z i ) − µ P D | Z ( Z i ) µ P D | Z ( Z i ) (cid:48) , however, we see that n (cid:88) i V ar P D | Z ( Y i | Z i ) = 1 n (cid:88) i V P D | Z ( Z i ) − n (cid:88) i µ P D | Z ( Z i ) µ P D | Z ( Z i ) (cid:48) . Thus, to prove that (cid:98) Σ − n (cid:88) V ar P D | Z ( Y i | Z i ) → p , it suﬃces to prove that n n (cid:88) i =1 (cid:16) Y (cid:96) Z ( i ) Y (cid:48) (cid:96) Z ( i ) − V P D | Z ( Z i ) (cid:17) → p and n n (cid:88) i =1 (cid:16) Y i Y (cid:48) (cid:96) Z ( i ) − µ P D | Z ( Z i ) µ P D | Z ( Z i ) (cid:48) (cid:17) → p , where the ﬁrst statement follows from Lemma 21 and the second from Lemma 22.Since n (cid:88) V ar P D | Z ( Y i | Z i ) − Σ (cid:0) P D | Z (cid:1) → uniformly over P D | Z ∈ P D | Z by Assumption 2, however, the result follows by thetriangle inequality. (cid:3) Performance Without Nuisance Parameters

This appendix discusses the simulated performance of the procedures we considerin the simpliﬁed setting discussed in Section 5.1 of the paper. In particular, weassume that there are no nuisance parameters (and thus no matrix X n ), and that Y n ∼ N ( µ n , I ) , and want to test H : µ n ≤ . We simulate the power of the leastfavorable, conditional, and hybrid tests. A number of other tests have been studied inthe setting without nuisance parameters, and for comparison we consider the test ofRomano et al. (2014 a ) (henceforth RSW). RSW include a simulation comparison oftheir test to that of D. Andrews & Barwick (2012), while Cox & Shi (2019) comparetheir test to both RSW and D. Andrews & Barwick (2012).As noted in Section 5.1 of the paper the conditional test in this setting compares ˆ η = max j Y n,j to a truncated normal critical value, truncated below at the secondlargest element of Y n . The hybrid critical value considers the same test statistic butcompares it to a truncated normal critical value which adds an upper truncation pointequal to the level κ least-favorable critical value.For our simulations, we consider either two, ten, or ﬁfty moments, k ∈ { , , } .When k ∈ { , } the parameter space is very large and we are unable to fullydepict the power function. Instead, we focus on how the power varies in the ﬁrst twoelements of µ n , while the remaining elements are held at a ﬁxed value. In particular,we consider ( µ n, , µ n, ) ∈ [ − , , while for j > we set µ n,j = µ ∗ for a ﬁxed value µ ∗ . Contours of the resulting power functions, based on 1000 simulations, are plottedin Figures 3-7. For visibility, we also include plots of the diﬀerence in power functionsbetween the conditional and hybrid tests and the RSW test.These simulations highlight a number of features discussed in the main text. Com-paring the least favorable and conditional tests, we see that when the largest momentis substantially larger than the second largest, the conditional test has better powerthan does the least favorable test, particularly when the total number of moments islarge. By contrast, when the two largest moments are approximately the same sizethe conditional test has poor power relative to the least favorable test. The hybridtest substantially improves on the conditional test in this case, while largely retain-ing the good performance of the conditional test in cases with many slack moments. Since this section considers a normal model with known variance, we consider a version of RSWbased on the normal distribution, discussed in Supplement Section S.1.2 of that paper, rather thanthe bootstrap version they discuss in the main text. east Favorable . . . . -10 -5 0 5 10 -10-50510 RSW . . . . . . . . -10 -5 0 5 10 -10-50510 Conditional . . . . . . . . -10 -5 0 5 10 -10-50510 Hybrid . . . . . . -10 -5 0 5 10 -10-50510 Conditional-RSW - . - . - . - . . . . . -10 -5 0 5 10 -10-50510 Hybrid-RSW - . . . . -10 -5 0 5 10 -10-50510 Figure 3: Power of tests with k = 2 .98 east Favorable . . . . -10 -5 0 5 10 -10-50510 RSW . . . . . . . . -10 -5 0 5 10 -10-50510 Conditional . . . . . . . . . -10 -5 0 5 10 -10-50510 Hybrid . . . . . . . -10 -5 0 5 10 -10-50510 Conditional-RSW - . - . - . - . . . . . -10 -5 0 5 10 -10-50510 Hybrid-RSW - . - . . . -10 -5 0 5 10 -10-50510 Figure 4: Power of tests with k = 10 , µ ∗ = 0 .99 east Favorable . . . . -10 -5 0 5 10 -10-50510 RSW . . . . . . . . -10 -5 0 5 10 -10-50510 Conditional . . . . . . . . . -10 -5 0 5 10 -10-50510 Hybrid . . . . . . . -10 -5 0 5 10 -10-50510 Conditional-RSW - . - . - . - . . . . . -10 -5 0 5 10 -10-50510 Hybrid-RSW - . - . . . -10 -5 0 5 10 -10-50510 Figure 5: Power of tests with k = 10 , µ ∗ = − .100 east Favorable . . . . -10 -5 0 5 10 -10-50510 RSW . . . . . . . -10 -5 0 5 10 -10-50510 Conditional . . . . . . -10 -5 0 5 10 -10-50510 Hybrid . . . . . -10 -5 0 5 10 -10-50510 Conditional-RSW - . - . - . - . -10 -5 0 5 10 -10-50510 Hybrid-RSW - . -10 -5 0 5 10 -10-50510 Figure 6: Power of tests with k = 50 , µ ∗ = 0 .101 east Favorable . . . . -10 -5 0 5 10 -10-50510 RSW . . . . . . . . -10 -5 0 5 10 -10-50510 Conditional . . . . . . . -10 -5 0 5 10 -10-50510 Hybrid . . . . . . . -10 -5 0 5 10 -10-50510 Conditional-RSW - . - . - . - . . . . . -10 -5 0 5 10 -10-50510 Hybrid-RSW - . - . . . . -10 -5 0 5 10 -10-50510 Figure 7: Power of tests with k = 50 , µ ∗ = − .102 Simulation Appendix

G.1 The Simulated Model

G.1.1 Competition and Firm Decisions

We consider competition between F ﬁrms, who in each period decide which set ofproducts to oﬀer. As in Wollmann, the products (indexed by j ) diﬀer in their grossweight rating g j , which can take on G possible values. The ﬁxed cost of oﬀeringa product in the current period depends on whether it was oﬀered in the previousperiod: if it was not previously marketed the costs are θ c + θ g g j . If the product waspreviously marketed, the ﬁxed costs scale down by a multiplicative factor β , so thecost of entering a previously marketed product is β ( θ c + θ g g j ) .Firm f estimates that marketing product j in period t will earn variable proﬁts π ∗ jft , and chooses to enter the product if and only if the expected proﬁts exceed theﬁxed costs. Thus, if a ﬁrm oﬀered product j in period t − , then the ﬁrm chooses tooﬀer j in period t iﬀ π ∗ jft − βθ c − βθ g g j > . If the ﬁrm didn’t oﬀer the product j in period t − , then it chooses to add product j iﬀ π ∗ jft − θ c − θ g g j > . G.1.2 Distributional Assumptions

We set π ∗ jft = η jt + (cid:15) jft , the sum of a product-level shock that is common to all ﬁrmsand a ﬁrm-product idiosyncratic shock. We assume that η jt ∼ N (0 , σ η ) . If j was notoﬀered in the previous period, then (cid:15) jft ∼ N ( βµ f + βθ g g j , σ (cid:15) ) ; if the product wasoﬀered previously, then (cid:15) jft ∼ N ( µ f + θ g g j , σ (cid:15) ) . Note that the mean proﬁtability ofmarketing a product depends on a ﬁrm-speciﬁc mean, µ f , which allows us to matchthe ﬁrm-level market shares observed in Wollmann’s data. We also constuct the meanof the (cid:15) jft term to depend on the product’s weight and whether it was marketed in the103revious period in a way that guarantees that all simulated products will be oﬀeredwith the same probability in our simulations.While ﬁrms make their decisions using π ∗ jft , we assume that the econometricianobserves only π jft = π ∗ jft + ν jt + ν jft . The ν terms represent measurement or expec-tational errors. We assume that ν jt and ν jft are independently drawn from a normaldistribution with mean and variance σ ν . G.2 Calibration

G.2.1 Exogenous Parameter Values

We set F = 9 to match the number of ﬁrms in Wollmann’s data, and G = 22 tomatch the number of unique values of GWR. We use θ c = 129 . , θ g = − . , and β = 0 . to match the results from the estimates in the November 2018 version ofWollman (2018). We set the values of g to be 22 evenly spaced points between12,700 and 54,277 to match the lowest and highest GWR ﬁgures reported in Table IIof Wollman (2018), which gives the average GWR for diﬀerent buyer types. G.2.2 Simulating Data for Calibration

To calibrate the remaining parameters, we simulate data according to the processdescribed above, and set the parameters to match moments of the simulated data tothose in Wollmann’s data.In order to simulate the data for the calibration, we ﬁrst ﬁx standard normal drawsthat are used to construct the η , (cid:15) , and ν shocks. These standard normals draws arethen scaled by the desired variance parameters in each simulation. Letting J ft denotethe set of products oﬀered by ﬁrm f in period t , the simulations begin in state 0 with J f = ∅ for all ﬁrms. We then simulate J ft and π ∗ going forward using the dynamicsdescribed above. We discard the ﬁrst 1,000 periods as burnout so as to obtain drawsfrom the stationary distribution, and calibrate the model using 27,000 subsequentperiods. After discarding 1,000 draws, we obtain essentially identical results if webegin from the state where all products are in the market in rather than all productsout of the market. Note that Wollmann denotes by − β what we have been calling β . .2.3 Calibrating the Remaining Parameters The parameter values to calibrate are { µ f } , σ η , σ (cid:15) , σ ν . Intuition for Calibration.

The intuition for the calibration is as follows. Theﬁrm-speciﬁc means µ f aﬀect the number of products each ﬁrm oﬀers, and so wecalibrate these to match the market shares and total number of products oﬀeredin Wollmann’s data. The σ (cid:15) and σ η terms aﬀect how often ﬁrms add and removeproducts, and so we calibrate these to match the variability of the number of productsoﬀered over time in Wollmann’s data. Lastly, we calibrate σ ν , which governs thevariance of the expectational/measurement error. We do not have direct measuresof the variability of ﬁrm proﬁts in Wollmann’s data, but if markups are relativelyconstant, then the variance in ﬁrm proﬁts is one-to-one with the variance of quantitysold, and so we calibrate σ ν to match the variability of quantities sold assuming mark-ups are ﬁxed at 35%. Technical Details for Calibration.

The calibration proceeds as follows:1) We ﬁrst calibrate ( σ η , σ (cid:15) ) and the µ f terms to match the market shares andvariability of products oﬀered in Wollmann. This calibration process involves an innerand outer loop, described below.a) The inner loop for µ f . Given a guess for ( σ η , σ (cid:15) ) , we calibrate µ f to match themarket share and average number of products in Wollmann’s data. Market sharesare taken from Table III in Wollman (2018). Wollmann does not provide the meannumber of products oﬀered by year, only the min and max, so we approximate it bytaking the midpoint between the two extremes, which gives 48 total products per yearon average.b) In the outer loop, we calibrate ( σ η , σ (cid:15) ) to match a measure of the variability ofthe number of products oﬀered in Wollmann’s data. In particular, Table I in Wollman(2018) lists 9-year averages for the total number of products oﬀered for three 9-yearperiods (he has 27 years of data). We run 1,000 simulations of 27 periods, and foreach 27-year period we calculate the average number of products oﬀered within each9-year subinterval, just as Wollmann does. We then calibrate σ η so that the averagevariance in the number of products oﬀered across three consecutive 9 year periodsmatches that in Wollmann’s data.The simulated variance comes very close to the target variance whenever σ η = σ (cid:15) ,regardless of scaling. We therefore choose σ η = σ (cid:15) = 30 because this gives that the105ariance of π ∗ is roughly half of the variance of π .2) Lastly, we calibrate σ ν to match a moment implied by the variability in quantitysold across time in Wollmann. In particular, if prices and markups are relativelyconstant, then the variance in quantities will be well-approximated by a constanttimes the variance in proﬁts: V ar ( π jft ) ≈ ¯ p ¯ m V ar ( Q jft ) , where ¯ p and ¯ m are theaverage prices and markups. For our calibration, we set ¯ p to be the average price inWollmann’s data ($66,722), and set ¯ m equal to 0.35. As with the number of productsoﬀered, Wollmann does not report annual quantities, but rather the average for 39-year periods. We thus use a procedure analogous to that described in step 1b) tomatch the variance of the 9-year averages of quantity sold. G.2.4 Calibrated Parameters

Tables 3 and 4 show the calibrated values for the µ f and variance parameters, respec-tively. Table 3: Calibrated µ f ParametersFirm µ f Chrysler . Ford . Daimler . GM . Hino . International . Isuzu . Paccar . Volvo . This is because if prices and costs are constant across ﬁrms, π jft = Q jft ( p − c )= Q jft p − cp p = Q jft × m × p. Thus,

V ar ( π jft ) = m p V ar ( Q jft ) when p and c are constant, and this holds approximately withaverages if the variance in m and p is small relative to that in Q . σ η . σ (cid:15) . σ ν . G.3 Details of simulations in Section 7

G.3.1 Drawing from Independent Markets

Wollmann’s original model involves observations of sequential periods from the samemarket. If we were to construct moments at the product-period level in this setting,then the sequential nature of the model would induce serial correlation in the realiza-tions of the moments. Although Σ can be estimated in this setting, accounting forserial correlation substantially complicates covariance estimation. Since covarianceestimation is not the focus of this paper, and Wollman (2018) performs inference as-suming no serial correlation, we instead focus on a modiﬁed DGP corresponding to across-section of independent markets, a common setting in the industrial organizationliterature. To do this, we sample from the stationary distribution of the calibratedDGP described above as follows. We draw a 51,000 period sequential chain, and dis-card the ﬁrst 1,000 periods as burnout. For each simulated dataset, we then randomlysubsample 500 periods from this chain. G.3.2 Parameter Grids and Monte Carlo Draws

For all of our simulations, we conduct inference by discretizing the parameter spacefor the parameter of interest. For δ g and the cost of the mean-weight truck, we use1,001 gridpoints; for β , we use 100 gridpoints. The bounds for the grid depend on thespeciﬁcation, and are equal to the upper and lower bound of the x-axis shown in therejection probabilty ﬁgures (Figures 1 and 2 and Appendix Figure 6).To calculate the LFP critical values, we draw a ﬁxed matrix Ξ of standard normaldraws of size M × , , and we use these for all of our calculations. Since the LFprocedure is more computationally intensive, we calculate it using a matrix of size M × . 107 .3.3 Handling of Numerical Precision Errors In simulating the draws for the LF approach, in certain very rare cases we encoun-tered computational issues in which the linear program for one of the draws did notconverge. In these cases, we treat the draw as if it were inﬁnity, which pushes theestimated critical value slightly higher, and makes our estimate of the rejection prob-ability slightly conservative. However, in all speciﬁcations this happens in no morethan 0.01% of cases (of approximately 50 million simulations), and is thus unlikely tohave any substantial impact on our results.

G.3.4 Additional Simulation Results

This appendix reports additional simulation results to complement the results reportedin Section 7 of the main text. In particular, Figure 8 reports rejection probabilitiesfor tests of hypotheses on δ g , while Tables 5-7 report the 5th and 95th percentiles ofthe excess length distribution for the conﬁdence sets we study.108igure 8: Rejection probabilities for 5% tests of δ g (a) 2 Parameters, 6 Moments (b) 2 Parameters, 14 Moments(c) 4 Parameters, 14 Moments (d) 4 Parameters, 38 Moments(e) 10 Parameters, 38 Moments (f) 10 Parameters, 110 Moments Parameters

Moments LFP LF Conditional Hybrid .

38 4 .

05 4 .

02 3 .

892 14 12 .

79 10 .

61 10 .

74 8 .

724 14 7 .

59 5 .

94 4 .

23 4 .

254 38 18 .

79 16 .

03 14 .

33 11 . .

76 10 .

24 4 .

87 4 . .

55 22 .

27 17 .

74 14 . Parameters

Moments LFP LF Conditional Hybrid .

38 2 .

06 2 .

12 1 .

802 14 7 .

60 5 .

44 4 .

97 3 .

344 14 5 .

51 3 .

88 2 .

09 2 .

094 38 15 .

02 11 .

67 7 .

53 3 . .

34 7 .

82 2 .

43 2 . .

45 18 .

89 11 .

58 7 . Median of Excess Length Distribution

Parameters

Moments LFP LF Conditional Hybrid .

32 3 .

99 4 .

07 3 .

752 14 12 .

75 10 .

48 10 .

49 8 .

544 14 7 .

56 5 .

91 4 .

07 4 .

374 38 19 .

08 16 .

33 14 .

68 11 . .

70 10 .

20 4 .

71 4 . .

61 22 .

36 17 .

91 14 . Parameters

Moments LFP LF Conditional Hybrid .

45 6 .

16 6 .

02 6 .

022 14 17 .

99 15 .

99 17 .

97 15 .

854 14 9 .

78 8 .

07 6 .

32 6 .

484 38 22 .

07 19 .

77 20 .

05 17 . .

22 12 .

63 6 .

98 7 . .

43 25 .

58 23 .

11 19 . δ g Mean of Excess Length Distribution

Parameters

Moments LFP LF Conditional Hybrid .

99 4 .

29 4 .

17 3 .

912 14 6 .

92 5 .

40 4 .

28 4 .

114 14 7 .

02 5 .

21 4 .

33 4 .

134 38 8 .

01 6 .

73 4 .

45 4 . .

16 6 .

63 4 .

50 4 . .

08 7 .

63 4 .

81 4 . Parameters

Moments LFP LF Conditional Hybrid .

70 1 .

04 0 .

93 0 .

682 14 3 .

53 2 .

04 0 .

93 0 .

774 14 3 .

62 1 .

83 0 .

93 0 .

684 38 4 .

58 3 .

38 1 .

07 1 . .

73 3 .

22 1 .

02 0 . .

56 4 .

13 1 .

40 1 . Median of Excess Length Distribution

Parameters

Moments LFP LF Conditional Hybrid .

02 4 .

28 4 .

18 3 .

932 14 6 .

91 5 .

40 4 .

43 4 .

184 14 7 5 .

19 4 .

43 4 .

184 38 7 .

97 6 .

68 4 .

43 4 . .

10 6 .

58 4 .

43 4 . .

11 7 .

69 5 .

18 5 . Parameters

Moments LFP LF Conditional Hybrid .

17 7 .

47 7 .

43 7 .

062 14 10 .

21 8 .

68 7 .

56 7 .

434 14 10 .

23 8 .

51 7 .

68 7 .

434 38 11 .

32 10 7 .

68 7 . .

44 9 .

87 7 .

68 7 . .

55 11 .

11 8 .

43 8 . β Mean of Excess Length Distribution

Parameters

Moments LFP LF Conditional Hybrid3 6 84.17 + + + + + + + + + + + + Parameters

Moments LFP LF Conditional Hybrid3 6 15.15 10.1 6.31 6.313 14 0.35 0.25 0.15 0.15 14 3.54 2.3 1.06 1.065 38 0.5 0.35 0.25 0.0511 38 0.81 0.5 0.2 0.211 110 0.56 0.35 0.66 0.03Median of Excess Length Distribution

Parameters

Moments LFP LF Conditional Hybrid3 6 118.69 61.87 41.67 36.623 14 0.76 0.56 0.45 0.355 14 10.25 7.78 6.01 5.35 38 0.86 0.66 0.96 0.4511 38 1.41 1.01 1.01 0.8111 110 0.86 0.66 2.57 0.5695th Percentile of Excess Length Distribution

Parameters

Moments LFP LF Conditional Hybrid3 6 123.11 + + + + + + + + + + Note:

For certain speciﬁcations and simulation draws, the rejection probability did not reach 1 atthe edge of our grid for λ . In these cases, we truncate the excess length at the edge of the grid. A + denotes statistics that are aﬀected by this truncation. Bisection Algorithm for Computing V lo and V up When the conditions in step 2 in Section 6.3 do not hold, V lo and V up must becalculated by ﬁnding the minimum and maximum of the set C = (cid:40) c : c = max ˜ γ ˜ γ (cid:48) (cid:16) s + Σ γγ (cid:48) Σ γ c (cid:17) subject to ˜ γ ≥ , W (cid:48) n ˜ γ = e (cid:41) Recall that the set C is convex, and its endpoints, if they are ﬁnite, can therefore becalculated via bisection. We thus recommend the following procedure for calculating V up . Begin by specifying a large value M , such that if V up > M , for practical purposeswe can consider V up = ∞ . Then implement Algorithm 1 described in the box below. In our implementation, we set M = max (cid:0) , ˆ η + 20 √ γ (cid:48) Σ γ (cid:1) , which guarantees that M is atleast 20 standard deviations above η . lgorithm 1 Bisection Method for Calculating V up procedure computeVUP if CheckIfInC(M) then V up ← ∞ else lb ← η ub ← M while ub − lb > T olV do mid ← ( lb + ub ) if CheckIfInC(mid) then lb ← mid else ub ← mid V up ← ( lb + ub ) where we deﬁne the functions: function LPValue (c) return max ˜ γ ˜ γ (cid:48) (cid:16) s + Σ γγ (cid:48) Σ γ c (cid:17) subject to ˜ γ ≥ , W (cid:48) n ˜ γ = e function CheckIfInC (c) if | c − LP V alue ( c ) | < T olLP then return True else return False upplement References Abadie, A. & Imbens, G. W. (2008), ‘Estimation of the conditional variance in pairedexperiments’,

Annales d’Économie et de Statistique (91/92), 175–187.Abadie, A., Imbens, G. W. & Zheng, F. (2014), ‘Inference for misspeciﬁed models withﬁxed regressors’,

Journal of the American Statistical Association (508), 1601–1614.Andrews, D. W., Guggenberger, P. & Cheng, X. (2019), ‘Generic results for estab-lishing the asymptotic size of conﬁdence sets and tests’,

Journal of Econometrics

Forthcoming .Chernozhukov, V., Chetverikov, D. & Kato, K. (2015), ‘Comparison and anti-concentration bounds for maxima of gaussian random vectors’,

Probability Theoryand Related Fields (1-2), 47–70.Fithian, W., Sun, D. L. & Taylor, J. E. (2017), Optimal inference after model selection.Working Paper.Kaido, H., Molinari, F. & Stoye, J. (2019), Constraint qualiﬁcations in partial identi-ﬁcation. Working Paper.Tijssen, G. & Sierksma, G. (1998), ‘Balinski-tucker simplex tableaus: dimensions, de-generacy degrees, and interior points of optimal faces’,

Mathematical Programming , 349–372.Van der Vaart, A. (2000), Asymptotic Statistics , Cambridge University Press.Van der Vaart, A. & Wellner, J. A. (1996),