Inference for Large-Scale Linear Systems with Known Coefficients
Zheng Fang, Andres Santos, Azeem M. Shaikh, Alexander Torgovitsky
IInference for Large-Scale Linear Systems with KnownCoefficients ∗ Zheng FangDepartment of EconomicsTexas A&M [email protected] Andres SantosDepartment of [email protected] M. ShaikhDepartment of EconomicsUniversity of [email protected] Alexander TorgovitskyDepartment of EconomicsUniversity of [email protected] 21, 2020
Abstract
This paper considers the problem of testing whether there exists a non-negativesolution to a possibly under-determined system of linear equations with knowncoefficients. This hypothesis testing problem arises naturally in a number of settings,including random coefficient, treatment effect, and discrete choice models, as well asa class of linear programming problems. As a first contribution, we obtain a novelgeometric characterization of the null hypothesis in terms of identified parameterssatisfying an infinite set of inequality restrictions. Using this characterization, wedevise a test that requires solving only linear programs for its implementation, andthus remains computationally feasible in the high-dimensional applications thatmotivate our analysis. The asymptotic size of the proposed test is shown to equalat most the nominal level uniformly over a large class of distributions that permitsthe number of linear equations to grow with the sample size.
Keywords: linear programming, linear inequalities, moment inequalities, randomcoefficients, partial identification, exchangeable bootstrap, uniform inference. ∗ We thank Denis Chetverikov, Patrick Kline, and Adriana Lleras-Muney for helpful comments.Conroy Lau provided outstanding research assistance. The research of the third author was supportedby NSF grant SES-1530661. The research of the fourth author was supported by NSF grant SES-1846832. a r X i v : . [ ec on . E M ] S e p Introduction
Given an independent and identically distributed (i.i.d.) sample { Z i } ni =1 with Z dis-tributed according to P ∈ P , this paper studies the hypothesis testing problem H : P ∈ P H : P ∈ P \ P , (1)where P is a “large” set of distributions satisfying conditions we describe below and P ≡ { P ∈ P : β ( P ) = Ax for some x ≥ } . Here, “ x ≥
0” signifies that all coordinates of x ∈ R d are non-negative, β ( P ) ∈ R p denotes an unknown but estimable parameter, and the coefficients of the linear systemare known in that A is a p × d known matrix.As we discuss in detail in Section 2, the described hypothesis testing problem playsa central role in a surprisingly varied array of empirical settings. Tests of (1), for in-stance, are useful for obtaining asymptotically valid confidence regions for counterfactualbroadband demand in the analysis of Nevo et al. (2016), and for conducting inferenceon the fraction of employers engaging in discrimination in the audit study of Kline andWalters (2019). Within the treatment effects literature, tests of (1) arise naturally whenconducting inference on partially identified causal parameters, such as in the studies byKline and Walters (2016) and Kamat (2019) of the Head Start program, or the analy-sis of unemployment state dependence by Torgovitsky (2019). The null hypothesis in(1) has also been shown by Kitamura and Stoye (2018) to play a central role in testingwhether a cross-sectional sample is rationalizable by a random utility model; see Manski(2014), Deb et al. (2017), and Lazzati et al. (2018) for related examples. In addition, weshow that for a class of linear programming problems the null hypothesis that the linearprogram is feasible may be mapped into (1) – an observation that enables us to conductinference in the competing risks model of Honor´e and Lleras-Muney (2006), the empir-ical study of the California Affordable Care Act marketplace by Tebaldi et al. (2019),and the dynamic discrete choice model of Honor´e and Tamer (2006). See Remark 3.1for details.A common feature of the empirical studies that motivate our analysis is that thedimensions of x ∈ R d and/or β ( P ) ∈ R p are often quite high – e.g., in Nevo et al.(2016) the dimensions p and d are both in excess of 5000. We therefore focus on devel-oping an inference procedure that remains computationally feasible in high-dimensionalsettings and asymptotically valid under favorable conditions on the relationship betweenthe dimensions of A and the sample size n . To this end, we first obtain a novel geo-metric characterization of the null hypothesis that is the cornerstone of our approachto inference. Formally, we show that the null hypothesis in (1) is true if and only if2 ( P ) belongs to the range of A and all angles between an estimable parameter and aknown set in R d are obtuse. This geometric result further provides, to the best of ourknowledge, a new characterization of the feasibility of a linear program distinct from,but closely related to, Farkas’ lemma that may be of independent interest.Guided by our geometric characterization of the null hypothesis and our desire forcomputational and statistical reliability, we propose a test statistic that may be com-puted through linear programming. While the test statistic is not pivotal, we obtain asuitable critical value by relying on a bootstrap procedure that similarly only requiressolving one linear program per bootstrap iteration. Besides delivering computationaltractability, the linear programming structure present in our test enables us to establishthe consistency of our asymptotic approximations under the requirement that p /n tendsto zero (up to logs). Leveraging the consistency of such approximations to establish theasymptotic validity of our test further requires us to verify an anti-concentration con-dition at a particular quantile (Chernozhukov et al., 2014). We show that the requiredanti-concentration property indeed holds for our test under a condition that relates theallowed rate of growth of p relative to n to the matrix A . This result enables us toderive a sufficient, but more stringent, condition on the rate of growth of p relative to n that delivers anti-concentration universally in A . Furthermore, if, as in much of therelated literature, p is fixed with the sample size, then our results imply that our test isasymptotically valid under weak regularity conditions on P .Our paper is related to important work by Kitamura and Stoye (2018), who study(1) in the context of testing the validity of a random utility model. Their inferenceprocedure, however, relies on conditions on A that can be violated in the broader setof applications that motivate us; see Section 2. In related work, Andrews et al. (2019)exploit a conditioning argument to develop methods for sub-vector inference in certainconditional moment inequality models. We show in Section 4.3.2 that we may use theirinsight in the same way to adapt our methodology to conduct inference for the sametypes problems they consider. Our analysis is also conceptually related to work onsub-vector inference in models involving moment inequalities or shape restrictions; see,among others, Romano and Shaikh (2008), Bugni et al. (2017), Kaido et al. (2019),Gandhi et al. (2019), Chernozhukov et al. (2015), Zhu (2019), and Fang and Seo (2019).While these procedures are designed for general problems that do not possess the specificstructure in (1), they are, as a result, less computationally tractable and/or rely on moredemanding and high-level conditions than the ones we employ.The remainder of the paper is organized as follows. By way of motivation, we firstdiscuss in Section 2 applications in which the null hypothesis in (1) naturally arises. InSections 3 and 4, we establish our geometric characterization of the null hypothesis andthe asymptotic validity of our test. Our simulation studies are contained in Section 5.Proofs and a guide to computation are contained in the Appendix. An R package for3mplementing our test is available at https://github.com/conroylau/lpinfer . In order to fix ideas, we next discuss a number of empirical settings in which the hy-pothesis testing problem described in (1) naturally arises.
Example 2.1. (Dynamic Programming).
Building on Fox et al. (2011), Nevo et al.(2016) estimate a model for residential broadband demand in which there are h ∈{ , . . . , H } types of consumers that select among plans k ∈ { , . . . , K } . Each plan ischaracterized by a fee F k , speed s k , usage allowance ¯ C k , and overage price p k . At day t , a consumer of type h with plan k has utility over usage c t and numeraire y t given by u h ( c t , y t , v t , ; k ) = v t ( c − ζ h t − ζ h ) − c t ( κ h + κ h log( s k ) ) + y t , where v t is an i.i.d. shock following a truncated log-normal distribution with mean µ h and variance σ h . The dynamic problem faced by a type h consumer with plan k is thenmax c ,...,c T T (cid:88) t =1 E [ u h ( c t , y t , v t ; k )]s.t. F k + p k max { C T − ¯ C k , } + Y T ≤ I, C T = T (cid:88) t =1 c t , Y T = T (cid:88) t =1 y t , (2)where total wealth I is assumed to be large enough not to restrict usage. From (2),it follows that the distribution of observed plan choice and daily usage, denoted by Z ∈ R T +1 , for a consumer of type h is characterized by θ h ≡ ( ζ h , κ h , κ h , µ h , σ h ).Therefore, for any function m of Z we obtain the moment restrictions E P [ m ( Z )] = H (cid:88) h =1 E θ h [ m ( Z )] x h , where E P and E θ h denote expectations under the distribution P of Z and under θ h respectively, while x h is the unknown proportion of each type in the population. Afterspecifying H = 16807 different types, Nevo et al. (2016) estimate x ≡ ( x , . . . , x H ) byGMM while imposing the constraints that x be a probability measure. The authorsthen conduct inference on counterfactual demand, which for a known function a equals H (cid:88) h =1 a ( θ h ) x h ,
4y employing the constrained GMM estimator for x and the block bootstrap. We note,however, that the results in Fang and Santos (2018) imply the bootstrap is inconsistent for this problem. In contrast, the results in the present paper enable us to conductasymptotically valid inference on counterfactual demand. For instance, by setting β ( P ) ≡ E P [ m ( Z )]1 γ A ≡ E θ [ m ( Z )] · · · E θ H [ m ( Z )]1 · · · a ( θ ) · · · a ( θ h ) (3)we may obtain a confidence region for counterfactual demand through test inversion(in γ ) of the null hypothesis in (1) – here, the final two constraints in (3) impose thatprobabilities add up to one and the hypothesized value for counterfactual demand. Otherapplications of the approach in Nevo et al. (2016) to inference in dynamic programsinclude Blundell et al. (2018) and Illanes and Padi (2019). Example 2.2. (Treatment Effects).
Kline and Walters (2016) examine the HeadStart Impact Study (HSIS) in which participants where randomly assigned an offer toattend a Head Start school. Each participant can attend a Head Start school ( h ), otherschools ( c ), or receive home care ( n ). We let W ∈ { , } denote whether an offer is made, D ( w ) ∈ { h, c, n } denote potential treatment status, and Y ( d ) denote test scores giventreatment status d ∈ { h, c, n } . Under the assumption that a Head Start offer increasesthe utility of attending a Head Start school but leaves the utility of other programsunchanged, Kline and Walters (2016) partition participants into five groups that aredetermined by the values of ( D (0) , D (1)). We denote group membership by C ∈ { nh, ch, nn, cc, hh } , (4)where, e.g., C = nh corresponds to ( D (0) , D (1)) = ( n, h ). Employing this structure,Kline and Walters (2016) show the local average treatment effect (LATE) identified byHSIS suffices for estimating the benefit cost ratio of a Head Start expansion. The impactof alternative policies, however, depends on partially identified parameters such asLATE nh ≡ E [ Y ( h ) − Y ( n ) | C = nh ] . (5)To estimate such partially identified parameters, Kline and Walters (2016) rely on aparametric selection model that delivers identification. In contrast, the results in thispaper enable us to construct confidence regions for parameters such as LATE nh withinthe nonparametric framework of Imbens and Angrist (1994). To this end note that, for5ny function m , the arguments in Imbens and Rubin (1997) imply E P [ m ( Y )1 { D = d }| W = 0] − E P [ m ( Y )1 { D = d }| W = 1]= (cid:40) E [ m ( Y ( d ))1 { C = dh } ] if d ∈ { n, c } E [ m ( Y ( h ))1 { C ∈ { nh, ch }} ] if d = h (6)while the null hypothesis that LATE nh equals a hypothesized value γ is equivalent to E [( Y ( h ) − Y ( n ))1 { C = nh } ] − γP ( C = hn ) = 0 . (7)Provided the support of test scores is finite, results (6) and (7) imply that the nullhypothesis that there exist a distribution of ( Y ( n ) , Y ( h ) , Y ( d ) , C ) satisfying (6) andLATE nh = γ is a special case of (1). As in Example 2.1, we may also obtain anasymptotically valid confidence region for LATE nh through test inversion (in γ ). Otherexamples of (1) arising in the treatment effects literature include Balke and Pearl (1994,1997), Laff´ers (2019), Machado et al. (2019), Kamat (2019), and Bai et al. (2020). Example 2.3. (Duration Models).
In studying the efficacy of President Nixon’s waron cancer, Honor´e and Lleras-Muney (2006) employ the competing risks model( T ∗ , I ) = (cid:40) (min { S , S } , arg min { S , S } ) if D = 0(min { αS , βS } , arg min { αS , βS } ) if D = 1 , where ( S , S ) are possibly dependent random variables representing duration untildeath due to cancer and cardio-vascular disease, D is independent of ( S , S ) and de-notes the implementation of the war on cancer, and ( α, β ) are unknown parameters.The observed variables are ( T, I, D ) where T = t k if t k ≤ T ∗ < t k +1 for k = 1 , . . . , M and t M +1 = ∞ , reflecting data sources often contain interval observations of duration.While ( α, β ) is partially identified, Honor´e and Lleras-Muney (2006) show that thereexist known finite sets S ( α, β ) and S k,i,d ( α, β ) ⊆ S ( α, β ) such that ( α, β ) belongs to theidentified set if and only if there is a distribution f ( · , · ) on S ( α, β ) satisfying (cid:88) ( s ,s ) ∈S k,i,d ( α,β ) f ( s , s ) = P ( T = t k , I = i | D = d ) , (cid:88) ( s ,s ) ∈S ( α,β ) f ( s , s ) = 1 , and f ( s , s ) ≥ s , s ) ∈ S ( α, β ) , (8)where the first equality must hold for all 1 ≤ k ≤ M , i ∈ { , } , and d ∈ { , } . Itfollows from (8) that testing whether a particular ( α, β ) belongs to the identified set is aspecial case of (1). Through test inversion, the results in this paper therefore allow us toconstruct a confidence region for the identified set that satisfies the coverage requirementproposed by Imbens and Manski (2004). We note that, in a similar fashsion, our resultsalso apply to the dynamic discrete choice model of Honor´e and Tamer (2006).6 xample 2.4. (Discrete Choice). In their study of demand for health insurance inthe California Affordable Care Act marketplace (Cover California), Tebaldi et al. (2019)model the observed plan choice Y by a consumer according to Y ≡ arg max ≤ j ≤ J V j − p j , where J denotes the number of available plans, V = ( V , . . . , V J ) is an unobserved vectorof valuations, and p ≡ ( p , . . . , p J ) denotes post-subsidy prices. Within the regulatoryframework of Cover California, post-subsidy prices satisfy p = π ( C ) for some knownfunction π and C a (discrete-valued) vector of individual characteristics that includeage and county of residence. By decomposing C into subvectors ( W, S ) and assuming V is independent of S conditional on W , Tebaldi et al. (2019) then obtain P ( Y = j | C = c ) = (cid:90) V j ( π ( c )) f V | W ( v | w ) dv for f V | W the density of V conditional on W and V j ( p ) ≡ { v : v j − p j ≥ v k − p k for all k } .The authors further show there is a finite partition V of the support of V satisfying P ( Y = j | C = c ) = (cid:88) V∈ V : V⊆V j ( π ( x )) (cid:90) V f V | W ( v | w ) dv (9)and such that the identified set for counterfactuals, such as the change in consumersurplus due to a change in subsidies, is characterized by functionals with the structure (cid:88) V∈ V : V⊆V (cid:63) a ( V ) (cid:90) V f V | W ( v | w ) dv (10)for known function a : V → R and set V (cid:63) . Arguing as in Example 2.1, it then followsfrom (9) and (10) that confidence regions for the desired counterfactuals may be obtainedthrough test inversion of hypotheses as in (1). Similar arguments allow us to applyour results to related discrete choice models such as the dynamic potential outcomesframework employed by Torgovitsky (2019) to measure state dependence. Example 2.5. (Revealed Preferences).
Building on McFadden and Richter (1990),Kitamura and Stoye (2018) develop a nonparametric specification test for random utilitymodel by showing the null hypothesis has the structure in (1). We note, however, thatthe arguments showing the asymptotic validity of their test rely on a key restrictionon the matrix A : Namely, that ( a − a ) (cid:48) ( a − a ) ≥ a , a , a ) of A . While such restriction on A is automatically satisfied in the randomutility framework that motivates the analysis in Kitamura and Stoye (2018) and relatedwork (Manski, 2014; Deb et al., 2017; Lazzati et al., 2018), we observe that it can failin our previously discussed examples. 7 Geometry of the Null Hypothesis
In this section, we obtain a geometric characterization of the null hypothesis that guidesthe construction of our test in Section 4. To this end, we first introduce some additionalnotation that will prove useful throughout the rest of our analysis.In what follows, we denote by R k the Euclidean space of dimension k and reservethe use of p and d to denote the dimensions of the matrix A . For any two columnvectors ( v , . . . , v k ) (cid:48) ≡ v and ( u , . . . , u k ) (cid:48) ≡ u in R k , we denote their inner product by (cid:104) v, u (cid:105) ≡ (cid:80) ki =1 v i u i . The space R k can be equipped with the norms (cid:107) · (cid:107) q given by (cid:107) v (cid:107) q ≡ { k (cid:88) i =1 | v i | q } q for any 1 ≤ q ≤ ∞ , where as usual (cid:107) · (cid:107) ∞ is understood to equal (cid:107) v (cid:107) ∞ ≡ max ≤ i ≤ k | v i | .In addition, for any k × k matrix M , the norm (cid:107) · (cid:107) q on R k induces an operator norm (cid:107) M (cid:107) o,q ≡ sup (cid:107) v (cid:107) q ≤ (cid:107) M v (cid:107) q on M ; e.g., (cid:107) M (cid:107) o, is the largest singular value of M , and (cid:107) M (cid:107) o, ∞ is the maximum (cid:107) · (cid:107) norm of the rows of M . While the norms (cid:107) · (cid:107) and (cid:107) · (cid:107) ∞ play a crucial role inour statistical analysis, our geometric analysis relies more heavily on the norm (cid:107) · (cid:107) . Inparticular, for any closed convex set C ⊆ R k , we rely on the properties of the (cid:107)·(cid:107) -metricprojection operator Π C : R k → C , which for any vector v ∈ R k is defined pointwise asΠ C ( v ) ≡ arg min c ∈ C (cid:107) v − c (cid:107) ;i.e., Π C ( v ) denotes the unique closest (under (cid:107) · (cid:107) ) element in C to the vector v . Finally,it will also be helpful to view the p × d matrix A as a linear map A : R d → R p . Therange R ⊆ R p and null space N ⊆ R d of A are defined as R ≡ { b ∈ R p : b = Ax for some x ∈ R d } N ≡ { x ∈ R d : Ax = 0 } . The null space N of A induces a decomposition of R d through its orthocomplement N ⊥ ≡ { y ∈ R d : (cid:104) y, x (cid:105) = 0 for all x ∈ N } ;i.e., any vector x ∈ R d can be written as x = Π N ( x ) + Π N ⊥ ( x ) with (cid:104) Π N ( x ) , Π N ⊥ ( x ) (cid:105) =0. For succinctness, we denote such a decomposition of R d as R d = N ⊕ N ⊥ .Our first result is a well known consequence of the decomposition R d = N ⊕ N ⊥ ,but we state it formally due to its importance in our derivations.8igure 1: Illustration of when requirement (ii) in (11) is satisfied. Left panel: N and N ⊥ are such that requirement (ii) holds regardless of x (cid:63) ( P ). Right panel: N and N ⊥ are such that requirement (ii) holds if and only if x (cid:63) ( P ) ∈ R . R + R − R + R − NN ⊥ x (cid:63) ( P ) x N + x (cid:63) ( P ) R + R − R + R − N ⊥ N x (cid:63) ( P ) x (cid:63) ( P ) N + x (cid:63) ( P ) Lemma 3.1.
For any β ( P ) ∈ R p there exists a unique x (cid:63) ( P ) ∈ N ⊥ satisfying Π R ( β ( P )) = A ( x (cid:63) ( P )) . We note, in particular, that if P ∈ P , then β ( P ) must belong to the range of A andas a result Π R ( β ( P )) = β ( P ). Thus, for P ∈ P , Lemma 3.1 implies that there exists aunique x (cid:63) ( P ) ∈ N ⊥ satisfying β ( P ) = A ( x (cid:63) ( P )). While x (cid:63) ( P ) is the unique solution in N ⊥ , there may nonetheless exist multiple solution in R d . In fact, Lemma 3.1 and thedecomposition R d = N ⊕ N ⊥ imply that, provided β ( P ) ∈ R , we have { x ∈ R d : Ax = β ( P ) } = x (cid:63) ( P ) + N. These observations allow us to characterize the null hypothesis in terms of two properties:(i) β ( P ) ∈ R (ii) { x (cid:63) ( P ) + N } ∩ R d + (cid:54) = ∅ ; (11)i.e., (i) ensures some solution to the equation Ax = β ( P ) exists, while (ii) ensures a positive solution x ∈ R d + exists. Importantly, we note that these two conditions dependon P only through two identified objects: β ( P ) ∈ R p and x (cid:63) ( P ) ∈ R d .Figure 1 illustrates these concepts in the simplest informative setting of p = 1 and d = 2, in which case N and N ⊥ are of dimension one and correspond to a rotation ofthe coordinate axes. Focusing on developing intuition for requirement (ii) in (11) wesuppose that β ( P ) ∈ R so that A ( x (cid:63) ( P )) = β ( P ). The left panel of Figure 1 displays asetting in which condition (ii) holds and an x ∈ R satisfying Ax = A ( x (cid:63) ( P )) = β ( P )9igure 2: Illustration of Theorem 3.1 with N = { x ∈ R : x = ( λ, λ, (cid:48) some λ ∈ R } and N ⊥ ∩ R − = { x ∈ R : x = (0 , , λ ) (cid:48) for some λ ≤ } . α denotes angle between x (cid:63) ( P ) and N ⊥ ∩ R − . Left panel: requirement (ii) in (11) holds and α is obtuse. Rightpanel: requirement (ii) in (11) fails and α is acute. R + R + R + N ⊥ N x (cid:63) ( P ) x N + x (cid:63) ( P ) α R + R + R + NN ⊥ x (cid:63) ( P ) N + x (cid:63) ( P ) α may be found even though x (cid:63) ( P ) / ∈ R . In fact, in the left panel of Figure 1, N and N ⊥ are such that requirement (ii) in (11) holds regardless of the value of x (cid:63) ( P ) (andhence regardless of P ). In contrast, the right panel of Figure 1 displays a scenario inwhich N and N ⊥ are such that whether requirement (ii) is satisfied or not depends on x (cid:63) ( P ); e.g., (ii) holds for x (cid:63) ( P ) and fails for x (cid:63) ( P ). In fact, in the right panel of Figure1, condition (ii) is satisfied if and only if x (cid:63) ( P ) ∈ R .The preceding discussion highlights that whether condition (ii) in (11) is satisfiedcan depend delicately on the orientation of N and N ⊥ in R d and the position of x (cid:63) ( P ) in N ⊥ . Our next result, provides a tractable geometric characterization of this relationship. Theorem 3.1.
For any β ( P ) ∈ R p there exists an x ∈ R d + satisfying Ax = β ( P ) ifand only if β ( P ) ∈ R and (cid:104) s, x (cid:63) ( P ) (cid:105) ≤ for all s ∈ N ⊥ ∩ R d − . Theorem 3.1 establishes that the null hypothesis holds if and only if β ( P ) ∈ R andthe angle between x (cid:63) ( P ) and any vector s ∈ N ⊥ ∩ R d − is obtuse. It is straightforward toverify this relation is indeed present in Figure 1. The content of Theorem 3.1, however,is better appreciated in R . Figure 2 illustrates a setting in which N ⊥ ∩ R − = { x ∈ R : x = (0 , , λ ) (cid:48) for some λ ≤ } . In this case, condition (ii) in (11) holds if and onlyif the third coordinate of x (cid:63) ( P ) is (weakly) positive, which is equivalent to the anglebetween x (cid:63) ( P ) and N ⊥ ∩ R − being obtuse. The left panel of Figure 2 depicts a settingin which the angle is obtuse, and an x ∈ R satisfying Ax = A ( x (cid:63) ( P )) may be foundeven though x (cid:63) ( P ) / ∈ R . In contrast, in the left panel of Figure 2, the angle is acuteand requirement (ii) in (11) fails to hold. 10 emark 3.1. A finite-dimensional linear program can be written in the standard formmin x ∈ R d (cid:104) c, x (cid:105) s.t. Ax = β and x ≥ c ∈ R d , β ∈ R p , and p × d matrix A ; see, e.g., Luenberger and Ye (1984).Theorem 3.1 thus provides a characterization of the feasibility of a linear program that isdistinct from, but closely related to, Farkas’ Lemma and may be of independent interest.We further observe that (12) implies that our results enable us to conduct inference onthe value of a linear program whose standard form is such that A and c (as in (12))are known while β potentially depends on the distribution of the data. This connectionwas implicitly employed in our discussion of many of the examples in Section 2, wherewe mapped the original linear programming formulations employed by the papers citedtherein into the hypothesis testing problem in (1). Theorem 3.1 provides us with the basis for constructing a variety of tests of the null hy-pothesis of interest. We next develop one such test, paying special attention to ensuringthat it be computationally feasible in high-dimensional problems.
In what follows, we let A † denote the Moore-Penrose pseudoinverse of A , which is a d × p matrix implicitly defined for any b ∈ R p through the optimization problem A † b ≡ arg min x ∈ R d (cid:107) x (cid:107) s.t. x ∈ arg min ˜ x ∈ R d (cid:107) A ˜ x − b (cid:107) ;i.e., A † b is the minimum norm solution to minimizing (cid:107) Ax − b (cid:107) over x . Importantly, wenote that A † b is well defined even if there is no solution to the equation Ax = b ( b / ∈ R )or the solution is not unique ( d > p ). For our purposes, it is also useful to note that A † b is the unique element in N ⊥ satisfying A ( A † b ) = Π R ( b ), and we may thus interpret A † as a linear map from R p onto N ⊥ ; see Luenberger (1969). Despite its implicit definition,there exist multiple fast algorithms for computing A † . In Appendix S.3, we also providea numerically equivalent reformulation of our test that avoids computing A † .In order to build our test statistic, we will assume that there is a suitable estimatorˆ β n of β ( P ) that is constructed from an i.i.d. sample { Z i } ni =1 with Z i ∈ Z distributedaccording to P ∈ P . Since β ( P ) ∈ R under the null hypothesis, Lemma 3.1 implies x (cid:63) ( P ) = A † β ( P ) (13)11or any P ∈ P , which suggests a sample analogue estimator for x (cid:63) ( P ). However, whilein our leading applications d ≥ p , it is important to note that the existence of a solutionto the equation Ax = β ( P ) locally overidentifies the model when d < p in the sense ofChen and Santos (2018). As a result, the sample analogue estimator for x (cid:63) ( P ) based on(13) may not be efficient when d < p , and we therefore instead setˆ x (cid:63)n = A † ˆ C n ˆ β n (14)as an estimator for x (cid:63) ( P ). Here, ˆ C n is a p × p matrix satisfying ˆ C n β ( P ) = β ( P ) whenever P ∈ P . For instance, the sample analogue estimator based on (13) corresponds tosetting ˆ C n = I p for I p the p × p identity matrix. More generally, it is straightforward toshow that the specification in (14) also accommodates a variety of minimum distanceestimators, which may be preferable to employing ˆ C n = I p when p > d .The estimators ˆ β n and ˆ x (cid:63)n readily allow us to devise a test based on the characteri-zation of the null hypothesis obtained in Theorem 3.1. First, note that since the rangeof A † equals N ⊥ , the condition (cid:104) s, x (cid:63) ( P ) (cid:105) ≤ s ∈ N ⊥ ∩ R d − is equivalent to (cid:104) A † s, x (cid:63) ( P ) (cid:105) ≤ s ∈ R p s.t. A † s ≤ R d ) . (15)Thus, with the goal of detecting a violation of condition (15), we introduce the statisticsup s ∈ ˆ V i n √ n (cid:104) A † s, ˆ x (cid:63)n (cid:105) = sup s ∈ ˆ V i n √ n (cid:104) A † s, A † ˆ C n ˆ β n (cid:105) (16)where ˆ V i n ≡ { s ∈ R p : A † s ≤ (cid:107) ˆΩ i n ( AA (cid:48) ) † s (cid:107) ≤ } . (17)Here, ˆΩ i n is a p × p symmetric matrix and the “i” superscript alludes to the relationto the “inequality” condition in Theorem 3.1 (i.e., (15)). The inclusion of a (cid:107) · (cid:107) -norm constraint in ˆ V i n in (17) ensures the statistic in (16) is not infinite with positiveprobability. The introduction of the matrix ˆΩ i n in (17) grants us an important degreeof flexibility in the family of test statistics we examine. In particular, we note thatchoosing ˆΩ i n suitably ensures that the statistic in (16) is scale invariant.By Theorem 3.1, in addition to (15), any P ∈ P must satisfy β ( P ) ∈ R . With thegoal of detecting a violation of this second requirement, we introduce the statisticsup s ∈ ˆ V e n √ n (cid:104) s, ˆ β n − A ˆ x (cid:63)n (cid:105) = sup s ∈ ˆ V e n √ n (cid:104) s, ( I p − AA † ˆ C n ) ˆ β n (cid:105) (18)where ˆ V e n ≡ { s ∈ R p : (cid:107) ˆΩ e n s (cid:107) ≤ } . Here, ˆΩ e n a p × p symmetric matrix and the “e” superscript alludes to the relation to12he “equality” condition in Theorem 3.1 (i.e., β ( P ) ∈ R ). In particular, note that ifˆΩ e n = I p , then by H¨older’s inequality (18) equals (cid:107) ˆ β n − A ˆ x (cid:63)n (cid:107) ∞ . As in (17), introducingˆΩ e n enables us to ensure that the statistic in (18) is scale invariant if so desired. We alsoobserve that in applications in which d ≥ p and A is full rank, the requirement β ( P ) ∈ R is automatically satisfied due to R = R p and (18) is identically zero due to ˆ C n = I p .As a test statistic T n , we simply employ the maximum of (16) and (18); i.e., we set T n ≡ max { sup s ∈ ˆ V e n √ n (cid:104) s, ˆ β n − A ˆ x (cid:63)n (cid:105) , sup s ∈ ˆ V i n √ n (cid:104) A † s, ˆ x (cid:63)n (cid:105)} , (19)which we note can be computed through linear programming. We do not consider weight-ing the statistics (16) and (18) when taking the maximum in (19) because weighting themis numerically equivalent to scaling the matrices ˆΩ i n and ˆΩ e n . A variety of alternative teststatistics can of course be motivated by Theorem 3.1; some of which may be preferableto T n in certain applications. A couple of remarks are therefore in order as to why ourconcern for computational reliability in high-dimensional models has led us to employing T n . First, we avoided directly studentizing the inequalities in (16) in order to avoid anon-convex optimization problem. Instead, scale-invariance can be ensured by choosingˆΩ i n suitably. Second, we avoided directly studentizing ( ˆ β n − A ˆ x (cid:63)n ) in (18) because theasymptotic variance matrix of ( ˆ β n − A ˆ x (cid:63)n ) is often rank deficient due to ( I p − AA † ˆ C n )being a projection matrix. Third, an alternative norm, say (cid:107) · (cid:107) , could be employed inthe definitions of ˆ V i n and ˆ V e n in (17). At least in our experience, however, linear programsscale better than quadratic programs. In addition, employing (cid:107) · (cid:107) -norm constraints im-plies distributional approximations to T n can be obtained using coupling arguments withrespect to (cid:107) · (cid:107) ∞ , which are available under weaker conditions than coupling argumentswith respect to, say, (cid:107) · (cid:107) .We next state a set of assumptions that will enable us to obtain a distributionalapproximation to T n . Unless otherwise stated, all quantities are allowed to depend on n , though we leave such dependence implicit to avoid notational clutter. Assumption 4.1.
For j ∈ { e , i } : (i) ˆΩ j n is symmetric; (ii) There is a symmetric matrix Ω j ( P ) satisfying (cid:107) (Ω j ( P )) † ( ˆΩ j n − Ω j ( P )) (cid:107) o, ∞ = O P ( a n / (cid:112) log(1 + p )) uniformly in P ∈ P ;(iii) range { ˆΩ j n } = range { Ω j ( P ) } with probability tending to one uniformly in P ∈ P . Assumption 4.2. (i) { Z i } ni =1 are i.i.d. with Z i ∈ Z distributed according to P ∈ P ;(ii) ˆ x (cid:63)n = A † ˆ C n ˆ β n for some p × p matrix ˆ C n satisfying ˆ C n β ( P ) = β ( P ) for all P ∈ P ;(iii) There are ψ i ( · , P ) : Z → R p and ψ e ( · , P ) : Z → R p satisfying uniformly in P ∈ P (cid:107) (Ω e ( P )) † { ( I p − AA † ˆ C n ) √ n { ˆ β n − β ( P ) } − √ n n (cid:88) i =1 ψ e ( Z i , P ) }(cid:107) ∞ = O P ( a n ) (cid:107) (Ω i ( P )) † { AA † ˆ C n √ n { ˆ β n − β ( P ) } − √ n n (cid:88) i =1 ψ i ( Z i , P ) }(cid:107) ∞ = O P ( a n ) . ssumption 4.3. For Σ j ( P ) ≡ E P [ ψ j ( Z, P ) ψ j ( Z, P ) (cid:48) ] : (i) E P [ ψ j ( Z, P )] = 0 for all P ∈ P and j ∈ { e , i } ; (ii) The eigenvalues of (Ω j ( P )) † Σ j ( P )(Ω j ( P )) † are bounded in j ∈ { e , i } , n , and P ∈ P ; (iii) Ψ( z, P ) ≡ (cid:107) (Ω e ( P )) † ψ e ( z, P ) (cid:107) ∞ ∨ (cid:107) (Ω i ( P )) † ψ i ( z, P ) (cid:107) ∞ satisfies sup P ∈ P (cid:107) Ψ( · , P ) (cid:107) P, ≤ M , Ψ < ∞ with M , Ψ ≥ . Assumption 4.4.
For j ∈ { e , i } : (i) ψ j ( Z, P ) ∈ range { Ω j ( P ) } P -almost surely forall P ∈ P ; (ii) ( I p − AA † ˆ C n ) { ˆ β n − β ( P ) } ∈ range { Σ e ( P ) } and AA † ˆ C n { ˆ β n − β ( P ) } ∈ range { Σ i ( P ) } with probability tending to one uniformly in P ∈ P . Because AA † ˆ C n is a projection matrix, the relevant asymptotic covariance matricesare often singular. In order to allow ˆΩ i n and ˆΩ e n to be sample standard deviation matri-ces, Assumption 4.1 therefore does not assume invertibility. Instead, Assumption 4.1(ii)requires a suitable form of consistency, and its rate is denoted by a n / (cid:112) log(1 + p ). Typi-cally a n will be of order p/ √ n (up to logs). When ˆΩ e n and ˆΩ i n are sample standard devia-tion matrices, Assumption 4.1(iii) is easily verified due to rank deficiency resulting fromthe presence of projection matrices. Alternatively, we note that if we employ invertible(e.g., diagonal) weights ˆΩ e n and ˆΩ i n , then Assumption 4.1(iii) is immediate. Assumptions4.2(i)–(ii) formalize previously discussed conditions, while Assumption 4.2(iii) requiresour estimators to be asymptotically linear with influence functions whose moments aredisciplined by Assumption 4.3. Finally, Assumption 4.4(i), together with Assumption4.1(iii), restricts the manner in which invertibility of ˆΩ e n and ˆΩ i n may fail – this condi-tion is again easily verified if we employ invertible weights or sample standard deviationmatrices. Similarly, Assumption 4.4(ii) ensures that the support of our estimators becontained in the support of their Gaussian approximations.Before establishing our distributional approximation to T n , we need to introduce afinal piece of notation. We define the population analogues to ˆ V e n and ˆ V i n as V e ( P ) ≡ { s ∈ R p : (cid:107) Ω e ( P ) s (cid:107) ≤ }V i ( P ) ≡ { s ∈ R p : A † s ≤ (cid:107) Ω i ( P )( AA (cid:48) ) † s (cid:107) ≤ } , (20)and for ψ e ( Z, P ) and ψ i ( Z, P ) the influence functions in Assumption 4.2(iii) we set ψ ( Z, P ) ≡ ( ψ e ( Z, P ) (cid:48) , ψ i ( Z, P ) (cid:48) ) (cid:48) and denote the corresponding asymptotic variance byΣ( P ) ≡ E P [ ψ ( Z, P ) ψ ( Z, P ) (cid:48) ] , (21)which note has dimension 2 p × p . For notational simplicity we also define the rate r n ≡ M , Ψ ( p log (1 + p ) n ) / + a n . (22)Our next theorem derives a distributional approximations for T n that, under appro-priate moment conditions, is valid uniformly in P ∈ P provided p log ( p ) /n = o (1).14 heorem 4.1. Let Assumptions 4.1, 4.2, 4.3, 4.4 hold, and r n = o (1) . Then, there is ( G e n ( P ) (cid:48) , G i n ( P ) (cid:48) ) (cid:48) ≡ G n ( P ) ∼ N (0 , Σ( P )) such that uniformly in P ∈ P we have T n = max { sup s ∈V e ( P ) (cid:104) s, G e n ( P ) (cid:105) , sup s ∈V i ( P ) (cid:104) A † s, A † G i n ( P ) (cid:105) + √ n (cid:104) A † s, A † β ( P ) (cid:105)} + O P ( r n ) . In order to obtain a suitable critical value, we will assume the availability of “bootstrap”estimates ( ˆ G e (cid:48) n , ˆ G i (cid:48) n ) (cid:48) whose law conditional on the data { Z i } ni =1 provides a consistentestimate for the joint distribution of ( G e n ( P ) (cid:48) , G i n ( P ) (cid:48) ) (cid:48) . Given such estimates, we mayfollow a number of approaches for obtaining critical values; see, e.g., Section 4.3.1. Belowwe focus on a specific procedure that has favorable power properties in our simulations. Step 1.
First, we observe that the main challenge in employing Theorem 4.1 for infer-ence is the presence of the nuisance function f ( · , P ) : R p → R given by f ( s, P ) ≡ √ n (cid:104) A † s, A † β ( P ) (cid:105) = √ n (cid:104) A † s, x (cid:63) ( P ) (cid:105) , (23)where the second equality follows from A † β ( P ) = x (cid:63) ( P ) for all P ∈ P . While f ( · , P )cannot be consistently estimated, we may nonetheless construct a suitable upper boundfor it. To this end, it is useful to note that in applications some coordinates of β ( P )may equal a known value for all P ∈ P ; see, e.g., Examples 2.1-2.5. We thereforedecompose β ( P ) = ( β u ( P ) (cid:48) , β (cid:48) k ) (cid:48) where β k is a known constant for all P ∈ P , andsimilarly decompose any b ∈ R p into subvectors of comformable dimensions b = ( b (cid:48) u , b k ) (cid:48) .Employing these definitions, we introduce a restricted estimator ˆ β r n for β ( P ) by settingˆ β r n ∈ arg min b =( b (cid:48) u ,b (cid:48) k ) (cid:48) sup s ∈ ˆ V i n √ n (cid:104) A † s, ˆ x (cid:63)n − A † b (cid:105) s.t. b k = β k , Ax = b for some x ≥ , (24)which may be computed through linear programming; see Appendix S.3 for details.Since f ( s, P ) ≤ s ∈ ˆ V i n and P ∈ P by Theorem 3.1, it follows that under thenull hypothesis λ n f ( s, P ) ≥ f ( s, P ) for any λ n ≤ s ∈ ˆ V i n . We thus setˆ U n ( s ) ≡ λ n √ n (cid:104) A † s, A † ˆ β r n (cid:105) , (25)which can be shown to be a suitable estimator for the upper bound λ n f ( s, P ) provided λ n ↓ λ n in Section 5. As a result, the function ˆ U n providesus with an asymptotic upper bound for the nuisance function f ( · , P ) on the set ˆ V i n . Inaddition, the upper bound ˆ U n reflects the structure of the null hypothesis in that: (i)ˆ U n ( s ) ≤ s ∈ ˆ V i n and (ii) There is a b ∈ R p satisfying Ax = b for some x ≥ U n ( s ) = (cid:104) A † s, A † b (cid:105) for all s ∈ R p ; see also our discussion in Section 4.3.1. Step 2.
Second, we note that the asymptotic approximation obtained in Theorem 4.115s increasing (in a first-order stochastic dominance sense) in the value of the nuisancefunction under the pointwise partial order – e.g., if f ( s, P ) ≥ f ( s, P (cid:48) ) for all s ∈ V i ( P ),then the distribution of T n under P first order stochastically dominates the distributionof T n under P (cid:48) . Hence, given the upper bound ˆ U n defined in Step 1, the precedingdiscussion suggests that, for a nominal level α test, we may employ the quantileˆ c n (1 − α ) ≡ inf { u : P (max { sup s ∈ ˆ V e n (cid:104) s, ˆ G e n (cid:105) , sup s ∈ ˆ V i n (cid:104) A † s, A † ˆ G i n (cid:105) + ˆ U n ( s ) } ≤ u |{ Z i } ni =1 ) ≥ − α } as a critical value for T n . We note that computing ˆ c n (1 − α ) requires solving one linearprogram per bootstrap replication.Given the above definitions, we finally define our test φ n ∈ { , } to equal φ n ≡ { T n > ˆ c n (1 − α ) } ;i.e., we reject the null hypothesis whenever T n exceeds ˆ c n (1 − α ). In order to establishthe asymptotic validity of this test, we impose an additional assumption that enablesus to derive the asymptotic properties of the bootstrap estimates ( ˆ G e (cid:48) n , ˆ G i (cid:48) n ) (cid:48) . Assumption 4.5. (i) There are exchangeable { W i,n } ni =1 independent of { Z i } ni =1 with (cid:107) (Ω j ( P )) † { ˆ G j n − √ n n (cid:88) i =1 ( W i,n − ¯ W n ) ψ j ( Z i , P ) }(cid:107) ∞ = O P ( a n ) uniformly in P ∈ P for j ∈ { e , i } ; (ii) For some a, b > , P ( | W ,n − E [ W ,n ] | > t ) ≤ {− t b + at } for all t ∈ R + and n ; (iii) | (cid:80) ni =1 ( W i,n − ¯ W n ) /n − | = O P ( n − / ) and sup n E [ | W ,n | ] < ∞ ; (iv) sup P ∈ P (cid:107) Ψ ( · , P ) (cid:107) P,q ≤ M q, Ψ < ∞ for some q ∈ (1 , + ∞ ] ; (v)For j ∈ { e , i } , ˆ G j n ∈ range { Σ j ( P ) } with probability tending to one uniformly in P ∈ P . Assumption 4.5(i) accommodates a variety of resampling schemes, such as the non-parametric, Bayesian, score, or weighted bootstrap, by simply requiring that ( ˆ G i (cid:48) n , ˆ G e (cid:48) n ) (cid:48) be asymptotically equivalent to an exchangeable bootstrap estimate of the distributionof the (scaled) sample mean of ψ ( · , P ). In parallel to Assumption 4.2(iii), we note thatAssumption 4.5(i) is a linearization assumption on our bootstrap estimates that is au-tomatically satisfied whenever ( ˆ G i (cid:48) n , ˆ G e (cid:48) n ) (cid:48) is linear in the data. Assumptions 4.5(ii)–(iii)impose moment and scale restrictions on the exchangeable bootstrap weights { W i,n } ni =1 ,and are satisfied by commonly used resampling schemes – e.g., the nonparametric andBayesian bootstrap, and the score or weighted bootstrap under appropriate choices ofweights. Assumption 4.5(iv) potentially strengthens the moment restrictions in Assump-tion 4.3(iii) (if q > /
2) and is imposed to sharpen our estimates of the coupling ratefor the bootstrap statistics. Finally, Assumption 4.5(v) is a bootstrap analogue of thepreviously imposed Assumption 4.4(ii). 16he introduced assumptions suffice for establishing that the law of ( ˆ G i (cid:48) n , ˆ G e (cid:48) n ) (cid:48) con-ditional on the data is a suitable estimator for the distribution of ( G e n ( P ) (cid:48) , G i n ( P ) (cid:48) ) (cid:48) uniformly in P ∈ P . Formally, we establish that ( ˆ G e (cid:48) n , ˆ G i (cid:48) n ) can be coupled (under (cid:107) · (cid:107) ∞ )to a copy of ( G e n ( P ) (cid:48) , G i n ( P ) (cid:48) ) (cid:48) that is independent of the data at a rate b n ≡ (cid:112) p log(1 + n ) M , Ψ n / + ( p log / (1 + p ) M , Ψ √ n ) / + ( p log (1 + p ) n /q M q, Ψ n ) / + a n ;see Lemma S.2.5 in the Supplemental Appendix. In particular, under appropriate mo-ment restrictions, the bootstrap is consistent provided p (log ( p ) ∨ log( n )) /n = o (1).To the best of our knowledge the stated consistency of the exchangeable bootstrap as p grows with n is a novel result that might be of independent interest.Before establishing that the asymptotic size of our proposed test is at most itsnominal level α , we need to introduce a final piece of notation. First, we note that theasymptotic approximation obtained in Theorem 4.1 contains the optimal value of twolinear programs. The solution to these programs can be shown to belong to the sets E e ( P ) ≡ { s ∈ R p : s is an extreme point of Ω e ( P ) V e ( P ) }E i ( P ) ≡ { s ∈ R p : s is an extreme point of ( AA (cid:48) ) † V i ( P ) } almost surely. We note that while both sets are finite, the cardinality of E i ( P ) can growexponentially in p , which results in coupling rates obtained via the high-dimensionalcentral limit theorem to be suboptimal (Chernozhukov et al., 2017). In addition, it willbe helpful to denote the standard deviations induced by G e n ( P ) and G i n ( P ) by σ e ( s, P ) ≡ { E P [( (cid:104) s, (Ω e ( P )) † G e n ( P ) (cid:105) ) ] } / σ i ( s, P ) ≡ { E P [( (cid:104) Ω i ( P ) s, (Ω i ( P )) † G i n ( P ) (cid:105) ) ] } / and denote their upper and (restricted) lower bounds over the set E e ( P ) ∪ E i ( P ) by¯ σ ( P ) ≡ sup s ∈E e ( P ) σ e ( s, P ) ∨ sup s ∈E i ( P ) σ i ( s, P ) σ ( P ) ≡ inf s ∈E e ( P ): σ e ( s,P ) > σ e ( s, P ) ∧ inf s ∈E i ( P ): σ i ( s,P ) > σ i ( s, P ) , where we set σ ( P ) = + ∞ if σ j ( s, P ) = 0 for all s ∈ E j ( P ), j ∈ { e , i } . In addition, for anyrandom variable V ∈ R , we let med { V } denote its median, and for any P ∈ P definem( P ) ≡ med { max { sup s ∈V e ( P ) (cid:104) s, G e n ( P ) (cid:105) , sup s ∈V i ( P ) (cid:104) A † s, A † G i n ( P ) (cid:105)}} . Finally, for notational convenience we introduce the sequence ξ n ≡ r n ∨ b n ∨ λ n (cid:112) log(1 + p ).Our next result establishes the asymptotic validity of the proposed test.17 heorem 4.2. Let Assumptions 4.1, 4.2, 4.3, 4.4, 4.5 hold, and α ∈ (0 , . . If ξ n satisfies ξ n = o (1) and sup P ∈ P (m( P ) + ¯ σ ( P )) /σ ( P ) = o ( ξ − n ) , then it follows that lim sup n →∞ sup P ∈ P E P [ φ n ] ≤ α. In order to leverage our asymptotic approximations to establish the asymptotic va-lidity of our test, Theorem 4.2 imposes an additional rate condition that constrains how p can grow with n . This rate condition depends on the matrix A and the weightingmatrices Ω j ( P ) for j ∈ { e , i } . As we show in Remark 4.1 below, it is possible to obtainuniversal (in A ) bounds for (m( P )+ ¯ σ ( P )) /σ ( P ) when setting Ω j ( P ) to be the standarddeviation matrix of G j n ( P ) for j ∈ { e , i } . While such bounds provide sufficient conditionsfor the rate requirements in Theorem 4.2, we emphasize that they can be quite conser-vative for a specific choice of A . Finally, we note that if, as in much of the literature,one considers the case in which p does not grow with n , then Remark 4.1 implies thatTheorem 4.2 holds under Assumptions 4.1–4.5 and the requirement λ n = o (1). Remark 4.1.
Whenever Ω j ( P ) is chosen to be the standard deviation matrix of G j n ( P )for j ∈ { e , i } , it is possible to obtain universal (in A ) bounds on ¯ σ ( P ), σ ( P ), and m( P ).For instance, under such choice of Ω j ( P ), it is straightforward to show thatmax s ∈E i ( P ) σ i ( s, P ) ≤ sup s : (cid:107) s (cid:107) ≤ { E P [( (cid:104) s, (Ω i ( P )) † G i n ( P ) (cid:105) ) ] } / ≤ i ( P ) and, moreover, that ¯ σ ( P ) ≤
1. Similarly,if σ i ( s, P ) > s ∈ E i ( P ), then closely related arguments yieldmin s ∈E i ( P ): σ i ( s,P ) > σ i ( s, P ) ≥ inf s : (cid:107) Ω i ( P ) s (cid:107) =1 { E P [( (cid:104) Ω i ( P ) s, (Ω i ( P )) † G i n ( P ) (cid:105) ) ] } / ≥ inf s : (cid:107) s (cid:107) =1 (cid:107) s (cid:107) = 1 √ p (26)and, moreover, that σ ( P ) ≥ / √ p . Finally, by H¨older’s and a maximal inequality itis possible to show that m( P ) (cid:46) (cid:112) log( p ). We emphasize, however, that the universal(in A ) bound in (26) can be quite conservative for a specific choice of A . For example,applying our procedure for testing whether a vector of means is non-positive correspondsto setting A = I p , in which case σ ( P ) = 1. We next discuss extensions to our results. For conciseness, we omit a formal analysis,but note they follow by similar arguments to those employed in proving Theorem 4.2.18 .3.1 Two Stage Critical Value
In Theorem 4.2, we focused on a particular choice of critical value due to its favorablepower properties in our simulations. It is important to note, however, that other ap-proaches are also available. For instance, an alternative critical value may be obtainedby proceeding in a manner that is similar in spirit to the approach pursued by Romanoet al. (2014) and Bai et al. (2019) for testing whether a finite-dimensional vector ofpopulations means is nonnegative. Specifically, for some pre-specified γ ∈ (0 , c (1) n (1 − γ ) ≡ inf { u : P ( sup s ∈ ˆ V i n (cid:104) A † s, − A † ˆ G i n (cid:105) ≤ u |{ Z i } ni =1 ) ≥ − γ } , and, in place of ˆ U n as introduced in (25), define the upper bound ˜ U n to be given by˜ U n ( s ) ≡ min {√ n (cid:104) A † s, ˆ x (cid:63)n (cid:105) + ˆ c (1) n (1 − γ ) , } . The function ˜ U n : R p → R may be interpreted as an upper confidence region for f ( · , P )(as in (23)) with uniform (in P ∈ P ) asymptotic coverage probability 1 − γ . For anominal level α test, we may then compare the test statistic T n to the critical valueˆ c (2) n (1 − α + γ ) ≡ inf { u : P (max { sup s ∈ ˆ V e n (cid:104) s, ˆ G e n (cid:105) , sup s ∈ ˆ V i n (cid:104) A † s, A † ˆ G i n (cid:105) + ˆ U n ( s ) } ≤ u |{ Z i } ni =1 ) ≥ − α + γ } . Here, the 1 − α + γ quantile is employed instead of the 1 − α quantile in order to accountfor the possibility that f ( s, P ) > ˜ U n ( s ) for some s ∈ ˆ V i n . The asymptotic validity ofthe resulting test can be established under the same conditions imposed in Theorem4.2. An appealing feature of the described approach is that it does not require selectinga “bandwidth” λ n . However, we find in simulations that the power of the resultingtest is lower than that of the test φ n . Intuitively, this is due to ˜ U n not satisfying˜ U n ( s ) = (cid:104) A † s, A † b (cid:105) for some b ∈ R p such that Ax = b with x ≥
0. As a result, theupper bound ˜ U n does not reflect the full structure of the null hypothesis. While we have focused on i.i.d. settings for simplicity, we note that extensions to otherasymptotic frameworks are conceptually straightforward. One interesting such extensionis to the problem of sub-vector inference in a class of models defined by conditionalmoment inequalities. In particular, we may follow an insight due to Andrews et al.(2019), who note that in an empirically relevant class of models defined by conditional19oment inequalities, the parameter of interest π is known to satisfy E P [ G ( D, π ) − M ( W, π ) δ | W ] ≤ δ ∈ R d δ (27)where G ( D, π ) ∈ R p , M ( W, π ) is a p × d δ matrix, and both are known functions of( D, W, π ). Andrews et al. (2019) observe that the structure of these models is suchthat testing whether a specified value π satisfies (27) is facilitated by conditioning on { W i } ni =1 . As we next argue, their important insight carries over to our framework.For any δ ∈ R d δ , let δ + ≡ δ ∨ δ − ≡ − ( δ ∧ ∨ and ∧ denote coordinate-wise maximums and minimums – e.g., δ + and δ − are the “positive” and “negative” partsof δ . We then observe that if π satisfies (27), then it follows that1 n n (cid:88) i =1 E P [ G ( D, π ) | W i ] = 1 n n (cid:88) i =1 M ( W i , π )( δ + − δ − ) − ∆ for some ∆ ∈ R p + , δ ∈ R d δ . Hence, by setting P to denote the distribution of { D i } ni =1 conditional on { W i } ni =1 , wemay test the null hypothesis that π satisfies (27) by letting β ( P ) and A equal β ( P ) ≡ n n (cid:88) i =1 E P [ G ( D, π ) | W i ] A ≡ [ 1 n n (cid:88) i =1 M ( W i , π ) , − n n (cid:88) i =1 M ( W i , π ) , − I p ]and testing whether β ( P ) = Ax for some x ≥ A doesnot depend on P due to the conditioning on { W i } ni =1 . By letting ˆ β n equalˆ β n ≡ n n (cid:88) i =1 G ( D i , π ) , our test remains largely the same as in Theorem 4.2, with the exception that the “boot-strap” estimates ( ˆ G e (cid:48) n , ˆ G i (cid:48) n ) (cid:48) must be suitably consistent for the law of((Ω e ( P )) † ( I p − AA † ˆ C n ) √ n { ˆ β n − β ( P ) } (cid:48) , (Ω i ( P )) † AA † ˆ C n √ n { ˆ β n − β ( P ) } (cid:48) ) (cid:48) conditional on { W i } ni =1 (instead of unconditionally, as in Theorem 4.2). Example 2.1 is one example of a class of mixture models considered by Fox et al. (2011).A simpler example with the same structure is a static, binary choice logit with random20oefficients. In this model, a consumer makes a binary decision Y ∈ { , } according to Y = 1 { C + C W − U ≥ } , (28)where W is an observed random variable which we will think of as the price the consumerfaces for buying a good ( Y = 1), and ( C , C , U ) are latent random variables. Therandom coefficients, V ≡ ( C , C ), represent the consumer’s overall preference for buyingthe good ( C ) as well and their price sensitivity ( C ). The unobservable U is assumedto follow a standard logistic distribution, independently of ( V, W ).A consumer of type v = ( c , c ) facing price w buys the good with probability P ( Y = 1 | W = w, V = v ) = 11 + exp( − c − c w ) ≡ (cid:96) ( w, v ) . (29)Bajari et al. (2007) and Fox et al. (2011) approximate the distribution of V using adiscrete distribution with known support points ( v , . . . , v d ) and unknown respectiveprobabilities x ≡ ( x , . . . , x d ). They also assume that V is independent of W , which is anatural baseline case under which random coefficient models are often studied (Ichimuraand Thompson, 1998; Gautier and Kitamura, 2013). Under this assumption, (29) canbe aggregated into a conditional moment equality in terms of observables: P ( Y = 1 | W = w ) = d (cid:88) j =1 x j (cid:96) ( w, v j ) . (30)A natural quantity of interest in this model is the elasticity of purchase probabilitywith respect to price. For a consumer of type v = ( c , c ) facing price ¯ w , this equals (cid:15) ( v, ¯ w ) ≡ (cid:18) ∂∂w (cid:96) ( v, w ) (cid:12)(cid:12)(cid:12) w = ¯ w (cid:19) ¯ w(cid:96) ( v, ¯ w ) = c ¯ w (1 − (cid:96) ( v, ¯ w )) . The cumulative distribution function (c.d.f.) of this elasticity, denoted F (cid:15) ( ·| ¯ w ), satisfies F (cid:15) ( t | ¯ w ) ≡ P ( (cid:15) ( V, ¯ w ) ≤ t ) = d (cid:88) j =1 { (cid:15) ( v j , ¯ w ) ≤ t } x j ≡ a ( t, ¯ w ) (cid:48) x, (31)where a ( t, ¯ w ) ≡ ( a ( t, ¯ w ) , . . . , a d ( t, ¯ w )) (cid:48) with a j ( t, ¯ w ) ≡ { (cid:15) ( v j , ¯ w ) ≤ t } . We take thec.d.f. F (cid:15) ( ·| ¯ w ) as our parameter of interest in the discussion ahead. We consider data generated from a class of mixed logit models parameterized as follows.The support of W is set to be either { , , , } or { , . , . , . . . , } , so that it has either4 or 16 points respectively. In either case its distribution is uniform over these points.21he (known) support of V is generating by taking a Sobol sequence of length √ d andrescaling it to lie in [0 , . V is a Sobol sequence of length √ d rescaled to [ − , V is taken to be uniform over the productof the two marginal supports, so that it has d support points in total. Fox et al. (2012) provide identification results for the distribution of random coefficientsin multinomial mixed logit model. The binary mixed logit model discussed in the previ-ous section is a special case of these models. However, their conditions require W to becontinuously distributed. When W is discretely distributed, one might naturally expectto find that the distributions of V and thus of (cid:15) ( V, ¯ w ) are only partially identified.We explore this conjecture computationally. Letting supp( W ) denote the support of W , we may then express the identified set for the distribution of V as being equal to X (cid:63) ( P ) ≡ { x ∈ R d + : d (cid:88) j =1 x j = 1 , d (cid:88) j =1 x j (cid:96) ( w, v j ) = P ( Y = 1 | W = w ) for all w ∈ supp( W ) } . In addition, for any t ∈ R , we denote the identified set for F (cid:15) ( t | ¯ w ) by A (cid:63) ( t, ¯ w | P ), whichsimply equals the projection of X (cid:63) ( P ) under the linear map introduced in (31): A (cid:63) ( t, ¯ w | P ) ≡ (cid:8) a ( t, ¯ w ) (cid:48) x : x ∈ X (cid:63) ( P ) (cid:9) . Since X (cid:63) ( P ) is a system of linear equalities and inequalities, and x (cid:55)→ a ( t, ¯ w ) (cid:48) x is scalar-valued and linear, A (cid:63) ( t, ¯ w | P ) is a closed interval (see, e.g. Mogstad et al., 2018, for asimilar argument). The left endpoint of this interval is the solution to the linear programmin x ∈ R d + a ( t, ¯ w ) (cid:48) x s.t. d (cid:88) j =1 x j = 1and d (cid:88) j =1 x j (cid:96) ( w, v j ) = P ( Y = 1 | W = w ) for all w ∈ supp( W ) , (32)and the right endpoint is the solution to its maximization counterpart.Figure 3 depicts A (cid:63) ( t, ¯ w | P ) as a function of t for ¯ w = 1. The outer and inner bandsdepict the identified set when the support of W has four and sixteen points, respectively,while the solid line indicates the distribution under the actual data generating process.The identified sets are non-trivial and widen with the number of support points for theunobservable, V , as indexed by d . For d = 16, the bounds when W has sixteen supportpoints are narrow, but numerically distinct from a point. This is because the system22igure 3: Bounds on the distribution of price elasticity F (cid:15) ( t | d = 400 d = 1600 d = 16 d = 100-3 -2 -1 0 -3 -2 -1 00.000.250.500.751.000.000.250.500.751.00 Elasticity ( t ) D i s t r i bu t i o n f un c t i o n (t r u t h a ndp o i n t w i s e b o und s ) These plots are based on the data generating processes described in Section 5.2. The solid black line isthe actual value of F ε ( t | A (cid:63) ( t, | P ) when the support of W has sixteen points.The darker color is the same set when the support of W has only four points. The dotted vertical is thevalue t = − of moment equations that defines X (cid:63) ( P ), while known to be nonsingular in principle, issufficiently close to singular to matter for floating point arithmetic. That is, there aremany values of x that satisfy these equations up to machine precision. As in Example 2.1, we may employ our results to test whether a hypothesized γ ∈ R belongs to the identified set for F (cid:15) ( t | ¯ w ). Indexing the support of W to have p − β ( P ) = P ( Y = 1 | W = w )... P ( Y = 1 | W = w p − )1 γ A = (cid:96) ( w , v ) · · · (cid:96) ( w , v d )... ... ... (cid:96) ( w p − , v ) · · · (cid:96) ( w p − , v d )1 · · · a ( t, ¯ w ) · · · a d ( t, ¯ w ) . (33)We take ˆ β n ≡ ( ˆ β u ,n , , γ ) (cid:48) ∈ R p , where ˆ β u ,n is the sample analogue of the first p − β ( P ). For designs with d ≥ p , we set ˆ x (cid:63)n = A † ˆ β n .When d < p we instead set ˆ x (cid:63)n to be the unique solution to the minimizationmin x ∈ R d (cid:16) ˆ β u ,n − A u x (cid:17) (cid:48) ˆΞ − n (cid:16) ˆ β u ,n − A u x (cid:17) s.t. d (cid:88) j =1 x = 1 and a ( t, ¯ w ) (cid:48) x = γ, (34)where A u corresponds to the first p − A and ˆΞ n is the sample analogue estimatorof asymptotic variance matrix of ˆ β u ,n . As weighting matrices, we let ˆΩ e n be the samplestandard deviation matrix of ˆ β n , and ˆΩ i n be the sample standard deviation of A ˆ x (cid:63)n computed from 250 draws of the nonparametric bootstrap.We explore two rules for selecting λ n . To motivate them, we note that an importanttheoretical restriction on λ n is that, uniformly in P ∈ P , it satisfy λ n √ n sup s ∈ ˆ V i n (cid:104) A † s, A † A (ˆ x (cid:63)n − x (cid:63) ( P )) (cid:105) = o P (1); (35)see Lemma S.2.1. Employing our coupling A (ˆ x (cid:63)n − x (cid:63) ( P )) ≈ G i n ( P ), H¨older’s andMarkov’s inequalities, and E P [ (cid:107) (Ω i n ( P )) † G i n ( P ) (cid:107) ∞ ] ≤ (cid:112) e ∨ p ) due to Ω i n ( P ) beingthe standard deviation matrix suggests selecting λ n to satisfy λ n (cid:112) log( e ∨ p ) = o (1) –here a ∨ b ≡ max { a, b } . For a concrete choice of λ n , we rely on the law of iteratedlogarithm and let λ r n = 1 / (cid:112) log( e ∨ p ) log( e ∨ log( e ∨ n )). As an alternative to λ r n , weemploy the bootstrap to approximate the law of (35). In particular, for some δ n ↓ λ b n ≡ min { , ˆ τ n (1 − δ n ) } where ˆ τ n (1 − δ n ) denotes the 1 − δ n quantile ofsup s ∈ ˆ V i n (cid:104) A † s, A † ˆ G i n (cid:105) (36)conditional on the data. For a concrete choice of δ n we let δ n = 1 / (cid:112) log( e ∨ log( e ∨ n )).In Appendix S.3, we describe the computation of our test in more detail. In particu-lar, we show how to reformulate the entire sequence into a series of linear programmingproblems. We also suggest reformulations that improve the stability of these linear pro-grams while also making it unnecessary to compute A † . An R package for implementingour test is available at https://github.com/conroylau/lpinfer .24igure 4: Null rejection probabilities for (nearly) point-identified designs d = 4 , p = 6 d = 4 , p = 18 d = 16 , p = 18 n = n = n = Nominal level R e j e c t i o np r o b a b ili t y Test
BS Wald BS Wald (RC) FSST FSST (RoT)
The dotted line is the 45 degree line. “FSST” refers to the test developed in this paper with λ b n , whereas“FSST (RoT)” uses the rule of thumb choice λ r n . “BS Wald” corresponds to a Wald test using bootstrapestimates of the standard errors. “BS Wald (RC)” is the same procedure but with standard errors basedon bootstrapping with a re-centered GMM criterion. The null hypothesis is that F (cid:15) ( − |
1) is equal toits true value. In the case of d = 16 , p = 18, which is set identified but with a very narrow identified set,we test the null hypothesis that F (cid:15) ( − |
1) is equal to the midpoint of the identified set.
We start by examining the null rejection probabilities of our testing procedure by setting γ to be the lower bound of the population identified set computed via (32) with t = − w = 1. In unreported simulations we found setting γ to be the upper bound of theidentified set yielded similar results. We consider sample sizes of n = 1000, 2000 and4000 for each of the data generating processes discussed in Section 5.2. All results arebased on 5000 Monte Carlo replications and 250 nonparametric bootstrap draws.25able 1: Null rejection probabilities for a nominal 0.05 test dn p Test 4 16 100 400 16001000 6 BS Wald .058 – – – –BS Wald (RC) .061 – – – –FSST .055 .014 .032 .021 .023FSST (RoT) .052 .009 .013 .007 .00818 BS Wald .061 .093 – – –BS Wald (RC) .061 .099 – – –FSST .046 .041 .017 .014 .013FSST (RoT) .046 .042 .014 .011 .0092000 6 BS Wald .064 – – – –BS Wald (RC) .069 – – – –FSST .053 .012 .040 .029 .031FSST (RoT) .052 .004 .022 .012 .01618 BS Wald .070 .074 – – –BS Wald (RC) .078 .079 – – –FSST .050 .044 .019 .014 .017FSST (RoT) .050 .046 .013 .011 .0124000 6 BS Wald .070 – – – –BS Wald (RC) .077 – – – –FSST .055 .019 .042 .031 .030FSST (RoT) .053 .003 .027 .019 .01718 BS Wald .081 .080 – – –BS Wald (RC) .095 .087 – – –FSST .052 .046 .026 .022 .024FSST (RoT) .052 .047 .020 .015 .017
The test abbreviations are described in the notes for Figure 4. The null hypothesis is that F (cid:15) ( − |
1) isequal to the lower bound of the population identified set.
We first consider the designs in which p − ≥ d so that F (cid:15) ( − |
1) is (nearly) pointidentified. In this case, one might alternatively consider estimating probability weights x satisfying the moment restrictions in (30) by constrained GMM, and then conductinginference on F (cid:15) ( − |
1) using a bootstrapped Wald test. For example, this is the approachthat appears to have been taken by Nevo et al. (2016) in the related setting discussed inExample 2.1. However, the non-negativity constraints on x imply that the bootstrapwill generally not be consistent in this case (Fang and Santos, 2018).We demonstrate this point in Figure 4 with plots of the actual and nominal level forboth our procedure (FSST) and for the bootstrapped Wald test based on constrainedGMM. The latter exhibits significant size distortions. For example the GMM test withnominal level 5% rejects nearly 10% of the time when d = 16 , p = 18 and n = 1,000,and a nominal level %10 test rejects 15% of the time when d = 4 , p = 18, and n = 4,000.Re-centering the GMM criterion before conducting this test (e.g. Hall and Horowitz,1996) leads to even greater over-rejection. In contrast, FSST has nearly equal nominal26igure 5: Power curves for FSST nominal 0 .
10 test, n = 2000 d = 400 , p = 6 d = 1600 , p = 6 d = 16 , p = 6 d = 100 , p = 60 .25 .50 .75 1.00 0 .25 .50 .75 1.000.10.20.30.40.50.60.70.80.901.000.10.20.30.40.50.60.70.80.901.00 Null hypothesis ( γ ) R e j e c t i o np r o b a b ili t y λ λ b n λ r n The vertical dotted lines indicate the lower and upper bounds of the population identified set. Thehorizontal dotted line indicates the nominal level (0 . and actual levels when using either λ b n or the rule-of-thumb (RoT) choice, λ r n .In Table 1, we report empirical rejection rates for our procedure using all of the de-signs, including the partially identified ones. Our approach has null rejection probabili-ties approximately no greater than the nominal level across all different data generatingprocesses and sample sizes, even with d as high as 1600. Figure 5 illustrates the impactthat λ n has on the power of the test. Both λ b n and λ r n provide considerable power gainsover the conservative choice of λ n = 0. Figure 6 shows how power increases for thechoice λ n = λ b n as the sample size increases from 2000 to 4000.27igure 6: Power curves for FSST nominal 0 .
10 test, λ n = λ b n d = 400 , p = 6 d = 1600 , p = 6 d = 16 , p = 6 d = 100 , p = 60 .25 .50 .75 1.00 0 .25 .50 .75 1.000.10.20.30.40.50.60.70.80.901.000.10.20.30.40.50.60.70.80.901.00 Null hypothesis ( γ ) R e j e c t i o np r o b a b ili t y Sample size
The vertical dotted lines indicate the lower and upper bounds of the population identified set. Thehorizontal dotted line indicates the nominal level (0 . References
Andrews, I. , Roth, J. and
Pakes, A. (2019). Inference for linear conditional moment in-equalities. Tech. rep., National Bureau of Economic Research.
Bai, Y. , Santos, A. and
Shaikh, A. (2019). A practical method for testing many momentinequalities.
University of Chicago, Becker Friedman Institute for Economics Working Paper . Bai, Y. , Shaikh, A. M. and
Vytlacil, E. J. (2020). Partial identification of treatment effectrankings with instrumental variables.
Working Paper. University of Chicago . Bajari, P. , Fox, J. T. and
Ryan, S. P. (2007). Linear Regression Estimation of DiscreteChoice Models with Nonparametric Distributions of Random Coefficients.
American EconomicReview , Balke, A. and
Pearl, J. (1994). Counterfactual probabilities: Computational methods,bounds and applications. In
Proceedings of the Tenth international conference on Uncertaintyin artificial intelligence . Morgan Kaufmann Publishers Inc., 46–54. alke, A. and Pearl, J. (1997). Bounds on treatment effects from studies with imperfectcompliance.
Journal of the American Statistical Association , Blundell, W. , Gowrisankaran, G. and
Langer, A. (2018). Escalation of scrutiny: Thegains from dynamic enforcement of environmental regulations. Tech. rep., National Bureauof Economic Research.
Bugni, F. A. , Canay, I. A. and
Shi, X. (2017). Inference for subvectors and other functionsof partially identified parameters in moment inequality models.
Quantitative Economics , Chen, X. and
Santos, A. (2018). Overidentification in regular models.
Econometrica , Chernozhukov, V. , Chetverikov, D. and
Kato, K. (2014). Comparison and anti-concentration bounds for maxima of gaussian random vectors.
Probability Theory and RelatedFields , Chernozhukov, V. , Chetverikov, D. and
Kato, K. (2017). Central limit theorems andbootstrap in high dimensions.
The Annals of Probability , Chernozhukov, V. , Newey, W. K. and
Santos, A. (2015). Constrained conditional momentrestriction models. arXiv preprint arXiv:1509.06311 . Deb, R. , Kitamura, Y. , Quah, J. K.-H. and
Stoye, J. (2017). Revealed price preference:Theory and stochastic testing.
Fang, Z. and
Santos, A. (2018). Inference on directionally differentiable functions.
The Reviewof Economic Studies , Fang, Z. and
Seo, J. (2019). A general framework for inference on shape restrictions. arXivpreprint arXiv:1910.07689 . Fox, J. T. , il Kim, K. , Ryan, S. P. and
Bajari, P. (2012). The random coefficients logitmodel is identified.
Journal of Econometrics , Fox, J. T. , Kim, K. I. , Ryan, S. P. and
Bajari, P. (2011). A simple estimator for thedistribution of random coefficients.
Quantitative Economics , Gandhi, A. , Lu, Z. and
Shi, X. (2019). Estimating demand for differentiated products withzeroes in market share data.
Working Paper. UW-Madison . Gautier, E. and
Kitamura, Y. (2013). Nonparametric Estimation in Random CoefficientsBinary Choice Models.
Econometrica , Hall, P. and
Horowitz, J. L. (1996). Bootstrap Critical Values for Tests Based onGeneralized-Method-of-Moments Estimators.
Econometrica , Honor´e, B. E. and
Lleras-Muney, A. (2006). Bounds in competing risks models and thewar on cancer.
Econometrica , Honor´e, B. E. and
Tamer, E. (2006). Bounds on parameters in panel dynamic discrete choicemodels.
Econometrica , Ichimura, H. and
Thompson, T. (1998). Maximum likelihood estimation of a binary choicemodel with random coefficients of unknown distribution.
Journal of Econometrics , Illanes, G. and
Padi, M. (2019). Competition, asymmetric information, and the annuitypuzzle: Evidence from a government-run exchange in chile.
Center for Retirement Researchat Boston College . mbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local averagetreatment effects.
Econometrica , Imbens, G. W. and
Manski, C. F. (2004). Confidence intervals for partially identified param-eters. Imbens, G. W. and
Rubin, D. B. (1997). Estimating outcome distributions for compliers ininstrumental variables models.
The Review of Economic Studies , Kaido, H. , Molinari, F. and
Stoye, J. (2019). Confidence intervals for projections of partiallyidentified parameters.
Econometrica , Kamat, V. (2019). Identification with latent choice sets.
Kitamura, Y. and
Stoye, J. (2018). Nonparametric analysis of random utility models.
Econo-metrica , Kline, P. and
Walters, C. (2019). Audits as evidence: Experiments, ensembles, and enforce-ment. arXiv preprint arXiv:1907.06622 . Kline, P. and
Walters, C. R. (2016). Evaluating public programs with close substitutes:The case of head start.
The Quarterly Journal of Economics , Laff´ers, L. (2019). Bounding average treatment effects using linear programming.
EmpiricalEconomics , Lazzati, N. , Quah, J. and
Shirai, K. (2018). Nonparametric analysis of monotone choice.
Available at SSRN 3301043 . Luenberger, D. G. (1969).
Optimization by Vector Space Methods . Wiley, New York.
Luenberger, D. G. and
Ye, Y. (1984).
Linear and nonlinear programming , vol. 2. Springer.
Machado, C. , Shaikh, A. M. and
Vytlacil, E. J. (2019). Instrumental variables and thesign of the average treatment effect.
Journal of Econometrics . Manski, C. F. (2014). Identification of income–leisure preferences and evaluation of incometax policy.
Quantitative Economics , McFadden, D. and
Richter, M. K. (1990). Stochastic rationality and revealed stochas-tic preference.
Preferences, Uncertainty, and Optimality, Essays in Honor of Leo Hurwicz,Westview Press: Boulder, CO
Mogstad, M. , Santos, A. and
Torgovitsky, A. (2018). Using Instrumental Variables forInference About Policy Relevant Treatment Parameters.
Econometrica , Nevo, A. , Turner, J. L. and
Williams, J. W. (2016). Usage-based pricing and demand forresidential broadband.
Econometrica , Romano, J. P. and
Shaikh, A. M. (2008). Inference for identifiable parameters in partiallyidentified econometric models.
Journal of Statistical Planning and Inference – Special Issuein Honor of Ted Anderson . Romano, J. P. , Shaikh, A. M. and
Wolf, M. (2014). A practical two-step method for testingmoment inequalities.
Econometrica , Tebaldi, P. , Torgovitsky, A. and
Yang, H. (2019). Nonparametric estimates of demand inthe california health insurance exchange. Tech. rep., National Bureau of Economic Research.
Torgovitsky, A. (2019). Nonparametric inference on state dependence in unemployment.
Econometrica , Zhu, Y. (2019). Inference in non-parametric/semi-parametric moment equality models withshape restrictions.
Quantitative Economics (forthcoming) . nference for Large-Scale Linear Systems with KnownCoefficients: Supplemental Appendix Zheng FangDepartment of EconomicsTexas A&M University Andres SantosDepartment of EconomicsUCLAAzeem M. ShaikhDepartment of EconomicsUniversity of Chicago Alexander TorgovitskyDepartment of EconomicsUniversity of ChicagoSeptember 21, 2020
This Supplemental Appendix to “Inference for Large-Scale Systems of Linear In-equalities” is organized as follows. Appendix S.1 contains the proofs of the theoreticalresults in Section 3. The proof of Theorems 4.1, 4.2, and required auxiliary results, iscontained in Appendix S.2. Throughout, we employ the following notation: a (cid:46) b a ≤ M b for some constant M that is universal in the proof. C ⊥ For any set C ⊆ R k , C ⊥ ≡ { x ∈ R k : (cid:104) x, y (cid:105) = 0 for all y ∈ C } . N The null space of the map A : R d → R p . R The range of the map A : R d → R p .Π C For any closed convex C ⊆ R k , Π C y ≡ arg min x ∈ C (cid:107) y − x (cid:107) . x (cid:63) ( P ) The unique element of N ⊥ solving Π R ( β ( P )) = A ( x (cid:63) ( P )). S.1 Results for Section 3
Proof of Lemma 3.1:
First note that by definition of R , there exists a x ( P ) ∈ R d suchthat Π R ( β ( P )) = A ( x ( P )). Hence, since R d = N ⊕ N ⊥ by Theorem 3.4.1 in Luenberger ∗ We thank Denis Chetverikov, Patrick Kline, and Adriana Lleras-Muney for helpful comments.Conroy Lau provided outstanding research assistance. The research of the third author was supportedby NSF grant SES-1530661. The research of the fourth author was supported by NSF grant SES-1846832. x ( P ) = Π N ( x ( P )) + Π N ⊥ ( x ( P )) and we set x (cid:63) ( P ) = Π N ⊥ ( x ( P )).Since A (Π N ( x ( P ))) = 0 by definition of N , we then obtain thatΠ R ( β ( P )) = A ( x ( P )) = A (Π N ⊥ x ( P ) + Π N x ( P )) = A ( x (cid:63) ( P )) . To see x (cid:63) ( P ) is the unique element in N ⊥ satisfying Π R ( β ( P )) = A ( x (cid:63) ( P )), let ˜ x ( P ) ∈ N ⊥ be any element satisfying A (˜ x ( P )) = Π R ( β ( P )) = A ( x (cid:63) ( P )). Since A (˜ x ( P ) − x (cid:63) ( P )) = 0, it then follows that ˜ x ( P ) − x (cid:63) ( P ) ∈ N . However, we also have ˜ x ( P ) − x (cid:63) ( P ) ∈ N ⊥ since ˜ x ( P ) , x (cid:63) ( P ) ∈ N ⊥ and N ⊥ is a vector subspace of R d . Thus, we obtain x (cid:63) ( P ) ∩ ˜ x ( P ) ∈ N ∩ N ⊥ , and since N ∩ N ⊥ = { } we can conclude ˜ x ( P ) = x (cid:63) ( P ), whichestablishes x (cid:63) ( P ) is indeed unique. Proof of Theorem 3.1:
Fix any β ( P ) ∈ R p and recall Π R ( β ( P )) denotes its projectiononto R (under (cid:107) · (cid:107) ). Next note that by Farkas’ Lemma (see, e.g., Corollary 5.85 inAliprantis and Border (2006)) it follows that the statementΠ R ( β ( P )) = A ˜ x for some ˜ x ≥ does not exist a y ∈ R p satisfying the following inequalities: A (cid:48) y ≤ R d ) and (cid:104) y, Π R ( β ( P )) (cid:105) > . (S.2)In particular, there being no y ∈ R p satisfying (S.1) is equivalent to the statement (cid:104) y, Π R ( β ( P )) (cid:105) ≤ y ∈ R p such that A (cid:48) y ≤ R d ) . (S.3)Next note Lemma 3.1 implies that there is a unique x (cid:63) ( P ) ∈ N ⊥ such that Π R ( β ( P )) = Ax (cid:63) ( P ). Therefore, (cid:104) y, Ax (cid:63) ( P ) (cid:105) = (cid:104) A (cid:48) y, x (cid:63) ( P ) (cid:105) implies (S.3) is equivalent to (cid:104) A (cid:48) y, x (cid:63) ( P ) (cid:105) ≤ y ∈ R p such that A (cid:48) y ≤ R d ) . (S.4)Moreover, since { A (cid:48) y : y ∈ R p and A (cid:48) y ≤ } = range { A (cid:48) } ∩ R d − , (S.4) is equivalent to (cid:104) s, x (cid:63) ( P ) (cid:105) ≤ s ∈ range { A (cid:48) } ∩ R d − (S.5)Theorem 6.7.3 in Luenberger (1969) implies that range { A (cid:48) } = N ⊥ . Since range { A (cid:48) } isclosed, we have that condition (S.5) is satisfied if and only if the following holds: (cid:104) s, x (cid:63) ( P ) (cid:105) ≤ s ∈ N ⊥ ∩ R d − . (S.6)In summary, we have shown that statement (S.1) is satisfied if and only if condition(S.6) holds. Since in addition β ( P ) ∈ R if and only if β ( P ) = Π R ( β ( P )), the claim ofthe theorem follows. 32 .2 Results for Section 4 Proof of Theorem 4.1:
First note that by Lemma S.2.4 there exists a Gaussian vector( G e n ( P ) (cid:48) , G i n ( P ) (cid:48) ) (cid:48) ≡ G n ( P ) ∈ R p with G n ( P ) ∼ N (0 , Σ( P )) satisfying (cid:107) (Ω e ( P )) † { ( I p − AA † ˆ C n ) √ n { ˆ β n − β ( P ) } − G e n ( P ) }(cid:107) ∞ = O P ( r n ) (cid:107) (Ω i ( P )) † { AA † ˆ C n √ n { ˆ β n − β ( P ) } − G i n ( P ) }(cid:107) ∞ = O P ( r n ) (S.7)uniformly in P ∈ P . Further note that Assumption 4.4(i) implies range { Σ j ( P ) } ⊆ range { Ω j ( P ) } for j ∈ { e , i } and P ∈ P . Therefore, Assumption 4.4(ii) yields( I p − AA † ˆ C n ) √ n { ˆ β n − β ( P ) } ∈ range { Ω e ( P ) } AA † ˆ C n √ n { ˆ β n − β ( P ) } ∈ range { Ω i ( P ) } (S.8)with probability tending to one uniformly in P ∈ P . Next, note that AA † s = s whenever s ∈ R and Theorem 3.1 imply ( I p − AA † ) β ( P ) = 0 for all P ∈ P . Therefore, ˆ x (cid:63)n = A † ˆ C n ˆ β n and ˆ C n β ( P ) = β ( P ) for all P ∈ P by Assumption 4.2(ii) yield thatsup s ∈ ˆ V e n √ n (cid:104) s, ˆ β n − A ˆ x (cid:63)n (cid:105) = sup s ∈ ˆ V e n (cid:104) s, ( I p − AA † ˆ C n ) √ n ˆ β n (cid:105) = sup s ∈ ˆ V e n (cid:104) s, ( I p − AA † ˆ C n ) √ n { ˆ β n − β ( P ) }(cid:105) (S.9)for all P ∈ P . Similarly, employing that A † AA † = A † (see, e.g., Proposition 6.11.1(5) inLuenberger (1969)) together with A ˆ x (cid:63)n = AA † ˆ C n ˆ β n and ˆ C n β ( P ) = β ( P ) for all P ∈ P by Assumption 4.2(ii), allows us to conclude that for all P ∈ P sup s ∈ ˆ V i n √ n (cid:104) A † s, ˆ x (cid:63)n (cid:105) = sup s ∈ ˆ V i n √ n (cid:104) A † s, A † ˆ C n ˆ β n (cid:105) = sup s ∈ ˆ V i n (cid:104) A † s, A † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105) + √ n (cid:104) A † s, A † β ( P ) (cid:105) . (S.10)Moreover, if P ∈ P , then √ n (cid:104) A † s, A † β ( P ) (cid:105) ≤ s satisfying A † s ≤ A † s ∈ N ⊥ ∩ R d − whenever A † s ≤
0, and x (cid:63) ( P ) = A † β ( P ). Hence, r n = o (1),results (S.7), (S.8), (S.9) and (S.10) and Theorem S.2.1 applied with ˆ W e n ( P ) = ( I p − AA † ˆ C n ) √ n { ˆ β n − β ( P ) } , ˆ W i n ( P ) = AA † ˆ C n √ n { ˆ β n − β ( P ) } , ˆ f n ( s, P ) = √ n (cid:104) A † s, A † β ( P ) (cid:105) , Q = P , and ω n = r n together with a n + r n = O ( r n ) imply uniformly in P ∈ P thatsup s ∈ ˆ V e n √ n (cid:104) s, ˆ β n − A ˆ x (cid:63)n (cid:105) = sup s ∈V e ( P ) (cid:104) s, G e n ( P ) (cid:105) + O P ( r n ) (S.11)sup s ∈ ˆ V i n √ n (cid:104) A † s, ˆ x (cid:63)n (cid:105) = sup s ∈V i ( P ) (cid:104) A † s, A † G i n ( P ) (cid:105) + √ n (cid:104) A † s, A † β ( P ) (cid:105) + O P ( r n ) , (S.12)33rom which the claim of the theorem follows. Proof of Theorem 4.2:
For notational simplicity we first set η ≡ − α and define M n ( s, P ) ≡ (cid:104) A † s, A † G i n ( P ) (cid:105) U n ( s, P ) ≡ √ n (cid:104) A † s, A † β ( P ) (cid:105) (S.13) A e n ( s, P ) ≡ (cid:104) s, (Ω e ( P )) † G e n ( P ) (cid:105) A i n ( s, P ) ≡ (cid:104) s, G i n ( P ) + √ nβ ( P ) (cid:105) . (S.14)We also set sequences (cid:96) n ↓ τ n ↑ r n ∨ b n ∨ λ n (cid:112) log(1 + p ) = o ( (cid:96) n ) andsup P ∈ P m( P ) + ¯ σ ( P ) z τ n σ ( P ) = o ( (cid:96) − n ) , (S.15)which is feasible by hypothesis. Also note that since η > .
5, there is (cid:15) > η − (cid:15) > . z η − (cid:15) the η − (cid:15) quantile of a standard normal random variable, let E n ( P ) ≡ { ˆ c n ( η ) ≥ ( σ ( P ) z η − (cid:15) ) / } (S.16) E n ( P ) ≡ { U n ( s, P ) ≤ ˆ U n ( s ) + (cid:96) n for all s ∈ ˆ V i n } . (S.17)Next note that 0 ∈ ˆ V e n and 0 ∈ ˆ V i n together yield that ˆ c n ( η ) ≥
0. Therefore, φ n = 1 implies T n >
0, which together with Lemma S.2.3 implies that the conclusionof the theorem is immediate on the set D ≡ { P ∈ P : σ j ( s, P ) = 0 for all s ∈E j ( P ) and all j ∈ { e , i }} . We therefore assume without loss of generality that for all P ∈ P , σ j ( s, P ) > s ∈ E j ( P ) and some j ∈ { e , i } . Next, we also observe thatsince φ n = 1 implies T n >
0, Lemma S.2.2 allows us to conclude thatlim sup n →∞ sup P ∈ P P ( φ n = 1) = lim sup n →∞ sup P ∈ P P ( T n > ˆ c n ( η ); E n ( P )) . (S.18)Further observe that, for j ∈ { e , i } , G j n ( P ) ∈ range { Σ j ( P ) } ⊆ range { Ω j ( P ) } almostsurely by Theorem 3.6.1 in Bogachev (1998) and Assumption 4.4(i). Hence, it fol-lows that Ω j ( P )(Ω j ( P )) † G j n ( P ) = G j n ( P ) almost surely for j ∈ { e , i } , which togetherwith H¨older’s inequality, Assumption 4.1(ii), the definitions of V e ( P ) and V i ( P ), and U n ( s, P ) ≤ s ∈ V i ( P ) and P ∈ P by Theorem 3.1 imply that almost surelysup s ∈V e ( P ) (cid:104) s, G e n ( P ) (cid:105) = sup s ∈V e ( P ) (cid:104) Ω e ( P ) s, (Ω e ( P )) † G e n ( P ) (cid:105) < ∞ sup s ∈V i ( P ) M n ( s, P ) + U n ( s, P ) = sup s ∈V i ( P ) (cid:104) Ω i ( P )( AA (cid:48) ) † s, (Ω i ( P )) † G i n ( P ) (cid:105) + U n ( s, P ) < ∞ . Thus, by Theorem 4.1 and Lemmas S.2.12, S.2.13 we obtain uniformly in P ∈ P that T n = max s ∈E e ( P ) A e n ( s, P ) ∨ max s ∈E i ( P ) A i n ( s, P ) + O P ( r n ) . (S.19)34or any τ ∈ (0 ,
1) and M n ( s, P ) as in (S.13), we next let c (1) n ( τ, P ) denote the τ th quantile c (1) n ( τ, P ) ≡ inf { u : P ( sup s ∈V i ( P ) M n ( s, P ) ≤ u ) ≥ τ } . (S.20)Employing c (1) n ( τ, P ) we further define a “truncated” subset E i ,τ ( P ) ⊆ E i ( P ) by E i ,τ ( P ) ≡ { s ∈ E i ( P ) : −(cid:104) s, √ nβ ( P ) (cid:105) ≤ c (1) n ( τ, P ) } . (S.21)Next note that 0 ∈ V i ( P ) satisfying M n (0 , P ) = 0 implies sup s ∈V i ( P ) M n ( s, P ) is nonneg-ative almost surely and therefore c (1) n ( τ, P ) ≥
0. Since in addition 0 ∈ E i ( P ) by LemmaS.2.13, it follows 0 ∈ E i ,τ ( P ) and therefore we obtain that P ( max s ∈E i ( P ) A i n ( s, P ) = max s ∈E i ,τ ( P ) A i n ( s, P )) ≥ P ( max s ∈E i ( P ) \E i ,τ ( P ) A i n ( s, P ) ≤ ≥ P ( sup s ∈V i ( P ) M n ( s, P ) ≤ c (1) n ( τ, P )) ≥ τ, where the second and final inequalities hold by definitions (S.13) and (S.20), and E i ( P ) ⊆ ( AA (cid:48) ) † V i ( P ). Next define the sets C n (j , P ) according to the relation C n (j , P ) ≡ (cid:40) E e ( P ) if j = e E i ,τ n ( P ) if j = i . Given these definitions, we then obtain from results (S.15), (S.18), and (S.19) thatlim sup n →∞ sup P ∈ P P ( φ n = 1) ≤ lim sup n →∞ sup P ∈ P P ( max s ∈E e ( P ) A e n ( s, P ) ∨ max s ∈E i ,τn ( P ) A i n ( s, P ) > ˆ c n ( η ) − (cid:96) n ; E n ( P ))= lim sup n →∞ sup P ∈ P P ( max j ∈{ e , i } max s ∈C n (j ,P ) A j n ( s, P ) > ˆ c n ( η ) − (cid:96) n ; E n ( P )) (S.22)due to τ n ↑ r n = o ( (cid:96) n ) by construction. Further define the set A n ( P ) to equal A n ( P ) ≡ { (j , s ) : j ∈ { e , i } , s ∈ C n (j , P ) , σ j ( s, P ) > } , and note that, for n sufficiently large, inf P ∈ P ( σ ( P ) z η − (cid:15) ) − (cid:96) n > E n ( P ) implies ˆ c n ( η ) − (cid:96) n >
0. Hence, since for all P ∈ P wehave E [ A e n ( s, P )] = 0 for all s ∈ E e ( P ) and E [ A i n ( s, P )] ≤ s ∈ E i ,τ n ( P ) dueto (cid:104) ( AA (cid:48) ) † s, β ( P ) (cid:105) ≤ s ∈ V i ( P ) by Theorem 3.1, we can conclude from result(S.22) that the claim of the theorem is immediate if A n ( P ) = ∅ . Therefore, assuming35ithout loss of generality that A n ( P ) (cid:54) = ∅ we obtain from the same observations thatlim sup n →∞ sup P ∈ P P ( φ n = 1) ≤ lim sup n →∞ sup P ∈ P P ( max (j ,s ) ∈A n ( P ) A j n ( s, P ) > ˆ c n ( η ) − (cid:96) n )= lim sup n →∞ sup P ∈ P P ( max (j ,s ) ∈A n ( P ) A j n ( s, P ) > ˆ c n ( η ) − (cid:96) n ; E n ( P )) , (S.23)where the final inequality holds for E n ( P ) as defined in (S.17) by Lemma S.2.1.For any P ∈ P , it follows that under E n ( P ), ˆ c n ( η ) is P -almost surely boundedfrom below by the conditional on { Z i } ni =1 η quantile of the random variablemax { sup s ∈ ˆ V e n (cid:104) s, ˆ G e n (cid:105) , sup s ∈ ˆ V i n (cid:104) A † s, A † ˆ G i n (cid:105) + U n ( s, P ) } − (cid:96) n . (S.24)Moreover, by Theorem S.2.5 there is a Gaussian vector ( G e (cid:63)n ( P ) (cid:48) , G i (cid:63)n ( P ) (cid:48) ) (cid:48) ≡ G (cid:63)n ( P )with G (cid:63)n ( P ) ∼ N (0 , Σ( P )), independent of { Z i } ni =1 , and satisfying (cid:107) (Ω e ( P )) † { ˆ G e n − G e (cid:63)n ( P ) }(cid:107) ∞ ∨ (cid:107) (Ω i ( P )) † { ˆ G i n − G i (cid:63)n ( P ) }(cid:107) ∞ = O P ( b n )uniformly in P ∈ P . Since r n = o (1) implies a n = o (1), we may then apply TheoremS.2.1 with ˆ W n = ˆ G n , W n ( P ) = G (cid:63)n ( P ), and ˆ f n ( s, P ) = U n ( s, P ) to conclude thatmax { sup s ∈ ˆ V e n (cid:104) s, ˆ G e n (cid:105) , sup s ∈ ˆ V i n (cid:104) A † s, A † ˆ G i n (cid:105) + U n ( s, P ) } = max { sup s ∈V e ( P ) (cid:104) s, G e (cid:63)n ( P ) (cid:105) , sup s ∈V i ( P ) (cid:104) A † s, A † G i (cid:63)n ( P ) (cid:105) + U n ( s, P ) } + O P ( b n )= max { max s ∈E e ( P ) (cid:104) s, (Ω e ( P )) † G e (cid:63)n ( P ) (cid:105) , max s ∈E i ( P ) (cid:104) s, G i (cid:63)n ( P ) + √ nβ ( P ) (cid:105)} + O P ( b n ) (S.25)uniformly in P ∈ P , and where the second equality follows by arguing as in (S.19).Therefore, defining c (2) n ( η, P ) to be the following η quantile c (2) n ( η, P ) ≡ inf { u : P ( max (j ,s ) ∈A n ( P ) A j n ( s, P ) ≤ u ) ≥ η } , we obtain from E n ( P ) implying that ˆ c n ( η ) is P -almost surely bounded from below bythe conditional on { Z i } ni =1 η quantile of (S.24) for any P ∈ P , results (S.23) and (S.25), G n ( P ) and G (cid:63)n ( P ) sharing the same distribution, G (cid:63)n ( P ) being independent of { Z i } ni =1 ,Lemma 11 in Chernozhukov et al. (2013), and b n = o ( (cid:96) n ) thatlim sup n →∞ sup P ∈ P P ( φ n = 1) ≤ lim sup n →∞ sup P ∈ P P ( max (j ,s ) ∈A n ( P ) A j n ( s, P ) > c (2) n ( η n , P ) − (cid:96) n )(S.26)for some sequence η n satisfying η n ↑ η . 36o conclude, for any (j , s ) ∈ A n ( P ) we define the random variable N ((j , s ) , P ) by N ((j , s ) , P ) ≡ A j n ( s, P ) − c (2) n ( η n , P ) σ j ( s, P ) + c (1) n ( τ n , P ) + 0 ∨ c (2) n ( η n , P ) σ ( P ) . Then note that E [ N ((j , s ) , P )] ≥ , s ) ∈ A n ( P ), by definition of E i ,τ n ( P ), c (1) n ( η n , P ) ≥
0, and σ j ( s, P ) ≥ σ ( P ) for all (j , s ) ∈ A n ( P ). Thus, since in additionVar { N ((j , s ) , P ) } = 1 for any (j , s ) ∈ A n ( P ) and A n ( P ) is finite due to E e ( P ) and E i ( P )being finite by Corollary 19.1.1 in Rockafellar (1970), Lemma S.2.11 implies P ( | max (j ,s ) ∈A n ( P ) A j n ( s, P ) − c (2) n ( η n , P ) | ≤ (cid:96) n ) ≤ P ( | max (j ,s ) ∈A n ( P ) N ((j , s ) , P ) − c (1) n ( τ n , P ) + 0 ∨ c (2) n ( η n , P ) σ ( P ) | ≤ (cid:96) n σ ( P ) ) ≤ (cid:96) n σ ( P ) max { med { max (j ,s ) ∈A n ( P ) N ((j , s ) , P ) } , } (S.27)for any P ∈ P . Next note the definition of N ((j , s ) , P ), Ω j ( P )(Ω j ( P )) † G j n ( P ) = G j n ( P )for j ∈ { e , i } , E e ( P ) ⊂ Ω e ( P ) V e ( P ), and E i ,τ n ( P ) ⊆ ( AA (cid:48) ) † V i ( P ) imply thatmed { max (j ,s ) ∈A n ( P ) N ((j , s ) , P ) }≤ σ ( P ) { med { sup s ∈V e ( P ) (cid:104) s, G e n ( P ) (cid:105) ∨ sup s ∈V i ( P ) (cid:104) A † s, A † G i n ( P ) (cid:105)} + c (1) n ( τ n , P ) + | c (2) n ( η n , P ) |} = m(P) σ ( P ) + c (1) n ( τ n , P ) + | c (2) n ( η n , P ) | σ ( P ) (S.28)for all P ∈ P and n . Furthermore, by Borell’s inequality (see, for example, the corollaryin pg. 82 of Davydov et al. (1998)) we also have the bound c (1) n ( τ n , P ) ≤ m( P ) + z τ n ¯ σ ( P ) (S.29)for all P ∈ P and n sufficiently large due to τ n ↑
1. Since P ∈ P implies (cid:104) s, β ( P ) (cid:105) ≤ s ∈ E i ,τ n ( P ) ⊂ ( AA (cid:48) ) † V i ( P ) by Theorem 3.1, Borell’s inequality yields c (2) n ( η n , P ) ≤ m( P ) + ¯ σ ( P ) z η n (S.30)for n sufficiently large by η n ↑ η > / P ). Also, η n > / n sufficiently large and 0 ≥ (cid:104) s, √ nβ ( P ) (cid:105) ≥ − c (1) n ( τ n , P ) for all s ∈ E i ,τ n ( P ) by (S.21) imply c (2) n ( η n , P ) ≥ med { max s ∈E e ( P ): σ e ( s,P ) > A e n ( s, P ) ∨ max s ∈E i ,τn ( P ): σ i ( s,P ) > (cid:104) s, G i n ( P ) (cid:105)} − c (1) n ( τ n , P ) ≥ − c (1) n ( τ n , P ) , (S.31)where in the last inequality we employed that E [ A e n ( s, P )] = 0 for all s ∈ E e ( P ) and37 [ (cid:104) s, G i n ( P ) (cid:105) ] = 0 for all s ∈ E i ( P ) imply med { A e n ( s, P ) } ≥ s ∈ E e ( P ) andmed {(cid:104) s, G i n ( P ) (cid:105)} ≥ s ∈ E i ( P ). Therefore, results (S.27), (S.28), (S.29), (S.30),(S.31), τ n ↑ z τ n ↑ ∞ , and (cid:96) n satisfying restriction (S.15) finally yield thatlim sup n →∞ sup P ∈ P P ( | max (j ,s ) ∈A n ( P ) A j n ( s, P ) − c (2) n ( η n , P ) | ≤ (cid:96) n ) (cid:46) lim sup n →∞ sup P ∈ P (cid:96) n (m( P ) + z τ n ¯ σ ( P )) σ ( P ) = 0 . (S.32)Thus, (S.26) and (S.32) together with the definition of c (2) n ( η n , P ) and η n ↑ η implylim sup n →∞ sup P ∈ P P ( φ n = 1) ≤ lim sup n →∞ sup P ∈ P P ( max (j ,s ) ∈A n ( P ) A j n ( s, P ) > c (2) n ( η n , P )) ≤ − η. Since η = 1 − α , the claim of the theorem therefore follows. Lemma S.2.1.
Let Assumptions 4.1, 4.2, 4.3, 4.4(i) hold, λ n ∈ [0 , , and r n = o (1) .Then, for any sequence (cid:96) n satisfying λ n (cid:112) log(1 + p ) = o ( (cid:96) n ) it follows that lim inf n →∞ inf P ∈ P P ( sup s ∈ ˆ V i n {√ n (cid:104) A † s, A † β ( P ) (cid:105) − ˆ U n ( s ) } ≤ (cid:96) n ) = 1 . Proof:
First note that Theorem 3.1 implies that (cid:104) A † s, A † β ( P ) (cid:105) ≤ s ∈ ˆ V i n and P ∈ P . Therefore, the definition of ˆ U n ( s ) and λ n ∈ [0 ,
1] allow us to conclude thatsup s ∈ ˆ V i n √ n (cid:104) A † s, A † β ( P ) (cid:105) − ˆ U n ( s ) ≤ sup s ∈ ˆ V i n λ n √ n (cid:104) A † s, A † { β ( P ) − ˆ β r n }(cid:105)≤ sup s ∈ ˆ V i n λ n √ n (cid:104) A † s, ˆ x (cid:63)n − A † ˆ β r n (cid:105) + sup s ∈ ˆ V i n λ n √ n (cid:104) A † s, A † β ( P ) − ˆ x (cid:63)n (cid:105) . (S.33)Moreover, the definition of ˆ β r n in (24), ˆ x (cid:63)n ≡ A † ˆ C n ˆ β n with ˆ C n β ( P ) = β ( P ) for any P ∈ P by Assumption 4.2(ii), β ( P ) ∈ R for any P ∈ P , and (S.33) yieldsup s ∈ ˆ V i n √ n (cid:104) A † s, A † β ( P ) (cid:105) − ˆ U n ( s ) ≤ sup s ∈ ˆ V i n λ n |(cid:104) A † s, √ n { ˆ x (cid:63)n − A † β ( P ) }(cid:105)| = sup s ∈ ˆ V i n λ n |(cid:104) A † s, A † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105)| . (S.34)By applying Theorem S.2.1 twice, once with ˆ W i n ( P ) = AA † ˆ C n √ n { ˆ β n − β ( P ) } andˆ W e n ( P ) = G e n ( P ), and once with ˆ W i n ( P ) = AA † ˆ C n √ n { β ( P ) − ˆ β n } and ˆ W e n ( P ) = − G e n ( P ), and in both cases setting ˆ f n ( s, P ) = 0 for all s ∈ R p , we obtain from Lemma38.2.4 and ( − G e n ( P ) (cid:48) , − G i n ( P ) (cid:48) ) (cid:48) ∼ N (0 , Σ( P )) that uniformly in P ∈ P we havesup s ∈ ˆ V i n (cid:104) A † s, A † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105) = sup s ∈V i ( P ) (cid:104) A † s, A † G i n ( P ) (cid:105) + O P ( r n )sup s ∈ ˆ V i n (cid:104) A † s, A † AA † ˆ C n √ n { β ( P ) − ˆ β n }(cid:105) = sup s ∈V i ( P ) (cid:104) A † s, A † ( − G i n ( P )) (cid:105) + O P ( r n ) . (S.35)Thus, since Ω i ( P )(Ω i ( P )) † G i n ( P ) = G i n ( P ) almost surely due to G i n ( P ) ∈ range { Σ i ( P ) } ⊆ range { Ω i ( P ) } almost surely by Theorem 3.6.1 in Bogachev (1998) and Assumption 4.4(i),we obtain from results (S.34), (S.35), and H¨older’s inequality thatsup s ∈ ˆ V i n λ n |(cid:104) A † s, A † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105)| = sup s ∈V i ( P ) λ n |(cid:104) A † s, A † G i n ( P ) (cid:105)| + O P ( λ n r n ) ≤ λ n (cid:107) (Ω i ( P )) † G i n ( P ) (cid:107) ∞ + O P ( λ n r n ) = O P ( λ n (cid:112) log(1 + p )) (S.36)uniformly in P ∈ P , and where the final equality follows from r n = o (1), Markov’sinequality, and sup P ∈ P E P [ (cid:107) (Ω i ( P )) † G i n ( P ) (cid:107) ∞ ] (cid:46) (cid:112) log(1 + p ) by Lemma S.2.8 andAssumption 4.3(ii). The claim of the Lemma then follows from results (S.34), (S.36),and λ n (cid:112) log(1 + p ) = o ( (cid:96) n ) by hypothesis. Lemma S.2.2.
Let Assumptions 4.1, 4.2(i)(ii), 4.3, 4.4, 4.5 hold, η ∈ (0 . , , <(cid:15) < η − . , and z η be the η quantile of N (0 , . If r n ∨ b n = o (1) and sup P ∈ P (m( P ) +¯ σ ( P )) /σ ( P ) = o ( r − n ∧ b − n ) , then for each P ∈ P there are { E n ( P ) } with lim inf n →∞ inf P ∈ P P ( { Z i } ni =1 ∈ E n ( P )) = 1 (S.37) and on E n ( P ) it holds that ˆ c n ( η ) ≥ ( σ ( P ) z η − (cid:15) ) / whenever T n > .Proof: First note that by Lemma S.2.5 there is a Gaussian vector ( G e (cid:63)n ( P ) (cid:48) , G i (cid:63)n ( P ) (cid:48) ) (cid:48) ≡ G (cid:63)n ( P ) ∼ N (0 , Σ( P )) that is independent of { Z i } ni =1 and satisfies (cid:107) (Ω e ( P )) † { ˆ G e n − G e (cid:63)n ( P ) }(cid:107) ∞ ∨ (cid:107) (Ω i ( P )) † { ˆ G i n − G i (cid:63)n ( P ) }(cid:107) ∞ = O P ( b n )uniformly in P ∈ P . Further define ˆ L n ∈ R and L (cid:63)n ( P ) ∈ R to be given byˆ L n ≡ max { sup s ∈ ˆ V e n (cid:104) s, ˆ G e n (cid:105) , sup s ∈ ˆ V i n (cid:104) A † s, A † ˆ G i n (cid:105) + ˆ U n ( s ) } (S.38) L (cid:63)n ( P ) ≡ max { sup s ∈V e ( P ) (cid:104) s, G e (cid:63)n ( P ) (cid:105) , sup s ∈V i ( P ) (cid:104) A † s, A † G i (cid:63)n ( P ) (cid:105) + ˆ U n ( s ) } , (S.39)and note that since (cid:104) A † s, A † ˆ β r n (cid:105) ≤ s ∈ R p such that A † s ≤ W n ( P ) = G (cid:63)n ( P ), ˆ W n = ˆ G n , and ˆ f n ( · , P ) = ˆ U n ( · ) that uniformly in P ∈ P sup s ∈ ˆ V e n (cid:104) s, ˆ G e (cid:63)n (cid:105) = sup s ∈V e ( P ) (cid:104) s, G e (cid:63)n ( P ) (cid:105) + O P ( b n ) (S.40)sup s ∈ ˆ V i n (cid:104) A † s, A † ˆ G i (cid:63)n (cid:105) + ˆ U n ( s ) = sup s ∈V i ( P ) (cid:104) A † s, A † G i (cid:63)n ( P ) (cid:105) + ˆ U n ( s ) + O P ( b n ) . (S.41)We establish the lemma by studying three separate cases.Case I: Suppose P ∈ P e0 ≡ { P ∈ P : σ e ( s, P ) > s ∈ E e ( P ) } . First set E n ( P ) ≡ { P ( | sup s ∈ ˆ V e n (cid:104) s, ˆ G e n (cid:105) − sup s ∈V e ( P ) (cid:104) s, G e (cid:63)n ( P ) (cid:105)| > ( σ ( P ) z η − (cid:15) ) / |{ Z i } ni =1 ) ≤ (cid:15) } and note that z η − (cid:15) > η − (cid:15) > .
5, and therefore result (S.40), Markov’sinequality, and b n × sup P ∈ P /σ ( P ) = o (1) by hypothesis, imply thatlim inf n →∞ inf P ∈ P e0 P ( { Z i } ni =1 ∈ E n ( P )) = 1 . Then note that whenever { Z i } ni =1 ∈ E n ( P ) the triangle inequality allows us to conclude P ( sup s ∈V e ( P ) (cid:104) s, G e (cid:63)n ( P ) (cid:105) ≤ ˆ c n ( η ) + σ ( P ) z η − (cid:15) |{ Z i } ni =1 ) ≥ P ( sup s ∈ ˆ V e n (cid:104) s, ˆ G e (cid:63)n (cid:105) ≤ ˆ c n ( η ) |{ Z i } ni =1 ) − (cid:15) ≥ P (ˆ L n ≤ ˆ c n ( η ) |{ Z i } ni =1 ) − (cid:15) ≥ η − (cid:15) (S.42)where the second inequality follows from (S.38), while the final inequality holds bydefinition of ˆ c n ( η ). Also note that G e (cid:63)n ( P ) ∼ N (0 , Σ e ( P )), Theorem 3.6.1 in Bogachev(1998), and Assumption 4.4(i) imply G e (cid:63)n ( P ) = Ω e ( P )(Ω e ( P )) † G e (cid:63)n ( P ) almost surely.Therefore, by symmetry of Ω e ( P ) we can conclude that almost surelysup s ∈V e ( P ) (cid:104) s, G e (cid:63)n ( P ) (cid:105) = sup s ∈V e ( P ) (cid:104) Ω e ( P ) s, (Ω e ( P )) † G e (cid:63)n ( P ) (cid:105) = max s ∈E e ( P ) (cid:104) s, (Ω e ( P )) † G e (cid:63)n ( P ) (cid:105) , where the second equality holds by Lemma S.2.12 and the supremum being finite byH¨older’s inequality. Hence, the (unconditional) distribution of sup s ∈V e ( P ) (cid:104) s, G e (cid:63)n ( P ) (cid:105) first order stochastically dominates the distribution N (0 , σ ( P )) whenever P ∈ P e0 bydefinition of σ ( P ). In particular, G e (cid:63)n ( P ) being independent of { Z i } ni =1 and result (S.42)imply that whenever { Z i } ni =1 ∈ E n ( P ) and P ∈ P e0 we must haveˆ c n ( η ) + σ ( P ) z η − (cid:15) ≥ σ ( P ) z η − (cid:15) , which establishes the claim of the lemma for the subset P e0 ⊆ P .40ase II: Suppose P ∈ P i0 ≡ { P ∈ P : σ i ( s, P ) > s ∈ E i ( P ) and σ e ( s, P ) =0 for all s ∈ E e ( P ) } , and define the event E n ( P ) ≡ (cid:84) j =1 E j,n ( P ), where E n ( P ) ≡ { ˆ V i n ⊆ V i ( P ) } E n ( P ) ≡ { AA † ˆ C n { ˆ β n − β ( P ) } ∈ range { Σ i ( P ) }} E n ( P ) ≡ { P ( | ˆ L n − L (cid:63)n ( P ) | > ( σ ( P ) z η − (cid:15) ) / |{ Z i } ni =1 ) ≤ (cid:15) } E n ( P ) ≡ { T n = sup s ∈ ˆ V i n (cid:104) A † s, ˆ x (cid:63)n (cid:105)} . Next note that Ω i ( P )( ˆΩ i n ) † ˆΩ i n = Ω i ( P ) with probability tending to one uniformly in P ∈ P by Assumption 4.1(iii), Lemma S.2.10, and symmetry of ˆΩ i n and Ω i ( P ). SinceˆΩ i n ( ˆΩ i n ) † ˆΩ i n = ˆΩ i n by Proposition 6.11.1(6) in Luenberger (1969), we obtain from thedefinition of ˆ V i n that with probability tending to one uniformly in P ∈ P sup s ∈ ˆ V i n (cid:107) Ω i ( P )( AA (cid:48) ) † s (cid:107) ≤ s ∈ ˆ V i n (cid:107) ( ˆΩ i n − Ω i ( P ))( ˆΩ i n ) † ˆΩ i n ( AA (cid:48) ) † s (cid:107) ≤ (cid:107) ( ˆΩ i n − Ω i ( P ))( ˆΩ i n ) † (cid:107) o, = 1 + (cid:107) ( ˆΩ i n ) † ( ˆΩ i n − Ω i ( P )) (cid:107) o, ∞ , (S.43)where the final equality follows from Assumptions 4.1(i)(ii) and Theorem 6.5.1 in Lu-enberger (1969). Therefore, (S.43) and Lemma S.2.6 (for E n ( P )), Assumption 4.4(ii)(for E n ( P )), results (S.40) and (S.41) together with η − (cid:15) > .
5, Markov’s inequalityand b n × sup P ∈ P /σ ( P ) = o (1) (for E n ( P )), and Lemma S.2.3 (for E n ( P )), yieldlim inf n →∞ inf P ∈ P i0 P ( { Z i } ni =1 ∈ E n ( P )) = 1 . Next note that if { Z i } ni =1 ∈ E n ( P ) then the event E n ( P ) allows us to conclude T n = sup s ∈ ˆ V i n √ n (cid:104) A † s, ˆ x (cid:63)n (cid:105) ≤ sup s ∈V i ( P ) √ n (cid:104) A † s, ˆ x (cid:63)n (cid:105) . (S.44)Furthermore, since A † AA † = A † by Proposition 6.11.1(5) in Luenberger (1969), Assump-tion 4.2(ii), AA † β ( P ) = β ( P ) whenever P ∈ P due to β ( P ) ∈ R by Theorem 3.1, sym-metry of Ω i ( P ), and AA † ˆ C n √ n { ˆ β n − β ( P ) } ∈ range { Ω i ( P ) } whenever { Z i } ni =1 ∈ E n ( P )due to range { Σ i ( P ) } ⊆ range { Ω i ( P ) } by Assumption 4.4(i) imply √ n (cid:104) A † s, ˆ x (cid:63)n (cid:105) = (cid:104) A † s, A † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105) + √ n (cid:104) A † s, A † β ( P ) (cid:105) = (cid:104) Ω i ( P )( AA (cid:48) ) † s, (Ω i ( P )) † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105) + √ n (cid:104) ( AA (cid:48) ) † s, β ( P ) (cid:105) (S.45)for any s ∈ V i ( P ) whenever { Z i } ni =1 ∈ E n ( P ). Further note that since (cid:104) A † s, A † β ( P ) (cid:105) ≤ P ∈ P and s ∈ V i ( P ) by Theorem 3.1, H¨older’s inequality implying (S.45) is41ounded in s ∈ V i ( P ) together with Lemmas S.2.12 and S.2.13 implies thatsup s ∈ ( AA (cid:48) ) † V i ( P ) (cid:104) Ω i ( P ) s, (Ω i ( P )) † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105) + √ n (cid:104) s, β ( P ) (cid:105) = max s ∈E i ( P ) (cid:104) Ω i ( P ) s, (Ω i ( P )) † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105) + √ n (cid:104) s, β ( P ) (cid:105) . (S.46)Hence, results (S.44), (S.45), and (S.46) together establish that the set S i ( P ) given by S i ( P ) ≡ { s ∈ E i ( P ) : (cid:104) Ω i ( P ) s, (Ω i ( P )) † AA † ˆ C n √ n { ˆ β n − β ( P ) } + √ n (cid:104) s, β ( P ) (cid:105) > } is such that S i ( P ) (cid:54) = ∅ whenever T n > { Z i } ni =1 ∈ E n ( P ). Moreover, since √ n (cid:104) s, β ( P ) (cid:105) ≤ s ∈ S i ( P ) due to S i ( P ) ⊆ E i ( P ) ⊂ ( AA (cid:48) ) † V i ( P ), P ∈ P , andTheorem 3.1, it follows that whenever S i ( P ) (cid:54) = ∅ we must have (cid:104) Ω i ( P ) s, (Ω i ( P )) † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105) > s ∈ S i ( P ). Also note that if { Z i } ni =1 ∈ E n ( P ) ⊆ E n ( P ), then range { Σ i ( P ) } equal-ing the support of G i n ( P ) by Theorem 3.6.1 in Bogachev (1998) implies that σ i ( s, P ) > s satisfying (S.47). Thus, we have so far shown that if P ∈ P i0 , then S i ( P ) (cid:54) = ∅ and σ i ( s, P ) > s ∈ S i ( P ) (S.48)whenever { Z i } ni =1 ∈ E n ( P ) and T n >
0. We next aim to show that in additionmax s ∈S i ( P ) (cid:104) s, AA † ˆ β r n (cid:105) = 0 (S.49)whenever { Z i } ni =1 ∈ E n ( P ) and T n >
0. To this end, note Theorem 3.1 yields that0 ≥ sup s ∈V i ( P ) (cid:104) A † s, A † ˆ β r n (cid:105) = sup s ∈ ( AA (cid:48) ) † V i ( P ) (cid:104) s, AA † ˆ β r n (cid:105) = max s ∈E i ( P ) (cid:104) s, AA † ˆ β r n (cid:105) , (S.50)where the first equality follows from A † AA † = A † by Proposition 6.11.1(5) in Luenberger(1969) and the second equality from Lemmas S.2.12 and S.2.13. Furthermore, since AA † ˆ C n β ( P ) = β ( P ) due to ˆ C n β ( P ) = β ( P ) by Assumption 4.2(ii) and β ( P ) ∈ R , weobtain from symmetry of Ω i ( P ) and AA † ˆ C n √ n { ˆ β n − β ( P ) } ∈ range { Ω i ( P ) } whenever { Z i } ni =1 ∈ E n ( P ) due to range { Σ i ( P ) } ⊆ range { Ω i ( P ) } by Assumption 4.4(i), thatmax s ∈E i ( P ) \S i ( P ) (cid:104) s, AA † ˆ C n ˆ β n (cid:105) = max s ∈E i ( P ) \S i ( P ) (cid:104) s, AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105) + √ n (cid:104) s, β ( P ) (cid:105) = max s ∈E i ( P ) \S i ( P ) (cid:104) Ω i ( P ) s, (Ω i ( P )) † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105) + √ n (cid:104) s, β ( P ) (cid:105) ≤ , (S.51)where the inequality follows by definition of S i ( P ). Thus, if we suppose by way ofcontradiction that (S.49) fails to hold, then (S.50), (S.51), S i ( P ) ⊆ E i ( P ), and E i ( P )42eing finite, imply there exists a γ (cid:63) ∈ (0 ,
1) (depending on ˆ β n and ˆ β r n ) such that0 ≥ max s ∈E i ( P ) (cid:104) s, AA † { (1 − γ (cid:63) ) ˆ β r n + γ (cid:63) ˆ C n ˆ β n }(cid:105) = sup s ∈ ( AA (cid:48) ) † V i ( P ) (cid:104) s, AA † { (1 − γ (cid:63) ) ˆ β r n + γ (cid:63) ˆ C n ˆ β n }(cid:105) = sup s ∈V i ( P ) (cid:104) A † s, A † { (1 − γ (cid:63) ) ˆ β r n + γ (cid:63) AA † ˆ C n ˆ β n }(cid:105) (S.52)where the equalities follow from Lemmas S.2.12 and S.2.13, and again employing A † AA † = A † by Proposition 6.11.1(5) in Luenberger (1969). However, by construction ˆ β r n ∈ R and AA † ˆ C n ˆ β n ∈ R , and therefore result (S.52) and Theorem 3.1 imply that(1 − γ (cid:63) ) ˆ β r n + γ (cid:63) AA † ˆ C n ˆ β n = Ax for some x ≥ . Moreover, note if T n >
0, then sup s ∈ ˆ V i n (cid:104) A † s, ˆ x (cid:63)n (cid:105) > T n > s ∈ ˆ V i n (cid:104) A † s, ˆ x (cid:63)n − A † ˆ β r n (cid:105) > (cid:104) A † s, A † ˆ β r n (cid:105) ≤ s ∈ ˆ V n by Theorem3.1. Hence, if T n >
0, then ˆ x (cid:63)n = A † ˆ C n ˆ β n , A † AA † = A † , and γ (cid:63) ∈ (0 ,
1) yieldsup s ∈ ˆ V i n (cid:104) A † s, ˆ x (cid:63)n − A † { (1 − γ (cid:63) ) ˆ β r n + γ (cid:63) AA † ˆ C n ˆ β n }(cid:105) = (1 − γ (cid:63) ) sup s ∈ ˆ V i n (cid:104) A † s, ˆ x (cid:63)n − A † ˆ β r n (cid:105) < sup s ∈ ˆ V i n (cid:104) A † s, ˆ x (cid:63)n − A † ˆ β r n (cid:105) , which is impossible by definition of ˆ β r n . We thus obtain that if { Z i } ni =1 ∈ E n ( P ) and T n >
0, then result (S.49) must hold.To conclude, note that results (S.48) and (S.49) imply there is a ˆ s n ∈ V i ( P ) de-pending only on P and { Z i } ni =1 such that ( AA (cid:48) ) † ˆ s n ∈ E i ( P ), σ (( AA (cid:48) ) † ˆ s n , P ) >
0, and0 = λ n (cid:104) A † ˆ s n , A † ˆ β r n (cid:105) ≡ ˆ U n (ˆ s n ) = 0 whenever { Z i } ni =1 ∈ E n ( P ) and T n >
0. Therefore,the definitions of ˆ L n , L (cid:63)n ( P ), and ˆ c n ( η ) together with { Z i } ni =1 ∈ E n ( P ) ⊆ E n ( P ) yield P ( (cid:104) A † ˆ s n , A † G i (cid:63)n ( P ) (cid:105) ≤ ˆ c n ( η ) + σ ( P ) z η − (cid:15) |{ Z i } ni =1 ) ≥ P ( L (cid:63)n ( P ) ≤ ˆ c n ( η ) + σ ( P ) z η − (cid:15) |{ Z i } ni =1 ) ≥ P (ˆ L n ≤ ˆ c n ( η ) |{ X i } ni =1 ) − (cid:15) ≥ η − (cid:15) (S.53)whenever P ∈ P i0 , { Z i } ni =1 ∈ E n ( P ), and T n >
0. Furthermore, since G i (cid:63)n ( P ) ∈ range { Ω i ( P ) } by Assumption 4.4(i) and Theorem 3.6.1 in Bogachev (1998), we have (cid:104) A † ˆ s n , A † G i (cid:63)n ( P ) (cid:105) = (cid:104) Ω i ( P )( AA (cid:48) ) † ˆ s n , (Ω i ( P )) † G i (cid:63)n ( P ) (cid:105) (S.54)almost surely. Hence, G i (cid:63)n ( P ) being independent of { Z i } ni =1 implies (cid:104) A † ˆ s n , A † G i (cid:63)n ( P ) (cid:105) ∼ N (0 , ( σ i (( AA (cid:48) ) † ˆ s n , P )) ) conditional on { Z i } ni =1 . Since (S.54) and σ i (( AA (cid:48) ) † ˆ s n , P ) >
43 imply that the distribution of (cid:104) A † ˆ s n , A † G i (cid:63)n ( P ) (cid:105) conditional on { Z i } ni =1 first orderstochastically dominates N (0 , σ ( P )) random variable, (S.53) yieldsˆ c n ( η ) + σ ( P ) z η − (cid:15) ≥ σ ( P ) z η − (cid:15) , which establishes the claim of the lemma for the subset P i0 .Case III: For the final case, suppose P ∈ P d0 ≡ { P ∈ P : σ j ( s, P ) = 0 for all s ∈E j ( P ) and j ∈ { e , i }} . Then, by Lemma S.2.3 we may set E n ( P ) ≡ { T n = 0 } and theclaim of the lemma for the subset P d0 follows. Theorem S.2.1.
Let Assumptions 4.1, 4.3(ii), and 4.4(i) hold with a n = o (1) , set Σ( P ) ≡ E P [ ψ ( X, P ) ψ ( X, P ) (cid:48) ] , and suppose ( ˆ W e n ( P ) (cid:48) , ˆ W i n ( P ) (cid:48) ) (cid:48) ≡ ˆ W n ( P ) ∈ R p satisfies (cid:107) (Ω e ( P )) † { ˆ W e n ( P ) − W e n ( P ) }(cid:107) ∞ ∨ (cid:107) (Ω i ( P )) † { ˆ W i n ( P ) − W i n ( P ) }(cid:107) ∞ = O P ( ω n ) (S.55) for ω n > , W n ( P ) ≡ ( W e n ( P ) (cid:48) , W i n ( P ) (cid:48) ) (cid:48) ∼ N (0 , Σ( P )) , and ˆ W e n ( P ) ∈ range { Ω e ( P ) } and ˆ W i n ( P ) ∈ range { Ω i ( P ) } with probability tending to one uniformly in P ∈ P . Then,for any Q ⊆ P and possibly random function ˆ f n ( · , P ) : R p → R satisfying γ ˆ f n ( s, P ) ≤ ˆ f n ( γs, P ) ≤ for all s with A † s ≤ , γ ∈ [0 , , and P ∈ Q , it follows uniformly in P ∈ Q that sup s ∈ ˆ V e n (cid:104) s, ˆ W e n ( P ) (cid:105) = sup s ∈V e ( P ) (cid:104) s, W e n ( P ) (cid:105) + O P ( ω n + a n )sup s ∈ ˆ V i n (cid:104) A † s, A † ˆ W i n ( P ) (cid:105) + ˆ f n ( s, P ) = sup s ∈V i ( P ) (cid:104) A † s, A † W i n ( P ) (cid:105) + ˆ f n ( s, P ) + O P ( ω n + a n ) . Proof:
We establish only the second claim of the theorem, noting that the first claimfollows from slightly simpler but largely identical arguments. First note that sinceΩ i ( P )(Ω i ( P )) † ˆ W i n ( P ) = ˆ W i n ( P ) whenever ˆ W i n ( P ) ∈ range { Ω i ( P ) } , it follows thatsup s ∈ ˆ V i n (cid:104) A † s, A † ˆ W i n ( P ) (cid:105) + ˆ f n ( s, P ) = sup s ∈ ˆ V i n (cid:104) A † s, A † Ω i ( P )(Ω i ( P )) † ˆ W i n ( P ) (cid:105) + ˆ f n ( s, P ) (S.57)with probability tending to one uniformly in P ∈ P . Further note that Lemma S.2.10and Assumption 4.1(iii) imply ˆΩ i n ( ˆΩ i n ) † Ω i ( P ) = Ω i ( P ) with probability tending to oneuniformly in P ∈ P . Thus, since ˆΩ i n and Ω i ( P ) are symmetric by Assumption 4.1(i)(ii),it follows that Ω i ( P ) = Ω i ( P )( ˆΩ i n ) † ˆΩ i n with probability tending to one uniformly in P ∈ P . Employing the triangle inequality, the definition of ˆ V i n , and ˆΩ i n ( ˆΩ i n ) † ˆΩ i n = ˆΩ i n
44y Proposition 6.11.1(6) in Luenberger (1969) we can conclude thatsup s ∈ ˆ V i n (cid:107) Ω i ( P )( AA (cid:48) ) † s (cid:107) ≤ s ∈ ˆ V i n (cid:107) ( ˆΩ i n − Ω i ( P ))( AA (cid:48) ) † s (cid:107) = 1 + sup s ∈ ˆ V i n (cid:107) ( ˆΩ i n − Ω i ( P ))( ˆΩ i n ) † ˆΩ i n ( AA (cid:48) ) † s (cid:107) ≤ (cid:107) ( ˆΩ i n − Ω i ( P ))( ˆΩ i n ) † (cid:107) o, (S.58)with probability tending to one uniformly in P ∈ P . Further note that Theorem 6.5.1in Luenberger (1969), symmetry of ˆΩ i n and Ω i ( P ), and Lemma S.2.6 imply (cid:107) ( ˆΩ i n − Ω i ( P ))( ˆΩ i n ) † (cid:107) o, = (cid:107) ( ˆΩ i n ) † ( ˆΩ i n − Ω i ( P )) (cid:107) o, ∞ = O P ( a n (cid:112) log(1 + p ) ) (S.59)uniformly in P ∈ P . Next, note that since Ω i ( P )( A † ) (cid:48) A † = Ω i ( P )( AA (cid:48) ) † (see, e.g., Seber(2008) pg. 139), H¨older’s inequality, and results (S.58) and (S.59) yieldsup s ∈ ˆ V i n |(cid:104) A † s, A † Ω i ( P )(Ω i ( P )) † ( ˆ W i n ( P ) − W i n ( P )) (cid:105)|≤ (1 + O P ( a n (cid:112) log(1 + p ) )) (cid:107) (Ω i ( P )) † ( ˆ W i n ( P ) − W i n ( P )) (cid:107) ∞ = O P ( ω n ) (S.60)uniformly in P ∈ P , and where the final equality follows from a n = o (1) by assumption.Therefore, combining results (S.57) and (S.60) we conclude that uniformly in P ∈ P sup s ∈ ˆ V i n (cid:104) A † s, A † ˆ W i n ( P ) (cid:105) + ˆ f n ( s, P )= sup s ∈ ˆ V i n (cid:104) A † s, A † Ω i ( P )(Ω i ( P )) † W i n ( P ) (cid:105) + ˆ f n ( s, P ) + O P ( ω n ) . (S.61)We next aim to replace ˆ V i n with V i ( P ) in (S.61). To this end, let ˆ s n ∈ ˆ V i n satisfy (cid:104) A † ˆ s n , A † Ω i ( P )(Ω i ( P )) † W i n ( P ) (cid:105) + ˆ f n (ˆ s n , P )= sup s ∈ ˆ V i n (cid:104) A † s, A † Ω i ( P )(Ω i ( P )) † W i n ( P ) (cid:105) + ˆ f n ( s, P ) + O ( ω n ) , (S.62)where note ˆ s n is random and (S.62) is meant to hold surely. Set ¯ s n ≡ γ n ˆ s n with γ n ≡ ( (cid:107) Ω i ( P )( AA (cid:48) ) † ˆ s n (cid:107) ∨ − ∈ [0 , , (S.63)and note that since γ n ≤
1, result (S.58) and ˆ s n ∈ ˆ V i n allow us to conclude that0 ≤ − γ n ≤ − (1 + (cid:107) ( ˆΩ i n − Ω i ( P ))( ˆΩ i n ) † (cid:107) o, ) − (S.64)45ith probability tending to one uniformly in P ∈ P . Hence, (S.59) and (S.64) yield0 ≤ − γ n ≤ O P ( a n (cid:112) log(1 + p ) ) (S.65)uniformly P ∈ P due to a n = o (1). Next, we note A † ˆ s n ≤ s n ∈ ˆ V i n and therefore A † ¯ s n = γ n A † ˆ s n ≤ γ n ≥
0. Since ¯ s n = γ n ˆ s n and (S.63) further imply (cid:107) Ω i ( P )( AA (cid:48) ) † ¯ s n (cid:107) = ( (cid:107) Ω i ( P )( AA (cid:48) ) † ˆ s n (cid:107) ∨ − (cid:107) Ω i ( P )( AA (cid:48) ) † ˆ s n (cid:107) ≤ , (S.66)it follows that ¯ s n ∈ V i ( P ). Moreover, ˆ s n − ¯ s n = (1 − γ n )ˆ s n , γ n ˆ f n (ˆ s n , P ) ≤ ˆ f n ( γ n ˆ s n , P )and ˆ f n (ˆ s n , P ) ≤ P ∈ Q by (S.56), and H¨older’s inequality yield (cid:104) A † (ˆ s n − ¯ s n ) ,A † Ω i ( P )(Ω i ( P )) † W i n ( P ) (cid:105) + ˆ f n (ˆ s n , P ) − ˆ f n (¯ s n , P ) ≤ (1 − γ n ) {(cid:104) A † ˆ s n , A † Ω i ( P )(Ω i ( P )) † W i n ( P ) (cid:105) + ˆ f n (ˆ s n , P ) }≤ (1 − γ n ) { sup s ∈ ˆ V i n (cid:107) Ω i ( P )( AA (cid:48) ) † s (cid:107) }(cid:107) (Ω i ( P )) † W i n ( P ) (cid:107) ∞ . In particular, since sup P ∈ P E P [ (cid:107) (Ω i ( P )) † W i n ( P ) (cid:107) ∞ ] (cid:46) (cid:112) log(1 + p ) by Lemma S.2.8 andAssumption 4.3(ii), Markov’s inequality, results (S.58), (S.59), (S.62), and (S.65), and¯ s n ∈ V i ( P ) allow us to conclude that uniformly in P ∈ Q we havesup s ∈ ˆ V i n (cid:104) A † s, A † Ω i ( P )(Ω i ( P )) † W i n ( P ) (cid:105) + ˆ f n ( s, P ) ≤ sup s ∈V i ( P ) (cid:104) A † s, A † Ω i ( P )(Ω i ( P )) † W i n ( P ) (cid:105) + ˆ f n ( s, P ) + O P ( ω n + a n ) (S.67)uniformly in P ∈ Q . The reverse inequality to (S.67) can be established by very similararguments, and therefore we can conclude that uniformly in P ∈ Q we havesup s ∈ ˆ V i n (cid:104) A † s, A † Ω i ( P )(Ω i ( P )) † W i n ( P ) (cid:105) + ˆ f n ( s, P )= sup s ∈V i ( P ) (cid:104) A † s, A † Ω i ( P )(Ω i ( P )) † W i n ( P ) (cid:105) + ˆ f n ( s, P ) + O P ( ω n + a n ) . (S.68)Finally, note W i n ( P ) almost surely belongs to the range of Σ i ( P ) : R p → R p by Theorem3.6.1 in Bogachev (1998). Hence, since Assumption 4.4(i) implies Ω i ( P )(Ω i ( P )) † Σ i ( P )it follows that W i n ( P ) = Ω i ( P )(Ω i ( P )) † W i n ( P ) P -almost surely. The second claim of thetheorem thus follows from (S.61), and (S.68). Lemma S.2.3.
Let Assumptions 4.1, 4.2(ii), 4.4, 4.5(v) hold, a n = o (1) , and for j ∈ { e , i } set D j0 ≡ { P ∈ P : σ j ( s, P ) = 0 for all s ∈ E j ( P ) } . Then: lim inf n →∞ inf P ∈ D e0 P ( sup s ∈ ˆ V e n |√ n (cid:104) s, ˆ β n − A ˆ x (cid:63)n (cid:105)| = sup s ∈ ˆ V e n |(cid:104) s, ˆ G e n (cid:105)| = 0) = 146im inf n →∞ inf P ∈ D i0 P ( sup s ∈ ˆ V i n |(cid:104) A † s, A † ˆ G i n (cid:105)| = sup s ∈ ˆ V i n |(cid:104) A † s, A † β ( P ) − ˆ x (cid:63)n (cid:105)| = sup s ∈ ˆ V i n (cid:104) A † s, ˆ x (cid:63)n (cid:105) = 0) = 1 . Proof:
First note that Theorem 3.6.1 in Bogachev (1998) and Assumption 4.4(i) imply G e n ( P ) ∈ range { Σ e ( P ) } ⊆ range { Ω e ( P ) } almost surely. Hence, Ω e ( P )(Ω e ( P )) † G e n ( P ) = G e n ( P ) almost surely and symmetry of Ω e ( P ) imply for any P ∈ D e0 thatsup s ∈V e ( P ) |(cid:104) s, G e n ( P ) (cid:105)| = sup s ∈V e ( P ) |(cid:104) Ω e ( P ) s, (Ω e ( P )) † G e n ( P ) (cid:105)| = max s ∈E e ( P ) |(cid:104) s, (Ω e ( P )) † G e n ( P ) (cid:105)| = 0 , (S.69)where the second equality follows from H¨older’s inequality implying the supremum isfinite and Lemma S.2.12. Also note that ˆΩ e n ( ˆΩ e n ) † Ω e ( P ) = Ω e ( P ) with probabilitytending to one uniformly in P ∈ P by Assumption 4.1(iii) and Lemma S.2.10. Thus, bysymmetry of ˆΩ e n and Ω e ( P ) we obtain that Ω e ( P ) = Ω e ( P )( ˆΩ e n ) † ˆΩ e n , which together withthe triangle inequality, definition of ˆ V e n , and ˆΩ e n ( ˆΩ e n ) † ˆΩ e n = ˆΩ e n by Proposition 6.11.1(6)in Luenberger (1969) imply with probability tending to one uniformly in P ∈ P thatsup s ∈ ˆ V e n (cid:107) Ω e ( P ) s (cid:107) ≤ s ∈ ˆ V e n (cid:107) ( ˆΩ e n − Ω e ( P )) s (cid:107) = 1 + sup s ∈ ˆ V e n (cid:107) ( ˆΩ e n − Ω e ( P ))( ˆΩ e n ) † ˆΩ e n s (cid:107) ≤ (cid:107) ( ˆΩ e n ) † ( ˆΩ e n − Ω e ( P )) (cid:107) o, ∞ , where the final inequality follows from Theorem 6.5.1 in Luenberger (1969). Therefore,Lemma S.2.6 and a n = o (1) imply that ˆ V e n ⊆ V e ( P ) with probability tending to oneuniformly in P ∈ P . We can thus conclude from 0 ∈ ˆ V e n , result (S.69), Assumption4.4(ii), and the support of G e n ( P ) being equal to the range of Σ e ( P ) by Theorem 3.6.1in Bogachev (1998) that with probability tending to one uniformly in P ∈ D e0 ≤ sup s ∈ ˆ V e n |√ n (cid:104) s, ˆ β n − A ˆ x (cid:63)n (cid:105)| ≤ sup s ∈ V e ( P ) |(cid:104) s, ( I p − AA † ˆ C n ) √ n { ˆ β n − β ( P ) }(cid:105)| = 0 . (S.70)Moreover, identical arguments but relying on Assumption 4.5(v) instead of 4.4(i) yield0 ≤ sup s ∈ ˆ V e n |(cid:104) s, ˆ G e n (cid:105)| ≤ sup s ∈ V e ( P ) |(cid:104) s, ˆ G e n (cid:105)| = 0 (S.71)with probability tending to one uniformly in P ∈ D e0 . The first claim of the lemmatherefore follows from results (S.70) and (S.71).For the second claim of the lemma, we note that identical arguments to those em-ployed for the first claim readily establish that ˆ V i n ⊆ V i ( P ) andsup s ∈ ˆ V i n |(cid:104) A † s, A † AA † ˆ C n √ n { ˆ β n − β ( P ) }(cid:105)| = sup s ∈ ˆ V i n |(cid:104) A † s, A † ˆ G i n (cid:105)| = 0 (S.72)47ith probability tending to one uniformly in P ∈ D i0 . Further note that since A † AA † = A † by Proposition 6.11.1(5) in Luenberger (1969), it follows that A † AA † ˆ C n ˆ β n = ˆ x (cid:63)n dueto ˆ x (cid:63)n ≡ A † ˆ C n ˆ β n by Assumption 4.2(ii) and therefore (S.72) yieldssup s ∈ ˆ V i n |(cid:104) A † s, ˆ x (cid:63)n − A † β ( P ) }(cid:105)| = 0 (S.73)with probability tending to one uniformly in P ∈ D i0 . Therefore, since (cid:104) A † s, A † β ( P ) (cid:105) ≤ P ∈ P and s ∈ V i ( P ) by Theorem 3.1, we obtain from 0 ∈ ˆ V i n and (S.73) that0 ≤ sup s ∈ ˆ V i n √ n (cid:104) A † s, ˆ x (cid:63)n (cid:105) ≤ sup s ∈ ˆ V i n |(cid:104) A † s, ˆ x (cid:63)n − A † β ( P ) }(cid:105)| + sup s ∈ ˆ V i n (cid:104) A † s, A † β ( P ) (cid:105) = 0with probability tending to one uniformly in P ∈ D i0 . Lemma S.2.4.
Set Σ( P ) ≡ E P [ ψ ( X, P ) ψ ( X, P ) (cid:48) ] and r n ≡ a n + M , Ψ p / (log(1 + p )) / /n / . If Assumptions 4.2(i)(iii), 4.3, 4.4(i) hold, and r n = o (1) , then there existsa Gaussian ( G e n ( P ) (cid:48) , G i n ( P ) (cid:48) ) (cid:48) ≡ G n ( P ) ∼ N (0 , Σ( P )) satisfying uniformly in P ∈ P : (cid:107) (Ω e ( P )) † { ( I p − AA † ˆ C n ) √ n { ˆ β n − β ( P ) } − G e n ( P ) }(cid:107) ∞ = O P ( r n ) (cid:107) (Ω i ( P )) † { AA † ˆ C n √ n { ˆ β n − β ( P ) } − G i n ( P ) }(cid:107) ∞ = O P ( r n ) . Proof:
We first set ˜ ψ ( Z, P ) ≡ (((Ω e ( P )) † ψ e ( Z, P )) (cid:48) , ((Ω i ( P )) † ψ i ( Z, P )) (cid:48) ) (cid:48) ∈ R p , define˜Σ( P ) ≡ E P [ ˜ ψ ( Z, P ) ˜ ψ ( Z, P ) (cid:48) ] , (S.74)and let S n ( P ) ∈ R p be normally distributed with mean zero and variance ˜Σ( P ) /n .Next observe that since (cid:107) a (cid:107) ≤ p (cid:107) a (cid:107) ∞ for any a ∈ R p we can conclude that E P [ (cid:107) S n ( P ) (cid:107) (cid:107) S n ( P ) (cid:107) ∞ ] ≤ pE P [ (cid:107) S n ( P ) (cid:107) ∞ ] (cid:46) p ( (cid:112) log(1 + p ) √ n ) , (S.75)where the second inequality follows from Lemma S.2.8 and Assumption 4.3(ii). More-over, by similar arguments, Assumption 4.3(iii), and result (S.75) we can conclude n { E P [ (cid:107) ˜ ψ ( Z, P ) √ n (cid:107) (cid:107) ˜ ψ ( Z, P ) √ n (cid:107) ∞ ] + E P [ (cid:107) S n ( P ) (cid:107) (cid:107) S n ( P ) (cid:107) ∞ ] } (cid:46) n { pn / E P [Ψ ( Z, P )] + E P [ (cid:107) S n ( P ) (cid:107) (cid:107) S n ( P ) (cid:107) ∞ ] } (cid:46) p √ n { M , Ψ + (log(1 + p )) / } . (S.76)Setting Z ∼ N (0 , I p ), we then obtain by Assumptions 4.2(i), 4.3(i), Lemma 39 in Belloni48t al. (2019), and (S.76) that for any δ > G n ( P ) ∼ N (0 , ˜Σ( P )) such that P ( (cid:107) √ n n (cid:88) i =1 ˜ ψ ( Z i , P ) − ˜ G n ( P ) (cid:107) ∞ > δ ) (cid:46) min t ≥ { P ( (cid:107) Z (cid:107) ∞ > t ) + t δ p √ n { M , Ψ + (log(1 + p )) / } (cid:46) min t ≥ { exp {− t p ) } + t δ pM , Ψ (log(1 + p )) / √ n } , (S.77)where the final inequality follows from Proposition A.2.1 in van der Vaart and Wellner(1996), E [ (cid:107) Z (cid:107) ∞ ] (cid:46) log(1 + p ) by Lemma S.2.8, and we employed that M , Ψ ≥ t = K (cid:112) log(1 + p )and δ = K pM , Ψ (log(1 + p )) / / √ n in (S.77) for any K > K ↑∞ lim sup n →∞ sup P ∈ P P ( (cid:107) √ n n (cid:88) i =1 ˜ ψ ( Z i , P ) − ˜ G n ( P ) (cid:107) ∞ > K M , Ψ p / (log(1 + p )) / n / ) (cid:46) lim K ↑∞ { exp {− K } + 1 K } = 0 . (S.78)Since r n ≡ M , Ψ p / (log(1 + p )) / /n / + a n , result (S.78), Assumption 4.2(iii), writing˜ G n ( P ) ≡ ( ˜ G e n ( P ) (cid:48) , ˜ G i n ( P ) (cid:48) ) (cid:48) , and the triangle inequality imply that uniformly in P ∈ P (cid:107) (Ω e ( P )) † ( I p − AA † ) √ n { ˆ β n − β ( P ) } − ˜ G e n ( P ) (cid:107) ∞ = O P ( r n ) (cid:107) (Ω i ( P )) † AA † √ n { ˆ β n − β ( P ) } − ˜ G i n ( P ) (cid:107) ∞ = O P ( r n ) . (S.79)To conclude, note that for j ∈ { e , i } , ˜ G j n ( P ) ∼ N (0 , (Ω j ( P )) † Σ j ( P )(Ω j ( P )) † ) andtherefore Theorem 3.6.1 in Bogachev (1998) implies that ˜ G j n ( P ) belongs to the range ofthe map (Ω j ( P )) † Σ j ( P )(Ω j ( P )) † : R p → R p almost surely. Thus, since for j ∈ { e , i } wehave (Ω j ( P )) † Ω j ( P )(Ω j ( P )) † = (Ω j ( P )) † it follows that (Ω j ( P )) † Ω j ( P ) ˜ G j n ( P ) = ˜ G j n ( P )almost surely. Hence, setting G j n ( P ) = Ω j ( P ) ˜ G j n ( P ) for j ∈ { e , i } we can conclude (cid:107) (Ω e ( P )) † { ( I p − AA † ) √ n { ˆ β n − β ( P ) } − G e n ( P ) }(cid:107) ∞ = O P ( r n ) (cid:107) (Ω i ( P )) † { AA † √ n { ˆ β n − β ( P ) } − G i n ( P ) }(cid:107) ∞ = O P ( r n ) (S.80)uniformly in P ∈ P by result (S.79). Since ˜ G n ( P ) ∼ N (0 , ˜Σ( P )), and Assumption4.4(i) implies Ω j ( P )(Ω j ( P )) † ψ j ( Z, P ) = ψ j ( Z, P ) P -almost surely, we can conclude fromthe definition of ˜Σ( P ) in (S.74) and G n ( P ) = ((Ω e ( P ) ˜ G e n ( P )) (cid:48) , (Ω i ( P ) ˜ G i n ( P )) (cid:48) ) (cid:48) that G n ( P ) ∼ N (0 , Σ( P )) and thus the claim of the lemma follows. Lemma S.2.5.
Let Assumptions 4.2(i), 4.3, 4.4(i), 4.5(i)-(iv) hold, and define b n ≡ (cid:112) p log(1 + n ) M , Ψ n / + ( p log / (1 + p ) M , Ψ √ n ) / + ( p log (1 + p ) n /q M q, Ψ n ) / + a n . f b n = o (1) , then there is a Gaussian vector ( G e (cid:63)n ( P ) (cid:48) , G i (cid:63)n ( P ) (cid:48) ) ≡ G (cid:63)n ( P ) ∼ N (0 , Σ( P )) independent of { Z i } ni =1 and satisfying uniformly in P ∈ P : (cid:107) (Ω e ( P )) † { ˆ G e n − G e (cid:63)n ( P ) }(cid:107) ∞ ∨ (cid:107) (Ω i ( P )) † { ˆ G i n − G i (cid:63)n ( P ) }(cid:107) ∞ = O P ( b n ) . Proof:
For ease of exposition we divide the proof into multiple steps. In the argumentsthat follow, we let ϕ ( Z, P ) ≡ ( ϕ e ( Z, P ) (cid:48) , ϕ i ( Z, P ) (cid:48) ) (cid:48) ∈ R p , where ϕ e ( Z, P ) ≡ (Ω e ( P )) † ψ e ( Z, P ) ϕ i ( Z, P ) ≡ (Ω i ( P )) † ψ i ( Z, P ) . (S.81)Step 1: (Distributional Representation). Let { U i } ∞ i =1 be an i.i.d. sequence independentof { Z i , W i,n } ni =1 with U i uniformly distributed on (0 , U (1) ,n , . . . U ( n ) ,n )to denote the order statistics of { U i } ni =1 and R i,n to denote the rank of each U i (i.e., U i = U ( R i,n ) ,n ). By Lemma 13.1(iv) in van der Vaart (1999), it then follows that thevector R n ≡ ( R ,n , . . . , R n,n ) is uniformly distributed on the set of all n ! permutationsof { , . . . , n } and hence by Assumption 4.5(i) we can conclude that( 1 √ n n (cid:88) i =1 ( W i,n − ¯ W n ) ϕ ( Z i , P ) , { Z i } ni =1 ) d = ( 1 √ n n (cid:88) i =1 ( W R i ,n − ¯ W n ) ϕ ( Z i , P ) , { Z i } ni =1 ) , where d = denotes equality in distribution and ¯ W n ≡ (cid:80) ni =1 W i,n /n .Step 2: (Couple to i.i.d.). We next define τ n : [0 , → { W i,n − ¯ W n } ni =1 to be given by τ n ( u ) ≡ inf { c : 1 n n (cid:88) i =1 { W i,n − ¯ W n ≤ c } ≥ u } ;i.e., τ n is the empirical quantile function of the sample { W i,n − ¯ W n } ni =1 . In addition, set S n ( P ) ≡ √ n n (cid:88) i =1 ( W R i ,n − ¯ W n ) ϕ ( Z i , P ) L n ( P ) ≡ √ n n (cid:88) i =1 ( ϕ ( Z i , P ) − ¯ ϕ n ( P )) τ n ( U i )where ¯ ϕ n ( P ) ≡ (cid:80) ni =1 ϕ ( Z i , P ) /n . Letting S j,n ( P ) and L j,n ( P ) denote the j th coordi-nates of S n ( P ) and L n ( P ) respectively, we then observe that Theorem 3.1 in H´ajek(1961) (see in particular equation (3.11) in page 512) allows us to conclude that E [( S j,n ( P ) − L j,n ( P )) |{ Z i , W i,n } ni =1 ] (cid:46) Var { L j,n ( P ) |{ Z i , W i,n } ni =1 } max ≤ i ≤ n | W i,n − ¯ W n | ( (cid:80) ni =1 ( W i,n − ¯ W n ) ) / . (S.82)50n order to study the properties of L n ( P ) it is convenient to define ξ i,n ( P ) to equal ξ i,n ( P ) ≡ ( ϕ ( Z i , P ) − ¯ ϕ n ( P )) τ n ( U i ) √ n . (S.83)Then note that since { U i } ni =1 are i.i.d. uniform on (0 ,
1] and independent of { Z i , W i,n } ni =1 ,and τ n is the empirical quantile function of { W i,n − ¯ W n } ni =1 it follows that E [ ξ i,n ( P ) |{ Z i , W i,n } ni =1 ] = 1 √ n ( ϕ ( Z i , P ) − ¯ ϕ n ( P ))( 1 n n (cid:88) i =1 W i,n − ¯ W n ) = 0 E [ ξ i,n ( P ) ξ i,n ( P ) (cid:48) |{ Z i , W i,n } ni =1 ] = ˆ σ n n ( ϕ ( Z i , P ) − ¯ ϕ n ( P ))( ϕ ( Z i , P ) − ¯ ϕ n ( P )) (cid:48) , (S.84)where ˆ σ n ≡ (cid:80) ni =1 ( W i,n − ¯ W n ) /n . Hence, since the variables { ξ i,n ( P ) } ni =1 are indepen-dent conditional on { Z i , W i,n } ni =1 it follows from L n ( P ) = (cid:80) ni =1 ξ i,n ( P ) thatVar { L j,n ( P ) |{ Z i , W i,n } ni =1 } = ˆ σ n n n (cid:88) i =1 ( ϕ j ( Z i , P ) − ¯ ϕ j,n ( P )) , (S.85)where ϕ j ( Z i , P ) and ¯ ϕ j,n ( P ) denote the j th coordinates of ϕ ( Z i , P ) and ¯ ϕ n ( P ) respec-tively. Thus, since for any random variable ( V , . . . , V p ) ≡ V ∈ R p Jensen’s inequalityimplies E [ (cid:107) V (cid:107) ∞ ] ≤ √ p max ≤ j ≤ p ( E [ V j ]) / , results (S.82) and (S.85) yield E [ (cid:107) S n ( P ) − L n ( P ) (cid:107) ∞ |{ Z i , W i,n } ni =1 ] (cid:46) √ p max ≤ j ≤ p ( ˆ σ n n / n (cid:88) i =1 ( ϕ j ( Z i , P ) − ¯ ϕ j,n ( P )) ) / ( max ≤ i ≤ n | W i,n − ¯ W n | ) / . (S.86)Next, we note that the definition of ϕ ( z, P ) implies that Ψ( z, P ), as introduced inAssumption 4.3(iii), satisfies Ψ( z, P ) = (cid:107) ϕ ( z, P ) (cid:107) ∞ . Hence, for M , Ψ as introduced inAssumption 4.3(iii), Markov and Jensen’s inequalities imply for any C > P ∈ P P ( | n n (cid:88) i =1 Ψ ( Z i , P ) | > CM , Ψ ) ≤ CM , Ψ sup P ∈ P E P [ | n n (cid:88) i =1 Ψ ( Z i , P ) | ] ≤ CM , Ψ sup P ∈ P (cid:107) Ψ( · , P ) (cid:107) P, ≤ C .
Thus, using that Ψ( z, P ) = (cid:107) ϕ ( z, P ) (cid:107) ∞ we conclude uniformly in P ∈ P thatmax ≤ j ≤ p n n (cid:88) i =1 ( ϕ j ( Z i , P ) − ¯ ϕ j,n ( P )) ≤ n n (cid:88) i =1 Ψ ( Z i , P ) = O P ( M , Ψ ) . (S.87)Moreover, by the triangle inequality, Assumption 4.5(ii), Lemma 2.2.10 in van der Vaartand Wellner (1996), and E [ | V | ] ≤ (cid:107) V (cid:107) ψ for any random variable V and (cid:107)·(cid:107) ψ the Orlicz51orm based on ψ = e x −
1, we can conclude that E [ max ≤ i ≤ n | W i,n − ¯ W n | ] ≤ E [ max ≤ i ≤ n | W i,n − E [ W ,n ] | ] + E [ | ¯ W n − E [ W ,n ] | ] (cid:46) log(1 + n ) + E [ | W ,n | ] . (S.88)Thus, ˆ σ n P → E [ | W ,n | ] being uniformly bounded in n byJensen’s inequality and Assumption 4.5(iii), and results (S.86), (S.87), (S.88) yield E [ (cid:107) S n ( P ) − L n ( P ) (cid:107) ∞ |{ Z i , W i,n } ni =1 ] = O P ( (cid:112) p log(1 + n ) M , Ψ n / )uniformly in P ∈ P . By Fubini’s theorem and Markov’s inequality we may thereforeconclude that unconditionally (on { Z i , W i,n } ni =1 ) and uniformly in P ∈ P we have (cid:107) S n ( P ) − L n ( P ) (cid:107) ∞ = O P ( (cid:112) p log(1 + n ) M , Ψ n / ) . Step 3: (Couple to Gaussian). We next proceed by coupling L n ( P ) to a (conditionally)Gaussian vector. To this end, recall the definition of ξ i,n ( P ) in (S.83) and let¯ G i,n ( P ) ∼ N (0 , Var { ξ i,n ( P ) |{ Z i , W i,n } ni =1 } )and { ¯ G i,n ( P ) } ni =1 be mutually independent conditional on { Z i , W i,n } ni =1 . Then note that (cid:107) a (cid:107) ≤ p (cid:107) a (cid:107) ∞ for any a ∈ R p , Lemma S.2.8, and result (S.84) imply n (cid:88) i =1 E [ (cid:107) ¯ G i,n ( P ) (cid:107) (cid:107) ¯ G i,n ( P ) (cid:107) ∞ |{ Z i , W i,n } ni =1 ] ≤ p n (cid:88) i =1 E [ (cid:107) ¯ G i,n ( P ) (cid:107) ∞ |{ Z i , W i,n } ni =1 ] (cid:46) p log / (1 + p ) ˆ σ n n / n (cid:88) i =1 (cid:107) ϕ ( Z i , P ) − ¯ ϕ n ( P ) (cid:107) / ∞ . (S.89)Similarly, employing the definition of ξ i,n ( P ), { U i } ni =1 being independent of { Z i , W i,n } ni =1 ,and τ n being the empirical quantile function of { W i,n − ¯ W n } ni =1 , we obtain that n (cid:88) i =1 E [ (cid:107) ξ i,n ( P ) (cid:107) (cid:107) ξ i,n ( P ) (cid:107) ∞ |{ Z i , W i,n } ni =1 ] ≤ p √ n ( 1 n n (cid:88) i =1 (cid:107) ϕ ( Z i , P ) − ¯ ϕ n ( P ) (cid:107) ∞ )( 1 n n (cid:88) i =1 | W i,n − ¯ W n | ) . (S.90)Therefore, results (S.89) and (S.90), Ψ( Z i , P ) = (cid:107) ϕ ( Z i , P ) (cid:107) ∞ , and multiple applications52f the triangle and Jensen’s inequalities yield the upper bound n (cid:88) i =1 E [ (cid:107) ¯ G i,n ( P ) (cid:107) (cid:107) ¯ G i,n ( P ) (cid:107) ∞ + (cid:107) ξ i,n ( P ) (cid:107) (cid:107) ξ i,n ( P ) (cid:107) ∞ |{ Z i , W i,n } ni =1 ] (cid:46) p log / (1 + p ) √ n ( 1 n n (cid:88) i =1 | W i,n | )( 1 n n (cid:88) i =1 { Ψ ( Z i , P ) + Ψ / ( Z i , P ) } ) ≡ B n ( P ) , (S.91)where the final equality is definitional. Next, let B denote the Borel σ -field on R p andfor any A ∈ B and (cid:15) > A (cid:15) ≡ { a ∈ R p : inf ˜ a ∈ A (cid:107) a − ˜ a (cid:107) ∞ ≤ (cid:15) } – i.e., A (cid:15) is an (cid:107) · (cid:107) ∞ -enlargement of A . Strassen’s Theorem (see Theorem 10.3.1 in Pollard (2002)),Lemma 39 in Belloni et al. (2019), and result (S.91) then establish for any δ > A ∈B { P ( L n ( P ) ∈ A |{ Z i , W i,n } ni =1 ) − P ( 1 √ n n (cid:88) i =1 ¯ G i,n ( P ) ∈ A δ |{ Z i , W i,n } ni =1 ) } (cid:46) min t ≥ (2 P ( (cid:107) Z (cid:107) ∞ > t ) + B n ( P ) δ t ) (S.92)where Z ∈ R p is distributed according to Z ∼ N (0 , I p ). Furthermore, PropositionA.2.1 in van der Vaart and Wellner (1996) and Lemma S.2.8 imply for some C < ∞ sup P ∈ P E P [min t ≥ (2 P ( (cid:107) Z (cid:107) ∞ > t ) + B n ( P ) δ t )] (cid:46) min t ≥ (exp {− t C log(1 + p ) } + sup P ∈ P E P [ B n ( P )] t δ ) (cid:46) min t ≥ (exp {− t C log(1 + p ) } + p log / (1 + p ) M , Ψ √ n t δ ) , (S.93)where the final inequality follows from (S.91), E [ | W i,n | ] being bounded uniformly in n by Assumption 4.5(iii), Jensen’s inequality, sup P ∈ P (cid:107) Ψ (cid:107) P, ≤ M , Ψ with M , Ψ ≥ { W i,n } ni =1 being independent of { Z i } ni =1 by Assumption 4.5(i).Hence, for any K > p log / (1 + p ) M , Ψ / √ n ≤ b n , (S.92), and (S.93) implysup P ∈ P E P [sup A ∈B E P [1 { L n ( P ) ∈ A } − { √ n n (cid:88) i =1 ¯ G i,n ( P ) ∈ A Kb n }|{ Z i , W i,n } ni =1 ]] (cid:46) min t ≥ (exp {− t C log(1 + p ) } + t K log(1 + p ) ) ≤ exp {− K C } + 1 K , (S.94)where the final inequality follows by setting t = K (cid:112) log(1 + p ). Theorem 4 in Monradand Philipp (1991) and result (S.94) then imply that there exists a ¯ G n ( P ) such that (cid:107) L n ( P ) − ¯ G n ( P ) (cid:107) ∞ = O P ( b n )53niformly in P ∈ P , and its distribution conditional on { Z i , W i,n } ni =1 is given by¯ G n ( P ) ∼ N (0 , n (cid:88) i =1 Var { ξ i,n ( P ) |{ Z i , W i,n } ni =1 } ) . (S.95)Step 4: (Remove Dependence). We next couple ¯ G n ( P ) to a Gaussian vector ˜ G (cid:63)n ( P ) thatis independent of { Z i , W i,n } ni =1 . To this end, we first note result (S.84) impliesˆΛ n ( P ) ≡ n (cid:88) i =1 Var { ξ i,n ( P ) |{ Z i , W i,n } ni =1 } = ˆ σ n n n (cid:88) i =1 ( ϕ ( Z i , P ) ϕ ( Z i , P ) (cid:48) − ¯ ϕ n ( P ) ¯ ϕ n ( P ) (cid:48) ) . Moreover, E P [ ϕ ( Z, P )] = 0 and sup P ∈ P max ≤ j ≤ p (cid:107) ϕ j ( · , P ) (cid:107) P, being bounded in n bydefinition of ϕ and Assumptions 4.3(i)(ii), and (cid:107) aa (cid:48) (cid:107) o, = (cid:107) a (cid:107) for any a ∈ R p implysup P ∈ P E P [ (cid:107) ¯ ϕ n ( P ) ¯ ϕ n ( P ) (cid:48) (cid:107) o, ] = sup P ∈ P E P [ (cid:107) ¯ ϕ n ( P ) (cid:107) ]= sup P ∈ P p (cid:88) j =1 E P [( 1 n n (cid:88) i =1 ϕ j ( Z i , P )) ] (cid:46) pn . (S.96)Also, (cid:107) ϕ ( Z i , P ) (cid:107) ≤ p Ψ ( Z i , P ), Assumption 4.5(iv), and Jensen’s inequality implysup P ∈ P E P [ max ≤ i ≤ n (cid:107) ϕ ( Z i , P ) (cid:107) ] (cid:46) sup P ∈ P pE P [ max ≤ i ≤ n Ψ ( Z i , P )] ≤ sup P ∈ P p ( E P [ max ≤ i ≤ n Ψ q ( Z i , P )]) /q ≤ sup P ∈ P p ( nE P [Ψ q ( Z i , P )]) /q ≤ pn /q M q, Ψ . Setting Λ( P ) ≡ E P [ ϕ ( Z, P ) ϕ ( Z, P ) (cid:48) ], we then note that b n = o (1), Lemma S.2.9, (cid:107) Λ( P ) (cid:107) o, being uniformly bounded in n and P ∈ P by Assumption 4.3(ii) and def-inition of ϕ ( Z, P ), and Markov’s inequality allow us to conclude that (cid:107) n n (cid:88) i =1 ϕ ( Z i , P ) ϕ ( Z i , P ) (cid:48) − Λ( P ) (cid:107) o, = O P ( { p log(1 + p ) n /q M q, Ψ n } / ) (S.97)uniformly in P ∈ P . Therefore, the triangle inequality, (S.96), (S.97), (cid:107) Λ( P ) (cid:107) o, beingbounded in n and P ∈ P and Assumption 4.5(iii) yield (cid:107) ˆΛ n ( P ) − Λ( P ) (cid:107) o, ≤ | ˆ σ n − |(cid:107) Λ( P ) (cid:107) o, + O P ( { p log(1 + p ) n /q M q, Ψ n } / )= O P ( { p log(1 + p ) n /q M q, Ψ n } / )uniformly in P ∈ P . Hence, since the distribution of ¯ G n ( P ) conditional on { Z i , W i,n } ni =1 equals (S.95), we may apply Lemma S.2.7 with V n = { Z i , W i,n } ni =1 to conclude that there54xists a ˜ G (cid:63)n ( P ) ∼ N (0 , Λ( P )) independent of { Z i , W i,n } ni =1 with (cid:107) ¯ G n ( P ) − ˜ G (cid:63)n ( P ) (cid:107) ∞ = O P (( p log (1 + p ) n /q M q, Ψ n ) / )uniformly in P ∈ P .Step 5: (Couple ˆ G n ). Combining Steps 2, 3, and 4, we obtain that there exists a Gaussianvector ˜ G (cid:63)n ( P ) that is independent of { Z i , W i,n } ni =1 and satisfies (cid:107) S n ( P ) − ˜ G (cid:63)n ( P ) (cid:107) ∞ = O P ( b n )uniformly in P ∈ P . Since, in particular, ˜ G (cid:63)n ( P ) is independent of { Z i } ni =1 , the represen-tation in Step 1 and Lemma 2.11 in Dudley and Philipp (1983) imply that there existsa ( ˘ G e (cid:63)n ( P ) (cid:48) , ˘ G i (cid:63)n ( P ) (cid:48) ) (cid:48) ≡ ˘ G (cid:63)n ( P ) ∼ N (0 , Λ( P )) independent of { Z i } ni =1 and such that (cid:107) √ n n (cid:88) i =1 ( W i,n − ¯ W n ) ϕ ( Z i , P ) − ˘ G (cid:63)n ( P ) (cid:107) ∞ = O P ( b n ) (S.98)uniformly in P ∈ P . To conclude, set G j (cid:63)n ( P ) ≡ Ω j ( P ) ˘ G j (cid:63)n ( P ) for j ∈ { e , i } and G (cid:63)n ( P ) ≡ ( G e (cid:63)n ( P ) (cid:48) , G i (cid:63)n ( P ) (cid:48) ) (cid:48) . Then note that since Ω j ( P )(Ω j ( P )) † ψ j ( Z, P ) = ψ j ( Z, P ) P -almostsurely for j ∈ { e , i } by Assumption 4.4(i), it follows from Λ( P ) ≡ E P [ ϕ ( Z, P ) ϕ ( Z, P ) (cid:48) ]and the definition of ϕ ( Z, P ) that G (cid:63)n ( P ) ∼ N (0 , Σ( P )) as desired. Furthermore, since˘ G (cid:63)n ( P ) belongs to the range of Λ( P ) almost surely by Theorem 3.6.1 in Bogachev (1998),it follows that ˘ G j (cid:63)n ( P ) = (Ω j ( P )) † Ω j ( P ) ˘ G j (cid:63)n ( P ) = (Ω j ( P )) † G j (cid:63)n ( P ) for j ∈ { e , i } . Thelemma thus follows from (S.98), the definition of ϕ ( Z, P ), and Assumption 4.5(i).
Lemma S.2.6.
If Assumption 4.1 holds and a n / (cid:112) log(1 + p ) = o (1) , then (cid:107) ( ˆΩ e n ) † ( ˆΩ e n − Ω e ( P )) (cid:107) o, ∞ ∨ (cid:107) ( ˆΩ i n ) † ( ˆΩ i n − Ω i ( P )) (cid:107) o, ∞ = O P ( a n / (cid:112) log(1 + p )) uniformly in P ∈ P .Proof: First note Assumption 4.1 and Lemma S.2.10 imply Ω e ( P )(Ω e ( P )) † ˆΩ e n = ˆΩ e n and( ˆΩ e n ) † ˆΩ e n (Ω e ( P )) † = (Ω e ( P )) † with probability tending to one uniformly in P ∈ P . SinceΩ e ( P )(Ω e ( P )) † Ω e ( P ) = Ω e ( P ) by Proposition 6.11.1(6) in Luenberger (1969), we thusobtain, with probability tending to one uniformly in P ∈ P , that (cid:107) ( ˆΩ e n ) † ( ˆΩ e n − Ω e ( P )) (cid:107) o, ∞ = (cid:107) ( ˆΩ e n ) † Ω e ( P )(Ω e ( P )) † ( ˆΩ e n − Ω e ( P )) (cid:107) o, ∞ ≤ (cid:107) ( ˆΩ e n ) † (Ω e ( P ) − ˆΩ e n ) (cid:107) o, ∞ × o P (1) + (cid:107) ( ˆΩ e n ) † ˆΩ e n (Ω e ( P )) † ( ˆΩ e n − Ω e ( P )) (cid:107) o, ∞ = (cid:107) ( ˆΩ e n ) † (Ω e ( P ) − ˆΩ e n ) (cid:107) o, ∞ × o P (1) + (cid:107) (Ω e ( P )) † ( ˆΩ e n − Ω e ( P )) (cid:107) o, ∞ , (S.99)where the inequality holds due to (cid:107) (Ω e ( P )) † ( ˆΩ e n − Ω e ( P )) (cid:107) o, ∞ = O P ( a n / (cid:112) p ))uniformly in P ∈ P by Assumption 4.1(ii) and a n / (cid:112) p ) = o (1) by hypothesis.Since (cid:107) (Ω e ( P )) † ( ˆΩ e n − Ω e ( P )) (cid:107) o, ∞ = O P ( a n / (cid:112) p )) uniformly in P ∈ P by As-sumption 4.1(ii), result (S.99) implies (cid:107) ( ˆΩ e n ) † ( ˆΩ e n − Ω e ( P )) (cid:107) o, ∞ = O P ( a n / (cid:112) log(1 + p )55niformly in P ∈ P . The claim (cid:107) ( ˆΩ i n ) † ( ˆΩ i n − Ω i ( P )) (cid:107) o, ∞ = O P ( a n / (cid:112) log(1 + p ) uni-formly in P ∈ P can be established by identical arguments. Lemma S.2.7.
Let { V n } ∞ n =1 be random variables with distribution parametrized by P ∈ P and ¯ G n ( P ) ∈ R d n be such that ¯ G n ( P ) ∼ N (0 , ˆΣ n ( P )) conditionally on V n . If thereexist non-random matrices Σ n ( P ) such that (cid:107) ˆΣ n ( P ) − Σ n ( P ) (cid:107) o, = O P ( δ n ) uniformlyin P ∈ P , then there exists a Gaussian G (cid:63)n ( P ) ∼ N (0 , Σ n ( P )) independent of V n andsatisfying (cid:107) ¯ G n ( P ) − G (cid:63)n ( P ) (cid:107) ∞ = O P ( (cid:112) log(1 + d n ) δ n ) uniformly in P ∈ P .Proof: Let { ˆ ν j ( P ) } d n j =1 and { ˆ λ j ( P ) } d n j =1 denote the unit length eigenvectors and corre-sponding eigenvalues of ˆΣ n ( P ). Further letting N d n be independent of ( V n , ¯ G n ( P )) anddistributed according to N d n ∼ N (0 , I d n ), we then define Z n ( P ) ∈ R d n to be given by Z n ( P ) ≡ (cid:88) j :ˆ λ j ( P ) (cid:54) =0 ˆ ν j ( P ) ˆ ν j ( P ) (cid:48) ¯ G n ( P )ˆ λ / j ( P ) + (cid:88) j :ˆ λ j ( P )=0 ˆ ν j ( P )(ˆ ν j ( P ) (cid:48) N d n ) . Since N d n is independent of V n and ¯ G n ( P ) is Gaussian conditional on V n it follows that Z n ( P ) is Gaussian conditional on V n as well. Moreover, we have E [ Z n ( P ) Z n ( P ) (cid:48) | V n ] = d n (cid:88) j =1 ˆ ν j ( P )ˆ ν j ( P ) (cid:48) = I d n , by direct calculation, and hence we conclude that Z n ( P ) ∼ N (0 , I d n ) and is independentof V n . Next, we note that Theorem 3.6.1 in Bogachev (1998) implies that ¯ G n ( P ) belongsto the range of ˆΣ n ( P ) : R d n → R d n almost surely. Thus, since { ˆ ν j ( P ) : ˆ λ j ( P ) (cid:54) = 0 } isan orthonormal basis for the range of ˆΣ n ( P ), we obtain that almost surelyˆΣ / n ( P ) Z n ( P ) = (cid:88) j :ˆ λ j ( P ) (cid:54) =0 ˆ ν j ( P )(ˆ ν j ( P ) (cid:48) ¯ G n ( P )) = ¯ G n ( P ) . (S.100)Employing that Z n ( P ) is independent of V n , we then define the desired G (cid:63)n ( P ) by G (cid:63)n ( P ) ≡ Σ / n ( P ) Z n ( P ) . (S.101)Next, set ˆ∆ n ( P ) ≡ ˆΣ / n ( P ) − Σ / n ( P ) and let ˆ∆ ( j,k ) n ( P ) denote its ( j, k ) entry. Note(S.100), (S.101), Lemma S.2.8, and sup (cid:107) v (cid:107) =1 (cid:104) v, a (cid:105) = (cid:107) a (cid:107) for any vector a ∈ R d n yield E [ (cid:107) ¯ G n ( P ) − G (cid:63)n ( P ) (cid:107) ∞ | V n ] (cid:46) (cid:112) log(1 + d n ) max ≤ j ≤ d n ( d n (cid:88) k =1 ( ˆ∆ ( j,k ) n ( P )) ) / = (cid:112) log(1 + d n ) sup (cid:107) v (cid:107) =1 (cid:107) ˆ∆ n ( P ) v (cid:107) ∞ ≤ (cid:112) log(1 + d n ) (cid:107) ˆ∆ n ( P ) (cid:107) o, , (S.102)where (cid:107) ˆ∆ n ( P ) (cid:107) o, denotes the operator norm of ˆ∆ n ( P ) : R d n → R d n when R d n is56ndowed with the norm (cid:107) · (cid:107) , and the final inequality follows from (cid:107) · (cid:107) ∞ ≤ (cid:107) · (cid:107) .Moreover, Theorem X.1.1 in Bhatia (1997) further implies that (cid:107) ˆ∆ n ( P ) (cid:107) o, ≤ (cid:107) ˆΣ n ( P ) − Σ( P ) (cid:107) o, = O P ( δ n ) , (S.103)where the equality holds uniformly in P ∈ P by hypothesis. Therefore, Fubini’s theorem,Markov’s inequality, and result (S.102) allow us to conclude for any C > P ∈ P P ( (cid:107) ¯ G n ( P ) − G (cid:63)n ( P ) (cid:107) ∞ > C (cid:112) log(1 + d n ) δ n and (cid:107) ˆ∆ n ( P ) (cid:107) o, ≤ C (cid:112) δ n ) ≤ sup P ∈ P E P [ (cid:107) ˆ∆ n ( P ) (cid:107) o, C √ δ n {(cid:107) ˆ∆ n ( P ) (cid:107) o, ≤ C (cid:112) δ n } ] ≤ C . (S.104)The claim of the lemma then follows from results (S.103) and (S.104).
Lemma S.2.8.
Let Z = ( Z , . . . , Z p ) ∈ R p be jointly Gaussian with E [ Z j ] = 0 and E [ Z j ] ≤ σ for all ≤ j ≤ p . Then, there is a universal K < ∞ such that for any q ≥ E [ (cid:107) Z (cid:107) q ∞ ] ≤ ( q ! (cid:112) log(1 + p ) σK (cid:112) log(2) ) q . Proof:
The result is well known and stated here for completeness and ease of reference.Define the function ψ : R → R to equal ψ ( u ) = exp { u } − u ∈ R and recallthat for any random variable V ∈ R its Orlicz norm (cid:107) V (cid:107) ψ is given by (cid:107) V (cid:107) ψ ≡ inf { C > E [ ψ ( | V | C )] ≤ } . Further note that for any q ≥ V we have ( E [ | V | q ]) /q ≤ q ! (cid:107) V (cid:107) ψ / (cid:112) log(2); see, e.g., van der Vaart and Wellner (1996) pg. 95. Hence, Lem-mas 2.2.1 and 2.2.2 in van der Vaart and Wellner (1996) imply that there exist finiteconstants K and K such that for all q ≥ E [ (cid:107) Z (cid:107) q ∞ ] ≤ ( q ! (cid:112) log(2) ) q (cid:107) max ≤ j ≤ p | Z j |(cid:107) qψ ≤ ( q ! (cid:112) log(2) ) q { K (cid:112) log(1 + p ) max ≤ j ≤ p (cid:107) Z j (cid:107) ψ } q ≤ ( q ! (cid:112) log(1 + p ) σK (cid:112) log(2) ) q , for all q ≥
1. The claim of the lemma therefore follows.
Lemma S.2.9.
Let { V i } ni =1 be an i.i.d. sample with V i ∈ R k and Σ ≡ E [ V V (cid:48) ] . Then: E [ (cid:107) n n (cid:88) i =1 V i V (cid:48) i − Σ (cid:107) o, ] ≤ max {(cid:107) Σ (cid:107) / o, δ, δ } , where δ ≡ D (cid:112) E [max ≤ i ≤ n (cid:107) V i (cid:107) ] log(1 + k ) /n for some universal constant D . roof: This is essentially Theorem E.1 in Kato (2013) if k ≥
2. Suppose k = 1. Thenby Lemma 2.3.1 in van der Vaart and Wellner (1996) it follows that E [ (cid:107) n n (cid:88) i =1 V i V (cid:48) i − Σ (cid:107) o, ] ≤ E [ | n n (cid:88) i =1 (cid:15) i V i | ] , (S.105)where { (cid:15) i } ni =1 are i.i.d. Rademacher random variables that are independent of { V i } ni =1 .For (cid:107) · (cid:107) ψ the Orlicz norm induced by ψ ( u ) = exp { u } −
1, it then follows from E [ | U | ] ≤ (cid:107) U (cid:107) ψ / (cid:112) log(2) for any random variable U ∈ R (see, e.g., van der Vaart andWellner (1996) pg. 95) and Lemma 2.2.7 in van der Vaart and Wellner (1996) that E [ | n n (cid:88) i =1 (cid:15) i V i | ] = E [ E [ | n n (cid:88) i =1 (cid:15) i V i ||{ V i } ni =1 ]] ≤ √ (cid:112) log(2) E [ { n (cid:88) i =1 ( V i n ) } / ] ≤ √ (cid:112) log(2) E [ max ≤ i ≤ n | V i |{ n (cid:88) i =1 ( V i n ) } / ] . (S.106)Therefore, the Cauchy-Schwarz’s inequality and result (S.106) allow us to conclude that E [ | n n (cid:88) i =1 (cid:15) i V i | ] ≤ √ (cid:112) log(2) { E [ max ≤ i ≤ n | V i | ] } / { E [ V ] n } / , which together with (S.105) establishes the claim of the lemma. Lemma S.2.10.
Let Ω and Ω be k × k symmetric matrices such that range { Ω } =range { Ω } . It then follows that Ω Ω † Ω = Ω and Ω † Ω Ω † = Ω † .Proof: For any k × k matrix M , let R ( M ) ⊆ R k and N ( M ) ⊆ R k denote its range andnull space. Also recall that any vector subspace V ⊆ R k we set V ⊥ ≡ { s ∈ R k : (cid:104) s, v (cid:105) =0 for all v ∈ V } . In order to establish the first claim of the lemma, let s ∈ R k bearbitrary and observe that since R (Ω ) = R (Ω ) it follows that there exists an s ∈ R k such that Ω s = Ω s . Therefore, Proposition 6.11.1(6) in Luenberger (1969) yieldsΩ Ω † Ω s = Ω Ω † Ω s = Ω s = Ω s . Hence, since s ∈ R k was arbitrary, it follows that Ω Ω † Ω = Ω .In order to establish the second claim of the lemma, first note that R ( M † ) = N ( M ) ⊥ for any k × k matrix M . Thus, since for j ∈ { , } we have N (Ω j ) ⊥ = R (Ω j ) due toΩ (cid:48) j = Ω j and Theorem 6.7.3(2) in Luenberger (1969), we can conclude that R (Ω † ) = N (Ω ) ⊥ = R (Ω ) = R (Ω ) = N ⊥ (Ω ) = R (Ω † ) , where the third equality holds by assumption. Letting s ∈ R k be arbitrary, it then58ollows that there exists an s ∈ R k for which Ω † s = Ω † s , and thusΩ † Ω Ω † s = Ω † Ω Ω † s = Ω † s = Ω † s , where the second equality holds by Proposition 6.11.1(5) in Luenberger (1969). Since s ∈ R k was arbitrary, it follows that Ω † Ω Ω † = Ω † . Lemma S.2.11.
Let ( Z , . . . , Z d ) (cid:48) ≡ Z ∈ R d be Gaussian with E [ Z j ] ≥ , Var { Z j } = σ > for all ≤ j ≤ d , and define S ≡ max ≤ j ≤ d Z j and m ≡ med { S } . Then, thedistribution of S is absolutely continuous and its density is bounded on R by σ max { m σ , } . Proof:
The result immediately follows from results in Chapter 11 of Davydov et al.(1998). First, let F denote the c.d.f. of S and note that Theorem 11.2 in Davydov et al.(1998) implies that F is absolutely continuous with density F (cid:48) satisfying F (cid:48) ( r ) = q ( r ) exp {− r σ } , (S.107)where q : R → R + is a nondecreasing function. Moreover, we can further conclude that q ( r ) (cid:90) ∞ r exp {− u σ } du ≤ (cid:90) ∞ r q ( u ) exp {− u σ } du = P ( S ≥ r ) ≤ , (S.108)where the first inequality follows from q : R → R + being nondecreasing and the equalityfollows from (S.107). Setting Φ and Φ (cid:48) to denote the c.d.f. and density of a standardnormal random variable respectively, then note that we may write (cid:90) ∞ r exp {− u σ } du = √ π (cid:90) ∞ r Φ (cid:48) ( u/σ ) du = √ πσ (1 − Φ( r/σ )) . (S.109)Therefore, we can combine results (S.107), (S.108), and (S.109) to obtain the bound F (cid:48) ( r ) ≤ exp {− r / σ }√ πσ (1 − Φ( r/σ )) = Φ (cid:48) ( r/σ ) σ (1 − Φ( r/σ )) ≤ σ max { rσ , } , (S.110)where the final result follows from Mill’s inequality implying Φ (cid:48) ( r ) / (1 − Φ (cid:48) ( r )) ≤ { r, } for all r ∈ R (see, e.g., pg. 64 in Chernozhukov et al. (2014)).Next note that for any η >
0, the definitions of S and m, and the distribution of S first order stochastically dominating that of Z j for any 1 ≤ j ≤ d imply that P ( S ≤ m + η ) ≥ P ( S ≤ max ≤ j ≤ p med { Z j } + η ) ≥ P ( max ≤ j ≤ d ( Z j − E [ Z j ]) ≤ η ) > , (S.111)where the final inequality follows from E [ Z ] belonging to the support of Z . Theorem591.2 in Davydov et al. (1998) thus implies q : R → R + is continuous at any r > m,which together with (S.107) and the first fundamental theorem of calculus establishes F is in fact differentiable at any r > m with derivative given by F (cid:48) . Setting Γ ≡ Φ − ◦ F ,then observe F = Φ ◦ Γ and hence at any r > m we obtain F (cid:48) ( r ) = Φ (cid:48) (Γ( r ))Γ (cid:48) ( r ) (S.112)for Γ (cid:48) the derivative of Γ. However, note that Γ (cid:48) is decreasing since Γ is concave byProposition 11.3 in Davydov et al. (1998), while Φ (cid:48) (Γ( r )) is decreasing on [m , + ∞ ) dueto Φ (cid:48) being decreasing on [0 , ∞ ) and Γ( r ) ∈ [0 , ∞ ) for any r > m. In particular, (S.112)implies that F (cid:48) is decreasing on (m , + ∞ ) which together with (S.110) yieldssup r ∈ (m , + ∞ ) F (cid:48) ( r ) = lim sup r ↓ m F (cid:48) ( r ) ≤ lim sup r ↓ m Φ (cid:48) ( r/σ ) σ (1 − Φ( r/σ )) = Φ (cid:48) (m /σ ) σ (1 − Φ(m /σ )) . (S.113)Since result (S.110) implies F (cid:48) ( r ) is bounded by 2 max { m /σ, } /σ on ( −∞ , m] and result(S.113) implies the same bound applies on (m , + ∞ ), the claim of the lemma follows. Lemma S.2.12.
Let C ⊆ R k be a nonempty, closed, polyhedral set containing nolines, and E denote its extreme points. Then: E (cid:54) = ∅ and for any y ∈ R k such that sup c ∈ C (cid:104) c, y (cid:105) < ∞ , it follows that sup c ∈ C (cid:104) c, y (cid:105) = max c ∈E (cid:104) c, y (cid:105) .Proof: The claim that
E (cid:54) = ∅ follows from Corollary 18.5.3 in Rockafellar (1970). More-over, for D the set of extreme directions of C , Corollary 19.1.1 in Rockafellar (1970)implies both E and D are finite. Thus, writing E = { a j } mj =1 and D ≡ { a j } nj = m +1 (with n = m when D = ∅ ), Theorem 18.5 in Rockafellar (1970) yields the representation C ≡ { c ∈ R k : c = n (cid:88) j =1 a j λ j s.t. m (cid:88) j =1 λ j = 1 and λ j ≥ ≤ j ≤ n } . (S.114)Next note that if sup c ∈ C (cid:104) c, y (cid:105) is finite, then Corollary 5.3.7 in Borwein and Lewis (2010)implies that the supremum is attained. Hence, by result (S.114) we obtainsup c ∈ C (cid:104) c, y (cid:105) = max { λ j } nj =1 (cid:104) y, n (cid:88) j =1 λ j a j (cid:105) s.t. m (cid:88) j =1 λ j = 1 , λ j ≥ ≤ j ≤ n = max { λ j } mj =1 (cid:104) y, m (cid:88) j =1 λ j a j (cid:105) s.t. m (cid:88) j =1 λ j = 1 , λ j ≥ ≤ j ≤ m, (S.115)where the second equality follows due to sup c ∈ C (cid:104) c, y (cid:105) being finite implying we must have (cid:104) y, a j (cid:105) ≤ m + 1 ≤ j ≤ n . Since E = { a j } mj =1 and the maximization in (S.115) issolved by setting λ j (cid:63) = 1 for some 1 ≤ j (cid:63) ≤ m , the claim of the lemma follows. Lemma S.2.13.
Let V i ( P ) be as defined in (20) . Then, the set ( AA (cid:48) ) † V i ( P ) is nonempty,closed, polyehdral, contains no lines, and zero is one of its extreme points. roof: First note that 0 ∈ ( AA (cid:48) ) † V i ( P ) and therefore ( AA (cid:48) ) † V i ( P ) is non-empty.To show ( AA (cid:48) ) † V i ( P ) is closed, suppose { v j } ∞ j =1 ∈ ( AA (cid:48) ) † V i ( P ) and (cid:107) v j − v (cid:63) (cid:107) = o (1)for some v (cid:63) ∈ R p . Since v j ∈ ( AA (cid:48) ) † V i ( P ) it follows that there is an s j ∈ V i ( P ) suchthat v j = ( AA (cid:48) ) † s j . Next, let ˜ s j ≡ AA † s j and note that( AA (cid:48) ) † ˜ s j = ( AA (cid:48) ) † AA † s j = ( A (cid:48) ) † A † s j = ( AA (cid:48) ) † s j (S.116)since ( AA (cid:48) ) † A = ( A (cid:48) ) † by Proposition 6.11.1(8) in Luenberger (1969) and ( A (cid:48) ) † A † =( AA (cid:48) ) † (see, e.g., Seber (2008) pg. 139). Moreover, note A † ˜ s j = A † AA † s j = A † s j byProposition 6.11.1(5) in Luenberger (1969), while (S.116) implies (cid:107) Ω i ( P )( AA (cid:48) ) † ˜ s j (cid:107) = (cid:107) Ω i ( P )( AA (cid:48) ) † s j (cid:107) . Hence, if s j ∈ V i ( P ), then ˜ s j ∈ V i ( P ), and by (S.116) we have( AA (cid:48) ) † ˜ s j = v j . Furthermore, by construction ˜ s j ∈ R and hence ( AA (cid:48) )( AA (cid:48) ) † ˜ s j = ˜ s j ,which together with ( AA (cid:48) ) † ˜ s j = v j implies ˜ s j = AA (cid:48) v j . By continuity, it then followsfrom (cid:107) v j − v (cid:63) (cid:107) = o (1) that (cid:107) ˜ s j − s (cid:63) (cid:107) = o (1) for s (cid:63) = AA (cid:48) v (cid:63) and thus s (cid:63) ∈ V i ( P ) dueto V i ( P ) being closed. Furthermore, v j = ( AA (cid:48) ) † ˜ s j yields (cid:107) v (cid:63) − ( AA (cid:48) ) † s (cid:63) (cid:107) ≤ lim n →∞ (cid:107) v j − v (cid:63) (cid:107) + (cid:107) ( AA (cid:48) ) † (˜ s j − s (cid:63) ) (cid:107) = 0 (S.117)due to (cid:107) v j − v (cid:63) (cid:107) = o (1) and (cid:107) ˜ s j − s (cid:63) (cid:107) = o (1). Since, as argued, s (cid:63) ∈ V i ( P ), we canconclude that v (cid:63) ∈ ( AA (cid:48) ) † V i ( P ) and hence that ( AA (cid:48) ) † V i ( P ) is closed as desired.The fact that ( AA (cid:48) ) † V i ( P ) is polyhedral is immediate from definition of V i ( P ), andthus we next show ( AA (cid:48) ) † V i ( P ) contains no lines. To this end, suppose v ∈ ( AA (cid:48) ) † V i ( P ),which implies v = ( AA (cid:48) ) † s for some s ∈ V i ( P ). Since A (cid:48) ( AA (cid:48) ) † = A † by Proposition6.11.1(9) in Luenberger (1969), we are able to conclude that A (cid:48) v = A (cid:48) ( AA (cid:48) ) † s = A † s ≤ s ∈ V i ( P ). Similarly, if − v ∈ ( AA (cid:48) ) † V i ( P ), then we must have A (cid:48) ( − v ) ≤ − v, v ∈ ( AA (cid:48) ) † V i ( P ) imply that A (cid:48) v = 0. However, for N ( A (cid:48) ) ⊥ the orthocomple-ment to the null space of A (cid:48) , note that v = ( AA (cid:48) ) † s = ( A (cid:48) ) † A † s implies that v ∈ N ( A (cid:48) ) ⊥ . (S.119)Since v ∈ N ( A (cid:48) ) ⊥ and A (cid:48) v = 0 imply v = 0, it follows that if − v, v ∈ ( AA (cid:48) ) † V i ( P ), then v = 0 and hence ( AA (cid:48) ) † V i ( P ) contains no lines as claimed.Finally, to see that zero is an extreme point of ( AA (cid:48) ) † V i ( P ) suppose that 0 = λv +(1 − λ ) v for some v , v ∈ ( AA (cid:48) ) † V i ( P ) and λ ∈ (0 , v ∈ ( AA (cid:48) ) † V i ( P ), λ ∈ (0 , A (cid:48) A (cid:48) ( λv + (1 − λ ) v ), it then follows that A (cid:48) v = A (cid:48) v = 0. Therefore, result (S.119) holding for any v ∈ ( AA (cid:48) ) † V i ( P ) implies v = v = 0, which verifies that zero is indeed an extreme point of ( AA (cid:48) ) † V i ( P ).61 .3 Computational Details In this appendix, we provide details on how we compute our test statistic, T n , defined in(19), the restricted estimator ˆ β r n , defined in (24), and obtain a critical value. One com-putational theme that we found important in our simulations is that the pseudoinverse A † can be poorly conditioned. As we show below, however, it is possible to implementour procedure without ever needing to compute A † explicitly.First, we need to select a specific estimator ˆ x (cid:63)n . In the mixed logit simulation inSection 5, the parameter β ( P ) can be decomposed into β ( P ) = ( β u ( P ) (cid:48) , β (cid:48) k ) (cid:48) , where β u ( P ) ∈ R p u and β k ∈ R p k is a known constant for all P ∈ P . Similarly, we decomposeany b ∈ R p into b = ( b (cid:48) u , b (cid:48) k ) (cid:48) with b u ∈ R p u and b k ∈ R p k , and partition the matrix A into the corresponding submatrices A u (dimension p u × k ) and A k (dimension p k × k ).In our simulations, we then set ˆ x (cid:63)n to be a solution to the quadratic programmin x ∈ R d (cid:16) ˆ β u ,n − A u x (cid:17) (cid:48) ˆΞ − n (cid:16) ˆ β u ,n − A u x (cid:17) s.t. A k x = β k , (S.120)where ˆ β n = ( ˆ β (cid:48) u ,n , β k ) (cid:48) and ˆΞ n is an estimate of the asymptotic variance matrix of ˆ β u ,n .While the solution to (S.120) may not be unique, we note that any two minimizers x and x of (S.120) must satisfy Ax = Ax . Since in our reformulations below ˆ x (cid:63)n onlyenters through A ˆ x (cid:63)n , the specific choice of minimizer in (S.120) is immaterial.Throughout, we let ˆΩ e n be the sample standard deviation matrix of the entire vectorˆ β n . Note that, since ˆ β n = ( ˆ β (cid:48) u ,n , β k ) (cid:48) and β k is non-stochastic, ˆΩ e n has the formˆΩ e n = (cid:34) ˆΞ / n
00 0 (cid:35) . (S.121)We further let ˆΩ i n be the sample standard deviation of A ˆ x (cid:63)n , although this choice ofstudentization plays no special computational role in what follows.Next, consider the first component of T n (see (19)), which we reproduce here as T e n ≡ sup s ∈ ˆ V e n √ n (cid:104) s, ˆ β n − A ˆ x (cid:63)n (cid:105) where ˆ V e n ≡ { s ∈ R p : (cid:107) ˆΩ e n s (cid:107) ≤ } . (S.122)As in the main text, the superscript “e” alludes to the relation to the “equality” conditionin Theorem 3.1 – i.e. this statistic is designed to detect violations of the requirementthat β ( P ) is an element of the range of A . As noted in the main text, ˆ β n = A ˆ x (cid:63)n andhence T e n = 0 whenever A is full rank and d ≥ p . In other cases, we use the fact that ˆ x (cid:63)n ,as the solution to (S.120), must satisfy A k ˆ x (cid:63)n = β k , and that our choice of ˆΩ e n in (S.121)62as ˆΞ / n as its upper left block. From these observations, we deduce that T e n = sup s u ∈ R p u √ n (cid:104) s u , ˆ β u ,n − A u ˆ x (cid:63)n (cid:105) s.t. (cid:107) ˆΞ / n s u (cid:107) ≤ (cid:107)√ n ˆΞ / n (cid:16) ˆ β u ,n − A u ˆ x (cid:63)n (cid:17) (cid:107) ∞ . (S.123)Thus, T e n can be computed by simply taking the maximum of a vector of length p u .The second component of T n , defined in (19), is reproduced here as T i n ≡ sup s ∈ ˆ V i n √ n (cid:104) A † s, ˆ x (cid:63)n (cid:105) where ˆ V i n ≡ { s ∈ R p : A † s ≤ (cid:107) ˆΩ i n ( AA (cid:48) ) † s (cid:107) ≤ } , (S.124)where the superscript “i” alludes to the relation to the “inequality condition in Theorem3.1 – i.e. this statistic is designed to detect violations of the requirement that a positivesolution to Ax = β ( P ) exists. To compute T i n without explicitly using A † , we first note A † = A (cid:48) ( AA (cid:48) ) † , (S.125)see, e.g., Proposition 6.11.1(9) in Luenberger (1969). Then, we observe thatrange { ( AA (cid:48) ) † } = null { AA (cid:48) } ⊥ = range { AA (cid:48) } = range { A } . (S.126)The first equality in (S.126) is a property of pseudoinverses, see, e.g., Luenberger (1969,pg. 164). The second equality is a standard result in linear algebra, see, e.g., Theorem6.6.1 in Luenberger (1969). This result is also used in the third equality, which uses thefollowing logic: if t = As for some s ∈ R p , then also t = As , where s ∈ null { A } ⊥ =range { A (cid:48) } is determined from the orthogonal decomposition s = s + s with s ∈ null { A } , and hence t ∈ range { AA (cid:48) } implying range { A } ⊆ range { AA (cid:48) } . Since triviallyrange { AA (cid:48) } ⊆ range { A } the third equality follows. We thus obtain that T i n = sup s ∈ R p √ n (cid:104) A (cid:48) ( AA (cid:48) ) † , ˆ x (cid:63)n (cid:105) s.t. A (cid:48) ( AA (cid:48) ) † s ≤ (cid:107) ˆΩ i n ( AA (cid:48) ) † s (cid:107) ≤ , = sup x ∈ R d √ n (cid:104) A (cid:48) Ax, ˆ x (cid:63)n (cid:105) s.t. A (cid:48) Ax ≤ (cid:107) ˆΩ i n Ax (cid:107) ≤ , = sup x ∈ R d ,s ∈ R p √ n (cid:104) s, A ˆ x (cid:63)n (cid:105) s.t. Ax = s, A (cid:48) s ≤ (cid:107) ˆΩ i n s (cid:107) ≤ , (S.127)where the first equality follows from (S.125), the second from (S.126), and in the thirdwe substituted s = Ax . The final program in (S.127) can be written explicitly as a63inear program by introducing non-negative slack variables, so that T i n = sup x ∈ R d ,s ∈ R p ,φ + ∈ R p + ,φ − ∈ R p + √ n (cid:104) s, A ˆ x (cid:63)n (cid:105) s.t. Ax = s, A (cid:48) s ≤ , (cid:104) p , φ + (cid:105) + (cid:104) p , φ − (cid:105) ≤ , φ + − φ − = ˆΩ i n s, (S.128)where p ∈ R p is the vector with all coordinates equal to one. Note that if d ≥ p and A has full rank, then the constraint Ax = s is redundant since Ax ranges across all of R p as x varies across R d . In these cases, the constraint Ax = s together with the variable x can be entirely removed from the linear program in (S.128). Taking the maximum of(S.123) and (S.128) yields our test statistic T n .Turning to our bootstrap procedure, we first show how to solve (24) to find ˆ β r n . Theoptimization problem to solve is here reproduced as:min x ∈ R d + ,b =( b (cid:48) u ,b (cid:48) k ) (cid:48) (cid:34) sup s ∈ ˆ V i n √ n (cid:104) A † s, ˆ x (cid:63)n − A † b (cid:105) (cid:35) s.t. b k = β k , Ax = b. (S.129)We first observe that the inner problem has the same structure as (S.124), but with ˆ x (cid:63)n replaced by ˆ x (cid:63)n − A † b , where b is a fixed variable of optimization from the outer problem.Applying the same logic employed in (S.127) to this inner problem yieldssup x ∈ R d √ n (cid:104) A (cid:48) Ax, ˆ x (cid:63)n − A † b (cid:105) s.t. A (cid:48) Ax ≤ (cid:107) ˆΩ i n Ax (cid:107) ≤ . (S.130)Introducing slack variables as in (S.128) turns (S.130) into a linear program. The dualof the resulting linear program can be shown to be given byinf φ ∈ R + ,φ p ∈ R p ,φ d ∈ R d + φ s.t. p φ − φ p ≥ , p φ + φ p ≥ − A (cid:48) ˆΩ i n φ p + A (cid:48) Aφ d = √ nA (cid:48) A (ˆ x (cid:63)n − A † b ) . (S.131)Next, let V ≡ range { AA (cid:48) } and note that since A † = A (cid:48) ( AA (cid:48) ) † by Proposition 6.11.1(8)in Luenberger (1969), it follows that A (cid:48) AA † b = A (cid:48) AA (cid:48) ( AA (cid:48) ) † b = A (cid:48) Π V b . However, by(S.126), V ≡ range { AA (cid:48) } = range { A } = null { A (cid:48) } ⊥ , where the final equality follows byTheorem 6.6.1 in Luenberger (1969). Hence, A (cid:48) Π V b = A (cid:48) b and (S.131) equalsinf φ ∈ R + ,φ p ∈ R p ,φ d ∈ R d + φ s.t. p φ − φ p ≥ , p φ + φ p ≥ − A (cid:48) ˆΩ i n φ p + A (cid:48) Aφ d = √ nA (cid:48) ( A ˆ x (cid:63)n − b ) . (S.132)Substituting (S.132) back into the inner problem in (S.129) then yields a single lin-ear program that determines ˆ β r n . Given ˆ β r n it is then straightforward to compute our64ootstrap statistic. For instance, in the simulations in Section 5, we letˆ G e n = √ n { ( ˆ β b,n − A ˆ x (cid:63)b,n ) − ( ˆ β n − A ˆ x (cid:63)n ) } ˆ G i n = √ nA (ˆ x (cid:63)b,n − ˆ x (cid:63)n )where ˆ β b,n and ˆ x (cid:63)b,n are nonparametric bootstrap analogues to ˆ β n and ˆ x (cid:63)n . Arguing asin result (S.123) it is then straightforward to show thatsup s ∈ ˆ V e n (cid:104) s, ˆ G e n (cid:105) = (cid:107)√ n ˆΞ / n ˆ G e n (cid:107) ∞ (S.133)In analogy to (S.123), we note that (S.133) equals zero whenever A is full rank and d ≥ p . Next, we may employ the same arguments as in (S.127) and (S.128) and noting AA † ˆ G i n = ˆ G i n due to AA † A = A by Proposition 6.11.1(6) in Luenberger (1969) to obtainsup s ∈ ˆ V i n (cid:104) A † s,A † ( ˆ G i n + √ nλ n ˆ β r n ) (cid:105) = sup x ∈ R d ,s ∈ R p ,φ + ∈ R p + ,φ − ∈ R p + (cid:104) s, ˆ G i n + √ nλ n ˆ β r n (cid:105) s.t. Ax = s, A (cid:48) s ≤ , (cid:104) p , φ + (cid:105) + (cid:104) p , φ − (cid:105) ≤ , φ + − φ − = ˆΩ i n s. (S.134)As in (S.128), we note that if A is full rank and d ≥ p , then the constraint Ax = s andthe variable x may be dropped from the linear program in (S.134). The critical value isthen obtained by computing the 1 − α quantile of the maximum of (S.133) and (S.134)across bootstrap iterations. Finally, we note the same arguments also shows that theproblem (36) used to determine λ b n is equivalent to (S.128) with A ˆ x (cid:63)n replaced by ˆ G i n . References
Aliprantis, C. D. and
Border, K. C. (2006).
Infinite Dimensional Analysis – A Hitchhiker’sGuide . Springer-Verlag, Berlin.
Belloni, A. , Chernozhukov, V. , Chetverikov, D. and
Fern´andez-Val, I. (2019). Con-ditional quantile processes based on series or many regressors.
Journal of Econometrics . Bhatia, R. (1997).
Matrix Analysis . Springer, New York.
Bogachev, V. I. (1998).
Gaussian measures . 62, American Mathematical Soc.
Borwein, J. and
Lewis, A. S. (2010).
Convex analysis and nonlinear optimization: theoryand examples . Springer Science & Business Media.
Chernozhukov, V. , Chetverikov, D. and
Kato, K. (2014). Comparison and anti-concentration bounds for maxima of gaussian random vectors.
Probability Theory and RelatedFields , Chernozhukov, V. , Lee, S. and
Rosen, A. M. (2013). Intersection bounds: Estimation andinference.
Econometrica , avydov, Y. A. , Lifshits, M. A. and
Smorodina, N. V. (1998).
Local Properties of Dis-tribuions of Stochastic Functionals . American Mathematical Society, Providence.
Dudley, R. and
Philipp, W. (1983). Invariance principles for sums of Banach space val-ued random elements and empirical processes.
Zeitschrift f¨ur Wahrscheinlichkeitstheorie undverwandte Gebiete , H´ajek, J. (1961). Some extensions of the Wald-Wolfowitz-Noether theorem.
The Annals ofMathematical Statistics , Kato, K. (2013). Quasi-Bayesian analysis of nonparametric instrumental variables models.
TheAnnals of Statistics , Luenberger, D. G. (1969).
Optimization by Vector Space Methods . Wiley, New York.
Monrad, D. and
Philipp, W. (1991). Nearby variables with nearby conditional laws and astrong approximation theorem for hilbert space valued martingales.
Probability Theory andRelated Fields , Pollard, D. (2002).
A user’s guide to measure theoretic probability , vol. 8. Cambridge Uni-versity Press.
Rockafellar, R. T. (1970).
Convex analysis , vol. 28. Princeton university press.
Seber, G. A. (2008).
A matrix handbook for statisticians , vol. 15. John Wiley & Sons. van der Vaart, A. (1999).
Asymptotic Statistics . Cambridge University Press, New York. van der Vaart, A. W. and
Wellner, J. A. (1996).
Weak Convergence and EmpiricalProcesses: with Applications to Statistics . Springer, New York.. Springer, New York.