Inference for high-dimensional exchangeable arrays
aa r X i v : . [ ec on . E M ] S e p INFERENCE FOR HIGH-DIMENSIONAL EXCHANGEABLE ARRAYS
HAROLD D. CHIANG, KENGO KATO, AND YUYA SASAKI
Abstract.
We consider inference for high-dimensional exchangeable arrays where the dimensionmay be much larger than the cluster sizes. Specifically, we consider separately and jointly exchange-able arrays that correspond to multiway clustered and polyadic data, respectively. Such exchange-able arrays have seen a surge of applications in empirical economics. However, both exchangeabilityconcepts induce highly complicated dependence structures, which poses a significant challenge forinference in high dimensions. In this paper, we first derive high-dimensional central limit theorems(CLTs) over the rectangles for the exchangeable arrays. Building on the high-dimensional CLTs, wedevelop novel multiplier bootstraps for the exchangeable arrays and derive their finite sample errorbounds in high dimensions. The derivations of these theoretical results rely on new technical toolssuch as Hoeffding-type decomposition and maximal inequalities for the degenerate components inthe Hoeffiding-type decomposition for the exchangeable arrays. We illustrate applications of ourbootstrap methods to robust inference in demand analysis, robust inference in extended gravityanalysis, and penalty choice for ℓ -penalized regression under multiway cluster sampling. Introduction
In empirical studies in economics, we often employ data of volumes and attributes of flows ofresources and commodities that are affected by supply shocks from the origin of the flow anddemand shocks from the destination of the flow. Although supply and demand shocks are essentialin economic analysis, a proper treatment of data generated by these shocks requires non-standardeconometric methods due to the two-dimensional clustered dependence induced by these shocks.When the set of agents generating the supply and the set of agents generating the demand aredifferent, the data is two-way clustered . Leading examples are market share data that is two-wayclustered by products and markets, where shares of a product are dependent across markets due toa common supply shock by the identical producer and shares of multiple products within a marketare dependent due to a common demand shock by consumers in the identical market.When the set of agents generating the supply and the set of agents generating the demand are thesame, the data is dyadic . Leading examples are international trade data, where volumes of exportsfrom an exporter are dependent across importers due to a common supply shock and volumes ofimports to an importer are dependent across exporters due to a common demand shock.Both of these types of data naturally entail complex dependence structures through commonsupply shocks by agents from an identical origin on agents across multiple destinations and commondemand shocks by agents from an identical destination on agents across multiple origins. Assuch, standard microeconometric methods that presume cross-sectional random sampling are notapplicable to either of these two types of data.
Date : First arXiv version: September 10, 2020. This version: September 13, 2020.
Key words and phrases.
Bootstrap, exchangeable array, polyadic data, multiway clustering.H.D. Chiang is supported by the Office of the Vice Chancellor for Research and Graduate Education at theUniversity of WisconsinMadison with funding from the Wisconsin Alumni Research Foundation. K. Kato is partiallysupported by NSF grants DMS-1952306 and DMS-2014636. tarting with the seminal papers by Fafchamps and Gubert (2007) for dyadic data and Cameron et al.(2011) for multiway clustering, the recent econometrics literature develops methods and theoriesof how to deal with these types of dependent data – see below for a more comprehensive literaturereview. The existing literature, however, does not cover a method of high-dimensional inference,even though a number of robust identification strategies for structural economic models entailhigh-dimensionality in inference – see the next paragraph for examples. In this light, we developa method of high-dimensional inference under general multiway clustering and polyadic samplingin this paper. For two-way clustered data { ( X ij , . . . , X pij ) T : 1 ≤ i ≤ N , ≤ j ≤ N } of randomvectors with high dimensions p ≫ min { N , N } , we develop a method and theory for bootstrapapproximation of the distribution of the sample mean N − N − P N i =1 P N j =1 ( X ij , . . . , X pij ) T . Simi-larly, for dyadic data { ( X ij , . . . , X pij ) T : 1 ≤ i, j ≤ n, i = j } of random vectors with high dimension p ≫ n , we develop a method and theory for bootstrap approximation of the distribution of thesample mean n − ( n − − P ni =1 P j = i ( X ij , . . . , X pij ) T . We also generalize our results for these casesof two-way clustering and dyadic data to the cases of general multiway clustering and polyadic data,respectively.Our proposed method applies to a number of important robust identification approaches forstructural economic models. For demand analysis with a two-way clustered data consisting of N products and N markets, Gandhi et al. (2020) derive many moment inequalities of the of form N − N − N X i =1 N X j =1 ( X ij ( θ ) , . . . , X pij ( θ )) T ≥ , (1.1)where ( X ij ( θ ) , . . . , X pij ( θ )) T denotes a p -dimensional vector-valued random function of structuralparameters θ . Similarly, for extended gravity analysis with a two-way clustered data consisting of N firms and N countries, Morales et al. (2019) derive many moment inequalities of the form (1.1).With our theory of approximating the distribution of the sample mean N − N − P N i =1 P N j =1 ( X ij ( θ ) ,. . . , X pij ( θ )) T , inverting the Kolmogorov-Smirnov test allows for inference about the structural pa-rameters θ similarly to Chernozhukov et al. (2019a). See Sections 5.1 and 5.2 ahead for detailsof these two applications to demand analysis and extended gravity analysis, respectively. As an-other application, our proposed technology also allows for selection of theoretically valid choice ofa penalty for implementing ℓ -regularized regression (Lasso) under multiway clustering. To ourknowledge, there is no existing theoretically justified method for Lasso penalty selection under anexchangeable sampling setting. See Section 5.3 for the application of Lasso penalty selection.The two sampling frameworks of interest in this paper, namely multiway clustering and polyadicsampling, can be formulated as exchangeable random arrays. Specifically, a natural stochasticframework for modeling of mutliway clustering is that of separately exchangeable arrays (MacKinnon et al.,2019). For network/dyadic data, on the other hand, Bickel and Chen (2009) propose the use of jointly exchangeable arrays, which has since become a popular model for such data structures, seeGraham (2019) and Graham and de Paula (2019) for recent reviews. While formal definitions ofthese exchangeability concepts are postponed until Sections 2 and 3, it is worth noting that theexchangeable structures arise naturally in many economic applications. For example, in the con-text of modeling dynamic oligopoly with investment, Athey and Schmutzler (2001) indicate thatthe assumption of firms’ profit functions being exchangeable is consistent with models of Cournotoligopoly, vertical product differentiation, and differentiated product models where the firms have dentical cross-price effects. In these contexts, exchangeability imposes the symmetry in the identi-ties of firms such that each firm cares only about the actions and state variables of its rivals, but notabout the match between a competitor’s identity and actions/state variables. They also point outthe close link between exchangeability and the notation of anonymity in cooperative game theoryand social choice theory (e.g. Moulin, 1988). Another such example is from the analysis of supplyand demand in differentiated products markets. Berry et al. (1995) point out that both the de-mand and the cost functions for a product are exchangeable in vectors of characteristics of all otherproducts. This emerges when the cost functions depend only on own-product characteristics, and istrue for differentiated products demand system in which the demand for a product is independentof the ordering of competitors’ products but only on their characteristics. They also observe that aunique Nash equilibrium implies several forms of exchangeability in the observed and unobservedrandom variables in their demand model (Berry et al., 1995, Section 5.1). Furthermore, Menzel(2016) observes that exchangeability of a certain form is a standard feature in almost all commonlyused empirical specification for game-theoretic models with more than two players.1.1. Relation to the Literature.
High-dimensional central limit theorems (CLTs) and boot-straps over rectangles with the “ p ≫ n ” regime are studied by Chernozhukov et al. (2013a, 2014,2015, 2016, 2017a), Deng and Zhang (2020), Chernozhukov et al. (2019b), Kuchibhotla et al. (2020),and Fang and Koike (2020) for the independent case, by Chen (2018), Chen and Kato (2020,2019) for U -statistics and processes, and by Zhang and Wu (2017), Zhang and Cheng (2018),Chernozhukov et al. (2019a), Koike (2019) for time series dependence. To the best of our knowledge,there is no result that considers extensions to exchangeable arrays in this literature. This paperbuilds on and complements those references by providing high-dimensional CLTs and bootstrapmethods for exchangeable arrays.Regression models with common shocks has been investigated by Andrews (2005) under ex-changeability with one-dimensional index. Standard errors under multiway clustering (or separatelyexchangeable arrays) are proposed by Cameron et al. (2011) for parametric models, such as linearand nonlinear regression models – also see Cameron and Miller (2015, Section V) for a survey.Uniform asymptotic theory under multiway clustering is studied by Menzel (2017), covering bothdegenerate and non-degenerate cases. Focusing on the non-degenerate cases, Davezies et al. (2018,2020) develop functional limit theorems for Donsker classes under multiway clustering. See alsoChiang and Sasaki (2019), Chiang et al. (2019), MacKinnon (2019), and MacKinnon et al. (2019)for some other extensions and applications. To our best knowledge, no existing theory in thisliterature permits increasing or high-dimensional inference.Theory of finite dimensional asymptotics (with fixed dimensions) for polyadic data (or jointlyexchangeable arrays) is well-studied, see, e.g., Silverman (1976) and Eagleson and Weber (1978).Standard errors under dyadic data are first proposed by Fafchamps and Gubert (2007) and fur-ther studied by Cameron and Miller (2014), Aronow et al. (2015), and Tabord-Meehan (2019).Davezies et al. (2020) develop functional limit theorems for Donsker classes under polyadic sam-pling. To the best of our knowledge, no existing theory in this literature permits increasing orhigh-dimensional inference.Methodologically, this paper is also related to the recent literature on high-dimensional U -statistics, such as Chen (2018), Chen and Kato (2020, 2019), among others. Under suitable assump-tions, the data of our interest can be written as U -statistic-like latent structure (in distribution)via the Aldous-Hoover-Kallenberg representation (Aldous, 1981; Hoover, 1979; Kallenberg, 2006), .e. the data can be written as a kernel function of some latent independent random variables.However, unlike the case with U -statistics, neither the kernel nor the latent independent randomvariables is known to us. In addition, we need to cope with the existence of extra idiosyncraticshocks in the latent structure. Both of these aspects present extra challenges.The identification-robust inference applications considered in this paper are also related to theextensive literature of testing conditional moment inequalities, which includes, but are not limitedto, Andrews and Shi (2013), Chernozhukov et al. (2013b), Lee et al. (2013), Armstrong (2014),Armstrong and Chan (2016), Andrews and Shi (2017), Chetverikov (2018), Lee et al. (2018), Bai et al.(2019) and Chernozhukov et al. (2019a). To the best of our knowledge, no theory that permits mul-tiway clustered or polyadic data has been developed in this literature.Regarding our bootstraps, McCullagh (2000) shows that no resampling scheme for the raw datais consistent for variance of a sample mean under multiway clustering. A Pigeonhole bootstrapis subsequently proposed by Owen (2007) and its different variants are further investigated inOwen and Eckles (2012), Menzel (2017) and Davezies et al. (2018, 2020). Whether the pigeonholebootstrap works for increasing or high-dimensional test statistics remains unknown to us. Wetherefore develop a novel bootstrap method in this paper which we argue works for high-dimensionaldata.Finally, we develop novel Hoeffding-type decompositions for both separately and jointly ex-changeable arrays and show that, in both cases, symmetrization inequalities can be established toeach of the Hoeffding-type projection terms. This allows us to obtain several new maximal inequal-ities that permit tight control over the higher-order approximation errors for any arbitrary indexdimension in a simple manner. This approach resonates with the Hoeffding decomposition andempirical process theory for U -processes. The proofs of these technical results are more involvedthan their U -statistics counterparts due to the aforementioned unknown (and, in jointly exchange-able case, index-dependent) nature of kernel functions and the presence of the extra unobservedshocks. These results are of independent interest. In comparison, the empirical process results inDavezies et al. (2020) rely on substantially different symmetrization inequalities; see Remarks 8and 9 in the Appendix for details.1.2. Notations and Organization.
Let N denote the set of positive integers. We use k·k , k·k , k·k , and k · k ∞ to denote the Euclidean, ℓ , ℓ , and ℓ ∞ -norms for vectors, respectively (precisely, k · k is not a norm but a seminorm). For two real vectors a = ( a , . . . , a p ) T and b = ( b , . . . , b p ) T ,the notation a ≤ b means that a j ≤ b j for all 1 ≤ j ≤ p . Let supp( a ) denote the support of a = ( a, . . . , a p ) T , i.e., supp( a ) = { j : a j = 0 } . We denote by ⊙ the Hadamard (element-wise)product, i.e., for i = ( i , . . . , i K ) and j = ( j , . . . , j K ), i ⊙ j = ( i j , . . . , i K j K ). For any a, b ∈ R ,let a ∨ b = max { a, b } . For 0 < β < ∞ , let ψ β be the function on [0 , ∞ ) defined by ψ β ( x ) = e x β − k · k ψ β denote the associated Orlicz norm, i.e., k ξ k ψ β = inf { C > E [ ψ β ( | ξ | /C )] ≤ } for areal-valued random variable ξ . For β ∈ (0 , k · k ψ β is not a norm but a quasi-norm, i.e., thereexists a constant C β depending only on β such that k ξ + ξ k ψ β ≤ C β ( k ξ k ψ β + k ξ k ψ β ). Let U [0 , , esults in Section 4, and illustrate three applications in Section 5. We defer all the technical proofsto the Appendix. 2. Multiway Clustering
In this section, we consider separately exchangeable arrays that correspond to multiway clustereddata. Pick any K ∈ N . With i = ( i , . . . , i K ) ∈ N K , we consider a K -array ( X i ) i ∈ N K consistingof random vectors in R p . We denote by X j i the j -th coordinate of X i : X i = ( X i , . . . , X p i ) T . Wesay that the array ( X i ) i ∈ N K is separately exchangeable if the following condition is satisfied (cf.Kallenberg, 2006, Section 3.1). Definition 1 (Separate exchangeability) . A K -array ( X i ) i ∈ N K is called separately exchangeableif for any K permutations π , . . . , π K of N , the arrays ( X i ) i ∈ N K and ( X ( π ( i ) ,...,π K ( i K )) ) i ∈ N K areidentically distributed in the sense that their finite dimensional distributions agree. From the Aldous-Hoover-Kallenberg representation (see Kallenberg, 2006, Corollary 7.23), anyseparately exchangeable array ( X i ) i ∈ N K is generated by the structure X i = f (( U i ⊙ e ) e ∈{ , } K ) , i ∈ N K , { U i ⊙ e : i ∈ N K , e ∈ { , } K } i.i.d. ∼ U [0 , f : [0 , K → R p . For example, when K = 2, then X i is generatedas X ( i ,i ) = f ( U (0 , , U ( i , , U (0 ,i ) , U ( i ,i ) ).The latent variable U appears commonly in all X i ’s. In the present paper, as in Andrews (2005)and Menzel (2017), we consider inference conditional on U and treat it as fixed. In the rest ofSection 2, we will assume (without further mentioning) that the array ( X i ) i ∈ N K has mean zero(conditional on U ) and is generated by the structure X i = g (( U i ⊙ e ) e ∈{ , } K \{ } ) , i ∈ N K , (2.1)where g is now a map from [0 , K − into R p .Suppose that we observe { X i : i ∈ [ N ] } with N = ( N , . . . , N K ) and [ N ] = Q Kk =1 { , . . . , N k } .We are interested in approximating the distribution of the sample mean S N = 1 Q Kk =1 N k X i ∈ [ N ] X i in the high-dimensional setting where the dimension p is allowed to entail p ≫ min { N , . . . , N K } . Example 1 (Empirical process indexed by function class with increasing cardinality) . Our settingcovers the following situation: let { Y i : i ∈ N K } be random variables taking values in an abstractmeasurable space ( S, S ), and suppose that they are generated as Y i = ˇ g (( U i ⊙ e ) e ∈{ , } K \{ } ) . Let f j : S → R for 1 ≤ j ≤ p be measurable functions, and define X j i = f j ( Y i ) − E [ f j ( Y i )]. In thiscase, the sample mean S N can be regarded as the empirical process f ( Q Kk =1 N k ) − P i ∈ [ N ] ( f ( Y i ) − E [ f ( Y i )]) indexed by the function class F = { f , . . . , f p } . Allowing p → ∞ as min ≤ k ≤ K N k → ∞ enables us to cover empirical processes indexed by function classes with increasing cardinality.For later convenience, we fix some additional notations. Let n = min ≤ k ≤ K N k and N =max ≤ k ≤ K N k denote the minimum and maximum cluster sizes, respectively. For 1 ≤ k ≤ K ,denote by E k = { e = ( e , . . . , e K ) ∈ { , } K : P Kk =1 e k = k } the set of vectors in { , } K whose upport has cardinality k . Let e k ∈ R K denote the vector such that the k -th coordinate of e k is 1and the other coordinates are 0. For a given e ∈ { , } K , define I e ([ N ]) = { i ⊙ e : i ∈ [ N ] } ⊂ N K with N = N ∪ { } . The following decomposition of the sample mean S N will play a fundamental role in our analysis,which is reminiscent of the Hoeffding decomposition for U -statistics (Lee, 1990; de la Pe˜na and Gin´e,1999). Lemma 1 (Hoeffding decomposition of separately exchangeable array) . For any i ∈ N K , definerecursively ˆ X i ⊙ e k = E [ X i | U i ⊙ e k ] , k = 1 , . . . , K, ˆ X i ⊙ e = E [ X i | ( U i ⊙ e ′ ) e ′ ≤ e ] − X e ′ ≤ ee ′ = e ˆ X i ⊙ e ′ , e ∈ K [ k =2 E k . Then, we have X i = X e ∈{ , } K \{ } ˆ X i ⊙ e . Consequently, we can decompose the sample mean S N = ( Q Kk =1 N k ) − P i ∈ [ N ] X i as S N = K X k =1 X e ∈E k Q k ′ ∈ supp( e ) N k ′ X i ∈ I e ([ N ]) ˆ X i . (2.2)The proof of this lemma can be found in Appendix B.1. Example 2 ( K = 3 case) . For instance, if K = 3, then for i = ( i , i , i ) ∈ N ,ˆ X ( i , , = E [ X i | U ( i , , ] , ˆ X (0 ,i , = E [ X i | U (0 ,i , ] , ˆ X (0 , ,i ) = E [ X i | U (0 , ,i ) ] , ˆ X ( i ,i , = E [ X i | U ( i , , , U (0 ,i , , U ( i ,i , ] − ˆ X ( i , , − ˆ X (0 ,i , , etc. , ˆ X ( i ,i ,i ) = X i − ˆ X ( i ,i , − ˆ X (0 ,i ,i ) − ˆ X ( i , ,i ) − ˆ X ( i , , − ˆ X (0 ,i , − ˆ X (0 , ,i ) . Remark 1 (Hoeffding decomposition) . The reason that we call (2.2) the Hoeffding decompositioncomes from the fact that if the dimension p is fixed, for each fixed k = 1 , . . . , K and e ∈ E k , thecomponent 1 Q k ′ ∈ supp( e ) N k ′ X i ∈ I e ([ N ]) ˆ X i scales as ( Q k ′ ∈ supp( e ) N k ′ ) − / = O ( n − k/ ) with n = min ≤ k ′ ≤ K N k ′ under moment conditions.See Corollary 3 in Appendix A. This is completely analogous to the Hoeffiding decomposition of U -statistics and from this analogy we shall call (2.2) the Hoeffding decomposition.The leading term in the decomposition (2.2) is X e ∈E Q k ′ ∈ supp( e ) N k ′ X i ∈ I e ([ N ]) ˆ X i = K X k =1 N − k N k X i k =1 E [ X i | U (0 ,..., ,i k , ,..., ] , hich we call the H´ajek projection of S N . Define W k,i k = E [ X i | U (0 ,..., ,i k , ,..., ] , k = 1 , . . . , K, S W N = K X k =1 N − k N k X i k =1 W k,i k , and Σ W k = E [ W k, W Tk, ] , k = 1 , . . . , K. Since S W N is the sum of independent random vectors, it is expected that the distribution of √ n S N can be approximated by N ( , Σ), whereΣ = K X k =1 ( n/N k )Σ W k , as long as the remainder term is negligible. This suggests the following multiplier bootstrap formultiway clustering.2.1. Multiplier bootstrap for multiway clustering.
Let { ξ ,i } N i =1 , . . . , { ξ K,i K } N K i K =1 be inde-pendent N (0 ,
1) random variables independent of the data. Ideally, we want to make use of thebootstrap statistic n X k =1 N − k N k X i k =1 ξ k,i k ( W k,i k − S N ) . However, this bootstrap is infeasible as W k,i k = E [ X i | U (0 ,...,i k ,..., ] are unknown to us. Estimationof W k,i k is nontrivial as U (0 ,...,i k ,..., is a latent variable. To gain an insight into how to estimate W k,i k , consider the case where K = 2. Then W ,i = E [ X ( i ,i ) | U ( i , ] = E [ g ( U ( i , , V ( i ,i ) ) | U ( i , ] with V ( i ,i ) = ( U (0 ,i ) , U ( i ,i ) ). Since U ( i , and V ( i ,i ) are independent and the lattervariable is independent across i , we see that W ,i can be estimated by taking the average of X ( i ,i ) over i .Building on this intuition, in general, we propose to estimate each W k,i k by X k,i k = 1 Q k ′ = k N k ′ X i ,...,i k − ,i k +1 ,...,i K X i , i k = 1 , . . . , N k ; k = 1 , . . . , K, i.e., the sample mean taken over all indices but i k . Then, we apply the multiplier bootstrap to X k,i k in place of W k,i k S MB N = K X k =1 N − k N k X i k =1 ξ k,i k ( X k,i k − S N ) . To the best of our knowledge, this multiplier bootstrap for multiway clustering is new in theliterature. We will formally study the validity of this multiplier bootstrap for high-dimensionalmultiway clustered data with p ≫ n in the next two subsections.2.2. High-dimensional CLT for multiway clustering.
We first establish a high-dimensionalCLT for S N over the class of rectangles, R = p Y j =1 [ a j , b j ] : −∞ ≤ a j ≤ b j ≤ ∞ , ≤ j ≤ p . This high-dimensional CLT will be a building block for establishing the validity of the multiplierbootstrap considered in the preceding section. e start with discussing regularity conditions. Denote by = (1 , . . . ,
1) the vector of ones. Let D N ≥ N , and let σ > N . We will assume either of the following momentconditions. max ≤ j ≤ p k X j k ψ ≤ D N , or (2.3) E [ k X k q ∞ ] ≤ D q N for some q ∈ (4 , ∞ ) . (2.4)We will also assume both of the following conditions.max ≤ j ≤ p ;1 ≤ k ≤ K E [ | W jk, | κ ] ≤ D κ N , κ = 1 , , (2.5)min ≤ j ≤ p ;1 ≤ k ≤ K E [ | W jk, | ] ≥ σ . (2.6)Condition (2.3) requires that each coordinate of X is sub-exponential. By Jensen’s inequality,Condition (2.3) implies that max ≤ j ≤ p ;1 ≤ k ≤ K k W jk, k ψ ≤ D N . Condition (2.4) is an alternative moment condition on X . Condition (2.4) is satisfied for exampleunder the following situation: Suppose that X i is given by X i = ε i Z i where ε i is a scalar “error”variable while Z is a vector of “covariates”. If each coordinate of Z i is bounded by a constant D N (that may depend on N ) and ε i has finite q -th moment, then E [ k X i k q ∞ ] ≤ D q N E [ | ε i | q ]. Again, byJensen’s inequality, Condition (2.4) implies thatmax ≤ k ≤ K E [ k W k, k q ∞ ] ≤ D q N . Condition (2.5) requires the maximum of third (respectively, fourth) moment across coordinatesto be increasing at speed no faster than the first (respectively, second) power of D N . By Jensen’sinequality, Condition (2.5) is satisfied if max ≤ j ≤ p E [ | X j | κ ] ≤ D κ N for κ = 1 ,
2. Condition (2.6)guarantees that the H´ajek projection is nondegenerate.Let γ = N ( , Σ).
Theorem 1 (High-dimensional CLT for multiway clustering) . Suppose that either Condition (2.3)or (2.4) holds, and further that both Conditions (2.5) and (2.6) hold. Then, there exists a constant C such that sup R ∈R | P ( √ n S N ∈ R ) − γ Σ ( R ) |≤ C (cid:16) D N log ( pN ) n (cid:17) / if Condition (2.3) holds, C (cid:20)(cid:16) D N log ( pN ) n (cid:17) / + (cid:16) D N log ( pN ) n − /q (cid:17) / (cid:21) if Condition (2.4) holds,where the constant C depends only on σ and K if Condition (2.3) holds, while C depends only on q, σ , and K if Condition (2.4) holds. Remark 2 (Refinement under subgaussianity) . The recent paper of Chernozhukov et al. (2019b)provides some improvements on convergence rate of Gaussian approximation under the subgaussiantail assumption for the sample mean of independent random vectors. With this new technique, ifwe strengthen Condition (2.3) by replacing the ψ -norm k · k ψ with the ψ -norm k · k ψ (i.e., each oordinate X is sub-Gaussian), the bound C (cid:0) n − D N log ( pN ) (cid:1) / in Theorem 1 can be improvedto C (cid:0) n − D N log ( pN ) (cid:1) / .2.3. Validity of multiplier bootstrap for multiway clustering.
We are now in position toestablish the validity of the proposed multiplier bootstrap for multiway clustered data. Let P | X [ N ] denote the law conditional on the data X [ N ] = ( X i ) i ∈ [ N ] . Defineˆ∆ W = max ≤ j ≤ p ;1 ≤ k ≤ K N k N k X i k =1 ( X jk,i k − W jk,i k ) , which accounts for the estimation error of X k,i k for W k,i k . Also, let σ = max ≤ j ≤ p ;1 ≤ k ≤ K q E [ | W jk, | ].The following theorem shows that as soon as ˆ∆ W is sufficiently small (i.e, σ ˆ∆ W log p = o P (1)),then the multiplier bootstrap is consistent over the rectangles under mild conditions on the dimen-sion p . Theorem 2 (Validity of multiplier bootstrap for multiway clustering) . Consider the following twocases.(i). Conditions (2.3), (2.5), and (2.6) hold, and there exist constants C and ζ , ζ ∈ (0 , such that P (cid:16) σ ˆ∆ W log p > C n − ζ (cid:17) ≤ C n − and (2.7) D N (log n ) log ( pN ) n ≤ C n − ζ . (2.8) (ii). Conditions (2.4), (2.5), and (2.6) hold, and there exist constants C and ζ , ζ ∈ (0 , such that Condition (2.7) holds and D N log ( pn ) n _ (cid:18) D N log pn − /q (cid:19) ≤ C n − ζ . (2.9) Then, under either Case (i) or (ii), there exists a constant C such that sup R ∈R (cid:12)(cid:12)(cid:12) P | X [ N ] ( √ n S MB N ∈ R ) − γ Σ ( R ) (cid:12)(cid:12)(cid:12) ≤ Cn − ( ζ ∧ ζ ) / with probability at least − Cn − , where the constant C depends only on σ, K , and C under Case(i), while C depends only on q, σ, K , and C under Case (ii). Remark 3 (Discussion on Conditions (2.7)–(2.9)) . Conditions (2.7)–(2.9) are placed to guaranteethat the error bound for our multiplier bootstrap decreases at a polynomial rate in n . If we are toshow a weaker result, namely,sup R ∈R | P | X [ N ] ( √ n S MB N ∈ R ) − γ Σ ( R ) | = o P (1) (2.10)as n → ∞ (with the understanding that p, σ, D N , and N are functions of n ), then Conditions(2.7)–(2.9) can be weakened to σ ˆ∆ W log p = o P (1) , D N log ( pN ) = o ( n ), and ( n − D N log ( pn )) ∨ ( n − /q D N log p ) = o (1), respectively. (The critical case q = 4 is allowed for (2.10); note that thehigh-dimensional CLT (Theorem 1) also holds with q = 4.)Condition (2.7) is a high-level condition on the estimation accuracy of X k,i k for W k,i k . Weprovide primitive sufficient conditions for Condition (2.7) to hold in the following proposition. roposition 1 (Primitive sufficient conditions for Condition (2.7)) . Consider the following twocases.(i’) Conditions (2.3), (2.5), and (2.6) hold, and there exist constants C and ζ ∈ (0 , suchthat σ D N log pn ≤ C n − ζ . (2.11) (ii’) Conditions (2.4), (2.5), and (2.6) hold, and there exist constants C and ζ ∈ (2 /q, suchthat σ D N log pn ≤ C n − ζ . (2.12) Under Case (i’), for any ν ∈ (1 /ζ, ∞ ) , there exists a constant C depending only on ν, K , and C such that P (cid:16) σ ˆ∆ W log p > Cn − ζ +1 /ν (cid:17) ≤ Cn − . Under Case (ii’), there exists a constant C depending only on q, K , and C such that P (cid:16) σ ˆ∆ W log p > Cn − ζ +2 /q (cid:17) ≤ Cn − . Remark 4 (Discussion on Conditions (2.11) and (2.12)) . If we are to follow Remark 3 and to showa sufficient condition for σ ˆ∆ W log p = o P (1), then Conditions (2.11) and (2.12) can be weakenedto σ D N log p = o ( n ) and σ D N log p = o ( n − /q ), respectively.In practice, we often normalize the coordinates of the sample mean by estimates of the standarddeviations, so that each coordinate is approximately distributed as N (0 , j -th coordinate of √ n S N is given by σ j =Var( √ nS W,j N ), where S W,j N is the j -th coordinate of S W N . This can be estimated byˆ σ j = K X k =1 nN k N k X i k =1 ( X jk,i k − S j N ) . Let Λ = diag { σ , . . . , σ p } and ˆΛ = diag { ˆ σ , . . . , ˆ σ p } . We consider to approximate the distributionof √ n ˆΛ − / S N by √ n ˆΛ − / S MB N . Corollary 1.
Consider Cases (i) and (ii) in Theorem 2. In Case (i), assume further that D N log ( pN ) n ≤ C n − ζ ∧ ζ ) / , while in Case (ii) assume further that D N log ( pN ) n _ (cid:18) D N log ( pN ) n − /q (cid:19) ≤ C n − ζ ∧ ζ ) / . Then, there exists a constant C such that for Y ∼ N ( , Σ) , sup R ∈R (cid:12)(cid:12)(cid:12) P ( √ n ˆΛ − / S N ∈ R ) − P (Λ − / Y ∈ R ) (cid:12)(cid:12)(cid:12) ≤ Cn − ( ζ ∧ ζ ) / and P (cid:26) sup R ∈R (cid:12)(cid:12)(cid:12) P | X [ N ] ( √ n ˆΛ − / S MB N ∈ R ) − P (Λ − / Y ∈ R ) (cid:12)(cid:12)(cid:12) ≤ Cn − ( ζ ∧ ζ ) / (cid:27) ≥ − Cn − , where the same convention on the constant C as in Theorem 2 applies. . Polyadic Data
In this section, we consider another class of exchangeable arrays, namely, jointly exchangeablearrays, which correspond to polyadic data. The notations in the current section are independentfrom those in Section 2 unless otherwise noted. Joint exchangeability induces a more complexdependence structure on arrays than separate exchangeability, but still we are able to developanalogous results to the preceding section for jointly exchangeable arrays as well. It should benoted, however, that we do require a different bootstrap and technical tools (cf. Appendix C) toaccommodate a specific dependence structure induced from joint exchangeability.Pick any K ∈ N . For a given positive integer n ≥ K , let I n,K = { ( i , . . . , i K ) : 1 ≤ i , . . . , i K ≤ n and i , . . . , i K are distinct } . Also let I ∞ ,K = S ∞ n = K I n,K . For any i = ( i , . . . , i K ) ∈ N K , let { i } + denote the set of distinct nonzero elements of ( i , . . . , i K ). For example, { (2 , , , } + = { , } .In this section, we consider a K -array ( X i ) i ∈ I ∞ ,K consisting of random vectors in R p . We say thatthe array ( X i ) i ∈ I ∞ ,K is jointly exchangeable if the following condition is satisfied (cf. Kallenberg,2006, Section 3.1). Definition 2 (Joint exchangeability) . A K -array ( X i ) i ∈ I ∞ ,K is called jointly exchangeable if forany permutation π of N , the arrays ( X i ) i ∈ I ∞ ,K and ( X ( π ( i ) ,...,π ( i K )) ) i ∈ I ∞ ,K are identically dis-tributed. From the Aldous-Hoover-Kallenberg representation (see Kallenberg, 2006, Theorem 7.22), anyjointly exchangeable array ( X i ) i ∈ I ∞ ,K is generated by the structure X i = f (( U { i ⊙ e } + ) e ∈{ , } K ) , i ∈ I ∞ ,K , { U { i ⊙ e } + : i ∈ I ∞ ,K , e ∈ { , } K } i.i.d. ∼ U [0 , f : [0 , K → R p . For example, when K = 2, then X ( i ,i ) isgenerated as X ( i ,i ) = f ( U ∅ , U i , U i , U { i ,i } ). (We will write U i k = U { i k } for the notational conve-nience.) Here the coordinates of the vector ( U { i ⊙ e } + ) e ∈{ , } K are understood to be properly ordered,so that, e.g., when K = 2, X ( i ,i ) = f ( U ∅ , U i , U i , U { i ,i } ) and X ( i ,i ) = f ( U ∅ , U i , U i , U { i ,i } )differ (although they have the identical distribution).As in the separately exchangeable case, we consider inference conditional on U ∅ , and in what fol-lows, we will assume that the array ( X i ) i ∈ I ∞ ,K has mean zero (conditional on U ∅ ) and is generatedby the structure X i = g (( U { i ⊙ e } + ) e ∈{ , } K \{ } ) , i ∈ I ∞ ,K , (3.1)where g is now a map from [0 , K − into R p .Suppose that we observe { X i : i ∈ I n,K } with n ≥ K and are interested in distributionalapproximation of the polyadic sample mean S n := ( n − K )! n ! X i ∈ I n,K X i . in the high-dimensional setting where the dimension p is allowed to entail p ≫ n . s in Section 2, define E k = { e = ( e , . . . , e K ) ∈ { , } K : P Kk =1 e k = k } for 1 ≤ k ≤ K . Theanalysis of the polyadic sample mean relies on the following decomposition of X i : X i = K X k =1 E [ X i | U i k ] + E [ X i | U i , . . . , U i K ] − K X k =1 E [ X i | U i k ] ! + K X k =2 (cid:16) E [ X i | ( U { i ⊙ e } + ) e ∈∪ kr =1 E r ] − E [ X i | ( U { i ⊙ e } + ) e ∈∪ k − r =1 E r ] (cid:17) . This leads to the decomposition S n = 1 n n X j =1 E ( n − K )!( n − K X k =1 X i ∈ I n,K : i k = j X i (cid:12)(cid:12)(cid:12) U j + ( n − K )! n ! X i ∈ I n,K E [ X i | U i , . . . , U i K ] − K X k =1 E [ X i | U i k ] ! + K X k =2 ( n − K )! n ! X i ∈ I n,K (cid:16) E [ X i | ( U { i ⊙ e } + ) e ∈∪ kr =1 E r ] − E [ X i | ( U { i ⊙ e } + ) e ∈∪ k − r =1 E r ] (cid:17) . (3.2)The second term on the right-hand side of (3.2) is a degenerate U -statistic and thus negligiblecompared with the first term under moment conditions (this term can be expanded into K − O ( n − k/ ) if p is fixed for k = 2 , . . . , K ). The analysis of the thirdterm is more complicated but it will be shown that the k -th term inside the first summation scalesas O ( n − k/ ) if the dimension p is fixed, so that the third term on the right-hand side of (3.2) isalso negligible compared with the first term. See Appendix C for details. Applying the Hoeffdingdecomposition to the second term on the right-hand side of (3.2), combining it with the third termon the right-hand side of (3.2), and aligning the terms according to their orders, we can obtain aHoeffding-type decomposition for jointly exchangeable arrays. As in the multiway clustering case,we call the first term on the right-hand side of (3.2) the H´ajek projection of S n .Defining h k ( u ) = E [ X (1 ,...,K ) | U k = u ] for k = 1 , . . . , K , we can simplify the H´ajek projectioninto S Wn = 1 n n X j =1 W j , with W j = K X k =1 h k ( U j ) . Since { W j } nj =1 are i.i.d., we can expect that √ n S Wn can be approximated (in distribution) by N ( , Σ), where Σ = E (cid:2) W W T (cid:3) . This suggests the following version of multiplier bootstrap for polyadic data.3.1.
Multiplier bootstrap for polyadic data.
Let { ξ j } nj =1 be independent N (0 ,
1) randomvariables independent of the data. Ideally, we want to make use of the multiplier bootstrap statistic1 n n X j =1 ξ j ( W j − K S n ) . his is infeasible, however, as the projections W j are unknown. As an alternative, we replace each W j by its estimate ˆ W j = ( n − K )!( n − K X k =1 X i ∈ I n,K : i k = j X i , and apply the multiplier bootstrap to ˆ W j , i.e., S MBn := 1 n n X j =1 ξ j ( ˆ W j − K S n )For example, when K = 2 (dyadic), this mulitplier bootstrap simplifies into S MBn = 1 n n X j =1 ξ j n − n X i ′ =1; i ′ = j ( X ( i ′ ,j ) + X ( j,i ′ ) ) − S n , which coincides with the multiplier bootstrap statistic considered in Section 3.2 of Davezies et al.(2020). However, Davezies et al. (2020) do not consider the extension to general K arrays, andfocus on the empirical process indexed by a Donsker class, which excludes the high-dimensionalsample mean. We will study the validity of this multiplier bootstrap for general polyadic data inthe following two subsections.3.2. High-dimensional CLT for polyadic data.
We consider to approximate the distributionof √ n S n by a Gaussian distribution on the set of rectangles R as defined in Section 2.Let D n ≥ n , and σ > n . We will assume either of the following moment conditions.max ≤ ℓ ≤ p k X ℓ (1 ,...,K ) k ψ ≤ D n , or (3.3) E [ k X (1 ,...,K ) k q ∞ ] ≤ D qn for some q ∈ (4 , ∞ ) . (3.4)We will also assume both of the following conditions.max ≤ ℓ ≤ p E [ | W ℓ | k ] ≤ D kn , κ = 1 , , (3.5)min ≤ ℓ ≤ p E [ | W ℓ | ] ≥ σ . (3.6)The conditions required here are similar to those in the case of multiway clustering in Section 2.The main difference is that Conditions (3.5) and (3.6) are now imposed on W .Let γ Σ = N ( , Σ).
Theorem 3 (High-dimensional CLT for polyadic data) . Suppose that either Condition (3.3) or(3.4) holds, and further both Conditions (3.5) and (3.6) hold. Then, there exists a constant C suchthat sup R ∈R (cid:12)(cid:12) P ( √ n S n ∈ R ) − γ Σ ( R ) (cid:12)(cid:12) ≤ C (cid:16) D n log ( pn ) n (cid:17) / if Condition (3.3) holds, C (cid:20)(cid:16) D n log ( pn ) n (cid:17) / + (cid:16) D n log ( pn ) n − /q (cid:17) / (cid:21) if Condition (3.4) holds,where the constant C depends only on σ and K if Condition (3.3) holds, while C depends only on q, σ , and K if Condition (3.4) holds. emark 5 (Comparison with Silverman (1976)) . Theorem 3 is a high-dimensional extension ofTheorem A in Silverman (1976) that establishes a CLT for jointly exchangeable arrays with fixed p . The covariance matrix of the limiting Gaussian distribution in Silverman (1976) has a differentexpression than our Σ, but we will verify below that two expressions are indeed the same. Thecovariance matrix given in Corollary to Theorem A in Silverman (1976) reads as follows: Letˇ X ( i ,...,i K ) be the symmetrized version of X ( i ,...,i K ) , i.e., ˇ X ( i ,...,i K ) = ( K !) − P ( i ′ ,...,i ′ K ) X ( i ′ ,...,i ′ K ) where the summation is taken over all permutations of ( i , . . . , i K ). The covariance matix given inSilverman (1976) is Σ S = K E [ ˇ X (1 ,...,K ) ˇ X (1 ,K +1 ,..., K ) ]. On the other hand, K X k =1 E [ X (1 ,...,K ) | U k = u ] = K X k =1 E [ ˇ X (1 ,...,K ) | U k = u ] = K E [ ˇ X (1 ,...,K ) | U = u ] , so that Σ = K E (cid:2) E [ ˇ X (1 ,...,K ) | U ] E [ ˇ X (1 ,...,K ) | U ] (cid:3) = K E [ ˇ X (1 ,...,K ) ˇ X (1 ,K +1 ,..., K ) ] = Σ S , as claimed.3.3. Validity of multiplier bootstrap for polyadic data.
Let P | X In,K denote the law condi-tional on the data ( X i ) i ∈ I n,K . Defineˆ∆ W, = max ≤ ℓ ≤ p n n X j =1 ( ˆ W ℓj − W ℓj ) . In addition, let σ = max ≤ ℓ ≤ p q E [ | W ℓ | ]. Theorem 4 (Validity of multiplier bootstrap for polyadic data) . Consider the following two cases.(i). Conditions (3.3), (3.5), and (3.6) hold, and there exist constants C and ζ , ζ ∈ (0 , such that P (cid:16) σ ˆ∆ W, log p > C n − ζ (cid:17) ≤ C n − and (3.7) D n (log n ) log ( pn ) n ≤ C n − ζ . (3.8) (ii). Conditions (3.4), (3.5), and (3.6) hold, and there exist constants C and ζ , ζ ∈ (0 , such that Condition (3.7) holds and D n log ( pn ) n _ (cid:18) D n log pn − /q (cid:19) ≤ C n − ζ . (3.9) Then, under either Case (i) or (ii), there exists a constant C such that sup R ∈R (cid:12)(cid:12)(cid:12) P | X In,K ( √ n S MBn ∈ R ) − γ Σ ( R ) (cid:12)(cid:12)(cid:12) ≤ Cn − ( ζ ∧ ζ ) / . with probability at least − Cn − , where the constant C depends only on σ, K , and C under Case(i), while C depends only on q, σ, K , and C under Case (ii). The following proposition provides primitive sufficient conditions for Condition (3.7) to hold.
Proposition 2 (Primitive sufficient conditions for Condition (3.7)) . Consider the following twocases. i’) Conditions (3.3), (3.5), and (3.6) hold, and there exist constants C and ζ ∈ (0 , suchthat σ D n log pn ≤ C n − ζ . (ii’) Conditions (3.4), (3.5), and (3.6) hold, and there exist constants C and ζ ∈ (2 /q, suchthat σ D n log pn ≤ C n − ζ . Under Case (i’), for any ν ∈ (1 /ζ, ∞ ) , there exists a constant C depending only on ν, K , and C such that P (cid:16) σ ˆ∆ W, log p > Cn − ζ +1 /ν (cid:17) ≤ Cn − . Under Case (ii’), there exists a constant C depending only on q, K , and C such that P (cid:16) σ ˆ∆ W, log p > Cn − ζ +2 /q (cid:17) ≤ Cn − . Finally, we consider normalized sample means for polyadic data. In light of the high-dimensionalCLT for polyadic data, the approximate variance of the ℓ -th coordinate of √ n S n is given by σ ℓ = Var( W ℓ ), which can be estimated byˆ σ ℓ = 1 n n X k =1 ( ˆ W ℓk − KS ℓn ) . Let Λ = diag { σ , . . . , σ p } and ˆΛ = diag { ˆ σ , . . . , ˆ σ p } . We consider to approximate the distributionof √ n ˆΛ − / S n by √ n ˆΛ − / S MBn . Corollary 2.
Consider Cases (i) and (ii) in Theorem 4. In Case (i), assume further that D n log ( pn ) n ≤ C n − ζ ∧ ζ ) / , while in Case (ii) assume further that D n log ( pn ) n _ (cid:18) D n log ( pn ) n − /q (cid:19) ≤ C n − ζ ∧ ζ ) / . Then, there exists a constant C such that for Y ∼ N ( , Σ) , sup R ∈R (cid:12)(cid:12)(cid:12) P ( √ n ˆΛ − / S n ∈ R ) − P (Λ − / Y ∈ R ) (cid:12)(cid:12)(cid:12) ≤ Cn − ( ζ ∧ ζ ) / and P (cid:26) sup R ∈R (cid:12)(cid:12)(cid:12) P | X In,K ( √ n ˆΛ − / S MBn ∈ R ) − P (Λ − / Y ∈ R ) (cid:12)(cid:12)(cid:12) ≤ Cn − ( ζ ∧ ζ ) / (cid:27) ≥ − Cn − , where the same convention on the constant C as in Theorem 4 applies. The proof is analogous to Corollary 1 and thus omitted. . Simulation Studies
Uniform Coverage under Multiway Clustering.
In this section, we present simulationstudies to evaluate finite sample performance of the proposed multiplier bootstrap method formultiway clustering. For simulation designs, we use two-way and three-way clustered sampling.With Σ Z denoting the p × p covariance matrix consisting of elements of the form 4 −| r − c | in its( r, c )-th position, two-way clustered samples are generated according to X i = 14 (cid:0) Z ( i , + Z (0 ,i ) (cid:1) + 12 Z ( i ,i ) . where (i) Z i ⊙ e ∼ N ( , Σ Z ) independently for i ∈ { ( i , i ) ∈ N : 1 ≤ i ≤ N , ≤ i ≤ N } and e ∈ { , } in one design, and (ii) Z i ⊙ e ∼ BN ( , Σ Z ) + (1 − B ) N ( , Z ) and B ∼ Bernoulli(0 . i ∈ { ( i , i ) ∈ N : 1 ≤ i ≤ N , ≤ i ≤ N } and e ∈ { , } in the other design.Likewise, three-way clustered samples are generated according to X i = 112 (cid:0) Z ( i , , + Z (0 ,i , + Z (0 , ,i ) + Z ( i ,i , + Z ( i , ,i ) + Z (0 ,i ,i ) (cid:1) + 12 Z ( i ,i ,i ) , where (i) Z i ⊙ e ∼ N ( , Σ Z ) independently for i ∈ { ( i , i , i ) ∈ N : 1 ≤ i ≤ N , ≤ i ≤ N , ≤ i ≤ N } and e ∈ { , } in one design, and (ii) Z i ⊙ e ∼ BN ( , Σ Z ) + (1 − B ) N ( , Z ) and B ∼ Bernoulli(0 .
5) independently for i ∈ { ( i , i , i ) ∈ N : 1 ≤ i ≤ N , ≤ i ≤ N , ≤ i ≤ N } and e ∈ { , } in the other design. For each of these data generating designs, we run 2,500 MonteCarlo iterations to compute the uniform coverage frequencies of E [ X i ] for the nominal probabilitiesof 80%, 90% and 95% using our proposed multiplier bootstrap for multiway clustering with 2,500bootstrap iterations.Tables 1 and 2 show simulation results for two-way cluster sampled data and three-way clustersampled data, respectively. The columns consist of the dimension p of X , and the two-way sam-ple size ( N , N ) or the three-way sample size ( N , N , N ). The displayed numbers indicate thesimulated uniform coverage frequencies for the nominal probabilities of 80%, 90% and 95%. Foreach dimension p ∈ { , , } , sample sizes vary as ( N , N ) ∈ { (25 , , (50 , , (100 , } inTable 1, and sample sizes vary as ( N , N , N ) ∈ { (25 , , , (50 , , , (100 , , } in Ta-ble 2. Observe that, for each nominal probability, the uniform coverage frequencies approach thenominal probability as the sample size increases. These results support the theoretical property ofour multiplier bootstrap method. We ran many other sets of simulations with various designs andsample sizes not presented here, but this observed pattern to support our theory remains invariantacross all the different sets of simulations.4.2. Uniform Coverage under Polyadic Data.
In this section, we present simulation studiesto evaluate finite sample performance of the proposed multiplier bootstrap method for polyadicdata. We shall focus on the the most common case in practice, the dyadic data, i.e. K = 2. WithΣ Z denoting the p × p covariance matrix consisting of elements of the form 4 −| r − c | in its ( r, c )-thposition, dyadic samples are generated according to X i,j = 14 (cid:0) Z ( i, + Z ( j, (cid:1) + 12 Z ( i,j ) , where (i) Z i ⊙ e ∼ N ( , Σ Z ) independently for i ∈ { ( i, j ) ∈ N : 1 ≤ i, j ≤ n, i = j } and e ∈{ } × { , } in one design, and (ii) Z i ⊙ e ∼ BN ( , Σ Z ) + (1 − B ) N ( , Z ) and B ∼ Bernoulli(0 . i ∈ { ( i, j ) ∈ N : 1 ≤ i, j ≤ n, i = j } and e ∈ { } × { , } in the other design.We run 2,500 Monte Carlo iterations to compute the uniform coverage frequencies of S n for the istribution of Z i ⊙ e (i) GaussianNormalization NoDimension of X i : p
25 25 25 50 50 50 100 100 100Sample Sizes: N , N
25 50 100 25 50 100 25 50 10080% Coverage 0.834 0.834 0.807 0.838 0.829 0.794 0.864 0.815 0.81390% Coverage 0.928 0.921 0.909 0.935 0.925 0.906 0.943 0.916 0.91095% Coverage 0.973 0.964 0.955 0.973 0.963 0.954 0.976 0.962 0.960Normalization YesDimension of X i : p
25 25 25 50 50 50 100 100 100Sample Sizes: N , N
25 50 100 25 50 100 25 50 10080% Coverage 0.753 0.776 0.788 0.740 0.783 0.793 0.698 0.758 0.79190% Coverage 0.876 0.889 0.895 0.860 0.882 0.900 0.834 0.876 0.89695% Coverage 0.933 0.943 0.947 0.921 0.938 0.947 0.902 0.936 0.948Distribution of Z i ⊙ e (ii) MixtureNormalization NoDimension of X i : p
25 25 25 50 50 50 100 100 100Sample Sizes: N , N
25 50 100 25 50 100 25 50 10080% Coverage 0.824 0.817 0.803 0.859 0.841 0.814 0.864 0.828 0.81490% Coverage 0.927 0.908 0.905 0.942 0.931 0.919 0.943 0.910 0.91795% Coverage 0.967 0.954 0.956 0.976 0.968 0.960 0.973 0.957 0.962Normalization YesDimension of X i : p
25 25 25 50 50 50 100 100 100Sample Sizes: N , N
25 50 100 25 50 100 25 50 10080% Coverage 0.747 0.772 0.785 0.716 0.768 0.783 0.711 0.776 0.78990% Coverage 0.861 0.882 0.891 0.848 0.878 0.887 0.841 0.884 0.88895% Coverage 0.925 0.939 0.940 0.912 0.938 0.944 0.914 0.941 0.942
Table 1.
Simulation results for two-way ( K = 2) cluster sampled data. Displayedare the dimension p of X , the two-way sample size ( N , N ) with N = N , and thesimulated uniform coverage frequencies for the nominal probabilities of 80%, 90%and 95%.nominal probabilities of 80%, 90% and 95% using our proposed multiplier bootstrap for multiwayclustering with 2,500 bootstrap iterations.Table 3 shows simulation results. The columns consist of the dimension p of X , and the polyadicsample size N . The displayed numbers indicate the simulated uniform coverage frequencies forthe nominal probabilities of 80%, 90% and 95%. For each dimension p ∈ { , , } , samplesizes vary as n ∈ { , , } . Observe that, for each nominal probability, the uniform coveragefrequencies approach the nominal probability as the sample size increases. These results supportthe theoretical property of our multiplier bootstrap method. We ran many other sets of simulationswith various designs and sample sizes not presented here, but this observed pattern to support ourtheory remains invariant across all the different sets of simulations. istribution of Z i ⊙ e (i) GaussianNormalization NoDimension of X i : p
25 25 25 50 50 50 100 100 100Sample Sizes: N , N , N
25 50 100 25 50 100 25 50 10080% Coverage 0.819 0.808 0.805 0.834 0.812 0.808 0.843 0.817 0.81390% Coverage 0.912 0.912 0.910 0.932 0.914 0.908 0.929 0.918 0.90295% Coverage 0.952 0.958 0.951 0.971 0.958 0.956 0.973 0.962 0.956Normalization YesDimension of X i : p
25 25 25 50 50 50 100 100 100Sample Sizes: N , N , N
25 50 100 25 50 100 25 50 10080% Coverage 0.777 0.780 0.792 0.768 0.789 0.785 0.732 0.768 0.79790% Coverage 0.879 0.884 0.892 0.874 0.890 0.888 0.852 0.878 0.89895% Coverage 0.938 0.936 0.953 0.939 0.944 0.935 0.925 0.935 0.945Distribution of Z i ⊙ e (ii) MixtureNormalization NoDimension of X i : p
25 25 25 50 50 50 100 100 100Sample Sizes: N , N , N
25 50 100 25 50 100 25 50 10080% Coverage 0.829 0.823 0.810 0.824 0.822 0.810 0.852 0.818 0.80390% Coverage 0.921 0.916 0.904 0.923 0.915 0.908 0.946 0.913 0.90895% Coverage 0.964 0.958 0.952 0.960 0.958 0.956 0.974 0.959 0.958Normalization YesDimension of X i : p
25 25 25 50 50 50 100 100 10080% Coverage 0.777 0.786 0.776 0.779 0.767 0.789 0.741 0.763 0.79690% Coverage 0.887 0.891 0.890 0.885 0.880 0.895 0.859 0.878 0.89495% Coverage 0.940 0.943 0.940 0.939 0.938 0.943 0.924 0.939 0.946
Table 2.
Simulation results for three-way ( K = 3) cluster sampled data. Displayedare the dimension p of X , the three-way sample size ( N , N , N ) with N = N = N , and the simulated uniform coverage frequencies for the nominal probabilities of80%, 90% and 95%. 5. Applications
In this section, we illustrate three applications of our proposed methods and theories. Section 5.1presents robust inference in demand analysis under differentiated products markets with marketshare data. Section 5.2 presents robust inference in extended gravity analysis with trade data.Section 5.3 presents penalty choice for the Lasso and the performance of its corresponding estimate.5.1.
Robust inference in demand analysis with market share data.
Market share data usedfor demand analysis under differentiated products markets naturally exhibit two-way clustering dueto the economic structure of supply and demand. Typical market share data are double-indexed byproducts i and markets j . Observations are generally dependent across markets j due to a commonsupply shock generated by the producer of product i . Observations are also generally dependentacross products i due to a common demand shock in market j . We illustrate an application of our istribution of Z i ⊙ e (i) GaussianNormalization NoDimension of X i,j : p
25 25 25 50 50 50 100 100 100Sample Size: n
50 100 200 50 100 200 50 100 20080% Coverage 0.791 0.782 0.800 0.792 0.798 0.795 0.803 0.805 0.80190% Coverage 0.902 0.898 0.901 0.909 0.898 0.905 0.909 0.906 0.91295% Coverage 0.953 0.954 0.950 0.958 0.951 0.956 0.956 0.954 0.957Normalization YesDimension of X i,j : p
25 25 25 50 50 50 100 100 100Sample Size: n
50 100 200 50 100 200 50 100 20080% Coverage 0.713 0.744 0.780 0.664 0.736 0.770 0.621 0.718 0.76890% Coverage 0.837 0.869 0.889 0.806 0.854 0.886 0.780 0.845 0.87695% Coverage 0.918 0.928 0.946 0.887 0.923 0.943 0.867 0.915 0.942Distribution of Z i ⊙ e (ii) MixtureNormalization NoDimension of X i,j : p
25 25 25 50 50 50 100 100 100Sample Size: n
50 100 200 50 100 200 50 100 20080% Coverage 0.777 0.781 0.786 0.778 0.798 0.786 0.794 0.801 0.79790% Coverage 0.884 0.902 0.894 0.904 0.908 0.892 0.911 0.899 0.89995% Coverage 0.948 0.953 0.952 0.960 0.958 0.950 0.957 0.953 0.954Normalization YesDimension of X i,j : p
25 25 25 50 50 50 100 100 100Sample Size: n
50 100 200 50 100 200 50 100 20080% Coverage 0.697 0.762 0.763 0.659 0.734 0.756 0.615 0.720 0.74690% Coverage 0.824 0.870 0.878 0.807 0.863 0.870 0.773 0.857 0.87095% Coverage 0.901 0.927 0.941 0.884 0.928 0.925 0.870 0.921 0.933
Table 3.
Simulation results for dyadic data. Displayed are the dimension p of X ,the dyadic sample size n , and the simulated uniform coverage frequencies for thenominal probabilities of 80%, 90% and 95%.proposed theory to the frontier approach of robust identification for demand models using this typeof data.Following Berry (1994) and Berry et al. (1995), consider a model of demand for N productsindexed by i = 1 , . . . , N with an outside option i = 0 in N markets. Consumer c derives utility u cij = δ ij + ǫ cij for product i in market j , where δ ij is the mean utility and ǫ cij denotes idiosyncraticshock with the type-I extreme value distribution. The mean utility is in turn modeled by δ ij = X Tij θ + η ij , where X ij is a vector of observed product and market characteristics and η ij denotesunobserved characteristics. Suppose that each consumer c in market j chooses the product i yieldingthe highest utility, i.e., s cij = { u ij ≥ u i ′ j for all i = 0 , , . . . , N } . Aggregation yields the productshare π ij = E [ s cij | δ j , . . . , δ N j ]. The standard market share inversion in turn yields the meanutility δ ij = log π ij − log π j . Suppose that we obtain instrumental variables z ij that is meanorthogonal to the unobserved characteristics η ij . n this setup, the standard econometric approach uses the generalized method of moments(GMM) with the mean orthogonality condition E [ η ij | z ij ] = 0. However, due to zero and/ornear-zero market shares in actual market share data, this standard approach is known to sufferfrom unbounded moments of moment functions. In this light, Gandhi et al. (2020) propose a ro-bust identification approach. Specifically, they derive upper and lower bounds of mean utilityfunctions, denoted by δ uij and δ ℓij , respectively, and propose a family of moment inequalities of theform H ( θ ) : ( E [( X Tij θ − δ uij ) g ( z ij )] ≤ E [ − ( X Tij θ − δ ℓij ) g ( z ij )] ≤ g ∈ G in an infinite set G of non-negative instrumental functions – see Gandhi et al. (2020).Applying our proposed method in Section 2, we may conduct inference for the utility parameters θ under two-way clustered market share data in the following manner. Define the p -dimensionalrandom vector X ij ( θ ) by X ij ( θ ) = (cid:0) ( X Tij θ − δ uij ) g ( z ij ) , ( δ ℓij − X Tij θ ) g ( z ij ) , . . . , ( X Tij θ − δ uij ) g p/ ( z ij ) , ( δ ℓij − X Tij θ ) g p/ ( z ij ) (cid:1) T for an increasing number p/ { g , . . . , g p/ } ⊂ G . Define the test statistic T N ( θ ) = max { S N ( θ ) } (or its normalized version), where S N ( θ ) = ( N N ) − P ( i,j ) ∈ [ N ] X ij ( θ ). Toapproximate the distribution of T N ( θ ), let ˆ W ,i ( θ ) = N − P N j =1 X ij ( θ ) − S N ( θ ) and ˆ W ,j ( θ ) = N − P N i =1 X ij ( θ ) − S N ( θ ) . Construct the multiplier process S MB N ( θ ) = N − P N i =1 ξ ,i ˆ W ,i ( θ ) + N − P N j =1 ξ ,j ˆ W ,j ( θ ), where { ξ ,i } and { ξ ,j } are independent N (0 ,
1) random variables indepen-dent of the data. Let c (1 − α ; θ ) denote the conditional (1 − α )-quantile of max { S MB N ( θ ) } . Our testrejects the null hypothesis H ( θ ) if T N ( θ ) > c (1 − α ; θ ). Inverting this test provides a confidenceregion for the utility parameters θ .5.2. Robust inference in extended gravity analysis with trade data.
Trade data used forgravity analysis naturally exhibit two-way clustering due to the economic structure of supply anddemand. Typical trade data are double-indexed by exporters i and importers j . Observations aregenerally dependent across importers j due to a common supply shock generated by the exporter i . Observations are also generally dependent across importers i due to a common demand shock inthe destination j . We illustrate an application of our proposed theory to the frontier approach ofrobust identification in extended gravity analysis using this type of data.Morales et al. (2019) introduces an extended gravity model with an implied static profit randomfunction π ijt ( · ) that firm i receives from exporting to country j in year t , where π ijt ( · ) takesstructural parameters θ as arguments. Write π ijj ′ t ( θ ) = π ijt ( θ ) − π ij ′ t ( θ ). We assumed to knowthe set A ijt of all the countries j ′ that share the same cost structure with country j from theviewpoint of firm i in year t . Let d ijt denote the indicator that firm i exports to country j in year t , let z ijt denote a vector of variables including components of costs that depend on gravity andextended gravity variables, and let δ denote the rate of future discounting. In this setting and withthese notations, Morales et al. (2019) propose a family of moment inequalities of the form H ( θ ) : E X j ′ ∈A ijt g ( z ijt , z ij ′ t ) d ijt (1 − d ijt )( π ijj ′ t ( θ ) + δπ ijj ′ ( t +1) ( θ )) ≥ g ∈ G of non-negative functions satisfying certain restrictions – see Morales et al. (2019). efine the p -dimensional random vector X ij ( θ ) by X ij ( θ ) = − P j ′ ∈A ijt g ( z ijt , z ij ′ t ) d ijt (1 − d ijt )( π ijj ′ t ( θ ) + δπ ijj ′ ( t +1) ( θ ))... − P j ′ ∈A ijt g p ( z ijt , z ij ′ t ) d ijt (1 − d ijt )( π ijj ′ t ( θ ) + δπ ijj ′ ( t +1) ( θ )) for an increasing number p of instrumental functions { g , . . . , g p } ⊂ G . Define the test statistic T N ( θ ) = max { S N ( θ ) } (or its normalized version), where S N ( θ ) = ( N N ) − P ( i,j ) ∈ [ N ] X ij ( θ ).Then, confidence regions for θ can be constructed as in the preceding section.5.3. Penalty choice for Lasso under multiway clustering.
Consider a regression model Y i = f ( Z i ) + ε i , E [ ε i | Z i ] = 0 , i ∈ [ N ] , where Y i is a scalar outcome variable, Z i ∈ R d is a d -dimensional vector of covariates, f : R d → R is an unknown regression function of interest, and ε i is an error term. We approximate f by alinear combination of technical controls X i = P ( Z i ) for some transformation P : R d → R p , i.e., f ( Z i ) = X T i β + r i , i ∈ [ N ] , where r i is a bias term. The dimension p can be much larger than the cluster sizes N , but weassume that the vector β ∈ R p is sparse in the sense that k β k = s ≪ n . Suppose that the array (cid:0) ( Y i , Z T i ) T (cid:1) i ∈ N K is separately exchangeable and generated as( Y i , Z T i ) T = g (( U i ⊙ e ) e ∈{ , } K \{ } ) , i ∈ N K , { U i ⊙ e : i ∈ N K , e ∈ { , } K \ { }} i.i.d. ∼ U [0 , , for some Borel measurable map g : [0 , K − → R d .Arguably, one of the most popular estimation methods for such a high-dimensional regressionproblem is the Lasso (Tibshirani, 1996); we refer to B¨uhlmann and van de Geer (2011); Giraud(2015); Wainwright (2019) as standard references on high-dimensional statistics. Let N = Q Kk =1 N k denote the total sample size. The Lasso estimate for β is defined byˆ β λ = arg min β ∈ R p N X i ∈ [ N ] ( Y i − X T i β ) + λ k β k , where λ > f = ( f i ) i ∈ [ N ] = ( f ( Z i )) i ∈ [ N ] by ˆ f λ =( X T i ˆ β λ ) i ∈ [ N ] . Let k t k N, = N − P i ∈ [ N ] t i for t = ( t i ) i ∈ [ N ] .In what follows, we discuss the statistical performance of the Lasso estimate. Following Bickel et al.(2009), we say that Condition RE( s, c ) holds (RE refers to “restricted eigenvalue”) if, for a givenpositive constant c ≥
1, the inequality κ ( s, c ) = min J ⊂{ ,...,p } ≤| J |≤ s inf θ ∈ R p , θ =0 k θ Jc k ≤ c k θ J k q sN − P i ∈ [ N ] ( θ T X i ) k θ J k > J c = { , . . . , p } \ J . Here for θ = ( θ , . . . , θ p ) T and J ⊂ { , . . . , p } , θ J = ( θ j ) j ∈ J . Keepin mind that as the covariates are random, the restricted eigenvalue κ ( s, c ) is random as well.Theorem 1 of Belloni and Chernozhukov (2013) implies that if, for a given c > • λ ≥ c k S N k ∞ with S N = N − P i ∈ [ N ] ε i X i and • Condition RE( s, c ) holds with c = ( c + 1) / ( c − hen the following nonasymptotic bounds hold with κ = κ ( s, c ): k ˆ f λ − f k N, ≤ k r k N, + (cid:18) c (cid:19) λ √ sκ . To ensure that λ ≥ c k S N k ∞ with high probability, say 1 − η for some small η >
0, we willchose λ to be an estimate of the (1 − η )-quantile of 2 c k S N k ∞ . To this end, we first estimatethe error terms ε i by pre-estimating β by the preliminary Lasso estimate ˜ β = ˆ β λ with penalty λ = τ n ( n − log p ) / for some slowing growing sequence τ n → ∞ . In the following, we take τ n = log n for the sake of simplicity but other choices also work. We apply the multiplier bootstrapto ˜ S N = N − P i ∈ [ N ] ˜ ε i X i instead of S N .We note that H´ajek projection to S N is given by P Kk =1 N − k P N k k =1 V k,i k , where V k,i k = E [ ε (1 ,..., ,i k , ,..., X (1 ,..., ,i k , ,..., | U (0 ,..., ,i k , ,..., ] . We estimate V k,i k by ˜ V k,i k = (cid:0) Y k ′ = k N k ′ (cid:1) − X i ,...,i k − ,i k +1 ,...,i K ˜ ε i X i . Let { ξ ,i } N i =1 , . . . , { ξ K,i K } N K i K =1 be i.i.d. N (0 ,
1) variables independent of the data, and considerΛ ξ N = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K X k =1 N k N k X i k =1 ξ k,i k ( ˜ V k,i k − ˜ S N ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . We propose to choose λ as λ = λ ( η ) = 2 c Λ ξ N (1 − η ) , where Λ ξ N (1 − η ) denotes the conditional (1 − η )-quantile of Λ ξ N . We allow η to decrease with n ,i.e, η = η n → λ (as n → ∞ )under multiway clustering. In what follows, we understand that s, p, N , η are functions of n whileother parameters such as c, q, κ are independent of n . Proposition 3 (Penalty choice for the Lasso under multiway clustering) . Suppose that: (i) thereexist some constants q ∈ [4 , ∞ ) independent of n and D N that may depend on N (and thus on n )such that E [ | ε | q ] ∨ E [ k X k q ∞ ] ≤ D q N and max ≤ j ≤ p max ≤ k ≤ K E [ | V jk, | ℓ ] ≤ D ℓ N for ℓ = 1 , ; (ii) E [ | V jk, | ] is bounded and bounded away from zero uniformly in ≤ j ≤ p and ≤ k ≤ K ; (iii)there exists a positive constant κ independent of n such that κ ( s, c ) ≥ κ with probability − o (1) ;(iv) as n → ∞ , k r k N, = O ( p ( s log p ) /n ) and sN /q D N log ( pN ) n _ D N log ( pn ) n − /q = o (1) . Then, we have λ ≥ c k S N k ∞ with probability − η − o (1) . Further, we have λ = O P r log pn _ r log(1 /η ) n ! . Consequently, if we take η = η n → , we have k ˆ f λ − f k N, = O P r s log pn _ r s log(1 /η ) n ! . he proof of Proposition 3 does not follow directly from the results of Section 2, as we have totake care of the estimation error of the preliminary Lasso estimate ˜ β , which requires extra work.Condition (iii) in the preceding proposition is a high-level condition on the sample gram matrix.The following proposition provides primitive sufficient conditions for Condition (iii) to hold in thetwo-way clustering case, i.e., K = 2. Proposition 4 (RE condition under two-way clustering K = 2) . Consider K = 2 and let B N = q E [max i ∈ [ N ] k X i k ∞ ] . Suppose that the eigenvalues of E [ X X T ] are bounded and bounded awayfrom zero, and sB N log ( pN ) = o ( n ) . Then, there exists a positive constant κ independent of n such that κ ( s, c ) ≥ κ with probability − o (1) . Under Condition (i) of Proposition 3, B N ≤ N /q D N , so that sB N log ( pN ) = o ( n ) reduces to sN /q D N log ( pN ) = o ( n ), which is implied by Condition (iv) of Proposition 3.The proof of Proposition 4 relies on Lemma 2.7 in Lecu´e and Mendelson (2017) and an ex-tension of Lemma P.1 in Belloni et al. (2018), whose proof in turn relies on the techniques inRudelson and Vershynin (2008), from the i.i.d. case to two-way clustering. Remark 6 (Column standardization) . For intepretability of the Lasso estimate, in practice, weoften rescale the penalty by the weighted ℓ -norm (as in Belloni and Chernozhukov (2011) in thequantile regression case) to make sure that the coefficients are penalized in a comparable manner.All the results in this section continue to hold under this practice as the conditions assumed inProposition 3 guarantee the sample second moment of each covariate is consistent uniformly overthe coordinates. 6. Summary
Empirical data in use for economic analysis are often clustered in two or more ways, whereone source of dependence across units of demand is the common supply shock, and the the othersource of dependence across units of supply is the common demand shock. When the set of agentsgenerating the supply and the set of agents generating the demand are different, then such data isseparately exchangeable or two-way clustered. Examples include market share data. When the setof agents generating the supply and the set of agents generating the demand are the same, thensuch data is jointly exchangeable or dyadic. Examples include international trade data.In this paper, for both separately exchangeable data and jointly exchangeable data, we de-velop methods and theories for inference about multi-dimensional, increasing-dimensional and high-dimensional parameters. Based on non-asymptotic Gaussian approximation error bounds for thetest-statistic on hyper-rectangles, we propose bootstrap methods and establish their finite samplevalidity. Simulation studies support the theoretical properties of the method.Three applications of the proposed method are illustrated. For demand analysis with a two-way clustered data consisting of N products and N markets, Gandhi et al. (2020) derive high-dimensional moment inequalities. Similarly, for extended gravity analysis with a two-way clustereddata consisting of N firms and N countries, Morales et al. (2019) derive high-dimensional momentinequalities. With our theory of approximating the distribution of a multiway sample mean of ahigh-dimensional random vector, inverting the Kolmogorov-Smirnov test allows for inference aboutthe structural parameters in these two settings. Finally, we also demonstrate an application of ourproposed method to penalty tuning parameter choice for ℓ -penalized regression under multiwaycluster sampling. ppendixAppendix A. Maximal Inequalities for Multiway Clustering
In this section, we shall develop maximal inequalities for separately exchangeable arrays. As inSection 2, let ( X i ) i ∈ N K be a K -array consisting of random vectors in R p with mean zero generatedby the structure (2.1), i.e., X i = g (( U i ⊙ e ) e ∈{ , } K \{ } ) for i ∈ N K . We will follow the notationsused in Section 2. The following theorem is fundamental. Theorem 5.
Pick any ≤ k ≤ K and e ∈ E k . Then, for any q ∈ [1 , ∞ ) , we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ I e ([ N ]) ˆ X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q ∞ /q ≤ C (log p ) k/ E max ≤ j ≤ p X i ∈ I e ([ N ]) | ˆ X j i | q/ /q , where C is a constant that depends only on q and K . The following corollary is immediate from Jensen’s inequality.
Corollary 3 (Global maximal inequality) . For any ≤ k ≤ K, e ∈ E k , and q ∈ [1 , ∞ ) , we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ∈ I e ([ N ]) ˆ X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q ∞ /q ≤ C (log p ) k/ s Y k ′ ∈ supp( e ) N k ′ ( E [ k ˆ X ⊙ e k q ∨ ∞ ]) / ( q ∨ , (A.1) where C is a constant that depends only on q and K . Remark 7.
By Jensen’s inequality, E [ k ˆ X ⊙ e k q ∨ ∞ ] on the right-hand side of (A.1) can be replacedby E [ k X k q ∨ ∞ ] by adjusting the constant C .The proof of Theorem 5 relies on the following symmetrization inequality. Recall that a Rademacherrandom variable is a random variable taking ± Lemma 2 (Symmetrization) . Pick any ≤ k ≤ K . Let { ǫ ,i } , . . . , { ǫ k,i k } be independent Rademacherrandom variables independent of the U -variables. Then, for any nondecreasing convex function Φ : [0 , ∞ ) → [0 , ∞ ) , we have E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ,...,i k ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ E Φ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ,...,i k ǫ ,i · · · ǫ k,i k ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . The proof of Lemma 2 in turn relies on the following result.
Lemma 3.
Let i ∈ N K . Pick any ≤ k ≤ K and let e ∈ E k . Then, for any ℓ ∈ supp( e ) ,conditionally on ( U i ⊙ e ′ ) e ′ ≤ e − e ℓ , the vector ˆ X i ⊙ e has mean zero.Proof of Lemma 3. For illustration, consider first the K = 3 case and e = (1 , , X i = X i − ˆ X ( i ,i , − ˆ X (0 ,i ,i ) − ˆ X ( i , ,i ) − ˆ X ( i , , − ˆ X (0 ,i , − ˆ X (0 , ,i ) . Given ( U ( i , , , U (0 ,i , , U ( i ,i , ), we have E [ ˆ X (0 ,i ,i ) | U ( i , , , U (0 ,i , , U ( i ,i , ] = E [ X i | U (0 ,i , ] − E [ X i | U (0 ,i , ] = 0 , E [ ˆ X ( i , ,i ) | U ( i , , , U (0 ,i , , U ( i ,i , ] = E [ X i | U ( i , , ] − E [ X i | U ( i , , ] = 0 . onclude that E [ ˆ X i | U ( i , , , U (0 ,i , , U ( i ,i , ] = E [ X i | U ( i , , , U (0 ,i , , U ( i ,i , ] − ( ˆ X ( i ,i , + ˆ X ( i , , + ˆ X (0 ,i , )= 0 . The proof for the general case is by induction on k . The conclusion is trivial when k = 1.Suppose that the lemma is true up to k −
1. Then, E [ ˆ X i ⊙ e | ( U i ⊙ e ′ ) e ′ ≤ e − e ℓ ]= E [ X i | ( U i ⊙ e ′ ) e ′ ≤ e − e ℓ ] − ˆ X i ⊙ ( e − e ℓ ) − X e ′ ≤ ee ′ = e , e − e ℓ E [ ˆ X i ⊙ e ′ | ( U i ⊙ e ′′ ) e ′′ ≤ e − e ℓ ] (by the definition of ˆ X i ⊙ e )= X e ′ ≤ e − e ℓ e ′ = e − e ℓ ˆ X i ⊙ e ′ − X e ′ ≤ ee ′ = e , e − e ℓ E [ ˆ X i ⊙ e ′ | ( U i ⊙ e ′′ ) e ′′ ≤ e − e ℓ ] (by plugging in the expansion of ˆ X i ⊙ ( e − e ℓ ) )= X e ′ ≤ e − e ℓ e ′ = e − e ℓ E [ ˆ X i ⊙ e ′ | ( U i ⊙ e ′′ ) e ′′ ≤ e − e ℓ ] − X e ′ ≤ ee ′ = e , e − e ℓ E [ ˆ X i ⊙ e ′ | ( U i ⊙ e ′′ ) e ′′ ≤ e − e ℓ ]= − X e ′ ≤ e − e ℓ ′ ,ℓ ′ = ℓℓ ∈ supp( e ′ ) ,ℓ ′ ∈ supp( e ) E h ˆ X i ⊙ e ′ | ( U i ⊙ e ′′ ) e ′′ ≤ e − e ℓ i . Here, we have used the fact that ˆ X i ⊙ e ′ is σ (( U i ⊙ e ′′ ) e ′′ ≤ e ′ )-measurable, so that E [ ˆ X i ⊙ e ′ | ( U i ⊙ e ′′ ) e ′′ ≤ e − e ℓ ] =ˆ X i ⊙ e ′ as long as supp( e ′ ) ⊂ supp( e − e ℓ ). For any e ′ ≤ e − e ℓ ′ with ℓ ′ = ℓ, ℓ ∈ supp( e ′ ), and ℓ ′ ∈ supp( e ), we have E h ˆ X i ⊙ e ′ | ( U i ⊙ e ′′ ) e ′′ ≤ e − e ℓ i = E h ˆ X i ⊙ e ′ | ( U i ⊙ e ′′ ) e ′′ ≤ e ′ − e ℓ i = 0by the induction hypothesis. Conclude that E [ ˆ X i ⊙ e | ( U i ⊙ e ′ ) e ′ ≤ e − e ℓ ] = 0. (cid:3) Proof of Lemma 2.
Let e = (1 , . . . , | {z } k , , . . . , U i ⊙ e ′ ) i ∈ [ N ] , e ′ ≤ e − e , { P i ,...,i k ˆ X ( i ,i ...,i k , ,..., : i = 1 , . . . , N } are independent with mean zero (the latter follows from Lemma 3). Hence, apply-ing the symmetrization inequality (van der Vaart and Wellner (1996), Lemma 2.3.6) conditionally n ( U i ⊙ e ′ ) i ∈ [ N ] , e ′ ≤ e − e , we have E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ,...,i k ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ | ( U i ⊙ e ′ ) i ∈ [ N ] , e ′ ≤ e − e = E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X i X i ,...,i k ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ | ( U i ⊙ e ′ ) i ∈ [ N ] , e ′ ≤ e − e ≤ E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X i ǫ ,i X i ,...,i k ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ | ( U i ⊙ e ′ ) i ∈ [ N ] , e ′ ≤ e − e = E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ,...,i k ǫ ,i ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ | ( U i ⊙ e ′ ) i ∈ [ N ] , e ′ ≤ e − e By Fubini’s theorem, we have E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ,...,i k ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ,...,i k ǫ ,i ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . Next, given { ǫ ,i } ∪ { U i ⊙ e ′ } i ∈ [ N ] , e ′ ≤ e − e , { P i ,i ,...,i K ǫ ,i ˆ X ( i ,i ...,i K , ,..., : i = 1 , . . . , N } areindependent with mean zero, so that by the symmetrization inequality and Fubini’s theorem, wehave E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ,...,i k ǫ ,i ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X i X i ,i ...,i k ǫ ,i ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X i ǫ ,i X i ,i ...,i k ǫ ,i ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = E Φ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ,...,i k ǫ ,i ǫ ,i ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . The conclusion of the lemma follows from repeating this procedure. (cid:3)
We are now in position to prove Theorem 5.
Proof of Theorem 5.
In this proof, the notation . means that the left-hand side is less than theright-hand side up to a constant that depends only on q and K . We may assume without loss ofgenerality e = (1 , . . . , | {z } k , , . . . , E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ,...,i k ǫ ,i · · · ǫ k,i k ˆ X ( i ,...,i k , ,..., (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q ∞ . (log p ) qk/ E max ≤ j ≤ p X i ,...,i k | ˆ X j ( i ,...,i k , ,..., | q/ . y conditioning and Lemma 2.2.2 in van der Vaart and Wellner (1996), together with the fact thatthat the L q -norm is bounded from above by the ψ /k -norm up to some constant that depends onlyon ( q, k ), the problem boils down to proving that, for any constants a i ,...,i k , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X i ,...,i k ǫ ,i · · · ǫ k,i k a i ,...,i k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ψ /k . s X i ,...,i k a i ,...,i k , but this follows from Corollary 3.2.6 in de la Pe˜na and Gin´e (1999). Indeed, let( ǫ ′ , ǫ ′ , . . . ) = ( ǫ , , . . . , ǫ ,N , ǫ , , . . . , ǫ K,N K ) , and define correspondingly b j ...j K = ( a i ...i K if j = i , j = N + i , . . . , j K = Q K − k =1 N k + i K , i k = 1 , . . . , N k , k = 1 , . . . , K . Then, X i ,...,i K ǫ ,i · · · ǫ K,i K a i ...i K = X j < ··· Appendix B. Proofs for Section 2 B.1. Proof of Lemma 1. The lemma follows from the fact that E [ X i | ( U i ⊙ e ) e ≤ ] = X i , so that X i = ˆ X i + P e ≤ , e = ˆ X i ⊙ e = P e ∈{ , } K \{ } ˆ X i ⊙ e . (cid:3) B.2. Proof of Theorem 1. We will assume Condition (2.3). The proof under Condition (2.4) issimilar and thus omitted. In this proof, let C denote a generic constant that depends only on σ and K . We divide the proof into two steps.Step 1. We first prove the following bound for the H´ajek projectionsup R ∈R | P ( √ n S W N ∈ R ) − γ Σ ( R ) | ≤ C (cid:18) D N log ( pN ) n (cid:19) / . For the notational convenience, we assume K = 2; the proof for the general case is completelyanalogous. Let W k = N − k P i k W k,i k . By Proposition 1 in Chernozhukov et al. (2017a), we havesup R ∈R | P ( p N k W k ∈ R ) − γ Σ Wk ( R ) | ≤ C (cid:18) D N log ( pN ) n (cid:19) / , k = 1 , . or any rectangle R = Q pj =1 [ a j , b j ], vector w = ( w , . . . , w p ) T ∈ R P , and scalar c > cR + w ] = Q pj =1 [ ca j + w j , cb j + w j ], which is still a rectangle. With this in mind, observethat for any rectangle R ∈ R , P ( √ n ( W + W ) ∈ R ) = E h P (cid:16)p N W ∈ [ p N /nR − p N W ] (cid:17) | W i Since W and W are independent, the right-hand side is bounded by E h γ Σ W ([ p N /nR − p N W ]) i + C (cid:18) D N log ( pN ) n (cid:19) / . For Y ∼ N ( , Σ W ) independent of W , we have γ Σ W ([ p N /nR − p N W ]) = P ( Y ∈ [ p N /nR − p N W ] | W ) , so that E h γ Σ W ([ p N /nR − p N W ]) i = P ( Y ∈ [ p N /nR − p N W ])= P ( p N W ∈ [ p N /nR − p N /N Y ])= E h P ( p N W ∈ [ p N /nR − p N /N Y ] | Y i . Since Y and W are independent, the far right-hand side is bounded by E h γ Σ W ([ p N /nR − p N /N Y ]) i + C (cid:18) D N log ( pN ) n (cid:19) / . For Y ∼ N ( , Σ W ) independent of Y , the first term can be written as P ( p n/N Y + p n/N Y ∈ R ) = γ Σ ( R ). Conclude that P ( √ n ( W + W ) ∈ R ) ≤ γ Σ ( R ) + C (cid:18) D N log ( pN ) n (cid:19) / . The reverse inequality follows similarly.Step 2. We will prove the conclusion of the theorem. Recall the decomposition: S N = S W N + R N with R N = K X k =2 X e ∈E k Q k ′ ∈ supp( e ) N k ′ X i ∈ I e ([ N ]) ˆ X i . By Corollary 3, we have E [ k R N k ∞ ] ≤ C K X k =2 n − k/ (log p ) k/ D N ≤ Cn − D N (log p ) . or R = Q pj =1 [ a j , b j ] with a = ( a , . . . , a p ) T and b = ( b , . . . , b p ) T , we have P ( √ n S N ∈ R ) = P ( {−√ n S N ≤ − a } ∩ {√ n S N ≤ b } ) ≤ P ( {−√ n S N ≤ − a } ∩ {√ n S N ≤ b } ∩ {k√ n R N k ∞ ≤ t } ) + P ( k√ n R N k ∞ > t ) ≤ P ( {−√ n S W N ≤ − a − t } ∩ {√ n S W N ≤ b + t } ) + Ct − n − / D N (log p ) ≤ γ Σ ( { y ∈ R p : − y ≤ − a + t, y ≤ b + t } )+ C (cid:18) D N log ( pN ) n (cid:19) / + Ct − n − / D N (log p ) ≤ γ Σ ( R ) + Ct p log p + C (cid:18) D N log ( pN ) n (cid:19) / + Ct − n − / D N (log p ) , where the last line follows from Nazarov’s inequality – see Lemma 7 in Appendix F. Choosing t = n − / D / N (log p ) / , we have P ( √ n S N ∈ R ) ≤ γ Σ ( R ) + C (cid:18) D N log ( pN ) n (cid:19) / + C (cid:18) D N log pn (cid:19) / ≤ γ Σ ( R ) + C (cid:18) D N log ( pN ) n (cid:19) / . The reverse inequality follows similarly. (cid:3) B.3. Proof of Theorem 2. We separately prove the theorem under Cases (i) and (ii).Case (i). Let C denote a generic constant that depends only on σ, K , and C . Also the notation . means that the left-hand side is bounded by the right-hand side up to a constant that dependsonly on σ, K , and C .Conditionally on X [ N ] , we have √ n S MB N ∼ N ( , ˆΣ), whereˆΣ = K X k =1 nN k N k X i k =1 ( X k,i k − S N )( X k,i k − S N ) T . Hence, to obtain a bound on sup R ∈R | P | X [ N ] ( √ n S MB N ∈ R ) − γ Σ ( R ) | , it suffices to bound k ˆΣ − Σ k ∞ in view of Lemma 8 in Appendix F. We note that k ˆΣ − Σ k ∞ ≤ K X k =1 max ≤ j,ℓ ≤ p (cid:12)(cid:12)(cid:12) nN k N k X i k =1 X jk,i k X ℓk,i k − nN k S j N S ℓ N − nN k E [ W jk, W ℓk, ] (cid:12)(cid:12)(cid:12)| {z } =: ˆ∆ W,k . We will focus on bounding ˆ∆ W, as similar bounds hold for ˆ∆ W,k with k ∈ { , . . . , K } .Observe that nN N X i =1 X j ,i X ℓ ,i = nN N X i =1 ( X j ,i − W j ,i )( X ℓ ,i − W ℓ ,i ) + nN N X i =1 ( X j ,i − W j ,i ) W ℓ ,i + nN N X i =1 W j ,i ( X ℓ ,i − W ℓ ,i ) + nN N X i =1 W j ,i W ℓ ,i . y the Cauchy-Schwarz inequality and the definition of n , we obtainˆ∆ W, ≤ max ≤ ℓ ≤ p N N X i =1 ( X ℓ ,i − W ℓ ,i ) | {z } =: ˆ∆ W, , +2 ˆ∆ / W, , vuut max ≤ ℓ ≤ p N N X i =1 | W ℓ ,i | + max ≤ j,ℓ ≤ p (cid:12)(cid:12)(cid:12) N N X i ( W j ,i W ℓ ,i − E [ W j , W ℓ , ]) (cid:12)(cid:12)(cid:12)| {z } =: ˆ∆ W, , + max ≤ ℓ ≤ p | S ℓ N | . (B.1)For the second term on the right-hand side, we have1 N N X i =1 | W ℓ ,i | ≤ E [ | W ℓ ,i | ] + 1 N N X i =1 ( | W ℓ ,i | − E [ | W ℓ ,i | ]) ≤ σ + ˆ∆ W, , . (B.2)Further, since S ℓ N = N − P N i =1 ( X ℓ ,i − W ℓ ,i ) + N − P N i =1 W ℓ ,i , we havemax ≤ ℓ ≤ p | S ℓ N | . ˆ∆ W, , + ˆ∆ W, , , (B.3)where ˆ∆ W, , = max ≤ ℓ ≤ p | N − P N i =1 W ℓ ,i | . Combining (B.1)–(B.3), we haveˆ∆ W, . ˆ∆ W, , + σ ˆ∆ / W, , + ˆ∆ W, , + ˆ∆ W, , . It remains to find bounds on the four terms on the right-hand side.First, by Condition (2.7), we have ˆ∆ W, , log p ≤ C n − ζ and σ ˆ∆ / W, , log p ≤ Cn − ζ / withprobability at least 1 − Cn − . Second, we note that E (cid:20) max ≤ i ≤ N max ≤ ℓ ≤ p | W ℓ ,i | (cid:21) . (log pN ) max ≤ ℓ ≤ p k| W ℓ , | k ψ / = (log pN ) max ≤ ℓ ≤ p k W ℓ , k ψ | {z } ≤ D N . Applying Lemma 8 in Chernozhukov et al. (2015), we have E [ ˆ∆ W, , ] . N − vuut (log p ) max ≤ j,ℓ ≤ p N X i =1 E [ | W j ,i W ℓ ,i | ] + N − s E (cid:20) max ≤ i ≤ N max ≤ ℓ ≤ p | W ℓ ,i | (cid:21) log p . N − / D N log / p + N − D N (log p ) log ( pN ) . n − / D N log / p + n − D N log ( pN ) . Now, applying Lemma E.2 in Chernozhukov et al. (2017a) with η = 1 and β = 1 / 2, together withthe fact that (cid:13)(cid:13)(cid:13)(cid:13) max ≤ i ≤ N max ≤ ℓ ≤ p | W ℓ ,i | (cid:13)(cid:13)(cid:13)(cid:13) ψ / = (cid:13)(cid:13)(cid:13)(cid:13) max ≤ i ≤ N max ≤ ℓ ≤ p | W ℓ ,i | (cid:13)(cid:13)(cid:13)(cid:13) ψ . (log pN ) D N , we have P (cid:16) ˆ∆ W, , ≥ E [ ˆ∆ W, , ] + t (cid:17) ≤ exp (cid:18) − nt D N (cid:19) + 3 exp − ntCD N log ( pN ) ! / . etting t = { Cn − D N log n } / ∨ { Cn − D N (log n ) log ( pN ) } , we conclude that P (cid:16) ˆ∆ W, , ≥ C { ( n − D N log / ( pn ) + n − D N (log n ) log ( pN ) } (cid:17) ≤ Cn − . Condition (2.8) then guarantees that ˆ∆ W, , log p ≤ Cn − ζ / with probability at least 1 − Cn − .Finally, since σ ≤ (max ≤ ℓ ≤ p E [ | W ℓ , | ]) / ≤ ≤ ℓ ≤ p E [ | W ℓ , | ] . D N , using Lemma 8 inChernozhukov et al. (2015), we have E [ ˆ∆ W, , ] . ( n − D N log p ) / + n − D N log( pN ) . Applying Lemma E.2 in Chernozhukov et al. (2017a) with η = 1 and β = 1, we haveˆ∆ W, , log p ≤ C { n − D N (log p ) log( pn ) + n − D N (log n )(log p ) log ( pN ) } | {z } ≤ Cn − ζ with probability at least 1 − Cn − . Conclude that ˆ∆ W, log p ≤ Cn − ( ζ ∧ ζ ) / with probability atleast 1 − Cn − . The desired result then follows from Lemma 8 in Appendix F.Case (ii). The proof is similar to the previous case. We only point out required modifications.Let C denote a generic constant that depends only on q, σ, K , and C . The similar modificationapplies to . . In view of the previous case, we only have to find bounds on ˆ∆ W, , and ˆ∆ W, , .Applying Lemma 8 in Chernozhukov et al. (2015), we have E [ ˆ∆ W, , ] . N − vuut (log p ) max ≤ j,ℓ ≤ p N X i =1 E [ | W j ,i W ℓ ,i | ] + N − s E (cid:20) max ≤ i ≤ N max ≤ ℓ ≤ p | W ℓ ,i | (cid:21) log p . N − / D N log / p + N − /q D N log p . n − / D N log / p + n − /q D N log p. Applying the Fuk-Nagaev inequality (Lemma E.2 in Chernozhukov et al. (2017a)) with s = q/ P (cid:16) ˆ∆ W, , ≥ E [ ˆ∆ W, , ] + t (cid:17) ≤ exp (cid:18) − N t D N (cid:19) + CN D q N N q/ t q/ ≤ exp (cid:18) − nt D N (cid:19) + CD q N n q/ − t q/ . Setting t = ( Cn − D N log n ) / W ( Cn − /q D N ), we have P (cid:16) ˆ∆ W, , ≥ C { ( n − D N log( pn )) / + n − /q D N log p } (cid:17) ≤ Cn − . Condition (2.9) then guarantees that ˆ∆ W, , log p ≤ Cn − ζ / with probability at least 1 − Cn − .A bound for ˆ∆ W, , can be obtained similarly. Using Lemma 8 in Chernozhukov et al. (2015), wehave E [ ˆ∆ W, , ] . ( n − D N log p ) / + n − /q D N log p. Applying Lemma E.2 in Chernozhukov et al. (2017a) with s = q , we have P (cid:16) ˆ∆ W, , ≥ E [ ˆ∆ W, , ] + t (cid:17) ≤ exp (cid:18) − nt D N (cid:19) + CD q N n q − t q . etting t = ( Cn − D N log n ) / W ( Cn − /q D N ), we conclude thatˆ∆ W, , log p ≤ C { n − D N (log p ) log( pn ) + n − /q log p } | {z } ≤ Cn − ζ with probability at least 1 − Cn − . (cid:3) B.4. Proof of Proposition 1. We separately prove the theorem under Cases (i’) and (ii’).Case (i’). Let the notation . mean that the left-hand side is bounded by the right-hand side upto a constant that depends only on ν, K , and C . We will show that P ( σ ˆ∆ W, , log p > n − ζ +1 /ν ) . n − , where ˆ∆ W, , = max ≤ ℓ ≤ p N N X i =1 ( X ℓ ,i − W ℓ ,i ) . Similar bounds hold for max ≤ ℓ ≤ p N − k P N k i k =1 ( X ℓk,i k − W ℓk,i k ) with k ∈ { , . . . , K } .We first note thatˆ∆ W, , = max ≤ ℓ ≤ p N N X i =1 ( X ℓ ,i − W ℓ ,i ) ≤ N N X i =1 k X ,i − W ,i k ∞ . Pick any i ∈ N . For each i − = ( i , . . . , i K ) ∈ N K − and e ∈ { , } K − , define the vector V i − ⊙ e = ( U (0 , i − ⊙ e ) , U ( i , i − ⊙ e ) ) . With this notation, we can rewrite X i with i = ( i , i − ) as X i = g (cid:0) U ( i , ,..., , ( V i − ⊙ e ) e ∈{ , } K − \{ } (cid:1) . From this expression, we see that, conditionally on U ( i , ,..., , the ( K − X ( i , i − ) ) i − ∈ N K − is separately exchangeable with mean vector W ,i generated by { V i − ⊙ e : i K − ∈ N K − , e ∈{ , } K − \ { }} . Applying Corollary 3 conditionally on U ( i , ,..., (the fact that U i ⊙ e are uniformon [0 , 1] is not crucial in the proof of Corollary 3) combined with Jensen’s inequality, we have E [ k X ,i − W ,i k ν ∞ | U ( i , ..., ] . K − X k =1 n − k/ (log p ) k/ ! ν | {z } . ( n − log p ) ν E [ k X ( i , ,..., k ν ∞ | U ( i ,..., ] , so that by Fubini’s theorem E [ k X ,i − W ,i k ν ∞ ] . ( n − log p ) ν E [ k X ( i , ,..., k ν ∞ ] . ( n − D N log p ) ν . This implies that E [( σ ˆ∆ W, , log p ) ν ] . n − ζν under our assumption. By Markov’s inequality, weconclude that P (cid:16) σ ˆ∆ W, , log p > n − ζ +1 /ν (cid:17) . n − . This completes the proof.Case (ii’). The proof is similar to the previous case. We only point out required modifications.Set ν = q/ E [ k X ,i − W ,i k q ∞ ] . ( n − log p ) ν E [ k X ( i , ,..., k q ∞ ] | {z } ≤ D q N , hich implies that E [( σ ˆ∆ W, , log p ) q/ ] . n − ζq/ . Markov’s inequality yields the desired result. (cid:3) B.5. Proof of Corollary 1. We only prove the corollary under Case (i). The proof for Case (ii)is similar. Let C denote a generic constant that depends only on σ, K , and C . We first note thatfrom the proof of Theorem 2, we havemax ≤ j ≤ p (cid:12)(cid:12) σ j / ˆ σ j − (cid:12)(cid:12) ≤ Cn − ( ζ ∧ ζ ) / / log p with probability at least 1 − Cn − . By Theorem 1, we havesup R ∈R (cid:12)(cid:12)(cid:12) P ( √ n Λ − / S N ∈ R ) − P (Λ − / Y ∈ R ) (cid:12)(cid:12)(cid:12) ≤ Cn − ( ζ ∧ ζ ) / . By the Borell-Sudakov-Tsirel’son inequality and the fact E [ k Λ − / Y k ∞ ] ≤ C √ log p , which is im-plied by the Gaussianity of Λ − / Y , we have P (cid:16) k Λ − / Y k ∞ > C p log( pn ) (cid:17) ≤ n − , Combining the high-dimensional CLT, we see that P (cid:16) k√ n Λ − / S N k ∞ > C p log( pn ) (cid:17) ≤ Cn − ( ζ ∧ ζ ) / . Since n − ( ζ ∧ ζ / log p × p log( pn ) ≤ Cn − ( ζ ∧ ζ / log / p , we have P (cid:16) k√ n ( ˆΛ − / − Λ − / ) S N k ∞ > t n (cid:17) ≤ Cn − ( ζ ∧ ζ ) / . with t n = Cn − ( ζ ∧ ζ / log / p .Now, for R = Q pj =1 [ a j , b j ] with a = ( a , . . . , a p ) T and b = ( b , . . . , b p ) T , we have P (cid:16) √ n ˆΛ − / S N ∈ R (cid:17) ≤ P (cid:16) {−√ n Λ − / S N ≤ − a + t n } ∩ {√ n Λ − / S N ≤ b + t n } (cid:17) + P (cid:16) k√ n ( ˆΛ − / − Λ − / ) S N k ∞ > t n (cid:17) ≤ P (cid:16) {− Λ − / Y ≤ − a + t n } ∩ { Λ − / Y ≤ b + t n } (cid:17) + Cn − ( ζ ∧ ζ ) / ≤ P (Λ − / Y ∈ R ) + Cn − ( ζ ∧ ζ ) / , where the last inequality follows from Lemma 7 together with the fact that t n p log p ≤ Cn − ( ζ ∧ ζ ) / / log p ≤ Cn − ( ζ ∧ ζ ) / . Thus, we have P ( √ n ˆΛ − / S N ∈ R ) ≤ P (Λ − / Y ∈ R ) + Cn − ( ζ ∧ ζ ) / . Likewise, we have P ( √ n ˆΛ − / S N ∈ R ) ≥ P (Λ − / Y ∈ R ) − Cn − ( ζ ∧ ζ ) / . Conclude that sup R ∈R (cid:12)(cid:12)(cid:12) P ( √ n ˆΛ − / S N ∈ R ) − P (Λ − / Y ∈ R ) (cid:12)(cid:12)(cid:12) ≤ Cn − ( ζ ∧ ζ ) / . imilarly, using Theorem 2 and following similar arguments, we conclude thatsup R ∈R (cid:12)(cid:12)(cid:12) P | X [ N ] ( √ n ˆΛ − / S MB N ∈ R ) − P (Λ − / Y ∈ R ) (cid:12)(cid:12)(cid:12) ≤ Cn − ( ζ ∧ ζ ) / with probability at least 1 − Cn − . (cid:3) Appendix C. Maximal Inequalities for Polyadic Data In this section, we shall develop maximal inequalities for jointly exchangeable arrays. As inSection 3, let ( X i ) i ∈ I ∞ ,K be a K -array consisting of random vectors in R p with mean zero generatedby the structure (3.1), i.e., X i = g (( U { i ⊙ e } + ) e ∈{ , } K \{ } ). We will follow the notations used inSection 3. Recall that I n,K = { ( i , . . . , i K ) : 1 ≤ i , . . . , i K ≤ n and i , . . . , i K are distinct } .We first point out that when analyzing the sample mean S n , it is without loss of generality toassume that X i is symmetric in the components of i , i.e., X ( i ,...,i K ) = X ( i ′ ,...,i ′ K ) (C.1)for any permutation ( i ′ , . . . , i ′ K ) of ( i , . . . , i K ). This is because even if X i is not symmetric in thecomponents of i , we can instead work with its symmetrized versionˇ X ( i ,...,i K ) = 1 K ! X ( i ′ ,...,i ′ K ) X ( i ′ ,...,i ′ K ) , where the summation is taken over all permutations of ( i , . . . , i K ). It is not difficult to see thatthe array ( ˇ X i ) i ∈ I ∞ ,K continues to be jointly exchangeable and satisfies that S n = ( n − K )! n ! X i ∈ I n,K ˇ X i = (cid:18) nK (cid:19) − X ≤ i < ···
Lemma 4. For any q ∈ [1 , ∞ ) , we have ( E [ k U n k q ∞ ]) /q ≤ C K X k =2 n − k/ (log p ) k/ ( E [ k X (1 ,...,K ) k q ∨ ∞ ]) / ( q ∨ , where C is a constant that depends only on q and K . e turn to the analysis of the third term on the right-hand side of (3.2) K X k =2 ( n − K )! n ! X i ∈ I n,K (cid:16) E [ X i | ( U { i ⊙ e } + ) e ∈∪ kr =1 E r ] − E [ X i | ( U { i ⊙ e } + ) e ∈∪ k − r =1 E r ] (cid:17) = K X k =2 (cid:18) nK (cid:19) − X ≤ i < ···
For any k = 2 , . . . , K and q ∈ [1 , ∞ ) , we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) nK (cid:19) − X ≤ i < ···
In this proof, the notation . means that the left-hand side is bounded by theright-hand side up to a constant that depends only on q and K . Fix any k = 2 , . . . , K . Conditionallyon U k − = { U { i ⊙ e } + : e ∈ ∪ k − r =1 E r , i ∈ I ∞ ,K } , the component E [ X i | ( U { i ⊙ e } + ) e ∈∪ kr =1 E r ] − E [ X i | ( U { i ⊙ e } + ) e ∈∪ k − r =1 E r ]is a function of ( U { i ⊙ e } + ) e ∈E k with mean zero E [ X i | ( U { i ⊙ e } + ) e ∈∪ kr =1 E r ] − E [ X i | ( U { i ⊙ e } + ) e ∈∪ k − r =1 E r ] = h ( { i ⊙ e } + ) e ∈E k (( U { i ⊙ e } + ) e ∈E k ) . The function h ( { i ⊙ e } + ) e ∈E k implicitly depends on ( U { i ⊙ e } + ) e ∈∪ k − r =1 E r , so that it is indexed by ( { i ⊙ e } + ) e ∈E k (the vector ( { i ⊙ e } + ) e ∪ k − r =1 E r is uniquely determined by ( { i ⊙ e } + ) e ∈E k so it is enough toindex the function by ( { i ⊙ e } + ) e ∈E k ). Define J n,k = { ( { i ⊙ e } + ) e ∈E k : 1 ≤ i < · · · < i K ≤ n } . This is a collection of vectors of sets where each vector contains m k = (cid:0) Kk (cid:1) sets. We denote ageneric element of J n,k by J = ( J , . . . , J m k ) by ordering the elements of E k . We will also write J = ( U J , . . . , U J mk ). Then we arrive at the expression X ≤ i < ···
Remark 9 (Comparison with Davezies et al. (2020)) . Lemma A.1 in Davezies et al. (2020) derivesa symmetrization inequality for the empirical process of a jointly exchangeable array. Essentially,the same comparison made in Remark 8 applies to the comparison of their Lemma A.1 with themaximal inequalities developed in this section. Lemma S3 in Davezies et al. (2020) covers thedegenerate case but focuses only on the K = 2 case. As seen in the proof of Lemma 5 above,however, handling the degenerate components in K > Appendix D. Proofs for Section 3 D.1. Proof of Theorem 3. Given Lemmas 4 and 5, the proof is almost identical to that ofTheorem 1. We omit the details for brevity. (cid:3) D.2. Proof of Theorem 4. Conditionally on ( X i ) i ∈ I n,K , we have √ n S MBn ∼ N ( , ˆΣ), whereˆΣ = 1 n n X j =1 ( ˆ W j − K S n )( ˆ W j − K S n ) T As in the proof of Theorem 2, the desired result follows from bounding ˆ∆ W = k ˆΣ − Σ k ∞ .We first note thatˆ∆ W = max ≤ ℓ,ℓ ′ ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X j =1 ( ˆ W ℓj − KS ℓn )( ˆ W ℓ ′ j − KS ℓ ′ n ) − E [ W ℓ W ℓ ′ ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . For every ℓ, ℓ ′ ∈ { , . . . , p } ,1 n n X j =1 ( ˆ W ℓj − KS ℓn )( ˆ W ℓ ′ j − KS ℓ ′ n ) = 1 n n X j =1 ˆ W ℓj ˆ W ℓ ′ j − K S ℓn S ℓ ′ n = 1 n n X j =1 ( ˆ W ℓj − W ℓj )( ˆ W ℓ ′ j − W ℓ ′ j ) + 1 n n X j =1 ( ˆ W ℓj − W ℓj ) W ℓ ′ j + 1 n n X j =1 W ℓj ( ˆ W ℓ ′ j − W ℓ ′ j ) + 1 n n X k =1 W ℓj W ℓ ′ j − K S ℓn S ℓ ′ n . sing the Cauchy-Schwarz inequality, we haveˆ∆ W ≤ max ≤ ℓ ≤ p n n X j =1 ( ˆ W ℓ ′ j − W ℓ ′ j ) | {z } =:∆ W, +2 ˆ∆ / W, vuut max ≤ ℓ ≤ p n n X j =1 | W ℓ ′ j | + max ≤ l,ℓ ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X j =1 ( W ℓj W ℓ ′ j − E [ W ℓ W ℓ ′ ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } = ˆ∆ W, + K max ≤ ℓ ≤ p | S ℓ ′ n | . For the second term on the right-hand side, we have1 n n X k =1 | W ℓj | ≤ E n n X j =1 | W ℓj | + 1 n n X j =1 ( | W ℓj | − E [ | W ℓ | ]) ≤ σ + ˆ∆ W, . Further, since KS ℓn = n − P nj =1 ( ˆ W ℓj − W ℓj ) + n − P nj =1 W ℓj , we have K max ≤ ℓ ≤ p | S ℓn | ≤ W, + 2 ˆ∆ W, , where ˆ∆ W, = max ≤ ℓ ≤ p | n − P nj =1 W ℓj | . Conclude thatˆ∆ W . ˆ∆ W, + σ ˆ∆ / W, + ˆ∆ W, + ˆ∆ W, up to a universal constant. The rest is completely analogous to the latter part of the proof ofTheorem 2. We omit the details for brevity. (cid:3) D.3. Proof of Proposition 2. We only prove the proposition under Case (i’). The proof forCase (ii’) is similar (cf. the proof of Proposition 1). In this proof, the notation . means that theleft-hand side is bounded by the right-hand side up to a constant that depends only on ν, K , and C . Recall that W j can can be written as W j = E ( n − K )!( n − K X k =1 X i ∈ I n,K : i k = j X i (cid:12)(cid:12)(cid:12) U j . We have ˆ∆ W, = max ≤ ℓ ≤ p n n X j =1 ( ˆ W ℓj − W ℓj ) ≤ n n X j =1 k ˆ W j − W j k ∞ . K X k =1 n n X j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( n − K )!( n − X i ∈ I n,K : i k = j ( X i − E [ X i | U j ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . Consider the k = 1 term. Pick any j ∈ N . Let I − j ∞ ,K − = { ( i , . . . , i K ) ∈ ( N \ { j } ) K − : i , . . . , i K are distinct } . Given U j , for each i − = ( i , . . . , i K ) ∈ I − j ∞ ,K and e ∈ { , } K − , define thevector V { i − ⊙ e } + = ( U { i − ⊙ e } + , U { ( j, i − ⊙ e ) } + ) . ith this notation, we can rewrite X i with i = ( j, i − ) as X i = g (cid:0) U j , ( V { i − ⊙ e } + ) e ∈{ , } K − \{ } (cid:1) . From this expression, we see that, conditionally on U j , the array ( X ( j, i − ) ) i − ∈ I − j ∞ ,K − is jointlyexchangeable with mean vector E [ X i | U j ]. Applying Lemmas 4 and 5 conditionally on U j (thefact that U -variables are uniform on (0 , 1) is not crucial in the proofs), we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( n − K )!( n − X i ∈ I n,K : i k = j ( X i − E [ X i | U j ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ν ∞ | U j . K − X k =1 n − k/ (log p ) k/ ! ν | {z } . ( n − log p ) ν E [ k X ( j, i − ) k ν ∞ | U j ] , where i − ∈ I − j ∞ ,K − is arbitrary. By Fubini’s theorem, the expectation of the left-hand side can bebounded as . ( n − log p ) ν E [ k X ( j, i − ) k ν ∞ ] . ( n − D n log p ) ν . Similar bounds hold for other k . Conclude that E [( σ ˆ∆ W, log p ) ν ] . n − ζν under our assumption.Together with Markov’s inequality, we obtain P (cid:16) σ ˆ∆ W, log p > n − ζ +1 /ν (cid:17) . n − . This completes the proof. (cid:3) Appendix E. Proof for Section 5 E.1. Proof of Proposition 3. In this proof, the notation . means that the left-hand side isbounded by the right-hand side up to a constant independent of n .By Theorem 1 (use Condition (2.4)), we havesup t ∈ R | P ( k√ n S N k ∞ ≤ t ) − P ( k G k ∞ ≤ t ) | → , where G ∼ N ( , Σ) with Σ = P Kk =1 ( n/N k ) E [ V k, V Tk, ]. Conditionally on (( Y i , Z T i ) T ) i ∈ [ N ] , we have K X k =1 √ nN k N k X i k =1 ξ k,i k ( ˜ V k,i k − ˜ S N ) ∼ N ( , ˜Σ) , where ˜Σ = K X k =1 ( n/N k ) N k X i k =1 ( ˜ V k,i k − ˜ S N )( ˜ V k,i k − ˜ S N ) T . Thus, in view of Lemma 8, it suffices to show that k ˜Σ − Σ k ∞ log p = o P (1) (the bound on λ followsfrom the Gaussian concentration). Further, Proposition 1 and the proof of Theorem 2 under poly-nomial moment conditions (see also Remark 3) imply that k ˆΣ − Σ k ∞ = o P ((log p ) − ), where ˆΣ = P Kk =1 ( n/N k ) P N k i k =1 ( ˆ V k,i k − S N )( ˆ V k,i k − S N ) T and ˆ V k,i k = ( Q k ′ = k N k ′ ) − P i ,...,i k − ,i k +1 ,...,i K ε i X i .Thus, it suffices to show that k ˜Σ − ˆΣ k ∞ = o P ((log p ) − ).Recall that λ = (log n )( n − log p ) / . We note that E [ k G k ∞ ] . max j,k q E [( V jk, ) ] log p . p log p, so that λ ≥ c k S N k ∞ with probability 1 − o (1). By assumption, κ ( s, c ) is bounded away fromzero with probability 1 − o (1). Thus, Theorem 1 in Belloni and Chernozhukov (2013) implies that vuut N X i ∈ [ N ] ( X T i ( ˜ β − β )) = O P s s log ( pN ) n . bserve that k ˜Σ − ˆΣ k ∞ ≤ K X k =1 max ≤ j,ℓ ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N k N k X i k =1 ( ˜ V jk,i k ˜ V ℓk,i k − ˆ V jk,i k ˆ V ℓk,i k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } =:( I k ) + K max ≤ j,ℓ ≤ p (cid:12)(cid:12)(cid:12) ˜ S j N ˜ S ℓ N − S j N S ℓ N (cid:12)(cid:12)(cid:12)| {z } =:( II ) . We first consider the term ( I k ). We shall focus on k = 1 as similar bounds hold for other k . Observethat1 N N X i =1 ( ˜ V j ,i ˜ V ℓ ,i − ˆ V j ,i ˆ V ℓ ,i ) = 1 N N X i =1 ( ˜ V j ,i − ˆ V j ,i )( ˜ V ℓ ,i − ˆ V ℓ ,i ) + 1 N N X i =1 ( ˜ V j ,i − ˆ V j ,i ) ˆ V ℓ ,i + 1 N N X i =1 ˆ V j ,i ( ˜ V ℓ ,i − ˆ V ℓ ,i ) . By Cauchy-Schwarz, we have( I ) ≤ max ≤ j ≤ p N N X i =1 ( ˜ V j ,i − ˆ V j ,i ) | {z } =:( III ) +2( III ) / vuuuuut max ≤ ℓ ≤ p N N X i =1 | ˆ V ℓ ,i | | {z } =:( IV ) . To bound ( IV ), we note thatmax ≤ ℓ ≤ p N N X i =1 | ˆ V ℓ ,i | ≤ N N X i =1 (cid:13)(cid:13)(cid:13) ˆ V ,i − E [ ˆ V ,i | U ( i , ,..., ] (cid:13)(cid:13)(cid:13) ∞ + 1 N N X i =1 (cid:13)(cid:13)(cid:13) E [ ˆ V ,i | U ( i , ,..., ] (cid:13)(cid:13)(cid:13) ∞ Since E [ ˆ V ,i | U ( i , ,..., ] = E [¯ ε X | U (1 , ,..., ] for all i , by Fubini and Jensen’s inequality, we have E " N N X i =1 (cid:13)(cid:13)(cid:13) E [ ˆ V ,i | U ( i , ,..., ] (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:16) E h(cid:13)(cid:13)(cid:13) E [ ε X ℓ | U (1 , ,..., ] (cid:13)(cid:13)(cid:13) q ∞ i(cid:17) /q ≤ (cid:18) E (cid:20) max ≤ ℓ ≤ p | ε X ℓ | q (cid:21)(cid:19) /q ≤ D N . Conditionally on U ( i , ,..., , E (cid:20)(cid:13)(cid:13)(cid:13) ˆ V ,i − E [ ˆ V ,i | U ( i , ,..., ] (cid:13)(cid:13)(cid:13) ∞ | U ( i , ,..., (cid:21) ≤ (cid:16) E h(cid:13)(cid:13)(cid:13) ˆ V ,i − E [ ε i X i | U ( i , ,..., ] (cid:13)(cid:13)(cid:13) q ∞ | U ( i , ,..., i(cid:17) /q . As in the proof of Proposition 1, conditionally on U ( i , ,..., , the array ( ε ( i , i − ) X ( i , i − ) ) i − ∈ N K − is separately exchangeable with mean vector E [ ε i X i | U ( i , ,..., ]. By Corollary 3, we have E h(cid:13)(cid:13)(cid:13) ˆ V ,i − E [ ε i X i | U ( i , ,..., ] (cid:13)(cid:13)(cid:13) q ∞ | U ( i , ,..., i . n − q/ (log p ) q/ E [ k ε i X i k q ∞ | U ( i , ,..., ] . By Fubini, we have E h(cid:13)(cid:13)(cid:13) ˆ V ,i − E [ ε i X i | U ( i , ,..., ] (cid:13)(cid:13)(cid:13) q ∞ i . n − q/ (log p ) q/ D q N . Conclude that | ( IV ) | = O P ( D N ). ext, we shall bound the term ( III ). Observe that by Cauchy-Schwarz, | ˜ V j ,i − ˆ V j ,i | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q k =1 N k X i ,...,i K X j i ( X T i ( ˜ β − β ) + r i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ s Q k =1 N k X i ,...,i K ( X j i ) s Q k =1 N k X i ,...,i K ( X T i ( ˜ β − β )) + s Q k =1 N k X i ,...,i K r i , so that the term ( III ) is bounded as . max j N N X i =1 Q k =1 N k X i ,...,i K ( X j i ) Q k =1 N k X i ,...,i K ( X T i ( ˜ β − β )) + 1 Q k =1 N k X i ,...,i K r i ≤ max j,i Q k =1 N k X i ,...,i K ( X j i ) N X i ∈ [ N ] ( X T i ( ˜ β − β )) + 1 N X i ∈ [ N ] r i | {z } = O P (cid:16) s log3( pN ) n (cid:17) . Observe that E max j,i Q k =1 N k X i ,...,i K ( X j i ) ≤ E max j,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q k =1 N k X i ,...,i K { ( X j i ) − E [( X j ( i , ,..., ) | U ( i , ,..., ] } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + E (cid:20) max j,i E [( X j ( i , ,..., ) | U ( i , ,..., ] (cid:21) . By H¨older’s inequality, we have E (cid:20) max j,i E [( X j ( i , ,..., ) | U ( i , ,..., ] (cid:21) ≤ E (cid:20) max i E [ k X ( i , ,..., k ∞ | U ( i , ,..., ] (cid:21) ≤ E (cid:20) max i (cid:0) E [ k X ( i , ,..., k q ∞ | U ( i , ,..., ] (cid:1) /q (cid:21) ≤ E "X i E [ k X ( i , ,..., k q ∞ | U ( i , ,..., ] /q ≤ N /q D N . Applying Corollary 3 conditionally on U ( i , ,..., (cf. the proof of Proposition 1), we have E max j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q k =1 N k X i ,...,i K { ( X j i ) − E [( X j ( i , ,..., ) | U ( i , ,..., ] } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q | U ( i , ,..., . n − q/ (log p ) q/ E [ k X ( i , ,..., k q ∞ | U ( i , ,..., ] . hus, we have E max j,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q k =1 N k X i ,...,i K { ( X j i ) − E [( X j ( i , ,..., ) | U ( i , ,..., ] } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ X i E max j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q k =1 N k X i ,...,i K { ( X j i ) − E [( X j ( i , ,..., ) | U ( i , ,..., ] } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q /q . N /q n − / (log p ) / D N . Conclude that ( III ) = O P (cid:16) { n − sN /q D N log ( pN ) } / (cid:17) and consequently | ( I ) | = O P (cid:16) { n − sN /q D N log ( pN ) } / (cid:17) . Finally, to bound | ( II ) | , observe that˜ S j N ˜ S ℓ N − S j N S ℓ N = ( ˜ S j N − S j N )( ˜ S ℓ N − S ℓ N ) + S j N ( ˜ S ℓ N − S ℓ N )+ ( ˜ S j N − S j N ) S ℓ N . Then, we have | ( II ) | ≤ max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i ∈ [ N ] (˜ ε i − ε i ) X j i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 k S N k ∞ · max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i ∈ [ N ] (˜ ε i − ε i ) X j i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . By Cauchy-Schwarz, we havemax ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i ∈ [ N ] (˜ ε i − ε i ) X j i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max ≤ j ≤ p vuut N X i ∈ [ N ] ( X j i ) vuut N X i ∈ [ N ] ( X T i ( β − ˜ β )) + k r k N, = O P s sD N log ( N p ) n , so that | ( II ) | = O P (cid:0) n − sD N log ( pN ) + { n − sD N (log p )(log ( pN )) } / (cid:1) . Combining the above bounds, we have k ˜Σ − ˆΣ k ∞ = O P (cid:16) { n − sN /q D N log ( pN ) } / (cid:17) . Thisimplies that k ˜Σ − ˆΣ k ∞ log p = o P (1), as required. (cid:3) E.2. Proof of Proposition 4. Recall that K = 2. We write X i,j instead of X ( i,j ) for the nota-tional simplicity. Define the N × p matrix X = ( X , , . . . , X N , , X , , . . . , X N ,N ) T . The s -sparseeigenvalue with 1 ≤ s ≤ p for X is defined by φ min ( s ) = min k θ k ≤ s, k θ k =1 k X θ k N, . By Lecu´e and Mendelson (2017, Lemma 2.7), if φ min ( s ) ≥ φ , then for 2 ≤ s ≤ p , we have k X θ k N, ≥ φ k θ k − k θ k s − × max ≤ ℓ ≤ p X ( i,j ) ∈ [ N ] ( X ℓi,j ) /N | {z } =:ˆ ρ or all θ ∈ R p . We can then deduce that for s ≤ ( s − φ / (2(1 + c ) ˆ ρ ), we have κ ( s , c ) ≥ φ / √ . Lemma 6 below implies that φ min ( s ) is bounded away from zero with probability 1 − o (1). Further,observe that ˆ ρ ≤ max ≤ ℓ ≤ p E [( X ℓ , ) ] + max ≤ ℓ ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N − X ( i,j ) ∈ [ N ] { ( X ℓi,j ) − E [( X ℓ , ) ] } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . The first term on the right-hand side is O (1), while the second term is o P (1) (which follows fromLemma 6 below with s = 1), so that ˆ ρ = O P (1). The conclusion of the proposition follows fromrescaling s . (cid:3) Lemma 6 (Sparse eigenvalues for two-way clustering) . Suppose that ( X i,j ) ( i,j ) ∈ [ N ] with [ N ] = { , . . . , N } × { , . . . , N } is sampled from a separately exchangeable array ( X i,j ) ( i,j ) ∈ N generatedas X i,j = g ( U i, , U ,j , U i,j ) for some Borel measurable map g : [0 , → R p and i.i.d. U [0 , variables U i, , U j, , U i,j . Pick any ≤ s ≤ p ∧ n . Let B = p E [ M ] with M = max ( i,j ) ∈ [ N ] k X i,j k ∞ .Define δ N = √ sB (cid:18) √ n n log / p + (log s )(log / N )(log / p ) o _ √ N (cid:8) log p + (log N )(log p ) (cid:9)(cid:19) . Then, we have E sup k θ k ≤ s, k θ k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X ( i,j ) ∈ [ N ] { ( θ T X i,j ) − E [( θ T X , ) ] } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . δ N + δ N sup k θ k ≤ s, k θ k =1 q E [( θ T X , ) ] up to a universal constant. In addition, we have δ N . { n − sB log ( pN ) } / up to a universalconstant.Proof of Lemma 6. In this proof, the notation . means that the left-hand side is bounded by theright-hand side up to a universal constant.Let Θ s = ∪ | T | = s { θ ∈ R p : k θ k = 1 , supp( θ ) ⊂ T } . Further, let Z i,j ( θ ) = ( θ T X i,j ) − E [( θ T X , ) ].Then, for each θ , Z i,j ( θ ) is a centered random variable. Consider the decomposition Z i,j ( θ ) = E [ Z i, ( θ ) | U i, ] + E [ Z ,j ( θ ) | U ,j ] + Z i,j ( θ ) − E [ Z i, ( θ ) | U i, ] − E [ Z ,j ( θ ) | U ,j ] | {z } =: ˆ Z i,j ( θ ) . We divide the rest of the proof into two steps.Step 1. Consider first the term P i,j E [ Z i,j ( θ ) | U i, ] = N P N i =1 E [ Z i, ( θ ) | U i, ], which consistsof i.i.d. variables. Observe that E [ Z i, ( θ ) | U i, ] has mean 0 and by symmetrization E " sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 E [ Z i, ( θ ) | U i, ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = E " sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 (cid:0) θ T E [ X i, X Ti, | U i, ] θ − E [( θ T X , ) ] (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E " E " sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 ǫ i (cid:0) θ T E [ X i, X Ti, | U i, ] θ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X [ N ] ≤ E " E " sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 ǫ i ( θ T X i, ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X [ N ] , here ( ǫ i ) N i =1 is a sequence of independent Rademacher random variables that are independent of( X i,j ) ( i,j ) ∈ [ N ] , and the second inequality follows from Jensen’s inequality. Now, the following boundcan be obtained by following the proof of Lemma P.1. in Belloni et al. (2018) with U set to be asingleton set: E " sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 ǫ i ( θ T X i, ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X [ N ] . √ sM R (log / p + (log s )(log / N )(log / p )) , where R = sup θ ∈ Θ s (cid:16)P N i =1 ( θ T X i, ) (cid:17) / .Choosing δ N , = BN − / √ s { log / p + (log s )(log / N )(log / p ) } , by Cauchy-Schwarz, we have I := E " sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 ǫ i ( θ T X i, ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . δ N , E [ M R ] B √ N ≤ (cid:18) δ N , B (cid:19) (cid:18) E [ M ] E [ R ] N (cid:19) / ≤ δ N , ( E [ R /N ]) / . δ N , (cid:18) I + sup θ ∈ Θ s E [( θ T X , ) ] (cid:19) / . Using the algebraic fact that a ≤ δ a + δ b implies a ≤ δ + a − δ b , we have I . δ N , + δ N , r sup θ ∈ Θ s E [( θ T X , ) ] . The same bound holds for E h sup θ ∈ Θ s (cid:12)(cid:12)(cid:12) N − P N j =1 E [ Z ,j ( θ ) | U ,j ] (cid:12)(cid:12)(cid:12)i . Conclude that E sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i,j ( E [ Z i,j ( θ ) | U i, ] + E [ Z i,j ( θ ) | U ,j ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . δ N , + δ N , r sup θ ∈ Θ s E [( θ T X , ) ] , where δ N , = Bn − / √ s { log / p + (log s )(log / N )(log / p ) } . Bn − / √ s log ( pN ).Step 2. Now, to obtain a bound on E [sup θ ∈ Θ s | N − P i,j ˆ Z i,j ( θ ) | ], by Lemma 2, we have thefollowing symmetrization inequality E sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i,j ˆ Z i,j ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E E sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i,j ǫ i ǫ ′ j ˆ Z i,j ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X [ N ] . E E sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i,j ǫ i ǫ ′ j ( θ T X i,j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X [ N ] , where ( ǫ i ) and ( ǫ ′ i ) are independent copies of Rademacher random variables independent of ( X i,j ) ( i,j ) ∈ [ N ] ,and the second inequality follows from Jensen’s inequality. Conditionally on ( X i,j ) ( i,j ) ∈ [ N ] , P i,j ǫ i ǫ ′ j ( θ T X i,j ) is a Rademacher chaos of degree 2 (cf. the proof of Theorem 5). Hence, Corollary 5.1.8 inde la Pe˜na and Gin´e (1999) yields that II := E sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i,j ǫ i ǫ ′ j ( θ T X i,j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X [ N ] . (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i,j ǫ i ǫ ′ j ( θ T X i,j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ψ | X . Z diam(Θ s )0 log N (Θ s , ρ X , t ) dt, here k·k ψ | X is the ψ -norm evaluated conditionally on ( X i,j ) ( i,j ) ∈ [ N ] , ρ X is a pseudometric on Θ s defined by ρ X ( θ, θ ) = (cid:16)P N i =1 P N j =1 { ( θ T X i,j ) − ( θ T X i,j ) } (cid:17) / , and diam(Θ s ) is the ρ X -diameterof Θ s . Now, for any two θ , ¯ θ ∈ Θ s , ρ X ( θ, ¯ θ ) = N X i =1 N X j =1 (cid:8) ( θ T X i,j ) − (¯ θ T X i,j ) (cid:9) / ≤ N X i =1 N X j =1 (cid:8) ( θ T X i,j ) + (¯ θ T X i,j ) (cid:9) / max ( i,j ) ∈ [ N ] | ( θ − ¯ θ ) T X i,j |≤ √ R k θ − ¯ θ k X , where R = sup θ ∈ Θ s (cid:16)P ( i,j ) ∈ [ N ] ( θ T X i,j ) (cid:17) / and k θ k X = max ( i,j ) ∈ [ N ] | θ T X i,j | . Thus, we have Z diam(Θ s )0 log N (Θ s , ρ X , t ) dt ≤ Z √ sMR log N (cid:16) Θ s / √ s, k · k X , t/ ( √ sR ) (cid:17) dt = 2 √ sR Z M log N (cid:0) Θ s / √ s, k · k X , t (cid:1) dt. Lemma 3.9 and Equation (3.10) in Rudelson and Vershynin (2008) yield that for some universalconstant A , Z M log N (cid:0) Θ s / √ s, k · k X , t (cid:1) dt ≤ Z M/ √ s log (cid:18)(cid:18) ps (cid:19) (1 + 2 M/t ) s (cid:19) dt + Z MM/ √ s log (cid:16) (2 p ) At − M log N (cid:17) dt ≤ M √ s log (cid:18) ps (cid:19) + √ s Z M/ √ s log(1 + 2 M/t ) dt + AM (log N )(log(2 p )) Z MM/ √ s dtt . M √ s log p + M (1 + 2 √ s ) log (cid:18) √ s (cid:19) + A √ sM (log N )(log(2 p )) . √ sM (cid:0) log p + (log N )(log p ) (cid:1) , where the second term follows from integration by parts √ s Z M/ √ s log(1 + 2 M/t ) dt ≤ √ st log (cid:18) Mt (cid:19) (cid:12)(cid:12)(cid:12) M/ √ s + √ s M log( t + 2 M ) (cid:12)(cid:12)(cid:12) M/ √ s . M (1 + 2 √ s ) log (cid:18) √ s (cid:19) . Hence, we have II . sR M (cid:8) log p + (log N )(log p ) (cid:9) . etting δ N , = sN − / B (cid:0) log p + (log N )(log p ) (cid:1) , we have III := E sup θ ∈ Θ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i,j ǫ i ǫ ′ j ( θ T X i,j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . δ N , E [ M R ] B √ N ≤ (cid:18) δ N , B (cid:19) (cid:18) E [ M ] E [ R ] N (cid:19) / ≤ δ N , (cid:18) E [ R ] N (cid:19) / . δ N , (cid:18) III + sup θ ∈ Θ s E [( θ T X , ) ] (cid:19) / . Using the same algebraic fact as in Step 1 yields that III . δ N , + δ N , q sup θ ∈ Θ s E [( θ T X , ) ].Finally, since n ≤ √ N and s ≤ n , we have sB √ N (cid:0) log p + (log N )(log p ) (cid:1) . sBn (cid:0) log p + (log N )(log p ) (cid:1) . √ sB √ n log ( pN ) . This completes the proof. (cid:3) Appendix F. Technical Tools Lemma 7 (Nazarov’s inequality) . Let Y = ( Y , . . . , Y p ) T be a centered Gaussian random vectorin R p such that E [ | Y j | ] ≥ σ for all ≤ j ≤ p and some constant σ > . Then for every y ∈ R p and δ > , P ( Y ≤ y + δ ) − P ( Y ≤ y ) ≤ δσ ( p p + 2) . Proof. This is Lemma A.1 in Chernozhukov et al. (2017a); see Chernozhukov et al. (2017b) for itsproof. (cid:3) Lemma 8 (Gaussian comparison over rectangles) . Let Y and W be centered Gaussian randomvectors in R d with covariance matrices Σ Y = (Σ Yj,k ) ≤ j,k ≤ d and Σ W = (Σ Wj,k ) ≤ j,k ≤ d , respectively,and let ∆ = k Σ Y − Σ W k ∞ . Suppose that min ≤ j ≤ d Σ Yj,j W min ≤ j ≤ d Σ Wj,j ≥ σ for some constant σ > . Then sup R ∈R | P ( Y ∈ R ) − P ( W ∈ R ) | ≤ C (∆ log d ) / , where C is a constant that depends only on σ .Proof. See Corollary 5.1 in Chernozhukov et al. (2019b). (cid:3) ReferencesAldous, D. J. (1981): “Representations for partially exchangeable arrays of random variables,” Journal of Multivariate Analysis , 11, 581–598. Andrews, D. W. (2005): “Cross-section regression with common shocks,” Econometrica , 73,1551–1585. Andrews, D. W. and X. Shi (2013): “Inference based on conditional moment inequalities,” Econometrica , 81, 609–666.——— (2017): “Inference based on many conditional moment inequalities,” Journal of Economet-rics , 196, 275–287. Armstrong, T. B. (2014): “Weighted KS statistics for inference on conditional moment inequal-ities,” Journal of Econometrics , 181, 92–116. Armstrong, T. B. and H. P. Chan (2016): “Multiscale adaptive inference on conditionalmoment inequalities,” Journal of Econometrics , 194, 24–43. ronow, P. M., C. Samii, and V. A. Assenova (2015): “Cluster–robust variance estimationfor dyadic data,” Political Analysis , 23, 564–577. Athey, S. and A. Schmutzler (2001): “Investment and market dominance,” RAND Journal ofEconomics , 32, 1–26. Bai, Y., A. Santos, and A. Shaikh (2019): “A practical method for testing many momentinequalities,” University of Chicago, Becker Friedman Institute for Economics Working Paper . Belloni, A. and V. Chernozhukov (2011): “ ℓ -penalized quantile regression in high-dimensional sparse models,” Annals of Statistics , 39, 82–130.——— (2013): “Least squares after model selection in high-dimensional sparse models,” Bernoulli ,19, 521–547. Belloni, A., V. Chernozhukov, D. Chetverikov, and Y. Wei (2018): “Uniformly valid post-regularization confidence regions for many functional parameters in Z-estimation framework,” Annals of Statistics , 46, 3643–3675. Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile prices in market equilibrium,” Econometrica , 63, 841–890. Berry, S. T. (1994): “Estimating discrete-choice models of product differentiation,” RAND Jour-nal of Economics , 25, 242–262. Bickel, P. J. and A. Chen (2009): “A nonparametric view of network models and Newman–Girvan and other modularities,” Proceedings of the National Academy of Sciences , 106, 21068–21073. Bickel, P. J., Y. Ritov, and A. B. Tsybakov (2009): “Simultaneous analysis of Lasso andDantzig selector,” Annals of Statistics , 37, 1705–1732. B¨uhlmann, P. and S. van de Geer (2011): Statistics for High-Dimensional Data , SpringerSeries in Statistics, Springer, Heidelberg, methods, theory and applications. Cameron, A. C. and D. L. Miller (2014): “Robust inference for dyadic data,” Unpublishedmanuscript, University of California-Davis . Cameron, C. A., J. B. Gelbach, and D. L. Miller (2011): “Robust inference with multiwayclustering,” Journal of Business & Economic Statistics , 29, 238–249. Cameron, C. A. and D. L. Miller (2015): “A practitioners guide to cluster-robust inference,” Journal of Human Resources , 50, 317–372. Chen, X. (2018): “Gaussian and bootstrap approximations for high-dimensional U-statistics andtheir applications,” Annals of Statistics , 46, 642–678. Chen, X. and K. Kato (2019): “Randomized incomplete U -statistics in high dimensions,” Annalsof Statistics , 47, 3127–3156.——— (2020): “Jackknife multiplier bootstrap: finite sample approximations to the U -processsupremum with applications,” Probability Theory and Related Fields , 176, 1097–1163. Chernozhukov, V., D. Chetverikov, and K. Kato (2013a): “Gaussian approximations andmultiplier bootstrap for maxima of sums of high-dimensional random vectors,” Annals of Statis-tics , 41, 2786–2819.——— (2014): “Gaussian approximation of suprema of empirical processes,” Annals of Statistics ,42, 1564–1597.——— (2015): “Comparison and anti-concentration bounds for maxima of Gaussian random vec-tors,” Probability Theory and Related Fields , 162, 47–70. —— (2016): “Empirical and multiplier bootstraps for suprema of empirical processes of increas-ing complexity, and related Gaussian couplings,” Stochastic Processes and their Applications ,126, 3632–3651.——— (2017a): “Central limit theorems and bootstrap in high dimensions,” Annals of Probability ,45, 2309–2352.——— (2017b): “Detailed proof of Nazarov’s inequality,” arXiv:1711.10696 .——— (2019a): “Inference on causal and structural parameters using many moment inequalities,” Review of Economic Studies , 86, 1867–1900. Chernozhukov, V., D. Chetverikov, K. Kato, and Y. Koike (2019b): “Improved CentralLimit Theorem and bootstrap approximations in high dimensions,” arXiv:1912.10529 . Chernozhukov, V., S. Lee, and A. M. Rosen (2013b): “Intersection bounds: Estimation andinference,” Econometrica , 81, 667–737. Chetverikov, D. (2018): “Adaptive tests of conditional moment inequalities,” Econometric The-ory , 34, 186–227. Chiang, H. and Y. Sasaki (2019): “Lasso under multi-way clustering: Estimation and post-selection inference,” arXiv:1905.02107 . Chiang, H. D., K. Kato, Y. Ma, and Y. Sasaki (2019): “Multiway cluster robust dou-ble/debiased machine learning,” arXiv:1909.03489 . Davezies, L., X. D’Haultfoeuille, and Y. Guyonvarch (2018): “Asymptotic results undermultiway clustering,” arXiv:1807.07925 .——— (2020): “Empirical process results for exchangeable arrays,” Annals of Statistics , forthcom-ing. de la Pe˜na, V. and E. Gin´e (1999): Decoupling: From Dependence to Independence , Springer. Deng, H. and C.-H. Zhang (2020): “Beyond Gaussian approximation: Bootstrap for maximaof sums of independent random vectors,” Annals of Statistics , forthcoming. Eagleson, G. K. and N. C. Weber (1978): “Limit theorems for weakly exchangeable arrays,” in Mathematical Proceedings of the Cambridge Philosophical Society , Cambridge University Press,vol. 84, 123–130. Fafchamps, M. and F. Gubert (2007): “The formation of risk sharing networks,” Journal ofDevelopment Economics , 83, 326–350. Fang, X. and Y. Koike (2020): “High-dimensional central limit theorems by Steins method,” arXiv:2001.10917 . Gandhi, A., Z. Lu, and X. Shi (2020): “Estimating demand for differentiated products withzeroes in market share data,” SSRN 3503565 . Giraud, C. (2015): Introduction to High-Dimensional Statistics , vol. 139 of Monographs on Sta-tistics and Applied Probability , CRC Press, Boca Raton, FL. Graham, B. and A. de Paula (2019): The Econometric Analysis of Network Data , AcademicPress. Graham, B. S. (2019): “Network data,” Tech. rep., National Bureau of Economic Research. Hoover, D. (1979): “Relations on probability spaces and arrays of random variables,” Workingpaper. Kallenberg, O. (2006): Probabilistic Symmetries and Invariance Principles , Springer Science &Business Media. oike, Y. (2019): “Gaussian approximation of maxima of Wiener functionals and its applicationto high-frequency data,” Annals of Statistics , 47, 1663–1687. Kuchibhotla, A. K., S. Mukherjee, and D. Banerjee (2020): “High-dimensional CLT:Improvements, non-uniform extensions and large deviations,” Bernoulli , forthcoming. Lecu´e, G. and S. Mendelson (2017): “Sparse recovery under weak moment assumptions,” Journal of the European Mathematical Society , 19, 881–904. Lee, A. J. (1990): U-Statistics: Theory and Practice , CRC Press. Lee, S., K. Song, and Y.-J. Whang (2013): “Testing functional inequalities,” Journal ofEconometrics , 172, 14–32.——— (2018): “Testing for a general class of functional inequalities,” Econometric Theory , 34,1018–1064. MacKinnon, J. G. (2019): “How cluster-robust inference is changing applied econometrics,” Canadian Journal of Economics , 52, 851–881. MacKinnon, J. G., M. Ø. Nielsen, and M. D. Webb (2019): “Wild bootstrap and asymptoticinference with multiway clustering,” Queen’s Economics Department Working Paper, No. 1415. McCullagh, P. (2000): “Resampling and exchangeable arrays,” Bernoulli , 6, 285–301. Menzel, K. (2016): “Inference for games with many players,” Review of Economic Studies , 83,306–337.——— (2017): “Bootstrap with clustering in two or more dimensions,” arXiv:1703.03043 . Morales, E., G. Sheu, and A. Zahler (2019): “Extended Gravity,” Review of EconomicStudies , 86, 2668–2712. Moulin, H. (1988): Axioms of Cooperative Game Theory , New York: Cambridge University Press. Owen, A. B. (2007): “The pigeonhole bootstrap,” Annals of Applied Statistics , 1, 386–411. Owen, A. B. and D. Eckles (2012): “Bootstrapping data arrays of arbitrary order,” Annals ofApplied Statistics , 6, 895–927. Rudelson, M. and R. Vershynin (2008): “On sparse reconstruction from Fourier and Gaussianmeasurements,” Communications on Pure and Applied Mathematics , 61, 1025–1045. Silverman, B. W. (1976): “Limit theorems for dissociated random variables,” Advances in AppliedProbability , 8, 806–819. Tabord-Meehan, M. (2019): “Inference with dyadic data: Asymptotic behavior of the dyadic-robust t-statistic,” Journal of Business & Economic Statistics , 37, 671–680. Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso,” Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 58, 267–288. van der Vaart, A. W. and J. A. Wellner (1996): Weak Convergence and Empirical Processes ,Springer. Wainwright, M. J. (2019): High-Dimensional Statistics: A Non-Asymptotic Viewpoint , Cam-bridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. Zhang, D. and W. B. Wu (2017): “Gaussian approximation for high dimensional time series,” Annals of Statistics , 45, 1895–1919. Zhang, X. and G. Cheng (2018): “Gaussian approximation for high dimensional vector underphysical dependence,” Bernoulli , 24, 2640–2675. (H. D. Chiang) Department of Economics, University of Wisconsin-Madison, William H. Sewell So-cial Science Building, 1180 Observatory Drive, Madison, WI 53706, USA. E-mail address : [email protected] K. Kato) Department of Statistics and Data Science, Cornell University, 1194 Comstock Hall,Ithaca, NY 14853, USA. E-mail address : [email protected] (Y. Sasaki) Department of Economics, Vanderbilt University, VU Station B E-mail address : [email protected]@vanderbilt.edu