Multiway Cluster Robust Double/Debiased Machine Learning
aa r X i v : . [ ec on . E M ] M a r Multiway Cluster Robust Double/Debiased Machine Learning ∗ Harold D. Chiang † Kengo Kato ‡ Yukun Ma § Yuya Sasaki ¶k Abstract
This paper investigates double/debiased machine learning (DML) under multiway clusteredsampling environments. We propose a novel multiway cross fitting algorithm and a multiwayDML estimator based on this algorithm. We also develop a multiway cluster robust standard errorformula. Simulations indicate that the proposed procedure has favorable finite sample performance.Applying the proposed method to market share data for demand analysis, we obtain larger two-waycluster robust standard errors for the price coefficient than non-robust ones in the demand model.
Keywords: double/debiased machine learning, multiway clustering, multiway cross fitting
JEL Codes:
C10, C13, C14 ∗ First arXiv date: September 8, 2019 † Harold D. Chiang: [email protected]. Department of Economics, Vanderbilt University, VU StationB ‡ Kengo Kato: [email protected]. Department of Statistics and Data Science, Cornell University, 1194 ComstockHall, Ithaca, NY 14853, USA § Yukun Ma: [email protected]. Department of Economics, Vanderbilt University, VU Station B ¶ Yuya Sasaki: [email protected]. Department of Economics, Vanderbilt University, VU Station B k We benefited from useful comments by seminar participants at Southern Methodist University, Stony Brook Univer-sity, University of Bristol, and University of Colorado - Boulder, and participants at CeMMAP UCL/Vanderbilt JointConference on Advances in Econometrics and CeMMAP Workshop on Causal Learning with Interactions. All remainingerrors are ours. Introduction
We propose a novel multiway cross fitting algorithm and a double/debiased machine learning (DML)estimator based on the proposed algorithm. This objective is motivated by recently growing interest inuse of dependent cross sectional data and recently increasing demand for DML methods in empiricalresearch. On one hand, researchers frequently use multiway cluster sampled data in empirical studies,such as network data, matched employer-employee data, matched student-teacher data, scanner datawhere observations are double-indexed by stores and products, and market share data where obser-vations are double-indexed by market and products. On the other hand, we have witnessed rapidlyincreasing popularity of machine learning methods in empirical studies, such as random forests, lasso,post-lasso, elastic nets, ridge, deep neural networks, and boosted trees among others. To date, avail-able DML methods focus on i.i.d. sampled data. In light of the aforementioned research environmentstoday, a new method of DML that is applicable to multiway cluster sampled data may well be ofinterest by empirical researchers.The DML was proposed by the recent influential paper by Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins(CCDDHNR, 2018). They provide a general DML toolbox for estimation and inference for structuralparameters with high-dimensional and/or infinite-dimensional nuisance parameters. In that paper, theestimation method and properties of the estimator are presented under the typical microeconometricassumption of i.i.d. sampling. We advance this frontier literature of DML by proposing a modifiedDML estimation procedure with multiway cross fitting, which accommodates multiway cluster sam-pled data. Even for multiway cluster sampled data, we show that the proposed DML procedure worksunder nearly identical set of assumptions to that of CCDDHNR (2018). To our best knowledge, thepresent paper is the first to consider generic DML methods under multiway cluster sampling.Another branch of the literature following the seminal work by Cameron, Gelbach, and Miller(2011) proposes multiway cluster robust inference methods. Menzel (2017) conducts formal analysesof bootstrap validity under multiway cluster sampling robustly accounting for non-degenerate anddegenerate cases. Davezies, D’Haultfoeuille, and Guyonvarch (2018) develop empirical process theoryunder multiway cluster sampling which applies to a large class of models. We advance this practicallyimportant literature by developing a multiway cluster robust inference method based on DML. In2eriving theoretical properties of the proposed estimator, we take advantage of the Aldous-Hooverrepresentation employed by the preceding papers. To our knowledge, the present paper is the first inthis literature on multiway clustering to develop generic DML methods.
The past few years have seen a fast growing literature in machine learning based econometric meth-ods. For general overviews of the field, see, e.g., Athey and Imbens (2019) or Mullainathan and Spiess(2017). For a review of estimation and inference methods for high-dimensional data, see Belloni, Chernozhukov, and Hansen(2014a). For an overview of data sketching methods tackling computationally impractically largenumber of observations, see Lee and Ng (2019). The DML of CCDDHNR (2018) is built uponBelloni, Chernozhukov, and Kato (2015), which proposes to use Neyman orthogonal moments for ageneral class of Z-estimation statistical problems in the presence of high-dimensional nuisance parame-ters. This framework is further generalized in different directions by Belloni, Chernozhukov, Fern´andez-Val, and Hansen(2017) and Belloni, Chernozhukov, Chetverikov, and Wei (2018). CCDDHNR (2018) combine the useof Neyman orthogonality condition with cross fitting to provide a simple yet widely applicable frame-work that covers a large class of models under i.i.d. settings. The DML is also compatible with varioustypes of machine learning based methods for nuisance parameter estimation.Driven by the need from empiricists, the literature on cluster robust inference has a long his-tory in econometrics. For recent review of the literature, see, e.g., Cameron and Miller (2015) andMacKinnon (2019). On the other hand, coping with cross-sectional dependence using a multiway clus-ter robust variance estimator is a relatively recent phenomenon. Cameron et al. (2011) first provide amultiway cluster robust variance estimator for linear regression models without imposing additionalparametric assumptions on the intra-cluster correlation structure. This variance estimator has signifi-cantly reshaped the landscape of econometric practices in applied microeconomics in the past decade. In contrast to the popularity among empirical researchers, theoretical justification of the validity ofthis type of procedures was lagging behind. The first rigorous treatment of asymptotic properties ofmultiway cluster robust estimators are established by Menzel (2017) using the Aldous-Hoover repre- As of December 31, 2019, Cameron et al. (2011) has received over 2,500 citations. The majority of such citationscame from applied economic papers. Usingthis asymptotic framework, MacKinnon, Nielsen, and Webb (2019) study linear regression models un-der the non-degenerate case and examine the validity of several types of wild bootstrap proceduresand the robustness of multiway cluster robust variance estimators under different cluster samplingsettings.Despite of the popularity of both machine learning and cluster robust inference among empiricalresearchers, relatively limited cluster robust inference results exist for machine learning based methods.Inference for machine learning based methods with one-way clustering is studied by Belloni, Chernozhukov, Hansen, and Kozbur(2016), Kock (2016), Kock and Tang (2019), Semenova, Goldman, Chernozhukov, and Taddy (2018)and Hansen and Liao (2019) for different variations of regularized regression estimators and Athey and Wager(2019) for random forests. Chiang and Sasaki (2019) investigate the performance of lasso and post-lasso in the partially linear model setting of Belloni, Chernozhukov, and Hansen (2014b) under multi-way cluster sampling. To our best knowledge, there is no general machine learning based procedureswith known validity under multiway cluster sampling environments.
Suppose that the researcher observes a sample { W ij | i ∈ { , ..., N } , j ∈ { , ..., M }} of double-indexedobservations of size N M . Let P denote the probability law of { W ij } ij , and let E P denote the expecta-tion with respect to P . Let C = N ∧ M denote the sample size in the smaller dimension. We considertwo-way clustering where each cell contains one observation for simplicity of notations, but resultsfor higher cluster dimensions and random cluster sizes can be obtained at the expense of involvednotations – see Appendix C for a general case. See also Davezies, D’Haultfoeuille, and Guyonvarch (2019) for further generalization of the empirical process theoryfor dyadic data under joint exchangeability assumption. P [ ψ ( W ; θ , η )] = 0 (2.1)for some score ψ that depends on a low-dimensional parameter vector θ ∈ Θ ⊂ R d θ and a nuisanceparameter η ∈ T for a convex subset T of a normed linear space. The nuisance parameter η may befinite-, high-, or infinite-dimensional, and its true value is denoted by η ∈ T . In this setup, the truevalue of the low-dimensional target parameter, denoted by θ ∈ Θ, is the object of interest.Let e T = { η − η : η ∈ T } , and define the Gateaux derivative map D r : e T → R d θ by D r [ η − η ] := ∂ r n E P [ ψ ( W ; θ , η + r ( η − η ))] o for all r ∈ [0 , ∂ η E P ψ ( W ; θ , η )[ η − η ] := D [ η − η ] . We say that the Neyman orthogonality condition holds at ( θ , η ) with respect to a nuisance realizationset T n ⊂ T if the score ψ satisfies (2.1), the pathwise derivative D r [ η − η ] exists for all r ∈ [0 ,
1) and η ∈ T n , and the orthogonality equation ∂ η E P ψ ( W ; θ , η )[ η − η ] = 0 (2.2)holds for all η ∈ T n . Furthermore, we also say that the λ n Neyman near-orthogonality condition holdsat ( θ , η ) with respect to a nuisance realization set T n ⊂ T if the score ψ satisfies (2.1), the pathwisederivative D r [ η − η ] exists for all r ∈ [0 ,
1) and η ∈ T n , and the orthogonality equationsup η ∈T n (cid:13)(cid:13)(cid:13) ∂ η E P ψ ( W ; θ , η )[ η − η ] (cid:13)(cid:13)(cid:13) ≤ λ n (2.3)holds for all η ∈ T n for some positive sequence { λ n } n such that λ n = o ( C − / ).Throughout, we will consider structural models satisfying the moment restriction (2.1) and eitherform of the Neyman orthogonality conditions, (2.2) or (2.3). Consider linear Neyman orthogonalscores ψ of the form ψ ( w ; θ, η ) = ψ a ( w ; η ) θ + ψ b ( w ; η ) , for all w ∈ supp(W), θ ∈ Θ, η ∈ T . (2.4)A generalization to nonlinear score follows from linearization with Gateaux differentiability as inSection 3.3 of CCDDHNR (2018). We focus on linear scores as they cover a wide range of applications.5 .2 The Multiway Double/Debiased Machine Learning For the class of models introduced in Section 2.1, we propose a novel K -fold multiway cross fittingprocedure for estimation of θ . For any r ∈ N , we use the notation [ r ] = { , ..., r } . With a fixed positiveinteger K , randomly partition [ N ] into K parts { I , ..., I K } and [ M ] into K parts { J , ..., J K } . Foreach ( k, ℓ ) ∈ [ K ] , obtain an estimate b η kℓ = b η (cid:0) ( W ij ) ( i,j ) ∈ ([ N ] \ I k ) × ([ M ] \ J ℓ ) (cid:1) of the nuisance parameter η by some machine learning method (e.g., lasso, post-lasso, elastic nets,ridge, deep neural networks, and boosted trees) using only the subsample of those observations withmultiway indices ( i, j ) in ([ N ] \ I k ) × ([ M ] \ J ℓ ). In turn, we define e θ , the multiway double/debiasedmachine learning (multiway DML) estimator for θ , as the solution to1 K X ( k,ℓ ) ∈ [ K ] E n,kℓ [ ψ ( W ; e θ, b η kℓ )] = 0 , (2.5)where E n,kℓ [ f ( W )] = | I k || J ℓ | P ( i,j ) ∈ I k × J ℓ f ( W ij ) denotes the subsample empirical expectation usingonly the those observations with multiway indices ( i, j ) in I k × J ℓ .We call this procedure the K -fold multiway cross fitting. Note that, for each ( k, ℓ ) ∈ [ K ] , thenuisance parameter estimate b η kℓ is computed using the subsample of those observations with multiwayindices ( i, j ) ∈ ([ N ] \ I k ) × ([ M ] \ J ℓ ), and in turn the score term E n,kℓ [ ψ ( W ; · , b η kℓ )] is computed usingthe subsample of those observations with multiway indices ( i, j ) ∈ I k × J ℓ . This two-step computationis repeated K times for every partitioning pair ( k, ℓ ) ∈ [ K ] . Figure 1 illustrates this K -fold crossfitting for the case of K = 2 and N = M = 4, where the cross fitting repeats for K (= 2 = 4) times. Remark 1.
This estimator is a multiway-counterpart of DML2 in CCDDHNR (2018). It is alsopossible to consider the multiway-counterpart of their DML1. With this said, we focus on this currentestimator following their simulation finding that DML2 outperforms their DML1 in most situationsettings due to the stability of the score function.
Remark 2 (Higher Cluster Dimensions) . When we have α -way clustering for an integer α >
2, theabove algorithm can be easily generalized into a K α -fold multiway DML estimator. See Appendix Cfor a generalization. 6igure 1: An illustration of 2 -fold cross fitting.NuisanceScore NuisanceScore ScoreNuisance ScoreNuisanceWe propose to estimate the asymptotic variance of √ C ( e θ − θ ) by b σ = b J − b Γ( b J − ) ′ , (2.6)where b Γ and b J are given by b Γ = 1 K X ( k,ℓ ) ∈ [ K ] | I | ∧ | J | ( | I || J | ) X i ∈ I k X j,j ′ ∈ J ℓ ψ ( W ij ; e θ, b η kℓ ) ψ ( W ij ′ ; e θ, b η kℓ ) ′ + | I | ∧ | J | ( | I || J | ) X i,i ′ ∈ I k X j ∈ J ℓ ψ ( W ij ; e θ, b η kℓ ) ψ ( W i ′ j ; e θ, b η kℓ ) ′ and b J = 1 K X ( k,ℓ ) ∈ [ K ] E n,kℓ [ ψ a ( W ; b η kℓ )] , accounting for multiway cluster dependence. For a d θ -dimensional vector r , the (1 − a ) confidenceinterval for the linear functional r ′ θ can be constructed byCI a := [ r ′ e θ ± Φ − (1 − a/ p r ′ b σ r/C ] . For an illustration, consider as a concrete example the partially linear IV model (cf. Okui, Small, Tanand Robins, 2012 ; CCDDHNR, 2018, Section 4.2) adapted to the multiway cluster sample data: Y ij = D ij θ + g ( X ij ) + ǫ ij , E P [ ǫ ij | X ij , Z ij ] = 0 , (2.7) Z ij = m ( X ij ) + v ij , E P [ v ij | X ij ] = 0 . (2.8)7 researcher observes the random variables Y ij , D ij , X ij , and Z ij , which are typically interpretedas the outcome, endogenous regressor, exogenous regressors, and instrumental variable, respectively.The low-dimensional parameter vector θ is an object of interest.A Neyman orthogonal score ψ for such model is given by ψ ( w ; θ, η ) = ( y − g ( x ) − θ ( d − g ( x )))( z − m ( x )) (2.9)as in Okui et al. (2012) and CCDDHNR (2018), where w = ( y, d, x, z ), η = ( g , g , m ) and g , g , m ∈ L ( P ). It is straightforward to verify that this score satisfies both the moment restriction (2.1),E P [ ψ ( W ; θ , η )] = 0, and the Neyman orthogonality condition (2.2), ∂ η E P ψ ( W ; θ , η )[ η − η ] = 0for all η ∈ T n at η = ( g , g , m ), where g ( X ) = E P [ Y | X ], g ( X ) = E P [ D | X ], and m ( X ) =E P [ Z | X ].The following algorithm is our proposed multiway DML procedure introduced in Section 2.2,specifically applied to this partially linear IV model. Algorithm 1 ( K -fold Multiway DML for Partially Linear IV Model with Lasso) .
1. Randomly partition [ N ] into K parts { I , ..., I K } and [ M ] into K parts { J , ..., J K } .2. For each ( k, ℓ ) ∈ [ K ] :(a) Run a lasso of Y on X to obtain b g ,kℓ ( x ) = x ′ b β kℓ using observations from I ck × J cℓ .(b) Run a lasso of D on X to obtain b g ,kℓ ( x ) = x ′ b γ kℓ using observations from I ck × J cℓ .(c) Run a lasso of Z on X to obtain b m kℓ ( x ) = x ′ b ξ kℓ using observations from I ck × J cℓ .3. Solve the equation K X ( k,ℓ ) ∈ [ K ] E n,kℓ [( Y ij − X ′ ij b β kℓ − θ ( D ij − X ′ ij b γ kℓ ))( Z ij − X ′ ij b ξ kℓ )] = 0 for θ to obtain the multiway DML estimate e θ .4. Let b ε ij = Y ij − X ′ ij b β kℓ − e θ ( D ij − X ′ ij b γ kℓ ) , b u ij = D ij − X ′ ij b γ kℓ , and b v ij = Z ij − X ′ ij b ξ kℓ for each ( i, j ) ∈ I k × J ℓ for each ( k, ℓ ) ∈ [ K ] , and let the multiway DML asymptotic variance estimator e given by b σ = b J − K K X k =1 K X ℓ =1 n | I | ∧ | J | ( | I || J | ) X i ∈ I k X j,j ′ ∈ J ℓ b ε ij b v ij b v ij ′ b ε ij ′ + | I | ∧ | J | ( | I || J | ) X i,i ′ ∈ I k X j ∈ J ℓ b ε ij b v ij b v i ′ j b ε i ′ j o ( b J − ) ′ , where b J = − K K X k =1 K X ℓ =1 E n,kℓ [ b u ij b v ij ] .
5. Report the estimate e θ , its standard error pb σ /C , and/or the (1 − a ) confidence intervalCI a := he θ ± Φ − (1 − a/ pb σ /C i . For the sake of concreteness, we present this algorithm specifically based on lasso (in the threesub-steps under step 2), but another machine learning method (e.g., post-lasso, elastic nets, ridge,deep neural networks, and boosted trees) may be substituted for lasso.
Example 1 (Demand Analysis) . Consider the model of Berry (1994) in which consumer c derives theutility δ ij + X ij α c + ε cij from choosing product i in market j , where ε cij independently follows the Type I Extreme Valuedistribution, α c is a random coefficient, and the mean utility δ ij takes the linear-index form δ ij = D ij θ + ǫ ij . In this framework, Lu, Shi, and Tao (2019, Equation (9)) derive the partial-linear equation Y ij = D ij θ + g ( X ij ) + ǫ ij for estimation of θ , where Y ij = log( S ij ) − log( S j ) denotes the observed log share of product i relative to the log of the outside share. Since D ij usually consists of the endogenous price of product i in market j , researchers often use instruments Z ij such that E P [ ǫ ij | X ij , Z ij ] = 0. This yields thereduced-form equation (2.7), together with the innocuous nonparametric projection equation (2.8).Since the random vector W ij = ( Y ij , D ij , X ij , Z ij ) is double-indexed by product i and market j ,9he sample naturally entails two-way dependence. Specifically, for each product i , { W ij } Mj =1 is likelydependent through a supply shock by the producer of product i . Similarly, for each market j , { W ij } Ni =1 is likely dependent through a demand shock in market j . As such, instead of using standard errorsbased on i.i.d. sampling, we recommend that a researcher uses the two-way cluster-robust standarderror based on Algorithm 1. △ In this section, we present formal theories to guarantee that the multiway DML method proposed inSection 2 works. We first fix some notations for convenience. The two-way sample sizes (
N, M ) ∈ N will be index by a single index n ∈ N as ( N, M ) = ( N ( n ) , M ( n )) where M ( n ) and N ( n ) are non-decreasing in n and M ( n ) N ( n ) is increasing in n . With this said, we will suppress the index notationand write ( N, M ) for simplicity. Let {P n } n be a sequence of sets of probability laws of { W ij } ij – notethat we allow for increasing dimensionality of W ij in the sample size n . Let P = P n ∈ P n denotethe law with respect to sample size ( N, M ). Throughout, we assume that this random vector W ij isBorel measurable. Recall the notations C = N ∧ M , µ N = C/N , and µ M = C/M , and suppose that µ N → ¯ µ N , µ M → ¯ µ M . We write a . b to mean a ≤ cb for some c > n .We also write a . P b to mean a = O P ( b ). For any finite dimensional vector v , k v k denotes the ℓ orEuclidean norm of v . For any matrix A , k A k denotes the induced ℓ -norm of the matrix. For any set B , | B | denotes the cardinality of the set.We state the following assumption on multiway clustered sampling. Assumption 1 (Sampling) . Suppose C → ∞ . The following conditions hold for each n .(i) ( W ij ) ( i,j ) ∈ N is an infinite sequence of separately exchangeable p -dimensional random vectors.That is, for any permutations π and π of N , we have( W ij ) ( i,j ) ∈ N d = ( W π ( i ) π ( j ) ) ( i,j ) ∈ N . (ii) ( W ij ) ( i,j ) ∈ N is dissociated. That is, for any ( c , c ) ∈ N , ( W ij ) i ∈ [ c ] ,j ∈ [ c ] is independent of( W ij ) i ∈ [ c ] c ,j ∈ [ c ] c . n , an econometrician observes ( W ij ) i ∈ [ N ] ,j ∈ [ M ] .Recall that we focus on the linear Neyman orthogonal score of the form ψ ( w ; θ, η ) = ψ a ( w ; η ) θ + ψ b ( w ; η ) , for all w ∈ supp(W), θ ∈ Θ, η ∈ T .Let c > c > s > q ≥ c ≤ c . Let { δ n } n ≥ (estimation errors)and { ∆ n } n ≥ (probability bounds) be sequences of positive constants that converge to zero such that δ n ≥ C − / . Let K ≥ W denote a copy of W that is independent from thedata and the random set T n of nuisance realization. With these notations, we consider the followingassumptions. Assumption 2 (Linear Neyman Orthogonal Score) . For C ≥ P ∈ P n , the following conditionshold.(i) The true parameter value θ satisfies (2.1).(ii) ψ is linear in the sense that it satisfies (2.4).(iii) The map η E P [ ψ ( W ; θ, η )] is twice continuously Gateaux differentiable on T .(iv) ψ satisfies either the Neyman orthogonality condition (2.2) or more generally the Neyman λ n near orthogonality condition at ( θ , η ) with respect to a nuisance realization set T n ⊂ T as λ n := sup η ∈T n (cid:13)(cid:13)(cid:13) ∂ η E P ψ ( W ; θ , η )[ η − η ] (cid:13)(cid:13)(cid:13) ≤ δ n C − / . (v) The identification condition holds as the singular values of the matrix J := E P [ ψ a ( W ; η )] arebetween c and c . Assumption 3 (Score Regularity and Nuisance Parameter Estimators) . For all C ≥ P ∈ P n ,the following conditions hold.(i) Given random subsets I ⊂ [ N ] and J ⊂ [ M ] such that | I | × | J | = ⌊ N M/K ⌋ , the nuisanceparameter estimator b η = b η (( W ij ) ( i,j ) ∈ I c × J c ), where the complements are taken with respect to[ N ] and [ M ], respectively, belongs to the realization set T n with probability at least 1 − ∆ n ,where T n contains η . 11ii) The following moment conditions hold: m n := sup η ∈T n (E P [ k ψ ( W ; θ , η ) k q ]) /q ≤ c ,m ′ n := sup η ∈T n (E P [ k ψ a ( W ; η ) k q ]) /q ≤ c . (iii) The following conditions on the rates r n , r ′ n and λ ′ n hold: r n := sup η ∈T n k E P [ ψ a ( W ; η )] − E P [ ψ a ( W ; η )] k ≤ δ n ,r ′ n := sup η ∈T n ( k E P [ ψ ( W ; θ , η )] − E P [ ψ ( W ; θ , η )] k ) / ≤ δ n ,λ ′ n = sup r ∈ (0 , ,η ∈T n k ∂ r E P [ ψ ( W ; θ , η + r ( η − η ))] k ≤ δ n / p C. (iv) All eigenvalues of the matrixΓ := ¯ µ N Γ N + ¯ µ M Γ M = ¯ µ N E P [ ψ ( W ; θ , η ) ψ ( W ; θ , η ) ′ ] + ¯ µ M E P [ ψ ( W ; θ , η ) ψ ( W ; θ , η ) ′ ] . are bounded from below by c . Remark 3 (Discussion of the Assumptions) . Assumption 1 is similar to those of the preceding work onmultiway cluster robust inference (cf. Menzel, 2017; Davezies et al., 2018; Chiang and Sasaki, 2019).Menzel (2017) does not invoke the dissociation, and follows an alternative approach to inference. Theother papers assume both the separate exchangeability and dissociation, and conduct unconditionalinference as in this paper. See Kallenberg (2006, Corollary 7.23 and Lemma 7.35) for representationswith and without the dissociation under the separate exchangeability. Assumption 2 is closely relatedto Assumptions 3.1 of CCDDHNR (2018). It requires the score to be Neyman near orthogonal –see their Section 2.2.1 for the procedure of orthogonalizing a non-orthogonal score. It also imposessome mild smoothness and identification conditions. Assumption 3 corresponds to Assumption 3.2 ofCCDDHNR (2018). It imposes some high level conditions on the quality of the nuisance parameterestimator as well as the non-degeneracy of the asymptotic variance. This rules out the degeneratecases such as Example 1.6 of Menzel (2017).
Remark 4 (Partial Distributions) . Assumptions 2 and 3 state conditions based on W , differentlyfrom CCDDHNR (2018), because of our need to deal with dependent observations in cross fitting inour multiway DML framework. 12he following result presents the main theorem of this paper, establishing the linear representa-tion and asymptotic normality of the multiway DML estimator. It corresponds to Theorem 3.1 ofCCDDHNR (2018), and is an extension of it to the case of multiway cluster sampling. Theorem 1 (Main Result) . Suppose that Assumptions 1, 2 and 3 are satisfied. If δ n ≥ C − / for all C ≥ , then p Cσ − ( e θ − θ ) = √ CN M N X i =1 M X j =1 ¯ ψ ( W ij ) + O P ( ρ n ) N (0 , I d θ ) holds uniformly over P ∈ P n , where the size of the remainder terms follows ρ n := C − / + r n + r ′ n + C / λ n + C / λ ′ n . δ n , the influence function takes the form ¯ ψ ( · ) := − σ − J − ψ ( · ; θ , η ) , and the asymptotic variance is givenby σ := J − Γ( J − ) ′ . (3.1)As is commonly the case in practice, we need to estimate the unknown asymptotic variance. Thefollowing theorem shows the validity of our proposed multiway DML variance estimator. Theorem 2 (Variance Estimator) . Under the assumptions required by Theorem 1, we have b σ = σ + O P ( ρ n ) . Furthermore, the statement of Theorem 1 holds true with b σ in place of σ . Theorems 1 and 2 can be used for constructing confidence intervals.
Corollary 1.
Suppose that all the Assumptions required by Theorem 1 are satisfied. Let r be a d θ -dimensional vector. The (1 − a ) confidence interval of r ′ θ given byCI a := [ r ′ e θ ± Φ − (1 − a/ p r ′ b σ r/C ] satisfies sup P ∈P n | P P ( θ ∈ CI a ) − (1 − a ) | → .
13s in Section 3.4 of CCDDHNR (2018), we can also repeatedly compute multiway DML estimatesand variance estimates S -times for some fixed S ∈ N and consider the average or median of theestimates as the new estimate. This does not have an asymptotic impact, yet it can reduce the impactof a random sample splitting on the estimate. Consider the partially linear IV model introduced in Section 2.3. We specifically focus on the followinghigh-dimensional linear representations Y ij = D ij θ + X ′ ij ζ + ǫ ij D ij = Z ij π + X ′ ij π + υ ij ,Z ij = X ′ ij ξ + V ij , where the parameter values are set to θ = π = 1 . ζ = π = ξ = (0 . , . . , · · · , . dim( X ) ) ′ for some large dim( X ). The primitive random vector ( X ′ ij , ǫ ij , υ ij , V ij ) ′ is constructed by X ij = (1 − ω X − ω X ) α Xij + ω X α Xi + ω X α Xj ,ǫ ij = (1 − ω ǫ − ω ǫ ) α ǫij + ω ǫ α ǫi + ω ǫ α ǫj ,υ ij = (1 − ω υ − ω υ ) α υij + ω υ α υi + ω υ α υj , and V ij = (1 − ω V − ω V ) α Vij + ω V α Vi + ω V α Vj with two-way clustering weights ( ω X , ω X ), ( ω ǫ , ω ǫ ), ( ω υ , ω υ ), and ( ω V , ω V ), where α Xij , α Xi , and α Xj are independently generated according to α Xij , α Xi , α Xj ∼ N , s X s X · · · s dim( X ) − X s dim( X ) − X s X s X · · · s dim( X ) − X s dim( X ) − X ... ... . . . ... ... s dim( X ) − X s dim( X ) − X · · · s X s X s dim( X ) − X s dim( X ) − X · · · s X s X , α ǫij , α υij ) ′ , ( α ǫi , α υi ) ′ , and ( α ǫj , α υj ) ′ are independently generated according to α ǫij α υij , α ǫi α υi , α ǫj α υj ∼ N , s ǫυ s ǫυ , and α Vij , α Vi , and α Vj are independently generated according to α Vij , α Vi , α Vj ∼ N (0 , . The weights ( ω X , ω X ), ( ω ǫ , ω ǫ ), ( ω υ , ω υ ), and ( ω V , ω V ) specify the extent of dependence in two-wayclustering in X ij , ǫ ij , υ ij , and V ij , respepctively. The parameter s X specifies the extent of collinearityamong the high-dimensional regressors X ij . The parameter s ǫυ specifies the extent of endogeneity.We set the values of these parameters to ( ω X , ω X ) = ( ω ǫ , ω ǫ ) = ( ω υ , ω υ ) = ( ω V , ω V ) = (0 . , . s X = s ǫυ = 0 . Monte Carlo simulations are conducted with 2,500 iterations for each set. Table 1 reports simulationresults. The first four columns in the table indicate the data generating process ( N , M , C , anddim( X )). The next column indicates the integer K for our K -fold cross fitting method. We use K = 2 and 3 in the simulations for the displayed results, since 2 ( ≈
5) and 3 ( ≈
10) are close to thecommon numbers of folds used in cross fitting in practice. The next column indicates the machinelearning method for estimation of b η kℓ . We use the ridge, elastic net, and lasso. The last four columnsof the table report Monte Carlo simulation statistics, including the bias (Bias), standard deviation(SD), root mean square error (RMSE), and coverage frequency for the nominal probability of 95%(Cover).For each covariate dimension dim( X ) ∈ { , } , for each choice K ∈ { , } for the number K of multiway cross fitting, and for each of the three machine learning methods, we observe the followingpatterns as the effective sample size C = N ∧ M increases: 1) the bias tends to zero; 2) the standarddeviation decreases approximately at the √ C rate; and 3) the coverage frequency converges to thenominal probability. These results confirm the theoretical properties of the proposed method. Weran several other sets of simulations besides those displayed in the table, and this pattern remains thesame across different sets. 15omparing the results across the three machine learning methods, we observe that the ridge entailslarger bias and smaller variance relative to the elastic net and lasso in finite sample. This makes thecoverage frequency of the ridge less accurate compared with the elastic net and lasso. This result isperhaps specific to the data generating process used for our simulations. On one hand, the choice K = 3 (i.e., 9-fold) of the multiway cross fitting contributes to mitigating the large bias of the ridgerelative to the choice K = 2, and hence K = 3 produces more preferred results for the ridge. On theother hand, the choice K = 2 tends to yield preferred results in terms of coverage accuracy for theelastic net and lasso. In light of these results, we recommend the elastic net or lasso along with theuse of 2 - fold (i.e., 4-fold) cross fitting. This number of folds in cross fitting is in fact similar to thatrecommended by CCDDHNR (2018) for i.i.d. sampling – see their Remark 3.1 where they recommend4- or 5-fold cross fitting. Let us revisit the demand model of Example 1 in Section 2.3. Recall that, for the consumer demandmodel of Berry (1994) introduced in Example 1, Lu et al. (2019, Equation (9)) derive the partial-linearequation Y ij = D ij θ + g ( X ij ) + ǫ ij (5.1)for estimation of θ , where Y ij = log( S ij ) − log( S j ) denotes the observed log share of product i relativeto the log of the outside share in market j , D ij denotes the log price of product i in market j , and X ij denotes a vector of observed attributes of product i in market j . To deal with the likely endogeneityof D ij , researchers often use instruments Z ij such that E P [ ǫ ij | X ij , Z ij ] = 0. Such instruments oftenconsist of observed attributes of other products in the market.The implied equation (5.1) together with this mean independence assumption yields the reduced-form model (2.7). Furthermore, we write the innocuous nonparametric projection equation (2.8).Therefore, we apply Algorithm 1 in Section 2.3 for the two-way cluster robust DML estimation of θ with a robust standard error.We present an application of the proposed algorithm to the U.S. automobile data of Berry, Levinsohn, and Pakes(1995). The sample consists of unbalanced two-way clustered observations with N = 557 models of16utomobiles and M = 20 markets. The observed attributes X ij consist of horsepower per weight,miles per dollar, miles per gallon, and size. The instrument Z ij is defined as the sum of the values ofthese attributes of other products.For the purpose of highlighting the effect of clustering assumptions, we report estimates andstandard errors under the zero-way cluster robust DML (based on the i.i.d. assumption) and theone-way cluster robust DML (based on clustering along each of the product and market dimensions),as well as the two-way cluster robust DML (along both of the product and market dimensions). Thenumber K = 4 of folds of cross fitting is used for the zero- and one-way cluster robust DML, whilethe number K = 4 of folds of two-way cross fitting is used for the two-way cluster robust DMLfollowing the recommendations from Section 4 and those by CCDDHNR (2018, Remark 3.1). Tomitigate the uncertainty induced by sample splitting, we compute estimates based on the average often rerandomized DML following CCDDHNR (2018, Section 3.4) with variance estimation accordingto CCDDHNR (2018, Equation 3.13) adapted to our two-way cluster-robustness.Table 2 summarizes the results. For each of the zero-, one-, and two-way cluster robust DML, boththe point estimates and standard errors are similar across all the choices of instrument. Furthermore,the point estimates are also similar across all of the zero-, one-, and two-way cluster robust DML.On the other hand, the standard errors tend to increase as the assumed number of ways of clusteringincreases. In other words, the zero-way cluster robust DML reports the smallest standard error whilethe two-way cluster robust DML reports the largest standard error. To robustly account for possiblecross-sectional dependence of observations in such two-way cluster sampled data as this market sharedata, we recommend that researchers use the two-way cluster robust DML although it may incurlarger standard errors as is the case with this application. In this paper, we propose a multiway DML procedure based on a new multiway cross fitting algorithm.This multiway DML procedure is valid in the presence of multiway cluster sampled data, which isfrequently used in empirical research. We present an asymptotic theory showing that multiway DMLis valid under nearly identical reguarity conditions to those of CCDDHNR (2018). The proposed17ethod covers a large class of econometric models as is the case with CCDDHNR (2018), and iscompatible with various machine learning based estimation methods. Simulation studies indicate thatthe proposed procedure has attractive finite sample performance under various multiway cluster sam-pling environments for various machine learning methods. To accompany the theoretical findings, weprovide easy-to-implement algorithms for multiway DML. Such algorithms are readily implementableusing existing statistical packages.There are a couple of possible directions for future research. First, whereas we focused onlinear orthogonal scores that cover a wide range of applications, it may be possible to develop amethod and theories for non-linear orthogonal scores as in CCDDHNR (2018; Section 3.3). Second,whereas we focused on unconditional moment restrictions, it may be possible and will be impor-tant to develop a method and theories for conditional moment restrictions (Ai and Chen, 2003, 2007;Chen, Linton, and Van Keilegom, 2003; Chen and Pouzo, 2015). We leave these and other extensionsfor future research. 18 ppendixA Proofs of the Main Results
For any ( i, j ) ∈ I k × J ℓ , we use the shorthand notation E P [ f ( W ij ) | I ck × J cℓ ] to denote the conditionalexpectation E P [ f ( W ij ) | ( W i ′ j ′ ) ( i ′ ,j ′ ) ∈ ([ N ] \ I k ) × ([ M ] \ J ℓ ) ] whenever one exists. A.1 Proof of Theorem 1
Proof.
In this proof we try to follow as parallelly as possible the five steps of the proof of Theorem3.1 of CCDDHNR (2018) although all the asymptotic arguments are properly modified to account formultiway cluster sampling.Denote E n for the event b η kℓ ∈ T n for all k, ℓ ∈ [ K ] . Assumption 3 (i) implies P ( E n ) ≥ − K ∆ n . Step 1.
This is the main step showing linear representation and asymptotic normality for the proposedestimator. Denote b J := 1 K X ( k,ℓ ) ∈ [ K ] E n,kℓ [ ψ a ( W ; b η kℓ )] , R n, := b J − J ,R n, := 1 K X ( k,ℓ ) ∈ [ K ] E n,kℓ [ ψ ( W ; θ , b η kℓ )] − N M N X i =1 M X j =1 ψ ( W ij ; θ , η ) . We will later show in Steps 2, 3, 4 and 5, respectively, that k R n, k = O P n ( C − / + r n ) , (A.1) k R n, k = O P n ( C − / r ′ n + λ n + λ ′ n ) , (A.2) (cid:13)(cid:13)(cid:13)p C ( N M ) − N X i =1 M X j =1 ψ ( W ij ; θ , η ) (cid:13)(cid:13)(cid:13) = O P n (1) , (A.3) k σ − k = O P n (1) . (A.4)Then, under Assumptions 2 and 3, C − / + r N ≤ ρ n = o (1) and all singular values of J are boundedaway from zero. Therefore, with P n -probability at least 1 − o (1), all singular values of b J are boundedaway from zero. Thus with the same P n probability, the multiway DML solution is uniquely written19s e θ = − b J − K X ( k,ℓ ) ∈ [ K ] E n,kℓ [ ψ b ( W ; b η kℓ )] , and p C ( e θ − θ ) = − p C b J − K X ( k,ℓ ) ∈ [ K ] (cid:16) E n,kℓ [ ψ b ( W ; b η kℓ )] + b J θ (cid:17) = − p C b J − K X ( k,ℓ ) ∈ [ K ] E n,kℓ [ ψ ( W ; θ , b η kℓ )]= − (cid:16) J + R n, (cid:17) − × (cid:16) √ CN M N X i =1 M X j =1 ψ ( W ij ; θ , η ) + p CR n, (cid:17) . (A.5)Using the fact that (cid:16) J + R n, (cid:17) − − J − = − ( J + R n, ) − R n, J − , we have k ( J + R n, ) − − J − k = k ( J + R n, ) − R n, J − k ≤ k ( J + R n, ) − k k R n, k k J − k = O P n (1) O P n ( C − / + r n ) O P n (1) = O P n ( C − / + r n ) . Furthermore, r ′ n + √ C ( λ n + λ ′ n ) ≤ ρ n = o (1), it holds that (cid:13)(cid:13)(cid:13) √ CN M N X i =1 M X j =1 ψ ( W ij ; θ , η ) + p CR n, (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) √ CN M N X i =1 M X j =1 ψ ( W ij ; θ , η ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)p CR n, (cid:13)(cid:13)(cid:13) = O P n (1) + o P n (1) = O P n (1) , where the first equality is due to (A.3) and (A.4). Combining above two bounds gives (cid:13)(cid:13)(cid:13)(cid:16) J + R n, (cid:17) − − J − (cid:13)(cid:13)(cid:13) × (cid:13)(cid:13)(cid:13) √ CN M N X i =1 M X j =1 ψ ( W ij ; θ , η ) + p CR n, (cid:13)(cid:13)(cid:13) = O P n ( C − / + r n ) O P n (1)= O P n ( C − / + r n ) . (A.6)Therefore, from (A.4), (A.5) and (A.6), we have p Cσ − ( e θ − θ ) = √ CN M N X i =1 M X j =1 ¯ ψ ( W ij ) + O P n ( ρ n ) . G n ¯ ψ . Applying Lemma 1, we obtain the indepen-dent linear representation H n ¯ ψ := N X i =1 √ CN E P n [ ¯ ψ ( W ij ) | U i ] + M X j =1 √ CM E P n [ ¯ ψ ( W ij ) | U j ]and it holds P n -a.s. that V ( G n ¯ ψ ) = V ( H n ¯ ψ ) + O ( C − ) = J − Γ( J − ) ′ + O ( C − ) and G n ¯ ψ = H n ¯ ψ + O P ( C − / )under Assumption 3 (iv). Recall that q ≥
4, the third moments of both summands of H n ¯ ψ are boundedover n under Assumptions 2(v) and 3 (ii) (iv). We have verified all the conditions for Lyapunov’s CLT.An application of Lyapunov’s CLT and Cramer-Wold device gives H n ¯ ψ N (0 , I d θ )and an application of Theorem 2.7 of van der Vaart (1998) concludes the proof. Step 2.
Since K is fixed, it suffices to show for any ( k, ℓ ) ∈ [ K ] , (cid:13)(cid:13)(cid:13) E n,kℓ [ ψ a ( W ; b η kℓ )] − E P [ ψ a ( W ; η )] (cid:13)(cid:13)(cid:13) = O P n ( C − / + r n ) . Fix ( k, ℓ ) ∈ [ K ] , (cid:13)(cid:13)(cid:13) E n,kℓ [ ψ a ( W ; b η kℓ )] − E P n [ ψ a ( W ij ; η )] (cid:13)(cid:13)(cid:13) ≤ I ,kℓ + I ,kℓ . where I ,kℓ := (cid:13)(cid:13)(cid:13) E n,kℓ [ ψ a ( W ; b η kℓ )] − E P n [ ψ a ( W ij ; b η kℓ ) | I ck × J cℓ ] (cid:13)(cid:13)(cid:13) I ,kℓ := (cid:13)(cid:13)(cid:13) E P n [ ψ a ( W ij ; b η kℓ ) | I ck × J cℓ ] − E P n [ ψ a ( W ; η )] (cid:13)(cid:13)(cid:13) . Notice that I ,kℓ ≤ r n with P n -probability 1 − o (1) follows directly from Assumptions 1 (ii) and 3 (iii).Now denote e ψ aij,m = ψ am ( W ij ; b η kℓ ) − E P n [ ψ am ( W ij ; b η kℓ ) | I ck × J cℓ ] and e ψ aij = ( e ψ aij,m ) m ∈ [ d θ ] . To bound I ,kℓ ,21ote that conditional on I ck × J cℓ , it holds thatE P n [ I ,kℓ | I ck × J cℓ ] =E P n h(cid:13)(cid:13)(cid:13) E n,kℓ [ ψ a ( W ; b η kℓ )] − E P n [ ψ a ( W ij ; b η kℓ ) | I ck × J cℓ ] (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I ck × J cℓ i = 1( | I || J | ) E P n h d θ X m =1 (cid:16) X ( i,j ) ∈ I k × J ℓ e ψ aij,m (cid:17) (cid:12)(cid:12)(cid:12) I ck × J cℓ i = 1( | I || J | ) X ( i,j ) ∈ I k × J ℓ X j ′ ∈ J ℓ ,j ′ = j E P n h d θ X m =1 e ψ aij,m e ψ aij ′ ,m (cid:12)(cid:12)(cid:12) I ck × J cℓ i + 1( | I || J | ) X ( i,j ) ∈ I k × J ℓ X i ′ ∈ I k ,i ′ = i E P n h d θ X m =1 e ψ aij,m e ψ ai ′ j,m (cid:12)(cid:12)(cid:12) I ck × J cℓ i + 1( | I || J | ) X ( i,j ) ∈ I k × J ℓ E P n h d θ X m =1 ( e ψ aij,m ) (cid:12)(cid:12)(cid:12) I ck × J cℓ i + 0= 1( | I || J | ) X ( i,j ) ∈ I k × J ℓ X j ′ ∈ J ℓ ,j ′ = j E P n [ h e ψ aij , e ψ aij ′ i| I ck × J cℓ ]+ 1( | I || J | ) X ( i,j ) ∈ I k × J ℓ X i ′ ∈ I k ,i ′ = i E P n [ h e ψ aij , e ψ ai ′ j i| I ck × J cℓ ]+ 1( | I || J | ) X ( i,j ) ∈ I k × J ℓ E P n [ k e ψ aij k | I ck × J cℓ ] . | I | ∧ | J | E P n h(cid:13)(cid:13)(cid:13) ψ a ( W ij ; b θ kℓ ) − E P n [ ψ a ( W ij ; b θ kℓ ) | I ck × J cℓ ] (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I ck × J cℓ i ≤ | I | ∧ | J | E P n [ k ψ a ( W ij ; b θ kℓ ) k | I ck × J cℓ ] ≤ c / | I | ∧ | J | under an application of Cauchy-Schwartz’s inequality and Assumptions 1 and 3 (ii). Note that C . | I | ∧ | J | . C . Hence an application of Lemma 2 (i) implies I ,kℓ = O P n ( C − / ) . This completes aproof of (A.1).
Step 3.
It again suffices to show that for any ( k, ℓ ) ∈ [ K ] , one has (cid:13)(cid:13)(cid:13) E n,kℓ [ ψ ( W ; θ , b η kℓ )] − | I || J | X ( i,j ) ∈ I k × J ℓ ψ ( W ij ; θ , η ) (cid:13)(cid:13)(cid:13) = O P n ( C − / r ′ n + λ n + λ ′ n )Denote G n,kℓ [ φ ( W )] = √ C | I || J | X ( i,j ) ∈ I k × J ℓ (cid:16) φ ( W ij ) − Z φ ( w ) dP n (cid:17) , φ is P n an integrable function on supp(W). Then (cid:13)(cid:13)(cid:13) E n,kℓ [ ψ ( W ; θ , b η kℓ )] − | I || J | X ( i,j ) ∈ I k × J ℓ ψ ( W ij ; θ , η ) (cid:13)(cid:13)(cid:13) ≤ I ,kℓ + I ,kℓ √ C where I ,kℓ := (cid:13)(cid:13) G n,kℓ [ ψ ( W ; θ , b η k,ℓ )] − G n,kℓ [ ψ ( W ; θ , η )] (cid:13)(cid:13) , I ,kℓ := p C (cid:13)(cid:13)(cid:13) E P n [ ψ ( W ij ; θ , b η k,ℓ ) | I k × J ℓ ] − E P n [ ψ ( W ; θ , η )] (cid:13)(cid:13)(cid:13) . Denote e ψ ij,m := ψ m ( W ij ; θ , b η k,ℓ ) − ψ m ( W ij ; θ , η ) and e ψ ij = ( e ψ ij,m ) m ∈ [ d θ ] . To bound I ,kℓ , noticethat using a similar argument as for the bound of I ,kℓ , one has E P n [ kI ,kℓ k | I ck × J cℓ ] =E P n [ k G n,kℓ [ ψ ( W ij ; θ , b η k,ℓ ) − ψ ( W ij ; θ , η )] k | I ck × J cℓ ]=E P n h C ( | I || J | ) d θ X m =1 n X ( i,j ) ∈ I k × J ℓ (cid:16) e ψ ij,m − E P n e ψ ij,m (cid:17)o (cid:12)(cid:12)(cid:12) I ck × J cℓ i = C ( | I || J | ) X ( i,j ) ∈ I k × J ℓ X j ′ ∈ J ℓ ,j ′ = j E P n h d θ X m =1 (cid:16) e ψ ij,m − E P n e ψ ij,m (cid:17)(cid:16) e ψ ij ′ ,m − E P n e ψ ij ′ ,m (cid:17)(cid:12)(cid:12)(cid:12) I ck × J cℓ i + C ( | I || J | ) X ( i,j ) ∈ I k × J ℓ X i ′ ∈ I k ,i ′ = i E P n h d θ X m =1 (cid:16) e ψ ij,m − E P n e ψ ij,m (cid:17)(cid:16) e ψ i ′ j,m − E P n e ψ i ′ j,m (cid:17)(cid:12)(cid:12)(cid:12) I ck × J cℓ i + C ( | I || J | ) X ( i,j ) ∈ I k × J ℓ E P n h d θ X m =1 (cid:16) e ψ ij,m − E P n e ψ ij,m (cid:17) (cid:12)(cid:12)(cid:12) I ck × J cℓ i + 0= C ( | I || J | ) X ( i,j ) ∈ I k × J ℓ X j ′ ∈ J ℓ ,j ′ = j E P n h h e ψ ij − E P n e ψ ij , e ψ ij ′ − E P n e ψ ij ′ i (cid:12)(cid:12)(cid:12) I ck × J cℓ i + C ( | I || J | ) X ( i,j ) ∈ I k × J ℓ X i ′ ∈ I k ,i ′ = i E P n h h e ψ ij − E P n e ψ ij , e ψ i ′ j − E P n e ψ i ′ j i (cid:12)(cid:12)(cid:12) I ck × J cℓ i + C ( | I || J | ) X ( i,j ) ∈ I k × J ℓ E P n h(cid:13)(cid:13)(cid:13) e ψ ij − E P n e ψ ij (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I ck × J cℓ i . E P n h(cid:13)(cid:13)(cid:13) ψ ( W ij ; θ , b η ) − ψ ( W ij ; θ , η ) − E P n [ ψ ( W ij ; θ , b η ) − ψ ( W ij ; θ , η )] (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I ck × J cℓ i ≤ E P n [ k ψ ( W ij ; θ , b η ) − ψ ( W ij ; θ , η ) k | I ck × J cℓ ] ≤ sup η ∈T n E P n [ k ψ ( W ; θ , η ) − ψ ( W ; θ , η ) k | I ck × J cℓ ]= sup η ∈T n E P n [ k ψ ( W ; θ , η ) − ψ ( W ; θ , η ) k ] = ( r ′ n ) , where the first inequality follows from Cauchy-Schwartz’s inequality, the second-to-last equality isdue to Assumption 1, and the last equality is due to Assumption 3 (iii).23ence, I ,kℓ = O P n ( r ′ n ). To bound I ,kℓ , let f kℓ ( r ) := E P n [ ψ ( W ij ; θ , η + r ( b η kℓ − η )) | I ck × J cℓ ] − E P n [ ψ ( W ; θ , η )] , r ∈ [0 , . An application of the mean value expansion coordinate-wise gives f kℓ (1) = f kℓ (0) + f ′ kℓ (0) + f ′′ kℓ ( e r ) / , where e r ∈ (0 , f kℓ (0) = 0 under Assumption 2 (i), and k f ′ kℓ (0) k = (cid:13)(cid:13)(cid:13) ∂ η E P n ψ ( W ; θ , η )[ b η kℓ − η ] (cid:13)(cid:13)(cid:13) ≤ λ n under Assumption 2 (iv). Moreover, under Assumption 3 (iii), on the event E n , we have k f ′′ kℓ ( e r ) k ≤ sup r ∈ (0 , k f ′′ kℓ ( r ) k ≤ λ ′ n . This completes a proof of (A.2).
Step 4.
Note thatE P n h(cid:13)(cid:13)(cid:13) √ CN M N X i =1 M X j =1 ψ ( W ij ; θ , η ) (cid:13)(cid:13)(cid:13) i = C ( N M ) E P n h d θ X m =1 (cid:16) N X i =1 M X j =1 ψ m ( W ij ; θ , η ) (cid:17) i = C ( N M ) N X i =1 X ≤ j Step 5. Note that all singular values of J are bounded from above by c under Assumption 2 (v)and all eigenvalues of Γ are bounded from below by c under Assumption 3 (iv). Therefore, we have k σ − k ≤ c / √ c and thus k σ − k = O P n (1) . This completes a proof of (A.4).24 .2 Proof of Theorem 2 Proof. Step 2 of the proof of Theorem 1 proves k b J − J k = O p ( C − / + r n ) and Assumption 2 (v)implies k J − k ≤ c − . Therefore, to prove the claim of the theorem, it suffices to show (cid:13)(cid:13)(cid:13) K X ( k,ℓ ) ∈ [ K ] n | I | ∧ | J | ( | I || J | ) X i ∈ I k X j,j ′ ∈ J ℓ ψ ( W ij ; e θ, b η kℓ ) ψ ( W ij ′ ; e θ, b η kℓ ) ′ + | I | ∧ | J | ( | I || J | ) X i,i ′ ∈ I k X j ∈ J ℓ ψ ( W ij ; e θ, b η kℓ ) ψ ( W i ′ j ; e θ, b η kℓ ) ′ o − ¯ µ N E P [ ψ ( W ; θ , η ) ψ ( W ; θ , η ) ′ ] − ¯ µ M E P [ ψ ( W ; θ , η ) ψ ( W ; θ , η ) ′ ] (cid:13)(cid:13)(cid:13) = O P ( ρ n ) . Moreover, since K and d θ are constants and µ N → ¯ µ N ≤ µ M → ¯ µ M ≤ 1, it suffices to showthat for each ( k, ℓ ) ∈ [ K ] and l, m ∈ [ d θ ], it holds that (cid:12)(cid:12)(cid:12) | I | ∧ | J | ( | I || J | ) X i ∈ I k X j,j ′ ∈ J ℓ ψ l ( W ij ; e θ, b η kℓ ) ψ m ( W ij ′ ; e θ, b η kℓ ) − µ N E P [ ψ l ( W ; θ , η ) ψ m ( W ; θ , η )] (cid:12)(cid:12)(cid:12) = O P ( ρ n )and (cid:12)(cid:12)(cid:12) | I | ∧ | J | ( | I || J | ) X i,i ′ ∈ I k X j ∈ J ℓ ψ l ( W ij ; e θ, b η kℓ ) ψ m ( W i ′ j ; e θ, b η kℓ ) − µ M E P [ ψ l ( W ; θ , η ) ψ m ( W ; θ , η )] (cid:12)(cid:12)(cid:12) = O P ( ρ n ) . We will show the second statement since the first one follows analogously. Denote the left-handside of the equation as I kℓ,lm . First, note that ( | I | ∧ | J | ) / | J | = µ M , and apply the triangle inequalityto get I kℓ,lm ≤ I kℓ,lm, + I kℓ,lm, , where I kℓ,lm, := (cid:12)(cid:12)(cid:12) | I | | J | X i,i ′ ∈ I k X j ∈ J ℓ n ψ l ( W ij ; e θ, b η kℓ ) ψ m ( W i ′ j ; e θ, b η kℓ ) − ψ l ( W ij ; θ , η ) ψ m ( W i ′ j ; θ , η ) o(cid:12)(cid:12)(cid:12) I kℓ,lm, := (cid:12)(cid:12)(cid:12) | I | | J | X i,i ′ ∈ I k X j ∈ J ℓ ψ l ( W ij ; θ , η ) ψ m ( W i ′ j ; θ , η ) − E P [ ψ l ( W ; θ , η ) ψ m ( W ; θ , η )] (cid:12)(cid:12)(cid:12) . 25e first find a bound for I kℓ,lm, . Since q > 4, it holds thatE P [ I kℓ,lm, ] = 1 | I | | J | E P h(cid:12)(cid:12)(cid:12) X i,i ′ ∈ I k X j ∈ J ℓ ψ l ( W ij ; θ , η ) ψ m ( W i ′ j ; θ , η ) − E P [ ψ l ( W ; θ , η ) ψ m ( W ; θ , η )] (cid:12)(cid:12)(cid:12) i ≤ | I | | J | E P h X i,i ′ ,i ′′ ∈ I k X j,j ′ ∈ J ℓ ψ l ( W ij ; θ , η ) ψ m ( W i ′ j ; θ , η ) ψ l ( W ij ′ ; θ , η ) ψ m ( W i ′′ j ′ ; θ , η ) i + 1 | I | | J | E P h X i,i ′ ,i ′′ ,i ′′′ ∈ I k X j ∈ J ℓ ψ l ( W ij ; θ , η ) ψ m ( W i ′ j ; θ , η ) ψ l ( W i ′′ j ; θ , η ) ψ m ( W i ′′′ j ; θ , η ) i + o (( | I | ∧ | J | ) − ) + 0 . | I | ∧ | J | E P [ k ψ ( W ; θ , η ) k ] . c /C = O ( C − / ) . Now, to bound I kℓ,lm, , we make use of the following identity coming from the proof of Theorem3.2 in CCDDHNR (2018): for any numbers a , b , δa , δb such that | a | ∨ | b | ≤ c and | δa | ∨ | δb | ≤ r , itholds that | ( a + δa )( b + δb ) − ab | ≤ r ( c + r ) . Denote ψ ij,h := ψ l ( W ij ; θ , η ) and b ψ ij,h := ψ l ( W ij ; e θ, b η kℓ )for h ∈ { l, m } and apply the above identity with a = ψ ij,l , b = ψ i ′ j,m , a + δa = b ψ ij,l , b + δb = b ψ i ′ j,m , r = | b ψ ij,l − ψ ij,l | ∨ | b ψ i ′ j,m − ψ i ′ j,m | and c = | ψ ij,l | ∨ | ψ i ′ j,m | . Then I kℓ,lm, = (cid:12)(cid:12)(cid:12) | I | | J | X i,i ′ ∈ I k X j ∈ J ℓ n b ψ ij,l b ψ i ′ j,m − ψ ij,l ψ i ′ j,m o(cid:12)(cid:12)(cid:12) ≤ | I | | J | X i,i ′ ∈ I k X j ∈ J ℓ | b ψ ij,l b ψ i ′ j,m − ψ ij,l ψ i ′ j,m |≤ | I | | J | X i,i ′ ∈ I k X j ∈ J ℓ ( | b ψ ij,l − ψ ij,l | ∨ | b ψ i ′ j,m − ψ i ′ j,m | ) × (cid:16) | ψ ij,l | ∨ | ψ i ′ j,m | + | b ψ ij,l − ψ ij,l | ∨ | b ψ i ′ j,m − ψ i ′ j,m | (cid:17) ≤ (cid:16) | I | | J | X i,i ′ ∈ I k X j ∈ J ℓ | b ψ ij,l − ψ ij,l | ∨ | b ψ i ′ j,m − ψ i ′ j,m | (cid:17) / × (cid:16) | I | | J | X i,i ′ ∈ I k X j ∈ J ℓ n | ψ ij,l | ∨ | ψ i ′ j,m | + | b ψ ij,l − ψ ij,l | ∨ | b ψ i ′ j,m − ψ i ′ j,m | o (cid:17) / ≤ (cid:16) | I | | J | X i,i ′ ∈ I k X j ∈ J ℓ | b ψ ij,l − ψ ij,l | ∨ | b ψ i ′ j,m − ψ i ′ j,m | (cid:17) / × n(cid:16) | I | | J | X i,i ′ ∈ I k X j ∈ J ℓ | ψ ij,l | ∨ | ψ i ′ j,m | (cid:17) / + (cid:16) | I | | J | X i,i ′ ∈ I k X j ∈ J ℓ | b ψ ij,l − ψ ij,l | ∨ | b ψ i ′ j,m − ψ i ′ j,m | (cid:17) / o , X i,i ′ ∈ I k X j ∈ J ℓ | ψ ij,l | ∨ | ψ i ′ j,m | ≤ | I | N X i =1 M X j =1 k ψ ( W ij ; θ , η ) k , X i,i ′ ∈ I k X j ∈ J ℓ | b ψ ij,l − ψ ij,l | ∨ | b ψ i ′ j,m − ψ i ′ j,m | ≤ | I | N X i =1 M X j =1 k ψ ( W ij ; e θ, b η kℓ ) − ψ ( W ij ; θ , η ) k . Thus, the above bound for I kℓ,lm, implies that I kℓ,lm, . R n × (cid:16) | I || J | X ( i,j ) ∈ I k × J ℓ k ψ ( W ij ; θ , η ) k + R n (cid:17) , where R n := 1 | I || J | X ( i,j ) ∈ I k × J ℓ k ψ ( W ij ; e θ, b η kℓ ) − ψ ( W ij ; θ , η ) k . Notice that 1 | I || J | X ( i,j ) ∈ I k × J ℓ k ψ ( W ij ; θ , η ) k = O P (1) , which is implied by Markov’s inequality and the calculationsE P h | I || J | X ( i,j ) ∈ I k × J ℓ k ψ ( W ij ; θ , η ) k i =E P [ k ψ ( W ; θ , η ) k ] ≤ c under Assumptions 1 and 3 (ii). Finally, to bound R n , using Assumption 2 (ii), R n . | I || J | X ( i,j ) ∈ I k × J ℓ k ψ a ( W ij ; b η kℓ )( e θ − θ ) k + 1 | I || J | X ( i,j ) ∈ I k × J ℓ k ψ ( W ij ; θ , b η kℓ ) − ψ ( W ij ; θ , η ) k . The first term on RHS is bounded by (cid:16) | I || J | X ( i,j ) ∈ I k × J ℓ k ψ a ( W ij ; b η kℓ ) k (cid:17) × k e θ − θ k = O P (1) × O P ( C − ) = O P ( C − )due to Assumption 3 (ii), Markov’s inequality, and Theorem 1. Furthermore, given that ( W ij ) ( i,j ) ∈ I ck × J cℓ satisfies b η kℓ ∈ T n ,E P h k ψ ( W ij ; θ , b η kℓ ) − ψ ( W ij ; θ , η ) k (cid:12)(cid:12)(cid:12) I ck × J cℓ i ≤ sup η ∈T n E P h k ψ ( W ij ; θ , η ) − ψ ( W ij ; θ , η ) k (cid:12)(cid:12)(cid:12) I ck × J cℓ i ≤ ( r ′ n ) b η kℓ ∈ T n happens with probability 1 − o (1), we have R n = O P ( C − + ( r ′ n ) ). Thus we conclude that I kℓ,lm, = O P ( C − / + r ′ n ) . This completes the proof. B Useful Lemmas We collect some of the useful auxiliary results in this section.First, for any f : supp(W) → R d for a fixed d ∈ N , we use G n f := p C n N M N X i =1 M X j =1 f ( W ij ) − E P [ f ( W )] o to denote its multiway empirical process. The following is a multivariate version of Chiang and Sasaki(2019), Lemma 1; see also Lemma D.2 in Davezies et al. (2018). Lemma 1 (Independentization via H´ajek Projections) . If Assumption 1 holds and f : supp(W) → R d for some fixed d ∈ N and suppose E P k f ( W ) k < K for a finite constant K that is independent of n , then there exist i.i.d. uniform random variables U i and U j such that the H´ajek projection H n f of G n f on G n = n N X i =1 g i ( U i ) + M X j =1 g j ( U j ) : g i , g j ∈ L ( P n ) o is equal to H n f = √ CN N X i =1 E P h f ( W i ) − E P f ( W ) (cid:12)(cid:12)(cid:12) U i i + √ CM M X j =1 E P h f ( W j ) − E P f ( W ) (cid:12)(cid:12)(cid:12) U j i for each n . Furthermore, V ( G n f ) = V ( H n f ) + O ( C − ) = ¯ µ N Cov ( f ( W ) , f ( W )) + ¯ µ M Cov ( f ( W ) , f ( W )) + O ( C − ) holds a.s.Proof. The proof is essentially the same as the proof for Lemma 1 of Chiang and Sasaki (2019) andis therefore omitted. 28he following re-states Lemma 6.1. of CCDDHNR (2018): Lemma 2 (Conditional Convergence Implies Unconditional) . Let ( X n ) and ( Y n ) be sequences ofrandom vectors.(i) If for ǫ n → , P ( k X n k > ǫ n | Y n ) = o P (1) in probability, then P ( k X n k > ǫ n ) = o (1) . In particular,this occurs if E P [ k X n k q /ǫ qn | Y n ] = o P (1) for some q ≥ .(ii) Let ( A n ) be a sequence of positive constants. If k X n k = O P ( A n ) conditional on Y n , then k X n k = O P ( A n ) unconditional, namely, for any l n → ∞ , P ( k X n k > l n A n ) = o (1) . C Extension to General Multiway Clustering In this section, we extend the main results to general multiway cluster sampling framework. Notationsin the current section are independent of those in the remaining parts of the paper – we introduce dif-ferent notations in order to enhance the readability of the main results of the paper while economizingcomplicated notations in the current extension section. Consider the ℓ -way clustered data for a fixeddimension ℓ ∈ N . With C i ∈ N denoting the number of clusters in the i -th cluster dimension for each i ∈ { , ..., ℓ } , each cell of the ℓ -way clustered sample is indexed by the ℓ -dimensional multiway clusterindices j = ( j , ..., j ℓ ) ∈ × ℓi =1 [ C i ]. The ℓ -dimensional size ( C , ..., C ℓ ) ∈ N ℓ of the ℓ -way clusteredsample will be index by n ∈ N as ( C , ..., C ℓ ) = ( C ( n ) , ..., C ℓ ( n )), where C i ( n ) is non-decreasing in n for each i ∈ { , ..., ℓ } and ℓ Q i =1 C i ( n ) is increasing in n . With this said, we will suppress the indexnotation and write ( C , ..., C ℓ ) without n for simplicity. Also define the notations C = ( C , ..., C ℓ ), Q C = ℓ Q i =1 C i , C = min ≤ i ≤ ℓ C i , C = max ≤ i ≤ ℓ C i , and µ i = C/C i for each i ∈ { , ..., ℓ } . Supposethat µ i → ¯ µ i for some constant ¯ µ i for each i ∈ { , ..., ℓ } . The number of observations in the j -thcell is denoted by N j , which is treated as an { , , ..., N } -valued random variable for some N ∈ N notdepending on n . When [ · ] takes the random variable N j as an argument, we extend the definition of[ · ] to [ N j ] := { , ..., N j } if N j ≥ ∅ if N j = 0. The observed vector for unit ı ∈ [ N j ] in the j -thcell is denoted by W ı, j . Let {P n } n be a sequence of sets of probability laws of ( N j , ( W ı, j ) ≤ ı ≤ N ) j ≥ ,where := (1 , ..., 1) for a short-hand notation and we write j ≥ j ′ to mean j i ≥ j i ′ for all i ∈ { , ..., ℓ } .29 xample 2. The sampling setting in Section 2.1 fits in the current general framework with ℓ = 2, C = N , C = M , and ( N j , W , j ) = (1 , W j j ) for all j ∈ [ N ] × [ M ] with probability one. △ The econometric model has the true parameters ( θ , η ) ∈ Θ × T satisfying the score momentrestriction E P h N X ı =1 ψ ( W ı, ; θ , η ) i = 0 , (C.1)where we focus on the linear Neyman orthogonal score of the form ψ ( w ; θ, η ) = ψ a ( w ; η ) θ + ψ b ( w ; η ) , for all w ∈ supp(W), θ ∈ Θ, η ∈ T (C.2)for supp(W):= ∪ Nı =1 supp( W ı, ), Θ ⊂ R d θ and a convex set T .For a fixed integer K > 1, we randomly split the data into K folds in each of the ℓ clusterdimensions, resulting in K ℓ folds in total. Specifically, randomly partition [ C i ] into K parts { I i , ..., I Ki } for each i ∈ { , ..., ℓ } . We use the ℓ -dimensional indices k := ( k , ..., k ℓ ) to index the ℓ -way fold I k := I k × · · · × I k ℓ and its complementary product I c k := I ck × · · · × I ck ℓ for each k ∈ [ K ] ℓ . Let b η k = b η ((( W ι, j ) ι ∈ [ N j ] ) j ∈ I c k )be a machine learning estimate of η using the subsample (( W ι, j ) ι ∈ [ N j ] ) j ∈ I c k for each k ∈ [ K ] ℓ . Let b J := 1 K ℓ X k ∈ [ K ] ℓ E n, k h X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) i where E n, k h X ı ∈ [ N j ] f ( W ı, j ) i := 1 | I k | X j ∈ I k X ı ∈ [ N j ] f ( W ı, j ) for each k ∈ [ K ] ℓ for any Borel measurable function f , the sum P ı ∈ [ N j ] is treated as zero when N j = 0, and | I k | := ⌊ Q ℓi =1 C i K ℓ ⌋ .With these setup and notations, the multiway DML estimator is defined by e θ = − b J − K ℓ X k ∈ [ K ] ℓ E n, k h X ı ∈ [ N j ] ψ b ( W ı, j ; b η k ) i . (C.3)Let | I k | = min {| I k | , ..., | I k ℓ |} for a short-hand notation. Also let I ( j ) denote the multiway foldcontaining the j -th multiway cluster, i.e., I ( j ) ⊂ × ℓi =1 [ C i ] satisfies I k = I ( j ) for some k ∈ [ K ] ℓ and j ∈ I ( j ). With these additional notations, we propose to estimate the asymptotic variance of √ C ( e θ − θ )30y b σ = b J − h K ℓ X k ∈ [ K ] ℓ | I k || I k | ℓ X i =1 X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ ( W ı, j ; e θ, b η k ) ψ ( W ı, j ′ ; e θ, b η k ) ′ i ( b J − ) ′ . (C.4) Example 2, Continued. The two-way DML in Section 2.2 is a special case of the current gen-eral methodological framework with { I , ..., I K } = { I , ..., I K } , { I , ..., I K } = { J , ..., J K } , b η ( k ,k ) = b η (( W j j ) ( j ,j ) ∈ ([ N ] \ I k ) × ([ M ] \ J k ), b J = K P ( k ,k ) ∈ [ K ] E n, ( k ,k ) [ ψ a ( W j j ; b η ( k ,k ) )] where E n, ( k ,k ) [ f ( W j j )] = | I k || J k | P ( j ,j ) ∈ I k × J k f ( W j j ), e θ = − b J − K P ( k ,k ) ∈ [ K ] E n, ( k ,k ) h ψ b ( W j j ; b η ( k ,k ) ) i ,and b σ = b J − b Γ( b J − ) ′ where b Γ = K P ( k ,k ) ∈ [ K ] n | I k |∧| J k | ( | I k || J k | ) P j ∈ I k P j ,j ′ ∈ J k ψ ( W j j ; e θ, b η ( k ,k ) ) ψ ( W j j ′ ; e θ, b η ( k ,k ) ) ′ + | I k |∧| J k | ( | I k || J k | ) P j ,j ′ ∈ I k P j ∈ J k ψ ( W j j ; e θ, b η ( k ,k ) ) ψ ( W j ′ j ; e θ, b η ( k ,k ) ) ′ o . △ We now state assumptions under which (C.4) is an asymptotically valid variance estimator for √ C ( e θ − θ ) with the multiway DML estimator (C.3). We write a . b to mean a ≤ cb for some c > n . We also write a . P b to mean a = O P ( b ). For any finite dimensionalvector v , k v k denotes the ℓ or Euclidean norm of v . For any matrix A , k A k denotes the induced ℓ -norm of the matrix. The following assumption concerns the multiway clustered sampling. Assumption 4 (Sampling) . The following conditions hold for each n .(i) The array ( N j , ( W ı, j ) ≤ ı ≤ N ) j ≥ is an infinite sequence of separately exchangeable random vector.That is, for any ℓ -tuple of permutations ( π , ..., π ℓ ) of N , we have( N j , ( W ı, j ) ≤ ı ≤ N ) j ≥ d = ( N π ( j ) ,...,π ℓ ( j ℓ ) , ( W ı,π ( j ) ,...,π ℓ ( j ℓ ) ) ≤ ı ≤ N ) j ≥ . (ii) ( N j , ( W ı, j ) ≤ ı ≤ N ) j ≥ is dissociated. That is, for any c ≥ , ( N j , ( W ı, j ) ≤ ı ≤ N ) ≤ j ≤ c is independentof ( N j ′ , ( W ı ′ , j ′ ) ≤ ı ′ ≤ N ) j ′ ≥ c + (iii) E ( N ) > N j ≤ N for each ≤ j ≤ C , where N ∈ N does not depend on n .(iv) The econometrician observes ( N j , ( W ı, j ) ≤ ı ≤ N j ) ≤ j ≤ C . Remark 5. The dependence among ( W ı, j ) ı ≥ in each cell j is left unrestricted in this assumption.Assumption 4 is similar to Assumption 1 of Davezies et al. (2018), except for N . We introduce N tosimplify some concentration arguments. 31et c > c > s > q ≥ c ≤ c . Let { δ n } n ≥ (estimationerrors) and { ∆ n } n ≥ (probability bounds) be sequences of positive constants that converge to zero suchthat δ n ≥ C − / . Let K ≥ N , ( W ı, ) ≤ ı ≤ N ) denote an independent copyof ( N , ( W ı, ) ≤ ı ≤ N ) and therefore is independent from the data and the random set T n of nuisancerealization. With these notations, we state the following assumptions for the model. Assumption 5 (Linear Neyman Orthogonal Score) . For all C ≥ P ∈ P n , the followingconditions hold.(i) The true parameter value θ satisfies (C.1).(ii) ψ is linear in the sense that it satisfies (C.2).(iii) The map η E P h P ı ∈ [ N ] ψ ( W ı, ; θ, η ) i is twice continuously Gateaux differentiable on T .(iv) ψ satisfies the Neyman near orthogonality condition at ( θ , η ) as λ n := sup η ∈T n (cid:13)(cid:13)(cid:13) ∂ η E P h X ı ∈ [ N ] ψ ( W ı, ; θ , η )[ η − η ] i(cid:13)(cid:13)(cid:13) ≤ δ n C − / . (v) The identification condition holds as the singular values of the matrix J := E P h P ı ∈ [ N ] ψ a ( W ı, ; η ) i are between c and c . Assumption 6 (Score Regularity and Nuisance Parameter Estimators) . For all C ≥ P ∈ P n ,the following conditions hold.(i) The realization set T n contains η , and the nuisance parameter estimator b η k = b η (( W ι, j ) ι ∈ [ N j ] ) j ∈ I c k belongs to the realization set T n for each k ∈ [ K ] ℓ with probability at least 1 − ∆ n .(ii) The following moment conditions hold: m n := sup η ∈T n (E P h(cid:13)(cid:13)(cid:13) X ı ∈ [ N ] ψ ( W ı, ; θ , η ) (cid:13)(cid:13)(cid:13) q i ) /q ≤ c ,m ′ n := sup η ∈T n (E P h(cid:13)(cid:13)(cid:13) X ı ∈ [ N ] ψ a ( W ı, ; η ) (cid:13)(cid:13)(cid:13) q i ) /q ≤ c . r n , r ′ n and λ ′ n hold: r n := sup η ∈T n (cid:13)(cid:13)(cid:13) E P h X ı ∈ [ N ] ψ a ( W ı, ; η ) i − E P h X ı ∈ [ N ] ψ a ( W ı, ; η ) i(cid:13)(cid:13)(cid:13) ≤ δ n ,r ′ n := sup η ∈T n (cid:16)(cid:13)(cid:13)(cid:13) E P h X ı ∈ [ N ] ψ ( W ı, ; θ , η ) i − E P h X ı ∈ [ N ] ψ ( W ı, ; θ , η ) i(cid:13)(cid:13)(cid:13) (cid:17) / ≤ δ n ,λ ′ n = sup r ∈ (0 , ,η ∈T n (cid:13)(cid:13)(cid:13) ∂ r E P h X ı ∈ [ N ] ψ ( W ı, ; θ , η + r ( η − η )) i(cid:13)(cid:13)(cid:13) ≤ δ n / p C. (iv) All eigenvalues of the matrixΓ := ℓ X i =1 ¯ µ i Γ i = ℓ X i =1 ¯ µ i E P N X ı =1 N i X ı ′ =1 ψ ( W ı, ; θ , η ) ψ ( W ı ′ , i ; θ , η ) ′ are bounded from below by c , where 2 i denotes the ℓ − tuple vector with 2 in each entry but for1 in the i -th entry.The following theorems generalize Theorems 1 and 2 to cover general ℓ -way cluster sampling. Theirproofs are contained in Section D. Theorem 3 (Main Result) . Suppose that Assumptions 4, 5 and 6 are satisfied. If δ n ≥ C − / for all C ≥ , then p Cσ − ( e θ − θ ) = √ C Q C C X j =1 ... C ℓ X j ℓ =1 X ı ∈ [ N j ] ¯ ψ ( W ı, j ) + O P ( ρ n ) N (0 , I d θ ) holds uniformly for all P ∈ P n , where Q C = ℓ Q i =1 C i , the influence function takes the form ¯ ψ ( · ) := − σ − J − ψ ( · ; θ , η ) , the size of the remainder terms follows ρ n := C − / + r n + r ′ n + C / λ n + C / λ ′ n . δ n , and the asymptotic variance is given by σ := J − Γ( J − ) ′ . (C.5) Theorem 4 (Variance Estimator) . Under the assumptions required by Theorem 3, we have b σ = σ + O P ( ρ n ) . Furthermore, the statement of Theorem 3 holds true with b σ in place of σ . Proofs of the Extended Results D.1 Proof of Theorem 3 Proof. Let E n denote the event b η ( k ,...,k ℓ ) ∈ T n for all ( k , ...k ℓ ) ∈ [ K ] ℓ and define k := ( k , ..., k ℓ ).Assumption 6 (i) implies P n ( E n ) ≥ − K ℓ ∆ n . Let e ∈ { , } ℓ , and define A e := { ( j , j ′ ) : ≤ j , j ′ ≤ C : ∀ i = 1 , ..., ℓ, e i = 1 ⇔ j i = j ′ i } , and ε m := { e ∈ { , } ℓ : P ℓi ′ =1 e i ′ = m } . Step 1. This is the main step showing linear representation and asymptotic normality for the proposedestimator. Denote b J := 1 K ℓ X k ∈ [ K ] ℓ E n, k h X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) i , R n, := b J − J ,R n, := 1 K ℓ X k ∈ [ K ] ℓ E n, k h X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η k ) i − Q C C X j =1 ... C ℓ X j ℓ =1 X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) . We will later show in Steps 2, 3, 4 and 5, respectively, that k R n, k = O P n ( C − / + r n ) , (D.1) k R n, k = O P n ( C − / r ′ n + λ n + λ ′ n ) , (D.2) (cid:13)(cid:13)(cid:13)p C Q C C X j =1 ... C ℓ X j ℓ =1 X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) = O P n (1) , (D.3) k σ − k = O P n (1) . (D.4)Then, under Assumptions 5 and 6, C − / + r N ≤ ρ n = o (1) and all singular values of J are boundedaway from zero. Therefore, with P n -probability at least 1 − o (1), all singular values of b J are boundedaway from zero. Thus with the same P n probability, the multiway DML solution is uniquely writtenas e θ = − b J − K ℓ X k ∈ [ K ] ℓ E n, k h X ı ∈ [ N j ] ψ b ( W ı, j ; b η k ) i , p C ( e θ − θ ) = − p C b J − K ℓ X k ∈ [ K ] ℓ (cid:16) E n, k h X ı ∈ [ N j ] ψ b ( W ı, j ; b η k ) i + b J θ (cid:17) = − p C b J − K ℓ X k ∈ [ K ] ℓ E n, k h X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η k ) i = − (cid:16) J + R n, (cid:17) − × (cid:16) √ C Q C C X j =1 ... C ℓ X j ℓ =1 X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) + p CR n, (cid:17) . (D.5)Using the fact that (cid:16) J + R n, (cid:17) − − J − = − ( J + R n, ) − R n, J − , we have k ( J + R n, ) − − J − k = k ( J + R n, ) − R n, J − k ≤ k ( J + R n, ) − k k R n, k k J − k = O P n (1) O P n ( C − / + r n ) O P n (1) = O P n ( C − / + r n ) . Furthermore, r ′ n + √ C ( λ n + λ ′ n ) ≤ ρ n = o (1), it holds that (cid:13)(cid:13)(cid:13) √ C Q C C X j =1 ... C ℓ X j ℓ =1 X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) + p CR n, (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) √ C Q C C X j =1 ... C ℓ X j ℓ =1 X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)p CR n, (cid:13)(cid:13)(cid:13) = O P n (1) + o P n (1) = O P n (1) , where the first equality is due to (D.3) and (D.4). Combining above two bounds gives (cid:13)(cid:13)(cid:13)(cid:16) J + R n, (cid:17) − − J − (cid:13)(cid:13)(cid:13) × (cid:13)(cid:13)(cid:13) √ C Q C C X j =1 ... C ℓ X j ℓ =1 X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) + p CR n, (cid:13)(cid:13)(cid:13) = O P n ( C − / + r n ) O P n (1)= O P n ( C − / + r n ) . (D.6)Therefore, from (D.4), (D.5) and (D.6), we have p Cσ − ( e θ − θ ) = √ C Q C C X j =1 ... C ℓ X j ℓ =1 X ı ∈ [ N j ] ¯ ψ ( W ı, j ) + O P n ( ρ n ) . The first term on the RHS above can be written as G n ¯ ψ . Applying Lemma 3, we obtain the indepen-dent linear representation H n ¯ ψ := C X j =1 √ CC E P n h X ı ∈ [ N j ] ¯ ψ ( W ı, j ) (cid:12)(cid:12)(cid:12) U j , ... i + ... + C ℓ X j ℓ =1 √ CC ℓ E P n h X ı ∈ [ N j ] ¯ ψ ( W ı, j ) (cid:12)(cid:12)(cid:12) U ... ,j ℓ i P n -a.s. that V n ( G n ¯ ψ ) = V n ( H n ¯ ψ ) + O ( C − ) = J − Γ( J − ) ′ + O ( C − ) and G n ¯ ψ = H n ¯ ψ + O P ( C − / ) , where V n ( · ) = E P n [( · − E P n [ · ]) ]. Under Assumption 6 (iv). Recall that q ≥ 4, the third moments ofboth summands of H n ¯ ψ are bounded over n under Assumptions 5(v) and 6 (ii) (iv). We have verifiedall the conditions for Lyapunov’s CLT. An application of Lyapunov’s CLT and Cramer-Wold devicegives H n ¯ ψ N (0 , I d θ )and an application of Theorem 2.7 of van der Vaart (1998) concludes the proof. Step 2. Since K is fixed, it suffices to show for any k ∈ [ K ] ℓ , (cid:13)(cid:13)(cid:13) E n, k h X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) i − E P h X ı ∈ [ N ] ψ a ( W ı, ; η ) i(cid:13)(cid:13)(cid:13) = O P n ( C − / + r n ) . Fix k ∈ [ K ] ℓ , (cid:13)(cid:13)(cid:13) E n, k h X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) i − E P n h X ı ∈ [ N ] ψ a ( W ı, ; η ) i(cid:13)(cid:13)(cid:13) ≤ I , k + I , k , where I , k := (cid:13)(cid:13)(cid:13) E n, k h X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) i − E P n h X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) (cid:12)(cid:12)(cid:12) I ck × ... × I ck ℓ i(cid:13)(cid:13)(cid:13) , I , k := (cid:13)(cid:13)(cid:13) E P n h X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) (cid:12)(cid:12)(cid:12) I ck × ... × I ck ℓ i − E P n h X ı ∈ [ N ] ψ a ( W ı, ; η ) i(cid:13)(cid:13)(cid:13) . Notice that I , k ≤ r n with P n -probability 1 − o (1) follows directly from Assumptions 4 (ii) and 6 (iii).Now denote e ψ a j ,m = P ı ∈ [ N j ] ψ am ( W ı, j ; b η k ) − E P n h P ı ∈ [ N j ] ψ am ( W ı, j ; b η k ) (cid:12)(cid:12)(cid:12) I ck × ... × I ck ℓ i and e ψ a j = ( e ψ a j ,m ) m ∈ [ d θ ] ,and | I k | = min {| I k | , ..., | I k ℓ |} . Let us denote I k := ( I k × ... × I k ℓ ) and I c k := ( I ck × ... × I ck ℓ ). Let j I ( j ) ∈ I , and define B e := { ( j , j ′ ) : ∀ i = 1 , ..., ℓ, e i = 1 ⇔ I i ( j ) = I i ( j ′ ) : j , j ′ ∈ I} , where I := { I , ..., I K } × ... × { I ℓ , ..., I Kℓ } , and ǫ m := { e ∈ { , } ℓ : P ℓi ′ =1 e i ′ = m } . To bound I , k , note that36onditional on I c k , it holds thatE P n [ I , k | I c k ] =E P n h(cid:13)(cid:13)(cid:13) E n, k h X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) i − E P n h X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) (cid:12)(cid:12)(cid:12) I c k i(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I c k i = 1 | I k | E P n h d θ X m =1 (cid:16) X j ∈ I k e ψ a j ,m (cid:17) (cid:12)(cid:12)(cid:12) I c k i = 1 | I k | X e ∈ ǫ X ( j ′ , j ) ∈B e E P n h d θ X m =1 e ψ a j ,m e ψ a j ′ ,m (cid:12)(cid:12)(cid:12) I c k i + 1 | I k | ℓ X r =2 X e ∈ ǫ r X ( j ′ , j ) ∈B e E P n h d θ X m =1 e ψ a j ,m e ψ a j ′ ,m (cid:12)(cid:12)(cid:12) I c k i + 1 | I k | X e ∈ ǫ X ( j ′ , j ) ∈B e E P n h d θ X m =1 e ψ a j ,m e ψ a j ′ ,m (cid:12)(cid:12)(cid:12) I c k i = 1 | I k | X e ∈ ǫ X ( j ′ , j ) ∈B e E P n [ h e ψ a j , e ψ a j ′ i| I c k ] + R + 0 . | I k | E P n h(cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) − E P n h X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) (cid:12)(cid:12)(cid:12) I c k i(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I c k i ≤ | I k | E P n h(cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I c k i ≤ c | I k | . In the third equility, the last term corresponds to the covariance between cells sharing no commoncluster. By independence, the last term is zero. Let us denote the second term in the third equalityby R . Under Cauchy-Schwarz inequality and Assumption 4 (ii), | R | ≤ | I k | X e ∈∪ ℓl =2 ǫ l |B e | E P n h d θ X m =1 ( e ψ a j ,m ) (cid:12)(cid:12)(cid:12) I c k i . (D.7)For r ≥ e ∈ ǫ r , we have |B e | = | I k | × Y i : e i =0 ( | I k i | − . (D.8)Therefore, R = O ( | I k | − ). Note that C . | I k | . C . Hence an application of Lemma 2 (i) implies I , k = O P n ( C − / ) . This completes a proof of (D.1). Step 3. It again suffices to show that for any k ∈ [ K ] ℓ , one has (cid:13)(cid:13)(cid:13) E n, k h X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η k ) i − | I k | X j ∈ I k X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) = O P n ( C − / r ′ n + λ n + λ ′ n ) . G n, k h X ı ∈ [ N j ] φ ( W ı, j ) i = √ C | I k | X j ∈ I k X ı ∈ [ N j ] (cid:16) φ ( W ı, j ) − Z φ ( w ) dP n (cid:17) . Then (cid:13)(cid:13)(cid:13) E n, k h X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η k ) i − | I k | X j ∈ I k X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) ≤ I , k + I , k √ C , where I , k := (cid:13)(cid:13)(cid:13) G n, k h X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η k ) i − G n, k h X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) i(cid:13)(cid:13)(cid:13) , I , k := p C (cid:13)(cid:13)(cid:13) E P n h X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η k ) (cid:12)(cid:12)(cid:12) I k i − E P n h X ı ∈ [ N ] ψ ( W ı, ; θ , η ) i(cid:13)(cid:13)(cid:13) . Denote e ψ j ,m := P ı ∈ [ N j ] ψ m ( W ı, j ; θ , b η k ) − P ı ∈ [ N j ] ψ m ( W ı, j ; θ , η ) and e ψ j = ( e ψ j ,m ) m ∈ [ d θ ] . To bound I , k ,notice that using a similar argument as for the bound of I , k , one has E P n [ kI , k k | I c k ] =E P n h(cid:13)(cid:13)(cid:13) G n, k h X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η k ) − X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) i(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I c k i =E P n h C | I k | d θ X m =1 n X j ∈ I k (cid:16) e ψ j ,m − E P n e ψ j ,m (cid:17)o (cid:12)(cid:12)(cid:12) I c k i = C | I k | X e ∈ ǫ X ( j , j ′ ) ∈B e E P n h d θ X m =1 (cid:16) e ψ j ,m − E P n e ψ j ,m (cid:17)(cid:16) e ψ j ′ ,m − E P n e ψ j ′ ,m (cid:17)(cid:12)(cid:12)(cid:12) I c k i + C | I k | ℓ X r =2 X e ∈ ǫ r X ( j , j ′ ) ∈B e E P n h d θ X m =1 (cid:16) e ψ j ,m − E P n e ψ j ,m (cid:17)(cid:16) e ψ j ′ ,m − E P n e ψ j ′ ,m (cid:17)(cid:12)(cid:12)(cid:12) I c k i + C | I k | X e ∈ ǫ X ( j , j ′ ) ∈B e E P n h d θ X m =1 (cid:16) e ψ j ,m − E P n e ψ j ,m (cid:17)(cid:16) e ψ j ′ ,m − E P n e ψ j ′ ,m (cid:17)(cid:12)(cid:12)(cid:12) I c k i = C | I k | X e ∈ ǫ X ( j , j ′ ) ∈B e E P n h h e ψ j − E P n e ψ j , e ψ j ′ − E P n e ψ j ′ i (cid:12)(cid:12)(cid:12) I c k i + R ′ + 0 . E P n h(cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η ) − X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) − E P n h X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η ) − X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) i(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I c k i ≤ E P n h(cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η ) − X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I c k i ≤ sup η ∈T n E P n h(cid:13)(cid:13)(cid:13) X ı ∈ [ N ] ψ ( W ı, ; θ , η ) − X ı ∈ [ N ] ψ ( W ı, ; θ , η ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I c k i = sup η ∈T n E P n h(cid:13)(cid:13)(cid:13) X ı ∈ [ N ] ψ ( W ı, ; θ , η ) − X ı ∈ [ N ] ψ ( W ı, ; θ , η ) (cid:13)(cid:13)(cid:13) i = ( r ′ n ) , R , we have R ′ = O ( C − ).Hence, I , k = O P n ( r ′ n ). To bound I , k , let f k ( r ) := E P n h X ı ∈ [ N j ] ψ ( W ı, j ; θ , η + r ( b η k − η )) (cid:12)(cid:12)(cid:12) I c k i − E P n h X ı ∈ [ N ] ψ ( W ı, ; θ , η ) i , r ∈ [0 , . An application of the mean value expansion coordinate-wise gives f k (1) = f k (0) + f ′ k (0) + f ′′ k ( e r ) / , where e r ∈ (0 , f k (0) = 0 under Assumption 5 (i), and k f ′ k (0) k = (cid:13)(cid:13)(cid:13) ∂ η E P n h X ı ∈ [ N j ] ψ ( W ; θ , η )[ b η k − η ] i(cid:13)(cid:13)(cid:13) ≤ λ n under Assumption 5 (iv). Moreover, under Assumption 6 (iii), on the event E n , we have k f ′′ k ( e r ) k ≤ sup r ∈ (0 , k f ′′ k ( r ) k ≤ λ ′ n . This completes a proof of (D.2). Step 4. Note thatE P n h(cid:13)(cid:13)(cid:13) √ C Q C C X j =1 ... C ℓ X j ℓ =1 X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) i = C Q C E P n h d θ X m =1 (cid:16) C X j =1 ... C ℓ X j ℓ =1 X ı ∈ [ N j ] ψ m ( W ı, j ; θ , η ) (cid:17) i = C Q C X e ∈ ε X ( j , j ′ ) ∈A e E P n h d θ X m =1 X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ m ( W ı, j ; θ , η ) ψ m ( W ı ′ , j ′ ; θ , η ) i + C Q C ℓ X r =2 X e ∈ ε r X ( j , j ′ ) ∈A e E P n h d θ X m =1 X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ m ( W ı, j ; θ , η ) ψ m ( W ı ′ , j ′ ; θ , η ) i + C Q C X e ∈ ε X ( j , j ′ ) ∈A e E P n h d θ X m =1 X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ m ( W ı, j ; θ , η ) ψ m ( W ı ′ , j ′ ; θ , η ) i . E P n h(cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) i ≤ c . tep 5. Note that all singular values of J are bounded from above by c under Assumption 5 (v)and all eigenvalues of Γ are bounded from below by c under Assumption 6 (iv). Therefore, we have k σ − k ≤ c / √ c and thus k σ − k = O P n (1) . This completes a proof of (D.4). D.2 Proof of Theorem 4 Proof. Step 2 of the proof of Theorem 3 proves k b J − J k = O p ( C − / + r n ) and Assumption 5 (v)implies k J − k ≤ c − . Therefore, to prove the claim of the theorem, it suffices to show (cid:13)(cid:13)(cid:13) K ℓ X k ∈ [ K ] ℓ | I k || I k | ℓ X i =1 X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ ( W ı, j ; e θ, b η k ) ψ ( W ı ′ , j ′ ; e θ, b η k ) ′ − ℓ X i =1 ¯ µ i E P h N X ı =1 N i X ı ′ =1 ψ ( W ı, ; θ , η ) ψ ( W ı ′ , i ; θ , η ) ′ i (cid:13)(cid:13)(cid:13) = O P ( ρ n ) . Moreover, since K and d θ are constants and µ i → ¯ µ i ≤ 1, it suffices to show that for each k ∈ [ K ] ℓ and l, m ∈ [ d θ ], it holds that (cid:12)(cid:12)(cid:12) | I k || I k | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ l ( W ı, j ; e θ, b η k ) ψ m ( W ı ′ , j ′ ; e θ, b η k ) − µ i E P h N X ı =1 N i X ı ′ =1 ψ l ( W ı, ; θ , η ) ψ m ( W ı ′ , i ; θ , η ) i(cid:12)(cid:12)(cid:12) = O P ( ρ n ) . Denote the left-hand side of the equation as I k ,lm . First, note that | I | / | I k i | = µ i . We denote i ′ for I k i such that | I k i ′ | = | I k | , and apply the triangle inequality to get I k ,lm ≤ I k ,lm, + I k ,lm, , where I k ,lm, := (cid:12)(cid:12)(cid:12) Q i = i ′ | I k i | | I k i ′ | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) n X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ l ( W ı, j ; e θ, b η k ) ψ m ( W ı ′ , j ′ ; e θ, b η k ) − X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ l ( W ı, j ; θ , η ) ψ m ( W ı ′ , j ′ ; θ , η ) o(cid:12)(cid:12)(cid:12) , I k ,lm, := (cid:12)(cid:12)(cid:12) Q i = i ′ | I k i | | I k i ′ | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ l ( W ı, j ; θ , η ) ψ m ( W ı ′ , j ′ ; θ , η ) − E P [ N X ı =1 N i X ı ′ =1 ψ l ( W ı, ; θ , η ) ψ m ( W ı ′ , i ; θ , η )] (cid:12)(cid:12)(cid:12) . 40e first find a bound for I k ,lm, . Since q > 4, it holds thatE P [ I k ,lm, ] = 1 Q i = i ′ | I k i | | I k i ′ | E P h(cid:12)(cid:12)(cid:12) X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ l ( W ı, j ; θ , η ) ψ m ( W ı ′ , j ′ ; θ , η ) − E P h N X ı =1 N i X ı ′ =1 ψ l ( W ı, ; θ , η ) ψ m ( W ı ′ , i ; θ , η ) i(cid:12)(cid:12)(cid:12) i ≤ Q i = i ′ | I k i | | I k i ′ | E P h X j , j ′ , j ′′ , j ′′′ ∈ I k I i ( j )= I i ( j ′ ) ,I i ( j ′′ )= I i ( j ′′′ ) X I s ( j )= I s ( j ′′ ) s = i X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] X ı ′′ ∈ [ N j ′′ ] X ı ′′′ ∈ [ N j ′′′ ] ψ l ( W ı, j ; θ , η ) ψ m ( W ı ′ , j ′ ; θ , η ) ψ l ( W ı ′′ , j ” ; θ , η ) ψ m ( W ı ′′′ , j ”’ ; θ , η ) i + 1 Q i = i ′ | I k i | | I k i ′ | E P h X j , j ′ , j ′′ , j ′′′ ∈ I k I i ( j )= I i ( j ′ )= I i ( j ′′ )= I i ( j ′′′ ) X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] X ı ′′ ∈ [ N j ′′ ] X ı ′′′ ∈ [ N j ′′′ ] ψ l ( W ı, j ; θ , η ) ψ m ( W ı ′ , j ′ ; θ , η ) ψ l ( W ı ′′ , j ′′ ; θ , η ) ψ m ( W ı ′′′ , j ′′′ ; θ , η ) i + o ( | I k | − ) + 0 . | I k | E P h(cid:13)(cid:13)(cid:13) X ı ∈ [ N ] ψ ( W ı, ; θ , η ) (cid:13)(cid:13)(cid:13) i . c /C = O ( C − ) . Now, to bound I k ,lm, , we make use of the following identity coming from the proof of Theorem 3.2in CCDDHNR (2018): for any numbers a , b , δa , δb such that | a | ∨ | b | ≤ c and | δa | ∨ | δb | ≤ r , it holdsthat | ( a + δa )( b + δb ) − ab | ≤ r ( c + r ) . Denote ψ j ,h := ψ l ( W ı, j ; θ , η ) and b ψ j ,h := ψ l ( W ı, j ; e θ, b η k ) for h ∈ { l, m } and apply the above identity with a = P ı ∈ [ N j ] ψ j ,l , b = P ı ′ ∈ [ N j ′ ] ψ j ′ ,m , a + δa = P ı ∈ [ N j ] b ψ j ,l , b + δb = P ı ′ ∈ [ N j ′ ] b ψ j ′ ,m , r = (cid:12)(cid:12)(cid:12) P ı ∈ [ N j ] b ψ j ,l − P ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) P ı ′ ∈ [ N j ′ ] b ψ j ′ ,m − P ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12) and c = (cid:12)(cid:12)(cid:12) P ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) P ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12) .41hen I k ,lm, = (cid:12)(cid:12)(cid:12) Q i = i ′ | I k i | | I k i ′ | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) n X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] b ψ j ,l b ψ j ′ ,m − X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ j ,l ψ j ′ ,m o(cid:12)(cid:12)(cid:12) ≤ Q i = i ′ | I k i | | I k i ′ | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) (cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] b ψ j ,l b ψ j ′ ,m − X ı ∈ [ N j ] X ı ′ ∈ [ N j ′ ] ψ j ,l ψ j ′ ,m (cid:12)(cid:12)(cid:12) ≤ Q i = i ′ | I k i | | I k i ′ | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) (cid:16)(cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] b ψ j ,l − X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] b ψ j ′ ,m − X ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12)(cid:17) × (cid:16)(cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] b ψ j ,l − X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] b ψ j ′ ,m − X ı ′ ∈ [ N j ′ ] ψ j ’ ,m (cid:12)(cid:12)(cid:12)(cid:17) ≤ (cid:16) Q i = i ′ | I k i | | I k i ′ | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) (cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] b ψ j ,l − X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] b ψ j ′ ,m − X ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12) (cid:17) / × (cid:16) Q i = i ′ | I k i | | I k i ′ | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) n(cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] b ψ j ,l − X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] b ψ j ′ ,m − X ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12)o (cid:17) / ≤ (cid:16) Q i = i ′ | I k i | | I k i ′ | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) (cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] b ψ j ,l − X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] b ψ j ′ ,m − X ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12) (cid:17) / × n(cid:16) Q i = i ′ | I k i | | I k i ′ | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) (cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12) (cid:17) / + (cid:16) Q i = i ′ | I k i | | I k i ′ | X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) (cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] b ψ j ,l − X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] b ψ j ′ ,m − X ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12) (cid:17) / o , where the second to the last inequality follows the Cauchy-Schwartz’s inequality and Minkowski’s42nequality. Notice that X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) (cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12) ≤ max ≤ i ≤ ℓ {| I k i |} C X j =1 ... C ℓ X j ℓ =1 (cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) , X j , j ′ ∈ I k I i ( j )= I i ( j ′ ) (cid:12)(cid:12)(cid:12) X ı ∈ [ N j ] b ψ j ,l − X ı ∈ [ N j ] ψ j ,l (cid:12)(cid:12)(cid:12) ∨ (cid:12)(cid:12)(cid:12) X ı ′ ∈ [ N j ′ ] b ψ j ′ ,m − X ı ′ ∈ [ N j ′ ] ψ j ′ ,m (cid:12)(cid:12)(cid:12) ≤ max ≤ i ≤ ℓ {| I k i |} C X j =1 ... C ℓ X j ℓ =1 (cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; e θ, b η k ) − X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) . Thus, the above bound for I k ,lm, implies that I k ,lm, . R n × (cid:16) | I k | X j ∈ I k (cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) + R n (cid:17) , where R n := 1 | I k | X j ∈ I k (cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; e θ, b η k ) − X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) . Notice that 1 | I k | X j ∈ I k (cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) = O P (1) , which is implied by Markov’s inequality and the calculationsE P h | I k | X j ∈ I k (cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) i =E P h(cid:13)(cid:13)(cid:13) N X ı =1 ψ ( W ı, ; θ , η ) (cid:13)(cid:13)(cid:13) i ≤ c under Assumptions 4 and 6 (ii). Finally, to bound R n , using Assumption 5 (ii), R n . | I k | X j ∈ I k (cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ a ( W ı, j ; b η k )( e θ − θ ) (cid:13)(cid:13)(cid:13) + 1 | I k | X j ∈ I k (cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η k ) − X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) . The first term on RHS is bounded by (cid:16) | I k | X j ∈ I k (cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ a ( W ı, j ; b η k ) (cid:13)(cid:13)(cid:13) (cid:17) × k e θ − θ k = O P (1) × O P ( C − ) = O P ( C − )43ue to Assumption 6 (ii), Markov’s inequality, and Theorem 3. Furthermore, given that ( W ı, j ) j ∈ I c k satisfies b η k ∈ T n , E P h(cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; θ , b η k ) − X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I c k i ≤ sup η ∈T n E P h(cid:13)(cid:13)(cid:13) X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) − X ı ∈ [ N j ] ψ ( W ı, j ; θ , η ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) I c k i ≤ ( r ′ n ) due to Assumptions 4 and 6 (iii). Also, the event b η k ∈ T n happens with probability 1 − o (1), we have R n = O P ( C − + ( r ′ n ) ). Thus we conclude that I k ,lm, = O P ( C − / + r ′ n ) . This completes the proof. E Additional Lemma In this section, we establish a multiway generalization of Lemma 1. For any r = 1 , ..., ℓ ,we let I r ( C ) = n c = j ⊙ e : e ∈ E r , ≤ j ≤ C o and ε m = { e ∈ { 0; 1 } ℓ : P ℓi =1 e i = m } , with ⊙ the Hadamardproduct on R ℓ .For each n ∈ N , let ( N n j , ( W nı, j ) ≤ ι ≤ N ) j ≥ be a set of random variables. For any f : supp(W n ) → R d for a fixed d ∈ N , let us define the multiway empirical process G n f := p C n Q C ℓ X i =1 C i X j i =1 X ι ∈ N n j f ( W nı, j ) − E P [ X ı ∈ [ N n ] f ( W nı, )] o . Lemma 3 (Independentization via H´ajek Projections) . For each n ∈ N , suppose that ( N n j , ( W nı, j ) ≤ ι ≤ N ) j ≥ satisfies Assumption 4. Let F n , |F n | = d , be a family of functions f : supp(W n ) → R that satisfies E h(cid:16) P ı ∈ [ N n ] f (cid:16) W nı, (cid:17)(cid:17) i < K < ∞ for some K independent of n . In addition, assume that C → ∞ andfor every e ∈ ε , C Q C → µ i ≥ , where i is the nonzero coordinate of e . Then there exists a family ofmutually independent standard uniform r.v.’s ( U c ) c > such that the H n f , the H´ajek projection of G n f on the set of statistics of the form P c ∈I r ( C ) g c ( U c ) (with g c ( U c ) square integrable, satisfies H n f = X c ∈I ( C ) √ C Q i : c i =0 C i E N n c ∨ X ı =1 f (cid:0) W nı, c ∨ (cid:1) (cid:12)(cid:12)(cid:12) U c − E X ı ∈ [ N n ] f (cid:0) W nı, (cid:1) . (E.1)44 n addition, it holds uniformly over F n that V ( G n f ) = V ( H n f ) + O ( C − ) = X e ∈ ε ¯ µ i Cov ( N n X ı =1 f ( W nı, ) , N n X ı =1 f ( W nı, − e )) + O ( C − ) . Proof. Throughout the proof, we drop the superscript n for simplicity. Under Assumption 4(i) and(ii), for each n , one can apply Lemma 7.35 of Kallenberg (2006) and obtain a measurable function τ n such that ( N j , ( W ı, j ) ≤ ι ≤ N ) j ≥ = (cid:0) τ n ( U j ⊙ e ) ≺ e (cid:22) (cid:1) j ≥ (E.2)where ( U c ) c ≥ denote a family of mutually independent uniform random variables on [0 , r = r = 1.The H´ajek projection H n f is characterized byE h ( G n f − H n f ) × X c ∈I ( C ) g c ( U c ) i = 0 for any ( g c ) c ∈I ( C ) ∈ (cid:16) L ℓ ([0; 1]) (cid:17) |I ( C ) | . As a result,we have E [ G n f | U c ] = E [ H n f | U c ] for any c ∈ I ( C ) . Because the range H n is closed subspace of square integrable random variables, H n f = X c ∈I ( C ) E ( H n f | U c ) . Next H n f = X c ∈I ( C ) E ( G n f | U c ) . Note that for any c ∈ I ( C ), c ∧ is the unique element ε such that c = j ⊙ e for some j (notethat j is not unique). Moreover, for any c ∈ I ( C ) independence between the U ′ s ensures that P ı ∈ [ N j ] f ( W ı, j ) ⊥ U c if j ⊙ e = c . This implies E ( G n f | U c ) = √ C Π C X ≤ j ≤ C E X ı ∈ [ N j ] f ( W ı, j ) − E X ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U c = √ C Π C X ≤ j ≤ C { j ⊙ e = c } E X ı ∈ [ N j ] f ( W ı, j ) − E X ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U c . N j , ( W ı, j ) ≤ ι ≤ N ) j ≥ in terms of the U ’s implies that E N j X ı =1 f ( W ı, j ) − E X ı ∈ [ N ] f ( N , W ı, ) (cid:12)(cid:12)(cid:12) U c = E N c ∨ X ı =1 f ( W ı, c ∨ ) − E X ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U c for any j such that j ⊙ e = c . Moreover, E ( G n f | U c ) = √ C Π C X ≤ j ≤ C { j ⊙ e = c } E N c ∨ X ı =1 f ( W ı, c ∨ ) − E X ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U c = √ C Q i : c i =0 C i Π C E N c ∨ X ı =1 f ( W ı, c ∨ ) − E X ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U c = √ C Q i : c i =0 C i E " N c ∨ X ı =1 f ( W ı, c ∨ ) (cid:12)(cid:12)(cid:12) U c − E X ı ∈ [ N ] f ( W ı, ) . It follows that H n f = X c ∈I ( C ) √ C Q i : c i =0 C i E " N c ∨ X ı =1 f ( W ı, c ∨ ) (cid:12)(cid:12)(cid:12) U c − E X ı ∈ [ N ] f ( W ı, ) . This shows the first claim of the lemma.Since F n is a finite family, we are left to prove that for each f ∈ F n , V ( G n f ) = V ( H n f ) + O ( C − ) = ℓ X i =1 ¯ µ i Cov ( N X ı =1 f ( W ı, ) , N i X ı =1 f ( W ı, i )) + O ( C − ) , where 2 i denotes the ℓ − tuple vector with 2 in each entry but for 1 in the i − th entry. Note that V ( H n f ) = X e ∈ ε C Q i : e i =1 C i V E X ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U e . (E.3)To conclude, it suffices to show that for each e ∈ ε , V E X ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U e = Cov X ı ∈ [ N ] f ( W ı, ) , N − e X ı =1 f ( W ı, − e ) . As ( N j , ( W ı, j ) ≤ ι ≤ N ) j ≥ = (cid:16) τ (cid:16) ( U j ⊙ e ) e ∈∪ ℓr =1 ε r (cid:17)(cid:17) j ≥ with i.i.d. U ’s, we have E " P ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U e = E " P ı ∈ [ N j ] f ( W ı, j ) (cid:12)(cid:12)(cid:12) U e for any j such that j ⊙ e = ⊙ e = e . Becuase − e ⊙ e = e , we have46 E " P ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U e = Cov E " P ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U e , E " N − e P ı =1 f ( W ı, − e ) (cid:12)(cid:12)(cid:12) U e . For any e ∈ ε , we have − e = . The independence of the U ’s ensures( U ⊙ e ′ ) e ′ ∈∪ ℓr =1 ε r \ e ⊥ (cid:0) U ( − e ) ⊙ e ′ (cid:1) e ′ ∈∪ ℓr =1 ε r \ e | U e and thus N P ı =1 f ( W ı, ) ⊥ N − e P ı =1 f ( W ı, − e ) | U e .Hence, for e ∈ ε E Cov X ı ∈ [ N ] f ( W ı, ) , N − e X ı =1 f ( W ı, − e ) (cid:12)(cid:12)(cid:12) U e = 0 . By the law of total covariance, we obtain V E X ı ∈ [ N ] f ( W ı, ) (cid:12)(cid:12)(cid:12) U e = Cov X ı ∈ [ N ] f ( W ı, ) , N − e X ı =1 f ( W ı, − e ) . This establishes the second claim of the lemma. 47 eferences Ai, C. and X. Chen (2003): “Efficient Estimation of Models with Conditional Moment RestrictionsContaining Unknown Functions,” Econometrica , 71, 1795–1843.——— (2007): “Estimation of Possibly Misspecified Semiparametric Conditional Moment RestrictionModels with Different Conditioning Variables,” Journal of Econometrics , 141, 5–43. Athey, S. and G. W. Imbens (2019): “Machine Learning Methods That Economists Should KnowAbout,” Annual Review of Economics , 11. Athey, S. and S. Wager (2019): “Estimating Treatment Effects with Causal Forests: An Applica-tion,” arXiv preprint arXiv:1902.07409 . Belloni, A., V. Chernozhukov, D. Chetverikov, and Y. Wei (2018): “Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework,” TheAnnals of Statistics , 46, 3643–3675. Belloni, A., V. Chernozhukov, I. Fern´andez-Val, and C. Hansen (2017): “Program evalua-tion and causal inference with high-dimensional data,” Econometrica , 85, 233–298. Belloni, A., V. Chernozhukov, and C. Hansen (2014a): “High-dimensional methods and infer-ence on structural and treatment effects,” Journal of Economic Perspectives , 28, 29–50.——— (2014b): “Inference on treatment effects after selection among high-dimensional controls,” TheReview of Economic Studies , 81, 608–650. Belloni, A., V. Chernozhukov, C. Hansen, and D. Kozbur (2016): “Inference in high-dimensional panel models with an application to gun control,” Journal of Business & EconomicStatistics , 34, 590–605. Belloni, A., V. Chernozhukov, and K. Kato (2015): “Uniform post-selection inference for leastabsolute deviation regression and other Z-estimation problems,” Biometrika , 102, 77–94. Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile prices in market equilibrium,” Econometrica: Journal of the Econometric Society , 841–890.48 erry, S. T. (1994): “Estimating discrete-choice models of product differentiation,” The RANDJournal of Economics , 242–262. Cameron, A. C. and D. L. Miller (2015): “A practitioners guide to cluster-robust inference,” Journal of Human Resources , 50, 317–372. Cameron, C. A., J. B. Gelbach, and D. L. Miller (2011): “Robust Inference With MultiwayClustering,” Journal of Business and Economic Statistics , 29, 238 – 249. Chen, X., O. Linton, and I. Van Keilegom (2003): “Estimation of Semiparametric Models whenthe Criterion Function Is Not Smooth,” Econometrica , 71, 1591–1608. Chen, X. and D. Pouzo (2015): “Sieve Wald and QLR Inferences on Semi/Nonparametric Condi-tional Moment Models,” Econometrica , 83, 1013–1079. Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, andJ. Robins (2018): “Double/debiased machine learning for treatment and structural parameters,” Econometrics Journal , 21, C1 – C68. Chiang, H. and Y. Sasaki (2019): “Lasso under Multi-way Clustering: Estimation and Post-selection Inference,” ArXiv:1905.02107. Davezies, L., X. D’Haultfoeuille, and Y. Guyonvarch (2018): “Asymptotic Results underMultiway Clustering,” ArXiv:1807.07925.——— (2019): “Empirical Process Results for Exchangeable Arrays,” arXiv preprintarXiv:1906.11293 . Hansen, C. and Y. Liao (2019): “The factor-lasso and k-step bootstrap approach for inference inhigh-dimensional economic applications,” Econometric Theory , 35, 465–509. Kallenberg, O. (2006): Probabilistic symmetries and invariance principles , Springer Science &Business Media. 49 ock, A. B. (2016): “Oracle Inequalities, Variable Selection and Uniform Inference in High-Dimensional Correlated Random Effects Panel Data Models,” Journal of Econometrics , 195, 71– 85. Kock, A. B. and H. Tang (2019): “Uniform Inference in High-Dimensional Dynamic Panel DataModels with Approximately Sparse Fixed Effects,” Econometric Theory , 35, 295–359. Lee, S. and S. Ng (2019): “An Econometric View of Algorithmic Subsampling,” arXiv preprintarXiv:1907.01954 . Lu, Z., X. Shi, and J. Tao (2019): “Semi-Nonparametric Estimation of Random Coefficient LogitModel for Aggregate Demand,” Working Paper. MacKinnon, J. G. (2019): “How cluster-robust inference is changing applied econometrics,” Cana-dian Journal of Economics/Revue canadienne d’´economique . MacKinnon, J. G., M. O. Nielsen, and M. D. Webb (2019): “Wild Bootstrap and AsymptoticInference with Multiway Clustering,” Queen’s Economics Department Working Paper, No. 1415. Menzel, K. (2017): “Bootstrap with Clustering in Two or More Dimensions,” ArXiv:1703.03043. Mullainathan, S. and J. Spiess (2017): “Machine learning: an applied econometric approach,” Journal of Economic Perspectives , 31, 87–106. Okui, R., D. S. Small, Z. Tan, and J. M. Robins (2012): “Doubly robust instrumental variableregression,” Statistica Sinica , 173–205. Semenova, V., M. Goldman, V. Chernozhukov, and M. Taddy (2018): “Orthogonal ma-chine learning for demand estimation: High dimensional causal inference in dynamic panels,” arXivpreprint arXiv:1608.00033 . van der Vaart, A. (1998): Asymptotic Statistics , Cambridge University Press.50 M C dim( X ) K ( K ) Machine Learning Bias SD RMSE Cover25 25 25 100 2 (4) Ridge 0.069 0.074 0.102 0.835Elastic Net 0.010 0.079 0.080 0.963Lasso 0.005 0.080 0.080 0.96550 50 50 100 2 (4) Ridge 0.014 0.047 0.049 0.940Elastic Net -0.002 0.048 0.048 0.956Lasso -0.001 0.049 0.049 0.95525 25 25 200 2 (4) Ridge 0.190 0.053 0.197 0.118Elastic Net 0.016 0.077 0.079 0.969Lasso 0.006 0.080 0.080 0.96850 50 50 200 2 (4) Ridge 0.037 0.046 0.058 0.876Elastic Net -0.000 0.048 0.048 0.960Lasso -0.002 0.048 0.048 0.96225 25 25 100 3 (9) Ridge 0.042 0.074 0.085 0.962Elastic Net 0.004 0.074 0.074 0.993Lasso 0.002 0.075 0.075 0.99250 50 50 100 3 (9) Ridge 0.007 0.048 0.049 0.962Elastic Net -0.001 0.047 0.047 0.972Lasso -0.001 0.048 0.048 0.96325 25 25 200 3 (9) Ridge 0.081 0.067 0.105 0.896Elastic Net 0.005 0.073 0.073 0.994Lasso 0.003 0.076 0.077 0.99250 50 50 200 3 (9) Ridge 0.018 0.047 0.050 0.944Elastic Net -0.002 0.048 0.048 0.968Lasso -0.003 0.049 0.049 0.968Table 1: Simulation results based on 5,000 Monte Carlo iterations. Results are displayed for each ofthe three machine learning methods, including the ridge, elastic net, and lasso. Reported statisticsare the bias (Bias), standard deviation (SD), root mean square error (RMSE), and coverage frequencyfor the nominal probability of 95% (Cover). 51-Way 1-Way 1-Way 2-WayInstrument ( Z ij ) — Product Market Product × Market Horsepower/weight -5.763 -5.719 -5.815 -5.659of other products (0.460) (0.640) (1.024) (1.211)Miles/dollar -6.121 -6.056 -6.191 -6.121of other products (0.607) (0.865) (1.491) (3.963)Size -5.684 -5.641 -5.727 -5.593of other products (0.413) (0.565) (0.892) (1.015)Table 2: Estimates and standard errors of the coefficient θ of log price in the demand model. Thefirst column indicates the instrumental variable. The second column shows the results of the DML bylasso not accounting for clustering with the number K = 4 of folds for cross fitting. The third andfourth columns show the results of the 1-way cluster-robust DML by lasso clustered at product andmarket, respectively, with the number K = 4 of folds for cross fitting. The fifth column shows theresults of the 2-way cluster-robust DML by lasso with the number K2