Doubly Distributed Supervised Learning and Inference with High-Dimensional Correlated Outcomes
DDoubly Distributed Supervised Learning andInference with High-Dimensional Correlated
Outcomes
Emily C. Hector and Peter X.-K. Song ∗ Department of BiostatisticsUniversity of Michigan
Abstract
This paper presents a unified framework for supervised learning and inference pro-cedures using the divide-and-conquer approach for high-dimensional correlated out-comes. We propose a general class of estimators that can be implemented in a fullydistributed and parallelized computational scheme. Modelling, computational andtheoretical challenges related to high-dimensional correlated outcomes are overcomeby dividing data at both outcome and subject levels, estimating the parameter of in-terest from blocks of data using a broad class of supervised learning procedures, andcombining block estimators in a closed-form meta-estimator asymptotically equivalentto estimates obtained by Hansen (1982)’s generalized method of moments (GMM)that does not require the entire data to be reloaded on a common server. We providerigorous theoretical justifications for the use of distributed estimators with corre-lated outcomes by studying the asymptotic behaviour of the combined estimatorwith fixed and diverging number of data divisions. Simulations illustrate the finitesample performance of the proposed method, and we provide an R package for easeof implementation.
Keywords: Divide-and-conquer, Generalized method of moments, Estimating functions,Parallel computing, Scalable computing ∗ We would like to acknowledge support for this project from the National Science Foundation (NSFDMS1513595) and the National Institutes of Health (NIH R01ES024732). a r X i v : . [ m a t h . S T ] J u l INTRODUCTION
Although the divide-and-conquer paradigm has been widely used in statistics and computerscience, its application with correlated data has been little investigated in the literature.We provide a theoretical justification, with theoretical guarantees, for divide-and-conquermethods with correlated data through a general unified estimating function theory frame-work. In particular, in this paper we focus on the large sample properties of a classof distributed and integrated estimators for supervised learning and inference with high-dimensional correlated outcomes. We consider N independent observations { y i , X i } Ni =1 where both the sample size N and the dimension M of the response vector y i may be sobig that a direct analysis of the data using conventional methodology is computationallyintensive, or even prohibitive. Such data may arise, for example, from imaging measure-ments of brain activity or from genomic data. Denote by f ( Y i ; X i , θ , Γ i ) the M -variatejoint parametric distribution of Y i conditioned on X i , where θ is the parameter of interestand Γ i contains parameters, such as for high-order dependencies, that may be difficult tomodel or handle computationally.Statistical inference with big data can be extremely challenging due to the high volume andhigh variety of these data, as noted recently by Secchi (2018). In the statistics literature,methodological efforts to date have primarily focused on high-dimensional covariates (i.e.high-dimensional X i ) with univariate responses (corresponding to M = 1); see Johnstoneand Titterington (2009) for an overview of the difficulties and methods in linear regression,and the citations therein for references to the extensive publications in this field. By con-trast, little work has focused on high-dimensional correlated outcomes (corresponding tolarge M ), which pose an entirely new and different set of methodological challenges stem-ming from a high-dimensional likelihood. The divide-and-combine paradigm holds promise2n overcoming these challenges; see Mackey et al. (2011) and Zhang et al. (2015b) for earlyexamples of the power of divide-and-combine algorithms. Some recent divide-and-combinemethods for independent outcomes can be found in Singh et al. (2005), Lin and Zeng(2010), Lin and Xi (2011), Chen and Xie (2014), and Liu et al. (2015), among others.More recently, Hector and Song (2019) proposed a Distributed and Integrated Method ofMoments (DIMM), a divide-and-combine strategy for supervised learning and inference ina regression setting with high-dimensional correlated outcomes Y . DIMM splits the M elements of Y into blocks of low-dimensional response subvectors, analyzes these blocksin a distributed and parallelized computational scheme using pairwise composite likeli-hood (CL), and combines block-specific results using a closed-form meta-estimator in asimilar spirit to Hansen (1982)’s seminal generalized method of moments (GMM). DIMMovercomes computational challenges associated with high-dimensional outcomes by run-ning block analyses in parallel and combining block-specific results via a computationallyand statistically efficient closed-form meta-estimator. DIMM is easily implemented usingMapReduce in the Hadoop framework (Khezr and Navimipour (2017)), where blocks ofdata are loaded only once and in parallel. DIMM presents a useful and natural exten-sion of the classical GMM framework, which easily accounts for inter-block dependencies.DIMM also improves on the classical meta-estimation where results from blocks are rou-tinely assumed to be independent. DIMM is still challenged, however, when estimating ahomogeneous parameter in the presence of heterogeneous parameters. Additionally, it isalso challenged computationally when the sample size N is large; the strategy of dividinghigh-dimensional vectors of correlated outcomes into blocks is insufficient to address theexcessive computational demand, since the sample size remains large in the block analy-ses. Thus, another division at the subject level is inevitable to mitigate the computationalburden arising from matrix inversions and iterative calculations in the block analyses.3his paper proposes a new doubly divided procedure to learn and perform inference fora homogeneous parameter of interest in the presence of heterogeneous parameters with ageneral class of supervised learning procedures. The double division at the response andsubject levels further speeds up computations in comparison to DIMM and results in adouble division of the data, visualized in Table 1: a division of the response Y , and a ran-dom division of subjects into independent subject groups, resulting in blocks of data witha smaller sample of low-dimensional response subvectors. We consider a general class ofsupervised learning procedures to analyze these blocks separately and in parallel. Then weestablish a GMM-type combination procedure that yields a meta-estimator of the parame-ter of interest. This proposed estimator is more general than the DIMM estimator in Hectorand Song (2019), and thus appealing in many practical settings where analyzing data withboth large M and N is challenging. We achieve a doubly divided learning and inferenceprocedure implemented in a distributed and parallelized computational scheme. The pro-posed class of supervised learning procedures is very general, including many importantestimation methods as special cases, such as Fisher’s maximum likelihood, Wedderburn(1974)’s quasi-likelihood, Liang and Zeger (1986)’s generalized estimating equations, Hu-ber (1964)’s M-estimation for robust inference, with possible extensions to semi-parametricand non-parametric models.The proposed Doubly Distributed and Integrated Method of Moments (DDIMM) not onlyprovides a unified framework of various supervised learning procedures of parameters withheterogeneity under the divide-and-combine paradigm, but provides key theoretical guar-antees for statistical inference, such as consistency and asymptotic normality, while offeringsignificant computational gains when response dimension M and sample size N are large.These are useful and innovative contributions to the arsenal of tools for high-dimensionalcorrelated data analysis, and to the collection of divide-and-combine algorithms, which4 lockGroup Subject 1 . . . Subject n . . . . . . Subject 1 . . .
Subject n K y , . . . y n , . . . . . . y , K . . . y n K , K ... ... ... ... ... ... ... ... ... m y m , . . . y n m , . . . . . . y m , K . . . y n K m , K ... ... ... ... ... ... ... ... ...... ... ... ... ... ... ... ... ...1 y ,J . . . y n ,J . . . . . . y ,JK . . . y n K ,JK ... ... ... ... ... ... ... ... ... m J y m J ,J . . . y n m J ,J . . . . . . y ,JK . . . y n K m J ,JK Table 1: Double division of outcome data on both the dimension of responses (intoblocks) and sample size (into groups).have so far concentrated on independently sampled data. In this paper, we focus on thetheoretical aspects of doubly distributed learning and inference, including a goodness-of-fittest based on a χ statistic. We also study consistency and asymptotic normality of theproposed estimator as the number of data divisions diverges. This includes theoreticaljustifications for distributed inference when the dimension of the response and the numberof response divisions diverges, which allows the analysis of highly dense outcome data.The rest of the paper is organized as follows. Section 2 describes the DDIMM, with ex-amples introduced in Section 3. Section 4 discusses large sample properties of the pro-posed DDIMM. Section 5 presents the main contribution of the paper, a closed-form meta-estimator and its implementation in a parallel and scalable computational scheme. Section6 illustrates the DDIMM’s finite sample performance with simulations. Section 7 concludeswith a discussion. Additional proofs and simulation results are deferred to the Appendices5nd Supplemental Material. An R package is available in the Supplemental Material. We begin with some notation. Let (cid:107)·(cid:107) be the (cid:96) -norm for a D -dimensional vector a and a D × D -dimensional matrix A defined by, respectively: (cid:107) a (cid:107) = (cid:18) D (cid:80) d =1 a d (cid:19) / for a = [ a d ] Dd =1 ∈ R D , (cid:107) A (cid:107) = (cid:18) D (cid:80) d =1 D (cid:80) d =1 A d d (cid:19) / for A = [ A d d ] D ,D d ,d =1 ∈ R D × D . We define the stacking operator S ( · ) for matrices { A jk } J,Kj =1 ,k =1 , A jk ∈ R D jk × D , as S ( A jk , A j (cid:48) k (cid:48) ) = (cid:16) A Tjk A Tj (cid:48) k (cid:48) (cid:17) T ∈ R ( D jk + D j (cid:48) k (cid:48) ) × D , S J ( A jk ) = (cid:16) A T k . . . A TJk (cid:17) T ∈ R D k × D , S JK ( A jk ) = (cid:16) A T . . . A TJ . . . A T K . . . A TJK (cid:17) T ∈ R D × D , where D k = (cid:80) Jj =1 D jk , D = (cid:80) Kk =1 D k . Consider the collection of samples { y i , X i } Ni =1 ,where X i ∈ R M × q is fixed, Y i ∈ R M , q, M ∈ N . The number of covariates q is consideredfixed in this paper. Let θ , ζ take values in parameter spaces Θ ⊆ R p , Ξ ⊆ R d , both compactsubsets of p - and d -dimensional Euclidean space respectively. Let p, d ∈ N , and consider θ to be the parameter of interest, and ζ to be a potentially large vector of parametersof secondary interest. Let θ ∈ Θ , ζ ∈ Ξ be the true values of θ and ζ respectively.Consider a class P = { P θ , ζ } of parametric models with associated estimating functions Ψ of parameter θ (e.g. Ψ can be the derivative of some objective function). Suppose we wantto learn the parameter θ by finding the root of Ψ ( θ ; y , ζ ) = , which is computationallyintensive or even prohibitive due to the large dimension M of y , the large sample size N , or6he large dimension d of ζ . We focus on a divide-and-combine approach utilizing moderndistributed computing platforms to alleviate the computational and modelling challengesposed by analyzing the whole data. First, for each subject i , DDIMM divides the M -dimensional response y i and its associatedcovariates into J blocks, denoted by: y i = (cid:16) y Ti, . . . y Ti,J (cid:17) T and X i = (cid:16) X Ti, . . . X Ti,J (cid:17) T , i = 1 , . . . , N. Division into blocks is not restricted to the order of data entry: responses may be groupedaccording to pre-specified block memberships, according to, say, substantive scientificknowledge, such as functional regions of the brain. In this paper, with no loss of generality,we use the order of data entry in the data division procedure. Further, DDIMM randomlysplits the N independent subjects to form K disjoint subject groups (cid:8) y i,jk , X i,jk (cid:9) n k i =1 . Theneach group has sample size n k , k = 1 , . . . , K , with (cid:80) Kk =1 n k = N . Refer to Table 1 for no-tation detail. For ease of exposition, we henceforth use the term “group” to refer to thedivision along subjects, and “block” to refer to the division along responses. We also usethe term “block” to refer to the division along both responses and subjects.We call (cid:8) y i,jk , X i,jk (cid:9) n k i =1 block ( j, k ), j = 1 , . . . , J and k = 1 , . . . , K . Within block ( j, k ),let m j be the dimension of the sub-response, y i,jk = ( y i ,jk , . . . , y im j ,jk ) T ∈ R m j , and X i,jk ∈ R m j × q the associated covariate matrix, with (cid:80) Jj =1 m j = M . For each block j ∈ { , . . . , J } , we have K independent subject groups (cid:8) y i,jk (cid:9) n k ,Ki =1 ,k =1 . In contrast, eachgroup k ∈ { , . . . , K } has n k subjects and for each subject i ∈ { , . . . , n k } , the J responseblocks (cid:8) y i,jk (cid:9) m j j =1 are dependent.The primary task is to solve Ψ ( θ ; y , ζ ) = to learn parameter θ ina supervised way over7he entire data. Given the above double data split scheme, this task becomes a divide-and-combine procedure: the first step is to solve the following system of block-specificestimating equations: for j ∈ { , . . . , J } , k ∈ { , . . . , K } , Ψ jk ( θ ; y jk , ζ jk ) = , (1) G jk ( ζ jk ; y jk , θ ) = , (2)where G jk is an estimating function used to learn parameters ζ jk (e.g. correlation pa-rameters) that are allowed to be heterogeneous across blocks such that ζ = S JK (cid:0) ζ jk (cid:1) .The true values ( θ , ζ jk ) of ( θ , ζ jk ) are the values such that E θ , ζ jk S ( Ψ jk ( θ ; y jk , ζ jk ) , G jk ( ζ jk ; y jk , θ )) = . Parameters ζ jk take values in parameter space Ξ jk ⊂ R d jk forsome d jk > ζ = S JK (cid:0) ζ jk (cid:1) , Ξ = ⨉ J,Kj =1 ,k =1 Ξ jk , d = (cid:80) Kk =1 (cid:80) Jj =1 d jk . Let ζ k = S J (cid:0) ζ jk (cid:1) and ζ k = S J (cid:0) ζ jk (cid:1) . This is a similar approach to GEE2, proposed by Zhaoand Prentice (1990), with details also in Liang et al. (1992), where unbiased estimatingequations for the nuisance parameters are added in order to guarantee consistency. Inthis way, we impose homogeneity of the parameter of interest θ across blocks but allowheterogeneity of the parameters of secondary interest. We assume that the class of paramet-ric models P yields block-specific estimating functions satisfying the following regularityassumptions:(A.1) (i) Ψ jk and G jk are unbiased; that is, for all θ ∈ Θ , ζ jk ∈ Ξ jk , E θ , ζ jk S ( Ψ jk ( θ ; Y jk , ζ jk ) , G jk ( ζ jk ; Y jk , θ )) = .(ii) E θ , ζ jk S (cid:0) Ψ jk ( θ ; Y jk , ζ jk ) , G jk ( ζ jk ; y jk , θ ) (cid:1) has a unique zero at ( θ , ζ jk ).(iii) Ψ jk and G jk are additive: for some kernel inference functions ψ jk and g jk , theytake the form Ψ jk ( θ ; y jk , ζ jk ) G jk ( ζ jk ; y jk , θ ) = 1 n k n k (cid:88) i =1 ψ jk ( θ ; y i,jk , ζ jk ) g jk ( ζ jk ; y i,jk , θ ) .
8e define Ψ jk and G jk as being “weakly regular” based on the above conditions (A.1)(i)-(iii) in which the defining properties of a regular inference function are applied to itsmean; see Song (2007) Chapter 3.5 for a definition of regular inference functions. Additionalconditions on the class P will be described throughout the paper where appropriate. Withinblock ( j, k ), denote by (cid:98) θ jk and (cid:98) ζ jk the joint solution to (1) and (2), estimators of θ and ζ jk respectively. For notation purposes, let (cid:98) θ list = S JK ( (cid:98) θ jk ), (cid:98) ζ k = S J ( (cid:98) ζ jk ), and (cid:98) ζ list = S JK ( (cid:98) ζ jk ). Due to the homogeneity of θ , the next step is integration of the block-specificestimators (cid:98) θ jk . By contrast, (cid:98) ζ jk remain heterogeneous and potentially high-dimensional.In the rest of the paper, for convenience of notation, we suppress the dependence of Ψ jk , G jk , ψ jk and g jk on y jk and y i,jk : Ψ jk ( θ ; ζ jk ) = Ψ jk ( θ ; y jk , ζ jk ) , G jk ( ζ jk ; θ ) = G jk ( ζ jk ; y jk , θ ) , ψ i,jk ( θ ; ζ jk ) = ψ jk ( θ ; y i,jk , ζ jk ) , g i,jk ( ζ jk ; θ ) = g jk ( ζ jk ; y i,jk , θ ) . Integrating block estimates (cid:98) θ jk into an estimator of θ , denoted by (cid:98) θ c , will yield a moreefficient estimate of θ . In the integration step, our intuition is to treat each system ofequations S (cid:0) Ψ jk ( θ ; ζ jk ) , G jk ( ζ jk ; θ ) (cid:1) = as a “moment condition” on θ contributed byblock ( j, k ), j = 1 , . . . , J , k = 1 , . . . , K . Technically, we want to derive an estimator (cid:98) θ c of θ that satisfies all J K moment conditions that effectively makes use of the
J K estimates of θ obtained from equations (1) and (2). To address the issue that θ is over-identified by the J K moment conditions, we invoke Hansen (1982)’s seminal generalized method of moments(GMM) to combine the moment conditions that arise from each block. Another significantadvantage of GMM is that it allows us to incorporate between-block dependencies, whichcannot be easily done in classical meta-estimation. To this end, define the subject group9ndicator δ i ( k ) = (subject i is in blocks ( j, k ) for some k ∈ { , . . . , K } and for all j =1 , . . . , J ) for i = 1 , . . . , N , k = 1 , . . . , K . For subject i , let ψ i ( θ ; ζ ) = S JK (cid:0) δ i ( k ) ψ i,jk ( θ ; ζ jk ) (cid:1) , g i ( ζ ; θ ) = S JK (cid:0) δ i ( k ) g i,jk ( ζ jk ; θ ) (cid:1) , where clearly only one S J (cid:0) δ i ( k ) ψ Ti,jk ( θ ; ζ jk ) (cid:1) is non-zero. Let a ⊗ denote the outer prod-uct of a vector a with itself, namely a ⊗ = aa T . Then we can define Ψ N ( θ ; ζ ) =(1 /N ) (cid:80) Ni =1 ψ i ( θ ; ζ ). It is easy to show that Ψ N ( θ ; ζ ) = 1 N S JK (cid:32) n k (cid:88) i =1 ψ i,jk ( θ ; ζ jk ) (cid:33) = 1 N S JK (cid:0) n k Ψ jk ( θ ; ζ jk ) (cid:1) . Similarly, define G N ( ζ ; θ ) = (1 /N ) (cid:80) Ni =1 g i ( ζ ; θ ) = (1 /N ) S JK (cid:0) n k G jk ( ζ jk ; θ ) (cid:1) . Since Ψ jk and G jk satisfy assumptions (A.1) for each j and k , Ψ N and G N are additive, unbiased,and E θ , ζ S ( Ψ N ( θ ; ζ ) , G N ( ζ ; θ )) has a unique zero at ( θ , ζ ). For convenience, we denote T N ( θ , ζ ) = Ψ N ( θ ; ζ ) G N ( ζ ; θ ) , τ i ( θ , ζ ) = ψ i ( θ ; ζ ) g i ( ζ ; θ ) . (3)We assume that the class P yields ψ , g satisfying the following conditions:(A.2) (i) Both ψ jk and g jk are Lipschitz continuous in θ and ζ , namely for j ∈ { , . . . , J } , k ∈ { , . . . , K } , and some constants c jk , b jk >
0, for all (cid:0) θ , ζ jk (cid:1) , (cid:0) θ , ζ jk (cid:1) ina neighbourhood of ( θ , ζ jk ), (cid:13)(cid:13) ψ i,jk ( θ ; ζ jk ) − ψ i,jk ( θ ; ζ jk ) (cid:13)(cid:13) ≤ c jk (cid:13)(cid:13) ( θ , ζ jk ) − ( θ , ζ jk ) (cid:13)(cid:13) , (cid:13)(cid:13) g i,jk ( ζ jk ; θ ) − g i,jk ( ζ jk ; θ ) (cid:13)(cid:13) ≤ b jk (cid:13)(cid:13) ( θ , ζ jk ) − ( θ , ζ jk ) (cid:13)(cid:13) . (ii) The sensitivity matrix −∇ θ , ζ E θ , ζ τ i ( θ , ζ ) is continuous in a compact neighbour-hood N ( θ , ζ ) of ( θ , ζ ), and positive definite;10iii) The variability matrix E θ , ζ ( τ i ( θ , ζ ) ⊗ ) is finite and positive-definite.Note that T N ( θ , ζ ) = has no unique solution because its dimension is bigger thanthe dimension of θ . To overcome this issue, we follow Hansen’s GMM for over-identifiedparameters. Let W be the weight matrix in the GMM equation (4). Classical GMM theorystates that any positive semi-definite matrix W can be used to guarantee consistencyand asymptotic normality of the resulting estimator, and that an optimal choice of W ,corresponding to the inverse covariance of the estimating function T N in (3), leads to anefficient GMM estimator. In our setting, a possible formulation for a GMM estimator of( θ , ζ ) is ( (cid:98) θ c , (cid:98) ζ c ) = arg min θ , ζ Q N ( θ , ζ | W ) , where (4) Q N ( θ , ζ | W ) = T TN ( θ , ζ ) W T N ( θ , ζ ) . In (4), the weight matrix W is a positive semi-definite ( J Kp + d ) × ( J Kp + d ) matrix.The heterogeneity of ζ allowed by the use of G N can lead to theoretical and computationalchallenges due to the high-dimensionality of the parameter, a problem from which GEE2also suffers. See Chan et al. (1998) and Carey et al. (1993) for a discussion on the computa-tional burden of inverting large matrices in GEE2. Note that block-specific estimators (cid:98) ζ list are consistent; the only possible improvement from re-learning ζ in an iterative procedurebetween (cid:98) θ c and (cid:98) ζ c is a gain in efficiency. This is not necessary since ζ are parametersof secondary interest and their efficiency is in general not of interest. We will derive aclosed-form meta-estimator of θ that avoids re-learning of ζ in Section 5.Following the work of Hansen (1982), we define a particular instance of the estimator in(4) by specifying W as the inverse sample covariance of T N . We will show in Section 4that this choice of W is optimal for the efficiency of the resulting estimator. Let (cid:98) V N be11he sample covariance of T N ( θ , ζ ): (cid:98) V N = 1 N N (cid:88) i =1 (cid:16) τ i ( (cid:98) θ list , (cid:98) ζ list ) (cid:17) ⊗ = 1 N N (cid:88) i =1 ψ i ( (cid:98) θ list ; (cid:98) ζ list ) g i ( (cid:98) ζ list ; (cid:98) θ list ) ⊗ , (5)where ψ i ( (cid:98) θ list ; (cid:98) ζ list ) = S JK (cid:16) δ i ( k ) ψ i,jk ( (cid:98) θ jk ; (cid:98) ζ jk ) (cid:17) . Letting W = (cid:98) V − N yields the followingoptimal GMM estimator:( (cid:98) θ opt , (cid:98) ζ opt ) = arg min θ , ζ T TN ( θ , ζ ) (cid:98) V − N T N ( θ , ζ ) . (6)We assume that W and (cid:98) V N are nonsingular; see Han and Song (2011) for optimal weightingmatrix with QIF when the sample covariance is ill-defined. Before presenting large-sampleproperties of ( (cid:98) θ c , (cid:98) ζ c ) and ( (cid:98) θ opt , (cid:98) ζ opt ) in Section 4, we demonstrate in Section 3 the flexibilityof our framework through several important supervised learning methods. We now present five examples to illustrate the flexibility of the unifying framework consid-ered in this paper.
Consider the multidimensional regression model h ( µ i,jk ) = X i,jk ( θ T β Tjk ) T , where µ i,jk = E ( Y i,jk | X i,jk , θ , β jk ) is the mean vector of Y i,jk given X i,jk , β jk , and the p -dimensionalparameter θ ( p ≤ q the number of covariates, which may include an intercept), and h is aknown component-wise link function. Let ζ jk be parameters of the second-order momentsof Y i,jk , such as dispersion parameters, and parameters in β jk (which may be empty). If thefull likelihood of Y i,jk is computationally tractable, Ψ jk and G jk correspond to the score12unctions, and (cid:98) θ jk and (cid:98) ζ jk may be given by the maximum likelihood estimates (MLEs).DDIMM can be applied straightforwardly by following the procedure in Section 2.If the full likelihood is computationally intractable or difficult to construct, one can insteaduse pseudo-likelihoods such as the pairwise composite likelihood. The pairwise compos-ite likelihood, originally proposed by Lindsay (1988) and detailed in Varin et al. (2011),provides the following forms of the equations for (1) and (2): Ψ jk ( θ ; ζ jk ) = 1 n k n k (cid:88) i =1 m j − (cid:88) r =1 m j (cid:88) t = r +1 ∇ θ log f j ( y ir,jk ; y it,jk ; θ , ζ jk , X i,jk ) , G jk ( ζ jk ; θ ) = 1 n k n k (cid:88) i =1 m j − (cid:88) r =1 m j (cid:88) t = r +1 ∇ ζ jk log f j ( y ir,jk ; y it,jk ; θ , ζ jk , X i,jk ) , for some bivariate marginal f j which can be chosen according to the nature of the responsedata. As long as the bivariate marginals f j are correctly specified, the composite scorefunctions Ψ jk and G jk satisfy the regularity conditions in (A.1). Hence the DDIMMcan be used to overcome the computational challenges related to the MLE and pairwisecomposite likelihood. We refer readers to Chapter 6 of Song (2007) and Chapter 3 ofJoe (2014) for details on constructing multivariate distributions using Gaussian and vinecopulas respectively, but note that direct computation of the MLE is computationally verychallenging when m j ≥
4. Examples of applications of Gaussian copulas can be found inSong et al. (2009), Bodnar et al. (2010), Bai et al. (2014), and in the importance samplingalgorithm proposed in Masarotto and Varin (2012), among others.
More generally, Wedderburn (1974)’s quasi-likelihood is a popular alternative method ofsupervised learning that does not require a fully specified multidimensional likelihood; it13eceives a full treatment in Heyde (1997). Consider Liang and Zeger (1986)’s marginalmean model h ( µ i,jk ) = X i,jk ( θ T β Tjk ) T for the analysis of longitudinal data, where µ i,jk = E ( Y i,jk | X i,jk , θ , β jk ) is the marginal mean vector of serially correlated outcomes Y i,jk given X i,jk , β jk , and the p -dimensional parameter θ ( p ≤ q ), and h is a knowncomponent-wise link function. In this setting, ζ jk consists of parameters in β jk (which maybe empty), parameters for the variances of Y it,jk , t = 1 , . . . , m j , and a nuisance parameter α jk which fully characterizes a working correlation matrix R jk ( α jk ). In the case where β jk is empty, the generalized estimating equation (GEE) proposed by Liang and Zeger(1986) yields the the kernel inference function ψ jk ( θ ; ζ jk ) = D Ti,jk Σ − i,jk r i,jk in (A.1) (iii),where D i,jk = ∇ θ µ i,jk , r i,jk = y i,jk − µ i,jk , and Σ i,jk = A i,jk R jk ( α jk ) A i,jk , where A i,jk =diag (cid:8) ( V ar ( Y it,jk )) / (cid:9) m j t =1 . In GEE2, G jk in (2) is specified as another unbiased inferencefunction satisfying (A.1) and (A.2). DDIMM provides a procedure for the application ofdistributed methods to high-dimensional longitudinal/clustered data. DDIMM can be applied to many learning methods proposed in robust statistics. Inthe robust statistics literature due to Huber (1964) and, more generally, Huber (2009),an M-estimator is defined as the root of an implicit equation of the form Ψ jk ( (cid:98) θ jk ) = (cid:80) n k i =1 ψ jk ( (cid:98) θ jk ) = , where ψ jk ( θ ) = ∇ θ ρ ( θ ), ρ is a suitable function that primarily aimsto provide estimators robust to influential data points, and (cid:98) θ jk ∈ R p , and ζ jk is empty orknown; additional details are available in Huber (2009) for the case when ζ jk is unknown.In the context of longitudinal data, Wang et al. (2005) robustify the generalized estimatingequations of Liang and Zeger (1986) by replacing the standardized residuals with Huber’s M -residuals. 14 .4 Joint mean-variance modelling Following Pan and Mackenzie (2003), one can jointly model the marginal means and co-variances of the longitudinal responses with h ( µ i,jk ) = X i,jk, β , log( σ i,jk ) = X i,jk, λ , and φ irt,jk = X irt,jk, γ for 1 ≤ t < r ≤ m j , where h is a known component-wise link function, β ∈ R q , λ ∈ R q and γ ∈ R q are unconstrained parameters, µ i,jk = E ( Y i,jk | X i,jk, , θ )and X i,jk, ∈ R m j × q a submatrix of X i,jk , σ i,jk = S ( V ar ( Y ir,jk )) m j r =1 and X i,jk, ∈ R m j × q a submatrix of X i,jk , and φ irt,jk are specified in Zhang et al. (2015a). Estimating functions Ψ jk and G jk in (1) and (2) are given in detail in Zhang et al. (2015a). There is some choicedepending on the problem considered as to whether θ = β , θ = ( λ , γ ), or θ = ( β , λ , γ ).In the first case, learning of variance parameters only helps improve estimation efficiency.This type of framework is widely applied in biomedical studies where the mean parametersare of primary interest. In the second case, learning of covariance parameters is of interestand β is treated as a nuisance parameter. This is the situation where prediction is ofprimary interest, such as in kriging in spatial data analysis. In the third case, G jk is null,and learning of variance parameters is of interest to the investigator. This case occurs forexample in the study of volatility for risk management in financial data analysis. Consider the marginal quantile regression model Q Y it,jk | X it,jk ( τ ) = X it,jk θ , where Q Y it,jk | X it,jk ( τ ) = F − Y it,jk | X it,jk ( τ ) = inf { y it,jk : F Y it,jk | X it,jk ( y it,jk ) ≥ τ } is the τ th quan-tile of Y it,jk | X it,jk , τ ∈ (0 , f Y it,jk | X it,jk ( y it,jk ) is the conditional distributionfunction of Y it,jk given X it,jk , t = 1 , . . . , m j . Many estimating functions Ψ jk and G jk for the learning of θ and association parameters ζ jk of Y i,jk have been proposed; seeJung (1996), Fu and Wang (2012), Lu and Fan (2015), and Yang et al. (2017) for examples.15ach of these five examples requires additional work to fully develop a divide-and-conquerstrategy via DDIMM, including specific computational details. Here we only present thegeneral framework with a high-level discussion that sheds light on DDIMM’s promisinggenerality and flexibility, and its coverage of a wide range of supervised learning methods.The theoretical results presented in Sections 4 and 5 are developed under a general unifiedframework of estimating functions that includes the above five examples as special cases. In this section we assume that K and J are fixed; this assumption will be relaxed in Section5. Let n min = min k =1 ,...,K n k and n max = max k =1 ,...,K n k . Suppose W p → w as n min → ∞ .In this section we study the asymptotic properties of the GMM estimator ( (cid:98) θ c , (cid:98) ζ c ) proposedin (4) and its optimal version proposed in (6). We assume throughout that subjects aremonotonically allocated to subject groups; that is, as n min → ∞ , a subject cannot bereallocated to another group once it has been assigned to a subject group. Define thevariability matrix of τ i ( θ , ζ ) in (3) as v ( θ , ζ ) = V ar θ , ζ { τ i ( θ , ζ ) } = v ψ ( θ , ζ ) v ψg ( θ , ζ ) v T ψg ( θ , ζ ) v g ( θ , ζ ) where v ψ ( θ , ζ ) = V ar θ , ζ { ψ i ( θ ; ζ ) } , v g ( θ , ζ ) = V ar θ , ζ { g i ( ζ ; θ ) } , and v ψg ( θ , ζ ) = E θ , ζ (cid:8) ψ i ( θ ; ζ ) g Ti ( ζ ; θ ) (cid:9) . Let the sensitivity matrix of τ i ( θ , ζ ) be s ( θ , ζ ) = −∇ θ , ζ E θ , ζ τ i ( θ , ζ ) = s θψ ( θ , ζ ) s ζψ ( θ , ζ ) s θg ( θ , ζ ) s ζg ( θ , ζ ) , where (7)16 θψ ( θ , ζ ) = S JK (cid:16) n k N s θψ jk ( θ , ζ jk ) (cid:17) , s ζψ ( θ , ζ ) = diag (cid:110) n k N s ζψ jk ( θ , ζ jk ) (cid:111) J,Kj =1 ,k =1 , s θg ( θ , ζ ) = S JK (cid:16) n k N s θ g jk ( θ , ζ jk ) (cid:17) , s ζg ( θ , ζ ) = diag (cid:110) n k N s ζg jk ( θ , ζ jk ) (cid:111) J,Kj =1 ,k =1 s jk ( θ , ζ jk ) = s θψ jk ( θ , ζ jk ) s ζψ jk ( θ , ζ jk ) s θg jk ( θ , ζ jk ) s ζg jk ( θ , ζ jk ) . Following Theorem 3.4 of Song (2007), block-specific estimates (cid:98) θ jk and (cid:98) ζ jk are consistentgiven assumptions (A.1). Consistency and asymptotic normality of the GMM estimator( (cid:98) θ c , (cid:98) ζ c ) in (4) have been established by Hansen (1982) and, more generally, by Newey andMcFadden (1994). To establish consistency and asymptotic normality for the combinedestimator ( (cid:98) θ c , (cid:98) ζ c ), we consider the following additional regularity conditions:(A.3) Following Newey and McFadden (1994), define Q ( θ , ζ | W ) = E θ , ζ (cid:8) T TN ( θ , ζ ) (cid:9) w E θ , ζ { T N ( θ , ζ ) } . Assume Q ( θ , ζ | W ) is twice-continuously differentiable in a neighbourhood of( θ , ζ ).(A.4) Let ( (cid:98) θ c , (cid:98) ζ c ) = arg min θ , ζ Q N ( θ , ζ | W ). Following Newey and McFadden (1994), assume Q N ( (cid:98) θ c , (cid:98) ζ c | W ) ≤ inf θ ∈ Θ , ζ ∈ Ξ Q N ( θ , ζ | W ) + (cid:15) N with (cid:15) N = o p (1). In addition, assume that θ , ζ are interior points of Θ and Ξ respectively, and that for any δ N → sup (cid:107) ( θ , ζ ) − ( θ , ζ ) (cid:107)≤ δ N N / N / (cid:107) ( θ , ζ ) − ( θ , ζ ) (cid:107) (cid:13)(cid:13) T N ( θ , ζ ) − T N ( θ , ζ ) − E θ , ζ T N ( θ , ζ ) (cid:13)(cid:13) p → . Theorems 1 and 2 do not require the differentiability of T N and Q N . Instead, they requirethe differentiability of their population versions, and that T N behaves “nicely” in a neigh-bourhood of ( θ , ζ ), in the sense that higher order terms are asymptotically ignorable. The17ollowing two theorems state the consistency and asymptotic normality of ( (cid:98) θ c , (cid:98) ζ c ) given in(4) under Newey and McFadden’s mild moment conditions given in (A.3) and (A.4). Theorem 1 (Consistency of ( (cid:98) θ c , (cid:98) ζ c )) . Suppose assumptions (A.1)-(A.3) hold with ( (cid:98) θ c , (cid:98) ζ c ) defined in (4) . Then ( (cid:98) θ c , (cid:98) ζ c ) p → ( θ , ζ ) as n min → ∞ . The proof of Theorem 1 follows closely the steps given in Hansen (1982) and Newey andMcFadden (1994), and thus is omitted.
Theorem 2 (Asymptotic normality of ( (cid:98) θ c , (cid:98) ζ c ) . Suppose assumptions (A.1)-(A.4) hold with ( (cid:98) θ c , (cid:98) ζ c ) defined in (4) . Then as n min → ∞ , N / (cid:98) θ c − θ (cid:98) ζ c − ζ d → N (cid:0) , j − ( θ , ζ ) s ( θ , ζ )˜ v ( θ , ζ ) s T ( θ , ζ ) j − ( θ , ζ ) (cid:1) , where ˜ v ( θ , ζ ) = wv ( θ , ζ ) w , and where the Godambe information j ( θ , ζ ) of T N ( θ , ζ ) takesthe form j ( θ , ζ ) = s ( θ , ζ ) ws T ( θ , ζ ) . The proof of Theorem 2 follows easily from Theorem 7.2 in Newey and McFadden (1994)and Theorem 1 above. We note that requiring K to be finite implies that N and n min areasymptotically of the same order. We will relax this assumption in Section 5. Conditions(A.3) and (A.4) allow us to consider non-differentiable kernel inference functions in theblock ( j, k ) analysis, extending Hector and Song (2019)’s DIMM beyond CL kernel infer-ence functions. We can now consider quantile regression, M-estimation, and more generalestimation functions than the score or CL score equations.A test of the over-identifying restrictions follows from Hansen (1982) and Hector and Song(2019). This test is useful for detecting invalid moment restrictions, which can inform ourchoice of data partition and model. Formally, we show in Theorem 3 that the objectivefunction N Q N evaluated at ( (cid:98) θ c , (cid:98) ζ c ) follows a χ distribution with ( J K − p degrees of18reedom. Unfortunately, it may be difficult to tell if invalid moment restrictions stem froman inappropriate data split or incorrect model specification. Residual analysis for modeldiagnostics can remove doubt in the latter case. Theorem 3 (Test of over-identifying restrictions) . Suppose assumptions (A.1)-(A.4) holdwith ( (cid:98) θ c , (cid:98) ζ c ) defined in (4) . Then as n min → ∞ , N Q N ( (cid:98) θ c , (cid:98) ζ c | W ) d → χ JK − p . The proof of Theorem 3 can be carried out with some minor changes from that of Theorem3 in Hector and Song (2019). The GMM provides an objective function with which to domodel selection even when the block analyses do not, such as with GEE and M-estimation.In the following, Theorem 4 and Corollary 1 show our combined GMM estimator derivedfrom (6) is optimal in the sense defined by Hansen (1982): it has an asymptotic covariancematrix at least as small (in terms of the Loewner ordering) as any other estimator exploitingthe same over-identifying restrictions. We refer to this property as “Hansen optimality”.
Theorem 4.
Suppose assumptions (A.1)-(A.2) hold. Then as n min → ∞ , (cid:98) V N p → v ( θ , ζ ) ,i.e. w = v − ( θ , ζ ) .Proof. The proof uses the consistency of the block estimators and the Central Limit The-orem, and is given in the Supplemental Material.
Corollary 1 (Hansen optimality) . Suppose assumptions (A.1)-(A.4) hold with ( (cid:98) θ c , (cid:98) ζ c ) defined in (4) . Let j ( θ , ζ ) as given in Theorem 2. Then as n min → ∞ , N / (cid:98) θ opt − θ (cid:98) ζ opt − ζ d → N (cid:0) , j − ( θ , ζ ) (cid:1) . The theoretical results given in Theorems 1-4 provide a framework for constructing asymp-totic confidence intervals and conducting hypothesis tests, so that we can perform inference19or θ when M and/or N are very large. Using an optimal weight matrix improves statis-tical power so DDIMM may detect some signals that other methods may miss. Since weconsider a broad class of models P , there are no general efficiency results about the block-specific estimator (cid:98) θ jk . When a learning method based on Ψ jk has known efficiency resultsand performs well enough, DDIMM generally inherits “local” efficiency to achieve overallefficiency. Remark 1.
We discuss efficiency for selected examples in Section 3.(i) In Example 3.1, when the score function exists and satisfies mild regularity condi-tions, its variance is given by Fisher’s information, and is a lower bound on the variances ofestimating functions for θ and ζ . This, coupled with Hansen’s optimality, means that usingthe score function for ψ jk and g jk yields an efficient estimator of θ and ζ . In an unpub-lished dissertation, Jin (2011) studied the efficiency of the pairwise composite likelihoodunder different correlation structures. Hector and Song (2019) showed empirically that theefficiency of the pairwise composite likelihood propagates to the combined estimator.(ii) In Example 3.2, it is known that the GEE estimator (cid:98) θ jk in Example 3.2 is semi-parametrically efficient when the correlation structure of the response y i,jk is correctlyspecified. This, coupled with Hansen’s optimality, means that using GEE’s for ψ jk withthe correct correlation structure of the response y i,jk yields an efficient estimator of θ . Remark 2.
The GMM estimator ( (cid:98) θ opt , (cid:98) ζ opt ) can be interpreted as maximizing an ex-tension of the confidence distribution density, as discussed in Hector and Song (2019).The confidence distribution approach is used for independent data in Xie and Singh(2013). Briefly, we can define the confidence estimating function (CEF) as U ( θ , ζ ) =Φ( N / (cid:98) V − / N T N ( θ , ζ )), where Φ( · ) is the ( J Kp + d )-variate standard normal distributionfunction. Clearly, U ( θ , ζ ) is asymptotically standard uniform at ( θ , ζ ) as long as (cid:98) V N is20 consistent estimator of the covariance of T N . Then we can define the density of the CEFas u ( θ , ζ ) = φ ( N / (cid:98) V − / N T N ( θ , ζ )). Maximizing u ( θ , ζ ) with respect to ( θ , ζ ) yields theminimization defined in (6).By framing our estimator as a GMM estimator, the theoretical framework of DIMM es-tablished only for CL can be extended to include a data split at the subject level and ageneralization of Ψ jk and G jk . Adding moment conditions allows the proposed methodto enjoy the power and versatility of the GMM, and the necessary theoretical results tosupport its use. This divide-and-conquer strategy benefits from handling low dimensionalblocks of data and estimating equations, yielding tremendous computational gains. Despite the computational gains offered by the divide-and-combine procedure and theGMM estimator, iteratively finding the solution ( (cid:98) θ opt , (cid:98) ζ opt ) (or ( (cid:98) θ c , (cid:98) ζ c )) to (6) can be slowdue to the high-dimensionality of parameter ζ and the need to repeatedly evaluate Ψ jk and G jk . To overcome this computational bottleneck, we propose a meta-estimator de-rived from (6) that delivers a closed-form estimator via a linear function of block estimates( (cid:98) θ list , (cid:98) ζ list ). We define the DDIMM estimator for ( θ , ζ ): (cid:98) θ DDIMM (cid:98) ζ DDIMM = (cid:32) K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:33) − K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:98) θ ik (cid:98) ζ list . (8)where (cid:98) C k,i is a function of sample variability and sensitivity matrices and block-specificestimators (cid:98) θ jk and (cid:98) ζ jk defined in detail in Section 5.1. If we do not plan to conduct21nference for ζ , which is treated as a nuisance parameter, taking (cid:104) (cid:98) C − (cid:105) p : to be rows 1 to p of matrix ( (cid:80) Kk =1 (cid:80) Ji =1 n k (cid:98) C k,i ) − leads to the closed-form estimator of θ : (cid:98) θ DDIMM = (cid:104) (cid:98) C − (cid:105) p : K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:16) (cid:98) θ Tik (cid:98) ζ Tlist (cid:17) T . (9)We briefly define sample sensitivity matrices that will appear in the main body of thepaper. Let S θψ jk ( θ , ζ jk ) be a n / k -consistent sample estimator of s θψ jk ( θ , ζ jk ), and similarlydefine S ζψ jk ( θ , ζ jk ), S θg jk ( θ , ζ jk ) and S ζg jk ( θ , ζ jk ). Let S jk ( θ , ζ jk ) = S θψ jk ( θ , ζ jk ) S ζψ jk ( θ , ζ jk ) S θg jk ( θ , ζ jk ) S ζg jk ( θ , ζ jk ) . Note that the uppercase S denotes the sample sensitivity matrix, and the lower-case s denotes the population sensitivity matrix. Let (cid:98) S jk = S jk ( (cid:98) θ jk , (cid:98) ζ jk ) and similarly define (cid:98) S θψ jk , (cid:98) S ζψ jk , (cid:98) S θg jk and (cid:98) S ζg jk . Sensitivity formulas are summarized in Table A.1 in AppendixA.1.The DDIMM estimator in (9) can be implemented in a fully parallelized and scalablecomputational scheme, where only one pass through each block of data is required. Theblock analyses are run on parallel CPUs, and return the values of summary statistics { (cid:98) θ jk , (cid:98) ζ jk , ψ i,jk ( (cid:98) θ jk ; (cid:98) ζ jk ) , g i,jk ( (cid:98) ζ jk ; (cid:98) θ jk ) , (cid:98) S jk } J,Kj,k =1 to the main computing node, which com-putes (cid:98) θ DDIMM in (9) in one step. (cid:98) C k,i We give details on the construction of (cid:98) C k,i . Readers may wish to omit this section on afirst reading, as these details are not necessary for an understanding of the main body of22he paper. We consider the optimal case where the GMM weighting matrix takes the form: W = (cid:98) V − N = (cid:98) V N, ψ (cid:98) V N, ψg (cid:98) V TN, ψg (cid:98) V N, g − = (cid:98) V ψ N (cid:98) V ψg N (cid:98) V ψg TN (cid:98) V g N . For convenience, we introduce a subsetting operation, with technical details available inAppendix A.2: we let (cid:104) (cid:98) V ψ N (cid:105) ij : k subset the rows for the parameters corresponding to block( i, k ) and the columns for the parameters corresponding to block ( j, k ) of matrix (cid:98) V ψ N .Similarly define (cid:104) (cid:98) V g N (cid:105) ij : k , and (cid:104) (cid:98) V ψg N (cid:105) ij : k . For η ∈ { θ , ζ } , let (cid:98) A η k,ij = (cid:18) (cid:98) S θ T ψ jk (cid:104) (cid:98) V ψ N (cid:105) ji : k + (cid:98) S θ T g jk (cid:104) (cid:98) V ψg TN (cid:105) ji : k (cid:19) (cid:98) S ηψ ik + (cid:18) (cid:98) S θ T ψ jk (cid:104) (cid:98) V ψg N (cid:105) ji : k + (cid:98) S θ T g jk (cid:104) (cid:98) V g N (cid:105) ji : k (cid:19) (cid:98) S ηg ik , (cid:98) B η k,ij = (cid:18) (cid:98) S ζ T ψ jk (cid:104) (cid:98) V ψ N (cid:105) ji : k + (cid:98) S ζ T g jk (cid:104) (cid:98) V ψg TN (cid:105) ji : k (cid:19) (cid:98) S ηψ ik + (cid:18) (cid:98) S ζ T ψ jk (cid:104) (cid:98) V ψg N (cid:105) ji : k + (cid:98) S ζ T g jk (cid:104) (cid:98) V g N (cid:105) ji : k (cid:19) (cid:98) S ηg ik . Define D ik as the sum of the dimensions of ζ , . . . , ζ i − k , and D k as the sum of thedimensions of ζ , . . . , ζ Jk − , with technical details in Appendix A.3. Let d k = (cid:80) Jj =1 d jk .Then we can define the following, (cid:98) C k,i = J (cid:80) j =1 (cid:98) A θ k,ij p × D ik J (cid:80) j =1 (cid:98) A ζ k,ij p × ( d − d ik − D ik ) D k × ( p + d ) (cid:98) B θ k,i d k × D ik (cid:98) B ζ k,i d k × ( d − d ik − D ik ) ... (cid:98) B θ k,iJ d Jk × D ik (cid:98) B ζ k,iJ d Jk × ( d − d ik − D ik ) ( d − d k − D k ) × ( p + d ) . (10) K and J fixed In this section we assume that K and J are fixed, which will be relaxed in Sections 5.3and 5.4. Recall that we assume subjects are monotonically allocated to subject groups: as23 min → ∞ , a subject cannot be reallocated to another group once it has been assigned toa subject group. Consider the following condition:(A.5) For each j = 1 , . . . , J , k = 1 , . . . , K , (cid:98) θ jk = θ + O p ( n − / k ) and (cid:98) ζ jk = ζ jk + O p ( n − / k ).For any δ N → sup (cid:107) ( θ , ζ ) − ( θ , ζ ) (cid:107)≤ δ N N / N / (cid:107) ( θ , ζ ) − ( θ , ζ ) (cid:107) (cid:13)(cid:13) T N ( θ , ζ ) − T N ( θ , ζ ) − E θ , ζ T N ( θ , ζ ) (cid:13)(cid:13) = O p ( N − / ) . Consequently, some large-sample results can be established which are helpful in studyingthe asymptotic behaviour of (cid:98) θ DDIMM . Lemma 1.
Suppose assumptions (A.1), (A.2) and (A.5) hold. Then we have consistentestimation of information matrices: (cid:98) V N = v ( θ , ζ ) + O p ( N − / ) , (cid:98) S jk = s jk ( θ , ζ jk ) + O p ( n − / k ) for each j, k, and N K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i = (cid:98) S T (cid:98) V − N (cid:98) S = j ( θ , ζ ) + O p ( N − / ) , where (cid:98) S = S (cid:16) n k N (cid:98) S θψ jk (cid:17) J,Kj =1 ,k =1 diag (cid:110) n k N (cid:98) S ζψ jk (cid:111) J,Kj =1 ,k =1 S (cid:16) n k N (cid:98) S θ g jk (cid:17) J,Kj =1 ,k =1 diag (cid:110) n k N (cid:98) S ζ g jk (cid:111) J,Kj =1 ,k =1 . Proof.
A detailed proof is given in the Supplemental Material.We show in Theorem 5 that the proposed closed-form estimator ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) in (8)is consistent and asymptotically normally distributed.
Theorem 5.
Suppose assumptions (A.1), (A.2) and (A.5) hold. Let j ( θ , ζ ) as given inTheorem 2. As n min → ∞ , N / (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ d → N (cid:0) , j − ( θ , ζ ) (cid:1) . roof of Theorem 5: Here we present major steps, with all necessary details available inAppendix B.1. First, we show that (cid:98) θ DDIMM and (cid:98) ζ DDIMM are consistent. Define λ ( θ , ζ ) = 1 N K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i θ − (cid:98) θ ik ζ − (cid:98) ζ list . (11)By definition, λ ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) = . As shown in Lemma B.1.1 in Appendix B.1, λ ( θ , ζ ) p → as n min → ∞ . Given that ∇ θ , ζ λ ( θ , ζ ) exists and is nonsingular, for some( θ ∗ , ζ ∗ ) between ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) and ( θ , ζ ), the first-order Taylor expansion leads to λ ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) − λ ( θ , ζ ) = ∇ θ , ζ λ ( θ , ζ ) | θ ∗ , ζ ∗ (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ , (12)which converges in probability to as n min → ∞ . This implies that ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) p → ( θ , ζ ) as n min → ∞ .Now we derive the distribution of ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ). With a slight abuse of notation,let (cid:98) θ list − θ = S JK (cid:16)(cid:98) θ jk − θ (cid:17) . We show in Lemma B.1.2 in Appendix B.1 that Ψ jk ( θ ; ζ jk ) G jk ( ζ jk ; θ ) = (cid:98) S jk (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk + O p ( n − k ) . (13)Recall the form of T N in (3). By the Central Limit Theorem, N / T N ( θ , ζ ) d →N (0 , v ( θ , ζ )). Then with (cid:98) S defined in Lemma 1, it follows from equation (13) that N / (cid:98) S (cid:16) ( (cid:98) θ list − θ ) T ( (cid:98) ζ list − ζ ) T (cid:17) T d → N (0 , v ( θ , ζ )) . Moreover, by Lemma 1 and Slutsky’s theorem we have: N / (cid:16) ( (cid:98) θ list − θ ) T ( (cid:98) ζ list − ζ ) T (cid:17) T d → N (cid:0) , j − ( θ , ζ ) (cid:1) . N / (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ = N / (cid:32) K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:33) − K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:98) θ ik − θ (cid:98) ζ list − ζ is asymptotically distributed N ( , j − ( θ , ζ )).This key theorem allows us to use (cid:98) θ DDIMM , which is more computationally attractive than (cid:98) θ opt defined in (6), without sacrificing any of the nice asymptotic properties for inference.Additionally, it follows easily from Theorem 5 that, under suitable conditions, the closed-form estimator ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) in (8) has the same asymptotic distribution as and isasymptotically equivalent to the GMM estimator (cid:98) θ opt in (6). Corollary 2.
Suppose assumptions (A.1)-(A.5) hold with ( (cid:98) θ opt , (cid:98) ζ opt ) defined in (6) . Then ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) and ( (cid:98) θ opt , (cid:98) ζ opt ) are asymptotically equivalent: as n min → ∞ , N / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:98) θ DDIMM − (cid:98) θ opt (cid:98) ζ DDIMM − (cid:98) ζ opt (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p → . Proof.
A detailed proof is given in the Supplemental Material.The computation of (cid:98) θ DDIMM in (9) relies solely on block-specific estimators ( (cid:98) θ list , (cid:98) ζ list ) andvalues of summary statistics from each block. To guarantee the appropriate asymptoticdistribution of (cid:98) θ DDIMM , we assume in condition (A.5) that these block-specific estimatorsare N / consistent estimators of the true values, which restricts the scope of possible block-specific inference methods. For inference methods not satisfying this N / consistency incondition (A.5), it is still possible to use (cid:98) θ opt in (6).26 .3 Asymptotic results for diverging K with J fixed We show in Theorem 6 that the asymptotic distribution of ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) remainsunchanged as the number of subject groups K grows with the sample size. Theorem 6.
Suppose N δ − / K is bounded as n min → ∞ for a positive constant δ < , andassumptions (A.1), (A.2) and (A.5) hold. Let H ∈ R h × ( p + d ) a matrix of rank r ∈ N , h ∈ N , r ≤ h , with finite maximum singular value ¯ σ ( H ) < ∞ . Let j ( θ , ζ ) as given in Theorem 2.Then, as n min → ∞ , we show that the limiting value j H ( θ , ζ ) of Hj − ( θ , ζ ) H T is apositive semi-definite and symmetric variance matrix, and that N / H (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ d → N ( , j H ( θ , ζ )) . Proof [Proof of Theorem 6] Here we present major steps, with all necessary details availablein Appendix B.2. First, we know that (cid:107) H (cid:107) ≤ r ¯ σ ( H ). Let λ ( θ , ζ ) defined by (11),such that λ ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) = . We show in Lemma B.2.1 in Appendix B.2 that (cid:107) λ ( θ , ζ ) (cid:107) = O p ( N − / − δ n / ) and (cid:13)(cid:13) {∇ θ , ζ λ ( θ , ζ ) } − (cid:13)(cid:13) = O p (cid:0) N / δ n − (cid:1) . From thefirst-order Taylor expansion in (12), we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) H (cid:107) (cid:13)(cid:13) ( ∇ θ , ζ λ ( θ , ζ ) | θ ∗ , ζ ∗ ) − (cid:13)(cid:13) (cid:107) λ ( θ , ζ ) (cid:107)≤ r ¯ σ ( H ) O p ( n − / ) . Then H ( (cid:98) θ TDDIMM , (cid:98) ζ TDDIMM ) T − H ( θ T , ζ T ) T p → as n min → ∞ .To derive the distribution of H ( (cid:98) θ TDDIMM , (cid:98) ζ TDDIMM ) T , first consider an arbitrary k ∈{ , . . . , K } . For convenience, denote T k ( θ , ζ k ) = S (cid:0) S J (cid:0) Ψ jk ( θ ; ζ jk ) (cid:1) , S J (cid:0) G jk ( ζ jk ; θ ) (cid:1)(cid:1) , i,k ( θ , ζ k ) = S (cid:0) S J (cid:0) ψ i,jk ( θ ; ζ jk ) (cid:1) , S J (cid:0) g i,jk ( ζ jk ; θ ) (cid:1)(cid:1) . By the Central Limit Theorem, n / k T k ( θ , ζ k ) = n − / k (cid:80) n k i =1 τ i,k ( θ , ζ k ) d →N ( , v k ( θ , ζ k )) as n k → ∞ , where v k ( θ , ζ k ) = V ar θ , ζ k { τ i,k ( θ , ζ k ) } . Define s k ( θ , ζ k ) = S J (cid:16) s θψ jk ( θ , ζ jk ) (cid:17) diag (cid:110) s ζψ jk ( θ , ζ jk ) (cid:111) Jj =1 S J (cid:16) s θ g jk ( θ , ζ jk ) (cid:17) diag (cid:110) s ζg jk ( θ , ζ jk ) (cid:111) Jj =1 , and j k ( θ , ζ k ) = s Tk ( θ , ζ ) v − k ( θ , ζ k ) s k ( θ , ζ k ) . By (13) in the proof of Theorem 5, Lemma 1, and Slutsky’s theorem, n / k j k ( θ , ζ k ) S (cid:16)(cid:98) θ jk − θ (cid:17) Jj =1 (cid:98) ζ k − ζ k d → N (cid:0) , j − k ( θ , ζ k ) (cid:1) . Note that the above vectors are independent for k = 1 , . . . , K . We establish in LemmaB.2.2 in Appendix B.2 that, for some affine transformation matrices E k , k = 1 , . . . , K , of ’s and ’s, n k N J (cid:88) i =1 (cid:98) C k,i (cid:98) θ ik − θ (cid:98) ζ list − ζ = n k N E k Z k + O p (cid:0) N − (cid:1) , and n k N J (cid:88) i =1 (cid:98) C k,i = n k N E k j k ( θ , ζ k ) E Tk + O p (cid:16) n / k N − (cid:17) , where n / k Z k d → N (cid:0) , j − k ( θ , ζ k ) (cid:1) . It is clear that j ( θ , ζ ) = (cid:80) Kk =1 ( n k /N ) E k j k ( θ , ζ k ) E Tk .Since E k has finitely many 1’s, (cid:107) E k (cid:107) is bounded. Since (cid:107) j k ( θ , ζ k ) (cid:107) is also bounded, (cid:107) j ( θ , ζ ) (cid:107) = O ( Kn max N − ) = O (1). j ( θ , ζ ) is positive semi-definite and symmetric,implying that Hj − ( θ , ζ ) H T is also positive semi-definite and symmetric. Followingthe monotone convergence theorem, we can write Hj − ( θ , ζ ) H T → j H ( θ , ζ ), where28 H ( θ , ζ ) exists and is a proper variance matrix.Using the fact that λ ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) = and K = O ( N / − δ ), we show in LemmaB.2.3 in Appendix B.2 that N / H ( (cid:98) θ DDIMM − θ , (cid:98) ζ DDIMM − ζ ) can be rewritten as H (cid:40) K (cid:88) k =1 n k N E k j k ( θ , ζ k ) E Tk + O p (cid:0) n / N − / − δ (cid:1)(cid:41) − (cid:34) K (cid:88) k =1 (cid:26)(cid:16) n k N (cid:17) / E k n / k Z k (cid:27) + O p (cid:0) N − δ (cid:1)(cid:35) . Since O p ( n / N − / − δ ) = o p (1) and O p ( N − δ ) = o p (1), it follows as in the proof of Theorem5 that as n min → ∞ , N / H (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ d → N ( , j H ( θ , ζ )) . In practice, Theorem 6 suggests that we can tune our choice of K and n min to attain thedesired trade-off between inference and computational speed: smaller K and larger n min will slow computations but improve estimation and asymptotic normality, whereas larger K and smaller n min will speed computations but worsen estimation and asymptotic normality. K and J In general, asymptotics for diverging J become very complicated and even analyticallyintractable depending on how, and to what extent, the dependence structure evolves as thedimension M of Y goes to infinity ( M → ∞ ). Cox and Reid (2004) propose constructinga pseudolikelihood from marginal densities when the full joint distribution is difficult toconstruct, and discuss asymptotics for increasing response dimensionality. To make theproblem of diverging M tractable, we consider the following regularity conditions:29A.6) Stationarity: for each M ∗ ∈ N and each ( M ∗ + 1)-dimensional measurableset B a subset of the sample space of Y , the distribution of Y i satisfies P { ( Y i,r , . . . , Y i,r + M ∗ ) ∈ B } = P { ( Y i, , . . . , Y i,M ∗ ) ∈ B } for every r ∈ N .(A.7) Let C k,i be the version of (cid:98) C k,i in (10) evaluated at the true values θ , ζ jk . For k = 1 , . . . , K , i = 1 , . . . , J , ( (cid:80) Kl =1 (cid:80) Jj =1 n l C l,j ) − n k C k,i = O p ( N − δ ) for a constant0 ≤ δ ≤ /
2. This can be thought of as a type of Lindeberg condition.(A.8) Conditions required for asymptotically normal distribution and efficiency of the GMMestimator ( (cid:98) θ opt , (cid:98) ζ opt ); see Theorem 5.4 in Donald et al. (2003) and the spanningcondition in Newey (2004). See Newey (2004) for related work on semiparametricefficiency of the GMM estimator as the number of moment conditions goes to infinity. Remark 3.
Condition (A.6) is typical for consistency and asymptotic normality of theGMM estimator ( (cid:98) θ opt , (cid:98) ζ opt ), following Hansen (1982) and Newey (2004). It is a typicalcondition for the application of the central limit theorem to stochastic processes, i.e. toinfinite dimensional random vectors. Additionally, in order to make statements aboutconvergence in probability, (A.6) is required to ensure a valid joint probability distributionas the dimension M increases. Remark 4.
Condition (A.7) ensures the covariance of the outcome Y i is appropriately con-trolled as M → ∞ . Alternative conditions may be considered, such as α -mixing (Bradley(1985)), ρ -mixing (Peligrad (1986)), or φ -mixing (Peligrad (1986)), but this is beyond thescope of this paper. Condition (A.7) can be simplified for the case where n k = n for all k = 1 , . . . , K . Then (A.7) becomes ( (cid:80) Kl =1 (cid:80) Jj =1 C l,j ) − C k,i = O p ( N − δ ).In Theorem 7 we show the consistency and asymptotic normality of the DDIMM estimatoras K and J diverge to ∞ . 30 heorem 7. Suppose N − δ n min and N δ − / KJ are bounded as n min → ∞ for constants ≤ δ ≤ and < δ < / such that δ + δ + δ / > . Suppose assumptions (A.1),(A.2), and (A.5)-(A.8) hold. Let H ∈ R h × ( p + d ) a matrix of rank r ∈ N , h ∈ N , r ≤ h , withfinite maximum singular value ¯ σ ( H ) < ∞ . Let j H ( θ , ζ ) as given in Theorem 6. Then as n min → ∞ , N / H (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ d → N (0 , j H ( θ , ζ )) . Proof
Write H (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ = H (cid:98) θ DDIMM − (cid:98) θ opt (cid:98) ζ DDIMM − (cid:98) ζ opt + H (cid:98) θ opt − θ (cid:98) ζ opt − ζ . To show the asymptotic distribution of the left-hand side, it is sufficient to show that H ( (cid:98) θ TDDIMM − (cid:98) θ Topt , (cid:98) ζ TDDIMM − (cid:98) ζ Topt ) T = o p ( N − / ).Given the assumptions of the theorem, we have the asymptotic distribution of ( (cid:98) θ opt , (cid:98) ζ opt,ik )and ( (cid:98) θ ik , (cid:98) ζ ik ): both are consistent estimators of θ , ζ ik and asymptotically normally dis-tributed with rates N − / and n − / k respectively. Then for each k ∈ { , . . . , K } , (cid:98) θ opt − (cid:98) θ ik (cid:98) ζ opt,ik − (cid:98) ζ ik = (cid:98) θ opt − θ (cid:98) ζ opt,ik − ζ ik − (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik = O p ( n − / k ) . Defining (cid:98) C ∗ k,i a subset of (cid:98) C k,i in Appendix A.4, we can rewrite ( (cid:98) θ TDDIMM − (cid:98) θ Topt , (cid:98) ζ TDDIMM − (cid:98) ζ Topt ) T as follows: (cid:32) K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:33) − K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:98) θ ik − (cid:98) θ opt (cid:98) ζ list − (cid:98) ζ opt = K (cid:88) k =1 J (cid:88) i =1 (cid:32) K (cid:88) l =1 J (cid:88) j =1 n l (cid:98) C l,j (cid:33) − n k (cid:98) C ∗ k,i (cid:98) θ ik − (cid:98) θ opt (cid:98) ζ ik − (cid:98) ζ opt,ik K (cid:88) k =1 J (cid:88) i =1 (cid:104) O p ( N − δ ) O p ( n − / k ) (cid:105) = O p ( KJ N − δ n − / )= O p ( N / − δ N − δ N − δ / ) = O p ( N / − δ − δ − δ / ) = o p ( N − / ) . In this section we consider two sets of simulations to examine the performance of theclosed-form estimator (cid:98) θ DDIMM under the linear regression setting µ i = X i θ , where µ i = E ( Y i | X i , θ ) and Y i ∼ N ( X i θ , Σ ). The first set illustrates the finite sample performanceand properties in Theorem 5 of (cid:98) θ DDIMM with fixed sample size N , varying number ofsubject groups K , varying dimensions M of Y , and fixed number of response blocks J .The second set of simulations illustrates the performance and properties in Theorem 7 of (cid:98) θ DDIMM with growing sample size N and response dimension M of Y , and varying numberof subjects groups K and response blocks J . In both settings, covariates consist of anintercept and two independently simulated M -dimensional multivariate normal variables,and the true value of θ is set to θ = (0 . , . , . T . Simulations are conducted using Rsoftware on a standard Linux cluster.We describe the first set of simulations. We specify Σ = S ⊗ A with nested correlationstructure, where ⊗ denotes the Kronecker product, A is an AR(1) covariance matrix withstandard deviation σ = 4 and correlation ρ = 0 .
8, and S is a randomly simulated J × J positive-definite matrix. We consider varying dimensions M of Y with fixed J = 5, and afixed sample size N = 5 ,
000 with varying K = 1 , ,
5. We consider two supervised learningprocedures: the pairwise composite likelihood using our own package, and the GEE using Rpackage geepack and our own package (see Supplemental Material). With each procedure,32 l ll l ll l ll l l l l ll l ll l ll l l l l ll l ll l ll l l
RMSEIntercept RMSEX1 RMSEX2ESEIntercept ESEX1 ESEX2BIASIntercept BIASX1 BIASX2ASEIntercept ASEX1 ASEX2
200 500 1000 200 500 1000 200 500 1000200 500 1000 200 500 1000 200 500 1000200 500 1000 200 500 1000 200 500 1000200 500 1000 200 500 1000 200 500 10004.0e−078.0e−071.2e−061.6e−06−6e−05−3e−05 0e+00 3e−05 6e−054.0e−078.0e−071.2e−061.6e−066.0e−048.0e−041.0e−031.2e−034.0e−078.0e−071.2e−061.6e−06−7.5e−05−5.0e−05−2.5e−05 0.0e+005.0e−071.0e−061.5e−066.0e−048.0e−041.0e−031.2e−031e−042e−043e−04−2.5e−04 0.0e+00 2.5e−04 5.0e−045.0e−051.0e−041.5e−042.0e−042.5e−046.0e−038.0e−031.0e−021.2e−021.4e−021.6e−02 response dimension (M) V a l ue model l K=1 K=2K=5 measure ASE BIASESE RMSE
Figure 1: Plot of simulation metrics for GEE, averaged over 1,000 simulations.33e fit the model with an AR(1) working block correlation structure. Results for the GEEare in Figure 1; results for the pairwise composite likelihood (CL) are in the SupplementalMaterial. We see that the mean asymptotic standard error (ASE) of (cid:98) θ DDIMM approxi-mates the empirical standard error (ESE) for all models, with slight variations due to thetype of covariates simulated. This means the covariance formula in Theorem 5 is correct.Additionally, (cid:98) θ DDIMM appears consistent since root mean squared error (RMSE), ASEand ESE are approximately equal. Moreover, we notice the ASE of (cid:98) θ DDIMM decreases asthe response dimension M increases. This makes intuitive sense, since an increase in M corresponds to an increase in overall number of observations, resulting in increased power.We also see a decrease in the ASE as the number of groups increases. This is due to theheterogeneity of block covariance parameters. Lastly, we observe from Table 2 that themean CPU time is very fast for the GEE, and decreases substantially as the number ofsubject groups increases.Response dimension Number of subject groupsK=1 K=2 K=5M=200 45 23 11M=500 351 184 87M=1,000 1956 961 417Table 2: Mean CPU time in seconds for each setting with the GEE block analysis,averaged over 1,000 simulations. Mean CPU time is computed as the maximum CPUtime taken over parallelized block analyses added to the CPU time taken by the rest ofthe procedure.We describe the second set of simulations, where we consider diverging sample size N and34esponse dimension M , and diverging number of subject groups K and response blocks J .We consider two settings: in Setting I, we let the sample size N = 5 ,
000 with number ofresponse groups K = 1, and let response dimension M = 4 ,
500 with number of responseblocks J = 6; in Setting II, we let the sample size N = 10 ,
000 with number of responsegroups K = 2, and let response dimension M = 9 ,
000 with number of response blocks J = 12. Responses are simulated from a Multivariate Normal distribution with AR(1)covariance structure, with standard deviation σ = 6 and correlation ρ = 0 .
8. This meansthere are no heterogeneous block parameters, so we expect a slightly less efficient estimatorsince there is less variability in the outcome. We learn mean and covariance parametersusing GEE with an AR(1) working block correlation structure. Mean bias (BIAS), RMSE,ESE and ASE of (cid:98) θ DDIMM are in Table 3. We observe that RMSE, ESE and ASE are veryclose, indicating appropriate estimation of (cid:98) θ DDIMM and its covariance in Theorem 7. Wealso confirm DDIMM’s ability to handle large sample size N and response dimension M .Setting Measure Intercept X X I RMSE/BIAS 3 . − .
77 0 . .
09 0 . − . . .
78 0 . .
59 0 . . . − .
99 0 . − .
03 0 . − . . .
70 0 . .
27 0 . . × − , BIAS × − , ESE × − , ASE × − for each setting and eachcovariate, averaged over 500 simulations.35 DISCUSSION
We have presented the large sample theory as a theoretical guarantee for a Doubly Dis-tributed and Integrated Method of Moments (DDIMM) that incorporates a broad class ofsupervised learning procedures into a doubly distributed and parallelizable computationalscheme for the efficient analysis of large samples of high-dimensional correlated responsesin the MapReduce framework. Theoretical challenges related to combining correlated esti-mators were addressed in the proofs, including the asymptotic properties of the proposedclosed-form estimator with fixed and diverging numbers of subject groups and responseblocks.The GMM approach to deriving the combined estimator ( (cid:98) θ c , (cid:98) ζ c ) proposed in (4) requiresonly weak regularity of the estimating equations Ψ jk and G jk . These assumptions aresatisfied by a broad range of learning procedures. The closed-form estimator proposedin equation (9), on the other hand, requires local n / k -consistent estimators in individualblocks of size n k , which is easily satisfied if Ψ jk and G jk are regular (see Song (2007) Chap-ter 3.5 for a definition of regular inference functions). This restricts the class of possiblelearning procedures, but still includes many analyses of interest.A detailed discussion of the limitations and trade-offs of the single split DIMM with CLblock analyses is featured in Hector and Song (2019). As mentioned in Section 5, theDDIMM introduces additional flexibility in trading off between computational speed andinference: the number of subject groups K and the smallest block size n min can be chosenby the investigator to attain the desired speed and efficiency.Particular applications of DDIMM to time series data are immediately obvious. Similarly,we envision potential application to nation-wide hospital daily visit numbers of, for exam-ple, asthma patients, over the course of the last decade. One could split the response (hos-36ital daily intake/daily stock price) into J years and into K groups (of hospitals/stocks),analyze blocks separately and in parallel using GEE, and combine results using DDIMM.Finally, extensions of our work to stochastic process modelling are accessible, with morechallenging work involving regularization of θ also of interest. A Technical details
A.1 Summary of sensitivity matrix formulas
Sensitivity matrices are summarized in Table A.1.sensitivity of w.r.t.* population sample plug-in sample ψ i,jk θ s θψ jk ( θ , ζ jk ) S θψ jk ( θ , ζ jk ) (cid:98) S θψ jk = S θψ jk ( (cid:98) θ jk , (cid:98) ζ jk ) ψ i,jk ζ jk s ζψ jk ( θ , ζ jk ) S ζψ jk ( θ , ζ jk ) (cid:98) S ζψ jk = S ζψ jk ( (cid:98) θ jk , (cid:98) ζ jk ) g i,jk θ s θg jk ( θ , ζ jk ) S θg jk ( θ , ζ jk ) (cid:98) S θg jk = S θg jk ( (cid:98) θ jk , (cid:98) ζ jk ) g i,jk ζ jk s ζg jk ( θ , ζ jk ) S ζg jk ( θ , ζ jk ) (cid:98) S ζg jk = S ζg jk ( (cid:98) θ jk , (cid:98) ζ jk ) S (cid:0) ψ i,jk , g i,jk (cid:1) ( θ , ζ jk ) s jk ( θ , ζ jk ) S jk ( θ , ζ jk ) (cid:98) S jk = S jk ( (cid:98) θ jk , (cid:98) ζ jk )Table A.1: Summary of sensitivity formulas. Formulas that are not used are marked “—”.*“w.r.t.” shorthand for “with respect to”. A.2 Subsetting operation on variability matrices
Operation (cid:104) (cid:98) V ψ N (cid:105) ij : k extracts a submatrix of (cid:98) V ψ N consisting of rows { ( i −
1) + ( k − J } p +1to { i + ( k − J } p and columns { j − k − J } p + 1 to { j + ( k − J } p . Operation37 (cid:98) V g N (cid:105) ij : k extracts a submatrix of (cid:98) V g N consisting of rows 1 + D ik to d ik + D ik and columns1 + D jk to d jk + D jk . Operation (cid:104) (cid:98) V ψg N (cid:105) ij : k extracts a submatrix of (cid:98) V ψg N consisting of rows { ( i −
1) + ( k − J } p + 1 to { i + ( k − J } p and columns 1 + D jk to d jk + D jk , where d jk is the dimension of ζ jk and D jk is defined in Section 5.1. A.3 Cumulative sum of dimensions of ζ Recall that we define D ik as the sum of the dimensions of ζ , . . . , ζ i − k , and D k as thesum of the dimensions of ζ , . . . , ζ Jk − . Specifically, let D ik = (cid:80) k − l =1 (cid:80) Jj =1 d jl + (cid:80) i − j =1 d jk for i, k > D k = (cid:80) k − l =1 (cid:80) Jj =1 d jl for k >
1, and D = 0. Let D k = (cid:80) k − l =1 d l for k > D = 0. A.4 Definition of (cid:98) C ∗ k,i Let k ∈ { , . . . , K } and i ∈ { , . . . , J } . Recall the definitions of (cid:98) A θ k,ij , (cid:98) A ζ k,ij , (cid:98) B θ k,ij and (cid:98) B ζ k,ij in Section 5.1. Define (cid:98) C ∗ k,i = J (cid:80) j =1 (cid:98) A θ k,ij J (cid:80) j =1 (cid:98) A ζ k,ij D ik × ( p + d ) (cid:98) B θ k,i (cid:98) B ζ k,i ... (cid:98) B θ k,iJ (cid:98) B ζ k,iJ ( d − d ik − D ik ) × ( p + d ) . Additional proofs
B.1 Proof of Theorem 5:
The following lemmas complete the proof of Theorem 5 given in the paper, under theassumed conditions.
Lemma B.1.1.
Define λ ( θ , ζ ) as in (11) in the proof of Theorem 5. Then λ ( θ , ζ ) p → n min → ∞ . Proof
Using Lemma 1, λ ( θ , ζ ) = 1 N K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i θ − (cid:98) θ ik ζ − (cid:98) ζ list = O p (cid:16) n − / (cid:17) (cid:8) j ( θ , ζ ) + O p (cid:0) N − / (cid:1)(cid:9) = O p (cid:16) n − / (cid:17) + O p (cid:16) n − / N − / (cid:17) p → n min → ∞ . Lemma B.1.2.
The following relationship holds: Ψ jk ( θ ; ζ jk ) G jk ( ζ jk ; θ ) = (cid:98) S jk (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk + O p ( n − k ) . Proof
Let j ∈ { , . . . , J } , k ∈ { , . . . , K } fixed. For convenience, denote T jk ( θ , ζ jk ) = Ψ jk ( θ ; ζ jk ) G jk ( ζ jk ; θ ) , τ i,jk ( θ , ζ jk ) = ψ i,jk ( θ ; ζ jk ) g i,jk ( ζ jk ; θ ) . By first-order Taylor expansion, E θ , ζ jk (cid:110) τ i,jk ( (cid:98) θ jk , (cid:98) ζ jk ) (cid:111) = E θ , ζ jk (cid:8) τ i,jk ( θ , ζ jk ) (cid:9) +39 θ E θ , ζ jk (cid:8) τ i,jk ( θ , ζ jk ) (cid:9) | θ ∗ , ζ ∗ jk (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk , (14)where ( θ ∗ , ζ ∗ jk ) lies between ( θ , ζ jk ) and ( (cid:98) θ jk , (cid:98) ζ jk ). By condition (A.5), T jk ( (cid:98) θ jk , (cid:98) ζ jk ) − T jk ( θ , ζ jk ) − E θ , ζ jk (cid:110) τ i,jk ( (cid:98) θ jk , (cid:98) ζ jk ) (cid:111) = O p ( N − / ) 1 + N / O p ( n − / k ) N / = O p ( n − / k N − / ) . (15)In other words, the norm of the difference between T jk ( θ , ζ jk ) and T jk ( (cid:98) θ jk , (cid:98) ζ jk ) − E θ , ζ jk { τ i,jk ( (cid:98) θ jk , (cid:98) ζ jk ) } goes to 0 at a rate faster than ( N n k ) − / . Adding (14) and (15),we have − T jk ( θ , ζ jk ) = T jk ( (cid:98) θ jk , (cid:98) ζ jk ) − T jk ( θ , ζ jk ) − E θ , ζ jk τ i,jk ( θ , ζ jk )= ∇ θ E θ , ζ jk τ i,jk ( θ , ζ jk ) | θ ∗ , ζ ∗ jk (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk + O p ( n − / k N − / )= − s jk ( θ ∗ , ζ ∗ jk ) (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk + O p ( n − / k N − / ) . Rearranging yields T jk ( θ , ζ jk ) = s jk ( θ ∗ , ζ ∗ jk ) (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk + O p ( n − / k N − / ) . (16)Finally, note that (cid:98) S jk = s jk ( θ , ζ jk )+ O p ( n − / k ) = s jk ( θ ∗ , ζ ∗ jk )+ O p ( n − / k ). Then pluggingthis into (16), we have: T jk ( θ , ζ jk ) = (cid:16) (cid:98) S jk + O p ( n − / k ) (cid:17) (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk + O p ( n − / k N − / )40 (cid:98) S jk (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk + O p ( n − k ) . B.2 Proof of Theorem 6
The following lemmas complete the proof of Theorem 6 given in the paper, under theassumed conditions.
Lemma B.2.1.
Define λ ( θ , ζ ) as in (11) in the proof of Theorem 5. Then (cid:107) λ ( θ , ζ ) (cid:107) = O p ( N − / − δ n / ) and (cid:13)(cid:13) {∇ θ , ζ λ ( θ , ζ ) } − (cid:13)(cid:13) = O p (cid:0) N / δ n − (cid:1) . Proof.
Due to the independence between subject groups, (cid:98) V ψ N , (cid:98) V ψg N and (cid:98) V g N are all blockdiagonal: (cid:98) V ψ N = diag (cid:110) (cid:98) V ψ k (cid:111) Kk =1 , (cid:98) V ψg N = diag (cid:110) (cid:98) V ψg k (cid:111) Kk =1 , and (cid:98) V g N = diag (cid:110) (cid:98) V g k (cid:111) Kk =1 . Bythe independence of subject groups, let v − ( θ , ζ ) = v ψ ( θ , ζ ) v ψg ( θ , ζ ) v ψg T ( θ , ζ ) v g ( θ , ζ ) = diag (cid:110) Nn k v ψ k ( θ , ζ ) (cid:111) Kk =1 diag (cid:110) Nn k v ψg k ( θ , ζ ) (cid:111) Kk =1 diag (cid:110) Nn k v ψg Tk ( θ , ζ ) (cid:111) Kk =1 diag (cid:110) Nn k v g k ( θ , ζ ) (cid:111) Kk =1 . Similar to the proof of Lemma 1, it can easily be shown that for each k =1 , . . . , K , (cid:98) V ψ k = ( N/n k ) v ψ k ( θ , ζ ) + O p ( N − / ), (cid:98) V ψg k = ( N/n k ) v ψg k ( θ , ζ ) + O p ( N − / ),and (cid:98) V g k = ( N/n k ) v g k ( θ , ζ ) + O p ( N − / ). Consider an arbitrary k ∈ { , . . . , K } .Let ( N/n k ) (cid:104) v ψ k ( θ , ζ ) (cid:105) ji = (cid:2) v ψ ( θ , ζ ) (cid:3) ji : k , and similarly define (cid:104) v ψg k ( θ , ζ ) (cid:105) ji and[ v g k ( θ , ζ )] ji . Then (cid:98) A θ k,ij = ( N/n k ) { a θ k,ij + O p ( n − / k ) } , where a θ k,ij is defined as41 s θ T ψ jk ( θ , ζ jk ) (cid:104) v ψ k ( θ , ζ ) (cid:105) ji + s θ T g jk ( θ , ζ jk ) (cid:104) v ψg Tk ( θ , ζ ) (cid:105) ji (cid:27) s θψ ik ( θ , ζ )+ (cid:26) s θ T ψ jk ( θ , ζ jk ) (cid:104) v ψg k ( θ , ζ ) (cid:105) ji + s θ T g jk ( θ , ζ jk ) [ v g k ( θ , ζ )] ji (cid:27) s θg ik ( θ , ζ ) . We can show similar results for (cid:98) A ζ k,ij , (cid:98) B θ k,ij and (cid:98) B ζ k,ij . Then we can rewrite (cid:107) λ ( θ , ζ ) (cid:107) ≤ K (cid:88) k =1 O p ( n / k N − ) = O p ( Kn / N − ) = O p ( N − / − δ n / ) , and (cid:107)∇ θ , ζ λ ( θ , ζ ) (cid:107) ≤ N K (cid:88) k =1 J (cid:88) i =1 n k (cid:13)(cid:13)(cid:13) (cid:98) C k,i (cid:13)(cid:13)(cid:13) ≤ O p (cid:0) N − / − δ n / (cid:1) + O (cid:0) N − / − δ n max (cid:1) = O p (cid:0) N − / − δ n max (cid:1) . Since ∇ θ , ζ λ ( θ , ζ ) is symmetric positive-definite, the above provides a bound on its eigen-values. Therefore, (cid:13)(cid:13) {∇ θ , ζ λ ( θ , ζ ) } − (cid:13)(cid:13) = O p (cid:0) N / δ n − (cid:1) . Lemma B.2.2.
For some matrices E k , k = 1 , . . . , K , of ’s and ’s, the following asymp-totic properties hold: n k N J (cid:88) i =1 (cid:98) C k,i (cid:98) θ ik − θ (cid:98) ζ list − ζ = n k N E k Z k + O p (cid:0) N − (cid:1) , and n k N J (cid:88) i =1 (cid:98) C k,i = n k N E k j k ( θ , ζ k ) E Tk + O p (cid:16) n / k N − (cid:17) , where n / k Z k d → N (cid:0) , j − k ( θ , ζ k ) (cid:1) . Proof
Recall that (cid:98) C k,i ( (cid:98) θ Tik − θ T , (cid:98) ζ Tlist − ζ T ) T = (cid:98) C ∗ k,i ( (cid:98) θ Tik − θ T , (cid:98) ζ Tik − ζ Tik ) T . Let (cid:2) v − k ( θ , ζ k ) (cid:3) ij subset the rows for the parameters corresponding to block ( i, k ) and thecolumns for the parameters corresponding to block ( j, k ) of matrix v − k ( θ , ζ k ). Define j jik ( θ , ζ jk , ζ ik ) = s jk ( θ , ζ jk ) (cid:2) v − k ( θ , ζ k ) (cid:3) ji s ik ( θ , ζ ik ), and (cid:2) j − k ( θ , ζ k ) (cid:3) i the submatrix of42 − k ( θ , ζ k ) corresponding to parameters in block ( i, k ), such that n / k (cid:40) J (cid:88) j =1 j jik ( θ , ζ jk , ζ ik ) (cid:41) (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik d → N (cid:0) , (cid:2) j − k ( θ , ζ k ) (cid:3) i (cid:1) . Then using the results in the proof of Lemma B.2.1, let E k and E k,i matrices of ’s and ’s such that n k N J (cid:88) i =1 (cid:98) C k,i = n k N E k (cid:110) j k ( θ , ζ k ) + O p (cid:16) n − / k (cid:17)(cid:111) E Tk = n k N E k j k ( θ , ζ k ) E Tk + O p (cid:16) n / k N − (cid:17) , and n k N J (cid:88) i =1 (cid:98) C k,i (cid:98) θ ik − θ (cid:98) ζ list − ζ = n k N E k J (cid:88) i =1 E k,i (cid:40) J (cid:88) j =1 j jik ( θ , ζ jk , ζ ik ) + O p (cid:16) n − / k (cid:17)(cid:41) (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik = n k N E k J (cid:88) i =1 E k,i J (cid:88) j =1 j jik ( θ , ζ jk , ζ ik ) (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik + O p (cid:0) N − (cid:1) . To obtain the desired result, define Z k = J (cid:88) i =1 E k,i J (cid:88) j =1 j jik ( θ , ζ jk , ζ ik ) (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik . Lemma B.2.3. N / H (cid:16)(cid:98) θ TDDIMM − θ T , (cid:98) ζ TDDIMM − ζ T (cid:17) can be rewritten as H (cid:40) K (cid:88) k =1 n k N E k j k ( θ , ζ k ) E Tk + O p (cid:0) n / N − / − δ (cid:1)(cid:41) − (cid:34) K (cid:88) k =1 (cid:26)(cid:16) n k N (cid:17) / E k n / k Z k (cid:27) + O p (cid:0) N − δ (cid:1)(cid:35) . roof N / H (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ = N / H (cid:32) K (cid:88) k =1 J (cid:88) i =1 n k N (cid:98) C k,i (cid:33) − K (cid:88) k =1 J (cid:88) i =1 n k N (cid:98) C k,i (cid:98) θ ik − θ (cid:98) ζ list − ζ = H (cid:34) K (cid:88) k =1 (cid:110) n k N E k j k ( θ , ζ k ) E Tk + O p ( n / k N − ) (cid:111)(cid:35) − · K (cid:88) k =1 n k N / E k J (cid:88) i =1 E k,i j ik ( θ , ζ jk , ζ ik ) (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik + O p ( N − / ) = H (cid:40) K (cid:88) k =1 n k N E k j k ( θ , ζ k ) E Tk + O p (cid:0) Kn / N − (cid:1)(cid:41) − · K (cid:88) k =1 n k N / E k J (cid:88) i =1 E k,i j ik ( θ , ζ jk , ζ ik ) (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik + O p (cid:0) KN − / (cid:1) = H (cid:40) K (cid:88) k =1 n k N E k j k ( θ , ζ k ) E Tk + O p (cid:0) n / N − / − δ (cid:1)(cid:41) − · (cid:34) K (cid:88) k =1 (cid:26)(cid:16) n k N (cid:17) / E k n / k Z k (cid:27) + O p (cid:0) N − δ (cid:1)(cid:35) . C Bibliography
Bai, Y., Kang, J., and Song, P. X.-K. (2014). Efficient pairwise composite likelihoodestimation for spatial-clustered data.
Biometrics , 70(3):661–670.44odnar, O., Bodnar, T., and Gupta, A. K. (2010). Estimation and inference for dependencein multivariate data.
Journal of multivariate analysis , 101(4):869–881.Bradley, R. C. (1985). On the central limit question under absolute regularity.
The Annalsof Probability , 13(4):1314–1325.Carey, V., Zeger, S. L., and Diggle, P. (1993). Modelling multivariate binary data withalternating logistic regressions.
Biometrika , 80(3):517–526.Chan, J. S., Kuk, A. Y., Bell, J., and McGilchrist, C. (1998). The analysis of methadoneclinic data using marginal and conditional logistic models with mixture of random effects.
Australian and New Zealand Journal of Statistics , 40(1):1–10.Chen, X. and Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarilylarge data.
Statistica Sinica , 24:1655–1684.Cox, D. R. and Reid, N. (2004). A note on pseudolikelihood constructed from marginaldensities.
Biometrika , 91(3):729–737.Donald, S. G., Imbens, G. W., and Newey, W. K. (2003). Empirical likelihood estimationand consistent tests with conditional moment restrictions.
Journal of econometrics ,117(1):55–93.Fu, L. and Wang, Y.-G. (2012). Quantile regression for longitudinal data with a workingcorrelation model.
Computational Statistics and Data Analysis , 56(8):2526–2538.Han, P. and Song, P. X.-K. (2011). A note on improving quadratic inference functionsusing a linear shrinkage approach.
Statistics and probability letters , 81(3):438–445.45ansen, L. P. (1982). Large sample properties of generalized method of moments estimators.
Econometrica , 50(4):1029–1054.Hector, E. C. and Song, P. X. K. (2019). A distributed and integrated method of momentsfor high-dimensional correlated data analysis. arXiv Preprint , arXiv:1910.02986.Heyde, C. C. (1997).
Quasi-likelihood and its application: a general approach to optimalparameter estimation . Springer Series in Statistics.Huber, P. J. (1964). Robust estimation of a location parameter.
The Annals of Mathemat-ical Statistics , 35(1):73–101.Huber, P. J. (2009).
Robust statistics . Wiley Series in Probability and Statistics, 2ndedition.Jin, Z. (2011).
Aspects of Composite Likelihood Inference . PhD thesis, University ofToronto.Joe, H. (2014).
Dependence modeling with copulas . Chapman & Hall, first edition.Johnstone, I. M. and Titterington, D. M. (2009). Statistical challenges of high-dimensionaldata.
Philosophical transactions of the royal society A: mathematical, physical and engi-neering sciences , 367(1906):4237–4253.Jung, S.-H. (1996). Quasi-likelihood for median regression models.
Journal of the AmericanStatistical Association , 91(433):251–257.Khezr, S. N. and Navimipour, N. J. (2017). Mapreduce and its applications, challengesand architecture: a comprehensive review and directions for future research.
Journal ofgrid computing , 15(3):295–321. 46iang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linearmodels.
Biometrika , 73(1):13–22.Liang, K.-Y., Zeger, S. L., and Qaqish, B. (1992). Multivariate regression analyses forcategorical data.
Journal of the Royal Statistical Society, Series B , 54(1):3–40.Lin, D.-Y. and Zeng, D. (2010). On the relative efficiency of using summary statisticsversus individual-level data in meta-analysis.
Biometrika , 97(2):321–332.Lin, N. and Xi, R. (2011). Aggregated estimating equation estimation.
Statistics and itsInterface , 4(1):73–83.Lindsay, B. G. (1988). Composite likelihood methods.
Contemporary Mathematics , 80:220–239.Liu, D., Liu, R. Y., and Xie, M. (2015). Multivariate meta-analysis of heterogeneousstudies using only summary statistics: efficiency and robustness.
Journal of the AmericanStatistical Association , 110(509):326–340.Lu, X. and Fan, Z. (2015). Weighted quantile regression for longitudinal data.
Computa-tional Statistics , 30(2):569–592.Mackey, L., Talwalkar, A., and Jordan, M. I. (2011). Divide-and-conquer matrix factoriza-tion. In
Advances in neural information processing systems 24 , pages 1134–1142.Masarotto, G. and Varin, C. (2012). Gaussian copula marginal regression.
Electronicjournal of statistics , 6:1517–1549.Newey, W. K. (2004). Efficient semiparametric estimation via moment restrictions.
Econo-metrica , 72(6):1877–1897. 47ewey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing.
Handbook of Econometrics , 4:2111–2245.Pan, J. and Mackenzie, G. (2003). On modelling mean-covariance structures in longitudinalstudies.
Biometrika , 90(1):239–244.Peligrad, M. (1986). Recent advances in the central limit theorem and its weak invarianceprinciple for mixing sequences of random variables (a survey). In Eberlein, E. andTaqqu, M. S., editors,
Dependence in probability and statistics. Progress in probabilityand statistics , volume 11. Birkh¨auser, Boston, MA.Secchi, P. (2018). On the role of statistics in the era of big data: a call for a debate.
Statistics and probability letters , 136:10–14.Singh, K., Xie, M., and Strawderman, W. E. (2005). Combining information from indepen-dent sources through confidence distributions.
The Annals of Statistics , 33(1):159–183.Song, P. X.-K. (2007).
Correlated Data Analysis: Modeling, Analytics, and Applications .Springer Series in Statistics.Song, P. X.-K., Li, M., and Yuan, Y. (2009). Joint regression analysis of correlated datausing gaussian copulas.
Biometrics , 65(1):60–68.Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood methods.
Statistica Sinica , 21(1):5–42.Wang, Y.-G., Lin, X., and Zhu, M. (2005). Robust estimating functions and bias correctionfor longitudinal data analysis.
Biometrics , 61(3):684–691.48edderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, andthe gauss-newton method.
Biometrika , 61(3):439–447.Xie, M. and Singh, K. (2013). Confidence distribution, the frequentist distribution estima-tor of a parameter: a review.
International Statistical Review , 81(1):3–39.Yang, C.-C., Chen, Y.-H., and Chang, H.-Y. (2017). Joint regression analysis of marginalquantile and quantile association: application to longitudinal body mass index in ado-lescents.
Journal of the Royal Statistical Society, Series C , 66(5):1075–1090.Zhang, W., Leng, C., and Tang, C. Y. (2015a). A joint modelling approach for longitudinalstudies.
Journal of the Royal Statistical Society, Series B , 77(1):219–238.Zhang, Y., Duchi, J., and Wainwright, M. (2015b). Divide and conquer kernel ridge regres-sion: a distributed algorithm with minimax optimal rates.
Journal of Machine LearningResearch , 16:3299–3340.Zhao, L. P. and Prentice, R. L. (1990). Correlated binary regression using a quadraticexponential model.