[PDF] Doubly Distributed Supervised Learning and Inference with High-Dimensional Correlated Outcomes

Abstract

This paper presents a unified framework for supervised learning and inference procedures using the divide-and-conquer approach for high-dimensional correlated outcomes. We propose a general class of estimators that can be implemented in a fully distributed and parallelized computational scheme. Modelling, computational and theoretical challenges related to high-dimensional correlated outcomes are overcome by dividing data at both outcome and subject levels, estimating the parameter of interest from blocks of data using a broad class of supervised learning procedures, and combining block estimators in a closed-form meta-estimator asymptotically equivalent to estimates obtained by Hansen (1982)'s generalized method of moments (GMM) that does not require the entire data to be reloaded on a common server. We provide rigorous theoretical justifications for the use of distributed estimators with correlated outcomes by studying the asymptotic behaviour of the combined estimator with fixed and diverging number of data divisions. Simulations illustrate the finite sample performance of the proposed method, and we provide an R package for ease of implementation.

Full PDF

DDoubly Distributed Supervised Learning andInference with High-Dimensional Correlated

Outcomes

Emily C. Hector and Peter X.-K. Song ∗ Department of BiostatisticsUniversity of Michigan

Abstract

This paper presents a uniﬁed framework for supervised learning and inference pro-cedures using the divide-and-conquer approach for high-dimensional correlated out-comes. We propose a general class of estimators that can be implemented in a fullydistributed and parallelized computational scheme. Modelling, computational andtheoretical challenges related to high-dimensional correlated outcomes are overcomeby dividing data at both outcome and subject levels, estimating the parameter of in-terest from blocks of data using a broad class of supervised learning procedures, andcombining block estimators in a closed-form meta-estimator asymptotically equivalentto estimates obtained by Hansen (1982)’s generalized method of moments (GMM)that does not require the entire data to be reloaded on a common server. We providerigorous theoretical justiﬁcations for the use of distributed estimators with corre-lated outcomes by studying the asymptotic behaviour of the combined estimatorwith ﬁxed and diverging number of data divisions. Simulations illustrate the ﬁnitesample performance of the proposed method, and we provide an R package for easeof implementation.

Keywords: Divide-and-conquer, Generalized method of moments, Estimating functions,Parallel computing, Scalable computing ∗ We would like to acknowledge support for this project from the National Science Foundation (NSFDMS1513595) and the National Institutes of Health (NIH R01ES024732). a r X i v : . [ m a t h . S T ] J u l INTRODUCTION

Although the divide-and-conquer paradigm has been widely used in statistics and computerscience, its application with correlated data has been little investigated in the literature.We provide a theoretical justiﬁcation, with theoretical guarantees, for divide-and-conquermethods with correlated data through a general uniﬁed estimating function theory frame-work. In particular, in this paper we focus on the large sample properties of a classof distributed and integrated estimators for supervised learning and inference with high-dimensional correlated outcomes. We consider N independent observations { y i , X i } Ni =1 where both the sample size N and the dimension M of the response vector y i may be sobig that a direct analysis of the data using conventional methodology is computationallyintensive, or even prohibitive. Such data may arise, for example, from imaging measure-ments of brain activity or from genomic data. Denote by f ( Y i ; X i , θ , Γ i ) the M -variatejoint parametric distribution of Y i conditioned on X i , where θ is the parameter of interestand Γ i contains parameters, such as for high-order dependencies, that may be diﬃcult tomodel or handle computationally.Statistical inference with big data can be extremely challenging due to the high volume andhigh variety of these data, as noted recently by Secchi (2018). In the statistics literature,methodological eﬀorts to date have primarily focused on high-dimensional covariates (i.e.high-dimensional X i ) with univariate responses (corresponding to M = 1); see Johnstoneand Titterington (2009) for an overview of the diﬃculties and methods in linear regression,and the citations therein for references to the extensive publications in this ﬁeld. By con-trast, little work has focused on high-dimensional correlated outcomes (corresponding tolarge M ), which pose an entirely new and diﬀerent set of methodological challenges stem-ming from a high-dimensional likelihood. The divide-and-combine paradigm holds promise2n overcoming these challenges; see Mackey et al. (2011) and Zhang et al. (2015b) for earlyexamples of the power of divide-and-combine algorithms. Some recent divide-and-combinemethods for independent outcomes can be found in Singh et al. (2005), Lin and Zeng(2010), Lin and Xi (2011), Chen and Xie (2014), and Liu et al. (2015), among others.More recently, Hector and Song (2019) proposed a Distributed and Integrated Method ofMoments (DIMM), a divide-and-combine strategy for supervised learning and inference ina regression setting with high-dimensional correlated outcomes Y . DIMM splits the M elements of Y into blocks of low-dimensional response subvectors, analyzes these blocksin a distributed and parallelized computational scheme using pairwise composite likeli-hood (CL), and combines block-speciﬁc results using a closed-form meta-estimator in asimilar spirit to Hansen (1982)’s seminal generalized method of moments (GMM). DIMMovercomes computational challenges associated with high-dimensional outcomes by run-ning block analyses in parallel and combining block-speciﬁc results via a computationallyand statistically eﬃcient closed-form meta-estimator. DIMM is easily implemented usingMapReduce in the Hadoop framework (Khezr and Navimipour (2017)), where blocks ofdata are loaded only once and in parallel. DIMM presents a useful and natural exten-sion of the classical GMM framework, which easily accounts for inter-block dependencies.DIMM also improves on the classical meta-estimation where results from blocks are rou-tinely assumed to be independent. DIMM is still challenged, however, when estimating ahomogeneous parameter in the presence of heterogeneous parameters. Additionally, it isalso challenged computationally when the sample size N is large; the strategy of dividinghigh-dimensional vectors of correlated outcomes into blocks is insuﬃcient to address theexcessive computational demand, since the sample size remains large in the block analy-ses. Thus, another division at the subject level is inevitable to mitigate the computationalburden arising from matrix inversions and iterative calculations in the block analyses.3his paper proposes a new doubly divided procedure to learn and perform inference fora homogeneous parameter of interest in the presence of heterogeneous parameters with ageneral class of supervised learning procedures. The double division at the response andsubject levels further speeds up computations in comparison to DIMM and results in adouble division of the data, visualized in Table 1: a division of the response Y , and a ran-dom division of subjects into independent subject groups, resulting in blocks of data witha smaller sample of low-dimensional response subvectors. We consider a general class ofsupervised learning procedures to analyze these blocks separately and in parallel. Then weestablish a GMM-type combination procedure that yields a meta-estimator of the parame-ter of interest. This proposed estimator is more general than the DIMM estimator in Hectorand Song (2019), and thus appealing in many practical settings where analyzing data withboth large M and N is challenging. We achieve a doubly divided learning and inferenceprocedure implemented in a distributed and parallelized computational scheme. The pro-posed class of supervised learning procedures is very general, including many importantestimation methods as special cases, such as Fisher’s maximum likelihood, Wedderburn(1974)’s quasi-likelihood, Liang and Zeger (1986)’s generalized estimating equations, Hu-ber (1964)’s M-estimation for robust inference, with possible extensions to semi-parametricand non-parametric models.The proposed Doubly Distributed and Integrated Method of Moments (DDIMM) not onlyprovides a uniﬁed framework of various supervised learning procedures of parameters withheterogeneity under the divide-and-combine paradigm, but provides key theoretical guar-antees for statistical inference, such as consistency and asymptotic normality, while oﬀeringsigniﬁcant computational gains when response dimension M and sample size N are large.These are useful and innovative contributions to the arsenal of tools for high-dimensionalcorrelated data analysis, and to the collection of divide-and-combine algorithms, which4 lockGroup Subject 1 . . . Subject n . . . . . . Subject 1 . . .

Subject n K y , . . . y n , . . . . . . y , K . . . y n K , K ... ... ... ... ... ... ... ... ... m y m , . . . y n m , . . . . . . y m , K . . . y n K m , K ... ... ... ... ... ... ... ... ...... ... ... ... ... ... ... ... ...1 y ,J . . . y n ,J . . . . . . y ,JK . . . y n K ,JK ... ... ... ... ... ... ... ... ... m J y m J ,J . . . y n m J ,J . . . . . . y ,JK . . . y n K m J ,JK Table 1: Double division of outcome data on both the dimension of responses (intoblocks) and sample size (into groups).have so far concentrated on independently sampled data. In this paper, we focus on thetheoretical aspects of doubly distributed learning and inference, including a goodness-of-ﬁttest based on a χ statistic. We also study consistency and asymptotic normality of theproposed estimator as the number of data divisions diverges. This includes theoreticaljustiﬁcations for distributed inference when the dimension of the response and the numberof response divisions diverges, which allows the analysis of highly dense outcome data.The rest of the paper is organized as follows. Section 2 describes the DDIMM, with ex-amples introduced in Section 3. Section 4 discusses large sample properties of the pro-posed DDIMM. Section 5 presents the main contribution of the paper, a closed-form meta-estimator and its implementation in a parallel and scalable computational scheme. Section6 illustrates the DDIMM’s ﬁnite sample performance with simulations. Section 7 concludeswith a discussion. Additional proofs and simulation results are deferred to the Appendices5nd Supplemental Material. An R package is available in the Supplemental Material. We begin with some notation. Let (cid:107)·(cid:107) be the (cid:96) -norm for a D -dimensional vector a and a D × D -dimensional matrix A deﬁned by, respectively: (cid:107) a (cid:107) = (cid:18) D (cid:80) d =1 a d (cid:19) / for a = [ a d ] Dd =1 ∈ R D , (cid:107) A (cid:107) = (cid:18) D (cid:80) d =1 D (cid:80) d =1 A d d (cid:19) / for A = [ A d d ] D ,D d ,d =1 ∈ R D × D . We deﬁne the stacking operator S ( · ) for matrices { A jk } J,Kj =1 ,k =1 , A jk ∈ R D jk × D , as S ( A jk , A j (cid:48) k (cid:48) ) = (cid:16) A Tjk A Tj (cid:48) k (cid:48) (cid:17) T ∈ R ( D jk + D j (cid:48) k (cid:48) ) × D , S J ( A jk ) = (cid:16) A T k . . . A TJk (cid:17) T ∈ R D k × D , S JK ( A jk ) = (cid:16) A T . . . A TJ . . . A T K . . . A TJK (cid:17) T ∈ R D × D , where D k = (cid:80) Jj =1 D jk , D = (cid:80) Kk =1 D k . Consider the collection of samples { y i , X i } Ni =1 ,where X i ∈ R M × q is ﬁxed, Y i ∈ R M , q, M ∈ N . The number of covariates q is consideredﬁxed in this paper. Let θ , ζ take values in parameter spaces Θ ⊆ R p , Ξ ⊆ R d , both compactsubsets of p - and d -dimensional Euclidean space respectively. Let p, d ∈ N , and consider θ to be the parameter of interest, and ζ to be a potentially large vector of parametersof secondary interest. Let θ ∈ Θ , ζ ∈ Ξ be the true values of θ and ζ respectively.Consider a class P = { P θ , ζ } of parametric models with associated estimating functions Ψ of parameter θ (e.g. Ψ can be the derivative of some objective function). Suppose we wantto learn the parameter θ by ﬁnding the root of Ψ ( θ ; y , ζ ) = , which is computationallyintensive or even prohibitive due to the large dimension M of y , the large sample size N , or6he large dimension d of ζ . We focus on a divide-and-combine approach utilizing moderndistributed computing platforms to alleviate the computational and modelling challengesposed by analyzing the whole data. First, for each subject i , DDIMM divides the M -dimensional response y i and its associatedcovariates into J blocks, denoted by: y i = (cid:16) y Ti, . . . y Ti,J (cid:17) T and X i = (cid:16) X Ti, . . . X Ti,J (cid:17) T , i = 1 , . . . , N. Division into blocks is not restricted to the order of data entry: responses may be groupedaccording to pre-speciﬁed block memberships, according to, say, substantive scientiﬁcknowledge, such as functional regions of the brain. In this paper, with no loss of generality,we use the order of data entry in the data division procedure. Further, DDIMM randomlysplits the N independent subjects to form K disjoint subject groups (cid:8) y i,jk , X i,jk (cid:9) n k i =1 . Theneach group has sample size n k , k = 1 , . . . , K , with (cid:80) Kk =1 n k = N . Refer to Table 1 for no-tation detail. For ease of exposition, we henceforth use the term “group” to refer to thedivision along subjects, and “block” to refer to the division along responses. We also usethe term “block” to refer to the division along both responses and subjects.We call (cid:8) y i,jk , X i,jk (cid:9) n k i =1 block ( j, k ), j = 1 , . . . , J and k = 1 , . . . , K . Within block ( j, k ),let m j be the dimension of the sub-response, y i,jk = ( y i ,jk , . . . , y im j ,jk ) T ∈ R m j , and X i,jk ∈ R m j × q the associated covariate matrix, with (cid:80) Jj =1 m j = M . For each block j ∈ { , . . . , J } , we have K independent subject groups (cid:8) y i,jk (cid:9) n k ,Ki =1 ,k =1 . In contrast, eachgroup k ∈ { , . . . , K } has n k subjects and for each subject i ∈ { , . . . , n k } , the J responseblocks (cid:8) y i,jk (cid:9) m j j =1 are dependent.The primary task is to solve Ψ ( θ ; y , ζ ) = to learn parameter θ ina supervised way over7he entire data. Given the above double data split scheme, this task becomes a divide-and-combine procedure: the ﬁrst step is to solve the following system of block-speciﬁcestimating equations: for j ∈ { , . . . , J } , k ∈ { , . . . , K } , Ψ jk ( θ ; y jk , ζ jk ) = , (1) G jk ( ζ jk ; y jk , θ ) = , (2)where G jk is an estimating function used to learn parameters ζ jk (e.g. correlation pa-rameters) that are allowed to be heterogeneous across blocks such that ζ = S JK (cid:0) ζ jk (cid:1) .The true values ( θ , ζ jk ) of ( θ , ζ jk ) are the values such that E θ , ζ jk S ( Ψ jk ( θ ; y jk , ζ jk ) , G jk ( ζ jk ; y jk , θ )) = . Parameters ζ jk take values in parameter space Ξ jk ⊂ R d jk forsome d jk > ζ = S JK (cid:0) ζ jk (cid:1) , Ξ = ⨉ J,Kj =1 ,k =1 Ξ jk , d = (cid:80) Kk =1 (cid:80) Jj =1 d jk . Let ζ k = S J (cid:0) ζ jk (cid:1) and ζ k = S J (cid:0) ζ jk (cid:1) . This is a similar approach to GEE2, proposed by Zhaoand Prentice (1990), with details also in Liang et al. (1992), where unbiased estimatingequations for the nuisance parameters are added in order to guarantee consistency. Inthis way, we impose homogeneity of the parameter of interest θ across blocks but allowheterogeneity of the parameters of secondary interest. We assume that the class of paramet-ric models P yields block-speciﬁc estimating functions satisfying the following regularityassumptions:(A.1) (i) Ψ jk and G jk are unbiased; that is, for all θ ∈ Θ , ζ jk ∈ Ξ jk , E θ , ζ jk S ( Ψ jk ( θ ; Y jk , ζ jk ) , G jk ( ζ jk ; Y jk , θ )) = .(ii) E θ , ζ jk S (cid:0) Ψ jk ( θ ; Y jk , ζ jk ) , G jk ( ζ jk ; y jk , θ ) (cid:1) has a unique zero at ( θ , ζ jk ).(iii) Ψ jk and G jk are additive: for some kernel inference functions ψ jk and g jk , theytake the form  Ψ jk ( θ ; y jk , ζ jk ) G jk ( ζ jk ; y jk , θ )  = 1 n k n k (cid:88) i =1  ψ jk ( θ ; y i,jk , ζ jk ) g jk ( ζ jk ; y i,jk , θ )  .

8e deﬁne Ψ jk and G jk as being “weakly regular” based on the above conditions (A.1)(i)-(iii) in which the deﬁning properties of a regular inference function are applied to itsmean; see Song (2007) Chapter 3.5 for a deﬁnition of regular inference functions. Additionalconditions on the class P will be described throughout the paper where appropriate. Withinblock ( j, k ), denote by (cid:98) θ jk and (cid:98) ζ jk the joint solution to (1) and (2), estimators of θ and ζ jk respectively. For notation purposes, let (cid:98) θ list = S JK ( (cid:98) θ jk ), (cid:98) ζ k = S J ( (cid:98) ζ jk ), and (cid:98) ζ list = S JK ( (cid:98) ζ jk ). Due to the homogeneity of θ , the next step is integration of the block-speciﬁcestimators (cid:98) θ jk . By contrast, (cid:98) ζ jk remain heterogeneous and potentially high-dimensional.In the rest of the paper, for convenience of notation, we suppress the dependence of Ψ jk , G jk , ψ jk and g jk on y jk and y i,jk : Ψ jk ( θ ; ζ jk ) = Ψ jk ( θ ; y jk , ζ jk ) , G jk ( ζ jk ; θ ) = G jk ( ζ jk ; y jk , θ ) , ψ i,jk ( θ ; ζ jk ) = ψ jk ( θ ; y i,jk , ζ jk ) , g i,jk ( ζ jk ; θ ) = g jk ( ζ jk ; y i,jk , θ ) . Integrating block estimates (cid:98) θ jk into an estimator of θ , denoted by (cid:98) θ c , will yield a moreeﬃcient estimate of θ . In the integration step, our intuition is to treat each system ofequations S (cid:0) Ψ jk ( θ ; ζ jk ) , G jk ( ζ jk ; θ ) (cid:1) = as a “moment condition” on θ contributed byblock ( j, k ), j = 1 , . . . , J , k = 1 , . . . , K . Technically, we want to derive an estimator (cid:98) θ c of θ that satisﬁes all J K moment conditions that eﬀectively makes use of the

J K estimates of θ obtained from equations (1) and (2). To address the issue that θ is over-identiﬁed by the J K moment conditions, we invoke Hansen (1982)’s seminal generalized method of moments(GMM) to combine the moment conditions that arise from each block. Another signiﬁcantadvantage of GMM is that it allows us to incorporate between-block dependencies, whichcannot be easily done in classical meta-estimation. To this end, deﬁne the subject group9ndicator δ i ( k ) = (subject i is in blocks ( j, k ) for some k ∈ { , . . . , K } and for all j =1 , . . . , J ) for i = 1 , . . . , N , k = 1 , . . . , K . For subject i , let ψ i ( θ ; ζ ) = S JK (cid:0) δ i ( k ) ψ i,jk ( θ ; ζ jk ) (cid:1) , g i ( ζ ; θ ) = S JK (cid:0) δ i ( k ) g i,jk ( ζ jk ; θ ) (cid:1) , where clearly only one S J (cid:0) δ i ( k ) ψ Ti,jk ( θ ; ζ jk ) (cid:1) is non-zero. Let a ⊗ denote the outer prod-uct of a vector a with itself, namely a ⊗ = aa T . Then we can deﬁne Ψ N ( θ ; ζ ) =(1 /N ) (cid:80) Ni =1 ψ i ( θ ; ζ ). It is easy to show that Ψ N ( θ ; ζ ) = 1 N S JK (cid:32) n k (cid:88) i =1 ψ i,jk ( θ ; ζ jk ) (cid:33) = 1 N S JK (cid:0) n k Ψ jk ( θ ; ζ jk ) (cid:1) . Similarly, deﬁne G N ( ζ ; θ ) = (1 /N ) (cid:80) Ni =1 g i ( ζ ; θ ) = (1 /N ) S JK (cid:0) n k G jk ( ζ jk ; θ ) (cid:1) . Since Ψ jk and G jk satisfy assumptions (A.1) for each j and k , Ψ N and G N are additive, unbiased,and E θ , ζ S ( Ψ N ( θ ; ζ ) , G N ( ζ ; θ )) has a unique zero at ( θ , ζ ). For convenience, we denote T N ( θ , ζ ) =  Ψ N ( θ ; ζ ) G N ( ζ ; θ )  , τ i ( θ , ζ ) =  ψ i ( θ ; ζ ) g i ( ζ ; θ )  . (3)We assume that the class P yields ψ , g satisfying the following conditions:(A.2) (i) Both ψ jk and g jk are Lipschitz continuous in θ and ζ , namely for j ∈ { , . . . , J } , k ∈ { , . . . , K } , and some constants c jk , b jk >

0, for all (cid:0) θ , ζ jk (cid:1) , (cid:0) θ , ζ jk (cid:1) ina neighbourhood of ( θ , ζ jk ), (cid:13)(cid:13) ψ i,jk ( θ ; ζ jk ) − ψ i,jk ( θ ; ζ jk ) (cid:13)(cid:13) ≤ c jk (cid:13)(cid:13) ( θ , ζ jk ) − ( θ , ζ jk ) (cid:13)(cid:13) , (cid:13)(cid:13) g i,jk ( ζ jk ; θ ) − g i,jk ( ζ jk ; θ ) (cid:13)(cid:13) ≤ b jk (cid:13)(cid:13) ( θ , ζ jk ) − ( θ , ζ jk ) (cid:13)(cid:13) . (ii) The sensitivity matrix −∇ θ , ζ E θ , ζ τ i ( θ , ζ ) is continuous in a compact neighbour-hood N ( θ , ζ ) of ( θ , ζ ), and positive deﬁnite;10iii) The variability matrix E θ , ζ ( τ i ( θ , ζ ) ⊗ ) is ﬁnite and positive-deﬁnite.Note that T N ( θ , ζ ) = has no unique solution because its dimension is bigger thanthe dimension of θ . To overcome this issue, we follow Hansen’s GMM for over-identiﬁedparameters. Let W be the weight matrix in the GMM equation (4). Classical GMM theorystates that any positive semi-deﬁnite matrix W can be used to guarantee consistencyand asymptotic normality of the resulting estimator, and that an optimal choice of W ,corresponding to the inverse covariance of the estimating function T N in (3), leads to aneﬃcient GMM estimator. In our setting, a possible formulation for a GMM estimator of( θ , ζ ) is ( (cid:98) θ c , (cid:98) ζ c ) = arg min θ , ζ Q N ( θ , ζ | W ) , where (4) Q N ( θ , ζ | W ) = T TN ( θ , ζ ) W T N ( θ , ζ ) . In (4), the weight matrix W is a positive semi-deﬁnite ( J Kp + d ) × ( J Kp + d ) matrix.The heterogeneity of ζ allowed by the use of G N can lead to theoretical and computationalchallenges due to the high-dimensionality of the parameter, a problem from which GEE2also suﬀers. See Chan et al. (1998) and Carey et al. (1993) for a discussion on the computa-tional burden of inverting large matrices in GEE2. Note that block-speciﬁc estimators (cid:98) ζ list are consistent; the only possible improvement from re-learning ζ in an iterative procedurebetween (cid:98) θ c and (cid:98) ζ c is a gain in eﬃciency. This is not necessary since ζ are parametersof secondary interest and their eﬃciency is in general not of interest. We will derive aclosed-form meta-estimator of θ that avoids re-learning of ζ in Section 5.Following the work of Hansen (1982), we deﬁne a particular instance of the estimator in(4) by specifying W as the inverse sample covariance of T N . We will show in Section 4that this choice of W is optimal for the eﬃciency of the resulting estimator. Let (cid:98) V N be11he sample covariance of T N ( θ , ζ ): (cid:98) V N = 1 N N (cid:88) i =1 (cid:16) τ i ( (cid:98) θ list , (cid:98) ζ list ) (cid:17) ⊗ = 1 N N (cid:88) i =1  ψ i ( (cid:98) θ list ; (cid:98) ζ list ) g i ( (cid:98) ζ list ; (cid:98) θ list )  ⊗ , (5)where ψ i ( (cid:98) θ list ; (cid:98) ζ list ) = S JK (cid:16) δ i ( k ) ψ i,jk ( (cid:98) θ jk ; (cid:98) ζ jk ) (cid:17) . Letting W = (cid:98) V − N yields the followingoptimal GMM estimator:( (cid:98) θ opt , (cid:98) ζ opt ) = arg min θ , ζ T TN ( θ , ζ ) (cid:98) V − N T N ( θ , ζ ) . (6)We assume that W and (cid:98) V N are nonsingular; see Han and Song (2011) for optimal weightingmatrix with QIF when the sample covariance is ill-deﬁned. Before presenting large-sampleproperties of ( (cid:98) θ c , (cid:98) ζ c ) and ( (cid:98) θ opt , (cid:98) ζ opt ) in Section 4, we demonstrate in Section 3 the ﬂexibilityof our framework through several important supervised learning methods. We now present ﬁve examples to illustrate the ﬂexibility of the unifying framework consid-ered in this paper.

Consider the multidimensional regression model h ( µ i,jk ) = X i,jk ( θ T β Tjk ) T , where µ i,jk = E ( Y i,jk | X i,jk , θ , β jk ) is the mean vector of Y i,jk given X i,jk , β jk , and the p -dimensionalparameter θ ( p ≤ q the number of covariates, which may include an intercept), and h is aknown component-wise link function. Let ζ jk be parameters of the second-order momentsof Y i,jk , such as dispersion parameters, and parameters in β jk (which may be empty). If thefull likelihood of Y i,jk is computationally tractable, Ψ jk and G jk correspond to the score12unctions, and (cid:98) θ jk and (cid:98) ζ jk may be given by the maximum likelihood estimates (MLEs).DDIMM can be applied straightforwardly by following the procedure in Section 2.If the full likelihood is computationally intractable or diﬃcult to construct, one can insteaduse pseudo-likelihoods such as the pairwise composite likelihood. The pairwise compos-ite likelihood, originally proposed by Lindsay (1988) and detailed in Varin et al. (2011),provides the following forms of the equations for (1) and (2): Ψ jk ( θ ; ζ jk ) = 1 n k n k (cid:88) i =1 m j − (cid:88) r =1 m j (cid:88) t = r +1 ∇ θ log f j ( y ir,jk ; y it,jk ; θ , ζ jk , X i,jk ) , G jk ( ζ jk ; θ ) = 1 n k n k (cid:88) i =1 m j − (cid:88) r =1 m j (cid:88) t = r +1 ∇ ζ jk log f j ( y ir,jk ; y it,jk ; θ , ζ jk , X i,jk ) , for some bivariate marginal f j which can be chosen according to the nature of the responsedata. As long as the bivariate marginals f j are correctly speciﬁed, the composite scorefunctions Ψ jk and G jk satisfy the regularity conditions in (A.1). Hence the DDIMMcan be used to overcome the computational challenges related to the MLE and pairwisecomposite likelihood. We refer readers to Chapter 6 of Song (2007) and Chapter 3 ofJoe (2014) for details on constructing multivariate distributions using Gaussian and vinecopulas respectively, but note that direct computation of the MLE is computationally verychallenging when m j ≥

4. Examples of applications of Gaussian copulas can be found inSong et al. (2009), Bodnar et al. (2010), Bai et al. (2014), and in the importance samplingalgorithm proposed in Masarotto and Varin (2012), among others.

More generally, Wedderburn (1974)’s quasi-likelihood is a popular alternative method ofsupervised learning that does not require a fully speciﬁed multidimensional likelihood; it13eceives a full treatment in Heyde (1997). Consider Liang and Zeger (1986)’s marginalmean model h ( µ i,jk ) = X i,jk ( θ T β Tjk ) T for the analysis of longitudinal data, where µ i,jk = E ( Y i,jk | X i,jk , θ , β jk ) is the marginal mean vector of serially correlated outcomes Y i,jk given X i,jk , β jk , and the p -dimensional parameter θ ( p ≤ q ), and h is a knowncomponent-wise link function. In this setting, ζ jk consists of parameters in β jk (which maybe empty), parameters for the variances of Y it,jk , t = 1 , . . . , m j , and a nuisance parameter α jk which fully characterizes a working correlation matrix R jk ( α jk ). In the case where β jk is empty, the generalized estimating equation (GEE) proposed by Liang and Zeger(1986) yields the the kernel inference function ψ jk ( θ ; ζ jk ) = D Ti,jk Σ − i,jk r i,jk in (A.1) (iii),where D i,jk = ∇ θ µ i,jk , r i,jk = y i,jk − µ i,jk , and Σ i,jk = A i,jk R jk ( α jk ) A i,jk , where A i,jk =diag (cid:8) ( V ar ( Y it,jk )) / (cid:9) m j t =1 . In GEE2, G jk in (2) is speciﬁed as another unbiased inferencefunction satisfying (A.1) and (A.2). DDIMM provides a procedure for the application ofdistributed methods to high-dimensional longitudinal/clustered data. DDIMM can be applied to many learning methods proposed in robust statistics. Inthe robust statistics literature due to Huber (1964) and, more generally, Huber (2009),an M-estimator is deﬁned as the root of an implicit equation of the form Ψ jk ( (cid:98) θ jk ) = (cid:80) n k i =1 ψ jk ( (cid:98) θ jk ) = , where ψ jk ( θ ) = ∇ θ ρ ( θ ), ρ is a suitable function that primarily aimsto provide estimators robust to inﬂuential data points, and (cid:98) θ jk ∈ R p , and ζ jk is empty orknown; additional details are available in Huber (2009) for the case when ζ jk is unknown.In the context of longitudinal data, Wang et al. (2005) robustify the generalized estimatingequations of Liang and Zeger (1986) by replacing the standardized residuals with Huber’s M -residuals. 14 .4 Joint mean-variance modelling Following Pan and Mackenzie (2003), one can jointly model the marginal means and co-variances of the longitudinal responses with h ( µ i,jk ) = X i,jk, β , log( σ i,jk ) = X i,jk, λ , and φ irt,jk = X irt,jk, γ for 1 ≤ t < r ≤ m j , where h is a known component-wise link function, β ∈ R q , λ ∈ R q and γ ∈ R q are unconstrained parameters, µ i,jk = E ( Y i,jk | X i,jk, , θ )and X i,jk, ∈ R m j × q a submatrix of X i,jk , σ i,jk = S ( V ar ( Y ir,jk )) m j r =1 and X i,jk, ∈ R m j × q a submatrix of X i,jk , and φ irt,jk are speciﬁed in Zhang et al. (2015a). Estimating functions Ψ jk and G jk in (1) and (2) are given in detail in Zhang et al. (2015a). There is some choicedepending on the problem considered as to whether θ = β , θ = ( λ , γ ), or θ = ( β , λ , γ ).In the ﬁrst case, learning of variance parameters only helps improve estimation eﬃciency.This type of framework is widely applied in biomedical studies where the mean parametersare of primary interest. In the second case, learning of covariance parameters is of interestand β is treated as a nuisance parameter. This is the situation where prediction is ofprimary interest, such as in kriging in spatial data analysis. In the third case, G jk is null,and learning of variance parameters is of interest to the investigator. This case occurs forexample in the study of volatility for risk management in ﬁnancial data analysis. Consider the marginal quantile regression model Q Y it,jk | X it,jk ( τ ) = X it,jk θ , where Q Y it,jk | X it,jk ( τ ) = F − Y it,jk | X it,jk ( τ ) = inf { y it,jk : F Y it,jk | X it,jk ( y it,jk ) ≥ τ } is the τ th quan-tile of Y it,jk | X it,jk , τ ∈ (0 , f Y it,jk | X it,jk ( y it,jk ) is the conditional distributionfunction of Y it,jk given X it,jk , t = 1 , . . . , m j . Many estimating functions Ψ jk and G jk for the learning of θ and association parameters ζ jk of Y i,jk have been proposed; seeJung (1996), Fu and Wang (2012), Lu and Fan (2015), and Yang et al. (2017) for examples.15ach of these ﬁve examples requires additional work to fully develop a divide-and-conquerstrategy via DDIMM, including speciﬁc computational details. Here we only present thegeneral framework with a high-level discussion that sheds light on DDIMM’s promisinggenerality and ﬂexibility, and its coverage of a wide range of supervised learning methods.The theoretical results presented in Sections 4 and 5 are developed under a general uniﬁedframework of estimating functions that includes the above ﬁve examples as special cases. In this section we assume that K and J are ﬁxed; this assumption will be relaxed in Section5. Let n min = min k =1 ,...,K n k and n max = max k =1 ,...,K n k . Suppose W p → w as n min → ∞ .In this section we study the asymptotic properties of the GMM estimator ( (cid:98) θ c , (cid:98) ζ c ) proposedin (4) and its optimal version proposed in (6). We assume throughout that subjects aremonotonically allocated to subject groups; that is, as n min → ∞ , a subject cannot bereallocated to another group once it has been assigned to a subject group. Deﬁne thevariability matrix of τ i ( θ , ζ ) in (3) as v ( θ , ζ ) = V ar θ , ζ { τ i ( θ , ζ ) } =  v ψ ( θ , ζ ) v ψg ( θ , ζ ) v T ψg ( θ , ζ ) v g ( θ , ζ )  where v ψ ( θ , ζ ) = V ar θ , ζ { ψ i ( θ ; ζ ) } , v g ( θ , ζ ) = V ar θ , ζ { g i ( ζ ; θ ) } , and v ψg ( θ , ζ ) = E θ , ζ (cid:8) ψ i ( θ ; ζ ) g Ti ( ζ ; θ ) (cid:9) . Let the sensitivity matrix of τ i ( θ , ζ ) be s ( θ , ζ ) = −∇ θ , ζ E θ , ζ τ i ( θ , ζ ) =  s θψ ( θ , ζ ) s ζψ ( θ , ζ ) s θg ( θ , ζ ) s ζg ( θ , ζ )  , where (7)16 θψ ( θ , ζ ) = S JK (cid:16) n k N s θψ jk ( θ , ζ jk ) (cid:17) , s ζψ ( θ , ζ ) = diag (cid:110) n k N s ζψ jk ( θ , ζ jk ) (cid:111) J,Kj =1 ,k =1 , s θg ( θ , ζ ) = S JK (cid:16) n k N s θ g jk ( θ , ζ jk ) (cid:17) , s ζg ( θ , ζ ) = diag (cid:110) n k N s ζg jk ( θ , ζ jk ) (cid:111) J,Kj =1 ,k =1 s jk ( θ , ζ jk ) =  s θψ jk ( θ , ζ jk ) s ζψ jk ( θ , ζ jk ) s θg jk ( θ , ζ jk ) s ζg jk ( θ , ζ jk )  . Following Theorem 3.4 of Song (2007), block-speciﬁc estimates (cid:98) θ jk and (cid:98) ζ jk are consistentgiven assumptions (A.1). Consistency and asymptotic normality of the GMM estimator( (cid:98) θ c , (cid:98) ζ c ) in (4) have been established by Hansen (1982) and, more generally, by Newey andMcFadden (1994). To establish consistency and asymptotic normality for the combinedestimator ( (cid:98) θ c , (cid:98) ζ c ), we consider the following additional regularity conditions:(A.3) Following Newey and McFadden (1994), deﬁne Q ( θ , ζ | W ) = E θ , ζ (cid:8) T TN ( θ , ζ ) (cid:9) w E θ , ζ { T N ( θ , ζ ) } . Assume Q ( θ , ζ | W ) is twice-continuously diﬀerentiable in a neighbourhood of( θ , ζ ).(A.4) Let ( (cid:98) θ c , (cid:98) ζ c ) = arg min θ , ζ Q N ( θ , ζ | W ). Following Newey and McFadden (1994), assume Q N ( (cid:98) θ c , (cid:98) ζ c | W ) ≤ inf θ ∈ Θ , ζ ∈ Ξ Q N ( θ , ζ | W ) + (cid:15) N with (cid:15) N = o p (1). In addition, assume that θ , ζ are interior points of Θ and Ξ respectively, and that for any δ N → sup (cid:107) ( θ , ζ ) − ( θ , ζ ) (cid:107)≤ δ N N / N / (cid:107) ( θ , ζ ) − ( θ , ζ ) (cid:107) (cid:13)(cid:13) T N ( θ , ζ ) − T N ( θ , ζ ) − E θ , ζ T N ( θ , ζ ) (cid:13)(cid:13) p → . Theorems 1 and 2 do not require the diﬀerentiability of T N and Q N . Instead, they requirethe diﬀerentiability of their population versions, and that T N behaves “nicely” in a neigh-bourhood of ( θ , ζ ), in the sense that higher order terms are asymptotically ignorable. The17ollowing two theorems state the consistency and asymptotic normality of ( (cid:98) θ c , (cid:98) ζ c ) given in(4) under Newey and McFadden’s mild moment conditions given in (A.3) and (A.4). Theorem 1 (Consistency of ( (cid:98) θ c , (cid:98) ζ c )) . Suppose assumptions (A.1)-(A.3) hold with ( (cid:98) θ c , (cid:98) ζ c ) deﬁned in (4) . Then ( (cid:98) θ c , (cid:98) ζ c ) p → ( θ , ζ ) as n min → ∞ . The proof of Theorem 1 follows closely the steps given in Hansen (1982) and Newey andMcFadden (1994), and thus is omitted.

Theorem 2 (Asymptotic normality of ( (cid:98) θ c , (cid:98) ζ c ) . Suppose assumptions (A.1)-(A.4) hold with ( (cid:98) θ c , (cid:98) ζ c ) deﬁned in (4) . Then as n min → ∞ , N /  (cid:98) θ c − θ (cid:98) ζ c − ζ  d → N (cid:0) , j − ( θ , ζ ) s ( θ , ζ )˜ v ( θ , ζ ) s T ( θ , ζ ) j − ( θ , ζ ) (cid:1) , where ˜ v ( θ , ζ ) = wv ( θ , ζ ) w , and where the Godambe information j ( θ , ζ ) of T N ( θ , ζ ) takesthe form j ( θ , ζ ) = s ( θ , ζ ) ws T ( θ , ζ ) . The proof of Theorem 2 follows easily from Theorem 7.2 in Newey and McFadden (1994)and Theorem 1 above. We note that requiring K to be ﬁnite implies that N and n min areasymptotically of the same order. We will relax this assumption in Section 5. Conditions(A.3) and (A.4) allow us to consider non-diﬀerentiable kernel inference functions in theblock ( j, k ) analysis, extending Hector and Song (2019)’s DIMM beyond CL kernel infer-ence functions. We can now consider quantile regression, M-estimation, and more generalestimation functions than the score or CL score equations.A test of the over-identifying restrictions follows from Hansen (1982) and Hector and Song(2019). This test is useful for detecting invalid moment restrictions, which can inform ourchoice of data partition and model. Formally, we show in Theorem 3 that the objectivefunction N Q N evaluated at ( (cid:98) θ c , (cid:98) ζ c ) follows a χ distribution with ( J K − p degrees of18reedom. Unfortunately, it may be diﬃcult to tell if invalid moment restrictions stem froman inappropriate data split or incorrect model speciﬁcation. Residual analysis for modeldiagnostics can remove doubt in the latter case. Theorem 3 (Test of over-identifying restrictions) . Suppose assumptions (A.1)-(A.4) holdwith ( (cid:98) θ c , (cid:98) ζ c ) deﬁned in (4) . Then as n min → ∞ , N Q N ( (cid:98) θ c , (cid:98) ζ c | W ) d → χ JK − p . The proof of Theorem 3 can be carried out with some minor changes from that of Theorem3 in Hector and Song (2019). The GMM provides an objective function with which to domodel selection even when the block analyses do not, such as with GEE and M-estimation.In the following, Theorem 4 and Corollary 1 show our combined GMM estimator derivedfrom (6) is optimal in the sense deﬁned by Hansen (1982): it has an asymptotic covariancematrix at least as small (in terms of the Loewner ordering) as any other estimator exploitingthe same over-identifying restrictions. We refer to this property as “Hansen optimality”.

Theorem 4.

Suppose assumptions (A.1)-(A.2) hold. Then as n min → ∞ , (cid:98) V N p → v ( θ , ζ ) ,i.e. w = v − ( θ , ζ ) .Proof. The proof uses the consistency of the block estimators and the Central Limit The-orem, and is given in the Supplemental Material.

Corollary 1 (Hansen optimality) . Suppose assumptions (A.1)-(A.4) hold with ( (cid:98) θ c , (cid:98) ζ c ) deﬁned in (4) . Let j ( θ , ζ ) as given in Theorem 2. Then as n min → ∞ , N /  (cid:98) θ opt − θ (cid:98) ζ opt − ζ  d → N (cid:0) , j − ( θ , ζ ) (cid:1) . The theoretical results given in Theorems 1-4 provide a framework for constructing asymp-totic conﬁdence intervals and conducting hypothesis tests, so that we can perform inference19or θ when M and/or N are very large. Using an optimal weight matrix improves statis-tical power so DDIMM may detect some signals that other methods may miss. Since weconsider a broad class of models P , there are no general eﬃciency results about the block-speciﬁc estimator (cid:98) θ jk . When a learning method based on Ψ jk has known eﬃciency resultsand performs well enough, DDIMM generally inherits “local” eﬃciency to achieve overalleﬃciency. Remark 1.

We discuss eﬃciency for selected examples in Section 3.(i) In Example 3.1, when the score function exists and satisﬁes mild regularity condi-tions, its variance is given by Fisher’s information, and is a lower bound on the variances ofestimating functions for θ and ζ . This, coupled with Hansen’s optimality, means that usingthe score function for ψ jk and g jk yields an eﬃcient estimator of θ and ζ . In an unpub-lished dissertation, Jin (2011) studied the eﬃciency of the pairwise composite likelihoodunder diﬀerent correlation structures. Hector and Song (2019) showed empirically that theeﬃciency of the pairwise composite likelihood propagates to the combined estimator.(ii) In Example 3.2, it is known that the GEE estimator (cid:98) θ jk in Example 3.2 is semi-parametrically eﬃcient when the correlation structure of the response y i,jk is correctlyspeciﬁed. This, coupled with Hansen’s optimality, means that using GEE’s for ψ jk withthe correct correlation structure of the response y i,jk yields an eﬃcient estimator of θ . Remark 2.

The GMM estimator ( (cid:98) θ opt , (cid:98) ζ opt ) can be interpreted as maximizing an ex-tension of the conﬁdence distribution density, as discussed in Hector and Song (2019).The conﬁdence distribution approach is used for independent data in Xie and Singh(2013). Brieﬂy, we can deﬁne the conﬁdence estimating function (CEF) as U ( θ , ζ ) =Φ( N / (cid:98) V − / N T N ( θ , ζ )), where Φ( · ) is the ( J Kp + d )-variate standard normal distributionfunction. Clearly, U ( θ , ζ ) is asymptotically standard uniform at ( θ , ζ ) as long as (cid:98) V N is20 consistent estimator of the covariance of T N . Then we can deﬁne the density of the CEFas u ( θ , ζ ) = φ ( N / (cid:98) V − / N T N ( θ , ζ )). Maximizing u ( θ , ζ ) with respect to ( θ , ζ ) yields theminimization deﬁned in (6).By framing our estimator as a GMM estimator, the theoretical framework of DIMM es-tablished only for CL can be extended to include a data split at the subject level and ageneralization of Ψ jk and G jk . Adding moment conditions allows the proposed methodto enjoy the power and versatility of the GMM, and the necessary theoretical results tosupport its use. This divide-and-conquer strategy beneﬁts from handling low dimensionalblocks of data and estimating equations, yielding tremendous computational gains. Despite the computational gains oﬀered by the divide-and-combine procedure and theGMM estimator, iteratively ﬁnding the solution ( (cid:98) θ opt , (cid:98) ζ opt ) (or ( (cid:98) θ c , (cid:98) ζ c )) to (6) can be slowdue to the high-dimensionality of parameter ζ and the need to repeatedly evaluate Ψ jk and G jk . To overcome this computational bottleneck, we propose a meta-estimator de-rived from (6) that delivers a closed-form estimator via a linear function of block estimates( (cid:98) θ list , (cid:98) ζ list ). We deﬁne the DDIMM estimator for ( θ , ζ ):  (cid:98) θ DDIMM (cid:98) ζ DDIMM  = (cid:32) K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:33) − K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i  (cid:98) θ ik (cid:98) ζ list  . (8)where (cid:98) C k,i is a function of sample variability and sensitivity matrices and block-speciﬁcestimators (cid:98) θ jk and (cid:98) ζ jk deﬁned in detail in Section 5.1. If we do not plan to conduct21nference for ζ , which is treated as a nuisance parameter, taking (cid:104) (cid:98) C − (cid:105) p : to be rows 1 to p of matrix ( (cid:80) Kk =1 (cid:80) Ji =1 n k (cid:98) C k,i ) − leads to the closed-form estimator of θ : (cid:98) θ DDIMM = (cid:104) (cid:98) C − (cid:105) p : K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:16) (cid:98) θ Tik (cid:98) ζ Tlist (cid:17) T . (9)We brieﬂy deﬁne sample sensitivity matrices that will appear in the main body of thepaper. Let S θψ jk ( θ , ζ jk ) be a n / k -consistent sample estimator of s θψ jk ( θ , ζ jk ), and similarlydeﬁne S ζψ jk ( θ , ζ jk ), S θg jk ( θ , ζ jk ) and S ζg jk ( θ , ζ jk ). Let S jk ( θ , ζ jk ) =  S θψ jk ( θ , ζ jk ) S ζψ jk ( θ , ζ jk ) S θg jk ( θ , ζ jk ) S ζg jk ( θ , ζ jk )  . Note that the uppercase S denotes the sample sensitivity matrix, and the lower-case s denotes the population sensitivity matrix. Let (cid:98) S jk = S jk ( (cid:98) θ jk , (cid:98) ζ jk ) and similarly deﬁne (cid:98) S θψ jk , (cid:98) S ζψ jk , (cid:98) S θg jk and (cid:98) S ζg jk . Sensitivity formulas are summarized in Table A.1 in AppendixA.1.The DDIMM estimator in (9) can be implemented in a fully parallelized and scalablecomputational scheme, where only one pass through each block of data is required. Theblock analyses are run on parallel CPUs, and return the values of summary statistics { (cid:98) θ jk , (cid:98) ζ jk , ψ i,jk ( (cid:98) θ jk ; (cid:98) ζ jk ) , g i,jk ( (cid:98) ζ jk ; (cid:98) θ jk ) , (cid:98) S jk } J,Kj,k =1 to the main computing node, which com-putes (cid:98) θ DDIMM in (9) in one step. (cid:98) C k,i We give details on the construction of (cid:98) C k,i . Readers may wish to omit this section on aﬁrst reading, as these details are not necessary for an understanding of the main body of22he paper. We consider the optimal case where the GMM weighting matrix takes the form: W = (cid:98) V − N =  (cid:98) V N, ψ (cid:98) V N, ψg (cid:98) V TN, ψg (cid:98) V N, g  − =  (cid:98) V ψ N (cid:98) V ψg N (cid:98) V ψg TN (cid:98) V g N  . For convenience, we introduce a subsetting operation, with technical details available inAppendix A.2: we let (cid:104) (cid:98) V ψ N (cid:105) ij : k subset the rows for the parameters corresponding to block( i, k ) and the columns for the parameters corresponding to block ( j, k ) of matrix (cid:98) V ψ N .Similarly deﬁne (cid:104) (cid:98) V g N (cid:105) ij : k , and (cid:104) (cid:98) V ψg N (cid:105) ij : k . For η ∈ { θ , ζ } , let (cid:98) A η k,ij = (cid:18) (cid:98) S θ T ψ jk (cid:104) (cid:98) V ψ N (cid:105) ji : k + (cid:98) S θ T g jk (cid:104) (cid:98) V ψg TN (cid:105) ji : k (cid:19) (cid:98) S ηψ ik + (cid:18) (cid:98) S θ T ψ jk (cid:104) (cid:98) V ψg N (cid:105) ji : k + (cid:98) S θ T g jk (cid:104) (cid:98) V g N (cid:105) ji : k (cid:19) (cid:98) S ηg ik , (cid:98) B η k,ij = (cid:18) (cid:98) S ζ T ψ jk (cid:104) (cid:98) V ψ N (cid:105) ji : k + (cid:98) S ζ T g jk (cid:104) (cid:98) V ψg TN (cid:105) ji : k (cid:19) (cid:98) S ηψ ik + (cid:18) (cid:98) S ζ T ψ jk (cid:104) (cid:98) V ψg N (cid:105) ji : k + (cid:98) S ζ T g jk (cid:104) (cid:98) V g N (cid:105) ji : k (cid:19) (cid:98) S ηg ik . Deﬁne D ik as the sum of the dimensions of ζ , . . . , ζ i − k , and D k as the sum of thedimensions of ζ , . . . , ζ Jk − , with technical details in Appendix A.3. Let d k = (cid:80) Jj =1 d jk .Then we can deﬁne the following, (cid:98) C k,i =  J (cid:80) j =1 (cid:98) A θ k,ij p × D ik J (cid:80) j =1 (cid:98) A ζ k,ij p × ( d − d ik − D ik ) D k × ( p + d ) (cid:98) B θ k,i d k × D ik (cid:98) B ζ k,i d k × ( d − d ik − D ik ) ... (cid:98) B θ k,iJ d Jk × D ik (cid:98) B ζ k,iJ d Jk × ( d − d ik − D ik ) ( d − d k − D k ) × ( p + d )  . (10) K and J ﬁxed In this section we assume that K and J are ﬁxed, which will be relaxed in Sections 5.3and 5.4. Recall that we assume subjects are monotonically allocated to subject groups: as23 min → ∞ , a subject cannot be reallocated to another group once it has been assigned toa subject group. Consider the following condition:(A.5) For each j = 1 , . . . , J , k = 1 , . . . , K , (cid:98) θ jk = θ + O p ( n − / k ) and (cid:98) ζ jk = ζ jk + O p ( n − / k ).For any δ N → sup (cid:107) ( θ , ζ ) − ( θ , ζ ) (cid:107)≤ δ N N / N / (cid:107) ( θ , ζ ) − ( θ , ζ ) (cid:107) (cid:13)(cid:13) T N ( θ , ζ ) − T N ( θ , ζ ) − E θ , ζ T N ( θ , ζ ) (cid:13)(cid:13) = O p ( N − / ) . Consequently, some large-sample results can be established which are helpful in studyingthe asymptotic behaviour of (cid:98) θ DDIMM . Lemma 1.

Suppose assumptions (A.1), (A.2) and (A.5) hold. Then we have consistentestimation of information matrices: (cid:98) V N = v ( θ , ζ ) + O p ( N − / ) , (cid:98) S jk = s jk ( θ , ζ jk ) + O p ( n − / k ) for each j, k, and N K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i = (cid:98) S T (cid:98) V − N (cid:98) S = j ( θ , ζ ) + O p ( N − / ) , where (cid:98) S =  S (cid:16) n k N (cid:98) S θψ jk (cid:17) J,Kj =1 ,k =1 diag (cid:110) n k N (cid:98) S ζψ jk (cid:111) J,Kj =1 ,k =1 S (cid:16) n k N (cid:98) S θ g jk (cid:17) J,Kj =1 ,k =1 diag (cid:110) n k N (cid:98) S ζ g jk (cid:111) J,Kj =1 ,k =1  . Proof.

A detailed proof is given in the Supplemental Material.We show in Theorem 5 that the proposed closed-form estimator ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) in (8)is consistent and asymptotically normally distributed.

Theorem 5.

Suppose assumptions (A.1), (A.2) and (A.5) hold. Let j ( θ , ζ ) as given inTheorem 2. As n min → ∞ , N /  (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ  d → N (cid:0) , j − ( θ , ζ ) (cid:1) . roof of Theorem 5: Here we present major steps, with all necessary details available inAppendix B.1. First, we show that (cid:98) θ DDIMM and (cid:98) ζ DDIMM are consistent. Deﬁne λ ( θ , ζ ) = 1 N K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i  θ − (cid:98) θ ik ζ − (cid:98) ζ list  . (11)By deﬁnition, λ ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) = . As shown in Lemma B.1.1 in Appendix B.1, λ ( θ , ζ ) p → as n min → ∞ . Given that ∇ θ , ζ λ ( θ , ζ ) exists and is nonsingular, for some( θ ∗ , ζ ∗ ) between ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) and ( θ , ζ ), the ﬁrst-order Taylor expansion leads to λ ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) − λ ( θ , ζ ) = ∇ θ , ζ λ ( θ , ζ ) | θ ∗ , ζ ∗  (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ  , (12)which converges in probability to as n min → ∞ . This implies that ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) p → ( θ , ζ ) as n min → ∞ .Now we derive the distribution of ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ). With a slight abuse of notation,let (cid:98) θ list − θ = S JK (cid:16)(cid:98) θ jk − θ (cid:17) . We show in Lemma B.1.2 in Appendix B.1 that  Ψ jk ( θ ; ζ jk ) G jk ( ζ jk ; θ )  = (cid:98) S jk  (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk  + O p ( n − k ) . (13)Recall the form of T N in (3). By the Central Limit Theorem, N / T N ( θ , ζ ) d →N (0 , v ( θ , ζ )). Then with (cid:98) S deﬁned in Lemma 1, it follows from equation (13) that N / (cid:98) S (cid:16) ( (cid:98) θ list − θ ) T ( (cid:98) ζ list − ζ ) T (cid:17) T d → N (0 , v ( θ , ζ )) . Moreover, by Lemma 1 and Slutsky’s theorem we have: N / (cid:16) ( (cid:98) θ list − θ ) T ( (cid:98) ζ list − ζ ) T (cid:17) T d → N (cid:0) , j − ( θ , ζ ) (cid:1) . N /  (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ  = N / (cid:32) K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:33) − K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i  (cid:98) θ ik − θ (cid:98) ζ list − ζ  is asymptotically distributed N ( , j − ( θ , ζ )).This key theorem allows us to use (cid:98) θ DDIMM , which is more computationally attractive than (cid:98) θ opt deﬁned in (6), without sacriﬁcing any of the nice asymptotic properties for inference.Additionally, it follows easily from Theorem 5 that, under suitable conditions, the closed-form estimator ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) in (8) has the same asymptotic distribution as and isasymptotically equivalent to the GMM estimator (cid:98) θ opt in (6). Corollary 2.

Suppose assumptions (A.1)-(A.5) hold with ( (cid:98) θ opt , (cid:98) ζ opt ) deﬁned in (6) . Then ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) and ( (cid:98) θ opt , (cid:98) ζ opt ) are asymptotically equivalent: as n min → ∞ , N / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:98) θ DDIMM − (cid:98) θ opt (cid:98) ζ DDIMM − (cid:98) ζ opt (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p → . Proof.

A detailed proof is given in the Supplemental Material.The computation of (cid:98) θ DDIMM in (9) relies solely on block-speciﬁc estimators ( (cid:98) θ list , (cid:98) ζ list ) andvalues of summary statistics from each block. To guarantee the appropriate asymptoticdistribution of (cid:98) θ DDIMM , we assume in condition (A.5) that these block-speciﬁc estimatorsare N / consistent estimators of the true values, which restricts the scope of possible block-speciﬁc inference methods. For inference methods not satisfying this N / consistency incondition (A.5), it is still possible to use (cid:98) θ opt in (6).26 .3 Asymptotic results for diverging K with J ﬁxed We show in Theorem 6 that the asymptotic distribution of ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) remainsunchanged as the number of subject groups K grows with the sample size. Theorem 6.

Suppose N δ − / K is bounded as n min → ∞ for a positive constant δ < , andassumptions (A.1), (A.2) and (A.5) hold. Let H ∈ R h × ( p + d ) a matrix of rank r ∈ N , h ∈ N , r ≤ h , with ﬁnite maximum singular value ¯ σ ( H ) < ∞ . Let j ( θ , ζ ) as given in Theorem 2.Then, as n min → ∞ , we show that the limiting value j H ( θ , ζ ) of Hj − ( θ , ζ ) H T is apositive semi-deﬁnite and symmetric variance matrix, and that N / H  (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ  d → N ( , j H ( θ , ζ )) . Proof [Proof of Theorem 6] Here we present major steps, with all necessary details availablein Appendix B.2. First, we know that (cid:107) H (cid:107) ≤ r ¯ σ ( H ). Let λ ( θ , ζ ) deﬁned by (11),such that λ ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) = . We show in Lemma B.2.1 in Appendix B.2 that (cid:107) λ ( θ , ζ ) (cid:107) = O p ( N − / − δ n / ) and (cid:13)(cid:13) {∇ θ , ζ λ ( θ , ζ ) } − (cid:13)(cid:13) = O p (cid:0) N / δ n − (cid:1) . From theﬁrst-order Taylor expansion in (12), we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H  (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) H (cid:107) (cid:13)(cid:13) ( ∇ θ , ζ λ ( θ , ζ ) | θ ∗ , ζ ∗ ) − (cid:13)(cid:13) (cid:107) λ ( θ , ζ ) (cid:107)≤ r ¯ σ ( H ) O p ( n − / ) . Then H ( (cid:98) θ TDDIMM , (cid:98) ζ TDDIMM ) T − H ( θ T , ζ T ) T p → as n min → ∞ .To derive the distribution of H ( (cid:98) θ TDDIMM , (cid:98) ζ TDDIMM ) T , ﬁrst consider an arbitrary k ∈{ , . . . , K } . For convenience, denote T k ( θ , ζ k ) = S (cid:0) S J (cid:0) Ψ jk ( θ ; ζ jk ) (cid:1) , S J (cid:0) G jk ( ζ jk ; θ ) (cid:1)(cid:1) , i,k ( θ , ζ k ) = S (cid:0) S J (cid:0) ψ i,jk ( θ ; ζ jk ) (cid:1) , S J (cid:0) g i,jk ( ζ jk ; θ ) (cid:1)(cid:1) . By the Central Limit Theorem, n / k T k ( θ , ζ k ) = n − / k (cid:80) n k i =1 τ i,k ( θ , ζ k ) d →N ( , v k ( θ , ζ k )) as n k → ∞ , where v k ( θ , ζ k ) = V ar θ , ζ k { τ i,k ( θ , ζ k ) } . Deﬁne s k ( θ , ζ k ) =  S J (cid:16) s θψ jk ( θ , ζ jk ) (cid:17) diag (cid:110) s ζψ jk ( θ , ζ jk ) (cid:111) Jj =1 S J (cid:16) s θ g jk ( θ , ζ jk ) (cid:17) diag (cid:110) s ζg jk ( θ , ζ jk ) (cid:111) Jj =1  , and j k ( θ , ζ k ) = s Tk ( θ , ζ ) v − k ( θ , ζ k ) s k ( θ , ζ k ) . By (13) in the proof of Theorem 5, Lemma 1, and Slutsky’s theorem, n / k j k ( θ , ζ k )  S (cid:16)(cid:98) θ jk − θ (cid:17) Jj =1 (cid:98) ζ k − ζ k  d → N (cid:0) , j − k ( θ , ζ k ) (cid:1) . Note that the above vectors are independent for k = 1 , . . . , K . We establish in LemmaB.2.2 in Appendix B.2 that, for some aﬃne transformation matrices E k , k = 1 , . . . , K , of ’s and ’s, n k N J (cid:88) i =1 (cid:98) C k,i  (cid:98) θ ik − θ (cid:98) ζ list − ζ  = n k N E k Z k + O p (cid:0) N − (cid:1) , and n k N J (cid:88) i =1 (cid:98) C k,i = n k N E k j k ( θ , ζ k ) E Tk + O p (cid:16) n / k N − (cid:17) , where n / k Z k d → N (cid:0) , j − k ( θ , ζ k ) (cid:1) . It is clear that j ( θ , ζ ) = (cid:80) Kk =1 ( n k /N ) E k j k ( θ , ζ k ) E Tk .Since E k has ﬁnitely many 1’s, (cid:107) E k (cid:107) is bounded. Since (cid:107) j k ( θ , ζ k ) (cid:107) is also bounded, (cid:107) j ( θ , ζ ) (cid:107) = O ( Kn max N − ) = O (1). j ( θ , ζ ) is positive semi-deﬁnite and symmetric,implying that Hj − ( θ , ζ ) H T is also positive semi-deﬁnite and symmetric. Followingthe monotone convergence theorem, we can write Hj − ( θ , ζ ) H T → j H ( θ , ζ ), where28 H ( θ , ζ ) exists and is a proper variance matrix.Using the fact that λ ( (cid:98) θ DDIMM , (cid:98) ζ DDIMM ) = and K = O ( N / − δ ), we show in LemmaB.2.3 in Appendix B.2 that N / H ( (cid:98) θ DDIMM − θ , (cid:98) ζ DDIMM − ζ ) can be rewritten as H (cid:40) K (cid:88) k =1 n k N E k j k ( θ , ζ k ) E Tk + O p (cid:0) n / N − / − δ (cid:1)(cid:41) − (cid:34) K (cid:88) k =1 (cid:26)(cid:16) n k N (cid:17) / E k n / k Z k (cid:27) + O p (cid:0) N − δ (cid:1)(cid:35) . Since O p ( n / N − / − δ ) = o p (1) and O p ( N − δ ) = o p (1), it follows as in the proof of Theorem5 that as n min → ∞ , N / H  (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ  d → N ( , j H ( θ , ζ )) . In practice, Theorem 6 suggests that we can tune our choice of K and n min to attain thedesired trade-oﬀ between inference and computational speed: smaller K and larger n min will slow computations but improve estimation and asymptotic normality, whereas larger K and smaller n min will speed computations but worsen estimation and asymptotic normality. K and J In general, asymptotics for diverging J become very complicated and even analyticallyintractable depending on how, and to what extent, the dependence structure evolves as thedimension M of Y goes to inﬁnity ( M → ∞ ). Cox and Reid (2004) propose constructinga pseudolikelihood from marginal densities when the full joint distribution is diﬃcult toconstruct, and discuss asymptotics for increasing response dimensionality. To make theproblem of diverging M tractable, we consider the following regularity conditions:29A.6) Stationarity: for each M ∗ ∈ N and each ( M ∗ + 1)-dimensional measurableset B a subset of the sample space of Y , the distribution of Y i satisﬁes P { ( Y i,r , . . . , Y i,r + M ∗ ) ∈ B } = P { ( Y i, , . . . , Y i,M ∗ ) ∈ B } for every r ∈ N .(A.7) Let C k,i be the version of (cid:98) C k,i in (10) evaluated at the true values θ , ζ jk . For k = 1 , . . . , K , i = 1 , . . . , J , ( (cid:80) Kl =1 (cid:80) Jj =1 n l C l,j ) − n k C k,i = O p ( N − δ ) for a constant0 ≤ δ ≤ /

2. This can be thought of as a type of Lindeberg condition.(A.8) Conditions required for asymptotically normal distribution and eﬃciency of the GMMestimator ( (cid:98) θ opt , (cid:98) ζ opt ); see Theorem 5.4 in Donald et al. (2003) and the spanningcondition in Newey (2004). See Newey (2004) for related work on semiparametriceﬃciency of the GMM estimator as the number of moment conditions goes to inﬁnity. Remark 3.

Condition (A.6) is typical for consistency and asymptotic normality of theGMM estimator ( (cid:98) θ opt , (cid:98) ζ opt ), following Hansen (1982) and Newey (2004). It is a typicalcondition for the application of the central limit theorem to stochastic processes, i.e. toinﬁnite dimensional random vectors. Additionally, in order to make statements aboutconvergence in probability, (A.6) is required to ensure a valid joint probability distributionas the dimension M increases. Remark 4.

Condition (A.7) ensures the covariance of the outcome Y i is appropriately con-trolled as M → ∞ . Alternative conditions may be considered, such as α -mixing (Bradley(1985)), ρ -mixing (Peligrad (1986)), or φ -mixing (Peligrad (1986)), but this is beyond thescope of this paper. Condition (A.7) can be simpliﬁed for the case where n k = n for all k = 1 , . . . , K . Then (A.7) becomes ( (cid:80) Kl =1 (cid:80) Jj =1 C l,j ) − C k,i = O p ( N − δ ).In Theorem 7 we show the consistency and asymptotic normality of the DDIMM estimatoras K and J diverge to ∞ . 30 heorem 7. Suppose N − δ n min and N δ − / KJ are bounded as n min → ∞ for constants ≤ δ ≤ and < δ < / such that δ + δ + δ / > . Suppose assumptions (A.1),(A.2), and (A.5)-(A.8) hold. Let H ∈ R h × ( p + d ) a matrix of rank r ∈ N , h ∈ N , r ≤ h , withﬁnite maximum singular value ¯ σ ( H ) < ∞ . Let j H ( θ , ζ ) as given in Theorem 6. Then as n min → ∞ , N / H  (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ  d → N (0 , j H ( θ , ζ )) . Proof

Write H  (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ  = H  (cid:98) θ DDIMM − (cid:98) θ opt (cid:98) ζ DDIMM − (cid:98) ζ opt  + H  (cid:98) θ opt − θ (cid:98) ζ opt − ζ  . To show the asymptotic distribution of the left-hand side, it is suﬃcient to show that H ( (cid:98) θ TDDIMM − (cid:98) θ Topt , (cid:98) ζ TDDIMM − (cid:98) ζ Topt ) T = o p ( N − / ).Given the assumptions of the theorem, we have the asymptotic distribution of ( (cid:98) θ opt , (cid:98) ζ opt,ik )and ( (cid:98) θ ik , (cid:98) ζ ik ): both are consistent estimators of θ , ζ ik and asymptotically normally dis-tributed with rates N − / and n − / k respectively. Then for each k ∈ { , . . . , K } ,  (cid:98) θ opt − (cid:98) θ ik (cid:98) ζ opt,ik − (cid:98) ζ ik  =  (cid:98) θ opt − θ (cid:98) ζ opt,ik − ζ ik  −  (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik  = O p ( n − / k ) . Deﬁning (cid:98) C ∗ k,i a subset of (cid:98) C k,i in Appendix A.4, we can rewrite ( (cid:98) θ TDDIMM − (cid:98) θ Topt , (cid:98) ζ TDDIMM − (cid:98) ζ Topt ) T as follows: (cid:32) K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i (cid:33) −  K (cid:88) k =1 J (cid:88) i =1  n k (cid:98) C k,i  (cid:98) θ ik − (cid:98) θ opt (cid:98) ζ list − (cid:98) ζ opt  = K (cid:88) k =1 J (cid:88) i =1 (cid:32) K (cid:88) l =1 J (cid:88) j =1 n l (cid:98) C l,j (cid:33) − n k (cid:98) C ∗ k,i  (cid:98) θ ik − (cid:98) θ opt (cid:98) ζ ik − (cid:98) ζ opt,ik  K (cid:88) k =1 J (cid:88) i =1 (cid:104) O p ( N − δ ) O p ( n − / k ) (cid:105) = O p ( KJ N − δ n − / )= O p ( N / − δ N − δ N − δ / ) = O p ( N / − δ − δ − δ / ) = o p ( N − / ) . In this section we consider two sets of simulations to examine the performance of theclosed-form estimator (cid:98) θ DDIMM under the linear regression setting µ i = X i θ , where µ i = E ( Y i | X i , θ ) and Y i ∼ N ( X i θ , Σ ). The ﬁrst set illustrates the ﬁnite sample performanceand properties in Theorem 5 of (cid:98) θ DDIMM with ﬁxed sample size N , varying number ofsubject groups K , varying dimensions M of Y , and ﬁxed number of response blocks J .The second set of simulations illustrates the performance and properties in Theorem 7 of (cid:98) θ DDIMM with growing sample size N and response dimension M of Y , and varying numberof subjects groups K and response blocks J . In both settings, covariates consist of anintercept and two independently simulated M -dimensional multivariate normal variables,and the true value of θ is set to θ = (0 . , . , . T . Simulations are conducted using Rsoftware on a standard Linux cluster.We describe the ﬁrst set of simulations. We specify Σ = S ⊗ A with nested correlationstructure, where ⊗ denotes the Kronecker product, A is an AR(1) covariance matrix withstandard deviation σ = 4 and correlation ρ = 0 .

8, and S is a randomly simulated J × J positive-deﬁnite matrix. We consider varying dimensions M of Y with ﬁxed J = 5, and aﬁxed sample size N = 5 ,

000 with varying K = 1 , ,

5. We consider two supervised learningprocedures: the pairwise composite likelihood using our own package, and the GEE using Rpackage geepack and our own package (see Supplemental Material). With each procedure,32 l ll l ll l ll l l l l ll l ll l ll l l l l ll l ll l ll l l

RMSEIntercept RMSEX1 RMSEX2ESEIntercept ESEX1 ESEX2BIASIntercept BIASX1 BIASX2ASEIntercept ASEX1 ASEX2

200 500 1000 200 500 1000 200 500 1000200 500 1000 200 500 1000 200 500 1000200 500 1000 200 500 1000 200 500 1000200 500 1000 200 500 1000 200 500 10004.0e−078.0e−071.2e−061.6e−06−6e−05−3e−05 0e+00 3e−05 6e−054.0e−078.0e−071.2e−061.6e−066.0e−048.0e−041.0e−031.2e−034.0e−078.0e−071.2e−061.6e−06−7.5e−05−5.0e−05−2.5e−05 0.0e+005.0e−071.0e−061.5e−066.0e−048.0e−041.0e−031.2e−031e−042e−043e−04−2.5e−04 0.0e+00 2.5e−04 5.0e−045.0e−051.0e−041.5e−042.0e−042.5e−046.0e−038.0e−031.0e−021.2e−021.4e−021.6e−02 response dimension (M) V a l ue model l K=1 K=2K=5 measure ASE BIASESE RMSE

Figure 1: Plot of simulation metrics for GEE, averaged over 1,000 simulations.33e ﬁt the model with an AR(1) working block correlation structure. Results for the GEEare in Figure 1; results for the pairwise composite likelihood (CL) are in the SupplementalMaterial. We see that the mean asymptotic standard error (ASE) of (cid:98) θ DDIMM approxi-mates the empirical standard error (ESE) for all models, with slight variations due to thetype of covariates simulated. This means the covariance formula in Theorem 5 is correct.Additionally, (cid:98) θ DDIMM appears consistent since root mean squared error (RMSE), ASEand ESE are approximately equal. Moreover, we notice the ASE of (cid:98) θ DDIMM decreases asthe response dimension M increases. This makes intuitive sense, since an increase in M corresponds to an increase in overall number of observations, resulting in increased power.We also see a decrease in the ASE as the number of groups increases. This is due to theheterogeneity of block covariance parameters. Lastly, we observe from Table 2 that themean CPU time is very fast for the GEE, and decreases substantially as the number ofsubject groups increases.Response dimension Number of subject groupsK=1 K=2 K=5M=200 45 23 11M=500 351 184 87M=1,000 1956 961 417Table 2: Mean CPU time in seconds for each setting with the GEE block analysis,averaged over 1,000 simulations. Mean CPU time is computed as the maximum CPUtime taken over parallelized block analyses added to the CPU time taken by the rest ofthe procedure.We describe the second set of simulations, where we consider diverging sample size N and34esponse dimension M , and diverging number of subject groups K and response blocks J .We consider two settings: in Setting I, we let the sample size N = 5 ,

000 with number ofresponse groups K = 1, and let response dimension M = 4 ,

500 with number of responseblocks J = 6; in Setting II, we let the sample size N = 10 ,

000 with number of responsegroups K = 2, and let response dimension M = 9 ,

000 with number of response blocks J = 12. Responses are simulated from a Multivariate Normal distribution with AR(1)covariance structure, with standard deviation σ = 6 and correlation ρ = 0 .

8. This meansthere are no heterogeneous block parameters, so we expect a slightly less eﬃcient estimatorsince there is less variability in the outcome. We learn mean and covariance parametersusing GEE with an AR(1) working block correlation structure. Mean bias (BIAS), RMSE,ESE and ASE of (cid:98) θ DDIMM are in Table 3. We observe that RMSE, ESE and ASE are veryclose, indicating appropriate estimation of (cid:98) θ DDIMM and its covariance in Theorem 7. Wealso conﬁrm DDIMM’s ability to handle large sample size N and response dimension M .Setting Measure Intercept X X I RMSE/BIAS 3 . − .

77 0 . .

09 0 . − . . .

78 0 . .

59 0 . . . − .

99 0 . − .

03 0 . − . . .

70 0 . .

27 0 . . × − , BIAS × − , ESE × − , ASE × − for each setting and eachcovariate, averaged over 500 simulations.35 DISCUSSION

We have presented the large sample theory as a theoretical guarantee for a Doubly Dis-tributed and Integrated Method of Moments (DDIMM) that incorporates a broad class ofsupervised learning procedures into a doubly distributed and parallelizable computationalscheme for the eﬃcient analysis of large samples of high-dimensional correlated responsesin the MapReduce framework. Theoretical challenges related to combining correlated esti-mators were addressed in the proofs, including the asymptotic properties of the proposedclosed-form estimator with ﬁxed and diverging numbers of subject groups and responseblocks.The GMM approach to deriving the combined estimator ( (cid:98) θ c , (cid:98) ζ c ) proposed in (4) requiresonly weak regularity of the estimating equations Ψ jk and G jk . These assumptions aresatisﬁed by a broad range of learning procedures. The closed-form estimator proposedin equation (9), on the other hand, requires local n / k -consistent estimators in individualblocks of size n k , which is easily satisﬁed if Ψ jk and G jk are regular (see Song (2007) Chap-ter 3.5 for a deﬁnition of regular inference functions). This restricts the class of possiblelearning procedures, but still includes many analyses of interest.A detailed discussion of the limitations and trade-oﬀs of the single split DIMM with CLblock analyses is featured in Hector and Song (2019). As mentioned in Section 5, theDDIMM introduces additional ﬂexibility in trading oﬀ between computational speed andinference: the number of subject groups K and the smallest block size n min can be chosenby the investigator to attain the desired speed and eﬃciency.Particular applications of DDIMM to time series data are immediately obvious. Similarly,we envision potential application to nation-wide hospital daily visit numbers of, for exam-ple, asthma patients, over the course of the last decade. One could split the response (hos-36ital daily intake/daily stock price) into J years and into K groups (of hospitals/stocks),analyze blocks separately and in parallel using GEE, and combine results using DDIMM.Finally, extensions of our work to stochastic process modelling are accessible, with morechallenging work involving regularization of θ also of interest. A Technical details

A.1 Summary of sensitivity matrix formulas

Sensitivity matrices are summarized in Table A.1.sensitivity of w.r.t.* population sample plug-in sample ψ i,jk θ s θψ jk ( θ , ζ jk ) S θψ jk ( θ , ζ jk ) (cid:98) S θψ jk = S θψ jk ( (cid:98) θ jk , (cid:98) ζ jk ) ψ i,jk ζ jk s ζψ jk ( θ , ζ jk ) S ζψ jk ( θ , ζ jk ) (cid:98) S ζψ jk = S ζψ jk ( (cid:98) θ jk , (cid:98) ζ jk ) g i,jk θ s θg jk ( θ , ζ jk ) S θg jk ( θ , ζ jk ) (cid:98) S θg jk = S θg jk ( (cid:98) θ jk , (cid:98) ζ jk ) g i,jk ζ jk s ζg jk ( θ , ζ jk ) S ζg jk ( θ , ζ jk ) (cid:98) S ζg jk = S ζg jk ( (cid:98) θ jk , (cid:98) ζ jk ) S (cid:0) ψ i,jk , g i,jk (cid:1) ( θ , ζ jk ) s jk ( θ , ζ jk ) S jk ( θ , ζ jk ) (cid:98) S jk = S jk ( (cid:98) θ jk , (cid:98) ζ jk )Table A.1: Summary of sensitivity formulas. Formulas that are not used are marked “—”.*“w.r.t.” shorthand for “with respect to”. A.2 Subsetting operation on variability matrices

Operation (cid:104) (cid:98) V ψ N (cid:105) ij : k extracts a submatrix of (cid:98) V ψ N consisting of rows { ( i −

1) + ( k − J } p +1to { i + ( k − J } p and columns { j − k − J } p + 1 to { j + ( k − J } p . Operation37 (cid:98) V g N (cid:105) ij : k extracts a submatrix of (cid:98) V g N consisting of rows 1 + D ik to d ik + D ik and columns1 + D jk to d jk + D jk . Operation (cid:104) (cid:98) V ψg N (cid:105) ij : k extracts a submatrix of (cid:98) V ψg N consisting of rows { ( i −

1) + ( k − J } p + 1 to { i + ( k − J } p and columns 1 + D jk to d jk + D jk , where d jk is the dimension of ζ jk and D jk is deﬁned in Section 5.1. A.3 Cumulative sum of dimensions of ζ Recall that we deﬁne D ik as the sum of the dimensions of ζ , . . . , ζ i − k , and D k as thesum of the dimensions of ζ , . . . , ζ Jk − . Speciﬁcally, let D ik = (cid:80) k − l =1 (cid:80) Jj =1 d jl + (cid:80) i − j =1 d jk for i, k > D k = (cid:80) k − l =1 (cid:80) Jj =1 d jl for k >

1, and D = 0. Let D k = (cid:80) k − l =1 d l for k > D = 0. A.4 Deﬁnition of (cid:98) C ∗ k,i Let k ∈ { , . . . , K } and i ∈ { , . . . , J } . Recall the deﬁnitions of (cid:98) A θ k,ij , (cid:98) A ζ k,ij , (cid:98) B θ k,ij and (cid:98) B ζ k,ij in Section 5.1. Deﬁne (cid:98) C ∗ k,i =  J (cid:80) j =1 (cid:98) A θ k,ij J (cid:80) j =1 (cid:98) A ζ k,ij D ik × ( p + d ) (cid:98) B θ k,i (cid:98) B ζ k,i ... (cid:98) B θ k,iJ (cid:98) B ζ k,iJ ( d − d ik − D ik ) × ( p + d )  . Additional proofs

B.1 Proof of Theorem 5:

The following lemmas complete the proof of Theorem 5 given in the paper, under theassumed conditions.

Lemma B.1.1.

Deﬁne λ ( θ , ζ ) as in (11) in the proof of Theorem 5. Then λ ( θ , ζ ) p → n min → ∞ . Proof

Using Lemma 1, λ ( θ , ζ ) = 1 N K (cid:88) k =1 J (cid:88) i =1 n k (cid:98) C k,i  θ − (cid:98) θ ik ζ − (cid:98) ζ list  = O p (cid:16) n − / (cid:17) (cid:8) j ( θ , ζ ) + O p (cid:0) N − / (cid:1)(cid:9) = O p (cid:16) n − / (cid:17) + O p (cid:16) n − / N − / (cid:17) p → n min → ∞ . Lemma B.1.2.

The following relationship holds:  Ψ jk ( θ ; ζ jk ) G jk ( ζ jk ; θ )  = (cid:98) S jk  (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk  + O p ( n − k ) . Proof

Let j ∈ { , . . . , J } , k ∈ { , . . . , K } ﬁxed. For convenience, denote T jk ( θ , ζ jk ) =  Ψ jk ( θ ; ζ jk ) G jk ( ζ jk ; θ )  , τ i,jk ( θ , ζ jk ) =  ψ i,jk ( θ ; ζ jk ) g i,jk ( ζ jk ; θ )  . By ﬁrst-order Taylor expansion, E θ , ζ jk (cid:110) τ i,jk ( (cid:98) θ jk , (cid:98) ζ jk ) (cid:111) = E θ , ζ jk (cid:8) τ i,jk ( θ , ζ jk ) (cid:9) +39 θ E θ , ζ jk (cid:8) τ i,jk ( θ , ζ jk ) (cid:9) | θ ∗ , ζ ∗ jk  (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk  , (14)where ( θ ∗ , ζ ∗ jk ) lies between ( θ , ζ jk ) and ( (cid:98) θ jk , (cid:98) ζ jk ). By condition (A.5), T jk ( (cid:98) θ jk , (cid:98) ζ jk ) − T jk ( θ , ζ jk ) − E θ , ζ jk (cid:110) τ i,jk ( (cid:98) θ jk , (cid:98) ζ jk ) (cid:111) = O p ( N − / ) 1 + N / O p ( n − / k ) N / = O p ( n − / k N − / ) . (15)In other words, the norm of the diﬀerence between T jk ( θ , ζ jk ) and T jk ( (cid:98) θ jk , (cid:98) ζ jk ) − E θ , ζ jk { τ i,jk ( (cid:98) θ jk , (cid:98) ζ jk ) } goes to 0 at a rate faster than ( N n k ) − / . Adding (14) and (15),we have − T jk ( θ , ζ jk ) = T jk ( (cid:98) θ jk , (cid:98) ζ jk ) − T jk ( θ , ζ jk ) − E θ , ζ jk τ i,jk ( θ , ζ jk )= ∇ θ E θ , ζ jk τ i,jk ( θ , ζ jk ) | θ ∗ , ζ ∗ jk  (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk  + O p ( n − / k N − / )= − s jk ( θ ∗ , ζ ∗ jk )  (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk  + O p ( n − / k N − / ) . Rearranging yields T jk ( θ , ζ jk ) = s jk ( θ ∗ , ζ ∗ jk )  (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk  + O p ( n − / k N − / ) . (16)Finally, note that (cid:98) S jk = s jk ( θ , ζ jk )+ O p ( n − / k ) = s jk ( θ ∗ , ζ ∗ jk )+ O p ( n − / k ). Then pluggingthis into (16), we have: T jk ( θ , ζ jk ) = (cid:16) (cid:98) S jk + O p ( n − / k ) (cid:17)  (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk  + O p ( n − / k N − / )40 (cid:98) S jk  (cid:98) θ jk − θ (cid:98) ζ jk − ζ jk  + O p ( n − k ) . B.2 Proof of Theorem 6

The following lemmas complete the proof of Theorem 6 given in the paper, under theassumed conditions.

Lemma B.2.1.

Deﬁne λ ( θ , ζ ) as in (11) in the proof of Theorem 5. Then (cid:107) λ ( θ , ζ ) (cid:107) = O p ( N − / − δ n / ) and (cid:13)(cid:13) {∇ θ , ζ λ ( θ , ζ ) } − (cid:13)(cid:13) = O p (cid:0) N / δ n − (cid:1) . Proof.

Due to the independence between subject groups, (cid:98) V ψ N , (cid:98) V ψg N and (cid:98) V g N are all blockdiagonal: (cid:98) V ψ N = diag (cid:110) (cid:98) V ψ k (cid:111) Kk =1 , (cid:98) V ψg N = diag (cid:110) (cid:98) V ψg k (cid:111) Kk =1 , and (cid:98) V g N = diag (cid:110) (cid:98) V g k (cid:111) Kk =1 . Bythe independence of subject groups, let v − ( θ , ζ ) =  v ψ ( θ , ζ ) v ψg ( θ , ζ ) v ψg T ( θ , ζ ) v g ( θ , ζ )  =  diag (cid:110) Nn k v ψ k ( θ , ζ ) (cid:111) Kk =1 diag (cid:110) Nn k v ψg k ( θ , ζ ) (cid:111) Kk =1 diag (cid:110) Nn k v ψg Tk ( θ , ζ ) (cid:111) Kk =1 diag (cid:110) Nn k v g k ( θ , ζ ) (cid:111) Kk =1  . Similar to the proof of Lemma 1, it can easily be shown that for each k =1 , . . . , K , (cid:98) V ψ k = ( N/n k ) v ψ k ( θ , ζ ) + O p ( N − / ), (cid:98) V ψg k = ( N/n k ) v ψg k ( θ , ζ ) + O p ( N − / ),and (cid:98) V g k = ( N/n k ) v g k ( θ , ζ ) + O p ( N − / ). Consider an arbitrary k ∈ { , . . . , K } .Let ( N/n k ) (cid:104) v ψ k ( θ , ζ ) (cid:105) ji = (cid:2) v ψ ( θ , ζ ) (cid:3) ji : k , and similarly deﬁne (cid:104) v ψg k ( θ , ζ ) (cid:105) ji and[ v g k ( θ , ζ )] ji . Then (cid:98) A θ k,ij = ( N/n k ) { a θ k,ij + O p ( n − / k ) } , where a θ k,ij is deﬁned as41 s θ T ψ jk ( θ , ζ jk ) (cid:104) v ψ k ( θ , ζ ) (cid:105) ji + s θ T g jk ( θ , ζ jk ) (cid:104) v ψg Tk ( θ , ζ ) (cid:105) ji (cid:27) s θψ ik ( θ , ζ )+ (cid:26) s θ T ψ jk ( θ , ζ jk ) (cid:104) v ψg k ( θ , ζ ) (cid:105) ji + s θ T g jk ( θ , ζ jk ) [ v g k ( θ , ζ )] ji (cid:27) s θg ik ( θ , ζ ) . We can show similar results for (cid:98) A ζ k,ij , (cid:98) B θ k,ij and (cid:98) B ζ k,ij . Then we can rewrite (cid:107) λ ( θ , ζ ) (cid:107) ≤ K (cid:88) k =1 O p ( n / k N − ) = O p ( Kn / N − ) = O p ( N − / − δ n / ) , and (cid:107)∇ θ , ζ λ ( θ , ζ ) (cid:107) ≤ N K (cid:88) k =1 J (cid:88) i =1 n k (cid:13)(cid:13)(cid:13) (cid:98) C k,i (cid:13)(cid:13)(cid:13) ≤ O p (cid:0) N − / − δ n / (cid:1) + O (cid:0) N − / − δ n max (cid:1) = O p (cid:0) N − / − δ n max (cid:1) . Since ∇ θ , ζ λ ( θ , ζ ) is symmetric positive-deﬁnite, the above provides a bound on its eigen-values. Therefore, (cid:13)(cid:13) {∇ θ , ζ λ ( θ , ζ ) } − (cid:13)(cid:13) = O p (cid:0) N / δ n − (cid:1) . Lemma B.2.2.

For some matrices E k , k = 1 , . . . , K , of ’s and ’s, the following asymp-totic properties hold: n k N J (cid:88) i =1 (cid:98) C k,i  (cid:98) θ ik − θ (cid:98) ζ list − ζ  = n k N E k Z k + O p (cid:0) N − (cid:1) , and n k N J (cid:88) i =1 (cid:98) C k,i = n k N E k j k ( θ , ζ k ) E Tk + O p (cid:16) n / k N − (cid:17) , where n / k Z k d → N (cid:0) , j − k ( θ , ζ k ) (cid:1) . Proof

Recall that (cid:98) C k,i ( (cid:98) θ Tik − θ T , (cid:98) ζ Tlist − ζ T ) T = (cid:98) C ∗ k,i ( (cid:98) θ Tik − θ T , (cid:98) ζ Tik − ζ Tik ) T . Let (cid:2) v − k ( θ , ζ k ) (cid:3) ij subset the rows for the parameters corresponding to block ( i, k ) and thecolumns for the parameters corresponding to block ( j, k ) of matrix v − k ( θ , ζ k ). Deﬁne j jik ( θ , ζ jk , ζ ik ) = s jk ( θ , ζ jk ) (cid:2) v − k ( θ , ζ k ) (cid:3) ji s ik ( θ , ζ ik ), and (cid:2) j − k ( θ , ζ k ) (cid:3) i the submatrix of42 − k ( θ , ζ k ) corresponding to parameters in block ( i, k ), such that n / k (cid:40) J (cid:88) j =1 j jik ( θ , ζ jk , ζ ik ) (cid:41)  (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik  d → N (cid:0) , (cid:2) j − k ( θ , ζ k ) (cid:3) i (cid:1) . Then using the results in the proof of Lemma B.2.1, let E k and E k,i matrices of ’s and ’s such that n k N J (cid:88) i =1 (cid:98) C k,i = n k N E k (cid:110) j k ( θ , ζ k ) + O p (cid:16) n − / k (cid:17)(cid:111) E Tk = n k N E k j k ( θ , ζ k ) E Tk + O p (cid:16) n / k N − (cid:17) , and n k N J (cid:88) i =1 (cid:98) C k,i  (cid:98) θ ik − θ (cid:98) ζ list − ζ  = n k N E k J (cid:88) i =1 E k,i (cid:40) J (cid:88) j =1 j jik ( θ , ζ jk , ζ ik ) + O p (cid:16) n − / k (cid:17)(cid:41)  (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik  = n k N E k J (cid:88) i =1 E k,i J (cid:88) j =1 j jik ( θ , ζ jk , ζ ik )  (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik  + O p (cid:0) N − (cid:1) . To obtain the desired result, deﬁne Z k = J (cid:88) i =1 E k,i J (cid:88) j =1 j jik ( θ , ζ jk , ζ ik )  (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik  . Lemma B.2.3. N / H (cid:16)(cid:98) θ TDDIMM − θ T , (cid:98) ζ TDDIMM − ζ T (cid:17) can be rewritten as H (cid:40) K (cid:88) k =1 n k N E k j k ( θ , ζ k ) E Tk + O p (cid:0) n / N − / − δ (cid:1)(cid:41) − (cid:34) K (cid:88) k =1 (cid:26)(cid:16) n k N (cid:17) / E k n / k Z k (cid:27) + O p (cid:0) N − δ (cid:1)(cid:35) . roof N / H  (cid:98) θ DDIMM − θ (cid:98) ζ DDIMM − ζ  = N / H (cid:32) K (cid:88) k =1 J (cid:88) i =1 n k N (cid:98) C k,i (cid:33) − K (cid:88) k =1 J (cid:88) i =1 n k N (cid:98) C k,i  (cid:98) θ ik − θ (cid:98) ζ list − ζ  = H (cid:34) K (cid:88) k =1 (cid:110) n k N E k j k ( θ , ζ k ) E Tk + O p ( n / k N − ) (cid:111)(cid:35) − · K (cid:88) k =1  n k N / E k J (cid:88) i =1 E k,i j ik ( θ , ζ jk , ζ ik )  (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik  + O p ( N − / )  = H (cid:40) K (cid:88) k =1 n k N E k j k ( θ , ζ k ) E Tk + O p (cid:0) Kn / N − (cid:1)(cid:41) − ·  K (cid:88) k =1  n k N / E k J (cid:88) i =1 E k,i j ik ( θ , ζ jk , ζ ik )  (cid:98) θ ik − θ (cid:98) ζ ik − ζ ik  + O p (cid:0) KN − / (cid:1) = H (cid:40) K (cid:88) k =1 n k N E k j k ( θ , ζ k ) E Tk + O p (cid:0) n / N − / − δ (cid:1)(cid:41) − · (cid:34) K (cid:88) k =1 (cid:26)(cid:16) n k N (cid:17) / E k n / k Z k (cid:27) + O p (cid:0) N − δ (cid:1)(cid:35) . C Bibliography

Bai, Y., Kang, J., and Song, P. X.-K. (2014). Eﬃcient pairwise composite likelihoodestimation for spatial-clustered data.

Biometrics , 70(3):661–670.44odnar, O., Bodnar, T., and Gupta, A. K. (2010). Estimation and inference for dependencein multivariate data.

Journal of multivariate analysis , 101(4):869–881.Bradley, R. C. (1985). On the central limit question under absolute regularity.

The Annalsof Probability , 13(4):1314–1325.Carey, V., Zeger, S. L., and Diggle, P. (1993). Modelling multivariate binary data withalternating logistic regressions.

Biometrika , 80(3):517–526.Chan, J. S., Kuk, A. Y., Bell, J., and McGilchrist, C. (1998). The analysis of methadoneclinic data using marginal and conditional logistic models with mixture of random eﬀects.

Australian and New Zealand Journal of Statistics , 40(1):1–10.Chen, X. and Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarilylarge data.

Statistica Sinica , 24:1655–1684.Cox, D. R. and Reid, N. (2004). A note on pseudolikelihood constructed from marginaldensities.

Biometrika , 91(3):729–737.Donald, S. G., Imbens, G. W., and Newey, W. K. (2003). Empirical likelihood estimationand consistent tests with conditional moment restrictions.

Journal of econometrics ,117(1):55–93.Fu, L. and Wang, Y.-G. (2012). Quantile regression for longitudinal data with a workingcorrelation model.

Computational Statistics and Data Analysis , 56(8):2526–2538.Han, P. and Song, P. X.-K. (2011). A note on improving quadratic inference functionsusing a linear shrinkage approach.

Statistics and probability letters , 81(3):438–445.45ansen, L. P. (1982). Large sample properties of generalized method of moments estimators.

Econometrica , 50(4):1029–1054.Hector, E. C. and Song, P. X. K. (2019). A distributed and integrated method of momentsfor high-dimensional correlated data analysis. arXiv Preprint , arXiv:1910.02986.Heyde, C. C. (1997).

Quasi-likelihood and its application: a general approach to optimalparameter estimation . Springer Series in Statistics.Huber, P. J. (1964). Robust estimation of a location parameter.

The Annals of Mathemat-ical Statistics , 35(1):73–101.Huber, P. J. (2009).

Robust statistics . Wiley Series in Probability and Statistics, 2ndedition.Jin, Z. (2011).

Aspects of Composite Likelihood Inference . PhD thesis, University ofToronto.Joe, H. (2014).

Dependence modeling with copulas . Chapman & Hall, ﬁrst edition.Johnstone, I. M. and Titterington, D. M. (2009). Statistical challenges of high-dimensionaldata.

Philosophical transactions of the royal society A: mathematical, physical and engi-neering sciences , 367(1906):4237–4253.Jung, S.-H. (1996). Quasi-likelihood for median regression models.

Journal of the AmericanStatistical Association , 91(433):251–257.Khezr, S. N. and Navimipour, N. J. (2017). Mapreduce and its applications, challengesand architecture: a comprehensive review and directions for future research.

Journal ofgrid computing , 15(3):295–321. 46iang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linearmodels.

Biometrika , 73(1):13–22.Liang, K.-Y., Zeger, S. L., and Qaqish, B. (1992). Multivariate regression analyses forcategorical data.

Journal of the Royal Statistical Society, Series B , 54(1):3–40.Lin, D.-Y. and Zeng, D. (2010). On the relative eﬃciency of using summary statisticsversus individual-level data in meta-analysis.

Biometrika , 97(2):321–332.Lin, N. and Xi, R. (2011). Aggregated estimating equation estimation.

Statistics and itsInterface , 4(1):73–83.Lindsay, B. G. (1988). Composite likelihood methods.

Contemporary Mathematics , 80:220–239.Liu, D., Liu, R. Y., and Xie, M. (2015). Multivariate meta-analysis of heterogeneousstudies using only summary statistics: eﬃciency and robustness.

Journal of the AmericanStatistical Association , 110(509):326–340.Lu, X. and Fan, Z. (2015). Weighted quantile regression for longitudinal data.

Computa-tional Statistics , 30(2):569–592.Mackey, L., Talwalkar, A., and Jordan, M. I. (2011). Divide-and-conquer matrix factoriza-tion. In

Advances in neural information processing systems 24 , pages 1134–1142.Masarotto, G. and Varin, C. (2012). Gaussian copula marginal regression.

Electronicjournal of statistics , 6:1517–1549.Newey, W. K. (2004). Eﬃcient semiparametric estimation via moment restrictions.

Econo-metrica , 72(6):1877–1897. 47ewey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing.

Handbook of Econometrics , 4:2111–2245.Pan, J. and Mackenzie, G. (2003). On modelling mean-covariance structures in longitudinalstudies.

Biometrika , 90(1):239–244.Peligrad, M. (1986). Recent advances in the central limit theorem and its weak invarianceprinciple for mixing sequences of random variables (a survey). In Eberlein, E. andTaqqu, M. S., editors,

Dependence in probability and statistics. Progress in probabilityand statistics , volume 11. Birkh¨auser, Boston, MA.Secchi, P. (2018). On the role of statistics in the era of big data: a call for a debate.

Statistics and probability letters , 136:10–14.Singh, K., Xie, M., and Strawderman, W. E. (2005). Combining information from indepen-dent sources through conﬁdence distributions.

The Annals of Statistics , 33(1):159–183.Song, P. X.-K. (2007).

Correlated Data Analysis: Modeling, Analytics, and Applications .Springer Series in Statistics.Song, P. X.-K., Li, M., and Yuan, Y. (2009). Joint regression analysis of correlated datausing gaussian copulas.

Biometrics , 65(1):60–68.Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood methods.

Statistica Sinica , 21(1):5–42.Wang, Y.-G., Lin, X., and Zhu, M. (2005). Robust estimating functions and bias correctionfor longitudinal data analysis.

Biometrics , 61(3):684–691.48edderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, andthe gauss-newton method.

Biometrika , 61(3):439–447.Xie, M. and Singh, K. (2013). Conﬁdence distribution, the frequentist distribution estima-tor of a parameter: a review.

International Statistical Review , 81(1):3–39.Yang, C.-C., Chen, Y.-H., and Chang, H.-Y. (2017). Joint regression analysis of marginalquantile and quantile association: application to longitudinal body mass index in ado-lescents.

Journal of the Royal Statistical Society, Series C , 66(5):1075–1090.Zhang, W., Leng, C., and Tang, C. Y. (2015a). A joint modelling approach for longitudinalstudies.

Journal of the Royal Statistical Society, Series B , 77(1):219–238.Zhang, Y., Duchi, J., and Wainwright, M. (2015b). Divide and conquer kernel ridge regres-sion: a distributed algorithm with minimax optimal rates.

Journal of Machine LearningResearch , 16:3299–3340.Zhao, L. P. and Prentice, R. L. (1990). Correlated binary regression using a quadraticexponential model.