[PDF] Individual Data Protected Integrative Regression Analysis of High-dimensional Heterogeneous Data

Abstract

Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high dimensional setting. The challenge is even more pronounced when the individual level data cannot be shared across studies, known as DataSHIELD constraint (Wolfson et al., 2010). Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts.

Full PDF

IIndividual Data Protected Integrative Regression Analysis ofHigh-dimensional Heterogeneous Data

Tianxi Cai , , Molei Liu , and Yin Xia Abstract

Evidence based decision making often relies on meta-analyzing multiple studies, which enables more preciseestimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however,highly challenging in the ultra high dimensional setting. The challenge is even more pronounced when theindividual level data cannot be shared across studies, known as DataSHIELD contraint (Wolfson et al., 2010).Under sparse regression models that are assumed to be similar yet not identical across studies, we propose inthis paper a novel integrative estimation procedure for data- S hielding H igh-dimensional I ntegrative R egression(SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodatesbetween study heterogeneity in both the covariate distribution and model parameters, and attains consistentvariable selection. Theoretically, SHIR is statistically more eﬃcient than the existing distributed approachesthat integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurredby aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to beasymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The ﬁnite-sampleperformance of our method is studied and compared with existing approaches via extensive simulation settings.We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease usingelectronic health records data from multiple chronic disease cohorts. Keywords : High dimensionality; model heterogeneity; DataSHIELD; sparse meta-analysis; dis-tributed learning; debiased Lasso; rate optimality; sparsistency. Authors are listed in alphabetical order. Department of Biostatistics, Harvard School of Public Health, Harvard University. The research of Tianxi Caiwas supported in part by NIH Grants U54 HG007963 and R21 CA242940. Department of Statistics, School of Management, Fudan University. The research of Yin Xia was supported inpart by NSFC Grants 11771094, 11690013. a r X i v : . [ s t a t . M E ] S e p Introduction

Synthesizing information from multiple studies is crucial for evidence based medicine and policydecision making. Meta-analyzing multiple studies allows for more precise estimates and enables in-vestigation of generalizability. In the presence of heterogeneity across studies and high dimensionalpredictors, such integrative analysis however is highly challenging. An example of such integrativeanalysis is to develop generalizable predictive models using electronic health records (EHR) datafrom diﬀerent hospitals. In addition to high dimensional features, EHR data analysis encounters pri-vacy constraints in that individual-level data typically cannot be shared across local hospital sites,which makes the challenge of integrative analysis even more pronounced. Breach of Privacy arisingfrom data sharing is in fact a growing concern in general for scientiﬁc studies. Recently, Wolfsonet al. (2010) proposed a generic individual-information protected integrative analysis framework,named DataSHIELD, that transfers only summary statistics from each distributed local site to thecentral site for pooled analysis. Conceptually highly valued by research communities (Jones et al.,2012; Doiron et al., 2013, e.g.), the DataSHIELD facilitates important data co-analysis settingswhere individual-level data meta-analysis (ILMA) is not feasible due to ethical and/or legal restric-tions (Gaye et al., 2014). In the low dimensional setting, a number of statistical methods havebeen developed for distributed analysis that satisfy the DataSHILED constraint (Chen et al., 2006;Wu et al., 2012; Liu and Ihler, 2014; Lu et al., 2015; Huang and Huo, 2015; Han and Liu, 2016;He et al., 2016; Z¨oller et al., 2018; Duan et al., 2019, 2020, e.g). Distributed learning methods forhigh dimensional regression have largely focused on settings without between-study heterogeneityas detailed in Section 1.2. To the best of our knowledge, no existing distributed learning methodscan eﬀectively handle both high-dimensionality together with the presence of model heterogeneityacross the local sites.

In the context of high dimensional regression, several recently proposed distributed inference ap-proaches can be potentially used for the meta-analysis under the DataSHIELD constraint. Specif-ically, Tang et al. (2016), Lee et al. (2017) and Battey et al. (2018) proposed distributed inferenceprocedures aggregating the local debiased LASSO estimators (Zhang and Zhang, 2014; Van de Geeret al., 2014; Javanmard and Montanari, 2014). By including debiasing procedure in their pipelines,the corresponding estimators can be used for inference directly. Lee et al. (2017) and Battey et al.(2018) proposed to further truncate the aggregated dense debiased estimators to achieve sparsity;see also Maity et al. (2019). Though this debiasing-based strategy can be extended to ﬁt for ourheterogeneous modeling assumption, it still loses statistical eﬃciency due to the failure to accountfor the heterogeneity of the information matrices across diﬀerent sites. In addition, the use ofdebiasing procedure at local sites incurs additional error for estimation, as detailed in Section 4.4.Besides, Lu et al. (2015) and Li et al. (2016) proposed distributed approaches for (cid:96) -regularized2ogistic and Cox regression, which rely on iterative communications across the studies. Theirmethods require sequential communications between local sites and the central machine, whichmay be time and resource consuming, especially since human eﬀort is often needed to performthe computation and data transfer in many practical settings. Chen and Xie (2014) proposed toestimate high dimensional parameters by ﬁrst adopting majority voting to select a positive set andthen combining local estimation of the coeﬃcients belonging to this set. Wang et al. (2014) proposedto aggregate the local estimators through their median values rather than their mean, shown to bemore robust to poor estimation performance of local sites with insuﬃcient sample size (Minsker,2019). More recently, Wang et al. (2017) and Jordan et al. (2019) presented a communication-eﬃcient surrogate likelihood framework for distributed statistical learning that only transfers theﬁrst order summary statistics, i.e. gradient between the local sites and the central site. Fan et al.(2019) extended their idea and proposed two iterative distributed optimization algorithms for thegeneral penalized likelihood problems. However, their framework, as well as others summarized inthis paragraph, is restricted to homogeneous scenarios and cannot be easily extended to the settingswith heterogeneous models or covariates. In this paper, we ﬁll the methodological gap of high dimensional distributed learning methods thatcan accommodate cross-study heterogeneity by proposing a novel data- S hielding H igh-dimensional I ntegrative R egression (SHIR) method under the DataSHIELD constraints. While SHIR can beviewed as analogous to the integrative analysis of debiased local LASSO estimators, it achievesdebiasing without having to perform debiasing for the local estimators. SHIR solves LASSO problemonly once in each local site without requiring the inverse Hessian matrices or the locally debiasedestimators and only needs one turn in communication. Statistically, it serves as the tool for theintegrative model estimation and variable selection, in the presence of high dimensionality andheterogeneity in model parameters across sites. In addition, under ultra-high dimensional regimewhere p can grow exponentially with n , SHIR is shown to theoretically achieve asymptoticallyequivalent performance with the estimator obtained by the ideal individual patient data (IPD)pooled across sites and attain consistent variable selection under some sparsity assumptions. Suchproperties are not readily available in the existing literature and some novel technical tools aredeveloped for the theoretical veriﬁcation. We also show that SHIR is statistically more eﬃcientthan the approach based on integrating and truncating locally debiased estimators (Lee et al., 2017;Battey et al., 2018, e.g.) through theoretical investigation. Our numerical studies further verifythis by comparing our method with the existing approaches. It demonstrates that SHIR enjoysclose numerical performance as the ideal IPD estimator and outperforms the other methods. The rest of this paper is organized as follows. We introduce the setting in Section 2 and describeSHIR, our proposed approach in Section 3. Theoretical properties of SHIR are studied in Section 4.3e derive the upper bound for its prediction and estimation risks, compare it with the existing ap-proach and show that the errors incurred by aggregating derived data is negligible compared to thestatistical minimax rate. When the true model is ultra-sparse, SHIR is shown to be asymptoticallyequivalent to the IPD estimator and achieves sparsistency. Section 5 compares the performanceof SHIR to existing methods through simulations. We apply SHIR to derive classiﬁcation modelsfor coronary artery disease (CAD) using EHR data from four diﬀerent disease cohorts in Section6. Section 7 concludes the paper with a discussion. Technical proofs of the theoretical results areprovided in the Supplementary Material.

Suppose there are M independent studies and n m subjects in the m th study, for m = 1 , . . . , M .For the i th subject in the m th study, let Y ( m ) i and X ( m ) i respectively denote the response andthe p -dimensional covariate vector, D ( m ) i = ( Y ( m ) i , X ( m ) T i ) T , Y ( m ) = ( Y ( m ) , . . . , Y ( m ) n m ) T , and X ( m ) =( X ( m ) , X ( m ) , . . . , X ( m ) n m ) T . We assume that the observations in study m , D ( m ) = { D ( m ) i , i = 1 , . . . , n m } ,are independent and identically distributed. Without loss of generality, we assume that X ( m ) i in-cludes 1 as the ﬁrst component and X ( m ) i, − has mean , where for any vector x = ( x , x , . . . , x d ) T , x − = ( x , . . . , x d ) T . We deﬁne the population parameters of interest as β ( m ) = argmin β ( m ) L m ( β ( m ) ) , where L m ( β ( m ) ) = E { f ( β ( m ) T X ( m ) i , Y ( m ) i ) } , β ( m ) = ( β ( m ) , β ( m ) , . . . , β ( m ) p ) T for some speciﬁed loss function f . Examples of loss function include f ( β T x , y ) = ( y − β T x ) forlinear regression and f ( β T x , y ) = − y β T x +log { β T x ) } for logistic regression. We assume that β ( m ) varies across studies but may share similar support and magnitude, which makes it possibleto attain more accurate estimation by jointly estimating the regression models across the studies.We will deﬁne such structure more rigorously in Section 2.1.Throughout, for any integer d , [ d ] = { , . . . , d } . For any vector x = ( x , x , . . . , x d ) T ∈ R d andindex set S = { j , . . . , j k : j < · · · < j k } ⊆ [ d ], x S = ( x j , . . . , x j k ) T , (cid:107) x (cid:107) q denotes the (cid:96) q norm of x and (cid:107) x (cid:107) ∞ = max j ∈ [ d ] | x j | . For any matrix A = [ A , . . . , A d ] ∈ R n × d and index set S , S ⊆ [ d ],let A j • and A • j respectively denote the j th row and column of A , A S S denote the submatrixcorresponding to rows in S and columns in S , A •S = [ A • j , . . . , A • j k ], (cid:107) A (cid:107) := (cid:2) Λ max ( A T A ) (cid:3) , (cid:107) A (cid:107) ∞ = max j (cid:107) A j • (cid:107) . Let β j = ( β ( ) j , . . . , β ( M ) j ) T , β ( • ) = ( β ( ) T , . . . , β ( M ) T ) T , and let β j , β ( m ) and β ( • ) = ( β ( ) T , β ( ) T , . . . , β ( M ) T ) T respectively denote the true values of β j , β ( m ) and β ( • ) . M Studies

To estimate β ( • ) based on D ( • ) = { D ( ) , . . . , D ( M ) } , consider the empirical global loss function (cid:98) L ( β ( • ) ) = N − M (cid:88) m =1 n m (cid:98) L m ( β ( m ) ) , N = (cid:80) Mm =1 n m and (cid:98) L m ( β ( m ) ) = n − m n m (cid:88) i =1 f ( β ( m ) T X ( m ) i , Y ( m ) i ) , m = 1 , . . . , M. Minimizing (cid:98) L ( β ( • ) ) is obviously equivalent to estimating β ( m ) using D ( m ) only. To improve theestimation of β ( • ) by synthesizing information from D ( • ) and overcome the high dimensionality, weassume that β ( • ) is sparse and employ penalized loss functions, (cid:98) L ( β ( • ) ) + λρ ( β ( • ) ), with the penaltyfunction ρ ( · ) designed to leverage prior information on β ( • ) .Under a prior assumption that β ( ) , − , . . . , β ( M ) , − , i.e. the true coeﬃcients excluding intercepts,share the same support, we may choose ρ ( · ) as the group LASSO penalty (Yuan and Lin, 2006) ρ ( β ( • ) ) = (cid:80) pj =2 (cid:107) β j (cid:107) . Alternatively, following typical meta-analysis, we decompose β ( m ) j as β ( m ) j = µ j + α ( m ) j with α j = ( α ( ) j , . . . , α ( M ) j ) T and T M × α j = 0 for identiﬁability. Similarly as Cheng et al.(2015), we may impose a mixture of LASSO and group LASSO penalty : ρ ( β ( • ) ) = p (cid:88) j =2 | µ j | + λ g p (cid:88) j =2 (cid:107) α j (cid:107) , (1)where λ g ≥ j , µ j represents average eﬀect of the covariate X j and α j captures the between study heterogeneity of the eﬀects. This penalty accommodates threetype of covariates: (i) homogeneous eﬀect with µ j (cid:54) = 0 and α j = ; (ii) heterogeneous eﬀect with α j (cid:54) = ; (iii) null eﬀect with µ j = 0 and α j = . The penalty ρ diﬀers slightly from that ofCheng et al. (2015) where (cid:107) α j, − (cid:107) was used instead of (cid:107) α j (cid:107) . This modiﬁed penalty leads totwo main advantages: (1) the estimator is invariant to the permutation of the indices of the M studies; and (2) it yields better theoretical estimation error bounds for the heterogeneous eﬀectsdetailed as in the proofs. Other penalties that attain similar properties include the hierarchicalLASSO penalty (Zhou and Zhu, 2010) ρ ( β ( • ) ) = (cid:80) pj =2 (cid:107) β j (cid:107) / . Although our proposed SHIRframework can accommodate a variety of sparsity inducing penalties, our theoretical derivationsand implementation focus on the mixture penalty ρ ( · ) = ρ ( · ) throughout but we comment onextensions of SHIR to incorporating alternative penalty functions in Section 7.With a given penalty function ρ ( · ), we may form a penalized loss (cid:98) Q ( β ( • ) ) = (cid:98) L ( β ( • ) ) + λρ ( β ( • ) )and an idealized IPD estimator for β ( • ) can be obtained as (cid:98) β ( • ) IPD = argmin β ( • ) (cid:98) Q ( β ( • ) ) , (2)with some tuning parameter λ ≥

0. However, the IPD estimator is not feasible under theDataSHIELD constraint. Our goal is to construct an alternative estimator that asymptoticallyattains the same eﬃciency as (cid:98) β ( • ) IPD but only requires sharing summary data. When p is not large,the sparse meta analysis (SMA) approach by He et al. (2016) achieves this goal via a second order5aylor expansion and estimates β ( • ) as (cid:98) β ( • ) SMA = argmin β ( • ) (cid:98) Q SMA ( β ( • ) ), where (cid:98) Q SMA ( β ( • ) ) = N − M (cid:88) m =1 ( β ( m ) − ˘ β ( m ) ) T ˘ V − m ( β ( m ) − ˘ β ( m ) ) + λρ ( β ( • ) ) , (3)˘ β ( m ) = argmin β ( m ) (cid:98) L m ( β ( m ) ) and ˘ V m = { n − m ∇ (cid:98) L m ( ˘ β ( m ) ) } − . When β ( m ) is the same across sites and λ = 0, (cid:98) β ( • ) SMA is the inverse variance weighted estimator (Lin and Zeng, 2010). Although the SMAattains oracle property for a relatively small p , it fails when p is large due to the failure of ˘ β ( m ) . In the high dimensional setting, one may overcome the limitation of the SMA approach by replacing˘ β ( m ) with the regularized LASSO estimator, (cid:98) β ( m ) LASSO = argmin β ( m ) (cid:98) L m ( β ( m ) ) + λ m (cid:107) β ( m ) − (cid:107) (4)However, aggregating { (cid:98) β ( m ) LASSO , m ∈ [ M ] } is problematic with large p due to their inherent biases.To overcome the bias issue, we build the SHIR method motivated by SMA and the debiasingapproach for LASSO (Van de Geer et al., 2014) yet achieve debiasing without having to performdebiasing for M local estimators. Speciﬁcally, we propose the SHIR estimator for β ( • ) as (cid:98) β ( • ) SHIR =argmin β ( • ) (cid:98) Q SHIR ( β ( • ) ), where (cid:98) Q SHIR ( β ( • ) ) = N − M (cid:88) m =1 n m (cid:110) β ( m ) T (cid:98) H m β ( m ) − β ( m ) T (cid:98) g m (cid:111) + λρ ( β ( • ) ) , (5) (cid:98) H m = ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ) is an estimate of the Hessian matrix and (cid:98) g m = (cid:98) H m (cid:98) β ( m ) LASSO − ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ).Our SHIR estimator (cid:98) β ( • ) SHIR satisfy the DataSHIELD constraint as (cid:98) Q SHIR ( β ( • ) ) depends on D ( m ) onlythrough summary statistics (cid:98) D m = { n m , (cid:98) H m , (cid:98) g m } , which can be obtained within the m th study, andrequires only one round of data transfer from local sites to the central node.With { (cid:98) H m , (cid:98) g m , m = 1 , ..., M } , we may implement the SHIR procedure using coordinate de-scent algorithms (Friedman et al., 2010) along with reparameterization. Speciﬁcally, let µ =( µ , . . . , µ p ) T , α ( • ) = ( α ( ) T , . . . , α ( M ) T ) T , and α ( • ) − = ( α ( ) T − , . . . , α ( M ) T − ) T . Also let µ and α ( • ) bethe true value of µ and α ( • ) , respectively. Let (cid:98) Q SHIR ( µ , α ( • ) ) = (cid:98) L SHIR ( µ , α ( • ) ) + λρ ( µ , α ( • ) ; λ g ) , ρ ( µ , α ( • ) ; λ g ) = (cid:107) µ − (cid:107) + λ g (cid:107) α ( • ) − (cid:107) , , (cid:107) α ( • ) − (cid:107) , = (cid:80) pj =2 (cid:107) α j (cid:107) and (cid:98) L SHIR ( µ , α ( • ) ) = N − M (cid:88) m =1 n m (cid:110) ( µ T + α ( m ) T ) (cid:98) H m ( µ + α ( m ) ) − (cid:98) g T m ( µ + α ( m ) ) (cid:111) . Then the optimization problem in (5) can be reparameterized and represented as:( (cid:98) µ SHIR , (cid:98) α ( • ) SHIR ) = argmin ( µ , α ( • ) ) (cid:98) Q SHIR ( µ , α ( • ) ) , s.t. T M × α j = 0 , j ∈ [ p ] , and (cid:98) β SHIR is obtained with the transformation: β ( m ) j = µ j + α ( m ) j for every j ∈ [ p ]. Remark 1.

The ﬁrst term in (cid:98) Q SHIR ( β ( • ) ) is essentially the second order Taylor expansion of (cid:98) L ( β ( • ) ) at the local LASSO estimators (cid:98) β ( • ) LASSO . The SHIR method can also be viewed as approximatelyaggregating local debiased LASSO estimators without actually carrying out the standard debiasingprocess. To see this, let (cid:98) Q dLASSO ( β ( • ) ) = N − M (cid:88) m =1 n m ( β ( m ) − (cid:98) β ( m ) dLASSO ) T (cid:98) H m ( β ( m ) − (cid:98) β ( m ) dLASSO ) + λρ ( β ( • ) ) where (cid:98) β ( m ) dLASSO is the debiased LASSO estimator for the m th study with (cid:98) β ( m ) dLASSO = (cid:98) β ( m ) LASSO − (cid:98) Θ m ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ) , for m = 1 , . . . , M , (6) and (cid:98) Θ m is a regularized inverse of (cid:98) H m . We may write (cid:98) Q dLASSO ( β ( • ) )= N − M (cid:88) m =1 (cid:26) n m (cid:104) β ( m ) T (cid:98) H m β ( m ) − β ( m ) T (cid:98) H m (cid:98) β ( m ) dLASSO (cid:105) + C m (cid:27) + λρ ( β ( • ) ) ≈ N − M (cid:88) m =1 (cid:26) n m (cid:104) β ( m ) T (cid:98) H m β ( m ) − β ( m ) T (cid:98) g m (cid:105) + C m (cid:27) + λρ ( β ( • ) ) = (cid:98) Q SHIR ( β ( • ) ) + C m , where we use (cid:98) Θ m (cid:98) H m ≈ I in the above approximation and the term C m = (cid:110) (cid:98) H m (cid:98) β ( m ) LASSO − (cid:98) H m (cid:98) Θ m ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ) (cid:111) T (cid:110)(cid:98) β ( m ) LASSO − (cid:98) Θ m ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ) (cid:111) does not depend on β ( • ) . We only use (cid:98) Θ m (cid:98) H m ≈ I heuristically above to show a connection betweenour SHIR estimator and the debiased LASSO, but the validity and asymptotic properties of theSHIR estimator do not require obtaining any (cid:98) Θ m or establishing a theoretical guarantee for (cid:98) Θ m (cid:98) H m being suﬃciently close to I . Remark 2.

Compared with existing debiasing-type methods (Lee et al., 2017; Battey et al., 2018),the SHIR is also computationally and statistically eﬃcient as it is performed without relying on thedebiased statistics (6) and achieves debiasing without calculating (cid:98) Θ m , which can only be estimated ell under strong conditions (Van de Geer et al., 2014; Jankov´a and Van De Geer, 2016). The implementation of SHIR requires selection of three sets of tuning parameters, { λ m , m ∈ [ M ] } , λ and λ g . We select { λ m , m ∈ [ M ] } for the LASSO problem locally via the standard K -fold crossvalidation (CV). Selecting λ and λ g needs to balance the trade-oﬀ between the model’s degrees offreedom, denoted by DF( λ, λ g ), and the quadratic loss in (cid:98) Q SHIR ( β ( • ) ). It is not feasible to tune λ and λ g via the CV since individual-level data are not available in the central site. We propose toselect λ and λ g as the minimizer of the generalized information criterion (GIC) (Wang and Leng,2007; Zhang et al., 2010), deﬁned asGIC( λ, λ g ) = Deviance( λ, λ g ) + γ N DF( λ, λ g ) , where γ N is some pre-speciﬁed scaling parameter andDeviance( λ, λ g ) = N − M (cid:88) m =1 n m (cid:110)(cid:98) β ( m ) TSHIR ( λ, λ g ) (cid:98) H m (cid:98) β ( m ) SHIR ( λ, λ g ) − (cid:98) g T m (cid:98) β ( m ) SHIR ( λ, λ g ) (cid:111) . Following Zhang et al. (2010) and Vaiter et al. (2012), we deﬁne DF( λ, λ g ) as the trace of (cid:20) ∂ (cid:98) S µ , (cid:98) S α (cid:98) Q SHIR ( (cid:98) µ SHIR , (cid:98) α ( • ) SHIR ) (cid:21) − (cid:20) ∂ (cid:98) S µ , (cid:98) S α (cid:98) L SHIR ( (cid:98) µ SHIR , (cid:98) α ( • ) SHIR ) (cid:21) , where (cid:98) S µ = { j : (cid:98) µ SHIR , j ( λ, λ g ) (cid:54) = 0 } , (cid:98) S α = { j : (cid:107) (cid:98) α SHIR , j ( λ, λ g ) (cid:107) (cid:54) = 0 } , the operator ∂ (cid:98) S µ , (cid:98) S α isdeﬁned as the second order partial derivative with respect to ( µ T (cid:98) S µ , α ( ) T (cid:98) S α , . . . , α ( M ) T (cid:98) S α ) T , after plugging α ( ) = − (cid:80) Mm =2 α ( m ) into (cid:98) Q SHIR ( µ , α ( • ) ) or (cid:98) L SHIR ( µ , α ( • ) ). Remark 3.

As discussed in Kim et al. (2012), γ N can be chosen depending on the goal withcommonly choices including γ N = 2 /N for AIC (Akaike, 1974), γ N = log N/N for BIC (Bhat andKumar, 2010), γ N = log log p log N/N for modiﬁed BIC (Wang et al., 2009) and γ N = 2 log p/N for RIC (Foster and George, 1994). We used the BIC with γ N = log N/N in our numerical studies.

Remark 4.

For linear models, it has been shown that the proper choice of γ N guarantees GIC’smodel selection consistency under various divergence rates of the dimension p (Kim et al., 2012).For example, for ﬁxed p , GIC is consistent if N γ N → ∞ and γ N → . When p diverges inpolynomial rate N ξ , then GIC is consistent provided that γ N = log N/N (BIC) if < ξ < / ; γ N =log log p log N/N (modiﬁed BIC) if < ξ < . When p diverges in exponential rate O (exp( κN ξ )) with < ν < ξ , GIC is consistent as γ N = N ν − . These results can be naturally extended to moregeneral log-likelihood functions. Theoretical Results

In this section, we present theoretical properties of (cid:98) β ( • ) SHIR for ρ ( β ( • ) ) = ρ ( β ( • ) ) but discuss how ourtheoretical results can be extended to other sparse structures in Section 7. In Sections 4.2 and4.3, we derive theoretical consistency and equivalence for the prediction and estimation risks ofthe SHIR, under high dimensional sparse model and smooth loss function f . In Section 4.4, wecompare the risk bounds for SHIR with an estimator derived based on those of the debiasing-basedaggregation approaches (Lee et al., 2017; Battey et al., 2018). In addition, Section 4.5 shows thatthe SHIR achieves sparsistency, i.e., variable selection consistency, for the non-zero sets of µ and α ( • ) . We begin with some notation and deﬁnitions that will be used throughout the paper. Let o { α ( n ) } , O { α ( n ) } , ω { α ( n ) } , Ω { α ( n ) } and Θ { α ( n ) } respectively represent the sequences thatgrow in a smaller, equal/smaller, larger, equal/larger and equal rate of the sequence α ( n ). Similarly,we let o P , O P , ω P , Ω P and Θ P represent each of the corresponding rates with probability approaching1 as n → ∞ . We deﬁne the model complexity adjusted eﬀective sample size for each study as n eﬀ m = n m / ( s log p ) and n eﬀ = N/ [ s (log p + M )], which are the main drivers for the rates of the proposedestimators. Following Vershynin (2018), we deﬁne the sub-Gaussian norm of a random variable X as (cid:107) X (cid:107) ψ := sup q ≥ q − / ( E | X | q ) /q . For any symmetric matrix X , let Λ min ( X ) and Λ max ( X )denote its minimum and maximum eigenvalue respectively. Let T = ( ( M − × , I ( M − × ( M − ) T and deﬁne (cid:107) x (cid:107) T := (cid:107) T x (cid:107) for x ∈ R M − and its conjugate norm as (cid:107) x (cid:107) (cid:101) T := (cid:107) T ( T T T ) − x (cid:107) . For a ∈ R , denote by sign( a ) the sign of a , and for event E , denote by I ( E ) the indicator for E . Denoteby S µ = { j : µ j (cid:54) = 0 } , S α = { j : (cid:107) α j (cid:107) (cid:54) = 0 } , S = S µ ∪ S α , s µ = |S µ | , s α = |S α | and s = |S | .Let f (cid:48) ( a, y ) = ∂f ( a, y ) /∂a and f (cid:48)(cid:48) ( a, y ) = ∂ f ( a, y ) /∂a .The weighted design matrix corresponding to (cid:98) L SHIR ( µ , α ( • ) ) with respect to θ = ( µ , α ( ) T , . . . , α ( M ) T ) T after setting α ( ) = − (cid:80) Mm =2 α ( m ) can be expressed as W ( β ( • ) ) = bdiag { Ω / ( β ( ) ) , . . . , Ω / M ( β ( M ) ) } Z , where “ bdiag ” is the block diagonal operator, Ω m ( β ) = diag { f (cid:48)(cid:48) ( β T X ( m ) , Y ( m ) ) , . . . , f (cid:48)(cid:48) ( β T X ( m ) n m , Y ( m ) n m ) } , Z = Z [ p ] , [ p ] , and for any S , S ⊆ [ p ], Z S , S =  X ( ) •S − X ( ) •S − X ( ) •S · · · − X ( ) •S X ( ) •S X ( ) •S · · · X ( ) •S X ( ) •S · · · ... ... ... . . . ... X ( M ) •S · · · X ( M ) •S  . For any S , S ⊆ [ p ], let H m, S ( β ( m ) ) represent the sub-matrix of H m ( β ( m ) ) := ∇ (cid:98) L m ( β ( m ) ) with itsrows and columns corresponding to S , and W S , S ( β ( • ) ) denote the sub-matrix of W ( β ( • ) ) corre-9ponding to Z S , S and ( µ T S , α ( ) T S , . . . , α ( M ) T S ) T . For notational ease, let (cid:98) H = bdiag { (cid:98) H , (cid:98) H , . . . , (cid:98) H M } , S full = {S µ , S α } and W S full ( β ( • ) ) = W S µ , S α ( β ( • ) ). Let H ( β ( • ) ) = bdiag { H ( β ( ) ) , H ( β ( ) ) , . . . , H M ( β ( M ) ) } .We next introduce the compatibility condition and irrepresentable condition below. Deﬁnition 1. Compatibility Condition ( C comp ): The Hessian matrix H ( β ( • ) ) and the index set S satisfy the Compatibility Condition with constant t , if there exists constant φ { t, S , H ( β ( • ) ) } suchthat for all ( µ T ∆ , α ( • ) T ∆ ) T = ( µ T ∆ , α ( ) T ∆ , . . . , α ( M ) T ∆ ) T ∈ C ( t, S ) , ( (cid:107) µ ∆ (cid:107) + λ g (cid:107) α ( • ) T ∆ (cid:107) , ) ≤ N − M (cid:88) m =1 n m ( µ ∆ + α ( m ) ∆ ) T H m ( β ( m ) )( µ ∆ + α ( m ) ∆ ) |S| /φ { t, S , H ( β ( • ) ) } , where C ( t, S ) = { ( u T , v ( • ) T ) T = ( u T , v ( ) T , . . . , v ( M ) T ) T : v ( ) + · · · + v ( M ) = , (cid:107) u S c (cid:107) + λ g (cid:107) v ( • ) S c (cid:107) , ≤ t ( (cid:107) u S (cid:107) + λ g (cid:107) v ( • ) S (cid:107) , ) } for any t and S , and φ { t, S , H ( β ( • ) ) } represents the compatibility constantof H ( β ( • ) ) on the set S . Deﬁnition 2. Irrepresentable Condition ( C Irrep ): The design matrix W ( β ( • ) ) satisﬁes the Irrep-resentable Condition on S full = ( S µ , S α ) with parameter (cid:15) > , if for all j ∈ S cµ and j (cid:48) ∈ S cα , sup u ∈ G S µ , v ( • ) ∈ G S α (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) ( u T , λ g v ( • ) T ) (cid:104) W T S full ( β ( • ) ) W S full ( β ( • ) ) (cid:105) − W T S full ( β ( • ) ) W j, ∅ ( β ( • ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) ≤ − (cid:15) ;sup u ∈ G S µ , v ( • ) ∈ G S α (cid:40)(cid:13)(cid:13)(cid:13)(cid:13) ( u T , λ g v ( • ) T ) (cid:104) W T S full ( β ( • ) ) W S full ( β ( • ) ) (cid:105) − W T S full ( β ( • ) ) W ∅ ,j (cid:48) ( β ( • ) ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:101) T (cid:41) ≤ λ g (1 − (cid:15) ) , where G S µ = (cid:40) u = ( u , · · · , u |S µ | ) T ∈ R |S µ | : max j ∈ [ |S µ | ] | u j | ≤ (cid:41) , G S α = (cid:40) v ( • ) = ( v ( ) T , . . . , v ( M ) T ) T ∈ R ( M − |S α | : max j ∈ [ |S α | ] (cid:107) v j (cid:107) (cid:101) T ≤ , v j = ( v ( ) j , . . . , v ( M ) j ) T (cid:41) represent the sub-gradient space corresponding to S µ and S α of the mixture penalty. To establish theoretical properties of the SHIR estimators in terms of estimation and predictionrisks, we ﬁrst introduce some suﬃcient conditions. Throughout the following analysis, we assumethat n m = Θ( N/M ) for m ∈ [ M ] and λ g = Θ( M − / ). Condition 1.

There exists δ = ω { (1 /n eﬀ m ) / } and φ = Θ(1) such that for all β ( • ) = ( β ( ) T , . . . , β ( M ) T ) T satisfying (cid:107) β ( m ) − β ( m ) (cid:107) < δ and t = Θ(1) , the Hessian matrices H ( β ( • ) ) and the index set S satisfy C comp (Deﬁnition 1) with compatibility constant φ { t, S , H ( β ( • ) ) } ≥ φ . Condition 2.

For all m ∈ [ M ] , X ( m ) ij f (cid:48) ( β ( m ) T X ( m ) i , Y ( m ) i ) is sub-Gaussian, i.e. there exists somepositive constant κ = Θ(1) such that (cid:107) X ( m ) ij f (cid:48) ( β ( m ) T X ( m ) i , Y ( m ) i ) (cid:107) ψ < κ . In addition, there exists > such that max m ∈ [ M ] ,i ∈ [ n m ] (cid:107) X ( m ) i (cid:107) ∞ ≤ B . Condition 3.

There exists positive C L = Θ(1) such that | f (cid:48)(cid:48) ( a, y ) − f (cid:48)(cid:48) ( b, y ) | ≤ C L | a − b | for all a, b ∈ R . Remark 5.

Condition 1 holds in the same spirit as the restricted strong convexity (RSC) conditioncorresponding to the general decomposable penalty introduced in Negahban et al. (2012). In Appendix ?? of the Supplementary Material, we further illustrate that Condition 1 is not stronger than theRSC condition. The ﬁrst part of Condition 2 controls the tail behavior of X ( m ) ij f (cid:48) ( a, y ) so thatthe random error ∇ (cid:98) L m ( β ( m ) ) can be bounded properly and the method could be beneﬁted from thegroup sparsity of α ( • ) (Huang and Zhang, 2010). This condition can be easily veriﬁed for sub-gaussian design and an extensive class of models, e.g. the logistic model. In addition, the condition max m ∈ [ M ] ,i ∈ [ n m ] (cid:107) X ( m ) i (cid:107) ∞ ≤ B holds for bounded design with B = Θ(1) and for sub-gaussian designwith B = Θ[ { log( pN ) } / ] . Condition 3 assumes a smooth function f to guarantee that the empiricalHessian matrix ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ) is close enough to ∇ (cid:98) L m ( β ( m ) ) , and the term (cid:98) g m = [ (cid:98) H m (cid:98) β ( m ) LASSO −∇ (cid:98) L m ( (cid:98) β ( m ) LASSO )] is close enough to [ ∇ (cid:98) L m ( β ( m ) ) β ( m ) − ∇ (cid:98) L m ( β ( m ) )] . We further assume in Condition 4 that the local LASSO estimators achieve the minimax optimalerror rates to a logarithmic scale (Raskutti et al., 2011).

Condition 4.

The local estimators satisfy that max m ∈ [ M ] (cid:107) (cid:98) β ( m ) LASSO − β ( m ) (cid:107) = O P { ( s /n eﬀ m ) / } , max m ∈ [ M ] (cid:107) (cid:98) β ( m ) LASSO − β ( m ) (cid:107) = O P { (1 /n eﬀ m ) / } and max m ∈ [ M ] (cid:107) X ( m ) ( (cid:98) β ( m ) LASSO − β ( m ) ) (cid:107) = O P { ( n m /n eﬀ m ) / } . Remark 6.

Extensive literatures, such as Van de Geer et al. (2008), B¨uhlmann and Van De Geer(2011) and Negahban et al. (2012), have established a complete theoretical framework regarding tothis property. See, for example, Negahban et al. (2012), in which Condition 4 can be proved undersub-gaussian design X ( m ) and strongly convex loss function f . Next, we present the risk bounds for the SHIR including the prediction risk (cid:107) (cid:98) H / ( (cid:98) β ( • ) SHIR − β ( • ) ) (cid:107) and estimation risk (cid:107) (cid:98) µ SHIR − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) SHIR − α ( • ) (cid:107) , . Theorem 1. (Risk bounds for the SHIR)

Under Conditions 1–4 and assume n m = Θ( N/M ) for all m ∈ [ M ] . There exists λ = Θ { / ( s n eﬀ ) / + B/n eﬀ m } and λ g = Θ( M − / ) such that (cid:107) (cid:98) H ( (cid:98) β ( • ) SHIR − β ( • ) ) (cid:107) = O P { (1 /n eﬀ ) + Bs /n eﬀ m } ; (cid:107) (cid:98) µ SHIR − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) SHIR − α ( • ) (cid:107) , = O P { ( s /n eﬀ ) + Bs /n eﬀ m } . The second term in each of the upper bounds of Theorem 1 is the error incurred by aggregationnoise of derived data instead of raw data. These terms are asymptotically negligible under sparsityas s = o ( { N (log p + M ) } / / [ BM log p ]). Then (cid:98) β ( • ) SHIR achieves the same error rate as the idealestimator (cid:98) β ( • ) IPD obtained by combining raw data as shown in the following section, and is nearly rateoptimal. 11 .3 Asymptotic Equivalence in Prediction and Estimation

Under speciﬁc sparsity assumptions, we show the asymptotic equivalence, with respect to predictionand estimation risks, of the SHIR and the ideal IPD estimator (cid:98) β ( • ) IPD or alternatively deﬁned as( (cid:98) µ IPD , (cid:98) α ( • ) IPD ) = argmin ( µ , α ( • ) ) (cid:98) L ( µ , α ( • ) ) + ˜ λρ ( µ , α ( • ) ; λ g ) , s.t. T M × α j = 0 , j ∈ [ p ] , where ˜ λ is a tuning parameter. Theorem 2. (Asymptotic Equivalence)

Under assumptions in Theorem 1 and assume s = o ( { N (log p + M ) } / / [ BM log p ]) , there exists ˜ λ = Θ { / ( s n eﬀ ) / } and λ g = Θ(1 /M / ) such thatthe IPD estimator (cid:98) β ( • ) IPD satisﬁes (cid:107) (cid:98) H ( (cid:98) β ( • ) IPD − β ( • ) ) (cid:107) = O P { (1 /n eﬀ ) } ; (cid:107) (cid:98) µ IPD − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) IPD − α ( • ) (cid:107) , = O P { ( s /n eﬀ ) } . Furthermore, for some λ ∆ = o (˜ λ ) , the IPD and the SHIR deﬁned by (5) with λ = ˜ λ + λ ∆ areequivalent in prediction and estimation in the sense that (cid:107) (cid:98) H ( (cid:98) β ( • ) SHIR − β ( • ) ) (cid:107) ≤ (cid:107) (cid:98) H ( (cid:98) β ( • ) IPD − β ( • ) ) (cid:107) + o P { (1 /n eﬀ ) } ; (cid:107) (cid:98) µ SHIR − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) SHIR − α ( • ) (cid:107) , ≤ (cid:107) (cid:98) µ IPD − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) IPD − α ( • ) (cid:107) , + o P { ( s /n eﬀ ) } . Theorem 2 demonstrates the asymptotic equivalence between (cid:98) β ( • ) SHIR and (cid:98) β ( • ) IPD with respect to esti-mation and prediction risks, and hence implies strict optimality of the SHIR. Speciﬁcally, when s = o ( { N (log p + M ) } / / [ BM log p ]), the excess risks of (cid:98) β ( • ) SHIR compared to (cid:98) β ( • ) IPD are of smallerorder than those of IPD, i.e. the minimax optimal rates (up to a logarithmic scale) for multi-tasklearning of high dimensional sparse model (Lounici et al., 2011; Huang and Zhang, 2010). Similarequivalence results was given in Theorem 4.8 of Battey et al. (2018) for the truncated debiasedLASSO estimator. However, to the best of our knowledge, in the existing literatures, such resultshave not been established yet for the LASSO-type estimators obtained directly from a sparse re-gression model. Compared with Battey et al. (2018), our result does not require the Hessian matrix (cid:98) H m to have a sparse inverse since we do not actually rely on the debiasing of (cid:98) β ( m ) LASSO . Consequently,the proofs of Theorem 2 are much more involved than those in Battey et al. (2018). The technicaldiﬃculties are brieﬂy discussed in Section 7 and new technical skills are developed and presentedin detail in the Supplementary Material.

To compare to existing approaches, we next consider an extension of the debiased LASSO basedprocedures proposed in Lee et al. (2017) and Battey et al. (2018) to incorporating between studyheterogeneity. Speciﬁcally, at the m th site, we derive the debiased LASSO estimator (cid:98) β ( m ) dLASSO asdeﬁned in (6) and send it to the central site, where (cid:98) Θ m is obtained via nodewise LASSO (Javanmard12nd Montanari, 2014). At the central site, compute (cid:98) µ dLASSO = M − (cid:80) Mm =1 (cid:98) β ( m ) dLASSO , (cid:98) α ( m ) dLASSO = (cid:98) β ( m ) dLASSO − (cid:98) µ dLASSO and (cid:98) α ( • ) dLASSO = ( (cid:98) α ( ) dLASSO , . . . , (cid:98) α ( M ) dLASSO ) T . The ﬁnal estimator for µ and α can be obtained bythresholding (cid:98) µ dLASSO and (cid:98) α ( • ) dLASSO as (cid:98) µ L & B = T µ ( (cid:98) µ dLASSO ; τ ) and (cid:98) α ( • ) L & B = T α ( (cid:98) α ( • ) dLASSO ; µ ), by Lee et al.(2017) and Battey et al. (2018), where T µ ( µ ; τ ) = { µ , µ h + ( τ ) , . . . , µ h + p ( τ ) } T or { µ , µ s + ( τ ) , . . . , µ s + p ( τ ) } T T α ( α ( • ) ; τ ) = vec { [ α , α h + ( τ ) , . . . α h + p ( τ )] T } or vec { [ α , α s + ( τ ) , . . . α s + p ( τ )] T } , for any vector x = ( x , ..., x d ) T and constant τ , x h + = x I ( (cid:107) x (cid:107) > τ ) and x s + = x (1 −(cid:107) x (cid:107) − τ ) I ( (cid:107) x (cid:107) >τ ) respectively denote the hard and soft thresholded counterparts of x , and vec( A ) vectorize thematrix A by column.The error rates of { (cid:98) µ L & B , (cid:98) α ( • ) L & B } or their corresponding (cid:98) β ( • ) L & B can be largely derived by extendingthe Theorem 21 results of Lee et al. (2017), and Theorem 4.8 of Battey et al. (2018). We outline theresults below and provide additional details in Section ?? of the Supplementary Material. Denoteby ¯ H m ( β ( m ) ) = E [ H m ( β ( m ) )], ¯ H m = ¯ H m ( β ( m ) ), ¯ Θ m = { ¯ θ mj(cid:96) } p × p = ¯ H − m and its row-wise sparsitylevel s = max m ∈ [ M ] j ∈ [ p ] |{ (cid:96) (cid:54) = j : ¯ θ mj(cid:96) (cid:54) = 0 }| . Then in analog to Theorem 1, under the sameregularity conditions as SHIR, one can obtain that (cid:107) (cid:98) µ L & B − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) L & B − α ( • ) (cid:107) , = O P { ( s /n eﬀ ) + B ( s + s ) /n eﬀ m } , (7)where B is as deﬁned in Condition 2. Compared with the error rates of SHIR as presented inTheorem 1, the debiased LASSO based estimator share the same “ﬁrst term”, i.e. ( s /n eﬀ ) , whichrepresents the error of individual level empirical process. However, its second term incurred bydata aggregation can be larger than that of SHIR as s = ω ( s ). In many practical settings, s could be excessively large due to the complex intrinsic relationship among the covariates, whichleads to the inﬂation of the error rate of (cid:98) µ L & B and (cid:98) α ( • ) L & B . This phenomenon is further validated inour simulation studies and the real example in the following sections. Remark 7.

When our model for Y is correctly speciﬁed, i.e. E { f (cid:48) ( β ( m ) T X ( m ) i , Y ( m ) i ) | X ( m ) i } = 0 ,the risk bound that depends on the exact sparsity level s can be potentially relaxed to the one thatdepends on the approximate sparsity level of ¯ Θ m (Ma et al., 2020; Liu et al., 2020), i.e. its row-wise (cid:96) q -norm with < q ≤ . Such sparsity condition seems more reasonable but relies on the modelcorrectness for Y , which is another strong assumption that cannot be easily veriﬁed in practice.In fact, our following simulation studies demonstrate that the estimators (cid:98) µ L & B and (cid:98) α ( • ) L & B no longerperform well under the settings with both non-sparse ¯ Θ m and misspeciﬁed models. In addition, SHIR could be more eﬃcient than the debiasing-based strategy even when theimpact of the additional error term, which depends on s in (7), is asymptotically negligible.Consider the setting when all β ( m ) ’s are the same, i.e., β ( m ) = β , and p is moderate or small so thatthe regularization is unnecessary and the maximum likelihood estimator (MLE) for β is feasible andasymptotically Gaussian. In this case, SHIR can be viewed as the inverse variance weight estimationwith asymptotic variance Σ SHIR = { (cid:80) Mm =1 n m ¯ Θ − m } − , while the debiasing-based approach outputs13n estimator of variance Σ L & B = M − (cid:80) Mm =1 n − m ¯ Θ m . It is not hard to show that Σ SHIR (cid:22) Σ L & B ,where the equality holds only if all ¯ Θ m ’s are in certain proportion. Thus, SHIR is strictly moreeﬃcient than debiasing-based approach under the low dimensional setting with heterogeneous ¯ Θ m ,which commonly arises in meta-analysis as the distributions of X ( m ) ’s are typically heterogeneousacross the local sites. In the high-dimensional setting, similarly, SHIR is expected to beneﬁt fromthe “inverse variance weight” construction, and our simulation results in Section 5 support thispoint. In this section, we present theoretical results concerning the variable selection consistency of theSHIR . We begin with some extra suﬃcient conditions for the sparsistency result.

Condition 5.

There exists δ = ω { (1 /n eﬀ m ) / } and C min = Θ(1) such that for all β ( m ) satisfying (cid:107) β ( m ) − β ( m ) (cid:107) < δ , Λ min { H m, S ( β ( m ) ) } > C min . Condition 6.

There exists δ = ω { (1 /n eﬀ m ) / } and (cid:15) = Θ(1) such that for all β ( • ) = ( β ( ) T , . . . , β ( M ) T ) T satisfying (cid:107) β ( m ) − β ( m ) (cid:107) < δ , the weighted design matrix W ( β ( • ) ) satisﬁes the Irrepresentable Con-dition C Irrep (Deﬁnition 2) on S full with constant (cid:15) . Condition 7.

Let ν = min { min j ∈S µ | µ j | , min j ∈S α (cid:107) α j (cid:107) /M / } . For the (cid:15) deﬁned in Condition6, [1 / ( n eﬀ ) / + Bs / /n eﬀ m ] / ( ν(cid:15) ) → , as N → ∞ . Remark 8.

Conditions 5–7 are sparsistency assumptions similar to those of Zhao and Yu (2006)and Nardi et al. (2008). Condition 5 requires the eigenvalues for the covariance matrix of theweighted design matrix corresponding to S to be bounded away from zero, so that its inverse behaveswell. Condition 6 adopts the commonly used irrepresentable condition (Zhao and Yu, 2006) to ourmixture penalty setting. Roughly speaking, it requires that the weighted design corresponding to S full cannot be represented well by the weighted design for S c full . Compared to Nardi et al. (2008), C Irrep isless intuitive but essentially weaker. We justify such condition on several common hessian structuresand compare it with Zhao and Yu (2006) in Section ?? of the Supplementary Material. Condition7 assumes that the minimum magnitude of the coeﬃcients is large enough to make the non-zerocoeﬃcients recognizable. As will be discussed later in this section, Condition 7 requires essentiallyweaker assumption on the minimum magnitude than existing results based on local LASSO (Zhaoand Yu, 2006). This is because we leverage the group structure of β ( m ) ’s to improve the eﬃciencyof variable selection. Theorem 3. (Sparsistency)

Let (cid:98) S µ = { j : (cid:98) µ SHIR , j (cid:54) = 0 } and (cid:98) S α = { j : (cid:107) (cid:98) α SHIR , j (cid:107) (cid:54) = 0 } . Denote theevent O µ = { (cid:98) S µ = S µ } and O α = { (cid:98) S α = S α } . Under Conditions 1–7 and assume that λ = o ( ν/s / ) and λ = (cid:15) − ω (1 / ( s n eﬀ ) / + B/n eﬀ m ) we have P ( O µ ∩ O α ) → as N → ∞ . emark 9. Condition 7 guarantees the existence of the tuning parameter λ used in Theorem 3.The range of the tuning parameter λ in Theorem 3 is similar to those in Zhao and Yu (2006) andNardi et al. (2008) in the sense that, λ needs to be smaller than the minimum magnitude of thesignal divided by s / to ensure the true positives are selected and dominate the random noise. Incomparison to Zhao and Yu (2006), our λ additionally includes the consideration of the aggregationnoise represented by B/n eﬀ m . Theorem 3 establishs the sparsistency of the SHIR estimator. When s = o ( { N (log p + M ) } / / [ BM log p ]) as assumed in Theorem 2, Condition 7 turns out to be ν(cid:15) = ω { / ( n eﬀ ) / } ,the corresponding sparsistency assumption for the IPD estimator. In contrast, a similar condition,which could be as strong as ν(cid:15) = ω { / ( n eﬀ m ) / } , is required for the local LASSO estimator (Zhaoand Yu, 2006). Compared with the local one, our integrative analysis procedure can recognizesmaller signal under some sparsity assumptions. In this sense, the structure of β ( • ) helps us toimprove the selection eﬃciency over the local LASSO estimator. Diﬀerent from the existing work,we need carefully address the mixture penalty ρ and the aggregation noise of the SHIR, whichintroduce technical diﬃculties to our theoretical analysis.In both Theorems 2 and 3, we allow M , the number of studies, to diverge while still preservingtheoretical properties. The growing rate of M is allowed to be M = min (cid:16) o { ( N/ log p ) / / ( Bs ) } , o { N/ ( Bs log p ) } (cid:17) for the equivalence result in Theorem 2 and M = min (cid:16) o { N (cid:15)ν/ ( Bs / log p ) } , o { N ( (cid:15)ν ) /s } (cid:17) for the sparsistency result in Theorem 3. We present simulation results in this section to evaluate the performance of our proposed SHIRestimator and compare it with several other approaches. We let M ∈ { , } and p ∈ { , , } and set n m = n = 400 for each m . For each conﬁguration, we summarize results based on 200simulated datasets. We consider three data generating mechanisms:(i) Sparse precision and correctly speciﬁed model (strong signal):

Across all studies, we let S µ = { , , . . . , } for µ , S α = { , , . . . , } for α , S = S µ ∪S α and S c = [ p ] \S . For each m ∈ [ M ], wegenerate X ( m ) from a zero-mean multivariate normal with covariance C ( m ) , where C ( m ) S c S c = R p − ( r m ), C ( m ) S c S = R p − ( r m ) Γ p − , ( r m ,

15) and C ( m ) SS = I + Γ T p − , ( r m , R p − ( r m ) Γ p − , ( r m ,

15) where I q denotes the q × q identity matrix, R q ( r ) denotes the q × q correlation matrix of AR (1) with correlationcoeﬃcient r , Γ q ,q ( r, s ) denotes the q × q matrix with each of its column having randomly picked s entries set as r or − r in random and the remaining being 0, and r m = 0 . m − /M +0 .

15. Given15 ( m ) , we generate Y ( m ) from the logistic model P ( Y ( m ) = 1 | X ( m ) ) = expit { X ( m ) T S µ µ S µ + X ( m ) T S α α ( m ) S α } with µ S µ = 0 . , − , , − , , − T and α ( m ) S α = 0 . − m · (1 , , , − , − , − T .(ii) Sparse precision and correctly speciﬁed model (weak signal):

Use the same datageneration mechanism as in (i) except relatively weak signals µ S µ = 0 . , − , , − , , − T and α ( m ) S α = 0 . − m · (1 , , , − , − , − T .(iii) Dense precision and wrongly speciﬁed model:

Let S = { , , . . . , } , S (cid:48) = { , . . . , } ,and S (cid:48)(cid:48) = [ p ] \ ( S ∪S (cid:48) ). For each m ∈ [ M ], we generate X ( m ) from zero-mean multivariate normal withcovariance matrix C ( m ) , where C ( m ) ( S (cid:48) ∪S (cid:48)(cid:48) )( S (cid:48) ∪S (cid:48)(cid:48) ) = bdiag { R ( r m ) , R p − ( r m ) } , C ( m ) SS (cid:48)(cid:48) = , C ( m ) S (cid:48) S = R ( r m ) Γ , ( r m ,

45) and C ( m ) SS = I + Γ T , ( r m , R ( r m ) Γ , ( r m , X ( m ) , we generate Y ( m ) from a logistic model with P ( Y ( m ) = 1 | X ( m ) ) = expit { (cid:80) j =1 { .

25 + 0 . − m }{ X ( m ) j +0 . X ( m ) j ) } + 0 . (cid:80) j =1 X ( m ) j X ( m ) j +1 } .Across all settings, the distribution of X ( m ) and model parameters of Y ( m ) | X ( m ) diﬀer across the M sites to mimic the heterogeneity of the covariates and models. The heterogeneity of X ( m ) is drivenby the study-speciﬁc correlation coeﬃcient r m in its covariance matrix C ( m ) . Under Settings (i) and(ii), the ﬁtted logistic loss corresponds to the likelihood under a correctly speciﬁed model with thesupport of µ and that of α ( m ) overlapping but not exactly the same. Under Setting (iii), the ﬁttedloss corresponds to a mis-speciﬁed model but the true target parameter β ( m ) remains approximatelysparse with only ﬁrst 5 elements being relatively large, 45 close to zero and remaining exactly zero.For each j ∈ S , there are 15 non-zero coeﬃcients on average in the j -th column (except j itself)of the precision Θ m under Settings (i) and (ii), and 45 non-zero coeﬃcients under Setting (iii). Sowe can use Settings (i) and (ii) to simulate the scenario with sparse precision on the active set anduse Setting (iii) to simulate relatively dense precision.For each simulated dataset, we obtain the SHIR estimator as well as the following alternativeestimators: (a) the IPD estimator (cid:98) β ( • ) IPD = argmin β ( • ) (cid:98) Q ( β ( • ) ); (b) the SMA estimator (He et al.,2016), following the sure independent screening procedure (Fan and Lv, 2008) that reduces thedimension to n/ (3 log n ) as recommended by He et al. (2016); and (c) the debiasing-based estimator (cid:98) β ( • ) L & B as introduced in Section 4.4, denoted by Debias L & B . For (cid:98) β ( • ) L & B , we used the soft thresholdingto be consistent with the penalty used by IPD, SMA and SHIR. We used the BIC to choose thetuning parameters for all methods.In Figures 1 and 2, we present the relative average absolute estimation error (rAEE), (cid:107) β ( • ) − β ( • ) (cid:107) , and the relative prediction error (rPE), (cid:107) X ( β ( • ) − β ( • ) ) (cid:107) , for each estimator compared to theIPD estimator, respectively. Consistent with the theoretical equivalence results, the SHIR estimatorattains very close estimation and prediction accuracy as those of the idealized IPD estimator, withrPE and rAEE around 1 .

03 under Setting (i), 1 .

02 under Setting (ii), and 1 .

07 under Setting(iii). The SHIR estimator is substantially more eﬃcient than the SMA under all the settings, withabout 50% reduction in both AEE and PE on average. This can be attributed to the improvedperformance of the local LASSO estimator (cid:98) β ( m ) LASSO over the MLE ˘ β ( m ) on sparse models. The superiorperformance is more pronounced for large p such as 800 and 1500, because the screening procedure16oes not work well in choosing the active set, especially in the presence of correlations among thecovariates. Compared with Debias L & B , SHIR also demonstrates its gain in eﬃciency. Speciﬁcally,relative to SHIR, Debias L & B has 20% ∼

29% higher AEE and 27% ∼

42% higher PE under thesethree settings. This is consistent with our theoretical results presented in Section 4.4 that SHIRhas smaller error compared to Debias L & B due to the heterogeneous hessians and aggregation errors.In addition, compared to Setting (i), the excessive error of Debias L & B is larger in Setting (iii) wherethe the inverse Hessian ¯ Θ m is relatively dense and the model of Y is misspeciﬁed. This is consistentwith our conclusion in Section 4.4 and Remark 7.In Figure 3, we summarize the average true positive rate (TPR) and false discovery rate (FDR)for recovering the support of β ( • ) under Settings (i) and (ii) where the logistic model is correctlyspeciﬁed. Again, SMA performed poorly under both settings with either low TPR or high FDR.Both IPD and SHIR have good support recovery performance with all TPRs above 0 .

91 and FDRsbelow 0 .

13 under the strong signal setting, and all TPRs above 0 .

74 and FDRs below 0 .

05 underthe weak signal setting. The IPD and SHIR attained similar TPRs and FDRs with absolutediﬀerences less than 0.02 across all settings. Compared with IPD and SHIR, Debias L & B shows worseperformance under both settings. Under Setting (i), the TPR of Debias L & B is consistently lowerthan that of SHIR by about 0.13 while the FDR of Debias L & B is generally higher than that of SHIRexcept that when p = 100 Debias L & B attained very low FDR due to over shrinkage. Under the weaksignal Setting (ii) with M = 4, Debias L & B is substantially less powerful than SHIR in recovering truesignals with TPR lower by as much as 0 .

52 while its average FDR is comparable to that of SHIR.When M = 8, Debias L & B attained comparable TPR as that of SHIR but generally has substantiallyhigher FDR. Linking EHR data with biorepositories containing “-omics” information has expanded the oppor-tunities for biomedical research (Kho et al., 2011). With the growing availability of these high–dimensional data, the bottleneck in clinical research has shifted from a paucity of biologic datato a paucity of high–quality phenotypic data. Accurately and eﬃciently annotating patients withdisease characteristics among millions of individuals is a critical step in fulﬁlling the promise ofusing EHR data for precision medicine. Novel machine learning based phenotyping methods lever-aging a large number of predictive features have improved the accuracy and eﬃciency of existingphenotyping methods (Liao et al., 2015; Yu et al., 2015).While the portability of phenotyping algorithms across multiple patient cohorts is of greatinterest, existing phenotyping algorithms are often developed and evaluated for a speciﬁc patientpopulation. To investigate the portability issue and develop EHR phenotyping algorithms forcoronary artery disease (CAD) useful for multiple cohorts, Liao et al. (2015) developed a CADalgorithm using a cohort of rheumatoid arthritis (RA) patients and applied the algorithm to otherdisease cohorts using EHR data from Partner’s Healthcare System. Here, we performed integrativeanalysis of multiple EHR disease cohorts to jointly develop algorithms for classifying CAD status17or four disease cohorts including type 2 diabetes mellitus (DM), inﬂammatory bowel disease (IBD),multiple sclerosis (MS) and RA. Under the DataSHIELD constraint, our proposed SHIR algorithmenables us to let the data determine if a single CAD phenotyping algorithm can perform well acrossfour disease cohorts or disease speciﬁc algorithms are needed.For algorithm training, clinical investigators have manually curated gold standard labels on theCAD status used as the response Y , for n = 172 DM patients, n = 230 IBD patients, n = 105 MSpatients and n = 760 RA patients. There are a total of p = 533 candidate features including bothcodiﬁed features, narrative features extracted via natural language processing (NLP) (Zeng et al.,2006), as well as their two-way interactions. Examples of codiﬁed features include demographicinformation, lab results, medication prescriptions, counts of International Classiﬁcation of Diseases(ICD) codes and Current Procedural Terminology (CPT) codes. Since patients may not havecertain lab measurements and missingness is highly informative, we also create missing indicatorsfor the lab measurements as additional features. Examples of NLP terms include mentions ofCAD, current smoking (CSMO), non smoking (NSMO) and CAD related procedures. Since thecount variables such as the total number of CAD ICD codes are zero-inﬂated and skewed, we takelog( x + 1) transformation and include I ( x >

0) as additional features for each count variable x .For each cohort, we randomly select 50% of the observations to form the training set for devel-oping the CAD algorithms and use the remaining 50% for validation. We trained CAD algorithmsbased on SHIR, Debias L & B and SMA. Since the true model parameters are unknown, we evaluatethe performance of diﬀerent methods based on the prediction performance of the trained algorithmson the validation set. We consider several standard accuracy measures including the area underthe receiver operating characteristic curve (AUC), the brier score deﬁned as the mean squaredresiduals on the validation data , as well as the F -score at threshold value chosen to attain a falsepositive rate of 5% ( F ) and 10% ( F ), where the F -score is deﬁned as the harmonic meanof the sensitivity and positive predictive value. The standard errors of the estimated predictionperformance measures are obtained by bootstrapping the validation data. We only report resultsbased on tuning parameters selected with BIC as in the simulation studies but note that the re-sults obtained from AIC are largely similar in terms of prediction performance. Furthermore, toverify the improvement of the performance by combining the four datasets, we include the LASSOestimator for each local dataset (Local) as a comparison.In Table 1, we present the estimated coeﬃcients for variables that received non-zero coeﬃcientsby at least one of the included methods. Interestingly, all integrative analysis methods set allheterogeneous coeﬃcients to zero, suggesting that a single CAD algorithm can be used across allcohorts although diﬀerent intercepts were used for diﬀerent disease cohorts. The magnitude of thecoeﬃcients from SHIR largely agree with the published algorithm with most important featuresbeing NLP mentions and ICD codes for CAD as well as total number of ICD codes which serves asa measure of healthcare utilization. The SMA set all variables to zero except for age, non-smokerand the NLP mentions and ICD codes for CAD, while Debias L & B has more similar support to SHIR.The point estimates along with their 95% bootstrap conﬁdence intervals of the accuracy mea-sures are presented in Figure 4. The results suggest that SHIR has the best performance across all18ethods, nearly on all datasets and across all measures. Among the integrative methods, SMAand Debias L & B performed much worse than SHIR on all accuracy measures. For example, the AUCwith its 95% conﬁdence interval of the CAD algorithm for the RA cohorts trained via SHIR, SMAand Debias L & B is respectively 0.93 (0.90,0.95), 0.88 (0.84,0.92) and 0.86 (0.82,0.90). Comparedto the local estimator, SHIR also performs substantially better. For example, the AUC of SHIRand Local for the IBD cohort is 0.93 (0.88,0.97) and 0.90 (0.84,0.95). The diﬀerence between theintegrative procedures and the local estimator is more pronounced for the DM cohort with AUCbeing around 0.95 for SHIR and 0.90 for the local estimator trained using DM data only. The localestimator fails to produce an informative algorithm for the MS cohort due to the small size of thetraining set. These results again demonstrate the power of borrowing information across studiesvia integrative analysis. In this paper, we proposed a novel approach, the SHIR, for integrative analysis of high dimensionaldata under the DataSHIELD framework, where only summary data is allowed to be transferred fromthe local sites to the central site to protect the individual-level data. One should notice that theDataSHIELD framework is essentially diﬀerent from the diﬀerential privacy concept (Dwork, 2011).Diﬀerential privacy establishes rigorous probabilistic deﬁnition on the privacy of a random dataprocessing algorithm, requiring the perturbation of the raw data with random noise (Wassermanand Zhou, 2010). It allows for sharing and publishing individual-level data for more general dataanalysis purposes. However, to ensure enough privacy, the diﬀerential privacy was shown to be toostrict to be used for high-dimensional regression and inference (Duchi et al., 2013). In comparison,the DataSHIELD framework is used for regression analysis and shown to work well under ultra-highdimensionality, with no eﬃciency loss compared with the ideal case where the data can be operatedwith no privacy constraint.Though motivated by the debiased LASSO, the SHIR approach does not need to obtain (cid:98) Θ m orthe debiased estimator (cid:98) β ( m ) dLASSO . As we demonstrated via both theoretical analyses and numericalstudies, the SHIR estimator is considerably more eﬃcient than the estimators obtained based onthe debiasing-based strategies considered in literatures (Lee et al., 2017; Battey et al., 2018). Thesuperior performance is even more pronounced when Θ m is relatively dense. Also, our methodaccommodates heterogeneity among the design matrices, as well as the coeﬃcients of the localsites, which is not adequately handled under the ultra high dimensional regime in existing literature.In terms of theoretical analysis, no existing distributed learning work provides similar estimationequivalence results as shown in Theorem 2 for regularized regression. Challenge in establishing theasymptotic equivalence in Theorem 2 arises from that (cid:98) β ( • ) IPD and (cid:98) β ( • ) SHIR are not necessarily as sparseas the true coeﬃcient β ( • ) . We overcome such challenge by comparing the two estimators in a moreelaborative way as detailed in the proof. Moreover, Theorem 3 ensures the sparsistency of the SHIRunder some commonly used conditions on deterministic design. Beneﬁted from integrating the data,our approach requires less restrictive assumption on the minimum strength of the coeﬃcients than19he local estimation, under certain sparsity assumption. Such property is not readily available inthe existing literature for distributed regression and the mixture penalty.Our approach is also eﬃcient both in computation and communication, as it only solves LASSOproblem once in each local site without requiring the computation of (cid:98) Θ ( m ) or debiasing and onlyneeds one around of communication. Consequently, since there is actually no debiasing procedurein our method, the SHIR cannot be directly used for hypothesis testing and conﬁdence inter-val construction. Future work lies on developing statistical approaches for such purposes underDataSHIELD, high-dimensionality and heterogeneity.For the choice of penalty, in the current paper we focus primarily on the mixture penalty, ρ ( β ( • ) ) deﬁned in (1), because it can leverage the prior assumption on the similar support andmagnitude of β ( m ) across studies. Nevertheless, other penalty functions such as those discussedin Section 2.1 can be incorporated into our framework provided that they can eﬀectively leverageon the prior knowledge. Similar techniques used for deriving the theoretical results of SHIR with ρ = ρ can be used for other penalty functions, with some technical details varying according todiﬀerent choices on ρ ( · ). See Section A.4 of the Supplementary Material for further justiﬁcations.We assume n m = Θ( N/M ) in Theorems 1-3 and allow M to grow slowly. These conditionsseem reasonable in our application but could be violated by some large scale meta-analysis of highdimensional data. The extensions to scenarios with larger M is highly non-trivial and warrantsfurther research. As a limitation of our theoretical analysis, our sparsity assumption in Theorem 2may be somewhat strong for our real example. However, we shall clarify that, ﬁrst, this assumptionis comparable to the assumption for the estimation equivalence result in Battey et al. (2018), andsecond, the consistency result in Theorem 1 holds with milder sparsity assumptions. On the otherhand, it is of interests to see the possibilities of reducing the aggregation error and achieving thetheoretical equivalence property with weaker sparsity assumption. One potential way is to usemultiple rounds of communications such as Fan et al. (2019). Detailed analysis of this approachwarrants future research. In the supplement we provide some justiﬁcations for Conditions 1 and 6, present detailed proofs ofTheorems 1–3, and outline theoretical analyses of SHIR for various penalty functions.

References

Akaike, H. (1974). A new look at the statistical model identiﬁcation.

IEEE transactions on automaticcontrol , 19(6):716–723.Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z., et al. (2018). Distributed testing and estimation under sparsehigh dimensional models.

The Annals of Statistics , 46(3):1352–1382.Bhat, H. S. and Kumar, N. (2010). On the derivation of the bayesian information criterion.

School of NaturalSciences, University of California . ¨uhlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory andapplications . Springer Science & Business Media.Chen, X. and Xie, M.-g. (2014). A split-and-conquer approach for analysis of extraordinarily large data.

Statistica Sinica , pages 1655–1684.Chen, Y., Dong, G., Han, J., Pei, J., Wah, B. W., and Wang, J. (2006). Regression cubes with losslesscompression and aggregation.

IEEE Transactions on Knowledge and Data Engineering , 18(12):1585–1599.Cheng, X., Lu, W., and Liu, M. (2015). Identiﬁcation of homogeneous and heterogeneous variables in pooledcohort studies.

Biometrics , 71(2):397–403.Doiron, D., Burton, P., Marcon, Y., Gaye, A., Wolﬀenbuttel, B. H., Perola, M., Stolk, R. P., Foco, L.,Minelli, C., Waldenberger, M., et al. (2013). Data harmonization and federated analysis of population-based studies: the BioSHaRE project.

Emerging themes in epidemiology , 10(1):12.Duan, R., Boland, M. R., Liu, Z., Liu, Y., Chang, H. H., Xu, H., Chu, H., Schmid, C. H., Forrest,C. B., Holmes, J. H., et al. (2020). Learning from electronic health records across multiple sites: Acommunication-eﬃcient and privacy-preserving distributed algorithm.

Journal of the American MedicalInformatics Association , 27(3):376–385.Duan, R., Boland, M. R., Moore, J. H., and Chen, Y. (2019). Odal: A one-shot distributed algorithm toperform logistic regressions on electronic health records data from multiple clinical sites. In

PSB , pages30–41. World Scientiﬁc.Duchi, J. C., Jordan, M. I., and Wainwright, M. J. (2013). Local privacy and statistical minimax rates. In , pages 429–438. IEEE.Dwork, C. (2011). Diﬀerential privacy.

Encyclopedia of Cryptography and Security , pages 338–340.Fan, J., Guo, Y., and Wang, K. (2019). Communication-eﬃcient accurate statistical estimation. arXivpreprint arXiv:1906.04870 .Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space.

Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) , 70(5):849–911.Foster, D. P. and George, E. I. (1994). The risk inﬂation criterion for multiple regression.

The Annals ofStatistics , pages 1947–1975.Friedman, J., Hastie, T., and Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736 .Gaye, A., Marcon, Y., Isaeva, J., LaFlamme, P., Turner, A., Jones, E. M., Minion, J., Boyd, A. W., Newby,C. J., Nuotio, M.-L., et al. (2014). DataSHIELD: taking the analysis to the data, not the data to theanalysis.

International journal of epidemiology , 43(6):1929–1944.Han, J. and Liu, Q. (2016). Bootstrap model aggregation for distributed statistical learning. In

Advances inNeural Information Processing Systems , pages 1795–1803.He, Q., Zhang, H. H., Avery, C. L., and Lin, D. (2016). Sparse meta-analysis with high-dimensional data.

Biostatistics , 17(2):205–220. uang, C. and Huo, X. (2015). A distributed one-step estimator. arXiv preprint arXiv:1511.01443 .Huang, J. and Zhang, T. (2010). The beneﬁt of group sparsity. The Annals of Statistics , 38(4):1978–2004.Jankov´a, J. and Van De Geer, S. (2016). Conﬁdence regions for high-dimensional generalized linear modelsunder sparsity. arXiv preprint arXiv:1610.01353 .Javanmard, A. and Montanari, A. (2014). Conﬁdence intervals and hypothesis testing for high-dimensionalregression.

The Journal of Machine Learning Research , 15(1):2869–2909.Jones, E., Sheehan, N., Masca, N., Wallace, S., Murtagh, M., and Burton, P. (2012). DataSHIELD–sharedindividual-level analysis without sharing the data: a biostatistical perspective.

Norsk epidemiologi , 21(2).Jordan, M. I., Lee, J. D., and Yang, Y. (2019). Communication-eﬃcient distributed statistical inference.

Journal of the American Statistical Association , 526(114):668–681.Kho, A. N., Pacheco, J. A., Peissig, P. L., Rasmussen, L., Newton, K. M., Weston, N., Crane, P. K., Pathak,J., Chute, C. G., Bielinski, S. J., et al. (2011). Electronic medical records for genetic research: results ofthe emerge consortium.

Science translational medicine , 3(79):79re1–79re1.Kim, Y., Kwon, S., and Choi, H. (2012). Consistent model selection criteria on high dimensions.

Journal ofMachine Learning Research , 13(Apr):1037–1057.Lee, J. D., Liu, Q., Sun, Y., and Taylor, J. E. (2017). Communication-eﬃcient sparse regression.

Journal ofMachine Learning Research , 18(5):1–30.Li, W., Liu, H., Yang, P., and Xie, W. (2016). Supporting regularized logistic regression privately andeﬃciently.

PloS one , 11(6):e0156479.Liao, K. P., Ananthakrishnan, A. N., Kumar, V., Xia, Z., Cagan, A., Gainer, V. S., Goryachev, S., Chen,P., Savova, G. K., Agniel, D., et al. (2015). Methods to develop an electronic medical record phenotypealgorithm to compare the risk of coronary artery disease across 3 chronic disease cohorts.

PLoS One ,10(8):e0136651.Lin, D. and Zeng, D. (2010). On the relative eﬃciency of using summary statistics versus individual-leveldata in meta-analysis.

Biometrika , 97(2):321–332.Liu, M., Xia, Y., Cai, T., and Cho, K. (2020). Integrative high dimensional multiple testing with hetero-geneity under data sharing constraints. arXiv preprint arXiv:2004.00816 .Liu, Q. and Ihler, A. T. (2014). Distributed estimation, information loss and exponential families. In

Advances in neural information processing systems , pages 1098–1106.Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A. B., et al. (2011). Oracle inequalities and optimalinference under group sparsity.

The Annals of Statistics , 39(4):2164–2204.Lu, C.-L., Wang, S., Ji, Z., Wu, Y., Xiong, L., Jiang, X., and Ohno-Machado, L. (2015). Webdisco: a webservice for distributed cox model learning without patient-level data sharing.

Journal of the AmericanMedical Informatics Association , 22(6):1212–1219.Ma, R., Tony Cai, T., and Li, H. (2020). Global and simultaneous hypothesis testing for high-dimensionallogistic regression models.

Journal of the American Statistical Association , pages 1–15. aity, S., Sun, Y., and Banerjee, M. (2019). Communication-eﬃcient integrative regression in high-dimensions. arXiv preprint arXiv:1912.11928 .Minsker, S. (2019). Distributed statistical estimation and rates of convergence in normal approximation. Electronic Journal of Statistics , 13(2):5213–5252.Nardi, Y., Rinaldo, A., et al. (2008). On the asymptotic properties of the group lasso estimator for linearmodels.

Electronic Journal of Statistics , 2:605–633.Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B., et al. (2012). A uniﬁed framework for high-dimensional analysis of m -estimators with decomposable regularizers. Statistical Science , 27(4):538–557.Raskutti, G., Wainwright, M. J., and Yu, B. (2011). Minimax rates of estimation for high-dimensional linearregression over l q -balls. IEEE transactions on information theory , 57(10):6976–6994.Tang, L., Zhou, L., and Song, P. X.-K. (2016). Method of divide-and-combine in regularized generalizedlinear models for big data. arXiv preprint arXiv:1611.06208 .Vaiter, S., Deledalle, C., Peyr´e, G., Fadili, J., and Dossal, C. (2012). The degrees of freedom of the grouplasso. arXiv preprint arXiv:1205.1481 .Van de Geer, S., B¨uhlmann, P., Ritov, Y., Dezeure, R., et al. (2014). On asymptotically optimal conﬁdenceregions and tests for high-dimensional models.

The Annals of Statistics , 42(3):1166–1202.Van de Geer, S. A. et al. (2008). High-dimensional generalized linear models and the lasso.

The Annals ofStatistics , 36(2):614–645.Vershynin, R. (2018).

High-dimensional probability: An introduction with applications in data science , vol-ume 47. Cambridge University Press.Wang, H. and Leng, C. (2007). Uniﬁed lasso estimation by least squares approximation.

Journal of theAmerican Statistical Association , 102(479):1039–1048.Wang, H., Li, B., and Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number ofparameters.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 71(3):671–683.Wang, J., Kolar, M., Srebro, N., and Zhang, T. (2017). Eﬃcient distributed learning with sparsity. In

Pro-ceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 3636–3645. JMLR.org.Wang, X., Peng, P., and Dunson, D. B. (2014). Median selection subset aggregation for parallel inference.In

Advances in neural information processing systems , pages 2195–2203.Wasserman, L. and Zhou, S. (2010). A statistical framework for diﬀerential privacy.

Journal of the AmericanStatistical Association , 105(489):375–389.Wolfson, M., Wallace, S. E., Masca, N., Rowe, G., Sheehan, N. A., Ferretti, V., LaFlamme, P., Tobin,M. D., Macleod, J., Little, J., et al. (2010). DataSHIELD: resolving a conﬂict in contemporary bioscien-ceperforming a pooled analysis of individual-level data without sharing the data.

International journal ofepidemiology , 39(5):1372–1382. u, Y., Jiang, X., Kim, J., and Ohno-Machado, L. (2012). Grid binary logistic re gression (glore): buildingshared models without sharing data. Journal of the American Medical Informatics Association , 19(5):758–764.Yu, S., Liao, K. P., Shaw, S. Y., Gainer, V. S., Churchill, S. E., Szolovits, P., Murphy, S. N., Kohane, I. S.,and Cai, T. (2015). Toward high-throughput phenotyping: unbiased automated feature extraction andselection from knowledge sources.

Journal of the American Medical Informatics Association , 22(5):993–1000.Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 68(1):49–67.Zeng, Q. T., Goryachev, S., Weiss, S., Sordo, M., Murphy, S. N., and Lazarus, R. (2006). Extractingprincipal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural languageprocessing system.

BMC medical informatics and decision making , 6(1):30.Zhang, C.-H. and Zhang, S. S. (2014). Conﬁdence intervals for low dimensional parameters in high di-mensional linear models.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,76(1):217–242.Zhang, Y., Li, R., and Tsai, C.-L. (2010). Regularization parameter selections via generalized informationcriterion.

Journal of the American Statistical Association , 105(489):312–323.Zhao, P. and Yu, B. (2006). On model selection consistency of lasso.

Journal of Machine learning research ,7(Nov):2541–2563.Zhou, N. and Zhu, J. (2010). Group variable selection via a hierarchical lasso and its oracle property. arXivpreprint arXiv:1006.2871 .Z¨oller, D., Lenz, S., and Binder, H. (2018). Distributed multivariable modeling for signature developmentunder data protection constraints. arXiv preprint arXiv:1803.00422 . L & B and SMA relative tothose of IPD under diﬀerent M , p and data generation mechanisms (i)–(iii) introduced in Section5. r e l a t i v e AEE

Method

IPDSHIRDebiasSMA

Sparse Precision and Correct Model (strong signal) r e l a t i v e AEE

Method

IPDSHIRDebiasSMA

Sparse Precision and Correct Model (weak signal) r e l a t i v e AEE

Method

IPDSHIRDebiasSMA

Dense Precision and Misspecified Model L & B and SMA relative to those of IPD underdiﬀerent M , p and data generation mechanisms (i)–(iii) introduced in Section 5. r e l a t i v e PE Method

IPDSHIRDebiasSMA

Sparse Precision and Correct Model (strong signal) r e l a t i v e PE Method

IPDSHIRDebiasSMA

Sparse Precision and Correct Model (weak signal) r e l a t i v e PE Method

IPDSHIRDebiasSMA

Dense Precision and Misspecified Model β ( • ) of IPD, SHIR, Debias L & B and SMA, under diﬀerent M , p and the data generationmechanisms (i) (correct model with) strong signal and (ii) weak signal introduced in Section 5. T r ue po s i t i v e r a t e Method

IPDSHIRDebiasSMA

Original Coefficients (strong signal) F a l s e d i sc o v e r y r a t e Method

IPDSHIRDebiasSMA

Original Coefficients (strong signal) T r ue po s i t i v e r a t e Method

IPDSHIRDebiasSMA

Original Coefficients (weak signal) F a l s e d i sc o v e r y r a t e Method

IPDSHIRDebiasSMA

Original Coefficients (weak signal) µ .A : B denotes the interaction term of variables A and B. The log( x + 1) transformation is taken onthe count data and the covariates are normalized. Variable Debias L & B SHIR SMAPrescription code of statin 0.14 0.07 0Age 0.09 0.26 0.28Procedure code for echo 0 -0.10 0Total ICD counts -0.38 -0.75 0NLP mention of CAD 0.97 1.34 0.81NLP mention of CAD procedure related concepts 0 0.02 0NLP mention of non-smoker -0.07 -0.25 -0.42ICD code for CAD 1.00 0.67 0.35CPT code for stent or CABG 0 0.05 0NLP mention of current-smoker 0 -0.03 0Any NLP mention 0.06 0.05 0ICD code for CAD : Procedure code for echo 0 -0.04 0NLP mention of CAD:NLP mention of possible-smoker 0 -0.02 0Oncall:NLP mention of non-smoker 0.09 0 0Indication for NLP mention of non-smoker -0.53 0 028igure 4: The mean and 95% bootstrap conﬁdence interval of AUC, Brier Score, F and F ofDebias L & B , Local, SHIR and SMA on the validation data from the four studies. l l l Methods l DebiasLocalSHIRSMA

AUC l l l l

Methods l DebiasLocalSHIRSMA

Brier Score l l l

Methods l DebiasLocalSHIRSMA F l l l Methods l DebiasLocalSHIRSMA F10%