Individual Data Protected Integrative Regression Analysis of High-dimensional Heterogeneous Data
IIndividual Data Protected Integrative Regression Analysis ofHigh-dimensional Heterogeneous Data
Tianxi Cai , , Molei Liu , and Yin Xia Abstract
Evidence based decision making often relies on meta-analyzing multiple studies, which enables more preciseestimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however,highly challenging in the ultra high dimensional setting. The challenge is even more pronounced when theindividual level data cannot be shared across studies, known as DataSHIELD contraint (Wolfson et al., 2010).Under sparse regression models that are assumed to be similar yet not identical across studies, we propose inthis paper a novel integrative estimation procedure for data- S hielding H igh-dimensional I ntegrative R egression(SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodatesbetween study heterogeneity in both the covariate distribution and model parameters, and attains consistentvariable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approachesthat integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurredby aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to beasymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sampleperformance of our method is studied and compared with existing approaches via extensive simulation settings.We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease usingelectronic health records data from multiple chronic disease cohorts. Keywords : High dimensionality; model heterogeneity; DataSHIELD; sparse meta-analysis; dis-tributed learning; debiased Lasso; rate optimality; sparsistency. Authors are listed in alphabetical order. Department of Biostatistics, Harvard School of Public Health, Harvard University. The research of Tianxi Caiwas supported in part by NIH Grants U54 HG007963 and R21 CA242940. Department of Statistics, School of Management, Fudan University. The research of Yin Xia was supported inpart by NSFC Grants 11771094, 11690013. a r X i v : . [ s t a t . M E ] S e p Introduction
Synthesizing information from multiple studies is crucial for evidence based medicine and policydecision making. Meta-analyzing multiple studies allows for more precise estimates and enables in-vestigation of generalizability. In the presence of heterogeneity across studies and high dimensionalpredictors, such integrative analysis however is highly challenging. An example of such integrativeanalysis is to develop generalizable predictive models using electronic health records (EHR) datafrom different hospitals. In addition to high dimensional features, EHR data analysis encounters pri-vacy constraints in that individual-level data typically cannot be shared across local hospital sites,which makes the challenge of integrative analysis even more pronounced. Breach of Privacy arisingfrom data sharing is in fact a growing concern in general for scientific studies. Recently, Wolfsonet al. (2010) proposed a generic individual-information protected integrative analysis framework,named DataSHIELD, that transfers only summary statistics from each distributed local site to thecentral site for pooled analysis. Conceptually highly valued by research communities (Jones et al.,2012; Doiron et al., 2013, e.g.), the DataSHIELD facilitates important data co-analysis settingswhere individual-level data meta-analysis (ILMA) is not feasible due to ethical and/or legal restric-tions (Gaye et al., 2014). In the low dimensional setting, a number of statistical methods havebeen developed for distributed analysis that satisfy the DataSHILED constraint (Chen et al., 2006;Wu et al., 2012; Liu and Ihler, 2014; Lu et al., 2015; Huang and Huo, 2015; Han and Liu, 2016;He et al., 2016; Z¨oller et al., 2018; Duan et al., 2019, 2020, e.g). Distributed learning methods forhigh dimensional regression have largely focused on settings without between-study heterogeneityas detailed in Section 1.2. To the best of our knowledge, no existing distributed learning methodscan effectively handle both high-dimensionality together with the presence of model heterogeneityacross the local sites.
In the context of high dimensional regression, several recently proposed distributed inference ap-proaches can be potentially used for the meta-analysis under the DataSHIELD constraint. Specif-ically, Tang et al. (2016), Lee et al. (2017) and Battey et al. (2018) proposed distributed inferenceprocedures aggregating the local debiased LASSO estimators (Zhang and Zhang, 2014; Van de Geeret al., 2014; Javanmard and Montanari, 2014). By including debiasing procedure in their pipelines,the corresponding estimators can be used for inference directly. Lee et al. (2017) and Battey et al.(2018) proposed to further truncate the aggregated dense debiased estimators to achieve sparsity;see also Maity et al. (2019). Though this debiasing-based strategy can be extended to fit for ourheterogeneous modeling assumption, it still loses statistical efficiency due to the failure to accountfor the heterogeneity of the information matrices across different sites. In addition, the use ofdebiasing procedure at local sites incurs additional error for estimation, as detailed in Section 4.4.Besides, Lu et al. (2015) and Li et al. (2016) proposed distributed approaches for (cid:96) -regularized2ogistic and Cox regression, which rely on iterative communications across the studies. Theirmethods require sequential communications between local sites and the central machine, whichmay be time and resource consuming, especially since human effort is often needed to performthe computation and data transfer in many practical settings. Chen and Xie (2014) proposed toestimate high dimensional parameters by first adopting majority voting to select a positive set andthen combining local estimation of the coefficients belonging to this set. Wang et al. (2014) proposedto aggregate the local estimators through their median values rather than their mean, shown to bemore robust to poor estimation performance of local sites with insufficient sample size (Minsker,2019). More recently, Wang et al. (2017) and Jordan et al. (2019) presented a communication-efficient surrogate likelihood framework for distributed statistical learning that only transfers thefirst order summary statistics, i.e. gradient between the local sites and the central site. Fan et al.(2019) extended their idea and proposed two iterative distributed optimization algorithms for thegeneral penalized likelihood problems. However, their framework, as well as others summarized inthis paragraph, is restricted to homogeneous scenarios and cannot be easily extended to the settingswith heterogeneous models or covariates. In this paper, we fill the methodological gap of high dimensional distributed learning methods thatcan accommodate cross-study heterogeneity by proposing a novel data- S hielding H igh-dimensional I ntegrative R egression (SHIR) method under the DataSHIELD constraints. While SHIR can beviewed as analogous to the integrative analysis of debiased local LASSO estimators, it achievesdebiasing without having to perform debiasing for the local estimators. SHIR solves LASSO problemonly once in each local site without requiring the inverse Hessian matrices or the locally debiasedestimators and only needs one turn in communication. Statistically, it serves as the tool for theintegrative model estimation and variable selection, in the presence of high dimensionality andheterogeneity in model parameters across sites. In addition, under ultra-high dimensional regimewhere p can grow exponentially with n , SHIR is shown to theoretically achieve asymptoticallyequivalent performance with the estimator obtained by the ideal individual patient data (IPD)pooled across sites and attain consistent variable selection under some sparsity assumptions. Suchproperties are not readily available in the existing literature and some novel technical tools aredeveloped for the theoretical verification. We also show that SHIR is statistically more efficientthan the approach based on integrating and truncating locally debiased estimators (Lee et al., 2017;Battey et al., 2018, e.g.) through theoretical investigation. Our numerical studies further verifythis by comparing our method with the existing approaches. It demonstrates that SHIR enjoysclose numerical performance as the ideal IPD estimator and outperforms the other methods. The rest of this paper is organized as follows. We introduce the setting in Section 2 and describeSHIR, our proposed approach in Section 3. Theoretical properties of SHIR are studied in Section 4.3e derive the upper bound for its prediction and estimation risks, compare it with the existing ap-proach and show that the errors incurred by aggregating derived data is negligible compared to thestatistical minimax rate. When the true model is ultra-sparse, SHIR is shown to be asymptoticallyequivalent to the IPD estimator and achieves sparsistency. Section 5 compares the performanceof SHIR to existing methods through simulations. We apply SHIR to derive classification modelsfor coronary artery disease (CAD) using EHR data from four different disease cohorts in Section6. Section 7 concludes the paper with a discussion. Technical proofs of the theoretical results areprovided in the Supplementary Material.
Suppose there are M independent studies and n m subjects in the m th study, for m = 1 , . . . , M .For the i th subject in the m th study, let Y ( m ) i and X ( m ) i respectively denote the response andthe p -dimensional covariate vector, D ( m ) i = ( Y ( m ) i , X ( m ) T i ) T , Y ( m ) = ( Y ( m ) , . . . , Y ( m ) n m ) T , and X ( m ) =( X ( m ) , X ( m ) , . . . , X ( m ) n m ) T . We assume that the observations in study m , D ( m ) = { D ( m ) i , i = 1 , . . . , n m } ,are independent and identically distributed. Without loss of generality, we assume that X ( m ) i in-cludes 1 as the first component and X ( m ) i, − has mean , where for any vector x = ( x , x , . . . , x d ) T , x − = ( x , . . . , x d ) T . We define the population parameters of interest as β ( m ) = argmin β ( m ) L m ( β ( m ) ) , where L m ( β ( m ) ) = E { f ( β ( m ) T X ( m ) i , Y ( m ) i ) } , β ( m ) = ( β ( m ) , β ( m ) , . . . , β ( m ) p ) T for some specified loss function f . Examples of loss function include f ( β T x , y ) = ( y − β T x ) forlinear regression and f ( β T x , y ) = − y β T x +log { β T x ) } for logistic regression. We assume that β ( m ) varies across studies but may share similar support and magnitude, which makes it possibleto attain more accurate estimation by jointly estimating the regression models across the studies.We will define such structure more rigorously in Section 2.1.Throughout, for any integer d , [ d ] = { , . . . , d } . For any vector x = ( x , x , . . . , x d ) T ∈ R d andindex set S = { j , . . . , j k : j < · · · < j k } ⊆ [ d ], x S = ( x j , . . . , x j k ) T , (cid:107) x (cid:107) q denotes the (cid:96) q norm of x and (cid:107) x (cid:107) ∞ = max j ∈ [ d ] | x j | . For any matrix A = [ A , . . . , A d ] ∈ R n × d and index set S , S ⊆ [ d ],let A j • and A • j respectively denote the j th row and column of A , A S S denote the submatrixcorresponding to rows in S and columns in S , A •S = [ A • j , . . . , A • j k ], (cid:107) A (cid:107) := (cid:2) Λ max ( A T A ) (cid:3) , (cid:107) A (cid:107) ∞ = max j (cid:107) A j • (cid:107) . Let β j = ( β ( ) j , . . . , β ( M ) j ) T , β ( • ) = ( β ( ) T , . . . , β ( M ) T ) T , and let β j , β ( m ) and β ( • ) = ( β ( ) T , β ( ) T , . . . , β ( M ) T ) T respectively denote the true values of β j , β ( m ) and β ( • ) . M Studies
To estimate β ( • ) based on D ( • ) = { D ( ) , . . . , D ( M ) } , consider the empirical global loss function (cid:98) L ( β ( • ) ) = N − M (cid:88) m =1 n m (cid:98) L m ( β ( m ) ) , N = (cid:80) Mm =1 n m and (cid:98) L m ( β ( m ) ) = n − m n m (cid:88) i =1 f ( β ( m ) T X ( m ) i , Y ( m ) i ) , m = 1 , . . . , M. Minimizing (cid:98) L ( β ( • ) ) is obviously equivalent to estimating β ( m ) using D ( m ) only. To improve theestimation of β ( • ) by synthesizing information from D ( • ) and overcome the high dimensionality, weassume that β ( • ) is sparse and employ penalized loss functions, (cid:98) L ( β ( • ) ) + λρ ( β ( • ) ), with the penaltyfunction ρ ( · ) designed to leverage prior information on β ( • ) .Under a prior assumption that β ( ) , − , . . . , β ( M ) , − , i.e. the true coefficients excluding intercepts,share the same support, we may choose ρ ( · ) as the group LASSO penalty (Yuan and Lin, 2006) ρ ( β ( • ) ) = (cid:80) pj =2 (cid:107) β j (cid:107) . Alternatively, following typical meta-analysis, we decompose β ( m ) j as β ( m ) j = µ j + α ( m ) j with α j = ( α ( ) j , . . . , α ( M ) j ) T and T M × α j = 0 for identifiability. Similarly as Cheng et al.(2015), we may impose a mixture of LASSO and group LASSO penalty : ρ ( β ( • ) ) = p (cid:88) j =2 | µ j | + λ g p (cid:88) j =2 (cid:107) α j (cid:107) , (1)where λ g ≥ j , µ j represents average effect of the covariate X j and α j captures the between study heterogeneity of the effects. This penalty accommodates threetype of covariates: (i) homogeneous effect with µ j (cid:54) = 0 and α j = ; (ii) heterogeneous effect with α j (cid:54) = ; (iii) null effect with µ j = 0 and α j = . The penalty ρ differs slightly from that ofCheng et al. (2015) where (cid:107) α j, − (cid:107) was used instead of (cid:107) α j (cid:107) . This modified penalty leads totwo main advantages: (1) the estimator is invariant to the permutation of the indices of the M studies; and (2) it yields better theoretical estimation error bounds for the heterogeneous effectsdetailed as in the proofs. Other penalties that attain similar properties include the hierarchicalLASSO penalty (Zhou and Zhu, 2010) ρ ( β ( • ) ) = (cid:80) pj =2 (cid:107) β j (cid:107) / . Although our proposed SHIRframework can accommodate a variety of sparsity inducing penalties, our theoretical derivationsand implementation focus on the mixture penalty ρ ( · ) = ρ ( · ) throughout but we comment onextensions of SHIR to incorporating alternative penalty functions in Section 7.With a given penalty function ρ ( · ), we may form a penalized loss (cid:98) Q ( β ( • ) ) = (cid:98) L ( β ( • ) ) + λρ ( β ( • ) )and an idealized IPD estimator for β ( • ) can be obtained as (cid:98) β ( • ) IPD = argmin β ( • ) (cid:98) Q ( β ( • ) ) , (2)with some tuning parameter λ ≥
0. However, the IPD estimator is not feasible under theDataSHIELD constraint. Our goal is to construct an alternative estimator that asymptoticallyattains the same efficiency as (cid:98) β ( • ) IPD but only requires sharing summary data. When p is not large,the sparse meta analysis (SMA) approach by He et al. (2016) achieves this goal via a second order5aylor expansion and estimates β ( • ) as (cid:98) β ( • ) SMA = argmin β ( • ) (cid:98) Q SMA ( β ( • ) ), where (cid:98) Q SMA ( β ( • ) ) = N − M (cid:88) m =1 ( β ( m ) − ˘ β ( m ) ) T ˘ V − m ( β ( m ) − ˘ β ( m ) ) + λρ ( β ( • ) ) , (3)˘ β ( m ) = argmin β ( m ) (cid:98) L m ( β ( m ) ) and ˘ V m = { n − m ∇ (cid:98) L m ( ˘ β ( m ) ) } − . When β ( m ) is the same across sites and λ = 0, (cid:98) β ( • ) SMA is the inverse variance weighted estimator (Lin and Zeng, 2010). Although the SMAattains oracle property for a relatively small p , it fails when p is large due to the failure of ˘ β ( m ) . In the high dimensional setting, one may overcome the limitation of the SMA approach by replacing˘ β ( m ) with the regularized LASSO estimator, (cid:98) β ( m ) LASSO = argmin β ( m ) (cid:98) L m ( β ( m ) ) + λ m (cid:107) β ( m ) − (cid:107) (4)However, aggregating { (cid:98) β ( m ) LASSO , m ∈ [ M ] } is problematic with large p due to their inherent biases.To overcome the bias issue, we build the SHIR method motivated by SMA and the debiasingapproach for LASSO (Van de Geer et al., 2014) yet achieve debiasing without having to performdebiasing for M local estimators. Specifically, we propose the SHIR estimator for β ( • ) as (cid:98) β ( • ) SHIR =argmin β ( • ) (cid:98) Q SHIR ( β ( • ) ), where (cid:98) Q SHIR ( β ( • ) ) = N − M (cid:88) m =1 n m (cid:110) β ( m ) T (cid:98) H m β ( m ) − β ( m ) T (cid:98) g m (cid:111) + λρ ( β ( • ) ) , (5) (cid:98) H m = ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ) is an estimate of the Hessian matrix and (cid:98) g m = (cid:98) H m (cid:98) β ( m ) LASSO − ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ).Our SHIR estimator (cid:98) β ( • ) SHIR satisfy the DataSHIELD constraint as (cid:98) Q SHIR ( β ( • ) ) depends on D ( m ) onlythrough summary statistics (cid:98) D m = { n m , (cid:98) H m , (cid:98) g m } , which can be obtained within the m th study, andrequires only one round of data transfer from local sites to the central node.With { (cid:98) H m , (cid:98) g m , m = 1 , ..., M } , we may implement the SHIR procedure using coordinate de-scent algorithms (Friedman et al., 2010) along with reparameterization. Specifically, let µ =( µ , . . . , µ p ) T , α ( • ) = ( α ( ) T , . . . , α ( M ) T ) T , and α ( • ) − = ( α ( ) T − , . . . , α ( M ) T − ) T . Also let µ and α ( • ) bethe true value of µ and α ( • ) , respectively. Let (cid:98) Q SHIR ( µ , α ( • ) ) = (cid:98) L SHIR ( µ , α ( • ) ) + λρ ( µ , α ( • ) ; λ g ) , ρ ( µ , α ( • ) ; λ g ) = (cid:107) µ − (cid:107) + λ g (cid:107) α ( • ) − (cid:107) , , (cid:107) α ( • ) − (cid:107) , = (cid:80) pj =2 (cid:107) α j (cid:107) and (cid:98) L SHIR ( µ , α ( • ) ) = N − M (cid:88) m =1 n m (cid:110) ( µ T + α ( m ) T ) (cid:98) H m ( µ + α ( m ) ) − (cid:98) g T m ( µ + α ( m ) ) (cid:111) . Then the optimization problem in (5) can be reparameterized and represented as:( (cid:98) µ SHIR , (cid:98) α ( • ) SHIR ) = argmin ( µ , α ( • ) ) (cid:98) Q SHIR ( µ , α ( • ) ) , s.t. T M × α j = 0 , j ∈ [ p ] , and (cid:98) β SHIR is obtained with the transformation: β ( m ) j = µ j + α ( m ) j for every j ∈ [ p ]. Remark 1.
The first term in (cid:98) Q SHIR ( β ( • ) ) is essentially the second order Taylor expansion of (cid:98) L ( β ( • ) ) at the local LASSO estimators (cid:98) β ( • ) LASSO . The SHIR method can also be viewed as approximatelyaggregating local debiased LASSO estimators without actually carrying out the standard debiasingprocess. To see this, let (cid:98) Q dLASSO ( β ( • ) ) = N − M (cid:88) m =1 n m ( β ( m ) − (cid:98) β ( m ) dLASSO ) T (cid:98) H m ( β ( m ) − (cid:98) β ( m ) dLASSO ) + λρ ( β ( • ) ) where (cid:98) β ( m ) dLASSO is the debiased LASSO estimator for the m th study with (cid:98) β ( m ) dLASSO = (cid:98) β ( m ) LASSO − (cid:98) Θ m ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ) , for m = 1 , . . . , M , (6) and (cid:98) Θ m is a regularized inverse of (cid:98) H m . We may write (cid:98) Q dLASSO ( β ( • ) )= N − M (cid:88) m =1 (cid:26) n m (cid:104) β ( m ) T (cid:98) H m β ( m ) − β ( m ) T (cid:98) H m (cid:98) β ( m ) dLASSO (cid:105) + C m (cid:27) + λρ ( β ( • ) ) ≈ N − M (cid:88) m =1 (cid:26) n m (cid:104) β ( m ) T (cid:98) H m β ( m ) − β ( m ) T (cid:98) g m (cid:105) + C m (cid:27) + λρ ( β ( • ) ) = (cid:98) Q SHIR ( β ( • ) ) + C m , where we use (cid:98) Θ m (cid:98) H m ≈ I in the above approximation and the term C m = (cid:110) (cid:98) H m (cid:98) β ( m ) LASSO − (cid:98) H m (cid:98) Θ m ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ) (cid:111) T (cid:110)(cid:98) β ( m ) LASSO − (cid:98) Θ m ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ) (cid:111) does not depend on β ( • ) . We only use (cid:98) Θ m (cid:98) H m ≈ I heuristically above to show a connection betweenour SHIR estimator and the debiased LASSO, but the validity and asymptotic properties of theSHIR estimator do not require obtaining any (cid:98) Θ m or establishing a theoretical guarantee for (cid:98) Θ m (cid:98) H m being sufficiently close to I . Remark 2.
Compared with existing debiasing-type methods (Lee et al., 2017; Battey et al., 2018),the SHIR is also computationally and statistically efficient as it is performed without relying on thedebiased statistics (6) and achieves debiasing without calculating (cid:98) Θ m , which can only be estimated ell under strong conditions (Van de Geer et al., 2014; Jankov´a and Van De Geer, 2016). The implementation of SHIR requires selection of three sets of tuning parameters, { λ m , m ∈ [ M ] } , λ and λ g . We select { λ m , m ∈ [ M ] } for the LASSO problem locally via the standard K -fold crossvalidation (CV). Selecting λ and λ g needs to balance the trade-off between the model’s degrees offreedom, denoted by DF( λ, λ g ), and the quadratic loss in (cid:98) Q SHIR ( β ( • ) ). It is not feasible to tune λ and λ g via the CV since individual-level data are not available in the central site. We propose toselect λ and λ g as the minimizer of the generalized information criterion (GIC) (Wang and Leng,2007; Zhang et al., 2010), defined asGIC( λ, λ g ) = Deviance( λ, λ g ) + γ N DF( λ, λ g ) , where γ N is some pre-specified scaling parameter andDeviance( λ, λ g ) = N − M (cid:88) m =1 n m (cid:110)(cid:98) β ( m ) TSHIR ( λ, λ g ) (cid:98) H m (cid:98) β ( m ) SHIR ( λ, λ g ) − (cid:98) g T m (cid:98) β ( m ) SHIR ( λ, λ g ) (cid:111) . Following Zhang et al. (2010) and Vaiter et al. (2012), we define DF( λ, λ g ) as the trace of (cid:20) ∂ (cid:98) S µ , (cid:98) S α (cid:98) Q SHIR ( (cid:98) µ SHIR , (cid:98) α ( • ) SHIR ) (cid:21) − (cid:20) ∂ (cid:98) S µ , (cid:98) S α (cid:98) L SHIR ( (cid:98) µ SHIR , (cid:98) α ( • ) SHIR ) (cid:21) , where (cid:98) S µ = { j : (cid:98) µ SHIR , j ( λ, λ g ) (cid:54) = 0 } , (cid:98) S α = { j : (cid:107) (cid:98) α SHIR , j ( λ, λ g ) (cid:107) (cid:54) = 0 } , the operator ∂ (cid:98) S µ , (cid:98) S α isdefined as the second order partial derivative with respect to ( µ T (cid:98) S µ , α ( ) T (cid:98) S α , . . . , α ( M ) T (cid:98) S α ) T , after plugging α ( ) = − (cid:80) Mm =2 α ( m ) into (cid:98) Q SHIR ( µ , α ( • ) ) or (cid:98) L SHIR ( µ , α ( • ) ). Remark 3.
As discussed in Kim et al. (2012), γ N can be chosen depending on the goal withcommonly choices including γ N = 2 /N for AIC (Akaike, 1974), γ N = log N/N for BIC (Bhat andKumar, 2010), γ N = log log p log N/N for modified BIC (Wang et al., 2009) and γ N = 2 log p/N for RIC (Foster and George, 1994). We used the BIC with γ N = log N/N in our numerical studies.
Remark 4.
For linear models, it has been shown that the proper choice of γ N guarantees GIC’smodel selection consistency under various divergence rates of the dimension p (Kim et al., 2012).For example, for fixed p , GIC is consistent if N γ N → ∞ and γ N → . When p diverges inpolynomial rate N ξ , then GIC is consistent provided that γ N = log N/N (BIC) if < ξ < / ; γ N =log log p log N/N (modified BIC) if < ξ < . When p diverges in exponential rate O (exp( κN ξ )) with < ν < ξ , GIC is consistent as γ N = N ν − . These results can be naturally extended to moregeneral log-likelihood functions. Theoretical Results
In this section, we present theoretical properties of (cid:98) β ( • ) SHIR for ρ ( β ( • ) ) = ρ ( β ( • ) ) but discuss how ourtheoretical results can be extended to other sparse structures in Section 7. In Sections 4.2 and4.3, we derive theoretical consistency and equivalence for the prediction and estimation risks ofthe SHIR, under high dimensional sparse model and smooth loss function f . In Section 4.4, wecompare the risk bounds for SHIR with an estimator derived based on those of the debiasing-basedaggregation approaches (Lee et al., 2017; Battey et al., 2018). In addition, Section 4.5 shows thatthe SHIR achieves sparsistency, i.e., variable selection consistency, for the non-zero sets of µ and α ( • ) . We begin with some notation and definitions that will be used throughout the paper. Let o { α ( n ) } , O { α ( n ) } , ω { α ( n ) } , Ω { α ( n ) } and Θ { α ( n ) } respectively represent the sequences thatgrow in a smaller, equal/smaller, larger, equal/larger and equal rate of the sequence α ( n ). Similarly,we let o P , O P , ω P , Ω P and Θ P represent each of the corresponding rates with probability approaching1 as n → ∞ . We define the model complexity adjusted effective sample size for each study as n eff m = n m / ( s log p ) and n eff = N/ [ s (log p + M )], which are the main drivers for the rates of the proposedestimators. Following Vershynin (2018), we define the sub-Gaussian norm of a random variable X as (cid:107) X (cid:107) ψ := sup q ≥ q − / ( E | X | q ) /q . For any symmetric matrix X , let Λ min ( X ) and Λ max ( X )denote its minimum and maximum eigenvalue respectively. Let T = ( ( M − × , I ( M − × ( M − ) T and define (cid:107) x (cid:107) T := (cid:107) T x (cid:107) for x ∈ R M − and its conjugate norm as (cid:107) x (cid:107) (cid:101) T := (cid:107) T ( T T T ) − x (cid:107) . For a ∈ R , denote by sign( a ) the sign of a , and for event E , denote by I ( E ) the indicator for E . Denoteby S µ = { j : µ j (cid:54) = 0 } , S α = { j : (cid:107) α j (cid:107) (cid:54) = 0 } , S = S µ ∪ S α , s µ = |S µ | , s α = |S α | and s = |S | .Let f (cid:48) ( a, y ) = ∂f ( a, y ) /∂a and f (cid:48)(cid:48) ( a, y ) = ∂ f ( a, y ) /∂a .The weighted design matrix corresponding to (cid:98) L SHIR ( µ , α ( • ) ) with respect to θ = ( µ , α ( ) T , . . . , α ( M ) T ) T after setting α ( ) = − (cid:80) Mm =2 α ( m ) can be expressed as W ( β ( • ) ) = bdiag { Ω / ( β ( ) ) , . . . , Ω / M ( β ( M ) ) } Z , where “ bdiag ” is the block diagonal operator, Ω m ( β ) = diag { f (cid:48)(cid:48) ( β T X ( m ) , Y ( m ) ) , . . . , f (cid:48)(cid:48) ( β T X ( m ) n m , Y ( m ) n m ) } , Z = Z [ p ] , [ p ] , and for any S , S ⊆ [ p ], Z S , S = X ( ) •S − X ( ) •S − X ( ) •S · · · − X ( ) •S X ( ) •S X ( ) •S · · · X ( ) •S X ( ) •S · · · ... ... ... . . . ... X ( M ) •S · · · X ( M ) •S . For any S , S ⊆ [ p ], let H m, S ( β ( m ) ) represent the sub-matrix of H m ( β ( m ) ) := ∇ (cid:98) L m ( β ( m ) ) with itsrows and columns corresponding to S , and W S , S ( β ( • ) ) denote the sub-matrix of W ( β ( • ) ) corre-9ponding to Z S , S and ( µ T S , α ( ) T S , . . . , α ( M ) T S ) T . For notational ease, let (cid:98) H = bdiag { (cid:98) H , (cid:98) H , . . . , (cid:98) H M } , S full = {S µ , S α } and W S full ( β ( • ) ) = W S µ , S α ( β ( • ) ). Let H ( β ( • ) ) = bdiag { H ( β ( ) ) , H ( β ( ) ) , . . . , H M ( β ( M ) ) } .We next introduce the compatibility condition and irrepresentable condition below. Definition 1. Compatibility Condition ( C comp ): The Hessian matrix H ( β ( • ) ) and the index set S satisfy the Compatibility Condition with constant t , if there exists constant φ { t, S , H ( β ( • ) ) } suchthat for all ( µ T ∆ , α ( • ) T ∆ ) T = ( µ T ∆ , α ( ) T ∆ , . . . , α ( M ) T ∆ ) T ∈ C ( t, S ) , ( (cid:107) µ ∆ (cid:107) + λ g (cid:107) α ( • ) T ∆ (cid:107) , ) ≤ N − M (cid:88) m =1 n m ( µ ∆ + α ( m ) ∆ ) T H m ( β ( m ) )( µ ∆ + α ( m ) ∆ ) |S| /φ { t, S , H ( β ( • ) ) } , where C ( t, S ) = { ( u T , v ( • ) T ) T = ( u T , v ( ) T , . . . , v ( M ) T ) T : v ( ) + · · · + v ( M ) = , (cid:107) u S c (cid:107) + λ g (cid:107) v ( • ) S c (cid:107) , ≤ t ( (cid:107) u S (cid:107) + λ g (cid:107) v ( • ) S (cid:107) , ) } for any t and S , and φ { t, S , H ( β ( • ) ) } represents the compatibility constantof H ( β ( • ) ) on the set S . Definition 2. Irrepresentable Condition ( C Irrep ): The design matrix W ( β ( • ) ) satisfies the Irrep-resentable Condition on S full = ( S µ , S α ) with parameter (cid:15) > , if for all j ∈ S cµ and j (cid:48) ∈ S cα , sup u ∈ G S µ , v ( • ) ∈ G S α (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) ( u T , λ g v ( • ) T ) (cid:104) W T S full ( β ( • ) ) W S full ( β ( • ) ) (cid:105) − W T S full ( β ( • ) ) W j, ∅ ( β ( • ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) ≤ − (cid:15) ;sup u ∈ G S µ , v ( • ) ∈ G S α (cid:40)(cid:13)(cid:13)(cid:13)(cid:13) ( u T , λ g v ( • ) T ) (cid:104) W T S full ( β ( • ) ) W S full ( β ( • ) ) (cid:105) − W T S full ( β ( • ) ) W ∅ ,j (cid:48) ( β ( • ) ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:101) T (cid:41) ≤ λ g (1 − (cid:15) ) , where G S µ = (cid:40) u = ( u , · · · , u |S µ | ) T ∈ R |S µ | : max j ∈ [ |S µ | ] | u j | ≤ (cid:41) , G S α = (cid:40) v ( • ) = ( v ( ) T , . . . , v ( M ) T ) T ∈ R ( M − |S α | : max j ∈ [ |S α | ] (cid:107) v j (cid:107) (cid:101) T ≤ , v j = ( v ( ) j , . . . , v ( M ) j ) T (cid:41) represent the sub-gradient space corresponding to S µ and S α of the mixture penalty. To establish theoretical properties of the SHIR estimators in terms of estimation and predictionrisks, we first introduce some sufficient conditions. Throughout the following analysis, we assumethat n m = Θ( N/M ) for m ∈ [ M ] and λ g = Θ( M − / ). Condition 1.
There exists δ = ω { (1 /n eff m ) / } and φ = Θ(1) such that for all β ( • ) = ( β ( ) T , . . . , β ( M ) T ) T satisfying (cid:107) β ( m ) − β ( m ) (cid:107) < δ and t = Θ(1) , the Hessian matrices H ( β ( • ) ) and the index set S satisfy C comp (Definition 1) with compatibility constant φ { t, S , H ( β ( • ) ) } ≥ φ . Condition 2.
For all m ∈ [ M ] , X ( m ) ij f (cid:48) ( β ( m ) T X ( m ) i , Y ( m ) i ) is sub-Gaussian, i.e. there exists somepositive constant κ = Θ(1) such that (cid:107) X ( m ) ij f (cid:48) ( β ( m ) T X ( m ) i , Y ( m ) i ) (cid:107) ψ < κ . In addition, there exists > such that max m ∈ [ M ] ,i ∈ [ n m ] (cid:107) X ( m ) i (cid:107) ∞ ≤ B . Condition 3.
There exists positive C L = Θ(1) such that | f (cid:48)(cid:48) ( a, y ) − f (cid:48)(cid:48) ( b, y ) | ≤ C L | a − b | for all a, b ∈ R . Remark 5.
Condition 1 holds in the same spirit as the restricted strong convexity (RSC) conditioncorresponding to the general decomposable penalty introduced in Negahban et al. (2012). In Appendix ?? of the Supplementary Material, we further illustrate that Condition 1 is not stronger than theRSC condition. The first part of Condition 2 controls the tail behavior of X ( m ) ij f (cid:48) ( a, y ) so thatthe random error ∇ (cid:98) L m ( β ( m ) ) can be bounded properly and the method could be benefited from thegroup sparsity of α ( • ) (Huang and Zhang, 2010). This condition can be easily verified for sub-gaussian design and an extensive class of models, e.g. the logistic model. In addition, the condition max m ∈ [ M ] ,i ∈ [ n m ] (cid:107) X ( m ) i (cid:107) ∞ ≤ B holds for bounded design with B = Θ(1) and for sub-gaussian designwith B = Θ[ { log( pN ) } / ] . Condition 3 assumes a smooth function f to guarantee that the empiricalHessian matrix ∇ (cid:98) L m ( (cid:98) β ( m ) LASSO ) is close enough to ∇ (cid:98) L m ( β ( m ) ) , and the term (cid:98) g m = [ (cid:98) H m (cid:98) β ( m ) LASSO −∇ (cid:98) L m ( (cid:98) β ( m ) LASSO )] is close enough to [ ∇ (cid:98) L m ( β ( m ) ) β ( m ) − ∇ (cid:98) L m ( β ( m ) )] . We further assume in Condition 4 that the local LASSO estimators achieve the minimax optimalerror rates to a logarithmic scale (Raskutti et al., 2011).
Condition 4.
The local estimators satisfy that max m ∈ [ M ] (cid:107) (cid:98) β ( m ) LASSO − β ( m ) (cid:107) = O P { ( s /n eff m ) / } , max m ∈ [ M ] (cid:107) (cid:98) β ( m ) LASSO − β ( m ) (cid:107) = O P { (1 /n eff m ) / } and max m ∈ [ M ] (cid:107) X ( m ) ( (cid:98) β ( m ) LASSO − β ( m ) ) (cid:107) = O P { ( n m /n eff m ) / } . Remark 6.
Extensive literatures, such as Van de Geer et al. (2008), B¨uhlmann and Van De Geer(2011) and Negahban et al. (2012), have established a complete theoretical framework regarding tothis property. See, for example, Negahban et al. (2012), in which Condition 4 can be proved undersub-gaussian design X ( m ) and strongly convex loss function f . Next, we present the risk bounds for the SHIR including the prediction risk (cid:107) (cid:98) H / ( (cid:98) β ( • ) SHIR − β ( • ) ) (cid:107) and estimation risk (cid:107) (cid:98) µ SHIR − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) SHIR − α ( • ) (cid:107) , . Theorem 1. (Risk bounds for the SHIR)
Under Conditions 1–4 and assume n m = Θ( N/M ) for all m ∈ [ M ] . There exists λ = Θ { / ( s n eff ) / + B/n eff m } and λ g = Θ( M − / ) such that (cid:107) (cid:98) H ( (cid:98) β ( • ) SHIR − β ( • ) ) (cid:107) = O P { (1 /n eff ) + Bs /n eff m } ; (cid:107) (cid:98) µ SHIR − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) SHIR − α ( • ) (cid:107) , = O P { ( s /n eff ) + Bs /n eff m } . The second term in each of the upper bounds of Theorem 1 is the error incurred by aggregationnoise of derived data instead of raw data. These terms are asymptotically negligible under sparsityas s = o ( { N (log p + M ) } / / [ BM log p ]). Then (cid:98) β ( • ) SHIR achieves the same error rate as the idealestimator (cid:98) β ( • ) IPD obtained by combining raw data as shown in the following section, and is nearly rateoptimal. 11 .3 Asymptotic Equivalence in Prediction and Estimation
Under specific sparsity assumptions, we show the asymptotic equivalence, with respect to predictionand estimation risks, of the SHIR and the ideal IPD estimator (cid:98) β ( • ) IPD or alternatively defined as( (cid:98) µ IPD , (cid:98) α ( • ) IPD ) = argmin ( µ , α ( • ) ) (cid:98) L ( µ , α ( • ) ) + ˜ λρ ( µ , α ( • ) ; λ g ) , s.t. T M × α j = 0 , j ∈ [ p ] , where ˜ λ is a tuning parameter. Theorem 2. (Asymptotic Equivalence)
Under assumptions in Theorem 1 and assume s = o ( { N (log p + M ) } / / [ BM log p ]) , there exists ˜ λ = Θ { / ( s n eff ) / } and λ g = Θ(1 /M / ) such thatthe IPD estimator (cid:98) β ( • ) IPD satisfies (cid:107) (cid:98) H ( (cid:98) β ( • ) IPD − β ( • ) ) (cid:107) = O P { (1 /n eff ) } ; (cid:107) (cid:98) µ IPD − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) IPD − α ( • ) (cid:107) , = O P { ( s /n eff ) } . Furthermore, for some λ ∆ = o (˜ λ ) , the IPD and the SHIR defined by (5) with λ = ˜ λ + λ ∆ areequivalent in prediction and estimation in the sense that (cid:107) (cid:98) H ( (cid:98) β ( • ) SHIR − β ( • ) ) (cid:107) ≤ (cid:107) (cid:98) H ( (cid:98) β ( • ) IPD − β ( • ) ) (cid:107) + o P { (1 /n eff ) } ; (cid:107) (cid:98) µ SHIR − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) SHIR − α ( • ) (cid:107) , ≤ (cid:107) (cid:98) µ IPD − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) IPD − α ( • ) (cid:107) , + o P { ( s /n eff ) } . Theorem 2 demonstrates the asymptotic equivalence between (cid:98) β ( • ) SHIR and (cid:98) β ( • ) IPD with respect to esti-mation and prediction risks, and hence implies strict optimality of the SHIR. Specifically, when s = o ( { N (log p + M ) } / / [ BM log p ]), the excess risks of (cid:98) β ( • ) SHIR compared to (cid:98) β ( • ) IPD are of smallerorder than those of IPD, i.e. the minimax optimal rates (up to a logarithmic scale) for multi-tasklearning of high dimensional sparse model (Lounici et al., 2011; Huang and Zhang, 2010). Similarequivalence results was given in Theorem 4.8 of Battey et al. (2018) for the truncated debiasedLASSO estimator. However, to the best of our knowledge, in the existing literatures, such resultshave not been established yet for the LASSO-type estimators obtained directly from a sparse re-gression model. Compared with Battey et al. (2018), our result does not require the Hessian matrix (cid:98) H m to have a sparse inverse since we do not actually rely on the debiasing of (cid:98) β ( m ) LASSO . Consequently,the proofs of Theorem 2 are much more involved than those in Battey et al. (2018). The technicaldifficulties are briefly discussed in Section 7 and new technical skills are developed and presentedin detail in the Supplementary Material.
To compare to existing approaches, we next consider an extension of the debiased LASSO basedprocedures proposed in Lee et al. (2017) and Battey et al. (2018) to incorporating between studyheterogeneity. Specifically, at the m th site, we derive the debiased LASSO estimator (cid:98) β ( m ) dLASSO asdefined in (6) and send it to the central site, where (cid:98) Θ m is obtained via nodewise LASSO (Javanmard12nd Montanari, 2014). At the central site, compute (cid:98) µ dLASSO = M − (cid:80) Mm =1 (cid:98) β ( m ) dLASSO , (cid:98) α ( m ) dLASSO = (cid:98) β ( m ) dLASSO − (cid:98) µ dLASSO and (cid:98) α ( • ) dLASSO = ( (cid:98) α ( ) dLASSO , . . . , (cid:98) α ( M ) dLASSO ) T . The final estimator for µ and α can be obtained bythresholding (cid:98) µ dLASSO and (cid:98) α ( • ) dLASSO as (cid:98) µ L & B = T µ ( (cid:98) µ dLASSO ; τ ) and (cid:98) α ( • ) L & B = T α ( (cid:98) α ( • ) dLASSO ; µ ), by Lee et al.(2017) and Battey et al. (2018), where T µ ( µ ; τ ) = { µ , µ h + ( τ ) , . . . , µ h + p ( τ ) } T or { µ , µ s + ( τ ) , . . . , µ s + p ( τ ) } T T α ( α ( • ) ; τ ) = vec { [ α , α h + ( τ ) , . . . α h + p ( τ )] T } or vec { [ α , α s + ( τ ) , . . . α s + p ( τ )] T } , for any vector x = ( x , ..., x d ) T and constant τ , x h + = x I ( (cid:107) x (cid:107) > τ ) and x s + = x (1 −(cid:107) x (cid:107) − τ ) I ( (cid:107) x (cid:107) >τ ) respectively denote the hard and soft thresholded counterparts of x , and vec( A ) vectorize thematrix A by column.The error rates of { (cid:98) µ L & B , (cid:98) α ( • ) L & B } or their corresponding (cid:98) β ( • ) L & B can be largely derived by extendingthe Theorem 21 results of Lee et al. (2017), and Theorem 4.8 of Battey et al. (2018). We outline theresults below and provide additional details in Section ?? of the Supplementary Material. Denoteby ¯ H m ( β ( m ) ) = E [ H m ( β ( m ) )], ¯ H m = ¯ H m ( β ( m ) ), ¯ Θ m = { ¯ θ mj(cid:96) } p × p = ¯ H − m and its row-wise sparsitylevel s = max m ∈ [ M ] j ∈ [ p ] |{ (cid:96) (cid:54) = j : ¯ θ mj(cid:96) (cid:54) = 0 }| . Then in analog to Theorem 1, under the sameregularity conditions as SHIR, one can obtain that (cid:107) (cid:98) µ L & B − µ (cid:107) + λ g (cid:107) (cid:98) α ( • ) L & B − α ( • ) (cid:107) , = O P { ( s /n eff ) + B ( s + s ) /n eff m } , (7)where B is as defined in Condition 2. Compared with the error rates of SHIR as presented inTheorem 1, the debiased LASSO based estimator share the same “first term”, i.e. ( s /n eff ) , whichrepresents the error of individual level empirical process. However, its second term incurred bydata aggregation can be larger than that of SHIR as s = ω ( s ). In many practical settings, s could be excessively large due to the complex intrinsic relationship among the covariates, whichleads to the inflation of the error rate of (cid:98) µ L & B and (cid:98) α ( • ) L & B . This phenomenon is further validated inour simulation studies and the real example in the following sections. Remark 7.
When our model for Y is correctly specified, i.e. E { f (cid:48) ( β ( m ) T X ( m ) i , Y ( m ) i ) | X ( m ) i } = 0 ,the risk bound that depends on the exact sparsity level s can be potentially relaxed to the one thatdepends on the approximate sparsity level of ¯ Θ m (Ma et al., 2020; Liu et al., 2020), i.e. its row-wise (cid:96) q -norm with < q ≤ . Such sparsity condition seems more reasonable but relies on the modelcorrectness for Y , which is another strong assumption that cannot be easily verified in practice.In fact, our following simulation studies demonstrate that the estimators (cid:98) µ L & B and (cid:98) α ( • ) L & B no longerperform well under the settings with both non-sparse ¯ Θ m and misspecified models. In addition, SHIR could be more efficient than the debiasing-based strategy even when theimpact of the additional error term, which depends on s in (7), is asymptotically negligible.Consider the setting when all β ( m ) ’s are the same, i.e., β ( m ) = β , and p is moderate or small so thatthe regularization is unnecessary and the maximum likelihood estimator (MLE) for β is feasible andasymptotically Gaussian. In this case, SHIR can be viewed as the inverse variance weight estimationwith asymptotic variance Σ SHIR = { (cid:80) Mm =1 n m ¯ Θ − m } − , while the debiasing-based approach outputs13n estimator of variance Σ L & B = M − (cid:80) Mm =1 n − m ¯ Θ m . It is not hard to show that Σ SHIR (cid:22) Σ L & B ,where the equality holds only if all ¯ Θ m ’s are in certain proportion. Thus, SHIR is strictly moreefficient than debiasing-based approach under the low dimensional setting with heterogeneous ¯ Θ m ,which commonly arises in meta-analysis as the distributions of X ( m ) ’s are typically heterogeneousacross the local sites. In the high-dimensional setting, similarly, SHIR is expected to benefit fromthe “inverse variance weight” construction, and our simulation results in Section 5 support thispoint. In this section, we present theoretical results concerning the variable selection consistency of theSHIR . We begin with some extra sufficient conditions for the sparsistency result.
Condition 5.
There exists δ = ω { (1 /n eff m ) / } and C min = Θ(1) such that for all β ( m ) satisfying (cid:107) β ( m ) − β ( m ) (cid:107) < δ , Λ min { H m, S ( β ( m ) ) } > C min . Condition 6.
There exists δ = ω { (1 /n eff m ) / } and (cid:15) = Θ(1) such that for all β ( • ) = ( β ( ) T , . . . , β ( M ) T ) T satisfying (cid:107) β ( m ) − β ( m ) (cid:107) < δ , the weighted design matrix W ( β ( • ) ) satisfies the Irrepresentable Con-dition C Irrep (Definition 2) on S full with constant (cid:15) . Condition 7.
Let ν = min { min j ∈S µ | µ j | , min j ∈S α (cid:107) α j (cid:107) /M / } . For the (cid:15) defined in Condition6, [1 / ( n eff ) / + Bs / /n eff m ] / ( ν(cid:15) ) → , as N → ∞ . Remark 8.
Conditions 5–7 are sparsistency assumptions similar to those of Zhao and Yu (2006)and Nardi et al. (2008). Condition 5 requires the eigenvalues for the covariance matrix of theweighted design matrix corresponding to S to be bounded away from zero, so that its inverse behaveswell. Condition 6 adopts the commonly used irrepresentable condition (Zhao and Yu, 2006) to ourmixture penalty setting. Roughly speaking, it requires that the weighted design corresponding to S full cannot be represented well by the weighted design for S c full . Compared to Nardi et al. (2008), C Irrep isless intuitive but essentially weaker. We justify such condition on several common hessian structuresand compare it with Zhao and Yu (2006) in Section ?? of the Supplementary Material. Condition7 assumes that the minimum magnitude of the coefficients is large enough to make the non-zerocoefficients recognizable. As will be discussed later in this section, Condition 7 requires essentiallyweaker assumption on the minimum magnitude than existing results based on local LASSO (Zhaoand Yu, 2006). This is because we leverage the group structure of β ( m ) ’s to improve the efficiencyof variable selection. Theorem 3. (Sparsistency)
Let (cid:98) S µ = { j : (cid:98) µ SHIR , j (cid:54) = 0 } and (cid:98) S α = { j : (cid:107) (cid:98) α SHIR , j (cid:107) (cid:54) = 0 } . Denote theevent O µ = { (cid:98) S µ = S µ } and O α = { (cid:98) S α = S α } . Under Conditions 1–7 and assume that λ = o ( ν/s / ) and λ = (cid:15) − ω (1 / ( s n eff ) / + B/n eff m ) we have P ( O µ ∩ O α ) → as N → ∞ . emark 9. Condition 7 guarantees the existence of the tuning parameter λ used in Theorem 3.The range of the tuning parameter λ in Theorem 3 is similar to those in Zhao and Yu (2006) andNardi et al. (2008) in the sense that, λ needs to be smaller than the minimum magnitude of thesignal divided by s / to ensure the true positives are selected and dominate the random noise. Incomparison to Zhao and Yu (2006), our λ additionally includes the consideration of the aggregationnoise represented by B/n eff m . Theorem 3 establishs the sparsistency of the SHIR estimator. When s = o ( { N (log p + M ) } / / [ BM log p ]) as assumed in Theorem 2, Condition 7 turns out to be ν(cid:15) = ω { / ( n eff ) / } ,the corresponding sparsistency assumption for the IPD estimator. In contrast, a similar condition,which could be as strong as ν(cid:15) = ω { / ( n eff m ) / } , is required for the local LASSO estimator (Zhaoand Yu, 2006). Compared with the local one, our integrative analysis procedure can recognizesmaller signal under some sparsity assumptions. In this sense, the structure of β ( • ) helps us toimprove the selection efficiency over the local LASSO estimator. Different from the existing work,we need carefully address the mixture penalty ρ and the aggregation noise of the SHIR, whichintroduce technical difficulties to our theoretical analysis.In both Theorems 2 and 3, we allow M , the number of studies, to diverge while still preservingtheoretical properties. The growing rate of M is allowed to be M = min (cid:16) o { ( N/ log p ) / / ( Bs ) } , o { N/ ( Bs log p ) } (cid:17) for the equivalence result in Theorem 2 and M = min (cid:16) o { N (cid:15)ν/ ( Bs / log p ) } , o { N ( (cid:15)ν ) /s } (cid:17) for the sparsistency result in Theorem 3. We present simulation results in this section to evaluate the performance of our proposed SHIRestimator and compare it with several other approaches. We let M ∈ { , } and p ∈ { , , } and set n m = n = 400 for each m . For each configuration, we summarize results based on 200simulated datasets. We consider three data generating mechanisms:(i) Sparse precision and correctly specified model (strong signal):
Across all studies, we let S µ = { , , . . . , } for µ , S α = { , , . . . , } for α , S = S µ ∪S α and S c = [ p ] \S . For each m ∈ [ M ], wegenerate X ( m ) from a zero-mean multivariate normal with covariance C ( m ) , where C ( m ) S c S c = R p − ( r m ), C ( m ) S c S = R p − ( r m ) Γ p − , ( r m ,
15) and C ( m ) SS = I + Γ T p − , ( r m , R p − ( r m ) Γ p − , ( r m ,
15) where I q denotes the q × q identity matrix, R q ( r ) denotes the q × q correlation matrix of AR (1) with correlationcoefficient r , Γ q ,q ( r, s ) denotes the q × q matrix with each of its column having randomly picked s entries set as r or − r in random and the remaining being 0, and r m = 0 . m − /M +0 .
15. Given15 ( m ) , we generate Y ( m ) from the logistic model P ( Y ( m ) = 1 | X ( m ) ) = expit { X ( m ) T S µ µ S µ + X ( m ) T S α α ( m ) S α } with µ S µ = 0 . , − , , − , , − T and α ( m ) S α = 0 . − m · (1 , , , − , − , − T .(ii) Sparse precision and correctly specified model (weak signal):
Use the same datageneration mechanism as in (i) except relatively weak signals µ S µ = 0 . , − , , − , , − T and α ( m ) S α = 0 . − m · (1 , , , − , − , − T .(iii) Dense precision and wrongly specified model:
Let S = { , , . . . , } , S (cid:48) = { , . . . , } ,and S (cid:48)(cid:48) = [ p ] \ ( S ∪S (cid:48) ). For each m ∈ [ M ], we generate X ( m ) from zero-mean multivariate normal withcovariance matrix C ( m ) , where C ( m ) ( S (cid:48) ∪S (cid:48)(cid:48) )( S (cid:48) ∪S (cid:48)(cid:48) ) = bdiag { R ( r m ) , R p − ( r m ) } , C ( m ) SS (cid:48)(cid:48) = , C ( m ) S (cid:48) S = R ( r m ) Γ , ( r m ,
45) and C ( m ) SS = I + Γ T , ( r m , R ( r m ) Γ , ( r m , X ( m ) , we generate Y ( m ) from a logistic model with P ( Y ( m ) = 1 | X ( m ) ) = expit { (cid:80) j =1 { .
25 + 0 . − m }{ X ( m ) j +0 . X ( m ) j ) } + 0 . (cid:80) j =1 X ( m ) j X ( m ) j +1 } .Across all settings, the distribution of X ( m ) and model parameters of Y ( m ) | X ( m ) differ across the M sites to mimic the heterogeneity of the covariates and models. The heterogeneity of X ( m ) is drivenby the study-specific correlation coefficient r m in its covariance matrix C ( m ) . Under Settings (i) and(ii), the fitted logistic loss corresponds to the likelihood under a correctly specified model with thesupport of µ and that of α ( m ) overlapping but not exactly the same. Under Setting (iii), the fittedloss corresponds to a mis-specified model but the true target parameter β ( m ) remains approximatelysparse with only first 5 elements being relatively large, 45 close to zero and remaining exactly zero.For each j ∈ S , there are 15 non-zero coefficients on average in the j -th column (except j itself)of the precision Θ m under Settings (i) and (ii), and 45 non-zero coefficients under Setting (iii). Sowe can use Settings (i) and (ii) to simulate the scenario with sparse precision on the active set anduse Setting (iii) to simulate relatively dense precision.For each simulated dataset, we obtain the SHIR estimator as well as the following alternativeestimators: (a) the IPD estimator (cid:98) β ( • ) IPD = argmin β ( • ) (cid:98) Q ( β ( • ) ); (b) the SMA estimator (He et al.,2016), following the sure independent screening procedure (Fan and Lv, 2008) that reduces thedimension to n/ (3 log n ) as recommended by He et al. (2016); and (c) the debiasing-based estimator (cid:98) β ( • ) L & B as introduced in Section 4.4, denoted by Debias L & B . For (cid:98) β ( • ) L & B , we used the soft thresholdingto be consistent with the penalty used by IPD, SMA and SHIR. We used the BIC to choose thetuning parameters for all methods.In Figures 1 and 2, we present the relative average absolute estimation error (rAEE), (cid:107) β ( • ) − β ( • ) (cid:107) , and the relative prediction error (rPE), (cid:107) X ( β ( • ) − β ( • ) ) (cid:107) , for each estimator compared to theIPD estimator, respectively. Consistent with the theoretical equivalence results, the SHIR estimatorattains very close estimation and prediction accuracy as those of the idealized IPD estimator, withrPE and rAEE around 1 .
03 under Setting (i), 1 .
02 under Setting (ii), and 1 .
07 under Setting(iii). The SHIR estimator is substantially more efficient than the SMA under all the settings, withabout 50% reduction in both AEE and PE on average. This can be attributed to the improvedperformance of the local LASSO estimator (cid:98) β ( m ) LASSO over the MLE ˘ β ( m ) on sparse models. The superiorperformance is more pronounced for large p such as 800 and 1500, because the screening procedure16oes not work well in choosing the active set, especially in the presence of correlations among thecovariates. Compared with Debias L & B , SHIR also demonstrates its gain in efficiency. Specifically,relative to SHIR, Debias L & B has 20% ∼
29% higher AEE and 27% ∼
42% higher PE under thesethree settings. This is consistent with our theoretical results presented in Section 4.4 that SHIRhas smaller error compared to Debias L & B due to the heterogeneous hessians and aggregation errors.In addition, compared to Setting (i), the excessive error of Debias L & B is larger in Setting (iii) wherethe the inverse Hessian ¯ Θ m is relatively dense and the model of Y is misspecified. This is consistentwith our conclusion in Section 4.4 and Remark 7.In Figure 3, we summarize the average true positive rate (TPR) and false discovery rate (FDR)for recovering the support of β ( • ) under Settings (i) and (ii) where the logistic model is correctlyspecified. Again, SMA performed poorly under both settings with either low TPR or high FDR.Both IPD and SHIR have good support recovery performance with all TPRs above 0 .
91 and FDRsbelow 0 .
13 under the strong signal setting, and all TPRs above 0 .
74 and FDRs below 0 .
05 underthe weak signal setting. The IPD and SHIR attained similar TPRs and FDRs with absolutedifferences less than 0.02 across all settings. Compared with IPD and SHIR, Debias L & B shows worseperformance under both settings. Under Setting (i), the TPR of Debias L & B is consistently lowerthan that of SHIR by about 0.13 while the FDR of Debias L & B is generally higher than that of SHIRexcept that when p = 100 Debias L & B attained very low FDR due to over shrinkage. Under the weaksignal Setting (ii) with M = 4, Debias L & B is substantially less powerful than SHIR in recovering truesignals with TPR lower by as much as 0 .
52 while its average FDR is comparable to that of SHIR.When M = 8, Debias L & B attained comparable TPR as that of SHIR but generally has substantiallyhigher FDR. Linking EHR data with biorepositories containing “-omics” information has expanded the oppor-tunities for biomedical research (Kho et al., 2011). With the growing availability of these high–dimensional data, the bottleneck in clinical research has shifted from a paucity of biologic datato a paucity of high–quality phenotypic data. Accurately and efficiently annotating patients withdisease characteristics among millions of individuals is a critical step in fulfilling the promise ofusing EHR data for precision medicine. Novel machine learning based phenotyping methods lever-aging a large number of predictive features have improved the accuracy and efficiency of existingphenotyping methods (Liao et al., 2015; Yu et al., 2015).While the portability of phenotyping algorithms across multiple patient cohorts is of greatinterest, existing phenotyping algorithms are often developed and evaluated for a specific patientpopulation. To investigate the portability issue and develop EHR phenotyping algorithms forcoronary artery disease (CAD) useful for multiple cohorts, Liao et al. (2015) developed a CADalgorithm using a cohort of rheumatoid arthritis (RA) patients and applied the algorithm to otherdisease cohorts using EHR data from Partner’s Healthcare System. Here, we performed integrativeanalysis of multiple EHR disease cohorts to jointly develop algorithms for classifying CAD status17or four disease cohorts including type 2 diabetes mellitus (DM), inflammatory bowel disease (IBD),multiple sclerosis (MS) and RA. Under the DataSHIELD constraint, our proposed SHIR algorithmenables us to let the data determine if a single CAD phenotyping algorithm can perform well acrossfour disease cohorts or disease specific algorithms are needed.For algorithm training, clinical investigators have manually curated gold standard labels on theCAD status used as the response Y , for n = 172 DM patients, n = 230 IBD patients, n = 105 MSpatients and n = 760 RA patients. There are a total of p = 533 candidate features including bothcodified features, narrative features extracted via natural language processing (NLP) (Zeng et al.,2006), as well as their two-way interactions. Examples of codified features include demographicinformation, lab results, medication prescriptions, counts of International Classification of Diseases(ICD) codes and Current Procedural Terminology (CPT) codes. Since patients may not havecertain lab measurements and missingness is highly informative, we also create missing indicatorsfor the lab measurements as additional features. Examples of NLP terms include mentions ofCAD, current smoking (CSMO), non smoking (NSMO) and CAD related procedures. Since thecount variables such as the total number of CAD ICD codes are zero-inflated and skewed, we takelog( x + 1) transformation and include I ( x >
0) as additional features for each count variable x .For each cohort, we randomly select 50% of the observations to form the training set for devel-oping the CAD algorithms and use the remaining 50% for validation. We trained CAD algorithmsbased on SHIR, Debias L & B and SMA. Since the true model parameters are unknown, we evaluatethe performance of different methods based on the prediction performance of the trained algorithmson the validation set. We consider several standard accuracy measures including the area underthe receiver operating characteristic curve (AUC), the brier score defined as the mean squaredresiduals on the validation data , as well as the F -score at threshold value chosen to attain a falsepositive rate of 5% ( F ) and 10% ( F ), where the F -score is defined as the harmonic meanof the sensitivity and positive predictive value. The standard errors of the estimated predictionperformance measures are obtained by bootstrapping the validation data. We only report resultsbased on tuning parameters selected with BIC as in the simulation studies but note that the re-sults obtained from AIC are largely similar in terms of prediction performance. Furthermore, toverify the improvement of the performance by combining the four datasets, we include the LASSOestimator for each local dataset (Local) as a comparison.In Table 1, we present the estimated coefficients for variables that received non-zero coefficientsby at least one of the included methods. Interestingly, all integrative analysis methods set allheterogeneous coefficients to zero, suggesting that a single CAD algorithm can be used across allcohorts although different intercepts were used for different disease cohorts. The magnitude of thecoefficients from SHIR largely agree with the published algorithm with most important featuresbeing NLP mentions and ICD codes for CAD as well as total number of ICD codes which serves asa measure of healthcare utilization. The SMA set all variables to zero except for age, non-smokerand the NLP mentions and ICD codes for CAD, while Debias L & B has more similar support to SHIR.The point estimates along with their 95% bootstrap confidence intervals of the accuracy mea-sures are presented in Figure 4. The results suggest that SHIR has the best performance across all18ethods, nearly on all datasets and across all measures. Among the integrative methods, SMAand Debias L & B performed much worse than SHIR on all accuracy measures. For example, the AUCwith its 95% confidence interval of the CAD algorithm for the RA cohorts trained via SHIR, SMAand Debias L & B is respectively 0.93 (0.90,0.95), 0.88 (0.84,0.92) and 0.86 (0.82,0.90). Comparedto the local estimator, SHIR also performs substantially better. For example, the AUC of SHIRand Local for the IBD cohort is 0.93 (0.88,0.97) and 0.90 (0.84,0.95). The difference between theintegrative procedures and the local estimator is more pronounced for the DM cohort with AUCbeing around 0.95 for SHIR and 0.90 for the local estimator trained using DM data only. The localestimator fails to produce an informative algorithm for the MS cohort due to the small size of thetraining set. These results again demonstrate the power of borrowing information across studiesvia integrative analysis. In this paper, we proposed a novel approach, the SHIR, for integrative analysis of high dimensionaldata under the DataSHIELD framework, where only summary data is allowed to be transferred fromthe local sites to the central site to protect the individual-level data. One should notice that theDataSHIELD framework is essentially different from the differential privacy concept (Dwork, 2011).Differential privacy establishes rigorous probabilistic definition on the privacy of a random dataprocessing algorithm, requiring the perturbation of the raw data with random noise (Wassermanand Zhou, 2010). It allows for sharing and publishing individual-level data for more general dataanalysis purposes. However, to ensure enough privacy, the differential privacy was shown to be toostrict to be used for high-dimensional regression and inference (Duchi et al., 2013). In comparison,the DataSHIELD framework is used for regression analysis and shown to work well under ultra-highdimensionality, with no efficiency loss compared with the ideal case where the data can be operatedwith no privacy constraint.Though motivated by the debiased LASSO, the SHIR approach does not need to obtain (cid:98) Θ m orthe debiased estimator (cid:98) β ( m ) dLASSO . As we demonstrated via both theoretical analyses and numericalstudies, the SHIR estimator is considerably more efficient than the estimators obtained based onthe debiasing-based strategies considered in literatures (Lee et al., 2017; Battey et al., 2018). Thesuperior performance is even more pronounced when Θ m is relatively dense. Also, our methodaccommodates heterogeneity among the design matrices, as well as the coefficients of the localsites, which is not adequately handled under the ultra high dimensional regime in existing literature.In terms of theoretical analysis, no existing distributed learning work provides similar estimationequivalence results as shown in Theorem 2 for regularized regression. Challenge in establishing theasymptotic equivalence in Theorem 2 arises from that (cid:98) β ( • ) IPD and (cid:98) β ( • ) SHIR are not necessarily as sparseas the true coefficient β ( • ) . We overcome such challenge by comparing the two estimators in a moreelaborative way as detailed in the proof. Moreover, Theorem 3 ensures the sparsistency of the SHIRunder some commonly used conditions on deterministic design. Benefited from integrating the data,our approach requires less restrictive assumption on the minimum strength of the coefficients than19he local estimation, under certain sparsity assumption. Such property is not readily available inthe existing literature for distributed regression and the mixture penalty.Our approach is also efficient both in computation and communication, as it only solves LASSOproblem once in each local site without requiring the computation of (cid:98) Θ ( m ) or debiasing and onlyneeds one around of communication. Consequently, since there is actually no debiasing procedurein our method, the SHIR cannot be directly used for hypothesis testing and confidence inter-val construction. Future work lies on developing statistical approaches for such purposes underDataSHIELD, high-dimensionality and heterogeneity.For the choice of penalty, in the current paper we focus primarily on the mixture penalty, ρ ( β ( • ) ) defined in (1), because it can leverage the prior assumption on the similar support andmagnitude of β ( m ) across studies. Nevertheless, other penalty functions such as those discussedin Section 2.1 can be incorporated into our framework provided that they can effectively leverageon the prior knowledge. Similar techniques used for deriving the theoretical results of SHIR with ρ = ρ can be used for other penalty functions, with some technical details varying according todifferent choices on ρ ( · ). See Section A.4 of the Supplementary Material for further justifications.We assume n m = Θ( N/M ) in Theorems 1-3 and allow M to grow slowly. These conditionsseem reasonable in our application but could be violated by some large scale meta-analysis of highdimensional data. The extensions to scenarios with larger M is highly non-trivial and warrantsfurther research. As a limitation of our theoretical analysis, our sparsity assumption in Theorem 2may be somewhat strong for our real example. However, we shall clarify that, first, this assumptionis comparable to the assumption for the estimation equivalence result in Battey et al. (2018), andsecond, the consistency result in Theorem 1 holds with milder sparsity assumptions. On the otherhand, it is of interests to see the possibilities of reducing the aggregation error and achieving thetheoretical equivalence property with weaker sparsity assumption. One potential way is to usemultiple rounds of communications such as Fan et al. (2019). Detailed analysis of this approachwarrants future research. In the supplement we provide some justifications for Conditions 1 and 6, present detailed proofs ofTheorems 1–3, and outline theoretical analyses of SHIR for various penalty functions.
References
Akaike, H. (1974). A new look at the statistical model identification.
IEEE transactions on automaticcontrol , 19(6):716–723.Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z., et al. (2018). Distributed testing and estimation under sparsehigh dimensional models.
The Annals of Statistics , 46(3):1352–1382.Bhat, H. S. and Kumar, N. (2010). On the derivation of the bayesian information criterion.
School of NaturalSciences, University of California . ¨uhlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory andapplications . Springer Science & Business Media.Chen, X. and Xie, M.-g. (2014). A split-and-conquer approach for analysis of extraordinarily large data.
Statistica Sinica , pages 1655–1684.Chen, Y., Dong, G., Han, J., Pei, J., Wah, B. W., and Wang, J. (2006). Regression cubes with losslesscompression and aggregation.
IEEE Transactions on Knowledge and Data Engineering , 18(12):1585–1599.Cheng, X., Lu, W., and Liu, M. (2015). Identification of homogeneous and heterogeneous variables in pooledcohort studies.
Biometrics , 71(2):397–403.Doiron, D., Burton, P., Marcon, Y., Gaye, A., Wolffenbuttel, B. H., Perola, M., Stolk, R. P., Foco, L.,Minelli, C., Waldenberger, M., et al. (2013). Data harmonization and federated analysis of population-based studies: the BioSHaRE project.
Emerging themes in epidemiology , 10(1):12.Duan, R., Boland, M. R., Liu, Z., Liu, Y., Chang, H. H., Xu, H., Chu, H., Schmid, C. H., Forrest,C. B., Holmes, J. H., et al. (2020). Learning from electronic health records across multiple sites: Acommunication-efficient and privacy-preserving distributed algorithm.
Journal of the American MedicalInformatics Association , 27(3):376–385.Duan, R., Boland, M. R., Moore, J. H., and Chen, Y. (2019). Odal: A one-shot distributed algorithm toperform logistic regressions on electronic health records data from multiple clinical sites. In
PSB , pages30–41. World Scientific.Duchi, J. C., Jordan, M. I., and Wainwright, M. J. (2013). Local privacy and statistical minimax rates. In , pages 429–438. IEEE.Dwork, C. (2011). Differential privacy.
Encyclopedia of Cryptography and Security , pages 338–340.Fan, J., Guo, Y., and Wang, K. (2019). Communication-efficient accurate statistical estimation. arXivpreprint arXiv:1906.04870 .Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space.
Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) , 70(5):849–911.Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression.
The Annals ofStatistics , pages 1947–1975.Friedman, J., Hastie, T., and Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736 .Gaye, A., Marcon, Y., Isaeva, J., LaFlamme, P., Turner, A., Jones, E. M., Minion, J., Boyd, A. W., Newby,C. J., Nuotio, M.-L., et al. (2014). DataSHIELD: taking the analysis to the data, not the data to theanalysis.
International journal of epidemiology , 43(6):1929–1944.Han, J. and Liu, Q. (2016). Bootstrap model aggregation for distributed statistical learning. In
Advances inNeural Information Processing Systems , pages 1795–1803.He, Q., Zhang, H. H., Avery, C. L., and Lin, D. (2016). Sparse meta-analysis with high-dimensional data.
Biostatistics , 17(2):205–220. uang, C. and Huo, X. (2015). A distributed one-step estimator. arXiv preprint arXiv:1511.01443 .Huang, J. and Zhang, T. (2010). The benefit of group sparsity. The Annals of Statistics , 38(4):1978–2004.Jankov´a, J. and Van De Geer, S. (2016). Confidence regions for high-dimensional generalized linear modelsunder sparsity. arXiv preprint arXiv:1610.01353 .Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensionalregression.
The Journal of Machine Learning Research , 15(1):2869–2909.Jones, E., Sheehan, N., Masca, N., Wallace, S., Murtagh, M., and Burton, P. (2012). DataSHIELD–sharedindividual-level analysis without sharing the data: a biostatistical perspective.
Norsk epidemiologi , 21(2).Jordan, M. I., Lee, J. D., and Yang, Y. (2019). Communication-efficient distributed statistical inference.
Journal of the American Statistical Association , 526(114):668–681.Kho, A. N., Pacheco, J. A., Peissig, P. L., Rasmussen, L., Newton, K. M., Weston, N., Crane, P. K., Pathak,J., Chute, C. G., Bielinski, S. J., et al. (2011). Electronic medical records for genetic research: results ofthe emerge consortium.
Science translational medicine , 3(79):79re1–79re1.Kim, Y., Kwon, S., and Choi, H. (2012). Consistent model selection criteria on high dimensions.
Journal ofMachine Learning Research , 13(Apr):1037–1057.Lee, J. D., Liu, Q., Sun, Y., and Taylor, J. E. (2017). Communication-efficient sparse regression.
Journal ofMachine Learning Research , 18(5):1–30.Li, W., Liu, H., Yang, P., and Xie, W. (2016). Supporting regularized logistic regression privately andefficiently.
PloS one , 11(6):e0156479.Liao, K. P., Ananthakrishnan, A. N., Kumar, V., Xia, Z., Cagan, A., Gainer, V. S., Goryachev, S., Chen,P., Savova, G. K., Agniel, D., et al. (2015). Methods to develop an electronic medical record phenotypealgorithm to compare the risk of coronary artery disease across 3 chronic disease cohorts.
PLoS One ,10(8):e0136651.Lin, D. and Zeng, D. (2010). On the relative efficiency of using summary statistics versus individual-leveldata in meta-analysis.
Biometrika , 97(2):321–332.Liu, M., Xia, Y., Cai, T., and Cho, K. (2020). Integrative high dimensional multiple testing with hetero-geneity under data sharing constraints. arXiv preprint arXiv:2004.00816 .Liu, Q. and Ihler, A. T. (2014). Distributed estimation, information loss and exponential families. In
Advances in neural information processing systems , pages 1098–1106.Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A. B., et al. (2011). Oracle inequalities and optimalinference under group sparsity.
The Annals of Statistics , 39(4):2164–2204.Lu, C.-L., Wang, S., Ji, Z., Wu, Y., Xiong, L., Jiang, X., and Ohno-Machado, L. (2015). Webdisco: a webservice for distributed cox model learning without patient-level data sharing.
Journal of the AmericanMedical Informatics Association , 22(6):1212–1219.Ma, R., Tony Cai, T., and Li, H. (2020). Global and simultaneous hypothesis testing for high-dimensionallogistic regression models.
Journal of the American Statistical Association , pages 1–15. aity, S., Sun, Y., and Banerjee, M. (2019). Communication-efficient integrative regression in high-dimensions. arXiv preprint arXiv:1912.11928 .Minsker, S. (2019). Distributed statistical estimation and rates of convergence in normal approximation. Electronic Journal of Statistics , 13(2):5213–5252.Nardi, Y., Rinaldo, A., et al. (2008). On the asymptotic properties of the group lasso estimator for linearmodels.
Electronic Journal of Statistics , 2:605–633.Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B., et al. (2012). A unified framework for high-dimensional analysis of m -estimators with decomposable regularizers. Statistical Science , 27(4):538–557.Raskutti, G., Wainwright, M. J., and Yu, B. (2011). Minimax rates of estimation for high-dimensional linearregression over l q -balls. IEEE transactions on information theory , 57(10):6976–6994.Tang, L., Zhou, L., and Song, P. X.-K. (2016). Method of divide-and-combine in regularized generalizedlinear models for big data. arXiv preprint arXiv:1611.06208 .Vaiter, S., Deledalle, C., Peyr´e, G., Fadili, J., and Dossal, C. (2012). The degrees of freedom of the grouplasso. arXiv preprint arXiv:1205.1481 .Van de Geer, S., B¨uhlmann, P., Ritov, Y., Dezeure, R., et al. (2014). On asymptotically optimal confidenceregions and tests for high-dimensional models.
The Annals of Statistics , 42(3):1166–1202.Van de Geer, S. A. et al. (2008). High-dimensional generalized linear models and the lasso.
The Annals ofStatistics , 36(2):614–645.Vershynin, R. (2018).
High-dimensional probability: An introduction with applications in data science , vol-ume 47. Cambridge University Press.Wang, H. and Leng, C. (2007). Unified lasso estimation by least squares approximation.
Journal of theAmerican Statistical Association , 102(479):1039–1048.Wang, H., Li, B., and Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number ofparameters.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 71(3):671–683.Wang, J., Kolar, M., Srebro, N., and Zhang, T. (2017). Efficient distributed learning with sparsity. In
Pro-ceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 3636–3645. JMLR.org.Wang, X., Peng, P., and Dunson, D. B. (2014). Median selection subset aggregation for parallel inference.In
Advances in neural information processing systems , pages 2195–2203.Wasserman, L. and Zhou, S. (2010). A statistical framework for differential privacy.
Journal of the AmericanStatistical Association , 105(489):375–389.Wolfson, M., Wallace, S. E., Masca, N., Rowe, G., Sheehan, N. A., Ferretti, V., LaFlamme, P., Tobin,M. D., Macleod, J., Little, J., et al. (2010). DataSHIELD: resolving a conflict in contemporary bioscien-ceperforming a pooled analysis of individual-level data without sharing the data.
International journal ofepidemiology , 39(5):1372–1382. u, Y., Jiang, X., Kim, J., and Ohno-Machado, L. (2012). Grid binary logistic re gression (glore): buildingshared models without sharing data. Journal of the American Medical Informatics Association , 19(5):758–764.Yu, S., Liao, K. P., Shaw, S. Y., Gainer, V. S., Churchill, S. E., Szolovits, P., Murphy, S. N., Kohane, I. S.,and Cai, T. (2015). Toward high-throughput phenotyping: unbiased automated feature extraction andselection from knowledge sources.
Journal of the American Medical Informatics Association , 22(5):993–1000.Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.
Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 68(1):49–67.Zeng, Q. T., Goryachev, S., Weiss, S., Sordo, M., Murphy, S. N., and Lazarus, R. (2006). Extractingprincipal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural languageprocessing system.
BMC medical informatics and decision making , 6(1):30.Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high di-mensional linear models.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,76(1):217–242.Zhang, Y., Li, R., and Tsai, C.-L. (2010). Regularization parameter selections via generalized informationcriterion.
Journal of the American Statistical Association , 105(489):312–323.Zhao, P. and Yu, B. (2006). On model selection consistency of lasso.
Journal of Machine learning research ,7(Nov):2541–2563.Zhou, N. and Zhu, J. (2010). Group variable selection via a hierarchical lasso and its oracle property. arXivpreprint arXiv:1006.2871 .Z¨oller, D., Lenz, S., and Binder, H. (2018). Distributed multivariable modeling for signature developmentunder data protection constraints. arXiv preprint arXiv:1803.00422 . L & B and SMA relative tothose of IPD under different M , p and data generation mechanisms (i)–(iii) introduced in Section5. r e l a t i v e AEE
Method
IPDSHIRDebiasSMA
Sparse Precision and Correct Model (strong signal) r e l a t i v e AEE
Method
IPDSHIRDebiasSMA
Sparse Precision and Correct Model (weak signal) r e l a t i v e AEE
Method
IPDSHIRDebiasSMA
Dense Precision and Misspecified Model L & B and SMA relative to those of IPD underdifferent M , p and data generation mechanisms (i)–(iii) introduced in Section 5. r e l a t i v e PE Method
IPDSHIRDebiasSMA
Sparse Precision and Correct Model (strong signal) r e l a t i v e PE Method
IPDSHIRDebiasSMA
Sparse Precision and Correct Model (weak signal) r e l a t i v e PE Method
IPDSHIRDebiasSMA
Dense Precision and Misspecified Model β ( • ) of IPD, SHIR, Debias L & B and SMA, under different M , p and the data generationmechanisms (i) (correct model with) strong signal and (ii) weak signal introduced in Section 5. T r ue po s i t i v e r a t e Method
IPDSHIRDebiasSMA
Original Coefficients (strong signal) F a l s e d i sc o v e r y r a t e Method
IPDSHIRDebiasSMA
Original Coefficients (strong signal) T r ue po s i t i v e r a t e Method
IPDSHIRDebiasSMA
Original Coefficients (weak signal) F a l s e d i sc o v e r y r a t e Method
IPDSHIRDebiasSMA
Original Coefficients (weak signal) µ .A : B denotes the interaction term of variables A and B. The log( x + 1) transformation is taken onthe count data and the covariates are normalized. Variable Debias L & B SHIR SMAPrescription code of statin 0.14 0.07 0Age 0.09 0.26 0.28Procedure code for echo 0 -0.10 0Total ICD counts -0.38 -0.75 0NLP mention of CAD 0.97 1.34 0.81NLP mention of CAD procedure related concepts 0 0.02 0NLP mention of non-smoker -0.07 -0.25 -0.42ICD code for CAD 1.00 0.67 0.35CPT code for stent or CABG 0 0.05 0NLP mention of current-smoker 0 -0.03 0Any NLP mention 0.06 0.05 0ICD code for CAD : Procedure code for echo 0 -0.04 0NLP mention of CAD:NLP mention of possible-smoker 0 -0.02 0Oncall:NLP mention of non-smoker 0.09 0 0Indication for NLP mention of non-smoker -0.53 0 028igure 4: The mean and 95% bootstrap confidence interval of AUC, Brier Score, F and F ofDebias L & B , Local, SHIR and SMA on the validation data from the four studies. l l l Methods l DebiasLocalSHIRSMA
AUC l l l l
Methods l DebiasLocalSHIRSMA
Brier Score l l l
Methods l DebiasLocalSHIRSMA F l l l Methods l DebiasLocalSHIRSMA F10%