[PDF] Least Squares Approximation for a Distributed System

Abstract

In this work, we develop a distributed least squares approximation (DLSA) method that is able to solve a large family of regression problems (e.g., linear regression, logistic regression, and Cox's model) on a distributed system. By approximating the local objective function using a local quadratic form, we are able to obtain a combined estimator by taking a weighted average of local estimators. The resulting estimator is proved to be statistically as efficient as the global estimator. Moreover, it requires only one round of communication. We further conduct a shrinkage estimation based on the DLSA estimation using an adaptive Lasso approach. The solution can be easily obtained by using the LARS algorithm on the master node. It is theoretically shown that the resulting estimator possesses the oracle property and is selection consistent by using a newly designed distributed Bayesian information criterion (DBIC). The finite sample performance and computational efficiency are further illustrated by an extensive numerical study and an airline dataset. The airline dataset is 52 GB in size. The entire methodology has been implemented in Python for a {\it de-facto} standard Spark system. The proposed DLSA algorithm on the Spark system takes 26 minutes to obtain a logistic regression estimator, which is more efficient and memory friendly than conventional methods.

Full PDF

LLeast Squares Approximation for a DistributedSystem

Xuening Zhu , Feng Li , and Hansheng Wang School of Data Science, Fudan University, Shanghai, China School of Statistics and Mathematics, Central University of Finance and Economics,Beijing, China Guanghua School of Management, Peking University, Beijing, China

Abstract

In this work, we develop a distributed least squares approximation (DLSA)method that is able to solve a large family of regression problems (e.g., linearregression, logistic regression, and Cox’s model) on a distributed system. Byapproximating the local objective function using a local quadratic form, we areable to obtain a combined estimator by taking a weighted average of local es-timators. The resulting estimator is proved to be statistically as eﬃcient asthe global estimator. Moreover, it requires only one round of communication.We further conduct shrinkage estimation based on the DLSA estimation usingan adaptive Lasso approach. The solution can be easily obtained by using theLARS algorithm on the master node. It is theoretically shown that the result-ing estimator possesses the oracle property and is selection consistent by usinga newly designed distributed Bayesian information criterion (DBIC). The ﬁnitesample performance and the computational eﬃciency are further illustrated byan extensive numerical study and an airline dataset. The airline dataset is 52 GBin size. The entire methodology has been implemented in Python for a de-facto standard Spark system. The proposed DLSA algorithm on the Spark systemtakes 26 minutes to obtain a logistic regression estimator, whereas a full likeli-hood algorithm takes 15 hours to obtain an inferior result.

KEY WORDS:

Distributed System; Least Squares Approximation; ShrinkageEstimation; Distributed BIC. ∗ Xuening Zhu is supported by the National Natural Science Foundation of China (nos.11901105, U1811461), the Shanghai Sailing Program for Youth Science and Technology Excellence(19YF1402700), and the Fudan-Xinzailing Joint Research Centre for Big Data, School of Data Sci-ence, Fudan University. Feng Li’s research is supported by the National Natural Science Foundationof China (no. 11501587). Hansheng Wang’s research is partially supported by the National NaturalScience Foundation of China (nos. 11831008, 11525101, 71532001, and 71332006). It is also supportedin part by China’s National Key Research Special Program (no. 2016YFC0207704). Feng Li is thecorresponding author. a r X i v : . [ s t a t . M E ] N ov . INTRODUCTION Modern data analysis often needs to address huge datasets. In many cases, thesize of the dataset could be too large to be conveniently handled by a single computer.Consequently, the dataset must be stored and processed on many connected computernodes, which thereafter are referred to as a distributed system. More precisely, a dis-tributed system refers to a large cluster of computers, which are typically connectedwith each other via wire protocols such as RPC and HTTP (Zaharia et al., 2012). Con-sequently, they are able to communicate with each other and accomplish the intendeddata analysis tasks at huge scales in a collective manner.By using a distributed system, we are able to break a large-scale computationproblem into many small pieces and then solve them in a distributed manner. A keychallenge faced by statistical computation on a distributed system is the communica-tion cost. The communication cost refers to the wall-clock time cost needed for datacommunication between diﬀerent computer nodes, which could be expensive in dis-tributed systems (Zhang et al., 2013; Shamir et al., 2014; Jordan et al., 2018). In thiswork, we consider a “master-and-worker”-type distributed system with strong work-ers. We assume that the workers are strong in the sense they are modern computerswith reasonable storage and computing capacities. For example, a worker with 32 CPUcores, 128 GB RAM, and 512 GB SSD hard disk could be a very strong worker. As onewill see later, the most widely used systems, Hadoop (Apache Software Foundation,2019a) and Spark (Apache Software Foundation, 2019b), belong to this category. Typ-ically, the workers do not communicate with each other directly. However, they shouldbe connected to a common master node, which is another computer with outstandingcapacities. Consequently, most data should be distributed to workers, and most com-putations should be conducted by the workers. This enables us to solve a large-scale2omputation problem in a distributed manner. In contrast, the master should take theresponsibility to coordinate with diﬀerent workers.For this “master-and-worker”-type distributed system, the communication cost ismostly between the master and workers. One can easily verify that good algorithms forsome simple moment estimates (e.g., sample mean) can be easily developed using thistype of distributed system. For example, to compute the sample mean on a distributedsystem, one can ﬁrst compute the sample mean on each worker, which is known as amap process. Then, each worker reports to the master the resulting sample meanand the associated sample size. Thereafter, the master can compute the overall samplemean by a weighted average of the sample means from each worker, which is known as areduce process. Such a “MapReduce” algorithm requires only one “master-and-worker”communication for each worker. It requires no direct communication between workers.Because most computations are accomplished by the workers, it also makes good use ofthe strong worker capacities. As a result, the algorithm can be considered eﬀective. Un-fortunately, cases such as the sample mean are rather rare in statistical analysis. Moststatistical algorithms do not have an analytical solution (e.g., the maximum likelihoodestimation of a logistic regression model) and thus require multiple iterations (e.g.,Newton-Raphson iteration or stochastic-gradient-descent-type algorithms). These it-erations unfortunately lead to substantial “master-and-worker” communication, whichis communicationally expensive. Therefore, developing algorithms that are highly eﬃ-cient computationally, communicationally and statistically for distributed systems hasbecome a problem of great interest.In the literature, the common wisdom for addressing a distributed statistical prob-lem can be classiﬁed into two categories. The ﬁrst category is the “one-shot” (OS) or“embarrassingly parallel” approach, which requires only one round of communication.3peciﬁcally, the local worker computes the estimators in parallel and then communicateto the master to obtain an average global estimator (Zhang et al., 2013; Liu and Ihler,2014; Lee et al., 2015; Battey et al., 2015; Fan et al., 2017; Chang et al., 2017b,a).Although this approach is highly eﬃcient in terms of communication, it might notachieve the best eﬃciency in statistical estimation in most occasions (Shamir et al.,2014; Jordan et al., 2018). The second approach includes iterative algorithms, whichrequire multiple rounds of communication between the master and the workers. Thisapproach typically requires multiple iterations to be taken so that the estimation ef-ﬁciency can be reﬁned to match the global (or centralized) estimator (Shamir et al.,2014; Wang et al., 2017a,b; Jordan et al., 2018). In addition, see Yang et al. (2016);Heinze et al. (2016); Smith et al. (2018); Li et al. (2019) for distributed statisticalmodelling methods when the data are distributed according to features rather thansamples.The aforementioned two approaches are also studied for the sparse learning problemusing (cid:96) shrinkage estimation. For the ﬁrst approach, Lee et al. (2015) investigatedthe distributed high-dimensional sparse regression using the OS approach by combin-ing local debiased (cid:96) estimates. Battey et al. (2015) revisited the same problem butfurther considered distributed testing and estimation methods in a uniﬁed likelihoodframework, in which a reﬁtted estimation is used to obtain an oracle convergence rate.For the second approach, both Wang et al. (2017a) and Jordan et al. (2018) havedeveloped iterative algorithms to solve the sparse estimation problem, and they the-oretically proved that the error bounds match the centralized estimator. Beyond the (cid:96) shrinkage estimation, Chen and Xie (2014) studied a penalized likelihood estima-tor with more general penalty function forms in a high-dimensional setting. However,to the best of our knowledge, there are no guarantees that simultaneously ensure the4odel selection consistency (Fan and Li, 2001) and establish a criterion for consistenttuning parameter selection (Wang et al., 2007). In addition, all of the above methodsassume independent and identical samples stored by each worker, which is questionablein practice because the distributed dataset might experience great heterogeneity fromworker to worker. We would like to remark that the heterogeneity cannot be avoidedbecause it is mainly due to the practical need to record data across time or space (forexample).In this work, we aim to develop a novel methodology to address a sparse estimationproblem with low dimensions ( p < n , where n is the local sample size). Under thissetting, we pay greater attention to the communication cost caused by the iterationsrather than transmitting digits. The data possessed by diﬀerent workers are allowedto be heterogeneous but share the same regression relationship. The proposed methodborrows the idea of the least squares approximation (LSA, Wang and Leng, 2007) andcan be used to handle a large class of parametric regression models on a distributedsystem. Speciﬁcally, let Y ∈ R be the response of interest, let X be the associatedpredictor with ﬁnite dimension, and let θ ∈ R p be the corresponding regression coeﬃ-cient. The objective is to estimate the regression parameter θ and conduct a variableselection on a distributed system that has one master and many strong workers. As-sume data, denoted by ( Y i , X i ), with 1 ≤ i ≤ N , that are distributed across diﬀerentworkers. Further, assume that the sample size on each worker is suﬃciently large andof the same order. Under this setting, we propose a distributed LSA (DLSA) method.The key idea is as follows: (1) First, we estimate the parameter θ on each worker separately by using local dataon distributed workers. This can be done eﬃciently by using standard statisticalestimation methods (e.g., maximum likelihood estimation). By assuming that5he sample size of each worker is suﬃciently large, the resulting estimator and itsasymptotic covariance estimate should be consistent but not statistically eﬃcient,as compared with the global estimates. (2) Next, each worker passes the local estimator of θ and its asymptotic covarianceestimate to the master. Because we do not consider a high-dimensional modelsetting, the communication cost in this regard should be negligible. (3) Once the master receives all the local estimators from the workers, a weightedleast squares-type objective function can be constructed. This can be viewed as alocal quadratic approximation of the global log-likelihood functions. As one canexpect, the resulting estimator shares the same asymptotic covariance with thefull-size MLE method (i.e., the global estimator) under appropriate regularityconditions. Figure 1:

Illustration of the DLSA method.

The major steps of the DLSA method are further illustrated in Figure 1. As onecan see, the DLSA method reduces communication costs mainly by using only oneround of communication and avoids further iterative steps. Given the DLSA objectivefunction on the master node, we can further conduct shrinkage estimation on the6aster. This is done by formulating an adaptive Lasso-type (Zou, 2006; Zhang andLu, 2007) objective function. The objective functions can be easily solved by the LARSalgorithm (Efron et al., 2004) with minimal computation cost on the master. Thus,no communication is required. Accordingly, a solution path can be obtained on themaster node. Thereafter, the best estimator can be selected from the solution path inconjunction with the proposed distributed Bayesian information criterion (DBIC). Wetheoretically show that the resulting estimation is selection consistent and as eﬃcientas the oracle estimator, which is the global estimator obtained under the true model.To summarize, we aim to make the following important contributions to the ex-isting literature. First, we propose a master with strong workers (MSW) distributedsystem framework, which solves a large-scale computation problem in a communication-eﬃcient way. Second, given this MSW system, we propose a novel DLSA method, whicheasily handles a large class of classical regression models such as linear regression, gen-eralized linear regression, and Cox’s model. Third, due to the simple quadratic formof the objective function, the analytical solution path can be readily obtained usingthe LARS algorithm on the master. Then, the best model can be easily selected bythe DBIC criterion. Finally, but also most importantly, the proposed DLSA methodfully takes advantage of the specialty of the MSW system, which pushes the intensivecomputation to the workers and therefore is as computationally, communicationallyand statistically eﬃcient as possible. Furthermore, we would like to make a remarkhere that although the proposed DLSA is designed for a distributed system, it can alsobe applied to a single computer when there are memory constraints (Chen et al., 2018;Wang et al., 2018).The remainder of this article is organized as follows. Section 2 introduces themodel setting and the least squares approximation method. Section 3 presents a7ommunication-eﬃcient shrinkage estimation and a distributed BIC criterion. Numer-ical studies are given in Section 4. An application to U.S. Airline data with datasetsgreater than 52 GB is illustrated using the DLSA method on the Spark system in Sec-tion 5. The article concludes with a brief discussion in Section 6. All technical detailsare delegated to the Appendix.

2. STATISTICAL MODELLING ON DISTRIBUTEDSYSTEMS

Suppose in the distributed system that there are in total N observations, which areindexed as i = 1 , · · · , N . The i th observation is denoted as Z i = ( X (cid:62) i , Y i ) (cid:62) ∈ R p +1 ,where Y i ∈ R is the response of interest and X i ∈ R p is the corresponding covariatevector. Speciﬁcally, the observations are distributed across K local workers. Deﬁne S = { , · · · , N } to be all sample observations. Decompose S = ∪ Kk =1 S k , where S k collects the observations distributed to the k th worker. Obviously, we should have S k ∩ S k = ∅ for any k (cid:54) = k . Deﬁne n = N/K as the average sample size for eachworker. Then, we assume |S k | = n k and that all n k diverge in the same order O ( n ).Speciﬁcally, c ≤ min k n k /n ≤ max k n k /n ≤ c for some positive constants c and c .We know immediately that N = (cid:80) n k . In practice, due to the data storing strategy,the data in diﬀerent workers could be quite heterogeneous, e.g., they might be collectedaccording to spatial regions. Despite the heterogeneity here, we assume they share thesame regression relationship, and the parameter of interest is given by θ ∈ R p . Wefocus on the case in which p is ﬁxed.Let L ( θ ; Z ) be a plausible twice-diﬀerentiable loss function. Deﬁne the global lossfunction as L ( θ ) = N − (cid:80) Ni =1 L ( θ ; Z i ), whose global minimizer is (cid:98) θ = arg min L ( θ ) and8he true value is θ . It is assumed that (cid:98) θ admits the following asymptotic rule √ N ( (cid:98) θ − θ ) → d N (0 , Σ)for some positive deﬁnite matrix Σ ∈ R p × p as N → ∞ . If L ( θ ; Z ) is the negative log-likelihood function, then (cid:98) θ is the global MLE estimator. Correspondingly, deﬁne thelocal loss function in the k th worker as L k ( θ ) = n − k (cid:80) i ∈S k L ( θ ; Z i ), whose minimizeris (cid:98) θ k = arg min θ L k ( θ ). We assume that √ n k ( (cid:98) θ k − θ ) → d N (0 , Σ k )as n k → ∞ for a positive deﬁnite matrix Σ k . The goal is to conduct statistical analysisbased on the data on the local worker and minimize the communication cost as muchas possible. In this section, we motivate our approach through least squares approximation tothe global loss function, which takes a local quadratic form. To motivate this idea,we begin by decomposing and approximating the global loss function using Taylor’sexpansion techniques as follows: L ( θ ) = N − K (cid:88) k =1 (cid:88) i ∈S k L ( θ ; Z i ) = N − K (cid:88) k =1 (cid:88) i ∈S k (cid:110) L ( θ ; Z i ) − L ( (cid:98) θ k ; Z i ) (cid:111) + C ≈ N − K (cid:88) k =1 (cid:88) i ∈S k ( θ − (cid:98) θ k ) (cid:62) ¨ L ( (cid:98) θ k ; Z i )( θ − (cid:98) θ k ) + C , (2.1)where the last equation uses the fact that ˙ L k ( (cid:98) θ k ) = 0, and C and C are some con-stants. Typically, the minimizer (cid:98) θ k will achieve the convergence rate √ n k . Intuitively,9he quadratic form in (2.1) should be a good local approximation of the global lossfunction. This inspires us to consider the following weighted least squares objectivefunction: (cid:101) L ( θ ) = N − (cid:88) k ( θ − (cid:98) θ k ) (cid:62) (cid:110) (cid:88) i ∈S k ¨ L ( (cid:98) θ k ; Z i ) (cid:111) ( θ − (cid:98) θ k ) , def = (cid:88) k ( θ − (cid:98) θ k ) (cid:62) α k (cid:98) Σ − k ( θ − (cid:98) θ k ) , where α k = n k /N . This leads to a weighted least squares estimator (WLSE), whichtakes an analytical form as follows: (cid:101) θ = arg min θ (cid:101) L ( θ ) = (cid:16) (cid:88) k α k (cid:98) Σ − k (cid:17) − (cid:16) (cid:88) k α k (cid:98) Σ − k (cid:98) θ k (cid:17) . (2.2)It is remarkable that the estimator ˜ θ in (2.2) can be easily computed on a distributedsystem. Speciﬁcally, the local worker sends (cid:98) θ k and (cid:98) Σ k to the master node, and then, themaster node produces the WLSE by (2.2). As a result, the above WLSE requires onlyone round of communication. Hence, it is highly eﬃcient in terms of communication.Note that instead of taking a simple average of local estimators (cid:98) θ k in the literature,the analytical solution in (2.2) takes a weighted average of (cid:98) θ k using weights (cid:98) Σ − k . Thiswill result in a higher statistical eﬃciency if the data are stored heterogeneously. Toinvestigate the asymptotic properties of the WLSE, we assume the following conditions.(C1) (Parameter Space) The parameter space Θ is a compact and convex subsetof R p . In addition, the true value θ lies in the interior of Θ.(C2) (Covariates Distribution) Assume the covariates X i ( i ∈ S k ) from the k thworker are independently and identically distributed from the distribution F k ( x ).10C3) (Identifiability) For any δ >

0, there exists (cid:15) > n →∞ inf P (cid:110) inf (cid:107) θ ∗ − θ (cid:107)≥ δ, ≤ k ≤ K (cid:0) L k ( θ ∗ ) − L k ( θ ) (cid:1) ≥ (cid:15) (cid:111) = 1 , and E (cid:110) ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ (cid:111) = . (C4) (Local Convexity) DeﬁneΩ k ( θ ) = E (cid:110) ∂ L ( θ ; Z i ) ∂θ ∂ L ( θ ; Z i ) ∂θ (cid:62) (cid:12)(cid:12)(cid:12) i ∈ S k (cid:111) = E (cid:110) ∂ L k ( θ ; Z i ) ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) i ∈ S k (cid:111) for i ∈ S k . Assume that Ω k ( θ ) is nonsingular at the true value θ . In addition,let Σ k = { Ω k ( θ ) } − , and Σ = { (cid:80) k α k Ω( θ ) } − .(C5) (Smoothness) Deﬁne B ( δ ) = { θ ∗ ∈ Θ |(cid:107) θ ∗ − θ (cid:107) ≤ δ } as a ball around the truevalue θ with radius δ >

0. Assume for almost all Z ∈ R p that the loss function L ( θ ; Z ) admits all third derivatives ∂ L ( θ ; Z ) / ( ∂θ i ∂θ j ∂θ l ) for all θ ∈ B ( δ ). Inaddition, assume that there exist functions M ijl ( Z ) and δ > (cid:12)(cid:12)(cid:12) ∂ ∂θ i ∂θ j ∂θ l L ( θ ∗ ; Z ) (cid:12)(cid:12)(cid:12) ≤ M ijl ( Z ) , for all θ ∗ ∈ B ( δ ) , (2.3)where E { M ijl ( Z m ) | m ∈ S k } < ∞ for all 1 ≤ i, j, l ≤ p and 1 ≤ k ≤ K .The above conditions are standard conditions to establish the asymptotic propertiesfor M -estimators. First, Condition (C1) assumes the parameter space to be convex(Jordan et al., 2018). Next, Condition (C2) concerns the distribution of the covariates { X i : i ∈ S k } . Speciﬁcally, there are diﬀerent F k ( x ) for diﬀerent 1 ≤ k ≤ K . Inparticular, it allows for the heterogeneous distribution of covariates across workers.We would like to remark that heterogeneity is a common phenomenon in distributed11ystems, and it has been ignored in much of the literature. Condition (C3) assures theidentiﬁability of the local loss functions across all workers. Finally, Conditions (C4)and (C5) are standard regularity conditions of the loss functions, which require certaindegrees of local convexity and smoothness of the loss functions. These conditions arewidely assumed in the literature to guarantee asymptotic convergence of the estimators(Fan and Li, 2001; Lehmann and Casella, 2006; Jordan et al., 2018).Given the conditions, we can establish the asymptotic properties of WLSE in thefollowing Proposition 1 and Theorem 1. Proposition 1.

Assume Conditions (C1)–(C5). Then, we have √ N ( (cid:101) θ − θ ) = V ( θ ) + B ( θ ) (2.4) with cov { V ( θ ) } = Σ and B ( θ ) = O p ( K/ √ N ) , where Σ = ( (cid:80) Kk =1 α k Σ − k ) − . The proof of Proposition 1 is given in Appendix A.1. Proposition 1 separates √ N ( (cid:101) θ − θ ) into two parts, namely, the variance part and the bias part. Particularly,one should note that the variance order is the same as the global estimator (cid:98) θ , which is O ( N − ), while the bias order is related to the number of local workers K . Consequently,if the local sample size is suﬃciently large, the bias should be suﬃciently small, andthus, the global statistical eﬃciency can be achieved. We state this result in thefollowing theorem. Theorem 1. (Global Asymptotic Normality)

Assume conditions (C1)–(C5)and further assume n/N / → ∞ . Then, we have √ N ( (cid:101) θ − θ ) → d N (0 , Σ) , whichachieves the same asymptotic normality as the global estimator (cid:98) θ . The proof of Theorem 1 is given in Appendix A.2. It can be concluded that weshould require the local sample size to be of an order larger than √ N , which is easy12o satisfy in practice. Otherwise, we should have N/n = K/n → ∞ . This impliesthat the number of workers is even larger than the average local sample size n . This isobviously not the case in practice. In the next section, we further discuss the shrinkageestimation based on the DLSA method.

3. COMMUNICATION-EFFICIENT SHRINKAGEESTIMATION

Variable selection is a classical but critically important problem. That is becausein practice, the number of available covariates is typically large, but only a small num-ber of covariates are related to the response. Given an appropriate variable selectiontechnique, one can discover the important variables with high probability. In recentdecades, various variable selection techniques have been well studied (Tibshirani, 1996;Fan and Li, 2001; Zou, 2006; Wang and Leng, 2007; Zhang, 2010). However, how toconduct variable selection on a distributed system has not been suﬃciently investi-gated. Existing approaches mostly focus on the (cid:96) shrinkage estimation and developcorresponding algorithms (Lee et al., 2015; Battey et al., 2015; Wang et al., 2017a;Jordan et al., 2018). However, to the best of our knowledge, there are three problemsthat remain unsolved on a distributed system: (a) most works do not establish the ora-cle properties of the shrinkage estimators, (b) no consistent tuning parameter selectioncriterion is given or investigated, and (c) the computation will be heavy if one needsto conduct estimation and select the tuning parameters simultaneously.To solve the above problems, we ﬁrst deﬁne some notations. Without loss of gen-erality, we assume the ﬁrst d (0 < d < p ) to be nonzero, i.e., θ j (cid:54) = 0 for 1 ≤ j ≤ d and θ j = 0 for j > d . Correspondingly, we denote M T = { , · · · , d } to be true13odel. In addition, let M = { i , · · · , i d } be an arbitrary candidate model withsize |M| = d . In addition, for an arbitrary vector v = ( v j : 1 ≤ j ≤ p ) (cid:62) , deﬁne v ( M ) = ( v i : i ∈ M ) (cid:62) ∈ R |M| and v ( −M ) = ( v i : i (cid:54)∈ M ) (cid:62) ∈ R p −|M| . For an arbitrarymatrix M = ( m ij ), deﬁne M ( M ) = ( m j j : j , j ∈ M ) ∈ R |M|×|M| For simultaneous variable selection and parameter estimation, we follow the ideaof Wang and Leng (2007) and consider the adaptive Lasso objective function on themaster (Zou, 2006; Zhang and Lu, 2007), Q λ ( θ ) = (cid:101) L ( θ ) + (cid:88) j λ j | θ j | . (3.1)By the adaptive Lasso method, diﬀerent amounts of shrinkage λ j are imposed on eachestimator to improve the estimation eﬃciency (Zou, 2006; Zou and Li, 2008). Comparedto the LSA approach of Wang and Leng (2007), we have the following key diﬀerences.First, (cid:101) θ is the combined WLSE from local workers. Second, (cid:98) Σ is constructed by the localasymptotic covariance estimators (cid:98) Σ k . Consequently, to achieve a global convergencerate, one needs to carefully balance the local convergence rate of (cid:98) θ k and (cid:98) Σ k with thatof the global ones.Deﬁne (cid:101) θ λ = arg min θ Q λ ( θ ). Then, we can establish the √ N -consistency as well asthe selection consistency result of (cid:101) θ λ under certain conditions of λ = ( λ , · · · , λ p ) (cid:62) . Theorem 2.

Assume the conditions (C1)–(C5). Let a λ = max { λ j , j ≤ d } and b λ =min { λ j , j > d } . Then, the following result holds.a.a. ( √ N -consistency). If √ N a λ → p , then (cid:101) θ λ − θ = O p ( N − / ) .b.b. (Selection Consistency). If √ N a λ → p and √ N b λ → p ∞ , then P ( (cid:101) θ ( −M T ) λ = 0) → a λ controls thelargest amount of penalty on the true nonzero parameters. Consequently, this amountcannot be too large; otherwise, it will result in a highly biased estimator. In contrast, b λ is responsible for producing sparse solutions of irrelevant covariates. Therefore, b λ should be suﬃciently large to produce an eﬀective amount of shrinkage.By Theorem 2, we know that with probability tending to one, we have (cid:101) θ ( −M T ) λ = .Meanwhile, (cid:101) θ ( M T ) λ − θ ( M T )0 = O p ( N − / ). It is then natural to ask whether the statisticaleﬃciency of (cid:101) θ ( M T ) λ can be as good as the oracle estimator, which is the global estimatorobtained under the true model, i.e., (cid:98) θ oracle = arg min θ ∈ R p ,θ j =0 , ∀ j (cid:54)∈M T L ( θ ). To this end,we require the following technical condition.(C6) ( Covariance Assumption)

Deﬁne the global unpenalized estimator as (cid:98) θ M =arg min { θ ∈ R p : θ j =0 , ∀ j (cid:54)∈M} L ( θ ) . Assume for the global estimator (cid:98) θ M with M ⊃ M T that √ N ( (cid:98) θ ( M ) M − θ ( M ) M ) → d N (0 , Σ M ) = N (0 , Ω − M ), where Σ M ∈ R |M|×|M| is apositive-deﬁnite matrix. Further assume for any M ⊃ M T that Ω M = Ω ( M ) M F holds, where M F = { , · · · , p } denotes the whole set.Condition (C6) does not seem very intuitive. Nevertheless, it is a condition that is wellsatisﬁed by most maximum likelihood estimators. A more detailed discussion has beenprovided by Wang and Leng (2007). We then have the oracle property in the followingtheorem. Theorem 3. (Oracle Property)

Assume Conditions (C1)–(C6). Let √ N a λ → p and √ N b λ → p ∞ ; then, it holds that √ N (cid:16)(cid:101) θ ( M T ) λ − θ ( M T ) (cid:17) → d N (cid:16) , Σ M T (cid:17) . (3.3)15y Theorem 2 and 3, we know that as long as the tuning parameters are approx-imately selected, the resulting estimator is selection consistent and as eﬃcient as theoracle estimator. It is remarkable that tuning a total of p parameters simultaneouslyis not feasible in practice. To ﬁx this problem, we follow the tradition of Zou (2006)and Wang et al. (2007) to specify λ j = λ | (cid:101) θ j | − . Since (cid:101) θ j is √ N -consistent, then aslong as as λ satisﬁes the condition λ √ N → λ N → ∞ , then the conditions √ N a λ → p √ N b λ → p ∞ are satisﬁed. Thereafter, the original problem of tuningparameter selection for λ can be replaced by selection for λ . Although it has been shown that asymptotically, the oracle property can be guar-anteed as long as the tuning parameters are approximately selected, it is still unclearhow to conduct variable selection in practice. That motivates us to design a BIC-type criterion that can select the true model consistently in a completely data-drivenmanner (Zhang and Lu, 2007; Chen and Chen, 2008; Zou and Zhang, 2009; Wanget al., 2013). Speciﬁcally, to consistently recover the sparsity pattern, we consider adistributed Bayesian information criterion (DBIC)-based criterion as follows:DBIC λ = ( (cid:101) θ λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ − (cid:101) θ ) + log N × df λ /N, (3.4)where df λ is the number of nonzero elements in (cid:101) θ λ .The design of the DBIC criterion is in the spirit of the BIC criterion used in Wangand Leng (2007). The diﬀerence is that the DBIC uses the WLSE estimator (cid:101) θ and theaverage of distributed covariance estimators (cid:98) Σ to construct the least squares objectivefunction. Intuitively, if (cid:101) θ and (cid:98) Σ approximate the global estimator (cid:98) θ = arg max L ( θ )and asymptotic covariance very well, then the DBIC criterion should be able to facili-16ate consistent tuning parameter selection. Speciﬁcally, the resulting model should beselection consistent (Shao, 1997).To formally investigate the theoretical performance of DBIC, we ﬁrst deﬁne somenotations. First, we deﬁne the set of nonzero elements of (cid:98) θ λ by M λ . Given a tuningparameter λ , M λ could be underﬁtted, correctly ﬁtted or overﬁtted. We could thenhave the following partition: R − = { λ ∈ R p : M λ (cid:54)⊃ M T } , R = { λ ∈ R p : M λ = M T } , R + = { λ ∈ R p : M λ ⊃ M T , M λ (cid:54) = M T } , where R − denotes the underﬁtted model, and R + denotes an overﬁtted model. We showin the following Theorem that the DBIC can consistently identify the true model. Theorem 4.

Assume Conditions (C1)–(C6). Deﬁne a reference tuning parametersequence { λ N ∈ R p } , where the ﬁrst d elements of λ N are /N and the remainingelements are log N/N . Then, we have P (cid:0) inf λ ∈ R − ∪ R + DBIC λ > DBIC λ N (cid:1) → . By Theorem 2 and 3, we know that with probability tending to one, we should have M λ N = M T . Consequently, the sequence λ N here plays a role as a reference sequencethat leads to the true model. Accordingly, Theorem 4 implies that the optimal λ selected by the DBIC will consistently identify the true model. This is because any λ leading to an inconsistent model selection result should perform worse than λ N interms of DBIC values. This result does not imply that the tuning parameter selectedby the DBIC is λ N ; it only states that any inconsistent λ should be at least worse than λ N , but λ N leads to consistent model selection results.17 . NUMERICAL STUDIES To demonstrate the ﬁnite sample performance of the DLSA method, we conducta number of simulation studies in this section. Five classical regression models arepresented, and the corresponding DLSA algorithms are implemented. For each model,we consider two typical settings to verify the numerical performance of the proposedmethod. They represent two diﬀerent data storing strategies together with competingmethods. The ﬁrst strategy is to distribute data in a complete random manner. Thus,the covariates on diﬀerent workers are independent and identically distributed (i.i.d).In contrast, the second strategy allows for covariate distribution on diﬀerent workersto be heterogeneous. The estimation eﬃciency as well as the as the variable selectionaccuracy are evaluated. Examples are given as follows.

Example 1. (Linear Regression).

We ﬁrst consider one of the most pop-ular regression analysis tools, i.e., linear regression. In particular, we generate thecontinuous response Y i by a linear relationship with the covariates X i as follows: Y i = X (cid:62) i θ + (cid:15) i , , where the noise term ε i is independently generated using a standard normal dis-tribution N (0 , θ =(3 , . , , , , , , (cid:62) . Example 2. (Logistic Regression).

The logistic regression is a classical modelthat addresses binary responses (Hosmer Jr et al., 2013). In this example, we generate18he response Y i independently by the Bernoulli distribution given the covariate X i as P ( Y i = 1 | X i ) = exp( X (cid:62) i θ )1 + exp( X (cid:62) i θ ) . We follow Wang and Leng (2007) to set the true parameter θ = (3 , , , . , , , , (cid:62) . Example 3. (Poisson Regression).

In this example, we consider the Poissonregression, which is used to model counted responses (Cameron and Trivedi, 2013).The responses are generated according to the Poisson distribution as P ( Y | X i , θ ) = λ Y i Y i ! exp( − λ ) , (4.1)where λ = exp( X (cid:62) i θ ). The true parameter θ is set to (0 . , , , , , , − . , , (cid:62) . Example 4. (Cox Model).

Next, we consider Cox’s model in the survivalanalysis. (Fan and Li, 2002). Speciﬁcally, we set the hazard function to be h ( t i | X i ) = exp( X (cid:62) i θ ) , , where t i is the survival time from the i th subject. In practice, we generate the survivaltime from an exponential distribution with mean exp( − X (cid:62) i θ ). The true parameter θ is set as θ = (0 . , , , , , , . , (cid:62) . Next, the censoring time is generated indepen-dently from an exponential distribution with a mean u i exp( X (cid:62) i β ), where u i is sampledfrom a uniform distribution U [1 , Example 5. (Ordered Probit Regression).

Finally, we consider the orderedProbit regression model, which is widely used to model ordinal responses (Harrell Jr,2015). Speciﬁcally, the responses take the value Y = 1 , · · · , L , which represents natural19rders. Given the covariates X i , the ordinal responses are independently generated asfollows: P ( Y i = l | X i , θ ) =  Φ( c − X (cid:62) θ ) l = 1Φ( c l − X (cid:62) θ ) − Φ( c l − − X (cid:62) θ ) 2 ≤ l ≤ L − − Φ( c L − − X (cid:62) θ ) l = L where Φ( · ) denotes the distribution function for the standard normal distribution, and c , · · · , c L − are the cut points. We set θ = (0 . , , , , , , . , (cid:62) and ( c , · · · , c L − ) (cid:62) =( − , , . (cid:62) with L = 4.For each example, two diﬀerent data storage strategies are considered. They leadto diﬀerent covariate distributions F x ( x ). Speciﬁcally, the following two settings areinvestigated. • Setting 1 (i.i.d Covariates).

We ﬁrst consider the setting in which the dataare distributed independently and identically across the workers. Speciﬁcally, thecovariates X ij (1 ≤ i ≤ N, ≤ j ≤ p ) are sampled from the standard normaldistribution N (0 , • Setting 2 (Heterogeneous Covariates).

Next, we look at the case wherebythe covariates distributed across each worker are heterogeneous. This is a com-mon case in practice. Speciﬁcally, on the k th worker, the covariates are sampledfrom the multivariate normal distribution N ( µ k , Σ k ), where µ k is generated fromthe uniform distribution U [ − , k = ( σ k,ij ) = ( ρ | j − j | k ) with ρ k sampledfrom U [0 . , .

20n this section, we give performance measurements and summarize simulation re-sults with respect to the estimation eﬃciency as well as the variable selection accuracy.The sample sizes are set as N = (10 , , × . Correspondingly, the number ofworkers is set to K = (5 , , R = 500 times. For the r threplication, denote (cid:98) θ ( r ) and (cid:101) θ ( r ) as the global estimator and WLSE, respectively. Tomeasure the estimation eﬃciency, we calculate the root mean square error (RMSE)for the j th estimator as RMSE (cid:101) θ,j = { R − (cid:80) r (cid:107) (cid:101) θ ( r ) j − θ j (cid:107) } / . The RMSE for theglobal estimator (cid:98) θ can be deﬁned similarly. Then, the relative estimation eﬃciency(REE) with respect to the global estimator is given by REE j = RMSE (cid:98) θ,j / RMSE (cid:101) θ,j for j = 1 , · · · , p .Next, based on the WLSE, we further conduct shrinkage estimation on the mas-ter node. Let (cid:99) M ( r ) be the set of selected variables in the r th replication using theDBIC. Correspondingly, (cid:101) θ ( r ) λ is the shrinkage estimator. To measure the sparse dis-covery accuracy, we calculate the average model size (MS) as MS = R − (cid:80) r | (cid:99) M ( r ) | .Next, the percentage of the true model being correctly identiﬁed is given by CM= R − (cid:80) r I ( (cid:99) M ( r ) = M T ). In addition, to further investigate the estimation accuracy,we calculate the REE of the shrinkage estimation with respect to the global estimatoras REE sj = RMSE (cid:98) θ,j / RMSE (cid:101) θ λ ,j for j ∈ M T . We compare the proposed DLSA method with (a) the OS estimator (Zhang et al.,2013), and (b) the CSL estimator (Jordan et al., 2018). The simulation results aresummarized in Table 2–6. First, in the i.i.d case, one can observe that all three methodsare as eﬃcient as the global estimator when N is increased. For example, for the Poisson21egression (i.e., Table 4), all the methods could achieve REE ≈ N = 100 , K = 10. However, in the heterogeneous setting (i.e., Setting 2), the ﬁnite sampleperformances of the three competing methods are quite diﬀerent. The proposed DLSAmethod achieves the highest eﬃciency of all the methods, which is also asymptoticallyeﬃcient as the global estimator. For instance, the REE of the DLSA estimation for thelogistic regression (i.e., Table 3) is near 1 in the second setting with N = 20 ,

000 and K = 5, while the REEs of the OS and CSL methods are approximately 0.88 and 0.37,respectively. Although the OS estimator is less eﬃcient than the DLSA estimator, itis still consistent as N increases. The CSL method behaves worst under this situation.That is because it only uses the local Hessian matrix; this could result in a highlybiased estimator.With respect to the shrinkage estimation, one can observe that the adaptive Lassoestimator is able to achieve higher estimation eﬃciency in the second setting than theglobal estimator. For example, the REE for the shrinkage DLSA (SDLSA) methodcould be even higher than 1 for Cox’s model in Setting 2 (i.e., Table 5). Finally, thenewly designed DBIC method does a great job at identifying the nonzero variableswith high accuracy. To see this, one could observe, for example, in the ordered Probitregression (i.e., Table 6), that the MS is controlled well to be approximately 3 and,CM is near 1.

5. APPLICATION TO AIRLINE DATA

For illustration purposes, we study a large real-world dataset. Speciﬁcally, thedataset considered here is the U.S. Airline Dataset. The dataset is available at http://stat-computing.org/dataexpo/2009 . It contains detailed ﬂight information aboutU.S. airlines from 1987 to 2008. The task is to predict the delayed status of a ﬂight given22ll other ﬂight information. Each sample in the data corresponds to one ﬂight record,which consists of delayed status, departure time, arrival time, distance of the ﬂight,ﬂight date, delay status at departures, carrier information, origin and destination. Thecomplete variable information is described in Table 7. The data contain six continuousvariables and ﬁve categorical variables. The categorical variables are converted todummies with appropriate dimensions. We treat the

Year and

DayofMonth variablesas numerical to capture the time eﬀects. To capture possible seasonal patterns, wealso convert the time variables

Month and

DayofWeek to dummies. Ultimately, atotal of 181 variables are used in the model. The total sample size is 113.9 millionobservations. This leads to the raw dataset being 12 GB on a hard drive. After thedummy transformation described in Table 7, the overall in-memory size is over 52 GB,even if all the dummies are stored in a sparse matrix format. Thus, this dataset canhardly be handled by a single computer. All the numerical variables are standardizedto have a mean of zero and a variance of one.

To demonstrate our method, we set up a Spark-on-YARN cluster on the Alibabacloud server ( ). This is astandard industrial-level architecture setup for a distributed system. The system con-sists of one master node and two worker nodes. Each node contains 64 virtual cores,64 GB of RAM and two 80 GB SSD local hard drives. The dataset is stored on theHadoop data ﬁle system (HDFS).Because the RAM is larger than the raw data size, one may wonder whether thelogistic regression task can be run on a single node. Unfortunately, this is infeasiblein practice. This is because much more memory (typically >

128 GB) is needed for23perating on matrices of such a huge size. Even for the installed Spark system, a task ofthis magnitude cannot be directly performed using an existing algorithm library (e.g.,Spark ML ). This is because Spark is a very memory-intensive system. For example,to compute a single distributed matrix with a size of approximately 1 GB in memory,one might need each worker to have 2 GB of memory in practice. This overheadmemory consumption grows signiﬁcantly as the size of the data matrix increases. Fordiscussions, see Chen and Guestrin (2016).If one insists on computing the traditional MLE based on the entire dataset undermemory constraints, then a stochastic gradient decent (SGD) algorithm (Zhang, 2004)has to be used. However, the built distributed SGD algorithm simply fails with theaforementioned cluster due to the well-known out-of-memory problem in Spark. Tothis end, we have to use a serialized SGD algorithm; however, we run it on the samehardware for comparison purposes. The entire dataset needs to be randomly shuﬄedﬁrst. Then, a serialized SGD algorithm optimizes the log-likelihood function in aniterative manner. For each iteration, a batch of the data is randomly sampled andthen used for calculating the log likelihood and its gradient. A heuristic batch size andlearning rate are used. The total computational time is 15 . − . × . We would like to remark that the com-puting time for the MLE does not include the data shuﬄing time. This serves as animportant benchmark to gauge the performance of the other competing methods (e.g.,DLSA and OS methods). Fortunately, both the proposed DLSA and OS methods allow us to develop a user-friendly Spark algorithm with very limited computer resources. As the algorithm is24esigned in a batch manner, it is highly eﬃcient under memory constraints. Thealgorithm is developed with the Spark Python API (PySpark) and run on a Sparksystem (version > https://github.com/feng-li/dlsa . We then use our algorithm to ﬁt the model.To this end, the entire dataset is randomly partitioned into 1139 subgroups. Thesample size for each subgroup is approximately 100 , . . − . × and − . × , respectively.Comparing these results with that of the traditional MLE, we ﬁnd that the traditionalMLE is extremely diﬃcult to compute. It takes more that 15 hours and obtains aninferior result (i.e., smaller log-likelihood value). In contrast, the log-likelihood valueof the DLSA is the best. We next apply the proposed shrinkage DLSA method (referred to as SDLSA) withthe BIC criterion to conduct variable selection. It is remarkable that this can be fullyconducted on the master, and no further communication is needed. It takes only 0 . Year is 6.12, whichimplies that as the year proceeds, the airline delays become more severe. Next, the pas-sengers are likely to encounter delays in May, June, October, November, and December(with coeﬃcients of 6.03, 0.28, 0.8, 0.4, and 0.19, respectively). In terms of days ofthe week, more delays are expected for certain working days, i.e., Tuesday, Wednesdayand Friday (with coeﬃcients of 0.6, 0.85 and 0.39, respectively) compared to weekends.Finally, within a given day, we ﬁnd the coeﬃcients for both the scheduled departuretime (

CRSDepTime , − .

12) and the scheduled arrival time (

CRSArrTime , − .

13) arenegative, indicating that late departing and arriving ﬂights suﬀer less so from delays.Next, with respect to the airline carriers, AA (American Airlines), DL (Delta AirLines), NW (Northwest Airlines), and UA (United Airlines) have estimated coeﬃcientsof 0.49, 0.18, 0.39, and 0.70, which indicates how more likely they are to be delayedcompared with other airlines. In addition, one may be interested in with which air-ports are more likely to have delayed ﬂights. According to our estimation results,the top ﬁve origin airports that cause delays are IAH (George Bush IntercontinentalAirport), LGA (LaGuardia Airport), PHL (Philadelphia International Airport), RDU(Raleigh-Durham International Airport), ONT (Ontario International Airport), andSMF (Sacramento International Airport), with coeﬃcients of 0 .

82, 0 .

87, 0 .

94, 1 . .

59, respectively. The top ﬁve destination airports that cause delays are PBI(Palm Beach International Airport), MCI (Kansas City International Airport), DCA(Ronald Reagan Washington National Airport), SAN (San Diego International Air-port), and MEM (Memphis International Airport), with coeﬃcients of 0 .

99, 1 .

00, 1 . .

15, and 1 .

16, respectively.

6. CONCLUDING REMARKS a b l e : C o e ﬃ c i e n t s f o rt h e l og i s t i c m o d e l e s t i m a t e d w i t hS D L S A u s i n g t h e B I C . T h ec a rr i e r a nd a i r p o rt a bb r e v i a t i o n s a r e a ss i g n e db y t h e I n t e r n a t i o n a l A i r T r a n s p o rt A ss o c i a t i o n . W e d e n o t e “ A i r p o rt ” , “ O r i g i n ”a nd “ D e s t i n a t i o n ” b y “ A ” , “ O ”a nd “ D ” , r e s p ec t i v e l y . I n t e r ce p t Y e a r C R S A rr T i m e D i s t a n ce C R S D e p T i m e D a y o f M o n t h E l a p s e d T i m e D e p T i m e − . . − . − . − . − . . . M o n t h F e b M a r A p r M a y J un J u l A u g S e p O c t N o v D ec − . − . . . − . − . − . − . . . . D a y o f W ee k T u e W e d T hu F r i S a t Sun . . − . . . . C a rr i e r AA C O D L N W UAU S W N . − . . . . − . − . AA B Q AN C A T L AU S B D L B H M B NA B O S B U F B U R B W I O − . − . − . . . . − . . − . − . . D − . . − . − . − . − . − . . − . − . . A C L E C L TC M H C V G D A L D AY D C A D E N D F W D T W E L P O . − . − . . − . . − . − . . . − . D . − . . − . − . − . . . − . − . − . A E W R F LL G S O HN L H O U I A D I AH I N D J AX J F K L A S O . − . . . − . − . . . − . . − . D − . − . . . − . . − . − . − . − . − . A L AX L G A M C I M C O M D W M E MM I A M K E M S P M S Y O A K O . . . . − . . . − . − . . − . D − . . . . . . − . − . − . . − . A O K C O M A O N T O R D O R F P B I P D X P H L P HX P I T P V D O − . − . . − . − . − . − . . . . − . D . − . − . − . − . . − . − . . . − . A R D U R I C R N O R O C R S W S AN S A T S D F S E A S F O S J C O − . . − . . . . − . . . . . D . . . − . . . . − . − . − . . A S J U S L C S M F S NA S T L S Y R T P A T U L T U S O − . . . − . . − . − . − . − . D . . − . . . − . . . .

27n this article, we develop a novel DLSA algorithm that is able to perform large-scale statistical estimation and inference on a distributed system. The DLSA methodcan be applied to a large family of regression models (e.g., logistic regression, Poissonregression, and Cox’s model). First, it is shown that the DLSA estimator is as statis-tically optimal as the global estimator. Moreover, it is computationally eﬃcient andonly requires one round of communication.Furthermore, we develop the corresponding shrinkage estimation by using an adap-tive Lasso approach. The oracle property is theoretically proven. A new DBIC measurefor distributed variable selection, which only needs to be performed on the master andrequires no further communication, is designed. We prove the DBIC measure to beselection consistent. Finally, numerical studies are conducted with ﬁve classical re-gression examples. In addition, a Spark toolbox is developed, which is shown to becomputationally eﬃcient both through simulation and in airline data analysis.To facilitate future research, we now discuss several interesting topics. First, theDLSA method requires the objective function to have continuous second-order deriva-tives. This assumption might be restrictive and cannot be satisﬁed for certain regressionmodels, e.g., the quantile regression. Consequently, the relaxation of this assumptioncan be investigated, and corresponding distributed algorithms should be designed forsuch regression models. Second, the dimension considered in our framework is ﬁnite.As a natural extension, one could study the shrinkage estimation properties in high-dimensional settings. Finally, the algorithm is designed for independent data. Inpractice, dependent data (e.g., time series data and network data) are frequently en-countered. It is thus interesting to develop corresponding algorithms by consideringthe dependency structure.

APPENDIX A ppendix A.1: Proof of Proposition 1 Note that (cid:101) θ − θ takes the form (cid:101) θ − θ = (cid:8) (cid:88) k α k (cid:98) Σ − k (cid:9) − (cid:8) (cid:88) k α k (cid:98) Σ − k ( (cid:98) θ k − θ ) (cid:9) . Deﬁne (cid:98) Σ k ( θ ) = ∂ L k ( θ ) /∂θ∂θ (cid:62) . In the following section, we denote (cid:98) Σ k by (cid:98) Σ k ( (cid:98) θ k ) tomake it clearer. By Slutsky’s Theorem, to prove (2.4), it suﬃces to verify that (cid:88) k α k { (cid:98) Σ k ( (cid:98) θ k ) } − → p Σ − , (A.1) √ N (cid:104) (cid:88) k α k { (cid:98) Σ k ( (cid:98) θ k ) } − ( (cid:98) θ k − θ ) (cid:105) = V ∗ ( θ ) + B ∗ ( θ ) , (A.2)where cov { V ∗ ( θ ) } = Σ − and B ∗ ( θ ) = O p ( K/ √ N ). We prove them in the following.

1. Proof of (A.1).

Recall that (cid:98) θ k is a √ n -consistent estimator of θ . Thisenables us to conduct a Taylor’s expansion of (cid:98) Σ − k ( (cid:98) θ k ) at θ , which yields (cid:98) Σ − k ( (cid:98) θ k ) − Σ − k = (cid:98) Σ − k ( (cid:98) θ k ) − (cid:98) Σ − k ( θ ) + (cid:98) Σ − k ( θ ) − Σ − k = (cid:88) j ∂ L k ( θ ) ∂θ j ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ∗ ( θ ∗ j − θ j ) + (cid:98) Σ − k ( θ ) − Σ − k where θ ∗ lies on the line joining θ and (cid:98) θ k . By Condition (C5), we have ∂ L k ( θ ) ∂θ j ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ∗ = O p (1). Therefore, the order of the ﬁrst term is O p (1 / √ n k ). In addition, we have (cid:98) Σ − k ( θ ) − Σ − k = (cid:98) Σ − k ( θ ) − E { (cid:98) Σ − k ( θ ) } = O p ( n − / k ). Consequently, it can be derivedthat (cid:98) Σ − k ( (cid:98) θ k ) − Σ − k = O p ( n − / k ). Further note that α k = n k /N and (cid:80) k α k = 1. Then,we have (cid:98) Σ − Σ = (cid:88) k α k [ { (cid:98) Σ k ( (cid:98) θ k ) } − − Σ − k ] = O p ( n − / ) = o p (1) . (A.3)29ence (A.1) is proven.

2. Proof of (A.2).

Recall that (cid:98) θ k is the local minimizer of L k ( θ ). Therefore, itholds that0 = ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = (cid:98) θ k = ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ + 1 n k (cid:88) i ∈S k ∂ L ( θ ; Z i ) ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ( (cid:98) θ k − θ )+ 12 n k (cid:88) i ∈S k p (cid:88) j =1 ( θ ∗ − θ ) (cid:62) ∂ L ( θ ; Z i ) ∂θ j ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ∗ ( θ ∗ − θ ) , where θ ∗ lies between θ and (cid:98) θ k . By standard arguments, ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ = O p ( n − / k ) , (cid:98) Σ − k ( θ ) = E (cid:110) ∂ L ( θ ; Z i , i ∈ S k ) ∂θ∂θ (cid:62) (cid:111)(cid:12)(cid:12)(cid:12) θ = θ + O p ( n − / k ) = Σ − k + O p ( n − / k ) .∂ L k ( θ ) ∂θ j ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ∗ = E (cid:110) ∂ L ( θ ; Z i , i ∈ S k ) ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ∗ (cid:111) + o p (1) . Further note that (cid:98) θ k − θ = O p ( n − / k ). Then, we have (cid:98) θ k − θ = − Σ k ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ + B k ( θ ) n k + O p (cid:16) n k (cid:17) , where B k ( θ ) = O (1) is the bias term. Then, it holds that √ N (cid:88) k α k (cid:98) Σ − k ( (cid:98) θ k )( (cid:98) θ k − θ )= √ N (cid:88) k α k Σ − k (cid:110) − Σ k ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ + B k ( θ ) n k + O p ( n − k ) (cid:111) + √ N (cid:88) k α k { (cid:98) Σ − k ( (cid:98) θ k ) − Σ − k } ( (cid:98) θ k − θ )= − √ N (cid:88) k n k ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ + 1 √ N (cid:88) k Σ − k B k ( θ ) + O p (cid:16) K √ N (cid:17) , (A.4)30here the second equation is implied by √ N (cid:80) k α k { (cid:98) Σ − k ( (cid:98) θ k ) − Σ − k } ( (cid:98) θ k − θ ) = O p ( K/ √ N ).By condition (C4), it can be concluded that cov { √ N (cid:80) k n k ∂ L k ( θ ) ∂θ | θ = θ } = Σ − . Conse-quently, (A.2) can be proven. Appendix A.2: Proof of Theorem 1

By Slutsky’s Theorem, the asymptotic normality is directly implied by (A.1) and(A.2) with V ∗ ( θ ) → d N (0 , Σ − ) and B ∗ ( θ ) = o p (1). First, by Condition (C6) and theLyapunov central limit theorem, we have V ∗ ( θ ) → d N (0 , Σ − ). Next, by the conditionthat n (cid:29) √ N , we have K (cid:29) √ N , and thus, B ∗ ( θ ) = o p (1). Appendix A.3: Proof of Theorem 2

1. Proof of √ N -consistency. Note that the objective function Q λ ( θ ) in (3.1) is a strictly convex function. Then,the local minimizer is also a global minimizer. To establish √ N -consistency results, itsuﬃces to verify the following result (Fan and Li, 2001), i.e., for an arbitrarily small (cid:15) >

0, there exists a suﬃciently large constant C such thatlim N inf P (cid:110) inf u ∈ R p : (cid:107) u (cid:107) = C Q λ ( θ + N − / u ) > Q ( θ ) (cid:111) > − (cid:15). (A.5)31et u = ( u , · · · , u p ) (cid:62) and (cid:98) ∆ N = (cid:80) k α k (cid:98) Σ − k ( (cid:98) θ k ) (cid:8) √ N ( θ − (cid:98) θ k ) (cid:9) , Then, we have N (cid:110) Q λ ( θ + N − / u ) − Q λ ( θ ) (cid:111) = u (cid:62) (cid:98) Σ − u + 2 u (cid:62) (cid:98) ∆ N + N p (cid:88) j =1 λ j | θ j + N − / u j | − N p (cid:88) j =1 λ j | θ j |≥ u (cid:62) (cid:98) Σ − u + 2 u (cid:62) (cid:98) ∆ N + N d (cid:88) j =1 λ j (cid:0) | θ j + N − / u j | − | θ j | (cid:1) ≥ u (cid:62) (cid:98) Σ − u + 2 u (cid:62) (cid:98) ∆ N − d ( √ N a N ) (cid:107) u (cid:107) . (A.6)where the second equality holds because we assume θ j = 0 for j > d . Further notethat we assume that √ N a N → p . Consequently, the last term (A.6) is o p (1). Next,note that the ﬁrst term of (A.6) is lower bounded by λ − ( (cid:98) Σ) C because (cid:107) u (cid:107) = C .By (A.3), we have λ max ( (cid:98) Σ) → p λ max (Σ). Consequently, with probability tending to 1,we have the ﬁrst term uniformly larger than 0 . λ − (Σ) C , which is positive due toCondition (C4). In addition, by K/ √ N →

0, we have (cid:98) ∆ N = O p (1). Consequently, aslong as C is suﬃciently large, the ﬁrst term will dominate the last two terms. Then,the result of (A.5) is proven.

2. Proof of Selection Consistency.

It suﬃces to verify that P ( (cid:101) θ λ,j = 0) → d < j ≤ p . Note that Q λ ( θ ) canbe rewritten as Q λ ( θ ) = ( θ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( θ − (cid:101) θ ) + (cid:88) j λ j | θ j | + C, where C is a constant. Deﬁne (cid:98) Ω = (cid:98) Σ − , and (cid:98) Ω ( j ) denotes the j th row of the matrix (cid:98) Ω.If (cid:101) θ λ,j (cid:54) = 0 for some j > d , then the partial derivative can be calculated as √ N ∂Q λ ( θ ) ∂θ j (cid:12)(cid:12)(cid:12) θ = (cid:101) θ λ = 2 (cid:98) Ω ( j ) (cid:62) √ N ( (cid:101) θ λ − (cid:101) θ ) + √ N λ j sign( (cid:101) θ λ,j ) . (A.7)32ote that (cid:98) Ω → p Σ − and √ N ( (cid:101) θ λ − (cid:101) θ ) = √ N ( (cid:101) θ λ − θ ) − √ N ( (cid:101) θ − θ ) = O p (1), by (A.1),Theorem 1, and Theorem 2 (a). Consequently, the ﬁrst term (A.7) is O p (1). Next, bythis condition, we know that √ N λ j ≥ √ N b λ → ∞ for j > d . Because (cid:101) θ λ,j (cid:54) = 0, wehave sign( (cid:101) θ λ,j ) = 1 or -1; thus, the second term (A.7) goes to inﬁnity. Obviously, theequation will not be equal to zero. This implies P ( (cid:101) θ λ,j = 0) → Appendix A.3: Proof of Theorem 3

We ﬁrst rewrite the asymptotic covariance Σ into the following block matrix form:Σ =  Σ Σ Σ Σ  , where Σ ∈ R d × d . Similarly, we partition its inverse matrix Ω into 4 correspondingparts, Ω = (Ω , Ω ; Ω , Ω ). By Theorem 2, with probability tending to 1, we have (cid:101) θ ( −M T ) λ = 0. Therefore, (cid:101) θ ( M T ) λ should be the global minimizer of the objective function, Q λ, ( θ ( M T ) ) =( θ ( M T ) − (cid:101) θ ( M T ) ) (cid:62) (cid:98) Ω ( θ ( M T ) − (cid:101) θ ( M T ) ) − θ ( M T ) − (cid:101) θ ( M T ) ) (cid:62) (cid:98) Ω (cid:101) θ ( −M T ) + (cid:101) θ ( −M T ) (cid:62) (cid:98) Ω (cid:101) θ ( −M T ) + d (cid:88) j =1 λ j | θ j | By Theorem 2, it can be concluded that with probability tending to 1, (cid:101) θ ( M T ) λ shouldbe nonzero (otherwise, the √ N -consistency result in Theorem 2 will not hold). As aresult, The partial derivative ∂Q λ ( θ ) /∂θ j should exist for 1 ≤ j ≤ d , which yields0 = 12 ∂Q λ, ( θ ( M T ) ) ∂θ ( M T ) (cid:12)(cid:12)(cid:12) θ ( M T ) = (cid:101) θ ( M T ) λ = (cid:98) Ω ( (cid:101) θ ( M T ) λ − (cid:101) θ ( M T ) ) − (cid:98) Ω (cid:101) θ ( −M T ) + D ( (cid:101) θ ( M T ) λ ) . (A.8)33here D ( (cid:101) θ ( M T ) λ ) is a d -dimensional vector, with its j th component given by 0 . λ j sign( (cid:101) θ λ,j ).By (A.8), it can be derived that √ N ( (cid:101) θ ( M T ) λ − θ ( M T )0 ) = √ N ( (cid:101) θ ( M T ) − θ ( M T )0 ) + (cid:98) Ω − (cid:98) Ω ( √ N (cid:101) θ ( −M T ) ) − (cid:98) Ω − √ N D ( (cid:101) θ ( M T ) λ )= √ N ( (cid:101) θ ( M T ) − θ ( M T )0 ) + Ω − Ω ( √ N (cid:101) θ ( −M T ) ) + o p (1) , (A.9)where the second equality follows because √ N (cid:101) θ ( −M T ) = O p (1) by Theorem 1, (cid:98) Ω → p Ω and (cid:98) Ω → p Ω by (A.1), and √ N λ j = o p (1) (1 ≤ j ≤ d ) by Theorem 2. Further-more, by the matrix inverse formula, it holds that Ω − Ω = − Σ Σ − . Consequently,we have √ N ( (cid:101) θ ( M T ) λ − θ ( M T )0 ) = √ N ( (cid:101) θ ( M T ) − θ ( M T )0 ) − Σ Σ − ( √ N (cid:101) θ ( −M T ) ) + o p (1) . By Theorem 1, we have that the above is asymptotically normal with a mean of 0,and the inverse asymptotic covariance matrix is given by (Σ − Σ Σ − Σ ) − = Ω .By condition (C6), we have Ω = Ω M T . Consequently, the estimator (cid:101) θ ( M ) λ shares thesame asymptotic distribution with the oracle estimator (cid:98) θ ( M T ) M T . Appendix A.4: Proof of Theorem 4

To establish the selection consistency property of the DBIC, we consider the fol-lowing two cases for any M λ (cid:54) = M T . The ﬁrst case is the underﬁtted case, and thesecond case is the overﬁtted case.

1. Underfitted Model.

Note that λ N satisﬁes the condition in Theorem 2.Consequently, we have that (cid:101) θ λ N is √ N -consistent. We thus also have DBIC λ N = o p (1).34or M (cid:54)⊃ M T , it can be derived thatDBIC λ = ( (cid:101) θ λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ − (cid:101) θ ) + df λ (log N ) /N ≥ ( (cid:101) θ λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ − (cid:101) θ ) (cid:62) Deﬁne (cid:101) θ M = arg min θ ∈ R p : θ j =0 , ∀ j (cid:54)∈M ( θ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( θ − (cid:101) θ ) as the unpenalized estimatorgiven the model identiﬁed by M . Consequently, by deﬁnition, we should have( (cid:101) θ λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ − (cid:101) θ ) (cid:62) ≥ ( (cid:101) θ M λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ M λ − (cid:101) θ ) ≥ min M(cid:54)⊃M T ( (cid:101) θ M − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ M − (cid:101) θ ) → p min M(cid:54)⊃M T ( (cid:101) θ M − (cid:101) θ ) (cid:62) Σ − ( (cid:101) θ M − (cid:101) θ ) , where the last convergence is due to (A.1). Because Σ is positive deﬁnite by Condition(C4), we have min M(cid:54)⊃M T ( (cid:101) θ M − (cid:101) θ ) (cid:62) Σ − ( (cid:101) θ M − (cid:101) θ ) > P (inf λ ∈ R − DBIC λ > DBIC λ N ) →

2. Overfitted Model.

We next consider the overﬁtted model. In contrast, let λ be an arbitrary tuning parameter that over selects the parameters. We then have df λ − d ≥

1. It can then be concluded that N (DBIC λ − DBIC λ N ) = N ( (cid:101) θ λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ − (cid:101) θ ) − N ( (cid:101) θ λ N − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ N − (cid:101) θ ) + ( df λ − d ) log N ≥ N ( (cid:101) θ M λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ M λ − (cid:101) θ ) − N ( (cid:101) θ λ N − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ N − (cid:101) θ ) + log N ≥ inf M⊃M T N ( (cid:101) θ M − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ M − (cid:101) θ ) − N ( (cid:101) θ λ N − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ N − (cid:101) θ ) + log N. (A.10)First note that for M ⊃ M T , we have that (cid:101) θ M is √ N -consistent. As a result, the ﬁrstterm of (A.10) is O p (1). Similarly, by Theorem 2, (cid:101) θ λ N is √ N -consistent, Thus, thesecond term (A.10) is also O p (1). As a result, (A.10) diverges to inﬁnity as N → ∞ .This implies that P (inf λ ∈ R + DBIC λ > DBIC λ N ) →

1. This completes the proof.35 lgorithm 1:

Spark implementation

Input:

The model function for modelling each partitioned dataset

Output:

The weighted least squares estimator ˜ θ , covariance matrix (cid:98) Σ, DBICDBIC λ Steps:Step (1) . Pre-determine the overall cluster available memory as M ram , the totalnumber of CPU cores as C cores , and the total data size to be processed as D total ; Step (2)

Deﬁne the number of batched chunks N chunks to allow forout-of-memory data processing. We recommend that N chunks be at leastgreater than 3 × D total /M ram in a Spark system. Step (3) . Deﬁne the number of partitions P partition = D total / ( N chunks × C cores ). Step (4) . Deﬁne a model function whereby the input is an n × ( p + 2) PythonPandas DataFrame containing the response variable, covariates and partitionid, and the output is a p × ( p + 1) Pandas DataFrame whereby the ﬁrst columnis (cid:98) θ k and the remaining columns store (cid:98) Σ − k . Step (5) . for i in 1: N chunks do(a) . Transfer the data chunk to Spark’s distributed DataFrame if the dataare stored in another format. (b) . Randomly assign an integer partition label from { , ..., P partition } toeach row of the Spark DataFrame. (c) . Repartition the DataFrame in the distributed system if the data are notpartitioned by the partition label. (d) . Group the Spark DataFrames by the assigned partition label. (e) . Apply the model function to each grouped dataset with Spark’s Grouped map Pandas UDFs

API and obtain a ( pP partition ) × ( p + 1)distributed Spark DataFrame R i . endStep (6) . Aggregate R i over both partitions and chunks and return the p × ( p + 1) matrix R final . Step (7) . Return ˜ θ by Equation (2.2), (cid:98) Σ, and DBIC λ by Equation (3.4). • Because the ﬁnal step in the DLSA algorithm is carried out on the masternode and because data transformation from worker nodes to the master node isrequired, a special tool called “Apache Arrow” ( https://arrow.apache.org/ )is plugged-in to our system to allow eﬃcient data transformation betweenSparks distributed DataFrame and Python’s Pandas DataFrame.36 eferences

Apache Software Foundation (2019a), “Apache Hadoop,” .— (2019b), “Apache Spark,” .Battey, H., Fan, J., Liu, H., Lu, J., and Zhu, Z. (2015), “Distributed estimation andinference with statistical guarantees,” arXiv preprint arXiv:1509.05457 .Cameron, A. C. and Trivedi, P. K. (2013),

Regression analysis of count data , vol. 53,Cambridge university press.Chang, X., Lin, S.-B., and Wang, Y. (2017a), “Divide and conquer local average re-gression,”

Electronic Journal of Statistics , 11, 1326–1350.Chang, X., Lin, S.-B., and Zhou, D.-X. (2017b), “Distributed semi-supervised learningwith kernel ridge regression,”

The Journal of Machine Learning Research , 18, 1493–1514.Chen, J. and Chen, Z. (2008), “Extended Bayesian information criteria for modelselection with large model spaces,”

Biometrika , 95, 759–771.Chen, T. and Guestrin, C. (2016), “Xgboost: A scalable tree boosting system,” in

Proceedings of the 22nd acm sigkdd international conference on knowledge discoveryand data mining , ACM, pp. 785–794.Chen, X., Liu, W., and Zhang, Y. (2018), “Quantile regression under memory con-straint,” arXiv preprint arXiv:1810.08264 .Chen, X. and Xie, M.-g. (2014), “A split-and-conquer approach for analysis of extraor-dinarily large data,”

Statistica Sinica , 1655–1684.37fron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “Least angle regression,”

Annals of Statistics , 32, 407–499.Fan, J. and Li, R. (2001), “Variable selection via nonconcave penalized likelihood andits oracle properties,”

Journal of the American Statistical Association , 96, 1348–1360.— (2002), “Variable selection for Cox’s proportional hazards model and frailty model,”

The Annals of Statistics , 30, 74–99.Fan, J., Wang, D., Wang, K., and Zhu, Z. (2017), “Distributed estimation of principaleigenspaces,” arXiv preprint arXiv:1702.06488 .Harrell Jr, F. E. (2015),

Regression modeling strategies: with applications to linearmodels, logistic and ordinal regression, and survival analysis , Springer.Heinze, C., McWilliams, B., and Meinshausen, N. (2016), “Dual-loco: Distributing sta-tistical estimation using random projections,” in

Artiﬁcial Intelligence and Statistics ,pp. 875–883.Hosmer Jr, D. W., Lemeshow, S., and Sturdivant, R. X. (2013),

Applied logistic regres-sion , vol. 398, John Wiley & Sons.Jordan, M. I., Lee, J. D., and Yang, Y. (2018), “Communication-eﬃcient distributedstatistical inference,”

Journal of the American Statistical Association , 1–14.Lee, J. D., Sun, Y., Liu, Q., and Taylor, J. E. (2015), “Communication-eﬃcient sparseregression: a one-shot approach,” arXiv preprint arXiv:1503.04337 .Lehmann, E. L. and Casella, G. (2006),

Theory of point estimation , Springer Science& Business Media. 38i, X., Li, R., Xia, Z., and Xu, C. (2019), “Distributed feature screening via compo-nentwise debiasing,” arXiv preprint arXiv:1903.03810 .Liu, Q. and Ihler, A. T. (2014), “Distributed estimation, information loss and exponen-tial families,” in

Advances in neural information processing systems , pp. 1098–1106.Shamir, O., Srebro, N., and Zhang, T. (2014), “Communication-eﬃcient distributedoptimization using an approximate newton-type method,” in

International confer-ence on machine learning , pp. 1000–1008.Shao, J. (1997), “An asymptotic theory for linear model selection,”

Statistica Sinica ,221–242.Smith, V., Forte, S., Ma, C., Tak´aˇc, M., Jordan, M. I., and Jaggi, M. (2018), “Co-CoA: A General Framework for Communication-Eﬃcient Distributed Optimization,”

Journal of Machine Learning Research , 18, 1–49.Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,”

Journal ofthe Royal Statistical Society. Series B , 267–288.Wang, H. and Leng, C. (2007), “Uniﬁed LASSO estimation by least squares approxi-mation,”

Journal of the American Statistical Association , 102, 1039–1048.Wang, H., Li, R., and Tsai, C.-L. (2007), “Tuning parameter selectors for the smoothlyclipped absolute deviation method,”

Biometrika , 94, 553–568.Wang, H., Zhu, R., and Ma, P. (2018), “Optimal subsampling for large sample logisticregression,”

Journal of the American Statistical Association , 113, 829–844.Wang, J., Kolar, M., Srebro, N., and Zhang, T. (2017a), “Eﬃcient distributed learn-ing with sparsity,” in

Proceedings of the 34th International Conference on MachineLearning-Volume 70 , JMLR. org, pp. 3636–3645.39ang, J., Wang, W., and Srebro, N. (2017b), “Memory and communication eﬃcientdistributed stochastic optimization with minibatch prox,” in

Proceedings of the 2017Conference on Learning Theory , eds. Kale, S. and Shamir, O., Amsterdam, Nether-lands: PMLR, vol. 65 of

Proceedings of Machine Learning Research , pp. 1882–1919.Wang, L., Kim, Y., and Li, R. (2013), “Calibrating non-convex penalized regression inultra-high dimension,”

Annals of statistics , 41, 2505.Yang, J., Mahoney, M. W., Saunders, M., and Sun, Y. (2016), “Feature-distributedsparse regression: a screen-and-clean approach,” in

Advances in Neural InformationProcessing Systems , pp. 2712–2720.Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin,M. J., Shenker, S., and Stoica, I. (2012), “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in

Proceedings of the 9thUSENIX conference on Networked Systems Design and Implementation , USENIXAssociation, pp. 2–2.Zhang, C.-H. (2010), “Nearly unbiased variable selection under minimax concavepenalty,”

Annals of Statistics , 38, 894–942.Zhang, H. H. and Lu, W. (2007), “Adaptive Lasso for Cox’s proportional hazardsmodel,”

Biometrika , 94, 691–703.Zhang, T. (2004), “Solving large scale linear prediction problems using stochastic gra-dient descent algorithms,” in

Proceedings of the twenty-ﬁrst international conferenceon Machine learning , ACM, p. 116.Zhang, Y., Duchi, J. C., and Wainwright, M. J. (2013), “Communication-eﬃcient40lgorithms for statistical optimization,”

The Journal of Machine Learning Research ,14, 3321–3363.Zou, H. (2006), “The adaptive lasso and its oracle properties,”

Journal of the AmericanStatistical Association , 101, 1418–1429.Zou, H. and Li, R. (2008), “One-step sparse estimates in nonconcave penalized likeli-hood models,”

Annals of statistics , 36, 1509.Zou, H. and Zhang, H. H. (2009), “On the adaptive elastic-net with a diverging numberof parameters,”

Annals of statistics , 37, 1733.41able 2: Simulation results for Example 1 with 500 replications. The numerical perfor-mances are evaluated for diﬀerent sample sizes N ( × ) and numbers of workers K .For the OS, CSL, and DLSA method, the REE j is reported. For the shrinkage DLSA(SDLSA) method, the REE sj is reported. Finally, the MS and CM are reported for theSDLSA method to evaluate the variable selection accuracy.N K Est. θ θ θ θ θ θ θ θ MS CM

Setting 1: i.i.d Covariates

10 5 OS 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00CSL 1.00 0.99 1.00 0.98 0.99 0.99 0.99 0.99DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.01 1.0020 5 OS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00CSL 0.99 1.00 0.99 0.99 0.99 1.00 0.99 1.00DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.01 1.00100 10 OS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00CSL 0.99 0.99 1.00 0.99 0.99 0.99 0.99 0.99DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.01 1.00