Least Squares Approximation for a Distributed System
LLeast Squares Approximation for a DistributedSystem
Xuening Zhu , Feng Li , and Hansheng Wang School of Data Science, Fudan University, Shanghai, China School of Statistics and Mathematics, Central University of Finance and Economics,Beijing, China Guanghua School of Management, Peking University, Beijing, China
Abstract
In this work, we develop a distributed least squares approximation (DLSA)method that is able to solve a large family of regression problems (e.g., linearregression, logistic regression, and Cox’s model) on a distributed system. Byapproximating the local objective function using a local quadratic form, we areable to obtain a combined estimator by taking a weighted average of local es-timators. The resulting estimator is proved to be statistically as efficient asthe global estimator. Moreover, it requires only one round of communication.We further conduct shrinkage estimation based on the DLSA estimation usingan adaptive Lasso approach. The solution can be easily obtained by using theLARS algorithm on the master node. It is theoretically shown that the result-ing estimator possesses the oracle property and is selection consistent by usinga newly designed distributed Bayesian information criterion (DBIC). The finitesample performance and the computational efficiency are further illustrated byan extensive numerical study and an airline dataset. The airline dataset is 52 GBin size. The entire methodology has been implemented in Python for a de-facto standard Spark system. The proposed DLSA algorithm on the Spark systemtakes 26 minutes to obtain a logistic regression estimator, whereas a full likeli-hood algorithm takes 15 hours to obtain an inferior result.
KEY WORDS:
Distributed System; Least Squares Approximation; ShrinkageEstimation; Distributed BIC. ∗ Xuening Zhu is supported by the National Natural Science Foundation of China (nos.11901105, U1811461), the Shanghai Sailing Program for Youth Science and Technology Excellence(19YF1402700), and the Fudan-Xinzailing Joint Research Centre for Big Data, School of Data Sci-ence, Fudan University. Feng Li’s research is supported by the National Natural Science Foundationof China (no. 11501587). Hansheng Wang’s research is partially supported by the National NaturalScience Foundation of China (nos. 11831008, 11525101, 71532001, and 71332006). It is also supportedin part by China’s National Key Research Special Program (no. 2016YFC0207704). Feng Li is thecorresponding author. a r X i v : . [ s t a t . M E ] N ov . INTRODUCTION Modern data analysis often needs to address huge datasets. In many cases, thesize of the dataset could be too large to be conveniently handled by a single computer.Consequently, the dataset must be stored and processed on many connected computernodes, which thereafter are referred to as a distributed system. More precisely, a dis-tributed system refers to a large cluster of computers, which are typically connectedwith each other via wire protocols such as RPC and HTTP (Zaharia et al., 2012). Con-sequently, they are able to communicate with each other and accomplish the intendeddata analysis tasks at huge scales in a collective manner.By using a distributed system, we are able to break a large-scale computationproblem into many small pieces and then solve them in a distributed manner. A keychallenge faced by statistical computation on a distributed system is the communica-tion cost. The communication cost refers to the wall-clock time cost needed for datacommunication between different computer nodes, which could be expensive in dis-tributed systems (Zhang et al., 2013; Shamir et al., 2014; Jordan et al., 2018). In thiswork, we consider a “master-and-worker”-type distributed system with strong work-ers. We assume that the workers are strong in the sense they are modern computerswith reasonable storage and computing capacities. For example, a worker with 32 CPUcores, 128 GB RAM, and 512 GB SSD hard disk could be a very strong worker. As onewill see later, the most widely used systems, Hadoop (Apache Software Foundation,2019a) and Spark (Apache Software Foundation, 2019b), belong to this category. Typ-ically, the workers do not communicate with each other directly. However, they shouldbe connected to a common master node, which is another computer with outstandingcapacities. Consequently, most data should be distributed to workers, and most com-putations should be conducted by the workers. This enables us to solve a large-scale2omputation problem in a distributed manner. In contrast, the master should take theresponsibility to coordinate with different workers.For this “master-and-worker”-type distributed system, the communication cost ismostly between the master and workers. One can easily verify that good algorithms forsome simple moment estimates (e.g., sample mean) can be easily developed using thistype of distributed system. For example, to compute the sample mean on a distributedsystem, one can first compute the sample mean on each worker, which is known as amap process. Then, each worker reports to the master the resulting sample meanand the associated sample size. Thereafter, the master can compute the overall samplemean by a weighted average of the sample means from each worker, which is known as areduce process. Such a “MapReduce” algorithm requires only one “master-and-worker”communication for each worker. It requires no direct communication between workers.Because most computations are accomplished by the workers, it also makes good use ofthe strong worker capacities. As a result, the algorithm can be considered effective. Un-fortunately, cases such as the sample mean are rather rare in statistical analysis. Moststatistical algorithms do not have an analytical solution (e.g., the maximum likelihoodestimation of a logistic regression model) and thus require multiple iterations (e.g.,Newton-Raphson iteration or stochastic-gradient-descent-type algorithms). These it-erations unfortunately lead to substantial “master-and-worker” communication, whichis communicationally expensive. Therefore, developing algorithms that are highly effi-cient computationally, communicationally and statistically for distributed systems hasbecome a problem of great interest.In the literature, the common wisdom for addressing a distributed statistical prob-lem can be classified into two categories. The first category is the “one-shot” (OS) or“embarrassingly parallel” approach, which requires only one round of communication.3pecifically, the local worker computes the estimators in parallel and then communicateto the master to obtain an average global estimator (Zhang et al., 2013; Liu and Ihler,2014; Lee et al., 2015; Battey et al., 2015; Fan et al., 2017; Chang et al., 2017b,a).Although this approach is highly efficient in terms of communication, it might notachieve the best efficiency in statistical estimation in most occasions (Shamir et al.,2014; Jordan et al., 2018). The second approach includes iterative algorithms, whichrequire multiple rounds of communication between the master and the workers. Thisapproach typically requires multiple iterations to be taken so that the estimation ef-ficiency can be refined to match the global (or centralized) estimator (Shamir et al.,2014; Wang et al., 2017a,b; Jordan et al., 2018). In addition, see Yang et al. (2016);Heinze et al. (2016); Smith et al. (2018); Li et al. (2019) for distributed statisticalmodelling methods when the data are distributed according to features rather thansamples.The aforementioned two approaches are also studied for the sparse learning problemusing (cid:96) shrinkage estimation. For the first approach, Lee et al. (2015) investigatedthe distributed high-dimensional sparse regression using the OS approach by combin-ing local debiased (cid:96) estimates. Battey et al. (2015) revisited the same problem butfurther considered distributed testing and estimation methods in a unified likelihoodframework, in which a refitted estimation is used to obtain an oracle convergence rate.For the second approach, both Wang et al. (2017a) and Jordan et al. (2018) havedeveloped iterative algorithms to solve the sparse estimation problem, and they the-oretically proved that the error bounds match the centralized estimator. Beyond the (cid:96) shrinkage estimation, Chen and Xie (2014) studied a penalized likelihood estima-tor with more general penalty function forms in a high-dimensional setting. However,to the best of our knowledge, there are no guarantees that simultaneously ensure the4odel selection consistency (Fan and Li, 2001) and establish a criterion for consistenttuning parameter selection (Wang et al., 2007). In addition, all of the above methodsassume independent and identical samples stored by each worker, which is questionablein practice because the distributed dataset might experience great heterogeneity fromworker to worker. We would like to remark that the heterogeneity cannot be avoidedbecause it is mainly due to the practical need to record data across time or space (forexample).In this work, we aim to develop a novel methodology to address a sparse estimationproblem with low dimensions ( p < n , where n is the local sample size). Under thissetting, we pay greater attention to the communication cost caused by the iterationsrather than transmitting digits. The data possessed by different workers are allowedto be heterogeneous but share the same regression relationship. The proposed methodborrows the idea of the least squares approximation (LSA, Wang and Leng, 2007) andcan be used to handle a large class of parametric regression models on a distributedsystem. Specifically, let Y ∈ R be the response of interest, let X be the associatedpredictor with finite dimension, and let θ ∈ R p be the corresponding regression coeffi-cient. The objective is to estimate the regression parameter θ and conduct a variableselection on a distributed system that has one master and many strong workers. As-sume data, denoted by ( Y i , X i ), with 1 ≤ i ≤ N , that are distributed across differentworkers. Further, assume that the sample size on each worker is sufficiently large andof the same order. Under this setting, we propose a distributed LSA (DLSA) method.The key idea is as follows: (1) First, we estimate the parameter θ on each worker separately by using local dataon distributed workers. This can be done efficiently by using standard statisticalestimation methods (e.g., maximum likelihood estimation). By assuming that5he sample size of each worker is sufficiently large, the resulting estimator and itsasymptotic covariance estimate should be consistent but not statistically efficient,as compared with the global estimates. (2) Next, each worker passes the local estimator of θ and its asymptotic covarianceestimate to the master. Because we do not consider a high-dimensional modelsetting, the communication cost in this regard should be negligible. (3) Once the master receives all the local estimators from the workers, a weightedleast squares-type objective function can be constructed. This can be viewed as alocal quadratic approximation of the global log-likelihood functions. As one canexpect, the resulting estimator shares the same asymptotic covariance with thefull-size MLE method (i.e., the global estimator) under appropriate regularityconditions. Figure 1:
Illustration of the DLSA method.
The major steps of the DLSA method are further illustrated in Figure 1. As onecan see, the DLSA method reduces communication costs mainly by using only oneround of communication and avoids further iterative steps. Given the DLSA objectivefunction on the master node, we can further conduct shrinkage estimation on the6aster. This is done by formulating an adaptive Lasso-type (Zou, 2006; Zhang andLu, 2007) objective function. The objective functions can be easily solved by the LARSalgorithm (Efron et al., 2004) with minimal computation cost on the master. Thus,no communication is required. Accordingly, a solution path can be obtained on themaster node. Thereafter, the best estimator can be selected from the solution path inconjunction with the proposed distributed Bayesian information criterion (DBIC). Wetheoretically show that the resulting estimation is selection consistent and as efficientas the oracle estimator, which is the global estimator obtained under the true model.To summarize, we aim to make the following important contributions to the ex-isting literature. First, we propose a master with strong workers (MSW) distributedsystem framework, which solves a large-scale computation problem in a communication-efficient way. Second, given this MSW system, we propose a novel DLSA method, whicheasily handles a large class of classical regression models such as linear regression, gen-eralized linear regression, and Cox’s model. Third, due to the simple quadratic formof the objective function, the analytical solution path can be readily obtained usingthe LARS algorithm on the master. Then, the best model can be easily selected bythe DBIC criterion. Finally, but also most importantly, the proposed DLSA methodfully takes advantage of the specialty of the MSW system, which pushes the intensivecomputation to the workers and therefore is as computationally, communicationallyand statistically efficient as possible. Furthermore, we would like to make a remarkhere that although the proposed DLSA is designed for a distributed system, it can alsobe applied to a single computer when there are memory constraints (Chen et al., 2018;Wang et al., 2018).The remainder of this article is organized as follows. Section 2 introduces themodel setting and the least squares approximation method. Section 3 presents a7ommunication-efficient shrinkage estimation and a distributed BIC criterion. Numer-ical studies are given in Section 4. An application to U.S. Airline data with datasetsgreater than 52 GB is illustrated using the DLSA method on the Spark system in Sec-tion 5. The article concludes with a brief discussion in Section 6. All technical detailsare delegated to the Appendix.
2. STATISTICAL MODELLING ON DISTRIBUTEDSYSTEMS
Suppose in the distributed system that there are in total N observations, which areindexed as i = 1 , · · · , N . The i th observation is denoted as Z i = ( X (cid:62) i , Y i ) (cid:62) ∈ R p +1 ,where Y i ∈ R is the response of interest and X i ∈ R p is the corresponding covariatevector. Specifically, the observations are distributed across K local workers. Define S = { , · · · , N } to be all sample observations. Decompose S = ∪ Kk =1 S k , where S k collects the observations distributed to the k th worker. Obviously, we should have S k ∩ S k = ∅ for any k (cid:54) = k . Define n = N/K as the average sample size for eachworker. Then, we assume |S k | = n k and that all n k diverge in the same order O ( n ).Specifically, c ≤ min k n k /n ≤ max k n k /n ≤ c for some positive constants c and c .We know immediately that N = (cid:80) n k . In practice, due to the data storing strategy,the data in different workers could be quite heterogeneous, e.g., they might be collectedaccording to spatial regions. Despite the heterogeneity here, we assume they share thesame regression relationship, and the parameter of interest is given by θ ∈ R p . Wefocus on the case in which p is fixed.Let L ( θ ; Z ) be a plausible twice-differentiable loss function. Define the global lossfunction as L ( θ ) = N − (cid:80) Ni =1 L ( θ ; Z i ), whose global minimizer is (cid:98) θ = arg min L ( θ ) and8he true value is θ . It is assumed that (cid:98) θ admits the following asymptotic rule √ N ( (cid:98) θ − θ ) → d N (0 , Σ)for some positive definite matrix Σ ∈ R p × p as N → ∞ . If L ( θ ; Z ) is the negative log-likelihood function, then (cid:98) θ is the global MLE estimator. Correspondingly, define thelocal loss function in the k th worker as L k ( θ ) = n − k (cid:80) i ∈S k L ( θ ; Z i ), whose minimizeris (cid:98) θ k = arg min θ L k ( θ ). We assume that √ n k ( (cid:98) θ k − θ ) → d N (0 , Σ k )as n k → ∞ for a positive definite matrix Σ k . The goal is to conduct statistical analysisbased on the data on the local worker and minimize the communication cost as muchas possible. In this section, we motivate our approach through least squares approximation tothe global loss function, which takes a local quadratic form. To motivate this idea,we begin by decomposing and approximating the global loss function using Taylor’sexpansion techniques as follows: L ( θ ) = N − K (cid:88) k =1 (cid:88) i ∈S k L ( θ ; Z i ) = N − K (cid:88) k =1 (cid:88) i ∈S k (cid:110) L ( θ ; Z i ) − L ( (cid:98) θ k ; Z i ) (cid:111) + C ≈ N − K (cid:88) k =1 (cid:88) i ∈S k ( θ − (cid:98) θ k ) (cid:62) ¨ L ( (cid:98) θ k ; Z i )( θ − (cid:98) θ k ) + C , (2.1)where the last equation uses the fact that ˙ L k ( (cid:98) θ k ) = 0, and C and C are some con-stants. Typically, the minimizer (cid:98) θ k will achieve the convergence rate √ n k . Intuitively,9he quadratic form in (2.1) should be a good local approximation of the global lossfunction. This inspires us to consider the following weighted least squares objectivefunction: (cid:101) L ( θ ) = N − (cid:88) k ( θ − (cid:98) θ k ) (cid:62) (cid:110) (cid:88) i ∈S k ¨ L ( (cid:98) θ k ; Z i ) (cid:111) ( θ − (cid:98) θ k ) , def = (cid:88) k ( θ − (cid:98) θ k ) (cid:62) α k (cid:98) Σ − k ( θ − (cid:98) θ k ) , where α k = n k /N . This leads to a weighted least squares estimator (WLSE), whichtakes an analytical form as follows: (cid:101) θ = arg min θ (cid:101) L ( θ ) = (cid:16) (cid:88) k α k (cid:98) Σ − k (cid:17) − (cid:16) (cid:88) k α k (cid:98) Σ − k (cid:98) θ k (cid:17) . (2.2)It is remarkable that the estimator ˜ θ in (2.2) can be easily computed on a distributedsystem. Specifically, the local worker sends (cid:98) θ k and (cid:98) Σ k to the master node, and then, themaster node produces the WLSE by (2.2). As a result, the above WLSE requires onlyone round of communication. Hence, it is highly efficient in terms of communication.Note that instead of taking a simple average of local estimators (cid:98) θ k in the literature,the analytical solution in (2.2) takes a weighted average of (cid:98) θ k using weights (cid:98) Σ − k . Thiswill result in a higher statistical efficiency if the data are stored heterogeneously. Toinvestigate the asymptotic properties of the WLSE, we assume the following conditions.(C1) (Parameter Space) The parameter space Θ is a compact and convex subsetof R p . In addition, the true value θ lies in the interior of Θ.(C2) (Covariates Distribution) Assume the covariates X i ( i ∈ S k ) from the k thworker are independently and identically distributed from the distribution F k ( x ).10C3) (Identifiability) For any δ >
0, there exists (cid:15) > n →∞ inf P (cid:110) inf (cid:107) θ ∗ − θ (cid:107)≥ δ, ≤ k ≤ K (cid:0) L k ( θ ∗ ) − L k ( θ ) (cid:1) ≥ (cid:15) (cid:111) = 1 , and E (cid:110) ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ (cid:111) = . (C4) (Local Convexity) DefineΩ k ( θ ) = E (cid:110) ∂ L ( θ ; Z i ) ∂θ ∂ L ( θ ; Z i ) ∂θ (cid:62) (cid:12)(cid:12)(cid:12) i ∈ S k (cid:111) = E (cid:110) ∂ L k ( θ ; Z i ) ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) i ∈ S k (cid:111) for i ∈ S k . Assume that Ω k ( θ ) is nonsingular at the true value θ . In addition,let Σ k = { Ω k ( θ ) } − , and Σ = { (cid:80) k α k Ω( θ ) } − .(C5) (Smoothness) Define B ( δ ) = { θ ∗ ∈ Θ |(cid:107) θ ∗ − θ (cid:107) ≤ δ } as a ball around the truevalue θ with radius δ >
0. Assume for almost all Z ∈ R p that the loss function L ( θ ; Z ) admits all third derivatives ∂ L ( θ ; Z ) / ( ∂θ i ∂θ j ∂θ l ) for all θ ∈ B ( δ ). Inaddition, assume that there exist functions M ijl ( Z ) and δ > (cid:12)(cid:12)(cid:12) ∂ ∂θ i ∂θ j ∂θ l L ( θ ∗ ; Z ) (cid:12)(cid:12)(cid:12) ≤ M ijl ( Z ) , for all θ ∗ ∈ B ( δ ) , (2.3)where E { M ijl ( Z m ) | m ∈ S k } < ∞ for all 1 ≤ i, j, l ≤ p and 1 ≤ k ≤ K .The above conditions are standard conditions to establish the asymptotic propertiesfor M -estimators. First, Condition (C1) assumes the parameter space to be convex(Jordan et al., 2018). Next, Condition (C2) concerns the distribution of the covariates { X i : i ∈ S k } . Specifically, there are different F k ( x ) for different 1 ≤ k ≤ K . Inparticular, it allows for the heterogeneous distribution of covariates across workers.We would like to remark that heterogeneity is a common phenomenon in distributed11ystems, and it has been ignored in much of the literature. Condition (C3) assures theidentifiability of the local loss functions across all workers. Finally, Conditions (C4)and (C5) are standard regularity conditions of the loss functions, which require certaindegrees of local convexity and smoothness of the loss functions. These conditions arewidely assumed in the literature to guarantee asymptotic convergence of the estimators(Fan and Li, 2001; Lehmann and Casella, 2006; Jordan et al., 2018).Given the conditions, we can establish the asymptotic properties of WLSE in thefollowing Proposition 1 and Theorem 1. Proposition 1.
Assume Conditions (C1)–(C5). Then, we have √ N ( (cid:101) θ − θ ) = V ( θ ) + B ( θ ) (2.4) with cov { V ( θ ) } = Σ and B ( θ ) = O p ( K/ √ N ) , where Σ = ( (cid:80) Kk =1 α k Σ − k ) − . The proof of Proposition 1 is given in Appendix A.1. Proposition 1 separates √ N ( (cid:101) θ − θ ) into two parts, namely, the variance part and the bias part. Particularly,one should note that the variance order is the same as the global estimator (cid:98) θ , which is O ( N − ), while the bias order is related to the number of local workers K . Consequently,if the local sample size is sufficiently large, the bias should be sufficiently small, andthus, the global statistical efficiency can be achieved. We state this result in thefollowing theorem. Theorem 1. (Global Asymptotic Normality)
Assume conditions (C1)–(C5)and further assume n/N / → ∞ . Then, we have √ N ( (cid:101) θ − θ ) → d N (0 , Σ) , whichachieves the same asymptotic normality as the global estimator (cid:98) θ . The proof of Theorem 1 is given in Appendix A.2. It can be concluded that weshould require the local sample size to be of an order larger than √ N , which is easy12o satisfy in practice. Otherwise, we should have N/n = K/n → ∞ . This impliesthat the number of workers is even larger than the average local sample size n . This isobviously not the case in practice. In the next section, we further discuss the shrinkageestimation based on the DLSA method.
3. COMMUNICATION-EFFICIENT SHRINKAGEESTIMATION
Variable selection is a classical but critically important problem. That is becausein practice, the number of available covariates is typically large, but only a small num-ber of covariates are related to the response. Given an appropriate variable selectiontechnique, one can discover the important variables with high probability. In recentdecades, various variable selection techniques have been well studied (Tibshirani, 1996;Fan and Li, 2001; Zou, 2006; Wang and Leng, 2007; Zhang, 2010). However, how toconduct variable selection on a distributed system has not been sufficiently investi-gated. Existing approaches mostly focus on the (cid:96) shrinkage estimation and developcorresponding algorithms (Lee et al., 2015; Battey et al., 2015; Wang et al., 2017a;Jordan et al., 2018). However, to the best of our knowledge, there are three problemsthat remain unsolved on a distributed system: (a) most works do not establish the ora-cle properties of the shrinkage estimators, (b) no consistent tuning parameter selectioncriterion is given or investigated, and (c) the computation will be heavy if one needsto conduct estimation and select the tuning parameters simultaneously.To solve the above problems, we first define some notations. Without loss of gen-erality, we assume the first d (0 < d < p ) to be nonzero, i.e., θ j (cid:54) = 0 for 1 ≤ j ≤ d and θ j = 0 for j > d . Correspondingly, we denote M T = { , · · · , d } to be true13odel. In addition, let M = { i , · · · , i d } be an arbitrary candidate model withsize |M| = d . In addition, for an arbitrary vector v = ( v j : 1 ≤ j ≤ p ) (cid:62) , define v ( M ) = ( v i : i ∈ M ) (cid:62) ∈ R |M| and v ( −M ) = ( v i : i (cid:54)∈ M ) (cid:62) ∈ R p −|M| . For an arbitrarymatrix M = ( m ij ), define M ( M ) = ( m j j : j , j ∈ M ) ∈ R |M|×|M| For simultaneous variable selection and parameter estimation, we follow the ideaof Wang and Leng (2007) and consider the adaptive Lasso objective function on themaster (Zou, 2006; Zhang and Lu, 2007), Q λ ( θ ) = (cid:101) L ( θ ) + (cid:88) j λ j | θ j | . (3.1)By the adaptive Lasso method, different amounts of shrinkage λ j are imposed on eachestimator to improve the estimation efficiency (Zou, 2006; Zou and Li, 2008). Comparedto the LSA approach of Wang and Leng (2007), we have the following key differences.First, (cid:101) θ is the combined WLSE from local workers. Second, (cid:98) Σ is constructed by the localasymptotic covariance estimators (cid:98) Σ k . Consequently, to achieve a global convergencerate, one needs to carefully balance the local convergence rate of (cid:98) θ k and (cid:98) Σ k with thatof the global ones.Define (cid:101) θ λ = arg min θ Q λ ( θ ). Then, we can establish the √ N -consistency as well asthe selection consistency result of (cid:101) θ λ under certain conditions of λ = ( λ , · · · , λ p ) (cid:62) . Theorem 2.
Assume the conditions (C1)–(C5). Let a λ = max { λ j , j ≤ d } and b λ =min { λ j , j > d } . Then, the following result holds.a.a. ( √ N -consistency). If √ N a λ → p , then (cid:101) θ λ − θ = O p ( N − / ) .b.b. (Selection Consistency). If √ N a λ → p and √ N b λ → p ∞ , then P ( (cid:101) θ ( −M T ) λ = 0) → a λ controls thelargest amount of penalty on the true nonzero parameters. Consequently, this amountcannot be too large; otherwise, it will result in a highly biased estimator. In contrast, b λ is responsible for producing sparse solutions of irrelevant covariates. Therefore, b λ should be sufficiently large to produce an effective amount of shrinkage.By Theorem 2, we know that with probability tending to one, we have (cid:101) θ ( −M T ) λ = .Meanwhile, (cid:101) θ ( M T ) λ − θ ( M T )0 = O p ( N − / ). It is then natural to ask whether the statisticalefficiency of (cid:101) θ ( M T ) λ can be as good as the oracle estimator, which is the global estimatorobtained under the true model, i.e., (cid:98) θ oracle = arg min θ ∈ R p ,θ j =0 , ∀ j (cid:54)∈M T L ( θ ). To this end,we require the following technical condition.(C6) ( Covariance Assumption)
Define the global unpenalized estimator as (cid:98) θ M =arg min { θ ∈ R p : θ j =0 , ∀ j (cid:54)∈M} L ( θ ) . Assume for the global estimator (cid:98) θ M with M ⊃ M T that √ N ( (cid:98) θ ( M ) M − θ ( M ) M ) → d N (0 , Σ M ) = N (0 , Ω − M ), where Σ M ∈ R |M|×|M| is apositive-definite matrix. Further assume for any M ⊃ M T that Ω M = Ω ( M ) M F holds, where M F = { , · · · , p } denotes the whole set.Condition (C6) does not seem very intuitive. Nevertheless, it is a condition that is wellsatisfied by most maximum likelihood estimators. A more detailed discussion has beenprovided by Wang and Leng (2007). We then have the oracle property in the followingtheorem. Theorem 3. (Oracle Property)
Assume Conditions (C1)–(C6). Let √ N a λ → p and √ N b λ → p ∞ ; then, it holds that √ N (cid:16)(cid:101) θ ( M T ) λ − θ ( M T ) (cid:17) → d N (cid:16) , Σ M T (cid:17) . (3.3)15y Theorem 2 and 3, we know that as long as the tuning parameters are approx-imately selected, the resulting estimator is selection consistent and as efficient as theoracle estimator. It is remarkable that tuning a total of p parameters simultaneouslyis not feasible in practice. To fix this problem, we follow the tradition of Zou (2006)and Wang et al. (2007) to specify λ j = λ | (cid:101) θ j | − . Since (cid:101) θ j is √ N -consistent, then aslong as as λ satisfies the condition λ √ N → λ N → ∞ , then the conditions √ N a λ → p √ N b λ → p ∞ are satisfied. Thereafter, the original problem of tuningparameter selection for λ can be replaced by selection for λ . Although it has been shown that asymptotically, the oracle property can be guar-anteed as long as the tuning parameters are approximately selected, it is still unclearhow to conduct variable selection in practice. That motivates us to design a BIC-type criterion that can select the true model consistently in a completely data-drivenmanner (Zhang and Lu, 2007; Chen and Chen, 2008; Zou and Zhang, 2009; Wanget al., 2013). Specifically, to consistently recover the sparsity pattern, we consider adistributed Bayesian information criterion (DBIC)-based criterion as follows:DBIC λ = ( (cid:101) θ λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ − (cid:101) θ ) + log N × df λ /N, (3.4)where df λ is the number of nonzero elements in (cid:101) θ λ .The design of the DBIC criterion is in the spirit of the BIC criterion used in Wangand Leng (2007). The difference is that the DBIC uses the WLSE estimator (cid:101) θ and theaverage of distributed covariance estimators (cid:98) Σ to construct the least squares objectivefunction. Intuitively, if (cid:101) θ and (cid:98) Σ approximate the global estimator (cid:98) θ = arg max L ( θ )and asymptotic covariance very well, then the DBIC criterion should be able to facili-16ate consistent tuning parameter selection. Specifically, the resulting model should beselection consistent (Shao, 1997).To formally investigate the theoretical performance of DBIC, we first define somenotations. First, we define the set of nonzero elements of (cid:98) θ λ by M λ . Given a tuningparameter λ , M λ could be underfitted, correctly fitted or overfitted. We could thenhave the following partition: R − = { λ ∈ R p : M λ (cid:54)⊃ M T } , R = { λ ∈ R p : M λ = M T } , R + = { λ ∈ R p : M λ ⊃ M T , M λ (cid:54) = M T } , where R − denotes the underfitted model, and R + denotes an overfitted model. We showin the following Theorem that the DBIC can consistently identify the true model. Theorem 4.
Assume Conditions (C1)–(C6). Define a reference tuning parametersequence { λ N ∈ R p } , where the first d elements of λ N are /N and the remainingelements are log N/N . Then, we have P (cid:0) inf λ ∈ R − ∪ R + DBIC λ > DBIC λ N (cid:1) → . By Theorem 2 and 3, we know that with probability tending to one, we should have M λ N = M T . Consequently, the sequence λ N here plays a role as a reference sequencethat leads to the true model. Accordingly, Theorem 4 implies that the optimal λ selected by the DBIC will consistently identify the true model. This is because any λ leading to an inconsistent model selection result should perform worse than λ N interms of DBIC values. This result does not imply that the tuning parameter selectedby the DBIC is λ N ; it only states that any inconsistent λ should be at least worse than λ N , but λ N leads to consistent model selection results.17 . NUMERICAL STUDIES To demonstrate the finite sample performance of the DLSA method, we conducta number of simulation studies in this section. Five classical regression models arepresented, and the corresponding DLSA algorithms are implemented. For each model,we consider two typical settings to verify the numerical performance of the proposedmethod. They represent two different data storing strategies together with competingmethods. The first strategy is to distribute data in a complete random manner. Thus,the covariates on different workers are independent and identically distributed (i.i.d).In contrast, the second strategy allows for covariate distribution on different workersto be heterogeneous. The estimation efficiency as well as the as the variable selectionaccuracy are evaluated. Examples are given as follows.
Example 1. (Linear Regression).
We first consider one of the most pop-ular regression analysis tools, i.e., linear regression. In particular, we generate thecontinuous response Y i by a linear relationship with the covariates X i as follows: Y i = X (cid:62) i θ + (cid:15) i , , where the noise term ε i is independently generated using a standard normal dis-tribution N (0 , θ =(3 , . , , , , , , (cid:62) . Example 2. (Logistic Regression).
The logistic regression is a classical modelthat addresses binary responses (Hosmer Jr et al., 2013). In this example, we generate18he response Y i independently by the Bernoulli distribution given the covariate X i as P ( Y i = 1 | X i ) = exp( X (cid:62) i θ )1 + exp( X (cid:62) i θ ) . We follow Wang and Leng (2007) to set the true parameter θ = (3 , , , . , , , , (cid:62) . Example 3. (Poisson Regression).
In this example, we consider the Poissonregression, which is used to model counted responses (Cameron and Trivedi, 2013).The responses are generated according to the Poisson distribution as P ( Y | X i , θ ) = λ Y i Y i ! exp( − λ ) , (4.1)where λ = exp( X (cid:62) i θ ). The true parameter θ is set to (0 . , , , , , , − . , , (cid:62) . Example 4. (Cox Model).
Next, we consider Cox’s model in the survivalanalysis. (Fan and Li, 2002). Specifically, we set the hazard function to be h ( t i | X i ) = exp( X (cid:62) i θ ) , , where t i is the survival time from the i th subject. In practice, we generate the survivaltime from an exponential distribution with mean exp( − X (cid:62) i θ ). The true parameter θ is set as θ = (0 . , , , , , , . , (cid:62) . Next, the censoring time is generated indepen-dently from an exponential distribution with a mean u i exp( X (cid:62) i β ), where u i is sampledfrom a uniform distribution U [1 , Example 5. (Ordered Probit Regression).
Finally, we consider the orderedProbit regression model, which is widely used to model ordinal responses (Harrell Jr,2015). Specifically, the responses take the value Y = 1 , · · · , L , which represents natural19rders. Given the covariates X i , the ordinal responses are independently generated asfollows: P ( Y i = l | X i , θ ) = Φ( c − X (cid:62) θ ) l = 1Φ( c l − X (cid:62) θ ) − Φ( c l − − X (cid:62) θ ) 2 ≤ l ≤ L − − Φ( c L − − X (cid:62) θ ) l = L where Φ( · ) denotes the distribution function for the standard normal distribution, and c , · · · , c L − are the cut points. We set θ = (0 . , , , , , , . , (cid:62) and ( c , · · · , c L − ) (cid:62) =( − , , . (cid:62) with L = 4.For each example, two different data storage strategies are considered. They leadto different covariate distributions F x ( x ). Specifically, the following two settings areinvestigated. • Setting 1 (i.i.d Covariates).
We first consider the setting in which the dataare distributed independently and identically across the workers. Specifically, thecovariates X ij (1 ≤ i ≤ N, ≤ j ≤ p ) are sampled from the standard normaldistribution N (0 , • Setting 2 (Heterogeneous Covariates).
Next, we look at the case wherebythe covariates distributed across each worker are heterogeneous. This is a com-mon case in practice. Specifically, on the k th worker, the covariates are sampledfrom the multivariate normal distribution N ( µ k , Σ k ), where µ k is generated fromthe uniform distribution U [ − , k = ( σ k,ij ) = ( ρ | j − j | k ) with ρ k sampledfrom U [0 . , .
20n this section, we give performance measurements and summarize simulation re-sults with respect to the estimation efficiency as well as the variable selection accuracy.The sample sizes are set as N = (10 , , × . Correspondingly, the number ofworkers is set to K = (5 , , R = 500 times. For the r threplication, denote (cid:98) θ ( r ) and (cid:101) θ ( r ) as the global estimator and WLSE, respectively. Tomeasure the estimation efficiency, we calculate the root mean square error (RMSE)for the j th estimator as RMSE (cid:101) θ,j = { R − (cid:80) r (cid:107) (cid:101) θ ( r ) j − θ j (cid:107) } / . The RMSE for theglobal estimator (cid:98) θ can be defined similarly. Then, the relative estimation efficiency(REE) with respect to the global estimator is given by REE j = RMSE (cid:98) θ,j / RMSE (cid:101) θ,j for j = 1 , · · · , p .Next, based on the WLSE, we further conduct shrinkage estimation on the mas-ter node. Let (cid:99) M ( r ) be the set of selected variables in the r th replication using theDBIC. Correspondingly, (cid:101) θ ( r ) λ is the shrinkage estimator. To measure the sparse dis-covery accuracy, we calculate the average model size (MS) as MS = R − (cid:80) r | (cid:99) M ( r ) | .Next, the percentage of the true model being correctly identified is given by CM= R − (cid:80) r I ( (cid:99) M ( r ) = M T ). In addition, to further investigate the estimation accuracy,we calculate the REE of the shrinkage estimation with respect to the global estimatoras REE sj = RMSE (cid:98) θ,j / RMSE (cid:101) θ λ ,j for j ∈ M T . We compare the proposed DLSA method with (a) the OS estimator (Zhang et al.,2013), and (b) the CSL estimator (Jordan et al., 2018). The simulation results aresummarized in Table 2–6. First, in the i.i.d case, one can observe that all three methodsare as efficient as the global estimator when N is increased. For example, for the Poisson21egression (i.e., Table 4), all the methods could achieve REE ≈ N = 100 , K = 10. However, in the heterogeneous setting (i.e., Setting 2), the finite sampleperformances of the three competing methods are quite different. The proposed DLSAmethod achieves the highest efficiency of all the methods, which is also asymptoticallyefficient as the global estimator. For instance, the REE of the DLSA estimation for thelogistic regression (i.e., Table 3) is near 1 in the second setting with N = 20 ,
000 and K = 5, while the REEs of the OS and CSL methods are approximately 0.88 and 0.37,respectively. Although the OS estimator is less efficient than the DLSA estimator, itis still consistent as N increases. The CSL method behaves worst under this situation.That is because it only uses the local Hessian matrix; this could result in a highlybiased estimator.With respect to the shrinkage estimation, one can observe that the adaptive Lassoestimator is able to achieve higher estimation efficiency in the second setting than theglobal estimator. For example, the REE for the shrinkage DLSA (SDLSA) methodcould be even higher than 1 for Cox’s model in Setting 2 (i.e., Table 5). Finally, thenewly designed DBIC method does a great job at identifying the nonzero variableswith high accuracy. To see this, one could observe, for example, in the ordered Probitregression (i.e., Table 6), that the MS is controlled well to be approximately 3 and,CM is near 1.
5. APPLICATION TO AIRLINE DATA
For illustration purposes, we study a large real-world dataset. Specifically, thedataset considered here is the U.S. Airline Dataset. The dataset is available at http://stat-computing.org/dataexpo/2009 . It contains detailed flight information aboutU.S. airlines from 1987 to 2008. The task is to predict the delayed status of a flight given22ll other flight information. Each sample in the data corresponds to one flight record,which consists of delayed status, departure time, arrival time, distance of the flight,flight date, delay status at departures, carrier information, origin and destination. Thecomplete variable information is described in Table 7. The data contain six continuousvariables and five categorical variables. The categorical variables are converted todummies with appropriate dimensions. We treat the
Year and
DayofMonth variablesas numerical to capture the time effects. To capture possible seasonal patterns, wealso convert the time variables
Month and
DayofWeek to dummies. Ultimately, atotal of 181 variables are used in the model. The total sample size is 113.9 millionobservations. This leads to the raw dataset being 12 GB on a hard drive. After thedummy transformation described in Table 7, the overall in-memory size is over 52 GB,even if all the dummies are stored in a sparse matrix format. Thus, this dataset canhardly be handled by a single computer. All the numerical variables are standardizedto have a mean of zero and a variance of one.
To demonstrate our method, we set up a Spark-on-YARN cluster on the Alibabacloud server ( ). This is astandard industrial-level architecture setup for a distributed system. The system con-sists of one master node and two worker nodes. Each node contains 64 virtual cores,64 GB of RAM and two 80 GB SSD local hard drives. The dataset is stored on theHadoop data file system (HDFS).Because the RAM is larger than the raw data size, one may wonder whether thelogistic regression task can be run on a single node. Unfortunately, this is infeasiblein practice. This is because much more memory (typically >
128 GB) is needed for23perating on matrices of such a huge size. Even for the installed Spark system, a task ofthis magnitude cannot be directly performed using an existing algorithm library (e.g.,Spark ML ). This is because Spark is a very memory-intensive system. For example,to compute a single distributed matrix with a size of approximately 1 GB in memory,one might need each worker to have 2 GB of memory in practice. This overheadmemory consumption grows significantly as the size of the data matrix increases. Fordiscussions, see Chen and Guestrin (2016).If one insists on computing the traditional MLE based on the entire dataset undermemory constraints, then a stochastic gradient decent (SGD) algorithm (Zhang, 2004)has to be used. However, the built distributed SGD algorithm simply fails with theaforementioned cluster due to the well-known out-of-memory problem in Spark. Tothis end, we have to use a serialized SGD algorithm; however, we run it on the samehardware for comparison purposes. The entire dataset needs to be randomly shuffledfirst. Then, a serialized SGD algorithm optimizes the log-likelihood function in aniterative manner. For each iteration, a batch of the data is randomly sampled andthen used for calculating the log likelihood and its gradient. A heuristic batch size andlearning rate are used. The total computational time is 15 . − . × . We would like to remark that the com-puting time for the MLE does not include the data shuffling time. This serves as animportant benchmark to gauge the performance of the other competing methods (e.g.,DLSA and OS methods). Fortunately, both the proposed DLSA and OS methods allow us to develop a user-friendly Spark algorithm with very limited computer resources. As the algorithm is24esigned in a batch manner, it is highly efficient under memory constraints. Thealgorithm is developed with the Spark Python API (PySpark) and run on a Sparksystem (version > https://github.com/feng-li/dlsa . We then use our algorithm to fit the model.To this end, the entire dataset is randomly partitioned into 1139 subgroups. Thesample size for each subgroup is approximately 100 , . . − . × and − . × , respectively.Comparing these results with that of the traditional MLE, we find that the traditionalMLE is extremely difficult to compute. It takes more that 15 hours and obtains aninferior result (i.e., smaller log-likelihood value). In contrast, the log-likelihood valueof the DLSA is the best. We next apply the proposed shrinkage DLSA method (referred to as SDLSA) withthe BIC criterion to conduct variable selection. It is remarkable that this can be fullyconducted on the master, and no further communication is needed. It takes only 0 . Year is 6.12, whichimplies that as the year proceeds, the airline delays become more severe. Next, the pas-sengers are likely to encounter delays in May, June, October, November, and December(with coefficients of 6.03, 0.28, 0.8, 0.4, and 0.19, respectively). In terms of days ofthe week, more delays are expected for certain working days, i.e., Tuesday, Wednesdayand Friday (with coefficients of 0.6, 0.85 and 0.39, respectively) compared to weekends.Finally, within a given day, we find the coefficients for both the scheduled departuretime (
CRSDepTime , − .
12) and the scheduled arrival time (
CRSArrTime , − .
13) arenegative, indicating that late departing and arriving flights suffer less so from delays.Next, with respect to the airline carriers, AA (American Airlines), DL (Delta AirLines), NW (Northwest Airlines), and UA (United Airlines) have estimated coefficientsof 0.49, 0.18, 0.39, and 0.70, which indicates how more likely they are to be delayedcompared with other airlines. In addition, one may be interested in with which air-ports are more likely to have delayed flights. According to our estimation results,the top five origin airports that cause delays are IAH (George Bush IntercontinentalAirport), LGA (LaGuardia Airport), PHL (Philadelphia International Airport), RDU(Raleigh-Durham International Airport), ONT (Ontario International Airport), andSMF (Sacramento International Airport), with coefficients of 0 .
82, 0 .
87, 0 .
94, 1 . .
59, respectively. The top five destination airports that cause delays are PBI(Palm Beach International Airport), MCI (Kansas City International Airport), DCA(Ronald Reagan Washington National Airport), SAN (San Diego International Air-port), and MEM (Memphis International Airport), with coefficients of 0 .
99, 1 .
00, 1 . .
15, and 1 .
16, respectively.
6. CONCLUDING REMARKS a b l e : C o e ffi c i e n t s f o rt h e l og i s t i c m o d e l e s t i m a t e d w i t hS D L S A u s i n g t h e B I C . T h ec a rr i e r a nd a i r p o rt a bb r e v i a t i o n s a r e a ss i g n e db y t h e I n t e r n a t i o n a l A i r T r a n s p o rt A ss o c i a t i o n . W e d e n o t e “ A i r p o rt ” , “ O r i g i n ”a nd “ D e s t i n a t i o n ” b y “ A ” , “ O ”a nd “ D ” , r e s p ec t i v e l y . I n t e r ce p t Y e a r C R S A rr T i m e D i s t a n ce C R S D e p T i m e D a y o f M o n t h E l a p s e d T i m e D e p T i m e − . . − . − . − . − . . . M o n t h F e b M a r A p r M a y J un J u l A u g S e p O c t N o v D ec − . − . . . − . − . − . − . . . . D a y o f W ee k T u e W e d T hu F r i S a t Sun . . − . . . . C a rr i e r AA C O D L N W UAU S W N . − . . . . − . − . AA B Q AN C A T L AU S B D L B H M B NA B O S B U F B U R B W I O − . − . − . . . . − . . − . − . . D − . . − . − . − . − . − . . − . − . . A C L E C L TC M H C V G D A L D AY D C A D E N D F W D T W E L P O . − . − . . − . . − . − . . . − . D . − . . − . − . − . . . − . − . − . A E W R F LL G S O HN L H O U I A D I AH I N D J AX J F K L A S O . − . . . − . − . . . − . . − . D − . − . . . − . . − . − . − . − . − . A L AX L G A M C I M C O M D W M E MM I A M K E M S P M S Y O A K O . . . . − . . . − . − . . − . D − . . . . . . − . − . − . . − . A O K C O M A O N T O R D O R F P B I P D X P H L P HX P I T P V D O − . − . . − . − . − . − . . . . − . D . − . − . − . − . . − . − . . . − . A R D U R I C R N O R O C R S W S AN S A T S D F S E A S F O S J C O − . . − . . . . − . . . . . D . . . − . . . . − . − . − . . A S J U S L C S M F S NA S T L S Y R T P A T U L T U S O − . . . − . . − . − . − . − . D . . − . . . − . . . .
27n this article, we develop a novel DLSA algorithm that is able to perform large-scale statistical estimation and inference on a distributed system. The DLSA methodcan be applied to a large family of regression models (e.g., logistic regression, Poissonregression, and Cox’s model). First, it is shown that the DLSA estimator is as statis-tically optimal as the global estimator. Moreover, it is computationally efficient andonly requires one round of communication.Furthermore, we develop the corresponding shrinkage estimation by using an adap-tive Lasso approach. The oracle property is theoretically proven. A new DBIC measurefor distributed variable selection, which only needs to be performed on the master andrequires no further communication, is designed. We prove the DBIC measure to beselection consistent. Finally, numerical studies are conducted with five classical re-gression examples. In addition, a Spark toolbox is developed, which is shown to becomputationally efficient both through simulation and in airline data analysis.To facilitate future research, we now discuss several interesting topics. First, theDLSA method requires the objective function to have continuous second-order deriva-tives. This assumption might be restrictive and cannot be satisfied for certain regressionmodels, e.g., the quantile regression. Consequently, the relaxation of this assumptioncan be investigated, and corresponding distributed algorithms should be designed forsuch regression models. Second, the dimension considered in our framework is finite.As a natural extension, one could study the shrinkage estimation properties in high-dimensional settings. Finally, the algorithm is designed for independent data. Inpractice, dependent data (e.g., time series data and network data) are frequently en-countered. It is thus interesting to develop corresponding algorithms by consideringthe dependency structure.
APPENDIX A ppendix A.1: Proof of Proposition 1 Note that (cid:101) θ − θ takes the form (cid:101) θ − θ = (cid:8) (cid:88) k α k (cid:98) Σ − k (cid:9) − (cid:8) (cid:88) k α k (cid:98) Σ − k ( (cid:98) θ k − θ ) (cid:9) . Define (cid:98) Σ k ( θ ) = ∂ L k ( θ ) /∂θ∂θ (cid:62) . In the following section, we denote (cid:98) Σ k by (cid:98) Σ k ( (cid:98) θ k ) tomake it clearer. By Slutsky’s Theorem, to prove (2.4), it suffices to verify that (cid:88) k α k { (cid:98) Σ k ( (cid:98) θ k ) } − → p Σ − , (A.1) √ N (cid:104) (cid:88) k α k { (cid:98) Σ k ( (cid:98) θ k ) } − ( (cid:98) θ k − θ ) (cid:105) = V ∗ ( θ ) + B ∗ ( θ ) , (A.2)where cov { V ∗ ( θ ) } = Σ − and B ∗ ( θ ) = O p ( K/ √ N ). We prove them in the following.
1. Proof of (A.1).
Recall that (cid:98) θ k is a √ n -consistent estimator of θ . Thisenables us to conduct a Taylor’s expansion of (cid:98) Σ − k ( (cid:98) θ k ) at θ , which yields (cid:98) Σ − k ( (cid:98) θ k ) − Σ − k = (cid:98) Σ − k ( (cid:98) θ k ) − (cid:98) Σ − k ( θ ) + (cid:98) Σ − k ( θ ) − Σ − k = (cid:88) j ∂ L k ( θ ) ∂θ j ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ∗ ( θ ∗ j − θ j ) + (cid:98) Σ − k ( θ ) − Σ − k where θ ∗ lies on the line joining θ and (cid:98) θ k . By Condition (C5), we have ∂ L k ( θ ) ∂θ j ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ∗ = O p (1). Therefore, the order of the first term is O p (1 / √ n k ). In addition, we have (cid:98) Σ − k ( θ ) − Σ − k = (cid:98) Σ − k ( θ ) − E { (cid:98) Σ − k ( θ ) } = O p ( n − / k ). Consequently, it can be derivedthat (cid:98) Σ − k ( (cid:98) θ k ) − Σ − k = O p ( n − / k ). Further note that α k = n k /N and (cid:80) k α k = 1. Then,we have (cid:98) Σ − Σ = (cid:88) k α k [ { (cid:98) Σ k ( (cid:98) θ k ) } − − Σ − k ] = O p ( n − / ) = o p (1) . (A.3)29ence (A.1) is proven.
2. Proof of (A.2).
Recall that (cid:98) θ k is the local minimizer of L k ( θ ). Therefore, itholds that0 = ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = (cid:98) θ k = ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ + 1 n k (cid:88) i ∈S k ∂ L ( θ ; Z i ) ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ( (cid:98) θ k − θ )+ 12 n k (cid:88) i ∈S k p (cid:88) j =1 ( θ ∗ − θ ) (cid:62) ∂ L ( θ ; Z i ) ∂θ j ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ∗ ( θ ∗ − θ ) , where θ ∗ lies between θ and (cid:98) θ k . By standard arguments, ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ = O p ( n − / k ) , (cid:98) Σ − k ( θ ) = E (cid:110) ∂ L ( θ ; Z i , i ∈ S k ) ∂θ∂θ (cid:62) (cid:111)(cid:12)(cid:12)(cid:12) θ = θ + O p ( n − / k ) = Σ − k + O p ( n − / k ) .∂ L k ( θ ) ∂θ j ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ∗ = E (cid:110) ∂ L ( θ ; Z i , i ∈ S k ) ∂θ∂θ (cid:62) (cid:12)(cid:12)(cid:12) θ = θ ∗ (cid:111) + o p (1) . Further note that (cid:98) θ k − θ = O p ( n − / k ). Then, we have (cid:98) θ k − θ = − Σ k ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ + B k ( θ ) n k + O p (cid:16) n k (cid:17) , where B k ( θ ) = O (1) is the bias term. Then, it holds that √ N (cid:88) k α k (cid:98) Σ − k ( (cid:98) θ k )( (cid:98) θ k − θ )= √ N (cid:88) k α k Σ − k (cid:110) − Σ k ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ + B k ( θ ) n k + O p ( n − k ) (cid:111) + √ N (cid:88) k α k { (cid:98) Σ − k ( (cid:98) θ k ) − Σ − k } ( (cid:98) θ k − θ )= − √ N (cid:88) k n k ∂ L k ( θ ) ∂θ (cid:12)(cid:12)(cid:12) θ = θ + 1 √ N (cid:88) k Σ − k B k ( θ ) + O p (cid:16) K √ N (cid:17) , (A.4)30here the second equation is implied by √ N (cid:80) k α k { (cid:98) Σ − k ( (cid:98) θ k ) − Σ − k } ( (cid:98) θ k − θ ) = O p ( K/ √ N ).By condition (C4), it can be concluded that cov { √ N (cid:80) k n k ∂ L k ( θ ) ∂θ | θ = θ } = Σ − . Conse-quently, (A.2) can be proven. Appendix A.2: Proof of Theorem 1
By Slutsky’s Theorem, the asymptotic normality is directly implied by (A.1) and(A.2) with V ∗ ( θ ) → d N (0 , Σ − ) and B ∗ ( θ ) = o p (1). First, by Condition (C6) and theLyapunov central limit theorem, we have V ∗ ( θ ) → d N (0 , Σ − ). Next, by the conditionthat n (cid:29) √ N , we have K (cid:29) √ N , and thus, B ∗ ( θ ) = o p (1). Appendix A.3: Proof of Theorem 2
1. Proof of √ N -consistency. Note that the objective function Q λ ( θ ) in (3.1) is a strictly convex function. Then,the local minimizer is also a global minimizer. To establish √ N -consistency results, itsuffices to verify the following result (Fan and Li, 2001), i.e., for an arbitrarily small (cid:15) >
0, there exists a sufficiently large constant C such thatlim N inf P (cid:110) inf u ∈ R p : (cid:107) u (cid:107) = C Q λ ( θ + N − / u ) > Q ( θ ) (cid:111) > − (cid:15). (A.5)31et u = ( u , · · · , u p ) (cid:62) and (cid:98) ∆ N = (cid:80) k α k (cid:98) Σ − k ( (cid:98) θ k ) (cid:8) √ N ( θ − (cid:98) θ k ) (cid:9) , Then, we have N (cid:110) Q λ ( θ + N − / u ) − Q λ ( θ ) (cid:111) = u (cid:62) (cid:98) Σ − u + 2 u (cid:62) (cid:98) ∆ N + N p (cid:88) j =1 λ j | θ j + N − / u j | − N p (cid:88) j =1 λ j | θ j |≥ u (cid:62) (cid:98) Σ − u + 2 u (cid:62) (cid:98) ∆ N + N d (cid:88) j =1 λ j (cid:0) | θ j + N − / u j | − | θ j | (cid:1) ≥ u (cid:62) (cid:98) Σ − u + 2 u (cid:62) (cid:98) ∆ N − d ( √ N a N ) (cid:107) u (cid:107) . (A.6)where the second equality holds because we assume θ j = 0 for j > d . Further notethat we assume that √ N a N → p . Consequently, the last term (A.6) is o p (1). Next,note that the first term of (A.6) is lower bounded by λ − ( (cid:98) Σ) C because (cid:107) u (cid:107) = C .By (A.3), we have λ max ( (cid:98) Σ) → p λ max (Σ). Consequently, with probability tending to 1,we have the first term uniformly larger than 0 . λ − (Σ) C , which is positive due toCondition (C4). In addition, by K/ √ N →
0, we have (cid:98) ∆ N = O p (1). Consequently, aslong as C is sufficiently large, the first term will dominate the last two terms. Then,the result of (A.5) is proven.
2. Proof of Selection Consistency.
It suffices to verify that P ( (cid:101) θ λ,j = 0) → d < j ≤ p . Note that Q λ ( θ ) canbe rewritten as Q λ ( θ ) = ( θ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( θ − (cid:101) θ ) + (cid:88) j λ j | θ j | + C, where C is a constant. Define (cid:98) Ω = (cid:98) Σ − , and (cid:98) Ω ( j ) denotes the j th row of the matrix (cid:98) Ω.If (cid:101) θ λ,j (cid:54) = 0 for some j > d , then the partial derivative can be calculated as √ N ∂Q λ ( θ ) ∂θ j (cid:12)(cid:12)(cid:12) θ = (cid:101) θ λ = 2 (cid:98) Ω ( j ) (cid:62) √ N ( (cid:101) θ λ − (cid:101) θ ) + √ N λ j sign( (cid:101) θ λ,j ) . (A.7)32ote that (cid:98) Ω → p Σ − and √ N ( (cid:101) θ λ − (cid:101) θ ) = √ N ( (cid:101) θ λ − θ ) − √ N ( (cid:101) θ − θ ) = O p (1), by (A.1),Theorem 1, and Theorem 2 (a). Consequently, the first term (A.7) is O p (1). Next, bythis condition, we know that √ N λ j ≥ √ N b λ → ∞ for j > d . Because (cid:101) θ λ,j (cid:54) = 0, wehave sign( (cid:101) θ λ,j ) = 1 or -1; thus, the second term (A.7) goes to infinity. Obviously, theequation will not be equal to zero. This implies P ( (cid:101) θ λ,j = 0) → Appendix A.3: Proof of Theorem 3
We first rewrite the asymptotic covariance Σ into the following block matrix form:Σ = Σ Σ Σ Σ , where Σ ∈ R d × d . Similarly, we partition its inverse matrix Ω into 4 correspondingparts, Ω = (Ω , Ω ; Ω , Ω ). By Theorem 2, with probability tending to 1, we have (cid:101) θ ( −M T ) λ = 0. Therefore, (cid:101) θ ( M T ) λ should be the global minimizer of the objective function, Q λ, ( θ ( M T ) ) =( θ ( M T ) − (cid:101) θ ( M T ) ) (cid:62) (cid:98) Ω ( θ ( M T ) − (cid:101) θ ( M T ) ) − θ ( M T ) − (cid:101) θ ( M T ) ) (cid:62) (cid:98) Ω (cid:101) θ ( −M T ) + (cid:101) θ ( −M T ) (cid:62) (cid:98) Ω (cid:101) θ ( −M T ) + d (cid:88) j =1 λ j | θ j | By Theorem 2, it can be concluded that with probability tending to 1, (cid:101) θ ( M T ) λ shouldbe nonzero (otherwise, the √ N -consistency result in Theorem 2 will not hold). As aresult, The partial derivative ∂Q λ ( θ ) /∂θ j should exist for 1 ≤ j ≤ d , which yields0 = 12 ∂Q λ, ( θ ( M T ) ) ∂θ ( M T ) (cid:12)(cid:12)(cid:12) θ ( M T ) = (cid:101) θ ( M T ) λ = (cid:98) Ω ( (cid:101) θ ( M T ) λ − (cid:101) θ ( M T ) ) − (cid:98) Ω (cid:101) θ ( −M T ) + D ( (cid:101) θ ( M T ) λ ) . (A.8)33here D ( (cid:101) θ ( M T ) λ ) is a d -dimensional vector, with its j th component given by 0 . λ j sign( (cid:101) θ λ,j ).By (A.8), it can be derived that √ N ( (cid:101) θ ( M T ) λ − θ ( M T )0 ) = √ N ( (cid:101) θ ( M T ) − θ ( M T )0 ) + (cid:98) Ω − (cid:98) Ω ( √ N (cid:101) θ ( −M T ) ) − (cid:98) Ω − √ N D ( (cid:101) θ ( M T ) λ )= √ N ( (cid:101) θ ( M T ) − θ ( M T )0 ) + Ω − Ω ( √ N (cid:101) θ ( −M T ) ) + o p (1) , (A.9)where the second equality follows because √ N (cid:101) θ ( −M T ) = O p (1) by Theorem 1, (cid:98) Ω → p Ω and (cid:98) Ω → p Ω by (A.1), and √ N λ j = o p (1) (1 ≤ j ≤ d ) by Theorem 2. Further-more, by the matrix inverse formula, it holds that Ω − Ω = − Σ Σ − . Consequently,we have √ N ( (cid:101) θ ( M T ) λ − θ ( M T )0 ) = √ N ( (cid:101) θ ( M T ) − θ ( M T )0 ) − Σ Σ − ( √ N (cid:101) θ ( −M T ) ) + o p (1) . By Theorem 1, we have that the above is asymptotically normal with a mean of 0,and the inverse asymptotic covariance matrix is given by (Σ − Σ Σ − Σ ) − = Ω .By condition (C6), we have Ω = Ω M T . Consequently, the estimator (cid:101) θ ( M ) λ shares thesame asymptotic distribution with the oracle estimator (cid:98) θ ( M T ) M T . Appendix A.4: Proof of Theorem 4
To establish the selection consistency property of the DBIC, we consider the fol-lowing two cases for any M λ (cid:54) = M T . The first case is the underfitted case, and thesecond case is the overfitted case.
1. Underfitted Model.
Note that λ N satisfies the condition in Theorem 2.Consequently, we have that (cid:101) θ λ N is √ N -consistent. We thus also have DBIC λ N = o p (1).34or M (cid:54)⊃ M T , it can be derived thatDBIC λ = ( (cid:101) θ λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ − (cid:101) θ ) + df λ (log N ) /N ≥ ( (cid:101) θ λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ − (cid:101) θ ) (cid:62) Define (cid:101) θ M = arg min θ ∈ R p : θ j =0 , ∀ j (cid:54)∈M ( θ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( θ − (cid:101) θ ) as the unpenalized estimatorgiven the model identified by M . Consequently, by definition, we should have( (cid:101) θ λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ − (cid:101) θ ) (cid:62) ≥ ( (cid:101) θ M λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ M λ − (cid:101) θ ) ≥ min M(cid:54)⊃M T ( (cid:101) θ M − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ M − (cid:101) θ ) → p min M(cid:54)⊃M T ( (cid:101) θ M − (cid:101) θ ) (cid:62) Σ − ( (cid:101) θ M − (cid:101) θ ) , where the last convergence is due to (A.1). Because Σ is positive definite by Condition(C4), we have min M(cid:54)⊃M T ( (cid:101) θ M − (cid:101) θ ) (cid:62) Σ − ( (cid:101) θ M − (cid:101) θ ) > P (inf λ ∈ R − DBIC λ > DBIC λ N ) →
2. Overfitted Model.
We next consider the overfitted model. In contrast, let λ be an arbitrary tuning parameter that over selects the parameters. We then have df λ − d ≥
1. It can then be concluded that N (DBIC λ − DBIC λ N ) = N ( (cid:101) θ λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ − (cid:101) θ ) − N ( (cid:101) θ λ N − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ N − (cid:101) θ ) + ( df λ − d ) log N ≥ N ( (cid:101) θ M λ − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ M λ − (cid:101) θ ) − N ( (cid:101) θ λ N − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ N − (cid:101) θ ) + log N ≥ inf M⊃M T N ( (cid:101) θ M − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ M − (cid:101) θ ) − N ( (cid:101) θ λ N − (cid:101) θ ) (cid:62) (cid:98) Σ − ( (cid:101) θ λ N − (cid:101) θ ) + log N. (A.10)First note that for M ⊃ M T , we have that (cid:101) θ M is √ N -consistent. As a result, the firstterm of (A.10) is O p (1). Similarly, by Theorem 2, (cid:101) θ λ N is √ N -consistent, Thus, thesecond term (A.10) is also O p (1). As a result, (A.10) diverges to infinity as N → ∞ .This implies that P (inf λ ∈ R + DBIC λ > DBIC λ N ) →
1. This completes the proof.35 lgorithm 1:
Spark implementation
Input:
The model function for modelling each partitioned dataset
Output:
The weighted least squares estimator ˜ θ , covariance matrix (cid:98) Σ, DBICDBIC λ Steps:Step (1) . Pre-determine the overall cluster available memory as M ram , the totalnumber of CPU cores as C cores , and the total data size to be processed as D total ; Step (2)
Define the number of batched chunks N chunks to allow forout-of-memory data processing. We recommend that N chunks be at leastgreater than 3 × D total /M ram in a Spark system. Step (3) . Define the number of partitions P partition = D total / ( N chunks × C cores ). Step (4) . Define a model function whereby the input is an n × ( p + 2) PythonPandas DataFrame containing the response variable, covariates and partitionid, and the output is a p × ( p + 1) Pandas DataFrame whereby the first columnis (cid:98) θ k and the remaining columns store (cid:98) Σ − k . Step (5) . for i in 1: N chunks do(a) . Transfer the data chunk to Spark’s distributed DataFrame if the dataare stored in another format. (b) . Randomly assign an integer partition label from { , ..., P partition } toeach row of the Spark DataFrame. (c) . Repartition the DataFrame in the distributed system if the data are notpartitioned by the partition label. (d) . Group the Spark DataFrames by the assigned partition label. (e) . Apply the model function to each grouped dataset with Spark’s Grouped map Pandas UDFs
API and obtain a ( pP partition ) × ( p + 1)distributed Spark DataFrame R i . endStep (6) . Aggregate R i over both partitions and chunks and return the p × ( p + 1) matrix R final . Step (7) . Return ˜ θ by Equation (2.2), (cid:98) Σ, and DBIC λ by Equation (3.4). • Because the final step in the DLSA algorithm is carried out on the masternode and because data transformation from worker nodes to the master node isrequired, a special tool called “Apache Arrow” ( https://arrow.apache.org/ )is plugged-in to our system to allow efficient data transformation betweenSparks distributed DataFrame and Python’s Pandas DataFrame.36 eferences
Apache Software Foundation (2019a), “Apache Hadoop,” .— (2019b), “Apache Spark,” .Battey, H., Fan, J., Liu, H., Lu, J., and Zhu, Z. (2015), “Distributed estimation andinference with statistical guarantees,” arXiv preprint arXiv:1509.05457 .Cameron, A. C. and Trivedi, P. K. (2013),
Regression analysis of count data , vol. 53,Cambridge university press.Chang, X., Lin, S.-B., and Wang, Y. (2017a), “Divide and conquer local average re-gression,”
Electronic Journal of Statistics , 11, 1326–1350.Chang, X., Lin, S.-B., and Zhou, D.-X. (2017b), “Distributed semi-supervised learningwith kernel ridge regression,”
The Journal of Machine Learning Research , 18, 1493–1514.Chen, J. and Chen, Z. (2008), “Extended Bayesian information criteria for modelselection with large model spaces,”
Biometrika , 95, 759–771.Chen, T. and Guestrin, C. (2016), “Xgboost: A scalable tree boosting system,” in
Proceedings of the 22nd acm sigkdd international conference on knowledge discoveryand data mining , ACM, pp. 785–794.Chen, X., Liu, W., and Zhang, Y. (2018), “Quantile regression under memory con-straint,” arXiv preprint arXiv:1810.08264 .Chen, X. and Xie, M.-g. (2014), “A split-and-conquer approach for analysis of extraor-dinarily large data,”
Statistica Sinica , 1655–1684.37fron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “Least angle regression,”
Annals of Statistics , 32, 407–499.Fan, J. and Li, R. (2001), “Variable selection via nonconcave penalized likelihood andits oracle properties,”
Journal of the American Statistical Association , 96, 1348–1360.— (2002), “Variable selection for Cox’s proportional hazards model and frailty model,”
The Annals of Statistics , 30, 74–99.Fan, J., Wang, D., Wang, K., and Zhu, Z. (2017), “Distributed estimation of principaleigenspaces,” arXiv preprint arXiv:1702.06488 .Harrell Jr, F. E. (2015),
Regression modeling strategies: with applications to linearmodels, logistic and ordinal regression, and survival analysis , Springer.Heinze, C., McWilliams, B., and Meinshausen, N. (2016), “Dual-loco: Distributing sta-tistical estimation using random projections,” in
Artificial Intelligence and Statistics ,pp. 875–883.Hosmer Jr, D. W., Lemeshow, S., and Sturdivant, R. X. (2013),
Applied logistic regres-sion , vol. 398, John Wiley & Sons.Jordan, M. I., Lee, J. D., and Yang, Y. (2018), “Communication-efficient distributedstatistical inference,”
Journal of the American Statistical Association , 1–14.Lee, J. D., Sun, Y., Liu, Q., and Taylor, J. E. (2015), “Communication-efficient sparseregression: a one-shot approach,” arXiv preprint arXiv:1503.04337 .Lehmann, E. L. and Casella, G. (2006),
Theory of point estimation , Springer Science& Business Media. 38i, X., Li, R., Xia, Z., and Xu, C. (2019), “Distributed feature screening via compo-nentwise debiasing,” arXiv preprint arXiv:1903.03810 .Liu, Q. and Ihler, A. T. (2014), “Distributed estimation, information loss and exponen-tial families,” in
Advances in neural information processing systems , pp. 1098–1106.Shamir, O., Srebro, N., and Zhang, T. (2014), “Communication-efficient distributedoptimization using an approximate newton-type method,” in
International confer-ence on machine learning , pp. 1000–1008.Shao, J. (1997), “An asymptotic theory for linear model selection,”
Statistica Sinica ,221–242.Smith, V., Forte, S., Ma, C., Tak´aˇc, M., Jordan, M. I., and Jaggi, M. (2018), “Co-CoA: A General Framework for Communication-Efficient Distributed Optimization,”
Journal of Machine Learning Research , 18, 1–49.Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,”
Journal ofthe Royal Statistical Society. Series B , 267–288.Wang, H. and Leng, C. (2007), “Unified LASSO estimation by least squares approxi-mation,”
Journal of the American Statistical Association , 102, 1039–1048.Wang, H., Li, R., and Tsai, C.-L. (2007), “Tuning parameter selectors for the smoothlyclipped absolute deviation method,”
Biometrika , 94, 553–568.Wang, H., Zhu, R., and Ma, P. (2018), “Optimal subsampling for large sample logisticregression,”
Journal of the American Statistical Association , 113, 829–844.Wang, J., Kolar, M., Srebro, N., and Zhang, T. (2017a), “Efficient distributed learn-ing with sparsity,” in
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , JMLR. org, pp. 3636–3645.39ang, J., Wang, W., and Srebro, N. (2017b), “Memory and communication efficientdistributed stochastic optimization with minibatch prox,” in
Proceedings of the 2017Conference on Learning Theory , eds. Kale, S. and Shamir, O., Amsterdam, Nether-lands: PMLR, vol. 65 of
Proceedings of Machine Learning Research , pp. 1882–1919.Wang, L., Kim, Y., and Li, R. (2013), “Calibrating non-convex penalized regression inultra-high dimension,”
Annals of statistics , 41, 2505.Yang, J., Mahoney, M. W., Saunders, M., and Sun, Y. (2016), “Feature-distributedsparse regression: a screen-and-clean approach,” in
Advances in Neural InformationProcessing Systems , pp. 2712–2720.Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin,M. J., Shenker, S., and Stoica, I. (2012), “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in
Proceedings of the 9thUSENIX conference on Networked Systems Design and Implementation , USENIXAssociation, pp. 2–2.Zhang, C.-H. (2010), “Nearly unbiased variable selection under minimax concavepenalty,”
Annals of Statistics , 38, 894–942.Zhang, H. H. and Lu, W. (2007), “Adaptive Lasso for Cox’s proportional hazardsmodel,”
Biometrika , 94, 691–703.Zhang, T. (2004), “Solving large scale linear prediction problems using stochastic gra-dient descent algorithms,” in
Proceedings of the twenty-first international conferenceon Machine learning , ACM, p. 116.Zhang, Y., Duchi, J. C., and Wainwright, M. J. (2013), “Communication-efficient40lgorithms for statistical optimization,”
The Journal of Machine Learning Research ,14, 3321–3363.Zou, H. (2006), “The adaptive lasso and its oracle properties,”
Journal of the AmericanStatistical Association , 101, 1418–1429.Zou, H. and Li, R. (2008), “One-step sparse estimates in nonconcave penalized likeli-hood models,”
Annals of statistics , 36, 1509.Zou, H. and Zhang, H. H. (2009), “On the adaptive elastic-net with a diverging numberof parameters,”
Annals of statistics , 37, 1733.41able 2: Simulation results for Example 1 with 500 replications. The numerical perfor-mances are evaluated for different sample sizes N ( × ) and numbers of workers K .For the OS, CSL, and DLSA method, the REE j is reported. For the shrinkage DLSA(SDLSA) method, the REE sj is reported. Finally, the MS and CM are reported for theSDLSA method to evaluate the variable selection accuracy.N K Est. θ θ θ θ θ θ θ θ MS CM
Setting 1: i.i.d Covariates
10 5 OS 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00CSL 1.00 0.99 1.00 0.98 0.99 0.99 0.99 0.99DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.01 1.0020 5 OS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00CSL 0.99 1.00 0.99 0.99 0.99 1.00 0.99 1.00DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.01 1.00100 10 OS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00CSL 0.99 0.99 1.00 0.99 0.99 0.99 0.99 0.99DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.01 1.00
Setting 2: Heterogeneous Covariates
10 5 OS 1.00 0.99 0.99 1.00 0.99 0.98 0.99 0.99CSL 0.97 0.97 0.98 0.98 0.98 1.00 0.98 0.98DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.16 - - 1.25 - - 1.25 - 3.01 1.0020 5 OS 0.99 0.99 0.99 1.01 0.99 0.99 0.99 0.99CSL 0.99 0.99 0.98 0.97 0.98 1.00 1.00 0.99DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.15 - - 1.23 - - 1.20 - 3.01 1.00100 10 OS 0.99 0.99 1.00 1.00 0.99 0.99 0.99 0.99CSL 0.98 0.97 0.95 0.98 0.96 0.97 0.96 0.95DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.15 - - 1.22 - - 1.28 - 3.00 1.0042able 3: Simulation results for Example 2 with 500 replications. The numerical perfor-mances are evaluated for different sample sizes N ( × ) and numbers of workers K .For the OS, CSL, and DLSA method, the REE j is reported. For the shrinkage DLSA(SDLSA) method, the REE sj is reported. Finally, the MS and CM are reported for theSDLSA method to evaluate the variable selection accuracy.N K Est. θ θ θ θ θ θ θ θ MS CM
Setting 1: i.i.d Covariates
10 5 OS 0.92 0.98 0.99 0.95 0.98 0.98 0.94 0.99CSL 0.94 1.00 0.99 0.96 0.97 0.99 0.94 0.98DLSA 0.98 1.01 1.01 0.99 1.01 1.01 0.98 1.01SDLSA 0.91 - - 0.92 - - 0.91 - 3.01 1.0020 5 OS 0.95 1.00 1.00 0.96 0.99 0.99 0.96 0.99CSL 0.99 0.99 0.99 1.00 1.00 0.99 0.99 0.99DLSA 1.00 1.00 1.00 1.01 1.00 1.00 1.00 1.00SDLSA 0.96 - - 1.00 - - 0.97 - 3.01 1.00100 10 OS 0.95 1.00 1.00 0.97 1.00 1.00 0.97 1.00CSL 0.99 0.99 1.00 0.99 1.00 1.00 0.98 0.99DLSA 1.01 1.00 1.00 1.00 1.00 1.00 0.99 1.00SDLSA 1.00 - - 1.00 - - 0.98 - 3.01 1.00
Setting 2: Heterogeneous Covariates
10 5 OS 0.79 0.86 0.89 0.83 0.88 0.86 0.82 0.90CSL 0.24 0.31 0.35 0.27 0.33 0.30 0.26 0.28DLSA 0.96 1.01 1.01 0.98 1.01 1.01 0.99 1.01SDLSA 0.86 - - 1.00 - - 1.00 - 3.00 1.0020 5 OS 0.86 0.87 0.88 0.87 0.90 0.94 0.89 0.89CSL 0.33 0.37 0.41 0.32 0.36 0.37 0.32 0.32DLSA 0.98 1.01 1.00 0.99 1.00 1.00 0.97 1.01SDLSA 0.96 - - 1.07 - - 1.03 - 3.01 1.00100 10 OS 0.92 0.94 0.92 0.93 0.91 0.92 0.91 0.92CSL 0.21 0.24 0.24 0.21 0.23 0.25 0.21 0.21DLSA 0.96 1.00 1.00 1.00 1.00 1.00 0.97 1.00SDLSA 0.97 - - 1.09 - - 1.07 - 3.00 1.0043able 4: Simulation results for Example 3 with 500 replications. The numerical perfor-mances are evaluated for different sample sizes N ( × ) and numbers of workers K .For the OS, CSL, and DLSA method, the REE j is reported. For the shrinkage DLSA(SDLSA) method, the REE sj is reported. Finally, the MS and CM are reported for theSDLSA method to evaluate the variable selection accuracy.N K Est. θ θ θ θ θ θ θ θ MS CM
Setting 1: i.i.d Covariates
10 5 OS 0.98 0.99 1.00 0.99 1.00 1.00 0.99 0.99CSL 0.93 0.95 0.94 0.89 0.93 0.95 0.94 0.93DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.01 1.0020 5 OS 1.00 0.99 1.00 0.99 0.99 0.99 0.99 1.00CSL 0.94 0.98 0.97 0.96 0.98 0.96 0.96 0.99DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.01 - - 1.00 - 3.00 1.00100 10 OS 1.00 1.00 1.00 0.99 1.00 1.00 0.99 0.99CSL 0.97 0.98 0.97 0.95 0.96 0.98 0.98 0.97DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.00 1.00
Setting 2: Heterogeneous Covariates
10 5 OS 0.59 0.62 0.68 0.59 0.65 0.64 0.68 0.68CSL 0.01 0.02 0.03 0.01 0.03 0.03 0.02 0.03DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.14 - - 1.27 - - 1.26 - 3.00 1.0020 5 OS 0.59 0.67 0.65 0.62 0.63 0.63 0.70 0.64CSL 0.01 0.03 0.03 0.01 0.03 0.03 0.02 0.03DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.18 - - 1.28 - - 1.31 - 3.00 1.00100 10 OS 0.62 0.68 0.75 0.67 0.69 0.69 0.75 0.69CSL 0.01 0.02 0.02 0.01 0.03 0.03 0.02 0.03DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.15 - - 1.28 - - 1.23 - 3.01 1.0044able 5: Simulation results for Example 4 with 500 replications. The numerical perfor-mances are evaluated for different sample sizes N ( × ) and numbers of workers K .For the OS, CSL, and DLSA method, the REE j is reported. For the shrinkage DLSA(SDLSA) method, the REE sj is reported. Finally, the MS and CM are reported for theSDLSA method to evaluate the variable selection accuracy.N K Est. θ θ θ θ θ θ θ θ MS CM
Setting 1: i.i.d Covariates
10 5 OS 0.97 0.99 0.99 0.95 0.99 0.98 0.99 0.99CSL 0.97 0.99 0.98 0.99 0.99 0.98 0.96 0.99DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.01 1.0020 5 OS 0.98 1.00 1.00 0.97 1.00 0.99 0.99 1.00CSL 0.99 0.99 0.98 1.00 0.99 0.97 0.99 0.98DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 0.99 - - 1.00 - - 1.00 - 3.01 1.00100 10 OS 1.00 1.00 0.99 0.99 1.00 1.00 1.00 1.00CSL 0.98 0.98 1.00 0.99 0.99 0.99 0.99 1.00DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.00 1.00
Setting 2: Heterogeneous Covariates
10 5 OS 0.65 0.75 0.71 0.72 0.74 0.75 0.70 0.70CSL 0.04 0.05 0.05 0.04 0.05 0.05 0.05 0.04DLSA 0.95 0.98 0.99 0.98 0.99 0.98 0.97 0.95SDLSA 1.01 - - 1.04 - - 1.05 - 3.00 1.0020 5 OS 0.70 0.80 0.77 0.73 0.75 0.74 0.75 0.70CSL 0.05 0.06 0.06 0.05 0.05 0.05 0.06 0.05DLSA 0.96 0.99 0.99 0.98 0.98 0.99 0.99 0.95SDLSA 1.01 - - 1.10 - - 1.12 - 3.00 1.00100 10 OS 0.80 0.78 0.80 0.79 0.78 0.81 0.77 0.78CSL 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.03DLSA 0.98 0.98 1.00 0.98 0.98 0.99 0.98 0.97SDLSA 1.04 - - 1.07 - - 1.08 - 3.00 1.0045able 6: Simulation results for Example 5 with 500 replications. The numerical perfor-mances are evaluated for different sample sizes N ( × ) and numbers of workers K .For the OS, CSL, and DLSA method, the REE j is reported. For the shrinkage DLSA(SDLSA) method, the REE sj is reported. Finally, the MS and CM are reported for theSDLSA method to evaluate the variable selection accuracy.N K Est. θ θ θ θ θ θ θ θ MS CM
Setting 1: i.i.d Covariates
10 5 OS 0.97 0.99 0.99 0.95 0.99 0.99 0.99 0.99CSL 0.99 1.00 0.99 0.99 0.99 0.99 0.98 0.99DLSA 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.01 1.0020 5 OS 0.99 1.00 1.00 0.98 1.00 1.00 0.99 1.00CSL 0.99 1.00 0.99 0.99 1.00 1.00 0.99 0.99DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.01 1.00100 10 OS 0.99 1.00 1.00 0.98 1.00 1.00 1.00 1.00CSL 0.98 0.99 1.00 0.99 0.99 0.99 0.99 0.99DLSA 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00SDLSA 1.00 - - 1.00 - - 1.00 - 3.00 1.00
Setting 2: Heterogeneous Covariates
10 5 OS 0.88 0.90 0.93 0.91 0.91 0.94 0.91 0.91CSL 0.14 0.19 0.20 0.20 0.20 0.18 0.19 0.12DLSA 0.95 0.95 0.98 0.98 0.96 0.98 0.97 0.92SDLSA 0.90 - - 0.98 - - 0.93 - 3.01 1.0020 5 OS 0.92 0.93 0.95 0.94 0.93 0.93 0.94 0.93CSL 0.13 0.20 0.19 0.20 0.19 0.19 0.19 0.12DLSA 0.94 0.96 0.98 0.99 0.97 0.99 0.96 0.93SDLSA 0.92 - - 1.03 - - 1.00 - 3.01 1.00100 10 OS 0.89 0.95 0.97 0.96 0.97 0.95 0.96 0.93CSL 0.09 0.13 0.14 0.15 0.13 0.14 0.13 0.09DLSA 0.93 0.97 0.97 0.98 0.98 0.97 0.98 0.93SDLSA 0.88 - - 0.99 - - 1.02 - 3.00 1.0046 a b l e : V a r i a b l e d e s c r i p t i o n f o rt h e U . S . a i r li n e d a t a . N u m e r i c a l v a r i a b l e s a r e s t a nd a r d i ze d t o h a v e m e a n ze r oa nd v a r i a n ce o n e . V a r i a b l e D e s c r i p t i o n V a r i a b l e u s e d i n t h e m o d e l D e l a y e d W h e t h e rt h e fl i g h t i s d e l a y e d , f o r Y e s ; f o r N o . U s e d a s t h e r e s p o n s e v a r i a b l e Y e a r Y e a r b e t w ee n nd U s e d a s nu m e r i c a l v a r i a b l e M o n t h W h i c h m o n t h o f t h e y e a r C o n v e rt e d t o11 du mm i e s D a y o f M o n t h W h i c hd a y o f t h e m o n t h U s e d a s nu m e r i c a l v a r i a b l e D a y o f W ee k W h i c hd a y o f t h e w ee k C o n v e rt e d t o6 du mm i e s D e p T i m e A c t u a l d e p a rt u r e t i m e U s e d a s nu m e r i c a l v a r i a b l e C R S D e p T i m e S c h e du l e dd e p a rt u r e t i m e U s e d a s nu m e r i c a l v a r i a b l e C R S A rr T i m e S c h e du l e d a rr i v a l t i m e U s e d a s nu m e r i c a l v a r i a b l e E l a p s e d T i m e A c t u a l e l a p s e d t i m e U s e d a s nu m e r i c a l v a r i a b l e D i s t a n ce D i s t a n ce b e t w ee n t h e o r i g i n a ndd e s t i n a t i o n i n m il e s U s e d a s nu m e r i c a l v a r i a b l e C a rr i e r F li g h t c a rr i e r c o d e f o r c a rr i e r s T o p c a rr i e s c o n v e rt e d t o7 du mm i e s D e s t i n a t i o n D e s t i n a t i o n o f t h e fl i g h t(t o t a l c a t e go r i e s ) T o p d e s t i n a t i o n c i t i e s c o n v e rt e d t o75 du mm i e s O r i g i n D e p a rt i n go r i g i n (t o t a l c a t e go r i e s ) T o p r i g i n c i t i e s c o n v e rt e d t o75 du mm i e ss