[PDF] Distributed Bootstrap for Simultaneous Inference Under High Dimensionality

Abstract

We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces a \ell_\infty-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds \tau_{\min} that warrants the statistical accuracy and efficiency. Furthermore, \tau_{\min} only increases logarithmically with the number of workers and intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-time Performance dataset. The code to reproduce the numerical results is available at GitHub: this https URL

Full PDF

DDistributed Bootstrap for SimultaneousInference Under High Dimensionality

Yang YuDepartment of Statistics, Purdue UniversityShih-Kang ChaoDepartment of Statistics, University of MissouriGuang ChengDepartment of Statistics, Purdue UniversityFebruary 22, 2021

Abstract

We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. Themethod produces a (cid:96) ∞ -norm conﬁdence region based on a communication-eﬃcientde-biased lasso, and we propose an eﬃcient cross-validation approach to tune themethod at every iteration. We theoretically prove a lower bound on the numberof communication rounds τ min that warrants the statistical accuracy and eﬃciency.Furthermore, τ min only increases logarithmically with the number of workers andintrinsic dimensionality, while nearly invariant to the nominal dimensionality. Wetest our theory by extensive simulation studies, and a variable screening task on asemi-synthetic dataset based on the US Airline On-time Performance dataset. Thecode to reproduce the numerical results is available at GitHub: https://github.com/skchao74/Distributed-bootstrap . Keywords:

Distributed learning; High-dimensional inference; Multiplier bootstrap1 a r X i v : . [ s t a t . M E ] F e b Introduction

Modern massive datasets with enormous sample size and tremendous dimensionality areusually impossible to be processed with a single machine. For remedy, a master-workerarchitecture, e.g., Hadoop (Singh & Kaur 2014), which operates on a cluster of nodesfor data storage and processing is often adopted, where the master node also contains aportion of the data; see Figure 1. An inherent problem of this architecture is that inter-nodecommunication can be over a thousand times slower than intra-node computation due tothe inter-node communication protocol, which unfortunately always comes with signiﬁcantoverhead (Lan et al. 2018, Fan, Guo & Wang 2019). Hence, communication eﬃciency isusually a top concern for algorithms development in distributed learning.

Master M (cid:101) θ ∇L ( (cid:101) θ ) (cid:101) θ (cid:101) θ ∇L ( (cid:101) θ ) ∇L k ( (cid:101) θ )Slave M Slave M Slave M k . . . . . . Figure 1: Master-worker architecture for storing and processing distributed data.Classical statistical methods are usually not communication-eﬃcient as some of themrequire hundreds or even thousands passes over the entire dataset. In the last few years,active research has greatly advanced our ability to perform distributed statistical optimiza-tion and inference in, e.g., maximum likelihood estimation (Zhang et al. 2012, Li et al.2013, Chen & Xie 2014, Battey et al. 2018, Jordan et al. 2019, Huang & Huo 2019, Chenet al. 2018, Zhu et al. 2020), Lasso (Lee et al. 2017, Wang et al. 2017, Wang & Zhang2017), partially linear models (Zhao et al. 2016), nonstandard regression (Shi et al. 2018,Banerjee et al. 2019), quantile regression (Volgushev et al. 2019, Chen et al. 2019), princi-pal component analysis (Fan, Wang, Wang & Zhu 2019, Chen et al. 2020), just to name afew. However, solutions for many other problems, for example the statistical inference forhigh-dimensional models, are still elusive.Simultaneous inference for high-dimensional statistical models has been widely con-sidered in many applications where datasets can be handled with a standalone computer(Cai & Sun 2017), and many recent papers focus on bootstrap as an eﬀective way to im-plement simultaneous inference (Dezeure et al. 2017, Zhang & Cheng 2017, Belloni et al.2018, 2019, Yu, Gupta & Kolar 2020). These existing methods typically utilize the well-celebrated de-biased Lasso (van de Geer et al. 2014, Zhang & Zhang 2014, Javanmard& Montanari 2014 a , b ), where the de-biased score results from the KKT condition. How-ever, extending their methods to a distributed computational framework requires a greatcare. For one thing, the implementation of de-biased Lasso requires expensive subroutinessuch as nodewise Lasso (van de Geer et al. 2014), which has to be replaced by a morecommunication-eﬃcient method. For another, the quality of the de-biased score, which isessential to the validity of the bootstrap, is generally worse in a distributed computationalframework than that in a centralized computational framework. In particular, it is heavilybiased so that it is not asymptotically normal. However, it can possibly be improved with asuﬃcient number of rounds of communication between the master and worker nodes. Thebootstrap validity therefore critically hinges on the interplay between the dimensionalityof the model and the intrinsic sparsity level, as well as the rounds of communication, the3umber of worker nodes and the size of local sample that are speciﬁc to the distributedcomputational framework.In this paper, we tackle the challenges discussed above and propose a communication-eﬃcient simultaneous inference method for high-dimensional models. The main componentat the core of our method is a novel way to improve the quality of the de-biased score with acarefully selected number of rounds of communication while relaxing the constraint on thenumber of machines, motivated by that of Wang et al. (2017) which improves the estimatoritself. Note that the de-biased Lasso has been applied by Lee et al. (2017) to obtain acommunication-eﬃcient √ N -consistent estimator, but their method restricts the numberof worker nodes to be less than the local sample size. Next, we apply communicate-eﬃcientmultiplier bootstrap methods k-grad and n+k-1-grad , which are originally proposed inYu, Chao & Cheng (2020) for low dimensional models. These bootstrap methods preventrepeatedly reﬁtting the models and relax the constraint on the number of machines thatplague the methods proposed earlier (Kleiner et al. 2014, Sengupta et al. 2016). A keychallenge in implementation is that cross-validation, which is a popular method for selectingtuning parameters, usually requires multiple passes of the entire dataset and is typicallyineﬃcient in the distributed computational framework. We propose a new cross-validationthat only requires the master node for implementation without needing to communicatewith the worker nodes.Our theoretical study focuses on the explicit lower bounds on the rounds of communi-cation that warrant the validity of the bootstrap method for high-dimensional generalizedlinear models; see Section 3.1 for an overview. In short, the greater the number of workernodes and/or the intrinsic sparsity level, the greater the rounds of communication requiredfor the bootstrap validity. The bootstrap validity and eﬃciency are corroborated by an4xtensive simulation study.We further demonstrate the merit of our method on variables screening with a semi-synthetic dataset, based on the large-scale US Airline On-time Performance dataset. Byperforming a pilot study on an independently sampled subset of data, we take four keyexplanatory variables for ﬂight delay, which correspond to the dummy variables of the fouryears after the September 11 attacks. On another independently sampled subset of data,we combine the dummy variables of the four years with artiﬁcial high-dimensional spuriousvariables to create a design matrix. We perform our method on this artiﬁcial dataset, andﬁnd that the relevant variables are correctly identiﬁed as the number of iteration increases.In particular, we visualize the eﬀect of these four years by conﬁdence intervals.The rest of the paper is organized as follows. In Section 2, we introduce the prob-lem formulation of distributed high-dimensional simultaneous inference and present themain algorithm. Theoretical guarantees of bootstrap validity for high-dimensional (gen-eralized) linear models are provided in Section 3. Section 4 presents simulation resultsthat corroborate our theoretical ﬁndings. Section 5 showcases an application on vari-able screening for high-dimensional logistic regression with a big real dataset using ournew method. Finally, Section 6 concludes the paper. Technical details are in Appen-dices. The code to reproduce the numerical results in Section 4 and 5 is in GitHub: https://github.com/skchao74/Distributed-bootstrap . Notations.

We denote the (cid:96) p -norm ( p ≥

1) of any vector v = ( v , . . . , v n ) by (cid:107) v (cid:107) p =( (cid:80) ni =1 | v i | p ) /p and (cid:107) v (cid:107) ∞ = max ≤ i ≤ n | v i | . The induced p -norm and the max-norm of anymatrix M ∈ R m × n (with element M ij at i -th row and j -th column) are denoted by ||| M ||| p =sup x ∈ R n ; (cid:107) x (cid:107) p =1 (cid:107) M x (cid:107) p and ||| M ||| max = max ≤ i ≤ m ;1 ≤ j ≤ n | M i,j | . We write a (cid:46) b if a = O ( b ),and a (cid:28) b if a = o ( b ). 5 Distributed Bootstrap for High-Dimensional Simul-taneous Inference

In this section, we introduce the distributed computational framework and present a novelbootstrap algorithm for high-dimensional simultaneous inference under this framework. Acommunication-eﬃcient cross-validation method is proposed for tuning.

Suppose data { Z i } Ni =1 are i.i.d., and L ( θ ; Z ) is a twice-diﬀerentiable convex loss functionarising from a statistical model, where θ = ( θ , . . . , θ d ) ∈ R d . Suppose that the parameterof interest θ ∗ is the minimizer of an expected loss: θ ∗ = arg min θ ∈ R d L ∗ ( θ ) , where L ∗ ( θ ) : = E Z [ L ( θ ; Z )] . We consider a high-dimensional setting where d > N is possible, and θ ∗ is sparse, i.e., thesupport of θ ∗ is ﬁxed and small.We consider a distributed computation framework, in which the entire data are storeddistributedly in k machines, and each machine has n data. Denote by { Z ij } i =1 ,...,n ; j =1 ,...,k the entire data, where Z ij is i -th datum on the j -th machine M j , and N = nk . Withoutloss of generality, assume that the ﬁrst machine M is the master node; see Figure 1.Deﬁne the local and global loss functions asglobal loss: L N ( θ ) = 1 k k (cid:88) j =1 L j ( θ ) , wherelocal loss: L j ( θ ) = 1 n n (cid:88) i =1 L ( θ ; Z ij ) , j = 1 , . . . , k. (1)6 great computational overhead occurs when the master and worker nodes communicate.In order to circumvent the overhead, the rounds of communications between the masterand worker nodes should be minimized, and the algorithms with reduced communicationoverheads are “communication-eﬃcient”. In this paper, we focus on the simultaneous conﬁdence region for θ ∗ in a high-dimensionalmodel, which is one of the eﬀective ways for variable selection and inference that areimmune to the well-known multiple testing problem. In particular, given an estimator (cid:98) θ that is √ N -consistent, simultaneous conﬁdence intervals can be found with conﬁdence α ,for large α ∈ (0 , c ( α ) : = inf { t ∈ R : P ( (cid:98) T ≤ t ) ≥ α } where (2) (cid:98) T : = (cid:13)(cid:13) √ N (cid:0)(cid:98) θ − θ ∗ (cid:1)(cid:13)(cid:13) ∞ . (3)where (cid:98) θ may be computed through the de-biased Lasso (van de Geer et al. 2014, Zhang &Zhang 2014, Javanmard & Montanari 2014 a , b ): (cid:98) θ = (cid:98) θ Lasso − (cid:98) Θ ∇L N ( (cid:98) θ Lasso ) , (4)where (cid:98) θ Lasso is the Lasso estimator, (cid:98)

Θ is a surrogate inverse Hessian matrix and L N ( θ ) = N − (cid:80) Ni =1 L ( θ ; Z i ) is the empirical loss.Implementing the simultaneous inference based on (cid:98) θ and (cid:98) T in distributed computationalframework inevitably faces some computational challenges. Firstly, computing (cid:98) θ usuallyinvolves some iterative optimization routines that can accumulate a large communicationoverhead without a careful engineering. Next, some bootstrap methods have been proposed7or estimating c ( α ), e.g., the multiplier bootstrap (Zhang & Cheng 2017), but they can-not be straightforwardly implemented within a distributed computational framework dueto excessive resampling and communication. Even though some communication-eﬃcientbootstrap methods have been proposed, e.g., Kleiner et al. (2014), Sengupta et al. (2016),Yu, Chao & Cheng (2020), they either require a large number of machines or are inappli-cable to high-dimensional models.Because of the above-mentioned diﬃculties, inference based on (cid:98) T is inapplicable in thedistributed computational framework and is regarded as an “oracle” in this paper. Ourgoal is to provide a method that is communication-eﬃcient while entertaining the samestatistical accuracy as that based on the oracle (cid:98) T . In order to adapt (4) to the distributed computational setting, we ﬁrst need to ﬁnd agood substitute (cid:101) θ for (cid:98) θ Lasso that is communication-eﬃcient, while noting that standardalgorithms for Lasso are not communication-eﬃcient. Fortunately, (cid:101) θ can be computed bythe communication-eﬃcient surrogate likelihood (CSL) algorithm with the (cid:96) -norm regu-larization (Wang et al. 2017, Jordan et al. 2019), which iteratively generates a sequence ofestimators (cid:101) θ ( t ) with regularization parameters λ ( t ) at each iteration t = 0 , . . . , τ −

1. SeeRemark 2.1 for model tuning and Lines 1-16 of Algorithm 1 for the exact implementation.Under regularity conditions, if t is suﬃciently large, it is warranted that (cid:101) θ is close to (cid:98) θ Lasso .Typical algorithms for computing (cid:98)

Θ, e.g., the nodewise Lasso (van de Geer et al. 2014),cannot be extended straightforwardly to the distributed computational framework due tothe same issue of communication ineﬃciency. We overcome this by performing the nodewiseLasso using only M without accessing the entire dataset. This simple approach does not8acriﬁce accuracy as long as a suﬃcient number of communication brings (cid:101) θ suﬃciently closeto θ ∗ .Lastly, given the surrogate estimators (cid:101) θ for (cid:98) θ Lasso and (cid:101)

Θ for (cid:98)

Θ, we estimate the asymp-totic quantile c ( α ) of (cid:98) T by bootstrapping (cid:107) (cid:101) Θ √ N ∇L N ( (cid:101) θ ) (cid:107) ∞ using the k-grad or n+k-1-grad bootstrap originally proposed by Yu, Chao & Cheng (2020) for low-dimensional models.However, the number of communication between master and worker nodes has to be care-fully ﬁne-tuned for high-dimensional models. In particular, the k-grad algorithm computes W ( b ) : = (cid:13)(cid:13)(cid:13)(cid:13) − (cid:101) Θ 1 √ k k (cid:88) j =1 (cid:15) ( b ) j √ n ( g j − ¯ g ) (cid:124) (cid:123)(cid:122) (cid:125) =: A (cid:13)(cid:13)(cid:13)(cid:13) ∞ , (5)where (cid:15) ( b ) j i.i.d. ∼ N (0 ,

1) independent from the data, g j = ∇L j ( (cid:101) θ ) and ¯ g = k − (cid:80) kj =1 g j .However, it is known that k-grad does not perform well when k is small (Yu, Chao &Cheng 2020). The improved algorithm n+k-1-grad computes (cid:102) W ( b ) : = (cid:13)(cid:13)(cid:13)(cid:13) − (cid:101) Θ 1 √ n + k − (cid:18) n (cid:88) i =1 (cid:15) ( b ) i ( g i − ¯ g ) + k (cid:88) j =2 (cid:15) ( b ) j √ n ( g j − ¯ g ) (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) =: (cid:101) A (cid:13)(cid:13)(cid:13)(cid:13) ∞ , (6)where (cid:15) ( b ) i and (cid:15) ( b ) j are i.i.d. N (0 ,

1) multipliers, and g i = ∇L ( (cid:101) θ ; Z i ) is based on a singledatum Z i in the master. The key advantage of k-grad or n+k-1-grad is that once themaster has the gradients g j from the worker nodes, the quantile of { W ( b ) } Bb =1 can becomputed in the master node only, without needing to communicate with worker nodes.See Algorithm 3 in the Appendix for the pseudocode of k-grad and n+k-1-grad .Algorithm 1 presents the complete statistical inference procedure. The number of itera-tions τ in Algorithm 1 steers the trade-oﬀ between statistical accuracy and communicationeﬃciency. In particular, a larger τ leads to a more accurate coverage of the simultaneous9 lgorithm 1 k-grad / n+k-1-grad with de-biased (cid:96) -CSL estimator Require: τ ≥ { λ ( t ) } τ − t =0 , nodewise Lassoprocedure Node ( · , · ) with hyperparameters { λ l } dl =1 (see Section A.2) (cid:101) θ (0) ← arg min θ L ( θ ) + λ (0) (cid:107) θ (cid:107) at M Compute (cid:101)

Θ by running

Node ( ∇ L ( (cid:101) θ (0) ) , { λ l } dl =1 ) at M for t = 1 , . . . , τ do Transmit (cid:101) θ ( t − to {M j } kj =2 Compute ∇L ( (cid:101) θ ( t − ) at M for j = 2 , . . . , k do Compute ∇L j ( (cid:101) θ ( t − ) at M j Transmit ∇L j ( (cid:101) θ ( t − ) to M end for ∇L N ( (cid:101) θ ( t − ) ← k − (cid:80) kj =1 ∇L j ( (cid:101) θ ( t − ) at M if t < τ then (cid:101) θ ( t ) ← arg min θ L ( θ ) − θ (cid:62) (cid:16) ∇L ( (cid:101) θ ( t − ) − ∇L N ( (cid:101) θ ( t − ) (cid:17) + λ ( t ) (cid:107) θ (cid:107) at M else (cid:101) θ ( τ ) ← (cid:101) θ ( τ − − (cid:101) Θ ∇L N ( (cid:101) θ ( τ − ) at M end if end for Run

DistBoots (‘ k-grad ’ or ‘ n+k-1-grad ’ , (cid:101) θ = (cid:101) θ ( τ ) , { g j = ∇L j ( (cid:101) θ ( τ − ) } kj =1 , (cid:101) Θ = (cid:101)

Θ) at M conﬁdence interval, but it also induces a higher communication cost. Therefore, study-ing the minimal τ that warrants the bootstrap accuracy is crucial, which will be done in10ection 3. Remark 2.1.

Two groups of hyperparameters need to be chosen in Algorithm 1: { λ ( t ) } τ − t =0 for regularization in CSL estimation, and { λ l } dl =1 for regularization in nodewise Lasso (seeAlgorithm 4). In Section 2.4, we propose a cross-validation method for tuning { λ ( t ) } τ − t =0 .As to { λ l } dl =1 , while van de Geer et al. (2014) suggests to choose the same value for all λ l by cross-validation, a potentially better way may be to allow λ l to be diﬀerent across l and select each λ l via cross-validation for the corresponding nodewise Lasso, which is theapproach we take for a distributed variable screening task in Section 5. Remark 2.2.

There exist other options than CSL for (cid:101) θ such as the averaging de-biasedestimator (Lee et al. 2017), but an additional round of communication may be needed tocompute the local gradients. More importantly, their method may be inaccurate when n < k . We propose a communication-eﬃcient cross-validation method for tuning the hyperparam-eters { λ ( t ) } τ − t =0 in Algorithm 1. Wang et al. (2017) proposes to hold out a validation seton each node for selecting λ ( t ) . However, this method requires ﬁtting the model for eachcandidate value of λ ( t ) , which uses the same communication cost as the complete CSLestimation procedure.We propose a communication-eﬃcient K -fold cross-validation method that chooses λ ( t ) for the CSL estimation at every iteration t . At iteration t , the master uses the gradientsalready communicated from the worker nodes at iteration t −

1. Hence, the cross-validationneeds only the master node, which circumvents costly communication between the masterand the worker nodes. 11peciﬁcally, notice that the surrogate loss (see Line 12 in Algorithm 1) is constructed us-ing n observations Z = { Z i } ni =1 in the master node and k − G = {∇L j ( (cid:101) θ ( t − ) } kj =2 from the worker nodes. We then create K (approximately) equal-size partitions to both Z and G . The objective function for training is formed using K − Z and G . In terms of the measure of ﬁt, instead of computing the original likelihood or loss, wecalculate the unregularized surrogate loss using the last partition of Z and G , still in themaster node. See Algorithm 2 for the pseudocode. Algorithm 2

Distributed K -fold cross-validation for t -step CSL Require: ( t − (cid:101) θ ( t − , set Λ of candidate values for λ ( t ) , partitionof master data Z = (cid:83) Kq =1 Z q , partition of worker gradients G = (cid:83) Kq =1 G q for q = 1 , . . . , K do Z train ← (cid:83) r (cid:54) = q Z r ; Z test ← Z q G train ← (cid:83) r (cid:54) = q G r ; G test ← G q g ,train ← Avg Z ∈Z train (cid:16) ∇L ( (cid:101) θ ( t − ; Z ) (cid:17) ; g ,test ← Avg Z ∈Z test (cid:16) ∇L ( (cid:101) θ ( t − ; Z ) (cid:17) ¯ g train ← Avg g ∈{ g ,train }∪G train ( g ); ¯ g test ← Avg g ∈{ g ,test }∪G test ( g ) for λ ∈ Λ t do β ← arg min θ Avg Z ∈Z train (cid:0) L ( θ ; Z ) (cid:1) − θ (cid:62) ( g ,train − ¯ g train ) + λ (cid:107) θ (cid:107) Loss ( λ, q ) ← Avg Z ∈Z test (cid:0) L ( θ ; Z ) (cid:1) − β (cid:62) ( g ,test − ¯ g test ) end for end for Return λ ( t ) = arg min λ ∈ Λ K − (cid:80) Kq =1 Loss ( λ, q )12 Theoretical Analysis

Section 3.1 provides an overview of the theoretical results. Section 3.2 presents the rigorousstatements for linear models. Section 3.3 presents the results for generalized linear models(GLMs).

As discussed in Section 2.3, τ has to be large enough to ensure the bootstrap accuracy, yet italso induces a great communication cost. Hence, our main goal is to pin down the minimalnumber of iterations τ min (communication rounds) suﬃcient for the bootstrap validity inAlgorithm 1. An overview of the theoretical results is provided in Figure 2.As an overall trend in Figure 2, τ min is increasing logarithmically in k and decreasingin n for both k-grad and n+k-1-grad in (generalized) linear models; in addition, τ min isincreasing in s logarithmically, where s is the maximum of the sparsity of the true coeﬃcientvector and the inverse population Hessian matrix to be formally deﬁned later.By comparing the left and right panels of Figure 2 under a ﬁxed tuple ( n, k, s ), the τ min for k-grad is always greater or equal to that for n+k-1-grad , which indicates a greatercommunication eﬃciency of n+k-1-grad . For very small k , n+k-1-grad can still prov-ably work, while k-grad cannot. Particularly, τ min = 1 can work for certain instances of n+k-1-grad but is always too small for k-grad .Regarding the comparison between high-dimensional sparse linear models (top panels)and GLMs (bottom panels), GLMs typically require a greater n than sparse linear models,which ensures that the error between (cid:101) θ ( t ) and θ ∗ decreases in a short transient phase; seeSection A.3 in the Appendix for details. 13 n / s = log s n k / s = l o g s k min = 2 min = 3 min = 4 min Linear Model, k-grad n / s = log s n min = 1 min = 2 min = 3 min = 4 min Linear Model, n+k 1-grad n / s = log s n k / s = l o g s k min = 2 min = 3 min = 4 min GLM, k-grad n / s = log s n min = 2 min = 3 min = 4 min min = 1 GLM, n+k 1-grad

Figure 2: Illustration of Theorems 3.1-3.8. Gray region are where the bootstrap validityare not warranted by our theory, and the other area is colored blue with varying lightnessaccording to the lower bound of iteration τ . γ n = log d n , γ k = log d k and γ ¯ s = log d ¯ s arethe orders of the local sample size n , number of machines k and the sparsity ¯ s .14 .2 Linear Model Suppose that N i.i.d. observations are generated by a linear model y = x (cid:62) θ ∗ + e with anunknown coeﬃcient vector θ ∗ ∈ R d , covariate random vector x ∈ R d , and noise e ∈ R independent of x with zero mean and variance of σ . We consider the least-squares loss L ( θ ; z ) = L ( θ ; x, y ) = ( y − x (cid:62) θ ) / (A1) x is sub-Gaussian, i.e., sup (cid:107) w (cid:107) ≤ E (cid:2) exp(( w (cid:62) x ) /L ) (cid:3) = O (1) , for some absolute constant L >

0. Moreover, 1 /λ min (Σ) ≤ µ for some absoluteconstant µ >

0, where Σ = E [ xx (cid:62) ]. (A2) e is sub-Gaussian, i.e., E (cid:2) exp( e /L (cid:48) ) (cid:3) = O (1) , for some absolute constant L (cid:48) >

0. Moreover, σ > (A3) θ ∗ and Θ l · are sparse for l = 1 , · · · , d , where Θ : = Σ − = E [ xx (cid:62) ] − . Speciﬁcally,we denote by S : = { l : θ ∗ l (cid:54) = 0 } the active set of covariates and its cardinality by s : = | S | . Also, we deﬁne s l : = |{ l (cid:48) (cid:54) = l : Θ l,l (cid:48) (cid:54) = 0 }| , s ∗ : = max l s l , and s = s ∨ s ∗ .Assumption (A1) ensures a restricted eigenvalue condition when n (cid:38) ¯ s log d by Rudel-son & Zhou (2013). Under the assumptions, we ﬁrst investigate the theoretical property ofAlgorithm 1, where we apply k-grad with the de-biased (cid:96) -CSL estimator with τ commu-nications. Deﬁne T : = (cid:13)(cid:13) √ N (cid:0)(cid:101) θ ( τ ) − θ ∗ (cid:1)(cid:13)(cid:13) ∞ , (7)15here (cid:101) θ ( τ ) is an output of Algorithm 1. Theorem 3.1 ( k-grad , sparse linear model) . Suppose (A1), (A2) and (A3) hold, and thatwe run Algorithm 1 with k-grad method in linear models. Let λ l (cid:16) (cid:114) log dn and λ ( t ) (cid:16) (cid:114) log dnk + (cid:114) log dn (cid:18) s (cid:114) log dn (cid:19) t , (8) for l = 1 , . . . , d and t = 0 , . . . , τ − . Assume n = d γ n , k = d γ k , s = d γ s for some constants γ n , γ k , γ s > . If γ n > γ s , γ k > γ s , and τ ≥ τ min , where τ min = 1 + (cid:22) max (cid:26) γ k + γ s γ n − γ s , γ s γ n − γ s (cid:27)(cid:23) , then for T deﬁned in (7) , we have sup α ∈ (0 , | P ( T ≤ c W ( α )) − α | = o (1) . (9) where c W ( α ) : = inf { t ∈ R : P (cid:15) ( W ≤ t ) ≥ α } , in which W is the k-grad bootstrap statisticswith the same distribution as W ( b ) in (5) and P (cid:15) denotes the probability with respect to therandomness from the multipliers.In addition, (9) also holds if T is replaced by (cid:98) T deﬁned in (3) . Theorem 3.1 warrants the bootstrap validity for the simultaneous conﬁdence intervalsproduced by Algorithm 1 with the k-grad . Furthermore, it also suggests that the bootstrapquantile can approximates the quantile of the oracle statistics T ; that is, our distributedbootstrap procedure is as statistically eﬃcient as the oracle centralized method.Next, we show that the same distributed bootstrap validity and the eﬃciency of the k-grad also hold for the n+k-1-grad in Algorithm 1.16 heorem 3.2 ( n+k-1-grad , sparse linear model) . Suppose (A1), (A2) and (A3) hold,and that we run Algorithm 1 with n+k-1-grad method. Let λ l and λ ( t ) be as in (8) for l = 1 , . . . , d and t = 0 , . . . , τ − . Assume n = d γ n , k = d γ k , s = d γ s for some constants γ n , γ k , γ s > . If γ n > γ s , γ n + γ k > γ s , and τ ≥ τ min , where τ min = 1 + (cid:22) ( γ k ∨ γ s ) + γ s γ n − γ s (cid:23) , then for T deﬁned in (7) , we have sup α ∈ (0 , | P ( T ≤ c (cid:102) W ( α )) − α | = o (1) . (10) where c (cid:102) W ( α ) : = inf { t ∈ R : P (cid:15) ( (cid:102) W ≤ t ) ≥ α } , in which (cid:102) W is the n+k-1-grad bootstrap statistics with the same distribution as (cid:102) W ( b ) in (6) and P (cid:15) denotes the probability with respect to the randomness from the multipliers.In addition, (10) also holds if T is replaced by (cid:98) T deﬁned in (3) . Note by Theorem 2.4 of van de Geer et al. (2014) that (cid:98) T is well approximated by (cid:107) (cid:98) Θ √ N ∇L N ( θ ∗ ) (cid:107) ∞ , which is further approximated by the (cid:96) ∞ -norm of the oracle score A = − Θ 1 √ N n (cid:88) i =1 k (cid:88) j =1 ∇L ( θ ∗ ; Z ij ) , given that (cid:98) Θ only deviates from Θ up to order O P ( s ∗ (log d ) / N − / ) in (cid:96) ∞ -norm. To gaina deeper look into the eﬃciency of k-grad and n+k-1-grad , we compare the diﬀerencebetween the covariance of A and the conditional covariance of A (for k-grad , deﬁned in(5)), and (cid:101) A (for n+k-1-grad , deﬁned in (6)). In particular, conditioning on the data Z ij ,17e have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) cov (cid:15) ( A ) − cov( A ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ s ∗ (cid:107) (cid:101) θ ( τ − − θ ∗ (cid:107) + ns ∗ (cid:107) (cid:101) θ ( τ − − θ ∗ (cid:107) + O P (cid:18)(cid:114) s ∗ k + (cid:114) s ∗ n (cid:19) , (11) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) cov (cid:15) ( (cid:101) A ) − cov( A ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ s ∗ (cid:107) (cid:101) θ ( τ − − θ ∗ (cid:107) + ( n ∧ k ) s ∗ (cid:107) (cid:101) θ ( τ − − θ ∗ (cid:107) + O P (cid:18)(cid:114) s ∗ n + k + (cid:114) s ∗ n (cid:19) , (12)up to some logarithmic terms in d , n or k . Overall, n+k-1-grad in (12) has a smallererror term than that of k-grad in (11). In particular, k-grad requires both n and k tobe large, while n+k-1-grad requires a large n but not necessarily a large k . In addition, τ = 1 could be enough for n+k-1-grad , but not for k-grad . To see it, if (cid:107) (cid:101) θ (0) − θ ∗ (cid:107) is oforder O P ( s ∗ / √ n ), the right-hand side of (11) can grow with s ∗ , while the error in (12) stillshrinks to zero as long as k (cid:28) n . Remark 3.3.

Note in both Theorems 3.1 and 3.2 that the expression of τ min does notdepend on d , because the direct eﬀect of d only enters through an iterative logarithmic term log log d which is dominated by log s (cid:16) log d . Remark 3.4.

The rates of { λ ( t ) } τ − t =0 and { λ l } dl =1 in Theorems 3.1 and 3.2 are motivatedby those in Wang et al. (2017) and van de Geer et al. (2014). Unfortunately, they arenot useful in practice. We therefore provide a practically useful cross-validation method inSection 2.4. Remark 3.5.

The main result (Theorem 2.2) in Zhang & Cheng (2017) can be seen asa justiﬁcation of multiplier bootstrap for high-dimensional linear models with data beingprocessed in a centralized manner. Theorem 3.2 compliments it by justifying a distributedmultiplier bootstrap with at least one round of communication ( τ ≥ ). emark 3.6. A rate of sup α ∈ (0 , | P ( T ≤ c W ( α )) − α | may be shown to be polynomial in n and k with a more careful analysis, which is faster than the order obtained by the extremevalue distribution approach (Chernozhukov et al. 2013, Zhang & Cheng 2017) that is atbest logarithmic. In this section, we consider GLMs, which generate i.i.d. observations ( x, y ) ∈ R d × R . Weassume that the loss function L is of the form L ( θ ; z ) = g ( y, x (cid:62) θ ) for θ, x ∈ R d and y ∈ R with g : R × R → R , and g ( a, b ) is three times diﬀerentiable with respect to b , and denote ∂∂b g ( a, b ), (cid:0) ∂∂b (cid:1) g ( a, b ), (cid:0) ∂∂b (cid:1) g ( a, b ) by g (cid:48) ( a, b ), g (cid:48)(cid:48) ( a, b ), g (cid:48)(cid:48)(cid:48) ( a, b ) respectively. We let θ ∗ bethe unique minimizer of the expected loss L ∗ ( θ ).We let X ∈ R n × d be the design matrix in the master node M and X ∗ : = P ∗ X be theweighted design matrix with a diagonal P ∗ ∈ R n × n with elements { g (cid:48)(cid:48) ( y i , x (cid:62) i θ ∗ ) / } i =1 ,...,n .We further let ( X ∗ ) − l ϕ ∗ l be the L projection of ( X ∗ ) l on ( X ∗ ) − l , for l = 1 , . . . , d . Equiva-lently, for l = 1 , . . . , d , we deﬁne ϕ ∗ l : = arg min ϕ ∈ R d − E [ (cid:107) ( X ∗ ) l − ( X ∗ ) − l ϕ (cid:107) ].We impose the following assumptions on the GLM. (B1) For some ∆ >

0, and ∆ (cid:48) > | x (cid:62) θ ∗ | ≤ ∆ (cid:48) ,sup | b |∨| b (cid:48) |≤ ∆+∆ (cid:48) sup a | g (cid:48)(cid:48) ( a, b ) − g (cid:48)(cid:48) ( a, b (cid:48) ) || b − b (cid:48) | ≤ , max | b |≤ ∆ sup a | g (cid:48) ( a, b ) | = O (1) , and max | b |≤ ∆+∆ (cid:48) sup a | g (cid:48)(cid:48) ( a, b ) | = O (1) . (B2) (cid:107) x (cid:107) ∞ = O (1). Moreover, x (cid:62) θ ∗ = O (1) and max l (cid:12)(cid:12) g (cid:48)(cid:48) ( y, x (cid:62) θ ∗ ) / x (cid:62)− l ϕ ∗ l (cid:12)(cid:12) = O (1), where x − l consists of all but the l -th coordinate of x .19 B3)

The least and the greatest eigenvalue of ∇ L ∗ ( θ ∗ ) and E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) arebounded away from zero and inﬁnity respectively. (B4) For some constant

L > l max q =1 , E [ | h ql | /L q ] + E [exp( | h l | /L )] = O (1) , ormax l max q =1 , E [ | h ql | /L q ] + E [(max l | h l | /L ) ] = O (1) , where h = ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ) and h l is the l -th coordinate. (B5) θ ∗ and Θ l · are sparse, where the inverse population Hessian matrix Θ : = ∇ L ∗ ( θ ∗ ) − ,i.e., S : = { l : θ ∗ l (cid:54) = 0 } , s : = | S | , s l : = |{ l (cid:48) (cid:54) = l : Θ l,l (cid:48) (cid:54) = 0 }| , s ∗ : = max l s l , and s = s ∨ s ∗ .Assumption (B1) imposes smoothness conditions on the loss function, which is satisﬁedby, for example, the logistic regression. In particular, logistic regression has g ( a, b ) = − ab +log(1 + exp( b )), and it can be easily seen that | g (cid:48) ( a, b ) | ≤ | g (cid:48)(cid:48) ( a, b ) | ≤ | g (cid:48)(cid:48)(cid:48) ( a, b ) | ≤ k-grad / n+k-1-grad for linear models, here we extendthem to the high-dimensional de-biased GLMs. See Figure 2 for a comparison between theresults of high-dimensional linear models and GLMs.20 heorem 3.7 ( k-grad , sparse GLM) . Suppose (B1)-(B4) hold, and that we run Algorithm1 with k-grad method in GLMs. Let λ l (cid:16) (cid:112) log d/n for l = 1 , . . . , d , and λ ( t ) be as λ ( t ) (cid:16) (cid:113) log dnk + s (cid:16) s (cid:113) log dn (cid:17) t , t ≤ τ , (cid:113) log dnk + s (cid:16) s (cid:113) log dn (cid:17) τ (cid:16) s (cid:113) log dn (cid:17) t − τ , t > τ + 1 , (13) for t = 0 , . . . , τ − , where τ = 1 + (cid:22) log γ n − γ s γ n − γ s (cid:23) . (14) Assume n = d γ n , k = d γ k , s = d γ s for some constants γ n , γ k , γ s > . If γ n > γ s , γ k > γ s ,and τ ≥ τ min , where τ min = max (cid:26) τ + (cid:22) γ k + γ s γ n − γ s + ν (cid:23) , (cid:22) log γ n − γ s γ n − γ s (cid:23)(cid:27) ,ν = 2 − τ ( γ n − γ s ) γ n − γ s ∈ (0 , , (15) then we have (9) . In addition, (9) also holds if T is replaced by (cid:98) T deﬁned in (3) . The τ in (14) is the preliminary communication rounds needed for the CSL estimatorto go through the regions which are far from θ ∗ . As s grows, the time spent in these regionscan increase. However, when n is large, e.g., n (cid:29) s , the loss function is more well-behavedso that the preliminary communication round can reduce to τ = 1. See Section A.3 in theAppendix for more detail. Theorem 3.8 ( n+k-1-grad , sparse GLM) . Suppose (B1)-(B4) hold, and that we run Al-gorithm 1 with n+k-1-grad method in GLMs. Let λ l (cid:16) (cid:112) log d/n for l = 1 , . . . , d , and λ ( t ) e as in (13) for t = 0 , . . . , τ − . Assume n = d γ n , k = d γ k , s = d γ s for some constants γ n , γ k , γ s > . If γ n > γ s and τ ≥ τ min , where τ min =  max (cid:110) (cid:106) log γ k + γ s γ n − γ s (cid:107) , (cid:111) , if γ k ≤ γ n − γ s ,τ + (cid:106) γ k + γ s γ n − γ s + ν (cid:107) , otherwise ,τ and ν deﬁned as in (14) and (15) respectively, then we have (10) . In addition, (10) also holds if T is replaced by (cid:98) T deﬁned in (3) . Remark 3.9.

The selection of { λ l } dl =1 in Theorems 3.7 and 3.8 are motivated by those invan de Geer et al. (2014), { λ ( t ) } τ − t =0 are motivated by Wang et al. (2017) and Jordan et al.(2019). Here we perform a more careful analysis for the two phases of model tuning as in (13) . We demonstrate the merits of our methods using synthetic data in this section. Thecode to reproduce the simulation experiments, results, and plots is available at GitHub: https://github.com/skchao74/Distributed-bootstrap .We consider a Gaussian linear model and a logistic regression model. We ﬁx totalsample size N = 2 and the dimension d = 2 , and choose the number of machines k from { , , . . . , } . The true coeﬃcient θ ∗ is a d -dimensional vector in which the ﬁrst s coordinates are 1 and the rest is 0, where s ∈ { , } for the linear model and s ∈ { , } for the GLM. We generate covariate vector x independently from N (0 , Σ), while consideringtwo diﬀerent speciﬁcations for Σ: • Toeplitz: Σ l,l (cid:48) = 0 . | l − l (cid:48) | ; 22 Equi-correlation: Σ l,l (cid:48) = 0 . l (cid:54) = l (cid:48) , Σ l,l = 1 for all l .For linear model, we generate the model noise independently from N (0 , y ∼ Ber(1 / (1 + exp[ − x (cid:62) θ ∗ ])). For each choice of s and k , werun Algorithm 1 with k-grad and n+k-1-grad on 1 ,

000 independently generated datasets,and compute the empirical coverage probability and the average width based on the resultsfrom these 1 ,

000 replications. At each replication, we draw B = 500 bootstrap samples,from which we calculate the 95% empirical quantile to further obtain the 95% simultaneousconﬁdence interval.For the (cid:96) -CSL computation, we choose the initial λ (0) by a local K -fold cross-validation,where K = 10 for linear regression and K = 5 for logistic regression. For each iteration t , λ ( t ) is selected by Algorithm 2 in Section 2.4 with K (cid:48) folds with K (cid:48) = min { k − , } ,which ensures that each partition of worker gradients is non-empty when k is small. For aneﬃcient implementation of the nodewise Lasso, we select a ˆ λ at every simulation repetitionand set λ l = ¯ λ for all l . Speciﬁcally, for each simulated dataset, we select ¯ λ = 10 − (cid:80) l =1 ˆ λ l ,where each ˆ λ l is obtained obtained by a cross-validation of nodewise Lasso regression of l -th variable on the remaining variables. Since the variables are homogeneous, these ˆ λ l ’sonly deviate by some random variations, which can be alleviated by an average.The oracle width is computed by ﬁrst ﬁx ( N, d, s ), and then we generate 500 indepen-dent datasets. For each dataset, we compute the centralized de-biased Lasso estimator (cid:98) θ as in (4). The oracle width is deﬁned as two times the 95% empirical quantile of (cid:107) (cid:98) θ − θ ∗ (cid:107) ∞ of the 500 samples. The average widths are compared against the oracle widths by takingthe ratio of the two.The empirical coverage probabilities and the average width ratios of k-grad and n+k-1-grad are displayed for the linear model in Figures 3 (Toeplitz design) and 4 (equi-correlation23esign), and for the logistic regression in 5 (Toeplitz design) and 6 (equi-correlation design),respectively. Note that increase in k indicates decrease in n , given the ﬁxed N .For small k , k-grad tends to over-cover, whereas n+k-1-grad has a more accuratecoverage. By contrast, the coverage of both algorithms fall when k gets too large (or n getstoo small), since the estimator (cid:101) θ ( τ ) deviates from (cid:98) θ and the deviation of the width fromthe oracle width, which reﬂects the discussion of (11) and (12). Moreover, as s = (cid:107) θ ∗ (cid:107) increases, it becomes harder for both algorithms to achieve the accurate 95% coverage, andboth algorithms start to fail at a smaller k (or larger n ), which stems from the fact that thebootstrap cannot accurately approximate variance of the asymptotic distribution as shownin (11) and (12). Nevertheless, raising the number of iterations improves the coverage,which veriﬁes our theory. We also observe an under-coverage of our bootstrap method inboth the linear regression and the logistic regression at the early stage of increasing k . Thisis due to the loss of accuracy in estimating the inverse Hessian matrices using only the datain the master node when k increases (or n decreases). Having demonstrated the performance of our method on purely synthetic data using sparsemodels in the last section, in this section, we artiﬁcially create spurious variables and mixthem with the variables obtained from a real big dataset. We check if our method cansuccessfully select the relevant variables associated with the response variable from thereal dataset. The code to retrieve data and reproduce the analyses, results, and plots isavailable at GitHub: https://github.com/skchao74/Distributed-bootstrap .24 log k k-grad, s = 2 = 1= 2= 3= 4 log k k-grad, s = 2 coveragewidth log k n+k 1-grad, s = 2 log k n+k 1-grad, s = 2 WidthOracleWidth

WidthOracleWidth

Figure 3: Empirical coverage probability ( left axis, solid lines ) and average width ( rightaxis, dashed lines ) of simultaneous conﬁdence intervals by k-grad and n+k-1-grad insparse linear regression with Toeplitz design and varying sparsity. Black solid line representsthe 95% nominal level and black dashed line represents 1 on the right y -axis.25 log k k-grad, s = 2 = 1= 2= 3= 4 log k k-grad, s = 2 coveragewidth log k n+k 1-grad, s = 2 log k n+k 1-grad, s = 2 WidthOracleWidth

WidthOracleWidth

Figure 4: Empirical coverage probability ( left axis, solid lines ) and average width ( rightaxis, dashed lines ) of simultaneous conﬁdence intervals by k-grad and n+k-1-grad insparse linear regression with equi-correlation design and varying sparsity. Black solid linerepresents the 95% nominal level and black dashed line represents 1 on the right y -axis. The US Airline On-time Performance dataset, available at http://stat-computing.org/dataexpo/2009 , consists of ﬂight arrival and departure details for all commercial ﬂights26 log k k-grad, s = 2 = 1= 2= 3= 4 log k k-grad, s = 2 coveragewidth log k n+k 1-grad, s = 2 log k n+k 1-grad, s = 2 WidthOracleWidth

WidthOracleWidth

Figure 5: Empirical coverage probability ( left axis, solid lines ) and average width ( rightaxis, dashed lines ) of simultaneous conﬁdence intervals by k-grad and n+k-1-grad insparse logistic regression with Toeplitz design and varying sparsity. Black solid line repre-sents the 95% nominal level and black dashed line represents 1 on the right y -axis.within the US from 1987 to 2008. Given the high dimensionality after dummy transforma-tion and the huge sample size of the entire dataset, the most eﬃcient way to process thedata is using a distributed computational system, with sample size on each worker node27 log k k-grad, s = 2 = 1= 2= 3= 4 log k k-grad, s = 2 coveragewidth log k n+k 1-grad, s = 2 log k n+k 1-grad, s = 2 WidthOracleWidth

WidthOracleWidth

Figure 6: Empirical coverage probability ( left axis, solid lines ) and average width ( rightaxis, dashed lines ) of simultaneous conﬁdence intervals by k-grad and n+k-1-grad insparse logistic regression with equi-correlation design and varying sparsity. Black solid linerepresents the 95% nominal level and black dashed line represents 1 on the right y -axis.likely to be smaller than the dimension. Our goal here is to uncover statistically signiﬁcantindependent variables associated with ﬂight delay. We use the following variables in ourmodel: 28 Year : from 1987 to 2008, • Month : from 1 to 12, • DayOfWeek : from 1 (Monday) to 7 (Sunday), • CRSDepTime : scheduled departure time (in four digits, ﬁrst two representing hour,last two representing minute), • CRSArrTime : scheduled arrival time (in the same format as above), • UniqueCarrier : unique carrier code, • Origin : origin (in IATA airport code), • Dest : destination (in IATA airport code), • ArrDelay : arrival delay (in minutes). Positive value means there is a delay.The complete variable information can be found at http://stat-computing.org/dataexpo/2009/the-data.html .The response variable is labeled by 1 to denote a delay if

ArrDelay is greater than zero,and by 0 otherwise. We categorize

CRSDepTime and

CRSArrTime into 24 one-hour timeintervals (e.g., 1420 is converted to 14 to represent the interval [14:00,15:00]), and thentreat

Year , Month , DayOfWeek , CRSDepTime , CRSArrTime , UniqueCarrier , Origin , and

Dest as nominal predictors. The nominal predictors are encoded by dummies with appropriatedimensions and merging all categories of lower counts into “others”, and either “others” orthe smallest ordinal value is treated as the baseline. This results in a total of 203 predictors.We provide the details of the dummy variable creation in Appendix (Section A.4). The totalsample size is 113.9 million observations. We randomly sample a dataset D of N = 500 , k = 1 ,

000 nodes such that each nodereceives n = 500 observations. We randomly sample another dataset D of N = 500 , D ∩ D = ∅ .29 .2 An Artiﬁcial Design Matrix and Variable Screening In the ﬁrst stage, we perform a preliminary study that informs us some seemingly relevantvariables to include in an artiﬁcial design matrix, which will be used to demonstrate variablescreening performance of our method in the second stage. Note that the purpose of thisstage is only to preliminarily discover possibly relevant variables, rather than to selectvariables in a fully rigorous manner. We perform a logistic regression in a centralizedmanner with intercept and without regularization using the N observations in D . Standard t tests reveal that 144 out of 203 slopes are signiﬁcantly non-zero ( p -values less than 0 . p -values correspond to the dummy variables of years2001–2004, and the coeﬃcients are all negative, which suggest less likelihood of ﬂight delayin these years. This interesting ﬁnding matches the results of previous study that theSeptember 11 terrorist attacks have negatively impacted the US airline demand (Ito & Lee2005), which resulted in less ﬂights and congestion. In addition, the Notice of Market-basedActions to Relieve Airport Congestion and Delay , (Docket No. OST-2001-9849) issued byDepartment of Transportation on August 21, 2001, might also help alleviate the US airlinedelay.To construct the artiﬁcial design matrix, we group the 4 predictors with the least p -values mentioned above and the intercept, so the number of the relevant columns is 5.Given d , we artiﬁcially create d − N (0 , C d − ), where C d − is a Toeplitz matrix ( C l,l (cid:48) = 0 . | l − l (cid:48) | ), and thenconverting half of the columns to either 0 or 1 by their signs. Then, we combine these d − D that are associatedwith the selected relevant variables to obtain an artiﬁcial design matrix.30n the second stage, using the artiﬁcial design matrix with the binary response vectorfrom the ArrDelay in D , we test if our distributed bootstrap n+k-1-grad (Algorithm 1) canscreen the artiﬁcially created spurious variables. Note that D and D are disjoint, where D is used in the ﬁrst stage for the preliminary study. For model tuning, we select λ (0) by alocal 10-fold cross-validation; for each t ≥ λ ( t ) is chosen by running a distributed 10-foldcross-validation in Algorithm 2. We select each λ l by performing a 10-fold cross-validationfor the nodewise Lasso of each variable. The same entire procedure is repeated under eachdimensionality d ∈ { , , , } .The left panel of Figure 7 plots the number of signiﬁcant variables against the number ofiterations τ , which was broken down into the number intersecting with the relevant variables(solid lines) and the number intersecting with the spurious variables (dashed lines). First,all of the 4 relevant variables are tested to be signiﬁcant at all iterations. For the spuriousvariables, we see that with τ = 1, the distributed bootstrap falsely detects one of them.However, as the number of iterations increases, less spurious variables are detected untilnone of them is detected. We also see that 2 iterations ( τ = 2) for d = 500 , ,

000 and 3iterations ( τ = 3) for d = 200 are suﬃcient, which empirically veriﬁes that our method isnot very sensitive to the nominal dimension d .As an illustration that is potentially useful in practice, the conﬁdence intervals computedwith the simultaneous quantile for the 4 important slopes under d = 1 ,

000 and τ = 2 areplotted in the right panel of Figure 7. It can be seen that the ﬂights in years 2002 and 2003are relatively less likely to delay, which match the decreased air traﬃc in the aftermath ofthe September 11 terrorist attacks. 31 N u m b e r o f s i g . v a r i a b l e s d = 200, relevant d = 500, relevant d = 1000, relevant d = 200, spurious d = 500, spurious d = 1000, spurious 0.5 0.4 0.3 0.2 0.1 0.0Year_2004Year_2003Year_2002Year_2001 Figure 7: The left panel shows the number of signiﬁcant variables uncovered by the simul-taneous conﬁdence intervals among the 4 relevant variables and among the d − d = 200 , , , d = 1 ,

000 and τ = 2. We propose distributed bootstrap methods k-grad and n+k-1-grad for high-dimensionalsimultaneous inference based on the de-biased (cid:96) -CSL estimator, which complements thework of Yu, Chao & Cheng (2020). The bootstrap validity and oracle eﬃciency are rigor-ously studied, and the merits are further shown via simulation study on coverage probabilityand eﬃciency, and a practical example on variable screening. References

Banerjee, M., Durot, C., Sen, B. et al. (2019), ‘Divide and conquer in nonstandard problemsand the super-eﬃciency phenomenon’,

The Annals of Statistics (2), 720–757.32attey, H., Fan, J., Liu, H., Lu, J. & Zhu, Z. (2018), ‘Distributed estimation and inferencewith statistical guarantees’, Annals of Statistics (3), 1352–1382.Belloni, A., Chernozhukov, V., Chetverikov, D. & Wei, Y. (2018), ‘Uniformly valid post-regularization conﬁdence regions for many functional parameters in z-estimation frame-work’, Ann. Statist. (6B), 3643–3675. URL: https://doi.org/10.1214/17-AOS1671

Belloni, A., Chernozhukov, V. & Kato, K. (2019), ‘Valid post-selection inference in high-dimensional approximately sparse quantile regression models’,

Journal of the AmericanStatistical Association (526), 749–758.

URL: https://doi.org/10.1080/01621459.2018.1442339

Cai, T. T. & Sun, W. (2017), ‘Large-scale global and simultaneous inference: Estimationand testing in very high dimensions’,

Annual Review of Economics , 411–439.Chen, X., Lee, J. D., Li, H. & Yang, Y. (2020), ‘Distributed estimation for principalcomponent analysis: a gap-free approach’, arXiv preprint arXiv:2004.02336 .Chen, X., Liu, W. & Zhang, Y. (2018), ‘First-order newton-type estimator for distributedestimation and inference’, arXiv preprint arXiv:1811.11368 .Chen, X., Liu, W. & Zhang, Y. (2019), ‘Quantile regression under memory constraint’, Ann. Statist. (6), 3244–3273. URL: https://doi.org/10.1214/18-AOS1777

Chen, X. & Xie, M.-g. (2014), ‘A split-and-conquer approach for analysis of extraordinarilylarge data’,

Statistica Sinica pp. 1655–1684.33hernozhukov, V., Chetverikov, D., Kato, K. et al. (2013), ‘Gaussian approximations andmultiplier bootstrap for maxima of sums of high-dimensional random vectors’,

The An-nals of Statistics (6), 2786–2819.Dezeure, R., B¨uhlmann, P. & Zhang, C.-H. (2017), ‘High-dimensional simultaneous infer-ence with the bootstrap’, Test (4), 685–719.Fan, J., Guo, Y. & Wang, K. (2019), ‘Communication-eﬃcient accurate statistical estima-tion’, arXiv preprint arXiv:1906.04870 .Fan, J., Wang, D., Wang, K. & Zhu, Z. (2019), ‘Distributed estimation of principaleigenspaces’, Annals of statistics (6), 3009.Huang, C. & Huo, X. (2019), ‘A distributed one-step estimator’, Mathematical Program-ming (1-2), 41–76.Ito, H. & Lee, D. (2005), ‘Assessing the impact of the september 11 terrorist attacks on usairline demand’,

Journal of Economics and Business (1), 75–95.Javanmard, A. & Montanari, A. (2014 a ), ‘Conﬁdence intervals and hypothesis testing forhigh-dimensional regression’, The Journal of Machine Learning Research (1), 2869–2909.Javanmard, A. & Montanari, A. (2014 b ), ‘Hypothesis testing in high-dimensional regressionunder the gaussian random design model: Asymptotic theory’, IEEE Transactions onInformation Theory (10), 6522–6554.Jordan, M. I., Lee, J. D. & Yang, Y. (2019), ‘Communication-eﬃcient distributed statisticalinference’, Journal of the American Statistical Association (526), 668–681.34leiner, A., Talwalkar, A., Sarkar, P. & Jordan, M. I. (2014), ‘A scalable bootstrap formassive data’,

Journal of the Royal Statistical Society: Series B (Statistical Methodology) (4), 795–816.Lan, G., Lee, S. & Zhou, Y. (2018), ‘Communication-eﬃcient algorithms for decentralizedand stochastic optimization’, Mathematical Programming pp. 1–48.Lee, J. D., Liu, Q., Sun, Y. & Taylor, J. E. (2017), ‘Communication-eﬃcient sparse regres-sion’,

The Journal of Machine Learning Research (1), 115–144.Li, R., Lin, D. K. & Li, B. (2013), ‘Statistical inference in massive data sets’, AppliedStochastic Models in Business and Industry (5), 399–409.Rudelson, M. & Zhou, S. (2013), ‘Reconstruction from anisotropic random measurements’, IEEE Transactions on Information Theory (6), 3434–3447.Sengupta, S., Volgushev, S. & Shao, X. (2016), ‘A subsampled double bootstrap for massivedata’, Journal of the American Statistical Association (515), 1222–1232.Shi, C., Lu, W. & Song, R. (2018), ‘A massive data framework for m-estimators withcubic-rate’,

Journal of the American Statistical Association (524), 1698–1709.Singh, K. & Kaur, R. (2014), Hadoop: addressing challenges of big data, in ‘2014 IEEEInternational Advance Computing Conference (IACC)’, IEEE, pp. 686–689.van de Geer, S., B¨uhlmann, P., Ritov, Y., Dezeure, R. et al. (2014), ‘On asymptotically op-timal conﬁdence regions and tests for high-dimensional models’, The Annals of Statistics (3), 1166–1202. 35olgushev, S., Chao, S.-K., Cheng, G. et al. (2019), ‘Distributed inference for quantileregression processes’, The Annals of Statistics (3), 1634–1662.Wang, J., Kolar, M., Srebro, N. & Zhang, T. (2017), Eﬃcient distributed learning withsparsity, in ‘Proceedings of the 34th International Conference on Machine Learning-Volume 70’, JMLR. org, pp. 3636–3645.Wang, J. & Zhang, T. (2017), ‘Improved optimization of ﬁnite sums with minibatch stochas-tic variance reduced proximal iterations’, arXiv preprint arXiv:1706.07001 .Yu, M., Gupta, V. & Kolar, M. (2020), ‘Simultaneous inference for pairwise graphical mod-els with generalized score matching’, Journal of Machine Learning Research (91), 1–51.Yu, Y., Chao, S.-K. & Cheng, G. (2020), Simultaneous inference for massive data:Distributed bootstrap, in ‘International Conference on Machine Learning’, PMLR,pp. 10892–10901.Zhang, C.-H. & Zhang, S. S. (2014), ‘Conﬁdence intervals for low dimensional parametersin high dimensional linear models’, Journal of the Royal Statistical Society: Series B(Statistical Methodology) (1), 217–242.Zhang, X. & Cheng, G. (2017), ‘Simultaneous inference for high-dimensional linear models’, Journal of the American Statistical Association (518), 757–768.Zhang, Y., Wainwright, M. J. & Duchi, J. C. (2012), Communication-eﬃcient algorithmsfor statistical optimization, in ‘Advances in Neural Information Processing Systems’,pp. 1502–1510. 36hao, T., Cheng, G. & Liu, H. (2016), ‘A partially linear framework for massive heteroge-neous data’, Annals of Statistics (4), 1400.Zhu, X., Li, F. & Wang, H. (2020), ‘Least squares approximation for a distributed system’, ArXiv Preprint arXiv:1908.04904 . 37 ppendix

A Technical Details

A.1 Pseudocode for k-grad and n+k-1-grad

Algorithm 3

DistBoots ( method , (cid:101) θ, { g j } kj =1 , (cid:101) Θ): only need the master node M Require: local gradient g j and estimate (cid:101) Θ of inverse Hessian obtained at M ¯ g ← k − (cid:80) kj =1 g j for b = 1 , . . . , B do if method =‘ k-grad ’ then Draw (cid:15) ( b )1 , . . . , (cid:15) ( b ) k i.i.d. ∼ N (0 ,

1) and compute W ( b ) by (5) else if method =‘ n+k-1-grad ’ then Draw (cid:15) ( b )11 , . . . , (cid:15) ( b ) n , (cid:15) ( b )2 , . . . , (cid:15) ( b ) k i.i.d. ∼ N (0 ,

1) and compute W ( b ) by (6) end if end for Compute the quantile c W ( α ) of { W (1) , . . . , W ( B ) } for α ∈ (0 , Return (cid:101) θ l ± N − / c W ( α ), l = 1 , . . . , d Remark A.1.

Although in Algorithm 3 the same (cid:101) θ is used for the center of the conﬁdenceinterval and for evaluating the gradients g ij , allowing them to be diﬀerent (such as inAlgorithm 1) can save one round of communication. For example, we can use (cid:101) θ ( τ ) for thecenter of the conﬁdence interval, while the gradients are evaluated with (cid:101) θ ( τ − . .2 Nodewise Lasso In Algorithm 4, we state the nodewise Lasso method for constructing approximate inverseHessian matrix used in Section 3.1.1 of van de Geer et al. (2014), which we apply inAlgorithm 1. We deﬁne the components of (cid:98) γ l as (cid:98) γ l = { (cid:98) γ l,l (cid:48) ; l (cid:48) = 1 , . . . , d, l (cid:48) (cid:54) = l } . We denoteby (cid:99) M l, − l the l -th row of (cid:99) M without the diagonal element ( l, l ), and by (cid:99) M − l, − l the submatrixwithout the l -th row and l -th column. Algorithm 4

Node ( (cid:99) M ) Require: sample Hessian matrix (cid:99) M ∈ R d × d , hyperparameters { λ l } dl =1 for l = 1 , . . . , d do Compute (cid:98) γ l = arg min γ ∈ R d − (cid:99) M l,l − (cid:99) M l, − l γ + γ (cid:62) (cid:99) M − l, − l γ + 2 λ l (cid:107) γ (cid:107) Compute (cid:98) τ l = (cid:99) M l,l − (cid:99) M l, − l (cid:98) γ l end for Construct (cid:91) M − as (cid:91) M − = (cid:98) τ − . . . (cid:98) τ − . . . . . . (cid:98) τ − d   − (cid:98) γ , . . . − (cid:98) γ ,d − (cid:98) γ , . . . − (cid:98) γ ,d ... ... . . . ... − (cid:98) γ d, − (cid:98) γ d, . . .  . Remark A.2.

Throughout this paper, we ﬁx the choice of nodewise Lasso in Algorithm 1for computing an approximate inverse Hessian matrix. In practice, various approaches(e.g., Zhang & Zhang (2014), Javanmard & Montanari (2014 a )) can be chosen from inconsideration of estimation accuracy and computational eﬃciency. .3 CSL Estimator for GLMs For the (cid:96) -penalized CSL estimator of generalized linear models, Theorem 3.3 of Wanget al. (2017) states that (cid:13)(cid:13)(cid:101) θ ( t +1) − θ ∗ (cid:13)(cid:13) (cid:46) s (cid:114) log dN + s (cid:114) log dn (cid:13)(cid:13)(cid:101) θ ( t ) − θ ∗ (cid:13)(cid:13) + M s (cid:13)(cid:13)(cid:101) θ ( t ) − θ ∗ (cid:13)(cid:13) , (16)where M ≥ g (cid:48)(cid:48) , which exists due to Assumptions (B1). Inlinear models, g ( a, b ) = ( a − b ) / g (cid:48)(cid:48) is a constant, so M = 0 and CSL estimator haslinear convergence to θ ∗ with rate s (log d ) / n − / until it reaches the upper bound givenby the ﬁrst term, which is also the rate of the centralized (oracle) estimator. For GLMs,however, M > t is small. For example, when t = 0, given that (cid:107) (cid:101) θ (0) − θ ∗ (cid:107) (cid:46) s (log d ) / n − / , it is easy to see that the third term isalways s times larger than the second term (up to a constant), and a larger n is requiredto ensure third term is less than (cid:13)(cid:13)(cid:101) θ ( t ) − θ ∗ (cid:13)(cid:13) and the error is shrinking. However, when t issuﬃciently large, this dominance reverses. The threshold is given by the τ in (14), and thisimplies the three phases of convergence: When t ≤ τ , the third term dominates and theconvergence is quadratic; when t > τ , the second term dominates the third and the linearconvergence kicks in. Finally, when t is suﬃciently large, the ﬁrst term dominates. Ouranalysis complements that of Wang et al. (2017), while in their Corollary 3.7 it is simplyassumed that the second term dominates the third. A.4 Creation of Dummy Variables

To ensure that none of the columns of the design matrix on the master node is completelyzero so that the nodewise Lasso can be computed, we create the dummy variables usingonly the observations in the master node on the dataset D . Speciﬁcally, for variables40 niqueCarrier , Origin , and

Dest , we keep the top categories that make up 90% of the datain the master node on D ; the rest categories are merged into “others” and are treated asbaseline. For CRSDepTime and

CRSArrTime , we merge the time intervals 23:00-6:00 and1:00-7:00 respectively (due to their low counts) and use them as baseline. For

Year , Month ,and

DayOfWeek , we treat year 1987, January, and Monday as baseline respectively.41

UPPLEMENTARY MATERIAL

B Proofs of Main Results

To simplify the notation, in the proof we denote ¯ θ = (cid:101) θ ( τ − , where (cid:101) θ ( τ − is the (cid:96) -penalizedestimator at τ − (cid:101) θ = (cid:101) θ ( τ ) output by Algorithm 1. Proof of Theorem 3.1.

We apply Theorem 3 of Wang et al. (2017), where theirAssumption 2 is inherited from Assumption (A1), and obtain that if n (cid:29) s log d , (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:101) θ ( τ − − θ ∗ (cid:13)(cid:13)(cid:13) = O P (cid:32) s (cid:114) log dN + (cid:32) s (cid:114) log dn (cid:33) τ (cid:33) . Then, by Lemma C.1, we have that sup α ∈ (0 , | P ( T ≤ c W ( α )) − α | = o (1) , as long as n (cid:29) s ∗ log κ d + s ∗ log κ d + s log d , k (cid:29) s ∗ log κ d , and s (cid:114) log dN + (cid:32) s (cid:114) log dn (cid:33) τ (cid:28) min (cid:26) √ ks ∗ log κ d , √ ns ∗ log κ d (cid:27) . These conditions hold if n (cid:29) ( s ∗ + s ∗ s ) log κ d + s ∗ log κ d , k (cid:29) s ∗ s log κ d + s ∗ log κ d , and τ > max (cid:26) log k + log s ∗ + log( C log κ d )log n − log( s ) − log log d , s ∗ + log( s ) + log( C log κ d )log n − log( s ) − log log d (cid:27) . If n = d γ n , k = d γ k , s = s ∨ s ∗ = d γ s for some constants γ n , γ k , and γ s , then a suﬃcientcondition is γ n > γ s , γ k > γ s , and τ ≥ (cid:22) max (cid:26) γ k + γ s γ n − γ s , γ s γ n − γ s (cid:27)(cid:23) . roof of Theorem 3.2. Similarly to the proof of Theorem 3.1, applying Theorem 3 ofWang et al. (2017) and Lemma C.2, we have that sup α ∈ (0 , (cid:12)(cid:12) P ( T ≤ c (cid:102) W ( α )) − α (cid:12)(cid:12) = o (1) , aslong as n (cid:29) s ∗ log κ d + s ∗ log κ d + s log d , n + k (cid:29) s ∗ log κ d , and s (cid:114) log dN + (cid:32) s (cid:114) log dn (cid:33) τ (cid:28) min (cid:40) √ ks ∗ log κ d , s ∗ (cid:112) log(( n + k ) d ) log κ d (cid:41) . These conditions hold if n (cid:29) ( s ∗ + s ∗ s ) log κ d + s ∗ log κ d , n + k (cid:29) s ∗ log κ d , nk (cid:29) s ∗ s log κ d , and τ > max (cid:26) log k + log s ∗ + log( C log κ d )log n − log( s ) − log log d , log s ∗ + log log(( n + k ) d ) + log( C log κ d )log n − log( s ) − log log d (cid:27) . If n = d γ n , k = d γ k , s = s ∨ s ∗ = d γ s for some constants γ n , γ k , and γ s , then a suﬃcientcondition is γ n > γ s , γ n + γ k > γ s , and τ ≥ (cid:22) ( γ k ∨ γ s ) + γ s γ n − γ s (cid:23) . Proof of Theorem 3.7.

We apply Theorem 6 of Wang et al. (2017), where theirAssumption 2 is inherited from Assumption (B3), and obtain that if n (cid:29) s log d , (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:101) θ ( τ − − θ ∗ (cid:13)(cid:13)(cid:13) =  O P (cid:32) s (cid:113) log dN + s (cid:18) s (cid:113) log dn (cid:19) τ − (cid:33) , τ ≤ τ + 1 ,O P (cid:32) s (cid:113) log dN + s (cid:18) s (cid:113) log dn (cid:19) τ (cid:18) s (cid:113) log dn (cid:19) τ − τ − (cid:33) , τ > τ + 1 , where τ is the smallest integer t such that (cid:32) s (cid:114) log dn (cid:33) t (cid:46) s (cid:114) log dn , τ = (cid:24) log (cid:18) log n − log( s ) − log( C log d )log n − log( s ) − log log d (cid:19)(cid:25) . Then, by Lemma C.3, we have that sup α ∈ (0 , | P ( T ≤ c W ( α )) − α | = o (1) , as long as n (cid:29) ( s + s ∗ ) log κ d + ( s + s ∗ ) log κ d + s log d , k (cid:29) s ∗ log κ d , and s (cid:114) log dN + 1 s (cid:32) s (cid:114) log dn (cid:33) τ − (cid:28) min (cid:26) √ ks ∗ s log κ d , √ ns ∗ log κ d (cid:27) , if τ ≤ τ + 1, and s (cid:114) log dN + 1 s (cid:32) s (cid:114) log dn (cid:33) τ (cid:32) s (cid:114) log dn (cid:33) τ − τ − (cid:28) min (cid:26) √ ks ∗ s log κ d , √ ns ∗ log κ d (cid:27) , if τ > τ + 1.If n = d γ n , k = d γ k , s = s ∨ s ∗ = d γ s for some constants γ n , γ k , and γ s , then a suﬃcientcondition is γ n > γ s , γ k > γ s , and τ ≥ (cid:22) max (cid:26) γ n − γ s γ n − γ s , τ + 1 + γ k + (4 · τ + 1) γ s − τ γ n γ n − γ s (cid:27)(cid:23) = (cid:22) max (cid:26) γ n − γ s γ n − γ s , τ + 2 + γ k + (4 · τ + 1) γ s − τ γ n γ n − (cid:27)(cid:23) = (cid:22) max (cid:26) γ n − γ s γ n − γ s , τ + γ k + γ s γ n − γ s + ν (cid:27)(cid:23) = max (cid:26) τ + (cid:22) γ k + γ s γ n − γ s + ν (cid:23) , (cid:22) log γ n − γ s γ n − γ s (cid:23)(cid:27) , where τ = 1 + (cid:22) log γ n − γ s γ n − γ s (cid:23) , ν = 2 − τ ( γ n − γ s ) γ n − γ s ∈ (0 , . Proof of Theorem 3.8.

Similarly to the proof of Theorem 3.2, applying Theorem 3 ofWang et al. (2017) and Lemma C.4, we have that sup α ∈ (0 , | P ( T ≤ c W ( α )) − α | = o (1) , as44ong as n (cid:29) ( s + s ∗ ) log κ d + ( s + s ∗ ) log κ d , n + k (cid:29) s ∗ log κ d , and s (cid:114) log dN + 1 s (cid:32) s (cid:114) log dn (cid:33) τ − (cid:28) min  n + ks ∗ (cid:16) n + k √ log d + k / log / d (cid:17) log κ d , √ ks ∗ s log κ d , (cid:0) nks ∗ log κ d (cid:1) /  , if τ ≤ τ + 1, and s (cid:114) log dN + 1 s (cid:32) s (cid:114) log dn (cid:33) τ (cid:32) s (cid:114) log dn (cid:33) τ − τ − (cid:28) min  n + ks ∗ (cid:16) n + k √ log d + k / log / d (cid:17) log κ d , √ ks ∗ s log κ d , (cid:0) nks ∗ log κ d (cid:1) /  , if τ > τ + 1, where τ = (cid:24) log (cid:18) log n − log( s ) − log( C log d )log n − log( s ) − log log d (cid:19)(cid:25) . If n = d γ n , k = d γ k , s = s ∨ s ∗ = d γ s for some constants γ n , γ k , and γ s , then a suﬃcientcondition is γ n > γ s , andLet s = s ∨ s ∗ . If n = s γ n , k = s γ k , and d = s γ d for some constants γ n , γ k , and γ d , thena suﬃcient condition is γ n >

5, and if τ ≤ τ + 1, τ ≥ max (cid:26) (cid:22) log γ k + 1 γ n − (cid:23) , (cid:27) , τ > τ + 1 τ ≥ (cid:22) τ + 1 + γ k + 4 · τ + 1 − τ γ n γ n − (cid:23) = (cid:22) τ + 2 + γ k + 4 · τ + 1 − τ γ n γ n − (cid:23) = (cid:22) τ + γ k + 1 γ n − ν (cid:23) = τ + (cid:22) γ k + 1 γ n − ν (cid:23) , where τ = 1 + (cid:22) log γ n − γ n − (cid:23) , ν = 2 − τ ( γ n − γ n − ∈ (0 , . C Technical Lemmas

Lemma C.1 ( k-grad ) . In sparse linear model, under Assumptions (A1) and (A2), if n (cid:29) s ∗ log κ d + s ∗ log κ d , k (cid:29) s ∗ log κ d , and (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:28) min (cid:26) √ ks ∗ log κ d , √ ns ∗ log κ d (cid:27) , for some κ > , then we have that sup α ∈ (0 , | P ( T ≤ c W ( α )) − α | = o (1) , and (17)sup α ∈ (0 , (cid:12)(cid:12)(cid:12) P ( (cid:98) T ≤ c W ( α )) − α (cid:12)(cid:12)(cid:12) = o (1) . (18) Proof of Lemma C.1.

As noted by Zhang & Cheng (2017), since (cid:107)√ N ( (cid:101) θ − θ ∗ ) (cid:107) ∞ =max l √ N | (cid:101) θ l − θ ∗ l | = √ N max l (cid:0) ( (cid:101) θ l − θ ∗ l ) ∨ ( θ ∗ l − (cid:101) θ l ) (cid:1) , the arguments for the bootstrap46onsistency result with T = max l √ N ( (cid:101) θ − θ ∗ ) l and (19) (cid:98) T = max l √ N ( (cid:98) θ − θ ∗ ) l (20)imply the bootstrap consistency result for T = (cid:107)√ N ( (cid:101) θ − θ ∗ ) (cid:107) ∞ and (cid:98) T = (cid:107)√ N ( (cid:98) θ − θ ∗ ) (cid:107) ∞ .Hence, from now on, we redeﬁne T and (cid:98) T as (19) and (20). Deﬁne an oracle multiplierbootstrap statistic as W ∗ : = max ≤ l ≤ d − √ N n (cid:88) i =1 k (cid:88) j =1 (cid:0) ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ij ) (cid:1) l (cid:15) ∗ ij , (21)where { (cid:15) ∗ ij } i =1 ,...,n ; j =1 ,...,k are N independent standard Gaussian variables, also independentof the entire dataset. The proof consists of two steps; the ﬁrst step is to show that W ∗ achieves bootstrap consistency, i.e., sup α ∈ (0 , | P ( T ≤ c W ∗ ( α )) − α | converges to 0, where c W ∗ ( α ) = inf { t ∈ R : P (cid:15) ( W ∗ ≤ t ) ≥ α } , and the second step is to show the bootstrapconsistency of our proposed bootstrap statistic by showing the quantiles of W and W ∗ areclose.Note that ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ) = E [ xx (cid:62) ] − x ( x (cid:62) θ ∗ − y ) = Θ xe and E (cid:104)(cid:0) ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ) (cid:1) (cid:0) ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ) (cid:1) (cid:62) (cid:105) = Θ E (cid:2) xx (cid:62) e (cid:3) Θ = σ ΘΣΘ = σ Θ . Then, under Assumptions (A1) and (A2),min l E (cid:104)(cid:0) ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ) (cid:1) l (cid:105) = σ min l Θ l,l ≥ σ λ min (Θ) = σ λ max (Σ) , (22)is bounded away from zero. Under Assumption (A1), x is sub-Gaussian, that is, w (cid:62) x issub-Gaussian with uniformly bounded ψ -norm for all w ∈ S d − . To show w (cid:62) Θ x is alsosub-Gaussian with uniformly bounded ψ -norm, we write it as w (cid:62) Θ x = (Θ w ) (cid:62) x = (cid:107) Θ w (cid:107) (cid:18) Θ w (cid:107) Θ w (cid:107) (cid:19) (cid:62) x. w/ (cid:107) Θ w (cid:107) ∈ S d − , we have that (Θ w/ (cid:107) Θ w (cid:107) ) x is sub-Gaussian with O (1) ψ -norm,and hence, w (cid:62) Θ x is sub-Gaussian with O ( (cid:107) Θ w (cid:107) ) = O ( λ max (Θ)) = O ( λ min (Σ) − ) = O (1) ψ -norm, under Assumption (A1). Since e is also sub-Gaussian under Assumption (A2) andis independent of w (cid:62) Θ x , we have that w (cid:62) Θ xe is sub-exponential with uniformly bounded ψ -norm for all w ∈ S d − , and also, all ( ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z )) l are sub-exponential withuniformly bounded ψ -norm. Combining this with (22), we have veriﬁed Assumption (E.1)of Chernozhukov et al. (2013) for ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ).Deﬁne T : = max ≤ l ≤ d −√ N (cid:0) ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:1) l , (23)which is a Bahadur representation of T . Under the condition log ( dN ) /N (cid:46) N − c for someconstant c >

0, which holds if N (cid:38) log κ d for some κ >

0, applying Theorem 3.2 andCorollary 2.1 of Chernozhukov et al. (2013), we obtain that for some constant c > v, ζ > α ∈ (0 , | P ( T ≤ c W ∗ ( α )) − α | (cid:46) N − c + v / (cid:18) ∨ log dv (cid:19) / + P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > v (cid:17) + ζ (cid:115) ∨ log dζ + P ( | T − T | > ζ ) , (24)where (cid:98) Ω : = cov (cid:15) (cid:32) − √ N n (cid:88) i =1 k (cid:88) j =1 ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ij ) (cid:15) ∗ ij (cid:33) = ∇ L ∗ ( θ ∗ ) − (cid:32) N n (cid:88) i =1 k (cid:88) j =1 ∇L ( θ ∗ ; Z ij ) ∇L ( θ ∗ ; Z ij ) (cid:62) (cid:33) ∇ L ∗ ( θ ∗ ) − , and (25)Ω : = cov (cid:0) −∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ) (cid:1) = ∇ L ∗ ( θ ∗ ) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) ∇ L ∗ ( θ ∗ ) − . (26)48o show the quantiles of W and W ∗ are close, we ﬁrst have that for any ω such that α + ω, α − ω ∈ (0 , P ( { T ≤ c W ( α ) } (cid:9) { T ≤ c W ∗ ( α ) } ) ≤ P ( c W ∗ ( α − ω ) < T ≤ c W ∗ ( α + ω )) + P ( c W ∗ ( α − ω ) > c W ( α )) + P ( c W ( α ) > c W ∗ ( α + ω )) , where (cid:9) denotes symmetric diﬀerence. Following the arguments in the proof of Lemma 3.2of Chernozhukov et al. (2013), we have that P ( c W ( α ) > c W ∗ ( α + π ( u ))) ≤ P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > u (cid:17) , and P ( c W ∗ ( α − π ( u )) > c W ( α )) ≤ P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > u (cid:17) , where π ( u ) : = u / (1 ∨ log( d/u )) / andΩ : = cov (cid:15) (cid:32) − √ k k (cid:88) j =1 (cid:101) Θ √ n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:15) j (cid:33) = (cid:101) Θ (cid:32) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) (cid:101) Θ (cid:62) . (27)By letting ω = π ( u ), we have that P ( { T ≤ c W ( α ) } (cid:9) { T ≤ c W ∗ ( α ) } ) ≤ P ( c W ∗ ( α − π ( u )) < T ≤ c W ∗ ( α + π ( u ))) + P ( c W ∗ ( α − π ( u )) > c W ( α )) + P ( c W ( α ) > c W ∗ ( α + π ( u ))) ≤ P ( c W ∗ ( α − π ( u )) < T ≤ c W ∗ ( α + π ( u ))) + 2 P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > u (cid:17) , where by (24), P ( c W ∗ ( α − π ( u )) < T ≤ c W ∗ ( α + π ( u ))) = P ( T ≤ c W ∗ ( α + π ( u ))) − P ( T ≤ c W ∗ ( α − π ( u ))) (cid:46) π ( u ) + N − c + ζ (cid:115) ∨ log dζ + P ( | T − T | > ζ ) , α ∈ (0 , | P ( T ≤ c W ( α )) − α | (cid:46) N − c + v / (cid:18) ∨ log dv (cid:19) / + P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > v (cid:17) + ζ (cid:115) ∨ log dζ + P ( | T − T | > ζ ) + u / (cid:18) ∨ log du (cid:19) / + P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > u (cid:17) . (28)Applying Lemmas C.5, C.10, and C.9, we have that there exist some ζ, u, v > ζ (cid:115) ∨ log dζ + P ( | T − T | > ζ ) = o (1) , and (29) u / (cid:18) ∨ log du (cid:19) / + P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > u (cid:17) = o (1) , and (30) v / (cid:18) ∨ log dv (cid:19) / + P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > v (cid:17) = o (1) , (31)and hence, after simplifying the conditions, obtain the ﬁrst result in the lemma. To obtainthe second result, we use Lemma C.6, which yields ξ (cid:115) ∨ log dξ + P (cid:16) | (cid:98) T − T | > ξ (cid:17) = o (1) . (32) Lemma C.2 ( n+k-1-grad ) . In sparse linear model, under Assumptions (A1) and (A2), if n (cid:29) s ∗ log κ d + s ∗ log κ d , n + k (cid:29) s ∗ log κ d , nk (cid:38) log κ d , and (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:28) min (cid:40) √ ks ∗ log κ d , s ∗ (cid:112) log(( n + k ) d ) log κ d (cid:41) , or some κ > , then we have that sup α ∈ (0 , (cid:12)(cid:12) P ( T ≤ c (cid:102) W ( α )) − α (cid:12)(cid:12) = o (1) , and (33)sup α ∈ (0 , (cid:12)(cid:12)(cid:12) P ( (cid:98) T ≤ c (cid:102) W ( α )) − α (cid:12)(cid:12)(cid:12) = o (1) . (34) Proof of Lemma C.2.

By the argument in the proof of Lemma C.1, we have thatsup α ∈ (0 , (cid:12)(cid:12) P ( T ≤ c (cid:102) W ( α )) − α (cid:12)(cid:12) (cid:46) N − c + v / (cid:18) ∨ log dv (cid:19) / + P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > v (cid:17) + ζ (cid:115) ∨ log dζ + P ( | T − T | > ζ ) + u / (cid:18) ∨ log du (cid:19) / + P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > u (cid:17) , (35)where (cid:101) Ω : = cov (cid:15) (cid:32) − √ n + k − (cid:32) n (cid:88) i =1 (cid:101) Θ (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:15) i + k (cid:88) j =2 (cid:101) Θ √ n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:15) j (cid:33)(cid:33) = (cid:101) Θ 1 n + k − (cid:32) n (cid:88) i =1 ( ∇L ( θ ; Z i ) − ∇L N ( θ )) ( ∇L ( θ ; Z i ) − ∇L N ( θ )) (cid:62) + k (cid:88) j =2 n ( ∇L j ( θ ) − ∇L N ( θ )) ( ∇L j ( θ ) − ∇L N ( θ )) (cid:62) (cid:33) (cid:101) Θ (cid:62) , (36)if N (cid:38) log κ d for some κ >

0. Applying Lemmas C.5, C.10, and C.11, we have that thereexist some ζ, u, v > u / (cid:18) ∨ log du (cid:19) / + P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > u (cid:17) = o (1) , (37)and (31) hold, and hence, after simplifying the conditions, obtain the ﬁrst result in thelemma. To obtain the second result, we use Lemma C.6, which yields (32).51 emma C.3 ( k-grad ) . In sparse GLM, under Assumptions (B1)–(B4), if n (cid:29) ( s + s ∗ ) log κ d + ( s + s ∗ ) log κ d , k (cid:29) s ∗ log κ d , and (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:28) min (cid:26) √ ks ∗ s log κ d , √ ns ∗ log κ d (cid:27) , for some κ > , then we have that (17) and (18) hold. Proof of Lemma C.3.

We redeﬁne T and (cid:98) T as (19) and (20). We deﬁne an oraclemultiplier bootstrap statistic as in (21). Under Assumption (B3),min l E (cid:104)(cid:0) ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ) (cid:1) l (cid:105) = min l (cid:0) ∇ L ∗ ( θ ∗ ) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) ∇ L ∗ ( θ ∗ ) − (cid:1) l,l ≥ λ min (cid:0) ∇ L ∗ ( θ ∗ ) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) ∇ L ∗ ( θ ∗ ) − (cid:1) ≥ λ min (cid:0) ∇ L ∗ ( θ ∗ ) − (cid:1) λ min (cid:0) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:1) = λ min (cid:0) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:1) λ max ( ∇ L ∗ ( θ ∗ )) is bounded away from zero. Combining this with Assumption (B4), we have veriﬁed As-sumption (E.1) of Chernozhukov et al. (2013) for ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ). Then, we use thesame argument as in the proof of Lemma C.1, and obtain (28) withΩ : = (cid:101) Θ( (cid:101) θ (0) ) (cid:32) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) (cid:101) Θ( (cid:101) θ (0) ) (cid:62) , (38)under the condition log ( dN ) /N (cid:46) N − c for some constant c >

0, which holds if N (cid:38) log κ d for some κ >

0. Applying Lemmas C.7, C.13, and C.12, we have that thereexist some ζ, u, v > emma C.4 ( n+k-1-grad ) . In sparse GLM, under Assumptions (B1)–(B4), if n (cid:29) ( s + s ∗ ) log κ d + ( s + s ∗ ) log κ d , n + k (cid:29) s ∗ log κ d , nk (cid:38) log κ d , and (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:28) min (cid:40) n + ks ∗ (cid:16) n + k √ log d + k / log / d (cid:17) log κ d , √ ks ∗ s log κ d , (cid:0) nks ∗ log κ d (cid:1) / (cid:41) , for some κ > , then we have that (33) and (34) hold. Proof of Lemma C.4.

By the argument in the proof of Lemma C.3, we obtain (35)with (cid:101)

Ω : = (cid:101) Θ( (cid:101) θ (0) ) 1 n + k − (cid:32) n (cid:88) i =1 ( ∇L ( θ ; Z i ) − ∇L N ( θ )) ( ∇L ( θ ; Z i ) − ∇L N ( θ )) (cid:62) + k (cid:88) j =2 n ( ∇L j ( θ ) − ∇L N ( θ )) ( ∇L j ( θ ) − ∇L N ( θ )) (cid:62) (cid:33) (cid:101) Θ( (cid:101) θ (0) ) (cid:62) , (39)if N (cid:38) log κ d for some κ >

0. Applying Lemmas C.7, C.13, and C.14, we have thatthere exist some ζ, u, v >

Lemma C.5. T and T are deﬁned as in (7) and (23) respectively. In sparse linear model,under Assumptions (A1) and (A2), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ) and n (cid:29) s ∗ log d , wehave that | T − T | = O P (cid:18) r ¯ θ (cid:112) s ∗ k log d + s ∗ log d √ n (cid:19) . Moreover, if n (cid:29) s ∗ log κ d and (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:28) √ ks ∗ log κ d , for some κ > , then there exists some ζ > such that (29) holds. roof of Lemma C.5. First, we note that | T − T | ≤ max ≤ l ≤ d (cid:12)(cid:12)(cid:12) √ N ( (cid:101) θ − θ ∗ ) l + √ N (cid:0) ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:1) l (cid:12)(cid:12)(cid:12) = √ N (cid:13)(cid:13)(cid:13)(cid:101) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ , where we use the fact that | max l a l − max l b l | ≤ max l | a l − b l | for any two vectors a and b ofthe same dimension. Next, we bound (cid:13)(cid:13)(cid:13)(cid:101) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ . In linear model,we have that (cid:101) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) = ¯ θ + (cid:101) Θ X (cid:62) N ( y N − X N ¯ θ ) N − θ ∗ − Θ X (cid:62) N ( y N − X N θ ∗ ) N , and then, (cid:13)(cid:13)(cid:13)(cid:101) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13) ¯ θ + (cid:101) Θ X (cid:62) N ( y N − X N ¯ θ ) N − θ ∗ − Θ X (cid:62) N ( y N − X N θ ∗ ) N (cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13) ¯ θ + (cid:101) Θ X (cid:62) N ( y N − X N ¯ θ ) N − θ ∗ − (cid:101) Θ X (cid:62) N ( y N − X N θ ∗ ) N + (cid:101) Θ X (cid:62) N ( y N − X N θ ∗ ) N − Θ X (cid:62) N ( y N − X N θ ∗ ) N (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) (cid:101) Θ X (cid:62) N X N N − I d (cid:19) (¯ θ − θ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:101) Θ − Θ (cid:17) X (cid:62) N e N N (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ X (cid:62) N X N N − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N e N N (cid:13)(cid:13)(cid:13)(cid:13) ∞ , where we use the triangle inequality in the second to last inequality and the fact that for anymatrix A and vector a with compatible dimensions, (cid:107) Aa (cid:107) ∞ ≤ ||| A ||| max (cid:107) a (cid:107) and (cid:107) Aa (cid:107) ∞ ≤||| A ||| ∞ (cid:107) a (cid:107) ∞ , in the last inequality. Further applying the triangle inequality and the factthat for any two matrices A and B with compatible dimensions, ||| AB ||| max ≤ ||| A ||| ∞ ||| B ||| max ,54e have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ X (cid:62) N X N N − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ X (cid:62) N X N N − (cid:101) Θ X (cid:62) X n + (cid:101) Θ X (cid:62) X n − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:18) X (cid:62) N X N N − X (cid:62) X n (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ X (cid:62) X n − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) N X N N − X (cid:62) X n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ X (cid:62) X n − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . Under Assumption (A1), X N has sub-Gaussian rows. Then, by Lemma C.21, if n (cid:29) s ∗ log d , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = max l (cid:13)(cid:13)(cid:13) (cid:101) Θ l (cid:13)(cid:13)(cid:13) = O P (cid:16) √ s ∗ (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ X (cid:62) X n − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn (cid:33) , and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = max l (cid:13)(cid:13)(cid:13) (cid:101) Θ l − Θ l (cid:13)(cid:13)(cid:13) = O P (cid:32) s ∗ (cid:114) log dn (cid:33) . It remains to bound (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) N X N N − X (cid:62) X n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max and (cid:13)(cid:13)(cid:13) X (cid:62) N e N N (cid:13)(cid:13)(cid:13) ∞ .Under Assumptions (A1), each x ij,l is sub-Gaussian, and therefore, the product x ij,l x ij,l (cid:48) of any two is sub-exponential. By Bernstein’s inequality, we have that for any t > P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) ( X (cid:62) N X N ) l,l (cid:48) N − Σ l,l (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ (cid:32) − cN (cid:32) t Σ l,l (cid:48) ∧ t | Σ l,l (cid:48) | (cid:33)(cid:33) , or for any δ ∈ (0 , P (cid:12)(cid:12)(cid:12)(cid:12) ( X (cid:62) N X N ) l,l (cid:48) N − Σ l,l (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) > | Σ l,l (cid:48) |  log d δ cN ∨ (cid:115) log d δ cN  ≤ δd , for some constant c >

0. Then, by the union bound, we have that P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) N X N N − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > ||| Σ ||| max  log d δ cN ∨ (cid:115) log d δ cN  ≤ δ. (40)55imilarly, we have that P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) X n − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > ||| Σ ||| max  log d δ cn ∨ (cid:115) log d δ cn  ≤ δ. (41)Then, by the triangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) N X N N − X (cid:62) X n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) X n − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) N X N N − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:46) ||| Σ ||| max  log d δ n ∨ (cid:115) log d δ n  (cid:46)  log d δ n ∨ (cid:115) log d δ n  , with probability at least 1 − δ , where we use ||| Σ ||| max ≤ ||| Σ ||| = λ max (Σ) = O (1) underAssumption (A1). This implies that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) N X N N − X (cid:62) X n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn (cid:33) . Under Assumptions (A1) and (A2), each x ij,l and e ij are sub-Gaussian, and therefore, theirproduct x ij,l e ij is sub-exponential. Applying Bernstein’s inequality, we have that for any δ ∈ (0 , P (cid:12)(cid:12)(cid:12)(cid:12) ( X (cid:62) N e N ) l N (cid:12)(cid:12)(cid:12)(cid:12) > (cid:112) Σ l,l σ  log dδ cN ∨ (cid:115) log dδ cN  ≤ δd , for some constant c >

0. Then, by the union bound, we have that P (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N e N N (cid:13)(cid:13)(cid:13)(cid:13) ∞ > max l (cid:112) Σ l,l σ  log dδ cN ∨ (cid:115) log dδ cN  ≤ δ, (42)56nd then, (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N e N N (cid:13)(cid:13)(cid:13)(cid:13) ∞ = O P (cid:32)(cid:114) log dN (cid:33) . Putting all the preceding bounds together, we obtain that (cid:13)(cid:13)(cid:13)(cid:101) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) N X N N − X (cid:62) X n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ X (cid:62) X n − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:19) (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N e N N (cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:32) O P (cid:16) √ s ∗ (cid:17) O P (cid:32)(cid:114) log dn (cid:33) + O P (cid:32)(cid:114) log dn (cid:33)(cid:33) O P ( r ¯ θ ) + O P (cid:32) s ∗ (cid:114) log dn (cid:33) O P (cid:32)(cid:114) log dN (cid:33) = O P (cid:32)(cid:114) s ∗ log dn r ¯ θ + s ∗ log dn √ k (cid:33) , where we assume that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ), and hence, | T − T | = O P (cid:18) r ¯ θ (cid:112) s ∗ k log d + s ∗ log d √ n (cid:19) . Choosing ζ = (cid:18) r ¯ θ (cid:112) s ∗ k log d + s ∗ log d √ n (cid:19) − κ , with any κ >

0, we deduce that P ( | T − T | > ζ ) = o (1) . We also have that ζ (cid:115) ∨ log dζ = o (1) , provided that (cid:18) r ¯ θ (cid:112) s ∗ k log d + s ∗ log d √ n (cid:19) log / κ d = o (1) , n (cid:29) s ∗ log κ d, and r ¯ θ (cid:28) √ ks ∗ log κ d . Lemma C.6. (cid:98) T and T are deﬁned as in (20) and (23) respectively. In sparse linear model,under Assumptions (A1) and (A2), provided that n (cid:29) s ∗ log d , we have that | (cid:98) T − T | = O P (cid:32) (cid:0) s √ s ∗ + s ∗ (cid:1) log d √ n (cid:33) . Moreover, if n (cid:29) (cid:0) s s ∗ + s ∗ (cid:1) log κ d and for some κ > , then there exists some ξ > such that (32) holds. Proof of Lemma C.6.

By the proof of Lemma C.5, we obtain that | (cid:98) T − T | ≤ max ≤ l ≤ d (cid:12)(cid:12)(cid:12) √ N ( (cid:98) θ − θ ∗ ) l + √ N (cid:0) ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:1) l (cid:12)(cid:12)(cid:12) = √ N (cid:13)(cid:13)(cid:13)(cid:98) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ = √ N (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:98) θ L + (cid:101) Θ X (cid:62) N ( y N − X N (cid:98) θ L ) N − θ ∗ − Θ X (cid:62) N ( y N − X N θ ∗ ) N (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ X (cid:62) N X N N − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:13)(cid:13)(cid:13)(cid:98) θ L − θ ∗ (cid:13)(cid:13)(cid:13) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N e N N (cid:13)(cid:13)(cid:13)(cid:13) ∞ = O P (cid:16)(cid:112) s ∗ k log d (cid:17) (cid:13)(cid:13)(cid:13)(cid:98) θ L − θ ∗ (cid:13)(cid:13)(cid:13) + O P (cid:18) s ∗ log d √ n (cid:19) . Since (cid:13)(cid:13)(cid:13)(cid:98) θ L − θ ∗ (cid:13)(cid:13)(cid:13) = O P (cid:32) s (cid:114) log dN (cid:33) ,

58e have that | (cid:98) T − T | = O P (cid:32) (cid:0) s √ s ∗ + s ∗ (cid:1) log d √ n (cid:33) . Choosing ξ = (cid:32) (cid:0) s √ s ∗ + s ∗ (cid:1) log d √ n (cid:33) − κ , with any κ >

0, we deduce that P (cid:16) | (cid:98) T − T | > ξ (cid:17) = o (1) . We also have that ξ (cid:115) ∨ log dξ = o (1) , provided that (cid:32) (cid:0) s √ s ∗ + s ∗ (cid:1) log d √ n (cid:33) log / κ d = o (1) , which holds if n (cid:29) (cid:0) s s ∗ + s ∗ (cid:1) log κ d. Lemma C.7. T and T are deﬁned as in (7) and (23) respectively. In sparse GLM, underAssumptions (B1) and (B2), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ) and n (cid:29) s log d + s ∗ log d ,we have that | T − T | = O P (cid:18) r ¯ θ (cid:112) s ∗ k log d + s ∗ log d √ n (cid:19) . Moreover, if n (cid:29) ( s ∗ + s ) log κ d and (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:28) min (cid:40) √ ks ∗ s log κ d , (cid:0) nks ∗ log κ d (cid:1) / (cid:41) , for some κ > , then there exists some ζ > such that (29) holds. roof of Lemma C.7. Following the argument in the proof of Lemma C.5, we havethat | T − T | ≤ max ≤ l ≤ d (cid:12)(cid:12)(cid:12) √ N ( (cid:101) θ l − θ ∗ l ) + √ N (cid:0) ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:1) l (cid:12)(cid:12)(cid:12) = √ N (cid:13)(cid:13)(cid:13)(cid:101) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ , and (cid:13)(cid:13)(cid:13)(cid:101) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13) ¯ θ − (cid:101) Θ( (cid:101) θ (0) ) ∇L N (¯ θ ) − θ ∗ + Θ ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13) ¯ θ − (cid:101) Θ( (cid:101) θ (0) ) ∇L N (¯ θ ) − θ ∗ + (cid:101) Θ( (cid:101) θ (0) ) ∇L N ( θ ∗ ) − (cid:101) Θ( (cid:101) θ (0) ) ∇L N ( θ ∗ ) + Θ ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13) ¯ θ − θ ∗ − (cid:101) Θ( (cid:101) θ (0) ) (cid:0) ∇L N (¯ θ ) − ∇L N ( θ ∗ ) (cid:1)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:16) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:17) ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ . By Taylor’s theorem, we have that ∇L N (¯ θ ) − ∇L N ( θ ∗ ) = (cid:90) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) dt (¯ θ − θ ∗ ) , (43)and then, (cid:13)(cid:13)(cid:13)(cid:101) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13) ¯ θ − θ ∗ − (cid:101) Θ( (cid:101) θ (0) ) (cid:90) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) dt (¯ θ − θ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:16) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:17) ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) (cid:16) (cid:101) Θ( (cid:101) θ (0) ) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) − I d (cid:17) dt (¯ θ − θ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:16) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:17) ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max dt (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:107)∇L N ( θ ∗ ) (cid:107) ∞ .

60y the triangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) − (cid:101) Θ( (cid:101) θ (0) ) ∇ L N ( θ ∗ ) + (cid:101) Θ( (cid:101) θ (0) ) ∇ L N ( θ ∗ ) − (cid:101) Θ( (cid:101) θ (0) ) ∇ L ( θ ∗ )+ (cid:101) Θ( (cid:101) θ (0) ) ∇ L ( θ ∗ ) − (cid:101) Θ( (cid:101) θ (0) ) ∇ L ( (cid:101) θ (0) ) + (cid:101) Θ( (cid:101) θ (0) ) ∇ L ( (cid:101) θ (0) ) − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:0) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) − ∇ L N ( θ ∗ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:0) ∇ L N ( θ ∗ ) − ∇ L ( θ ∗ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:16) ∇ L ( θ ∗ ) − ∇ L ( (cid:101) θ (0) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) ∇ L ( (cid:101) θ (0) ) − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:18) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) − ∇ L N ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L N ( θ ∗ ) − ∇ L ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L ( θ ∗ ) − ∇ L ( (cid:101) θ (0) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:19) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) ∇ L ( (cid:101) θ (0) ) − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . Under Assumption (B1), we have by Taylor’s theorem that (cid:12)(cid:12) g (cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + t (¯ θ − θ ∗ ))) − g (cid:48)(cid:48) ( y ij , x (cid:62) ij θ ∗ ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) g (cid:48)(cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + st (¯ θ − θ ∗ ))) ds · tx (cid:62) ij (¯ θ − θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:46) (cid:12)(cid:12) x (cid:62) ij (¯ θ − θ ∗ ) (cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) − ∇ L N ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N n (cid:88) i =1 k (cid:88) j =1 x ij x (cid:62) ij (cid:0) g (cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + t (¯ θ − θ ∗ ))) − g (cid:48)(cid:48) ( y ij , x (cid:62) ij θ ∗ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ N n (cid:88) i =1 k (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x ij x (cid:62) ij (cid:0) g (cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + t (¯ θ − θ ∗ ))) − g (cid:48)(cid:48) ( y ij , x (cid:62) ij θ ∗ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = 1 N n (cid:88) i =1 k (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x ij x (cid:62) ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:12)(cid:12) g (cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + t (¯ θ − θ ∗ ))) − g (cid:48)(cid:48) ( y ij , x (cid:62) ij θ ∗ ) (cid:12)(cid:12) (cid:46) N n (cid:88) i =1 k (cid:88) j =1 (cid:107) x ij (cid:107) ∞ (cid:12)(cid:12) x (cid:62) ij (¯ θ − θ ∗ ) (cid:12)(cid:12) ≤ N n (cid:88) i =1 k (cid:88) j =1 (cid:107) x ij (cid:107) ∞ (cid:107) ¯ θ − θ ∗ (cid:107) (cid:46) (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) , (44)where we use that (cid:107) x ij (cid:107) ∞ = O (1) under Assumption (B2) in the last inequality. Similarly,we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L ( θ ∗ ) − ∇ L ( (cid:101) θ (0) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:46) (cid:107) (cid:101) θ (0) − θ ∗ (cid:107) = O P (cid:32) s (cid:114) log dn (cid:33) , by noticing that (cid:101) θ (0) is a local Lasso estimator computed using n observations. Note that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L N ( θ ∗ ) − ∇ L ∗ ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N n (cid:88) i =1 k (cid:88) j =1 g (cid:48)(cid:48) ( y ij , x (cid:62) ij θ ∗ ) x ij x (cid:62) ij − E [ g (cid:48)(cid:48) ( y, x (cid:62) θ ∗ ) xx (cid:62) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max , and g (cid:48)(cid:48) ( y ij , x (cid:62) ij θ ∗ ) = O (1) under Assumption (B1). Then, we have that by Hoeﬀding’sinequality, P  (cid:80) ni =1 (cid:80) kj =1 g (cid:48)(cid:48) ( y ij , x (cid:62) ij θ ∗ ) x ij,l x ij,l (cid:48) N − E [ g (cid:48)(cid:48) ( y, x (cid:62) θ ∗ ) x l x l (cid:48) ] > (cid:115) d δ ) N  ≤ δd , δ ∈ (0 , − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L N ( θ ∗ ) − ∇ L ∗ ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:115) d δ ) N , which implies that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L N ( θ ∗ ) − ∇ L ∗ ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dN (cid:33) . (45)Similarly, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L ( θ ∗ ) − ∇ L ∗ ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn (cid:33) , and then, by the triangle inequality, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L N ( θ ∗ ) − ∇ L ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L N ( θ ∗ ) − ∇ L ∗ ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L ( θ ∗ ) − ∇ L ∗ ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn (cid:33) . Note that ∇L N ( θ ∗ ) = (cid:80) ni =1 (cid:80) kj =1 g (cid:48) ( y ij , x (cid:62) ij θ ∗ ) x ij /N and g (cid:48) ( y ij , x (cid:62) ij θ ∗ ) x ij,l = O (1) for each l = 1 , . . . , d under Assumptions (B1) and (B2). Then, by Hoeﬀding’s inequality, we havethat P ( |∇L N ( θ ∗ ) l | > t ) ≤ (cid:18) − N t c (cid:19) , (46)for any t >

0, or P  |∇L N ( θ ∗ ) l | > (cid:115) c log dδ N  ≤ δd , for any δ ∈ (0 , − δ that (cid:107)∇L N ( θ ∗ ) (cid:107) ∞ ≤ (cid:115) c log dδ N , (cid:107)∇L N ( θ ∗ ) (cid:107) ∞ = O P (cid:32)(cid:114) log dN (cid:33) . (47)By Lemma C.22, provided that n (cid:29) s log d + s ∗ log d , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O P (cid:16) √ s ∗ (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) ∇ L ( (cid:101) θ (0) ) − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn (cid:33) , and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O P (cid:32) ( s + s ∗ ) (cid:114) log dn (cid:33) . Putting all the preceding bounds together, we obtain that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:16) √ s ∗ (cid:17) (cid:32) O P ( r ¯ θ ) + O P (cid:32)(cid:114) log dn (cid:33) + O P (cid:32) s (cid:114) log dn (cid:33)(cid:33) + O P (cid:32)(cid:114) log dn (cid:33) = O P (cid:32) √ s ∗ (cid:32) r ¯ θ + s (cid:114) log dn (cid:33)(cid:33) , and then, (cid:13)(cid:13)(cid:13)(cid:101) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ = O P (cid:32) √ s ∗ (cid:32) r ¯ θ + s (cid:114) log dn (cid:33)(cid:33) O P ( r ¯ θ ) + O P (cid:32) ( s + s ∗ ) (cid:114) log dn (cid:33) O P (cid:32)(cid:114) log dN (cid:33) = O P (cid:32) √ s ∗ (cid:32) r ¯ θ + s (cid:114) log dn (cid:33) r ¯ θ + ( s + s ∗ ) log dn √ k (cid:33) , (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ), and hence, | T − T | = O P (cid:18) √ s ∗ (cid:16) √ nr ¯ θ + s (cid:112) log d (cid:17) √ kr ¯ θ + ( s + s ∗ ) log d √ n (cid:19) . Choosing ζ = (cid:18) √ s ∗ (cid:16) √ nr ¯ θ + s (cid:112) log d (cid:17) √ kr ¯ θ + ( s + s ∗ ) log d √ n (cid:19) − κ , with any κ >

0, we deduce that P ( | T − T | > ζ ) = o (1) . We also have that ζ (cid:115) ∨ log dζ = o (1) , provided that (cid:18) √ s ∗ (cid:16) √ nr ¯ θ + s (cid:112) log d (cid:17) √ kr ¯ θ + ( s + s ∗ ) log d √ n (cid:19) log / κ d = o (1) , which holds if n (cid:29) (cid:0) s ∗ + s (cid:1) log κ d, and r ¯ θ (cid:28) min (cid:40) √ ks ∗ s log κ d , (cid:0) nks ∗ log κ d (cid:1) / (cid:41) . Lemma C.8. (cid:98) T and T are deﬁned as in (20) and (23) respectively. In sparse GLM, underAssumptions (B1) and (B2), provided that n (cid:29) s log d + s ∗ log d , we have that | (cid:98) T − T | = O P (cid:32) (cid:0) s √ s ∗ + s ∗ (cid:1) log d √ n (cid:33) . Moreover, if n (cid:29) (cid:0) s s ∗ + s ∗ (cid:1) log κ d for some κ > , then there exists some ξ > suchthat (32) holds. roof of Lemma C.8. By the proof of Lemma C.7, we obtain that | (cid:98) T − T | ≤ max ≤ l ≤ d (cid:12)(cid:12)(cid:12) √ N ( (cid:98) θ − θ ∗ ) l + √ N (cid:0) ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:1) l (cid:12)(cid:12)(cid:12) = √ N (cid:13)(cid:13)(cid:13)(cid:98) θ − θ ∗ + ∇ L ∗ ( θ ∗ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ = √ N (cid:13)(cid:13)(cid:13)(cid:98) θ L − (cid:101) Θ( (cid:101) θ (0) ) ∇L N ( (cid:98) θ L ) − θ ∗ + Θ ∇L N ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) ∇ L N ( θ ∗ + t ( (cid:98) θ L − θ ∗ )) − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max dt (cid:13)(cid:13)(cid:13)(cid:98) θ L − θ ∗ (cid:13)(cid:13)(cid:13) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:107)∇L N ( θ ∗ ) (cid:107) ∞ = O P (cid:16) √ nks ∗ (cid:17) (cid:13)(cid:13)(cid:13)(cid:98) θ L − θ ∗ (cid:13)(cid:13)(cid:13) + O P (cid:16) s (cid:112) ks ∗ log d (cid:17) (cid:13)(cid:13)(cid:13)(cid:98) θ L − θ ∗ (cid:13)(cid:13)(cid:13) + O P (cid:18) ( s + s ∗ ) log d √ n (cid:19) . Since (cid:13)(cid:13)(cid:13)(cid:98) θ L − θ ∗ (cid:13)(cid:13)(cid:13) = O P (cid:32) s (cid:114) log dN (cid:33) , we have that | (cid:98) T − T | = O P (cid:32) (cid:0) s √ s ∗ + s ∗ (cid:1) log d √ n (cid:33) . Choosing ξ = (cid:32) (cid:0) s √ s ∗ + s ∗ (cid:1) log d √ n (cid:33) − κ , with any κ >

Note by the triangle inequality that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max , where Ω is deﬁned as in (26). First, we bound (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . With Assumption (E.1) ofChernozhukov et al. (2013) veriﬁed for ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ) in the proof of Lemma C.1,by the proof of Corollary 3.1 of Chernozhukov et al. (2013), we have that E (cid:104)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:105) (cid:46) (cid:114) log dN + log ( dN ) log dN , and then, by Markov’s inequality, with probability at least 1 − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:46) δ (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) , δ ∈ (0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) . Next, we bound (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . By the triangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:32) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) (cid:101) Θ (cid:62) − Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:32) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:33) (cid:101) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:101) Θ (cid:62) − Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max : = I (¯ θ ) + I . To bound I (¯ θ ), we use the fact that for any two matrices A and B with compatibledimensions, ||| AB ||| max ≤ ||| A ||| ∞ ||| B ||| max and ||| AB ||| max ≤ ||| A ||| max ||| B ||| , and obtain that I (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . Under Assumption (A1), by Lemma C.21, if n (cid:29) s ∗ log d , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = max l (cid:13)(cid:13)(cid:13) (cid:101) Θ l (cid:13)(cid:13)(cid:13) = O P (cid:16) √ s ∗ (cid:17) . I (¯ θ ) = O P ( s ∗ ) O P (cid:32)(cid:114) log dk + log ( dk ) log dk + (cid:112) log( kd ) r ¯ θ + nr θ (cid:33) = O P (cid:32) s ∗ (cid:32)(cid:114) log dk + log ( dk ) log dk + (cid:112) log( kd ) r ¯ θ + nr θ (cid:33)(cid:33) , under Assumptions (A1) and (A2), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ), r ¯ θ (cid:112) log( kd ) (cid:46) k (cid:38) log ( dk ) log d .It remains to bound I . In linear model, we have that I = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:0) σ Σ (cid:1) (cid:101) Θ (cid:62) − Θ (cid:0) σ Σ (cid:1) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) ΘΣ (cid:101) Θ (cid:62) − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max , and by the triangle inequality, I = σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( (cid:101) Θ − Θ + Θ)Σ( (cid:101) Θ − Θ + Θ) (cid:62) − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( (cid:101) Θ − Θ)Σ( (cid:101) Θ − Θ) (cid:62) + ΘΣ( (cid:101) Θ − Θ) (cid:62) + ( (cid:101) Θ − Θ)ΣΘ + ΘΣΘ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( (cid:101) Θ − Θ)Σ( (cid:101) Θ − Θ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + 2 σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . By Lemma C.21, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ max l (cid:13)(cid:13)(cid:13) (cid:101) Θ l − Θ l (cid:13)(cid:13)(cid:13) = O P (cid:32)(cid:114) s ∗ log dn (cid:33) , and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( (cid:101) Θ − Θ)Σ( (cid:101) Θ − Θ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ ||| Σ ||| max l (cid:13)(cid:13)(cid:13) (cid:101) Θ l − Θ l (cid:13)(cid:13)(cid:13) = O P (cid:18) s ∗ log dn (cid:19) , where we use that ||| Σ ||| max ≤ ||| Σ ||| = O (1) under Assumption (A1). Then, we obtain that I = O P (cid:18) s ∗ log dn (cid:19) + O P (cid:32)(cid:114) s ∗ log dn (cid:33) = O P (cid:32)(cid:114) s ∗ log dn (cid:33) . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dk + log ( dk ) log dk + (cid:112) log( kd ) r ¯ θ + nr θ (cid:33) + (cid:114) s ∗ log dn (cid:33) , and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dk + log ( dk ) log dk + (cid:112) log( kd ) r ¯ θ + nr θ (cid:33) + (cid:114) s ∗ log dn (cid:33) . Choosing u = (cid:32) s ∗ (cid:114) log dk + s ∗ log ( dk ) log dk + s ∗ (cid:112) log( kd ) r ¯ θ + ns ∗ r θ + (cid:114) s ∗ log dn (cid:33) − κ , with any κ >

0, we deduce that P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > u (cid:17) = o (1) . We also have that u / (cid:18) ∨ log du (cid:19) / = o (1) , provided that (cid:32) s ∗ (cid:114) log dk + s ∗ log ( dk ) log dk + s ∗ (cid:112) log( kd ) r ¯ θ + ns ∗ r θ + (cid:114) s ∗ log dn (cid:33) log κ d = o (1) , which holds if n (cid:29) s ∗ log κ d,k (cid:29) s ∗ log κ d, and r ¯ θ (cid:28) min (cid:40) s ∗ (cid:112) log( kd ) log κ d , √ ns ∗ log κ d (cid:41) . emma C.10. (cid:98) Ω and Ω is deﬁned as in (25) and (26) respectively. In sparse linear model,under Assumptions (A1) and (A2), we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) . Moreover, if N (cid:29) log κ d for some κ > , then there exists some v > such that (31) holds. Proof of Lemma C.10.

In the proof of Lemma C.9, we have shown that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) . Choosing v = (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) − κ , with any κ >

0, we deduce that P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > v (cid:17) = o (1) . We also have that v / (cid:18) ∨ log dv (cid:19) / = o (1) , provided that (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) log κ d = o (1) , which holds if N (cid:29) log κ d. The same result applies to the low-dimensional case as well.71 emma C.11. (cid:101) Ω and (cid:98) Ω are deﬁned as in (36) and (25) respectively. In sparse linear model,under Assumptions (A1) and (A2), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ) , r ¯ θ (cid:112) log(( n + k ) d ) (cid:46) , n (cid:29) s ∗ log d , and n + k (cid:38) log d , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dn + k + log ( d ( n + k )) log dn + k + (cid:112) log(( n + k ) d ) r ¯ θ + nkn + k r θ (cid:33) + (cid:114) s ∗ log dn (cid:33) . Moreover, if n (cid:29) s ∗ log κ d , n + k (cid:29) s ∗ log κ d , and (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:28) min (cid:40) s ∗ (cid:112) log(( n + k ) d ) log κ d , √ s ∗ log κ d (cid:114) n + 1 k (cid:41) , for some κ > , then there exists some u > such that (37) holds. Proof of Lemma C.11.

Note by the triangle inequality that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max , where Ω is deﬁned as in (26). By the proof of Lemma C.9, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max using the same argument as in the proof of Lemma C.9. Bythe triangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ 1 n + k − (cid:32) n (cid:88) i =1 (cid:0) ∇L ( θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L ( θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:62) + k (cid:88) j =2 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) (cid:101) Θ (cid:62) − Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:32) n + k − (cid:32) n (cid:88) i =1 (cid:0) ∇L ( θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L ( θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:62) + k (cid:88) j =2 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:33) (cid:101) Θ (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:101) Θ (cid:62) − Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max : = I (cid:48) (¯ θ ) + I . We have shown in the proof of Lemma C.9 that I = O P (cid:32)(cid:114) s ∗ log dn (cid:33) . To bound I (cid:48) (¯ θ ), we note that I (cid:48) (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n + k − (cid:32) n (cid:88) i =1 (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:62) + k (cid:88) j =2 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . Under Assumption (A1), by Lemma C.21, if n (cid:29) s ∗ log d , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = max l (cid:13)(cid:13)(cid:13) (cid:101) Θ l (cid:13)(cid:13)(cid:13) = O P (cid:16) √ s ∗ (cid:17) . I (cid:48) (¯ θ ) = O P (cid:32) s ∗ (cid:32)(cid:114) log dn + k + log ( d ( n + k )) log dn + k + (cid:112) log(( n + k ) d ) r ¯ θ + nkn + k r θ (cid:33)(cid:33) , under Assumptions (A1) and (A2), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ), r ¯ θ (cid:112) log(( n + k ) d ) (cid:46)

1, and n + k (cid:38) log ( d ( n + k )) log d . Putting all the preceding bounds together, we obtainthat (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dn + k + log ( d ( n + k )) log dn + k + (cid:112) log(( n + k ) d ) r ¯ θ + nkn + k r θ (cid:33) + (cid:114) s ∗ log dn (cid:33) , and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dn + k + log ( d ( n + k )) log dn + k + (cid:112) log(( n + k ) d ) r ¯ θ + nkn + k r θ (cid:33) + (cid:114) s ∗ log dn (cid:33) . Choosing u = (cid:32) s ∗ (cid:114) log dn + k + s ∗ log ( d ( n + k )) log dn + k + s ∗ (cid:112) log(( n + k ) d ) r ¯ θ + nks ∗ n + k r θ + (cid:114) s ∗ log dn (cid:33) − κ , with any κ >

0, we deduce that P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > u (cid:17) = o (1) . We also have that u / (cid:18) ∨ log du (cid:19) / = o (1) , provided that (cid:32) s ∗ (cid:114) log dn + k + s ∗ log ( d ( n + k )) log dn + k + s ∗ (cid:112) log(( n + k ) d ) r ¯ θ + nks ∗ n + k r θ + (cid:114) s ∗ log dn (cid:33) log κ d = o (1) , which holds if n (cid:29) s ∗ log κ d, + k (cid:29) s ∗ log κ d, and r ¯ θ (cid:28) min (cid:40) s ∗ (cid:112) log(( n + k ) d ) log κ d , √ s ∗ log κ d (cid:114) n + 1 k (cid:41) . Lemma C.12. Ω and (cid:98) Ω are deﬁned as in (38) and (25) respectively. In sparse GLM, underAssumptions (B1)–(B4), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ) , r ¯ θ (cid:46) , n (cid:29) s log d + s ∗ log d ,and k (cid:38) log d , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dk + (cid:112) log dr ¯ θ + nr θ (cid:33) + (cid:114) ( s + s ∗ ) log dn + log ( dN ) log dN (cid:33) . Moreover, if n (cid:29) ( s + s ∗ ) log κ d , k (cid:29) s ∗ log κ d , and (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:28) min (cid:26) s ∗ log / κ d , √ ns ∗ log κ d (cid:27) , for some κ > , then there exists some u > such that (30) holds. Proof of Lemma C.12.

We use the same argument as in the proof of Lemma C.9. Noteby the triangle inequality that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max , where Ω is deﬁned as in (26). First, we bound (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . With Assumption (E.1) ofChernozhukov et al. (2013) veriﬁed for ∇ L ∗ ( θ ∗ ) − ∇L ( θ ∗ ; Z ) in the proof of Lemma C.3,by the proof of Corollary 3.1 of Chernozhukov et al. (2013), we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . By the triangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:32) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) (cid:101) Θ( (cid:101) θ (0) ) (cid:62) − Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:32) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:33) (cid:101) Θ( (cid:101) θ (0) ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:101) Θ( (cid:101) θ (0) ) (cid:62) − Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max : = I (¯ θ ) + I . Note that (cid:101) Θ( (cid:101) θ (0) ) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:101) Θ( (cid:101) θ (0) ) (cid:62) = (cid:16) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:17) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:16) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:17) (cid:62) + Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:16) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:17) (cid:62) + (cid:16) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:17) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) Θ + Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) Θ . By the triangle inequality, we have that I ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:16) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:17) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:16) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:17) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:16) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:17) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max l (cid:13)(cid:13)(cid:13) (cid:101) Θ( (cid:101) θ (0) ) l − Θ l (cid:13)(cid:13)(cid:13) + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max l (cid:107) Θ l (cid:107) max l (cid:13)(cid:13)(cid:13) (cid:101) Θ( (cid:101) θ (0) ) l − Θ l (cid:13)(cid:13)(cid:13) . Note that max l (cid:107) Θ l (cid:107) ≤ ||| Θ ||| = O (1) under Assumption (B3). By Lemma C.22, providedthat n (cid:29) s log d + s ∗ log d , we have that I = O P (cid:32) ( s + s ∗ ) log dn + √ s + s ∗ (cid:114) log dn (cid:33) = O P (cid:32)(cid:114) ( s + s ∗ ) log dn (cid:33) .

76o bound I (¯ θ ), we note that I (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . By Lemma C.22, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O P (cid:16) √ s ∗ (cid:17) . Then, applying Lemma C.19, we obtain that I (¯ θ ) = O P (cid:32) s ∗ (cid:32)(cid:114) log dk + (cid:112) log dr ¯ θ + nr θ (cid:33)(cid:33) , provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ), r ¯ θ (cid:46) n (cid:38) log d , and k (cid:38) log d . Putting all thepreceding bounds together, we obtain that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dk + (cid:112) log dr ¯ θ + nr θ (cid:33) + (cid:114) ( s + s ∗ ) log dn (cid:33) , and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dk + (cid:112) log dr ¯ θ + nr θ (cid:33) + (cid:114) ( s + s ∗ ) log dn + log ( dN ) log dN (cid:33) . Choosing u = (cid:32) s ∗ (cid:114) log dk + s ∗ (cid:112) log dr ¯ θ + ns ∗ r θ + (cid:114) ( s + s ∗ ) log dn + log ( dN ) log dN (cid:33) − κ , with any κ >

0, we deduce that P (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > u (cid:17) = o (1) . We also have that u / (cid:18) ∨ log du (cid:19) / = o (1) , (cid:32) s ∗ (cid:114) log dk + s ∗ (cid:112) log dr ¯ θ + ns ∗ r θ + (cid:114) ( s + s ∗ ) log dn + log ( dN ) log dN (cid:33) log κ d = o (1) , which holds if n (cid:29) ( s + s ∗ ) log κ d,k (cid:29) s ∗ log κ d, and r ¯ θ (cid:28) min (cid:26) s ∗ log / κ d , √ ns ∗ log κ d (cid:27) . Lemma C.13. (cid:98) Ω and Ω is deﬁned as in (25) and (26) respectively. In sparse GLM, underAssumptions (B3)–(B4), we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) . Moreover, if N (cid:29) log κ d for some κ > , then there exists some v > such that (31) holds. Proof of Lemma C.13.

In the proof of Lemma C.12, we have shown that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) . Choosing v = (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) − κ , with any κ >

78e also have that v / (cid:18) ∨ log dv (cid:19) / = o (1) , provided that (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) log κ d = o (1) , which holds if N (cid:29) log κ d. The same result applies to the low-dimensional case as well.

Lemma C.14. (cid:101) Ω and (cid:98) Ω are deﬁned as in (39) and (25) respectively. In sparse GLM, underAssumptions (B1)–(B4), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ) , r ¯ θ (cid:46) , and n (cid:29) s log d + s ∗ log d , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dn + k + n + k √ log d + k / log / dn + k r ¯ θ + nkn + k r θ (cid:33) + (cid:114) ( s + s ∗ ) log dn + log ( dN ) log dN (cid:33) . Moreover, if n (cid:29) ( s + s ∗ ) log κ d + s log d + s ∗ log d , n + k (cid:29) s ∗ log κ d , and (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:28) min  n + ks ∗ (cid:16) n + k √ log d + k / log / d (cid:17) log κ d , √ s ∗ log κ d (cid:114) n + 1 k  , for some κ > , then there exists some u > such that (37) holds. Proof of Lemma C.14.

Note by the triangle inequality that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max , is deﬁned as in (26). By the proof of Lemma C.12, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dN + log ( dN ) log dN (cid:33) . Next, we bound (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max using the same argument as in the proof of Lemma C.12.By the triangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) 1 n + k − (cid:32) n (cid:88) i =1 (cid:0) ∇L ( θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L ( θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:62) + k (cid:88) j =2 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) (cid:101) Θ( (cid:101) θ (0) ) (cid:62) − Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:32) n + k − (cid:32) n (cid:88) i =1 (cid:0) ∇L ( θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L ( θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:62) + k (cid:88) j =2 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:33) (cid:101) Θ( (cid:101) θ (0) ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:101) Θ( (cid:101) θ (0) ) (cid:62) − Θ E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max : = I (cid:48) (¯ θ ) + I . We have shown in the proof of Lemma C.12 that I = O P (cid:32)(cid:114) ( s + s ∗ ) log dn (cid:33) . To bound I (cid:48) (¯ θ ), we note that I (cid:48) (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n + k − (cid:32) n (cid:88) i =1 (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:62) + k (cid:88) j =2 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max .

80y Lemma C.22, provided that n (cid:29) s log d + s ∗ log d , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O P (cid:16) √ s ∗ (cid:17) . Then, applying Lemma C.20, we have that I (cid:48) (¯ θ ) = O P (cid:32) s ∗ (cid:32)(cid:114) log dn + k + n + k √ log d + k / log / dn + k r ¯ θ + nkn + k r θ (cid:33)(cid:33) , under Assumptions (B1)–(B3), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ), r ¯ θ (cid:46)

1, and n + k (cid:38) log d .Putting all the preceding bounds together, we obtain that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dn + k + n + k √ log d + k / log / dn + k r ¯ θ + nkn + k r θ (cid:33) + (cid:114) ( s + s ∗ ) log dn (cid:33) , and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:101) Ω − (cid:98) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) s ∗ (cid:32)(cid:114) log dn + k + n + k √ log d + k / log / dn + k r ¯ θ + nkn + k r θ (cid:33) + (cid:114) ( s + s ∗ ) log dn + log ( dN ) log dN (cid:33) . Choosing u = (cid:32) s ∗ (cid:114) log dn + k + n + k √ log d + k / log / dn + k s ∗ r ¯ θ + nks ∗ n + k r θ + (cid:114) ( s + s ∗ ) log dn + log ( dN ) log dN (cid:33) − κ , with any κ >

For any θ , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ( ∇L j ( θ ) − ∇L N ( θ )) ( ∇L j ( θ ) − ∇L N ( θ )) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ U ( θ ) + U + U ( θ ) , where U ( θ ) : = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ( ∇L j ( θ ) − ∇L ∗ ( θ )) ( ∇L j ( θ ) − ∇L ∗ ( θ )) (cid:62) − n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ,U : = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max , and U ( θ ) : = n (cid:107)∇L N ( θ ) − ∇L ∗ ( θ ) (cid:107) ∞ . Lemma C.16.

In sparse linear model, under Assumptions (A1) and (A2), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ) , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dk + log ( dk ) log dk + (cid:32) (cid:18) log dk (cid:19) / + (cid:115) log ( dk ) log dk (cid:33)(cid:112) log( kd ) r ¯ θ + (cid:32) n + (cid:114) n log dk + log( kd ) (cid:33) r θ (cid:33) . Proof of Lemma C.16.

By Lemma C.15, it suﬃces to bound U (¯ θ ), U , and U (¯ θ ). Webegin by bounding U . In linear model, we have that U = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:32) X (cid:62) j e j n (cid:33) (cid:32) X (cid:62) j e j n (cid:33) (cid:62) − σ Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . Note that E (cid:32) ( X (cid:62) j e j ) l √ n (cid:33)  = E (cid:20) (cid:80) ni =1 X ij,l e ij n (cid:21) = σ Σ l,l is bounded away from zero, under Assumptions (A1) and (A2). Also, using same argumentfor obtaining (42), we have that for any t > P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X (cid:62) j e j ) l n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:33) ≤ (cid:32) − cn (cid:32) t Σ l,l σ ∧ t (cid:112) Σ l,l σ (cid:33)(cid:33) , and then, P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( X (cid:62) j e j ) l √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:33) ≤ (cid:32) − c (cid:32) t Σ l,l σ ∧ t √ n (cid:112) Σ l,l σ (cid:33)(cid:33) ≤ C exp ( − c (cid:48) t ) , c , c (cid:48) , and C , that is, ( X (cid:62) j e j ) l / √ n is sub-exponential with O(1) ψ -norm for each ( j, l ). Then, by the proof of Corollary 3.1 of Chernozhukov et al. (2013),we have that E [ U ] = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 (cid:32) X (cid:62) j e j √ n (cid:33) (cid:32) X (cid:62) j e j √ n (cid:33) (cid:62) − σ Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max  (cid:46) (cid:114) log dk + log ( dk ) log dk , and then, for any δ ∈ (0 , − δ , U (cid:46) δ (cid:32)(cid:114) log dk + log ( dk ) log dk (cid:33) , by Markov’s inequality, which implies that U = O P (cid:32)(cid:114) log dk + log ( dk ) log dk (cid:33) . Next, we bound U (¯ θ ). By the triangle inequality and the fact that for any matrix A and vector a with compatible dimensions, (cid:107) Aa (cid:107) ∞ ≤ ||| A ||| max (cid:107) a (cid:107) , we have that (cid:13)(cid:13) ∇L N (¯ θ ) − ∇L ∗ (¯ θ ) (cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13) ∇L N (¯ θ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13) ∞ + (cid:107)∇L N ( θ ∗ ) (cid:107) ∞ + (cid:13)(cid:13) ∇L ∗ (¯ θ ) (cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N ( X N ¯ θ − y N ) N − X (cid:62) N ( X N θ ∗ − y N ) N (cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N ( X N θ ∗ − y N ) N (cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13) Σ(¯ θ − θ ∗ ) (cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N X N N (¯ θ − θ ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N e N N (cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13) Σ(¯ θ − θ ∗ ) (cid:13)(cid:13) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) N X N N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N e N N (cid:13)(cid:13)(cid:13)(cid:13) ∞ + ||| Σ ||| max (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:46) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) N X N N − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N e N N (cid:13)(cid:13)(cid:13)(cid:13) ∞ + ||| Σ ||| max (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) . By (40) and (42), we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) N X N N − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ ||| Σ ||| max  log d δ cN ∨ (cid:115) log d δ cN  = O P (cid:32)(cid:114) log dN (cid:33) , (cid:13)(cid:13)(cid:13)(cid:13) X (cid:62) N e N N (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ max l (cid:112) Σ l,l σ  log dδ cN ∨ (cid:115) log dδ cN  = O P (cid:32)(cid:114) log dN (cid:33) , where max l (cid:112) Σ l,l ≤ ||| Σ ||| max = O (1) under Assumption (A1). Then, assuming that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ), we have that (cid:13)(cid:13) ∇L N (¯ θ ) − ∇L ∗ (¯ θ ) (cid:13)(cid:13) ∞ = (cid:32) O (1) + O P (cid:32)(cid:114) log dN (cid:33)(cid:33) O P ( r ¯ θ ) + O P (cid:32)(cid:114) log dN (cid:33) = O P (cid:32)(cid:32) (cid:114) log dN (cid:33) r ¯ θ + (cid:114) log dN (cid:33) , and then, U (¯ θ ) = O P (cid:32)(cid:32) (cid:114) log dN (cid:33) nr θ + log dk (cid:33) . Lastly, we bound U (¯ θ ). We write ∇L j (¯ θ ) −∇L ∗ (¯ θ ) as (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) + ∇L j ( θ ∗ ), and obtain by the triangle inequality that U (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) ∇L j ( θ ∗ ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max : = U (¯ θ ) + 2 U (¯ θ ) .

85o bound U (¯ θ ), we ﬁrst deﬁne an inner product (cid:104) A, B (cid:105) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) AB (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max for any A, B ∈ R d × k ,the validity of which is easy to check. We then apply Cauchy-Schwarz inequality on (cid:104) A, B (cid:105) with A = (cid:114) nk (cid:104) ∇L ( θ ∗ ) . . . ∇L k ( θ ∗ ) (cid:105) and B = (cid:114) nk (cid:104) ∇L (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L ( θ ∗ ) . . . ∇L k (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L k ( θ ∗ )) (cid:105) and obtain that U (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / U (¯ θ ) / . By the triangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = U + σ ||| Σ ||| max = O P (cid:32) (cid:114) log dk + log ( dk ) log dk (cid:33) .

86t remains to bound U (¯ θ ). Note that ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) = X (cid:62) j ( X j ¯ θ − y j ) n − Σ(¯ θ − θ ∗ ) + X (cid:62) j ( X j θ ∗ − y j ) n = (cid:32) X (cid:62) j X j n − Σ (cid:33) (¯ θ − θ ∗ ) . Then, we have that U (¯ θ ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:32) X (cid:62) j X j n − Σ (cid:33) (¯ θ − θ ∗ )(¯ θ − θ ∗ ) (cid:62) (cid:32) X (cid:62) j X j n − Σ (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ k k (cid:88) j =1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:32) X (cid:62) j X j n − Σ (cid:33) (¯ θ − θ ∗ )(¯ θ − θ ∗ ) (cid:62) (cid:32) X (cid:62) j X j n − Σ (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = 1 k k (cid:88) j =1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:32) X (cid:62) j X j n − Σ (cid:33) (¯ θ − θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ k k (cid:88) j =1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) j X j n − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) , where we use the triangle inequality and the fact that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) aa (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:107) a (cid:107) ∞ for any vector a ,and (cid:107) Aa (cid:107) ∞ ≤ ||| A ||| max (cid:107) a (cid:107) for any matrix A and vector a with compatible dimensions. By(41), we have that P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) j X j n − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > ||| Σ ||| max  log kd δ cn ∨ (cid:115) log kd δ cn  ≤ δk , and then, by the union bound, P  max j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) j X j n − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > ||| Σ ||| max  log kd δ cn ∨ (cid:115) log kd δ cn  ≤ δ, which implies that max j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:62) j X j n − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log( kd ) n (cid:33) . U (¯ θ ) = O P (cid:0) log( kd ) r θ (cid:1) ,U (¯ θ ) = O P  (cid:18) log dk (cid:19) / + (cid:115) log ( dk ) log dk  (cid:112) log( kd ) r ¯ θ  ,U (¯ θ ) = O P  (cid:18) log dk (cid:19) / + (cid:115) log ( dk ) log dk  (cid:112) log( kd ) r ¯ θ + log( kd ) r θ  , and ﬁnally, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P (cid:32)(cid:114) log dk + log ( dk ) log dk + (cid:32) (cid:18) log dk (cid:19) / + (cid:115) log ( dk ) log dk (cid:33)(cid:112) log( kd ) r ¯ θ + (cid:32) n + (cid:114) n log dk + log( kd ) (cid:33) r θ (cid:33) . Lemma C.17.

For any θ , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n + k − (cid:32) n (cid:88) i =1 ( ∇L ( θ ; Z i ) − ∇L N ( θ )) ( ∇L ( θ ; Z i ) − ∇L N ( θ )) (cid:62) + k (cid:88) j =2 n ( ∇L j ( θ ) − ∇L N ( θ )) ( ∇L j ( θ ) − ∇L N ( θ )) (cid:62) (cid:33) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ V ( θ ) + V (cid:48) ( θ ) + V + V (cid:48) + V ( θ ) , where V ( θ ) : = k − n + k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k − k (cid:88) j =2 n ( ∇L j ( θ ) − ∇L ∗ ( θ )) ( ∇L j ( θ ) − ∇L ∗ ( θ )) (cid:62) − n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max , (cid:48) ( θ ) : = nn + k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ )) ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ )) (cid:62) − ∇L ( θ ∗ ; Z i ) ∇L ( θ ∗ ; Z i ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ,V : = k − n + k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k − k (cid:88) j =2 n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ,V (cid:48) : = nn + k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ∇L ( θ ∗ ; Z i ) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max , and V ( θ ) : = nkn + k − (cid:107)∇L N ( θ ) − ∇L ∗ ( θ ) (cid:107) ∞ . Lemma C.17 is the same as Lemma F.3 of Yu, Chao & Cheng (2020). We omit theproof.

Lemma C.18.

In sparse linear model, under Assumptions (A1) and (A2), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ) , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n + k − (cid:32) n (cid:88) i =1 (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:62) + k (cid:88) j =2 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn + k + log ( d ( n + k )) log dn + k + (cid:32)(cid:32) (cid:114) log dN (cid:33) nkn + k + log(( n + k ) d ) (cid:33) r θ + (cid:32)(cid:112) log(( n + k ) d ) + log / d (cid:112) log(( n + k ) d )( n + k ) / + (cid:115) log ( d ( n + k )) log dn + k (cid:33) r ¯ θ (cid:33) . Proof of Lemma C.18.

By Lemma C.17, it suﬃces to bound V (¯ θ ), V (cid:48) (¯ θ ), V , V (cid:48) ,and V (¯ θ ). By the proof of Lemma C.16, we have that under Assumptions (A1) and (A2),89ssuming that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ), V (¯ θ ) = k − n + k − O P  (cid:18) log dk (cid:19) / + (cid:115) log ( dk ) log dk  (cid:112) log( kd ) r ¯ θ + log( kd ) r θ  = O P  (cid:18) log dk (cid:19) / + (cid:115) log ( dk ) log dk  k (cid:112) log( kd ) n + k r ¯ θ + k log( kd ) n + k r θ  ,V = k − n + k − O P (cid:32)(cid:114) log dk + log ( dk ) log dk (cid:33) = O P (cid:18) √ k log dn + k + log ( dk ) log dn + k (cid:19) , and V (¯ θ ) = nkn + k − O P (cid:32)(cid:32) (cid:114) log dN (cid:33) r θ + log dN (cid:33) = O P (cid:32)(cid:32) (cid:114) log dN (cid:33) nkn + k r θ + log dn + k (cid:33) . It remains to bound V (cid:48) (¯ θ ) and V (cid:48) .To bound V (cid:48) , we have that in linear model, under Assumptions (A1) and (A2), V (cid:48) = nn + k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( x i e i ) ( x i e i ) (cid:62) − σ Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max . Note that E (cid:2) ( x i e i ) l (cid:3) = σ Σ l,l is bounded away from zero, and also, ( x i e i ) l is sub-exponential with O(1) ψ -norm foreach ( i, l ). Then, by the proof of Corollary 3.1 of Chernozhukov et al. (2013), we have that E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( x i e i ) ( x i e i ) (cid:62) − σ Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:35) (cid:46) (cid:114) log dn + log ( dn ) log dn , and then, for any δ ∈ (0 , − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( x i e i ) ( x i e i ) (cid:62) − σ Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:46) δ (cid:32)(cid:114) log dn + log ( dn ) log dn (cid:33) ,

90y Markov’s inequality, which implies that V (cid:48) = nn + k − O P (cid:32)(cid:114) log dn + log ( dn ) log dn (cid:33) = O P (cid:18) √ n log dn + k + log ( dn ) log dn + k (cid:19) . Lastly, we bound V (cid:48) (¯ θ ) using the same argument as in bounding U (¯ θ ) in the proof ofLemma C.16. We write ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) as ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) + ∇L ( θ ∗ ; Z i ), and obtain by the triangle inequality that n + k − n V (cid:48) (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) ∇L ( θ ∗ ; Z i ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max : = V (cid:48) (¯ θ ) + 2 V (cid:48) (¯ θ ) . Applying Cauchy-Schwarz inequality, we obtain that V (cid:48) (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ∇L ( θ ∗ ; Z i ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ∇L ( θ ∗ ; Z i ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / V (cid:48) (¯ θ ) / .

91y the triangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ∇L ( θ ∗ ; Z i ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ∇L ( θ ∗ ; Z i ) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = n + k − n V (cid:48) + σ ||| Σ ||| max = O P (cid:32) (cid:114) log dn + log ( dn ) log dn (cid:33) . It remains to bound V (cid:48) (¯ θ ). Note that ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i ) = x ij ( x (cid:62) ij ¯ θ − y ij ) − Σ(¯ θ − θ ∗ ) + x ij ( x (cid:62) ij θ ∗ − y ij )= (cid:0) x ij x (cid:62) ij − Σ (cid:1) (¯ θ − θ ∗ ) . Then, we have by the triangle inequality that V (cid:48) (¯ θ ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:0) x i x (cid:62) i − Σ (cid:1) (¯ θ − θ ∗ )(¯ θ − θ ∗ ) (cid:62) (cid:0) x i x (cid:62) i − Σ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ n n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:0) x i x (cid:62) i − Σ (cid:1) (¯ θ − θ ∗ )(¯ θ − θ ∗ ) (cid:62) (cid:0) x i x (cid:62) i − Σ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = 1 n n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:0) x i x (cid:62) i − Σ (cid:1) (¯ θ − θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ n n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x i x (cid:62) i − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) . Similarly to obtaining (41), we have that P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x i x (cid:62) i − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > ||| Σ ||| max  log nd δ c ∨ (cid:115) log nd δ c  ≤ δn , P  max i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x i x (cid:62) i − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max > ||| Σ ||| max  log nd δ c ∨ (cid:115) log nd δ c  ≤ δ, which implies that max i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x i x (cid:62) i − Σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:16)(cid:112) log( nd ) (cid:17) . Putting all the preceding bounds together, we obtain that V (cid:48) (¯ θ ) = O P (cid:0) log( nd ) r θ (cid:1) ,V (cid:48) (¯ θ ) = O P  (cid:18) log dn (cid:19) / + (cid:115) log ( dn ) log dn  (cid:112) log( nd ) r ¯ θ  ,V (cid:48) (¯ θ ) = nn + k − O P  (cid:18) log dn (cid:19) / + (cid:115) log ( dn ) log dn  (cid:112) log( nd ) r ¯ θ + log( nd ) r θ  = O P  (cid:18) log dn (cid:19) / + (cid:115) log ( dn ) log dn  n (cid:112) log( nd ) n + k r ¯ θ + n log( nd ) n + k r θ  , and ﬁnally, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n + k − (cid:32) n (cid:88) i =1 (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:62) + k (cid:88) j =2 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn + k + log ( d ( n + k )) log dn + k + (cid:32)(cid:32) (cid:114) log dN (cid:33) nkn + k + log(( n + k ) d ) (cid:33) r θ + (cid:32)(cid:112) log(( n + k ) d ) + log / d (cid:112) log(( n + k ) d )( n + k ) / + (cid:115) log ( d ( n + k )) log dn + k (cid:33) r ¯ θ (cid:33) . emma C.19. In sparse GLM, under Assumptions (B1)–(B3), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ) , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dk + log dk + (cid:32) (cid:18) log dk (cid:19) / (cid:33) (cid:16)(cid:112) log d + √ nr ¯ θ (cid:17) r ¯ θ + (cid:0) n + log d + nr θ (cid:1) r θ (cid:33) . Proof of Lemma C.19.

By Lemma C.15, it suﬃces to bound U (¯ θ ), U , and U (¯ θ ). Webegin by bounding U . Using the argument for obtaining (46), we have that for any t > P ( |∇L j ( θ ∗ ) l | > t ) ≤ (cid:18) − nt c (cid:19) , and then, P (cid:0) √ n |∇L j ( θ ∗ ) l | > t (cid:1) ≤ (cid:18) − t c (cid:19) , that is, √ n ∇L j ( θ ∗ ) l is sub-Gaussian with O (1) ψ -norm. Therefore, n ∇L j ( θ ∗ ) l ∇L j ( θ ∗ ) l (cid:48) issub-exponential with O (1) ψ -norm. Note that E [ n ∇L j ( θ ∗ ) l ∇L j ( θ ∗ ) l (cid:48) ] = E [ ∇L ( θ ∗ ; Z ) l ∇L ( θ ∗ ; Z ) l (cid:48) ].Then, we apply Bernstein’s inequality and obtain that for any t > P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) l ∇L j ( θ ∗ ) l (cid:48) − E [ ∇L ( θ ∗ ; Z ) l ∇L ( θ ∗ ; Z ) l (cid:48) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:33) ≤ (cid:0) − ck (cid:0) t ∧ t (cid:1)(cid:1) , or, for any δ ∈ (0 , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) l ∇L j ( θ ∗ ) l (cid:48) − E [ ∇L ( θ ∗ ; Z ) l ∇L ( θ ∗ ; Z ) l (cid:48) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:115) log d δ ck ∨ log d δ ck  ≤ δd , and by the union bound, with probability at least 1 − δ , U ≤ (cid:115) log d δ ck ∨ log d δ ck , U = O P (cid:32)(cid:114) log dk (cid:33) . Next, we bound U (¯ θ ). By the triangle inequality, we have that (cid:13)(cid:13) ∇L N (¯ θ ) − ∇L ∗ (¯ θ ) (cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13) ∇L N (¯ θ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13) ∞ + (cid:107)∇L N ( θ ∗ ) (cid:107) ∞ + (cid:13)(cid:13) ∇L ∗ (¯ θ ) (cid:13)(cid:13) ∞ . By (43), we have that ∇L N (¯ θ ) − ∇L N ( θ ∗ ) = (cid:90) ∇ L N ( θ ∗ + t (¯ θ − θ ∗ )) dt (¯ θ − θ ∗ )= (cid:90) N n (cid:88) i =1 k (cid:88) j =1 g (cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + t (¯ θ − θ ∗ ))) x ij x (cid:62) ij dt (¯ θ − θ ∗ ) , and then, under Assumptions (B1) and (B2), (cid:13)(cid:13) ∇L N (¯ θ ) − ∇L N ( θ ∗ ) (cid:13)(cid:13) ∞ = (cid:90) N n (cid:88) i =1 k (cid:88) j =1 (cid:12)(cid:12) g (cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + t (¯ θ − θ ∗ ))) (cid:12)(cid:12) (cid:107) x ij (cid:107) ∞ dt (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) ∞ (cid:46) (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) ∞ . Note that for any θ , (cid:107)∇L ∗ ( θ ) (cid:107) ∞ = (cid:107)∇L ∗ ( θ ) − ∇L ∗ ( θ ∗ ) (cid:107) ∞ = (cid:13)(cid:13) E (cid:2)(cid:0) g (cid:48) ( y, x (cid:62) θ ) − g (cid:48) ( y, x (cid:62) θ ∗ )) (cid:1) x (cid:3)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13) E (cid:20)(cid:90) g (cid:48)(cid:48) ( y, x (cid:62) ( θ ∗ + t ( θ − θ ∗ ))) dtxx (cid:62) ( θ − θ ∗ ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ E (cid:20)(cid:90) (cid:12)(cid:12) g (cid:48)(cid:48) ( y, x (cid:62) ( θ ∗ + t ( θ − θ ∗ ))) (cid:12)(cid:12) dt (cid:107) x (cid:107) ∞ (cid:107) θ − θ ∗ (cid:107) ∞ (cid:21) (cid:46) (cid:107) θ − θ ∗ (cid:107) ∞ . (cid:13)(cid:13) ∇L ∗ (¯ θ ) (cid:13)(cid:13) ∞ (cid:46) (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) ∞ . By (47), we have that (cid:107)∇L N ( θ ∗ ) (cid:107) ∞ = O P (cid:32)(cid:114) log dN (cid:33) . Then, assuming that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ), we have that (cid:13)(cid:13) ∇L N (¯ θ ) − ∇L ∗ (¯ θ ) (cid:13)(cid:13) ∞ = O P (cid:32) r ¯ θ + (cid:114) log dN (cid:33) , and then, U (¯ θ ) = O P (cid:18) nr θ + log dk (cid:19) . Lastly, we bound U (¯ θ ). As in the proof of Lemma C.16, we have that U (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) (cid:0) ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max : = U (¯ θ ) + 2 U (¯ θ ) , and U (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / U (¯ θ ) / . Note that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O (1) under Assumption (B3). Then, by the96riangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n ∇L j ( θ ∗ ) ∇L j ( θ ∗ ) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = U + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) (cid:114) log dk (cid:33) . It remains to bound U (¯ θ ). Note that ∇L j (¯ θ ) − ∇L j ( θ ∗ ) = (cid:90) ∇ L j ( θ ∗ + t (¯ θ − θ ∗ )) dt (¯ θ − θ ∗ )= (cid:90) n n (cid:88) i =1 g (cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + t (¯ θ − θ ∗ ))) x ij x (cid:62) ij dt (¯ θ − θ ∗ ) , and g (cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + t (¯ θ − θ ∗ ))) = g (cid:48)(cid:48) ( y ij , x (cid:62) ij θ ∗ ) + (cid:90) g (cid:48)(cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + st (¯ θ − θ ∗ ))) dsx (cid:62) ij ( t (¯ θ − θ ∗ )) , and then ∇L j (¯ θ ) − ∇L j ( θ ∗ ) = 1 n n (cid:88) i =1 g (cid:48)(cid:48) ( y ij , x (cid:62) ij θ ∗ ) x ij x (cid:62) ij (¯ θ − θ ∗ )+ (cid:90) (cid:90) n n (cid:88) i =1 g (cid:48)(cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + st (¯ θ − θ ∗ ))) x (cid:62) ij t (¯ θ − θ ∗ ) x ij x (cid:62) ij dtds (¯ θ − θ ∗ ) . In a similar way, we have that ∇L ∗ (¯ θ ) = ∇L ∗ (¯ θ ) − ∇L ∗ ( θ ∗ )= E (cid:2) g (cid:48)(cid:48) ( y, x (cid:62) θ ∗ ) xx (cid:62) (cid:3) (¯ θ − θ ∗ )+ (cid:90) (cid:90) E x,y (cid:2) g (cid:48)(cid:48)(cid:48) ( y, x (cid:62) ( θ ∗ + st (¯ θ − θ ∗ ))) x (cid:62) t (¯ θ − θ ∗ ) xx (cid:62) (cid:3) dtds (¯ θ − θ ∗ ) , ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) − ∇L j ( θ ∗ ) = (cid:32) n n (cid:88) i =1 g (cid:48)(cid:48) ( y ij , x (cid:62) ij θ ∗ ) x ij x (cid:62) ij − E (cid:2) g (cid:48)(cid:48) ( y, x (cid:62) θ ∗ ) xx (cid:62) (cid:3)(cid:33) (¯ θ − θ ∗ )+ (cid:90) (cid:90) n n (cid:88) i =1 g (cid:48)(cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + st (¯ θ − θ ∗ ))) x (cid:62) ij t (¯ θ − θ ∗ ) x ij x (cid:62) ij − E x,y (cid:2) g (cid:48)(cid:48)(cid:48) ( y, x (cid:62) ( θ ∗ + st (¯ θ − θ ∗ ))) x (cid:62) t (¯ θ − θ ∗ ) xx (cid:62) (cid:3) dtds (¯ θ − θ ∗ ): = U ,j + U ,j (¯ θ ) . Then, we have by the triangle inequality that U (¯ θ ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) U ,j + U ,j (¯ θ ) (cid:1) (cid:0) U ,j + U ,j (¯ θ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ k k (cid:88) j =1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:0) U ,j + U ,j (¯ θ ) (cid:1) (cid:0) U ,j + U ,j (¯ θ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = 1 k k (cid:88) j =1 n (cid:13)(cid:13) U ,j + U ,j (¯ θ ) (cid:13)(cid:13) ∞ ≤ k k (cid:88) j =1 n (cid:16) (cid:107) U ,j (cid:107) ∞ + (cid:13)(cid:13) U ,j (¯ θ ) (cid:13)(cid:13) ∞ (cid:17) Using the argument for obtaining (45), we have that (cid:107) U ,j (cid:107) ∞ = (cid:13)(cid:13)(cid:0) ∇ L j ( θ ∗ ) − ∇ L ∗ ( θ ∗ ) (cid:1) (¯ θ − θ ∗ ) (cid:13)(cid:13) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L j ( θ ∗ ) − ∇ L ∗ ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P (cid:32)(cid:114) log dn (cid:33) O P ( r ¯ θ )= O P (cid:32)(cid:114) log dn r ¯ θ (cid:33) . (cid:13)(cid:13) U ,j (¯ θ ) (cid:13)(cid:13) ∞ ≤ (cid:90) (cid:90) n n (cid:88) i =1 (cid:12)(cid:12) g (cid:48)(cid:48)(cid:48) ( y ij , x (cid:62) ij ( θ ∗ + st (¯ θ − θ ∗ ))) (cid:12)(cid:12) (cid:107) x ij (cid:107) ∞ t (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:107) x ij (cid:107) ∞ + E x,y (cid:2)(cid:12)(cid:12) g (cid:48)(cid:48)(cid:48) ( y, x (cid:62) ( θ ∗ + st (¯ θ − θ ∗ ))) (cid:12)(cid:12) (cid:107) x (cid:107) ∞ t (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:107) x (cid:107) ∞ (cid:3) dtds (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:46) (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P (cid:0) r θ (cid:1) . Hence, we have that U (¯ θ ) = n (cid:18) O P (cid:18) log dn r θ (cid:19) + O P (cid:0) r θ (cid:1)(cid:19) = O P (cid:0)(cid:0) log d + nr θ (cid:1) r θ (cid:1) . Putting all the preceding bounds together, we obtain that U (¯ θ ) = O P (cid:32)(cid:32) (cid:18) log dk (cid:19) / (cid:33) (cid:16)(cid:112) log d + √ nr ¯ θ (cid:17) r ¯ θ (cid:33) ,U (¯ θ ) = O P (cid:32)(cid:32) (cid:18) log dk (cid:19) / (cid:33) (cid:16)(cid:112) log d + √ nr ¯ θ (cid:17) r ¯ θ + (cid:0) log d + nr θ (cid:1) r θ (cid:33) , and ﬁnally, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P (cid:32)(cid:114) log dk + log dk + (cid:32) (cid:18) log dk (cid:19) / (cid:33) (cid:16)(cid:112) log d + √ nr ¯ θ (cid:17) r ¯ θ + (cid:0) n + log d + nr θ (cid:1) r θ (cid:33) . Lemma C.20.

In sparse GLM, under Assumptions (B1)–(B3), provided that (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) =99 P ( r ¯ θ ) , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n + k − (cid:32) n (cid:88) i =1 (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:62) + k (cid:88) j =2 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn + k + log dn + k + nkn + k r θ + (cid:32) (cid:18) log dn (cid:19) / (cid:33) nn + k (cid:0) r ¯ θ + r θ (cid:1) + nn + k r θ + (cid:32) (cid:18) log dk (cid:19) / (cid:33) k √ log d + k √ nr ¯ θ n + k r ¯ θ + k log d + knr θ n + k r θ (cid:33) . Proof of Lemma C.20.

By Lemma C.17, it suﬃces to bound V (¯ θ ), V (cid:48) (¯ θ ), V , V (cid:48) , and V (¯ θ ). By the proof of Lemma C.19, we have that under Assumptions (B1)–(B3), assumingthat (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ), V (¯ θ ) = k − n + k − O P (cid:32)(cid:32) (cid:18) log dk (cid:19) / (cid:33) (cid:16)(cid:112) log d + √ nr ¯ θ (cid:17) r ¯ θ + (cid:0) log d + nr θ (cid:1) r θ (cid:33) = O P (cid:32)(cid:32) (cid:18) log dk (cid:19) / (cid:33) k √ log d + k √ nr ¯ θ n + k r ¯ θ + k log d + knr θ n + k r θ (cid:33) ,V = k − n + k − O P (cid:32)(cid:114) log dk (cid:33) = O P (cid:18) √ k log dn + k (cid:19) , and V (¯ θ ) = nkn + k − O P (cid:18) r θ + log dN (cid:19) = O P (cid:18) nkn + k r θ + log dn + k (cid:19) . It remains to bound V (cid:48) (¯ θ ) and V (cid:48) .To bound V (cid:48) , we note that each ∇L ( θ ∗ ; Z i ) l ∇L ( θ ∗ ; Z i ) l (cid:48) = g (cid:48) ( y i , x (cid:62) i θ ∗ ) x i ,l x i ,l (cid:48) isbounded under Assumptions (B1) and (B2). Applying Hoeﬀding’s inequality, we obtain100hat for any t > P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) l ∇L ( θ ∗ ; Z i ) l (cid:48) − E [ ∇L ( θ ∗ ; Z ) l ∇L ( θ ∗ ; Z ) l (cid:48) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:33) ≤ (cid:18) − nt c (cid:19) , for some constant c , or, for any δ ∈ (0 , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) l ∇L ( θ ∗ ; Z i ) l (cid:48) − E [ ∇L ( θ ∗ ; Z ) l ∇L ( θ ∗ ; Z ) l (cid:48) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:115) c log d δ n  ≤ δd , and by the union bound, with probability at least 1 − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ∇L ( θ ∗ ; Z i ) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:115) c log d δ n , which implies that V (cid:48) = nn + k − O P (cid:32)(cid:114) log dn (cid:33) = O P (cid:32)(cid:114) n log dn + k (cid:33) . Lastly, we bound V (cid:48) (¯ θ ). As in the proof of Lemma C.18, we have that n + k − n V (cid:48) (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ( ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i )) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max : = V (cid:48) (¯ θ ) + 2 V (cid:48) (¯ θ ) , and V (cid:48) (¯ θ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ∇L ( θ ∗ ; Z i ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / V (cid:48) (¯ θ ) / . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O (1) under Assumption (B3). Then, by thetriangle inequality, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ∇L ( θ ∗ ; Z i ) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ∇L ( θ ∗ ; Z i ) ∇L ( θ ∗ ; Z i ) (cid:62) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = n + k − n V (cid:48) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32) (cid:114) log dn (cid:33) . It remains to bound V (cid:48) (¯ θ ). Using the same argument for analyzing ∇L j (¯ θ ) − ∇L ∗ (¯ θ ) −∇L j ( θ ∗ ) in the proof of Lemma C.19, we obtain that ∇L ( θ ; Z i ) − ∇L ∗ ( θ ) − ∇L ( θ ∗ ; Z i ) = (cid:0) g (cid:48)(cid:48) ( y i , x (cid:62) i θ ∗ ) x i x (cid:62) i − E (cid:2) g (cid:48)(cid:48) ( y, x (cid:62) θ ∗ ) xx (cid:62) (cid:3)(cid:1) (¯ θ − θ ∗ )+ (cid:90) (cid:90) g (cid:48)(cid:48)(cid:48) ( y i , x (cid:62) i ( θ ∗ + st (¯ θ − θ ∗ ))) x (cid:62) i t (¯ θ − θ ∗ ) x i x (cid:62) i − E x,y (cid:2) g (cid:48)(cid:48)(cid:48) ( y, x (cid:62) ( θ ∗ + st (¯ θ − θ ∗ ))) x (cid:62) t (¯ θ − θ ∗ ) xx (cid:62) (cid:3) dtds (¯ θ − θ ∗ ): = V (cid:48) ,i + V (cid:48) ,i (¯ θ ) , and V (cid:48) (¯ θ ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:0) V (cid:48) ,i + V (cid:48) ,i (¯ θ ) (cid:1) (cid:0) V (cid:48) ,i + V (cid:48) ,i (¯ θ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max ≤ n n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:0) V (cid:48) ,i + V (cid:48) ,i (¯ θ ) (cid:1) (cid:0) V (cid:48) ,i + V (cid:48) ,i (¯ θ ) (cid:1) (cid:62) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = 1 n n (cid:88) i =1 (cid:13)(cid:13) V (cid:48) ,i + V (cid:48) ,i (¯ θ ) (cid:13)(cid:13) ∞ ≤ n n (cid:88) i =1 (cid:16)(cid:13)(cid:13) V (cid:48) ,i (cid:13)(cid:13) ∞ + (cid:13)(cid:13) V (cid:48) ,i (¯ θ ) (cid:13)(cid:13) ∞ (cid:17) . (cid:13)(cid:13) V (cid:48) ,i (cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:0) ∇ L ( θ ∗ ; Z i ) − ∇ L ∗ ( θ ∗ ) (cid:1) (¯ θ − θ ∗ ) (cid:13)(cid:13) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L ( θ ∗ ; Z i ) − ∇ L ∗ ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) ≤ (cid:0)(cid:12)(cid:12) g (cid:48)(cid:48) ( y i , x (cid:62) i θ ∗ ) (cid:12)(cid:12) (cid:107) x i (cid:107) ∞ + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ L ∗ ( θ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:1) (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P ( r ¯ θ ) , and (cid:13)(cid:13) V (cid:48) ,i (¯ θ ) (cid:13)(cid:13) ∞ ≤ (cid:90) (cid:90) (cid:12)(cid:12) g (cid:48)(cid:48)(cid:48) ( y i , x (cid:62) i ( θ ∗ + st (¯ θ − θ ∗ ))) (cid:12)(cid:12) (cid:107) x i (cid:107) ∞ t (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:107) x i (cid:107) ∞ + E x,y (cid:2)(cid:12)(cid:12) g (cid:48)(cid:48)(cid:48) ( y, x (cid:62) ( θ ∗ + st (¯ θ − θ ∗ ))) (cid:12)(cid:12) (cid:107) x (cid:107) ∞ t (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:107) x (cid:107) ∞ (cid:3) dtds (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) (cid:46) (cid:13)(cid:13) ¯ θ − θ ∗ (cid:13)(cid:13) = O P (cid:0) r θ (cid:1) , and hence, V (cid:48) (¯ θ ) = O P (cid:0) r θ + r θ (cid:1) . Putting all the preceding bounds together, we obtain that V (cid:48) (¯ θ ) = O P (cid:32)(cid:32) (cid:18) log dn (cid:19) / (cid:33) (cid:0) r ¯ θ + r θ (cid:1)(cid:33) ,V (cid:48) (¯ θ ) = nn + k − O P (cid:32)(cid:32) (cid:18) log dn (cid:19) / (cid:33) (cid:0) r ¯ θ + r θ (cid:1) + r θ + r θ (cid:33) = O P (cid:32)(cid:32) (cid:18) log dn (cid:19) / (cid:33) nn + k (cid:0) r ¯ θ + r θ (cid:1) + nn + k r θ (cid:33) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n + k − (cid:32) n (cid:88) i =1 (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L (¯ θ ; Z i ) − ∇L N (¯ θ ) (cid:1) (cid:62) + k (cid:88) j =2 n (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:0) ∇L j (¯ θ ) − ∇L N (¯ θ ) (cid:1) (cid:62) (cid:33) − E (cid:2) ∇L ( θ ∗ ; Z ) ∇L ( θ ∗ ; Z ) (cid:62) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn + k + log dn + k + nkn + k r θ + (cid:32) (cid:18) log dn (cid:19) / (cid:33) nn + k (cid:0) r ¯ θ + r θ (cid:1) + nn + k r θ + (cid:32) (cid:18) log dk (cid:19) / (cid:33) k √ log d + k √ nr ¯ θ n + k r ¯ θ + k log d + knr θ n + k r θ (cid:33) . Lemma C.21.

In high-dimensional linear model, under Assumption (A1), if n (cid:29) s ∗ log d ,we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O P (cid:16) √ s ∗ (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O P (cid:32) s ∗ (cid:114) log dn (cid:33) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ X (cid:62) X n − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn (cid:33) , and max l (cid:13)(cid:13)(cid:13) (cid:101) Θ l − Θ l (cid:13)(cid:13)(cid:13) = O P (cid:32)(cid:114) s ∗ log dn (cid:33) . Proof of Lemma C.21.

In the high-dimensional setting, (cid:101)

Θ is constructed using nodewiseLasso. We obtain the bounds in the lemma from the proof of Lemma 5.3 and Theorem 2.4of van de Geer et al. (2014).

Lemma C.22.

In high-dimensional GLM, under Assumptions (B1)–(B3), if n (cid:29) s log d + s ∗ log d , we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O P (cid:16) √ s ∗ (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ = O P (cid:32) ( s + s ∗ ) (cid:114) log dn (cid:33) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:101) Θ( (cid:101) θ (0) ) ∇ L ( (cid:101) θ (0) ) − I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max = O P (cid:32)(cid:114) log dn (cid:33) , and max l (cid:13)(cid:13)(cid:13) (cid:101) Θ( (cid:101) θ (0) ) l − Θ l (cid:13)(cid:13)(cid:13) = O P (cid:32)(cid:114) ( s + s ∗ ) log dn (cid:33) . Proof of Lemma C.22.

In the high-dimensional setting, (cid:101) Θ( (cid:101) θ (0)(0)