Boosting with Structural Sparsity: A Differential Inclusion Approach
BBoosting with Structural Sparsity:A Differential Inclusion Approach
Chendi Huang a , Xinwei Sun a , Jiechao Xiong a , Yuan Yao a,b, ∗ a School of Mathematical Science, Peking University, Beijing, 100871, China b Department of Mathematics and Division of Biomedical Engineering, Hong KongUniversity of Science and Technology, Clear Water Bay, Kowloon, Hong Kong SAR,China
Abstract
Boosting as gradient descent algorithms is one popular method in machinelearning. In this paper a novel Boosting-type algorithm is proposed based onrestricted gradient descent with structural sparsity control whose underlyingdynamics are governed by differential inclusions. In particular, we present aniterative regularization path with structural sparsity where the parameter issparse under some linear transforms, based on variable splitting and the Lin-earized Bregman Iteration. Hence it is called
Split LBI . Despite its simplicity,Split LBI outperforms the popular generalized Lasso in both theory and ex-periments. A theory of path consistency is presented that equipped witha proper early stopping, Split LBI may achieve model selection consistencyunder a family of Irrepresentable Conditions which can be weaker than thenecessary and sufficient condition for generalized Lasso. Furthermore, some (cid:96) error bounds are also given at the minimax optimal rates. The utilityand benefit of the algorithm are illustrated by several applications includingimage denoising, partial order ranking of sport teams, and world universitygrouping with crowdsourced ranking data. Keywords:
Boosting, differential inclusions, structural sparsity, linearizedBregman iteration, variable splitting, generalized Lasso, model selection,consistency ∗ Corresponding author
Email addresses: [email protected] (Chendi Huang), [email protected] (Xinwei Sun), [email protected] (JiechaoXiong), [email protected] (Yuan Yao)
April 18, 2017 a r X i v : . [ s t a t . M L ] A p r . Introduction In this paper, consider the recovery from linear noisy measurements of β (cid:63) ∈ R p , which satisfies the following structural sparsity that the lineartransformation γ (cid:63) := Dβ (cid:63) for some D ∈ R m × p has most of its elements beingzeros. For a design matrix X ∈ R n × p , let y = Xβ (cid:63) + (cid:15), γ (cid:63) = Dβ (cid:63) ( S = supp ( γ (cid:63) ) , s = | S | ) , (1.1)where (cid:15) ∈ R n has independent identically distributed components, each ofwhich has a sub-Gaussian distribution with parameter σ ( E [exp( t(cid:15) i )] ≤ exp( σ t / D has various examplesincluding the Fourier transform, the wavelet transform, or graph gradientoperators etc. Here γ (cid:63) is sparse , i.e. s (cid:28) m . Given ( y, X, D ), the purpose is to estimate β (cid:63) as well as γ (cid:63) , and in particular, recovers the support of γ (cid:63) .There is a large literature on this problem. Perhaps the most popularapproach is the following (cid:96) -penalized convex optimization problem,arg min β (cid:18) n (cid:107) y − Xβ (cid:107) + λ (cid:107) Dβ (cid:107) (cid:19) . (1.2)Such a problem can be at least traced back to Rudin et al. (1992) as a total variation regularization for image denoising in applied mathematics; in statistics it is formally proposed by Tibshirani et al. (2005) as fused Lasso .As D = I it reduces to the well-known Lasso (Tibshirani, 1996) and differentchoices of D include many special cases, it is often called generalized Lasso (Tibshirani and Taylor, 2011) in statistics.Various algorithms are studied for solving (1.2) at fixed values of the tuning parameter λ , most of which is based on the ADMM or Split Bregmanusing operator splitting ideas (see for examples Goldstein and Osher (2009);Ye and Xie (2011); Wahlberg et al. (2012); Ramdas and Tibshirani (2014);Zhu (2017) and references therein). To avoid the difficulty in dealing with thestructural sparsity in (cid:107) Dβ (cid:107) , these algorithms exploit an augmented variable γ to enforce sparsity while keeping it close to Dβ .On the other hand, regularization paths are crucial for model selectionby computing estimators as functions of regularization parameters. For ex-ample, Efron et al. (2004) studies the regularization path of standard Lasso2ith D = I , the algorithm in Hoefling (2010) computes the regularization path of fused Lasso, and the dual path algorithm in Tibshirani and Taylor(2011) can deal with generalized Lasso. Recently, Arnold and Tibshirani(2016) discussed various efficient implementations of the the algorithm inTibshirani and Taylor (2011), and the related R package genlasso can befound in CRAN repository. All of these are based on homotopy method of solving convex optimization (1.2).Our departure here, instead of solving (1.2), is to look at an extremelysimple yet novel iterative scheme which finds a new regularization path withstructural sparsity. We are going to show that it works in a better way than genlasso , in both theory and experiments. Define a loss function which splits Dβ and γ , (cid:96) ( β, γ ) := 12 n (cid:107) y − Xβ (cid:107) + 12 ν (cid:107) γ − Dβ (cid:107) ( ν > . (1.3)Now consider the following iterative algorithm, β k +1 = β k − κα ∇ β (cid:96) ( β k , γ k ) , (1.4a) z k +1 = z k − α ∇ γ (cid:96) ( β k , γ k ) , (1.4b) γ k +1 = κ · prox (cid:107)·(cid:107) ( z k +1 ) , (1.4c)where the initial choice z = γ = 0 ∈ R m , β = 0 ∈ R p , parameters κ > , α > , ν >
0, and the proximal map associated with a convex function h is defined by prox h ( z ) = arg min x (cid:107) z − x (cid:107) / h ( x ), which is reduced to the shrinkage operator when h is taken to be the (cid:96) -norm, prox (cid:107)·(cid:107) ( z ) = S ( z, S ( z, λ ) = sign( z ) · max ( | z | − λ,
0) ( λ ≥ . The algorithm generates a sequence ( β k , γ k ) k ∈ N which defines a discreteregularization path. Iteration (1.4a) has appeared as L -Boost (B¨uhlmannand Yu, 2002) in machine learning and can be traced back to the Landweber Iteration in inverse problems (Yao et al., 2007) where early stopping reg-ularization is needed against overfitting noise. On the other hand, (1.4b)and (1.4c), generating a sparse regularization path on γ k , is known as the Linearized Bregman Iteration (LBI) firstly proposed in Yin et al. (2008). Re-cently in sparse linear regression, Osher et al. (2016) shows that under nearly oracle estimator which is optimal over allestimators. Equipped with a variable splitting between Dβ and γ , algorithm(1.4) thus combines the L -Boost of β for prediction and LBI of γ for sparse structure. Hence in this paper we call (1.4) the Split LBI or Boosting withstructural sparsity .The gap (cid:107) γ − Dβ (cid:107) /ν controls the affinity between Dβ and γ . As ν → Dβ = γ which meets the generalized Lasso constraint; while for a finite ν > Dβ is not necessarily sparse. Such an increase in degree of freedom, however, leaves us a new space for improving the model selection consistency,as we shall see in the following experiment and in later part of this paper fora theoretical development. The following example shows that the iterative regularization path (1.4) can be more accurate than the regularization path of generalized Lasso, interms of Area Under the Curve (AUC) measurement of the order of param-eters becoming nonzero in consistent with the ground truth sparsity pattern(higher value of AUC means better performance of variable selection of analgorithm such that true parameters becoming nonzero along the algorithmic regularization path earlier than the null parameters). The following simpleexperiment illustrates such phenomena by simulations. Example 1.
Consider two problems: standard Lasso and 1-D fused Lasso.In both cases, set n = p = 50, and generate X ∈ R n × p denoting n i.i.d.samples from N (0 , I p ), (cid:15) ∼ N (0 , I n ), y = Xβ (cid:63) + (cid:15) . β (cid:63)j = 2 (if 1 ≤ j ≤ − ≤ j ≤ D = I ,and for 1-D fused Lasso we choose D = ( D ; D ) ∈ R ( p − p ) × p such that( D β ) j = β j − β j +1 (for 1 ≤ j ≤ p −
1) and D = I p . Figure 1 showsthe regularization paths by genlasso ( { Dβ λ } ) and by iteration (1.4) (linearinterpolation of { γ k } ) with κ = 200 and ν ∈ { , , } , respectively. The generalized Lasso path is in fact piecewise linear with respect to λ while weshow it along t = 1 /λ for a comparison. Note that the iterative paths exhibita variety of different shapes depending on the choice of ν . However, in terms The “Area Under the Curve” is the area under the Receiver Operating Characteristic(ROC) Curve, whose definition can be seen for example in (Brown and Davis, 2006). igure 1: { Dβ λ } ( t = 1 /λ ) by genlasso and { γ k } ( t = kα ) by Split LBI (1.4) with ν = 1 , ,
10, for 1-D fused Lasso. of order of those curves entering into nonzero range, these iterative pathsexhibit a better accuracy than genlasso . Table 1 shows this by the mean AUC of 100 independent experiments in each case, where the increase of ν improves the model selection accuracy of Split LBI paths and beats that ofgeneralized Lasso. Why does Split LBI perform better in model selection than generalizedLasso?
Some limit dynamics of algorithm (1.4) actually shed light on the cause. Below we are going to derive several limit dynamics of Split LBI, whichare differential inclusions and lead to explanations on how our algorithmmight improve over generalized Lasso. First of all, noting by the following Moreau Decomposition ρ ∈ ∂ (cid:107) γ (cid:107) , z = ρ + γ/κ ⇐⇒ γ = κ S ( z, , ρ = z − S ( z, , (1.5)5 able 1: Mean AUC (with standard deviation) comparisons where Split LBI (1.4) beats genlasso . The first is for the standard Lasso, and the second is for the 1-D fused Lassoin Example 1. genlasso Split LBI ν = 1 ν = 5 ν = 10 . . . . ( . . . . ) genlasso Split LBI ν = 1 ν = 5 ν = 10 . . . . ( . . . . )the Split LBI (1.4) can be rewritten as, β k +1 /κ = β k /κ − α ∇ β (cid:96) ( β k , γ k ) , (1.6a) ρ k +1 + γ k +1 /κ = ρ k + γ k /κ − α ∇ γ (cid:96) ( β k , γ k ) , (1.6b) ρ k ∈ ∂ (cid:107) γ k (cid:107) , (1.6c)where ρ = γ = 0 ∈ R m , β = 0 ∈ R p .Now taking ρ ( kα ) = ρ k , γ ( kα ) = γ k , β ( kα ) = β k , and α →
0, (1.6) isa forward Euler discretization of the following limit dynamics, called
SplitLinearized Bregman Inverse Scale Space (Split LBISS) here.
Definition 1 (Split LBISS).
For α →
0, define the following differentialinclusion as the limit dynamics of Split LBI,˙ β ( t ) /κ = −∇ β (cid:96) ( β ( t ) , γ ( t )) , (1.7a)˙ ρ ( t ) + ˙ γ ( t ) /κ = −∇ γ (cid:96) ( β ( t ) , γ ( t )) , (1.7b) ρ ( t ) ∈ ∂ (cid:107) γ ( t ) (cid:107) , (1.7c)where ρ ( t ) , β ( t ) , γ ( t ) are right continuously differentiable, with ˙ ρ ( t ) , ˙ β ( t ) , ˙ γ ( t ) denoting the right derivatives in t of ρ ( t ) , β ( t ) , γ ( t ) respectively, and ρ (0) = γ (0) = 0 ∈ R m , β (0) = 0 ∈ R p .Next taking κ → ∞ , we reach the following dynamics called Split InverseScale Space (Split ISS) in this paper. 6 efinition 2 (Split ISS).
For κ → ∞ and α →
0, define the differentialinclusion, 0 = −∇ β (cid:96) ( β ( t ) , γ ( t )) , (1.8a)˙ ρ ( t ) = −∇ γ (cid:96) ( β ( t ) , γ ( t )) , (1.8b) ρ ( t ) ∈ ∂ (cid:107) γ ( t ) (cid:107) , (1.8c)where ρ ( t ) is right continuously differentiable, β ( t ) , γ ( t ) are right continuous,and ρ (0) = γ (0) = 0 ∈ R m , β (0) = 0 ∈ R p . Solving β ( t ) in (1.8a) andplugging it into (1.8b), (1.8) can be reduced to˙ ρ ( t ) = − Σ / (Σ / γ ( t ) − Σ † / DA † X ∗ y ) , (1.9a) ρ ( t ) ∈ ∂ (cid:107) γ ( t ) (cid:107) , (1.9b)where Σ and A are given by Σ = Σ( ν ) := (cid:0) I − DA † D T (cid:1) /ν, and A = A ( ν ) = νX ∗ X + D T D. (1.10)In fact by (1.8a) we have β ( t ) = arg min β (cid:96) ( β, γ ( t )) = A † (cid:0) νX ∗ y + D T γ ( t ) (cid:1) , where A = νX ∗ X + D T D . Substituting this for β ( t ) in (1.8b) and noting(C.1) ( DAX ∗ = Σ / Σ † / DAX ∗ ), we thus get (1.9a). Remark 1.
Note that (1.9) coincides with the differential inclusion proposedin Chapter 8 of Moeller (2012) where the authors introduced it in a differentway. The existence and uniquess of solutions of Split LBISS and Split ISS will be characterized precisely in Section 3.Now consider the particular case of the standard Lasso where D = I andΣ( ν ) = X ∗ ( I + νXX ∗ ) − X . Hence as ν →
0, we have Σ( ν ) → X ∗ X and(1.9) leads to the standard Inverse Scale Space (ISS) dynamics studied in(Osher et al., 2016) by identifying β = γ . Proposition 1.
Let D = I and ν → , then γ = β and (1.8) reduces to ˙ ρ ( t ) = − X ∗ ( Xβ ( t ) − y ) , (1.11a) ρ ( t ) ∈ ∂ (cid:107) β ( t ) (cid:107) , (1.11b) with the same notations as above. Model Selection Consistency : Under what conditions there existsa point ¯ τ (or ¯ k ) such that supp( γ (¯ τ )) = S (or supp( γ ¯ k ) = S ), or morespecifically the so called sign-consistency holds, sign( γ (¯ τ )) = sign( γ (cid:63) ) (or sign( γ ¯ k ) = sign( γ (cid:63) ), respectively)?Comparing the reduced Split ISS (1.9) with the ISS (1.11), one can seethat Σ( ν ) plays a similar role as X ∗ X . For the special case that D = I and ν →
0, Osher et al. (2016) shows that under nearly the same conditions asLasso, ISS (1.11) achieves model selection consistency but with the unbiased oracle estimator which is better than Lasso. Here an unbiased estimatormeans the expectation of the estimator equals to the ground truth and Lassois well-known to be biased. In fact, under a so called
Irrepresentable Condi-tion (IRR) on X ∗ X , ISS (1.11) is guaranteed to evolve before the stoppingtime on the oracle subspace whose coordinate index is within the support set S of the true parameter, i.e. no false positive. Similarly the Lasso regular-ization path also has no false positive under the same condition. Moreoverif the signal is strong enough, the Lasso may pick up an estimator which is sign-consistent yet biased , while the ISS path with an early stopping mayreach the oracle estimator which is both sign-consistent and unbiased . For the comparison with generalized Lasso, the Irrepresentable Conditionon Σ( ν ) will replace that on X ∗ X , where the additional degree of freedomprovided by ν enables us a chance to beat generalized Lasso.Model selection and estimation consistency of generalized Lasso (1.2) hasbeen studied in previous work. Sharpnack et al. (2012) considered the model selection consistency of the edge Lasso, with a special D in (1.2), whichhas applications over graphs. Liu et al. (2013) provides an upper boundof estimation error by assuming the design matrix X is a Gaussian ran-dom matrix. In particular, Vaiter et al. (2013) proposes a general conditioncalled Identifiability Criterion (IC) for sign consistency. Lee et al. (2013) establishes a general framework for model selection consistency for penalizedM-estimators, proposing an Irrepresentable Condition which is equivalent toIC from Vaiter et al. (2013) under the specific setting of (1.2). In fact bothof these conditions are sufficient and necessary for structural sparse recoveryby generalized Lasso (1.2) in a certain sense.
In this paper, we shall present a new family of the Irrepresentable Con-dition depending on Σ( ν ), under which model selection consistency can beestablished for both Split ISS (1.8) and Split LBI (1.4). In particular, thiscondition family can be strictly weaker than IC as the parameter ν grows,8 igure 2: Illustration of global behaviour of dynamics in this paper. which sheds light on the superb performance of Split LBI we observed in the experiment above. Therefore, the benefits of exploiting Split LBI (1.4)not only lie in its algorithmic simplicity, but also provide a possibility oftheoretical improvement on model selection consistency.Roughly speaking, the global picture of our theoretical development isillustrated in Figure 2:
1. Equipped with the Irrepresentable Condition on Σ( ν ), all the dynamics(differential inclusions and the discrete iterations) evolves in a subspaceof estimators whose support set lies in the that of the true parameter,whence the subspace is called the oracle subspace here;2. Further enhanced by a restricted strongly convexity, along the paths of these dynamics the loss is rapidly decreasing at an exponential speed,firstly approaching a saddle point lying the oracle estimator then flow-ing away;3. Early stopping regularization is designed here to stop the dynamicsaround the saddle point to pick up an estimator close to the oracle before escaping to overfitted solutions;4. If the signal is strong enough such that the true parameters are all ofsufficiently large magnitudes, such a good estimator is guaranteed torecover the sparsity pattern of the ground truth.In the sequel, we are going to elaborate them in a precise way. .4. Paper Organization This paper is a long version of a conference report (Huang et al., 2016)which states the main results about the discrete algorithm (1.4) withoutproofs together with part of the experiments. The full version here is orga-nized as follows: Section 2 presents the Irrepresentable Condition together with other assumptions for Split ISS and LBI, and shows that it can bestrictly weaker than IC, the necessary and sufficient condition for model se-lection consistency of generalized Lasso; Some basic properties of dynamicpaths are presented in Section 3, including the existence and uniqueness ofdifferential inclusion solutions, as well as the non-increasing loss along the paths; Section 4 collects the path consistency results for both differentialinclusions and the discrete algorithm; A brief description of proof ideas forthese results are presented in Section 5 with specific details left in appen-dices; Section 6 collects three more applications, including image denoising,partial order (group) estimate in sports and crowdsourced university ranking;
Conclusion is given in Section 7; Appendices collect all the remaining proofsin this paper.
For matrix Q with m rows ( D for example) and J ⊆ { , , . . . , m } , let Q J = Q J, · be the submatrix of Q with rows indexed by J . However, for Q ∈ R n × p ( X for example) and J ⊆ { , , . . . , p } , let Q J = Q · ,J be thesubmatrix of Q with columns indexed by J , abusing the notation. P L denotes the projection matrix onto a linear subspace L , Let L + L := { ξ + ξ : ξ ∈ L , ξ ∈ L } for subspaces L , L . For a matrix Q , let Q † denotesthe Moore-Penrose pseudoinverse of Q , and we recall that Q † = ( Q T Q ) † Q T . Let λ max ( Q ) , λ min ( Q ) , λ min , + ( Q ) denotes the largest singular value, the small-est singular value, the smallest nonzero singular value of Q , respectively. Forsymmetric matrices P and Q , Q (cid:31) P (or Q (cid:23) P ) means that Q − P ispositive (semi-)definite, respectively. Let Q ∗ := Q T /n . Sometimes we use (cid:104) a, b (cid:105) := a T b , denoting the inner product between vectors a, b . Also, for tidiness in some situations, we write ( Q ; Q ) := ( Q T , Q T ) T .
2. Assumptions and Comparisons
We need some convention, definitions and assumptions. For the identifia-bility of β (cid:63) , we can assume that β (cid:63) and its estimators of interest are restricted10n L := (ker( X ) ∩ ker( D )) ⊥ = Im (cid:0) X T (cid:1) + Im (cid:0) D T (cid:1) , since replacing β (cid:63) with “the projection of β (cid:63) onto L ” does not change themodel. We also have β (cid:63) ∈ M , where M is the model subspace defined as M := { β : D S c β = 0 } . Note that (cid:96) ( β, γ ) is quadratic, and we can define its Hessian matrix H = H ( ν ) := ∇ (cid:96) ( β, γ ) ≡ (cid:18) X ∗ X + D T D/ν − D T /ν − D/ν I m /ν (cid:19) (2.1)(sometimes we use the notation H ( ν ) stressing the dependence on ν ). Nowwe assume that there exist constants λ D , Λ D , Λ X > λ min , + ( D ) , λ min , + ( D S c )) ≥ λ D , (2.2a)Λ max ( D ) ≤ Λ D , (2.2b)Λ max ( X ∗ X ) ≤ Λ X . (2.2c)Besides, we consider the following assumption. Assumption 1 (
Restricted Strong Convexity (RSC) ). There exists a con-stant λ > such that β T X ∗ Xβ ≥ λ (cid:107) β (cid:107) , for any β ∈ L ∩ M . (2.3) Remark 2.
When L = R p , i.e. there is only one β (cid:63) satisfying (1.1), Assump-tion 1 is the same as that proposed by Lee et al. (2013). Specifically, when D = I , Assumption 1 reduces to X ∗ S X S (cid:23) λI , the usual RSC in Lasso. Proposition 2.
If there exists
C > , ν > , such that (cid:0) β T , γ TS (cid:1) · H ( β,S ) , ( β,S ) ( ν ) · (cid:18) βγ S (cid:19) ≥ C ν (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) βγ S (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ( β ∈ L , γ S ∈ R s ) . (2.4) then (2.3) holds. Conversely, if (2.3) holds, then there exists C > , suchthat for all ν > , (2.4) holds. emark 3. Traditional RSC for the partial Lasso min β,γ ( (cid:96) ( β, γ ) + λ (cid:107) γ (cid:107) )requires (cid:96) to be restricted strongly convex, i.e. strongly convex restricted on N := L ⊕ R s ⊕ { } p − s which is the sparse subspace corresponding to thesupport of γ (cid:63) ). Proposition 2 implies that, Assumption 1 is necessary for (cid:96) to be restricted strongly convex for a specific ν > (cid:96) depends on ν ), and also sufficient for (cid:96) to be restricted strongly convex for all ν > Remark 4.
Let us further note the rate C/ (1 + ν ) in (2.4). When ν → it approaches C , a constant independent with ν . When ν → + ∞ , the rate C/ (1 + ν ) ∼ ν − is the best possible , since (cid:107) H (cid:107) (cid:46) ν − by (C.3). Assumption 2 (
Irrepresentable Condition ( ν ) (IRR( ν )) ). There exists a con-stant η ∈ (0 , such that sup ρ ∈ [ − , s (cid:13)(cid:13)(cid:13)(cid:13) H S c , ( β,S ) ( ν ) H ( β,S ) , ( β,S ) ( ν ) † · (cid:18) p ρ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ − η. (2.5) Remark 5.
Assumption 2 actually concerns a family of assumptions with varing ν . However, practically we only require that IRR( ν ) holds for thespecific ν used in the algorithm of Split LBI. Remark 6.
Assumption 2 directly generalizes the Irrepresentable Conditionfrom standard Lasso (Zhao and Yu, 2006) and OMP/BP (Tropp, 2004), tothe partial Lasso: min β,γ ( (cid:96) ( β, γ ) + λ (cid:107) γ (cid:107) ). This type conditions are firstlyproposed by (Tropp, 2004) for Orthogonal Matching Pursuit (OMP) and Ba-sis Pursuit (BP) in noise free case, in the name of Exact Recovery Condition;later Cai and Wang (2011) extends it to OMP in noisy measurement; Zhaoand Yu (2006) establishes it for model selection consistency of Lasso underGaussian noise while Wainwright (2009) extends it to the sub-gaussian; Yuanand Lin (2007) and Zou (2006) also independently present this condition inother studies. Here following the standard Lasso case (Wainwright, 2009),one version of the Irrepresentable Condition should be (cid:13)(cid:13) H S c , ( β,S ) ( ν ) H ( β,S ) , ( β,S ) ( ν ) † · ρ (cid:63) ( β,S ) (cid:13)(cid:13) ∞ ≤ − η, where ρ (cid:63) ( β,S ) = (cid:18) p ρ (cid:63)S (cid:19) .ρ (cid:63) ( β,S ) is the value of gradient (subgradient) of (cid:96) penalty function (cid:107) · (cid:107) on( β (cid:63) ; γ (cid:63)S ). Here ρ (cid:63)β = 0 p , because β is not assumed to be sparse and hence isnot penalized. Assumption 2 slightly strengthens this by a supremum over ρ , for uniform sparse recovery independent to a particular sign pattern of γ (cid:63) .12 .2. Some Equivalent Assumptions Recall that in order to obtain path consistency results of standard LBISSand LBI in Osher et al. (2016), they propose Restricted Strong Convexity(RSC) and Irrpresentable Condition (IRR) based on their X ∗ X , and these assumptions are actually the same as those for Lasso. In a contrast, for SplitLBISS and Split LBI, we can propose assumptions based on Σ( ν ), i.e. Σ S,S ( ν )is positive definite, and (cid:107) Σ S c ,S ( ν )Σ S,S ( ν ) − (cid:107) ∞ ≤ − η . These assumptionsactually prove to be equivalent with Assumption 1 and 2 as follows. Proposition 3.
If There exists
C > , ν > such that (2.4) holds for ν = ν , then there exists C (cid:48) > such that for all ν > , Σ S,S ( ν ) (cid:23) C (cid:48) ν I. (2.6) Conversely, if there exists C (cid:48) > , ν > such that (2.6) holds for ν = ν ,then there exists C > such that for all ν > , (2.4) holds. Proposition 4.
Under Assumption 1, the left hand side of (2.5) in Assump-tion 2 becomes (cid:107) Σ S c ,S ( ν )Σ S,S ( ν ) − (cid:107) ∞ , and (2.5) is equivalent to (cid:13)(cid:13) Σ S c ,S ( ν )Σ S,S ( ν ) − (cid:13)(cid:13) ∞ ≤ − η. (2.7) Remark 7.
From Proposition 2 and 4, Σ seems to be closely related to H ,which is truly the case. In fact, Σ is the Schur complement of H β,β in H . We present a comparison theorem showing that IRR( ν ) can be weakerthan IC, a necessary and sufficient for model selection consistency of gener-alized Lasso (Vaiter et al., 2013). Define irr( ν ) as the left hand side of (2.5)(or equivalently the left hand side of (2.7), due to Proposition 4), andirr(0) := lim ν → irr( ν ) , irr( ∞ ) := lim ν → + ∞ irr( ν ) . Let W be a matrix whose columns form an orthogonal basis of ker( D S c ), anddefine Ω S := (cid:16) D † S c (cid:17) T (cid:16) X ∗ XW (cid:0) W T X ∗ XW (cid:1) † W T − I (cid:17) D TS , ic := (cid:13)(cid:13) Ω S (cid:13)(cid:13) ∞ , ic := min u ∈ ker ( D TSc ) (cid:13)(cid:13) Ω S sign ( D S β (cid:63) ) − u (cid:13)(cid:13) ∞ . igure 3: A comparison between our family of Irrepresentable Conditions (IRR( ν )) and ICin Vaiter et al. (2013), with log-scale horizontal axis. As ν grows, irr( ν ) can be significantlysmaller than ic and ic , so that our model selection condition is easier to be met! Vaiter et al. (2013) proved the sign consistency of the generalized Lassoestimator of (1.2) for specifically chosen λ , under the assumption ic < As we shall see later, the same conclusion holds for our algorithm under theassumption irr( ν ) ≤ − η . Which assumption is weaker to be satisfied?
Thefollowing theorem, with proof in Appendix E, answers this.
Theorem 1 (
Comparisons between IRR( ν ) and IC ).
1. ic ≥ ic .
2. irr(0) exists, and irr(0) = ic .
3. irr( ∞ ) exists, and irr( ∞ ) = 0 if and only if ker( X ) ⊆ ker( D S ) . From this comparison theorem with a design matrix X of full columnrank, as ν grows, irr( ν ) < ic ≤ ic , hence Assumption 2 is weaker thanIC. Now recall the setting of Example 1 where ker( X ) = 0 generically. In Figure 3, the (solid and dashed) horizontal red lines denote ic , ic , and we seethe blue curve denoting irr( ν ) approaches ic when ν → ν → + ∞ , which illustrates Theorem 1 (here each of ic , ic , irr( ν )is the mean of 100 values calculated under 100 generated X ’s). Althoughirr(0) = ic is slightly larger than ic , irr( ν ) can be significantly smaller than if ν is not tiny. On the right side of the vertical line, irr( ν ) drops below1, indicating that Assumption 2 is satisfied while IC fails. Remark 8.
Despite that Theorem 1 suggests to adopt a large ν , ν can notbe arbitrarily large, elsewise C/ (1 + ν ) in (2.4) is small and (cid:96) becomes “flat”,which will deteriorates the estimator in terms of (cid:96) error to be shown later.
3. Basic Properties of Paths
The following theorem establishes the solution existence as well as unique-ness of Split ISS and Split LBISS, in almost the same way as Osher et al.(2016). The proof is given in Appendix C.
Theorem 2 (
Existence and uniqueness of solutions ). As for Split ISS (1.8) , assume that ρ ( t ) is right continuously differ-entiable and β ( t ) , γ ( t ) is right continuous. Then a solution exists for t ≥ , with piecewise linear ρ ( t ) and piecewise constant β ( t ) , γ ( t ) . Be-sides, ρ ( t ) is unique. If additionally Σ S ( t ) ,S ( t ) (cid:31) for ≤ t ≤ τ , where Σ is defined in (1.10) and S ( t ) := supp( γ ( t )) , then β ( t ) , γ ( t ) are unique for ≤ t ≤ τ . As for Split LBISS (1.7) , assume that ρ ( t ) , β ( t ) are right continuouslydifferentiable. Then there is a unique solution for t ≥ . The following theorem states that along the solution path of either dif-ferential inclusions or iterative algorithms, the loss function is always non- increasing. Its proof is provided in Appendix C.
Theorem 3 (
Non-increasing loss along the paths ). Consider the loss func-tion (cid:96) defined in (1.3) . For a solution ( ρ ( t ) , β ( t ) , γ ( t )) of Split ISS (1.8) , (cid:96) ( β ( t ) , γ ( t )) is non-increasing in t . For a solution ( ρ ( t ) , β ( t ) , γ ( t )) of Split LBISS (1.7) , (cid:96) ( β ( t ) , γ ( t )) isnon-increasing in t . For a solution ( ρ k , β k , γ k ) of Split LBI (1.6) , (cid:96) ( β k , γ k ) is non-increasingin k , if κα (cid:107) H (cid:107) ≤ . (3.1) Moreover, (cid:107) H (cid:107) ≤ ν Λ X + Λ D ) /ν holds, so (3.1) holds if κα ≤ ν/ (1 + ν Λ X + Λ D ) . (3.2)15 . Path Consistency of Split LBISS and Split LBI The following theorem, with proof in Appendix G, says that under As-sumption 1 and 2, Split LBISS will automatically evolve in the “ oracle ”subspace (unknown to us) restricted within the support set of ( β (cid:63) , γ (cid:63) ) before leaving it, and if the signal parameters is strong enough, sign consistency willbe reached. Moreover, (cid:96) error bounds on γ ( t ) and β ( t ) are given. Theorem 4 (Consistency of Split LBISS).
Under Assumption 1 and 2,define λ H = C/ (1 + ν ) (from (2.4) ) and suppose that κ is large so that κ ≥ η (cid:18) λ D + Λ X λ λ D (cid:19) (cid:115) ν Λ X + Λ D ) λ H ν · (cid:18) (1 + Λ D ) (cid:107) β (cid:63) (cid:107) + 2 σλ H (cid:18) Λ X λ D + Λ X λ D + λ H λ D + Λ X λ λ D (cid:19)(cid:19) , (4.1) Let ¯ τ := η σ · λ D Λ X (cid:114) n log m . (4.2) Then with probability not less than − /m − − n/ , we have all thefollowing properties.
1. No-false-positive : The solution has no false-positive, i.e. supp( γ ( t )) ⊆ S , for ≤ t ≤ τ .
2. Sign consistency of γ ( t ) : Once the signal is strong enough such that γ (cid:63) min := ( D S β (cid:63) ) min ≥ σηλ H · Λ X Λ D λ D (2 log s + 5 + log(8Λ D )) (cid:114) log mn , (4.3) then γ ( t ) has sign consistency at ¯ τ , i.e. sign( γ (¯ τ )) = sign( Dβ (cid:63) ) . (cid:96) consistency of γ ( t ) : (cid:107) γ (¯ τ ) − Dβ (cid:63) (cid:107) ≤ σηλ H · Λ X λ D (cid:114) s log mn . (cid:96) “consistency” of β ( t ) : (cid:107) β (¯ τ ) − β (cid:63) (cid:107) ≤ σηλ H · λ Λ X (1 + λ D ) + Λ X λ λ D (cid:114) s log mn + 2 σλ (cid:114) r (cid:48) log mn + ν · σ · λ Λ X + Λ X λ λ D , where r (cid:48) = dim( { Xβ : β ∈ ker( D ) } ) , (4.4) which is very small in most cases. Despite that the sign consistency of γ ( t ) can be established here, usuallyone can not expect Dβ ( t ) recovers the sparsity pattern of γ (cid:63) due to thevariable splitting. As shown in the last term of the (cid:96) error bound of β ( t ),increasing ν will sacrifice its accuracy, as to achieve the minimax optimal (cid:96) error rate one needs ν = O ( (cid:112) ( s log m ) /n ). However, one can remedy thisby projecting β ( t ) on to a subspace using the support set of γ ( t ), and obtaina good estimator ˜ β ( t ) with both sign consistency and (cid:96) consistency at theminimax optimal rates. This leads to the following theorem. Theorem 5 (Consistency of revised Split LBISS).
Under Assumption 1and 2, define λ H = C/ (1 + ν ) (from (2.4) ) and suppose that κ satisfies (4.1) .Define ¯ τ the same as in Theorem 4, and define S ( t ) := supp( γ ( t )) , P S ( t ) := P ker ( D S ( t ) c ) = I − D † S ( t ) c D S ( t ) c , ˜ β ( t ) := P S ( t ) β ( t ) . If S ( t ) c = ∅ , define P S ( t ) = I . Then we have the following properties.
1. Sign consistency of ˜ β ( t ) : Once (4.3) holds, then with probability not lessthan − /m − − n/ , there holds sign( D ˜ β (¯ τ )) = sign( Dβ (cid:63) ) . (cid:96) consistency of ˜ β ( t ) : With probability not less than − /m − r (cid:48) /m − − n/ , we have (cid:13)(cid:13)(cid:13) ˜ β (¯ τ ) − β (cid:63) (cid:13)(cid:13)(cid:13) ≤ σηλ H · Λ X (Λ D + λ D ) λ D (cid:114) s log mn + 2 σλ H (cid:18) Λ X λ D + λ H λ D + Λ X λ λ D (cid:19) (cid:114) r (cid:48) log mn + 2 (cid:13)(cid:13)(cid:13) D † S (¯ τ ) c D S (¯ τ ) c ∩ S β (cid:63) (cid:13)(cid:13)(cid:13) , where r (cid:48) is defined in (4.4) . If additionally S (¯ τ ) ⊇ S , then the lastterm on the right hand side drops. emark 9. In most cases r (cid:48) is very small, so the dominant (cid:96) error rate is O ( (cid:112) ( s log m ) /n ) (as long as ν is upper bounded by constant), which is min-imax optimal (Lee et al., 2013; Liu et al., 2013). Based on theorems on consistency of Split LBISS, one can naturally derivesimilar results for Split LBI with large κ and small α . Theorem 6 (Consistency of Split LBI).
Under Assumption 1 and 2, de-fine λ H = C/ (1 + ν ) (from (2.4) ). Suppose that κ is large and α is small, sothat κα (cid:107) H (cid:107) < , (4.5) κ satisfies (4.1) with λ H replaced by λ (cid:48) H := λ H (1 − κα (cid:107) H (cid:107) / > , and α < ¯ τ := η σ · λ D Λ X (cid:114) n log m . Let k := (cid:98) ¯ τ /α (cid:99) . Then with probability not less than − /m − − n/ ,we have all the following properties.
1. No-false-positive : The solution has no false-positive, i.e. supp( γ k ) ⊆ S ,for ≤ kα ≤ τ .
2. Sign consistency of γ k : Once the signal is strong enough such that γ (cid:63) min := ( D S β (cid:63) ) min ≥ σηλ (cid:48) H (1 − α/ ¯ τ ) · Λ X Λ D λ D (2 log s + 5 + log(8Λ D )) (cid:114) log mn , (4.6) then γ k has sign consistency at ¯ k , i.e. sign( γ ¯ k ) = sign( Dβ (cid:63) ) . (cid:96) consistency of γ k : (cid:107) γ ¯ k − Dβ (cid:63) (cid:107) ≤ σηλ (cid:48) H (1 − α/ ¯ τ ) · Λ X λ D (cid:114) s log mn . (cid:96) “consistency” of β k : (cid:107) β ¯ k − β (cid:63) (cid:107) ≤ σηλ (cid:48) H (1 − α/ ¯ τ ) · λ Λ X (1 + λ D ) + Λ X λ λ D (cid:114) s log mn + 2 σλ (cid:114) r (cid:48) log mn + ν · σ · λ Λ X + Λ X λ λ D , where r (cid:48) is defined in (4.4) . β k such that D ˜ β k is sparse and the corresponding (cid:96) error bound is improved. Theorem 7 (Consistency of revised Split LBI).
Under Assumption 1and 2, define λ H = C/ (1 + ν ) (from (2.4) ). Suppose that κ, α satisfy thesame conditions as in Theorem 6; λ (cid:48) H , ¯ τ is defined the same as in Theorem 6.Define S k := supp( γ k ) , P S k := P ker (cid:16) D Sck (cid:17) = I − D † S ck D S ck , ˜ β k := P S k β k . If S ck = ∅ , define P S k = I . Then we have the following properties.
1. Sign consistency of ˜ β k : Once (4.6) holds, then with probability not lessthan − /m − − n/ , there holds sign( D ˜ β ¯ k ) = sign( Dβ (cid:63) ) . (cid:96) consistency of ˜ β k : With probability not less than − /m − r (cid:48) /m − − n/ , we have (cid:13)(cid:13)(cid:13) ˜ β ¯ k − β (cid:63) (cid:13)(cid:13)(cid:13) ≤ σηλ (cid:48) H (1 − α/ ¯ τ ) · Λ X (Λ D + λ D ) λ D (cid:114) s log mn + 2 σλ (cid:48) H (cid:18) Λ X λ D + λ (cid:48) H λ D + Λ X λ λ D (cid:19) (cid:114) r (cid:48) log mn + 2 (cid:13)(cid:13)(cid:13) D † S c ¯ k D S c ¯ k ∩ S β (cid:63) (cid:13)(cid:13)(cid:13) , where r (cid:48) is defined in (4.4) . If additionally S ¯ k ⊇ S , then the last term on the right hand side drops.
5. Proof Ideas for SLBISS Path Consistency Theorems
Sketchy proof of Theorem 4.
The Split LBISS dynamics always start withinthe oracle subspace ( γ S c ( t ) = 0), and by Lemma 7 we prove that under theIrrepresentable Condition the exit time of the oracle subspace is no earlier than some ¯ τ (cid:46) (cid:112) n/ log m (i.e. the no-false-positive condition holds before¯ τ ), with high probability.Before ¯ τ , the dynamics follow the identical path of the following oracledynamics of the Split LBISS restricted in the oracle subspace ρ (cid:48) S c ( t ) = γ (cid:48) S c ( t ) ≡ , (5.1a)˙ β (cid:48) ( t ) /κ = − X ∗ ( Xβ (cid:48) ( t ) − y ) − D T ( Dβ (cid:48) ( t ) − γ (cid:48) ( t )) /ν, (5.1b)˙ ρ (cid:48) S ( t ) + ˙ γ (cid:48) S ( t ) /κ = − ( γ (cid:48) S ( t ) − D S β (cid:48) ( t )) /ν, (5.1c) ρ (cid:48) S ( t ) ∈ ∂ (cid:107) γ (cid:48) S ( t ) (cid:107) , (5.1d)19here ρ (cid:48) S (0) = γ (cid:48) S (0) = 0 ∈ R s , β (cid:48) (0) = 0 ∈ R p . Theorem 3 shows that theloss is always dropping along the paths. Hence to monitor the distance of anestimator to the oracle estimator ( β o , γ o ) ∈ arg min β,γβ ∈L , γ Sc =0 (cid:96) ( β, γ ) ⊆ arg min β,γγ Sc =0 (cid:96) ( β, γ ) (5.2)which is an optimal estimate of the true parameter ( β (cid:63) , γ (cid:63) ) (with errorbounds in Lemma 8), we define a potential function Ψ( t ) := D ρ (cid:48) S ( t ) ( γ oS , γ (cid:48) S ( t )) + d ( t ) / (2 κ ) , where d β ( t ) := β (cid:48) ( t ) − β o , d γ ( t ) := γ (cid:48) ( t ) − γ o , d ( t ) := (cid:113) (cid:107) d γ,S ( t ) (cid:107) + (cid:107) d β ( t ) (cid:107) , (5.3)and the Bregman distance D ρ (cid:48) S ( t ) ( γ oS , γ (cid:48) S ( t )) := (cid:107) γ oS (cid:107) − (cid:107) γ (cid:48) S ( t ) (cid:107) − (cid:104) γ oS − γ (cid:48) S ( t ) , ρ (cid:48) S ( t ) (cid:105) = (cid:107) γ oS (cid:107) − (cid:104) γ oS , ρ (cid:48) S ( t ) (cid:105) . Equipped with this potential function, the original differential inclusion isreduced to the following differential inequality, called as generalized Bihari’sinequality (Lemma 1) whose proof will be given in Appendix F.
Lemma 1 (
Generalized Bihari’s inequality ). Under Assumption 1, for all t ≥ we have dd t Ψ( t ) ≤ − λ H F − (Ψ( t )) , where γ o min := min( | γ oj | : γ oj (cid:54) = 0) and F ( x ) := x κ + , ≤ x < ( γ o min ) , x/γ o min , ( γ o min ) ≤ x < s ( γ o min ) , √ sx, x ≥ s ( γ o min ) ,F − ( x ) := inf( y : F ( y ) ≥ x ) ( y ≥ . The property of the right hand side of (5.2) is based on (cid:96) ( P L β o , γ o ) = (cid:96) ( β o , γ o ). (RSC), leads to an exponential decrease of the potential above enforcing theconvergence to the oracle estimator. Then we can show that as long as thesignal is strong enough with all the magnitudes of entries of γ (cid:63) being largeenough ( (cid:38) (log s ) (cid:112) (log m ) /n ), the dynamics stopped at ¯ τ , exactly selectsall nonzero entries of γ o ((F.8) in Lemma 6), hence also of γ (cid:63) with high probability, achieving the sign consistency.Even without the strong signal condition, with RSC we can also showthat the dynamics, at ¯ τ , returns a good estimator of γ o ((F.9) in Lemma 6),hence also of γ (cid:63) , having an (cid:96) error (cid:39) (cid:112) ( s log m ) /n (minimax optimal rate)with high probability. Combining the (cid:96) bounds of β (cid:48) ( t ) − β o (from (F.9) in Lemma 6) and β o − β (cid:63) (Lemma 8), we obtain the result concerning the (cid:96) bound of β (cid:48) ( t ) − β (cid:63) , at ¯ τ , similarly with the minimax optimal rate.A detailed proof of Theorem 4 can be found in Appendix G. (cid:3) Remark 10.
It is an interesting open problem how to relax the IrrepresentableCondition to achieve a minimax optimal estimator at weaker conditions such as (Bickel et al., 2009).
Proof sketch of Theorem 5.
By Theorem 4, the exit time of the oracle sub-space is no earlier than some ¯ τ (cid:46) (cid:112) n/ log m , i.e. the no-false-positive con-dition holds before ¯ τ , or say S ( t ) ⊆ S for t ≤ ¯ τ , with high probability. Thedefinition of ˜ β ( t ) enforces D S c ˜ β (¯ τ ) = 0 = D S c β (cid:63) . Using the error bounds of β (cid:48) ( t ) − β o (from (F.8) in Lemma 6) and β o − β (cid:63) (Lemma 8), we obtain (cid:107) D S ˜ β (¯ τ ) − D S β (cid:63) (cid:107) ∞ < γ (cid:63) min = ( D S β (cid:63) ) min = ⇒ sign (cid:16) D S ˜ β (¯ τ ) (cid:17) = sign( D S β (cid:63) ) , as long as the magnitudes of entries of γ (cid:63) are all large enough, achieving thesign consistency. Also we can obtain the (cid:96) bound of ˜ β ( t ) − β (cid:63) .A detailed proof of Theorem 5 can be found in Appendix G. (cid:3)
6. Experiments
In this section, we show three additional applications using the algorithmproposed in this paper. The first application is about traditional image de-noising using TV-regularization or fused Lasso. The remaining twos are new21pplications in partial order ranking: the second one is the basketball teamranking in partial order and the third one is the grouping of world universities in crowdsourced ranking. For reproducible research, Matlab source codes arereleased at the following website: https://github.com/yuany-pku/split-lbi . Parameter κ should be large enough according to (4.1). Moreover, step size α should be small enough to ensure the stability of Split LBI. When ν, κ are determined, α can actually be determined by α = ν/ ( κ (1 + ν Λ X + Λ D ))(see (3.2)). Consider the image denoising problem in Tibshirani and Taylor (2011).
The original image is resized to 50 ×
50, and reset with only four colors, asin the top left image in Figure 4. Some noise is added by randomly changingsome pixels to be white, as in the bottom left. Let G = ( V, E ) is the 4-nearest-neighbor grid graph on pixels, then β = ( β R ; β G ; β B ) ∈ R | V | since there are 3color channels (RGB channels). X = I | V | and D = diag( D G , D G , D G ), where D G δ ∈ R | E |×| V | is the gradient operator on graph G defined by ( D G x )( e ij ) = x i − x j , e ij ∈ E . Set ν = 180 , κ = 100. The regularization path of Split LBIis shown in Figure 4, where as t evolves, images on the path gradually selectvisually salient features before picking up the random noise.Now compare the AUC (Area Under the Curve) of genlasso and Split LBI algorithm with different ν . For simplicity we show the AUC correspond-ing to the red color channel. Here ν ∈ { , , , , . . . , } . As shown inthe right panel of Figure 4, with the increase of ν , Split LBI beats genlasso with higher AUC values. Here we consider a new application on the ranking of p = 12 FIBA bas-ketball teams into partial orders. The teams are listed in Figure 5. Wecollected n = 134 pairwise comparison game results mainly from various im-portant championship such as Olympic Games, FIBA World Championshipand FIBA Basketball Championship in 5 continents from 2006–2014 (8 years is not too long for teams to keep relatively stable levels while not too shortto have enough samples). For each sample indexed by k and corresponding22 riginal Figure t =9.3798 t =23.7812 Noisy Figure t =60.5532 t =617.1275
Figure 4: Left is image denoising results by Split LBI. Right shows the AUC of Split LBI(blue solid line) increases and exceeds that of genlasso (dashed red line) as ν increases.Figure 5: Partial order ranking for basketball teams. Top left shows { β λ } ( t = 1 /λ ) by genlasso and ˜ β k ( t = kα ) by Split LBI. Top right shows the same grouping result justpassing t . Bottom is the FIBA ranking of all teams. i k , j k ), y k = s i k − s j k is the score difference between team i k and j k .We assume a model y k = β (cid:63)i k − β (cid:63)j k + (cid:15) k where β (cid:63) ∈ R p measures the strengthof these teams. So the design matrix X ∈ R n × p is defined by its k -th row: x k,i k = 1 , x k,j k = − , x k,l = 0 ( l (cid:54) = i k , j k ). In sports, teams with similarstrength generally meet more often than those in different levels. Thus wehope to find a coarse grained partial order ranking by adding a structuralsparsity on Dβ (cid:63) where D = cX ( c scales the smallest nonzero singular valueof D to be 1). The top left panel of Figure 5 shows { β λ } by genlasso and ˜ β k by SplitLBI with ν = 1 and κ = 100. Both paths give the same partial order atearly stages, though the Split LBI path looks qualitatively better. For exam-ple, the top right panel shows the same partial order after the change point t . It is interesting to compare it against the FIBA ranking in September, .
0) itself forms a group, agreeing with the common sense that it is much better than anyother country. Spain, having much higher FIBA ranking points (705 .
0) thanthe 3rd team Argentina (455 . Crowdsourcing technique has been recently used to rank universities byInternet voters, e.g.
CrowdRank . In the following a crowdsourcing experimenthas been conducted for ranking p = 261 universities in the world on theplatform . The majority of the participants are undergraduates or alumni from Peking University, mostlymajoring in applied mathematics and statistics while some with engineeringbackground. Voters are widely distributed around the world, with one fifthof all from Beijing, see Figure 6. Every voter is presented with a randomlychosen pair of universities, and asked with the question “ which university would you rather attend? ”. Then the voter is allowed to choose either ofthe two universities, or simply “ I can’t decide ”. Our collection consists ofabout eight thousand votes. To make our result more robust, we removesome indecisive votes or outliers using the technique from Xu et al. (2014)24 igure 6: The map of voter distribution. and are left with n = 6 ,
125 paired comparison samples in the cleaned dataset for the study in this paper. For each sample indexed by k and correspondinguniversity pair ( i k , j k ), if the voter considers i k to be better than j k , then y k = 1, otherwise y k = −
1. We assume a model y k = β (cid:63)i k − β (cid:63)j k + (cid:15) k where β (cid:63) ∈ R p measures the strength of these universities. So the design matrix X ∈ R n × p is defined by its k -th row: x k,i k = 1 , x k,j k = − , x k,l = 0 ( l (cid:54) = i k , j k ). D is denoted as the total variation matrix with complete graph, i.e. (cid:107) Dβ (cid:107) = Σ i The 2nd group universities are listed in See Table 3. It includes top26niversities in Asia (Peking University, Tsinghua University, University of Tokyo, University of Hong Kong, and Hong Kong University of Science andTechnology), Europe (Swiss Federal Institute of Technology/ETH, ImperialCollege London, University College London, and London School of Economicsand Political Science), and North America. A surprising result is that MIT islisted in this second group, while most of the authoritative ranking systems clearly place it in the first tier. This phenomenon is probably due to thesampling bias in our crowdsourcing experiment: a large portion of the votersare of statistics major and MIT does not have a statistics department orprogram. Hence such voters will not choose MIT when considering graduateprograms. Massachusetts Institute of Technol-ogy (MIT) University of Southern CaliforniaUniversity of British Columbia(Canada) University of Wisconsin-MadisonPeking University (China) Northwestern UniversityUniversity of Chicago Swiss Federal Institute of Technol-ogy (Switzerland)Brown University Georgia Institute of TechnologyImperial College London (UK) University of WashingtonUniversity of Toronto (Canada) University of California, Santa Bar-baraDuke University University of Tokyo (Japan)The University of Hong Kong(Hong Kong) Purdue UniversityUniversity of Texas at Austin Dartmouth CollegeUniversity of California, Irvine University of California, SantaCruzUniversity of California, Davis Tsinghua University (China)University of Maryland, CollegePark London School of Economics andPolitical Science (UK)Boston University Hong Kong University of Scienceand Technology (Hong Kong)University College London (UK) Rice University Table 3: The Universities in the 2nd Group https://github.com/yuany-pku/split-lbi/tree/master/examples/university . 7. Conclusion In this paper, we introduce a novel iterative regularization path withstructural sparsity such that parameters are sparse under certain linear trans- form. Variable splitting is exploited to lift the parameters into a high di-mensional space with separate parameters for data fitting and sparse modelselection. A statistical benefit of such a splitting lies in its improved model se-lection consistency under weaker conditions than the traditional generalizedLasso, shown in both theory and experiments. For the statistical analysis of such an algorithm, several limit dynamics as differential inclusions are intro-duced which sheds light on the consistency properties of the regularizationpaths. Finally some applications are given with real world data, includingimage denoising, partial order ranking of basket ball teams, and groupingof world universities by crowdsourced ranking. These results show that the benefit of the proposed algorithm lies in both its simplicity in computingthe regularization path iteratively and its solid theoretical guarantee on pathconsistency. Hence it can be regarded as a generalization of L Boost in ma-chine learning or Landweber iteration in inverse problems with structuralsparsity control. Appendix A. Further Notations throughout the Appendix Apart from Section 1.5 and 2.1, we need more notations throughout theappendix. Let the compact singular value decomposition (compact SVD) of D be D = U Λ V T (cid:0) Λ ∈ R r × r , Λ (cid:31) , U ∈ R m × r , V ∈ R p × r (cid:1) , (A.1)and ( V, ˜ V ) be an orthogonal square matrix. Let the compact SVD of X ˜ V / √ n be X ˜ V / √ n = U Λ V T (cid:16) Λ ∈ R r (cid:48) × r (cid:48) , Λ (cid:31) , U ∈ R n × r (cid:48) , V ∈ R ( p − r ) × r (cid:48) (cid:17) , (A.2)and let ( V , ˜ V ) be an orthogonal square matrix. r (cid:48) in (A.2) is the rank of X ˜ V , which meets the definition (4.4).28e have λ D I (cid:22) Λ (cid:22) Λ D I . If ker( D ) ⊆ ker( X ) (for example, D has fullcolumn rank), then X ˜ V = 0, and ˜ V , U , Λ , V , ˜ V all drop. Generally, r (cid:48) ≤ p − r . If Assumption 1 holds, noting for any ξ ∈ R p − r ,˜ V ξ ∈ ker( D ) ⊆ M , and (cid:16) ˜ V ξ (cid:17) T X ∗ X (cid:16) ˜ V ξ (cid:17) ≥ λ (cid:13)(cid:13)(cid:13) ˜ V ξ (cid:13)(cid:13)(cid:13) = λ (cid:107) ξ (cid:107) , we have V Λ V T = ˜ V T X ∗ X ˜ V (cid:23) λI . Since V is a tall matrix, we furtherknow it is square (elsewise V Λ V T is not invertible), i.e. r (cid:48) = p − r . Besides,we have λ := λ min (Λ ) ≥ √ λ (when Λ drops, λ := + ∞ ).From now, we also write λ H = C/ (1 + ν ) , λ Σ = C (cid:48) / (1 + ν ) according toProposition 2 and 3. Appendix B. Some Useful Technical LemmasLemma 2 ( Concentration inequalities ). Suppose that (cid:15) ∈ R n has indepen-dent identically distributed components, each of which has a sub-Gaussiandistribution with parameter σ , i.e. E [exp( t(cid:15) i )] ≤ exp( σ t / , then P (cid:18) (cid:107) B(cid:15) (cid:107) ∞ σ ≥ z (cid:19) ≤ q exp (cid:32) − z (cid:107) B (cid:107) (cid:33) (cid:0) B ∈ R q × n , z ≥ (cid:1) , (B.1) P (cid:32) (cid:107) (cid:15) (cid:107) nσ ≥ z (cid:33) ≤ exp (cid:18) − n ( z − log (1 + z ))2 (cid:19) ( z ≥ . (B.2) Moreover, by (B.1) we have that for B ∈ R q × n , with probability not less than − q/m , (cid:107) B(cid:15) (cid:107) ∞ ≤ σ · (cid:107) B (cid:107) (cid:112) log m. (B.3) By (B.2) we have that with probability not less than − exp( − n/ , (cid:107) (cid:15) (cid:107) ≤ σ √ n. (B.4) Proof. As for (B.1), let B = ( B i,j ) q × n and 1 ≤ i ≤ q , it is well-known that B i, · (cid:15) = B i, (cid:15) + B i, (cid:15) + · · · + B i,n (cid:15) n is also sub-Gaussian, with parameter b i = ( B i, + · · · + B i,n ) σ . Thus P ( (cid:107) B(cid:15) (cid:107) ∞ ≥ z ) ≤ q · max ≤ i ≤ q P ( | B i, · (cid:15) | ≥ z ) ≤ q exp (cid:18) − z b i (cid:19) ≤ q exp (cid:32) − z (cid:107) B (cid:107) (cid:33) . 29s for (B.2), note that for 0 ≤ ζ < / P (cid:32) (cid:107) (cid:15) (cid:107) nσ ≥ z (cid:33) ≤ P (cid:32) exp (cid:32) ζ (cid:107) (cid:15) (cid:107) σ (cid:33) ≥ exp ( ζn (1 + z )) (cid:33) ≤ exp ( − ζn (1 + z )) E (cid:34) exp (cid:32) ζ (cid:107) (cid:15) (cid:107) σ (cid:33)(cid:35) = exp ( − ζn (1 + z )) (cid:18) E (cid:20) exp (cid:18) ζ(cid:15) σ (cid:19)(cid:21)(cid:19) n ≤ exp ( − ζn (1 + z )) · (cid:18) − ζ (cid:19) n/ . Take ζ = z/ (2(1 + z )) ∈ [0 , / Lemma 3 ( Transformations and bounds for quadratic forms ). If K = (cid:18) P QQ T R (cid:19) (cid:23) , then (cid:0) u T , v T (cid:1) (cid:18) P QQ T R (cid:19) (cid:18) uv (cid:19) ≥ max (cid:0) u T (cid:0) P − QR † Q T (cid:1) u, v T (cid:0) R − Q T P † Q (cid:1) v (cid:1) . (B.5) Moreover, for ≤ ζ ≤ , the following two statements are equivalent: P − QR † Q T (cid:23) ζP, (B.6) R − Q T P † Q (cid:23) ζR. (B.7) And if (B.6) and (B.7) hold, then by (B.5) we easily obtain (cid:0) u T , v T (cid:1) (cid:18) P QQ T R (cid:19) (cid:18) uv (cid:19) ≥ max (cid:0) ζu T P u, ζv T Rv (cid:1) ≥ ξ (cid:0) λ min ( P ) (cid:107) u (cid:107) + λ min ( R ) (cid:107) v (cid:107) (cid:1) ≥ ξ /λ min( P ) + 1 /λ min ( R ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) uv (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . (B.8) Proof. Theorem 1.19 in Zhang (2006) tells that P P † Q = Q , so it is easy toverify K = (cid:18) I Q T P † I (cid:19) (cid:18) P R − Q T P † Q (cid:19) (cid:18) I P † Q I (cid:19) . (cid:0) u T , v T (cid:1) K (cid:18) uv (cid:19) = (cid:18) u + P † Rvv (cid:19) T (cid:18) P R − Q T P † Q (cid:19) (cid:18) u + P † Rvv (cid:19) ≥ v T (cid:0) R − Q T P † Q (cid:1) v. Similarly we can obtain another inequality.If (B.6) holds, then P † / QR † / · R † / Q T P † / (cid:22) (1 − ζ ) P † / P P † / (cid:22) (1 − ζ ) I = ⇒ R † / Q T P † / · P † / QR † / (cid:22) (1 − ζ ) I = ⇒ R / R † / Q T P † QR † / R / (cid:22) (1 − ζ ) R. By Theorem 1.19 in Zhang (2006) we have QR † / R / = Q , thus Q T P † Q (cid:22) (1 − ζ ) R , i.e. (B.7) holds. Similarly (B.7) implies (B.6). Lemma 4 ( Representation of L ). Adopt notations from (A.1) and (A.2) . β ∈ L if and only if β = V δ + ˜ V V ξ, where δ = V T β, ξ = V T ˜ V T β. Proof. Note that I = V V T + ˜ V ˜ V T = V V T + ˜ V (cid:16) V V T + ˜ V ˜ V T (cid:17) ˜ V T . Right multiplying β on both side leads to β = V δ + ˜ V V ξ + ˜ V ˜ V (cid:16) ˜ V T ˜ V T β (cid:17) . (B.9)It suffices to show ker (cid:16) ˜ V T ˜ V T (cid:17) = L , which is equivalent to L (cid:48) := Im (cid:16) ˜ V ˜ V (cid:17) = L ⊥ (= ker( X ) ∩ ker( D )) . For any β ∈ L (cid:48) , we have Xβ = 0 , Dβ = 0 since X ˜ V ˜ V = 0 , D ˜ V = 0, so β ∈ L ⊥ . Conversely, if β ∈ L ⊥ , left multiplying D on both sides of (B.9)leads to δ = 0. Then left multiplying X on both sides of (B.9) further leadsto ξ = 0. Now (B.9) tells that β ∈ L (cid:48) . So L (cid:48) = L ⊥ .31 emma 5. Adopt the notation from (A.1) and (A.2) . Define B := Λ + νV T X ∗ (cid:0) I − U U T (cid:1) XV. We have DA † = U Λ B − V T (cid:18) I − √ n X T U Λ − V T ˜ V T (cid:19) . (B.10) Consequently, Σ = (cid:0) I − DA † D T (cid:1) /ν = (cid:0) I − U Λ B − Λ U T (cid:1) /ν. (B.11) Proof. Note that (cid:18) V T ˜ V T (cid:19) A (cid:16) V, ˜ V (cid:17) = (cid:18) Λ + νV T X ∗ XV νV T X T U Λ V T / √ nνV Λ U T XV / √ n νV Λ V T (cid:19) = QM Q T , where Q := (cid:18) I r V T X T U Λ − V T / √ n I p − r (cid:19) , M := (cid:18) B νV Λ V T (cid:19) We can directly verify that ( QM Q T ) † = ( Q T ) − M † Q − , thus DA † = D (cid:16) V, ˜ V (cid:17) (cid:18)(cid:18) V T ˜ V T (cid:19) A (cid:16) V, ˜ V (cid:17)(cid:19) † (cid:18) V T ˜ V T (cid:19) = ( U Λ , (cid:0) Q T (cid:1) − M † Q − (cid:18) V T ˜ V T (cid:19) , which comes to be the right hand side of (B.10). Now it is easy to verify(B.11). Appendix C. Proof on Basic Path Properties of Split ISS, SplitLBISS and Split LBI Proof of Theorem 2. For Split ISS, by (1.8a) and the fact that β ( t ) ∈ L =Im( X T ) + Im( D T ) = Im( A ) = Im( A † ), we can solve β ( t ) = A † ( νX ∗ y + D T γ ( t )) which is determined by γ ( t ). Plugging it into (1.8b) we have˙ ρ ( t ) + ˙ γ ( t ) /κ = − Σ γ ( t ) + DA † X ∗ y. Taking M = I p + m − ( (cid:112) ν/nX T , D T ) † ( (cid:112) ν/nX T , D T ) in Theorem 1.19 inZhang (2006) leads to DA † X ∗ = ΣΣ † (cid:0) DA † X ∗ (cid:1) = Σ / Σ † / ( DAX ∗ ) . (C.1)32he inclusion becomes˙ ρ ( t ) + ˙ γ ( t ) /κ = − Σ / (cid:0) Σ / γ ( t ) − Σ † / DA † X ∗ y (cid:1) , which is a standard ISS (on γ ( t )) and has been sufficiently discussed in Osheret al. (2016) (let X, y in that paper take √ n Σ / and √ n Σ † / DA † X ∗ y inthis paper). Specifially, there exists a solution with piecewise linear ρ ( t ) and piecewise constant β ( t ) , γ ( t ). Besides, ρ ( t ) is unique. If additionally, whenΣ S ( t ) ,S ( t ) (cid:31) 0, we have that Σ · ,S ( t ) has full column rank, and γ ( t ) (hence β ( t ))is unique.For Split LBISS, letting z ( t ) = ρ ( t ) + γ ( t ) /κ and noting (1.5), the SplitLBISS (1.7) is equivalent to (cid:18) ˙ β ( t )˙ z ( t ) (cid:19) = − (cid:18) − κX ∗ ( Xβ ( t ) − y ) − κD T ( Dβ ( t ) − κ S ( z ( t ) , /ν − ( κ S ( z ( t ) , − Dβ ( t )) /ν (cid:19) . The Picard-Lindel¨of Theorem implies that this ODE has a unique solution( β ( t ) , z ( t )), so there exists a unique solution to the Split LBISS (1.7). (cid:3) Proof of Theorem 3. For Split ISS, one can easily imitates the technique inthe proof of Theorem 2.1 in Osher et al. (2016) to show that ( β ( t ) , γ ( t )) isthe solution of the following optimization problem.min β,γ (cid:96) ( β, γ )subject to γ j ≥ , if ρ j ( t ) = 1 ,γ j ≤ , if ρ j ( t ) = − ,γ j = 0 , if ρ j ( t ) ∈ ( − , . (C.2)for any t > 0, due to the continuity of ρ ( · ), there is a small neighborhood of t , on which every τ satisfies ρ j ( τ ) > − γ j ( τ ) ≥ , if ρ j ( t ) = 1 ,ρ j ( τ ) < γ j ( τ ) ≥ , if ρ j ( t ) = − ,ρ j ( τ ) ∈ ( − , 1) hence γ j ( τ ) = 0 , if ρ j ( t ) ∈ ( − , . That is to say, ( β ( τ ) , γ ( τ )) satisfies the constraints in (C.2), so the valueof (cid:96) ( β ( τ ) , γ ( τ )) is not less than (cid:96) ( β ( t ) , γ ( t )) (the solution of (C.2)). This implies that any t ≥ ( β ( · ) , γ ( · )). Then by standard techniques in mathematical analysis, we havethat (cid:96) ( β ( t ) , γ ( t )) is non-increasing.For Split LBISS, by (1.7c), we have ˙ γ j ( t ) · ˙ ρ j ( t ) ≡ j , so (cid:96) isnon-increasing sincedd t (cid:96) ( β ( t ) , γ ( t )) = (cid:28)(cid:18) ˙ β ( t )˙ γ ( t ) (cid:19) , (cid:18) ∇ β (cid:96) ( β ( t ) , γ ( t )) ∇ γ (cid:96) ( β ( t ) , γ ( t )) (cid:19)(cid:29) = (cid:28)(cid:18) ˙ β ( t )˙ γ ( t ) (cid:19) , (cid:18) − ˙ β ( t ) /κ − ˙ ρ ( t ) − ˙ γ ( t ) /κ (cid:19)(cid:29) = 1 κ (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ˙ β ( t )˙ γ ( t ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ . For Split LBI, noting ( ρ k +1 − ρ k )( γ k +1 − γ k ) = (cid:107) ρ k +1 (cid:107) − (cid:104) ρ k +1 , γ k (cid:105) + (cid:107) γ k +1 (cid:107) − (cid:104) ρ k , γ k +1 (cid:105) ≥ 0, we have − α ∇ (cid:96) ( β k , γ k ) T (cid:18) β k +1 − β k γ k +1 − γ k (cid:19) = (cid:18)(cid:18) ρ k +1 − ρ k (cid:19) + 1 κ (cid:18) β k +1 − β k γ k +1 − γ k (cid:19)(cid:19) (cid:18) β k +1 − β k γ k +1 − γ k (cid:19) ≥ κ (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) β k +1 − β k γ k +1 − γ k (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . By κα (cid:107) H (cid:107) < 2, we have (cid:96) ( β k +1 , γ k +1 ) − (cid:96) ( β k , γ k )= ∇ (cid:96) ( β k , γ k ) T (cid:18) β k +1 − β k γ k +1 − γ k (cid:19) + 12 (cid:0) β Tk +1 − β Tk , γ Tk +1 − γ Tk (cid:1) H (cid:18) β k +1 − β k γ k +1 − γ k (cid:19) ≤ − κα (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) β k +1 − β k γ k +1 − γ k (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:107) H (cid:107) · (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) β k +1 − β k γ k +1 − γ k (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ . Moreover, it is easy to verify that (cid:0) β T , γ T (cid:1) H (cid:18) βγ (cid:19) = 1 n (cid:107) Xβ (cid:107) + 1 ν (cid:107) Dβ − γ (cid:107) ≤ n (cid:107) Xβ (cid:107) + 2 ν (cid:107) Dβ (cid:107) +2 (cid:107) γ (cid:107) ≤ ν Λ X + Λ D ) ν (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) βγ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (cid:18)(cid:18) βγ (cid:19) ∈ R m + p (cid:19) , = ⇒ (cid:107) H (cid:107) ≤ ν Λ X + Λ D ) ν . (C.3)34 ppendix D. Proof on Equivalence of Assumptions Proof of Proposition 2. If there exists C > , ν > β ∈ L ∩ M , taking γ S = D S β and noting D S c β = 0, the left handside of (2.4) is1 n (cid:107) Xβ (cid:107) + 1 ν (cid:107) γ S − D S β (cid:107) + 1 ν (cid:107) D S c β (cid:107) = 1 n (cid:107) Xβ (cid:107) . which should be not less than ( C/ (1 + ν )) (cid:107) β (cid:107) . Thus Assumption 1 holds for λ = C/ (1 + ν ) > λ > 0, for any β ∈ L , let β = β (cid:48) + β (cid:48)(cid:48) where β (cid:48) ∈ L∩M and β (cid:48)(cid:48) ∈ L∩M ⊥ . Since D S c β (cid:48) = 0 , β (cid:48)(cid:48) ∈ M ⊥ = Im( D TS c ),we have β T D TS c D S c β = β (cid:48)(cid:48) T D TS c D S c β (cid:48)(cid:48) ≥ λ D (cid:107) β (cid:48)(cid:48) (cid:107) . For constant ν = 2 λ D / ( λ + 2Λ X ) > β T (cid:0) ν X ∗ X + D TS c D S c (cid:1) β = ν · β T X ∗ Xβ + β T D TS c D S c β ≥ ν (cid:16) β (cid:48) / T X ∗ X ( β (cid:48) / − ( − β (cid:48)(cid:48) ) T X ∗ X ( − β (cid:48)(cid:48) ) (cid:17) + λ D (cid:107) β (cid:48)(cid:48) (cid:107) ≥ λν (cid:107) β (cid:48) (cid:107) + (cid:0) λ D − ν Λ X (cid:1) (cid:107) β (cid:48)(cid:48) (cid:107) = λν (cid:16) (cid:107) β (cid:48) (cid:107) + (cid:107) β (cid:48)(cid:48) (cid:107) (cid:17) = λν (cid:107) β (cid:107) . The left hand side of (2.4), denoted by L or L ( ν ), satisfies L ≥ n (cid:107) Xβ (cid:107) + 1 ν (cid:107) D S c β (cid:107) ≥ ν , ν ) β T (cid:0) ν X ∗ X + D TS c D S c (cid:1) β ≥ λν ν + ν ) (cid:107) β (cid:107) . Furthermore, by the inequality above and Cauchy’s inequality, L ≥ λν ν + ν ) (cid:107) β (cid:107) + 1 ν (cid:107) γ S − D S β (cid:107) ≥ λν D ( ν + ν ) (cid:107) D S β (cid:107) + 1 ν (cid:107) γ S − D S β (cid:107) ≥ D ( ν + ν ) / ( λν ) + ν (cid:107) γ S (cid:107) , Consequently, (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) βγ S (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:18) ν + ν ) λν + (cid:18) D ( ν + ν ) λν + ν (cid:19)(cid:19) L ≤ νC L, C = λ · min( ν , D + λν (cid:18) ν = 2 λ D λ + 2Λ X > (cid:19) (D.1)is a constant. Thus (2.4) holds for all ν > (cid:3) Proof of Proposition 3. Let L = L ( ν ) denotes the left hand side of (2.4).Suppose there exists C > , ν > ν = ν . Since H ( β,S ) , ( β,S ) ( ν ) ≥ min (cid:16) , ν ν (cid:17) H ( β,S ) , ( β,S ) ( ν ) ≥ ν ν + ν H ( β,S ) , ( β,S ) ( ν ) , (D.2)we can find C > C = C and all ν > 0. Now H ( β,S ) , ( β,S ) = QM Q T , where Q := (cid:18) I p − D S A † I s (cid:19) , M := (cid:18) A/ν 00 Σ S,S (cid:19) . (D.3)So L = (cid:18) β − A † D TS γ S γ S (cid:19) T (cid:18) A/ν 00 Σ S,S (cid:19) (cid:18) β − A † D TS γ S γ S (cid:19) .L ≥ ( C / (1 + ν )) (cid:107) ( β ; γ S ) (cid:107) implies γ TS Σ S,S γ S ≥ ( C / (1 + ν )) (cid:107) γ S (cid:107) (letting β = A † D TS γ S ∈ L ). So (2.6) holds for C (cid:48) = C and all ν > C (cid:48) > , ν > ν = ν . Forany β ∈ L , represent β = V δ + ˜ V V ξ by Lemma 4, then β T (cid:0) ν X ∗ X + D T D (cid:1) β = (cid:0) δ T , ξ T (cid:1) (cid:18) Λ + ν V T X ∗ XV ν V T X ∗ X ˜ V V ν V T ˜ V T X ∗ XV ν V T ˜ V T X ∗ X ˜ V V (cid:19) (cid:18) δξ (cid:19) = (cid:0) δ T , ξ T (cid:1) (cid:18) Λ + ν V T X ∗ XV ν V T X T U Λ / √ nν Λ U T XV / √ n ν Λ (cid:19) (cid:18) δξ (cid:19) . For P = Λ + ν V T X ∗ XV, Q = ν V T X T U Λ / √ n, R = ν Λ , P − QR † Q T = Λ + ν V T X ∗ ( I − U U T ) XV (cid:23) λ D I (cid:23) λ D ν Λ X + Λ D · P. By Lemma 3 we have β T (cid:0) ν X ∗ X + D T D (cid:1) β ≥ λ D ν Λ X + Λ D · /λ min ( P ) + 1 /λ min ( R ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) δξ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≥ λ D ν Λ X + Λ D · /λ D + 1 / ( ν λ ) (cid:107) β (cid:107) . H S,S ( ν ) − H S,β ( ν ) H β,β ( ν ) † H β,S ( ν ) = Σ S,S ( ν ) (cid:23) C (cid:48) ν I = C (cid:48) ν ν H S,S ( ν ) . By Lemma 3 we have H β,β ( ν ) − H β,S ( ν ) H S,S ( ν ) † H S,β ( ν ) (cid:23) C (cid:48) ν ν H β,β ( ν )= ⇒ (cid:18) − C (cid:48) ν ν (cid:19) (cid:0) ν X ∗ X + D TS c D S c (cid:1) (cid:23) C (cid:48) ν ν D TS D S = ⇒ ν X ∗ X + D TS c D S c (cid:23) C (cid:48) ν ν (cid:0) ν X ∗ X + D T D (cid:1) . Thus L ( ν ) = 12 n (cid:107) Xβ (cid:107) + 12 ν (cid:107) γ S − D S β (cid:107) + (cid:107) D S c β (cid:107) ≥ ν β T (cid:0) ν X ∗ X + D TS c D S c (cid:1) β ≥ C (cid:48) ν ) β T (cid:0) ν X ∗ X + D T D (cid:1) β ≥ C (cid:48) (cid:107) β (cid:107) . Where C (cid:48) > L ( ν ) ≥ γ TS (cid:0) H S,S ( ν ) − H S,β ( ν ) H β,β ( ν ) † H β,S ( ν ) (cid:1) γ S ≥ C (cid:48) ν (cid:107) γ S (cid:107) . Thus we can find C (cid:48) > L ( ν ) ≥ C (cid:48) (cid:107) ( β ; γ S ) (cid:107) . Combining with(D.2), we can find C > ν > (cid:3) Proof of Proposition 4. Under Assumption 1, by Proposition 2 and 3 we haveΣ S,S (cid:31) 0. By (D.3), we knowrank (cid:0) H ( β,S ) , ( β,S ) (cid:1) = rank (cid:18)(cid:18) A/ν 00 Σ S,S (cid:19)(cid:19) = rank( A ) + rank (Σ S,S ) = rank ( H β,β ) + rank ( H S,S ) . Then by Theorem 1.21 in Zhang (2006), we have that H ( β,S ) , ( β,S ) † = (cid:18) νA † + A † D TS Σ − S,S D S A † A † D TS Σ − S,S Σ − S,S D S A † Σ − S,S (cid:19) . By H S c , ( β,S ) = ( − D S c /ν, 0) and − D S c A † D S /ν = Σ S c ,S , we have H S c , ( β,S ) H ( β,S ) , ( β,S ) † = (cid:0) − D S c A † + Σ S c ,S Σ − S,S D S A † , Σ S c ,S Σ − S,S (cid:1) . (D.4)The rest is easy. (cid:3) ppendix E. Proof of the Comparison Theorem Proof of Theorem 1. By definition, we have ic ≥ (cid:107) Ω S sign( D S β (cid:63) ) (cid:107) ∞ ≥ ic .Now we prove irr(0) exists and irr(0) = ic . Let M := Λ − V T X ∗ ( I − U U T ) XV Λ − . When ν is small, by (B.11), ν Σ = I − U Λ B − Λ U T = I − U ( I + νM ) − U T = I − U (cid:0) I − νM + O (cid:0) ν (cid:1)(cid:1) U T = I − U U T + νU M U T + O (cid:0) ν (cid:1) = ⇒ ν Σ S c ,S = − U S c U TS + νU S c M U TS + O (cid:0) ν (cid:1) , ν Σ S,S = I − U S U TS + νU S M U TS + O (cid:0) ν (cid:1) . Let F := I − U S U TS and F = U (cid:48) Λ (cid:48) U (cid:48) T be the “compact” eigendecompositionof F (Λ (cid:48) (cid:31) G := U S M U TS . Suppose ( U (cid:48) , ˜ U (cid:48) ) is an orthogonal squarematrix, and K = (cid:18) K K K T K (cid:19) := (cid:18) U (cid:48) T ˜ U (cid:48) T (cid:19) G (cid:16) U (cid:48) , ˜ U (cid:48) (cid:17) . By F + νG (cid:31) 0, we have K (cid:31) 0. Now F + νG = (cid:16) U (cid:48) , ˜ U (cid:48) (cid:17) (cid:18) Λ (cid:48) + νK νK νK T νK (cid:19) (cid:18) U (cid:48) T ˜ U (cid:48) T (cid:19) . Define Q ν = K − νK T (Λ (cid:48) + νK ) − K , R ν = K T (Λ (cid:48) + νK ) − , and we cancalculate( F + νG ) − = (cid:16) U (cid:48) , ˜ U (cid:48) (cid:17) (cid:18) (Λ (cid:48) + νK ) − + νR Tν Q − ν R ν − R Tν Q − ν − Q − ν R ν Q ν /ν (cid:19) (cid:18) U (cid:48) T ˜ U (cid:48) T (cid:19) . Note that Q ν → K , R ν → K T Λ (cid:48)− , and note that U TS c U S c U TS ˜ U (cid:48) = (cid:0) I − U TS U S (cid:1) U TS ˜ U (cid:48) = U TS (cid:0) I − U S U TS (cid:1) ˜ U (cid:48) = U TS U (cid:48) Λ (cid:48) · U (cid:48) T ˜ U (cid:48) = 0= ⇒ (cid:16) U S c U TS ˜ U (cid:48) (cid:17) T U S c U TS ˜ U (cid:48) = 0 = ⇒ U S c U TS ˜ U (cid:48) = 0 . (E.1)Combining it with the representation of ( F + νG ) − , − U S c U TS Σ − S,S . = U S c U TS ( F + νG ) − = − (cid:0) U S c U TS U (cid:48) , (cid:1) (cid:18) (Λ (cid:48) + νK ) − + νR Tν Q − ν R ν − R Tν Q − ν − Q − ν R ν (cid:63) (cid:19) (cid:18) U (cid:48) T ˜ U (cid:48) T (cid:19) → (cid:0) − U S c U TS U (cid:48) Λ (cid:48)− , U S c U TS U (cid:48) Λ (cid:48)− K K − (cid:1) (cid:18) U (cid:48) T ˜ U (cid:48) (cid:19) = − U S c U TS U (cid:48) Λ (cid:48)− (cid:16) U (cid:48) T − K K − ˜ U (cid:48) T (cid:17) . νU S c M U TS Σ − S,S . = U S c M U TS · ν ( F + νG ) − → U S c M U TS ˜ U (cid:48) K − ˜ U (cid:48) T . So when ν → S c ,S Σ − S,S → − U S c U TS U (cid:48) Λ (cid:48)− (cid:16) U (cid:48) T − K K − ˜ U (cid:48) T (cid:17) + U S c M U TS ˜ U (cid:48) K − ˜ U (cid:48) T = − U S c U TS U (cid:48) Λ (cid:48)− U (cid:48) T + U S c (cid:0) U TS U (cid:48) Λ (cid:48)− U (cid:48) T U S + I (cid:1) M U TS ˜ U (cid:48) K − ˜ U (cid:48) T = − D S c V Λ − U TS U (cid:48) Λ (cid:48)− U (cid:48) T + D S c V Λ − (cid:0) I + U TS U (cid:48) Λ (cid:48)− U (cid:48) T U S (cid:1) M U TS ˜ U (cid:48) K − ˜ U (cid:48) T . The infinity norm of the right hand side is irr(0). On the other hand,ic = (cid:13)(cid:13)(cid:13) D S c (cid:0) D TS c D S c (cid:1) † (cid:16) X ∗ XW (cid:0) W T X ∗ XW (cid:1) † W T − I (cid:17) D TS (cid:13)(cid:13)(cid:13) ∞ . In order to prove irr(0) = ic , it suffices to show (cid:16) X ∗ XW (cid:0) W T X ∗ XW (cid:1) † W T − I (cid:17) D TS = − D TS c D S c V Λ − U TS U (cid:48) Λ (cid:48)− U (cid:48) T + D TS c D S c V Λ − (cid:0) I + U TS U (cid:48) Λ (cid:48)− U (cid:48) T U S (cid:1) M U TS ˜ U (cid:48) K − ˜ U (cid:48) T . The first term of the right hand side is − V Λ U TS c U S c U TS U (cid:48) Λ (cid:48)− U (cid:48) T = − V Λ (cid:0) I − U TS U S (cid:1) U TS U (cid:48) Λ (cid:48)− U (cid:48) T = − V Λ U TS (cid:0) I − U S U TS (cid:1) U (cid:48) Λ (cid:48)− U (cid:48) T = − V Λ U TS U (cid:48) Λ (cid:48) U (cid:48) T U (cid:48) Λ (cid:48)− U (cid:48) T = − D TS U (cid:48) U (cid:48) T , while by the fact that (cid:0) I − U TS U S (cid:1) (cid:0) I + U TS U (cid:48) Λ (cid:48)− U (cid:48) T U S (cid:1) = I − U TS U S + U TS U (cid:48) Λ (cid:48)− U (cid:48) T U S − U TS U S U TS U (cid:48) Λ (cid:48)− U (cid:48) T U S = I − U TS U S + U TS (cid:0) I − U S U TS (cid:1) U (cid:48) Λ (cid:48)− U (cid:48) T U S = I − U TS U S + U TS U (cid:48) U (cid:48) T U S = I − U TS ˜ U (cid:48) ˜ U (cid:48) T U S , the second term becomes V Λ U TS c U S c (cid:0) I + U TS U (cid:48) Λ (cid:48)− U (cid:48) T U S (cid:1) M U TS ˜ U (cid:48) K − ˜ U (cid:48) T = V Λ (cid:0) I − U TS U S (cid:1) (cid:0) I + U TS U (cid:48) Λ (cid:48)− U (cid:48) T U S (cid:1) M U TS ˜ U (cid:48) K − ˜ U (cid:48) T = V Λ (cid:16) I − U TS ˜ U (cid:48) ˜ U (cid:48) T U S (cid:17) M U TS ˜ U (cid:48) K − ˜ U (cid:48) T = V Λ M U TS ˜ U (cid:48) K − ˜ U (cid:48) T − V Λ U TS ˜ U (cid:48) · ˜ U (cid:48) T U S M U TS ˜ U (cid:48) · K − ˜ U (cid:48) T = X ∗ (cid:0) I − U U T (cid:1) XV Λ − U TS ˜ U (cid:48) K − ˜ U (cid:48) − D TS ˜ U (cid:48) · K · K − ˜ U (cid:48) T = X ∗ (cid:0) I − U U T (cid:1) XV Λ − U TS ˜ U (cid:48) K − ˜ U (cid:48) T − D TS ˜ U (cid:48) ˜ U (cid:48) T . 39o it suffices to show X ∗ XW (cid:0) W T X ∗ XW (cid:1) † W T D TS = X ∗ (cid:0) I − U U T (cid:1) XV Λ − U TS ˜ U (cid:48) K − ˜ U (cid:48) T , which is equivalent to X ∗ (cid:0) XW W T X ∗ (cid:1) † XW W T D TS = X ∗ (cid:0) I − U U T (cid:1) XV Λ − U TS ˜ U (cid:48) K − ˜ U (cid:48) T . (E.2)First we prove ker ( U S c ) = Im (cid:16) U TS ˜ U (cid:48) (cid:17) . (E.3)In fact, by (E.1) we have Im( U TS ˜ U (cid:48) ) ⊆ ker( U S c ). For any ζ ∈ ker( U S c ), wehave ( I − U TS U S ) ζ = U TS c U S c ζ = 0. Let ζ = U TS ζ + ζ , ζ ∈ ker( U S ) , then0 = ( I − U TS U S )( U TS ζ + ζ ) = ζ + ( I − U TS U S ) U TS ζ = ζ + U TS ( I − U S U TS ) ζ , which implies ζ ∈ Im( U TS ). But ζ ∈ ker( U S ), then ζ = 0, and 0 =( I − U TS U S ) U TS ζ = U TS ( I − U S U TS ) ζ = U TS U (cid:48) Λ (cid:48) U (cid:48) T ζ . Assume that ζ = U (cid:48) ζ + ˜ U (cid:48) ˜ ζ , then U TS U (cid:48) Λ (cid:48) ζ = 0. Thus0 = U S U TS U (cid:48) Λ (cid:48) ζ = (cid:0) I − U (cid:48) Λ (cid:48) U (cid:48) T (cid:1) U (cid:48) Λ (cid:48) ζ = U (cid:48) Λ (cid:48) ( I − Λ (cid:48) ) ζ = ⇒ ( I − Λ (cid:48) ) ζ = 0= ⇒ U S U TS U (cid:48) ζ = U (cid:48) ( I − Λ (cid:48) ) ζ = 0 = ⇒ (cid:0) U TS U (cid:48) ζ (cid:1) T U TS U (cid:48) ζ = 0 = ⇒ U TS U (cid:48) ζ = 0= ⇒ β = U TS ζ = U TS U (cid:48) ζ + U TS ˜ U (cid:48) ˜ ζ = U TS ˜ U (cid:48) ˜ ζ ∈ Im (cid:16) U TS ˜ U (cid:48) (cid:17) . So (E.3) holds. Now for any β ∈ R p , let β = V δ + ˜ V ˜ δ , then β ∈ ker( D S c ) ifand only if U S c Λ δ = 0, which means δ ∈ Λ − ker( U S c ) = Im(Λ − U TS ˜ U (cid:48) ). Soker ( D S c ) = Im ( J ) + Im (cid:16) ˜ V (cid:17) , where J := V Λ − U TS ˜ U (cid:48) . Since ˜ V T V = 0, the linear subspaces spanned by J and ˜ V are orthogonal,and we have W W T = J (cid:0) J T J (cid:1) † J T + ˜ V ˜ V T . V T V = 0 , ˜ V T X ∗ ( I − U U T ) = 0, we have X ∗ (cid:0) XW W T X ∗ (cid:1) † XW W T D TS ˜ U (cid:48) ˜ U (cid:48) T K = X ∗ (cid:0) XW W T X ∗ (cid:1) † XJ (cid:0) J T J (cid:1) † J T V Λ U TS ˜ U (cid:48) ˜ U (cid:48) T K = X ∗ (cid:0) XW W T X ∗ (cid:1) † XJ (cid:0) J T J (cid:1) † · ˜ U (cid:48) T U S U TS ˜ U (cid:48) · ˜ U (cid:48) T U S M U TS ˜ U (cid:48) = X ∗ (cid:0) XW W T X ∗ (cid:1) † XJ (cid:0) J T J (cid:1) † · ˜ U (cid:48) T U S M U TS ˜ U (cid:48) = X ∗ (cid:0) XW W T X ∗ (cid:1) † XJ (cid:0) J T J (cid:1) † J T X ∗ (cid:0) I − U U T (cid:1) XV Λ − U TS ˜ U (cid:48) = X ∗ (cid:0) XW W T X ∗ (cid:1) † (cid:0) XW W T X ∗ (cid:1) (cid:0) I − U U T (cid:1) XJ. Since ( XW W T X ∗ ) † ( XW W T X ∗ ) is the projection matrix onto the linearsubspace Im( XW ) = Im( X ˜ V ) + Im( XJ ) = Im( U ) + Im( XJ ), and ( I − U U T ) XJ = XJ − U · U T XJ lies in this subspace, the last term abovebecomes X ∗ (cid:0) I − U U T (cid:1) XJ . Therefore, we get X ∗ (cid:0) XW W T X ∗ (cid:1) † XW W T D TS ˜ U (cid:48) K = X ∗ (cid:0) I − U U T (cid:1) XJ ⇐⇒ X ∗ (cid:0) XW W T X ∗ (cid:1) † XW W T D TS ˜ U (cid:48) = X ∗ (cid:0) I − U U T (cid:1) XV Λ − U TS ˜ U (cid:48) K − . Now to prove (E.2), it suffices to show X ∗ (cid:0) XW W T X ∗ (cid:1) † XW W T D TS (cid:16) I − ˜ U (cid:48) ˜ U (cid:48) T (cid:17) = 0 ⇐ = W W T D TS U (cid:48) U (cid:48) T = 0 ⇐ = J (cid:0) J T J (cid:1) † J T D TS U (cid:48) = 0 ⇐ = J T D TS U (cid:48) = 0 ⇐ = ˜ U (cid:48) T U S Λ − V T · V Λ U TS U (cid:48) = 0 ⇐ = ˜ U (cid:48) T U S U TS U (cid:48) = 0 ⇐ = ˜ U (cid:48) T (cid:0) I − U (cid:48) Λ (cid:48) U (cid:48) T (cid:1) U (cid:48) = 0 , which is surely true since ˜ U (cid:48) T U (cid:48) = 0. Then irr(0) = ic is proved.Now we turn to irr( ∞ ). Let M = U (cid:48)(cid:48) Λ (cid:48)(cid:48) U (cid:48)(cid:48) T be the compact eigendecom-position of M , and ( U (cid:48)(cid:48) , ˜ U (cid:48)(cid:48) ) is an orthogonal square matrix. Then ν Σ = I − U ( I + νM ) − U T = I − U (cid:16) U (cid:48)(cid:48) , ˜ U (cid:48)(cid:48) (cid:17) (cid:18)(cid:18) U (cid:48)(cid:48) T ˜ U (cid:48)(cid:48) T (cid:19) ( I + νM ) (cid:16) U (cid:48)(cid:48) , ˜ U (cid:48)(cid:48) (cid:17)(cid:19) − (cid:18) U (cid:48)(cid:48) T U (cid:48)(cid:48) T (cid:19) U T = I − U (cid:16) U (cid:48)(cid:48) , ˜ U (cid:48)(cid:48) (cid:17) (cid:18) I + ν Λ (cid:48)(cid:48) I (cid:19) − (cid:18) U (cid:48)(cid:48) T ˜ U (cid:48)(cid:48) T (cid:19) U T = I − U U (cid:48)(cid:48) ( I + ν Λ (cid:48)(cid:48) ) − U (cid:48)(cid:48) T U T − U ˜ U (cid:48)(cid:48) ˜ U (cid:48)(cid:48) T U T → I − U ˜ U (cid:48)(cid:48) ˜ U (cid:48)(cid:48) T U T ν → + ∞ . Besides, ν Σ S,S → I − U S ˜ U (cid:48)(cid:48) ˜ U (cid:48)(cid:48) T U TS , and this limit (cid:23) ν Σ S,S (cid:31) ν > 0. Thus Σ S c ,S Σ − S,S has limit when ν → + ∞ .Now we study when irr( ∞ ) = 0. Let D TS = X T C + D TS c C , which implies U TS = Λ − V T X T C + U TS c C . Then 0 = ˜ V T D TS = ˜ V T X T C + 0 = √ nV Λ U T C , which implies U T C = 0.So for N = Λ − V T X T ( I − U U T ) / √ n , we have N C = Λ − V T X T C / √ n. Then irr( ∞ ) = 0 ⇐⇒ − U S c ˜ U (cid:48)(cid:48) ˜ U (cid:48)(cid:48) T U TS = 0 ⇐⇒ − U S c ( I − M M † ) U TS = 0. By M = N N T , the equation is further equivalent to − U S c (cid:0) I − N N † (cid:1) U TS = 0 ⇐⇒ − U S c (cid:0) I − N N † (cid:1) (cid:0) Λ − V T XC + U TS c C (cid:1) = 0 ⇐⇒ − U S c (cid:0) I − N N † (cid:1) (cid:0) √ nN C + U TS c C (cid:1) = 0 ⇐⇒ − U S c (cid:0) I − N N † (cid:1) U TS c C = 0 ⇐⇒ C T U S c (cid:0) I − N N † (cid:1) · (cid:0) I − N N † (cid:1) U TS c C = 0 ⇐⇒ (cid:0) I − N N † (cid:1) U TS c C = 0 ⇐⇒ Im( U TS c C ) ⊆ Im( N ) . It suffices to show that the last property holds if and only if ker( X ) ⊆ ker( D S )or, equivalently, Im( D TS ) ⊆ Im( X T ). In fact, if Im( D TS ) ⊆ Im( X T ), then C can be set 0 in the beginning, and Im( U TS c C ) = Im(0) ⊆ Im( N ). IfIm( U TS c C ) ⊆ Im( N ), let U TS c C = N C , then D TS c C = V Λ U TS c C = V V T X T (cid:0) I − U U T (cid:1) C / √ n = (cid:16) V V T + ˜ V ˜ V T (cid:17) X T (cid:0) I − U U T (cid:1) C / √ n = X T (cid:0) I − U U T (cid:1) C / √ n, and hence D TS = X T C + D TS c C = X T ( C +( I − U U T ) C / √ n ), which impliesIm( D TS ) ⊆ Im( X T ). We have finished the proof of that irr( ∞ ) = 0 if and only if ker( X ) ⊆ ker( D S ). (cid:3) Appendix F. Proof on Oracle Properties Proof of Lemma 1. From the definition of oracle estimators (5.2), ∇ β (cid:96) ( β o , γ o ) = X ∗ ( Xβ o − y ) + D T ( Dβ o − γ o ) /ν = 0 , ∇ γ S (cid:96) ( β o , γ o ) = ( γ oS − D S β o ) /ν = 0 . (F.1)42dding (F.1) to (5.1b) and (5.1c), we have (cid:18) ρ (cid:48) S ( t ) (cid:19) + 1 κ (cid:18) ˙ β (cid:48) ( t )˙ γ (cid:48) S ( t ) (cid:19) = − H ( β,S ) , ( β,S ) (cid:18) d β ( t ) d γ,S ( t ) (cid:19) . (F.2)Besides, since (cid:18) β (cid:48) ( t ) γ (cid:48) ( t ) (cid:19) , (cid:18) β o γ o (cid:19) ∈ L ⊕ R s ⊕ { } m − s , by (5.2) and Pythagorean Theorem, (cid:96) ( β (cid:48) ( t ) , γ (cid:48) ( t )) = 12 n (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) y (cid:19) − (cid:18) X − (cid:112) n/νD I m (cid:19) (cid:18) β (cid:48) ( t ) γ (cid:48) ( t ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = 12 n (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) X − (cid:112) n/νD I m (cid:19) (cid:18) β (cid:48) ( t ) γ (cid:48) ( t ) (cid:19) − (cid:18) X − (cid:112) n/νD I m (cid:19) (cid:18) β o γ o (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + 12 n (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) y (cid:19) − (cid:18) X − (cid:112) n/νD I m (cid:19) (cid:18) β o γ o (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = L ( t ) + constant (independent of t ) , (F.3)where L ( t ) := 12 n (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) X − (cid:112) n/νD I m (cid:19) (cid:18) d β ( t ) d γ ( t ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = 12 (cid:0) d β ( t ) T , d γ ( t ) T (cid:1) H (cid:18) d β ( t ) d γ ( t ) (cid:19) = 12 (cid:0) d β ( t ) T , d γ,S ( t ) T (cid:1) H ( β,S ) , ( β,S ) (cid:18) d β ( t ) d γ,S ( t ) (cid:19) . (F.4)Noting γ j ( t ) · ˙ ρ j ( t ) ≡ j , by (5.1d), (F.2) and (F.4) we havedd t Ψ( t ) = (cid:104)− γ oS , ˙ ρ (cid:48) S ( t ) (cid:105) + d γ,S ( t ) T ˙ γ S ( t ) /κ + d β ( t ) T ˙ β (cid:48) ( t ) /κ = (cid:28)(cid:18) d β ( t ) d γ,S ( t ) (cid:19) , (cid:18) ρ (cid:48) S ( t ) (cid:19) + 1 κ (cid:18) ˙ β (cid:48) ( t )˙ γ (cid:48) S ( t ) (cid:19)(cid:29) = − L ( t ) . (F.5)Thus it suffices to show F (cid:18) λ H L ( t ) (cid:19) ≥ Ψ( t ) . (cid:107) γ oS (cid:107) − (cid:104) γ oS , ρ (cid:48) S ( t ) (cid:105) = 0 if (cid:107) γ (cid:48) S ( t ) − γ oS (cid:107) < ( γ o min ) , and (cid:107) γ oS (cid:107) − (cid:104) γ oS , ρ (cid:48) S ( t ) (cid:105) ≤ (cid:88) j ∈ N ( t ) (cid:12)(cid:12) γ oj (cid:12)(cid:12) (cid:0) N ( t ) := (cid:8) j : sign (cid:0) γ (cid:48) j ( t ) (cid:1) (cid:54) = sign (cid:0) γ oj (cid:1)(cid:9)(cid:1) ≤ γ o min (cid:88) j ∈ N ( t ) ( γ oj ) ≤ γ o min (cid:107) γ (cid:48) S ( t ) − γ oS (cid:107) (cid:115) s (cid:88) j ∈ N ( t ) ( γ oj ) ≤ (cid:113) s (cid:107) γ (cid:48) S ( t ) − γ oS (cid:107) . ThusΨ( t ) − κ (cid:0) (cid:107) d γ,S ( t ) (cid:107) + (cid:107) d β ( t ) (cid:107) (cid:1) ≤ F (cid:0) (cid:107) d γ,S ( t ) (cid:107) (cid:1) − κ (cid:107) d γ,S ( t ) (cid:107) . It suffice to show F (cid:18) λ H L ( t ) (cid:19) ≥ F (cid:0) (cid:107) d γ,S ( t ) (cid:107) (cid:1) + 12 κ (cid:107) d β ( t ) (cid:107) , which is true since by Assumption 1 L ( t ) = (cid:0) d β ( t ) T , d γ,S ( t ) T (cid:1) · H ( β,S ) , ( β,S ) · (cid:18) d β ( t ) d γ,S ( t ) (cid:19) ≥ λ H · d ( t ) , (F.6)and by F ( · + x ) ≥ F ( · ) + x/ (2 κ ) F (cid:0) d ( t ) (cid:1) = F (cid:0) (cid:107) d β ( t ) (cid:107) + (cid:107) d γ,S ( t ) (cid:107) (cid:1) ≥ F (cid:0) (cid:107) d γ,S ( t ) (cid:107) (cid:1) + 12 κ (cid:107) d β ( t ) (cid:107) . (cid:3) Lemma 6. Under Assumption 1, let γ o min := min( | γ oj | : γ oj (cid:54) = 0) . For t ≥ τ ∞ ( µ ) := 1 κλ H log 1 µ + 2 log s + 4 + d (0) /κλ H γ o min (0 < µ < , (F.7) we have d ( t ) ≤ µγ o min (cid:0) = ⇒ sign ( γ (cid:48) S ( t )) = sign ( γ oS ) , if γ oj (cid:54) = 0 for j ∈ S (cid:1) . (F.8) For t ≥ , we have d ( t ) ≤ min √ s + d (0) /κλ H t , (cid:115) ν Λ X + Λ D ) λ H ν · d (0) . (F.9)44 roof of Lemma 6. Noting (F.3) and that (cid:96) ( β (cid:48) ( t ) , γ (cid:48) ( t )) is non-increasing ,we know L ( t ) is non-increasing . (F.5) tells that Ψ( t ) is non-increasing since L ( t ) ≥ 0. If L ( t ) = 0 for t = τ ∞ ( µ ), by (F.6) and the fact that L ( t ) isnon-increasing, we have d ( t ) ≤ λ H L ( t ) = 0 ( t ≥ τ ∞ ( µ )) . Therefore (F.8) holds for t ≥ τ ∞ ( µ ). Now assume that L ( t ) > t = τ ∞ ( µ )(and hence for 0 ≤ t ≤ τ ∞ ( µ )), then Ψ( t ) is strictly decreasing on [0 , τ ∞ ( µ )].Besides, F is strictly increasing and continuous on [( γ o min ) , + ∞ ). Moreover, F (cid:0) d (0) (cid:1) ≥ F (cid:0) (cid:107) γ oS (cid:107) (cid:1) + (cid:107) β o (cid:107) / (2 κ ) ≥ Ψ(0) ,d (0) ≥ (cid:107) γ oS (cid:107) ≥ s ( γ o min ) , If there does not exist some t ≤ τ ∞ ( µ ) satisfying (F.8), then for 0 ≤ t ≤ τ ∞ ( µ ), Ψ ( t ) (cid:40) ≥ d ( t ) / (2 κ ) ≥ µ ( γ o min ) / (2 κ ) > , if κ < + ∞ ,> , if κ = + ∞ , which also implies that F − (Ψ( t )) > 0. By Lemma 1, λ H τ ∞ ( µ ) ≤ (cid:90) τ ∞ ( µ )0 − dd t Ψ( t ) F − (Ψ( t )) d t = (cid:90) Ψ(0)Ψ( τ ∞ ( µ )) d xF − ( x ) ≤ (cid:32)(cid:90) ( γ o min ) / (2 κ ) µ ( γ o min ) / (2 κ ) + (cid:90) F (cid:16) ( γ o min ) (cid:17) ( γ o min ) / (2 κ ) + (cid:90) F (cid:16) s ( γ o min ) (cid:17) F (cid:16) ( γ o min ) (cid:17) + (cid:90) F ( d (0) ) F (cid:16) s ( γ o min ) (cid:17) (cid:33) d xF − ( x ) ≤ (cid:90) ( γ o min ) / (2 κ ) µ ( γ o min ) / (2 κ ) d x κx + (cid:90) F (cid:16) ( γ o min ) (cid:17) ( γ o min ) / (2 κ ) γ o min ) d x + (cid:90) s ( γ o min ) ( γ o min ) d F ( x ) x + (cid:90) d (0) s ( γ o min ) d F ( x ) x = 12 κ log 1 µ + 2 γ o min + (cid:90) s ( γ o min ) ( γ o min ) (cid:18) κx + 2 γ o min x (cid:19) d x + (cid:90) d (0) s ( γ o min ) (cid:18) κx + √ sx √ x (cid:19) d x< κ log 1 µ + 2 γ o min + 12 κ log d (0) ( γ o min ) + 2 log sγ o min + 2 γ o min ≤ κ log 1 µ + 2 log s + 4 + d (0) /κγ o min , τ ∞ ( µ ). Thus (F.8) holds for some 0 ≤ τ ≤ τ ∞ ( µ ). If κ = + ∞ , we see that for t ≥ τ ∞ ( µ ), Ψ( t ) ≤ Ψ( τ ) = 0. Then − L ( t ), the derivative of Ψ( t ), is 0 (which means d ( t ) = 0) when t ≥ τ ∞ ( µ ),and (F.8) holds. If κ < + ∞ , just note that for t ≥ τ , d ( t ) / (2 κ ) ≤ Ψ( t ) ≤ Ψ( τ ) = d ( τ ) / (2 κ ) = ⇒ d ( t ) ≤ d ( τ ) ≤ µγ o min . So (F.8) holds for t ≥ τ ∞ ( µ ).For any t > 0, if L ( t ) = 0, then d ( t ) = 0 and (F.9) holds. If L ( t ) > 0, let C = (cid:112) L ( t ) /λ H > 0, then for any 0 ≤ t (cid:48) ≤ t ,dd t (cid:48) Ψ ( t (cid:48) ) = − L ( t (cid:48) ) ≤ − L ( t ) = − λ H C . Besides, for ˜ F ( x ) = x/ (2 κ ) + 2 √ sx ≥ F ( x ), by Lemma 1 we havedd t (cid:48) Ψ ( t (cid:48) ) ≤ − λ H F − (Ψ ( t (cid:48) )) ≤ − λ H ˜ F − (Ψ ( t (cid:48) )) . By (F.5) and the fact that˜ F (cid:0) d (0) (cid:1) ≥ ˜ F (cid:0) (cid:107) γ oS (cid:107) (cid:1) + (cid:107) β o (cid:107) / (2 κ ) ≥ Ψ(0) , we have that, if d (0) > C , then λ H t ≤ (cid:90) t − dd t (cid:48) Ψ ( t (cid:48) )max (cid:16) C , ˜ F − (Ψ ( t (cid:48) )) (cid:17) d t (cid:48) = (cid:90) Ψ(0)Ψ( t ) d x max (cid:16) C , ˜ F − ( x ) (cid:17) ≤ (cid:90) ˜ F ( d (0) ) ˜ F (0) d x max (cid:16) C , ˜ F − ( x ) (cid:17) = (cid:90) ˜ F ( C ) ˜ F (0) d xC + (cid:90) d (0) C d ˜ F ( x ) x = C / (2 κ ) + 2 √ sCC + (cid:90) d (0) C (cid:18) κx + √ sx √ x (cid:19) d x ≤ √ sC + 12 κ (cid:18) d (0) C (cid:19) ≤ √ s + d (0) /κC . If d (0) ≤ C , then similarly λ H t ≤ (cid:90) ˜ F ( d (0) ) ˜ F (0) d x max (cid:16) C , ˜ F − ( x ) (cid:17) ≤ (cid:90) ˜ F ( d (0) ) ˜ F (0) d xC = d (0) / (2 κ ) + 2 √ s · d (0) C ≤ √ s + d (0) /κC . d ( t ) ≤ λ H L ( t ) = 2 λ H · λ H C ≤ (cid:18) √ s + d (0) /κλ H t (cid:19) . Besides, noting (C.3), we have2 L (0) = (cid:0) d β (0) T , d γ,S (0) T (cid:1) H ( β,S ) , ( β,S ) (cid:18) d β (0) d γ,S (0) (cid:19) ≤ (cid:107) H (cid:107) · (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) d β (0) d γ (0) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ν Λ X + Λ D ) ν · d (0) . Thus d ( t ) ≤ λ H L ( t ) ≤ λ H L (0) ≤ ν Λ X + Λ D ) λ H ν · d (0) . Thus (F.9) holds. (cid:3) Appendix G. Proof on Consistency of Split LBISS Before proving Theorem 4 and 5, we need the following lemmas. Lemma 7 ( No-false-positive condition for Split LBISS ). For the oracle dy-namics (5.1) , if there is τ > , such that for ≤ t ≤ τ the inequality (cid:13)(cid:13)(cid:13)(cid:13) H S c , ( β,S ) H ( β,S ) , ( β,S ) † (cid:18)(cid:18) p ρ (cid:48) S ( t ) (cid:19) + 1 κ (cid:18) β (cid:48) ( t ) γ (cid:48) S ( t ) (cid:19) − t (cid:18) X ∗ (cid:15) s (cid:19)(cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ∞ < holds, then the solution path of the original dynamics (1.7) has no false- positive for ≤ t ≤ τ .Proof of Lemma 7. It is easy to see that (cid:18) p ˙ ρ ( t ) (cid:19) + 1 κ (cid:18) ˙ β ( t )˙ γ ( t ) (cid:19) = H (cid:18)(cid:18) β ( t ) γ ( t ) (cid:19) − (cid:18) β (cid:63) γ (cid:63) (cid:19)(cid:19) + (cid:18) X ∗ (cid:15) m (cid:19) . (G.2)Now define the exit time of oracle subspace, τ exit := inf ( t ≥ (cid:107) ρ S c ( t ) (cid:107) ∞ = 1) . 47t suffices to show τ exit > τ . For 0 ≤ t < τ exit , we have γ S c ( t ) = 0, whichalso implies the paths of Split LBISS and oracle dynamics are identical, i.e. ρ S ( t ) = ρ (cid:48) S ( t ) and γ S ( t ) = γ (cid:48) S ( t ). Hence by (G.2) we have (cid:18) p ˙ ρ (cid:48) S ( t ) (cid:19) + 1 κ (cid:18) ˙ β (cid:48) ( t )˙ γ (cid:48) S ( t ) (cid:19) = − H ( β,S ) , ( β,S ) (cid:18)(cid:18) β (cid:48) ( t ) γ (cid:48) S ( t ) (cid:19) − (cid:18) β (cid:63) γ (cid:63)S (cid:19)(cid:19) + (cid:18) X ∗ (cid:15) s (cid:19) , (G.3)˙ ρ S c ( t ) = − H S c , ( β,S ) (cid:18)(cid:18) β (cid:48) ( t ) γ (cid:48) S ( t ) (cid:19) − (cid:18) β (cid:63) γ (cid:63)S (cid:19)(cid:19) . We claim that (cid:18) β (cid:48) ( t ) γ (cid:48) S ( t ) (cid:19) − (cid:18) β (cid:63) γ (cid:63)S (cid:19) ∈ L ⊕ R s = Im (cid:0) H ( β,S ) , ( β,S ) † (cid:1) (the equality above will be shown at last), so by (G.3) we have (cid:18) β (cid:48) ( t ) γ (cid:48) S ( t ) (cid:19) − (cid:18) β (cid:63) γ (cid:63)S (cid:19) = − H ( β,S ) , ( β,S ) † (cid:18)(cid:18) p ˙ ρ (cid:48) S ( t ) (cid:19) + 1 κ (cid:18) ˙ β (cid:48) ( t )˙ γ (cid:48) S ( t ) (cid:19) − (cid:18) X ∗ (cid:15) s (cid:19)(cid:19) , = ⇒ ˙ ρ S c ( t ) = H S c , ( β,S ) H ( β,S ) , ( β,S ) † (cid:18)(cid:18) p ˙ ρ (cid:48) S ( t ) (cid:19) + 1 κ (cid:18) ˙ β (cid:48) ( t )˙ γ (cid:48) S ( t ) (cid:19) − (cid:18) X ∗ (cid:15) s (cid:19)(cid:19) . Integration on both sides leads to, for 0 ≤ t < τ exit ρ S c ( t ) = H S c , ( β,S ) H ( β,S ) , ( β,S ) † (cid:18)(cid:18) p ρ (cid:48) S ( t ) (cid:19) + 1 κ (cid:18) β (cid:48) ( t ) γ (cid:48) S ( t ) (cid:19) − t (cid:18) X ∗ (cid:15) s (cid:19)(cid:19) . Due to the continuity of ρ S c ( t ) , ρ (cid:48) S ( t ) (and γ (cid:48) S ( t ), if κ < + ∞ ), the equationabove also holds for t = τ exit . According to the definition of τ exit , we know(G.1) does not hold for t = τ exit . Thus for τ < τ exit , the desired result follows. So it suffices to prove L ⊕ R s = Im (cid:0) H ( β,S ) , ( β,S ) † (cid:1) . (G.4)Actually, let H ( β,S ) , ( β,S ) = U (cid:48) Λ (cid:48) U (cid:48) T where U (cid:48) T U (cid:48) = I and Λ (cid:48) is an invertiblediagonal matrix. It suffices to show L ⊕ R s = Im( U (cid:48) ). First, by the definitionof H , one can easily verify thatIm ( U (cid:48) ) = Im (cid:0) H ( β,S ) , ( β,S ) (cid:1) ⊆ (cid:0) Im (cid:0) X T (cid:1) + Im (cid:0) D T (cid:1)(cid:1) ⊕ R s = L ⊕ R s . 48n the other hand, assume that ( U (cid:48) , ˜ U (cid:48) ) is an orthogonal square matrix. Forany ζ ∈ L ⊕ R s , since P Im( U (cid:48) ) ζ ∈ Im( U (cid:48) ) ⊆ L ⊕ R s , we have P Im( ˜ U (cid:48) ) ζ = ζ − P Im( U (cid:48) ) ζ ∈ L ⊕ R s , and (2.3) tells us0 = (cid:13)(cid:13)(cid:13) Λ (cid:48) / U (cid:48) T P Im ( ˜ U (cid:48) ) ζ (cid:13)(cid:13)(cid:13) ≥ λ H (cid:13)(cid:13)(cid:13) P Im ( ˜ U (cid:48) ) ζ (cid:13)(cid:13)(cid:13) = ⇒ P Im ( ˜ U (cid:48) ) ζ = 0= ⇒ ζ = P Im( U (cid:48) ) ζ + P Im ( ˜ U (cid:48) ) ζ = P Im( U (cid:48) ) ζ ∈ Im( U (cid:48) ) . Thus (G.4) holds. (cid:3) Lemma 8. Suppose Σ S,S (cid:23) λ Σ I . For β o ∈ L and γ oS ∈ R s satisfying (F.1) ,we have (cid:107) β o − β (cid:63) (cid:107) = (cid:107) δ o − δ (cid:63) (cid:107) + (cid:107) ξ o − ξ (cid:63) (cid:107) , where δ o − δ (cid:63) := V T ( β o − β (cid:63) ) , ξ o − ξ (cid:63) = V T ˜ V T ( β o − β (cid:63) ) , (G.5) and δ o − δ (cid:63) = (cid:0) νB − + B − Λ U TS Σ − S,S U S Λ B − (cid:1) V T X ∗ (cid:0) I − U U T (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) (cid:44) B δ (cid:15), with (cid:107) B δ (cid:107) ≤ Λ X √ n · λ Σ λ D , (G.6) ξ o − ξ (cid:63) = n − / Λ − U T ( I − XV B δ ) (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) B ξ (cid:15), with (cid:107) B ξ (cid:107) ≤ λ Σ λ D + Λ X √ n · λ λ Σ λ D . (G.7) Besides, we have γ oS − γ (cid:63)S = Σ − S,S U S Λ B − V T X ∗ (cid:0) I − U U T (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) (cid:44) B γ (cid:15), with (cid:107) B γ (cid:107) ≤ Λ X √ n · λ Σ λ D . (G.8) Proof. By Lemma 4 and β o − β (cid:63) ∈ L , we have (G.5). By (F.1), we have γ oS − γ (cid:63)S = D S ( β o − β (cid:63) ) = U S Λ ( δ o − δ (cid:63) ) , (G.9)and X ∗ (cid:15) + D TS ( γ oS − γ (cid:63)S ) /ν = (cid:0) X ∗ X + D T D/ν (cid:1) ( β o − β (cid:63) ) , X ∗ (cid:15) + V Λ U TS ( γ oS − γ (cid:63)S ) /ν = (cid:0) X ∗ X + V Λ V T /ν (cid:1) (cid:16) V ( δ o − δ (cid:63) ) + ˜ V V ( ξ o − ξ (cid:63) ) (cid:17) = (cid:0) X ∗ XV + V Λ /ν (cid:1) ( δ o − δ (cid:63) ) + √ nX ∗ U Λ ( ξ o − ξ (cid:63) ) . (G.10)Left multiplying Λ − V T ˜ V T on both sides of (G.10) leads to ξ o − ξ (cid:63) = 1 √ n Λ − U T ( (cid:15) − XV ( δ o − δ (cid:63) )) . (G.11)Then left multiplying V T on both sides of (G.10) leads to V T X ∗ (cid:15) + Λ U TS ( γ oS − γ (cid:63)S ) /ν = (cid:0) V T X ∗ XV + Λ /ν (cid:1) ( δ o − δ (cid:63) )+ √ nV T X ∗ U Λ · √ n Λ − U T ( (cid:15) − XV ( δ o − δ (cid:63) ))= (cid:0) V T X ∗ (cid:0) I − U U T (cid:1) XV + Λ /ν (cid:1) ( δ o − δ (cid:63) ) + V T X ∗ U U T (cid:15). Recalling the definition of B in Lemma 5, the equation above implies δ o − δ (cid:63) = B − Λ U TS ( γ oS − γ (cid:63)S ) + νB − V T X ∗ (cid:0) I − U U T (cid:1) (cid:15). (G.12)Plugging it into (G.9), we obtain γ oS − γ (cid:63)S = B γ (cid:15) . Then noting B (cid:23) λ D I , wehave (cid:107) B γ (cid:107) ≤ (cid:13)(cid:13) Σ − S,S (cid:13)(cid:13) · · (cid:13)(cid:13) Λ B − (cid:13)(cid:13) · · (cid:107) X ∗ (cid:107) · (cid:13)(cid:13) I − U U T (cid:13)(cid:13) ≤ Λ X √ n · λ Σ λ D . so (G.8) holds. Now by (G.12) we have δ o − δ (cid:63) = B δ (cid:15) . Noting (B.11) andΣ S,S (cid:23) λ Σ I , we have U S Λ B − / · B − / Λ U TS (cid:22) (1 − λ Σ ν ) I ⇐⇒ B − / Λ U TS · U S Λ B − / (cid:22) (1 − λ Σ ν ) I ⇐⇒ Λ U TS U S Λ (cid:22) (1 − λ Σ ν ) B. Thus νB − + B − Λ U TS Σ − S,S U S Λ B − (cid:22) νB − + 1 λ Σ B − Λ U TS U S Λ B − (cid:22) λ Σ B − , which immediately leads to (G.6). Finally, combining (G.11) with (G.6) wehave (G.7). (cid:3) Proof of Theorem 4. By (B.3), (G.7) and (G.8), we have that with probabil-ity not less than 1 − s/m ≥ − /m , (cid:107) γ oS − γ (cid:63)S (cid:107) ∞ < σλ H · Λ X λ D (cid:114) log mn , (G.13) (cid:107) ξ o − ξ (cid:63) (cid:107) ∞ < σλ H · λ H λ D + Λ X λ λ D (cid:114) log mn . (G.14)By (B.4) and (G.5) to (G.8), with probability not less than 1 − − n/ (cid:107) (cid:15) (cid:107) ≤ σ √ n, which implies (cid:107) γ oS − γ (cid:63)S (cid:107) < σλ H · Λ X λ D , (cid:107) δ o − δ (cid:63) (cid:107) < σλ H · Λ X λ D , (cid:107) ξ o − ξ (cid:63) (cid:107) < σλ H · λ H λ D + Λ X λ λ D . (G.15)The inequalities above also imply (cid:107) β o − β (cid:63) (cid:107) ≤ (cid:107) δ o − δ (cid:63) (cid:107) + (cid:107) ξ o − ξ (cid:63) (cid:107) < σλ H (cid:18) Λ X λ D + λ H λ D + Λ X λ λ D (cid:19) , (G.16)and d (0) = (cid:113) (cid:107) γ oS (cid:107) + (cid:107) β o (cid:107) ≤ (cid:107) γ (cid:63)S (cid:107) + (cid:107) β (cid:63) (cid:107) + (cid:107) γ oS − γ (cid:63)S (cid:107) + (cid:107) β o − β (cid:63) (cid:107) < (1 + Λ D ) (cid:107) β (cid:63) (cid:107) + 2 σλ H (cid:18) Λ X λ D + Λ X λ D + λ H λ D + Λ X λ λ D (cid:19) . (G.17)From now, we assume all the inequalities above hold. The condition on κ now tells us κ ≥ η (cid:18) λ D + Λ X λ λ D (cid:19) (cid:115) ν Λ X + Λ D ) λ H ν · d (0) ( ≥ d (0)) . (G.18)Now we prove the No-false-positive property. By Lemma 7, it suffices toshow that for 0 ≤ t ≤ ¯ τ , (G.1) holds with probability not less that 1 − /m .51y (B.10), (D.4) and (F.9),1 κ (cid:13)(cid:13)(cid:13)(cid:13) H S c , ( β,S ) H ( β,S ) , ( β,S ) † (cid:18) β (cid:48) ( t ) γ (cid:48) S ( t ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:0) − D S c A † + Σ S c ,S Σ − S,S D S (cid:1) A † β (cid:48) ( t ) + Σ S c ,S Σ − S,S γ (cid:48) S ( t ) (cid:13)(cid:13) ∞ /κ ≤ (cid:13)(cid:13) D S c A † β (cid:48) ( t ) (cid:13)(cid:13) ∞ /κ + (cid:13)(cid:13) Σ S c ,S Σ − S,S D S A † β (cid:48) ( t ) (cid:13)(cid:13) ∞ /κ + (cid:107) γ (cid:48) S ( t ) (cid:107) ∞ /κ ≤ (cid:13)(cid:13) DA † (cid:13)(cid:13) ·(cid:107) β (cid:48) ( t ) (cid:107) /κ + (cid:107) γ (cid:48) S ( t ) (cid:107) /κ ≤ (cid:18) (cid:18) λ D + Λ X λ D λ (cid:19) + 1 (cid:19) (cid:113) (cid:107) β (cid:48) ( t ) (cid:107) + (cid:107) γ (cid:48) S ( t ) (cid:107) /κ ≤ (cid:18) λ D + Λ X λ D λ (cid:19) ( d (0) + d ( t )) /κ ≤ (cid:18) λ D + Λ X λ D λ (cid:19) (cid:115) ν Λ X + Λ D ) λ H ν d (0) /κ ≤ η . Besides, by (D.4) we have (cid:13)(cid:13)(cid:13)(cid:13) H S c , ( β,S ) H ( β,S ) , ( β,S ) † (cid:18) X ∗ (cid:15) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:16) − D S c + Σ S c ,S Σ † S,S D S (cid:17) A † X ∗ (cid:15) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13) D S c A † X ∗ (cid:15) (cid:13)(cid:13) ∞ + (cid:13)(cid:13) D S A † X ∗ (cid:15) (cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13) DA † X ∗ (cid:15) (cid:13)(cid:13) ∞ . By (B.11), DA † D T = U Λ B − Λ U T and Λ (cid:22) B (cid:22) (1 + ν Λ X /λ D )Λ , therefore1 is an upper bound of the largest eigenvalue of DA † D T , and 1 / (1+ ν Λ X /λ D )is a lower bound of the smallest nonzero eigenvalue of DA † D T . Then DA † X ∗ (cid:0) DA † X ∗ (cid:1) T = 1 nν DA † (cid:0) A − D T D (cid:1) A † D T = 1 nν (cid:16) DA † D T − (cid:0) DA † D T (cid:1) (cid:17) (cid:22) nν min (cid:32) , ν Λ X /λ D (1 + ν Λ X /λ D ) (cid:33) I (cid:22) Λ X n · λ D I. By (B.3), with probability not less than 1 − /m , for any 0 ≤ t ≤ ¯ τ , (cid:13)(cid:13)(cid:13)(cid:13) H S c , ( β,S ) H ( β,S ) , ( β,S ) † · t (cid:18) X ∗ (cid:15) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ τ (cid:13)(cid:13) DA † X ∗ (cid:15) (cid:13)(cid:13) ∞ ≤ τ · σ · (cid:115) Λ X n · λ D · (cid:112) log m < η . Combining the results above with Assumption 2, we have for 0 ≤ t ≤ ¯ τ ,(G.1) holds with probability not less that 1 − /m , and we have the No-false-positive property (which tells that ( β ( t ) , γ S ( t )) coincides with that ofthe oracle dynamics for 0 ≤ t ≤ ¯ τ ). sign consistency of γ ( t ). If the γ ∗ min condition (4.3)holds, by (G.13), (cid:107) γ oS − γ (cid:63)S (cid:107) ∞ ≤ σλ H · Λ X λ D (cid:114) log mn ≤ γ (cid:63) min ⇒ γ o min ≥ γ (cid:63) min . (G.19)Thus sign( γ oS ) = sign( γ (cid:63)S ), and γ o min ≥ γ (cid:63) min ≥ s + 5 λ H ¯ τ > s + 4 + d (0) /κλ H ¯ τ = ⇒ ¯ τ > s + 4 + d (0) /κλ H γ o min . By (F.8), the sign consistency of γ (cid:48) S ( t ) holds for t > inf <µ< (cid:18) κλ H log 1 µ + 2 log s + 4 + d (0) /κλ H γ o min (cid:19) = 2 log s + 4 + d (0) /κλ H γ o min , thus also for ¯ τ . Then under the No-false-positive property,sign ( γ S (¯ τ )) = sign ( γ (cid:48) S (¯ τ )) = sign ( γ oS ) = sign ( γ (cid:63)S ) , and sign ( γ (cid:48) S c (¯ τ )) = 0 = sign ( γ (cid:63)S c ) . Now we prove the (cid:96) consistency of γ ( t ). Under the No-false-positiveproperty, for 0 ≤ t ≤ ¯ τ , (cid:107) γ ( t ) − Dβ (cid:63) (cid:107) = (cid:107) γ (cid:48) S ( t ) − γ (cid:63)S (cid:107) ≤ (cid:107) d γ,S ( t ) (cid:107) + (cid:107) γ oS − γ (cid:63)S (cid:107) ≤ d ( t ) + √ s (cid:107) γ oS − γ (cid:63)S (cid:107) ∞ ≤ √ s + d (0) /κλ H t + 2 σλ H · Λ X λ D (cid:114) s log mn ≤ √ sλ H t + 2 σλ H · Λ X λ D (cid:114) s log mn . Finally, we prove the (cid:96) consistency of β ( t ). Under the No-false-positiveproperty, for 0 ≤ t ≤ ¯ τ , (cid:107) β ( t ) − β (cid:63) (cid:107) = (cid:107) β (cid:48) ( t ) − β (cid:63) (cid:107) ≤ d β ( t ) + (cid:107) β o − β (cid:63) (cid:107) ≤ d ( t ) + (cid:107) β o − β (cid:63) (cid:107) . 53y Lemma 8 (especially noting (G.12)), we have (cid:107) β o − β (cid:63) (cid:107) ≤ (cid:107) δ o − δ (cid:63) (cid:107) + (cid:107) ξ o − ξ (cid:63) (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13) √ n Λ − U T (cid:15) (cid:13)(cid:13)(cid:13)(cid:13) + (cid:18) (cid:13)(cid:13)(cid:13)(cid:13) √ n Λ − U T XV (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) ·(cid:107) δ o − δ (cid:63) (cid:107) ≤ √ r (cid:48) (cid:13)(cid:13)(cid:13)(cid:13) √ n Λ − U T (cid:15) (cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:18) X λ (cid:19) (cid:0) ν (cid:13)(cid:13) B − V T X ∗ (cid:0) I − U U T (cid:1) (cid:15) (cid:13)(cid:13) + (cid:13)(cid:13) B − Λ U TS (cid:13)(cid:13) · √ s (cid:107) γ oS − γ (cid:63)S (cid:107) ∞ (cid:1) ≤ √ r (cid:48) (cid:13)(cid:13)(cid:13)(cid:13) √ n Λ − U T (cid:15) (cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:18) X λ (cid:19) (cid:32) ν · σ · Λ X λ D + 1 λ D · √ s · σλ H · Λ X λ D (cid:114) log mn (cid:33) . By (B.3), with probability not less than 1 − /m , we have (cid:13)(cid:13)(cid:13)(cid:13) √ n Λ − U T (cid:15) (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ σ (cid:13)(cid:13)(cid:13)(cid:13) √ n Λ − U T (cid:13)(cid:13)(cid:13)(cid:13) (cid:112) log m ≤ σλ (cid:114) log mn . In this case, combining the inequalities above with d ( t ) ≤ √ s/ ( λ H t ), thedesired result follows. (cid:3) Proof of Theorem 5. By the proof details of Theorem 4, we know that with probability not less than 1 − /m − − n/ ≤ t ≤ τ . From now,we assume that these properties are all valid.First we prove the sign consistency of ˜ β ( t ). If the γ (cid:63) min condition (4.3)holds, then by Theorem 4, S (¯ τ ) = S holds, and we have D S c P S (¯ τ ) = D S c (cid:16) I − D † S c D S c (cid:17) = 0 = ⇒ sign (cid:16) D S c ˜ β (¯ τ ) (cid:17) = 0 = sign ( D S c β (cid:63) ) . To prove sign( D S ˜ β (¯ τ )) = sign( D S β (cid:63) ), note that (cid:13)(cid:13)(cid:13) D S ˜ β (¯ τ ) − D S β ∗ (cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13) D S (cid:16) I − D † S c D S c (cid:17) ( β (cid:48) (¯ τ ) − β (cid:63) ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13) D S (cid:16) I − D † S c D S c (cid:17) d β (¯ τ ) (cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13) D S (cid:16) − D † S c D S c (cid:17) ( β o − β (cid:63) ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13) D S (cid:16) I − D † S c D S c (cid:17) d β (¯ τ ) (cid:13)(cid:13)(cid:13) ∞ + (cid:107) γ oS − γ (cid:63)S (cid:107) ∞ + (cid:13)(cid:13)(cid:13) D S D † S c D S c ( β o − β (cid:63) ) (cid:13)(cid:13)(cid:13) ∞ . First, by (G.18), κ ≥ d (0) ≥ (cid:107) γ oS (cid:107) ≥ γ o min , and¯ τ ≥ log(8Λ D ) λ H γ o min + 2 log s + 5 λ H γ o min ≥ κλ H log (8Λ D ) + 2 log s + 4 + d (0) /κλ H γ o min . 54y (F.8), we have d (¯ τ ) ≤ γ o min / (8Λ D ), and thus (cid:13)(cid:13)(cid:13) D S (cid:16) I − D † S c D S c (cid:17) d β (¯ τ ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:107) D S (cid:107) · (cid:13)(cid:13)(cid:13) I − D † S c D S c (cid:13)(cid:13)(cid:13) · (cid:107) d β (¯ τ ) (cid:107) ≤ Λ D · d (¯ τ ) ≤ γ o min ≤ γ (cid:63) min . Besides, by (G.6), we have D S D † S c D S c ( β o − β (cid:63) ) = U S Λ V T D † S c U S c Λ ( δ o − δ (cid:63) ) = U S Λ V T D † S c U S c Λ B δ (cid:15) with (cid:13)(cid:13)(cid:13) U S Λ V T D † S c U S c Λ B δ (cid:13)(cid:13)(cid:13) ≤ Λ D (cid:13)(cid:13)(cid:13) D † S c · U S c Λ V T (cid:13)(cid:13)(cid:13) · (cid:107) B δ (cid:107) ≤ Λ X Λ D √ n · λ H λ D . By (B.3), with probability not less than 1 − /m , (cid:13)(cid:13)(cid:13) D S D † S c D S c ( β o − β (cid:63) ) (cid:13)(cid:13)(cid:13) ∞ < σλ H · Λ X Λ D λ D (cid:114) log mn ≤ γ (cid:63) min . Finally, we note (G.19). Then sign( D S ˜ β (¯ τ )) = sign( D S β (cid:63) ) holds, since (cid:13)(cid:13)(cid:13) D S (cid:16) ˜ β (¯ τ ) − β (cid:63) (cid:17)(cid:13)(cid:13)(cid:13) ∞ < γ (cid:63) min γ (cid:63) min γ (cid:63) min D S β (cid:63) ) min . Then we prove the (cid:96) consistency of ˜ β ( t ). For any 0 ≤ t ≤ ¯ τ , S ( t ) ⊆ S ,which implies D S c ˜ β ( t ) = D S c β (cid:63) = 0. Then (cid:13)(cid:13)(cid:13) ˜ β ( t ) − β (cid:63) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) V T (cid:16) ˜ β ( t ) − β (cid:63) (cid:17)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) V T ˜ V T (cid:16) ˜ β ( t ) − β (cid:63) (cid:17)(cid:13)(cid:13)(cid:13) ≤ (cid:0)(cid:13)(cid:13) V T P S ( t ) ( β (cid:48) ( t ) − β (cid:63) ) (cid:13)(cid:13) + (cid:13)(cid:13) V T (cid:0) I − P S ( t ) (cid:1) β (cid:63) (cid:13)(cid:13) (cid:1) + (cid:16)(cid:13)(cid:13)(cid:13) V T ˜ V T P S ( t ) ( β (cid:48) ( t ) − β (cid:63) ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) V T ˜ V T (cid:0) I − P S ( t ) (cid:1) β (cid:63) (cid:13)(cid:13)(cid:13) (cid:17) ≤ (cid:13)(cid:13) V T P S ( t ) ( β (cid:48) ( t ) − β (cid:63) ) (cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) V T ˜ V T P S ( t ) ( β (cid:48) ( t ) − β (cid:63) ) (cid:13)(cid:13)(cid:13) +2 (cid:13)(cid:13)(cid:13) D † S ( t ) c D S ( t ) c ∩ S β (cid:63) (cid:13)(cid:13)(cid:13) . The first and second term of the right hand side are respectively not greaterthan (cid:13)(cid:13) V T P S ( t ) d β ( t ) (cid:13)(cid:13) + (cid:13)(cid:13) V T P S ( t ) ( β o − β (cid:63) ) (cid:13)(cid:13) ≤ (cid:107) d β ( t ) (cid:107) + 1 λ D (cid:13)(cid:13) DP S ( t ) ( β o − β (cid:63) ) (cid:13)(cid:13) ≤ d ( t ) + 1 λ D (cid:13)(cid:13) D S ( t ) P S ( t ) ( β o − β (cid:63) ) (cid:13)(cid:13) = d ( t ) + 1 λ D (cid:13)(cid:13)(cid:13) U S ( t ) Λ (cid:16) − V T D † S ( t ) c U S ( t ) c Λ (cid:17) ( δ o − δ (cid:63) ) (cid:13)(cid:13)(cid:13) D S ( t ) c P S ( t ) = 0), and (cid:13)(cid:13)(cid:13) V T ˜ V T P S ( t ) d β ( t ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) V T ˜ V T P S ( t ) ( β o − β (cid:63) ) (cid:13)(cid:13)(cid:13) ≤ (cid:107) d β ( t ) (cid:107) + (cid:13)(cid:13)(cid:13) ( ξ o − ξ (cid:63) ) − V T ˜ V T D † S ( t ) c D S ( t ) c ( β o − β (cid:63) ) (cid:13)(cid:13)(cid:13) ≤ d ( t ) + (cid:107) ξ o − ξ (cid:63) (cid:107) + (cid:13)(cid:13)(cid:13) V T ˜ V T D † S ( t ) c U S ( t ) c Λ ( δ o − δ (cid:63) ) (cid:13)(cid:13)(cid:13) . Noting (F.9) and (G.14), as well as applying the definition of B δ in Lemma 8,now we only need to show that with probability not less than 1 − /m − r (cid:48) /m , (cid:13)(cid:13)(cid:13) U S ( t ) Λ (cid:16) I − V T D † S ( t ) c U S ( t ) c Λ (cid:17) B δ (cid:15) (cid:13)(cid:13)(cid:13) ∞ ≤ σλ H · Λ D Λ X λ D (cid:114) log mn , (cid:13)(cid:13)(cid:13) V T ˜ V T D † S ( t ) c U S ( t ) c Λ B δ (cid:15) (cid:13)(cid:13)(cid:13) ∞ ≤ σλ H · Λ X λ D (cid:114) log mn , which are both true, according to (B.3), as well as (G.6) which leads to (cid:13)(cid:13)(cid:13) U S ( t ) Λ (cid:16) I − V T D † S ( t ) c U S ( t ) c Λ (cid:17) B δ (cid:13)(cid:13)(cid:13) ≤ Λ D (cid:16) (cid:13)(cid:13)(cid:13) V T D † S ( t ) c · U S ( t ) c Λ V T (cid:13)(cid:13)(cid:13) (cid:17) (cid:107) B δ (cid:107) ≤ X Λ D √ n · λ H λ D , and (cid:13)(cid:13)(cid:13) V T ˜ V T D † S ( t ) c U S ( t ) c Λ B δ (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) D S ( t ) c · U S ( t ) c Λ V T (cid:13)(cid:13) · (cid:107) B δ (cid:107) ≤ Λ X √ n · λ H λ D . (cid:3) Appendix H. Proof on Consistency of Split LBI Proof of Theorem 6 and 7. They are merely discrete versions of proofs of Theorem 4 and 5. In the proofs, Lemma 9 and 10 stated below are applied,instead of Lemma 1 and 6. (cid:3) Specifically, one can define the oracle iteration of Split LBI as an ora-cle version of Split LBI (1.6) (with S known and ρ k,S c , γ k,S c set to be 0),resembling the idea of oracle dynamics of Split LBISS. DefineΨ k := (cid:107) γ oS (cid:107) − (cid:104) γ oS , ρ k,S (cid:105) + (cid:107) γ k,S − γ oS (cid:107) / (2 κ ) + (cid:107) β k − β o (cid:107) / (2 κ ) . Then we have 56 emma 9 ( Discrete Generalized Bihari’s inequality ). Under Assumption 1,suppose κα (cid:107) H (cid:107) < and λ (cid:48) H = λ H (1 − κα (cid:107) H (cid:107) / . For all k we have Ψ k +1 − Ψ k ≤ − αλ (cid:48) H F − (Ψ k ) , where γ o min , F ( x ) , F − ( x ) are defined the same as in Lemma 1.Proof of Lemma 9. The proof is almost a discrete version of the continuouscase. The only non-trivial thing is to show thatΨ k +1 − Ψ k ≤ − α (1 − κα (cid:107) H (cid:107) / L k , where L k := 12 (cid:0) d Tk,β , d Tk,γ,S (cid:1) H ( β,S ) , ( β,S ) (cid:18) d k,β d k,γ,S (cid:19) , (cid:18) d k,β d k,γ,S (cid:19) := (cid:18) β (cid:48) k − β o γ (cid:48) k,S − γ oS (cid:19) . By (1.6), we have − αH ( β,S ) , ( β,S ) (cid:18) d k,β d γ,k,S (cid:19) = (cid:18) ρ (cid:48) k +1 ,S − ρ (cid:48) k,S (cid:19) + 1 κ (cid:18) β (cid:48) k +1 − β (cid:48) k γ (cid:48) k +1 ,S − γ (cid:48) k,S (cid:19) . Noting ( ρ (cid:48) k +1 ,S − ρ (cid:48) k,S ) T γ (cid:48) k +1 ,S ≥ d Tk,β , d Tγ,k,S ) on both sides,we have − αL k = d Tγ,k,S (cid:0) ρ (cid:48) k +1 ,S − ρ (cid:48) k,S (cid:1) + 1 κ (cid:18) d k,β d k,γ,S (cid:19) T (cid:18) β (cid:48) k +1 − β (cid:48) k γ (cid:48) k +1 ,S − γ (cid:48) k,S (cid:19) ≥ − (cid:0) ρ (cid:48) k +1 ,S − ρ (cid:48) k,S (cid:1) T (cid:0) γ (cid:48) k +1 ,S − γ (cid:48) k,S (cid:1) − (cid:0) ρ (cid:48) k +1 ,S − ρ (cid:48) k,S (cid:1) T γ oS + 1 κ (cid:18) d k,β d k,γ,S (cid:19) T (cid:18) β (cid:48) k +1 − β (cid:48) k γ (cid:48) k +1 ,S − γ (cid:48) k,S (cid:19) . ThusΨ k +1 − Ψ k = − (cid:0) ρ (cid:48) k +1 ,S − ρ (cid:48) S,k (cid:1) T γ oS + 12 κ (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) d k +1 ,β d k +1 ,γ,S (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) d k,β d k,γ,S (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (cid:33) = − (cid:0) ρ (cid:48) k +1 ,S − ρ (cid:48) S,k (cid:1) T γ oS + 12 κ (cid:18) β (cid:48) k +1 − β (cid:48) k γ (cid:48) k +1 ,S − γ (cid:48) k,S (cid:19) T (cid:18)(cid:18) β (cid:48) k +1 − β (cid:48) k γ (cid:48) k +1 ,S − γ (cid:48) k,S (cid:19) + 2 (cid:18) d k,β d k,γ,S (cid:19)(cid:19) ≤ − αL k + (cid:0) ρ (cid:48) S,k +1 − ρ (cid:48) S,k (cid:1) T (cid:0) γ (cid:48) k +1 ,S − γ (cid:48) k,S (cid:1) + 12 κ (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) β (cid:48) k +1 − β (cid:48) k γ (cid:48) k +1 ,S − γ (cid:48) k,S (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ − αL k + κ (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ρ (cid:48) k +1 ,S − ρ (cid:48) k,S (cid:19) + 1 κ (cid:18) β (cid:48) k +1 − β (cid:48) k γ (cid:48) k +1 ,S − γ (cid:48) k,S (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − (cid:0) d Tk,β , d Tk,γ,S (cid:1) (cid:18) αH ( β,S ) , ( β,S ) − κα H β,S ) , ( β,S ) (cid:19) (cid:18) d k,β d k,γ,S (cid:19) ≤ − α (cid:16) − κα (cid:13)(cid:13) H ( β,S ) , ( β,S ) (cid:13)(cid:13) (cid:17) (cid:0) d Tk,β , d Tk,γ,S (cid:1) H ( β,S ) , ( β,S ) (cid:18) d k,β d k,γ,S (cid:19) ≤ − α (1 − κα (cid:107) H (cid:107) / L k . (cid:3) Lemma 10. Under Assumption 1, suppose κα (cid:107) H (cid:107) < and λ (cid:48) H = λ H (1 − κα (cid:107) H (cid:107) / . Let γ o min := min( | γ oj | : γ oj (cid:54) = 0) ,d k,β = β (cid:48) k − β o , d k,γ = γ (cid:48) k − γ o , d k = (cid:113) (cid:107) d k,β (cid:107) + (cid:107) d k,γ,S (cid:107) . Then for any k such that kα ≥ τ (cid:48)∞ ( µ ) := 1 κλ (cid:48) H log 1 µ + 2 log s + 4 + d /κλ (cid:48) H γ o min + 4 α (0 < µ < , (H.1) we have d k ≤ µγ o min (cid:0) = ⇒ sign (cid:0) γ (cid:48) k,S (cid:1) = sign ( γ oS ) , if γ oj (cid:54) = 0 for j ∈ S (cid:1) . (H.2) For any k , we have d k ≤ min (cid:32) √ s + d /κλ (cid:48) H kα , (cid:115) ν Λ X + Λ D ) λ (cid:48) H ν · d (cid:33) . (H.3) Proof of Lemma 10. The proof is almost a discrete version of the continuouscase. The only non-trivial thing is described as follows. First, suppose theredoes not exist k ≤ τ (cid:48)∞ ( µ ) /α satisfying (H.2), then for any 0 ≤ kα ≤ τ (cid:48)∞ ( µ ),we have Ψ k > µ ( γ o min ) / (2 κ ). Letting k = 0, then Ψ k = Ψ ≤ F ( d ).Suppose that F (cid:0) d (cid:1) ≥ Ψ k , . . . , Ψ k − > F (cid:0) s ( γ o min ) (cid:1) ≥ Ψ k , . . . , Ψ k − > F (cid:0) ( γ o min ) (cid:1) ≥ Ψ k , . . . , Ψ k − > ( γ o min ) / (2 κ ) ≥ Ψ k , . . . , Ψ k − > µ ( γ o min ) / (2 κ ) ≥ Ψ k , . . . Then k α > τ (cid:48)∞ ( µ ). Besides, by Lemma 9, α ≤ Ψ k − Ψ k +1 λ (cid:48) H F − (Ψ k ) (0 ≤ kα ≤ τ (cid:48)∞ ( µ )) . λ (cid:48) H ( k − α is not greater than (cid:32) k − (cid:88) k = k + k − (cid:88) k = k + k − (cid:88) k = k + k − (cid:88) k = k (cid:33) Ψ k − Ψ k +1 F − (Ψ k ) ≤ k − (cid:88) k = k Ψ k − Ψ k +1 κ Ψ k + k − (cid:88) k = k Ψ k − Ψ k +1 ( γ o min ) + k − (cid:88) k = k F (∆ k ) − F (∆ k +1 )∆ k + k − (cid:88) k = k F (∆ k ) − F (∆ k +1 )∆ k (cid:0) ∆ k := F − (Ψ k ) (cid:1) = k − (cid:88) k = k Ψ k − Ψ k +1 κ Ψ k + k − (cid:88) k = k Ψ k − Ψ k +1 ( γ o min ) + k − (cid:88) k = k (cid:18) ∆ k − ∆ k +1 κ ∆ k + 2(∆ k − ∆ k +1 ) γ o min ∆ k (cid:19) + k − (cid:88) k = k (cid:32) ∆ k − ∆ k +1 κ ∆ k + 2 √ s (cid:0) √ ∆ k − √ ∆ k +1 (cid:1) ∆ k (cid:33) . By ( u − v ) /u ≤ log( u/v ) and ( √ u − √ v ) /u ≤ / √ v − / √ u for u ≥ v > k / Ψ k − )2 κ + Ψ k − Ψ k − ( γ o min ) + log (∆ k / ∆ k − )2 κ + 2 log (∆ k / ∆ k − ) γ o min + 2 √ s (cid:32) (cid:112) ∆ k − − (cid:112) ∆ k (cid:33) < log (1 /µ )2 κ + 2 γ o min ( γ o min ) + log (cid:0) d / ( γ o min ) (cid:1) κ + 2 log sγ o min + 2 √ s (cid:113) s ( γ o min ) . Therefore we get λ (cid:48) H ( τ (cid:48)∞ ( µ ) − α ) < λ (cid:48) H ( k − α < κ log 1 µ + 2 log s + 4 + d /κγ o min , a contradiction with the definition of τ (cid:48)∞ ( µ ). So there exists some k ≤ τ (cid:48)∞ ( µ ) /α satisfying (H.2). Then continue to imitate the proof in the con-tinous version, we obtain (H.2) for all k ≥ τ (cid:48)∞ ( µ ) /α . The proof of (H.3) follows the same spirit. (cid:3) References Arnold, T. B., Tibshirani, R. J., 2016. Efficient implementations of the gener-alized lasso dual path algorithm. Journal of Computational and GraphicalStatistics 25 (1), 1–27. B¨uhlmann, P., Yu, B., 2002. Boosting with the l -loss: Regression and clas-sification. Journal of American Statistical Association 98, 324–340.Cai, T., Wang, L., 2011. Orthogonal matching pursuit for sparse signal re-covery. IEEE Transactions on Information Theory 57 (7), 4680–4688.Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regres- sion. The Annals of statistics 32 (2), 407–499.Goldstein, T., Osher, S., 2009. Split bregman method for large scale fusedlasso. SIAM Journal on Imaging Sciences 2 (2), 323–343.Hoefling, H., 2010. A path algorithm for the fused lasso signal approximator.Journal of Computational and Graphical Statistics 19 (4), 984–1006. Huang, C., Sun, X., Xiong, J., Yao, Y., 2016. Split LBI: An Iterative Regular-ization Path with Structural Sparsity. In: Advances in Neural InformationProcessing Systems (NIPS) 29. pp. 3369–3377.Lee, J. D., Sun, Y., Taylor, J. E., 2013. On model selection consistencyof penalized m-estimators: a geometric theory. In: Advances in Neural Information Processing Systems (NIPS) 26. pp. 342–350.Liu, J., Yuan, L., Ye, J., 2013. Guaranteed sparse recovery under lineartransformation. In: Proceedings of The 30th International Conference onMachine Learning (ICML). pp. 91–99.Moeller, M., 2012. Multiscale methods for polyhedral regularizations and ap- plications in high dimensional imaging. Ph.D. thesis, University of Muen-ster.Osher, S., Ruan, F., Xiong, J., Yao, Y., Yin, W., 2016. Sparse recoveryvia differential inclusions. Applied and Computational Harmonic Analysis41 (2), 436–469. Sharpnack, J., Singh, A., Rinaldo, A., 2012. Sparsistency of the edge lassoover graphs. In: International Conference on Artificial Intelligence andStatistics. pp. 1028–1036.Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journalof the Royal Statistical Society. Series B (Methodological), 267–288. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K., 2005. Sparsityand smoothness via the fused lasso. Journal of the Royal Statistical SocietySeries B, 91–108.Tibshirani, R. J., Taylor, J., 2011. The solution path of the generalized lasso.The Annals of Statistics 39 (3), 1335–1371. Tropp, J. A., 2004. Greed is good: Algorithmic results for sparse approxima-tion. IEEE Trans. Inform. Theory 50 (10), 2231–2242.Vaiter, S., Peyre, G., Dossal, C., Fadili, J., 2013. Robust sparse analysisregularization. IEEE Transactions on Information Theory 59 (4), 2001–2016. Wahlberg, B., Boyd, S., Annergren, M., Wang, Y., 2012. An ADMM algo-rithm for a class of total variation regularized estimation problems. IFACProceedings Volumes 45 (16), 83–88.Wainwright, M. J., 2009. Sharp thresholds for high-dimensional and noisysparsity recovery using l -constrained quadratic programming (lasso). IEEE Transactions on Information Theory 55 (5), 2183–2202.Xu, Q., Xiong, J., Huang, Q., Yao, Y., 2014. Robust statistical ranking:Theory and algorithms. arXiv:1408.3467 [cs, stat].URL http://arxiv.org/abs/1408.3467 Yao, Y., Rosasco, L., Caponnetto, A., 2007. On early stopping in gradient descent learning. Constructive Approximation 26 (2), 289–315.61e, G.-B., Xie, X., 2011. Split bregman method for large scale fused lasso.Computational Statistics & Data Analysis 55 (4), 1552–1569.Yin, W., Osher, S., Darbon, J., Goldfarb, D., 2008. Bregman iterative al-gorithms for compressed sensing and related problems. SIAM Journal on Imaging Sciences 1 (1), 143–168.Yuan, M., Lin, Y., 2007. On the nonnegative garrote estimator. Journal ofthe Royal Statistical Society, Series B 69 (2), 143–161.Zhang, F., 2006. The Schur Complement and Its Applications. Springer Sci-ence & Business Media. Zhao, P., Yu, B., 2006. On model selection consistency of lasso. Journal ofMachine Learning Research 7, 2541–2567.Zhu, Y., 2017. An augmented admm algorithm with application to the gen-eralized lasso problem. Journal of Computational and Graphical Statistics26 (1), 195–204. URL http://dx.doi.org/10.1080/10618600.2015.1114491http://dx.doi.org/10.1080/10618600.2015.1114491