Joint estimation of K related regression models with simple L 1 -norm penalties
JJoint estimation of K related regression models with simple L -norm penalties E. Ollier (cid:63) and V. Viallon ‡ (cid:63) Universit´e de Lyon, F-69622, Lyon, France ‡ Universit´e de Lyon, F-69622, Lyon, France; Universit´e Lyon 1, UMRESTTE, F-69373 Lyon;IFSTTAR, UMRESTTE, F-69675 Bron. [email protected] [email protected]
Abstract
We propose a new approach, along with refinements, based on L penalties and aimedat jointly estimating several related regression models. Its main interest is that it canbe rewritten as a weighted lasso on a simple transformation of the original data set. Inparticular, it does not need new dedicated algorithms and is ready to implement under avariety of regression models, e.g. , using standard R packages. Moreover, asymptotic oracleproperties are derived along with preliminary non-asymptotic results, suggesting goodtheoretical properties. Our approach is further compared with state-of-the-art competitorsunder various settings on synthetic data: these empirical results confirm that our approachperforms at least similarly to its competitors. As a final illustration, an analysis of roadsafety data is provided.
1. Introduction
With the emergence of high-dimensional data, penalized regression models have now be-come standard, with such penalties as the L q -norm or quasinorm of the parameter vector,for some q ≥
0. Underlying the use of these penalties, a very common assumption whenworking with moderate to high dimensional data is that the theoretical parameter vector β ∗ is sparse : only a small subset of its p components is expected to be non-null. Sparsity-inducing penalties, such as those relying on the L -norm, are especially useful in thiscontext: the L -norm being convex it further leads to approaches that can generally besolved efficiently; they are often referred to as the lasso. Under appropriate conditions, lassoestimates especially attain optimal convergence rates (up to logarithmic terms): sparsity-inducing approaches are not only appealing for interpretation matters, but also becausethey can improve upon non-penalized procedures and lead to optimal estimates regardingestimation and/or prediction accuracy (see e.g. , Bickel et al. (2009); B¨uhlmann and van deGeer (2011); Dalalyan et al. (2014) for the linear regression case).1 a r X i v : . [ s t a t . M E ] N ov n many applications, the objective is actually to estimate several, K say, parametervectors β ∗ , . . . , β ∗ K corresponding to K related regression models; this problem is oftenreferred to as multi-task learning in the literature (Argyriou et al., 2008; Maurer andPontil, 2013; Negahban and Wainwright, 2011; Lounici et al., 2009; Jacob et al., 2009).Our motivating example corresponds to a standard situation in epidemiology or clinicalresearch, where data come from several strata of the population (Gertheiss and Tutz, 2012;Viallon et al., 2014). Each stratum can correspond to a type or dosage of treatment, agiven geographical area or can be defined by crossing age category and gender, etc. Therelationship between the response variable and the covariates has then to be studied oneach stratum. This relationship is rarely strictly the same over all the strata, but somehomogeneity is generally expected. As a result, simple strategies consisting in estimatingone model per stratum or one model on all the strata pooled together are generally sub-optimal in this context, and appropriate approaches should automatically adapt for thelevel of heterogeneity over the strata. Besides the sparsity assumption on each β ∗ k , vectors β ∗ k are usually expected to be close to each other in some sense. The principle of multi-tasklearning strategies is then to account for the presumed similarity among the β ∗ k ’s, whilenot masking potential heterogeneities, in order to improve estimation performance. Recentmethods rely on penalties encouraging estimators of matrix B ∗ = ( β ∗ , . . . , β ∗ K ) ∈ IR p × K to exhibit some given structure: low-rank with the trace-norm penalty, group-structurewith L /L or L /L ∞ penalties, etc. To solve the corresponding optimization problem,dedicated algorithms are generally needed which limits their use and generalization bypractitioners: most of the available algorithms enable to consider only the L loss (forlinear regression typically), and/or the logistic or hinge loss. In this paper we propose anew approach which is very intuitive and easy to implement under a variety of models, i.e. , for a variety of loss functions. Indeed, our approach reduces to a weighted lassoon a straightforward transformation of the data. As a result, any function or packageenabling the resolution of the weighted lasso can directly be used for the implementation.In particular, the glmnet R package of Friedman et al. (2010) allows the treatment oflinear, logistic, Poisson, multinomial and Cox models. The principle of our approach isvery simple: we encourage sparsity within individual β ∗ k and similarity between the β ∗ k ’s.As will be shown below, we reach this objective by simply using L -norm penalties andour approach returns estimates for matrix B ∗ that are sums of a rank one matrix and asparse one.In the following Section 2, we formally introduce the setting we consider, describe ourapproach, state its connections with previous works, discuss its practical implementationand present theoretical results. Then, results from simulation studies are presented in Sec-tion 3, where comparisons are made with state-of-the-art methods. Finally, an illustrationof our approach is provided on road safety data in Section 4, where K = 18 experts havebeen asked to determine drivers’ responsibility in road traffic accidents and the objectiveis to assess expert agreement. 2 . Methods Although the main applications we have in mind concern data gathered from various strataof a population, our work falls into the more general multi-task learning context. Forthe sake of simplicity, we will use the terminology stratum everywhere, but everythingcan be extended by replacing stratum by task. In addition, methods as well as theirtheoretical properties will be presented in the linear regression model with no intercept forease of notation. Extensions to generalized linear models (McCullagh and Nelder, 1989)are straightforward. How to deal with intercept terms in practice is briefly discussed inSection 2.2.2 below.
For any integer m ≥
1, we will denote by [ m ] the set of values { , . . . , m } . Moreover, m and m will denote the vectors of size m with components all equal to 0 and 1 respectively, while I m will stand for the ( m × m ) identity matrix. For any vector x = ( x , . . . , x m ) ∈ IR m ,we further denote by supp ( x ) its support ( i.e. , supp ( x ) = { j ∈ [ m ] : x j (cid:54) = 0 } ) andwe set (cid:107) x (cid:107) q = ( (cid:80) j ∈ [ m ] | x j | q ) /q for any real number q ∈ (0 , ∞ ), (cid:107) x (cid:107) ∞ = max j | x j | and (cid:107) x (cid:107) = | supp ( x ) | , where | E | denotes the cardinality of the set E . For any set E ⊆ [ m ], wewill denote by x E the vector of IR | E | with components ( x j ) j ∈ E .Denote by K ≥ n k the number of observations in stratum k ∈ [ K ], with n = (cid:80) k ∈ [ K ] n k the total number of observations. We will assume that forany k ∈ [ K ] response vectors y ( k ) = ( y ( k )1 , . . . , y ( k ) n k ) T ∈ IR n k are related to design matrices X ( k ) = ( x ( k )1 T , . . . , x ( k ) n k T ) T ∈ IR n k × p according to the following linear model y ( k ) = X ( k ) β ∗ k + ε ( k ) . (1)In this equation, vectors ε ( k ) = ( ε ( k )1 , . . . , ε ( k ) n k ) T ∈ IR n k denote noise vectors: variables ε ( k ) i will be assumed to be independent and identically distributed (i.i.d.) with IE ε ( k ) i = 0and Var( ε ( k ) i ) = σ for some unknown σ >
0, for all i ∈ [ n k ] and k ∈ [ K ]. Finally,vectors β ∗ k ∈ IR p are the K parameter vectors of interest, to be estimated. In the contextconsidered here, the β ∗ k ’s are expected to be sparse and close to each other in some sense. Our proposal relies on the following decomposition for β ∗ k , k ∈ [ K ]: β ∗ k = β ∗ + γ ∗ k . (2)Here β ∗ describes what is “common” over the strata, while γ ∗ k captures the variationin stratum k around β ∗ . Of course, there are infinitely many such decompositions, but3ssuming that vectors β ∗ k are close to each other, those minimizing the quantity (cid:80) k (cid:107) γ ∗ k (cid:107) q ,for some q ≥
0, are naturally appealing. They correspond to very natural choices for vector β ∗ ∈ arg min β (cid:80) k (cid:107) β ∗ k − β (cid:107) q . In particular, it is straightforward that choices q = 0 , j ∈ [ p ], β ∗ j is a mode, a median and themean of ( β ∗ ,j , . . . , β ∗ K,j ), respectively.The principle of our proposal is to obtain estimators of the form (cid:98) β k = (cid:98) β + (cid:98) γ k withsparse (cid:98) β and sparse (cid:98) γ k . For appropriate values λ ≥ λ ,k ≥
0, our estimates arethen defined as( (cid:98) β , (cid:98) γ . . . , (cid:98) γ K ) ∈ argmin β , γ ,..., γ K (cid:88) k ≥ (cid:107) y ( k ) − X ( k ) ( β + γ k ) (cid:107) n + λ (cid:107) β (cid:107) + (cid:88) k ≥ λ ,k (cid:107) γ k (cid:107) . (3)Clearly, (cid:98) β = and (cid:98) γ k = (cid:98) β k for λ large enough: the K problems are solved independently(the corresponding strategy will be referred to as IndepLasso in the sequel). On the otherhand, we have (cid:98) β k = (cid:98) β for all k if the λ ,k ’s are large enough: optimal solutions are those ofthe lasso run on all the data pooled together (the corresponding strategy will be referredto as IdentLasso in the sequel). This is a nice property since selecting appropriate valuesfor λ and the λ ,k ’s allows us to adapt to the extent of homogeneity over the strata. Werefer to Section 3.1.1 below for a discussion on how to chose λ and λ ,k in practice.The optimization problem described in Equation (3) is of course equivalent to the mini-mization of (cid:88) k ≥ (cid:107) y ( k ) − X ( k ) β k (cid:107) n + λ {(cid:107) β (cid:107) + (cid:88) k ≥ λ ,k λ (cid:107) β k − β (cid:107) } (4)over β k ∈ IR p for k ∈ [ K ] and β ∈ IR p . Because β now only appears in the penalty terms,it is easy to see that, at optimum, (cid:98) β j is a weighted and shrunk version of the median of( (cid:98) β ,j , . . . , (cid:98) β K,j ); we will denote it by WSmedian µ [ K ] ( (cid:98) β ,j , . . . , (cid:98) β K,j ) with µ [ K ] = ( µ , . . . , µ K )and µ k = λ ,k /λ . If all µ k ’s are equal and tend to 0, then WSmedian µ [ K ] tends to theconstant function returning p : this corresponds to the decomposition (cid:98) β k = (cid:98) γ k , andthen to the IndepLasso strategy. On the other hand, if all µ k ’s are equal but tend toinfinity, then WSmedian µ [ K ] tends to the standard median function: this corresponds tothe decomposition (cid:98) β k = (cid:98) β + (cid:98) γ k , where (cid:98) β j = median( (cid:98) β ,j , . . . , (cid:98) β K,j ). If the µ k ’s are allequal to 1 /τ , with τ ∈ IN, then WSmedian µ [ K ] is a shrunk median in the sense that (cid:98) β j = WSmedian /τ,..., /τ ( (cid:98) β ,j , . . . , (cid:98) β K,j ) = median( Tτ , (cid:98) β ,j , . . . , (cid:98) β K,j ) . A last interesting example is when for some (cid:96) ∈ [ K ] µ (cid:96) → ∞ and the other µ k ’s arefixed (finite), for all k (cid:54) = (cid:96) . Then WSmedian µ [ K ] ( (cid:98) β ,j , . . . , (cid:98) β K,j ) → (cid:98) β (cid:96),j , leading to the4ecomposition (cid:98) β k = (cid:98) β (cid:96) + (cid:98) γ k , with (cid:98) γ (cid:96) = p : this corresponds to considering stratum (cid:96) as the reference one (see Section 2.3.1 below). To sum up, each particular choice forthe λ ,k /λ ratios “identifies” a particular common effect vector (cid:98) β with (cid:98) β j defined asWSmedian µ [ K ] ( (cid:98) β ,j , . . . , (cid:98) β K,j ), and our approach encourages solutions ( (cid:98) β , . . . , (cid:98) β K ) withtypically sparse vector of common effects (cid:98) β and sparse vectors of differences (cid:98) β k − (cid:98) β . A very attractive property of our approach is that it can be rewritten as a simple weightedlasso. To state this, first set Y = ( y (1) T , . . . , y ( k ) T ) T ∈ IR n and introduce the quantities X = X (1) X (1) . . . ... ... . . . ... X ( K ) . . . X ( K ) and θ = βγ (1) ... γ ( K ) , which are elements of IR n × ( K +1) p and IR ( K +1) p respectively. Further introduce the vectorof weights ω = ( Tp , ( λ , /λ ) Tp , . . . , ( λ ,K /λ ) Tp ) T ∈ IR ( K +1) p . Then, setting (cid:107) θ (cid:107) , ω = (cid:80) ( K +1) pj =1 ω j | θ j | , the criterion to be minimized in Equation (3) can be rewritten as (cid:107) Y − X θ (cid:107) n + λ (cid:107) θ (cid:107) , ω (5)which is the criterion minimized in the weighted lasso. Of course, it is easy to show thatthis remains true under generalized linear models, Cox models, etc. As mentioned above,this is particularly interesting since it means that our approach can be implemented undera variety of models using available packages for the lasso, e.g. , the glmnet R package ofFriedman et al. (2010).Throughout the paper, only linear regression models with no intercept are consideredfor ease of notation. In practice however, intercept has generally to be included in themodel. A first option consists in using the glmnet function with “intercept=FALSE” andreplace X ( k ) by Z ( k ) = ( n k , X ( k ) ) ∈ IR n k × ( p +1) in the definition of X above: this way, theabsolute value of the common intercept term is penalized, and so are variations around thiscommon intercept. In many situations, it makes more sense not to penalize the commonintercept: an alternative option is then to use the glmnet function with “intercept=TRUE”and replace X ( k ) by Z ( k ) in the definition of X everywhere except in the first p columnsof X : this way, only variations around the common intercept (which corresponds to themedian of the intercepts over the strata in this case) are penalized. Decomposition (2) was first suggested in Evgeniou and Pontil (2004) who use the SVMmachinery with L -norm penalties. For the application we have in mind, using sparsity-5nducing norms as the L -norm is more appealing. Indeed, detecting the subset of covariatesthat have a differential effect over the strata is often of primary interest (see our applicationin Section 4 below). By using the L -norm of the γ k ’s our approach enjoys good propertiesregarding support recovery for both the β k ’s and γ k ’s (as will be shown below), which isof course not the case for methods based on L -norm penalties. Another advantage whenusing L -norms is that the common effect estimation is more robust (median versus mean).Jalali et al. (2013) consider a closely related decomposition, this time for the parametermatrix B = ( β , . . . , β K ) ∈ IR p × K . They write B = R + S , where R and S are two ( p × K )matrices with element ( j, k ) denoted by r ( k ) j and s ( k ) j , respectively. Further denote by r ( k ) and s ( k ) the k -th column of matrices R and S respectively, so that β ( k ) = s ( k ) + r ( k ) ,and by r j and s j the j -th row of matrices R and S . Further set, for any matrix M ∈ IR p × K = ( m , . . . , m p ) T and q ≥ (cid:107) M (cid:107) ,q = (cid:80) j ∈ [ p ] (cid:107) m j (cid:107) q . Then their approach returnsan estimate for B ∗ derived from minimizers of the following criterion: (cid:88) k (cid:107) y ( k ) − X ( k ) ( r ( k ) + s ( k ) ) (cid:107) n + λ s (cid:107) S (cid:107) , + λ r (cid:107) R (cid:107) , ∞ . (6)The term (cid:107) S (cid:107) , encourages matrix S to be sparse (few elements are non-zero) while (cid:107) R (cid:107) , ∞ encourages matrix R to have a row-wise group structure (few lines have zero entries):together, they encourage solutions (cid:98) B that can be written as the sum of a row-sparsematrix and a sparse one. As mentioned in Jalali et al. (2013), setting d = (cid:98) λ r /λ s (cid:99) , thecombination of the two penalties leads to solutions (cid:98) R such that for any j with (cid:107) (cid:98) r j (cid:107) ∞ > | M j | ≥ d + 1 where M j = { k : | (cid:98) r ( k ) j | = (cid:107) (cid:98) r j (cid:107) ∞ } (see their Lemma 2): in otherwords, rows that are not uniformly null have at least d + 1 components that have equalabsolute values. In contrast, our approach returns estimates for B ∗ that is a sum of a rankone matrix (each row has p equal components instead of at least d + 1 components of equalabsolute values) plus a sparse matrix: it is therefore less flexible, but easier to interpret and,above all, to implement. From a theoretical point of view, Jalali et al. especially consideredthe case where K = 2 and covariates are generated from a standard multivariate Gaussiandistribution. Further assuming that n = n , they show that the number of samples neededto ensure support recovery with high probability ( i.e. , the sample complexity) is inferior forDirty, compared to both IndepLasso (where K lasso are run independently on the K strata)and the group-lasso strategy relying on the L /L ∞ penalty (Negahban and Wainwright,2011). As shown Section 6.1.1 of the Appendix, the sample complexity of our approachis never superior to that of Dirty if in addition to the assumptions considered in Jalali etal., β ∗ ,j β ∗ ,j ≥ j ∈ [ p ]. Our empirical results presented in Section 3 further suggestthat it may still be the case for K >
2. Stating this theoretically is out of the scope of thepresent paper and will be considered elsewhere.6 .3 Adaptive version and other refinement
An adaptive version of our approach can be derived by selecting appropriate weights andthen replacing the L norms by weighted L -norms in Equation (3). Following the ideasof the adaptive lasso (Zou, 2006), these weights can be constructed from initial estimatorsof β j and γ k,j = β k,j − β j , for j ∈ [ p ] and k ∈ [ K ]. Denoting by (cid:101) β k initial estimates of β k ( e.g. , OLS estimates or MLEs if n k (cid:29) p for all k ∈ [ K ]), and defining for any j , (cid:96) j = min { k : (cid:101) β k,j ∈ median( (cid:101) β ,j , . . . , (cid:101) β K,j ) } , initial estimators for β j and γ k,j are set to (cid:101) β (cid:96) j ,j and (cid:101) β k,j − (cid:101) β (cid:96) j ,j . Note that because values (cid:101) β ,j , . . . , (cid:101) β K,j are generally all distinct, their median can be a range of values if K is even.In this case, we want to set the initial estimator of the common effect of covariate j toone of the values in ( (cid:101) β ,j , . . . , (cid:101) β K,j ), hence the use of the minimum in the definition of (cid:96) j (alternatively, the maximum could be used: there is no reason for favoring one or theother, at least a priori). Then setting, for all k ∈ [ K ] and j ∈ [ p ], w k,j = 1 / | (cid:101) β k,j | ρ and ν k,j = 1 / | (cid:101) β k,j − (cid:101) β (cid:96) j ,j | ρ , for some ρ > ρ = 1), the adaptive version ofour approach consists in minimizing the following objective function (cid:88) k ≥ (cid:107) y ( k ) − X ( k ) ( β + γ k ) (cid:107) n + λ p (cid:88) j =1 w (cid:96) j ,j | β j | + (cid:88) k ≥ λ ,k p (cid:88) j =1 ν k,j | γ k,j | (cid:111) . But because ν (cid:96) j ,j = + ∞ for all j ∈ [ p ], this is equivalent to minimizing the criterion: (cid:88) k ≥ (cid:107) y ( k ) − X ( k ) β k (cid:107) n + λ p (cid:88) j =1 w (cid:96) j ,j | β (cid:96) j ,j | + (cid:88) k ≥ λ ,k p (cid:88) j =1 ν k,j | β k,j − β (cid:96) j ,j | (cid:111) . (7)In other words, the adaptive version of our approach can be seen as a refinement of thefollowing strategy, that is very common in epidemiology and clinical research when datacome from several strata. Many practitioners would first select a reference stratum, say (cid:96) ∈ [ K ], and then use the decomposition: β k = β (cid:96) + δ k , for any k (cid:54) = (cid:96) . Then, a lasso canbe used and parameter estimates for each stratum can be obtained from minimizers of thefollowing objective function:12 n (cid:110) (cid:107) y ( (cid:96) ) − X ( (cid:96) ) β (cid:96) (cid:107) + (cid:88) k (cid:54) = (cid:96) (cid:107) y ( k ) − X ( k ) ( β (cid:96) + δ ( k ) ) (cid:107) (cid:111) + λ (cid:107) β (cid:96) (cid:107) + (cid:88) k (cid:54) = (cid:96) λ ,k (cid:107) δ ( k ) (cid:107) . (8)This approach reduces to a standard lasso where to the original vector of covariatesis augmented by interaction terms between the covariates and indicator functions de-scribing membership to each stratum k (cid:54) = (cid:96) . The adaptive version of our approach7njoys two advantages compared to this strategy, referred to as InterLasso hereafter.First, the selection of the reference stratum is automatic with our approach and basedon initial estimates of the β ∗ k ,’s. Second, and above all, the reference stratum (cid:96) j iscovariate-specific. Whenever the initial estimates are consistent, (cid:96) j belongs to the set L ∗ j = { k : β ∗ k,j ∈ median( β ∗ ,j , . . . , β ∗ K,j ) } with high probability for n large enough (if the n k ’s all tend to ∞ as n → ∞ ), and is therefore an appealing reference stratum for covari-ate j . As a result, our approach will generally yield models with lower complexity andhence better performance than the simple InterLasso (even if InterLasso can of course leadto better models than our approach in some situations; e.g. , if there exists a stratum (cid:96) for which β ∗ (cid:96),j ∈ mode( β ∗ ,j , . . . , β ∗ K,j ) for all j ∈ [ p ], and this stratum (cid:96) is chosen as thereference one, and mode( β ∗ ,j , . . . , β ∗ K,j ) ∩ median( β ∗ ,j , . . . , β ∗ K,j ) = ∅ for some j ∈ [ p ]). With our first approach, sparsity of vectors (cid:98) β k = (cid:98) β + (cid:98) γ k is not directly encouraged, andis only “induced” by the sparsity of the common effects (cid:98) β and of the variations (cid:98) γ k . Forinstance, if for some a (cid:54) = 0 and j ∈ [ p ], β ∗ (cid:96) j ,j = a and β ∗ k,j = 0 for some k (cid:54) = (cid:96) , our firstapproach will typically not return (cid:98) β k,j = 0 (unless it also returns (cid:98) β (cid:96) j ,j = 0). We thereforepropose a refined version which directly penalizes the L -norms of the β k ’s. More precisely,a second set of estimators can be defined as minimizers of the following criterion (cid:88) k ≥ (cid:110) (cid:107) y ( k ) − X ( k ) β k (cid:107) n + λ ,k (cid:107) β k (cid:107) + λ ,k (cid:107) β k − β (cid:107) (cid:111) (9)over β k ∈ IR p for k ∈ [ K ] and β ∈ IR p . In the sequel, this refined version will be referredto as M , while M will refer to our first approach. Because β now only appears in the lastterm of the objective function, it is clear that the j -th component of any optimal solution (cid:98) β is such that (cid:98) β j = median( (cid:98) β ,j , . . . , (cid:98) β K,j ): for each covariate j ∈ [ p ], its estimated commoneffect (cid:98) β j corresponds to the median of its estimated effects over all the strata, irrespectiveto λ ,k /λ ,k ratios.The implementation of M is however less trivial than that of our first approach. InSection 6.2 of the Appendix, we derive the dual formulation of the optimization problemdescribed in Equation (9), which is shown to reduce to a standard optimization problemfor the linear regression (quadratic programming) and the logistic regression (entropy max-imization problem) and can therefore be solved with available optimization toolboxes; thehinge loss can also be easily treated.An adaptive version of M can further be recast in the generalized fused lasso framework(H¨ofling et al., 2010; Viallon et al., 2014). This makes its implementation straightforwardwith packages dedicated to the generalized fused lasso ( e.g. , the FusedLasso R package ofH¨ofling et al. (2010)). Recall the notations introduced in Section 2.3.1 above. The adaptive8ersion of M consists in minimizing the following objective function (cid:88) k ≥ (cid:110) (cid:107) y ( k ) − X ( k ) β k (cid:107) n + λ ,k p (cid:88) j =1 w k,j | β k,j | + λ ,k p (cid:88) j =1 ν k,j | β k,j − β (cid:96) j ,j | (cid:111) . (10)Here again, we have ν (cid:96) j ,j = + ∞ for all j ∈ [ p ]. To see the connection with thegeneralized fused lasso set Y = ( y (1) T , . . . , y ( k ) T ) T ∈ IR n as above, and introduce X F the ( n × Kp ) block diagonal matrix with k -th block equal to X ( k ) . Further define b =( β T , . . . , β TK ) T = ( b , . . . , b Kp ) ∈ IR pK . Then, the criterion of Equation (10) can be rewrit-ten as (cid:107) Y − X F b (cid:107) / (2 n ) + λ (cid:107) b (cid:107) + λ (cid:80) j ∼ j | b j − b j | . Condition j ∼ j indicatesthat components b j and b j are connected in a particular graph that describes absolutedifferences that are penalized. Here, this graph consists of p star-graphs where the j -thstar-graph has coefficient β (cid:96) j ,j at its center, that is connected to each β k,j for k (cid:54) = (cid:96) j . Theadaptive version of M can also be seen as a particular case of the generalized fused lasso,using the same graph made of p star-graphs (our two adaptive versions penalize the sameabsolute differences), but where only the absolute value of the central node | β (cid:96) j ,j | of eachstar-graph is penalized (while all the | β k,j | ’s are penalized in Equation (10) above). Finallynote that the InterLasso strategy can also be seen as a special case of the generalized fusedlasso: the underlying graph is made of p star-graphs with β (cid:96),j as the central node (insteadof β (cid:96) j ,j ), and only the | β (cid:96),j | terms are penalized, for j ∈ [ p ].These observations formally establish a connection with another strategy that wasconsidered in Gertheiss and Tutz (2012) and Viallon et al. (2014). It will be referred toas CliqueFused, and it corresponds to a generalized fused lasso using this time a graphmade of p cliques (instead of p star-graphs): for each covariate j , all absolute differences | β k,j − β k (cid:48) ,j | are penalized for all k (cid:54) = k (cid:48) .The rewriting of the adaptive version of M as a special case of the generalized fusedlasso shows that it is easy to implement with available packages, like the FusedLasso Rpackages for linear and logistic regression models; the same naturally holds for the adaptiveversion of M , but since it can still be written as a particular case of the (adaptive)lasso its implementation is faster using the glmnet function for instance. In additionasymptotic oracle properties for the adaptive version of M (and M ) readily follow asdirect consequences of the results obtained in Viallon et al. (2014) (see Section 2.4 below);for the adaptive version of M , results obtained by Zou (2006) for the adaptive lasso canalso be used. For the sake of brevity, we only present here a summary of asymptotic oracle properties forthe adaptive versions of our approach. We refer to Section 6.1 in the Appendix for moredetails along with preliminary non-asymptotic results for our first approach M .9e consider the situation where p and K are held fixed (they do not increase with n ).We further assume that, for each k ∈ [ K ], matrix X ( k ) T X ( k ) /n k converges to a positivedefinite matrix C ( k ) as n k → ∞ , and n k /n → κ k as n → ∞ , with 0 < κ k < stratum sizes all tend to infinity at the same rate). These assumptions essentially imply that MLEs (cid:101) β k,j exist and are √ n -consistent as n → ∞ ; they will therefore be used in the definitionof the weights in Equations (7) and (10). Our results essentially show that the adaptiveversion of M , say M ad , enjoys asymptotic oracle properties . In particular, setting for any j ∈ [ p ], (cid:96) ∗ j = min { k : β ∗ k,j ∈ median( β ∗ ,j , . . . , β ∗ K,j ) } and K ∗A ∗ ,j = { k : β ∗ k,j = β ∗ (cid:96) ∗ j ,j } if β ∗ (cid:96) ∗ j ,j (cid:54) = 0, then all the parameters β ∗ k,j for k ∈ K ∗A ∗ ,j are estimated by the common value (cid:98) β ( ad ) (cid:96) j ,j , with probability tending to 1 as n → ∞ . This estimator has the same Gaussian limitdistribution as that of the estimator we would obtain by pooling all data correspondingto covariate j and strata in K ∗A ∗ ,j together. Moreover, M ad is asymptotically optimal insituations where, for all j ∈ [ p ], β ∗ k ,j = β ∗ k ,j implies that either β ∗ k ,j = β ∗ k ,j = 0 or β ∗ k ,j = β ∗ k ,j = β ∗ (cid:96) ∗ j ,j , as in examples 1 to 3 of Figure 4 in the Appendix. Asymptotic oracleproperties can easily be derived for method M ad as well but, for M ad to be optimal wemust have in addition: for all j ∈ [ p ], { β ∗ (cid:96) ∗ j ,j (cid:54) = 0 } ⇒ {∀ k ∈ [ K ] , β ∗ k,j (cid:54) = 0 } . For instance,in example 2 on Figure 4, the estimator of β ∗ ,j returned by M ad can not be 0 (unless M ad returns a zero estimate for β ∗ k j ,j as well). Therefore, M ad would typically return amodel with overall complexity higher than the theoretical one, hence be sub-optimal. Onthe other hand, contrary to the CliqueFused strategy, both our approaches use star-graphsand are therefore sub-optimal in situations where non-zero values in ( β ∗ ,j , . . . , β ∗ K,j ) consistof at least two groups of identical values (and possibly some distinct non-zero values). Forinstance, in example 4 of Figure 4, neither M ad nor M ad can return non-null equal valuesfor components (cid:98) β ( ad )1 ,j , (cid:98) β ( ad )2 ,j , and (cid:98) β ( ad )3 ,j (nor for (cid:98) β ( ad )8 ,j , (cid:98) β ( ad )9 ,j , and (cid:98) β ( ad )10 ,j ), while CliqueFusedtypically would, for n large enough. However, CliqueFused may of course be outperformedby our approaches on finite samples. This was illustrated in the simulation study conductedin Viallon et al. (2014) in the single-task setting: they evaluated the robustness of thegeneralized fused lasso to graph misspecification and especially observed that the clique-based strategy, though asymptotically optimal, was outperformed by other graphs-basedstrategies on finite samples. In the present multi-task setting, if β ∗ ,j , . . . , β ∗ K − ,j are allequal and β ∗ K,j is different from the K − (cid:96) j ∈ [ K −
1] for n enough, and our approaches would only penalize | β K,j − β (cid:96) j ,j | : they are therefore more likelyto detect this difference than CliqueFused which penalizes all the differences | β K,j − β k,j | for all k ∈ [ K − . Simulation Study Our main objective here is to illustrate the two new approaches introduced in Section 2and compare them with state-of-the-art competitors, in particular with those presentedabove which show some links with our proposal: CliqueFused and Dirty. For comparison,we further included IndepLasso and IdentLasso. The sparse group-lasso strategy relyingon the L + L /L ∞ penalty and referred to as spGroupLasso hereafter is considered too(see Section 6.3 of the Appendix for a brief description of this approach and Negahbanand Wainwright (2011) and Simon et al. (2013) for more details). Results obtained withthe stepwise method described in Gertheiss and Tutz (2012) will also be presented in thelow-dimensional example.The glmnet R function of Friedman et al. (2010) was used to implement IndepLasso,IdentLasso as well as our approach M (and its adaptive version). As for M M .The spGroupLasso strategy was implemented with the spams R package of Mairal et al.(2010), and we used the gvcm.cat R package for the stepwise approach of Gertheiss andTutz (2012). Finally, we used the script of Jalali et al. (2013) (available at http://ali-jalali.com/index files/L1Linf LASSO.r) to implement Dirty. However, this script actuallyimplements a revised version of Dirty. At each iteration ι of their algorithm, tuning parame-ters are divided by √ ι in the coordinate gradient descent steps, making the soft-thresholdingoperator closer and closer to the hard-thresholding one as ι increases. This trick then re-turns solutions that are barely shrunk and is related in some way to the relaxed lasso ofMeinshausen (2007), the adaptive lasso of Zou (2006) and L q -penalization with 0 ≤ q < / √ ι ” terms in the RevDirty script. All methods presented above involve tuning parameters that need to be carefully selected inpractice. Generally speaking, a predefined grid of λ values has first to be constructed. Werefer to Section 6.4.3 of the Appendix for a complete description of the grid construction.Given grids of λ and λ values, cross-validation can be seen as the strategy of choice when p is large ( i.e. , at least comparable to n ) and/or only prediction accuracy matters. When p (cid:28) n and support recovery is of primary interest, a commonly preferred criterion is the11IC computed with unbiased estimates: for the lasso, these unbiased estimates can beobtained by the OLS-Hybrid two-step strategy (Efron et al., 2004), which corresponds to asimplified version of the relaxed lasso with the particular choice φ = 0 (Meinshausen, 2007).For the approaches considered in this paper, it can easily be extended. This criterion willbe referred to as 2stepBIC in the sequel. In the first simulation study, we consider the case where K = 5 and n k = 15 p = 225. Foreach stratum k ∈ [ K ], each of the n k rows (observations) of the design matrix X ( k ) is gener-ated under a multivariate Gaussian distribution N p ( p , Σ ), where the ( j , j )-element of Σ is 2 −| j − j | (for j , j ∈ [ p ]). For given values (SpGlob , SpSpec) ∈ [0 , , each component β ∗ j of vector β ∗ ∈ IR p is generated from a Bernoulli distribution with parameter SpGlob, whilefor each k ∈ [ K ], each component of vector δ ∗ k ∈ IR p is generated from a Bernoulli distribu-tion with parameter SpSpec. The expected number of non-zero components in β ∗ and δ ∗ k isthen p × SpGlob and p × SpSpec respectively. Each vector β ∗ k are then set to β ∗ + δ ∗ k , withcomponents either 0, 1 or 2. By making SpSpec vary, we make the level of heterogeneityvary over the strata, while SpGlob enables the control of the “common” effects sparsitylevel. Observe that components β ∗ j do not necessarily correspond to median( β ∗ ,j , . . . , β ∗ K,j )(they do in some cases, e.g. , when SpSpec= 0, but not always). Given β ∗ k and X ( k ) , vec-tor y ( k ) is generated from a multivariate gaussian distribution N n k ( X ( k ) β ∗ k , σ I n k ). Byvarying the noise variance σ , we can make the signal-to-noise-ratio (SNR) vary. For eachsimulation design, corresponding to a given triple of values (SpGlob, SpSpec, SNR), 100replications are performed, and results presented and discussed below correspond to av-erages over these 100 replications. Methods are evaluated according to their predictionaccuracy (Figure 1), measured by log( (cid:80) k ∈ [ K ] (cid:107) X ( k ) ( β ∗ k − (cid:98) β k ) (cid:107) ), and support recovery ac-curacy (Figure 2), that measures the ability to correctly recover zero and non-zero elementsof matrix B ∗ = ( β ∗ , . . . , β ∗ K ). Tuning parameters are selected with the 2StepBIC describedabove.In Figure 1, the first row corresponds to SpSpec= 0, so that all β ∗ k ’s are equal. In thiscase, the optimal strategy is of course IdentLasso but M , spGroupLasso, M and Dirtyall perform as well (or nearly as well) as this optimal strategy. The stepwise approach andCliqueFused are less efficient, but still perform better than IndepLasso. On the other hand,when SpGlob is low and SpSpec is high (last row, first two columns), the K true modelshave not much in common and IndepLasso is nearly-optimal: in this case again, the multi-task learning strategies perform nearly as well as the optimal strategy. In addition, and asexpected, prediction performance of all methods tend to decrease as SpGlob and/or SpSpecincreases, that is as the true model complexity increases. Overall, M and Dirty show verysimilar performance, irrespective to the true model complexity. They perform the bestfor SpSpec (cid:54) = 0, closely followed by the other multi-task learning strategies. For SpGlob=0 . (cid:54) = 0, we observed moderate performance for both M and spGroupLasso.12s shown in Section 6.4.3 of the Appendix for method M , alternative choices for theinitial grid of ( λ , λ ) values lead to better results, suggesting that our proposal for theconstruction of this grid might not always be optimal for M (and spGroupLasso) and thatmore theoretical effort might be needed to get a better initial grid. Sparsity_glob = = = ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll − − − − − − − − − − − S pa r s i t y _ s pe c = S pa r s i t y _ s pe c = . S pa r s i t y _ s pe c = . snr Method l l
M1 M2 Dirty CliqueFused spGroupLasso Stepwise IdentLasso IndepLasso
Figure 1: Results from the first simulation study: prediction accuracy.Turning our attention to support recovery, Figure 2 presents results obtained with Dirtyand RevDirty, M , and the adaptive versions of M , M and CliqueFused. See Figure 6in the Appendix for the corresponding analysis of the methods compared on Figure 1:overall, conclusions for these methods are very similar to those drawn from the analysisof the prediction performance. Weights for adaptive versions were derived from initialOLS estimates of the β ∗ k . The comparison of M with its adaptive version M ( ad )1 clearlyillustrates the potential gain on support recovery accuracy when using adaptive weights(no such gain was observed on prediction accuracy; results not shown). Moreover, adaptive13ersions of CliqueFused and M attain nearly the same performance as M ( ad )1 . Overall,RevDirty performs the best, but ( i ) the comparison is unfair since RevDirty is more thanan adaptive version of Dirty and ( ii ) adaptive versions of M and M performs only slightlyworse in most cases. Sparsity_glob = = = ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll . . . . . . . . . . . . S pa r s i t y _ s pe c = S pa r s i t y _ s pe c = . S pa r s i t y _ s pe c = . snr Method l l
M1 Dirty AdaptM1 AdaptM2 RevDirty AdaptCliqueFused
Figure 2: Results from the first simulation study: support recovery accuracy.
The objective of the second simulation study was to describe the methods performancewhen the dimension p is of the order of n k and selection of tuning parameters is done bycross-validation. In view of the results of the first simulation study, and to save computa-tional times, IdentLasso, IndepLasso and CliqueFused were not included in this analysis.In addition, a relaxed version of M was included, following the idea of the relaxed lasso14Meinshausen, 2007) (we made the φ -value vary over the sequence { , . , . . . , . , } ). Weconsidered the case where K = 5 and p = n k = 100. Vectors β , and δ ∗ k , for k ∈ [ K ] areconstructed as follows. Only the first 10 components of these vectors can be non-zero. Thesparsity of the common effect β is fixed to 4, with non-zero components all equal to 1.Three levels of heterogeneity are considered: complete homogeneity (all the δ ∗ k ’s are equalto p ), low heterogeneity (the number of non-zero elements of matrix ∆ ∗ = ( δ ∗ , . . . , δ ∗ K ) is5, each non-zero element being ± ∆ ∗ is 15, each non-zero element being ± X ( k ) and response vectors y ( k ) are generated as in the first simulationstudy, except we now set Σ = I p . As in the first simulation study, we make the SNR varyby varying the noise level.Results are presented in Figure 3. Because Dirty and RevDirty take a very longtime to run in this second simulation study, results correspond to averages over 20 repli-cations only. Methods are evaluated according to their prediction accuracy, measuredby log( (cid:80) k ∈ [ K ] || X ( k ) ( β ∗ k − (cid:98) β k ) || ), their estimation accuracy (measured by (cid:80) k ∈ [ K ] (cid:107) β ∗ k − (cid:98) β k ) (cid:107) ) / ( Kp )), and their ability to recover the true support (measured in this highly sparsesetting by the F1-score, defined as the harmonic mean of precision and recall). Becauseestimators selected by cross-validation may have many tiny non-zero components, hard-thresholding these estimators can improve their performance regarding support recovery.F1-score for these hard-thresholded versions are therefore also presented, with thresholdset to 0.25 and 0.5. By first comparing crude versions (no adaptive weights, nor relaxationnor the Revised Dirty trick), M and M generally perform at least similarly to Dirtyand spGroupLasso. The relaxed and adaptive versions of M generally achieve betterperformance than the crude M , especially in cases of complete homogeneity or low het-erogeneity. In these cases, these versions also outperform RevDirty in terms of predictionand estimation accuracy. As for the F1-score, RevDirty is clearly the best: only in the caseof complete homogeneity the relaxed version of M achieves similar performance. Notehowever that the comparison is again unfair, since RevDirty is more than an adaptive ora relaxed version of Dirty. After the hard-thresholding step, differences among methodsin terms of F1-scores are much narrower: focusing on low values of the SNR, the differentversions of our approaches appear very competitive. (Our results confirm that RevDirtyreturns nearly unshrunk estimates and then fewer tiny non-zero components than “purely” L -based approaches: it does fewer mistakes (its F1-score is higher), but apparently biggermistakes since its prediction and estimation accuracies can be outperformed by the relaxedor adaptive versions of M .)
4. Application on road safety data
One major objective in road safety is to determine factors associated with being responsibleof a road traffic accident. To do so, epidemiologists typically use data sets of car crashes, inwhich drivers’ responsibility has been determined by experts. Of course, it is crucial that15 o Heterogeneity Low
Heterogeneity Moderate
Heterogeneity ll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll ll ll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll ll ll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll ll − − − . . . . . . . . . . . . . . . . l og ( L2 P r ed ) L2 e rr o r F F . T h r e s h . . F . T h r e s h . . snr Method l l
M1 M2 Dirty spGroupLasso AdaptM1 RelaxM1 RevDirty
Figure 3: Results from the second simulation study with n k = p = 100 and K = 5.16hese experts use a proper and consistent “rule” to determine responsibility. Consistencyespecially implies that two experts should agree on drivers’ responsibility. Level of agree-ment can be measured by Cohen’s κ or other standard agreement criteria, but this wouldrequire each driver’s responsibility to be rated by at least two experts. Here, we will usea joint estimation strategy to assess experts agreement in situations where each driver’sresponsibility is only rated once by one expert. As will be made clearer below, our strategywill also help us to check the relevance of the rule used by experts for the determinationof drivers’ responsibility.Our dataset consists of n β ∗ k of each logistic model ( k = 1 , . . . ,
18) shouldbe equal. To check whether this is the case, we propose to jointly estimate the 18 corre-sponding sparse logistic models using the two approaches described in this paper (both thecrude and refined versions). Each method returns a (206 ,
18) matrix where each columnscontains the parameter vector (including the intercept) of one stratum, i.e. one expert. Weevaluated both methods on a predefined grid of ( λ , λ ,k ) values and retained the modelminimizing the 2stepBIC. Contrary to situations considered in our simulation study, the n k ’s are not constant here and several strategies can be put forward for determining thegrid. The most simple one consists in using λ ,k = λ and proceed as described in Section3.1.1. As mentioned in Section 3.1.1 (and as suggested by the preliminary non-asymptoticanalysis of Section 6.1.2 of the Appendix), another strategy for method M consists in firststandardizing columns of matrix X , then setting λ ,k = λ and finally proceeding again17s described in Section 3.1.1. Both strategies were applied for method M and they led tosimilar results.With either method, good concordance was observed between supports of vectors (cid:98) β k :only two [resp. three] elements of matrix (cid:98) ∆ = ( (cid:98) δ , . . . , (cid:98) δ ) were non-null when using M [resp. M ]. Moreover, non-zero elements returned by M have small absolute values ( ≤ (cid:98) ∆ related to driving under the influence were all zero: this suggeststhat all experts account for alcohol consumption in the same way.Now, looking at the common vector (cid:98) β can further help us to check the adequacy ofthe common rule used by the experts. We will pay a particular attention on the associa-tion between alcohol consumption and the probability of being considered responsible byexperts. Indeed for road safety matters, and as recalled to experts during the trainingphase, the only fact of driving under the influence should not imply to be (nor increasethe risk of being) considered as responsible. If components of vectors (cid:98) β correspondingto alcohol consumption are non-null, then the adequacy of the rule may be questioned.Method M returns a vector (cid:98) β with 74 non-null components, while M returned a slightlysparser one with 60 non-null components. Overall, variables that are the most associatedto a high probability of being responsible are as expected: a driver who drove (or ran)away, and/or went through a stop sign or a red light, crossed the lane, etc. is likely to beconsidered as responsible of the accident. Regarding alcohol consumption, both M and M return two non-zero components in vector (cid:98) β : those corresponding to the highest levelof alcohol (between 1 .
5. Discussion
In this paper we present a new approach, along with refinements, aimed at jointly modelingseveral related regression models. Although most of the presentation was made in thesituation where these models correspond to several strata of a population, our approach canbe used in the more general multi-task learning setting. Also, for ease of notation mostlylinear regression models were considered in this paper. As illustrated in the application ofSection 4, extension to other models is however straightforward. This is particularly truefor our first approach and its adaptive or relaxed version, since they all can be rewrittenas simple weighted lasso problems: in particular, linear, logistic, Poisson and Cox modelscan be treated thanks to the glmnet R package (Friedman et al., 2010). Other functionsexist for other models, like the clogitL1 R package for conditional logistic models of Reidand Tibshirani (2014), and allow for further extensions of our approach.Our approach naturally connects with several strategies formerly proposed in the liter-ature. First, it is a simplified version of the Dirty models proposed by Jalali et al. (2013)in the linear regression setting, and is much easier to extend to other regression models.On the designs considered in our simulation study, it still performs similarly to the Dirtymodels and we also exhibited situations where its theoretical sample complexity is at leastas good as that of the Dirty models. Second, adaptive versions of our approaches arespecial cases of the generalized fused lasso. This connection is particularly appealing forinterpretation matters since it allows for a simple comparison between CliqueFused, theInterLasso and our two strategies. First, CliqueFused uses cliques, while InterLasso andour two approaches use star-graphs. Second, for all j ∈ [ p ], the center of the star-graphsare set to β (cid:96),j when using InterLasso with reference stratum (cid:96) , while when using adap-tive versions of our approaches centers are adaptively set to β (cid:96) j ,j : the reference stratumis covariate-specific, automatically selected from the initial estimates, and it corresponds(with probability tending to 1) to one of the stratum in K ∗ j = { k : β ∗ k,j = β ∗ (cid:96) ∗ j ,j } if the initialestimates are consistent. Finally, besides the structure of the graph (edges correspondingto differences β k,j − β k (cid:48) ,j that are penalized), these four strategies exhibit differences ac-cording to which nodes in the graph (coefficients β k,j ) are directly encouraged to be sparse:they are all penalized when using CliqueFused or M , while only the β (cid:96),j for InterLassoand the β (cid:96) j ,j for M are penalized, for j ∈ [ p ].From a theoretical point of view, the connection with the generalized fused lasso al-lowed us to derive asymptotic oracle properties for our adaptive versions. Preliminarynon-asymptotic results were also stated, under strong but not necessary conditions. Re-garding prediction performance, these preliminary results might be extended to more gen-eral settings by adapting the recent results of Dalalyan et al. (2014), who study predictionperformance of the lasso in the presence of correlated designs. As for support recovery19f the β ∗ k ’s, adapting the dual-witness idea of Wainwright (2009) is a promising lead (asJalali et al. (2013) did for the Dirty models). Other extensions of this work could rest ongreedy algorithms or recent MCMC-based approaches (such as the exponential screeningof Rigollet and Tsybakov (2012)), for which optimal prediction bounds have been obtainedwith no particular assumption on the design matrix. References
A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning.
MachineLearning , 73(3):243–272, 2008.P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzigselector.
The Annals of Statistics , 37(4):1705–1732, 2009.P. B¨uhlmann and S. van de Geer.
Statistics for High-Dimensional Data: Methods, Theoryand Applications . Berlin : Springer-Verlag, 2011.A. S. Dalalyan, M. Hebiri, and J. Lederer. On the prediction performance of the lasso. arXiv preprint arXiv:1402.1700 , 2014.B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression.
The Annalsof statistics , 32(2):407–499, 2004.T. Evgeniou and M. Pontil. Regularized multi–task learning. In
Proceedings of the tenthACM SIGKDD International Conference on Knowledge Discovery and Data mining ,pages 109–117. ACM, 2004.J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linearmodels via coordinate descent.
Journal of statistical software , 33(1):1, 2010.J. Gertheiss and G. Tutz. Regularization and model selection with categorial effect modi-fiers.
Statistica Sinica , 22:957–982, 2012.H. H¨ofling, H. Binder, and M. Schumacher. A coordinate-wise optimization algorithm forthe Fused Lasso.
Arxiv preprint arXiv:1011.6409 , 2010.L. Jacob, J.P. Vert, and F. Bach. Clustered multi-task learning: A convex formulation. In
Advances in neural information processing systems , pages 745–752, 2009.A. Jalali, P. Ravikumar, and S. Sanghavi. A dirty model for multiple sparse regression.
IEEE Transactions on Information Theory , 59(12):7947–7968, 2013.K. Lounici. Sup-norm convergence rate and sign concentration property of lasso and dantzigestimators.
Electronic Journal of statistics , 2:90–102, 2008.20. Lounici, M. Pontil, A.B. Tsybakov, and S. van de Geer. Taking advantage of sparsityin multi-task learning. In
COLT’09 , 2009.J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online Learning for Matrix Factorization andSparse Coding.
The Journal of Machine Learning Research , 11:19–60, 2010.A. Maurer and M. Pontil. Excess risk bounds for multitask learning with trace normregularization. In
COLT’13 , pages 55–76, 2013.P. McCullagh and J. A. Nelder.
Generalized Linear Models. 2nd ed.
New-York : Chapman& Hall, 1989.N. Meinshausen. Relaxed lasso.
Computational Statistics and Data Analysis , 52(1):374–393, 2007.S. Negahban and M. Wainwright. Simultaneous support recovery in high dimensions:Benefits and perils of block-regularization.
IEEE Transactions on Information Theory ,57(6):3841–3863, 2011.S. Reid and R. Tibshirani. Regularization paths for conditional logistic regression: Theclogitl1 package.
Journal of Statistical Software , 58(12), 2014.P. Rigollet and A. B. Tsybakov. Sparse estimation by exponential weighting.
StatisticalScience , 27(4):558–575, 2012.N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso.
Journal ofComputational and Graphical Statistics , 22(2):231–245, 2013.S. van de Geer and P. B¨uhlmann. On the conditions used to prove oracle results for thelasso.
Electronic Journal of Statistics , 3:1360–1392, 2009.V. Viallon, S. Lambert-Lacroix, H. H¨ofling, and F. Picard. On the robustness of thegeneralized fused lasso to prior specifications.
Statistics and Computing (To Appear) ,2014.M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso).
IEEE Transactions on Information Theory ,55(5):2183–2202, 2009.H. Zou. The Adaptive Lasso and Its Oracle Properties.
Journal of the American StatisticalAssociation , 101(476):1418–1429, 2006. 21 . Appendix: Supplementary Materials
In this section, we derive preliminary non-asymptotic properties for our first approach M (Sections 6.1.1 and 6.1.2) as well as asymptotic oracle properties for our adaptive versions(especially for M , see Section 6.1.3). K = 2 and β ∗ ,j β ∗ ,j ≥ for all j ∈ [ p ] . Here we consider the case where K = 2. Working under the assumptions considered inJalali et al. (2013), it is easy to show that the sample complexity of M is at most theone of the Dirty models, under the additional assumption that β ∗ ,j β ∗ ,j ≥ j ∈ [ p ].Indeed, for any given pair of potential estimates β ,j and β ,j , the penalty term involvedin the Dirty approach for covariate j can be written φ ( r , r ) = λ max( | r | , | r | ) + λ {| β ,j − r | + | β ,j − r |} , while for M it reduces to¯ φ ( r ) = λ | r | + λ {| β ,j − r | + | β ,j − r |} = φ ( r, r ) . Moreover, for any positive λ , λ and any β ,j , β ,j it is easy to show that min r ,r φ ( r , r ) =min r ¯ φ ( r ) if β ,j β ,j ≥ r ,r φ ( r , r ) < min r ¯ φ ( r ) otherwise. In words, pairs ofvalues such that β ,j β ,j ≥ M , while all other pairsare equally penalized by Dirty and M . Consequently, assuming that β ∗ ,j β ∗ ,j ≥ j ∈ [ p ], the probability of correct support recovery for M is superior or equal to that ofDirty. In particular, under the assumptions considered in Theorem 3 of Jalali et al. (2013),and assuming that β ∗ ,j β ∗ ,j ≥ j ∈ [ p ], perfect support recovery is ensured formethod M as soon as the common number of observations per stratum (task) is superiorto (2 − α ) s log( p − (2 − α ) s ), where s is the common support size of β ∗ and β ∗ and α is the overlap proportion between the two supports. As shown in Jalali et al. (2013) fortheir dirty model, this implies that M outperforms both IndepLasso and the L /L ∞ -grouplasso strategy in terms of sample complexity (except in the two extremes situations of nosharing at all and full sharing, where it matches the best strategy). In addition to the sample complexity in the simple case described above, other preliminarynon-asymptotic properties can easily be derived for our approach thanks to its rewritingas a weighted lasso. Most non-asymptotic theoretical results of the lasso have been estab-lished under strong conditions on the design matrix: to name a few, the irrepresentabilitycondition, the mutual incoherence, the restricted eigenvalue condition or the RIP property22see van de Geer and B¨uhlmann (2009) for an overview). In Equation (5) of the mainpaper, it is clear that matrix X generally does not enjoy such properties. Assuming forsimplicity that n k = n/K for all k ∈ [ K ], we present here simple situations where it does. Amore thorough study would be needed to establish non-asymptotic properties under moregeneral assumptions and an interesting lead lies in the recent results concerning predictionaccuracy of lasso estimators under correlated designs (Dalalyan et al., 2014). This will beconsidered elsewhere.The first situation is when K is large enough and the β ∗ k ’s remain highly similar.More precisely, assume that design matrices X ( k ) fulfill the following mutual incoherencecondition (Lounici, 2008), for all k ∈ [ K ]: setting Σ ( k ) = ( X ( k ) T X ( k ) ) /n k for all k ∈ [ K ],we assume that Σ ( k ) j,j = 1 for all j ∈ [ p ] and max j (cid:54) = j | Σ ( k ) j ,j | ≤ / (7 αs ), for some integer s ≥ α >
1. Then the following standardized version of matrix X (cid:101) X = X (1) √ K X (1) . . . ... ... . . . ... X ( K ) . . . √ K X ( K ) can easily be shown to fulfill the same mutual incoherence assumption, as long as K > (7 αs ) . For such values of K , and assuming that the true parameter vector θ ∗ contains atmost s non-zero components (such a θ ∗ is ensured to be unique), results of Lounici (2008)can be applied to derive the L ∞ rate of convergence and the sign consistency of estimatorsobtained from M , with sparsity parameters set to λ ,k = λ (which is equivalent to workingwith the non-standardized matrix X and λ ,k = λ K − / ). Of course the assumption K > (7 αs ) is very strong: it holds typically when strata are numerous and highly similar(so that s remains low).Another simple situation arises when X ( k ) T X ( k ) /n k = I p for all k ∈ [ K ] and vectors β ∗ k are all equal, and we set again λ ,k = λ . Setting J = { j : θ j (cid:54) = 0 } , this ensures that J ⊆ [ p ], i.e. θ ∗ J = β ∗ J . Then it is easy to show that the irrepresentability condition ofWainwright (2009) holds with γ = 1 − (cid:112) /K , and then to obtain the following L ∞ bound:for any δ, ε >
0, and assuming that the noise variables ξ ( k ) i are i.i.d. σ -sub-Gaussian, wehavemax k (cid:107) (cid:98) β k − β ∗ k (cid:107) ∞ = (cid:107) (cid:98) β − β ∗ (cid:107) ∞ ≤ σ (cid:40)(cid:114) | J | + ε n + 2 γ (cid:114) | J | log( Kp − | J | ) + δ n (cid:41) , with probability greater than 1 − − δ / − − ε / (cid:112) { | J | log( p − | J | ) + δ } /n . In other words, when X ( k ) T X ( k ) /n k = I p forall k ∈ [ K ] and all the β ∗ k ’s are equal our approach performs as well as IdentLasso (up toconstants and log-terms), which is optimal in this case.23 .1.3 Asymptotic results for the adaptive version of M in the fixed p ,fixed K case In this paragraph, we will consider the simple situation where p and K are held fixed (theydo not increase with n ). We further assume that, for all k ∈ [ K ], matrices X ( k ) T X ( k ) /n k converges to positive definite matrices C ( k ) as n k → ∞ , and n k /n → κ k as n → ∞ , with0 < κ k < stratum sizes all tend to infinity at the same rate). These assumptionsessentially imply that MLEs (cid:101) β k,j can be used in the definition of the weights in Equations(10) and (7) of the main paper, and are √ n -consistent as n → ∞ .Before stating our result, some notations are needed; in particular, the overall complex-ity of the model returned by M ad has to be defined properly. For any j ∈ [ p ], introducethe quantities (cid:96) ∗ j = min { k : β ∗ k,j ∈ median( β ∗ ,j , . . . , β ∗ K,j ) } , K ∗ j = { k : β ∗ k,j = β ∗ (cid:96) ∗ j ,j } ⊆ [ K ],and N ∗ j = (cid:80) k ∈ (cid:96) ∗ j n k . Further introduce A ∗ = { ( k, j ) : β ∗ k,j (cid:54) = 0 } and, for all j ∈ [ p ], A ∗ j = { k : β ∗ k,j (cid:54) = 0 } and K ∗A ∗ ,j = { k : β ∗ k,j = β ∗ (cid:96) ∗ j ,j (cid:54) = 0 } .To define the empirical counterparts of the previous quantities returned by M ad , firstrecall that for any j ∈ [ p ], (cid:96) j = min { k : (cid:101) β k,j ∈ median( (cid:101) β ,j , . . . , (cid:101) β K,j ) } . Then denote by (cid:98) β ( ad ) k,j for any k ∈ [ K ] and j ∈ [ p ] the estimator of β ∗ k,j returned by M ad . We can nowdefine (cid:98) A = { ( k, j ) : (cid:98) β ( ad ) k,j (cid:54) = 0 } , and, for all j ∈ [ p ], (cid:98) (cid:96) j = { k : (cid:98) β ( ad ) k,j = (cid:98) β ( ad ) (cid:96) j ,j } ⊆ [ K ], (cid:98) A j = { k : (cid:98) β ( ad ) k,j (cid:54) = 0 } and (cid:98) K (cid:98) A ,j = { k : (cid:98) β ( ad ) k,j = (cid:98) β ( ad ) (cid:96) j ,j (cid:54) = 0 } . Finally for any j ∈ [ p ] denote by (cid:98) s j the number of distinct non-null values in ( (cid:98) β ( ad )1 ,j , . . . , (cid:98) β ( ad ) K,j ), that is (cid:98) s j = (cid:40) | (cid:98) A j | if (cid:98) β ( ad ) (cid:96) j ,j = 0 | (cid:98) A j | − | (cid:98) K (cid:98) A ,j | + 1 otherwise . The overall complexity of the model returned by M ad , which corresponds to the numberof “free” parameters returned by M ad , is then defined as (cid:98) s = (cid:80) j ∈ [ p ] (cid:98) s j . More precisely, forany j ∈ [ p ] such that (cid:98) s j > M ad returns a vector (cid:98) η ( ad ) j of size (cid:98) s j defined by (cid:98) η ( ad ) j = (cid:98) β ( ad ) (cid:96) j ,j if (cid:98) β ( ad ) (cid:96) j ,j (cid:54) = 0 and (cid:98) s j = 1;( (cid:98) β ( ad )¯ k j, ,j , . . . , (cid:98) β ( ad )¯ k j, (cid:98) sj ,j ) T if (cid:98) β ( ad ) (cid:96) j ,j = 0 and (cid:98) s j ≥ , with { ¯ k j, , . . . , ¯ k j, (cid:98) s j } = (cid:98) A j ;( (cid:98) β ( ad ) (cid:96) j ,j , (cid:98) β ¯ k ( ad ) j, ,j , . . . , (cid:98) β ( ad )¯ k j, (cid:98) sj − ,j ) T otherwise , with { ¯ k j, , . . . , ¯ k j, (cid:98) s j − } = (cid:98) A j \ (cid:98) K A ,j . Now denote by (cid:98) η ( ad ) the estimator returned by M ad (after some reordering), i.e. , (cid:98) η ( ad ) =( (cid:98) η ( ad ) j ) j : (cid:98) s j > ∈ IR s . We can further introduce the theoretical counterpart of (cid:98) s j and (cid:98) s , bysetting s ∗ j = |A ∗ j | if β ∗ (cid:96) ∗ j ,j = 0 and s ∗ j = |A ∗ j | − | K ∗A ∗ ,j | + 1 otherwise, and s ∗ = (cid:80) j ∈ [ p ] s ∗ j .24inally introduce for any j ∈ [ p ] such that s ∗ j > η ∗ j = β ∗ (cid:96) ∗ j ,j if β ∗ (cid:96) ∗ j ,j (cid:54) = 0 and s ∗ j = 1;( β ∗ ¯ k ∗ j, ,j , . . . , β ¯ k ∗ j,sj ,j ) T if β ∗ (cid:96) ∗ j ,j = 0 and s ∗ j ≥ , with { ¯ k ∗ j, , . . . , ¯ k ∗ j,s j } = A ∗ j ;( β ∗ (cid:96) ∗ j ,j , β ∗ ¯ k ∗ j, ,j , . . . , β ∗ ¯ k ∗ j,s ∗ j − ,j ) T otherwise , with { ¯ k ∗ j, , . . . , ¯ k ∗ j,s ∗ j − } = A ∗ j \ K ∗A ∗ ,j . and η ∗ = ( η ∗ j ) j : s ∗ j > ∈ IR s ∗ .As a simple consequence of Theorem 2 in Viallon et al. (2014), we have IP( (cid:98) A = A ∗ ) → ∩ j ∈ [ p ] { (cid:98) K A ,j = K ∗A ∗ ,j } ) →
1, as n → ∞ . In particular, this implies that the twovectors η ∗ and (cid:98) η ( ad ) are of identical size s ∗ = (cid:98) s with probability tending to one. Moreover,we have the following Gaussian limit distribution √ n ( (cid:98) η ( ad ) − η ∗ ) → N ( s ∗ , σ ( (cid:101) X Ts ∗ (cid:101) X s ∗ ) − ) as n → ∞ , with (cid:101) X s ∗ the ( n × s ∗ ) matrix that can be written as ( (cid:101) X j,s ∗ j ) j : s ∗ j > where each submatrix (cid:101) X j,s ∗ j , of size ( n × s ∗ j ) is of the form: (cid:101) X j,s ∗ j = (cid:88) k ∈ K ∗A∗ ,j (cid:101) X ( k ) j if β ∗ (cid:96) ∗ j ,j (cid:54) = 0 and s ∗ j = 1;( (cid:101) X (¯ k ∗ j, ) j , . . . , (cid:101) X (¯ k ∗ j,s ∗ j ) j ) if β ∗ (cid:96) ∗ j ,j = 0 and s ∗ j ≥ , with { ¯ k ∗ j, , . . . , ¯ k ∗ j,s j } = A ∗ j ;( (cid:88) k ∈ K ∗A∗ ,j (cid:101) X ( k ) j , (cid:101) X (¯ k ∗ j, ) j , . . . , (cid:101) X (¯ k ∗ j,sj ) j ) otherwise , with { ¯ k ∗ j, , . . . , ¯ k ∗ j,s ∗ j − } = A ∗ j \ K ∗A ∗ ,j . Here, (cid:101) X ( k ) j is the column vector of size n of the form ( T (cid:80) k (cid:48)
0] are valid median values, while only onevalue corresponds to the median under the other three examples (0, 0.5 and 0 for examples1, 2 and 4 respectively).to the CliqueFused strategy that is based on cliques, both our approaches use star-graphsand are therefore sub-optimal in situations where non-zero values in ( β ∗ ,j , . . . , β ∗ K,j ) consistof at least two groups of identical values (and possibly some distinct non-zero values).For instance, in example 4 of Figure 4, neither M ad nor M ad can return equal values forcomponents (cid:98) β ( ad )1 ,j , (cid:98) β ( ad )2 ,j , and (cid:98) β ( ad )3 ,j (nor for (cid:98) β ( ad )8 ,j , (cid:98) β ( ad )9 ,j , and (cid:98) β ( ad )10 ,j ), while CliqueFused would,for n large enough. However, CliqueFused, i.e. , clique-based strategies, can of course beoutperformed on finite samples. This was illustrated in the simulation study conducted inViallon et al. (2014) in the single-stratum setting. In the present “multiple strata” setting,if β ∗ ,j , . . . , β ∗ K − ,j are all equal and β ∗ K,j is different from the K − (cid:96) j ∈ [ K −
1] for n enough, and our approaches would only penalize | β K,j − β (cid:96) j ,j | : theyare therefore more likely to detect this difference than the approach based on clique-graphswhich penalizes all the differences | β K,j − β k,j | for all k ∈ [ K − M In this paragraph, we consider a general framework that encompasses both linear andlogistic models along with SVM as special cases. Let f be a convex loss function and forany k ∈ [ K ] set L k ( β k ) = (cid:80) i ∈ [ n k ] f ( β Tk a ( k ) i + c ( k ) i ) for β k ∈ IR p and some given a ( k ) i ∈ IR p and c ( k ) i ∈ IR. We define the k -th feature matrix A ( k ) := [ a ( k )1 , . . . , a ( k ) n k ] T ∈ IR n k × p . Weconsider a generic supervised learning problem of the form( (cid:98) β , ( (cid:98) γ k ) Kk =1 ) ∈ argmin β , ( γ k ) Kk =1 K (cid:88) k =1 (cid:110) L k ( β + γ k ) + λ ,k (cid:107) β + γ k (cid:107) + λ ,k (cid:107) γ k (cid:107) (cid:111) . (11)26oss function f ( ξ ) conjugate function f ∗ ( ϑ ) domain of f ∗ squared f sq ( ξ ) = ξ / (2 n ) ϑ / (2 n ) IRlogistic f log ( ξ ) = log(1 + e − ξ ) ( − ϑ ) log( − ϑ ) + ( ϑ + 1) log( ϑ + 1) [ − , m hinge f hi ( ξ ) = (1 − ξ ) + − ϑ [ − , m Table 1: Expression for the conjugate of popular loss functions, adopting the convention0 log 0 = 0 for the logistic loss.Recall that throughout the paper, data is of the form ( x ( k )1 , y ( k )1 ) , . . . , ( x ( k ) n k , y ( k ) n k ) with x ( k ) i ∈ IR p and y ( k ) i ∈ IR, i ∈ [ n k ] and k ∈ [ K ]. In this context, the aforementionedformalism covers the linear regression with the loss function set to the squared loss: f ( ξ ) = f sq ( ξ ) = ξ / (2 n ), and by setting a ( k ) i = x ( k ) i ∈ IR p the vector of covariates and c ( k ) i = − y ( k ) i the (negative) response for observation i ∈ [ n k ] of the k -th stratum. Likewise, by setting f ( ξ ) = f log ( ξ ) = log(1 + e − ξ ), c ( k ) i = 0, and a ( k ) i = y ( k ) i x ( k ) i for i ∈ [ n k ], our formalismcovers both the logistic regression and the SVM framework, by considering the logistic loss f ( ξ ) = f log ( ξ ) = log(1 + e − ξ ) and the hinge loss f ( ξ ) = f hi ( ξ ) = (1 − ξ ) + respectively.For future use, we denote by f ∗ the Fenchel conjugate of the loss function f , which isthe extended-value convex function defined as f ∗ ( ϑ ) := max ξ ξϑ − f ( ξ ) . Beyond convexity, we make a few mild assumptions about the loss function f . First, weassume that it is non-negative everywhere, and that it is closed (its epigraph is closed),so that f ∗∗ = f . These assumptions are met with the squared, logistic and hinge lossfunctions, as well as other popular loss functions. The conjugate of the squared, logisticand hinge loss functions are given in Table 1.We have φ (( λ ,k ) k ∈ [ K ] , ( λ ,k ) k ∈ [ K ] ) = min β , ( γ k ) Kk =1 K (cid:88) k =1 (cid:8) L k ( β + γ k ) + λ ,k (cid:107) β + γ k (cid:107) + λ ,k (cid:107) γ k (cid:107) (cid:9) = min β , ( β k ) Kk =1 , ( z k ) Kk =1 K (cid:88) k =1 (cid:8) G k ( z k ) + λ ,k (cid:107) β k (cid:107) + λ ,k (cid:107) β k − β (cid:107) (cid:9) :s . t . z k = A ( k ) β k + c k , k ∈ [ K ] , where, for z ∈ IR n k , G k ( z ) = (cid:80) n k i =1 f ( z i ) and c k = ( c ( k )1 , . . . , c ( k ) n k ) for any k ∈ [ K ].27e can now express the problem in min-max form as follows φ (( λ ,k ) k ∈ [ K ] , ( λ ,k ) k ∈ [ K ] ) = min β , ( β k ) Kk =1 , ( z k ) Kk =1 max ( u k ) Kk =1 , ( v k ) Kk =1 , ( α k ) Kk =1 K (cid:88) k =1 (cid:26) G k ( z k ) + α Tk ( A ( k ) β k + c k − z k )+ u Tk β k + v Tk ( β k − β ) (cid:27) s . t . (cid:107) u k (cid:107) ∞ ≤ λ ,k , (cid:107) v k (cid:107) ∞ ≤ λ ,k , k ∈ [ K ] . Now, solving for β , ( β k ) Kk =1 , we obtain the dual constraints: K (cid:88) k =1 v k = p , A ( k ) T α k = u k + v k , for k ∈ [ K ] . Still denoting by f (cid:63) the Fenchel conjugate of the function f , and eliminating variables u k , k = 1 , . . . , K , we obtain the dual problem φ (( λ ,k ) k ∈ [ K ] , ( λ ,k ) k ∈ [ K ] )) = min α ,..., α K (cid:88) k (cid:110) α Tk c k + (cid:88) i ∈ [ n k ] f (cid:63) ( α k,i ) (cid:111) under the constraints: (cid:13)(cid:13)(cid:13) A ( k ) T α k − v k (cid:13)(cid:13)(cid:13) ∞ ≤ λ ,k , (cid:107) v k (cid:107) ∞ ≤ λ ,k , (cid:80) k v k = p and α k,i ∈ D om( f (cid:63) ) , for k ∈ [ K ] . For instance, in the logistic regression setting, this takes the form φ (( λ ,k ) k ∈ [ K ] , ( λ ,k ) k ∈ [ K ] )) = min α ,..., α K (cid:88) k ∈ [ K ] α Tk log α k + (1 − α k ) T log(1 − α k )under the constraints: (cid:13)(cid:13)(cid:13) A ( k ) T α k − v k (cid:13)(cid:13)(cid:13) ∞ ≤ λ ,k , (cid:107) v k (cid:107) ∞ ≤ λ ,k , (cid:80) k v k = p , and α k ∈ [0 , n k , for k ∈ [ K ] . This problem is then equivalent to a standard entropy maximization and can therefore besolved using standard optimization toolboxes like the Mosek toolbox in R or Matlab forinstance, which returns both optimal dual and primal solutions. (In the linear regressionsetting, the problem reduces to a quadratic programming and can also be solved withMosek.) 28 .3 (Sparse) group lasso (spGroupLasso).
Lounici et al. (2009) and Negahban and Wainwright (2011) consider two different grouplasso strategies. Their estimators are defined as minimizers of the following criterion (cid:88) k (cid:107) y ( k ) − X ( k ) β ( k ) (cid:107) n + λ (cid:107) B (cid:107) ,q (12)with typical values q = 2 (Lounici et al., 2009) or q = ∞ (Negahban and Wainwright,2011). The (cid:107) · (cid:107) ,q encourages solutions (cid:98) B to exhibit a row-wise group structure in thefollowing sense: for each covariate j ∈ [ p ], parameters (cid:98) β ( k ) j , for k ∈ [ K ], are either all 0 orall non-zero (the estimated effect of each covariate is either null for all tasks or non-nullfor all tasks).A slightly more flexible approach consists in adding an L penalty to allow covariatesto have a non-zero estimated effect on all but some tasks. This leads to the so-calledsparse group lasso in standard regression models (Simon et al., 2013). In our context, thissuggests to minimize to following criterion (cid:88) k (cid:107) y ( k ) − X ( k ) β ( k ) (cid:107) n + λ (cid:107) B (cid:107) , + λ (cid:107) B (cid:107) ,q . (13)Note that in Equation (13) above solutions (cid:98) B are encouraged to be sparse and to exhibita group structure while in the Dirty model of Jalali et al. (2013) (see Equation (6) of themain paper) solutions are sums of two matrices: (cid:98) S that is encouraged to be sparse and (cid:98) R that is encouraged to have a group structure.Non-asymptotic properties have been established for the non-sparse group-lasso inmulti-task settings (Lounici et al., 2009; Negahban and Wainwright, 2011). In particular,Negahban and Wainwright (2011) compare the L /L ∞ -regularized method to IndepLassoin the special case of K = 2 linear regression problems with standard Gaussian designswhose supports have common size s and overlap in a fraction α ∈ [0 ,
1] of their entries.They focus on the capacity of each method to correctly specify the union of the two sup-ports. Assuming a common number n/ L /L ∞ -regularized method yields improved statistical efficiency if the overlap parameter islarge enough ( α ≥ / α < / β ∗ ,j β ∗ ,j ≥
0, we have estab-lished that our first approach M performed at least similarly to the Dirty model (seeSection 6.1.1 above). 29 .4 Additional results from the simulation study K = 2Here we present a comparison of Dirty and M when K = 2. We actually mimic thesimulation study presented in Jalali et al. (2013). For a given value of p , the sparsityof both β ∗ and β ∗ is set to s = (cid:98) p/ (cid:99) . Results are presented for p = 128 (results for p = 256 were similar). The proportion of non-zero and equal components in β ∗ and β ∗ isset to α ∈ { . , . } . For a given value of β >
0, non-zero components are set to ± β withprobability 1 /
2. Design matrices were generated under a multivariate Gaussian N p ( p , I p )distribution, and noise variables under a N (0 , .
1) distribution. Number of observations perstratum n = n = n/ n k = cs log[ p − (2 α ) s ] /
2, for c ∈ { . , , . , , } definingthe rescaled sample size (Jalali et al., 2013). Selection of the optimal sparsity parameterwas performed by using an independent test sample. Methods referred to as 2step Dirtyand 2step M further use a 2-step strategy to unshrunk estimates before computing theL2 prediction error on the test sample. Figure 5 presents the results we obtained for thesupport recovery accuracy. It confirms that Dirty and M perform similarly. b = b = ll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll ll . . . . . . . . . . a = . a = . Rescaled Sample Size
Method l l
Dirty M1 2step M1 2step Dirty
Figure 5: Results from the simulation study with K = 2.30 .4.2 Support recovery accuracy for the methods compared on Figure 1 ofthe main paper Figure 6 presents the support recovery accuracies for the methods compared in terms ofprediction accuracy on Figure 1 of the main paper. As mentioned in the main paper,conclusions drawn from Figure 6 are very similar to those drawn from Figure 1 of the mainpaper.
Sparsity_glob = = = ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll . . . . . . . . . . . . . . . S pa r s i t y _ s pe c = S pa r s i t y _ s pe c = . S pa r s i t y _ s pe c = . snr Method l l
M1 M2 Dirty CliqueFused spGroupLasso Stepwise IdentLasso IndepLasso
Figure 6: Results from the first simulation study: support recovery accuracy. M depending on the ( λ , λ ) grid All methods presented in the main article involve tuning parameters that need to be care-fully selected in practice. Generally speaking, a predefined grid of λ values has first to beconstructed. For instance in the single stratum setting, the minimal value λ max ensuring31hat the vector returned by the lasso is null for all λ ≥ λ max has a closed-form expres-sion ( e.g. , for linear and logistic regression) and the grid is then generally of the form { λ max / , . . . , λ max } . Most methods considered in the main paper involve two tuningparameters, and the derivation of λ , max and λ , max is not straightforward. Because oursimulated data are such that X ( k ) j ∼ N (0 ,
1) for all k ∈ [ p ] and j ∈ [ p ], and n k = n/K for all k ∈ [ K ], we limit the presentation to the case where tuning parameters do notdepend on k ; if the n k ’s are not equal, one solution for M consists in first standardizingthe columns of matrix X and then proceed as described here. For the Dirty approachof Jalali et al. (2013) we proceed as they did in their simulation study. Because of itsconnection with the lasso, the following strategy can be used for our first approach M (or its adaptive version). For any pair ( λ , λ ) such that λ ≥ Kλ , it is easy to see fromEquation (4) of the main paper that the estimated common effect (cid:98) β will be p and M then reduces to a version of IndepLasso ran with the tuning parameter set to λ . On theother hand, if λ = 0 then β is unpenalized, and M returns parameter vectors (cid:98) β k = (cid:98) β for λ large enough. Therefore, our strategy is to first make the λ /λ ratio vary on theinterval [0 , K ] (we take 50 equally-spaced values on a log-scale), and for each value of thisratio r , the glmnet function can be used to compute the λ , max ( r ) value, along with the 50models returned with λ ( r ) varying on the grid of 50 equally-spaced values (on a log-scale) { λ , max ( r ) / , . . . , λ , max ( r ) } .For the other methods ( M , CliqueFused and spGroupLasso) we proceed as follows.We first consider the case λ = 0. The methods then all reduce to a standard lasso forwhich the λ , max value can be computed. The grid of λ -values is then chosen as the set of50 values { λ , max / , . . . , λ , max } , equally-spaced on a log-scale. Then, for each λ valueon this grid, the minimal value λ , max ( λ ) is numerically approximated. This value is suchthat for all λ ≥ λ , max ( λ ), the considered method computed with parameters ( λ , λ )returns a vector of parameter (cid:98) β ( λ , λ ) with the lowest possible overall complexity. Inparticular, for methods M , M and CliqueFused, this means that (cid:98) β k ( λ , λ ) = (cid:98) β k (cid:48) ( λ , λ )for all ( k, k (cid:48) ) ∈ [ K ] and λ ≥ λ , max ( λ ). Such an approximate λ , max ( λ ) value is simplyobtained by iteratively running the considered method for various values of λ , startingfrom a huge value, and successively dividing it by 2 for instance. Another, more simple,strategy is as follows. We first consider the case λ = 0. The methods then all reduceto a standard lasso for which the λ , max value can be computed. Then, we use the gridof 50 values { λ , max / , . . . , λ , max } for both λ and λ . This strategy is referred to asAlternativeGrid on Figure 7 below which presents the comparison of the performance for M implemented with the two strategies on the first simulation study. The comparisonof the black and grey curves illustrates how difficult it is to empirically compare meth-ods. When using the AlternativeGrid strategy, M performs similarly to M , while usingthe other strategy, it performs like SpGroupLasso (which uses the same strategy for thegrid construction). Therefore, the observed difference of performance between the various32trategies can be (at least partly) explained by the choice for the initial grid of ( λ , λ )values. Sparsity_glob = = = − − − − − − − − − − − − S pa r s i t y _ s pe c = S pa r s i t y _ s pe c = . S pa r s i t y _ s pe c = . snr Method
M1 M2 M2_AlternativeGrid spGroupLasso
Figure 7: Comparison of the prediction performance for M depending on the strategyused to construct the ( λ , λ2