[PDF] Joint estimation of K related regression models with simple L 1 -norm penalties

Abstract

We propose a new approach, along with refinements, based on L 1 penalties and aimed at jointly estimating several related regression models. Its main interest is that it can be rewritten as a weighted lasso on a simple transformation of the original data set. In particular, it does not need new dedicated algorithms and is ready to implement under a variety of regression models, {\em e.g.}, using standard R packages. Moreover, asymptotic oracle properties are derived along with preliminary non-asymptotic results, suggesting good theoretical properties. Our approach is further compared with state-of-the-art competitors under various settings on synthetic data: these empirical results confirm that our approach performs at least similarly to its competitors. As a final illustration, an analysis of road safety data is provided.

Full PDF

JJoint estimation of K related regression models with simple L -norm penalties E. Ollier (cid:63) and V. Viallon ‡ (cid:63) Universit´e de Lyon, F-69622, Lyon, France ‡ Universit´e de Lyon, F-69622, Lyon, France; Universit´e Lyon 1, UMRESTTE, F-69373 Lyon;IFSTTAR, UMRESTTE, F-69675 Bron. [email protected] [email protected]

Abstract

We propose a new approach, along with reﬁnements, based on L penalties and aimedat jointly estimating several related regression models. Its main interest is that it canbe rewritten as a weighted lasso on a simple transformation of the original data set. Inparticular, it does not need new dedicated algorithms and is ready to implement under avariety of regression models, e.g. , using standard R packages. Moreover, asymptotic oracleproperties are derived along with preliminary non-asymptotic results, suggesting goodtheoretical properties. Our approach is further compared with state-of-the-art competitorsunder various settings on synthetic data: these empirical results conﬁrm that our approachperforms at least similarly to its competitors. As a ﬁnal illustration, an analysis of roadsafety data is provided.

1. Introduction

With the emergence of high-dimensional data, penalized regression models have now be-come standard, with such penalties as the L q -norm or quasinorm of the parameter vector,for some q ≥

0. Underlying the use of these penalties, a very common assumption whenworking with moderate to high dimensional data is that the theoretical parameter vector β ∗ is sparse : only a small subset of its p components is expected to be non-null. Sparsity-inducing penalties, such as those relying on the L -norm, are especially useful in thiscontext: the L -norm being convex it further leads to approaches that can generally besolved eﬃciently; they are often referred to as the lasso. Under appropriate conditions, lassoestimates especially attain optimal convergence rates (up to logarithmic terms): sparsity-inducing approaches are not only appealing for interpretation matters, but also becausethey can improve upon non-penalized procedures and lead to optimal estimates regardingestimation and/or prediction accuracy (see e.g. , Bickel et al. (2009); B¨uhlmann and van deGeer (2011); Dalalyan et al. (2014) for the linear regression case).1 a r X i v : . [ s t a t . M E ] N ov n many applications, the objective is actually to estimate several, K say, parametervectors β ∗ , . . . , β ∗ K corresponding to K related regression models; this problem is oftenreferred to as multi-task learning in the literature (Argyriou et al., 2008; Maurer andPontil, 2013; Negahban and Wainwright, 2011; Lounici et al., 2009; Jacob et al., 2009).Our motivating example corresponds to a standard situation in epidemiology or clinicalresearch, where data come from several strata of the population (Gertheiss and Tutz, 2012;Viallon et al., 2014). Each stratum can correspond to a type or dosage of treatment, agiven geographical area or can be deﬁned by crossing age category and gender, etc. Therelationship between the response variable and the covariates has then to be studied oneach stratum. This relationship is rarely strictly the same over all the strata, but somehomogeneity is generally expected. As a result, simple strategies consisting in estimatingone model per stratum or one model on all the strata pooled together are generally sub-optimal in this context, and appropriate approaches should automatically adapt for thelevel of heterogeneity over the strata. Besides the sparsity assumption on each β ∗ k , vectors β ∗ k are usually expected to be close to each other in some sense. The principle of multi-tasklearning strategies is then to account for the presumed similarity among the β ∗ k ’s, whilenot masking potential heterogeneities, in order to improve estimation performance. Recentmethods rely on penalties encouraging estimators of matrix B ∗ = ( β ∗ , . . . , β ∗ K ) ∈ IR p × K to exhibit some given structure: low-rank with the trace-norm penalty, group-structurewith L /L or L /L ∞ penalties, etc. To solve the corresponding optimization problem,dedicated algorithms are generally needed which limits their use and generalization bypractitioners: most of the available algorithms enable to consider only the L loss (forlinear regression typically), and/or the logistic or hinge loss. In this paper we propose anew approach which is very intuitive and easy to implement under a variety of models, i.e. , for a variety of loss functions. Indeed, our approach reduces to a weighted lassoon a straightforward transformation of the data. As a result, any function or packageenabling the resolution of the weighted lasso can directly be used for the implementation.In particular, the glmnet R package of Friedman et al. (2010) allows the treatment oflinear, logistic, Poisson, multinomial and Cox models. The principle of our approach isvery simple: we encourage sparsity within individual β ∗ k and similarity between the β ∗ k ’s.As will be shown below, we reach this objective by simply using L -norm penalties andour approach returns estimates for matrix B ∗ that are sums of a rank one matrix and asparse one.In the following Section 2, we formally introduce the setting we consider, describe ourapproach, state its connections with previous works, discuss its practical implementationand present theoretical results. Then, results from simulation studies are presented in Sec-tion 3, where comparisons are made with state-of-the-art methods. Finally, an illustrationof our approach is provided on road safety data in Section 4, where K = 18 experts havebeen asked to determine drivers’ responsibility in road traﬃc accidents and the objectiveis to assess expert agreement. 2 . Methods Although the main applications we have in mind concern data gathered from various strataof a population, our work falls into the more general multi-task learning context. Forthe sake of simplicity, we will use the terminology stratum everywhere, but everythingcan be extended by replacing stratum by task. In addition, methods as well as theirtheoretical properties will be presented in the linear regression model with no intercept forease of notation. Extensions to generalized linear models (McCullagh and Nelder, 1989)are straightforward. How to deal with intercept terms in practice is brieﬂy discussed inSection 2.2.2 below.

For any integer m ≥

1, we will denote by [ m ] the set of values { , . . . , m } . Moreover, m and m will denote the vectors of size m with components all equal to 0 and 1 respectively, while I m will stand for the ( m × m ) identity matrix. For any vector x = ( x , . . . , x m ) ∈ IR m ,we further denote by supp ( x ) its support ( i.e. , supp ( x ) = { j ∈ [ m ] : x j (cid:54) = 0 } ) andwe set (cid:107) x (cid:107) q = ( (cid:80) j ∈ [ m ] | x j | q ) /q for any real number q ∈ (0 , ∞ ), (cid:107) x (cid:107) ∞ = max j | x j | and (cid:107) x (cid:107) = | supp ( x ) | , where | E | denotes the cardinality of the set E . For any set E ⊆ [ m ], wewill denote by x E the vector of IR | E | with components ( x j ) j ∈ E .Denote by K ≥ n k the number of observations in stratum k ∈ [ K ], with n = (cid:80) k ∈ [ K ] n k the total number of observations. We will assume that forany k ∈ [ K ] response vectors y ( k ) = ( y ( k )1 , . . . , y ( k ) n k ) T ∈ IR n k are related to design matrices X ( k ) = ( x ( k )1 T , . . . , x ( k ) n k T ) T ∈ IR n k × p according to the following linear model y ( k ) = X ( k ) β ∗ k + ε ( k ) . (1)In this equation, vectors ε ( k ) = ( ε ( k )1 , . . . , ε ( k ) n k ) T ∈ IR n k denote noise vectors: variables ε ( k ) i will be assumed to be independent and identically distributed (i.i.d.) with IE ε ( k ) i = 0and Var( ε ( k ) i ) = σ for some unknown σ >

0, for all i ∈ [ n k ] and k ∈ [ K ]. Finally,vectors β ∗ k ∈ IR p are the K parameter vectors of interest, to be estimated. In the contextconsidered here, the β ∗ k ’s are expected to be sparse and close to each other in some sense. Our proposal relies on the following decomposition for β ∗ k , k ∈ [ K ]: β ∗ k = β ∗ + γ ∗ k . (2)Here β ∗ describes what is “common” over the strata, while γ ∗ k captures the variationin stratum k around β ∗ . Of course, there are inﬁnitely many such decompositions, but3ssuming that vectors β ∗ k are close to each other, those minimizing the quantity (cid:80) k (cid:107) γ ∗ k (cid:107) q ,for some q ≥

0, are naturally appealing. They correspond to very natural choices for vector β ∗ ∈ arg min β (cid:80) k (cid:107) β ∗ k − β (cid:107) q . In particular, it is straightforward that choices q = 0 , j ∈ [ p ], β ∗ j is a mode, a median and themean of ( β ∗ ,j , . . . , β ∗ K,j ), respectively.The principle of our proposal is to obtain estimators of the form (cid:98) β k = (cid:98) β + (cid:98) γ k withsparse (cid:98) β and sparse (cid:98) γ k . For appropriate values λ ≥ λ ,k ≥

0, our estimates arethen deﬁned as( (cid:98) β , (cid:98) γ . . . , (cid:98) γ K ) ∈ argmin β , γ ,..., γ K (cid:88) k ≥ (cid:107) y ( k ) − X ( k ) ( β + γ k ) (cid:107) n + λ (cid:107) β (cid:107) + (cid:88) k ≥ λ ,k (cid:107) γ k (cid:107)  . (3)Clearly, (cid:98) β = and (cid:98) γ k = (cid:98) β k for λ large enough: the K problems are solved independently(the corresponding strategy will be referred to as IndepLasso in the sequel). On the otherhand, we have (cid:98) β k = (cid:98) β for all k if the λ ,k ’s are large enough: optimal solutions are those ofthe lasso run on all the data pooled together (the corresponding strategy will be referredto as IdentLasso in the sequel). This is a nice property since selecting appropriate valuesfor λ and the λ ,k ’s allows us to adapt to the extent of homogeneity over the strata. Werefer to Section 3.1.1 below for a discussion on how to chose λ and λ ,k in practice.The optimization problem described in Equation (3) is of course equivalent to the mini-mization of (cid:88) k ≥ (cid:107) y ( k ) − X ( k ) β k (cid:107) n + λ {(cid:107) β (cid:107) + (cid:88) k ≥ λ ,k λ (cid:107) β k − β (cid:107) } (4)over β k ∈ IR p for k ∈ [ K ] and β ∈ IR p . Because β now only appears in the penalty terms,it is easy to see that, at optimum, (cid:98) β j is a weighted and shrunk version of the median of( (cid:98) β ,j , . . . , (cid:98) β K,j ); we will denote it by WSmedian µ [ K ] ( (cid:98) β ,j , . . . , (cid:98) β K,j ) with µ [ K ] = ( µ , . . . , µ K )and µ k = λ ,k /λ . If all µ k ’s are equal and tend to 0, then WSmedian µ [ K ] tends to theconstant function returning p : this corresponds to the decomposition (cid:98) β k = (cid:98) γ k , andthen to the IndepLasso strategy. On the other hand, if all µ k ’s are equal but tend toinﬁnity, then WSmedian µ [ K ] tends to the standard median function: this corresponds tothe decomposition (cid:98) β k = (cid:98) β + (cid:98) γ k , where (cid:98) β j = median( (cid:98) β ,j , . . . , (cid:98) β K,j ). If the µ k ’s are allequal to 1 /τ , with τ ∈ IN, then WSmedian µ [ K ] is a shrunk median in the sense that (cid:98) β j = WSmedian /τ,..., /τ ( (cid:98) β ,j , . . . , (cid:98) β K,j ) = median( Tτ , (cid:98) β ,j , . . . , (cid:98) β K,j ) . A last interesting example is when for some (cid:96) ∈ [ K ] µ (cid:96) → ∞ and the other µ k ’s areﬁxed (ﬁnite), for all k (cid:54) = (cid:96) . Then WSmedian µ [ K ] ( (cid:98) β ,j , . . . , (cid:98) β K,j ) → (cid:98) β (cid:96),j , leading to the4ecomposition (cid:98) β k = (cid:98) β (cid:96) + (cid:98) γ k , with (cid:98) γ (cid:96) = p : this corresponds to considering stratum (cid:96) as the reference one (see Section 2.3.1 below). To sum up, each particular choice forthe λ ,k /λ ratios “identiﬁes” a particular common eﬀect vector (cid:98) β with (cid:98) β j deﬁned asWSmedian µ [ K ] ( (cid:98) β ,j , . . . , (cid:98) β K,j ), and our approach encourages solutions ( (cid:98) β , . . . , (cid:98) β K ) withtypically sparse vector of common eﬀects (cid:98) β and sparse vectors of diﬀerences (cid:98) β k − (cid:98) β . A very attractive property of our approach is that it can be rewritten as a simple weightedlasso. To state this, ﬁrst set Y = ( y (1) T , . . . , y ( k ) T ) T ∈ IR n and introduce the quantities X =  X (1) X (1) . . . ... ... . . . ... X ( K ) . . . X ( K )  and θ =  βγ (1) ... γ ( K )  , which are elements of IR n × ( K +1) p and IR ( K +1) p respectively. Further introduce the vectorof weights ω = ( Tp , ( λ , /λ ) Tp , . . . , ( λ ,K /λ ) Tp ) T ∈ IR ( K +1) p . Then, setting (cid:107) θ (cid:107) , ω = (cid:80) ( K +1) pj =1 ω j | θ j | , the criterion to be minimized in Equation (3) can be rewritten as (cid:107) Y − X θ (cid:107) n + λ (cid:107) θ (cid:107) , ω (5)which is the criterion minimized in the weighted lasso. Of course, it is easy to show thatthis remains true under generalized linear models, Cox models, etc. As mentioned above,this is particularly interesting since it means that our approach can be implemented undera variety of models using available packages for the lasso, e.g. , the glmnet R package ofFriedman et al. (2010).Throughout the paper, only linear regression models with no intercept are consideredfor ease of notation. In practice however, intercept has generally to be included in themodel. A ﬁrst option consists in using the glmnet function with “intercept=FALSE” andreplace X ( k ) by Z ( k ) = ( n k , X ( k ) ) ∈ IR n k × ( p +1) in the deﬁnition of X above: this way, theabsolute value of the common intercept term is penalized, and so are variations around thiscommon intercept. In many situations, it makes more sense not to penalize the commonintercept: an alternative option is then to use the glmnet function with “intercept=TRUE”and replace X ( k ) by Z ( k ) in the deﬁnition of X everywhere except in the ﬁrst p columnsof X : this way, only variations around the common intercept (which corresponds to themedian of the intercepts over the strata in this case) are penalized. Decomposition (2) was ﬁrst suggested in Evgeniou and Pontil (2004) who use the SVMmachinery with L -norm penalties. For the application we have in mind, using sparsity-5nducing norms as the L -norm is more appealing. Indeed, detecting the subset of covariatesthat have a diﬀerential eﬀect over the strata is often of primary interest (see our applicationin Section 4 below). By using the L -norm of the γ k ’s our approach enjoys good propertiesregarding support recovery for both the β k ’s and γ k ’s (as will be shown below), which isof course not the case for methods based on L -norm penalties. Another advantage whenusing L -norms is that the common eﬀect estimation is more robust (median versus mean).Jalali et al. (2013) consider a closely related decomposition, this time for the parametermatrix B = ( β , . . . , β K ) ∈ IR p × K . They write B = R + S , where R and S are two ( p × K )matrices with element ( j, k ) denoted by r ( k ) j and s ( k ) j , respectively. Further denote by r ( k ) and s ( k ) the k -th column of matrices R and S respectively, so that β ( k ) = s ( k ) + r ( k ) ,and by r j and s j the j -th row of matrices R and S . Further set, for any matrix M ∈ IR p × K = ( m , . . . , m p ) T and q ≥ (cid:107) M (cid:107) ,q = (cid:80) j ∈ [ p ] (cid:107) m j (cid:107) q . Then their approach returnsan estimate for B ∗ derived from minimizers of the following criterion: (cid:88) k (cid:107) y ( k ) − X ( k ) ( r ( k ) + s ( k ) ) (cid:107) n + λ s (cid:107) S (cid:107) , + λ r (cid:107) R (cid:107) , ∞ . (6)The term (cid:107) S (cid:107) , encourages matrix S to be sparse (few elements are non-zero) while (cid:107) R (cid:107) , ∞ encourages matrix R to have a row-wise group structure (few lines have zero entries):together, they encourage solutions (cid:98) B that can be written as the sum of a row-sparsematrix and a sparse one. As mentioned in Jalali et al. (2013), setting d = (cid:98) λ r /λ s (cid:99) , thecombination of the two penalties leads to solutions (cid:98) R such that for any j with (cid:107) (cid:98) r j (cid:107) ∞ > | M j | ≥ d + 1 where M j = { k : | (cid:98) r ( k ) j | = (cid:107) (cid:98) r j (cid:107) ∞ } (see their Lemma 2): in otherwords, rows that are not uniformly null have at least d + 1 components that have equalabsolute values. In contrast, our approach returns estimates for B ∗ that is a sum of a rankone matrix (each row has p equal components instead of at least d + 1 components of equalabsolute values) plus a sparse matrix: it is therefore less ﬂexible, but easier to interpret and,above all, to implement. From a theoretical point of view, Jalali et al. especially consideredthe case where K = 2 and covariates are generated from a standard multivariate Gaussiandistribution. Further assuming that n = n , they show that the number of samples neededto ensure support recovery with high probability ( i.e. , the sample complexity) is inferior forDirty, compared to both IndepLasso (where K lasso are run independently on the K strata)and the group-lasso strategy relying on the L /L ∞ penalty (Negahban and Wainwright,2011). As shown Section 6.1.1 of the Appendix, the sample complexity of our approachis never superior to that of Dirty if in addition to the assumptions considered in Jalali etal., β ∗ ,j β ∗ ,j ≥ j ∈ [ p ]. Our empirical results presented in Section 3 further suggestthat it may still be the case for K >

2. Stating this theoretically is out of the scope of thepresent paper and will be considered elsewhere.6 .3 Adaptive version and other reﬁnement

An adaptive version of our approach can be derived by selecting appropriate weights andthen replacing the L norms by weighted L -norms in Equation (3). Following the ideasof the adaptive lasso (Zou, 2006), these weights can be constructed from initial estimatorsof β j and γ k,j = β k,j − β j , for j ∈ [ p ] and k ∈ [ K ]. Denoting by (cid:101) β k initial estimates of β k ( e.g. , OLS estimates or MLEs if n k (cid:29) p for all k ∈ [ K ]), and deﬁning for any j , (cid:96) j = min { k : (cid:101) β k,j ∈ median( (cid:101) β ,j , . . . , (cid:101) β K,j ) } , initial estimators for β j and γ k,j are set to (cid:101) β (cid:96) j ,j and (cid:101) β k,j − (cid:101) β (cid:96) j ,j . Note that because values (cid:101) β ,j , . . . , (cid:101) β K,j are generally all distinct, their median can be a range of values if K is even.In this case, we want to set the initial estimator of the common eﬀect of covariate j toone of the values in ( (cid:101) β ,j , . . . , (cid:101) β K,j ), hence the use of the minimum in the deﬁnition of (cid:96) j (alternatively, the maximum could be used: there is no reason for favoring one or theother, at least a priori). Then setting, for all k ∈ [ K ] and j ∈ [ p ], w k,j = 1 / | (cid:101) β k,j | ρ and ν k,j = 1 / | (cid:101) β k,j − (cid:101) β (cid:96) j ,j | ρ , for some ρ > ρ = 1), the adaptive version ofour approach consists in minimizing the following objective function (cid:88) k ≥ (cid:107) y ( k ) − X ( k ) ( β + γ k ) (cid:107) n + λ p (cid:88) j =1 w (cid:96) j ,j | β j | + (cid:88) k ≥ λ ,k p (cid:88) j =1 ν k,j | γ k,j | (cid:111) . But because ν (cid:96) j ,j = + ∞ for all j ∈ [ p ], this is equivalent to minimizing the criterion: (cid:88) k ≥ (cid:107) y ( k ) − X ( k ) β k (cid:107) n + λ p (cid:88) j =1 w (cid:96) j ,j | β (cid:96) j ,j | + (cid:88) k ≥ λ ,k p (cid:88) j =1 ν k,j | β k,j − β (cid:96) j ,j | (cid:111) . (7)In other words, the adaptive version of our approach can be seen as a reﬁnement of thefollowing strategy, that is very common in epidemiology and clinical research when datacome from several strata. Many practitioners would ﬁrst select a reference stratum, say (cid:96) ∈ [ K ], and then use the decomposition: β k = β (cid:96) + δ k , for any k (cid:54) = (cid:96) . Then, a lasso canbe used and parameter estimates for each stratum can be obtained from minimizers of thefollowing objective function:12 n (cid:110) (cid:107) y ( (cid:96) ) − X ( (cid:96) ) β (cid:96) (cid:107) + (cid:88) k (cid:54) = (cid:96) (cid:107) y ( k ) − X ( k ) ( β (cid:96) + δ ( k ) ) (cid:107) (cid:111) + λ (cid:107) β (cid:96) (cid:107) + (cid:88) k (cid:54) = (cid:96) λ ,k (cid:107) δ ( k ) (cid:107) . (8)This approach reduces to a standard lasso where to the original vector of covariatesis augmented by interaction terms between the covariates and indicator functions de-scribing membership to each stratum k (cid:54) = (cid:96) . The adaptive version of our approach7njoys two advantages compared to this strategy, referred to as InterLasso hereafter.First, the selection of the reference stratum is automatic with our approach and basedon initial estimates of the β ∗ k ,’s. Second, and above all, the reference stratum (cid:96) j iscovariate-speciﬁc. Whenever the initial estimates are consistent, (cid:96) j belongs to the set L ∗ j = { k : β ∗ k,j ∈ median( β ∗ ,j , . . . , β ∗ K,j ) } with high probability for n large enough (if the n k ’s all tend to ∞ as n → ∞ ), and is therefore an appealing reference stratum for covari-ate j . As a result, our approach will generally yield models with lower complexity andhence better performance than the simple InterLasso (even if InterLasso can of course leadto better models than our approach in some situations; e.g. , if there exists a stratum (cid:96) for which β ∗ (cid:96),j ∈ mode( β ∗ ,j , . . . , β ∗ K,j ) for all j ∈ [ p ], and this stratum (cid:96) is chosen as thereference one, and mode( β ∗ ,j , . . . , β ∗ K,j ) ∩ median( β ∗ ,j , . . . , β ∗ K,j ) = ∅ for some j ∈ [ p ]). With our ﬁrst approach, sparsity of vectors (cid:98) β k = (cid:98) β + (cid:98) γ k is not directly encouraged, andis only “induced” by the sparsity of the common eﬀects (cid:98) β and of the variations (cid:98) γ k . Forinstance, if for some a (cid:54) = 0 and j ∈ [ p ], β ∗ (cid:96) j ,j = a and β ∗ k,j = 0 for some k (cid:54) = (cid:96) , our ﬁrstapproach will typically not return (cid:98) β k,j = 0 (unless it also returns (cid:98) β (cid:96) j ,j = 0). We thereforepropose a reﬁned version which directly penalizes the L -norms of the β k ’s. More precisely,a second set of estimators can be deﬁned as minimizers of the following criterion (cid:88) k ≥ (cid:110) (cid:107) y ( k ) − X ( k ) β k (cid:107) n + λ ,k (cid:107) β k (cid:107) + λ ,k (cid:107) β k − β (cid:107) (cid:111) (9)over β k ∈ IR p for k ∈ [ K ] and β ∈ IR p . In the sequel, this reﬁned version will be referredto as M , while M will refer to our ﬁrst approach. Because β now only appears in the lastterm of the objective function, it is clear that the j -th component of any optimal solution (cid:98) β is such that (cid:98) β j = median( (cid:98) β ,j , . . . , (cid:98) β K,j ): for each covariate j ∈ [ p ], its estimated commoneﬀect (cid:98) β j corresponds to the median of its estimated eﬀects over all the strata, irrespectiveto λ ,k /λ ,k ratios.The implementation of M is however less trivial than that of our ﬁrst approach. InSection 6.2 of the Appendix, we derive the dual formulation of the optimization problemdescribed in Equation (9), which is shown to reduce to a standard optimization problemfor the linear regression (quadratic programming) and the logistic regression (entropy max-imization problem) and can therefore be solved with available optimization toolboxes; thehinge loss can also be easily treated.An adaptive version of M can further be recast in the generalized fused lasso framework(H¨oﬂing et al., 2010; Viallon et al., 2014). This makes its implementation straightforwardwith packages dedicated to the generalized fused lasso ( e.g. , the FusedLasso R package ofH¨oﬂing et al. (2010)). Recall the notations introduced in Section 2.3.1 above. The adaptive8ersion of M consists in minimizing the following objective function (cid:88) k ≥ (cid:110) (cid:107) y ( k ) − X ( k ) β k (cid:107) n + λ ,k p (cid:88) j =1 w k,j | β k,j | + λ ,k p (cid:88) j =1 ν k,j | β k,j − β (cid:96) j ,j | (cid:111) . (10)Here again, we have ν (cid:96) j ,j = + ∞ for all j ∈ [ p ]. To see the connection with thegeneralized fused lasso set Y = ( y (1) T , . . . , y ( k ) T ) T ∈ IR n as above, and introduce X F the ( n × Kp ) block diagonal matrix with k -th block equal to X ( k ) . Further deﬁne b =( β T , . . . , β TK ) T = ( b , . . . , b Kp ) ∈ IR pK . Then, the criterion of Equation (10) can be rewrit-ten as (cid:107) Y − X F b (cid:107) / (2 n ) + λ (cid:107) b (cid:107) + λ (cid:80) j ∼ j | b j − b j | . Condition j ∼ j indicatesthat components b j and b j are connected in a particular graph that describes absolutediﬀerences that are penalized. Here, this graph consists of p star-graphs where the j -thstar-graph has coeﬃcient β (cid:96) j ,j at its center, that is connected to each β k,j for k (cid:54) = (cid:96) j . Theadaptive version of M can also be seen as a particular case of the generalized fused lasso,using the same graph made of p star-graphs (our two adaptive versions penalize the sameabsolute diﬀerences), but where only the absolute value of the central node | β (cid:96) j ,j | of eachstar-graph is penalized (while all the | β k,j | ’s are penalized in Equation (10) above). Finallynote that the InterLasso strategy can also be seen as a special case of the generalized fusedlasso: the underlying graph is made of p star-graphs with β (cid:96),j as the central node (insteadof β (cid:96) j ,j ), and only the | β (cid:96),j | terms are penalized, for j ∈ [ p ].These observations formally establish a connection with another strategy that wasconsidered in Gertheiss and Tutz (2012) and Viallon et al. (2014). It will be referred toas CliqueFused, and it corresponds to a generalized fused lasso using this time a graphmade of p cliques (instead of p star-graphs): for each covariate j , all absolute diﬀerences | β k,j − β k (cid:48) ,j | are penalized for all k (cid:54) = k (cid:48) .The rewriting of the adaptive version of M as a special case of the generalized fusedlasso shows that it is easy to implement with available packages, like the FusedLasso Rpackages for linear and logistic regression models; the same naturally holds for the adaptiveversion of M , but since it can still be written as a particular case of the (adaptive)lasso its implementation is faster using the glmnet function for instance. In additionasymptotic oracle properties for the adaptive version of M (and M ) readily follow asdirect consequences of the results obtained in Viallon et al. (2014) (see Section 2.4 below);for the adaptive version of M , results obtained by Zou (2006) for the adaptive lasso canalso be used. For the sake of brevity, we only present here a summary of asymptotic oracle properties forthe adaptive versions of our approach. We refer to Section 6.1 in the Appendix for moredetails along with preliminary non-asymptotic results for our ﬁrst approach M .9e consider the situation where p and K are held ﬁxed (they do not increase with n ).We further assume that, for each k ∈ [ K ], matrix X ( k ) T X ( k ) /n k converges to a positivedeﬁnite matrix C ( k ) as n k → ∞ , and n k /n → κ k as n → ∞ , with 0 < κ k < stratum sizes all tend to inﬁnity at the same rate). These assumptions essentially imply that MLEs (cid:101) β k,j exist and are √ n -consistent as n → ∞ ; they will therefore be used in the deﬁnitionof the weights in Equations (7) and (10). Our results essentially show that the adaptiveversion of M , say M ad , enjoys asymptotic oracle properties . In particular, setting for any j ∈ [ p ], (cid:96) ∗ j = min { k : β ∗ k,j ∈ median( β ∗ ,j , . . . , β ∗ K,j ) } and K ∗A ∗ ,j = { k : β ∗ k,j = β ∗ (cid:96) ∗ j ,j } if β ∗ (cid:96) ∗ j ,j (cid:54) = 0, then all the parameters β ∗ k,j for k ∈ K ∗A ∗ ,j are estimated by the common value (cid:98) β ( ad ) (cid:96) j ,j , with probability tending to 1 as n → ∞ . This estimator has the same Gaussian limitdistribution as that of the estimator we would obtain by pooling all data correspondingto covariate j and strata in K ∗A ∗ ,j together. Moreover, M ad is asymptotically optimal insituations where, for all j ∈ [ p ], β ∗ k ,j = β ∗ k ,j implies that either β ∗ k ,j = β ∗ k ,j = 0 or β ∗ k ,j = β ∗ k ,j = β ∗ (cid:96) ∗ j ,j , as in examples 1 to 3 of Figure 4 in the Appendix. Asymptotic oracleproperties can easily be derived for method M ad as well but, for M ad to be optimal wemust have in addition: for all j ∈ [ p ], { β ∗ (cid:96) ∗ j ,j (cid:54) = 0 } ⇒ {∀ k ∈ [ K ] , β ∗ k,j (cid:54) = 0 } . For instance,in example 2 on Figure 4, the estimator of β ∗ ,j returned by M ad can not be 0 (unless M ad returns a zero estimate for β ∗ k j ,j as well). Therefore, M ad would typically return amodel with overall complexity higher than the theoretical one, hence be sub-optimal. Onthe other hand, contrary to the CliqueFused strategy, both our approaches use star-graphsand are therefore sub-optimal in situations where non-zero values in ( β ∗ ,j , . . . , β ∗ K,j ) consistof at least two groups of identical values (and possibly some distinct non-zero values). Forinstance, in example 4 of Figure 4, neither M ad nor M ad can return non-null equal valuesfor components (cid:98) β ( ad )1 ,j , (cid:98) β ( ad )2 ,j , and (cid:98) β ( ad )3 ,j (nor for (cid:98) β ( ad )8 ,j , (cid:98) β ( ad )9 ,j , and (cid:98) β ( ad )10 ,j ), while CliqueFusedtypically would, for n large enough. However, CliqueFused may of course be outperformedby our approaches on ﬁnite samples. This was illustrated in the simulation study conductedin Viallon et al. (2014) in the single-task setting: they evaluated the robustness of thegeneralized fused lasso to graph misspeciﬁcation and especially observed that the clique-based strategy, though asymptotically optimal, was outperformed by other graphs-basedstrategies on ﬁnite samples. In the present multi-task setting, if β ∗ ,j , . . . , β ∗ K − ,j are allequal and β ∗ K,j is diﬀerent from the K − (cid:96) j ∈ [ K −

1] for n enough, and our approaches would only penalize | β K,j − β (cid:96) j ,j | : they are therefore more likelyto detect this diﬀerence than CliqueFused which penalizes all the diﬀerences | β K,j − β k,j | for all k ∈ [ K − . Simulation Study Our main objective here is to illustrate the two new approaches introduced in Section 2and compare them with state-of-the-art competitors, in particular with those presentedabove which show some links with our proposal: CliqueFused and Dirty. For comparison,we further included IndepLasso and IdentLasso. The sparse group-lasso strategy relyingon the L + L /L ∞ penalty and referred to as spGroupLasso hereafter is considered too(see Section 6.3 of the Appendix for a brief description of this approach and Negahbanand Wainwright (2011) and Simon et al. (2013) for more details). Results obtained withthe stepwise method described in Gertheiss and Tutz (2012) will also be presented in thelow-dimensional example.The glmnet R function of Friedman et al. (2010) was used to implement IndepLasso,IdentLasso as well as our approach M (and its adaptive version). As for M M .The spGroupLasso strategy was implemented with the spams R package of Mairal et al.(2010), and we used the gvcm.cat R package for the stepwise approach of Gertheiss andTutz (2012). Finally, we used the script of Jalali et al. (2013) (available at http://ali-jalali.com/index ﬁles/L1Linf LASSO.r) to implement Dirty. However, this script actuallyimplements a revised version of Dirty. At each iteration ι of their algorithm, tuning parame-ters are divided by √ ι in the coordinate gradient descent steps, making the soft-thresholdingoperator closer and closer to the hard-thresholding one as ι increases. This trick then re-turns solutions that are barely shrunk and is related in some way to the relaxed lasso ofMeinshausen (2007), the adaptive lasso of Zou (2006) and L q -penalization with 0 ≤ q < / √ ι ” terms in the RevDirty script. All methods presented above involve tuning parameters that need to be carefully selected inpractice. Generally speaking, a predeﬁned grid of λ values has ﬁrst to be constructed. Werefer to Section 6.4.3 of the Appendix for a complete description of the grid construction.Given grids of λ and λ values, cross-validation can be seen as the strategy of choice when p is large ( i.e. , at least comparable to n ) and/or only prediction accuracy matters. When p (cid:28) n and support recovery is of primary interest, a commonly preferred criterion is the11IC computed with unbiased estimates: for the lasso, these unbiased estimates can beobtained by the OLS-Hybrid two-step strategy (Efron et al., 2004), which corresponds to asimpliﬁed version of the relaxed lasso with the particular choice φ = 0 (Meinshausen, 2007).For the approaches considered in this paper, it can easily be extended. This criterion willbe referred to as 2stepBIC in the sequel. In the ﬁrst simulation study, we consider the case where K = 5 and n k = 15 p = 225. Foreach stratum k ∈ [ K ], each of the n k rows (observations) of the design matrix X ( k ) is gener-ated under a multivariate Gaussian distribution N p ( p , Σ ), where the ( j , j )-element of Σ is 2 −| j − j | (for j , j ∈ [ p ]). For given values (SpGlob , SpSpec) ∈ [0 , , each component β ∗ j of vector β ∗ ∈ IR p is generated from a Bernoulli distribution with parameter SpGlob, whilefor each k ∈ [ K ], each component of vector δ ∗ k ∈ IR p is generated from a Bernoulli distribu-tion with parameter SpSpec. The expected number of non-zero components in β ∗ and δ ∗ k isthen p × SpGlob and p × SpSpec respectively. Each vector β ∗ k are then set to β ∗ + δ ∗ k , withcomponents either 0, 1 or 2. By making SpSpec vary, we make the level of heterogeneityvary over the strata, while SpGlob enables the control of the “common” eﬀects sparsitylevel. Observe that components β ∗ j do not necessarily correspond to median( β ∗ ,j , . . . , β ∗ K,j )(they do in some cases, e.g. , when SpSpec= 0, but not always). Given β ∗ k and X ( k ) , vec-tor y ( k ) is generated from a multivariate gaussian distribution N n k ( X ( k ) β ∗ k , σ I n k ). Byvarying the noise variance σ , we can make the signal-to-noise-ratio (SNR) vary. For eachsimulation design, corresponding to a given triple of values (SpGlob, SpSpec, SNR), 100replications are performed, and results presented and discussed below correspond to av-erages over these 100 replications. Methods are evaluated according to their predictionaccuracy (Figure 1), measured by log( (cid:80) k ∈ [ K ] (cid:107) X ( k ) ( β ∗ k − (cid:98) β k ) (cid:107) ), and support recovery ac-curacy (Figure 2), that measures the ability to correctly recover zero and non-zero elementsof matrix B ∗ = ( β ∗ , . . . , β ∗ K ). Tuning parameters are selected with the 2StepBIC describedabove.In Figure 1, the ﬁrst row corresponds to SpSpec= 0, so that all β ∗ k ’s are equal. In thiscase, the optimal strategy is of course IdentLasso but M , spGroupLasso, M and Dirtyall perform as well (or nearly as well) as this optimal strategy. The stepwise approach andCliqueFused are less eﬃcient, but still perform better than IndepLasso. On the other hand,when SpGlob is low and SpSpec is high (last row, ﬁrst two columns), the K true modelshave not much in common and IndepLasso is nearly-optimal: in this case again, the multi-task learning strategies perform nearly as well as the optimal strategy. In addition, and asexpected, prediction performance of all methods tend to decrease as SpGlob and/or SpSpecincreases, that is as the true model complexity increases. Overall, M and Dirty show verysimilar performance, irrespective to the true model complexity. They perform the bestfor SpSpec (cid:54) = 0, closely followed by the other multi-task learning strategies. For SpGlob=0 . (cid:54) = 0, we observed moderate performance for both M and spGroupLasso.12s shown in Section 6.4.3 of the Appendix for method M , alternative choices for theinitial grid of ( λ , λ ) values lead to better results, suggesting that our proposal for theconstruction of this grid might not always be optimal for M (and spGroupLasso) and thatmore theoretical eﬀort might be needed to get a better initial grid. Sparsity_glob = = = ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll − − − − − − − − − − − S pa r s i t y _ s pe c = S pa r s i t y _ s pe c = . S pa r s i t y _ s pe c = . snr Method l l

M1 M2 Dirty CliqueFused spGroupLasso Stepwise IdentLasso IndepLasso

Figure 1: Results from the ﬁrst simulation study: prediction accuracy.Turning our attention to support recovery, Figure 2 presents results obtained with Dirtyand RevDirty, M , and the adaptive versions of M , M and CliqueFused. See Figure 6in the Appendix for the corresponding analysis of the methods compared on Figure 1:overall, conclusions for these methods are very similar to those drawn from the analysisof the prediction performance. Weights for adaptive versions were derived from initialOLS estimates of the β ∗ k . The comparison of M with its adaptive version M ( ad )1 clearlyillustrates the potential gain on support recovery accuracy when using adaptive weights(no such gain was observed on prediction accuracy; results not shown). Moreover, adaptive13ersions of CliqueFused and M attain nearly the same performance as M ( ad )1 . Overall,RevDirty performs the best, but ( i ) the comparison is unfair since RevDirty is more thanan adaptive version of Dirty and ( ii ) adaptive versions of M and M performs only slightlyworse in most cases. Sparsity_glob = = = ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll . . . . . . . . . . . . S pa r s i t y _ s pe c = S pa r s i t y _ s pe c = . S pa r s i t y _ s pe c = . snr Method l l

M1 Dirty AdaptM1 AdaptM2 RevDirty AdaptCliqueFused

Figure 2: Results from the ﬁrst simulation study: support recovery accuracy.

The objective of the second simulation study was to describe the methods performancewhen the dimension p is of the order of n k and selection of tuning parameters is done bycross-validation. In view of the results of the ﬁrst simulation study, and to save computa-tional times, IdentLasso, IndepLasso and CliqueFused were not included in this analysis.In addition, a relaxed version of M was included, following the idea of the relaxed lasso14Meinshausen, 2007) (we made the φ -value vary over the sequence { , . , . . . , . , } ). Weconsidered the case where K = 5 and p = n k = 100. Vectors β , and δ ∗ k , for k ∈ [ K ] areconstructed as follows. Only the ﬁrst 10 components of these vectors can be non-zero. Thesparsity of the common eﬀect β is ﬁxed to 4, with non-zero components all equal to 1.Three levels of heterogeneity are considered: complete homogeneity (all the δ ∗ k ’s are equalto p ), low heterogeneity (the number of non-zero elements of matrix ∆ ∗ = ( δ ∗ , . . . , δ ∗ K ) is5, each non-zero element being ± ∆ ∗ is 15, each non-zero element being ± X ( k ) and response vectors y ( k ) are generated as in the ﬁrst simulationstudy, except we now set Σ = I p . As in the ﬁrst simulation study, we make the SNR varyby varying the noise level.Results are presented in Figure 3. Because Dirty and RevDirty take a very longtime to run in this second simulation study, results correspond to averages over 20 repli-cations only. Methods are evaluated according to their prediction accuracy, measuredby log( (cid:80) k ∈ [ K ] || X ( k ) ( β ∗ k − (cid:98) β k ) || ), their estimation accuracy (measured by (cid:80) k ∈ [ K ] (cid:107) β ∗ k − (cid:98) β k ) (cid:107) ) / ( Kp )), and their ability to recover the true support (measured in this highly sparsesetting by the F1-score, deﬁned as the harmonic mean of precision and recall). Becauseestimators selected by cross-validation may have many tiny non-zero components, hard-thresholding these estimators can improve their performance regarding support recovery.F1-score for these hard-thresholded versions are therefore also presented, with thresholdset to 0.25 and 0.5. By ﬁrst comparing crude versions (no adaptive weights, nor relaxationnor the Revised Dirty trick), M and M generally perform at least similarly to Dirtyand spGroupLasso. The relaxed and adaptive versions of M generally achieve betterperformance than the crude M , especially in cases of complete homogeneity or low het-erogeneity. In these cases, these versions also outperform RevDirty in terms of predictionand estimation accuracy. As for the F1-score, RevDirty is clearly the best: only in the caseof complete homogeneity the relaxed version of M achieves similar performance. Notehowever that the comparison is again unfair, since RevDirty is more than an adaptive ora relaxed version of Dirty. After the hard-thresholding step, diﬀerences among methodsin terms of F1-scores are much narrower: focusing on low values of the SNR, the diﬀerentversions of our approaches appear very competitive. (Our results conﬁrm that RevDirtyreturns nearly unshrunk estimates and then fewer tiny non-zero components than “purely” L -based approaches: it does fewer mistakes (its F1-score is higher), but apparently biggermistakes since its prediction and estimation accuracies can be outperformed by the relaxedor adaptive versions of M .)

4. Application on road safety data

One major objective in road safety is to determine factors associated with being responsibleof a road traﬃc accident. To do so, epidemiologists typically use data sets of car crashes, inwhich drivers’ responsibility has been determined by experts. Of course, it is crucial that15 o Heterogeneity Low

Heterogeneity Moderate

Heterogeneity ll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll ll ll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll ll ll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll llll ll ll ll ll ll − − − . . . . . . . . . . . . . . . . l og ( L2 P r ed ) L2 e rr o r F F . T h r e s h . . F . T h r e s h . . snr Method l l

M1 M2 Dirty spGroupLasso AdaptM1 RelaxM1 RevDirty

Figure 3: Results from the second simulation study with n k = p = 100 and K = 5.16hese experts use a proper and consistent “rule” to determine responsibility. Consistencyespecially implies that two experts should agree on drivers’ responsibility. Level of agree-ment can be measured by Cohen’s κ or other standard agreement criteria, but this wouldrequire each driver’s responsibility to be rated by at least two experts. Here, we will usea joint estimation strategy to assess experts agreement in situations where each driver’sresponsibility is only rated once by one expert. As will be made clearer below, our strategywill also help us to check the relevance of the rule used by experts for the determinationof drivers’ responsibility.Our dataset consists of n β ∗ k of each logistic model ( k = 1 , . . . ,

18) shouldbe equal. To check whether this is the case, we propose to jointly estimate the 18 corre-sponding sparse logistic models using the two approaches described in this paper (both thecrude and reﬁned versions). Each method returns a (206 ,

18) matrix where each columnscontains the parameter vector (including the intercept) of one stratum, i.e. one expert. Weevaluated both methods on a predeﬁned grid of ( λ , λ ,k ) values and retained the modelminimizing the 2stepBIC. Contrary to situations considered in our simulation study, the n k ’s are not constant here and several strategies can be put forward for determining thegrid. The most simple one consists in using λ ,k = λ and proceed as described in Section3.1.1. As mentioned in Section 3.1.1 (and as suggested by the preliminary non-asymptoticanalysis of Section 6.1.2 of the Appendix), another strategy for method M consists in ﬁrststandardizing columns of matrix X , then setting λ ,k = λ and ﬁnally proceeding again17s described in Section 3.1.1. Both strategies were applied for method M and they led tosimilar results.With either method, good concordance was observed between supports of vectors (cid:98) β k :only two [resp. three] elements of matrix (cid:98) ∆ = ( (cid:98) δ , . . . , (cid:98) δ ) were non-null when using M [resp. M ]. Moreover, non-zero elements returned by M have small absolute values ( ≤ (cid:98) ∆ related to driving under the inﬂuence were all zero: this suggeststhat all experts account for alcohol consumption in the same way.Now, looking at the common vector (cid:98) β can further help us to check the adequacy ofthe common rule used by the experts. We will pay a particular attention on the associa-tion between alcohol consumption and the probability of being considered responsible byexperts. Indeed for road safety matters, and as recalled to experts during the trainingphase, the only fact of driving under the inﬂuence should not imply to be (nor increasethe risk of being) considered as responsible. If components of vectors (cid:98) β correspondingto alcohol consumption are non-null, then the adequacy of the rule may be questioned.Method M returns a vector (cid:98) β with 74 non-null components, while M returned a slightlysparser one with 60 non-null components. Overall, variables that are the most associatedto a high probability of being responsible are as expected: a driver who drove (or ran)away, and/or went through a stop sign or a red light, crossed the lane, etc. is likely to beconsidered as responsible of the accident. Regarding alcohol consumption, both M and M return two non-zero components in vector (cid:98) β : those corresponding to the highest levelof alcohol (between 1 .

5. Discussion

In this paper we present a new approach, along with reﬁnements, aimed at jointly modelingseveral related regression models. Although most of the presentation was made in thesituation where these models correspond to several strata of a population, our approach canbe used in the more general multi-task learning setting. Also, for ease of notation mostlylinear regression models were considered in this paper. As illustrated in the application ofSection 4, extension to other models is however straightforward. This is particularly truefor our ﬁrst approach and its adaptive or relaxed version, since they all can be rewrittenas simple weighted lasso problems: in particular, linear, logistic, Poisson and Cox modelscan be treated thanks to the glmnet R package (Friedman et al., 2010). Other functionsexist for other models, like the clogitL1 R package for conditional logistic models of Reidand Tibshirani (2014), and allow for further extensions of our approach.Our approach naturally connects with several strategies formerly proposed in the liter-ature. First, it is a simpliﬁed version of the Dirty models proposed by Jalali et al. (2013)in the linear regression setting, and is much easier to extend to other regression models.On the designs considered in our simulation study, it still performs similarly to the Dirtymodels and we also exhibited situations where its theoretical sample complexity is at leastas good as that of the Dirty models. Second, adaptive versions of our approaches arespecial cases of the generalized fused lasso. This connection is particularly appealing forinterpretation matters since it allows for a simple comparison between CliqueFused, theInterLasso and our two strategies. First, CliqueFused uses cliques, while InterLasso andour two approaches use star-graphs. Second, for all j ∈ [ p ], the center of the star-graphsare set to β (cid:96),j when using InterLasso with reference stratum (cid:96) , while when using adap-tive versions of our approaches centers are adaptively set to β (cid:96) j ,j : the reference stratumis covariate-speciﬁc, automatically selected from the initial estimates, and it corresponds(with probability tending to 1) to one of the stratum in K ∗ j = { k : β ∗ k,j = β ∗ (cid:96) ∗ j ,j } if the initialestimates are consistent. Finally, besides the structure of the graph (edges correspondingto diﬀerences β k,j − β k (cid:48) ,j that are penalized), these four strategies exhibit diﬀerences ac-cording to which nodes in the graph (coeﬃcients β k,j ) are directly encouraged to be sparse:they are all penalized when using CliqueFused or M , while only the β (cid:96),j for InterLassoand the β (cid:96) j ,j for M are penalized, for j ∈ [ p ].From a theoretical point of view, the connection with the generalized fused lasso al-lowed us to derive asymptotic oracle properties for our adaptive versions. Preliminarynon-asymptotic results were also stated, under strong but not necessary conditions. Re-garding prediction performance, these preliminary results might be extended to more gen-eral settings by adapting the recent results of Dalalyan et al. (2014), who study predictionperformance of the lasso in the presence of correlated designs. As for support recovery19f the β ∗ k ’s, adapting the dual-witness idea of Wainwright (2009) is a promising lead (asJalali et al. (2013) did for the Dirty models). Other extensions of this work could rest ongreedy algorithms or recent MCMC-based approaches (such as the exponential screeningof Rigollet and Tsybakov (2012)), for which optimal prediction bounds have been obtainedwith no particular assumption on the design matrix. References

A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning.

MachineLearning , 73(3):243–272, 2008.P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzigselector.

The Annals of Statistics , 37(4):1705–1732, 2009.P. B¨uhlmann and S. van de Geer.

Statistics for High-Dimensional Data: Methods, Theoryand Applications . Berlin : Springer-Verlag, 2011.A. S. Dalalyan, M. Hebiri, and J. Lederer. On the prediction performance of the lasso. arXiv preprint arXiv:1402.1700 , 2014.B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression.

The Annalsof statistics , 32(2):407–499, 2004.T. Evgeniou and M. Pontil. Regularized multi–task learning. In

Proceedings of the tenthACM SIGKDD International Conference on Knowledge Discovery and Data mining ,pages 109–117. ACM, 2004.J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linearmodels via coordinate descent.

Journal of statistical software , 33(1):1, 2010.J. Gertheiss and G. Tutz. Regularization and model selection with categorial eﬀect modi-ﬁers.

Statistica Sinica , 22:957–982, 2012.H. H¨oﬂing, H. Binder, and M. Schumacher. A coordinate-wise optimization algorithm forthe Fused Lasso.

Arxiv preprint arXiv:1011.6409 , 2010.L. Jacob, J.P. Vert, and F. Bach. Clustered multi-task learning: A convex formulation. In

Advances in neural information processing systems , pages 745–752, 2009.A. Jalali, P. Ravikumar, and S. Sanghavi. A dirty model for multiple sparse regression.

IEEE Transactions on Information Theory , 59(12):7947–7968, 2013.K. Lounici. Sup-norm convergence rate and sign concentration property of lasso and dantzigestimators.

Electronic Journal of statistics , 2:90–102, 2008.20. Lounici, M. Pontil, A.B. Tsybakov, and S. van de Geer. Taking advantage of sparsityin multi-task learning. In

COLT’09 , 2009.J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online Learning for Matrix Factorization andSparse Coding.

The Journal of Machine Learning Research , 11:19–60, 2010.A. Maurer and M. Pontil. Excess risk bounds for multitask learning with trace normregularization. In

COLT’13 , pages 55–76, 2013.P. McCullagh and J. A. Nelder.

Generalized Linear Models. 2nd ed.

New-York : Chapman& Hall, 1989.N. Meinshausen. Relaxed lasso.

Computational Statistics and Data Analysis , 52(1):374–393, 2007.S. Negahban and M. Wainwright. Simultaneous support recovery in high dimensions:Beneﬁts and perils of block-regularization.

IEEE Transactions on Information Theory ,57(6):3841–3863, 2011.S. Reid and R. Tibshirani. Regularization paths for conditional logistic regression: Theclogitl1 package.

Journal of Statistical Software , 58(12), 2014.P. Rigollet and A. B. Tsybakov. Sparse estimation by exponential weighting.

StatisticalScience , 27(4):558–575, 2012.N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso.

Journal ofComputational and Graphical Statistics , 22(2):231–245, 2013.S. van de Geer and P. B¨uhlmann. On the conditions used to prove oracle results for thelasso.

Electronic Journal of Statistics , 3:1360–1392, 2009.V. Viallon, S. Lambert-Lacroix, H. H¨oﬂing, and F. Picard. On the robustness of thegeneralized fused lasso to prior speciﬁcations.

Statistics and Computing (To Appear) ,2014.M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso).

IEEE Transactions on Information Theory ,55(5):2183–2202, 2009.H. Zou. The Adaptive Lasso and Its Oracle Properties.

Journal of the American StatisticalAssociation , 101(476):1418–1429, 2006. 21 . Appendix: Supplementary Materials

In this section, we derive preliminary non-asymptotic properties for our ﬁrst approach M (Sections 6.1.1 and 6.1.2) as well as asymptotic oracle properties for our adaptive versions(especially for M , see Section 6.1.3). K = 2 and β ∗ ,j β ∗ ,j ≥ for all j ∈ [ p ] . Here we consider the case where K = 2. Working under the assumptions considered inJalali et al. (2013), it is easy to show that the sample complexity of M is at most theone of the Dirty models, under the additional assumption that β ∗ ,j β ∗ ,j ≥ j ∈ [ p ].Indeed, for any given pair of potential estimates β ,j and β ,j , the penalty term involvedin the Dirty approach for covariate j can be written φ ( r , r ) = λ max( | r | , | r | ) + λ {| β ,j − r | + | β ,j − r |} , while for M it reduces to¯ φ ( r ) = λ | r | + λ {| β ,j − r | + | β ,j − r |} = φ ( r, r ) . Moreover, for any positive λ , λ and any β ,j , β ,j it is easy to show that min r ,r φ ( r , r ) =min r ¯ φ ( r ) if β ,j β ,j ≥ r ,r φ ( r , r ) < min r ¯ φ ( r ) otherwise. In words, pairs ofvalues such that β ,j β ,j ≥ M , while all other pairsare equally penalized by Dirty and M . Consequently, assuming that β ∗ ,j β ∗ ,j ≥ j ∈ [ p ], the probability of correct support recovery for M is superior or equal to that ofDirty. In particular, under the assumptions considered in Theorem 3 of Jalali et al. (2013),and assuming that β ∗ ,j β ∗ ,j ≥ j ∈ [ p ], perfect support recovery is ensured formethod M as soon as the common number of observations per stratum (task) is superiorto (2 − α ) s log( p − (2 − α ) s ), where s is the common support size of β ∗ and β ∗ and α is the overlap proportion between the two supports. As shown in Jalali et al. (2013) fortheir dirty model, this implies that M outperforms both IndepLasso and the L /L ∞ -grouplasso strategy in terms of sample complexity (except in the two extremes situations of nosharing at all and full sharing, where it matches the best strategy). In addition to the sample complexity in the simple case described above, other preliminarynon-asymptotic properties can easily be derived for our approach thanks to its rewritingas a weighted lasso. Most non-asymptotic theoretical results of the lasso have been estab-lished under strong conditions on the design matrix: to name a few, the irrepresentabilitycondition, the mutual incoherence, the restricted eigenvalue condition or the RIP property22see van de Geer and B¨uhlmann (2009) for an overview). In Equation (5) of the mainpaper, it is clear that matrix X generally does not enjoy such properties. Assuming forsimplicity that n k = n/K for all k ∈ [ K ], we present here simple situations where it does. Amore thorough study would be needed to establish non-asymptotic properties under moregeneral assumptions and an interesting lead lies in the recent results concerning predictionaccuracy of lasso estimators under correlated designs (Dalalyan et al., 2014). This will beconsidered elsewhere.The ﬁrst situation is when K is large enough and the β ∗ k ’s remain highly similar.More precisely, assume that design matrices X ( k ) fulﬁll the following mutual incoherencecondition (Lounici, 2008), for all k ∈ [ K ]: setting Σ ( k ) = ( X ( k ) T X ( k ) ) /n k for all k ∈ [ K ],we assume that Σ ( k ) j,j = 1 for all j ∈ [ p ] and max j (cid:54) = j | Σ ( k ) j ,j | ≤ / (7 αs ), for some integer s ≥ α >

1. Then the following standardized version of matrix X (cid:101) X =  X (1) √ K X (1) . . . ... ... . . . ... X ( K ) . . . √ K X ( K )  can easily be shown to fulﬁll the same mutual incoherence assumption, as long as K > (7 αs ) . For such values of K , and assuming that the true parameter vector θ ∗ contains atmost s non-zero components (such a θ ∗ is ensured to be unique), results of Lounici (2008)can be applied to derive the L ∞ rate of convergence and the sign consistency of estimatorsobtained from M , with sparsity parameters set to λ ,k = λ (which is equivalent to workingwith the non-standardized matrix X and λ ,k = λ K − / ). Of course the assumption K > (7 αs ) is very strong: it holds typically when strata are numerous and highly similar(so that s remains low).Another simple situation arises when X ( k ) T X ( k ) /n k = I p for all k ∈ [ K ] and vectors β ∗ k are all equal, and we set again λ ,k = λ . Setting J = { j : θ j (cid:54) = 0 } , this ensures that J ⊆ [ p ], i.e. θ ∗ J = β ∗ J . Then it is easy to show that the irrepresentability condition ofWainwright (2009) holds with γ = 1 − (cid:112) /K , and then to obtain the following L ∞ bound:for any δ, ε >

0, and assuming that the noise variables ξ ( k ) i are i.i.d. σ -sub-Gaussian, wehavemax k (cid:107) (cid:98) β k − β ∗ k (cid:107) ∞ = (cid:107) (cid:98) β − β ∗ (cid:107) ∞ ≤ σ (cid:40)(cid:114) | J | + ε n + 2 γ (cid:114) | J | log( Kp − | J | ) + δ n (cid:41) , with probability greater than 1 − − δ / − − ε / (cid:112) { | J | log( p − | J | ) + δ } /n . In other words, when X ( k ) T X ( k ) /n k = I p forall k ∈ [ K ] and all the β ∗ k ’s are equal our approach performs as well as IdentLasso (up toconstants and log-terms), which is optimal in this case.23 .1.3 Asymptotic results for the adaptive version of M in the fixed p ,fixed K case In this paragraph, we will consider the simple situation where p and K are held ﬁxed (theydo not increase with n ). We further assume that, for all k ∈ [ K ], matrices X ( k ) T X ( k ) /n k converges to positive deﬁnite matrices C ( k ) as n k → ∞ , and n k /n → κ k as n → ∞ , with0 < κ k < stratum sizes all tend to inﬁnity at the same rate). These assumptionsessentially imply that MLEs (cid:101) β k,j can be used in the deﬁnition of the weights in Equations(10) and (7) of the main paper, and are √ n -consistent as n → ∞ .Before stating our result, some notations are needed; in particular, the overall complex-ity of the model returned by M ad has to be deﬁned properly. For any j ∈ [ p ], introducethe quantities (cid:96) ∗ j = min { k : β ∗ k,j ∈ median( β ∗ ,j , . . . , β ∗ K,j ) } , K ∗ j = { k : β ∗ k,j = β ∗ (cid:96) ∗ j ,j } ⊆ [ K ],and N ∗ j = (cid:80) k ∈ (cid:96) ∗ j n k . Further introduce A ∗ = { ( k, j ) : β ∗ k,j (cid:54) = 0 } and, for all j ∈ [ p ], A ∗ j = { k : β ∗ k,j (cid:54) = 0 } and K ∗A ∗ ,j = { k : β ∗ k,j = β ∗ (cid:96) ∗ j ,j (cid:54) = 0 } .To deﬁne the empirical counterparts of the previous quantities returned by M ad , ﬁrstrecall that for any j ∈ [ p ], (cid:96) j = min { k : (cid:101) β k,j ∈ median( (cid:101) β ,j , . . . , (cid:101) β K,j ) } . Then denote by (cid:98) β ( ad ) k,j for any k ∈ [ K ] and j ∈ [ p ] the estimator of β ∗ k,j returned by M ad . We can nowdeﬁne (cid:98) A = { ( k, j ) : (cid:98) β ( ad ) k,j (cid:54) = 0 } , and, for all j ∈ [ p ], (cid:98) (cid:96) j = { k : (cid:98) β ( ad ) k,j = (cid:98) β ( ad ) (cid:96) j ,j } ⊆ [ K ], (cid:98) A j = { k : (cid:98) β ( ad ) k,j (cid:54) = 0 } and (cid:98) K (cid:98) A ,j = { k : (cid:98) β ( ad ) k,j = (cid:98) β ( ad ) (cid:96) j ,j (cid:54) = 0 } . Finally for any j ∈ [ p ] denote by (cid:98) s j the number of distinct non-null values in ( (cid:98) β ( ad )1 ,j , . . . , (cid:98) β ( ad ) K,j ), that is (cid:98) s j = (cid:40) | (cid:98) A j | if (cid:98) β ( ad ) (cid:96) j ,j = 0 | (cid:98) A j | − | (cid:98) K (cid:98) A ,j | + 1 otherwise . The overall complexity of the model returned by M ad , which corresponds to the numberof “free” parameters returned by M ad , is then deﬁned as (cid:98) s = (cid:80) j ∈ [ p ] (cid:98) s j . More precisely, forany j ∈ [ p ] such that (cid:98) s j > M ad returns a vector (cid:98) η ( ad ) j of size (cid:98) s j deﬁned by (cid:98) η ( ad ) j =  (cid:98) β ( ad ) (cid:96) j ,j if (cid:98) β ( ad ) (cid:96) j ,j (cid:54) = 0 and (cid:98) s j = 1;( (cid:98) β ( ad )¯ k j, ,j , . . . , (cid:98) β ( ad )¯ k j, (cid:98) sj ,j ) T if (cid:98) β ( ad ) (cid:96) j ,j = 0 and (cid:98) s j ≥ , with { ¯ k j, , . . . , ¯ k j, (cid:98) s j } = (cid:98) A j ;( (cid:98) β ( ad ) (cid:96) j ,j , (cid:98) β ¯ k ( ad ) j, ,j , . . . , (cid:98) β ( ad )¯ k j, (cid:98) sj − ,j ) T otherwise , with { ¯ k j, , . . . , ¯ k j, (cid:98) s j − } = (cid:98) A j \ (cid:98) K A ,j . Now denote by (cid:98) η ( ad ) the estimator returned by M ad (after some reordering), i.e. , (cid:98) η ( ad ) =( (cid:98) η ( ad ) j ) j : (cid:98) s j > ∈ IR s . We can further introduce the theoretical counterpart of (cid:98) s j and (cid:98) s , bysetting s ∗ j = |A ∗ j | if β ∗ (cid:96) ∗ j ,j = 0 and s ∗ j = |A ∗ j | − | K ∗A ∗ ,j | + 1 otherwise, and s ∗ = (cid:80) j ∈ [ p ] s ∗ j .24inally introduce for any j ∈ [ p ] such that s ∗ j > η ∗ j =  β ∗ (cid:96) ∗ j ,j if β ∗ (cid:96) ∗ j ,j (cid:54) = 0 and s ∗ j = 1;( β ∗ ¯ k ∗ j, ,j , . . . , β ¯ k ∗ j,sj ,j ) T if β ∗ (cid:96) ∗ j ,j = 0 and s ∗ j ≥ , with { ¯ k ∗ j, , . . . , ¯ k ∗ j,s j } = A ∗ j ;( β ∗ (cid:96) ∗ j ,j , β ∗ ¯ k ∗ j, ,j , . . . , β ∗ ¯ k ∗ j,s ∗ j − ,j ) T otherwise , with { ¯ k ∗ j, , . . . , ¯ k ∗ j,s ∗ j − } = A ∗ j \ K ∗A ∗ ,j . and η ∗ = ( η ∗ j ) j : s ∗ j > ∈ IR s ∗ .As a simple consequence of Theorem 2 in Viallon et al. (2014), we have IP( (cid:98) A = A ∗ ) → ∩ j ∈ [ p ] { (cid:98) K A ,j = K ∗A ∗ ,j } ) →

1, as n → ∞ . In particular, this implies that the twovectors η ∗ and (cid:98) η ( ad ) are of identical size s ∗ = (cid:98) s with probability tending to one. Moreover,we have the following Gaussian limit distribution √ n ( (cid:98) η ( ad ) − η ∗ ) → N ( s ∗ , σ ( (cid:101) X Ts ∗ (cid:101) X s ∗ ) − ) as n → ∞ , with (cid:101) X s ∗ the ( n × s ∗ ) matrix that can be written as ( (cid:101) X j,s ∗ j ) j : s ∗ j > where each submatrix (cid:101) X j,s ∗ j , of size ( n × s ∗ j ) is of the form: (cid:101) X j,s ∗ j =  (cid:88) k ∈ K ∗A∗ ,j (cid:101) X ( k ) j if β ∗ (cid:96) ∗ j ,j (cid:54) = 0 and s ∗ j = 1;( (cid:101) X (¯ k ∗ j, ) j , . . . , (cid:101) X (¯ k ∗ j,s ∗ j ) j ) if β ∗ (cid:96) ∗ j ,j = 0 and s ∗ j ≥ , with { ¯ k ∗ j, , . . . , ¯ k ∗ j,s j } = A ∗ j ;( (cid:88) k ∈ K ∗A∗ ,j (cid:101) X ( k ) j , (cid:101) X (¯ k ∗ j, ) j , . . . , (cid:101) X (¯ k ∗ j,sj ) j ) otherwise , with { ¯ k ∗ j, , . . . , ¯ k ∗ j,s ∗ j − } = A ∗ j \ K ∗A ∗ ,j . Here, (cid:101) X ( k ) j is the column vector of size n of the form ( T (cid:80) k (cid:48) k n k (cid:48) ) T , with X ( k ) j the j -th column of matrix X ( k ) . For any j ∈ [ p ], if β ∗ (cid:96) ∗ j ,j (cid:54) = 0 then we have β ∗ k,j = β ∗ (cid:96) ∗ j ,j for all k ∈ K ∗A ∗ ,j and these parameters are all estimated by the common value (cid:98) β ( ad ) (cid:96) j ,j ,which has the same Gaussian limit distribution as that of the estimator we would obtainby pooling all data corresponding to strata in K ∗A ∗ ,j together. Our results then showthat method M ad enjoys asymptotic oracle properties . More precisely, method M ad isasymptotically optimal in situations where, for all j ∈ [ p ], β ∗ k ,j = β ∗ k ,j implies that either β ∗ k ,j = β ∗ k ,j = 0 or β ∗ k ,j = β ∗ k ,j = β ∗ (cid:96) ∗ j ,j , as in examples 1 to 3 of Figure 4 of this Appendix.Asymptotic oracle properties can easily be derived for method M ad as well but, for M ad to be optimal we must have in addition: for all j ∈ [ p ], { β ∗ (cid:96) ∗ j ,j (cid:54) = 0 } ⇒ {∀ k ∈ [ K ] , β ∗ k,j (cid:54) = 0 } .For instance, in example 2 on Figure 4, the estimator of β ∗ ,j returned by M ad can not be 0(unless M ad returns a zero estimate for β ∗ (cid:96) j ,j as well), so that M ad would typically return amodel with overall complexity higher than the theoretical one. On the other hand, contrary25 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l −2−101 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Figure 4: Graphical representation of four typical examples of vector ( β ∗ ,j , . . . , β ∗ K,j ), forsome j ∈ [ p ] and with K = 10. Without loss of generality, we assume here that strata { , . . . , K } are ordered in such a way that β ∗ ,j ≤ . . . ≤ β ∗ K,j . For each example, themedian value of ( β ∗ ,j , . . . , β ∗ K,j ) is represented by a blue bullet: under the third example,all values between the two blue bullets [ − . ,

0] are valid median values, while only onevalue corresponds to the median under the other three examples (0, 0.5 and 0 for examples1, 2 and 4 respectively).to the CliqueFused strategy that is based on cliques, both our approaches use star-graphsand are therefore sub-optimal in situations where non-zero values in ( β ∗ ,j , . . . , β ∗ K,j ) consistof at least two groups of identical values (and possibly some distinct non-zero values).For instance, in example 4 of Figure 4, neither M ad nor M ad can return equal values forcomponents (cid:98) β ( ad )1 ,j , (cid:98) β ( ad )2 ,j , and (cid:98) β ( ad )3 ,j (nor for (cid:98) β ( ad )8 ,j , (cid:98) β ( ad )9 ,j , and (cid:98) β ( ad )10 ,j ), while CliqueFused would,for n large enough. However, CliqueFused, i.e. , clique-based strategies, can of course beoutperformed on ﬁnite samples. This was illustrated in the simulation study conducted inViallon et al. (2014) in the single-stratum setting. In the present “multiple strata” setting,if β ∗ ,j , . . . , β ∗ K − ,j are all equal and β ∗ K,j is diﬀerent from the K − (cid:96) j ∈ [ K −

1] for n enough, and our approaches would only penalize | β K,j − β (cid:96) j ,j | : theyare therefore more likely to detect this diﬀerence than the approach based on clique-graphswhich penalizes all the diﬀerences | β K,j − β k,j | for all k ∈ [ K − M In this paragraph, we consider a general framework that encompasses both linear andlogistic models along with SVM as special cases. Let f be a convex loss function and forany k ∈ [ K ] set L k ( β k ) = (cid:80) i ∈ [ n k ] f ( β Tk a ( k ) i + c ( k ) i ) for β k ∈ IR p and some given a ( k ) i ∈ IR p and c ( k ) i ∈ IR. We deﬁne the k -th feature matrix A ( k ) := [ a ( k )1 , . . . , a ( k ) n k ] T ∈ IR n k × p . Weconsider a generic supervised learning problem of the form( (cid:98) β , ( (cid:98) γ k ) Kk =1 ) ∈ argmin β , ( γ k ) Kk =1 K (cid:88) k =1 (cid:110) L k ( β + γ k ) + λ ,k (cid:107) β + γ k (cid:107) + λ ,k (cid:107) γ k (cid:107) (cid:111) . (11)26oss function f ( ξ ) conjugate function f ∗ ( ϑ ) domain of f ∗ squared f sq ( ξ ) = ξ / (2 n ) ϑ / (2 n ) IRlogistic f log ( ξ ) = log(1 + e − ξ ) ( − ϑ ) log( − ϑ ) + ( ϑ + 1) log( ϑ + 1) [ − , m hinge f hi ( ξ ) = (1 − ξ ) + − ϑ [ − , m Table 1: Expression for the conjugate of popular loss functions, adopting the convention0 log 0 = 0 for the logistic loss.Recall that throughout the paper, data is of the form ( x ( k )1 , y ( k )1 ) , . . . , ( x ( k ) n k , y ( k ) n k ) with x ( k ) i ∈ IR p and y ( k ) i ∈ IR, i ∈ [ n k ] and k ∈ [ K ]. In this context, the aforementionedformalism covers the linear regression with the loss function set to the squared loss: f ( ξ ) = f sq ( ξ ) = ξ / (2 n ), and by setting a ( k ) i = x ( k ) i ∈ IR p the vector of covariates and c ( k ) i = − y ( k ) i the (negative) response for observation i ∈ [ n k ] of the k -th stratum. Likewise, by setting f ( ξ ) = f log ( ξ ) = log(1 + e − ξ ), c ( k ) i = 0, and a ( k ) i = y ( k ) i x ( k ) i for i ∈ [ n k ], our formalismcovers both the logistic regression and the SVM framework, by considering the logistic loss f ( ξ ) = f log ( ξ ) = log(1 + e − ξ ) and the hinge loss f ( ξ ) = f hi ( ξ ) = (1 − ξ ) + respectively.For future use, we denote by f ∗ the Fenchel conjugate of the loss function f , which isthe extended-value convex function deﬁned as f ∗ ( ϑ ) := max ξ ξϑ − f ( ξ ) . Beyond convexity, we make a few mild assumptions about the loss function f . First, weassume that it is non-negative everywhere, and that it is closed (its epigraph is closed),so that f ∗∗ = f . These assumptions are met with the squared, logistic and hinge lossfunctions, as well as other popular loss functions. The conjugate of the squared, logisticand hinge loss functions are given in Table 1.We have φ (( λ ,k ) k ∈ [ K ] , ( λ ,k ) k ∈ [ K ] ) = min β , ( γ k ) Kk =1 K (cid:88) k =1 (cid:8) L k ( β + γ k ) + λ ,k (cid:107) β + γ k (cid:107) + λ ,k (cid:107) γ k (cid:107) (cid:9) = min β , ( β k ) Kk =1 , ( z k ) Kk =1 K (cid:88) k =1 (cid:8) G k ( z k ) + λ ,k (cid:107) β k (cid:107) + λ ,k (cid:107) β k − β (cid:107) (cid:9) :s . t . z k = A ( k ) β k + c k , k ∈ [ K ] , where, for z ∈ IR n k , G k ( z ) = (cid:80) n k i =1 f ( z i ) and c k = ( c ( k )1 , . . . , c ( k ) n k ) for any k ∈ [ K ].27e can now express the problem in min-max form as follows φ (( λ ,k ) k ∈ [ K ] , ( λ ,k ) k ∈ [ K ] ) = min β , ( β k ) Kk =1 , ( z k ) Kk =1 max ( u k ) Kk =1 , ( v k ) Kk =1 , ( α k ) Kk =1 K (cid:88) k =1 (cid:26) G k ( z k ) + α Tk ( A ( k ) β k + c k − z k )+ u Tk β k + v Tk ( β k − β ) (cid:27) s . t . (cid:107) u k (cid:107) ∞ ≤ λ ,k , (cid:107) v k (cid:107) ∞ ≤ λ ,k , k ∈ [ K ] . Now, solving for β , ( β k ) Kk =1 , we obtain the dual constraints: K (cid:88) k =1 v k = p , A ( k ) T α k = u k + v k , for k ∈ [ K ] . Still denoting by f (cid:63) the Fenchel conjugate of the function f , and eliminating variables u k , k = 1 , . . . , K , we obtain the dual problem φ (( λ ,k ) k ∈ [ K ] , ( λ ,k ) k ∈ [ K ] )) = min α ,..., α K (cid:88) k (cid:110) α Tk c k + (cid:88) i ∈ [ n k ] f (cid:63) ( α k,i ) (cid:111) under the constraints: (cid:13)(cid:13)(cid:13) A ( k ) T α k − v k (cid:13)(cid:13)(cid:13) ∞ ≤ λ ,k , (cid:107) v k (cid:107) ∞ ≤ λ ,k , (cid:80) k v k = p and α k,i ∈ D om( f (cid:63) ) , for k ∈ [ K ] . For instance, in the logistic regression setting, this takes the form φ (( λ ,k ) k ∈ [ K ] , ( λ ,k ) k ∈ [ K ] )) = min α ,..., α K (cid:88) k ∈ [ K ] α Tk log α k + (1 − α k ) T log(1 − α k )under the constraints: (cid:13)(cid:13)(cid:13) A ( k ) T α k − v k (cid:13)(cid:13)(cid:13) ∞ ≤ λ ,k , (cid:107) v k (cid:107) ∞ ≤ λ ,k , (cid:80) k v k = p , and α k ∈ [0 , n k , for k ∈ [ K ] . This problem is then equivalent to a standard entropy maximization and can therefore besolved using standard optimization toolboxes like the Mosek toolbox in R or Matlab forinstance, which returns both optimal dual and primal solutions. (In the linear regressionsetting, the problem reduces to a quadratic programming and can also be solved withMosek.) 28 .3 (Sparse) group lasso (spGroupLasso).

Lounici et al. (2009) and Negahban and Wainwright (2011) consider two diﬀerent grouplasso strategies. Their estimators are deﬁned as minimizers of the following criterion (cid:88) k (cid:107) y ( k ) − X ( k ) β ( k ) (cid:107) n + λ (cid:107) B (cid:107) ,q (12)with typical values q = 2 (Lounici et al., 2009) or q = ∞ (Negahban and Wainwright,2011). The (cid:107) · (cid:107) ,q encourages solutions (cid:98) B to exhibit a row-wise group structure in thefollowing sense: for each covariate j ∈ [ p ], parameters (cid:98) β ( k ) j , for k ∈ [ K ], are either all 0 orall non-zero (the estimated eﬀect of each covariate is either null for all tasks or non-nullfor all tasks).A slightly more ﬂexible approach consists in adding an L penalty to allow covariatesto have a non-zero estimated eﬀect on all but some tasks. This leads to the so-calledsparse group lasso in standard regression models (Simon et al., 2013). In our context, thissuggests to minimize to following criterion (cid:88) k (cid:107) y ( k ) − X ( k ) β ( k ) (cid:107) n + λ (cid:107) B (cid:107) , + λ (cid:107) B (cid:107) ,q . (13)Note that in Equation (13) above solutions (cid:98) B are encouraged to be sparse and to exhibita group structure while in the Dirty model of Jalali et al. (2013) (see Equation (6) of themain paper) solutions are sums of two matrices: (cid:98) S that is encouraged to be sparse and (cid:98) R that is encouraged to have a group structure.Non-asymptotic properties have been established for the non-sparse group-lasso inmulti-task settings (Lounici et al., 2009; Negahban and Wainwright, 2011). In particular,Negahban and Wainwright (2011) compare the L /L ∞ -regularized method to IndepLassoin the special case of K = 2 linear regression problems with standard Gaussian designswhose supports have common size s and overlap in a fraction α ∈ [0 ,

1] of their entries.They focus on the capacity of each method to correctly specify the union of the two sup-ports. Assuming a common number n/ L /L ∞ -regularized method yields improved statistical eﬃciency if the overlap parameter islarge enough ( α ≥ / α < / β ∗ ,j β ∗ ,j ≥

0, we have estab-lished that our ﬁrst approach M performed at least similarly to the Dirty model (seeSection 6.1.1 above). 29 .4 Additional results from the simulation study K = 2Here we present a comparison of Dirty and M when K = 2. We actually mimic thesimulation study presented in Jalali et al. (2013). For a given value of p , the sparsityof both β ∗ and β ∗ is set to s = (cid:98) p/ (cid:99) . Results are presented for p = 128 (results for p = 256 were similar). The proportion of non-zero and equal components in β ∗ and β ∗ isset to α ∈ { . , . } . For a given value of β >

0, non-zero components are set to ± β withprobability 1 /

2. Design matrices were generated under a multivariate Gaussian N p ( p , I p )distribution, and noise variables under a N (0 , .

1) distribution. Number of observations perstratum n = n = n/ n k = cs log[ p − (2 α ) s ] /

2, for c ∈ { . , , . , , } deﬁningthe rescaled sample size (Jalali et al., 2013). Selection of the optimal sparsity parameterwas performed by using an independent test sample. Methods referred to as 2step Dirtyand 2step M further use a 2-step strategy to unshrunk estimates before computing theL2 prediction error on the test sample. Figure 5 presents the results we obtained for thesupport recovery accuracy. It conﬁrms that Dirty and M perform similarly. b = b = ll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll ll . . . . . . . . . . a = . a = . Rescaled Sample Size

Method l l

Dirty M1 2step M1 2step Dirty

Figure 5: Results from the simulation study with K = 2.30 .4.2 Support recovery accuracy for the methods compared on Figure 1 ofthe main paper Figure 6 presents the support recovery accuracies for the methods compared in terms ofprediction accuracy on Figure 1 of the main paper. As mentioned in the main paper,conclusions drawn from Figure 6 are very similar to those drawn from Figure 1 of the mainpaper.

Sparsity_glob = = = ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll ll ll ll ll llll ll ll ll llll ll ll ll ll . . . . . . . . . . . . . . . S pa r s i t y _ s pe c = S pa r s i t y _ s pe c = . S pa r s i t y _ s pe c = . snr Method l l

M1 M2 Dirty CliqueFused spGroupLasso Stepwise IdentLasso IndepLasso

Figure 6: Results from the ﬁrst simulation study: support recovery accuracy. M depending on the ( λ , λ ) grid All methods presented in the main article involve tuning parameters that need to be care-fully selected in practice. Generally speaking, a predeﬁned grid of λ values has ﬁrst to beconstructed. For instance in the single stratum setting, the minimal value λ max ensuring31hat the vector returned by the lasso is null for all λ ≥ λ max has a closed-form expres-sion ( e.g. , for linear and logistic regression) and the grid is then generally of the form { λ max / , . . . , λ max } . Most methods considered in the main paper involve two tuningparameters, and the derivation of λ , max and λ , max is not straightforward. Because oursimulated data are such that X ( k ) j ∼ N (0 ,

1) for all k ∈ [ p ] and j ∈ [ p ], and n k = n/K for all k ∈ [ K ], we limit the presentation to the case where tuning parameters do notdepend on k ; if the n k ’s are not equal, one solution for M consists in ﬁrst standardizingthe columns of matrix X and then proceed as described here. For the Dirty approachof Jalali et al. (2013) we proceed as they did in their simulation study. Because of itsconnection with the lasso, the following strategy can be used for our ﬁrst approach M (or its adaptive version). For any pair ( λ , λ ) such that λ ≥ Kλ , it is easy to see fromEquation (4) of the main paper that the estimated common eﬀect (cid:98) β will be p and M then reduces to a version of IndepLasso ran with the tuning parameter set to λ . On theother hand, if λ = 0 then β is unpenalized, and M returns parameter vectors (cid:98) β k = (cid:98) β for λ large enough. Therefore, our strategy is to ﬁrst make the λ /λ ratio vary on theinterval [0 , K ] (we take 50 equally-spaced values on a log-scale), and for each value of thisratio r , the glmnet function can be used to compute the λ , max ( r ) value, along with the 50models returned with λ ( r ) varying on the grid of 50 equally-spaced values (on a log-scale) { λ , max ( r ) / , . . . , λ , max ( r ) } .For the other methods ( M , CliqueFused and spGroupLasso) we proceed as follows.We ﬁrst consider the case λ = 0. The methods then all reduce to a standard lasso forwhich the λ , max value can be computed. The grid of λ -values is then chosen as the set of50 values { λ , max / , . . . , λ , max } , equally-spaced on a log-scale. Then, for each λ valueon this grid, the minimal value λ , max ( λ ) is numerically approximated. This value is suchthat for all λ ≥ λ , max ( λ ), the considered method computed with parameters ( λ , λ )returns a vector of parameter (cid:98) β ( λ , λ ) with the lowest possible overall complexity. Inparticular, for methods M , M and CliqueFused, this means that (cid:98) β k ( λ , λ ) = (cid:98) β k (cid:48) ( λ , λ )for all ( k, k (cid:48) ) ∈ [ K ] and λ ≥ λ , max ( λ ). Such an approximate λ , max ( λ ) value is simplyobtained by iteratively running the considered method for various values of λ , startingfrom a huge value, and successively dividing it by 2 for instance. Another, more simple,strategy is as follows. We ﬁrst consider the case λ = 0. The methods then all reduceto a standard lasso for which the λ , max value can be computed. Then, we use the gridof 50 values { λ , max / , . . . , λ , max } for both λ and λ . This strategy is referred to asAlternativeGrid on Figure 7 below which presents the comparison of the performance for M implemented with the two strategies on the ﬁrst simulation study. The comparisonof the black and grey curves illustrates how diﬃcult it is to empirically compare meth-ods. When using the AlternativeGrid strategy, M performs similarly to M , while usingthe other strategy, it performs like SpGroupLasso (which uses the same strategy for thegrid construction). Therefore, the observed diﬀerence of performance between the various32trategies can be (at least partly) explained by the choice for the initial grid of ( λ , λ )values. Sparsity_glob = = = − − − − − − − − − − − − S pa r s i t y _ s pe c = S pa r s i t y _ s pe c = . S pa r s i t y _ s pe c = . snr Method

M1 M2 M2_AlternativeGrid spGroupLasso

Figure 7: Comparison of the prediction performance for M depending on the strategyused to construct the ( λ , λ2