[PDF] Co-Clustering for Multitask Learning

Abstract

This paper presents a new multitask learning framework that learns a shared representation among the tasks, incorporating both task and feature clusters. The jointly-induced clusters yield a shared latent subspace where task relationships are learned more effectively and more generally than in state-of-the-art multitask learning methods. The proposed general framework enables the derivation of more specific or restricted state-of-the-art multitask methods. The paper also proposes a highly-scalable multitask learning algorithm, based on the new framework, using conjugate gradient descent and generalized \textit{Sylvester equations}. Experimental results on synthetic and benchmark datasets show that the proposed method systematically outperforms several state-of-the-art multitask learning methods.

Full PDF

aa r X i v : . [ s t a t . M L ] M a r Co-Clustering for Multitask Learning

Keerthiram Murugesan Jaime Carbonell Yiming Yang Abstract

This paper presents a new multitask learn-ing framework that learns a shared repre-sentation among the tasks, incorporatingboth task and feature clusters. The jointly-induced clusters yield a shared latent sub-space where task relationships are learnedmore eﬀectively and more generally than instate-of-the-art multitask learning methods.The proposed general framework enables thederivation of more speciﬁc or restricted state-of-the-art multitask methods. The paper alsoproposes a highly-scalable multitask learningalgorithm, based on the new framework, us-ing conjugate gradient descent and general-ized

Sylvester equations . Experimental re-sults on synthetic and benchmark datasetsshow that the proposed method system-atically outperforms several state-of-the-artmultitask learning methods.

1. Introduction

Multitask learning leverages shared structures amongthe tasks to jointly build a better model for eachtask. Most existing work in multitask learning fo-cuses on how to take advantage of task similarities, ei-ther by learning the relationship between the tasks viacross-task regularization techniques (Zhang & Yeung,2014; Zhang & Schneider, 2010; Rothman et al., 2010;Xue et al., 2007) or by learning a shared fea-ture representation across all the tasks, leverag-ing low-dimensional subspaces in the feature space(Argyriou et al., 2008; Jalali et al., 2010; Liu et al.,2009; Swirszcz & Lozano, 2012). Learning task re-lationships has been shown beneﬁcial in (positiveand negative) transfer of knowledge from information- School of Computer Science, Carnegie Mel-lon University, Pittsburgh, Pennsylvania, USA. {kmuruges,jgc,yiming}@cs.cmu.edu . Correspondence to:Keerthiram Murugesan < [email protected] >. Proceedings of the th International Conference on Ma-chine Learning , Sydney, Australia, 2017. JMLR: W&CP.Copyright 2017 by the author(s). rich tasks to information-poor tasks (Zhang & Yeung,2014), whereas the shared feature representationhas been shown to perform well when each taskhas a limited number of training instances (observa-tions) compared to the total number across all tasks(Argyriou et al., 2008). Existing research in multitasklearning considers either the ﬁrst approach and learnsa task relationship matrix in addition to the task pa-rameters, or relies on the latter approach and learnsa shared latent feature representation from the taskparameters. To the best of our knowledge, there isno prior work that utilizes both principles jointly formultitask learning. In this paper, we propose a newapproach that learns a shared feature representationalong with the task relationship matrix jointly to com-bine the advantages of both principles into a generalmultitask learning framework.Early work on latent shared representation includes(Zhang et al., 2005), which proposes a model basedon Independent Component Analysis (ICA) for learn-ing multiple related tasks. The task parameters areassumed to be generated from independent sources.(Argyriou et al., 2008) consider sparse representationscommon across many learning tasks. Similar in spiritto PCA for unsupervised tasks, their approach learnsa low dimensional representation of the observations(Ding & He, 2004). More recently, (Kumar & Daume,2012) assume that relationships among tasks aresparse to enforce that each observed task is obtainedfrom only a few of the latent features, and fromthere learn the overlapping group structure amongthe tasks. (Crammer & Mansour, 2012) propose a K-means-like procedure that simultaneously clusteringdiﬀerent tasks and learning a small pool of m ≪ T shared models. Speciﬁcally, each task is free to choosea model from the pool that better classiﬁes its owndata, and each model is learned from pooling togetherall the training data that belongs to the same clus-ter. (Barzilai & Crammer, 2015) propose a similar ap-proach that clusters the T tasks into K task-clusterswith hard assignments.These methods compute the factorization of the taskweight matrix to learn the shared feature representa-tion and the task structure. This matrix factorization o-Clustering for Multitask Learning induces the simultaneous clustering of both the tasksand the features in the K -dimensional latent subspace(Li & Ding, 2006). One of the major disadvantages ofthis assumption is that it restricts the model to deﬁneboth the tasks and the features to have same numberof clusters. For example, in the case of sentiment anal-ysis, where each task belongs to a certain domain ora product category such as books, automobiles, etc.,and each feature is simply a word from the vocabularyof the product reviews. Clearly, assuming both thefeatures and the tasks have same number of clustersis an unjustiﬁed assumption, as the number of featureclusters are typically more than the number of taskclusters, but the latter increase more than the former,as new products are introduced. Such a restrictive as-sumption may (and often does) hurt the performanceof the model.Unlike in the previous work, our proposed approachprovides a ﬂexible way to cluster both the tasks andthe features. We introduce an additional degree of free-dom that allows the number of task clusters to diﬀerfrom the number of features clusters (Ding et al., 2006;Wang et al., 2011). In addition, our proposed modelslearns both the task relationship matrix and the fea-ture relationship matrix along with the co-clusteringof both the tasks and the features (Gu & Zhou, 2009;Sindhwani et al., 2009). Our proposed approach isclosely related to Output Kernel Learning ( OKL )where we learn the kernel between the components ofthe output vector for problems such as multi-outputlearning, multitask learning, etc (Dinuzzo et al., 2011;Sindhwani et al., 2013). The key disadvantage of

OKL is that it requires the computation of kernel matrix be-tween every pair of instances from all the tasks. Thisresults in scalability constraint especially when thenumber of tasks/features is large (Weinberger et al.,2009). Our proposed models achieve the similar eﬀectby learning a shared feature representation commonacross the tasks.A key challenge in factoring with the extra degree offreedom is optimizing the resulting objective function.Previous work on co-clustering for multitask learningrequires strong assumptions on the task parameters.(Zhong & Kwok, 2012) or not scalable to large-scale ap-plications (Xu et al., 2015). We propose an eﬃcient al-gorithm that scales well to large-scale multitask learn-ing and utilizes the structure of the objective functionto learn the factorized task parameters. We formulatethe learning of latent variables in terms of a generalizedSylvester equation which can be eﬃciently solved us-ing the conjugate gradient descent algorithm. We startfrom the mathematical background and then motivateour approach in Section 2. Then we introduce our pro- posed models and their learning procedures in Section3. Section 4 reports the empirical analysis of our pro-posed models and shows that learning both the taskclusters and the feature clusters along with the taskparameters gives signiﬁcant improvements comparedto the state-of-the-art baselines in multitask learning.

2. Background

Suppose we have T tasks and D t = { X t , Y t } = { ( x ti , y ti ) : i = 1 , , ..., N t } is the training set for eachtask t = { , , . . . , T } . Let W t represent the weightvector for a task indexed by t . These task weight vec-tors are stacked as columns of a matrix W , which isof size P × T , with P being the feature dimension.Traditional multitask learning imposes additional as-sumptions on W such as low-rank, ℓ norm, ℓ , norm,etc to leverage the shared characteristics among thetasks. In this paper, we consider a similar assumptionbased on the factorization of the task weight matrix W .In factored models, we decompose the weight matrix W as F G ⊤ , where F can be interpreted as a featurecluster matrix of size P × K with K feature clustersand, similarly, G as a task cluster matrix of size T × K with K task clusters. If we consider squared errorlosses for all the tasks, then the objective function forlearning F and G can be given as follows: arg min F ∈R P × K G ∈R T × K F ∈ Γ F , G ∈ Γ G X t ∈ [ T ] k Y t − X t F G ⊤ t k + P λ ( F ) + P λ ( G ) (1)In the above objective function, the latent feature rep-resentation is captured by the matrix F and the group-ing structure on the tasks is determined by the matrix G . The predictor W t for task t can then be computedfrom F G ⊤ t , where G t is t th row of matrix G . In theabove objective function, P λ ( F ) is a regularizationterm that penalizes the unknown matrix F with regu-larization parameter λ . Similarly, P λ ( G ) is a regu-larization term that penalizes the unknown matrix G with regularization parameter λ . Γ F and Γ G are theircorresponding constraint spaces. Without these addi-tional constraints on F and G , the objective functionreduces to solving each task independently, since anytask weight matrix from F and G can also be attainedby W .Several assumptions can be enforced on these unknownfactors F and G . Below we discuss some of the pre-vious models that make some well-known assumptionson F and G and can be written in terms of the above o-Clustering for Multitask Learning objective function.(1) Factored Multitask Learning ( FMTL )(Amit et al., 2007) considers a squared frobeniusnorm on both F and G . arg min F ∈R P × K G ∈R T × K X t ∈ [ T ] k Y t − X t F G ⊤ t k + λ k F k F + λ k G k F (2)It can be shown that the above problem can equiv-alently written as the multitask learning with tracenorm constraint on the task weight matrix W .(2) Multitask Feature Learning ( MTFL )(Argyriou et al., 2008) assumes that the matrix G learns sparse representations common across manytasks. Similar in spirit to PCA for unsupervised tasks, MTFL learns a low dimensional representation ofthe observations X t for each task, using F such that FF ⊤ = I p . arg min F ∈R P × K , G ∈R T × K FF ⊤ = I p X t ∈ [ T ] k Y t − X t F G ⊤ t k + λ k G k , (3)where K is usually set to P . It considers an ℓ , normon G to force all the tasks to have a similar sparsitypattern such that the tasks select the same latent fea-tures (columns of F ). It is worth noting that the Equa-tion 3 can be equivalently written as follows: arg min W ∈R P × T , Σ (cid:23) X t ∈ [ T ] k Y t − X t W t k + λtr ( W ⊤ Σ − W ) (4)which then can be rewritten as multitask learning witha trace norm constraint on the task weight matrix W as before.(3) Group Overlap MTL ( GO-MTL )(Kumar & Daume, 2012) assumes that the ma-trix G is sparse to enforce that each observed taskis obtained from only a few of the latent features,indexed by the non-zero pattern of the correspondingrows of the matrix G . arg min F ∈R P × K G ∈R T × K X t ∈ [ T ] k Y t − X t F G ⊤ t k + λ k F k F + λ k G k (5)The above objective function can be compared to dic-tionary learning where each column of F is consideredas a dictionary atom and each row of G is considered astheir corresponding sparse codes (Maurer et al., 2013). (4) Multitask Learning by Clustering ( CMTL )(Barzilai & Crammer, 2015) assumes that the T taskscan be clustered into K task-clusters with hard assign-ment. For example, if the k th element of G t is one,and all other elements of G t are zero, we say that task t is associated with cluster k . arg min F ∈R P × K, G ∈R T × KGt ∈{ , } K k Gt k , ∀ t ∈ [ T ] X t ∈ [ T ] k Y t − X t F G ⊤ t k + λ k F k F (6)The constraints G t ∈ { , } K , k G t k = 1 ensure that G is a proper clustering matrix. Since the above prob-lem is computationally expensive as it involves solvinga combinatorial problem, the constraint on G is re-laxed as G t ∈ [0 , K .These four methods require the number of task clustersto be same as the number of features clusters, whichas mentioned earlier, is a restrictive assumption thatmay and often does hurt performance. In addition,these methods do not leverage the inherent relation-ship between the features (via F ) and the relationshipbetween the tasks (via G ). Note that these objectivefunctions are bi-convex problems where the optimiza-tion is convex in F when ﬁxing G and vice versa. Wecannot achieve globally optimal solution but one canshow that algorithm reaches the locally optimal solu-tion in a ﬁxed number of iterations.

3. Proposed Approach

Existing models do not take into consideration boththe relationship between the tasks and the relationshipbetween the features. Here we consider a more generalformulation that in addition to estimating the param-eters F and G , we learn their task relationship matrix Ω and the feature relationship matrix Σ . We call thisframework BiFactor multitask learning, following thefactorization of the task parameters W into two low-rank matrices F and G . arg min F ∈R P × K, G ∈R T × K Σ (cid:23) , Ω (cid:23) X t ∈ [ T ] k Y t − X t F G t k + λ tr ( F ⊤ Σ − F ) + λ tr ( G ⊤ Ω − G ) (7)In the above objective function, we consider P λ ( F ) and P λ ( G ) to learn task relationship and feature re-lationship matrices Σ and Ω . The motivation for theseregularization terms is based on (Argyriou et al., 2008; o-Clustering for Multitask Learning Zhang & Yeung, 2014) where they considered sepa-rately either the task relationship matrix Ω or the fea-ture relationship matrix Σ . Note that the value of K is typically set to value less than min( P, T ) .It is easy to see that by setting the value of G to I T , our objective function reduces to multitask fea-ture learning ( M T F L ) discussed in the previous sec-tion. Similarly, by setting the value of F to I P ,our objective function reduces to multitask relation-ship learning ( M T RL ) (Zhang & Yeung, 2014). If weset Ω = I T and Σ = I P , we obtain the factored multi-task learning setting ( F M T L ) deﬁned in Equation 2.Hence the prior art can be cast as special cases of ourmore general formulation by imposing certain limitingrestrictions.

BiFactor

MTL

We propose an eﬃcient learning algorithm for solvingthe above objective function

BiFactor

MTL. Consideran alternating minimization algorithm, where we learnthe shared representation F while ﬁxing the task struc-ture G and we learn the task structure G while ﬁxingthe shared representation F . We repeat these stepsuntil we converge to the locally optimal solution. Optimizing w.r.t F gives an equation called generalized Sylvester equation of the form

AQB ⊤ + CQD ⊤ = E for the unknown Q . We will show in the next sectionon how to solve these linear equation eﬃciently. Fromthe objective function, we have: X t ( X ⊤ t X t ) F ( G ⊤ t G t ) + λ Σ − F = X t X ⊤ t Y t G t (8) Optimizing w.r.t G for squared error loss results in thesimilar linear equation: ( F ⊤ X ⊤ t X t F ) G t + λ Ω − G = F ⊤ X ⊤ t Y t (9) Optimizing w.r.t Ω and Σ : The optimization of theabove function w.r.t Ω and Σ while ﬁxing the other un-knowns can be learned easily with the following closed-form solutions (Zhang & Yeung, 2014): Ω = ( GG ⊤ ) tr (( GG ⊤ ) ) Σ = ( FF ⊤ ) tr (( FF ⊤ ) ) identity matrix of size T × T (assuming that the rank K is set to T ) identity matrix of size P × P (assuming that the rank K is set to P ) As mentioned earlier, one of the restrictions in

BiFac-tor

MTL and factored models is that both the numberof feature clusters and task clusters should be set to K . This poses a serious model restriction, by assum-ing both the latent task and feature representation livein a same subspace. Such assumption can signiﬁcantlyhinder the ﬂexibility of the model search space and weaddress this problem with a modiﬁcation to our previ-ous framework.Following the previous work in matrix tri-factorization,we introduce an additional factor S such that we write W as FSG ⊤ where F is a feature cluster matrix ofsize P × K with K feature clusters and G is a taskcluster matrix of size T × K with K task clustersand S is the matrix that maps feature clusters to taskclusters. With this representation, latent features liein a K dimensional subspace and the latent tasks liein a K dimensional subspace. arg min F ∈R P × K , G ∈R T × K , S ∈R K × K Σ (cid:23) , Ω (cid:23) X t ∈ [ T ] k Y t − X t FS G ⊤ t k + λ tr ( F ⊤ Σ − F ) + λ tr ( G ⊤ Ω − G ) (10)The cluster mapping matrix S introduces an additionaldegree of freedom in the factored models and addressesthe realistic assumptions encountered in many applica-tions. Note that we do not consider any regularizationon S in this paper, but one may impose additionalconstraint on S such as ℓ (sparse penalty), ℓ (ridgepenalty), non-negative constraints, etc, to further im-prove performance. TriFactor

MTL

We introduce an eﬃcient learning algorithm for solving

TriFactor

MTL, similar to the optimization procedurefor

BiFactor

MTL. As before, we consider an alternat-ing minimization algorithm, where we learn the sharedrepresentation F while ﬁxing the G and S , we learnthe task structure G while ﬁxing the F and S and welearn the cluster mapping matrix S , by ﬁxing F and G . We repeat these steps until we converge to a locallyoptimal solution. Optimizing w.r.t F gives a generalized Sylvester equa-tion as before. X t ( X ⊤ t X t ) F ( S G ⊤ t G t S ⊤ )+ λ Σ − F = X t X ⊤ t Y t G t S ⊤ (11) o-Clustering for Multitask Learning Optimizing w.r.t G gives the following linear equation: ( S ⊤ F ⊤ X ⊤ t X t FS ) G t + λ Ω − G = S ⊤ F ⊤ X ⊤ t Y t (12)for all t ∈ [ T ] . Optimizing w.r.t S : Solving for S results in the follow-ing equation: X t ( F ⊤ X ⊤ t X t F ) S ( G ⊤ t G t ) = X t F ⊤ X ⊤ t Y t G t (13) Optimizing w.r.t Ω and Σ : The optimization of theabove function w.r.t Ω and Σ while ﬁxing the otherunknowns can be learned as in BiFactor

MTL. Notethat one may consider ℓ regularization on Ω and Σ to learn the sparse relationship between the tasks andthe features (Zhang & Schneider, 2010). We give some details on how to solve the generalized

Sylvester equations (8,9,11,12,13) encountered in

Bi-Factor and

TriFactor

MTL optimization steps. Thegeneralized

Sylvester equation of the form

AQB ⊤ + CQD ⊤ = E has a unique solution Q under certainregularity conditions which can be exactly obtainedby an extended version of the classical Bartels-Stewartmethod whose complexity is O (( p + q ) ) for p × q -matrixvariable Q , compared to the naive matrix inversionwhich requires O ( p q ) .Alternatively one can solve the linear equation us-ing the properties of the Kronecker product: ( B ⊤ ⊗ A ) vec ( Q ) + ( D ⊤ ⊗ C ) vec ( Q ) = vec ( E ) where ⊗ is theKronecker product and vec ( . ) vectorizes Q in a columnoriented way. Below, we show the alternative form for TriFactor

MTL equations: X t (( S G ⊤ t G t S ⊤ ) ⊗ ( X ⊤ t X t )) vec ( F ) + λ ( I K ⊗ Σ − ) vec ( F )= vec (cid:0) X t X ⊤ t Y t G t S ⊤ (cid:1) (14) diag ( S ⊤ F ⊤ X ⊤ t X t FS ) Tt =1 vec ( G ) + ( I K ⊗ Ω − ) vec ( G )= vec ([ S ⊤ F ⊤ X ⊤ t Y t ] Tt =1 ) (15) X t (( G ⊤ t G t ) ⊗ ( F ⊤ X ⊤ t X t F )) vec ( S ) = vec (cid:0) X t F ⊤ X ⊤ t Y t (cid:1) (16) We can do the same for

BiFactor

MTL, enabling us touse conjugate gradient descent (CG) to learn our un- known factors whose complexity depends on the con-dition number of the matrix ( B ⊤ ⊗ A ) + ( D ⊤ ⊗ C ) .To optimize F , G and S , we iteratively run conju-gate gradient descent for each factor while ﬁxing theother unknowns until a convergence condition (toler-ance ≤ − ) is met. In addition, CG can exploit thesolution from the previous iteration, low rank struc-ture in the equation and the fact that the matrix vectorproducts can be computed relatively eﬃciently. Fromour experiments. We ﬁnd that our algorithm convergesfast, i.e. in a few iterations.

4. Experiments

In this section, we report on experiments on both syn-thetic datasets and three real world datasets to evalu-ate the eﬀectiveness of our proposed MTL methods.We compare both our models with several state-of-the-art baselines discussed in Section 2. We includethe results for Shared Multitask learning (

SHAMO )(Crammer & Mansour, 2012), which uses a K-meanslike procedure that simultaneously clusters diﬀerenttasks using a small pool of m ≪ T shared model. Fol-lowing (Barzilai & Crammer, 2015), we use gradient-projection algorithm to optimize the dual of the ob-jective function (Equation 6). In addition, we com-pare our results with Single-task learning ( STL ), whichlearns a single model by pooling together the datafrom all the tasks and Independent task learning (

ITL )which learns each task independently.The parameters for the proposed formulations and sev-eral state-of-the-art baselines are chosen from -foldcross validation. We ﬁx the value of λ to . in or-der to reduce the search space. The value for λ ischosen from the search grid { − , − , . . . , , } .The value for K , K and K are chosen from { , , , , , , } . We evaluate the models usingRoot Mean Squared Error ( RMSE ) for the regressiontasks and using F -measure for the classiﬁcation tasks.For our experiments, we consider the squared error lossfor each task. We repeat all our experiments times tocompensate for statistical variability. The best modeland the statistically competitive models (by paired t-test with α = 0 . ) are shown in boldface. We evaluate our models on ﬁve synthetic datasetsbased on the assumptions considered in both the base-lines and the proposed methods. We generate ex-amples from X t ∼ N (0 , I P ) with P = 20 for each task t . All the datasets consist of tasks with train-ing examples per task. Each task is constructed using Y t = X t W t + N (0 , . The task parameters for each o-Clustering for Multitask Learning Table 1.

Performance results (RMSE) on synthetic datasets. The table reports the mean and standard errors over random runs. The best model and the statistically competitive models (by paired t-test with α = 0 . ) are shown inboldface. Model syn1 syn2 syn3 syn4 syn5

STL

ITL

SHAMO

MTFL

GO-MTL

BiFactorMTL

TriFactorMTL synthetic dataset is generated as follows:1. syn1 dataset consists of groups of tasks with tasks in each group without any overlap. Wegenerate K = 15 latent features from F ∼ N (0 , and each W t is constructed from linearly combin-ing latent features from F .2. syn2 dataset is generated with overlappinggroups of tasks. As before, we generate K = 15 la-tent features from F ∼ N (0 , but tasks in group are constructed from features − , tasks ingroup are constructed from features − andthe tasks in group are constructed from features − .3. syn3 dataset simulates the BiFactor

MTL. Werandomly generate task covariance matrix Ω andfeature covariance matrix Σ . We sample F ∼N (0 , Σ ) and G ∼ N (0 , Ω ) and compute W = FG ⊤ .4. syn4 dataset simulates the TriFactor

MTL. Werandomly generate task covariance matrix Ω andfeature covariance matrix Σ . We sample F ∼N (0 , Σ ) , G ∼ N (0 , Ω ) and S ∼ U (0 , . We com-pute the task weight matrix by W = FSG ⊤ .5. syn5 dataset simulates the experiment with taskweight matrix drawn from a matrix normal dis-tribution (Zhang & Schneider, 2010). We ran-domly generate task covariance matrix Ω and fea-ture covariance matrix Σ . We sample vec ( W ) ∼N (0 , Σ ⊗ Ω ) .We compare the proposed methods BiFactor

MTL and

TriFactor

MTL against the baselines. We can see inTable 1 that

BiFactor and

TriFactor

MTL outper-forms all the baselines in all the synthetic datasets.

STL performs the worst since it combines the datafrom all the tasks. We can see that the

SHAMO performs better than

STL but worse than

ITL which shows that learning these tasks separately is beneﬁcialthan combining them to learn a fewer models.As mentioned earlier, since

MTFL is similar to

FMTL in Equation 2, we can see how the results of

BiFactor

MTL improve when it learns both the task relationshipmatrix and the feature relationship matrix. Note thatthe syn1 and syn2 datasets are based on assump-tions in

GO-MTL , hence, it performs better than theother baselines.

BiFactor

MTL and

TriFactor

MTLmodels are equally competent with

GO-MTL whichshows that our proposed methods can easily adapt tothese assumptions. Synthetic datasets syn3 , syn4 and syn5 are generated with both the task covari-ance matrix and the feature covariance matrix. Sinceboth BiFactor

MTL and

TriFactor

MTL learns taskand feature relationship matrix along with the taskweight parameters, they performs signiﬁcantly betterthan other baselines.

We evaluate the proposed methods on examinationscore prediction data, a benchmark dataset in mul-titask regression reported in several previous arti-cles (Argyriou et al., 2008; Kumar & Daume, 2012;Zhang & Yeung, 2014) . The school dataset con-sists of examination scores of , students from schools in London. Each school is considered as atask and we need to predict exam scores for studentsfrom these schools. The feature set includes theyear of the examination, four school-speciﬁc and threestudent-speciﬁc attributes. We replace each categor-ical attribute with one binary variable for each pos-sible attribute value, as suggested in (Argyriou et al.,2008). This results in attributes with an additionalattribute to account for the bias term.Clearly, the dataset has the school and student speciﬁcfeature clusters that can help in learning the shared fea- http://ttic.uchicago.edu/~argyriou/code/mtl_feat/school_splits.tar o-Clustering for Multitask Learning Table 2.

Performance results (RMSE) on school datasets. The table reports the mean and standard errors over randomruns. Models STL ITL SHAMO MTFL GO-MTL BiFactor TriFactor (0.03) (0.04) (0.05) (0.05) (0.05) (0.08) (0.09) (0.07) (0.05) (0.05) (0.02) (0.10) (0.11) (0.08) (0.10) (0.06) (0.06) (0.06) (0.14) (0.13) (0.10) ture representation better than the other factored base-lines. In addition, there must be several task clustersin the data to account for the diﬀerences among theschools. The training and test sets are obtained by di-viding examples of each task into many small datasets,by varying the size of the training data with , and , in order to evaluate the proposed methodson many tasks with limited numbers of examples.Table 2 shows the experimental results for school data.All the factorized MTL methods outperform STL and

ITL . We can see that both

TriFactor

MTL and

BiFac-tor

MTL outperform other baselines signiﬁcantly. It isinteresting to see that

TriFactor

MTL performs con-siderably well even when the tasks have limited num-bers of examples. When there is more training data,the result the advantage of

TriFactor

MTL over thestrongest baseline

GO-MTL is reduced.

We follow the experimental setup in(Crammer & Mansour, 2012; Barzilai & Crammer,2015) and evaluate our algorithm on product reviewsfrom amazon . The dataset contains product reviewsfrom domains such as books, dvd, etc. We considereach domain as a binary classiﬁcation task. Thereviews are stemmed and stopwords are removed fromthe review text. We represent each review as a bagof , unigrams/bigrams with TF-IDF scores. Wechoose , reviews from each domain and eachreview is associated with a rating from { , , , } .The reviews with rating is not included in thisexperiment as such sentiments were ambiguous andtherefore cannot be reliably predicted. hWe ran several experiments on this dataset to test theimportance of learning shared feature representationand co-clustering of tasks and features. In Experiment I , we construct classiﬁcation tasks with reviews la-beled positive (+) when rating < and labeled neg-ative ( − ) when rating < . We use training ex-amples for each task and the remaining for test set.Since all the tasks are essentially same, ITL performbetter than all the other models (with an F -measure of . ) by combining data from all the other tasks.The results for our proposed methods BiFactor

MTL( . ) and TriFactor

MTL ( . ) are comparable tothat of ITL . See supplementary material for the resultsof Experiment I .For Experiment II , we split each domain into two equalsets, from which we create two prediction tasks basedon the two diﬀerent thresholds: whether the rating forthe reviews is or not and whether the rating for thereviews is or not. Obviously, combining all the taskstogether will not help in this setting. Experiments III and IV are similar to Experiment II , except that eachtask is further divided into or sub-tasks.Experiment V splits each domain into three equal setsto construct three prediction tasks based on three dif-ferent thresholds: whether the rating for the reviewsis or not, whether the rating for the reviews is < or not and whether the rating for the reviews is ornot. This setting captures the reviews with diﬀerentlevels of sentiments. As before, we build the datasetfor Experiments VI and VII by further dividing thethree prediction tasks from Experiment V into or sub-tasks.The results from our experiments are reported in Ta-ble 3. The ﬁrst four rows in the table show the numberof tasks in each experiment, number of thresholds con-sidered for the ratings, number of splits constructedfrom each domain and the total number of trainingexamples in each task. The general trend is that fac-torized models performs signiﬁcantly better than theother baselines. Since MTFL , BiFactor

MTL and

Tri-Factor

MTL learn feature relationship matrix Σ in ad-dition to the task parameter, they achieve better re-sults than CMTL , which considers only the task clus-ters.We notice that as we increase the number of tasks,the gap between the performances of

TriFactor

MTLand

BiFactor

MTL (and

GO-MTL ) widens, since theassumption that the the number of feature and taskclusters K should be same is clearly violated. Onthe other hand, TriFactor

MTL learns with a diﬀer-ent number of feature and task clusters ( K , K ) and,hence achieves a better performance than all the othermethods considered in these experiments. o-Clustering for Multitask Learning Table 3.

Performance results (F-measure) for various experiments on sentiment detection. The table reports the meanand standard errors over random runs. Data

II III IV V VI VII

Tasks

28 56 84 42 86 126

Thresholds(Splits)

Train Size

120 60 40 80 40 26

STL (0.002) (0.001) (0.002) (0.002) (0.003) (0.001)

ITL (0.001) (0.002) (0.001) (0.001) (0.002) (0.001)

SHAMO (0.002) (0.006) (0.002) (0.006) (0.002) (0.013)

CMTL (0.016) (0.007) (0.004) (0.002) (0.002) (0.002)

MTFL (0.004) (0.002) (0.007) (0.002) (0.003) (0.002)

GO-MTL 0.582 (0.012) (0.013) (0.007) (0.004) (0.005) (0.008)

BiFactor (0.018) (0.013) (0.002) (0.013) (0.020) (0.052)

TriFactor (0.008) (0.006) (0.012) (0.013) (0.020) (0.029)

Table 4.

Performance results (F-measure) on 20

Newsgroups dataset. The table reports the mean and standard errors over random runs. Models

Task 1 Task 2 Task 3 Task 4 Task 5

GO-MTL

BiFactorMTL

TriFactorMTL

Finally, we evaluate our proposed models on 20

News-groups dataset for transfer learning . The datasetcontains postings from 20 Usenet newsgroups. As be-fore, the postings are stemmed and the stopwords areremoved from the text. We represent each posting as abag of unigrams/bigrams with TF-IDF scores. Weconstruct tasks from the postings of the newsgroups.We randomly select a pair of newsgroup classes to buildeach one-vs-one classiﬁcation task. We follow the hold-out experiment suggested by (Raina et al., 2006) forthe transfer learning setup. For each of the tasks(target task), we learn F ( F and S in case of TriFac-tor

MTL) from the remaining tasks (source tasks).With F ( F and S ) known from the source tasks, weselect of the data from the target task to learn G target . This experiment shows how well the learnedlatent feature representation from the source tasks in a K -dimensional subspace ( K -dimensional subspace for TriFactor

MTL) adapt to the new task. We evaluateour results on the remaining data from the target task.We select

GO-MTL as our baseline to compare ourresults. Since

CMTL doesn’t explicitly learn F , wedid not include it in this experiment.Table 4 shows the results for this experiment. We re- http://qwone.com/~jason/20Newsgroups/ port the ﬁrst tasks here. See supplementary materialfor the performance results of all the tasks. We seethat both GO-MTL and

BiFactor

MTL perform almostthe same, since both of them learn the latent featurerepresentation in a K -dimensional space. As is evidentfrom the table, TriFactor

MTL outperforms both

GO-MTL and

BiFactor

MTL, which shows that learningboth the factors F and S improves information trans-fer from the source tasks to the target task.

5. Conclusions

In this paper, we proposed a novel framework for mul-titask learning that factors the task parameters into ashared feature representation and a task structure tolearn from multiple related tasks. We formulated twoapproaches, motivated from recent work in multitasklatent feature learning. The ﬁrst (

BiFactor

MTL), de-composes the task parameters W into two low-rankmatrices: latent feature representation F and taskstructure G . As this approach is restrictive on thenumber of clusters in the latent feature and task space,we proposed a second method ( TriFactor

MTL), whichintroduces an additional degree of freedom to permitdiﬀerent clusterings in each. We developed a highlyscalable and eﬃcient learning algorithm using conju-gate gradient descent and generalized Sylvester equa-tions. Extensive empirical analysis on both synthetic o-Clustering for Multitask Learning and real datasets show that

Trifactor multitask learn-ing outperforms the other state-of-the-art multitaskbaselines, thereby demonstrating the eﬀectiveness ofthe proposed approach.

References

Amit, Yonatan, Fink, Michael, Srebro, Nathan, andUllman, Shimon. Uncovering shared structures inmulticlass classiﬁcation. In

Proceedings of the 24thinternational conference on Machine learning , pp.17–24. ACM, 2007.Argyriou, Andreas, Evgeniou, Theodoros, and Pontil,Massimiliano. Convex multi-task feature learning.

Machine Learning , 73(3):243–272, 2008.Barzilai, Aviad and Crammer, Koby. Convex multi-task learning by clustering. In

Proceedings of the18th International Conference on Artiﬁcial Intelli-gence and Statistics (AISTATS-15) , 2015.Crammer, Koby and Mansour, Yishay. Learning mul-tiple tasks using shared hypotheses. In

Advances inNeural Information Processing Systems , pp. 1475–1483, 2012.Ding, Chris and He, Xiaofeng. K-means clusteringvia principal component analysis. In

Proceedings ofthe twenty-ﬁrst international conference on Machinelearning , pp. 29. ACM, 2004.Ding, Chris, Li, Tao, Peng, Wei, and Park, Hae-sun. Orthogonal nonnegative matrix t-factorizationsfor clustering. In

Proceedings of the 12th ACMSIGKDD international conference on Knowledgediscovery and data mining , pp. 126–135. ACM, 2006.Dinuzzo, Francesco, Ong, Cheng S, Pillonetto, Gian-luigi, and Gehler, Peter V. Learning output kernelswith block coordinate descent. In

Proceedings of the28th International Conference on Machine Learning(ICML-11) , pp. 49–56, 2011.Gu, Quanquan and Zhou, Jie. Co-clustering on mani-folds. In

Proceedings of the 15th ACM SIGKDD in-ternational conference on Knowledge discovery anddata mining , pp. 359–368. ACM, 2009.Jalali, Ali, Sanghavi, Sujay, Ruan, Chao, and Raviku-mar, Pradeep K. A dirty model for multi-task learn-ing. In

Advances in Neural Information ProcessingSystems , pp. 964–972, 2010.Kumar, Abhishek and Daume, Hal. Learning taskgrouping and overlap in multi-task learning. In

Pro-ceedings of the 29th International Conference on Ma-chine Learning (ICML-12) , pp. 1383–1390, 2012. Li, Tao and Ding, Chris. The relationships among var-ious nonnegative matrix factorization methods forclustering. In

Data Mining, 2006. ICDM’06. SixthInternational Conference on , pp. 362–371. IEEE,2006.Liu, Jun, Ji, Shuiwang, and Ye, Jieping. Multi-taskfeature learning via eﬃcient l 2, 1-norm minimiza-tion. In

Proceedings of the twenty-ﬁfth conferenceon uncertainty in artiﬁcial intelligence , pp. 339–348.AUAI Press, 2009.Maurer, Andreas, Pontil, Massimiliano, and Romera-Paredes, Bernardino. Sparse coding for multitaskand transfer learning. In

ICML (2) , pp. 343–351,2013.Raina, Rajat, Ng, Andrew Y, and Koller, Daphne.Constructing informative priors using transfer learn-ing. In

Proceedings of the 23rd International Con-ference on Machine Learning , pp. 713–720. ACM,2006.Rothman, Adam J, Levina, Elizaveta, and Zhu, Ji.Sparse multivariate regression with covariance esti-mation.

Journal of Computational and GraphicalStatistics , 19(4):947–962, 2010.Sindhwani, Vikas, Hu, Jianying, and Mojsilovic, Alek-sandra. Regularized co-clustering with dual supervi-sion. In

Advances in Neural Information ProcessingSystems , pp. 1505–1512, 2009.Sindhwani, Vikas, Minh, Ha Quang, and Lozano, Au-rélie C. Scalable matrix-valued kernel learning forhigh-dimensional nonlinear multivariate regressionand granger causality. In

Proceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intel-ligence , pp. 586–595. AUAI Press, 2013.Swirszcz, Grzegorz and Lozano, Aurelie C. Multi-levellasso for sparse multi-task regression. In

Proceed-ings of the 29th International Conference on Ma-chine Learning (ICML-12) , pp. 361–368, 2012.Wang, Hua, Nie, Feiping, Huang, Heng, and Makedon,Fillia. Fast nonnegative matrix tri-factorization forlarge-scale data co-clustering. In

Proceedings of the22nd International Joint Conference on Artiﬁcial In-telligence , pp. 1553, 2011.Weinberger, Kilian, Dasgupta, Anirban, Langford,John, Smola, Alex, and Attenberg, Josh. Featurehashing for large scale multitask learning. In

Pro-ceedings of the 26th Annual International Confer-ence on Machine Learning , pp. 1113–1120. ACM,2009. o-Clustering for Multitask Learning

Xu, Linli, Huang, Aiqing, Chen, Jianhui, and Chen,Enhong. Exploiting task-feature co-clusters inmulti-task learning. In

Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence ,pp. 1931–1937. AAAI Press, 2015.Xue, Ya, Liao, Xuejun, Carin, Lawrence, and Krishna-puram, Balaji. Multi-task learning for classiﬁcationwith dirichlet process priors.

The Journal of Ma-chine Learning Research , 8:35–63, 2007.Zhang, Jian, Ghahramani, Zoubin, and Yang, Yiming.Learning multiple related tasks using latent indepen-dent component analysis. In

Advances in neural in-formation processing systems , pp. 1585–1592, 2005.Zhang, Yi and Schneider, Jeﬀ G. Learning multipletasks with a sparse matrix-normal penalty. In

Ad-vances in Neural Information Processing Systems ,pp. 2550–2558, 2010.Zhang, Yu and Yeung, Dit-Yan. A regularization ap-proach to learning task relationships in multitasklearning.

ACM Transactions on Knowledge Discov-ery from Data (TKDD) , 8(3):12, 2014.Zhong, Wenliang and Kwok, James T. Convex multi-task learning with ﬂexible task clusters. In

Proceed-ings of the 29th International Conference on Ma-chine Learning (ICML-12) , pp. 49–56, 2012.

Sensitivity Analysis

Figure 1 shows the hyper-parameter sensitivity analy-sis for

GO-MTL , BiFactor

MTL and

TriFactor

MTL.As before, we ﬁx λ = 0 . . GO-MTL and

BiFac-tor

MTL have two hyper-parameters λ , K to tune and TriFactor

MTL have three hyper-parameters λ , K and K to tune. We can see from the plots that our pro-posed models yield stable results even when we changethe K, K and K . On the other hand, GO-MTL re-sults are sensitive to the values of λ , regularizationparameter for sparse penalty on G . Additional Results

Tables 5 and 6 show the complete experimental re-sults for sentiment analysis and transfer learning ex-periments.List of one-vs-one classiﬁcation tasks used in Table 6 (Task 1) comp.windows.xvs comp.os.ms-windows.misc(Task 2) soc.religion.christian vs rec.sport.hockey(Task 3) misc.forsalevs talk.politics.guns(Task 4)sci.medvs rec.autos(Task 5) comp.sys.mac.hardwarevs talk.politics.misc(Task 6) sci.spacevs alt.atheism(Task 7) comp.graphicsvs comp.sys.ibm.pc.hardware(Task 8) talk.politics.mideastvs sci.electronics(Task 9) rec.motorcyclesvs talk.religion.misc(Task 10) rec.sport.baseballvs sci.crypt o-Clustering for Multitask Learning T r i F a c t o r B i F a c t o r GO - M T L λ R M SE T r i F a c t o r ( K = ) T r i F a c t o r ( K = ) T r i F a c t o r ( K = ) B i F a c t o r GO - M T L K / K R M SE T r i F a c t o r B i F a c t o r GO - M T L λ R M SE T r i F a c t o r ( K = ) T r i F a c t o r ( K = ) T r i F a c t o r ( K = ) B i F a c t o r GO - M T L K / K R M SE T r i F a c t o r B i F a c t o r GO - M T L λ F - m ea s u r e T r i F a c t o r ( λ = ) T r i F a c t o r ( λ = ) B i F a c t o r M T L GO - M T L K/K F - m ea s u r e Figure 1.

Top : Sensitivity analysis for the regularization parameter λ when λ = 0 . and K = 2 (left) and number ofclusters K and K when λ = 0 . and λ = 0 . (right) calculated for syn5 dataset (RMSE). Middle : Sensitivity analysisfor school dataset (RMSE).

Bottom : Sensitivity analysis for sentiment detection (F-measure). o - C l u s t e r i n g f o r M u l t i t a s k L e a r n i n g Table 5.

Performance results (F-measure) for various experiments on sentiment detection. The table reports the mean and standard errors over random runs. Data

I II III IV V VI VII

Tasks

14 28 56 84 42 86 126

Thresholds(Splits)

Train Size

240 120 60 40 80 40 26

STL

ITL

SHAMO

CMTL

MTFL

BiFactor

MTL 0.722 (0.006) 0.611 (0.018) 0.561 (0.013)

TriFactor

MTL 0.733 (0.006)

Table 6.

Performance results (F-measure) on 20

Newsgroups dataset. The table reports the mean and standard errors over random runs. The best model andthe statistically competitive models (by paired t-test with α = 0 . ) are shown in boldface. Models

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Task 10

GO-MTL (0.09) (0.06) (0.04) (0.06) (0.03) (0.02) (0.02) (0.01) (0.00) (0.05)

BiFactorMTL (0.09) (0.05) (0.04) (0.03) (0.01) (0.02) (0.02) (0.02) (0.00) (0.04)

TriFactorMTL (0.03) (0.02) (0.02) (0.02) (0.02) (0.01) (0.02) (0.01) (0.03)0.62