Co-Clustering for Multitask Learning
aa r X i v : . [ s t a t . M L ] M a r Co-Clustering for Multitask Learning
Keerthiram Murugesan Jaime Carbonell Yiming Yang Abstract
This paper presents a new multitask learn-ing framework that learns a shared repre-sentation among the tasks, incorporatingboth task and feature clusters. The jointly-induced clusters yield a shared latent sub-space where task relationships are learnedmore effectively and more generally than instate-of-the-art multitask learning methods.The proposed general framework enables thederivation of more specific or restricted state-of-the-art multitask methods. The paper alsoproposes a highly-scalable multitask learningalgorithm, based on the new framework, us-ing conjugate gradient descent and general-ized
Sylvester equations . Experimental re-sults on synthetic and benchmark datasetsshow that the proposed method system-atically outperforms several state-of-the-artmultitask learning methods.
1. Introduction
Multitask learning leverages shared structures amongthe tasks to jointly build a better model for eachtask. Most existing work in multitask learning fo-cuses on how to take advantage of task similarities, ei-ther by learning the relationship between the tasks viacross-task regularization techniques (Zhang & Yeung,2014; Zhang & Schneider, 2010; Rothman et al., 2010;Xue et al., 2007) or by learning a shared fea-ture representation across all the tasks, leverag-ing low-dimensional subspaces in the feature space(Argyriou et al., 2008; Jalali et al., 2010; Liu et al.,2009; Swirszcz & Lozano, 2012). Learning task re-lationships has been shown beneficial in (positiveand negative) transfer of knowledge from information- School of Computer Science, Carnegie Mel-lon University, Pittsburgh, Pennsylvania, USA. {kmuruges,jgc,yiming}@cs.cmu.edu . Correspondence to:Keerthiram Murugesan < [email protected] >. Proceedings of the th International Conference on Ma-chine Learning , Sydney, Australia, 2017. JMLR: W&CP.Copyright 2017 by the author(s). rich tasks to information-poor tasks (Zhang & Yeung,2014), whereas the shared feature representationhas been shown to perform well when each taskhas a limited number of training instances (observa-tions) compared to the total number across all tasks(Argyriou et al., 2008). Existing research in multitasklearning considers either the first approach and learnsa task relationship matrix in addition to the task pa-rameters, or relies on the latter approach and learnsa shared latent feature representation from the taskparameters. To the best of our knowledge, there isno prior work that utilizes both principles jointly formultitask learning. In this paper, we propose a newapproach that learns a shared feature representationalong with the task relationship matrix jointly to com-bine the advantages of both principles into a generalmultitask learning framework.Early work on latent shared representation includes(Zhang et al., 2005), which proposes a model basedon Independent Component Analysis (ICA) for learn-ing multiple related tasks. The task parameters areassumed to be generated from independent sources.(Argyriou et al., 2008) consider sparse representationscommon across many learning tasks. Similar in spiritto PCA for unsupervised tasks, their approach learnsa low dimensional representation of the observations(Ding & He, 2004). More recently, (Kumar & Daume,2012) assume that relationships among tasks aresparse to enforce that each observed task is obtainedfrom only a few of the latent features, and fromthere learn the overlapping group structure amongthe tasks. (Crammer & Mansour, 2012) propose a K-means-like procedure that simultaneously clusteringdifferent tasks and learning a small pool of m ≪ T shared models. Specifically, each task is free to choosea model from the pool that better classifies its owndata, and each model is learned from pooling togetherall the training data that belongs to the same clus-ter. (Barzilai & Crammer, 2015) propose a similar ap-proach that clusters the T tasks into K task-clusterswith hard assignments.These methods compute the factorization of the taskweight matrix to learn the shared feature representa-tion and the task structure. This matrix factorization o-Clustering for Multitask Learning induces the simultaneous clustering of both the tasksand the features in the K -dimensional latent subspace(Li & Ding, 2006). One of the major disadvantages ofthis assumption is that it restricts the model to defineboth the tasks and the features to have same numberof clusters. For example, in the case of sentiment anal-ysis, where each task belongs to a certain domain ora product category such as books, automobiles, etc.,and each feature is simply a word from the vocabularyof the product reviews. Clearly, assuming both thefeatures and the tasks have same number of clustersis an unjustified assumption, as the number of featureclusters are typically more than the number of taskclusters, but the latter increase more than the former,as new products are introduced. Such a restrictive as-sumption may (and often does) hurt the performanceof the model.Unlike in the previous work, our proposed approachprovides a flexible way to cluster both the tasks andthe features. We introduce an additional degree of free-dom that allows the number of task clusters to differfrom the number of features clusters (Ding et al., 2006;Wang et al., 2011). In addition, our proposed modelslearns both the task relationship matrix and the fea-ture relationship matrix along with the co-clusteringof both the tasks and the features (Gu & Zhou, 2009;Sindhwani et al., 2009). Our proposed approach isclosely related to Output Kernel Learning ( OKL )where we learn the kernel between the components ofthe output vector for problems such as multi-outputlearning, multitask learning, etc (Dinuzzo et al., 2011;Sindhwani et al., 2013). The key disadvantage of
OKL is that it requires the computation of kernel matrix be-tween every pair of instances from all the tasks. Thisresults in scalability constraint especially when thenumber of tasks/features is large (Weinberger et al.,2009). Our proposed models achieve the similar effectby learning a shared feature representation commonacross the tasks.A key challenge in factoring with the extra degree offreedom is optimizing the resulting objective function.Previous work on co-clustering for multitask learningrequires strong assumptions on the task parameters.(Zhong & Kwok, 2012) or not scalable to large-scale ap-plications (Xu et al., 2015). We propose an efficient al-gorithm that scales well to large-scale multitask learn-ing and utilizes the structure of the objective functionto learn the factorized task parameters. We formulatethe learning of latent variables in terms of a generalizedSylvester equation which can be efficiently solved us-ing the conjugate gradient descent algorithm. We startfrom the mathematical background and then motivateour approach in Section 2. Then we introduce our pro- posed models and their learning procedures in Section3. Section 4 reports the empirical analysis of our pro-posed models and shows that learning both the taskclusters and the feature clusters along with the taskparameters gives significant improvements comparedto the state-of-the-art baselines in multitask learning.
2. Background
Suppose we have T tasks and D t = { X t , Y t } = { ( x ti , y ti ) : i = 1 , , ..., N t } is the training set for eachtask t = { , , . . . , T } . Let W t represent the weightvector for a task indexed by t . These task weight vec-tors are stacked as columns of a matrix W , which isof size P × T , with P being the feature dimension.Traditional multitask learning imposes additional as-sumptions on W such as low-rank, ℓ norm, ℓ , norm,etc to leverage the shared characteristics among thetasks. In this paper, we consider a similar assumptionbased on the factorization of the task weight matrix W .In factored models, we decompose the weight matrix W as F G ⊤ , where F can be interpreted as a featurecluster matrix of size P × K with K feature clustersand, similarly, G as a task cluster matrix of size T × K with K task clusters. If we consider squared errorlosses for all the tasks, then the objective function forlearning F and G can be given as follows: arg min F ∈R P × K G ∈R T × K F ∈ Γ F , G ∈ Γ G X t ∈ [ T ] k Y t − X t F G ⊤ t k + P λ ( F ) + P λ ( G ) (1)In the above objective function, the latent feature rep-resentation is captured by the matrix F and the group-ing structure on the tasks is determined by the matrix G . The predictor W t for task t can then be computedfrom F G ⊤ t , where G t is t th row of matrix G . In theabove objective function, P λ ( F ) is a regularizationterm that penalizes the unknown matrix F with regu-larization parameter λ . Similarly, P λ ( G ) is a regu-larization term that penalizes the unknown matrix G with regularization parameter λ . Γ F and Γ G are theircorresponding constraint spaces. Without these addi-tional constraints on F and G , the objective functionreduces to solving each task independently, since anytask weight matrix from F and G can also be attainedby W .Several assumptions can be enforced on these unknownfactors F and G . Below we discuss some of the pre-vious models that make some well-known assumptionson F and G and can be written in terms of the above o-Clustering for Multitask Learning objective function.(1) Factored Multitask Learning ( FMTL )(Amit et al., 2007) considers a squared frobeniusnorm on both F and G . arg min F ∈R P × K G ∈R T × K X t ∈ [ T ] k Y t − X t F G ⊤ t k + λ k F k F + λ k G k F (2)It can be shown that the above problem can equiv-alently written as the multitask learning with tracenorm constraint on the task weight matrix W .(2) Multitask Feature Learning ( MTFL )(Argyriou et al., 2008) assumes that the matrix G learns sparse representations common across manytasks. Similar in spirit to PCA for unsupervised tasks, MTFL learns a low dimensional representation ofthe observations X t for each task, using F such that FF ⊤ = I p . arg min F ∈R P × K , G ∈R T × K FF ⊤ = I p X t ∈ [ T ] k Y t − X t F G ⊤ t k + λ k G k , (3)where K is usually set to P . It considers an ℓ , normon G to force all the tasks to have a similar sparsitypattern such that the tasks select the same latent fea-tures (columns of F ). It is worth noting that the Equa-tion 3 can be equivalently written as follows: arg min W ∈R P × T , Σ (cid:23) X t ∈ [ T ] k Y t − X t W t k + λtr ( W ⊤ Σ − W ) (4)which then can be rewritten as multitask learning witha trace norm constraint on the task weight matrix W as before.(3) Group Overlap MTL ( GO-MTL )(Kumar & Daume, 2012) assumes that the ma-trix G is sparse to enforce that each observed taskis obtained from only a few of the latent features,indexed by the non-zero pattern of the correspondingrows of the matrix G . arg min F ∈R P × K G ∈R T × K X t ∈ [ T ] k Y t − X t F G ⊤ t k + λ k F k F + λ k G k (5)The above objective function can be compared to dic-tionary learning where each column of F is consideredas a dictionary atom and each row of G is considered astheir corresponding sparse codes (Maurer et al., 2013). (4) Multitask Learning by Clustering ( CMTL )(Barzilai & Crammer, 2015) assumes that the T taskscan be clustered into K task-clusters with hard assign-ment. For example, if the k th element of G t is one,and all other elements of G t are zero, we say that task t is associated with cluster k . arg min F ∈R P × K, G ∈R T × KGt ∈{ , } K k Gt k , ∀ t ∈ [ T ] X t ∈ [ T ] k Y t − X t F G ⊤ t k + λ k F k F (6)The constraints G t ∈ { , } K , k G t k = 1 ensure that G is a proper clustering matrix. Since the above prob-lem is computationally expensive as it involves solvinga combinatorial problem, the constraint on G is re-laxed as G t ∈ [0 , K .These four methods require the number of task clustersto be same as the number of features clusters, whichas mentioned earlier, is a restrictive assumption thatmay and often does hurt performance. In addition,these methods do not leverage the inherent relation-ship between the features (via F ) and the relationshipbetween the tasks (via G ). Note that these objectivefunctions are bi-convex problems where the optimiza-tion is convex in F when fixing G and vice versa. Wecannot achieve globally optimal solution but one canshow that algorithm reaches the locally optimal solu-tion in a fixed number of iterations.
3. Proposed Approach
Existing models do not take into consideration boththe relationship between the tasks and the relationshipbetween the features. Here we consider a more generalformulation that in addition to estimating the param-eters F and G , we learn their task relationship matrix Ω and the feature relationship matrix Σ . We call thisframework BiFactor multitask learning, following thefactorization of the task parameters W into two low-rank matrices F and G . arg min F ∈R P × K, G ∈R T × K Σ (cid:23) , Ω (cid:23) X t ∈ [ T ] k Y t − X t F G t k + λ tr ( F ⊤ Σ − F ) + λ tr ( G ⊤ Ω − G ) (7)In the above objective function, we consider P λ ( F ) and P λ ( G ) to learn task relationship and feature re-lationship matrices Σ and Ω . The motivation for theseregularization terms is based on (Argyriou et al., 2008; o-Clustering for Multitask Learning Zhang & Yeung, 2014) where they considered sepa-rately either the task relationship matrix Ω or the fea-ture relationship matrix Σ . Note that the value of K is typically set to value less than min( P, T ) .It is easy to see that by setting the value of G to I T , our objective function reduces to multitask fea-ture learning ( M T F L ) discussed in the previous sec-tion. Similarly, by setting the value of F to I P ,our objective function reduces to multitask relation-ship learning ( M T RL ) (Zhang & Yeung, 2014). If weset Ω = I T and Σ = I P , we obtain the factored multi-task learning setting ( F M T L ) defined in Equation 2.Hence the prior art can be cast as special cases of ourmore general formulation by imposing certain limitingrestrictions.
BiFactor
MTL
We propose an efficient learning algorithm for solvingthe above objective function
BiFactor
MTL. Consideran alternating minimization algorithm, where we learnthe shared representation F while fixing the task struc-ture G and we learn the task structure G while fixingthe shared representation F . We repeat these stepsuntil we converge to the locally optimal solution. Optimizing w.r.t F gives an equation called generalized Sylvester equation of the form
AQB ⊤ + CQD ⊤ = E for the unknown Q . We will show in the next sectionon how to solve these linear equation efficiently. Fromthe objective function, we have: X t ( X ⊤ t X t ) F ( G ⊤ t G t ) + λ Σ − F = X t X ⊤ t Y t G t (8) Optimizing w.r.t G for squared error loss results in thesimilar linear equation: ( F ⊤ X ⊤ t X t F ) G t + λ Ω − G = F ⊤ X ⊤ t Y t (9) Optimizing w.r.t Ω and Σ : The optimization of theabove function w.r.t Ω and Σ while fixing the other un-knowns can be learned easily with the following closed-form solutions (Zhang & Yeung, 2014): Ω = ( GG ⊤ ) tr (( GG ⊤ ) ) Σ = ( FF ⊤ ) tr (( FF ⊤ ) ) identity matrix of size T × T (assuming that the rank K is set to T ) identity matrix of size P × P (assuming that the rank K is set to P ) As mentioned earlier, one of the restrictions in
BiFac-tor
MTL and factored models is that both the numberof feature clusters and task clusters should be set to K . This poses a serious model restriction, by assum-ing both the latent task and feature representation livein a same subspace. Such assumption can significantlyhinder the flexibility of the model search space and weaddress this problem with a modification to our previ-ous framework.Following the previous work in matrix tri-factorization,we introduce an additional factor S such that we write W as FSG ⊤ where F is a feature cluster matrix ofsize P × K with K feature clusters and G is a taskcluster matrix of size T × K with K task clustersand S is the matrix that maps feature clusters to taskclusters. With this representation, latent features liein a K dimensional subspace and the latent tasks liein a K dimensional subspace. arg min F ∈R P × K , G ∈R T × K , S ∈R K × K Σ (cid:23) , Ω (cid:23) X t ∈ [ T ] k Y t − X t FS G ⊤ t k + λ tr ( F ⊤ Σ − F ) + λ tr ( G ⊤ Ω − G ) (10)The cluster mapping matrix S introduces an additionaldegree of freedom in the factored models and addressesthe realistic assumptions encountered in many applica-tions. Note that we do not consider any regularizationon S in this paper, but one may impose additionalconstraint on S such as ℓ (sparse penalty), ℓ (ridgepenalty), non-negative constraints, etc, to further im-prove performance. TriFactor
MTL
We introduce an efficient learning algorithm for solving
TriFactor
MTL, similar to the optimization procedurefor
BiFactor
MTL. As before, we consider an alternat-ing minimization algorithm, where we learn the sharedrepresentation F while fixing the G and S , we learnthe task structure G while fixing the F and S and welearn the cluster mapping matrix S , by fixing F and G . We repeat these steps until we converge to a locallyoptimal solution. Optimizing w.r.t F gives a generalized Sylvester equa-tion as before. X t ( X ⊤ t X t ) F ( S G ⊤ t G t S ⊤ )+ λ Σ − F = X t X ⊤ t Y t G t S ⊤ (11) o-Clustering for Multitask Learning Optimizing w.r.t G gives the following linear equation: ( S ⊤ F ⊤ X ⊤ t X t FS ) G t + λ Ω − G = S ⊤ F ⊤ X ⊤ t Y t (12)for all t ∈ [ T ] . Optimizing w.r.t S : Solving for S results in the follow-ing equation: X t ( F ⊤ X ⊤ t X t F ) S ( G ⊤ t G t ) = X t F ⊤ X ⊤ t Y t G t (13) Optimizing w.r.t Ω and Σ : The optimization of theabove function w.r.t Ω and Σ while fixing the otherunknowns can be learned as in BiFactor
MTL. Notethat one may consider ℓ regularization on Ω and Σ to learn the sparse relationship between the tasks andthe features (Zhang & Schneider, 2010). We give some details on how to solve the generalized
Sylvester equations (8,9,11,12,13) encountered in
Bi-Factor and
TriFactor
MTL optimization steps. Thegeneralized
Sylvester equation of the form
AQB ⊤ + CQD ⊤ = E has a unique solution Q under certainregularity conditions which can be exactly obtainedby an extended version of the classical Bartels-Stewartmethod whose complexity is O (( p + q ) ) for p × q -matrixvariable Q , compared to the naive matrix inversionwhich requires O ( p q ) .Alternatively one can solve the linear equation us-ing the properties of the Kronecker product: ( B ⊤ ⊗ A ) vec ( Q ) + ( D ⊤ ⊗ C ) vec ( Q ) = vec ( E ) where ⊗ is theKronecker product and vec ( . ) vectorizes Q in a columnoriented way. Below, we show the alternative form for TriFactor
MTL equations: X t (( S G ⊤ t G t S ⊤ ) ⊗ ( X ⊤ t X t )) vec ( F ) + λ ( I K ⊗ Σ − ) vec ( F )= vec (cid:0) X t X ⊤ t Y t G t S ⊤ (cid:1) (14) diag ( S ⊤ F ⊤ X ⊤ t X t FS ) Tt =1 vec ( G ) + ( I K ⊗ Ω − ) vec ( G )= vec ([ S ⊤ F ⊤ X ⊤ t Y t ] Tt =1 ) (15) X t (( G ⊤ t G t ) ⊗ ( F ⊤ X ⊤ t X t F )) vec ( S ) = vec (cid:0) X t F ⊤ X ⊤ t Y t (cid:1) (16) We can do the same for
BiFactor
MTL, enabling us touse conjugate gradient descent (CG) to learn our un- known factors whose complexity depends on the con-dition number of the matrix ( B ⊤ ⊗ A ) + ( D ⊤ ⊗ C ) .To optimize F , G and S , we iteratively run conju-gate gradient descent for each factor while fixing theother unknowns until a convergence condition (toler-ance ≤ − ) is met. In addition, CG can exploit thesolution from the previous iteration, low rank struc-ture in the equation and the fact that the matrix vectorproducts can be computed relatively efficiently. Fromour experiments. We find that our algorithm convergesfast, i.e. in a few iterations.
4. Experiments
In this section, we report on experiments on both syn-thetic datasets and three real world datasets to evalu-ate the effectiveness of our proposed MTL methods.We compare both our models with several state-of-the-art baselines discussed in Section 2. We includethe results for Shared Multitask learning (
SHAMO )(Crammer & Mansour, 2012), which uses a K-meanslike procedure that simultaneously clusters differenttasks using a small pool of m ≪ T shared model. Fol-lowing (Barzilai & Crammer, 2015), we use gradient-projection algorithm to optimize the dual of the ob-jective function (Equation 6). In addition, we com-pare our results with Single-task learning ( STL ), whichlearns a single model by pooling together the datafrom all the tasks and Independent task learning (
ITL )which learns each task independently.The parameters for the proposed formulations and sev-eral state-of-the-art baselines are chosen from -foldcross validation. We fix the value of λ to . in or-der to reduce the search space. The value for λ ischosen from the search grid { − , − , . . . , , } .The value for K , K and K are chosen from { , , , , , , } . We evaluate the models usingRoot Mean Squared Error ( RMSE ) for the regressiontasks and using F -measure for the classification tasks.For our experiments, we consider the squared error lossfor each task. We repeat all our experiments times tocompensate for statistical variability. The best modeland the statistically competitive models (by paired t-test with α = 0 . ) are shown in boldface. We evaluate our models on five synthetic datasetsbased on the assumptions considered in both the base-lines and the proposed methods. We generate ex-amples from X t ∼ N (0 , I P ) with P = 20 for each task t . All the datasets consist of tasks with train-ing examples per task. Each task is constructed using Y t = X t W t + N (0 , . The task parameters for each o-Clustering for Multitask Learning Table 1.
Performance results (RMSE) on synthetic datasets. The table reports the mean and standard errors over random runs. The best model and the statistically competitive models (by paired t-test with α = 0 . ) are shown inboldface. Model syn1 syn2 syn3 syn4 syn5
STL
ITL
SHAMO
MTFL
GO-MTL
BiFactorMTL
TriFactorMTL synthetic dataset is generated as follows:1. syn1 dataset consists of groups of tasks with tasks in each group without any overlap. Wegenerate K = 15 latent features from F ∼ N (0 , and each W t is constructed from linearly combin-ing latent features from F .2. syn2 dataset is generated with overlappinggroups of tasks. As before, we generate K = 15 la-tent features from F ∼ N (0 , but tasks in group are constructed from features − , tasks ingroup are constructed from features − andthe tasks in group are constructed from features − .3. syn3 dataset simulates the BiFactor
MTL. Werandomly generate task covariance matrix Ω andfeature covariance matrix Σ . We sample F ∼N (0 , Σ ) and G ∼ N (0 , Ω ) and compute W = FG ⊤ .4. syn4 dataset simulates the TriFactor
MTL. Werandomly generate task covariance matrix Ω andfeature covariance matrix Σ . We sample F ∼N (0 , Σ ) , G ∼ N (0 , Ω ) and S ∼ U (0 , . We com-pute the task weight matrix by W = FSG ⊤ .5. syn5 dataset simulates the experiment with taskweight matrix drawn from a matrix normal dis-tribution (Zhang & Schneider, 2010). We ran-domly generate task covariance matrix Ω and fea-ture covariance matrix Σ . We sample vec ( W ) ∼N (0 , Σ ⊗ Ω ) .We compare the proposed methods BiFactor
MTL and
TriFactor
MTL against the baselines. We can see inTable 1 that
BiFactor and
TriFactor
MTL outper-forms all the baselines in all the synthetic datasets.
STL performs the worst since it combines the datafrom all the tasks. We can see that the
SHAMO performs better than
STL but worse than
ITL which shows that learning these tasks separately is beneficialthan combining them to learn a fewer models.As mentioned earlier, since
MTFL is similar to
FMTL in Equation 2, we can see how the results of
BiFactor
MTL improve when it learns both the task relationshipmatrix and the feature relationship matrix. Note thatthe syn1 and syn2 datasets are based on assump-tions in
GO-MTL , hence, it performs better than theother baselines.
BiFactor
MTL and
TriFactor
MTLmodels are equally competent with
GO-MTL whichshows that our proposed methods can easily adapt tothese assumptions. Synthetic datasets syn3 , syn4 and syn5 are generated with both the task covari-ance matrix and the feature covariance matrix. Sinceboth BiFactor
MTL and
TriFactor
MTL learns taskand feature relationship matrix along with the taskweight parameters, they performs significantly betterthan other baselines.
We evaluate the proposed methods on examinationscore prediction data, a benchmark dataset in mul-titask regression reported in several previous arti-cles (Argyriou et al., 2008; Kumar & Daume, 2012;Zhang & Yeung, 2014) . The school dataset con-sists of examination scores of , students from schools in London. Each school is considered as atask and we need to predict exam scores for studentsfrom these schools. The feature set includes theyear of the examination, four school-specific and threestudent-specific attributes. We replace each categor-ical attribute with one binary variable for each pos-sible attribute value, as suggested in (Argyriou et al.,2008). This results in attributes with an additionalattribute to account for the bias term.Clearly, the dataset has the school and student specificfeature clusters that can help in learning the shared fea- http://ttic.uchicago.edu/~argyriou/code/mtl_feat/school_splits.tar o-Clustering for Multitask Learning Table 2.
Performance results (RMSE) on school datasets. The table reports the mean and standard errors over randomruns. Models STL ITL SHAMO MTFL GO-MTL BiFactor TriFactor (0.03) (0.04) (0.05) (0.05) (0.05) (0.08) (0.09) (0.07) (0.05) (0.05) (0.02) (0.10) (0.11) (0.08) (0.10) (0.06) (0.06) (0.06) (0.14) (0.13) (0.10) ture representation better than the other factored base-lines. In addition, there must be several task clustersin the data to account for the differences among theschools. The training and test sets are obtained by di-viding examples of each task into many small datasets,by varying the size of the training data with , and , in order to evaluate the proposed methodson many tasks with limited numbers of examples.Table 2 shows the experimental results for school data.All the factorized MTL methods outperform STL and
ITL . We can see that both
TriFactor
MTL and
BiFac-tor
MTL outperform other baselines significantly. It isinteresting to see that
TriFactor
MTL performs con-siderably well even when the tasks have limited num-bers of examples. When there is more training data,the result the advantage of
TriFactor
MTL over thestrongest baseline
GO-MTL is reduced.
We follow the experimental setup in(Crammer & Mansour, 2012; Barzilai & Crammer,2015) and evaluate our algorithm on product reviewsfrom amazon . The dataset contains product reviewsfrom domains such as books, dvd, etc. We considereach domain as a binary classification task. Thereviews are stemmed and stopwords are removed fromthe review text. We represent each review as a bagof , unigrams/bigrams with TF-IDF scores. Wechoose , reviews from each domain and eachreview is associated with a rating from { , , , } .The reviews with rating is not included in thisexperiment as such sentiments were ambiguous andtherefore cannot be reliably predicted. hWe ran several experiments on this dataset to test theimportance of learning shared feature representationand co-clustering of tasks and features. In Experiment I , we construct classification tasks with reviews la-beled positive (+) when rating < and labeled neg-ative ( − ) when rating < . We use training ex-amples for each task and the remaining for test set.Since all the tasks are essentially same, ITL performbetter than all the other models (with an F -measure of . ) by combining data from all the other tasks.The results for our proposed methods BiFactor
MTL( . ) and TriFactor
MTL ( . ) are comparable tothat of ITL . See supplementary material for the resultsof Experiment I .For Experiment II , we split each domain into two equalsets, from which we create two prediction tasks basedon the two different thresholds: whether the rating forthe reviews is or not and whether the rating for thereviews is or not. Obviously, combining all the taskstogether will not help in this setting. Experiments III and IV are similar to Experiment II , except that eachtask is further divided into or sub-tasks.Experiment V splits each domain into three equal setsto construct three prediction tasks based on three dif-ferent thresholds: whether the rating for the reviewsis or not, whether the rating for the reviews is < or not and whether the rating for the reviews is ornot. This setting captures the reviews with differentlevels of sentiments. As before, we build the datasetfor Experiments VI and VII by further dividing thethree prediction tasks from Experiment V into or sub-tasks.The results from our experiments are reported in Ta-ble 3. The first four rows in the table show the numberof tasks in each experiment, number of thresholds con-sidered for the ratings, number of splits constructedfrom each domain and the total number of trainingexamples in each task. The general trend is that fac-torized models performs significantly better than theother baselines. Since MTFL , BiFactor
MTL and
Tri-Factor
MTL learn feature relationship matrix Σ in ad-dition to the task parameter, they achieve better re-sults than CMTL , which considers only the task clus-ters.We notice that as we increase the number of tasks,the gap between the performances of
TriFactor
MTLand
BiFactor
MTL (and
GO-MTL ) widens, since theassumption that the the number of feature and taskclusters K should be same is clearly violated. Onthe other hand, TriFactor
MTL learns with a differ-ent number of feature and task clusters ( K , K ) and,hence achieves a better performance than all the othermethods considered in these experiments. o-Clustering for Multitask Learning Table 3.
Performance results (F-measure) for various experiments on sentiment detection. The table reports the meanand standard errors over random runs. Data
II III IV V VI VII
Tasks
28 56 84 42 86 126
Thresholds(Splits)
Train Size
120 60 40 80 40 26
STL (0.002) (0.001) (0.002) (0.002) (0.003) (0.001)
ITL (0.001) (0.002) (0.001) (0.001) (0.002) (0.001)
SHAMO (0.002) (0.006) (0.002) (0.006) (0.002) (0.013)
CMTL (0.016) (0.007) (0.004) (0.002) (0.002) (0.002)
MTFL (0.004) (0.002) (0.007) (0.002) (0.003) (0.002)
GO-MTL 0.582 (0.012) (0.013) (0.007) (0.004) (0.005) (0.008)
BiFactor (0.018) (0.013) (0.002) (0.013) (0.020) (0.052)
TriFactor (0.008) (0.006) (0.012) (0.013) (0.020) (0.029)
Table 4.
Performance results (F-measure) on 20
Newsgroups dataset. The table reports the mean and standard errors over random runs. Models
Task 1 Task 2 Task 3 Task 4 Task 5
GO-MTL
BiFactorMTL
TriFactorMTL
Finally, we evaluate our proposed models on 20
News-groups dataset for transfer learning . The datasetcontains postings from 20 Usenet newsgroups. As be-fore, the postings are stemmed and the stopwords areremoved from the text. We represent each posting as abag of unigrams/bigrams with TF-IDF scores. Weconstruct tasks from the postings of the newsgroups.We randomly select a pair of newsgroup classes to buildeach one-vs-one classification task. We follow the hold-out experiment suggested by (Raina et al., 2006) forthe transfer learning setup. For each of the tasks(target task), we learn F ( F and S in case of TriFac-tor
MTL) from the remaining tasks (source tasks).With F ( F and S ) known from the source tasks, weselect of the data from the target task to learn G target . This experiment shows how well the learnedlatent feature representation from the source tasks in a K -dimensional subspace ( K -dimensional subspace for TriFactor
MTL) adapt to the new task. We evaluateour results on the remaining data from the target task.We select
GO-MTL as our baseline to compare ourresults. Since
CMTL doesn’t explicitly learn F , wedid not include it in this experiment.Table 4 shows the results for this experiment. We re- http://qwone.com/~jason/20Newsgroups/ port the first tasks here. See supplementary materialfor the performance results of all the tasks. We seethat both GO-MTL and
BiFactor
MTL perform almostthe same, since both of them learn the latent featurerepresentation in a K -dimensional space. As is evidentfrom the table, TriFactor
MTL outperforms both
GO-MTL and
BiFactor
MTL, which shows that learningboth the factors F and S improves information trans-fer from the source tasks to the target task.
5. Conclusions
In this paper, we proposed a novel framework for mul-titask learning that factors the task parameters into ashared feature representation and a task structure tolearn from multiple related tasks. We formulated twoapproaches, motivated from recent work in multitasklatent feature learning. The first (
BiFactor
MTL), de-composes the task parameters W into two low-rankmatrices: latent feature representation F and taskstructure G . As this approach is restrictive on thenumber of clusters in the latent feature and task space,we proposed a second method ( TriFactor
MTL), whichintroduces an additional degree of freedom to permitdifferent clusterings in each. We developed a highlyscalable and efficient learning algorithm using conju-gate gradient descent and generalized Sylvester equa-tions. Extensive empirical analysis on both synthetic o-Clustering for Multitask Learning and real datasets show that
Trifactor multitask learn-ing outperforms the other state-of-the-art multitaskbaselines, thereby demonstrating the effectiveness ofthe proposed approach.
References
Amit, Yonatan, Fink, Michael, Srebro, Nathan, andUllman, Shimon. Uncovering shared structures inmulticlass classification. In
Proceedings of the 24thinternational conference on Machine learning , pp.17–24. ACM, 2007.Argyriou, Andreas, Evgeniou, Theodoros, and Pontil,Massimiliano. Convex multi-task feature learning.
Machine Learning , 73(3):243–272, 2008.Barzilai, Aviad and Crammer, Koby. Convex multi-task learning by clustering. In
Proceedings of the18th International Conference on Artificial Intelli-gence and Statistics (AISTATS-15) , 2015.Crammer, Koby and Mansour, Yishay. Learning mul-tiple tasks using shared hypotheses. In
Advances inNeural Information Processing Systems , pp. 1475–1483, 2012.Ding, Chris and He, Xiaofeng. K-means clusteringvia principal component analysis. In
Proceedings ofthe twenty-first international conference on Machinelearning , pp. 29. ACM, 2004.Ding, Chris, Li, Tao, Peng, Wei, and Park, Hae-sun. Orthogonal nonnegative matrix t-factorizationsfor clustering. In
Proceedings of the 12th ACMSIGKDD international conference on Knowledgediscovery and data mining , pp. 126–135. ACM, 2006.Dinuzzo, Francesco, Ong, Cheng S, Pillonetto, Gian-luigi, and Gehler, Peter V. Learning output kernelswith block coordinate descent. In
Proceedings of the28th International Conference on Machine Learning(ICML-11) , pp. 49–56, 2011.Gu, Quanquan and Zhou, Jie. Co-clustering on mani-folds. In
Proceedings of the 15th ACM SIGKDD in-ternational conference on Knowledge discovery anddata mining , pp. 359–368. ACM, 2009.Jalali, Ali, Sanghavi, Sujay, Ruan, Chao, and Raviku-mar, Pradeep K. A dirty model for multi-task learn-ing. In
Advances in Neural Information ProcessingSystems , pp. 964–972, 2010.Kumar, Abhishek and Daume, Hal. Learning taskgrouping and overlap in multi-task learning. In
Pro-ceedings of the 29th International Conference on Ma-chine Learning (ICML-12) , pp. 1383–1390, 2012. Li, Tao and Ding, Chris. The relationships among var-ious nonnegative matrix factorization methods forclustering. In
Data Mining, 2006. ICDM’06. SixthInternational Conference on , pp. 362–371. IEEE,2006.Liu, Jun, Ji, Shuiwang, and Ye, Jieping. Multi-taskfeature learning via efficient l 2, 1-norm minimiza-tion. In
Proceedings of the twenty-fifth conferenceon uncertainty in artificial intelligence , pp. 339–348.AUAI Press, 2009.Maurer, Andreas, Pontil, Massimiliano, and Romera-Paredes, Bernardino. Sparse coding for multitaskand transfer learning. In
ICML (2) , pp. 343–351,2013.Raina, Rajat, Ng, Andrew Y, and Koller, Daphne.Constructing informative priors using transfer learn-ing. In
Proceedings of the 23rd International Con-ference on Machine Learning , pp. 713–720. ACM,2006.Rothman, Adam J, Levina, Elizaveta, and Zhu, Ji.Sparse multivariate regression with covariance esti-mation.
Journal of Computational and GraphicalStatistics , 19(4):947–962, 2010.Sindhwani, Vikas, Hu, Jianying, and Mojsilovic, Alek-sandra. Regularized co-clustering with dual supervi-sion. In
Advances in Neural Information ProcessingSystems , pp. 1505–1512, 2009.Sindhwani, Vikas, Minh, Ha Quang, and Lozano, Au-rélie C. Scalable matrix-valued kernel learning forhigh-dimensional nonlinear multivariate regressionand granger causality. In
Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intel-ligence , pp. 586–595. AUAI Press, 2013.Swirszcz, Grzegorz and Lozano, Aurelie C. Multi-levellasso for sparse multi-task regression. In
Proceed-ings of the 29th International Conference on Ma-chine Learning (ICML-12) , pp. 361–368, 2012.Wang, Hua, Nie, Feiping, Huang, Heng, and Makedon,Fillia. Fast nonnegative matrix tri-factorization forlarge-scale data co-clustering. In
Proceedings of the22nd International Joint Conference on Artificial In-telligence , pp. 1553, 2011.Weinberger, Kilian, Dasgupta, Anirban, Langford,John, Smola, Alex, and Attenberg, Josh. Featurehashing for large scale multitask learning. In
Pro-ceedings of the 26th Annual International Confer-ence on Machine Learning , pp. 1113–1120. ACM,2009. o-Clustering for Multitask Learning
Xu, Linli, Huang, Aiqing, Chen, Jianhui, and Chen,Enhong. Exploiting task-feature co-clusters inmulti-task learning. In
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence ,pp. 1931–1937. AAAI Press, 2015.Xue, Ya, Liao, Xuejun, Carin, Lawrence, and Krishna-puram, Balaji. Multi-task learning for classificationwith dirichlet process priors.
The Journal of Ma-chine Learning Research , 8:35–63, 2007.Zhang, Jian, Ghahramani, Zoubin, and Yang, Yiming.Learning multiple related tasks using latent indepen-dent component analysis. In
Advances in neural in-formation processing systems , pp. 1585–1592, 2005.Zhang, Yi and Schneider, Jeff G. Learning multipletasks with a sparse matrix-normal penalty. In
Ad-vances in Neural Information Processing Systems ,pp. 2550–2558, 2010.Zhang, Yu and Yeung, Dit-Yan. A regularization ap-proach to learning task relationships in multitasklearning.
ACM Transactions on Knowledge Discov-ery from Data (TKDD) , 8(3):12, 2014.Zhong, Wenliang and Kwok, James T. Convex multi-task learning with flexible task clusters. In
Proceed-ings of the 29th International Conference on Ma-chine Learning (ICML-12) , pp. 49–56, 2012.
Sensitivity Analysis
Figure 1 shows the hyper-parameter sensitivity analy-sis for
GO-MTL , BiFactor
MTL and
TriFactor
MTL.As before, we fix λ = 0 . . GO-MTL and
BiFac-tor
MTL have two hyper-parameters λ , K to tune and TriFactor
MTL have three hyper-parameters λ , K and K to tune. We can see from the plots that our pro-posed models yield stable results even when we changethe K, K and K . On the other hand, GO-MTL re-sults are sensitive to the values of λ , regularizationparameter for sparse penalty on G . Additional Results
Tables 5 and 6 show the complete experimental re-sults for sentiment analysis and transfer learning ex-periments.List of one-vs-one classification tasks used in Table 6 (Task 1) comp.windows.xvs comp.os.ms-windows.misc(Task 2) soc.religion.christian vs rec.sport.hockey(Task 3) misc.forsalevs talk.politics.guns(Task 4)sci.medvs rec.autos(Task 5) comp.sys.mac.hardwarevs talk.politics.misc(Task 6) sci.spacevs alt.atheism(Task 7) comp.graphicsvs comp.sys.ibm.pc.hardware(Task 8) talk.politics.mideastvs sci.electronics(Task 9) rec.motorcyclesvs talk.religion.misc(Task 10) rec.sport.baseballvs sci.crypt o-Clustering for Multitask Learning T r i F a c t o r B i F a c t o r GO - M T L λ R M SE T r i F a c t o r ( K = ) T r i F a c t o r ( K = ) T r i F a c t o r ( K = ) B i F a c t o r GO - M T L K / K R M SE T r i F a c t o r B i F a c t o r GO - M T L λ R M SE T r i F a c t o r ( K = ) T r i F a c t o r ( K = ) T r i F a c t o r ( K = ) B i F a c t o r GO - M T L K / K R M SE T r i F a c t o r B i F a c t o r GO - M T L λ F - m ea s u r e T r i F a c t o r ( λ = ) T r i F a c t o r ( λ = ) B i F a c t o r M T L GO - M T L K/K F - m ea s u r e Figure 1.
Top : Sensitivity analysis for the regularization parameter λ when λ = 0 . and K = 2 (left) and number ofclusters K and K when λ = 0 . and λ = 0 . (right) calculated for syn5 dataset (RMSE). Middle : Sensitivity analysisfor school dataset (RMSE).
Bottom : Sensitivity analysis for sentiment detection (F-measure). o - C l u s t e r i n g f o r M u l t i t a s k L e a r n i n g Table 5.
Performance results (F-measure) for various experiments on sentiment detection. The table reports the mean and standard errors over random runs. Data
I II III IV V VI VII
Tasks
14 28 56 84 42 86 126
Thresholds(Splits)
Train Size
240 120 60 40 80 40 26
STL
ITL
SHAMO
CMTL
MTFL
BiFactor
MTL 0.722 (0.006) 0.611 (0.018) 0.561 (0.013)
TriFactor
MTL 0.733 (0.006)
Table 6.
Performance results (F-measure) on 20
Newsgroups dataset. The table reports the mean and standard errors over random runs. The best model andthe statistically competitive models (by paired t-test with α = 0 . ) are shown in boldface. Models
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Task 10
GO-MTL (0.09) (0.06) (0.04) (0.06) (0.03) (0.02) (0.02) (0.01) (0.00) (0.05)
BiFactorMTL (0.09) (0.05) (0.04) (0.03) (0.01) (0.02) (0.02) (0.02) (0.00) (0.04)
TriFactorMTL (0.03) (0.02) (0.02) (0.02) (0.02) (0.01) (0.02) (0.01) (0.03)0.62