Self-Paced Multitask Learning with Shared Knowledge
aa r X i v : . [ s t a t . M L ] J un Self-Paced Multitask Learning with Shared Knowledge
Keerthiram Murugesan and
Jaime Carbonell
Carnegie Mellon University, Pittsburgh, PA, USA{kmuruges,jgc}@cs.cmu.edu
Abstract
This paper introduces self-paced task selectionto multitask learning, where instances from moreclosely related tasks are selected in a progressionof easier-to-harder tasks, to emulate an effective hu-man education strategy, but applied to multitask ma-chine learning. We develop the mathematical foun-dation for the approach based on iterative selectionof the most appropriate task, learning the task pa-rameters, and updating the shared knowledge, op-timizing a new bi-convex loss function. This pro-posed method applies quite generally, includingto multitask feature learning, multitask learningwith alternating structure optimization, etc. Resultsshow that in each of the above formulations self-paced (easier-to-harder) task selection outperformsthe baseline version of these methods in all the ex-periments.
Self-paced learning, inspired by established human educationprinciples, defines a new machine learning paradigm basedon a curriculum defined dynamically by the learner ("self-paced") instead of a fixed curriculum set a-priori by a teacher.It is an iterative approach that alternatively learns the modelparameters and selects easier instances at first, progressing toharder ones [Kumar et al. , 2010]. However, naive extensionof self-paced learning to the multitask setting may result in in-tractable increases in the number of learning parameters andtherefore inefficient use of shared knowledge among the tasks.Existing work in this area is not scalable and/or lacks suffi-cient generality to apply to several multitask learning chal-lenges [Li et al. , 2017].Not all tasks are equal. Some tasks are easy to learn andsome tasks are complex, facilitated by previously learnedtasks to solve it efficiently. For example, classification taskof whether an image has a bird or not can be learned by solv-ing easier component tasks first such as
Is there a wing? , Isthere a beak? , Does it have feathers? , etc. The knowledgelearned from these previously learned easier tasks can be usedto solve the complex tasks effectively and such shared knowl-edge plays an important role in transfer of information be-tween these tasks. This phenomenon is more evident in many real-world data such as object detection, weather prediction,landmine detection, etc.We introduce a new learning framework for multiple tasksthat addresses the aforementioned issues. It starts with easierset of tasks, and gradually introduces more difficult ones tobuild the shared knowledge base. Our proposed method pro-vides a natural way to specify the trade-off between choosingthe easier tasks to update the shared knowledge and learn-ing new tasks using the knowledge acquired from previouslylearned tasks. Our proposed framework based on self-pacedlearning for multiple tasks addresses these three key chal-lenges: 1) it embeds task selection into the model learning;2) it gradually learns the shared knowledge at the system’sown pace; 3) it is generalizable to a wider group of multitaskproblems.We first briefly introduce the self-paced learning frame-work. Next, we describe our proposed approach for self-paced multitask learning with efficient learning of latent taskweights. We give a probabilistic interpretation of these taskweights, based on their training errors. We apply our learn-ing framework to a few popular multitask problems such asMultitask Feature Learning, Multitask Learning with Alter-nating Structure Optimization (ASO), Mean regularized Mul-titask Learning and show that self-paced multitask learningsignificantly improves the learning performance of the origi-nal problem. In addition, we evaluate our method against sev-eral algorithms for sequential learning of multiple tasks.
Given a set of N training instances along with their labels ( x i , y i ) i ∈ [ N ] , the general form of the objective function forsingle task learning is given by: E λ { ˆ w } = arg min w X i ∈ [ N ] ℓ ( y i , f ( x i , w )) + ρ γ ( w ) (1)where ρ γ ( w ) is the regularization term on the model pa-rameters and typically it is set to ρ γ ( w ) = γ || w || ( ridge orL2 penalty) or γ || w || ( lasso or L1 penalty). γ is the regular-ization parameter and [ N ] is the index set { , , . . . N } Self-paced learning (
SPL ) provides a strategy for simulta-neously selecting the easier instances and re-estimating themodel parameters w at each iteration [Kumar et al. , 2010].e assume a linear predictor function f ( x i , w ) with un-known parameter w . Self-paced learning solves the followingobjective function: E λ { ˆ w , ˆ τ } = arg min w ,τ ∈ Ω X i ∈ [ N ] τ i ℓ ( y i , f ( x i , w ))+ ρ γ ( w ) + λr ( τ ) (2)where r ( τ ) is the regularization term, Ω is the domain spaceof τ , ρ γ ( w ) is the regularization term on model parameters w as defined earlier, and λ is the regularization parameterthat identifies the difficulty of the instances. There are twounknowns in equation 2: model parameter vector w and theselection parameter τ (restricted to the domain Ω ).A common choice of the constraint space C = { ρ γ ( w ) , r ( τ ) , Ω } in SPL is { γ || w || , −|| τ || } , { , } N } . See[Jiang et al. , 2015] for more examples on the constraint space.With this setting, equation 2 is a bi-convex optimization prob-lem over w and τ , which can be efficiently solved by alter-nating minimization . Given a fixed τ , the solution for w canbe obtained using any off-the-shelf solver and for a fixed w ,solution for τ can be given as follows: ˆ τ i = (cid:26) if ℓ ( y i , f ( x i , w )) < λ otherwise ∀ i ∈ [ N ] (3)There exists an intuitive explanation for this alternativesearch strategy: 1) when updating τ with a fixed w , a samplewhose loss is smaller than a certain threshold λ is taken as an“easy” sample because it is a sample with “ less error ”, andwill be selected in training ( τ ∗ i = 1 ) or otherwise unselected( τ ∗ i = 0 ); 2) when updating w with a fixed τ , the classifier istrained only on the selected “easy” samples. When λ is small,only “easy” samples with small losses will be considered. Suppose we are given T tasks where the t -th task is associ-ated with N t training examples. Denote by (cid:8) ( x ti , y ti ) (cid:9) N t i =1 and L ( y t , f ( X t , w t )) = N t P i ∈ [ N t ] ℓ ( y ti , f ( x ti , w t )) the train-ing set and loss for task t , respectively. In this paper, weconsider a more general formulation for multitask learning,which is given by [Caruana, 1997; Baxter and others, 2000;Evgeniou and Pontil, 2004]: E λ { ˆ W , ˆ Θ } = arg min W , Θ ∈ Γ X t ∈ [ T ] L ( y t , f ( X t , w t ))+ P γ ( W , Θ ) (4)where P γ ( W , Θ ) is the regularization term on task param-eters W , Θ is the knowledge shared among the tasks whichdepends on the problem under consideration. We assume that P γ ( W , Θ ) can be written as P t ∈ [ T ] P γ ( w t , Θ ) , such that,for a given Θ , the above objective function decomposes into T independent optimization problems. P γ ( w t , Θ ) gives ascoring function on how easier the task is, compared to thatof the learned knowledge Θ . Several multitask learning prob-lems fall under this general characterization. For example, Multitask Feature Learning ( MTFL ), Regularized Multi-task Learning (
MMTL ), Multitask learning with manifoldregularization (
MTML ), Multitask learning via AlternatingStructure Optimization (
MTASO ), Sparse coding for mul-titask learning (
SC-MTL ), etc [Evgeniou and Pontil, 2007;Evgeniou and Pontil, 2004; Agarwal et al. , 2010;Ando and Zhang, 2005; Maurer et al. , 2013]. With thisformulation, one can easily extend the
SP L framework tomultitask setting, by considering instance weights for eachtask. E λ { ˆ W , ˆ Θ , ˆ τ } = arg min W , Θ ∈ Γ τ ∈ Ω X t ∈ [ T ] N t X i ∈ [ N t ] τ ti ℓ ( y ti , f ( x ti , w t ))+ P γ ( W , Θ ) + λr ( τ ) (5)But there are two key issues with this naive extension of SPL :1) The above formulation fails to effectively utilize the knowl-edge shared among the tasks; 2) The number of unknown pa-rameters τ grows with the total number of instances N = P t N t from all the tasks. This is a serious problem especiallywhen the number of tasks T is large [Weinberger et al. , 2009]and/or when manual annotation of task instances is expensive[Kshirsagar et al. , 2013].To address these issues, we consider task-level weights, in-stead of instance-level weights. Our motivation behind thisapproach is based on the human educational process. Whenstudents learn a new concept, they (or their teachers) choose anew task that is relevant to their recently-acquired knowledge,rather that more distant tasks or concepts or other haphazardselections. Inspired by this interpretation, we propose the fol-lowing objective function for Self-Paced Multitask Learning( sp MTL): E λ { ˆ W , ˆ Θ , ˆ τ } = arg min W , Θ ∈ Γ τ ∈ Ω X t ∈ [ T ] τ t (cid:2) L ( y t , f ( X t , w t ))+ P γ ( w t , Θ ) (cid:3) + λr ( τ ) (6)Note that the number of parameters τ t depends on T in-stead of N and the τ t depends on both the training error ofthe task and the task regularization term for the shared knowl-edge Θ .The pseudo-code is in Algorithm 1. The learning algo-rithm defines a task as "easy" task if it has low training error N t P i ∈ [ N t ] ℓ ( y i , f ( x i , w t )) and similar to the shared knowl-edge representation P γ ( w t , Θ ) . These tasks will be selectedin building the shared knowledge Θ . Following Equation 3,we can define τ t as : ˆ τ t = if L ( y t , f ( X t , w ( k ) t ))+ P γ ( w ( k ) t , Θ ( k − ) < λδ otherwise ∀ t ∈ [ T ] (7)For multitask setting, it is desirable to consider an alterna-tive constraint space that gives probabilistic interpretation for For correctness of the algorithm, we set τ t = δ for the hardtasks, instead of τ t = 0 with δ = 0 . . lgorithm 1: Self-Paced Multitask Learning: A GeneralFramework
Input : D = { ( X t , y t ) } Tt =1 , Θ (0) , c > Output : W , Θ k ← , λ ← λ repeat Solve for w ( k ) t ← arg min w L ( y t , f ( X t , w )) + P γ ( w , Θ ( k − ) ∀ t ; Solve for τ ( k ) using equation (7) or equation (8) ; Solve for Θ ( k ) : Θ ( k ) ← arg min Θ P t ∈ [ T ] τ ( k ) t P γ ( w ( k ) t , Θ ) ; λ ← c λ ; k ← k + 1 ; until k τ ( k ) − τ ( k − k ≤ ǫ ; τ . By setting C = { γ || w || , − H ( τ ) , ∆ N − } , we get ˆ τ t ∝ exp( − [ L ( y t , f ( X t , w t )) + P γ ( w t , Θ )] /λ ) , (8)where H ( τ ) = − P t ∈ [ T ] τ t log τ t denotes the entropy ofthe probability distribution τ over the tasks. The key idea isthat the algorithm, at each iteration, maintains a probabilitydistribution over the tasks to identify the simpler tasks basedon the shared knowledge. Similar approach has been used inlearning relationship between multiple tasks in an online set-ting [Murugesan et al. , 2016]. Using this representation, wecan use τ to sample, at each iteration, the "easy" tasks andthus makes the learning problem scalable using stochastic ap-proximation when the number of tasks is large. It is worthnoting that our framework can easily handle outlier tasksby a simple modification to Algorithm 1. Since outlier tasksare different from the main tasks and are usually difficultto learn, we can take advantage of this simple observationfor early stopping, before the algorithm visits all the tasks[Romera-Paredes et al. , 2012].Our algorithm can be easily generalized to other types ofupdating rules by replacing exp in (8) with other functions. Inlatter cases, however, τ may no longer have probabilistic inter-pretations. Algorithm 1 shows the basic steps in learning thetask weights and the shared knowledge. The algorithm usesan additional parameter ′ c ′ that controls the learning pace ofthe self-paced procedure. Typically, ′ c ′ is set to some valuegreater than (in our experiments, we set it to . ) suchthat, at each iteration, the threshold λ is relaxed to includedmore tasks. The input to the algorithm also takes Θ (0) , initialknowledge about the domain and can be initialized based onsome external sources. We give three examples to motivate our self-paced learn-ing procedure. We briefly discuss how our algorithm al-ters the learning pace of the original problem. Note thatthe existing implementation of these problems can be eas-ily "self-paced", by simply adding a few lines of code to geta better performance of the original problem. We refer thereaders to [Evgeniou and Pontil, 2007; Agarwal et al. , 2010;Ando and Zhang, 2005] for additional background.
Example 1: Self-Paced Mean Regularized MultitaskLearning ( sp MMTL)
Mean Regularized Multitask learning assumes that all task pa-rameters are close to some fixed parameter w in the parame-ter space. sp MMTL learns τ to select the easy tasks based onthe distance of each task parameter w t from w . E MMTL ,λ = arg min { w , w ,... w T } w ,τ ∈ Ω X t ∈ [ T ] τ t L ( y t , f ( X t , w t ))+ γ || w t − w || + λ || τ || (9)In the above objective function, we can get the closed-formsolution for w as w = T P Tt =1 w t which is the mean ofthe task parameters. Example 2: Self-paced Multitask Feature Learning( sp MTFL)
Multitask feature learning learns a common fea-ture representation D shared across multiple related tasks. Inaddition to learning the task parameters and the shared fea-ture representation, sp MTFL learns τ to select the easy tasksfirst, defined by the learning parameter λ . The algorithm startswith these easy tasks to learn the shared feature representationwhich is used for solving progressively harder tasks. E MTFL ,λ = arg min { w , w ,... w T } D ∈ S d ++ τ ∈ Ω X t ∈ [ T ] τ t L ( y t , f ( X t , w t ))+ γ X t ∈ [ T ] τ t h w t , D − w t ) + λr ( τ ) (10)The value of τ t determines the importance of a task inlearning this shared feature representation, i.e., tasks withhigh probability contributes more towards learning D thanthe tasks with low probability. Example 3: Self-paced Multitask learning with Alter-nating Structure Optimization ( sp MTASO)
Alternating Structure Optimization learns a shared low-dimensional predictive structure U on a hypothesis spacefrom multiple-related tasks. This low-dimensional structurealong with the low-dimensional model parameters v t arelearned gradually from easy tasks guided by τ . E MTASO ,λ = arg min { w , w ,... w T } UU ⊤ = I h × h τ ∈ Ω X t ∈ [ T ] τ t L ( y t , f ( X t , w t ))+ γ X t ∈ [ T ] τ t || w t − U ⊤ v t || + λr ( τ ) (11) In this section, we briefly review two learning methods thatare most related to our proposed learning algorithm. Boththese methods learn from multiple tasks sequentially in a spe-cific order to either improve the learning performance or tospeedup the algorithm. Pentina et al. (2015) propose a cur-riculum learning method ( CL ) for multiple tasks to find thebest order of tasks to be learned based on training error. Thetasks are solved in a sequential manner based on this order byransferring information from the previously learned tasks tothe next ones through shared task parameters. They show thatthis sequential learning of tasks in a meaningful order can besuperior than solving the tasks simultaneously. The objectivefunction of CL for learning the best task order and the taskparameters is given as follows: E CL = arg min { w , w ,... w T } π ∈ Ψ T X t ∈ [ T ] L ( y π ( t ) , f ( X π ( t ) , w π ( t ) ))+ γ X t ∈ [ T ] || w π ( t ) − w π ( t − || (12)where Ψ T is the symmetric group of all permutations over [ T ] . Since, minimizing with respect to all possible permuta-tions π ∈ Ψ T is an expensive combinatorial problem, theysuggest a greedy, incremental procedure for approximatingthe task order. Their method shares with ours the motiva-tion of learning from easier tasks first, and then graduallyadd more difficult tasks, based on training errors. But unlikeour proposed method, which utilizes shared knowledge fromall previous tasks, their method does not allow sharing be-tween different levels of task relatedness. In addition, the Eu-clidean distance based regularization in their objective func-tion forces the parameter of newly learned task to be similarto its immediate predecessor. This more myopic approach canbe a restrictive assumption for many applications.Perhaps the most relevant work to ours in the context oflifelong learning is from [Ruvolo and Eaton, 2013b], whichlearns the shared basis L from tasks that arrives sequentially.They propose an efficient online multitask learning algorithm( ELLA ) that allows the transfer of knowledge from previouslylearned tasks to the new tasks using this shared basis. The taskparameters are represented as a sparse linear combination ofthe columns of the shared basis w t = Ls t . The motivationfor ELLA and our method are significantly different. Whereas
ELLA tries to achieve nearly identical to the performance ofbatch MTL with increased speedup in learning, our proposedmethod focuses on improving the learning performance overthat of the original algorithm, with minimal changes to saidoriginal algorithm. Unlike our proposed method,
ELLA can-not be easily generalized to existing multitask problems. Itonly uses efficient update equations specific to their proposedobjective function.
All reported results in this section are averaged over ran-dom runs of the training data. Unless otherwise specified, allmodel parameters are chosen via -fold cross validation. Forall the experiments, we update the τ values using the equa-tion 8. We evaluate our self-paced multitask learning algo-rithm on the four well-known multitask problems (MMTL,MTFL, MTASO), briefly discussed in the previous section.We also compare our results with Independent multitasklearning (ITL) where each task is learned independently andSingle-task learning (STL) where we learn a single model bypooling together data from all the tasks. Synthetic data (syn1) consists of tasks that belong to groups of tasks with training examples per task. We gen-erate the task parameters as in [Kang et al. , 2011]. Each ex-ample consists of features. We randomly select a subset oftasks and increase their variance to ( σ = 25) , and variancesfor the rest of the tasks are set to be low ( σ = 5) in orderto simulate the difference between easy and hard tasks. Withthis setting, we expect that our self-paced learning algorithmshould be able to learn the shared knowledge from the easiertasks and use this knowledge to improve the performance ofthe harder tasks. Synthetic data (syn2) consists of tasks with train-ing examples per task as before. We randomly generate a -dimensional vector ( s , s , s , . . . , s ) such that the parame-ter for each task t is given as w t = ( s , s , . . . s t , , , . . . , and each example consists of features. The dataset is con-structed in such a way that learning the task t is easier thanlearning the task t + 1 and so on.The result for syn1 and syn2 are shown in Table 1. We re-port the RMSE (mean and std) of our methods. All of our self-paced methods perform better than their baseline methods onaverage in both the synthetic datasets. Figure 1 (bottom-left)shows the τ learned using self-paced task selection ( sp MTFL)at each iteration. We can see that the tasks are selected basedon their difficulty and the number of features used in eachtask. Figure 1 (top-left) shows the task-specific test errorsfor syn2 dataset ( sp MTFL vs. their corresponding baselinemethods MTFL and ITL). Each red point in the plot com-pares the RMSE of ITL with sp MTFL and each blue pointcompares the RMSE of MTFL vs. sp MTFL. Points abovethe line y = x show that the self-paced methods does betterthan ITL or their MTL baseline methods. From the (MTFLvs. sp MTFL) plot, we can see that our self-paced learningmethod sp MTFL achieves significant improvement on hardertasks (blue points in top-right) compared to the easier tasks(blue points in bottom-left). Based on our learning procedure,these harder tasks must have been learned at the later part ofthe learning and thus efficiently utilize the knowledge learnedfrom the easier tasks to improve their performances. Similarbehaviour can be observed in the other two plots. Note thatsome of the points fall slightly below the y = x line, butsince the decrease in performance of these tasks are small, ithas very little impact on the overall score. We believe thiscan be avoided if we tune different regularization parameter λ t for each task. However, this will increase the number ofparameters to tune in addition to the task weight parameters τ . We use the following benchmark real datasets for our experi-ments on self-paced multitask learning.
London School data ( school ) consists of examinationscores of , students from schools in London. Eachschool is considered as a task and the feature set includes yearof the examination, four school-specific and three student-specific features. We replace each categorical feature with onebinary variable for each possible feature value, as suggested rror spMTFL e rr o r MTFLITL error spMTFL e rr o r MTFLITL error spMTFL e rr o r MTFLITLTasks It e r a t i on Iteration E λ { ˆ W , ˆ Θ } × MTFL (RMSE=12.13)spMTFL ( λ =50, RMSE=10.96)spMTFL ( λ =25, RMSE=10.97)spMTFL ( λ =10, RMSE=10.98)spMTFL ( λ =1, RMSE=11.34)spMTFL ( λ =0.1, RMSE=12.35)spMTFL ( λ =0.01, RMSE=12.37) λ (log-scale) -3 -2 -1 E λ { ˆ W , ˆ Θ } c=2c=1.5c=1.1 Figure 1: Error of MTFL and ITL vs. Error of sp MTFL calculated for syn2 dataset (Top-left). Error of MTFL and ITL vs. Error of sp MTFLcalculated for school dataset (Top-middle). Error of MTFL and ITL vs. Error of sp MTFL calculated for cs dataset (Top-right). Values of ˆ τ from sp MTFL at each iteration calculated for syn2 dataset (Bottom-left). Convergence of the algorithm with varying threshold λ (Bottom-middle) calculated from sp MTFL for school dataset. Convergence of the algorithm with different learning pace ′ c ′ (Bottom-right) calculatedfrom sp MTFL for cs dataset. The experiment shows ′ c ′ = 1 . for learning pace yields a stable performance. in [Argyriou et al. , 2008]. This results in features with ad-ditional feature to account for the bias term. We use the ten − train-test splits that came with the dataset for ourexperiments. Computer Survey data ( cs ) was collected from the ratingsof students on each of the different personal comput-ers. Each student here is considered as a single task and therating ranges from − . There are observations in eachtask. Each computer is represented by different featuressuch as RAM, cache-size, CPU speed, etc. We add an addi-tional feature to account for the bias term. Train-test splits areobtained by selecting − , thus giving examplesfor training and examples for test set. Sentiment Detection data ( sentiment ) contains reviewsfrom domains. The reviews are represented by a bagof unigram/bigram TF-IDF features from a dictionary ofsize , . Each review is associated with a rating from { , , , } . We select , reviews for each domain andcreate two tasks ( reviews per task), based on whether therating is or not and whether the rating is or not, in orderto represent the different levels of sentiment. This gives us binary classification tasks. We use reviews per task fortraining and the rest of the reviews for test set. Landmine Detection data ( landmine ) consists of taskscollected from different landmine fields. Each task is a binaryclassification problem: landmines (+) or clutter ( − ) and eachexample consists of 9 features extracted from radar images.Landmine data is collected from two different terrains: tasks1-10 are from highly foliated regions and tasks 11-19 are fromdesert regions, therefore tasks naturally form two clusters. Weuse examples from each task for training and the rest as the test data. We repeat the experiments on (stratified) splits tomeasure the performance reliably. Since the dataset is highlyskewed, we use AU C score to compare our results.Table 1 summarizes the performance of our methods onthe four real datasets. We can see that our proposed self-paced learning algorithm does well on almost all datasets. Asin our synthetic experiments, we observe that sp MTFL per-forms significantly better than MTFL, which is a state-of-the-art method for multitask problems. It is interesting to see thatwhen the self-paced learning procedure doesn’t help the orig-inal algorithm, it doesn’t perform worse than the baseline re-sults. In such cases, our self-paced learning algorithm givesequal probability to all the tasks ( τ t = T , ∀ t ∈ [ T ]) withinthe first few iterations. Thus the proposed self-paced methodsreduce to their original methods and the performance of theself-paced methods are on par with their baselines.We also notice that if a dataset doesn’t adhere to the as-sumptions of a model, such as task parameters lie on a man-ifold or low-dimensional space, then our self-paced meth-ods result in little improvement, as it can be seen in cs (and also in sentiment for sp MTASO). It is worth mention-ing that our proposed self-paced multitask learning algo-rithm does exceptionally better in school , which is a bench-mark dataset for multitask experiments in the existing lit-erature [Agarwal et al. , 2010; Kumar and Daume, 2012]. Ourproposed methods achieve as much as improvement overtheir baselines on some experiments. Figures (top-middle)and (top-right) show the task-specific errors for school and cs dataset. We can see similar pattern as in syn2 . The easiertasks learned at an earlier stage help the harder tasks at thelater stages as it is evident from these plots. odels syn1 syn2 school cs sentiment landmine STL
ITL
MMTL spMMTL 1.03 (0.05)
MTFL spMTFL 0.73 (0.05) 2.34 (0.12) 10.99 (0.08) spMTASO
Table 1: Average performance on six datasets: means and standard errors over random runs. We use RMSE as our performance measurefor syn1 , syn2 , school , and cs and Area under the curve (AUC) for sentiment and landmine . Self-paced methods with the best performanceagainst their corresponding MTL baselines (paired t-tests at significance level) are shown in boldface. sp MTFL with Sequential LearningAlgorithms
Finally, we compare our self-paced multitask learning algo-rithm against the sequential multitask learning algorithms(curriculum learning for multiple tasks [Pentina et al. , 2015]and efficient lifelong learning [Ruvolo and Eaton, 2013b;Ruvolo and Eaton, 2013a] . We choose sp MTFL for compar-ison based on its overall performance in the previous experi-ments. We use landmine dataset for evaluation. We use dif-ferent variant of
ELLA for fair comparison against our pro-posed approach. The original
ELLA algorithm assumes thatthe tasks arrive randomly and the lifelong learner has nocontrol over their order (
ELLA-random ). Ruvolo and Eaton(2013a) show that if the learner can choose the next task ac-tively, it can improve the learning performance using as fewtasks as possible. They proposed two active task selectionprocedures for choosing the next best task: 1) InformationMaximization (
ELLA-infomax ) chooses the next task to max-imize the expected information gain about the basis L ; 2)Diversity ( ELLA-diversity ) chooses the next task as the onethat the current basis L is doing the worst performance. Boththese approaches select the tasks that are significantly differ-ent from the previously learned tasks ( active task selection ),rather than a progression of tasks that build upon each other.Our proposed method selects the task based on the trainingerror and its relevance to the shared knowledge learned fromthe previous tasks ( self-paced task selection ).Figure 2 shows the task-specific test performance resultsfor this experiment on landmine dataset. We compare our re-sults from sp MTFL against CL and variants of ELLA . We use (1 − AU C ) score for our comparison. As in Figure 1, pointsabove the line y = x show that the sp MTFL does betterthan the other sequential learning methods. We can see that sp MTFL outperforms all the baselines on average ( . ).Compared to sp MTFL, CL performs better on easier tasks butworse on harder tasks. On the other hand, the performance ofthe variants of ELLA on harder tasks are comparable to that - A UC CL (AUC=75.40)ELLA-random (AUC=72.94)ELLA-infomax (AUC=75.92)ELLA-diversity (AUC=73.08)ELLA-diversity++ (AUC=71.74)MTFL (AUC=75.67)
Figure 2: Average performance on landmine for sequential learningalgorithms and sp MTFL: means and standard errors over randomruns. We use (1 − AUC ) score as our performance measure forcomparison. Mean AUC score is shown in the bracket. of our self-paced method, but worse on some easier tasks.
In this work, we proposed a novel self-paced learning frame-work for multiple tasks that jointly learns the latent taskweights and shared knowledge from all the tasks. The pro-posed method iteratively updates the shared knowledge basedon these task weights and thus improves the learning perfor-mance. By allowing the τ to take the probabilistic interpreta-tion, we can easily see which tasks are easier to learn at any it-eration, and prefer those for task selection. In our future work,we plan to consider a stochastic version of this algorithm toupdate the shared knowledge base efficiently and study the al-gorithm’s ability to handle the outlier tasks. Effectiveness ofour algorithm is empirically verified over several benchmarkdatasets. eferences [Agarwal et al. , 2010] Arvind Agarwal, Samuel Gerber, andHal Daume. Learning multiple tasks using manifold regu-larization. In Advances in neural information processingsystems , pages 46–54, 2010.[Ando and Zhang, 2005] Rie Kubota Ando and Tong Zhang.A framework for learning predictive structures from multi-ple tasks and unlabeled data.
Journal of Machine LearningResearch , 6(Nov):1817–1853, 2005.[Argyriou et al. , 2008] Andreas Argyriou, Theodoros Evge-niou, and Massimiliano Pontil. Convex multi-task featurelearning.
Machine Learning , 73(3):243–272, 2008.[Baxter and others, 2000] Jonathan Baxter et al. A model ofinductive bias learning.
J. Artif. Intell. Res.(JAIR) , 12(149-198):3, 2000.[Caruana, 1997] Rich Caruana. Multitask learning.
MachineLearning , 1(28):41–75, 1997.[Evgeniou and Pontil, 2004] Theodoros Evgeniou and Mas-similiano Pontil. Regularized multi–task learning. In
Pro-ceedings of the tenth ACM SIGKDD international confer-ence on Knowledge discovery and data mining , pages 109–117. ACM, 2004.[Evgeniou and Pontil, 2007] A Evgeniou and MassimilianoPontil. Multi-task feature learning.
Advances in neuralinformation processing systems , 19:41, 2007.[Jiang et al. , 2015] Lu Jiang, Deyu Meng, Qian Zhao,Shiguang Shan, and Alexander G Hauptmann. Self-pacedcurriculum learning. In
AAAI , volume 2, page 6, 2015.[Kang et al. , 2011] Zhuoliang Kang, Kristen Grauman, andFei Sha. Learning with whom to share in multi-task featurelearning. In
Proceedings of the 28th International Con-ference on Machine Learning (ICML-11) , pages 521–528,2011.[Kshirsagar et al. , 2013] Meghana Kshirsagar, Jaime Car-bonell, and Judith Klein-Seetharaman. Multitask learn-ing for host–pathogen protein interactions.
Bioinformatics ,29(13):i217–i226, 2013.[Kumar and Daume, 2012] Abhishek Kumar and HalDaume. Learning task grouping and overlap in multi-tasklearning. In
Proceedings of the 29th International Confer-ence on Machine Learning (ICML-12) , pages 1383–1390,2012.[Kumar et al. , 2010] M Pawan Kumar, Benjamin Packer, andDaphne Koller. Self-paced learning for latent variablemodels. In
Advances in Neural Information ProcessingSystems , pages 1189–1197, 2010.[Li et al. , 2017] Changsheng Li, Fan Wei, Junchi Yan, Weis-han Dong, Qingshan Liu, and Hongyuan Zha. Self-pacedmulti-task learning. In
AAAI , pages 2175–2181, 2017.[Maurer et al. , 2013] Andreas Maurer, Massimiliano Pontil,and Bernardino Romera-Paredes. Sparse coding for mul-titask and transfer learning. In
ICML (2) , pages 343–351,2013. [Murugesan et al. , 2016] Keerthiram Murugesan, HanxiaoLiu, Jaime Carbonell, and Yiming Yang. Adaptivesmoothed online multi-task learning. In
Advances in Neu-ral Information Processing Systems , pages 4296–4304,2016.[Pentina et al. , 2015] Anastasia Pentina, Viktoriia Sharman-ska, and Christoph H Lampert. Curriculum learning ofmultiple tasks. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 5492–5500, 2015.[Romera-Paredes et al. , 2012] Bernardino Romera-Paredes,Andreas Argyriou, Nadia Berthouze, and MassimilianoPontil. Exploiting unrelated tasks in multi-task learning.In
AISTATS , volume 22, pages 951–959, 2012.[Ruvolo and Eaton, 2013a] Paul Ruvolo and Eric Eaton. Ac-tive task selection for lifelong machine learning. In
AAAI ,2013.[Ruvolo and Eaton, 2013b] Paul Ruvolo and Eric Eaton.Ella: An efficient lifelong learning algorithm.
ICML (1) ,28:507–515, 2013.[Weinberger et al. , 2009] Kilian Weinberger, Anirban Das-gupta, John Langford, Alex Smola, and Josh Attenberg.Feature hashing for large scale multitask learning. In