Exploring and Predicting Transferability across NLP Tasks
Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, Mohit Iyyer
EExploring and Predicting Transferability across NLP Tasks
Tu Vu (cid:70) Tong Wang Tsendsuren Munkhdalai Alessandro Sordoni Adam Trischler Andrew Mattarella-Micke Subhransu Maji Mohit Iyyer University of Massachusetts Amherst Microsoft Research Montreal Intuit AI { tuvu,smaji,miyyer } @cs.umass.eduandrew [email protected] { tong.wang,tsendsuren.munkhdalai } @microsoft.com { alsordo,adam.trischler } @microsoft.com Abstract
Recent advances in NLP demonstrate the effec-tiveness of training large-scale language mod-els and transferring them to downstream tasks.Can fine-tuning these models on tasks otherthan language modeling further improve per-formance? In this paper, we conduct an ex-tensive study of the transferability between 33NLP tasks across three broad classes of prob-lems (text classification, question answering,and sequence labeling). Our results show thattransfer learning is more beneficial than pre-viously thought, especially when target taskdata is scarce, and can improve performanceeven when the source task is small or differssubstantially from the target task (e.g., part-of-speech tagging transfers well to the DROPQA dataset). We also develop task embed-dings that can be used to predict the most trans-ferable source tasks for a given target task,and we validate their effectiveness in exper-iments controlled for source and target datasize. Overall, our experiments reveal that fac-tors such as source data size, task and domainsimilarity, and task complexity all play a rolein determining transferability.
With the advent of methods such as ELMo (Peterset al., 2018) and BERT (Devlin et al., 2019), thedominant paradigm for developing NLP modelshas shifted to transfer learning: first, pretrain alarge language model, and then fine-tune it on thetarget dataset. Prior work has explored whetherfine-tuning on intermediate source tasks before thetarget task can further improve this pipeline (Phanget al., 2018; Talmor and Berant, 2019; Liu et al.,2019a), but the conditions for successful trans-fer remain opaque, and choosing arbitrary source (cid:70)
Part of this work was done during an internship atMicrosoft Research. fine-tune BERT on selected source task 4. fine-tune the resulting model on target task1. given a target task of interest, compute a task embedding from BERT’s layer-wise gradients identify the most similar source task embedding from a precomputed library SQuADSST2 DROPMNLI QNLI POS-PTBCCGWikiHop
WikiHop Target task
Figure 1: A demonstration of our task embeddingpipeline. We first compute task embeddings fromBERT’s gradients for all 33 tasks in our empirical study.Then, given a target task, we identify the most similarsource task (in this example, WikiHop) via cosine sim-ilarity of the task embeddings. Finally, we perform in-termediate fine-tuning of BERT on the selected sourcetask before fine-tuning on the target task. tasks can even adversely impact downstream per-formance (Wang et al., 2019b). Our work has twomain contributions: (1) we perform a large-scaleempirical study across 33 different NLP datasets toshed light on when intermediate fine-tuning helps,and (2) we develop task embeddings to predictwhich source tasks to use for a given target task.Our study includes over 3,000 combinations oftasks and data regimes within and across threebroad classes of problems (text classification/re-gression, question answering, and sequence label-ing), which is considerably more comprehensivethan prior work (Wang et al., 2019a; Talmor and Be-rant, 2019; Liu et al., 2019a). Our results show thattransfer learning is more beneficial than previouslythought, especially for target tasks with limitedtraining data, and even source tasks that are small Credit to Jay Alammar for creating the BERT image usedin this figure. a r X i v : . [ c s . C L ] M a y r on the surface very different than the target taskcan result in transfer gains. While previous workhas recommended using the amount of labeled dataas a criterion to select source tasks (Phang et al.,2018), our analysis suggests that the similarity be-tween the source and target tasks and domains arecrucial for successful transfer, particularly in data-constrained regimes.Motivated by these results, we move on to a morepractical research question: given a particular tar-get task, can we predict which source tasks (out ofsome predefined set) will yield the largest transferlearning improvement, especially in limited-datasettings? We address this challenge by learning em-beddings of tasks that encode their individual char-acteristics (Figure 1). More specifically, we pro-cess all examples from a dataset through BERT andcompute a task embedding based on the model’sgradients with respect to the task-specific loss, fol-lowing recent meta-learning work in computer vi-sion (Achille et al., 2019). We empirically demon-strate the practical value of these task embeddingsfor selecting source tasks (via simple cosine simi-larity) that effectively transfer to a given target task.To the best of our knowledge, this is the first workwithin NLP that builds explicit representations ofNLP tasks for meta-learning.Our task library, which consists of pretrainedmodels and task embeddings for the 33 NLP tasksstudied in this work, will be made publicly avail-able, in addition to a codebase that computes taskembeddings for new datasets and identifies sourcetasks that will likely improve downstream perfor-mance. To better understand the relationships between dif-ferent tasks in the transfer learning setting, we per-form an empirical study with 33 tasks across threebroad classes of problems: text classification/re-gression (CR), question answering (QA), and se-quence labeling (SL). In each experiment, we fol-low the STILTs pipeline of Phang et al. (2018) bytaking a pretrained BERT model, fine-tuning it onan intermediate source task, and then fine-tuningthe resulting model on a target task. We explore https://github.com/ngram-lab/task-transferability We define a task as a dataset paired with an objectivefunction. We use the BERT-Base, Uncased model, which has 12layers, 768-d hidden size, 12 heads, and 110M total parame-ters.
Task | Train | text classification/regression (CR) SNLI (Bowman et al., 2015) 570KMNLI (Williams et al., 2018) 393KQQP question answering (QA) SQuAD-2 (Rajpurkar et al., 2018) 162KNewsQA (Trischler et al., 2017) 120KHotpotQA (Yang et al., 2018) 113KSQuAD-1 (Rajpurkar et al., 2016) 108KDuoRC-p (Saha et al., 2018) 100KDuoRC-s (Saha et al., 2018) 86KDROP (Dua et al., 2019) 77KWikiHop (Welbl et al., 2018) 51KBoolQ (Clark et al., 2019) 16KComQA (Abujabal et al., 2019) 11KCQ (Bao et al., 2016) 2K sequence labeling (SL)
ST (Bjerva et al., 2016) 43KCCG (Hockenmaier and Steedman, 2007) 40KParent (Liu et al., 2019a) 40KGParent (Liu et al., 2019a) 40KGGParent (Liu et al., 2019a) 40KPOS-PTB (Marcus et al., 1993) 38KGED (Yannakoudakis et al., 2011) 29KNER (Tjong Kim Sang and De Meulder, 2003) 14KPOS-EWT (Silveira et al., 2014) 13KConj (Ficler and Goldberg, 2016) 13KChunk (Tjong Kim Sang and Buchholz, 2000) 9K
Table 1: Datasets used in our experiments, grouped bytask class and sorted by training dataset size. in-class and out-of-class transfer in both data-richand data-constrained regimes and demonstrate thatpositive transfer can occur in a more diverse arrayof settings than previously thought.
As all of our experiments rely on the pretrainedBERT model, we choose tasks that can be solvedwithout modifying the base architecture. We denotea dataset D = { ( x i , y i ) } ni =1 , with n total examplesof inputs x and associated outputs y . An input x can be either a single text (e.g., in sentence classi-fication) or a concatenation of multiple segments(e.g., a question-passage pair in reading compre-hension). We encode each input x as[ CLS ] w w . . . w L [ SEP ] w w . . . w L , where w ij is token i of the j th segment of x , [ CLS ]is a special symbol for classification output, and
SEP ] is a special symbol to separate any text seg-ments if they exist in x . Finally, each task is solvedby applying a classification layer over either thefinal [ CLS ] token representation (for CR) or theentire sequence of final layer token representations(for QA or SL).For both stages of fine-tuning, we follow De-vlin et al. (2019) by backpropagating into all modelparameters for a fixed number of epochs. While in-dividual task performance can likely be further im-proved with more involved hyperparameter tuningfor each experimental setting, we standardize hy-perparameters across each of the three classes to cutdown on computational expense, following priorwork (Phang et al., 2018; Wang et al., 2019b). Table 1 lists the 33 datasets we use in our experi-ments. We select these datasets by mostly follow-ing prior work: nine of the eleven CR tasks comefrom the GLUE benchmark (Wang et al., 2019b);all eleven QA tasks are from the MultiQA reposi-tory (Talmor and Berant, 2019); and all eleven SLtasks were also used by Liu et al. (2019a). We con-sider all possible pairs of source and target datasets,and each experiment is evaluated on the develop-ment set of the target dataset.For each (source, target) dataset pair, we per-form transfer experiments in three data regimes toexamine the impact of source and target data size: full source → full target , full source → limitedtarget , and limited source → limited target . Inthe full training regime, all training data for the as-sociated task is used for fine-tuning. In the limited setting, we artificially limit the amount of trainingdata by randomly selecting 1K training exampleswithout replacement following Phang et al. (2018).Since fine-tuning BERT can be unstable on smalldatasets (Devlin et al., 2019), we perform 20 ran-dom restarts for each experiment and report allexperiments using the mean (see Appendix B forvariance statistics). We fine-tune all CR and QA tasks for three epochs, andSL tasks for six epochs, using the HuggingFace Transformerslibrary (Wolf et al., 2019) and its recommended hyperparame-ters. We also experimented with freezing BERT’s parametersand fine-tuning only the linear classification layer during bothstages, which usually reduces overall performance. Appendix A contains more details about dataset charac-teristics and their associated evaluation metrics. https://github.com/alontalmor/MultiQA We resample 1K examples for each restart in the limited data setting. For tasks with fewer than 1K training examples, full source → full target ↓ src,tgt → CR QA SLCR 6.3 (11) (10) (10)
QA 3.2 (10) (11) (9)
SL 5.3 (8) (10) (11) full source → limited target CR QA SLCR 56.9 (11) (10) (10)
QA 44.3 (11) (11) (11)
SL 45.6 (11) (6) (11) limited source → limited target CR QA SLCR 23.7 (11) (11) (11)
QA 37.3 (11) (11) (11)
SL 29.3 (10) (8) (11)
Table 2: A summary of transfer results for each combi-nation of the three task classes in the three data regimes.Rows denote source task classes while columns denotetarget task classes. Each cell represents the relativegain of the best source task in the source class for agiven target task, averaged across all of target tasks inthe target class. In parentheses, we additionally reportthe number of target tasks (out of 11) for which at leastone source task results in a positive transfer gain. Thegreen-colored cells along the diagonal indicate in-classtransfer.
As evaluation metrics are not consistent amongthe three classes of tasks (accuracy and F1 are mostcommon across all tasks, while exact match is alsoused to evaluate QA tasks), we measure the impactof intermediate fine-tuning by computing the rela-tive transfer gain g s → t given a source task s and tar-get task t . More concretely, if a baseline model thatis only fine-tuned on the target dataset (i.e., with-out any intermediate fine-tuning) achieves a perfor-mance of p t , while a transferred model achievesa performance of p s → t , the relative transfer gain g s → t = p s → t − p t p t . Table 2 contains the results of our transfer experi-ments across each combination of classes and dataregimes. In each cell, we first compute the trans-fer gain of the best source task for each target taskin a particular class, and then average across alltarget tasks in the same class. We summarize our we use the full training dataset. See Appendix B for tables for each individual task. ndings as follows: • Contrary to prior belief, transfer gains are pos-sible even when the source dataset is small. • Out-of-class transfer succeeds in many cases,some of which are unintuitive. • Factors other than source dataset size, such asthe similarity between source and target tasks,matter more in low-data regimes.In the rest of this section, we first provide a quickoverview of the results before analyzing each ofthese three findings in more detail.
In-class transfer:
The diagonal of each block ofTable 2 shows the results for in-class transfer, inwhich source tasks are from the same class as thetarget task. Across all three data regimes, mosttarget tasks benefit from in-class transfer, and theaverage transfer gain is larger for CR and QA tasksthan for SL tasks. Changing the data regimes sig-nificantly impacts the average transfer gain, whichis lowest in the full source → full target regime(+5.4% average relative gain across all tasks) andhighest in the full source → limited target regime(+47.0% average gain). In general, tasks with fewertraining examples benefit the most from transfer,such as RTE (+17.0 accuracy gain) and CQ (+14.9F1), and the best source tasks in the full source → full target regime tend to be data-rich tasks suchas MNLI, SNLI, and SQuAD (Figure 2). Out-of-class transfer:
Having analyzed the ef-fects of in-class transfer in different data regimes,we turn now to out-of-class transfer, in which thesource task comes from a different class than thetarget task. The off-diagonal entries of each blockof Table 2 summarize our results. In general, weobserve that most tasks benefit from out-of-classtransfer, although the magnitude of the transfergains is lower than for in-class transfer, and thatCR and QA tasks benefit more than SL tasks (sim-ilar to our in-class transfer results). While someof the results are intuitive (e.g., SQuAD is a goodsource task for QNLI, which is an entailment taskbuilt from QA pairs), others are more difficult to ex-plain (using part-of-speech tagging as a source taskfor DROP results in huge transfer gains in limitedtarget regimes). We also observe that transfer learning in limited target regimes significantly reduces the variance of target task per-formance across random restarts in many cases, which is con-sistent with previous work (Phang et al., 2018) and shown inAppendix B.
Large source datasets are not always bestfor data-constrained target tasks:
Phang et al.(2018) observe that source data size is a goodheuristic to obtain positive transfer gain. In the full source → limited target regime, we find tothe contrary that the largest source datasets do notalways result in the largest transfer gains. For CRtasks, MNLI/SNLI are the best source tasks foronly four CR target tasks (three of which are textualentailment tasks), compared to seven in full source → full target . STS-B, which is much smaller thanMNLI and SNLI, is the best source task for MRPCand QQP, while MRPC, an even smaller dataset, isthe best source task for STS-B. As STS-B, QQP,and MRPC are all sentence similarity and para-phrase tasks, this result suggests that the similaritybetween the source and target tasks matters morefor data-constrained target tasks. We observe simi-lar task similarity patterns for QA (the best sourcetask for WikiHop is the other multi-hop QA task,HotpotQA) and SL (POS-PTB is the best sourcetask for POS-EWT, the only other part-of-speechtagging task). However, the large SQuAD 2.0dataset is almost always the best source task withinQA tasks. We hypothesize that another importantfactor especially apparent in our QA tasks is do-main similarity (e.g., SQuAD, HotpotQA, DROP,and DuoRC were all built from Wikipedia). When does transfer work with data-constrained source tasks?
We now turnto the limited source → limited target regime,which eliminates the source data size confound.For CR, STS-B is the best source task for sixCR target tasks out of 11, including four textualentailment tasks (MNLI, QNLI, SNLI, SciTail),whereas MNLI/SNLI are the best source tasksfor only two tasks (RTE, WNLI). This resultsuggests that source/target task similarity, whichwe found to be a factor for the full source → limited target , is not the only important factor foreffective transfer in data-constrained scenarios. Wehypothesize that the complexity of the source taskcan also play a role: perhaps regression objectives(as used in the STS-B task) are more useful fortransfer learning than classification objectives(MNLI/SNLI). Unknown factors may also play arole: in QA, SQuAD is no longer the best sourcetask for any target task, while NewsQA is the bestsource for five target tasks. hun k S N L I G E D G E D M N L I S N L I S T S - B S T S - B S N L I S N L I S Q u A D - M N L I S T S - B Q N L I M N L I M N L I S Q u A D - S Q u A D - S N L I S N L I Q Q P M N L I S T N e w s Q A H o t p o t Q A S Q u A D - S Q u A D - D u o R C - s H o t p o t Q A H o t p o t Q A H o t p o t Q A S Q u A D - S Q u A D - S Q u A D - S Q u A D - S Q u A D - D u o R C - s N e w s Q A S Q u A D - S Q u A D - S Q u A D - S Q u A D - S Q u A D - S Q u A D - C Q S T GG P a r e n t GG P a r e n t G P a r e n t G P a r e n t GG P a r e n t GG P a r e n t S T S T G P a r e n t CC G S T S T P O S - P TB CC G P O S - P TB P O S - P TB S T P O S - E W T P a r e n t CC G W N L I C o L A R T E M R P C M N L I S T S - B Q Q P S N L I Q N L I SS T - S c i T a il D R O P C Q D u o R C - p W i k i H o p C o m Q A D u o R C - s N e w s Q A B oo l QH o t p o t Q A S Q u A D - S Q u A D - G E D C o n j G G P a r e n t G P a r e n t N E R P a r e n t C C G S T P O S - E W T P O S - P T B C hun k target tasks m e a n r e s u l t s full source ! lim. target CR tasksQA tasksSL tasks baseline (no transfer)task chosen by TaskEmb B e s t s o u r ce T a s k E m b c h o i ce T a r g e t t a s k A small dataset like STS-B is the best source for two CR targets, MRPC and QQP DROP benefits significantly from SL source tasks T a r g e t t a s k p e r f o r m a n ce P O S - E W T R T E W N L I G E D M N L I S N L I H o t p o t Q A S T S - B S Q u A D - S N L I Q N L I Q N L I C o L A S N L I M N L I M N L I S Q u A D - S Q u A D - M N L I S N L I M N L I S N L I S Q u A D - S Q u A D - S Q u A D - H o t p o t Q A D u o R C - s D u o R C - s H o t p o t Q A H o t p o t Q A N e w s Q A W i k i H o p P O S - E W T D u o R C - p S Q u A D - S Q u A D - S Q u A D - S Q u A D - S Q u A D - N e w s Q A S N L I S Q u A D - N e w s Q A N e w s Q A R T E S Q u A D - G P a r e n t S Q u A D - G P a r e n t G P a r e n t S c i T a il GG P a r e n t D u o R C - s S T P O S - E W T CC G P O S - P TBGG P a r e n t N E RP O S - P TB S c i T a il P O S - P TB D R O PP O S - E W T P a r e n t S Q u A D - W N L I C o L A R T E M R P C M N L I S T S - B Q Q P S N L I Q N L I SS T - S c i T a il D R O P C Q D u o R C - p W i k i H o p C o m Q A D u o R C - s N e w s Q A B oo l QH o t p o t Q A S Q u A D - S Q u A D - G E D C o n j G G P a r e n t G P a r e n t N E R P a r e n t C C G S T P O S - E W T P O S - P T B C hun k target tasks m e a n r e s u l t s full source ! full target CR tasksQA tasksSL tasks baseline (no transfer)task chosen by TaskEmb T a r g e t t a s k B e s t s o u r ce T a s k E m b c h o i ce Intermediate fine-tuning does not result in significant improvements for SL tasks in the full target data regime T a r g e t t a s k p e r f o r m a n ce C hun k S N L I S Q u A D - S N L I C Q S N L I S c i T a il S T S - B C Q S N L I C o m Q A Q N L I C o m Q A M RP C S T S - B M N L I N e w s Q A S Q u A D - W i k i H o p S N L I C o m Q A Q N L I CC G H o t p o t Q A H o t p o t Q A H o t p o t Q A N e w s Q A H o t p o t Q A H o t p o t Q A H o t p o t Q A H o t p o t Q A H o t p o t Q A N e w s Q A H o t p o t Q A H o t p o t Q A H o t p o t Q A P O S - P TB S Q u A D - N e w s Q A S Q u A D - N E R H o t p o t Q A N e w s Q A H o t p o t Q A D u o R C - p G P a r e n t GG P a r e n t G P a r e n t G P a r e n t G P a r e n t GG P a r e n t GG P a r e n t H o t p o t Q A P O S - P TB C hun k G P a r e n t P O S - P TB P O S - P TB P O S - P TB P O S - P TB P O S - P TB P O S - P TB S T S T P a r e n t P O S - P TB W N L I C o L A R T E M R P C M N L I S T S - B Q Q P S N L I Q N L I SS T - S c i T a il D R O P C Q D u o R C - p W i k i H o p C o m Q A D u o R C - s N e w s Q A B oo l QH o t p o t Q A S Q u A D - S Q u A D - G E D C o n j G G P a r e n t G P a r e n t N E R P a r e n t C C G S T P O S - E W T P O S - P T B C hun k target tasks m e a n r e s u l t s lim. source ! lim. target CR tasksQA tasksSL tasks baseline (no transfer)task chosen by TaskEmb T a r g e t t a s k B e s t s o u r ce T a s k E m b c h o i ce T a r g e t t a s k p e r f o r m a n ce QA tasks are good sources for CR targets SQuAD is no longer the best source task for any QA targets in this regimeLarge datasets like MNLI, SNLI, and SQuAD are often the best source tasks
Figure 2: In these plots (best viewed zoomed in and with color), each violin corresponds to a target task in thespecified data regime, the points within each violin represent individual source tasks, the color of each point denotesits task class, and the y-coordinate of each point represents the target task performance after transferring from thesource. Above each violin, we provide the identity of both the best source task (the highest point within the violin)and the top-ranked source task identified by our T
ASK E MB method (the red star). The horizontal black line in eachviolin represents the baseline performance of BERT fine-tuned on the target task without intermediate fine-tuning.T ASK E MB consistently selects source tasks that yield positive transfer, and often selects the best source task. Predicting task transferability
The above analysis suggests that while many tar-get tasks benefit from intermediate fine-tuning, nosingle factor (e.g., data size, task and domain sim-ilarity, task complexity) is predictive of transfergain across all of our settings. Given a novel tar-get task, how can we select a single source taskthat maximizes transfer gain? One straightforwardbut extremely expensive approach is to enumerateevery possible source and target task combination,as typically done in both the previous section andin prior work (Wang et al., 2019b; Talmor and Be-rant, 2019; Liu et al., 2019a). Work on multi-tasklearning within NLP offers a more practical alter-native by developing feature-based models to iden-tify task and dataset characteristics that are predic-tive of task synergies (Mart´ınez Alonso and Plank,2017; Bingel and Søgaard, 2017; Kerinec et al.,2018). Here, we take a different approach, inspiredby recent computer vision methods (Achille et al.,2019), by computing vector-based task embeddings from layer-wise gradients of BERT. Our methodoutperforms heuristics like data size in terms ofselecting the most transferable source tasks acrossall regimes and task classes.
We develop two methods for computing task em-beddings from BERT. The first, T
EXT E MB , is com-puted by pooling BERT’s representations across anentire dataset, and as such captures properties ofthe text and domain. The second, T ASK E MB , re-lies on the correlation between the fine-tuning lossfunction and the parameters of BERT, and encodesmore information about the type of knowledge andreasoning required to solve the task. T EXT E MB : As our analysis indicates that do-main similarity is a relevant factor for transfer,we first explore a simple method based on aver-aging BERT token-level representations of the in-puts. Given a dataset D , we process each inputsample x i through the pretrained BERT modelwithout any finetuning and compute h x , the av-erage of final layer token-level representations.Finally, the text embedding is just the averageof these pooled vectors over the entire dataset:T EXT E MB ( D ) = (cid:80) x ∈ D h x | D | . This method cap-tures linguistic properties of the input text x anddoes not depend on the training labels y . T ASK E MB : Ideally, we want a way of capturingtask similarity beyond just input properties repre-sented by T
EXT E MB . Following the methodol-ogy of T ASK EC , which develops task embed-dings for meta-learning over vision tasks (Achilleet al., 2019), we create representations of tasks de-rived from the Fisher information matrix (or simply Fisher ). The Fisher captures the curvature of theloss surface (the sensitivity of the loss to small per-turbations of model parameters), which intuitivelytells us which of the model parameters are mostuseful for the task and thus provides a rich sourceof knowledge about the task itself.To begin, we fine-tune BERT on the trainingdataset of a given task, as in the baseline experi-ments of Section 2. The fine-tuned model withoutthe final task-specific layer forms our feature ex-tractor . Next, we feed the entire training datasetinto the model and compute the task embeddingbased on the Fisher of the feature extractor’s pa-rameters θ , i.e., the expected covariance of the gra-dients of the log-likelihood with respect to θ : F θ = E x,y ∼ P θ ( x,y ) ∇ θ log P θ ( y | x ) ∇ θ log P θ ( y | x ) T . We compute the empirical Fisher , which uses thetraining labels instead of sampling from P θ ( x, y ) : F θ = 1 n n (cid:88) i =1 (cid:2) ∇ θ log P θ ( y i | x i ) ∇ θ log P θ ( y i | x i ) T (cid:3) , and only consider the diagonal entries to reducecomputational complexity. Additionally, weconsider the Fisher F φ with respect to the featureextractor’s outputs (activations) φ , which encodesuseful features about the inputs to solve the task.The diagonal F φ is averaged across all tokens andalso across all input samples.While Fisher matrices computed from networkswith different parameters are theoretically notcomparable, we find empirically that computingT ASK E MB from a fine-tuned task-specific BERTresults in better correlations to task transferabil-ity than when using the frozen BERT. We leavefurther exploration of this phenomenon to futurework.We explore task embeddings computed fromthe diagonal Fisher of different components of We additionally find that using a frozen BERT leads todegenerate results in data-constrained scenarios, since theFisher can be noisy when trained with few samples.
ERT, including the token embeddings, multi-head attention, feed forward layers, and the layeroutput, performing layer-wise averaging. Sinceour base model is BERT, this method results inhigh-dimensional task embeddings, from 768-d fortask embeddings computed from hidden representa-tions to millions of dimensions for those computedthrough the subword embedding matrix. While onecan optionally perform dimensionality reduction(e.g., through PCA), all of our experiments are con-ducted directly on the original task embeddings.
We investigate whether a high similarity betweentwo different task embeddings correlates with ahigh degree of transferability between those twotasks. Our evaluation centers around the meta-taskof selecting the best source task for a given targettask. Specifically, given a target task, we com-pute the cosine similarity between its task em-bedding t and the task embeddings for every othersource task s i in our task library. We then rankthe source tasks in descending order by their co-sine distance. This ranking is evaluated using twometrics: (1) the average rank ρ of the source taskwith the highest absolute transfer gain as deter-mined by our experiments in Section 2, and (2) theNormalized Discounted Cumulative Gain (NDCG;J¨arvelin and Kek¨al¨ainen, 2002), a common infor-mation retrieval measure that evaluates the qualityof the entire ranking, not just the rank of the bestsource task. The NDCG at position p is definedas: NDCG p = DCG p ( R pred ) DCG p ( R true ) , where R pred , R true arethe predicted and gold ranking of the source tasks,respectively; and DCG p ( R ) = p (cid:88) i =1 rel i − ( i + 1) , where rel i is the relevance (target performance) of thesource task with rank i in the evaluated ranking R .An NDCG of 100% indicates a perfect ranking.We compare rankings derived from T EXT E MB and T ASK E MB to D ATA S IZE , a heuristic baselinethat ranks all source tasks by the number of trainingexamples. We include it because source data sizeis a major factor in our transfer results, particularlyin the full source → full target regime. Note that while task transferability, unlike cosine distance,is asymmetric, our preliminary experiments with asymmetricdistance metrics did not yield better results, so we leave furtherexploration to future work.
Aggregating similarity signals from embeddingspaces:
For our T
ASK E MB approach, we aggre-gate rankings from all of the different componentsof BERT rather than evaluate each component-specific ranking separately. We expect that taskembeddings derived from different componentsmight contain complementary information aboutthe task, which motivates this decision. Con-cretely, given a target task t , assume that r c arethe rank scores assigned to a source task s by c different components of BERT. Then, the aggre-gated score is computed according to the recipro-cal rank fusion algorithm (Cormack et al., 2009): RRF ( s ) = c (cid:88) i =1 k + r i . We also use this approachto ensemble different task embedding methods,which results in T
EXT + T
ASK . The average performance of selecting the bestsource task across target tasks using differentmethods is shown in Table 3. We first providean overview of these results before analyzingT
ASK E MB in more detail. Baselines: D ATA S IZE is a good heuristic whenthe full source training data is available, but it strug-gles in all out-of-class transfer scenarios as well ason sequence labeling tasks, for which most datasetscontain roughly the same number of examples (Ta-ble 1). T EXT E MB performs better than D ATA -S IZE on average, especially within the limited dataregimes. Interestingly, T
EXT E MB underperformssignificantly on CR tasks compared to QA and SL.We theorize that this effect is partly due to therelative homogeneity of the QA and SL datasets(i.e., many QA datasets use Wikipedia while manySL tasks are extracted from the Penn Treebank)compared to the more diverse CR datasets. If T EX - T E MB is capturing mainly domain similarity, thenit may struggle when that is not a relevant transferfactor. We observe that rankings derived from certain compo-nents are more useful than others (e.g., token embeddings arecrucial for classification), but aggregating across all compo-nents consistently outperforms individual ones. We use k = 60 as in Cormack et al. (2009). All methods obtain a higher NDCG score on SL tasksin the full source → full target regime because there is littledifference in target task performance between source taskshere (see Figure 2a), and thus the rankings are not penalizedheavily. ull source → full target full source → lim. target lim. source → lim. target in-class (10) all-class (32) in-class (10) all-class (32) in-class (10) all-class (32) Method ρ NDCG ρ NDCG ρ NDCG ρ NDCG ρ NDCG ρ NDCG classification / regression D ATA S IZE
EXT E MB ASK E MB T EXT +T ASK question answering D ATA S IZE
EXT E MB ASK E MB EXT +T ASK sequence labeling D ATA S IZE
EXT E MB ASK E MB T EXT +T ASK
Table 3: To evaluate our task embedding methods, we measure the average rank ( ρ ) that they assign to the bestsource task (i.e., the one that results in the largest transfer gain) across target tasks, as well as the average NDCGmeasure of the overall ranking’s quality. Combining the complementary signals in T ASK E MB and T EXT E MB consistently decreases ρ (lower is better) and increases NDCG across all settings, and both methods in isolationgenerally perform better than the D ATA S IZE heuristic. T ASK E MB improves transferability prediction: Table 3 shows that T
ASK E MB can substantiallyboost the quality of the rankings, frequently out-performing both T EXT E MB and D ATA S IZE acrossdifferent classes of problems, data regimes, andtransfer scenarios. These results indicate that thetask similarity between the computed embeddingsis a robust predictor of effective transfer. The en-semble of T
EXT + T
ASK results in further slightimprovements, but the small magnitude of thesegains suggests that T
ASK E MB partially encodes do-main similarity. In the limited source → limitedtarget , where the D ATA S IZE heuristic does not ap-ply, T
ASK E MB still performs strongly, although notas well as in the full source data regimes. Figure 2shows that our task embedding methods usuallyselect the best or near the best available source taskfor a given target task across data regimes. Understanding the task embedding space:
Togain further insight into the kinds of informationthat are encoded by different task embeddings,Figure 3 visualizes the different task spaces inthe full source → full target regime using theFruchterman-Reingold force-directed placement al-gorithm (Fruchterman and Reingold, 1991).The task space of T EXT E MB (Figure 3, top)shows that datasets with similar sources are nearone another: in QA, tasks built from web snip- pets are closely linked (CQ and ComQA), whilein SL, tasks extracted from Penn Treebank areclustered together (CCG, POS-PTB, Parent, GPar-ent, GGParent, Chunk, and Conj). Additionally,the SQuAD datasets are strongly linked to QNLI,which was created by converting SQuAD ques-tions. T ASK E MB also captures the dataset domainto some extent (Figure 3, bottom), but it also en-codes task similarity: for example, POS-PTB isnow moved closer to POS-EWT, another part-of-speech tagging task that uses a different data source.Neither method captures some of the unintuitivecases in the low-data regimes, such as STS-B’shigh transferability to other CR target tasks, orthat DROP benefits most from SL tasks in low-dataregimes (see Tables 8, 9, 26, and 27 in Appendix B).T ASK E MB and T EXT E MB clearly do not captureall of the factors that influence task transferability,which motivates the development of more sophisti-cated task embedding techniques in the future. We build on existing work in (1) exploring transferrelationships between NLP tasks, and (2) identify-ing beneficial transferable tasks via meta-learning.
Transferability between NLP tasks:
Sharingknowledge across different tasks, as in multi-taskand transfer learning, often improves over stan- oLASST-2MRPC STS-BQQPMNLIQNLI RTE WNLI SNLISciTailSQuAD-1SQuAD-2NewsQAHotpotQA BoolQDROPWikiHop DuoRC-pDuoRC-sCQComQA CCGPOS-PTB POS-EWTParentGParent GGParentST Chunk NERGEDConj
CoLASST-2MRPC STS-BQQP MNLIQNLIRTEWNLI SNLISciTail SQuAD-1.1SQuAD-2.0NewsQAHotpotQA BoolQDROPWikiHop DuoRC-pDuoRC-s CQComQA CCGPOS-PTBPOS-EWT ParentGParentGGParentST ChunkNERGED Conj
Figure 3: A 2D visualization of the task spaces of T EX - T E MB (upper) and T ASK E MB (lower). T EXT E MB cap-tures a lot of domain similarity (e.g., the Penn TreebankSL tasks are highly interconnected, while T ASK E MB focuses more on task similarity (the two part-of-speechtagging tasks are highly similar despite their domaindissimilarity). dard single-task learning (Ruder, 2017). Withinmulti-task learning, several works (Luong et al.,2016; Mou et al., 2016; Mart´ınez Alonso andPlank, 2017; Bingel and Søgaard, 2017; Kerinecet al., 2018; Changpinyo et al., 2018; Liu et al.,2019b) combine related tasks for better regulariza-tion and transfer. More related to our work, Phanget al. (2018) explore intermediate fine-tuning andfind that transferring from closely-related data-richsource tasks boosts target task performance for textclassification, while Liu et al. (2019a) observe sim-ilar gains for sequence labeling tasks. Expand-ing from single to multi-source transfer, Talmorand Berant (2019) show that pretraining on mul-tiple related tasks improves generalization on QAtasks. Nevertheless, exploiting synergies betweentasks remains difficult, with many combinationsof tasks negatively impacting downstream perfor-mance (Mart´ınez Alonso and Plank, 2017; Bingeland Søgaard, 2017; McCann et al., 2018; Wanget al., 2019a), and the factors that determine suc-cessful transfer still remain murky. Identifying beneficial relationships amongtasks:
To predict transferable tasks, somemethods (Mart´ınez Alonso and Plank, 2017;Bingel and Søgaard, 2017) rely on features derivedfrom dataset characteristics and learning curves,which is a time-consuming process that may notgeneralize well across classes of problems (Kerinecet al., 2018). Recent work on task embeddings in computer vision offers a more principled wayto encode tasks for meta-learning (Zamir et al.,2018; Achille et al., 2019; Yan et al., 2020).Taskonomy (Zamir et al., 2018) models theunderlying structure among visual tasks to reducethe need for supervision, while Task2Vec (Achilleet al., 2019) uses a frozen feature extractorpretrained on ImageNet to represent visual tasksin a topological space (analogous to our method’sreliance on BERT). Finally, concurrent work inNLP augments a generative model for multi-tasklanguage generation with a task embedding spacefor modeling latent skills (Cao and Yogatama,2020).
In this work, we conduct a large-scale empiricalstudy of the transferability between 33 NLP tasksacross three broad classes of problems, encom-passing classification, question answering, and se-quence labeling. We show that the benefits of trans-fer learning are more pronounced than previouslythought, especially when target training data is lim-ited, and we develop methods that learn vectorrepresentations of tasks that can be used to reasonabout the relationships between them. These taskembeddings allow us to predict source tasks thatwill positively transfer to a given target task. Ouranalysis suggests that data size, the similarity be-tween the source and target tasks and domains, andtask complexity are crucial for effective transfer,particularly in data-constrained regimes.
Acknowledgments
We thank Yoshua Bengio and researchers at Mi-crosoft Research Montreal for valuable feedback onthis project. We also thank Kalpesh Krishna, NaderAkoury, Shiv Shankar, and the UMass NLP groupfor many insightful discussions. We are grateful toNelson Liu and Alon Talmor for sharing the QAand SL datasets. Finally, we thank Peter Potashfor his experimentation efforts. TV and MI weresupported by an Intuit AI Award for this project. eferences
Abdalghani Abujabal, Rishiraj Saha Roy, MohamedYahya, and Gerhard Weikum. 2019. ComQA:A community-sourced dataset for complex factoidquestion answering with paraphrase clusters. In
Pro-ceedings of the Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies (NAACL2019) , pages 307–317.Lasha Abzianidze, Johannes Bjerva, Kilian Evang,Hessel Haagsma, Rik van Noord, Pierre Ludmann,Duc-Duy Nguyen, and Johan Bos. 2017. The paral-lel meaning bank: Towards a multilingual corpus oftranslations annotated with compositional meaningrepresentations. In
Proceedings of the Conference ofthe European Chapter of the Association for Compu-tational Linguistics (EACL 2017) , pages 242–247.Alessandro Achille, Michael Lam, Rahul Tewari,Avinash Ravichandran, Subhransu Maji, Charless C.Fowlkes, Stefano Soatto, and Pietro Perona. 2019.Task2vec: Task embedding for meta-learning. In
Proceedings of the IEEE International Conferenceon Computer Vision (ICCV 2019) , pages 6430–6439.Junwei Bao, Nan Duan, Zhao Yan, Ming Zhou, andTiejun Zhao. 2016. Constraint-based question an-swering with knowledge graph. In
Proceedings ofthe International Conference on Computational Lin-guistics (COLING 2016) , pages 2503–2514.Joachim Bingel and Anders Søgaard. 2017. Identify-ing beneficial task relations for multi-task learningin deep neural networks. In
Proceedings of the Con-ference of the European Chapter of the Associationfor Computational Linguistics (EACL 2017) , pages164–169.Johannes Bjerva, Barbara Plank, and Johan Bos. 2016.Semantic tagging with deep residual networks. In
Proceedings of the International Conference onComputational Linguistics (COLING 2016) , pages3531–3541.Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large an-notated corpus for learning natural language infer-ence. In
Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP 2015) , pages 632–642.Kris Cao and Dani Yogatama. 2020. Modelling la-tent skills for multitask language generation. arXivpreprint arXiv:2002.09543 .Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017task 1: Semantic textual similarity multilingual andcrosslingual focused evaluation. In
Proceedings ofthe 11th International Workshop on Semantic Evalu-ation (SemEval 2017) , pages 1–14. Soravit Changpinyo, Hexiang Hu, and Fei Sha. 2018.Multi-task learning for sequence tagging: An em-pirical study. In
Proceedings of the InternationalConference on Computational Linguistics (COLING2018) , pages 2965–2977.Christopher Clark, Kenton Lee, Ming-Wei Chang,Tom Kwiatkowski, Michael Collins, and KristinaToutanova. 2019. BoolQ: Exploring the surprisingdifficulty of natural yes/no questions. In
Proceed-ings of the Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies (NAACL 2019) ,pages 2924–2936.Gordon V. Cormack, Charles L A Clarke, and Ste-fan Buettcher. 2009. Reciprocal rank fusion outper-forms condorcet and individual rank learning meth-ods. In
Proceedings of the International ACM SIGIRConference on Research and Development in Infor-mation Retrieval (SIGIR 2009) , page 758759.Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The pascal recognising textual entailmentchallenge. In
Proceedings of the International Con-ference on Machine Learning Challenges: Evaluat-ing Predictive Uncertainty Visual Object Classifica-tion, and Recognizing Textual Entailment (MLCW2006) , page 177190.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies (NAACL 2019) , pages 4171–4186.William B. Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In
Proceedings of the Third International Workshopon Paraphrasing (IWP 2005) .Dheeru Dua, Yizhong Wang, Pradeep Dasigi, GabrielStanovsky, Sameer Singh, and Matt Gardner. 2019.DROP: A reading comprehension benchmark requir-ing discrete reasoning over paragraphs. In
Proceed-ings of the Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies (NAACL 2019) ,pages 2368–2378.Jessica Ficler and Yoav Goldberg. 2016. Coordinationannotation extension in the Penn tree bank. In
Pro-ceedings of the Annual Meeting of the Associationfor Computational Linguistics (ACL 2016) , pages834–842.Thomas M. J. Fruchterman and Edward M. Rein-gold. 1991. Graph drawing by force-directedplacement.
Software: Practice and Experience ,21(11):11291164.Julia Hockenmaier and Mark Steedman. 2007. CCG-bank: A corpus of CCG derivations and dependencytructures extracted from the Penn treebank.
Com-putational Linguistics , 33(3):355–396.Kalervo J¨arvelin and Jaana Kek¨al¨ainen. 2002. Cumu-lated gain-based evaluation of ir techniques.
ACMTransactions on Information Systems (TOIS 2002) ,20(4):422–446.Emma Kerinec, Chlo´e Braud, and Anders Søgaard.2018. When does deep multi-task learning workfor loosely related document classification tasks?In
Proceedings of the EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP (EMNLP Workshop BlackboxNLP2018) , pages 1–8.Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.Scitail: A textual entailment dataset from sciencequestion answering. In
Proceedings of the Confer-ence on Artificial Intelligence (AAAI 2018) .Hector Levesque. 2011. The winograd schema chal-lenge. In
Proceedings of the AAAI Spring Sympo-sium: Logical Formalizations of Commonsense Rea-soning (AAAI Spring Symposium 2011) , volume 46,page 47.Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019a. Lin-guistic knowledge and transferability of contextualrepresentations. In
Proceedings of the Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies (NAACL 2019) , pages 1073–1094.Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019b. Multi-task deep neural networksfor natural language understanding. In
Proceedingsof the Annual Meeting of the Association for Compu-tational Linguistics (ACL 2019) , pages 4487–4496.Minh-Thang Luong, Quoc V Le, Ilya Sutskever, OriolVinyals, and Lukasz Kaiser. 2016. Multi-task se-quence to sequence learning.
Proceedings of theInternational Conference on Learning Representa-tions (ICLR 2016) .Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn Treebank.
Computa-tional Linguistics , 19(2):313–330.H´ector Mart´ınez Alonso and Barbara Plank. 2017.When is multitask learning effective? semantic se-quence prediction under varying data conditions.In
Proceedings of the Conference of the EuropeanChapter of the Association for Computational Lin-guistics (EACL 2017) , pages 44–53.Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,and Richard Socher. 2018. The natural language de-cathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730 . Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu,Lu Zhang, and Zhi Jin. 2016. How transferable areneural networks in NLP applications? In
Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing (EMNLP 2016) , pages479–489.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In
Proceedings of the Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies (NAACL 2018) , pages 2227–2237.Jason Phang, Thibault F´evry, and Samuel R Bowman.2018. Sentence encoders on stilts: Supplementarytraining on intermediate labeled-data tasks. arXivpreprint arXiv:1811.01088 .Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In
Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics (ACL 2018) , pages 784–789.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questionsfor machine comprehension of text. In
Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP 2016) , pages 2383–2392.Marek Rei and Helen Yannakoudakis. 2016. Composi-tional sequence labeling models for error detectionin learner writing. In
Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics (ACL 2016) , pages 1181–1191.Sebastian Ruder. 2017. An overview of multi-tasklearning in deep neural networks. arXiv preprintarXiv:1706.05098 .Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, andKarthik Sankaranarayanan. 2018. DuoRC: Towardscomplex language understanding with paraphrasedreading comprehension. In
Proceedings of the An-nual Meeting of the Association for ComputationalLinguistics (ACLl 2018) , pages 1683–1693.Natalia Silveira, Timothy Dozat, Marie-Catherinede Marneffe, Samuel Bowman, Miriam Connor,John Bauer, and Chris Manning. 2014. A gold stan-dard dependency corpus for English. In
Proceedingsof the International Conference on Language Re-sources and Evaluation (LREC 2014) , pages 2897–2904.Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In
Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing(EMNLP 2013) , pages 1631–1642.lon Talmor and Jonathan Berant. 2019. MultiQA: Anempirical investigation of generalization and transferin reading comprehension. In
Proceedings of the An-nual Meeting of the Association for ComputationalLinguistics (ACL 2019) , pages 4911–4921.Alon Talmor, Mor Geva, and Jonathan Berant. 2017.Evaluating semantic parsing against a simple web-based question answering model. In
Proceedings ofthe Joint Conference on Lexical and ComputationalSemantics (*SEM 2017) , pages 161–167.Erik F. Tjong Kim Sang and Sabine Buchholz. 2000.Introduction to the CoNLL-2000 shared task chunk-ing. In
Proceedings of the Conference on Computa-tional Natural Language Learning and the LearningLanguage in Logic Workshop (CoNLL-LLL 2000) ,pages 127–132.Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. In
Proceedings of the Conference on Natural LanguageLearning (CoNLL 2003) , pages 142–147.Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-ris, Alessandro Sordoni, Philip Bachman, and Ka-heer Suleman. 2017. NewsQA: A machine compre-hension dataset. In
Proceedings of the Workshop onRepresentation Learning for NLP , pages 191–200.Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pap-pagari, R. Thomas McCoy, Roma Patel, NajoungKim, Ian Tenney, Yinghui Huang, Katherin Yu,Shuning Jin, Berlin Chen, Benjamin Van Durme,Edouard Grave, Ellie Pavlick, and Samuel R. Bow-man. 2019a. Can you tell me how to get past sesamestreet? sentence-level pretraining beyond languagemodeling. In
Proceedings of the Annual Meeting ofthe Association for Computational Linguistics (ACL2019) , pages 4465–4476.Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R Bowman. 2019b.Glue: A multi-task benchmark and analysis platformfor natural language understanding.
Proceedings ofthe International Conference on Learning Represen-tations (ICLR 2019) .Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-man. 2019. Neural network acceptability judgments.
Transactions of the Association for ComputationalLinguistics (TACL 2019) , 7:625–641.Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents.
Transac-tions of the Association for Computational Linguis-tics (TACL) , 6:287–302.Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In
Proceed-ings of the Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies (NAACL 2018) ,pages 1112–1122. Thomas Wolf, L Debut, V Sanh, J Chaumond, C De-langue, A Moi, P Cistac, T Rault, R Louf, M Fun-towicz, et al. 2019. Huggingfaces transformers:State-of-the-art natural language processing. arXivpreprint arXiv:1910.03771 .Xi Yan, David Acuna, and Sanja Fidler. 2020. Neuraldata server: A large-scale search engine for transferlearning data. arXiv preprint arXiv:2001.02799 .Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D. Manning. 2018. HotpotQA: A datasetfor diverse, explainable multi-hop question answer-ing. In
Proceedings of the Conference on EmpiricalMethods in Natural Language Processing (EMNLP2018) , pages 2369–2380.Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.2011. A new dataset and method for automaticallygrading ESOL texts. In
Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics (ACL 2011) , pages 180–189.Amir R. Zamir, Alexander Sax, William Shen,Leonidas J. Guibas, Jitendra Malik, and SilvioSavarese. 2018. Taskonomy: Disentangling tasktransfer learning. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR 2018) , pages 3712–3722. ppendicesA Tasks and datasets
In this work, we experiment with 33 datasets acrossthree broad classes of problems (text classifica-tion/regression, question answering, and sequencelabeling). Please see below for details.
Classification/regression (eleven tasks):
Weuse the nine GLUE datasets (Wang et al.,2019b), including grammatical acceptability judg-ments (
CoLA ; Warstadt et al., 2019); sentimentanalysis (
SST-2 ; Socher et al., 2013); paraphraseidentification (
MRPC ; Dolan and Brockett, 2005);semantic similarity with STS-Benchmark (
STS-B ; Cer et al., 2017) and Quora QuestionPairs ( QQP ); natural language inference (NLI)with Multi-Genre NLI (
MNLI ; Williams et al.,2018), SQuAD (Rajpurkar et al., 2016) con-verted into Question-answering NLI (
QNLI ; Wanget al., 2019b), Recognizing Textual Entailment1,2,3,5 (
RTE ; Dagan et al., 2006, et seq.), and theWinograd Schema Challenge (Levesque, 2011) re-cast as Winograd NLI (
WNLI ). Additionally, weinclude the Stanford NLI dataset (
SNLI ; Bowmanet al., 2015) and the science QA dataset (Khot et al.,2018) converted into NLI (
SciTail ). We report F1scores for QQP and MRPC, Spearman correlationsfor STS-B, and accuracy scores for the other tasks.For MNLI, we report the average score on the“matched” and “mismatched” development sets.
Question answering (eleven tasks):
We useeleven QA datasets from the MultiQA (Tal-mor and Berant, 2019) repository , includ-ing the Stanford Question Answering datasets SQuAD-1 and
SQuAD-2 (Rajpurkar et al., 2016,2018);
NewsQA (Trischler et al., 2017);
Hot-potQA (Yang et al., 2018) – the version wherethe context includes 10 paragraphs retrieved byan information retrieval system; Natural Yes/NoQuestions dataset (
BoolQ ; Clark et al., 2019); Dis-crete Reasoning Over Paragraphs dataset (
DROP ;Dua et al., 2019) – we only use the extractive ex-amples in the original dataset but evaluate on theentire development set following Talmor and Be-rant (2019);
WikiHop (Welbl et al., 2018); DuoRCSelf (
DuoRC-s ) and DuoRC Paraphrase (
DuoRC-p ) datasets (Saha et al., 2018) where the questions https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs https://github.com/alontalmor/MultiQA are taken from either the same version or a dif-ferent version of the document from which thequestions were asked, respectively; ComplexQues-tions ( CQ ; Bao et al., 2016; Talmor et al., 2017);and ComQA (Abujabal et al., 2019) – contextsare not provided but the questions are augmentedwith web snippets retrieved from Google searchengine (Talmor and Berant, 2019). We report F1scores for all QA tasks.
Sequence labeling (eleven tasks):
We experi-ment with eleven sequence labeling tasks usedby Liu et al. (2019a), including CCG supertaggingwith CCGbank (
CCG ; Hockenmaier and Steed-man, 2007); part-of-speech tagging with the PennTreebank (
POS-PTB ; Marcus et al., 1993) andthe Universal Dependencies English Web Tree-bank (
POS-EWT ; Silveira et al., 2014); syntacticconstituency ancestor tagging, i.e., predicting theconstituent label of the parent (
Parent ), grandpar-ent (
GParent ), and great-grandparent (
GGParent )of each word in the PTB phrase-structure tree;semantic tagging task ( ST ; Bjerva et al., 2016;Abzianidze et al., 2017); syntactic chunking withthe CoNLL 2000 shared task dataset ( Chunk ;Tjong Kim Sang and Buchholz, 2000); namedentity recognition with the CoNLL 2003 sharedtask dataset (
NER ; Tjong Kim Sang and De Meul-der, 2003); grammatical error detection with theFirst Certificate in English dataset (
GED ; Yan-nakoudakis et al., 2011; Rei and Yannakoudakis,2016); and conjunct identification, i.e., identify-ing the tokens that comprise the conjuncts in acoordination construction, with the coordinationannotated PTB dataset (
Conj ; Ficler and Goldberg,2016). We report F1 scores for all SL tasks.
Full results for fine-tuning andtransfer learning across tasks
For both fine-tuning and transfer learning, we usethe same architecture across tasks, apart from thetask-specific output layer. The feature extractor,i.e., BERT, is pretrained while the task-specific out-put layer is randomly initialized for each task. Allthe parameters are fine-tuned end-to-end. An al-ternative approach is to keep the feature extractorfrozen during fine-tuning. We find that fine-tuningthe whole model for a given task leads to betterperformance in most cases, except for WNLI andDROP, possibly because of their adversarial na-ture (see Tables 4, 5, and 6). In our experiments,we follow the fine-tuning recipe of (Devlin et al.,2019), i.e., only fine-tuning for a fixed number of t epochs for each class of problems. We develop ourinfrastructure using the HuggingFace’s Transform-ers (Wolf et al., 2019) and use the hyperparametersrecommended by the library for each class.We show the full results for fine-tuning and trans-fer learning across tasks from Table 4 to Table 33.Below we describe the setting for these tables inmore detail:In Tables 4, 5, and 6, we report the results offine-tuning BERT (without any intermediate fine-tuning) on the 33 NLP tasks studied in this work.We perform experiments in two data regimes: full and limited . In the full regime, all training data forthe associated task is used while in the limited set-ting, we artificially limit the amount of training databy randomly selecting 1K training examples with-out replacement following Phang et al. (2018). Foreach experiment in the limited regime, we perform20 random restarts (1K examples are resampledfor each restart) and report the mean and standarddeviation. We show the results after each trainingepoch t .For our transfer experiments, we considerevery possible pair of (source, target) tasks withinand across classes of problems in the three dataregimes described in 2.1.1, which results in 3267combinations of tasks and data regimes. We followthe transfer recipe of Phang et al. (2018) by firstfine-tuning BERT on the source task (intermediatefine-tuning) before fine-tuning on the target task.For both stages, we only perform training fora fixed number t of epochs following previouswork (Devlin et al., 2019; Phang et al., 2018), andfor each task, we use the same value of t as in ourfine-tuning experiments. From Table 7 to Table 15, we show our in-classtransfer results for each combination of (source,target) tasks, in which source tasks come from thesame class as the target task. In each table, rowsdenote source tasks while columns denote targettasks. Each cell represents the target task perfor-mance of the transferred model from the associ-ated source task to the associated target task. Theorange-colored cells along the diagonal indicate theresults of fine-tuning BERT on target tasks withoutany intermediate fine-tuning. Positive transfers areshown in blue and the best results are highlightedin bold (blue). For transfer results in the limited setting, we report the mean and standard deviationacross 20 random restarts.Finally, from Table 16 to Table 33, we presentour out-of-class transfer results, in which sourcetasks come from a different class than the targettask. In each table, results are shown in a similarway as above, except that the orange-colored rowBaseline shows the results of fine-tuning BERT ontarget tasks without any intermediate fine-tuning.ask Data regimefull limited frozen BERT unfrozen BERT unfrozen BERTt = 1 t = 2 t = 3 t = 1 t = 2 t = 3 t = 1 t = 2 t = 3CoLA 0.0 0.0 0.0 48.1 51.3 51.0 1.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4: Fine-tuning results for classification/regression tasks. ask
Data regimefull limited frozen BERT unfrozen BERT unfrozen BERTt = 1 t = 2 t = 3 t = 1 t = 2 t = 3 t = 1 t = 2 t = 3SQuAD-1 10.6 12.1 13.0 86.8 87.7 87.9 12.5 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 5: Fine-tuning results for question answering tasks. ask
Data regimefull limited frozen BERT unfrozen BERT unfrozen BERTt = 2 t = 4 t = 6 t = 2 t = 4 t = 6 t = 2 t = 4 t = 6CCG 39.7 44.9 48.1 95.2 95.5 95.6 11.1 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 6: Fine-tuning results for sequence labeling tasks. ask CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SNLI SciTailCoLA 51.0 92.2 86.6 86.4
QNLI 49.9 92.5 86.6
Table 7: In-class transfer results for classification/regression tasks in full source → full target . ask CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SNLI SciTailCoLA 4.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MNLI 1.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 8: In-class transfer results for classification/regression tasks in full source → limited target . ask CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SNLI SciTailCoLA 4.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± QQP 3.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 9: In-class transfer results for classification/regression tasks in limited source → limited target . ask SQuAD-1 SQuAD-2 NewsQA HotpotQA BoolQ DROP WikiHop DuoRC-p DuoRC-s CQ ComQASQuAD-1 87.9 73.4 65.5 70.1 71.0 26.9 63.7 51.1 62.9 45.2 64.8SQuAD-2 87.8 71.9 HotpotQA 88.6 72.8 64.8 67.9 73.1 26.1
DuoRC-p 88.1 71.7 64.6 68.4 71.5 23.9 63.3 50.6 63.1 44.1 65.1DuoRC-s 88.5 72.6 64.5 69.0 71.1 24.3 63.9
Table 10: In-class transfer results for question answering tasks in full source → full target . ask SQuAD-1 SQuAD-2 NewsQA HotpotQA BoolQ DROP WikiHop DuoRC-p DuoRC-s CQ ComQASQuAD-1 26.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± BoolQ 26.6 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 11: In-class transfer results for question answering tasks in full source → limited target . ask SQuAD-1 SQuAD-2 NewsQA HotpotQA BoolQ DROP WikiHop DuoRC-p DuoRC-s CQ ComQASQuAD-1 26.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± HotpotQA 59.4 ± ± ± ± ± ± ± ± ± ± ± BoolQ 32.4 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 12: In-class transfer results for question answering tasks in limited source → limited target . ask CCG POS-PTB POS-EWT Parent GParent GGParent ST Chunk NER GED ConjCCG 95.6 96.7 96.4 95.3 91.8 89.6 95.8 97.7 94.0 45.8 90.3POS-PTB GGParent 95.5 96.6 96.5 95.4 91.9 89.5 95.8 97.5 94.5 46.5 90.8ST 95.5 96.6 96.5 95.1 91.6 89.3 95.8 96.9
Table 13: In-class transfer results for sequence labeling tasks in full source → full target . ask CCG POS-PTB POS-EWT Parent GParent GGParent ST Chunk NER GED ConjCCG 53.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ST ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 14: In-class transfer results for sequence labeling tasks in full source → limited target . ask CCG POS-PTB POS-EWT Parent GParent GGParent ST Chunk NER GED ConjCCG 53.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ST 67.5 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 15: In-class transfer results for sequence labeling tasks in limited source → limited target . ask CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SNLI SciTailBaseline 51.0 91.9 84.0 85.9 87.3 84.2 91.4 60.6 45.1 90.7 93.9SQuAD-1 52.4 92.1 87.0 WikiHop 49.2 91.9 84.6 86.8 86.8 83.7 90.7 66.1 38.0 90.7 93.5DuoRC-p 42.4 92.2 86.3 87.3 86.7 83.4 90.9 62.8 36.6 90.5 92.5DuoRC-s 48.8 91.5 86.4 87.9 87.1 83.6 90.8 67.1 42.3 90.6 93.9CQ 52.1 91.9 85.4 86.9 86.9 84.0 90.6 68.2 45.1 90.8 93.6ComQA 49.5
Table 16: Out-of-class transfer results from question answering tasks to classification/regression tasks in full source → full target . ask CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SNLI SciTailBaseline 4.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SQuAD-2 2.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 17: Out-of-class transfer results from question answering tasks to classification/regression tasks in full source → limited target . ask CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SNLI SciTailBaseline 4.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 18: Out-of-class transfer results from question answering tasks to classification/regression tasks in limited source → limited target . ask CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SNLI SciTailBaseline 51.0 91.9 84.0 85.9 87.3 84.2 91.4 60.6 45.1 90.7 93.9CCG 46.2 90.5 83.7 86.3 86.4 83.4 90.2 61.7 35.2 90.6 93.3POS-PTB 39.7 91.2 85.7 86.2 86.9 82.9 90.3 61.7 42.3 Table 19: Out-of-class transfer results from sequence labeling tasks to classification/regression tasks in full source → full target . ask CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SNLI SciTailBaseline 4.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Chunk 0.5 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 20: Out-of-class transfer results from sequence labeling tasks to classification/regression tasks in full source → limited target . ask CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SNLI SciTailBaseline 4.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± GParent 1.3 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 21: Out-of-class transfer results from sequence labeling tasks to classification/regression tasks in limited source → limited target . ask SQuAD-1 SQuAD-2 NewsQA HotpotQA BoolQ DROP WikiHop DuoRC-p DuoRC-s CQ ComQABaseline 87.9 71.9 64.1 67.9 65.7 22.4 62.8 50.6 63.3 30.5 63.2CoLA 87.8 70.1 64.6 68.2 64.9 22.3 62.9 51.0 Table 22: Out-of-class transfer results from classification/regression tasks to question answering tasks in full source → full target . ask SQuAD-1 SQuAD-2 NewsQA HotpotQA BoolQ DROP WikiHop DuoRC-p DuoRC-s CQ ComQABaseline 26.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± RTE 25.9 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 23: Out-of-class transfer results from classification/regression tasks to question answering tasks in full source → limited target . ask SQuAD-1 SQuAD-2 NewsQA HotpotQA BoolQ DROP WikiHop DuoRC-p DuoRC-s CQ ComQABaseline 26.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MRPC 23.4 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 24: Out-of-class transfer results from classification/regression tasks to question answering tasks in limited source → limited target . ask SQuAD-1 SQuAD-2 NewsQA HotpotQA BoolQ DROP WikiHop DuoRC-p DuoRC-s CQ ComQABaseline 87.9 71.9 64.1 67.9 65.7 22.4 62.8 50.6 63.3 30.5 63.2CCG 87.0 68.1 63.8 66.3 65.5 22.0 62.2 49.7 62.1 30.5 61.1POS-PTB 87.4 70.2 62.2 65.8 64.7 21.6 62.2 49.7 63.5 28.4 62.8POS-EWT 85.9 66.7 62.6 66.2 65.4 22.0 62.6 50.2 Table 25: Out-of-class transfer results from sequence labeling tasks to question answering tasks in full source → full target . ask SQuAD-1 SQuAD-2 NewsQA HotpotQA BoolQ DROP WikiHop DuoRC-p DuoRC-s CQ ComQABaseline 26.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 26: Out-of-class transfer results from sequence labeling tasks to question answering tasks in full source → limited target . ask SQuAD-1 SQuAD-2 NewsQA HotpotQA BoolQ DROP WikiHop DuoRC-p DuoRC-s CQ ComQABaseline 26.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 27: Out-of-class transfer results from sequence labeling tasks to question answering tasks in limited source → limited target . ask CCG POS-PTB POS-EWT Parent GParent GGParent ST Chunk NER GED ConjBaseline 95.6 96.7 96.6 95.4 91.9 89.5 95.8 97.1 94.7 46.6 89.4CoLA 95.5 96.7 MNLI 95.4 96.7 96.6 95.1 91.9 89.0 95.7 97.1 94.6 46.6
QNLI 95.5 96.7
Table 28: Out-of-class transfer results from classification/regression tasks to sequence labeling tasks in full source → full target . ask CCG POS-PTB POS-EWT Parent GParent GGParent ST Chunk NER GED ConjBaseline 53.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± RTE 51.3 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 29: Out-of-class transfer results from classification/regression tasks to sequence labeling tasks in full source → limited target . ask CCG POS-PTB POS-EWT Parent GParent GGParent ST Chunk NER GED ConjBaseline 53.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MNLI 53.3 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 30: Out-of-class transfer results from classification/regression tasks to sequence labeling tasks in limited source → limited target . ask CCG POS-PTB POS-EWT Parent GParent GGParent ST Chunk NER GED ConjBaseline 95.6 96.7 96.6 95.4 91.9 89.5 95.8 97.1 94.7 46.6 89.4SQuAD-1 95.4 96.7 SQuAD-2 95.4 96.7 96.6 95.3 91.8 89.4 95.8 97.1 94.5 46.4 89.9NewsQA 95.5 96.7 96.4 95.3 91.6 89.2 95.8 97.0 94.4 45.6 90.0HotpotQA 95.4 96.7 96.3 95.1 91.7 89.1 95.8 96.9 94.5 45.8 90.0BoolQ 95.5 96.7 96.6 95.3 91.7 89.5 95.8 96.9 94.7
Table 31: Out-of-class transfer results from question answering tasks to sequence labeling tasks in full source → full target . ask CCG POS-PTB POS-EWT Parent GParent GGParent ST Chunk NER GED ConjBaseline 53.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SQuAD-2 56.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 32: Out-of-class transfer results from question answering tasks to sequence labeling tasks in full source → limited target . ask CCG POS-PTB POS-EWT Parent GParent GGParent ST Chunk NER GED ConjBaseline 53.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± NewsQA 54.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 33: Out-of-class transfer results from question answering tasks to sequence labeling tasks in limited source → limited targetlimited target