The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding
Xiaodong Liu, Yu Wang, Jianshu Ji, Hao Cheng, Xueyun Zhu, Emmanuel Awa, Pengcheng He, Weizhu Chen, Hoifung Poon, Guihong Cao, Jianfeng Gao
TThe Microsoft Toolkit of Multi-Task Deep Neural Networks for NaturalLanguage Understanding
Xiaodong Liu ∗ , Yu Wang ∗ , Jianshu Ji, Hao Cheng, Xueyun Zhu, Emmanuel Awa,Pengcheng He, Weizhu Chen, Hoifung Poon, Guihong Cao and Jianfeng Gao Microsoft Corporation { xiaodl,yuwan,jianshuj,chehao,xuzhu } @microsoft.com Abstract
We present MT-DNN , an open-source nat-ural language understanding (NLU) toolkitthat makes it easy for researchers and de-velopers to train customized deep learningmodels. Built upon PyTorch and Transform-ers, MT-DNN is designed to facilitate rapidcustomization for a broad spectrum of NLUtasks, using a variety of objectives (classifi-cation, regression, structured prediction) andtext encoders (e.g., RNNs, BERT, RoBERTa,UniLM). A unique feature of MT-DNN isits built-in support for robust and transfer-able learning using the adversarial multi-tasklearning paradigm. To enable efficient pro-duction deployment, MT-DNN supports multi-task knowledge distillation, which can sub-stantially compress a deep neural model with-out significant performance drop. We demon-strate the effectiveness of MT-DNN on a widerange of NLU applications across general andbiomedical domains. The software and pre-trained models will be publicly available athttps://github.com/namisan/mt-dnn. NLP model development has observed a paradigmshift in recent years, due to the success in using pre-trained language models to improve a wide rangeof NLP tasks (Peters et al., 2018; Devlin et al.,2019). Unlike the traditional pipeline approachthat conducts annotation in stages using primarilysupervised learning, the new paradigm features auniversal pretraining stage that trains a large neu-ral language model via self-supervision on a largeunlabeled text corpus, followed by a fine-tuning step that starts from the pretrained contextual rep-resentations and conducts supervised learning for ∗ Equal Contribution. The complete name of our toolkit is MT -DNN (The M icrosoft T oolkit of M ulti- T ask D eep N eural N etworks forNatural Language Understanding), but we use MT-DNN forsake of simplicity. individual tasks. The pretrained language modelscan effectively model textual variations and dis-tributional similarity. Therefore, they can makesubsequent task-specific training more sample ef-ficient and often significantly boost performancein downstream tasks. However, these models arequite large and pose significant challenges to pro-duction deployment that has stringent memory orspeed requirements. As a result, knowledge distil-lation has become another key feature in this newlearning paradigm. An effective distillation stepcan often substantially compress a large model forefficient deployment (Clark et al., 2019; Tang et al.,2019; Liu et al., 2019a).In the NLP community, there are several welldesigned frameworks for research and commer-cial purposes, including toolkits for providing con-ventional layered linguistic annotations (Manninget al., 2014), platforms for developing novel neuralmodels (Gardner et al., 2018) and systems for neu-ral machine translation (Ott et al., 2019). However,it is hard to find an existing tool that supports allfeatures in the new paradigm and can be easily cus-tomized for new tasks. For example, (Wolf et al.,2019) provides a number of popular Transformer-based (Vaswani et al., 2017) text encoders in anice unified interface, but does not offer multi-task learning or adversarial training, state-of-the-arttechniques that have been shown to significantlyimprove performance. Additionally, most publicframeworks do not offer knowledge distillation.A notable exception is DistillBERT (Sanh et al.,2019), but it provides a standalone compressedmodel and does not support task-specific modelcompression that can further improve performance.We introduce MT-DNN, a comprehensive andeasily-configurable open-source toolkit for build-ing robust and transferable natural language under-standing models. MT-DNN is built upon PyTorch(Paszke et al., 2019) and the popular Transformer- a r X i v : . [ c s . C L ] M a y ased text-encoder interface (Wolf et al., 2019). Itsupports a large inventory of pretrained models,neural architectures, and NLU tasks, and can beeasily customized for new tasks.A key distinct feature for MT-DNN is that itprovides out-of-box adversarial training, multi-tasklearning, and knowledge distillation. Users cantrain a set of related tasks jointly to amplify eachother. They can also invoke adversarial training(Miyato et al., 2018; Jiang et al., 2019; Liu et al.,2020), which helps improve model robustness andgeneralizability. For production deployment wherelarge model size becomes a practical obstacle, userscan use MT-DNN to compress the original mod-els into substantially smaller ones, even using acompletely different architecture (e.g., compressedBERT or other Transformer-based text encodersinto LSTMs (Hochreiter and Schmidhuber, 1997)).The distillation step can similarly leverage multi-task learning and adversarial training. Users canalso conduct pretraining from scratch using themasked language model objective in MT-DNN.Moreover, in the fine-tuning step, users can incor-porate this as an auxiliary task on the training text,which has been shown to improve performance.MT-DNN provides a comprehensive list of state-of-the-art pre-trained NLU models, together withstep-by-step tutorials for using such models in gen-eral and biomedical applications. MT-DNN is designed for modularity, flexibility,and ease of use. These modules are built upon Py-Torch (Paszke et al., 2019) and Transformers (Wolfet al., 2019), allowing the use of the SOTA pre-trained models, e.g., BERT (Devlin et al., 2019),RoBERTa (Liu et al., 2019c) and UniLM (Donget al., 2019). The unique attribute of this pack-age is a flexible interface for adversarial multi-taskfine-tuning and knowledge distillation, so that re-searchers and developers can build large SOTANLU models and then compress them to small onesfor online deployment.The overall workflow andsystem architecture are shown in Figure 1 and Fig-ure 3 respectively.
As shown in Figure 1, starting from the neural lan-guage model pre-training, there are three differenttraining configurations by following the directedarrows: • Single-task configuration: single-task fine-
Multi-task
Knowledge
Distillation
Multi-task
Fine-tuning
Natural Language Model
Pre-training
Single-task
Fine-tuning
Single-task
Knowledge
Distillation
Adversarial
Training
Figure 1: The workflow of MT-DNN: train a neural lan-guage model on a large amount of unlabeled raw textto obtain general contextual representations; then fine-tune the learned contextual representation on down-stream tasks, e.g. GLUE (Wang et al., 2018); lastly,distill this large model to a lighter one for online de-ployment. In the later two phrases, we can leveragepowerful multi-task learning and adversarial training tofurther improve performance. tuning and single-task knowledge distillation; • Multi-task configuration: multi-task fine-tuning and multi-task knowledge distillation; • Multi-stage configuration: multi-task fine-tuning, single-task fine tuning and single-taskknowledge distillation.Moreover, all configurations can be additionallyequipped with the adversarial training. Each stageof the workflow is described in details as follows.
Neural Language Model Pre-Training
Due tothe great success of deep contextual representa-tions, such as ELMo (Peters et al., 2018), GPT(Radford et al., 2018) and BERT (Devlin et al.,2019), it is common practice of developing NLUmodels by first pre-training the underlying neuraltext representations (text encoders) through mas-sive language modeling which results in superiortext representations transferable across multipleNLP tasks. Because of this, there has been an in-creasing effort to develop better pre-trained textencoders by multiplying either the scale of data(Liu et al., 2019c) or the size of model (Raffelet al., 2019). Similar to existing codebases (De-vlin et al., 2019), MT-DNN supports the LM pre-training from scratch with multiple types of objec- igure 2: Process of knowledge distillation for MTL. A set of tasks where there is task-specific labeled trainingdata are picked. Then, for each task, an ensemble of different neural nets (teacher) is trained. The teacher is usedto generate for each task-specific training sample a set of soft targets. Given the soft targets of the training datasetsacross multiple tasks, a single MT-DNN (student) shown in Figure 3 is trained using multi-task learning and backpropagation, except that if task t has a teacher, the task-specific loss is the average of two objective functions, onefor the correct targets and the other for the soft targets assigned by the teacher. tives, such as masked LM (Devlin et al., 2019) andnext sentence prediction (Devlin et al., 2019).Moreover, users can leverage the LM pre-training, such as masked LM used by BERT, asan auxiliary task for fine-tuning under the multi-task learning (MTL) framework (Sun et al., 2019;Liu et al., 2019b). Fine-tuning
Once the text encoder is trained in thepre-training stage, an additional task-specific layeris usually added for fine-tuning based on the down-stream task. Besides the existing typical single-taskfine-tuning, MT-DNN facilitates a joint fine-tuningwith a configurable list of related tasks in a MTLfashion. By encoding task-relatedness and sharingunderlying text representations, MTL is a powerfultraining paradigm that promotes the model general-ization ability and results in improved performance(Caruana, 1997; Liu et al., 2019b; Luong et al.,2015; Liu et al., 2015; Ruder, 2017; Collobert et al.,2011). Additionally, a two-step fine-tuning stageis also supported to utilize datasets from relatedtasks, i.e. a single-task fine-tuning following amulti-task fine-tuning. It also supports two popularsampling strategies in MTL training: 1) samplingtasks uniformly (Caruana, 1997; Liu et al., 2015);2) sampling tasks based on the size of the dataset(Liu et al., 2019b). This makes it easy to explorevarious ways to feed training data to MTL training.Finally, to further improve the model robustness,MT-DNN also offers a recipe to apply adversarial training (Madry et al., 2017; Zhu et al., 2019; Jianget al., 2019) in the fine-tuning stage.
Knowledge Distillation
Although contextual textrepresentation models pre-trained with massivetext data have led to remarkable progress in NLP,it is computationally prohibitive and inefficientto deploy such models with millions of parame-ters for real-world applications (e.g. BERT largemodel has 344 million parameters). Therefore, inorder to expedite the NLU model learned in ei-ther a single-task or multi-task fashion for deploy-ment, MT-DNN additionally supports the multi-task knowledge distillation (Clark et al., 2019; Liuet al., 2019a; Tang et al., 2019; Balan et al., 2015;Ba and Caruana, 2014), an extension of (Hintonet al., 2015), to compress cumbersome models intolighter ones. The multi-task knowledge distillationprocess is illustrated in Figure 2. Similar to thefine-tuning stage, adversarial training is availablein the knowledge distillation stage. l ): The input X = { x , ..., x m } is a sequence of tokens of length m .The first token x is always a specific token, e.g. [CLS] for BERT Devlin et al. (2019) while for RoBERTa Liu et al. (2019c). If X is a pair ofsentences ( X , X ) , we separate these sentenceswith special tokens, e.g. [SEP] for BERT and [] for RoBERTa. The lexicon encoder maps igure 3: Overall System Architecture: The lower layers are shared across all tasks while the top layers are task-specific. The input X (either a sentence or a set of sentences) is first represented as a sequence of embeddingvectors, one for each word, in l . Then the encoder, e.g a Transformer or recurrent neural network (LSTM) model,captures the contextual information for each word and generates the shared contextual embedding vectors in l .Finally, for each task, additional task-specific layers generate task-specific representations, followed by operationsnecessary for classification, similarity scoring, or relevance ranking. In case of adversarial training, we perturbembeddings from the lexicon encoder and then add an extra loss term during the training. Note that for theinference phrase, it does not require perturbations. X into a sequence of input embedding vectors,one for each token, constructed by summing thecorresponding word with positional, and optionalsegment embeddings. Encoder ( l ): We support a multi-layer bidirec-tional Transformer (Vaswani et al., 2017) or aLSTM (Hochreiter and Schmidhuber, 1997) en-coder to map the input representation vectors ( l )into a sequence of contextual embedding vectors C ∈ R d × m . This is the shared representationacross different tasks. Note that MT-DNN alsoallows developers to customize their own encoders.For example, one can design an encoder with fewTransformer layers (e.g. 3 layers) to distill knowl-edge from the BERT large model (24 layers), sothat they can deploy this small mode online to meetthe latency restriction as shown in Figure 2. Task-Specific Output Layers:
We can incorpo-rate arbitrary natural language tasks, each with itstask-specific output layer. For example, we imple-ment the output layers as a neural decoder for aneural ranker for relevance ranking, a logistic re-gression for text classification, and so on. A multi-step reasoning decoder, SAN (Liu et al., 2018a,b)is also provided. Customers can choose from ex- isting task-specific output layer or implement newone by themselves.
In this section, we present a comprehensive setof examples to illustrate how to customize MT-DNN for new tasks. We use popular benchmarksfrom general and biomedical domains, includingGLUE (Wang et al., 2018), SNLI (Bowman et al.,2015), SciTail (Khot et al., 2018), SQuAD (Ra-jpurkar et al., 2016), ANLI (Nie et al., 2019), andbiomedical named entity recognition (NER), rela-tion extraction (RE) and question answering (QA)(Lee et al., 2019). To make the experiments repro-ducible, we make all the configuration files publiclyavailable. We also provide a quick guide for cus-tomizing a new task in Jupyter notebooks. • GLUE . The General Language UnderstandingEvaluation (GLUE) benchmark is a collection ofnine natural language understanding (NLU) tasks.As shown in Table 1, it includes question an-swering (Rajpurkar et al., 2016), linguistic accept- orpus
Task FormulationGLUECoLA Acceptability ClassificationSST Sentiment ClassificationMNLI NLI ClassificationRTE NLI ClassificationWNLI NLI ClassificationQQP Paraphrase ClassificationMRPC Paraphrase ClassificationQNLI QA/NLI ClassificationQNLI v1.0 QA/NLI Pairwise RankingSTS-B Similarity RegressionOthersSNLI NLI ClassificationSciTail NLI ClassificationANLI NLI ClassificationSQuAD MRC Span Classification
Table 1: Summary of the four benchmarks: GLUE,SNLI, SciTail and ANLI.
Model MNLI RTE QNLI SST MRPCAcc Acc Acc Acc F1BERT 84.5 63.5 91.1 92.9 89.0BERT + MTL 85.3 79.1 91.5 93.6 89.2BERT + AdvTrain 85.6 71.2 91.6 93.0 91.3
Table 2: Comparison among single task, multi-Taskand adversarial training on MNLI, RTE, QNLI, SSTand MPRC in GLUE.
Model
Dev TestBERT
LARGE (Nie et al., 2019) 49.3 44.2RoBERTa
LARGE (Nie et al., 2019) 53.7 49.7RoBERTa-LARGE + AdvTrain 57.1 57.1
Table 3: Results in terms of accuracy on the ANLI. ability (Warstadt et al., 2018), sentiment analy-sis (Socher et al., 2013), text similarity (Cer et al.,2017), paraphrase detection (Dolan and Brockett,2005), and natural language inference (NLI) (Da-gan et al., 2006; Bar-Haim et al., 2006; Giampic-colo et al., 2007; Bentivogli et al., 2009; Levesqueet al., 2012; Williams et al., 2018). The diversity ofthe tasks makes GLUE very suitable for evaluatingthe generalization and robustness of NLU models. • SNLI . The Stanford Natural Language Inference(SNLI) dataset contains 570k human annotated sen-tence pairs, in which the premises are drawn from the captions of the Flickr30 corpus and hypothe-ses are manually annotated (Bowman et al., 2015).This is the most widely used entailment dataset forNLI. • SciTail
This is a textual entailment dataset de-rived from a science question answering (SciQ)dataset (Khot et al., 2018). In contrast to otherentailment datasets mentioned previously, the hy-potheses in SciTail are created from science ques-tions while the corresponding answer candidatesand premises come from relevant web sentencesretrieved from a large corpus. • ANLI . The Adversarial Natural Language Infer-ence (ANLI, Nie et al. (2019)) is a new large-scaleNLI benchmark dataset, collected via an iterative,adversarial human-and-model-in-the-loop proce-dure. Particular, the data is selected to be difficultto the state-of-the-art models, including BERT andRoBERTa. • SQuAD . The Stanford Question AnsweringDataset (SQuAD) (Rajpurkar et al., 2016) containsabout 23K passages and 100K questions. The pas-sages come from approximately 500 Wikipediaarticles and the questions and answers are obtainedby crowdsourcing.Following (Devlin et al., 2019), table 2 comparesdifferent training algorithm: 1) BERT denotes a sin-gle task fine-tuning; 2) BERT + MTL indicates thatit is trained jointly via MTL; at last 3), BERT + Ad-vTrain represents that a single task fine-tuning withadversarial training. It is obvious that the both MLTand adversarial training helps to obtain a better re-sult. We further test our model on an adversarialnatural language inference (ANLI) dataset (Nieet al., 2019). Table 3 summarizes the results onANLI. As Nie et al. (2019), all the dataset of ANLI(Nie et al., 2019), MNLI (Williams et al., 2018),SNLI (Bowman et al., 2015) and FEVER (Thorneet al., 2018) are combined as training. RoBERTa-LARGE+AdvTrain obtains the best performancecompared with all the strong baselines, demonstrat-ing the advantage of adversarial training.
There has been rising interest in exploring natu-ral language understanding tasks in high-value do-mains other than newswire and the Web. In ourrelease, we provide MT-DNN customization forthree representative biomedical natural languageunderstanding tasks: • Named entity recognition (NER): In biomedicalatural language understanding, NER has receivedgreater attention than other tasks and datasets areavailable for recognizing various biomedical enti-ties such as disease, gene, drug (chemical). • Relation extraction (RE): Relation extraction ismore closely related to end applications, but an-notation effort is significantly higher compared toNER. Most existing RE tasks focus on binary re-lations within a short text span such as a sentenceof an abstract. Examples include gene-disease orprotein-chemical relations. • Question answering (QA): Inspired by interestin QA for the general domain, there has beensome effort to create question-answering datasetsin biomedicine. Annotation requires domain ex-pertise, so it is significantly harder than in generaldomain, where it is to produce large-scale datasetsby crowdsourcing.The MT-DNN customization can work with stan-dard or biomedicine-specific pretraining modelssuch as BioBERT, and can be directly applied tobiomedical benchmarks (Lee et al., 2019).
Figure 4: The configuration of SNLI.
We will go though a typical Natural LanguageInference task, e.g. SNLI, which is one of themost popular benchmark, showing how to applyour toolkit to a new task. MT-DNN is driven byconfiguration and command line arguments. Firstly,the SNLI configuration is shown in Figure 4. Theconfiguration defines tasks, model architecture aswell as loss functions. We briefly introduce theseattributes as follows:1. data format is a required attribute and it de-notes that each sample includes two sentences(premise and hypothesis). Please refer thetutorial and API for supported formats. 2. task layer type specifies architecture of thetask specific layer. The default is a ”linearlayer”.3. labels
Users can list unique values of labels.The configuration helps to convert back andforth between text labels and numbers duringtraining and evaluation. Without it, MT-DNNassumes the label of prediction are numbers.4. metric meta is the evaluation metric used forvalidation.5. loss is the loss function for SNLI. It also sup-ports other functions, e.g. MSE for regression.6. kd loss is the loss function in the knowledgedistillation setting.7. adv loss is the loss function in the adversarialsetting.8. n class denotes the number of categories forSNLI.9. task type specifies whether it is a classificationtask or a regression task.Once the configuration is provided, one can trainthe customized model for the task, using any sup-ported pre-trained models as starting point.MT-DNN is also highly extensible, as shown inFigure 4, loss and task layer type point to existingclasses in code. Users can write customized classesand plug into MT-DNN. The customized classescould then be used via configuration.
Microsoft MT-DNN is an open-source natural lan-guage understanding toolkit which facilitates re-searchers and developers to build customized deeplearning models. Its key features are: 1) support forrobust and transferable learning using adversarialmulti-task learning paradigm; 2) enable knowledgedistillation under the multi-task learning settingwhich can be leveraged to derive lighter modelsfor efficient online deployment. We will extendMT-DNN to support Natural Language Generationtasks, e.g. Question Generation, and incorporatemore pre-trained encoders, e.g. T5 (Raffel et al.,2019) in future.
Acknowledgments
We thank Liyuan Liu, Sha Li, Mehrad Morad-shahi and other contributors to the package, andthe anonymous reviewers for valuable discussionsand comments. eferences
Jimmy Ba and Rich Caruana. 2014. Do deep nets reallyneed to be deep? In
Advances in neural informationprocessing systems , pages 2654–2662.Anoop Korattikara Balan, Vivek Rathod, Kevin P Mur-phy, and Max Welling. 2015. Bayesian dark knowl-edge. In
Advances in Neural Information Process-ing Systems , pages 3438–3446.Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, andDanilo Giampiccolo. 2006. The second PASCALrecognising textual entailment challenge. In
Pro-ceedings of the Second PASCAL Challenges Work-shop on Recognising Textual Entailment .Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, DaniloGiampiccolo, and Bernardo Magnini. 2009. Thefifth pascal recognizing textual entailment challenge.In
In Proc Text Analysis Conference (TAC09 .Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In
Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) .Association for Computational Linguistics.Rich Caruana. 1997. Multitask learning.
Machinelearning , 28(1):41–75.Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017task 1: Semantic textual similarity-multilingual andcross-lingual focused evaluation. arXiv preprintarXiv:1708.00055 .Kevin Clark, Minh-Thang Luong, Urvashi Khandel-wal, Christopher D Manning, and Quoc V Le.2019. Bam! born-again multi-task networks fornatural language understanding. arXiv preprintarXiv:1907.04829 .Ronan Collobert, Jason Weston, L´eon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch.
Journal of machine learning research ,12(Aug):2493–2537.Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The pascal recognising textual entailmentchallenge. In
Proceedings of the First Inter-national Conference on Machine Learning Chal-lenges: Evaluating Predictive Uncertainty VisualObject Classification, and Recognizing Textual En-tailment , MLCW’05, pages 177–190, Berlin, Hei-delberg. Springer-Verlag.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186. William B Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In
Proceedings of the Third International Workshopon Paraphrasing (IWP2005) .Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019. Unified languagemodel pre-training for natural language understand-ing and generation. In
Advances in Neural Informa-tion Processing Systems , pages 13042–13054.Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.Allennlp: A deep semantic natural language process-ing platform. arXiv preprint arXiv:1803.07640 .Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,and Bill Dolan. 2007. The third PASCAL recogniz-ing textual entailment challenge. In
Proceedings ofthe ACL-PASCAL Workshop on Textual Entailmentand Paraphrasing , pages 1–9, Prague. Associationfor Computational Linguistics.Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 .Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation ,9(8):1735–1780.Haoming Jiang, Pengcheng He, Weizhu Chen, Xi-aodong Liu, Jianfeng Gao, and Tuo Zhao. 2019.Smart: Robust and efficient fine-tuning for pre-trained natural language models through princi-pled regularized optimization. arXiv preprintarXiv:1911.03437 .Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.SciTail: A textual entailment dataset from sciencequestion answering. In
AAAI .Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So, andJaewoo Kang. 2019. Biobert: pre-trained biomed-ical language representation model for biomedicaltext mining. arXiv preprint arXiv:1901.08746 .Hector Levesque, Ernest Davis, and Leora Morgen-stern. 2012. The winograd schema challenge. In
Thirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning .Xiaodong Liu, Hao Cheng, Pengcheng He, WeizhuChen, Yu Wang, Hoifung Poon, and Jianfeng Gao.2020. Adversarial training for large neural languagemodels. arXiv preprint arXiv:2004.08994 .Xiaodong Liu, Kevin Duh, and Jianfeng Gao. 2018a.Stochastic answer networks for natural language in-ference. arXiv preprint arXiv:1804.07888 .iaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng,Kevin Duh, and Ye-Yi Wang. 2015. Representationlearning using multi-task deep neural networks forsemantic classification and information retrieval. In
Proceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 912–921.Xiaodong Liu, Pengcheng He, Weizhu Chen, andJianfeng Gao. 2019a. Improving multi-task deepneural networks via knowledge distillation fornatural language understanding. arXiv preprintarXiv:1904.09482 .Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019b. Multi-task deep neural networksfor natural language understanding. In
Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics , pages 4487–4496, Flo-rence, Italy. Association for Computational Linguis-tics.Xiaodong Liu, Yelong Shen, Kevin Duh, and JianfengGao. 2018b. Stochastic answer networks for ma-chine reading comprehension. In
Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) . Asso-ciation for Computational Linguistics.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019c.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Minh-Thang Luong, Quoc V Le, Ilya Sutskever, OriolVinyals, and Lukasz Kaiser. 2015. Multi-tasksequence to sequence learning. arXiv preprintarXiv:1511.06114 .Aleksander Madry, Aleksandar Makelov, LudwigSchmidt, Dimitris Tsipras, and Adrian Vladu. 2017.Towards deep learning models resistant to adversar-ial attacks. arXiv preprint arXiv:1706.06083 .Christopher D Manning, Mihai Surdeanu, John Bauer,Jenny Rose Finkel, Steven Bethard, and David Mc-Closky. 2014. The stanford corenlp natural languageprocessing toolkit. In
Proceedings of 52nd annualmeeting of the association for computational linguis-tics: system demonstrations , pages 55–60.Takeru Miyato, Shin-ichi Maeda, Masanori Koyama,and Shin Ishii. 2018. Virtual adversarial training:a regularization method for supervised and semi-supervised learning.
IEEE transactions on pat-tern analysis and machine intelligence , 41(8):1979–1993.Yixin Nie, Adina Williams, Emily Dinan, MohitBansal, Jason Weston, and Douwe Kiela. 2019. Ad-versarial nli: A new benchmark for natural languageunderstanding. arXiv preprint arXiv:1910.14599 . Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensi-ble toolkit for sequence modeling. arXiv preprintarXiv:1904.01038 .Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. Pytorch: An imperative style,high-performance deep learning library. In
Ad-vances in Neural Information Processing Systems ,pages 8024–8035.Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. arXiv preprint arXiv:1802.05365 .Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2018. Languagemodels are unsupervised multitask learners.Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2019. Exploring the limitsof transfer learning with a unified text-to-text trans-former. arXiv preprint arXiv:1910.10683 .Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In
Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 2383–2392, Austin,Texas. Association for Computational Linguistics.Sebastian Ruder. 2017. An overview of multi-tasklearning in deep neural networks. arXiv preprintarXiv:1706.05098 .Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108 .Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In
Proceedings of the 2013 conference onempirical methods in natural language processing ,pages 1631–1642.Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, HaoTian, Hua Wu, and Haifeng Wang. 2019. Ernie 2.0:A continual pre-training framework for language un-derstanding. arXiv preprint arXiv:1907.12412 .Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, OlgaVechtomova, and Jimmy Lin. 2019. Distilling task-specific knowledge from bert into simple neural net-works. arXiv preprint arXiv:1903.12136 .James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.Fever: a large-scale dataset for fact extraction andverification. arXiv preprint arXiv:1803.05355 .shish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in neural information pro-cessing systems , pages 5998–6008.Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R Bowman. 2018.Glue: A multi-task benchmark and analysis platformfor natural language understanding. arXiv preprintarXiv:1804.07461 .Alex Warstadt, Amanpreet Singh, and Samuel R Bow-man. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471 .Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In
Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers) , pages 1112–1122. Association forComputational Linguistics.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.
ArXiv , abs/1910.03771.Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, ThomasGoldstein, and Jingjing Liu. 2019. Freelb: En-hanced adversarial training for language understand-ing. arXiv preprint arXiv:1909.11764arXiv preprint arXiv:1909.11764