Hierarchical Multitask Learning Approach for BERT
HHierarchical Multitask Learning Approach for BERT
C¸ a˘gla Aksoy Alper Ahmeto˘glu Tunga G ¨ung¨or
Department of Computer Engineering, Bo˘gazic¸i UniversityBebek Istanbul 34342 Turkey { cagla.aksoy,alper.ahmetoglu,gungort } @boun.edu.tr Abstract
Recent works show that learning contextualized embeddings for words is beneficial for down-stream tasks. BERT is one successful example of this approach. It learns embeddings by solvingtwo tasks, which are masked language model (masked LM) and the next sentence prediction(NSP). The pre-training of BERT can also be framed as a multitask learning problem. In thiswork, we adopt hierarchical multitask learning approaches for BERT pre-training. Pre-trainingtasks are solved at different layers instead of the last layer, and information from the NSP taskis transferred to the masked LM task. Also, we propose a new pre-training task bigram shiftto encode word order information. We choose two downstream tasks, one of which requiressentence-level embeddings (textual entailment), and the other requires contextualized embed-dings of words (question answering). Due to computational restrictions, we use the downstreamtask data instead of a large dataset for the pre-training to see the performance of proposed modelswhen given a restricted dataset. We test their performance on several probing tasks to analyzelearned embeddings. Our results show that imposing a task hierarchy in pre-training improvesthe performance of embeddings.
There have been many studies in natural language processing (NLP) to find suitable word representations(embeddings) that carry information of a language. Even if finding these word representations can becomputationally demanding, this can be advantageous since it is computed only once. These learnedrepresentations can be used for various downstream tasks.Word2Vec (Mikolov et al., 2013) finds word embeddings by predicting a word given its neighbor-hood (CBOW) or predicting its neighborhood given the word (Skip-gram). Words that are used togetherhave similar word embeddings due to the training strategy. However, these embeddings do not con-tain word order information and contextual information. ELMo (Peters et al., 2018) uses bidirectionalLSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997) to predict a word given its context. Since BiL-STM is used for creating embeddings, both left-to-right and right-to-left contexts are implicitly encoded.Transformer (Vaswani et al., 2017) is shown to be more appropriate for training in large datasets dueto its self-attention mechanism. OpenAI GPT (Radford et al., 2018) has the same objective as ELMoin the forward direction, except it uses transformer architecture. BERT (Devlin et al., 2018) also usestransformer architecture with bidirectional pre-training tasks. Training objectives affect the informationencoded in embeddings. Each objective and architecture presumes a different inductive bias.In this work, we focused on BERT as it uses multiple training objectives. These objectives can createan inhibitory effect or a regulatory effect on each other. For this reason, we applied a hierarchicalmultitask learning approach to BERT by modifying its original structure. Our motivation is to createembeddings that encode the information from each task in a balanced way. Our contributions are asfollows:• Instead of training masked language model (masked LM) and next sentence prediction (NSP) clas-sifiers with the last layer embeddings, we trained masked LM classifiers with embeddings from a r X i v : . [ c s . C L ] O c t ower layers of the transformer (Lower Mask). We do the same experiment for the NSP classifieras well (Lower NSP). By evaluating the performance of the embeddings on downstream tasks, weprovide insights about the hierarchy between pre-training tasks.• We incorporate the input or the output of the NSP classifier to the input of the masked LM classifierin order to enrich the sentence-level embedding.• We propose a new pre-training objective, bigram shift, in addition to masked LM and NSP tasks toenforce embeddings also to learn word order information.Our experimental results show that Lower NSP has a competitive performance when compared withthe original BERT structure. We also evaluate the learned embeddings on probing tasks to provide usefulinsights into training strategies. Results on probing task experiments show that using bigram shift taskfor pre-training is useful for specific tasks. The remaining part of this paper is organized as follows. InSection 2, we mention related works. In Section 3, we explain our methods in detail. In Section 4, wereport our experiment results. Lastly, we give a conclusion in Section 5. Multitask learning (MTL) (Caruana, 1997) is an approach to construct one model for related tasks toachieve a better generalization performance. BERT pre-training can be thought of as a multitask modelby this definition. Besides this parallel task structure as in BERT, hierarchical MTL approach modelstasks in a successive fashion. With this approach, complex tasks are predicted at deeper layers so thatthey can take advantage of representations of lower-level tasks (Søgaard and Goldberg, 2016; Hashimotoet al., 2016; Sanh et al., 2019). Examples of this approach can also be seen in computer vision (Szegedyet al., 2016; Sabour et al., 2017). The other important issue is choosing related tasks. In case of traininga model for the tasks which do not support each other, the accuracy for each task can be lower than theirsingle task structures (Bingel and Søgaard, 2017).BERT pre-training is done by predicting randomly masked words and predicting whether two sen-tences are consecutive or not. These two objectives are optimized simultaneously. This model is trainedon a large unlabeled corpus; it can be easily fine-tuned on various downstream tasks with high perfor-mance. There are different studies to replicate and improve BERT model architecture. RoBERTa (Liu etal., 2019) makes slight adjustments for pre-training tasks. One relevant change to our work is removingthe NSP task. This approach also hints to the issue of choosing related tasks (Bingel and Søgaard, 2017).ALBERT (Lan et al., 2019) proposes a way to reduce the parameter size of the BERT structure. Theyalso proposed a new pre-training task, sentence order prediction, instead of NSP. StructBERT (Wang etal., 2019) also proposes two auxiliary tasks that are word structure and sentence structure instead of NSPto create embeddings that are faithful to both word and sentence orders. Word structural task is similarto our proposed bigram shift task. However, we predict whether words are in the correct position or notinstead of predicting the correct word in a position.In general, BERT is fine-tuned on downstream tasks by using its last layer activations. There areseveral works that use lower (intermediate) layers of BERT as feature extractors and aggregate themwith additional layers (Yang and Zhao, 2019; Zhu et al., 2018). These works only focus on using lowerlayer activations in the fine-tuning step without changing the pre-training procedure. Our work differs inthe sense that we modify pre-training of BERT with hierarchical multitask learning to get lower layersthat are specialized on one task.
In the original BERT pre-training, there are two objectives: masked LM and NSP. These are optimizedsimultaneously to learn a language model. From a multitask learning point of view, minimizing the lossof two objectives at the same time should enhance the performance if these two tasks are related to eachother (Bingel and Søgaard, 2017). To our knowledge, there is no prior work on the relation between thesetwo tasks. As the relation of tasks is an essential factor, another important factor is the task complexity.or example, if one of these tasks is more straightforward than the other, the model may overfit to easiertask while the training for the other task is not complete. This is an unwanted consequence since a partof the model memorizes the task instead of generalizing it. This could have been prevented by earlystopping; however, it is hard to decide when to stop even for one task (Prechelt, 1998). . . . [CLS]TOK 1..[SEP]TOK 1..[SEP] ...
TRANSFORMER M A S K E D L M N S P (a) Lower NSP . . . [CLS]TOK 1..[SEP]TOK 1..[SEP] ... TRANSFORMER ... M A S K E D L M N S P (b) Lower Mask Figure 1: Different hierarchical BERT architectures. (a) using lower level embedding of [CLS] tokenfor NSP classifier; (b) using lower level embeddings of masked tokens for masked LM classifier.We experimented a multitask learning approach to BERT pre-training in several ways:1. NSP classifier is trained with [CLS] embedding from lower layers instead of the last layer. Werefer to this model as Lower NSP (Figure 1a). Here [CLS] is the start token of the input. Thisembedding is considered to be a sentence-level embedding (Devlin et al., 2018). Since we do notexactly know the task hierarchy, we also do the opposite where the masked LM classifier is trainedwith the embeddings from lower layers, which correspond to masked tokens in the input. We referto this model as Lower Mask (Figure 1b).2. We concatenate [CLS] embedding, (Figure 2a), or the output of the NSP classifier to the input ofmasked LM classifier (Figure 2b). This will (1) indirectly regularize the sentence-level embeddingsince it will be used for both tasks (2) explicitly provide sentence-level information to masked LMclassifier.3. Motivated from probing tasks (Conneau et al., 2018), we randomly swap the order of bigrams 15%of the time for each input and use an additional classifier, which predicts whether a token is in theright position in the sentence or not. This will force embeddings to encapsulate the word orderinformation.4. In our experiments for Lower NSP models, we see that the NSP classifier fits faster than the maskedLM classifier. Further training will cause overfitting for the NSP classifier. Therefore we do ex- . . . [CLS]TOK 1..[SEP]TOK 1..[SEP] ...
TRANSFORMER M A S K E D L M N S P (a) CLS concatenation . . . [CLS]TOK 1..[SEP]TOK 1..[SEP] ... TRANSFORMER N S P M A S K E D L M (b) NSP concatenation Figure 2: Different concatenation techniques. (a) concatenation of [CLS] embedding to each input ofmasked LM classifier; (b) concatenation of the output of NSP classifier to each input of masked LMclassifier.eriments on freezing all parts that affect the NSP classifier. We refer to this model as LowerNSP-freeze. C C L S e m b e dd i n g (a) Sentence level tasks C T o k e n e m b e dd i n g s (b) Token level tasks C C L S + T o k e n e m b e dd i n g s (c) CLS concatenation C N S P + T o k e n e m b e dd i n g s (d) NSP concatenation Figure 3: Different downstream architectures. In all four figures C is a classifier. (a) [CLS] embeddingfrom NSP level is used as input for sentence-level tasks. (b) Embeddings from masked LM level are usedas inputs for token-level tasks. In (c), [CLS] embedding from NSP level is concatenated to each input.In (d), NSP classifier output is concatenated to each input.There are several changes in the fine-tuning to accommodate the changes in the pre-training. In theoriginal BERT structure, last layer embeddings are used for both sentence-level tasks (Figure 3a) andtoken-level tasks (Figure 3b). With our hierarchical approach, [CLS] embedding is selected from thelayer which we train NSP classifier (Figure 1a); token embeddings are selected from the layer whichwe train masked LM classifier (Figure 1b). If [CLS] embedding or NSP output is included in thepre-training, we also include these extra information in the fine-tuning for token-level tasks (Figure 3c,3d). We tested different proposed architectures on several datasets. Due to computational issues, we makepre-training on small datasets as opposed to BERT. For this, we have two different approaches. First,we used the raw text data of the downstream task for pre-training. In the second approach, we use adifferent raw text with similar size of the first data. We experimented on two downstream tasks: questionanswering (token-level task) and textual entailment (sentence-level task).For question answering, we used SQuAD1.1 (Rajpurkar et al., 2016) and SQuAD2.0 (Rajpurkar et al.,2018) datasets. SQuAD2.0 is an extended version of SQuAD1.1. It contains questions that do not havea possible answer in the given context. These two datasets are used for both pre-training and fine-tuningon downstream tasks. Additionally, we used WikiText-2 for pre-training. We followed the same strategyas in BERT to create pre-training data from paragraphs of these three datasets separately. We pre-trainthe model with inputs that contain less than 128 tokens for 90% of steps, as in BERT. However, for therest of 10% of steps, we used inputs containing less than 384 tokens instead of 512.For textual entailment, we used Multi-Genre Natural Language Inference (MultiNLI) corpus (Williamset al., 2017). This dataset does not contain paragraphs but independent sentences. Therefore we did notuse this dataset for pre-training. Instead, we used the model pre-trained on the WikiText-2 dataset.To evaluate quality of learned embeddings, we used probing tasks in (Conneau et al., 2018). We onlyevaluate embeddings that are pre-trained on WikiText-2 dataset since these are used for both downstreamtasks. There are 10 different probing tasks which are categorized into three groups: surface information,syntactic information, and semantic information. Surface information tasks are sentence length ( SL )and word content ( WC ). Syntactic information tasks are tree depth ( TD ), bigram shift ( BS ), and topconstituents ( TC ). Semantic information tasks are tense ( T ), subject number ( SN ), object number ( ON ),semantic odd man out ( OM ) and coordination inversion ( CI ). Details of these tasks can be found in(Conneau et al., 2018). .2 Pre-training and Fine-tuning Results We made modifications that are explained in Section 3 to the BERT base architecture. We pre-trainedall variants of architectures in our pre-training datasets and fine-tuned them on downstream tasks. Weselect 1e-4 as the learning rate with 1e-4 weight decay for pre-training, and 1e-5 as the learning rate withno weight decay for fine-tuning. These hyperparameters are found by grid search done on BERT basearchitecture. We set the batch size to be 32 for inputs with length up to 128 tokens and one for longerinputs in the pre-training. Batch size for fine-tuning is set to be one for all downstream tasks. We setdropout rate to 0.1. For all experiments, Adam optimizer (Kingma and Ba, 2014) is used with defaultbeta parameters and AMSGrad (Reddi et al., 2019) option.For Lower NSP and Lower Mask architectures, we experimented with all intermediate layers to choosethe best level of selected embeddings for the lower classifier. As we used BERT base architecture, whichhas 12 encoder layers, there are 11 different intermediate layers. As an additional experiment, we use [CLS] concatenation or NSP concatenation for BERT, Lower NSP, and Lower Mask architectures.Apart from these models, we freeze NSP related parts of the model for cases in which Lower NSPclassifier overfitted to data before masked LM training. To see the importance of the NSP task, weremoved the NSP classifier as in RoBERTa. The best models for each architecture are fine-tuned ondownstream tasks. We add three baselines:1. Pre-trained BERT base model is fine-tuned.2. BERT base architecture without any pre-training is directly fine-tuned to understand whether theperformance is due to architecture or large-scale training.3. Another modern architecture due to its success, BiLSTM, is fine-tuned with fastText (Bojanowskiet al., 2017) word embeddings.
Models Conc. SQuAD WikiText-2
CLS NSP SQuAD1.1 SQuAD2.0 SQuAD1.1 SQuAD2.0EM / F1 EM / F1 EM / F1 EM / F1Pre-trained BERT - - 75.9 / 85.4 71.1 / 73.8 - -BERT - - 49.5 / 60.6 54.8 / 57.4 48.1 / 60.5 54.7 / 57.9+ - 48.1 / 59.1 / / - + 48.7 / 60.2 55.5 / 58.5 48.2 / 59.7 51.2 / 53.2Lower NSP - - 47.2 / 58.3 56.3 / 58.8 / / SQuAD and
WikiText-2 columns represent pre-training sets.
Conc. (concatenation) column represents whether [CLS] embedding or NSP output is used in both pre-training and fine-tuning.Fine-tuning results of QA classifiers on SQuAD1.1 and SQuAD2.0 validation sets are shown in Table1. Here, exact match (EM) measures whether the prediction overlaps with the truth exactly. F-measureF1) evaluates partial overlaps (Rajpurkar et al., 2016). We see that pre-trained BERT embeddings out-perform all approaches. However, we note that it is not a fair comparison since we only pre-train ourmodels with small datasets. In the figure, BERT is the replica, except it is pre-trained on the small data.Therefore, we compare our approaches with this model.One obvious result is the underperformance of Lower Mask models. However, Lower NSP modelsshow competitive performance with BERT, even better in some datasets. This might give a clue aboutthe task hierarchy between masked LM and NSP tasks. One might expect that the masked LM task ismore straightforward than the NSP task because the former makes predictions in the word-level while thelatter makes predictions in the sentence-level. However, the results suggest that the masked LM task iscomplicated as it requires deeper layers to perform successfully. NSP task can be done without encodingthe full knowledge of each word. Unlike RoBERTa, the performance drops when we remove NSP loss.The reason for this might be pre-training with small data; the model may not need sentence-level taskwhen it has access to many word combinations. Instead of removing NSP loss, one promising directioncan be training NSP task at lower layers for the large-scale training.Bigram Shift model performs slightly worse than other models when the pre-training is done onSQuAD datasets. One possible explanation is that the question answering problem does not requirethe word order information. On WikiText-2, the results are dramatically worse. This might be due tooptimization; a better hyperparameter search can increment the performance.Self-supervised pre-training shows its contribution once more. Even when pre-training is done onthe downstream data with a poorly constructed architecture (Lower Mask), it performs better than theoriginal transformer without any pre-training. In Table 1, without pre-training model has 51.0 F1 scoreon SQuAD2.0. However, this is not a good model as it predicts that all question-paragraph pairs areimpossible to answer (51% of examples in the validation set are impossible to answer). Furthermore, allmodels outperform BiLSTM with fastText. Notice that, fastText is also a model pre-trained on large-scaledata. A transformer pre-trained directly on downstream data has better representations for the specificdownstream tasks than BiLSTM with fastText.
Models Pre-training Accuracy
CLS NSP Matched MismatchedPre-trained BERT - - 84.4 84.9BERT - - + - 70.2 70.8- + 71.1 71.3Lower Mask - - 69.9 70.7+ - 68.5 69.2- + 67.1 67.9Lower NSP - - 69.1 70.2+ - 70.3 71.9- + 70.5 71.3Lower NSP-Freeze - - 45.2 45.3Bigram Shift - - 69.8 70.7BiLSTM - - 66.8 67.3Without Pre-training - - 61.2 61.4Table 2: Accuracy results on MultiNLI validation sets.
Pre-training column represents whether [CLS] embedding or NSP output is used in pre-training.Results on the MultiNLI corpus for both matched and mismatched validation sets are shown in Table2. Matched set contains examples from the same source as the training set; mismatched set containsexamples from different sources. We cannot use concatenated parts for fine-tuning, unlike questionanswering. Only [CLS] embedding from NSP level is used for fine-tuning.s in QA results, pre-training on small data has a bootstrapping effect compared with the modelwithout pre-training. All models except Lower NSP-Freeze outperforms BiLSTM. This shows that thetransformer aggregates sentence-level information better than BiLSTM. This is an interesting result sinceBiLSTM sees each word explicitly to encode sentence-level embedding while the transformer encodesthis implicitly via the NSP task. While BERT has the best performance, Lower NSP is also competitive.Unlike question answering, Lower Mask and Bigram Shift models have better performance.
CLS NSP CLS NSP SQuAD1.1 SQuAD2.0 SQuAD1.1 SQuAD2.0EM / F1 EM / F1 EM / F1 EM / F1BERT - - - - 49.5 / 60.6 54.8 / 57.4 48.1 / 60.5 54.7 / 57.9- - + - 50.0 / 61.6 55.1 / 58.4 48.0 / 60.7 54.9 / 58.2- - - + 50.4 / 61.7 56.5 / 59.5 47.4 / 60.1 / + - + - 48.1 / 59.1 / / / PT ) and/or fine-tuning ( FT ).To evaluate the effect of concatenated parts ( [CLS] embedding or NSP output), we also test thefollowing architectures: (1) use in pre-training (2) use in fine-tuning (3) use in both pre-training andfine-tuning (4) not using. QA classifier results for these experiments are shown in Table 3. Even if themodel is pre-trained without using concatenated parts, using these extra inputs in fine-tuning slightlyincreases the performance. In some cases, the peak performance is achieved by using the concatenatedparts in both pre-training and fine-tuning. For probing tasks, a multi-layer perceptron (MLP) with two hidden layers with 128 units is used. Here,the transformer part is frozen, and we only train the probing task classifier. The learning rate is set to be1e-3 with no weight decay. The batch size is 32. The classifier uses [CLS] embedding from the NSPlevel as input.The results of probing tasks are shown in Table 4. These results suggest the followings:• In sentence length and tree depth tasks, pre-trained BERT performs worse than some models. Thisis quite surprising considering its large-scale training.• Lower NSP models perform better than Lower Mask models and better than BERT for some tasks.We can say that Lower NSP constructs more informative sentence-level representation than theothers.• Bigram shift model achieves good results for tasks that require order information (tree depth, bigramshift, coordination inversion). While it is not as good as other models in downstream tasks, includingorder information might be beneficial for other tasks. odels PT SL WC TD BS TC T SN ON OM CI
CLS NSPPT. BERT - - 68.3 32.4 34.3 86.5 75.2 88.7 83.0 77.8 64.3 74.4BERT - - 83.8 9.6 36.0 62.5 70.1 + - 75.7 10.1 34.6 57.9 63.6 70.2 73.8 68.9 50.5 55.6- + 84.9
Table 4: Accuracy results of probing tasks. PT (pre-training) column represents whether [CLS] embed-ding or NSP output is used in pre-training. We proposed to pre-train BERT with a hierarchical multitask learning approach. Our results on restricteddata (due to computational resources) show that this approach achieves better or equal performance. Weincorporate sentence-level information to solve word-level tasks. This also shows a slight incrementin performance. We propose an additional pre-training task, bigram shift, which causes embeddings tocontain word order information.We believe that implementing these techniques to large-scale training will further advance the state-of-the-art. Probing tasks show that different training techniques lead embeddings to contain differentlinguistic properties. This is an essential point since there are various problems in the NLP domain thatrequire different needs. Therefore selecting an appropriate pre-training strategy is an important factor.
Acknowledgments
The numerical calculations reported in this work were partially performed at TUBITAK ULAKBIM,High Performance and Grid Computing Center (TRUBA resources) and TETAM servers.
References
Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deepneural networks. arXiv preprint arXiv:1702.08303 .Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors withsubword information.
Transactions of the Association for Computational Linguistics , 5:135–146.Rich Caruana. 1997. Multitask learning.
Machine learning , 28(1):41–75.Alexis Conneau, Germ´an Kruszewski, Guillaume Lample, Lo¨ıc Barrault, and Marco Baroni. 2018. Whatyou can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprintarXiv:1805.01070 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805 .Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2016. A joint many-task model:Growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587 .Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory.
Neural computation , 9(8):1735–1780.iederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019.Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, LukeZettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXivpreprint arXiv:1907.11692 .Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations invector space. arXiv preprint arXiv:1301.3781 .Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettle-moyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 .Lutz Prechelt. 1998. Early stopping-but when? In
Neural Networks: Tricks of the trade , pages 55–69. Springer.Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understandingby generative pre-training.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. arXiv preprint arXiv:1606.05250 .Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions forsquad. arXiv preprint arXiv:1806.03822 .Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2019. On the convergence of adam and beyond. arXiv preprintarXiv:1904.09237 .Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. In
Advances inneural information processing systems , pages 3856–3866.Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019. A hierarchical multi-task approach for learning embed-dings from semantic tasks. In
Proceedings of the AAAI Conference on Artificial Intelligence , volume 33, pages6949–6956.Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lowerlayers. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers) , pages 231–235.Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking theinception architecture for computer vision. In
Proceedings of the IEEE conference on computer vision andpattern recognition , pages 2818–2826.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, andIllia Polosukhin. 2017. Attention is all you need. In
Advances in neural information processing systems , pages5998–6008.Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, and Luo Si. 2019. Struct-bert: Incorporating language structures into pre-training for deep language understanding. arXiv preprintarXiv:1908.04577 .Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentenceunderstanding through inference. arXiv preprint arXiv:1704.05426 .Junjie Yang and Hai Zhao. 2019. Deepening hidden representations from pre-trained language models for naturallanguage understanding. arXiv preprint arXiv:1911.01940 .Chenguang Zhu, Michael Zeng, and Xuedong Huang. 2018. Sdnet: Contextualized attention-based deep networkfor conversational question answering. arXiv preprint arXiv:1812.03593arXiv preprint arXiv:1812.03593