Accelerating Pre-trained Language Models via Calibrated Cascade
Lei Li, Yankai Lin, Shuhuai Ren, Deli Chen, Xuancheng Ren, Peng Li, Jie Zhou, Xu Sun
AAccelerating Pre-trained Language Models via Calibrated Cascade
Lei Li † , Yankai Lin ‡ , Shuhuai Ren † , Deli Chen † , ‡ ,Xuancheng Ren † , Peng Li ‡ , Jie Zhou ‡ , Xu Sun †† MOE Key Lab of Computational Linguistics, School of EECS, Peking University ‡ Pattern Recognition Center, WeChat AI, Tencent Inc., China { lilei, shuhuai ren } @stu.pku.edu.cn { chendeli, renxc, xusun } @pku.edu.cn, { yankailin, patrickpli, withtomzhou } @tencent.com Abstract
Dynamic early exiting aims to accelerate pre-trained language models’ (PLMs) inferenceby exiting in shallow layer without passingthrough the entire model. In this paper, weanalyze the working mechanism of dynamicearly exiting and find it cannot achieve a sat-isfying trade-off between inference speed andperformance. On one hand, the PLMs’ rep-resentations in shallow layers are not suffi-cient for accurate prediction. One the otherhand, the internal off-ramps cannot provide re-liable exiting decisions. To remedy this, weinstead propose CascadeBERT, which dynam-ically selects a proper-sized, complete modelin a cascading manner. To obtain more reliablemodel selection, we further devise a difficulty-aware objective, encouraging the model outputclass probability to reflect the real difficultyof each instance. Extensive experimental re-sults demonstrate the superiority of our pro-posal over strong baseline models of PLMs’acceleration including both dynamic early ex-iting and knowledge distillation methods.
Large-scale pre-trained language models (PLMs)including BERT (Devlin et al., 2019) andRoBERTa (Liu et al., 2019) have demonstratedsuperior performance on various natural languageunderstanding tasks. While the increasing modelsize brings more power, the huge training costsand especially the long inference time hinder thedeployment of PLMs in real-time applications. Re-searchers have recently exploited various kindsof approaches for accelerating the PLMs’ infer-ence speed, including model-level compression andinstance-level speed-up. The former aims at ob-taining a compact model via quantization (Shenet al., 2020), pruning (Voita et al., 2019; Michelet al., 2019) and knowledge distillation (Sanh et al.,2019; Sun et al., 2019; Jiao et al., 2020), while thelatter treats each instance differently and designs exiting metrics for emitting predictions based onintermediate off-ramps at early layers (Xin et al.,2020; Liu et al., 2020; Schwartz et al., 2020).The idea of dynamic early exiting is intuitiveand simple, and can be utilized to accelerate infer-ence and reduce the potential risk of the overthink-ing problem (Kaya et al., 2019). However, such astraightforward paradigm can be sub-optimal, es-pecially when the speed-up ratio is high, i.e., mostexamples are predicted only based on the shallowrepresentations. It is due to the following two rea-sons: First, as revealed by previous studies (Tenneyet al., 2019), PLMs like BERT exhibit a hierar-chy of representations and rediscover traditionaltext processing pipeline, e.g., shallow layers extractlow-level features like lexical/syntactic informationwhile deep layers capture semantic-level relations.The high-level semantic inference ability is usuallyrequired even for those easy instances, and there-fore we cannot conduct inference solely based onlow-level features. It is verified by the analysisexperiments in Section 2. Second, to measure thequality of exiting decisions, we design a metric forexamining model’s ability to distinguish difficult in-stances from easy ones. Our experiments show thatintermediate classifiers in the early exiting modelscannot provide reliable exiting decisions, whichhinders a better trade-off between speed-up andperformance in the idea of early exiting.In this paper, we propose CascadeBERT, toaccelerate pre-trained language model inferencebased on a series of complete models with differentlayer numbers in a cascading manner. Specifically,when inferring a given instance, rather than directlyexiting in the middle layers of PLMs, we progres-sively check if it can be solved by the current PLMfrom the smallest to the largest one. Furthermore,we propose to calibrate the PLMs’ predictions ac-cording to the example difficulty, making it reflectthe real difficulty of each instance, and thereforecould be a good indicator for model selection. Ex- a r X i v : . [ c s . C L ] D ec erimental results on six natural language under-standing benchmarks demonstrate that our modelcan obtain a much better trade-off between the in-ference speed and task performance than early ex-iting methods, achieving a close and even supe-rior performance compared to the state-of-the-artknowledge distillation methods. Dynamic early exiting aims to speed-up the infer-ence of PLMs by adding internal off-ramps (clas-sifiers) after each layer in the original model. Foreach instance, if the internal off-ramps predictionbased on current layer representation of the in-stance is confident enough, e.g., the maximum classprobability exceeds a threshold, then the predictionis emitted without passing the entire model. How-ever, whether the internal representations can pro-vide a sufficient information for high-performanceprediction results and whether intermediate clas-sifiers can be utilized for making robust exitingdecisions still remain unclear. In this section, we in-vestigate the working mechanism of dynamic earlyexiting by exploring these two questions.
As discussed by Tenney et al. (2019), pre-trainedlanguage models like BERT learn a hierarchy ofrepresentations and rediscover the traditional textprocessing pipeline, e.g., basic syntactic informa-tion emerges in shallow layers, while deeper layersmainly capture high-level semantic structures. Ourmotivation is that the high-level semantic is usu-ally required even for those easy instances, andtherefore the predictions based on shallow repre-sentations are not accurate.To further examine this, we conduct a behavioranalysis to evaluate the model performance basedon shallow layer outputs. If the representationfrom a shallow layer contains sufficient featuresfor a task, it is reasonable to expect a classifiercan achieve decent performance after fine-tuningbased on it. Specifically, we compare the followingmodels:
DeeBERT (Xin et al., 2020), which is a represen-tative of early exiting methods. The internal classi-fiers in DeeBERT are used for emitting predictions.
BERT- n L , which only utilizes the first n layer inthe original BERT model for prediction. A clas-sifier is added directly after the first n layers andfine-tuned on the training dataset. Layer A cc u r a c y BERT-CompleteBERT-nLDeeBERT
Figure 1: Model performance comparison utilizing thesame number of layers on MNLI(m) datasets. Com-plete models with a comprehensive pipeline clearly out-perform models like DeeBERT without high-level se-mantic features.
BERT-Complete (Turc et al., 2019), which is alight version of the original BERT model pre-trained from scratch. We assume this model hasthe complete text processing pipeline ability.We conduct experiments on the MNLI (Williamset al., 2018) dataset, a natural language infer-ence task requiring the model to predict the re-lation between a premise sentence and a hypothesissentence. Figure 1 illustrates the results on thematched version (MNLI-m) development set underdifferent layers. From the figure, we can see that:(1) There is a clear performance gap between themodels with and without a full pipeline, especiallywhen the layer number is small. It indicates thatthe pipeline ability is vital for models to handlecomplicated natural language tasks.(2) The BERT- n L also outperforms DeeBERT.We attribute the reason to that the final layer afterfine-tuning learns task-specific information to ob-tain a decent performance. A similar phenomenonis also observed in Merchant et al. (2020). How-ever, since the internal layer presentations in Dee-BERT are restricted by its relative position in thewhole model, thus the adaption impact cannot befully exploited.(3) The gap is narrowed as the number of layerincreases, which in turn validates our assumptionthat shallow representations are not sufficient foraccurate predictions.
We further probe the early exiting decisions madeby internal classifiers. In more detail, we denote dif-ficult instances as instances that the model cannotpredict correctly (refer to Section 3.2 for details), ϰ ϲ ϴ > Ă LJ Ğ ƌ Z d Ͳ Ž ŵ Ɖ ů Ğ ƚ Ğ Z d Ͳ Ŷ > Ğ Ğ Z d ϳ ϳ ͘ ϰ Ϯ ϳ ϴ ͘ ϴ ϱ ϳ ϵ ͘ ϳ ϲ ϴ ϭ ͘ Ϭ ϴ ϳ ϯ ͘ ϰ ϵ ϳ ϳ ͘ Ϯ ϴ ϴ Ϭ ͘ ϭ ϲ ϴ ϭ ͘ ϭ ϱ ϱ ϲ ͘ ϲ ϲ ϲ ϱ ͘ ϯ ϯ ϳ ϰ ͘ ϯ ϭ ϳ ϳ ͘ ϰ Ϭ ϱ ϰ ϲ Ϭ ϲ ϲ ϳ Ϯ ϳ ϴ ϴ ϰ Figure 2: DIS (%) heatmap of different models on theMNLI (m) development set. The DIS of internal off-ramps in the DeeBERT of shallow layers is relativelylow, leading to unreliable dynamic exiting decisions. and easy examples as those that can be handledwell. Intuitively, a difficult instance should be pre-dicted with a lower confidence score than that of aneasy one, thus the confidence score can be better uti-lized as an indicator for early exiting decisions. Fol-lowing most existing early exiting work, for eachinstance x , we utilize the maximum class probabil-ity of the output distribution c ( x ) as the confidencescore. To measure how well the model can tellthe difference between easy and difficult examples,we propose Difficulty Inversion Score (DIS). First,we sort the instances by their confidence scoresin an ascending order, i.e., c ( x i ) < c ( x j ) for any i < j . Then we compute the difficulty inversionpair number for each difficult instance as follows:DISum = N (cid:88) i =1 i − (cid:88) j =1 D ( x i , x j ) (1)where N is the instance number. D ( · , · ) is an indi-cator function, computed as: D ( x i , x j ) = (cid:26) , if d i > d j and c ( x i ) < c ( x j )0 , otherwise. (2) where d i and d j is the difficulty of x i and x j ,respectively, e.g., for difficult instances and for easy instances. The final DIS is a normalizedDISum: DIS = 1 − K DISum (3)where K is a normalizing factor, i.e., the productof the number of easy instances and the numberof difficult instances, to re-scale DIS to the rangefrom to . A higher DIS indicates that the modelperforms well at ranking among instances accord-ing to the confidence score to distinguish difficult instances from easy ones. The exiting decisionsbased on classifiers with lower DIS scores are thusunreliable, since it results in emitting more wronglypredicted results of difficult instances. To measurethe ability to rank instances difficulty of the internalclassifiers in the dynamic early exiting framework,we compute the DIS metric on MNLI (m) for differ-ent models discussed in Section 2.1, and the resultsare illustrated in Figure 2. We can observe that:(1) The off-ramps of internal classifiers withshallow layers in DeeBERT remains a clear gapto BERT- n L and BERT-Complete. This indicatesthe exiting decisions in the shallow layers can beunreliable, thus the task performance can be poorwhen instances are mostly emitted in shallow lay-ers.(2) The ability to distinguish difficult examplesfrom easy ones is enhanced as the layer numberincreases. As revealed in the previous section,since the deeper layer representations can boostthe task performance, it is reasonable to expectthe off-ramps in deeper layers can provide morecomprehensive early exiting decisions.In all, our analysis demonstrates that currentdynamic early exiting predictions made by internalclassifiers are not reliable.
To tackle the drawbacks of the dynamic early ex-iting we investigated above, we propose a novelframework, named CascadeBERT, that utilizes asuite of complete PLMs with different layer num-bers for acceleration in a cascading manner, andfurther devise a difficulty-awared calibration regu-larization to inform the model of instance difficulty.
Formally, given n complete PLM models ( M , M , · · · , M n ) trained on the downstream taskdataset with { l , . . . , l n } layer respectively, ourgoal is to select the model with minimal layer num-ber for each input instance x while maintaining themodel performance. We formulate it as a cascadeexiting problem, i.e., execute the model predictionsequentially for each input example from the small-est M to the largest M n , and examine whether theprediction of the input instance x can be emitted.Specifically, we use the confidence score c ( x ) ,i.e., the maximum class probability, as a metricto determine whether the predictions are confident lgorithm 1: Cascade Exiting
Input:
Models { M i } , threshold { τ i } Data:
Input x Result:
Class probability distribution
Pr( y | x ) for i ← to n do Pr( y | x ) = M i ( x ) c ( x ) = max y (Pr( y | x )) if c ( x ) > τ i then Early exit
Pr( y | x ) return Pr( y | x ) enough for outputting: c ( x ) = max y ∈ L (Pr( y | x )) (4)where L is the label set of the task.Given a confidence threshold τ , the predictionresult is emitted once the confidence score exceedsthe threshold. By varying the confidence threshold τ , we can obtain different speed-up ratio based onthe application requirements. A smaller τ denotesthat more examples are outputted using the currentmodel, making the inference faster, while a bigger τ will make more examples go through larger mod-els for a better performance. The whole frameworkis illustrated in Algorithm 1. Since every candidatemodel in our cascading framework is a completemodel with the pipeline processing ability, predic-tions are more robust even when only the smallestmodels is executed. To further make the cascade exiting based on confi-dence score more reliable, we design a difficulty-based margin loss to further improve the efficiencyof the cascade mechanism. In more detail, we adda regularization objective for each instance pair: L ( x i , x j ) = max { , − g ( x i , x j )( c ( x i ) − c ( x j )) + (cid:15) } (5) where (cid:15) is a confidence margin. We design g ( x i , x j ) as follows g ( x i , x j ) = , if d i > d j , if d i = d j − , otherwise (6) d i and d j is the difficulty of x i and x j , respectively.The objective is added to the original task-specific loss with a weight factor λ to adjust its impact. Byoptimizing the combined objective function, theconfidence scores of more difficult instances areadjusted to be lower than that of easier instances,making the confidence-based emitting decisionsmore accurate. Note that traditional post-hoc cali-brated methods like temperature scaling (Guo et al.,2017) are not applicable in this case, since the re-scaling technique will not change the rank for dif-ferent instances.To measure the instance difficulty, we first splitthe training dataset D into K folds { ˜ D i | i =1 , . . . , K } and train K small models using theleave-one-out method with multiple seeds, i.e.,model θ i is trained on the D − ˜ D i . We utilize θ i to evaluate the difficulty of the examples in ˜ D i .Specifically, the samples are marked as easy exam-ples if they can be correctly classified. Otherwise,they are labeled as difficult. To eliminate the impactof randomness, we group seeds predictions andstrictly label the examples which can be correctlypredicted in all seeds as easy examples for the mod-els, while others as difficult examples. A similarapproach is also adopted in Xu et al. (2020a) forapplying curriculum learning into natural languageunderstanding. We estimate the inference speed-up ratio accord-ing to the number of layers executed actually inforward propagation for each example (Xin et al.,2020; Zhou et al., 2020), since we do not intro-duce any external parameters into our framework.Specifically, the speed-up ratio over an originalmodel with n -layer is calculated as:speed-up ratio = (cid:80) Ni =1 n × m i (cid:80) Ni =1 C × m i (7)where m i is the instance number that actuallycosts C layers in total and N is the number oftest instances. Note that in our cascading exitingframework, the overhead brought by instances thatrun the forward propagation in multiple modelsis counted in C . For example, for an examplewhich is first fed into a -layer model and thengoes through a -layer model to obtain the finalprediction result, the number of layers actually ex-ecuted is therefore . ethod MNLI (m) MNLI (mm) MRPC QNLI QQP RTE SST-2 Average BERT-base † × ) 83.4 (1.00 × ) 88.9 (1.00 × ) 90.5 (1.00 × ) 71.2 (1.00 × ) 66.4 (1.00 × ) 93.5 (1.00 × ) 82.6 ∼ . × s t a ti c BERT-6L ‡ × ) 79.2 (2.00 × ) 85.1 (2.00 × ) 86.2 (2.00 × ) 68.9 (2.00 × ) 65.0 (2.00 × ) 90.9 (2.00 × ) 79.3BERT-small † (2.00 × ) (2.00 × ) 86.8 (2.00 × ) 88.9 (2.00 × ) 70.4 (2.00 × ) 65.3 (2.00 × ) 91.8 (2.00 × ) 81.2BERT-PKD † × ) 81.0 (2.00 × ) 85.0 (2.00 × ) 89.0 (2.00 × ) 70.7 (2.00 × ) 65.5 (2.00 × ) 92.0 (2.00 × ) 80.7BERT-of-Theseus † × ) 82.1 (2.00 × ) (2.00 × ) (2.00 × ) (2.00 × ) (2.00 × ) (2.00 × ) dyn a m i c DeeBERT † × ) 73.1 (1.88 × ) 84.4 (2.07 × ) 85.6 (2.09 × ) 70.4 (2.13 × ) 64.3 (1.95 × ) 90.2 (2.00 × ) 77.5PABEE † × ) 78.7 (2.08 × ) 84.4 (2.01 × ) 88.0 (1.87 × ) 70.4 (2.09 × ) 64.0 (1.81 × ) 89.3 (1.95 × ) 79.2 CascadeBERT 83.0 (2.01 × ) (2.01 × ) (2.01 × ) (2.01 × ) (2.01 × ) (2.03 × ) (2.08 × ) ∼ . × s t a ti c BERT-4L ‡ × ) 75.1 (3.00 × ) (3.00 × ) 84.7 (3.00 × ) 66.5 (3.00 × ) (3.00 × ) 87.5 (3.00 × ) 76.5BERT-small ‡ × ) 78.3 (3.00 × ) 82.3 (3.00 × ) (3.00 × ) 69.8 (3.00 × ) 59.2 (3.00 × ) (3.00 × ) 78.0BERT-PKD ‡ (3.00 × ) (3.00 × ) (3.00 × ) 84.9 (3.00 × ) (3.00 × ) 62.8 (3.00 × ) 89.2 (3.00 × ) BERT-of-Theseus ‡ × ) 77.4 (3.00 × ) 82.2 (3.00 × ) 85.5 (3.00 × ) 68.3 (3.00 × ) 59.5 (3.00 × ) 89.7 (3.00 × ) 77.3 dyn a m i c DeeBERT ‡ × ) 61.3 (3.03 × ) 83.5 (3.00 × ) 82.4 (2.99 × ) 67.0 (2.97 × ) 59.9 (3.00 × ) 88.8 (2.97 × ) 72.3PABEE ‡ × ) 75.3 (2.71 × ) 82.6 (2.72 × ) 82.6 (3.04 × ) 69.5 (2.57 × ) 60.5 (2.38 × ) 85.2 (3.15 × ) 75.9 CascadeBERT 81.2 (3.00 × ) (3.00 × ) (3.00 × ) (3.02 × ) (3.02 × ) (3.03 × ) (3.00 × ) Table 1: Test results from the GLUE server. We report F1-score for QQP and MRPC and accuracy for other tasks.The corresponding speed-up ratio is shown in parenthesis. For baseline methods, † denotes results taken from theoriginal paper and ‡ denotes results based on our implementation. The middle rows report model performancearound 2 × speed-up and the bottom rows represent 3 × acceleration. The best results of the same kind methodsare shown in bold. We evaluate our method on the GLUE benchmarkwith the BERT (Devlin et al., 2019) model as ourbackbone architecture. We first give a brief intro-duction of the dataset used and the experimental set-ting, following by the description of baseline mod-els for comprehensive evaluation. The results andanalysis of the experiments are finally presented.
We use six classification tasks in GLUE bench-mark (Wang et al., 2018): MNLI (Williamset al., 2018), MRPC (Dolan and Brockett, 2005),QNLI (Rajpurkar et al., 2016), QQP, RTE (Ben-tivogli et al., 2009) and SST-2 (Socher et al., 2013).The metrics for evaluation are F1-score for QQPand MRPC, and accuracy for the rest tasks. Ourimplementation is based on the Huggingface Trans-formers library (Wolf et al., 2020). We use twomodels for selection with and layers, respec-tively, since we practically find it works well andwe leave more models to select for future work. Weutilize the weights provided by Turc et al. (2019)to initialize the models in our suite. The hyper-parameters, including the margin (cid:15) and margin lossweight λ are tuned on the development set and weselect the best performing models for evaluation onthe test set. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs We implement two kinds of baseline models for acomprehensive evaluation of our framework:
Early Exiting includes BERT- n L, where only thefirst n layers in the original model are used for mak-ing the final classification results, and a classifier iffine-tuned for acquiring task-specific information.We take n = 6 and n = 4 to obtain a staticallycompressed model with speed-up ratio 2 × and 3 × ,respectively. DeeBERT (Xin et al., 2020) whichmakes dynamic early predictions based on the in-ternal classifiers. PABEE (Zhou et al., 2020) is alsoincluded, which is a robustly enhanced variant byemitting a prediction after several layers producinga consistent exiting decision. Knowledge Distillation aims at compressing theoriginal large models into small ones with fewerlayers. We compare our framework to distillationmethods that do not require external data, includ-ing BERT-PKD (Sun et al., 2019), which distillsinternal states of teacher model to student modeland BERT-of-Theseus (Xu et al., 2020b), whichachieves compression by gradually replacing themodule in the original model.
The main results are presented in Table 1. Wesurprisingly find that the early exiting method can-not beat the simple method BERT- n L, which di-rectly fine-tuning a classifier layer after an inter-nal layer. This validates our motivation that themitting decisions and the predictions based onshallow layer representations are not reliable. Ourmethod instead makes a selection between com-plete models in a cascading manner and achievessuperior performance over early exiting methodswith a large margin. Furthermore, our Cascade-BERT also outperforms the enhanced version ofdynamic early exiting, PABEE, by a . averagepoints when the speed-up ratio is 2 × . The gap be-comes clearer when the speed-up ratio comes to 3 × ,which demonstrates that our model can maintainsatisfying performance even when the speed-upratio is relatively high.Besides, our proposal is comparable with thestate-of-the-art knowledge distillation methods likeBERT-PKD and BERT-of-Theseus in 2 × speed-up,and is in an advantageous position when the speed-up ratio is up to 3 × . Although distillation methodscan implicitly learn the pipeline ability by forcingstudent models to mimic the intermediate repre-sentations of the teacher model (Sun et al., 2019),or gradually replacing the modules in the teachermodel (Xu et al., 2020b), it is still relatively hard toobtain a good performing student model with thepipeline processing ability for all instances with dif-ferent difficulties, especially when the compressionratio is high. On the contrary, since every modelcandidate in our framework is a complete modeland the predictions are calibrated to reflect the in-stance difficulty, the cascade of different modelscan still produce robust results. Our work aims to accelerate large-scale pre-trainedlanguage models inference while maintaining thesuperior performance. Researchers have previouslymade efforts to achieve this goal, and the studiescan be categorized as model-level compression andinstance-level speed-up:
Model-level compression aims to obtain acomputation-efficient model, including knowledgedistillation and quantization. Knowledge distil-lation (KD) focuses on transferring the knowl-edge from a teacher model into a small studentmodel (Hinton et al., 2015), and various KD tech-niques have been applied to pre-trained languagemodels for a more tight student model (Sanh et al.,2019; Sun et al., 2019; Jiao et al., 2020). Quantiza-tion methods target at using fewer physical bits toefficiently represent the model (Shen et al., 2020).Note that our proposal is agnostic to the model distillation techniques, and thus these advancedmethods can be incorporated into our frameworkto further enhance the performance.
Instance-level speed-up proposes to accelerate theinference speed via early exiting, i.e., producingresults based on intermediate representations (Xinet al., 2020; Liu et al., 2020; Schwartz et al., 2020).The motivation is that instances with different com-plexities can exit at different levels in a big model,based on prediction confidence of the internal clas-sifiers. However, we argue that this approachmakes unreliable predictions due to the lack ofhigh-level semantic understanding. We instead pro-pose to achieve acceleration based on a series ofcomplete models, and calibrate the model predic-tion for more accurate selection decisions.
In this paper, we address the unreliable resultsproduced by the dynamic early exiting methodsand propose CascadeBERT, a simple and effec-tive framework for accelerating the inference ofpre-trained language models. Experimental resultsdemonstrate our proposal achieves superior perfor-mance over previous acceleration methods whenthe speed-up ratio is high. We hope to analyze theimpact of the number of models for selection andexplore more backbone architectures for evaluatingthe universality of our framework in the future.
Acknowledgements
This work was supported by a Tencent ResearchGrant. Xu Sun is the corresponding author of thispaper.
References
Luisa Bentivogli, Ido Kalman Dagan, Dang Hoa,Danilo Giampiccolo, and Bernardo Magnini. 2009.The fifth pascal recognizing textual entailment chal-lenge. In
TAC Workshop .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In
NAACL-HLT , pages 4171–4186.William B. Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In
Proceedings of the Third International Workshopon Paraphrasing (IWP) .Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Wein-berger. 2017. On calibration of modern neural net-works. In
ICML , pages 1321–1330.eoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 .Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.2020. TinyBERT: Distilling BERT for natural lan-guage understanding. In
Findings of EMNLP , pages4163–4174.Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras.2019. Shallow-deep networks: Understanding andmitigating network overthinking. In
ICML , pages3301–3310.Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao,Haotang Deng, and Qi Ju. 2020. FastBERT: a self-distilling BERT with adaptive inference time. In
Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 6035–6044.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A robustly optimized BERT pretrainingapproach. arXiv preprint arXiv:1907.11692 .Amil Merchant, Elahe Rahimtoroghi, Ellie Pavlick,and Ian Tenney. 2020. What happens to BERTembeddings during fine-tuning? In
BlackboxNLPWorkshop , pages 33–44.Paul Michel, Omer Levy, and Graham Neubig. 2019.Are sixteen heads really better than one? In
NeurIPS , pages 14014–14024.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In
EMNLP , pages2383–2392.Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. DistilBERT, a distilled versionof BERT: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108 .Roy Schwartz, Gabriel Stanovsky, SwabhaSwayamdipta, Jesse Dodge, and Noah A. Smith.2020. The right tool for the job: Matchingmodel and instance complexities. In
ACL , pages6640–6651.Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, ZheweiYao, Amir Gholami, Michael W Mahoney, and KurtKeutzer. 2020. Q-BERT: Hessian based ultra lowprecision quantization of BERT. In
AAAI , pages8815–8821.Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Y Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In
EMNLP , pages 1631–1642. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.Patient knowledge distillation for BERT model com-pression. In
EMNLP-IJCNLP , pages 4323–4332.Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.BERT rediscovers the classical NLP pipeline. In
ACL , pages 4593–4601.Iulia Turc, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. Well-read students learn better:The impact of student initialization on knowledgedistillation. arXiv preprint arXiv:1908.08962 .Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-nrich, and Ivan Titov. 2019. Analyzing multi-headself-attention: Specialized heads do the heavy lift-ing, the rest can be pruned. In
ACL , pages 5797–5808.Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In
EMNLP Workshop on BlackboxNLP , pages 353–355.Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In
NAACL-HLT , pages 1112–1122.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Transformers: State-of-the-art natural language pro-cessing. In
System Demonstrations, EMNLP , pages38–45.Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, andJimmy Lin. 2020. DeeBERT: Dynamic early exit-ing for accelerating BERT inference. In
ACL , pages2246–2251.Benfeng Xu, Licheng Zhang, Zhendong Mao, QuanWang, Hongtao Xie, and Yongdong Zhang. 2020a.Curriculum learning for natural language under-standing. In
ACL , pages 6095–6104.Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei,and Ming Zhou. 2020b. BERT-of-theseus: Com-pressing BERT by progressive module replacing. In
EMNLP , pages 7859–7869.Wangchunshu Zhou, Canwen Xu, Tao Ge, JulianMcAuley, Ke Xu, and Furu Wei. 2020. BERT losespatience: Fast and robust inference with early exit. arXiv preprint arXiv:2006.04152arXiv preprint arXiv:2006.04152