[PDF] Accelerating Pre-trained Language Models via Calibrated Cascade

Abstract

Dynamic early exiting aims to accelerate pre-trained language models' (PLMs) inference by exiting in shallow layer without passing through the entire model. In this paper, we analyze the working mechanism of dynamic early exiting and find it cannot achieve a satisfying trade-off between inference speed and performance. On one hand, the PLMs' representations in shallow layers are not sufficient for accurate prediction. One the other hand, the internal off-ramps cannot provide reliable exiting decisions. To remedy this, we instead propose CascadeBERT, which dynamically selects a proper-sized, complete model in a cascading manner. To obtain more reliable model selection, we further devise a difficulty-aware objective, encouraging the model output class probability to reflect the real difficulty of each instance. Extensive experimental results demonstrate the superiority of our proposal over strong baseline models of PLMs' acceleration including both dynamic early exiting and knowledge distillation methods.

Full PDF

AAccelerating Pre-trained Language Models via Calibrated Cascade

Lei Li † , Yankai Lin ‡ , Shuhuai Ren † , Deli Chen † , ‡ ,Xuancheng Ren † , Peng Li ‡ , Jie Zhou ‡ , Xu Sun †† MOE Key Lab of Computational Linguistics, School of EECS, Peking University ‡ Pattern Recognition Center, WeChat AI, Tencent Inc., China { lilei, shuhuai ren } @stu.pku.edu.cn { chendeli, renxc, xusun } @pku.edu.cn, { yankailin, patrickpli, withtomzhou } @tencent.com Abstract

Dynamic early exiting aims to accelerate pre-trained language models’ (PLMs) inferenceby exiting in shallow layer without passingthrough the entire model. In this paper, weanalyze the working mechanism of dynamicearly exiting and ﬁnd it cannot achieve a sat-isfying trade-off between inference speed andperformance. On one hand, the PLMs’ rep-resentations in shallow layers are not sufﬁ-cient for accurate prediction. One the otherhand, the internal off-ramps cannot provide re-liable exiting decisions. To remedy this, weinstead propose CascadeBERT, which dynam-ically selects a proper-sized, complete modelin a cascading manner. To obtain more reliablemodel selection, we further devise a difﬁculty-aware objective, encouraging the model outputclass probability to reﬂect the real difﬁcultyof each instance. Extensive experimental re-sults demonstrate the superiority of our pro-posal over strong baseline models of PLMs’acceleration including both dynamic early ex-iting and knowledge distillation methods.

Large-scale pre-trained language models (PLMs)including BERT (Devlin et al., 2019) andRoBERTa (Liu et al., 2019) have demonstratedsuperior performance on various natural languageunderstanding tasks. While the increasing modelsize brings more power, the huge training costsand especially the long inference time hinder thedeployment of PLMs in real-time applications. Re-searchers have recently exploited various kindsof approaches for accelerating the PLMs’ infer-ence speed, including model-level compression andinstance-level speed-up. The former aims at ob-taining a compact model via quantization (Shenet al., 2020), pruning (Voita et al., 2019; Michelet al., 2019) and knowledge distillation (Sanh et al.,2019; Sun et al., 2019; Jiao et al., 2020), while thelatter treats each instance differently and designs exiting metrics for emitting predictions based onintermediate off-ramps at early layers (Xin et al.,2020; Liu et al., 2020; Schwartz et al., 2020).The idea of dynamic early exiting is intuitiveand simple, and can be utilized to accelerate infer-ence and reduce the potential risk of the overthink-ing problem (Kaya et al., 2019). However, such astraightforward paradigm can be sub-optimal, es-pecially when the speed-up ratio is high, i.e., mostexamples are predicted only based on the shallowrepresentations. It is due to the following two rea-sons: First, as revealed by previous studies (Tenneyet al., 2019), PLMs like BERT exhibit a hierar-chy of representations and rediscover traditionaltext processing pipeline, e.g., shallow layers extractlow-level features like lexical/syntactic informationwhile deep layers capture semantic-level relations.The high-level semantic inference ability is usuallyrequired even for those easy instances, and there-fore we cannot conduct inference solely based onlow-level features. It is veriﬁed by the analysisexperiments in Section 2. Second, to measure thequality of exiting decisions, we design a metric forexamining model’s ability to distinguish difﬁcult in-stances from easy ones. Our experiments show thatintermediate classiﬁers in the early exiting modelscannot provide reliable exiting decisions, whichhinders a better trade-off between speed-up andperformance in the idea of early exiting.In this paper, we propose CascadeBERT, toaccelerate pre-trained language model inferencebased on a series of complete models with differentlayer numbers in a cascading manner. Speciﬁcally,when inferring a given instance, rather than directlyexiting in the middle layers of PLMs, we progres-sively check if it can be solved by the current PLMfrom the smallest to the largest one. Furthermore,we propose to calibrate the PLMs’ predictions ac-cording to the example difﬁculty, making it reﬂectthe real difﬁculty of each instance, and thereforecould be a good indicator for model selection. Ex- a r X i v : . [ c s . C L ] D ec erimental results on six natural language under-standing benchmarks demonstrate that our modelcan obtain a much better trade-off between the in-ference speed and task performance than early ex-iting methods, achieving a close and even supe-rior performance compared to the state-of-the-artknowledge distillation methods. Dynamic early exiting aims to speed-up the infer-ence of PLMs by adding internal off-ramps (clas-siﬁers) after each layer in the original model. Foreach instance, if the internal off-ramps predictionbased on current layer representation of the in-stance is conﬁdent enough, e.g., the maximum classprobability exceeds a threshold, then the predictionis emitted without passing the entire model. How-ever, whether the internal representations can pro-vide a sufﬁcient information for high-performanceprediction results and whether intermediate clas-siﬁers can be utilized for making robust exitingdecisions still remain unclear. In this section, we in-vestigate the working mechanism of dynamic earlyexiting by exploring these two questions.

As discussed by Tenney et al. (2019), pre-trainedlanguage models like BERT learn a hierarchy ofrepresentations and rediscover the traditional textprocessing pipeline, e.g., basic syntactic informa-tion emerges in shallow layers, while deeper layersmainly capture high-level semantic structures. Ourmotivation is that the high-level semantic is usu-ally required even for those easy instances, andtherefore the predictions based on shallow repre-sentations are not accurate.To further examine this, we conduct a behavioranalysis to evaluate the model performance basedon shallow layer outputs. If the representationfrom a shallow layer contains sufﬁcient featuresfor a task, it is reasonable to expect a classiﬁercan achieve decent performance after ﬁne-tuningbased on it. Speciﬁcally, we compare the followingmodels:

DeeBERT (Xin et al., 2020), which is a represen-tative of early exiting methods. The internal classi-ﬁers in DeeBERT are used for emitting predictions.

BERT- n L , which only utilizes the ﬁrst n layer inthe original BERT model for prediction. A clas-siﬁer is added directly after the ﬁrst n layers andﬁne-tuned on the training dataset. Layer A cc u r a c y BERT-CompleteBERT-nLDeeBERT

Figure 1: Model performance comparison utilizing thesame number of layers on MNLI(m) datasets. Com-plete models with a comprehensive pipeline clearly out-perform models like DeeBERT without high-level se-mantic features.

BERT-Complete (Turc et al., 2019), which is alight version of the original BERT model pre-trained from scratch. We assume this model hasthe complete text processing pipeline ability.We conduct experiments on the MNLI (Williamset al., 2018) dataset, a natural language infer-ence task requiring the model to predict the re-lation between a premise sentence and a hypothesissentence. Figure 1 illustrates the results on thematched version (MNLI-m) development set underdifferent layers. From the ﬁgure, we can see that:(1) There is a clear performance gap between themodels with and without a full pipeline, especiallywhen the layer number is small. It indicates thatthe pipeline ability is vital for models to handlecomplicated natural language tasks.(2) The BERT- n L also outperforms DeeBERT.We attribute the reason to that the ﬁnal layer afterﬁne-tuning learns task-speciﬁc information to ob-tain a decent performance. A similar phenomenonis also observed in Merchant et al. (2020). How-ever, since the internal layer presentations in Dee-BERT are restricted by its relative position in thewhole model, thus the adaption impact cannot befully exploited.(3) The gap is narrowed as the number of layerincreases, which in turn validates our assumptionthat shallow representations are not sufﬁcient foraccurate predictions.

We further probe the early exiting decisions madeby internal classiﬁers. In more detail, we denote dif-ﬁcult instances as instances that the model cannotpredict correctly (refer to Section 3.2 for details), ϰ ϲ ϴ >ĂǇĞƌ Z d Ͳ Ž ŵ Ɖ ů Ğ ƚ Ğ Z d Ͳ Ŷ > ĞĞ Z d ϳϳ͘ϰϮ ϳϴ͘ϴϱ ϳϵ͘ϳϲ ϴϭ͘Ϭϴϳϯ͘ϰϵ ϳϳ͘Ϯϴ ϴϬ͘ϭϲ ϴϭ͘ϭϱϱϲ͘ϲϲ ϲϱ͘ϯϯ ϳϰ͘ϯϭ ϳϳ͘ϰϬ ϱϰϲϬϲϲϳϮϳϴϴϰ Figure 2: DIS (%) heatmap of different models on theMNLI (m) development set. The DIS of internal off-ramps in the DeeBERT of shallow layers is relativelylow, leading to unreliable dynamic exiting decisions. and easy examples as those that can be handledwell. Intuitively, a difﬁcult instance should be pre-dicted with a lower conﬁdence score than that of aneasy one, thus the conﬁdence score can be better uti-lized as an indicator for early exiting decisions. Fol-lowing most existing early exiting work, for eachinstance x , we utilize the maximum class probabil-ity of the output distribution c ( x ) as the conﬁdencescore. To measure how well the model can tellthe difference between easy and difﬁcult examples,we propose Difﬁculty Inversion Score (DIS). First,we sort the instances by their conﬁdence scoresin an ascending order, i.e., c ( x i ) < c ( x j ) for any i < j . Then we compute the difﬁculty inversionpair number for each difﬁcult instance as follows:DISum = N (cid:88) i =1 i − (cid:88) j =1 D ( x i , x j ) (1)where N is the instance number. D ( · , · ) is an indi-cator function, computed as: D ( x i , x j ) = (cid:26) , if d i > d j and c ( x i ) < c ( x j )0 , otherwise. (2) where d i and d j is the difﬁculty of x i and x j ,respectively, e.g., for difﬁcult instances and for easy instances. The ﬁnal DIS is a normalizedDISum: DIS = 1 − K DISum (3)where K is a normalizing factor, i.e., the productof the number of easy instances and the numberof difﬁcult instances, to re-scale DIS to the rangefrom to . A higher DIS indicates that the modelperforms well at ranking among instances accord-ing to the conﬁdence score to distinguish difﬁcult instances from easy ones. The exiting decisionsbased on classiﬁers with lower DIS scores are thusunreliable, since it results in emitting more wronglypredicted results of difﬁcult instances. To measurethe ability to rank instances difﬁculty of the internalclassiﬁers in the dynamic early exiting framework,we compute the DIS metric on MNLI (m) for differ-ent models discussed in Section 2.1, and the resultsare illustrated in Figure 2. We can observe that:(1) The off-ramps of internal classiﬁers withshallow layers in DeeBERT remains a clear gapto BERT- n L and BERT-Complete. This indicatesthe exiting decisions in the shallow layers can beunreliable, thus the task performance can be poorwhen instances are mostly emitted in shallow lay-ers.(2) The ability to distinguish difﬁcult examplesfrom easy ones is enhanced as the layer numberincreases. As revealed in the previous section,since the deeper layer representations can boostthe task performance, it is reasonable to expectthe off-ramps in deeper layers can provide morecomprehensive early exiting decisions.In all, our analysis demonstrates that currentdynamic early exiting predictions made by internalclassiﬁers are not reliable.

To tackle the drawbacks of the dynamic early ex-iting we investigated above, we propose a novelframework, named CascadeBERT, that utilizes asuite of complete PLMs with different layer num-bers for acceleration in a cascading manner, andfurther devise a difﬁculty-awared calibration regu-larization to inform the model of instance difﬁculty.

Formally, given n complete PLM models ( M , M , · · · , M n ) trained on the downstream taskdataset with { l , . . . , l n } layer respectively, ourgoal is to select the model with minimal layer num-ber for each input instance x while maintaining themodel performance. We formulate it as a cascadeexiting problem, i.e., execute the model predictionsequentially for each input example from the small-est M to the largest M n , and examine whether theprediction of the input instance x can be emitted.Speciﬁcally, we use the conﬁdence score c ( x ) ,i.e., the maximum class probability, as a metricto determine whether the predictions are conﬁdent lgorithm 1: Cascade Exiting

Input:

Models { M i } , threshold { τ i } Data:

Input x Result:

Class probability distribution

Pr( y | x ) for i ← to n do Pr( y | x ) = M i ( x ) c ( x ) = max y (Pr( y | x )) if c ( x ) > τ i then Early exit

Pr( y | x ) return Pr( y | x ) enough for outputting: c ( x ) = max y ∈ L (Pr( y | x )) (4)where L is the label set of the task.Given a conﬁdence threshold τ , the predictionresult is emitted once the conﬁdence score exceedsthe threshold. By varying the conﬁdence threshold τ , we can obtain different speed-up ratio based onthe application requirements. A smaller τ denotesthat more examples are outputted using the currentmodel, making the inference faster, while a bigger τ will make more examples go through larger mod-els for a better performance. The whole frameworkis illustrated in Algorithm 1. Since every candidatemodel in our cascading framework is a completemodel with the pipeline processing ability, predic-tions are more robust even when only the smallestmodels is executed. To further make the cascade exiting based on conﬁ-dence score more reliable, we design a difﬁculty-based margin loss to further improve the efﬁciencyof the cascade mechanism. In more detail, we adda regularization objective for each instance pair: L ( x i , x j ) = max { , − g ( x i , x j )( c ( x i ) − c ( x j )) + (cid:15) } (5) where (cid:15) is a conﬁdence margin. We design g ( x i , x j ) as follows g ( x i , x j ) =  , if d i > d j , if d i = d j − , otherwise (6) d i and d j is the difﬁculty of x i and x j , respectively.The objective is added to the original task-speciﬁc loss with a weight factor λ to adjust its impact. Byoptimizing the combined objective function, theconﬁdence scores of more difﬁcult instances areadjusted to be lower than that of easier instances,making the conﬁdence-based emitting decisionsmore accurate. Note that traditional post-hoc cali-brated methods like temperature scaling (Guo et al.,2017) are not applicable in this case, since the re-scaling technique will not change the rank for dif-ferent instances.To measure the instance difﬁculty, we ﬁrst splitthe training dataset D into K folds { ˜ D i | i =1 , . . . , K } and train K small models using theleave-one-out method with multiple seeds, i.e.,model θ i is trained on the D − ˜ D i . We utilize θ i to evaluate the difﬁculty of the examples in ˜ D i .Speciﬁcally, the samples are marked as easy exam-ples if they can be correctly classiﬁed. Otherwise,they are labeled as difﬁcult. To eliminate the impactof randomness, we group seeds predictions andstrictly label the examples which can be correctlypredicted in all seeds as easy examples for the mod-els, while others as difﬁcult examples. A similarapproach is also adopted in Xu et al. (2020a) forapplying curriculum learning into natural languageunderstanding. We estimate the inference speed-up ratio accord-ing to the number of layers executed actually inforward propagation for each example (Xin et al.,2020; Zhou et al., 2020), since we do not intro-duce any external parameters into our framework.Speciﬁcally, the speed-up ratio over an originalmodel with n -layer is calculated as:speed-up ratio = (cid:80) Ni =1 n × m i (cid:80) Ni =1 C × m i (7)where m i is the instance number that actuallycosts C layers in total and N is the number oftest instances. Note that in our cascading exitingframework, the overhead brought by instances thatrun the forward propagation in multiple modelsis counted in C . For example, for an examplewhich is ﬁrst fed into a -layer model and thengoes through a -layer model to obtain the ﬁnalprediction result, the number of layers actually ex-ecuted is therefore . ethod MNLI (m) MNLI (mm) MRPC QNLI QQP RTE SST-2 Average BERT-base † × ) 83.4 (1.00 × ) 88.9 (1.00 × ) 90.5 (1.00 × ) 71.2 (1.00 × ) 66.4 (1.00 × ) 93.5 (1.00 × ) 82.6 ∼ . × s t a ti c BERT-6L ‡ × ) 79.2 (2.00 × ) 85.1 (2.00 × ) 86.2 (2.00 × ) 68.9 (2.00 × ) 65.0 (2.00 × ) 90.9 (2.00 × ) 79.3BERT-small † (2.00 × ) (2.00 × ) 86.8 (2.00 × ) 88.9 (2.00 × ) 70.4 (2.00 × ) 65.3 (2.00 × ) 91.8 (2.00 × ) 81.2BERT-PKD † × ) 81.0 (2.00 × ) 85.0 (2.00 × ) 89.0 (2.00 × ) 70.7 (2.00 × ) 65.5 (2.00 × ) 92.0 (2.00 × ) 80.7BERT-of-Theseus † × ) 82.1 (2.00 × ) (2.00 × ) (2.00 × ) (2.00 × ) (2.00 × ) (2.00 × ) dyn a m i c DeeBERT † × ) 73.1 (1.88 × ) 84.4 (2.07 × ) 85.6 (2.09 × ) 70.4 (2.13 × ) 64.3 (1.95 × ) 90.2 (2.00 × ) 77.5PABEE † × ) 78.7 (2.08 × ) 84.4 (2.01 × ) 88.0 (1.87 × ) 70.4 (2.09 × ) 64.0 (1.81 × ) 89.3 (1.95 × ) 79.2 CascadeBERT 83.0 (2.01 × ) (2.01 × ) (2.01 × ) (2.01 × ) (2.01 × ) (2.03 × ) (2.08 × ) ∼ . × s t a ti c BERT-4L ‡ × ) 75.1 (3.00 × ) (3.00 × ) 84.7 (3.00 × ) 66.5 (3.00 × ) (3.00 × ) 87.5 (3.00 × ) 76.5BERT-small ‡ × ) 78.3 (3.00 × ) 82.3 (3.00 × ) (3.00 × ) 69.8 (3.00 × ) 59.2 (3.00 × ) (3.00 × ) 78.0BERT-PKD ‡ (3.00 × ) (3.00 × ) (3.00 × ) 84.9 (3.00 × ) (3.00 × ) 62.8 (3.00 × ) 89.2 (3.00 × ) BERT-of-Theseus ‡ × ) 77.4 (3.00 × ) 82.2 (3.00 × ) 85.5 (3.00 × ) 68.3 (3.00 × ) 59.5 (3.00 × ) 89.7 (3.00 × ) 77.3 dyn a m i c DeeBERT ‡ × ) 61.3 (3.03 × ) 83.5 (3.00 × ) 82.4 (2.99 × ) 67.0 (2.97 × ) 59.9 (3.00 × ) 88.8 (2.97 × ) 72.3PABEE ‡ × ) 75.3 (2.71 × ) 82.6 (2.72 × ) 82.6 (3.04 × ) 69.5 (2.57 × ) 60.5 (2.38 × ) 85.2 (3.15 × ) 75.9 CascadeBERT 81.2 (3.00 × ) (3.00 × ) (3.00 × ) (3.02 × ) (3.02 × ) (3.03 × ) (3.00 × ) Table 1: Test results from the GLUE server. We report F1-score for QQP and MRPC and accuracy for other tasks.The corresponding speed-up ratio is shown in parenthesis. For baseline methods, † denotes results taken from theoriginal paper and ‡ denotes results based on our implementation. The middle rows report model performancearound 2 × speed-up and the bottom rows represent 3 × acceleration. The best results of the same kind methodsare shown in bold. We evaluate our method on the GLUE benchmarkwith the BERT (Devlin et al., 2019) model as ourbackbone architecture. We ﬁrst give a brief intro-duction of the dataset used and the experimental set-ting, following by the description of baseline mod-els for comprehensive evaluation. The results andanalysis of the experiments are ﬁnally presented.

We use six classiﬁcation tasks in GLUE bench-mark (Wang et al., 2018): MNLI (Williamset al., 2018), MRPC (Dolan and Brockett, 2005),QNLI (Rajpurkar et al., 2016), QQP, RTE (Ben-tivogli et al., 2009) and SST-2 (Socher et al., 2013).The metrics for evaluation are F1-score for QQPand MRPC, and accuracy for the rest tasks. Ourimplementation is based on the Huggingface Trans-formers library (Wolf et al., 2020). We use twomodels for selection with and layers, respec-tively, since we practically ﬁnd it works well andwe leave more models to select for future work. Weutilize the weights provided by Turc et al. (2019)to initialize the models in our suite. The hyper-parameters, including the margin (cid:15) and margin lossweight λ are tuned on the development set and weselect the best performing models for evaluation onthe test set. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs We implement two kinds of baseline models for acomprehensive evaluation of our framework:

Early Exiting includes BERT- n L, where only theﬁrst n layers in the original model are used for mak-ing the ﬁnal classiﬁcation results, and a classiﬁer ifﬁne-tuned for acquiring task-speciﬁc information.We take n = 6 and n = 4 to obtain a staticallycompressed model with speed-up ratio 2 × and 3 × ,respectively. DeeBERT (Xin et al., 2020) whichmakes dynamic early predictions based on the in-ternal classiﬁers. PABEE (Zhou et al., 2020) is alsoincluded, which is a robustly enhanced variant byemitting a prediction after several layers producinga consistent exiting decision. Knowledge Distillation aims at compressing theoriginal large models into small ones with fewerlayers. We compare our framework to distillationmethods that do not require external data, includ-ing BERT-PKD (Sun et al., 2019), which distillsinternal states of teacher model to student modeland BERT-of-Theseus (Xu et al., 2020b), whichachieves compression by gradually replacing themodule in the original model.

The main results are presented in Table 1. Wesurprisingly ﬁnd that the early exiting method can-not beat the simple method BERT- n L, which di-rectly ﬁne-tuning a classiﬁer layer after an inter-nal layer. This validates our motivation that themitting decisions and the predictions based onshallow layer representations are not reliable. Ourmethod instead makes a selection between com-plete models in a cascading manner and achievessuperior performance over early exiting methodswith a large margin. Furthermore, our Cascade-BERT also outperforms the enhanced version ofdynamic early exiting, PABEE, by a . averagepoints when the speed-up ratio is 2 × . The gap be-comes clearer when the speed-up ratio comes to 3 × ,which demonstrates that our model can maintainsatisfying performance even when the speed-upratio is relatively high.Besides, our proposal is comparable with thestate-of-the-art knowledge distillation methods likeBERT-PKD and BERT-of-Theseus in 2 × speed-up,and is in an advantageous position when the speed-up ratio is up to 3 × . Although distillation methodscan implicitly learn the pipeline ability by forcingstudent models to mimic the intermediate repre-sentations of the teacher model (Sun et al., 2019),or gradually replacing the modules in the teachermodel (Xu et al., 2020b), it is still relatively hard toobtain a good performing student model with thepipeline processing ability for all instances with dif-ferent difﬁculties, especially when the compressionratio is high. On the contrary, since every modelcandidate in our framework is a complete modeland the predictions are calibrated to reﬂect the in-stance difﬁculty, the cascade of different modelscan still produce robust results. Our work aims to accelerate large-scale pre-trainedlanguage models inference while maintaining thesuperior performance. Researchers have previouslymade efforts to achieve this goal, and the studiescan be categorized as model-level compression andinstance-level speed-up:

Model-level compression aims to obtain acomputation-efﬁcient model, including knowledgedistillation and quantization. Knowledge distil-lation (KD) focuses on transferring the knowl-edge from a teacher model into a small studentmodel (Hinton et al., 2015), and various KD tech-niques have been applied to pre-trained languagemodels for a more tight student model (Sanh et al.,2019; Sun et al., 2019; Jiao et al., 2020). Quantiza-tion methods target at using fewer physical bits toefﬁciently represent the model (Shen et al., 2020).Note that our proposal is agnostic to the model distillation techniques, and thus these advancedmethods can be incorporated into our frameworkto further enhance the performance.

Instance-level speed-up proposes to accelerate theinference speed via early exiting, i.e., producingresults based on intermediate representations (Xinet al., 2020; Liu et al., 2020; Schwartz et al., 2020).The motivation is that instances with different com-plexities can exit at different levels in a big model,based on prediction conﬁdence of the internal clas-siﬁers. However, we argue that this approachmakes unreliable predictions due to the lack ofhigh-level semantic understanding. We instead pro-pose to achieve acceleration based on a series ofcomplete models, and calibrate the model predic-tion for more accurate selection decisions.

In this paper, we address the unreliable resultsproduced by the dynamic early exiting methodsand propose CascadeBERT, a simple and effec-tive framework for accelerating the inference ofpre-trained language models. Experimental resultsdemonstrate our proposal achieves superior perfor-mance over previous acceleration methods whenthe speed-up ratio is high. We hope to analyze theimpact of the number of models for selection andexplore more backbone architectures for evaluatingthe universality of our framework in the future.

Acknowledgements

This work was supported by a Tencent ResearchGrant. Xu Sun is the corresponding author of thispaper.

References

Luisa Bentivogli, Ido Kalman Dagan, Dang Hoa,Danilo Giampiccolo, and Bernardo Magnini. 2009.The ﬁfth pascal recognizing textual entailment chal-lenge. In

TAC Workshop .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In

NAACL-HLT , pages 4171–4186.William B. Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In

Proceedings of the Third International Workshopon Paraphrasing (IWP) .Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Wein-berger. 2017. On calibration of modern neural net-works. In

ICML , pages 1321–1330.eoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 .Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.2020. TinyBERT: Distilling BERT for natural lan-guage understanding. In

Findings of EMNLP , pages4163–4174.Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras.2019. Shallow-deep networks: Understanding andmitigating network overthinking. In

ICML , pages3301–3310.Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao,Haotang Deng, and Qi Ju. 2020. FastBERT: a self-distilling BERT with adaptive inference time. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 6035–6044.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A robustly optimized BERT pretrainingapproach. arXiv preprint arXiv:1907.11692 .Amil Merchant, Elahe Rahimtoroghi, Ellie Pavlick,and Ian Tenney. 2020. What happens to BERTembeddings during ﬁne-tuning? In

BlackboxNLPWorkshop , pages 33–44.Paul Michel, Omer Levy, and Graham Neubig. 2019.Are sixteen heads really better than one? In

NeurIPS , pages 14014–14024.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In

EMNLP , pages2383–2392.Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. DistilBERT, a distilled versionof BERT: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108 .Roy Schwartz, Gabriel Stanovsky, SwabhaSwayamdipta, Jesse Dodge, and Noah A. Smith.2020. The right tool for the job: Matchingmodel and instance complexities. In

ACL , pages6640–6651.Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, ZheweiYao, Amir Gholami, Michael W Mahoney, and KurtKeutzer. 2020. Q-BERT: Hessian based ultra lowprecision quantization of BERT. In

AAAI , pages8815–8821.Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Y Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In

EMNLP , pages 1631–1642. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.Patient knowledge distillation for BERT model com-pression. In

EMNLP-IJCNLP , pages 4323–4332.Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.BERT rediscovers the classical NLP pipeline. In

ACL , pages 4593–4601.Iulia Turc, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. Well-read students learn better:The impact of student initialization on knowledgedistillation. arXiv preprint arXiv:1908.08962 .Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-nrich, and Ivan Titov. 2019. Analyzing multi-headself-attention: Specialized heads do the heavy lift-ing, the rest can be pruned. In

ACL , pages 5797–5808.Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In

EMNLP Workshop on BlackboxNLP , pages 353–355.Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In

NAACL-HLT , pages 1112–1122.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Transformers: State-of-the-art natural language pro-cessing. In

System Demonstrations, EMNLP , pages38–45.Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, andJimmy Lin. 2020. DeeBERT: Dynamic early exit-ing for accelerating BERT inference. In

ACL , pages2246–2251.Benfeng Xu, Licheng Zhang, Zhendong Mao, QuanWang, Hongtao Xie, and Yongdong Zhang. 2020a.Curriculum learning for natural language under-standing. In

ACL , pages 6095–6104.Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei,and Ming Zhou. 2020b. BERT-of-theseus: Com-pressing BERT by progressive module replacing. In

EMNLP , pages 7859–7869.Wangchunshu Zhou, Canwen Xu, Tao Ge, JulianMcAuley, Ke Xu, and Furu Wei. 2020. BERT losespatience: Fast and robust inference with early exit. arXiv preprint arXiv:2006.04152arXiv preprint arXiv:2006.04152