BERT-of-Theseus: Compressing BERT by Progressive Module Replacing
BBERT-of-Theseus: Compressing BERT by Progressive Module Replacing
Canwen Xu ∗ , Wangchunshu Zhou ∗ , Tao Ge , Furu Wei , Ming Zhou University of California, San Diego Beihang University Microsoft Research Asia [email protected] [email protected] { tage,fuwei,mingzhou } @microsoft.com Abstract
In this paper, we propose a novel modelcompression approach to effectively compressBERT by progressive module replacing. Ourapproach first divides the original BERT intoseveral modules and builds their compact sub-stitutes. Then, we randomly replace the origi-nal modules with their substitutes to train thecompact modules to mimic the behavior ofthe original modules. We progressively in-crease the probability of replacement throughthe training. In this way, our approach bringsa deeper level of interaction between the orig-inal and compact models. Compared to theprevious knowledge distillation approaches forBERT compression, our approach does not in-troduce any additional loss function. Our ap-proach outperforms existing knowledge distil-lation approaches on GLUE benchmark, show-ing a new perspective of model compression. With the prevalence of deep learning, many hugeneural models have been proposed and achievestate-of-the-art performance in various fields (Heet al., 2016; Vaswani et al., 2017). Specifically,in Natural Language Processing (NLP), pretrain-ing and fine-tuning have become the new normof most tasks. Transformer-based pretrained mod-els (Devlin et al., 2019; Liu et al., 2019b; Yanget al., 2019; Song et al., 2019; Dong et al., 2019)have dominated the field of both Natural LanguageUnderstanding (NLU) and Natural Language Gen-eration (NLG). These models benefit from their“overparameterized” nature (Nakkiran et al., 2020)and contain millions or even billions of parameters,making it computationally expensive and ineffi-cient considering both memory consumption and ∗ Equal contribution. Work done during these two authors’internship at Microsoft Research Asia. The code and pretrained model are available at https://github.com/JetRunner/BERT-of-Theseus high latency. This drawback enormously hindersthe applications of these models in production.To resolve this problem, many techniques havebeen proposed to compress a neural network. Gen-erally, these techniques can be categorized intoQuantization (Gong et al., 2014), Weights Prun-ing (Han et al., 2016) and Knowledge Distillation(KD) (Hinton et al., 2015). Among them, KD hasreceived much attention for compressing pretrainedlanguage models. KD exploits a large teacher model to “teach” a compact student model to mimicthe teacher’s behavior. In this way, the knowledgeembedded in the teacher model can be transferredinto the smaller model. However, the retained per-formance of the student model relies on a well-designed distillation loss function which forces thestudent model to behave as the teacher. Recentstudies on KD (Sun et al., 2019; Jiao et al., 2019)even leverage more sophisticated model-specificdistillation loss functions for better performance.Different from previous KD studies which ex-plicitly exploit a distillation loss to minimize thedistance between the teacher model and the studentmodel, we propose a new genre of model compres-sion. Inspired by the famous thought experiment“Ship of Theseus” in Philosophy, where all com-ponents of a ship are gradually replaced by newones until no original component exists, we pro-pose Theseus Compression for BERT (
BERT-of-Theseus ), which progressively substitutes modulesof BERT with modules of fewer parameters. Wecall the original model and compressed model pre-decessor and successor , in correspondence to theconcepts of teacher and student in KD, respectively.As shown in Figure 1, we first specify a substitute(successor module) for each predecessor module(i.e., modules in the predecessor model). Then, werandomly replace each predecessor module with its https://en.wikipedia.org/wiki/Ship_of_Theseus a r X i v : . [ c s . C L ] O c t orresponding successor module by a probabilityand make them work together in the training phase.After convergence, we combine all successor mod-ules to be the successor model for inference. Inthis way, the large predecessor model can be com-pressed into a compact successor model.Theseus Compression shares a similar idea withKD, which encourages the compressed model tobehave like the original, but holds many merits.First, we only use the task-specific loss functionin the compression process. However, KD-basedmethods use task-specific loss, together with oneor multiple distillation losses as its optimizationobjective. Also, selecting various loss functionsand balancing the weights of each loss for differ-ent tasks and datasets can be laborious (Sun et al.,2019; Sanh et al., 2019). Second, different fromrecent work (Jiao et al., 2019), Theseus Compres-sion does not use Transformer-specific features forcompression thus is potential to compress a widespectrum of models. Third, instead of using theoriginal model only for inference in KD, our ap-proach allows the predecessor model to work in as-sociation with the compressed successor model, en-abling a possible gradient-level interaction. More-over, the different module permutations mixingboth predecessor and successor modules may addextra regularization, similar to Dropout (Srivastavaet al., 2014). With a Curriculum Learning (Bengioet al., 2009) driven replacement scheduler, our ap-proach achieves promising performance compress-ing BERT (Devlin et al., 2019), a large pretrainedTransformer model.To summarize, our contribution is two-fold: (1)We propose a novel approach, Theseus Compres-sion, revealing a new pathway to model compres-sion, with no additional loss function. (2) Ourcompressed BERT model is . × faster while re-taining more than performance of the originalmodel, outperforming other KD-based compres-sion baselines. Model Compression
Model compression aimsto reduce the size and computational cost of a largemodel while retaining as much performance aspossible. Conventional explanations (Denil et al.,2013; Zhai et al., 2016) claim that the large num-ber of weights is necessary for the training of deepneural network but a high degree of redundancyexists after training. Recent work (Frankle and Carbin, 2019) proposes The Lottery Ticket Hypoth-esis claiming that dense, randomly initialized andfeed-forward networks contain subnetworks thatcan be recognized and trained to get a comparabletest accuracy to the original network. Quantiza-tion (Gong et al., 2014) reduces the number of bitsused to represent a number in a model. WeightsPruning (Han et al., 2016; He et al., 2017) conductsa binary classification to decide which weights tobe trimmed from the model. Knowledge Distil-lation (KD) (Hinton et al., 2015) aims to train acompact model which behaves like the originalone. FitNets (Romero et al., 2015) demonstratesthat “hints” learned by the large model can ben-efit the distillation process. Born-Again NeuralNetwork (Furlanello et al., 2018) reveals that en-sembling multiple identical-parameterized studentscan outperform a teacher model. LIT (Koratanaet al., 2019) introduces block-wise intermediaterepresentation training. Liu et al. (2019a) distilledknowledge from ensemble models to improve theperformance of a single model on NLU tasks. Tanet al. (2019) exploited KD for multi-lingual ma-chine translation. Different from KD-based meth-ods, our proposed Theseus Compression is the firstapproach to mix the original model and compactmodel for training. Also, no additional loss isused throughout the whole compression procedure,which simplifies the implementation.
Faster BERT
Very recently, many attempts havebeen made to speed up a large pretrained languagemodel (e.g., BERT (Devlin et al., 2019)). Michelet al. (2019) reduced the parameters of a BERTmodel by pruning unnecessary heads in the Trans-former. Shen et al. (2020) quantized BERT to2-bit using Hessian information. Also, substan-tial modification has been made to Transformerarchitecture. Fan et al. (2020) exploited a structuredropping mechanism to train a BERT-like modelwhich is resilient to pruning. ALBERT (Lan et al.,2020) leverages matrix decomposition and param-eter sharing. However, these models cannot ex-ploit ready-made model weights and require a fullretraining. Tang et al. (2019) used a BiLSTM ar-chitecture to extract task-specific knowledge fromBERT. DistilBERT (Sanh et al., 2019) applies anaive Knowledge Distillation on the same corpusused to pretrain BERT. Patient Knowledge Distilla-tion (PKD) (Sun et al., 2019) designs multiple dis-tillation losses between the module hidden states ofthe teacher and student models. Pretrained Distilla- (cid:81)(cid:83)(cid:88)(cid:87) r
The basic idea of Theseus Compression is very sim-ilar to KD. We want the successor model to act likea predecessor model. KD explicitly defines a lossto measure the similarity of the teacher and student.However, the performance vastly relies on the de-sign of the loss function (Hinton et al., 2015; Sunet al., 2019; Jiao et al., 2019). This loss functionneeds to be combined with task-specific loss (Sunet al., 2019; Koratana et al., 2019). Different fromKD, Theseus Compression only requires one task- specific loss function (e.g., Cross Entropy), whichclosely resembles a fine-tuning procedure. Inspiredby Dropout (Srivastava et al., 2014), we proposemodule replacing, a novel technique for model com-pression. We call the original model and the tar-get model predecessor and successor , respectively.First, we specify a successor module for each mod-ule in the predecessor. For example, in the con-text of BERT compression, we let one Transformerlayer be the successor module for two Transformerlayers. Consider a predecessor model P which has n modules and a successor model S which has n predefined modules. Let P = { prd , . . . , prd n } denote the predecessor model, prd i and scc i denotethe the predecessor modules and their correspond-ing substitutes, respectively. The output vectors ofthe i -th module is denoted as y i . Thus, the forwardoperation can be described in the form of: y i +1 = prd i ( y i ) (1)During compression, we apply module replacing.First, for ( i + 1) -th module, r i +1 is an independentBernoulli random variable which has probability p to be and − p to be . r i +1 ∼ Bernoulli( p ) (2)Then, the output of the ( i + 1) -th model is calcu-lated as: y i +1 = r i +1 ∗ scc i ( y i )+(1 − r i +1 ) ∗ prd i ( y i ) (3)here ∗ denotes the element-wise multiplication, r i +1 ∈ { , } . In this way, the predecessor mod-ules and successor modules work together in thetraining. Since the permutation of the hybrid modelis random, it adds extra noises as a regulariza-tion for the training of the successor, similar toDropout (Srivastava et al., 2014).During training, similar to a fine-tuning process,we optimize a regular task-specific loss, e.g., CrossEntropy: L = − (cid:88) j ∈| X | (cid:88) c ∈ C [ [ z j = c ] · log P ( z j = c | x j )] (4)where x j ∈ X is the i -th training sample; z j is itscorresponding ground-truth label; c and C denotea class label and the set of class labels, respectively.For back-propagation, the weights of all predeces-sor modules are frozen. For both the embeddinglayer and output layer of the predecessor modelare weight-frozen and directly adopted for the suc-cessor model in this training phase. In this way,the gradient can be calculated across both the pre-decessor and successor modules, allowing deeperinteraction. To make the training and inference processes asclose as possible, we further carry out a post-replacement fine-tuning phase to allow all succes-sor modules to work together. After the replacingcompression converges, we collect all successormodules and combine them to be the successormodel S : S = { scc , . . . , scc n } y i +1 = scc i ( y i ) (5)Since each scc i is smaller than prd i in size, the pre-decessor model P is in essence compressed into asmaller model S . Then, we fine-tune the successormodel by optimizing the same loss of Equation 4.The whole procedure including module replacingand successor fine-tuning is illustrated in Figure2(a). Finally, we use the fine-tuned successor forinference as Equation 5. Although setting a constant replacement rate p canmeet the need for compressing a model, we furtherhighlight a Curriculum Learning (Bengio et al.,2009) driven replacement scheduler, which coordi-nates the progressive replacement of the modules. R e p l a c i n g R a t e (a) Constant p=0.5 (b) Linear Replace Scheduler Figure 2: The replacing curves of a constant modulereplace rate and a replacement scheduler. We use dif-ferent shades of gray to mark the two phases of The-seus Compression: (1) Module replacing. (2) Succes-sor fine-tuning.
Similar to (Morerio et al., 2017; Zhou et al., 2020a),we devise a replacement scheduler to dynamicallytune the replacement rate p .Here, we leverage a simple linear scheduler θ ( t ) to output the dynamic replacement rate p d for step t . p d = min(1 , θ ( t )) = min(1 , kt + b ) (6)where k > is the coefficient and b is the basicreplacement rate. The replacing rate curve with areplacement scheduler is illustrated in Figure 2(b).In this way, we unify the two previously sepa-rated training stages and encourage an end-to-endeasy-to-hard learning process. First, with morepredecessor modules present, the model wouldmore likely to correctly predict thus have a rel-atively small cross-entropy loss, which is helpfulfor smoothing the learning process. Then, at a latertime of compression, more modules can be presenttogether, encouraging the model to gradually learnto predict with less guidance from the predeces-sor and steadily transit to the successor fine-tuningstage.Second, at the beginning of the compression,when θ ( t ) < , considering the average learningrate for all n successor modules, the expected num-ber of replaced modules is n · p d and the expectedaverage learning rate is: lr (cid:48) = ( np d /n ) lr = ( kt + b ) lr (7)where lr is the constant learning rate set for thecompression and lr (cid:48) is the equivalent learning rateconsidering all successor modules. Thus, when ap-plying a replacement scheduler, a warm-up mecha-nism (Popel and Bojar, 2018) is essentially adoptedat the same time, which helps the training of aTransformer. Experiments
In this section, we introduce the experiments ofTheseus Compression for BERT (Devlin et al.,2019) compression. We compare BERT-of-Theseus with other compression methods and fur-ther conduct experiments to analyze the results.
We evaluate our proposed approach on the GLUEbenchmark (Wang et al., 2019; Dolan and Brock-ett, 2005; Conneau and Kiela, 2018; Socher et al.,2013; Williams et al., 2018; Rajpurkar et al., 2016;Warstadt et al., 2019). Note that we excludeWNLI (Levesque, 2011) following the originalBERT paper (Devlin et al., 2019).The accuracy is used as the metric for SST-2,MNLI-m, MNLI-mm, QNLI and RTE. The F1 andaccuracy are used for MRPC and QQP. The Pearsoncorrelation and Spearman correlation are used forSTS-B. Matthew’s correlation is used for CoLA.The results reported for the test set of GLUE arein the same format as on the official leaderboard.For the sake of comparison with (Sanh et al., 2019),on the development set of GLUE, the result ofMNLI is an average on MNLI-m and MNLI-mm;the results on MRPC and QQP are reported with theaverage of F1 and accuracy; the result reported onSTS-B is the average of the Pearson and Spearmancorrelation.
We test our approach under a task-specific com-pression setting (Sun et al., 2019; Turc et al., 2019)instead of a pretraining compression setting (Sanhet al., 2019; Sun et al., 2020). That is to say, we useno external unlabeled corpus but only the train-ing set of each task in GLUE to compress themodel. The reason behind this decision is thatwe intend to straightforwardly verify the effective-ness of our generic compression approach. Thefast training process of task-specific compression(e.g., no longer than GPU hours for any taskof GLUE) computationally enables us to conductmore analytical experiments. For comparison, Dis-tilBERT (Sanh et al., 2019) takes
GPU hoursto train. Plus, in real-world applications, this set-ting provides with more flexibility when select-ing from different pretrained LMs (e.g., BERT,RoBERTa (Liu et al., 2019b)) for various down-stream tasks and it is easy to adopt a newly releasedmodel, without a time-consuming pretraining com- pression. We will also discuss the possibility touse an MNLI model for a general purpose with in-termediate transfer learning (Pruksachatkun et al.,2020).Formally, we define the task of compression astrying to retain as much performance as possiblewhen compressing the officially released BERT-base (uncased) to a 6-layer compact model withthe same hidden size, following the settings in(Sanh et al., 2019; Sun et al., 2019; Turc et al.,2019). Under this setting, the compressed modelhas 24M parameters for the token embedding (iden-tical to the original model) and 42M parametersfor the Transformer layers and obtains a . × speed-up for inference. We fine-tune BERT-base as the predecessor modelfor each task with the batch size of , the learningrate of × − , and the number of epochs as . Asa result, we are able to obtain a predecessor modelwith comparable performance with that reportedin previous studies (Sanh et al., 2019; Sun et al.,2019; Jiao et al., 2019).Afterward, for training successor models, fol-lowing (Sanh et al., 2019; Sun et al., 2019), weuse the first layers of BERT-base to initialize thesuccessor model since the over-parameterized na-ture of Transformer (Vaswani et al., 2017) couldcause the model unable to converge while trainingon small datasets. During module replacing, Wefix the batch size as 32 for all evaluated tasks to re-duce the search space. All r variables only sampleonce for a training batch. The maximum sequencelength is set to 256 on QNLI and 128 for the othertasks. We perform grid search over the sets of learn-ing rate lr as { } , the basic replacing rate b as { } , the scheduler coefficient k makingthe dynamic replacing rate increase to within thefirst { } training steps.We apply an early stopping mechanism and selectthe model with the best performance on the de-velopment set. We conduct our experiments on asingle Nvidia V100 16GB GPU. The peak memoryusage is approximately identical to fine-tuning aBERT-base, since there would be at most 12 layerstraining at the same time. The training time foreach task varies depending on the different sizesof training sets. For example, it takes 20 hours to https://github.com/google-research/bert ethod MLM + CE
NSP - -Fine-tuning 6 66M CE
TASK (cid:55) (cid:51)
Vanilla KD (2015) 6 66M CE KD + CE TASK (cid:55) (cid:51)
BERT-PKD (2019) 6 66M CE KD + PT KD + CE TASK (cid:55) (cid:51)
DistilBERT (2019) 6 66M CE KD + Cos KD + CE MLM (cid:51) (unlabeled) (cid:51)
PD-BERT (2019) 6 66M CE
MLM + CE KD + CE TASK (cid:51) (unlabeled) (cid:51)
TinyBERT (2019) 4 15M MSE attn + MSE hidn + MSE embd + CE KD (cid:51) (unlabeled + labeled) (cid:55) MobileBERT (2020) 24 25M FMT+AT+PKT+CE KD +CE MLM (cid:51) (unlabeled) (cid:55)
BERT-of-Theseus (Ours) 6 66M CE
TASK (cid:55) (cid:51)
Table 1: Comparison of different BERT compression approaches. “CE” and “MSE” stand for Cross Entropy andMean Square Error, respectively. “KD” indicates the loss is for Knowledge Distillation. “CE
TASK ”, “CE
MLM ”and “CE
NSP ” indicate Cross Entropy calculated on downstream tasks, Masked LM pretraining and Next SentencePrediction, respectively. Other loss functions are described in their corresponding papers. train on MNLI but less than 30 minutes on MRPC.
As shown in Table 1, we compare the layer num-bers, parameter numbers, loss function, externaldata usage and model agnosticism of our proposedapproach to existing methods. We set up a baselineof vanilla Knowledge Distillation (Hinton et al.,2015) as in (Sun et al., 2019). Additionally, wedirectly fine-tune a truncated 6-layer BERT model(the bottom 6 layers of the original BERT) onGLUE tasks to obtain a natural fine-tuning base-line. Under the setting of compressing 12-layerBERT-base to a 6-layer compact model, we chooseBERT-PKD (Sun et al., 2019), PD-BERT (Turcet al., 2019), and DistilBERT (Sanh et al., 2019) asstrong baselines. Note that DistilBERT (Sanh et al.,2019) is not directly comparable here since it usesa pretraining compression setting. Both PD-BERTand DistilBERT use external unlabeled corpus. Ad-ditionally, we use LayerDrop (Fan et al., 2020) onBERT weights to prune the model on downstreamtasks. We do not include TinyBERT (Jiao et al.,2019) since it conducts distillation twice and lever-ages extra augmented data for GLUE tasks. Wealso exclude MobileBERT (Sun et al., 2020), dueto its redesigned Transformer block and differentmodel size. Besides, in these two studies, the lossfunctions are not architecture-agnostic thus limittheir applications on other types of models. We report the experimental results on the devel-opment set of GLUE in Table 2 and submit ourpredictions to the GLUE test server and obtain the We also tried the top 6 layers and interleaving 6 layersbut both perform worse than the bottom 6 layers. results from the official leaderboard as shown inTable 3. Note that DistilBERT does not report onthe test set. The BERT-base performance reportedon GLUE development set is the predecessor fine-tuned by us. The results of BERT-PKD on thedevelopment set are reproduced by us using theofficial implementation. In the original paper ofBERT-PKD, the results of CoLA and STS-B onthe test set are not reported, thus we reproducethese two results. Fine-tuning and Vanilla KD base-lines are both implemented by us. All other resultsare from the original papers. The macro scoreshere are calculated in the same way as the officialleaderboard but are not directly comparable withGLUE leaderboard since we exclude WNLI fromthe calculation.Overall, our BERT-of-Theseus retains . and . of the BERT-base performance onGLUE development set and test set, respectively.On every task of GLUE, our model dramaticallyoutperforms the fine-tuning baseline, indicatingthat with the same loss function, our proposed ap-proach can effectively transfer knowledge fromthe predecessor to the successor. Also, our modelobviously outperforms the vanilla KD (Hintonet al., 2015) and Patient Knowledge Distillation(PKD) (Sun et al., 2019), showing its supremacyover the KD-based compression approaches. OnMNLI, our model performs better than BERT-PKDbut slightly lower than PD-BERT (Turc et al., 2019).However, PD-BERT exploits an additional corpuswhich provides much more samples for knowledgetransferring. Also, we would like to highlight that Please note that the reported results of DistilBERT aredifferent across various versions on arXiv. The results here arefrom v3, which was the newest version when we composedthis paper.ethod CoLA MNLI MRPC QNLI QQP RTE SST-2 STS-B Macro(8.5K) (393K) (3.7K) (105K) (364K) (2.5K) (67K) (5.7K) ScoreBERT-base (2019) 54.3 83.5 89.5 91.2 89.8 71.1 91.5 88.9 82.5DistilBERT (2019) 43.6 79.0 87.5 85.3 84.9 59.9 90.7 81.2 76.5PD-BERT (2019) -
Table 2: Experimental results (median of 5 runs) on the development set of GLUE. The numbers under each datasetindicate the number of training samples. All models listed above (except BERT-base) have 66M parameters, 6layers and 1.94 × speed-up. Method CoLA MNLI-m/mm MRPC QNLI QQP RTE SST-2 STS-B Macro(8.5K) (393K) (3.7K) (105K) (364K) (2.5K) (67K) (5.7K) ScoreBERT-base (2019) 52.1 84.6 / 83.4 88.9 / 84.8 90.5 71.2 / 89.2 66.4 93.5 87.1 / 85.8 80.0PD-BERT (2019) -
Table 3: Experimental results on the test set from the GLUE server. All models listed above (except BERT-base)have 66M parameters, 6 layers and 1.94 × speed-up. on RTE, our model achieves nearly identical perfor-mance to BERT-base and on QQP our model evenoutperforms BERT-base. To analyze, a moderatemodel size may help generalize and prevent overfit-ting on downstream tasks. Notably, on both largedatasets with more than 350K samples (e.g., MNLIand QQP) and small datasets with fewer than 4Ksamples (e.g., MRPC and RTE), our model canconsistently achieve good performance, verifyingthe robustness of our approach. Although our approach achieves good performanceunder a task-specific setting, it requires morecomputational resources to fine-tune a full-sizepredecessor than a compact BERT (e.g., Distil-BERT (Sanh et al., 2019)). Pruksachatkun et al.(2020) found that models trained on some datasetscan be used for a second-round fine-tuning. Thus,we use MNLI as the intermediate task and releaseour compressed model by conducting compres-sion on MNLI to facilitate downstream applica-tions. After compression, we fine-tune the succes-sor model on other sentence classification tasks andcompare the results with DistilBERT (Sanh et al.,2019) in Table 4. Our model achieves an identi- cal performance on MRPC and outperforms Distil-BERT on the other sentence-level tasks. Also, ourintermediate-task transfer results also outperformPD-BERT (Turc et al., 2019) on three tasks, indicat-ing that our task-specific model is also competitivefor a general purpose through the intermediate-tasktransfer learning approach.
In this section, we conduct extensive experimentsto analyze our BERT-of-Theseus.
As pointed out in previous work (Fan et al., 2020),different layers of a Transformer play imbalancedroles for inference. To explore the effect of dif-ferent module replacements, we iteratively useone compressed successor module (constant replac-ing rate, without successor fine-tuning) to replaceits corresponding predecessor module on QNLI,MNLI and QQP, as shown in Table 5. Our resultsshow that the replacement of the last two moduleshave limited influence on the overall performancewhile the replacement of the first module signif-icantly harms the performance. To analyze, thelinguistic features are mainly extracted by the first ethod MNLI MRPC QNLI QQP RTE SST-2 STS-BBERT-base (2019) 83.5 89.5 91.2 89.8 71.1 91.5 88.9DistilBERT (2019) 79.0
MNLI
Table 4: Experimental results of intermediate-task transfer learning on GLUE-dev.
Replacement QNLI (∆)
MNLI (∆)
QQP (∆)
Predecessor 91.87 84.54 89.48 prd → scc prd → scc prd → scc prd → scc prd → scc prd → scc Table 5: Impact of the replacement for different mod-ules on GLUE-dev. prd i → scc i indicates the replace-ment of the i -th module from the predecessor. few layers. Therefore, the reduced representationcapability becomes the bottleneck for the followinglayers. We attempt to adopt different replacing rates onGLUE tasks. First, we fix the batch size to be and learning rate lr to be × − and conductcompression on each task. On the other hand, aswe analyzed in Section 3.3, the equivalent learningrate lr (cid:48) is affected by the replacing rate. To furthereliminate the influence of the learning rate, we fixthe equivalent learning rate lr (cid:48) to be × − andadjust the learning rate lr for different replacingrates by lr = lr (cid:48) /p .We illustrate the results with different replacingrates on two representative tasks (MRPC and RTE)in Figure 3. The trivial gap between two curvesin both figures indicate that the effect of differentreplacing rates on equivalent learning rate is notthe main factor for the performance differences.A replacing rate in the range between . and . can always lead to a satisfying performance on allGLUE tasks. However, a significant performancedrop can be observed on all tasks if the replacingrate is too small (e.g., p = 0 . ). On the other hand,the best replacing rate differs across tasks. To study the impact of our curriculum replace-ment strategy, we compare the results of BERT- A v g . A cc a n d F MRPC-LRMRPC-ELR (a) MRPC A cc RTE-LRRTE-ELR (b) RTE
Figure 3: Performance of different replacing rate onMRPC and RTE. “LR” and “ELR” denote that thelearning rate and equivalent learning rate are fixed, re-spectively. of-Theseus compressed with a constant replacingrate and with a replacement scheduler. The con-stant replacing rate for the baseline is searchedover { } . Additionally, we implementan “anti-curriculum” baseline, similar to the onein (Morerio et al., 2017). For each task, we adoptthe same coefficient k and basic replacing rate b to calculate the p d as Equation 6 for both curricu-lum replacement and anti-curriculum. However,we use − p d as the dynamic replacing rate foranti-curriculum baseline. Thus, we can determinewhether the improvement of curriculum replace-ment is simply due to an inconstant replacing rateor an easy-to-hard curriculum design.As shown in Table 6, our model compressedwith curriculum scheduler consistently outperformsa model compressed with a constant replacingrate. In contrast, a substantial performance dropis observed on the model compressed with an anti-curriculum scheduler, which further verifies theeffectiveness and importance of the curriculum re-placement strategy. We further replace different numbers of Trans-former layers with one layer to verify the effec-tiveness of Theseus Compression under differentsettings. We replace 3/4 layers with one Trans-former layer, resulting in a 4/3-layer BERT model. trategy CoLA (∆)
MNLI (∆)
MRPC (∆)
QNLI (∆)
QQP (∆)
RTE (∆)
SST-2 (∆)
STS-B (∆)
Constant Rate 44.4 81.9 87.1 88.5 88.6 66.4 90.6 88.4Anti-curriculum 42.8 (-1.6) 79.8 (-2.1) 85.6 (-1.5) 87.8 (-0.7) 87.6 (-1.0) 62.4 (-4.0) 88.8 (-1.8) 85.4 (-3.0)Curriculum
Table 6: Comparison of models compressed with a constant replacing rate, a curriculum replacement schedulerand its corresponding anti-curriculum scheduler on GLUE-dev.
Method × × × Fine-tuning 4 2.82 × × Fine-tuning 3 3.66 × × Table 7: Experimental results of replacing different numbers of layers with one layer on GLUE-dev. “
The results are shown in Table 7. BERT-of-Theseusconsistently outperforms the fine-tuned truncatedBERT baselines, demonstrating its effectivenessunder different settings.
In this paper, we propose Theseus Compression, anovel model compression approach. We use this ap-proach to compress BERT to a compact model thatoutperforms other models compressed by Knowl-edge Distillation. Our work highlights a new genreof model compression and reveals a new path to-wards model compression.For future work, we would like to explore thepossibility of applying Theseus Compression onheterogeneous network modules. First, manydeveloped in-place substitutes (e.g., ShuffleNetunit (Zhang et al., 2018) for ResBlock (He et al.,2016), Reformer Layer (Kitaev et al., 2020) forTransformer Layer (Vaswani et al., 2017)) are natu-ral successor modules that can be directly adoptedin Theseus Compression. Also, it is possible touse a feed-forward neural network to map featuresbetween the hidden spaces of different sizes (Jiaoet al., 2019) to enable replacement between mod-ules with different input and output sizes. Althoughour model has achieved good performance com-pressing BERT, it would be interesting to exploreits possible applications in other neural models. Assummarized in Table 1, our model does not relyon any model-specific features to compress BERT. Therefore, it is potential to apply Theseus Com-pression to other large models (e.g., ResNet (Heet al., 2016) in Computer Vision). In addition, wewould like to conduct Theseus Compression onmore types of neural networks including Convo-lutional Neural Networks and Graph Neural Net-works. We will also investigate the combination ofour compression-based approach with recently pro-posed dynamic acceleration method (Zhou et al.,2020b) to further improve the efficiency of pre-trained language models.
Acknowledgments
We are grateful for the insightful comments fromthe anonymous reviewers. Tao Ge is the corre-sponding author.
References
Yoshua Bengio, J´erˆome Louradour, Ronan Collobert,and Jason Weston. 2009. Curriculum learning. In
ICML .Alexis Conneau and Douwe Kiela. 2018. Senteval: Anevaluation toolkit for universal sentence representa-tions. In
LREC .Misha Denil, Babak Shakibi, Laurent Dinh,Marc’Aurelio Ranzato, and Nando de Freitas.2013. Predicting parameters in deep learning. In
NeurIPS .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofeep bidirectional transformers for language under-standing. In
NAACL-HLT .William B. Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In
IWP@IJCNLP .Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019. Unified languagemodel pre-training for natural language understand-ing and generation. In
NeurIPS .Angela Fan, Edouard Grave, and Armand Joulin. 2020.Reducing transformer depth on demand with struc-tured dropout. In
ICLR .Jonathan Frankle and Michael Carbin. 2019. The lot-tery ticket hypothesis: Finding sparse, trainable neu-ral networks. In
ICLR .Tommaso Furlanello, Zachary Chase Lipton, MichaelTschannen, Laurent Itti, and Anima Anandkumar.2018. Born-again neural networks. In
ICML .Yunchao Gong, Liu Liu, Ming Yang, and LubomirBourdev. 2014. Compressing deep convolutionalnetworks using vector quantization. arXiv preprintarXiv:1412.6115 .Song Han, Huizi Mao, and William J. Dally. 2016.Deep compression: Compressing deep neural net-work with pruning, trained quantization and huff-man coding. In
ICLR .Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In
CVPR .Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Chan-nel pruning for accelerating very deep neural net-works. In
ICCV .Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 .Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.2019. Tinybert: Distilling bert for natural languageunderstanding. arXiv preprint arXiv:1909.10351 .Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.2020. Reformer: The efficient transformer. In
ICLR .Animesh Koratana, Daniel Kang, Peter Bailis, andMatei Zaharia. 2019. LIT: Learned intermediaterepresentation training for model compression. In
ICML .Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. ALBERT: A Lite BERT for Self-supervisedLearning of Language Representations. In
ICLR . Hector J. Levesque. 2011. The winograd schema chal-lenge. In
AAAI Spring Symposium: Logical Formal-izations of Commonsense Reasoning .Xiaodong Liu, Pengcheng He, Weizhu Chen, andJianfeng Gao. 2019a. Improving multi-task deepneural networks via knowledge distillation fornatural language understanding. arXiv preprintarXiv:1904.09482 .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019b.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Paul Michel, Omer Levy, and Graham Neubig. 2019.Are sixteen heads really better than one? In
NeurIPS .Pietro Morerio, Jacopo Cavazza, Riccardo Volpi, Ren´eVidal, and Vittorio Murino. 2017. Curriculumdropout. In
ICCV .Preetum Nakkiran, Gal Kaplun, Yamini Bansal, TristanYang, Boaz Barak, and Ilya Sutskever. 2020. Deepdouble descent: Where bigger models and more datahurt. In
ICLR .Martin Popel and Ondˇrej Bojar. 2018. Training tipsfor the transformer model.
The Prague Bulletin ofMathematical Linguistics , 110(1):43–70.Yada Pruksachatkun, Jason Phang, Haokun Liu,Phu Mon Htut, Xiaoyi Zhang, Richard YuanzhePang, Clara Vania, Katharina Kann, and Samuel R.Bowman. 2020. Intermediate-task transfer learningwith pretrained language models: When and whydoes it work? In
ACL .Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100, 000+ questions formachine comprehension of text. In
EMNLP .Adriana Romero, Nicolas Ballas, Samira Ebrahimi Ka-hou, Antoine Chassang, Carlo Gatta, and YoshuaBengio. 2015. Fitnets: Hints for thin deep nets. In
ICLR .Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108 .Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, ZheweiYao, Amir Gholami, Michael W. Mahoney, and KurtKeutzer. 2020. Q-BERT: hessian based ultra lowprecision quantization of BERT. In
AAAI .Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Y. Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In
EMNLP .aitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: masked sequence to se-quence pre-training for language generation. In
ICML .Nitish Srivastava, Geoffrey E. Hinton, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. 2014. Dropout: a simple way to prevent neuralnetworks from overfitting.
J. Mach. Learn. Res. ,15(1):1929–1958.Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.Patient knowledge distillation for BERT model com-pression. In
EMNLP-IJCNLP .Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu,Yiming Yang, and Denny Zhou. 2020. Mobilebert:a compact task-agnostic BERT for resource-limiteddevices. In
ACL .Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019. Multilingual neural machine transla-tion with knowledge distillation. In
ICLR .Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, OlgaVechtomova, and Jimmy Lin. 2019. Distilling task-specific knowledge from bert into simple neural net-works. arXiv preprint arXiv:1903.12136 .Iulia Turc, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. Well-read students learn better:On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962 .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
NeurIPS .Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In
ICLR .Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-man. 2019. Neural network acceptability judgments.
TACL , 7:625–641.Adina Williams, Nikita Nangia, and Samuel R. Bow-man. 2018. A broad-coverage challenge corpusfor sentence understanding through inference. In
NAACL-HLT .Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In
NeurIPS .Shuangfei Zhai, Yu Cheng, Zhongfei (Mark) Zhang,and Weining Lu. 2016. Doubly convolutional neu-ral networks. In
NeurIPS .Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and JianSun. 2018. Shufflenet: An extremely efficient con-volutional neural network for mobile devices. In
CVPR . Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, andMing Zhou. 2020a. Scheduled drophead: A reg-ularization method for transformer models. arXivpreprint arXiv:2004.13342 .Wangchunshu Zhou, Canwen Xu, Tao Ge, JulianMcAuley, Ke Xu, and Furu Wei. 2020b. Bert losespatience: Fast and robust inference with early exit.In