[PDF] Distilling Knowledge Learned in BERT for Text Generation

Abstract

Large-scale pre-trained language model such as BERT has achieved great success in language understanding tasks. However, it remains an open question how to utilize BERT for language generation. In this paper, we present a novel approach, Conditional Masked Language Modeling (C-MLM), to enable the finetuning of BERT on target generation tasks. The finetuned BERT (teacher) is exploited as extra supervision to improve conventional Seq2Seq models (student) for better text generation performance. By leveraging BERT's idiosyncratic bidirectional nature, distilling knowledge learned in BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. Experiments show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization. Our proposed model also achieves new state of the art on IWSLT German-English and English-Vietnamese MT datasets. Code is available at this https URL.

Full PDF

DDistilling Knowledge Learned in BERT for Text Generation

Yen-Chun Chen , Zhe Gan , Yu Cheng , Jingzhou Liu , Jingjing Liu Microsoft Dynamics 365 AI Research Carnegie Mellon University { yen-chun.chen,zhe.gan,yu.cheng,jinjl } @microsoft.com; [email protected] Abstract

Large-scale pre-trained language model suchas BERT has achieved great success in lan-guage understanding tasks. However, it re-mains an open question how to utilize BERTfor language generation. In this paper, wepresent a novel approach, Conditional MaskedLanguage Modeling (C-MLM), to enable theﬁnetuning of BERT on target generation tasks.The ﬁnetuned BERT ( teacher ) is exploitedas extra supervision to improve conventionalSeq2Seq models ( student ) for better text gen-eration performance. By leveraging BERT’sidiosyncratic bidirectional nature, distillingknowledge learned in BERT can encourageauto-regressive Seq2Seq models to plan ahead,imposing global sequence-level supervisionfor coherent text generation. Experimentsshow that the proposed approach signiﬁcantlyoutperforms strong Transformer baselines onmultiple language generation tasks such as ma-chine translation and text summarization. Ourproposed model also achieves new state of theart on IWSLT German-English and English-Vietnamese MT datasets. Large-scale pre-trained language model, such asELMo (Peters et al., 2018), GPT (Radford et al.,2018) and BERT (Devlin et al., 2019), has becomethe de facto ﬁrst encoding step for many naturallanguage processing (NLP) tasks. For example,BERT, pre-trained with deep bidirectional Trans-former (Vaswani et al., 2017) via masked languagemodeling and next sentence prediction, has revo-lutionized the state of the art in many languageunderstanding tasks, such as natural language infer-ence (Bowman et al., 2015) and question answer-ing (Rajpurkar et al., 2016). Code is available at https://github.com/ChenRocks/Distill-BERT-Textgen.

However, beyond common practice of ﬁnetun-ing BERT for language understanding (Wang et al.,2019), applying BERT to language generation stillremains an open question. Text generation aimsto generate natural language sentences conditionedon certain input, with applications ranging frommachine translation (Cho et al., 2014; Sutskeveret al., 2014; Bahdanau et al., 2015), text sum-marization (Nallapati et al., 2016; Gehring et al.,2017; Chen and Bansal, 2018), to image caption-ing (Vinyals et al., 2015; Xu et al., 2015; Gan et al.,2017). In this work, we study how to use BERTfor better text generation, which is still a relativelyunexplored territory.Intuitively, as BERT is learned with a generativeobjective via Masked Language Modeling (MLM)during the pre-training stage, a natural assumptionis that this training objective should have learnedessential, bidirectional, contextual knowledge thatcan help enhance text generation. Unfortunately,this MLM objective is not auto-regressive, whichencumbers its direct application to auto-regressivetext generation in practice.We tackle this challenge by proposing a noveland generalizable approach to distilling knowledgelearned in BERT for text generation tasks. Weﬁrst propose a new Conditional Masked LanguageModeling (C-MLM) task, inspired by MLM but re-quiring additional conditional input, which enablesﬁnetuning pre-trained BERT on a target dataset.In order to extract knowledge from the ﬁnetunedBERT and apply it to a text generation model, weleverage the ﬁnetuned BERT as a teacher modelthat generates sequences of word probability logitsfor the training samples, and treat the text genera-tion model as a student network, which can effec-tively learn from the teacher’s outputs for imitation.The proposed approach improves text generationby providing a good estimation on word probabilitydistribution for each token in a sentence, consum- a r X i v : . [ c s . C L ] J u l ng both the left and the right context, the exploita-tion of which encourages conventional text gen-eration models to plan ahead . At inference time,the teacher model (BERT) is not required thus thedecoding speed is as fast as the underlying studentmodel.Text generation models are usually trainedvia Maximum Likelihood Estimation (MLE), or teacher forcing (Bengio et al., 2015): at each timestep, it maximizes the likelihood of the next wordconditioned on its previous ground-truth words.This corresponds to optimizing one-step-ahead pre-diction. As there is no explicit signal towardsglobal planning in the training objective, the gen-eration model may incline to focusing on localstructure rather than global coherence. With ourproposed approach, BERT’s looking into the fu-ture ability can act as an effective regularizationmethod, capturing subtle long-term dependenciesthat ensure global coherence and in consequenceboost model performance on text generation.An alternative way to leverage BERT fortext generation is to initialize the parameters ofthe encoder or decoder of Seq2Seq with pre-trained BERT, and then ﬁnetuning on the targetdataset. However, this approach requires the en-coder/decoder to be identical to BERT, inevitablymaking the ﬁnal text generation model too large.Our approach, on the other hand, is modular andcompatible to any text-generation model, and hasno restriction on model size or model architecture(e.g., LSTM or Transformer).The main contributions of this work are three-fold: ( i ) We present a novel approach to utilizingBERT for text generation. The proposed methodinduces sequence-level knowledge into the conven-tional one-step-ahead and teacher-forcing trainingparadigm, by introducing an effective regulariza-tion term to MLE training loss. ( ii ) We conductcomprehensive evaluation on multiple text genera-tion tasks, including machine translation and textsummarization. Experiments show that our pro-posed approach signiﬁcantly outperforms strongTransformer baselines and is generalizable to differ-ent tasks. ( iii ) The proposed model achieves newstate of the art on both IWSLT14 German-Englishand IWSLT15 English-Vietnamese datasets. Pre-trained Language Models

Prior to large-scale pre-trained language model, word embed- dings (Mikolov et al., 2013; Pennington et al.,2014; Bojanowski et al., 2017) were widely usedfor NLP tasks. Recently, CoVe (McCann et al.,2017) introduced (conditional) language modelspre-trained on paired machine translation corpus.ELMo (Peters et al., 2018) learned a contextual lan-guage model on a large corpus with bidirectionalRNN. GPT (Radford et al., 2018) used unidirec-tional Transformer to achieve better contextualizedword representation. By ﬁne-tuning pre-trained lan-guage models, ULMFit (Howard and Ruder, 2018)also achieved promising results on text classiﬁca-tion.In our study, we focus on BERT due to its supe-rior performance on multiple language understand-ing tasks. However, different from previous workexploiting BERT for language understanding tasks,here we aim to apply BERT to text generation. Tothe best of our knowledge, this is still a relativelyunexplored space. The proposed approach is alsomodel-agnostic and can be applied to other pre-trained language models as well.

BERT for Text Generation

There has been somerecent attempt on applying BERT to text generation.Speciﬁcally, Lample and Conneau (2019) trainedcross-lingual MLM and demonstrated promisingresults for cross-lingual natural language infer-ence (Conneau et al., 2018) and unsupervisedneural machine translation (NMT) (Lample et al.,2018). Wang and Cho (2019) formulated BERT asa Markov Random Field LM and showed prelimi-nary results on unsupervised text generation withimproved diversity. Zhang et al. (2019a) utilizedan encoder with BERT and a two-stage decoder fortext summarization. Song et al. (2019) proposedMasked Seq2Seq (MASS) pre-training, demonstrat-ing promising results on unsupervised NMT, textsummarization and conversational response gener-ation. Concurrent with our work, Ghazvininejadet al. (2019) proposed a similar conditional MLMfor constant-time translation, and Yang et al. (2019)studied how to ﬁne-tune BERT for NMT.Our approach is novel in the sense that we donot directly use the parameters of BERT in theSeq2Seq model. Instead, BERT acts as an effectiveregularization to the MLE training loss, by proac-tively injecting future information for predictingthe present.

Right-to-Left Generation

Our work also shares ahigh-level intuition with those approaches that tryto regularize left-to-right generative models with onditional MLM [SEP] [SEP] [MASK] [CLS]

Encoder Decoder

Attention Knowledge DistillationInput Sequence Partial Output SequenceInput Sequence Masked Output Sequence

BERT as TeacherSeq2Seq as Student

Figure 1: Illustration of distilling knowledge from BERT for text generation. See Section 3.2 and 3.3 for details. a right-to-left counterpart. Speciﬁcally, Liu et al.(2016) trained a separate reverse NMT and per-formed joint decoding at inference time to enforceagreement between forward and reverse models.Twin Networks (Serdyuk et al., 2018) used a back-ward RNN jointly trained with a forward RNNdecoder by matching their hidden states. Zhanget al. (2019b) further extended the idea to Trans-former with joint training, so that the forward andthe backward models iteratively improve each other.Our proposed approach stems from a similar in-tuition. However, we focus on using pre-trainedlanguage model such as BERT to regularize anauto-regressive generation model.

Knowledge Distillation

Our method shares thesame loss formulation as Knowledge Distillation(KD) proposed in Bucilu et al. (2006); Hinton et al.(2015); Kim and Rush (2016), where a smaller stu-dent model is trained on soft labels provided bya larger teacher model. More recently, Tan et al.(2019) applied KD to multilingual NMT, and Sunet al. (2019) proposed patient KD for BERT modelcompression. Compared with these previous stud-ies, where both the teacher and the student aretrained on the same task, our approach is differentin the sense that the BERT teacher is not designedto perform the student’s generation task. We focuson using KD to leverage the learned knowledgein BERT for text generation, while previous workmostly focused on model compression.

In this section, we present our proposed approachto distilling the knowledge in BERT for text gener-ation in generic sequence-to-sequence (Seq2Seq) setting. We ﬁrst review Seq2Seq learning in Sec-tion 3.1, and then describe the proposed approachin Section 3.2 and 3.3.

Seq2Seq learning (Sutskever et al., 2014) aimsto generate a sequence of discrete output Y =( y , . . . , y N ) of length N , conditioned on a se-quence of discrete input X = ( x , . . . , x M ) oflength M . A Seq2Seq model learns parameters θ to estimate the conditional likelihood P θ ( Y | X ) ,typically trained via Maximum Likelihood Estima-tion (MLE), or equivalently, minimizing the cross-entropy loss: L xe ( θ ) = − log P θ ( Y | X ) (1) = − N (cid:88) t =1 log P θ ( y t | y t − , X ) , where each conditional probability can be calcu-lated via an attention-based recurrent neural net-work (RNN) (Bahdanau et al., 2015; Luong et al.,2015), Transformer (Vaswani et al., 2017), or anyother neural sequence-generation models. This generic Seq2Seq learning framework is thestate of the art on a wide range of text generationtasks. Using modern deep neural networks, theconditional probabilities can be readily modeled asa sequence of classiﬁcations over the word vocabu-lary. However, during training, in order to generatethe t -th token y t , the model only sees a partial sen-tence y t − from the ground-truth training data.Intuitively, it is reasonable to assume that a bidirec-tional model can be more informative than a left-o-right generation model, since additional contextfrom the right (or future) is also incorporated to pre-dict the current word. Unfortunately, this additionalinformation is not utilized in a standard Seq2Seqmodel, since it can only be trained in a left-to-rightmanner, where the future context is masked out toprevent each word from indirectly “ seeing itself ”.To compensate this single-directional limitation ofSeq2Seq setting, we propose a new conditional lan-guage model (C-MLM) to enable the ﬁnetuning ofBERT on target generation task, in hope that theﬁnetuned bidirectional BERT can be utilized forbetter text generation.BERT (Devlin et al., 2019) is a deep bidirec-tional Transformer trained via Masked LanguageModeling (MLM). In a similar setting, where theinput is a sequence pair (

X, Y ), of the tokensare randomly masked. Formally, we denote themasked token sets as X m and Y m , and the disjointcounterpart ( i.e. , the unmasked tokens) as X u and Y u , respectively. The trained BERT model aims toestimate the joint probability: P ( x m , . . . , x mi , y m , . . . , y mj | X u , Y u ) , (2)where i and j denote the number of masked tokensin X and Y , respectively. Each x m(cid:63) ∈ X m , andeach y m(cid:63) ∈ Y m . Eqn. (2) can be trained with thestandard word-level cross-entropy loss.We aim to marry MLM pre-training withSeq2Seq learning, to leverage bidirectional lan-guage model for text generation. To this end, wepropose a conditional-MLM, a variant of MLMthat allows further ﬁnetuning of pre-trained BERTon target dataset. For example, for machine trans-lation, X and Y represent the source and the targetsentence, respectively. We ﬁrst concatenate themtogether and randomly mask of the tokensonly in Y , then train the network to model the jointprobability: P ( y m , . . . , y mj | X, Y u ) . (3)The above C-MLM objective is similar to theconditional language modeling (LM) objective inEqn. (1), but conditional LM only permits pre-dicting a word based on its left context. C-MLMis also related to Masked Seq2Seq (MASS) pre-training (Song et al., 2019). However, in MASS, Besides MLM, Devlin et al. (2019) also introduced thenext sentence prediction task for training BERT. We omit thistask since it is unrelated to our work. The two sequences are consecutive paragraphs sampledfrom a very large corpus such as Wikipedia. the encoder takes a sentence with randomly maskedfragment (several consecutive tokens) as input, andthe decoder tries to predict this masked fragment,which is different from our model design. The ﬁnalgoal is also different: MASS focuses on Seq2Seqpre-training, while we focus on leveraging BERTfor text generation. In our experiments, we observethat the C-MLM task can obtain high accuracy andgood generalization on word prediction. However,it is not feasible to generate sequential output di-rectly from C-MLM. Instead, we use knowledgedistillation to distill the knowledge learned fromthe ﬁnetuned BERT into a Seq2Seq model for di-rect text generation, which will be explained in thenext sub-section.

Our inspiration springs from the observation thatthe probability distribution of the masked word y mt is estimated using both y u t − and y ut +1: N from Y u . In other words, the distribution for agiven word P ( y mt | X, Y u ) contains informationfrom both backward and forward contexts, whichis a desirable beneﬁt for providing sequence-levelglobal guidance. This probability distribution canbe considered as soft targets for a text generationmodel to mimic from, which potentially containsmore useful and ﬁne-grained information than theusual hard-assigned, one-hot label, therefore en-hancing conventional left-to-right generation mod-els to look into the future .In a knowledge distillation setting, the BERTmodel can be considered as a teacher , while theSeq2Seq model acts as a student . Speciﬁcally, theSeq2Seq model can be trained with the followingobjective function: L bidi ( θ ) = − (cid:88) w ∈V (cid:104) P φ ( y t = w | Y u , X ) · (4) log P θ ( y t = w | y t − , X ) (cid:105) , where P φ ( y t ) is the soft target estimated by theﬁnetuned BERT with learned parameters φ , and V denotes the output vocabulary. Note that φ isﬁxed during the distillation process. An illustrationof this learning process is provided in Figure 1,which aims to match the word probability distri-bution P θ ( y t ) provided by the student with P φ ( y t ) provided by the teacher ( i.e. , distillation).To further improve the Seq2Seq student model,hard-assigned labels are also utilized. The ﬁnalodel is trained with the following compound ob-jective: L ( θ ) = α L bidi ( θ ) + (1 − α ) L xe ( θ ) , (5)where α is a hyper-parameter for tuning the rel-ative importance of the two training targets: softestimation from ﬁnetuned BERT, and ground-truthhard label. Note that our proposed approach onlyhas a minimal requirement on the architecture ofthe incorporated Seq2Seq model. As long as themodel is trained to estimate word-level probabilityas in Eqn. (1), it can be trained jointly with theproposed objective function Eqn. (5).At a higher level, the additional loss term L bidi can be interpreted as a sequence-level objectivefunction. Our auto-regressive (or causal) model θ tries to predict the probability distribution thatmatches the estimation the bidirectional teachermodel predicts, hence encouraging the planning offuture (right context) for generation. In this section, we describe our experiments ontwo well-studied text generation tasks: machinetranslation, and abstractive text summarization.

We consider two rela-tively small-scale datasets, IWSLT15 English-Vietnamese (En-Vi, 113k training samples) andIWSLT14 German-English (De-En, 160k trainingsamples), and one medium-scale dataset, WMT14English-German (En-De, 4.5M training samples).For IWSLT15 En-Vi, we use the pre-processeddataset provided by Luong and Manning (2015).We use tst2012 as dev set and test on tst2013. ForIWSLT14 De-En, we follow the pre-processingsteps and the same train/dev/test split as in Wu et al.(2019). For WMT14 En-De, we follow the pre-processing steps in Vaswani et al. (2017) for faircomparison. We use newstest2013 as the dev setand newstest2014 as the test set. We report BLEUscores (Papineni et al., 2002) for evaluation of MTperformance following the Moses script. Abstractive Summarization

For summarization,we conduct experiments on the Gigaword sum-marization dataset (Rush et al., 2015). Note that For fair comparison to previous work, we reporttokenized BLEU scores using https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl, and for WMT14 En-De, we further split thecompound words after tokenization. the original train/valid/test split of Gigaword is3.8M/190k/2k. In our experiments, we observedsevere distribution mismatch between the valida-tion and test data. See Table 4, 5, and Sec. 4.4 fordetailed discussion. Therefore, we further sampled5k/5k dev/test-dev splits from the validation set andtuned hyper-parameters on the dev set only. We re-port ROUGE scores (Lin, 2004) on test-dev for theevaluation of our proposed approach, and includeresults on the standard test split for the comparisonwith prior work.

Our implementation is based on the Py-Torch (Paszke et al., 2017) version of Open-NMT (Klein et al., 2018) seq2seq toolkit. We usethe ‘base’ model of 6-layer Transformer with 512-hidden 8-head attention blocks and 2048-hiddenfeed-forward layer for all experiments, with labelsmoothing regularization (LSR) (Szegedy et al.,2016) of 0.1. We batch examples with similarsequence length, and count batch size by thenumber of tokens. For MT we use the pre-trained

BERT-base-multilingual-cased model, and forsummarization we use

BERT-base-uncased as thestarting point of BERT ﬁnetuning. We use thecorresponding pre-trained byte-pair-encoding (Sen-nrich et al., 2016) shipped together with the BERTmodel for tokenization.For all training methods of all Transformer mod-els, the learning rate schedule is set to lr = η · d − . model · min( step − . , step · warmup steps − . ) , where d model = 512 is the attention representationsize (Vaswani et al., 2017). For all BERT ﬁne-tuning, we follow Devlin et al. (2019) and use atriangular learning rate schedule with maximumlearning rate η . The parameters are updated withthe Adam optimizer (Kingma and Ba, 2015). Inthe distillation stage, we pre-compute BERT’s pre-diction logits of the training data and use top- K distillation (Tan et al., 2019) to reduce computationoverhead and memory footprint, where K is set to8 across all the experiments. Our method can also be viewed as a ‘learned LSR’. Theresults reported of our proposed method are trained togetherwith regular LSR, showing the effectiveness of our teacher. BERT pre-trained models are available athttps://github.com/google-research/bert. Our ﬁnetun-ing implementation is modiﬁed from code available athttps://github.com/huggingface/pytorch-pretrained-BERT. The masking strategy is described in the supplementary. We also tune the temperature T for the softmax appliedat the teacher’s logits. Different from the original KD, we e-En Models dev testOur ImplementationsTransformer (base) 35.27 34.09+ BERT teacher Other Reported ResultsConvS2S + MRT ‡ (cid:5) - 34.4 † Lightweight Conv (cid:5) - 34.8 † Dyn. Convolution (cid:5) - 35.2 † Table 1: BLEU scores for IWSLT14 German-Englishtranslation. ( † ) tuned with checkpoint averaging. ( ‡ )from Edunov et al. (2018). ( (cid:5) ) from Wu et al. (2019). En-Vi Models tst2012 tst2013Our ImplementationsRNN 23.37 26.80+ BERT teacher 25.14 27.59Transformer (base) 27.03 30.76+ BERT teacher

Other Reported ResultsRNN † - 26.1Seq2Seq-OT (cid:63) (cid:5) - 29.3CVT (cid:5) - 29.6 Table 2: BLEU scores for IWSLT15 English-Vietnamese translation. ( † ) from Luong et al. (2017).( (cid:63) ) from Chen et al. (2019). ( (cid:5) ) from Clark et al.(2018). For the detailed values of the hyper-parametersfor each experiment, please refer to the supplemen-tary material. We found it necessary to train longerwith L bidi , since it is still improving after the stepat which the baseline Transformer starts to plateau.At inference time, we use beam search with beamsize 4 and length penalty (Wu et al., 2016) of 0.6across all the models. All the hyper-parametersare tuned on the development set. Note that ourTransformer baselines achieve higher scores thanthe reference implementation on each dataset (inmost cases comparable to the state-of-the-art). We ﬁrst validate our proposed text generation ap-proach on machine translation task. Experimentalresults are summarized in Table 1, 2 and 3, whichshow that our model signiﬁcantly improves overthe strong Transformer baseline across all three do not apply the same T on the student. In preliminary ex-periment we found high T of Seq2Seq results in much worseperformance. We hypothesize the low-entropy nature of condi-tioned text generation is not suitable for temperature scaling. En-De Models NT2013 NT2014Our ImplementationsTransformer (base) 25.95 26.94+ BERT teacher

Other Reported ResultsTransformer (base) (cid:5) † Transformer (big) (cid:63) ‡ † Dyn. Convolution •‡ ± † Table 3: BLEU scores for WMT14 English-Germantranslation. ( † ) tuned with checkpoint averaging. ( ‡ )trained on WMT16, a slightly different version of train-ing data. ( (cid:5) ) from Vaswani et al. (2017). ( (cid:63) ) from Ottet al. (2018). ( • ) from Wu et al. (2019). datasets. Note that our baseline is the ‘base’ modelof Transformer, which has 44M trainable parame-ters, and the reference implementation by Wu et al.(2019) of the ‘big’ model with 176M parameters. For IWSLT German-English translation, ourmethod improves over the Transformer baseline by1.54 BLEU points, and achieves new state of theart. Our approach outperforms previously-reportedresults such as ConvS2S+MRT, a convolutional-based model (Gehring et al., 2017) with minimumrisk training (Edunov et al., 2018), and Lightweightand Dynamic Convolution (Wu et al., 2019). Notethat Wu et al. (2019) also tuned checkpoint averag-ing, which creates a soft ensemble effect. And theirmodel has roughly the same amount of parametersas Transformer (big).For IWSLT English-Vietnamese translation,since most prior work experimented with RNNmodels, we also report RNN-based results here.This also suggests that our method is model-agnostic. Our best model outperforms Seq2Seq-OT (Chen et al., 2019) that utilizes optimal trans-port for sequence-level training, as well as theELMo and CVT results reported in Clark et al.(2018). For WMT14 English-German transla-tion, our method still improves over the well-tunedTransformer baseline. We also report the scoresof Transformer (big) and state-of-the-art DynamicConvolution model (Wu et al., 2019) for reference.

Table 4 and Table 5 show the results of our ap-proach on abstractive summarization task, where Parameter counts exclude word embedding and ﬁnal lin-ear projection, which mostly depends on the vocabulary size.BERT-base has 86M trainable parameters. The CVT results used a much larger RNN and CNN-based character embedding, as well as a customized structure.Therefore, we did not try to use RNN to match their results.

W Models R-1 R-2 R-LDevTransformer (base) 46.64 24.37 43.17+ BERT teacher

Test-DevTransformer (base) 46.84 24.80 43.58+ BERT teacher

Table 4: ROUGE F scores for Gigaword abstractivesummarization on our internal test-dev split. GW Models R-1 R-2 R-LSeq2Seq † ‡ g(cid:63) cnn (cid:5) Re Sum • Table 5: ROUGE F scores for Gigaword abstractivesummarization on the ofﬁcial test set (Trm: Trans-former). ( † ) from Nallapati et al. (2016). ( ‡ ) from Linet al. (2018). ( (cid:63) ) from Cao et al. (2018b). ( (cid:5) ) from Am-playo et al. (2018). ( • ) from Cao et al. (2018a). R-1, R-2, and R-L denote F scores of ROUGE-1, ROUGE-2, and ROUGE-L, respectively. Ourmethod shows improvement on all the metrics, asshown in Table 4. We observe a large gap betweendev and test scores, which suggests that the data inthe test set is very different from that in the vali-dation set, as mentioned in Section 4.1. Given thefact that the ofﬁcial test split contains only 1,951noisy examples, we believe that our results onthe dev/test-dev sets further strengthens our claim.On the test split, our best model is comparableto state-of-the-art models that use much more com-plex architectures speciﬁcally designed for summa-rization. CGU (Lin et al., 2018) augmented convo-lutional gating units. FTSum g (Cao et al., 2018b)leveraged extra information extraction and depen-dency parsing features. E2T cnn (Amplayo et al.,2018) utilized entities provided by an external en-tity linking system. Re Sum (Cao et al., 2018a)carefully designed a retrieve-and-rerank pipelinewith human-written soft templates. Despite thatour model has no summarization-speciﬁc modeldesign, we still achieve comparable performanceto these models on all the metrics. When we manually inspected the test set data, we foundmany corrupted examples such as extremely short input arti-cles, meaningless summary, and dominating unknown words.

Methods De-En En-Vi(dev) (tst2012)Transformer (base) 35.27 27.03Trm + BERT l r sm Table 6: Ablation study. (Trm: Transformer)

There are several possible factors that could con-tribute to the performance gain: additional param-eters of BERT, extra data (pretraining corpus) ofBERT, and the bidirectional nature. To better un-derstand the key contributions of our method, weconduct an ablation study described in the follow-ing. We ﬁnetune 2 extra teachers: BERT sm andBERT l r . For BERT sm , we use a smaller BERT(6 layers) for C-MLM ﬁnetuning, which has ap-proximately the same number of parameters asTransformer-base. For BERT l r , we use the fullBERT model but ﬁnetune it using left-to-right LMas in the conventional Seq2Seq model. Next, weapply the proposed KD method to train the Trans-former on En-Vi and De-En MT tasks. Resultsare shown in Table 6. BERT sm still works wellthough the full BERT provides further improve-ment. On the other hand, BERT l r slightly hurtsthe performance. We hypothesize that it generatesnoisy learning targets for the student, hence the per-formance drop. Empirically, we show that the bidi-rectional knowledge could be more important thanthe extra parameters, while the pre-trained weightsremain useful for more stable C-MLM training. We next analyze the effect of our proposed ap-proach on different output lengths. We plot theBLEU scores on MT w.r.t. different output genera-tion lengths N on the development set. Resultsare provided in Figure 2 and Figure 3. For IWSLTGerman-English dataset (Figure 2: Left), we cansee a shared trend that the proposed L bidi objec-tive gains higher BLEU points on longer transla-tion pairs. For WMT English-German (Figure 3),we can see that although the proposed methodperforms much worse when the output sentences We still use the pretrained weights of BERT, otherwisethe C-MLM does not converge very well. For Gigaword summarization, almost all summaries areshort sentences (less than 0.5% of the summaries contain morethan 16 words), so we omit the analysis. igure 2: BLEU scores on IWSLT German-English and English-Vietnamese for different output lengths.

Reference my mother says that i started reading at the age of two , although i think four is probably close to the truth .Transformer my mother says that i started reading with two years , but i think that four of them probably correspond to thetruth . (39.6)Ours my mother says that i started reading at the age of two , but i think four is more likely to be the truth . (65.2)Reference we already have the data showing that it reduces the duration of your ﬂu by a few hours .Transformer we ’ve already got the data showing that it ’s going to crash the duration of your ﬂu by a few hours . (56.6)Ours we already have the data showing that it reduces the duration of your ﬂu by a few hours . (100.0)Reference we now know that at gombe alone , there are nine different ways in which chimpanzees use different objectsfor different purposes .Transformer we know today that alone in gombe , there are nine different ways that chimpanzees use different objectsin different ways . (35.8)Ours we now know that in gombe alone , there are nine different ways that chimpanzees use different objectsfor different purposes . (71.5)

Table 7: Qualitative examples from IWSLT German-English translation. Numbers inside the parenthesis aresentence-level BLEU scores. Red word is where the baseline Transformer makes a mistake without consider-ing the possible future phrase and fails to recover. On the other hand, our model makes the right decision at theblue word, hence generates more coherent sentence. Please refer to Section 4.7 for detailed explanation.Figure 3: BLEU scores on WMT English-German fordifferent output lengths. are very short, it achieves relatively consistentimprovement on longer cases, hence resulting inoverall BLEU improvement. For IWSLT English-Vietnamese (Figure 2: Right), we see a similartrend when the length

N > . In Table 7, we show some translation examples onIWSLT German-English dataset. In the ﬁrst exam-ple, the baseline Transformer cannot recover from‘ with ’ and ‘ of ’, which renders the full sentencenot making much sense. “I started reading with ...”would make sense from the left context; however, ifthe model also considers the right context “the ageof two”, the word ‘ with ’ would be assigned withlower probability by the soft labels provided by theBERT teacher. Even though at test-time the modelcannot ‘look ahead’, the soft-targets at training-time prevents the over-conﬁdence of the model onone-hot label; hence the better generalization at thetest-time. Similarly, other examples show that ourmodel can generate text more coherently w.r.t. thecontext on the right (underlined in Table 7), thusmaking more accurate and natural translation. In this work, we propose a novel and generic ap-proach to utilizing pre-trained language models tomprove text generation without explicit parame-ter sharing, feature extraction, or augmenting withauxiliary tasks.

Our proposed Conditional MLMmechanism leverages unsupervised language mod-els pre-trained on large corpus, and then adapts tosupervised sequence-to-sequence tasks. Our distil-lation approach indirectly inﬂuences the text gen-eration model by providing soft-label distributionsonly, hence is model-agnostic . Experiments showthat our model improves over strong Transformerbaselines on multiple text generation tasks such asmachine translation and abstractive summarization,and achieves new state-of-the-art on some of thetranslation tasks. For future work, we will explorethe extension of Conditional MLM to multimodalinput such as image captioning.

References

Reinald Kim Amplayo, Seonjae Lim, and Seung-wonHwang. 2018. Entity commonsense representationfor neural abstractive summarization. In

NAACL .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In

ICLR .Samy Bengio, Oriol Vinyals, Navdeep Jaitly, andNoam Shazeer. 2015. Scheduled sampling for se-quence prediction with recurrent neural networks.In

NIPS .Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information.

TACL .Samuel R Bowman, Gabor Angeli, Christopher Potts,and Christopher D Manning. 2015. A large anno-tated corpus for learning natural language inference.In

EMNLP .Cristian Bucilu, Rich Caruana, and AlexandruNiculescu-Mizil. 2006. Model compression. In

KDD .Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei.2018a. Retrieve, rerank and rewrite: Soft templatebased neural summarization. In

ACL .Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li.2018b. Faithful to the original: Fact aware neuralabstractive summarization. In

AAAI .Liqun Chen, Yizhe Zhang, Ruiyi Zhang, ChenyangTao, Zhe Gan, Haichao Zhang, Bai Li, DinghanShen, Changyou Chen, and Lawrence Carin. 2019.Improving sequence-to-sequence learning via opti-mal transport. In

ICLR .Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-tive summarization with reinforce-selected sentencerewriting. In

ACL . Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. In

EMNLP .Kevin Clark, Minh-Thang Luong, Christopher D. Man-ning, and Quoc Le. 2018. Semi-supervised se-quence modeling with cross-view training. In

EMNLP .Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-ina Williams, Samuel R. Bowman, Holger Schwenk,and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. In

EMNLP .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In

NAACL .Sergey Edunov, Myle Ott, Michael Auli, David Grang-ier, and Marc’Aurelio Ranzato. 2018. Classicalstructured prediction losses for sequence to se-quence learning. In

NAACL .Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu,Kenneth Tran, Jianfeng Gao, Lawrence Carin, andLi Deng. 2017. Semantic compositional networksfor visual captioning. In

CVPR .Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N Dauphin. 2017. Convolutionalsequence to sequence learning. In

ICML .Marjan Ghazvininejad, Omer Levy, Yinhan Liu, andLuke Zettlemoyer. 2019. Constant-time machinetranslation with conditional masked language mod-els. arXiv preprint arXiv:1904.09324 .Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean.2015. Distilling the knowledge in a neural network.In

NIPS Deep Learning and Representation Learn-ing Workshop .Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model ﬁne-tuning for text classiﬁcation. In

ACL .Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In

EMNLP .Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In

ICLR .Guillaume Klein, Yoon Kim, Yuntian Deng, VincentNguyen, Jean Senellart, and Alexander Rush. 2018.OpenNMT: Neural machine translation toolkit. In

AMTA .Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprintarXiv:1901.07291 .Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2018. Unsupervised ma-chine translation using monolingual corpora only. In

ICLR .hin-Yew Lin. 2004. ROUGE: A package for auto-matic evaluation of summaries. In

ACL Text Sum-marization Branches Out Workshop .Junyang Lin, Xu Sun, Shuming Ma, and Qi Su. 2018.Global encoding for abstractive summarization. In

ACL .Lemao Liu, Masao Utiyama, Andrew Finch, andEiichiro Sumita. 2016. Agreement on target-bidirectional neural machine translation. In

NAACL .Minh-Thang Luong, Eugene Brevdo, and Rui Zhao.2017. Neural machine translation (seq2seq) tutorial. https://github.com/tensorﬂow/nmt .Minh-Thang Luong and Christopher D. Manning. 2015.Stanford neural machine translation systems for spo-ken language domain. In

IWSLT .Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In

EMNLP .Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In

NIPS .Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In

NIPS .Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,Bing Xiang, et al. 2016. Abstractive text summariza-tion using sequence-to-sequence rnns and beyond.In

CoNLL .Myle Ott, Sergey Edunov, David Grangier, andMichael Auli. 2018. Scaling neural machine trans-lation. In

WMT .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

ACL .Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.In

NIPS Autodiff Workshop .Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In

EMNLP .Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In

NAACL .Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In

EMNLP .Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In

EMNLP .Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In

ACL .Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sor-doni, Adam Trischler, Chris Pal, and Yoshua Ben-gio. 2018. Twin networks: Matching the future forsequence generation. In

ICLR .Kaitao Song Song, Xu Tan, Tao Qin, Jianfeng Lu, andTie-Yan Liu. 2019. Mass: Masked sequence tosequence pre-training for language generation. In

ICML .Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overﬁtting.

JMLR .Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.Patient knowledge distillation for bert model com-pression. arXiv preprint arXiv:1908.09355 .Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In

NIPS .Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jonathon Shlens, and Zbigniew Wojna. 2016. Re-thinking the inception architecture for computer vi-sion. In

CVPR .Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019. Multilingual neural machine transla-tion with knowledge distillation. In

ICLR .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

NIPS .Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In

CVPR .Alex Wang and Kyunghyun Cho. 2019. Bert hasa mouth, and it must speak: Bert as a markovrandom ﬁeld language model. arXiv preprintarXiv:1902.04094 .Alex Wang, Amapreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R Bowman. 2019.Glue: A multi-task benchmark and analysis platformfor natural language understanding. In

ICLR .Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin,and Michael Auli. 2019. Pay less attention withlightweight and dynamic convolutions. In

ICLR .onghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, Melvin John-son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws,Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, KeithStevens, George Kurian, Nishant Patil, Wei Wang,Cliff Young, Jason Smith, Jason Riesa, Alex Rud-nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,and Jeffrey Dean. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144 .Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhutdinov, RichardZemel, and Yoshua Bengio. 2015. Show, attend andtell: Neural image caption generation with visual at-tention. In

ICML .Jiacheng Yang, Mingxuan Wang, Hao Zhou, ChengqiZhao, Yong Yu, Weinan Zhang, and Lei Li. 2019.Towards making the most of bert in neural machinetranslation. arXiv preprint arXiv:1908.05672 .Haoyu Zhang, Jianjun Xu, and Ji Wang. 2019a.Pretraining-based natural language genera-tion for text summarization. arXiv preprintarXiv:1902.09243 .Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, MingZhou, and Enhong Chen. 2019b. Regularizing neu-ral machine translation by target-bidirectional agree-ment. In

AAAI . Implementaion Details andHyper-parameter Values

We run all experiments on single GPU of NVIDIATitan RTX or V100 except for WMT En-De we use4 V100s for training. Note that for large batch sizesthat do not ﬁt in GPU memory, we use the gradientaccumulation tricks as in Ott et al. (2018). Batchsizes are counted in number of tokens. Note that allthe hyper-parameters are tuned on the developmentset only.To compute the logits (soft labels) from teacher,we repeat a training pair for 7 times and create acircular mask as illustrated in Figure 4. This maskapproximates the masking rate of the BERTtraining. From the masked positions we can obtainsoft probabilities predicted by the BERT teacherfor each output tokens y . These logits are pre-computed once for the training set so that we donot have to repeatedly sample random masks andrun forward pass of BERT while training. IWSLT De-En

For C-MLM ﬁne-tuning, wetrain for 100k steps with 5k warmup steps , η =5 · − , and batch size of 16k tokens. Forbaseline model, we train for 50k steps with 4k warmup steps and batch size of 6k tokens. Thelearning rate η is set to 1. For the proposed model,we train for 100k steps with 8k warmup steps and batch size of 6k tokens. The learning rate η is set to 2, α = 0 . , and T = 10 . Seq2Seq modeluses dropout (Srivastava et al., 2014) of 0.3 in bothcases. IWSLT En-Vi

For C-MLM ﬁne-tuning and base-line Transformer, the hyper-parameters are iden-tical to that of IWSLT De-En. For the pro-posed model, we train for 100k steps with 8k warmup steps and batch size of 6k tokens. Thelearning rate η is set to 2, α = 0 . , and T = 5 .Dropout is still 0.1. WMT En-De

For C-MLM ﬁne-tuning, we trainfor 100k steps with 5k warmup steps , η =5 · − , and batch size of 512k tokens. Forbaseline model, we train for 30k steps with 4k warmup steps and batch size of 384k tokens. Thelearning rate η is set to 4. Since this is our largestdataset and training is slow, for the proposed modelwe use the baseline Transformer to initialize theSeq2Seq student. For the proposed model, we con-tinue training for 50k steps with 4k warmup steps and batch size of 64k tokens. The learning rate η is Figure 4: Illustration of the masking strategy for com-puting the teacher soft labels. Gray slashed boxes de-note the [MASK] positions. set to 2, α = 0 . , and T = 5 . Seq2Seq model usesdropout of 0.1 in both cases. Gigaword

For C-MLM ﬁne-tuning, we train for100k steps with 5k warmup steps , η = 5 · − ,and batch size of 64k tokens. For baseline model,we train for 50k steps with 4k warmup steps andbatch size of 40k tokens. The learning rate η isset to 1. For the proposed model, we train for 70ksteps with 4k warmup steps and batch size of 36ktokens. The learning rate η is set to 2, α = 0 . ,and T = 10 . Seq2Seq model uses dropout of 0.1in both cases. B Additional Generation Examples

We show Gigaword summarization examples inTable 9 and extra En-DE generation examples inTable 8. Qualitatively, our Transformer + BERTTeacher outperforms baseline Transformer and gen-erate more coherent sentences. eference the political climate in the u.s. at the time was tense , and there were debates going on about immigration .Transformer the political climate in the u.s. was back then , and there was constant disasters . (29.5)Ours the political climate in the united states at the time was tense , and there were ongoing shifting debates .(57.3)Reference it would be immoral to leave these young people with a climate system spiraling out of control .Transformer it would be immoral to let these young people leave a climate system that was out of control . (44.6)Ours it would be immoral to leave these young people with a climate system out of control . (84.3)Reference the tahltan have called for the creation of a tribal heritage reserve which will set aside the largest protectedarea in british columbia .Transformer tahltan demands the institution of a tribe in british columbia that should make the largest protection area inbritish columbia . (19.9)Ours the tahltan demands to build a tribe reserve that should be the largest protected area in british columbia .(32.2)

Table 8: Qualitative examples from IWSLT German-English translation. Numbers inside the parenthesis aresentence-level BLEU scores. Red word is where the baseline Transformer makes a mistake without consider-ing the possible future phrase and fails to recover. On the other hand, our model makes the right decision at theblue word, hence generates more coherent sentence. Please refer to Section 4.6 in the main paper for detailedexplanation.