NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application
Chuhan Wu, Fangzhao Wu, Yang Yu, Tao Qi, Yongfeng Huang, Qi Liu
NNewsBERT: Distilling Pre-trained Language Model forIntelligent News Application
Chuhan Wu , Fangzhao Wu , Yang Yu , Tao Qi , Yongfeng Huang , Qi Liu Tsinghua University, Beijing 100084, China Microsoft Research Asia, Beijing 100080, China University of Science and Technology of China, Hefei 230027, China{wuchuhan15,wufangzhao,taoqi.qt}@gmail.com,[email protected],[email protected],[email protected]
ABSTRACT
Pre-trained language models (PLMs) like BERT have made greatprogress in NLP. News articles usually contain rich textual informa-tion, and PLMs have the potentials to enhance news text modelingfor various intelligent news applications like news recommenda-tion and retrieval. However, most existing PLMs are in huge sizewith hundreds of millions of parameters. Many online news appli-cations need to serve millions of users with low latency tolerance,which poses huge challenges to incorporating PLMs in these scenar-ios. Knowledge distillation techniques can compress a large PLMinto a much smaller one and meanwhile keeps good performance.However, existing language models are pre-trained and distilled ongeneral corpus like Wikipedia, which has some gaps with the newsdomain and may be suboptimal for news intelligence. In this paper,we propose NewsBERT, which can distill PLMs for efficient andeffective news intelligence. In our approach, we design a teacher-student joint learning and distillation framework to collaborativelylearn both teacher and student models, where the student modelcan learn from the learning experience of the teacher model. In addi-tion, we propose a momentum distillation method by incorporatingthe gradients of teacher model into the update of student modelto better transfer useful knowledge learned by the teacher model.Extensive experiments on two real-world datasets with three tasksshow that NewsBERT can effectively improve the model perfor-mance in various intelligent news applications with much smallermodels.
KEYWORDS
Knowledge distillation, Pre-trained language model, News applica-tion, BERT
ACM Reference Format:
Chuhan Wu , Fangzhao Wu , Yang Yu , Tao Qi , Yongfeng Huang , Qi Liu .2021. NewsBERT: Distilling Pre-trained Language Model for IntelligentNews Application. In Proceedings of ACM SIGKDD Conference on Knowl-edge Discovery and Data Mining (KDD 2021),
Jennifer B. Sartor and TheoD’Hondt (Eds.). ACM, New York, NY, USA, Article 4, 9 pages. https://doi.org/10.475/123_4
Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).
KDD 2021, August 2021, Singapore © 2021 Copyright held by the owner/author(s).ACM ISBN 123-4567-24-567/08/06.https://doi.org/10.475/123_4
Pre-trained language models like BERT [8] and GPT [23] haveachieved remarkable success in various NLP applications [16, 42].These powerful language models are usually in huge size with hun-dreds of millions of parameters [22]. For example, the BERT-Basemodel contains about 110M parameters and 12 Transformer [33] lay-ers, which usually raises a high demand of computational resourcesin model training and inference. However, many online applicationsneed to provide services for a large number of concurrent users andthe tolerance of latency is often low, which hinders the deploymentof large-scale language models in these systems [24].In recent years, online news websites such as MSN News andGoogle News have gained huge popularity for users to digest digitalnews [37]. These news websites usually involve a series of intelli-gent news applications like automatic news topic classification [38],news headline generation [31] and news recommendation [19]. Inthese applications, text modeling is a critical technique becausenews articles usually contain rich textual content [40]. Thus, theseapplications would benefit a lot from the powerful language un-derstanding ability of pre-trained language models if they could beincorporated in a efficient way, which further has the potential toimprove the news reading experience of millions of users.Knowledge distillation is a technique that can compress a cum-bersome teacher model into a lighter-weight student model bytransferring useful knowledge [11, 14]. It has been employed to com-press many huge pre-trained language models into much smallerversions and meanwhile keep most of the original performance [13,24, 29, 35]. For example, Sanh et al. [24] proposed a DistilBERTapproach, which learns the student model from the soft targetprobabilities of the teacher model by using a distillation loss withsoftmax-temperature [12], and they regularized the hidden statedirections of the student and teacher models to be aligned. Jiao etal. [13] proposed TinyBERT, which is an improved version of Distil-BERT. In addition to the distillation loss, they proposed to regularizethe token embeddings, hidden states and attention heatmaps ofboth student and teacher models to be aligned via the mean squarederror loss. These methods usually learn the teacher and studentmodels successively, where the student can only learn from the re-sults of the teacher model. However, the learning experience of theteacher may also be useful for the learning of student model [44],which is not considered by existing methods. In addition, the cor-pus for pre-training and distilling general language models (e.g.,WikiPedia) may also have some domain shifts with news corpus,which may not be optimal for intelligent news applications. a r X i v : . [ c s . C L ] F e b n this paper, we propose a NewsBERT approach that can distillpre-trained language models for various intelligent news applica-tions. In our approach, we design a teacher-student joint learningand distillation framework to collaboratively learn both teacher andstudent models in news intelligence tasks by sharing the parametersof top layers, and meanwhile distill the student model by regular-izing the output soft probabilities and hidden representations. Inthis way, the student model can learn from the teacher’s learningexperience to better imitate the teacher model, and the teachercan also be aware of the learning status of the student model toenhance student teaching. In addition, we propose a momentumdistillation method by using the gradients of the teacher model toboost the gradients of student model in a momentum way, whichcan better transfer useful knowledge learned by the teacher modelto enhance the learning of student model. We conduct extensiveexperiments on two real-world datasets that involve three newsintelligence tasks. The results validate that our proposed News-BERT approach can consistently improve the performance of thesetasks using much smaller models and outperform many baselinemethods for distilling pre-trained language models.The main contributions of this work include: • We propose a NewsBERT approach to distill pre-trainedlanguage models for intelligent news applications. • We propose a teacher-student joint learning and distillationframework to collaboratively learn both teacher and stu-dent models by sharing useful knowledge obtained by theirlearning process. • We propose a momentum distillation method by using thegradient of the teacher model to boost the learning of studentmodel in a momentum manner. • Extensive experiments on real-world datasets validate thatour method can effectively improve the model performancein various intelligent news applications in an efficient way.
In recent years, many researchers explore to use knowledge dis-tillation techniques to compress large-scale PLMs into smallerones [13, 18, 24, 29, 30, 32, 34, 35, 41]. For example, Tang et al. [32]proposed a BiLSTM
SOFT method that distills the BERT model into asingle layer BiLSTM using the distillation loss in downstream tasks.Sanh et al. [24] proposed a DistilBERT approach, which distillsthe student model at the pre-training stage using the distillationloss and a cosine embedding loss that aligns the hidden states ofteacher and student models. Sun et al. [29] proposed a patient knowl-edge distillation method for BERT compression named BERT-PKD,which distills the student model by learning from teacher’s out-put soft probabilities and hidden states produced by intermediatelayers. Wang et al. [35] proposed MiniLM, which employs a deepself-attention distillation method that uses the KL-divergence lossbetween teacher’s and student’s attention heatmaps computed byquery-key inner-product and the value relations computed by value-value inner-product. Jiao et al. [13] proposed TinyBERT, whichdistills the BERT model at both pre-training and fine-tuning stagesby using the distillation loss and the MSE loss between the em-beddings, hidden states and attention heatmaps. There are also a few works that explore to distill pre-trained language modelsfor specific downstream tasks such as document retrieval [6, 17].For example, Lu et al. [17] proposed a TwinBERT approach fordocument retrieval, which employs a two-tower architecture withtwo separate language models to encode the query and document,respectively. They used the distillation loss function to compressthe two BERT models into smaller ones. These methods usuallytrain the teacher and student models successively, i.e., distilling thestudent model based on a well-tuned teacher model. However, theuseful experience evoked by the teacher’s learning process cannotbe exploited by the student and the teacher is also not aware of thestudent’s learning status. In addition, the corpus for pre-trainingand distilling these language models usually has some domain shiftswith news texts. Thus, it may not be optimal to apply the off-the-shelf distilled language models to intelligent news applications. Inthis work, we propose a
NewsBERT method to distill pre-trainedlanguage models for intelligent news applications, which can ef-fectively reduce the computational cost of PLMs and meanwhileachieve promising performance. We propose a teacher-student jointlearning and distillation framework, where the student model canexploit the useful knowledge produced by the learning process ofthe teacher model. In addition, we propose a momentum distillationmethod that integrates the gradient of the teacher model into thestudent model gradient as a momentum to boost the learning ofthe student.
Online news platforms usually involve various intelligent news ap-plications like news topic classification [4], fake news detection [26],news headline generation [9], news retrieval [5] and personalizednews recommendation [39]. Since news articles usually contain richtextual information, learning accurate news representations basedon their texts is usually a core problem in these applications [40].Many prior works on news intelligence use handcrafted featuresto represent news texts. For example, Bourgonje et al. [3] used theTF-IDF features extracted from the news headline and body as wellas their lengths to represent news. Lian et al. [15] used the topicand entity features extracted from the news title to represent newstexts. However, these methods usually require heavy effort on man-ual feature engineering and their features may not be adaptableacross different news applications. In addition, handcrafted featuresmay not be optimal in capturing the semantic information of newscontent, which is critical for news intelligence.In recent years, many works explore to use deep learning basedtechniques to model news texts [25, 36, 39]. For example, Shuet al. [25] proposed a fake news detection approach named dE-FEND, which uses a hierarchical Bi-LSTM model to learn sen-tence representations from words and then learn news represen-tations from sentence representations. Wu et al. [39] proposed anNRMS approach for news recommendation that uses multi-headself-attention networks to learn news representations. There arealso a few methods that explored to use pre-trained language mod-els to model news texts for news classification. For instance, Sunet al. [28] studied how to fine-tune the BERT model for text clas-sification (e.g., news topic classification). However, besides newsclassification, incorporating pre-trained language models in many eacher Model ...
Transformer Layer 1Transformer Layer KShared Pooling
Student Model
Hidden LossHidden Loss �𝒚𝒚 𝒕𝒕 �𝒚𝒚 𝐬𝐬 Distill Loss ... ... ......
Transformer Layer (N-1)K+1Transformer Layer NK ... ... ......... ... ... ... ...
Transformer Layer 1 ... ... ...
Transformer Layer N ...
Shared PoolingShared Dense Shared Dense
Hidden Loss
Dallas Cowboys 2021 NFL Schedule
Embedding ...
Embedding ...
Hidden Loss
News Text 𝑯𝑯 𝑁𝑁𝑠𝑠 𝑯𝑯 𝑬𝑬 𝑠𝑠 𝑯𝑯 𝑁𝑁𝑁𝑁𝑡𝑡 𝑯𝑯 𝑁𝑁𝑡𝑡 𝑬𝑬 𝑡𝑡 𝒉𝒉 𝑠𝑠 𝒉𝒉 𝑡𝑡 𝒚𝒚 Task Loss Task Loss
Label 𝒈𝒈 𝑁𝑁 , 𝑁𝑁𝑡𝑡 𝒈𝒈 𝑁𝑁 , 𝒈𝒈 , 𝑁𝑁𝑡𝑡 𝒈𝒈 , 𝒈𝒈 𝑁𝑁𝑡𝑡 𝒈𝒈 GradientMomentumGradientMomentum
BackwardPropagation
Figure 1: The framework of our
NewsBERT approach in a typical news classification task. intelligent news applications like personalized news recommenda-tion is less studied due to the huge computational costs of PLMs. Inour work, we propose to distill pre-trained language models intolight-weight models for intelligent news applications, which hasthe potential to improve the performance of various news relatedtasks and improve the news reading experience of massive users.
In this section, we introduce our
NewsBERT approach that can distillpre-trained language models for intelligent news applications. Wewill first introduce the teacher-student joint learning and distillationframework of
NewsBERT by using the news classification task as arepresentative example, then introduce our proposed momentumdistillation method, and finally introduce how to learn
NewsBERT in more complicated tasks like news recommendation.
The overall framework of our
NewsBERT approach in a typical newsclassification task is shown in Fig. 1. It contains a teacher model witha parameter set Θ 𝑡 and a student model with a parameter set Θ 𝑠 . The teacher is a strong but large-scale PLM (e.g., BERT) with heavycomputational cost, and the goal is to learn the light-weight studentmodel that can keep most of the teacher’s performance. Differentfrom existing knowledge distillation methods that first learn theteacher model and then distill the student model from the fixedteacher model, in our approach we jointly learn the teacher andstudent models and meanwhile distilling useful knowledge from theteacher model. Both teacher and student language models containan embedding layer and several Transformer [33] layers. We assumethat the teacher model has 𝑁 𝐾
Transformer [33] layers on the top ofthe embedding layer and the student model contains 𝑁 Transformerlayers on the embedding layer. Thus, the inference speed of thestudent model is approximately 𝐾 times faster than the teacher. Wefirst use the teacher and student models to separately process theinput news text (denoted as 𝑥 ) through their Transformer layersand obtain the hidden representation of each token. We use a sharedattentive pooling [43] layer (with parameter set Θ 𝑝 ) to convert thehidden representation sequences output by the teacher and studentmodels into unified news embeddings, and finally use a shareddense layer (with parameter set Θ 𝑑 ) to predict the classificationprobability scores based on the news embedding. By sharing thearameters of the top pooling and dense layers, the student modelcan get richer supervision information from the teacher, and theteacher can also be aware of student’s learning status. Thus, theteacher and student can be reciprocally learned by sharing usefulknowledge encoded by them, which is helpful for learning a strongstudent model.Next, we introduce the knowledge distillation details of ourapproach. We assume the 𝑖 -th Transformer layer in the studentmodel corresponds to the layers [( 𝑖 − ) 𝐾 + , ..., 𝑖𝐾 ] in the teachermodel. We call the stack of these 𝐾 layers in the teacher model asa “block”. Motivated by [29], we apply a hidden loss to align thehidden representations given by each layer in the student modeland its corresponding block in the teacher model, which can helpthe student better learn from the teacher. We denote the tokenrepresentations output by the embedding layers in the teacher andstudent models as E 𝑡 and E 𝑠 , respectively. The hidden representa-tions produced by the 𝑖 -th layer in the student model are denotedas H 𝑠𝑖 , and the hidden representations given by the correspondingblock in the teacher model are denoted as H 𝑡𝑖𝐾 . The hidden lossfunction applied to these layers is formulated as follows: L 𝑙ℎ𝑖𝑑𝑑𝑒𝑛 ( 𝑥, Θ 𝑡 ; Θ 𝑠 ) = MSE ( E 𝑡 , E 𝑠 ) + 𝑁 ∑︁ 𝑖 = MSE ( H 𝑡𝑖𝐾 , H 𝑠𝑖 ) , (1)where MSE stands for the Mean Squared Error loss function. Inaddition, since the pooling layer is shared between student andteacher, we expect the unified news embeddings learned by thepooling layers in the teacher and student models (denoted as h 𝑡 and h 𝑠 respectively) to be similar. Thus, we propose to apply anadditional hidden loss to these embeddings, which is formulated asfollows: L 𝑝ℎ𝑖𝑑𝑑𝑒𝑛 ( 𝑥, Θ 𝑡 ; Θ 𝑠 , Θ 𝑝 ) = MSE ( h 𝑡 , h 𝑠 ) . (2)Besides, to encourage the student model to give similar predictionsto the teacher model, we use the distillation loss function to regular-ize the output soft labels. We denote the soft label vectors predictedby the teacher and student models as ˆ 𝑦 𝑡 and ˆ 𝑦 𝑠 , respectively. Thedistillation loss is formulated as follows: L 𝑑𝑖𝑠𝑡𝑖𝑙𝑙 ( 𝑥, Θ 𝑡 ; Θ 𝑠 , Θ 𝑝 , Θ 𝑑 ) = CE ( ˆ 𝑦 𝑡 / 𝑡, ˆ 𝑦 𝑠 / 𝑡 ) , (3)where CE stands for the cross-entropy function and 𝑡 is the temper-ature value. The overall loss function for distillation is a summationof the hidden losses and the distillation loss, which is formulatedas follows: L 𝑑 ( 𝑥, Θ 𝑡 ; Θ 𝑠 , Θ 𝑝 , Θ 𝑑 ) = L 𝑙ℎ𝑖𝑑𝑑𝑒𝑛 + L 𝑝ℎ𝑖𝑑𝑑𝑒𝑛 + L 𝑑𝑖𝑠𝑡𝑖𝑙𝑙 . (4)Since the original teacher and student models are task-agnostic,both teacher and student models need to receive task-specific su-pervision signals from the task labels (denoted as 𝑦 ) to tune theirparameters. Thus, the unified loss function L 𝑠 for training the stu-dent model is the summation of the overall distillation loss and theclassification loss, which is written as follows: L 𝑠 ( 𝑥, Θ 𝑡 ; Θ 𝑠 , Θ 𝑝 , Θ 𝑑 ) = L 𝑑 ( 𝑥, Θ 𝑡 ; Θ 𝑠 , Θ 𝑝 , Θ 𝑑 ) + CE ( ˆ 𝑦 𝑠 , 𝑦 ) . (5)Since we do not expect the teacher to be influenced by the studenttoo heavily, the loss function L 𝑡 for training the teacher model isonly the classification loss, which is computed as follows: L 𝑡 ( 𝑥 ; Θ 𝑡 , Θ 𝑝 , Θ 𝑑 ) = CE ( ˆ 𝑦 𝑡 , 𝑦 ) . (6) NEWS NEWS NEWS NEWS
NewsBERT
User’s Clicked News Candidate News
NewsBERT NewsBERT NewsBERT …… User Encoder Click Predictor �𝑦𝑦𝒖𝒖 𝒉𝒉 𝑐𝑐 𝒉𝒉 𝒉𝒉 𝒉𝒉 𝑇𝑇 … 𝐷𝐷 𝑐𝑐 𝐷𝐷 𝐷𝐷 𝐷𝐷 𝑇𝑇 UserEmbedding Candidate NewsEmbedding
Figure 2: The framework of incorporating
NewsBERT in per-sonalized news recommendation.
By jointly optimizing the loss functions of the teacher and studentmodels via backward propagation, we can obtain a light-weightstudent model that can generate task-specific news representationsfor inferring the labels in downstream tasks as the teacher model.
In our approach, each Transformer layer in the student modelcorresponds to a block in the teacher model and we expect theyhave similar behaviors in learning hidden text representations. Tohelp the student model better imitate the teacher model, we proposea momentum distillation method that can inject the gradients of theteacher model into the student model as a gradient momentum toboost the learning of the student model. We denote the gradients ofthe 𝑗 -th layer in the 𝑖 -th block of the teacher model as g 𝑡𝑖,𝑗 , which iscomputed by optimizing the teacher’s training loss L 𝑡 via backwardpropagation. The gradients of the 𝑘 -th layer in the student modelis denoted as g 𝑠𝑘 , which is derived from L 𝑠 . We use the averageof the gradients from each layer in the 𝑖 -th block of the teachermodel as the overall gradients of this block (denoted as g 𝑡𝑖 ), whichis formulated as: g 𝑡𝑖 = 𝐾 𝐾 ∑︁ 𝑗 = g 𝑡𝑖,𝑗 . (7)Motivated by the momentum mechanism [10, 21], we combine theblock gradients g 𝑡𝑖 with the gradients of the corresponding layer inthe student model in a momentum manner, which is formulated asfollows: g 𝑠𝑘 = 𝛽 g 𝑡𝑘 + ( − 𝛽 ) g 𝑠𝑘 , (8)where 𝛽 is a momentum hyperparameter that controls the strengthof the gradient momentum of the teacher model. In this way, theteacher’s gradients are explicitly injected into the student model,which may have the potential to better guide the learning of thestudent by pushing each layer in the student model to have similarfunction with the corresponding block in the teacher model. .3 Applications of NewsBERT for NewsIntelligence In this section, we briefly introduce the applications of
NewsBERT in other news intelligence scenarios like personalized news recom-mendation. An illustrative framework of news recommendationis shown in Fig. 2, which is a two-tower framework. The inputis a sequence with a user’s 𝑇 historical clicked news (denoted as [ 𝐷 , 𝐷 , ..., 𝐷 𝑇 ] ) and a candidate news 𝐷 𝑐 , and the output is theclick probability score ˆ 𝑦 which can be further used for personalizednews ranking and display. We first use a shared NewsBERT modelto encode each clicked news and the candidate news into theirhidden representations [ h , h , ..., h 𝑇 ] and h 𝑐 . Then, we use a userencoder to capture user interest from the representations of clickednews and obtain a user embedding u . The final click probabilityscore is predicted by matching the user embedding u and h 𝑐 viaa click predictor, which can be implemented by the inner productfunction. In this framework, teacher and student NewsBERT modelsare used to generate news embeddings separately, while the userencoder and click predictor are shared between the teacher andstudent models to generate the prediction scores, which are furtherconstrained by the distillation loss function. In addition, the MSEhidden losses are simultaneously applied to all news embeddingsgenerated by the shared
NewsBERT model and the user embed-ding u generated by the user encoder, which can encourage thestudent model to be similar with the teacher model in supportinguser interest modeling. We conduct experiments on two real-world datasets. The firstdataset is the MIND dataset [40], which is a large-scale publicnews recommendation dataset. It contains the news impressionlogs of 1 million users on the Microsoft News website during 6weeks (from 10/12/2019 to 11/22/2019). We used this dataset forlearning and distilling our
NewsBERT model in the news topic clas-sification and personalized news recommendation tasks. On theMIND dataset, the logs of the first 5 weeks were used for trainingand validation, and logs in the last week rest were reserved for test.Since there are some news that appear in multiple dataset splits, inthe news topic classification task we only used the news that donot appear in the training and validation sets for test to avoid labelleakage. The second dataset is a news retrieval dataset (named as
NewsRetrieval ), which was sampled from the logs of Bing searchengine from 07/31/2020 to 09/13/2020. It contains the search queriesof users and the corresponding clicked news. On this dataset, wefinetuned models distilled on
MIND to measure their cross-taskperformance in the news retrieval task. We used the logs in the firstmonth for training, the logs in the next week for validation, and therest logs for test. The statistics of the two datasets are summarizedin Table 1.In our experiments, motivated by [7], we used the first 8 layersof the pre-trained UniLM [1] model as the teacher model , andwe used the parameters of its first 1, 2 or 4 layers to initialize the We used the UniLM V2 model.
Table 1: Detailed statistics of the
MIND and
NewsRetrieval datasets.
MIND
NewsRetrieval student models with different capacities. In the news recommen-dation task, the user encoder was implemented by an attentivepooling layer, and the click predictor was implemented by innerproduct. The query vectors in all attentive pooling layers were256-dimensional. We used Adam [2] as the model optimizer, andthe learning rate was 3e-6. The temperature value 𝑡 was set to 1.The batch size was 32. The dropout [27] ratio was 0.2. The gradi-ent momentum hyperparameter 𝛽 was set to 0.1 and 0.15 in thenews topic classification task and the news recommendation task,respectively. These hyperparamters were tuned on the validationset. Since the topic categories in MIND are imbalanced, we usedaccuracy and macro-F1 score (denoted as macro-F) as the metricsfor the news topic classification task. Following [40], we used theAUC, MRR, nDCG@5 and nDCG@10 scores to measure the per-formance of news recommendation models. On the news retrievaltask, we used AUC as the main metric. We independently repeatedeach experiment 5 times and reported the average results.
In this section, we compare the performance of our
NewsBERT approach with many baseline methods, including: • Glove [20], which is a widely used pre-trained word embed-ding. We used Glove to initialize the word embeddings in aTransformer [33] model for news topic classification and theNRMS [39] model for news recommendation. • BERT [8], a popular pre-trained language model with bi-directional Transformers. We compare the performance ofthe 12-layer BERT-Base model or its first 8 layers. • UniLM [1], a unified language model for natural languageunderstanding and generation, which is the teacher modelin our approach. We also compare its 12-layer version andits variant using the first 1, 2, 4, or 8 layers. • TwinBERT [17], a method to distill pre-trained languagemodels for document retrieval. For fair comparison, we usedthe same UniLM model as the approach, and compare theperformance of the student models with 1, 2, and 4 layers. • TinyBERT [13], which is a state-of-the-art two-stage knowl-edge distillation method for compressing pre-trained lan-guage models. We compare the performance of the officiallyreleased 4-layer and 6-layer TinyBERT models distilled fromBERT-Base and the performance of student models with 1, 2,and 4 layers distilled from the UniLM model.Table 2 shows the performance of all the compared methods innews topic classification and news recommendation tasks. Fromthe results, we have the following observations. First, comparedwith the Glove baseline, the methods based on pre-trained language able 2: Performance comparisons of different methods. * means using the UniLM model for distillation. The results of bestperformed teacher and student models are highlighted.Model Topic classification News Recommendation Speedup
Accuracy Macro-F AUC MRR nDCG@5 nDCG@10
Glove 71.13 49.71 67.92 33.09 36.03 41.80 -BERT-12 73.68 51.44 69.78 34.56 37.90 43.45 1.0xBERT-8 73.95 51.56 70.04 34.70 38.09 43.79 1.5xUniLM-12 74.54 51.75 70.53 35.29 38.61 44.29 1.0xUniLM-8
NewsBERT are usually better. Thismay be because the TwinBERT method only distills the teachermodel based on the output soft labels, while the other two methodscan also align the hidden representations learned by intermediatelayers, which can help the student model better imitate the teachermodel. Fifth, our
NewsBERT approach outperforms all other com-pared baseline methods, and our further t-test results show theimprovements are significant at 𝑝 < .
01 (by comparing the modelswith the same number of layers). This is because our approach em-ploys a teacher-student joint learning and distillation frameworkwhere the student can learn from the learning process of the teacher,which is beneficial for the student to extract useful knowledge fromthe teacher model. In addition, our approach uses a momentumdistillation method that can inject the gradients of teacher modelinto the student model in a momentum way, which can help eachlayer in the student model to better imitate the corresponding partin the teacher model. Thus, our approach can achieve better per-formance than other distillation methods. Sixth,
NewsBERT can A UC UniLMTwinBertTinyBERTNewsBERT
Figure 3: Cross-task performance in the news retrieval task. achieve satisfactory and even comparable results with the originalpre-trained language model. For example, there is only a 0.24%accuracy gap between
NewsBERT-4 and the teacher model in thetopic classification task. In addition, the size of student models ismuch smaller than the original 12-layer model, and their trainingor inference speed is much faster (e.g., about 12.0x speedup forthe one-layer NewsBERT). Thus, our approach has the potential toempower various intelligent news applications in an efficient way.Next, to validate the generalization ability of our approach, weevaluate the performance of
NewsBERT in an additional news re-trieval task. We used the
NewsBERT model learned in the newsrecommendation task, and we finetuned it with the labeled newsretrieval data in a two-tower framework used by TwinBERT [17]. ewsBERT-1 NewsBERT-2 NewsBERT-470.071.072.073.074.075.0 A cc u r a cy DisjointJoint (a) Topic classification.
NewsBERT-1 NewsBERT-2 NewsBERT-466.067.068.069.070.071.0 A UC DisjointJoint (b) News recommendation.
Figure 4: Influence of the teacher-student joint learning anddistillation framework on the student model.
We compared its performance with several methods, including fine-tuning the general UniLM model or the TwinBERT and TinyBERTmodels distilled in the news recommendation task. The results areshown in Fig. 3, from which we have several findings. First, directlyfine-tuning the generally pre-trained UniLM model is worse thanusing the models distilled in the news recommendation task. This isprobably because that language models are usually pre-trained ongeneral corpus like Wikipedia, which has some domain shifts withthe news domain. Thus, generally pre-trained language modelsmay not be optimal for intelligent news applications. Second, our
NewsBERT approach also achieves better cross-task performancethan TinyBERT and TwinBERT. It shows that our approach is moresuitable in distilling pre-trained language models for intelligentnews applications than these methods.
In this section, we conduct experiments to validate the advantage ofour proposed teacher-student joint learning and distillation frame-work over conventional methods that learn teacher and studentmodels successively [11]. We first compare the performance of thestudent models under our framework and their variants learned in
Accuracy Macro-F71.072.073.074.075.076.0 A cc u r a cy UniLM-8 (w/o student)UniLM-8 (NewsBERT-4 student)UniLM-8 (NewsBERT-2 student)UniLM-8 (NewsBERT-1 student) (a) Topic classification.
Accuracy Macro-F70.070.270.470.670.871.0 A UC UniLM-8 (w/o student)UniLM-8 (NewsBERT-4 student)UniLM-8 (NewsBERT-2 student)UniLM-8 (NewsBERT-1 student) (b) News recommendation.
Figure 5: Influence of the teacher-student joint learning anddistillation framework on the teacher model. a disjoint manner. The results are shown in Figs. 4(a) and 4(b). Wefind that our proposed joint learning and distillation frameworkcan consistently improve the performance of student models withdifferent capacities. This is because in our approach the studentmodel can learn from the useful experience evoked by the learningprocess of the teacher model, and the teacher model is also awareof the student’s learning status. However, in the disjoint learningframework, student can only learn from the results of a passiveteacher. Thus, learning teacher and student models successivelymay not be optimal way for distilling a high-quality student model.We also explore the influence of the teacher-student joint learn-ing and distillation framework on the teacher model. We comparethe performance of the original UniLM-8 model and its variants thatserve as the teacher model for distilling different student models.The results are shown in Figs. 5(a) and 5(b). We find a very interest-ing phenomenon that the performance of some teacher models thatteach students with sufficient capacities are better than the originalUniLM-8 model that does not participate in the joint learning anddistillation framework. This may be because the teacher model mayalso benefit from the useful knowledge encoded by the studentmodel. These results show that our teacher-student joint learning
Layer 2 Layers 4 Layers70.071.072.073.074.075.0 A cc u r a cy NewsBERT-Hidden Loss-Distillation Loss-Gradient Momentum (a) Topic classification. A UC NewsBERT-Hidden Loss-Distillation Loss-Gradient Momentum (b) News recommendation.
Figure 6: Effect of each core component in
NewsBERT . and distillation framework can help learn the teacher and studentmodels reciprocally, which may improve both of their performance. In this section, we conduct experiments to validate the effectivenessof several core techniques in our approach, including the hiddenloss, the distillation loss and the momentum distillation method. Wecompare the performance of
NewsBERT and its variants with oneof these components removed. The results are shown in Figs. 6(a)and 6(b). We find that the momentum distillation method playsa critical role in our method because the performance declinesconsiderably when it is removed. This may be because the gradientsof teacher model condense the knowledge and experience obtainedfrom its learning process, which can better teach the student modelto have similar function with the teacher model and thereby yieldsbetter performance. In addition, the distillation loss function isalso important for our approach. This is because the distillationloss regularizes the output of the student model to be similar withthe teacher model, which encourages the student model to behavesimilarly with the teacher model. Besides, the hidden loss functionsare also useful for our approach. It may be because the hiddenloss functions can align the hidden representations learned by the β A cc u r a cy (a) Topic classification. β A UC (b) News recommendation. Figure 7: Influence of the gradient momentum hyperparam-eter 𝛽 . teacher and student models, which is beneficial for the studentmodel to imitate the teacher. In this section, we conduct experiments to study the influence ofthe gradient momentum hyperparameter 𝛽 on the model perfor-mance. We vary the value of 𝛽 from 0 to 0.3, and the results areshown in Figs. 7(a) and 7(b). We observe that the performance isnot optimal when the value of 𝛽 is too small. This is because thegradient momentum is too weak under a small 𝛽 , and the usefulexperience from the teacher model cannot be effectively exploited.However, the performance starts to decline when 𝛽 is relativelylarge (e.g., 𝛽 > . 𝛽 from 0.1 to 0.2 is recommended. In this paper, we propose a knowledge distillation approach namedNewsBERT to compress pre-trained language models for intelligentnews applications. We propose a teacher-student joint learningnd distillation framework to collaboratively train both teacherand student models, where the student model can learn from thelearning experience of the teacher model and the teacher modelis aware of the learning of student model. In addition, we proposea momentum distillation method that combines the gradients ofthe teacher model with the gradients of the student model in amomentum way, which can boost the learning of student modelby injecting the knowledge learned by the teacher. We conductextensive experiments on two real-world datasets with three differ-ent news intelligence tasks, and the results demonstrate that ourNewsBERT approach can effectively improve the performance ofthese tasks with considerably smaller models.In future, we plan to deploy NewsBERT in online news intelli-gence applications like personalized news recommendation andnews retrieval to improve user experiences. We will also applyour NewsBERT approach to more intelligent news applicationsto improve their model performance in an efficient way. We arealso interested in developing more effective knowledge distilla-tion method to better compress pre-trained language models andmaximally keep their performance.
REFERENCES [1] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, YuWang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al. 2020. Unilmv2: Pseudo-masked language models for unified language model pre-training. In
ICML . PMLR,642–652.[2] Yoshua Bengio and Yann LeCun. 2015. Adam: A Method for Stochastic Optimiza-tion. In
ICLR .[3] Peter Bourgonje, Julian Moreno Schneider, and Georg Rehm. 2017. From clickbaitto fake news detection: an approach based on detecting the stance of headlines toarticles. In
EMNLP 2017 Workshop: Natural Language Processing meets Journalism .84–89.[4] Ricardo Carreira, Jaime M Crato, Daniel Gonçalves, and Joaquim A Jorge. 2004.Evaluating adaptive user profiles for news classification. In
IUI . 206–212.[5] Matteo Catena, Ophir Frieder, Cristina Ioana Muntean, Franco Maria Nardini,Raffaele Perego, and Nicola Tonellotto. 2019. Enhanced News Retrieval: PassagesLead the Way!. In
SIGIR . 1269–1272.[6] Xuanang Chen, Ben He, Kai Hui, Le Sun, and Yingfei Sun. 2020. SimplifiedTinyBERT: Knowledge Distillation for Document Retrieval. arXiv preprintarXiv:2009.07531 (2020).[7] Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang,Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2020. Infoxlm: Aninformation-theoretic framework for cross-lingual language model pre-training. arXiv preprint arXiv:2007.07834 (2020).[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. In
NAACL-HLT . 4171–4186.[9] Daniil Gavrilov, Pavel Kalaidin, and Valentin Malykh. 2019. Self-attentive modelfor headline generation. In
ECIR . Springer, 87–93.[10] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo-mentum contrast for unsupervised visual representation learning. In
CVPR . 9729–9738.[11] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge ina neural network. arXiv preprint arXiv:1503.02531 (2015).[12] Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterizationwith gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).[13] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, FangWang, and Qun Liu. 2020. TinyBERT: Distilling BERT for Natural LanguageUnderstanding. In
EMNLP Findings . 4163–4174.[14] Yoon Kim and Alexander M Rush. 2016. Sequence-Level Knowledge Distillation.In
EMNLP . 1317–1327.[15] Jianxun Lian, Fuzheng Zhang, Xing Xie, and Guangzhong Sun. 2018. TowardsBetter Representation Learning for Personalized News Recommendation: a Multi-Channel Deep Fusion Approach.. In
IJCAI . 3805–3811.[16] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, OmerLevy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: Arobustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).[17] Wenhao Lu, Jian Jiao, and Ruofei Zhang. 2020. TwinBERT: Distilling Knowledgeto Twin-Structured Compressed BERT Models for Large-Scale Retrieval. In
CIKM . 2645–2652.[18] Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Mat-sukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation viateacher assistant. In
AAAI , Vol. 34. 5191–5198.[19] Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017.Embedding-based news recommendation for millions of users. In
KDD . ACM,1933–1942.[20] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:Global vectors for word representation. In
EMNLP . 1532–1543.[21] Ning Qian. 1999. On the momentum term in gradient descent learning algorithms.
Neural networks
12, 1 (1999), 145–151.[22] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang.2020. Pre-trained models for natural language processing: A survey.
ScienceChina Technological Sciences (2020), 1–26.[23] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever. 2019. Language models are unsupervised multitask learners.
OpenAIblog
1, 8 (2019), 9.[24] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis-tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108 (2019).[25] Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee, and Huan Liu. 2019. defend:Explainable fake news detection. In
KDD . 395–405.[26] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake newsdetection on social media: A data mining perspective.
ACM SIGKDD explorationsnewsletter
19, 1 (2017), 22–36.[27] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a simple way to prevent neural networks fromoverfitting.
JMLR
15, 1 (2014), 1929–1958.[28] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tuneBERT for text classification?. In
CCL . Springer, 194–206.[29] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient Knowledge Distilla-tion for BERT Model Compression. In
EMNLP-IJCNLP . 4314–4323.[30] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and DennyZhou. 2020. MobileBERT: a Compact Task-Agnostic BERT for Resource-LimitedDevices. In
ACL . 2158–2170.[31] Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017. From Neural Sentence Summa-rization to Headline Generation: A Coarse-to-Fine Approach.. In
IJCAI , Vol. 17.4109–4115.[32] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin.2019. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136 (2019).[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
NIPS . 5998–6008.[34] Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. 2020.MiniLMv2: Multi-Head Self-Attention Relation Distillation for CompressingPretrained Transformers. arXiv preprint arXiv:2012.15828 (2020).[35] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou.2020. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compressionof Pre-Trained Transformers. In
NeurIPS .[36] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang,and Xing Xie. 2019. Neural News Recommendation with Attentive Multi-ViewLearning. In
IJCAI . 3863–3869.[37] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang, andXing Xie. 2019. Npa: Neural news recommendation with personalized attention.In
KDD . 2576–2584.[38] Chuhan Wu, Fangzhao Wu, Mingxiao An, Yongfeng Huang, and Xing Xie. 2019.Neural News Recommendation with Topic-Aware News Representation. In
ACL .1154–1159.[39] Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi, Yongfeng Huang, and Xing Xie.2019. Neural News Recommendation with Multi-Head Self-Attention. In
EMNLP .6390–6395.[40] Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian,Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, et al. 2020. MIND: A Large-scale Dataset for News Recommendation. In
ACL . 3597–3606.[41] Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. 2020. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing. In
EMNLP .7859–7869.[42] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, andQuoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for LanguageUnderstanding. In
NeurIPS , Vol. 32. 5753–5763.[43] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and EduardHovy. 2016. Hierarchical attention networks for document classification. In
NAACL-HLT . 1480–1489.[44] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. 2018. Deepmutual learning. In