Training Large-Scale News Recommenders with Pretrained Language Models in the Loop
TTraining Microsoft News Recommenders with PretrainedLanguage Models in the Loop
Shitao Xiao
BUPTBeijing, [email protected]
Zheng Liu
Microsoft Research AsiaBeijing, [email protected]
Yingxia Shao
BUPTBeijing, [email protected]
Tao Di
MicrosoftRedmond, WA, [email protected]
Xing Xie
Microsoft Research AsiaBeijing, [email protected]
ABSTRACT
News recommendation calls for deep insights of news articles’ un-derlying semantics. Therefore, pretrained language models (PLMs),like BERT and RoBERTa, may substantially contribute to the rec-ommendation quality. However, it’s extremely challenging to havenews recommenders trained together with such big models: thelearning of news recommenders requires intensive news encodingoperations, whose cost is prohibitive if PLMs are used as the newsencoder. In this paper, we propose a novel framework, SpeedyFeed,which efficiently trains PLMs-based news recommenders of supe-rior quality. SpeedyFeed is highlighted for its light-weighted encod-ing pipeline, which gives rise to three major advantages. Firstly, itmakes the intermedia results fully reusable for the training work-flow, which removes most of the repetitive but redundant encodingoperations. Secondly, it improves the data efficiency of the trainingworkflow, where non-informative data can be eliminated from en-coding. Thirdly, it further saves the cost by leveraging simplifiednews encoding and compact news representation.SpeedyFeed leads to more than 100 × acceleration of the trainingprocess, which enables big models to be trained efficiently and effec-tively over massive user data. The well-trained PLMs-based modelsignificantly outperforms the state-of-the-art news recommendersin comprehensive offline experiments. It is applied to MicrosoftNews to empower the training of large-scale production mod-els, which demonstrate highly competitive online performances.SpeedyFeed is also a model-agnostic framework, thus being poten-tially applicable to a wide spectrum of content-based recommendersystems. We’ve made the source code open to the public so as tofacilitate research and applications in related areas. https://github.com/staoxiao/SpeedyFeedPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]. Conference’17, July 2017, Washington, DC, USA © 2021 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
KEYWORDS
News Recommendation, Pretrained Language Models, TrainingFramework, Efficiency and Effectiveness
ACM Reference Format:
Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, and Xing Xie. 2021. TrainingMicrosoft News Recommenders with Pretrained Language Models in theLoop. In
Proceedings of ACM Conference (Conference’17).
ACM, New York,NY, USA, 11 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Online news platforms have been important media of informationacquisition. Given the huge volumes of online news articles, person-alized news feed [13, 18, 29] become imperative, with which usersmay get the news articles they feel interested in. The high-qualitynews recommendation is built upon the precise understandingof news articles’ underlying semantics. Therefore, the pretrainedlanguage models (PLMs), e.g., BERT and RoBERTa [8, 14], whichachieve remarkable performances on general text understandingtasks, are desirable of being applied as the news encoder. However,the PLMs are not quite friendly to the end-to-end training of newsrecommenders. On the one hand, it is expensive to work with PLMs:the encoding speed will be relatively slow and the GPU RAM con-sumption will be huge given the considerable sizes of PLMs. On theother hand, the training of news recommenders requires intensivenews encoding operations: to learn from every click signal of a user,the user’s entire historical news clicks need to be encoded, whosecost will be prohibitive if PLMs are utilized. As a result, the devel-opment of PLMs-based news recommenders is severely limited bythe efficiency bottleneck.To overcome the above challenge, a novel framework SpeedyFeedis proposed in this work, which trains PMLs-based news recom-menders with high efficiency and high quality. SpeedyFeed is high-lighted for its light-weighted encoding pipeline, which leads to thefollowing advantages. • The intermedia results are made highly reusable. Instead of havingtraining instances encoded for one-shot use and discard afterwards,our framework saves the cost by making the following intermediaresults fully reusable. Firstly, there are a small fraction of “break-ing news articles”, which are highly popular and widely exist inthe majority of users’ histories. Such news articles may frequentlyappear throughout the training process, and need to be re-encoded a r X i v : . [ c s . I R ] F e b onference’17, July 2017, Washington, DC, USA Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, and Xing Xie everytime. Knowing that the news recommenders are trained withsmall learning rates, usually in the magnitude of 1e − (especiallywhen PLMs are fine-tuned), the same news article’s embeddingwill merely be slightly changed in every training step. As such, acaching mechanism is developed, which enables freshly generatednews embeddings to be reused for multiple steps. Secondly, it is alsowasteful of simply having user history encoded for the predictionof one single click signal. In our framework, the autoregressive usermodeling is proposed, where an encoded prefix of user history canbe reused for the calculation of all its subsequent user embeddings. • The data efficiency is significantly improved. The typical trainingworkflow is prone to poor data efficiency: given that the lengthsof news articles and user histories are highly diversified, plentyof padded elements have to be introduced so that raw data canbe batched as input tensors. The padded data is not only non-informative, but also severely slows down the training speed. Inour framework, a centralized news encoding workflow is designed,which completely eliminates the padded data in user history. Be-sides, the data loader is designed to adaptively group the traininginstances, so that less padded data is needed for the news articles. • The news encoding cost is reduced with well preserved encodingquality. The PLMs are limited by their quadratic encoding complex-ity, which makes the news encoding cost grow dramatically whenthe news article’s length becomes longer. In our framework, twotechniques are utilized to mitigate this problem. Firstly, the buslanguage modeling (BusLM) is introduced for news encoding: onthe one hand, it partitions each news article into small segments,which results in the linear reduction of encoding complexity; onthe other hand, it establishes the bus connection between the seg-ments, which makes them jointly encoded for a high-quality newsembedding. Secondly, the content refinement is performed for eachnews article before it is encoded by PLMs: the useful part of a newsarticle is identified from the raw content, based on which the newsarticle is transformed into a more compact representation.It is worth noting that SpeedyFeed is not solely for trainingspeedup. But because of the high training speed, it is now madefeasible of training PLMs-based news recommenders with a hugeamount of user data. The Enlarged Model Scale, together withthe Enriched Training Data, ultimately make our recommendersuperior in making high-quality news recommendation.SpeedyFeed is applied to Microsoft News, where it leads to morethan 100 × acceleration of the training speed, compared with its con-ventional workflow. Besides, our well trained PLMs-based recom-mender significantly outperforms the state-of-the-art approaches incomprehensive offline evaluations. The PLMs-based recommendersdemonstrate highly competitive performances in production, wherenotable improvements are achieved in the online A/B test.Finally, SpeedyFeed is a model-agnostic framework. In fact, it canbe helpful to a wide variety of content-based recommender systemswhere user behaviors are associated with rich textual information,such as commodity and advertisement recommendation. The wholeproject is now made public-available, which aims to facilitate theresearch and applications in the corresponding areas. • Deep News Recommendation Systems . News recommenda-tion systems are designed to identify users’ interested news articles with intensive exploitation of their historical news browsing be-haviors [7, 13, 29]. As a result, two inherent problems need to beresolved within this process. One problem is the modeling of users’behaviors. With the progress of deep learning based recommen-dation systems, plenty of techniques have been proposed for usermodeling. In Youtube-DNN [6], users are represented as the aver-ages of their interacted items’ embeddings; in GRU4Rec [10], users’historical behaviors are aggregated with GRUs for sequential aware-ness; in DIN/DEIN, users’ historical behaviors are attentively ag-gregated to establish candidate dependency; and in RUM, memorynetworks are utilized to capture the diversity about users’ behav-iors. Such technical advancement also inspires the development ofnews recommenders. In DKN [24], users’ historical news clicks areattended by the candidate news for more precise modeling of userinterest; and in LSTUR [2], recurrent neural networks are leveragedto capture users’ short-term interests.The other problem, which is more specific to news recommenda-tion, is the modeling of news content. In recent years, the prosper-ity of natural language processing pushes forward the progress ofnews modeling. For example, the hierarchical attention networks(HAN) [30], which was originally proposed for document classifica-tion, is adapted for the multi-view representation of news articles[26]; meanwhile, the Deep Attention Matching Networks (DAMN),which was designed for response selection in chatbots, is appliedto perform fine-grained matching between the news content anduser history. The remarkable progress of the pretrained languagemodels brings huge potentials for the further enhancement of newsmodeling. However, the efficiency issue becomes one of the majorobstacles of applying PLMs for news recommendation: comparedwith the conventional small-scale text encoders, making use ofPLMs in news recommenders is expensive in terms of both trainingtime and computation resources. It usually requires the models tobe trained on powerful GPU clusters, and still takes tremendouslymore running time. As a result, the research progress and real-worldapplications of PLMs-based news recommenders are comparativelylimited for the current stage. • Pretrained Language Models . The pretrained language mod-els are proposed to learn universal representation/generation mod-els with neural networks trained on large-scale corpus. The earlyworks were started with some shallow structures, e.g, Skip-Gram[17] and GloVe [19]; in recent years, the network structures arebeing quickly scaled up: from EMLo [20], GPT [21], to BERT [14],RoBERTa [14], UniLM [3], till today’s GPT-3 [4]. The large-scalemodels, which get fully trained with massive corpus, demonstratesuperior capabilities on general NLP tasks, e.g., semantic matching,question-answering, machine translation, and response generation.The pretrained language models are also being intensively ap-plied for the retrieval or information-filtering related scenarios[5, 11, 12]; e.g., in [9], PLMs are trained for knowledge retrieval,and in [15], PLMs are fine-tuned for advertisement keyword match-ing. In these scenarios, PLMs are required to represent a query anda keyword into their latent embeddings, where the query-keywordrelationship can be reflected by their embedding similarity. Ap-parently, news recommenders turn out to be similar applications.However, the PLMs-based news recommenders can be relativelymore expensive: to match a user towards a candidate news, it needs raining Microsoft News Recommenders with Pretrained Language Models in the Loop Conference’17, July 2017, Washington, DC, USA • Data Efficiency • Reusability • Light-weighted Encoding • Centralized News Encoding • Cache Acceleration • Bus Language Modeling • Autoregressive User Modeling • Content Refinement • Dynamic BatchingSpeedyFeed
NewsEncoder biden takes the wheel [pad]
Candidate NewsNews & UserEncoderUser History … News EmbeddingsUser Embeddings Prediction Loss News TensorUser Tensor pad • Encoding Cost
Figure 1: Left: the typical training workflow of news recommendation. Middle: underlying problems within the typical work-flow. Right: the constitution of SpeedyFeed, and how it contributes to the training workflow. to encode all of the user’s historical news clicks with PLMs, whichwill lead to huge encoding costs.
The newsrecommender is to predict user’s future news preference given theirnews clicks in history. Therefore, a typical training workflow con-sists of 3 steps (shown on the left side of Figure 1), as implementedby Microsoft Recommenders [1]. • Input Processing . The trainer needs to transfer the raw data,i.e., user’s historical interactions with news, into the required for-mat, such that it can be loaded for training. Two operations areinvolved in this stage. On the one hand, the news articles are tok-enized, and then padded/truncated into token sequences of a unifiedlength. On the other hand, all users’ histories are padded/truncatedinto news sequences of a unified length. • News & User Encoding . The input tensors are encoded viatwo steps [29]. Firstly, the news encoding, which maps user’s histori-cal news clicks and all of the candidate news into news embeddings.Secondly, the user encoding, which generates user embeddingsbased on the encoded historical news and other auxiliary features. • Learning from Prediction . Finally, the prediction is madeabout which candidate news is clicked given the user and newsembeddings. The prediction loss (e.g., cross entropy or BPR) willbe back-propagated for the model parameters’ update.
One of the mostnotable things about training a news recommender is its huge textencoding cost: to make a prediction for one single user click, thetrainer needs to encode 1) all of the news articles in user history and 2) the candidate news . Considering the large magnitudes ofpretrained language models, the text encoding related computationswill almost dominate the entire training cost. However, because ofthe following issues (shown in the middle of Figure 1), the typical training workflow will become severely limited in efficiency, whichmakes it prohibitive to train a PLMs-based news recommender. • High Encoding Cost . First of all, the PLMs are considerablylarger in scale than the text encoders used in conventional text-related recommendations, e.g., bi-LSTM, CNN, and shallow trans-formers. What is worse, the PLMs are highly unfavorable to theprocessing of long texts. Particularly, the encoding cost is vulner-able to the length of input news ( 𝑁 ), whose time complexity is 𝑂 ( 𝑁 ) , given that the mainstream PLMs are all based on trans-former architectures [23]. Considering that many news articlesrequire long textual descriptions to fully express their underlyinginformation, it will result in a huge computation overhead whileencoding such news articles. • Low Reusability . Secondly, the reusability is seldom empha-sized before: every time a training instance is given, it is processedfor the calculation of its own loss; once the loss is back-propagated,all related intermediate results, especially the news embeddings,will all be discarded after being used just one time. Consideringthat it is quite an expensive operation to encode a news article withPLMs, such a defect will severely slow down the training progress. • Low Data Efficiency . Lastly, due to the existence of substan-tial padded data (the padded elements in news article and userhistory), the meaningful computation throughput can be severelylimited. Particularly, given that a token is the atomic data unit forboth user tensor and candidate tensor, we define data efficiency asthe ratio of valid (i.e., non-padded) tokens within the input tensors:DE = | valid tokens || valid tokens | + | padded tokens | × . (1)Due to the highly diversified lengths of user histories and newsarticles, a huge amount of padded elements will probably be intro-duced. We empirically find that the data efficiency is usually lowerthan 50% in practice, which leads to a big waste of computationcapacity and further jeopardizes the training efficiency. onference’17, July 2017, Washington, DC, USA Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, and Xing Xie Instance 1 {News}Instance 𝑖 {News}Instance M {News} News Encoder
Instance 1 {Embeddings}Instance 𝑖 {Embeddings}Instance M {Embeddings} gather dispatch …… …… MergedSet EncodingResult
MergedSet { … }
EncodingResult { … }
EncoderGathering of raw news Dispatching of news embeddings
Figure 2: Illustration of centralized news encoding.
We develop an efficient training framework SpeedyFeed, which en-ables news recommenders built upon large-scale PLMs to be trainedwith both high speed and high quality. With SpeedyFeed, the newsand user encoding are carried out through a light-weighted pipeline,which is characterized by the following techniques: the centralizednews encoding for high data efficiency, 2) the cache accelerationand the autoregressive user modeling for high reusability, 3) thebus language modeling for economic encoding complexity (as therightmost of Figure 1). Besides, two auxiliary techniques: contentrefinement and dynamic batching, are introduced, which give riseto a more compact representation of news content and furtherreduction of padded data, respectively.
The overall news encoding work-flow is discussed in the first place. In the typical training workflow,the news encoder will directly work on the input tensors (i.e., usertensor, news tensor) for the news embeddings. During this process,the padded news articles are encoded together with the valid news,which results in low data efficiency.Unlike the typical method, all of the news articles within a com-mon mini-batch are jointly encoded in SpeedyFeed (as Figure 2). Thecentralized news encoding takes 3 steps: gathering, encoding anddispatching. Once a mini-batch is given, it gathers the news articlesfrom all users and candidates into a merged set. The padded newsand duplicated news are all removed. Then, the new embeddingsare generated for all remaining news in the merged set. Finally, thenews embeddings are dispatched to their original training instances.Note that the padded news articles also require their embeddingsso as to infer the user embeddings; in this place, a dummy vector isplugged into the positions taken by the padded news, whereby noadditional encoding cost is needed.Top- 𝛼
1% 3% 5% 10% 20% 30%Click Ratio (%) 59.53 77.65 84.82 92.47 97.36 98.97
Table 1: The long-tail property about news click. The Top- popular news yields almost 60% of the overall clicks. The cache mechanism is developed tofully reuse the intermedia news encoding results. Particularly, onenotable observation about Microsoft News is its long-tail propertyof the news click distribution. As shown in Table 1, the top-1%popular news articles yield almost 60% of the overall news clicks.
Cache Controller
ReadWrite Write Read Write Read Write ReadWrite-in with 𝒑 Expired after 𝜸 Cache LookupNews EncoderNYIn Cache? News Embedding
Figure 3: Illustration of Cache-accelerated News Encoding.
Therefore, such popular news articles may widely exist in the ma-jority of users’ histories, making them frequently re-encoded acrossdifferent training batches. Knowing that the model parameters areupdated with a fairly small learning rate, usually in the magnitudeof 1 𝑒 − , one news article’s recent embedding can be reused in thecurrent mini-batch for approximation. Based on this intuition, wepropose Cache-accelerated News Encoding, where a cache is main-tained in memory for the storage of fresh news embeddings. Thenews encoding workflow is changed accordingly as Figure 3. • Cached News Encoding . For each news article in a mini-batch, the trainer will check the cache in the first place: if there is acopy of news embedding in cache, it will be directly reused withoutencoding; otherwise, the news article will be encoded from scratch.
Cache LookupNews EncoderNYIn Cache? News Embedding
ReadWrite
Cache Manager
Lookup: 𝒑 𝒕 Expired: 𝜸 Device 1 Device 2 Device 3 Device 4Master Memory
Figure 4: Illustration of Cache Management. • Cache Management Policy . The cache is managed with thefollowing principles. Firstly, all of the embeddings in cache mustbe newly generated in the past few steps; otherwise, it will beincompatible with the current model. Secondly, the cache lookupshould be dynamically scheduled: in the initial stage, the cachednews embeddings should be used with a relatively low probability,because the step-wise change of model parameters is sharp; as thetraining goes on, the lookup rate should be gradually increased,because the change of model parameters becomes mild.Based on the above principles, the cache management policy ismade (as Figure 4), which is subject to two decisive variables: thestepwise lookup rate 𝑝 𝑡 , and the expiration step 𝛾 . 1) An exponentialscheduler is used to control the probability of whether to lookupthe cache: the cached is looked up with probability 0 when thetraining gets started; the lookup probability will gradually grow to 𝑝 𝑡 at the 𝑡 -th step w.r.t. the following relationship: 𝑝 𝑡 = . − exp (− 𝛽𝑡 ) . (2) 𝛽 is the hyper parameter for growth rate, which lets 𝑝 𝑡 grow to 1.0after the initial stage of the training process. 2) A cached embeddingis expired and removed from cache after 𝛾 steps since its written in. raining Microsoft News Recommenders with Pretrained Language Models in the Loop Conference’17, July 2017, Washington, DC, USA News: Title\ 𝒊 -th seg-1 𝒊 −th … … seg-2 𝒊 −th seg-KBus i Bus i Bus i ( 𝒊 -1)-th ( 𝒊 −1)−th ( 𝒊 −1)−th( 𝒊 +1)-th ( 𝒊 +1)−th ( 𝒊 +1)−th… … …… … … Figure 5: Illustration of BusLM (using the 𝑖 -th layer for demonstration). Finally, instead of maintaining a private cache for each trainingthread, we establish a global cache in the master node. As a result,the newly generated news embeddings in one node can be sharedacross all devices, which accommodates the distributed training ofnews recommender. Besides, the cache is maintained in memory,instead of GPU RAM; therefore, it is almost free of cost, and thestorage capacity can be large enough to host all needed embeddings.
We make further analysis of howwe conduct news encoding in an economic way. The news encodingcomplexity is to the square of news length 𝑂 ( 𝑁 ) . A straightfor-ward way of time reduction is to split the news into several sub-components, e.g., the title, abstract, and body, as done in [26]; thetext segments are processed independently, whose encoding resultswill be aggregated for the final news embedding. The operationmay cut down the time complexity to 𝑂 ( 𝑁 / 𝐾 ) , if the text can bepartitioned into 𝐾 “almost equal-length” segments. Yet, the naivesplit of text harms the news embedding quality, as the text segmentscannot make reference to each other during the encoding process.Inspired by recent progress on efficient transformers [22], wepropose BusLM (Figure 5) to encode the news articles, where theacceleration is achieved with fully preserved embedding quality.In BusLM, the input news is uniformly partitioned into 𝐾 text seg-ments, such that the encoding complexity is reduced to 𝑂 ( 𝑁 / 𝐾 ) .The segments are still encoded by transformers; however, a layer-wise “bus connection” is established between the transformers,which enables information to be exchanged across the segments.In each layer of the transformers, a “proxy embedding” is chosenfor each segment, which serves as the sketch of its underlyinginformation. To avoid additional computation as much as possible,we directly select the first embedding of each segment as its proxy;e.g., for the 𝑖 -th layer of segment 𝑗 , H 𝑖𝑗 [ ] is chosen as the proxy(let H 𝑖𝑗 be the 𝑗 -th segment’s embedding sequence on the 𝑖 -th layer).The 𝑖 -th layer’s proxy embeddings from all of the segments aregathered as the 𝑖 -th Bus: Bus 𝑖 = { H 𝑖𝑗 [ ]} 𝐾𝑗 = . (3)The bus is used as a medium of information exchange, which allthe segments may attend to directly. Particularly, the 𝑖 -th to 𝑖 + 𝑗 -th segment will become: H 𝑖 + 𝑗 = Transformer 𝑖 ( (cid:2) H 𝑖𝑗 , Bus 𝑖 (cid:3) ) , (4)in which “ [] ” denotes concatenation, and “Transformer 𝑖 (·) ” is the 𝑖 -th layer of the transformers. The final news embedding is acquired by aggregating all of the hidden states in the last layer, i.e., H − ∗ .It is empirically verified that both time-efficiency and memory-efficiency benefit from BusLM; meanwhile, the information lossdue to the split of text is fully mitigated with the Bus connection. 𝒊𝒊 + 𝒊 + 𝒊 + 𝒊 + 𝒊 + Figure 6: Illustration of Autoregressive User Modeling.
The computation cost is fur-ther saved by reusing the encoded prefix of user history. As dis-cussed, the news recommender is trained based on the predictionloss of user’s news click. Therefore, a typical training instanceconsists of user’s news click at one timestamp 𝑡 and its precedingnews clicks: ⟨ click = 𝑡 , clicks ≤ 𝑡 − ⟩ . However, when another traininginstance ⟨ click = 𝑡 + , clicks ≤ 𝑡 ⟩ is presented, clicks ≤ 𝑡 − becomes theprefix of user history, which requires to be re-encoded.We propose autoregressive user modeling for more efficientutilization of user history (Figure 6), where all of the news clicksabout a user can be predicted at one-shot of news encoding. Insteadof processing each training instance ⟨ clicks = 𝑡 , click ≤ 𝑡 − ⟩ case-by-case, the whole user history clicks ≤ 𝐿 will be treated as one unifiedtraining instance (let 𝐿 be the max length of user history). Thetrainer will encode all of the historical news clicks, which givesthe news embedding set: { 𝜽 𝑙 } ≤ 𝐿 . The trainer will then calculatethe user embedding set { 𝝁 𝑡 } ≤ 𝐿 , where 𝝁 𝑡 is conditioned on thepreceding news embeddings { 𝜽 𝑙 } ≤ 𝑡 . Each user embedding is usedto predict the news click for the next timestamp, where the overallprediction loss L 𝑎𝑢𝑡𝑜 w.r.t. one sample user is computed as: L 𝑎𝑢𝑡𝑜 = − ∑︁ 𝑡 < 𝐿 log exp (⟨ 𝜽 𝑡 + , 𝝁 𝑡 ⟩) exp (⟨ 𝜽 𝑡 + , 𝝁 𝑡 ⟩) + (cid:205) 𝜽 ′ 𝑡 + exp (⟨ 𝜽 ′ 𝑡 + , 𝝁 𝑡 ⟩) . (5)“ ⟨·⟩ ” calculates the relevance of the user and news embeddings, e.g.,inner product; and 𝜽 ′ 𝑡 + is the embedding of a negative sample. onference’17, July 2017, Washington, DC, USA Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, and Xing Xie Algorithm 1:
Light-weighted Encoding Pipeline
Input: a mini-batch: user tensor U , news tensor NOutput:
Overall prediction loss L 𝑎𝑢𝑡𝑜 begin Merged set M : gather( U .news ∪ N .news); Cached set M 𝐶 : { m : m in cache } M ; Get lookup rate 𝑝 𝑡 from scheduler (Eq. 2); Lookup Set M 𝐿 : sample from M 𝐶 with 𝑝 𝑡 ; News embeddings set Θ ← CacheLookup( M 𝐿 ); News embeddings set Θ ← BusLM( M \ M 𝐿 ); Dispatch Θ ∪ Θ ; Refresh cache with Θ ; Get L 𝑎𝑢𝑡𝑜 from autoregressive user modeling. We summarize the light-weighted encoding pipe-line into Algorithm 1. For each mini-batch, all the included newsarticles from both user tensor and news tensor are gathered into themerged set M , with all padded news and duplicated news removedfrom it. The lookup set M 𝐿 is sampled from the cached news articles M 𝐶 with the lookup rate 𝑝 𝑡 . The news articles within the lookupset directly use their cached embeddings, which gives 𝚯 . The newsarticles out of the lookup set are encoded with BusLM, which gives 𝚯 . The whole news embeddings 𝚯 ∪ 𝚯 are dispatched to theiroriginal positions (in either user history or candidate news); then,the cache is refreshed with the newly generated news embeddings 𝚯 . Finally, the overall prediction loss is calculated with the autore-gressive user modeling, as in Eq. 5. Although the news encoding cost isreduced with BusLM, it is still prohibitive to process extremely longnews articles. Therefore, moderate truncation of the new article isinevitable in practice. Instead of simply taking the head of a news ar-ticle, the truncation is treated as a filtering operation in SpeedyFeed:we will try to remove the redundant or non-informative data, andpreserve the important information as much as possible. The Or-dered Bag-of-Words (OBoW) model is proposed for this purpose(Figure 7), which compactly represents the informative data dis-tilled from the raw content. In OBoW, all special characters andstopping words are discarded; besides, the news article is repre-sented as a sequence of “word: count” tuples ordered by the words’first-appearances in the original content. The remaining words arelabeled with their BM25 scores. The words with the top- 𝑘 BM25importance are believed to be the most informative, which will bepreserved by the final OBoW model.One minor modification is required to encode the OBoW withPLMs: apart from the original token embedding, position embed-ding, and segment embedding, an additional frequency embeddingis added to each input token, which brings in the information abouteach word’s times of appearance in the original content.
Dynamic batching is an asynchronousdata loading process, which runs in parallel to the model trainingprocess. It uses multiple threads to read the user log and generatetraining instances (user history and candidate news) from it. The
Novak Djokovic has voiced his fears for lower-ranked players as the coronavirus-ravaged 2020 season draws to a close. The world number one, in London for the elite eight-man ATP Finals, said players ranked outside the world's top 500 were struggling to make ends meet. Djokovic said the biggest concern was the unpredictability of the 2021 calendar, with decisions taken out of the hands of the sport's administrators as a result of Covid-19.
Novak :1;
Djokovic :2; voiced :2; his:1; fears :1; lower :1; ranked :2; players :2; coronavirus :1; ravaged :1; :1; season:1; draws:1; close:1; …
Novak :1;
Djokovic :2; voiced :2; fears :1; lower :1; ranked :2; players :2; coronavirus :1; ravaged :1; :1; …
Raw Text Ordered Bag-of-WordsRefined by BM25
Figure 7: Illustration of Content Refinement. training instances are gathered as mini-batches and consumed bythe training process. To reduce the number of padded tokens andmaximize the GPU utilization, the following treatments are adopted.Firstly, the training instances are grouped based on the lengths oftheir included news articles. As each training instance is virtually acollection of news articles, they can be marked by the “max-length”of their included news; e.g., an instance with 3 news, whose lengthsare (32, 48, 36), will be marked with 48. Each training instance isrouted to a bucket based on its max-length; e.g., the above instancemay be routed to the bucket which hosts training instances of maxlengths 40 ∼
50. All the news within one bucket will be padded tothe same length based on the currently longest news in the bucket.Secondly, the bucket is checked after every fill-in: whether the totalnumber of tokens reaches the threshold determined by the GPURAM’s capacity. Once a bucket is full, all of its training instanceswill be dump into a mini-batch and appended to the mini-batchqueue, which is consumed by the training process.The above generation of mini-batch is “dynamic”, as the paddedlength and the batch size are determined case-by-case. Based on thegrouping operation, the overall padded length can be minimized;and with the dynamic batch size, the GPU capacity can be used asmuch as possible.
Table 2: Statistics of dataset
We use a massive dataset for evaluation.The dataset contains MSN (EN-US market ) users’ news readingbehaviors from 2020-05-01 to 2020-08-31. The data in the first 3months (2020-05-01 to 2020-07-31) is used for training; the lastmonth’s data (2020-08-*) is used for testing. There are 1,202,576news articles and 4,720,192 users, and it yields 72,093,576 newsclicks in total (summarized as Table 2). The following featuresare utilized for offline experiments. The news articles consist oftheir titles, abstracts, and bodies; the users are represented withtheir historical news clicks. Other features, like news categories,user demography and contexts, are omitted here, but exploited inproduction. More specifications of the data related to reproducibilityare included in Appendix. raining Microsoft News Recommenders with Pretrained Language Models in the Loop Conference’17, July 2017, Washington, DC, USA AUC MRR NDCG@5 NDCG@10 Recall@50 Recall@100 Recall@200 Time (hour)
NPA 65.01 24.66 26.06 30.90 2.24% 4.04% 7.14% 23.6NAML 67.57 26.90 28.73 33.75 2.29% 5.22% 8.67% 26.6LSTUR 64.37 24.35 25.58 30.63 2.28% 3.96% 6.84% 30.4NRMS 68.62 27.30 29.09 34.15 3.10% 5.81% 9.15% 27.8UniLM – – – – – – – 2497.5Speed-Mini 72.06 30.16 32.63 37.74 6.93% 10.75% 16.13% 3.1Speed-Half 69.47 28.06 30.04 35.11 4.78% 7.47% 11.84% 3.3Speed-Last 70.98 28.92 31.09 36.26 4.69% 8.07% 12.94% 6.5Speed-UniLM % % % 19.4 Table 3: Upper: baselines w.o. SpeedyFeed acceleration. Lower: PLMs-based news recommenders accelerated with SpeedyFeed.
A default news recommender istrained by SpeedyFeed, whose configurations are listed as follows. • News Encoder . Our news encoder is initialized with the pre-trained checkpoint of UniLMv2-base [3] (
UniLM for short), whichis a 12-layer & 768-hidden-dimension language model trained byMicrosoft. Leveraging the state-of-the-art pretraining techniques, itoutperforms other well-known PLMs of the same scale, e.g., BERT-base and RoBERTa-base, on GLUE benchmark and many othergeneral NLP tasks. • User Encoder . A highly simplified user encoder is adopted by thedefault news recommender, as we target on the impact brought bythe news encoder. Particularly, a simple adaptation of the YouTube-DNN [6], namely, Attentive YouTube-DNN, is utilized: it makes useof the weighted-sum of user’s historical news embeddings for userrepresentation, where a learnable attention vector is introduced togenerate the aggregation weights.Besides, we also combine the default user encoder with the fol-lowing alternative news encoders, which are certain sorts of sim-plifications of the original UniLM. • MiniLM [25], a high-quality distillation of UniLM; both its depthand width are reduced to 50% of the original model (i.e., with 6layers and 384 hidden-dimensions). • UniLM-Half , where the model scale is the same as MiniLM, butthe model weights are directly inherited from the original UniLM. • UniLM-Last , which uses the whole UniLM but simply finetunesthe last layer in the training stage. (It is different from the defaultone where UniLM is trained end-to-end.) Although it does not con-tribute to the feed-forward speed, it reduces the training cost asmost of the layers are frozen: the GPU RAM usage will be lower, sothat we may use larger batch sizes for acceleration. Besides, it alsosaves the cost as much fewer model parameters call for update.All the above approaches are trained with SpeedyFeed, thusreferred to as Speed-UniLM, Speed-Mini, Speed-Half, and Speed-Last in the experiments.
The following representative news recommenderbaselines are utilized in our experiments. • NPA [27], which leverages personalized attention to select andaggregate useful information in user history. • NAML [26], which uses multi-view attention to aggregate user’shistorical news clicks. • LSTUR [2], which relies on multiple neural network structuresto jointly capture user’s long-term and short-term interest. • NRMS [28], which makes use of multi-head self attention to im-prove the quality of user representation.The above approaches make trial of various user modeling strate-gies for news recommendation. But one thing in common is that allof them make use of comparatively small-scale text encoders to gen-erate news embeddings, such as 1D-CNN or self-attention. Thesemethods are trained following the default workflow as demon-strated in Microsoft Recommenders[1].
The experiment results are comprehensivelyevaluated from three perspectives. On the one hand, the evaluationis made for the ranking performance: given a testing impression,the recommender is required to generate the ranking orders for theimpressed news articles; the ranking orders are compared with theground-truth (i.e., the clicked news within the impression), whoseperformance is measured with the typical ranking metrics:
AUC , NDCG , MRR . On the other hand, the evaluation is made for therecall performance: based on the user embedding generated by therecommender, the relevant news articles are retrieved from thewhole production index. Since the relevance between user embed-ding and news embedding is measured by inner-product, it turnsout to be a Max-Inner-Product-Search problem, where HNSW [16]is used as the backbone of ANN index. The performance is measuredwith
Recall@K , where the ground-truth is still the clicked newsof the testing impression. We also evaluate the training efficiency,where the time cost is measured with the following configurations.
All the training jobs are performedon an Azure machine, with 4 × Nvidia-V100-32G GPUs, 40 × Intel(R)Xeon(R) Platinum 8168 CPU @ 2.70GHz processors, run on Ubuntu16.04.6. The models are implemented with PyTorch 1.7.0. Morespecifications about the training process are included in Appendix.All the codes have been open-sourced on our GitHub repo.
The main experiments are performedto clarify the following issues: 1) the effect on recommendationquality when PLMs are utilized as the news encoders, and 2) theeffect on efficiency when SpeedyFeed is leveraged for recommendertraining. The following conclusions can be drawn based on theexperiments results reported in Table 3. https://azure.microsoft.com/en-us/services/machine-learning/ onference’17, July 2017, Washington, DC, USA Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, and Xing Xie Training Time Speedup w.o. SpeedyFeed: 2497.5 hours Overall Speedup: 128.7 × w. SpeedyFeed: 19.4 hours Central & Batch: 3.0 × Cache: 1.98 × Autoregressive: 17.0 × BusLM: 1.27 × Table 4: Left: default recommender’s training time w./w.o.accelerated by SpeedyFeed. Right: overall speedup, andspeedup effect from each basic module.
Firstly, our default recommender, which is equipped with a full-scale and end-to-end trained UniLM, beats all models with simpli-fied (MiniLM, UniLM-half) or insufficiently trained (UniLM-last)UniLM. Besides, it outperforms all of the baseline recommenders byhuge margins: all the ranking metrics are significantly improved,and the recall metrics go above the baselines by several times.Therefore, it validates 1) the recommendation quality can be greatlyimproved by large-scale PLMs, and 2) the PLMs need to be fullytrained within the recommender so as to achieve the best perfor-mance. The UniLM-based recommender is also verified with onlineA/B test, where most of the critical metrics are significantly im-proved, e.g., the average-time-per-step × the-total-required-steps ). In other words, the training speed is accelerated byover 100 × with SpeedyFeed.Finally, one additional issue is that whether we should resort tothe distilled PLMs for further training speedup. Given the resultsin Table 3, we incline to stay with the full-scale PLMs (i.e., UniLM v.s. MiniLM), as the recommendation quality can be significantlyimproved with an acceptable increment of training cost. Besides,although MiniLM is faster, the online inference speed is hardly abottleneck for news recommendation in MSN: there are merely tensof thousands of fresh news articles generated each day, which canbe encoded and cached for the recommender with very little cost.However, we do not exclude distilled PLMs’ necessity for otherscenarios, like search and ads, where fresh contents are generatedrapidly and need to be processed in realtime.
Experiments are performed to further clar-ify the following issues: 1) SpeedyFeed’s impacts on training speedup, D a t a E ff i c i e n c y Impact of Centralized News Encoding (CNE) and Dynamic Batching (DB)w. DB, w.o. CNE w. CNE & DB26.27 20.39 20.66 23.76051015202530 T i m e C o s t p e r - B a t c h ( s ) Effect of BusLM on Training Time 31.045 20.211 18.287 18.66505101520253035 M e m o r y C o s t p e r B a t c h ( G B ) Effect of BusLM on G-RAM Cost
Figure 8: Effect of Dynamic Batching (DB) and CentralizedNews Encoding (CNE) on data efficiency.
AUC Recall@50 Recall@100 w.o. Bus 73.20 6.72 10.81w.o. Cache 73.70 8.11 12.73w.o. Refine 73.70 7.88 12.45Default
2) SpeedyFeed’s impact on data efficiency, 3) SpeedyFeed’s impacton recommendation quality, 4) detailed analysis of cached newsencoding, and 5) detailed analysis of BusLM. • Impact on Speedup
We study the overall speedup effect ofSpeedyFeed (Table 4): the training time is reduced from 2497.5 hours(estimated) to 19.4 hours, which means a 128.7 × speedup. We furtheranalyze the detailed speedup effect of each module. We find that theautoregressive user modeling leads to the biggest gain, where thetraining speed is accelerated by 17 × . This observation is natural tointerpret, as the encoding cost for the entire prefix of user historycan now be saved by reusing the preceding encoding results (asdiscussed in 4.1.4). The centralized news encoding (Central) and thedynamic batching (Batch) jointly improve the data efficiency, whichresults in another 3 × speedup. Besides, the cached news encoding(Cache) and BusLM generate 2 × and 1.27 × speedup, respectively. • Impact on Data Efficiency . The data efficiency (Eq. 1) isjointly increased with the centralized news encoding and the dy-namic batching (Figure 8). The original data efficiency (1-Bucket,w.o. CNE) is merely around 30%, which means roughly 70% of thecomputation is wasted on the padded data. With the adoption ofboth techniques, a large portion of the padded data is removed, andthe data efficiency can be easily improved to more than 70%. • Impact on Recommendation Quality . Shown as Table 5,the recommendation quality is reduced when bus and content re-finement are disabled. Particularly, for “w.o. Bus”, the bus connec-tion is removed, where the partitioned news segments becomeindependently encoded. Without making reference to the wholecontext, it’s hard to fully express each segment’s underlying seman-tic, which harms the quality of news embedding. For “w.o. refine”,the front part of each news article is truncated for input; therefore,it will result in more severely information loss, which reduces therecommendation quality. • More on Cache . The effect of cached news encoding is testedwith different values of 𝛾 (the “expiration step” defined in 4.1.2).With the increment of 𝛾 , the training speed is accelerated tremen-dously. We choose “ 𝛾 =
20” as our default setting for the trade-off oftraining efficiency and quality. Besides, it is more interesting to see raining Microsoft News Recommenders with Pretrained Language Models in the Loop Conference’17, July 2017, Washington, DC, USA D a t a E ff i c i e n c y Effect of Dynamic Batching (DB) and Centralized News Encoding (CNE)w. DB w. CNE & DB26.27 20.39 20.66 23.76051015202530 T i m e C o s t p e r - B a t c h ( s ) Effect of BusLM on Training Time 31.045 20.211 18.287 18.66505101520253035 M e m o r y C o s t p e r B a t c h ( G B ) Effect of BusLM on G-RAM Cost
Figure 9: BusLM’s effect on speed and GPU RAM cost.
AUC Recall@50 Recall@100
Time (h) 𝛾 = 𝛾 = 𝛾 =
20 73.74 8.32 𝛾 =
30 73.69 7.99 12.37 15.4
Table 6: Effect of cached news encoding. The cache is dis-abled when 𝛾 = . that the cached news encoding also contributes to the recommen-dation quality. This is probably because the cached news encodinghelps to put more emphasis on the long-tailed news articles, whichcould suffer from insufficient training due to the dominance of themost popular news. In other words, most of the training opportuni-ties would be taken by a small number of news articles (if cached isdisabled), given that a large portion of the news clicks are resultedfrom the hottest minority (as reflected by Table 1). • More on BusLM . Shown as Figure 9, the training speed isimproved and GPU RAM consumption is reduced thanks to BusLM(“
In this paper, we propose a novel framework, SpeedyFeed, for theefficient training of PLMs-based news recommender. SpeedyFeedenjoys three technical advantages: high reusability, high data ef-ficiency, and economic news encoding complexity, which jointlylead to a huge speedup of the training workflow. The proposedframework is applied to Microsoft News, where significant im-provements are achieved in both offline and online evaluation. Theproposed framework is made public-available so as to facilitate thedevelopment in related areas. In the future, we’ll proactively extendthis framework to support more real-world applications, such ascommodity and advertisement recommendation.
REFERENCES [1] 2020.
Microsoft Recommenders . https://github.com/microsoft/recommenders/[2] Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu, and Xing Xie.2019. Neural news recommendation with long-and short-term user representa-tions. In
Proceedings of the 57th ACL . 336–345.[3] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, YuWang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al. 2020. Unilmv2: Pseudo-masked language models for unified language model pre-training. In
ICML . [4] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, et al. 2020. Language models are few-shot learners. arXiv preprintarXiv:2005.14165 (2020).[5] Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar.2020. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprintarXiv:2002.03932 (2020).[6] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks foryoutube recommendations. In the 10th RecSys . 191–198.[7] Abhinandan S Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. 2007.Google news personalization: scalable online collaborative filtering. In . 271–280.[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).[9] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang.2020. Realm: Retrieval-augmented language model pre-training. arXiv preprintarXiv:2002.08909 (2020).[10] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2015. Session-based recommendations with recurrent neural networks. arXivpreprint arXiv:1511.06939 (2015).[11] Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019.Poly-encoders: Transformer architectures and pre-training strategies for fast andaccurate multi-sentence scoring. arXiv preprint arXiv:1905.01969 (2019).[12] Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passagesearch via contextualized late interaction over bert. In . 39–48.[13] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In
Proceedings ofthe 19th WWW . 661–670.[14] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, OmerLevy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: Arobustly optimized bert pretraining approach. arXiv:1907.11692 (2019).[15] Wenhao Lu, Jian Jiao, and Ruofei Zhang. 2020. TwinBERT: Distilling Knowl-edge to Twin-Structured Compressed BERT Models for Large-Scale Retrieval. In
Proceedings of the 29th CIKM . 2645–2652.[16] Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximatenearest neighbor search using hierarchical navigable small world graphs.
IEEEtransactions on pattern analysis and machine intelligence
42, 4 (2018), 824–836.[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013).[18] Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017.Embedding-based news recommendation for millions of users. In .[19] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:Global vectors for word representation. In . 1532–1543.[20] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, ChristopherClark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized wordrepresentations. arXiv preprint arXiv:1802.05365 (2018).[21] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Im-proving language understanding by generative pre-training. (2018).[22] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficienttransformers: A survey. arXiv preprint arXiv:2009.06732 (2020).[23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in neural information processing systems . 5998–6008.[24] Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. DKN: Deepknowledge-aware network for news recommendation. In .[25] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou.2020. Minilm: Deep self-attention distillation for task-agnostic compression ofpre-trained transformers. arXiv preprint arXiv:2002.10957 (2020).[26] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang,and Xing Xie. 2019. Neural news recommendation with attentive multi-viewlearning. arXiv preprint arXiv:1907.05576 (2019).[27] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang, andXing Xie. 2019. NPA: neural news recommendation with personalized attention.In
Proceedings of the 25th KDD . 2576–2584.[28] Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi, Yongfeng Huang, and Xing Xie.2019. Neural News Recommendation with Multi-Head Self-Attention. In . Hong Kong, China, 6389–6394.[29] Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian,Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, et al. 2020. Mind: A large-scaledataset for news recommendation. In
Proceedings of the 58th ACL . 3597–3606.[30] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy.2016. Hierarchical attention networks for document classification. In
NAACL . onference’17, July 2017, Washington, DC, USA Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, and Xing Xie A APPENDIXA.1 Implementation Details of SpeedyFeed
A.1.1 Bus Language Modeling.
The news article is split into 𝐾 segments which are interconnected by bus technology. For the 𝑖 -thlayer of segment 𝑗 , the first tokens (i.e., [CLS]) is chosen to formthe bus for information exchange, then the bus is spliced into eachsegment as the input of transformer layer: Bus 𝑖 = { H 𝑖𝑗 [ ]} 𝐾𝑗 = , (6) H 𝑖 + 𝑗 = Transformer 𝑖 ( (cid:2) H 𝑖𝑗 , Bus 𝑖 (cid:3) ) . (7)In particular, for the self-attention layer of transformer, the bus isapplied in key and value, and the query still only adopt the originalembedding sequence of segments: Q 𝑖𝑗 = H 𝑖𝑗 W 𝑖𝑄 , K 𝑖𝑗 = ( (cid:2) H 𝑖𝑗 , Bus 𝑖 (cid:3) ) W 𝑖𝐾 , V 𝑖𝑗 = ( (cid:2) H 𝑖𝑗 , Bus 𝑖 (cid:3) ) W 𝑖𝑉 . (8)In this way, H 𝑖 + 𝑗 has the same shape as H 𝑖𝑗 . We repeat the Eqns 6and 7 for each layer.After all transformer layers, we aggregate all of the hidden statesin the last layer (i.e., H − ∗ ) as news embeddings by two additionalattention layers. Specifically, the first attention layer is proposed tolearn more informative representations of segments. The attentionweight 𝛼 𝑗,𝑛 of the 𝑛 -th tokens in 𝑗 -th segment is computed as: 𝛼 𝑗,𝑛 = q 𝑇 𝑡𝑎𝑛ℎ ( W H − 𝑗,𝑛 + b ) (9) 𝛼 𝑗,𝑛 = exp ( 𝛼 𝑗,𝑛 ) (cid:205) 𝐿𝑛 = exp ( 𝛼 𝑗,𝑛 ) (10)where W and b are projection parameters, and q 𝑇 is the queryvector. The representation of segment is the weighted summationof the contextual tokens representations, formulated as: v 𝑗 = 𝐿 ∑︁ 𝑛 = 𝛼 𝑗,𝑛 H − 𝑗,𝑛 (11)The second attention layer is to aggregate the segment embedding v 𝑗 . Similarly, denote the attention weight of the 𝑗 -th segment as 𝛼 𝑗 ,which is calculated by: 𝛼 𝑗 = q 𝑇 𝑡𝑎𝑛ℎ ( W v 𝑗 + b ) (12) 𝛼 𝑗 = exp ( 𝛼 𝑗 ) (cid:205) 𝐾𝑗 = exp ( 𝛼 𝑗 ) (13)where q 𝑇 , W , and b are learnable parameters. The final represen-tation of a news article is the summation of the segment represen-tations weighted by their attention weights: e = 𝐾 ∑︁ 𝑗 = 𝛼 𝑗 v 𝑗 (14) A.1.2 Cache.
We detailed describe the cache mechanism in Algo-rithm 2. In the training step 𝑡 , we input a merged news set M bycentralized news encoding. Firstly, the lookup rate 𝑝 𝑡 is generatedaccording to the current step 𝑡 and hyper-parameter 𝛽 . We use arandom value to determine whether to read embeddings from thecache. If yes, for news embeddings of in-cache news, we load theembeddings which are encoded less than 𝛾 training steps beforethe current step. These loaded embeddings are output as Θ , andthe news they represent are added into M 𝐿 . For news in M but not M 𝐿 , we encode them into embeddings Θ with BusLM, and write Θ into cache. Finally, Θ ∪ Θ is the whole new embeddings instep 𝑡 . Algorithm 2:
Cached News Encoding
Input:
The merged news set M in mini-batch, currenttraining step 𝑡 Output:
News embeddings Θ for all news in M begin initialize Θ and M 𝐿 ; 𝑝 𝑡 = 1 − exp (− 𝛽𝑡 ) ; if 𝑟𝑎𝑛𝑑𝑜𝑚 < 𝑝 𝑡 then for 𝑛 in M do if 𝑛 in cache then 𝑒 𝑛 ← Read(cache, n); if 𝑡 − 𝑡 𝑒 𝑛 ≤ 𝛾 then Θ .add( 𝑒 𝑛 ); M 𝐿 .add( 𝑛 ); Θ ← BusLM( M \ M L ); Write(cache, Θ ); Θ = Θ ∪ Θ . A.2 Details of Dataset • News . Each news article is composed of its title, abstract andbody. The average text length is 659.64, which is even longer thanthe default maximum length (512) of ordinary PLMs. It is alsochallenging to load a sufficient amount of training instances into amini-batch given such long texts. However, knowing that the title,abstract and the 1st paragraph in the body are the most informativeparts for the majority of MSN news, we’ll take a text segment fromeach of them, whose length is no more than 32. Finally, the overalltext length is confined within 96 for the trade-off of quality andfeasibility. • User . The users are characterized by their historical interactionwith the platform. As a result, each user is associated with onerecord, containing all of the user’s impressions ordered by time:“User-ID
A.3 Training configurations
All the training jobs are performed on an Azure machine, with4 × Nvidia-V100-32G GPUs, 40 × Intel(R) Xeon(R) Platinum 8168 CPU https://azure.microsoft.com/en-us/services/machine-learning/ raining Microsoft News Recommenders with Pretrained Language Models in the Loop Conference’17, July 2017, Washington, DC, USA @ 2.70GHz processors, run on Ubuntu 16.04.6. The models areimplemented with PyTorch 1.7.0.We optimize the parameters with the Adam optimizer. The learn-ing rate is 8e-6 for pretrained model and 1e-4 for other layers (e.g.,user encoder). The negative sampling ratio is 1. The max lengthof user click history is 100. The default max length of news articleis 96 and we split the text into three segments according to thestructure of title, abstract and body. The default pretrained modelis the complete UniLM. We use 2 buckets and push the train datain buckets to the mini-batch queue once a basket is filled with39800 tokens. For content refinement, the 𝑘 of BM25 is 2 and wereserve the words of top-32 BM25 scores for each segment. Forcache management policy, the hyper-parameter 𝛽 is 2e-3, and thedefault expiration-step 𝛾𝛾