[PDF] Self-Attentive Model for Headline Generation

Abstract

Headline generation is a special type of text summarization task. While the amount of available training data for this task is almost unlimited, it still remains challenging, as learning to generate headlines for news articles implies that the model has strong reasoning about natural language. To overcome this issue, we applied recent Universal Transformer architecture paired with byte-pair encoding technique and achieved new state-of-the-art results on the New York Times Annotated corpus with ROUGE-L F1-score 24.84 and ROUGE-2 F1-score 13.48. We also present the new RIA corpus and reach ROUGE-L F1-score 36.81 and ROUGE-2 F1-score 22.15 on it.

Full PDF

aa r X i v : . [ c s . C L ] J a n Self-Attentive Model for Headline Generation

Daniil Gavrilov, Pavel Kalaidin, and Valentin Malykh

VK,191023, Nevsky ave., 28, Saint-Petersburg, Russia {firstname.lastname}@vk.com

Abstract.

Headline generation is a special type of text summarizationtask. While the amount of available training data for this task is al-most unlimited, it still remains challenging, as learning to generate head-lines for news articles implies that the model has strong reasoning aboutnatural language. To overcome this issue, we applied recent UniversalTransformer architecture paired with byte-pair encoding technique andachieved new state-of-the-art results on the New York Times Annotatedcorpus with ROUGE-L F1-score 24.84 and ROUGE-2 F1-score 13.48. Wealso present the new RIA corpus and reach ROUGE-L F1-score 36.81 andROUGE-2 F1-score 22.15 on it.

Keywords: universal transformer · headline generation · BPE · sum-marization. Headline writing style has broader applications than those used purely within thejournalism community. So-called naming is one of the arts of journalism. Just asnatural language processing techniques help people with tasks such as incomingmessage classiﬁcation (see [5] or [6]), the naming problem could also be solvedusing modern machine learning and, in particular, deep learning techniques. Inthe ﬁeld of machine learning, the naming problem is formulated as headlinegeneration, i.e. given the text it is needed to generate a title.Headline generation can also be seen as a special type of text summarization.The aim of summarization is to produce a shorter version of the text that cap-tures the main idea of the source version. We focus on abstractive summarizationwhen the summary is generated on the ﬂy, conditioned on the source sentence,possibly containing novel words not used in the original text.The downside of traditional summarization is that ﬁnding a source of sum-maries for a large number of texts is rather costly. The advantage of headlinegeneration over the traditional approach is that we have an endless supply ofnews articles since they are available in every major language and almost alwayshave a title.This task could be considered language-independent due to the absence ofthe necessity of native speakers for markup and/or model development.While the task of learning to generate article headlines may seem to be easierthan generating full summaries, it still requires that the learning algorithm be

Daniil Gavrilov, Pavel Kalaidin, and Valentin Malykh able to catch structure dependencies in natural language and therefore could bean interesting benchmark for testing various approaches.In this paper, we present a new approach to headline generation based onUniversal Transformer architecture which explicitly learns non-local representa-tions of the text and seems to be necessary to train summarization model. Wealso present the test results of our model on the New York Times Annotatedcorpus and the RIA corpus.

Rush et al. [11] were the ﬁrst to apply an attention mechanism to abstractivetext summarization.In the recent work of Hayashi [4], an encoder-decoder approach was presented,where the ﬁrst sentence was reformulated to a headline. Our Encoder-Decoderbaseline (see section 6.1) follows their setup.The related approach was presented in [10], where the approach of the ﬁrstsentence was expanded with a so-called topic sentence. The topic sentence ischosen to be the ﬁrst sentence containing the most important information froma news article (so called 5W1H information, where 5W1H stands for who, what,where, when, why, how). Our Encoder-Decoder baseline could be considered toimplement their approach in OF (trained On First sentence) setup.Tan et al. in [15] present an encoder-decoder approach based on a pregen-erated summary of the article. The summary is generated using a statisticalsummarization approach. The authors mention that the ﬁrst sentence approachis not enough for New York Times corpora, but they only use a summary fortheir approach instead of the whole text, thus relying on external tools of sum-marization.

Consider that we have dataset D = { ( title i , f ulltext i ) } Ni of news articles andtheir titles. An approach for learning summarization is to deﬁne a conditionalprobability P ( y t |{ y , ..., y t − } , X, θ ) of some token y t ∈ V at time step t ∈ N ,with respect to article text X = { x , ..., x N } ( x i ∈ V too) and previous tokensof the title { y , ..., y t − } , parameterized by a neural network with parameters θ .Then model parameters are found as θ MLE = argmax θ Q Ni P ( Y i | X i , θ ) We can then apply two methods for ﬁnding the most probable sentence undertrained model: greedy , decoding token-by-token by ﬁnding the most probabletoken at each time step, and beam-search , where we ﬁnd the top-k most probabletokens at each step. The latter method yields better results though it is morecomputationally expensive.Sutskever et al. [14] proposed a model that deﬁnes P ( y t |{ y , ..., y t − } , X, θ ) by propagating initial sequence X through a Recurrent Neural Network (RNN).Then last hidden state of RNN is used as context vector c and is then passed tothe second RNN with y , ..., y t − to obtain distribution over y t . elf-Attentive Model for Headline Generation 3 RNNs have a commonly known ﬂaw. They rapidly forget earlier timesteps,e.g. see [2]. To mitigate this issue, attention [1] was introduced to the Encoder-Decoder architecture. The attention mechanism makes a model able to obtain anew context vector at every decoding iteration from diﬀerent parts of an encodedsequence. It helps capture all the relevant information from the input sequence,removing the bottleneck of the ﬁxed size hidden vector of the decoder’s RNN.

While RNNs could be easily used to deﬁne the Encoder-Decoder model, learningthe recurrent model is very expensive from a computation perspective. The otherdrawback is that they use only local information while omitting a sequence ofhidden states H = { h , ..., h N } . I.e. any two vectors from hidden state h i and h j are connected with j − i RNN computations that makes it hard to catch all thedependencies in them due to limited capacity. To train a rich model that wouldlearn complex text structure, we have to deﬁne a model that relies on non-localdependencies in the data.In this work, we adopt the Universal Transformer model architecture [3],which is a modiﬁed version of Transformer [16]. This approach has several bene-ﬁts over RNNs. First of all, it could be trained in parallel. Furthermore, all inputvectors are connected to every other via the attention mechanism. It implies thatTransformer architecture learns non-local dependencies between tokens regard-less of the distance between them, and thus it is able to learn a more complexrepresentation of the text in the article, which proves to be necessary to eﬀec-tively solve the task of summarization. Also, unlike [4,15], our model is trainedend-to-end using the text and title of each news article.

We also adopt byte-pair encoding (BPE), introduced by Sennrich for the ma-chine translation task in [13]. BPE is a data compression technique where oftenencountered pairs of bytes are replaced by additional extra-alphabet symbols. Inthe case of texts, like in the machine translation ﬁeld, the most frequent wordsare kept in the vocabulary, while less frequent words are replaced by a sequenceof (typically two) tokens. E.g., for morphologically rich languages, the word end-ings could be detached since each word form is deﬁnitely less frequent than itsstem. BPE encoding allows us to represent all words, including the ones unseenduring training, with a ﬁxed vocabulary.

In our experiments, we consider two corpora: one in Russian and another inEnglish. It is important to mention that we have not done any additional pre-processing other than lower casing, unlike other approaches [4,10]. We apply

Daniil Gavrilov, Pavel Kalaidin, and Valentin Malykh

BPE encoding, which allows us to avoid usage of the < U N K > token for out-of-vocabulary words. For our experiments, we withheld 20,000 random articlesto form the test set. We have repeated our experiments 5 times with diﬀerentrandom seeds and report mean values.

English Dataset

We use the New York Times Annotated Corpus (NYT) aspresented by the Linguistic Data Consortium in [12]. This dataset contains 1.8million news articles from the New York Times news agency, written betweenthe years 1987 and 2006. For our experiments, we ﬁltered out news articlescontaining titles shorter than 3 words or longer than 15 words. We also ﬁlteredarticles with a body text shorter than 20 words or longer than 2000 words. Inaddition, we skipped obituaries in the dataset. After ﬁltering, we had 1444919news available to us with a mean title length of 7.9 words and mean text lengthof 707.6 words.

Russian Dataset

Russian news agency “Rossiya Segodnya” provided us with adataset (RIA) for research purposes . It contains news documents from January,2010 to December, 2014. In total, there are 1003869 news articles in the providedcorpus with a mean title length 9.5 words and mean text length of 315.6 words. This model takes the ﬁrst sentence of an article and uses itas its hypothesis for an article headline. This is a strong baseline for generatingheadlines from news articles.

Encoder-Decoder

Following [10], we use the encoder-decoder architecture onthe ﬁrst sentence of an article. The model itself is already described at recentworks section as Seq-To-Seq with RNNs of Sutskever et al. [14]. For this ap-proach, we use the same preprocessing as we did for our model, including bytepair encoding.

For both datasets, NYT and RIA, we used the same set of hyper-parametersfor the models, namely 4 layers in the encoder and decoder with 8 heads ofattention. In addition, we added a Dropout of p = 0 . before applying LayerNormalization [8].The models were trained with the Adam optimizer using a scaled learningrate, as proposed by the authors of the original Transformer with the number The dataset is available at https://vk.cc/8W0l5P elf-Attentive Model for Headline Generation 5 of warmout steps equal to in both cases and β = (0 . , . . Both modelswere trained until convergence.We trained the BPE tokenizator separately on the datasets. NYT data wastokenized with a vocabulary size of active tokens equal to , while RIAdata was tokenized using token vocabulary. In addition, we have limitedlength of the documents with 3000 BPE tokens and 2000 BPE tokens for RIAand NYT datasets respectively. Any exceeding tokens were omitted. word2vec[9] embeddings were trained on each dataset with the size of each embeddingequal to . For headline generation, we adopted beam-search size of . Model R-1-f R-1-r R-2-f R-2-r R-L-f R-L-rNew York TimesFirst Sentence 11.64

Encoder-Decoder 23.02 21.90 11.84 11.44 21.23 21.31summ-hieratt [15] - 29.60 - 8.17 - 26.05Universal Transformer w/ smoothing (ours) 25.60 23.90 12.92 12.42 23.66 25.27Universal Transformer (ours)

Encoder-Decoder 39.10 38.31 22.13

Table 1.

ROUGE-1,2,L F and recall scores, on NYT corpus and RIA corpus. In Tab. 1 we present results based on two corpora: the New York TimesAnnotated (NYT) corpus for English, and the Rossiya Segodnya (RIA) corpusfor Russian. For the NYT corpus, we reached a new state of the art on ROUGE-1, ROUGE-2 and ROUGE-L F scores. For the RIA corpus, since it has noprevious art, we present results for the baselines and our model. For our modelwe also experimented with label smoothing following [7].In our experiments, we noticed that some of the generated headlines arescored low by ROUGE metrics despite seeming reasonable, e.g. top sample inTab. 3. This lead us to a new series of experiments. We conducted human evalua-tion of obtained results for both NYT and RIA corpora. The results are presented We are providing results from Tan et al. [15], which were achieved using the NYTcorpus. Unfortunately, the authors have not published all of their ﬁltering criteriaand seed for random sampling for this corpus, so we could not follow their setupcompletely. Therefore, these results are presented here for reference. Daniil Gavrilov, Pavel Kalaidin, and Valentin MalykhDataset User Preference Human Tie MachineNew York Times Annotated 57.4 27.4 15.2Rossiya Segodnya 54.4 30.6 15.0

Table 2.

Human evaluation results for NYT and RIA datasets. in Tab. 2. 5 annotators marked up 100 randomly sampled articles from a trainset of each corpora. Each number shows the percentage of annotator prefer-ence over three possible options: original headline (Human), generated headline(Machine), no preference (Tie).For the both corpora, we could see that our model is not reaching humanparity yet, having 42.6% and 45.6% of (Machine + Tie) user preference for NYTand RIA datasets respectively, but this result is already close to human parityand leaves room for improvement.

Original text, truncated : Unethical and irresponsible as the assertion that antidepressant medication, an ex-cellent treatment for some forms of depression, will turn a man into a ﬁsh. It does a disservice to psychoanalysis,which oﬀers rich and valuable insights into the human mind. ... Homosexuality is not an illness by any of the usualcriteria in medicine, such as an increased risk of morbidity or mortality, painful symptoms or social, interpersonalor occupational dysfunction as a result of homosexuality itself...

Original headline : homosexuality, not an illness, can’t be cured

Generated headline : why we can’t let gay therapy begin

Original text, truncated : southwest airlines said yesterday that it would add 16 ﬂights a day from chicago midwayairport, moving to protect a valuable hub amid the ﬁght breaking out over the assets of ata airlines, the airport’sbiggest carrier. southwest said that beginning in january, it would add the ﬂights to 13 cities that it already servedfrom midway...

Original headline : southwest is adding ﬂights to protect its chicago hub

Generated headline : southwest airlines to add 16 ﬂights from chicago

Original text, truncated : москва, 1 апр - риа новости. количество сделок продажи элитных квартир в москвевыросло в первом квартале этого года, по сравнению с аналогичным периодом предыдущего, в два раза,говорится в отчете компании intermarksavill s. при этом, также сообщается в нем, количество заключенных встолице первичных сделок в сегменте бизнес-класса в первом квартале 2010 года оказалось на 20 выше, чемв первом квартале прошлого года...

Original headline : продажи элитного жилья в москве увеличились в 1 квартале в два раза

Generated headline : продажи элитных квартир в москве в 1 квартале выросли вдвое

Table 3.

Samples of headlines generated by our model.

In this paper, we explore the application of Universal Transformer architecture tothe task of abstractive headline generation and outperform the abstractive state-of-the-art result on the New York Times Annotated corpus. We also presenta newly released Rossiya Segodnya corpus and results achieved by our modelapplied to it.

Acknowledgments.

Authors are thankful to Alexey Samarin for useful dis-cussions, David Prince for proofreading, Madina Kabirova for proofreading andhuman evaluation organization, Anastasia Semenyuk and Maria Zaharova forhelp obtaining the New York Times Annotated corpus, and Alexey Filippovskiifor providing the Rossiya Segodnya corpus. elf-Attentive Model for Headline Generation 7

References