DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization
Jiaxin Shi, Chen Liang, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang
DDeepChannel: Salience Estimation by Contrastive Learningfor Extractive Document Summarization
Jiaxin Shi, ∗ Chen Liang, ∗ Lei Hou, † Juanzi Li, Zhiyuan Liu, Hanwang Zhang Tsinghua University Nanyang Technological University { shijx12,lliangchenc } @gmail.com, { houlei,lijuanzi,liuzy } @tsinghua.edu.cn, [email protected] Abstract
We propose DeepChannel, a robust, data-efficient, and in-terpretable neural model for extractive document summa-rization. Given any document-summary pair, we estimate asalience score, which is modeled using an attention-baseddeep neural network, to represent the salience degree of thesummary for yielding the document. We devise a contrastivetraining strategy to learn the salience estimation network,and then use the learned salience score as a guide and itera-tively extract the most salient sentences from the document asour generated summary. In experiments, our model not onlyachieves state-of-the-art ROUGE scores on CNN/Daily Maildataset, but also shows strong robustness in the out-of-domaintest on DUC2007 test set. Moreover, our model reaches aROUGE-1 F-1 score of 39.41 on CNN/Daily Mail test setwith merely / training set, demonstrating a tremendousdata efficiency. Introduction
Automatic document summarization is a challenging task innatural language understanding, aiming to compress a tex-tual document to a shorter highlight that contains the mostrepresentative information of the original text. Existing sum-marization approaches are mainly classified into two cate-gories: extractive methods and abstractive methods. Extrac-tive summarization methods, on which this paper focuses,aim to select salient snippets, sentences or passages directlyfrom the input document, while abstractive summarizationgenerates summaries that may have words or phrases notpresent in the input.Recently, as end-to-end deep learning has made greatprogress in many NLP fields, such as machine transla-tion (Luong, Pham, and Manning 2015) and question an-swering (Iyyer et al. 2014), a lot of researchers have pro-posed neural models to address the document summarizationproblem. For example, SummaRuNNer (Nallapati, Zhai,and Zhou 2017) uses a Recurrent Neural Network (RNN)based sequence model for extractive summarization, Re-fresh (Narayan, Cohen, and Lapata 2018) assigns each doc-ument sentence a score to indicate its probability of being ∗ Equal contribution. † Corresponding Author.Copyright c (cid:13) D : Rutgers University has banned fraternity and soror-ity house parties at its main campus in New Brunswick,New Jersey, for the rest of the spring semester after sev-eral alcohol-related problems this school year, including thedeath of a student. S : Rutgers University has banned fraternity and sororityhouse parties because of an alcohol-related accident that ledto the death of a student. S : The main campus of Rutgers University is located inNew Brunswick, New Jersey. Table 1: Examples of different degrees of salience. We con-sider P ( D | S ) > P ( D | S ) because S contains more im-portant information compared with S and thus is moresalient for yielding D .extracted, and many abstractive models (See, Liu, and Man-ning 2017; Jadhav and Rajan 2018) are developed basedon the encoder-decoder framework that encodes a documentand decodes its summary. These existing neural summariz-ers mostly aim to build an end-to-end mapping from the in-put document to its summary. The learning of such an end-to-end neural network 1) always requires a huge amount oftraining corpus, 2) easily suffers from the overfitting prob-lem (Srivastava et al. 2014; Erhan et al. 2010), and 3) usuallylacks interpretability.To alleviate these problems, we propose a neural ex-tractive summarizer named DeepChannel , which estimatessalience for guiding the extraction procedure instead oflearning an end-to-end mapping. DeepChannel is inspiredby the noisy-channel (Knight and Marcu 2002; Daum´e IIIand Marcu 2002), a probabilistic approach for sentence-leveland document-level compression. Given an input document D , the noisy-channel model aims to find an optimal sum-mary S that maximizes P ( S | D ) . It 1) splits P ( S | D ) us-ing Bayes rule, 2) independently estimates a language modelprobability P ( S ) and a channel model probability P ( D | S ) ,3) defines expanding rules, and 4) learns the parameters ina traditional statistical manner. Such a statistical approachdepends on manual rules, lacks generality, suffers from datasparsity, and fails to capture semantics (Mnih and Hinton2009), which is the key for document understanding. To thisend, we design a neural channel model to draw support from a r X i v : . [ c s . C L ] N ov he great representation power of deep learning.Given any document-summary pair ( D, S ) , we learn achannel probability (i.e., salience score) P ( D | S ) , represent-ing that we start with a short summary S and add “noise”to it, yielding a longer document, how likely D is produced.It can be considered as a measure of how much salient in-formation of D is contained in S . Table 1 gives an examplewhere S is more salient than S for yielding D . We de-sign an attention-based neural network to model the channelprobability, and train it with a contrastive training strategy.That is, we firstly use a heuristic way to randomly producecontrastive samples, including two candidate summaries S and S for an input D where the former is more salient, andthen maximize the margin between P ( D | S ) and P ( D | S ) .This training strategy implicitly increases the size of train-ing instances and incorporate randomness into the trainingprocedure, and thus help our model perform well even on asmall training set. With a well-learned P ( D | S ) , we producethe optimal summary S ∗ = argmax S P ( D | S ) by greedilyextracting the most salient sentences which have a maxi-mum probability to expand to the whole document. Com-pared with the statistical noisy-channel, our neural modelcan 1) make use of semantics involved in distributed repre-sentations, 2) alleviate the training sparseness and 3) avoidthe high-cost expert-designed rules. Our model consists of two parts, salience estimation andsalience-guided extraction. Only the first part is paramet-ric and requires an annotated corpus for training. Differ-ent from most state-of-the-art approaches that usually learna direct mapping from a document to its annotated sum-mary, our salience estimation learns a mapping from anydocument-summary pair to a salience score. It brings twosignificant benefits: 1) Our model is more robust to domainvariations. DeepChannel performs much better than otherend-to-end baselines when testing on DUC 2007 whiletraining on CNN/Daily Mail . 2) Our model is much moredata-efficient and alleviates the overfitting problem to a greatdegree. DeepChannel performs well even when we reducethe size of the CNN/Daily Mail training set to / .We also conduct quantitative and qualitative experimentson the standard CNN/Daily Mail benchmark, demonstratingthat our model not only performs on par with state-of-the-artsummarization systems, but also shows high interpretabilitydue to the well-designed attention mechanism.To sum up, our contributions are as follows: • we propose DeepChannel, an extractive summarizationapproach consisting of a deep neural network for salienceestimation and a salience-guided greedy extraction strat-egy; • we demonstrate that our model outperforms or matchesstate-of-the-art summarizers, is robust to domain vari-ations, performs well on the small training set, and ishighly interpretable. P ( S ) is not taken into consideration in our current model, andwe leave it for future work. https://github.com/deepmind/rc-data Related Work
Traditional summarization methods usually depend on man-ual rules and expert knowledge, such as the expandingrules of noisy-channel models (Daum´e III and Marcu 2002;Knight and Marcu 2002), objectives and constraints of Inte-ger Linear Programming (ILP) models (Woodsend and La-pata 2012; Parveen, Ramsl, and Strube 2015; Bing et al.2015), human-engineered features of some sequence clas-sification methods (Shen et al. 2007), and so on.Deep learning models can learn continuous features au-tomatically and have made substantial progress in multipleNLP areas. Many deep learning-based summarization mod-els have been proposed recently for both extractive and ab-stractive summarization tasks.
Extractive. (Nallapati, Zhai, and Zhou 2017) considersthe extraction as a sequence classification task and proposesSummaRuNNer, a simple RNN based model that decideswhether or not to include a sentence in the summary. (Wuand Hu 2018) takes the coherence of summaries into ac-count and designs a reinforcement learning (RL) methodto maximize the combined ROUGE (Lin 2004) and coher-ence reward. (Narayan, Cohen, and Lapata 2018) conceptu-alizes extractive summarization as a sentence ranking taskand optimizes the ROUGE evaluation metric through an RLobjective. (Jadhav and Rajan 2018) models the interactionof keywords and salient sentences using a two-level pointernetwork and combines them to generate the extractive sum-mary.
Abstractive.
A vast majority of abstractive summariz-ers are built based on the encoder-decoder structure. (See,Liu, and Manning 2017) incorporates a pointing mecha-nism into the encoder-decoder, such that their model candirectly copy words from the source text while decodingsummaries. (Paulus, Xiong, and Socher 2017) combines thestandard cross-entropy loss and RL objectives to maximizethe ROUGE metric at the same time of sequence predictiontraining. (Chen and Bansal 2018) proposes a fast summa-rization model that first selects salient sentences and thenrewrites them abstractively to generate a concise overallsummary. Their hybrid approach jointly learns an extrac-tor and a rewriter, capable of both extractive and abstrac-tive summarization. (Hsu et al. 2018) also combines extrac-tion and abstraction, but they implement it by unifying asentence-level attention and a word-level attention and guid-ing these two parts with an inconsistency loss.Most of these deep summarization models aim to learn adirect mapping from the document to the summary. Instead,our DeepChannel aims to learn a channel probability to mea-sure the salience of any document-summary pair. (Peyrardand Eckle-Kohler 2017) learns to estimate automatic Pyra-mid scores and extract summaries by solving an ILP prob-lem, but their model depends on a lot of manual features andtheir ILP-based extraction is totally different from ours.
DeepChannel
We represent a document-summary pair as ( D, S ) , where S is an either annotated or generated summary for docu-ment D . D consists of | D | sentences [ d , d , · · · , d | D | ] , and entenceEncoder s
For estimating P ( D | S ) , we consider that the document D isgenerated based on the given S . For simplicity, we assumethat sentences in the document are conditional independent.Then we have P ( D | S ) = (cid:81) | D | i =1 P ( d i | S ) , where P ( d i | S ) denotes the chance that d i is produced from S . Another as-sumption is that different summary sentences make differ-ent amounts of contribution to the generation of d i . Whencalculating P ( d i | S ) , we should concentrate more on thosesummary sentences that have higher semantic relevance to d i . We use an attention mechanism to model this.As our target is the probability value rather than to decodethe texts, we compute the probability just in sentence-levelinstead of further deriving the equation to a word-level se-quence generation process (i.e., the encoder-decoder). Somesentence embedding models (Logeswaran and Lee 2018) usethe similar simplification strategy, which makes the learningmuch more efficient.Specifically, we encode each sentence of ( D, S ) via aGated Recurrent Unit (GRU) (Chung et al. 2014), one of themost renowned variants of RNNs, to obtain the sentence-level semantic vectors: d i = GRU ([ dw i, , · · · , dw i, | d i | ]; θ ) , i = 1 , · · · , | D | s j = GRU ([ sw j, , · · · , sw j, | s j | ]; θ ) , j = 1 , · · · , | S | . (1)Sentences of the document and the summary share the sameencoder whose parameters are denoted as θ .To compute P ( d i | S ) , we design an attention mechanism(see Figure 1) that assigns a weight A i,j to each summarysentence s j , which will be large if the semantics of s j issimilar to d i . Then we calculate the weighted summationof summary sentence vectors, denoted by ¯s i , concatenate itwith d i , and feed them into a multi-layer perceptron (MLP).Besides, for further information interaction, we take theelement-wise production of these two vectors, d i (cid:12) ¯s i , asanother input of MLP. Formally, we have A i = Softmax ( d (cid:62) i [ s ; s ; · · · ; s | S | ]) , ¯s i = | S | (cid:88) j =1 A i,j s j ,P ( d i | S ) = Sigmoid ( MLP ([ d i ; ¯s i ; d i (cid:12) ¯s i ]; θ )) , (2)where θ is parameters of MLP. Let θ include both θ and θ , we can reformulate our channel probability as P ( D | S ; θ ) = | D | (cid:89) i =1 Sigmoid ( MLP ([ d i ; ¯s i ; d i (cid:12) ¯s i ])) . (3) Contrastive Learning
We expect that P ( D | S ) should be large if S contains salientinformation to construct D , else it should be small. Toachieve this goal, we devise a contrastive training strategy.That is, given a document D , we construct a pair of con-trastive candidate summaries S and S , one positive andone negative, satisfying that S is more salient to summarize D than S . Then we train our channel model by maximizingthe margin between P ( D | S ) and P ( D | S ) . astro has promised to improve efficiency by cutting some red tape Cuban President Raul Castro says the country must become more productiveExpectations rise as a new president leads Cuba for the first time in 49 yearsRare public displays of discontent show frustrations faced by Cubans
Indeed, Castro has promised to move within a few weeks to improve efficiency by cutting some of the red tape that can frustrate the most fervent of revolutionaries. "I think those expectations are really very large indeed and it'll be the Achilles heel, ……But Wayne S. Smith, a senior fellow at the Center for International Policy in Washington and chief of the U.S. Interests Section in Havana from …… "Rather, we will see a peaceful transition and the existing system remain largely intact," Smith predicted.
Raul Castro was chosen Sunday to take over Cuba's presidency from his brother, Fidel Castro.
Cubans, too, are calling for reforms, though not all of them related to productivity.
ROUGE-1 F1 Score
Cuban President Raul Castro says the country must become more productiveExpectations rise as a new president leads Cuba for the first time in 49 yearsRare public displays of discontent show frustrations faced by Cubans
Indeed, Castro has promised to move within a few weeks to improve efficiency by cutting some of the red tape that can frustrate the most fervent of revolutionaries.
Cuban President Raul Castro says the country must become more productiveExpectations rise as a new president leads Cuba for the first time in 49 yearsRare public displays of discontent show frustrations faced by Cubans
Raul Castro was chosen Sunday to take over Cuba's presidency from his brother, Fidel Castro. ˆ S
Let A denote the attention matrix of ( D, S ) . We considerthat a reasonable attention should satisfy following two con-ditions: 1) A i is sharp, that is, the i -th document sentenceshould focus on its most relevant summary sentences. 2) Allsummary sentences are important and each summary sen-tence should get attention from some document sentences.Inspired by (Lin et al. 2017), we introduce a penalizationterm to achieve both of these two goals: L penal ( θ ) = (cid:12)(cid:12)(cid:12)(cid:12) A (cid:62) A − | D || S | I (cid:12)(cid:12)(cid:12)(cid:12) F . (5)Here || · || F stands for the Frobenius norm of a matrix,the shape of A is | D | × | S | and the shape of I is | S | × | S | .This penalization term will be minimized together with thecontrastive loss.Because of the softmax, we have (cid:80) | S | i =1 A k,i = 1 for anyvalid k . We denote the element in the A (cid:62) A matrix as a i,j ,which is equal to the inner production of A : ,i and A : ,j .As all elements in A is non-negative, we can draw that 1) a i,j ≥ , and 2) a i,j = 0 iff. A k,i = 0 or A k,j = 0 forany k . In other words, if A k is not sharp and attends to s i and s j at the same time, then a i,j will be greater than . Byforcing the non-diagonal a i,j ( i (cid:54) = j ) to approximate , wecan encourage each d k to focus on summary sentences assharp as possible. On the other hand, we force the diagonalof A (cid:62) A to approximate | D || S | , meaning that each summarysentence should receive nearly average attention, avoidingthat certain s is not focused on at all. To understand this intu-itively, let’s consider that each row in A is a one-hot vector,meaning that each document sentence attends to only onesummary sentence. Then a i,i is totally equal to the numberof received attention of s i , and (cid:80) | S | i =1 a i,i = | D | . The diag-onal part of our penalization term amounts to encouragingan average division of these attention. This simple averageassumption is not accurate but is efficient to compute and isdemonstrated to be effective.Our final loss function is: L ( θ ) = L con + α L penal , (6)where α is a hyperparameter, and L penal is computed using ( D, S ) . Greedy Extraction
For testing, we devise a greedy extraction strategy in termsof our well-trained channel model P ( D | S ) , described in Al-gorithm 1.We iteratively extract one sentence from the documentand add it into S ∗ , such that P ( D | S ∗ ) is greedily maxi-mized until the upper bound of the length of the summary l is reached. Such a simple greedy extraction algorithm iscomputationally efficient. Furthermore, it can automaticallyavoid redundancy between extracted sentences, because thesalience score of S ∗ will not increase if we add a redundantsentence into S ∗ . Benefiting from the great potential of thechannel model, what we extract at each step must be uniqueand valuable. We will further demonstrate that in our exper-iments. lgorithm 1 Greedy Extraction Algorithm
Input: document D = { d , d , ..., d | D | } , a well-pretrainedchannel model P ( D | S ) , expected summary length l Output: optimal summary S ∗ S ∗ ← {} while | S ∗ | < l do d, p ← nil, for d i ∈ D − S ∗ do p i ← P ( D | S ∗ ∪ { d i } ) according to Formula 3 if p i > p then d, p = d i , p i end ifend for S ∗ = S ∗ ∪ { d } end while Resort S ∗ based on the order in D return S ∗ In Algorithm 1 we exclude S ∗ from D at each step be-cause we observed some “magic sentence”s in experiments.That is, after a document sentence d magic is extracted into S ∗ , appending any other d i , i (cid:54) = magic into S ∗ will lead to adecrease of P ( D | S ∗ ) , and thus d magic will be repeatedly se-lected as it can hold that probability. We guess it is because d magic is much more salient than other d i , and appendingother d i into S ∗ will “distract” the channel attention.Using this greedy extraction strategy, we can produce anextracted summary containing l sentences for any given in-put document. Experiments
Datasets
We evaluate our model on two datasets: CNN/DailyMail (Hermann et al. 2015; Nallapati et al. 2016; See,Liu, and Manning 2017; Hsu et al. 2018) and DUC 2007.The CNN/Daily Mail dataset contains news stories in CNNand Daily Mail websites and corresponding human-writtenhighlights as summaries. This dataset has two versions:anonymized, which replaces named entities by special to-kens, and non-anonymized, which preserves the raw texts.We follow (Hsu et al. 2018) and obtain the non-anonymizedversion of this dataset which has 287,113 training pairs,13,368 validation pairs, and 11,490 test pairs.DUC 2007 is a multiple-document dataset containing 45topics, and each topic corresponds to 25 relevant documentsand 4 summary annotations. We concatenate multiple docu-ments in the same topic to obtain a single-document test setwhose size is 45. After training on CNN/Daily Mail, we useDUC 2007 dataset as an additional out-of-domain test set, tocompare the robustness of different models.
Implementation Details
For preprocessing, we lower all document and summary sen-tences, replace numbers with a placeholder “ (cid:104) zero (cid:105) ” and re-move sentences containing less than 4 words. We set the vo-cabulary size to 50k and replace low-frequency words witha special token “ (cid:104) unk (cid:105) ”. For the model, we set the dimension of the word embed-ding to 300, and the GRU hidden dimension to 1024. We usea 3-layered MLP to calculate P ( d i | S ) in Formula 2, whichconsists of 3 linear layers, 2 ReLU layers and an output sig-moid layer. We use dropout (Srivastava et al. 2014) withprobability 0.3 after the word embedding layer and beforethe first layer of the MLP.For the training and hyperparameters, we init our wordembeddings using GloVe (Pennington, Socher, and Manning2014) pretrained vectors and then finetune them in our task.We use Adam (Kingma and Ba 2014) optimizer with a fixedlearning rate of 1e-5 to train our model. We set the weightof the penalization term α = 0 . . When extracting sen-tences, we fix the number of target sentences (i.e., l in Al-gorithm 1) to 3. The implementation is made publicly avail-able. Evaluation
For CNN/Daily Mail experiments, we use the full-lengthRouge F1 metric (Lin 2004). For DUC 2007, we use lim-ited length Rouge recall at 75 bytes and 275 bytes. We re-port the scores from Rouge-1, Rouge-2, and Rouge-L, whichare computed using the matches of unigrams, bigrams, andlongest common subsequences respectively, with the groundtruth summaries.
Baselines
Our extractive baselines include: lead-3 (See, Liu, and Man-ning 2017), SummaRuNNer (Nallapati, Zhai, and Zhou2017), Refresh (Narayan, Cohen, and Lapata 2018), SWAP-NET (Jadhav and Rajan 2018), and rnn-ext+RL (Chen andBansal 2018).We also compare our performance with state-of-the-art abstractive baselines, including PointerGenerator (See,Liu, and Manning 2017), ML+RL+intra-attention (Paulus,Xiong, and Socher 2017), controlled (Fan, Grangier, andAuli 2017), and inconsistency loss (Hsu et al. 2018).For further analyses such as out-of-domain test, we selectthe 3 most representative approaches, SummaRuNNer, Re-fresh, and PointerGenerator , as the baselines. SummaRuN-Ner predicts a binary label for each document sentence, in-dicating whether it is extracted. Refresh learns to rank sen-tences using reinforcement learning and then directly ex-tracts the top- k . PointerGenerator, which is built on thesequence-to-sequence (seq2seq) framework, is one of themost typical abstractive summarizers. Results
Results on CNN/Daily Mail
Table 2 shows the performance comparison betweenour DeepChannel and state-of-the-art baselines on theCNN/Daily Mail dataset using full-length Rouge F-1 asthe metric. We can see that DeepChannel performs betterthan or at least on par with state-of-the-art models. BesidesDeepChannel, there are two approaches achieving more than https://github.com/lliangchenc/DeepChannel Method Rouge-1 Rouge-2 Rouge-LExtractivelead-3 40.34 17.70 36.57SummaRuNNer 39.60 16.20 35.30Refresh 40.00 18.20 36.60SWAP-NET 41.60 18.30 37.70rnn-ext + RL 41.47 18.72 37.76DeepChannel 41.50 17.77 37.62AbstractivePointerGenerator 39.53 17.28 36.38ML+RL+intra-attention 39.87 15.82 36.90controlled 39.75 17.29 36.54inconsistency loss 40.68 17.97 37.13
Table 2: Performance on CNN/Daily Mail test set using thefull length Rouge F-1 score.
Results on DUC 2007
Rouge-1 Rouge-2 Rouge-L75 bytesSummaRuNNer 18.32 4.57 12.96PointerGenerator 13.74 2.49 10.97Refresh 18.39 5.04 14.85DeepChannel
275 bytesSummaRuNNer 27.06 6.09 6.49PointerGenerator 23.93 4.70 5.98Refresh 26.80 6.30 6.66DeepChannel
Table 3: Performance on DUC 2007 dataset using the lim-ited length recall variants of Rouge. The upper section areresults at 75 bytes, and the lower are results at 275 bytes.DeepChannel outperforms other baselines stably, indicatingthat it is more robust for the out-of-domain application.To compare the robustness of models, we conducted out-of-domain experiments by training models on CNN/DailyMail training set while evaluating on DUC 2007 dataset.Table 3 shows the limited length Rouge recall scores at75 bytes and 275 bytes. We can see that DeepChannel ob-tains Rouge-1 score of 19.53 at 75 bytes and 28.85 at 275bytes, stably and significantly better than other three base-lines, demonstrating the strong robustness of our model.It is worth noting that PointerGenerator, a seq2seq basedabstractive approach, suffers performance drop by a large margin when transferred to the out-of-domain dataset. Afterbeing trained on CNN/Daily Mail training set, it performson par with SummaRuNNer and Refresh when testing onCNN/Daily Mail test set (Table 2), while worse a lot on DUC2007. We consider that the seq2seq summarization systemsare more easily to suffer from the overfitting problem as theyattempt to memorize as many details (i.e., learn to decodeeach word) of the training data as possible.
Results on Reduced CNN/Daily Mail
Rouge-1 Rouge-2 Rouge-L
SummaRunner 35.95 15.87 32.38PointerGenerator 34.32 11.82 31.54Refresh 36.30 14.56 33.06DeepChannel
SummaRunner 35.44 15.50 31.88PointerGenerator 28.57 6.28 25.90Refresh 36.05 14.23 32.79DeepChannel
Table 4: Performance when training on reduced CNN/DailyMail training set. The full-length Rouge F-1 scores onCNN/Daily Mail test set are reported. The two sections areresults of 1/10 and 1/100 respectively. Our model can obtainhigh scores even with only 1/100 training samples, whileother baselines, especially the seq2seq-based PointerGener-ator, suffer a significant performance degradation on reducedtraining set.We reduced the size of the training set to explore the dataefficiency of different models. We conducted two experi-ments, respectively preserving 1/10 (28,711 pairs) and 1/100(2,871 pairs) samples of the CNN/Daily Mail training set.Models were trained on the reduced training set and eval-uated on the original test set. Table 4 shows the performanceof different models, using full-length Rouge F-1 as the mea-surement.We can see that being trained on merely 2,871 train-ing samples, our DeepChannel can still achieve a goodRouge score, just slightly lower than the score obtained onthe complete training set. In contrast, the Rouge score ofSummaRunner, Refresh, and especially PointerGenerator,all suffer a drastic drop on the reduced training set. Whenthe fraction reduces from 1/10 to 1/100, PointerGenerator’sRouge-1 F1 score drops sharply, i.e., from 34.32 to 28.57.We think it is due to the same reason as why PointerGen-erator performs badly on DUC 2007. The seq2seq structureattempts to learn all details of the training set, leading to amore serious overfitting problem when the number of train-ing samples is limited. Attributed to our salient estimation,DeepChannel has strong generalization ability and can learnfrom a very small training set and avoid overfitting to a greatextent.
Influence of the Penalization Term
We set α — the weight of the penalization term — to . in our experiments. In Table 6 we show results of different α ocument: Rutgers University has banned fraternity andsorority house parties at its main campus in NewBrunswick, New Jersey, for the rest of the spring semesterafter several alcohol-related problems this school year, in-cluding the death of a student. The probation was decidedlast week but announced by the university Monday. ’Rut-gers takes seriously its ... university said in a statement.
Lastmonth, a fraternity was shut down because of an underagedrinking incident in November in which a member of SigmaPhi Epsilon was taken to a hospital after drinking heavily atthe fraternity house.
Rutgers University has banned fraternityand sorority house parties at its main campus for the rest of thespring semester after several alcohol-related problems ......
Gold Summary:
Rutgers University has banned fraternity andsorority house parties at its main campus for the rest of thespring semester. The probation was decided last week, but theschool announced the move on Monday. 86 recognized fra-ternities and sororities will be allowed to hold spring formalsand other events where third-party vendors serve alcohol. Lastmonth, a fraternity was shut down because of an underagedrinking incident in November. A member of Sigma Phi Ep-silon was taken to a hospital after drinking heavily at the fra-ternity house during the incident. In September, a 19-year-oldstudent, Caitlyn Kovacs, died of alcohol poisoning after attend-ing a fraternity party.
Document: ...... are not as kind on the body as they purport to be.
Investigators found that a number of flavors were labeled’healthy’ - brimming with fiber, protein and antioxidants,while being low in fat and sodium. However, upon closer in-spection, it was found that ’none of the products met therequirements to make such content claims’ and were in fact’misbranded’.
Mislabeled? The FDA has ruled that KIND barsare not as kind on the body as they purport to be.
Indeed, DailyMail Online calculated that one KIND bar flavor - not in-cluded in the FDA investigation - contains more calories, fatand sodium than a Snickers bar.
A 40g Honey Smoked BBQKIND Bar ......
Gold Summary:
FDA Investigators found that a number of fla-vors were labeled ’healthy’ - brimming with fiber and antioxi-dants, while being low in fat and sodium. However, upon closerinspection it was found that ’none of the products met the re-quirements to make such content claims’. Daily Mail Onlinecalculated that one KIND bar flavor - not included in the FDAinvestigation - contains more calories and fat than a Snickersbar. New York University nutritionist, Marion Nestle, likenedKIND bars to candy
Table 5: Example documents and gold summaries fromCNN/Daily Mail test set. The sentences chosen byDeepChannel for extractive summarization are highlightedin bold, and the corresponding summary sentences whichhave equivalent semantics are underlined. α Rouge-1 Rouge-2 Rouge-L0 40.89 17.21 37.080.001
Table 6: Performance on CNN/Daily Mail test set with dif-ferent weights of the penalization term. values, to illustrate why we choose . . When we removethe penalization term (that is, α = 0 ), rouge scores drop alot as the model cannot learn a reasonable attention withoutregularization. We will show qualitative cases for further ex-planation. On the other hand, the performance will degradewith too high penalization weights, such as α = 0 . , as itwill cause unstable training of contrastive loss. Qualitative Analyses
We show qualitative results to demonstrate that our modelcan successfully extract salient sentences. Table 5 gives twoexamples from CNN/Daily Mail test set. Our extracted threesentences are marked in bold text, and corresponding equiv-alent summary sentences are marked with underlines. Wecan see that DeepChannel can indeed find the most salientsentences from the document. Besides, the redundant sen-tences are automatically avoided in our extractive results,which is attributed to the good property of the channel prob-ability and our greedy strategy. d[0] It looks like an ordinary forest, with moss climbing up the walls and brown leaves covering the floor.d[1] The amazing illusion is the work of German body-painting artist Joerg Duesterwald, who spent hours painting his model so she would blend in with her surroundings.d[2] The stunning set of pictures was taken in a forest in Langenfeld, Germany, yesterday.d[3] The temperature was thought to be between 10 to 15 degrees during the shoot. Above, the model curls herself into the shade of therockface s[0] s[1]
Figure 3: Example of attention heatmap between documentsentences (rows) and gold summary sentences (columns).s[0]:
The illusion is the work of German body-painting artistJoerg Duesterwald, who spent hours painting his model .s[1]:
Stunning set of pictures was taken in front of a rockfacein a forest in Langenfeld, Germany, yesterday . Best viewedin color. d[0] It looks like an ordinary forest, with moss climbing up the walls and brown leaves covering the floor.d[1] The amazing illusion is the work of German body-painting artist Joerg Duesterwald, who spent hours painting his model so she would blend in with her surroundings.d[2] The stunning set of pictures was taken in a forest in Langenfeld, Germany, yesterday.d[3] The temperature was thought to be between 10 to 15 degrees during the shoot. Above, the model curls herself into the shade of therockface s[0] s[1]
Figure 4: Heatmap when removing the penalization term.We can see s[0] does not receive attention at all. Best viewedin color.Figure 3 shows an example of attention heatmap, whereeach row corresponds to a document sentence and each col-umn corresponds to a sentence of the gold summary. Wecan see that our model can successfully learn high attentionscores for sentence pairs which have relevant semantics.e also display the heatmap of the same document in thecase of removing the penalization term during training (Fig-ure 4). We can see that all document sentences focus on s[1],while s[0] does not receive attention at all. Our proposed pe-nalization term can make sure that no summary sentence isleft out.
Conclusions and Future Work
We propose DeepChannel, consisting of a deep neuralnetwork-based channel model and an iterative extractionstrategy, for extractive document summarization. Experi-ments on CNN/Daily Mail demonstrate that our model per-forms on par with state-of-the-art summarization systems.Furthermore, DeepChannel has three significant advantages:1) strong robustness to domain variations; 2) high data effi-ciency; 3) high interpretability.For future work, we will consider more fine-grained, i.e.,word-level, attention and extraction mechanisms. Besides,we will try to take the language model P ( S ) into account,to reflect the influence and coherence between adjacent sen-tences. Acknowledgments
The work is supported by NSFC key projects (U1736204,61533018, 61661146007), Ministry of Education and ChinaMobile Research Fund (No. 20181770250), and THUNUSNExT Co-Lab.
References [Bing et al. 2015] Bing, L.; Li, P.; Liao, Y.; Lam, W.; Guo, W.; andPassonneau, R. 2015. Abstractive multi-document summarizationvia phrase selection and merging. In
ACL-IJCNLP .[Chen and Bansal 2018] Chen, Y.-C., and Bansal, M. 2018. Fastabstractive summarization with reinforce-selected sentence rewrit-ing. arXiv preprint arXiv:1805.11080 .[Chung et al. 2014] Chung, J.; Gulcehre, C.; Cho, K.; and Bengio,Y. 2014. Empirical evaluation of gated recurrent neural networkson sequence modeling. arXiv preprint arXiv:1412.3555 .[Daum´e III and Marcu 2002] Daum´e III, H., and Marcu, D. 2002.A noisy-channel model for document compression. In
ACL .[Erhan et al. 2010] Erhan, D.; Bengio, Y.; Courville, A.; Manzagol,P.-A.; Vincent, P.; and Bengio, S. 2010. Why does unsupervisedpre-training help deep learning?
Journal of Machine Learning Re-search .[Fan, Grangier, and Auli 2017] Fan, A.; Grangier, D.; and Auli, M.2017. Controllable abstractive summarization. arXiv preprintarXiv:1711.05217 .[Hermann et al. 2015] Hermann, K. M.; Kocisky, T.; Grefenstette,E.; Espeholt, L.; Kay, W.; Suleyman, M.; and Blunsom, P. 2015.Teaching machines to read and comprehend. In
Advances in NeuralInformation Processing Systems , 1693–1701.[Hsu et al. 2018] Hsu, W.-T.; Lin, C.-K.; Lee, M.-Y.; Min, K.; Tang,J.; and Sun, M. 2018. A unified model for extractive and ab-stractive summarization using inconsistency loss. arXiv preprintarXiv:1805.06266 .[Iyyer et al. 2014] Iyyer, M.; Boyd-Graber, J.; Claudino, L.; Socher,R.; and Daum´e III, H. 2014. A neural network for factoid questionanswering over paragraphs. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) ,633–644.[Jadhav and Rajan 2018] Jadhav, A., and Rajan, V. 2018. Extractivesummarization with swap-net: Sentences and words from alternat-ing pointer networks. In
ACL .[Kingma and Ba 2014] Kingma, D., and Ba, J. 2014. Adam:A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .[Knight and Marcu 2002] Knight, K., and Marcu, D. 2002. Sum-marization beyond sentence extraction: A probabilistic approach tosentence compression.
Artificial Intelligence .[Lin et al. 2017] Lin, Z.; Feng, M.; Santos, C. N. d.; Yu, M.; Xiang,B.; Zhou, B.; and Bengio, Y. 2017. A structured self-attentivesentence embedding. arXiv preprint arXiv:1703.03130 .[Lin 2004] Lin, C.-Y. 2004. Rouge: A package for automatic eval-uation of summaries.
Text Summarization Branches Out .[Logeswaran and Lee 2018] Logeswaran, L., and Lee, H. 2018. Anefficient framework for learning sentence representations. In
ICLR .[Luong, Pham, and Manning 2015] Luong, M.-T.; Pham, H.; andManning, C. D. 2015. Effective approaches to attention-basedneural machine translation. arXiv preprint arXiv:1508.04025 .[Mnih and Hinton 2009] Mnih, A., and Hinton, G. E. 2009. A scal-able hierarchical distributed language model. In
NIPS .[Nallapati et al. 2016] Nallapati, R.; Zhou, B.; dos Santos, C.; Gul-cehre, C.; and Xiang, B. 2016. Abstractive text summarizationusing sequence-to-sequence rnns and beyond. In
CoNLL .[Nallapati, Zhai, and Zhou 2017] Nallapati, R.; Zhai, F.; and Zhou,B. 2017. Summarunner: A recurrent neural network based se-quence model for extractive summarization of documents. In
AAAI .[Narayan, Cohen, and Lapata 2018] Narayan, S.; Cohen, S. B.; andLapata, M. 2018. Ranking sentences for extractive summarizationwith reinforcement learning. In
NAACL .[Parveen, Ramsl, and Strube 2015] Parveen, D.; Ramsl, H.-M.; andStrube, M. 2015. Topical coherence for graph-based extractivesummarization. In
EMNLP .[Paulus, Xiong, and Socher 2017] Paulus, R.; Xiong, C.; andSocher, R. 2017. A deep reinforced model for abstractive sum-marization. arXiv preprint arXiv:1705.04304 .[Pennington, Socher, and Manning 2014] Pennington, J.; Socher,R.; and Manning, C. 2014. Glove: Global vectors for word rep-resentation. In
EMNLP .[Peyrard and Eckle-Kohler 2017] Peyrard, M., and Eckle-Kohler, J.2017. Supervised learning of automatic pyramid for optimization-based multi-document summarization. In
ACL .[See, Liu, and Manning 2017] See, A.; Liu, P. J.; and Manning,C. D. 2017. Get to the point: Summarization with pointer-generatornetworks. In
ACL .[Shen et al. 2007] Shen, D.; Sun, J.-T.; Li, H.; Yang, Q.; and Chen,Z. 2007. Document summarization using conditional randomfields. In
IJCAI .[Srivastava et al. 2014] Srivastava, N.; Hinton, G. E.; Krizhevsky,A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simpleway to prevent neural networks from overfitting.
JMLR .[Woodsend and Lapata 2012] Woodsend, K., and Lapata, M. 2012.Multiple aspect summarization using integer linear programming.In
EMNLP-CoNLL .[Wu and Hu 2018] Wu, Y., and Hu, B. 2018. Learning to extractcoherent summary via deep reinforcement learning. arXiv preprintarXiv:1804.07036arXiv preprintarXiv:1804.07036