[PDF] Neural Network-Based Abstract Generation for Opinions and Arguments

Abstract

We study the problem of generating abstractive summaries for opinionated text. We propose an attention-based neural network model that is able to absorb information from multiple text units to construct informative, concise, and fluent summaries. An importance-based sampling method is designed to allow the encoder to integrate information from an important subset of input. Automatic evaluation indicates that our system outperforms state-of-the-art abstractive and extractive summarization systems on two newly collected datasets of movie reviews and arguments. Our system summaries are also rated as more informative and grammatical in human evaluation.

Full PDF

NNeural Network-Based Abstract Generation for Opinions and Arguments

Lu Wang

College of Computer and Information ScienceNortheastern UniversityBoston, MA 02115 [email protected]

Wang Ling

Google DeepMindLondon, N1 0AE [email protected]

Abstract

We study the problem of generating abstrac-tive summaries for opinionated text. We pro-pose an attention-based neural network modelthat is able to absorb information from multi-ple text units to construct informative, concise,and ﬂuent summaries. An importance-basedsampling method is designed to allow the en-coder to integrate information from an impor-tant subset of input. Automatic evaluation in-dicates that our system outperforms state-of-the-art abstractive and extractive summariza-tion systems on two newly collected datasetsof movie reviews and arguments. Our systemsummaries are also rated as more informativeand grammatical in human evaluation.

Collecting opinions from others is an integral partof our daily activities. Discovering what other peo-ple think can help us navigate through different as-pects of life, ranging from making decisions on reg-ular tasks to judging fundamental societal issues andforming personal ideology. To efﬁciently absorb themassive amount of opinionated information, there isa pressing need for automated systems that can gen-erate concise and ﬂuent opinion summary about anentity or a topic. In spite of substantial researchesin opinion summarization, the most prominent ap-proaches mainly rely on extractive summarization methods, where phrases or sentences from the origi-nal documents are selected for inclusion in the sum-mary (Hu and Liu, 2004; Lerman et al., 2009). Oneof the problems that extractive methods suffer from

Movie : The Martian

Reviews :- One the smartest , sweetest, and most satisfyingly suspensefulsci-ﬁ ﬁlms in years.- ...an intimate sci-ﬁ epic that is smart , spectacular and stirring.- The Martian is a thrilling , human and moving sci-ﬁ picturethat is easily the most emotionally engaging ﬁlm Ridley Scotthas made...- It’s pretty sunny and often funny , a space oddity for a directornot known for pictures with a sense of humor .- The Martian highlights the book’s best qualities , tones downits worst, and adds its own style...

Opinion Consensus (Summary) : Smart , thrilling , and sur-prisingly funny , The Martian offers a faithful adaptation ofthe bestselling book that brings out the best in leading manMatt Damon and director Ridley Scott. Topic : This House supports the death penalty.

Arguments :- The state has a responsibility to protect the lives of innocentcitizens, and enacting the death penalty may save lives by re-ducing the rate of violent crime .- While the prospect of life in prison may be frightening, surelydeath is a more daunting prospect.- A 1985 study by Stephen K. Layson at the University of NorthCarolina showed that a single execution deters 18 murders .- Reducing the wait time on death row prior to execution candramatically increase its deterrent effect in the United States.

Claim (Summary) : The death penalty deters crime . Figure 1:

Examples for an opinion consensus of pro-fessional reviews (critics) about movie “

The Martian ” from , and a claim about “deathpenalty” supported by arguments from idebate.org . Con-tent with similar meaning is highlighted in the same color. is that they unavoidably include secondary or redun-dant information. On the contrary, abstractive sum-marization methods, which are able to generate textbeyond the original input, can produce more coher-ent and concise summaries.In this paper, we present an attention-based neu-ral network model for generating abstractive sum-maries of opinionated text . Our system takes as in-put a set of text units containing opinions about thesame topic (e.g. reviews for a movie, or arguments a r X i v : . [ c s . C L ] J un or a controversial social issue), and then outputs aone-sentence abstractive summary that describes theopinion consensus of the input.Speciﬁcally, we investigate our abstract genera-tion model on two types of opinionated text: moviereviews and arguments on controversial topics . Ex-amples are displayed in Figure 1. The ﬁrst exam-ple contains a set of professional reviews (or crit-ics) about movie “The Martian” and an opinion con-sensus written by an editor. It would be more use-ful to automatically generate ﬂuent opinion consen-sus rather than simply extracting features (e.g. plot,music, etc) and opinion phrases as done in previoussummarization work (Zhuang et al., 2006; Li et al.,2010). The second example lists a set of argumentson “death penalty”, where each argument supportsthe central claim “death penalty deters crime”. Ar-guments, as a special type of opinionated text, con-tain reasons to persuade or inform people on certainissues. Given a set of arguments on the same topic,we aim at investigating the capability of our abstractgeneration system for the novel task of claim gener-ation .Existing abstract generation systems for opinion-ated text mostly take an approach that ﬁrst identi-ﬁes salient phrases, and then merges them into sen-tences (Bing et al., 2015; Ganesan et al., 2010).Those systems are not capable of generating newwords, and the output summary may suffer fromungrammatical structure. Another line of work re-quires a large amount of human input to enforcesummary quality. For example, Gerani et al. (2014)utilize a set of templates constructed by human,which are ﬁlled by extracted phrases to generategrammatical sentences that serve different discoursefunctions.To address the challenges above, we propose touse an attention-based abstract generation model —a data-driven approach trained to generate informa-tive, concise, and ﬂuent opinion summaries. Ourmethod is based on the recently proposed frame-work of neural encoder-decoder models (Kalchbren-ner and Blunsom, 2013; Sutskever et al., 2014a),which translates a sentence in a source languageinto a target language. Different from previouswork, our summarization system is designed to sup-port multiple input text units. An attention-basedmodel (Bahdanau et al., 2014) is deployed to al- low the encoder to automatically search for salientinformation within context. Furthermore, we pro-pose an importance-based sampling method so thatthe encoder can integrate information from an im-portant subset of input text. The importance scoreof a text unit is estimated from a novel regressionmodel with pairwise preference-based regularizer.With importance-based sampling, our model can betrained within manageable time, and is still able tolearn from diversiﬁed input.We demonstrate the effectiveness of our model ontwo newly collected datasets for movie reviews andarguments. Automatic evaluation by BLEU (Pap-ineni et al., 2002) indicates that our system outper-forms the state-of-the-art extract-based and abstract-based methods on both tasks. For example, weachieved a BLEU score of 24.88 on Rotten Toma-toes movie reviews, compared to 19.72 by an ab-stractive opinion summarization system from Gane-san et al. (2010). ROUGE evaluation (Lin and Hovy,2003) also indicates that our system summaries havereasonable information coverage. Human judgesfurther rated our summaries to be more informativeand grammatical than compared systems. We collected two datasets for movie reviewsand arguments on controversial topics with gold-standard abstracts. Rotten Tomatoes ( ) is a movie review web-site that aggregates both professional critics anduser-generated reviews (henceforth

RottenToma-toes ). For each movie, a one-sentence critic con-sensus is constructed by an editor to summarize theopinions in professional critics. We crawled 246,164critics and their opinion consensus for 3,731 movies(i.e. around 66 reviews per movie on average). Weselect 2,458 movies for training, 536 movies for val-idation and 737 movies for testing. The opinion con-sensus is treated as the gold-standard summary.We also collect an argumentation dataset from idebate.org (henceforth

Idebate ), which is aWikipedia-style website for gathering pro and conarguments on controversial issues. The argumentsunder each debate (or topic) are organized into dif- The datasets can be downloaded from . erent “for” and “against” points. Each point con-tains a one-sentence central claim constructed by theeditors to summarize the corresponding arguments,and is treated as the gold-standard. For instance, ona debate about “death penalty”, one claim is “thedeath penalty deters crime” with an argument “en-acting the death penalty may save lives by reducingthe rate of violent crime” (Figure 1). We crawled676 debates with 2,259 claims. We treat each sen-tence as an argument, which results in 17,359 argu-ments in total. 450 debates are used for training, 67debates for validation, and 150 debates for testing. In this section, we ﬁrst deﬁne our problem in Sec-tion 3.1, followed by model description. In gen-eral, we utilize a Long Short-Term Memory networkfor generating abstracts (Section 3.2) from a latentrepresentation computed by an attention-based en-coder (Section 3.3). The encoder is designed tosearch for relevant information from input to bet-ter inform the abstract generation process. We alsodiscuss an importance-based sampling method to al-low encoder to integrate information from an impor-tant subset of input (Sections 3.4 and 3.5). Post-processing (Section 3.6) is conducted to re-rank thegenerations and pick the best one as the ﬁnal sum-mary.

In summarization, the goal is to generate a summary y , composed by the sequence of words y , ..., | y | .Unlike previous neural encoder-decoder approacheswhich decode from only one input, our input con-sists of an arbitrary number of reviews or arguments(henceforth text units wherever there is no ambigu-ity), denoted as x = { x , ..., x M } . Each text unit x k is composed by a sequence of words x k , ..., x k | x k | .Each word takes the form of a representation vector,which is initialized randomly or by pre-trained em-beddings (Mikolov et al., 2013), and updated duringtraining. The summarization task is deﬁned as ﬁnd-ing ˆ y , which is the most likely sequence of words ˆ y , ..., ˆ y N such that: ˆ y = argmax y log P ( y | x ) (1) where log P ( y | x ) denotes the conditional log-likelihood of the output sequence y , given the inputtext units x . In the next sections, we describe theattention model used to model log P ( y | x ) . Similar as previous work (Sutskever et al., 2014b;Bahdanau et al., 2014), we decompose log P ( y | x ) into a sequence of word-level predictions: log P ( y | x ) = (cid:88) j =1 ,..., | y | log P ( y j | y , ..., y j − , x ) (2) where each word y j is predicted conditional on thepreviously generated y , ..., y j − and input x . Theprobability is estimated by standard word softmax: p ( y j | y , ..., y j − , x ) = sof tmax ( h j ) (3) h j is the Recurrent Neural Networks (RNNs) statevariable at timestamp j , which is modeled as: h j = g ( y j − , h j − , s ) (4) Here g is a recurrent update function for generatingthe new state h j from the representation of previ-ously generated word y j − (obtained from a wordlookup table), the previous state h j − , and the inputtext representation s (see Section 3.3).In this work, we implement g using a Long Short-Term Memory (LSTM) network (Hochreiter andSchmidhuber, 1997), which has been shown to be ef-fective at capturing long range dependencies. Herewe summarize the update rules for LSTM cells, andrefer readers to the original work (Hochreiter andSchmidhuber, 1997) for more details. Given an ar-bitrary input vector u j at timestamp j − and theprevious state h j − , a typical LSTM deﬁnes the fol-lowing update rules: i j = σ ( W iu u j + W ih h j − + W ic c j − + b i ) f j = σ ( W fu u j + W fh h j − + W fc c j − + b f ) c j = f j (cid:12) c j − + i j (cid:12) tanh( W cu u j + W ch h j − + b c ) o j = σ ( W ou u j + W oh h j − + W oc c j + b o ) h j = o j (cid:12) tanh( c j ) (5) σ is component-wise logistic sigmoid function, and (cid:12) denotes Hadamard product. Projection matrices ∗∗ and biases b ∗ are parameters to be learned dur-ing training.Long range dependencies are captured by the cellmemory c j , which is updated linearly to avoid thevanishing gradient problem. It is accomplished bypredicting two vectors i j and f j , which determinewhat to keep and what to forget from the currenttimestamp. Vector o j then decides on what infor-mation from the new cell memory c j can be passedto the new state h j . Finally, the model concatenatesthe representation of previous output word y j − andthe input representation s (see Section 3.3) as u j ,which serves as the input at each timestamp. The representation of input text units s is computedusing an attention model (Bahdanau et al., 2014).Given a single text unit x , ..., x | x | and the previousstate h j , the model generates s as a weighted sum: (cid:88) i =1 ,..., | x | a i b i (6) where a i is the attention coefﬁcient obtained forword x i , and b i is the context dependent repre-sentation of x i . In our work, we construct b i bybuilding a bidirectional LSTM over the whole in-put sequence x , ..., x | x | and then combining the for-ward and backward states. Formally, we use theLSTM formulation from Eq. 5 to generate the for-ward states h f , ..., h f | x | by setting u j = x j (the pro-jection word x j using a word lookup table). Like-wise, the backward states h b | x | , ..., h b are generatedusing a backward LSTM by feeding the input in thereverse order, that is, u j = x | x |− j +1 . The coefﬁ-cients a i are computed with a softmax over all input: a i = sof tmax ( v ( b i , h j − )) (7) where function v computes the afﬁnity of eachword x i and the current output context h j − —how likely the input word is to be used to gener-ate the next word in summary. We set v ( b i , h j − ) = W s · tanh( W cg b i + W hg h j − ) , where W ∗ and W ∗∗ are parameters to be learned. A key distinction between our model and ex-isting sequence-to-sequence models (Sutskeveret al., 2014b; Bahdanau et al., 2014) is that our input consists of multiple separate textunits. Given an input of N text units, i.e. { x k , ..., x k | x k | } Nk =1 , a simple extension would beto concatenate them into one sequence as z = x , ..., x | x | , SEG , x , ..., x | x | , SEG , x N , ..., x N | x N | ,where SEG is a special token that delimits inputs.However, there are two problems with this ap-proach. Firstly, the model is sensitive to the orderof text units. Moreover, z may contain thousands ofwords. This will become a bottleneck for our modelwith a training time of O ( N | z | ) , since attention co-efﬁcients must be computed for all input words togenerate each output word.We address these two problems by sub-samplingfrom the input. The intuition is that even though thenumber of input text units is large, many of themare redundant or contain secondary information. Asour task is to emphasize the main points made in theinput, some of them can be removed without los-ing too much information. Therefore, we deﬁne animportance score f ( x k ) ∈ [0 , for each document x k (see Section 3.5). During training, K candidatesare sampled from a multinomial distribution whichis constructed by normalizing f ( x k ) for input textunits. Notice that the training process goes over thetraining set multiple times, and our model is stillable to learn from more than K text units. For test-ing, top- K candidates with the highest importancescores are collapsed in descending order into z . We now describe the importance estimation model,which outputs importance scores for text units. Ingeneral, we start with a ridge regression model,and add a regularizer to enforce the separation ofsummary-worthy text units from others.Given a cluster of text units { x , ..., x M } and theirsummary y , we compute the number of overlappingcontent words between each text unit and summary y as its gold-standard importance score. The scoresare uniformly normalized to [0 , . Each text unit x k is represented as an d − dimensional feature vector r k ∈ R d , with label l k . Text units in the training dataare thus denoted with a feature matrix ˜R and a labelvector ˜L . We aim at learning f ( x k ) = r k · w by mini-mizing || ˜Rw − ˜L || + β · || w || . This is a standardformulation for ridge regression, and we use fea-ures in Table 1. Furthermore, pairwise preferenceconstraints have been utilized for learning rankingmodels (Joachims, 2002). We then consider addinga pairwise preference-based regularizing constraint to incorporate a bias towards summary-worthy textunits: λ · (cid:80) T (cid:80) x p ,x q ∈T ,l p > ,l q =0 || ( r p − r q ) · w − || ,where T is a cluster of text units to be summa-rized. Term ( r p − r q ) · w enforces the separationof summary-worthy text from the others. We furtherconstruct ˜R (cid:48) to contain all the pairwise differences ( r p − r q ) . ˜L (cid:48) is a vector of the same size as ˜R (cid:48) witheach element as . The objective function becomes: J ( w ) = || ˜Rw − ˜L || + λ · || ˜R (cid:48) w − ˜L (cid:48) || + β · || w || (8) λ , β are tuned on development set. With ˜ β = β · I d and ˜ λ = λ · I | R (cid:48) | , closed-form solution for ˆ w is: ˆ w = ( ˜R T ˜R + ˜R (cid:48) T ˜ λ ˜R (cid:48) + ˜ β ) − ( ˜R T ˜L + ˜R (cid:48) T ˜ λ ˜L (cid:48) ) (9) - num of words - category in General Inquirer- unigram (Stone et al., 1966)- num of POS tags - num of positive/negative/neutral- num of named entities words (General Inquirer,- centroidness (Radev, 2001) MPQA (Wilson et al., 2005))- avg/max TF-IDF scores Table 1:

Features used for text unit importance estimation.

For testing phase, we re-rank the n -best summariesaccording to their cosine similarity with the inputtext units. The one with the highest similarity is in-cluded in the ﬁnal summary. Uses of more sophis-ticated re-ranking methods (Charniak and Johnson,2005; Konstas and Lapata, 2012) will be investi-gated in future work. Data Pre-processing.

We pre-process the datasetswith Stanford CoreNLP (Manning et al., 2014) fortokenization and extracting POS tags and depen-dency relations. For RottenTomatoes dataset, we re-place movie titles with a generic label in training,and substitute it with the movie name if there is anygeneric label generated in testing.

Pre-trained Embeddings and Features.

The sizeof word representation is set to 300, both for in-put and output words. These can be initializedrandomly or using pre-trained embeddings learnedfrom Google news (Mikolov et al., 2013). We alsoextend our model with additional features describedin Table 2. Discrete features, such as POS tags, aremapped into word representation via lookup tables.For continuous features (e.g TF-IDF scores), theyare attached to word vectors as additional values. - part of a named entity? - category in General Inquirer- capitalized? - sentiment polarity- POS tag (General Inquirer, MPQA)- dependency relation - TF-IDF score

Table 2:

Token-level features used for abstract generation.

Hyper-parameters and Stop Criterion.

TheLSTMs (Equation 5) for the decoder and encodersare deﬁned with states and cells of 150 dimensions.The attention of each input word and state pair iscomputed by being projected into a vector of 100dimensions (Equation 6).Training is performed via Adagrad (Duchi et al.,2011). It terminates when performance does not im-prove on the development set. We use BLEU (up to4-grams) (Papineni et al., 2002) as evaluation met-ric, which computes the precision of n-grams in gen-erated summaries with gold-standard abstracts as thereference. Finally, the importance-based samplingrate ( K ) is set to 5 for experiments in Sections 5.2and 5.3.Decoding is performed by beam search with abeam size of 20, i.e. we keep 20 most probable out-put sequences in stack at each step. Outputs with end of sentence token are also considered forre-ranking. Decoding stops when every beam instack generates the end of sentence token. We ﬁrst evaluate the importance estimation compo-nent described in Section 3.5. We compare withSupport Vector Regression (SVR) (Smola and Vap-nik, 1997) and two baselines: (1) a length baseline that ranks text units based on their length, and (2)a centroid baseline that ranks text units accordingo their centroidness, which is computed as the co-sine similarity between a text unit and centroid of thecluster to be summarized (Erkan and Radev, 2004).Figure 2:

Evaluation of importance estimation by mean re-ciprocal rank (MRR), and normalized discounted cumulativegain at top 3 and 5 returned results (NDCG@3 and NDCG@5).Our regression model with pairwise preference-based regular-izer uniformly outperforms baseline systems on both datasets.

We evaluate using mean reciprocal rank (MRR),and normalized discounted cumulative gain at top 3and 5 returned results (NDCG@3). Text units areconsidered relevant if they have at least one overlap-ping content word with the gold-standard summary.From Figure 2, we can see that our importance es-timation model produces uniformly better rankingperformance on both datasets.

For automatic summary evaluation, we considerthree popular metrics. ROUGE (Lin and Hovy,2003) is employed to evaluate n-grams recall ofthe summaries with gold-standard abstracts as ref-erence. ROUGE-SU4 (measures unigram and skip-bigrams separated by up to four words) is reported.We also utilize BLEU, a precision-based metric,which has been used to evaluate various languagegeneration systems (Chiang, 2005; Angeli et al.,2010; Karpathy and Fei-Fei, 2014). We furtherconsider METEOR (Denkowski and Lavie, 2014).As a recall-oriented metric, it calculates similaritybetween generations and references by consideringsynonyms and paraphrases.For comparisons, we ﬁrst compare with an ab-stractive summarization method presented in Gane-san et al. (2010) on the RottenTomatoes dataset.Ganesan et al. (2010) utilize a graph-based algo-rithm to remove repetitive information, and mergeopinionated expressions based on syntactic struc- tures of product reviews. For both datasets, we con-sider two extractive summarization approaches: (1)L EX R ANK (Erkan and Radev, 2004) is an unsuper-vised method that computes text centrality based onPageRank algorithm; (2) Sipos et al. (2012) proposea supervised

SUBMODULAR summarization modelwhich is trained with Support Vector Machines. Inaddition,

LONGEST sentence is picked up as a base-line.Four variations of our system are tested. One usesrandomly initialized word embeddings. The rest ofthem use pre-trained word embeddings, additionalfeatures in Table 2, and their combination. For allsystems, we generate a one-sentence summary.Results are displayed in Table 3. Our system withpre-trained word embeddings and additional fea-tures achieves the best BLEU scores on both datasets(in boldface ) with statistical signiﬁcance (two-tailedWilcoxon signed rank test, p < . ). Notice thatour system summaries are conciser (i.e. shorter onaverage), which lead to higher scores on precisionbased-metrics, e.g. BLEU, and lower scores onrecall-based metrics, e.g. METEOR and ROUGE.On RottenTomatoes dataset, where summaries gen-erated by different systems are similar in length, oursystem still outperforms other methods in METEORand ROUGE in addition to their signiﬁcantly bet-ter BLEU scores. This is not true on Idebate, sincethe length of summaries by extract-based systems issigniﬁcantly longer. But the BLEU scores of oursystem are considerably higher. Among our foursystems, models with pre-trained word embeddingsin general achieve better scores. Though additionalfeatures do not always improve the performance, weﬁnd that they help our systems converge faster. For human evaluation, we consider three aspects: in-formativeness that indicates how much salient infor-mation is contained in the summary, grammaticality that measures whether a summary is grammatical,and compactness that denotes whether a summarycontains unnecessary information. Each aspect israted on a 1 to 5 scale (5 is the best). The judges are We do not run this model on Idebate because it relies onhigh redundancy to detect repetitive expressions, which is notobserved on Idebate. ottenTomatoes Idebate

Length BLEU METEOR ROUGE Length BLEU METEOR ROUGEExtract-Based Systems L ONGEST EX R ANK

UBMODULAR O PINOSIS UR S YSTEMS words ∗ words (pre-trained) ∗ ∗ words + features ∗ words (pre-trained) + features ∗ ∗ Table 3:

Automatic evaluation results by BLEU, METEOR, and ROUGE SU-4 scores (multiplied by 100) for abstract generationsystems. The average lengths for human written summaries are . and . for RottenTomatoes and Idebate. The best performingsystem for each column is highlighted in boldface , where our system with pre-trained word embeddings and additional featuresachieves the best BLEU scores on both datasets. Our systems that are statistically signiﬁcantly better than the comparisons arehighlighted with ∗ (two-tailed Wilcoxon signed rank test, p < . ). Our system also has the best METEOR and ROUGE scores(in italics ) on RottenTomatoes dataset among learning-based systems. Info Gram Comp Avg Rank Best% L EX R ANK

PINOSIS UR S YSTEM H UMAN A BSTRACT

Table 4:

Human evaluation results for abstract generation sys-tems. Inter-rater agreement for overall ranking is 0.71 by Krip-pendorff’s α . Informativeness ( Info ), grammaticality (

Gram ),and Compactness (

Comp ) are rated on a 1 to 5 scale, with 5as the best. Our system achieves the best informativeness andgrammaticality scores among the three learning-based systems.Our summaries are ranked as the best in 18% of the evaluations,and are also ranked higher than compared systems on average. also asked to give a ranking on all summary varia-tions according to their overall quality.We randomly sampled 40 movies from Rotten-Tomatoes test set, each of which was evaluated by5 distinct human judges. We hired 10 proﬁcient En-glish speakers for evaluation. Three system sum-maries (LexRank, Opinosis, and our system) andhuman-written abstract along with 20 representativereviews were displayed for each movie. Reviewswith the highest gold-standard importance scoreswere selected.Results are reported in Table 4. As it can beseen, our system outperforms the abstract-based sys-tem O

PINOSIS in all aspects, and also achieves bet-ter informativeness and grammaticality scores thanL EX R ANK , which extracts sentences in their origi-nal form. Our system summaries are ranked as thebest in 18% of the evaluations, and has an averageranking of 2.3, which is higher than both O

PINOSIS and L EX R ANK on average. An inter-rater agree-ment of Krippendorff’s α of 0.71 is achieved for overall ranking. This implies that our attention-based abstract generation model can produce sum-maries of better quality than existing summarizationsystems. We also ﬁnd that our system summaries areconstructed in a style closer to human abstracts thanothers. Sample summaries are displayed in Figure 3. We further investigate whether taking inputs sam-pled from distributions estimated by importancescores trains models with better performance thanthe ones learned from ﬁxed input or uniformly-sampled input. Recall that we sample K text unitsbased on their importance scores ( Importance-BasedSampling ). Here we consider two other setups: oneis sampling K text units uniformly from the in-put ( Uniform Sampling ), another is picking K textunits with the highest scores ( Top K ). We try vari-ous K values. Results in Figure 4 demonstrates thatImportance-Based Sampling can produce compara-ble BLEU scores to Top K methods, while both ofthem outperform Uniform Sampling. For METEORscore, Importance-Based Sampling uniformly out-performs the other two methods . Finally, we discuss some other observations and po-tential improvements. First, applying the re-rankingcomponent after the model generates n -best ab-stracts leads to better performance. Preliminary ex-periments show that simply picking the top-1 gener- We observe similar results on the Idebate dataset ovie : The Neverending Story

Reviews : (1) Here is a little adventure that fed on our uncul-tivated need to think, and wonder... (2) Magical storytellingtargeted at children still fascinates. (3)...the art direction in-volved a lot of imagination.

Human : A magical journey about the power of a young boy’simagination to save a dying fantasy land, The NeverendingStory remains a much-loved kids adventure.

LexRank : It pokes along at times and lapses occasionally intodark moments of preachy philosophy, but this is still a charm-ing, amusing and harmless ﬁlm for kids.

Opinosis : The Neverending Story is a silly fantasy movie thatoften shows its age .

Our System : The Neverending Story is an entertaining chil-dren’s adventure, with heart and imagination to spare.

Movie : Joe Strummer: The Future is Unwritten

Reviews : (1) The late punk rock legend Joe Strummer isrendered fully human in Julian Temple’s engrossing and all-encompassing portrait. (2) The movie fascinates not so muchbecause of Strummer... but because of the way Temple or-ganized and edited the ﬁlm. (3) One of the most compellingdocumentary portraits of a musician yet made.

Human : Displaying Joe Strummer warts and all, The Fu-ture is Unwritten succeeds as both an engrossing documen-tary and a comprehensive examination of one of music’s mostlegendary ﬁgures.

LexRank : Joe Strummer: The Future Is Unwritten is a ﬁlmfor fans – really big fans .

Opinosis : Joe Strummer: The Future Is Unwritten is for fans– really big fans .

Our System : Fascinating and insightful, Joe Strummer: TheFuture Is Unwritten is a thoroughly engrossing documentary.

Topic : This House would detain terror suspects without trial.

Arguments : (1) Governments must have powers to protecttheir citizens against threats to the life of the nation.(2) Every-one would recognise that rules that are applied in peacetimemay not be appropriate during wartime.

Human : Governments must have powers to protect citizensfrom harm.

LexRank : This is not merely to directly protect citizens frompolitical violence, but also because political violence handi-caps the process of reconstruction in nation-building efforts.

Our System : Governments have the obligation to protect cit-izens from harmful substances.

Topic : This House would replace Christmas with a festival foreveryone.

Arguments : (1) Christmas celebrations in the Westernworld... do not respect the rights of those who are not reli-gious. (2) States should instead be sponsoring and celebratingevents that everyone can join in equally, regardless of religion,race or class.

Human : States should respect the freedom from religion, aswell as the freedom of religion.

LexRank : For school children who do not share the majority-Christian faith, Christmas celebrations require either their par-ticipation when they do not want to, through coercion, or theirnon-participation and therefore isolation whilst everyone elsecelebrations their inclusiveness.

Our System : People have a right to freedom of religion.

Figure 3:

Sample summaries generated by different systemson movie reviews and arguments. We only show a subset ofreviews and arguments due to limited space.

Figure 4:

Sampling effect on RottenTomatoes. ations produces inferior results than re-ranking themwith simple heuristics. This suggests that the currentmodels are oblivious to some task speciﬁc issues,such as informativeness. Post-processing is neededto make better use of the summary candidates. Forexample, future work can study other sophisticatedre-ranking algorithms (Charniak and Johnson, 2005;Konstas and Lapata, 2012).Furthermore, we also look at the difﬁcult caseswhere our summaries are evaluated to have lower in-formativeness. They are often much shorter than thegold-standard human abstracts, thus the informationcoverage is limited. In other cases, some generationscontain incorrect information on domain-dependentfacts, e.g. named entities, numbers, etc. For in-stance, a summary “a poignant coming-of-age talemarked by a breakout lead performance from CateShortland” is generated for movie “Lore”. This sum-mary contains “Cate Shortland” which is the direc-tor of the movie instead of actor. It would requiresemantic features to handle this issue, which has yetto be attempted.

Our work belongs to the area of opinion summa-rization. Constructing ﬂuent natural language opin-ion summaries has mainly considered product re-views (Hu and Liu, 2004; Lerman et al., 2009), com-munity question answering (Wang et al., 2014), andeditorials (Paul et al., 2010). Extractive summariza-tion approaches are employed to identify summary-worthy sentences. For example, Hu and Liu (2004)ﬁrst identify the frequent product features and thenattach extracted opinion sentences to the corre-sponding feature. Our model instead utilizes ab-stract generation techniques to construct natural lan-guage summaries. As far as we know, we are alsohe ﬁrst to study claim generation for arguments.Recently, there has been a growing interest ingenerating abstractive summaries for news arti-cles (Bing et al., 2015), spoken meetings (Wangand Cardie, 2013), and product reviews (Ganesan etal., 2010; Di Fabbrizio et al., 2014; Gerani et al.,2014). Most approaches are based on phrase extrac-tion, from which an algorithm concatenates theminto sentences (Bing et al., 2015; Ganesan et al.,2010). Nevertheless, the output summaries are notguaranteed to be grammatical. Gerani et al. (2014)then design a set of manually-constructed realizationtemplates for producing grammatical sentences thatserve different discourse functions. Our approachdoes not require any human-annotated rules, and canbe applied in various domains.Our task is closely related to recent advances inneural machine translation (Kalchbrenner and Blun-som, 2013; Sutskever et al., 2014a). Based on thesequence-to-sequence paradigm, RNNs-based mod-els have been investigated for compression (Filip-pova et al., 2015) and summarization (Filippova etal., 2015; Rush et al., 2015; Hermann et al., 2015)at sentence-level. Built on the attention-based trans-lation model in Bahdanau et al. (2014), Rush et al.(2015) study the problem of constructing abstract fora single sentence. Our task differs from the mod-els presented above in that our model carries out ab-stractive decoding from multiple sentences insteadof a single sentence.

In this work, we presented a neural approach togenerate abstractive summaries for opinionated text.We employed an attention-based method that ﬁndssalient information from different input text units togenerate an informative and concise summary. Tocope with the large number of input text, we de-ploy an importance-based sampling mechanism formodel training. Experiments showed that our sys-tem obtained state-of-the-art results using both au-tomatic evaluation and human evaluation.

References [Angeli et al.2010] Gabor Angeli, Percy Liang, and DanKlein. 2010. A simple domain-independent prob-abilistic approach to generation. In

Proceedings of the 2010 Conference on Empirical Methods in Natu-ral Language Processing , pages 502–512. Associationfor Computational Linguistics.[Bahdanau et al.2014] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2014. Neural machinetranslation by jointly learning to align and translate.

CoRR , abs/1409.0473.[Bing et al.2015] Lidong Bing, Piji Li, Yi Liao, Wai Lam,Weiwei Guo, and Rebecca Passonneau. 2015. Ab-stractive multi-document summarization via phrase se-lection and merging. In

Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: Long Pa-pers) , pages 1587–1597, Beijing, China, July. Associ-ation for Computational Linguistics.[Charniak and Johnson2005] Eugene Charniak and MarkJohnson. 2005. Coarse-to-ﬁne n-best parsing andmaxent discriminative reranking. In

Proceedings ofthe 43rd Annual Meeting on Association for Computa-tional Linguistics , ACL ’05, pages 173–180, Strouds-burg, PA, USA. Association for Computational Lin-guistics.[Chiang2005] David Chiang. 2005. A hierarchicalphrase-based model for statistical machine translation.In

Proceedings of the 43rd Annual Meeting on Associ-ation for Computational Linguistics , pages 263–270.Association for Computational Linguistics.[Denkowski and Lavie2014] Michael Denkowski andAlon Lavie. 2014. Meteor universal: Languagespeciﬁc translation evaluation for any target language.In

Proceedings of the EACL 2014 Workshop onStatistical Machine Translation .[Di Fabbrizio et al.2014] Giuseppe Di Fabbrizio,Amanda J Stent, and Robert Gaizauskas. 2014.A hybrid approach to multi-document summarizationof opinions in reviews.

INLG 2014 , page 54.[Duchi et al.2011] John Duchi, Elad Hazan, and YoramSinger. 2011. Adaptive subgradient methods for on-line learning and stochastic optimization.

J. Mach.Learn. Res. , 12:2121–2159, July.[Erkan and Radev2004] G¨unes Erkan and Dragomir R.Radev. 2004. Lexrank: Graph-based lexical centralityas salience in text summarization.

J. Artif. Int. Res. ,22(1):457–479, December.[Filippova et al.2015] Katja Filippova, Enrique Alfon-seca, Carlos A. Colmenares, Lukasz Kaiser, and OriolVinyals. 2015. Sentence compression by deletionwith lstms. In

Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing ,pages 360–368, Lisbon, Portugal, September. Associ-ation for Computational Linguistics.Ganesan et al.2010] Kavita Ganesan, ChengXiang Zhai,and Jiawei Han. 2010. Opinosis: a graph-based ap-proach to abstractive summarization of highly redun-dant opinions. In

Proceedings of the 23rd interna-tional conference on computational linguistics , pages340–348. Association for Computational Linguistics.[Gerani et al.2014] Shima Gerani, Yashar Mehdad,Giuseppe Carenini, Raymond T. Ng, and Bita Nejat.2014. Abstractive summarization of product reviewsusing discourse structure. In

Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 1602–1613,Doha, Qatar, October. Association for ComputationalLinguistics.[Hermann et al.2015] Karl Moritz Hermann, Tom´as Ko-cisk´y, Edward Grefenstette, Lasse Espeholt, WillKay, Mustafa Suleyman, and Phil Blunsom. 2015.Teaching machines to read and comprehend.

CoRR ,abs/1506.03340.[Hochreiter and Schmidhuber1997] Sepp Hochreiter andJ¨urgen Schmidhuber. 1997. Long short-term memory.

Neural Comput. , 9(8):1735–1780, November.[Hu and Liu2004] Minqing Hu and Bing Liu. 2004. Min-ing and summarizing customer reviews. In

Proceed-ings of the Tenth ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining , KDD’04, pages 168–177, New York, NY, USA. ACM.[Joachims2002] Thorsten Joachims. 2002. Optimiz-ing search engines using clickthrough data. In

Pro-ceedings of the Eighth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Min-ing , KDD ’02, pages 133–142, New York, NY, USA.ACM.[Kalchbrenner and Blunsom2013] Nal Kalchbrenner andPhil Blunsom. 2013. Recurrent continuous translationmodels. In

EMNLP , pages 1700–1709. ACL.[Karpathy and Fei-Fei2014] Andrej Karpathy and Li Fei-Fei. 2014. Deep visual-semantic alignmentsfor generating image descriptions. arXiv preprintarXiv:1412.2306 .[Konstas and Lapata2012] Ioannis Konstas and MirellaLapata. 2012. Concept-to-text generation via discrim-inative reranking. In

Proceedings of the 50th AnnualMeeting of the Association for Computational Linguis-tics (Volume 1: Long Papers) , pages 369–378, Jeju Is-land, Korea, July. Association for Computational Lin-guistics.[Lerman et al.2009] Kevin Lerman, Sasha Blair-Goldensohn, and Ryan McDonald. 2009. Sentimentsummarization: Evaluating and learning user pref-erences. In

Proceedings of the 12th Conference ofthe European Chapter of the Association for Com-putational Linguistics , EACL ’09, pages 514–522, Stroudsburg, PA, USA. Association for ComputationalLinguistics.[Li et al.2010] Fangtao Li, Chao Han, Minlie Huang, Xi-aoyan Zhu, Ying-Ju Xia, Shu Zhang, and Hao Yu.2010. Structure-aware review mining and summariza-tion. In

Proceedings of the 23rd International Con-ference on Computational Linguistics , COLING ’10,pages 653–661, Stroudsburg, PA, USA. Associationfor Computational Linguistics.[Lin and Hovy2003] Chin-Yew Lin and Eduard Hovy.2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In

Proceedings of the2003 Conference of the North American Chapter of theAssociation for Computational Linguistics on HumanLanguage Technology - Volume 1 , pages 71–78.[Manning et al.2014] Christopher Manning, Mihai Sur-deanu, John Bauer, Jenny Finkel, Steven Bethard, andDavid McClosky. 2014. The stanford corenlp nat-ural language processing toolkit. In

Proceedings of52nd Annual Meeting of the Association for Computa-tional Linguistics: System Demonstrations , pages 55–60, Baltimore, Maryland. Association for Computa-tional Linguistics.[Mikolov et al.2013] Tomas Mikolov, Kai Chen, GregCorrado, and Jeffrey Dean. 2013. Efﬁcient estima-tion of word representations in vector space.

CoRR ,abs/1301.3781.[Papineni et al.2002] Kishore Papineni, Salim Roukos,Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a methodfor automatic evaluation of machine translation. In

Proceedings of the 40th annual meeting on associationfor computational linguistics , pages 311–318. Associ-ation for Computational Linguistics.[Paul et al.2010] Michael J. Paul, ChengXiang Zhai, andRoxana Girju. 2010. Summarizing contrastive view-points in opinionated text. In

Proceedings of the 2010Conference on Empirical Methods in Natural Lan-guage Processing , EMNLP ’10, pages 66–76, Strouds-burg, PA, USA. Association for Computational Lin-guistics.[Radev2001] Dragomir R. Radev. 2001. Experiments insingle and multidocument summarization using mead.In

In First Document Understanding Conference .[Rush et al.2015] Alexander M. Rush, Sumit Chopra, andJason Weston. 2015. A neural attention model for ab-stractive sentence summarization. In

Proceedings ofthe 2015 Conference on Empirical Methods in NaturalLanguage Processing , pages 379–389, Lisbon, Portu-gal, September. Association for Computational Lin-guistics.[Sipos et al.2012] Ruben Sipos, Pannaga Shivaswamy,and Thorsten Joachims. 2012. Large-margin learningof submodular summarization models. In

Proceedingsof the 13th Conference of the European Chapter of thessociation for Computational Linguistics , EACL ’12,pages 224–233, Stroudsburg, PA, USA. Associationfor Computational Linguistics.[Smola and Vapnik1997] Alex Smola and Vladimir Vap-nik. 1997. Support vector regression machines.

Advances in neural information processing systems ,9:155–161.[Stone et al.1966] Philip J. Stone, Dexter C. Dunphy,Marshall S. Smith, and Daniel M. Ogilvie. 1966.

TheGeneral Inquirer: A Computer Approach to ContentAnalysis . MIT Press, Cambridge, MA.[Sutskever et al.2014a] Ilya Sutskever, Oriol Vinyals, andQuoc V. Le. 2014a. Sequence to sequence learn-ing with neural networks. In

Advances in Neural In-formation Processing Systems 27: Annual Conferenceon Neural Information Processing Systems 2014, De-cember 8-13 2014, Montreal, Quebec, Canada , pages3104–3112.[Sutskever et al.2014b] Ilya Sutskever, Oriol Vinyals, andQuoc V. Le. 2014b. Sequence to sequence learningwith neural networks.

CoRR , abs/1409.3215.[Wang and Cardie2013] Lu Wang and Claire Cardie.2013. Domain-independent abstract generation for fo-cused meeting summarization. In

Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages1395–1405, Soﬁa, Bulgaria, August. Association forComputational Linguistics.[Wang et al.2014] Lu Wang, Hema Raghavan, ClaireCardie, and Vittorio Castelli. 2014. Query-focusedopinion summarization for user-generated content. In

Proceedings of COLING 2014, the 25th InternationalConference on Computational Linguistics: TechnicalPapers , pages 1660–1669, Dublin, Ireland, August.Dublin City University and Association for Computa-tional Linguistics.[Wilson et al.2005] Theresa Wilson, Janyce Wiebe, andPaul Hoffmann. 2005. Recognizing contextual po-larity in phrase-level sentiment analysis. In

Proceed-ings of the Conference on Human Language Technol-ogy and Empirical Methods in Natural Language Pro-cessing , HLT ’05, pages 347–354, Stroudsburg, PA,USA. Association for Computational Linguistics.[Zhuang et al.2006] Li Zhuang, Feng Jing, and Xiao-YanZhu. 2006. Movie review mining and summarization.In