Sentence Ordering and Coherence Modeling using Recurrent Neural Networks
SSentence Ordering and Coherence Modeling using Recurrent Neural Networks
Lajanugen Logeswaran , Honglak Lee , Dragomir Radev Department of Computer Science & Engineering, University of Michigan Department of Computer Science, Yale University [email protected], [email protected], [email protected]
Abstract
Modeling the structure of coherent texts is a key NLP problem.The task of coherently organizing a given set of sentenceshas been commonly used to build and evaluate models thatunderstand such structure. We propose an end-to-end unsu-pervised deep learning approach based on the set-to-sequenceframework to address this problem. Our model strongly out-performs prior methods in the order discrimination task and anovel task of ordering abstracts from scientific articles. Fur-thermore, our work shows that useful text representations canbe obtained by learning to order sentences. Visualizing thelearned sentence representations shows that the model captureshigh-level logical structure in paragraphs. Our representationsperform comparably to state-of-the-art pre-training methodson sentence similarity and paraphrase detection tasks.
Modeling the structure of coherent texts is an important prob-lem in NLP. A well-written text has a particular high-levellogical and topical structure. The actual word and sentencechoices and their transitions come together to convey thepurpose of the text. Our primary goal is to build models thatcan learn such structure by arranging a given set of sentencesto make coherent text.Multi-document Summarization (MDS) and retrieval-based Question Answering (QA) involve extracting infor-mation from multiple documents and organizing it into acoherent summary. Since the relative ordering of sentencesfrom different sources can be unclear, being able to automat-ically evaluate a particular order is essential. Barzilay andElhadad (2002) discuss the importance of an ordering com-ponent in MDS and show that finding acceptable orderingscan enhance user comprehension.More importantly, by learning to order sentences we canmodel text coherence. It is difficult to explicitly characterizethe properties of text that make it coherent. Ordering modelsattempt to understand these properties by learning high-levelstructure that causes sentences to appear in a specific orderin human-authored texts. Automatic methods for evaluatinghuman/machine generated text have great importance, withapplications in essay scoring (Miltsakaki and Kukich 2004;Burstein, Tetreault, and Andreyev 2010) and text generation
Copyright c (cid:13) (Park and Kim 2015; Kiddon, Zettlemoyer, and Choi 2016).Coherence models aid the better design of these systems.Exploiting unlabelled corpora to learn semantic represen-tations of data has become an active area of investigation.Self-supervised learning is a typical approach that uses infor-mation naturally available as part of the data as supervisorysignals (Wang and Gupta 2015; Doersch, Gupta, and Efros2015). Noroozi and Favaro (2016) attempt to learn visualrepresentations by solving image jigsaw puzzles. Sentenceordering can be considered as a jigsaw puzzle in the languagedomain and an interesting question is whether we can learnuseful textual representations by performing this task.Our approach to coherence modeling is driven by recentsuccess in capturing semantics using distributed representa-tions and modeling sequences using Recurrent Neural Nets(RNN). RNNs are now the dominant approach to sequencelearning and mapping problems. The Sequence-to-sequence(Seq2seq) framework (Sutskever, Vinyals, and Le 2014) andits variants have fueled RNN based approaches to a range ofproblems such as language modeling, text generation, MT,QA, and many others.In this work we propose an RNN-based approach to thesentence ordering problem which exploits the set-to-sequenceframework of Vinyals, Bengio, and Kudlur (2015). A word-level RNN encoder produces sentence embeddings, and asentence-level set encoder RNN iteratively attends to theseembeddings and constructs a context representation. Initial-ized with this representation, a sentence-level pointer networkselects the sentences sequentially.The most widely studied task relevant to sentence order-ing and coherence modeling is the order discrimination task.Given a document and a permuted version of it, the task in-volves identifying the more coherent ordering. Our proposedmodel achieves state-of-the-art performance on two bench-mark datasets for this task, outperforming several classicalapproaches and recent data-driven approaches.Addressing the more challenging task of ordering a givencollection of sentences, we consider the novel and interestingtask of ordering sentences from abstracts of scientific articles.Our model strongly outperforms previous work on this task.We visualize the learned sentence representations and showthat our model captures high-level discourse structure. Weprovide visualizations that help understand what informationin the sentences the model uses to identify the next sentence. a r X i v : . [ c s . C L ] D ec inally, we demonstrate that our ordering model learnscoherence properties and text representations that are usefulfor several downstream tasks including summarization, sen-tence similarity and paraphrase detection. In summary, ourkey contributions are as follows: • We propose an end-to-end trainable model based on theset-to-sequence framework to address the problem of co-herently ordering a collection of sentences. • We consider the novel task of understanding structure inabstract paragraphs and demonstrate state-of-the-art resultsin order discrimination and sentence ordering tasks. • We show that our model learns sentence representationsthat perform comparably to recent unsupervised pre-training methods on downstream tasks.
Coherence modeling & sentence ordering.
Coherencemodeling and sentence ordering have been approached byclosely related techniques. Many approaches propose a mea-sure of coherence and formulate the ordering problem asfinding an order with maximal coherence. Recurring themesfrom prior work include linguistic features, centering theory,local and global coherence.Local coherence has been modeled by considering proper-ties of local windows of sentences such as sentence similarityand transition structure. Lapata (2003) represent sentences byvectors of linguistic features and learn the transition probabil-ities between features of adjacent sentences. The Entity-Gridmodel (Barzilay and Lapata 2008) captures local coherenceby modeling patterns of entity distributions. Sentences arerepresented by the syntactic roles of entities appearing in thedocument. Features extracted from the entity grid are used totrain a ranking SVM. These two methods are motivated fromcentering theory (Grosz, Weinstein, and Joshi 1995), whichstates that nouns and entities in coherent discourses exhibitcertain patterns.Global models of coherence typically use HMMs to modeldocument structure. The content model (Barzilay and Lee2004) represents topics in a particular domain as states in anHMM. State transitions capture possible presentation order-ings within the domain. Words of a sentence are modeledusing a topic-specific language model. The content modelinspired several subsequent work to combine the strengthsof local and global models. Elsner, Austerweil, and Char-niak (2007) combine the entity grid and the content modelusing a non-parametric HMM. Soricut and Marcu (2006)use several models as feature functions and define a log-linear model to assign probability to a text. Louis andNenkova (2012) model the intentional structure in documentsusing syntax features.Unlike previous approaches, we do not use any handcraftedfeatures and adopt an embedding-based approach. Local co-herence is taken into account by a next-sentence predictioncomponent in our model, and global dependencies are natu-rally captured by an RNN. We demonstrate that our modelcan capture both logical and topical structure by several eval-uation benchmarks.
Data-driven approaches.
Neural approaches have gainedattention recently. Li and Hovy (2014) model sentences asembeddings derived from recurrent neural nets and train afeed-forward neural network that takes an input window ofsentence embeddings and outputs a probability which rep-resents the coherence of the sentence window. Coherenceevaluation is performed by sliding the window over the textand aggregating the score. Li and Jurafsky (2016) study thesame model in a larger scale task and also use a sequence-to-sequence approach in which the model is trained to generatethe next sentence given the current sentence and vice versa.Nguyen and Joty (2017) learn to model coherence using aconvolutional network that operates on the Entity-Grid rep-resentation of an input document. These models are limitedby their local nature; our experiments show that consideringlarger contexts is beneficial.
Hierarchical RNNs for document modeling.
Word-leveland sentence-level RNNs have been used in a hierarchicalfashion for modeling documents in prior work. Li, Luong,and Jurafsky (2015) proposed a hierarchical autoencoder forgeneration and summarization applications. More relevant toour work is a similar model considered by Lin et al. (2015).A sentence-level RNN predicts the bag of words in the nextsentence given the previous sentences and a word-level RNNpredicts the word sequence conditioned on the sentence RNNhidden state. Our model has a hierarchical structure similarto these models, but takes a discriminative approach.
Combinatorial optimization with RNNs.
Vinyals, Ben-gio, and Kudlur (2015) equip sequence-to-sequence modelswith the ability to handle input and output sets, and discussexperiments on sorting, language modeling and parsing. Thisis called the read , process and write (or set-to-sequence)model. The read block maps input tokens to a fixed lengthvector representation. The process block is an RNN encoderwhich, at each time-step, attends to the input token embed-dings and computes an attention readout, appending it to thecurrent hidden state. The write block is an RNN which pro-duces the target sequence conditioned on the representationproduced by the process block. Their goal is to show thatinput and output orderings can matter in these tasks, whichis demonstrated using small scale experiments. Our workexploits this framework to address the challenging problemof modeling logical and hierarchical structure in text. Vinyals,Fortunato, and Jaitly (2015) proposed pointer-networks forcombinatorial optimization problems where the output dic-tionary size depends on the number of input elements. Weuse a pointer-network as the decoder to sequentially pick thenext sentence. Our proposed model is inspired by the way a human wouldsolve this task. First, the model reads the sentences to capturetheir meaning and the general context of the paragraph. Giventhis knowledge, the model tries to pick the sentences one byone sequentially till exhaustion.Our model is based on the read , process and write frame-work of Vinyals, Bengio, and Kudlur (2015) briefly discussedin the previous section. We use the encoder-decoder termi-nology that is more common in the following discussion. STM LSTM LSTM … 𝑎 𝑠 𝑖 𝑎 𝑚−1 𝑖 𝑠 𝑖 𝑎 𝑚 < start > Set Encoder LSTM LSTM LSTM … < start > ො𝑦 ~ 𝑎 ′ 𝑠 ො𝑦 Pointer Network Decoder 𝑎 𝑎 𝑎 𝑛′ 𝑠 ො𝑦 𝑛−1 ො𝑦 ~ 𝑎 ′ ො𝑦 𝑛 ~ 𝑎 𝑛′
1) This is preliminary information, ...2) Any errors in this report will be ...3) On March 7, 2000, at 1020 hrs …… n) The aircraft sustained substantial .. Sentence Encoder 𝑎 𝑎 Figure 1:
Model Overview : The input set of sentences are represented as vectors using a sentence encoder. At each time step ofthe model, attention weights are computed for the sentence embeddings based on the current hidden state. The encoder uses theattention probabilities to compute the input for the next time-step and the decoder uses them for prediction.The model is comprised of a sentence encoder RNN, anencoder RNN and a decoder RNN (Fig. 1). The sentenceencoder takes as input the words of a sentence s sequentiallyand computes an embedding representation of the sentence.Henceforth, we use s to refer to a sentence or its embeddinginterchangeably. The embeddings { s , s , ..., s n } of a givenset of n sentences constitute the sentence memory, availableto be accessed by subsequent components.The encoder LSTM is identical to the originally proposedprocess block, defined by Eqs 1-4. At each time step the inputto the LSTM is computed by taking a weighted sum over thememory elements, the weights being attention probabilitiesobtained using the previous hidden state as query (Eqs. 1,2). This is iterated for a fixed number of times called theread cycles. Intuitively, the model identifies a soft input orderto read the sentences. As described in Vinyals, Bengio, andKudlur (2015) the encoder has the desirable property of be-ing invariant to the order in which the sentence embeddingsreside in the memory. e t,i enc = f ( s i , h t enc ); i ∈ { , ..., n } (1) a t enc = Softmax ( e t enc ) (2) s t att = n (cid:88) i =1 a t,i enc s i (3) h t +1 enc , c t +1 enc = LSTM ( h t enc , c t enc , s t att ) (4)The decoder is a pointer network that takes a similar formwith a few differences (Eqs. 5-7). The LSTM takes the em-bedding of the previous sentence as input instead of the atten-tion readout. At training time the correct order of sentences ( s o , s o , ..., s o n ) = ( x , x , ..., x n ) is known ( o representsthe correct order) and x t − is used as the input. At test timethe predicted assignment ˆ x t − is used instead. The attentioncomputation is identical to that of the encoder, but now a t,i dec isinterpreted as the probability for s i being the correct sentencechoice at position t , conditioned on the previous sentenceassignments p ( S t = s i | S , ..., S t − ) . The initial state of thedecoder LSTM is initialized with the final hidden state of theencoder. x is a vector of zeros. h t dec , c t dec = LSTM ( h t − dec , c t − dec , x t − ) (5) e t,i dec = f ( s i , h t dec ); i ∈ { , ..., n } (6) a t dec = Softmax ( e t dec ) (7) Scoring Function . We consider two choices for the scoringfunction f in Eqs. 1, 6. The first (Eq. 8) is a single hiddenlayer feed-forward net that takes s, h as inputs ( W, b, W (cid:48) , b (cid:48) are learnable parameters). The structure of f is similar to thewindow network of Li and Hovy (2014). While they used alocal window of sentences to capture context, this scoringfunction exploits the entire history of sentences encodedin the RNN hidden state to score candidates for the nextsentence. f ( s, h ) = W (cid:48) tanh ( W [ s ; h ] + b ) + b (cid:48) (8)We also consider a bilinear scoring function (Eq. 9). Com-pared to the previous scoring function, this takes a generativeapproach to regress the next sentence given the current hid-den state ( W h + b ) , enforcing that it be most similar to thecorrect next sentence. We observed that this scoring functionled to better sentence representations (Sec. 4.4). f ( s, h ) = s T ( W h + b ) (9) Contrastive Sentences . In its vanilla form, we found thatthe set-to-sequence model tends to rely on certain word cluesto perform the ordering task. To encourage holistic sentenceunderstanding, we add a random set of sentences to the sen-tence memory when the decoder makes classification de-cisions. This makes the problem more challenging for thedecoder since now it has to distinguish between sentencesthat are relevant and irrelevant to the current context in iden-tifying the correct sentence.
Coherence modeling . We define the coherence score ofan arbitrary partial/complete assignment ( s p , ..., s p k ) to thefirst k sentence positions as (cid:80) ki =1 log p ( S i = s p i | S ,...,i − = s p ,...,p i − ) (10)where S , .., S k are random variables representing the sen-tence assignment to positions through k . The conditionalable 1: Mean Accuracy comparison on the Accidents and Earthquakes data for the order discrimination task. The referencemodels are Entity-Grid (Barzilay and Lapata 2008), HMM (Louis and Nenkova 2012), Graph (Guinaudeau and Strube 2013),Window network (Li and Hovy 2014) and sequence-to-sequence (Li and Jurafsky 2016), respectively.Entity-Grid HMM Graph Window Seq2seq Ours(Recurrent) (Recursive)Accidents 0.904 0.842 0.846 0.840 0.864 0.930 Earthquakes 0.872 0.957 0.635 0.951 0.976 0.992 probabilities are derived from the network. This is our mea-sure of comparing the coherence of different renderings of adocument. It is also used as a heuristic during decoding.
Training Objective . The model is trained using the maxi-mum likelihood objectivemax (cid:80) x ∈ D (cid:80) | x | t =1 log p ( x t | x , ..., x t − ) (11)where D denotes the training set and each training in-stance is given by an ordered document of sentences x =( x , ..., x | x | ) . We first consider the order discrmination task that has beenwidely used in the literature for evaluating coherence models.We then consider the more challenging ordering problemwhere a coherent order of a given collection of sentencesneeds to be determined. We then demonstrate that our or-dering model learns coherence properties useful for summa-rization. Finally, we show that our model learns sentencerepresentations that are useful for downstream applications.For all tasks discussed in this section we train the modelwith the maximum likelihood objective on the training datarelevant to the task. We used the single hidden layer MLPscoring function for the order discrimination and sentenceordering tasks. Models are trained end-to-end. We use pre-trained 300 dimensional GloVe word embeddings (Penning-ton, Socher, and Manning 2014) to initialize word vectors.All LSTMs use a hidden layer size of 1000 and the MLP inEq. 8 has a hidden layer size of 500. The number of read cy-cles in the encoder is set to 10. The same model architectureis used across all experiments. We used the Adam optimizer(Kingma and Ba 2014) with batch size 10 and learning rate5e-4 for learning. The model is regularized using early stop-ping. Hyperparameters were chosen using the validation set.
The ordering problem is traditionally formulated as a binaryclassification task: Given a reference paragraph and its per-muted version, identify the more coherent one (Barzilay andLapata 2008).The datasets widely used for this task in previous workare the Accidents and Earthquakes news reports. In each ofthese datasets the training and test sets include 100 articlesand approximately 20 permutations of each article.In Table 1 we compare our results with traditional ap-proaches and recent data-driven approaches. The entity gridmodel provides a strong baseline on the Accidents dataset, only outperformed by our model and Li and Jurafsky (2016).On the Earthquakes data the window approach of Li and Ju-rafsky (2016) performs strongly. Our approach outperformsprior models on both datasets, achieving near perfect perfor-mance on the Earthquakes dataset.While these datasets have been widely used, they are quiteformulaic in nature and are no longer challenging. We henceturn to the more challenging task of ordering a given collec-tion of sentences to make a coherent document.
In this task we directly address the ordering problem. We donot assume the availability of a set of candidate orderingsto choose from and instead find a good ordering from allpossible permutations of the sentences.The difficulty of the ordering problem depends on the na-ture of the text, as well as the length of paragraphs considered.Evaluation on text from arbitrary text sources makes it diffi-cult to interpret the results, since it may not be clear whetherto attribute the observed performance to a deficient modelor ambiguity in next sentence choices due to many plausibleorderings.Text summaries are a suitable source of data for this task.They often exhibit a clear flow of ideas and have little redun-dancy. We specifically look at abstracts of conference papersand research proposals. This data has several favorable prop-erties. Abstracts usually have a particular high-level format- They begin with a brief introduction, a description of theproblem and proposed approach and conclude with perfor-mance remarks. This would allow us to identify if the modelcan capture high-level logical structure. Second, abstractshave an average length of about 10, making the ordering taskmore accessible. This also gives us a significant amount ofdata to train and test our models.We use the following sources of abstracts for this task. • NIPS Abstracts . We consider abstracts from NIPS papersin the past 10 years. We parsed 3280 abstracts from paperpdfs and obtained 3259 abstracts after omitting erroneousextracts. The dataset was split into years 2005-2013 fortraining and 2014, 2015 respectively for validation, testing. • ACL Abstracts . A second source of abstracts are papersfrom the ACL Anthology Network (AAN) corpus (Radevet al. 2009). We extracted 12,157 abstracts from the textparses using simple keyword matching for the strings ‘Ab-stract’ and ‘Introduction’. We use all extracts of paperspublished up to year 2010 for training, year 2011 for vali-dation and years 2012-2013 for testing.able 2: Comparison against prior methods on the abstracts data.NIPS Abstracts AAN Abstracts NSF AbstractsAccuracy τ Accuracy τ Accuracy τ Random 15.59 0 19.36 0 9.46 0Entity Grid (Barzilay and Lapata 2008) 20.10 0.09 21.82 0.10 - -Seq2seq (Uni) (Li and Jurafsky 2016) 27.18 0.27 36.62 0.40 13.68 0.10Window network (Li and Hovy 2014) 41.76 0.59 50.87 0.65 18.67 0.28RNN Decoder 48.22 0.67 52.06 0.66 25.79 0.48Proposed model • NSF Abstracts . We also used the NSF Research AwardAbstracts dataset (Lichman 2013). It comprises abstractsfrom a diverse set of scientific areas in contrast to the previ-ous two sources of data and the abstracts are also lengthier,making this dataset more challenging. Years 1990-1999were used for training, 2000 for validation and 2001-2003for testing. We capped the parses of the abstracts to a max-imum length of 40 sentences. Unsuccessful parses andparses of excessive length were discarded.Further details about the data are provided in the supplement.The following metrics are used to evaluate performance.
Accuracy measures how often the absolute position of a sen-tence was correctly predicted.
Kendall’s tau ( τ ) is computedas − · N/ (cid:0) n (cid:1) , where N is the number of pairs in thepredicted sequence with incorrect relative order and n is thesequence length. Lapata (2006) discusses that this metricreliably correlates with human judgements.The following baselines are used for comparison: • Entity Grid . Our first baseline is the Entity Grid modelof Barzilay and Lapata (2008). We use the Stanfordparser (Klein and Manning 2003) and Brown CoherenceToolkit to derive Entity grid representations. A rankingSVM is trained to score correct orderings higher than in-correct orderings as in the original work. We used 20 per-mutations per document as training data. Since the entitygrid only provides a means of feature extraction we evalu-ate the model in the ordering setting as follows. We choose1000 random permutations for each document, one of thembeing the correct order, and pick the order with maximumcoherence. We experimented with transitions of length atmost 3 in the entity-grid. • Seq2seq . The second baseline we consider is a sequence-to-sequence model which is trained to predict the nextsentence given the current sentence. Li and Jurafsky (2016)consider similar methods and our model is the same as theiruni-directional model. These methods were shown to yieldsentence embeddings that have competitive performancein several semantic tasks in Kiros et al. (2015). • Window Network . We consider the window approach ofLi and Hovy (2014) and Li and Jurafsky (2016) whichdemonstrated strong performance in the order discrimi-nation task as our third baseline. We adopt the same co-herence score interpretation considered by the authors. In bitbucket.org/melsner/browncoherence both the above models we consider a special embeddingvector which is padded at the beginning of a paragraph andlearned during training. This vector allows us to identifythe initial few sentences during greedy decoding. • RNN Decoder . Another baseline is our proposed modelwithout the encoder. The decoder hidden state is initializedwith zeros. We observed that using a special start symbolas for the other baselines helped obtain better performancewith this model. However, a start symbol did not help whenthe model is equipped with an encoder as the hidden stateinitialization alone was good enough.We do not place emphasis on the particular search algo-rithm in this work and thus use beam search with the coher-ence score heuristic for all models. A beam size of 100 wasused. During decoding, sentence candidates that have beenalready chosen are pruned from the beam. All RNNs use ahidden layer size of 1000. For the window network we useda window size of 3 and a hidden layer size of 2000. We ini-tialize all models with pre-trained GloVe word embeddings.We assess the performance of our model against baselinemethods in Table 2. The window network performs stronglycompared to the other baselines. Our model does better by asignificant margin by exploiting global context, demonstrat-ing that global context is important in this task.While the Entity-Grid model has been fairly successful forthe order discrimination task in the past we observe that itfails to discriminate between a large number of candidates.One reason could be that the feature representation is lesssensitive to local changes in sentence order (such as swappingadjacent sentences). The computational expense of obtain-ing parse trees and constructing grids on a large amount ofdata prohibited experimenting with this model on the NSFabstracts data.The Seq2seq model performs worse than the window net-work. Interestingly, Li and Jurafsky (2016) observe that theSeq2seq model outperforms the window network in an or-der discrimination task on Wikipedia data. However, theWikipedia data considered in their work is an order of magni-tude larger than the datasets considered here, and that couldhave potentially helped the generative model. These mod-els are also expensive during inference since they involvecomputing and sampling from word distributions.Fig. 2 shows t-SNE embeddings of sentence representa-tions learned by our sentence encoder. These are sentencesfrom test sets, color coded by their positions in the source a) NIPS Abstracts (b) AAN Abstracts (c) NSF Abstracts FirstSentenceLastSentence
Figure 2: t-SNE embeddings of representations learned by the model for sentences from the test set. Embeddings are color codedby the position of the sentence in the document it appears.Table 3: Comparison on extractive summarizationbetween models trained from scratch and modelspre-trained with the ordering task.Model
ROUGE-1 ROUGE-2 ROUGE-L
Summary length = 75bFrom scratch 18.29 47.56 12.79Pre-train ord. 18.77 50.32 13.25Summary length = 275bFrom scratch 35.82 10.67 33.69Pre-train ord. 36.47 10.99 34.27 Table 4: Performance comparison for semantic similarity and paraphrasedetection. The first row shows the best performing purely supervised meth-ods. The last section shows our models.Model SICK MSRPr ρ MSE (Acc) (F1)Supervised 0.868 0.808 0.253 80.4 86.0Uni-ST (Kiros et al. 2015) 0.848 0.778 0.287 73.0 81.9Ordering model 0.807 0.742 0.356 72.3 81.1+ BoW 0.842 0.775 0.299 74.0 81.9+ Uni-ST 0.860 0.795 0.270 74.9 82.5abstract. This shows that our model learns high-level struc-ture in the documents, generalizing well to unseen text. Thestructure is less apparent in the NSF dataset due to its datadiversity and longer documents. While approaches based onthe Barzilay and Lee (2004) model explicitly capture topicsby discovering clusters in sentences, our neural approachimplicitly discovers such structure.
In this section we show that sentence ordering models learncoherence properties useful for summarization. We considera variation of our model where the model takes a set of sen-tences from several documents as input and sequentially pickssummary sentences until it predicts a special ‘stop’ symbol.A key distinction between this model and recent work (Chengand Lapata 2016; Nallapati, Zhou, and Ma 2016) is that theinput order of sentences is assumed to be unknown, makingit applicable to multi-document summarization.We train a model from scratch to perform extractive sum-marization in the above fashion. We then consider a modelthat is pre-trained on the ordering task and is fine-tuned onthe above task. The DailyMail and CNN datasets (Chengand Lapata 2016) were used for experimentation. We useDailyMail for pre-training purposes and CNN for fine-tuningand evaluation. The labels in DailyMail are not used. Wecompare ROUGE scores of the two models in Table 3 understandard evaluation settings.We observe that the model pre-trained with the orderingtask scores consistently better than the model trained fromscratch. The results can be improved further by using largernews corpora. This shows that sentence ordering is an attrac-tive unsupervised objective for exploiting large unlabelled corpora to improve summarization systems. It further showsthat the coherence scores obtained from the ordering modelcorrelates well with summary quality.
One of the original motivations for this work is the questionof whether we can learn high-quality sentence representationsby learning to model text coherence. To address this questionwe trained our model on a large number of paragraphs usingthe BookCorpus dataset (Kiros et al. 2015).To evaluate the quality of sentence embeddings derivedfrom the model, we use the evaluation pipeline of Kiroset al. (2015) for tasks that involve understanding sentencesemantics. These evaluations are performed by training aclassifier on top of the embeddings derived from the model(holding the embeddings fixed) so that the performance is in-dicative of the quality of sentence representations. We presenta comparison for the semantic relatedness and paraphrasedetection tasks in Table 4. Results for only uni-directionalversions of models are discussed here for a fair comparison.Similar to the skip-thought (ST) paper, we train two models- one predicting the correct order in the forward directionand another in the backward direction. The numbers shownfor the ordering model were obtained by concatenating therepresentations from the two models.Concatenating the above representation with the bag-of-words representation (using the fine-tuned word embeddings)of the sentence further improves performance. This is be-cause the ordering model can choose to pay less attention tospecific lexical information and focus on high-level documentstructure. Hence, the two representations capture complemen-tary semantics. Adding ST features improves performanceable 5: Visualizing salient words (Abstracts are from the AAN corpus).In this paper , we propose a new method for semantic class induction .First , we introduce a generative model of sentences , based on dependency trees and which takes into account homonymy .Our model can thus be seen as a generalization of Brown clustering .Second , we describe an efficient algorithm to perform inference and learning in this model .Third , we apply our proposed method on two large datasets ( 108 tokens , 105 words types ) , and demonstrate that classesinduced by our algorithm improve performance over Brown clustering on the task of semisupervised supersense tagging andnamed entity recognition .Representation learning is a promising technique for discovering features that allow supervised classifiers to generalize froma source domain dataset to arbitrary new domains .We present a novel , formal statement of the representation learning task .We argue that because the task is computationally intractable in general , it is important for a representation learner to be ableto incorporate expert knowledge during its search for helpful features .Leveraging the Posterior Regularization framework , we develop an architecture for incorporating biases into representationlearning .We investigate three types of biases , and experiments on two domain adaptation tasks show that our biased learners identifysignificantly better sets of features than unbiased learners , resulting in a relative reduction in error of more than 16% forboth tasks , with respect to state-of-the-art representation learning techniques.further. We observed that the bilinear scoring function andintroducing contrastive sentences to the decoder improvedthe quality of learned representations significantly.Our model has several key advantages over ST. ST has aword-level reconstruction objective and is trained with largesoftmax output layers. This limits the vocabulary size andslows down training (they use a vocabulary size of 20k and re-port two weeks of training). Our model achieves comparableperformance and does not have such a word reconstructioncomponent. We train with a vocabulary of 400k words; theabove results are based on a training time of two days on aGTX Titan X GPU.
We attempt to understand what text-level clues the model usesto perform the ordering task. Inspired by Li et al. (2015), weuse gradients of prediction decisions with respect to words ofthe correct sentence as a proxy for the salience of each word.We feed sentences to the decoder in the correct order and ateach time step compute the derivative of the score e (Eq. 6)of the correct next sentence s = ( w , .., w n ) with respect toits word embeddings. The importance of word w i in correctlypredicting s as the next sentence is defined as (cid:107) ∂e∂w i (cid:107) . Weassume the hidden states of the decoder to be fixed and onlyback-propagate gradients through the sentence encoder.Table 5 shows visualizations of two abstracts. Darkershades correspond to higher gradient norms. In the first ex-ample the model appears to be using the word clues ‘first’,‘second’ and ‘third’. A similar observation was made by Chen,Qiu, and Huang (2016). In the second example we observethat the model pays attention to phrases such as ‘We present’,‘We argue’, which are typical of abstract texts. It also focuseson the word ‘representation’ appearing in the first two sen-tences. These observations link to centering theory whichstates that entity distributions in coherent discourses exhibitcertain patterns. The model implicitly learns these patternswith no syntax annotations or handcrafted features. This work investigated the challenging problem of coher-ently organizing a set of sentences. Our RNN-based modelperforms strongly compared to baselines and prior work onsentence ordering and order discrimination tasks. We furtherdemonstrated that it captures high-level document structureand learns useful sentence representations when trained onlarge amounts of data. Our approach to the ordering problemdeviates from most prior work that use handcrafted features.However, exploiting linguistic features for next sentence clas-sification can potentially further improve performance. Entitydistribution patterns can provide useful features about namedentities that are treated as out-of-vocabulary words. The order-ing problem can be further studied at higher-level discourseunits such as paragraphs, sections and chapters.
This material is based in part upon work supported by IBMunder contract 4915012629. Any opinions, findings, con-clusions or recommendations expressed above are those ofthe authors and do not necessarily reflect the views of IBM.We thank the UMich/IBM Sapphire team and Junhyuk Oh,Ruben Villegas, Xinchen Yan, Rui Zhang, Kibok Lee andYuting Zhang for helpful comments and discussions.
References
Barzilay, R., and Elhadad, N. 2002. Inferring strategies forsentence ordering in multidocument news summarization.
Journal of Artificial Intelligence Research
Computational Linguistics arXiv preprint cs/0405039 .urstein, J.; Tetreault, J.; and Andreyev, S. 2010. Usingentity-based features to model coherence in student essays.In
Human language technologies: The 2010 annual confer-ence of the North American chapter of the Association forComputational Linguistics , 681–684. Association for Com-putational Linguistics.Chen, X.; Qiu, X.; and Huang, X. 2016. Neural sentenceordering. arXiv preprint arXiv:1607.06952 .Cheng, J., and Lapata, M. 2016. Neural summariza-tion by extracting sentences and words. arXiv preprintarXiv:1603.07252 .Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsupervisedvisual representation learning by context prediction. In
Pro-ceedings of the IEEE International Conference on ComputerVision , 1422–1430.Elsner, M.; Austerweil, J. L.; and Charniak, E. 2007. Aunified local and global model for discourse coherence. In
HLT-NAACL , 436–443.Grosz, B. J.; Weinstein, S.; and Joshi, A. K. 1995. Centering:A framework for modeling the local coherence of discourse.
Computational linguistics
ACL (1) , 93–103.Kiddon, C.; Zettlemoyer, L.; and Choi, Y. 2016. Globallycoherent text generation with neural checklist models. In
Proceedings of the 2016 Conference on Empirical Methodsin Natural Language Processing (EMNLP) .Kingma, D., and Ba, J. 2014. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 .Kiros, R.; Zhu, Y.; Salakhutdinov, R. R.; Zemel, R.; Urtasun,R.; Torralba, A.; and Fidler, S. 2015. Skip-thought vectors. In
Advances in Neural Information Processing Systems , 3276–3284.Klein, D., and Manning, C. D. 2003. Accurate unlexicalizedparsing. In
Proceedings of the 41st Annual Meeting on As-sociation for Computational Linguistics-Volume 1 , 423–430.Association for Computational Linguistics.Lapata, M. 2003. Probabilistic text structuring: Experi-ments with sentence ordering. In
Proceedings of the 41st An-nual Meeting on Association for Computational Linguistics-Volume 1 , 545–552. Association for Computational Linguis-tics.Lapata, M. 2006. Automatic evaluation of information or-dering: Kendall’s tau.
Computational Linguistics
EMNLP , 2039–2048.Li, J., and Jurafsky, D. 2016. Neural net modelsfor open-domain discourse coherence. arXiv preprintarXiv:1606.01545 .Li, J.; Chen, X.; Hovy, E.; and Jurafsky, D. 2015. Visualiz-ing and understanding neural models in nlp. arXiv preprintarXiv:1506.01066 .Li, J.; Luong, M.-T.; and Jurafsky, D. 2015. A hierarchical neural autoencoder for paragraphs and documents. arXivpreprint arXiv:1506.01057 .Lichman, M. 2013. UCI machine learning repository.Lin, R.; Liu, S.; Yang, M.; Li, M.; Zhou, M.; and Li, S. 2015.Hierarchical recurrent neural network for document model-ing. In
Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing , 899–907.Louis, A., and Nenkova, A. 2012. A coherence modelbased on syntactic patterns. In
Proceedings of the 2012Joint Conference on Empirical Methods in Natural LanguageProcessing and Computational Natural Language Learning ,1157–1168. Association for Computational Linguistics.Miltsakaki, E., and Kukich, K. 2004. Evaluation of textcoherence for electronic essay scoring systems.
NaturalLanguage Engineering arXiv preprint arXiv:1611.04244 .Nguyen, D. T., and Joty, S. 2017. A neural local coherencemodel. In
Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers) , volume 1, 1320–1330.Noroozi, M., and Favaro, P. 2016. Unsupervised learning ofvisual representations by solving jigsaw puzzles. In
EuropeanConference on Computer Vision , 69–84. Springer.Park, C. C., and Kim, G. 2015. Expressing an image streamwith a sequence of natural sentences. In
Advances in NeuralInformation Processing Systems , 73–81.Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:Global vectors for word representation. In
EMNLP , vol-ume 14, 1532–43.Radev, D. R.; Joseph, M. T.; Gibson, B.; and Muthukrishnan,P. 2009. A Bibliometric and Network Analysis of the field ofComputational Linguistics.
Journal of the American Societyfor Information Science and Technology .Soricut, R., and Marcu, D. 2006. Discourse generationusing utility-trained coherence models. In
Proceedings of theCOLING/ACL on Main conference poster sessions , 803–810.Association for Computational Linguistics.Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequenceto sequence learning with neural networks. In
Advances inneural information processing systems , 3104–3112.Vinyals, O.; Bengio, S.; and Kudlur, M. 2015. Ordermatters: Sequence to sequence for sets. arXiv preprintarXiv:1511.06391 .Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointernetworks. In
Advances in Neural Information ProcessingSystems , 2674–2682.Wang, X., and Gupta, A. 2015. Unsupervised learning ofvisual representations using videos. In