[PDF] Modeling Topical Coherence in Discourse without Supervision

Abstract

Coherence of text is an important attribute to be measured for both manually and automatically generated discourse; but well-defined quantitative metrics for it are still elusive. In this paper, we present a metric for scoring topical coherence of an input paragraph on a real-valued scale by analyzing its underlying topical structure. We first extract all possible topics that the sentences of a paragraph of text are related to. Coherence of this text is then measured by computing: (a) the degree of uncertainty of the topics with respect to the paragraph, and (b) the relatedness between these topics. All components of our modular framework rely only on unlabeled data and WordNet, thus making it completely unsupervised, which is an important feature for general-purpose usage of any metric. Experiments are conducted on two datasets - a publicly available dataset for essay grading (representing human discourse), and a synthetic dataset constructed by mixing content from multiple paragraphs covering diverse topics. Our evaluation shows that the measured coherence scores are positively correlated with the ground truth for both the datasets. Further validation to our coherence scores is provided by conducting human evaluation on the synthetic data, showing a significant agreement of 79.3%

Full PDF

MModeling Topical Coherence in Discourse without Supervision

Disha Shrivastava ∗ MILA, Universit´e de Montr´ealMontreal, Canada [email protected]

Abhijit Mishra

IBM ResearchBangalore, India [email protected]

Karthik Sankaranarayanan

IBM ResearchBangalore, India [email protected]

Abstract

Coherence of text is an important attribute tobe measured for both manually and automat-ically generated discourse; but well-deﬁnedquantitative metrics for it are still elusive. Inthis paper, we present a metric for scoringtopical coherence of an input paragraph on areal-valued scale by analyzing its underlyingtopical structure. We ﬁrst extract all possi-ble topics that the sentences of a paragraphof text are related to. Coherence of this textis then measured by computing: (a) the de-gree of uncertainty of the topics with respect tothe paragraph, and (b) the relatedness betweenthese topics. All components of our mod-ular framework rely only on unlabeled dataand WordNet, thus making it completely un-supervised, which is an important feature forgeneral-purpose usage of any metric. Experi-ments are conducted on two datasets - a pub-licly available dataset for essay grading (rep-resenting human discourse), and a syntheticdataset constructed by mixing content frommultiple paragraphs covering diverse topics.Our evaluation shows that the measured co-herence scores are positively correlated withthe ground truth for both the datasets. Furthervalidation to our coherence scores is providedby conducting human evaluation on the syn-thetic data, showing a signiﬁcant agreement of79.3%.

Discourse coherence measurement has been animportant task for evaluating human generatedtext (Attali and Burstein, 2004; Crossley and Mc-Namara, 2011; Taghipour and Ng, 2016) and textoutput produced by natural language generation(NLG) systems such as summarizers, descriptivequestion answering systems and automatic cre-ative content generators. Measuring coherenceis an essential need for evaluation of NLG sys-tems that are currently impeded by lack of ro- ∗ *Work done as part of IBM Research, Bangalore Example 1 (Coherent):

The most important part of anessay is the thesis statement. The thesis statement intro-duces the argument of the essay. It also helps to createa structure for the essay. Therefore, one should alwaysbegin with a thesis statement while writing an essay.

Example 2 (Locally Incoherent):

It also helps to cre-ate a structure for the essay. The thesis statement in-troduces the argument of the essay. The most importantpart of an essay is the thesis statement.

Example 3 (Topically Incoherent):

The most impor-tant part of an essay is the thesis statement . Essays can be written on various topics from domains such aspolitics, sports, current affairs etc. I like to write aboutCricket because it is the most popular team sport playedat international level.

Table 1: Example of topically coherent, locally inco-herent, and topically incoherent paragraphs bust quality-estimation metrics (Belz and Kilgar-riff, 2006; Nenkova et al., 2007). A robust eval-uation metric may not only evaluate and interpretNLG systems better, but can also contribute to bet-ter system designs (such as using the metric as aloss or reward function in risk minimization or re-inforcement learning settings).We propose a metric for topical coherence (re-ferred to as coherence, henceforth) in paragraphs.Our work is motivated from the fact that automaticNLG systems (such as abstractive summarizers)often produce text with topics that may be quiteunrelated to each other. For example, in a recentlyproposed work See et al. (2017), the summarygenerated by P

OINTER -G EN +C OVERAGE system(Figure 1 in See et al. (2017)), has a considerabletopic shift (from “administration” to “winning theelection”), a mistake that is more likely to be com-mitted by machines than humans.Topical coherence differs from local coherence.To illustrate the difference, Table 1 shows ex-amples of coherent, locally incoherent and topi-cally incoherent paragraphs. Example is quitecoherent as it revolves around one central topic i.e., importance of thesis statement . Example is a r X i v : . [ c s . C L ] S e p ore incoherent as compared to Example , dueto the fact that sentences are not ordered natu-rally. However, it still talks about the same cen-tral topic as Example . Example on the otherhand, confusingly covers multiple topics such as importance of thesis statement , possible essay do-mains , and popular sports , which are quite unre-lated to each other and hence, is topically incoher-ent. Our metric for topical coherence is designedto address two key aspects: (a) how effectivelyeach sentence contributes to the topic(s) that theparagraph expresses, and (b) to what extent top-ics expressed by the paragraph are related to eachother. These aspects are tackled by independentmodules, making our metric modular and inter-pretable . Typically, gold labels for coherencescores are difﬁcult to obtain, and therefore the unsupervised nature of our framework alongwith its simplicity makes it easier and convenientto employ. We perform experiments on differentsets of a publicly available human graded essaydataset, and a synthetic dataset constructed by in-jecting incoherence into already available coher-ent paragraphs (possibly mimicking machine gen-erated discourse). Evaluation results show a posi-tive correlation of the measured coherence scoreswith the gold-standard scores, hence making oursystem acceptable for coherence based ranking ofparagraphs. Additionally, human evaluation ona set of synthetic data essays shows a signiﬁcantagreement (79.3%), demonstrating the effective-ness of our proposed metric. Finally, a comparisonwith the supervised systems discussed in (Barzi-lay and Lapata, 2008) shows that our framework,even though unsupervised and less complex, canexhibit performance competitive to the supervisedsystems that require signiﬁcant amount of trainingdata and in some cases, deep linguistics meta in-formation such as role labels, coreferences, depen-dency parses etc. , to produce reasonable results. Based on the motivation laid above, a prima-facie formulation of coherence of a paragraph (denotedas CS ) can be proposed assuming that CS varieslinearly and positively with: (a) to what degree ofcertainty topics expressed by the paragraph (de-noted as T ) are supported by its constituent sen-tences (denoted by variable E ( T ) ), and (b) to whatextent these topics are related to each other (de- noted by variable λ ( T ) ). Mathematically, CS = κ × E ( T ) × λ ( T ) (1)For empirical purposes, κ , the constant term canbe set to . The two components of the above for-mula require topics expressed by the paragraph tobe extracted ﬁrst. This process is explained below. The ﬁrst step before extracting the topics (whichare abstract “concepts”) is to deﬁne a set of widevariety of generic topics which could be mapped toany given paragraph. This one-time process is car-ried out by performing word-clustering on a largescale mixed domain unlabeled data. The assump-tion here is that clusters extracted from large scaledata represent generic topics in the universal spaceof English language. After deﬁning generic topics,the next step is to infer topics discussed in a giveninput paragraph. We explain each of these steps indetail below.For word clustering, we consider embeddingslearned on a large scale corpora (Pennington et al.,2014) and cluster them via k-means clustering.The assumption here is that words connected to aparticular topic would exhibit strong syntagmaticand paradigmatic relations. Since word embed-dings are good at capturing such relations, topi-cally connected words may eventually lie in thesame cluster in the vector space. Each cluster inthis space hence represents a topic . Topic modelssuch as the ones based on Latent Dirichlet Allo-cation (Blei et al., 2003) are other alternatives fortopic extraction. However, it is well known thatthe behavior of topic models changes drastically based on the prior distribution, hyper-parametersand document processing applied (Chang et al.,2009). Hence, we rather keep the topic extractionprocess simplistic by using word embeddings. Each sentence in the input paragraph is POStagged and only nouns are selected for topic ex-traction. The intuition behind choosing only nounsis that nouns are most representative of the topic ina given sentence. For each noun, we determine thetopic-cluster to which it belongs. To avoid noise,for each sentence, we choose a dominant clusterby assigning cluster scores to all clusters identi-ﬁed within a given sentence (the reason we try to basic LDA based experiments did not show good perfor-mance in our setting igure 1: Architecture for estimating p ( P, t i ) extract topics for each sentence separately is thatsentences are the atomic units that are capable ofdiscussing independent topics from others). Thecluster topic score of a cluster is (a) directly pro-portional to the fraction of nouns within the sen-tence that belong to the cluster and the mean pair-wise cosine similarity of nouns in the sentence,and (b) is inversely proportional to the diameterof the cluster. The diameter of the cluster is themaximum distance between any two points in thecluster. The dominant cluster for a sentence is theone which has the highest cluster topic score. Af-ter determining the dominant cluster, we take allpoints within the cluster and check for its existencein WordNet. This eliminates highly speciﬁc terms,jargon, named entities which could potentially addnoise to the clusters. Therefore, each topic cannow be represented as a subgraph of the WordNetwhere the nodes in the subgraph are bag-of-wordsrepresenting a topic (topicBOW).Given the topics relevant to the paragraph, wenow explain the two components E ( T ) and λ ( T ) of Equation 1 in the following subsections. E ( T ) For an input paragraph P , with M sentences( s , s , ..., s M ), expressing a set of topics ( T = t , t , ..., t N ), the component E ( T ) can be ex-pressed as: E ( T ) = − N (cid:88) i =1 p ( t i | P ) log ( p ( t i | P )) (2)where p ( t i | P ) = p ( t i | s , s , ..., s M ) representsthe probability of the topic t i conditioned over the paragraph. E ( T ) corresponds to the condi-tional entropy of topics here. Intuitively, if sen-tences in the paragraph are well distributed acrossall the topics emerging out of the paragraph, i.e. each sentence in the paragraph is somewhat re-lated (even though loosely) to all the topics, theparagraph may exhibit more coherence, with lesstopic-shift. On the other hand, if the paragraphcan be divided in such a way that each segmentof the paragraph is related to a unique topic, theparagraph will be less coherent. This is adequatelycaptured by the conditional entropy formula whichrewards when the distribution p ( T | P ) is smoothand penalizes when it is sparse. Moreover, thisformulation of E ( T ) captures another essential as-pect: if the number of topics in the paragraph be-come very large, it inherently pulls the conditionalprobabilities p ( t i | P ) down, making the distribu-tions sparse and thus, reducing the entropy. So,when number of topics grow, E ( T ) is penalizedmore.The term p ( t i | P ) in Equation 2 can be expandedfurther, with the help of Bayes’ rule as follows: p ( t i | P ) = p ( P | t i ) p ( t i ) (cid:80) Nj =1 p ( P | t j ) p ( t j ) (3)Furthermore, term p ( P | t i ) can be expanded us-ing chain rule, as: p ( P | t i ) = p ( s , s , ..., s M | t i )= p ( s | t i ) p ( s | s , t i ) ...p ( s M | s , s , ..., s M − , t i )= M (cid:89) k =1 p ( s k | C k − , t i ) (4)Here C k − can be considered as the context thatappears before the the k th sentence.Since, no topic in the paragraph can be givenmore importance over another, the probabilityterm p ( t i ) can be uniformly distributed across T i.e., p ( t i ) = N . For estimating p ( s k | C k − , t i ) (without applying any simplifying assumptions),we follow the distributed bag-of-words (BOW)model by Le and Mikolov (2014). The idea hereis to train distributed bag-of-words model using alarge number of sentences (covering good numberof topics); and later on use the model for inferring p ( s k | C k − , t i ) . The inference time snapshot of themodel is given in Figure 1 and the probability es-timation algorithm is given in Algorithm 1. it is highly improbable to have sentences discussing alarge number of topics lgorithm 1 P ROBABILITY E STIMATION model ← TrainDistBOWModel REM

TrainDistBOWModel is the process oftraining a distributed representation model forbag-of-words with unlabeled data. C ← null t v ← InferBOWVector ( t i ) REM

InferBOWVector infers encoded vectorfor topic t i from the distributed representationmodel. p ( P | t i ) = 1 for s k ∈ P = s , s , ..., s M do s v ← InferBOWVector ( s k ) C k = C k − ⊕ s v C curr = C k ⊕ t v C prev = C k − ⊕ t v p ( s k , C k − , t i ) ← InferProb( C curr ) REM

InferProb produces the joint prob-ability score of a given bag-of-word dis-tributed representation. p ( C k − , t i ) ← InferProb( C prev ) p ( s k | C k − , t i ) = p ( s k ,C k − ,t i ) p ( C k − ,t i ) p ( P | t i ) = p ( P | t i ) × p ( s k | C k − , t i ) end for Output: p ( P | t i ) As discussed earlier in Section 2.3, since top-ics are also treated as bag of words (topicBOW),it is easy to do conditional inference using a bag-of-words based distributed representation model.In such a setting, both sentences and topics aretreated as bag-of-words, and hence, there is nospecial signal to be passed to the model with re-spect to the topic bag-of-word representations. Weagree that bag-of-words based techniques can beagnostic to within-sentence sequentiality and nat-ural order of sentences. However, since model-ing topical coherence involves inferring sentence-topic associations, and does not necessarily aimto model within-sentence properties; not preserv-ing sentence order would not adversely affect ourgoal. We now describe the second component ofEquation 1. λ ( T ) The component λ ( T ) aims to capture relatedness between topics expressed by the paragraph. Fortwo paragraphs with same number of topics, the coherence score should be more if topics arestrongly related with each other as compared tothe case where the correlation between topics ofthe paragraph is less. This helps us to refrain frompenalizing the coherence scores even though theparagraph contains large number of topics if suchtopics are strongly related. For example, a topicalshift from the topic carnivore to mammals in gen-eral should be penalized less than that from carni-vore to electronics . The inter-relatedness betweentopics is captured by λ ( T ) , for which we relyon lexical knowledge networks such as WordNet(Fellbaum, 1998), that preserve various forms ofconceptual-semantic and lexical relations betweenwords and are well-curated. Note that, we do notopt for simplistic topic-relatedness measures suchas inter-cluster distance between topic-clusters, assuch measures are signiﬁcantly affected by thebias in the data used for clustering, and noise in-troduced by imperfect algorithm and parameter se-lection.Each word in the topic bag-of-words (for thewhole paragraph) is mapped to a node in theWordNet (if its lemma exists in WordNet), accord-ing to its most frequent sense. The mapped nodesin the WordNet are then connected with each otherthrough other intermediate nodes to form a sub-graph. Figure 2 illustrates a subgraph extractedfor Example 2 discussed in Section 1. From theWordNet subgraph, we compute λ ( T ) as follows: λ ( T ) = N odeSim ( T ) ∗ N D ( T ) T C ( T ) ∗ ED ( T ) (5) • NodeSim(T) : This corresponds to the aver-age similarity between nodes representingthe topics in the subgraph . This is obtainedby calculating the average cosine similaritybetween the corresponding node embedding,obtained via the

TransE multi-relational em-bedding learning technique (Bordes et al.,2013). TransE operation on a graph resultsin low-dimensional embeddings of the nodesthat capture its relationship with the othernodes in the graph in a distributional man-ner. In our setting, a higher similarity be-tween two node embeddings indicate higherinter-relationship between them. We com-pute cosine similarity between the embed-dings of each pair of nodes in the sub-graphand then average the similarity scores. • ND(T) : This represents the average neigh- igure 2: Sample WordNet subgraph extracted for calculating λ ( T ) . Dashes indicate indirect connections. borhood degree of the nodes in the sub-graph. Intuitively, a higher average neighbor-hood degree indicates higher degree of con-nectedness amongst nodes, indicating highertopic relatedness. • ED(T) : This denotes the edge-density of thegraph . The notion behind using this mea-sure is that if topics are distantly placed inthe WordNet graph (a case of higher incoher-ence), the subgraph generated will be denserwith more number of nodes and even morenumber of edges established through variousWordNet relationships. So, higher ED ( T ) should penalize λ ( T ) and the overall coher-ence score. • TC(T) : We deﬁne this term as:

T C ( T ) = edges in the subgraph edges in its transitive closure Since transitive closure of a graph indicatesthe node reachability, the term

T C reducesthe reachability of nodes in the subgraph. Fora graph like Wordnet where there are limitedrelations between nodes, subgraphs with non-ambiguous reachable paths are indicative ofstronger topic-relatedness. Hence, a higher

T C score should penalize λ ( T ) and the over-all coherence score.We provide an end-to-end algorithm for coher-ence score calculation in Algorithm 2. https://en.wikipedia.org/wiki/Dense graph Algorithm 2 C OHERENCE S CORE C ALCULATION function CalculateCoherenceScores (Para P , WordNet graph G ) T ← FindTopics ( P ) SG ← CreateSubGraph ( G , T ) E ( T ) ← calcEntropy ( T , P ) N odeSim ( T ) ← calcTransESimilarity ( SG ) N D ( T ) ← calcConnectivity( SG ) ED ( T ) ← calcEdgeDensity ( SG ) T C ( T ) ← calcTCScore ( SG ) λ ( T ) = NodeSim ( T ) × ND ( T ) ED ( T ) × T C ( T ) CS = κ × E ( T ) × λ ( T ) return coherence score, CS We carry out our experiments on two sets of dataas described below:

We take the Kaggle data released by the HewlettFoundation for the task of Automated Essay Grad-ing (Foundation, 2012). The dataset consists ofeight essay-sets corresponding to two types of es-say prompts. We consider the persuasive/ narra-tive/ expository sets of essays (i.e., essay set id 1,2, 7 and 8). Each essay is provided with scoresof two or more human experts. We take the re-solved expert scores for essay sets 1, 7 and 8; andean of domain1 and domain2 scores for essayset 2 as gold labels. The expert scores indicate theoverall goodness of the essay in terms of coher-ence, cohesion, organization, language-structure etc . Though the overall grades are not exact la-bels for coherence, coherence plays an inﬂuentialrole while grading the essays. Hence, showing apositive correlation to these human graded essayscores can provide a validation for our coherencemetric. . From the Kaggle data we extracted 5870essays in total with varied number of sentences ineach essay ranging from 1-84. More details on thedataset can be found in Table 2. We created a synthetic data based on the essaysprovided by The Louvain Corpus of Native En-glish Essays (LOCNESS) (catholique de Louvain,2017). The corpus consists of argumentative andliterary essays written by British and Americanuniversity students. The essays are written ondifferent topics ranging from computers , biology , British society etc. . We take the original es-says and replace a fraction of the paragraph withsentences randomly chosen from essays on com-pletely different topics. All the original para-graphs are labeled with a coherence score of 1.0.The coherence scores of the synthesized variantsare determined by the degree of incoherency intro-duced. For example, if 20% of the original para-graph is replaced with sentences from essays on adifferent topic, the coherence score is reduced by20%. If replaced sentences are extracted from es-says on “two” different topics, the score is reducedfurther by another 20%. Since many of the es-says are extremely big, we sampled 81 essays (in-cluding variants) which had less than 1000 words.In the dataset, essays are labeled with coherencescores of C = [1.0,0.8,0.6,0.4] based on the abovecriteria, which is later treated as ground truth.The state-of-the-art data-driven NLG systemsthat generate discourses often mix different topics(ref. abstractive summarization work discussed inthe introduction). We try to mimic that by replac-ing portions of the coherent paragraphs with sen-tences from other paragraphs discussing unrelatedtopics. This is the rationale behind creating such adataset for evaluation. We refrain from creating a dataset with manually labeledcoherence scores, due to the subjective nature of the labelingtask E ( T ) and λ ( T ) To cluster the GloVe word vectors we experi-mented with different values of K for K-meansclustering algorithm. Finally, we chose K=1000based on the values of average inter-cluster andintra-cluster distances. The conditional proba-bility inference model discussed in Section 2.4above was trained on a mixed domain corpus, i.e. , UMBC WebBase corpus of 3 billion En-glish words. The pre-computed transE embed-dings trained on the Wordnet graph were obtainedfrom (Bordes et al., 2013). Our coherence scoreswere in the real-valued range [0.1-1000].

We obtained coherence scores CS using Algo-rithm 2 for each of the four sets of human gradedessays (referred to as Set , Set , Set and Set henceforth); and the synthetic data generated (re-ferred to as Synthetic ). We obtained Spearman’srank correlation coefﬁcient between the gold la-bels and calculated coherence scores to see howthey are correlated. Since, for most practical pur-poses, the relative ranking of paragraphs based oncoherence may be more important than computingthe absolute coherence scores, we chose the Spear-man’s rank correlation metric instead of Pearson’scorrelation. We also computed the rMSE betweenthe scaled coherence scores and gold labels foreach of the ﬁve sets of data.

Dataset

Synthetic 81 27.38Set 1 1783 22.77Set 2 1800 20.36Set 7 1569 11.71Set 8 723 34.88

Table 2: Data statistics

The results for the Spearman’s rank correlationand rMSE between the coherence scores and goldlabels are reported in Table 3. As it can be seen,in all cases our coherence scores obtain a pos-itive correlation with the corresponding gold la-bels, suggesting that we are indeed modeling co-herence, which plays an essential role in humanessay grading. The low values of rMSE indicates http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words/ ataset Correlation ( p ) rMSE Synthetic 0.417 (1e-4) 0.37Set 1 0.502 (1.2e-114) 0.63Set 2 0.433 (4.3e-83) 0.78Set 7 0.411 (4.9e-65) 0.72Set 8 0.283 (8.2e-15) 0.46

Table 3: Results for synthetic and human essaydatasets.

Correlation → Spearman’s Rank Corre-lation Coefﬁcient between measured and gold valuesof coherence with statistical signiﬁcance test values p . rM SE → Root Mean Squared Error between normal-ized measured and gold values of coherence. All cor-relation values are within 99% conﬁdence ( p < . ). that the predicted coherence scores are quite ac-ceptable. To see the importance of each com-ponent in our coherence score, we calculated thecomponent-wise Spearman’s rank correlation co-efﬁcients w.r.t. the gold labels for each of the ﬁvedatasets (Table 4). It can be seen that entropy andaverage neighborhood degree are positively cor-related and TC score and edge density are nega-tively correlated as expected. The correlation with N odeSim ( T ) is mostly positive, though for essaysets 2 and 7 we get slightly negative correlations.This might be due to the speciﬁc nature of promptsof these essay sets. Set 1 and Set 2 are persuasiveessay prompts. Hence, the responses may containcomplex relations between words which might notbe captured by a WordNet graph which modelsvery few speciﬁc kinds of relations. The com-paratively high correlation of entropy componentshows that it is an essential part of our coherencescore formulation. We ran our pipeline for Example 1 given in Ta-ble 1 (Coherent) and the example shown in Fig-ure 3 (Incoherent). We got coherence score val-ues of . and . respectively, for the twocases. In each case, out of all the topics obtainedfor the paragraph, we obtained the topic member-ship of each sentence. Interestingly for the sec-ond case, as shown in Figure 3, out of the twotopics obtained for the paragraph, the ﬁrst, thirdand fourth sentences belong to one topic (colouredblue), and the second sentence belongs to the sec-ond topic (coloured red), indicating a signiﬁcantdrift in topic and hence incoherence. We conducted human evaluation on a randomgroup of 27 essays (9 sets) from our synthetic

Figure 3: Colour coded representation of topic mem-bership for the incoherent paragraph. dataset. The task was to rank each paragraphwithin a set (1 original and 3 perturbed variants)based on the coherence. We compared these rank-ings done by human subjects with the ranking pro-duced by our system based on the values of co-herence scores. We found an agreement of 79.3%between the two rankings, which suggests that therankings produced by our system are acceptable.

Intrigued to see how our system performs as com-pared to supervised techniques for measuring lo-cal coherence, we tested our system on the

Earth-quakes and

Accidents datasets released by (Barzi-lay and Lapata, 2008). We obtained coherencescores for the essays given in the test set of thetwo datsets. Then the accuracy was measuredby considering the fraction of test pairs rankedcorrectly based on the values of our coherencescores. If the relative difference between the co-herence scores was less than a ﬁxed tolerance, wemarked them as positive. We got accuarcies of and on Earthquakes and Accidentsdatasets, respectively. These values are competi-tive to the reported accuracy ﬁgures of 81.4% and86.0% (row Coreference-Syntax-Salience- in Ta-ble 5 of (Barzilay and Lapata, 2008)) on the twodatasets. Considering the fact that our system isunsupervised and does not need additional com-plex meta-information like dependency parses andcoreferences, syntax and saliency information etc.,which the best supervised systems use today; oursystem offers signiﬁcant advantages and is moregeneralizable compared to the popular supervisedtechniques.

The importance of discourse coherence analy-sis and measurement was identiﬁed long backby classical and computational linguists. Ear-lier works by Bamberg (1983), Ryan (1984), Mc-Culley (1985) formalize coherence and proper- ataset E ( T ) NodeSim ( T ) ED ( T ) T C ( T ) ND ( T ) Synthetic 0.255 0.107 -0.084 -0.015 0.042Set 1 0.207 -0.015 -0.494 -0.497 0.516Set 2 0.205 -0.115 -0.384 -0.438 0.496Set 7 0.293 0.157 -0.427 -0.351 0.401Set 8 0.08 0.142 -0.385 -0.28 0.356

Table 4: Spearman Rank Correlation between different components of the coherence metric and gold labels ties of coherent discourses. There have been sev-eral works on automated essay grading (Attaliand Burstein, 2004), modeling paragraph organi-zations (Persing et al., 2010), automated text scor-ing (Alikaniotis et al., 2016), and measuring co-herence quality (Somasundaran et al., 2014). Ourmetric can certainly be used in these scenarios.In domains such as education, e-commerce,judicial and compliance many automatic scor-ers have been proposed over the last couple ofdecades. (Higgins et al., 2004) and (Miltsakakiand Kukich, 2004) propose frameworks for mea-suring text coherence for essays collected by ETS.Since their data is not available publicly, a com-parative study could not be carried out. Foltzet al. (1998) propose a coherence model using la-tent semantic analysis. Using various textual andgrammatical properties of the text, Crossley andMcNamara (2011) implemented a statistical re-gression based system for essay scoring and rank-ing. Recent works on evaluating the holistic scoresof essays rely on deep learning based techniques(Alikaniotis et al., 2016; Taghipour and Ng, 2016).However, relatively very little work has been donefor individual aspects of the essay, such as organi-zation (Persing et al., 2010), coherence and cohe-sion (Somasundaran et al., 2014).A signiﬁcant amount of research has been car-ried out on modeling sentence ordering and lo-cal coherence in paragraphs. Barzilay and La-pata (2008), in a pioneering work, modeled lo-cal coherence in paragraphs (a comparison withthem is provided in Section 4.4 above) The rank-labels are predicted in a supervised setting withfeatures extracted from the paragraphs based on anentity-grid model. Several works that addressedthe problem of local coherence using the same(or similar) datasets are: (a) HMM based ap-proach considering syntactic patterns by Louis andNenkova (2012), (b) Window Based Approach byLi and Hovy (2014), (c) Sequence-to-sequencebased approach by Li and Jurafsky (2017), andrecurrent neural network based approach by (Lo- geswaran et al., 2016). These approaches, unlikeours, are supervised, and some of them requirecomplicated meta-linguistic information to be ex-tracted through Role labeling, Coreference resolu-tion, dependency parses etc. thus requiring expen-sive additional resources.

In this paper, we presented a metric for scoringparagraph topical coherence of natural languagetext on real valued scale. To measure topical con-gruency, our system ﬁrst extracts a set of possibletopics that emanate, as sentences in the paragraphunfold. Paragraph coherence is then measuredas a product of two components capturing (a)paragraph-topic association, and (b) topic related-ness. Experiments on two datasets of human gen-erated and automatically synthesized paragraphsreveal that the coherence scores produced by oursystem are positively correlated with the ground-truth. An additional human evaluation on a subsetof synthetic dataset also proves the goodness ofour measure, showing a strong agreement betweencoherence based ranking of paragraphs done byhumans and our system. Our framework is quitesimple, unsupervised and highly modular , mak-ing it possible to interpret, evaluate as well asplug-and-play individual components. Moreover,our framework offers the ﬂexibility to trivially ex-tend it to other languages with a Wordnet. Ourfuture agenda includes introducing additional rel-evant components of coherence measurement intoour formulation. We would also like to apply ourmetric to optimize NLG systems for abstractivesummarization and descriptive QA.

References

Dimitrios Alikaniotis, Helen Yannakoudakis, andMarek Rei. 2016. Automatic text scoring usingneural networks. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 715–725, Berlin, Germany. Association for Computa-tional Linguistics.igal Attali and Jill Burstein. 2004. Automated essayscoring with e-rater R (cid:13) v. 2.0. ETS Research ReportSeries , 2004(2).Betty Bamberg. 1983. What makes a text coher-ent?

College Composition and Communication ,34(4):417–429.Regina Barzilay and Mirella Lapata. 2008. Modelinglocal coherence: An entity-based approach.

Compu-tational Linguistics , 34(1):1–34.Anja Belz and Adam Kilgarriff. 2006. Shared-taskevaluations in hlt: Lessons for nlg. In

Proceedingsof the Fourth International Natural Language Gen-eration Conference , pages 133–135. Association forComputational Linguistics.David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation.

Journal of ma-chine Learning research , 3(Jan):993–1022.Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In

Advances in neural informationprocessing systems , pages 2787–2795.Jonathan Chang, Sean Gerrish, Chong Wang, Jordan LBoyd-Graber, and David M Blei. 2009. Readingtea leaves: How humans interpret topic models. In

Advances in neural information processing systems ,pages 288–296.Scott Crossley and Danielle McNamara. 2011. Text co-herence and judgments of essay quality: Models ofquality and coherence. In

Proceedings of the Cogni-tive Science Society , volume 33.Christiane Fellbaum. 1998.

WordNet . Wiley OnlineLibrary.Peter W Foltz, Walter Kintsch, and Thomas K Lan-dauer. 1998. The measurement of textual coherencewith latent semantic analysis.

Discourse processes ,25(2-3):285–307.Hewlett Foundation. 2012. The hewlett foundation es-say grading data.Derrick Higgins, Jill Burstein, Daniel Marcu, and Clau-dia Gentile. 2004. Evaluating multiple aspects ofcoherence in student essays. In

Proceedings ofthe Human Language Technology Conference ofthe North American Chapter of the Association forComputational Linguistics: HLT-NAACL 2004 .Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In

Proceed-ings of the 31st International Conference on Ma-chine Learning (ICML-14) , pages 1188–1196.Jiwei Li and Eduard Hovy. 2014. A model of co-herence based on distributed sentence representa-tion. In

Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP) , pages 2039–2048. Jiwei Li and Dan Jurafsky. 2017. Neural net modelsof open-domain discourse coherence. In

Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing , pages 198–209.Lajanugen Logeswaran, Honglak Lee, and DragomirRadev. 2016. Sentence ordering using recurrentneural networks. arXiv preprint arXiv:1611.02654 .Annie Louis and Ani Nenkova. 2012. A coherencemodel based on syntactic patterns. In

Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning , pages 1157–1168. As-sociation for Computational Linguistics.Universit catholique de Louvain. 2017. Locness.George A McCulley. 1985. Writing quality, coherence,and cohesion.

Research in the Teaching of English ,pages 269–282.Eleni Miltsakaki and Karen Kukich. 2004. Evaluationof text coherence for electronic essay scoring sys-tems.

Natural Language Engineering , 10(1):25–55.Ani Nenkova, Rebecca Passonneau, and KathleenMcKeown. 2007. The pyramid method: Incorpo-rating human content selection variation in summa-rization evaluation.

ACM Transactions on Speechand Language Processing (TSLP) , 4(2):4.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In

Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP) , pages 1532–1543.Isaac Persing, Alan Davis, and Vincent Ng. 2010.Modeling organization in student essays. In

Pro-ceedings of the 2010 Conference on Empirical Meth-ods in Natural Language Processing , pages 229–239. Association for Computational Linguistics.Michael P Ryan. 1984. Conceptions of prose co-herence: Individual differences in epistemologi-cal standards.

Journal of Educational psychology ,76(6):1226.Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1073–1083, Vancouver, Canada. Association for Compu-tational Linguistics.Swapna Somasundaran, Jill Burstein, and MartinChodorow. 2014. Lexical chaining for measuringdiscourse coherence quality in test-taker essays. In

COLING , pages 950–961.Kaveh Taghipour and Hwee Tou Ng. 2016. A neuralapproach to automated essay scoring. In