[PDF] Extending Text Informativeness Measures to Passage Interestingness Evaluation (Language Model vs. Word Embedding)

Abstract

Standard informativeness measures used to evaluate Automatic Text Summarization mostly rely on n-gram overlapping between the automatic summary and the reference summaries. These measures differ from the metric they use (cosine, ROUGE, Kullback-Leibler, Logarithm Similarity, etc.) and the bag of terms they consider (single words, word n-grams, entities, nuggets, etc.). Recent word embedding approaches offer a continuous alternative to discrete approaches based on the presence/absence of a text unit. Informativeness measures have been extended to Focus Information Retrieval evaluation involving a user's information need represented by short queries. In particular for the task of CLEF-INEX Tweet Contextualization, tweet contents have been considered as queries. In this paper we define the concept of Interestingness as a generalization of Informativeness, whereby the information need is diverse and formalized as an unknown set of implicit queries. We then study the ability of state of the art Informativeness measures to cope with this generalization. Lately we show that with this new framework, standard word embeddings outperforms discrete measures only on uni-grams, however bi-grams seems to be a key point of interestingness evaluation. Lastly we prove that the CLEF-INEX Tweet Contextualization 2012 Logarithm Similarity measure provides best results.

Full PDF

EExtending Text Informativeness Measures toPassage Interestingness Evaluation

Language Model vs. Word Embedding

Carlos-Emiliano Gonz´alez-Gallardo , Eric SanJuan , andJuan-Manuel Torres-Moreno , LIA, Universit´e d’Avignon et des Pays de Vaucluse,74 Rue Louis Pasteur, 84029 Avignon, France ´Ecole Polytechnique de Montr´eal, 2900 Edouard Montpetit Blvd,Montreal, QC H3T 1J4, Canada {carlos-emiliano.gonzalez-gallardo,eric.sanjuan,juan-manuel.torres}@univ-avignon.fr Abstract.

Standard informativeness measures used to evaluateAutomatic Text Summarization mostly rely on n-gram overlap-ping between the automatic summary and the reference summaries.These measures differ from the metric they use (cosine, ROUGE,Kullback-Leibler, Logarithm Similarity, etc.) and the bag of termsthey consider (single words, word n-grams, entities, nuggets, etc.).Recent word embedding approaches offer a continuous alterna-tive to discrete approaches based on the presence/absence of atext unit. Informativeness measures have been extended to FocusInformation Retrieval evaluation involving a user’s informationneed represented by short queries. In particular for the task ofCLEF-INEX Tweet Contextualization, tweet contents have beenconsidered as queries. In this paper we deﬁne the concept of In-terestingness as a generalization of Informativeness, whereby theinformation need is diverse and formalized as an unknown set ofimplicit queries. We then study the ability of state of the art In-formativeness measures to cope with this generalization. Latelywe show that with this new framework, standard word embed-dings outperforms discrete measures only on uni-grams, howeverbi-grams seems to be a key point of interestingness evaluation.Lastly we prove that the CLEF-INEX Tweet Contextualization2012 Logarithm Similarity measure provides best results.

Keywords:

Information Retrieval, Automatic Text Summariza-tion, Evaluation, Informativeness, Interestingness a r X i v : . [ c s . I R ] A p r Introduction

Following Bellot et al. in [2], we consider Informativeness in thecontext of Focused Retrieval (FR) [9]. Given a query represent-ing a user information need, a FR system returns a ranked list ofshort passages extracted from a document collection. Then, a userreads a passage top to bottom and tag it as: 1) informative, 2) par-tially informative or 3) uninformative depending if all, only partsor no part of it contains useful information relevant to the query.This information can be factual and explicitly linked to the queryas in Question Answering (QA) or more abstract and provide somegeneral background about the query. FR systems are evaluated ac-cording to the cumulative length of extracted informative passages(Precision) and their diversity (Recall).Interestingness, by contrast, is a much broader concept used inData Mining as it is deﬁned as “the power of attracting or hold-ing attention” and relates the ideas of lift and information gainused to mine very large sets of association rules between numer-ous features. Mining interesting associations between features is acomplex interactive process. Unlike Information Retrieval (IR), inInterestingness there is no precise query to initiate the search.As expected, it has been shown that none of the numerousInterestingness measures proposed in Business Intelligence soft-wares allow to catch all associations that experts will consider asuseful [10]. Each measure allows to grasp only a facet on interest-ingness. However, Hilderman et. al deﬁned in [8] ﬁve principlesto be satisﬁed by these type of measures: minimal and maximalvalues, skewness, permutation invariance and transfer.By reusing this concept of Interestingness, in the context ofFR we deﬁne it as: a text passage that is clearly informative forsome implicit user’s information need.

More precisely; given a setof users and a set of passages that were considered interesting by atleast one of these users, the task consists in ﬁnding new interestingpassages not related to previous topics and implicit queries. We testthe ability of state of the art FR Informativeness metrics to dealwith this speciﬁc new task. Therefore, in this paper we considerhort factual passages, most of them are single sentences insteadof complete summaries. From a formal point of view we considertext units like words, word n-grams, terms or entities as attributesin [8]. Passages S to be ranked by decreasing interestingness arethen represented as sets of unique tuples ( ω, ω ( S )) where ω is anattribute and ω ( S ) its frequency.The data we used consists of a large pool of , passagesextracted from the Wikipedia by state of the art FR systems [2]over topics from Twitter. Each passage real informativeness hasbeen individually assessed by CLEF-INEX Tweet Contextualiza-tion (TC@INEX) task organizers. These assessments resulted ina reference score scaled between 0 and 2 depending on the relativelength of the informative part and assessor’s inter-agreement.Based on this extended dataset we analyze correlations be-tween the reference Informativeness scores and a set of state of theart Informativeness automatic measures including F-score, Kull-back-Leibler (KL) divergence, ROUGE, Logarithm Similarity (Log-Sim) and cosine measures. Because passages are extracted from theWikipedia, we also consider the restriction of these measures toanchor texts inside passages. Wikipedia anchors are references torelated entities; anchors inside informative passages can be consid-ered as informative nuggets annotated by Wikipedia contributors.The rest of the paper is organized as follows. In § § § § Informativeness measures have been introduced in various con-texts. Depending of the type of information unit that is used and the http://inex.mmci.uni-saarland.de/tracks/qa/ ay this unit is represented, it is possible to separate them in lan-guage model approaches and continuous space approaches (wordembeddings). Language model (LM) approaches of Informativenesses are basedon smoothed probabilities over all text units. The most popularis ROUGE, which compares an automatically generated summarywith a (human-produced) reference summary. ROUGE-N is de-ﬁned as the average number of common term n-grams in the sum-mary and a set of k ≥ reference summaries [11]. In particular,ROUGE-1 computes the distribution of uni-grams; it counts thenumber of uni-grams (words or stems) that occur both in the refer-ence summary and in the produced summary [27].Apart from ROUGE-1, other variants of ROUGE are also avail-able. While ROUGE-1 (uni-grams) considers each term indepen-dently, ROUGE-2 (bi-grams) considers sequences of two terms.Skip-grams, used in ROUGE-S, are pairs of consecutive terms withpossible insertions between them; the number of skipped terms isa parameter of ROUGE-SU.ROUGE-2 and ROUGE-SU4 were used to evaluate the gener-ated summaries in the Document Understanding Conference (DUC)in 2005 [3]. ROUGE proved itself better correlating human judg-ments under readability assumption than classical cosine measures[22]. Indeed, the various ROUGE variants were evaluated on thethree years of DUC data in [13], showing that some ROUGE ver-sions are more appropriate for speciﬁc contexts.ROUGE implies that reference summaries have been built byhumans. In the case of a very large number of documents, it is eas-ier to apply a measure that can be used automatically to comparethe content of the produced summary with the one of the full setof documents. In this framework, measures such as KL divergenceand Jensen-Shannon (JS) probability distribution divergence usedin the Text Analysis Conference (TAC) [4] compare the distribu-tions of words or word sequences between the produced summarynd the original documents. In this approach, Informativeness re-lies on complex hided word distributions and not on speciﬁc units.In addition, it is shown that JS probability distribution diver-gence is correlated in the same way than ROUGE when used toevaluate summaries built by passage extraction from the documentsto be summarized [15,24]. JS is the symmetric variant of the KL di-vergence used in the LM for IR and Latent Dirichlet Allocation toreveal latent concepts. Both measures aim at calculating the simi-larity between two smoothed conditional probabilistic distributionsof terms P ( w | R ) and P ( w | S ) where R is a textual reference and S a short summary.We can also cite the LogSim measure used in TC@INEX 2012task [26]. This measure was introduced because 1) there were noreference summaries produced by humans, 2) the automaticallyproduced summaries were of various sizes and 3) some automati-cally produced summaries were too short. Although these variousmeasures have been introduced to evaluate similar tasks as Au-tomatic Text Summarization (ATS), they are applied in differentcontexts. Moreover, all of them can be sensitive to the text unit thatis being used. The inﬂuence of the type of text unit (uni-gram, bi-gram, skip-gram, etc.) has been evaluated for ROUGE in [13] asthey correspond to the ROUGE variants.Informativeness measures can be classiﬁed into three main fam-ilies: The most common approach of Informativeness likelihood inIR relies on probabilistic KL divergence. In automatic summariza-tion evaluation the most common approaches are F β and ROUGEsimilarity scores based on n-grams. This two families rely on ex-act term overlapping. A third family created by word embeddingrepresentations has recently introduced a vectorial approach lessdependent on explicit term overlapping.More formally, let Ω be a type of text unit and S a sentencewhich Informativeness is to be evaluated against a textual reference R , we use the following three discrete metrics and one continuousapproach. ullback-Leibler (KL) Divergence We implement it as the expectation based on the reference R of the log difference between normalized frequencies in R and smoothedprobabilities over S . We ﬁx the smoothing parameter µ at its min-imal value ( µ = 1 ). This divergence is not normalized. D KL ( R || S ) = (cid:88) ω ∈ Ω ( R ) ln (cid:18) P ( ω | R ) . ( | S | + 1) P ( ω | S ) . | S | + P ( ω | Ω ) (cid:19) .P ( ω | R ) (1) Logarithm Similarity (LogSim)

Like the KL divergence, this is also an expectation based on thereference R but of a normalized similarity that is only deﬁned over Ω ( S ) ∩ Ω ( R ) [26]. LS ( S || R ) = (cid:88) ω ∈ Ω ( S ) ∩ Ω ( R ) e − (cid:12)(cid:12)(cid:12) ln (cid:16) LR ( ω,S ) LR ( ω,R ) (cid:17)(cid:12)(cid:12)(cid:12) .P ( ω | R ) (2)where, L R ( ω, X ) = ln(1 + P ( ω | X ) . | R | ) (3) F β and ROUGE Scores F β measures the harmonic mean between Precision ( p ) and Recall( r ). It is normally expressed as F β = ( β + 1) × p × rβ × p + r (4)where β is the factor that controls the relative emphasis between p and r . If β = 1 , then it is possible to rewrite Equation 4 as: F = 2 × p × rp + r (5)1-score ( F ) is the most common normalized set theoreticsimilarity giving equal emphasis to p and r . To represent F interms of Ω , R and S , lets ﬁrst rewrite p and r as: p = | Ω ( S ) ∩ Ω ( R ) || Ω ( S ) | (6) r = | Ω ( S ) ∩ Ω ( R ) || Ω ( R ) | (7)ﬁnally Equation 5 is redeﬁned as: F ( S | R ) = 2 × | Ω ( S ) ∩ Ω ( R ) || Ω ( S ) | + | Ω ( R ) | (8)As explained in [11], the idea behind all ROUGE metrics is toautomatically determine the quality of a candidate summary com-paring it with reference summaries written by humans. The qualityis obtained by comparing the number of overlapping n-grams, suchas word sequences or word pairs, between the candidate and a setof reference summaries.ROUGE-N is deﬁned as the average number of common n-grams between the candidate summary and a set of reference sum-maries: ROUGE-N = (cid:80) ω ∈ Ω ( R ) Count match ( ω ) (cid:80) ω ∈ Ω ( R ) Count ( ω ) (9)where Count ( ω ) is the frequency of the ω (cid:48) s n-gram and Count match is the co-occurring frequency of the ω (cid:48) s n-gram in R and in thecandidate summary S . All ROUGE variants base their functional-ity in the lexical similarities between the candidate and the refer-ence summaries, this is a problem when the candidate summarydo not share the same words of the references; that is the case ofabstractive summaries and summaries with a signiﬁcant amount ofparaphrasing. ontinuous Space Approach Classic approaches of Natural Language Processing transform thewords of a dataset into a bag of words representation that leads tosparse vectors and semantic information loose. In a dataset contain-ing n different words, each word will be represented in a one-hotvector that is absolutely independent of the rest of the words in thedataset.Table 1 shows the one-hot vectors of a ﬁctitious dataset with10,000 different words. It is true that some kind of relation existsbetween tiger ↔ cat and wolf ↔ dog but with the one-hot vec-tors this relationship is impossible to maintain. The restriction canbe overcome with word embeddings. Table 1.

One-hot encoding vectors exampleID word One-hot1 tiger [1 0 0 0 0 ... [0 1 0 0 0 ... ... ... ...9999 wolf [0 0 0 0 0 ... [0 0 0 0 0 ...

Word embeddings are another way to represent words withina dataset. In this representation, vectors are capable to maintainthe relationship between words in the dataset following the distri-butional semantics hypothesis [7]: words that appear in the samecontexts share similar meanings .Word2vec is a popular Neural Network embedding model de-scribed in [18]. It aims to map the vocabulary of a dataset into amultidimensional vector space in which the distance between theprojections corresponds to the semantic similarity between them[20].Two different model approaches are proposed in word2vec:Continuous Bag-of-Words (CBOW) and Continuous Skip-gram.The ﬁrst one predicts a target word based on the context, whilehe second one predicts a context given a target word [16]. In bothcases the size of the embeddings is deﬁned by the size of the pro-jection layer.Learning the output vectors of the neural network is a very ex-pensive task; so in order to increase the computation efﬁciency ofthe training process, it is necessary to limit the number of outputvectors that are updated for each training instance. In [23] two dif-ferent methods are proposed: Hierarchical Softmax and NegativeSampling. – Hierarchical Softmax [19]:

It represents the output layer ofthe neural network in a binary tree shape where each leaf ofthe tree corresponds to each one of the V words of the vocab-ulary within the dataset and each node represents the relativeprobabilities of its child nodes. Given the nature of the binarytree shape, there exists just one path from the root of the tree toeach one of the leafs; making it possible to use this path to es-timate the probability of each one of the words (leafs) [17,23]. – Negative Sampling [17]: The goal is to reduce the amount ofoutput vectors that have to be updated during each iteration ofthe training process. Instead of using all the samples during theloss function evaluation, just a small sample of them is takeninto account.As proposed in [17], it is possible to combine words by an element-wise addition of their embeddings. Given a document D of length n represented by the set of word embeddings D = { v , v , ..., v n } where | v i | = m ; a simple way to represent D with a unique embed-ding in terms of D is to add each component j of each embedding v i in D to obtain a unique vector d of length m .To measure the similarity between the vector of a referencedocument ( d R ) and the one of a proposed sentence ( d S ) we calcu-late the cosine similarity between the two vectors as: cos RS ( θ ) = d R · d S || d R || · || d S || (10) .2 Nugget Based Evaluation LM approaches of Informativeness are based on smoothed proba-bilities over all text units, but it is possible to use only a subset ofthem (nuggets).In the context of QA, Dang et al. deﬁned a nugget as an infor-mative text unit (ITU) about the target that is interesting; atomicitybeing linked to the fact a binary decision can be made on the rele-vance of the nugget to answer a question [4]. This method makesit possible to consider documents that have not been evaluated tobe labeled as relevant or not relevant (simply because they containrelevant nuggets or not).It has been shown that real ITUs can be automatically extractedto convert textual references into a set of nuggets [5]. This simpli-ﬁes the complex problem of Informativeness evaluation providinga method to measure the proximity between two sets of ITUs. Forthat, standard Precision-Recall or Pyramid measures can be usedif nuggets are unambiguous entities [14] and more sophisticatednugget score metrics based on shingles if not [21].We refer the reader to these publications [4,5,14,21] for ad-vanced and non trivial technical details. Here we shall point outthat the original Pyramid method relies on human evaluation to:1. Identify in text summaries all short sequences of words that arerelevant to some query or question.2. Supervise the clustering of these units into coherent featuresthat allow to compute some informativeness score.The global idea is that Informativeness relies on the presence orabsence of some speciﬁc text units and can be then evaluated basedon their counting. Pavlu et al. share this same idea even thoughthey try to automatize the process of selecting and identifying theseITUs [21].

For the purpose of this work, we consider sentences as sets of items(nuggets, words or word n-grams). There is a wide range of Inter-stingness measures but they all combine the three following prop-erties [8]:1. Diversity that can rely on concentration (the sentence revealsan important information) or on dispersion (the sentence linksseveral different entities).2. Permutation invariance: the order of the items has no impacton the overall score.3. Transfer: a sentence with few highly informative items shouldbe considered more interesting than a long sentence with lessinformative items.Given a text reference, only the LogSim measure covers allthree properties. KL measure, due to smoothing, does not fulﬁll(1); while F β measure does not fulﬁll (3) since it favors long sen-tences with large overlaps. We formally deﬁne short passage Informativeness evaluation as aternary relation between a set of topics T , a subset of short textextracts P from a large document resource and a set S of gradedscores such that top ranked passages contain certainly relevant in-formation about the related topic or its background.By contrast, we deﬁne short passage Interestingness evaluationas a projection of interestingness over content and scores. There-fore a binary relation exist between a set of short text passages anda graded score such that top ranked passages are informative forsome unknown topic. We consider the data collection from the TC@INEX 2012 task [26]and state of the art measures to automatically evaluate summarynformativeness and Interestingness. The participants to this taskhad to provide a summary composed of extracted passages fromWikipedia that should contextualize a tweet by revealing its im-plicit background and providing some factual explanations aboutrelated concepts. During 2012 and 2013, tweets were collectedfrom non-personal Twitter accounts. In 2012 a total of 33 validruns were submitted by 13 teams from 10 countries; while in 2013,13 teams from 9 countries submitted 24 runs to the task [1].

Textual Relevance Judgments

Reusable assessments to evaluate text informativeness have beenone of the main goals of TC@INEX tracks; all the collections thathave been built are indeed freely available . This has been per-formed by organizers on a pool of 63 topics (tweets) in 2012, and70 topics in 2013.Assessments consist in textual relevance judgments (t-rels) tobe used for content comparison with automatically built summaries.Since summaries returned by participant systems are composed ofextracted passages from Wikipedia, the reference is built on a poolof these passages.An important fact in the pre-processing of this pool is that pas-sages starting and ending by the same 25 characters have beenconsidered as duplicated; therefore passages in the reference areunique, but short sub-passages could appear twice in longer ones.Moreover, since for each topic all passages from all participantshave been merged and displayed to the assessor in alphabeticalorder, each passage informativeness has been evaluated indepen-dently from others, even in the same summary. This results in areference highly redundant at the level of noun phrases and conse-quently all types of ITUs that can be extracted.The soundness of the pooling procedure was veriﬁed during theTC@INEX 2011 campaign over a corpus extracted from the NewYork Times [6,25]. Indeed, topics in 2011 were only tweets com- http://tc.talne.eu ng from the New York Times, and a full article was associated toeach tweet. To check that the resulting pool of relevant answers wassound, a second automatic evaluation for informativeness of sum-maries had then been carried out, taking as the reference the textualcontent of the article. None of the participants had reported havingused this information available on the New York Times website.Both rankings, one based on submitted run pooling and the sec-ond one based on New York Times articles, appeared to be highlycorrelated. Pearsons product-moment correlation = 88% and p -value < − .The TC@INEX 2012 task collection provides: – A set of different topics: A topic is a short sentence (a tweet)that is used by participants to build a query-driven summary.Each summary should be 500 words-long or less and be sup-posed to be built by sentence extraction from Wikipedia; – A set of runs: a run consists of several summaries, one pertopic which has to be built by one system from a participant; – For each summary produced by a participant, the passages thathave been marked as informative (and thus the ones that areconsidered as non informative) by human assessors. We havea set of , assessed passages, either informative or non-informative.From this test collection we extracted the set of passages fromparticipants’ automatic summaries that human evaluators have mar-ked as informative regarding a topic. Each topic was evaluatedby two people including the one that chose the original tweet asthe topic. Therefore each passage obtained a graded score among [0 , ∪ { } as the relative total length of the passage that has beenhighlighted as informative by at least one evaluator. In the case thatthe two evaluators agreed that the whole passage was informative,an score of was assigned.Another resource was built using this data. From each passagewe extracted text units. We considered in ﬁrst place stems that werein turn used to build uni-grams, bi-grams and skip-grams; secondly,ikipedia entities in anchor texts. Stems were simply obtained us-ing Porter’s stemmer after a process of stop words removal; theycorrespond to uni-grams. Bi-grams were composed of two adja-cent text units after stop words removal.If all passages that have been selected as informative are con-sidered and separated by topic, it is possible to build a textual ref-erence for each topic to apply state of the art Informativeness met-rics. By contrast, by merging all the references we obtain a textualreference to evaluate Interestingness.We also focus on other type of token characteristics. We hy-pothesize that the measures also can be sensitive to term distri-bution. Indeed, it is likely that automatic summaries that containhighly frequent terms are less interesting for a user than a sum-mary that contains less frequent and thus probably more informa-tive terms. In the same way, it is likely that summaries that con-tain name entities are more informative than summaries that do notcontain any. These phenomenons have not been studied in the liter-ature justifying the purpose of this paper. While in previous studiestokens are words or stems, in this paper we also study other typesof tokens such as DBpedia entities.To perform experiments with word embeddings we used fourdifferent word2vec models. All of them with an embedding size of300 dimensions and a negative sampling of 15 units. Table 2 showsthe number of embeddings and the text unit used in each model. Table 2.

Word2vec modelsModel gram gram gram gram The google news gram model was the same used by [20] tocalculate ROUGE-WE. Both clef inex gram and clef inex gram mod-els where trained with the , passages of the TC@INEX 2012ask. Finally, the wiki gram model was trained with a . Gb parti-tion of the 2012’s Wikipedia. Rather than calculating the correlation between systems’ averagemeasure scores and their human assigned average coverage scoresas in [12] with a limited number of systems and abstracts, we con-sidered each passage individually for which we have a human as-sessment. We then calculated all scores for each metric.Following the same approach as in TREC 2015 Microblog Track [28], to evaluate one speciﬁc metric we ranked passages in de-creasing order following this metric. Considering different cut-offvalues, we computed the normalized Cumulative Gain (nCG) overtop ranked passages. This score is the sum of the graded humanjudgments top ranked passages obtained in the TC@INEX 2012divided by the maximum score that could have been expected atthis precise cut-off value.More precisely, given a set of topics T and a set of passages Ω for which there is a graded evaluation ref τ ( ω ) of their Informa-tiveness for at least one topic τ ∈ T , nCG k ( m ) is computed asfollows for any metric m and any cut-off value k : nCG k ( m ) = max { (cid:80) ω ∈ S m ( ω ) : S ⊂ Ω, | S | ≤ k } max { (cid:80) ω ∈ S ref τ ( ω ) : S ⊂ Ω, | S | ≤ k, τ ∈ T } (11)where | S | is the cardinal of S .For values of k lower than the number of passages ω such that ref τ ( ω ) ≥ , nCG k ( m ) shows the measure Precision. On thecontrary, for cut-off values higher than the number of passages ω such that ref τ ( ω ) > , nCG k ( m ) indicates the maximum Recallthat can be expected using this measure.We considered and evaluated the following set M of 13 textualoverlap measures: https://github.com/lintool/twitter-tools/wiki/TREC-2015-Track-Guidelines F1-score among uni-grams

F1 2:

F1-score among bi-grams

F1 sk:

F1-score among skip-grams with a gap of one word

KL 1:

KL divergence among uni-grams

KL 2:

KL divergence among bi-grams

KL sk:

KL divergence among skip-grams with a gap of one word

LS 1:

LogSim score among uni-grams

LS 2:

LogSim score among bi-grams

LS sk:

LogSim score among skip-grams with a gap of one word w2v g:

Word2vec cosine similarity over the Google News model(google news gram ) w2v c: Word2vec cosine similarity over the set of all passages in Ω (clef inex gram ) w2v c bi: Word2vec cosine similarity the same set Ω but consid-ering word bi-grams instead of single words (clef inex gram ) w2v wp bi: Word2vec cosine similarity over the Wikipedia ver-sion used as copora in TC@INEX 2012 considering word bi-grams (wiki gram )The ﬁrst nine measures are discrete measures that we also ap-plied to sibling operators over Wikipedia anchor texts. We considerthe anchors associated with Wikipedia entities as potential nuggetsas deﬁned in Pyramid evaluations [3].To evaluate Informativeness, each measure was applied to es-timate the overlap between the reference informative passages ofthe topic and the passage itself. Statistical signiﬁcation was testedover topics. Meanwhile, to evaluate Interestingness each measureis applied to estimate the overlap between the set of all informativepassage, disregarding its speciﬁc topic and the passage itself. Sta-tistical signiﬁcation is tested based on a 12-fold split of the corpus. We used nCG over TC@INEX 2012 data to test the ability ofstate of the art Informativeness measures to evaluate ATS systems.he goal is to distinguish informative from non-informative shorttext passages rather than entire abstracts. This is done by rank-ing the Informativeness score of all passages manually assessedin TC@INEX 2012 for any topic and sorting them in decreasingorder. Then, the ability of a speciﬁc measure to ﬁnd the k mostinformative passages is evaluated.For each measure µ ∈ M and passage ω ∈ Ω associated witha topic τ ∈ T , we compute µ ( ϕ τ , ω ) to estimate the semanticoverlap between ω with the textual reference ϕ τ deﬁned as: ϕ τ = (cid:91) { ω ∈ Ω : ref τ ( ω ) > } (12) Discrete vs. Continuous Metrics

Figure 1 presents the results of the F1-scores over uni-grams, bi-grams and skip-grams and compares them against the word2vecuni-gram models. It shows that word2vec approaches are less efﬁ-cient to evaluate short passage informativeness even over standarduni-grams. Meanwhile, F1-scores over bi-grams and skip-gramsreach similar performance.The ﬁgure also shows that for low cut-off values, Precision islower that % and that bi-grams do not improve the measure per-formance. For higher cut-off values it shows that maximal Recallis only reached by the F1-score over bi-grams and skip-grams.A slightly improvement can be seen with the local word2vecmodel (clef inex gram ) over the model trained with the GoogleNews dataset (google news gram ). This behavior can be explainedby the fact that the amount of unknown words in the clef inex gram model is smaller that the one in the google news gram model. De-spite that this improvement in the cumulative score is very small, itshows that is better to create the word embeddings with a smallerbut more specialized dataset. Best Discrete informativeness Metric

In Figure 2 it can be seen that all KL metrics show similar per-formance for medium cut-off values. Among all metrics over uni- (cid:1)(cid:0) (cid:2)(cid:0) (cid:3)(cid:0) (cid:4)(cid:0) (cid:5)(cid:0) (cid:6)(cid:5) (cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:0)(cid:2)(cid:0)(cid:3)(cid:0)(cid:4)(cid:0)(cid:5)(cid:0)(cid:7)(cid:0)(cid:8)(cid:0)(cid:9)(cid:0)(cid:6)(cid:0)(cid:1)(cid:0)(cid:0) (cid:10) (cid:10) (cid:11) (cid:12) (cid:13) (cid:12) (cid:14) (cid:15) (cid:16) (cid:17) (cid:18) (cid:19) (cid:10) (cid:20)(cid:21) (cid:22) (cid:23) (cid:19) (cid:10)(cid:24)(cid:10) (cid:13) (cid:15) (cid:25) (cid:10) (cid:20)(cid:21) (cid:22) (cid:23) (cid:19) (cid:10)(cid:10)(cid:10)(cid:10) (cid:26) (cid:27) (cid:28) (cid:11)(cid:12)(cid:16)(cid:29)(cid:22)(cid:30)(cid:30)(cid:10)(cid:18)(cid:15)(cid:14)(cid:12)(cid:19)(cid:24)(cid:1)(cid:0)(cid:31)(cid:0)(cid:0)(cid:0)(cid:10)(cid:10)(cid:10)(cid:26)(cid:27)(cid:28) (cid:10) (cid:2)(cid:18)!"(cid:10) (cid:2)(cid:18)!(cid:21)(cid:10)

Fig. 1.

Informativeness nCG scores about F1-scores vs. word2vec approaches grams, F1-score achieves the better Precision and Recall over allcut-off values, obtaining a pick in the recall of almost % inhigh cut-off values. F1-scores over bi-grams and skip-grams out-performs LogSim over the same units, however this improvementis not signiﬁcant. Again only bi-gram based measures reach maxi-mal recall for high cut-off values. Discrete Metrics Restricted to Entities

As shown in Figure 3, it appears that for low cut-off values, restrict-ing references and passages to DBpedia entities provides equiva-lent scores to complete passages for F1-scores. By contrast, forhigh cut-off values and maximal recall there is a clear gap betweenrestricted entities and complete passages.An interesting remark is the behavior of the LogSim measureswith DBpedia entities. For low cut-off values all three LogSim (cid:1)(cid:0) (cid:2)(cid:0) (cid:3)(cid:0) (cid:4)(cid:0) (cid:5)(cid:0) (cid:6)(cid:5) (cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:0)(cid:2)(cid:0)(cid:3)(cid:0)(cid:4)(cid:0)(cid:5)(cid:0)(cid:7)(cid:0)(cid:8)(cid:0)(cid:9)(cid:0)(cid:6)(cid:0)(cid:1)(cid:0)(cid:0) (cid:10) (cid:10) (cid:10) (cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:16)(cid:10)(cid:17)(cid:18)(cid:19)(cid:12)(cid:20)(cid:21)(cid:1)(cid:0)(cid:22)(cid:0)(cid:0)(cid:0)(cid:10)(cid:10)(cid:10)(cid:23)(cid:24)(cid:25) (cid:10)(cid:26)(cid:1)(cid:27)(cid:1)(cid:10)(cid:26)(cid:1)(cid:27)(cid:2)(cid:10)(cid:26)(cid:1)(cid:27)(cid:28)(cid:29)(cid:10)(cid:30)(cid:31)(cid:27)(cid:1)(cid:10)(cid:30)(cid:31)(cid:27)(cid:2)(cid:10)(cid:30)(cid:31)(cid:27)(cid:28)(cid:29)(cid:10)(cid:31) (cid:27)(cid:1)(cid:10)(cid:31) (cid:27)(cid:2)(cid:10)(cid:31) (cid:27) (cid:30)

Fig. 2.

Informativeness nCG scores about F1-scores vs. KL and LogSim ap-proaches scores are extremely low and fail to highlight informative pas-sages; for high cut-off values, LogSim scores over bi-grams andskip-grams reach a recall near the .In regard to the F1-scores with uni-grams, a slightly outper-form with restricted entities over complete passages can be seenfor high cut-off values but this difference is not signiﬁcant.

We now used nCG over TC@INEX 2012 data to test the abilityof the same Informativeness measures in distinguishing interestingfrom non-interesting text passages without considering an explicittopic.For each measure µ ∈ M and passage ω ∈ Ω associated witha topic τ ∈ T , we compute µ ( ∆ τ , ω ) to estimate the semantic (cid:1)(cid:0) (cid:2)(cid:0) (cid:3)(cid:0) (cid:4)(cid:0) (cid:5)(cid:0) (cid:6)(cid:5) (cid:1)(cid:0)(cid:0)(cid:7)(cid:1)(cid:0)(cid:0)(cid:1)(cid:0)(cid:2)(cid:0)(cid:3)(cid:0)(cid:4)(cid:0)(cid:5)(cid:0)(cid:8)(cid:0)(cid:9)(cid:0)(cid:10)(cid:0)(cid:6)(cid:0)(cid:1)(cid:0)(cid:0) (cid:11) (cid:11) (cid:11)(cid:12)(cid:1)(cid:13)(cid:1)(cid:11)(cid:12)(cid:1)(cid:13)(cid:2)(cid:11)(cid:12)(cid:1)(cid:13)(cid:14)(cid:15)(cid:11)(cid:16)(cid:17)(cid:13)(cid:1)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:11)(cid:16)(cid:17)(cid:13)(cid:2)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:11)(cid:16)(cid:17)(cid:13)(cid:14)(cid:15)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:11)(cid:12)(cid:1)(cid:13)(cid:1)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:11)(cid:12)(cid:1)(cid:13)(cid:2)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:11)(cid:12)(cid:1)(cid:13)(cid:14)(cid:15)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26) (cid:27) (cid:28) (cid:29) (cid:28) (cid:30) (cid:25) (cid:31) (cid:24) (cid:22) (cid:11) (cid:14)! " (cid:22) (cid:11)$(cid:11) (cid:29) (cid:25) % (cid:11) (cid:14)! " (cid:22) (cid:11)(cid:11)(cid:11) (cid:18) & (cid:26) (cid:27)(cid:28)(cid:31)(cid:7)"’’(cid:11) (cid:25)(cid:30)(cid:28)(cid:22)$(cid:1)(cid:0)((cid:0)(cid:0)(cid:0)(cid:11)(cid:11)(cid:11)(cid:18)&(cid:26) Fig. 3.

Informativeness nCG scores about DBpedia entities vs. complete passages overlap between ω and the textual reference ∆ τ deﬁned as the con-catenation of passages that are informative for at least one differenttopic t (cid:48) (cid:54) = t : ∆ τ = (cid:91) { ω ∈ Ω : ( ∃ t (cid:48) ∈ T − { t } ) ref t (cid:48) ( ω ) > } (13)However in oder to avoid any overﬁtting effect, we’ve split thedataset of , passages ranked per topic into folds and re-stricted ∆ τ to passages in a different fold than the one including τ . Discrete vs. Continuous Metrics

Figure 4 presents the results of F1 scores and word2vec modelsover uni-grams and bi-grams. For small cut-off values, word2vecmodels and the F1-score over uni-grams show a similar low per-formance. The improvement of F1-scores over bi-grams and skip-rams against uni-grams for all cut-off values was striking for thistask. (cid:0) (cid:1)(cid:0) (cid:2)(cid:0) (cid:3)(cid:0) (cid:4)(cid:0) (cid:5)(cid:0) (cid:6)(cid:5) (cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:0)(cid:2)(cid:0)(cid:3)(cid:0)(cid:4)(cid:0)(cid:5)(cid:0)(cid:7)(cid:0)(cid:8)(cid:0)(cid:9)(cid:0)(cid:6)(cid:0)(cid:1)(cid:0)(cid:0) (cid:10) (cid:10) (cid:10)(cid:11)(cid:1)(cid:12)(cid:1)(cid:10)(cid:11)(cid:1)(cid:12)(cid:2)(cid:10)(cid:11)(cid:1)(cid:12)(cid:13)(cid:14)(cid:10)(cid:15)(cid:16)(cid:17)(cid:12)(cid:18)(cid:17)(cid:19)(cid:10)(cid:20)(cid:2)(cid:19)(cid:12)(cid:21)(cid:10)(cid:20)(cid:2)(cid:19)(cid:12)(cid:22)(cid:10)(cid:20)(cid:2)(cid:19)(cid:12)(cid:22)(cid:12)(cid:23)(cid:18)(cid:10)(cid:20)(cid:2)(cid:19)(cid:12)(cid:20)(cid:24)(cid:12)(cid:23)(cid:18) (cid:25) (cid:26) (cid:27) (cid:26) (cid:15) (cid:28) (cid:29) (cid:18) (cid:19) (cid:16) (cid:10) (cid:13)(cid:22) (cid:30) (cid:31) (cid:16) (cid:10) (cid:10) (cid:27) (cid:28) ! (cid:10) (cid:13)(cid:22) (cid:30) (cid:31) (cid:16) (cid:10)(cid:10)(cid:10) " $ (cid:25)(cid:26)(cid:29)%(cid:30)&&(cid:10)(cid:19)(cid:28)(cid:15)(cid:26)(cid:16) (cid:1)(cid:0)’(cid:0)(cid:0)(cid:0)(cid:10)(cid:10)(cid:10)" Fig. 4.

Interestingness nCG scores about F1-scores vs. word2vec approaches

F1-scores do not reach this time maximal recall for high cut-offvalues. However best compromise between recall and precision isreached for a cut-off value around , passages over , in-formative passages in the reference. Word2vec models trained overWikipedia using bi-grams signiﬁcantly improves precision oversmall cut-off values but then become less efﬁcient after , pas-sages because of the proportion of missing bi-grams in the model.Other word2vec models do not outperform the baseline taking in-verse passage length (len inv) as measure with the idea that shortpassages could be more speciﬁc. est Discrete Interestingness Metric As shown in Figure 5, among discrete measures there is anothercontrast with previous results over Informativeness. It appears thatLogSim measures over bi-grams signiﬁcantly outperform F1-scoresand reach total recall after , retrieved passages. Again, onlybi-gram and skip-gram based measures reach maximal recall forhigh cut-off values. LogSim bi-gram and skip-gram measures sig-niﬁcantly outperform all other measures for any cut-off value. Italso appears that KL metrics provide the best results over uni-grams among all metrics with the same units. (cid:0) (cid:1)(cid:0) (cid:2)(cid:0) (cid:3)(cid:0) (cid:4)(cid:0) (cid:5)(cid:0) (cid:6)(cid:5) (cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:0)(cid:2)(cid:0)(cid:3)(cid:0)(cid:4)(cid:0)(cid:5)(cid:0)(cid:7)(cid:0)(cid:8)(cid:0)(cid:9)(cid:0)(cid:6)(cid:0)(cid:1)(cid:0)(cid:0) (cid:10) (cid:10) (cid:11) (cid:12) (cid:13) (cid:12) (cid:14) (cid:15) (cid:16) (cid:17) (cid:18) (cid:19) (cid:10) (cid:20)(cid:21) (cid:22) (cid:23) (cid:19) (cid:10)(cid:24)(cid:10) (cid:13) (cid:15) (cid:25) (cid:10) (cid:20)(cid:21) (cid:22) (cid:23) (cid:19) (cid:10)(cid:10)(cid:10) (cid:26) (cid:27) (cid:28) (cid:11)(cid:12)(cid:16)(cid:29)(cid:22)(cid:30)(cid:30)(cid:10)(cid:18)(cid:15)(cid:14)(cid:12)(cid:19)(cid:24)(cid:1)(cid:0)(cid:31)(cid:0)(cid:0)(cid:0)(cid:10)(cid:10)(cid:10)(cid:26)(cid:27)(cid:28) (cid:10) (cid:1)!(cid:1)(cid:10) (cid:1)!(cid:2)(cid:10) (cid:1)!(cid:20)"(cid:10) Fig. 5.

Interestingness nCG scores about F1-scores vs. KL and LogSim ap-proaches iscrete Metrics Restricted to Entities

Finally, in Figure 6 we look at the impact of restricting referencesand passages to DBpedia entities in the case of interestingness.For low and medium cut-off values, F1-scores over bi-grams andskip-grams with complete passages show a better performance thanthe ones with DBpedia entities. This difference in performance isreduced for high cut-off values where the recall for bi-grams withDBpedia entities rises to a .From Figures 5 and 6 we see that LogSim measures over bi-grams and skip-grams with both complete and restricted passagesoutperform all the other measures. In general, LogSim scores withcomplete passages show a better performance than those with re-stricted entities. Surprisingly, it appears that LogSim measures overbi-grams remain almost stable and outperform F1-scores both overcomplete and restricted passages. (cid:0) (cid:1)(cid:0) (cid:2)(cid:0) (cid:3)(cid:0) (cid:4)(cid:0) (cid:5)(cid:0) (cid:6)(cid:5) (cid:1)(cid:0)(cid:0)(cid:0)(cid:1)(cid:0)(cid:2)(cid:0)(cid:3)(cid:0)(cid:4)(cid:0)(cid:5)(cid:0)(cid:7)(cid:0)(cid:8)(cid:0)(cid:9)(cid:0)(cid:6)(cid:0)(cid:1)(cid:0)(cid:0) (cid:10) (cid:10) (cid:11) (cid:12) (cid:13) (cid:12) (cid:14) (cid:15) (cid:16) (cid:17) (cid:18) (cid:19) (cid:10) (cid:20)(cid:21) (cid:22) (cid:23) (cid:19) (cid:10)(cid:24)(cid:10) (cid:13) (cid:15) (cid:25) (cid:10) (cid:20)(cid:21) (cid:22) (cid:23) (cid:19) (cid:10)(cid:10)(cid:10) (cid:26) (cid:27) (cid:28) (cid:11)(cid:12)(cid:16)(cid:29)(cid:22)(cid:30)(cid:30)(cid:10)(cid:18)(cid:15)(cid:14)(cid:12)(cid:19)(cid:24)(cid:1)(cid:0)(cid:31)(cid:0)(cid:0)(cid:0)(cid:10)(cid:10)(cid:10)(cid:26)(cid:27)(cid:28) (cid:10) (cid:1)!(cid:1)(cid:10) (cid:1)!(cid:2)(cid:10) (cid:1)!(cid:20)"(cid:10) (cid:1)!(cid:1)(cid:26)

Fig. 6.

Interestingness nCG scores about DBpedia entities vs. complete passages

Conclusion

In this paper we have deﬁned the concept of Interestingness in FRas a generalization of the concept of Informativeness where theinformation need is diverse and formalized as an unknown set ofimplicit queries.We then studied the ability of state of the art Informativenessmeasures to cope with this generalization, showing that on thisnew framework, cosine similarity between word embedding vectorrepresentations outperform discrete measures only on uni-grams.Moreover discrete bi-gram and skip-gram LMs appeared to be akey point of Interestingness evaluation, signiﬁcantly outperform-ing all experiments wit word2vec models. However we did showthat an alternative word2vec bi-gram model learned on Wikipediaoutperforms the uni-gram word2vec models on nCG scores forsmall cut-off value, but its performance decreases for higher val-ues. Finally we showed that TC@INEX 2012 LogSim measureprovides indeed best results for efﬁcient Interestingness detectionover this corpus, both on complete passages and on passages re-stricted to their anchor texts referring to entities.