Real-Time Web Scale Event Summarization Using Sequential Decision Making
RReal-Time Web Scale Event Summarization Using Sequential Decision Making
Chris Kedzie
Columbia UniversityDept. of Computer [email protected]
Fernando Diaz
Microsoft [email protected]
Kathleen McKeown
Columbia UniversityDept. of Computer [email protected]
Abstract
We present a system based on sequential deci-sion making for the online summarization of mas-sive document streams, such as those found on theweb. Given an event of interest (e.g. “Bostonmarathon bombing”), our system is able to filter thestream for relevance and produce a series of shorttext updates describing the event as it unfolds overtime. Unlike previous work, our approach is ableto jointly model the relevance, comprehensiveness,novelty, and timeliness required by time-sensitivequeries. We demonstrate a 28.3% improvement insummary F and a 43.8% improvement in time-sensitive F metrics. Tracking unfolding news at web-scale continues to be a chal-lenging task. Crisis informatics, monitoring of breakingnews, and intelligence tracking all have difficulty in identi-fying new, relevant information within the massive quantitiesof text that appear online each second. One broad need thathas emerged is the ability to provide realtime event-specificupdates of streaming text data which are timely, relevant, andcomprehensive while avoiding redundancy.Unfortunately, many approaches have adapted standard au-tomatic multi-document summarization techniques that areinadequate for web scale applications. Typically, such sys-tems assume full retrospective access to the documents tobe summarized, or that at most a handful of updates to thesummary will be made [Dang and Owczarzak, 2008]. Fur-thermore, evaluation of these systems has assumed reliableand relevant input, something missing in an inconsistent, dy-namic, noisy stream of web or social media data. As a result,these systems are poor fits for most real world applications.In this paper, we present a novel streaming-documentsummarization system based on sequential decision making.Specifically, we adopt the “learning to search” approach, atechnique which adapts methods from reinforcement learningfor structured prediction problems [Daum´e III et al. , 2009;Ross et al. , 2011]. In this framework, we cast streaming sum-marization as a form of greedy search and train our system toimitate the behavior of an oracle summarization system. – Two explosions shattered the euphoria of theBoston Marathon finish line on Monday, sending authoritiesout on the course to carry off the injured while the stragglerswere rerouted away... – Police in New York City and London are steppingup security following explosions at the Boston Marathon. – A senior U.S. intelligence official says two moreexplosive devices have been found near the scene of the Bostonmarathon where two bombs detonated earlier. – Several candidates for Massachusetts Senate spe-cial election have suspended campaign activity in response tothe explosions...
Figure 1: Excerpt of summary for the query ‘Bostonmarathon bombing’ generated from an input stream.Given a stream of sentence-segmented news webpages andan event query (e.g. “Boston marathon bombing”), our sys-tem monitors the stream to detect relevant, comprehensive,novel, and timely content. In response, our summarizer pro-duces a series of short text updates describing the event asit unfolds over time. We present an example of our realtimeupdate stream in Figure 1. We evaluate our system in a crisisinformatics setting on a diverse set of event queries, coveringsevere storms, social unrest, terrorism, and large accidents.We demonstrate a 28.3% improvement in summary F and a43.8% improvement in time-sensitive F metrics against sev-eral state-of-the-art baselines. Multi-document summarization (MDS) has long been stud-ied by the natural language processing community. We fo-cus specifically on extractive summarization, where the taskis to take a collection of text and select some subset of sen-tences from it that adequately describes the content subject tosome budget constraint (e.g. the summary must not exceed k words). For a more in depth survey of the field, see [Nenkovaand McKeown, 2012].Because labeled training data is often scarce, unsupervisedapproaches to clustering and ranking predominate the field.Popular approaches involve ranking sentences by variousnotions of input similarity or graph centrality [Radev et al. , a r X i v : . [ c s . C L ] M a y et al. , 2011; Masonand Charniak, 2011; Conroy et al. , 2011].Streaming or temporal summarization was first explored inthe context of topic detection and tracking [Allan et al. , 2001]and more recently at the Text Retrieval Conference (TREC)[Aslam et al. , 2013]. Top performers at TREC included anaffinity propagation clustering approach [Kedzie et al. , 2015]and a ranking/MDS system combination method [McCreadie et al. , 2014]. Both methods are unfortunately constrained towork in hourly batches, introducing potential latency. Per-haps most similar to our work is that of [Guo et al. , 2013]which iteratively fits a pair of regression models to predictngram recall and precision of candidate updates to a modelsummary. However, their learning objective fails to accountfor errors made in subsequent prediction steps. A streaming summarization task is composed of a brief textquery q , including a categorical event type (e.g. ‘earthquake’,‘hurricane’), as well as a document stream [ X , X , . . . ] . Inpractice, we assume that each document is segmented into asequence of sentences and we therefore consider a sentencestream [ x , x , . . . ] . A streaming summarization algorithmthen selects or skips each sentence x as it is observed suchthat the end user is provided a filtered stream of sentencesthat are relevant, comprehensive, low redundancy, and timely(see Section 5.2). We refer to the selected sentences as up-dates and collectively they make up an update summary . Weshow a fragment of an update summary for the query ‘Bostonmarathon bombing’ in Figure 1. We could na¨ıvely treat this problem as classification and pre-dict which sentences to select or skip. However, this wouldmake it difficult to take advantage of many features (e.g. sen-tence novelty w.r.t. previous updates). What is more concern-ing, however, is that the classification objective for this taskis somewhat ill-defined: successfully predicting select on onesentence changes the true label (from select to skip) for sen-tences that contain the same information but occur later in thestream.In this work, we pose streaming summarization as a greedysearch over a binary branching tree where each level corre- Stream Position123 s e l e c t s k i p Figure 2: Search space for a stream of size two. The depthof the tree corresponds to the position in the stream. Leftbranches indicate selecting the current sentence as an update.Right branches skip the current sentence. The path in greencorresponds to one trajectory through this space consisting ofa select sentence one, then skip sentence 2. The state repre-sented by the hollow dot corresponds to the stream at sentenceposition 2 with the update summary containing sentence 1.sponds to a position in the stream (see Figure 2). The heightof the tree corresponds to the length of stream. A path throughthe tree is determined by the system select and skip decisions.When treated as a sequential decision making problem,our task reduces to defining a policy for selecting a sen-tence based on its properties as well as properties of its an-cestors (i.e. all of the observed sentences and previous deci-sions). The union of properties–also known as the features–represents the current state in the decision making process.The feature representation provides state abstraction bothwithin a given query’s search tree as well as to states in otherqueries’ search trees, and also allows for complex interactionsbetween the current update summary, candidate sentences,and stream dynamics unlike the classification approach.In order to learn an effective policy for a query q , we cantake one of several approaches. We could use a simulator toprovide feedback to a reinforcement learning algorithm. Al-ternatively, if provided access to an evaluation algorithm attraining time, we can simulate (approximately) optimal deci-sions. That is, using the training data, we can define an oraclepolicy that is able to omnisciently determine which sentencesto select and which to skip. Moreover, it can make these de-terminations by starting at the root or at an arbitrary node inthe tree, allowing us to observe optimal performance in statesunlikely to be reached by the oracle. We adopt locally op-timal learning to search to learn our model from the oraclepolicy [Chang et al. , 2015].In this section, we will begin by describing the learningalgorithm abstractly and then in detail for our task. We willconclude with details on how to train the model with an oraclepolicy. In the induced search problem, each search state s t cor-responds to observing the first t sentences in the stream x , . . . , x t and a sequence of t − actions a , . . . , a t − . Forall states s ∈ S , the set of actions is a ∈ { , } with indicat-ing we add the t -th sentence to our update summary, and in-dicating we ignore it. For simplicity, we assume a fixed lengthstream of size T but this is not strictly necessary. From eachinput stream, x = x , . . . , x T , we produce a correspondingoutput a ∈ { , } T . We use x : t to indicate the first t elementsof x . nput : { x q , π ∗ q } q ∈Q , number of iterations N , and amixture parameter β ∈ (0 , for roll-out. Output : ˜ π Initialize ˜ π i ← for n ∈ { , , . . . , N } do for q ∈ Q do Γ ← ∅ for t ∈ { , , . . . , T − } do Roll-in by executing ˜ π i for t rounds andreach s t . for a ∈ A ( s t ) do Let π o = π ∗ q with probability β , else ˜ π i . Compute c t ( a ) by rolling out with π o Γ ← Γ ∪ {(cid:104) Φ( s t ) , a, c t ( a ) (cid:105)} ˜ π i +1 ← Update Cost-Sensitive Classifier (˜ π i , Γ) i ← i + 1 Return ˜ π i . Algorithm 1:
Locally optimal learning to search.For a training query q , a reference policy π ∗ q can be con-structed from the training data. Specifically, with access tothe relevance and novelty of every x i in the stream, we canomnisciently make a decision whether to select or skip basedon a long term metric (see Section 5.2). The goal then is tolearn a policy ˜ π that imitates the reference well across a setof training queries Q . We encode each state as a vector in R d with a feature function Φ and our learned policy is a mapping ˜ π : R d → A of states to actions.We train ˜ π using locally optimal learning to search [Chang et al. , 2015], presented in Algorithm 1. The algorithm op-erates by iteratively updating a cost-sensitive classifier. Foreach training query, we construct a query-specific training set Γ by simulating the processing of the training input stream x q . The instances in Γ are triples comprised of a feature vec-tor derived from the current state s , a candidate action a , andthe cost c ( a ) associated with taking action a in state s . Con-structing Γ consists of (1) selecting states and actions, and (2)computing the cost for each state-action pair.The number of states is exponential in T , so constructing Γ using the full set of states may be computationally pro-hibitive. Beyond this, the states in Γ would not be represen-tative of those visited at test time. In order to address this,we sample from S by executing the current policy ˜ π through-out the training simulation, resulting in T state samples for Γ ,(lines 5-12).Given a sampled state s , we need to compute the cost oftaking actions a ∈ { , } . With access to a query-specific or-acle, π ∗ q , we can observe it’s preferred decision at s and penal-ize choosing the other action. The magnitude of this penaltyis proportional to the difference in expected performance be-tween the oracle decision and the alternative decision. Theperformance of a decision is derived from a loss function (cid:96) ,to be introduced in Section 4.3. Importantly, our loss functionis defined over a complete update summary, incorporating theimplications of selecting an action on future decisions. There-fore, our cost needs to incorporate a sequence of decisions af- ter taking some action in state s . The algorithm accomplishesthis by rolling out a policy after a until the stream has beenexhausted (line 10). As a result, we have a prefix defined by ˜ π , an action, and then a suffix defined by the roll out pol-icy. In our work, we use a mixture policy that combines boththe current model ˜ π and the oracle π ∗ (line 9). This mixturepolicy encourages learning from states that are likely to bevisited by the current learned policy but not by the oracle.After our algorithm has gathered Γ for a specific q using ˜ π i ,we train on the data to produce ˜ π i +1 . Here ˜ π i is implementedas a cost-sensitive classifier, i.e. a linear regression of thecosts on features and actions; the natural policy is to selectthe action with lowest predicted cost. With each query, weupdate the regression with stochastic gradient descent on thenewly sampled (state, action, cost) tuples (line 12). We repeatthis process for N passes over all queries in the training set.In the following sections, we specify the feature function Φ , the loss (cid:96) , and our reference policy π ∗ . As mentioned in the previous section, we represent each stateas a feature vector. In general, at time t , these features arefunctions of the current sentence (i.e. x t ), the stream history(i.e. x : t ), and/or the decision history ( a : t − ). We refer tofeatures only determined by x t as static features and all othersas dynamic features. Static FeaturesBasic Features
Our most basic features look at the length inwords of a sentence, its position in the document, and the ra-tio of specific named entity tags to non-named entity tokens.We also compute the average number of sentence tokens thatmatch the event query words and synonyms using WordNet.
Language Model Features
Similar to [Kedzie et al. ,2015], we compute the average token log probability of thesentence on two language models: i) an event type specificlanguage model and ii) a general newswire language model.The first language model is built from Wikipedia articles rel-evant to the event-type domain. The second model is builtfrom the New York Times and Associate Press sections of theGigaword-5 corpus [Graff and Cieri, 2003].
Single Document Summarization Features
These fea-tures are computed using the current sentence’s document asa context and are also commonly used as ranking features inother document summarization systems. Where a similarityor distance is need, we use either a tf-idf bag-of-words or k -dimensional latent vector representation. The latter is derivedby projecting the former onto a k -dimensional space using theweighted textual matrix factorization method [Guo and Diab,2012].We compute S UM B ASIC features [Nenkova and Vander-wende, 2005]: the average and sum of unigram probabili-ties in a sentence. We compute the arithmetic and geometricmeans of the sentence’s cosine distance to the other sentencesof the document [Guo et al. , 2013]. We refer to this quantityas novelty and compute it with both vector representations. We have attempted to use a comprehensive set of static features used in previ-ous summarization systems. We omit details for space but source code is available at: https://github.com/kedz/ijcai2016 e also compute the centroid rank [Radev et al. , 2000] andLexRank of each sentence [Erkan and Radev, 2004], againusing both vector representations.
Summary Content Probability
For a subset of the streamsentences we have manual judgements as to whether theymatch to model summary content or not (see Sec. 5.1, Ex-panding Relevance Judgments). We use this data (restrictedto sentences from the training query streams), to train a deci-sion tree classifier, using the sentences’ term ngrams as clas-sifier features. As this data is aggregated across the trainingqueries, the purpose of this classifier is to capture the impor-tance of general ngrams predictive of summary worthy con-tent.Using this classifier, we obtain the probability that the cur-rent sentence x t contains summary content and use this as amodel feature. Dynamic FeaturesStream Language Models
We maintain several unigram lan-guage models that are updated with each new document inthe stream. Using these counts, we compute the sum, aver-age, and maximum token probability of the non-stop wordsin the sentence. We compute similar quantities restricted tothe person , location , and organization named entities. Update Similarity
The average and maximum cosine sim-ilarity of the current sentence to all previous updates is com-puted under both the tf-idf bag-of-words and latent vectorrepresentation. We also include indicator features for whenthe set of updates is empty (i.e. at the beginning of a run) andwhen either similarity is 0.
Document Frequency
We also compute the hour-to-hourpercent change in document frequency of the stream. Thisfeature helps gauge breaking developments in an unfoldingevent. As this feature is also heavily affected by the dailynews cycle (larger average document frequencies in the morn-ing and evening) we compute the 0-mean/unit-variance of thisfeature using the training streams to find the mean and vari-ance for each hour of the day.
Feature Interactions
Many of our features are helpful fordetermining the importance of a sentence with respect to itsdocument. However, they are more ambiguous for determin-ing importance to the event as a whole. For example, it is notclear how to compare the document level PageRank of sen-tences from different documents. To compensate for this, weleverage two features which we believe to be good global in-dicators of update selection: the summary content probabilityand the document frequency. These two features are prox-ies for detecting (1) a good summary sentences (regardless ofnovelty with respect to other previous decisions) and (2) whenan event is likely to be producing novel content. We computethe conjunctions of all previously mentioned features with thesummary content probability and document frequency sepa-rately and together.
Much of the multi-document summarization literature em-ploys greedy selection methods. We adopt a greedy oraclethat selects a sentence if it improves our evaluation metric(see Section 5.2). We design our loss function to penalize policies thatseverely over- or under-generate. Given two sets of decisions,usually one from the oracle and another from the candidatemodel, we define the loss as the complement of the Dice co-efficient between the decisions, (cid:96) ( a , a (cid:48) ) = 1 − × (cid:80) i a i a (cid:48) i (cid:80) i a i + a (cid:48) i . This encourages not only local agreement between policies(the numerator of the second term) but that the learned andoracle policy should generate roughly the same number ofupdates (the denominator in the second term).
We evaluate our method on the publicly available TREC Tem-poral Summarization Track data. This data is comprised ofthree parts.The corpus consists of a 16.1 terabyte set of 1.2 billiontimestamped documents crawled from the web between Oc-tober, 2011 and February 2013 [Frank et al. , 2012]. The crawlincludes news articles, forum data, weblogs, as well as a va-riety of other crawled web pages. The queries consist of a set of 44 events which occurredduring the timespan of the corpus. Each query has an as-sociated time range to limit the experiment to a timespan ofinterest, usually around two weeks. In addition, each query isassociated with an ‘event category’ (e.g. ‘earthquake’, ‘hur-ricane’). Each query is also associated with an ideal sum-mary, a set of short, timestamped textual descriptions of factsabout the event. The items in this set, also known as nuggets ,are considered the completed and irreducible sub-events as-sociated with the query. For example, the phrases “multi-ple people have been injured” and “at least three people havebeen killed” are two of the nuggets extracted for the query‘Boston marathon bombing’. On average, 73.35 nuggets wereextracted for each event.The relevance judgments consist of a sample of sentencespooled from participant systems, each of which has beenmanually assessed as related to one or more of a query’snuggets or not. For example, the following sentence, “Twoexplosions near the finish line of the Boston Marathon onMonday killed three people and wounded scores,” matchesthe nuggets mentioned above. The relevance judgments canbe used to compute evaluation metrics (Section 5.2) and, as aresult, to also define our oracle policy (Section 4.3).
Expanding Relevance Judgments
Because of the large size of the corpus and the limited sizeof the sample, many good candidate sentences were not man-ually reviewed. After aggressive document filtering (see be-low), less than 1% of the sentences received manual review.In order to increase the amount of data for training and eval-uation of our system, we augmented the manual judgementswith automatic or “soft” matches. A separate gradient boost-ing classifier was trained for each nugget with more than http://streamcorpus.org/ Document Filtering
For any given event query, most of the documents in the cor-pus are irrelevant. Because our queries all consist of newsevents, we restrict ourselves to the news section of the cor-pus, consisting of 7,592,062 documents.These documents are raw web pages, mostly from localnews outlets running stories from syndication services (e.g.Reuters), in a variety of layouts. In order to normalize theseinputs we filtered the raw stream for relevancy and redun-dancy with the following three stage process.We first preprocessed each document’s raw html using anarticle extraction library. Articles were truncated to the first20 sentences. We then removed any articles that did not con-tain all of the query keywords in the article text, resultingin one document stream for each query. Finally, documentswhose cosine similarity to any previous document was > . were removed from the stream. We are interested in measuring a summary’s relevance, com-prehensiveness, redundancy, and latency (the delay in select-ing nugget information). The Temporal Summarization Trackadopts three principle metrics which we review here. Com-plete details can be found in the Track’s official metrics doc-ument. We use the official evaluation code to compute allmetrics.Given a system’s update summary a and our sentence-levelrelevance judgments, we can compute the number of match-ing nuggets found. Importantly, a summary only gets creditfor the number of unique matching nuggets, not the numberof matching sentences. This prevents a system from receivingcredit for selecting several sentences which match the samenugget. We refer to the number of unique matching nuggetsas the gain . We can also penalize a system which retrievesa sentence matching a nugget far after the timestamp of thenugget. The latency-penalized gain discounts each match’scontribution to the gain proportionally to the delay of the firstmatching sentence.The gain value can be used to compute latency andredundancy-penalized analogs to precision and recall. Specif-ically, the expected gain divides the gain by the number of https://github.com/grangier/python-goose system updates. This precision-oriented metric can be con-sidered the expected number of new nuggets in a sentence se-lected by the system. The comprehensiveness divides the gainby the number of nuggets. This recall-oriented metric can beconsidered the completeness of a user’s information after thetermination of the experiment. Finally, we also compute theharmonic mean of expected gain and comprehensiveness (i.e. F ). We present results using either gain or latency-penalizedgain in order to better understand system behavior.To evaluate our model, we randomly select five events touse as a development set and then perform a leave-one-outstyle evaluation on the remaining 39 events. Even after filtering, each training query’s document stream isstill too large to be used directly in our combinatorial searchspace. In order to make training time reasonable yet rep-resentative, we downsample each stream to a length of 100sentences. The downsampling is done uniformly over the en-tire stream. This is repeated 10 times for each training eventto create a total of 380 training streams. In the event that adownsample contains no nuggets (either human or automati-cally labeled) we resample until at least one exists in the sam-ple.In order to avoid over-fitting, we select the model iterationfor each training fold based on its performance (in F scoreof expected gain and comprehensiveness) on the developmentset. We refer to our “learning to search” model in the results asL S . We compare our proposed model against several base-lines and extensions. Cosine Similarity Threshold
One of the top perform-ing systems in temporal-summarization at TREC 2015 wasa heuristic method that only examined article first sentences,selecting those that were below a cosine similarity thresholdto any of the previously selected updates. We implemented avariant of that approach using the latent-vector representationused throughout this work. The development set was usedto set the threshold. We refer to this model as C OS (TeamW ATERLOO C LARKE at TREC 2015).
Affinity Propagation
The next baseline was a top per-former at the previous year’s TREC evaluations [Kedzie et al. ,2015]. This system processes the stream in non-overlappingwindows of time, using affinity propagation (AP) clustering[Frey and Dueck, 2007] to identify update sentences (i.e. sen-tences that are cluster centers). As in the C OS model, a simi-larity threshold is used to filter out updates that are too similarto previous updates (i.e. previous clustering outputs). We usethe summary content probability feature as the preference orsalience parameter. The time window size, similarity thresh-old, and an offset for the cluster preference are tuned on thedevelopment set. We use the authors’ publicly available im-plementation and refer to this method as APS AL . Learning2Search+Cosine Similarity Threshold
In thismodel, which we refer to as L S C OS , we run L S as before,but filter the resulting updates using the same cosine similar-npenalized latency-penalizedexp. gain comp. F exp. gain comp. F num. updatesAPS AL . c .
09 0 .
094 0 .
105 0 .
088 0 .
088 8 . C OS .
075 0 . s .
099 0 .
095 0 . s . s . s,f L S . . s,f .
112 0 . c . s,c,f . s . s,f L S C OS . c,l . s . s,c,l . s,c,l . s . s,c,l . s,c Figure 3: Average system performance and average number of updates per event. Superscripts indicate significant improve-ments ( p < . ) between the run and competing algorithms using the paired randomization test with the Bonferroni correctionfor multiple comparisons ( s : APS AL , c : C OS , l : L S , f : L S C OS ).latency-penalizedexp. gain comp. F C OS . . . L S F S .
164 0 .
220 0 . L S C OS F S . . . Figure 4: Average system performance. L S F S and L S C OS F S runs are trained and evaluated on first sentences only (like theC OS system). Unpenalized results are omitted for space butthe rankings are consistent.Miss MissLead Body Empty Dupl. Total APS AL C OS L S F S L S C OS F S L S L S C OS OS . The threshold was also tunedon the development set. Results for system runs are shown in Figure 3. On average,L S and L S C OS achieve higher F scores than the baselinesystems in both latency penalized and unpenalized evalua-tions. For L S C OS , the difference in mean F score was sig-nificant compared to all other systems (for both latency set-tings).APS AL achieved the overall highest expected gain, par-tially because it was the tersest system we evaluated. How-ever, only C OS was statistically significantly worse than it onthis measure.In comprehensiveness, L S recalls on average a fifth of thenuggets for each event. This is even more impressive whencompared to the average number of updates produced by eachsystem (Figure 3); while C OS achieves similar comprehen-siveness, it takes on average about 62% more updates thanL S and almost 400% more updates than L S C OS . The outputsize of C OS stretches the limit of the term “summary,” whichis typically shorter than 145 sentences in length. This is es-pecially important if the intended application is negatively af-fected by verbosity (e.g. crisis monitoring). Since C OS only considers the first sentence of each docu-ment, it may miss relevant sentences below the article’s lead.In order to confirm the importance of modeling the oracle, wealso trained and evaluated the L S based approaches on firstsentence only streams. Figure 4 shows the latency penalizedresults of the first sentence only runs. The L S approachesstill dominate C OS and receive larger positive effects fromthe latency penalty despite also being restricted to the firstsentence. Clearly having a model (beyond similarity) of whatto select is helpful. Ultimately we do much better when wecan look at the whole document.We also performed an error analysis to further understandhow each system operates. Figure 5 shows the errors madeby each system on the test streams. Errors were broken downinto four categories. Miss lead and miss body errors occurwhen a system skips a sentence containing a novel nugget inthe lead or article body respectively. An empty error indicatesan update was selected that contained no nugget.
Duplicate errors occur when an update contains nuggets but none arenovel.Overall, errors of the miss type are most common and sug-gest future development effort should focus on summary con-tent identification. About a fifth to a third of all system errorcomes from missing content in the lead sentence alone.After misses, empty errors (false positives) are the nextlargest source of error. C OS was especially prone to empty er-rors (41% of its total errors). L S is also vulnerable to empties(19.9%) but after applying the similarity filter and restrictingto first sentences, these errors can be reduced dramatically (to1%).Surprisingly, duplicate errors are a minor issue in our eval-uation. This is not to suggest we should ignore this compo-nent, however, as efforts to increase recall (reduce miss er-rors) are likely to require more robust redundancy detection. In this paper we presented a fully online streaming documentsummarization system capable of processing web-scale dataefficiently. We also demonstrated the effectiveness of “learn-ing to search” algorithms for this task. As shown in our erroranalysis, improving the summary content selection especiallyin article body should be the focus of future work. We wouldlike to explore deeper linguistic analysis (e.g. coreferenceand discourse structures) to identify places likely to containcontent rather than processing whole documents.
Acknowledgements
We would like to thank Hal Daum´e III for answering ourquestions about learning to search. The research describedhere was supported in part by the National Science Founda-tion (NSF) under IIS-1422863. Any opinions, findings andconclusions or recommendations expressed in this paper arethose of the authors and do not necessarily reflect the viewsof the NSF.
References [Allan et al. , 2001] James Allan, Rahul Gupta, and VikasKhandelwal. Temporal summaries of new topics. In
Pro-ceedings of the 24th annual international ACM SIGIR con-ference on Research and development in information re-trieval , pages 10–18. ACM, 2001.[Aslam et al. , 2013] Javed Aslam, Matthew Ekstrand-Abueg, Virgil Pavlu, Fernado Diaz, and Tetsuya Sakai.Trec 2013 temporal summarization. In
Proceedings of the22nd Text Retrieval Conference (TREC), November , 2013.[Chang et al. , 2015] Kai-Wei Chang, Akshay Krishna-murthy, Alekh Agarwal, Hal Daume, and John Langford.Learning to search better than your teacher. In DavidBlei and Francis Bach, editors,
Proceedings of the 32ndICML (ICML-15) , pages 2058–2066. JMLR Workshopand Conference Proceedings, 2015.[Conroy et al. , 2011] John M Conroy, Judith D Schlesinger,Jeff Kubina, Peter A Rankel, and Dianne P OLeary. Classy2011 at tac: Guided and multi-lingual summaries and eval-uation metrics. In
Proceedings of the Text Analysis Con-ference , 2011.[Dang and Owczarzak, 2008] Hoa Trang Dang and KarolinaOwczarzak. Overview of the tac 2008 update summariza-tion task. In
Proceedings of text analysis conference , pages1–16, 2008.[Daum´e III et al. , 2009] Hal Daum´e III, John Langford, andDaniel Marcu. Search-based structured prediction.
Ma-chine learning , 75(3):297–325, 2009.[Du et al. , 2011] Pan Du, Jipeng Yuan, Xianghui Lin, JinZhang, Jiafeng Guo, and Xueqi Cheng. Decayed divrankfor guided summarization. In
Proceedings of Text AnalysisConference , 2011.[Erkan and Radev, 2004] G¨unes Erkan and Dragomir RRadev. Lexrank: Graph-based lexical centrality as saliencein text summarization.
JAIR , pages 457–479, 2004.[Frank et al. , 2012] John R Frank, Max Kleiman-Weiner,Daniel A Roberts, Feng Niu, Ce Zhang, Christopher R´e,and Ian Soboroff. Building an entity-centric stream filter-ing test collection for trec 2012. Technical report, DTICDocument, 2012.[Frey and Dueck, 2007] Brendan J Frey and Delbert Dueck.Clustering by passing messages between data points.
Sci-ence , 315(5814):972–976, 2007.[Graff and Cieri, 2003] David Graff and C Cieri. English gi-gaword corpus.
Linguistic Data Consortium , 2003. [Guo and Diab, 2012] Weiwei Guo and Mona Diab. A sim-ple unsupervised latent semantics based approach for sen-tence similarity. In
Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation , SemEval ’12, pages586–590, Stroudsburg, PA, USA, 2012. Association forComputational Linguistics.[Guo et al. , 2013] Qi Guo, Fernando Diaz, and Elad Yom-Tov. Updating users about time critical events. In
ECIR ,ECIR, pages 483–494, Berlin, Heidelberg, 2013. Springer-Verlag.[Haghighi and Vanderwende, 2009] Aria Haghighi and LucyVanderwende. Exploring content models for multi-document summarization. In
Proceedings of Human Lan-guage Technologies: The 2009 Annual Conference of theNorth American Chapter of the Association for Computa-tional Linguistics , pages 362–370. Association for Com-putational Linguistics, 2009.[Kedzie et al. , 2015] Chris Kedzie, Kathleen McKeown, andFernando Diaz. Predicting salient updates for disas-ter summarization. In
Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguis-tics and the 7th International Joint Conference on NaturalLanguage Processing , pages 1608–1617. Association forComputational Linguistics, July 2015.[Lin and Hovy, 2000] Chin-Yew Lin and Eduard Hovy. Theautomated acquisition of topic signatures for text summa-rization. In
ACL , pages 495–501. ACL, 2000.[Mason and Charniak, 2011] Rebecca Mason and EugeneCharniak. Extractive multi-document summaries shouldexplicitly not contain document-specific content. In
Pro-ceedings of the Workshop on Automatic Summarization forDifferent Genres, Media, and Languages , pages 49–54.Association for Computational Linguistics, 2011.[McCreadie et al. , 2014] Richard McCreadie, Craig Mac-donald, and Iadh Ounis. Incremental update summariza-tion: Adaptive sentence selection based on prevalence andnovelty. In
CIKM , pages 301–310. ACM, 2014.[Nenkova and McKeown, 2012] Ani Nenkova and KathleenMcKeown. A survey of text summarization techniques. In
Mining Text Data , pages 43–76. Springer, 2012.[Nenkova and Vanderwende, 2005] A. Nenkova and L. Van-derwende. The impact of frequency on summariza-tion. Technical Report MSR-TR-2005-101, MSR-TR-2005-101, January 2005.[Radev et al. , 2000] Dragomir R Radev, Hongyan Jing, andMalgorzata Budzikowska. Centroid-based summariza-tion of multiple documents: sentence extraction, utility-based evaluation, and user studies. In
Proceedings of the2000 NAACL-ANLP Workshop on Automatic summariza-tion . Association for Computational Linguistics, 2000.[Ross et al. , 2011] St´ephane Ross, Geoffrey J. Gordon, andDrew Bagnell. A reduction of imitation learning and struc-tured prediction to no-regret online learning. In