[PDF] Real-Time Web Scale Event Summarization Using Sequential Decision Making

Abstract

We present a system based on sequential decision making for the online summarization of massive document streams, such as those found on the web. Given an event of interest (e.g. "Boston marathon bombing"), our system is able to filter the stream for relevance and produce a series of short text updates describing the event as it unfolds over time. Unlike previous work, our approach is able to jointly model the relevance, comprehensiveness, novelty, and timeliness required by time-sensitive queries. We demonstrate a 28.3% improvement in summary F1 and a 43.8% improvement in time-sensitive F1 metrics.

Full PDF

RReal-Time Web Scale Event Summarization Using Sequential Decision Making

Chris Kedzie

Columbia UniversityDept. of Computer [email protected]

Fernando Diaz

Microsoft [email protected]

Kathleen McKeown

Columbia UniversityDept. of Computer [email protected]

Abstract

We present a system based on sequential deci-sion making for the online summarization of mas-sive document streams, such as those found on theweb. Given an event of interest (e.g. “Bostonmarathon bombing”), our system is able to ﬁlter thestream for relevance and produce a series of shorttext updates describing the event as it unfolds overtime. Unlike previous work, our approach is ableto jointly model the relevance, comprehensiveness,novelty, and timeliness required by time-sensitivequeries. We demonstrate a 28.3% improvement insummary F and a 43.8% improvement in time-sensitive F metrics. Tracking unfolding news at web-scale continues to be a chal-lenging task. Crisis informatics, monitoring of breakingnews, and intelligence tracking all have difﬁculty in identi-fying new, relevant information within the massive quantitiesof text that appear online each second. One broad need thathas emerged is the ability to provide realtime event-speciﬁcupdates of streaming text data which are timely, relevant, andcomprehensive while avoiding redundancy.Unfortunately, many approaches have adapted standard au-tomatic multi-document summarization techniques that areinadequate for web scale applications. Typically, such sys-tems assume full retrospective access to the documents tobe summarized, or that at most a handful of updates to thesummary will be made [Dang and Owczarzak, 2008]. Fur-thermore, evaluation of these systems has assumed reliableand relevant input, something missing in an inconsistent, dy-namic, noisy stream of web or social media data. As a result,these systems are poor ﬁts for most real world applications.In this paper, we present a novel streaming-documentsummarization system based on sequential decision making.Speciﬁcally, we adopt the “learning to search” approach, atechnique which adapts methods from reinforcement learningfor structured prediction problems [Daum´e III et al. , 2009;Ross et al. , 2011]. In this framework, we cast streaming sum-marization as a form of greedy search and train our system toimitate the behavior of an oracle summarization system. – Two explosions shattered the euphoria of theBoston Marathon ﬁnish line on Monday, sending authoritiesout on the course to carry off the injured while the stragglerswere rerouted away... – Police in New York City and London are steppingup security following explosions at the Boston Marathon. – A senior U.S. intelligence ofﬁcial says two moreexplosive devices have been found near the scene of the Bostonmarathon where two bombs detonated earlier. – Several candidates for Massachusetts Senate spe-cial election have suspended campaign activity in response tothe explosions...

Figure 1: Excerpt of summary for the query ‘Bostonmarathon bombing’ generated from an input stream.Given a stream of sentence-segmented news webpages andan event query (e.g. “Boston marathon bombing”), our sys-tem monitors the stream to detect relevant, comprehensive,novel, and timely content. In response, our summarizer pro-duces a series of short text updates describing the event asit unfolds over time. We present an example of our realtimeupdate stream in Figure 1. We evaluate our system in a crisisinformatics setting on a diverse set of event queries, coveringsevere storms, social unrest, terrorism, and large accidents.We demonstrate a 28.3% improvement in summary F and a43.8% improvement in time-sensitive F metrics against sev-eral state-of-the-art baselines. Multi-document summarization (MDS) has long been stud-ied by the natural language processing community. We fo-cus speciﬁcally on extractive summarization, where the taskis to take a collection of text and select some subset of sen-tences from it that adequately describes the content subject tosome budget constraint (e.g. the summary must not exceed k words). For a more in depth survey of the ﬁeld, see [Nenkovaand McKeown, 2012].Because labeled training data is often scarce, unsupervisedapproaches to clustering and ranking predominate the ﬁeld.Popular approaches involve ranking sentences by variousnotions of input similarity or graph centrality [Radev et al. , a r X i v : . [ c s . C L ] M a y et al. , 2011; Masonand Charniak, 2011; Conroy et al. , 2011].Streaming or temporal summarization was ﬁrst explored inthe context of topic detection and tracking [Allan et al. , 2001]and more recently at the Text Retrieval Conference (TREC)[Aslam et al. , 2013]. Top performers at TREC included anafﬁnity propagation clustering approach [Kedzie et al. , 2015]and a ranking/MDS system combination method [McCreadie et al. , 2014]. Both methods are unfortunately constrained towork in hourly batches, introducing potential latency. Per-haps most similar to our work is that of [Guo et al. , 2013]which iteratively ﬁts a pair of regression models to predictngram recall and precision of candidate updates to a modelsummary. However, their learning objective fails to accountfor errors made in subsequent prediction steps. A streaming summarization task is composed of a brief textquery q , including a categorical event type (e.g. ‘earthquake’,‘hurricane’), as well as a document stream [ X , X , . . . ] . Inpractice, we assume that each document is segmented into asequence of sentences and we therefore consider a sentencestream [ x , x , . . . ] . A streaming summarization algorithmthen selects or skips each sentence x as it is observed suchthat the end user is provided a ﬁltered stream of sentencesthat are relevant, comprehensive, low redundancy, and timely(see Section 5.2). We refer to the selected sentences as up-dates and collectively they make up an update summary . Weshow a fragment of an update summary for the query ‘Bostonmarathon bombing’ in Figure 1. We could na¨ıvely treat this problem as classiﬁcation and pre-dict which sentences to select or skip. However, this wouldmake it difﬁcult to take advantage of many features (e.g. sen-tence novelty w.r.t. previous updates). What is more concern-ing, however, is that the classiﬁcation objective for this taskis somewhat ill-deﬁned: successfully predicting select on onesentence changes the true label (from select to skip) for sen-tences that contain the same information but occur later in thestream.In this work, we pose streaming summarization as a greedysearch over a binary branching tree where each level corre- Stream Position123 s e l e c t s k i p Figure 2: Search space for a stream of size two. The depthof the tree corresponds to the position in the stream. Leftbranches indicate selecting the current sentence as an update.Right branches skip the current sentence. The path in greencorresponds to one trajectory through this space consisting ofa select sentence one, then skip sentence 2. The state repre-sented by the hollow dot corresponds to the stream at sentenceposition 2 with the update summary containing sentence 1.sponds to a position in the stream (see Figure 2). The heightof the tree corresponds to the length of stream. A path throughthe tree is determined by the system select and skip decisions.When treated as a sequential decision making problem,our task reduces to deﬁning a policy for selecting a sen-tence based on its properties as well as properties of its an-cestors (i.e. all of the observed sentences and previous deci-sions). The union of properties–also known as the features–represents the current state in the decision making process.The feature representation provides state abstraction bothwithin a given query’s search tree as well as to states in otherqueries’ search trees, and also allows for complex interactionsbetween the current update summary, candidate sentences,and stream dynamics unlike the classiﬁcation approach.In order to learn an effective policy for a query q , we cantake one of several approaches. We could use a simulator toprovide feedback to a reinforcement learning algorithm. Al-ternatively, if provided access to an evaluation algorithm attraining time, we can simulate (approximately) optimal deci-sions. That is, using the training data, we can deﬁne an oraclepolicy that is able to omnisciently determine which sentencesto select and which to skip. Moreover, it can make these de-terminations by starting at the root or at an arbitrary node inthe tree, allowing us to observe optimal performance in statesunlikely to be reached by the oracle. We adopt locally op-timal learning to search to learn our model from the oraclepolicy [Chang et al. , 2015].In this section, we will begin by describing the learningalgorithm abstractly and then in detail for our task. We willconclude with details on how to train the model with an oraclepolicy. In the induced search problem, each search state s t cor-responds to observing the ﬁrst t sentences in the stream x , . . . , x t and a sequence of t − actions a , . . . , a t − . Forall states s ∈ S , the set of actions is a ∈ { , } with indicat-ing we add the t -th sentence to our update summary, and in-dicating we ignore it. For simplicity, we assume a ﬁxed lengthstream of size T but this is not strictly necessary. From eachinput stream, x = x , . . . , x T , we produce a correspondingoutput a ∈ { , } T . We use x : t to indicate the ﬁrst t elementsof x . nput : { x q , π ∗ q } q ∈Q , number of iterations N , and amixture parameter β ∈ (0 , for roll-out. Output : ˜ π Initialize ˜ π i ← for n ∈ { , , . . . , N } do for q ∈ Q do Γ ← ∅ for t ∈ { , , . . . , T − } do Roll-in by executing ˜ π i for t rounds andreach s t . for a ∈ A ( s t ) do Let π o = π ∗ q with probability β , else ˜ π i . Compute c t ( a ) by rolling out with π o Γ ← Γ ∪ {(cid:104) Φ( s t ) , a, c t ( a ) (cid:105)} ˜ π i +1 ← Update Cost-Sensitive Classiﬁer (˜ π i , Γ) i ← i + 1 Return ˜ π i . Algorithm 1:

Locally optimal learning to search.For a training query q , a reference policy π ∗ q can be con-structed from the training data. Speciﬁcally, with access tothe relevance and novelty of every x i in the stream, we canomnisciently make a decision whether to select or skip basedon a long term metric (see Section 5.2). The goal then is tolearn a policy ˜ π that imitates the reference well across a setof training queries Q . We encode each state as a vector in R d with a feature function Φ and our learned policy is a mapping ˜ π : R d → A of states to actions.We train ˜ π using locally optimal learning to search [Chang et al. , 2015], presented in Algorithm 1. The algorithm op-erates by iteratively updating a cost-sensitive classiﬁer. Foreach training query, we construct a query-speciﬁc training set Γ by simulating the processing of the training input stream x q . The instances in Γ are triples comprised of a feature vec-tor derived from the current state s , a candidate action a , andthe cost c ( a ) associated with taking action a in state s . Con-structing Γ consists of (1) selecting states and actions, and (2)computing the cost for each state-action pair.The number of states is exponential in T , so constructing Γ using the full set of states may be computationally pro-hibitive. Beyond this, the states in Γ would not be represen-tative of those visited at test time. In order to address this,we sample from S by executing the current policy ˜ π through-out the training simulation, resulting in T state samples for Γ ,(lines 5-12).Given a sampled state s , we need to compute the cost oftaking actions a ∈ { , } . With access to a query-speciﬁc or-acle, π ∗ q , we can observe it’s preferred decision at s and penal-ize choosing the other action. The magnitude of this penaltyis proportional to the difference in expected performance be-tween the oracle decision and the alternative decision. Theperformance of a decision is derived from a loss function (cid:96) ,to be introduced in Section 4.3. Importantly, our loss functionis deﬁned over a complete update summary, incorporating theimplications of selecting an action on future decisions. There-fore, our cost needs to incorporate a sequence of decisions af- ter taking some action in state s . The algorithm accomplishesthis by rolling out a policy after a until the stream has beenexhausted (line 10). As a result, we have a preﬁx deﬁned by ˜ π , an action, and then a sufﬁx deﬁned by the roll out pol-icy. In our work, we use a mixture policy that combines boththe current model ˜ π and the oracle π ∗ (line 9). This mixturepolicy encourages learning from states that are likely to bevisited by the current learned policy but not by the oracle.After our algorithm has gathered Γ for a speciﬁc q using ˜ π i ,we train on the data to produce ˜ π i +1 . Here ˜ π i is implementedas a cost-sensitive classiﬁer, i.e. a linear regression of thecosts on features and actions; the natural policy is to selectthe action with lowest predicted cost. With each query, weupdate the regression with stochastic gradient descent on thenewly sampled (state, action, cost) tuples (line 12). We repeatthis process for N passes over all queries in the training set.In the following sections, we specify the feature function Φ , the loss (cid:96) , and our reference policy π ∗ . As mentioned in the previous section, we represent each stateas a feature vector. In general, at time t , these features arefunctions of the current sentence (i.e. x t ), the stream history(i.e. x : t ), and/or the decision history ( a : t − ). We refer tofeatures only determined by x t as static features and all othersas dynamic features. Static FeaturesBasic Features

Our most basic features look at the length inwords of a sentence, its position in the document, and the ra-tio of speciﬁc named entity tags to non-named entity tokens.We also compute the average number of sentence tokens thatmatch the event query words and synonyms using WordNet.

Language Model Features

Similar to [Kedzie et al. ,2015], we compute the average token log probability of thesentence on two language models: i) an event type speciﬁclanguage model and ii) a general newswire language model.The ﬁrst language model is built from Wikipedia articles rel-evant to the event-type domain. The second model is builtfrom the New York Times and Associate Press sections of theGigaword-5 corpus [Graff and Cieri, 2003].

Single Document Summarization Features

These fea-tures are computed using the current sentence’s document asa context and are also commonly used as ranking features inother document summarization systems. Where a similarityor distance is need, we use either a tf-idf bag-of-words or k -dimensional latent vector representation. The latter is derivedby projecting the former onto a k -dimensional space using theweighted textual matrix factorization method [Guo and Diab,2012].We compute S UM B ASIC features [Nenkova and Vander-wende, 2005]: the average and sum of unigram probabili-ties in a sentence. We compute the arithmetic and geometricmeans of the sentence’s cosine distance to the other sentencesof the document [Guo et al. , 2013]. We refer to this quantityas novelty and compute it with both vector representations. We have attempted to use a comprehensive set of static features used in previ-ous summarization systems. We omit details for space but source code is available at: https://github.com/kedz/ijcai2016 e also compute the centroid rank [Radev et al. , 2000] andLexRank of each sentence [Erkan and Radev, 2004], againusing both vector representations.

Summary Content Probability

For a subset of the streamsentences we have manual judgements as to whether theymatch to model summary content or not (see Sec. 5.1, Ex-panding Relevance Judgments). We use this data (restrictedto sentences from the training query streams), to train a deci-sion tree classiﬁer, using the sentences’ term ngrams as clas-siﬁer features. As this data is aggregated across the trainingqueries, the purpose of this classiﬁer is to capture the impor-tance of general ngrams predictive of summary worthy con-tent.Using this classiﬁer, we obtain the probability that the cur-rent sentence x t contains summary content and use this as amodel feature. Dynamic FeaturesStream Language Models

We maintain several unigram lan-guage models that are updated with each new document inthe stream. Using these counts, we compute the sum, aver-age, and maximum token probability of the non-stop wordsin the sentence. We compute similar quantities restricted tothe person , location , and organization named entities. Update Similarity

The average and maximum cosine sim-ilarity of the current sentence to all previous updates is com-puted under both the tf-idf bag-of-words and latent vectorrepresentation. We also include indicator features for whenthe set of updates is empty (i.e. at the beginning of a run) andwhen either similarity is 0.

Document Frequency

We also compute the hour-to-hourpercent change in document frequency of the stream. Thisfeature helps gauge breaking developments in an unfoldingevent. As this feature is also heavily affected by the dailynews cycle (larger average document frequencies in the morn-ing and evening) we compute the 0-mean/unit-variance of thisfeature using the training streams to ﬁnd the mean and vari-ance for each hour of the day.

Feature Interactions

Many of our features are helpful fordetermining the importance of a sentence with respect to itsdocument. However, they are more ambiguous for determin-ing importance to the event as a whole. For example, it is notclear how to compare the document level PageRank of sen-tences from different documents. To compensate for this, weleverage two features which we believe to be good global in-dicators of update selection: the summary content probabilityand the document frequency. These two features are prox-ies for detecting (1) a good summary sentences (regardless ofnovelty with respect to other previous decisions) and (2) whenan event is likely to be producing novel content. We computethe conjunctions of all previously mentioned features with thesummary content probability and document frequency sepa-rately and together.

Much of the multi-document summarization literature em-ploys greedy selection methods. We adopt a greedy oraclethat selects a sentence if it improves our evaluation metric(see Section 5.2). We design our loss function to penalize policies thatseverely over- or under-generate. Given two sets of decisions,usually one from the oracle and another from the candidatemodel, we deﬁne the loss as the complement of the Dice co-efﬁcient between the decisions, (cid:96) ( a , a (cid:48) ) = 1 − × (cid:80) i a i a (cid:48) i (cid:80) i a i + a (cid:48) i . This encourages not only local agreement between policies(the numerator of the second term) but that the learned andoracle policy should generate roughly the same number ofupdates (the denominator in the second term).

We evaluate our method on the publicly available TREC Tem-poral Summarization Track data. This data is comprised ofthree parts.The corpus consists of a 16.1 terabyte set of 1.2 billiontimestamped documents crawled from the web between Oc-tober, 2011 and February 2013 [Frank et al. , 2012]. The crawlincludes news articles, forum data, weblogs, as well as a va-riety of other crawled web pages. The queries consist of a set of 44 events which occurredduring the timespan of the corpus. Each query has an as-sociated time range to limit the experiment to a timespan ofinterest, usually around two weeks. In addition, each query isassociated with an ‘event category’ (e.g. ‘earthquake’, ‘hur-ricane’). Each query is also associated with an ideal sum-mary, a set of short, timestamped textual descriptions of factsabout the event. The items in this set, also known as nuggets ,are considered the completed and irreducible sub-events as-sociated with the query. For example, the phrases “multi-ple people have been injured” and “at least three people havebeen killed” are two of the nuggets extracted for the query‘Boston marathon bombing’. On average, 73.35 nuggets wereextracted for each event.The relevance judgments consist of a sample of sentencespooled from participant systems, each of which has beenmanually assessed as related to one or more of a query’snuggets or not. For example, the following sentence, “Twoexplosions near the ﬁnish line of the Boston Marathon onMonday killed three people and wounded scores,” matchesthe nuggets mentioned above. The relevance judgments canbe used to compute evaluation metrics (Section 5.2) and, as aresult, to also deﬁne our oracle policy (Section 4.3).

Expanding Relevance Judgments

Because of the large size of the corpus and the limited sizeof the sample, many good candidate sentences were not man-ually reviewed. After aggressive document ﬁltering (see be-low), less than 1% of the sentences received manual review.In order to increase the amount of data for training and eval-uation of our system, we augmented the manual judgementswith automatic or “soft” matches. A separate gradient boost-ing classiﬁer was trained for each nugget with more than http://streamcorpus.org/ Document Filtering

For any given event query, most of the documents in the cor-pus are irrelevant. Because our queries all consist of newsevents, we restrict ourselves to the news section of the cor-pus, consisting of 7,592,062 documents.These documents are raw web pages, mostly from localnews outlets running stories from syndication services (e.g.Reuters), in a variety of layouts. In order to normalize theseinputs we ﬁltered the raw stream for relevancy and redun-dancy with the following three stage process.We ﬁrst preprocessed each document’s raw html using anarticle extraction library. Articles were truncated to the ﬁrst20 sentences. We then removed any articles that did not con-tain all of the query keywords in the article text, resultingin one document stream for each query. Finally, documentswhose cosine similarity to any previous document was > . were removed from the stream. We are interested in measuring a summary’s relevance, com-prehensiveness, redundancy, and latency (the delay in select-ing nugget information). The Temporal Summarization Trackadopts three principle metrics which we review here. Com-plete details can be found in the Track’s ofﬁcial metrics doc-ument. We use the ofﬁcial evaluation code to compute allmetrics.Given a system’s update summary a and our sentence-levelrelevance judgments, we can compute the number of match-ing nuggets found. Importantly, a summary only gets creditfor the number of unique matching nuggets, not the numberof matching sentences. This prevents a system from receivingcredit for selecting several sentences which match the samenugget. We refer to the number of unique matching nuggetsas the gain . We can also penalize a system which retrievesa sentence matching a nugget far after the timestamp of thenugget. The latency-penalized gain discounts each match’scontribution to the gain proportionally to the delay of the ﬁrstmatching sentence.The gain value can be used to compute latency andredundancy-penalized analogs to precision and recall. Specif-ically, the expected gain divides the gain by the number of https://github.com/grangier/python-goose system updates. This precision-oriented metric can be con-sidered the expected number of new nuggets in a sentence se-lected by the system. The comprehensiveness divides the gainby the number of nuggets. This recall-oriented metric can beconsidered the completeness of a user’s information after thetermination of the experiment. Finally, we also compute theharmonic mean of expected gain and comprehensiveness (i.e. F ). We present results using either gain or latency-penalizedgain in order to better understand system behavior.To evaluate our model, we randomly select ﬁve events touse as a development set and then perform a leave-one-outstyle evaluation on the remaining 39 events. Even after ﬁltering, each training query’s document stream isstill too large to be used directly in our combinatorial searchspace. In order to make training time reasonable yet rep-resentative, we downsample each stream to a length of 100sentences. The downsampling is done uniformly over the en-tire stream. This is repeated 10 times for each training eventto create a total of 380 training streams. In the event that adownsample contains no nuggets (either human or automati-cally labeled) we resample until at least one exists in the sam-ple.In order to avoid over-ﬁtting, we select the model iterationfor each training fold based on its performance (in F scoreof expected gain and comprehensiveness) on the developmentset. We refer to our “learning to search” model in the results asL S . We compare our proposed model against several base-lines and extensions. Cosine Similarity Threshold

One of the top perform-ing systems in temporal-summarization at TREC 2015 wasa heuristic method that only examined article ﬁrst sentences,selecting those that were below a cosine similarity thresholdto any of the previously selected updates. We implemented avariant of that approach using the latent-vector representationused throughout this work. The development set was usedto set the threshold. We refer to this model as C OS (TeamW ATERLOO C LARKE at TREC 2015).

Afﬁnity Propagation

The next baseline was a top per-former at the previous year’s TREC evaluations [Kedzie et al. ,2015]. This system processes the stream in non-overlappingwindows of time, using afﬁnity propagation (AP) clustering[Frey and Dueck, 2007] to identify update sentences (i.e. sen-tences that are cluster centers). As in the C OS model, a simi-larity threshold is used to ﬁlter out updates that are too similarto previous updates (i.e. previous clustering outputs). We usethe summary content probability feature as the preference orsalience parameter. The time window size, similarity thresh-old, and an offset for the cluster preference are tuned on thedevelopment set. We use the authors’ publicly available im-plementation and refer to this method as APS AL . Learning2Search+Cosine Similarity Threshold

In thismodel, which we refer to as L S C OS , we run L S as before,but ﬁlter the resulting updates using the same cosine similar-npenalized latency-penalizedexp. gain comp. F exp. gain comp. F num. updatesAPS AL . c .

09 0 .

094 0 .

105 0 .

088 0 .

088 8 . C OS .

075 0 . s .

099 0 .

095 0 . s . s . s,f L S . . s,f .

112 0 . c . s,c,f . s . s,f L S C OS . c,l . s . s,c,l . s,c,l . s . s,c,l . s,c Figure 3: Average system performance and average number of updates per event. Superscripts indicate signiﬁcant improve-ments ( p < . ) between the run and competing algorithms using the paired randomization test with the Bonferroni correctionfor multiple comparisons ( s : APS AL , c : C OS , l : L S , f : L S C OS ).latency-penalizedexp. gain comp. F C OS . . . L S F S .

164 0 .

220 0 . L S C OS F S . . . Figure 4: Average system performance. L S F S and L S C OS F S runs are trained and evaluated on ﬁrst sentences only (like theC OS system). Unpenalized results are omitted for space butthe rankings are consistent.Miss MissLead Body Empty Dupl. Total APS AL C OS L S F S L S C OS F S L S L S C OS OS . The threshold was also tunedon the development set. Results for system runs are shown in Figure 3. On average,L S and L S C OS achieve higher F scores than the baselinesystems in both latency penalized and unpenalized evalua-tions. For L S C OS , the difference in mean F score was sig-niﬁcant compared to all other systems (for both latency set-tings).APS AL achieved the overall highest expected gain, par-tially because it was the tersest system we evaluated. How-ever, only C OS was statistically signiﬁcantly worse than it onthis measure.In comprehensiveness, L S recalls on average a ﬁfth of thenuggets for each event. This is even more impressive whencompared to the average number of updates produced by eachsystem (Figure 3); while C OS achieves similar comprehen-siveness, it takes on average about 62% more updates thanL S and almost 400% more updates than L S C OS . The outputsize of C OS stretches the limit of the term “summary,” whichis typically shorter than 145 sentences in length. This is es-pecially important if the intended application is negatively af-fected by verbosity (e.g. crisis monitoring). Since C OS only considers the ﬁrst sentence of each docu-ment, it may miss relevant sentences below the article’s lead.In order to conﬁrm the importance of modeling the oracle, wealso trained and evaluated the L S based approaches on ﬁrstsentence only streams. Figure 4 shows the latency penalizedresults of the ﬁrst sentence only runs. The L S approachesstill dominate C OS and receive larger positive effects fromthe latency penalty despite also being restricted to the ﬁrstsentence. Clearly having a model (beyond similarity) of whatto select is helpful. Ultimately we do much better when wecan look at the whole document.We also performed an error analysis to further understandhow each system operates. Figure 5 shows the errors madeby each system on the test streams. Errors were broken downinto four categories. Miss lead and miss body errors occurwhen a system skips a sentence containing a novel nugget inthe lead or article body respectively. An empty error indicatesan update was selected that contained no nugget.

Duplicate errors occur when an update contains nuggets but none arenovel.Overall, errors of the miss type are most common and sug-gest future development effort should focus on summary con-tent identiﬁcation. About a ﬁfth to a third of all system errorcomes from missing content in the lead sentence alone.After misses, empty errors (false positives) are the nextlargest source of error. C OS was especially prone to empty er-rors (41% of its total errors). L S is also vulnerable to empties(19.9%) but after applying the similarity ﬁlter and restrictingto ﬁrst sentences, these errors can be reduced dramatically (to1%).Surprisingly, duplicate errors are a minor issue in our eval-uation. This is not to suggest we should ignore this compo-nent, however, as efforts to increase recall (reduce miss er-rors) are likely to require more robust redundancy detection. In this paper we presented a fully online streaming documentsummarization system capable of processing web-scale dataefﬁciently. We also demonstrated the effectiveness of “learn-ing to search” algorithms for this task. As shown in our erroranalysis, improving the summary content selection especiallyin article body should be the focus of future work. We wouldlike to explore deeper linguistic analysis (e.g. coreferenceand discourse structures) to identify places likely to containcontent rather than processing whole documents.

Acknowledgements

We would like to thank Hal Daum´e III for answering ourquestions about learning to search. The research describedhere was supported in part by the National Science Founda-tion (NSF) under IIS-1422863. Any opinions, ﬁndings andconclusions or recommendations expressed in this paper arethose of the authors and do not necessarily reﬂect the viewsof the NSF.

References [Allan et al. , 2001] James Allan, Rahul Gupta, and VikasKhandelwal. Temporal summaries of new topics. In

Pro-ceedings of the 24th annual international ACM SIGIR con-ference on Research and development in information re-trieval , pages 10–18. ACM, 2001.[Aslam et al. , 2013] Javed Aslam, Matthew Ekstrand-Abueg, Virgil Pavlu, Fernado Diaz, and Tetsuya Sakai.Trec 2013 temporal summarization. In

Proceedings of the22nd Text Retrieval Conference (TREC), November , 2013.[Chang et al. , 2015] Kai-Wei Chang, Akshay Krishna-murthy, Alekh Agarwal, Hal Daume, and John Langford.Learning to search better than your teacher. In DavidBlei and Francis Bach, editors,

Proceedings of the 32ndICML (ICML-15) , pages 2058–2066. JMLR Workshopand Conference Proceedings, 2015.[Conroy et al. , 2011] John M Conroy, Judith D Schlesinger,Jeff Kubina, Peter A Rankel, and Dianne P OLeary. Classy2011 at tac: Guided and multi-lingual summaries and eval-uation metrics. In

Proceedings of the Text Analysis Con-ference , 2011.[Dang and Owczarzak, 2008] Hoa Trang Dang and KarolinaOwczarzak. Overview of the tac 2008 update summariza-tion task. In

Proceedings of text analysis conference , pages1–16, 2008.[Daum´e III et al. , 2009] Hal Daum´e III, John Langford, andDaniel Marcu. Search-based structured prediction.

Ma-chine learning , 75(3):297–325, 2009.[Du et al. , 2011] Pan Du, Jipeng Yuan, Xianghui Lin, JinZhang, Jiafeng Guo, and Xueqi Cheng. Decayed divrankfor guided summarization. In

Proceedings of Text AnalysisConference , 2011.[Erkan and Radev, 2004] G¨unes Erkan and Dragomir RRadev. Lexrank: Graph-based lexical centrality as saliencein text summarization.

JAIR , pages 457–479, 2004.[Frank et al. , 2012] John R Frank, Max Kleiman-Weiner,Daniel A Roberts, Feng Niu, Ce Zhang, Christopher R´e,and Ian Soboroff. Building an entity-centric stream ﬁlter-ing test collection for trec 2012. Technical report, DTICDocument, 2012.[Frey and Dueck, 2007] Brendan J Frey and Delbert Dueck.Clustering by passing messages between data points.

Sci-ence , 315(5814):972–976, 2007.[Graff and Cieri, 2003] David Graff and C Cieri. English gi-gaword corpus.

Linguistic Data Consortium , 2003. [Guo and Diab, 2012] Weiwei Guo and Mona Diab. A sim-ple unsupervised latent semantics based approach for sen-tence similarity. In

Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation , SemEval ’12, pages586–590, Stroudsburg, PA, USA, 2012. Association forComputational Linguistics.[Guo et al. , 2013] Qi Guo, Fernando Diaz, and Elad Yom-Tov. Updating users about time critical events. In

ECIR ,ECIR, pages 483–494, Berlin, Heidelberg, 2013. Springer-Verlag.[Haghighi and Vanderwende, 2009] Aria Haghighi and LucyVanderwende. Exploring content models for multi-document summarization. In

Proceedings of Human Lan-guage Technologies: The 2009 Annual Conference of theNorth American Chapter of the Association for Computa-tional Linguistics , pages 362–370. Association for Com-putational Linguistics, 2009.[Kedzie et al. , 2015] Chris Kedzie, Kathleen McKeown, andFernando Diaz. Predicting salient updates for disas-ter summarization. In

Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguis-tics and the 7th International Joint Conference on NaturalLanguage Processing , pages 1608–1617. Association forComputational Linguistics, July 2015.[Lin and Hovy, 2000] Chin-Yew Lin and Eduard Hovy. Theautomated acquisition of topic signatures for text summa-rization. In

ACL , pages 495–501. ACL, 2000.[Mason and Charniak, 2011] Rebecca Mason and EugeneCharniak. Extractive multi-document summaries shouldexplicitly not contain document-speciﬁc content. In

Pro-ceedings of the Workshop on Automatic Summarization forDifferent Genres, Media, and Languages , pages 49–54.Association for Computational Linguistics, 2011.[McCreadie et al. , 2014] Richard McCreadie, Craig Mac-donald, and Iadh Ounis. Incremental update summariza-tion: Adaptive sentence selection based on prevalence andnovelty. In

CIKM , pages 301–310. ACM, 2014.[Nenkova and McKeown, 2012] Ani Nenkova and KathleenMcKeown. A survey of text summarization techniques. In

Mining Text Data , pages 43–76. Springer, 2012.[Nenkova and Vanderwende, 2005] A. Nenkova and L. Van-derwende. The impact of frequency on summariza-tion. Technical Report MSR-TR-2005-101, MSR-TR-2005-101, January 2005.[Radev et al. , 2000] Dragomir R Radev, Hongyan Jing, andMalgorzata Budzikowska. Centroid-based summariza-tion of multiple documents: sentence extraction, utility-based evaluation, and user studies. In

Proceedings of the2000 NAACL-ANLP Workshop on Automatic summariza-tion . Association for Computational Linguistics, 2000.[Ross et al. , 2011] St´ephane Ross, Geoffrey J. Gordon, andDrew Bagnell. A reduction of imitation learning and struc-tured prediction to no-regret online learning. In