[PDF] Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings

Abstract

We propose a method for online news stream clustering that is a variant of the non-parametric streaming K-means algorithm. Our model uses a combination of sparse and dense document representations, aggregates document-cluster similarity along these multiple representations and makes the clustering decision using a neural classifier. The weighted document-cluster similarity model is learned using a novel adaptation of the triplet loss into a linear classification objective. We show that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings for clustering. Our model achieves a new state-of-the-art on a standard stream clustering dataset of English documents.

Full PDF

EEvent-Driven News Stream Clustering using Entity-Aware ContextualEmbeddings

Kailash Karthik Saravanakumar ∗ Miguel Ballesteros Muthu Kumar Chandrasekaran Kathleen McKeown , Amazon AI, USA Department of Computer Science, Columbia University, NY, USA { ballemig, cmuthuk, mckeownk } @amazon.com { kailashkarthik.s } @columbia.edu Abstract

We propose a method for online news streamclustering that is a variant of the non-parametric streaming K-means algorithm. Ourmodel uses a combination of sparse anddense document representations, aggregatesdocument-cluster similarity along these mul-tiple representations and makes the cluster-ing decision using a neural classiﬁer. Theweighted document-cluster similarity model islearned using a novel adaptation of the tripletloss into a linear classiﬁcation objective. Weshow that the use of a suitable ﬁne-tuning ob-jective and external knowledge in pre-trainedtransformer models yields signiﬁcant improve-ments in the effectiveness of contextual em-beddings for clustering. Our model achieves anew state-of-the-art on a standard stream clus-tering dataset of English documents.

Human presentation and understanding of news ar-ticles is almost never isolated. Seminal real-worldevents spawn a chain of strongly correlated newsarticles that form a news story over time. Giventhe abundance of online news sources, the con-sumption of news in the context of the stories theybelong to is challenging. Unless people are ableto scour the many news sources multiple times aday, major events of interest can be missed as theyoccur. The real-time monitoring of news, segregat-ing articles into their corresponding stories, thusenables people to follow news stories over time.This goal of identifying and tracking topics froma news stream was ﬁrst introduced in the TopicDetection and Tracking (TDT) task (Allan et al.,1998). Topics in the news stream setting usuallycorrespond to real-world events, while news ar-ticles may also be categorized thematically into ∗ Work done during internship at Amazon sports, politics, etc. We focus on the task of cluster-ing news on the basis of event-based story chains.We make a distinction between our deﬁnition ofan event topic , which follows TDT and refers tolarge-scale real-world events, and the ﬁne-grainedevents used in trigger-based event detection (Ahn,2006). Given the non-parametric nature of ourtask (the number of events is not known before-hand and evolves over time), the two primary ap-proaches have been topic modeling using Hierar-chical Dirichlet Processes (HDPs) (Teh et al., 2005;Beykikhoshk et al., 2018) and Stream Clustering(MacQueen, 1967; Laban and Hearst, 2017; Mi-randa et al., 2018). While HDPs use word distri-butions within documents to infer topics, streamclustering models use representation strategies toencode and cluster documents. Contemporary mod-els have adopted stream clustering using TF-IDFweighted bag of words representations to achievestate-of-the-art results (Staykovski et al., 2019).In this paper, we present a model for event topicdetection and tracking from news streams that lever-ages a combination of dense and sparse documentrepresentations. Our dense representations are ob-tained from BERT models (Devlin et al., 2019) ﬁne-tuned using the triplet network architecture (Hofferand Ailon, 2015) on the event similarity task, whichwe describe in Section 3. We also use an adaptationof the triplet loss to learn a Support Vector Machine(SVM) (Boser et al., 1992) based document-clustersimilarity model and handle the non-parametriccluster creation using a shallow neural network. Weempirically show consistent improvement in clus-tering performance across many clustering metricsand signiﬁcantly less cluster fragmentation.The main contributions of this paper are:• We present a novel technique for event-drivennews stream clustering, which, to the best ofour knowledge, is the ﬁrst attempt of using a r X i v : . [ c s . C L ] J a n ontextual representations for this task.• We investigate the impact of BERT’s ﬁne-tuning objective on clustering performanceand show that tuning on the event similaritytask using triplet loss improves the effective-ness of embeddings for clustering.• We demonstrate the importance of adding ex-ternal knowledge to contextual embeddingsfor clustering by introducing entity aware-ness to BERT. Contrary to a previous claim(Staykovski et al., 2019), we empiricallyshow that dense embeddings improve cluster-ing performance when augmented with task-pertinent ﬁne-tuning, external knowledge andthe conjunction of sparse and temporal repre-sentations.• We analyze the problem of cluster fragmenta-tion and show that it is not captured well bythe metrics reported in the literature. We pro-pose an additional metric that captures frag-mentation better and report results on both. In this section, we introduce the TDT task, priorwork on tracking events from news streams and afew related parametric variants of the TDT task.The goal of the TDT task is to organize a col-lection of news articles into groups called topics.Topics are deﬁned as sets of highly correlated newsarticles that are related to some seminal real-worldevent. This is a narrower deﬁnition than the gen-eral notion of a topic which could include sub-jects (like

New York City ) as well. TDT deﬁnesan event to be represented by a triple < location,time, people involved > , which spawns a series ofnews articles over time. We are interested in all ﬁvesub-tasks of TDT - story segmentation, ﬁrst storydetection, cluster detection, tracking and story linkdetection - though we do not explicitly tackle thesesub-problems individually.After the initial work on the TDT corpora, in-terest in news stream clustering was rekindled bythe news tracking system NewsLens (Laban andHearst, 2017). NewsLens tackled the problemin multiple stages: (1) document representationthrough keyphrase extraction; (2) non-parametricbatch clustering using the Louvian algorithm (Blon-del et al., 2008); and (3) linking of clusters acrossbatches. Staykovski et al. (2019) presented a modi-ﬁed version of this model, using TF-IDF bag of words document representations instead of key-words. They also compared the relative perfor-mance of sparse TF-IDF bag of words and densedoc2vec (Le and Mikolov, 2014) representationsand showed that the latter performs worse, bothindividually and in unison with sparse represen-tations. Linger and Hajaiej (2020) extended thisbatch clustering idea to the multilingual setting byincorporating a Siamese Multilingual-DistilBERT(Sanh et al., 2019) model to link clusters acrosslanguages.In contrast to the batch-clustering approach,Miranda et al. (2018) adopt an online clusteringparadigm, where streaming documents are com-pared against existing clusters to ﬁnd the bestmatch or to create a new cluster. We adoptthis stream clustering approach as it is robust totemporal density variations in the news stream.Batch clustering models tune a batch size hyper-parameter that is both training corpus dependentand might not be able to adjust to temporal varia-tions in stream density. In their model, they alsouse a pipeline architecture, having separate mod-els for document-cluster similarity computationand cluster creation. Similarity between a docu-ment and cluster is computed along multiple doc-ument representations and then aggregated usinga Rank-SVM model (Joachims, 2002). The deci-sion to merge a document with a cluster or createa new cluster is taken by an SVM classiﬁer. Ourmodel also follows this architecture, but criticallyadds dense document representations, an SVMtrained on the adapted triplet loss for aggregatingdocument-cluster similarities and a shallow neuralnetwork for cluster creation.News event tracking has also been framed asa non-parametric topic modeling problem (Zhouet al., 2015) and HDPs that share parameters acrosstemporal batches have been used for this task(Beykikhoshk et al., 2018). Dense document rep-resentations have been shown to be useful in theparametric variant of our problem, with neural LDA(Dieng et al., 2019a; Keya et al., 2019; Dieng et al.,2019b; Bianchi et al., 2020), temporal topic evolu-tion models (Zaheer et al., 2017; Gupta et al., 2018;Zaheer et al., 2019; Brochier et al., 2020) and em-bedding space clustering (Momeni et al., 2018; Siaet al., 2020) being some prominent approaches inthe literature. igure 1: The architecture of the news stream clustering model, showing the clustering process for a single doc-ument in the news stream. At the end of the clustering process for each document, the cluster pool is updatedbased on the output from the cluster creation model, either by adding document d to cluster c ∗ or by creating a newcluster with the document. Our clustering model is a variant of the stream-ing K-means algorithm (MacQueen, 1967) withtwo key differences: (1) we compute the similar-ity between documents and clusters along a set ofrepresentations instead of a single vector represen-tation; and (2) we decide the cluster membershipusing the output of a neural classiﬁer, a learnedmodel, instead of a static tuned threshold.At any point in time t , let n be the number ofclusters the model has created thus far, called thecluster pool. Given a continuous stream of newsdocuments, the goal of the model is to decide thecluster membership (if any) for each input docu-ment. In our task, we assume that each documentbelongs to a single event, represented by a cluster.The architecture of the model, as shown in Figure1, consists of three main components : (1) docu-ment representations, (2) document-cluster similar-ity computation using a weighted similarity modeland (3) cluster creation model. In what follows, wedescribe each of these components. Documents in the news stream have a set of repre-sentations, as shown in Figure 1, where each rep-resentation is one of the following types - sparseTF-IDF, dense embedding or temporal. We de-scribe below these representation types and howclusters, which are created by our model, buildrepresentations from their assigned documents.

Separate TF-IDF models that are trained only onthe tokens, lemmas and entities in a corpus areused to encode documents separately. For everydocument in the news stream, its title, body andtitle+body are each encoded into separate bags oftokens, lemmas and entities, creating nine sparsebag of word representations per document.

Dense document representations are obtained byembedding the body of documents using BERT,with pre-trained BERT (P-BERT) without any ﬁne-tuning as our baseline embedding model. In orderto improve the effectiveness of contextual embed-dings for our clustering task, we experiment withenhancements along two dimensions: (1) the ﬁne-tuning objective, and (2) the provision of externalknowledge. We train separate BERT models for (1)and (2) and use them to encode documents.To evaluate the impact of the ﬁne-tuning objec-tive, we ﬁne-tune BERT models on two differenttasks - event classiﬁcation (C-BERT) and eventsimilarity (S-BERT). We also evaluate the impactof external knowledge on the embeddings throughan entity-aware BERT architecture, which may bepaired with either of the ﬁne-tuning objectives.

Fine-tuning on Event Classiﬁcation

The goalof this ﬁne-tuning is to tune the CLS token em-bedding such that it encodes information about the The CLS token, introduced in (Devlin et al., 2019), is a special tokenadded to the beginning of every document before being embedded by BERT vent that a document corresponds to. A dense andsoftmax layer are stacked on top of the CLS tokenembedding to classify a document into one of theevents in the output space.

Fine-tuning on Event Similarity

Fine-tuningon the task of event classiﬁcation constrains theembedding of documents corresponding to differ-ent events to be non-linearly separable. Semanticsabout events can be better captured if the vectorsimilarity between document embeddings encodewhether they are from the same event or not.For this, we adapt the triplet network architecture(Hoffer and Ailon, 2015) and ﬁne-tune on the taskof event similarity. Triplet BERT networks were in-troduced for the semantic text similarity (STS) task(Reimers and Gurevych, 2019), where the vectorsimilarity between sentence embeddings was tunedto reﬂect the semantic similarity between them. Weformulate the event similarity task, where the term“similarity” refers to whether two documents arefrom the same event cluster or not. In our task,documents from the same event are similar (withsimilarity = 1), while those from different eventsare dissimilar (with similarity = 0). Given the em-beddings of an anchor document d a , a positive doc-ument d p (from the same event as the anchor) anda negative document d n (from a different event),triplet loss is computed as l triplet = sim ( d a , d n ) − sim ( d a , d p ) + m (1)where sim is the cosine similarity function and m is the hyper-parameter margin. Providing External Entity Knowledge

In linewith TDT’s deﬁnition, entities are central to eventsand thus need to be highlighted in document rep-resentations for our clustering task. We followLogeswaran et al. (2019) to introduce entity aware-ness to BERT by leveraging knowledge from anexternal NER system. Apart from token, positionand token type embeddings, we also add an en-tity presence-absence embedding for each tokendepending on whether it corresponds to an entityor not. The entity aware BERT model architectureis shown in Figure 2. This enhanced entity-awaremodel can then be coupled with the event similarity(E-S-BERT) objective for ﬁne-tuning.

Documents are also represented with the timestampof publication. Unlike TF-IDF and dense embed-dings, which are vector valued representations, the

Figure 2: Entity-aware BERT model, with the addi-tional entity presence ( E E ) and absence ( E NE ) embed-dings temporal representation of a document is just a sin-gle value (e.g ”05-09-2020”) which has an associ-ated subtraction operation. The difference betweentwo timestamps is deﬁned as the number of inter-vening days between them. Section 3.2 describeshow these timestamps are used for clustering. Since clusters are created and updated by ourmodel, their representations need to be generateddynamically from the documents assigned to them.While documents in the news stream have a set of11 representations (9 TF-IDF, dense embeddingsand timestamp), clusters have two additional time-stamp representations. Cluster representations arederived from documents in the cluster through ag-gregation. While dense embedding and sparse TF-IDF representations of a cluster are aggregated us-ing mean pooling, clusters have three timestamprepresentations corresponding to different aggrega-tion strategies - min, max and mean pooling.

Once documents are encoded by a set of represen-tations, they are compared to the clusters in thecluster pool to ﬁnd the most compatible cluster.The similarity between a document and a clusteris computed along each representation separatelyand is then aggregated into a single compatibilityscore ( c - score ). While similarity along contex-tual embeddings and TF-IDF bag representationsis computed using cosine similarity (as shown inEquation 2), timestamp similarity is computed us-ing the Gaussian similarity function introduced inMiranda et al. (2018) (as shown in Equation 3).Let r d v and r c v denote a dense or sparse vectorepresentation of a document d and cluster c re-spectively. Let r d t and r c t denote their timestamprepresentations. Let ( i, j ) correspond to a pair ofdocument-cluster representations of the same type(as deﬁned in Section 3.1). Document-cluster sim-ilarity is computed along each representation andaggregated using a weighted summation as sim ( r d , r c ) = { sim ( r d i , r c j ) ∀ ( i, j ) } sim ( r d v , r c v ) = r d v · r c v | r d v || r c v | (2) sim ( r d t , r c t ) = e − (( rd t − rc t ) − µ )2 σ (3) c - score ( r d , r c ) = (cid:88) ( i,j ) w j · sim ( r d i , r c j ) where µ and σ are tuned hyper-parameters of thetemporal similarity function. It is noted here thatsince clusters have two additional timestamp rep-resentations, all three timestamp similarities arecomputed using the single document timestamprepresentation, as illustrated in Figure 3.The dataset does not contain annotation for thedegree of membership between a document andcluster and thus, the weights for combining therepresentation similarities can’t be learned directly.To circumvent this issue, we train a linear model ona relevant task so that the trained weights can thenbe adapted to compute the compatibility score.In our model, we train a linear model on a noveladaptation of the event similarity triplet loss used totrain the S-BERT model. The triplet loss, as deﬁnedin Equation 1, can be adapted to a linear classiﬁerif similarity has a related notion with regards to theclassiﬁer. SVM is an appropriate model since thedegree of compatibility between a point x and aclass c is given by the distance of the point fromthe class’ decision hyperplane w c . This distance,computed as w c · x + b , can thus be used as thesimilarity metric to adapt the triplet loss.In our case, the inputs to the SVM model arevectors of document-cluster similarities along theset of representations sim ( r d , r c ) . The adaptedSVM-triplet loss is thus computed as shown below.Since we want to minimize this loss, we analyzeits point of minima. l svm − triplet = w · sim ( r a , r n ) − w · sim ( r a , r p )+ ml svm − triplet = 0= ⇒ m = w · ( sim ( r a , r p ) − sim ( r a , r n )) The adapted triplet loss can thus be modeledas a classiﬁcation task with inputs ( sim ( r a , r p ) − sim ( r a , r n )) and the outputs m . For mathemat-ical convenience, we set m = 1 without loss ofgenerality. In this manner, we transform the eventsimilarity triplet loss objective into a classiﬁcationobjective to train an SVM model. The novelty ofthis supervision is that we focus on learning use-ful weights and not a useful classiﬁer. The learnedweights, which minimize the event similarity tripletloss, are utilized for document-cluster c-score com-putation. During the clustering process, a docu-ment d is compared against all the clusters in thepool C to determine the most compatible cluster c ∗ as c ∗ = arg max c ∈ C c - score ( r d , r c ) Since our clustering problem is non-parametric,each document in the stream could potentially bethe start of a new event cluster. Thus, the mostcompatible cluster c ∗ might not actually be thecluster that the document corresponds to. Givena document and its most compatible cluster, thecluster creation model decides whether or not anew cluster is to be created. For this, we employa shallow neural network which takes document-cluster similarities along the set of representationsas input and decides if a new cluster should becreated. Since the dimensionality of the input spacefor the network is small, we use a shallow networkto prevent overﬁtting. To train and evaluate our clustering models, we usethe standard multilingual news stream clusteringdataset (Miranda et al., 2018), which contains ar-ticles from English, French and German. For ourclustering task, we only use the English subset ofthe corpus, which consists of 20,959 articles. Ar-ticles are annotated with language, timestamp andthe event cluster to which they belong, in additionto their title and body text. We use the same train-ing and evaluation split provided by Miranda et al.(2018) and use the training set to ﬁne-tune the pa-rameters of the clustering model. The training andevaluation sets are temporally disjoint to ensurethat the clustering models are tuned independent ofthe events seen during training. igure 3: Computation of c-score: (a) similarities are computed for each representation individually using theappropriate similarity function (cosine or Gaussian); (b) subsequently, the computed similarities are aggregatedinto a single c-score value using the weights of the weighted similarity model ( w ) We train our model pipeline in a sequence whereeach component model is supplied with the outputfrom the component trained before in the sequence.For instance, the cluster creation model is trainedusing the embeddings from the ﬁne-tuned BERTmodel and by selecting the most compatible clus-ter determined by the trained weighted similaritymodel. We experiment with multiple documentrepresentation sets, training all the component mod-els each time and evaluating the entire clusteringmodel on the test set.We use the TF-IDF weights provided in the Mi-randa et al. (2018) corpus to ensure fair comparisonwith prior work. For training the event similarityBERT model (S-BERT), triplets are generated foreach document using the batch-hard regime (Her-mans et al., 2017) by picking the hardest positiveand negative examples from its mini-batch . Wetrain the S-BERT model for 2 epochs using a batchsize of 32, with 10% of the training data being usedfor linear warmup. We use Adam optimizer withlearning rate e − and epsilon e − . Documentembeddings are obtained by mean pooling acrossall its tokens. For NER, we use the medium Englishmodel provided by spaCy (Honnibal and Montani,2017).Training instances for the weighted similarityand cluster creation models are generated by simu-lating the stream clustering process on the trainingset and assigning each document to its true event We use the batch-hard implementation provided byReimers and Gurevych (2019) at https://github.com/UKPLab/sentence-transformers cluster. For the weighted similarity model, we gen-erate triples of < document, true cluster, samplednegative cluster > and convert them into SVM train-ing instances as mentioned in Section 3.2. Since allthe training instances have the same label m , halfthe training set is negated to balance the dataset.To generate training samples for the cluster cre-ation model, the most compatible cluster is deter-mined using the trained weighted similarity modelfor each document. A sample is then generatedwith input as the document-cluster similarities andoutput as 0 or 1 depending on whether the truecluster for that document is in the cluster pool ornot. The dataset contains over 12k documents butonly 593 clusters, entailing that the fraction of train-ing samples where a new cluster is created is only5%, making the dataset extremely biased. To miti-gate this issue, we use the SVMSMOTE algorithm(Nguyen et al., 2011) to oversample the minorityclass and make the classes equal in size. For clus-ter creation, we train a shallow single layer neuralnetwork with two nodes using the L-BFGS solver(Nocedal, 1980). The weighted similarity and clus-ter creation models are trained using 5-fold crossvalidation to tune hyper-parameters and then on theentire training set using the best settings.The clustering output is evaluated by comparingagainst the ground truth clusters. We report resultson the B-Cubed metrics (Bagga and Baldwin, 1998)in Table 1 to compare against prior work. Prior workhas shown that sparse TF-IDF bag representa- odel Clusters Count(True Count - 222) B-Cubed MetricsPrecision Recall F Score

Laban and Hearst (2017) 873 94.37 85.58 89.76Miranda et al. (2018) 326 94.27 90.25 92.36Staykovski et al. (2019) 484 95.16 93.66 94.41Linger and Hajaiej (2020) 298 94.19 93.55 93.86Ours - TF-IDF 530 93.50 80.23 86.36Ours - TF-IDF (out-of-order) 413 90.57 87.51 89.01Ours - TF-IDF + Time 222 87.57 96.27 91.72Ours - E-S-BERT 452 79.76 60.77 68.98Ours - E-S-BERT + Time 471 92.70 74.69 82.73Ours - TF-IDF + P-BERT + Time 196 83.12

Table 1: Results of clustering performance for different document representation strategies as compared againstcontemporary models. P-BERT refers to pre-trained BERT; C-BERT refers to BERT ﬁne-tuned on event classiﬁ-cation S-BERT refers to BERT ﬁne-tuned using triplet loss on event similarity; E-S-BERT refers to entity awareBERT ﬁne-tuned on event similarity. tions achieve competitive performance (Laban andHearst, 2017; Miranda et al., 2018) and our exper-iments validate this observation. The clusteringmodel that uses only sparse TF-IDF bags to rep-resent documents achieves a very high score of86.8% B-Cubed F score, as shown in Table 1. IfTF-IDF bags are used in combination with times-tamps, then the performance further increases to91.7%, setting a tough baseline to beat. Contextual embeddings, by themselves, achievesub-par clustering performance:

In line withprior work, we observe that dense document em-beddings, both when used as the sole representationand in conjunction with timestamps, are unable tomatch the clustering performance of TF-IDF bags.It can be seen in Table 1 that even our best BERTmodel (entity aware BERT trained on event simi-larity) only achieves an F1 score of 69% individ-ually and 82.7% when combined with timestamprepresentations. These scores are 17.8% and 9%lower than their corresponding TF-IDF counter-parts. BERT embeddings are richer representationsthat encode linguistic information including syntaxand semantics through its pre-training. Thus, themodel is unable to distinguish between events atthe desired granularity and ends up clustering to-gether topically related events (for instance, twodifferent events related to soccer).

Fine-tuning objective impacts the effectivenessof embeddings for clustering:

In most NLP tasks, ﬁne-tuning contextual embeddings on a re-lated pertinent objective is beneﬁcial, we observethat the choice of ﬁne-tuning objective is criticalto the task performance. While the baseline pre-trained P-BERT model achieves a clustering scoreof 89.6% when used in conjunction with TF-IDFand timestamp representations (TF-IDF + P-BERT+ Time), ﬁne-tuning embeddings on event classi-ﬁcation (TF-IDF + C-BERT + Time) drops theperformance to 87%. This drop in performance canbe attributed to the following reasons. Firstly, thelarge output space (593 events) and small datasetsize (12k documents) make it hard for the modelto learn effectively during ﬁne-tuning. In additionto this, the classiﬁcation objective requires that theembeddings of documents from different eventsbe non-linearly separable. But this is not directlycompatible with how the embeddings are used bythe weighted similarity model, which is to computecosine similarity. This discordance entails that theﬁne-tuning process degrades the clustering perfor-mance. The event similarity triplet loss is a moresuitable ﬁne-tuning objective and it is observedthat ﬁne-tuning BERT on this objective (TF-IDF+ S-BERT + Time) results in a better clusteringperformance of 92.04%.

External entity knowledge makes embeddingsmore effective for clustering:

The introductionof external knowledge through the entity awareBERT architecture signiﬁcantly improves the clus- etric TF-IDF + E-S-BERT + Time Miranda Gain

B-Cubed 94.76 92.36 2.40 † CEAF-e 76.93 69.57 7.36 † CEAF-m 93.31 90.19 3.12 † MUC 99.30 98.88 0.42 ‡ BLANC 98.13 96.93 1.20 § V Measure 97.98 97.01 0.97 † Adjusted Rand Score 96.26 93.87 2.39 § Adjusted Mutual Information 97.99 97.02 2.97 § Fowlkes Mallows Score 96.38 94.11 2.27 § Table 2: Results of clustering performance across different evaluation metrics. For each metric computed usingprecision, recall and F-1 scores, only the F-1 scores are reported. Statistically signiﬁcant gains, with p <<< . are denoted by † and p < . by ‡ . Gains denoted by § are not evaluated for signiﬁcance, in line with literature. tering performance of the model. It can be seenin Table 1 that introducing entity awareness andtraining on the event similarity task (TF-IDF + E-S-BERT + Time) results in a clustering score of94.76% , achieving a new state-of-the-art on thedataset . The results are statistically signiﬁcant and p values from a paired student’s t-test are reportedin Table 2. This is almost 3 points better thanthe corresponding model without entity awareness,which highlights the importance of this externalknowledge. When given external knowledge froman NER system, the BERT model, like sparse TF-IDF representations, is able to draw attention toentities and highlight them in the document embed-dings. It is observed that the model learns to projectentities and non-entities in mutually orthogonal di-rections and thereby adds emphasis to entities.In our experiments, we observe an increase ofalmost 1 point in F score by considering only asubset of the OntoNotes corpus (Weischedel et al.,2013) labels . Ignoring entity classes like ORDI-NAL and CARDINAL helps as they don’t provideuseful information for our clustering task. Thescores reported in Table 1 correspond to entity-aware models trained on this label subset. We alsoexperimented with separate embeddings for eachentity type instead of the binary entity presence-absence embeddings and observed that it degrades F score by more than 2 points. Ablating time and non-streaming input:

When we ablate timestamp from the representation The mean and standard deviation of the precision, recalland F-1 scores over ﬁve independent training and evaluationsof our model are . ± . , . ± . and . ± . . We observe similar results on the TDT Pilot dataset (Allanet al., 1998), as shown in Section 4.4 Our entity label subset consists PERSON, NORP, FAC,ORG, GPE, LOC, PRODUCT, EVENT, WORK OF ART,LAW and LANGUAGE. (rows that are not marked with “Time” in Table 1)and then stream documents in random order (rowsmarked with (“out- of-order”) in Table 1), thenumber of clusters increase over when accountingfor time. When ablating time, we also observe thatsupplying documents in random order producesfewer clusters and better b-cubed F scores. Weobserve examples of clusters that are incorrectlymerged in the absence of temporal information (inthe out-of-order setting). See Appendix for actualexamples from our output. Cluster fragmentation is not captured well byB-Cubed metrics

The improvements our modelmakes can be seen clearly by observing the num-ber of clusters created by the model. While theprevious state-of-the-art model produced 484 clus-ters, ours produces only 276 , which is closer tothe true cluster count of 222. Our model producesfar less cluster fragmentation, resulting in a 79.4%reduction in the number of erroneous additionalclusters created. We argue that this is an importantimprovement that is not well captured by the smallincrease in B-Cubed metrics.While B-Cubed F score is the standard metricreported in the literature, it is an article-level met-ric which gives more importance to large clusters.This entails that B-Cubed metrics heavily penalizethe model’s output for making mistakes on largeclusters while mistakes on smaller clusters can fallthrough without incurring much penalty. In ourexperiments, we observed that this property of themetric prevents it from capturing cluster fragmen-tation errors on smaller events. In the news streamclustering setting, small events may correspond torecent salient events and thus, we want our metric The mean and standard deviation of the cluster count overﬁve independent training and evaluations of our model are ± . o be agnostic to the size of the clusters.We thus use an additional metric that weightsevery cluster equally - CEAF-e (Luo, 2005). TheCEAF-e metric creates a one-to-one mapping be-tween the clustering output and gold clusters usingthe Kuhn-Munkres algorithm. The similarity be-tween a gold cluster G and an output cluster O iscomputed as the fraction of articles that are com-mon to the clusters. Once the clusters are aligned,precision and recall are computed using the alignedpairs of clusters. This ensures that unaligned clus-ters contribute to a penalty in the score and clusterfragmentation and coalescing is captured by themetric.In order to ensure that our model’s better perfor-mance is metric-agnostic, we also empirically eval-uated our clustering model against prior work usingseveral clustering metrics, the results of which arepresented in Table 2. For this, we compare withMiranda et al. (2018) since their results are readilyreplicable. It can be observed that our model con-sistently achieves better performance across mostmetrics and is thus robust to the metric idiosyn-crasies. Our model achieves an improvement of7.36 points on the CEAF-e metric, which showsthat our clustering model performs better than con-temporary models on smaller clusters as well. To validate the robustness of our clustering model,we evaluate it on the TDT Pilot corpus (Allan et al.,1998). The TDT Pilot corpus consists of a set ofnewswire and broadcast news transcripts that spanthe period from July 1, 1994 to June 30, 1995. Outof the 16,000 documents collected, 1,382 are an-notated to be relevant to one of 25 events duringthat period. Unlike the Miranda et al. (2018) cor-pus, TDT Pilot does not have the article titles. We,therefore, train all the components of our ensem-ble architecture using only the document body text.The TDT corpus does not provide pre-trained TF-IDF weights, so we train the weights on the entirecorpus as a pre-processing step. Unlike Miranda,the TDT Corpus also lacks standard train and testsplits. We create our own splits across 25 events.The splits are described and listed in the Appendix.In line with our observations on the Mirandaet al. (2018) corpus, we observe similar resultson the TDT corpus. We achieve the best resulton this corpus on a model with TF-IDF represen-tations combined with temporal representations, BERT entity-aware representations ﬁne-tuned onthe event similarity task. The best result has a b-cubed precision of , b-cubed recall of and a b-cubed F of 88.18 . We generate

12 clus-ters which matches the number of clusters in theground truth.We show that even in a cross-corpus setting,dense contextual embeddings, when augmentedwith pertinent ﬁne-tuning, external knowledge andthe conjunction of sparse and temporal representa-tions, are a potent representation strategy for eventtopic clustering.

In this paper, we present a novel news stream clus-tering algorithm that uses a combination of sparseand dense vector representations. We show thatwhile dense embeddings by themselves do notachieve the best clustering results, enhancementslike entity awareness and event similarity ﬁne-tuning make them effective in conjunction withsparse and temporal representations. Our modelachieves new state-of-the-art results on the Mirandaet al. (2018) dataset. We also analyze the problemof cluster fragmentation noting that our approachis able to produce a similar number of clusters asin the test set, in contrast to prior work which pro-duces far too many clusters. We note issues withthe B-Cubed metrics and we complement our re-sults using CEAF-e as an additional metric for ourclustering task. In addtion, we provide a compre-hensive empirical evaluation across many metricsto show the robustness of our model to metric id-iosyncrasies.

Acknowledgments

We thank Chao Zhao, Heng Ji, Rishita Anubhaiand Graham Horwood for valuable discussions.

References

David Ahn. 2006. The stages of event extraction. In

Proceedings of the Workshop on Annotating andReasoning about Time and Events , pages 1–8, Syd-ney, Australia. Association for Computational Lin-guistics.James Allan, Jaime G Carbonell, George Doddington,Jonathan Yamron, and Yiming Yang. 1998. Topicdetection and tracking pilot study ﬁnal report. In

Proceedings of the DARPA Broadcast News Tran-scription and Understanding Workshop , pages 194–218.mit Bagga and Breck Baldwin. 1998. Algorithmsfor scoring coreference chains. In

The ﬁrst interna-tional conference on language resources and evalua-tion workshop on linguistics coreference , volume 1,pages 563–566. Citeseer.Adham Beykikhoshk, Ognjen Arandjelovi´c, DinhPhung, and Svetha Venkatesh. 2018. Discover-ing topic structures of a temporally evolving docu-ment corpus.

Knowledge and Information Systems ,55(3):599–632.Federico Bianchi, Silvia Terragni, and Dirk Hovy.2020. The dynamic embedded topic model. arXivpreprint arXiv:2004.03974 .Vincent D Blondel, Jean-Loup Guillaume, RenaudLambiotte, and Etienne Lefebvre. 2008. Fast un-folding of communities in large networks.

Jour-nal of statistical mechanics: theory and experiment ,2008(10):P10008.Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N.Vapnik. 1992. A training algorithm for optimalmargin classiﬁers. In

Proceedings of the Fifth An-nual Workshop on Computational Learning Theory ,COLT ’92, page 144–152, New York, NY, USA. As-sociation for Computing Machinery.Robin Brochier, Adrien Guille, and Julien Velcin. 2020.Inductive document network embedding with topic-word attention.

Advances in Information Retrieval ,page 326–340.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Adji B Dieng, Francisco J R Ruiz, and David M Blei.2019a. Topic modeling in embedding spaces. arXivpreprint arXiv:1907.04907 .Adji B Dieng, Francisco JR Ruiz, and David M Blei.2019b. The dynamic embedded topic model. arXivpreprint arXiv:1907.05545 .Pankaj Gupta, Subburam Rajaram, Hinrich Sch¨utze,and Bernt Andrassy. 2018. Deep temporal-recurrent-replicated-softmax for topical trends over time. In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers) , pages 1079–1089, NewOrleans, Louisiana. Association for ComputationalLinguistics.Alexander Hermans, Lucas Beyer, and Bastian Leibe.2017. In defense of the triplet loss for person re-identiﬁcation. arXiv preprint arXiv:1703.07737 . Elad Hoffer and Nir Ailon. 2015. Deep metric learningusing triplet network. In

International Workshop onSimilarity-Based Pattern Recognition , pages 84–92.Springer.Matthew Honnibal and Ines Montani. 2017. spaCy 2:Natural language understanding with Bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.Thorsten Joachims. 2002. Optimizing search enginesusing clickthrough data. In

Proceedings of theEighth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , KDD ’02,page 133–142, New York, NY, USA. Association forComputing Machinery.Kamrun Naher Keya, Yannis Papanikolaou, andJames R. Foulds. 2019. Neural embedding allo-cation: Distributed representations of topic models. arXiv preprint arXiv:1909.04702 .Philippe Laban and Marti Hearst. 2017. newsLens:building and visualizing long-ranging news stories.In

Proceedings of the Events and Stories in the NewsWorkshop , pages 1–9, Vancouver, Canada. Associa-tion for Computational Linguistics.Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In

Proceed-ings of the 31st International Conference on Inter-national Conference on Machine Learning - Volume32 , ICML’14, page II–1188–II–1196. JMLR.org.Mathis Linger and Mhamed Hajaiej. 2020. Batchclustering for multilingual news streaming. arXivpreprint arXiv:2004.08123 .Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee,Kristina Toutanova, Jacob Devlin, and Honglak Lee.2019. Zero-shot entity linking by reading entity de-scriptions. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 3449–3460, Florence, Italy. Association forComputational Linguistics.Xiaoqiang Luo. 2005. On coreference resolution per-formance metrics. In

Proceedings of Human Lan-guage Technology Conference and Conference onEmpirical Methods in Natural Language Processing ,pages 25–32, Vancouver, British Columbia, Canada.Association for Computational Linguistics.J. MacQueen. 1967. Some methods for classiﬁcationand analysis of multivariate observations. In

Pro-ceedings of the Fifth Berkeley Symposium on Mathe-matical Statistics and Probability, Volume 1: Statis-tics , pages 281–297, Berkeley, Calif. University ofCalifornia Press.Sebasti˜ao Miranda, Art¯urs Znotin¸ ˇs, Shay B. Cohen,and Guntis Barzdins. 2018. Multilingual cluster-ing of streaming news. In

Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing , pages 4535–4544, Brussels, Bel-gium. Association for Computational Linguistics.laheh Momeni, Shanika Karunasekera, Palash Goyal,and Kristina Lerman. 2018. Modeling evolutionof topics in large-scale temporal text corpora. In

Twelfth International AAAI Conference on Web andSocial Media .Hien M. Nguyen, Eric W. Cooper, and Katsuari Kamei.2011. Borderline over-sampling for imbalanceddata classiﬁcation.

Int. J. Knowl. Eng. Soft DataParadigm. , 3(1):4–21.Jorge Nocedal. 1980. Updating quasi-newton matriceswith limited storage.

Mathematics of Computation ,35(151):773–782.Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages3982–3992, Hong Kong, China. Association forComputational Linguistics.Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108 .Suzanna Sia, Ayush Dalmia, and Sabrina J. Mielke.2020. Tired of topic models? clusters of pretrainedword embeddings make for fast and good topics too! arXiv preprint arXiv:2004.14914 .Todor Staykovski, Alberto Barr´on-Cede˜no, GiovanniDa San Martino, and Preslav Nakov. 2019. Densevs. sparse representations for news stream cluster-ing. In

Proceedings of Text2Story - 2nd Workshopon Narrative Extraction From Texts, co-located withthe 41st European Conference on Information Re-trieval, Text2Story@ECIR 2019, Cologne, Germany,April 14th, 2019 , volume 2342 of

CEUR WorkshopProceedings , pages 47–52. CEUR-WS.org.Yee W Teh, Michael I Jordan, Matthew J Beal, andDavid M Blei. 2005. Sharing clusters among re-lated groups: Hierarchical dirichlet processes. In

Advances in neural information processing systems ,pages 1385–1392.Ralph Weischedel et al. 2013.

OntoNotes Release 5.0LDC2013T19. Web Download . Linguistic Data Con-sortium, Philadelphia.Manzil Zaheer, Amr Ahmed, and Alexander J. Smola.2017. Latent LSTM allocation: Joint clustering andnon-linear dynamic modeling of sequence data. In

Proceedings of the 34th International Conferenceon Machine Learning , volume 70 of

Proceedingsof Machine Learning Research , pages 3967–3976,International Convention Centre, Sydney, Australia.PMLR.Manzil Zaheer, Amr Ahmed, Yuan Wang, Daniel Silva,Marc Najork, Yuchen Wu, Shibani Sanan, and Suro-jit Chatterjee. 2019. Uncovering hidden structure in sequence data via threading recurrent models. In

Proceedings of the Twelfth ACM International Con-ference on Web Search and Data Mining , WSDM’19, page 186–194, New York, NY, USA. Associa-tion for Computing Machinery.Deyu Zhou, Haiyang Xu, and Yulan He. 2015. An un-supervised Bayesian modelling approach for story-line detection on news articles. In

Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing , pages 1943–1948, Lis-bon, Portugal. Association for Computational Lin-guistics.

A Appendix

The TDT corpus does not have a training and testsplit and we thus partition the corpus into two al-most equal portions such that all documents in asingle event are part of the same split. Our train-ing set consists of 873 documents and our test setconsists of 680 documents. The events in eachpartition of the TDT corpus is shown in Table 3

Events in Our Train Split

Karrigan Harding, Shannon Faulker, Quayle lung clot,Haiti ousts observers, NYC Subway bombing,Carlos the Jackal, USAir 427 crash, Lost in Iraq,Death of Kim Jong Il, Clinic Murders, Kobe Japan quake,Serbs violate Bihac, OK-City bombing

Events in Our Test Split

Pentium chip ﬂaw, Cuban riot in Panama, Justice-to-be Breyer,Humble TX ﬂooding, WTC Bombing trial,Cessna on White House, Aldrich Ames, Comet into Jupiter,Serbians down F-16, Carter in Bosnia, Halls copter,DNA in OJ trial

Table 3: Events in the training and test splits of the TDTPilot corpus