Event Evolution Tracking from Streaming Social Posts
EEvent Evolution Tracking from Streaming Social Posts
Pei Lee
University of British ColumbiaVancouver, BC, Canada [email protected] Laks V.S. Lakshmanan
University of British ColumbiaVancouver, BC, Canada [email protected] Evangelos Milios
Dalhousie UniversityHalifax, NS, Canada [email protected]
ABSTRACT
Online social post streams such as Twitter timelines and forum dis-cussions have emerged as important channels for information dis-semination. They are noisy, informal, and surge quickly. Reallife events, which may happen and evolve every minute, are per-ceived and circulated in post streams by social users. Intuitively,an event can be viewed as a dense cluster of posts with a life cy-cle sharing the same descriptive words. There are many previousworks on event detection from social streams. However, there hasbeen surprisingly little work on tracking the evolution patterns ofevents, e.g., birth/death, growth/decay, merge/split, which we ad-dress in this paper. To define a tracking scope, we use a slidingtime window, where old posts disappear and new posts appear ateach moment. Following that, we model a social post stream asan evolving network, where each social post is a node, and edgesbetween posts are constructed when the post similarity is above athreshold. We propose a framework which summarizes the infor-mation in the stream within the current time window as a “sketchgraph” composed of “core” posts. We develop incremental updatealgorithms to handle highly dynamic social streams and track eventevolution patterns in real time. Moreover, we visualize events asword clouds to aid human perception. Our evaluation on a real dataset consisting of 5.2 million posts demonstrates that our methodcan effectively track event dynamics in the whole life cycle fromvery large volumes of social streams on the fly.
1. INTRODUCTION
In the current social web age, people easily feel overwhelmedby the information deluge coming from post streams which flow infrom channels like Twitter, Facebook, forums, Blog websites andemail-lists. As of 2009, it was reported [1], e.g., that each Twitteruser follows 126 users on average, and on each day, the receivedsocial streaming posts will cost users considerable time to read,only to discover a small interesting part. This is a huge overheadthat users pay in order to find a small amount of interesting infor-mation. There is thus an urgent need to provide users with toolswhich can automatically extract and summarize significant infor-mation from highly dynamic social streams, e.g., report emerging bursty events, or track the evolution of a specific event in a giventime span. There are many previous studies [19, 22, 24, 25, 12,7] on detecting new emerging events from text streams; they servethe need for answering the query “ what’s happening? ” over so-cial streams. However, in many scenarios, users may want to knowmore details about an event and may like to issue advanced querieslike “ how’re things going? ”. For example, for the event “SOPA(Stop Online Piracy Act) protest” happening in January 2012, ex-isting event detection approaches can discover bursty activities ateach moment, but cannot answer queries like “how SOPA protesthas evolved in the past few days?”. An ideal output to such anevolution query would be a “panoramic view” of the event his-tory, which improves user experience. In this work, we considerthis kind of queries as an instance of the event evolution tracking problem, which aims to track the event evolutionary dynamics ateach moment from social streams. Typical event evolution patternsinclude emerging (birth) or disappearing (death), inflating (growth)or shrinking (decay), and merging or splitting of events. Event de-tection can be viewed as a subproblem of event evolution tracking.We try to solve the event evolution tracking problem in the paper.There are several major challenges in the tracking of event evo-lution. The first challenge is the effective organization of noisysocial post streams into a meaningful structure. Social posts suchas tweets are usually written in an informal way, with lots of ab-breviations, misspellings and grammatical errors. Even worse, acorrectly written post may have no significance and be just noise.Recent works on event detection from Twitter [28, 22] recognizeand handle noise in post streams in a limited and ad hoc manner butdo not handle noise in a systematic formal framework. The secondchallenge is how to track and express the event evolution behaviorsprecisely and incrementally. Most related work reports event ac-tivity by volume on the time dimension [22, 19]. While certainlyuseful, this cannot show the evolution behaviors about how eventsare split or merged, for instance. The third challenge is the sum-mary and annotation of events. Since an event may easily containthousands of posts, it is important to summarize and annotate it inorder to facilitate human perception. Recent related works [22, 19,28] typically show users just a list of posts ranked by importanceor time freshness, which falls short of addressing this challenge.To handle the above mentioned challenges, we first model so-cial streams as an evolving post network , and then propose a sketchgraph -based framework to track the evolution of events in the postnetwork (Section 4.1). Intuitively, a sketch graph can be viewedas a compact summary of the original post network. The sketchgraph only contains core posts from the post network, where coreposts are defined as posts that play a central role in the network.Noise posts will be directly pruned by the sketch graph. As we willdiscuss in Section 5, evolution behaviors can be effectively and in- a r X i v : . [ c s . S I] N ov ime Window F r e qu e n c y Social Streams Post Network, Sketch Graph
Extract (a) The sketch-based framework
Moment t E E E E Moment t+1 E E MergeGrow Born (b) Event evolution tracking
Figure 1: (a) Post network captures the correlation between posts in the time window at each moment, and evolves as time rolls on.Core nodes and core edges are bolded. (b) From moment t to t + 1 , typical event evolution behaviors include birth or death, growthor decay, merging or splitting of events. Each event is annotated by a topical word cloud. crementally expressed based on a group of primitive operations inthe sketch graph. Technically, we define an event as a cluster in thepost network, and then summarize and annotate each event by top-ical word-clouds. We show an overview of major steps for eventtracking from social streams in Figure 1. Note that as time rollson, the post network, events and their annotations will be updatedincrementally at each moment.We notice that at a high level, our definition seemingly resem-bles previous work on density-based clustering over streaming data,e.g., DenStream in [9] and cluster maintenance in [2] and [5]. How-ever, there are several major differences. First, our approach workson an evolving graph structure and provides users the flexibilityin choosing the scope for tracking and monitoring new events bymeans of a fading time window, while the existing work doesn’tprovide a tracking scope. Second, the existing work can only pro-cess the adding of nodes/edges one by one, while our approach canhandle adding, deleting and fading of nodes subgraph by subgraph .This is an important requirement for dealing with the high through-put rate of online post streams. Third, the focus of our approachis tracking and analyzing the evolution dynamics in the whole lifecycle of events. By contrast, they focus on the maintenance of clus-ters, which is only a sub-task in our problem.Finally, to compare with traditional topic tracking on news arti-cle streams, we note that this problem is usually formulated as aclassification problem [4]: when a new story arrives, compare itwith topic features in the training set and if it matches sufficiently,declare it to be on a topic. Commonly used techniques include de-cision trees and k -NN [29]. This approach assumes that topics are predefined before tracking . Thus, we cannot simply apply topictracking techniques to event tracking in social streams, since futureevents are unknown and may not conform to any previously knowntopics. Moreover, traditional topic tracking has difficulties in track-ing the composite behaviors such as merging and splitting, whichare definitely a key aspect of event evolution.In summary, the main problem we study in this paper is capturedby the following questions: how to efficiently track the evolutionbehavior of social post streams such as Twitter, which are noisy andhighly dynamic? how to do this incrementally? what is an effectiveway to express evolution behavior of events? In this paper, wedevelop a framework and algorithms to answer all these questions.Our main contributions are the following: • We design an effective approach to extract and organize mean-ingful information from noisy social streams (Section 3); • We filter noisy posts by introducing sketch graph (Section4.1)), based on which we define a group of primitive opera-tions and express evolution behaviors with respect to graphsand events using these primitive operations (Section 5); S ( p , p ) the content similarity between p and p S F ( p , p ) the fading similarity between p and p ( δ , ε , ε ) density factors for core post, edge and core edge w t ( p ) the weight of post p at time moment t G t ( V t , E t ) the post network at moment tG t ( V t , E t ) the sketch graph of the post network at moment t N ( p ) the neighbor set of p in G t C, S t a component, a component set at moment tC, S t a cluster, a cluster set at moment tN c ( p ) the set of clusters where p ’s neighboring core posts belong E, (cid:101) S t an event, an event set at moment tA the annotation for an event Table 1: Notation. • We propose efficient algorithms cTrack and eTrack to trackthe evolution of clusters and events accurately and incremen-tally, superior in both quality and performance to existing ap-proaches that find the evolution patterns by matching eventsin consecutive time moments (Section 6). • Our evaluation on a large real data set demonstrates that ourmethod can effectively track all six kinds of event evolutionbehaviors from highly dynamic social post streams in a scal-able manner (Section 7).More related work is discussed in Section 2. We summarize thepaper and discuss extensions in Section 8. For convenience, wesummarize the major notations used in the paper, in Table 1.
2. RELATED WORK
Work related to this paper mainly falls in one of these categories.
Topic/Event/Community detection and tracking . Most previousworks detect events by discovering topic bursts from a documentstream. Their major techniques are either detecting the frequencypeaks of event-indicating phrases over time in a histogram, or mon-itoring the formation of a cluster from a structure perspective. Afeature-pivot clustering is proposed in [12] to detect bursty eventsfrom text streams. Sarma et al. [25] design efficient algorithms todiscover events from a large graph of dynamic relationships. Jinet al. [15] present Topic Initiator Detection (TID) to automaticallyfind which web document initiated the topic on the Web. Louvainmethod [8], based on modularity optimization, is the state-of-the-art community detection approach which outperforms others. How-ever, Louvain method cannot not resist massive noise. None of theabove works address the event evolution tracking problem.There is less work on evolution tracking. An event-based charac-terization of behavioral patterns for communities in temporal inter-action graphs is presented in [6]. A framework for tracking short,distinctive phrases (called “memes”) that travel relatively intacthrough on-line text was developed in [19]. Unlike these works,we focus on the tracking of real world event-specific evolution pat-terns from social streams.
Social Stream Mining . Weng et al. [28] build signals for individ-ual words and apply wavelet analysis on the frequency of words todetect events from Twitter. Twitinfo [22] detects events by keywordpeaks and represents an event it discovers from Twitter by a time-line of related tweets. Recently, Agarwal et al. [2] discover eventsthat are unraveling in microblog streams, by modeling events asdense clusters in highly dynamic graphs. Angel et al. [5] studythe efficient maintenance of dense subgraphs under streaming edgeweight updates. Both [2] and [5] model the social stream as anevolving entity graph, but suffer from the drawback that post at-tributes like time and author cannot be reflected. Another draw-back of [2] and [5] is that they can only handle edge-by-edge up-dates, but cannot handle subgraph-by-subgraph bulk updates. Bothdrawbacks are solved in our paper.
Clustering and Evolving Graphs . In this paper, we summarize anoriginal post network into a sketch graph based on density parame-ters. Compared with partitioning-based approaches (e.g., K-Means[14]) and hierarchical approaches (e.g., BIRCH [14]), density-basedclustering (e.g., DBSCAN [14]) is effective in finding arbitrary-shaped clusters, and is robust to noise. The main challenge is to ap-ply density-based clustering on fast evolving post networks. CluS-tream [3] is a framework that divides the clustering process into anonline component which periodically generates detailed summarystatistics for nodes and an offline component which uses only thesummary statistics for clustering. However, CluStream is based onK-Means only. DenStream [9] presents a new approach for discov-ering clusters in an evolving data stream by extending DBSCAN.This work is related to us in that both employ density based clus-tering. The differences between our approach and DenStream werediscussed in detail in the introduction. Subsequently, DStream [10]uses an online component which maps each input data record into agrid and an offline component which generates grid clusters basedon the density. Another related work is by Kim et al. [16], whichfirst clusters individual snapshots into quasi-cliques and then mapsthem over time by looking at the density of bipartite graphs be-tween quasi-cliques in adjacent snapshots. Although [16] can han-dle birth, growth, decay and death of clusters, the splitting andmerging behaviors are not supported. In contrast, our approachis totally incremental and is able to track composite behaviors likemerging and splitting in real time.
3. POST NETWORK CONSTRUCTION
In this section, we describe how we construct a post networkfrom a social post stream. The main challenge is detecting similar-ity between streaming posts efficiently and accurately, taking thetime of the posts into account. We use a notion of fading similarity (Section 3.2) and propose a technique called linkage search to effi-ciently detect posts similar to a post as it streams in (Section 3.3).
Social posts such as tweets are usually written in an informalway. Our aim is to design a processing strategy that can quicklyjudge what a post talks about and is robust enough to the informalwriting style. In particular, we focus on the entity words containedin a post, since entities depict the topic. For example, given a tweet“iPad 3 battery pointing to thinner, lighter tablet?”, the entities are“iPad”, “battery” and “tablet”. However, traditional Named En-tity Recognition tools [23] only support a narrow range of entities like Locations, Persons and Organizations. NLP parser based ap-proaches [17] are not appropriate due to the informal writing styleof posts and the need for high processing speed. Also, simply treat-ing each post text as a bag of words [21] will lead to loss of accu-racy, since different words have different weights in deciding thetopic of a post. To broaden the applicability, we treat each nounin the post text as a candidate entity. Technically, we obtain nounsfrom a post text using a Part-Of-Speech Tagger , and if a noun isplural (POS tag “NNS” or “NNPS”), we obtain the prototype of thisnoun using WordNet stemmer . In practice, we find this prepro-cessing technique to be robust and efficient. In the Twitter datasetwe used in our experiments (see Section 7), each tweet contains 4.9entities on an average. We formally define a social post as follows.D EFINITION
1. (Post) . A post p is a triple ( L, τ, u ) , where L is the list of entities in the post, τ is the time stamp of the post, and u is the user who posted it. We let p L denote L in the post p for simplicity, and analogouslyfor p τ and p u . We use | p L | to denote the number of entities in p . Post similarity is the most crucial criterion in correlating postsof the same event together. Traditional similarity measures suchas TF-IDF based cosine similarity, Jaccard Coefficient and PearsonCorrelation [21] only consider the post content. However, clearlytime dimension should play an important role in determining postsimilarity, since posts created closer together in time are more likelyto discuss the same event than posts created at very different mo-ments. We introduce the notion of fading similarity to capture bothcontent similarity and time proximity. Formally, we define the fad-ing similarity between a pair of posts p i and p j as S F ( p i , p j ) = S ( p Li , p Lj ) D ( | p τi − p τj | ) (1) where S ( p Li , p Lj ) is a set-based similarity measure that maps thesimilarity between p Li and p Lj to the interval [0 , , and D ( | p τi − p τj | ) is a distance measure that is monotonically increasing with | p τi − p τj | and D ( | p τi − p τj | ) ≥ . For example, S ( p Li , p Lj ) may bethe Jaccard coefficient with S ( p Li , p Lj ) = ( | p Li ∩ p Lj | ) / ( | p Li ∪ p Lj | ) ,and D ( | p τi − p τj | ) may be D ( | p τi − p τj | ) = e | p τi − p τj | . We willcompare different measures in experiments. It is known that nounsare usually more topic relevant than verbs, adjectives, etc [20].Consequently, entity-based similarity of posts is more appropri-ate than similarity based on all words. Besides, it has the advan-tage of smaller computational overhead. It is trivial to see that ≤ S F ( p i , p j ) ≤ and that S F ( p i , p j ) is symmetric. To find the implicit correlation between posts as they stream in,we build a post network based on the following rule: if the fadingsimilarity between two posts is higher than a given threshold ε ,we create an edge between them. More formally:D EFINITION
2. (Post Network) . The snapshot of post network at moment t is defined as a graph G t ( V t , E t ) , where each node p ∈ V t is a post, and an edge ( p i , p j ) ∈ E t is constructed iff S F ( p i , p j ) ≥ ε , where < ε < is a given parameter. We monitor social streams using a fading time window , whichwill be introduced explicitly in Section 5.3. For now, imagine atime window of observation and consider the post network at thebeginning of the time window. As we move forward in time, new POS Tagger, http://nlp.stanford.edu/software/tagger.shtml JWI WordNet Stemmer, http://projects.csail.mit.edu/jwi/ (cid:31) t G t S t S SkeClu C C Ske GenSke Gen (cid:143)(cid:143) t S (cid:4) E A Ann (cid:32)
Figure 2: The functional relationships between different typesof objects defined in this paper. Refer to Table 1 for notations. posts flow in and old posts fade out and G t ( V t , E t ) is dynami-cally updated at each moment, with new nodes/edges added andold nodes/edges removed. Removing a node and associated edgesfrom G t ( V t , E t ) is an easy operation, so let us investigate the casewhen a node is added. When a new post p flows in, we need to con-struct the linkage (i.e., edges) between p and nodes in V t , followingDefinition 2. In a real scenario, since the number of nodes |V t | caneasily go up to millions, it is impractical to compare p with everynode in V t to verify the satisfaction of Definition 2. Below, we pro-pose Linkage Search to identify the neighbors of p accurately byaccessing only a small subset of nodes in V t . Linkage Search . Let N ( p ) denote the set of neighbors of p thatsatisfy Definition 2. The problem of linkage search is finding N ( p ) accurately by accessing only a small node set N (cid:48) , where N ( p ) ⊆N (cid:48) ⊂ V t and |N (cid:48) | (cid:28) |V t | . To solve this problem, first we con-struct a post-entity bipartite graph, and then perform a two-step ran-dom walk process on this bipartite graph and use the hitting counts.The main idea of linkage search is to let a random surfer start frompost node p and walk to any entity node in p L on the first step, andcontinue to walk back to post nodes except p on the second step. Allthe post nodes visited on the second step form a set N (cid:48) . For anynode q ∈ N (cid:48) , q can be hit multiple times from different entitiesand the total hitting count H ( p, q ) can be aggregated. Assumingthe measure of S ( p L , q L ) in Eq. (1) is Jaccard coefficient, we canverify the linkage between p and q by checking the condition S F ( p, q ) = H ( p, q )( | p L | + | q L | − H ( p, q )) D ( | p τ − q τ | ) ≥ ε (2) Linkage search supports the construction of a post network onthe fly . It is easy to see that for post q (cid:54)∈ N (cid:48) , S F ( p, q ) = 0 .Thus, we do not need to access posts that are not in N (cid:48) . Since apost like tweet usually connects to very few entities, |N (cid:48) | (cid:28) |V t | generally holds, thus making linkage search very efficient. Supposethe average number of entities in each post is d and the averagenumber of posts mentioning each entity is d . Then linkage searchcan find the neighbor set of a given post in time O ( d d ) .
4. SKETCH-BASED SUMMARIZATION
Here we first introduce the notion of sketch graph based on den-sity parameters, and define events in the post network based oncomponents of the sketch graph. The relationships between differ-ent types of objects defined in this paper are illustrated in Figure2. As an example to explain this figure, the arrow from G t to G t labeled Ske means G t = Ske ( G t ) . See Table 1 for notations. Many posts tend to be just noise so it is essential to identify thoseposts that play a central role in describing events. We formallycapture them using the notion of core posts . On web link graphs,there is a lot of research on node authority ranking, e.g., HITS andPageRank [21]. However, most of these methods are iterative andnot applicable to one-scan computation on streaming data. In this paper, we introduce a sketch graph G t ( V t , E t ) of postnetwork G t ( V t , E t ) based on density parameters ( δ , ε , ε ) , where ε ≥ ε . In density-based clustering , δ (a.k.a. MinP ts in DB-SCAN [11]) is the minimum number of nodes in an ε -neighborhood,required to form a cluster. In the post network, we consider δ to be the threshold to judge whether a post is important and sim-ilarly ε is a threshold for core edges. The reason we choosedensity-based clustering is that, compared with partitioning-basedapproaches (e.g., K-Means [14]) and hierarchical approaches (e.g.,BIRCH [14]), density-based clustering (e.g., DBSCAN) definesclusters as areas of higher density than the remainder of the dataset, which is effective in finding arbitrary-shaped clusters and is ro-bust to noise. We adapt the concepts from density-based clusteringand define the post weights as follows.D EFINITION
3. (Post Weight) . Given a post p = ( L, τ, a ) in G t ( V t , E t ) and its neighbor set N ( p ) , the weight of p at moment t , t ≥ p τ , is defined as w t ( p ) = 1 D ( | t − p τ | ) (cid:88) q ∈N ( p ) S F ( p, q ) (3) Notice, post weight decays as time moves forward, although theneighbor set N ( p ) may change. Thus, post weights need to be con-tinuously updated. In practice, we only store the sum (cid:80) q ∈N ( p ) S F ( p, q ) with p to avoid frequent updates and compute w t ( p ) on demandwhen we need to judge the importance of a post. Based on postweight, we distinguish nodes in G t ( V t , E t ) into three types, as de-fined below.D EFINITION
4. (Node Types) . • A post p is a core post if w t ( p ) ≥ δ ; • It is a border post if w t ( p ) < δ but there exists at least onecore post q ∈ N ( p ) ; • It is a noise post if it is neither core nor border, i.e., w t ( p ) <δ and there is no core post in N ( p ) . Intuitively, a post is a core post if it shares enough common en-tities with many other posts. Neighbors of a core post are at leastborder posts, if not core posts themselves. In density-based clus-tering, core posts play a central role: if a core post p is found tobe a part of a cluster C , its neighboring (border or core) posts willalso be a part of C . Analogouly to the notion of core posts, we usethe threshold δ to define core edges . An edge ( p, q ) is a core edgeif S F ( p, q ) ≥ ε , where ε ≥ ε . Notice that a core edge mayconnect non-core nodes. Core posts connected by core edges willbuild a summary of G t ( V t , E t ) , that we call the sketch graph (SeeFigure 2).D EFINITION
5. (Sketch Graph) . Given post network G t ( V t , E t ) and density parameters ( δ , ε , ε ) , we define the sketch graph of G t as the subgraph induced by its core posts and core edges. Moreprecisely, the sketch graph G t ( V t , E t ) satisfies the condition thateach node p ∈ V t is a core post and each edge ( p, q ) ∈ E t is acore edge. We write G t = Ske ( G t ) . Intuitively, all important nodes in G t ( V t , E t ) and their relation-ships are retained in G t ( V t , E t ) . Empirically, we found that ad-justing the granularity of ( δ , ε , ε ) to make the size | V t | roughlyequal to 20% of |V t | leads to a good balance between the quality ofthe sketch graph in terms of the information retained and its spacecomplexity. The tuning detail can be found in Section 7.1. We define events based on post clusters. Recall that if two coreposts are connected by core edges, they should belong to the sameluster. It implies all core posts in cluster C form a connected com-ponent C in G t ( V t , E t ) and we write C = Ske ( C ) . Let S t denotethe set of connected components in G t . Notice that S t and G t havethe same node set and the same structure, so we write S t = G t for simplicity. To generate clusters based on the sketch graph dis-cussed above, we can start from a connected component C ∈ S t to build a cluster C . See Figure 2 for the functional relationshipsbetween different types of objects. We define cluster C as follows.D EFINITION
6. (Cluster) . Given G t ( V t , E t ) and the correspond-ing sketch graph G t ( V t , E t ) , a cluster C of G t is a set of coreposts and border posts generated from a connected component C in G t ( V t , E t ) , written as C = Gen ( C ) , and defined as follows: • All posts in the component C form the core posts of C . • For every core post in C , all its neighboring border posts in G t form the border posts in C . A core post only appears in one cluster (by definition). If a bor-der post is associated with multiple core posts in different clusters,this border post will appear in multiple clusters. Events are definedbased on clusters in G t . However, not every cluster can form anevent; we treat clusters with a very small size as outliers, since theydo not gain wide popularity at the current moment. We use E and O to denote an event and an outlier respectively.D EFINITION
7. (Event) . Given a cluster C , the following func-tion distinguishes between an event and an outlier. Event ( C ) = (cid:26) true , if | C | ≥ ϕ false , otherwise (4) where ϕ is a size threshold. We empirically set ϕ = 10 for event identification. Note thatoutliers are different from noise; noise is associated at the levelof posts, as opposed to clusters. Besides, an outlier at the currentmoment may grow into an event in the future, and an event mayalso degrade into an outlier as time passes . Entity Annotation . Considering the huge volume of posts in anevent, it is important to summarize and present a post cluster as aconceptual event to aid human perception. In related work, Twit-info [22] represents an event it discovers from Twitter by a time-line of tweets. Although the tweet activity by volume over time isshown, it is tedious for users to read tweets one-by-one to figure outthe event detail. In this paper, we summarize a snapshot of an eventby a word cloud [13]. The font size of a word in the cloud indicatesits popularity. Compared with Twitinfo, a word cloud provides asummary of the event at a glance and is much easier for a humanto perceive. Technically, given an event E , the annotation A for E , computed as A = Ann ( E ) , is a set of entities with popularityscore, expressed as A = { ( e , P e ) , ( e , P e ) , · · · } .We take entities as the word candidates in the cloud. Intuitively,an entity e is popular in an event if many important posts in thisevent contain e in their entity list. Formally, the relationship be-tween posts and entities can be modeled as a bipartite graph. Recallthat the HITS algorithm [18] computes the hub score and authorityscore by mutual reinforcement: the hub score H i of a node i is de-cided by the sum of authority scores of all nodes pointed by i , andsimultaneously, the authority score A j of a node j is decided by thesum of hub scores of all nodes pointing to j . Inspired by HITS, welet the post weight obtained from the post network be the initial hubscore of posts, and then the authority scores of entities can be com-puted by one iteration: A = M T H , where M is an adjacency ma-trix between posts and entities in an event, A = [ P e , P e , · · · ] T is a vector of entity authorities and H = [ w t ( p ) , w t ( p ) , · · · ] T isa vector of post weights.
5. TRACKING EVENT EVOLUTION
In this section, we discuss the evolution of sketch graph andevents and develop primitive operations, which form the theoret-ical basis for our algorithms in Section 6. We introduce the fadingtime window to serve as a monitor on social streams. Formally, theproblem we try to solve in this paper is shown below.P
ROBLEM Given an evolving post network sequence G = {G , G , · · · } , the event evolution tracking problem is to generatean event set (cid:101) S i at each moment i , and discover all the evolutionbehaviors between events in (cid:101) S i and (cid:101) S i +1 , i = 1 , , · · · . The traditional approach taken for tracking evolving network re-lated problems [4, 16] is illustrated in Figure 3(a). Given a postnetwork G t at time t , identify the event set (cid:101) S t associated with G t (marked as step 1 (cid:13) in the figure). The network evolves from G t to G t +1 at the next moment (step 2 (cid:13) ). Again, the events associ-ated with G t +1 are identified from scratch (step 3 (cid:13) ). Finally, thecorrespondence between event sets (cid:101) S t and (cid:101) S t +1 is determined toextract the evolution behaviors of events from time t to t + 1 (step4 (cid:13) ). This traditional approach has two major shortcomings. Firstly,repeated extraction of events from post networks from scratch isan expensive operation. Similarly, tracing the correspondence be-tween event sets at successive moments is also expensive. Sec-ondly, this step of tracing correspondence, since it is done after thetwo event sets are generated, may lead to loss of accuracy. Themethod we propose is incremental tracking of event evolution. Itcorresponds to step 5 (cid:13) in the figure. More precisely, for the veryfirst snapshot of the post network, say G , our approach will gen-erate the corresponding event set (cid:101) S . After this, this step is neverapplied again. In the steady state, we apply step 5 (cid:13) , i.e., we incre-mentally derive (cid:101) S t +1 from (cid:101) S t . Our experiments show that our in-cremental tracking approach outperforms the traditional baselinesboth in quality and in performance. We analyze the evolutionary process of events at each momentand abstract them into five primitive operators: + , − , (cid:12) , ↑ , ↓ . Weclassify the operators based on the objects they manipulate: nodesor clusters. Note that both events and outliers are clusters.The following operators manipulate nodes in a post network.D EFINITION
8. (Node Operations) . • G t + p : add a new post p into G t ( V t , E t ) where p (cid:54)∈ V t . All thenew edges associated with p will be constructed automaticallyby linkage search; • G t − p : delete a post p from G t ( V t , E t ) where p ∈ V t . Allthe existing edges associated with p will be automatically re-moved from E t . • (cid:12) ( G t ) : update the weight of posts in G t . Typically, the adding/deleting of a post will trigger the updatingof post weights. For convenience, we define two composite opera-tors on post networks.D
EFINITION
9. (Composite Operators) . • G t ⊕ p = (cid:12) ( G t + p ) : add a post p into G t ( V t , E t ) where p (cid:54)∈ V t and update the weight of posts in V t ; • G t (cid:9) p = (cid:12) ( G t − p ) : delete a post p from G t ( V t , E t ) where p ∈ V t and update the weight of posts in V t . The following operators manipulate clusters:D
EFINITION
10. (Cluster Operations) . • S t + C : add cluster C to the cluster set S t ; S t S Time Dimension U s e r S p ace t t o n Len t o t n t t t ①② ③ ④ (a) (b) ⑤ Figure 3: (a) Illustration of traditional approach vs. incre-mental approach to evolution tracking. Steps on edges are de-scribed in Section 5.1. (b) An illustration of the fading timewindow from time t to t + 1 , where post weights fade w.r.t. theend of time window. G t will be updated by deleting subgraph G o and adding subgraph G n . • S t − C : remove cluster C from the cluster set S t ; • ↑ ( C, p ) : increase the size of C by adding a new post p ; • ↓ ( C, p ) : decrease the size of C by removing an old post p . Each operator defined above on a single object can be extendedto a set of objects, i.e., for a node set X = { p , p , · · · , p n } , G t + X = G t + p + p + · · · + p n . This is well defined since + isassociative and commutative.We use the left-associative convention for ‘ − ’: that is, we write A − B − C means ( A − B ) − C . In particular, we write A − B + C to mean ( A − B ) + C and the order in which posts in a set areadded/deleted to G t does not matter. These operators will be usedlater in the formal description of the evolution procedures. Figure4(a) depicts the role played by the primitive operators in the eventevolution tracking, end to end, from the post network onward. Fading (or decay) and sliding time window are two common ag-gregation schemes used in time-evolving graphs [9]. Fading schemeputs a higher emphasis on newer posts, and this characteristic hasbeen captured by fading similarity in Eq. (1). Sliding time windowscheme (posts are first-in, first-out) is essential because older postsare less important and it is not necessary to retain all the posts in thehistory. Since events evolve quickly, even inside a given time win-dow, it is important to highlight new posts and degrade old postsusing the fading scheme. Thus, we combine these two schemesand introduce a fading time window , as illustrated in Figure 3(b).In practice, users can specify the length of the time window to ad-just the scope of monitoring. Users can also choose different fadingfunctions to penalize old posts and highlight new posts in differentways. Let ∆ t denote the interval between moments. For simplicity,we abbreviate the moment ( t + i · ∆ t ) as ( t + i ) . When the timewindow slides from moment t to t + 1 , the post network G t ( V t , E t ) will be updated to be G t +1 ( V t +1 , E t +1 ) . Suppose G o ( V o , E o ) is theold subgraph (of G t ) that lapses at moment t + 1 and G n ( V n , E n ) isthe new subgraph (of G t +1 ) that appears (see Figure 3(b)). Clearly, G t +1 = G t − G o + G n (5) Let
Len be the length of the time window. We assume
Len > t , which will make V o ∩ V n = ∅ . This assumption usuallyholds in real applications, e.g., we set Len to 1 week and ∆ t to1 day. The following claim shows different ways to compute theoverlapping part in G t +1 and G t .C LAIM G t +1 − G n = G t − G o = G t +1 (cid:9) G n = G t (cid:9) G o Proof Sketch:
From Eq. (5) we know G t +1 − G n = G t − G o . Atthe post network level, post weight is orthogonal to node/edge sets,so we have G t (cid:9) G o = G t − G o and G t +1 (cid:9) G n = G t +1 − G n . (cid:50) The updating of sketch graphs from G t to G t +1 is the core taskin event evolution tracking. We already know from Claim 1 that G t +1 −G n = G t −G o at the post network level. This is the overlap-ping part between G t and G t +1 . However, at the sketch graph level, Ske ( G t +1 −G n ) (cid:54) = Ske ( G t −G o ) : some core posts in G t −G o mayno longer be core posts due to the removal of edges incident withnodes in G o or simply due to the passing of time; some non-coreposts may become core posts because of the adding of edges withnodes in G n . To measure the changes in the overlapping part, wedefine the following three components.D EFINITION
11. (Updated Components in Overlap) . • S + : components of non-core posts in G t − G o that becomecore posts in G t +1 − G n due to the adding of G n , i.e., S + = Ske ( G t +1 − G n ) − Ske ( G t +1 (cid:9) G n ) • S − : components of core posts in G t − G o that become non-core posts in G t +1 − G n due to the removing of G o , i.e., S − = Ske ( G t − G o ) − Ske ( G t (cid:9) G o ) • S (cid:12) : components of core posts in G t − G o that become non-core posts in G t +1 − G n due to the passing of time, i.e., S (cid:12) = Ske ( G t (cid:9) G o ) − Ske ( G t +1 (cid:9) G n ) Based on Definition 11, Theorem 1 shows how the overlappingparts in G t and G t +1 differ at the sketch graph level.T HEOREM From moment t to t +1 , the changes of core postsin the overlapping part, i.e., G t +1 −G n (equivalently, G t −G o ), canbe updated using the components S + , S − and S (cid:12) . That is, Ske ( G t +1 − G n ) − Ske ( G t − G o ) = S + − S − − S (cid:12) (6) Proof Sketch: S + − S − − S (cid:12) = Ske ( G t +1 − G n ) − Ske ( G t +1 (cid:9)G n ) − ( Ske ( G t − G o ) − Ske ( G t (cid:9) G o )) − ( Ske ( G t (cid:9) G o ) − Ske ( G t +1 (cid:9) G n )) = Ske ( G t +1 − G n ) − Ske ( G t − G o ) . (cid:50) The following theorem shows the iterative and incremental up-dating of sketch graphs from moment t to t + 1 .T HEOREM From moment t to t + 1 , the sketch graph evolvesby removing core posts in G o , adding core posts in G n and updatingcore posts in the overlapping part. That is S t +1 = G t +1 = S t − S o − S − − S (cid:12) + S n + S + (7) Proof Sketch: S + − S − − S (cid:12) = Ske ( G t +1 − G n ) − Ske ( G t +1 (cid:9)G n ) − ( Ske ( G t − G o ) − Ske ( G t (cid:9) G o )) − ( Ske ( G t (cid:9) G o ) − Ske ( G t +1 (cid:9) G n )) = Ske ( G t +1 − G n ) − Ske ( G t − G o ) . (cid:50) Theorem 2 indicates that we can incrementally maintain S t +1 from S t . Since we define events based on connected components ina sketch graph, this incremental updating of sketch graphs benefitsincremental computation of event evolution essentially. Let
Clu ( G t ) denote the clustering operation on post network G t .The clustering will shift and reorder the nodes in V t by group-ing nodes that belong to the same cluster together. We use S t = Clu ( G t ) to represent the set of clusters obtained from G t . Noticethat noise posts in G t do not appear in any cluster, so the num-ber of posts in S t is typically smaller than |V t | . Similarly, welet Clu ( G n ) = S n and Clu ( G o ) = S o be the cluster sets on thegraphs G n and G o . Cluster Evolution . Based on Definition 6, a cluster at moment t isgenerated from a component in S t . We can apply the Gen functionon sketch graphs to get the iterative computation of clusters. ost Network
Clu
Outliers
Event Events Event
ClustersAdd Delete Add Delete mutual transformation (a) | N c ( p ) | ≥ Add a core post p + ↑ Merge
Delete a core post p − ↓ Split (b)
Figure 4: (a) The relationships between primitives and evolu-tions. Each box represents an evolution object and the arrowsbetween them describe inputs/outputs. (b) The evolutionary be-havior table for clusters when adding or deleting a core post p . C OROLLARY From moment t to t + 1 , the set of clusters canbe incrementally updated by the iterative computation S t +1 = Clu ( G t +1 ) = S t − S o − S − − S (cid:12) + S n + S + (8) Proof Sketch:
Since the generation of any two clusters from twodifferent components is independent, by applying
Gen function onboth sides of Eq. (7) we get the iterative computation of clusters. (cid:50)
The basic operations underlying Eq. (8) are the cases when S t is modified by the addition or deletion of a cluster that includesonly one post. In the following, we analyze and show the evolutionof clusters by adding a post p n or deleting a post p o . Let N c ( p n ) denote the set of clusters where p n ’s neighboring core posts belong before p n is added , and let N c ( p o ) denote the set of clusters where p o ’s neighboring core posts belong after p o is removed . It shouldbe noted that | N c ( p ) | = 0 means p has no neighboring core posts.Notice that Merge and
Split are composite operations and can becomposed by a series of cluster primitive operations. The evolutionbehaviors of clusters is shown in Figure 4(b) and explained below. (a) Addition : S t + { p n } If p n is a noise post after being added into G t , ignore p n . If p n isa border post, add p n to each cluster in N c ( p n ) . Else, p n is a corepost and we do the following: • If | N c ( p n ) | = 0 : + C and C = { p n } ∪ N ( p n ) ; • If | N c ( p n ) | = 1 : ↑ ( C, { p n } ∪ N ( p n )) ; • If | N c ( p n ) | ≥ : Merge = + C − (cid:80) C (cid:48) ∈ N c ( p n ) C (cid:48) and C = N c ( p n ) ∪ { p n } ∪ N ( p n ) . (b) Deletion : S t − { p o } If p o is a noise post before being deleted from G t , ignore p o . If p o is a border post, delete p o from each cluster in N c ( p o ) . Else, p o is a core post and we do the following: • If | N c ( p o ) | = 0 : − C where p o ∈ C ; • If | N c ( p o ) | = 1 : ↓ ( C, { p o } ∪ N ( p o )) ; • If | N c ( p o ) | ≥ : Split = − C + (cid:80) C (cid:48) ∈ N c ( p o ) C (cid:48) , p o ∈ C .Using commutativity and associativity of operator ‘ + ’ and theleft-associative convention for operator ‘ − ’, we can rewrite the ex-pression S t − p + p − p as S t − ( p + p ) + p . While bothexpressions are equivalent, in terms of the total number of primi-tive cluster operations, different ordering of cluster sequences mayhave different performance. In the following, we show a theorem toindicate how to reduce the number of primitive cluster operationsby reordering. As an example, S t + p − p + p − p can bereordered as S t − ( p + p ) + ( p + p ) to reduce the number ofprimitive operations during cluster evolution. T HEOREM Suppose posts have an equal probability to joinany cluster. Given an initial set of clusters and an arbitrary se-quence of node addition and deletions, the number of cluster prim-itive operations can be reduced by performing all the deletions first.
Proof Sketch:
Since posts have equal probability to join clusters,a smaller |N ( p ) | (i.e., the p ) indicates a smaller num-ber of neighboring clusters. Clearly, if the deletions are performedfirst, we can save time because there will be no edges between thedeleted nodes and added nodes. Otherwise, the edges between thedeleted nodes and added nodes need to be constructed first and re-moved later, which implies a higher |N ( p ) | in computation. (cid:50) The Evolution of Events and Outliers . Recall, at any moment, anoutlier may grow into an event and an event may degrade into anoutlier. Based on Definition 7, the evolution procedures of eventsand outliers follow the evolution of clusters. That is, the additionof a new event or outlier as well as the deletion of an existing eventor outlier is exactly the same as for a cluster. The only differenceis that the
Event function is applied to check whether the clusteradded/deleted is an event or an outlier. When the size of a clusterincreases (as with ↑ ( C, p ) ) it may correspond to an outlier/eventgrowing into a larger outlier/event, or to an outlier growing in size,exceeding the threshold ϕ and turning into an event. Similarly,when a cluster shrinks (as with ↓ ( C, p ) ) it may be an outlier/eventbecoming a smaller outlier/event or an event becoming smaller thanthe threshold and turning into an outlier.
6. INCREMENTAL ALGORITHMS
The approach of decomposing an evolving graph into a series ofsnapshots for each moment is a traditional way to tackle evolvinggraph related problems [6, 16]. However, this traditional approachsuffers from both quality and performance, since events are gen-erated from scratch and are matched heuristically. To overcomethese challenges, we propose an incremental tracking framework,as introduced in Section 5.1 and illustrated in Figure 3(a).In this section, we leverage our primitive operators for evolutiontracking by proposing Algorithms cTrack and eTrack to track theevolution of clusters and events respectively. The observation isthat at each moment |V o | + |V n | (cid:28) |V t | , where V o and V n arethe set of old and new posts between moments t and t + 1 . Sowe can save a lot of computation by adjusting clusters and eventsincrementally, rather than generating them from scratch. Bulk Updating . Traditional incremental computation on evolvinggraphs usually treats the addition/deletion of nodes or edges one byone [10, 27]. In Section 5.5, we discussed the updating of S t byadding or deleting a single post. However, in a real scenario, sincesocial posts arrive at a high speed, the post-by-post incrementalupdating will lead to poor performance. In this section, we speedup the incremental computation of S t by bulk updating . We definea bulk as a cluster of posts and “lift” the one-by-one updating of S t to bulk updating. Recall that S n and S o denote the set of clustersin the graph G n = G t +1 − G t and G o = G t − G t +1 respectively.Specifically, given a cluster C n ∈ S n , let N c ( C n ) denote the set ofclusters in G t where neighboring core posts of nodes in C n belong,i.e., N c ( C n ) = ∪ p n ∈ C n N c ( p n ) . Analogously, given C o ∈ S o , welet N c ( C o ) = ∪ p o ∈ C o N c ( p o ) . Clearly, updating S t with a singlenode { p } is a special case of bulk updating. cTrack . The steps for incremental tracking of cluster evolution( cTrack ) are summarized in Algorithm 1. cTrack follows the it-erative computation in Eq. (8) and sequence order in Theorem 3,that is S t +1 = S t − S o − S − − S (cid:12) + S n + S + . As analyzed inSection 5.5, each bulk addition and bulk deletion has three possible lgorithm 1: cTrack Input : S t , S o , S n , S − , S + , S (cid:12) Output : S t +1 S t +1 = S t ; // Delete S o ∪ S − for each cluster C o in S o ∪ S − ∪ S (cid:12) do C o = Ske ( C o ) ; N c ( C o ) = ∪ p o ∈ C o N c ( p o ) ; if | N c ( C o ) | = 0 then remove cluster C o from S t +1 ; else if | N c ( C o ) | = 1 then delete C o from cluster C (cid:48) where C (cid:48) ∈ N c ( C o ) ; else remove the cluster that C o belongs to from S t +1 ; for each cluster C (cid:48) ∈ N c ( C o ) do assign a new cluster id for C (cid:48) ; add C (cid:48) into S t +1 ; // Add S n ∪ S + for each cluster C n in S n ∪ S + do C n = Ske ( C n ) ; N c ( C n ) = ∪ p n ∈ C n N c ( p n ) ; if | N c ( C n ) | = 0 then assign a new cluster id for C n and add C to S t +1 ; else if | N c ( C n ) | = 1 then add C n into cluster C (cid:48) where C (cid:48) ∈ N c ( C n ) ; else assign a new cluster id for C n ; for each cluster C (cid:48) ∈ N c ( C n ) do C n = C n ∪ C (cid:48) ; remove C (cid:48) from S t +1 ; C n = C n ∪ C n ; add C n into S t +1 ; return S t +1 ; evolutionary behaviors, decided by the size of N c ( C ) . Lines 3-13deal with deleting a bulk C o , where {− , ↓ , Split } patterns are han-dled. Lines 15-27 deal with adding a bulk C n and handle { + , ↑ , Merge } patterns. The time complexity of cTrack is linear in thetotal number of bulk updates. eTrack . Algorithm eTrack works on top of cTrack . We summa-rize the steps for incremental event tracking ( eTrack ) in Algorithm2. Based on cTrack , eTrack monitors the changes in the cluster seteffected by cTrack at each moment. If the cluster is not changed, eTrack will take no action; otherwise, eTrack will judge the corre-sponding cases and invoke Event function to handle the event evo-lution behaviors (Lines 4-13). Notice that in Lines 5-9, if a cluster C in S i has ClusterId id , we use the convention that S i ( id ) = C ,and S i ( id ) = ∅ means there is no cluster in S i with ClusterId id .Especially, lines 7-9 mean an event in S i evolves into an event in S i +1 by deleting the part S i ( id ) − S i +1 ( id ) first and adding thepart S i +1 ( id ) − S i ( id ) later.
7. EXPERIMENTS
In this section, we first “tune” the construction of post networkand sketch graph to find the best selection of fading similarity mea-sures and density parameters. Then, we test the quality and perfor-mance of event evolution tracking algorithms on two social streams:Tech-Lite and Tech-Full that we crawled from Twitter. We havedesigned two types of baseline algorithms for event detection andevolution tracking. Our event detection baseline covers the major
Algorithm 2: eTrack
Input : G = {G , G , · · · , G n } , S Output : Event evolution behaviors for i from 1 to n − do obtain S o , S n , S − , S + from G i +1 − G i ; S i +1 = cTrack ( S i , S o , S n , S − , S + , S (cid:12) ) ; for each cluster C ∈ S i +1 do id = ClusterId ( C ) ; if S i ( id ) (cid:54) = ∅ then output ↓ ( C, S i ( id ) − S i +1 ( id )) ; output ↑ ( C, S i +1 ( id ) − S i ( id )) ; Event ( C ) ; else Event (+ C ) ; for each cluster C ∈ S i do id = ClusterId ( C ) ; if S i +1 ( id ) = ∅ then Event ( − C ) ; PostID Time WeightEntityIDSim CorePostCoreSimID ExpTimeComponentID Count(a) (b)
Figure 5: Graph schemas of (a) post network and (b) sketchgraph. Rectangles represent nodes and ellipses represent at-tributes. “Sim” means similarity edges between posts and“CoreSim” means core edges. “Weight” is the post weight.“ExpTime” is the time when a core post expires due to timepassing. “Count” is the number of core posts in a component. techniques reported in [19, 22, 24, 25]. Our tracking baseline sum-marizes the state of the art algorithms reported in [6, 16]. All ex-periments are conducted on a computer with Intel 2.66 GHz CPU,4 GB RAM, running 64-bit Windows 7. All algorithms are imple-mented in Java. We use a graph database called Neo4J to store andmanipulate the post network and sketch graph. Datasets . All datasets are crawled from Twitter.com via TwitterAPI. Although our event tracking algorithm works regardless ofthe domain, we make the data set domain specific in order to fa-cilitate evaluation of the generated results. We built a technologydomain dataset called Tech-Lite by aggregating all the timelines ofusers listed in Technology category of “Who to follow” and theirretweeted users. Tech-Lite has 352,328 tweets, 1402 users and thestreaming rate is about 11700 tweets/day. Based on the intuitionthat users followed by users in Technology category are most likelyalso in the technology domain, we obtained a larger technology so-cial stream called Tech-Full by collecting all the timelines of usersthat are followed by users in Technology category. Tech-Full has5,196,086 tweets, created by 224,242 users, whose streaming rate isabout 7216 tweets/hour. Both timelines in Tech-Lite and Tech-Fullinclude retweets and have a time span from Jan. 1 to Feb. 1, 2012.Since Tech-Lite is a sampled subset of Tech-Full, the parameterslearned from Tech-Lite generally apply to Tech-Full. Graph Storage Schema . Neo4J provides convenience in manag-ing a large and fast-evolving data structure that represents a postnetwork and sketch graph. A database in Neo4J is based on the http://neo4j.org/ http://twitter.com/who to follow/interests P erce n t ag e ( % ) Density Parameters
Figure 6: The changing trends of the number of core posts, coreedges and events when increasing δ = ε from 0.3 to 0.8. Weset δ = ε = 0 . as the 100% basis and keep ε = 0 . . attributed graph model [26], where each node/edge can be associ-ated with multiple attributes in the form of key-value pairs. Graphschemas of the post network and sketch graph are shown in Fig-ure 5 (a) and (b) respectively. Indices are created on attributes tosupport fast exact queries and range queries of nodes. The quality of post network and sketch graph construction willdetermine the quality of event generation. Recall from Section 3that the following factors influence the construction of post net-works and sketch graphs: (1) entity extraction from post contents;(2) similarity/distance measures between posts; and (3) density pa-rameters for the generation of core posts and core edges. We mea-sure the influence of each factor below to make informed choices.
Post Content Preprocessing . As described in Section 3, we ex-tract entities from posts by Stanford POS tagger. One alternativeapproach to entity extraction is using hashtags. However, only 11%of the tweets in our dataset have hashtags. This will result in lotsof posts in the dataset having no similarity score between them.Another approach is simply tokenizing tweets into unigrams andtreating unigrams as entities, and we call it the “Unigrams” ap-proach. This approach is based on the state of art method for eventdetection as discussed in [22]. Table 2(a) shows the comparisonof the three entity extraction approaches in the first time windowof the Tech-Full social stream. If we use “Unigrams”, obviouslythe number of entities is larger than other two approaches, but thenumber of edges between posts tends to be smaller, because tweetswritten by different users usually share very few common wordseven when they talk about the same event. The “Hashtags” ap-proach also produces a smaller number of edges, core posts andevents, since it generates a much sparser post network. Overall, the“POS-Tagger” approach can discover more similarity relationshipsbetween posts and produce more core posts and events given thesame social stream and parameter setting.
Similarity/Distance Measures . Many set-based similarity mea-sures such as Jaccard coefficient [21] can be used to compute thesimilarity S ( p Li , p Lj ) between posts. Since entity frequency is usu-ally 1 in a tweet, measures such as Cosine similarity and Pearsoncorrelation [21] will degenerate to a form very similar to Jaccard, sowe use Jaccard. The distance function D ( | p τi − p τj | ) along time di-mension determines how similarity to older posts is penalized, withrespect to recent posts. We compared three different distance func-tions: (1) reciprocal fading (“Reci-Fading”) with D ( | p τi − p τj | ) = | p τi − p τj | + 1 , (2) exponential fading (“Exp-Fading”) with D ( | p τi − p τj | ) = e | p τi − p τj | and (3) no fading (“No-Fading”) with D ( | p τi − p τj | ) = 1 . For any posts p i , p j , clearly e | p τi − p τj | ≥ | p τi − p τj | + 1 ≥ . Since a time window usually contains many moments, Exp-Fading penalizes the posts in the old part of time window severely(see Table 2(b)): the number of core posts and events generated byExp-Fading is much lower than by other approaches. Since No- (a) Results of different entity extraction approaches. Methods
POS-Tagger (b) Results of different distance functions.
Methods
Reci-Fading (c) Precision and recall of top 50 events.
Methods Precision Recall Precision(major events) (major events) (G-Trends)HashtagPeaks 0.40 0.30 0.25UnigramPeaks 0.45 0.40 0.20Louvain 0.60 0.55 0.75 eTrack
Table 2: (a) and (b) are in the first time window of Tech-Lite with 75,849 posts in one week. Density parameters ( δ , ε , ε ) = (0 . , . , . for core posts and threshold ϕ = 10 for event identification. (c) shows the precision and recall of top20 events generated by Baseline 1a, 1b, Louvain and eTrack . Jan 15, 2012 Jan 22, 2012 Jan 29, 2012SOPA WikipediaSOPA Facebook SOPA Megaupload Apple iBooks
Figure 7: Examples of Google Trends peaks in January 2012.We validate the events generated by cTrack by checking theexistence of volume peaks at a nearby time moment in GoogleTrends. Although these peaks can detect bursty events, GoogleTrends cannot discover the merging/splitting patterns.
Fading does not penalize old posts in the time window, too manyedges and core posts will be generated without considering recency.Reci-Fading is in between, which is a more gradual penalty func-tion than Exp-Fading and we use it by default.
Density Parameters . The density parameters ( δ , ε , ε ) controlthe construction of the sketch graph. Clearly, the higher the den-sity parameters, the smaller and sparser the sketch graph. Figure 6shows the number of core posts, core edges and events as a percent-age of the numbers for δ = ε = 0 . , as δ = ε increases from0.3 to 0.8. Results are obtained from the first time window of theTech-Lite social stream. We can see the rate of decrease of eventsis higher than the rates of core posts and core edges, becauseevents are less likely to form in sparser sketch graphs. More smallevents can be detected by lower density parameters, but the com-putational cost will increase because of larger and denser sketchgraphs. However, for big events, they are not very sensitive to thesedensity parameters. We set δ = ε = 0 . as a trade-off betweenthe compactness of events and complexity. ank HashtagPeaks UnigramPeaks Louvain eTrack
1
Figure 8: Lists of top 10 events detected from Twitter Technology streams in January 2012 by baseline HashtagPeaks, UnigramPeaks,Louvain and our approach eTrack. To adapt for event detection, Louvain method defines events by post communities. We annotateevents generated by both Louvain and eTrack by highly frequent entities. eTrack also shows the birthday of events.Ground truth . To generate the ground truth, we crawl news ar-ticles in January 2012 from famous technology websites such asTechCrunch, Wired, CNET, etc, without looking at tweets. Thenwe treat the titles of news as posts and apply our event trackingalgorithm to extract event evolution patterns. Finally, a total of 20major events with life cycles are identified as ground truth. Typicalevents include “happy new year”, “CES 2012”, “jerry yang yahooresignation”, “sopa wikipedia blackout”, etc.To find more noticeable events in the technology news domain,we use Google Trends for Search , which shows the traffic trendsof keywords that appeared in Google Search along the time dimen-sion. If an event-indicating phrase has a volume peak in GoogleTrends at a specific time, it means this event is sufficiently vali-dated by the real world. We validate the correctness of an event E i generated by our approach by the following process: we pickthe top 3 entities of E i ranked by frequency in E i and search themin Google Trends, and if the traffic trend of these top entities hasa distinct peak at a nearby time to E i , we consider that E i corre-sponds to a real world event widely witnessed by the public. Fourexamples of Google Trends peaks are shown in Figure 7. Baseline 1: Peak-Detection . In recent works [19, 22, 24, 25],events are generally detected as volume peaks of phrases over timein social streams. These approaches share the same spirit that ag-gregates the frequency of event-indicating phrases at each momentto build a histogram and generates events by detecting volume peaksin the histogram. We design two variants of Peak-Detection to cap-ture the major techniques used by these state-of-the-art approaches. • Baseline 1a:
HashtagPeaks : aggregates frequent hashtags; • Baseline 1b:
UnigramPeaks : aggregates frequent unigrams.Notice, both baselines above are for event detection only. Listsof the top 10 events detected by
HashtagPeaks and
Unigram-Peaks are presented in Figure 8 (first two columns). Hashtags areemployed by twitter users as a manual way to indicate an event,but it requires manual assignation by a human. Some highly fre-quent hashtags like “
UnigramPeaks uses theentities extracted from the social stream preprocessing stage, whichhas a better quality than
HashtagPeaks . However, both of themare limited in their representation of events, because the internalstructure of events is missing. Besides, although these peaks candetect bursty words, they cannot discover evolution patterns suchas the merging/splitting of events. For example, in Figure 7, thereis no way to know “Apple announced iBooks” is an event split fromthe big event “SOPA” earlier, as illustrated in detail in Figure 9(b). Baseline 2: Community Detection . A community in a large net-work refers to a structure of nodes with dense connections inter-nally and sparse connections between communities. It is possibleto define an event as a community of posts. Louvain method [8],based on modularity optimization, is the state-of-the-art approachwhich outperforms other known community detection methods interms of performance. We design a baseline called “
Louvain ” todetect events defined as post communities.The top 10 events generated by
Louvain are shown in Figure 8.As we can see, not every result is meaningful in
Louvain method.For example, “Apple iphone ipad” and “Internet people time” aretoo vague to correspond to any concrete real events. The reasonis, although
Louvain method can make sure every community hasrelatively dense internal and sparse external connections, it cannotguarantee that every node in the community is important and hasa sufficiently high connectivity with other nodes in the same com-munity. It is highly possible that a low-degree node belongs to acommunity only because it has zero connectivity with other com-munities. Furthermore, noise posts are quite prevalent in Twitterand they negatively impact Louvain method.
Baseline 3: Pattern-Matching . We design a baseline to track theevolution patterns of events between snapshots. In graph mining,the “divide-and-conquer” approach of decomposing the evolvinggraph into a series of snapshot graphs at each moment is a tradi-tional way to tackle evolving graph related problems [6, 16]. Asan example, Kim et al. [16] first cluster individual snapshots intoquasi-cliques and then map them in adjacent snapshots over time.Inspired by this approach, we design a baseline for event evolutiontracking, which detects events from each snapshot of post networkindependently, and then characterizes the evolution of these eventsat consecutive moments, by identifying certain heuristic patterns: • If | E t ∩ E t +1 || E t ∪ E t +1 | ≥ κ % and | E t | ≤ | E t +1 | , E t +1 = ↑ E t ; • If | E t ∩ E t +1 || E t ∪ E t +1 | ≥ κ % and | E t | > | E t +1 | , E t +1 = ↓ E t . where E t and E t +1 are any two events detected at moment t and t +1 respectively, κ % is the minimal commonality to say E t and E t +1 are snapshots of the same event. A higher κ % will result in a higherprecision but a lower recall of the evolution tracking. Empiricallywe found that we needed to set κ % = 90% to guarantee the quality.It is worth noting that this baseline generates the same events as eTrack , but with a non-incremental evolution tracking approach. Precision and Recall . To measure the quality of event detection,we use
HashtagPeaks , UnigramPeaks and
Louvain as base-lines to compare with our Algorithm eTrack . It is worth notingthat Baseline 3 is designed for the tracking of event evolution dy-namics between moments, so we omit it here. We compare the
Jan 8, 2012 CES prediction Jan 9, 2012 CES tomorrowblogs, etc Jan 10, 2012 CES conference news press, etc Jan 11, 2012 CES video, press, blogs, photo, events, etc Jan 12, 2012 CES video, gadgets, etc Jan 13, 2012 CES video, award, etc growgrow grow decaydecay decay
Jan 14, 2012 CES last day (a) The life cycle of “CES Conference” grow
Jan 16, 2012 SOPA Wikipedia Protest Jan 16, 2012 SOPA Wikipedia blackout Jan 18, 2012 SOPA PIPA protest Jan 18, 2012 Apple products Jan 19, 2012 SOPA PIPA protest and Apple iBooksJan 20, 2012 SOPA PIPA protest and Apple iBooksJan 21, 2012 SOPA PIPA protest Jan 21, 2012 Apple iBooks grow merge mergedecay split split (b) The merging and splitting of “SOPA” and “Apple”
Figure 9: Tracking the evolution of events by eTrack and baselines. At each moment, an event is annotated by a word cloud.Baselines 1 and 2 only works for the detection of new emerging events, and is not applicable for the tracking of merging and splittingdynamics. The evolution trajectories of eTrack and Baseline 3 are depicted by solid and hollow arrows respectively. precision and recall of top 20 events generated by baselines and eTrack and show the results in Table 2(c). On the ground truth of20 major events obtained from news articles in technology web-sites,
HashtagPeaks and
UnigramPeaks have rather low preci-sion and recall scores, because of their poor ability in capturingevent bursts. Notice that the precision and recall may be not equaleven the sizes of extracted events and ground truth are equal, be-cause multiple extracted events may correspond to the same eventin ground truth. eTrack outperforms the baselines in both preci-sion and recall. Since there are many events discussed in the socialmedia but not very noticeable in news websites, we also validatethe precision of the generated events using Google Trends. Aswe can see,
HashtagPeaks and
UnigramPeaks don’t gain toomuch in Google Trends validation, since the words they generateare less informative and not very event-indicating. eTrack gains aprecision of 95% in Google Trends, where the only failed result is“Samsung galaxy nexus”, whose volume is steadily high withoutobvious peaks in Google Trends. The reason may be social streamis more dynamic.
Louvain is worse than eTrack . The results show eTrack is significantly better than other baselines in quality.
Life Cycle of Event Evolution . Our approach is capable of track-ing the whole life cycle of an event: from birth, growth, decay todeath. We illustrate this using the example of “CES 2012”, a ma-jor consumer electronics show held in Las Vegas from January 10to 13. As early as Jan 6, our approach has already detected somediscussions about CES and generated an event about CES. Figure9(a) shows the major snapshots of this event, from Jan 8 to Jan 14.As we can see, on Jan 8, most people talked about “CES predic-tion”, and on Jan 9, the highlighted topic was “CES tomorrow” andsome hearsays about “ultrabook” which would be shown in CES.After the actual event happened on Jan 10, we can see the eventgrew distinctly bigger, and lots of products, news and messages arespreading over the social network, and this situation continues untilJan 13, which is the last day of CES. Afterwards, the discussionsbecome weaker and continue until Jan 14, when “CES” was notthe biggest mention on that day but still existed in some discus-sions. Compared with our approach, Baselines 1 and 2 can detectthe emerging of “CES” with a frequency count at each moment, butno trajectory is generated. Baseline 3 can track a very coarse tra-jectory of this event, i.e., from Jan 10 to Jan 12. The reason is, if an R unn i n g T i m e ( m i n ) Time window length (·Δt) eTrack: Tech-FulleTrack: Tech-Lite (a) Varying time window R unn i n g T i m e ( m i n ) Step length (·Δt) eTrack: Tech-FulleTrack: Tech-LiteBaseline 3: Tech-FullBaseline 3: Tech-Lite (b) Varying step length
Figure 10: The running time on two datasets as the adjustingof the time window length and step length. event changes rapidly and many posts at consecutive moments can-not be associated with each other, Baseline 3 will fail to track theevolution. Since in social streams the posts usually surge quickly,our approach is superior over the baselines.
Event Merging & Splitting . Figure 9(b) illustrates an exampleof event merging and splitting generated by Algorithm eTrack . eTrack detected the event of SOPA (Stop Online Piracy Act) andWikipedia on Jan 16, because on that day Wikipedia announced theblackout on Wednesday (Jan 18) to protest SOPA. This event grewdistinctly on Jan 17 and Jan 18, by inducing more people in the so-cial network to discuss about this topic. At the same time, therewas another event detected on Jan 18, discussing Apple’s prod-ucts. On Jan 19, actually the SOPA event and Apple event weremerged, because Apple joined the SOPA protest and lots of Appleproducts such as iBooks in education are actually directly relatedto SOPA. This event evolved on Jan 20, by adding more discus-sions about iBooks 2. Apple iBooks 2 was actually unveiled inJan 21, while this new product gained lots of attention, people whotalked about iBooks did not talk about SOPA anymore. Thus, onJan 21, the SOPA-Apple event was split into two events, whichwould evolve independently afterwards. Unfortunately, the abovemerging and splitting process cannot be tracked by any of the base-lines, which output some independent events. The reason for Base-line 3 is, given the ground truth E t +1 = E t + E t , i.e., E t and E t merged into E t +1 , | E t ∩ E t +1 || E t ∪ E t +1 | ≥ κ % or | E t ∩ E t +1 || E t ∪ E t +1 | ≥ κ % ismost likely invalid, so the ground truth cannot be tracked. .3 Running Time We measure how the baseline and eTrack scale w.r.t. both thevarying time window width and the step length. We use both Tech-Lite and Tech-Full streams, and set the time step interval ∆ t = 1 day for Tech-Lite, ∆ t = 1 hour for Tech-Full to track events ondifferent time granularity. The streaming post rates for Tech-Liteand Tech-Full are 11700/day and 7126/hour respectively. Figure10(a) shows the running time of eTrack when we increase the timewindow length, and we can see for a time window of t hoursin Tech-Full, our approach can finish the post preprocessing, postnetwork construction and event tracking in just 3 minutes. A keyobservation is that the running time of eTrack does not depend onthe overall size of the dataset. Rather, it depends on the streamingspeed of posts in ∆ t . Figure 10(b) shows if we fix the time win-dow length as t and increase the step length of the sliding timewindow, the running time of eTrack grows nearly linearly. Com-pared with our incremental computation, Baseline 3 has to processposts in the whole time window from scratch at each moment, sothe running time will be steadily high. If the step length is largerthan t in TechFull, eTrack does not have an advantage in run-ning time compared with Baseline 3, because a large part of postnetwork is updated at each moment. However, this extreme case israre. Since in a real scenario, the step length is much smaller thanthe time window length, our approach shows much better efficiencythan the baseline approach.
8. CONCLUSION
Our main goal in this paper is to track the evolution of eventsover social streams such as Twitter. To that end, we extract mean-ingful information from noisy post streams and organize it into anevolving network of posts under a sliding time window. We modelevents as sufficiently large clusters of posts sharing the same top-ics, and propose a framework to describe event evolution behaviorsusing a set of primitive operations. Unlike previous approaches,our evolution tracking algorithm eTrack performs incremental up-dates and efficiently tracks event evolution patterns in real time.We experimentally demonstrate the performance and quality of ouralgorithm over two real data sets crawled from Twitter. As a natu-ral progression, in the future, it would be interesting to investigatethe tracking of evolution of social emotions on products, with itsobvious application for business intelligence.
9. REFERENCES
PVLDB , 5(10):980–991, 2012.[3] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A frameworkfor clustering evolving data streams. In
VLDB , pages 81–92,2003.[4] J. Allan, editor.
Topic detection and tracking: event-basedinformation organization . Kluwer Academic Publishers,2002.[5] A. Angel, N. Koudas, N. Sarkas, and D. Srivastava. Densesubgraph maintenance under streaming edge weight updatesfor real-time story identification.
PVLDB , 5(6):574–585,2012. [6] S. Asur, S. Parthasarathy, and D. Ucar. An event-basedframework for characterizing the evolutionary behavior ofinteraction graphs. In
KDD , pages 913–921, 2007.[7] H. Becker, M. Naaman, and L. Gravano. Learning similaritymetrics for event identification in social media. In
WSDM ,pages 291–300, 2010.[8] V. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre.Fast unfolding of communities in large networks.
J. Stat.Mech., P10008 , 2008.[9] F. Cao, M. Ester, W. Qian, and A. Zhou. Density-basedclustering over an evolving data stream with noise. In
SDM ,2006.[10] Y. Chen and L. Tu. Density-based clustering for real-timestream data. In
KDD , pages 133–142, 2007.[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. Adensity-based algorithm for discovering clusters in largespatial databases with noise. In
KDD , pages 226–231, 1996.[12] G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu. Parameter freebursty events detection in text streams. In
VLDB , pages181–192, 2005.[13] M. Halvey and M. T. Keane. An assessment of tagpresentation techniques. In
WWW , pages 1313–1314, 2007.[14] J. Han, M. Kamber, and J. Pei.
Data Mining: Concepts andTechniques . Morgan Kaufmann, 2011.[15] X. Jin, W. S. Spangler, R. Ma, and J. Han. Topic initiatordetection on the world wide web. In
WWW , pages 481–490,2010.[16] M.-S. Kim and J. Han. A particle-and-density basedevolutionary clustering method for dynamic networks.
PVLDB , 2(1):622–633, 2009.[17] D. Klein and C. D. Manning. Accurate unlexicalized parsing.In
ACL , pages 423–430, 2003.[18] J. M. Kleinberg. Authoritative sources in a hyperlinkedenvironment.
J. ACM , 46(5):604–632, 1999.[19] J. Leskovec, L. Backstrom, and J. M. Kleinberg.Meme-tracking and the dynamics of the news cycle. In
KDD ,pages 497–506, 2009.[20] B. Liu, C. W. Chin, and H. T. Ng. Mining topic-specificconcepts and definitions on the web. In
WWW , pages251–260, 2003.[21] C. D. Manning, P. Raghavan, and H. Schuetze.
Introductionto Information Retrieval . Cambridge University Press, 2008.[22] A. Marcus, M. S. Bernstein, O. Badar, D. R. Karger,S. Madden, and R. C. Miller. Twitinfo: aggregating andvisualizing microblogs for event exploration. In
CHI , pages227–236, 2011.[23] D. Nadeau and S. Sekine. A survey of named entityrecognition and classification.
Lingvisticae Investigationes ,30(1):3–26, 2007.[24] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakestwitter users: real-time event detection by social sensors. In
WWW , pages 851–860, 2010.[25] A. D. Sarma, A. Jain, and C. Yu. Dynamic relationship andevent discovery. In
WSDM , pages 207–216, 2011.[26] S. Srinivasa. Data, storage and index models for graphdatabases. In
Graph Data Management , pages 47–70. 2011.[27] L. Wan, W. K. Ng, X. H. Dang, P. S. Yu, and K. Zhang.Density-based clustering of data streams at multipleresolutions.
TKDD , 3(3), 2009.[28] J. Weng and B.-S. Lee. Event detection in twitter. In
ICWSM ,2011.29] Y. Yang, T. Ault, T. Pierce, and C. W. Lattimer. Improvingtext categorization methods for event tracking. In