Event-Radar: Real-time Local Event Detection System for Geo-Tagged Tweet Streams
aa r X i v : . [ c s . I R ] O c t Event-Radar: Real-time Local Event Detection System forGeo-Tagged Tweet Streams
Sibo Zhang
University of Illinois [email protected]
Yuan Cheng
University of Illinois [email protected]
Deyuan Ke
University of Illinois [email protected]
ABSTRACT
The local event detection is to use people’s posting messages withgeotags on social networks to reveal the related ongoing eventsand their locations [1]. Recent studies have demonstrated that thegeo-tagged tweet stream serves as an unprecedentedly valuablesource for local event detection. Nevertheless, how to effectivelyextract local events from large geo-tagged tweet streams in realtime remains challenging. A robust and efficient cloud-based real-time local event detection software system would benefit variousaspects in the real-life society, from shopping recommendation forcustomer service providers to disaster alarming for emergency de-partments.We use the preliminary research GeoBurst [1] as a starting point,which proposed a novel method to detect local events. GeoBurst+[2] leverages a novel cross-modal authority measure to identifyseveral pivots in the query window. Such pivots reveal differentgeo-topical activities and naturally attract related tweets to formcandidate events. It further summarises the continuous stream andcompares the candidates against the historical summaries to pin-point truly interesting local events. We mainly implement a web-site demonstration system Event-Radar with an improved algo-rithm to show the real-time local events online for public interests.Better still, as the query window shifts, our method can update theevent list with little time cost, thus achieving continuous monitor-ing of the stream.
KEYWORDS
Event detection, Local event, Location-based service, data stream,Data mining, Web mining
With over 500 million tweets written by users every day and thereare more than 100 million users, Twitter has been one of the mostpopular online news and social networking service. This meansthat a large of amount of data is frequently generated. Users postand interact with tweets, which restricted to 140 characters. Be-yond Twitter, we have online social media sites like Facebook, Youtubeand Instagram, which have transformed the method we connectwith individuals, groups, and communities and altered everydaypractices [5]. Numerous recent workshops, such as Semantic Anal-ysis in Social Media [6], are increasingly focusing on the influenceof social media on our daily lives. Unlike other media sources, Twit-ter messages offer timely and fine-grained information about any event, reflecting personal perspectives, social information, emo-tional reactions, and local event.A local event is an unusual activity burst in a local area andwithin specified duration while engaging a considerable numberof participants. Empirical studies [7] show that the online socialnetworking service Twitter is often the first medium to break sig-nificant natural events such as earthquakes often in a matter ofseconds after they occur. Twitter is “what’s-happening-right-now”tool [8] and given the nature of its tweets are a real-time flow oftext messages coming from very different sources covering variouskinds of subjects in distinct languages and locations. The Twitterfree stream is an interesting source of data for “real time” event de-tection based on text mining techniques. Noticing that here “realtime” means that events need to be discovered as early as possibleafter they start unravelling in the online social networking servicestream. Such information about emerging events can be hugelyvaluable if it is visible in real time.Studying those data can provide us with useful information. “Whatis happening right now?” is a fascinating question that many peo-ple ask every day. People are interested in those events happenslocally [9]. Corporations are interested in sponsoring their prod-uct to favourable customers [10]. Event detection can answer thisquestion. Besides that, nature disasters might be detected by Twit-ter and warn people even faster than other media [11]. Some pre-dictions can also be completed from Twitter data, such as the crimeprediction [12]. Typical examples include the bomb blasts in Mum-bai in November 2008, the flooding of the Red River Valley in theUnited States and Canada in March and April 2009, and the “ArabSpring” in the Middle East and North Africa region [13]. Severalstudies have analysed Twitter’s user intentions. For example, userintentions can be categorised on Twitter into daily chatter, con-versations, allocation information, and journalism news. They alsoidentified Twitter users as information sources, friends, and infor-mation hunters.Considerable research efforts have been made in detecting real-time events. However, most them lack the accuracy when dealingwith local events or the capability for real-time events. For exam-ple, Abdelhaq’s EvenTweet [3], using spatial entropy, clustering,and feature ranking to extract and rank local events, cannot dealwith the real-time environment.Nevertheless, there are also trials of event detection in Twit-ter rivalling to event detection in traditional media. Twitter mes-sages are usually not well organised. Twitter streams cover hugeamounts of meaningless messages, which negatively affect the de-tection performance. Furthermore, conventional text mining tech-niques are not appropriate, since the short length of tweets, a sig-nificant number of spelling and grammatical mistakes, and the chronic ibo Zhang et al.ibo Zhang et al.
With over 500 million tweets written by users every day and thereare more than 100 million users, Twitter has been one of the mostpopular online news and social networking service. This meansthat a large of amount of data is frequently generated. Users postand interact with tweets, which restricted to 140 characters. Be-yond Twitter, we have online social media sites like Facebook, Youtubeand Instagram, which have transformed the method we connectwith individuals, groups, and communities and altered everydaypractices [5]. Numerous recent workshops, such as Semantic Anal-ysis in Social Media [6], are increasingly focusing on the influenceof social media on our daily lives. Unlike other media sources, Twit-ter messages offer timely and fine-grained information about any event, reflecting personal perspectives, social information, emo-tional reactions, and local event.A local event is an unusual activity burst in a local area andwithin specified duration while engaging a considerable numberof participants. Empirical studies [7] show that the online socialnetworking service Twitter is often the first medium to break sig-nificant natural events such as earthquakes often in a matter ofseconds after they occur. Twitter is “what’s-happening-right-now”tool [8] and given the nature of its tweets are a real-time flow oftext messages coming from very different sources covering variouskinds of subjects in distinct languages and locations. The Twitterfree stream is an interesting source of data for “real time” event de-tection based on text mining techniques. Noticing that here “realtime” means that events need to be discovered as early as possibleafter they start unravelling in the online social networking servicestream. Such information about emerging events can be hugelyvaluable if it is visible in real time.Studying those data can provide us with useful information. “Whatis happening right now?” is a fascinating question that many peo-ple ask every day. People are interested in those events happenslocally [9]. Corporations are interested in sponsoring their prod-uct to favourable customers [10]. Event detection can answer thisquestion. Besides that, nature disasters might be detected by Twit-ter and warn people even faster than other media [11]. Some pre-dictions can also be completed from Twitter data, such as the crimeprediction [12]. Typical examples include the bomb blasts in Mum-bai in November 2008, the flooding of the Red River Valley in theUnited States and Canada in March and April 2009, and the “ArabSpring” in the Middle East and North Africa region [13]. Severalstudies have analysed Twitter’s user intentions. For example, userintentions can be categorised on Twitter into daily chatter, con-versations, allocation information, and journalism news. They alsoidentified Twitter users as information sources, friends, and infor-mation hunters.Considerable research efforts have been made in detecting real-time events. However, most them lack the accuracy when dealingwith local events or the capability for real-time events. For exam-ple, Abdelhaq’s EvenTweet [3], using spatial entropy, clustering,and feature ranking to extract and rank local events, cannot dealwith the real-time environment.Nevertheless, there are also trials of event detection in Twit-ter rivalling to event detection in traditional media. Twitter mes-sages are usually not well organised. Twitter streams cover hugeamounts of meaningless messages, which negatively affect the de-tection performance. Furthermore, conventional text mining tech-niques are not appropriate, since the short length of tweets, a sig-nificant number of spelling and grammatical mistakes, and the chronic ibo Zhang et al.ibo Zhang et al. use of straightforward and mixed language. Since spelling and gram-mar errors, mixed languages, colloquial expressions and shortenedwords are very common in tweets, we are very hard to understandtheir semantic meanings. Similarly, while a significant amount ofdata, it is tough to find a well-organized way to select valuabletweets in Twitter. While the real-time detection of local events wasnearly incredible years ago due to the lack of reliable data sources,the explosive growth of geo-tagged tweet data brings new opportu-nities to it. With the ubiquitous connectivity of wireless networksand the vast proliferation of mobile devices, more than 10 milliongeo-tagged tweets are created in the Twitter every day. Numer-ous real-world examples have exposed the effectiveness and thetimely information reported by Twitter during disasters and socialmovements. For example, when the Tohoku Earthquake hit Japanon March 2011 and when the Baltimore Riot took place in April2015, many people posted geotagged tweets to broadcast it rightthere. Its sheer size, multi-faceted information, and real-time na-ture make the geo-tagged tweet stream an unprecedentedly valu-able source for detecting local events [2].Tweets are about contents from daily life things to newest localand worldwide events. Twitter streams contain significant amountsof meaningless messages (pointless babbles) and rumours [13]. Theseare important to help to understand people’s reactions to events.Nevertheless, they undesirably affect event detection performance.A major test facing event detection from Twitter streams is toseparate the dull and polluted information from exciting real-lifeevents. In practice, highly scalable and efficient approaches arerequired for handling and processing the increasingly significantquantity of Twitter data especially for real-time event detection.Other challenges are intrinsic to Twitter’s natural. These are dueto the short length of tweet messages, the frequent use of simplewords, the enormous quantity of spelling and grammatical errors.Such data sparseness, lack of context, and diversity of vocabularymake the traditional text analysis techniques less appropriate fortweets [14]. Also, different events may enjoy different popularityamong users and can differ significantly in content, the number ofmessages and participants, periods, internal structure, and causalrelationships [15].Thus, the challenges are in below three aspects:
1. Integrating diverse types of data.
The geo-tagged tweetstream involves three different data types: location, time, and text.Considering the entirely different representations of those datatypes and the complex cross-modal interactions among them, howto effectively integrate them for local event detection is challeng-ing.
2. Capturing the semantics of short text.
Since every tweetis limited to 140 characters, the semantics of the user’s activity isexpressed through short and sparse text messages. Compared withtraditional documents ( e.g. , news), it is much harder to capture thesemantics of short tweet messages and extract high-quality localevents.
3. On-line and real-time detection.
When a local event out-breaks, it is key to report the event instantly to allow for timely ac-tions. As massive geo-tagged tweets stream in, the detector shouldwork in an on-line and real-time manner instead of a batch-wiseand inefficient one. Such a requirement is the third challenge ofour problem [2].
The Topic Detection and Tracking program by Jonathan G. Fiscusand George R. Doddington [16] gave the following definitions ofthe event:
Event is “something that happens at some specific time andplace along with all necessary preconditions and unavoidable con-sequences”;Sakaki et al. [17] defines an event as an arbitrary classificationof space/time region that might have actively participating agents,passive factors, products, and a location in space/time like is beingdefined in the event ontology by Raimond and Abdallah [18]. Thetarget events in this work are significant events that are visiblethrough messages, posts, or status updates of active users in Twit-ter online social network service. These events have several prop-erties: (i) they are of large scale because many users experiencethe event, (ii) they particularly influence people’s daily life, beingthat the main reason why users are induced to mention it, and (iii)they have both spatial and temporal regions. The importance ofan event is connected with the distance users have between them-selves and the event, and with the spent time since the occurrence.
Figure 1: Earthquake location estimation based on tweets.Balloons show the tweets on the earthquake. The crossshows the earthquake centre. Red represents new tweets;blue represents later tweets [17].
Events are evaluated using a decision based on whether a docu-ment reports a new topic that has not been reported previously,or if should be merged with an existent event [22]. Differing onhow data is treated, two groups of Event Detection systems wereidentified [23].
Online New Event Detection (NED).
Online New Event Detection denotes to the task of classifyingevents from live streams of tweets in real-time. Most new and ret-rospective event detection methods rely on the use of well-knownclustering-based algorithms [24]. Usually, new event detection con-tains the continuous monitoring of tweet feeds for discovering vent-Radar: Real-time Local Event Detection System for Geo-Tagged Tweet Streams
Figure 2: Screenshot of Trotter, an earthquake reporting sys-tem [17]. events in near real time, which could do event detection of real-world events like breaking news, natural disasters or football game.
Retrospective Event Detection (RED).
Retrospective Event Detection denotes to the process of classi-fying unidentified events previously from gathered past data thathave arrived in the past. In Retrospective Event Detection, mostmethods are founded on the retrieval of event relevant documentsby performing queries over a collection of records. Both techniquesassume that event relevant documents contain the query terms. Adisparity of the previous approach is the use of query growth tech-niques, meaning that some messages related to a specific event donot contain specific event related information, but with the use ofimproved queries, messages related to the event can be recovered.
Event detection could be classified into specified or unspecifiedevent detection techniques [25]. By using specific pre-known in-formation and features about an event, traditional information re-trieval and extraction techniques can be modified to perform spec-ified event detection. Most traditional information retrieval andextraction methods are useless when no previous information isavailable about the event. Unspecified event detection methods ad-dress the issue on the basis that temporal signals constructed viadocument analysis can detect real work events. Monitoring trendsin text streams, alliance topographies with same viewpoints, andcategorising events into different categories are among those tasksto perform unspecified event detection.
Event detection has been deeply studied in the past few years, andvarious methods have been proposed to address the problem. Fre-quently used feature representations are also presented and dis-cussed. This survey does not provide an exhaustive review of ex-isting approaches but rather techniques which related to the areathat would focus on our most important research directions.The event detection problem is not a new research topic, Yang etal. [26] in 1998, is a study on retrospective and on-line event detec-tion which examined the usage and postponement of text retrievaland clustering techniques. The main task was to detect new events from a well-organized stream of news stories repeatedly. The sys-tem performed quite well and showed that basic techniques suchas document clustering could be highly effective to perform eventdetection. Depending on the type of events, these methods are clas-sified into unspecified and specified event detection.
Unspecified Event Detection:
This kind of events is mainlyabout emerging events, breaking news, and general topics that at-tract a considerable number of users’ attentions. We are interestedin using Twitter tweets to find ongoing local events. Thus, the gen-eral events will be of our interests [34]. Typically, such events oftencome with a significant temporary boost of the use of keywords.The trends in tweets can be clustered according to the frequent-occur feature. Whereas, there is a non-toxic event which is viewedas noise when conducting event detection [27]. Therefore, the ma-jor challenge to be dealt with in unspecified event detections is todistinguish significant trends event from those trivial non-events.Several techniques have been proposed to tackle this challenge byapplying a range of machine learning, data mining, and text min-ing techniques.
TwitterStand:
News in tweets [35] showed a novel system whichdeals with the problem of capturing proper tweets trends related tobreaking news, TwitterStand. Two techniques were used, a naïveBayes classifier and online clustering algorithms. Naïve Bayes clas-sifier was applied to distinguish breaking news from irrelevantnon-events in tweet streaming. Whereas, the online cluster whichemploys term frequency–inverse document frequency and cos sim-ilarity measures were used to from newsgroups. The paper usedtweets’ hashtag and timestamps as an additional method to reducethe clustering errors of online cluster algorithms.
Breaking news detection and tracking on Twitter [36] pro-posed a technique to capture breaking news from Twitter, with theadditional functions of following and ranking. The tweets were ex-tracted through pre-defined queries of Twitter API and indexedbefore similarity grouping. Grouping was based on the of-if sim-ilarity measure between messages. All tweets were sorted in as-cending weight with authors, proper nouns, and hashtags. Stan-ford Named Entity Recognizer (NER) was used to identify propernouns with the number of sponsors’ followers and some shares ofthe tweets taken into consideration. In the paper, the author fac-tor was introduced for the reliance and soundness of the tweets,which improved the accuracy. They also developed an applicationcalled Hot-streams to validate the algorithms.Streaming first story detection with application to Twitter [37]focused on predicting new events which never occurred in previ-ous tweets. The approach was mainly about improving the effi-ciency of conducting cosine similarity measurement within docu-ments. The paper developed the locality sensitive hashing meth-ods, which applied the search operations to a small number ofrecords and optimised the complexity within a constant time andspace. Whereas, the replies, the number of shares, and hashtagswere not taken into consideration in the paper. The experimentresults indicated two remarkable facts which are 1. User based pre-vails when compared with tweet-based ranking. 2. The entropy ofinformation leads to less message spam.
Real-world event identification on Twitter [38] used an on-line clustering technique to associate tweets with the real-worldevent. It keeps clustering related tweets and then classifies the ibo Zhang et al.ibo Zhang et al.
Real-world event identification on Twitter [38] used an on-line clustering technique to associate tweets with the real-worldevent. It keeps clustering related tweets and then classifies the ibo Zhang et al.ibo Zhang et al. clusters into two categories, real events or trivial nonevents. Asignificant difference between actual events and nonevents is thatthere are Twitter-centric topics within nonevents. Naaman pointedout that such topic is trending but do not reflect or represent anyreal-world events. All tweets were present as an of-if weight vec-tor based on their contents. The paper used cosine similarity tocalculate the distance between a tweet and cluster centroids. Theweight of hashtag was doubled as it was hypothesized that it bringsa strong connection between text and tweet topic. All standardmethods of pre-processing, such as stem and stop word list werealso used. The cluster was computed with a combination of term-frequency-based temporal features, twitter-operation-based socialfunctions, local features, and Twitter-centric features. Term frequency-based features based on the number of the appearance in the mes-sage set with a cluster. The twitter operations include comments,replies, share, etc, and the feature contains the percentage of thoseoperations in the cluster content. This paper assumes that the pro-posed cluster obtained the intention to revolve around a certainmeaningful topic but the non-event clusters have the trends of re-volving around some irrelevant terms such as “dinner”, “sleep”, or“right”. The twitter-centric. Upon all these work, an SVM is devel-oped to classify the clusters and tweets associated with the clustersinto real-world labelled portion or non-even labelled portion.Towards effective event detection, tracking and summarizationon microblog data [39] proposed a technique to assign some topic-word-based features to the microblog data to train a cluster. Topicwords are those words which share more popularity when com-pared with others in an event. These words are computed froman extraction of daily messages in microblog data based on thefrequency of the phrase, incidence of hashtag associated with thephrase, and entropy. A co-occurrence graph was generated by addingedges to messages and topical words where a hierarchical clusterwas used upon to transfer the set of topical words into event clus-ters. The paper claims that the hierarchical cluster over forms tra-ditional K-means algorithms.Real-time event detection for online behavioural analysis of bigsocial data, [40] employed a 5-stage method in real-time event de-tection. It collected tweets with search conditions and convertedthem into JSON format. The terms are then extracted from thetweets by adopting named entity recognition. It constructs the sig-nals by tracking both occurrences of terms extracted from extrac-tion phase and diffusion of the information. After this, a weightedgraph of which nodes are tweets is computed. The edges are mea-sured by the complement of the similarity degree. Clustering is afinal stage which includes adjacent points that are close measuredby timestamp and occurrences. Each cluster is viewed as a poten-tial candidate for grouping events to whether they are real-worldevents or non-events.
Specified Event Detection : A specified event can be public orpre-planned social meetings such as a concert. It should containthe metadata such as venue, time, attendees, and musicians. Thework introduced here attempt to exploit Twitter textual content ormetadata information or both.Popescu and Pennacchiotti [41] focused on identifying contro-versial events that provoke public discussions with opposing opin-ions on Twitter, such as controversies involving superstars. Theirdetection outline is based on the idea of a Twitter snapshot, a trio consisting of a target entity, a given period, and a set of tweetsabout the entity from the target period. Assumed a set of Twit-ter snapshots, an event detection module first distinguishes be-tween the event and non-event snaps using a supervised gradi-ent boosted decision trees [42], trained on the manually labelleddata set. To rank these event snaps, a controversy model allocateshigher scores to controversial event snapshots, by a reversion al-gorithm applied to a large number of features. The employed fea-tures are based on Twitter-specific characteristics including lin-guistic, structural, buzziness, nine sentiment, and controversy fea-tures, and on external features for example news buzz. These ex-ternal features require time alignment of entities in news mediaand Twitter sources, to capture entities that are trending in bothsources because they are more likely to mention real-world events.The authors have also planned to merge the two stages into asingle-stage system by including the event detection score as an ex-tra feature into the controversy model. This produced an improvedperformance. Feature analysis of the single-stage system exposedthat the event score is the most relevant feature because it discrim-inates event from nonevent snapshots. Hashtags are originated tobe important semantic topographies for tweets in the meantimethey help classify the topic of a tweet and approximation the top-ical cohesiveness of a set of tweets. External features based onnews and the Web are also originated usefully; hereafter, associ-ation with traditional media helps authenticate and explain socialmedia reactions. Also, the linguistic, structural, and sentiment fea-tures also deliver significant effects. The authors determined thata rich, diverse set of features be crucial for controversy detection.Benson et al. [43] present a novel approach to identify Twittermessages for concert events using a factor graph model, which si-multaneously examines individual messages, clusters them accord-ing to the event type, and induces a correct value for each eventproperty. The motivation is to infer a comprehensive list of mu-sical events from Twitter (based on artist–venue pairs) to wholean existing list ( e.g. , city event calendar table) by discovering newmusical events mentioned by Twitter users that are difficult to findin other media sources. At the message level, this approach relieson a conditional random field (CRF) to excerpt the artist name andposition of the event. The contribution features to CRF model in-clude word form; a set of even expressions for mutual emoticons,time references, and venue types; a large number of words for artistnames removed from an external source; and a bag of words for cityplace names. Clustering is directed by term popularity, which is anarrangement score among the message term labels and some can-didate worth. To imprisonment the huge text difference in Twittermessages, this score is founded on a weighted combination of termsimilarity measures. This including complete string matching, andadjacency and equality indicators scaled by the inverse documentfrequency. Also, a uniqueness factor is working during clusteringto expose rare event messages that are dominated by the generalones and to discourage various messages from the same facts tocluster into multiple incidents. Alternatively, a consistent indica-tor is employed to discourage messages from multiple events toform a single cluster. The factor graph model is then used to cap-ture the interaction between all components and provide the finalchoice. The production of the model consists of an event-based vent-Radar: Real-time Local Event Detection System for Geo-Tagged Tweet Streams clustering of messages, where each cluster is characterised by anartist–venue pairs.Lee and Sumiya [44] present a geosocial local event detectionsystem based on modelling and monitoring crowd behaviours viaTwitter, to identify local festivals. They rely on geographical reg-ularities deduced from the usual behaviour patterns of crowds us-ing geotags. First, Twitter geotagged data are collected and prepro-cessed over an extended period for a specific region [45]. The areais then alienated into several regions of interest (ROI) using thek-means algorithm, applied to the geographical coordinates (lon-gitudes/latitudes) of the collected data. Geographical regularitiesof the crowd within each ROI are then predictable from historicaldata based on three main features: some tweets, users, and mov-ing users within an ROI. Statistics for these functions are then ac-cumulated over historical data using 6-hour time interval to formthe estimated behaviour of the crowd within each ROI. Finally, un-usual events in the monitored geographical area can be detected bycomparing statistics from new tweets with those of the estimatedbehaviour. The authors found that an augmented user combinedwith an increased number of tweets provides a strong indicator oflocal festivals.
Abdelhaq [31] presents a method of EVENTTWEET which extractshashtags and Twitter keywords based on temporal burst and spa-tial location. Then it employs a cluster on these keywords to com-pute events depending on location distribution.Krumm, John [30] introduced a method, Eyewitness, to find lo-cal events from a large-scale stream of Twitter textures. The paperconsiders the location statistics of tweets and classifies the loca-tion data to train a classifier for locating meaningful tweets. Thenit envisions the classifier being harnessed in a user-interaction sys-tem to identify and monitor the events based on users’ locations.Eyewitness adopts a regression model to predict the number ofgeotagged tweets in a certain amount of time. If the real numberof tweets is larger than the predicted number, the event is definedas a local event. It also employs a text summarization algorithm toextract the tweets belongs to the event.Chao [1] proposed another approach, GEOBURST for local eventdetection. The paper assumed that a significant local event resultsin the scene of many geotagged texts around one certain place.Moreover, a method was built based on this assumption whichfirstly looks for all geo-clustered topics and secondly ranks thosetopics based on spatiotemporal business to get the significant lo-cal events. There is an ad-hoc streaming process embedded in themethods to implement the function of processing and updatingcontinuous real-time tweets.
In this section, we describe the application of Local Event Detec-tion algorithm, which consisted of Candidate Generator, CandidateGenerator Classification, Online Updater.
Given a query window Q and the set D Q of tweets falling in Q, thecandidate generator is to divide D Q into several geo-topical clus-ters, such that the tweets in each group are geographically closeand semantically coherent. The Clustering of D Q , however, posesseveral challenges: how to combine the geographical and semanticsimilarities in a reasonable way? How to capture the correlationsbetween different keywords? Moreover, how to generate qualityclusters without knowing the suitable number of clusters in ad-vance? To address these challenges, we perform a novel pivot seek-ing process to identify the centres of geo-topical clusters. Our keyinsight is that: the spot where the event occurs acts as a pivot thatproduces relevant tweets around it; the closer we are to the pivot,the more likely we observe relevant tweets. Therefore, we define ageo-topical authority score for each tweet, where a kernel functioncaptures the geographical influence among tweets, and the seman-tic influenced by random walk on a keyword co-occurrence graph.With this authority measure, we develop an authority ascent pro-cedure to retrieve authority maxima as pivots; and each pivot nat-urally attracts similar tweets to form a quality geo-topical cluster.Below, we rest introduce our geo-topical authority measure to de-fine pivot tweets and then develop an authority ascent procedurefor pivot seeking. Pivot Tweet is an amount of G ( d → d ) en-ergy is distributed from d to d through random walk on the graph, G ( d → d ) S ( d ! d ) is the amount that successfully reaches d ; and d authority is the total sum of energy that d receives from itsneighbors [38]. The authority score is analogous to kernel densityin the task of nonparametric kernel density estimation [7]. In ker-nel density estimation, the density of any point x in the Euclideanspace is contributed mainly by the observed points that are closeenough to x. As such, the density maxima can be defined in a non-parametric manner. Analogously, in our problem, the geo-topic au-thority of any tweet d is contributed by the observed tweets thatare similar to d both geographically and semantically. As a result,the salient tweets for different activities can be selected in the geo-topical space. Nowour task is to nd all pivots in D Q and assign each tweet to its corre-sponding pivot. We develop an authority ascent procedure for thispurpose. As shown in Figure 3, starting from a tweet d as the ini-tial center, we perform step-by-step center shifting. Assuming thecenter at step t is tweet dt , we nd dt neighborhood N ( dt ) , and thelocal pivot l ( dt ) ? the tweet having the largest authority in N ( dt ) .Then we regard l ( dt ) as our new center, i.e., dt + = l ( dt ) . Aswe continue such an authority ascent process, the center is guar-anteed to converge to an authority maximum. It is because everyshift operation increases the authority of the curr Up to now, we have obtained a set of geo-topical clusters in thequery window as candidate events. Nevertheless, as aforementioned,not necessarily does every candidate correspond to a local event.In this section, we describe the module for candidate event classi-fication. The foundation of our classification is the summarization ibo Zhang et al.ibo Zhang et al.
Given a query window Q and the set D Q of tweets falling in Q, thecandidate generator is to divide D Q into several geo-topical clus-ters, such that the tweets in each group are geographically closeand semantically coherent. The Clustering of D Q , however, posesseveral challenges: how to combine the geographical and semanticsimilarities in a reasonable way? How to capture the correlationsbetween different keywords? Moreover, how to generate qualityclusters without knowing the suitable number of clusters in ad-vance? To address these challenges, we perform a novel pivot seek-ing process to identify the centres of geo-topical clusters. Our keyinsight is that: the spot where the event occurs acts as a pivot thatproduces relevant tweets around it; the closer we are to the pivot,the more likely we observe relevant tweets. Therefore, we define ageo-topical authority score for each tweet, where a kernel functioncaptures the geographical influence among tweets, and the seman-tic influenced by random walk on a keyword co-occurrence graph.With this authority measure, we develop an authority ascent pro-cedure to retrieve authority maxima as pivots; and each pivot nat-urally attracts similar tweets to form a quality geo-topical cluster.Below, we rest introduce our geo-topical authority measure to de-fine pivot tweets and then develop an authority ascent procedurefor pivot seeking. Pivot Tweet is an amount of G ( d → d ) en-ergy is distributed from d to d through random walk on the graph, G ( d → d ) S ( d ! d ) is the amount that successfully reaches d ; and d authority is the total sum of energy that d receives from itsneighbors [38]. The authority score is analogous to kernel densityin the task of nonparametric kernel density estimation [7]. In ker-nel density estimation, the density of any point x in the Euclideanspace is contributed mainly by the observed points that are closeenough to x. As such, the density maxima can be defined in a non-parametric manner. Analogously, in our problem, the geo-topic au-thority of any tweet d is contributed by the observed tweets thatare similar to d both geographically and semantically. As a result,the salient tweets for different activities can be selected in the geo-topical space. Nowour task is to nd all pivots in D Q and assign each tweet to its corre-sponding pivot. We develop an authority ascent procedure for thispurpose. As shown in Figure 3, starting from a tweet d as the ini-tial center, we perform step-by-step center shifting. Assuming thecenter at step t is tweet dt , we nd dt neighborhood N ( dt ) , and thelocal pivot l ( dt ) ? the tweet having the largest authority in N ( dt ) .Then we regard l ( dt ) as our new center, i.e., dt + = l ( dt ) . Aswe continue such an authority ascent process, the center is guar-anteed to converge to an authority maximum. It is because everyshift operation increases the authority of the curr Up to now, we have obtained a set of geo-topical clusters in thequery window as candidate events. Nevertheless, as aforementioned,not necessarily does every candidate correspond to a local event.In this section, we describe the module for candidate event classi-fication. The foundation of our classification is the summarization ibo Zhang et al.ibo Zhang et al.
Algorithm 1:
Pivot seeking.
Input:
The tweet set D Q , the kernel bandwidth h , thesemantic threshold δ . Output:
The pivot for each tweet in D Q . // Neighborhood computation. foreach d ∈ D Q do N ( d ) ← { d ′ | d ′ ∈ D Q , G ( d ′ → d ) > , S ( d ′ → d ) > δ } ; // Authority computation. foreach d ∈ D Q do A ( d ) ← d ’s authorith score computed from N ( d ) ; // Find local pivot for each tweet. for d ∈ D Q do l ( d ) ← arg max d ′ ∈ N ( d ) A ( d ′ ) ; // Authority ascent. foreach d ∈ D Q do Perform authority ascent to find the pivot for d ; Algorithm 2:
Approximate RWR score computation.
Input:
The keyword co-occurrence graph G , a keyword q ,the restart probability α , an error bound ϵ . Output: q ’s vicinity V q . // p ( u ) is the score of node u that needs to bepropagated. s ( q ) ← α , p ( q ) ← α , V q ← ϕ ; Q ← a priority queue that keeps p ( u ) for the keywords in G ; while Q . peek () ≥ αϕ do u ← Q.pop(); for v ∈ I ( u ) do ∆ s ( v ) = ( − α ) p vu p ( u ) ; s ( v ) ← s ( v ) + ∆ s ( v ) ; V q [ v ] ← s ( v ) ;Q.update ( v , p ( v ) + ∆ s ( v )) ; p ( u ) ← return V q ; Figure 3: An illustration of the authority ascent process. module, which learns word embedding to capture the semantics ofshort tweet messages and meanwhile constructs the activity time-line to reveal routine regional activities. In what follows, we de-scribe embedding learning and activity timeline construction andthen present the classier.
The embeddinglearner aims at capturing the semantics of short text by jointlymapping the tweet messages and keywords into the same low-dimensional space. If two tweets (keywords) are semantically sim-ilar, they are forced to have close embedding vectors in the latentspace. The learner continuously consumes a massive amount oftweets from the input stream and learns to preserve their intrin-sic semantics. As such, it can generate red-length vectors for anytext pieces ( e.g. , the candidate event and the background activ-ity), which serve as high-quality features to discriminate whethera candidate event is indeed a local event or not.Relying on thetweet caching strategy and the SGD optimisation procedure, theembedding learner continuously consumes the geo-tagged tweetstream and keeps updating the embeddings for different keywordsand tweets. With the learnt keyword embeddings, the embeddingof any ad-hoc text piece can be easily derived with SGD. As wewill illustrate shortly, such a property enables us to quantify thespatiotemporal unusualness of each candidate event and extracthighly discriminative features to pinpoint true local events.
The activity timeline aimsat unveiling the normal activities in different regions during differ-ent time periods. For this purpose, we design a structure calledtweet cluster (TC) and extend the CluStream algorithm [2].TheTC essentially provides a concise where-when-what summary forS: (1) where: with n, ml, and 2, one can easily compute the loca-tion mean and variance for S; (2) when: with n, mt, and Mt 2, onecan compute the average time and temporal variance for S; and(3) what: me keeps the number of occurrences for each keyword.These fields in a TC-S enable us to estimate the number of keywordoccurrences at any location. First, the quantities n, ml, and ml 2 al-low us to compute the center location of the TC S. Second, themetracks the number of occurrences for different keywords aroundthe centered location of S. With either spatial interpolation or ker-nel density estimation, one can estimate the occurrences of key-word k at any ad-hoc location based on the distance to the centerlocation of S. Moreover, TC satisfies the additive property, i.e., thefields can be easily incremented if a new tweet is absorbed. Basedon this property, we adapt CluStream to continuously clusters thestream into a set of TCs. When a new tweet d arrives, it ends theTCM that is geographically closest to d. If d is within M ’s bound-ary (computed from n, ml, and 2, see [2] for details), it absorbs dinto and updates its fields; otherwise, it creates a new TC for d.Meanwhile, we employ two strategies to limit the maximum num-ber of TCs: (1) deleting the TCs that are too old and contain fewtweets; Moreover, (2) merging closest TC pairs until the number ofremaining TCs is small enough. We cluster the continuous streamand store the clustering snapshots at different timestamps. Sincestoring the snapshot of every timestamp is unrealistic, we use thepyramid time frame (PTF) structure [2] to achieve both excellentspace efficiency and high coverage of the stream history. vent-Radar: Real-time Local Event Detection System for Geo-Tagged Tweet Streams We use logistic regression to train a binaryclassier and judge whether each candidate is indeed a local event.We choose logistic regression because of its robustness when thereis only a limited amount of training data. While we have also triedusing other classifiers like Random Forest and SVM, we nd that thelogistic regression classier produces the best result in our experi-ments. The labelled instances for the classier are collected througha large-scale experiment on a popular crowdsourcing platform. Wewill shortly detail the annotation process in Section 6.We analyse the complexity of the candidate classification stepas follows. As the prediction time of logistic regression is linear inthe number of features and has O ( ) complexity, the time cost isdominated by the feature extraction process. Let NC be the maxi-mum number of tweets in each candidate, and M be the keywordvocabulary size, D be the latent embedding dimension, and N Q isthe number of tweets in the query window. We need to extract thefeatures for all the candidates in the query window. The time costsfor extracting different features for each candidate event are ana-lyzed as follows: (1) For the temporal unusualness measure, its timecomplexity is O ( M + N A + D ) where NA is the maximum numberof TCs in one snapshot of the activity timeline; (2) For the spatialunusualness measure, its time complexity is O ( M + NQ + D ) ; (3)For the temporal ACM Transactions burstiness measure, its timecomplexity is O ( MN A ) ; (4) For the spatial burstiness measure, itstime complexity is O ( MNC ) ; (5) For the static features, the totaltime complexity is O ( NC ) . In this section, we present the online updater of GeoBurst+. Con-sider a query window Q , let Q Be the new query window after Q shifts. Instead of finding the local events in Q from scratch, theonline update leverages the results in Q and updates the eventlist with little cost. If one runs the batch detection algorithm inthe updated window Q , the candidate generation step will dom-inate the total time cost in the two-step detection process, whilethe candidate classification step is very efficient. Hence, our focuson supporting efficient online detection is to develop algorithmsthat can fast update the geo-topical clustering results when thequery window shifts from Q to Q . To guarantee to generate thecorrect clustering results in Q , the key is to nd the new pivots inThe new window Q based on the previous results in Q . Let D Q be the tweets falling in Q and D Q be the tweets in Q . We denoteby RQ the tweets removed from D Q , i.e., R Q = D Q .. D Q ; and by I Q the tweets inserted into D Q , i.e., I Q = D Q .. D Q . In the sequel,we design a strategy that nds pivots in D Q by just processing RQand IQ Recall that, the pivot seeking process rst computes the localpivot for each tweet and then performs authority ascent via a pathof local pivots. So long as the local pivot information is correctlymaintained for each tweet, the authority ascent can be fast com-pleted. The major idea for avoiding ending pivots from scratch isthat, as D Q is changed to D Q , only some tweets have their localpivots changed. We call them mutated tweets, defined as follows.Definition (Mutated Tweet). A tweet d D Q is a mutated tweetif d local pivot in D Q is different from its local pivot in D Q .Now the questions are, how do we fast identify the mutatedtweets by analysing the influence of RQ and I Q ? Our observation is that, for any tweet, it can become a mutated tweet only if at leastone of its neighbours has authority change. Therefore, we take a re-verse search strategy to nd mutated tweets: (1) First, we identify in D Q all the tweets whose authorities have changed. (2) Second, foreach authority-changed tweet t, we retrieve the tweets that regardt as its neighbor and update their local pivots. An Event Radar was implemented to test and simulate our approachas an experiment. The setting is a Mac OS laptop with a 1.6GHZprocessor and 8GB RAM. Event Radar was implemented in MEANStack with MongoDB as database and Express.js as a server.
Event Radar can visualise all inputs from MongoDB on GoogleMap and enable users to view the events’ tag, original tweets, times-tamp and the rank score of the event. ibo Zhang et al.ibo Zhang et al.
Event Radar can visualise all inputs from MongoDB on GoogleMap and enable users to view the events’ tag, original tweets, times-tamp and the rank score of the event. ibo Zhang et al.ibo Zhang et al.
Figure 4: Query of finding the events with a timestamp from 5 to 6. The system returns 23 events.Figure 5: Query to obtain events with a timestamp from 1 to 10, and keyword dinner. The system returns 199 events.
The system also provides a query mode for users to send queries tothe server to select the desired events, by providing the conditionsof event tweets’ terms, the geospatial distance between the usersand events’ locations, and the timestamp’s query windows.In summary, Event Radar is a novel approach to providing aweb-based application for users to view the local events in a givenarea. Additionally, the system contains the query mode for users to search the events of their interest. Therefore, such system has apromising perspective to be developed as a system for local secu-rity authority or press due to the reason that it can detect the localevents, and update them in the dynamic time stream.
We studied the problem of real-time local event detection in geo-tagged tweet streams. Event detection aims at finding real-world vent-Radar: Real-time Local Event Detection System for Geo-Tagged Tweet Streams occurrences that unfold over space and time. We mainly imple-ment a website demonstration system with an improved algorithmto show the real-time local events online for public interests. Oursystem Event-Radar is not limited to Twitter. Rather, any geo-textualsocial media stream ( e.g. , Instagram photo tags, Facebook posts)can use to extract interesting local events as well. For future work,it is interesting to extend Event-Radar for handling the tweets thatmention geo-entities but do not include exact GPS coordinates.We built a demonstration system to visualise the local event de-tection result dynamically. This system consisted of two servers,connected to the mongo database. One server is in charge of load-ing Twitter data from Twitter API, constructing of co-occurrencekeyword graph, running batch mode to generate local event can-didates based on geographic impact and semantic impact. It ranksthe candidate by making vertical comparison across time frameand horizontal comparison across all clusters, finally outputtingthe local event results into Database. Meanwhile, we optimisedthe original project by saving the co-occurrence keyword graphinto the database, so that when the system restarts, it reloads thegraph from the database to save sufficient time. Another serverdeals with the front end request and excellent local event resultsfrom the database, sending results to the front end to be visualised.
ACKNOWLEDGMENTS
Special thanks to Chao Zhang, TAs and Prof. Han for giving valu-able feedbacks and help during the whole semester.
REFERENCES [1] Zhang, Chao et al. “GeoBurst: Real-Time Local Event Detection in Geo-TaggedTweet Streams.”
Proceedings of the 39th International ACM SIGIR conference onResearch and Development in Information Retrieval . ACM, 2016.[2] Chao Zhang, Dongming Lei, Quan Yuan, Honglei Zhuang, Lance Kaplan,Shaowen Wang, and Jiawei Han. 2017. GeoBurst+: Effective and Real-Time LocalEvent Detection in Geo-Tagged Tweet Streams. ACM Trans. Intell. Syst. Technol.1, 1, Article 1 (March 2017), 23 pages.[3] Abdelhaq, Hamed, Christian Sengstock, and Michael Gertz. “Eventweet: Onlinelocalized event detection from twitter.” Proceedings of the VLDB Endowment6.12 (2013): 1326-1329.[4] Zhang, Chao, et al. “Fast inbound top-k query for random walk with restart.”
Joint European Conference on Machine Learning and Knowledge Discovery inDatabases . Springer International Publishing, 2015.[5] Ellison, Nicole B. “Social network sites: Definition, history, and scholarship.”Journal of Computer-Mediated Communication 13.1 (2007): 210-230.[6] FARZINDAR, A., and D. INKPEN. 2012. Proceedings of the Workshop on Se-mantic Analysis in Social Media. Association for Computational Linguistics, Avi-gnon, France 50.[7] Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a newsmedia? Categories and subject descriptors. Most 112(2), 591–600 (2010).[8] Schonfeld, E.: Techcrunch: mining the thought stream (2009). Accessed 9 July2013[9] Q. Yuan, G. Cong, Z. Ma, A. Sun, and N. M. Thalmann, “Who, where, when andwhat: discover spatio-temporal topics for twitter users,” in Proceedings of the19th ACM SIGKDD international conference on Knowledge discovery and datamining. ACM, 2013, pp. 605–613.[10] A. Farzindar, “Industrial perspectives on social networks,” in EACL 2012-Workshop on Semantic Analysis in Social Media, vol. 2, 2012.[11] A. Java, X. Song, T. Finin, and B. Tseng, “Why we twitter: understanding mi-croblogging usage and communities,” in Pro- ceedings of the 9th WebKDD and1st SNA-KDD 2007 workshop on Web mining and social network analysis. ACM,2007, pp. 56–65.[12] X. Wang, M. S. Gerber, and D. E. Brown, “Automatic crime prediction usingevents extracted from twitter posts,” in Social Computing, Behavioral-CulturalModeling and Prediction. Springer, 2012, pp. 231–238.[13] Atefeh, Farzindar, and Wael Khreich. “A surveyof techniques for event detectionin twitter.” Computational Intelligence 31.1 (2015): 132-164 [14] METZLER, D., S. DUMAIS, and C. MEEK. 2007. Similarity measuresfor short seg-ments of text. In Proceed- ings of the 29th European Conference on IR Research,ECIR’07. Springer-Verlag: Berlin, Heidelberg, pp. 16–27.[15] NALLAPATI, R., A. FENG, F. PENG, and J. ALLAN. 2004. Event threading withinnews topics. In Proceedings of the Thirteenth ACM International Conference onInformation and Knowledge Management, CIKM’04, ACM, New York, NY, pp.446–453.[16] Fiscus, J.G., Doddington, G.R.: Topic detection and tracking evaluation overview.In: Topic Detection and Tracking, pp. 17–31 (2002). 88[17] Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes Twitter users: real-timeevent detection by social sensors. In: Proceedings of the 19th InternationalConfer- ence on World Wide Web, pp. 851–860. ACM (2010).[18] Raimond, Y., Abdallah, S.: The event ontology (2007).[19] Linguistic Data Consortium: TDT 2004: Annotation Manual - version 1.2, 4 Au-gust 2004.[20] Strassel, S.: Topic Detection & Traking (TDT-5) (2004).[21] Cordeiro, Mário, and João Gama. “Online social networks event detection: a sur-vey.” Solving Large Scale Learning Tasks. Challenges and Algorithms. SpringerInternational Publishing, 2016. 1-41.[22] Yang, C.C., Shi, X., Wei, C.P.: Discovering event evolution graphs from newscorpora. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 39, 850–863 (2009)[23] Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In:Proceedings of the 21st Annual International ACM SIGIR Conference on Re-search and Development in Information Retrieval - SIGIR 1998, New York, USA,pp. 37–45 (1998).[24] Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal,C.C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, New York (2012)[25] Farzindar, A.: Social network integration in document summarization. In:Fiori, A. (ed.) Innovative Document Summarization Techniques: Revolutioniz-ing Knowledge Understanding. IGI-Global, Hershey (2014)Boettcher, Alexander,and Dongman Lee. “Eventradar: A real-time local event detection scheme usingtwitter stream.”
Green Computing and Communications (GreenCom), 2012 IEEEInternational Conference on . IEEE, 2012. 9 105.[26] Yang, Y., Pierce, T.T., Carbonell, J.G.: A study of retrospective and on-line eventdetection. In: SIGIR 1998: Proceedings of the 21st Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, Mel-bourne, Australia, 24–28 August 1998, pp. 28–36. ACM, New York (1998).[27] NAAMAN, M., H. BECKER, and GRAVANO. L. 2011. Hip and trendy: characteriz-ing emerging trends on Twitter.Journal of the American Society of InformationScience and Technology, 62(5): 902–918.[28] Walther, Maximilian, and Michael Kaisser. “Geo-spatial event detection in thetwitter stream.”
European Conference on Information Retrieval . Springer BerlinHeidelberg, 2013.[29] Girju, Roxana, and Dan I. Moldovan. “Text mining for causal relations.”
FLAIRSConference . 2002.[30] Krumm, John, and Eric Horvitz. “Eyewitness: Identifying local events via space-time signals in twitter feeds.”
Proceedings of the 23rd SIGSPATIAL InternationalConference on Advances in Geographic Information Systems . ACM, 2015.[31] H. Abdelhaq, C. Sengstock, and M. Gertz. Eventweet: Online localized event de-tection from twitter. PVLDB, 6(12):1326–1329, 2013.[32] Q. He, K. Chang, and E.-P. Lim. Analyzing feature trajectories for event detection.In SIGIR, pages 207–214, 2007.[33] C. C. Aggarwal and K. Subbian. Event detection in social streams. In SDM, pages624–635, 2012.[34] MATHIOUDAKIS, M., and N. KOUDAS. 2010. TwitterMonitor: Trend detectionover the Twitter stream. In SIGMOD Conference, Indianapolis, IN, pp. 1155–1158.[35] SANKARANARAYANAN, J., H. SAMET, B. E. TEITLER, M. D. LIEBERMAN, andJ. SPERLING. 2009. TwitterStand: News in tweets. In Proceedings of the 17thACM SIGSPATIAL International Conference on Advances in Geographic Infor-mation Systems, GIS’09, ACM, New York, NY, pp. 42–51.[36] PHUVIPADAWAT, S., and T. MURATA. 2010. Breaking news detection and track-ing in Twitter. In IEEE/WIC/ACM International Conference on Web Intelligenceand Intelligent Agent Technology (WI-IAT), Vol. 3, Toronto, ON, pp. 120–123.[37] PETROVIC, S., M. OSBORNE, and V. LAVRENKO. 2010. Streaming first storydetection with application to Twitter. In Human Language Technologies: The2010 Annual Conference of the North American Chapter of the Association forComputational Linguistics, HLT’10, pp. 181–189.[38] BECKER, H., M. NAAMAN, and L. GRAVANO. 2011a. Beyond trending topics:Real-world event identification on Twitter. In ICWSM, Barcelona, Spain.[39] LONG, R., H.WANG, Y. CHEN, O. JIN, and Y. YU. 2011. Towards effective eventdetection, tracking and summarization on microblog data. InWeb-Age Informa-tion Management, Vol. 6897 of Lecture Notes in Computer Science. Edited byWANG, H., S. LI, S. OYAMA, X. HU, and T. QIAN. Springer: Berlin/Heidelberg,pp. 652–663.[40] Nguyen, Duc T., and Jai E. Jung. “Real-time event detection for online behavioralanalysis of big social data.”
Future Generation Computer Systems
66 (2017): 137-145. ibo Zhang et al.ibo Zhang et al.