Graph-of-Tweets: A Graph Merging Approach to Sub-event Identification
GGraph-of-Tweets: A Graph Merging Approach toSub-event Identification
Xiaonan Jing
Computer and Information TechnologyPurdue University
West Lafayette, IN, [email protected]
Julia Taylor Rayz
Computer and Information TechnologyPurdue University
West Lafayette, IN, [email protected]
Abstract —Graph structures are powerful tools for modelingthe relationships between textual elements. Graph-of-Words(GoW) has been adopted in many Natural Language tasks toencode the association between terms. However, GoW providesfew document-level relationships in cases when the connectionsbetween documents are also essential. For identifying sub-eventson social media like Twitter, features from both word- anddocument-level can be useful as they supply different informationof the event. We propose a hybrid Graph-of-Tweets (GoT) modelwhich combines the word- and document-level structures formodeling Tweets. To compress large amount of raw data, wepropose a graph merging method which utilizes FastText wordembeddings to reduce the GoW. Furthermore, we present a novelmethod to construct GoT with the reduced GoW and a MutualInformation (MI) measure. Finally, we identify maximal cliquesto extract popular sub-events. Our model showed promisingresults on condensing lexical-level information and capturingkeywords of sub-events.
Index Terms —Twitter, event detection, word embedding,graph, mutual information.
I. I
NTRODUCTION
With Twitter and other types of social networks being themainstream platform of information sharing, an innumerableamount of textual data is generated every day. Social networksdriven communication has made it easier to learn user interestsand discover popular topics. An event on Twitter can beviewed as a collection of sub-events as users update new poststhrough time. Trending sub-events can provide informationon group interests, which can assist with learning groupbehaviours. Previously, a Twitter event has been described asa collection of hashtags [1], [2], a (set of) named entity [3],a Knowledge Graph (KG) triplet [4], or a tweet embedding[5]. While these representations can illustrate the same Twitterevent from various aspects, it can be argued that a KG triplet,which utilizes a graph structure, exhibits richer features thanthe other representations. In other words, the graph structureallows more semantic relationships between entities to bepreserved. Besides KG, other NLP tasks such as word disam-biguation [6], [7], text classification [8], [9], summarization[10], [11], and event identification [12], [13] have also widelyadopted graph structures. A graph G = ( V, E ) typicallyconsists of a set of vertices V and a set of edges E whichdescribes the relations between the vertices. The main benefitof a graph structure lies in its flexibility to model a variety of linguistic elements. Depending on the needs, ”the graphitself can represent different entities, such as a sentence,a single document, multiple documents or even the entiredocument collection. Furthermore, the edges on the graphs canbe directed or undirected, as well as associated with weights ornot” [14]. Following this line of reasoning, we utilize a graphstructure to combine both token- and tweet-level associationsin modeling Twitter events.Graph-of-Words (GoW) is a common method inspired bythe traditional Bag-of-Words (BoW) representation. Typically,the vertices in a GoW represent the BoW from a corpus.In addition, the edges encode the co-occurrence association(i.e. the co-occurrence frequency) between the words in BoW.Although the traditional GoW improved upon BoW to includeword association information, it still fails to incorporate se-mantic information. One may argue that, as previously men-tioned, using a KG can effectively incorporate both semanticinformation and corpus level associations into the graph.However, any pre-existing KG, such as WordNet [15] andFreeBase [16], cannot guarantee an up-to-date lexicon/entitycollection. Therefore, we propose a novel vocabulary richgraph structure to cope with the constantly changing real-timeevent modeling.In this paper, we employ a graph structure to model tokens,tweets, and their relationships. To the best of our knowledge,this is the first work to represent document level graphswith token level graphs in tweet representation. Our maincontributions are the developments of 1) a novel GoT; 2) anunsupervised graph merging algorithm to condense token-levelinformation; 3) an adjusted mutual information (MI) measurefor conceptualized tweet similarities.II. R ELATED W ORK
Various studies have adopted graph structures to assist withunsupervised modeling of the diverse entity relationships.
Event Clustering.
Jin and Bai [17] utilized a directed GoWfor long documents clustering. Each document was convertedto a graph with nodes, edges, edge weights representing wordfeatures, co-occurrence, and co-occurrence frequencies respec-tively. The similarity between documents was subsequentlyconverted to the similarity between the maximum commonsub-graphs. With the graph similarity as a metric, K-means a r X i v : . [ c s . C L ] J a n lustering was applied to maximum common document graphsto generate the clusters. Jinarat et al. [18] proposed a pretrainedWord2Vec embedding [19] based GoW edge removal approachfor tweet clustering. They constructed an undirected graphwith nodes being the unique words in all tweets and edgesbeing the similarity between the words. Token and tweetclusters were created by removing edges below a certainsimilarity value. However, pretrained embeddings can be proneto rare words in tweets, where abbreviations and tags are apopular means for delivering information. Event stream detection.
Meladianos et al. [20] presented asimilar graph of words approach in identifying sub-events ofa World Cup match on Twitter. They improved edge weightmetric by incorporating tweet length to global co-occurrencefrequency. The sub-events were generated by selecting tweetswhich contains the top k-degenerate subgraphs. Another effortby Fedoryszak et al. [13] considered an event stream as acluster chain consisting of trending Twitter entities in timeorder. The clusters were treated as nodes and the similaritiesbetween them were labeled as edge weights. While a commonissue in entity comparison may be raised for the lack ofcoverage limitation [3], Fedoryszak et al. [13] were able toovercome this issue through the help of an internal TwitterKG. However, potential synonyms were not taken into accountin the weights assignment.
Summarization.
Parveen and Strube [21] proposed anentity and sentence based bipartite graph for multi-documentsummarization, which utilized a Hyperlink-Induced TopicSearch algorithm [22]. However, they only considered nounsin the graph and ignored other parts-of-speech from the docu-ments. In another attempt, Nayeem and Chali [10] adopted theTextRank algorithm [23] which employs a graph of sentencesto rank similar sentence clusters. To improve from the crisptoken matching metric used by the algorithm, Nayeem andChali [10] instead used the pretrained Word2Vec embeddings[19] for sentence similarities.
Graph Construction.
Glavas et al. [24] built a semanticrelatedness graph for text segmentation where the nodesand edges denote sentences and relatedness of the sentencesrespectively. They showed that extracting maximal cliqueswas effective in exploiting structures of semantic relatednessgraphs. In an attempt of automated KG completion, Szubertand Steedman [25] proposed a word embedding based graphmerging method in improvements of AskNET [26]. Similar toour merging approach, tokens were merged incrementally intoa pre-constructed KG based on word embedding similarities.The difference is that AskNET used a graph-level globalranking mechanism while our approach considers a neighbor-level local ranking for the tokens. Additionally, Szubert andSteedman limited their scope to only named entity relatedrelations during the graph construction.
Evaluations of Event Identification.
It should be notedthat there is no benchmark dataset for event identificationevaluation, as many event identification approaches are un-supervised and that the event dataset varies by the research ofinterests. Among the previously mentioned studies, Jin and Bai [17] conducted their clustering on dataset with category labelswhich allowed their unsupervised approach to be evaluatedwith precision, recall, and F-scores. Meladianos et al. [20] wasable to generate labels through a sport highlight website ESPNas the sub-events contained in a soccer game exhibit simplerstructure than most real-life events. Fedoryszak et al. [13]created a evaluation subset from scratch with the help of theTwitter internal KG and manually examined their clusteringresults. Both of the event summarization approaches adoptedthe classic metric ROUGE [27] score as well as the bench-marking dataset DUC.III. M
ETHODOLOGY
We propose the following (1) − (8) steps for sub-eventidentification (Figure 1). The dataset used in this paper wascollected from Twitter for a particular event (Section III-A).Step (1) includes tokenization and lemmatization with Stan-ford CoreNLP and NLTK toolkit , as well as removingstopwords, punctuations, links, and @ mentions. All processedtexts are converted to lowercase. Steps (2) - (3) contributeto GoW construction (Section III-B). Step 4 performs ourproposed GoW reduction method (Section III-C). Step (5)elaborates on Graph-of-Tweets (GoT) construction using MI(Section III-D). Steps (6) - (8) finalize the subevents extractionfrom the GoT (Section III-E). (1) ProcessRaw Tweets (2) TrainFastTextModel (3) ConstructToken Graph (4) ReduceToken Graph(5) ConstructTweet Graph(6) ExtractTop MISubgraph(7) IdentifyMaximalCliques(8) IdentifySub-events Fig. 1: 8-Step Framework Used in This Paper
A. Dataset
Following a recent trend on social media, we collected dataon ”COVID-19” from Twitter as our dataset. We define tweetscontaining the case-insensitive keywords { ”covid”, ”corona” } as information related to the ”COVID-19” event. We fetched500,000 random tweets containing one of the keywords everyday in the one month period from Feb 25 to Mar 25 andkept only original tweets as our dataset. More specifically,retweets and quoted tweets were filtered out during the datacleaning process. The statistics of the processed dataset usedfor FastText model training can be found in Table I. BesidesFastText training, which requires a large corpus to achieveaccurate vector representations, we focused on a single day ofdata, the Feb 25 subset, for the rest of our experiment in thispaper. Stanford CoreNLP ver. 3.9.2; NLTK ver. 3.4.5
ABLE I: Statistics of the dataset on COVID-19 from Feb 25- Mar 25 number of tweets 1,289,004number of unique tokens 337,365number of tweets (Feb 25) 38,508number of unique tokens (Feb 25) 29,493average tokens per tweet 12.85standard deviation on tokens per tweet 6.89
B. Graph-of-Words Construction
We trained a word embedding model on the large 30-day dataset as our word-level similarity measure in GoWconstruction. Pretrained Word2Vec models [19] have beenapplied previously to graphs as edge weights [8], [11], [18].Trained on Google News corpus, Word2Vec is powerful infinding context based word associations. Words appearing insimilar contexts will receive a higher cosine similarity scorebased on the hypothesis that ”a word is characterized by thecompany it keeps” [28]. However, the pretrained Word2Vecmodel can only handle words within its vocabulary coverage.In other words, if any low frequency words are ignored duringthe training, they will not be captured in the model. In ourcase, the Twitter data contain many informal words, spellingmistakes, and new COVID-19 related words, which make pre-trained model not suitable for our task. On the other hand,FastText model [29] uses character n-grams to enrich wordrepresentations when learning the word embeddings. Informalwords such as ”likeee” can be denoted as a combinationof { ”like”, ”ikee”, ”keee” } (4-gram), which its vector rep-resentation, after concatenating the subword vectors, will becloser to the intended word ”like” given that both wordswere used in similar contexts. Thus, we employ the FastTextword embeddings as the word-level similarity measure inour GoW construction. A skip-gram model with the Gensimimplementation was trained on the large 30-day dataset.For the basic structure of the GoW, we adopt the approachfrom Meladianos et al. [20] where the vertices V represent thetokens and the edges E co represent the co-occurrence of twotokens in a tweet. For k tweets in the corpus T : { t , t , ..., t k } ,the co-occurrence weight w co between the unique tokens v and v is computed as: w co = k (cid:88) i =1 n i − (1)where n i ( n i > denotes the number of unique tokens intweet t i that contains both token v and v . An edge e co is onlydrawn when the token pair co-occur at least once. In additionto the base graph, we add another set of edges E s denoting thecosine similarity w s between the token embeddings obtainedfrom FastText. Figure 2 illustrates an example GoW for twoprocessed tweets. Gensim ver. 3.8.1; hyperparameters (that differ from default): size: 300d,alpha: 0.05, min alpha=0.0005, epoch: 30 ”positive”(a) ”test”(b)”virus”(c)”corona”(d) w co bc w co bd w s bc w s bd Fig. 2: An example GoW for the processed tweets { ”virus”,”test”, ”positive” } and { ”corona”, ”test” } . Solid and dottededges denote E co and E s respectively. C. Graph-of-Words Reduction
A raw GoW constructed directly from the dataset oftencontains a large number of nodes. Many of these nodescarry repeating information, which increases the difficultiesof the subsequent tasks. To condense the amount of nodes,we propose a two-phase node merging method to reduce theproposed GoW: • Phase I: Linear reduction by node merging based on wordoccurrence frequency. • Phase II: Semantic reduction by node merging based ontoken similarity.
Phase I.
The goal of this phase is to reduce the numberof nodes in the raw graph in a fast and efficient manner. Fortokens that occur in less than 5 tweets, we merge them to itstop similar token node in the graph. This phase is performedon nodes in the order of least frequently appearing to mostfrequently appearing nodes.
Phase II.
The goal of this phase is to combine frequenttokens in the graph. Algorithm 1 describes this process. For
Algorithm 1
Semantic Node Collapse for node v i ∈ V : { v , v , ..., v m } do degree d i = (cid:80) m w co ij end for V = sort asc ( V, key = d i ) for v i ∈ V do sim neighbors = list () neighbors ( v i ) = B i : { v i , ..., v iz } neighbor weights ( v i ) = W i : { w co i , ..., w co iz } B i = sort asc ( B i , key = w co ij ) for v ij ∈ B i do sim token = fasttext.most similar ( v ij , top n ) if any other neighbor v ik ∈ sim token then sim neighbors.insert (( v ik , sim val )) end if parent = max ( sim neighbors, key = sim val ) node merge ( src = v ij , dst = parent ) end forend for a node in the GoW, we merge its lower weighed neighborinto another neighbor if the top N similar token of the lowerweighed neighbor contains another neighbor. It should beaddressed that the ordering of the node and the direction oferging matters in this process. For token nodes in the GoW,we perform this phase on nodes in the order of lowest degreeto highest degree; and for neighbors of the same node, weperform the phase on neighbors in the order of lowest weightto highest weight. When the top N similar token contains morethan one other neighbors, we select the node with the highestsimilarity value as the parent node.The node merging process consists of removing originalnodes, reconnecting edges between neighbors, and migratingco-occurrence weights w co and merged tokens. Essentially, if atarget node is merged into a parent node, the target node willbe removed from the graph and the neighbors of the targetnode will be reconnected to the parent. It should be noted thatwhen migrating neighbors and weights, we only consider theco-occurrence edge E s and only increment the weights w co into the edges of the parent node, while w s remains the sameas determined by the leading token of the merged nodes. Fora merged node with multiple tokens, we define the leadingtoken as a single token that is carried by the original GoW.More precisely, suppose the target node ” corona ” is tobe merged into the parent node ” virus ” in Figure 2. Sincenode ” corona ” is only neighboring with node ” test ” , we addthe weights together so that the new weight between node ” test ” and ” virus ” is w co bc = w co bd + w co bc , and removenode ” corona ” from the graph. Furthermore, for the newnode ” virus ” containing both tokens ”virus” and ”corona”,the leading token is ”virus”, and the weight w s bc remains thesame as the cosine similarity between the word vectors ”virus”and ”test”. D. Graph-of-Tweets Construction with Normalized MutualInformation As Graph Edges
Similar to GoW, we introduce a novel GoT which mapstweets to nodes. In addition, to compare the shared informationbetween the tweet nodes, we construct the edges with an ad-justed MI metric. Each tweet is represented as a set of uniquemerged nodes obtained from the previous two-phase GoWreduction, and tweets with identical token node representationare treated as one node. For instance, after merging tokennode ” corona ” into ” virus ” in Figure 2, a processed tweet { ”corona”, ”virus”, ”positive”, ”test” } can be represented as atweet node t : { ” virus ” , ” positive ” , ” test ” } which containsthe set of unique merged token nodes ” virus ” , ” positive ” ,and ” test ” .Originally from Information Theory, MI is used to measurethe information overlap between two random variables X and Y . Pointwise MI (PMI) [30] in Equation 2 was introduced tocomputational linguistics to measure the associations betweenbigrams / n-grams. PMI uses unigram frequencies and co-occurrence frequencies to compute the marginal and jointprobabilities respectively (Equation 3), in which W denotes in this paper, we will use the italic ”leading token” to represent a tokennode, and the normal ”text” to represent plain text. the total number of words in a corpus where word x and y co-occur. i ( x, y ) = log p ( x, y ) p ( x ) p ( y ) (2) p ( x ) = f ( x ) W , p ( y ) = f ( y ) W , p ( x, y ) = f ( x, y ) W (3)The drawbacks of PMI are: 1) low frequency word pairs tendto receive relatively higher scores; 2) PMI suffers from theabsence of boundaries in comparison tasks [31]. In an extremecase when x and y are perfectly correlated, the marginalprobabilities p ( x ) and p ( y ) will take the same value as thejoint probability p ( x, y ) . In other words, when x and y onlyoccur together, p ( x, y ) = p ( x ) = p ( y ) , which will result i ( x, y ) in Equation 2 to take − log p ( x, y ) . Therefore, with W remaining the same, a lower f ( x, y ) will result in higherPMI value. Additionally, it can be noted that PMI in Equation2 suffers from the absence of boundaries when applied forcomparisons between word pairs [31]. To mitigate the scoringbias and the comparison difficulty, a common solution is tonormalize the PMI values with − log p ( x, y ) or a combinationof − log p ( x ) and − log p ( y ) to the range [ − , to smooththe overall distribution.Apart from measuring word associations, MI is also widelyapplied in clustering evaluations. In the case when the groundtruth clusters are known, MI can be used to score the”similarity” between the predicted clusters and the labels. Acontingency table (Figure 3) is used to illustrate the numberof overlapping elements between the predicted clusterings A and ground truth clusterings B . One disadvantage of usingFig. 3: Contingency Table Between Clusterings A and B [32]MI for clustering evaluation is the existence of selection bias,which Romano et al. [33] described as ”the tendency tochoose inappropriate clustering solutions with more clustersor induced with fewer data points.” However, normalizationcan effectively reduce the bias as well as adding an upperbound for easier comparison. Romano et al. [33] proposed avariance and expectation based normalization method. Otherpopular normalization methods include using the joint entropyof A and B or some combinations of the entropies of A and B as the denominator [32].In our GoT case, since tweet nodes are represented bydifferent sets of token nodes, we can treat the tweet nodesas clusters which allow repeating elements. Thus, the totalnumber of elements is the number of token nodes obtainedrom the reduced GoW. Correspondingly, the intersection be-tween two tweet nodes can be represented by the overlappingelements (token nodes). Following this line of logic, we definethe normalized MI between two tweets in Algorithm 2. Algorithm 2
Normalized MI (NMI) Between Tweet NodesLet T : { t , t , ..., t k } be the set of tweets in the GoT.Let V : { v , v , ..., v m } be the m token nodes in the reducedGoW. For tweets t i : { v i , ..., v ix } and t j : { v j , ..., v jy } ,the M I is defined as:
M I ( t i , t j ) = log p ( t i ,t j ) p ( t i ) p ( t j ) ,where p ( t i ) and p ( t j ) are the probabilities of token nodescontaining an individual tweet with respect of total numberof token nodes m , with p ( t i ) = xm and p ( t j ) = ym .The joint probability p ( t i , t j ) represents the intersectionof token nodes between the two tweets, with p ( t i , t j ) = count ( t i ∩ t j ) m .To normalize M I to the range [ − , , we use a normaliza-tion denominator: norm = max [ − log p ( t i ) , − log p ( t j )] ,Thus, N M I = MInorm .Note that when p ( t i , t j ) = 0 (indicating no intersection),NMI will take boundary value − .It should be noted that as the fetching keywords { ”corona”,”covid” } appear in every tweet in the dataset, tokens contain-ing these words are removed when the GoT is constructed.Consequently, two tweet nodes with only ”corona” or ”covid” in common will result in an NMI value of -1, while the out-comes of sub-event identification are not affected by removingthe main event ”COVID-19”. E. Sub-event Extraction From Graph-of-Tweets
We hypothesize that popular sub-events are included withina subgraph of GoT which are fully connected and highlysimilar in content. Following this assumption, we extract aGoT subgraph with only the tweet nodes included in the top nNMI values. Subsequently, we identify all maximal cliques ofsize greater than 2 from the subgraph for further analysis. Ascliques obtained this way consist of only nodes with large NMIvalues, which indicates that the tweet nodes are highly similar,the clique can be represented by the token nodes includedin the tweet nodes. Thus, we treat a clique as a set of alltoken nodes contained in the tweet nodes and that each cliquerepresents a popular sub-event.IV. R
ESULTS AND D ISCUSSION
To summarize, the raw GoW consisted of 29,493 uniquetoken nodes for tweets from the Feb 25 data division. Thetwo-phase GoW reduction reduced the graph by 83.8%, with24,711 nodes merged and 4,782 token nodes left in the GoW.On the other hand, the raw GoT consisted of 31,383 uniquetweet nodes. The extracted subgraph of the top 1000 MI valuesconsisted of 1,259 tweet nodes. Finally, 83 maximal cliqueswere identified from the subgraph.
A. Graph-of-Words Reduction
Phase I reduction merged 19,663 token nodes that appearedin less than 5 tweets, which is roughly 66.7% of all to-ken nodes contained in the raw GoW. This phase primarilycontributes to reducing uncommon words within the corpus.It can be argued that rare words may provide non-trivialinformation. However, because our goal is to identify popularsub-events, the information loss brought by rare words doesnot heavily affect the results. Furthermore, word similarityprovided by FastText can effectively map the rare terms to themore common words with a similar meaning. For instance,in our FastText model, the rare word ”floridah” has its mostsimilar word as the intended word ”florida”. Another fun factis that the emoji ” ” was merged into the node ”coffee” .There were 5,048 token nodes merged during Phase IIreduction. This phase mainly contributes to information com-pression within common tokens. By merging the neighborsthat carry similar information, the same target node can appearin a less complex network without information loss. TableII presents statistics of the resulting GoW. Table III showssome example merged nodes from the reduced GoW . Thelargest node (248 tokens) is not shown here as it contains manyexpletives. The second largest node ”lmaoooo” (221 tokens)contains mostly informal terms like ”omgggg” and emojis.TABLE II: Reduced GoW Statistics single / merged token nodes count 2,119 / 2,663max / min node size 248 / 2avg / std node size 10.28 / 16.56max / min within merged node similarity 0.9672 / 0.1337avg / std within merged node similarity 0.4074 / 0.1544 TABLE III: Some Example Merged Nodes After Reduction leading token merged tokens airline frontier, iata, swoop, airliner, pilot, delta, pilotingcdc wcdc, cdcS, cdcwho, cdcas, lrts, rts, lrt,
We identify roughly three patterns of merge, namely 1)words with same stem or synonym merge, 2) common bi-gramor fix-ed expression merge, and 3) words of topical relatedbut semantically different merge during Phase II reduction.Table IV illustrates some example merges from source nodeto destination node, with the green , blue, and red columnscorrespond to type-1, type-2 and type-3 respectively. Amongtype-1 merge (green), it can be seen that common abbrevi-ations such as ”btc” as ”bitcoin” are also captured in themerging process. In type-2 merge (blue), the examples suchas ”test positive” and ”health department” are frequent bi-grams in the context of our data; and other examples suchas ”silicon valley” and ”long term” are fixed expressions. One The full reduced Graph-of-Words (4,728 token nodes) can be found at:https://github.com/lostkuma/GraphOfTweets
ABLE IV: Some example merges in Phase II reduction src node dst node src node dst node src node dst node covid19 covid positive test buy sellcovid coronavirus department health always never general drawback of word embedding models is that insteadof semantic similarity, words with frequent occurrence willbe considered as very similar as noted in the distributionalhypothesis [28]. An improvement can be made by combiningnamed entities, fixed expressions, and frequent bi-grams inthe data processing stage so that a node can also represents aconcept in the GoW. Finally, type-3 merge (red) is also suffersfrom the drawback of word embedding models. Word pairslike ”buy” and ”sell”, ”always” and ”never”, ”masked” and”unmasked” are antonyms/reversives in meaning. However,these word pairs tend to be used in the same context so they areconsidered highly similar during the merging process. Wordpairs like ”latter” and ”sp[latter]”, ”undo” and ”m[undo]”(means ”world” in Spanish) are subwords of each other. Recallthat FastText model uses character n-grams in the training. Thesubword mechanism leads the rare words to be consideredas similar to the common words which share a part withthem. It should be noted that in Phase II, the merging isperformed from the lower weighed neighbors to the higherweighed neighbors and from lower degree nodes to higherdegree nodes. It is plausible that a common word like ”undo”is a lower weighed neighbor as compared to the uncommonword ”mundo” if the target token is also a rare word in Spanishand that ”undo” does not co-occur frequently with the targettoken.
B. Graph-of-Tweets and Sub-events Extraction
Using the reduced GoW, we constructed a GoT, whereeach tweet node was represented as a set of leading tokensfrom the token nodes. Of the original 38,508 tweets, 31,383tweets were represented as unique tweet nodes. Take the tweetnode tweet 18736 , which repeated the most times with afrequency of 221 times, as an example. The original text of tweet 18736 says ”US CDC warns of disruption to everydaylife’ with spread of coronavirus https://t.co/zO2IovjJlq”, whichis an event created by Twitter on Feb 25. After preprocessing,we obtained { ”u”, ”cdc”, ”warns”, ”disruption”, ”everyday”,”life”, ”spread”, ”coronavirus” } . Finally, the GoW leading to-ken represented tweet node became { ”probably”, ”cdc”, ”ex-pects”, ”destroy”, ”risking”, ”spreading”, ”coronavirus” } ,with both ”everyday” and ”life” merged to ”risking” . Onemay notice that the word ”US” is converted to ”u” after pre-processing, which consecutively affected the FastText training,the GoW reduction and the GoT representation. This is dueto a limitation from the WordNet lemmatizer that mappingthe lower case ”us” to ”u”. We would consider keeping the uncased texts and employ named entity recognition to identifycases like this to preserve the correct word in future work.Subsequently, tweet nodes constituting the top 1000 NMIweights were extracted to construct a new GoT subgraph forfurther analysis. The 1000 NMI subgraph contained 1,259tweet nodes, which consisted of 1,024 token nodes. Theminimum NMI value within the subgraph was 0.9769. InFigure 4, edges are only drawn between node pairs withintersecting token nodes. The edges with an NMI value of-1 are not shown in the figure. It should be noted that if allthe tweet node pairs extracted from the top 1000 MI edges areunique, there would be 2,000 nodes in the subgraph.Fig. 4: GoT subgraph for the top 1000 MI edges with 1,259tweet nodes and their edges (top 1000 MI edges and 397,849other edges). The red edges represent the top 1000 MI edgesand the gray edges represent other MI edges. Nodes withhigher saturation contains more leading tokens. Similarly,edges with higher saturation indicate larger MI weights.After examining the tweet nodes with top 10 total de-grees (the sum of all MI weights between the node andits neighbors), we found that some of the nodes are sub-sets of each other. For instance, tweet 23146 is representedas: { ”news”, ”epidemiologist”, ”nation”, ”probably”, ”con-firmed”, ”know”, ”spreading”, ”humping”, ” } , of which two other tweets tweet 20023 and tweet 11475 are subsets with 2 and 1 token node(s) dif-ferences. Further statistics on the associations between totalnode degrees, average node degrees, and node size (numberof leading tokens) are shown in Figure 5. It can be seen inFigure 5a that total node degree is positively correlated withthe number of neighbors. We also examined the correlationbetween total node degree and average neighbor degree (aver-age NMI weights of all neighbors), but found no correlation.On the other hand, in Figure 5b, the average neighbor degreeis negatively correlated with the number of leading tokensin a tweet node. This indicates that NMI relatively favors a) total degrees (top) and num-ber of neighbors (bottom) oftweet nodes ordered by total de-grees descendingly. (b) average neighbor degrees(top) and number of leading to-kens of tweet nodes ordered byaverage degrees descendingly. Fig. 5: Node degree distribution for top 1000 MI subgraph.tweets with less elements. Imagine a random case where onlyindependent tokens are present in a tweet node. Larger nodes(have more tokens) have less chances to share informationwith other nodes. More precisely, for two pairs of tweets thatboth share the same intersection size, the size of the tweetswill determine the NMI values. For instance, t (5 elements)and t (5 elements) share 3 elements, while t and t (10elements) also share 3 elements. Assuming there are a total of m = 100 elements, the NMI values as defined in Algorithm2 are N M I t ,t = 0 . and N M I t ,t = 0 . . It shouldbe noted that different normalization methods can cause theNMI values to vary. The normalization metric we chose largelynormalizes the upper bound of the MI values. When a differentnormalization metric is applied, for example, using the jointprobability, both the upper and lower bounds of the MI valuescan be normalized.Finally, we identified 83 maximal cliques of size greaterthan 2 from the top 1000 NMI subgraph. While the largestclique contained 14 tweet nodes, there were 21 cliques con-sisting of 4 to 6 tweet nodes, and the rests contained 3 tweetnodes. We observed that the tweet nodes contained in the sameclique shared highly similar content. Table V illustrates theshared token nodes in all 14 tweet nodes from the largestclique. We can derive the information contained in this cliqueas ”the stock and/or bitcoin market has a crash/slide”. FurtherTABLE V: Token Nodes Shared Between Tweet Nodes in theExtracted Cliques Shared token nodes in all14 tweet nodes of a strongrepresentative clique bitcoinprice , stockmarket , best , crashing , gdx , history , probably , slide shared token nodes in all 4tweet nodes of a somewhatrepresentative clique probably , hhs , open , immediate , likelihood , know , american investigation indicated that nodes in this clique are associatedwith the same tweet: ”The Dow just logged its worst 2-day point slide in history — here are 5 reasons the stockmarket is tanking, and only one of them is the coronavirus”,which elaborates on the stock market dropping. We marked this type of clique to be ”strong representative” of the sub-event. However, not all cliques represented the original tweetcontents well. Another clique in Table V that contained thetweets regarding ”US HHS says there’s ‘low immediate risk’of coronavirus for Americans” only suggested the ”likelihoodof” the main ”COVID-19” event. We marked this type ofclique as ”somewhat representative” of the sub-event. Othertypes of cliques are marked as ”not representative”. TableVI shows the content validity evaluation on 83 maximalcliques. It should be noted that the labels are annotated inacknowledgement of the meaningful token nodes, which onaverage takes 64.3% of each clique. We also analyzed theevent categories among the generated cliques and manuallyannotated the clique with type labels. Figure 6 shows thecontent validity distribution by different event types. Wecan see that our proposed model performed very well ongenerating meaningful clique content on the categories ”healthdepartment announcement” and ”stock, economy”; and notvery well on ”politics related” category. The difference inperformance may be due to the variation of the amount ofinformation contained in the described events.TABLE VI: Content Validation of 83 Maximal Cliques strong somewhat notrepresentative representative representative42 (50.60%) 33 (39.76%) 8 (9.64%) Fig. 6: Clique validity distribution by manually labeled eventtype categories. Deep blue, light blue, and grey correspond to”strong representative”, ”somewhat representative”, and ”notrepresentative” categories respectively.It should be addressed that the dataset used in this paper wascollected directly from Twitter and that the approach proposedin this paper was fully automated and unsupervised. As wementioned previously there is no bench-marking dataset forevent identification, and the absence of gold standard makesit difficult to conduct quantitative evaluations on the results.Our best effort, as presented in Table VI and Figure 6, wasto manually analyzed the content of the generated cliques forvalidation.. C
ONCLUSION AND F UTURE W ORK
In this paper, we proposed a hybrid graph model whichuses conceptualized GoW nodes to represent tweet nodes forsub-events identification. We developed an incremental graphmerging approach to condense raw GoW leveraging wordembeddings. In addition, we outlined how the reduced GoWis connected to GoT and developed an adjusted NMI metricto measure nodes similarity in GoT. Finally, we utilized thefundamental graph structure, cliques, to assist with identify-ing sub-events. Our approach showed promising results onidentifying popular sub-events in a fully unsupervised manneron real-world data. There remain adjustments that can bemade to the GoT to improve the robustness of the modelso more detailed and less noisy events can be captured. Inthe future, we plan to employ named entity recognition inraw data processing to identify key concepts. Additionally,frequent bi-grams/n-grams will be examined and combinedprior to FastText training to improve the similarity metricsfrom word embeddings. We will also compare different MInormalization methods to neutralize the bias from sequencelength. Finally, we plan to improve the conceptualizationmethod of the token nodes so instead of the leading token,a node can be represented by the concept it is associated to.R
EFERENCES[1] W. Feng, C. Zhang, W. Zhang, J. Han, J. Wang, C. Aggarwal, andJ. Huang, “Streamcube: hierarchical spatio-temporal hashtag clusteringfor event exploration over the twitter stream,” in . IEEE, 2015, pp. 1561–1572.[2] S.-F. Yang and J. T. Rayz, “An event detection approach based on twitterhashtags,” arXiv preprint arXiv:1804.11243 , 2018.[3] A. J. McMinn and J. M. Jose, “Real-time entity-based event detectionfor twitter,” in
International conference of the cross-language evaluationforum for european languages . Springer, Cham, 2015, pp. 65–77.[4] Y. Qin, Y. Zhang, M. Zhang, and D. Zheng, “Frame-based representationfor event detection on twitter,”
IEICE TRANSACTIONS on Informationand Systems , vol. 101, no. 4, pp. 1180–1188, 2018.[5] B. Dhingra, Z. Zhou, D. Fitzpatrick, M. Muehl, and W. W. Cohen,“Tweet2vec: Character-based distributed representations for social me-dia,” arXiv preprint arXiv:1605.03481 , 2016.[6] L. N. Pina and R. Johansson, “Embedding senses for efficient graph-based word sense disambiguation,” in
Proceedings of TextGraphs-10:the workshop on graph-based methods for natural language processing ,2016, pp. 1–5.[7] M. Bevilacqua and R. Navigli, “Breaking through the 80% glassceiling: Raising the state of the art in word sense disambiguation byincorporating knowledge graph information,” in
Proceedings of the 58thAnnual Meeting of the Association for Computational Linguistics , 2020,pp. 2854–2864.[8] K. Skianis, F. Malliaros, and M. Vazirgiannis, “Fusing document,collection and label graph-based representations with word embeddingsfor text classification,” in
Proceedings of the Twelfth Workshop onGraph-Based Methods for Natural Language Processing (TextGraphs-12) . ACL, 2018, pp. 49–58.[9] L. Yao, C. Mao, and Y. Luo., “Graph convolutional networks for textclassification,” in
Proceedings of the AAAI Conference on ArtificialIntelligence , vol. 33, 2019, pp. 7370–7377.[10] M. T. Nayeem and Y. Chali, “Extract with order for coherent multi-document summarization,” in
Proceedings of TextGraphs-11: the Work-shop on Graph-based Methods for Natural Language Processing . Van-couver, Canada: ACL, 2017, pp. 51–56.[11] M. Yasunaga, R. Zhang, K. Meelu, A. Pareek, K. Srinivasan, andD. Radev, “Graph-based neural multi-document summarization,” arXivpreprint arXiv:1706.06681 , 2017. [12] A. Tonon, P. Cudr´e-Mauroux, A. Blarer, V. Lenders, and B. Motik,“Armatweet: detecting events by semantic tweet analysis,” in
EuropeanSemantic Web Conference . Springer, 2017, pp. 138–153.[13] M. Fedoryszak, B. Frederick, V. Rajaram, and C. Zhong, “Real-timeevent detection on social data streams,” in
Proceedings of the 25th ACMSIGKDD International Conference on Knowledge Discovery & DataMining , 2019, pp. 2774–2782.[14] M. Vazirgiannis, F. D. Malliaros, and G. Nikolentzos., “Graphrep:boosting text mining, nlp and information retrieval with graphs,” in
Proceedings of the 27th ACM International Conference on Informationand Knowledge Management , 2018, pp. 2295–2296.[15] G. A. Miller, “Wordnet: a lexical database for english,”
Communicationsof the ACM , vol. 38, no. 11, pp. 39–41, 1995.[16] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Free-base: a collaboratively created graph database for structuring humanknowledge,” in
Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data , 2008, pp. 1247–1250.[17] C.-X. Jin and Q.-C. Bai, “Text clustering algorithm based on thegraph structures of semantic word co-occurrence,” in .IEEE, 2016, pp. 497–502.[18] S. Jinarat, B. Manaskasemsak, and A. Rungsawang, “Short text clus-tering based on word semantic graph with word embedding model.” in . IEEE, 2018, pp. 1427–1432.[19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in
Advances in neural information processing systems , 2013.[20] P. Meladianos, G. Nikolentzos, F. Rousseau, Y. Stavrakas, and M. Vazir-giannis., “Degeneracy-based real-time sub-event detection in twitterstream,” in
Ninth international AAAI conference on web and socialmedia , 2015.[21] D. Parveen and M. Strube, “Multi-document summarization using bipar-tite graphs,” in
Proceedings of TextGraphs-9: the workshop on Graph-based Methods for Natural Language Processing . Doha, Qatar: ACL,2014, pp. 15–24.[22] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,”
Journal of the ACM (JACM) , vol. 46, no. 5, pp. 604–632, 1999.[23] R. Mihalcea and P. Tarau, “Textrank: Bringing order into text,” in
Proceedings of the 2004 conference on empirical methods in naturallanguage processing , 2004, pp. 404–411.[24] G. Glavaˇs, F. Nanni, and S. P. Ponzetto, “Unsupervised text segmentationusing semantic relatedness graphs.” Association for ComputationalLinguistics, 2016.[25] I. Szubert and M. Steedman, “Node embeddings for graph merging:Case of knowledge graph construction,” in
Proceedings of the ThirteenthWorkshop on Graph-Based Methods for Natural Language Processing(TextGraphs-13) , 2019, pp. 172–176.[26] B. Harrington and S. Clark, “Asknet: Creating and evaluating largescale integrated semantic networks,”
International Journal of SemanticComputing , vol. 2, no. 03, pp. 343–364, 2008.[27] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”in
Text summarization branches out , 2004, pp. 74–81.[28] J. R. Firth, “A synopsis of linguistic theory,”
Studies in linguisticanalysis. , pp. 1930–1955, 1957.[29] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching wordvectors with subword information,”
Transactions of the Association forComputational Linguistics , vol. 5, pp. 35–146., 2017.[30] K. Church and P. Hanks, “Word association norms, mutual information,and lexicography,”
Computational linguistics , vol. 16, no. 1, pp. 22–29,1990.[31] G. Bouma, “Normalized (pointwise) mutual information in collocationextraction,” in
Proceedings of GSCL , 2009, pp. 31–40.[32] A. Amelio and C. Pizzuti, “Correction for closeness: Adjusting normal-ized mutual information measure for clustering comparison,”
Computa-tional Intelligence , vol. 33, no. 3, pp. 579–601, 2017.[33] S. Romano, J. Bailey, V. Nguyen, and K. Verspoor, “Standardized mutualinformation for clustering comparisons: one step further in adjustmentfor chance.” in