[PDF] A Query-Driven System for Discovering Interesting Subgraphs in Social Media

Abstract

Social media data are often modeled as heterogeneous graphs with multiple types of nodes and edges. We present a discovery algorithm that first chooses a "background" graph based on a user's analytical interest and then automatically discovers subgraphs that are structurally and content-wise distinctly different from the background graph. The technique combines the notion of a \texttt{group-by} operation on a graph and the notion of subjective interestingness, resulting in an automated discovery of interesting subgraphs. Our experiments on a socio-political database show the effectiveness of our technique.

Full PDF

SSocial Network Analysis and Mining manuscript No. (will be inserted by the editor)

A Query-Driven System for Discovering Interesting Subgraphsin Social Media

Subhasis Dasgupta · Amarnath Gupta

Received: date / Accepted: date

Abstract

Social media data are often modeled as heterogeneous graphs with multiple types of nodes and edges.We present a discovery algorithm that ﬁrst chooses a “background” graph based on a user’s analytical interestand then automatically discovers subgraphs that are structurally and content-wise distinctly diﬀerent from thebackground graph. The technique combines the notion of a group-by operation on a graph and the notion ofsubjective interestingness, resulting in an automated discovery of interesting subgraphs. Our experiments on asocio-political database show the eﬀectiveness of our technique.

Keywords social network · interesting subgraph discovery · subjective interestingness An information system designed for analysis of socialmedia must consider a common set properties that char-acterize all social media data. – Information elements in social media are essentially heterogeneous in nature – users, posts, images, exter-nal URL references, although related, all bear diﬀer-ent kinds information. – Most social information is temporal – a timestampis associated with user events like the creation or re-sponse on a post, as well as system events like useraccount creation, deactivation and deletion. The sys-tem should therefore allow both temporal as well astime-agnostic analyses. – Information in social media evolves fast. In one study(Zhu et al., 2013), it was shown that the numberof users in a social media is a power function oftime. More recently, (Antonakaki et al., 2018) showedthat Twitter’s growth is supralinear and follows Le-scovec’s model of graph evolution (Leskovec et al.,2007). Therefore, an analyst may ﬁrst have to per-form exploration tasks on the data before ﬁguring outtheir analysis plan. – Social media has a signiﬁcant textual content, some-times with speciﬁc entity markers (e.g., mentions)and topic markers (e.g., hashtags). Therefore any in-formation element derived from text (e.g., named en- tities, topics, sentiment scores) may also be used foranalysis. To be practically useful, the system mustaccommodate semantic synonyms – – Relationships between information items in social me-dia data must capture both canonical relationshipslike ( tweet-15 mentions user-392 ) but a wide-varietyof computed relationships over base entities (users,posts, . . . ) and text-derived information (e.g., namedentities).It is also imperative that such an information must sup-port three styles of analysis tasks1.

Search , where the user speciﬁes content predicatewithout specifying the structure of the data. Forexample, seeking the number of tweets related toKamala Harris should count tweets where she is theauthor, as well as tweets where any synonym of “Ka-mala Haris” is in the tweet text.2.

Query , where the user speciﬁes query conditionsbased on the structure of the data. For example,tweets with create date between 9 and 9:30 am onJanuary 6th, 2021, with text containing the string“Pence” and that were favorited at least 100 timesduring the same time period.3.

Discovery, where the user may or may not knowthe exact predicates on the data items to be re- a r X i v : . [ c s . S I] F e b trieved, but can specify analytical operations (to-gether with some post-ﬁlters) whose results will pro-vide insights into the data. For example, we call aquery like Perform community detection on alltweets on January 6, 2021 and return the usersfrom the largest community a discovery query.In general, a real-life analytics workload will freely com-bine these modalities as part of a user’s informationexploration process.In this paper, we present a general-purpose graph-based model for social media data and a subgraph dis-covery algorithm atop this data model. Physically, thedata model is implemented on AWESOME (Dasguptaet al., 2016), an analytical platform designed to enablelarge-scale social media analytics over continuously ac-quired data from social media APIs. The platform, de-veloped as a polystore system natively supports rela-tional, graph and document data, and hence enables auser to perform complex analysis that include arbitrarycombinations of search, query and discovery operations.We use the term query-driven discovery to reﬂectthat the scenario where the user does not want to runthe discovery algorithm on a large and continuously col-lected body of data; rather, the user knows a startingpoint that can be speciﬁed as an expressive query (illus-trated later), and puts bounds on the discovery processso that it terminates within an acceptable time limit.

Contributions.

This paper makes the following contri-butions. (a) It oﬀers a new formulation for the subgraphinterestingness problem for social media; (b) based onthis formulation, it presents a discovery algorithm forsocial media; (c) it demonstrates the eﬃcacy of the al-gorithm on multiple data sets.

Organization of the paper.

The rest of the paperis organized as follows. Section 2 describes the relatedresearch on interesting subgraph ﬁnding in as inves-tigated by researchers in Knowledge Discovery, Infor-mation Management, as well Social Network Mining.Section 3 presents the abstract data model over whichthe Discovery Algorithm operations and the basic deﬁ-nitions to establish the domain of discourse for the dis-covery process. Section 4 presents our method of gener-ating candidate subgraphs that will be tested for inter-estingness. Section 5 ﬁrst presents our interestingnessmetrics and then the testing process based on thesemetrics. Section 6 describes the experimental valida-tion of our approach on multiple data sets. Section 7presents concluding discussions.

The problem of ﬁnding “interesting” information in adata set is not new. (Silberschatz and Tuzhilin, 1996)described that an“interestingness measure” can be “ob-jective” or “subjective”. A measure is “objective” whenit is computed solely based on the properties of thedata. In contrast, a “subjective” measure must take intoaccount the user’s perspective. They propose that (a) apattern is interesting if it is ”surprising” to the user ( un-expectedness ) and (b) a pattern is interesting if the usercan act on it to his advantage ( actionability ). Of thesecriteria, actionability is hard to determine algorithmi-cally; unexpectedness, on the other hand, can be viewedas the departure from the user’s beliefs. For example,a user may believe that the 24-hour occurrence patternof all hashtags are nearly identical. In this case, a dis-covery would be to ﬁnd a set of hashtags and sampledates for which this belief is violated. Following (Gengand Hamilton, 2006), there are three possibilities re-garding how a system is informed of a user’s knowledgeand beliefs: (a) the user provides a formal speciﬁcationof his or her knowledge, and after obtaining the miningresults, the system chooses which unexpected patternsto present to the user (Bing Liu et al., 1999); (b) ac-cording to the user’s interactive feedback, the system re-moves uninteresting patterns (Sahar, 1999); and (c) thesystem applies the user’s speciﬁcations as constraintsduring the mining process to narrow down the searchspace and provide fewer results. Our work roughly cor-responds to the third strategy. Early research on ﬁndinginteresting subgraphs focused primarily on ﬁnding in-teresting substructures. This body of research primarilyfound interestingness in two directions: (a) ﬁnding fre-quently occurring subgraphs in a collection of graphs(e.g., chemical structures) (Kuramochi and Karypis,2001; Yan and Han, 2003; Thoma et al., 2010) and (b)ﬁnding regions of a large graph that have high edge den-sity (Lee et al., 2010; Sariyuce et al., 2015; Epasto et al.,2015; Wen et al., 2017) compared to other regions in thegraph. Note that while a dense region in the graph candeﬁnitely interesting, shows, the inverse situation wherea sparsely connected region is surrounded by an other-wise dense periphery can be equally interesting for anapplication.We illustrate the situation in Figure 1. The primarydata set is a collection of tweets on COVID-19 vacci-nation, but this speciﬁc graph shows a sparse core onIndian politics that is loosely connected to nodes of anotherwise dense periphery on the primary topic. Stan-dard network features like hashtag histogram and thenode degree histograms do not reveal this substructurerequiring us to explore new methods of discovery.

Fig. 1

An example of an interesting subgraph. Tweets on Indian politics form a sparse center in a dense periphery (separatelyshown in the top right). The bottom two ﬁgures show the hashtag distribution of the center and the degree distribution of thecenter (log scale) respectively.

A serious limitation of the above class of work isthat the interestingness criteria does not take into ac-count node content (resp. edge content) which may bepresent in a property graph data model (Angles, 2018)where nodes and edges have attributes. (Bendimerad,2019) presents several discovery algorithms for graphswith vertex attributes and edge attributes. They per-form both structure-based graph clustering and sub-space clustering of attributes to identify interesting (intheir domain “anomalous”) subgraphs.On the “subjective” side of the interestingness prob-lem, one approach considers interesting subgraphs as asubgraph matching problem (Shan et al., 2019). Theirgeneral idea is to compute all matching subgraphs thatsatisfy a user the query and then ranking the resultsbased on the rarity and the likelihood of the associa-tions among entities in the subgraphs. In contrast (Adri-aens et al., 2019) uses the notion of “subjective inter-estingness” which roughly corresponds to ﬁnding sub- graphs whose connectivity properties (e.g., the averagedegree of a vertices) are distinctly diﬀerent from an “ex-pected” background graph. This approach uses a con-strained optimization problem that maximizes an ob-jective function over the information content ( IC ) andthe description length ( DL ) of the desired subgraphpattern.Our work is conceptually most inspired by (Bendimeradet al., 2019) that explores the subjective interestingnessproblem for attributed graphs. Their main contributioncenters around CSEA (Cohesive Subgraph with Excep-tional Attributes) patterns that inform the user that agiven set of attributes has exceptional values through-out a set of vertices in the graph. The subjective inter-estingness is given by S ( U, S ) = IC ( U, S ) DL ( U ) where U is a subset of nodes and S is a set of restric-tions on the value domains of the attributes. The sys-tem models the prior beliefs of the user as the Maxi-mum Entropy distribution subject to any stated prior,beliefs the user may hold about the data (e.g., the dis-tribution of an attribute value). The information con-tent IC ( U, S ) of a CSEA pattern (

U, S ) is formalizedas negative of the logarithm of the probability that thepattern is present under the background distribution.The length of a description of U is the intersection of allneighborhoods in a subset X ⊆ N ( U ), along with theset of “exceptions”, vertices are in the intersection butnot part of U . However, we have a completely diﬀerent,more database-centric formulation of the backgroundand the user’s beliefs. heterogeneous information network (an informationnetwork with multiple types of nodes and edges), whichwe view as a temporal property graph G . Let N be thenode set and E be the edge set of G . N can be viewed asa disjoint union of diﬀerent subsets (called node types )– users U , posts P , topic markers (e.g., hastags) H ,term vocabulary V (the set of all terms appearing ina corpus), references (e.g., URLs) R ( τ ), where τ repre-sents the type of resource (e.g., image, video, web site . . . ). Each type of node have a diﬀerent set of proper-ties (attributes) ¯ A ( . ). We denote the attributes of U as¯ A ( U ) = a ( U ) , a ( U ) . . . such that a i ( U ) is the i -theattribute of U . An attribute of a node type may betemporal – a post p ∈ P may have temporal attributecalled creationDate . Edges in this network can be di-rectional and have a single edge type . The following isa set of base (but not exhaustive) edge types : − writes : U (cid:55)→ P − uses : P (cid:55)→ H − mentions : P (cid:55)→ U maps a post p to a user u if u mentioned in p − repostOf : P (cid:55)→ P maps a post p to a post p if p isrepost of p . This implies that ts ( p ) < ts ( p ) where ts is the timestamp attribute − replyTo/comment : P (cid:55)→ P maps a post p to a post p if p is a reply to p . This implies that ts ( p )

The discovery processstarts with the speciﬁcation of the user’s universe ofdiscourse identiﬁed with a query Q . We provide someillustrative examples of user interest with queries of in-creasing complexity. Example 1. “All tweets related to COVID-19 between06/01/2020 and 07/31/2020 that refer to Hydroxychloro-quine”. In this speciﬁcation, the condition “related toCOVID-19” amounts to ﬁnding tweets containing any k terms from a user-provided list, and the condition on“Hydroxychloroquine” is expressed as a fuzzy search. Fig. 2

Hydroxychloroquine and COVID related Hashtags

Figure 2 shows the top hashtags related to this search– note that the hashtag co-occurrence graph aroundCOVID-19 and hyroxychloroquine includes “FakeNews-Media”.

Example 2. “All tweets from users who mention

Trump in their user proﬁle and

Fauci in at least n of theirtweets”. Notice that this query is about users with acertain behavioral pattern – it captures all tweets fromusers who have a combination of speciﬁc proﬁle featuresand tweet content. Example 3. “All tweets from users U whose tweetsappear in hashtag-cooccurrence in the 3 neighborhoodaround is used together with all tweets of users U who belong to the user-mention-user networks ofthese users (i.e., U ) ”, where refers to “Ameri-can Descendant of Slaves”, which represents an AfricanAmerican cause.The end result of Q is a collection of posts that wecall the initial post set P . Using the posts in P , thesystem creates a background graph as follows. Initial Background Graph.

The initial backgroundgraph G is the graph derived from P over which thediscovery process runs. However, to deﬁne the initialgraph, we ﬁrst develop the notion of a conversation . Deﬁnition 1 (Semantic Neighborhood.) N ( p ), thesemantic neighborhood of a post p is the graph connect-ing p to instances of U ∪ H ∪ P ∪ V that directly relatesto p . Deﬁnition 2 (Conversation Context.)

The conver-sation context C ( p ) of post p is a subgraph satisfyingthe following conditions:1. P : The set of posts reachable to/from p along therelationships repostOf, replyTo belong to C ( p ).2. P : The union of posts in the semantic neighborhoodof P belong to C ( p ).3. E : The induced subgraph of P ∪ P belong to C ( p )4. Nothing else belongs to C ( p ). Clearly, we can assert that C ( p ) is a connected graphand that N ( p ) (cid:64) g C ( p ) where (cid:64) g denotes a subgraphrelationship. Deﬁnition 3 (Initial Background Graph.)

The ini-tial background graph G is a merger of all conversa-tion contexts C ( p i ) , p i ∈ P , together with all computededges induces the nodes of ∪ i C ( p i )The initial background graph itself can be a gateway toﬁnding interesting properties of the graph. To illustratethis based on the graph obtained from our Example2. Figure 3 presents two views of the cluster ofhashtags from January 2021. The left chart shows thetime vs. count of the hashtags while the right chartshows the dominant hashtags of the same period in thiscluster. The strong peak in the timeline, was due to anintense discussion, revealed by topic modeling, on thecreation of an oﬃce on African American issues. Theoccurrence of this peak is interesting because most ofthe social media conversation in this time period wasfocused on the Capitol attack on January 6.Given the G graph,we discover subgraphs S i ⊂ G o whose content and structure are distinctly diﬀerent thatof G . However, unlike previous approaches, we ap-ply a generate-and-test paradigm for discovery. Thegenerate-step (Section 4) uses a graph cube like (Zhaoet al., 2011) technique to generate candidate subgraphsthat might be interesting and the test-step (Section5.2) computes if (a) the candidate is suﬃciently dis-tinct from the G (cid:48) , and (b) the collection of candidatesare suﬃciently distinct from each other. Subgraph Interestingness.

For a subgraph S i to beconsidered as a candidate, it must satisfy the followingconditions. C1. S i must be connected and should satisfy a sizethreshold θ n , the minimal number of nodes. C2.

Let A ij (resp. B ik ) be the set of local propertiesof node j (resp. edge k ) of subgraph S i . A property iscalled “local” if it is not a network property like vertexdegree. All nodes (resp. edges) of S i must satisfy someuser-speciﬁed predicate φ N (resp. φ E ) speciﬁed over A ij (resp. B ik ). For example, a node predicate might re-quire that all “post” nodes in the subgraph must havea re-post count of at least 300, while an edge predicatemay require that all hashtag co-occurrence relation-ships must have a weight of at least 10. A user deﬁnedconstraint on the candidate subgraph improves the in-terpretability of the result. Typical subjective interest-ingness techniques (van Leeuwen et al., 2016; Adriaenset al., 2019) use only structural features of the networkand do not consider attribute-based constraints, whichlimits their pragmatic utility. Fig. 3

Two views of the cluster of hashtags

C3.

For each text-valued attribute a of A ij , let C ( a )be the collection of the values of a over all nodes of S i , and D ( C ( a )) is a textual diversity metric computedover C ( a ). For S i to be interesting, it must have atleast one attribute a such that D ( C ( a )) does not havethe usual power-law distribution expected in social net-works. Zheng et al (Zheng and Gupta, 2019) used vocab-ulary diversity and topic diversity as textual diversitymeasures. Section 3.2 describes the creation of the initial back-ground graph G that serves as the domain of discoursefor discovery. Depending on the number of initial posts P resulting form the initial query, the size of G mightbe too large – in this case the user can specify followupqueries on G to narrow down the scope of discovery.We call this narrowed-down graph of interest as G (cid:48) – ifno followup queries were used, G (cid:48) = G . The next stepis to generate some candidate subgraphs that will betested for interestingness. Node Grouping.

A node group is a subset of nodes( G (cid:48) ) where all nodes in a group have some similar property.We generalize the groupby operation, commonly usedin relational database systems, to heterogeneous infor-mation networks. To describe the generalization, let usassume R ( A, B, C, D, . . . ) is a relation (table) with at-tributes

A, B, C, D, . . . A groupby operation takes as in-put (a) a subset of grouping attributes (e.g. A, B ), (b) a grouped attribute (e.g., C ) and (c) an aggregation func-tion (e.g., count ). The operation ﬁrst computes eachdistinct cross-product value of the grouping attributes(in our example, A × B ) and creates a list of all valuesof the grouped attribute corresponding to each distinct value of the grouping attributes, and then applies theaggregation function to the list. Thus, the result of the groupby operation is a single aggregated value for eachdistinct cross-product value of grouping attributes.To apply this operation to a social network graph,we recognize that there are two distinct ways of deﬁn-ing the “grouping-object”.(1) Node properties can be directly used just like inthe relational case. For example, for tweets a groupingcondition might be getDate (Tweet.created at) ∧ bin(Tweet.favoriteCount, 100) , where the getDate function extracts the date of a tweet and the bin func-tion creates buckets of size 100 from the favorite countof each tweet.(2) The grouping-object is a subgraph pattern. For ex-ample, the subgraph pattern (:tweet { date } )-[:uses]->(:hashtag { text } ) (P1)states that all ”tweet” nodes having the same postingdate, together with every distinct hashtag text will beplaced in a separate group. Notice that while (1) pro-duces disjoint tweets, (2) produces a “soft” partitioningon the tweets and hashtags due to the many-to-manyrelationship between tweets and hashtags.In either case, the result is a set of node groups, desig-nated here as N i . For example, the grouping pattern P1expressed in a Cypher-like syntax (Francis et al., 2018)(implemented in the Neo4J graph data managementsystem) states that all tweets having the same postingdate, together with every distinct hashtag text will beplaced in a separate group. Notice that this process pro-duces a “fuzzy” partitioning on the tweets and hashtagsdue to the many-to-many relationship between tweetsand hashtags. Hence, the same tweet node can belongto two diﬀerent groups because it has multiple hash-tags. Similarly, a hashtag node can belong to multiple groups because tweets from diﬀerent dates may haveused the same hashtag. While the grouping conditionspeciﬁcation language can express more complex group-ing conditions, in this paper, we will use simpler casesto highlight the eﬃcacy of the discovery algorithm. Wedenote the node set in each group as N i . Graph Construction.

To complete the groupby oper-ation, we also need to specify the aggregation functionin addition to the grouping-object and the grouped-object. This function takes the form of a graph con-struction operation that constructs a subgraph S i byexpanding on the node set N i . Diﬀerent expansion rulescan be speciﬁed, leading to the formation of diﬀerentgraphs. Here we list three rules that we have found fairlyuseful in practice. G1.

Identify all the tweet nodes in N i . Construct a relaxed induced subgraph of the tweet -labeled nodes in N i . The subgraph is induced because it only uses tweetscontained within N i , and it is relaxed because containsall nodes directly associated with these tweet nodes,such as author, hashtags, URLs, and mentioned-users. G2.

Construct a mention network from within the tweetnodes in N i – the mention network initially connects all tweet and user -labeled nodes. Extend the network byincluding all nodes directly associated with these tweetnodes. G3.

A third construction relaxes the grouping con-straint. We ﬁrst compute either G1 or G2 , and thenextend the graph by including the ﬁrst order neighbor-hood of mentioned users or hashtags. While this clearlybreaks the initial group boundaries, a network thusconstructed includes tweets of similar themes (throughhashtags) or audience (through mentions). Automated Group Generation.

In a practical set-ting, as shown in Section 6, the parameters for nodegrouping operation can be speciﬁed by a user, or itcan be generated automatically. Automatic generationof grouping-objects is based on the considerations de-scribed below. To keep the autogeneration manageable,we will only consider single and two objects for at-tribute grouping and only a single edge for subgraphpatterns. – Since temporal shifts in social media themes andstructure are almost always of interest, the postingtimestamp is always a grouping variable. For our pur-poses, we set the granularity to a day by default,although a user can set it. – The frequency of most nontemporal attributes (likehashtags) have a mixture distribution of double-paretolognormal distribution and power law (Bhattacharyaet al., 2020), we will adopt the following strategy. ◦ Let f ( A ) be distribution of attribute A , and κ ( f ( A ))be the curvature of f ( A ). If A is a discrete vari- able, we ﬁnd a ∗ , the maximum curvature (elbow)point of f ( A ) numerically (Antunes et al., 2018). ◦ We compute A (cid:48) , the values of attribute A to theleft of a ∗ for all attributes and choose the at-tribute where the cardinality of A (cid:48) is maximum.In other words, we choose attributes which havethe highest number of pre-elbow values. – We adopt a similar strategy for subgraph patterns.If T ( a i ) L −→ T ( b j ) is an edge where T , T are nodelabels, a i , b j are node properties and L is an edgelabel, then a i and b j will be selected based on theconditions above. Since the number of edge labels isfairly small in our social media data, we will evalu-ate the estimated cardinality of the edge for all suchtriples and select one with the lowest cardinality. S in ref-erence to a background graph G b (e.g., G (cid:48) ), and con-sists of a structural as well as a content component. Weﬁrst discuss the structural component. To compare asubgraph S i with the background graph, we ﬁrst com-pute a set of network properties P j (see below) fornodes (or edges) and then compute the frequency dis-tribution f ( P j ( S i )) of these properties over all nodes(resp. edges) of (a) subgraphs S i , and (b) the refer-ence graph (e.g., G (cid:48) ). A distance between f ( P j ( S i ))and f ( P j ( G b )) is computed using Jensen–Shannon di-vergence (JSD). In the following, we use ∆ ( f , f ) torefer to the JS-divergence of distributions f and f . Eigenvector Centrality Disparity:

The testing pro-cess starts by identifying the distributions of nodes withhigh node centrality between the networks. While thereis no shortage of centrality measures in the literature,we choose eigenvector centrality (Das et al., 2018) de-ﬁned below, to represent the dominant nodes. Let A =( a i,j ) be the adjacency matrix of a graph. The eigen-vector centrality x i of node i is given by: x i = 1 λ (cid:88) k a k,i x k where λ (cid:54) = 0 is a constant. The rationale for this choicefollows from earlier studies in (Bonacich, 2007; Ruh-nau, 2000; Yan et al., 2014), who establish that sincethe eigenvector centrality can be seen as a weighted sumof direct and indirect connections, it represents the truestructure of the network more faithfully than other cen-trality measures. Further, (Ruhnau, 2000) proved thatthe eigenvector-centrality under the Euclidean norm can be transformed into node-centrality, a property notexhibited by other common measures. Let the distribu-tions of eigenvector centrality of subgraphs A and B be β a and β b respectively, and that of the backgroundgraph be β t , then | ∆ e ( β t , β a ) | > θ indicates that A issuﬃciently structurally distinct from G b | ∆ e ( β t , β a ) | > | ∆ e ( β t , β b ) | indicates that A contains signiﬁcantly moreor signiﬁcantly less inﬂuential nodes than B . Topical Navigability Disparity:

Navigability mea-sures ease of ﬂow. If subgraph S is more navigable thansubgraph S (cid:48) , then there will be more traﬃc through S compared to S (cid:48) . However, the likelihood of seeing ahigher ﬂow through a subgraph depends not just on thestructure of the network, but on extrinsic covariates liketime and topic. So, a subgraph is interesting in termsof navigability if for some values of a covariate, its nav-igability measure is diﬀerent from that of a backgroundsubgraph.Inspired by its application in biology (Seguin et al.,2018), traﬃc analysis (Scellato et al., 2010), and net-work attack analysis (Lekha and Balakrishnan, 2020),we use edge betweenness centrality (Das et al., 2018) asthe generic (non-topic) measure of navigability. Let α ij be the number of shortest paths from node i to j and α ij ( k ) is the number of paths passes through the edge k . Then the edge-betweenness centrality is C eb ( k ) = (cid:88) ( i,j ) ∈ V α ij ( k ) α ij By this deﬁnition, the edge betweenness centrality is theportion of all-pairs shortest paths that pass through anedge. Since edge betweenness centrality of edge e mea-sures the proportion of paths that passes through e ,a subgraph S with a higher proportion of high-valuededge betweenness centrality implies that S may be more navigable than G b or another subgraph S (cid:48) of the graph,i.e., information propagation is higher through this sub-graph compared to the whole background network, forthat matter, any other subgraph of network having alower proportion of nodes with high edge betweennesscentrality. Let the distribution of the edge betweennesscentrality of two subgraphs A and B are c and c re-spectively, and that of the reference graph is c . Then, | ∆ b ( c , c ) | < | ∆ b ( c , c ) | means the second subgraphis more navigable than the ﬁrst.To associate navigability with topics, we detect topicclusters over the background graph and the subgraphbeing inspected. The exact method for topic clusterﬁnding is independent of the use of topical navigabil-ity. In our setting, we have used topic topic modelingand dense region detection in hashtag cooccurrence net-works. For each topic cluster, we identify posts (within the subgraph) that belong to the cluster. If the numberof posts is greater than a threshold, we compute thenavigability disparity. Propagativeness Disparity:

The concept of prop-agativeness builds on the concept of navigability. Prop-agativeness attempts to capture how strongly the net-work is spreading information through a navigable sub-graph S . We illustrate the concept with a network con-structed over tweets where a political personality (Sen-ator Kamala Harris in this example) is mentioned inJanuary 2021. The three rows in Figure 10 show thenetwork characteristics of the subregions of this graph,related, respectively to the themes of and “blacklives matter” (Row 1), Captiol insurrection (Row 2)and Socioeconomic issues related to COVID-19 includ-ing stimulus funding ad business reopening (Row 3). Inearlier work (Zheng and Gupta, 2019), we have shownthat a well known propagation mechanism for tweets isto add user-mentions to improve the reach of a message- hence the user-mention subgraph is indicative of prop-agative activity. In Figure 10, we compare the hashtagactivity (measured by the Hashtag subgraph) and themention activity (size of the mention graph) in thesethree subgraphs. Figure 10 (e) shows a low and fairlysteady size of the user mention activity in relation to thehashtag activity on the same topic, and these two indi-cators are not strongly correlated. Further, Figure 10 (f)shows that the mean and standard deviation of node de-gree of hashtag activity are fairly close, and the averagedegree of user co-mention (two users mentioned in thesame tweet) graph is relatively steady over the period ofobservation – showing low propagativeness. In contrast,Row 1 and Row 2 show sharper peaks. But the curvein Figure 10 (c) declines and has low, uncorrelated usermention activity. Hence, for this topic although there isa lot of discussion (leading to high navigability edges),the propagativeness is quite low. In comparison, Figure10 (a) shows a strong peak and a stronger correlationbe the two curves indicating more propagativeness. Thehigher standard deviation in the co-mention node de-gree over time (Figure 10 (b)) also shows the makingof more propagation around this topic compared to theothers.We capture propagativeness using current ﬂow be-tweenness centrality (Brandes and Fleischer, 2005) whichis based on Kirchoﬀ’s current laws. We combine thiswith the average neighbor degree of the nodes of S tomeasure the spreading propensity of S . The current ﬂowbetweenness centrality is the portion of all-pairs short-est paths that pass through a node, and the averageneighbor degree is the average degree of the neighbor-hood of each node. If a subgraph has higher current ﬂowbetweenness centrality plus a higher average neighbor degree, the network should have faster communicability.Let α ij be the number of shortest paths from node i to j and α ij ( n ) is the number of paths passes through thenode n . Then the current ﬂow betweenness centrality: C nb ( n ) = (cid:88) ( i,j ) ∈ V α ij ( n ) α ij Suppose the distribution of the current ﬂow between-ness centrality of two subgraphs A and B is p and p respectively, and distribution of the reference graph is p t . Also the distribution of the β n , the average neighbordegree of the node n , for the subgraph A and B is γ and γ respectively, and the reference distribution is γ t .If the condition ∆ ( p t , p ) ∗ ∆ ( γ t , γ ) < ∆ ( p t , p ) ∗ ∆ ( γ t , γ )holds, we can conclude that subgraph B is a fasterpropagating network than subgraph A . This measureis of interest in a social media based on the obser-vation that misinformation/disinformation propagationgroups either try to increase the average neighbor de-gree by adding fake nodes or try to involve inﬂuentialnodes with high edge centrality to propagate the mes-sage faster (Besel et al., 2018). Subgroups within a Candidate Subgraph:

Thepurpose of the last metric is to determine whether acandidate subgraph identiﬁed using the previous mea-sures need to be further decomposed into smaller sub-graphs. We use subgraph centrality (Estrada and Rodriguez-Velazquez, 2005) and coreness of nodes as our metrics.The subgraph centrality measures the number of sub-graphs a vertex participates in, and the core numberof a node is the largest value k of a k -core containingthat node. So a subgraph for which the core numberand subgraph centrality distributions are right-skewedcompared to the background subgraph are (i) eithersplit around high-coreness nodes, or (ii) reported to theuser as a mixture of diverse topics. The node grouping,per-group subgraph generation and candidate subgraphidentiﬁcation process is presented in Algorithm 1. Inthe algorithm, function cut2bin extends the cut func-tion, which compares the histograms of the two dis-tributions whose domains (X-values) must overlap, andproduces equi-width bins to ensure that two histograms(i.e., frequency distributions) have compatible bins.5.2 The Testing ProcessThe discovery algorithm’s input is the list of divergencevalues of two candidate sets computed against the samereference graph. It produces four lists at the end. Each Algorithm 1:

Graph Construction Algorithm

INPUT : Q out Output of the query, L Graph constructionrules, gv grouping variable, th size is the minimum size ofthe subgraph; Function gmetrics ( Q out , L , groupV ar ) G[] ← ConstructGraph( Q out , L ); T ← []; for g ∈ G do t α ← ComputeMetrics(g);

T.push ( t alpha ); end return T endFunction ComputeMetrics (Graph g) m ← []; m.push ( eigenV ectorCentrality ( g ));......... m.push ( coreNumber ( g ));return m endFunction CompareHistograms (List t , List x ) s g ← cut bin ( x , bin edges ); bin edges ← getBinEdges( x ); t g ← cut bin ( t , bin edges ); β js ← distance.jensenShannon ( t g , s g ); h t ← histogram ( t g , s g , bin edges );return β js , h t , bin edges ; end Algorithm 2:

Graph Discovery Algorithm

Input:

Set of all subgraphs divergence σ Output:

Feature vectors v , v , v , List for re-partitionrecommendations lev : eigenvector centrality; ec : edge current ﬂow betweenness centrality; nc : current ﬂow betweenness centrality; µ : core number; z : average neighbor degree; Function discover ( σ ) for any two set of divergence from σ ans σ doif σ ( ev ) > σ ( ev ) then v ( σ ) = v ( σ ) + 1; if σ ( ec ) > σ ( ec ) then v ( σ ) = v ( σ ) + 1; if ( σ ( nc ) + σ ( µ )) > ( σ ( ec ) + σ ( µ ) ) then v ( σ ) = v ( σ ) + 1; endif ( σ ( sc ) + σ ( z )) > ( σ ( sc ) + σ ( z ) ) then l ( σ ) = 1; endendendendend of the ﬁrst three lists contains one speciﬁc factor ofinterestingness of the subgraph. The most interestingsubgraph should present in all three vectors. If the sub-graph has many cores and is suﬃciently dense, thenthe system considers the subgraph to be uninterpretable and sends it for re-partitioning. Therefore, the fourthlist contains the subgraph that should partition again.Currently, our repartitioning strategy is to take subsetsof the original keyword list provided by the user at thebeginning of the discovery process to re-initiate the dis-covery process for the dense, uninterpretable subgraph. Fig. 4

Fig. 5

AVG Degree Distributions and Std Dev

Fig. 6

Insurrection and Capitol Attack

Fig. 7

AVG Degree Distributions and Std Dev

Fig. 8

Socioeconomic Issues During COVID-19

Fig. 9

AVG Degree Distributions and Std Dev

Fig. 10

Various Metrics for Daily Partitioned Hashtag co-occurrences and User mentioned Graph from the Political PersonaltyDataset. Q retrieves tweets where Kamala Harris is mentioned in a hashtag, text body or user mention. The output of each metric produces a value for eachparticipant node of the input. However, to comparetwo diﬀerent candidates, in terms of the metrics men-tioned above, we need to convert them to comparablehistograms by applying a binning function dependingon the data type of the grouping function.

Bin Formation (cut2bin):

Cut is a conventional opera-tor (available with R, Matlab, Pandas etc. ) segmentsand sorts data values into bins. The cut2bin is an ex- tension of a standard cut function, which compares thehistograms of the two distributions whose domains (X-values) must overlap. The cut function accepts as inputa set of set of node property values (e.g., the central-ity metrics), and optionally a set of edge boundariesfor the bins. It returns the histograms of distribution.Using the cut, ﬁrst, we produce n equi-width bins fromthe distribution with the narrower domain. Then we ex-tract bin edges from the result and use it as the input bin edges to create the wider distribution‘s cut. Thisenforces the histograms to be compatible. In case oneof the distribution is known to be a reference distribu-tion (distribution from the background graph) againstwhich the second distribution is compared, we use thereference distribution for equi-width binning and binthe second distribution relative to the ﬁrst.The CompareHistograms function uses the cut2bin function to produce the histograms, and then computesthe JS Divergence on the comparable histograms. The

CompareHistograms function returns the set of diver-gence values for each metric of a subgraph, which isthe input of the discovery algorithm. The function re-quires the user to specify which of the compared graphsshould be considered as a reference – this is requiredto ensure that our method is scalable for large back-ground graphs (which are typically much larger thanthe interesting subgraphs). If the background graph isvery large, we take several random subgraphs from thisgraph to ensure they are representative before the ac-tual comparisons are conducted. To this end, we adoptthe well-known random walk strategy.In the algorithm v , v and v are the three vectorsto store the interestingness factors of the subgraphs,and l is the list for repartitioning. For two subgraphs, ifone of them qualiﬁed for v means, the subgraph con-tains higher centrality than the other. In that case, itincreases the value of that qualiﬁed bit in the vectorby one. Similarly, it increases the value of v by one,if the same candidate has high navigability. Finally, itincreases the v , if it has higher propagativeness. Thealgorithm selects the top- k scores of candidates fromeach vector, and marks them interesting. In the experiments, we constructed two subgraphsfrom each subquery. The ﬁrst is the “hashtag co-occurrence”graph, where each hashtag is a node, and they are con-nected through an edge if they coexist in a tweet. Thesecond is the “user co-mention” graph, where each useris a node, and there is an edge between two nodes if atweet mentioned them jointly. Intuitively, the hashtagco-occurrence subgraph captures topic prevalence andpropagation, whereas the co-mention subgraph capturesthe tendency to inﬂuence and propagate messages to alarger audience. Our goal is to discover surprises (andlack thereof) in these two aspects for our data sets.We note that the dataset chosen is from a monthwhere the US experienced a major event in the formof the Capitol Attack, and a new administration wassworn in. This explains why the number of tweets in“Capitol Attack” subgraph is high for both politiciansin this week, and not surprisingly it is also the mostdiscussed topic as evidenced by the high average nodedegree. Therefore, this “selection bias” sets our expec-tation for subjective interestingness – given the speciﬁcweek we have chosen, this issue will dominate most so-cial media conversations in the USA. We also observethe low ratio of the number of unique nodes to the num-ber of tweets, signifying the high number of retweets,that signals a form of information propagation over thenetwork. The propagativeness of the network duringthis eventful week is also evidenced by the fact that theunique node count of a co-mention network is almost75% - 88% higher on average compared to the hashtagco-occur network of the same class. In Section 6.3, weshow how our interestingness technique performs in theface of this dataset. https://code.awesome.sdsc.edu/awsomelabpublic/datasets/int-springer-snam/)2 Table 1

Dataset Descriptions

Data Set TotalCollectionSize

SubQuery

NetworkType TotalTweets Uniquenodes Uniqueedges SelfLoop Density AvgDegree

KamalaHarris

CapitolAttack

Hashtagco-occur 164397 1398 7801 16 0.0025 4.3userco-mention 164397 8012 48604 87 0.00012 3.19

Hashtagco-occur 158419 3671 10738 29 0.0015 5.8userco-mention 158419 30829 39865 49 8.3 2.5

EconomicIssues

Hashtagco-occur 36678 1278 1828 4 0.0022 2.8userco-mention 36678 6971 11584 19 0.0004 3.4

Joe Biden

CapitolAttack

Hashtagco-occur 676898 7728 21422 50 0.00071 5.49userco-mention 676898 82046 101646 130 3.0146 2.473

Hashtagco-occur 183765 3007 11008 29 0.002 3.85userco-mention 158419 29547 40932 56 9.3 2.7

EconomicIssues

Hashtagco-occur 138754 2961 5733 10 0.0013 3.87userco-mention 138754 21417 19691 23 8.5 1.83

Vaccine

VaccineAnti-vax

Hashtagco-occur 1000000 18809 24195 44 2.52 2.5userco-mention 1000000 203211 41877 46 2.02 0.4

Covid Test

Hashtagco-occur 1000000 26671 45378 69 0.00012 3.4userco-mention 1000000 188761 83656 109 4.67 0.886economy hashtagco-occur 917890 3002 4395 9 0.0009 2.9userco-mention 917890 20590 8528 13 4.023 0.8

Kamala Harris Network : Figure 29 represents Ka-mala Harris network. This network is interesting be-cause even in the context of the Capitol Attack and theSenate approval of election results, it is dominated by Fig. 11

Capitol Attack

Fig. 12

Fig. 13

Economic issues

Fig. 14

Top Hashtag Distributions of the Kamala Harris Data set

Fig. 15

Capitol Attack

Fig. 16

Fig. 17

Economic issues

Fig. 18

Top Hashtag Distributions of the Joe Biden set portant. However, there are a few spikes on

Biden Network : While the predominance of

Vaccine Network

In the vacation network, we foundthat “economic issues” and “covid tests” are more prop-agative than “vaccine and anti-vaccine” related topics Fig. 19

Vaccine anti-vaccine Issues

Fig. 20

COVID-19 Test

Fig. 21

Economic issues

Fig. 22

Top Hashtag Distributions of the Vaccine Data set (Figure 43). The surprising result here is that the “vac-cine - anti-vaccine” topics show a strong correlationwith “Economy” in the other two charts. We observethat while the vaccine issues are navigable through thenetwork, but this topic cluster is is not very propagativein the network. In contrast, in the co-mention network,vaccine and anti-vaccine issues are both very naviga-ble and strongly propagative. Further, the propagative-ness in the co-mention network for the Covid-test showsmany spikes at the diﬀerent levels, which signiﬁes fortesting related issues, the network serves as a vehicle ofmessage propagation and inﬂuencing.6.4 Result ValidationThere is not a lot of work in the literature on interest-ing subgraph ﬁnding. Additionally, there are no bench-mark data sets on which our “interestingness ﬁnding”technique can be compared to. This prompted us toevaluate the results using a core-periphery analysis asindicated earlier in Figure 1. The idea is to demonstratethat the parts of the network claimed to be interestingstand out in comparison to the network of a randomsample of comparable size from the background graph.These results are presented in Figures 48, 53, 58. In eachof these cases, we have shown a representative randomgraph in the rightmost subﬁgure to represent the back-ground graph. To us, the large and dense core formationin Figure 48(b) is an expected, non-surprising result.However, the lack of core formation on Figure 48(c) isinteresting because it shows while there was a sizeablenetwork for economics related term for Kamala Har-ris, the topics never “gelled” into a core-forming con-versation and showed little propagation. Figure 48(a)is somewhat more interesting because the density ofthe periphery is far less strong than the random graph while the core has about the same density as the ran-dom graph. The core periphery separation is much moreprominent in ﬁrst three plots of Figure 53. Unlike theKamala Harris random graph, Figure 53(d) shows thatthe random graph of this data set itself has a very largedense core and a moderately dense periphery. Amongthe three subﬁgures, the small (but lighter) core of Fig-ure 53(b) has the maximum diﬀerence compared to therandom graph, although we ﬁnd 53(a) to be conceptu-ally more interesting. For the vaccine data set, Figure58 (c)is closest to the random graph showing that mostconversations around the topic touches economic issues,while the discussion on the vaccine itself (Figure 53(a))and that of COVID-testing (Figure 53(b)) are more fo-cused and propagative.

In this paper, we presented a general technique to dis-cover interesting subgraphs in social networks with theintent that social network researchers from diﬀerent do-mains would ﬁnd it as a viable research tool. The resultsobtained from the tool will help researchers to probedeeper into analyzing the underlying phenomena thatleads to the surprising result discovered by the tool.While we used Twitter as our example data source, sim-ilar analysis can be performed on other social media,where the content features would be diﬀerent. Further,in this paper, we have used a few centrality measuresto compute divergence-based features – but the systemis designed to plug in other measures as well. Fig. 23

Eigenvector Centrality Disparity Hashtag Network

Fig. 24

Topical Navigability Disparity Hashtag Network

Fig. 25

Propagativeness Disparity Hashtag Network

Fig. 26

Eigenvector Centrality Disparity co-mention Network

Fig. 27

Topical Navigability Disparity co-mention Network

Fig. 28

Propagativeness Disparity co-mention Network

Fig. 29

Comparative studies of all sub-queries using the ”Kamala Harris” data set.6

Fig. 30

Eigenvector Centrality Disparity Hashtag Network

Fig. 31

Topical Navigability Disparity Hashtag Network

Fig. 32

Propagativeness Disparity Hashtag Network

Fig. 33

Eigenvector Centrality Disparity co-mention Network

Fig. 34

Topical Navigability Disparity co-mention Network

Fig. 35

Propagativeness Disparity co-mention Network

Fig. 36

Comparative studies of all sub-queries using the ”Joe Biden” data set.7

Fig. 37

Eigenvector Centrality Disparity: ”Vaccine” HashtagNetwork

Fig. 38

Topical Navigability Disparity: Vaccine Hashtag Net-work

Fig. 39

Propagativeness Disparity: ”Vaccine” Hashtag Net-work

Fig. 40

Eigenvector Centrality Disparity: ”Vaccine” co-mention Network

Fig. 41

Topical Navigability Disparity: ”Vaccine” co-mentionNetwork

Fig. 42

Propagativeness Disparity: ”Vaccine” co-mention Net-work

Fig. 43

Comparative studies of all sub-queries using the ”Vaccine” data set.8

Fig. 44

Fig. 45

Capitol attack

Fig. 46

Economic issues

Fig. 47

Random Graph

Fig. 48

Core and Periphery visualization of ”Kamala Harris” Data set

Fig. 49

Fig. 50

Capitol attack

Fig. 51

Economic issues

Fig. 52

Random Data

Fig. 53

Core and Periphery visualization of ”Joe Biden” Data set

Fig. 54

Vaccine Issues

Fig. 55

COVID-19 Test

Fig. 56

Economic issues

Fig. 57

Random Graph

Fig. 58

Core and Periphery visualization of the Vaccine Data set9

Acknowledgements

This work was partially supported byNSF Grants

References

Adriaens F, Lijﬃjt J, De Bie T (2019) Subjectively in-teresting connecting trees and forests. Data Miningand Knowledge Discovery 33(4):1088–1124Angles R (2018) The property graph database model.In: AMWAntonakaki D, Ioannidis S, Fragopoulou P (2018) Uti-lizing the average node degree to assess the temporalgrowth rate of twitter. Social Network Analysis andMining 8(1):12Antunes M, Gomes D, Aguiar RL (2018) Knee/elbowestimation based on ﬁrst derivative threshold. In:Proc. of 4th IEEE International Conference on BigData Computing Service and Applications (Big-DataService), IEEE, pp 237–240Bendimerad A (2019) Mining useful patterns in at-tributed graphs. PhD thesis, Universit´e de LyonBendimerad A, Mel A, Lijﬃjt J, Plantevit M, Ro-bardet C, De Bie T (2019) Mining subjectivelyinteresting attributed subgraphs. arXiv preprintarXiv:190503040Besel C, Echeverria J, Zhou S (2018) Full cycle analysisof a large-scale botnet attack on twitter. In: 2018IEEE/ACM International Conference on Advancesin Social Networks Analysis and Mining (ASONAM),IEEE, pp 170–177Bhattacharya S, Sinha S, Roy S, Gupta A (2020) To-wards ﬁnding the best-ﬁt distribution for OSN data.J Supercomput 76(12):9882–9900Bing Liu, Wynne Hsu, Lai-Fun Mun, Hing-Yan Lee(1999) Finding interesting patterns using user expec-tations. IEEE Transactions on Knowledge and DataEngineering 11(6):817–832, DOI 10.1109/69.824588Bonacich P (2007) Some unique properties of eigenvec-tor centrality. Soc Networks 29(4):555–564Brandes U, Fleischer D (2005) Centrality measuresbased on current ﬂow. In: Annual symposium ontheoretical aspects of computer science, Springer, pp533–544Consens MP, Mendelzon AO (1993) Low-complexityaggregation in graphlog and datalog. TheoreticalComputer Science 116(1):95–116Das K, Samanta S, Pal M (2018) Study on centralitymeasures in social networks: a survey. Social networkanalysis and mining 8(1):13Dasgupta S, Coakley K, Gupta A (2016) Analytics-driven data ingestion and derivation in the AWE-SOME polystore. In: IEEE International Conferenceon Big Data, Washington DC, USA, IEEE ComputerSociety, pp 2555–2564Epasto A, Lattanzi S, Sozio M (2015) Eﬃcient densestsubgraph computation in evolving graphs. In: Proc. of the 24th International Conference on World WideWeb, pp 300–310Estrada E, Rodriguez-Velazquez JA (2005) Subgraphcentrality in complex networks. Physical Review E71(5):056103Francis N, Green A, Guagliardo P, Libkin L, LindaakerT, Marsault V, Plantikow S, Rydberg M, Selmer P,Taylor A (2018) Cypher: An evolving query languagefor property graphs. In: Proc. of the Int. Conf. onManagement of Data (SIGMOD), pp 1433–1445Geng L, Hamilton HJ (2006) Interestingness measuresfor data mining: A survey. ACM Computing Surveys(CSUR) 38(3):9–esKuramochi M, Karypis G (2001) Frequent subgraphdiscovery. In: Proc. of the IEEE Int. Conf. on DataMining, IEEE, pp 313–320Lee VE, Ruan N, Jin R, Aggarwal C (2010) A surveyof algorithms for dense subgraph discovery. In: Man-aging and Mining Graph Data, Springer, pp 303–336van Leeuwen M, De Bie T, Spyropoulou E, MesnageC (2016) Subjective interestingness of subgraph pat-terns. Machine Learning 105(1):41–75Lekha DS, Balakrishnan K (2020) Central attacks incomplex networks: A revisit with new fallback strat-egy. Physica A: Statistical Mechanics and its Appli-cations 549:124347Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evo-lution: Densiﬁcation and shrinking diameters. ACMtransactions on Knowledge Discovery from Data(TKDD) 1(1):2–esRuhnau B (2000) Eigenvector-centrality—a node-centrality? Soc Networks 22(4):357–365Sahar S (1999) Interestingness via what is not inter-esting. In: Proceedings of the Fifth ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining, Association for Computing Ma-chinery, New York, NY, USA, p 332–336, DOI10.1145/312129.312272, URL https://doi.org/10.1145/312129.312272https://doi.org/10.1145/312129.312272