[PDF] On The Network You Keep: Analyzing Persons of Interest using Cliqster

Abstract

Our goal is to determine the structural differences between different categories of networks and to use these differences to predict the network category. Existing work on this topic has looked at social networks such as Facebook, Twitter, co-author networks etc. We, instead, focus on a novel data set that we have assembled from a variety of sources, including law-enforcement agencies, financial institutions, commercial database providers and other similar organizations. The data set comprises networks of "persons of interest" with each network belonging to different categories such as suspected terrorists, convicted individuals etc. We demonstrate that such "anti-social" networks are qualitatively different from the usual social networks and that new techniques are required to identify and learn features of such networks for the purposes of prediction and classification. We propose Cliqster, a new generative Bernoulli process-based model for unweighted networks. The generating probabilities are the result of a decomposition which reflects a network's community structure. Using a maximum likelihood solution for the network inference leads to a least-squares problem. By solving this problem, we are able to present an efficient algorithm for transforming the network to a new space which is both concise and discriminative. This new space preserves the identity of the network as much as possible. Our algorithm is interpretable and intuitive. Finally, by comparing our research against the baseline method (SVD) and against a state-of-the-art Graphlet algorithm, we show the strength of our algorithm in discriminating between different categories of networks.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

On The Network You Keep: Analyzing Persons ofInterest using Cliqster

Saber Shokat Fadaee · MehrdadFarajtabar · Ravi Sundaram · Javed A.Aslam · Nikos Passas

Received: date / Accepted: date

Abstract

Our goal is to determine the structural diﬀerences between diﬀer-ent categories of networks and to use these diﬀerences to predict the networkcategory. Existing work on this topic has looked at social networks such asFacebook, Twitter, co-author networks etc. We, instead, focus on a novel dataset that we have assembled from a variety of sources, including law-enforcementagencies, ﬁnancial institutions, commercial database providers and other sim-ilar organizations. The data set comprises networks of persons of interest with each network belonging to diﬀerent categories such as suspected terror-ists, convicted individuals etc. We demonstrate that such “anti-social” net-works are qualitatively diﬀerent from the usual social networks and that new

A preliminary version of this paper appeared in Proceedings of the 2014 IEEE/ACM Inter-national Conference on Advances in Social Networks Analysis and Mining [1].S. Shokat FadaeeCollege of Computer and Information ScienceNortheastern UniversityE-mail: [email protected]. FarajtabarCollege of ComputingGeorgia Institute of TechnologyE-mail: [email protected]. SundaramCollege of Computer and Information ScienceNortheastern UniversityE-mail: [email protected]. A. AslamCollege of Computer and Information ScienceNortheastern UniversityE-mail: [email protected]. PassasSchool of Criminology and Criminal JusticeNortheastern UniversityE-mail: [email protected] a r X i v : . [ c s . S I] O c t Saber Shokat Fadaee et al. techniques are required to identify and learn features of such networks for thepurposes of prediction and classiﬁcation.We propose Cliqster, a new generative Bernoulli process-based model forunweighted networks. The generating probabilities are the result of a decom-position which reﬂects a network’s community structure. Using a maximumlikelihood solution for the network inference leads to a least-squares problem.By solving this problem, we are able to present an eﬃcient algorithm for trans-forming the network to a new space which is both concise and discriminative.This new space preserves the identity of the network as much as possible. Ouralgorithm is interpretable and intuitive. Finally, by comparing our researchagainst the baseline method (SVD) and against a state-of-the-art Graphletalgorithm, we show the strength of our algorithm in discriminating betweendiﬀerent categories of networks.

Keywords

Social network analysis · Persons of interest · Communitystructure G , G , · · · , G n and another graph G m . We would like to ﬁnd out which graph has the mostsimilar structure to G m , and whether G m can be used to reconstruct any ofthose graphs.Rather than studying individuals through popular social networks (suchas Twitter, Facebook, etc.), the presented research is based on a new data-setwhich has been collected through law-enforcement agencies, ﬁnancial insti-tutions, commercial databases and other public resources. Our data-set is acollection of networks of persons of interest . This approach of building net-works from public resources has been successful because it is often easier toinfer the connections among individuals from widely available resources thanthrough the private activities of speciﬁc individuals. n The Network You Keep: Analyzing Persons of Interest using Cliqster 3 .

89% accuracy for the page rank and 40 .

61% accuracy for the graphdegree. This justiﬁes the quest for new techniques to identify features in theunderlying structure of the networks that will enable accurate classiﬁcation oftheir categories.1.3 Our contributionsAfter performing experiments with decomposition methods (and their vari-ants) from existing literature, we ﬁnally discovered a novel technique we callCliqster – based on decomposing the network into a linear combination ofits maximal cliques, similar to Graphlet decomposition [11] of a network. Wecompare Cliqster against the traditional SVD (Singular Value Decomposition)as well as state-of-the-art Graphlet methods. The most important yardstick ofcomparison is the discriminating power of the methods. We ﬁnd that Cliqsteris superior to Graphlet and signiﬁcantly superior to SVD in its discriminatingpower, i.e., in its ability to distinguish between diﬀerent categories of personsof interest. Eﬃciency is another important criterion and comprises both thespeed of the inference algorithm as well as the size of the resulting representa-tion. Both the algorithm speed as well as the model size are closely tied to thedimension of the bases used in the representation. Again, here the dimensionof the Cliqster-bases was smaller than the Graphlet-bases in a majority of thecategories and substantially smaller than SVD in all the categories. A thirdcriterion is the interpretability of the model. By using cliques, Cliqster natu-rally captures interactions between groups or cells of individuals and is thususeful for detecting subversive sets of individuals with the potential to act inconcert.

Saber Shokat Fadaee et al.

In summary, we provide a new generative statistical model for networkswith an eﬃcient inference algorithm. Cliqster is computationally eﬃcient, andintuitive, and gives interpretable results. We have also created a new andcomprehensive data-set gathered from public and commercial records thathas independent value. Our ﬁndings validate the promise of statistics-basedtechnologies for categorizing and drawing inferences about sub-networks ofpeople entirely through the structure of their network.The remaining part of the paper is organized as follows. In §

2, we brieﬂyintroduce related work. § §

4, experimental results arepresented demonstrating the eﬀectiveness of our algorithm on ﬁnding an ap-propriate and discriminating representation of a social network’s structure. Atthe end of this section, we present a comprehensive discussion of observationsregarding the dataset. § Signiﬁcant attention has been given to to the approach of studying criminalactivity through an analysis of social networks [12], [13], and [14]. [12] discov-ered that two-thirds of criminals commit crimes alongside another person. [13]demonstrated that charting social interactions can facilitate an understandingof criminal activity. [14] investigated the importance of weak ties to interpretcriminal activity.Statistical network models have also been widely studied in order to demon-strate interactions among people in diﬀerent contexts. Such network modelshave been used to analyze social relationships, communication networks, pub-lishing activity, terrorist networks, and protein interaction patterns, as wellas many other huge data-sets. [15] considered random graphs with ﬁxed num-ber of vertices and studied the properties of this model as the number ofedges increases. [16] studied a related version in which every edge had a ﬁxedprobability p for appearing in a network. Exchangeable random graphs [17]and exponential random graphs [18] are other important models. In [19] theycreated a toolbox to resolve duplicate nodes in a social network.The problem of ﬁnding roles of a person in a network has been widelystudied. In [20] they have a link-based approach to this problem. In [21] theystudied how to identify a group of vertices that can mutually verify each other.The relationship between social roles and diﬀusion process in a social networkis studied in [22]. In [23] they combine the problem of capturing uncertaintyover existence of edges, uncertainty over attribute values of nodes and identityuncertainty. In [24] they use an unsupervised method to solve the problem ofdiscovering roles of a node in a network. In [25] they studied how the networkcharacteristic reﬂect the social situation of users in an online society. In [26]they study the role discovery problem with an assumption that nodes withsimilar structural patterns belong to the same role. The diﬀerence between n The Network You Keep: Analyzing Persons of Interest using Cliqster 5 the works of [24], [25], [26] and similar works like [27], [28], [29] with our workis that they are interested in the roles of a node in a speciﬁc network, while weare interested in studying the structural diﬀerences among diﬀerent networks.In this work, we assume all the nodes in a network has the same role/job.Despite the various applications of ﬁnding the roles of diﬀerent sub networksin a graph, this problem has only received a limited amount of attention. Inthis paper we are studying the role discovery problem for a network.Recently researchers have become interested in stochastic block-modelingand latent graph models [30,31,32]. These methods attempt to analyze thelatent community structure behind interactions. Instead of modeling the com-munity structure of the network directly, we propose a simple stochastic pro-cess based on a Bernoulli trial for generating networks. We implicitly considerthe community structure in the network generating model through a decom-position and projection to the space of baseline communities (cliques in ourmodel). For a comprehensive review of statistical network models we referinterested readers to [33].Formerly, Singular Value Decomposition was used for the decompositionof a network [34,35,36]. However, since SVD basis elements are not inter-pretable in terms of community structure, it can not capture the notion ofsocial information we are interested in quantifying. Authors in [11] introducedthe Graphlet decomposition of a weighted network; by abandoning the or-thogonality constraint they were able to gain interpretability. The resultingmethod works with weighted graphs; however, alternate techniques, such aspower graphs (which involve powering the adjacency matrix of a graph toobtain a weighted graph), need to be used in order to apply this method tounweighted graphs such as (most) social networks. n nodes in the network (For example n = 10 in Figure1). Consider Y as a n × n matrix representing the connectivity in the network. Y ( r, s ) = 1 if node r is connected to node s , and 0 otherwise. Saber Shokat Fadaee et al.

Fig. 1: Network of ten peopleIn Cliqster, the generative model for the network is: Y = Bernoulli( Z ) (1)which means Y ( r, s ) = Y ( s, r ) = 1 with probability Z ( r, s ), and Y ( r, s ) = Y ( s, r ) = 0 with probability 1 − Z ( r, s ) for all r > s . Since the graph isundirected the matrix Z is lower triangular.Inspired by PCA and SVD, in Cliqster we choose to represent Z in a newspace [34], [36]. Community structure is a key factor to understand and analyzea network, and because of this we are motivated to choose bases in a way thatreﬂects the community structure [35]. Consequently, we decided to factorize Z as Z = K (cid:88) k =1 µ k B k (2)where K is the number of maximal cliques (bases), and B k is k th lower tri-angular basis matrix that represents the k th maximal clique, and µ k is itscontribution to the network. In section 3.4 we elaborate on this basis selec-tion process. From this point forward, we consider these bases as cliques ofa network. We also represent a network in this new space. Each network isparameterized by the coeﬃcients and the bases which construct the Z , thenetwork’s generating matrix.3.2 InferenceWhen given a network Y of people and their connections, our goal is to inferthe parameters generating this network. We must ﬁrst assume the bases are n The Network You Keep: Analyzing Persons of Interest using Cliqster 7 selected as baseline cliques. The likelihood of the network parameters (coeﬃ-cients) given the observation is: L ( µ K ) = (cid:89) r>s : Y ( r,s )=1 Z ( r, s ) (cid:89) r>s : Y ( r,s )=0 (1 − Z ( r, s ))We estimate these parameters by maximizing their likelihood under the con-straint 0 ≤ Z ( r, s ) ≤ r > s .One can easily see the likelihood is maximized when Z ( r, s ) = 1 if Y ( r, s ) =1 and Z ( r, s ) = 0 if Y ( r, s ) = 0. Therefore Y = K (cid:88) k =1 µ k B k (3)should be used for the lower triangle of Y .Unfolding the above equation results in, Y (2 ,

1) = µ B (2 ,

1) + . . . + µ K B K (2 , Y (3 ,

1) = µ B (3 ,

1) + . . . + µ K B K (3 , Y (3 ,

2) = µ B (3 ,

2) + . . . + µ K B K (3 , Y ( n, n −

1) = µ B ( n, n −

1) + . . . + µ K B K ( n, n − µ = ( µ , . . . , µ K ) (cid:62) b rs = ( B ( r, s ) , . . . , B K ( r, s )) (cid:62) (4)So the new objective function can be written as, J = (cid:88) r>s ( µ (cid:62) b rs − Y ( r, s )) (5) J is convex with respect to µ under the following constraints 0 ≤ µ (cid:62) b rs ≤

1. This is essentially a constrained least squares problem, which can be solvedthrough existing eﬃcient algorithms [37], [38]. Through this formula, the rep-resentation parameters µ K are thus computed easily and we are done withthe inference procedure.We turn our attention to the new representation and try to ﬁnd an algo-rithm which can produce a more interpretable result. The exact generatingparameters are no longer needed in our application. Therefore, by relaxingthe constraints we will be able to present it with a simple and very eﬃcientalgorithm. In addition, the solution to this unconstrained problem providesus with an intuitive understanding of what is happening behind this inference Saber Shokat Fadaee et al. procedure. To determine the optimal parameters, we must take the derivativewith respect to µ : ∂J∂µ = 2 (cid:88) r>s b rs ( b rs (cid:62) µ − Y ( r, s )) (6)By equating the above derivative to zero and doing a simple mathematicalprocedure, we are presented with the solution µ = A − d (7)where A = (cid:88) r>s b rs b rs (cid:62) d = (cid:88) r>s Y ( r, s ) b rs (8) A is a K × K matrix and d is a K × O ( n ) constraints. Despite this fact, weobtain very good results, and we will soon explain why this happens.Our novel decomposition method ﬁnds µ which is used to represent a net-work, and which could stand-in for a network in network analysis applications.This representation is used in the next section in order to discriminate betweendiﬀerent types of networks.The results from the decomposition of the network presented in ﬁgure 1 isdemonstrated in table 1.Table 1: µ within each cluster Cluster members µ { , , } { , , } { , , } { , , } { , } { , } { , } A and d give you an intuition about n The Network You Keep: Analyzing Persons of Interest using Cliqster 9 the network. For further insight into this process, consider a matrix A . Ev-ery entry of this matrix is equal to the number of edges shared by the twocorresponding cliques. This matrix encodes the power relationships betweenbaseline clusters, as a part of network reconstruction. The intersection be-tween two bases shows how much one basis can overpower another basis asthey are reconstructing a network. In contrast, d presents the commonalitiesbetween a given network and its baseline communities. Through this equation,a community’s contribution to a network is encoded.With the interpretation of this data in mind, the equation A µ = d is nowmore meaningful for understanding the signiﬁcance of our new representationof a network. Consider multiplying the ﬁrst row of the matrix by the vector µ ,which should be equal to d . In order to solve this equation, we have chosenour coeﬃcients in such a way that when the intersection of cluster 1 and otherclusters are multiplied by their corresponding coeﬃcients and added together,the result is a clearer understanding of the ﬁrst cluster’s contribution to thenetwork construction.3.4 Basis SelectionUsers in persons of interest network usually form associations in particularways, thus, community structure is a good distinguishing factor for diﬀerentnetworks. There are diﬀerent structures that form a community. One of theinteresting structures that forms a community is the maximal cliques of thatcommunity. We use them as the basis of our method. There are so many waysto compute the maximal cliques of a network. We use the Bron-Kerboschalgorithm [39] for identifying our network’s communities. As mentioned in [11],this is one of the most eﬃcient algorithms for identifying all of the maximalcliques in an undirected network. After applying the Bron-Kerbosch algorithmto ﬁgure 1, we identify the communities that are represented in table 2. TheBron-Kerbosch algorithm is described in the algorithm 1. Algorithm 1

Bron-Kerbosch algorithm C = ∅ (cid:46) We keep the maximal clique in C2: I = V ( G ) (cid:46) The set of vertices that can be added to C3: X = ∅ (cid:46) The set of vertices that are connected to C but are excluded from it4: procedure

Enumerate ( C, I, X )5: if I == ∅ and X == ∅ then C is maximal clique7: else for each vertex v in I do Enumerate ( C ∪ { v } , I (cid:84) N ( v ) , X (cid:84) N ( v ))10: I ← I { v } X ← X ∪ { v } The Bron-Kerbosch algorithm has many diﬀerent versions. We use theversion introduced in [40].

One of the most successful aspects of this algorithm is that it provides amulti-resolution perspective of the network. This algorithm identiﬁes commu-nities through a variety of scales, which, we will see, allows us to locate themost natural and representative set of coeﬃcients and bases.3.5 ComplexityThe aforementioned inference equation requires A and d to be computed,which can be done in O ( m + n ) time where m is the number of edges and n isthe number of nodes in the network. The least-square solution requires O ( K )operations. A graph’s degeneracy measures its sparsity and is the smallestvalue f such that every nonempty induced subgraph of that graph containsa vertex of degree at most f [41]. In [40] they proposed a variation of theBron-Kerbosch algorithm, which runs in O ( f n f/ ) where f is a network’sdegeneracy number. This is close to the best possible running time since thelargest possible number of maximal cliques in an n-vertex graph with degen-eracy f is ( n − f )3 f/ [40].A power law graph is a graph in which the number of vertices with degree d is proportional to x α where 1 ≤ α ≤

3. When 1 < α ≤ f = O ( n / α ),and when 2 < α < f = O ( n (3 − α ) / ) [42]. Combining with therunning time, O ( f n f/ ) of the Bron-Kerbosch variant [40], we ﬁnd that therunning time for ﬁnding all maximal cliques in a power law graph to be 2 O ( √ n ) .However, the maximum number of cliques in graphs based on real worldnetworks is typically O (log n ) [11]. In this section we investigate the properties of the new features we have learnedabout the network in question. Firstly, we introduce the new dataset we havebuilt. Our experiments attempt to prove two claims:1. the new representation is concise, and2. it can discriminate between diﬀerent network typesWe will now compare our results with SVD decomposition and graphlet de-composition algorithms [11].4.1 DatasetWe have gathered a dataset by gathering and fusing information from a varietyof public and commercial sources. Our ﬁnal dataset was comprised of around750,000 persons of interest with 3,000,000 connections among them. We thenﬁltered this dataset to slightly less than 550,000 individuals who fell into oneof the following 5 categories: n The Network You Keep: Analyzing Persons of Interest using Cliqster 11 Suspicious Individuals : Persons who have appeared on sanctioned lists,been arrested or detained, but not been convicted of a crime.2.

Convicted Individuals : Persons who have been indicted, tried and convictedin a court of law.3.

Lawyers/Legal Professionals : Persons currently employed in a legal profes-sion.4.

Politically Exposed Persons : Elected oﬃcials, heads of parties, or personswho have held or currently hold political positions now or in the past.5.

Suspected Terrorists : Persons suspected of aiding, abetting or committingterrorist activities.This dataset is publicly available at [9].Table 2: Table of Categories and corresponding sizes plus number ofconnected components and density of each category

Category Members Components Density

Suspicious Individuals 316,990 77,811 0.0000180Convicted Individuals 165,411 35,517 0.0000427Lawyers/Legal Professionals 3,723 1,492 0.0006220Politically Exposed Persons 13,776 4,947 0.0001533Suspected Terrorists 31,817 5,016 0.0002068

The color scheme we use for our ﬁgures are as follow: Red for

Suspi-cious Individuals (SI) , blue for

Convicted Individuals (CI) , brown for

Lawyer/Legal Professionals (LL) , orange for

Politically Exposed Per-sons (PEPS) , and black for

Suspected Terrorists (ST) .4.2 Basic propertiesWe want to know whether our dataset has the common properties of socialnetworks or not, i.e. having a power law distribution. The ﬁrst thing to check isthe degree distribution of each subnetwork, and if they can be ﬁtted to a power-law distribution. We have a scale-free network If the degree distributions inour subnetwork follow power-law distribution. We used the poweRlaw [43] andigraph [44] packages to calculate the maximum likelihood power law ﬁt of theLegal subnetwork, and the results are shown in ﬁgure 2. It looks like a scale-freenetwork, but we need to check this with more accurate measures. In a power-law distribution P ( X = x ) is proportion to cx α . The α of each subnetwork canbe seen in the table 3. Each of our subnetwork can be ﬁtted into a power-lawdistribution, so all of them are scale-free networks. However, these networksare not small-world networks. The number of connected components in eachnetwork, indicates if you start at a certain node in each network it is impossibleto reach to most of the other nodes in that network. . . . . Neighbors CD F Fig. 2: The cumulative distribution functions and their maximum likelihoodpower law ﬁt of the Legal subnetworkTable 3: Table of alpha, the exponent of the ﬁtted power-law distribution ineach category

Category α Suspicious Individuals 1.838563Convicted Individuals 1.733839Lawyers/Legal Professionals 2.977307Politically Exposed Persons 3.107326Suspected Terrorists 1.770715 ,

000 vertices asa sample. We then analyze this data, and repeat this operation 1 ,

000 times andrepresent the data’s average with bold lines in the following graphs. All ﬁguresalso include a representation of what happens to this data when the standarddeviation of it is taken at a margin of 2 , which we illustrate through a line of alighter variation of the same color. We analyzed this data with three diﬀerent n The Network You Keep: Analyzing Persons of Interest using Cliqster 13

Coefficient Index A m p li t ude o f C oe ff i c i en t - - - - - - - Convicted Individuals

Fig. 3: Number of bases and amplitude of coeﬃcient for ConvictedIndividuals using SVDNumber of bases and amplitude of coeﬃcient forConvicted Individuals using SVDmethods, the Singular Value Decomposition, Graphlet Decomposition, as wellas our own proposed model.4.4 Singular Value DecompositionWe ﬁrst analyzed our data using the Singular Value Decomposition method[34]. Figure 3 shows the eﬀective number of non-zero coeﬃcients for this al-gorithm. Figure 4 demonstrates the ability of this algorithm to discriminatebetween two diﬀerent categories. Finally, the ability of the algorithm to distin-guish between the 5 categories is illustrated in ﬁgure 5. The average numberof bases we observed in the samples of a 1 ,

000 vertices is around 800 as canbe seen in ﬁgures 3, 4 and 5.4.5 Graphlet DecompositionWe next performed the same tests using Graphlet Decomposition. Figure 6demonstrates the eﬀective number of non-zero coeﬃcients for this algorithm.Figure 7 shows the ability of this algorithm to discriminate between two dif-ferent types of networks. The algorithm’s ability to distinguish between the 5categories is again illustrated in ﬁgure 8. As can be seen in these ﬁgures thenumber of bases elements for Graphlet Decomposition is around 20.

STLL

Terrorism vs Legal

Coefficient Index A m p li t ude o f C oe ff i c i en t - - - - - - - Fig. 4: Comparison of coeﬃcients between Terrorist sub networks and Legalsub networks using SVD

SICIPEPSSTLL

ALL

Coefficient Index A m p li t ude o f C oe ff i c i en t - - - - - - - Fig. 5: The ability of SVD method to distinguish between diﬀerent categoriesof networks4.6 CliqsterFinally, we performed the same tests using our method. We ﬁrst determinedappropriate bases using the Bron-Kerbosch algorithm. We then computed A and d . The new representation for a sample network of one category that n The Network You Keep: Analyzing Persons of Interest using Cliqster 15 Coefficient Index A m p li t ude o f C oe ff i c i en t . . . Convicted Individuals

Fig. 6: Number of bases and amplitude of coeﬃcient for ConvictedIndividuals using Graphlet Decomposition Algorithmresulted from our new method is shown in Figure 9. Figure 10 shows the abilityof our algorithm to discriminate between two diﬀerent types of networks. Ournew algorithm’s ability to distinguish between two diﬀerent types of networksis illustrated in Figure 11, which also shows that the number of bases elementsfor Graphlet Decomposition is around 50.4.7 PerformanceWe analyzed the time complexity of Cliqster in the section 3.5. Now it’s timeto check if the empirical results verify our theory. For the

Convicted Individuals subnetwork we ran both our method and SVD using the igraph package in R.The performance of the Graphlet method is very similar to Cliqster so we donot include that in this experiment.We ran our experiment on “

Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz(8 CPUs), 3.4GHz ” processor with “ ” of memory. As you can seein ﬁgure 12, as we grow the sample size our method performs twice as fast asthe SVD method.

STLL

Terrorism vs Legal

Coefficient Index A m p li t ude o f C oe ff i c i en t . . . Fig. 7: Comparison of coeﬃcients between Terrorist sub networks and Legalsub networks using Graphlet Decomposition Algorithm4.8 DistinguishabilityIn order to compare the ability of each of these methods to distinguish be-tween diﬀerent types of social networks, we sampled 100 networks from eachcategory, combining all of these samples before running the K-means cluster-ing algorithm (with 5 as the number of clusters), and repeated this action 100times. We used each network’s top 20 largest coeﬃcients, and are willing toknow if coeﬃcients of diﬀerent sub-networks can be distinguished from eachother. We gave the combined coeﬃcients of all diﬀerent sub-networks to theK-means clustering algorithm as an input, and calculated the mean error ofclustering. As you can see in table 4, our method often returns the bases withthe best ability to distinguish between the type of social network presented.The Graphlet Decompostion slightly outperforms our method in two of thefollowing sub-networks, and such diﬀerence is negligible in practice. n The Network You Keep: Analyzing Persons of Interest using Cliqster 17

SICI

ALL

Coefficient Index A m p li t ude o f C oe ff i c i en t . . . . PEPSSTLL

Fig. 8: The ability of Graphlet method to distinguish between diﬀerentcategories of networksTable 4: Mean error of clustering with 20 coeﬃcients ( µ ) Category SVD Graphlet Cliqster

SI 0.51461

LL 0.75006 0.10931

PEPS 0.66082 0.12195

ST 0.65381 k − nearest neighborsalgorithm (or k − N N for short). k − N N is a non-parametric method that isused for classiﬁcation in a supervised setting. Let’s assume we want to comparethe features that are used to distinguish between these two groups: Suspicious

Convicted Individual

Coefficient Index A m p li t ude o f C oe ff i c i en t . . . . . . Fig. 9: Number of bases and amplitude of coeﬃcient for ConvictedIndividuals using Cliqster

STLL

Terrorism vs Legal

Coefficient Index A m p li t ude o f C oe ff i c i en t . . . . . . . Fig. 10: Comparison of coeﬃcients between Terrorist sub networks and Legalsub networks using Cliqster n The Network You Keep: Analyzing Persons of Interest using Cliqster 19

Fig. 11: The ability of Cliqster to distinguish between diﬀerent categories ofnetworks

Performance

Sample Size E l ap s ed t i m e i n s e c ond s CliqsterSVD

Fig. 12: Comparison of performance between Cliqster and SVD

Suspicious Individuals versus Convicted Individuals

Size of training set A cc u r a cy Fig. 13: The accuracy of community detection based on the training sizeIndividuals and Convicted Individuals. We train Cliqster with samples of size1 ,

000 that are randomly selected from both communities, gather the featuresand repeat this operation 1 ,

000 times. After that we run the k − N N with k = 3 and a test data of size 100. In order to avoid ties, we need to pick anodd number for k in case of binary classiﬁcation. When we set k = 3 we arelooking at the classiﬁcation problem in a 3 dimensional space. We also makesure there is no intersection between the members of training and test sets toavoid the problem of over-ﬁtting.Figure 13 shows the result of this experiment. With using a training set ofsize 40 we can classify these two groups with an accuracy of 97%. It basicallymeans that when we have a training set of size 40, K-NN can learn how todistinguish between these two groups with an accuracy of 97%.Things are a little bit diﬀerent when it comes to comparing the behav-ior of Lawyers/Legal professionals network and Politically Exposed Personsnetwork. As you can see in ﬁgure 14 we need a training set of size 100 toreach to an accuracy of 74%. This diﬀerence suggest a contrast between thecharacteristics of these networks. According to Cliqster, the network structureof Lawyers/Legal professionals and the network structure of Politically Ex-posed Persons have more in common than the network structure of SuspiciousIndividuals and the network structure of Convicted Individuals.If we analyze the network structure of Suspected Terrorists and compare itwith network structure of Convicted Individuals, we will see that after usinga training set of size around 20 we reach to the 100% accuracy. k − N N n The Network You Keep: Analyzing Persons of Interest using Cliqster 21

Lawyer/Legal professionals versus PEPS

Size of training set A cc u r a cy Fig. 14: The accuracy of community detection based on the training sizecan classify these two groups with no error 15. Now we compare the networkstructure of Suspected Terrorists and Politically Exposed Persons networks16. After using a training set of size 50, we reach to the 99% accuracy.4.10 DiscussionFigures 3, 6, and 9 compare the ability of the three methods to compress data.These graphs demonstrate that the SVD method is ineﬃcient for summarizinga network’s features. The graph also shows that the Graphlet method producesthe smallest feature space. Our representation is also very small, however,and the diﬀerence in size produced through these methods is negligible inreal world applications of this equation. Earlier we demonstrated that the20 largest coeﬃcients in the representation produced through our method issuﬃcient to outperform the Graphlet algorithm in terms of distinguish abilityand clustering.Figures 4, 7, and 10 demonstrate the ability of the algorithms to distinguishbetween two selected categories. When comparing our method with the SVDand Graphic Decomposition methods, the coeﬃcients seem to be very similarbetween those produced by our method and the SVD method, however, ourmethod also performs as well as the Graphlet Decomposition method in distin-guishing between two types of networks. This demonstrates that communitystructure is a natural basis for interpreting social networks. By decomposinga network into cliques, our method provides an eﬃcient transformation that is

Suspected Terrorists versus Convicted Individuals

Size of training set A cc u r a cy Fig. 15: The accuracy of community detection based on the training size

Suspected Terrorists versus PEPS

Size of training set A cc u r a cy Fig. 16: The accuracy of community detection based on the training size n The Network You Keep: Analyzing Persons of Interest using Cliqster 23 concise and easier to analyze than SVD bases, which are constrained throughtheir requirement to be orthogonal. Figures 5, 8, and 11 verify these claimsfor all 5 categories.Table 4 demonstrates the performance of our algorithm to consistentlysummarize each network according to category. We then clustered all coef-ﬁcients using k-means. Through this process, it became clear that the SVDmethod could not identity the category of the network being analyzed. Be-cause of this, we can infer that by selecting the community structure (cliques)as bases, our ability to identify a network is considerably improved. Our pro-posed algorithm was more accurate in clustering than the Graphlet Decom-position algorithm. Thus, the Bernoulli Distribution (as used in seminal workof Erd˝os and R´enyi) is a simpler and more natural process for generating net-works. Our proposed method is also easier to interpret and does not run therisk of getting stuck in local minima like the Graphlet method.Finally, ﬁgures 13, 14, 16 and 15 demonstrate the ability of k − N N toclassify features produced by Cliqster in binary classiﬁcation settings. Theyalso give us some interpretations on similarities and diﬀerences between thenetwork structure of diﬀerent groups.

After proposing Cliqster, which is a new generative model for decomposingrandom networks, we applied this method to our new dataset of persons ofinterest. Our primary discovery in this research has been that a variant of ourdecomposition method provides a statistical test capable of accurately discrim-inating between diﬀerent categories of social networks. Our resulting methodis both accurate and eﬃcient. We created a similar discriminant based on thetraditional Singular Value Decomposition and Graphlet methods, and foundthat they are not capable of discriminating between social network categories.Our research also demonstrates community structure or cliques to be a naturalchoice for bases. This allows for a high degree of compression and at the sametime preserves the identity of the network very well. The new representationproduced through our method is concise and discriminative.Comparing the three methods, we found that the dimensions of the Graphlet-bases and our bases were signiﬁcantly smaller than the SVD-bases, while alsoaccurately identifying the category of the network being analyzed. Therefore,our method is an extremely accurate and eﬃcient means of identifying diﬀerentnetwork types.On the non-technical side we would like to see how we can get law-enforcementagencies to adopt our methods. There are a number of directions for furtherresearch on the technical front. We would like to expand the use of our sim-ple intuitive algorithm to weighted networks, such as networks with an edgegenerating process based on the Gamma distribution. The problem with theMaximum Likelihood solution for a network is that it is subject to over-ﬁttingor a biased estimation. Adding a regularization term would adjust for this dis- crepancy. A natural choice for such a term would be a sparse regularization,which is in accordance with real social networks. Extensive possibility for fu-ture work exists in the potential of incorporating prior knowledge into Cliqsterby using Bayesian inference. Another natural avenue for further investigationsis to consider how Cliqster can be adapted to regular social networks.

Acknowledgment

The authors would like to thank Hossein Azari Souﬁani for his comments ondiﬀerent aspects of this work.

References

1. S. Shokat Fadaee, M. Farajtabar, R. Sundaram, J. Aslam, and N. Passas, “The networkyou keep: Analyzing persons of interest using cliqster,” in

Advances in Social NetworksAnalysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on

Networks, Crowds, and Markets: Reasoning About a Highly Con-nected World . New York, NY, USA: Cambridge University Press, 2010.11. H. Azari Souﬁani and E. M. Airoldi, “Graphlets decomposition of a weighted network,”

Journal of Machine Learning Research , 2012.12. A. Reiss, “Understanding changes in crime rates,”

In Crime and Justice: A review ofResearch , vol. 10, 1980.13. E. L. Glaeser, B. Sacerdote, and J. A. Scheinkman, “Crime and social interactions,”

The Quarterly Journal of Economics , vol. 111, no. 2, pp. 507–48, May 1996.14. E. Patacchini and Y. Zenou, “The strength of weak ties in crime,”

European EconomicReview , vol. 52, no. 2, pp. 209 – 236, 2008.15. P. Erd˝os and A. R´enyi, “On random graphs,”

Publicationes Mathematicae Debrecen ,vol. 6, pp. 290–297, 1959.16. E. N. Gilbert, “Random graphs,”

The Annals of Mathematical Statistics , vol. 30, no. 4,pp. 1141–1144, 1959.17. E. M. Airoldi, “Bayesian mixed-membership models of complex and evolving networks,”DTIC Document, Tech. Rep., 2006.18. G. Robins, P. Pattison, Y. Kalish, and D. Lusher, “An introduction to exponentialrandom graph (p*) models for social networks,”

Social Networks , vol. 29, no. 2, pp. 173– 191, 2007, special Section: Advances in Exponential Random Graph (p*) Models.19. M. Bilgic, L. Licamele, L. Getoor, and B. Shneiderman, “D-dupe: An interactive toolfor entity resolution in social networks,” in

Visual Analytics Science and Technology(VAST) , Baltimore, October 2006.n The Network You Keep: Analyzing Persons of Interest using Cliqster 2520. G. Barta, “A link-based approach to entity resolution in social networks,”

CoRR , vol.abs/1404.3017, 2014.21. Y.-C. Lo, J.-Y. Li, M.-Y. Yeh, S.-D. Lin, and J. Pei, “What distinguish one from itspeers in social networks?”

Data Mining and Knowledge Discovery , vol. 27, no. 3, pp.396–420, 2013.22. Y. Yang, J. Tang, C. W.-k. Leung, Y. Sun, Q. Chen, J. Li, and Q. Yang, “Rain: Socialrole-aware information diﬀusion,” 2014.23. W. E. Moustafa, A. Kimmig, A. Deshpande, and L. Getoor, “Subgraph pattern matchingover uncertain graphs with identity linkage uncertainty,”

CoRR , vol. abs/1305.7006,2013.24. K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu, D. Koutra,C. Faloutsos, and L. Li, “Rolx: Structural role extraction & mining in large graphs,” in

Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discov-ery and Data Mining , ser. KDD ’12. New York, NY, USA: ACM, 2012, pp. 1231–1239.25. Y. Zhao, G. Wang, P. S. Yu, S. Liu, and S. Zhang, “Inferring social roles and statuses insocial networks,” in

Proceedings of the 19th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , ser. KDD ’13. New York, NY, USA: ACM,2013, pp. 695–703.26. R. A. Rossi and N. K. Ahmed, “Role discovery in networks,”

IEEE Transactions onKnowledge and Data Engineering , vol. 99, no. PrePrints, p. 1, 2014.27. K. Li, S. Guo, N. Du, J. Gao, and A. Zhang, “Learning, analyzing and predicting objectroles on dynamic networks,” in

Data Mining (ICDM), 2013 IEEE 13th InternationalConference on , Dec 2013, pp. 428–437.28. S. Bhagat, G. Cormode, and S. Muthukrishnan, “Node classiﬁcation in social networks,”

CoRR , vol. abs/1101.3291, 2011.29. H. Xu, Y. Yang, L. Wang, and W. Liu, “Node classiﬁcation in social network via a factorgraph model,” in

Advances in Knowledge Discovery and Data Mining , ser. LectureNotes in Computer Science, J. Pei, V. Tseng, L. Cao, H. Motoda, and G. Xu, Eds.Springer Berlin Heidelberg, 2013, vol. 7818, pp. 213–224.30. K. Nowicki and T. A. B. Snijders, “Estimation and prediction for stochastic blockstruc-tures,”

Journal of the American Statistical Association , vol. 96, no. 455, pp. 1077–1087,2001.31. E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, “Mixed membership stochasticblockmodels,”

Journal of Machine Learning Research , 2008.32. B. Karrer and M. E. Newman, “Stochastic blockmodels and community structure innetworks,”

Physical Review E , vol. 83, no. 1, p. 016107, 2011.33. A. Goldenberg, A. X. Zheng, S. E. Fienberg, and E. M. Airoldi, “A survey of statisticalnetwork models,”

ArXiv e-prints , dec 2009.34. F. R. K. Chung, “Spectral graph theory,”

American Mathematical Society , 1997.35. P. Hoﬀ, “Multiplicative latent factor models for description and prediction of socialnetworks,”

Computational & Mathematical Organization Theory , vol. 15, no. 4, pp.261–272, 2009.36. M. Kim and J. Leskovec, “Multiplicative attribute graph model of real-world networks.”

Internet Mathematics , vol. 8, no. 1-2, pp. 113–160, 2012.37. C. L. Lawson and R. J. Hanson,

Solving least squares problems . SIAM, 1974, vol. 161.38. S. P. Boyd and L. Vandenberghe,

Convex optimization . Cambridge university press,2004.39. C. Bron and J. Kerbosch, “Finding all cliques of an undirected graph,”

Communicationsof the ACM , 1973.40. D. Eppstein and D. Strash, “Listing all maximal cliques in large sparse real-worldgraphs,” in

Experimental Algorithms . Springer, 2011, pp. 364–375.41. D. R. Lick and A. T. White, “ k -degenerate graphs,” Canad. J. Math. , vol. 22, pp.1082–1096, 1970.42. A. Buchanan, J. Walteros, S. Butenko, and P. Pardalos, “Solving maximum clique insparse graphs: an o ( nm + n d/ ) algorithm for d-degenerate graphs,” OptimizationLetters , 2013.43. C. S. Gillespie, “Fitting heavy tailed distributions: The poweRlaw package,”

Journal ofStatistical Software , vol. 64, no. 2, pp. 1–16, 2015.6 Saber Shokat Fadaee et al.44. G. Csardi and T. Nepusz, “The igraph software package for complex networkresearch,”