A Synthetic Network Generator for Covert Network Analytics
Amr Elsisy, Aamir Mandviwalla, Boleslaw Szymanski, Thomas Sharkey
AA Synthetic Network Generator for Covert NetworkAnalytics
Amr Elsisy a,b , Aamir Mandviwalla a,b , Boleslaw K. Szymanski a,b, ∗ , ThomasSharkey b,c a Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA b Network Science and Technology Center, Rensselaer Polytechnic Institute, Troy, NY12180, USA c Industrial and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180,USA
Abstract
We study social networks and focus on covert (also known as hidden) networks,such as terrorist or criminal networks. Their structures, memberships and ac-tivities are illegal. Thus, data about covert networks is often incomplete andpartially incorrect, making interpreting structures and activities of such net-works challenging. For legal reasons, real data about active covert networks isinaccessible to researchers. To address these challenges, we introduce here anetwork generator for synthetic networks that are statistically similar to a realnetwork but void of personal information about its members. The generatoruses statistical data about a real or imagined covert organization network. Itgenerates randomized instances of the Stochastic Block model of the networkgroups but preserves this network organizational structure. The direct use ofsuch anonymized networks is for training on them the research and analyticaltools for finding structure and dynamics of covert networks. Since these syn-thetic networks differ in their sets of edges and communities, they can be usedas a new source for network analytics. First, they provide alternative interpreta-tions of the data about the original network. The distribution of probabilities forthese alternative interpretations enables new network analytics. The analysts ∗ Corresponding author
Email address: [email protected] (Boleslaw K. Szymanski)
Preprint submitted to Elsevier August 12, 2020 a r X i v : . [ c s . S I] A ug an find community structures which are frequent, therefore stable under per-turbations. They may also analyze how the stability changes with the strengthof perturbation. For covert networks, the analysts can quantify statistically ex-pected outcomes of interdiction. This kind of analytics applies to all complexnetwork in which the data are incomplete or partially incorrect. Keywords:
Social networks, Random weighted network generator, Networkstructure stability, Covert networks, Hierarchical networks
1. Introduction
The randomized generation of synthetic networks is often used for test-ing which properties of a given network depend on its structure and features(Barabási et al., 2000; Elsisy et al., 2019). Such a use has become popularsince Erdós and Rényi introduced the random network model (Erdös and Rényi,1959). It generates an edge between each pair of nodes with a fixed probability p .The resulting networks are highly random as expected. Yet, despite interestingmathematical properties of this model, later research discovered that randomnetworks rarely arise in nature or engineering practice. The more advanced mod-els proposed later include the scale-free network model (Barabási and Albert,1999) and its variants that represent the structures of many social and naturallyarising networks. Another one is the Stochastic Block Model (SBM) (Hollandet al., 1983) which extends the random network model by grouping nodes intoblocks. The probability of an edge between a pair of nodes is now determinedby the probability of connection between the blocks to which the nodes belong.This model produces community structure resembling those arising in real net-works but limits the differences between degrees of individual nodes located inthe same block. This weakness is addressed by SBM variants, such as the de-gree planted SBM (Karrer and Newman, 2011) or the degree-corrected plantedpartition model (Newman, 2016). The Lancichinetti-Fortunato-Radicchi (LFR)benchmark (Lancichinetti et al., 2008) generates synthetic networks with thedesired heterogeneity of the node degrees and community size distribution. The2enerated networks are customized using such parameters as power law ex-ponents for the node degrees, the community size, and the density of edgeswithin the same community to mimic real networks. These networks are oftenused to test community detection methods. A model presented in (Xiao et al.,2020) generates synthetic networks by rewiring edges in real-life networks. Themore edges are rewired, the blurrier the communities become, allowing onlythe increasingly more stable communities to prevail. Another generative modelcreates a hierarchical multi-layer networks that are often used to represent thehierarchical management structures (Ravasz and Barabási, 2003).In general, we are interested in computer based social networks that havebeen supported by the growing number of computer platforms and companies.The activities in such networks generate massive amounts of data, to which,however, access is increasingly limited due to privacy concerns. We are par-ticularly interested in covert (a.k.a. hidden) networks, such as terrorist andcriminal networks. The common trait of such networks is that they try tohide their membership, structure, and illegal activities. The membership insuch networks is often secretive. Their essential interactions are covert, so theirnon-essential interactions seem overly visible, making essential ones difficult toobserve. Law enforcement in most of the countries is limited in the means ofcollected data by privacy laws protecting citizens from unjustified surveillance.As a result, the data about covert networks is often incomplete and partiallyincorrect. This creates a challenge in interpreting or discerning the structureand activities of such networks. An additional challenge arises from the inac-cessibility to researchers of data about networks under investigations. Only fora fraction of networks whose members were prosecuted in the court, such databecomes publicly available.A network generator introduced here addresses these challenges. It is de-signed to create a set of synthetic networks, each structurally similar to areal network, but with anonymous nodes that are interconnected or clusteredslightly differently than nodes in the original network. The direct use of suchanonymized networks is to train research and analytical tools for finding struc-3ure and dynamics of covert networks. However, the synthetic networks providealternative interpretations of the data about the original network. The dis-tribution of probabilities for these alternative interpretations enables users toquantify statistically expected outcomes of operations on the covert networks.Another use of the alternative interpretations is to decide if the data collectedfor the given covert network is sufficient for getting a reliable interpretationof this network’s structure and dynamics. A high frequency of alternative in-terpretations of the original network structure make them likely candidates forground truth structure. To decide which interpretation is most likely the truestructure of the original network may require collecting additional data.The synthetic networks created by our generator can also be useful for an-alyzing the social networks with partial information about their structures andbio-medical networks in which massive collection of experimental data aboutnetwork dynamics makes the data partially incomplete and incorrect.
2. A model of organizations represented by covert networks
Discovery and monitoring of covert networks often rely on getting access toinformation flows among nodes suspected to be involved in network activities.This flow may involve wiretapped telephone interactions, message exchanges,recorded conversations, or copies of written documents. We refer to an organi-zation represented by such a network as covert and to the nodes as members .Using community detection, we discover what we call groups of members whointeract among themselves more often than with other members. We preferthis term over communities , which usually denotes the nodes having casual re-lationships, or clusters , which refer to subsets of nodes with similar values forselected attributes. In small organizations, groups could be independent of theorganizational structure of the network, but more often they have a hierarchical structure for management nodes. The number of hierarchy levels depends onthe organization size. Our model represents each group by parameters such asthe average density of edges inside the group, from this group to other groups,4nd the placements of group members in the organization hierarchy. The modeluses parameters to individualize group leader connections to members of itsgroup and, separately, to other nodes. Given an original network, the generatorindividualizes it by randomly rewiring edges of its groups and its managementhierarchy (Ravasz and Barabási, 2003). As a result, our model combines twonetwork models: SBM for groups and hierarchical network for managementstructure.Most of the previously proposed synthetic network generators rely only onone network generative model. For example, in (Shang et al., 2020) the authorspropose a network generator that creates synthetic multi-layered networks. Anetwork generator presented in (Guo and Kraines, 2009) implements small-worldsocial networks with the desired high clustering coefficient. Another networkgenerator described in (Benyahia et al., 2016) creates dynamic networks withthe prescribed community structure.
3. Data
To present our generator in action, we use the Caviar and Ciel datasets(Morselli, 2009). The former defines a drug smuggling network, using data col-lected from 1994 to 1996 by an investigation of the West End Gang in Montreal,Canada. This gang was active in trafficking hashish and cocaine. During theinvestigation, police repeatedly confiscated shipments of drugs, but made noarrests until the investigation ended. The collected data mainly consists of thelists of phone calls made among gang members. Every two months, investiga-tors were creating a snapshot of the network, in which each edge represents thecalls made between two nodes during the corresponding two months and theedge weight is equal to the number of these calls. In total 11 snapshots werecollected. The snapshots allow us to observe the network reactions to the ship-ment confiscations, and to the changes in positions of members in the networkoccurring between snapshots.In the case of the Caviar network, we know which nodes had some man-5gement roles from the publicly released court proceedings. For example, theseproceedings identified node N1 (see Fig. 2) as the leader of the hashish group,and node N12 as the leader of the cocaine group. Fig. 1 shows more nodes inmanagement roles in the Caviar groups. Still, the community detection algo-rithm alone did not assign many low degree nodes to any group (Bahulkar et al.,2018). We also present the results obtained with the Ciel dataset (Morselli,
Figure 1: Illustration of our model integration of the stochastic block model (SBM) with thehierarchical multi-layer network model in the Caviar network. Nodes marked with the samecolor belong to the same group. Nodes placed at the same hierarchy levels play the same rolesin the management of this network.
Network
Caviar Ciel
Node identifier N1 N3 N12 N76 N1 N2 N10RBC score 0.430 0.180 0.303 0.078 0.591 0.641 0.015Rank of the score 1 3 2 4 2 1 7
Table 1: The relative betweenness centrality of the management nodes of the Caviar andCiel networks reveals their different priorities in managing criminal organization. In theCaviar network managers are strongly connected to the nodes in their groups, which improvesefficiency of management. In contrast, In Ciel network management nodes connect direct onlyto subordinates. This may benefit security of node N10. betweenness centrality of a node measures a fraction of the shortest paths ofinformation flow between all pairs of nodes passing through this node (Unnithanet al., 2014). A normalized version of this metric, called relative betweennesscentrality, limits its range to [0,1]. We use it for ease of comparison of theresults. We conclude that the structures of the Caviar and Ciel networks arefundamentally different. The Caviar network prioritized efficiency. This is indi-cated by the highest relative betweenness centrality scores of the managementnodes, which are N1, boss, and N3, N12, N76, managers, among all nodes inthe Caviar network (cf. Table 1). The implied easy access to these nodes fromothers supports high efficiency of information flow in the network, but low se-curity for the leader nodes. In contrast, in the Ciel network, the boss limitsthe connections to direct subordinates. Consequently, node N10 has a very lowrelative betweenness centrality (RBC) when compared to the manager nodes,71 and N2, (see Table 1) even though some management role was plausible forhim (Morselli, 2009).We chose the Caviar and Ciel datasets for our study because they arewell studied and analyzed using public data released during court proceedings(Bahulkar et al., 2018; Berlusconi, 2013; Skillicorn et al., 2013). We show thatsynthetic networks generated by our generator for both networks are similar tothe original ones. This is important because only pieces of the ground truth areavailable for covert networks under investigation.
4. Implementation of the Random Anonymized Network Generator
The
Random Anonymized Network Generator (RANG) organizes its networkgeneration process into the following three steps. The owners of the originalnetwork data execute the first step. They need to replace node identity andits personal data by a single unique numerical ID of this node. They alsoneed to assign to each node its place in the management hierarchy, a group towhich this node belongs and two lists, one of the subordinates and the other ofsuperiors. The second step is executed by RANG software. It summarizes thegroup structure of the network as the list of probabilities of an edge between anypair of nodes. These probabilities depend on groups to which the nodes belongand the roles these nodes play in the network. The obtained data are sharedwith the outside users or used internally by the owners. The third step generatesa set of synthetic random networks using the edge probabilities defined in thesecond step and analyzes this set for a group structure stability and nodes’management roles consistency.
Social networks often have directed weighted edges to represent intensityof interactions measured in frequency of calls, messages, or meetings. WithinSBM, we use one of the two models to assign weights to edges.8he first process for assigning edges is called the Weighted Random Graph(WRG) generator (Garlaschelli, 2009). Let W denote the sum of weight of alledges and E the maximum number of edges we can generate between subsetsof nodes. In WRG, the edges’ weights are generated by running Bernoulli trialswith probability p = WW + E . This run stops at the first failed trial. The numberof successful trials before this failure defines the weight of the generated edge.This process gives rise to the geometric distribution of edge weights.We introduce here an alternative approach, named Bernoulli Weighted Ran-dom Network (BWRN) model. It has two parameters. One is a vector of theweights w ’s for edges in the original graph, and the other is the probability p B that controls the variance of the generated weight distribution. The process ofgenerating edges starts with the heaviest edges and progresses down to edgeswith the smallest weight. Given the currently processed weight w , an associatedweight is computed as w B = (cid:98) w/p B (cid:99) . For each edge with w weight in the origi-nal graph, we select a weight in the range [0 , (cid:100) w/p B (cid:101) ] as follows. First, we selectrandomly a pair of not yet connected nodes and run w B Bernoulli trials withprobability p B . If p B w B < w , we do one more Bernoulli trial with probability p a = w − p B w B . The weight from such a run is equal to the number of successesin those trials. The edge is not created when this run returns weight 0.This design yields a distribution of weights with probability of choosingweight k , where ≤ k ≤ (cid:100) w/p B (cid:101) is defined as: p w ( k, p B ) = ( w B )!( w B − k )! k ! p kB (1 − p B ) w B − k if k ≤ w B w − p B w B for k = w B + 1 if p B w B < w (1)As a result, the average sum of weights of all edges created by this process isthe same as in the original network. Indeed, the expected weight from Equation1 is p B w B + w − p B w B = w .The parameter p B defines probability that the edge of weight w will not begenerated, which is (1 − p B ) w B if w = p B w B and (1 − p B ) w B (1 − w + p B w ) otherwise, so it quickly decreases with increase of w and p B . With p B > . even edges with weight 1 have low chance of below to be lost. An interesting9rade-off arises for slightly lower values of p B . For example with p B = 0 . ,which we used for computational experiments here, about of such edgeswill not appear in the generated network but a similar fraction of edges willincrease their weight to 2, strengthening cohesiveness of some communities.The variance of the distribution of the weights generated for an edge withweight w in the original data is w (1 − p B ) + p a ( p B − p a ) ≈ w (1 − p B ) (Feller,1968), so it grows with increase in w but decays with increase of p B . Thus,selecting large p B will make generated synthetic networks more similar to theoriginal one, while decreasing it would have opposite effects. Hence, differentkinds of analyses could be conducted with different choices of p B .By taking into account weights of edges, our model allows users to definedifferent edge densities in the group for edges at the same level of hierarchy(e.g., among peers) than for edges across the hierarchy (e.g., between the groupleader and a subordinate). This enables users to account for typically higherinformation flow intensities between managers and subordinates than amongpeers. To test the accuracy of our synthetic network generator, we ran the Lou-vain community detection algorithm Blondel et al. (2008) on all the generatednetworks and compared the generated groups to the groups in the original net-work. Yet, there are many high quality community detection algorithms, suchas SpeakEasy Gaiteri et al. (2015), which we triesd and it yielded similar re-sults, CPM Traag et al. (2011), modularity maximization Chen et al. (2014),fast Clauset et al. (2004), or adaptive modularity Lu et al. (2018) that can beused for this purpose on social and covert networks, so users of RANG can useany of them for this purpose.We observe that in the Caviar and Ciel networks, these two groups often Indeed (cid:80) w B k =0 k p k (1 − p a ) + k ∗ p k p a + (2 k + 1) p k p a = w (1 − p B ) + w + p a ( p B − p a ) .Hence V ar ( p B , w ) = w (1 − p B ) + p a ( p B − p a )
10o not match perfectly. There are several reasons for such differences. First,datasets for covert networks are incomplete and may have many undetectededges. Second, important nodes are often so highly interconnected, that theymay belong to several different groups. For example, in Caviar, nodes N1 and N3belong to the same group in the originla graph, but N3 has so many connectionsto other groups that the generator occasionally assigns it to them. For similarreasons such miss-assignments may happen to the managers. At the same time,such a miss-assignment may signal diverse roles that such a node plays in thenetwork. Hence, comparative analysis of groups discovered in the syntheticallygenerated networks may shed a new light on the operations of the original covertnetwork.We also detect the network hierarchy levels to reveal the structural prop-erties of the generated networks. We use the relative betweenness centrality,which is easy to compute, as a hierarchy level measure because in networksthere is a strong correlation between the hierarchy measures and betweennesscentrality scores (Rajeh et al., 2020). Comparing the nodes with high relativebetweenness centrality in the generated network and such nodes in the orig-inal network enables us to measure how well the generated networks preserveleadership hierarchy. A Combined Score (CS) measures overall similarity of gen-erated networks to the original one. CS is a product of the group and hierarchysimilarities.
The owners of the data about the original (real or synthetic) network needto provide as input three lists about this network. They need first to anonymizeinput data by removing any personal information about the nodes and insteadprovide just lists of nodes, each represented by a pair with an abstract nodeidentifier and the level of management hierarchy to which this node belongs. Thesecond is the list of all edges, each providing identifiers of its starting and endingnodes and its weight. The third is the list of groups, each group represented bythe list of identifiers of its members based on the original network and, unless a11roup is independent, the identifier of a group leader. Alternatively, the ownersmay prefer to generate a synthetic network from this input since the output ofthe generator is in the required format and the generated network is differentfrom the original.The processing of the input data starts with computation of probabilities ofexistence and weights of edges for the generated network. We use two modelsto assign randomized weights to the weighted edges. Both methods classifythe edges into several classes and generate edges and weights for each classseparately. The first class includes internal edges of a group g i of size | g i | at thesame level of management hierarchy, so including members but not superiorsof the neighbors. In such a case, there are E i = | g i | ( | g i | − directed edges.Let W i denote the sum of their weights. The second class includes edges acrossthe members of two different groups g i , g j . There are E i,j = | g i || g j | such edgesfrom g i to g j and E j,i in the opposite direction. Let sums of their weights bedenoted as W i,j , W j,i . The next class includes edges from a superior, s ( i ) to themembers of its group g i and from the group members to this superior. Thereare E s ( i ) ,i = E i,s ( i ) = | g i | of such edges in each direction. Their sums of weightsare denoted as W s ( i ) ,i , W i,s ( i ) . Let |¬ i | denote the number of all nodes not ina group g i at the level of management of members of this group. We define theclass of edges from the superior of group i to ¬ i nodes as E s ( i ) , ¬ i = |¬ i | in eachdirection, with the sum of weights denoted W s ( i ) , ¬ i .One solution uses the Weighted Random Graph (WRG) approach (Gar-laschelli, 2009). The second solution uses the Bernoulli Weighted Random Net-work (BWRN). Section Generative models networks discusses both solutions.Next, the management roles are assigned to the nodes. Our model allowsfor an arbitrary number of management hierarchy levels. Yet, given that bothCaviar and Ciel networks are composed of small groups, we set here the limit fortheir hierarchy levels at three. The number of nodes assigned to each hierarchylevel higher than one depends on the total number of groups at the immediatelylower level. The third level of management consists of the highest authoritynode in the local network. We refer to such a node as a boss . We refer to the12odes at the second level as managers . The first level of hierarchy is composedof the remaining nodes organized into groups. We call such nodes members .As explained in the next section, our network generator attempts to find thegroup managers even without any information from the network investigators.Managers serve as intermediaries between the boss and the members. Thereview of the literature reveals that the majority of small companies have aratio of four employees per manager (Davison, 2003). This is a bit smaller thanin case of covert networks, such as Caviar, where this ratio is close to six. Theplausible reasons for this difference include self-motivation of the members fordoing their tasks, as well as keeping a fraction of the organization membersinteracting with the boss small for safety reasons.Not all groups are supervised through hierarchical management. We refer tosuch a group as independent . A group of small size and a low fraction of recipro-cal connections is likely to be independent. An example is the money-launderinggroup in the Caviar network that might be regarded as independent. Its mem-bers have only outgoing edges targeting the outside nodes. Thus, members donot have incoming edges from any other node in the network.
To get a better baseline for network accuracy than the Erdös-Rényi randomnetwork, we built a straightforward extension of SBM by supporting weighteddirected edges with integer weight. It takes as input lists of network nodes,edges with integer weights, and groups in the original network. The first stepis to count, for each group, the sum of weights of all edges inside this group,and then edges of this group leading to and from each other group. These sumsbecome probabilities by dividing them by the sum of weights of all the networkedges. Using the built-in numpy (Oliphant, 2006) random choice method, thebaseline generator picks a pair of groups, including those in which the sourceand target groups are the same. Then, a node from the source group and anode from the target group are selected repeatedly until the selected nodes aredifferent, thus avoiding self-loops. Using the created set of probabilities, we13elect the probability for the selected group and execute a Bernoulli trial withthis probability. On success of the trial, the weight of connection between thesetwo nodes is increased by one. This entire process is repeated until the totalweight of all edges becomes the same as in the input network. The last step isto run Louvain community detection on the resulting network and compare theoutput with the original network communities.
5. Results
We test the network generator by generating 100 synthetic networks andtaking the average scores to avoid any statistically insignificant outliers. We usethe Louvain community detection algorithm to find the group structure of eachgenerated network. Then, using the Normalized Mutual Information (NMI)method (Danon et al., 2005), we compare groups detected in the generatednetworks to groups in the corresponding original network.
To evaluate how much the RANG generated synthetic networks are similarto the original network used for input to RANG, we measure two aspects ofsimilarity. The first is group structure. For this purpose, we used Louvain com-munity detection to find group structure of all compared networks. The secondaspect is management nodes. We find all management nodes by their high rela-tive betweenness centrality in all compared networks. We use as threshold of the relative betweenness centrality metric of the node whose rank among thehighest values of this metric is equal to the number of management nodes inthe original network. For both datasets, we used weighted edge generators tocreate three sets of synthetic networks. For the first set we used our BernoulliWeighted Random Network method, for the second set, the Weighted RandomGraph method, while for the last set, the weighted SBM described in Section4.5.Table 2 contains results of measurements of the first aspect of similarity thatsynthetic networks generated by BWRN methods are by factor of two more14 riginal Network Caviar CielGenerator RANG Weighted RANG WeightedBWRN WRG SBM BWRN WRG SBMmean median min max
Table 2: Results show NMI scores from comparing the groups in networks generated fromthe Caviar (columns 2-4) and Ciel (columns 5-7) original networks to groups in those originalnetworks. Three generators are compared. Columns 2, 5, show performance of RANG with ourBernoulli Weighted Random Generator. In Columns 3, 6 results for RANG using WeightedRandom Graph method are displayed. Columns 4-7 contain results for the weighted SBMbaseline method. Our BWNR method outperformed the other two in terms of median valuesfor both Caviar and Ciel networks and mean value for Ciel network. The weighted SBMmethod was close behind with the best score for mean value for Caviar. WRG method wasfar behind with scores about half of the leading scores.
Metric RANG & BWRN RANG & WRG Weighted SBMGroup NMI median
Jaccard Leadership
Combined Score
Table 3: Similarity of the original Caviar network to networks generated by our RandomAnonymized Network Generator first with our Bernoulli Weighted Random Network method,then with Weighted Random Graph method. The third set of networks was generated byWeighted SBM baseline. The first row of the results shows group structure similarity fromTable 2, the second the leadership similarity, and the last row shows the combined Score thatis the product of the first two scores. In all three rows, the best score is shown in bold font.
This means that the management structure is a differentiating factor incomparing these two generators. It is evaluated in Table 3 that demonstratesthe importance of leadership detection for keeping the synthetic distinct butclose to the original network. The results obtained using Jaccard metric (Zakiand Meira, 2020) show that the RANG generated networks using our BernoulliWeighted Random Network method preserve the group structure and leadershiphierarchy twice as well as the baseline generated networks and nearly four timesbetter than RANG using Weighted Random Graph method. Moreover, theunweighted SBM generator performs much worse than the baseline with theweighted SBM. n summary, our RANG approach better differentiates betweenleadership and membership roles for nodes within groups while performing atleast slightly better than others on community detection. These results verifythe strength of RANG in preserving groups and hierarchy within the generatednetworks. 16 igure 2: The presented Caviar networks depict a) the groups in the original network, andgroups detected in the synthetic networks with similarity that is (b) highest, (c) lowest, and(d) average. .2. Stability of group structures of covert networks Generating statistically similar networks to a real network enables us toanalyze how stable is the structure of the original network. These networksrepresent sets of small perturbations of the original network interconnections.If many generated synthetic networks are structurally similar to the originalnetwork, the latter network is stable and resistant to such perturbations. Onthe other hand, if the structure of the original network is not stable, only a fewsynthetic networks will be structurally similar to the real network.We apply this analysis to the Caviar network, by generating G = 1000 ran-dom synthetic networks, and comparing their structures to each other, and tothe structure of the Caviar original network. We start this analysis by creatinga meta-graph, nodes of which are the generated networks, so the size of themeta-graph is G = 1000 nodes. For each node, we draw an undirected edge be-tween this node and any other node in the meta-graph with a matching groupstructure. We create two versions of the meta-graph. In one, edges represent anexact matching. In another, the edges show flexible matching, which allows upto one node difference in each group for drawing an edge. Figure 3 shows theresulting distribution of the node degrees in meta-graph with exact matching.We find that the original network’s structure repeats ten times for exact match-ing among 1000 generated synthetic networks. Thus, this structure is not verystable. It is sensitive to small perturbations in node connectivity in the gener-ated networks. In contrast, we also find that the most frequent group structureoccurs 177 times with exact matching, and 310 times for flexible matching. Italso happens to be the most similar to the original detected group structure.The synthetic network with this structure is shown in Sub-figure 2(b). More-over, each of the top ten most frequently occurring structures repeats at least20 times for exact matching, and at least 206 times for flexible matching. Theremaining generated networks have either unique structures, or structures thatwere similar to only a few generated networks.From a practical perspective, the identification of stable groups in a criminalorganization is important. It will help analysts to concentrate on group struc-18 igure 3: The histogram of meta-graph degree distribution for exact matching of groups inthe generated networks. The x-axis defines the node degree d , while the y-axis shows n ( d ) ,the number of nodes with d degree. The group structure of the original network is not stablebecause only of synthetic networks share this structure. tures that arise frequently and thus represent plausible interpretations of datacollected about the network. Simulating interdiction on alternative structuresreveals a range of outcomes. Using the meta-graph of generated networks, ana-lysts can compute probabilities of these outcomes providing the distribution ofthe interdiction outcomes for the given original network.
6. Conclusions
We introduce and make available to researchers and analysts a network gen-erator that produces synthetic networks statistically similar to the given real orsynthetic network. The first goal of this work is to enable sharing data betweenanalysts who own sensitive data about current investigations and researchers,who need realistic networks to develop efficient tools for network analytics. Onlyowners of the data can enable sharing of sensitive data. They can properlyseparate abstract structural information about network from the sensitive per-sonal and operational information. The former is first encapsulated by ownersinto three lists of nodes, groups and management roles and then further trans-formed into numeric, possible shuffled node ids, nodes’ clustering into groups19nd statistics about edges and their weights across those clusters. Hence, thedescription of this original networks passed to researchers is succinct and voidof personal data. The output of the generated networks follows the same formatas is required for input. The model uses the introduced here Bernoulli WeightedRandom Network generator that creates the edges whose average sum of weightsis the same as it is in the original network. It also uses the Stochastic BlockModel to create alternative group structures to those existing in the originalnetwork. To preserve the organizational aspects of the original network, we usea hierarchical network model.In general, our generator is broadly applicable to social networks, both withformal and informal organizational structures. It aims at aiding analyses andinvestigations of covert networks with partial information about their nodes,edges, and internal structures. The generator enables researchers to developalgorithms and systems on the generated synthetic networks for use in realapplications.Even the analyst have only partially knowledge of the covert networks duringinvestigation. Therefore, the important second use of the generator is as a toolfor analysts to use it for network analytics on the original network. Studyingthe generated synthetic networks statistically similar to a given covert networkis useful in two ways. Differences between those networks suggest the potentialalternative interpretations of data or the need for collecting more data. Thesimilarities between those networks will enable the analysts to find stable partsof the original network structure. The more synthetic networks we have, themore we can analyze and understand the operations, leadership, and groupspresent in the original network. More generally, the distribution of probabilitiesfor the alternative interpretations enables new network analytics. The analystscan find community structures which are frequent, therefore stable under per-turbations. They may also analyze how the stability changes with the strengthof perturbation. For covert networks, the analysts can quantify statistically ex-pected outcomes of interdiction. This kind of analytics applies to all complexnetwork in which the data are incomplete or partially incorrect.20e thoroughly tested the generator on two real covert networks, Caviar andCiel. To measure the similarity between the generated networks and the originalone, we use the well-known Louvain community detection algorithm. Applyingthe NMI and the Jaccard metrics, we measured the results, which demonstratethe high similarities among generated networks and the original one.We conclude that accounting for groups and a management hierarchy ofa network is essential for generating synthetic networks that are statisticallysimilar to the original network.In future work, we plan to add the fourth step of network generation goingbeyond the original graph to enable its hierarchical network expansion . One wayto accomplish it is to replicate any part of the original network multiple timesat any level of hierarchy and provide additional levels of management hierarchyif needed. Such an extension will allow the researchers and analysts to studyevolution of covert networks and potential interdiction outcomes in large andcomplex criminal networks.
References
Bahulkar, A., Szymanski, B. K., Baycik, N. O., and Sharkey, T. C. (2018).Community detection with edge augmentation in criminal networks. In , pages 1168–1175. IEEE.Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks.
Science , 286(5439):509–512.Barabási, A.-L., Albert, R., and Jeong, H. (2000). Scale-free characteristics ofrandom networks: the topology of the world-wide web.
Physica , 281(1-4):69–77.Benyahia, O., Largeron, C., Jeudy, B., and Zaïane, O. R. (2016). Dancer:Dynamic attributed network with community structure generator. In
Joint uropean Conference on Machine Learning and Knowledge Discovery inDatabases , pages 41–44. Springer.Berlusconi, G. (2013). Do all the pieces matter? assessing the reliability oflaw enforcement data sources for the network analysis of wire taps. Glob.Crime , 14(1):61–81.Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fastunfolding of communities in large networks.
J. Stat. Mech.: Theory Exp. ,2008(10):P10008.Chen, M., Kuzmin, K., and Szymanski, B. (2014). Community detection viamaximization of modularity and its variants.
IEEE Trans. Comput. SocialSys. , 1(1):46–65.Clauset, A., Newman, M., and Moore, C. (2004). Finding community structurein very large networks.
Phys. Rev. E , 70(6):066111.Danon, L., Diaz-Guilera, A., Duch, J., and Arenas, A. (2005). Compar-ing community structure identification.
J. Stat. Mech.: Theory Exp. ,2005(09):P09008.Davison, B. (2003). Management span of control: how wide is too wide?
J.Bus. Strategy .Elsisy, A., Holzbauer, B. O., Szymanski, B. K., Qi, M., and Pentland, A. (2019).What makes social search efficient. arxiv:1904.06551 .Erdös, P. and Rényi, A. (1959). On random graphs. I.
Publ. Math. , 6:290–297.Feller, W. (1968).
An Introduction to Probability Theory and Its Applications .Willey, New York, NY, 3 edition.Gaiteri, C., Chen, M., Szymanski, B., Kuzmin, K., Xie, J., Lee, C., Blanche, T.,Chaibub Neto, E., Huang, S.-C., Grabowski, T., Madhyastha, T., and Ko-mashko, V. (2015). Identifying robust communities and multi-community22odes by combining top-down and bottom-up approaches to clustering.
Sci.Rep. , 5:16361.Garlaschelli, D. (2009). The weighted random graph model.
New J. Phys. ,11(7):073005.Guo, W. and Kraines, S. B. (2009). A random network generator with finelytunable clustering coefficient for small-world social networks. In , pages10–17. IEEE.Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). Stochastic blockmodels:First steps.
Soc. Netw. , 5(2):109–137.Karrer, B. and Newman, M. E. (2011). Stochastic blockmodels and communitystructure in networks.
Phys. Rev. E , 83(1):016107.Lancichinetti, A., Fortunato, S., and Radicchi, F. (2008). Benchmark graphsfor testing community detection algorithms.
Phys. Rev. E , 78(4):046110.Lu, X., Kuzmin, K., Chen, M., and B.K., S. (2018). Adaptive modularitymaximization via edge weighting scheme.
Inf. Sci. , 424:55–68.Morselli, C. (2009).
Inside criminal networks , volume 8. Springer.Newman, M. E. (2016). Equivalence between modularity optimization andmaximum likelihood methods for community detection.
Phys. Rev. E ,94(6):052315.Oliphant, T. E. (2006).
A guide to NumPy , volume 1. Trelgol Publishing USA.Rajeh, S., Savonnet, M., Leclercq, E., and Cherifi, H. (2020). Interplay betweenhierarchy and centrality in complex networks.
IEEE Access , 8:129717–129742.Ravasz, E. and Barabási, A. L. (2003). Hierarchical organization in complexnetworks.
Phys. Rev. E , 67(2):026112.23hang, K.-k., Yang, B., Moore, J. M., Ji, Q., and Small, M. (2020). Growing net-works with communities: A distributive link model.
Chaos , 30(4):041101.Skillicorn, D. B., Zheng, Q., and Morselli, C. (2013). Spectral embedding fordynamic social networks. In
Proceedings of the 2013 IEEE/ACM Interna-tional Conference on Advances in Social Networks Analysis and Mining ,pages 316–323.Traag, V., Van Dooren, P., and Nesterov, Y. (2011). Narrow scope for resolution-limit-free community detection.
Phys. Rev. E , 84(1):016114.Unnithan, S. K. R., Kannan, B., and Jathavedan, M. (2014). Betweennesscentrality in some classes of graphs.
Int. J. Comb. , 2014:241723.Xiao, J., Ren, H.-F., and Xu, X.-K. (2020). Constructing real-life benchmarksfor community detection by rewiring edges.
Complexity , 2020.Zaki, M. J. and Meira, W. J. (2020).
Data Mining and Machine Learning:Fundamental Concepts and Algorithms . Cambridge University Press, Cam-bridge, U.K., 2 edition.
7. Author contributions statementAmr Elsisy : Conceptualization, Methodology, Software, Investigation, Writ-ing - Original Draft, Writing - Review & Editing, Visualization.
Aamir Mand-viwalla : Methodology, Software, Investigation, Writing - Original Draft, Writ-ing - Review & Editing.
Boleslaw K. Szymanski : Conceptualization, Method-ology, Writing - Original Draft, Writing - Review & Editing, Supervision, ProjectAdministration.
Thomas Sharkey : Writing - Review & Editing.
8. Acknowledgments
This work was partially supported by the U.S. Department of Homeland Se-curity under Grant Award Number 2017-ST061-CINA01, the Office of Naval Re-search (ONR) Grant N00014-15-1-2640, the Defense Advanced Research Projects24gency (DARPA) and the Army Research Office (ARO) under Contract W911NF-17-C-0099, and the Army Research Office (ARO) Grant W911NF-16-1-0524.The views and conclusions contained in this document are those of the authorsand should not be interpreted as necessarily representing the official policies,either expressed or implied, of the U.S. Department of Homeland Security orU.S. Department of Defense.
9. Competing Interests9. Competing Interests