Survey of Graph Analysis Applications
SSurvey of Graph Analysis Applications
Tim HegemanDelft University of Technology
Alexandru IosupVrije Universiteit Amsterdam
July 3, 2018
Abstract
Recently, many systems for graph analysis havebeen developed to address the growing needs ofboth industry and academia to study complexgraphs. Insight into the practical uses of graphanalysis will allow future developments of such sys-tems to optimize for real-world usage, instead oftargeting single use cases or hypothetical work-loads. This insight may be derived from surveyson the applications of graph analysis. However,existing surveys are limited in the variety of ap-plication domains, datasets, and/or graph analysistechniques they study. In this work we present andapply a systematic method for identifying practicaluse cases of graph analysis. We identify commonlyused graph features and analysis methods and useour findings to construct a taxonomy of graph anal-ysis applications. We conclude that practical usecases of graph analysis cover a diverse set of graphfeatures and analysis methods. Furthermore, mostapplications combine multiple features and meth-ods. Our findings motivate further developmentof graph analysis systems to support a broader setof applications and to facilitate the combination ofmultiple analysis methods in an (interactive) work-flow.
Graph analysis is used across many application do-mains to interpret complex webs of relationshipsand connections formed by people, roads, finan-cial transactions, etc. Understanding the practi-cal uses of graph analysis is key to tuning existinggraph analysis systems and guiding the develop-ment of new systems. However, this understand-ing requires knowledge of applications from manydomains, including the datasets and graph analy-sis methods they use. Existing surveys focus onstudying in-depth the datasets and analysis meth-ods used in a single domain [1,12,27,63,69], identi-fying applications of specific (classes of) graph al- gorithms [17, 20, 25], or exploring a variety of ap-plication domains [10, 18]. In contrast, we identifyapplications across a large number of applicationdomains and characterize the datasets and graphanalysis methods used in practice.To facilitate the growing need for analyzinggraphs, many graph analysis systems have been de-veloped . Most systems target generic applicationsof graph analysis, e.g., by providing a generic pro-gramming model like Pregel [52], without explicitlyconsidering the characteristics of real-world appli-cations. However, the performance of graph anal-ysis applications depends a combination of threecharacteristics [29, 38], known as the PAD-triangle:the platform, the algorithm, and the dataset. Thus,when developing and tuning graph analysis plat-forms, knowledge of the algorithms and datasetsused in practice is essential to achieving good per-formance across many applications.To address the gaps in knowledge left by previouswork, we pose our main research question: Whatare the characteristics of the datasets and analy-sis methods used in practical applications of graphanalysis?
We further define two sub-questions:
How to identify practical applications of graph anal-ysis? and
How to characterize graph datasets andgraph analysis methods?
In this work we makethree contributions to answer these questions:1. We present a systematic method for identifyingpractical use cases of graph analysis (Section 2)and apply this method to find a set of graphanalysis application (Section 3).2. We identify commonly used graph features(Section 4) and classes of graph analysis meth-ods (Section 5). Based on these common ele-ments, we present a taxonomy of graph analy-sis applications (Section 6).3. We propose directions for future research in thedevelopment of graph analysis systems (Sec-tion 7). Doekemeijer et al. [21] identified over 80 parallel graphprocessing frameworks between 2004 and 2014. a r X i v : . [ c s . S I] J u l Method for Finding, Select-ing, and Characterizing Rel-evant Material
Applications of graph analysis can be found acrossmany application domains and use a wide rangeof datasets and algorithms. In this section we de-fine a method for finding and selecting literatureon graph analysis applications and for characteriz-ing their datasets and methods.
Three common methods used to conduct litera-ture surveys are: unguided traversal of the mate-rial, snowballing [83, 86], and the Systematic Liter-ature Review method proposed by Kitchenham etal. [44] Unguided traversal of the material is thesimple process of reading as much as possible ofthe topic starting from a seed set of articles (e.g.,provided by the supervisor) and continuing withas many articles as the reader can find using thetypical repositories and search tools for scientificliterature. For example, the reader could pursueevery relevant link in each article read, or check allarticles in the best conferences and journals in thepast decade. The unguided element of the methodcomes from the lack of definition of stop and searchcriteria. The decision of which articles to selectfor review from the set of found articles is left en-tirely to reader and is not guided by a set of selec-tion criteria. The snowballing method uses similarmechanisms, but imposes guidance elements suchas criteria for finding and selecting material.The Systematic Literature Review method ofKitchenham et al. is a comprehensive method forconducting literature reviews. As summarized inTable 1, the methods consists of three major stages:planning, conducting, and reporting. Each of thesestages is comprised of a set of steps whose appli-cation depends on the application domain and thespecific goals of the survey. The stage of conduct-ing the review in the SLR method includes at leastthree important elements: identifying the reposito-ries and search engines that can deliver relevant ma-terial, defining a set of specific keywords (queries)used as automated selection criteria for relevantmaterial, and defining a procedure for manually se-lecting truly relevant material from the set obtainedthrough automated search.We compare qualitatively these three methods,considering two criteria: the scale of the resultingdataset, and selection bias. The unguided traversalmethod can yield any amount of material, but it has an implicit selection bias towards material alreadyknown to the reader by relying on the reader toselect directions for searching material. The snow-balling method results in a large body of relevantmaterial with a minor selection bias caused by thechoice of a seed set of relevant material. The SLRmethod yields a limited set of relevant material andavoids a selection bias through a systematic search.To make a selection, we first specify the desiredoutcomes for the two criteria. We prefer a limitednumber of articles because we would like to do anin-depth manual inspection of each, and we envi-sion a large number of application domains (manyof which we do not know) which requires a lackof selection bias. Therefor, we select the System-atic Literature Review method of Kitchenham etal. which best meets our criteria. In the remainderof this work, we follow the steps of the method aslisted in Table 1 where applicable. Notably absentin our approach is the “study quality assessment”step. The quality of the work presented in sur-veyed material is largely irrelevant to our analysis;we consider only the datasets and graph analysismethods used and do not investigate how well thesemethods perform with respect to other approachesin a given domain.
To identify relevant literature we considered typ-ical search engines for scientific literature, as rec-ommended by the SLR method. Due to the widerange of application domains that potentially usegraph analysis, we excluded any search engine ded-icated to specific fields. We selected Google Scholarfor its extensive corpus and open access. We usedeight queries to search for relevant material as listedin Table 2. For each query we retrieved the first 100search results for further inspection. We conductedour search during January 2018.Our search queries were formulated to match awide range of possible applications, but they alsomatch a large volume of irrelevant material. Weused manual selection to extract all relevant lit-erature from the body of search results. We se-lected all articles that explicitly describe the useof one or more algorithms or methods for graphanalysis as a primary contribution to address theirresearch question(s). We specifically exclude sec-ondary studies, books, and theses because they typ-ically present multiple applications and inclusion ofthese applications may have led to overrepresenta-tion of the corresponding application domain. Wealso exclude articles presenting a system or algo-rithm for graph analysis unless they target a spe-2able 1: Overview of steps that comprise the Systematic Literature Review method by Kitchenham etal. [44] The steps we apply in this work are indicated by a checkmark ( (cid:51) ) and we list the section(s)implementing the step (if applicable).
Planning the review (cid:51)
Identification of the need for a review (S. 1) (cid:55)
Commissioning a review (cid:51)
Specifying the research question(s) (S. 1) (cid:51)
Developing a review protocol (S. 2.1) (cid:55)
Evaluating the review protocol
Conducting the review (cid:51)
Identification of research (S. 2.2) (cid:51)
Selection of primary studies (S. 2.2) (cid:55)
Study quality assessment
Reporting the review (cid:51)
Data extraction and monitoring (S. 2.3, 3, 6) (cid:51)
Data synthesis (S. 4-6) (cid:55)
Specifying dissemination mechanisms (cid:51)
Formatting the main report (cid:55)
Evaluating the reportTable 2: Search queries used to identify relevant literature. (cid:63)
Three articles were retrieved via two searchqueries.
ID Query
Q1 graph analysis 39 34Q2 graph analytics 4 3Q3 graph mining 19 14Q4 graph processing 3 2Q5 network analysis 15 7Q6 network analytics 6 2Q7 network mining 12 1Q8 network processing 1 0
Total: (cid:63) (cid:63) cific application (domain).The results of the identification and selection pro-cess are summarized in Table 2. From 800 searchresults we selected 96 relevant articles (12%). Wefurther reduced this set to 60 articles for in-depthanalysis through manual selection while preservingthe diversity of application domains among selectedarticles. To analyze the selected material we used a three-step process. The primary purpose of this process ischaracterizing the datasets and methods for graphanalysis used in practice. First, we scanned eachselected article to summarize the application it de- scribes (presented in Section 3). We also identifiedany notable features of the dataset and the primarymethod of graph analysis used by each application.Second, we derived a list of common (classes of)graph features (Section 4) and graph analysis meth-ods (Section 5) from the initial analysis performedin the first step. Third, we extracted from each se-lected article the graph features and graph analysismethods they use. The resulting data was used toconstruct a taxonomy as presented in Section 6.We manually classified the graph features presentin each application. Due to a combination of fre-quently used terms (e.g., many articles refer to both“directed” and “undirected” graphs) and featuresnot explicitly identified by the authors (e.g., vertexand edge properties, heterogeneity), keyword-based3earch as primary classification method was notfeasible. Where plausible we used keyword-basedsearch to validate the results of the manual inspec-tion process (used keywords are listed in Section 4where applicable).To classify the graph analysis methods used byeach application, we used search queries for mostclasses of methods (listed in Section 2 where appli-cable), followed by manual inspection of the contextof each search result to rule out false positives. Wesupplemented these results with the list of primarymethods extracted during the initial scan of eacharticle, especially for (classes of) methods with-out well-defined terminology (e.g., no suitable key-words for the “graph mutation” class were founddue to the ubiquity of candidates such as “reduc-tion”, “merge”, “mutate”). Some atypical meth-ods for graph analysis may have been missed inour classification process if such a method was notidentified as the primary method of analysis of anarticle.
Although we selected our method to identify rele-vant graph analysis applications from a broad rangeof domains, the specifics of our search strategy in-troduces three potential biases. First, we restrictour search to scientific literature, so we do not iden-tify any commercial applications if their methodshave not been published. Second, we restrict oursearch to English literature, which may exclude ap-plications that are not well-known in the English-speaking scientific community. Third, some of oursearch queries show a strong correspondence withspecific types or analysis, e.g., “mining” returnsmany application of pattern matching or subgraphisomorphism, whereas “network analysis” often oc-curs in the phrase “social network analysis”, refer-ring to a common set of graph analysis methods.
In this section we present the graph analysis ap-plications we have selected and characterized usingthe method presented in Section 2. Applicationsare grouped by application domain.
Biological networks are used to study the interac-tions of numerous biological entities, including pro-teins, genes, and organisms. We present in turn thebiological applications we characterized.
Protein-protein interaction networks:
Li etal. [50] propose an algorithm to identify proteincomplexes in protein-protein interaction networks.Their Local Clique Merging Algorithm (LCMA) it-eratively identifies local cliques and merges them ifthey overlap significantly (i.e., are similar).
Gene regulatory networks:
Pinna et al. [65]address the problem of deriving a gene regulatorynetwork from observed gene expression levels. Theauthors construct a network of genes and inferrededges with weights signifying an initial estimate ofthe probability of an edge’s existence. Edges with aprobability below some threshold are removed fromthe network. Using strongly connected compo-nents, the authors identify unessential edges (feed-forward edges) to derive a final regulatory network.
Metabolic networks:
Koyt¨urk et al. [45]present an algorithm for identifying common pat-terns in graph-based representations of metabolicpathways. Their graphs contain one vertex for ev-ery unique enzyme in a pathway and edges for in-teraction between those enzymes. By mining themetabolic pathway graphs of multiple organisms forfrequent subgraphs, the authors are able to identifysub-pathways common to many organisms.
Tissue modeling:
Bilgin et al. [7,8] classify tis-sue samples by segmenting tissue images to iden-tify cells, linking those cells in cell graphs, comput-ing various graph theoretical measures for each cellgraph, and applying machine learning techniques.They apply variations of this method to detect can-cer in breast tissue [8] and bone tissue [7].
Microbial communities:
Barber´an et al. [3]study the co-occurrence of microbes. Their networkconsists of vertices corresponding to microbes andedges corresponding to statistically significant cor-relation in occurrence across soil samples. Based onseveral network measures, e.g., average path length,and a comparison of the network’s structure to arandom graph, the authors find a distinct separa-tion between two types of microbes: generalists andspecialists.
Protein assembly:
A typical approach to iden-tifying proteins involves collecting information onthe peptide fragments (i.e., parts of a protein) in asample and assembling these fragments to find pro-teins that could have been present in the sample.Zhang et al. [91] solve the assembly problem usinga bipartite graph: observed peptides and poten-tial proteins are mapped to vertices and an edge isadded between a peptide and all proteins it is partof. To find likely candidates for proteins present inthe sample, the authors reduce the resulting graphby merging vertices with identical connections, ex-tract all connected components, and use a greedyset cover algorithm to find a set of proteins that4over all observed peptides.
Other:
Royer et al. [71] propose the
PowerGraph Analysis method for compressing biologicalnetworks. They identify three basic motifs foundin many networks: stars, cliques, and bicliques.By iteratively identifying motifs and replacing sub-graphs with power nodes , the authors achieve com-pression rates of up to 85% without losing informa-tion for a variety of biological networks.
The human brain consists of an estimated 100 bil-lion neurons and 1 quadrillion synapses. Althoughthese neurons and synapses naturally comprise abrain network, this network is difficult to collectand analyze due to the small scale of the neu-rons and the large scale of the resulting network.Instead, brain networks used for practical neuro-science applications consist of brain regions andtheir communication paths.Brain networks collected using fMRI are typi-cally undirected and weighted, with each weightrepresenting the level of communication betweentwo brain regions. These weighted graphs are typ-ically converted to unweighted graphs by droppingall edges with a weight below some threshold. Theresulting graph can be analyzed for typical small-world properties: high average clustering coefficientand relatively low characteristic path length [82].Medical conditions that have been studied usingthis technique include Alzheimer’s disease [74, 77],brain tumors [4], and traumatic brain injury [13].Weighted networks can also be studied directly, asdone by Stam et al. [76] for Alzheimer’s disease pa-tients. The authors further study two damage mod-els,
Random Failure and
Targeted Attack , throughsimulation and find that the Targeted Attack modelbest approximates the deterioration in brain con-nectivity observed in AD patients.A directed brain network can be obtained from
EEG recordings of electrical activity in the cerebralcortex. This type of network has also been shownto have typical small-world properties [90]: highaverage clustering coefficient and low average pathlength. Various studies have shown deviations inconnectivity when affected by certain medical con-ditions. For example, patients with epilepsy havemore regular (i.e., not centralized) activity betweenbrain regions during a seizure [85]. Spinal cord in-jured patients show larger levels of internal organi-zation and fault tolerance, possibly as compensa-tion triggered by the injury [24].
Graph analysis powers a variety of security appli-cations, most notably to identify vulnerabilities oranomalies in computer networks.
Network vulnerability analysis:
Phillips etal. [64] present a system for analyzing vulnerabil-ities in computer networks using attack graphs.They generate attack graphs from attack templates,i.e., specifications of the pre- and post-conditions ofan attack. These templates are combined in a graphin which each node represents a combination of ma-chines, users, and/or permissions that the attackerhas obtained and each edge represents an actionthat may be performed by the attacker to compro-mise a new machine, obtain more permissions, etc.A cost or probability of success may be associatedwith each edge to study the most likely paths ofintrusion. The authors further describe a variety oftechniques to improve security using their system,including selecting from a set of possible securitymeasures the most cost effective, or determining aminimal set of monitors to place such that each at-tack may be detected by multiple monitors.Noel et al. [61] define a suite of metrics to quan-tify the vulnerability of a network based on its at-tack graph, including three metrics based on graphtheoretical properties. First, they identify weaklyconnected components. The presence of separatecomponents in the attack graph is indicative of alack of vulnerabilities between multiple sets of ma-chines in the network. Second, they identify cycles.Infecting any machine in a cycle gives indirect ac-cess to all other machines in the cycle, which repre-sents a larger surface area for attacks. Third, theycompute the diameter. A large diameter implies alarge number of actions that an attacker needs totake to compromise the entire network.
Malware detection:
Kwon et al. [46] ana-lyze download graphs to identify droppers , maliciousprograms that download other programs (i.e., mal-ware, or other droppers) to a host machine. Thedownload graph consists of programs as vertices,and download relationships as edges. The authorsextract subgraphs, called influence graphs, rootedin a single program vertex and containing all otherprograms that have been directly or transitivelydownloaded by the root program. The authors usea variety of metrics to summarize each influencergraph and use machine learning techniques to clas-sify the root of each influence graph as a legitimateprogram or a dropper.Polonium [15] detects malware by analyzing alarge, bipartite graph of machines and files found onthose machines. Polonium uses a belief propagationalgorithm to estimate the probability of a file being5alicious. This algorithm is seeded with externaldata on machine reputation and known maliciousfiles.
Botnet detection:
Millions of computersworldwide have been infected by malicious soft-ware and are now part of networks of infected hostscalled botnets. Many such botnets rely on peer-to-peer (P2P) communication to spread commands tomembers of the network. Nagaraja et al. [56] pro-pose a method for identifying a P2P botnet in acommunication graph, i.e., a graph in which everyhost is a vertex and two vertices are connected byan undirected edge if the corresponding hosts havecommunicated during some period of time. First,the authors use random walks on the communi-cation graph to distinguish fast-mixing hosts (i.e.,likely members of a P2P network) from slow-mixinghosts. Next, the vertices are clustered into sub-graphs using k-means, and the nodes are clusteredusing an extension of SybilInfer [CITE], an algo-rithm based on random walks to extract a stronglyconnected group of nodes from a graph.
Anomaly detection:
Jiang et al. [41] analyzeDNS traffic to identify anomalies. They model DNSfailures as a bipartite graph mapping hosts thathave issued at least one failed DNS request to do-main names they have queried. Using a matrixfactorization algorithm (tNMF), the authors de-compose the DNS failure graph into communitiesof hosts and domains. They analyze each commu-nity and identify typical structures: stars and bi-meshes. Further, they track the evolution of com-munities over time.
Video surveillance:
Calderara et al. [14] usespectral graph theory to analyze trajectories of peo-ple observed by video surveillance. The physicalspace observed in the video is quantized and theobserved trajectories are translated to sequences ofquantized locations. A graph is constructed withnodes representing a transition from one location toanother and weighted edges representing the possi-bility of moving from one location to another loca-tion via precisely one intermediate location (sharedby the source and target node). Spectral graph the-ory is used to filter out noisy trajectories from thegraph and to determine if a new trajectory is eitherconsistent with earlier observation or anomalous.
Graphs occur naturally in various aspects of soft-ware development, e.g., call graphs, control flowgraphs, dependency graphs. We describe severalapplications of graph analysis in software develop-ment and related activities.
Identifying and locating software bugs:
Cheng et al. [16] localize bugs by mining softwarebehavior graphs. Execution traces of a programare captured as method-level or block-level behav-ior graphs in which each vertex presents a methodor basic block, respectively. Edges in software be-havior graphs can capture a variety of control flowrelationships, e.g., one method calling another. Theauthors capture behavior graphs from multiple exe-cutions of the same program and label which behav-ior graphs correspond to faulty executions. Next,they mine the most discriminative subgraph(s) tolocalize the difference(s) in execution between cor-rect and faulty behavior.Maxwell et al. [53] identify memory leaks in heapdumps by mining recurring patterns of object ref-erences. Their heap graph representation containsa vertex for every object in the heap dump and anedge for every reference. The authors first reducethe graph by extracting the dominator tree and us-ing graph grammar reduction to compress typical(recursive) patterns. Next, they mine frequent sub-graphs to identify potential memory leaks.Eichinger et al. [23] locate software bugs in callgraphs using graph mining. Their approach first re-duces call graphs by identifying sequences of identi-cal calls and merging the corresponding subgraphs.Edge weights are introduced to capture the frequen-cies of calls. Next, the authors use weighted fre-quent subgraph mining to identify differences be-tween successful and unsuccessful executions, thusrevealing potential locations of bugs.
Defect prediction:
Zimmermann et al. [92] usenetwork measures on software dependency graphsto identify critical binaries and predict defects.They find significant correlation between severalnetwork measures (e.g., eigenvector and degree cen-trality) and the number of defects in a binary. Theyfind that network measures are able to predict 60%of defects, compared to 30% for traditional softwarecomplexity measures.
Diagnosing distributed systems: G [30] isa graph processing system for analyzing large soft-ware execution graphs, i.e., graphs describing sys-tem events and their relationships. G can be pro-grammed to perform custom queries on an execu-tion graph. Underlying these queries are two keyoperations: slicing (i.e., extracting all events thatare causally related to a queried event) and hier-archical aggregation (i.e., merging a set of relatedevents into a single event). Software plagiarism detection:
GPLAG [51]detects software plagiarism by representing pro-grams as program dependence graphs. Statementsin the program are encoded as vertices, and controlflow and data dependencies are encoded as directededges. Type information is encoded as vertex and6dge properties. Commonalities between two pro-gram are detected using subgraph isomorphism.
Developer collaboration:
Surian et al. [78]analyze the SourceForge collaboration network todiscover how well connected developers are, whattopological structures characterize developer com-munities, etc. The authors first identify connectedcomponents and mark each as either a small or alarge community. Second, they mine common pat-terns among the small communities. Third, theycompute the frequency of each pattern in both thesmall and large communities.
Navigation systems are a stereotypical example ofreal-world graph analysis applications. More gen-erally, logistics and planning problems are oftensolved using graphs.
Road networks:
Urban street networks can benaturally modeled as a graph; intersections can bemapped to vertices and the roads connecting in-tersections can be mapped to edges. This is typi-cally referred to as the primal graph representation.For some applications an alternative representationmay be preferred. The dual graph representationmaps roads to vertices and intersections to edges.Porta et al. describe various methods for analyz-ing both the primal [67] and dual [66] urban streetgraphs.TrajGraph [35] uses graph-based visual analyticsto study traffic in urban street networks, in par-ticular using taxi route data. It uses partitioningtechniques to reduce the size of the graph for vi-sualization purposes. Further, TrajGraph providesautomated analysis of traffic to identify hubs at dif-ferent points in time. Finally, it allows users tohighlight arbitrary areas (i.e., subgraphs) to ana-lyze local traffic.
General planning problems:
GraphPlan [9]represents a planning problem as a
Planning Graph ,a leveled DAG in which nodes represent propo-sitions (states?), edges represent action, and lev-els represent timesteps in the produced plan. Thegraph is dynamically constructed and pruned onelevel at a time. A custom heuristic-based traversalalgorithm is used to find valid plans.Hong et al. [34] propose a goal graph -based ap-proach to goal recognition, loosely based on Graph-Plan. They use an iterative two-stage algorithm torepeatedly extend the goal graph based on observedactions, followed by the identification of plans andgoals that are consistent with the new graph.Helmert [32] proposes a greedy algorithm basedon a causal graph for planning problems. Thecausal graph encodes state variables as vertices and possible causal relationships as directed edges(i.e., an edge from vertex A to vertex B indicatesthat the value of state variable B may depend onthe value of state variable A). Further, every statevariable has an associated domain transition graphwith a vertex for every possible value of the vari-able and a directed edge for every transition, an-notated with the preconditions for the transition.A plan is found using a heuristic-based traversal ofthe causal graph. The heuristic includes repeatedshortest path searches on domain transition graphs.
Network routing:
Daly et al. [19] apply socialnetwork analysis to routing in mobile ad hoc net-works (MANETs). Routing in MANETs is chal-lenging because the network graph is rarely con-nected. Efficiently delivering messages requiresidentifying devices that are likely to connect tomany other devices, thus quickly spreading themessage through the network. By using networkanalysis techniques, the authors identify devicesin the network with high betweenness (i.e., shortroutes to many other devices) as key candidates forspreading messages.
Communication between people has long been stud-ied to understand how groups of people interact.One-on-one communication methods, including (e-)mail, phone calls, and text messaging, can be mod-eled as a social interaction network in which eachvertex represent a person and edges represent thattwo people have communicated. Similarly, socialmedia platforms like Facebook and Twitter natu-rally comprise a social interaction network, oftenwith additional information on types of interactions(friendships, follower-relationships, likes, etc.).Schwartz et al. [75] analyze a directed interactiongraph derived from mail traffic to discover sharedinterests. They apply typical SNA techniques, in-cluding extracting the largest weakly connectedcomponent, computing the diameter, average pathlength, etc. Further, they cluster the graph using
Aggregate Specialization Graph Isolation and
Spe-cialization Subgraph Derivation to identify commu-nities with shared interests.Recently, many studies have analyzed the Twit-ter graph to identify important users ( influencers ),authoritative users, etc. For example, TURank [87]ranks Twitter users based on authority scores.These scores are computed using a variation ofPageRank, called ObjectRank, on the user-tweetgraph consisting of users, tweets, follow relation-ships, post relationships, and retweet relationships.Khrabov et al. [43] identify influential Twitter usersusing a combination of PageRank and a custom7anking based on a user’s number of mentions. Byusing dynamic metrics, they identify key users byperiods of consecutive, accelerated growth in in-fluence. Yang et al. [88] analyze retweets usinga variation of the HITS algorithm to identify in-teresting posts, i.e., posts that may be of interestto a wider audience than the direct neighborhoodof the user who posted it. They first identify au-thoritative users in the user-retweeted-user graph,and next identify interesting tweets based on theauthority of their creator and retweets. They con-clude that, by considering both users and retweets,their method performs better than consider onlyretweets.
Mota et al. [55] analyze speech graphs derived fromdream reports produced by schizophrenic, bipolar,and control subjects. A speech graph consists ofa node for every unique word in a body of textand a directed edge for every consecutive pair ofwords. By comparing 14 attributes (e.g., largestconnected component, repeated and parallel edges,cycles, clustering coefficient) of each speech graph,the authors achieve high accuracy in classifying thegroup of a subject based on their dream report.Verbal fluency tests are used to assess a person’sability to produce a sequence of words satisfyingsome task. For example, in a category fluency test,the subject is asked to list as many words fittingthe given category as they can in a limited time.By converting this sequence of words to a speechgraph, the subject speech patterns can be analyzed.Lerner et al. [49] study the results of category flu-ency tests taken by adults with mild cognitive im-pairment (MCI), with Alzheimer’s disease (AD), orneither. They merged the speech graphs of all sub-jects in the same subject group. Analysis of thethree resulting graphs, one for each subject group,reveals that they have typical small-world proper-ties, but that those properties are less prevalent forthe MCI and AD groups. This is consistent withearlier research. Bertola et al. [5] perform a simi-lar study, but analyze each subject’s speech graphindividually. Using an extensive set of graph mea-sures, they can accurately classify a subject to bein one of the three subject groups.
By virtue of the incremental nature of scien-tific progress, publications naturally form networksthrough the citations that connect them. Such ci-tation networks have been studied for many sci-entific communities. Jacovi et al. [39] apply so- cial network analytics techniques to study the ci-tation graph of the CSCW conferences and relatedwork. They identify communities and track theirevolution over time. Further, they identify chasm-papers; papers that were influential outside theCSCW conferences, but overlooked in the CSCWcommunity. Gondal [28] analyzes a citation net-work for an emergent research field. In additionto small-world properties, the author considers thepresence of various patterns, including stars, paperssharing a large number of cited authors, and pa-pers citing primarily authors from a single country.Tsatsaronis et al. [80] apply Power Graph Analy-sis, a method originally developed to analyze andvisualize biological networks, to study the DBLPcitation network and its evolution.
Molecules can be represented by graphs with atomsas vertices and the bonds connecting atoms asedges. Large databases of such molecular graphsand the properties of the associated molecules havebeen compiled and are used in, e.g., pharmaceu-tical research to identify (fragments of) moleculeswith desirable properties for a new drug. Nijssenet al. [60] propose a method based on frequent sub-graph mining to extract common molecule frag-ments from a set of molecules. This method maybe used to identify molecule fragments that charac-terize a set of input molecules sharing a desirableproperty. Wegner er al. [84] propose a method forclassifying molecules by mining their correspondingmolecular graphs for maximum common substruc-tures.
Iori et al. [37] study overnight inter-bank traffic ofItalian banks between 1999 and 2002. They definea temporal graph in which each vertex correspondsto a bank and each edge indicates that at least onetransfer occurred between two banks during theselected time period. Each edge is further anno-tated with the number and total volume of trans-fers. After analyzing various network measures, theauthors conclude that while some microstructurecharacteristics were found (e.g., degree correlatesstrongly with the size of the bank), the network issomewhat random, which is indicative of an effi-cient system. Wang et al. [81] use a different ap-proach to study financial data as time series. Theyapply visibility graph analysis to study four time se-ries related to China’s quarterly economic growthbetween 1992 and 2010. They identify small-worldproperties in all corresponding graphs.8 .11 Linguistics
Jiang et al. [40] propose a graph-based approachto document classification. Each document is rep-resented by a graph containing a variety of vertextypes (e.g., word, part of speech) and edge types(e.g., part of speech to a word, word order). Theauthors use frequent subgraph mining on documentsets to group documents that share similar, un-common phrasing. Biemann [6] proposes ChineseWhispers, a graph clustering algorithm designed toaddress various natural language processing prob-lems. Example applications studied with ChineseWhispers include language separation (i.e., detect-ing languages in word co-occurrence graphs) andword class acquisition (i.e., clustering a word co-occurrence graph to detect classes of words).
We identified several other applications of graphanalysis that do not fit any of the previously dis-cussed application domains.
Cardiographs:
Jiang et al. [42] study the im-pact of meditation on heartbeat dynamics usingvisibility graph analysis. They monitor the heartrate of volunteers before and during mediation.The authors find that meditation causes significantchanges in heartbeat rhythms.
Geometric constraint problem:
Geometricconstraint problems describe sets of geometric en-tities (e.g., points, lines) and constraints betweenpairs of entities (e.g., distance, angle) with the pur-pose of assigning to each object a position, orienta-tion, etc. such that all constraints are met. Lee etal. [48] propose an approach based on graph reduc-tion. Their algorithm initially constructs a graphcontaining each geometric entity as a vertex andeach constraint as an edge. Next, it iterativelyidentifies clusters based on degrees of freedom anal-ysis (typically small cycles) and collapses them toa single pseudo-geometric entity until the graph isreduced to a single vertex.
Knowledge-based systems:
Rodriguez [70]analyzes the relationships between datasets main-tained by the Linked Data community. They createa graph in which every vertex represents a datasetand every directed edge indicates a reference fromone dataset to another. They identify stronglyconnected communities that correspond well withknown domains (e.g., biology, computer science).
Recommendation systems:
Mirza et al. [54]propose a method for evaluating recommendationsystems by interpreting their outcomes as a pairof graphs: a social graph comprised of all usersand a bidirectional edge for every pair of similar users, and a recommender graph which extends thesocial graph with all recommended artifacts. Byanalyzing the connected components in the recom-mender graph and comparing average path lengthsfrom users to artifacts between the recommendergraph and a random graph, the authors evaluatethe effectiveness of a recommendation algorithm.
Seismology:
The visibility graph analysismethod has been applied to seismology data tovarying success. For example, analyzing earth-quakes in Italy between 2005-2010 reveals a long-range correlation in their magnitudes [79]. How-ever, time-clustering was not detected by themethod.
Video analysis:
Yeung et al. [89] use a
SceneTransition Graph to represent shots (as vertices)and the transitions between them (as directededges). The authors identify strongly connectedcomponents as scenes in the analyzed video.
Web usage mining:
Website operators trackthe browsing patterns of visitors to study how vis-itors interact with the website, and consequentlyto identify changes that can be made to the web-site to improve the browsing experience. Heydariet al. [33] propose a graph-based approach to webusage mining. They capture each visitor’s traversalthrough a website as a graph by representing webpages as vertices and links followed from one pageto another as edges. Vertices are weighted by thetime a user spent on the corresponding page. Theauthors mine a set of browsing graphs to identifycommon browsing habits. A graph is a structure consisting of a set of vertices and a set of vertex pairs ( edges ). Graphs are typ-ically used to represent a collection of entities andtheir pairwise relationships. Many applications ex-tend this model to encode additional information:the strength of a relationship, specific propertiesof an entity, differentiation of relationships types,etc. In this section we identify several commonlyused graph features among the surveyed applica-tions presented in Section 3. We present typicaluse cases for each feature. Edges in a graph may be either directed, indi-cating a one-way relationship between two ver-tices, or undirected, indicating a two-way rela-tionship. The types of relationships used by thesurveyed applications are diverse and we did notidentify a clear classification of use cases of ei-9her directed or undirected edges. However, wehighlight several recurring relationship types andother observations. First, directed edges are fre-quently used in applications that analyze sequen-tial or causally-connected processes, such as trac-ing movement through time [14], analyzing softwarecall graphs [23], finding valid plans in a planninggraph [9], or capturing a person’s speech to under-stand psychological disorders [5]. Directed edgesare also prevalent in sociology graphs to representsocial interactions that are typically initiated byone party. For some types of graphs we found bothdirected and undirected variations, e.g., some meth-ods for measuring brain activity do not capture thedirection of communication, but other methods do.
Keywords used to validate manual inspection:weight, length, distance, strength.
A typical extension of the graph model is the ad-dition of weights to edges (and, less commonly, tovertices). That is, the definition of a graph is ex-tended with a function that maps every edge (orvertex) to a single, typically real-valued, weight.Although weights may be used to model any prop-erty, we highlight two typical use cases of edgeweights.First, edge weights are commonly used to rep-resent the strength of a relationship. For exam-ple, the edges in a brain network model com-munication paths between two regions. Theiredge weights model the amount of activity, e.g.,in [4,13,24,74,76,77,85]. Similarly, edges can modelthe frequency of a relationship’s occurrence, e.g.,the number of calls made to a function [23], thenumber of co-authored papers [80], or the numberof bank transfers [37].Second, edge weights may be used to representthe length or distance of a connection. The stereo-typical example for this use of edge weights is roadnetworks. In such a network, edges often repre-sent roads and their weights represent the lengthof roads (or the expected time taken to travel fromthe start to the end of the road). This applicationhas been described in [67] and many introductorytext books on graph theory.Only the first use case of edge weights as a mea-sure of relationship strength was prevalent amongthe application presented in Section 3. However,shortest path queries, an important class of graphanalysis methods described in Section 5.2, operateon either unweighted graphs or on weighted graphsrepresenting distances. To accommodate shortestpath queries on graphs with edge strengths, two ap-proaches are commonly used. First, the weighted graph may be converted to an unweighted graphby removing all edges with a strength below somethreshold and removing the weights of all remainingedges. This approach is frequently used for brainnetworks [4, 13, 24, 74, 77, 85]. Second, the edgestrengths may be converted to lengths by defin-ing the length of an edge to be the inverse of itsstrength. Thus, strong edges are short and a pathalong several strong relationships may be shorterthan a direct path via one weak relationship. Thisapproach has been applied to citation networks [58],brain networks [72], etc.
By definition, graphs consist of one set of verticesand one set of edges. In some graphs, all verticesand edges are homogeneous. That is, they repre-sent instances of one type of entity or relationship,respectively. However, some applications includemultiple types of entities and relationship which arecombined in a single heterogeneous graph. Thesetypes may be encoded in the graph representationas vertex and edge labels . To avoid confusion withother uses of graph labels, we use the terms ver-tex/entity type and edge/relationship type to distin-guish types of entities and relationships modeled bya heterogeneous graph.Heterogeneity can have significant impact on thestructure of a graph. For example, some types ofrelationships may only occur between two partic-ular types of entities, some types of entities maynot be able to have direct relationships with othertypes, and some types of relationships may be morenumerous than others. Each of these examples in-troduces a bias in the number and location of rela-tionships in a graph that can not be explained bytopological measures such as degree distribution.Examples of heterogeneity in vertex types includeuser and tweet vertices in a Twitter graph [87],hosts and domain names in a DNS network [41], andproteins and peptides in proteomics [91]. Examplesof heterogeneity in edge types include dependenciesbetween program statement in a call graph [51],causal relationships in diagnosing distributed sys-tems [30], and relationships between words in textclassification [40]. In graph theory, labeling most commonly refers to as-signing a unique label (or identifier) to every vertex/edge,typically integers from the range [1 , , .4 Property Graphs Weighted and heterogeneous graphs both extendgraphs with a function mapping vertices and/oredges to a single weight or type. Property graphsgeneralize this approach to an arbitrary number offunctions mapping to arbitrary values. Each func-tion represents a property of the modeled entity orrelationship. For example, in a social network onefunction maps each user vertex to its name andanother function maps each friendship edge to thedate the friendship was formed.In our characterization, we use a more restricteddefinition of property graphs; an application ischaracterized as using property graphs if and onlyif the analysis of the graph uses the values of theseproperties. Although the entities and relationshipsof almost all applications have properties, theseproperties are not typically used in the analysis ofthe corresponding graph. For example, a social net-work application may identify communities by con-sidering only the edges present in the graph andnot considering the name, date of birth, or otherpersonal information of each user. This applicationcan use any graph analysis system without supportfor property graphs. Thus, we do not consider suchan application to have a property graph.Examples of property graphs include chemicalstructure graphs annotated with atom and bondcharacteristics [60, 84], and an attack graph anno-tated with pre- and post-conditions for modelingattack vectors in a compute network [64].
Keywords used to validate manual inspection: tem-poral, timestamp, evolution.
In many real-world applications, graphs evolveover time as the entities and relationships capturedin these graphs change. Most surveyed applica-tions do not address this evolution in their analysis;the graphs analyzed by these applications representsnapshots of the real-world networks they model.In contrast, some applications use the temporal na-ture of their graphs explicitly, e.g., to study the evo-lution of the network. Graphs that include tempo-ral data (e.g., the time at which a relationship wasestablished) are known as temporal graphs .We identify three use cases for temporal graphs.First, they are used to study the evolution of re-lationships in networks. For example, a citation In our taxonomy we classify an application as using tem-poral graphs if and only if the application explicitly usestimestamped data for analysis. That is, even if a graph con-sists of timestamped data or data collected over a period oftime, it is not classified as a temporal graph if the temporaldata is not explicitly analyzed. graph annotated with the publication date of eachpaper can reveal how scientific communities inter-act and co-evolve [39]. Second, some temporalgraphs have static structures but dynamic weightsor other properties, e.g., traffic data per hour ina road network [35], or brain activity levels cap-tured periodically [85]. Third, temporal graphsmay be used to study sequences of events connectedby causality relationships, e.g., communication viamessages in a distributed systems [30].
Many applications that operate on graphs usegraph analysis techniques to extract informationfrom a graph that is not evident from individual en-tities or relationships. For example, the degree of asingle vertex in a social network graph may informus how many friends the corresponding user has,but by analyzing all relationships in the graph, wecan determine whether the user is popular, how im-portant they are for cohesion in the network, whatcommunities they are part of, etc. In this sectionwe characterize the methods for graph analysis usedby the applications presented in Section 3.
Keywords used to identify uses: clustering coeffi-cient, degree distribution, degree assortativity, localefficiency, local subgraph.
Many graph-based applications use metrics toquantify a graph’s structural properties. We referto a common class of such metrics as neighborhoodstatistics . These statistics are computed for eachvertex in a graph and depend only on the immedi-ate neighborhood of a vertex .Two common neighborhood statistics are the de-gree distribution and local clustering coefficientof graph. The degree distribution of a graph isused across many application domains (e.g., biol-ogy [24], finance [81]) to support the hypothesisthat a graph’s structure is different from a randomgraph . Similarly, the local clustering coefficient is The neighborhood of a vertex includes the vertex itself,all vertices with an edge to or from the target vertex, and alledges that connect two vertices included in the neighborhood Studies across many domains have found that theirgraphs are power-law graphs, i.e., that the degree distri-bution of the graph follows a power-law distribution. Thismatches the observations of Barab´asi and Albert [2] who pre-dicted the presence of power-law distribution across manylarge complex networks. However, the “power-law hypoth-esis” has frequently been argued against, including in a re-cent study by Broido and Clauset [11] which found strongevidence of power-law properties in only 4% of graphs theyanalyzed.
Keywords used to identify uses: path length, short-est path, breadth-first, depth-first, traversal, globalefficiency, random walk.
Many applications of graph analysis are not re-stricted to the neighborhood of vertices, but insteadinvolve the analysis of paths between vertices. Ofparticular interest are shortest path queries, i.e.,identify a path of minimal total length betweengiven two vertices (or minimal number of edges inunweighted graphs). Although these queries canbe answered individually using Dijkstra’s algorithm(or breadth-first search on unweighted graphs),many applications require computing the shortestpath between multiple or all vertex pairs (e.g., us-ing the Floyd-Warshall algorithm).We identify three primary uses of shortest pathalgorithms. First, shortest paths can be useddirectly, e.g., to find an attack path with thehighest probability of success in a network attackgraph [64]. Second, the average path length canbe used to identify “small-world” properties [82]in a graph. Global efficiency [47], the inverseof the average path length, is used to determinethe efficiency of communication between brain re-gions [24], to identify potential software defects [92],etc. Third, the maximum shortest path length(diameter) and related metrics (e.g., eccentricity)are used across many domains to study worst-casepaths.We further identify two applications of path-based graph analysis. First, the enumeration ofnear-shortest paths (i.e., listing all paths betweentwo vertices that are no longer than ρ times theshortest path for a given parameter ρ ) is used toidentify a broad set of security vulnerabilities in a network [64]. Second, the computation of num-ber of shortest paths between two vertices (i.e., thepath multiplicity of a vertex pair) is computed asa measure of flexibility in communication betweenbrain regions [90].Closely related to path queries are graph traver-sals, i.e., traversing through a graph from a sourcevertex via its outgoing edges. Typical examples oftraversals include breadth-first search and depth-first search, which are both commonly used asbuilding blocks in more complex methods of anal-ysis. In the applications presented in Section 3we further identified the use of random walksto identify sub-networks of bots in a larger net-work [56], and domain-specific traversals used forplanning [9, 32, 34] and debugging distributed sys-tems [30]. Keywords used to identify uses: connected compo-nents.
Instead of identifying how vertices in a graphare connected (i.e., discovering paths), some graphproblems are concerned with determining if ver-tices are connected. The typical application of an-alyzing the connectivity of a graph is identifying itsconnected components. That is, dividing a graphinto components such that two vertices belong tothe same component if and only if a path exists be-tween them. For directed graphs we distinguish twotypes of connected components: weak and strong.In a weakly connected component, any pair of ver-tices is connected by an undirected path (i.e., apath that treats every directed edge as though itis undirected). In a strongly connected component,directed paths exist from each vertex to every othervertex in the component. We identify several usecases of identifying connected components in thereviewed articles.First, some applications decompose their graphinto connected components to analyze componentsindividually. For example, the largest compo-nent(s) in a communication network [75] or DNSfailure graph [41] contain the vast majority of in-teresting vertices. By extracting only those com-ponents, any further analysis is not biased by alarge number of isolated vertices. For other applica-tions all components are of interest, e.g., a match-ing problem in a bipartite protein-peptide graphcan be split into one smaller sub-problem per con-nected component [91].Second, the number of components or size ofthe largest component can be used as metrics toclassify graphs. For example, the fraction of ver-tices that belong to the largest connected compo-12ent is a highly discriminative feature in classifyingbreast [8] or bone [7] tissue samples for cancer di-agnosis, or in classifying speech graphs [5, 55].Third, a graph’s components can be interpretedas communities or otherwise related entities. Forexample, strongly connected components in a scenetransition graph correspond to story units [89]and components in a network attack graph corre-spond to sets of servers with vulnerable intercon-nections [61]. Other methods for detecting commu-nities in a graph are presented in Section 5.4. Keywords used to identify uses: centrality, PageR-ank, ranking, HITS.
In most real-world networks, not all entities areof equal importance to the network. Key to numer-ous applications is identifying the most importantentities in a network. This is typically achieved byassigning to each entity a score to signify its impor-tance and deriving a ranking from these scores.Most commonly-used measures of importance arebased on vertex- or edge-centrality. The centralityof a vertex or edge is a measure of how important itis to the structure of a network. Three commonlyused categories of centrality are betweenness cen-trality, closeness centrality, and degree centrality,as first popularized by Freeman [26] in the contextof social networks . More recently, PageRank cen-trality [62] has seen adoption across application do-mains after first being used to identify importantwebsites in the Google search engine.Other measures for ranking used by reviewed ap-plications include: information centrality in soft-ware dependency graphs [92], straightness central-ity in road networks [67], ObjectRank as adap-tation of PageRank to a heterogeneous Twittergraph [87], and Hyperlink-Induced Topic Search(HITS) to identify hubs and authorities (e.g., onTwitter [88]). Keywords used to identify uses: cluster(ing), com-munity/communities.
Clustering is the act of identifying clusters (orgroups) of entities in a graph. Clustering tech-niques are often used to extract groups of relatedentities from a graph based on the relationships Although Freeman’s centrality measures were defined inearlier work by different authors, Freeman is often creditedfor providing an intuitive conceptual interpretation of cen-trality in the context of social networks and for identify-ing three types of centrality that cover important structuralcharacteristics of social networks. they have formed. Because the definition of “re-lated entities” is application-specific, there existmany specialized algorithms for clustering. Amongthe surveyed applications, we identify two main ap-proaches. First, some applications decompose agraph into well-defined groups, e.g., connected com-ponents [78, 89, 91] or cliques [50, 71, 80]. We referto this approach as rule-based clustering.Second, applications can use community detec-tion algorithms. Although there are many conflict-ing definitions of communities, a community is typ-ically characterized by a large number of internalrelationships and a small number of relationshipswith entities outside the community. Applicationsof community detection include grouping relatedpapers in a citation network [39], identifying cityregions [35], and finding people with shared inter-ests [75]. We refer to this approach as community-based clustering.
Keywords used to identify uses: subgraph, isomor-phism, motif, pattern matching, cycle, clique, star,mesh.
The subgraph isomorphism problem is the taskof identifying a subgraph of the input graph thatis isomorphic to a second graph (the pattern ).The outcome of the subgraph isomorphism prob-lem may be used directly by applications, e.g.,to lookup chemical compounds by structure ina database [60, 84], or to determine the similar-ity of two program dependence graphs for plagia-rism detection [51]. Subgraph isomorphism is alsoused as a component for solving other problems.For example, frequent subgraph mining uses sub-graph isomorphism to identify subgraphs that oc-cur frequently in a set of graphs. This techniquecan be used to discover common web browsingpatterns [33], localize software bugs [23], classifytext [40], etc. More recently, discriminative sub-graph mining algorithms have been developed toidentify discriminative subgraphs, That is, givena set of labeled input graphs, these algorithmsidentify subgraphs that typically occur in graphswith one label, but not in graphs with another la-bel. An exemplary use case for discriminative sub-graph mining is locating bugs through program flowgraphs, each labeled as either “triggered bug” or“did not trigger bug” [16].Related to subgraph isomorphism is the problemof identifying subgraphs that match a more looselydefined pattern, e.g., a clique or a star. Unlike thepatterns used as input to subgraph isomorphism,stars and cliques describe a general structure of ver-13ices and edges, but no fixed size . We refer to thisset of problems as pattern detection . Example ap-plications include identifying protein complexes asclique patterns [50] and identifying malicious hostsor domain names as star patterns in a DNS failuregraph [41]. Many applications add properties to an input graphas part of their graph analysis method, e.g., byadding a distance attributed to every vertex in ashortest path computation, or a component iden-tifier as output of a connected components algo-rithm. In contrast, the structure of the graph isoften static: vertices and edges are neither addednor removed. We identify several use cases for mu-tating a graph’s structure as part of the analysis ofa graph. We distinguish between graph construc-tion, i.e., extending the input graph or constructinga new graph algorithmically, and graph reduction,i.e., removing or merging vertices or edges from theinput graph. In our taxonomy, we do not considerextracting a subgraph from the input graph to begraph mutation.Graph construction is used to dynamically gen-erate parts of a planning graph that are relevantto the analysis, especially when the full planninggraph contains many irrelevant vertices and maynot fit in memory [9, 34]. Graph reduction is oftenused to merge groups of related vertices into onevertex per group, thereby creating a new, smallergraph that hides relationships within groups andexposes relationships between groups. This ap-proach is used to group protein complexes in pro-tein interaction networks [71], communities of au-thors in citation graphs [80], city regions in a roadnetwork [35], etc. Other use cases include simulat-ing brain damage models by deleting edges from abrain network [76] and replacing frequently occur-ring patterns in a graph with a single copy and anassociated count to shrink the graph [53].
We identify three applications whose primarymethod of graph analysis does not match any of theaforementioned classes of methods. First, Calder-ara et al. [14] use spectral graph theory to charac-terize a graph constructed from paths formed by It is possible to identify these specific patterns by creat-ing stars or cliques of al possible sizes (up to the size of thegraph) and using these as concrete patterns for a subgraphisomorphism algorithm. However, this process is inefficientand may not be feasible for more complex patterns. Inciden-tally, this approach proves the NP-completeness of subgraphisomorphism by reduction from the clique problem. moving people and to identify anomalous paths.Second, Polonium [15] uses an algorithm based onbelief propagation to identify files that are likely tobe malicious. Third, Zhang et al. [91] use a greedyset cover algorithm to identify a feasible set of pro-teins to explain the observed set of protein frag-ments.
As described in Section 2.3, we classified for everyselected application the graph features and analy-sis methods they use. The results of this processare presented as a taxonomy in Table 3. The tabledepicts for each class of graph features (see Sec-tion 4) and analysis methods (see Section 5, except“other” methods) whether it is used by a given ap-plication, and, where applicable, which variants ofa feature or method are used. The variants arelisted in the caption of Table 3 and described inmore detail in their respective subsections in Sec-tions 4 and 5. Table 4 summarizes the frequency ofoccurrence for each class and each variant of graphfeatures and analysis methods.Overall, we find significant diversity in the graphsand analysis methods used by the surveyed appli-cations. As depicted in Table 4, all classes of graphfeatures are present in at least 9 applications (15%),whereas all classes of analysis methods are presentin at least 13 applications ( ∼ combinations of fea-tures and methods. Some combinations occur withhigh frequency, e.g., local clustering coefficient andall-pair shortest paths are combined in many ap-plications to identify “small-world” properties in agraph. Many other combinations are infrequent,but there is no clear segregation between classes offeatures or methods. This suggest that support forall features and methods should be supported in asingle graph analysis platform.We observe that diversity was not found acrossall application domains. In particular, we find sig-nificant overlap in the approaches used within theneuroscience, psychology, and chemistry domains,although the number of surveyed articles in eachdomain is too small to make sound claims aboutthe representativeness of our results for individualdomains.14able 3: Taxonomy of surveyed articles. A dot indicates that a graph feature or analysis method wasnot identified in an article. Legend by column: Weighted : vertex ( V ), edge ( E ), edge weights removedafter filtering low weight edges from input ( (cid:63) ). Heterogeneous : vertex ( V ), edge ( E ). Properties :vertex ( V ), edge ( E ). Neighborhood Statistics : clustering coefficient ( C ), degree distribution ( D ),other ( (cid:63) ). Paths & Traversals : shortest path ( S ), average path length ( A ), maximum path length( M ), traversal ( T ), other ( (cid:63) ). Connectivity : weakly connected components ( W ), strongly connectedcomponents ( S ). Centrality & Ranking : betweenness centrality ( B ), closeness centrality ( C ), degreecentrality ( D ), PageRank ( P ), other ( (cid:63) ). Clustering : rule-based ( R ), community-based ( C ). SubgraphIsomorphism & Mining : pattern detection ( P ), subgraph isomorphism ( S ), frequent subgraph mining( F ), discriminative subgraph mining ( D ). Graph Mutation : construction ( C ), reduction ( R ). Graph Features Graph Analysis MethodsDomain Ref. D i r e c t e d E d g e s W e i g h t e d H e t e r og e n e o u s P r o p e r t i e s T e m p o r a l N e i g hb o r h oo d S t a t i s t i c s P a t h s & T r a v e r s a l s C o nn e c t i v i t y C e n t r a li t y & R a n k i n g C l u s t e r i n g Sub g r a ph I s o m o r ph i s m & P a tt e r n s G r a ph M u t a t i o n Biology [3] · · · · ·
C A · · · · · [7] · · · · ·
CD AM W · · · · [8] · · · · ·
CD AM W C · · · [45] (cid:51) · · · · · · · · · F · [50] · · · · · · · · · R P R[65] (cid:51) E · · · · T S · · ·
R[71] · · · · · · · W · R P R[91] · · V · · · · W · R · RNeuroscience [4] · (cid:63) · · · C A · · · · · [13] · E · · · C (cid:63) A · B · · · [24] (cid:51) (cid:63) · · · CD (cid:63) A · · · · · [74] · (cid:63) · · · C A · · · · · [76] · E · · · C A · · · ·
R[77] · (cid:63) · · · CD A · · · · · [85] (cid:51) (cid:63) · · (cid:51) · · · B · · · [90] (cid:51) · · · · CD A (cid:63) · · C · · Security [14] (cid:51) E · · · · · · · · · · [15] · · V · · · · · · · · · [41] · · V · (cid:51) · · W · C P · [46] (cid:51) · · VE (cid:51) CD (cid:63) M · · · · · [56] · · · · · · T · · C · · [61] (cid:51) · · · · · M WS · · · · [64] (cid:51) E · VE · · S (cid:63) · · · · · SoftwareEngineering [16] (cid:51) · E V · · · · · · D · [23] (cid:51) E · V · · T · · · F · [30] (cid:51) · E VE (cid:51) · T · · · · R[51] (cid:51) · E · · · · · · · S · [53] (cid:51) · · VE · · · · · · F R[78] · · · · · D · W · R SF · [92] (cid:51) · · · · C (cid:63) · W BCD (cid:63) · P · Graph Features Graph Analysis MethodsDomain Ref. D i r e c t e d E d g e s W e i g h t e d H e t e r og e n e o u s P r o p e r t i e s T e m p o r a l N e i g hb o r h oo d S t a t i s t i c s P a t h s & T r a v e r s a l s C o nn e c t i v i t y C e n t r a li t y & R a n k i n g C l u s t e r i n g Sub g r a ph I s o m o r ph i s m & P a tt e r n s G r a ph M u t a t i o n Logistics &Planning [9] (cid:51) · V · · · T · · · · C[19] · · · · · · · · B · · · [32] (cid:51) · · · · · T · · · · · [34] (cid:51) · V · · · T · · · · C[35] (cid:51) V · · (cid:51) · · · BP C · R[66] · · · · · CD (cid:63) A · · · · · [67] · E · · · · · · BCD (cid:63) · · ·
Sociology [43] (cid:51) · · · (cid:51) · · · P (cid:63) · · · [75] (cid:51) · · · · · AM W · · · · [87] (cid:51)
E VE · · · · · P (cid:63) · · · [88] (cid:51) · V · · · · · (cid:63) · · · Psychology [5] (cid:51) · · · · C (cid:63) A S · · · · [49] · · · · ·
C AM · · · · · [55] (cid:51) · · · · C (cid:63) A WS · · · ·
Science [28] · · · V · D A · · · ·
R[39] (cid:51) · · · (cid:51) · · ·
B C · · [80] · E · · (cid:51) · · · · R P RChemistry [60] · · · VE · · · · · · S · [84] · · · VE · · · · · · S · Finance [37] · E · · (cid:51) CD (cid:63) M · · · · · [81] · · · · · CD (cid:63) A · D C · ·
Linguistics [6] · E · · · · · · · C · · [40] (cid:51) VE VE · · · · · · · F · Other [33] (cid:51) V · · · · · · · · F · [42] · · · · · D · · · · · · [48] · · V · · · · · · · S R[54] (cid:51) · V · · · A W · · · · [70] (cid:51) · · · · D (cid:63) AM WS P C · · [79] · · · · · D · · · · · · [89] (cid:51) · · · · · · S · R · · Finally, we note that two entries in the taxon-omy have no associated analysis methods: applica-tions [14] and [15] in the security domain. As de-scribed in Section 5.8, these applications use graphanalysis methods that do not fit any of the identi-fied classes and are not used by any other applica-tion.
Using the analysis of our taxonomy as presentedin Section 6, we present several directions for fu-ture research in the development and tuning ofgraph analysis systems with a focus on functional-ity and performance. First, the significant diversity16able 4: Summary of the number of applications that use a given graph feature or analysis method,derived from the taxonomy presented in Table 3.
Total
Directed Edges 31 (cid:51) : 31Weighted 19 V: 3, E: 12, (cid:63) : 5Heterogeneous 13 V: 10, E: 5Properties 9 V: 9, E: 6Temporal 9 (cid:51) : 9
Graph Analysis Methods
Neighborhood Statistics 23 C: 18, D: 14, (cid:63) :10Paths & Traversals 30 S: 1, A: 19, M:8, T: 7, (cid:63) : 2Connectivity 15 W: 12, S: 6Centrality & Ranking 13 B: 7, C: 3, D: 3, P: 4, (cid:63) : 5Clustering 14 R: 6, C: 8Subgraph Isomorphism & Patterns 16 P: 5, S: 5, F: 6, D: 1Graph Mutation 13 R: 11, C: 2in graph features, analysis methods, and their com-binations used in real-world applications suggests aneed for graph analysis platforms with support formany classes of graphs and analysis methods. Re-cent developments of graph analysis platforms havefocused on either graph databases (e.g., Neo4j [57])or (parallel) graph processing platforms (Doekemei-jer et al. [21] surveyed many). Graph databaseshave extensive support for mutable, heterogeneousproperty graphs, for path queries, and for patternmatching. In contrast, graph processing platformstypically focus on static graphs and large-scale ana-lytics using methods such as centrality and cluster-ing. We do observe a trend towards bridging thisgap, e.g., with Neo4j announcing support for graphalgorithms in their database [36].Second, many applications use more than onemethod of analysis on the same graph, so there isa need for platforms that allow either interactiveanalysis or workflows of algorithms for batch pro-cessing. Typical graph databases already offer thisfunctionality by providing a service that can be in-teractively queried, but graph processing platformstypically require loading a copy of the graph fromdisk for each job (i.e., a single algorithm). Futureresearch on graph processing platforms may studythe reuse of a graph stored in memory to speed upsequences of jobs, or even multiple algorithms exe-cuting in parallel on the same graph. This research,as well as the development of new benchmarks toassess support for graph analysis workflows, mayrequire additional insight in how practitioners usegraph analysis platforms (e.g., do they define in ad-vance a set of algorithms to run, or is the selection of analysis methods guided by results of the previ-ous method?)Third, we observe that many graphs analyzedin practice share a common set of high-level fea-tures, as evident from our taxonomy, but thatthe structural properties of graphs vary widelyand are known to impact significantly the perfor-mance of graph analysis [38]. Although some struc-tural characteristics are well-defined, e.g., bipar-tite graphs, there is not comprehensive set of met-rics that captures all structural properties relevantto graph analysis performance. Ongoing work onthis topic includes the development of syntheticdataset generators that aim to replicate variousstructural properties of real-world datasets, e.g.,DataSynth [68] and Darwini [22].Finally, we find that many surveyed applica-tions operate on small graphs, i.e., under a mil-lion vertices and edges . Most of the larger graphs,i.e., hundreds of millions of edges and more, werefound in domains studying digital networks, in-cluding social media networks, software engineer-ing networks, and computer security networks. Thelargest graph we identified is used to identify mal-ware and consists of nearly 1 billion vertices and37 billion edges [15]. These findings contradict arecent user study by Sahu et al. [73] which foundthat large-scale graphs are more frequent and canbe found across many domains. Possible explana- Some surveyed articles do not explicitly state the size oftheir graph, or they present a general solution without a con-crete input graph. We have not quantified the precise scalesof each application’s graph, but draw conclusions based onestimations. , a potential bias in [73]due to surveying users of specific software products(of which many explicitly target large-scale graphanalysis), and a disparity between graphs used inacademia versus industry. To facilitate the validation and reproduction ofour results, we provide the data collected and pro-duced while performing this survey as open-accessdata [31]. We provide a complete list of search re-sults obtained via Google Scholar, a shortlist ofrelevant identified results (see Section 2.2), andthe resulting analysis and characterization (see Sec-tions 2.3, 6).
The development and tuning of graph analysis plat-forms benefits greatly from knowledge of their prac-tical applications. However, such knowledge re-quires a broad view of graph analysis use casesacross many domains and an in-depth view of thedatasets and algorithms they use. Earlier work fo-cuses on few applications, provides a high-level viewof many applications, or studies in depth a small setof algorithms.In this work we made a threefold contribution to-wards gaining insight in the practical use of graphanalysis: (i) we defined and applied a systematicmethod for identifying and selecting relevant lit-erature on graph analysis applications across manydomains, (ii) we presented a taxonomy of the graphfeatures and graph analysis methods used in prac-tice, and (iii) we proposed several directions for fu-ture research in developing and tuning graph anal-ysis platforms.Our primary observation is the large diversity inthe domains to which the identified applications be-long and in the classes of graph features and anal-ysis methods they use. From 5 classes of graphfeatures and 7 classes of analysis methods, eachwas encountered in at least 15% of surveyed ap-plications. We further observe that most applica-tions use multiple analysis methods and that many The development of large-scale graph processing plat-forms did not gain significant traction until the publicationof Pregel in 2010 [21], so we believe it is feasible that adop-tion of large-scale graph analysis techniques has increasedbetween the publication of our surveyed articles (many ofwhich appeared before 2010) and the 2017 study by Sahu etal. classes of analysis are combined in practice. In con-trast, some domains show significant homogeneity,especially neuroscience, a domain that focuses onthe analysis of brain networks and has seeminglydeveloped a common set of techniques for analyz-ing such networks.We conclude that future research in the develop-ment and tuning of graph analysis platforms shouldfocus on integrating support for a wide range ofgraph features and analysis methods, in contrastto the dichotomy of graph databases and parallelgraph processing frameworks that currently exists.Additionally, the combination of multiple analysismethods, interactively or as a workflow, should befurther investigated and supported in future sys-tems. Finally, to understand the differences ingraphs across domains and their impact on perfor-mance, additional research is needed to character-ize the structural characteristics of graphs beyondsimple structures such as bipartite graphs.
References [1] T. Aittokallio and B. Schwikowski. Graph-based methods for analysing networks in cellbiology.
Briefings in bioinformatics , 7(3):243–255, 2006.[2] A.-L. Barab´asi and R. Albert. Emergenceof scaling in random networks. science ,286(5439):509–512, 1999.[3] A. Barber´an, S. T. Bates, E. O. Casamayor,and N. Fierer. Using network analysis to ex-plore co-occurrence patterns in soil microbialcommunities.
The ISME journal , 6(2):343,2012.[4] F. Bartolomei, I. Bosma, M. Klein, J. C.Baayen, J. C. Reijneveld, T. J. Postma, J. J.Heimans, B. W. van Dijk, J. C. de Munck,A. de Jongh, K. S. Cover, and C. J. Stam.Disturbed functional connectivity in brain tu-mour patients: Evaluation by graph analysisof synchronization matrices.
Clinical Neuro-physiology , 117(9):2039 – 2049, 2006.[5] L. Bertola, N. B. Mota, M. Copelli, T. Rivero,B. S. Diniz, M. A. Romano-Silva, S. Ribeiro,and L. F. Malloy-Diniz. Graph analysis ofverbal fluency test discriminate between pa-tients with alzheimer’s disease, mild cognitiveimpairment and normal elderly controls.
Fron-tiers in aging neuroscience , 6:185, 2014.[6] C. Biemann. Chinese whispers: an efficientgraph clustering algorithm and its applica-tion to natural language processing problems.18n
Proceedings of the first workshop on graphbased methods for natural language processing ,pages 73–80. Association for ComputationalLinguistics, 2006.[7] C. C. Bilgin, P. Bullough, G. E. Plopper, andB. Yener. Ecm-aware cell-graph mining forbone tissue modeling and classification.
Datamining and knowledge discovery , 20(3):416–438, 2010.[8] C. C. Bilgin, C. Demir, C. Nagi, and B. Yener.Cell-graph mining for breast tissue modelingand classification. In
Engineering in Medicineand Biology Society, 2007. EMBS 2007. 29thAnnual International Conference of the IEEE ,pages 5311–5314. IEEE, 2007.[9] A. Blum and M. L. Furst. Fast planningthrough planning graph analysis.
Artif. Intell. ,90(1-2):281–300, 1997.[10] S. Boccaletti, V. Latora, Y. Moreno,M. Chavez, and D.-U. Hwang. Complexnetworks: Structure and dynamics.
Physicsreports , 424(4-5):175–308, 2006.[11] A. D. Broido and A. Clauset. Scale-free net-works are rare.
CoRR , abs/1801.03400, 2018.[12] E. Bullmore and O. Sporns. Complex brainnetworks: graph theoretical analysis of struc-tural and functional systems.
Nature ReviewsNeuroscience , 10(3):186, 2009.[13] K. Caeyenberghs, A. Leemans, M. H. Heit-ger, I. Leunissen, T. Dhollander, S. Sunaert,P. Dupont, and S. P. Swinnen. Graph analysisof functional brain networks for cognitive con-trol of action in traumatic brain injury.
Brain ,135(4):1293–1307, 2012.[14] S. Calderara, U. Heinemann, A. Prati, R. Cuc-chiara, and N. Tishby. Detecting anomalies inpeople’s trajectories using spectral graph anal-ysis.
Computer Vision and Image Understand-ing , 115(8):1099–1111, 2011.[15] D. H. . Chau, C. Nachenberg, J. Wilhelm,A. Wright, and C. Faloutsos. Polonium: Tera-scale graph mining and inference for malwaredetection. In
Proceedings of the 2011 SIAM In-ternational Conference on Data Mining , pages131–142. SIAM, 2011.[16] H. Cheng, D. Lo, Y. Zhou, X. Wang, andX. Yan. Identifying bug signatures using dis-criminative graph mining. In
Proceedings ofthe Eighteenth International Symposium on Software Testing and Analysis, ISSTA 2009,Chicago, IL, USA, July 19-23, 2009 , pages141–152, 2009.[17] D. Conte, P. Foggia, C. Sansone, andM. Vento. Thirty years of graph matching inpattern recognition.
International journal ofpattern recognition and artificial intelligence ,18(03):265–298, 2004.[18] L. d. F. Costa, O. N. Oliveira Jr, G. Travieso,F. A. Rodrigues, P. R. Villas Boas, L. An-tiqueira, M. P. Viana, and L. E. Correa Rocha.Analyzing and modeling real-world phenom-ena with complex networks: a survey of appli-cations.
Advances in Physics , 60(3):329–412,2011.[19] E. M. Daly and M. Haahr. Social network anal-ysis for routing in disconnected delay-tolerantmanets. In
Proceedings of the 8th ACM Intera-tional Symposium on Mobile Ad Hoc Network-ing and Computing, MobiHoc 2007, Montreal,Quebec, Canada, September 9-14, 2007 , pages32–40, 2007.[20] M. Demange, T. Ekim, B. Ries, and C. Tanas-escu. On some applications of the selectivegraph coloring problem.
European Journal ofOperational Research , 240(2):307–314, 2015.[21] N. Doekemeijer and A. L. Varbanescu. A sur-vey of parallel graph processing frameworks.
Delft University of Technology , page 21, 2014.[22] S. Edunov, D. Logothetis, C. Wang, A. Ching,and M. Kabiljo. Darwini: Generating realis-tic large-scale social graphs. arXiv preprintarXiv:1610.00664 , 2016.[23] F. Eichinger, K. B¨ohm, and M. Huber. Min-ing edge-weighted call graphs to localise soft-ware bugs. In
Machine Learning and Knowl-edge Discovery in Databases, European Con-ference, ECML/PKDD 2008, Antwerp, Bel-gium, September 15-19, 2008, Proceedings,Part I , pages 333–348, 2008.[24] F. D. V. Fallani, L. Astolfi, F. Cincotti,D. Mattia, M. G. Marciani, S. Salinari,J. Kurths, S. Gao, A. Cichocki, A. Colosimo,et al. Cortical functional connectivity net-works in normal and spinal cord injured pa-tients: evaluation by graph analysis.
Humanbrain mapping , 28(12):1334–1346, 2007.[25] S. Fortunato. Community detection in graphs.
Physics reports , 486(3-5):75–174, 2010.1926] L. C. Freeman. Centrality in social net-works conceptual clarification.
Social net-works , 1(3):215–239, 1978.[27] R. Garc´ıa-Domenech, J. G´alvez, J. V.de Juli´an-Ortiz, and L. Pogliani. Some newtrends in chemical graph theory.
Chemical Re-views , 108(3):1127–1169, 2008.[28] N. Gondal. The local and global structure ofknowledge production in an emergent researchfield: An exponential random graph analysis.
Social Networks , 33(1):20–30, 2011.[29] Y. Guo, M. Biczak, A. L. Varbanescu, A. Io-sup, C. Martella, and T. L. Willke. How welldo graph-processing platforms perform? anempirical performance evaluation and analysis.In
Parallel and Distributed Processing Sym-posium, 2014 IEEE 28th International , pages395–404. IEEE, 2014.[30] Z. Guo, D. Zhou, H. Lin, M. Yang, F. Long,C. Deng, C. Liu, and L. Zhou. G2: A graphprocessing system for diagnosing distributedsystems. In , 2011.[31] T. Hegeman and A. Iosup. Dataset for sur-vey on applications of graph analysis. https://doi.org/10.5281/zenodo.1298640 , June2018.[32] M. Helmert. A planning heuristic basedon causal graph analysis. In
Proceedings ofthe Fourteenth International Conference onAutomated Planning and Scheduling (ICAPS2004), June 3-7 2004, Whistler, BritishColumbia, Canada , pages 161–170, 2004.[33] M. Heydari, R. A. Helal, and K. I. Ghauth. Agraph-based web usage mining method consid-ering client side data. In
Electrical Engineer-ing and Informatics, 2009. ICEEI’09. Inter-national Conference on , volume 1, pages 147–153. IEEE, 2009.[34] J. Hong. Goal recognition through goal graphanalysis.
J. Artif. Intell. Res. , 15:1–30, 2001.[35] X. Huang, Y. Zhao, C. Ma, J. Yang, X. Ye, andC. Zhang. Trajgraph: A graph-based visualanalytics approach to studying urban networkcentralities using taxi trajectory data.
IEEETrans. Vis. Comput. Graph. , 22(1):160–169,2016. [36] M. Hunger. Proudly Releas-ing: Efficient Graph Algorithms inNeo4j. https://neo4j.com/blog/efficient-graph-algorithms-neo4j/ ,August 3, 2017. Accessed: 2018-04-05.[37] G. Iori, G. De Masi, O. V. Precup, G. Gabbi,and G. Caldarelli. A network analysis of theitalian overnight money market.
Journal ofEconomic Dynamics and Control , 32(1):259–278, 2008.[38] A. Iosup, T. Hegeman, W. L. Ngai, S. Heldens,A. Prat-P´erez, T. Manhardto, H. Chafio,M. Capot˘a, N. Sundaram, M. Anderson, et al.Ldbc graphalytics: A benchmark for large-scale graph analysis on parallel and distributedplatforms.
Proceedings of the VLDB Endow-ment , 9(13):1317–1328, 2016.[39] M. Jacovi, V. Soroka, G. Gilboa-Freedman,S. Ur, E. Shahar, and N. Marmasse. Thechasms of CSCW: a citation graph analysisof the CSCW conference. In
Proceedings ofthe 2006 ACM Conference on Computer Sup-ported Cooperative Work, CSCW 2006, Banff,Alberta, Canada, November 4-8, 2006 , pages289–298, 2006.[40] C. Jiang, F. Coenen, R. Sanderson, andM. Zito. Text classification using graphmining-based feature extraction. In
Re-search and Development in Intelligent SystemsXXVI, Incorporating Applications and Inno-vations in Intelligent Systems XVII, Peter-house College, Cambridge, UK, 15-17 Decem-ber 2009 , pages 21–34, 2009.[41] N. Jiang, J. Cao, Y. Jin, E. L. Li, andZ. Zhang. Identifying suspicious activitiesthrough DNS failure graph analysis. In
Pro-ceedings of the 18th annual IEEE InternationalConference on Network Protocols, ICNP 2010,Kyoto, Japan, 5-8 October, 2010 , pages 144–153, 2010.[42] S. Jiang, C. Bian, X. Ning, and Q. D. Ma. Vis-ibility graph analysis on heartbeat dynamics ofmeditation training.
Applied Physics Letters ,102(25):253702, 2013.[43] A. Khrabrov and G. Cybenko. Discovering in-fluence in communication networks using dy-namic graph analysis. In
Proceedings of the2010 IEEE Second International Conferenceon Social Computing, SocialCom / IEEE In-ternational Conference on Privacy, Security,Risk and Trust, PASSAT 2010, Minneapolis, innesota, USA, August 20-22, 2010 , pages288–294, 2010.[44] B. Kitchenham and S. Charters. Guidelinesfor performing systematic literature reviews insoftware engineering. Technical report, KeeleUniversity and Durham University Joint Re-port, 2007.[45] M. Koyut¨urk, A. Grama, and W. Szpankowski.An efficient algorithm for detecting frequentsubgraphs in biological networks. Bioinformat-ics , 20(suppl 1):i200–i207, 2004.[46] B. J. Kwon, J. Mondal, J. Jang, L. Bilge, andT. Dumitras. The dropper effect: Insights intomalware distribution with downloader graphanalytics. In
Proceedings of the 22nd ACMSIGSAC Conference on Computer and Com-munications Security, Denver, CO, USA, Oc-tober 12-6, 2015 , pages 1118–1129, 2015.[47] V. Latora and M. Marchiori. Efficient behav-ior of small-world networks.
Physical reviewletters , 87(19):198701, 2001.[48] K.-Y. Lee, O.-H. Kwon, J.-Y. Lee, and T.-W. Kim. A hybrid approach to geometricconstraint solving with graph analysis and re-duction.
Advances in Engineering Software ,34(2):103 – 113, 2003.[49] A. J. Lerner, P. K. Ogrocki, and P. J. Thomas.Network graph analysis of category fluencytesting.
Cognitive and Behavioral Neurology ,22(1):45–52, 2009.[50] X.-L. Li, C.-S. Foo, S.-H. Tan, and S.-K.Ng. Interaction graph mining for protein com-plexes using local clique merging.
Genome In-formatics , 16(2):260–269, 2005.[51] C. Liu, C. Chen, J. Han, and P. S. Yu.GPLAG: detection of software plagiarism byprogram dependence graph analysis. In
Pro-ceedings of the Twelfth ACM SIGKDD Inter-national Conference on Knowledge Discoveryand Data Mining, Philadelphia, PA, USA, Au-gust 20-23, 2006 , pages 872–881, 2006.[52] G. Malewicz, M. H. Austern, A. J. Bik, J. C.Dehnert, I. Horn, N. Leiser, and G. Cza-jkowski. Pregel: a system for large-scale graphprocessing. In
Proceedings of the 2010 ACMSIGMOD International Conference on Man-agement of data , pages 135–146. ACM, 2010.[53] E. K. Maxwell, G. Back, and N. Ramakrish-nan. Diagnosing memory leaks using graph mining on heap dumps. In
Proceedings ofthe 16th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Min-ing, Washington, DC, USA, July 25-28, 2010 ,pages 115–124, 2010.[54] B. J. Mirza, B. J. Keller, and N. Ramakrish-nan. Studying recommendation algorithms bygraph analysis.
J. Intell. Inf. Syst. , 20(2):131–160, 2003.[55] N. B. Mota, R. Furtado, P. P. Maia,M. Copelli, and S. Ribeiro. Graph analysis ofdream reports is especially informative aboutpsychosis.
Scientific reports , 4:3691, 2014.[56] S. Nagaraja, P. Mittal, C. Hong, M. Cae-sar, and N. Borisov. Botgrep: Finding P2Pbots with structured graph analysis. In ,pages 95–110, 2010.[57] Neo4j, Inc. The Neo4j Graph Platform. https://neo4j.com/ . Accessed: 2018-04-05.[58] M. E. Newman. Scientific collaborationnetworks. ii. shortest paths, weighted net-works, and centrality.
Physical review E ,64(1):016132, 2001.[59] M. E. Newman. Assortative mixing in net-works.
Physical review letters , 89(20):208701,2002.[60] S. Nijssen and J. N. Kok. Frequent graph min-ing and its application to molecular databases.In
Systems, Man and Cybernetics, 2004 IEEEInternational Conference on , volume 5, pages4571–4577. IEEE, 2004.[61] S. Noel and S. Jajodia. Metrics suite fornetwork attack graph analytics. In
Cyberand Information Security Research Confer-ence, CISR ’14, Oak Ridge, TN, USA, April8-10, 2014 , pages 5–8, 2014.[62] L. Page, S. Brin, R. Motwani, and T. Wino-grad. The pagerank citation ranking: Bringingorder to the web. Technical report, StanfordInfoLab, 1999.[63] B. Peng, L. Zhang, and D. Zhang. A survey ofgraph theoretical approaches to image segmen-tation.
Pattern Recognition , 46(3):1020–1038,2013.[64] C. A. Phillips and L. P. Swiler. A graph-basedsystem for network-vulnerability analysis. In21 roceedings of the 1998 Workshop on New Se-curity Paradigms, Charlottsville, VA, USA,September 22-25, 1998 , pages 71–79, 1998.[65] A. Pinna, N. Soranzo, and A. de la Fuente.From knockouts to networks: Establishing di-rect cause-effect relationships through graphanalysis.
PLOS ONE , 5(10):1–8, 10 2010.[66] S. Porta, P. Crucitti, and V. Latora. Thenetwork analysis of urban streets: a dual ap-proach.
Physica A: Statistical Mechanics andits Applications , 369(2):853–866, 2006.[67] S. Porta, P. Crucitti, and V. Latora. The net-work analysis of urban streets: a primal ap-proach.
Environment and Planning B: plan-ning and design , 33(5):705–725, 2006.[68] A. Prat-P´erez, J. Guisado-G´amez, X. F. Salas,P. Koupy, S. Depner, and D. B. Bartolini. To-wards a property graph generator for bench-marking. In
Proceedings of the Fifth Interna-tional Workshop on Graph Data-managementExperiences & Systems , page 6. ACM, 2017.[69] T. Reps. Program analysis via graph reacha-bility1.
Information and software technology ,40(11-12):701–726, 1998.[70] M. A. Rodriguez. A graph analysis of thelinked data cloud.
CoRR , abs/0903.0194, 2009.[71] L. Royer, M. Reimann, B. Andreopoulos, andM. Schroeder. Unraveling protein networkswith power graph analysis.
PLOS Computa-tional Biology , 4(7):1–17, 07 2008.[72] M. Rubinov and O. Sporns. Complex networkmeasures of brain connectivity: uses and in-terpretations.
Neuroimage , 52(3):1059–1069,2010.[73] S. Sahu, A. Mhedhbi, S. Salihoglu, J. Lin, andM. T. ¨Ozsu. The ubiquity of large graphsand surprising challenges of graph processing.
Proceedings of the VLDB Endowment , 11(4),2017.[74] E. J. Sanz-Arigita, M. M. Schoonheim, J. S.Damoiseaux, S. A. R. B. Rombouts, E. Maris,F. Barkhof, P. Scheltens, and C. J. Stam. Lossof small-world networks in alzheimer’s disease:Graph analysis of fmri resting-state functionalconnectivity.
PLOS ONE , 5(11):1–14, 11 2010.[75] M. F. Schwartz and D. C. M. Wood. Dis-covering shared interests using graph analysis.
Commun. ACM , 36(8):78–89, 1993. [76] C. Stam, W. De Haan, A. Daffertshofer,B. Jones, I. Manshanden, A.-M. van Cap-pellen van Walsum, T. Montez, J. Verbunt,J. De Munck, B. Van Dijk, et al. Graph the-oretical analysis of magnetoencephalographicfunctional connectivity in alzheimer’s disease.
Brain , 132(1):213–224, 2008.[77] K. Supekar, V. Menon, D. Rubin, M. Musen,and M. D. Greicius. Network analysisof intrinsic functional brain connectivity inalzheimer’s disease.
PLoS computational bi-ology , 4(6):e1000100, 2008.[78] D. Surian, D. Lo, and E. Lim. Mining collab-oration patterns from a large developer net-work. In , pages 269–273,2010.[79] L. Telesca and M. Lovallo. Analysis of seis-mic sequences by using the method of vis-ibility graph.
EPL (Europhysics Letters) ,97(5):50002, 2012.[80] G. Tsatsaronis, I. Varlamis, S. Torge,M. Reimann, K. Nørv˚ag, M. Schroeder, andM. Zschunke. How to become a group leader?or modeling author types based on graph min-ing. In
Research and Advanced Technology forDigital Libraries - International Conferenceon Theory and Practice of Digital Libraries,TPDL 2011, Berlin, Germany, September 26-28, 2011. Proceedings , pages 15–26, 2011.[81] N. Wang, D. Li, and Q. Wang. Visibility graphanalysis on quarterly macroeconomic series ofchina based on complex network theory.
Phys-ica A: Statistical Mechanics and its Applica-tions , 391(24):6543–6555, 2012.[82] D. J. Watts and S. H. Strogatz. Collec-tive dynamics of small-worldnetworks. nature ,393(6684):440, 1998.[83] J. Webster and R. T. Watson. Analyzing thepast to prepare for the future: Writing a liter-ature review.
MIS Quarterly , 26(2), 2002.[84] J. K. Wegner, H. Fr¨ohlich, H. M. Mielenz, andA. Zell. Data and graph mining in chemicalspace for adme and activity data sets.
Molec-ular Informatics , 25(3):205–220, 2006.[85] C. Wilke, G. Worrell, and B. He. Graph analy-sis of epileptogenic networks in human partialepilepsy.
Epilepsia , 52(1):84–93, 2011.2286] C. Wohlin. Guidelines for snowballing insystematic literature studies and a replica-tion in software engineering. In , pages 38:1–38:10, 2014.[87] Y. Yamaguchi, T. Takahashi, T. Amagasa, andH. Kitagawa. Turank: Twitter user rank-ing based on user-tweet graph analysis. In
Web Information Systems Engineering - WISE2010 - 11th International Conference, HongKong, China, December 12-14, 2010. Proceed-ings , pages 240–253, 2010.[88] M. Yang, J. Lee, S. Lee, and H. Rim. Findinginteresting posts in twitter based on retweetgraph analysis. In
The 35th InternationalACM SIGIR conference on research and de-velopment in Information Retrieval, SIGIR’12, Portland, OR, USA, August 12-16, 2012 ,pages 1073–1074, 2012.[89] M. M. Yeung, B. Yeo, and B. Liu. Segmen-tation of video by clustering and graph analy-sis.
Computer Vision and Image Understand-ing , 71(1):94–109, 1998.[90] G. Zamora-L´opez, C. Zhou, and J. Kurths.Graph analysis of cortical networks revealscomplex anatomical communication substrate.
Chaos: An Interdisciplinary Journal of Non-linear Science , 19(1):015117, 2009.[91] B. Zhang, M. C. Chambers, and D. L. Tabb.Proteomic parsimony through bipartite graphanalysis improves accuracy and transparency.
Journal of proteome research , 6(9):3549–3557,2007.[92] T. Zimmermann and N. Nagappan. Predictingdefects using network analysis on dependencygraphs. In30th International Conference onSoftware Engineering (ICSE 2008), Leipzig,Germany, May 10-18, 2008