[PDF] Overcoming Bias in Community Detection Evaluation

Abstract

Community detection is a key task to further understand the function and the structure of complex networks. Therefore, a strategy used to assess this task must be able to avoid biased and incorrect results that might invalidate further analyses or applications that rely on such communities. Two widely used strategies to assess this task are generally known as structural and functional. The structural strategy basically consists in detecting and assessing such communities by using multiple methods and structural metrics. On the other hand, the functional strategy might be used when ground truth data are available to assess the detected communities. However, the evaluation of communities based on such strategies is usually done in experimental configurations that are largely susceptible to biases, a situation that is inherent to algorithms, metrics and network data used in this task. Furthermore, such strategies are not systematically combined in a way that allows for the identification and mitigation of bias in the algorithms, metrics or network data to converge into more consistent results. In this context, the main contribution of this article is an approach that supports a robust quality evaluation when detecting communities in real-world networks. In our approach, we measure the quality of a community by applying the structural and functional strategies, and the combination of both, to obtain different pieces of evidence. Then, we consider the divergences and the consensus among the pieces of evidence to identify and overcome possible sources of bias in community detection algorithms, evaluation metrics, and network data. Experiments conducted with several real and synthetic networks provided results that show the effectiveness of our approach to obtain more consistent conclusions about the quality of the detected communities.

Full PDF

OOvercoming Bias in Community Detection Evaluation

Jeancarlo C. Leão, Alberto H. F. Laender and Pedro O. S. Vaz de Melo Instituto Federal do Norte de Minas Gerais (IFNMG), Brazil [email protected] Universidade Federal de Minas Gerais, Brazil {laender,olmo}@dcc.ufmg.br

Abstract.

Community detection is a key task to further understand the function and the structure of complexnetworks. Therefore, a strategy used to assess this task must be able to avoid biased and incorrect results that mightinvalidate further analyses or applications that rely on such communities. Two widely used strategies to assess this taskare generally known as structural and functional . The structural strategy basically consists in detecting and assessingsuch communities by using multiple methods and structural metrics. On the other hand, the functional strategymight be used when ground truth data are available to assess the detected communities. However, the evaluationof communities based on such strategies is usually done in experimental conﬁgurations that are largely susceptibleto biases, a situation that is inherent to algorithms, metrics and network data used in this task. Furthermore, suchstrategies are not systematically combined in a way that allows for the identiﬁcation and mitigation of bias in thealgorithms, metrics or network data to converge into more consistent results. In this context, the main contribution ofthis article is an approach that supports a robust quality evaluation when detecting communities in real-world networks.In our approach, we measure the quality of a community by applying the structural and functional strategies, and thecombination of both, to obtain diﬀerent pieces of evidence. Then, we consider the divergences and the consensus amongthe pieces of evidence to identify and overcome possible sources of bias in community detection algorithms, evaluationmetrics, and network data. Experiments conducted with several real and synthetic networks provided results that showthe eﬀectiveness of our approach to obtain more consistent conclusions about the quality of the detected communities.Categories and Subject Descriptors: H.2 [

Data Mining ]: Miscellaneous; G.2.2 [

Graph Theory ]: Miscellaneous; J.4[

Social and Behavioral Sciences ]: MiscellaneousKeywords: Community Structure, Quality Evaluation, Bias, Ensemble Approach, Triangulation Method

1. INTRODUCTIONThe community detection problem has been much studied in the context of social networks due to itswide application in many domains, giving rise to many methods to address it [Almeida et al. 2012;Fortunato 2010; Gandica et al. 2020; Yang and Leskovec 2015]. However, one of the major challengesrelated to this problem is the diﬃculty to evaluate the detected communities with respect to thevarious methods proposed in the literature. Part of this diﬃculty lies on the fact that there is stillno universally accepted deﬁnition for the concept of community [Fortunato 2010], as well as for whatwe understand as being the quality of a community [Hric et al. 2014]. Besides, the evaluation of suchcommunities is usually carried out by using experimental conﬁgurations greatly susceptible to biases,which are inherent to the algorithms, metrics and network data used in this task [Jebabli et al. 2018;Leão et al. 2019; Liu et al. 2020]. Moreover, this kind of evaluation is, in general, carried out withoutexplicitly dealing with such biases, which may lead to inconsistent results.In order to illustrate this problem, let us consider the example shown in Figure 1. Speciﬁcally,Figure 1a shows a social network formed by 34 members (vertices) of a karate club interconnected byedges representing interactions between them outside the club. Originally, this network was dividedinto two non-overlapping communities labeled by Zachary [1977] with 16 and 18 members, respectively,each one supervised by a speciﬁc instructor. Figure 1b, on the other hand, shows the communities a r X i v : . [ c s . S I] F e b · Jeancarlo C. Leão, Alberto H. F. Laender and Pedro O. S. Vaz de Melo

Fig. 1. Example of how bias can aﬀect community detection in social networks.detected in this same network by the Louvain algorithm [Blondel et al. 2008], a well known andvery eﬀective community detection algorithm. Note that the community structure revealed by theLouvain algorithm is diﬀerent from those presented in Figures 1c-e, which were respectively obtainedby the Girvan-Newman [Newman and Girvan 2004], Walktrap [Pons and Latapy 2005] and Spin-Glass [Reichardt and Bornholdt 2006] algorithms, all of them also considered very eﬀective. Here, weraise the possibility that the bias of each heuristic algorithm interferes with its ﬁnal results, makingthem diﬀerent from each other and from the absolute optimal theoretical result, not necessarily presentin Figure 1.Thus, let us check the pieces of evidence present in this example in order to reach a consensus onwhich algorithm produces the best quality communities. First, the modularity metric indicates thatthe communities shown in Figure 1b present a better quality with respect to their modular structure(i.e., they present the highest modularity value). On the other hand, when comparing the detectedcommunities with the ground truth (Figure 1a) using the Rand Index similarity metric [Rand 1971],it indicates that the network in Figure 1d is the best one, since its communities are among the mostmodular ones, being also more similar to those shown in Figure 1a, even though there is not a perfectmatch. However, due to its own bias this similarity metric scored better the communities in Figure 1e,even though they show more visual diﬀerences with respect to the ground truth than the communitiesin Figure 1d. Finally, due to some speciﬁc bias in the original network data or in the ground truthdata, the Girvan-Newman algorithm has not been able to identify good communities (Figure 1c),as shown by the values of the two metrics considered. Note that, although the modularity value ofthese communities is the one closest to the ground truth’s, these two sets of communities are lessstructurally similar, since the number of subgraphs obtained by this algorithm is the largest amongall networks.In face of these pieces of evidence pointing to opposite directions with respect to the quality of thecommunities in our example, a question arises on which one presents the best structure and what causesthis divergence. We could try to obtain a consensual result, but this would be inadequate, becausehypotheses about bias were not tested. In addition, evidence obtained from visual inspection is onlyviable on very small networks such as the ones considered in Figure 1. Nevertheless, our assumptionis that all pieces of evidence provide a signiﬁcant amount of information about the network structurethat, combined with other pieces of evidence on the main sources of bias, results in a consistentdecision on the quality of the detected communities.To the best of our knowledge, there is no comprehensive evaluation approach that, considering mul-tiple strategies, is able to identify which one provides the best interpretation. More importantly, as wedetail later, due to their own biases it is not always possible to ﬁnd a consensus among diﬀerent met-rics and community detection algorithms on which community structure has the highest quality. Thisrequires a cross-checking approach involving at least three distinct evaluation strategies to indicate aconsensus and estimate a possible bias with respect to the quality of the revealed communities.A strategy generally employed to improve methods, measures, and data reliability and validity Table I. Methods for community detection.

Main Method Algorithm ξ ReferencesModularitymaximization Louvain Modularity (LM) D [Blondel et al. 2008]Greedy Optimizationof Modularity (GM) D [Clauset et al. 2004]Leading Eigenvector (LE) D [Newman 2006]Dynamic process Label Propagation (LP) N [Raghavan et al. 2007]Spin-glass (SG) N [Reichardt and Bornholdt 2006]Removal of edgesbetween communities Girvan–Newman (GN) D [Newman and Girvan 2004]Node closeness givenby random walks Walktrap (WT) N [Pons and Latapy 2005]Infomap (IM) N [Rosvall and Bergstrom 2011] ξ : State model (D-Deterministic/N-Non deterministic). is triangulation , which consists in using diﬀerent approaches for measuring the same characteris-tic [Brender 2006]. Note that, with only one measure of a speciﬁc characteristic, the error and biasesinherent in that measure are confounded with the characteristic itself. Thus, when measuring diﬀer-ent aspects using diﬀerent metrics, it is possible to decide which one can bring into better focus thecharacteristic of interest. Based on this idea, the main contribution of this article is a robust approachfor community quality evaluation that allows one to obtain results less prone to bias when detectingcommunities in synthetic and real networks.Thus, given a network, its set of ground truth communities and a set of its communities to be evalu-ated, our approach allows one to overcome biases in network data, detection algorithms and evaluationmetrics by using distinct evaluation strategies when analyzing the quality of such communities. Forthis, each strategy must strongly highlight a distinct aspect of a community’s quality in addition toconsidering multiple and diverse detection methods, metrics and network datasets. For example, inFigure 1 the structural and functional aspects of the communities are represented, respectively, bytheir modularity and similarity with the respective ground truths. Notice that, for our purpose, thechoice of the best metrics, detection methods and ground truth data is not important, since we are nottrying to identify the best existing community, but the best one among those being compared. Thus,when there is a large variety of metrics and methods available for this task, they can be pre-selectedbased on some criteria of relevance and diversiﬁcation.The rest of this article is organized as follows. Section 2 reviews related work. Section 3 describesour approach for community quality evaluation. Then, Section 4 analyzes the experimental resultsobtained by applying our proposed approach to real and simulated networks. Finally, Section 5presents our conclusions and some considerations for future work.2. RELATED WORKAlthough community detection has become one of the most popular and best-studied research topicsin network science [Fortunato 2010; Gandica et al. 2020; Hric et al. 2014; Kivelä et al. 2014; Leão et al.2018; Vieira et al. 2020; Zhao 2017], the problem of validating the quality of a community derivedfrom a real network has not received the due attention in the literature, since there is no consensus onwhat is meant by a good community [Yang and Leskovec 2015]. Moreover, there are several diﬀerentdeﬁnitions of what a community is, which has resulted in many distinct approaches to community According to O’Donoghue and Punch [2003], triangulation is a “method of cross-checking data from multiple sourcesto search for regularities in the research data.” · Jeancarlo C. Leão, Alberto H. F. Laender and Pedro O. S. Vaz de Melo detection [Coscia 2019; Ghasemian et al. 2020]. For example, the algorithms listed in Table I usuallyextract diﬀerent communities from a given network, which are in general considered of good qualityby distinct metrics.In a previous work [Leão et al. 2018], we analyzed the similarity of communities detected by distinctmethods in order to identify the set of algorithms that tend to produce more similar results and thosethat provide more distinct ones, when compared by multiple similarity metrics and considering agiven network. Doing so, we aimed to increase the scope and the consistency of the evaluation processwhen detecting communities. More recently, Coscia [2019] proposed an analysis essentially similarto ours also aimed at classifying community detection algorithms according to the similarity of theirresults. For this, he considers how many times two algorithms provide similar communities. Besidesthe similarity of the results provided, other criteria have been used to distinguish community detectionalgorithms. For example, Abrahao et al. [2012], Yang and Leskovec [2015], and Ghasemian et al. [2019]consider the structural properties of the detected communities for such a distinction.Although the analysis of the detected communities is commonly used to distinguish the detectionalgorithms, we have not identiﬁed any work that uses the variability of the characteristics of thesecommunities to deal with bias. In fact, in general, such works show some bias when evaluating thequality of a community. Next, we describe some works that have proposed methods that make itpossible to deal with some types of bias when assessing the quality of their detected communities.

Bias in Community Detection Algorithms.

Heuristic algorithms for community detection often ﬁndcommunities that are systematically biased in the sense that their results might be diﬀerent than theoptimum objective function chosen for this speciﬁc task [Abrahao et al. 2012; Leskovec et al. 2010; Peelet al. 2017]. Moreover, distinct goals for community detection might lead to totally distinct objectivefunctions [Coscia 2019; Gandica et al. 2020; Ghasemian et al. 2020]. In addition, the intensity of thesediﬀerences might vary from a network to another due to detection methods being sensitive to diﬀerentcommunity structures, topologies, and types or instances of a network [Coscia et al. 2011; Ghasemianet al. 2020; Leskovec et al. 2010].Note that detection algorithms are expected to introduce in their ﬁnal results the same bias of themetrics they use in their optimization functions [Jebabli et al. 2018; Peel et al. 2017]. A popular case isthat of the modularity, conductance and coverage metrics that have strong structural biases that makethem favor smaller clusters and whose maximization is aimed at by most detection algorithms [Almeidaet al. 2012; Jebabli et al. 2018].In this context, diﬀerent approaches have been proposed with the aim of reducing the eﬀect of biasesand improving the detection of communities. For instance, Lancichinetti et al. [2012] show how tocombine the communities obtained from various detection methods into a consensual one, statisticallymore stable and with a better structure. It is worth noting that this approach seeks consensus onlyon the structural aspect of the communities and does not explore the diﬀerent results produced bymultiple methods to identify, analyze and consider any other kind of bias.

Bias in Evaluation Metrics.

Existing works usually consider only speciﬁc aspects to assess the qualityof a community, for example by measuring the structure derived from its connectivity (structuralaspect) [Newman and Girvan 2004] or by measuring its similarity with a ground truth community(functional aspect) [Peel et al. 2017] to ﬁnally performing a comparison with a good baseline score [Hricet al. 2014]. Regarding the structural aspect, community detection algorithms are usually evaluatedby correlated metrics or by the same metrics used by their optimization function, such as modularity[Fortunato 2010; Yang and Leskovec 2015], which can produce some biased results. Another typeof bias associated with a metric may be an incorrect score systematically attributed according to aspeciﬁc characteristic present in the data. For example, popular quality metrics present strong biaswhen applied to networks with diﬀerent sizes or number of clusters [Almeida et al. 2012; Coscia et al. Rand Index and

Normalized Mutual Information .Note that similarity metrics can provide diﬀerent scores for a same network [Leão et al. 2019], in somecases even outliers. In addition, such metrics are also susceptible to producing biased results whencomparing communities with speciﬁc characteristics [Amelio and Pizzuti 2015; Lei et al. 2017].To deal with a speciﬁc type of bias associated with a metric, works like those carried out by Liuet al. [2019], Gösgens et al. [2020] and Labatut [2015] usually propose a new metric or enhance anexisting one. However, we have not found in the literature any community quality or similarity metricthat does not present any bias, i.e., that is capable of dealing with multiple types of bias or doesnot present any vulnerability to bias originating from other sources, such as detection algorithms ornetwork data. In this sense, our contribution relies on mitigating bias in the evaluation of communitiesby using multiple existing metrics.

Bias in Data.

In addition to the speciﬁc biases of each algorithm introduced in the data of the detectedcommunities [Leskovec et al. 2010], there may also be some bias in data from the network [Jebabliet al. 2018; Leão et al. 2018] and from its ground truth data [Leão et al. 2019]. In a previous work,Rocha et al. [2017] described how the representation of real temporal interactions can result in biaseddata. More recently, Leão et al. [2018] proposed a solution that avoids the biased data problem bydirectly removing noise produced by sporadic relationships found in a social network. They alsoshowed that this kind of noise may cause errors when detecting communities. Note that datasets withgood ground truths are rare [Yang and Leskovec 2015] and might not be always suitable for the typeof community we want to analyze [Leão et al. 2018]. Final Comments.

In addition to the need of overcoming the speciﬁc limitations discussed earlier, somestudies report that it is also important to consider multiple strategies when assessing the quality ofa community [Dao et al. 2020; Jebabli et al. 2018; Leão et al. 2019]. In addition, Dao et al. [2020]also systematize a process for obtaining a conclusion on the structural and functional quality ofthe communities, in this case, by applying the multiple criteria decision making process. Jebabli etal. [2018] also highlight the distinction between diﬀerent community detection algorithms, howeverlimited to a single comparison criterion that is based on the distribution of the number of communities.Thus, by analyzing the above works, we identiﬁed the following main contributions of our proposedapproach to assess the quality of a community: (i) it deals with diﬀerent types of bias inherent in data,metrics and algorithms; (ii) it involves diﬀerent aspects of the quality of a community through the useof multiple strategies; (iii) it uses multiple criteria to distinguish among diﬀerent community detectionalgorithms and metrics used to assess them; (iv) it seeks a consensual and consistent decision; (v) itprovides requirements for an evaluation maturity model; and, ﬁnally, (vi) it provides a framework tosystematically assess the community detection task. These contributions are complementary to ourpreliminary results presented in our previous work [Leão et al. 2019]. In the history of interactions of a social network there are those that represent a strong relationship between two peoplein a community (e.g., a teacher and a student in a school) and others, result of chance, that represent interactionsbetween people from diﬀerent communities and most likely will not occur in the future (e.g., a phone call from atelemarketer) [Leão et al. 2017; Leão et al. 2018]. · Jeancarlo C. Leão, Alberto H. F. Laender and Pedro O. S. Vaz de Melo

Qualitative DecisionQuantitative EvaluationInput Evidence GatheringExperimental Setup

Functional Aspect - Variation of Information - Mutual Information - Split Join Distance - Rand Index LM Structural Aspect - Modularity- Conductance- Density- Community Count- Community SizeCombined structural and functional results

CommunityDetectionAlgorithms Triangulation

GMLELPSGWTIM

NetworkDataCommunityStructureto beEvaluatedGround TruthData Communities

SimilarityMatrixScoreDistanceStatisticalAnalysis GN PerformanceScoringNetworkDistanceScoringNetworkSimilarityScoringNetwork

ControlMethod Decision

Fig. 2. Overview of the proposed approach to community quality evaluation.3. PROPOSED APPROACHFigure 2 summarizes our approach for community quality evaluation. First, in the input step, weprovide a network, its ground truth communities and the set of its communities that we want toevaluate. Next, in the experimental setup step, in addition to the set of ground truth communities,we also consider as a further source of evidence the communities detected by distinct algorithms,for example, those listed in Table I. Then, in the quantitative evaluation step, all communities areassessed by multiple structural and functional metrics in order to be compared to each other to providea set of combined evidence, whereas in the evidence gathering step we group the results produced byeach algorithm in a new set of pieces of evidence to highlight structural and functional aspects relatedto the quality of each community. Finally, in the qualitative decision step, we compare all pieces ofevidence to get a ﬁnal decision on the quality of the communities. However, it is important to noticethat in case of a robust decision based on the existing pieces of evidence is not available it is requiredto apply a control method and start the whole process again to collect additional evidence, as wefurther describe next.3.1 Collecting EvidenceBy using structural metrics, we are able to quantify the connectivity of speciﬁc sets of nodes inthe network in terms of structural characteristics that are typical of real-world communities [Yangand Leskovec 2015]. For this, we take into account multiple pieces of evidence on the quality of a A control method consists in raising a hypothesis about the eﬀect of a given class of entities on an experimental resultand testing such a hypothesis by comparing its result with that obtained without considering that class of entities onthe data [Creswell and Creswell 2018]. community expressed by the results of statistical analyses of the scores obtained by metrics such asmodularity, conductance and density [Newman and Girvan 2004; Yang and Leskovec 2015]. We alsouse speciﬁc statistics, such as the number and size of the detected communities, the distribution of thecomponent sizes, the variance of these values, and other network metrics, such the ones considered inTable II (see Section 4), to help analyze the results.In this step, we also statistically estimate the consensus and the divergence between the detectionalgorithms with respect to the structure of the communities of a network. For this, we measure thesimilarity between communities detected by diﬀerent algorithms, considering similarity metrics suchas Variation of Information (VI),

Normalized Mutual Information (NMI),

Split Join Distance (SJD)and

Rand Index (RI) to provide some functional evidence. By using such additional metrics, we collectevidence on the functional aspects of the detected communities by measuring the similarity betweenthem and their respective ground truths.Finally, in the last step (Evidence Gathering), we combine structural and functional aspects inorder to make a ﬁnal decision on the quality of the set of communities provided as input. For this, wemeasure the structural characteristics of the networks’ ground truth communities. Then, we collectevidence about the agreement between such measures and the measures of the communities obtainedby distinct community detection algorithms, by using the distance measure between two scores s and s , deﬁned by Equation 1. d ( s , s ) = | s − s | / ( s + s ) (1)3.2 Diversiﬁcation CriterionIn our experimental conﬁguration, we estimate the diversiﬁcation of community detection algorithmsbased on the distinction between the communities detected by them and the tendency of pairs ofalgorithms to detect very diﬀerent communities. For this, we measure the similarity between thesets of detected communities and the distance d (see Equation 1) between the quality score of thesecommunities. Then, we model each set of such measures as a multilayer network, by separating ineach one of its layers the measures derived from the same metric. In this network, the nodes representdiﬀerent algorithms and the edges correspond to a relation of similarity among them, weighted bythe respective metric value (see Figure 2). Next, a clustering analysis is performed on each layer ofthe network to identify the algorithms that produce similar results (i.e., belong to the same group).Finally, the algorithms that are often assigned to diﬀerent groups represent a greater diversiﬁcationin the baseline repertoire.Regarding this multilayer network, we also analyze the diversiﬁcation of the quality measures. Forthis, we measure the similarity between diﬀerent layers. Thus, each pair of layers with very similarclusters reinforces the respective pairs of evidence collected in those layers, when they correspond touncorrelated metrics. On the other hand, very diﬀerent clusters indicate diversity of measurementsand can be an evidence of the eﬀect of bias in some of the metrics. Then, this set of evidence isconsidered to decide on the ﬁnal quality of the communities.3.3 Decision by TriangulationBy analyzing all pieces of evidence considered (structural, functional and the two combined), wecapture distinct aspects of the communities’ quality and conclude on the quality of their structuresby means of a consensual decision. For this, we ﬁrst cross-check these pieces of evidence to estimatetheir validity and consistency. Thus, any agreed evidence on a speciﬁc aspect reinforces the internalconsistency of that strategy on that speciﬁc aspect. On the other hand, the agreement on pieces ofevidence collected considering diﬀerent aspects allows our approach to validate the quality evidencedby the corresponding strategies. · Jeancarlo C. Leão, Alberto H. F. Laender and Pedro O. S. Vaz de Melo

The ﬁnal decision on the quality of the community structures is obtained when a consensual consis-tency among all strategies is achieved. Possibly, the disagreement between diﬀerent pieces of evidenceexplains the internal bias of a strategy (for example, a disagreement in one of its metrics) or evena bias of the strategy itself (when a set of its pieces of evidence disagrees with a set of pieces ofevidence produced by another strategy). In this analysis, it is also carried out a cross-checking ofthe consistency and validity of the pieces of evidence with respect to the quality of the communitiesdetected by a speciﬁc algorithm.3.4 Controlling Data BiasTo strengthen the pieces of evidence on the quality of a community and allow a more consistent andconsensual conclusion, we ﬁrst verify the existence of bias in the network data and in its ground truthdata. Then, we check the eﬀect of data bias by controlling its source. More speciﬁcally, we estimateand minimize the eﬀect of bias by removing from the network nodes and edges that systematicallydamage its structure. Note that here “data bias” is any error generated by a community detectionalgorithm that might be associated with some noise in the network being assessed.For example, after collecting evidence about the structural and functional quality of the communitiesexisting in a network, a low consensus among very convincing measures that express such a qualityis an indication that the source of bias is the data. To verify this, we apply a control method, i.e.,we deliberately remove from the network structure the part supposedly biased by the use of a speciﬁcnetwork ﬁlter. Then, we apply the entire evaluation ﬂow shown in Figure 2 and collect additionalpieces of evidence for each evaluation strategy. If this procedure leads to a higher level of consensus(not necessarily indicating higher quality communities), we obtain the conﬁrmation that the structureremoved from the data inﬂuenced the assessment, possibly in a systematic way. Note that we adoptthree diﬀerent types of control ﬁlter: for noisy edges, noisy nodes and small components, which aredescribed next.To ﬁlter out noisy edges, we use the framework proposed in our previous work [Leão et al. 2018].When identifying nodes in the network that belong to an entity class that violates the communitystructure, we use a ﬁlter per class, i.e., we remove from the network (and from its ground truth) allnodes labeled in the respective ground truth with that class identiﬁer. With the removal of thesenoisy nodes and edges, we aim to extract communities with a better deﬁned structure and, mostimportantly, that allow a more objective assessment and the collection of evidence to decide on thequality of the communities. Finally, the third ﬁlter, used as a control method, aims to remove verysmall and dense components from the network, since they contribute little to distinguish speciﬁc biasfrom the algorithms, but can distort the structure of the detected communities and inﬂuence theirquality assessment due to the inherent bias of algorithms and metrics.4. EXPERIMENTAL RESULTSTo evaluate our proposed approach, we run a series of experiments to assess the communities derivedfrom twelve networks by applying a combination of eight algorithms based on state-of-the-art commu-nity detection methods. Note that in these experiments we analyze the communities generated by eachalgorithm separately, considering the other ones as their baselines. In addition, experiments involvingnon-deterministic algorithms (see Table I) were performed several times (at least 30 repetitions) toensure the reliability of the results.4.1 NetworksInitially, we modeled as temporal and aggregate edge graphs the following scientiﬁc collaborationnetworks (here identiﬁed according to their respective datasets): APS - coauthorship network of Table II. Characterization of the networks.

Application Domain Network | V | | E | ∆ D CC C

Scientiﬁc Collaboration APS [Brandão and Moro 2017] 181k 852k 305 0.5 0.33 5kPubMed [Brandão and Moro 2017] 444k 5.5M 4869 0.6 0.36 9karXiv [Leão 2018] 33k 180k 424 3.3 - 3kSIC10 [Leão et al. 2019] 3k 12k 96 20 0.34 119Contact in a Hospital LH10 [Vanhems et al. 2013] 76 1k 65 4k 0.6 1Contact in a High andPrimary School HSC [Génois and Barrat 2018] 327 5818 87 1k 0.44 1PSC [Stehlé et al. 2011] 242 8k 134 3k 0.48 1Contact in a French HealthInstitute IVS13 [Génois et al. 2015] 92 1k 44 2k 0.37 1IVS15 [Génois and Barrat 2018] 217 4k 84 2k 0.36 1Contact in the HypertextACM Conference ACM09 [Isella et al. 2011] 403 10k 189 1k 0.24 1Simulated Networks HSC(S) [Leão et al. 2019] ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ | V | : set of vertices; | E | : set of edges; ∆ : max degree; D : density (x − ); CC : cluster coeﬃcient; C :number of components. The min degree is 1 in all networks. members of the American Physical Society [Brandão and Moro 2017; Leão et al. 2018]; PubMed- coauthorship network derived from scientiﬁc articles available in MEDLINE [Brandão and Moro2017; Leão et al. 2019]; arXiv - coauthorship network derived from scientiﬁc articles deposited inarXiv [Leão 2018]; SIC10 - coauthorship network of papers presented at the 2019 Seminar on ScientiﬁcInitiation held at the Federal Institute of the Northern of Minas Gerais [Leão et al. 2019]; LH10 -contact network between people in a hospital [Vanhems et al. 2013], HSC - contact network in ahigh school [Gemmetto et al. 2014; Génois and Barrat 2018]; PSC - contact network in a primaryschool [Stehlé et al. 2011]; IVS13 and IVS15 - contact networks in a French health institute in twodiﬀerent years [Génois et al. 2015; Génois and Barrat 2018]; ACM09 - contact network in the ACM 2009Hypertext Conference [Isella et al. 2011]; HSC(S) - simulated network based on the HSC network [Leão2018]; EEU(S) - simulated network based on an e-mail exchange network [Leão 2018].In addition, we modeled the metadata of these networks to use them as ground truths for theirrespective communities, where each community is identiﬁed by the predominant research area of theirrespective researchers in the collaboration networks, by the class of the students in the high/primaryschool contact networks, and by the department of the workers in the the French Health Institutecontact networks. With respect to the Hypertext ACM Conference network, we notice that it repre-sents a single community. Finally, the ground truths of the simulated networks are the same of thereal networks from which they were generated [Nunes et al. 2017].Table II presents a general characterization of these networks. Note that, it was important toconsider in our experiment synthetic and real-world networks that have many variations of type, sizeand complex structural features. In addition, the number of networks used is justiﬁed by the need tocollect suﬃcient evidence to obtain the consistent results.4.2 Evidence ConsideredThe combination of functional and structural evidence in our experiments allowed us to corroboratethe quality of the ground truths as well as of the communities detected in all networks. This also madeit possible to indicate the algorithm that identiﬁed the best communities in the networks. For this,we ﬁrst analyzed the results of each strategy individually, providing hypotheses about the quality ofthe communities. Then, we combined these results, verifying the consensus among the quality of thecommunities. In this way, we veriﬁed which hypotheses were refuted, as well as the biases identiﬁed. · Jeancarlo C. Leão, Alberto H. F. Laender and Pedro O. S. Vaz de Melo l l l l l l l

A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) Network M odu l a r i t y l l l ll l l l l A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) Network N u m be r o f C o mm un i t i e s Fig. 3. Details of some of the pieces of evidence considered in the structural evaluation strategy: (left) modularity valuesfor the communities detected by all considered algorithms in each network (boxplot) and for the respective ground truths(blue dot); (right) number of communities detected by all considered algorithms in each network (boxplot) and in therespective ground truths (blue dot).

Identifying the Best Communities.

First, we analyze the structure of the communities de-tected by the diﬀerent algorithms. Here, we note that the communities derived from the HSC, SIC10and arXiv networks present the best deﬁned characteristics. For this, we considered the followingpieces of evidence: high average modularity (Figure 3), greater consensus on the structure of thecommunities (interquartile of the similarity between them, presented in Figure 4), greater conﬁdenceof the modularity value obtained in diﬀerent experiments with the same non-deterministic algorithm(coeﬃcient of variation less than 0.1 of the modularity values between repetitions of detection exper-iments) and small variation in the number of communities detected by these algorithms (Figure 3).However, as we shall see below, although such pieces of evidence indicate that the communities fromthese three networks have the same characteristics, we have not come to the same conclusion abouttheir quality.From a functional viewpoint, unlike the High School network, in the arXiv and SIC10 networksthere is no convergence of evidence to conﬁrm the quality of their communities when compared withtheir ground truths (Figure 4 (right)). This can be considered as a disagreement with respect tothe structural aspect when we compare the distance between the structural measure values, such asmodularity values or number of communities of the arXiv and SIC10 networks with those of mostother networks.Note that in Figure 3 (left), for example, there is a large diﬀerence between the modularity valuesof the ground truths and those estimated for the detected communities in the arXiv and SIC10networks. In addition, according to Figure 3, the number of communities in the ground truths isfar from the number of communities actually detected in the networks . Therefore, the strength ofthese initial pieces of evidence has led us to the conviction that the communities detected in theactual networks are the correct ones. In addition, the conﬁdence intervals of the measures and thestructural evidence that strongly disagree with the functional one corroborate the interpretation thatthe detected communities are the real ones and not those shown by the ground truths. This meansthat the two sets of evidence, structural and functional, contradict each other on which communitiesin the arXiv and SIC10 networks are the best ones. Thus, we need to make a decision on which set ofevidence is the strongest one, the structural quality of the ground truth or the detected communities.Based on our approach, the possibility of bias being the cause of these divergences makes it necessaryto evaluate them by using a third set of evidence (on a new and independent particular aspect) tosupport one of the two contradictory sets of evidence. To do so, we have analyzed these contradictorysets of evidence in order to raise some hypotheses about the main source of bias in the convergenceof the results on the quality of the communities of the arXiv and SIC10 networks. The distance between these two values (number of communities) is measured by the metric given by Equation 1. llllllllllllllllllll llllllllll llllll llllllllllllllll llllllllllllllllllllllllllllllll llllllllllllllllllllllllll llllllllllllllll Adjusted Rand IndexRand indexSplit Join DistanceNormalized Mutual InformationVariation of Information

A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) Network S i m il a r i t y llllllll llllllllll llll llll llllllllll llllllll llll llllllllll llllllll llllll llll llllll llll llll llll llllllll Adjusted Rand IndexRand indexSplit Join DistanceNormalized Mutual InformationVariation of Information

A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) A P SP u b M e d a r X i v S I C H H S CH S C ( S ) P S C I V S I V S A C M E E U ( S ) Network S i m il a r i t y Fig. 4. Details of some of the pieces of evidence considered in the functional evaluation strategy: (left) similarity valuesbetween the communities detected by diﬀerent algorithms and expressed by diﬀerent metrics; (right) similarity valuesbetween the ground truth and the communities detected by distinct algorithms and expressed by diﬀerent metrics. Notethat, especially for the metrics VI and SJD, the lower their values, the greater the similarity indicated.

As we can see, in the arXiv and SIC10 networks the bias caused by the detection algorithms doesnot considerably interfere in their results, since the communities suggested by them are structurallysimilar. In addition, this was evidenced in these networks by all structural metrics considered, whosevalues corroborate a high-quality community structure, as already shown in Figures 3 and 4. Thus, wehave hypothesized that the bias interference is predominantly in the data, provoking a disagreementbetween pieces of structural and functional evidence, as well as the identiﬁcation of false communitiesof high quality.For this, we ﬁrst analyze the ground truth communities of the scientiﬁc collaboration networks andthen consider the meaning of these communities in those networks, i.e., they are groups of researchers · Jeancarlo C. Leão, Alberto H. F. Laender and Pedro O. S. Vaz de Melo that publish together and predominantly in the same area of knowledge. However, it should be notedthat this deﬁnition is not absolute, since there may be a multidisciplinary community with sporadicco-authorships or a community of researchers that work in the same area, but do not collaborate witheach other. In both possible cases, ground truth communities are not very well captured by detectionalgorithms that rely on network connectivity. Thus, we consider the hypothesis that the bias thatobscures the real community structure of the arXiv and SIC10 networks is a consequence of theexistence of edges and nodes that represent, respectively, sporadic collaborations and researchers thatwork in the same knowledge area, but do not signiﬁcantly interact with each other. Then, to test ourhypothesis, we run the following bias control experiment: we removed such skewed edges and nodesusing the ﬁltering framework proposed by Leão et al. [2018], and then, as a new iteration, followedagain the steps of our evaluation approach as shown in Figure 2, using as input the ﬁltered versionof that network . From this new iteration, we obtained new communities in which the structuraland functional metrics of the arXiv network converged, as indicated by its greater similarity with theground truth communities, and showing a better structural aspect, as indicated by all metrics usedfor this purpose (Figure 3).However, this same convergence was not observed in networks such as SIC10, which, therefore, wasalso inspected by a new iteration of the evaluation ﬂow of our approach. This time, the second controlmethod used was the ﬁltering of the smallest connected components, due to the considerable number ofthem in this network. As a result, it was found that the measurement of the structural quality of theircommunities is biased by the presence of these components. Finally, with these ﬁndings, a consensualdecision was reached among the evaluation strategies, thus conﬁrming that the communities of bothnetworks are of High Quality . Note that the components that distorted the structural quality of bothnetworks to

Very High correspond only to noise in the SIC10 network (5% of its structure) and in thearXiv network itself (90% of its structure).

Table III. Decision on the quality of the communities.

Network Premature Decision Data Quality ControlMethod Main Source ofSigniﬁcant Bias ConsensualDecisionStructural Functional Net Ground TruthLH10 Very Low Very Low Low Medium NF,EF Net and GT LowPubMed Medium Low Medium Very Low EF GT, Alg and Met LowAPS High Low Medium Low EF GT, Alg and Met LowACM09 Very Low Very Low Low Very Low EF Net and GT MediumEEU(S) Low Medium Very Low High EF Net HigharXiv Very High Low Medium Low EF Net and GT HighSIC10 Very High Medium High Medium EF,CF GT and Met HighPSC Very Low Very Low Low Medium EF Net Very HighHSC(S) Low High Medium Very High EF Net Very HighInVS15 Low High Medium Very High EF Net Very HighInVS13 Low High Medium Very High EF Net Very HighHSC High Very High High Very High EF None Very High

Main source of signiﬁcant bias: GT (ground truth data); Alg (algorithms); Met (metrics); Net (networkdata). Control method: NF (node ﬁlter); EF (edge ﬁlter); CF (component ﬁlter).

This way, we have been able to identify that the most signiﬁcant source of bias in the arXiv networkwas in its data and in the SIC10 network in the structural metrics used, such as modularity. Thisinﬂuenced the premature results of assessing the quality of the communities in these networks, respec-tively underestimating and overestimating their values. Furthermore, these two situations exemplifyhow considering few pieces of evidence can lead to apparently very convincing results, but unreli-able. Note also that the divergence between consensus conclusions obtained by diﬀerent strategiesis somewhat recurrent in the premature conclusions of evaluating communities in all the networks The generated datasets are available by request at http://cnet.jcloud.net.br repository. used. More precisely, in 60% of the networks, the quality measured by the structural or functionalstrategies showed underestimated values and in 30% of the networks, they presented overestimatedvalues in relation to the decision considering the inﬂuence of biases. Table III shows all prematuredecisions about the quality of communities, as well as the consensual decision obtained by identifyingand considering biases. It is worth highlighting some of these cases, as follows.In the APS and PubMed networks, data bias is also the main source of divergence between sets ofevidence, beyond bias in structural metrics and detection algorithms. The communities of the LH10,ACM09 and PSC networks presented themselves with very low quality, indicated individually by thescores in all strategies. Despite this, the low conﬁdence of these estimates (also widely evidenced asdetailed in Figures. 3 and 4) made us investigate and check for bias in their data, in part, causedby the presence of noisy edges, as deﬁned by Leão et al. [2018]. In addition, in the LH10 networkdata, we veriﬁed a class of nodes that tend to violate the community structure, and that, whenisolated, allowed to test its eﬀect on the evaluation result. Finally, our assessment approach revealedcommunities with diﬀerent quality factors in these networks: from low to very high. The PSC networkstands out with communities with very high quality revealed by our approach and which demonstratesthat the premature decision was very wrong and inﬂuenced by the noisy edges. Highlighting mildermistakes in their premature decisions, the identiﬁcation and isolation of biases in the EEU(S), HSC,InVS15, HSC(S) and InVS13 networks, allowed us to conclude that these networks have high qualitycommunities. In general, the ﬁnal decision on the quality of the networks was obtained with twoiterations of our assessment approach, the ﬁrst always made on the data of the skewed network andthe other controlling the bias of these data by ﬁltering noisy edges (except on LH10 and SIC10, whichrequired a third iteration and speciﬁc control methods).4.2.2 Best Detection Algorithms.

Although the most modular communities are those detected bythe Louvain algorithm (upper bounds shown in Figure 3, left), the modularity values of the com-munities detected by the Infomap algorithm are generally closer to those of the ground truth (thereis a greater agreement between them). In addition, Infomap provided several cases in which therewas an agreement between the modularity of the detected communities and that of the ground truth.On the other hand, these same metrics achieved smaller values for the Louvain algorithm. More-over, the modularity of the communities extracted by diﬀerent algorithms and that of the functionalcommunities have considerably varied for most networks. We also veriﬁed how diﬀerent communitydetection algorithms agree with each other and with the network ground truths with respect to theircommunities. Despite the variation in the structure of the detected communities, as shown in Fig-ure 4 (left graph), there was a higher consensus among them than with respect to their ground truthcommunities (Figure 4 (right graph)).

Table IV. Best detection algorithms according to distinct experiments.

Best Algorithm LM GM LE LP WT IMMetrics ARI, SJD ARI SJD*, VI* RI, SJD, VI ARI, RI ARI, NMI*, RI, SJD, VINetworks PubMed, arXiv APS APS, PubMed APS, Sinth. arXiv, Sinth. All*Metrics with the best overall value.

In addition to obtaining a consensus among diﬀerent algorithms, our approach also identiﬁed somealgorithms with distinct behavior, such as Infomap, that detected less modular communities, but ingeneral more similar to their ground truths. Despite such divergences among the strategies, mostpieces of evidence indicate the Louvain algorithm as the least biased among those that are based onthe modularity maximization and the one that obtained estimates with higher values for most of thestructural metrics, particularly modularity. We also identiﬁed some algorithms that presented thebest score on some speciﬁc metrics. This is the case of the Louvain algorithm (LM) for modularity, · Jeancarlo C. Leão, Alberto H. F. Laender and Pedro O. S. Vaz de Melo and of the Infomap (IM) and Leading Eigenvector (LE) algorithms for the similarity metrics Normal-ized Mutual Information (NMI), and Split Join Distance (SJD) and Variation of Information (VI),respectively (see Table IV). Notice that our proposed approach is able to analyze distinct alternativesolutions for the task at the hand, thus being able to identify those algorithms that provide the besttrade-oﬀ.5. CONCLUSIONS AND FUTURE WORKThe main contribution of this paper is an approach to identify and reduce the eﬀect of biases whenassessing the quality of a set of communities. Speciﬁcally, we use multiple and diversiﬁed measurementstrategies designed to capture diﬀerent aspects of the quality of a community structure. For itsevaluation, we carried out a set of experiments using twelve networks (ten real ones and two syntheticones) and compared the results obtained by eight community detection algorithms considered thestate-of-the-art in the area. In addition, we also used distinct metrics, each one providing a piece ofevidence from a speciﬁc point of view on the quality of the assessed communities.In this context, we consider the consensus and the divergence between such pieces of evidenceto hypothesize about the inﬂuence of bias coming from metrics, algorithms or network data whenassessing the quality of such communities. Thus, outliers observed by statistically analyzing theresulting measurements allowed us to test hypotheses of speciﬁc biases with respect to metrics anddetection algorithms. To test a hypothesis on bias in network data or in its ground truth metadata,we use control methods, such as node ﬁlters or network edges. These methods allowed us to verifythat, by removing the supposed biased structures from the data, a greater consensus in the evaluationof their communities can be veriﬁed by the multiple metrics and strategies used by our approach.By doing so, we were able to sustain our hypothesis by showing that the quality evaluation ofcommunities detected from a network must be supported by multiple pieces of evidence. That is,given the discrepancy between the quality indicated by distinct evaluation strategies, we evidentiatethat the use of a single quality metric or strategy, be it structural or functional, makes the resultsbiased and unreliable. On the other hand, our multi-strategy evaluation approach makes it possibleto explain extreme values for some of the metrics considered and decide which strategies lead to amore consistent conclusion about the quality of a community. For example, we were able to verify theexistence of bias in some metrics, network data and detection algorithms, which allowed us to reach aconsensual decision very diﬀerent from the premature decision, individually suggested by the metricsused by one of the strategies.A current limitation of our proposed approach is the use of a predeﬁned set of evaluation metricsand community detection algorithms. However, this limitation can be easily overcome by providing aconﬁgurable framework in which such features could be deﬁned according to speciﬁc characteristics ofthe networks being considered. It is also worth noting that the approach proposed in this article canbe applied to other types of algorithm (such as those for clustering tabular data, backbone extraction,core-periphery analysis, detection of dynamic or overlapping communities, etc.), as well as adapted toother tasks besides community detection (such as system modeling or simulation, supervised machinelearning techniques, missing data prediction, etc.). Finally, another line of future work could be, forexample, adapting this approach to assess the task of link prediction in social networks in order toprovide more robust results.AcknowledgementsWork supported by project MASWeb (FAPEMIG/PRONEX grant APQ-01400-14) and by the au-thors’ individual grants from CNPq and FAPEMIG. Particularly, the ﬁrst author would like to thankLBD/UFMG, JCLoud.net.br and LabSiCCx - Laboratório de Sistemas Computacionais Complexos(PROPPI/IFNMG, project Nr. 209/2019) for the infrastructure provided. REFERENCES

Abrahao, B. , Soundarajan, S. , Hopcroft, J. , and Kleinberg, R. On the Separability of Structural Classes ofCommunities. In

Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery andData Mining . New York, NY, USA, pp. 624–632, 2012.

Almeida, H. , Guedes, D. , Meira Jr, W. , and Zaki, M. J. Towards a Better Quality Metric for Graph ClusterEvaluation.

Journal of Information and Data Management

Amelio, A. and Pizzuti, C.

Is Normalized Mutual Information a Fair Measure for Comparing Community DetectionMethods? In

Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysisand Mining 2015 . Paris, France, pp. 1584–1585, 2015.

Blondel, V. D. , Guillaume, J.-L. , Lambiotte, R. , and Lefebvre, E. Fast unfolding of communities in largenetworks.

Journal of Statistical Mechanics: Theory and Experiment

Brandão, M. A. and Moro, M. M.

The strength of co-authorship ties through diﬀerent topological properties.

Journal of the Brazilian Computer Society

23 (1): 5, 2017.

Brender, J.

Framework for Meta-Assessment of Assessment Studies. In

Handbook of Evaluation Methods for HealthInformatics . Burlington, pp. 253–320, 2006.

Clauset, A. , Newman, M. E. J. , and Moore, C. Finding community structure in very large networks.

PhysicalReview E vol. 70, pp. 066111, 2004.

Coscia, M.

Discovering Communities of Community Discovery. In

Proceedings of the 2019 IEEE/ACM InternationalConference on Advances in Social Networks Analysis and Mining . New York, NY, USA, pp. 1–8, 2019.

Coscia, M. , Giannotti, F. , and Pedreschi, D. A classiﬁcation for community discovery methods in complexnetworks.

Statistical Analysis and Data Mining: The ASA Data Science Journal

Creswell, J. W. and Creswell, J. D.

Research Design: Qualitative, Quantitative, and Mixed Methods Approaches .SAGE Publications, Thousand Oaks, California, USA, 2018.

Dao, V. L. , Bothorel, C. , and Lenca, P. Community structure: A comparative evaluation of community detectionmethods.

Network Science

Fortunato, S.

Community detection in graphs.

Physics Reports

486 (3–5): 75–174, 2010.

Gandica, Y. , Decuyper, A. , Cloquet, C. , Thomas, I. , and Delvenne, J.-C. Measuring the eﬀect of nodeaggregation on community detection.

EPJ Data Science

Gemmetto, V. , Barrat, A. , and Cattuto, C. Mitigation of infectious disease at school: targeted class closure vsschool closure.

BMC Infectious Diseases

14 (1): 695, 2014.

Génois, M. and Barrat, A.

Can co-location be used as a proxy for face-to-face contacts?

EPJ Data Science

Génois, M. , Vestergaard, C. L. , Fournet, J. , Panisson, A. , Bonmarin, I. , and Barrat, A. Data on face-to-face contacts in an oﬃce building suggest a low-cost vaccination strategy based on community linkers.

NetworkScience

Ghasemian, A. , Hosseinmardi, H. , and Clauset, A. Evaluating Overﬁt and Underﬁt in Models of Network Com-munity Structure.

IEEE Transactions on Knowledge and Data Engineering

32 (9): 1722–1735, 2020.

Gösgens, M. , Tikhonov, A. , and Prokhorenkova, L. Systematic analysis of cluster similarity indices: How tovalidate validation measures.

CoRR vol. arXiv:1911.04773, 2020.

Hric, D. , Darst, R. K. , and Fortunato, S. Community detection in networks: structural communities versusground truth.

Physical Review E

90 (6): 62805, 2014.

Isella, L. , Stehlé, J. , Barrat, A. , Cattuto, C. , Pinton, J.-F. , and den Broeck, W. V. What’s in a crowd?Analysis of face-to-face behavioral networks.

Journal of Theoretical Biology

271 (1): 166 – 180, 2011.

Jebabli, M. , Cherifi, H. , Cherifi, C. , and Hamouda, A. Community detection algorithm evaluation with ground-truth data.

Physica A: Statistical Mechanics and its Applications vol. 492, pp. 651 – 706, 2018.

Kivelä, M. , Arenas, A. , Barthelemy, M. , Gleeson, J. P. , Moreno, Y. , and Porter, M. A. Multilayer networks.

Journal of Complex Networks

Labatut, V.

Generalised measures for the evaluation of community detection methods.

International Journal ofSocial Network Mining

Lancichinetti, A. and Fortunato, S.

Consensus clustering in complex networks.

Scientiﬁc Reports vol. 2, pp. 336,2012.

Lei, Y. , Bezdek, J. C. , Romano, S. , Vinh, N. X. , Chan, J. , and Bailey, J. Ground truth bias in external clustervalidity indices.

Pattern Recognition vol. 65, pp. 58 – 70, 2017.

Leskovec, J. , Lang, K. J. , and Mahoney, M. Empirical Comparison of Algorithms for Network CommunityDetection. In

Proceedings of the 19th International Conference on World Wide Web . New York, NY, USA, pp.631–640, 2010.

Leão, J. C.

An Approach for Detecting Communities from Sequences of Social Interactions . M.S. thesis, UniversidadeFederal de Minas Gerais, Belo Horizonte, MG, Brazil (in Portuguese), 2018. · Jeancarlo C. Leão, Alberto H. F. Laender and Pedro O. S. Vaz de Melo

Leão, J. C. , Brandão, M. A. , Vaz de Melo, P. O. S. , and Laender, A. H. F. Mineração de Perﬁs Sociaisem Redes Temporais. In

Anais do 32º Simpósio Brasileiro de Bancos de Dados, SBBD 2017 . Uberlândia, MG, pp.264–269, 2017.

Leão, J. C. , Brandão, M. A. , Vaz de Melo, P. O. S. , and Laender, A. H. F. Who is really in my socialcircle? Mining social relationships to improve detection of real communities.

Journal of Internet Services and Appli-cations

Leão, J. C. , Cardoso, R. J. S. , and Santos, A. B. Uma análise temporal da rede de colaboração cientı´fica doIFNMG: 10 anos de iniciação cientı´fica e orientação acadêmica.

Anais dos Simpósios de Informática do IFNMG -Campus Januária vol. 11, pp. 7, 2019.

Leão, J. C. , Laender, A. H. F. , and Vaz de Melo, P. O. S. A Multi-Strategy Approach to Overcoming Bias inCommunity Detection Evaluation. In

Anais do 34º Simpósio Brasileiro de Bancos de Dados, SBBD 2019 . Fortaleza,CE, pp. 13–24, 2019.

Liu, X. , Cheng, H. , and Zhang, Z. Evaluation of community detection methods.

IEEE Transactions on Knowledgeand Data Engineering

32 (9): 736 – 1746, 2020.

Newman, M. E. J.

Modularity and community structure in networks.

Proceedings of the National Academy ofSciences

103 (23): 8577–8582, 2006.

Newman, M. E. J. and Girvan, M.

Finding and evaluating community structure in networks.

Physical ReviewE

69 (2): 26113, 2004.

Nunes, I. O. , Celes, C. , D., M. S. , Vaz de Melo, P. O. S. , and Loureiro, A. A. F. GRM: Group RegularityMobility Model. In

Proceedings of the 20th ACM International Conference on Modeling, Analysis and Simulationof Wireless and Mobile Systems . New York, NY, USA, pp. 85–89, 2017.

O’Donoghue, T. and K., P.

Qualitative Educational Research in Action: Doing and Reﬂecting . Routledge, Abingdon,UK, 2003.

Peel, L. , Larremore, D. B. , and Clauset, A. The ground truth about metadata and community detection innetworks.

Science Advances

Pons, P. and Latapy, M.

Computing communities in large networks using random walks. In

Computer and Infor-mation Sciences - ISCIS 2005: 20th International Symposium, Istanbul, Turkey, October 26-28, 2005. Proceedings ,p. Yolum, T. Güngör, F. Gürgen, and C. Özturan (Eds.). Berlin, Heidelberg, pp. 284–293, 2005.

Raghavan, U. N. , Albert, R. , and Kumara, S. Near linear time algorithm to detect community structures inlarge-scale networks.

Physical Review E

76 (3): 036106, 2007.

Rand, W. M.

Objective criteria for the evaluation of clustering methods.

Journal of the American StatisticalAssociation

66 (336): 846–850, 1971.

Reichardt, J. and Bornholdt, S.

Statistical mechanics of community detection.

Physical Review E vol. 74, pp.016110, 2006.

Rocha, L. E. C. , Masuda, N. , and Holme, P. Sampling of temporal networks: Methods and biases.

Physical ReviewE

96 (5): 52302, 2017.

Rosvall, M. and Bergstrom, C. T.

Multilevel compression of random walks on networks reveals hierarchicalorganization in large integrated systems.

PLOS ONE

Stehlé, J. , Voirin, N. , Barrat, A. , Cattuto, C. , Isella, L. , Pinton, J.-F. , Quaggiotto, M. , Van den Broeck,W. , Régis, C. , Lina, B. , and Vanhems, P. High-resolution measurements of face-to-face contact patterns in aprimary school.

PLOS ONE

Vanhems, P. , Barrat, A. , Cattuto, C. , Pinton, J.-F. , Khanafer, N. , Régis, C. , Kim, B.-a. , Comte, B. , andVoirin, N. Estimating potential infection transmission routes in hospital wards using wearable proximity sensors.

PLOS ONE

Vieira, V. F. , Xavier, C. R. , and Evsukoff, A. G. Comparing the Community Structure Identiﬁed by OverlappingMethods. In

Complex Networks and Their Applications VIII , H. Cheriﬁ, S. Gaito, J. F. Mendes, E. Moro, and L. M.Rocha (Eds.). Cham, pp. 262–273, 2020.

Yang, J. and Leskovec, J.

Deﬁning and evaluating network communities based on ground-truth.

Knowledge andInformation Systems

42 (1): 181–213, 2015.

Zachary, W. W.

An information ﬂow model for conﬂict and ﬁssion in small groups.

Journal of AnthropologicalResearch

33 (4): 452–473, 1977.

Zaki, M. J. and Meira Jr., W.

Data Mining and Analysis: Fundamental Concepts and Algorithms . CambridgeUniversity Press, New York, NY, USA, 2014.

Zhao, Y.

A survey on theoretical advances of community detection in networks.