[PDF] Finding Scientific Communities In Citation Graphs: Convergent Clustering

Abstract

Understanding the nature and organization of scientific communities is of broad interest. The `Invisible College' is a historical metaphor for one such type of community and the search for such `colleges' can be framed as the detection and analysis of small groups of scientists working on problems of common interests. Case studies have previously been conducted on individual communities with respect to their scientific and social behavior. In this study, we introduce, a new and scalable community finding approach. Supplemented by expert assessment, we use the convergence of two different clustering methods to select article clusters generated from over two million articles from the field of immunology spanning an eleven year period with relevant cluster quality indicators for evaluation. Finally, we identify author communities defined by these clusters. A sample of the article clusters produced by this pipeline was reviewed by experts, and shows strong thematic relatedness, suggesting that the inferred author communities may represent valid communities of practice. These findings suggest that such convergent approaches may be useful in the future.

Full PDF

FFinding Scientiﬁc Communities In CitationGraphs:Convergent Clustering

Shreya Chandrasekharan , Mariam Zaka , Stephen Gallo , Tandy Warnow , andGeorge Chacko Netelabs, NET ESolutions Corporation (an NTT DATA Company), McLean, VA,USA American Institute of Biological Sciences, Herndon, VA, USA Department of Computer Science, University of Illinois at Urbana-Champaign,Urbana, IL, USAJuly 30, 2020

Index terms—

Invisible College, Clustering, Community Finding, Citation Graph,Scientiﬁc Organization

Abstract

Understanding the nature and organization of scientiﬁc communities is of broadinterest. The ‘Invisible College’ is a historical metaphor for one such type of commu-nity and the search for such ‘colleges’ can be framed as the detection and analysis ofsmall groups of scientists working on problems of common interests. Case studies havepreviously been conducted on individual communities with respect to their scientiﬁcand social behavior. In this study, we introduce, a new and scalable community ﬁndingapproach. Supplemented by expert assessment, we use the convergence of two diﬀerentclustering methods to select article clusters generated from over two million articlesfrom the ﬁeld of immunology spanning an eleven year period with relevant clusterquality indicators for evaluation. Finally, we identify author communities deﬁned bythese clusters. A sample of the article clusters produced by this pipeline was reviewedby experts, and shows strong thematic relatedness, suggesting that the inferred authorcommunities may represent valid communities of practice. These ﬁndings suggest thatsuch convergent approaches may be useful in the future.

Introduction

In this article, we report on an eﬀort to use citation data to identify groups of scientiﬁcarticles that may reﬂect widespread small-scale organization in the scientiﬁc enterprise.1 a r X i v : . [ c s . D L ] J u l e are inspired by the ‘Invisible College’ concept, which appears to originate from agroup of intellectually active individuals who held meetings around 1660 and became theRoyal Society of London in 1663 (Price & Beaver, 1966; The Royal Society, 2020) butmore generally refers to a relatively small self-assembled group of scientists with commonscientiﬁc interests.Importantly, there is a sense that these colleges are ‘in groups’ with inﬂuence overprestige, research funding, and the scientiﬁc ideas of their community (Price & Beaver,1966). Thus, these groups may also advocate for or exhibit resistance to new ideas withintheir domains of interest (Barber, 1961). Furthermore, while such groups may espouseidealized norms of science (Merton, 1957), they are also likely to be driven by social interestssuch as personal recognition that inﬂuence both individual and collective behavior (Barber,1962; Crane, 1972; Hagstrom, 1965).An important distinction has also been made between local (small) and global groupsin the discussion of community (Hull, 1988, p. 112), with small groups being credited asthe locus of rapid change and innovation. Furthermore, Price noted, in apparent referenceto Invisible Colleges, that communications are likely challenged in groups larger than 100members (Price, 1963) and that the small ‘strips’ at the research front of science may beat most the work of ‘a few hundred’ persons (Price, 1965); this number is also cited in aclustering study (Small & Sweeney, 1985).In a case study (Price & Beaver, 1966) of the Information Exchange Group No 1(Green, 1965) that was organized by the US National Institutes of Health to focus onelectron transfer and oxidative phosphorylation, Price and Beaver described a group of517 members, 62% of whom were from the United States and the rest from 27 diﬀerentcountries. Price and Beaver used the memos shared within Information Exchange GroupNo 1 as proxies for research articles and citations. They noted 1,239 authorships in 533memos with two-author memos being the mode that was stable across a 5 year period.The majority of these authors were were associated only with a single memo, and the top30 authors each contributed to six or more memos. Three conclusions were drawn fromthis study: ﬁrst, that there existed a small nucleus of highly active researchers and otherswho collaborated with them only once; second, that separate groups existed within thiscollege, and third, that collaboration was a key feature. This valuable case study of Priceand Beaver is, however, limited by examining a tiny sample of the enterprise as it existedin the ﬁrst half of the decade of 1960-1970. It is far more likely that a range of group sizesand behaviors exists now, and perhaps even then.Thus, a natural question is how these small groups form and grow, and how to detectthem from the bibliographic literature, which our study addresses. Our study is motivatedby two considerations. First, the scientiﬁc enterprise has grown considerably, experiencedgreater globalization (Wagner, 2008) in the 21st century, and exhibits new features like largeinternational collaborations such as the Human Genome Project (Lander et al., 2001) andthe Advanced LIGO project (Harry et al., 2010). Even so, the tendency of scientists toform small groups that collaborate to advance their scientiﬁc and social interests is unlikely2o have vanished and these groups may well exist within larger structures. Thus, we seek tounderstand organizational structures in the modern scientiﬁc enterprise and reconcile themwith the observations of Price and others from the 1960s. Second, modern bibliographiesand accessible computing make it possible to revisit the search for these small communitiesof practice (and in particular invisible colleges) at scale through community ﬁnding analysesof citation graphs.In our use of citation data to identify and characterize putative colleges or communitiesof practice (Lave & Wagner, 1991), our working hypothesis is that such groups can bedetected by identifying clusters of articles that are citation-dense, since common interestswill result in citation of relevant documents especially those authored by the ‘in group’.Furthermore, as we noted earlier, we are particularly interested in small communities, sincethese may represent new ideas (Hull, 1988, p. 112).Rather than attempting to directly identify author communities of practice, we ﬁrstconstruct article clusters, and then examine the authors within the article clusters; themotivation for this approach is that each community of practice is by its nature formedaround a speciﬁc research question or area, while individual scientists may participate inmultiple communities of practice based on diﬀerent scientiﬁc and social interests.We begin by identifying article clusters constructed from direct citations of articles.The use of article-level citation patterns is motivated by the understanding that clustersdeﬁned by articles that cite each other can be more informative than clusters derived fromjournal-derived categories (Waltman & van Eck, 2012; Shu et al., 2019; Milojevic, 2019)or from searches for words or phrases that recur in articles (Klavans & Boyack, 2017).We further restrict our examination to moderately-sized clusters based on the realizationthat article clusters that are either very small or very large will not assist in discoveringauthor clusters of interest. Our pipeline (shown in Figure 1) shows the steps we used inthis development of author clusters.We have chosen the ﬁeld of immunology as a test case since it has existed for manyyears, grown and diversiﬁed over time, and exchanges discovery and methods with otherbiomedical areas. To identify article clusters, we ﬁrst constructed a citation graph for pub-lications in the ﬁeld of immunology within an 11 year timeframe, consisting of 2.16 millionarticles. We then used two diﬀerent clustering methods, Markov Clustering (Dongen, 2000)and Graclus (Dhillon et al., 2007), which we previously used to cluster citation data withina discipline (Devarakonda et al., 2020), to identify clusters that are robust to choice ofclustering method. To evaluate the approach, we use expert annotation on a sample ofthe Markov Clustering (MCL) clusters we produced that passed several stringent qualitycriteria (Fig. 1).These two clustering methods use diﬀerent strategies and criteria. Markov Clustering(MCL) has several desirable features: it is scalable, does not require pre-speciﬁcation ofthe number of clusters to be generated, and has tunable parameters that control breadthof search and granularity of output. Graclus is a spectral clustering method that we havepreviously used to construct article clusters from citation data (Devarakonda et al., 2020).3n order to identify those clusters of interest, we limit our attention to those MCL clustersthat have very high overlap with at least one cluster produced by Graclus: the convergenceof these two methods serves to identify those publication clusters where the citation signalis high enough to result in robustness to the choice of clustering method. By restrictingthe set of clusters to a range of sizes consistent with a potential community of practice, weare able to select publication clusters of interest that are then used to build author clustersthat may represent communities of interest.We explore the merit of this combined approach by examining the publication clusters itproduces, and use human experts to evaluate a sample of the selected clusters for thematicrelatedness. Our study shows strong concordance between the cluster conductance andexpert evaluation of thematic relatedness, supporting the potential value of this protocol.Thus, we consider this study a ﬁrst step in designing and testing a pipeline that could enablelarge-scale identiﬁcation of communities of diﬀerent sizes and types, based on diﬀerentsearch criteria. choice of clustering algorithm, and parameters. Materials and Methods

As a source of bibliographic data for this study, we used Scopus (Elsevier BV, 2020), asimplemented in the ERNIE project (Korobskiy et al., 2019). At the time of this analysis,our Scopus data consisted of ∼

95 million publication records plus their cited references.From these data, we selected publications in English, of type ‘article’ with publicationtype ‘core’, and Scopus All Science Journal Classiﬁcation (ASJC) code of 2403, whichcorresponds to immunology, for each of the years 1985–1995. We then ampliﬁed the set ofimmunology articles thus extracted by supplementing them with articles that directly citedthem and as well as by articles cited by these immunology articles. The only constraints weimposed on the cited or citing articles were to require that they were English publicationsof type ‘core’; in particular, the cited and citing articles were not constrained by ASJCcodes. All in all, we assembled 12 datasets (Table 1) that consisted of (i) 11 ‘year-slices’,each representing the set of immunology articles published in a given year (any of 1985-1995) along with their cited references and the publications that cited them (e.g., the 1985year-slice) and (ii) the union of data from the 11 year-slices, to create a working datasetof 2,163,683 articles that we refer to as the combined dataset. These data were stored asan annotated list of 6,846,323 pairs representing 11 years of data.For Markov Clustering analysis, we downloaded and compiled source code for the MCL-edge software (Dongen, 2000). After evaluating diﬀerent runtime parameters, we clusteredtest sets using an expansion parameter of 2 (default) and an inﬂation parameter of 2.0to minimize the number of large aggregated clusters (Figure 3). We generated clustersusing the same parameters for each of the 11 individual years. See Table 1 for empiricalproperties of the MCL clusters computed for the 11 year-slices and the combined dataset.Under the same conditions, we also generated 134,094 clusters containing 2.16 million4 ataset num clusters num articles mean size median size mean cond. mean coh.1985 10,568 293,086 27.73 18 0.30 0.091986 10,621 310,030 29.19 19 0.32 0.091987 10,984 325,661 29.65 20 0.31 0.091988 11,697 363,038 31.04 21 0.33 0.091989 12,401 397,292 32.04 21 0.33 0.091990 12,542 419,500 33.45 23 0.34 0.091991 13,089 463,581 35.42 24 0.34 0.091992 13,878 507,365 36.56 25 0.35 0.091993 14,135 542,948 38.41 27 0.36 0.091994 14,681 584,768 39.83 27 0.38 0.091995 15,918 642,686 40.37 28 0.37 0.09combined 134,094 2163683 16.14 9 0.70 0.09

Table 1: Empirical properties of the MCL clusters computed for each year-slice as wellas for the combined Immunology dataset. We show statistics regarding size, conductance(cond.), and coherence (coh.)nodes from our working dataset (above), resulting in cluster sizes of 1 (minimum), 16.2(mean), 9 (median), and 3,956 (maximum). For a random graph comparison, we performed1 million reciprocal citation exchanges between randomly selected pairs of publications onthese data and then ran MCL-edge on the resultant data.To evaluate clusters and shuﬄed-citation clusters generated by MCL, we measured (i)cluster conductance (Shun et al., 2016; Emmons et al., 2016; Devarakonda et al., 2020),which essentially measures intra-cluster density, and (ii) textual coherence using the Jensen-Shannon divergence (Boyack et al., 2011), with the expectation that ‘good’ clusters wouldexhibit low conductance and high coherence, respectively. We also clustered data fromeach of the 11 years using Graclus, adjusting its runtime parameter number of clusters toproduce a distribution of cluster sizes that roughly approximated those generated by MCL(Figure 3).To compute textual coherence, we used the titles and abstracts of all the articles in ourstudy. On average, roughly 11% of the publications had missing titles and/or abstracts,reducing the corpus to titles and abstracts from 1.95 million (1,955,164) articles. We ﬁrstconcatenated these titles and abstracts and pre-processed them by lemmatization basedon parts of speech (POS) tagging, to preserve the tokens and their context. The fourPOS considered were adjectives, nouns, adverbs, and verbs. For all other types of parts ofspeech (including those not classiﬁed at all), the token was mapped to ‘noun’ by default;for example, ‘grow’ and ‘growth’ are two diﬀerent tokens while ‘grow’ and ‘growing’ wouldbe mapped to ‘grow’. We then removed stop-words using a list of 513 tokens comprisingbasic NLTK stop-words, PubMed stop-words, and a select list of tokens from the top 500most frequent words in our dataset. 5or each cluster of size greater than 10 (after removing missing values), we performeda second pre-processing by ﬁltering out those tokens that occurred only once in the entirecluster. Next, we converted all the remaining tokens by article in the cluster into a matrixof term frequencies (i.e., for each article, we had a vector of counts of all the tokens). Wealso obtained a vector of counts for all the unique tokens in the cluster. Textual coherencewas measured by using the Jensen-Shannon Divergence (JSD), which is used to computethe distance between two probability distributions. JSD was computed between the vectorof term frequencies of the cluster and each article in the cluster using the following:

J SD p,q = 12 D KL ( p, m ) + 12 D KL ( q, m )where m = p + q , p is the probability of a term in a document, q is the probability of thesame term in the cluster, and D KL is the Kullback-Leibler divergence, given by: D KL ( p, m ) = (cid:88) p i log ( p i m i )We computed the textual coherence for a given cluster X of size n (after removingmissing values) as follows. Letting JSD X denote the arithmetic mean of all article JSDvalues in X , we deﬁne the textual coherence of X to be JSD X -JSD random(n) (Boyack et al.,2011), where JSD random(n) denotes the JSD of a random cluster of the size n .JSD random(n) is the arithmetic mean of all the JSD values computed from randomselected sets of size n from all the articles in our study. For each value n , we estimateJSD random(n) by selecting 50 article subsets of size n at random, and averaging the results.The method of computing each iteration of JSD random(n) is exactly the same as the methoddescribed for JSD X above.After completion of MCL clustering and computing conductance and coherence values,we compared a random sample of 1,000 clusters from the year 1990 and visualized theeﬀect of shuﬄing citations on conductance and coherence compared to the original citationdata (Figure 2).We scored each publication using a weighted citation count (Williams et al 2015, Keserci2017) of intra-graph citations, which assigns a score to each node in a graph that consistsof the number of in-graph citations it receives plus the number of citations received by itsneighbors. We also computed the number of times each article in our working dataset hadbeen cited in Scopus between publication and July 2020. Thematic relatedness . Lastly, we used the expertise of two of the authors of this article(Zaka and Gallo), who are professional peer review specialists in the biomedical sciences,and highly experienced at clustering proposals for funding based on multiple criteria such assub-disciplines, methods, disease, and researchers. In preliminary experiments, we provideda small number of training clusters to these evaluators representing a range of conductancevalues to assist in develop a common set of principles by which they would evaluate a testset. The clusters in this development set were not considered further.6or expert evaluation, we randomly selected 90 MCL clusters, each with 30–350 pub-lications, and with conductance values of no more than 0.5. Since smaller clusters occurmore frequently, the sample of 90 was constructed from two strata based on size to ensurerepresentation of the larger cluster sizes. An additional 10 clusters with conductance valuesgreater than 0.5 were added to the sample. The two evaluators were each asked to evaluate50 selected clusters (45 from the set of 90 and 5 from the set of 10) for thematic related-ness, given only the titles and abstracts for each publication in each cluster. Using theirexpertise in peer review, they assigned scores on a simple scale of 1–4 where 1 representeda cluster exhibiting a single discernible scientiﬁc theme, 2 for a moderate level of thematicrelatedness, 3 for poor thematic relatedness, and 4 for ‘unable to evaluate’. The evaluatorsalso annotated each cluster with keywords such as ‘hemophilia’ or ‘adenosine deaminase’to indicate the theme that they discerned (Supplementary Data).

Results & Discussion

Markov Clustering of 2.16 million publications . Our initial experiment was to cluster the2.16 million publications in our combined dataset as well as separately for the 11 individualyear-slices using Markov Clustering (MCL). This experiment resulted in 134,094 clusters forthe combined dataset and from 10,000-16,000 clusters for each of the individual year-slices,as shown in Table 1.Some noteworthy trends are immediately apparent. First, the number of publicationsincreases monotonically each year between 1985 and 1995 and the number of clustersgenerated by MCL (as well as the average and median size of these clusters) similarlyincreases for each year. However, while the number of clusters computed on the combineddataset is roughly ten-fold the number of clusters in individual years, the average andmedian cluster size decrease by roughly two-fold for the combined dataset (Table 1).Another interesting feature is that except for the combined dataset where the meanconductance is somewhat high (0.70), the MCL clusters have very good coherence values(mean of 0.09) and conductance values (mean of 0.30-0.38). However, the conductancevalues are of greater interest here than the coherence values, since by measuring citationsthey more directly assess the likely interactions between authors.A comparison of conductance and coherence proﬁles of 1000 MCL clusters from theyear 1990 to random sets of publications of the same size is shown in Figure 2 (Materialsand Methods). This comparison shows that random subsets of publications have muchpoorer conductance and coherence values, highlighting that MCL clusters are of very highquality with respect to both criteria.We also observed that each MCL cluster in the combined dataset mapped well to a singleMCL cluster in some year-slice but not as well to any MCL cluster in another year (Figure4). As an example, cluster . . Comparing MCL and Graclus clusters.

Our study enables a comparison of MCL andGraclus clusters, as performed on this dataset. Interestingly, despite heavy overlap, theconductance values of MCL clusters often diﬀered from those of Graclus clusters, withMCL clusters usually exhibiting low conductance values (median of 0.032) and Graclusclusters usually exhibiting high conductance (median of 1.0, 25 of 35 Graclus clusters hada conductance of 1.0). We traced this observation to nodes of high degree not being presentin matched Graclus clusters.For example, cluster ∼ ∼

5% of the cases, and MCL conductance was worsethan Graclus conductance in ∼ ∼ ∼ ∼ ∼ ∼ Conclusions

This study reveals some interesting trends that suggest directions for future work. First,our research suggests a pipeline to identify possible communities of practice: (1) use conver-gent clustering, in this both Markov Clustering (MCL) and Graclus (a spectral clusteringmethod), to cluster the citation network, (2) ﬁnd those MCL clusters within an appropri-ate size range that have very low conductance values and high Jaccard coeﬃcients to theirmatched Graclus clusters, and (3) extract the author community for each article cluster,and ﬁlter out those communities that are edge cases (e.g., clusters where only one authoris citing the other authors, or where only one author is cited by any other author).Although we were able to conduct an expert evaluation for only a very small sampleof the clusters produced by this pipeline, our experts ranked these clusters highly withrespect to thematic relatedness, suggesting that this pipeline could produce article clustersderived from a scientiﬁc theme. As a result, the author communities detected using thepipeline represent likely communities of practice, as was our objective.11 ow match year size(m) size(g) cond(m) cond(g) coh(m) coh(g) int.edges(m) int.edges(g) jc1 1988 190 194 0.06 1.00 0.11 0.11 189 0 0.962 1990 328 327 0.01 1.00 0.08 0.08 327 0 0.993 1993 114 112 0.01 0.01 0.11 0.11 113 111 0.984 1988 211 224 0.04 1.00 0.10 0.09 210 0 0.925 1986 185 183 0.02 1.00 0.04 0.04 184 0 0.986 1989 182 184 0.05 1.00 0.05 0.05 181 0 0.977 1993 173 171 0.01 1.00 0.10 0.10 172 0 0.988 1995 201 211 0.06 1.00 0.09 0.09 200 0 0.919 1985 225 225 0.01 1.00 0.10 0.10 224 0 0.9810 1986 184 188 0.02 1.00 0.10 0.10 183 0 0.9611 1986 167 165 0.00 1.00 0.13 0.13 166 0 0.9912 1990 146 147 0.02 1.00 0.10 0.10 145 0 0.9713 1995 165 168 0.02 1.00 0.11 0.11 164 0 0.9614 1993 168 166 0.03 1.00 0.13 0.12 167 0 0.9815 1992 153 156 0.05 1.00 0.08 0.08 152 0 0.9416 1991 155 150 0.17 1.00 0.12 0.12 154 0 0.9617 1985 213 217 0.08 1.00 0.09 0.09 212 0 0.9518 1987 150 156 0.03 1.00 0.08 0.08 149 0 0.9419 1993 121 121 0.02 0.02 0.07 0.07 120 120 1.0020 1987 151 153 0.04 1.00 0.08 0.07 150 0 0.9621 1994 175 171 0.09 1.00 0.13 0.13 174 0 0.9322 1991 141 150 0.12 1.00 0.06 0.06 140 0 0.9023 1994 328 332 0.05 1.00 0.14 0.14 327 0 0.9524 1988 264 272 0.05 1.00 0.09 0.09 263 0 0.9525 1988 139 137 0.02 1.00 0.05 0.05 138 0 0.9726 1988 156 159 0.03 1.00 0.10 0.10 155 0 0.9627 1994 128 124 0.10 1.00 0.08 0.08 127 0 0.9728 1994 116 109 0.15 0.15 0.07 0.07 115 108 0.9229 1993 106 106 0.01 0.01 0.02 0.02 175 175 1.0030 1993 118 113 0.06 0.06 0.11 0.11 118 112 0.9331 1994 97 97 0.01 0.01 0.08 0.08 96 96 1.0032 1992 78 78 0.00 0.00 0.13 0.13 77 77 1.0033 1990 55 56 0.02 0.02 0.08 0.08 54 55 0.9834 1987 47 47 0.02 0.02 0.08 0.08 46 46 1.0035 1990 68 66 0.08 0.08 0.08 0.08 67 65 0.94

Table 2: Features of the 35 MCL-Graclus pairs of clusters selected for thematic relatedness. m,g refer to MCL and Graclus respectively. size: number of publications in the cluster.cond: conductance, coh: coherence, int.edges: number of internal edges, jc: Jaccard Coef-ﬁcient for publications in paired MCL and Graclus clusters. All 35 MCL clusters receivedthe top rating (1) by the evaluators and the Jaccard Coeﬃcient for MCL-Graclus pairs isgreater than 0.9 for all pairs (row 22 shows Jaccard Coeﬀﬁcient of 0.90, but that is dueto rounding). The median conductance of MCL is 0.032, the median conductance of theGraclus clusters is 1.0 (25 Graclus clusters have a conductance of 1.0). Rows 1, 2, and 3are discussed further in the text. 12e applied a stringent Jaccard coeﬃcient to the article clusters we considered in orderto reduce the false discovery rate (FDR). However, the conductance values even for lowerJaccard coeﬃcients also resulted in comparably low median conductance when we examinedthe 100 clusters evaluated by humans. For example, the median conductance of MCLclusters was 0.03 when the Jaccard coeﬃcient was greater than 0.9 and 0.08 when theJaccard coeﬃcient was relaxed to greater than 0.7. These data suggest that relaxing theseconstraints should be rigorously explored since it may not increase the FDR signiﬁcantly.This hypothesis is also supported by the high ratings that the experts gave to MCL clustersthat had low conductance but lower overlap with their paired Graclus clusters.Despite these promising results, we are well aware of the limitations of using citationand cluster analysis to identify communities of practice. The best techniques would ideallyuse expert evaluation, which is unfortunately not scalable (and in our study, we onlyused expert evaluation for 100 of the clusters we generated). Furthermore, this studyonly examined immunology publications and others connected to them by citation. Otherstudies have shown that citation behavior can depend signiﬁcantly on the ﬁeld, makingextrapolation of trends from one ﬁeld to another premature (Wallace et al., 2012; Bradleyet al., 2020). Thus, the trends in this study may not be consistently found in other researchdisciplines or timeframes. Our future work will characterize these initial observations,evaluate additional clustering techniques, and focus on elucidating interactions betweenauthors within and across clusters to reﬁne the pipeline we envision.13

Supportive Information

Supplementary material and code used in this study is available on our Github site (Ko-robskiy et al., 2019).

Acknowledgments

We thank Vladimir Smirnov for helpful discussions on using Markov Clustering. TheERNIE project involves a collaboration with Elsevier. The content of this publication issolely the responsibility of the authors and does not necessarily represent the oﬃcial viewsof the National Institutes of Health or Elsevier. We thank our Elsevier colleagues for theirsupport of the ERNIE project. authorcontributions

Shreya Chandrasekharan: Conceptualization; Methodology; Investigation; WritingOrigi-nal Draft; WritingReview and Editing. Mariam Zaka: Investigation; WritingReview andEditing.; Stephen Gallo: Investigation; WritingReview and Editing.Tandy Warnow: Con-ceptualization; Methodology; WritingOriginal Draft; WritingReview and Editing. GeorgeChacko: Conceptualization; Methodology; Investigation; WritingOriginal Draft; Writin-gReview and Editing; Funding Acquisition, Resources; Supervision.

The authors have no competing interests. Scopus data used in this study was availableto us through a collaborative agreement with Elsevier on the ERNIE project. Elsevierpersonnel played no role in conceptualization, experimental design, review of results, orconclusions presented.

Research and development reported in this publication was partially supported by federalfunds from the National Institute on Drug Abuse (NIDA), National Institutes of Health,U.S. Department of Health and Human Services, under Contract Nos. HHSN271201700053C(N43DA-17-1216) and HHSN271201800040C (N44DA-18-1216). Tandy Warnow receivesfunding from the Grainger Foundation. 14

Data Availability

Access to the bibliographic data analyzed in this study requires a license from Elsevier.Code generated for this study is freely available from our Github site (Korobskiy et al.,2019). 15 eferences

Barber, B. (1961). Resistance by scientists to scientiﬁc discovery.

Science , , 596–602.Barber, B. (1962). Science and the Social Order . New York: Collier Books.Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R.,Schijvenaars, B., Skupin, A., Ma, N., & Brner, K. (2011). Clustering More than TwoMillion Biomedical Publications: Comparing the Accuracies of Nine Text-Based Simi-larity Approaches.

PLOS ONE , (3), e18029.Bradley, J., Devarakonda, S., Davey, A., Korobskiy, D., Liu, S., Lakhdar-Hamina, D.,Warnow, T., & Chacko, G. (2020). Co-citations in context: Disciplinary heterogeneityis relevant. Quantitative Science Studies , (1), 264–276.Crane, D. (1972). Invisible Colleges: Diﬀusion of Knowledge in Scientiﬁc Communities .Chicago: University of Chicago Press.Cucala, M., Bauerfeind, P., Emde, C., Gonvers, J. J., Koelz, H. R., & Blum, A. L. (1987).Is it wise to prescribe NSAIDs with modern gastroprotective agents?

ScandinavianJournal of Rheumatology , (sup65), 141–154.Devarakonda, S., Korobskiy, D., Warnow, T., & Chacko, G. (2020). Viewing com-puter science through citation analysis: Salton and Bergmark Redux. Scientometrics .https://doi.org/10.1007/s11192-020-03624-0.Dhillon, I., Guan, Y., & Kulis, B. (2007). Weighted graph cuts without eigenvectors: Amultilevel approach. In

IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI) , vol. 29:11, (pp. 1944–1957). ACM Press.Dongen, S. (2000). A cluster algorithm for graphs.

CWI (Centre for Mathematics andComputer Science) . Accessed May 2020.URL https://micans.org/mcl/src/mcl-05-090.tar.gz

PloS one , (7), e0159161.Green, D. (1965). Information Exchange Group No. 1. Science , , 143.Hagstrom, W. O. (1965). The Scientiﬁc Community . New York: Basic Books.Harry, G., et al. (2010). Advanced LIGO: the next generation of gravitational wave detec-tors.

Classical and Quantum Gravity , (8), 084006.16ull, D. L. (1988). Science as a Process . Chicago, USA: University of Chicago Press.Klavans, R., & Boyack, K. (2017). Which type of citation analysis generates the mostaccurate taxonomy of scientiﬁc and technical knowledge?

Journal of the Association forInformation Science and Technology , , 984–998.Korobskiy, D., Davey, A., Liu, S., Devarakonda, S., & Chacko, G. (2019). Enhanced Re-search Network Informatics Environment (ERNIE). Github repository, NET ESolutionsCorporation. https://github.com/NETESOLUTIONS/ERNIE.Lander, E., et al. (2001). Initial sequencing and analysis of the human genome. Nature , (6822), 860–921.Lave, J., & Wagner, E. (1991). Situated Learning: Legitimate Peripheral Participation .Cambridge, UK: Cambridge University Press.Merton, R. K. (1957).

Social Theory and Social Structure . Glencoe, IL: Free Press.Michel, B., Hunder, G., Bloch, D., & Calabrese, L. (1992). Hypersensitivity vasculitisand henoch-schnlein purpura: a comparison between the 2 disorders.

The Journal ofrheumatology , (5), 721728.Milojevic, S. (2019). Practical method to reclassify Web of Science articles into uniquesubject categories and broad disciplines. Quantitative Science Studies , (pp. 1–24).10.1162/qss a 00014.Price, D. d. S. (1963).

Little science, big science . New York: Columbia Univ. Press.Price, D. d. S. (1965). Networks of Scientiﬁc Papers.

Science , , 510–515.Price, D. d. S., & Beaver, D. D. (1966). Collaboration in an invisible college. AmericanPsychologist , (11), 1011–1018.Shu, F., Julien, C.-A., Zhang, L., Qiu, J., Zhang, J., & Larivire, V. (2019). Comparingjournal and paper level classiﬁcations of science. Journal of Informetrics , (1), 202–225.Shun, J., Roosta-Khorasani, F., Fountoulakis, K., & Mahoney, M. W. (2016). ParallelLocal Graph Clustering. Proc. VLDB Endow. , (12), 1041–1052.Small, H., & Sweeney, E. (1985). Clustering the science citation index using co-citations. Scientometrics , (3), 391–409.The Royal Society (2020). History of The Royal Society. https://royalsociety.org/about-us/history, accessed July 2020.Wagner, C. (2008). The New Invisible College: Science for Development . Washington DC:Brookings Institution Press. 17allace, M. L., Lariviere, V., & Gingras, Y. (2012). A Small World of Citations? TheInﬂuence of Collaboration Networks on Citation Practices.

PLOS One , , e33339.Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classiﬁcation system of science. Journal of the American Society for InformationScience and Technology , (12), 2378–2392.White, J., Herman, A., Pullen, A. M., Kubo, R., Kappler, J. W., & Marrack, P. (1989).The v β -speciﬁc superantigen staphylococcal enterotoxin b: Stimulation of mature t cellsand clonal deletion in neonatal mice. Cell , (1), 27–35.18 Figures

Figure 1: Workﬂow to detect communities. Year-slices (center of schematic) were generatedfrom immunology articles in Scopus for the years 1985-1995 (Materials and Methods).The citation data in these year-slices were de-duplicated and combined to construct thecombined dataset (right side of schematic). MCL clustering was performed on each year-slice as well as the combined dataset. This produced a set of 134.094 clusters. We thenrestricted attention to those clusters containing between 30 and 350 publications, whichresulted in 16,909 clusters. From these, a sample of 100 clusters was provided to twoevaluators who rated 77 of them as strongly themed. In parallel, each of the 11 year-slices was clustered with MCL and with Graclus. Every cluster from the set of 16,909 wasmatched, using overlap as matching criterion, to a single MCL cluster from all clustersgenerated in the 11 year-slices. Each of these matched MCL clusters from the year-sliceswas also matched to a Graclus cluster from the same year. Convergence was measured usingthe intersection/union ratio (Jaccard Coeﬃcient) between members of a pair of clusters.The 77 clusters selected by the evaluators were further constrained to those with a JaccardCoeﬃcient greater than 0.9 from their corresponding MCL-Graclus pair; this constraintresulted in 35 clusters. Authors for the publications in each of these clusters were analyzedfor the purpose of community inference. 19 .000.050.100.150.20 1 2 group c ohe r en c e group c ondu c t an c e Figure 2: Conductance and coherence proﬁles of 1,000 MCL clusters (group 1) comparedto random clusters (group 2), showing that MCL clusters have lower conductance and in-creased coherence compared to random clusters of the same size. The 1990 immunologyyear-slice was either clustered (x-axis: group 1) or subjected to 1 million citation shuf-ﬂing (group 2) operations and then clustered using MCL-edge software with an expansionparameter setting of 2 and inﬂation parameter setting of 2.0. From each of the resultantdatasets, a sample of 1,000 clusters of size 30–350 publications were randomly selected andanalyzed for conductance and coherence.

Supplementary material is available on our Github site (Korobskiy et al., 2019).20 rac_5284 grac_2391grac_10568 grac_79260 100 200 300 0 100 200 300025005000750010000025005000750010000 cluster_size c oun t mcl grac_52840 100 200 300 0 100 200 300025005000750010000 cluster_size c oun tt