Frequently Co-cited Publications: Features and Kinetics
Sitaram Devarakonda, James Bradley, Dmitriy Korobskiy, Tandy Warnow, George Chacko
FFrequently Co-cited Publications: Features and Kinetics
Sitaram Devarakonda , James Bradley , Dmitriy Korobskiy , Tandy Warnow , andGeorge Chacko ∗ Netelabs, NET ESolutions Corporation, McLean, VA Raymond Mason School of Business, Coll. of William & Mary, Williamsburg, VA Department of Computer Science, Univ. of Illinois, Urbana-Champaign, ILMay 12, 2020 ∗ [email protected] a r X i v : . [ c s . D L ] M a y bstract Co-citation measurements can reveal the extent to which a concept representing anovel combination of existing ideas evolves towards a specialty. The strength of co-citation is represented by its frequency, which accumulates over time. Of interest iswhether underlying features associated with the strength of co-citation can be identi-fied. We use the proximal citation network for a given pair of articles ( x, y ) to compute θ , an a priori estimate of the probability of co-citation between x and y, prior to theirfirst co-citation.Thus, low values for θ reflect pairs of articles for which co-citation ispresumed less likely. We observe that co-citation frequencies are a composite of power-law and lognormal distributions, and that very high co-citation frequencies are morelikely to be composed of pairs with low values of θ , reflecting the impact of a novelcombination of ideas. Furthermore, we note that the occurrence of a direct citationbetween two members of a co-cited pair increases with co-citation frequency. Finally,we identify cases of frequently co-cited publications that accumulate co-citations afteran extended period of dormancy. Introduction
Co-citation, “the frequency with which two documents from the earlier literature are citedtogether in the later literature”, was first described in 1973 [34, 50]. As noted by [50],co-citation patterns differ from bibliographic coupling patterns [28] but align with pat-terns of direct citation and frequently co-cited publications must have high individual cita-tions.Co-citation has been the subject of further study and characterization, for example, compar-isons to bibliographic coupling and direct citation [6], the study of invisible colleges [24, 41],construction of networks by co-citation [52, 53], evaluation of clusters in combination withtextual analysis [8], textual similarity at the article and other levels [14], and the fractalnature of publications aggregated by co-citations [58].Co-citations provide details of the relationship between key (highly cited) ideas, and changesin co-citation patterns over time may provide insight into the mechanism with which newschools of thought develop. Implicit in the definition of co-citation is novel combinations ofexisting ideas, but only some frequently co-cited article pairs reflect surprising combinations.For example, two publications presenting the leading methods for the same computationalproblem may be highly co-cited, but this does not reflect a novel combination of ideas. Sim-ilarly, two publications describing methods that often constitute part of the same workflowmay be highly co-cited, but these co-citations are also not surprising. On the other hand,for two articles in different fields, frequent co-citation is generally unexpected.Novel, atypical, or otherwise unusual combinations of co-cited articles have been exploredat the journal-level [65, 10, 7, 57]. However, journal-level classifications have limited res-2lution relative to article-level studies, which may better represent the actual structureand aggregations of the scientific literature [49, 29, 63, 37, 26]. Accordingly, we sought todiscover measurable characteristics of frequently co-cited publications from an article-levelperspective.To study frequently co-cited articles, we have developed a novel graph-theoretic approachthat reflects the citation neighborhood of a given pair of articles. In seeking to determinethe degree to which a co-cited pair of papers represented a surprising combination, wewished to avoid journal-based field classifications, which present challenges. Instead, weattempted to use citation history to produce an estimate of the probability that a givenpair of publications ( x, y ) would be co-cited. Since we focus on the activity before theyare first co-cited, the “probability" of co-citation is zero, by definition, since there are noco-citations yet. Hence, we approximated co-citation probabilities: we treat an articlethat cites one member of a co-cited pair and also cites at least one article that cites theother member as a proxy for co-citation. Specifically, given a pair of publications x, y , weconstruct a directed bipartite graph whose vertex set contains all publications that citeeither x or y previous to their first co-citation. We then compute θ , a normalized countof such proxies, and use it to predict the probability of co-citation between x and y . Thisapproach enables an evaluation that is specific to the given pair of articles, and does sowithout substantial computational cost, while avoiding definitions of disciplines derivedfrom journals or having to measure disciplinary distances.To support our analysis, we constructed a dataset of articles from Scopus [19] that werepublished in the eleven year period, 1985-1995, and extracted the cited references in thesearticles. Recognizing that frequently co-cited publications must derive from highly-citedpublications [50], we identified those reference pairs (33.6 million pairs) for each article inthe dataset that are drawn from the top 1% most cited articles in Scopus and measuredtheir frequency of co-citation.To investigate which statistical distributions might best describe the co-citation frequenciesin these 33.6 million co-cited pairs, we reviewed prior work on distributions of citation fre-quency [47, 20, 44, 45, 40, 64, 55, 56, 48]. This research has fit the frequency distributionof citation strength sometimes to a power law distribution and other times to a lognormaldistribution. A graph of the analogous co-citation data suggests that power law or log-normal distributions are candidates for describing co-citation strength as well and so we,accordingly, investigated that conjecture. Interestingly, [38] notes the debate between theappropriateness of power law versus lognormal distributions is not confined to bibliometrics,but has been at issue in many disciplines and contexts.To study how the best-fit distributional function and parameters for co-citation might varywith θ , we stratified co-citation frequency data. We also measured whether a direct linkexists between two members of a co-cited pair (i.e., whether one member of a pair citesthe other) and how this property is related to co-citation frequencies. We find that the3istribution of co-citation frequencies varies with θ and that a power law distribution fitsco-citation frequencies more often when θ is small, whereas a lognormal distribution fitsmore often for large θ .A pertinent aspect of co-citation is the rate at which frequencies accumulate. While citationdynamics of individual publications have been fairly well studied by others, for example,[62, 20], the dynamics of co-cited articles are less well studied. Our interest was the specialcase analogous to the Sleeping Beauty phenomenon [59, 27], which may reflect delayedrecognition of scientific discovery and the causes attributed to it [35, 21, 22, 15, 2, 23]. Thus,we also identified co-cited pairs that featured a period of dormancy before accumulatingco-citations. Materials & Methods
Data
Citation counts were computed for all Scopus articles (88,639,980 records) updatedthrough December 2019, as implemented in the ERNIE project [30]. Records with corruptedor missing publication years or classified as ‘dummy’ by the vendor were then removed,resulting in a dataset of 76,572,284 publications. Hazen percentiles of citation counts,grouped by year of publication, were calculated for the these data [4]. The top 1% of highlycited publications from each year were combined into a set of highly cited publicationsconsisting of 768,993 publications.Publications of type ‘article’, each containing at least five cited references and published inthe 11 year period from 1985-1995, were subset from Scopus to form a dataset of 3,394,799publications and 51,801,106 references (8,397,935 unique). For each of these publications,all possible reference pairs were generated and then restricted to those pairs where bothmembers were in the set of highly cited publications (above).For example, the data for 1985 consisted of 223,485 articles after processing as describedabove. Computing all reference pairs (that were also members of the highly cited publicationset of 768,993) from these 223,485 articles gave rise to 2,600,101 reference pairs (Table 1)that ranged in co-citation frequency from 1 to 874 within the 1985 dataset; from 1 to 11,949across the 11 year period 1985-1995; and from 1 to 35,755 across all of Scopus. Collectively,the publications in our 1985-1995 dataset generated 33,641,395 unique co-citation pairs, forwhich we computed co-citation frequencies across all of Scopus.
Derivation of θ We now show how we define our prior on the probability of x and y beingco-cited, based on the citation graph restricted to publications that cite either x or y (butnot both) up to the year of their first co-citation. Recall that we defined a proxy co-citationof x and y to be an article that cites one member of the co-cited pair ( x, y ) and also citesat least one article that cites the other member. The idea behind this definition is that we4 copus Count Citations for all Publications in ScopusGroup by
Year of PublicationGenerate Hazen
PercentilesSelect top 1% from each year Published in 1985-1995All pubs of type ArticleWith at least 5 cited references Generate pairwise combinations of all references within an article for all selected articlesRestrict to pairwise combinations of references belonging to top 1% cited articlesCombine across 1985-1995 and compute co-citation frequencies publications 33.6 million co-cited pairs
Figure 1:
The workflow we used to generate a dataset of 33,641,395 co-citedpublications from references cited by articles in Scopus published in the years1985-1995.
Table 1:
Summary of Analyzed Data
Publication of type article that had at leastfive cited references indexed in Scopus were selected from the eleven years, 1985-1995.All possible reference pairs were generated for the cited references of these articles andthen restricted to those pairs where both members were in the set of 768,993 highly citedpublications. The column Co-cited Pairs shows the number of pairs in each year after therestriction was applied.Year Articles References Co-cited Pairs1985 223,485 1,796,502 2,600,1011986 238,096 1,920,225 2,840,5571987 250,575 2,037,654 3,180,2611988 269,219 2,182,571 3,406,9021989 285,873 2,303,481 3,793,9861990 305,010 2,490,909 4,546,9151991 325,782. 2,662,005 5,039,3341992 343,239. 2,846,607 5,622,1641993 360,916 3,006,374 6,121,1471994 387,062. 3,228,240 7,022,4991995 405,503. 3,432,228 7,626,6845onsider papers that cite x as proxies for x , and papers that cite y as proxies for y . Thus, ifa paper a cites both x and y (cid:48) (where y (cid:48) is a proxy for y ), then it is a proxy for a co-citationof x and y . Similarly, if a paper b cites both y and x (cid:48) (where x (cid:48) is a proxy for x ), it is also aproxy for a co-citation of x and y . This motivates the graph-theoretic formulation, whichwe now formally present.We fix the pair x, y and we define N ( x ) to be the set of all publications that cite x (but donot also cite y ), and are published no later than the year of the first co-citation of x and y .We similarly define N ( y ) . We define a directed bipartite graph with vertex set N ( x ) ∪ N ( y ) .Note that if x cites y then x ∈ N ( y ) , and similarly for the case where y cites x . Note alsothat since we have restricted N ( x ) and N ( y ) that N ( x ) ∩ N ( y ) = ∅ . We now describe howthe directed edge set E ( x, y ) is constructed. For any pair of articles a, b where a ∈ N ( x ) and b ∈ N ( y ) , if a cites b then we include the directed edge a → b in E ( x, y ) . Similarly, weinclude edge b → a if b cites a . Finally, if a pair of articles both cite each other, then thegraph has parallel edges. By construction, this graph is bipartite , which means that all theedges go between the two sets N ( x ) and N ( y ) (i.e., no edges exist between two vertices in N ( x ) , nor between two vertices in N ( y ) ).Note that by the definition, every edge in E ( x, y ) arises because of a proxy co-citation, sothat the number of proxy co-citations is the number of directed edges in E ( x, y ) . Considerthe situation where a publication a cites x (so that a ∈ N ( x ) ) and also cites b , b , b in N ( y ) : this defines three directed edges from a to nodes of N ( y ) . We count this as threeproxy co-citations, not as one proxy co-citation. Similarly, if we have a publication b thatcites y and also cites a , a , a , a in N ( x ) , then there are four directed edges that go from b to nodes in N ( x ) and we will count each of those directed edges as a different proxyco-citation.Accordingly, letting | X | denote the cardinality of a set X , we note | E ( x, y ) | , i.e., the numberof directed edges that go between N ( x ) and N ( y ) , is the number of proxy co-citationsbetween x and y . If no parallel edges are permitted, the maximum number of possibleproxy co-citations is | N ( x ) | × | N ( y ) | . Under the assumption that both N ( x ) and N ( y ) eachhave at least one article, we define θ ( x, y ) , our prior on the probability of x and y beingco-cited, as follows: θ ( x, y ) = | E ( x, y ) || N ( x ) | × | N ( y ) | . Note that if parallel edges do not occur in the graph, then θ ( x, y ) ≤ , but that otherwisethe value can be greater than . Note also that θ ( x, y ) = 0 if E ( x, y ) = ∅ (i.e., if there areno proxy co-citations) and that θ ( x, y ) = 1 if every possible proxy co-citation occurs.To efficiently calculate θ , we used the following pipeline. We copied Scopus data froma relational schema in PostgreSQL into a citation graph from Scopus into the Neo4j 3.5graph database using an automated Extract Transform Load (ETL) pipeline that combined6ostgres CSV export and the Neo4j Bulk Import tool. The graph vertex set is all publica-tions, each with a publication year attribute, and the edge set is all citations between thepublications. A Cypher index was created on the publication year. We developed Cypherqueries to calculate θ and tuned performance by splitting input publication pairs into smallbatches and processing them in parallel, using parallelization in Bash and GNU Parallel.Batch size, the number of parallel job slots, and other parameters were tuned for perfor-mance, with best results achieved on batch sizes varying from 20 to 100 pairs. The resultsof θ calculations were cross-checked using SQL calculations. In the small number of caseswhere θ computed to > (above) it was set to 1 for the purpose of this study. Statistical Calculations
We denote the observed co-citation frequency data by the multi-set X o = { x o , . . . , x oN } , where N is the total number of pairs of articles and x oi is the observed frequency of the i th pair of papers being co-cited. Note that this is in general a multi-set, as different pairs ofarticles can have the same co-citation frequency. Let n ( x ) be the number of times that x appears in X o (equivalent, n ( x ) is the number of pairs of articles that are co-cited x times),and let N ( x ) = (cid:80) ∞ y = x n ( y ) denote the total number of pairs of articles that are co-cited atleast x times. Then f o ( x | x ≥ x ) = n ( x ) N ( x ) for x ∈ [ x, ∞ ) , (1)where x is a parameter we use to analyze the distribution’s right tail starting at vary-ing frequencies. We describe in this subsection (i) the statistical computations for fittinglognormal and power law distributions to right tails of the observed co-citation frequencydistributions as defined by (1) for various x and (ii) how we assessed the quality of thosefits. Further, we performed such analyses for various slices of the data, stratifying by θ andother parameters, as is described in the Results section.We used a discrete version of a lognormal distribution to represent integer co-citation fre-quencies, f ( · ) , following [55] and [56], while appropriately normalizing for our conditionalassessment of the right tail commencing at x : f LN ( x | µ, σ, x ) = ˜ f ( x | µ, σ ) (cid:80) ∞ n = x ˜ f ( n | µ, σ ) for x ≥ x (2) ˜ f ( x | µ, σ ) = (cid:90) x +0 . x − . dqq √ πσ exp (cid:32) − (ln q − µ ) σ (cid:33) , where µ and σ are the mean and standard deviation, respectively, of the underlying normaldistribution. These probabilities can be computed with the cumulative normal distribu-tion, ˜ f ( x | µ, σ ) = Φ (cid:18) ln ( x + 0 . σ (cid:19) − Φ (cid:18) ln ( x − . σ (cid:19) , x , using a maximum (log) likelihood estimator (MLE). We solved forthe best-fit distributional parameters for the lognormal distribution, µ and σ , by modifying amulti-dimensional interval search algorithm from [43] and following [56]. A compiled versionof this code using the C++ header file, amoeba.h , is available on our Github site [30].We fit a discrete power law distribution to the data for various values of x , which wasnormalized for our conditional observations of the right tail: f P L ( x | α, x ) = x − α ζ ( α, x ) for x ≥ x, (3)where the Hurwitz zeta function, ζ ( α, x ) = ∞ (cid:88) x =0 x + x ) α , is a generalization of the Riemann zeta function, ζ ( α, , as is needed for analysis of theright tail.We solved first-order conditions for the (log) MLE to find the best-fit distributional expo-nent α , ζ (cid:48) ( α, x ) ζ ( α, x ) = − N ( x ) (cid:88) x ∈ X o ( x ) ln x, (4)as described in [13] and [25], where X o ( x ) = { x ∈ X o : x ≥ x } , are the observed co-citations with frequencies at least as great as x and N ( x ) is the number of such co-citations.We solved (4) to find α using a bisection algorithm.We used the χ goodness of fit ( χ ) and the Kolmogorov-Smirnov (K-S) tests to assess thenull hypothesis that the distribution of the observed co-citation frequencies and the best-fitlognormal distribution are the same, and similarly for the best-fit power law distribution.We also computed the Kullback-Leibler Divergence (K-L) between the observed data andthe best-fit distributions.Both the χ and K-S tests employed the null hypothesis that the observed co-citationfrequencies, n ( x ) for x ∈ [ x, ∞ ) , were sampled from the best-fit lognormal or power lawdistributions, which we denote by f d ( · | x ) for d ∈ { LN, P L } , while suppressing the param-eters specific to each of the distributions.The usual χ statistic was computed by, first, grouping each of the observed co-citationfrequencies into k bins, denoted by b i for i ∈ { , . . . , k } , and then computing χ = k (cid:88) i =1 ( O i − E i ) E i , O i is the observed number of co-citations having frequencies associated with the i -thbin, O i = (cid:88) x ∈ b i n ( x ) , and E i is the expected number of observations for frequencies in bin i , if the null hypothesiswas true, in a sample with size equal to the number of observed data points, N ( x ) : E i = (cid:88) x ∈ b i f d ( x | x ) N ( x ) If the null hypothesis was true, then we would expect O i and E i to be approximately equal,with deviations owing to variability due to sampling.Constructing the bins b i requires only that E i ≥ for every i = 1 , . . . , k . Test outcomes aresometimes sensitive to the minimum E i permitted, which we will denote by E , and so wetested with multiple thresholds, including 10, 20, 50, and 70. Furthermore, statistical testsare stochastic: these multiple tests permitted a reduction in the probability of erroneouslyrejecting or accepting the null hypothesis based on a single test. The distribution of observedco-citation frequencies was skewed right with a long tail, so that aggregating bins to satisfy E i ≥ E was most critical in the right tail. This motivated a bin construction algorithm thataggregated frequencies in reverse order, starting with the extreme right tail. Algorithm 1requires a set of the unique observed co-citation frequencies, ˆ X o , which includes the elementsof the multiset X o without repetition. While Algorithm 1 does not guarantee in generalthat all bins satisfy E i ≥ E , that criterion was satisfied for the observed data. Algorithm 1
Frequency Bin Construction i ← b = {} while (cid:12)(cid:12)(cid:12) ˆ X o (cid:12)(cid:12)(cid:12) > do b i ← b i ∪ (cid:110) max (cid:16) ˆ X o (cid:17)(cid:111) ˆ X o ← ˆ X o \ max (cid:16) ˆ X o (cid:17) if E i ≥ E then i ← i + 1 b i ← {} end if end while We implemented a K-S test using simulation to generate a sampling distribution to accountfor the discrete frequency observations [54]. We denote the cumulative distribution ofobserved co-citation frequencies by F o ( x | x ) = (cid:80) xi = x f o ( i | x ) , and the best-fit cumulative9istribution by F d ( x | x ) = (cid:80) xi = x f d ( i | x ) . The K-S test involves testing the maximumabsolute difference between the observed and theorized cumulative distributions, D n = max x | F o ( x | x ) − F d ( x | x ) | , where n is the number of observations giving rise to F o ( x | x ) , against the distribution ofsuch differences between samples from the theorized distribution with the same number ofobservations, n , ˜ D n = max x (cid:12)(cid:12)(cid:12) ˜ F d, ( x | x ) − ˜ F d, ( x | x ) (cid:12)(cid:12)(cid:12) , where ˜ F d,j ( x | x ) is the empirical distribution of sample j of size n (notation suppressed)drawn from F d ( x | x ) . We generated 100 such random variables ˜ D n for each test. We rejectthe null hypothesis if D n is larger than substantially all of the ˜ D n , say all but 5%, forequivalence with a p -value of 0.05. The number of ˜ D n samples drawn yields a p -value witha resolution of 1%.We computed the K-L Divergence two ways due to its asymmetry: D K − L ( f o (cid:107) f d ) = ∞ (cid:88) x = x f o ( x | x ) ln f o ( x | x ) f d ( x | x ) D K − L ( f d (cid:107) f o ) = ∞ (cid:88) x = x f d ( x | x ) ln f d ( x | x ) f o ( x | x ) . Separate from the tests above, we tested whether the distribution of co-citation frequencieswas independent of θ using a χ test, using the null hypothesis that the co-citation frequencydistribution was independent of θ . We initially created a contingency table on θ and co-citation frequency using these bins for θ , { [0 . , . , [0 . , . , [0 . , . , [0 . , . , [0 . , . } ,and logarithmic bins for frequency to accommodate the skewed distributions: { [10 , , [100 , , [1000 , , [10000 , } . We, subsequently, aggregated these bins to have an expected number of co-citations in eachbin equal to or greater than 5 to account for a decreasing number of observations as θ and fre-quency increased by having just two intervals for frequency: { [10 , , [100 , } . Kinetics of Co-citation
We extended prior work on delayed recognition and the SleepingBeauty phenomemon [27, 59, 33, 23] towards co-citation. We have modified the beauty coef-ficient (B) of [27] to address co-citations by: (i) counting citations to a pair of publications(co-citations) rather than citations to individual papers, (ii) setting t (age zero) to thefirst year in which a pair of publications could be co-cited (i.e., the publication year of themore recently published member of a co-cited pair), and (iii) setting C to the number ofco-citations occurring in year t . Rather than calculate awakening time as in [27], we opted10o measure the simpler length of time between t and the first year in which a co-citationwas recorded; we label this measurement as the timelag t l , so that t l = 0 if a co-citationwas recorded in t . Results and Discussion
Our base dataset, described in Table 1, consists of the 33,641,395 co-cited reference pairs(33.6 million pairs) and their co-citation frequencies, gathered from Scopus during the11-year period from 1985-1995 (Materials and Methods). A striking distribution of co-citation frequencies with a long right tail is observed with a minimum co-citation of 1, amedian of 2, and a maximum co-citation frequency of 51,567 (Figure 2). Approximately33.3 of 33.6 million pairs (99% of observations) have co-citation frequencies ranging from1–67 and the remaining 1% have co-citation frequencies ranging from 68–51,567. Since thefocus of our study was co-citations of frequently cited publications, we further restrictedthis dataset to those pairs with a co-citation frequency of at least 10, which resulted ina smaller dataset of 4,119,324 co-cited pairs (4.1 million pairs) with minimum co-citationfrequency of 10, median of 18, and a maximum co-citation frequency of 51,567. In order tofocus on co-citations derived from highly cited publications, θ was calculated for all pairswith a co-citation frequency of at least 10. We also note whether one article in a co-citationpair cites the other (connectedness).Influenced by the use of linked co-citations for clustering [52], we also examined the extentto which members of a co-cited pair were also found in other co-cited pairs. We found that205,543 articles contributed to 4.12 million co-cited pairs. The highest frequency observedin our dataset, 51,567 co-citations, was for a pair of articles from the field of physicalchemistry: Becke (1993) [3] and Lee, Yang, and Parr (1988) [32]. The members of this pairare not connected and are found in a total of 1,504 co-cited pairs with frequencies rangingfrom 10 to 51,567. The second highest frequency, 28,407 co-citations, was for another pairof articles from the field of biochemistry: [31, 9]. Members of this pair are not connectedand are found in 41,909 co-cited pairs, 24,558 for the Laemmli gel electrophoresis articleand 17,352 for the Bradford protein estimation article. In terms of this second pair, botharticles describe methods heavily used in biochemistry and molecular biology, an area withstrong referencing activity, so this result is not entirely surprising.Having developed θ ( x, y ) as a prediction of the probability that articles x and y would beco-cited, we first tested whether the distribution of co-citation frequencies was indepen-dent of θ (Materials and Methods). The null hypothesis that the co-citation frequencydistribution was independent of θ was rejected with a very small p -value: the statisticalsoftware indicated a p -value with no significant non-zero digits. We next investigated whatdistribution functions might fit the frequencies of co-citation as θ varied.11ased on the long tails of citation frequencies, prior research has assessed the fit of lognormaland power law distributions [55, 47, 56]. We noted long right tails in co-citation frequencies,which, similarly, motivated us to assess the fit of lognormal and power law distributions toco-citation data. Further, we stratified the data according to (i) the minimum frequencyfor the right tail x , (ii) θ , and (iii) whether the two members of each co-citation pairwere connected. Figure 3 shows which distribution, if either, fits the data in each slice,based on tests of statistical significance. Note that there were no circumstances where bothdistributions fit: if one fit, then the other did not.Statistical tests were not possible for some slices due to an insufficient number of datapoints. This was the case for certain combinations of large x , large θ , and co-citations thatwere not connected. The number of data points obviously decreases as x increases, andwe found the decrease in the number of data points to be more precipitous when θ waslarge and co-citations were unconnected due to the lighter right tails for these parametercombinations. The graph in the right panel of Figure 4, which has a logarithmic y -axis,shows that the number of data points per θ interval analyzed decreases most often by morethan an order of magnitude from one interval to the next as θ increases. Most pairs ofpublications that are co-cited at least ten times, therefore, have small values of θ .Figure 3 indicates when the null hypothesis of a best-fit lognormal or power law fittingthe observed data can not be rejected. We computed two types of statistics for evaluatingthe null hypothesis ( χ and K-S) and, moreover, we computed the χ statistic for fourbinning strategies. Figure 3 indicates a distributional fit, specifically, if either the K-S p -value is greater than 0.05 or if two or more of the χ statistics are greater than 0.05.While we computed the K-L Divergence (see supplementary material), we did not use thesecomputations for formal statements of distributional fit because they are neither a norm nordo they determine statistical significance. These K-L computations did, however, supportthe findings based on formal tests of statistical significance.Power law distributions fit most often when co-citations are connected (Fig. 3), when moreextreme right tails are considered, and when co-citations have small values of θ . Lognormaldistributions fit, conversely, in some circumstances, when a greater portion of the right tailis considered. These observations support the existence of heavy tails for θ small, even if alognormal distribution fits the observed data more broadly. This observation is consistentwith our observations of the most frequent co-citations having small θ values, as shown inthe scatter plot in the left panel of Figure 4.Mitzenmacher [38] shows a close relationship between the power law and lognormal dis-tributions vis-à-vis subtle variations in generative mechanisms that determine whether theresulting distribution is power law or lognormal. The stratified layers in Figure 3 where alognormal distribution fits for some portion of the right tail and, in the same instance, apower law describes the more extreme tail, may, therefore, be due to a generative mecha-nism whose parameters are close to those for a power law distribution as well as those for12 lognormal distribution.Table 2: Exponents of best-fit power law distributions
These observations are forpower law exponents where comparison across intervals of θ were possible, and where sta-tistical tests indicated that a power law was a good fit to the data. The articles of theco-citations were connected for all data shown.Right-tail cutoff ( x ) θ Power law exponent ( α )200 [0 . , . [0 . , . [0 . , . [0 . , . [0 . , . [0 . , . θ : these were possible for θ intervals of [0 . , . and [0 . , . , for connectedco-citations, and right tails commencing at x ∈ { , , } . The power law exponent α in these comparisons was less for θ ∈ [0 . , . than for θ ∈ [0 . , . , indicating heavier tailsfor θ small and, therefore, a greater chance of extreme co-citation frequency. Figure 5 showsa log-log plot of the number of co-citations ( y -axis) exhibiting the counts on the x -axis, for θ in the interval [0 . , . (note that both axes employ log scaling). The pattern for pointsbelow the 99th percentile clearly indicate that the number of co-citations referenced at agiven frequency decreases greatly as the frequency increases. Also, the broadening of thescatter where fewer co-citations are cited more frequently is indicative of a long right tail,as has been observed in other research where lognormal or power law distributions havebeen fit to data, as in [39].Perline [42] warns against fitting a power law function to truncated data. Informally, aportion of the entire data set can appear linear on a log-log plot, while the entire data setwould not. He cites instances where researchers have mistakenly characterized an entiredata set as following a power law due to an analysis of only a portion of the data, when alognormal distribution might provide a better fit to the entire data set. Indeed, the scatterplot in Figure 5 is not linear and so, as Figure 3 shows, a power law does not fit the entiredata set. This is what Perline calls a weak power law where a power law distributionfunction fits the tail, but not the entire distribution. Our concern, however, is not withcharacterizing the distributional function for the entire data set, but with characterizing thefeatures of high frequency co-citations, which by definition means we were concerned withthe right tail of the distribution. Moreover, the results avoid confusion between lognormaland power law distribution functions because we have shown not only that a power lawprovides a statistically significant fit, but also that a lognormal distribution function does13ot fit.Our analysis found particularly heavy tails that were well fit by power law distributionsfor small θ , in the intervals [0 . , . and [0 . , . , and for co-citations whose constituentsare connected, as shown in Fig. 3. The closely related Matthew Effect [36], cumulativeadvantage [45], and the preferential attachment class of models [1] provide a possible ex-planation for citation frequencies following a power law distribution for some sufficientlyextreme portion of the right tail. For greater values of θ , insufficient data in the right tailsprecludes a definitive assessment in this regard, although one might argue that the lackof observations in the tails is counter to the existence of a power law relationship. It isalso noteworthy that the exponents we found for co-citations (Table 2) are close in valueto those reported for citations by [45] and [47]. Delayed Co-citations
The delayed onset of citations to a well cited publication, also referredto as ‘Delayed Recognition’ and ’Sleeping Beauty’, has been studied by Garfield, van Raan,and others [21, 60, 27, 59, 33, 23, 5]. We sought to extend this concept to frequently co-cited articles. As an initial step, we calculated two parameters (Materials and Methods):(1) the beauty coefficient [27] modified for co-cited articles and (2) timelag t l , the lengthof time between first possible year of co-citation and the first year in which a co-citationwas recorded. We further focused our consideration of delayed co-citations to the 95thpercentile or greater of co-citation frequencies in our dataset of 4.1 million co-cited pairs.Within the bounds of this restriction, 24 co-cited pairs have a beauty coefficient of 1,000or greater and all 24 are in the 99th percentile of co-citation frequencies. Thus, very highbeauty coefficients are associated with high co-citation frequencies.We also examined the relationship of t l with co-citation frequencies (Fig. 6) and observedthat high t l values were associated with lower co-citation frequencies. These data in ap-pear to be consistent with a report from van Raan and Winnink [60], who conclude that‘probability of awakening after a period of deep sleep is becoming rapidly smaller for longersleeping periods’. Further, when two articles are connected, they tend to have smaller t l values compared to pairs that are not connected in the same frequency range.14 igures f r equen cy f r equen cy pe r c en t c onne c t ed Figure 2: The x-axis shows percentiles for all three plots
Left Side
Co-citation frequen-cies of highly cited publications from Scopus 1985-1995
Co-citation frequencies are plot-ted against their percentile values. The upper and lower plots were both generated from33,641,395 data points. The lower plot shows the same data with a logarithmic (ln) trans-formation of y-axis. The minimum co-citation frequency is 1, the median is 2, the thirdquartile is 4, and the maximum is 51,567. Additionally, 15,140,356 pairs (45 %) have aco-citation frequency of 1. Frequencies of 12, 22, 67, and 209 correspond to quantile valuesof 0.9, 0.95, 0.99, and 0.999 respectively.
Right Side
Direct citations between membersof a co-cited pair (connectedness) increase with co-citation frequency. The proportion ofconnected pairs (a direct citation exists between the two members of a pair) within eachpercentile is shown. Data are plotted for all pairs with a co-citation frequency of at least (4.1 million pairs) 15igure 3: Distributional fits to the observed co-citation frequencies
The graphshows where a lognormal or power law distribution demonstrated a statistically significantfit with the observed co-citation frequencies stratified by θ , extent of the right tail tested x ,and whether co-citations were connected. A power law fit more often for θ in the intervals [0 . , . and [0 . , . when cocitation constituents were connected. When a lognormaldistribution fit, it was for broader portions of the data set. Data were insufficient fortesting as θ increased due to (i) fewer observations and (ii) less prominent right tails.16a) Co-citation Scopus frequency versus θ (b) Number of co-cited pairs per θ intervalFigure 4: Co-citation dynamics relative to θ . (a) Points represent the Scopus frequencyvs. θ value for each co-cited pair. Darker regions indicate denser plots of the translucentpoints. Co-cited pairs with the greater frequency are observed for pairs with smaller θ . (b)The y -axis employs a log scale and shows the number of co-cited pairs per θ interval. Thenumber of co-cited pairs decreases, most often, by more than an order of magnitude perinterval as θ increases. The dominance of co-cited pairs with smaller θ are also reflected byregions of greater density in panel (a). 17igure 5: Log-log plot of the number of co-citations versus co-citation countfor θ ∈ [0 . , . The y -axis shows the number of co-cited pairs observed having the cita-tion counts plotted along the x -axis. The tightly clustered plot below the 99 th percentiledemonstrates a clear pattern of decreasing number of co-cited pairs having an increasingnumber of citation counts. The scatter plot for the tail above the 99 th percentile broadens,indicating a long tail of relatively few co-cited pairs that were cited with extreme frequency.18igure 6: Relationship between time lag ( t l ) and co-citation frequency Extendedlag times are associated with lower co-citation frequencies. Connected pairs have lower t l values. Data are shown for 207,214 pairs consisting of ≥ t l , the time between first possible co-citation and first co-citation.19 F r equen cy example_1 pub1pub2cocite050100150200250 1960 1980 2000 2020 Publication Year F r equen cy example_2 pub1pub2cocite Figure 7:
Co-citation frequencies of highly cited publications from Scopus1985-1995
Upper panel
Publication 1: Instability of the interface of two gases acceleratedby a shock wave (1972) doi: 10.1007/BF01015969, first cited (1993), total citations (566).Publication 2: Taylor instability in shock acceleration of compressible fluids (1960) doi:10.1002/cpa.3160130207, first cited (1973), total citations (566), first co-cited (1993), totalco-citations (541).
Lower Panel
Publication 1: Colorimetric assay of catalase doi: 10.1016/0003-2697(72)90132-7 (1972) doi: 10.1016/0304-4165(79)90289-7, first cited (1972), totalcitations (2683). Publication 2: Levels of glutathione, glutathione reductase andglutathione S-transferase activities in rat lung and liver (1979) doi: 10.1016/0304-4165(79)90289-7, first cited (1979), total citations (2464), first co-cited (1979), totalco-citations (470).. 20 onclusions
In this article, we report on our exploration of features that impact the frequency of co-citations. In particular, we wished to examine article pairs with high co-citation frequencieswith respect to whether they originated from the same school(s) of thought or representednovel combinations of existing ideas. However, defining a discipline is challenging, anddetermining the discipline(s) relevant to specific publications remains a challenging problem.Journal-level classifications of disciplines have known limitations and while article-levelapproaches offer some advantages, they are not free of their own limitations [37].Consequently, we designed θ , a statistic that examines the citation neighborhood of a pairof articles x and y to estimate the probability that they would be co-cited. Our approachhas advantages compared to alternate approaches: it avoids the challenges of journal-levelanalyses, it does not require a definition of “discipline" (or “disciplinary distance"), it doesnot require assignment of disciplines to articles, it is computationally feasible, and, mostimportantly, it enables an evaluation that is specific to a given pair of articles.We note that when x and y are from the same sub-field, then θ may be very large, andconversely, when x and y are from very different fields, it might be reasonable to expectthat θ will be small. Thus, in a sense, θ may correlate with disciplinary similarity, withlarge values for θ reflecting conditions where the two publications are in the same (orvery close) sub-disciplines, and small values for θ reflecting that the disciplines for the twopublications are very distantly related. We also comment that in this initial study, we havenot considered second-degree information, that is publications that cite publications thatcite an article of interest.Our data indicate that the most frequent co-citations occur when co-citations have smallvalues of θ , as shown in Figure 4. Our study considered the hypothesis that the frequencydistribution is independent of θ , but our statistical tests rejected this hypothesis, andshowed instead that the frequency distribution is best characterized by a power law for smallvalues of θ and connected publications, and in many other regions is best characterized bya lognormal distribution.The observation that power laws are consistent with small values of θ and connected co-citations is consistent with the theory of preferential attachment for these parameter set-tings. To the extent that preferential attachment is the mechanism giving rise to a powerlaw, this suggests that preferential attachment is, at least, stronger for small θ values andconnected co-citations than for other parameter combinations, or that preferential attach-ment is not applicable to other parameter values.Observing power laws, heavy tails, and pairs with extreme co-citation strength for smallvalues of θ (i.e., pairs that have small a priori probabilities of being co-cited) may seem, onits face, paradoxical. One possible explanation of the pairs in the extreme right tail with21oth small θ and large co-citation strength is that those pairs represent novel combina-tions of ideas that, when recognized within the research community, catalyze an increasedcitation rate, consistent with preferential attachment coupled to time-dependent initial at-tractiveness [20] as an underlying generative mechanism. However, small values of θ do notguarantee a high co-citation count: indeed, even for small values of θ , co-citations with apower law predominantly have relatively low co-citation strength.We also note the increasing proportion of connected pairs as the percentile for co-citationfrequency increases (Fig. 2); this pair of parameters appears to be associated with a fertileenvironment where extremely high co-citation frequencies are possible. This observationraises the question of whether small values of θ and connected co-citations are associ-ated with preferential attachment and, if a causal relationship exists, then how do θ andco-citation connection provide an environment supporting preferential attachment? A pos-sibility is that one article in a co-cited pair citing the other makes the potential significanceof the combination of their ideas apparent to researchers. The clear pattern of the highestfrequency co-cited pairs typically having low θ values suggests that these pairs are highlycited and hence impactful because of the novelty in the ideas or fields that are combined (asreflected in low θ ). However, other factors should be considered, such as the prominence ofauthors and prestige of a journal [22] where the first co-citation appears.We did not apply field-normalization techniques when assembling the parent pool of 768,993highly cited articles consisting of the top 1% of highly cited articles from each year in theScopus bibliography. Thus, the highly co-cited pairs we observe are biased towards high-referencing areas such as biomedicine and parts of the physical sciences [51]. However,the dataset we analyzed has a lower bound of 10 on co-citation frequencies and includespairs from fields other than those that are high referencing. For example, the maximum t l we observed in the dataset of 4.1 million pairs was 149 years, and is associated to a pairof articles independently published in 1840, establishing their eponymous Staudt-Clausentheorem [12, 61]; this pair of articles was apparently co-cited 10 times since their publication.A second pair of articles concerning electron theory of metals [17, 18] was first co-cited in1994 for a total of 109 times, with t l observed of 94 years. Both cases are drawn frommathematics and physics rather than the medical literature. They are also consistentwith the suggestion that the probability of awakening is smaller after a period of deepsleep [60]. As we have defined t l , with its heavy penalty for early citation, we createadditional sensitivity to coverage and data quality especially for pairs with low citationnumbers. Indeed, for the Staudt-Clausen pair, a manual search of other sources revealedan article [11] in which they are co-cited. Both these articles were originally published inGerman and it is possible that additional co-citations were not captured. Thus, big dataapproaches that serve to identify trends should be accompanied by more meticulous casestudies, where possible. Other approaches for examining depth of sleep and awakeningtime should certainly be considered [59, 27]. Lastly, using our approach to revisit invisiblecolleges [46, 16, 52] seems warranted, since it seems likely that the upper bound of a hundred22embers predicted by [46] is likely to have increased in a global scientific enterprise withelectronic publishing and social media.Finally, we view these results as a first step towards further investigation of co-citationbehavior, and we introduce a new technique based on exploring first-degree neighbors ofco-cited publications; we are hopeful that this graph-theoretic study will stimulate newapproaches that will provide additional insights, and prove complementary to other articlelevel approaches. Acknowledgments
In addition to support through federal funding, the ERNIE project features a collaborationwith Elsevier. We thank our colleagues from Elsevier for their support of the collabora-tion.
Competing Interests
The authors have no competing interests. Scopus data used in this study was available to usthrough a collaborative agreement with Elsevier on the ERNIE project. Elsevier personnelplayed no role in conceptualization, experimental design, review of results, or conclusionspresented. The content of this publication is solely the responsibility of the authors and doesnot necessarily represent the official views of the National Institutes of Health or Elsevier.Sitaram Devarakonda’s present affiliation is Randstad USA. His contributions to this articlewere made while he was a full-time employee of NET ESolutions Corporation.
Author Contributions
Conceptualization, GC, JB, SD, and TW; Methodology, AD, DK, GC, JB, SD, SL, andTW; Investigation, DL-H, GC, JB, and SD; Writing -Original Draft, GC, JB, TW; Writing-Review and Editing, AD, DK, DL-H, GC, JB, SD, SL, and TW; Funding Acquisition, GC;Resources, DK and GC; Supervision, GC. Authors are listed in alphabetic order.
References [1]
Albert, R., and Barabási . Statistical mechanics of complex networks.
Reviews ofModern Physics 74 , 1 (2002), 47–97. 232]
Barber, B.
Resistance by scientists to scientific discovery.
Science 134 (1961), 596–602.[3]
Becke, A. D.
Density-functional thermochemistry. III. The role of exact exchange.
The Journal of Chemical Physics 98 , 7 (Apr. 1993), 5648–5652. Publisher: AmericanInstitute of Physics.[4]
Bornmann, L., Leydesdorff, L., and Mutz, R.
The use of percentiles andpercentile rank classes in the analysis of bibliometric data: Opportunities and limits.
Journal of Informetrics 7 , 1 (2013), 158–165.[5]
Bornmann, L., Ye, A. Y., and Ye, F. Y.
Identifying “hot papers” and papers with“delayed recognition” in large-scale datasets by using dynamically normalized citationimpact scores.
Scientometrics 116 , 2 (Aug. 2018), 655–674.[6]
Boyack, K., and Klavans, R.
Co-citation analysis, bibliographic coupling, anddirect citation: Which citation approach represents the research front most accurately?
Journal of the American Society for Information Science and Technology 61 , 12 (2010),2389–2404.[7]
Boyack, K., and Klavans, R.
Atypical combinations are confounded by disciplinaryeffects. In
International Conference on Science and Technology Indicators (Leiden,Netherlands, 2014), CWTS-Leiden University, pp. 49–58.[8]
Braam, R. R., Moed, H. F., and Raan, A. F. J. v.
Mapping of science bycombined co-citation and word analysis. I. Structural aspects.
Journal of the AmericanSociety for Information Science 42 , 4 (1991), 233–251.[9]
Bradford, M. M.
A rapid and sensitive method for the quantitation of microgramquantities of protein utilizing the principle of protein-dye binding.
Analytical Biochem-istry 72 (May 1976), 248–254.[10]
Bradley, J., Devarakonda, S., Davey, A., Korobskiy, D., Liu, S., Lakhdar-Hamina, D., Warnow, T., and Chacko, G.
Co-citations in context: Disciplinaryheterogeneity is relevant.
Quantitative Science Studies (2020), 1–13.[11]
Carlitz, L.
The Staudt-Clausen Theorem.
Mathematics Magazine 34 (1961), 131–146.[12]
Clausen, T.
Theorem.
Astronomische Nachrichten 17 (1840), 351–352.[13]
Clauset, A., Shalizi, C. R., and Newman, M. E. J.
Power-Law Distributions inEmpirical Data.
SIAM Review 51 , 4 (Nov. 2009), 661–703.[14]
Colavizza, G., Boyack, K., Eck, N. J. v., and Waltman, L.
The Closer theBetter: Similarity of Publication Pairs at Different Cocitation Levels.
Journal of theAssociation for Information Science and Technology 69 , 4 (2018), 600–609.2415]
Cole, S.
Professional Standing and the Reception of Scientific Discoveries.
AmericanJournal of Sociology 76 , 2 (1970), 286–306.[16]
Crane, D.
Invisible Colleges: Diffusion of Knowledge in Scientific Communities .University of Chicago Press, Chicago, 1972.[17]
Drude, P.
Zur Elektronentheorie der Metalle.
Annalen der Physik 306 (1900), 566–613.[18]
Drude, P.
Zur Elektronentheorie der Metalle; II. Teil. Galvanomagnetische undthermomagnetische Effecte.
Annalen der Physik 308 (1900), 369–402.[19]
Elsevier BV . Scopus, 2019. accessed Dec 2019.[20]
Eom, Y.-H., and Fortunato, S.
Characterizing and modeling citation dynamics.
PLOS ONE 6 , 9 (09 2011), 1–7.[21]
Garfield, E.
Would Mendel’s work have been ignored if the Science Citation Indexwas available 100 years ago?
Essays of an Information Scientist 1 (1970), 69–70.[22]
Garfield, E.
Premature Discovery or Delayed Recognition - Why?
Essays of anInformation Scientist 4 (1980), 488–493.[23]
Glänzel, W., and Garfield, E.
The myth of delayed recognition.
Scientist 18 , 11(June 2004).[24]
Gmür, M.
Co-citation analysis and the search for invisible colleges: A methodologicalevaluation.
Scientometrics 57 , 1 (May 2003), 27–57.[25]
Goldstein, M. L., Morris, S. A., and Yen, G. G.
Problems with fitting to thepower-law distribution.
The European Physical Journal B - Condensed Matter andComplex Systems 41 , 2 (Sept. 2004), 255–258.[26]
Gómez, I., Bordons, M., Fernàndez, M., and Méndez, A.
Coping with theproblem of subject classification diversity.
Scientometrics 35 (02 1996), 223–235.[27]
Ke, Q., Ferrara, E., Radicchi, F., and Flammini, A.
Defining and identifyingSleeping Beauties in science.
Proceedings of the National Academy of Sciences 112 , 24(June 2015), 7426–7431.[28]
Kessler, M. M.
Bibliographic coupling between scientific pa-pers.
American Documentation 14 , 1 (1963), 10–25. _eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.5090140103.[29]
Klavans, R., and Boyack, K.
Which type of citation analysis generates the mostaccurate taxonomy of scientific and technical knowledge?
Journal of the Associationfor Information Science and Technology 68 (04 2017), 984–998.2530]
Korobskiy, D., Davey, A., Liu, S., Devarakonda, S., and Chacko, G.
En-hanced Research Network Informatics Environment (ERNIE). Github repository, NETESolutions Corporation, 2019.[31]
Laemmli, U. K.
Cleavage of structural proteins during the assembly of the head ofbacteriophage T4.
Nature 227 , 5259 (Aug. 1970), 680–685.[32]
Lee, C., Yang, W., and Parr, R. G.
Development of the Colle-Salvetti correlation-energy formula into a functional of the electron density.
Physical Review B 37 , 2 (Jan.1988), 785–789. Publisher: American Physical Society.[33]
Li, J., and Ye, F. Y.
Distinguishing sleeping beauties in science.
Scientometrics108 , 2 (Aug. 2016), 821–828.[34]
Marshakova-Shaikevich, I.
System of Document Connections Based on References.
Scientific and Technical Information Serial of VINITI 6 , 2 (1973), 3–8.[35]
Merton, R.
Resistance to the systematic study of multiple discoveries in science.
European Journal of Sociology 4 , 2 (1963), 37–282.[36]
Merton, R.
The Matthew Effect in science.
Science 159 , 3810 (1968), 56–63.[37]
Milojevic, S.
Practical method to reclassify web of science articles into uniquesubject categories and broad disciplines.
Quantitative Science Studies (12 2019), 1–24.[38]
Mitzenmacher, M.
A brief history of generative models for power law and lognormaldistributions.
Internet Mathematics 1 (2003), 226–251.[39]
Montebruno, P., Bennett, R. J., van Lieshout, C., and Smith, H.
A taleof two tails: Do Power Law and Lognormal models fit firm-size distributions in themid-Victorian era?
Physica A: Statistical Mechanics and its Applications 523 (June2019), 858–875.[40]
Newman, M. E. J.
The Structure and Function of Complex Networks.
SIAM Review45 , 2 (Jan. 2003), 167–256.[41]
Noma, E.
Co-citation analysis and the invisible college.
Journal of the AmericanSociety for Information Science 35 , 1 (1984), 29–33.[42]
Perline, R.
Strong, weak and false inverse power laws.
Statistical Sciecne 20 , 1(2005), 68–88.[43]
Press, W., Teukolsky, S., Vetterling, W., and Flannery, B.
Numericalrecipes in C: The art of scientific computing (3rd ed.) . Cambridge University Press,New York, September 2007.[44]
Price, D. d. S.
Networks of Scientific Papers.
Science 149 (1965), 510–515.2645]
Price, D. d. S.
A general theory of bibliometric and other cumulative advantageprocesses.
Journal of the American Society for Information Science 27 , 5 (Sept. 1976),292–306.[46]
Price, D. d. S., and Beaver, D. D.
Collaboration in an invisible college.
AmericanPsychologist 21 , 11 (1966), 1011–1018.[47]
Radicchi, F., Fortunato, S., and Castellano, C.
Universality of citation dis-tributions: toward an objective measure of scientific impact.
PNAS 105 , 45 (2008),17268–17272.[48]
Redner, S.
Citation statistics from 110 years of physical review.
Physics Today 58 ,6 (2005), 49–54.[49]
Shu, F., Julien, C.-A., Zhang, L., Qiu, J., Zhang, J., and Larivière, V.
Comparing journal and paper level classifications of science.
Journal of Informetrics13 , 1 (Feb. 2019), 202–225.[50]
Small, H.
Co-citation in the scientific literature: A new measure of the relationshipbetween two documents.
Journal of the American Society for Information Science 24 ,4 (1973), 265–269.[51]
Small, H., and Greenlee, E.
Citation context analysis of a co-citation cluster:Recombinant-DNA.
Scientometrics 2 , 4 (July 1980), 277–301.[52]
Small, H., and Sweeney, E.
Clustering the science citation index R (cid:13) using co-citations. Scientometrics 7 , 3 (Mar. 1985), 391–409.[53]
Small, H., Sweeney, E., and Greenlee, E.
Clustering the science citation indexusing co-citations. II. Mapping science.
Scientometrics 8 , 5 (Nov. 1985), 321–340.[54]
StackExchange . Can I use the Kolmogorov–Smirnov test on my Data?, 2014.https://stats.stackexchange.com/questions/112910/can-i-use-kolmogorov-smirnov-test-on-my-data.[55]
Stringer, M. J., Sales-Pardo, M., and Amaral, L. A. N.
Effectiveness ofjournal ranking schemes as a tool for locating information.
Journal of the AmericanSociety for Information Science and Technology 3 , 2 (2008), e1683.[56]
Stringer, M. J., Sales-Pardo, M., and Amaral, L. A. N.
Statistical validationof a global model for the distribution of the ultimate number of citations accrued bypapers published in a scientific journal.
PLoS ONE 61 , 7 (2010), 1377–1385.[57]
Uzzi, B., Mukherjee, S., Stringer, M., and Jones, B.
Atypical Combinationsand Scientific Impact.
Science 342 , 6157 (Oct. 2013), 468–472.2758] van Raan, A. F. J.
Fractal dimension of co-citations.
Nature 347 , 6294 (Oct. 1990),626–626.[59] van Raan, A. F. J.
Sleeping Beauties in science.
Scientometrics 59 , 3 (Mar. 2004),467–472.[60] van Raan, A. F. J., and Winnink, J. J.
The occurrence of ‘sleeping beauty’publications in medical research: Their scientific impact and technological relevance.
PLOS ONE 14 , 10 (10 2019), 1–34.[61] von Staudt, K.
Beweis eines Lehrsatzes, die Bernoullischen Zahlen betreffend.
Jour-nal für die reine und angewandte Mathematik 21 (1840), 372–374.[62]
Wallace, M., Larivière, V., and Gingras, Y.
Modeling a century of citationdistributions.
Journal of Informetrics 3 , 4 (2009), 296–303.[63]
Waltman, L., and van Eck, N. J.
A new methodology for constructing apublication-level classification system of science.
Journal of the American Society forInformation Science and Technology 63 , 12 (2012), 2378–2392.[64]
Wang, D., Song, C., and Barabási, A.-L.
Quantifying Long-Term ScientificImpact.
Science 342 , 6154 (Oct. 2013), 127–132.[65]
Wang, J., Veugelers, R., and Stephan, P.
Bias against novelty in science:A cautionary tale for users of bibliometric indicators.