Co-citations in context: disciplinary heterogeneity is relevant
James Bradley, Sitaram Devarakonda, Avon Davey, Dmitriy Korobskiy, Siyu Liu, Djamil Lakhdar-Hamina, Tandy Warnow, George Chacko
CCo-citations in context: disciplinary heterogeneity isrelevant ∗ James Bradley , Sitaram Devarakonda , Avon Davey , Dmitriy Korobskiy , SiyuLiu , Djamil Lakhdar-Hamina , Tandy Warnow , and George Chacko † Raymond Mason School of Business, Coll. of William & Mary, Williamsburg, VA Netelabs, NET ESolutions Corporation, McLean, VA Department of Computer Science, Univ. of Illinois, Urbana-Champaign, ILSeptember 20, 2019 ∗ Accepted for publication in Quantitative Science Studies. † [email protected] a r X i v : . [ c s . D L ] S e p bstract Citation analysis of the scientific literature has been used to study and define dis-ciplinary boundaries, to trace the dissemination of knowledge, and to estimate impact.Co-citation, the frequency with which pairs of publications are cited, provides insightinto how documents relate to each other and across fields. Co-citation analysis hasbeen used to characterize combinations of prior work as conventional or innovative andto derive features of highly cited publications. Given the organization of science intodisciplines, a key question is the sensitivity of such analyses to frame of reference. Ourstudy examines this question using semantically-themed citation networks. We observethat trends reported to be true across the scientific literature do not hold for focusedcitation networks, and we conclude that inferring novelty using co-citation analysis andrandom graph models benefits from disciplinary context.
Citation and network analysis of scientific literature reveals information on semantic re-lationships between publications, collaboration between scientists, and the practice of ci-tation itself [4, 3, 14, 17, 15, 19]. Co-citation, the frequency with which two documentsare cited together in other documents, provides additional insights, including the identifi-cation of semantically related documents, fields, specializations, and new ideas in science[18, 12, 1, 26, 25].In a novel approach, Uzzi and colleagues [22] used co-citation analysis to characterize asubset of highly cited articles with respect to both novel and conventional combinations ofprior research. The frequency with which references were co-cited in 17.9 million articlesand their cited references from the Web of Science (WoS) was calculated and expressedas journal pair frequencies (observed co-citation frequencies). Expected co-citation valueswere generated using Monte Carlo simulations under a random graph model. Observedfrequencies were then normalized (shifted and scaled) to averaged expected values from tenrandomized networks and termed as z-scores . Consequently, every article was associatedwith multiple z-scores corresponding to co-cited journal pairs in its references. For eacharticle, positional statistics of z-scores were calculated and evaluated to set thresholds for abinary classification of conventionality using the median z-score of an article, and noveltyusing the tenth percentile of z-scores within an article.Thus, LNHC would denote low novelty (LN) and high conventionality (HC), with all fourcombinations of LN and HN with LC and HC being possible. The authors observed thatHNHC articles were twice as likely to be highly cited compared to the background rate,suggesting that novel combinations of ideas flavoring a body of conventional thought werea feature of impact. 2ey to the findings of Uzzi et al. is the random graph model used, and its underlying assump-tions. The citation switching algorithm used to generate expected values by substitutingcited references with randomly selected references published in the same year is designedto preserve the number of publications, the number of references in each publication, andthe year of publication of both publications and references. Importantly, disciplinary ori-gin does not affect the probability that a reference is selected to replace another one. Forexample, a reference in quantum physics can be substituted, with equal probability, by areference published in the same year but from the field of quantum physics, quantum chem-istry, classical literature, entomology, or anthropology. Such substitutions do not accountfor the disciplinary nature of scientific research and citation behavior [24, 13, 8, 5] very well.Accordingly, model misspecification is likely to arise on account of the simulated values notcorresponding to the empirical data very well.A follow-up study by Boyack and Klavans (2014) [2] explored the impact of disciplineand journal effects on these definitions of conventionality and novelty. While their studyhad some methodological differences in the use of Scopus data rather than WoS data, asmaller data set, and a χ calculation rather than Monte Carlo simulations to generateexpected values of journal pairs, Boyack and Klavans noted strong effects from disciplinesand journals. While they also reported the trend that HNHC is more probable in highlycited papers, they observed that “only 64.4% of 243 WoS subject categories” in the Uzzi et al. study met the criterion of having the highest probability of hit papers in the HNHC category.Further, they observed that journals vary widely in terms of size and influence and that 20journals accounted for 15.9% of co-citations in their measurements. Lastly, they noted thatthree multidisciplinary journals accounted for 9.4% of all atypical combinations.Despite different methods used to generate expected values, both of these key precedingstudies measured co-citation frequencies across the scientific literature (using either WoSor Scopus) and normalized them without disciplinary constraints before subsequently ana-lyzing disciplinary subsets. We hypothesized instead that modifying the normalization toconstrain substitution references to be drawn only from the citation network being studied(the “local network”) rather than all of WoS (the “global network”) would reduce model mis-specification by limiting substitutions from references that were ectopic to these networks.Consequently, we used keyword searches of the scientific literature to construct exemplarcitation networks themed around academic disciplines of interest: applied physics, immunol-ogy, and metabolism . The cited references in these networks while predominantly alignedwith the parent discipline (physics or life sciences in this case), also included articles fromother disciplines. Within these disciplinary frameworks, we calculated observed and ex-pected co-citation frequencies using a refined random graph model and an efficient MonteCarlo simulation algorithm.Our analyses, using multiple techniques, provide substantial evidence that a constrainedmodel where reference substitutions are limited to a local (disciplinary) network reduces3odel misspecification compared to the unconstrained model that uses the global network(WoS). Furthermore, re-analyses of these three semantically-themed citation networks underthe improved model reveals strikingly different trends. For example, while Uzzi et al. reported that highly cited articles are more likely than expected to be both HC and HNand that this trend largely held across all disciplines, we find that these trends vary withthe discipline so that universal trends are not apparent. Specifically, HC remains highlycorrelated with highly cited articles in the immunology and metabolism data sets but notwith applied physics, and HN is highly correlated with highly cited articles in applied physicsbut not with immunology and metabolism. Thus, disciplinary networks are different fromeach other, and trends that hold for the full WoS network do not hold for even large networks(such as metabolism). Furthermore, we also found that the categories demonstrating thehighest percentage of highly cited articles (e.g., HC, HN, etc.) are not robust with respectto varying thresholds for high citation counts or for highly novel citation patterns. Overall,our study, although limited to three disciplinary networks, suggests that co-citation analysisthat inadequately considers disciplinary differences may not be very useful at detectinguniversal features of impactful publications. We have previously developed ERNIE, an open source knowledge platform into which weparse the Web of Science (WoS) Core Collection [7]. WoS data stored in ERNIE spans theperiod 1900-2019 and consists of over 72 million publications. For this study, we generatedan analytical data set from years 1985 to 2005 using data in ERNIE. The total numberof publications in this data set was just over 25 million publications (25,134,073), whichwere then stratified by year of publication. For each of these years, we further restrictedanalysis to publications of type Article. Since WoS data also contains incomplete referencesor references that point at other indexes, we also considered only those references for whichthere were complete records (Table 1). For example, WoS data for year 2005 contained1,753,174 publications, which after restricting to type Article and considering only thosereferences described above resulted in 916,573 publications, 6,095,594 unique references (setof references), and 17,167,347 total references (multiset of references). Given consistenttrends in the data (Table 1), we analyzed the two boundary years (1985 and 2005) and themid-point (1995). We also used the number of times each of these articles was cited in thefirst 8 years since publication as a measure of its impact.We constructed three disciplinary data sets in areas of our interest based on the key-word searches: immunology, metabolism, and applied physics. For the first two, rooted inbiomedical research, we searched Pubmed for the term ‘immunology’ or ‘metabolism’ in the4able 1: Summary of base WoS Analytical data set. Only publications of type Article withat least two references and references with complete publication data were selected for thisdata set. The number of unique publications of type Article, unique references (ur), totalreferences (tr), and the ratio of total references to unique references increases monotonicallywith each year indicating that both the number of documents and citation activity increaseover time.
Year Unique Publications Unique References (ur) Total References (tr) tr/ur1985 391,860 2,266,584 5,588,861 2.471986 402,309 2,316,451 5,708,796 2.461987 412,936 2,427,347 5,998,513 2.471988 426,001 2,545,647 6,354,917 2.501989 443,144 2,673,092 6,749,319 2.521990 458,768 2,827,517 7,209,413 2.551991 477,712 2,977,784 7,729,776 2.601992 492,181 3,134,109 8,188,940 2.611993 504,488 3,278,102 8,676,583 2.651994 523,660 3,458,072 9,255,748 2.681995 537,160 3,680,616 9,875,421 2.681996 663,110 4,144,581 11,641,286 2.811997 677,077 4,340,733 12,135,104 2.801998 693,531 4,573,584 12,728,629 2.781999 709,827 4,784,024 13,280,828 2.782000 721,926 5,008,842 13,810,746 2.762001 727,816 5,203,078 14,261,189 2.742002 747,287 5,464,045 15,001,390 2.752003 786,284 5,773,756 16,024,652 2.782004 826,834 6,095,594 17,167,347 2.822005 886,648 6,615,824 19,036,324 2.88 years 1985, 1995, and 2005 (Table 2). Pubmed IDs (pmids) returned were matched to WoSIDs (wos_ids) and used to retrieve relevant articles. For the applied physics data set, wedirectly searched traditional subject labels in WoS for ‘Physics, Applied’. While appliedphysics and immunology represent somewhat small networks (roughly 3-6% of our analyt-ical WoS datasets) over the three years examined, metabolism represents approximately20-23%, making them interesting and meaningful test cases. We also examined publica-tions in the five major research areas in WoS: life sciences & biomedicine, physical sciences,technology, social sciences, and arts & humanities, using the extended WoS subcategoryclassification of 153 sub-groups to categorize disciplinary composition of cited references inthe data sets we studied.
We performed analyses on publications from 1985, 1995, and 2005. Building upon prior work[22], all (cid:0) n (cid:1) reference pairs were generated for each publication, where n is the number ofcited references in the publication. These reference pairs were then mapped to the journals5able 2: Disciplinary data sets. PubMed and WoS were searched for articles using searchterms, ‘immunology’, ‘metabolism’, and ‘applied physics.’ Counts of publications are shownfor each of the three years analyzed and expressed in parentheses as a percentage of thetotal number of publications in our analytical WoS data set (Table 1) for that year. Notethat Applied Physics and Immunology represent about 3-6% of the publications in ouranalytical WoS datasets, but Metabolism occupies 20-23%.Year Applied Physics Immunology Metabolism1985 10,298 (2.7%) 21,606 (5.5%) 78,998 (20.2%)1995 21,012 (3.9%) 29,320 (5.5%) 121,247 (22.6%)2005 35,600 (4.0%) 37,296 (4.2%) 200,052 (22.6%)they were published in using ISSN numbers as identifiers. Where multiple ISSN numbersexist for a journal, the most frequently used one in WoS was assigned to the journal. Inaddition, publications containing fewer than two references were discarded. Journal pairfrequencies were summed across the data set to create observed frequencies ( F obs ) .For citation shuffling, we developed a performant citation switching algorithm, runtime en-hanced permuting citation switcher (repcs) [10], that randomly permuted citations withineach disciplinary data set and within each year of publication: each citation within each ar-ticle was switched within its permutation group in order to preserve the number of referencesfrom each publication year within each article. In so doing, the number of publications, thenumber of references in each data set, and the disciplinary composition of the referencesin each data set were preserved. Our approach is different from previous studies in theseways: (i) we sampled citations in proportion to their citation frequency (equivalently froma multiset rather than a set) in order to better reflect citation practice, (ii) we permitted asubstitution to match the original reference in a publication when the random selection pro-cess dictated it rather than attempting to enforce that a different reference be substituted,and (iii) we introduced an error correction step to delete any publications that accumulatedduplicate references during the substitution process. As a benchmark, we used the citationswitching algorithm of [22], henceforth referred to as umsj as also done in [2], using codekindly provided by the authors. A single comparative analysis showed that while 10 simu-lations of the WoS 1985 data set (391,860 selected articles) completed in 2,186 hours usingthe umsj algorithm, it completed in less than one hour using our implementation of the repcs algorithm on a Spark cluster. We also tested repcs under comparable conditions to umsj and estimated a runtime advantage of at least two orders of magnitude. This runtimeadvantage was significant enough that we chose to use the repcs algorithm in our study andgenerated expected values averaged over 1,000 simulations for improved coverage of everydata set we analyzed.Using averaged results from 1,000 simulations for each data set studied, z-scores were cal-culated for each journal-pair using the formula ( F obs − F exp ) /σ where F obs is the observed6requency, F exp is the averaged simulated frequency, and σ is the standard deviation ofthe simulated frequencies for a journal pair [22]. As a result of these calculations, eachpublication becomes associated with a set of z-scores corresponding to the journal pairsderived from pairwise combinations of its cited references. Positional statistics of z-scoreswere calculated for each publication, which was then labeled according to conventionalityand novelty: (i) HC if the median z-score exceeded the median of median z-scores for allpublications and LC otherwise and (ii) HN if the tenth percentile of z-scores for a publi-cation was less than zero and LN otherwise. We also analyzed the effect of defining highnovelty using the first percentile of z-scores.To consider the relationship between citation impact, conventionality, and novelty we cal-culated percentiles for the number of accumulated citations in the first 8 years since publi-cation for each article we studied and stratified. We investigated multiple definitions of hitarticles, with hits defined as the 1%, 2%, 5%, and 10% top-cited articles. A source of misspecification arises from not accounting for disciplinary heterogeneity bytreating all eligible references within WoS as equiprobable substituents when studying adisciplinary network. Under this model [22], the probability of selecting a reference from adiscipline is identical to the proportion of the articles in WoS in that discipline for a givenyear. If the global model accurately reflects citation practice, the expected proportion ofreferences within papers published in a given discipline D would be approximately equalto the proportion of references in D , and conversely, the degree to which the proportiondeviates from the expected value would reflect the extent of model misspecification.To study the disciplinary composition of references in our custom data sets, we first usedthe high level WoS classification of five major research areas: life sciences & biomedicine,physical sciences, social sciences, technology, and arts & humanities. The two largest ofthese research areas are physical sciences and life sciences & biomedicine, which contributeon average approximately 35.1% and 62.8%, respectively, of the references in WoS over thethree years of interest. Under the unconstrained model, we would expect close to 35% ofthe references cited by the publications in any large network to be drawn from the physicalsciences and close to 63% of the references to be drawn from life sciences and biomedicine.Yet the empirical data present a very different story: roughly 80% of the references citedin physical sciences publications are from the physical sciences and 90% of the referencescited in life sciences & biomedicine publications are from the life sciences & biomedicine. Inother words, the empirical data shows a strong tendency of publications to cite papers that7re in the same major research area rather than in some other research area. Thus, there isa strong bias towards citations that are intra-network . Our observations are in agreementwith [24] who found that, often, a majority of an article’s citations are from the specialtyof the article, even though that percentage varied among disciplines in the eight specialtiesthey investigated (from approximately 39% to 89% for 2006). Furthermore, these findingsargue that a discipline-indifferent random graph model would exhibit misspecification indeviating substantially from the empirical data, and supports the concern about definitionsof innovation and conventionality that are based on deviation from expected values.We also analyzed disciplinary composition at a deeper level using all 153 Subjects in theWoS extended classification and examining the consequences of citation shuffling within adisciplinary set or all of the Web of Science. References in publications belonging to thesethree data sets were summarized as a frequency distribution of 153 WoS Subjects as classes.A single shuffle of the references in the disciplinary data sets and in the corresponding WoSyear slice was performed, using either the repcs or umsj algorithms, after which subjectfrequencies were computed again. The fold difference in subject frequencies of referencesbefore and after shuffling was calculated for these groups using all 153 subject categoriesand summarized in the box plots in Fig 1. As an example, the applied physics data setcontained one reference labeled Genetics and Heredity, but after the shuffle (using theWoS background), acquired 1496 references labeled Genetics and Heredity. Similarly, themetabolism data set contained one reference labeled Philosophy, but after a single shuffle(again using the WoS background) it had 661 occurrences with this label. The data showconvincingly that a publication’s disciplinary composition of references in a network ispreserved when citation shuffling is constrained to the network, but is significantly distortedwhen the WoS superset is used as a source of substitution. A second inference is that thetwo algorithms, repcs and umsj , have equivalent effects in this experiment (and so are onlydistinguishable for running time considerations).We then tested the conjecture that model misspecification would be reduced by constrain-ing the substitutions to disciplinary networks by examining the Kullback-Leibler (K-L)Divergence [11] between observed and predicted citation distributions, restricted to the setof journals in a given disciplinary network. The results (Table 3) confirm this prediction:simulations under the constrained model (where the background network is the local disci-plinary network) consistently have a lower K-L divergence compared to simulations underthe unconstrained model (where the background network is WoS). Furthermore, the K-Ldivergence for the unconstrained model is generally twice as large as the K-L divergencefor the constrained models, with ratios that range from 1.96 to 2.77, and are greater than2.0 in eight out of nine cases. These results clearly demonstrate that constraining referencesubstitutions to the given local disciplinary network better fits the observed data, and hencereduces model misspecification. 8 l l r ep cs u m s j ap_local ap_WoS imm_local imm_WoS metab_local metab_WoS network background ( l og ) f o l d c hange i n s ub j e c t f r equen cy o f r e f e r en c e s Figure 1: Citation shuffling using the local network preserves the disciplinary composition ofreferences within networks, but using the global network does not. Publications of type Ar-ticle belonging to the three disciplinary networks (ap=applied physics, imm=immunology,and metab=metabolism) were subject to a single shuffle of all their cited references usingeither the local network (i.e., the cited references in these networks, denoted bg_local) orthe global network (i.e., references from all articles in WoS, denoted bg_WoS) as the sourceof allowed substitutions, where “bg” indicates the disciplinary network. Citation shufflingwas performed using either our algorithm ( repcs , top row) or that of Uzzi et al. ( umsj ,bottom row). The disciplinary composition of cited references before and after shufflingwas measured as frequencies for each of 153 sub-disciplines (from the extended subjectclassification in WoS) and expressed as a fold difference between citation counts groupedby subject for original (o) and shuffled (s) references using the formula (fold_difference = if else ( o > s, o/s, s/o ) ) and rounded to the nearest integer. A fold difference of indicatesthat citation shuffling did not alter disciplinary composition. Data are shown for articlespublished in 1985. All eight boxplots are generated from 153 observations each. Null valueswere set to . Note y-axis values: log .2 Calculation of Novelty and Conventionality using the constrainedmodel Since the constrained model better fits the observed data, we evaluated the distributionof highly cited articles (i.e., “hit articles”) in the four categories (HNHC, HNLC, LNHC,LNLC), for different thresholds for hit articles. Figure 2, Panels (a) and (b), compares hitrates for the four categories among the immunology, metabolism, applied physics, and WoSdata sets for 1995, where the hit rate is defined as the number of hit articles in each categorydivided by the number of articles in the category. The calculation for the hit rates for theWoS data set (bottom row, Figure 2) mirrors Uzzi et al.’s results, whereby the largest hitrates were for the HNHC category, despite our methodological changes in sampling citationsin proportion to their frequency. However, the trends for all three disciplinary networks aredifferent from those for WoS. Specifically, the highest hit rates for the 1995 immunologyand metabolism data sets are in the LNHC category for the top 1% of cited articles (andtied between LNHC and HNHC for the top 10%), and the highest hit rates for the 1995applied physics data sets are in the HNLC category for both the top 1% and top 10% of allcited articles. Thus, the category exhibiting the highest hit rate among highly cited papersdepends on the specific disciplinary network and to some extent on the threshold for beinghighly cited.Furthermore, the categories displaying the greatest hit rate vary to some extent with theyear. For example, when the 10% top-cited articles are deemed to be hits and novelty isdefined at the 10th percentile of z-scores, the category with the highest hit rate in appliedphysics for 1995 is in HNLC (12.3% versus 10.9% for HNHC), while the hit rate for HNHCis greater than for HNLC in 1985 and 2005 (13.2% versus 10.9%, and 11.4% versus 10.7%,respectively).We evaluated the statistical significance of the categorical hit rates using multiple methods.Our first test was based on the null hypotheses that hits were distributed randomly amongthe four categories with uniform probability in proportion to the number of articles in eachcategory. Rejecting the null hypothesis, using a Chi-Square Goodness of Fit test, supportsa non-uniform dispersion of hits with some of the four categories being associated withhigher or lower than expected expected hit rates. The null hypothesis was rejected at a p < . in all cases in Figure 2, with the exception of the immunology and applied physicsdata sets where hit articles are designated as the top 1% of articles: valid tests were notpossible in those instances due to too few expected hits. The null hypothesis was rejectedwith p < . for all valid tests for all parameter settings, all data sets, and all years:hypotheses tests were valid in 73 of 96 instances. We conclude that it is likely that thedistribution of hits among categories is not uniform and that, instead, hit rates vary amongthe categories in all disciplinary data sets.We also tested the explanatory power of each framework dimension by classifying articles11 R Y H O W \ + L J K / R Z & R Q Y H Q W L R Q D O L W \ Low High & D W H J R U \ + L W 5 D W H $ S S O L H G 3 K \ V L F V 1 R Y H O W \ + L J K / R Z & R Q Y H Q W L R Q D O L W \ Low High & D W H J R U \ + L W 5 D W H , P P X Q R O R J \ 1 R Y H O W \ + L J K / R Z & R Q Y H Q W L R Q D O L W \ Low High & D W H J R U \ + L W 5 D W H 0 H W D E R O L V P 1 R Y H O W \ + L J K / R Z & R Q Y H Q W L R Q D O L W \ Low High & D W H J R U \ + L W 5 D W H : R 6 DR A FT RESEARCH ARTICLE
Co-citations in context: disciplinary heterogeneityis relevant
James Bradley , Sitaram Devarakonda , Avon Davey , Dmitriy Korobskiy , Siyu Liu ,Djamil Lakhdar-Hamina ,Tandy Warnow and George Chacko Raymond A. Mason School of Business, College of William and Mary, Williamsburg, VA, USA Netelabs, NET ESolutions Corporation, McLean, VA 22102, USA Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA
Keywords: co-citation analysis, bibliometrics, random graphs
ABSTRACT
Citation analysis of the scientific literature has been used to study and define disciplinaryboundaries, to trace the dissemination of knowledge, and to estimate impact. Co-citation,the frequency with which pairs of publications are cited, provides insight into howdocuments relate to each other and across fields. Co-citation analysis has been used tocharacterize combinations of prior work as conventional or innovative and to derivefeatures of highly cited publications. Given the organization of science into disciplines, akey question is the sensitivity of such analyses to frame of reference. Our study examinesthis question using semantically-themed citation networks. We observe that trendsreported to be true across the scientific literature do not hold for focused citationnetworks, and we conclude that co-citation analysis requires a contextual perspective.
INTRODUCTION
Citation and network analysis of scientific literature reveals information on semantic rela-tionships between publications, collaboration between scientists, and the practice of cita-tion itself (de Solla Price, 1965; Garfield, 1955; Newman, 2001; Patience, Patience, Blais, &Bertrand, 2017; Shi, Leskovec, & McFarland, 2010). Co-citation, the frequency with whichtwo documents are cited together in other documents provides additional insights, includ-ing the identification of semantically related documents, fields, specializations, and newideas in science (Boyack & Klavans, 2010; Marshakova-Shaikevich, 1973; Small, 1973; Zuck-erman, 2018).Uzzi, Mukherjee, Stringer, and Jones (2013) used a novel approach for co-citation anal-ysis to characterize a subset of highly cited articles with respect to both novel and conven-tional combinations of prior research. The frequency with which references were co-cited in17.9 million articles and their cited references from the Web of Science (WoS) was calculatedand expressed as journal pair frequencies (observed co-citation frequencies). Expected co-citation values were generated from randomized networks using Monte Carlo simulationsunder a random graph model. Observed frequencies were then normalized (shifted andscaled) to averaged expected values from ten randomized networks and termed as z-scores .Consequently, every article was associated with multiple z-scores corresponding to co-citedjournal pairs in its references. For each article, positional statistics of z-scores were calcu- a n o p e n a c c e s s j o u r n a l
Citation: Betzel, R. F., Fukushima, M.,wHe, Ye, Zuo, Xi-Nian, Sporns, O. (2016)Dynamic fluctuations coincide withperiods of high and low modularity inresting-state functional brain networksNetwork Neuroscience, 1DOI:http://dx.doi.org/10.1162/NETN-00001Supporting Information:http://dx.doi.org/10.7910/DVN/PQ6ILMReceived: 20 October 2016Accepted: 7 November 2016Published: 26 January 2016Competing Interests: Theauthors have declared that nocompeting interests exist.Corresponding Author:George [email protected] Editor:Xi-Nian ZuoCopyright: © The MIT Press 1 R Y H O W \ + L J K / R Z & R Q Y H Q W L R Q D O L W \ Low High & D W H J R U \ + L W 5 D W H
0 4 8 12 16 $ S S O L H G 3 K \ V L F V 1 R Y H O W \ + L J K / R Z & R Q Y H Q W L R Q D O L W \ Low High & D W H J R U \ + L W 5 D W H
0 4 8 12 16 , P P X Q R O R J \ 1 R Y H O W \ + L J K / R Z & R Q Y H Q W L R Q D O L W \ Low High & D W H J R U \ + L W 5 D W H
0 4 8 12 16 0 H W D E R O L V P 1 R Y H O W \ + L J K / R Z & R Q Y H Q W L R Q D O L W \ Low High & D W H J R U \ + L W 5 D W H
0 4 8 12 16 : R 6 (a) Top 1% of cited articles (b) Top 10% of cited articlesFigure 2: Effect of using the improved model on categorical hit rates for Immunology,Applied Physics, and WoS for 1995. Panels (a) and (b) show hit rates for the LNLC,LNHC, HNLC, and HNHC categories for the applied physics, immunology, metabolism,and WoS data sets when hit articles are defined as the top 1% and top 10% of articles,respectively. Novelty in both panels is defined at the 10th percentile of articles’ z-scoredistributions. The results for the WoS data set also show that the highest hit rate is forthe HNHC category. Results for the three disciplinary networks all differ from the overallWoS results: the highest hit rates for the immunology and metabolism data sets are in theLNHC category and the highest hit rate for the applied physics data sets are in the HNLCcategory. The number of data points in the applied physics, immunology, metabolism, andWoS data sets are 18,305, 21,917, 97,405, and 476,288, respectively.12s LN or HN and, separately, as LC or HC. We tested the null hypothesis that hits aredistributed between LN and HN (LC and HC) in proportion to the total number of articlesassigned to those categories. That null hypothesis was rejected for the WoS data alongboth dimensions. Consistent with prior findings, hit articles were overrepresented in theHC category in every instance of WoS data at a p < . and also overrepresented in theHN category at a p < . in all but two cases: the p-values in those exceptions were . and . . Hits in the immunology and metabolism data were overrepresented in the HCcategory with the same statistical significance as for WoS. The relationship of novelty withhits in the immunology and metabolism data set differed dramatically from WoS, however,with statistically significant findings of hit articles being sometimes overrepresented in theLN category, and sometimes being underrepresented. Consistent with WoS, hit articles inapplied physics were positively related with HN with a statistical significance of at least p < . in all 12 parameter sets, and at p < . in 10 of 12 cases. To the contrary, astrong positive relationship was found between LC and hit articles in applied physics in5 of 12 instances with p < . . These results suggest that (1) both conventionality andnovelty are strongly related to hits in WoS, (2) the conventionality dimension is stronglyrelated with hits in immunology and metabolism and novelty is not, and (3) novelty is morestrongly related with hits in applied physics than is conventionality. More generally, wefind that the dimensions most strongly related with hit articles vary between disciplinaryand broad data sets, and also among disciplines.We described concerns with model misspecification along two general dimensions: the back-ground data set and sampling methodology for the random graph. The differences we foundfrom prior research in terms of which categories demonstrated the highest hit rates werecaused both by using disciplinary data sets and our sampling methodology, repcs , throughthe article z-score distributions. When z-scores are shifted downward using one algorithmversus another, for example, then the former algorithm can result in an increased percent-age of HN articles. We therefore examined the extent to which each of our methodologicaldifferences contributed to our observations. We found that z-scores changed sign more as aconsequence of background network (local network or WoS) and much less as a consequenceof sampling algorithm ( umsj or repcs ). For example, on the immunology data set, 28.6%of the journal pairs changed signs with our sampling algorithm ( repcs ) as the backgroundnetwork is changed from global (WoS) to local, and only 2.8% of z-scores changed signs inthe WoS data set depending on whether umsj or repcs was used.We conclude that the choice of background data sets is the source of a majority of differenceswe observed in the categories demonstrating the highest hit rates, although our samplingapproach, most notably sampling from a multiset so as to reflect the observed frequenciesof individual citations as well as their associated journals and disciplines, can also creatematerial differences. 13 Discussion
The principal difference between the two models we discuss is a single parameter–the set ofreferences that can be used as substituents during the substitution process. The keywordsearch we use also has the advantage of selecting only relevant articles from multidisci-plinary journals. However, it is important to note that the local networks we evaluatedare not monodisciplinary, the references cited within exhibit disciplinary diversity. We pro-vided several lines of evidence that showed that changing this one parameter from a globalnetwork to the local disciplinary network reduces model misspecification. Using the con-strained model (which allows substitutions only within the local network) instead of theunconstrained model (which allows substitutions in the WoS network) produces differenttrends in terms of conventionality and novelty, depending on the network and the parentdiscipline. In particular, when using the unconstrained model, highly cited papers weremost likely to be in the HNHC category but this trend does not consistently hold when us-ing the constrained model. Instead, we find that conventionality flavored with novelty is not generally a feature of impactful research. Further, high “novelty” is not always indicativeof impactful research.More generally, these results show that the trends approaching universality in highly citedpapers are not robust to changes in thresholds for defining high impact or high noveltyarticles, or with time, and may be the consequence of using a random model that has apoor fit to the observed data. On the other hand, while the constrained model reducesmodel misspecification compared to the unconstrained model, this does not imply thatthe constrained model is reasonable nor that trends observed under the constrained modelconvincingly explain scientific practice. Indeed, there are significant challenges in usingrandom models to understand human behavior, of which citation practice is one example.As we note, vide supra , under our conditions of analysis, the trends for all three disciplinarynetworks are different from those for WoS.Our work has shown that the use of local networks enables simulations that are more con-sistent with research citation patterns. Further work might explore additional constraintson random assignment of citations to publications to better align benchmarks with citationpractice. For example, proximity defined by co-author networks [24] might be consideredwhen defining probabilities for citation substitutions. Another interesting but challeng-ing direction would be to find ways to distinguish intra-disciplinary from cross-disciplinarynovelty. In this respect, the related work of [25] is insightful with its use of empirical dataand observations made on novelty and quality, as well as dispersion and kinetics of accruedcitations of articles classified as novel.We note that journals are used as grouping units for articles in the three studies we discuss[25, 22, 2] as well as this one. While we used keyword searches to identify sets of articles,we still relied on journal grouping to generate z-scores. Such a grouping, while appealing on14ccount of relative simplicity, obscures measurements of novel pairings at the article level.Journals are also of limited use in representing individual fields, and repeating some of thesestudies using article clusters may be more informative [21, 9]. Various factors contributeto citation counts [16, 23] and further study of these in the context of co-citation analysismay be of interest. We also acknowledge the limitations of using citation counts to identifyimpactful publications. Overall, evaluation in context [6] and further consideration of thedisciplinary nature of the scientific enterprise is likely to result in improved models thatyield further knowledge.
We thank two anonymous reviewers for helpful comments. We thank the authors of Uzzi etal. [22] who kindly shared their Python code for citation shuffling. We are grateful to KevinBoyack and Dick Klavans for constructively critical discussions. We also thank StephenGallo and Scott Glisson for helpful suggestions. Research and development reported inthis publication was partially supported by Federal funds from the National Institute onDrug Abuse, National Institutes of Health, US Department of Health and Human Services,under Contract Nos. HHSN271201700053C (N43DA-17-1216) and HHSN271201800040C(N44DA-18-1216). The content of this publication is solely the responsibility of the authorsand does not necessarily represent the official views of the National Institutes of Health.TW receives funding from the Grainger Foundation.
The authors have no competing interests. Web of Science data leased from Clarivate Ana-lytics was used in this study. Clarivate Analytics, had no role in conceptualization, experi-mental design, review of results, conclusions presented, and funding. Avon Davey’s presentaffiliation is GlaxoSmithKline, Research Triangle Park, NC, USA. His contributions to thisarticle were made while he was a full time employee of NET ESolutions Corporation.
Access to the bibliographic data analyzed in this study requires a license from ClarivateAnalytics. We have made supplementary data available on Mendeley Data atDOI: 10.17632/4n8ns8vzvz. Code generated for this study is freely available from ourGithub site [10]. 15
Author Contributions
Conceptualization, GC, JB, SD, and TW; Methodology, AD, DK, GC, JB, SD, SL, andTW; Investigation, DL-H, GC, JB, and SD; Writing -Original Draft, GC, JB, TW; Writing-Review and Editing, AD, DK, DL-H, GC, JB, SD, SL, and TW; Funding Acquisition, GC;Resources, DK and GC; Supervision, GC. Authors are listed in alphabetic order.
References [1] Kevin Boyack and Richard Klavans. Co-citation analysis, bibliographic coupling, anddirect citation: Which citation approach represents the research front most accurately?
Journal of the American Society for Information Science and Technology , 61(12):2389–2404, 2010.[2] Kevin Boyack and Richard Klavans. Atypical combinations are confounded by disci-plinary effects. In
International conference on science and technology indicators , pages49–58, Leiden, Netherlands, 2014. CWTS-Leiden University.[3] D. J. de Solla Price. Networks of Scientific Papers.
Science , 149(3683):510–515, July1965.[4] Eugene Garfield. Citation Science: A New Dimension in Documentation throughAssociation of Ideas.
Science , 122(3159):108–111, July 1955.[5] Eugene Garfield.
Citation Indexing-Its Theory and Application in Science, Technology,and Humanities . John Wiley and Sons, ISI Press, New York, NY, USA, 1 edition, 1979.[6] Diana Hicks, Paul Wouters, Ludo Waltman, Sarah de Rijcke, and Ismael Rafols. Bib-liometrics: The Leiden Manifesto for research metrics.
Nature News , 520(7548):429,April 2015.[7] Samet Keserci, Avon Davey, Alexander R Pico, Dmitriy Korobskiy, and George Chacko.ERNIE: A data platform for research assessment. bioRxiv , 2018.[8] Richard Klavans and Kevin W. Boyack. Research portfolio analysis and topic promi-nence.
Journal of Informetrics , 11(4):1158–1174, November 2017.[9] Richard Klavans and Kevin W. Boyack. Which Type of Citation Analysis Generates theMost Accurate Taxonomy of Scientific Technical Knowledge.
Journal of the Associationfor Information Science and Technology , 68(4):984–998, 2017.[10] D. Korobskiy, A. Davey, S. Liu, S. Devarakonda, and G. Chacko. Enhanced ResearchNetwork Informatics Environment (ERNIE) https://github.com/netesolutions/ernie.Github repository, NET ESolutions Corporation, 2019.1611] S. Kullback and R. A. Leibler. On Information and Sufficiency.
The Annals of Math-ematical Statistics , 22(1):79–86, March 1951.[12] Irina Marshakova-Shaikevich. System of document connections based on references.
Nauchno-Tekhnicheskaya Informatsiya Seriya 2-Informatsionnye Protsessy I Sistemy ,6(4):3–8, July 1973.[13] Henk F. Moed. Measuring contextual citation impact of scientific journals.
Journal ofinformetrics , 4(3):265–277, 2010.[14] M. E. J. Newman. The structure of scientific collaboration networks.
Proceedings ofthe National Academy of Sciences , 98(2):404–409, January 2001.[15] G. S. Patience, C. A. Patience, B. Blais, and F. Bertrand. Citation analysis of scientificcategories.
Heliyon , 3(5):e00300, May 2017.[16] H. P. F. Peters and Anthony F. J. van Raan. On Determinants of Citation Scores ACase Study in Chemical Engineering.
JASIS , 45:39–49, 1994.[17] Xiaolin Shi, Jure Leskovec, and Daniel A. McFarland. Citing for high impact. In
Proceedings of the 10th annual joint conference on digital libraries , JCDL ’10, pages49–58, New York, NY, USA, 2010. ACM.[18] Henry Small. Co-citation in the scientific literature: A new measure of the relationshipbetween two documents.
Journal of the American Society for Information Science ,24(4):265–269, July 1973.[19] Stephen Stigler. Citation patterns in the journals of statistics and probability.
Statis-tical Science , 9(1):94–108, 1994.[20] J. Sueur, T. Aubin, and C. Simonis. Seewave: a free modular tool for sound analysisand synthesis.
Bioacoustics , 18:213–226, 2008.[21] V. A. Traag, L. Waltman, and N. J. van Eck. From Louvain to Leiden: guaranteeingwell-connected communities.
Scientific Reports , 9(1):1–12, March 2019.[22] Brian Uzzi, Satyam Mukherjee, Michael Stringer, and Ben Jones. Atypical combina-tions and scientific impact.
Science (New York, N.Y.) , 342(6157):468–472, October2013.[23] E. S. Vieira and J. A. N. F. Gomes. Citations to scientific articles: Its distribution anddependence on the article features.
Journal of Informetrics , 4(1):1–13, January 2010.[24] Mathew L. Wallace, Vincent Lariviere, and Yves Gingras. A small world of citations?The influence of collaboration networks on citation practices.
PLOS One , 7(3):e33339,2012. 1725] Jian Wang, Reinhilde Veugelers, and Paula Stephan. Bias against novelty in science:Acautionary tale for users of bibliometric indicators.
Research Policy , 46(8):1416–1436,October 2017.[26] Harriet Zuckerman. The Sociology of Science and the Garfield Effect: Happy Acci-dents, Unanticipated Developments and Unexploited Potentials.