A SIR epidemic model for citation dynamics
AA SIR epidemic model for citation dynamics
Sandro M. Reia and Jos´e F. Fontanari Instituto de F´ısica de S˜ao Carlos, Universidade de S˜ao Paulo,Caixa Postal 369, 13560-970 S˜ao Carlos, S˜ao Paulo, Brazil
The study of citations in the scientific literature crosses the boundaries between the traditionalbranches of science and stands on its own as a most profitable research field dubbed the ‘science ofscience’. Although the understanding of the citation histories of individual papers involves manyintangible factors, the basic assumption that citations beget citations can explain most features ofthe empirical citation patterns. Here we use the SIR epidemic model as a mechanistic model forthe citation dynamics of well-cited papers published in selected journals of the American PhysicalSociety. The estimated epidemiological parameters offer insight on unknown quantities as the size ofthe community that could cite a paper and its ultimate impact on that community. We find a good,though imperfect, agreement between the rank of the journals obtained using the epidemiologicalparameters and the impact factor rank.
I. INTRODUCTION
Regardless of the controversial and widespread use ofcitation based measures as a quantitative proxy of a pa-per’s importance [1–6], the study of citations seems tohave acquired a life of its own [7]. In fact, citation net-works, citation distributions and citation dynamics aretopics that range over most of the issues addressed bythe science of complexity. In addition, the large citationdatasets, which unfortunately are rarely freely accessi-ble, makes the subject very attractive since, contrary tomost complex systems problems, the theories about cita-tion patterns can readily be tested against empirical data[8–10].A remarkable outcome of the quantitative study of thecitation patterns is the realization that starkly differ-ent citation histories, such as the rare ‘sleeping beauties’(i.e., papers that are not cited for a long while and thensuddenly become popular [11, 12]) or the more common‘shooting stars’ (i.e., papers that are highly cited initiallybut die quickly), can be explained by tunning a few pa-rameters of mechanistic models of the citation dynamics[13–15]. A seemingly natural mechanistic model to de-scribe the spread of ideas in the academia is the SIRepidemic model [16], which, however, has not yet beenapplied to the analysis of the citation histories of indi-vidual papers.Accordingly, here we use the SIR epidemic model to fitthe number of citations received over a period of 15 yearsby 300 hit papers published in 6 selected APS journals.We define hit papers as the 50 most cited papers pub-lished from 2000 to 2003 in each of those journals. Thedata used in our analysis is from the American PhysicalSociety Data Sets for Research (available upon requestat [17]), which include only internal citations, i.e., cita-tions to papers published in APS journals from paperspublished in APS journals as well. In Sec. II, we offer abrief overview of the APS Dataset.The epidemiological parameters of the SIR epidemicmodel have a direct interpretation in terms of the cita-tion dynamics, as we explain in great detail in Sec. III. Following the maxim ‘citations beget citations’ we as-sume that the citations of a given paper are promotedby a certain number of influential papers whose bibli-ographies include that paper and whose potential to in-fluence yet-to-be-written papers to cite it is determinedby the transmission parameter β . Papers cease to be in-fluential at a rate γ . The ratio R = β/γ is very closeto 1 for most of the hit papers considered here, so thatthe ‘shooting stars’ and the ‘sleeping beauties’ citationpatterns are obtained for large and small values of β ,respectively. An epidemiological parameter of particularinterest is the potential maximum number of citations S a hit paper can receive, which yields an estimate of thesize of the community that could in principle cite thatpaper. The SIR model allows the ready estimate of thetotal number of citations a paper ever acquires Υ ∞ ≤ S ,which can be seen as the ultimate impact of a paper [15].The relative ultimate impact Υ ∞ /S happens to dependon the basic reproductive number R only.The distributions of the epidemiological parameters foreach selected journal enables the ranking of the journalsusing their medians, as we describe in Sec. IV. In partic-ular, the ranks produced by the medians of the distribu-tions of the parameters S and β interchanges the posi-tion of two journals only in comparison with the ImpactFactor (IF) rank [18]. In general, however, the higher theIF of a journal is, the less its relative ultimate impact. II. APS DATASET
The APS Dataset comprises citing article pairs andarticle metadata of about 636000 papers published in17 journals of the American Physical Society from 1893to 2018 [17]. The first journal,
Physical Review , ceasedpublication in 1969, and the most recent journal inthe dataset,
Physical Review Materials , was launched in2017.With more than one century of existence, the APSjournals lived through the World War I and the WorldWar II. The total number of papers published per yearin the APS journals shown in the upper panel of Fig. 1 a r X i v : . [ c s . D L ] N ov nu m b e r o f p a p e r s World War IWorld War II year nu m b e r o f p a p e r s FIG. 1. Number of papers published in all APS journals from1893 to 2018 (upper panel). The dashed straight line is thefitting function f ( x ) = a exp [( x − /b ] with a = 63 and b = 18. The lower panel shows the number of papers pub-lished in Phys. Rev. Series II ( ◦ ), Phys. Rev. Lett. ( (cid:79) ), Phys. Rev. A ( (cid:52) ), B ( (cid:46) ), C ( (cid:47) ), D ( (cid:50) ) and E ( (cid:5) ). reveals the distinct impact these two major events hadon the academic productivity of the physicists. WhereasWWI caused no discernible changes on the number ofpapers published in the Physical Review , most likely be-cause this journal was not yet the main publication choiceof the European physicists, WWII caused a sharp dropon the number of published papers, which reflects theworldwide disruption this event produced in all activitiesunrelated to warfare. More importantly, this panel showsthat the number of published papers grows at an expo-nential rate [1], which probably prompted the splitting ofthe
Physical Review Series II into
Physical Review A , B , C , and D in 1970. Nevertheless, the exponential growthtrend continued for the offspring journals, leading to thecurrent difficulty of physicists to keep pace with the ad-vances of their own research subfields [19, 20]. The lowerpanel of Fig. 1 shows the number of papers publishedin the APS journals we will consider in this paper. Inaddition to the journals already mentioned, the panel in-cludes the Physical Review Letters that was introducedin 1958 and the
Physical Review E that was launchedin 1993. For completeness, we also include in the panelthe
Physical Review, Series II which replaced the
Phys-ical Review, Series I and was active from 1913 to 1969.Those are the 7 APS journals with the largest number ofpapers published since 1913.
TABLE I. Number of papers published from 2000 to 2018 inthe 6 APS journals used in our citation dynamics analysis.APS Journal number of papers
Phys. Rev. A
Phys. Rev. B
Phys. Rev. C
Phys. Rev. D
Phys. Rev. E
Phys. Rev. Lett.
Here we focus on the 54874 papers published in
Phys.Rev. Lett. , Phys. Rev. A , B , C , D , and E from 2000to 2018. Table I shows the number of papers publishedin each of these journals in that time window. For eachjournal we pick the 50 papers published from 2000 to 2003that received the highest number of citations up to 15years (180 months) after their publication dates. Thosepapers are named hit papers and next we will show howto model their citation patterns using an epidemiologicalmodel. We refer the reader to Ref. [9] for the analysisof the citation statistics of Physical Review from 1893through 2003. III. EPIDEMIOLOGICAL MODEL
We characterize the citation histories of the hit pa-pers by their cumulative number of citations received inthe period of 180 months from their publication dates.Figure 2 illustrates this quantity for three representativecitation histories in the time window considered. In par-ticular, the number of citations of the paper shown inthe upper panel [21] exhibits a very rapid increase fol-lowed by stabilization within about 30 months after itspublication. This is the editors’ dream paper for it hasthe perfect timing to boost the IF of a journal. Thecitation record of the paper shown in the middle panel[22] exhibits a steady and consistent growth of little lessthan one citation per month. Perhaps the most interest-ing citation pattern is that of the paper exhibited in thelower panel [23], which displays a latent period followedby a steady speed up of the number of citations. Thispanel illustrates the ‘sleeping beauties’ citation patternthat makes the prediction of the impact of a research us-ing short-time information (say, a 24 months window) asomewhat shortsighted enterprise [11, 12].The main assumption behind our approach to modelthe citation histories illustrated in Fig. 2 is that citationsbeget more citations, so that the citation dynamics couldbe modeled as the spread of an infectious disease. Thismeans that a particular hit paper comes to the knowledgeof prospective citing authors through the reading of pa-pers that cite the hit paper. Here we model the citationdynamics of a hit paper using the popular SIR model[24, 25] where the susceptible (S), infected (I) and re-moved (R) classes must be properly reinterpreted withinthe citation dynamics context. Υ Υ t Υ FIG. 2. Cumulative number of citations Υ of three repre-sentative hit papers as function of the time t , measured inmonths, after their publication dates. The symbols are thecitation data extracted from the APS dataset and the solidcurves are the fittings with the SIR model. The epidemiolog-ical parameters [ S , β, γ ] are [42000 , . , .
25] (upper panel),[3150 , . , .
47] (middle panel) and [1050 , . , .
10] (lowerpanel).
In particular, once a hit paper is published we assumethat there is a maximum number of citations it can re-ceive, which we denote by S . This is the number of pa-pers in an abstract population of papers not yet writtenthat are susceptible to cite the hit paper. This numbercan only decrease with time and we denote by S ( t ) ≤ S the number of citations the hit paper can still receiveafter time t from its publication date. Of course,Υ( t ) = S − S ( t ) (1)is the measurable total number of citations the hit pa-per received until time t , which is shown in Fig. 2 for three selected hit papers. In principle, S could be esti-mated if we knew the size of the community that workson the subject addressed by the hit paper, the mean num-ber of papers published per month by researchers in thatcommunity and the average number of references thosepapers contain.A simple way to model the decrease of the numberof susceptible papers with time is through the contactprocess dSdt = − βS IN , (2)where I = I ( t ) is the number of papers that have citedthe hit paper before or at time t and that can still in-fluence susceptible papers to cite that paper. In otherwords, I ( t ) is the number of influential (or infective) pa-pers. Although the hit paper does not cite itself, we willassume that it contributes to I (0) = I . We note thatthe hit paper may not contribute to I ( t ), i.e., it may notbe influential any more at time t > β in Eq. (2) is ameasure of the persuasion power of the influential papers,i.e., it is a measure of the likelihood an author will citethe hit paper because that author read a paper that citesthe hit paper. Because β is a per capita transmission ratewe have introduced the constant factor N = S + I inEq. (2) to guarantee that its value is on the order of 1regardless of the value of S , and that this equation isdimensionally correct. (The unit of S and I is papers.)The equation for the number of influential papers dIdt = βS IN − γI (3)makes plain the fair assumption that influential paperscease to be influential at a rate γ and move into theremoved class. Papers in the removed class play no rolein the citation dynamics and their number is given by dRdt = γI. (4)Since d ( S + I + R ) /dt = 0 we have S ( t ) + I ( t ) + R ( t ) = S + I = N , because the removed class is empty at t = 0.Moreover, since the number of citations are reported ona monthly basis, we use the month as our time unit sothat β and γ have unit 1/month.We note that our epidemiological approach builds onan assumption different from the vastly popular cumula-tive advantage or preferential attachment assumption, inwhich the probability that a publication is cited is an in-creasing function of its current total number of citations[2, 9]. In our case, this probability is a function of a (vari-able) fraction of the total number of citations, namely, itis a function of the number of influential papers.Perhaps the most interesting quantity in the citationdynamics context is Υ ∞ = lim t →∞ [ S − S ( t )] = S − S ∞ , which gives the total number of citations a hit paperacquires during its lifetime, i.e., its ultimate impact. Ofcourse, Υ ∞ cannot be measured but can be easily inferredusing our epidemiological approach. In fact, it is givenby the positive root of the transcendental equation [25]Υ ∞ = S (cid:20) − exp( − Υ ∞ + I N ρ ) (cid:21) (5)where ρ = γ/β . Hence Υ ∞ = S only in the limit ρ → ρ , a hit paper will receive only the fraction υ = Υ ∞ /S of the potential citations it could receive. Forinstance, for the papers analyzed in Fig. 2, the estimatedtotal number of citations they will receive is Υ ∞ = 1060(upper panel), 166 (middle panel) and 446 (lower panel).Assuming that I (cid:28) N ≈ S we rewrite Eq. (5) as υ = 1 − exp( − υρ ) , (6)which has a nonzero solution provided that R = 1 /ρ >
1. Here R is the basic reproductive number that ulti-mately determines the overall impact of a hit paper onthe abstract population of susceptible papers. Since ourfocus are on hit papers only, we have R > R of thehit papers were all very close to 1.The SIR model has only three adjustable parameters,namely, S , β and γ and our goal is to estimate theseparameters by fitting Eq. (1) to the cumulative number ofcitations extracted from the APS dataset. The quality ofthe fitting can be appreciated in Fig. 2. Incidentally, thepaper considered in the upper panel of this figure has thelargest value of β among the 300 hit papers consideredin our study. IV. EPIDEMIOLOGICAL JOURNAL RANKING
The results of our fitting procedure are summarized inFig. 3, which shows the epidemiological parameters S , β and γ that best fit the theoretical estimate of Υ( t )to the empirical cumulative number of citations. Eachsymbol in a panel corresponds to the estimated epidemi-ological parameter of a particular hit paper. Because ofthe considerable spread of the values of the estimates,which is particularly pronounced for S , it is convenientto summarize the parameter distributions by their medi-ans, ˜ S , ˜ β and ˜ γ , which are shown in Table II togetherwith the medians of the basic reproductive number ˜ R and of the ultimate fraction of the potential citations re-ceived ˜ υ . The order of the journals listed in this table isdetermined by the value of ˜ S .Remarkably, Table II shows that ˜ S , ˜ β and ˜ γ are goodpredictors of the rank of the selected APS journals ac-cording to the IF metric [18]. In fact, if we considerthat the hit papers were published between 2000 and2003, the epidemiological rank offered in this table in-terchanges the positions of Phys. Rev. C and
Phys. Rev. S − − β P h y s . R e v . A P h y s . R e v . B P h y s . R e v . C P h y s . R e v . D P h y s . R e v . E P h y s . R e v . L e tt . − − γ FIG. 3. Epidemiological parameters S , β and γ that best fitthe theoretical estimate of Υ( t ) to the empirical cumulativecitation numbers. Each symbol corresponds to a particularhit paper published in the indicated APS journal. The unitof β and γ is 1/month and the unit of S is papers.TABLE II. Medians of the potential number of citations ( ˜ S ),the per capita transmission rate ( ˜ β ), the removal rate (˜ γ ),the basic reproductive number ( ˜ R ) and the fraction of thepotential citations received (˜ υ ) for the selected APS journals.APS Journal ˜ S ˜ β ˜ γ ˜ R ˜ υ Phys. Rev. Lett.
Phys. Rev. D
Phys. Rev. B
Phys. Rev. A
Phys. Rev. E
Phys. Rev. C E only as compared with the IF rank (see Fig. 4). Thegood agreement between these ranks is not very surpris- year I m p a c t f a c t o r FIG. 4. Evolution of the Impact Factor from 2000 to 2019 ofthe journals
Phys. Rev. Lett. ( (cid:79) ), Phys. Rev. A ( (cid:52) ), B ( (cid:46) ), C ( (cid:47) ), D ( (cid:50) ) and E ( (cid:5) ). The hit papers considered in thispaper were published between 2000 and 2003. ing in the sense that it is well known that the size of thecommunity, which in our approach is measured by S ,correlates well with the IF [26]. It is important to note,however, that S does not correlate well with the num-ber of papers published in a journal during the period ofanalysis shown in Table I. In addition and perhaps moreimportantly, because the IF is measured in a 24 monthswindow, it correlates well with the transmission rate β ,since large values of this parameter result in many cita-tions in a short time provided there are plenty of suscep-tible papers (see upper panel of Fig. 2). These findingsvalidate our theoretical approach to model the citationdynamics as well as the procedure we used to estimatethe epidemiological parameters.The positive correlation between the IF metric and ˜ γ is more intriguing and, perhaps, illuminating. We recallthat a high value of γ implies a quick obsolescence ofthe influential papers, i.e., those papers are influentialfor a short period of time only. This means that theepidemiological approach predicts that papers that citehit papers published in high impact factor journals arenot likely to be very impactful themselves. This scenariois evocative of the many application papers that use anovel analysis method presented in a hit paper.The surprising finding revealed in Table II is the nega-tive correlation between ˜ β (and hence the IF metric) and˜ R or ˜ υ . We recall that R and υ are related throughEq. (6). Actually the true relevance of R can be ap-preciated only through its link to υ , which reveals that R > S istaken into account, hit papers published in high impactjournals actually have a smaller (relative) number of ci-tations than hit papers published in low impact journals.We note that ˜ R and ˜ υ are not related by Eq. (6): thesequantities are estimated from the distributions of R and υ of the hit papers for each journal. V. CONCLUSION
The literature already offers several mechanistic mod-els for the citation dynamics of individual papers. Someof them build on the similarity between the S-shapedcurves of the cumulative number of citations and thecurves that describe the diffusion of innovations to ar-gue that the same mechanisms that drive the adoptionof a new product, viz., innovation and imitation [27, 28],may explain the citation process as well [13, 14]. How-ever, the likely most successful mechanistic model of cita-tion dynamics builds on assumptions proper to this dy-namics, viz., preferential attachment, fitness and aging[15]. As already pointed out, preferential attachment orcumulative advantage means that the probability that apublication is cited is an increasing function of its currentnumber of citations [1, 2, 9]. Fitness expresses the notionthat papers differ with respect to the perceived noveltyand importance of their contents [29, 30] and aging cap-tures the fact that the perceived novelty and importanceof a paper eventually fade out [31]. There are, of course,many intangible factors behind an author’s decision tocite a paper, such as the reputation of its authors and thejournal where it was published, that can be identified ina citation network analysis but cannot be implementedin a mechanistic model [32].Here we take a different approach that is inspired bythe attempt to describe the spread of Feynman diagramsthrough the theoretical physics communities of differentcountries using models of epidemics [16]. In particular,we fit the citation history of 300 hit papers from 6 se-lected APS journals using a SIR epidemic model. Theadvantage of this approach is that the epidemiologicalparameters have a direct interpretation in terms of thecitation dynamics. For instance, a paper’s relative longterm impact (i.e., the total number of citations a paperwill ever acquire) is easily derived within the epidemi-ological framework (see Eq. (6)) and it is a function ofthe basic reproductive number R = β/γ only. This re-sult is similar to the ultimate impact of a paper derivedin Ref. [15], which happens to depend only on the rel-ative fitness of the paper. In fact, recalling that in theinfectious disease context R is the average number ofpeople infected from one person and that in the evolu-tionary context the fitness of an organism is measured bythe number of offspring per generation it produces, it isfair to think of R as the fitness of the paper, so the twodistinct approaches reach similar conclusions. However,the epidemiological approach does not make the prefer-ential attachment assumption, since it assumes that theprobability that a publication is cited at a certain timeis a function of the number of influential papers at thattime, which is a time-dependent fraction of the currentnumber of citations.The distributions of the values of the epidemiologicalparameters that describe the citation histories of the 50hit papers for each one of the 6 APS journals consid-ered allow us to characterize those journals and definean epidemiological rank. It turns out that there is agood, though not perfect, correlation between the rankobtained using the transmission rate β or the potentialmaximum number of citations S a paper acquires andthe IF rank. Surprisingly, this rank correlates negativelywith the rank obtained using the basic reproductive num-ber R , which implies that hit papers published in highimpact factor journals have less relative long-term impact( υ = Υ ∞ /S ) than hit papers published in low impactfactor journals, although their absolute long-term impact(Υ ∞ ) is much greater. The fact that the IF rank is ob-tained using citations from papers published in all jour-nals indexed at the Web of Science may explain the dis-crepancies with the epidemiological ranks that use dataof APS journals only, particularly in research areas suchas Nuclear Physics where there are many traditional jour-nals owned by other publishers. In summary, the SIR epidemic model proved very valu-able to fit the citation histories of hit papers and, in addi-tion, offered unexpected insights on the citation dynam-ics. The good correlation between the IF rank and theepidemiological ranks suggests that this simple epidemicmodel succeeded in picking out the essential elements be-hind the citation dynamics. ACKNOWLEDGMENTS
We thank the American Physical Society for lettingus use their citation database. The research of JFF wassupported in part by Grant No. 2020/03041-3, Funda¸c˜aode Amparo `a Pesquisa do Estado de S˜ao Paulo (FAPESP)and by Grant No. 305058/2017-7, Conselho Nacional deDesenvolvimento Cient´ıfico e Tecnol´ogico (CNPq). SMRwas supported by the Coordena¸c˜ao de Aperfei¸coamentode Pessoal de N´ıvel Superior - Brasil (CAPES) - FinanceCode 001. [1] D. J. S. Price,
Science since Babylon (Yale UniversityPress, New Haven, 1975) .[2] R. Merton,
The Sociology of Science (University ofChicago Press, Chicago, 1973).[3] F. Radicchi, S. Fortunato and C. Castellano, Proc. Natl.Acad. Sci. USA , 17268 (2008).[4] F. Radicchi, S. Fortunato, B. Markines, and A. Vespig-nani, Phys. Rev. E , 056103 (2009).[5] B. Uzzi, S. Mukherjee, M. Stringer and B. Jones, Science , 468 (2013).[6] J. Ioannidis, K. W. Boyack, H. Small, A. A. Sorensen,and R. Klavans, Nature , 561 (2014).[7] J. Mingers and L. Leydesdorff, Eur. J. Oper. Res. , 1(2015).[8] S. Redner, Eur. Phys. J. B , 131 (1988).[9] S. Redner, Phys. Today , 49 (2005).[10] M. L. Wallace, V. Larivi`ere and Y. Gingras, J. Informetr. , 296 (2009).[11] A. F. J. van Raan, Scientometrics , 467 (2004).[12] Q. Ke, E. Ferrara, F. Radicchi and A. Flammini, Proc.Natl. Acad. Sci. USA , 7426 (2015).[13] J. Mingers, J. Oper. Res. Soc. , 1013 (2008).[14] C. Min, Y. Ding, J. Li, Y. Bu, L. Pei, and J. Sun, J.Assoc. Inf. Sci. Technol. , 1271 (2018).[15] D. Wang, C. Song, A.-L. Barab´asi, Science , 127(2013).[16] L. M. Bettencourt, A. Cintr´on-Arias, D. I. Kaiser and C.Castillo-Ch´avez, Physica A , 513 (2006). [17] https://journals.aps.org/datasets.[18] E. Garfield, JAMA , 90 (2006).[19] P. Larsen and M. Von Ins, Scientometrics , 575 (2010).[20] E. Landhuis, Nature , 457 (2016).[21] K. Hagiwara et al., Phys. Rev. D , 010001 (2002).[22] M. S. Kim, W. Son, V. Buˇzek and P. L. Knight, Phys.Rev. A , 032323 (2002).[23] I. Souza, N. Marzari and D. Vanderbilt, Phys. Rev. B , 035109 (2001).[24] W. O. Kermack and A. G. McKendrick, Proc. R. Soc. A , 700 (1927).[25] J. D. Murray, Mathematical Biology I: An Introduction (Springer, New York, 1993).[26] B. M. Althouse, J. D. West, C. T. Bergstrom and T.Bergstrom, J. Assoc. Inf. Sci. Technol. , 27 (2009).[27] E. M. Rogers, Diffusion of Innovations (Simon andSchuster, New York, 2010).[28] F. M. Bass, Manag. Sci. , 215 (1969).[29] J. G. Foster, A. Rzhetsky, and J. A. Evans, Am. Sociol.Rev. , 875 (2015).[30] J. Li, Y. Yin, S. Fortunato and D. Wang, Nat. Rev. Phys. , 301 (2019).[31] Y.-H. Eom and S. Fortunato, PLoS ONE , e24926(2011).[32] Y. Dong, R. A. Johnson, and N. V. Chawla, IEEE Trans.Big Data2