aa r X i v : . [ phy s i c s . s o c - ph ] O c t Prediction of citation dynamics of individual papers
Michael Golosovsky ∗ The Racah Institute of Physics,The Hebrew University of Jerusalem,9190401 Jerusalem, Israel (Dated: October 3, 2019)
Abstract
We apply stochastic model of citation dynamics of individual papers developed in our previouswork (M. Golosovsky and S. Solomon, Phys. Rev. E , 012324 (2017)) to forecast citation careerof individual papers. We focus not only on the estimate of the future citations of a paper but onthe probabilistic margins of such estimate as well. PACS numbers: ∗ [email protected] . INTRODUCTION The interest in predicting citation behavior of scientific papers is motivated by the needto forecast the journal impact factor, for early identification of the breakthrough papers,and career considerations [1–3]). Prediction is usually based on a priori and a posteriori factors that, in principle, can determine citation career of a paper. The former factors areset at the moment of publication and these are subject, title, author’s previous record andreputation [4–9], venue (journal) [10, 11], the length and the composition of the referencelist [8, 11–13], the style of the paper [11, 14, 15], etc. The difficulty of this approach isthat the most important attributes, such as novelty, originality, significance, and timelinessof the results are qualitative. In principle, some of them can be quantified but this ischallenging. A brilliant example of such quantification is Ref. [12] which managed tocharacterize the novelty of a paper through diversity (frequency of atypical combinations)of its references.
A posteriori factors develop during short time after the paper has beenpublished and these include the ”impact factor”- the number of citations during a shortperiod after publication [4, 16–19], and the place that the paper occupies in its community.There are two complementary approaches to predict citation career of a paper basing onthese factors.Computer scientists focus more on a priori factors. They take a large set of paperswhose citation career has been evolving for a long time and use it for training, namely, theymeasure correlation between these factors and the number of citations of a paper in the longtime limit. Then, the factors are ranked according to their importance and predictive modelis built by machine learning. The general consensus is that predictive algorithm shall useseveral factors or combination of them [7, 20], whereas the relative weight of these factorsfor different disciplines can vary. It has been also realized that linear correlations do not tellthe whole story [4, 16, 18, 21] and predictive algorithm shall be better nonlinear, similar tothat of Ref. [18]. When the predictive algorithm has been validated, it works as follows.For a new paper, one determines all relevant factors and builds a prediction. The result ofprediction is the number of citations of a paper after some predetermined time. Althoughthis prediction is probabilistic, the margins of predictability were never studied properly.The approach of researchers with the background in natural sciences is different. Theyfocus more on a posteriori factors, such as recent citation history of a paper. They construct2mpirical models of citation dynamics which are based on some predetermined scheme ofthe citation process, namely, they assume a certain strategy that the author of a new paperadopts when he cites the previous studies. This model predicts a future citation behaviorof a paper basing on its citation history and several paper-specific parameters, the mostimportant of them being fitness, a hidden parameter that can be reliably estimated onlyafter citation career of the paper has been developing for 2-3 years [22]. When the modelhas been constructed and validated, the prediction is performed as follows. One takes a newpaper and, by studying its initial citation history, makes a probabilistic estimate of its fitnessand other specific parameters. After such estimate has been made and the correspondingparameters have been substituted into the model of citation dynamics, it predicts the numberof citations of this paper in the long time limit. This approach has been most completelyembodied in the Wang-Song-Barabasi model [16].Bibliometric analysis considers both a priori and a posteriori factors. The researchers inthis area have long recognized that the early citation history of a paper is a good predictorof its future success. On another hand, they were the first to draw attention to sleepingbeauties [16, 23–25], the papers that started to gain popularity long after publication. Manyimportant papers exhibited the sleeping beauty behavior which no model of citation dynam-ics can predict. Thus, the presence of such papers sets a limit to prediction of the futurecitation count of a paper. On another hand, this poor predictability is what makes sciencefun for so many researchers.Our purpose is to forecast the future citation career of a paper basing on our recentlydeveloped stochastic model of citation dynamics [21] . This model includes several empiricalparameters, some of them are common to the whole discipline while all individual attributesof the paper are lumped into one parameter- fitness which does not vary with time. Our firstgoal is to explore the limits of predictability of the citation career of a paper with a givenfitness, the uncertainty of prediction being related to intrinsic stochasticity of the citationprocess. Our second goal is to quantify the ingredients of fitness, in particular, we show howone can quantify such attribute of a paper as timeliness of results.3
I. STOCHASTIC MODEL OF CITATION DYNAMICS- A SUMMARY
Assume a paper j published in year t j . To quantify its citation dynamics, we introduce∆ K j = k j ( t j , t i ) dt , the number of citations garnered by this paper in the time window( t i , t i + dt ) where k j ( t j , t i ) is the paper’s j citation rate in year t i . The model assumesthat ∆ K j is a random variable that follows a time-inhomogeneous stochastic point process,namely, the probability of having ∆ K j citations in a short time interval dt is λ ∆ Kjj ∆ K j ! e − λ j where λ j dt is the paper-specific probabilistic citation rate. The model assumes that thisrate consists of the direct and indirect contributions, λ j ( t j , t i ) = λ dirj ( t j , t i ) + λ indirj ( t j , t i ) , (1)where the first term captures those papers that cite paper j and does not cite any otherpaper that cites j ; while the second term captures the papers that cite both j and one ormore of its citing papers.The model yields the following expression for λ j ( t j , t i ) λ j ( t ) = η j R ˜ A ( t ) + Z t m ( t − τ ) T ( t − τ ) R k j ( τ ) dτ, (2)where, in order to shorten notation, we introduced t = t i − t j , the number of years afterpublication, and dropped t j . The first addend in Eq. 2 stays for the direct citation rate.Here, η j is the fitness of the paper j , R is the average length of the reference list length ofthe papers published in year t j , and ˜ A ( t ) is the aging function. The second addend in Eq.2 captures the indirect citation rate. Here, m ( t ) is the average citation rate of the paperspublished in year t j , T ( t ) is the obsolescence function, and k j ( τ ) is the past citation rate ofthe paper j .Our measurements yielded the factors and functions, R , ˜ A ( t ) , m ( t ) and T ( t ). We shownthat they are the same for all papers in the same field published in the same year. Thus,the paper’s individuality is captured by the fitness η j and by its past citation history, k j ( τ ).To predict citation dynamics of the paper, we need to measure its past citation dynamicsand to estimate its fitness (which is supposed to be constant during paper’s lifetime). Thenwe substitute these numbers into Eq. 2 and run numerical simulation with known functions R , ˜ A ( t ) , m ( t ) and T ( t ). Technically, Eq. 2 describes a self-exciting or Hawkes process, sincethere is a positive feedback between the past and present citation citation rate. Hence, theprediction of future citations is inherently probabilistic and its margins increase with time.4 II. PROBABILISTIC CHARACTER OF THE CITATION PROCESS. IMPLICA-TIONS WITH RESPECT TO PREDICTABILITY OF FUTURE CITATIONS
Citation process is stochastic, the stochasticity imposes limits on the predictability offuture citations. Moreover, as we showed earlier [21], citation dynamic of a paper follows aself-exciting (Hawkes) process whereby past fluctuations are amplified. The positive feedbackbetween past fluctuations and future citations renders the task of long-term prediction ofcitation behavior of a paper almost futile and limits predictive algorithms to the range of2-3 years. In our previous study [26] we illustrated this by measurements.We explore here the following question: if we had known paper’s fitness - what are themargins of predictability of its citation trajectory? To answer this question, we analyzed therelation between the paper’s fitness and the number of citations it garners in the long-timelimit. This was done using our calibrated and verified model of citation dynamics. We wishto estimate, K ∞ ( η ), the expected number of citations after 25 years for the paper with acertain fitness η . To this end, we performed numerical simulations based on Eq. 2 withparameters for Physics papers published in 1984. We considered 4000 papers with the samefitness η , found statistical distribution of their citations after 25 years, and measured themean K ∞ and the width of this distribution. We consider K ∞ as the expected number ofcitations in the long time limit.Figure 1a shows that the expected number of citations, K ∞ , grows nonlinearly withfitness η . Figure 1b focuses on the width of the K ∞ - distribution. We observe that for thepapers with low η , citation distribution in the long time limit is wide, while for the paperswith high η , citation distribution in the long time limit is narrow. This means that whilecitation dynamics of a low-fitness paper strongly depends on chance, citation dynamics ofthe high-fitness paper is more deterministic. A. Divergence of citation dynamics of the papers with the same fitness- numericalsimulation
In particular, Fig. 1b shows that if expected number of citations in the long-time limitis 3, the actual number of citations can be anything between 0 and 7; if expected number ofcitations is 10, the actual number can be between 3 and 20, if the expected number is 100,5
IG. 1. (a) Expected number of citations after 25 years, K ∞ ( η ), in dependence of the paper’sfitness η . Numerical simulation for 4000 papers with the same η . The simulation is based on Eq.2 with the parameters for Physics papers published in 1984. Continuous line shows an empiricalpower-law dependence, K ∞ ∝ η . . (b) Actual number of citations after 25 years versus expectednumber. The error bars show the width of the distribution, K ∞ ± std ( K ∞ ). Citation distributionsare broad for low η and narrow for high η . the actual number can be between 50 and 130, if the expected number is 1000, the actualnumber can be between 700 and 1200. If we compare two papers that garnered 3 and 20citations in the long-time limit, they can have the same fitness η , namely, they are mostprobably in the same ”quality” league. Two papers that garnered 700 and 1200 citationsare probably in the same ”quality” league, namely they can have the same fitness. Butthe papers that garnered 100 and 1000 citations should have different fitness and belong todifferent ”quality” leagues.Figure 2 shows K ∞ - distributions from a slightly different perspective. We observe thata paper which is worth
10 citations, with 10% probability can garner less than 3 or morethan 18 citations; a paper which is worth
100 citations, with 10% probability can garnerless than 54 or more than 135 citations; a paper which is worth
IG. 2. Statistical distribution of the number of citations after 25 years for the papers with thesame fitness η . Numerical simulation based on Eq. 2 for 4000 papers. The mean of the distribution, K ∞ , is indicated at each curve. Continuous lines show cumulative probability of getting more than K citations, R ∞ K p ( K ) dK | η = const . Dashed lines show complementary probability of getting less than K citations, R K p ( K ) dK | η = const . The intervals of the 10% probabilities of having K expected citationsare shown by arrows. career of the paper. Similar assumption was adopted by Ref. [27] in their description ofWeb-pages popularity and it was justified by measurements. This assumption is reasonablefor ordinary papers but not for sleeping beauties, that can be dormant for a long time andthen become popular. B. Fitness estimation
Refs. [16, 28] associate fitness with the ultimate impact of the paper, namely the numberof total citations in the long-time limit; Ref. [29] determines paper fitness by ranking; Ref.[30] estimate patent fitness as a combination of attributes found through factor analysis, Ref.[31] associates fitness with the number of citations during a couple of years after publication.We define fitness slightly differently, namely, η is the number of direct citations in the long-time limit. Obviously, this definition cannot be a basis for prediction of future citations7ince it can be used only when citation career of a paper is close to completion. FIG. 3. The relation between direct citations and total citations during first three years afterpublication for Physics papers published in 1984. Publication year corresponds to t = 1. Thedashed line shows linear approximation, K dir (3) = K (3). This approximation is good only for low-cited papers. Red continuous line shows the empirical approximation K dir (3) = [ K (3)] . whichshall be used for highly-cited papers. The fitness is estimated from Eq. 2 as η j = K dirj (3) R P ˜ A ( t ) . To work out operational definition of fitness that can be used for prediction, we note thatthe fitness in our sense is related to initial citation rate, since at the beginning of the citationcareer of a paper citations are predominantly direct. We base our operational definition offitness on the ”magic of three years”, well-recognized in bibliometrics. Namely, the numberof citations garnered by a paper during first 2-3 years after publication (for computer sciencepapers this initial period is 0.5-1 year) is a basis for fitness estimation. Figure 3 shows thatthe relation is nonlinear (see also Fig. 1a) and from this calibration plot we estimate fitness.
IV. FITNESS ESTIMATION BASING ON PAPER’S CONTENT
We believe that η i is determined by the journal (venue), the number of researchers inthe area, reputation of the research group, and last but not least -by the paper’s novelty,timeliness, and quality although the latter can be subjective notion. It should be noted,8owever, that the paper’s fitness and the number of citations gauge not the quality of apaper but its impact. Note, that even erroneous paper can have a great impact. On theother hand, the impact can depend on the factors unrelated to paper’s content - institution,reputation of the research group, catchy title, etc.For example, H. Brot et Y. Louzoun showed [32] that the name of the first author mattersfor citation count, in particular, the Physics papers whose first author’s name starts fromthe letters A, B, C, in the long time limit have ∼
10% more citations than the papers whoselist of authors starts from X, Y, Z. (May be, success of the famous Alpher-Bethe-Gamowpaper partially derives from the lucky combination of their last names?)To further demonstrate the importance of the author’s name for success of the paper, weconsider anectodal evidence based on a couple of papers. Indeed, Richard Lewontin and JackHubby made a landmark study in molecular evolution while collaborating in the Universityof Chicago. To get equal credit for their contribution, their scientific report was publishedas two companion papers with very similar titles and subjects:1. J.L. Hubby and R.C. Lewontin, ”A molecular approach to study of genic heterozy-gosity in natural populations. . Number of alleles at different loci in drosophilapseudoobscura”, Genetics 54(2), 577-594 (1966).2. R.C. Lewontin and J.L. Hubby, ”A molecular approach to study of genic heterozygosityin natural populations. Amount of variation and degree of heterozygocity in naturalpopulations of drosophila pseudoobscura”, Genetics 54(2), 595-609 (1966).The main difference between these two papers is the order of authors. By 2018, thesecond paper got around 900 citations while the first paper got only around 500 citations!This difference is explained by the fact that, when the papers were first published in 1966,Lewontin, who was three years older than Hubby, was better known in the scientific commu-nity. Thus, researchers preferred to cite the paper in which Lewontin was the first author.Eventually, Hubby became also well-known, the paper in which he was the first author gotfair credit and a large number of citations. However, citation count of the Lewontin paperremained bigger due to impressive head start. On another hand, do citation counts of thesetwo papers reflect difference in their ”quality”? Our model and Fig. 1 show, that the prob-ability of two papers, which garnered 500 and 900 citations in the long time limit, to have9he same fitness, is ∼ V. TIMELINESS OF RESULTS
One of the important criteria, which the editor and reviewers use in their eavluation ofsubmitted papers, is the timeliness of results. This criterion singles out the papers that dealwith a hot topic. In our parlance, the paper that focuses on hot topic has enhanced fitnessas compared to the paper belonging to the mature research direction. How one can quantifythe corresponding contribution to fitness?Suppose that at year t there appeared one or several breakthrough papers which werefollowed by a flurry of subsequent developments. This means that a new field (hot topic)has been born. The number of publications in this new field starts to grow explosively andthen saturates. As we have shown before [21], the authors are conservative in their citinghabits, and the length and the age composition of their reference lists remains more or lessthe same. In particular, the papers that were published in the same year constitute ∼ − ∼ −
10% of all references,the papers that were published two years before also constitute ∼ −
10% of all references,etc. Thus, the papers that were published long after the onset of a new topic have big choicein choosing their references, while the papers published soon after the onset of a new topichave a very limited choice for filling their reference list and all choose the papers that werepublished close to the onset. Thus, the papers that were published soon after the birthof a new field, namely, timely papers, shall have enhanced number of citations (enhancedfitness).To put these considerations into quantitative terms, we consider a new field that appearedat time t . We denote the annual number of publications in this field by N ( t + t ). Equation 2yields the average number of direct citations that the paper in this field, which was publishedin year t + t , garners during three subsequent years, K dir ( t + t, t + t + 3) = η ( t + t ) R ( t + t ) X ˜ A ( τ ) N ( t + t + τ ) . (3)where η is the average fitness of the papers in the new field which were published in year t + t , R is the average reference list length of the papers published in year t , ˜ A ( τ ) is10he aging function for citations. Note also, ˜ A ( τ ) = A ( τ ) e ( α + β ) τ where A ( τ ) is the agingfunction for references. While the aging function for citations is specific for each disciplineand publication year, the aging function for references turns out surprisingly universal andalmost independent of the publication year [21]. The reference-citation duality [21] yieldsaverage fitness for the papers published in year t + t , η ( t + t ) = P A ( τ ) N ( t + t + τ ) N ( t + t ) e βτ P A ( τ ) e ( α + β ) τ . (4)If the new field grows with the same rate as the whole discipline, namely, N ( t + t + τ ) N ( t + t ) = e ατ ,then η does not depend on t . However, if this new field grows faster than the whole discipline,then η is enhanced.Figure 4 illustrates these considerations. We know that a hot topic usually appearsabruptly and can be identified through a burst of citations and publications [33]. We chooseseveral such research areas in Physics with well-defined onset t , with some of these areasthe author of this book has had personal experience. Using Web of Science, we found allpapers belonging to each of these topics, that were published in year t + t . For each t , wemeasured annual number of papers and statistical distribution of the number of citationsgarnered by them during first three years after publication. Then we determined the meanand the width of these distributions. Using Eq. 4 and Fig. 3, we found the average fitnessof the papers in each topic published in year t + t , basing on the mean of the distribution.On another hand, we estimated this fitness using Eq. 4. Figure 4 shows that the modelprediction based on Eq. 4 captures our measurements perfectly well.Figure 4 implies that any paper published soon after the new topic appeared, has agood head start and this quantifies the ”first mover advantage” introduced by Newman [35].However, this does not mean that the papers published long after the onset of a hot topicdoomed to be undercited. In fact, Fig. 4 shows only the mean of the fitness distribution foreach year. The actual fitness distribution is very wide and its width is comparable to themean. Hence, at each moment after the onset of a hot topic there are many papers whosefitness considerably exceeds the average one.11 I. DISCUSSION
We showed here that our stochastic model of citation dynamics can be a basis for pre-dicting citation trajectory of papers. This model shall be compared to the physics-inspiredpredictive model developed by Wang, Song, and Barabasi [16]. Pham, Sheridan, and Shi-modaira [36, 37] developed a software package based on this model and demonstrated thatit is a valid predictive tool. This model includes three paper-specific parameters: fitness η ,immediacy µ , and σ . To determine these parameters, one needs to measure initial citationtrajectory of a paper, 2-3 years are not enough. As a predictive tool, this model worksbest for the highly-cited papers. Although this deterministic model predicts citation trajec-tory of a paper, it cannot specify probabilistic margins of the prediction. On the contrary,our probabilistic model includes only one paper-specific parameter- fitness, it does provideprobabilistic margins of the future citation count. However, our model works better withordinary papers and does not predict well citation trajectories of the highly-cited papers.Thus, our model is complementary to that of Ref. [16].What are its possible applications? We believe that our model can be used for forecastingthe five-year journal impact factor. The papers published in one year in one journal representmore or less homogeneous set of papers, hence predicting the mean number of citations forthis set is more reliable than predicting citation trajectory of a single paper. On anotherhand, our model can give probabilistic margins of such prediction.Another application can be the early identification of the breakthrough papers. So far,this was done by analyzing diversity and age structure of the reference list of papers [12,38], diversity and interdisciplinarity of paper’s content [39], or through identification of theatypical citation trajectory, corresponding to sleeping beauties [25]. An important questionis how soon can we identify such rising star? Obviously, if the paper (or patent) gets morecitations than what is expected from the ordinary paper published in the same year andin the same journal, then this is a candidate to be a breakthrough paper [40]. On anotherhand, the deviation from the ordinary citation trajectory may be accidental. Our modelcan make an estimate of the probability of the enhanced citation count in order to judgewhether it occurred by chance or not. 12
1] A. Clauset, D. B. Larremore, and R. Sinatra, Science , 477 (2017).[2] A. Zeng, Z. Shen, J. Zhou, J. Wu, Y. Fan, Y. Wang, and H. E. Stanley, Physics Reports(2017).[3] I. Tahamtan, A. Safipour Afshar, and K. Ahamdzadeh, Scientometrics , 1195 (2016).[4] C. Castillo, D. Donato, and A. Gionis, in
String Processing and Information Retrieval , editedby N. Ziviani and R. Baeza-Yates (Springer Berlin Heidelberg, Berlin, Heidelberg, 2007) pp.107–117.[5] W. Ke, Scientometrics , 981 (2013).[6] H. P. F. Peters and A. F. J. van Raan, J. Am. Soc. Inf. Sci. , 39 (1994).[7] C. Stegehuis, N. Litvak, and L. Waltman, Journal of Informetrics , 642 (2015).[8] R. Yan, C. Huang, J. Tang, Y. Zhang, and X. Li, in Proc. 12th ACM/IEEE-CS Joint Conference on Digital Libraries , JCDL ’12 (ACM, NewYork, NY, USA, 2012) pp. 51–60.[9] R. Yan, J. Tang, X. Liu, D. Shan, and X. Li, in
Proc. 20th ACM International Conference on Information and Knowledge Management ,CIKM ’11 (ACM, New York, NY, USA, 2011) pp. 1247–1252.[10] V. Larivi`ere and Y. Gingras, J. of the Association for Information Science and Technology , 424 (2010).[11] F. Didegah and M. Thelwall, Journal of Informetrics , 861 (2013).[12] B. Uzzi, S. Mukherjee, M. Stringer, and B. Jones, Science , 468 (2013).[13] P. Klimek, A. S. Jovanovic, R. Egloff, and R. Schneider, Scientometrics , 1265 (2016).[14] A. Letchford, H. S. Moat, and T. Preis, Royal Society Open Science , 150266 (2015),https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.150266.[15] C. W. Fox, C. E. T. Paine, and B. Sauterey, Ecology and Evolution , 7717 (2019).[16] D. Wang, C. Song, and A.-L. Barabasi, Science , 127 (2013).[17] J. Adams, Scientometrics , 567 (2005).[18] L. Li and H. Tong, in Proc. 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,KDD ’15 (ACM, New York, NY, USA, 2015) pp. 655–664.[19] X. Cao, Y. Chen, and K. R. Liu, Journal of Informetrics , 471 (2016).[20] J. Bollen, H. Van de Sompel, A. Hagberg, and R. Chute, PLOS ONE , 1 (2009).