[PDF] Prediction of citation dynamics of individual papers

Abstract

We apply stochastic model of citation dynamics of individual papers developed in our previous work (M. Golosovsky and S. Solomon, Phys. Rev. E\textbf{ 95}, 012324 (2017)) to forecast citation career of individual papers. We focus not only on the estimate of the future citations of a paper but on the probabilistic margins of such estimate as well.

Full PDF

aa r X i v : . [ phy s i c s . s o c - ph ] O c t Prediction of citation dynamics of individual papers

Michael Golosovsky ∗ The Racah Institute of Physics,The Hebrew University of Jerusalem,9190401 Jerusalem, Israel (Dated: October 3, 2019)

Abstract

We apply stochastic model of citation dynamics of individual papers developed in our previouswork (M. Golosovsky and S. Solomon, Phys. Rev. E , 012324 (2017)) to forecast citation careerof individual papers. We focus not only on the estimate of the future citations of a paper but onthe probabilistic margins of such estimate as well. PACS numbers: ∗ [email protected] . INTRODUCTION The interest in predicting citation behavior of scientiﬁc papers is motivated by the needto forecast the journal impact factor, for early identiﬁcation of the breakthrough papers,and career considerations [1–3]). Prediction is usually based on a priori and a posteriori factors that, in principle, can determine citation career of a paper. The former factors areset at the moment of publication and these are subject, title, author’s previous record andreputation [4–9], venue (journal) [10, 11], the length and the composition of the referencelist [8, 11–13], the style of the paper [11, 14, 15], etc. The diﬃculty of this approach isthat the most important attributes, such as novelty, originality, signiﬁcance, and timelinessof the results are qualitative. In principle, some of them can be quantiﬁed but this ischallenging. A brilliant example of such quantiﬁcation is Ref. [12] which managed tocharacterize the novelty of a paper through diversity (frequency of atypical combinations)of its references.

A posteriori factors develop during short time after the paper has beenpublished and these include the ”impact factor”- the number of citations during a shortperiod after publication [4, 16–19], and the place that the paper occupies in its community.There are two complementary approaches to predict citation career of a paper basing onthese factors.Computer scientists focus more on a priori factors. They take a large set of paperswhose citation career has been evolving for a long time and use it for training, namely, theymeasure correlation between these factors and the number of citations of a paper in the longtime limit. Then, the factors are ranked according to their importance and predictive modelis built by machine learning. The general consensus is that predictive algorithm shall useseveral factors or combination of them [7, 20], whereas the relative weight of these factorsfor diﬀerent disciplines can vary. It has been also realized that linear correlations do not tellthe whole story [4, 16, 18, 21] and predictive algorithm shall be better nonlinear, similar tothat of Ref. [18]. When the predictive algorithm has been validated, it works as follows.For a new paper, one determines all relevant factors and builds a prediction. The result ofprediction is the number of citations of a paper after some predetermined time. Althoughthis prediction is probabilistic, the margins of predictability were never studied properly.The approach of researchers with the background in natural sciences is diﬀerent. Theyfocus more on a posteriori factors, such as recent citation history of a paper. They construct2mpirical models of citation dynamics which are based on some predetermined scheme ofthe citation process, namely, they assume a certain strategy that the author of a new paperadopts when he cites the previous studies. This model predicts a future citation behaviorof a paper basing on its citation history and several paper-speciﬁc parameters, the mostimportant of them being ﬁtness, a hidden parameter that can be reliably estimated onlyafter citation career of the paper has been developing for 2-3 years [22]. When the modelhas been constructed and validated, the prediction is performed as follows. One takes a newpaper and, by studying its initial citation history, makes a probabilistic estimate of its ﬁtnessand other speciﬁc parameters. After such estimate has been made and the correspondingparameters have been substituted into the model of citation dynamics, it predicts the numberof citations of this paper in the long time limit. This approach has been most completelyembodied in the Wang-Song-Barabasi model [16].Bibliometric analysis considers both a priori and a posteriori factors. The researchers inthis area have long recognized that the early citation history of a paper is a good predictorof its future success. On another hand, they were the ﬁrst to draw attention to sleepingbeauties [16, 23–25], the papers that started to gain popularity long after publication. Manyimportant papers exhibited the sleeping beauty behavior which no model of citation dynam-ics can predict. Thus, the presence of such papers sets a limit to prediction of the futurecitation count of a paper. On another hand, this poor predictability is what makes sciencefun for so many researchers.Our purpose is to forecast the future citation career of a paper basing on our recentlydeveloped stochastic model of citation dynamics [21] . This model includes several empiricalparameters, some of them are common to the whole discipline while all individual attributesof the paper are lumped into one parameter- ﬁtness which does not vary with time. Our ﬁrstgoal is to explore the limits of predictability of the citation career of a paper with a givenﬁtness, the uncertainty of prediction being related to intrinsic stochasticity of the citationprocess. Our second goal is to quantify the ingredients of ﬁtness, in particular, we show howone can quantify such attribute of a paper as timeliness of results.3

I. STOCHASTIC MODEL OF CITATION DYNAMICS- A SUMMARY

Assume a paper j published in year t j . To quantify its citation dynamics, we introduce∆ K j = k j ( t j , t i ) dt , the number of citations garnered by this paper in the time window( t i , t i + dt ) where k j ( t j , t i ) is the paper’s j citation rate in year t i . The model assumesthat ∆ K j is a random variable that follows a time-inhomogeneous stochastic point process,namely, the probability of having ∆ K j citations in a short time interval dt is λ ∆ Kjj ∆ K j ! e − λ j where λ j dt is the paper-speciﬁc probabilistic citation rate. The model assumes that thisrate consists of the direct and indirect contributions, λ j ( t j , t i ) = λ dirj ( t j , t i ) + λ indirj ( t j , t i ) , (1)where the ﬁrst term captures those papers that cite paper j and does not cite any otherpaper that cites j ; while the second term captures the papers that cite both j and one ormore of its citing papers.The model yields the following expression for λ j ( t j , t i ) λ j ( t ) = η j R ˜ A ( t ) + Z t m ( t − τ ) T ( t − τ ) R k j ( τ ) dτ, (2)where, in order to shorten notation, we introduced t = t i − t j , the number of years afterpublication, and dropped t j . The ﬁrst addend in Eq. 2 stays for the direct citation rate.Here, η j is the ﬁtness of the paper j , R is the average length of the reference list length ofthe papers published in year t j , and ˜ A ( t ) is the aging function. The second addend in Eq.2 captures the indirect citation rate. Here, m ( t ) is the average citation rate of the paperspublished in year t j , T ( t ) is the obsolescence function, and k j ( τ ) is the past citation rate ofthe paper j .Our measurements yielded the factors and functions, R , ˜ A ( t ) , m ( t ) and T ( t ). We shownthat they are the same for all papers in the same ﬁeld published in the same year. Thus,the paper’s individuality is captured by the ﬁtness η j and by its past citation history, k j ( τ ).To predict citation dynamics of the paper, we need to measure its past citation dynamicsand to estimate its ﬁtness (which is supposed to be constant during paper’s lifetime). Thenwe substitute these numbers into Eq. 2 and run numerical simulation with known functions R , ˜ A ( t ) , m ( t ) and T ( t ). Technically, Eq. 2 describes a self-exciting or Hawkes process, sincethere is a positive feedback between the past and present citation citation rate. Hence, theprediction of future citations is inherently probabilistic and its margins increase with time.4 II. PROBABILISTIC CHARACTER OF THE CITATION PROCESS. IMPLICA-TIONS WITH RESPECT TO PREDICTABILITY OF FUTURE CITATIONS

Citation process is stochastic, the stochasticity imposes limits on the predictability offuture citations. Moreover, as we showed earlier [21], citation dynamic of a paper follows aself-exciting (Hawkes) process whereby past ﬂuctuations are ampliﬁed. The positive feedbackbetween past ﬂuctuations and future citations renders the task of long-term prediction ofcitation behavior of a paper almost futile and limits predictive algorithms to the range of2-3 years. In our previous study [26] we illustrated this by measurements.We explore here the following question: if we had known paper’s ﬁtness - what are themargins of predictability of its citation trajectory? To answer this question, we analyzed therelation between the paper’s ﬁtness and the number of citations it garners in the long-timelimit. This was done using our calibrated and veriﬁed model of citation dynamics. We wishto estimate, K ∞ ( η ), the expected number of citations after 25 years for the paper with acertain ﬁtness η . To this end, we performed numerical simulations based on Eq. 2 withparameters for Physics papers published in 1984. We considered 4000 papers with the sameﬁtness η , found statistical distribution of their citations after 25 years, and measured themean K ∞ and the width of this distribution. We consider K ∞ as the expected number ofcitations in the long time limit.Figure 1a shows that the expected number of citations, K ∞ , grows nonlinearly withﬁtness η . Figure 1b focuses on the width of the K ∞ - distribution. We observe that for thepapers with low η , citation distribution in the long time limit is wide, while for the paperswith high η , citation distribution in the long time limit is narrow. This means that whilecitation dynamics of a low-ﬁtness paper strongly depends on chance, citation dynamics ofthe high-ﬁtness paper is more deterministic. A. Divergence of citation dynamics of the papers with the same ﬁtness- numericalsimulation

In particular, Fig. 1b shows that if expected number of citations in the long-time limitis 3, the actual number of citations can be anything between 0 and 7; if expected number ofcitations is 10, the actual number can be between 3 and 20, if the expected number is 100,5

IG. 1. (a) Expected number of citations after 25 years, K ∞ ( η ), in dependence of the paper’sﬁtness η . Numerical simulation for 4000 papers with the same η . The simulation is based on Eq.2 with the parameters for Physics papers published in 1984. Continuous line shows an empiricalpower-law dependence, K ∞ ∝ η . . (b) Actual number of citations after 25 years versus expectednumber. The error bars show the width of the distribution, K ∞ ± std ( K ∞ ). Citation distributionsare broad for low η and narrow for high η . the actual number can be between 50 and 130, if the expected number is 1000, the actualnumber can be between 700 and 1200. If we compare two papers that garnered 3 and 20citations in the long-time limit, they can have the same ﬁtness η , namely, they are mostprobably in the same ”quality” league. Two papers that garnered 700 and 1200 citationsare probably in the same ”quality” league, namely they can have the same ﬁtness. Butthe papers that garnered 100 and 1000 citations should have diﬀerent ﬁtness and belong todiﬀerent ”quality” leagues.Figure 2 shows K ∞ - distributions from a slightly diﬀerent perspective. We observe thata paper which is worth

10 citations, with 10% probability can garner less than 3 or morethan 18 citations; a paper which is worth

100 citations, with 10% probability can garnerless than 54 or more than 135 citations; a paper which is worth

IG. 2. Statistical distribution of the number of citations after 25 years for the papers with thesame ﬁtness η . Numerical simulation based on Eq. 2 for 4000 papers. The mean of the distribution, K ∞ , is indicated at each curve. Continuous lines show cumulative probability of getting more than K citations, R ∞ K p ( K ) dK | η = const . Dashed lines show complementary probability of getting less than K citations, R K p ( K ) dK | η = const . The intervals of the 10% probabilities of having K expected citationsare shown by arrows. career of the paper. Similar assumption was adopted by Ref. [27] in their description ofWeb-pages popularity and it was justiﬁed by measurements. This assumption is reasonablefor ordinary papers but not for sleeping beauties, that can be dormant for a long time andthen become popular. B. Fitness estimation

Refs. [16, 28] associate ﬁtness with the ultimate impact of the paper, namely the numberof total citations in the long-time limit; Ref. [29] determines paper ﬁtness by ranking; Ref.[30] estimate patent ﬁtness as a combination of attributes found through factor analysis, Ref.[31] associates ﬁtness with the number of citations during a couple of years after publication.We deﬁne ﬁtness slightly diﬀerently, namely, η is the number of direct citations in the long-time limit. Obviously, this deﬁnition cannot be a basis for prediction of future citations7ince it can be used only when citation career of a paper is close to completion. FIG. 3. The relation between direct citations and total citations during ﬁrst three years afterpublication for Physics papers published in 1984. Publication year corresponds to t = 1. Thedashed line shows linear approximation, K dir (3) = K (3). This approximation is good only for low-cited papers. Red continuous line shows the empirical approximation K dir (3) = [ K (3)] . whichshall be used for highly-cited papers. The ﬁtness is estimated from Eq. 2 as η j = K dirj (3) R P ˜ A ( t ) . To work out operational deﬁnition of ﬁtness that can be used for prediction, we note thatthe ﬁtness in our sense is related to initial citation rate, since at the beginning of the citationcareer of a paper citations are predominantly direct. We base our operational deﬁnition ofﬁtness on the ”magic of three years”, well-recognized in bibliometrics. Namely, the numberof citations garnered by a paper during ﬁrst 2-3 years after publication (for computer sciencepapers this initial period is 0.5-1 year) is a basis for ﬁtness estimation. Figure 3 shows thatthe relation is nonlinear (see also Fig. 1a) and from this calibration plot we estimate ﬁtness.

IV. FITNESS ESTIMATION BASING ON PAPER’S CONTENT

We believe that η i is determined by the journal (venue), the number of researchers inthe area, reputation of the research group, and last but not least -by the paper’s novelty,timeliness, and quality although the latter can be subjective notion. It should be noted,8owever, that the paper’s ﬁtness and the number of citations gauge not the quality of apaper but its impact. Note, that even erroneous paper can have a great impact. On theother hand, the impact can depend on the factors unrelated to paper’s content - institution,reputation of the research group, catchy title, etc.For example, H. Brot et Y. Louzoun showed [32] that the name of the ﬁrst author mattersfor citation count, in particular, the Physics papers whose ﬁrst author’s name starts fromthe letters A, B, C, in the long time limit have ∼

10% more citations than the papers whoselist of authors starts from X, Y, Z. (May be, success of the famous Alpher-Bethe-Gamowpaper partially derives from the lucky combination of their last names?)To further demonstrate the importance of the author’s name for success of the paper, weconsider anectodal evidence based on a couple of papers. Indeed, Richard Lewontin and JackHubby made a landmark study in molecular evolution while collaborating in the Universityof Chicago. To get equal credit for their contribution, their scientiﬁc report was publishedas two companion papers with very similar titles and subjects:1. J.L. Hubby and R.C. Lewontin, ”A molecular approach to study of genic heterozy-gosity in natural populations. . Number of alleles at diﬀerent loci in drosophilapseudoobscura”, Genetics 54(2), 577-594 (1966).2. R.C. Lewontin and J.L. Hubby, ”A molecular approach to study of genic heterozygosityin natural populations. Amount of variation and degree of heterozygocity in naturalpopulations of drosophila pseudoobscura”, Genetics 54(2), 595-609 (1966).The main diﬀerence between these two papers is the order of authors. By 2018, thesecond paper got around 900 citations while the ﬁrst paper got only around 500 citations!This diﬀerence is explained by the fact that, when the papers were ﬁrst published in 1966,Lewontin, who was three years older than Hubby, was better known in the scientiﬁc commu-nity. Thus, researchers preferred to cite the paper in which Lewontin was the ﬁrst author.Eventually, Hubby became also well-known, the paper in which he was the ﬁrst author gotfair credit and a large number of citations. However, citation count of the Lewontin paperremained bigger due to impressive head start. On another hand, do citation counts of thesetwo papers reﬂect diﬀerence in their ”quality”? Our model and Fig. 1 show, that the prob-ability of two papers, which garnered 500 and 900 citations in the long time limit, to have9he same ﬁtness, is ∼ V. TIMELINESS OF RESULTS

One of the important criteria, which the editor and reviewers use in their eavluation ofsubmitted papers, is the timeliness of results. This criterion singles out the papers that dealwith a hot topic. In our parlance, the paper that focuses on hot topic has enhanced ﬁtnessas compared to the paper belonging to the mature research direction. How one can quantifythe corresponding contribution to ﬁtness?Suppose that at year t there appeared one or several breakthrough papers which werefollowed by a ﬂurry of subsequent developments. This means that a new ﬁeld (hot topic)has been born. The number of publications in this new ﬁeld starts to grow explosively andthen saturates. As we have shown before [21], the authors are conservative in their citinghabits, and the length and the age composition of their reference lists remains more or lessthe same. In particular, the papers that were published in the same year constitute ∼ − ∼ −

10% of all references,the papers that were published two years before also constitute ∼ −

10% of all references,etc. Thus, the papers that were published long after the onset of a new topic have big choicein choosing their references, while the papers published soon after the onset of a new topichave a very limited choice for ﬁlling their reference list and all choose the papers that werepublished close to the onset. Thus, the papers that were published soon after the birthof a new ﬁeld, namely, timely papers, shall have enhanced number of citations (enhancedﬁtness).To put these considerations into quantitative terms, we consider a new ﬁeld that appearedat time t . We denote the annual number of publications in this ﬁeld by N ( t + t ). Equation 2yields the average number of direct citations that the paper in this ﬁeld, which was publishedin year t + t , garners during three subsequent years, K dir ( t + t, t + t + 3) = η ( t + t ) R ( t + t ) X ˜ A ( τ ) N ( t + t + τ ) . (3)where η is the average ﬁtness of the papers in the new ﬁeld which were published in year t + t , R is the average reference list length of the papers published in year t , ˜ A ( τ ) is10he aging function for citations. Note also, ˜ A ( τ ) = A ( τ ) e ( α + β ) τ where A ( τ ) is the agingfunction for references. While the aging function for citations is speciﬁc for each disciplineand publication year, the aging function for references turns out surprisingly universal andalmost independent of the publication year [21]. The reference-citation duality [21] yieldsaverage ﬁtness for the papers published in year t + t , η ( t + t ) = P A ( τ ) N ( t + t + τ ) N ( t + t ) e βτ P A ( τ ) e ( α + β ) τ . (4)If the new ﬁeld grows with the same rate as the whole discipline, namely, N ( t + t + τ ) N ( t + t ) = e ατ ,then η does not depend on t . However, if this new ﬁeld grows faster than the whole discipline,then η is enhanced.Figure 4 illustrates these considerations. We know that a hot topic usually appearsabruptly and can be identiﬁed through a burst of citations and publications [33]. We chooseseveral such research areas in Physics with well-deﬁned onset t , with some of these areasthe author of this book has had personal experience. Using Web of Science, we found allpapers belonging to each of these topics, that were published in year t + t . For each t , wemeasured annual number of papers and statistical distribution of the number of citationsgarnered by them during ﬁrst three years after publication. Then we determined the meanand the width of these distributions. Using Eq. 4 and Fig. 3, we found the average ﬁtnessof the papers in each topic published in year t + t , basing on the mean of the distribution.On another hand, we estimated this ﬁtness using Eq. 4. Figure 4 shows that the modelprediction based on Eq. 4 captures our measurements perfectly well.Figure 4 implies that any paper published soon after the new topic appeared, has agood head start and this quantiﬁes the ”ﬁrst mover advantage” introduced by Newman [35].However, this does not mean that the papers published long after the onset of a hot topicdoomed to be undercited. In fact, Fig. 4 shows only the mean of the ﬁtness distribution foreach year. The actual ﬁtness distribution is very wide and its width is comparable to themean. Hence, at each moment after the onset of a hot topic there are many papers whoseﬁtness considerably exceeds the average one.11 I. DISCUSSION

We showed here that our stochastic model of citation dynamics can be a basis for pre-dicting citation trajectory of papers. This model shall be compared to the physics-inspiredpredictive model developed by Wang, Song, and Barabasi [16]. Pham, Sheridan, and Shi-modaira [36, 37] developed a software package based on this model and demonstrated thatit is a valid predictive tool. This model includes three paper-speciﬁc parameters: ﬁtness η ,immediacy µ , and σ . To determine these parameters, one needs to measure initial citationtrajectory of a paper, 2-3 years are not enough. As a predictive tool, this model worksbest for the highly-cited papers. Although this deterministic model predicts citation trajec-tory of a paper, it cannot specify probabilistic margins of the prediction. On the contrary,our probabilistic model includes only one paper-speciﬁc parameter- ﬁtness, it does provideprobabilistic margins of the future citation count. However, our model works better withordinary papers and does not predict well citation trajectories of the highly-cited papers.Thus, our model is complementary to that of Ref. [16].What are its possible applications? We believe that our model can be used for forecastingthe ﬁve-year journal impact factor. The papers published in one year in one journal representmore or less homogeneous set of papers, hence predicting the mean number of citations forthis set is more reliable than predicting citation trajectory of a single paper. On anotherhand, our model can give probabilistic margins of such prediction.Another application can be the early identiﬁcation of the breakthrough papers. So far,this was done by analyzing diversity and age structure of the reference list of papers [12,38], diversity and interdisciplinarity of paper’s content [39], or through identiﬁcation of theatypical citation trajectory, corresponding to sleeping beauties [25]. An important questionis how soon can we identify such rising star? Obviously, if the paper (or patent) gets morecitations than what is expected from the ordinary paper published in the same year andin the same journal, then this is a candidate to be a breakthrough paper [40]. On anotherhand, the deviation from the ordinary citation trajectory may be accidental. Our modelcan make an estimate of the probability of the enhanced citation count in order to judgewhether it occurred by chance or not. 12

1] A. Clauset, D. B. Larremore, and R. Sinatra, Science , 477 (2017).[2] A. Zeng, Z. Shen, J. Zhou, J. Wu, Y. Fan, Y. Wang, and H. E. Stanley, Physics Reports(2017).[3] I. Tahamtan, A. Saﬁpour Afshar, and K. Ahamdzadeh, Scientometrics , 1195 (2016).[4] C. Castillo, D. Donato, and A. Gionis, in

String Processing and Information Retrieval , editedby N. Ziviani and R. Baeza-Yates (Springer Berlin Heidelberg, Berlin, Heidelberg, 2007) pp.107–117.[5] W. Ke, Scientometrics , 981 (2013).[6] H. P. F. Peters and A. F. J. van Raan, J. Am. Soc. Inf. Sci. , 39 (1994).[7] C. Stegehuis, N. Litvak, and L. Waltman, Journal of Informetrics , 642 (2015).[8] R. Yan, C. Huang, J. Tang, Y. Zhang, and X. Li, in Proc. 12th ACM/IEEE-CS Joint Conference on Digital Libraries , JCDL ’12 (ACM, NewYork, NY, USA, 2012) pp. 51–60.[9] R. Yan, J. Tang, X. Liu, D. Shan, and X. Li, in

Proc. 20th ACM International Conference on Information and Knowledge Management ,CIKM ’11 (ACM, New York, NY, USA, 2011) pp. 1247–1252.[10] V. Larivi`ere and Y. Gingras, J. of the Association for Information Science and Technology , 424 (2010).[11] F. Didegah and M. Thelwall, Journal of Informetrics , 861 (2013).[12] B. Uzzi, S. Mukherjee, M. Stringer, and B. Jones, Science , 468 (2013).[13] P. Klimek, A. S. Jovanovic, R. Egloﬀ, and R. Schneider, Scientometrics , 1265 (2016).[14] A. Letchford, H. S. Moat, and T. Preis, Royal Society Open Science , 150266 (2015),https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.150266.[15] C. W. Fox, C. E. T. Paine, and B. Sauterey, Ecology and Evolution , 7717 (2019).[16] D. Wang, C. Song, and A.-L. Barabasi, Science , 127 (2013).[17] J. Adams, Scientometrics , 567 (2005).[18] L. Li and H. Tong, in Proc. 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,KDD ’15 (ACM, New York, NY, USA, 2015) pp. 655–664.[19] X. Cao, Y. Chen, and K. R. Liu, Journal of Informetrics , 471 (2016).[20] J. Bollen, H. Van de Sompel, A. Hagberg, and R. Chute, PLOS ONE , 1 (2009).