[PDF] Predicting the Citations of Scholarly Paper

Abstract

Citation prediction of scholarly papers is of great significance in guiding funding allocations, recruitment decisions, and rewards. However, little is known about how citation patterns evolve over time. By exploring the inherent involution property in scholarly paper citation, we introduce the Paper Potential Index (PPI) model based on four factors: inherent quality of scholarly paper, scholarly paper impact decaying over time, early citations, and early citers' impact. In addition, by analyzing factors that drive citation growth, we propose a multi-feature model for impact prediction. Experimental results demonstrate that the two models improve the accuracy in predicting scholarly paper citations. Compared to the multi-feature model, the PPI model yields superior predictive performance in terms of range-normalized RMSE. The PPI model better interprets the changes in citation, without the need to adjust parameters. Compared to the PPI model, the multi-feature model performs better prediction in terms of Mean Absolute Percentage Error and Accuracy; however, their predictive performance is more dependent on the parameter adjustment.

Full PDF

PPredicting the Citations of Scholarly Paper

Xiaomei Bai a , Fuli Zhang b *, Ivan Lee c a Computing Center, Anshan Normal University, Anshan, China b Library, Anshan Normal University, Anshan, China c School of Information Technology and Mathematical Sciences, University of SouthAustralia, Australia

Abstract

Citation prediction of scholarly papers is of great signiﬁcance in guiding fund-ing allocations, recruitment decisions, and rewards. However, little is knownabout how citation patterns evolve over time. By exploring the inherent invo-lution property in scholarly paper citation, we introduce the Paper PotentialIndex (PPI) model based on four factors: inherent quality of scholarly paper,scholarly paper impact decaying over time, early citations, and early citers’impact. In addition, by analyzing factors that drive citation growth, wepropose a multi-feature model for impact prediction. Experimental resultsdemonstrate that the two models improve the accuracy in predicting schol-arly paper citations. Compared to the multi-feature model, the PPI modelyields superior predictive performance in terms of range-normalized RMSE.The PPI model better interprets the changes in citation, without the needto adjust parameters. Compared to the PPI model, the multi-feature modelperforms better prediction in terms of Mean Absolute Percentage Error andAccuracy; however, their predictive performance is more dependent on theparameter adjustment.

Keywords:

Scholarly Paper, Paper Potential Index, Multi-featureModel ∗ Corresponding author

Email address: [email protected] (Fuli Zhang b *) Preprint submitted to Journal of Informetrics August 13, 2020 a r X i v : . [ c s . D L ] A ug . Introduction There is an increasing interest in understanding the citation dynamicsof scholarly paper and the evolution in science (Xia et al., 2017). So far,studies in this ﬁeld have primarily been focused on success of science (Xiaet al., 2016; Bai et al., 2016; Cao et al., 2016; Fiala and Tutoky, 2018; Zhanget al., 2017), academic collaboration networks (Panagopoulos et al., 2017),team science (Heidi, 2015) and scientiﬁc impact prediction (Bai et al., 2017).While citation serves as a popular indicator for measuring the research out-come, it is often required to estimate the future impact as well. For in-stance, research impact prediction helps in eﬀective allocation of researchfunds (Clauset et al., 2017). An important challenge in scientiﬁc impact pre-diction is to characterize the change in citations over time, and it is importantto identify the factors that aﬀect citations of scholarly papers.Previous studies have mainly focused on predicting the citations or an-alyzing future citation distributions. Some studies utilize machine learn-ing algorithms such as Gradient Boosting Decision Tree (Sandulescu andChiru, 2016), Support Vector Machine (Adankon and Cheriet, 2010), andXGBoost (Chen and Guestrin, 2016). To train the validity of the predictivemodels, crucial features have been identiﬁed for citation prediction, includingearly citations, journal impact factor, authors’ authority, journal reputation,topic of scholarly paper, and age (Petersen et al., 2014; Sarig¨ol et al., 2014; Yuet al., 2014). Some citation prediction studies have applied generative modelto reﬂect the observation that older papers typically attracted higher cita-tions (Newman, 2008), or to address some citation patterns that come withan initial period of growth followed by a gradual declined over time (Wanget al., 2008, 2013). More recently, Xiao et al. (2016) proposed a point pro-cess model to predict the long-term impact of individual publications basedon early citations. Furthermore, Singh et al. (2017) has found that earlyinﬂuential citers negatively aﬀected long-term scientiﬁc impact, possibly dueto attention stealing, whereas non-inﬂuential early citers positively aﬀectedlong-term scientiﬁc impact.Inspired by the prior work Wang et al. (2008, 2013); Xiao et al. (2016);Singh et al. (2017), we model the Paper Potential Index (PPI) by consideringthe following factors: inherent quality of scholarly paper, scholarly paper im-pact decaying over time, early citations, and early citers’ impact. The PPIpredictive model combines these factors and expands the Hawkes process,and it mainly depends on the inherent involution mechanism of paper cita-2ions with the following three properties: (1) Paper citation declines alongwith the decay of paper novelty over time; (2) The early citer’s impact canincrease scholarly paper impact in the predictive model; (3) Early citationshelp retaining long term citations.In addition, we propose a multi-feature predictive model, which consid-ers author-based features, journal-based features, and citations feature. Wecompare the prediction results of the two models in terms of mean abso-lute error, root mean squared error, range-normalized RMSE, mean absolutepercentage error and accuracy.Main contributions of this paper include: (1) Introduction of PPI whichreﬂects the potential impact of a scholarly article; (2) Consideration of schol-arly paper impact decaying over time, scholarly papers’ quality, early ci-tations, and early citing authors’ impact, to quantify the potential impactof scholarly articles; (3) Discussions on how PPI outperforms the existingmulti-feature models in citation prediction.

2. Related work

Citation prediction of scholarly papers has been extensively investigated,and these studies are mostly based on the analysis of mixture of features,including author-based features (the number of authors, the country of theauthor’s institution, authors’ authority, etc.), journal-based features (the to-tal citations of the journal, journal impact factor, keyword frequency of eachjournal, etc.), paper-based features (the topic of scholarly paper, scholarlypaper length, keyword repetition in the abstract of a paper, the number of ref-erences, etc.), and other features such as institutional features (institutionalrankings and reputation, etc.) In addition, Altmetrics are also employed topredict the citations of scholarly paper. Various investigations have beenconducted to explore the correlation between Twitter activities and citationpatterns (Peoples et al., 2016; Timilsina et al., 2016; Erdt et al., 2016). Semi-nal examples in citation prediction using mixture of features are summarizedin Table 1. The three categories of features: author-based features, journal-based features, and citations feature are used in our multi-feature predictivemodel. In order to improve the performance of prediction, Author ImpactFactor (AIF), Q value, H-index, Journal Impact Factor and citations areused to predict the citations of scholarly paper. The main diﬀerence betweenour multi-feature predictive model and the prior studies is the selection offeatures. 3 able 1:

Examples of multi-feature citation prediction of scholarly paper. source author features journal features paper features other featuresHaslam et al.(2008) the number ofauthors, ﬁrstauthor gender journal prestige title length,the number ofreferences ﬁrst authorinstitution’sprestigebornmannet al. (2012) the number ofauthors, thereputation ofthe authors the language ofthe publishingjournal citation count citationperformance ofthe citedreferences,reviewers’ratings ofimportanceLivne et al.(2013) H-index,g-index journal prestige citations contentsimilarity,graph density,clusteringcoeﬃcientYu et al.(2014) the number ofauthors, thecountry of theauthor’sinstitution,H-index journal impactfactor, totalcitations, 5-yearimpact factor,the citedhalf-life the number ofreferences, thereciprocal ofthe ﬁrst-citedage of thispaper the documenttypeSingh et al.(2015) H-index,author rank,past inﬂuenceof authors,productivity,sociality,authority,versatility journal rank,journalcentrality, pastinﬂuence ofjournals publicationcount, citationcount, novelty,topic rank,diversity averagecountX,averageciteWordsRobson andMousqu`es(2016) the number ofauthors, authorname the number ofjournal pages,journal prestige the year ofpublication,title length,abstract length special issueSohrabi andIraj (2017) the number ofauthors title length,abstract length SCImagoquartile4n order to analyze the eﬃciency of multi-feature for citation prediction,regression models are often used. Popular regression models for citationprediction include quantile regression (Robson and Mousqu`es, 2016), semi-continuous regression (Sohrabi and Iraj, 2017) and Gradient Boosted Regres-sion Trees (GBDT) (Chen and Zhang, 2015). Generative models can also beused to predict the citations of scholarly papers (Li et al., 2015; Zhang et al.,2016). Wang et al. (2013) proposed a point process by identifying three fun-damental mechanisms in paper impact prediction: preferential attachment(highly cited papers are more likely be cited again), decay rate, and ﬁtness(capturing the inherent diﬀerences between papers) to predict the probabil-ity of a paper being cited. To characterize the citation dynamics of scien-tiﬁc papers, a nonlinear stochastic model of citation dynamics based on thecopying-redirection-triadic closure mechanism was reported by Golosovskyand Solomon (2017).

3. Modeling citing behavior as a point process

The American Physical Society (APS) dataset includes all papers pub-lished in 9 journals, including Physical Review A, Physical Review B, Phys-ical Review C, Physical Review D, Physical Review E, Physical Review I,Physical Review L, Physical Review ST and Review of Modern Physics,from 1970 to 2013 (http://publish.aps.org/datasets). Each record in theAPS dataset includes paper title, author names, author aﬃliations, date ofpublication, and a list of cited papers. Because the APS dataset does notprovide unique author identiﬁers, we ﬁrst do name disambiguation based onthe method proposed by Sinatra et al. (2016) in our experiments. Two au-thors are considered to be the same individual if all of the following threeconditions are fulﬁlled: (1) Last names of two authors are identical; (2) Firstnames are identical or with the matched initial; (3) One of the following istrue: the two authors cited each other at least once; the two authors shareat least one co-author; The two authors share at least one similar aﬃliation.We select 183,336 papers as experimental data in the APS dataset from 1978to 1998. Scholarly papers with greater or equal to 5 citations within the ﬁrst5 years of publication are used as the training data, and their citations inthe subsequent 10 years are used as the testing data.5 .2. Prediction model

Intrinsic potential

Citations reﬂect the impact of a research paper,which correspond to the authors’ impact which can be quantiﬁed as Q i foran author i (Sinatra et al., 2016). A scholar with high Q i is expected topublish high-impact publications. In this paper, we use the parameter Q i toindicate the intrinsic potential of a paper’s impact. Paper impact decaying over time

As new ideas presented of eachpaper further grow in follow-up studies, the novelty fades away eventuallyand the impact of papers decays over time (Wang et al., 2013). Figure 1shows the citation pattern of individual scholarly papers over time. Thevertical axis is the yearly citations of 100 randomly selected scholarly paperspublished between 1978 and 1997 in the APS dataset. The color representsto the publication year of each scholarly paper. According to Figure 1, eachpaper has its own inherent citation trend and the pattern may not correlateto one another.

Figure 1: Citation pattern of individual scholarly papers over time.

Early citers’ impact

Some prior studies have ignored the citers’ impactto the citation dynamics (Wang et al., 2013). According to the study inSingh et al. (2017), inﬂuential early citers might negatively aﬀect long-termscientiﬁc impact of papers due to attention stealing, whereas non-inﬂuentialearly citers could positively aﬀect the long-term scientiﬁc impact of papers.Inspired by this idea, the early citers’ impact is used in PPI to model thecitation pattern of a scholarly paper.

Early citation

Based on the behavior that high early citations lead tomore citations in the future, we model the Paper Potential Index λ d ( t ) of a6cholarly paper d by extending a self-exciting Hawkes process: λ d ( t ) = β d Q dMax e − w d t + α d (cid:88) j,t j

1, wemaximize the reached probability of the i th citation at time t i . The conceptcan be formulated as follows: p ( t i | t i − ) = exp (cid:18) − (cid:90) t i t i − λ ( t ) dt (cid:19) λ ( t i ) (4)then we use the maximum likelihood estimation method to calculate the like-lihood function on the cited sequence of each article, and take the logarithmicfunction of the maximum likelihood estimate: log n (cid:89) i =1 p ( t i | t i − ) = n (cid:88) i =1 logλ ( t i ) − (cid:90) T λ ( t ) dt (5)where n is the citation count of a scholarly paper, t i is the time that the i − th citation occurs, and T is a period of time that a paper is cited. The maximumvalue of the log-likelihood function is obtained by calculating the minimumof its dual equation. Equation (4) is brought into the above formula, andadd a sparse regularized term (cid:107) β (cid:107) , we get the objective function L β : L β = − N (cid:88) d =1 { n (cid:88) i =1 log ( βs d e − w d t i + i − (cid:88) j − α d D j e − w d ( t i − t j ) ) − βs d w d (1 − e − w d T ) − α d w d n (cid:88) i =1 D i − e − w d ( T − t i ) } + λ (cid:107) β (cid:107) (6)where N is the number of papers in the experimental data, s d is the featuresof a paper. Adding the regularization term makes the objective functionnon-diﬀerentiable, we use the Alternating Direction Method of Multipliers(ADMM) to decompose the original optimization problem into a few sim-pler sub-problems. By introducing the auxiliary variable z , the optimizationproblem in equation (6) can be formulated by the following constraint opti-mization: min L + λ (cid:107) z (cid:107) s.t. β = z (7)The corresponding augmented Lagrangian is: L ρ = L + λ (cid:107) z (cid:107) + ρµ ( β − z ) + ρ (cid:107) β − z (cid:107) (8)where µ is the dual variable or Lagrange multiplier; ρ is the penalty coeﬃ-cient, which is usually used as an iterative step to update the dual variable.8he steps to solve the above augmented Lagrange optimization problem us-ing the ADMM algorithm are as follows:( β l +1 , α l +1 ) = arg min β ≥ ,α ≥ L ρ ( β l , α l , z l , u l ) (9) z l +1 = S λ/ρ ( β l +1 + α l +1 ) (10) u l +1 = u l + β l − z l +1 (11)where S λ/ρ is a soft critical value function. The ADMM algorithm is similarto the dual ascent algorithm, including a parameter minimization process,such as equation (9); an auxiliary parameter minimization process, such asequation (10); and a dual parameter update process, such as equation (11).In order to eﬃciently solve the optimization problem in equation (9), we usethe EM framework to update the parameters α and β . Given the probabilitythat feature k activates event i is p ki , the probability that event i activatesevent j is p ij , the EM algorithm is as follows: p d ( l +1) ki = β k s dk e − w dt i λ ( t i ) (12) p d ( l +1) ij = α d D j e − w d ( t i − t j ) λ ( t i ) (13) β l +1 k = − B + (cid:113) B + 4 ρ (cid:80) Nd =1 (cid:80) ni =1 p dki ρ (14) α ( l +1) d = (cid:80) ni =1 (cid:80) i − j =1 p dij (cid:80) ni =1 ( D i − e − w d ( T − t i ) ) /w d (15)where B = (cid:80) Nd =1 s dk (1 − e − w d T ) /w d + ρ ( u k − z k ). Equation (12) representsthe probability that the value of the k th feature S dk and the coeﬃcient β k corresponding to the feature k aﬀect the citations of the paper when a pa-per is cited i times. Equation (13) represents the probability that the j -th( j ≥ i ) citation aﬀects the citations of a paper when it is cited i times.Therefore, (cid:80) Nd =1 (cid:80) ni =1 λ ( t i ) p dki indicates the expectation that the coeﬃcient β k corresponding to the feature k aﬀects citations of the paper on the entire9ata set. (cid:80) ni =1 (cid:80) i − j =1 λ ( t i ) p dij indicates the expectation that the number ofexisting citations of the paper aﬀects its citations. In equation (8), we ﬁndthe maximum of these two expectations and derive the partial derivatives for α and β . When the partial derivative is zero, equations (14) and (15) areobtained. By iterating until convergence, we get the optimal values of theparameters α and β . After that, the new values of α and β are brought backto the values of u and z in the ADMM algorithm.After obtaining the parameters α and β , the parameters w and w ofeach paper are solved by the gradient descent method. The gradient of theobjective function with respect to w and w is as follows: ∂L ρ ∂w = n (cid:88) i =1 βst i e − w t i λ ( t i ) + βsw ( e − w T + T · w · e − w T −

1) (16) ∂L ρ ∂w = n (cid:88) i =1 (cid:80) i − j =1 ( t i − t j ) αD j e − w ( t i − t j ) λ ( t i )+ αw [ w ( T − t i ) e − w ( T − t i ) + e − w ( T − t i ) −

1] (17)After obtaining the optimal values of all parameters α , β , w and w , weestimate the citations of a scholarly paper after a certain period of time bytaking the integral of the intensity function λ ( t ).

4. Multi-features predictive model

Author-based features . • Author Impact Factor (AIF).Similar to the concept of journal impact factor, an author’s AIF inyear T is the average citations of published papers in a period of ∆ T years before year T . Based on the APS dataset, we compute eachauthor’s AIF value according to the author’s publishing history anduse the statistics of all authors’ AIF of a given institution as a group ofits features, including sum, maximum, minimum, median, average anddeviation. We brieﬂy explore and report the authors’ AIF features inthis work. 10 Q value.The Q value is calculated according to equation 2. • H-index.A scholar has an index value of H if the scholar has H papers withat least H citations. H-index can give an estimate of the impact of ascholar’s cumulative research contributions. Journal-based feature .Journal Impact Factor is a quantitative index to evaluate the impact of jour-nal. It is actually the ratio of citations of a journal and papers published ofthe journal.

Citations feature .The historical citations of each paper are used to predict the impact of apaper.

In order to investigate the eﬀect of author-based feature, journal-basedfeature and citations feature, we evaluate the importance of features (seeTable 2).

In this section, we describe the multi-feature predictive model, which in-tegrates author-based feature, journal-based feature and citations to the Gra-dient boosting decision trees (GBDT). The GBDT model suits for a mass offeatures and no-linear relationships between the predictor variables and thetarget variable. In terms of the multi-feature predictive model, parametersadjustment is crucial for the performance of predictive model. Main param-eters include:(1) learning rate : namely the model’s learning speed on the distributioncharacteristics of the sample, expressed as the weight of the regression treefor each iteration in the algorithm. The larger the learning rate is, the fasterthe algorithm converges. The smaller the learning rate is, the slower thealgorithm converges, but the prediction accuracy may increase.(2) number of iterations : the number of iterations is the number of weaklearners obtained in the model. In general, the number of iterations dependson the learning rate. 11 able 2:

Features used in the prediction model.

Feature Description Feature Descriptionc1 one-year citations max(H-index) maximum of H-indexc2 two-year citations min(H-index) minimum of H-indexc3 three-year citations avg(H-index) average of H-indexc4 four-year citations med(H-index) median of H-indexc5 ﬁve-year citations dev(H-index) deviation of H-indexsum(Q) sum of Q value sum(AIF) sum of AIFmax(Q) maximum of Q value max(AIF) maximum of AIFmin(Q) minimum of Q value min(AIF) minimum of AIFavg(Q) average of Q value avg(AIF) average of AIFmed(Q) median of Q value med(AIF) median of AIFdev(Q) deviation of Q value dev(AIF) deviation of AIFsum(H-index) sum of H-index JIF journal impact factor(3) minimum samples of leaf nodes : this parameter deﬁnes the conditionsunder which the subtree continues to be divided. If the number of sampleson the leaf node is smaller than the set value, the node will not be furtherdivided.(4) maximum depth of decision tree : this parameter is used to control themaximum depth of the decision tree generated by each round iteration. Thepurpose is to prevent over-ﬁtting.(5)

Sampling rate : this parameter indicates the proportion of training sam-ples used in each training, and its value ranges from 0 to 1. When the valueis 1, it indicates that all the samples are involved training. The main roleof this parameter is to add sample perturbation to prevent over-ﬁtting. Thesampling rate of general samples is set between 0.5 and 0.8. If the value istoo large, the risk of over-ﬁtting will be increased. If the value is too small,correct samples may not be learned due to too few samples, and the modeldeviations will increase.We used the Grid Search method to adjust the above mentioned parame-ters. The value of the learning rate ranges from 0.0005 to 0.5 and the step sizeis 0.0005. The number of iterations ranges from 500 to 3000 and the stepsize is 500. The value of leaf node minimum sample number value ranges12rom 10 to 80 and the step size is 10. The maximum depth of the decisiontree ranges from 5 to 7 and the step size is 1. Sampling rate ranges from 0.5to 1.0 and the step size is 0.1.According to the range of values and the step size of each parameter,the grid covered parameter space is generated for grid search. Each pointon the grid is traversed, and the parameter combination corresponding tothe point is used to train the model on the training set. Correspondingly,prediction is performed on the validation set, and the predictive accuracy iscalculated as an estimate of the prediction performance of the model underthe set of parameters. After traversing all the parameter combinations, theset of parameters with the highest prediction accuracy on the correspondingveriﬁcation set is taken as the parameter of the ﬁnal model.

5. Results and discussion

In this subsection, we introduce several evaluation metrics for validatingthe PPI prediction model.

Mean absolute error (MAE) .MAE quantiﬁes how close the predictions is to the ground truth. MAE isgiven by:

M AE = 1 n n (cid:88) i =1 | e i | (18)The mean absolute error is an average of the absolute errors | e i | , whichis equal to | f i − y i | , where f i is the prediction, and y i is the true value. n represents the number of predictions. Root mean squared error (RMSE) .RMSE is similar to MAE, which is deﬁned as follows:

RM SE = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 e i (19)RMSE also provides the average error and quantify the overall error rate. Insome cases, we need to compare results across activities, but RMSE can notgive an indication of the relative error. We need a normalized error, such asRange-normalized RMSE. 13 ange-normalized RMSE (NRMSE) . N RM SE = RM SEmax ( y i ) − min ( y i ) (20)where max ( y i ) and min ( y i ) represent the maximum and minimum functions,which are calculated by all ground-truth values of the test instances. Mean absolute percentage error (MAPE)

An useful normalized metric is MAPE, which normalizes each error value foreach prediction. This metric shows the average deviation between predictedoutput and true output from the n experimental data. MAPE is deﬁned asfollows: M AP E = 1 n n (cid:88) i =1 | e i | y i (21) Accuracy

Accuracy shows the fraction of papers correctly predicted for a given errortolerance (cid:15) : Accuracy = 1 n n (cid:88) i =1 | | e i | y i ≤ (cid:15) | (22) Figure 2 shows the feature importance score of all features to predict the15th year’s citations of the published papers. The features c c c c Q value, JIF, authors’ Q value’ sum, respectively, theirvalues are 0.0608, 0.0527, 0.0441, 0.0338, 0.0331, and 0.0317. The feature im-portance score for predicting 6th-14th year citations of the published papersare shown in the appendix. 14 igure 2: Feature importance score of all features. Based on all features’ importance in predicting 6th-15th year citationsof papers, we selected the top 10 feature retraining model, the predictionaccuracy remains high. There are diﬀerences in feature importance scoresfor diﬀerent predictive years (see appendix). Figure 3 shows the top 10 fea-ture importance score for predicting the 15th year citations of the publishedpapers. The features c c Q value’s minimum ranks fourth, and its value is 0.0969. Otherfeature importance scores are less than 0.0950. Figure 3: Importance scores of top 10 features. Q value, and JIF areranked in the top 10 of the list of feature importance ranking. Author H-index related features and author AIF related features are located behindthe list of feature importance rankings.In summary, we observe that historical citations play an important rolefor predicting the impact of the paper. Besides, author-based features areimportant in predicting the paper impact, especially the authors’ Q value. To test the validity of PPI prediction model, its predictive performance iscompared against four competing models: PPI NECAI, GBDT All, GBDT 10and PLI Science published by Wang et al. (2013). The comparison is madein terms of MAE, RMSE, NRMSE, MAPE, and accuracy.Figure 4 shows the MAE value of the ﬁve models. According to Figure4, we observe that PPI outperforms all competing models with lower MAEvalues for predicting citations after a scholarly paper is published for 5 years.We also observe that MAE values of all ﬁve models increase along with theyear, indicating that the predictive performance of all ﬁve models degradesover time. M AE Time PPI PPI_NECAI GBDT_All GBDT_10 PLI_Science

Figure 4: Comparing MAE for diﬀerent models. R M SE Time PPI PPI_NECAI GBDT_All GBDT_10 PLI_Science

Figure 5: Comparing RMSE for diﬀerent models.

Figure 6 shows NRMSE values of the ﬁve models. For PPI model andPPI NECAI model, their NRMSE values are about 0.006. The NRMSEvalues of GBDT All model and GBDT 10 model shows increasing trends, andtheir NRMSE values are about 0.018 in future the 10th years after the ﬁfthyear of scholarly paper published. The NRMSE values of the PLI Sciencemodel show a decaying trend. In term of NRMSE, the predictive performanceof the PPI model is better than other four models.Figure 7 shows the MAPE values of the ﬁve models. We observe thatthe MAPE values of GBDT All model and GBDT 10 model are below theother three models. The MAPE value of the PPI model is slightly higherthan GBDT All model and GBDT 10 model.Figure 8 shows the accuracy of the ﬁve models. The accuracy values of17 NR M SE Time PPI PPI_NECAI GBDT_All GBDT_10 PLI_Science

Figure 6: Comparing NRMSE for diﬀerent models. M APE

Time PPI PPI_NECAI GBDT_All GBDT_10 PLI_Science

Figure 7: Comparing MAPE for diﬀerent models. A cc u r a cy Time PPI PPI_NECAI GBDT_All GBDT_10 PLI_Science

Figure 8: Comparing Accuracy for diﬀerent models.

By comparing PPI and PPI-NECAI, we observe that early citing authors’impact contributes to improved prediction of scholarly paper impact. PPIyields superior citation prediction over PPI-NECAI, GBDT All, GBDT 10and PLI-Science in terms of MAE and NRMSE. Although the predictiveperformances of the GBDT All model and the GBDT 10 model are betterthan other three models in terms of MAPE and accuracy, the proposed PPIprediction model gives a clear explanation for the predictive eﬀect of themodel by the following factors: inherent quality of scholarly paper, scholarlypaper impact decaying over time, early citations, and early citers’ impact.Compared to PPI-NECAI and PLI Science, PPI more accurately predictsthe scholarly paper impact. Although considering early citers’ impact canimprove the predictive performance of PPI model, other factors exist, such asauthor’s team impact, journal impact, authors’ cooperation relationship, anddisciplinary diﬀerences. In addition, due to the fact that the APS datasetonly contains local citations, this might limit the predictive accuracy of thiswork. Uncovering the essence of paper potential index is a promising future19ork, which might improve the predictive performance of PPI model, andit could provide a better understanding of the evolution of scholarly paperimpact.

6. Conclusion

Based on point estimation process, we present the PPI predictive model,which considers the following four factors: (1) inherent quality of scholarlypaper; (2) scholarly paper impact decaying over time; (3) early citations; and(4) early citers’ impact. Experimental results indicate that the PPI modelimproves citation prediction of scholarly papers. The predictive performanceof PPI is better than PPI-NECAI, which reﬂects that early citing author’simpact is important for predicting the citations of scholarly paper. Althoughthe predictive performance of the GBDT All model and GBDT 10 model isbetter than other three models in terms of MAPE and accuracy, the pro-posed PPI predictive model give a clear explanation for the predictive eﬀect,indicating that an ultimate understanding of long-term impact of scholarlypaper will beneﬁt from understanding the inherent evolutionary mechanismof citations of scholarly papers.

Acknowledgement

We thank Feng Xia and Jie Hou from School of Software, Dalian Uni-versity of Technology for valuable discussions on this work. This workwas partially supported by Liaoning Provincial Key R&D Guidance Project(2018104021) and Liaoning Provincial Natural Fund Guidance Plan (20180550011).