[PDF] An Overview on Evaluating and Predicting Scholarly Article Impact

Abstract

Scholarly article impact reflects the significance of academic output recognised by academic peers, and it often plays a crucial role in assessing the scientific achievements of researchers, teams, institutions and countries. It is also used for addressing various needs in the academic and scientific arena, such as recruitment decisions, promotions, and funding allocations. This article provides a comprehensive review of recent progresses related to article impact assessment and prediction. The~review starts by sharing some insight into the article impact research and outlines current research status. Some core methods and recent progress are presented to outline how article impact metrics and prediction have evolved to consider integrating multiple networks. Key techniques, including statistical analysis, machine learning, data mining and network science, are discussed. In particular, we highlight important applications of each technique in article impact research. Subsequently, we discuss the open issues and challenges of article impact research. At the same time, this review points out some important research directions, including article impact evaluation by considering Conflict of Interest, time and location information, various distributions of scholarly entities, and rising stars.

Full PDF

information

Article

An Overview on Evaluating and Predicting ScholarlyArticle Impact

Xiaomei Bai , Hui Liu *, Fuli Zhang , Zhaolong Ning , Xiangjie Kong , Ivan Lee and Feng Xia School of Software, Dalian University of Technology, Dalian 116620, China; [email protected] (X.B.);[email protected] (Z.N.); [email protected] (X.K.); [email protected] (F.X.) Computing Center, Anshan Normal University, Anshan 114007, China Library, Anshan Normal University, Anshan 114007, China; [email protected] School of Information Technology and Mathematical Sciences, University of South Australia, SA 5095,Australia; [email protected] * Correspondence: [email protected]; Tel.: +86-0411-6227-4391Academic Editor: David BawdenReceived: 23 May 2017; Accepted: 23 June 2017; Published: date

Abstract:

Scholarly article impact reﬂects the signiﬁcance of academic output recognised by academicpeers, and it often plays a crucial role in assessing the scientiﬁc achievements of researchers, teams,institutions and countries. It is also used for addressing various needs in the academic and scientiﬁcarena, such as recruitment decisions, promotions, and funding allocations. This article providesa comprehensive review of recent progresses related to article impact assessment and prediction.The review starts by sharing some insight into the article impact research and outlines current researchstatus. Some core methods and recent progress are presented to outline how article impact metricsand prediction have evolved to consider integrating multiple networks. Key techniques, includingstatistical analysis, machine learning, data mining and network science, are discussed. In particular,we highlight important applications of each technique in article impact research. Subsequently, wediscuss the open issues and challenges of article impact research. At the same time, this review pointsout some important research directions, including article impact evaluation by considering Conﬂictof Interest, time and location information, various distributions of scholarly entities, and rising stars.

Keywords: scholarly big data; article impact; machine learning; data mining

1. Introduction

Scholarly impact acts as one of the strongest currencies in the academia, and it is frequentlymeasured in terms of citations of research articles. Citations indicate the impact of scholars, articles,journals, institutions, and other scholarly entities [1]. The inﬂuence of an article is often quantiﬁed asan index, which represents its contributions for improving research ﬁnding by other scholars [2].Researching the impact of scientiﬁc articles mainly focuses on two interrelated questions: how toassess the past impact of an article, and how to accurately predict its future impact? The study of articleimpact is important for evaluating the impact of individual scientists, journals, teams, institutions,and even for countries. It is also crucial for addressing the following fundamental problems, suchas rewards, funding allocation, promotion, and recruitment decisions. Evaluating and predictingarticle impact have attracted great attention in the academic and scientiﬁc arena over the past decades.The changes occur from one dimension to multiple dimensions, from unstructured metrics to structuredmetrics (Figure 1). Citations [3] are a popular indicator to measure article impact. However, it onlyfocuses on the perspective of single dimension. Altmetrics [4] provide information on downloads,views, shares, and citations to assess article impact from a multidimensional perspective. PageRank has

Information , xx a r X i v : . [ c s . D L ] A ug nformation , xx , x 2 of 14 been introduced to evaluate article impact [5], which can be viewed as a milestone in impact research.It has shown a structured method to quantify article impact. Meanwhile, in order to objectivelyevaluate article impact and accurately predict its future impact, machine learning and data miningtechniques play crucial roles, such as mining the important characters of scholarly networks andoptimizing the performance of algorithms [6]. ArticleCitationsAltmetrics Citation-based

Time Location AuthorJournal Institution

Social Network

DownloadTwitter

View

ShareBookmarkLike

Article Journal Institution

Figure 1.

Methods of evaluating and predicting article impact.

What drives the rapid development in evaluating and predicting article impact? The past decadehas witnessed the rapid growth in the ability of network platforms to gather and transport a largenumber of academic data, i.e., a phenomenon usually referred to as “Big Scholarly Data” (see Figure 2).Different networks with various scholarly entities and their relationships can be observed fromFigure 2. Scholars can collect such data to solve the problems of scholarly impact evaluation. They canobtain useful insights from such datasets by leveraging statistical analysis, machine learning, datamining, and network science techniques. The academic data with exponential growth become essentialto develop the scholarly impact metrics. The metrics combine the statistical and computationalconsiderations. However, one problem cannot be ignored. That is these datasets are personalized.SCOPUS contains abstracts and citations of journal papers. Web of Science offers online scientiﬁccitations by Thomson Reuters. PubMed includes more than 23 million citations for biomedicalliterature. CiteULike allows users to search and share scholarly papers. Mendeley can not only beused to manage references, but also it is an academic social platform. Digital Bibliography & LibraryProject (DBLP) shows publications of journals and conferences, not including citation information.Microsoft Academic Graph (MAG) includes heterogeneous information with publication records,authors, institutions, journals, conferences, ﬁelds of study and citation relationships. In these raw data,the most prominent problems are loss and incompletion of data, which probably will result in poorperformance of evaluation and prediction to some extent. Data cleaning and supplement are necessaryfor accurately capturing the evaluative and predictive results. Besides, these data sets can be jointlyinvestigated to complement one another. For example, DBLP does not include citation information,but it has an effective mechanism to process name disambiguation. Integrating DBLP dataset and thecitation information of SCOPUS can meet the needs for some scholarly analysis. nformation , xx , x 3 of 14 A u t ho r A u t ho r A u t ho r A u t ho r A u t ho r A u t ho r A u t ho r A u t ho r A u t ho r A u t ho r A u t h o r A u t h o r A u t h o r A u t h o r A u t h o r A u t h o r A u t h o r A r t i c l e1 A r t i c l e2 A r t i c l e3 A r t i c l e4 A r t i c l e5 A r t i c l e6 A r t i c l e7 A r t i c l e8 A r t i c l e9 A r t i c l e10 A r t i c l e11 A r t i c l e17 A r t i c l e18 A r t i c l e19 A r t i c l e20 A r t i c l e21 A r t i c l e22 A r t i c l e23 A r t i c l e24 A r t i c l e25 A r t i c l e26 A r t i c l e27 A r t i c l e28 A r t i c l e29 A r t i c l e30 J ou r na l J ou r na l J ou r na l J o u r n a l I n s t i t u t i o n 1 I n s t i t u t i o n 2 I n s t i t u t i o n 3 I n s t i t u t i o n 4 I n s t i t u t i o n 5 I n s t i t u t i on6 I n s t i t u t i on7 I n s t i t u t i on8 I n s t i t u t i on9 C on f e r en c e1 C on f e r en c e2 C on f e r en c e3 C on f e r en c e4 C on f e r en c e5 C on f e r en c e6 C on f e r en c e7 C on f e r en c e8 K e y w o r d1 K e y w o r d2 K e y w o r d3 K e y w o r d4 K e y w o r d5 K e y w o r d6 K e y w o r d7 K e y w o r d8 K e y w o r d9 Author citation networksCo-author networksAuthor co-cited networksAuthor-article networksAuthor-journal networksConference-author networksAuthor-institution networksAuthor-keyword networksArticle citation networksBibliographic co-cited networksBibliographic coupling networksJournal-article networksConference-article networksArticle-institution networksArticle-keyword networksJournal citation networksJournal co-cited networksInstitution citation networksInstitution collaboration networksCo-word networksTopic-occurrence networksConference citation networks

Figure 2.

Characterizing scholarly networks.

2. Key Techniques

In this section, we discuss four crucial techniques for evaluating and predicting article impactincluding statistical analysis, machine learning, data mining and network science techniques.

Statistical methods cover the process of collecting, dealing with, analysing and explaining data.Researchers can gain science knowledge from data through statistics. Statistical analysis is mainlyinterested in analyzing and understanding data, including regression models, variable selection,principal component analysis, factor analysis, cluster analysis, canonical correlation analysis, timeseries analysis, probability and density estimation, and so on [7,8]. Regression models contain singlevariable regression and multiple variables regression. Statistical techniques are usually used in mostﬁelds of nature and social science, such as ﬁnance, medical treatment, industry etc. In researchingscholarly impact ﬁelds the beneﬁts of statistical techniques are as follows: • Pre-process data; • Optimize parameters, for instance, multiple variables linear regression; • Select features and improve models for scholarly evaluation and prediction. For example, use themassive existing statistics to estimate a probability density function; • Analyse scholarly data to obtain statistical data, and then use statistical model to predict the trendsof impact, top scholars, top articles, etc.To explore the relationships between citations and citation distance, statistical analysis such asgrouping and clustering may be applied. When we use group analysis technique, an appropriatesegmentation point of citation distance is crucial. Usually, the selection of segmentation point dependson the experimental data. An advantage of group analysis technique is relatively easy to dealwith data. However, due to the compulsory group, the disadvantage of group analysis is obvious. nformation , xx , x 4 of 14 Clustering analysis technique remedies the drawback of compulsory group such as Density-basedSpatial Clustering of Applications with Noise algorithm can be used to analyze the relationshipsbetween citations and citation distance based on the density of institutions.In short, statistical analysis can be used to pre-process data, capture intermediate results or gainﬁnal results in evaluating and predicting research of scholarly impact. For example, a multivariatelinear regression was used to estimate the parameters of three algorithms for evaluating the impactof papers [9]. Based on principal component analysis, a factor analysis was used to explore the maincomponents in bibliometric and altmetric indicators [10]. Especially when scholars predict citations ofa paper or a scholar’s H-index, they usually give an estimative range instead of a speciﬁc value.

Machine learning is one of the most rapidly developing techniques, and it can help computersto address the problems learnt through experience [6]. Machine learning mainly includesthree major paradigms: supervised learning, unsupervised learning, and reinforcement learning.Supervised learning is widely applied in spam classifying of e-mails, face identifying, and medicaldiagnosis ﬁelds. It aims to generate predictions according to its mapping functions. Relying ondifferent mapping functions, learning algorithms are divided into neural networks [11], supportvector machines (SVM) [12], decision trees [13], logistic regression [14], and decision forests [15].The mapping functions are driven by different kinds of application needs. Unsupervised leaningfocuses on direct inference of predictions without the help of the training sample of previous solvedcases [16]. The purpose of reinforcement learning is to learn a mapping function by desponding onintermediate between supervised and unsupervised leaning in training data. Reinforce learning hasbeen successfully applied in human-level control [17].In recent years, one prominent progress in supervised learning involves deep neural networks.Deep learning [18] has played an important role in computer vision, speech recognition, naturallanguage translation, and collaborative ﬁlter. Deep learning algorithms can be used to discover usefulrepresentations of the input data without the requirement of labelled training data. The developmentof machine learning is closely related to other research ﬁelds progress. As machine learning theorydevelops, we will see the beneﬁts it brings us. Machine learning contributes to scholarly impactresearch as follows: • Design effective algorithms ﬁtting to various scholarly sources of data; • Predict future trends such as articles impact and scholars’ impact in future; • Conduct scholarly recommendation such as recommending collaborators, the articles with topimpact in various research ﬁelds.Currently, some researchers have leveraged machine learning techniques to successfully predictscholarly impact, including articles, scholars, institutions and even countries. The commonly usedmethods for predicting scholarly impact include neural networks, SVM, Markov [19], XGboost [20],etc. In term of the performance of predicting the scholarly impact, neural networks model is betterthan Markov model. Neural networks model can be used to deal with large amount of data comparedto Markov model and SVM model. SVM model can not only be directly used to regress, but alsobe used to classify. The SVM model is more suitable to deal with a small amount of scholarly data.Otherwise, mixing SVM model and neural networks model can obtain better performance of predictioncompared to the independent SVM or the single neural networks model. The predictive power ofXGboost is better than Markov, neural networks, and SVM. However, a disadvantage of XGboost isthat it needs to adjust a large of number of parameters. In the future, we believe that machine learningcan provide more support to resolve emerging issues about scholarly impact, and also can providemodels for understanding learning in scholarly impact, biological evolution, neural systems and otherresearch ﬁelds. nformation , xx , x 5 of 14 Data mining is used to discover knowledge hidden in a large amount of data, including spatialdata mining, temporal data mining, sequence data mining and intention data mining [21]. Data mininghas important applications in ﬁnance, telecommunication, science, and engineering ﬁelds. In recentyears, we have witnessed a rapid expansion in the ability to collect data from various sensors and onlinemedia platforms in different formats. For instance, a large source of data is going to be generated fromonline platforms like Facebook, Twitter, and Google. Big data drives scholars to continually exploreuseful patterns for better services. Meanwhile, it gives a challenge for big data mining. Data miningcan beneﬁt scholarly impact research as follows: • Mining heterogeneous academic networks, such as article-author networks, author-journalnetworks, author-institution networks, etc.; • Exploring the complex relationships among academic entities, including the relationships ofpapers, authors, journals, conferences, teams, institutions and countries; • Seeking automatically patterns in scholarly data to predict future trends and improvepredicting performance; • Mining large data streams for effective scholarly recommendations; • Cleaning scholarly data to gain valuable information; • Integrating diverse kinds of scholarly data.In brief, data mining can solve the scholarly evaluation and predication problems by analyzingdata in database. Discovering meaningful patterns in scholarly networks will lead to some advantages.Useful patterns allow scholars to predicate scholarly impact based on new data. For example, in orderto predict the impact of an article, we may ﬁrst train the data of previous years by applying machinelearning techniques, like neural network, Markov and SVM models. In addition, we predict futureimpact of an institution on testing datasets. How to express a pattern is important. The expressions ofa pattern can be presented in two ways: transparent box and black box. The former’s constructiondiscloses the structure of the pattern by explaining something about scholarly data, while the latter’sconstruction is inexplicable. Data mining also involves learning for ﬁnding structural patterns inscholarly data. It helps to explain data before making predictions. In data mining, machine learningis applied in many research ﬁelds. It is used to capture the explicit knowledge structures which areimportant to preform well on new data.

Network science can help to understand the structure of networks, development and weaknesses.Despite apparent diversities, a lot of networks generate, evolve, and are driven by some basic lawsand mechanisms. For instance, degree distribution has been proved to be the power law; small worldproperty is an important principle in many networks. Two important organizing principles of theevolution of networks were introduced, i.e. preferential attachment and ﬁtness [22].Well-known scholarly network structures are complex, including homogeneous andheterogeneous networks [23], directed and undirected networks [24]. Homogeneous citation networkscontain article-article networks, author-author networks, journal-journal networks, and word-wordnetworks, etc. Figure 3 shows the citation relationships of article-article networks generated by randomextracted 486 articles and their references from APS dataset, 561 edges in total. Each circle representsan article and the links represent citation relationships. Blue, yellow and green represent nodes withsmall, medium and large degrees. Heterogeneous citation networks include article-author networks,author-journal networks, article-journal networks, etc. In particular, co-author networks and co-wordnetworks are also important homogeneous networks in scholarly impact studies. Citation network isa representative directed network, showing a link from a citing paper to a cited paper, while co-authornetwork is a undirected network. The important indices of nodes in undirect networks includedegree centrality, betweens centrality, closeness centrality, k-shell, k-core, and eigenvector centrality. nformation , xx , x 6 of 14 In directed networks, two representative algorithms, i.e., PageRank [25] and HITS [26] algorithms,are commonly used to calculate the importance degree of nodes.

Figure 3.

Characterizing citation relationships of article-article networks. The degrees of nodes rangefrom small (blue) to large (green). The larger the degree of a node is, the more references the article has.

In diversiﬁed scholarly networks, network analysis plays a key role mainly in the following threeaspects. First, network analysis helps identifying key nodes in scholarly networks. These key nodesare a series of scholarly entities, including top inﬂuential articles, top inﬂuential authors, top inﬂuentialjournals, top inﬂuential teams, top inﬂuential institutions, co-authors with super tie, academic risingstar [27,28], serendipity in scientiﬁc collaboration [29], Sleeping Beauties in science [30], etc. We alsoneed to study the difference of the important degree of various nodes in unweighted and weightednetworks. Most scholarly networks are weighted [31,32], but we cannot always obtain appropriateweights. However, an appropriate weight is the key for quantifying scholarly impact. For example,impact of an article is no longer a simple citation count. The importance degree of each article incitation networks should consider the authors’ authorities of citing articles and published journal’sprestige of the article through analyzing the citation networks. Citation-based structured measurementshave provided new perspective for evaluating scholarly impact. Second, network analysis helps toexplore the most important structure features, such as what features determine scholars’ success [33],success of an article [34], and success of teams [35]. Third, network analysis helps quantifying therelationships among scholarly entities, including articles, authors, journals, conferences, institutions,teams and countries. For example, previous researchers have quantiﬁed the relationships of co-authorsin scientiﬁc community. It indicates scientiﬁc collaboration with weak, strong, and super ties fromlongitudinal perspective [36]. All in all, network structured analysis provides a solution to quantifyingthe scholarly impact.In the next section, we will introduce article impact metrics and prediction, mainly including twoaspects: core methods and their recent research progress. nformation , xx , x 7 of 14

3. Article Impact Metrics

Figure 4 provides a framework for evaluation of article impact to build and test a set of scholarlydata models, including data collection, data pre-processing, data analysis, features selection, algorithmsdesign, optimizing algorithms and evaluation of algorithms. Datasets refer to original datasetslike DBLP, APS and MAG. According to targets of evaluation, we can use the original datasets orcomplement the original datasets by crawling necessary data from scholarly websites like SCOPUS.Researching the various relationships including citation, co-author, co-cited, etc. is beneﬁcial to assessthe impact of scholarly impact. For example, identifying different citation relationships providesan objective evaluation method [9]. In the assessment framework, the assessment method is themost central part. Currently, there are several types of assessment methods: citations, Altmetricsand citations-based structured metrics. The validity of the veriﬁcation method is an essential part.Common evaluation methods include Spearman’s correlation coefﬁcient, recommendation intensityand so on. Impact metrics can brieﬂy be divided into two categories: unstructured metrics (or statisticalmetrics) and structured metrics according to the way of measurement.

Datasets Processed datasets

Relationships MethodsDBLPAPS

MAG … Article DOIArticle nameAuthor nameInstitution nameAuthors’ order

Country … Citation

Co-author

Co-citedCo-occurrence … IntelligenttechniquesOptimizationalgorithmsCitationsAltmetricsCitations-based… Evaluationmethods Spearman’s correlation coefficient

Recommendation intensity…Single variable linear regressMultiple variables regress…StatisticsMachine learning

Data mining

Network science…

1. Preprocess

2. Analyze

3. Select

Features

4. Design5. Optimize6. Evaluate

Figure 4.

Frameworks of evaluating article impact.

Citations as statistical method are perhaps the oldest and most widely used metric for articleimpact evaluation. Citations as measuring metric are always under dispute. From the perspectiveof objective evaluation, can original citations truly characterize the quality of article? The answeris obviously no. The biggest obstacles are self-citation and mandatory citation [37], which haveincreased the difﬁculty of objectively measuring article impact. How to accurately identify a variety ofself-citation and mandatory citation is challenging. Meanwhile, negative citation has attracted scholars’attention [38]. However, scientiﬁc researchers do not stay at distinguishing the citation patterns.The scholarly publications are undergoing the changing from traditional prints to online platforms.The change generates some open issues. Meanwhile, it presents an opportunity to characterize articleimpact from multidimensional perspective.Altmetrics [39] emerge at the historic moment and obtains much attention in academic community.Altmetrics are the study of measuring the scholarly impact based on activities in social media platforms,and go beyond citations [40]. Altmetrics present various quantitative values including citations,downloads, mentions, tweets, shares, views, discussions, saves and bookmarks from statisticalperspective. Altmetrics scores (mentioned in blogs) can be used to identify highly cited articles.At the same time, Altmetrics can complement and improve evaluation of article impact with newinsights [10]. Although broadening the evaluation methods for measuring scholarly impact, Altmetricslack the authority and credibility as metrics. It is partly because Almetrics are easy to be gamed bymalicious scholars [41]. nformation , xx , x 8 of 14 Citations-based structured methods have made some progress. One signiﬁcant measurement ofimpact metrics in recent years involves homogeneous and heterogeneous networks [42], includingcitation networks, co-author networks, co-citation networks, article-author networks, article-journalnetworks, author-journal networks. The diversity of scholarly networks can satisfy the diverse needsof applications with different scholarly structures capturing different kinds of scholarly characters.One thing can be certain: citations-based structural metrics can generate a truer measure of theimportance of an article than citations alone. Previous researchers have contributed to the structuralmetrics for evaluating article impact [5,43–48]. These assessment methods mostly are based onPageRank algorithm and HITS algorithm. PageRank algorithm provides a fast and objective rankingway to rank the nodes in network. In a citation network, papers with higher PageRank scores havemore chances to be visited. PageRank is more suitable for homogeneous networks. In scholarlynetworks, HITS algorithm distinguishes the scholarly entities as authorities and hubs based onthe local structure, and calculates their scores in a mutual reinforcing way. HITS algorithm canalso be applied to heterogeneous networks like paper-author network and paper-journal network,in which the authors and journals are regarded as hub nodes, and the papers are regarded asauthority nodes. It is worth mentioning that S-index metric measured article impact through inﬂuencepropagation in heterogenous citation networks [49]. Meanwhile, Neil Shah et al. suggested a goodimpact metric should consider the following six aspects: volume sensitivity, prestige sensitivity,robustness, extensibility, temporality, interpretability and computability. Exploiting network structurecharacters may provide an opportunity to develop a reﬁned and objective metric for measuring thescholarly impact.While co-citation analysis can be utilised to associate the relevance across different disciplines andto identify the bridging nodes [50], it should be noted that citation-based metrics are biased by diversedomain sizes and citation activities [51]. Domain variation may hamper a fair evaluation for scholarlyimpact, such as scholarly papers in some disciplines are cited much more or much less compared toothers [52]. Two important reasons cause the above results. One is uneven number of cited paperseach article in different domains, the other is unbalanced cross-discipline citations. Although scholarlypapers can be cited by different domains, Schneider et al. [53] suggested relative citation pattern withindisciplines should be considered for the evaluation of scholarly impact.

4. Article Impact Prediction

Prediction of future impact is an emerging area, researching on the “science of science”.Impact prediction is more important compared to impact evaluation. Impact prediction can directlyallocate funds, scientiﬁc awards, and other decisions. Figure 5 provides a ﬂowchart of a computationalmodel for predicting article impact. The left column (Input) is the input data, capturing publication,citation, downloads, reviews, and other information. The center column (Model) describes modellearning and testing. The right column (Output) provides a few speciﬁc examples which the modelcan predict.Specially, article impact prediction has attracted a lot of attention in recent years. Predicting anarticle impact mainly focuses on predicting citations or citation distributions through network science,data mining and machine learning techniques (see Table 1). Early citations of an article playeda critical role for predicting its long-run citation [54]. They showed that university ranking withcumulative citations can be easily predicted by early received citations across the economics disciplineat a university. Cao et al. [55] presented a Gaussian mixture model to predict future citations ofpapers based on short-term citation activities. Peter et al. [56] constructed a keyword-term network topredict the numbers of citations in the future by analyzing the recursive centrality measures, indicatingdocument centrality has higher predictive ability for the future citations of papers. Based on quantileregression, Stegehuis et al. [57] proposed a model to predict the probability distribution for futurecitations of an article, and considered two key features: early citations and journal impact factor.Yu et al. [58] leveraged four categories of features, including articles, authors, citations, and journals to nformation , xx , x 9 of 14 predict future citations of an article based on stepwise regression analysis. Based on co-authorshipnetworks, a Machine Learning Classiﬁer was developed to predict whether a publication wouldget high citations [59]. Based on Random forest classiﬁer, they showed a supervised classiﬁcationmodel, in which multidimensional feature vectors were considered to predict the future citations of apaper. Wang et al. [3] constructed a generative model for predicting long-term impact of an articleby using three key factors: preferential attachment, citation trend, and ﬁtness. In short, previousresearchers are mostly based on early citations for predicting the impact of paper. They mainlyfocus on the autocorrelation of historical data in citation network. However, a common drawback ofthese predictive methods is that they are dependent too much on historical citations. Exploring thefundamental characteristics of citations yielded may be able to ﬁnd a novel predictive method, ignoringthe early citations. In recent years, with the development of social media, social media activities areused to reﬂect the underlying impact of an article. For example, Tweets can predict whether an articlecan be cited frequently when an article was published for 3 days [60]. Based on a heterogeneousscholarly network, Mohan et al. [61] predicted academic impact by integrating the bibliometric datawith the social data like weblogs and mainstream news, indicating that graph-based measure canreasonably predict the impact of early stage researchers. APS,DBLP,

MAG...

Altmetrics Other data simulations Testing Learning Sleeping

Beauty

Rising Star

Citations

Citation distribution

Input Model Output Citation history Authors' social structure Citation distribution Time and other determinants Institutions New data Different features

Figure 5.

Flowchart of predicting article impact.

Table 1.

Several representative methods for predicting article impact.

Features Prediction Goal Main Techniques early citations, Journal quantile of citations distribution quantile regressionImpact Factorauthors characteristics, citations multivariate analysisinstitutional factors, featuresof article organization,research approachSocial dimension: citations random forest classiﬁerco-authorship networksyear, page count, author count, long-term citations random forestauthor name, journal, abstractlength, title length, specialissue, etc.Altmetrics: tweeter citations correlation analysis, linearregression analysis nformation , xx , x 10 of 14 There is an increasing interesting in identifying Sleeping Beauties in science. Sleeping Beautyin scientiﬁc community refers to that the value of an article can be recognized only after years ofpublication [30]. Ke et al. suggested a common mechanism using a parameter-free method to identifySleeping Beauties on large-scale datasets.

5. Open Issues and Challenges on Article Impact Metrics

Despite pioneers have obtained success, article impact remains a young ﬁeld with many openissues. In previous researches, many different datasets are usually used to quantify scholarly articleimpact. These granulitic and inconsistent data have been applied in various scholarly researches.Sharing datasets are necessary and valuable for objectively evaluating article impact and generatingnew metrics. Uniﬁed and consistent scholarly datasets are an open issue. Citation-based structuredmetrics are relatively new and have got less attention. Researchers consider that the importantdegrees of citation structures is newly shaped by PageRank and HITS algorithms introduced inscholarly networks. In addition, social dimensioned assessment and citation distributions havebeen less explored. Thus, multidimensional metrics for quantifying article impact are an open issue.Altmetrics have been considered for complementing article-level metrics. Pioneered researchers havemade some progress. Altmetrics for evaluating scholarly article impact is still an open issue. In thisopen issue, locating the reasonable and available benchmarks is an urgent need to be solved.

With the rapid emergence of a large number of social platforms, scholarly datasets present hithertounknown event in academia. Even though these datasets possess personalized characters, they havethe problems of missing data, repeated data, data uncertainty phenomena. Evaluation metrics basedon these inconsistent datasets can bring some problems. For example, reproducing scientiﬁc ﬁndingsin previous researches can be realized. Therefore, uniﬁed and consistent scholarly datasets should beascertained and shared by scientiﬁc researchers in academia for impact metrics.

In previous researches, citation-based structured metrics mainly consider the dimensions ofauthors, journals, articles and time. Each author’s importance in citation networks is usually ignored.An article generative impact is regarded as the same no matter who cites it. In fact, citing authors’impact in citation networks should be investigated for objectively quantifying article impact. Copyingthe same citations from other articles is a frequently observed practice in academic publications [62].Therefore, an article may get more citations through frequency-dependent copying if it is cited byexperienced scholars. The article impact can be inﬂuenced by many factors such as authors’ socialrelationships, citation distributions of authors, journals, institutions and countries. In particular,identifying anomalous citation patterns and weakening citation strength are critical for objectivelymeasuring article impact [9]. Although analysing Conﬂict of Interest (COI) relationships betweenauthors has given a solution to identify anomalous citations. We need to mine COI relationships formore objective assessment in a further step. These problems have not been addressed. Therefore, futureimpact metrics need to explore the importance in citation networks, authors’ social relationships,various citation distributions, etc.

Altmetrics are recent article-level metrics [63]. Altmetrics are usually considered as thecomplement beyond citations. Altmetrics have some merits for evaluating. However, Altmetrics areonly based on web usage statistics [64]. They are more easily manipulated by factitiously downloading,sharing, commenting, etc. What can be done to guarantee the credibility of data on social media forevaluating article impact? What can be measured by Altmetrics? How to select sources of data forAltmetrics? What relationships exist between Altmetrics and citations? Using data analysis techniques nformation , xx , x 11 of 14 to explore Almetrics indicators in depth provides a possible solution to validating Altmetrics. There aremany explored opportunities in article impact researches. Available and credible benchmarks are key to measuring article impact. Despite past decadeswitnessed important progress, it is difﬁcult to verify the performance of article impact metrics.Without right datasets and standards, developed metrics are not contextually robust and cannotbe understood [65]. Therefore, how to select benchmarks based on uniﬁed and consistent scholarlydatasets with the aim of objectively quantifying impact is an important open issue.

6. Open Issues and Challenges on Article Impact Prediction

Despite our research has summarized article impact prediction so far, a great number of furtherissues and challenges call for our attention to predict impact accurately. In this section, we point outsome potential issues except for uniﬁed scholarly datasets and benchmarks.

Despite of the previously analyzed Sleeping Beauties phenomena, various issues remain tobe addressed in the corresponding researches. How to identify Sleeping Beauties in science?How to predict impact of Sleeping Beauties? Whether the trending topics are related to SleepBeauties? Whether the trending topics have contributed to predict Sleep Beauties? Whether thecorrelations between Sleep Beauties and different journals, between Sleep Beauties and institutions caninﬂuence the impact of Sleeping Beauties? Therefore, more efforts are needed to explore these criticalscientiﬁc problems.

Despite pioneered researchers have obtained success from multidimensional perspective inpredicting article impact, a full integration of multidimensional datasets needs to be explored in afurther step. Characterizing the breadth and the depth of an article impact is unfortunately onlyfrom one single perspective. For example, previous researches generally focused on early citationsto predict impact of an article [54]. However, little attention has been paid to location informationsuch as institutions and countries, social relationships and citation distributions for predicting impact.Therefore, future research needs to predict article impact from multiple dimensions.

Predicting the fast-rising citations for an article in the future provides valuable guidance to theacademia. It can help the academia to ﬁnd out popular topics or new topics, advanced techniques,signiﬁcant ﬁndings, etc. Meanwhile, a direct beneﬁt is to avoid wasting time in the ocean of scholarlydata for researchers. What are the features contributed to enhance an article impact? Finding thesefeatures is beneﬁcial to predict rising star in articles.

7. Conclusions

This article presents a detailed overview of evaluating and predicting article impact. It discussesthe open issues and challenges that need to be solved in a further step. At ﬁrst, we have given a simpleintroduction about article impact research. Next, we have elaborated on core methods and recentprogress. Then, we have introduced some key techniques, and some opportunities can be seen byleveraging statistics, machine learning, data mining and network science techniques. Finally, we havepresented open research issues regarding the assessment and prediction of article impact, and pointedout potential research directions. nformation , xx , x 12 of 14 Author Contributions:

X.B. conceived the study and wrote the manuscript; F.X. supervised the design and thedevelopment of the proposed study; F.Z. contributed statistical analysis work; X.B., H.L., F.Z., Z.N., X.K., I.L. andF.X. revised the manuscript; and all authors have read and approved the ﬁnal manuscript.

Conﬂicts of Interest:

The authors declare no conﬂict of interest.

References

1. Aguinis, H.; Suárez-González, I.; Lannelongue, G.; Joo, H. Scholarly impact revisited.

Acad. Manag. Perspect. , , 105–132.2. Gargouri, Y.; Hajjem, C.; Larivière, V.; Gingras, Y.; Carr, L.; Brody, T.; Harnad, S. Self-selected or mandated,open access increases citation impact for higher quality research. PLoS ONE , , e13636.3. Wang, D.; Song, C.; Barabási, A.L. Quantifying long-term scientiﬁc impact. Science , , 127–132.4. Piwowar, H. Altmetrics: Value all research products. Nature , , 159–159.5. Chen, P.; Xie, H.; Maslov, S.; Redner, S. Finding scientiﬁc gems with Google’s PageRank algorithm. J. Informetr. , , 8–15.6. Jordan, M.; Mitchell, T. Machine learning: Trends, perspectives, and prospects. Science , , 255–260.7. Johnson, R.A.; Wichern, D.W. Applied Multivariate Statistical Analysis ; Prentice Hall:Upper Saddle River, NJ, UAS, 2002; Volume 5.8. Di Ciaccio, A.; Coli, M.; Ibanez, J.M.A.

Advanced Statistical Methods for the Analysis of Large Data-Sets ; Springer:Berlin/Heidelberg, Germany, 2012.9. Bai, X.; Xia, F.; Lee, I.; Zhang, J.; Ning, Z. Identifying anomalous citations for objective evaluation of scholarlyarticle impact.

PLoS ONE , , e0162364.10. Costas, R.; Zahedi, Z.; Wouters, P. Do ‘altmetrics’ correlate with citations? Extensive comparison of altmetricindicators with citations from a multidisciplinary perspective. J. Assoc. Inf. Sci. Technol. , , 2003–2019.11. Rojas, R. Neural Networks: A Systematic Introduction ; Springer: Berlin/Heidelberg, Germany, 2013.12. Hearst, M.A.; Dumais, S.T.; Osman, E.; Platt, J.; Scholkopf, B. Support vector machines.

IEEE Intell. Syst.Their Appl. , , 18–28.13. Quinlan, J.R. Induction of decision trees. Mach. Learn. , , 81–106.14. Hosmer, D.W., Jr.; Lemeshow, S. Applied Logistic Regression. Journal of the American Statistical Association , , 411.15. Ho, T.K. Random decision forests. In Proceedings of the Third International Conference on DocumentAnalysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282.16. Hastie, T.; Tibshirani, R.; Friedman, J. Unsupervised learning. In The Elements of Statistical Learning ; Springer:New York, NY, USA, 2009; pp. 485–585.17. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.;Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning.

Nature , , 529–533.18. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature , , 436–444.19. Jarrow, R.A.; Lando, D.; Turnbull, S.M. A Markov model for the term structure of credit risk spreads. Rev. Financ. Stud. , , 481–523.20. Chen, T.; He, T. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDDInternational Conference on knowledge discovery and data mining, San Francisco, CA, USA, 13–17 August2016; pp. 785–794.21. Bhise, R.; Thorat, S.; Supekar, A. Importance of data mining in higher education system. IOSR J. Hum.Soc. Sci. , , 18–21.22. Barabási, A.L. Network science. Philos. Trans. R. Soc. Lond. A Math. Phys. Eng. Sci. , , 20120375.23. Yan, E.; Ding, Y. Scholarly network similarities: How bibliographic coupling networks, citation networks,cocitation networks, topical networks, coauthorship networks, and coword networks relate to each other. J. Am. Soc. Inf. Sci. Technol. , , 1313–1326.24. West, J.D.; Jensen, M.C.; Dandrea, R.J.; Gordon, G.J.; Bergstrom, C.T. Author-level Eigenfactor metrics:Evaluating the inﬂuence of authors, institutions, and countries within the social science research networkcommunity. J. Am. Soc. Inf. Sci. Technol. , , 787–801. nformation , xx , x 13 of 14

25. Page, L.; Brin, S.; Motwani, R.; Winograd, T.

The PageRank Citation Ranking: Bringing Order to the Web ;Technical Report; Stanford InfoLab: Stanford, CA, USA, 1999.26. Kleinberg, J.M. Authoritative sources in a hyperlinked environment.

J. ACM , , 604–632.27. Zhang, C.; Liu, C.; Yu, L.; Zhang, Z.K.; Zhou, T. Identifying the Academic Rising Stars. arXiv ,arXiv:1606.05752.28. Zhang, J.; Ning, Z.; Bai, X.; Wang, W.; Yu, S.; Xia, F. Who are the Rising Stars in Academia? In Proceedingsof the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, Newark, NJ, USA, 19–23 June 2016;pp. 211–212.29. Sugiyama, K.; Kan, M.Y. Serendipitous recommendation for scholarly papers considering relations amongresearchers. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on DigitalLibraries, Ottawa, ON, Canada, 13–17 June 2011; pp. 307–310.30. Ke, Q.; Ferrara, E.; Radicchi, F.; Flammini, A. Deﬁning and identifying Sleeping Beauties in science. Proc. Natl.Acad. Sci. USA , , 7426–7431.31. Bai, X.; Zhang, J.; Cui, H.; Ning, Z.; Xia, F. PNCOIRank: Evaluating the Impact of Scholarly Articles withPositive and Negative Citations. In Proceedings of the 25th International Conference Companion on WorldWide Web, Montréal, QC, Canada, 11–15 April 2016; pp. 9–10.32. Zhu, X.; Turney, P.; Lemire, D.; Vellino, A. Measuring academic inﬂuence: Not all citations are equal. J. Assoc.Inf. Sci. Technol. , , 408–427.33. Sutherland, K.A. Constructions of success in academia: An early career perspective. Stud. High. Educ. , , 743–759.34. Letchford, A.; Moat, H.S.; Preis, T. The advantage of short paper titles. R. Soc. Open Sci. , , 150266.35. Anicich, E.M.; Swaab, R.I.; Galinsky, A.D. Hierarchical cultural values predict success and mortality inhigh-stakes teams. Proc. Natl. Acad. Sci. USA , , 1338–1343.36. Petersen, A.M. Quantifying the impact of weak, strong, and super ties in scientiﬁc careers. Adv. Short Pap.Titles , , E4671–E4680.37. Esfe, M.H.; Wongwises, S.; Asadi, A.; Karimipour, A.; Akbari, M. Mandatory and self-citation; types, reasons,their beneﬁts and disadvantages. Sci. Eng. Ethics , , 1581–1585.38. Catalini, C.; Lacetera, N.; Oettl, A. The incidence and role of negative citations in science. Proc. Natl. Acad.Sci. USA , , 13823–13826.39. Priem, J. Altmetrics. In Beyond Bibliometrics: Harnessing Multidimensional Indicators of Scholarly Impact ;MIT Press: Cambridge, MA, USA, 2014; pp. 263–288.40. Kwok, R. Research impact: Altmetrics make their mark.

Nature , , 491–493.41. Cheung, M.K. Altmetrics: Too soon for use in assessment. Nature , , 176–176.42. Yan, E.; Ding, Y. Measuring scholarly impact in heterogeneous networks. Proc. Am. Soc. Inf. Sci. Technol. , , 1–7.43. Wang, Y.; Tong, Y.; Zeng, M. Ranking Scientiﬁc Articles by Exploiting Citations, Authors, Journals, andTime Information. In Proceedings of the Twenty-Seventh AAAI Conference on Artiﬁcial Intelligence,Bellevue, WA, USA, 14–18 July 2013.44. Walker, D.; Xie, H.; Yan, K.K.; Maslov, S. Ranking scientiﬁc publications using a model of network trafﬁc. J. Stat. Mech. Theory Exp. , , P06010.45. Sayyadi, H.; Getoor, L. FutureRank: Ranking Scientiﬁc Articles by Predicting their Future PageRank.In Proceedings of the SIAM International Conference on Data Mining (SDM 2009), Sparks, NV, USA,30 April–2 May 2009; pp. 533–544.46. Zhou, Y.B.; Lü, L.; Li, M. Quantifying the inﬂuence of scientists and their publications: Distinguishingbetween prestige and popularity. New J. Phys. , , 033033.47. Wang, S.; Xie, S.; Zhang, X.; Li, Z.; Yu, P.S.; Shu, X. Future inﬂuence ranking of scientiﬁc literature.In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, PA, USA,24–26 April 2014.48. Liu, Z.; Huang, H.; Wei, X.; Mao, X. Tri-Rank: An Authority Ranking Framework in HeterogeneousAcademic Networks by Mutual Reinforce. In Proceedings of the IEEE 26th International Conference onTools with Artiﬁcial Intelligence (ICTAI), Limassol, Cyprus, 10–12 November 2014; pp. 493–500.49. Shah, N.; Song, Y. S-index: Towards better metrics for quantifying research impact. arXiv ,arXiv:1507.03650. nformation , xx , x 14 of 14

50. Small, H. Maps of science as interdisciplinary discourse: Co-citation contexts and the role of analogy

Scientometrics , , 835–849.51. Kaur, J.; Radicchi, F.; Menczer, F. Universality of scholarly impact metrics. J. Informetr. , , 924–932.52. Radicchi, F.; Fortunato, S.; Castellano, C. Universality of citation distributions: Toward an objective measureof scientiﬁc impact. Proc. Natl. Acad. Sci. USA , , 17268–17272.53. Schneider, M.; Kane, C.M.; Rainwater, J; Guerrero, L; Tong, G.; Desai, S.R.; Trochim, W. Feasibility of commonbibliometrics in evaluating translational science. J. Clin. Transl. Sci. , , 45–52.54. Bruns, S.B.; Stern, D.I. Research assessment using early citation information. Scientometrics , ,917–935.55. Cao, X.; Chen, Y.; Liu, K.J.R. A data analytic approach to quantifying scientiﬁc impact. J. Informetr. , , 471–484.56. Klimek, P.; Jovanovic, A.S.; Egloff, R.; Schneider, R. Successful ﬁsh go with the ﬂow: Citation impactprediction based on centrality measures for term-document networks. Scientometrics , , 1265–1282.57. Stegehuis, C.; Litvak, N.; Waltman, L. Predicting the long-term citation impact of recent publications. J. Informetr. , , 642–657.58. Yu, T.; Yu, G.; Li, P.Y.; Wang, L. Citation impact prediction for scientiﬁc papers using stepwise regressionanalysis. Scientometrics , , 1233–1252.59. Sarigöl, E.; Pﬁtzner, R.; Scholtes, I.; Garas, A.; Schweitzer, F. Predicting scientiﬁc success based oncoauthorship networks. EPJ Data Sci. , , doi:10.1140/epjds/s13688-014-0009-x.60. Eysenbach, G. Can tweets predict citations? metrics of social impact based on twitter and correlation withtraditional metrics of scientiﬁc impact. J. Med. Internet Res. , , E123.61. Timilsina, M.; Davis, B.; Taylor, M.; Hayes, C. Towards predicting academic impact from mainstream newsand weblogs: A heterogeneous graph based approach. In Proceedings of the 2016 IEEE/ACM InternationalConference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, CA, USA,18–21 August 2016; pp. 1388–1389.62. Simkin, M.; Roychowdhury, V. Read Before You Cite! Complex Syst. , , 269–274.63. Thelwall, M. Data Science Altmetrics. J. Data Inf. Sci. , , 7–12.64. Barbaro, A.; Gentili, D.; Rebufﬁ, C. Altmetrics as new indicators of scientiﬁc impact. J. Eur. Assoc. HealthInf. Libr. , , 3–6.65. Wilsdon, J. We need a measured approach to metrics. Nature , , 129.c (cid:13)(cid:13)