The Role of Positive and Negative Citations in Scientific Evaluation
IIEEE ACCESS 1
The Role of Positive and Negative Citations inScientific Evaluation
Xiaomei Bai, Ivan Lee, Zhaolong Ning, Amr Tolba, Feng Xia
Abstract —Quantifying the impact of scientific papers objec-tively is crucial for research output assessment, which subse-quently affects institution and country rankings, research fund-ing allocations, academic recruitment and national/internationalscientific priorities. While most of the assessment schemes basedon publication citations may potentially be manipulated throughnegative citations, in this study, we explore Conflict of Interest(COI) relationships and discover negative citations and sub-sequently weaken the associated citation strength. PANDORA(Positive And Negative COI- Distinguished Objective Rank Al-gorithm) has been developed, which captures the positive andnegative COI, together with the positive and negative suspectedCOI relationships. In order to alleviate the influence causedby negative COI relationship, collaboration times, collaborationtime span, citation times and citation time span are employed todetermine the citing strength; while for positive COI relationship,we regard it as normal citation relationship. Furthermore, wecalculate the impact of scholarly papers by PageRank and HITSalgorithms, based on a credit allocation algorithm which isutilized to assess the impact of institutions fairly and objectively.Experiments are conducted on the publication dataset fromAmerican Physical Society (APS) dataset, and the results demon-strate that our method significantly outperforms the currentsolutions in Recommendation Intensity of list R at top-K andSpearman’s rank correlation coefficient at top-K.
Index Terms —Conflict of Interest, Negative Citations, ImpactEvaluation.
I. I
NTRODUCTION W ITH the development of scholarly big data, quantifyingresearch output plays an important role in ranking in-stitutions and countries, allocating research grants, making hir-ing decisions and planning scientific priorities [1], [2], [3]. Forinstance, it is observed that the institutional appraisal systemhas been consistently evolving to address the various needs inscientific impact evaluation [4], [5], [6]. While self-citationsare meant for reflecting the research progress or knowledgediffusion which is a standard and acceptable procedure [7], theact could, however, be abused to artificially inflate researchimpact [8] or introduce excessive self-advertisement [9]. Inthis paper, the scope of self-citations is extended to cover co-authors citations (note: co-author citations are not limited to
X. Bai, Z. Ning, F. Xia, are with the Key Laboratory for UbiquitousNetwork and Service Software of Liaoning Province, School of Software,Dalian University of Technology, Dalian 116620, China.X. Bai is also with Computing Center, Anshan Normal University, Anshan114007, China.I. Lee is with the School of Information Technology and MathematicalSciences, University of South Australia, Australia.A. Tolba is Riyadh Community College, King Saud University, Riyadh11437, Saudi Arabia.A. Tolba is also Mathematics and Computer Science Department, Facultyof Science, Menoufia University.Corresponding Author: Z. Ning; Email: [email protected] co-authors of the citing paper, but co-authors in all papersover a pre-defined time frame, such as over the past threeyears), as well as colleague citations (i.e. co-workers belongto the same institute). The act of the extended scope of self-citation, which is referred as COI citations in this paper, mayeither be necessary or negative as discussed above. Prior studyshows that self-citation accounts for under a three-yearcitation window [10], and this figure is expected higher underCOI as it’s an extended scope of self-citation. Unfortunately,theres lack of universal ground truth to differentiate legitimatecitations among all COI citations. Thus, there’s a need todevelop a mechanism to model COI citations, and associatedifferent weights to different type of citations. It should benoted that, legitimate self-citations should be treated as regularcitations with standard weights (usually 1); whereas negativecitations should be given a lower weight (i.e. this weight maybe less than 1) to reduce their inflated impact.Current approaches of institution appraisal fall in two pri-mary categories: full counting and fractional counting. Thefull counting based methods assume that one scholarly papercontributes equally to all authors’ institutions, so the impact ofthe paper could be counted multiple times [11]. The fractionalcounting based methods consider the rate indicators of aselected list of top journals and the best papers or focusonly on highly-cited papers [12]. Fractional counting mayalso suffer from potential distortion in institutional impactevaluation. This is because diversity of negative citations [13],[14] is neglected. A feasible solution to the problem is todesign a better assessment metric, which can identify thenegative citations, assesses the impact of academic papersobjectively and allocates fair shares to contributing institutions.The evaluation techniques for the impact of scientific out-puts have evolved dramatically in recent years, from citation-based indicators [15] to rank-based metrics [13], [16], [17].A nonlinear PageRank algorithm was proposed to improvethe effectiveness of reference ranking [13]. FutureRank waspresented to rank scientific articles by citation, authors andtime [16]. Based on FutureRank, Wang et al. [17] proposed aCAJTRank to fairly evaluate scientific articles by consideringcitation, authors, journals and time information. Due to therapid development of impact metrics, some researchers ap-pealed to exploiting a measured method for metrics [18], [ ? ].Unfortunately, most existing metrics regard all citations withequal weights, and it is likely that these metrics do not properlyreflect accurate impact of authors, journals and institutions.Therefore, there is a crucial need for an improved metric tofairly assess academic papers.There are two technical challenges for fair appraisal of a r X i v : . [ c s . D L ] A ug EEE ACCESS 2 academic affiliations: (1) identifying negative citations; and(2) distinguishing the citation strength, and allocating theimpact of scientific paper to the affiliation(s) of each author.Dissecting the citation relationship is a feasible way forsolving the first problem, and we define positive/negative COIrelationships and positive/negative suspected COI relationshipsas follows:Positive/negative COI relationships: for two papers withexisting citing relationship, if the authors of the two papers areever co-authors, and if there are one or more papers citing thetwo papers at the same time, the citation behavior is viewedas positive COI relationships. Otherwise, if no independentpapers (papers from different authors) co-cite the two papers,the citation is considered as a negative COI relationship.Positive/negative suspected COI relationships: similar tothe definition of Positive/negative COI, which leverages ci-tation behavior among coauthors, suspected COI leveragesthe citation behaviors for authors from the same affiliation.Likewise, if there are independent papers (papers withoutauthors from the same institute) recognizing the correlationof papers from the same affiliation, the citation is viewed asan positive suspected COI, otherwise the citation is consideredas a negative suspected COI.We propose PANDORA to model both the COI and sus-pected COI relationships for fairly measuring the impactof papers. We leverage the credit allocation algorithm [19]to capture coauthors contributions of a paper, and adjustedweighting is applied to distribute the impact of a publicationto affiliation(s) of each author.In this paper, we differentiate positive and negative COI re-lationships, positive and negative suspected COI relationshipsaccording to citation patterns, as illustrated in Fig. 1. Themeasurements of negative COI and negative suspected COIrelationships between researchers employ the following fourfactors: collaboration times, collaboration time span, citationtimes and citation time span. It should be noted that if theauthors of paper P i citing paper P j is a positive COI, thecitation is regarded normal, and the citation strength is setas 1. Extensive experiments are conducted on two subsets ofAmerican Physical Society (APS), i.e. PRC and PRE. Theresults demonstrate that our method outperforms the existingapproaches in the Recommendation Intensity at top-K andSpearmans rank correlation coefficient at top-K.Fig. 1: Illustrative example of different COI relationshipsbetween authors.(Note: where P i and A i are the list of papers and authors,respectively. The red line and blue line denote the citing relationship. The figure illustrates four cases: (A) Before P i cites P j , the authors of P i and P j have co-authored one ormultiple publications, and P i and P j are co-cited by otherpublications. Just like P cites P , author A and author A ever collaborated P , and P and P were co-cited by P , ( A , A ) makes up a positive COI author pair. (B) Comparedwith (A), P i and P j are not co-cited by other publications,just like P and P have not been cited by other paperssimultaneously, ( A , A ) composes a negative COI authorpair. (C) Before P i cites P j , between authors of P i and P j have not collaborated each other, but A i and A j belong to thesame affiliation, and P i and P j are co-cited by other papers,just like the relationship of P and P , ( A , A ) is a positivesuspect COI author pair. (D) Compared with (C), P i and P j are not co-cited by other publications, as P and P , ( A , A ) composes a negative suspect COI author pair.)II. D ESIGN OF
PANDORA
A. Dataset
Our experiments are conducted on the APS dataset, whichcontains 71,287 publications published in two different sub-sets of APS: Physical Review C and Physical Review E(http://publish.aps.org/datasets), between 1970 and 2013 (43years). Each record in the dataset includes the papers title,DOI, author(s), date of publication, affiliation(s) of authorsand publisher. A list of citations is provided by an independentdataset within the APS dataset.
B. Impact of a scholarly paper
The structure of PANDORA is illustrated in Fig. 2. PAN-DORA quantifies publication impact by differentiating positiveCOI, negative COI, positive suspected COI and negativesuspected COI between the citing and cited publications.Furthermore, PANDORA computes the authoritative score ofeach paper based on CAJTRank [17]. CAJTRank was a graph-based ranking method and used PageRank [20] algorithm andHITS [21] algorithm simultaneously to rank scientific papersby relying on the four factors: citations, authors, journals(conferences) and publication time. PANDORA adopts theweighted PageRank to fairly evaluate the impact of schol-arly papers, and it also leverages the credit allocation algo-rithm [19] to allocate the impact of a paper to its signedauthors. In this paper, we consider citations, authors, journalsand time factors, mainly because these factors can influencethe impact of papers.
C. Identification of positive COI
If the authors of a citing paper and a cited paper havebeen co-authors of one or more papers, COI relationshipsexist. Although the COI citations may distort the appraisalof research impact, some of these citations are positive andnecessary. Consider a scenario where the authors of theciting paper P i and the cited paper P j have published paperstogether. If P i and P j are co-cited by the papers from differentauthors, we consider the citation of P i and P j is reasonable,and we refer the citation as a positive COI, with its citationstrength W P − P COIi,j set as 1.
EEE ACCESS 3
Fig. 2: The structure of PANDORA. (Note: Ranking the impactof publications contains three steps: (1) Extract positive COI,negative COI, positive suspected COI and negative suspectedCOI from APS dataset; (2) Calculate citing strength accordingto the relationship of step (1); (3) List top N papers by theweighted PageRank algorithm and HITS algorithm.)
D. Identification of negative COI
If COI relationships exist between the authors of a citingpaper P i and a cited paper P j , and no paper cites P i and P j simultaneously, we define cases like this as negative COI. Theciting strength is quantified by the correlation strength betweenthe authors, and we introduce two factors: collaboration times,and collaboration time span, for adjusting the weight ofcitation strength. The negative COI strength of co-authors inPANDORA is defined as follows: W A − NCOIx,y = N Co − authorx,y ∆ T c (1)where W A − NCOIx,y is the negative COI strength of co-authorsbetween the x th author and the y th author. N Co − authorx,y indicates the cumulative number of papers coauthored by the x th author and the y th author. N Co − authorx,y = | S x ∩ S y | ,where S x is a set of papers published by the x th author, and S y is a set of papers published by the y th author. T Nx,y isthe last year of the x th author citing the y th author. T x,y isthe initial year of the x th author starts citing the y th author. ∆ T c = T Nx,y − T x,y + 1 indicates the number of year betweenthe first and the last collaborations of authors A x and A y .The relationship strength of each two publications based onthe co-author’s negative COI is calculated by: W P − NCOIi,j = I (cid:88) i J (cid:88) j ( N Co − paperi,j ∆ T c ) (2)where W P − NCOIi,j is the negative COI strength of the i th paperand the j th paper. It is the sum of negative COI betweeneach two authors of citing paper and cited paper. N Co − paperi,j indicates the cumulative number of papers coauthored betweeneach two signed authors in the i th paper and the j th paper. N Co − paperi,j = | S i ∩ S j | , S i is the set of all papers with signedauthors of the i th paper. S j is the all papers set of signed authors of the j th paper. The citation strength of each twopublications is defined as follows: W P − NCitei,j = e − ρ ( T Current − T Cite +1) W P − NCOIi,j (3)where an exponentially decaying formula is proposed to calcu-late the citation strength between paper P i and paper P j . Weconsider the citing strength ranges from 0 to 1. As scholarsusually intend to cite recent work, if a paper fails to attractcitations in early years after being disseminated, the chance ofattracting more citations over time will be limited. W P − NCitei,j is the citation strength of the i th paper citing the j th paper. W P − NCitei,j is defined within a range between 0 and 1 inour work, instead of assuming the citation strength as 1 (i.e.without regarding the COI relationships) in previous works.The value of citation strength is reasonable with the rangebetween 0 and 1, and the formula favors current citation. ρ isa constant value illustrating predefined decay parameter [17]. T Current stands for the current time, T Cite is the time of paper P i citing paper P j , and T Current − T Cite + 1 represents thetime duration of paper P i citing paper P j . E. Identification of positive suspected COI
Given paper P i cites paper P j , although the authors betweenthe two papers have not collaborated in producing any jointpaper, the authors of P i and P j belong to the same affiliation,we consider that suspected COI relationship exists betweenthem. If the two publications are co-cited by one or morepublications (i.e. demonstrated relevance), we regard paper P i citing paper P j as positive suspected COI, and the citingstrength W P − P SCOIi,j is set as 1.
F. Identification of negative suspected COI
Given paper P i cites paper P j , the authors between the twopapers have not co-authored any paper, and the authors of P i and P j belong to the same affiliation, however the twopublications are not co-cited by other publications (i.e. withoutdemonstrated relevance), we regard paper P i citing paper P j as negative suspected COI. The citing strength is quantified byintroducing the two factors: citing times and citing time span.The strength of negative suspected COI relationship of eachtwo authors is defined as follows: W A − NSCOIx,y = N Citex,y ∆ T s (4)where W A − NSCOIx,y is the negative suspected COI strength ofco-authors of the x th author and the y th author. N Citex,y is thecumulative number of papers of the x th author citing the y thauthor. ∆ T s = T Nx,y − T x,y + 1 indicates the number of yearsbetween the first and the last citing of authors A x and A y . Thestrength of suspected COI relationship of each two articles iscalculated by: W P − NSCOIi,j = I (cid:88) i J (cid:88) j ( N Citei,j ∆ T s ) (5)where W P − NSCOIi,j is the negative suspected COI strengthof the i th paper and the j th paper. It is a summation of all EEE ACCESS 4 the negative suspected COI between authors of paper P i andpaper P j . N Citei,j is the cumulative citing number between theauthors of the i th paper and the j th paper. The strength ofcitation relationship of each two articles is defined as: W P − NSCitei,j = e − ρ ( T Current − T Cite +1) W P − NSCOIi,j (6)
G. The authority score of a paper
In CAJTRank, all citing weights are set as 1, and theabnormal citations are ignored, which leads to the potentialissues in article impact evaluation. PANDORA is proposed toaddress these issues to improve objective and accurate impactassessment of research articles. The core idea of PANDORAis that the prestige of a publication is quantified by thescores of its weighted-PageRank, authors, journal publishedand references, namely the impact of a publication. Comparedto CAJTRank algorithm, PANDORA method constructs aweighted-PageRank to capture the authority score of eachpublication in citation networks. While the initial score ofeach paper is set as /N , N indicates the total numbers ofscholarly papers in the experiment. Meanwhile, PANDORAalso considers the authority scores of each author, journaland references, which are calculated by HITS algorithm.Particularly, CAJTRank algorithm assumed all co-authors’contributions are equal to a paper. The CAJTRank algorithmneglects a fact that contributions of different authors of apaper are never equal. In order to resolve this problem, creditallocation is introduced in order to reasonably distribute theinfluential score of different authors of individual paper [19].Concrete process of computing the prestige score of eachpublication is demonstrated as follows:The weighted PageRank score of paper P i is computed by weightedP ageRank ( P i ) = (cid:88) P j ∈ IN ( P i ) W j,i | OU T ( P j ) | S ( P j ) (7)where IN ( P i ) indicates all the papers of linking to paper P i , | OU T ( P j ) | is the total number of publication P j linking out. W j,i illustrates the citation strength of paper P j citing paper P i . S ( P j ) is the original score of paper P j before iteration isupdated.The authors’ authority scores of single publication aredetermined by the influence score of each author, while theprestige score of individual author is related to the impactscores of his/her published papers, and can be calculated bythe HITS algorithm. When computing the hub score of eachauthor, credit allocation algorithm is used to distribute theproportion of impact of single paper to different authors.Specifically, the authors’ prestige scores of each paper, Author ( P i ) , is calculated by Author ( P i ) = 1 T ( A ) · (cid:88) A j ∈ Neighbor ( P i ) (cid:80) P k ∈ Neighbor ( A j ) CreditShare ( P k , A j ) · S ( P k ) | N eighbor ( A j ) | (8) where T ( A ) indicates a summation of all authors’ hub scoresin the experimental data. N eighbor ( P i ) denotes the co-authors’ list of paper P i , N eighbor ( A j ) shows all the pa-pers published by author A j , CreditShare ( P k , A j ) illustratesthe proportion of publication P k impact score by differentauthors of P k . S ( P k ) is the prestige score of paper P k , | N eighbor ( A j ) | is the number of publications of author A j .Likewise, the prestige score of a journal paper is computedby the HITS algorithm. Specific formula is shown as follows: Journal ( P i ) = 1 T ( J ) · (cid:88) J j ∈ Neighbor ( P i ) (cid:80) P k ∈ Neighbor ( J j ) S ( P k ) | N eighbor ( J j ) | (9)where Journal ( P i ) demonstrates the prestige score of publi-cation P i transmitted from its published journal, and the hubscore of each journal includes the prestige scores of all the pa-pers published in the journal. T ( J ) is the sum of all journals’hub scores, N eighbor ( P i ) is the journal published by paper P i , and each paper belongs to one journal. N eighbor ( J j ) isthe set of papers published by journal J j , | N eighbor ( J j ) | isthe total number of publications in N eighbor ( J j ) .The reference scores of each publication are also computedby the HITS algorithm: Ref erence ( P i ) = 1 T ( P ) · (cid:88) P j ∈ Neighbor ( P i ) (cid:80) P k ∈ Neighbor ( P j ) S ( P k ) | N eighbor ( P j ) | (10)where Ref erence ( P i ) shows the publication score P i col-lected from the authority scores of its references. T ( P ) indicates the total scores transmitted from all the hub papers. N eighbor ( P j ) contains all the publications that P j links to. | N eighbor ( P j ) | is the sum of publication references P j .The authority score of each article contains the four compo-nents: PageRank score of the paper, scores of authors, journaland references of this work. The authority score is defined asthe weighted sum of these components plus a normalize term: S ( P i ) = α · weightedP ageRank ( P i ) + β · Author ( P i )+ γ · Journal ( P i ) + δ · Ref erence ( P i )+ (1 − α − β − γ − δ ) · N (11)In our experiment, if the score gap between the presentand previous prestige of each paper is less than 0.0001,we consider the presented PANDORA iterative algorithmconverges. S ( P i ) indicates the authority score of publication P i . α , β , γ and δ are constant parameters, ranging from 0 to 1.The probability of random jump is set as 0.15 experimentally;meanwhile, the summation of α + β + γ + δ is set as 0.85for obtaining good experimental results. Currently, the typicalapproaches for parameter estimation include simple linear EEE ACCESS 5 regression, multivariate linear regression and support vectormachine regression [22]. Based on our experimental character-istic, multivariate linear regression is employed to estimate theparameters of PANDORA, CAJTRank and FutureRank, due tothe characters of different factors and linear evaluation. We setthe sum of all parameters as 0.85 in the three algorithms, andwe find that a parameter captures relative high value, and otherparameters receive relative low values by multivariate linearregress. We then estimate three groups of optimal parametersto compare the accuracy of RI and Spearmans rank correlationcoefficients of the above mentioned three algorithms. Moredetails can be found in reference [23].
H. Impact of institution and country
In order to assess the impact of each institution objectively,the computing process is divided into the two parts in thissection. The first part is to objectively allocate the impactof each publication to its co-authors by PANDORA. If apublication is only signed by an author, the prestige scoreof the paper belongs to the single author and this situationis relatively popular in decades ago. With the development ofInternet, the collaboration among multiple authors also startsto prevail. Correspondingly, fair distribution of publicationimpact to multiple authors becomes an important practice toevaluate the impact of different scholars and their institutions.To fairly distribute the impact of a paper, credit allocationalgorithm is leveraged to resolve the coauthors’ contributionsto a paper. In this algorithm, the contributive proportion ofeach author to a paper is determined by the total credit fromall of his/her co-cited papers. The second part is to calculatethe impact of different institutions. The impact of a scholar isdefined as follows: I A = (cid:88) i R · S ( P i ) (12)where I A refers to the set of individual scholar’s impactcaptured from his/her publications. R refers to the proportionof credit allocation to different authors. S ( P i ) indicates theprestige score of a scholarly paper.The institutional impact is determined by the impact of allscholars’ publications in this institution: I I = (cid:88) i I A,i (13)where I I indicates the set of institutional impact, containingthe impact of all scholars in this institution. I A,i denotes ascholar’s impact. For the authors with multiple affiliations, weconsider the first affiliation as the primary institution.Furthermore, the impact of a country is determined by theimpact of publications of all institutions in this country. Givena country with i institutions, the impact of a country is definedas follows: I C = (cid:88) i I I,i (14)where I C represents the set of country impact. I I,i denotesthe institutional impact. III. R
ESULTS
A. Measuring the impact of papers
In order to objectively evaluate the impact of institutions,we first explore the COI relationships in citation networks.71,287 papers with 755,902 author pairs are analyzed inour experiment. The author pairs are divided into the fourcategories: positive COI, negative COI, positive suspected COIand negative suspected COI. It is interesting to note that theauthor pairs of positive COI relationships and negative COIrelationships are 59,388 pairs and 48,377 pairs respectively.There are 25,803 papers with positive COI, and 24,294 papersabout with negative COI. We also find that there are 3,196author pairs exist suspected COI relationships, in which authorpairs of positive suspected COI relationships and negativesuspected COI relationships are 1,645 and 1,551 respectively.We conduct the experiments on two subsets of APS to com-pare PANDORA with FutureRank [16] and CAJTRank [17].The experiments use two metrics: RI [23] and Spearmans rankcorrelation coefficient [24]. Due to the lack of ground truth forranking papers, previous research adopted the future PageRankscore [16] or the future citations [17] as the ground truth.In our study, citations without COI are used as the groundtruth instead. The reason is that both future PageRank scoresand future citations contain negative citations. Therefore, theground truth may be biased, and it likely results in theinaccurate appraisal of the impact of academic publications.The concept of RI is described as follows: let R indicatethe list of top-K returned papers of a ranking method and L represent the list of ground truth. For any scholarly paper P i in R with the ranked order ro , the RI of P i at k is definedas: RI ( P i )@ k = (cid:26) k − ro ) /k, P i ∈ L , P i / ∈ L (15)The above formula implies that for any article P i of thetop-K ground truth list, if it is ranked higher, RI of the article P i will be higher. Correspondingly, we are able to gain the RI of the list R at k . According to the RI of each article, the RI of the list R at k could be defined as follows: RI ( P i )@ k = (cid:88) P i ∈ R RI ( P i )@ k (16)To better understand the impact that PageRank, authors,journal, citations and time factors have on the prestige scoresof each article, a multivariate linear regression is implementedto estimate the parameters of PANDORA, FutureRank andCAJTRank, to optimize the ranking results in terms of RI performance. Fig. 3 shows different RI accuracy rates of thethree algorithms on top-K papers, and K ranges from 10 to300. The RI accurate rates of CAJTRank and FutureRank arebetween 0.6 and 0.677, and between 0.5 and 0.633, respec-tively. In comparison, the RI accurate rates of PANDORAare between 0.656 and 0.8. As shown in Fig. 3, PANDORAconsistently outperforms CAJTRank and FutureRank in termsof RI for all K values. EEE ACCESS 6
P(RI(R)@K)
T O P K P A N D O R A C A J T R a n k F u t u r e R a n k
Fig. 3: Comparing the probabilities of Recommendation In-tensity for PANDORA, FutureRank and CAJTRank.Moreover, the Spearmans rank correlation coefficient isutilized to assess the similarity of these algorithms. ρ = (cid:80) i ( R ( P i ) − R )( R ( P i ) − R ) (cid:113)(cid:80) i ( R ( P i ) − R ) (cid:80) i ( R ( P i ) − R ) (17)where R ( P i ) and R ( P i ) indicate the position of publication P i in the ground truth rank list and the corresponding algo-rithm rank list, respectively. R and R are the average rankpositions of all publications in the two ranks lists, respectively.In Fig. 4, we observe that Spearman’s rank correlationcoefficients of PANDORA and CAJTRank are at around 0.5and 0.42 respectively, while the coefficient changes from 0.267to 0.519 for FutureRank. The best result could be obtainedby considering all kinds of information: PageRank, authors,journal, references and time factors. PANDORA outperformsthe other state-of-art methods on two subsets of APS. Spearman’s rank correlation coefficient
T O P K P A N D O R A C A J T R a n k F u t u r e R a n k
Fig. 4: Plots of the Spearmans rank correlation coefficientagainst top-K papers with PANDORA, FutureRank and CAJ-TRank algorithms.By comparing RI and Spearman’s rank correlation coef-ficients of CAJTRank and FutureRank, we find that journal factor is beneficial to assess the impact of papers. At thesame time, by comparing PANDORA and CAJTRank, thepreceding results indicate that by the using COI relationships,weighted PageRank could improve the performance of PAN-DORA. Next, we take dynamic evolutionary nature of citationnetworks, time information and COI relationships of authorsinto consideration, and jointly use them to determine the citingstrength, which have important effect on generating a better RI and Spearman’s rank correlation coefficient. Moreover,PANDORA’s author credit allocation scheme helps generatinga fair and objective ranking list. B. Quantifying the impact of institutions
In order to capture the existence of COI citations at theinstitution-level and at the country-level, several character-istics are investigated for institution m and country n : theinstitution size ( A m ), represents the total number of authorsfrom institution m who published at least one publication. P m shows the number of publications published from institution m . COIC m is the cumulative COI citations derived from allpublications in institution m . The country size ( A n ), denotesthe total number of authors from country n who published atleast a publication. P n represents the number of publicationspublished from country n ; COIC n shows the cumulative COIcitations derived from all publications in country n .Figs. 5a and 5b show the correlation between the institutionsize A m and both the average COI citations COIC/A m andthe average publications COI citations COIC/P . Fig. 5aindicates that most institutions have a small number of scholarspublishing on APS journals. It could be observed that COIcitations exist in institution with different scales, and the largerthe institution is, the more COI citations are likely to be.Fig. 5b shows the correlation between institution size A m and the average COI citations of each paper COIC/P m .According to Fig. 5b, we can observe that COI citations ofeach paper range from − − , and the average COIcitations of each paper range from 0.529 to 1.022. Figs. 5cand 5d show the correlation between the country size A n andboth the average COI citations COIC/A n and the averagepublications COI citations COIC/P n . Fig. 5c indicates thatmost countries have a small number of scholars in the APSpublications, except for a few countries. For example, Amer-ican scholars are the highest contributors to PRC and PRE,with 20,468 scholars. We also find that the country size A n positively correlates with COI citations, which indicates thatlarge countries are likely to result in COI citations accordingto the analysis of each author’s COI citations. The potentialcause of this phenomenon is that large institutions or largecountries have more international and internal collaborations.According to Fig. 5c, we also observe that when country sizefalls in between ranks 60 and 100, COIC/A n presents asudden fall. To obtain a better understanding of the decrease,we investigate these countries and their adjacent countries inthe following three aspects: the number of papers, the numberof co-authors and COI citations. We observe an interestingphenomenon: small COI citations exist in these countries, andthe number of co-authors (about 3 authors) of each paper is EEE ACCESS 7 less than their adjacent countries (around 4 authors). This maybe the reason why impact of these countries is relatively smallcompared with other countries. Fig. 5d shows the correlationbetween the country size A n and the average COI citationsof each paper COIC/P n . It indicates that each publicationcontains more COI citations in large countries compared tosmall countries, COI citations of each paper range from 0.156to 0.883, while the lower COIC/P n values are similar to theones in Fig. 5c.In order to investigate the impact of average institutionand average scholarly papers in different institution sizes andcountry sizes, several external characteristics are examined.The institution impact ( I I ) represents the total scores of allauthors collected by all papers P i in this affiliation. Thecountry impact ( I C ), denotes the total summation of all theinstitutions collected by all the authors in this country. InFigs. 6a and 6b, the correlation between the institution size A m and both average institution impact I I /A m and averagescholarly papers impact I I /P m are shown. We find thatinstitution size influences the impact of each author in his/heraffiliation, that is, scholars in large institution obtain higheraverage impact (Fig. 6a). We also notice that the institutionsize has little influence on the average impact of publicationsin institutions on the whole (Fig. 6b). In Figs. 6c and 6d,the correlation between the country size A n and both countryaverage impact I C /A n and average country scholarly papersimpact I C /P n are demonstrated, which indicates country sizehas some positive influence on the impact of country, however,the country size has little influence on the average impact ofpublications in countries on the whole. According to Figs. 5cand 6c, we observe that if the impact of country is small, itsCOI citations are small.To investigate the relationship of COI citations and impactof papers evolving at the institution-level and at the country-level, previous studies have been focusing on citation distri-butions, yet little is concerned about the evolution of COIcitations of individual institution and country. The trends ofCOI citations variations in different institutions and countriesare well illustrated by the COI citations history of publicationsextracted from PRC and PRE subsets of APS corpus (Figs. 7aand 7b). In Fig. 7b, we analyze the relationship between theimpact of top 100 institutions and their COI citations. Therelationship distribution indicates that the higher the impact ofinstitution is, the more COI citations it may contain. For exam-ple, the impact and COI citations of the top ranked institutionare far beyond the other institutions. Most institutions withlower COI citations have a relative smaller impact than otherinstitutions with higher COI citations. According to Fig. 7b,we observe that the top 20 institutions are from Americannational laboratory and university by the impact of institutionsrank. In Fig. 7a, we observe that the countries with higherimpact likely have more COI citations than the countries withlower impact. By COI citations of each country rank, top 20countries from high to low in order are: America, Germany,France, Japan, Italy, China, UK, Spain, Australia, Canada,India, Netherlands, Russia, Belgium, Brazil, Israel, Poland,Switzerland, Denmark and Sweden.According to Fig. 7c, we observe that the COI citations of some countries suddenly increase since 1993 year, the trendof high COI citations continues until 2005. After that, COIcitations of country present decreased trend. The reason behindis that COI citations of a paper are a cumulative process.The published time of papers becomes shorter, and possibleCOI citations are relatively smaller, therefore, we observethat around 2013, the COI citations of country become less.Fig. 7d characters the dynamic impact of scholarly papers incountry-level. We observe that the impact of countries presentsa sudden increase since 1993, which is consistent with Fig. 7c.What fueled this increase in the impact of country? From thestatistics of the yearly summation of national publications,we observe that the number of publications grows rapidly in1993, and some countries even increase to 10 times, comparedto 1992. We believe this shift has close relationship withthe ”information superhighway” strategy of the United Statesoriginated from the Clinton period. In September 1993, shortlyafter Clinton became the president of the United States, he of-ficially launched the cross-century national information infras-tructure project plan. This program not only had a very broadimpact on the world, but also created a brilliant future forthe United States information economy. In addition, a journalpaper published in Nature 2013 also reported that internationalcollaboration increased more than ten-fold since the mid-1990s, which coincides with our experimental results fromanother aspect [25]. Top 20 countries from high to low in orderare: America, Germany, France, China, Japan, India, Italy, UK,Spain, Brazil, Russia, Canada, Netherlands, Australia, Israel,Belgium, Poland, Argentina, Korea and Switzerland by impactof country rank. Specially, America is the most prominentcountry, and its impact is about 3.5 times of Germany. Throughsuch sorting results, we observe that the ranking order ofcountry impact correlates with scientific and technologicalcompetitiveness and more details for scientific competitivenessof nations can be found in the reference [6].IV. D ISCUSSION
Our analysis about the citation relationship in APS datasetby time evolution indicates that regardless of the institutionsize, the COI relationships in scientific evolution networks area common phenomenon and COI citations present variabletrends in different institutions and countries. Institutions withthe highest COI citations are found in USA, followed byAustralia, Germany, Israel, Finland and Spain. While USA andGermany belong to G8 countries, Australia and Israel are thecountries with high investments in research [26]. This findingsuggests that leading countries in the world are not insulatedfrom negative citations behaviors. Understanding how theCOI relationships affect the appraisal of scholarly papers isof great importance for scientific community. We investigatethe criteria of fair assessment of the scientific impact, andthe proposed PANDORA technique helps identifying highlyinfluential papers, scholars, and academic institutions.There are several potential explanations for why COI re-lationships exist in citation networks. Firstly, the cooperationunder the competition mechanism may play a crucial role.Specially, an extremely strong collaboration relationship (su-per ties) has remarkably positive impact on citations [27]. For
EEE ACCESS 8
Fig. 5: Quantitative patterns of COI citations for institutions and countries. (Note: (a) The institution size vs. the average COIcitations is shown, indicating most institution sizes are very small, and a few have large scholars. (b) The correlation betweeninstitution size and the average publications COI citations shows institution size has little influence on the
COIC/P . (c) Thecorrelation between Country size and each scholar’s average COI citations indicates that large countries have more authorswith COI citations. (d) The relationship between country size and the average COI citations of each publication illustrates thatlarge countries have more papers with COI citations.)Fig. 6: Quantitative patterns of impact for institutions and countries. (Note: (a) Institution size positively correlates with theaverage impact of institutions on the whole. (b) Institution size has no specific influence on the average publications impact ofinstitution. (c) Country size has positive impact on the average impact in different country size as a whole. (d) Country sizehas a very small influence on the average publications impact.)
EEE ACCESS 9
Fig. 7: Comparing the impact of institutions and countries against COI citations, and charactering the dynamics of COI citationsand the impact of country. (Note: (a) COI citations have a positive correlation with the impact of country C i , each color dotindicates a country. (b) The higher the impact of institution is, the more COI citations it may contain. (c) The yearly variance ofCOI citations in country-level indicates since 1993, the COI citations have shown the tendency of surge. (d) Striking similaritiescan be found between the yearly impact of country C i ( t ) from 1970 to 2013 and the trend of COI citations of country inFig. 7(c).)example, pursuing collaborative research in a close communitytends to be more beneficial than working alone, and negativeCOI citations could increase at the same time. Secondly, someresearchers may cite deliberately each others work to boostcitation count, hence, the implied academic reputations [13],i.e., researchers may cite their friends’ newly published papers.Lastly, the same scientific affiliation may also contribute tonegative COI citations, i.e., they likely cite their colleaguesjust because they belong to a scientific team instead ofactual research relevance. Although we have stated severalpossible reasons of COI citations generation, the behavior inreal scientific community are more complicated and furtherinvestigating in COI citations is required to fairly evaluate thescientific impact.In summary, we have proposed a COI-based method togive the audiences a deeper insight into the impact of papers,institutions and countries. Our method has several distinctfeatures over conventional approaches: (i) The citing strengthis determined based on the COI relationships in citationnetworks, and the four categories of COI relationships areinvestigated for adjusting the citation weights. To this end,a co-citation mechanism is used to identify positive COI,negative COI, positive suspected COI and negative suspectedCOI relationships in citation networks. In our method, ahomogeneous and heterogeneous fusion perspective is adoptedso that the scores of PageRank, authors, journals, references fittogether into the corresponding impact of articles calculation.To quantify the impact of papers, the well-known approachPageRank treats all citation weights as 1, making all citationswith equal importance. However, as COI relationships exist, our approach adopts a weighted PageRank to identify differentcitation relationships to fairly evaluate the impact of papers.(ii) PANDORA performs better than the current methods interms of evaluating the impact of institutions. We adopt acredit allocation algorithm for fairly distributing the impactof each publication to its authors, which could guaranteeaccurate appraisal of authors’ impact. Then we reasonablyassign the authority score of each scholar to his/her affiliationand country. Previous research equally allocated the impact ofsingle paper to its all authors, or distributed them according tothe sequence order of authors. Thus some papers are computedtwice or more when calculating the total publications of acountry [28]. In our method, we take the first authors firstaffiliation to resolve these issues. (iii) Our approach couldbe utilized to improve some established metrics for scientificimpact by considering the COI relationships, such as JIFand H-index. Current evaluation methods of citation-basedimpact, from JIF to H-index, contain anomalous citationsinevitably leading to the inaccuracy of evaluation. However,PANDORA can effectively discover such abnormal citations.If the abnormal citations can be eliminated before calculatingJIF and H-index, these metrics can reflect true impact moreeffectively.Our proposed method has few limitations: (i) Our experi-ments covered a descent number of research papers (71,287),these papers come from APS dataset and the study is limited tothe physics discipline. (ii) Dependence of the data quality: forexample, we extracted the affiliation details from APS dataset,thus relied on the mechanism that APS chosen to combinemultiple affiliations that are variants of the same institution. EEE ACCESS 10 (iii) COI relationships are identified through co-author andsame institution, whereas other forms of COI relationshipsmight also exist.Our analysis provides guidance on quantifying the impactof the academic publications, institutions and countries usinga metrics-based system. As such, understanding how researchpapers trajectories shift in affiliation-level and country-levelis valuable for career planning and selection of collaborativeopportunities. Furthermore, better understanding of the COIrelationships in citation networks is also significant for as-sessing efficiency and productivity of scientific research.A
CKNOWLEDGMENT
The authors extend their appreciation to the InternationalScientific Partnership Program ISPP at King Saud Universityfor funding this research work through ISPP
EFERENCES[1] J. Ypersele, “The maze of impact metrics,”
Nature , vol. 502, no. 7472,pp. 423–423, 2013.[2] F. Xia, W. Wang, T. M. Bekele, and H. Liu, “Big scholarly data: Asurvey,” vol. 3, pp. 18–35, 2017.[3] X. Bai, H. Liu, F. Zhang, Z. Ning, X. Kong, I. Lee, and F. Xia,“An overview on evaluating and predicting scholarly article impact,”
Information , vol. 8, no. 3, p. 73, 2017.[4] A. Arag´on, “A measure for the impact of research.”
Scientific reports ,vol. 3, pp. 1649–1649, 2012.[5] L. Bornmann, “Ranking institutions by the handicap principle,”
Scien-tometrics , vol. 100, no. 2, pp. 603–604, 2014.[6] L. Bornmann, M. Stefaner, F. Aneg´on, and R. Mutz, “What is theeffect of country-specific characteristics on the research performanceof scientific institutions? using multi-level statistical models to rank andmap universities and research-focused institutions worldwide,”
Journalof Informetrics , vol. 8, no. 3, pp. 581–593, 2014.[7] R. Glvez, “Assessing author self-citation as a mechanism of relevantknowledge diffusion,”
Scientometrics , pp. 1–12, 2017.[8] C. Bartneck and S. Kokkelmans, “Detecting h-index manipulationthrough self-citation analysis.”
Scientometrics , vol. 87, no. 1, pp. 85–98, 2011.[9] P. Seglen, “The skewness of science,”
Journal of the Association forInformation Science and Technology , vol. 43, no. 9, pp. 628C–638, 1992.[10] D. Aksnes, “A macro study of self-citation,”
Scientometrics , vol. 56,no. 2, pp. 235–246, 2003.[11] P. Vinkler, “The evaluation of research by scientometric indicators,”
Chandos Pub , 2010.[12] L. Bornmann, M. Stefaner, F. Aneg´on, and R. Mutz, “Ranking andmapping of universities and research-focused institutions worldwidebased on highly-cited papers: A visualisation of results from multi-levelmodels,”
Online Information Review , vol. 38, no. 1, pp. 43–58, 2014.[13] L. Yao, T. Wei, A. Zeng, Y. Fan, and Z. Di, “Ranking scientificpublications: the effect of nonlinearity,”
Scientific reports , vol. 4, 2014.[14] X. Bai, F. Xia, I. Lee, J. Zhang, and Z. Ning, “Identifying anomalouscitations for objective evaluation of scholarly article impact,”
Plos One ,vol. 11, no. 9, p. e0162364, 2016.[15] E. Garfield, “The history and meaning of the journal impact factor,”
Jama , vol. 295, no. 1, pp. 90–93, 2006.[16] H. Sayyadi and L. Getoor, “Futurerank: Ranking scientific articles bypredicting their future pagerank.” in
SDM . SIAM, 2009, pp. 533–544.[17] Y. Wang, Y. Tong, and M. Zeng, “Ranking scientific articles by exploit-ing citations, authors, journals, and time information,” in
Twenty-SeventhAAAI Conference on Artificial Intelligence , 2013, pp. 933–939.[18] J. Wilsdon, “We need a measured approach to metrics,”
Nature , vol.523, no. 7559, p. 129, 2015. [19] H. Shen and A. Barab´asi, “Collective credit allocation in science,”
Proceedings of the National Academy of Sciences , vol. 111, no. 34,pp. 12 325–12 330, 2014.[20] L. Page, “The pagerank citation ranking : Bringing order to the web,”
Stanford Digital Libraries Working Paper , vol. 9, no. 1, pp. 1–14, 1998.[21] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,”in
Acm-Siam Symposium on Discrete Algorithms , 1998, pp. 668–677.[22] H. Purwins, B. Barak, A. Nagi, R. Engel, U. Hockele, A. Kyek,S. Cherla, B. Lenz, G. Pfeifer, and K. Weinzierl, “Regression methodsfor virtual metrology of layer thickness in chemical vapor deposition,”
Mechatronics, IEEE/ASME Transactions on , vol. 19, no. 1, pp. 1–8,2014.[23] X. Jiang, X. Sun, and H. Zhuge, “Towards an effective and unbiasedranking of scientific literature through mutual reinforcement,” in
Pro-ceedings of the 21st ACM international conference on Information andknowledge management . ACM, 2012, pp. 714–723.[24] J. Myers and A. Well,
Research design and statistical analysis . L.Erlbaum Associates, 2010.[25] J. Adams, “Collaborations: The fourth age of research,”
Nature , vol.497, no. 7451, pp. 557–560, 2013.[26] G. Cimini, A. Gabrielli, and F. S. Labini, “The scientific competitivenessof nations,”
PloS one , vol. 9, no. 12, p. e113470, 2014.[27] A. Petersen, “Quantifying the impact of weak, strong, and super ties inscientific careers,”
Proceedings of the National Academy of Sciences ,vol. 112, no. 34, pp. E4671–E4680, 2015.[28] D. King, “The scientific impact of nations,”