Best of both worlds? Simultaneous evaluation of researchers and their works
aa r X i v : . [ c s . D L ] J un Best of both worlds? ∗ Simultaneous evaluation of researchers and their worksEphrance Abu Ujum , Gangan Prathap , and Kuru Ratnavelu [email protected] , [email protected] CSIR National Institute for Interdisciplinary Science and Technology (NIIST), Councilof Scientific and Industrial Research, Thiruvananthapuram, 695 019, Kerala, India, [email protected]
April 23, 2018
Abstract
This paper explores a dual score system that simultaneously eval-uates the relative importance of researchers and their works. It is amodification of the CITEX algorithm recently described in Pal and Ruj(2015). Using available publication data for m author keywords (asa proxy for researchers) and n papers it is possible to construct a m × n author-paper feature matrix. This is further combined withcitation data to construct a HITS-like algorithm that iteratively sat-isfies two criteria: first, a good author is cited by good authors , andsecond, a good paper is cited by good authors . Following Pal and Ruj,the resulting algorithm produces an author eigenscore and a papereigenscore. The algorithm is tested on 213,530 citable publicationslisted under Thomson ISI’s “ Information Science & Library Science ”JCR category from 1980–2012.
Rankings provide an effective means to artificially assign order to the everincreasing volume of published research and researchers. The study and ∗ This work was supported by the University of Malaya High Impact Research GrantUM.C/625/1/HIR/MOHE/SC/13. bibliometric analytics , which we define here as “key indicatorsderived from bibliometric data through mathematical or statistical analysisfor the purpose of generating insight”. In addition to information retrieval,bibliometric analytics focuses on discovering patterns specific to the data athand in order to support decision-making or inference-related tasks. Thispaper is yet another step in this direction.Specifically, this paper builds on recent work established by Pal and Ruj(2015) to simultaneously score research authors and papers by relative im-portance. The proposed algorithm, dubbed CITEX (CITation indEX), takesadvantage of the many-to-many correspondence between a given set of au-thors and the papers they have collectively published. For this purpose, map-pings between both sets can be formalized as linkages on a bipartite graph,hereon referred to as the author-paper (or author-document) network. Inthis sense, the cumulative advantage accrued by authors due to their papersand vice versa can be quantified using graph theoretic methods.Furthermore, papers are interconnected through citation links; that is, atypical paper refers to previous works in order to acknowledge relevance inaddition to specifying its own placement within the existing literature. Theresulting paper citation network can thus be represented as a directed graph.Since the distribution of citation links varies from one paper to the next –usually in a highly skewed manner (Simon, 1955; de Solla Price, 1965; Price,1976; Newman, 2009) – this can then be used as a basis to distinguish whichpapers are more prominently located than others. Several schemes have beenproposed to exploit precisely this feature; i.e. scores are computed for eachpaper based on some discriminatory function of its connectivity (or how itis embedded within a structure of links) (Chen et al., 2007). These paperscan then be ordered according to the computed scores to produce rankings.Such schemes are integral to information retrieval tasks on online databases,for example, Google Scholar, CiteSeerX, and Microsoft Academic Search.CITEX extends this tradition by combining information from the author-paper network with the paper citation network to determine which authors We note that Bhatt and Martens (2009) and Rethlefsen and Aldrich (2013) used theterm “ bibliometric analytics ” but have not provided a formal definition. In general, citation linkages are made to indicate reaction to past work rather thanconcrete dependence . Hence, the presence of citation linkages – that is, a link pointingfrom citing (referring) paper to cited (referred) paper – serves to describe intellectual flowsin successive works, which in itself does not necessarily imply a flow of influence. This is a notion of the paper’s location as opposed to its position . As with a citationcount, the presence of a citation link does not explicitly convey whether it takes on theposition of supporting or opposing the referred work.
Information Science & Library Science ”. The paper isconcluded in Section 6.
Suppose we are presented with a corpus consisting of m authors and n pa-pers. Furthermore, suppose that from this corpus, we are able to extract thebinary m × n author-paper feature matrix, M , and binary n × n citationmatrix, C . Let an entry M ij = 1 denote that author i on the i -th row of M has (co)authored paper j on the j -th column of M ( M ij = 0 otherwise).This implies that row sums of M correspond to total papers published byeach author. Column sums of M correspond to total authors for each paper.A column-normalized version of M (with the same dimensions) can be con-structed so that authorship share of author i to paper j is divided equally as W ij = M ij / P i M ij .In a similar way, let C ij = 1 denote that cited paper j on the j -th columnof C receives a citation from a citing paper i on the i -th row of C ( C ij = 0 oth-erwise). Additionally, we require that C contains no self-citations ( C ii = 0).Given an extreme case where C = n × n , Pal and Ruj define the CITEX pa-per and author scores as y j = P mi =1 M ij x i and x i = P nj =1 W ij y j , respectively.These expressions are written in matrix form as y ← M T x and x ← W y .This captures the notion that the y -score for paper j depends on the relativeimportance of its authors, while the x -score for author i depends on her au-3horship share ( W ij ) for each paper j multiplied by its corresponding score y j . A complete description however requires the inclusion of citation features.Since this must reduce to the case of a zero citation matrix, Pal and Rujachieve this by the inclusion of a ( I + C T ) term (which is equivalent toadding in paper self-citations to C ). Since y ← M T W y and x ← W M T x ,then for the k -th recursion: x ( k ) = W ( I + C T ) M T x ( k − (1) y ( k ) = ( I + C T ) M T W y ( k − (2)is one such possible choice. By induction, we obtain: x ( k ) = [ W ( I + C T ) M T ] k x (0) (3) y ( k ) = [( I + C T ) M T W ] k y (0) (4)For initial guess vectors, Pal and Ruj use x (0) = m × and y (0) = n × .Supposing P = W ( I + C T ) M T , so that x ( k ) = P k x (0) , then: x ( k +1) = P P k x (0) = P x ( k ) (5)If the distance between two x score vectors is k x ( k +1) − x ( k ) k < ǫ then conver-gence is met relative to tolerance ǫ (Franceschet, 2011). Since P is a nonneg-ative matrix with dimensions n × n and x (0) >
0, then in accordance with thePerron-Frobenius theorem , the x scores become stationary as k → ∞ , thussatisfying P x ∗ = x ∗ (Perron, 1907; Frobenius, 1912). A similar argument isapplicable for y by setting Q = ( I + C T ) M T W .There are other algorithms that combine author and paper features.One notable example is the Co-Ranking framework proposed in Zhou et al.(2007). This approach uses a PageRank-based model on a bipartite co-authorship/paper citation network, whereby two intra-class random walksallow traversal strictly between one class of nodes, while an inter-class ran-dom walk allows jumps between networks. The stationary probabilities forauthor nodes and paper nodes are computed by coupling the random walks(assuming the status of researchers and the work they produce are mutuallyreinforced). The resulting algorithm yields improvements compared to whenapplying PageRank on either feature (network) in isolation, although at theexpense of introducing three additional adjustable parameters to the usual In particular, given that
P x = cx and c = 1 is the largest eigenvalue, then P k x (0) converge to a vector x ∗ (in the same direction as x ) as k → ∞ . . CITEX adds an interesting twist to the currentliterature since, unlike PageRank, it does not depend on any adjustable pa-rameters. Since the performance of a data mining algorithm depends on its design (Jahne,2000; Balakin, 2010), it is useful to determine precisely what features areemphasized by CITEX in order to anticipate the qualitative aspects of theranking it will necessarily produce. In particular, we are interested in theconditions that maximize a given score since the highest percentile is de-signed to correspond to the topmost ranks. Specific to the CITEX authorscore, Equation 1 can be expanded as: x ( k ) = W M T x ( k − + W C T M T x ( k − (6) x ( k ) i = m X a =1 n X p =1 W ip M ap x ( k − a + m X a =1 n X p ,p =1 W ip C p p M ap x ( k − a (7)The first term on the right hand side of Equation 7 captures the cumulativeauthorship share of author i with author a . This term is positively biasedtowards author i if she is prolific (adjusting for authorship share), and more soif she collaborates frequently with “good authors” (those with high x -scores).This includes the case where a = i , so that if the cumulative authorship shareof i herself is significantly large, then x ( k ) i ∼ x ( k − i P np =1 W ip .As for the second term, a citation from paper p → p corresponds to anauthor citation from a → i fractionalized by W ip . Hence, this term increasesthe larger the number of citations from a → i , the larger the authorship sharefor each paper authored by i (for which credit is minimally split), and thelarger the x -score of i ’s citing authors. Put together, CITEX defines a goodauthor as one who publishes frequently with good authors, and is even moreso if he/she is cited by good authors .A similar analysis can be done for the CITEX paper score as given inEquation 2: y ( k ) = M T W y ( k − + C T M T W y ( k − (8) y ( k ) j = m X a =1 n X p =1 M aj W ap y ( k − p + m X a ,a =1 n X p =1 C pj M a j W a p y ( k − p (9) We are referring to the damping parameter originally described in Brin and Page(1998). The interested reader is referred to Langville and Meyer (2006) and Chen et al.(2007) for an in-depth discussion on the PageRank algorithm. j receives fractional y -score con-tributions for each author a appearing in both papers j and p . Essentially, M aj W ap is an author similarity term, hence, this part of the equation in-creases for papers that share the same authors. This term will also increasethe larger the y -score for each “similar author” paper p (relative to j ) andwhenever W ap →
1. For the case of an author i with a significantly largenumber of papers, we could end up with y ( k ) j ∼ y ( k − j P np =1 W ip .For the second term, we see that y j depends on the sum of y -scoresfrom each paper that cites it, p (“good papers” have high y -score). Withsome rearranging, the second term also contains the product W a p C pj M a j .This means that the y -score of paper j depends on the sum of fractionalizedcitations from all citing papers p (i.e. P np =1 W a p C pj ). Combining this withthe effect from the first term of Equation 9, we surmise that CITEX definesa good paper as one with high author similarity with good papers, and is evenmore so if it is cited by good papers .Based on our analysis, we have determined two quirks with the originalformulation of CITEX. These are:1. x ( k ) i ∼ x ( k − i P np =1 W ip : the CITEX author score for an author i canincrease from being highly prolific, and more so if he/she tends tocoauthor in small teams. This allows for the case of an extremelyprolific solo author to be over-represented by the algorithm. He or shemay not even need a boost from citation count (from good authors orotherwise) in order to obtain a high CITEX author score.2. y ( k ) j ∼ y ( k − j P np =1 W ip : the CITEX paper score can increase just byhaving the same author list repeat over a significant fraction of the col-lection, with this effect becoming more pronounced if the listing tendsto be short. Similarly, such cases can be over-represented by CITEXwithout a boost from citation count (from good papers or otherwise).To illustrate the potential problems associated with these quirks, we con-struct two toy calculations analogous to those posed in Pal and Ruj (2015).These are as shown in Figure 1 and Figure 2.As a result of the quirks highlighted in Figure 1 and Figure 2, we canexpect that author and paper rankings generated by CITEX will suffer fromspecificity issues since extreme publication and citation traits are mixed to-gether. The task of this paper is to propose a more elegant variation of theCITEX algorithm that addresses the above mentioned issues.6 = a a a p p p p p p C = Figure 1: Problem 1 — Hypothetical case of one prolific solo authorwith no citations. CITEX gives x = [0 . , . , . y =[0 . , . , . , . , . , . p has the highest scorefollowed by p due to the number of citations they receive compared to nocitations for the other papers. Oddly, p is ranked lower than papers p , p and p despite being authored by author a who has one citation more than a (via p ). M = a a a p p p p p p C = Figure 2: Problem 2 — The effect of high author similarity with good papers.The setup in this diagram is similar to Figure 1 with one additional citationlink added from paper p to p . CITEX gives x = [0 . , . , . y = [0 . , . , . , . , . , . a leads by author scorefollowed by a tie between a and a , despite the absence of (co)author self-citations to a (note that a has one author self-citation via p → p ). Paper p is ranked highest despite having only one citation because it is cited bya good paper ( p ). Due to the way paper scores are propagated in CITEX,papers p and p also receive high scores just by having high author similaritywith paper p . 7 An improved Coupled Author-Paper Scor-ing algorithm
As highlighted in Section 3, CITEX has a built-in tendency to produce a rankordering that gives undesired priority to highly productive authors (evenif they are relatively uninfluential), in addition to assigning high relativeimportance to papers associated to highly prolific authors (overriding thecitation impact of other papers).To circumvent these issues, we propose dropping the self-citation term( I + C T ) in Equations 1 and 2, and replace the M matrices with W matri-ces to ensure conservation of citation count when switching from the papercitation network to the author citation network (inter-author citations arefractionalized). This results in the following set of equations which definesour Coupled Author-Paper Scoring ( CAPS ) algorithm: x ( k ) = W C T W T x ( k − (10) y ( k ) = C T W T x ( k ) (11)Following previous conventions (Kleinberg, 1999; Pal and Ruj, 2015), westart with an initial guess vector (specifically, x (0) = m × and y (0) = n × )and determine the values of scores iteratively (i.e. iterate k ≥ a good author is cited by goodauthors ”. Equation 11 quantifies the criterion that “ a good paper is citedby good authors ”. The equations above provide a self-consistent basis forrepeated improvement (Easley and Kleinberg, 2010, pp. 355–356). This canbe seen by writing L = W C : x ( k ) = W L T x ( k − = W y ( k − (12) y ( k ) = L T x ( k ) (13)Hence, a good author has good papers that are cited by good authors who havegood papers and so on. The m × n matrix L has entries ( L ) ij = P np =1 W ip C pj which correspond to the cumulative fractional citations made by citing author i through papers p (if authored by i ) to some cited paper j . Essentially, L encodes the author-paper citation matrix .Entries of the m × m matrix product W L T in Equation 12 corresponds tothe cumulative fractional citations received by authors in row i from authorsin column a . This is because ( W L T ) ia = P np ,p =1 W ap C p p W ip signifiesthat author a in paper p cites paper p which is (co)authored by i . Thesum over all possible papers p serves to aggregate all fractional citations8eceived by author i from author j . W L T is thus the (fractional) authorcitation matrix .In effect, we find that the author score defined in Equation 12 thereforecorresponds to x ( k ) i = P ma =1 P np ,p =1 W ap C p p W ip x ( k − a . Therefore, the au-thor score for author i is proportional to the cumulative author citations re-ceived as well as the score of the citing authors. This captures the intuitionthat authors promote each other through their published works . Similarly,Equation 13 implies that the paper score for paper j is y ( k ) j = P mi =1 L ij x ( k ) i .This quantifies the relationship that the relative importance of a paper de-pends on the authority its citing authors . We test the CITEX and CAPS algorithm on papers published under theThomson ISI
Journal Citation Reports (JCR) subject category of “
Infor-mation Science & Library Science ” (LIS) from the years 1980 up to 2012inclusive. This dataset consists of 213,530 papers, 471,191 total inter-papercitations, and 73,597 author keywords. We do not conduct author or bib-liographic reference disambiguation in order to assess the output quality ofCAPS and CITEX when used with minimal data preprocessing.
The output of a ranking scheme depends on how it scores selected featuresthat are present (or absent) for each datum relative to the rest of the dataset.In general, it is difficult to determine the performance of the underlyingscoring algorithm when there is no ground truth to base such judgements. Incases like this, the most sensible thing to do is to speak of the properties ofthe scores generated by the algorithm of interest, and whether the rankingsgenerated show reasonable agreement with known methods and observations.In this respect, the distribution of author scores for CAPS and CITEX ex-hibit a reasonably high Spearman rank correlation coefficient ( ρ ) with h -indexscore ( p < . h -index (Hirsch, 2005) provides a useful comparison to CAPS andCITEX as it too combines publication and citation traits together. However,unlike CAPS (and to a lesser extent, CITEX), the h -index is not designedto differentiate whether a citation is received from a relatively “good” paper(author) or otherwise, hence some disparity in the resulting ranking is to beexpected. This can be seen in Table 1.Since CAPS and CITEX are also positively correlated with ρ = 0 . h -index, respectively. Notethat h -index values in columns denoted by h are computed using availabledata (only ISI papers indexed under LIS JCR category from 1980-2012).Note the usage of ordinal ranking for the h -index column. Pubs. Times Cited CAPS CITEX h -indexRank h Author key h Author key h Author key h Author key h Author key1 5 rogers.m 5 davis.fd 21 egghe.l 5 rogers.m 30 glanzel.w2 0 cassada.j 29 benbasat.i 26 leydesdorff.l 0 cassada.j 29 bates.dw3 0 klett.re 12 venkatesh.v 23 rousseau.r 0 klett.re 29 benbasat.i4 1 ramsdell.k 29 bates.dw 30 glanzel.w 1 ramsdell.k 28 garfield.e5 1 christian.g 1 pawlak.z 24 thelwall.m 1 christian.g 26 leydesdorff.l6 0 vicarel.ja 19 straub.dw 16 burrell.ql 0 vicarel.ja 25 schubert.a7 2 hoffert.b 30 glanzel.w 25 schubert.a 2 hoffert.b 25 spink.a8 1 sutton.j 1 gruber.tr 17 ingwersen.p 1 sutton.j 24 grover.v9 1 sutton.jc 25 spink.a 17 bar-ilan.j 24 thelwall.m 24 moed.hf10 1 bigelow.d 14 salton.g 22 braun.t 21 egghe.l 24 thelwall.m11 1 stevens.n 3 furnas.gw 18 cronin.b 30 glanzel.w 23 rousseau.r12 0 zlendich.j 24 grover.v 17 van.raan.afj 26 leydesdorff.l 22 braun.t13 0 fairchild.ca 3 deerwester.s 16 white.hd 1 sutton.jc 21 egghe.l14 1 pearl.n 3 dumais.st 12 jacso.p 2 decandido.ga 20 willett.p15 0 richard.o 2 landauer.tk 24 moed.hf 23 rousseau.r 19 ford.n16 2 gordon.rs 8 buckley.c 15 vinkler.p 3 stlifer.e 19 saracevic.t17 0 maccann.d 25 schubert.a 16 small.h 1 bigelow.d 19 smaglik.p18 0 lombardo.d 5 morris.mg 18 mccain.kw 25 schubert.a 19 straub.dw19 1 williamson.ga 1 harshman.r 16 bornmann.l 2 rawlinson.n 18 bates.mj20 5 butler.t 7 todd.pa 28 garfield.e 0 davidson.a 18 chen.hc21 1 raiteri.s 9 karahanna.e 9 pao.ml 0 de.baron.fhk 18 cronin.b22 1 gillespie.t 26 leydesdorff.l 15 vaughan.l 0 elizabeth.p 18 dennis.ar23 1 campbell.p 18 zmud.rw 6 rao.ikr 1 furlong.cw 18 lyytinen.k24 3 burns.a 14 gefen.d 15 daniel.hd 0 hammett.d 18 mccain.kw25 2 wyatt.n 19 saracevic.t 15 oppenheim.c 0 hemingway.h 18 zmud.rw ( p < . h -index distribution for top N ranks byCAPS and CITEX score to resemble each other for increasingly large N .For the top N = 25 ranks, µ CAPS ( h ) = 18 .
44 while µ CITEX ( h ) = 6 .
76. For N = 250 the mean h -index values are 8.03 and 7.03, while for N = 2500 weobtain 3.32 and 3.36 for CAPS and CITEX, respectively. Ideally, the toppercentile of any ranking should correspond to an easily interpreted orderingby quality, hence in this sense, CAPS improves on the CITEX author ranking(since the top ranks tend to correspond to high h -index values).Incidentally, the top ranked author by CITEX (Rogers, with a score of3 . × − ) corresponds to 83.6% of the entire CITEX author score distri-bution. Together with Cassada (author score = 2 . × − ), both authorstake up a shocking 96% of total scores. Over the entire list of authors, this10orresponds to a Gini coefficient of 0.9999. In contrast, 20% (14,719) oftop scoring authors according to the CAPS algorithm accounts for approxi-mately 99.96% of the scores (corresponds to a Gini coefficient of 0.9891). Thisimplies that the difference between CAPS author scores for adjacent ranksbecomes progressively smaller as we go down the ranks. This is exaggeratedto a greater extreme in CITEX.Interestingly, the Gini coefficients for fractional publication count andfractional citation count of authors in the LIS dataset are 0.7744 and 0.8715,respectively. Furthermore, 20% of top authors account for 81.4% of the totalfractional publications as well as 90% of the total fractional citations. Whilethese values are characteristic of high levels of inequailty, they are quite tamecompared to the level of inequality implied by CAPS. The presence of suchextreme levels of inequality suggests a vast differential in the ability of LISresearchers to capitalize the resources, technical skills, and opportunities attheir disposal (Shockley, 1957). The top 25 ranking by citation count, CAPS paper score, and CITEX paperscore is as displayed in Table 2. The topmost ranks of CITEX are populatedby papers sharing the same high-scoring author (Rogers). Looking beyondthe top 25 ranks, we find that with the exception of papers at ranks 7 to 12,the first 3819 positions are papers authored by Rogers, while the next 2610positions (ranks 3820 − rich get richer (Price, 1976; Barab´asi et al., 1999). The Gini coefficient is a measure of statistical dispersion typically used to measure thelevel of inequality in a given sample. For a sample of size n ordered such that x i ≤ x i +1 ,it is given by G = P ni =1 ix i n P ni =1 x i − n +1 n . A Gini coefficient of 1 indicates maximal inequalitywhereby the total score is associated to only one element in the sample while the remainderof the sample contributes nothing to the total score. A Gini coefficient of 0 indicates perfectequality whereby the total score is distributed equally among all elements in the sample. p < .
01) over all papers are: ρ ( C , C ) = 0 . ρ ( C , C ) = 0 .
17, and ρ ( C , C ) = 0 .
26. CAPS appears in better agreementwith citation count than CITEX.
Citation count ( C ) CAPS ( C ) CITEX ( C )Rank Paper TC Paper TC Paper TC1 1982/IJCIS/11/341/pawlak 3319 2006/SCI/69/121/egghe 105 1995/LJ/120/113/rogers 12 1989/MISQ/13/319/davis 3251 1990/JIS/16/17/egghe 61 1995/LJ/120/119/rogers 13 1993/KA/5/199/gruber 2618 2006/SCI/69/131/egghe 250 1995/LJ/120/130/rogers 14 1990/JASIS/41/391/deerwester 2150 1998/JD/54/236/ingwersen 199 1995/LJ/120/187/rogers 15 1980/PAL/14/130/porter 1653 2005/S/19/8/braun 86 1995/LJ/120/213/rogers 16 2003/MISQ/27/425/venkatesh 1534 2003/JASIST/54/550/ahlgren 123 1996/LJ/121/100/rogers 17 1988/IPM/24/513/salton 1449 2006/SCI/69/169/braun 127 2011/LJ/136/30/fox 08 2001/MISQ/25/107/alavi 1075 2006/SCI/67/491/van.raan 177 2007/LJ/132/36/albanese 49 1995/ISR/6/144/taylor 1021 1999/JD/55/577/smith 93 1993/LJ/118/32/berry 210 2003/JMIS/19/9/delone 772 2001/JASIST/52/1157/thelwall 94 1989/LJ/114/18/decandido 111 2004/MISQ/28/75/hevner 724 1985/JD/41/173/egghe 48 1989/LJ/114/57/decandido 012 1995/MISQ/19/189/compeau 684 1992/IPM/28/201/egghe 41 1995/LJ/120/12/stlifer 113 2003/MISQ/27/51/gefen 677 1989/SCI/16/3/schubert 165 1992/LJ/117/52/rogers 014 2000/ISR/11/342/venkatesh 596 1997/JD/53/404/almind 163 2000/LJ/125/91/rogers 015 1999/MISQ/23/67/klein 569 2001/SCI/50/65/bjorneborn 93 2005/LJ/130/172/rogers 016 2000/MISQ/24/169/bharadwaj 568 1986/SCI/9/281/schubert 162 2006/LJ/131/114/rogers 017 1992/MISQ/16/227/adams 542 2006/SCI/67/315/glanzel 88 2006/LJ/131/114/rogers 018 1995/MISQ/19/213/goodhue 540 2006/SCI/69/161/banks 60 2006/LJ/131/123/rogers 019 1987/MISQ/11/369/benbasat 526 1996/SCI/36/97/egghe 31 2006/LJ/131/123/rogers 020 1999/MISQ/23/183/karahanna 513 1991/JASIS/42/479/egghe 29 2007/LJ/132/132/rogers 121 1999/JAMIA/6/313/bates 497 2003/SCI/56/357/glanzel 82 2007/LJ/132/132/rogers 122 1988/MISQ/12/259/doll 477 1986/SCI/9/103/leydesdorff 46 2007/LJ/132/171/rogers 023 1999/IJGIS/13/143/stockwell 475 2001/SCI/50/7/bar-ilan 64 2007/LJ/132/96/rogers 024 2000/MISQ/24/115/venkatesh 475 2002/JASIST/53/995/thelwall 72 2006/LJ/131/27/rogers 125 2003/ISR/14/189/chin 472 1996/JIS/22/165/egghe 24 2003/LJ/128/40/rogers 2 In contrast, the CAPS paper score possesses a Gini coefficient of 0.9912,while CITEX has a slightly lower value of 0.9785. This implies that bothmethods exhibit large score differentials only between the topmost ranks.For CAPS paper score, this can be traced to the fact that 81.2% of thelowest scoring population has a score of exactly zero (76% of papers in thestudy data have zero citations ). The reason for this is that the coupling The LIS dataset consists of 103,768 papers from
Library Journal ( ∼ .
6% of total).This is nearly 14 times larger than the the 2 nd largest contributor, i.e. Scientist . While thisseems excessively high, consider that only 1.9% of papers from
Library Journal contributes1% of non-zero citations in the LIS dataset (from a total of 471,191 citations for 213,530papers). In comparison,
Scientometrics is only the 6 th largest contributor to the dataset Citation count CAPS CITEXMISQ mis.quart 43 SCI scientometrics 33 LJ libr.j 100ISR inform.syst.res 14 JASIS j.am.soc.inform.sci 17JAMIA j.am.med.inform.assn 9 JASIST j.am.soc.inf.sci.tec 14JASIS j.am.soc.inform.sci 6 JD j.doc 12JD j.doc 3 JIS j.inform.sci 9IPM inform.process.manag 3 IPM inform.process.manag 6JMIS j.manage.inform.syst 3 JI j.informetr 5IJCIS int.j.comput.inf.sci 2 ARIS annu.rev.inform.sci 2IM inform.manage 2 SSI soc.sci.inform 1IJGIS int.j.geogr.inf.sci 2 S scientist 1SCI scientometrics 2ARIS annu.rev.inform.sci 1CJIS can.j.inform.sci 1EJIS eur.j.inform.syst 1GIQ gov.inform.q 1IJGIS int.j.geogr.inf.syst 1IMA inform.manage-amster 1JASIST j.am.soc.inf.sci.tec 1JIS j.inf.sci 1KA knowl.acquis 1OR online.rev 1PAL program-autom.libr 1 of both author features and paper features places strict limits on the sizeof the non-zero scoring population. On the other hand, the CITEX paperscore has no zero scoring population (due to the presence of artificial paperself-citations). The extremely high Gini coefficients for both CITEX andCAPS implies that we can only reasonably differentiate a small fractionof the dataset corresponding to top scoring papers that coincide with topscoring authors.A quick glance at top scoring papers listed in the “Citation count” columnof Table 2 reveals that these mostly correspond to informatics papers ratherthan informetrics. Contrast this with the listing shown in the “CAPS” col-umn where the emphasis is more towards informetrics papers instead. Thereason for this is that the CAPS algorithm takes into account authorshipfeatures when scoring papers, which are not accounted for in a simple cita-tion count. Since informetrics authors are highlighted in Table 1, it followsthat informetrics papers are also highlighted in Table 2. Table 3 provides alisting of journals in the top 100 ranks. This provides some indication of the with 3100 papers (1.5% of total LIS papers) yet contributes a total of 29792 citations(6.3% from LIS total) making it the 4 th largest contributor citation-wise. For reference,the largest citation counts are attributed to MIS Quarterly , J AM MED INFORM ASSN ,and
J AM SOC INFORM SCI with 55736, 30470, 30317 citations, respectively. For CITEX, 1% of the top scoring population accounts for 50.1% of the total score,while 2% accounts for 91%. In comparison, CAPS has 1% and 2% of the top scoringpopulation accounting for 88% and 97% of total scores, respectively.
In this paper we have constructed a modified version of the CITEX algorithmoriginally introduced by Pal and Ruj (2015). This algorithm was designedto assign relative importance scores to papers and authors by taking intoaccount data from both entities simultaneously. Conventional methods likecitation count and PageRank, for example, cannot do so without appropri-ate modification. The modification of CITEX which we propose, dubbedthe CAPS (Coupled Author-Paper Scoring) algorithm, is designed to ad-dress some of the weaknesses of Pal and Ruj’s original algorithm which wedescribed in Section 3 (essentially, the shortcomings can be traced to artifi-cially introduced self-citations on the paper-level).Using a real dataset (ISI papers published from 1980 − Information Science & Library Science ”), we show thatour proposed modifications outperforms CITEX in identifying important au-thors and papers. However, the CAPS algorithm appears to suffer from highinequality in the resulting score distributions as indicated by an extremelyhigh Gini coefficient ( ∼ . rich get richer ” effect. In contrast, CITEXrewards high scores for authors lying at the tail of the publication productiv-ity distribution, and by association, rewards high scores for papers publishedby such authors irrespective of the relative importance of their papers withinthe paper citation network. In this sense, CITEX is useful to find instanceswhere high productivity is mismatched with low impact.While bibliometric analytic algorithms such as CITEX or CAPS, or evenbibliometric adaptations of website ranking algorithms such as HITS orPageRank can prove useful in identifying what is important in a given dataset,it is crucial to be aware of the limitations and subtleties of such methods.Each method finds exactly what it is designed to seek and since it is hardto account for, let alone anticipate every relevant feature or contingency,we must concede that the rankings produced are themselves only facets ofthe underlying organization in the data. Hence, bibliometric analytic algo-14ithms should be used first and foremost to guide decisions on where to lookdeeper (i.e. to construct recommendation engines), and if necessary, usedwith extreme caution when drawing inferences on the relative standing ofbibliometric entities. References
Balakin, K. V. (2010).
Pharmaceutical Data Mining: Approaches and Ap-plications for Drug Discovery . John Wiley & Sons, Hoboken, New Jersey,USA.Barab´asi, A., Albert, R., and Jeong, H. (1999). Mean-field theory for scale-free random networks.
Physica A: Statistical Mechanics and its Applica-tions , 272(1):173–187.Bhatt, A. and Martens, B. (2009). THE TOPICS OF CAAD: AN EVOLU-TIONARY PERSPECTIVE. In Tidafi, T. and Dorta, T., editors,
JoiningLanguages, Cultures and Visions: CAAD Futures 2009 , Proceedings ofthe 13th International CAAD Futures Conference, Montr´eal. Les Pressesde l’Universit´e de Montr´eal.Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual websearch engine.
Computer Networks and ISDN Systems , 30(1):107–117.Chen, P., Xie, H., Maslov, S., and Redner, S. (2007). Finding scientific gemswith Google’s PageRank algorithm.
Journal of Informetrics , 1(1):8–15.de Solla Price, D. J. (1965). Networks of scientific papers.
Science ,149(3683):510–515.Easley, D. and Kleinberg, J. (2010).
Networks, crowds, and markets . Cam-bridge Univ Press.Franceschet, M. (2011). Pagerank: Standing on the shoulders of giants.
Communications of the ACM , 54(6):92–101.Frobenius, G. F. (1912). ¨Uber Matrizen aus nicht negativen Elementen.
K¨onigliche Akademie der Wissenschaften , pages 456–477.Hirsch, J. E. (2005). An index to quantify an individual’s scientific researchoutput.
Proceedings of the National Academy of Sciences of the UnitedStates of America , 102(46):16569. 15ahne, B. (2000).
Computer Vision and Applications: A Guide for Studentsand Practitioners . Academic Press, San Diego, CA, USA.Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment.
Journal of the ACM (JACM) , 46(5):604–632.Langville, A. N. and Meyer, C. D. (2006).
Google’s PageRank and Beyond:The Science of Search Engine Rankings . Princeton University Press, NewJersey, USA.Newman, M. (2009). The first-mover advantage in scientific publication.
EPL(Europhysics Letters) , 86(6):68001.Pal, A. and Ruj, S. (2015). CITEX: A new citation index to measure therelative importance of authors and papers in scientific publications. arXivpreprint arXiv:1501.04894 .Perron, O. (1907). Zur theorie der matrices.
Mathematische Annalen ,64(2):248–263.Price, D. d. S. (1976). A general theory of bibliometric and other cumulativeadvantage processes.
Journal of the American Society for InformationScience , 27(5):292–306.Rethlefsen, M. L. and Aldrich, A. M. (2013). Environmental health citationpatterns: mapping the literature 2008–2010.
Journal of the Medical LibraryAssociation: JMLA , 101(1):47.Shockley, W. (1957). On the Statistics of Individual Variations of Produc-tivity in Research Laboratories.
Proceedings of the IRE , 45(3):279–290.Simon, H. (1955). On a class of skew distribution functions.
Biometrika ,pages 425–440.Zhou, D., Orshanskiy, S. A., Zha, H., and Giles, C. L. (2007). Co-rankingauthors and documents in a heterogeneous network. In