Go Wide, Go Deep: Quantifying the Impact of Scientific Papers through Influence Dispersion Trees
Dattatreya Mohapatra, Abhishek Maiti, Sumit Bhatia, Tanmoy Chakraborty
GGo Wide, Go Deep: Quantifying the Impact of Scientific Papersthrough Influence Dispersion Trees
Dattatreya Mohapatra , Abhishek Maiti , Sumit Bhatia and Tanmoy Chakraborty IIIT-Delhi, India; IBM Research AI, New Delhi, India{dattatreya15021,abhishek16005,tanmoy}@iiitd.ac.in,[email protected]
ABSTRACT
Despite a long history of use of ‘citation count’ as a measure toassess the impact or influence of a scientific paper, the evolution offollow-up work inspired by the paper and their interactions throughcitation links have rarely been explored to quantify how the paperenriches the depth and breadth of a research field. We proposea novel data structure, called Influence Dispersion Tree (IDT) tomodel the organization of follow-up papers and their dependen-cies through citations. We also propose the notion of an ideal IDTfor every paper and show that an ideal (highly influential) papershould increase the knowledge of a field vertically and horizontally.Upon suitably exploring the structural properties of IDT (both the-oretically and empirically), we derive a suite of metrics, namelyInfluence Dispersion Index (IDI), Normalized Influence Divergence(NID) to quantify the influence of a paper. Our theoretical analysisshows that an ideal IDT configuration should have equal depth andbreadth (and thus minimize the NID value).We establish the superiority of NID as a better influence measurein two experimental settings. First, on a large real-world biblio-graphic dataset, we show that NID outperforms raw citation countas an early predictor of the number of new citations a paper willreceive within a certain period after publication. Second, we showthat NID is superior to the raw citation count at identifying the pa-pers recognized as highly influential through ‘Test of Time Award’among all their contemporary papers (published in the same venue).We conclude that in order to quantify the influence of a paper, alongwith the total citation count, one should also consider how the cit-ing papers are organized among themselves to better understandthe influence of a paper on the research field. For reproducibility,the code and datasets used in this study are being made availableto the community.
A common consensus among the Scientometrics community is thatthe total number of citations received by a scientific article canbe used to quantify its impact on the research field [16, 17]. Ci-tation count, being a simple metric to compute and interpret, iscommonly used in many decision-making processes such as fac-ulty recruitment, fund disbursement, and tenure decisions. Manyimprovements over raw citation count have also been proposed byincorporating additional constraints. Examples include normaliz-ing citation counts by the maximum citation count a paper couldachieve in a particular research field [33], metrics inspired by PageR-ank [12], taking into account the locations of citation mentions inthe paper (e.g. Introduction, Related Work, etc.) [37], understand-ing the reasons behind citations and assigning different weights todifferent citations based on these reasons [7]. While improvements over the raw citation count, these mea-sures are fundamentally also aggregate measures as they ignorethe relationships between different (citing) papers that cite a givenpaper. We posit that such connections are useful and studying themcan help us better understand the propagation of influence froma paper to its different citing papers. Rather than proposing yetanother variant of citation count, we are interested in unravelingthese structural connections between the set of followup papers of agiven paper and understand the differentiating structural propertiesof influential papers.
Motivation:
We posit that the impact of a scientific paper canbroadly be studied across two dimensions – (i) how many differentresearch directions it gives rise to; and (ii) how much traction theseindividual research directions gather in the field. In the former case,we say that the influence of the paper has breadth and it helpsin expanding the field horizontally, leading to an increase in thebreadth of the field. A paper with such a broad influence may eventrigger the emergence of a new sub-field. In the latter case, we saythat the paper has had a deep influence on the field with a largenumber of papers in a given research direction. Intuitively, highlyinfluential papers are the ones that have a deep, and broad influenceon the field . Influence measures that are variants of the raw citationcount of the paper may not offer such fine-grained understandingof the contribution of a paper to its field. Quantifying the impactof a paper in terms of its depth and breadth may also help to un-cover the relationship between its different citing papers [24] andthus, understand the diffusion patterns of scientific ideas throughcitation links [9], predict the structural virality [19] and citationcascade [8, 24, 30]. While there have been recent efforts to studythese structural properties of networks formed by a paper and itsciting papers [24, 30], none of these studies have attempted to de-velop a metric to quantify the influence of a paper from its networktopology.
We are the first to propose a series of metrics to quantify anew facet of influence that a paper has had on its followup papers . Our Contributions:
Our major contributions are threefold: (i) A framework to model the depth and breadth of the influ-ence of a paper by a novel network structure, called the
InfluenceDispersion Tree (IDT) (Section 3). The IDT of a paper P is a directedtree rooted at P with all its citing papers as the children. The tree isconstructed such that the citing papers having citation links amongthemselves are grouped to represent a body of work influenced bythe root paper P (Section 3.1). These bodies of work along with thenumber of papers in each group are then used to model the depthand breadth of impact of P . We also present a theoretical analysis ofthe properties of the IDT structure and show how these propertiesare related to the citation count of the paper (Section 3.2). (ii) A series of measures to quantify the influence of a scien-tific paper: For a scholarly paper P , we propose a novel metric, a r X i v : . [ c s . D L ] A p r CDL’19, June 2019, Urbana-Champaign, Illinois, USA Mohapatra et al. called
Influence Dispersion Index (IDI) derived from its IDT to quan-tify the contribution of the paper to its field (by increasing depthor breadth or both) through influence diffusion (Section 3.3). Weargue that in an ideal scenario, the influence of a paper should bedispersed to maximize the depth as well as the breadth of its influ-ence. We then derive the configuration of the IDT of such a paperand prove that such an optimal IDT configuration will have equaldepth and breadth (and is equal to (cid:6) √ n (cid:7) , where n is the numberof citations of a given paper). Next, we propose another metric,called Influence Divergence (ID) that measures how the IDI valueof a paper diverges from IDI value of the optimal IDT configura-tion (Section 3.5). A lower value of divergence indicates that theinfluence of the paper under consideration is dispersed in a waythat is similar to that of the ideal case, and consequently, higher isthe chance for the paper to be considered as a highly influentialpaper. We further derive a normalized version of ID, and call it
Normalized Information Divergence (NID) that normalizes influencedivergence values for different papers with different citation countsin the range [ , ] and allows for comparing different papers basedon their NID values. (iii) Empirical validation on large real-world datasets: We usea large bibliographic dataset consisting of about 3 . , a highly influential paper tends to have an IDT with high breadth aswell as high depth . For reproducibility, the code and the dataset areavailable at https://github.com/LCS2-IIITD/influence-dispersion. There has been a plethora of research to measure the impact ofscientific articles through various forms of citation analysis. In thissection, we separate the related work into two parts – (i) studies dealing with citation count and its variants for measuring the im-pact, and (ii) studies exploring detailed orchestration of citationsaround scientific papers.
Searching for accurate and reliable indicators of research perfor-mance has a long and often controversial history. Citation datais frequently used to measure scientific impact [16, 17]. Most ci-tation indicators are based on citation counts – Journal ImpactFactor [18], h -index [21], Eigenfactor [14], i-10 index [11], c-index[31], etc. Many variations and adaptations were proposed to com-pensate the drawbacks of these indices. For instance, m -quotient[21, 39] attempts to eliminate the bias of h -index towards olderresearchers/articles. д -index [13] and e -index [41] were proposedto overcome bias again authors with heavily cited articles. We pro-posed C -index [32] to resolve ties while ranking medium-citedand low-cited authors by h -index. Even though so many variationsof h-index were proposed in the literature, Bornmann et al. [4]concluded that most of them are redundant by showing a meancorrelation coefficient of 0 . . influmetrics [3], webometrics [1], usage metrics [26], altmetrics [20], etc. Chakrabortyet al. [5] showed that the change in yearly citation count of articlespublished in journals is different from articles published in confer-ences. Even the evolution of yearly citation count of papers variesacross disciplines [6, 34]. This further raises a new proposition ofdesigning domain-specific impact measurement metrics. Despite such a vast literature on the use of citation count for assess-ing the quality of scientific community, the evolution of citationstructure has remained largely unexplored. There have been a fewrecent studies which attempted to understand the organization ofcitations around a scientific entity (paper, author, venue etc.), par-ticularly focusing on the topology of the graph constructed fromthe induced subgraph of papers citing the seed paper. Waumansand Bersini [40] took an evolutionary perspective to propose analgorithm for constructing genealogical trees of scientific papers onthe basis of their citation count evolution over time. This is useful totrace the evolution of certain concepts proposed in the seed paper.Singh et al. [38] developed a relay-linking model for prominenceand obsolescence to include the factors like aging, decline etc. in theevolving citation network. Min et al. [29] characterized the citationdiffusion process using a classic marketing model [2] and noticedsome interesting patterns in the spread of scientific ideas. Inspiredby information cascade modeling in online social networks [10],they [30] further made an attempt to study the behavior of cita-tion cascade. They concluded that the average depth of the cascadetends to be influenced by both the lifespan and the whole volumeof scientific literature. Huang et al. [24] and Chen [8] argued thatcitation cascade helps us better understand the citation impact of nfluence Dispersion Trees JCDL’19, June 2019, Urbana-Champaign, Illinois, USA a scientific publication. They empirically showed that most of theproperties of the cascade graph (such as cascade size, edge count,in-degree, and out-degree) follow typical power law distributions;however cascade depth follows exponential distribution.
Although recent studies [8, 24, 30] argued that there is a need toexplore the organization of citations (followup papers) around aseed paper in order to measure better scientific impact, no onequantitatively studied the impact of such network. We are the firstto propose an impact measurement metric, called ‘Influence Dis-persion Index’ (Section 3.3) which is derived upon converting arooted citation network to a sparse representation, called ‘influencedispersion tree’ (IDT) (Section 3). We show how an optimal orien-tation of CDT (in terms of its depth and breadth) helps in gainingmore impact, which may not be explained by simple citation count.Moreover, the construction of IDT is unique and different from thecitation cascade graph proposed earlier [8, 24, 30] (see Section 3 formore details).
In this section, we first develop and define the concept of InfluenceDispersion Tree of a scholarly paper and describe some of theproperties of IDTs. We then develop a simple measure to estimatethe influence of a scholarly paper given its IDT.
Let us consider a scholarly paper P and let C P = { p , p , . . . , p n } be the set of papers citing P . We assume that P has equally anddirectly influenced each and every paper in C P . Definition 1. [ Influence Dispersion Graph ] The Influence Dis-persion Graph (IDG) of the paper P is a directed and rooted graph G P (V P , E P ) with V P = C P ∪ { P } as the vertex set and P as theroot. The edge set E P consists of edges of the form { p u → p v } such that p u ∈ V P , p v ∈ C P and p v cites p u .Figure 1(a) shows an illustration of an IDG for the paper P andits citing paper set { p , p , p , p , p } . Observe that the IDG of paper P is the same as the induced subgraph of the larger citation graphconsisting of P and all its citing papers, and with edges in theopposite direction to indicate the propagation of influence from thecited paper to the citing paper. Further, note that the constructionof an IDG is similar to that of citation cascades [24, 29] with thefundamental difference that the IDG is restricted strictly to theone-hop citation neighborhood of P (i.e., papers that are directlyinfluenced by P ) as opposed to the citation cascade that considershigher order citation neighborhoods as well (i.e., papers indirectlyinfluenced by P ). Thus, an IDG only considers followup papers thatare directly influenced by a given paper. If p cites P ; and p cites p but not P , it is not always clear if p is influenced by both P and p ,or solely by p . Thus, we make the stricter and unambiguous choiceby selecting only p to be included in the IDG. Though variants ofIDG could be constructed by adding additional followup papers, Although previous studies [7, 42] have found that a paper has a varying amount ofinfluence on its citing papers, it is a common practice to assume uniform influencefor simplification (e.g., in computing impact factors, h-index [22], etc.) and is theassumption we also make. we believe that the major conclusions drawn from the paper willremain valid owing to the stricter and unambiguous process ofconstructing the IDG.Next, to further analyze and study the influence of paper P onits citing papers, we derive the Influence Dispersion Tree (IDT) of P from its IDG. A tree structure, by definition, provides a hierarchicalview of the influence P exerts on its citing papers and provides aneasy to understand representation to study the relation between P and its citing papers. The IDT of paper P is a directed and rootedtree T P = {V P , E ′ P } with P as the root. The vertex set is the sameas that of IDG of P and the edge set E ′ P ⊂ E P is derived from theedge set of IDG as described next.Note that a paper p v ∈ C P can cite more than one paper in V P ,giving rise to the following three possibilities:(1) p v cites only the root paper P . In this case, we add the edge P → p v creating a new branch in the tree emanating fromroot node (e.g., edges P → p and P → p in Fig. 1(b)).(2) p v cites the root paper P and p u ∈ C P \ { p v } . In this case,we say that p v is influenced by P as well as p u . There aretwo possible edges here: P → p v and p u → p v . However,since p u is also influenced by P , the edge p u → p v indirectlycaptures this influence that P has on p v . We therefore retainonly the edge p u → p v . This choice leads to addition of anew leaf node in IDT capturing the chain of impact startingfrom P up to the leaf node p v (e.g., edge p → p in Fig. 1(b)).(3) p v cites the root paper P , as well as a set of other papers P u ⊆ C P \ { p v } , | P u | > =
2. Note that by definition, each p ∈ P u also cites the root paper P . The possible edges to addhere are E = {{ p → p v } ; ∀ p ∈ P u }. We add the edge e to E ′ P such that e = p → p v where p = arg max p ′ ∈ P u shortestPathLenдth ( P , p ′ ) (1)Edge P → P in Fig. 1(b) is such an edge.The intuition behind adding edges in this way is to maximizethe depth of IDT (if there are more than one edge, and each ofwhich maximizes the depth, then we choose one of them randomly,e.g., p → p in Fig. 1(b)). The edge construction mechanism ismotivated by the citation cascade graph [24, 30]. Upon adding anewly citing paper in T P , we reconstruct T P in such a way that therichness of P ’s influence to its citing papers is maximally preserved.Richness maximization can be thought of as maximizing the breadthor the depth of the IDT. We choose the latter one in order to capturethe cascading effect into the resultant IDT. Definition 2 ( Influence Dispersion Tree).
The Influence Dis-persion Tree (IDT) of paper P is a tree T P (V P , E ′ P ) , whose vertexset V P is the union of P and all the papers citing P . If a paper p v cites only P and no other papers in V P , we add P → p v into theedge set E ′ P . If p v cites other papers P u ∈ V P \ { P } along with P , we add only one edge p x → p v (where p x ∈ P u ) according toEquation 1. Definition 3 ( P -rooted IDT). An IDT is called P -rooted IDT whenthe root node of the tree is P .Figure 1 illustrates a toy example of constructing IDT from IDGillustrating all three possible cases of edge connections as discussedabove. CDL’19, June 2019, Urbana-Champaign, Illinois, USA Mohapatra et al.
Figure 1: (a)-(b) Illustration of the construction of (b) IDT from (a) IDG of paper P . Papers in red only cite P ; Papers in greencite P and one other paper in the graph; blue paper cites P and more than one other paper in the graph. In case of yellow paper,a tie-breaking occurs due the equal possibility of p being connected from p and p in order to maximize the depth of IDT.Tie-breaking is resolved by randomly connecting p from p in IDT. (c)-(d) Two corner cases to illustrate the lower bound –minimum and maximum number of leaf nodes. (e) A configuration of a P -rooted IDT with ( n ) non-root nodes that results inmaximum IDI value. In this section, we describe a few important properties of an IDT. (i) Depth:
The depth d of a P -rooted IDT is defined as the lengthof the longest path from the root to the leaf nodes p L in the tree. d = max p l ∈ p L shortestPathLenдth ( P , p l ) (2)where d is the depth of the tree, and p L is the set of leaf nodes inIDT. The depth of the IDT shown in Figure 1(b) is 3.The depth of an IDT can be interpreted as the longest chain/seriesof papers representing a body of work influenced by P . (ii) Breadth: The breadth b of a P -rooted IDT is defined as themaximum number of nodes at a given level in the tree. b = max ≤ l ≤ d | N l | ; N l : = { n ∈ V P | level ( n ) = l } (3)The breadth of the IDT shown in Figure 1(b) is 2. (iii) Branch: A branch P ⇝ p l is a path from the root P to the leaf p l in an IDT. (iv) Fragmented and Unified Branch: A branch P ⇝ p l is calledfragmented when an intermediate node (except root) p ∈ P ⇝ p l becomes a part of another branch P ⇝ p l ′ . p is then called a frag-ment point of P ⇝ p l . In Figure 1(e), P ⇝ p k + is a fragmentedbranch with p k as a fragment point. If a branch is not fragmented,it is called as a unified branch. In Figure 1(d), P ⇝ p is a unifiedbranch.We now define some properties to describe how depth andbreadth of a P -rooted IDT are related with n – the number ofcitations of P (and the number of non-root nodes in the IDT of P ). Lemma 1.
For a paper P with n citations, the range of the depth d and breadth b of the P -rooted IDT is ≤ d , b ≤ n . Proof. The breadth of a P -rooted IDT will be maximum (i.e, n )when all the n papers cite only the root paper P , and there is nocitation among these n papers (e.g. Figure 1(c)). Likewise, the depth of a P -rooted IDT will be maximum (i.e., n ) when there is a chain of n papers { P , p , p , · · · , p n } forming a unified branch such that p i cites p i − , ∀ ≤ i ≤ n ; and p i also cites P , ∀ i (e.g., Figure 1(d)). □ Lemma 2.
For a paper P with n citations, the sum of depth d andbreadth b of the P -rooted IDT is bounded by n + , i.e., d + b ≤ n + . Proof. When a new node is added to IDT, there are four pos-sibilities – breadth increases, depth increases, both increase, andneither increases. The sum of d and b will be maximum when bothof them are individually maximum. This will only be possible whenall but the root node are involved in either increasing depth orbreadth or both. However, we can see that only one node, i.e., thefirst node attached to the root node, can increase both depth andbreadth, and the rest will increase either depth or breadth, but notboth. Since the total number of non-root nodes added to IDT are n ,the sum of b and d can attain a maximum value of n + □ Lemma 3.
For a paper P with n citations and its P -rooted IDT, theproduct of its depth d and breadth b is at least n , i.e., db ≥ n Proof. d is the maximum length of any branch, and b is indica-tive of the number of branches from root to leaf. So, for an IDTwhose branching occurs at the root node itself and nowhere else, db represents the number of nodes it can have to maintain its depthas d and breadth as b by adding to those branches which have lessthan d length. Since n is the number of nodes already present in theIDT, we can say that the number of nodes we can add is db − n . Sincethis quantity is always non-negative as this quantity represents thenumber of nodes we can add, we have db − n ≥ = ⇒ db ≥ n (4)For those IDTs which have branching in places other than theroot i.e., fragmented branches, the nodes which are above thebranching nodes, will be counted more than once as they representmultiple root to leaf paths and hence db will give more number of nfluence Dispersion Trees JCDL’19, June 2019, Urbana-Champaign, Illinois, USA Figure 2: Reconnecting leaf edges of a star IDT (a) to formother configurations. nodes than present in the IDT; hence db > n (5)Therefore, for both the cases, it is seen that db ≥ n . □ Given the IDT of a paper, we define its Influence Dispersion Index(IDI) by the sum of length of all the paths from the root node to allthe leaf nodes.
Definition 4 ( Influence Dispersion Index).
The IDI of paper P is defined as IDI ( P ) = (cid:213) p l ∈ p L distance ( P , p l ) (6)where p L is the set of leaf nodes of the P ’s IDT T P (V P , E P ) .The IDI of P in Figure 1(b) is 5.Intuitively, each leaf node in P ’s IDT corresponds to a separatebranch emanating from the original paper P . Each branch comprisesof the set of papers which are influenced by the root paper in onedirection. We can interpret IDI as a measure of the ability of thepaper to distribute its influence. We hypothesize that the more anIDT has unified branch, the more the chance that the influenceemanating from P is distributed uniformly. For a P -rooted IDT with n non-root nodes,the minimum value of IDI is n . This is because each node (paper)in the tree will be encountered at least once while computing IDI,resulting in the lower bound as n . Figures 1(c) and (d) show twocorner cases – one configuration with the minimum number of leafnodes (i.e, 1), and other configuration with the maximum numberof leaf nodes (i.e., n ). Note that given the size of the IDT, there canbe multiple configurations with minimum IDT values. From a starIDT (Figures 1 (c)) if we pick an edge and connect it to any leafnode or the root node, then IDI of the resultant configuration willremain same. In fact, if we keep on repeating the same repairing step, all the resultant configurations will exhibit the same IDI value.In short, during the transformation of a star IDT to a line IDT byreconnecting a leaf edge (an edge whose one end node is a leaf)to another leaf node or to the root node, all the intermediate IDTswill exhibit the same IDI of n . Figure 2 shows a toy example of thereconfiguration. We will discuss more in Section 3.4.3. In order to maximize the value of IDI, a P -rooted IDT should satisfy the following three conditions:(1) The number of leaves should be as large as possible.(2) The length of the branch from root to leaf should be as longas possible.(3) The number of common nodes in each root-to-leaf branchshould be maximized so that each node counter is maximized.Subject to the constraint on the number of nodes in the tree (i.e., n + P -rooted IDT with n non-root nodes as shown inFigure 1(e) be IDI ( P , k ) , where k is the number of nodes forminga chain from P (excluding P ) and node p k has ( n − k ) descendants.Then, IDI ( P , k ) is determined as follows: IDI ( P , k ) = k ( n − k ) + ( n − k ) (7)Differentiating it w.r.t to k , we get ∂ IDI ( P , K ) ∂ k = n − k − k = (cid:22) n − (cid:25) (9)This yields the maximum value of IDI as IDI ( P ) max = ( + (cid:22) n − (cid:25) )( n − (cid:22) n − (cid:25) ) (10)Therefore, for a P -rooted IDT with n non-root nodes, we have thefollowing bounds on its IDI: n ≤ IDI ( P ) ≤ ( + (cid:22) n − (cid:25) )( n − (cid:22) n − (cid:25) ) (11) d , b and n for Optimal Dispersion. As dis-cussed above, a paper with a given number of citations n , can havedifferently shaped IDTs, and consequently, very different IDI values.Intuitively, we expect a highly influential paper to have multiplelong unified branches, i.e., it should have a high depth value as wellas high breadth value . Thus, we want the IDT of a highly influentialpaper to have high depth, high breadth, and a tree structure suchthat the number of non-root nodes are as uniformly distributed indifferent branches of the trees as possible, indicating significantdepth in each branch. Also, recall from Lemma 3 that for a givenvalue of d and b , the number of nodes in an IDT can not be morethan db (i.e., n ≤ db ). This leads us to the following constrainedobjective function that the IDT in its optimal configuration shouldsatisfy. CDL’19, June 2019, Urbana-Champaign, Illinois, USA Mohapatra et al.
Figure 3: Illustration of an optimal configuration of a P -rooted IDT of a paper P with n citations. The depth andbreadth of the IDT are same ( k = r = (cid:6) √ n (cid:7) ). minimize ( db − n ) s.t d + b ≤ n + db ≥ n (from Lemma 3) (12)This yields an optimal configuration where d = b = (cid:4) √ n (cid:7) .Proof. As discussed, db represents the maximum number ofnodes the tree can have by having depth as d and breadth as b .The IDT will have maximum number of nodes for a given d and b only when all the branches in the IDT are unified branches. Thiscondition will force the IDT to have all the branches to branch outfrom the root node. If k is the number of nodes in each unifiedbranch of the optimal tree, and there are r such branches, thenthe number of nodes in this IDT will be kr (assuming equal lengthfor each branch). Since k and r are equal for an optimal IDT asdiscussed earlier, we have k = n ⇒ k = √ n (13)For IDTs where the nodes are not evenly distributed among anequal number of unified branches with each branch having equalnumber of nodes (in other words, when the number of non-rootnodes is not a perfect square), the corresponding k comes out to be k = n ⇒ k = (cid:6) √ n (cid:7) (14) □ Figure 3 illustrates a paper with an optimal configuration wherethe IDT has an equitable distribution in terms of both depth andbreadth, indicating that the paper has influenced multiple branches,and all the influenced branches have grown significantly. Note thatthe cost function favors configurations where the impact of thepaper is maximized both in terms of depth and breadth, and hence,will penalize configurations where there exists a large number ofshort branches (high b , low d ) or very few long branches (high d ,low b ). In this section, we study the potential of IDI as an early predic-tor of the overall impact and influence of a scholarly article. Asdiscussed before, IDI of a paper P provides a fine-grained view of the influence of P on other papers citing P , in terms of the depthand breadth of the IDT. As described in Section 3.4, for a paperwith n citations, there exists an ideal configuration of the IDT thatoptimizes the influence dispersion of the paper such that it has bothhigh breadth (influenced multiple branches of work) and high depth(significantly deepened each individual branch). With this intuition,we posit that the closeness of the actual IDT of a given paper P with n citations, denoted by T P to its corresponding ideal IDI with n citations, denoted by ¯ T P can be used as a surrogate measure ofinfluence or impact of paper P . We can use any distance metricbetween two graphs – such as Graph Edit Distance [15], Gromov-Wasserstein distance [28] – to measure the closeness between T P and ¯ T P . However, all these measures are computationally expensive[15]. Therefore, we here use the IDI of each IDT as a proxy for itstopological structure and measure the difference between the IDIvalues of T P and ¯ T P (as a replacement of the graph distance). Recallfrom Section 3.4 that the IDI of an ideal IDT with n non-root nodesis n (which is also the lower bound of an IDT with n internal nodes).We define the Influence Divergence (ID) of a paper as thedifference of the IDI value of its original IDT, IDI(P) and that of itscorresponding ideal IDT configuration, ¯
IDI (P) ID ( P ) = IDI ( P ) − ¯ IDI ( P ) (15)We further normalize the IDI value using max-min normalization. Definition 5 ( Normalized Influence Divergence).
NormalizedInfluence Divergence (NID) of a paper P is defined by the differencebetween the IDI value of its corresponding IDT and the same of itscorresponding ideal IDT configuration, ¯ IDI (P), normalized by thedifference between maximum and minimum IDI values of the IDTswith the size as that of P ’s IDT. Formally, it is written as: N ID ( P ) = IDI ( P ) − ¯ IDI ( P ) IDI max | P | − IDI min | P | (16)The normalization is needed to compare two papers with dif-ferent IDI values. NID ranges between 0 and 1. Clearly, a highlyinfluential paper will have a low N ID ( P ) (i.e., lower deviation fromits ideal dispersion index). We used a publicly available dataset of scholarly articles providedby Chakraborty and Nandi [6]. The dataset contains about 4 millionarticles indexed by Microsoft Academic Search (MAS) . For eachpaper in the dataset, additional metadata such as the title of thepaper, its authors and their affiliations, year and venue of publi-cation are also available. The publication years of papers presentin the dataset span over half a century allowing us to investigatediverse types of papers in terms of their IDTs. A unique ID is alsoassigned to each author and publication venue upon resolving thenamed-entity disambiguation by MAS itself. We passed the datasetthrough a series of pre-processing stages such as removing papersthat do not have any citation and reference, removing papers thathave forward citations (i.e., citing a paper that is published afterthe citing paper; this may happen due to archiving the paper beforepublishing it), etc. This filtering resulted in a final set of 3 , , https://academic.microsoft.com/ nfluence Dispersion Trees JCDL’19, June 2019, Urbana-Champaign, Illinois, USA Number of papers 3,908,805Number of unique venues 5,149Number of unique authors 1,186,412Avg. number of papers per author 5.21Avg. number of authors per paper 2.57Min. (max.) number of references per paper 1 (2,432)Min. (max.) number of citations per paper 1 (13,102)
Table 1: Some important statistics about the MAS dataset.
In this section, we report various empirical observations about theIDTs of the papers in our dataset that provide a holistic view of thetopological structure of the trees. We also study the how depth andbreadth of the IDTs, the IDI and NID values vary with the citationcount of the papers.
Figure 4 plots the frequency distribution of depth and breadth ofthe IDTs for all the papers in the dataset. Observe that the values forbreadth follow a very long tail distribution with about 75% of papershaving a breadth less than or equal to 3 (note the log-scale on x-axesin Fig. 4b). On the other hand, the range of the depth values forIDTs is much smaller compared to the range of breadth values. Themaximum value of depth is 48 compared to the maximum breadthof 4 , understand the depth and breadthof the impact of these papers on their citing papers and measure theinfluence these papers have had on the fields.Figure 5 shows the distribution of breadth and depth with cita-tions (Figures 5a and 5b, respectively) and the correlation betweendepth and breadth (Figure 5c). We observe that while breadth isstrongly correlated with citation count ( ρ = . ρ = . F r e q u e n c y Depth (a) Depth F r e q u e n c y Breadth (b) Breadth Figure 4: Frequency distributions for depth (4a) and breadth(4b) of IDTs of all the papers in the dataset. The x-axis in theplot for breadth is in logarithmic scale. body of work represented by an already formed branch (increas-ing the depth). Further, note from Figure 5c that the variation inbreadth values reduces with increasing depth. Especially for IDTswith depth greater than 30, the values of breadth lie in a relativelynarrow band (almost all IDTs with depth greater than 30 havebreadth less than 300). This is indicative of highly influential papersthat have spawned multiple directions of follow-up works and incre-mental citations correspond to continuation of these independentdirections (thus increasing depth).
We now study how the IDI and NID values vary with the citationcounts across multiple papers. Figure 6 shows the scatter plot ofIDI and NID values with citations for all the papers in the dataset.We observe that IDI values in general increase with the numberof citations of a paper. This is along expected lines as the IDI fora paper is bounded by the number of citations of the paper (Equa-tion 11). A more interesting observation can be made from the plotfor NID values (Figure 6b) where we see that in general, the valueof NID decreases with increasing citations – papers having a highnumber of citations tend to have very low values of NID. Recall thatfor a given paper, NID captures how different or far way the IDI ofthe given paper is from its corresponding ideal IDT. Thus, highlyinfluential papers tend to have their IDTs close to their ideal IDTconfigurations (as illustrated by the low NID value). This empiricalobservation strengthens our hypothesis that highly influential pa-pers will, in general, lead to considerable amount of followup work(high depth) in multiple directions (high breadth) . As discussed before, we hypothesize that the highly influentialpapers produce IDTs which would be close to their correspondingideal configurations. In Section 5.2, we found that highly-citedpapers have very low NID values. Here we ask a complementaryquestion –
Is low IDI value of a given paper an indicator of its futureinfluence?
In other words, does a paper having its IDT close tothe ideal configuration at a given time will be an influential paperin near future? We design two experiments to answer the abovequestion. In Section 6.1, we study if NID can predict how manycitations a paper will get in future. In Section 6.2, we study if IDImeasure can identify highly influential papers – specifically, papers
CDL’19, June 2019, Urbana-Champaign, Illinois, USA Mohapatra et al.
No. Paper
Table 2: A set of representative papers: B r e a d t h Citation (a) Breadth vs. Citations D e p t h Citation (b) Depth vs. Citations D e p t h Breadth (c) Depth vs. Breadth
Figure 5: Scatter plots showing variations of breadth with citations (a), depth with citations (b), and correlation between depthand breadth (c). I D I Citation (a) IDI vs. Citations N I D Citation (b) NID vs. Citations Figure 6: Scatter plots showing variations of (a) IDI and (b)NID values with citation counts. that have been judged highly influential by the community andhave been awarded Test of Time (ToT) awards . Let P v be the set of papers published in a publication venue v (aconference or a journal). Let y v be the year of organization of v .Over the next t years, papers in P v will influence the follow upwork and will gather citations accordingly. Let I ( p ) be an influencemeasure under consideration. Let R ( v , t , I ) be the ranked list ofpapers in P v ordered by the value of I ( . ) at t . Thus, the top ranked Many conferences and journals award ‘Test of Time’ or ‘10 year influential paperaward’ to papers that have had a high impact on their respective fields. These papersare generally selected by a committee of senior researchers. paper in R ( v , t , I ) is considered to have maximum influence at t . If I ( . ) is able to capture the impact correctly, we expect the papers withhigh influence scores to have more incremental citations in futurecompared to papers having low influence scores. Let C ( v , t , t ) bethe ranked list of papers in P v ordered by the increase in citationsfrom time t to t . Thus, the papers that received highest fractionalincrease in citations in the time period ( t , t ) will be ranked atthe top. Note that we chose fractional increase in citation countrather than absolute count to account for papers that are early risersand receive most of their lifetime citations in first few years afterpublication [5]. Also, we consider only those papers published in avenue ( v here) rather than all the papers in our dataset to nullifythe effect of diverse citation dynamics across fields and venues [6].Intuitively, if I ( . ) is a good predictor of a paper’s influence, theranked lists R ( v , t , I ) and C ( v , t , t ) should be very similar – influ-ential papers at time t should receive more incremental citationsfrom t to t . Thus, the similarity of the two ranked list could beused as a measure to evaluate the potential of I ( . ) to be able to cap-ture the influence of papers. We use the Kendall Tau rank distance K defined below to measure the similarity of the two ranked lists R ( v , t , I ) and C ( v , t , t ) as follows. z ( v , I ) = K( R ( v , t , I ) , C ( v , t , t )) (17)A lower value of the z score indicates that the two ranked listsare highly similar, that in turn shows that I ( . ) has high predictive nfluence Dispersion Trees JCDL’19, June 2019, Urbana-Champaign, Illinois, USA power in forecasting the future incremental citations. We use thisframework to evaluate the potential of NID (as a replacement I ( . ) in this case) as an early predictor of future incremental citations ofa paper. We use the number of citations of a paper as a competitorof NID as it is the most common and simplest way of judging theinfluence of a paper [16, 17]. First, we group all the papers in ourdataset by their venues and compute the values of the influencemetrics (NID and citation count) after five years following thepublication year (i.e., t = t = ,
219 uniquevenues and 30 ,
556 papers in total.With the group of papers published together in a venue andtheir citation information available, we compute the following threeranked lists:(1) R v , c = R ( v , , c ) ; the ranked lists of papers in venue v or-dered by their citation counts five years after the publication.(2) R v , nid = R ( v , , nid ) ; the ranked lists of papers in venue v ordered by their NID scores five years after the publication.(3) C v = C ( v , , ) ; the ranked lists of papers in venue v or-dered by the normalized incremental citations received be-ginning of 5 th years after the publication till 10 th years afterpublication.For each venue v , these lists can be used to compute z ( v , N ID ) and z ( v , c ) – i.e., the z scores with NID and citation count as in-fluence measures, respectively. For the 1 ,
219 venues identified asabove, the average value of z score using citations and IDI as theinfluence measure is found to be 0 . . Z score is lower when using NID as theinfluence measure compared to that with citation count. In otherwords, more papers identified as influential by NID received moreincremental future citations compared to the papers identified asinfluential by citation count.Figure 7 provides a fine-grained illustration of the differenceof z scores achieved by the two influence measures for each ofthe 1,219 venues. For each venue, we compute the difference of z scores achieved by NID and citation count. We note that for most ofthe venues, the z -score achieved by NID is lower than the z -scoreachieved by the citation count (positive bars). These observationsindicate that when compared with raw citation count, NID is amuch stronger predictor of the future impact of a scientific paper.As opposed to the raw citation count, the IDT of a paper provides afine-grained view of the impact of the paper in terms of its depthand breadth as succinctly captured by the IDT of the paper. Theseresults provide compelling evidence for the utility of IDT (and theconsequent measures such as IDI and NDI derived from it) forstudying the impact of scholarly papers. z ( v , c ) - z ( v , N I D ) -0.4-0.200.20.40.60.8 Venue
Figure 7: z-scores for venues. Papers in a venue are rankedusing NID, number of citations and relative gain in citations.The horizontal axis represents venues ordered by the differ-ence in two z-scores.
Many conferences recognize highly influential papers that havehad a long-lasting impact on the respective field of research. Theserecognition are awarded in the form of Test of Time (ToT) awards,10 year Influential Paper Awards, etc. We manually collected a setof papers that have received the ToT awards by their respectivepublication venues and obtained a list of 40 such papers (publishedin conferences like SIGIR, AAAI, ICCV etc.) that are also present inour dataset.Let P be a ToT awardee paper that was published in year y atvenue v . We extracted all the papers from our dataset that werepublished at venue v in year y . We then ordered these papers bytheir citation count at time y +
10 (i.e., 10 years after publication)and selected top 5% highest-cited papers (including P ). We con-sider these papers to be the major competitor of P to win the TOTaward since highly influential papers are expected to achieve a highnumber of citations . We then compute the rank of P , denoted by Rank ( P , Cite ) in this set. Similarly, we compute NID at time y + P , denoted by Rank ( P , N ID ) . If NID is a better measure ofthe paper’s impact, then we expect P to have a better rank (1 beingthe best outcome, i.e., the top paper) compared to the other papersin the compared set. Figure 8 plots Rank ( P , Cite ) and Rank ( P , N ID )for each TOT awardee paper P . We note that in most of the cases(25 out of 40), the ToT papers are the top-ranked papers by bothcitation count and NID.Interestingly, we also note that in 12 out of 40 cases, the ranksof the ToT awardee papers achieved by NID are lower (better) thanthe ranks achieved by citation counts. Thus, the papers judged mostinfluential by the community (by giving TOT award) may not alwayshave the highest citations among all their contemporary papers . Theremay be some subjective evaluation criteria that capture the influ-ence a paper has had on the field. The results of this experimentindicate that NID is much better at capturing the influence of apaper – 33 out of 40 times, the ToT paper achieves rank 1 when Many conferences (e.g., SIGIR) nominate top five most cited papers published in ayear for the ToT award, in addition to getting nominations from the community.
CDL’19, June 2019, Urbana-Champaign, Illinois, USA Mohapatra et al. R a n k Venue
NIDCitations
Figure 8: Absolute ranks (based on citation count and NID)of the ToT papers among their contemporaries. ranked by NID. The overall Mean Reciprocal Rank (MRR) achievedby NID is 0 . . This paper proposed a novel concept, called ‘Influence DispersionTree’ (IDT) to explore and model the structural information amongthe followup (citing) papers of a given paper linked through cita-tions. We derive several basic and advanced properties of an IDT tounderstand their relations with the raw citation count. One strikingobservation is that with the increase in citation count, the depth ofan IDT grows much slower than the breadth. However, as the cita-tion count grows, the IDT of a paper moves closer to its ideal IDTconfiguration. We further proposed a series of metrics to quantifythe notion of influence from IDT. Our proposed metric NID turnedout to be superior to the raw citation count – (i) to predict howmany new citations a paper is going to receive within a certain timewindow after publication, (ii) to identify and explain why a paper isrecognized by its research community (through various prestigiousawards such as Test of Time awards) as highly influential amongits contemporaries.The conclusion we would like to draw from this paper is – tounderstand the contribution of a source paper to its own researchfield, along with the total number of followup papers of a sourcepaper (i.e., citation count), one should also consider how these fol-lowup papers are organized among themselves through citations. Apaper can be treated as highly influential only when it has enricheda field equally in both vertical (deepening the knowledge furtherinside the field) and horizontal (allowing the emergence of newsub-fields) directions.
ACKNOWLEDGEMENT
Part of the research was supported by the Ramanujan Fellowship,Early Career Research Award (SERB, DST), and the Infosys Centrefor AI at IIITD.
REFERENCES [1] Tomas C Almind and Peter Ingwersen. 1997. Informetric analyses on the worldwide web: methodological approaches to ’webometrics’.
Journal of documentation
53, 4 (1997), 404–426.[2] Frank M Bass. 1969. A new product growth for model consumer durables.
Management science
15, 5 (1969), 215–227.[3] Johan Bollen and Herbert Van de Sompel. 2006. Mapping the structure of sciencethrough usage.
Scientometrics
69, 2 (2006), 227–258.[4] Lutz Bornmann, Rüdiger Mutz, Sven E Hug, and Hans-Dieter Daniel. 2011. Amultilevel meta-analysis of studies reporting correlations between the h indexand 37 different h index variants.
Journal of Informetrics
5, 3 (2011), 346–359.[5] Tanmoy Chakraborty, Suhansanu Kumar, Pawan Goyal, Niloy Ganguly, andAnimesh Mukherjee. 2015. On the categorization of scientific citation profiles incomputer science.
Commun. ACM
58, 9 (2015), 82–90.[6] Tanmoy Chakraborty and Subrata Nandi. 2018. Universal trajectories of scientificsuccess.
Knowledge and Information Systems
54, 2 (2018), 487–509.[7] Tanmoy Chakraborty and Ramasuri Narayanam. 2016. All fingers are not equal:Intensity of references in scientific articles. In
EMNLP . 1348–1358.[8] Chaomei Chen. 2018. Cascading Citation Expansion.
CoRR abs/1806.00089 (2018).http://arxiv.org/abs/1806.00089[9] Chaomei Chen and Diana Hicks. 2004. Tracing knowledge diffusion.
Scientomet-rics
59, 2 (2004), 199–211.[10] Justin Cheng, Lada Adamic, P. Alex Dow, Jon Michael Kleinberg, and JureLeskovec. 2014. Can Cascades Be Predicted?. In
WWW . 925–936.[11] James Connor. 2011. Google Scholar citations open to all.
Google Inc.[cit.2017/05/13]. Dostupné z: https://scholar. googleblog. com/2011/11/google-scholar-citations-open-to-all. html (2011).[12] Ying Ding, Erjia Yan, Arthur Frazho, and James Caverlee. 2009. PageRank forranking authors in co-citation networks.
Journal of the American Society forInformation Science and Technology
60, 11 (2009), 2229–2243.[13] Leo Egghe. 2006. An improvement of the h-index: The g-index. ISSI.[14] Alan Fersht. 2009. The most influential journals: Impact Factor and Eigenfactor.[15] Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. 2010. A survey of graphedit distance.
Pattern Analysis and applications
13, 1 (2010), 113–129.[16] Eugene Garfield. 1964. " Science Citation Index"-A New Dimension in Indexing.
Science
Science
Jama
Management Science
62, 1 (2015), 180–196.[20] Stefanie Haustein, Isabella Peters, Judit Bar-Ilan, Jason Priem, Hadas Shema,and Jens Terliesner. 2014. Coverage and adoption of altmetrics sources in thebibliometric community.
Scientometrics
Proceedings of the National academy of Sciences
Proceedings of the National academy of Sciences
Journal of Counseling Psychology
30, 4 (1983),600.[24] Yong Huang, Yi Bu, Ying Ding, and Wei Lu. 2018. Number versus structure:towards citing cascades.
Scientometrics
PLoS One
3, 7 (2008), e2778.[26] Michael J Kurtz and Johan Bollen. 2011. Usage bibliometrics. arXiv preprintarXiv:1102.2891 (2011).[27] Janet Lee, Kristin L Kraus, and William T Couldwell. 2009. Use of the h index inneurosurgery.
Journal of neurosurgery
Foundations of computational mathematics
11, 4 (2011), 417–487.[29] Chao Min, Ying Ding, Jiang Li, Yi Bu, Lei Pei, and Jianjun Sun. 2018. Innovationor imitation: The diffusion of citations.
Journal of the Association for InformationScience and Technology
69, 10 (2018), 1271–1282.[30] Chao Min, Jianjun Sun, and Ying Ding. 2017. Quantifying the evolution of citationcascades.
Proceedings of the Association for Information Science and Technology
54, 1 (2017), 761–763.[31] Alex Post, Adam Y Li, Jennifer B Dai, Akbar Y Maniya, Syed Haider, StanislawSobotka, and Tanvir F Choudhri. 2018. c-index and Subindices of the h-index:New Variants of the h-index to Account for Variations in Author Contribution.
Cureus
10, 5 (2018).[32] Dinesh Pradhan, Partha Sarathi Paul, Umesh Maheswari, Subrata Nandi, andTanmoy Chakraborty. 2017. C3-index: a PageRank based multi-faceted metric forauthors’ performance measurement.
Scientometrics nfluence Dispersion Trees JCDL’19, June 2019, Urbana-Champaign, Illinois, USA [33] Filippo Radicchi, Santo Fortunato, and Claudio Castellano. 2008. Universality ofcitation distributions: Toward an objective measure of scientific impact.
PNAS
PloS one
12, 3 (2017), e0173152.[35] Sidney Redner. 1998. How popular is your paper? An empirical study of thecitation distribution.
The European Physical Journal B-Condensed Matter andComplex Systems
4, 2 (1998), 131–134.[36] Andrej A Romanovsky. 2012. Revised h index for biomedical research.[37] Mayank Singh, Vikas Patidar, Suhansanu Kumar, Tanmoy Chakraborty, AnimeshMukherjee, and Pawan Goyal. 2015. The role of citation context in predicting long-term citation profiles: An experimental study based on a massive bibliographictext dataset. In
CIKM . ACM, 1271–1280. [38] Mayank Singh, Rajdeep Sarkar, Pawan Goyal, Animesh Mukherjee, and SoumenChakrabarti. 2017. Relay-Linking Models for Prominence and Obsolescence inEvolving Networks. In
SIGKDD . 1077–1086.[39] Dennis F Thompson, Erin C Callen, and Milap C Nahata. 2009. New indicesin scholarship assessment.
American Journal of Pharmaceutical Education
73, 6(2009), 111.[40] Michaël Charles Waumans and Hugues Bersini. 2016. Genealogical trees ofscientific papers.
PloS one
11, 3 (2016), e0150588.[41] Chun-Ting Zhang. 2009. The e-index, complementing the h-index for excesscitations.
PLoS One
4, 5 (2009), e5429.[42] Xiaodan Zhu, Peter Turney, Daniel Lemire, and André Vellino. 2015. Measuringacademic influence: Not all citations are equal.