[PDF] Information Diffusion in Computer Science Citation Networks

Abstract

The paper citation network is a traditional social medium for the exchange of ideas and knowledge. In this paper we view citation networks from the perspective of information diffusion. We study the structural features of the information paths through the citation networks of publications in computer science, and analyze the impact of various citation choices on the subsequent impact of the article. We find that citing recent papers and papers within the same scholarly community garners a slightly larger number of citations on average. However, this correlation is weaker among well-cited papers implying that for high impact work citing within one's field is of lesser importance. We also study differences in information flow for specific subsets of citation networks: books versus conference and journal articles, different areas of computer science, and different time periods.

Full PDF

IInformation Diffusion in Computer Science Citation Networks

Xiaolin Shi

Dept. of EECSUniversity of MichiganAnn Arbor, [email protected]

Belle Tseng

Yahoo Inc.3420 Central ExpwySanta Clara, [email protected]

Lada Adamic

School of InformationUniversity of MichiganAnn Arbor, [email protected]

Abstract

The paper citation network is a traditional social medium forthe exchange of ideas and knowledge. In this paper we viewcitation networks from the perspective of information diffu-sion. We study the structural features of the information pathsthrough the citation networks of publications in computer sci-ence, and analyze the impact of various citation choices onthe subsequent impact of the article. We ﬁnd that citing re-cent papers and papers within the same scholarly communitygarners a slightly larger number of citations on average.However, this correlation is weaker among well-cited papersimplying that for high impact work citing within one’s ﬁeldis of lesser importance. We also study differences in infor-mation ﬂow for speciﬁc subsets of citation networks: booksversus conference and journal articles, different areas of com-puter science, and different time periods.

Introduction

Information diffusion is the communication of knowledgeover time among members of a social system. In order toanalyze information diffusion, one needs to study the over-all information ﬂow and individual information cascades inthe networks. Although much recent attention has been fo-cused on new forms of collective content generation and ﬁl-tering, such as blogs, wikis, and collaborative tagging sys-tems, there is a well established social medium for aggregat-ing and generating knowledge — published scholarly work.As researchers innovate, they not only publish new results,but also cite previous results and related work that their owninnovations are based on. This creates a social ecology ofknowledge — where information is shared and ﬂows alongco-authorship and citation ties.In this paper, we examine information ﬂow within andbetween different areas of computer science and its impact.Our basic assumption is many citations are evidence of in-formation ﬂow from one article, and its authors, to another.In order to cite a paper, an author usually, though not al-ways (Simkin and Roychowdhury 2005), reads the paperand acknowledges it as being relevant to the subject of theirown paper, either by providing information that their workis built upon, or by providing information about related ap-proaches to the same problem. Although not every citationrepresents the same level of engagement, citation networksprovide some of the clearest evidence of information ﬂow. Our work has two primary goals: ﬁrst, we are interestedin observing the features of information ﬂow in citation net-works; and second, we want to know which of these features,such as time spans and community structure representingdifferent ﬁelds of research, affect the information ﬂow.Studying citation networks has been the purview of theﬁeld of scientometrics, which aims to measure the impact ofscholarly publications (Dieks and Chang 1976). Scientomet-ric data has been available for several decades and so it wasalready in the 1960s that de Sola Price ﬁrst observed powerlaws in scientiﬁc citation networks and developed models ofcitation dynamics (de Solla Price 1965).However, the recent emergence of online knowledge shar-ing has made it particularly easy to study information dif-fusion on a large scale. Studies of information cascadesin blogs (Kumar et al. 2003; Adar et al. 2004; Leskovecet al. 2007), social bookmarking sites, and photo sharinghave all revealed a highly skewed distribution in the atten-tion a particular post, URL, new story (Lerman 2007), orphoto (Lerman and Jones 2007) will receive. The atten-tion may be measured through links or tags given to theitems. In sepererate studies, it has been shown that suchnetworks exhibit strong community structure (Tseng, Tate-mura, and Wu 2005; Adamic and Glance 2005; Chin andChignell 2006), where links or interactions occur more fre-quently within communities than between them.The role of community structure in information diffusionhas also been studied in scientiﬁc citation networks. It hasbeen found that there is a longer delay for citations acrossdisciplines than ones within a discipline, implying that in-formation is not only less likely to diffuse across communityboundaries, but when it does, it will do so with a longer timedelay (Rinia et al. 2001). Information ﬂow between com-munities is such a relatively small proportion of total infor-mation ﬂow, that modeling citation networks without themprovides realistic citation distributions and clustering coefﬁ-cients (Borner 2004; Rosvall and Bergstrom 2008). The de-velopment of efﬁcient network algorithms has lead not justto discoveries of the the overall properties of citation net-works, but also the detection of changes in citation patternswhere a new trend or paradigm emerges (Leicht et al. 2007).There has also been interest in visualizing and quantifyingthe amount of information ﬂow between different areas inscience (Boyack, Klavans, and B¨orner 2005), in effect map- a r X i v : . [ c s . D L ] M a y ing the generation of human knowledge through informa-tion ﬂows. These maps leave open the question, however, ofwhat happens once information has diffused across a com-munity boundary; will it have the same impact as informa-tion diffusing within a community?This is an interesting question, because recent empiricalwork (Guimera et al. 2005) has shown that new collabo-rations between experienced authors are more likely to re-sult in a publication in a high impact journal than in col-laborations between unseasoned authors or repeat collabo-rations between the same two authors. The argument is thatmerging ideas and expertise in a novel way will producehigher impact work. But this work did not address whetherthe authors were from the same scientiﬁc communities ornot, or whether the publications cited in the work stemmedfrom the same ﬁeld. On the theoretical side, agent basedmodels of innovation have shown that independent innova-tion within communities is important, so that the networkas a whole does not converge on suboptimal solutions tooquickly (Lazer and Friedman 2005).In this paper, to answer the question of the impact ofcross-community information ﬂows in computer science,we make empirical observations of citations of computerscience articles, focusing speciﬁcally on information ﬂowacross community boundaries and temporal gaps. In the fol-lowing sections, we ﬁrst describe the computer science pub-lication data sets we used and the construction of the citationnetworks. We then examine the properties of the citationnetworks, and relate the properties of a citing link to subse-quent impact of the citing article. Preliminaries

Deﬁnition of citation networks

Citation networks are networks of references between doc-uments. In this paper, we focus on paper citation net-works, which correspond to information diffusion in the cor-responding research areas.From the graph theoretic perspective, citation networkscan be thought of as directed graphs with time stamps andcommunity labels on each node: • Nodes : publications; • Edges : one paper citing another; • Edge directions : in order to represent the direction of in-formation ﬂow, we denote the direction of edges fromcited papers to citing papers; • Time stamps : years in which the papers were published; • Time spans : the time elapsed between the publication ofthe cited and citing paper; • Community labels : we classify the papers into differentresearch areas according to their venue information.Information ﬂows in citation networks can be interpretedas the scientiﬁc ideas and knowledge transmitted from publi-cation to publication, which are explicitly indicated by cita-tion relationships. Not all, or perhaps very little, informationis preserved from cited to citing paper. Further, the informa-tion may be amended in the citing paper. Nevertheless, weassume that the cited paper informed the citing paper. There are two common and signiﬁcant features of any typical cita-tion network: ﬁrst, it is directed and almost acyclic; and sec-ond, when it evolves over time, only new nodes and edgesare added, and none are removed (Leicht et al. 2007). Theacyclic nature of the graph stems from the simple fact that,with very few exceptions, a paper will not cite a paper pub-lished in the future. Although publication delays may leadto such occurrences, most citations are limited to previouslypublished work.

Description of data sets

The datasets we study are two large digital libraries encom-passing comprehensive scholarly articles primarily in com-puter science — the

ACM data set and the CiteSeer dataset (Giles 2004). In the ACM data set, there are several dif-ferent types of publications, such as books, journal articles,conference proceeding papers, reports, and theses. Booksalone account for 113,089 of the publications in the

ACM dataset. Both of the data sets have information about thepublication dates and venues; however, some of the informa-tion is incomplete or inaccurate. Since our study considersthe time evolution and community structure of the networks,we deleted the nodes with an unresolved time or venue in-formation.While

ACM data set includes citations to publications out-side of the data set, the

Citeseer data does not, and sowe limit our analysis to citations between articles withineach dataset. In addition, some citations between two arti-cles that both reside in the same data set are missing, due tothe difﬁculty in disambiguating and parsing citations fromarticle text (Simkin and Roychowdhury 2005). Even withthese limitations, we are left with 346,000 citations for the

ACM dataset and 84,000 citations in the

CiteSeer dataset,which we use to measure information ﬂows between dif-ferent computer science communities and the impact of apublication. We will discuss possible biases introduced bymissing data below.Even though we are analyzing two separate datasets, theyoverlap in subject area and time span. It is therefore re-assuring that they have a signiﬁcant, but relatively smalloverlap in the articles that they contain. There are 613,444proceedings or journal papers in the

ACM dataset that weare studying, and 593,386 of them have distinct titles inthe database; while there are 716,774 papers in

CiteSeer dataset, and 611,127 have distinct titles. By matching thetitles and authors of the 593,386 papers in

ACM and 611,127papers in

CiteSeer using a simple cosine similarity mea-sure, we identify 122,978 (20%) papers that are present inboth datasets. Finally, Table 1 gives summary statistics ofthe two data sets and the citation networks we will study.

Structural features of citation networks

Since the structural features of citation networks provide ex-plicit evidence of information ﬂow paths, our study of infor-mation diffusion starts with them. http://portal.acm.org http://citeseer.ist.psu.edu rig. With Publication Date With Publication VenueNodes Edges Nodes Edges Time range Nodes Edges Communities ACM

CiteSeer

Degree distributions

As stated before, we set the direction of an edge to reﬂect thedirection of information ﬂow. The in-degree is the numberof papers cited in a given paper. In effect, it is the numberof papers that may have inﬂuenced the paper at hand. Theout-degree is the number of papers citing the given paper,reﬂecting the paper’s potential impact and inﬂuence.In the previous section, we mentioned that there are dif-ferent types of publications in

ACM , including those withvery high in-degrees, such as books. Since these compre-hensive publications normally have many more referencesthan regular papers, we show the degree distributions sep-arately for books and papers in the

ACM data set in Figure1. The distribution of the length of the reference list forpublications (their indegree) is highly skewed; some publi-cations have references from 10 to several hundred papers inthe dataset, many have none, or few. In actuality, many ofthese papers have longer lists of references, but these werenot identiﬁed, or they fall outside of the dataset. The dis-tributions of out-degrees are similarly skewed, an indicationof a linear preferential attachment mechanism: already wellcited papers are more easily discovered, and subsequentlycited: it is the success-breeds-success phenomenon (Bur-rell 2003). As one might expect, both the in-degree dis-tributions and out-degree distributions of books in the

ACM dataset are signiﬁcantly heavier tailed – with books both cit-ing more and being cited more. It is therefore unsurprisingthat the in-degree and out-degree distributions of documentsin

CiteSeer are more similar to those of

ACM papers, asopposed to books. l l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l + + + + In−degree K F r equen cy o f do c u m en t s w i t h i n − deg r ee s > K l l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll ACM bookACM paperCiteSeer l l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l

Out−degree K F r equen cy o f do c u m en t s w i t h ou t − deg r ee s > K l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ACM bookACM paperCiteSeer (a) In-degree distributions (b) Out-degree distributionsFigure 1: The degree distributions of the

ACM and

CiteSeer citation graphs.

Connectivity

In order to analyze how information ﬂows through the cita-tion networks, we ﬁrst study their connectivity. Two vertices A and B are said to be in the same stronglyconnected component if there exists a path both from A to B and from B to A . For both the ACM and

CiteSeer citation graphs with correct time stamps, there are no signif-icant strongly connected components. The absence of largestrongly connected components is consistent with the factthat citation graphs are nearly perfect directed acyclic graphs(DAGs). However, . of the nodes (publications) are inthe largest weakly connected component , in which there is apath between every pair of nodes in the version of undirectedgraph, in the ACM citation network. Similarly, . of thenodes are in the largest weakly connected component of the CiteSeer citation network.In the

ACM dataset, 5.0% of the papers are published inthe years 2004 and 2005. By tracing back their citations,we ﬁnd that 48.8% of the papers published in previous yearsare either directly or indirectly cited by the papers in 2004and 2005. In the

CiteSeer data set, with 3.5% paperspublished in the years 2003 - 2005, 28.9% of earlier papersare reachable by tracing back the citations. Although the twodata sets differ in their cohesion (either due to completenessof data, or other issues of coverage), we observe that eachsubsequent generation of papers is tied directly or indirectlyto a signiﬁcant portion of the prior work.

Pajek

Pajek (a) 500 most cited papers (b) 500 random papersFigure 2: Subgraphs of the 500 most cited papers and 500random papers in the

ACM citation network.If we take the 500 most cited articles in the

ACM net-work, we observe that there is a giant component linkinga signiﬁcant fraction of the most inﬂuential papers (seeFigure 2 (a)). In contrast, if we were to select 500 ran-dom papers, there would be no giant component - as pa-pers are rather unlikely to cite one another (Figure 2 (b)).This observation is consistent with many other networkswhere the most highly connected nodes tend to be con-nected to one another (Shi et al. 2008). It is also knownas the rich-club phenomenon (Zhou and Mondragon 2004;Colizza et al. 2006). Yet it is still striking that such a smallnumber of most inﬂuential papers out of tens of thousandsn computer science should be connected to one anotherthrough one another.

Average shortest directed path

The shortest paths in graphs (also termed geodesics) relatedirectly to the accessibility of information. Like many othercomplex networks, the citation networks we study exhibitthe small world phenomenon . The average shortest directedpath of the

ACM graph is 7.60, and its largest geodesic is 32.Similar to the

ACM citation network,

CiteSeer has an av-erage shortest directed path of 6.29 and its longest geodesicis 28. However, the reachable pairs of nodes via directedpaths in the

ACM citation network comprise . of all pos-sible paths (or node pairs) and . of all possible paths inthe CiteSeer citation network.From the connectivity and shortest directed paths of thecitation networks, we can see that, in spite of the largestweakly connected component occupying nearly the entiretyof the network, the percentage of reachable pairs of nodesis smaller. But where paths do exist, we observe that thelengths of information ﬂows are generally short. Note thatthis does not preclude that there are more circuitous routesinvolving several papers. In our measurements, we only ac-count for the shortest path. This form of “deep linking”, cit-ing original articles, considerably shortens the path betweenarticles.

Sizes of information cascades

One may be interested not only in direct citations of a givenpaper, but subsequent citations of the citing papers, etc. Fig-ure 3 gives a simple example of an information cascade. Aninformation cascade represents both the direct and indirectinﬂuence of a given publication, although the inﬂuence isdiluted with each subsequent step. As each paper cites sev-eral others, it is difﬁcult to attribute inﬂuence to any givenchain of citations. It may be that the reason that A cites B isunrelated to the reason why B cited C.

Time

Figure 3: Illustration of a information cascade with multipletemporal levels.By taking every node in the citation graph as a root, werun a breadth ﬁrst search along the out-going edges, and ob-tain an information cascade tree starting from every paperin the network. The distributions of sizes, depths and num-bers of leaves of the information cascade trees are shown inFigure 4.From the ﬁgure, we see that all of the distributions ofsizes, depths and numbers of leaves have a very a sharp − − − P e r c en t. o f i n f o . c a sc ade t r ee s w i t h node s > K sizedepth − − − P e r c en t. o f i n f o . c a sc ade t r ee s w i t h node s > K sizedepth (a) ACM (b) CiteSeerFigure 4: The distributions of sizes, depths and numbersof leaves of the information cascade trees in ACM and

CiteSeer citation graphs.drop in the tail. These drop-offs naturally correspond tothe limits of the data sets: there are no more than severalhundreds of thousands of papers in each data set, and a cas-cade can only encompass papers appearing after the givenpaper. This is different from the power-law distributions ofthe cascade sizes in the blogosphere observed by Leskovecat al.(Leskovec et al. 2007), with slightly different cascadedeﬁnitions. The blog measurements were unaffected by sizelimitations in the data, as the largest cascade sizes encom-passed no more than a few thousand posts out of millionsthat were observed.The Spearman correlations of the cascade sizes, depthsand numbers of leaves in the information cascades, as wellas the out-degrees are shown as Table 2. We see that, inboth the

ACM and

CiteSeer citation graphs, the sizes anddepths of the information cascade trees have large correla-tions, meaning that cascades that encompass several genera-tions of scholarly work are also the largest.size & size & size & depth &out-deg depth . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ CiteSeer . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ Table 2: Spearman correlations of the sizes, depths, num-bers of leaves of information cascades and out-degrees ofall papers in the two citation networks. ∗∗∗ , ∗∗ , and ∗ de-note signiﬁcance at the < . , < . and ≥ . levelsrespectively. Information diffusion and the effects ofcitations

After examining the structural features and the informationﬂow paths between papers in the citation networks, we turnto how information ﬂows between communities, and howdifferent types of citations (from and to various communitiesand citing old or new papers) would affect the subsequentinformation diffusion in citation networks.

ACMCiteSeer

Figure 5: Visualization of the matrices of community weights between different areas of computer science. Darker cellsrepresent more frequent citation than expected if citation were at random, lighter ones depict less frequent citation.

Information ﬂows between communities

We assign papers to communities according to their venues,using the classiﬁcation system adopted by Microsoft’s,

Li-bra academic search service . For example, a paper pub-lished in the KDD ( Knowledge Discovery and Data Min-ing ) Conference would be classiﬁed under “Data Mining”,while a paper published in the

Journal of Information Pro-fessing and Management would be classiﬁed under “Infor-mation Retrieval”. Because of the incomplete and noisy in-formation in the venues, we are able to classify about / ofthe papers with about − precision. With this com-munity classiﬁcation, there are about 205,000 within com-munity citations and 141,000 across community citations in ACM , while 42,000 both within and across community cita-tions in

CiteSeer .In order to quantify the densities of information ﬂow fromcommunity to community, we ﬁrst count the number of ci-tations between every pair of communities for each data setseparately (e.g. the number of citations of Theory to The-ory, Theory to Data Mining, etc.), and get a matrix A withthese numbers as its entries. We then compare the numberof citations between any pair of communities relative to therate of citation we would expect if the volume of inboundand outbound citations were the same, but the citations wereallocated at random. We let N ij be the actual number ofcitations from i to j , N i (cid:5) = (cid:80) j N ij be the total numberof citations from community i , N (cid:5) j = (cid:80) i N ij be the to-tal number of citations to community j , and N = (cid:80) ij N ij be the total number of citations in matrix A . Then the ex-pected number of citations, assuming indifference to one’s http://libra.msra.cn own ﬁeld and others, from community i to community j is E [ N ij ] = N i (cid:5) × N (cid:5) j /N . We deﬁne the community weight asa z-score that tells us how many standard deviations aboveor below expected N ij is. Here we have the observation that N (cid:29) N i (cid:5) and N (cid:29) N (cid:5) j , so we approximate the standarddeviation by (cid:112) E [ N ij ] . In this way, for every entry, we geta normalized value, which we call community weight : W ij = ( N ij − N i (cid:5) × N (cid:5) j N ) / (cid:114) N i (cid:5) × N (cid:5) j N By visualizing the normalized matrix, i.e. matrix of com-munity weights, as in Figure 5, we can observe differentdensities of information ﬂow amongst communities. Forexample, for each community, as expected, the majority ofcitations are within the community itself. However, thereare some closely related communities. For example, thereappears to be considerable information ﬂow from Informa-tion Science to Information Retrieval, from Databases toData Mining, from Information Retrieval to Data Miningand from Computer Vision to Computer Graphics. Theseﬂows reﬂect frequent citations by papers from the secondcommunity to those in the ﬁrst. We also observe that themore theoretical areas such as Algorithms & Theory andPhysics are less connected with others, while more appliedareas, such as Data Mining, Information Retrieval, and Op-erating Systems have more information ﬂows two and fromother areas.

Correlations of information diffusion and citationfeatures

If we deﬁne information diffusion to occur when a paper iscited, then many factors affect such information diffusion.hey include the popularity of the research ﬁeld pertainingto the article in a certain period, the reputation of the authors,the speciﬁc innovation reported in the publication, etc. How-ever, there is much we can surmise simply from the citationpatterns, time lapses and community information. Speciﬁ-cally, we examine what kinds of citations would make theciting papers have greater impact, whether it is citing an-other paper in a related community with strong informationﬂow, or the time elapsed since the publication of the citedpaper.As we have stated before, to measure the inﬂuence ofa particular paper, both directly and indirectly inﬂuencedpapers may need to be taken into consideration, possiblyweighing them differently. However, for both the clarity ofthe model and lack of consensus in the literature for a par-ticular weighting scheme (Aksnes 2006), we use the num-ber of citations a paper receives normalized by the averagenumber of citations received by all papers in the same areaand year (Valderas et al. 2007). This measure allows us tomake a fair comparison between articles that may not haveﬁnished accumulating citations due to their recency, and toaccount for differences in the publication cycle for differentareas (Stringer, Sales-Pardo, and Amaral 2008).

Citation networks for all of computer science

Since ourstudy focuses mainly on the relationship between informa-tion ﬂow and innovation, as opposed to summaries and re-views, we exclude publications that are book chapters andbooks, and focus on journal articles and papers publishedin conference proceedings. In the

ACM dataset, the articlesare already classiﬁed according to publication venue type,and so are easily ﬁltered. In the

CiteSeer dataset, weﬁnd that a majority of publications having 40 or more ref-erences tend to be review manuscripts. We exclude suchpublications from both data sets. Finally, we exclude pa-pers published after 2000, because their recency means thatthey have not accumulated most of their citations (Stringer,Sales-Pardo, and Amaral 2008; Burrell 2003).Table 3 shows the correlations between communityweights and time lapse of the citing and cited paper, andthe subsequent impact of the citing paper. From the tablewe see that for both citation networks, the weights of in-formation ﬂows between communities (i.e. the communityweights) have positive correlations with the inﬂuence met-ric (normalized out-degrees). This means that, on average,a computer science paper will be rewarded for referencingother papers within its own community or proximate com-munities.More recent papers have had an opportunity to cite moredistant papers in time. Since pairs of citations are onlyrecorded between papers in the dataset, older papers willhave shorter recorded timelags to the papers they reference,since earlier referenced papers may not be included. Theabove is reﬂected in the correlation between the publicationyear of the citing paper and the time elapsed between the twopapers ( ρ = 0 . , p < − ). More interestingly, there is anegative between the time elapsed between the papers andthe subsequent impact of the citing paper. Note that we arealready normalizing by the average citation number of pa- pers in a given year, so that older papers’ chance to accumu-late more citations is not a factor. The negative correlationbetween citation time lag and impact could be interpreted asciting more recent work being rewarded by citations.However, it is not uncommon to see some extremely in-novative and inﬂuential work whose citations reach acrosscommunities, or draw upon older publications. The over-all correlations only reﬂect the average trend. As we ob-served in Figure 1, a large proportion of the papers receivesvery few citations, while a few papers garner large numbersof them. We found interesting trends, when, in addition tomeasuring the overall correlation for all papers, we com-puted separate correlations for the bottom 90% of the papersaccording to impact (denoted as ≤ in Table 3) and thetop 10% ( > ).What we can observe is that for less well cited papers,the correlations between impact and community informationﬂow weight are positive, in agreement with the overall trend.This is where the majority of papers lie — they receive fewcitations and do not lead to large subsequent impact. How-ever, for papers with high impact (dozens to hundreds ofcitations), the neutral correlations show that citing withinone’s own community is less important.Similar patterns are observed for time lags as well. Thelower impact articles beneﬁt from citing recent work; butfor more inﬂuential papers, these correlations are reduced orabsent. It may be that a truly innovative article draws uponwork that had not been garnering much attention recently,and that is not tied to many other relevant publications. Thiswould imply that the more innovative and more highly citedpapers may cross boundaries where information normallydoes not ﬂow. Subnetworks of papers in different areas

We have seenhow the weights of information ﬂows between communitiesaffect the subsequent impact of the citing papers in the over-all citation networks of computer science. In this subsection,we investigate the correlations in a ﬁner scope. We choosethe areas whose papers constitute more than 5% of the to-tal number of papers with community information in bothdata sets, such as Theory, Distributed and Parallel Comput-ing, Software Engineering, etc. We consider these papersand their sets of references. Again, we compute the cor-relations of the community weights of those citation edgesand the normalized out-degrees of these citing papers. Thecorrelations are given in Table 4. The results show that thecorrelations are mostly positive or neutral in three areas, ex-cept for papers in theory and algorithms in both

ACM and

CiteSeer . This implies that information diffusion hasdifferent impact on publications in theoretical computer sci-ence and applied computer science. In future work we wouldlike to probe these differences in information diffusion forvarious ﬁelds further.

Subnetworks of papers in different time periods

In-stead of grouping papers according to their areas, we canalso group them by publication date. In order to reduce thenoise introduced by the incompleteness and sparsity of thedata sets, we only choose papers in the following four timeperiods for both the

ACM and

CiteSeer data sets: 1980–CM CiteSeerOverall ≤ >

90% Overall ≤ > − . ∗∗∗ − . ∗∗∗ . ∗ − . ∗∗∗ − . ∗∗∗ . ∗ c-weight . ∗∗∗ . ∗∗∗ . ∗ . ∗∗∗ . ∗∗∗ . ∗ Table 3: Spearman correlations show the effects of community weights and time differences between the cited and citing paperson the subsequent impacts of citing papers.Venue Dataset Percent. CorrelationTheory ACM 32.92% − . ∗∗∗ & Algorithms CiteSeer 17.24% − . ∗∗∗ Distributed ACM 6.39% . ∗ Computing CiteSeer 10.45% − . ∗ Artiﬁcial ACM 5.24% . ∗ Intelligence CiteSeer 8.69% . ∗∗∗ Software ACM 19.54% . ∗∗∗ Engineering CiteSeer 19.37% . ∗∗∗ Table 4: Correlations between community weights and nor-malized out-degrees of citing papers grouped by differentcommunities.1984, 1985–1989, 1990–1994, and 1995–1999.After grouping the papers according to publication date,same as before, we select the edges with destination papersin the chosen set of papers (e.g. published between 1990 and1994). We use Pearson correlations of community weightson the citation edges and the logarithm of normalized out-degrees of the destination papers, which are shown in Fig-ure 6. Although

ACM and

CiteSeer have different rangesof conﬁdence intervals for the correlations, the trends of thetwo sets of correlations are consistent — they are slightlyincreasing as the the time periods grow more recent. Per-haps with the research areas getting ﬁner and deeper, it maybe more difﬁcult for researchers to keep up with, understandand cite papers in areas far from their own. At the sametime, their own communities have grown and diversiﬁed toincorporate information ﬂows from other areas, so that citingwithin one’s area may provide adequate diversity. However,these are only speculations as to the underlying reasons whyciting within one’s area would be of greater beneﬁt to morerecent papers.

Subnetworks of papers versus books

We consider oneﬁnal subset of the citation graph, that of books. As we men-tioned before, in the

ACM dataset, there are documents la-beled as books or book chapters. We select these documents,and study how their citations patterns may be different fromthose of journal research articles or conference proceedings.Since the datasets did not map books to different ﬁelds ofcomputer science,we just consider the raw out-degree of books as the mea-sure of impact and focus on the time elapsed between thepublication of the book and the work it cites. We considercitations from books or book chapters to any type of pub-lication, including papers in journals and conference pro-ceedings. Because books have longer reference lists (seeFigure 1(a)), any single citation is less likely to have a strong l l l l

Time Period C o rr e l a t i on − . − . − . − . . . ll ACMCiteSeer l l l l

Figure 6: Correlations between community weights and nor-malized out-degrees of citing papers, grouped by differenttime periods.effect on the impact of the citing article. Indeed, we ﬁnd thatthe correlation of time spans and the out-degrees of books is − . ∗ . This is lower than the corresponding correlationbetween out-degrees and time spans for papers in the samedataset, which is − . ∗∗∗ . This trend is also consistentwith the fact that books are expected to cover a substan-tial amount of material, which may necessitate citing earlierpublications. On the other hand, papers may just need to citethe most recent work they are building upon. Conclusions and future work

We analyzed a very old, regimented, and established socialmedium for knowledge sharing in order to discover patternsof information ﬂow with respect to community structure.Consistent with prior results, we ﬁnd a wide range in theimpact individual publications have. Information cascades,encompassing all chains of citations resulting from a singlepaper, vary dramatically in size, and only a small propor-tion of paper pairs are linked via cascades. In contrast themost inﬂuential papers are surprisingly interlinked. Manypublications go mostly unnoticed, while some garner con-siderable attention. There are interesting factors, relating tothe citation graph, that correlate with the popularity a givenpublication will enjoy.Our particular interest is on the impact of a particular ci-tation on the success of the citing article. Through inten-sive study of two data sets of computer science publications, CM and CiteSeer , we ﬁnd that citations that occur withincommunities lead to a slightly higher number of direct ci-tations; and also, citing more recent papers corresponded toreceiving more citations in turn. However, our most interest-ing ﬁnding is that for the most inﬂuential group of papers,this relationship was reduced or absent, allowing for the pos-sibility that ideas across communities can lead to higher im-pact work. Finally, we ﬁnd that the effect of recency andcommunity on citation structure differs among different ar-eas of computer science and among different time periods.In future work, we would like to expand our study to sev-eral additional contexts, including patent citation networksand paper citation networks of various scientiﬁc areas, inwhich the effect of boundary spanning information ﬂowswould be investigated. We would also like to extend ouranalysis to blogs, whose strong community structure hasbeen observed, along with observations of information cas-cades, but little is known about the effect of this communitystructure on diffusion properties. However, the sparsenessof citation data for blogs, and the loose relationship betweenthem will present additional challenges. We would also liketo extend our study using textual analysis, to map speciﬁcideas that are spreading through the citation network. In do-ing so, we could identify the points at which a particular ideahas crossed a community boundary, and measure whetherthis occasionally leads to large information cascades.

Acknowledgements

We would like to thank Eytan Bakshy for helpful commentsand suggestions. This research was supported in part by agrant from NEC.

References [Adamic and Glance 2005] Adamic, L., and Glance, N.2005. The political blogosphere and the 2004 US election:divided they blog.

LinkKDD

WWW2004 Workshop on the WebloggingEcosystem: Aggregation, Analysis and Dynamics . ACMPress.[Aksnes 2006] Aksnes, D. 2006. Citation rates and percep-tions of scientiﬁc contribution.

JASIST

PNAS

Scientometrics

JASIST

Hypertext ’06 , 11–22. New York, NY, USA: ACM.[Colizza et al. 2006] Colizza, V.; Flammini, A.; Serrano,M. A.; and Vespignani, A. 2006. Detecting rich-club or-dering in complex networks.

Nature Physics

Science

Social Studies ofScience

AWIC , 2.[Guimera et al. 2005] Guimera, R.; Uzzi, B.; Spiro, J.; andAmaral, L. 2005. Team Assembly Mechanisms DetermineCollaboration Network Structure and Team Performance.

Science

WWW ’03 , 568–576. New York, NY, USA:ACM Press.[Lazer and Friedman 2005] Lazer, D., and Friedman, A.2005. The hare and the tortoise: the network structureof exploration and exploitation.

Proceedings of the 2005national conference on Digital government research

The European PhysicalJournal B

Proceedings of ICWSM .[Lerman 2007] Lerman, K. 2007. Social Information Pro-cessing in News Aggregation.

IEEE Internet Computing

SIAM InternationalConference on Data Mining .[Rinia et al. 2001] Rinia, E.; Van Leeuwen, T.; Bruins, E.;Van Vuren, H.; and Van Raan, A. 2001. Citation delayin interdisciplinary knowledge exchange.

Scientometrics

PNAS

Hypertext ’08 , 61–70. New York, NY, USA:ACM.[Simkin and Roychowdhury 2005] Simkin, M., and Roy-chowdhury, V. 2005. Stochastic modeling of citation slips.

Scientometrics

PLoS ONE

WWW 2005 Workshop onthe Weblogging Ecosystem .Valderas et al. 2007] Valderas, J. M.; Bentley, R. A.; Buck-ley, R.; Wray, K. B.; Wuchty, S.; Jones, B. F.; and Uzzi, B.2007. Why Do Team-Authored Papers Get Cited More?

Science