Information Diffusion in Computer Science Citation Networks
IInformation Diffusion in Computer Science Citation Networks
Xiaolin Shi
Dept. of EECSUniversity of MichiganAnn Arbor, [email protected]
Belle Tseng
Yahoo Inc.3420 Central ExpwySanta Clara, [email protected]
Lada Adamic
School of InformationUniversity of MichiganAnn Arbor, [email protected]
Abstract
The paper citation network is a traditional social medium forthe exchange of ideas and knowledge. In this paper we viewcitation networks from the perspective of information diffu-sion. We study the structural features of the information pathsthrough the citation networks of publications in computer sci-ence, and analyze the impact of various citation choices onthe subsequent impact of the article. We find that citing re-cent papers and papers within the same scholarly communitygarners a slightly larger number of citations on average.However, this correlation is weaker among well-cited papersimplying that for high impact work citing within one’s fieldis of lesser importance. We also study differences in infor-mation flow for specific subsets of citation networks: booksversus conference and journal articles, different areas of com-puter science, and different time periods.
Introduction
Information diffusion is the communication of knowledgeover time among members of a social system. In order toanalyze information diffusion, one needs to study the over-all information flow and individual information cascades inthe networks. Although much recent attention has been fo-cused on new forms of collective content generation and fil-tering, such as blogs, wikis, and collaborative tagging sys-tems, there is a well established social medium for aggregat-ing and generating knowledge — published scholarly work.As researchers innovate, they not only publish new results,but also cite previous results and related work that their owninnovations are based on. This creates a social ecology ofknowledge — where information is shared and flows alongco-authorship and citation ties.In this paper, we examine information flow within andbetween different areas of computer science and its impact.Our basic assumption is many citations are evidence of in-formation flow from one article, and its authors, to another.In order to cite a paper, an author usually, though not al-ways (Simkin and Roychowdhury 2005), reads the paperand acknowledges it as being relevant to the subject of theirown paper, either by providing information that their workis built upon, or by providing information about related ap-proaches to the same problem. Although not every citationrepresents the same level of engagement, citation networksprovide some of the clearest evidence of information flow. Our work has two primary goals: first, we are interestedin observing the features of information flow in citation net-works; and second, we want to know which of these features,such as time spans and community structure representingdifferent fields of research, affect the information flow.Studying citation networks has been the purview of thefield of scientometrics, which aims to measure the impact ofscholarly publications (Dieks and Chang 1976). Scientomet-ric data has been available for several decades and so it wasalready in the 1960s that de Sola Price first observed powerlaws in scientific citation networks and developed models ofcitation dynamics (de Solla Price 1965).However, the recent emergence of online knowledge shar-ing has made it particularly easy to study information dif-fusion on a large scale. Studies of information cascadesin blogs (Kumar et al. 2003; Adar et al. 2004; Leskovecet al. 2007), social bookmarking sites, and photo sharinghave all revealed a highly skewed distribution in the atten-tion a particular post, URL, new story (Lerman 2007), orphoto (Lerman and Jones 2007) will receive. The atten-tion may be measured through links or tags given to theitems. In sepererate studies, it has been shown that suchnetworks exhibit strong community structure (Tseng, Tate-mura, and Wu 2005; Adamic and Glance 2005; Chin andChignell 2006), where links or interactions occur more fre-quently within communities than between them.The role of community structure in information diffusionhas also been studied in scientific citation networks. It hasbeen found that there is a longer delay for citations acrossdisciplines than ones within a discipline, implying that in-formation is not only less likely to diffuse across communityboundaries, but when it does, it will do so with a longer timedelay (Rinia et al. 2001). Information flow between com-munities is such a relatively small proportion of total infor-mation flow, that modeling citation networks without themprovides realistic citation distributions and clustering coeffi-cients (Borner 2004; Rosvall and Bergstrom 2008). The de-velopment of efficient network algorithms has lead not justto discoveries of the the overall properties of citation net-works, but also the detection of changes in citation patternswhere a new trend or paradigm emerges (Leicht et al. 2007).There has also been interest in visualizing and quantifyingthe amount of information flow between different areas inscience (Boyack, Klavans, and B¨orner 2005), in effect map- a r X i v : . [ c s . D L ] M a y ing the generation of human knowledge through informa-tion flows. These maps leave open the question, however, ofwhat happens once information has diffused across a com-munity boundary; will it have the same impact as informa-tion diffusing within a community?This is an interesting question, because recent empiricalwork (Guimera et al. 2005) has shown that new collabo-rations between experienced authors are more likely to re-sult in a publication in a high impact journal than in col-laborations between unseasoned authors or repeat collabo-rations between the same two authors. The argument is thatmerging ideas and expertise in a novel way will producehigher impact work. But this work did not address whetherthe authors were from the same scientific communities ornot, or whether the publications cited in the work stemmedfrom the same field. On the theoretical side, agent basedmodels of innovation have shown that independent innova-tion within communities is important, so that the networkas a whole does not converge on suboptimal solutions tooquickly (Lazer and Friedman 2005).In this paper, to answer the question of the impact ofcross-community information flows in computer science,we make empirical observations of citations of computerscience articles, focusing specifically on information flowacross community boundaries and temporal gaps. In the fol-lowing sections, we first describe the computer science pub-lication data sets we used and the construction of the citationnetworks. We then examine the properties of the citationnetworks, and relate the properties of a citing link to subse-quent impact of the citing article. Preliminaries
Definition of citation networks
Citation networks are networks of references between doc-uments. In this paper, we focus on paper citation net-works, which correspond to information diffusion in the cor-responding research areas.From the graph theoretic perspective, citation networkscan be thought of as directed graphs with time stamps andcommunity labels on each node: • Nodes : publications; • Edges : one paper citing another; • Edge directions : in order to represent the direction of in-formation flow, we denote the direction of edges fromcited papers to citing papers; • Time stamps : years in which the papers were published; • Time spans : the time elapsed between the publication ofthe cited and citing paper; • Community labels : we classify the papers into differentresearch areas according to their venue information.Information flows in citation networks can be interpretedas the scientific ideas and knowledge transmitted from publi-cation to publication, which are explicitly indicated by cita-tion relationships. Not all, or perhaps very little, informationis preserved from cited to citing paper. Further, the informa-tion may be amended in the citing paper. Nevertheless, weassume that the cited paper informed the citing paper. There are two common and significant features of any typical cita-tion network: first, it is directed and almost acyclic; and sec-ond, when it evolves over time, only new nodes and edgesare added, and none are removed (Leicht et al. 2007). Theacyclic nature of the graph stems from the simple fact that,with very few exceptions, a paper will not cite a paper pub-lished in the future. Although publication delays may leadto such occurrences, most citations are limited to previouslypublished work.
Description of data sets
The datasets we study are two large digital libraries encom-passing comprehensive scholarly articles primarily in com-puter science — the
ACM data set and the CiteSeer dataset (Giles 2004). In the ACM data set, there are several dif-ferent types of publications, such as books, journal articles,conference proceeding papers, reports, and theses. Booksalone account for 113,089 of the publications in the
ACM dataset. Both of the data sets have information about thepublication dates and venues; however, some of the informa-tion is incomplete or inaccurate. Since our study considersthe time evolution and community structure of the networks,we deleted the nodes with an unresolved time or venue in-formation.While
ACM data set includes citations to publications out-side of the data set, the
Citeseer data does not, and sowe limit our analysis to citations between articles withineach dataset. In addition, some citations between two arti-cles that both reside in the same data set are missing, due tothe difficulty in disambiguating and parsing citations fromarticle text (Simkin and Roychowdhury 2005). Even withthese limitations, we are left with 346,000 citations for the
ACM dataset and 84,000 citations in the
CiteSeer dataset,which we use to measure information flows between dif-ferent computer science communities and the impact of apublication. We will discuss possible biases introduced bymissing data below.Even though we are analyzing two separate datasets, theyoverlap in subject area and time span. It is therefore re-assuring that they have a significant, but relatively smalloverlap in the articles that they contain. There are 613,444proceedings or journal papers in the
ACM dataset that weare studying, and 593,386 of them have distinct titles inthe database; while there are 716,774 papers in
CiteSeer dataset, and 611,127 have distinct titles. By matching thetitles and authors of the 593,386 papers in
ACM and 611,127papers in
CiteSeer using a simple cosine similarity mea-sure, we identify 122,978 (20%) papers that are present inboth datasets. Finally, Table 1 gives summary statistics ofthe two data sets and the citation networks we will study.
Structural features of citation networks
Since the structural features of citation networks provide ex-plicit evidence of information flow paths, our study of infor-mation diffusion starts with them. http://portal.acm.org http://citeseer.ist.psu.edu rig. With Publication Date With Publication VenueNodes Edges Nodes Edges Time range Nodes Edges Communities ACM
CiteSeer
Degree distributions
As stated before, we set the direction of an edge to reflect thedirection of information flow. The in-degree is the numberof papers cited in a given paper. In effect, it is the numberof papers that may have influenced the paper at hand. Theout-degree is the number of papers citing the given paper,reflecting the paper’s potential impact and influence.In the previous section, we mentioned that there are dif-ferent types of publications in
ACM , including those withvery high in-degrees, such as books. Since these compre-hensive publications normally have many more referencesthan regular papers, we show the degree distributions sep-arately for books and papers in the
ACM data set in Figure1. The distribution of the length of the reference list forpublications (their indegree) is highly skewed; some publi-cations have references from 10 to several hundred papers inthe dataset, many have none, or few. In actuality, many ofthese papers have longer lists of references, but these werenot identified, or they fall outside of the dataset. The dis-tributions of out-degrees are similarly skewed, an indicationof a linear preferential attachment mechanism: already wellcited papers are more easily discovered, and subsequentlycited: it is the success-breeds-success phenomenon (Bur-rell 2003). As one might expect, both the in-degree dis-tributions and out-degree distributions of books in the
ACM dataset are significantly heavier tailed – with books both cit-ing more and being cited more. It is therefore unsurprisingthat the in-degree and out-degree distributions of documentsin
CiteSeer are more similar to those of
ACM papers, asopposed to books. l l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l + + + + In−degree K F r equen cy o f do c u m en t s w i t h i n − deg r ee s > K l l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll ACM bookACM paperCiteSeer l l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l
Out−degree K F r equen cy o f do c u m en t s w i t h ou t − deg r ee s > K l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ACM bookACM paperCiteSeer (a) In-degree distributions (b) Out-degree distributionsFigure 1: The degree distributions of the
ACM and
CiteSeer citation graphs.
Connectivity
In order to analyze how information flows through the cita-tion networks, we first study their connectivity. Two vertices A and B are said to be in the same stronglyconnected component if there exists a path both from A to B and from B to A . For both the ACM and
CiteSeer citation graphs with correct time stamps, there are no signif-icant strongly connected components. The absence of largestrongly connected components is consistent with the factthat citation graphs are nearly perfect directed acyclic graphs(DAGs). However, . of the nodes (publications) are inthe largest weakly connected component , in which there is apath between every pair of nodes in the version of undirectedgraph, in the ACM citation network. Similarly, . of thenodes are in the largest weakly connected component of the CiteSeer citation network.In the
ACM dataset, 5.0% of the papers are published inthe years 2004 and 2005. By tracing back their citations,we find that 48.8% of the papers published in previous yearsare either directly or indirectly cited by the papers in 2004and 2005. In the
CiteSeer data set, with 3.5% paperspublished in the years 2003 - 2005, 28.9% of earlier papersare reachable by tracing back the citations. Although the twodata sets differ in their cohesion (either due to completenessof data, or other issues of coverage), we observe that eachsubsequent generation of papers is tied directly or indirectlyto a significant portion of the prior work.
Pajek
Pajek (a) 500 most cited papers (b) 500 random papersFigure 2: Subgraphs of the 500 most cited papers and 500random papers in the
ACM citation network.If we take the 500 most cited articles in the
ACM net-work, we observe that there is a giant component linkinga significant fraction of the most influential papers (seeFigure 2 (a)). In contrast, if we were to select 500 ran-dom papers, there would be no giant component - as pa-pers are rather unlikely to cite one another (Figure 2 (b)).This observation is consistent with many other networkswhere the most highly connected nodes tend to be con-nected to one another (Shi et al. 2008). It is also knownas the rich-club phenomenon (Zhou and Mondragon 2004;Colizza et al. 2006). Yet it is still striking that such a smallnumber of most influential papers out of tens of thousandsn computer science should be connected to one anotherthrough one another.
Average shortest directed path
The shortest paths in graphs (also termed geodesics) relatedirectly to the accessibility of information. Like many othercomplex networks, the citation networks we study exhibitthe small world phenomenon . The average shortest directedpath of the
ACM graph is 7.60, and its largest geodesic is 32.Similar to the
ACM citation network,
CiteSeer has an av-erage shortest directed path of 6.29 and its longest geodesicis 28. However, the reachable pairs of nodes via directedpaths in the
ACM citation network comprise . of all pos-sible paths (or node pairs) and . of all possible paths inthe CiteSeer citation network.From the connectivity and shortest directed paths of thecitation networks, we can see that, in spite of the largestweakly connected component occupying nearly the entiretyof the network, the percentage of reachable pairs of nodesis smaller. But where paths do exist, we observe that thelengths of information flows are generally short. Note thatthis does not preclude that there are more circuitous routesinvolving several papers. In our measurements, we only ac-count for the shortest path. This form of “deep linking”, cit-ing original articles, considerably shortens the path betweenarticles.
Sizes of information cascades
One may be interested not only in direct citations of a givenpaper, but subsequent citations of the citing papers, etc. Fig-ure 3 gives a simple example of an information cascade. Aninformation cascade represents both the direct and indirectinfluence of a given publication, although the influence isdiluted with each subsequent step. As each paper cites sev-eral others, it is difficult to attribute influence to any givenchain of citations. It may be that the reason that A cites B isunrelated to the reason why B cited C.
Time
Figure 3: Illustration of a information cascade with multipletemporal levels.By taking every node in the citation graph as a root, werun a breadth first search along the out-going edges, and ob-tain an information cascade tree starting from every paperin the network. The distributions of sizes, depths and num-bers of leaves of the information cascade trees are shown inFigure 4.From the figure, we see that all of the distributions ofsizes, depths and numbers of leaves have a very a sharp − − − P e r c en t. o f i n f o . c a sc ade t r ee s w i t h node s > K sizedepth − − − P e r c en t. o f i n f o . c a sc ade t r ee s w i t h node s > K sizedepth (a) ACM (b) CiteSeerFigure 4: The distributions of sizes, depths and numbersof leaves of the information cascade trees in ACM and
CiteSeer citation graphs.drop in the tail. These drop-offs naturally correspond tothe limits of the data sets: there are no more than severalhundreds of thousands of papers in each data set, and a cas-cade can only encompass papers appearing after the givenpaper. This is different from the power-law distributions ofthe cascade sizes in the blogosphere observed by Leskovecat al.(Leskovec et al. 2007), with slightly different cascadedefinitions. The blog measurements were unaffected by sizelimitations in the data, as the largest cascade sizes encom-passed no more than a few thousand posts out of millionsthat were observed.The Spearman correlations of the cascade sizes, depthsand numbers of leaves in the information cascades, as wellas the out-degrees are shown as Table 2. We see that, inboth the
ACM and
CiteSeer citation graphs, the sizes anddepths of the information cascade trees have large correla-tions, meaning that cascades that encompass several genera-tions of scholarly work are also the largest.size & size & size & depth &out-deg depth . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ CiteSeer . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ Table 2: Spearman correlations of the sizes, depths, num-bers of leaves of information cascades and out-degrees ofall papers in the two citation networks. ∗∗∗ , ∗∗ , and ∗ de-note significance at the < . , < . and ≥ . levelsrespectively. Information diffusion and the effects ofcitations
After examining the structural features and the informationflow paths between papers in the citation networks, we turnto how information flows between communities, and howdifferent types of citations (from and to various communitiesand citing old or new papers) would affect the subsequentinformation diffusion in citation networks.
ACMCiteSeer
Figure 5: Visualization of the matrices of community weights between different areas of computer science. Darker cellsrepresent more frequent citation than expected if citation were at random, lighter ones depict less frequent citation.
Information flows between communities
We assign papers to communities according to their venues,using the classification system adopted by Microsoft’s,
Li-bra academic search service . For example, a paper pub-lished in the KDD ( Knowledge Discovery and Data Min-ing ) Conference would be classified under “Data Mining”,while a paper published in the
Journal of Information Pro-fessing and Management would be classified under “Infor-mation Retrieval”. Because of the incomplete and noisy in-formation in the venues, we are able to classify about / ofthe papers with about − precision. With this com-munity classification, there are about 205,000 within com-munity citations and 141,000 across community citations in ACM , while 42,000 both within and across community cita-tions in
CiteSeer .In order to quantify the densities of information flow fromcommunity to community, we first count the number of ci-tations between every pair of communities for each data setseparately (e.g. the number of citations of Theory to The-ory, Theory to Data Mining, etc.), and get a matrix A withthese numbers as its entries. We then compare the numberof citations between any pair of communities relative to therate of citation we would expect if the volume of inboundand outbound citations were the same, but the citations wereallocated at random. We let N ij be the actual number ofcitations from i to j , N i (cid:5) = (cid:80) j N ij be the total numberof citations from community i , N (cid:5) j = (cid:80) i N ij be the to-tal number of citations to community j , and N = (cid:80) ij N ij be the total number of citations in matrix A . Then the ex-pected number of citations, assuming indifference to one’s http://libra.msra.cn own field and others, from community i to community j is E [ N ij ] = N i (cid:5) × N (cid:5) j /N . We define the community weight asa z-score that tells us how many standard deviations aboveor below expected N ij is. Here we have the observation that N (cid:29) N i (cid:5) and N (cid:29) N (cid:5) j , so we approximate the standarddeviation by (cid:112) E [ N ij ] . In this way, for every entry, we geta normalized value, which we call community weight : W ij = ( N ij − N i (cid:5) × N (cid:5) j N ) / (cid:114) N i (cid:5) × N (cid:5) j N By visualizing the normalized matrix, i.e. matrix of com-munity weights, as in Figure 5, we can observe differentdensities of information flow amongst communities. Forexample, for each community, as expected, the majority ofcitations are within the community itself. However, thereare some closely related communities. For example, thereappears to be considerable information flow from Informa-tion Science to Information Retrieval, from Databases toData Mining, from Information Retrieval to Data Miningand from Computer Vision to Computer Graphics. Theseflows reflect frequent citations by papers from the secondcommunity to those in the first. We also observe that themore theoretical areas such as Algorithms & Theory andPhysics are less connected with others, while more appliedareas, such as Data Mining, Information Retrieval, and Op-erating Systems have more information flows two and fromother areas.
Correlations of information diffusion and citationfeatures
If we define information diffusion to occur when a paper iscited, then many factors affect such information diffusion.hey include the popularity of the research field pertainingto the article in a certain period, the reputation of the authors,the specific innovation reported in the publication, etc. How-ever, there is much we can surmise simply from the citationpatterns, time lapses and community information. Specifi-cally, we examine what kinds of citations would make theciting papers have greater impact, whether it is citing an-other paper in a related community with strong informationflow, or the time elapsed since the publication of the citedpaper.As we have stated before, to measure the influence ofa particular paper, both directly and indirectly influencedpapers may need to be taken into consideration, possiblyweighing them differently. However, for both the clarity ofthe model and lack of consensus in the literature for a par-ticular weighting scheme (Aksnes 2006), we use the num-ber of citations a paper receives normalized by the averagenumber of citations received by all papers in the same areaand year (Valderas et al. 2007). This measure allows us tomake a fair comparison between articles that may not havefinished accumulating citations due to their recency, and toaccount for differences in the publication cycle for differentareas (Stringer, Sales-Pardo, and Amaral 2008).
Citation networks for all of computer science
Since ourstudy focuses mainly on the relationship between informa-tion flow and innovation, as opposed to summaries and re-views, we exclude publications that are book chapters andbooks, and focus on journal articles and papers publishedin conference proceedings. In the
ACM dataset, the articlesare already classified according to publication venue type,and so are easily filtered. In the
CiteSeer dataset, wefind that a majority of publications having 40 or more ref-erences tend to be review manuscripts. We exclude suchpublications from both data sets. Finally, we exclude pa-pers published after 2000, because their recency means thatthey have not accumulated most of their citations (Stringer,Sales-Pardo, and Amaral 2008; Burrell 2003).Table 3 shows the correlations between communityweights and time lapse of the citing and cited paper, andthe subsequent impact of the citing paper. From the tablewe see that for both citation networks, the weights of in-formation flows between communities (i.e. the communityweights) have positive correlations with the influence met-ric (normalized out-degrees). This means that, on average,a computer science paper will be rewarded for referencingother papers within its own community or proximate com-munities.More recent papers have had an opportunity to cite moredistant papers in time. Since pairs of citations are onlyrecorded between papers in the dataset, older papers willhave shorter recorded timelags to the papers they reference,since earlier referenced papers may not be included. Theabove is reflected in the correlation between the publicationyear of the citing paper and the time elapsed between the twopapers ( ρ = 0 . , p < − ). More interestingly, there is anegative between the time elapsed between the papers andthe subsequent impact of the citing paper. Note that we arealready normalizing by the average citation number of pa- pers in a given year, so that older papers’ chance to accumu-late more citations is not a factor. The negative correlationbetween citation time lag and impact could be interpreted asciting more recent work being rewarded by citations.However, it is not uncommon to see some extremely in-novative and influential work whose citations reach acrosscommunities, or draw upon older publications. The over-all correlations only reflect the average trend. As we ob-served in Figure 1, a large proportion of the papers receivesvery few citations, while a few papers garner large numbersof them. We found interesting trends, when, in addition tomeasuring the overall correlation for all papers, we com-puted separate correlations for the bottom 90% of the papersaccording to impact (denoted as ≤ in Table 3) and thetop 10% ( > ).What we can observe is that for less well cited papers,the correlations between impact and community informationflow weight are positive, in agreement with the overall trend.This is where the majority of papers lie — they receive fewcitations and do not lead to large subsequent impact. How-ever, for papers with high impact (dozens to hundreds ofcitations), the neutral correlations show that citing withinone’s own community is less important.Similar patterns are observed for time lags as well. Thelower impact articles benefit from citing recent work; butfor more influential papers, these correlations are reduced orabsent. It may be that a truly innovative article draws uponwork that had not been garnering much attention recently,and that is not tied to many other relevant publications. Thiswould imply that the more innovative and more highly citedpapers may cross boundaries where information normallydoes not flow. Subnetworks of papers in different areas
We have seenhow the weights of information flows between communitiesaffect the subsequent impact of the citing papers in the over-all citation networks of computer science. In this subsection,we investigate the correlations in a finer scope. We choosethe areas whose papers constitute more than 5% of the to-tal number of papers with community information in bothdata sets, such as Theory, Distributed and Parallel Comput-ing, Software Engineering, etc. We consider these papersand their sets of references. Again, we compute the cor-relations of the community weights of those citation edgesand the normalized out-degrees of these citing papers. Thecorrelations are given in Table 4. The results show that thecorrelations are mostly positive or neutral in three areas, ex-cept for papers in theory and algorithms in both
ACM and
CiteSeer . This implies that information diffusion hasdifferent impact on publications in theoretical computer sci-ence and applied computer science. In future work we wouldlike to probe these differences in information diffusion forvarious fields further.
Subnetworks of papers in different time periods
In-stead of grouping papers according to their areas, we canalso group them by publication date. In order to reduce thenoise introduced by the incompleteness and sparsity of thedata sets, we only choose papers in the following four timeperiods for both the
ACM and
CiteSeer data sets: 1980–CM CiteSeerOverall ≤ >
90% Overall ≤ > − . ∗∗∗ − . ∗∗∗ . ∗ − . ∗∗∗ − . ∗∗∗ . ∗ c-weight . ∗∗∗ . ∗∗∗ . ∗ . ∗∗∗ . ∗∗∗ . ∗ Table 3: Spearman correlations show the effects of community weights and time differences between the cited and citing paperson the subsequent impacts of citing papers.Venue Dataset Percent. CorrelationTheory ACM 32.92% − . ∗∗∗ & Algorithms CiteSeer 17.24% − . ∗∗∗ Distributed ACM 6.39% . ∗ Computing CiteSeer 10.45% − . ∗ Artificial ACM 5.24% . ∗ Intelligence CiteSeer 8.69% . ∗∗∗ Software ACM 19.54% . ∗∗∗ Engineering CiteSeer 19.37% . ∗∗∗ Table 4: Correlations between community weights and nor-malized out-degrees of citing papers grouped by differentcommunities.1984, 1985–1989, 1990–1994, and 1995–1999.After grouping the papers according to publication date,same as before, we select the edges with destination papersin the chosen set of papers (e.g. published between 1990 and1994). We use Pearson correlations of community weightson the citation edges and the logarithm of normalized out-degrees of the destination papers, which are shown in Fig-ure 6. Although
ACM and
CiteSeer have different rangesof confidence intervals for the correlations, the trends of thetwo sets of correlations are consistent — they are slightlyincreasing as the the time periods grow more recent. Per-haps with the research areas getting finer and deeper, it maybe more difficult for researchers to keep up with, understandand cite papers in areas far from their own. At the sametime, their own communities have grown and diversified toincorporate information flows from other areas, so that citingwithin one’s area may provide adequate diversity. However,these are only speculations as to the underlying reasons whyciting within one’s area would be of greater benefit to morerecent papers.
Subnetworks of papers versus books
We consider onefinal subset of the citation graph, that of books. As we men-tioned before, in the
ACM dataset, there are documents la-beled as books or book chapters. We select these documents,and study how their citations patterns may be different fromthose of journal research articles or conference proceedings.Since the datasets did not map books to different fields ofcomputer science,we just consider the raw out-degree of books as the mea-sure of impact and focus on the time elapsed between thepublication of the book and the work it cites. We considercitations from books or book chapters to any type of pub-lication, including papers in journals and conference pro-ceedings. Because books have longer reference lists (seeFigure 1(a)), any single citation is less likely to have a strong l l l l
Time Period C o rr e l a t i on − . − . − . − . . . ll ACMCiteSeer l l l l
Figure 6: Correlations between community weights and nor-malized out-degrees of citing papers, grouped by differenttime periods.effect on the impact of the citing article. Indeed, we find thatthe correlation of time spans and the out-degrees of books is − . ∗ . This is lower than the corresponding correlationbetween out-degrees and time spans for papers in the samedataset, which is − . ∗∗∗ . This trend is also consistentwith the fact that books are expected to cover a substan-tial amount of material, which may necessitate citing earlierpublications. On the other hand, papers may just need to citethe most recent work they are building upon. Conclusions and future work
We analyzed a very old, regimented, and established socialmedium for knowledge sharing in order to discover patternsof information flow with respect to community structure.Consistent with prior results, we find a wide range in theimpact individual publications have. Information cascades,encompassing all chains of citations resulting from a singlepaper, vary dramatically in size, and only a small propor-tion of paper pairs are linked via cascades. In contrast themost influential papers are surprisingly interlinked. Manypublications go mostly unnoticed, while some garner con-siderable attention. There are interesting factors, relating tothe citation graph, that correlate with the popularity a givenpublication will enjoy.Our particular interest is on the impact of a particular ci-tation on the success of the citing article. Through inten-sive study of two data sets of computer science publications, CM and CiteSeer , we find that citations that occur withincommunities lead to a slightly higher number of direct ci-tations; and also, citing more recent papers corresponded toreceiving more citations in turn. However, our most interest-ing finding is that for the most influential group of papers,this relationship was reduced or absent, allowing for the pos-sibility that ideas across communities can lead to higher im-pact work. Finally, we find that the effect of recency andcommunity on citation structure differs among different ar-eas of computer science and among different time periods.In future work, we would like to expand our study to sev-eral additional contexts, including patent citation networksand paper citation networks of various scientific areas, inwhich the effect of boundary spanning information flowswould be investigated. We would also like to extend ouranalysis to blogs, whose strong community structure hasbeen observed, along with observations of information cas-cades, but little is known about the effect of this communitystructure on diffusion properties. However, the sparsenessof citation data for blogs, and the loose relationship betweenthem will present additional challenges. We would also liketo extend our study using textual analysis, to map specificideas that are spreading through the citation network. In do-ing so, we could identify the points at which a particular ideahas crossed a community boundary, and measure whetherthis occasionally leads to large information cascades.
Acknowledgements
We would like to thank Eytan Bakshy for helpful commentsand suggestions. This research was supported in part by agrant from NEC.
References [Adamic and Glance 2005] Adamic, L., and Glance, N.2005. The political blogosphere and the 2004 US election:divided they blog.
LinkKDD
WWW2004 Workshop on the WebloggingEcosystem: Aggregation, Analysis and Dynamics . ACMPress.[Aksnes 2006] Aksnes, D. 2006. Citation rates and percep-tions of scientific contribution.
JASIST
PNAS
Scientometrics
JASIST
Hypertext ’06 , 11–22. New York, NY, USA: ACM.[Colizza et al. 2006] Colizza, V.; Flammini, A.; Serrano,M. A.; and Vespignani, A. 2006. Detecting rich-club or-dering in complex networks.
Nature Physics
Science
Social Studies ofScience
AWIC , 2.[Guimera et al. 2005] Guimera, R.; Uzzi, B.; Spiro, J.; andAmaral, L. 2005. Team Assembly Mechanisms DetermineCollaboration Network Structure and Team Performance.
Science
WWW ’03 , 568–576. New York, NY, USA:ACM Press.[Lazer and Friedman 2005] Lazer, D., and Friedman, A.2005. The hare and the tortoise: the network structureof exploration and exploitation.
Proceedings of the 2005national conference on Digital government research
The European PhysicalJournal B
Proceedings of ICWSM .[Lerman 2007] Lerman, K. 2007. Social Information Pro-cessing in News Aggregation.
IEEE Internet Computing
SIAM InternationalConference on Data Mining .[Rinia et al. 2001] Rinia, E.; Van Leeuwen, T.; Bruins, E.;Van Vuren, H.; and Van Raan, A. 2001. Citation delayin interdisciplinary knowledge exchange.
Scientometrics
PNAS
Hypertext ’08 , 61–70. New York, NY, USA:ACM.[Simkin and Roychowdhury 2005] Simkin, M., and Roy-chowdhury, V. 2005. Stochastic modeling of citation slips.
Scientometrics
PLoS ONE
WWW 2005 Workshop onthe Weblogging Ecosystem .Valderas et al. 2007] Valderas, J. M.; Bentley, R. A.; Buck-ley, R.; Wray, K. B.; Wuchty, S.; Jones, B. F.; and Uzzi, B.2007. Why Do Team-Authored Papers Get Cited More?
Science