Characterizing scientific production and consumption in Physics
Qian Zhang, Nicola Perra, Bruno Goncalves, Fabio Ciulla, Alessandro Vespignani
CCharacterizing scientific production and consumption in Physics
Qian Zhang , Nicola Perra , Bruno Gonçalves , Fabio Ciulla , Alessandro Vespignani , , ∗ Laboratory for the Modeling of Biological and Socio-technical Systems, Northeastern University, Boston MA02115 USA Aix Marseille Université, CNRS, CPT, UMR 7332, 13288 Marseille, France Institute for Scientific Interchange Foundation, Turin 10133, Italy Institute for Quantitative Social Sciences at Harvard University, Cambridge, MA,02138
AbstractWe analyze the entire publication database of the American Physical Society generating lon-gitudinal (50 years) citation networks geolocalized at the level of single urban areas. We definethe knowledge diffusion proxy, and scientific production ranking algorithms to capture the spatio-temporal dynamics of Physics knowledge worldwide. By using the knowledge diffusion proxy weidentify the key cities in the production and consumption of knowledge in Physics as a function oftime. The results from the scientific production ranking algorithm allow us to characterize the topcities for scholarly research in Physics. Although we focus on a single dataset concerning a specificfield, the methodology presented here opens the path to comparative studies of the dynamics ofknowledge across disciplines and research areas.
Over the last decade, the digitalization of publication datasets has propelled bibliographic studies allowingfor the first time access to the geospatial distribution of millions of publications, and citations at differ-ent granularities [1, 2, 3, 4, 5, 6, 7, 8] (see [9] for a review). More precisely, authors’ name, affiliations,addresses, and references can be aggregated at different scales, and used to characterize publications and ci-tations patterns of single papers [10, 11], journals [12, 13], authors [14, 15, 16], institutions [17], cities [18],or countries [19]. The sheer size of the datasets allows also system level analysis on research productionand consumption [20], migration of authors [21, 22], and change in production in several regions of theworld as a function of time [5, 6], just to name a few examples. At the same time those analyses havespurred an intense research activity aimed at defining metrics able to capture the importance/ranking ofauthors, institutions, or even entire countries [23, 24, 14, 15, 17, 25, 26, 27, 28, 29]. Whereas such largedatasets are extremely useful in understanding scholarly networks and in charting the creation of knowl-edge, they are also pointing out the limits of our conceptual and modeling frameworks [30] and call for adeeper understanding of the dynamics ruling the diffusion and fruition of knowledge across the the socialand geographical space.In this paper we study citation patterns of articles published in the American Physical Society (APS) jour-nals in a fifty-year time interval ( - ) [31]. Although in the early years of this period the dataset wasobviously biased toward the scholarly activity within the USA, in the last twenty years only about 35% ofthe papers are produced in the USA. The same amount of production has been observed in databases that ∗ To whom correspondence should be addressed; email: [email protected] a r X i v : . [ phy s i c s . s o c - ph ] F e b nclude multiple journals, and disciplines[19, 7]. Indeed the journals of the APS are considered worldwideas reference publication venues that well represent the international research activity in Physics. Further-more this dataset does not bundle different disciplines and publication languages, providing a homogeneousdataset concerning Physics scholarly research. For each paper we geolocalize the institutions contained inthe authors’ affiliations. In this way we are able to associate each paper in the database with specific urbanareas. This defines a time resolved, geolocalized citation network including 2,307 cities around the worldengaged in the production of scholarly work in the area of Physics. Following previous works [17, 8] weassume that the number of given or received citations is a proxy of knowledge consumption or production,respectively. More precisely, we assume that citations are the currency traded between parties in the knowl-edge exchange. Nodes that receive citations export their knowledge to others. Nodes that cite other works,import knowledge from others. According to this assumption we classify nodes considering the unbalancein their trade. Knowledge producers are nodes that are cited (export) more than they cite, (import). Onthe contrary, we label as consumers nodes that cite (import) more than they are cited (export). Using thisclassification, we define the knowledge diffusion proxy algorithm to explore how scientific knowledge flowsfrom producers to consumers. This tool explicitly assumes a systemic perspective of knowledge diffusion,highlighting the global structure of scientific production and consumption in Physics.The temporal analysis reveals interesting patterns and the progressive delocalization of knowledge produc-ers. In particular, we find that in the last twenty years the geographical distribution of knowledge productionhas drastically changed. A paramount example is the transition in the USA from a knowledge produc-tion localized around major urban areas in the east and west coast to a broad geographical distributionwhere a significant part of the knowledge production is now occurring also in the midwestern and southernstates in USA. Analogously, we observe the early 90s dominance of UK and Northern Europe to subsideto an increase of production from France, Italy and several regions of Spain. Interestingly, the last decadeshows that several of China’s urban areas are emerging as the largest knowledge consumers worldwide.The reasons underlying this phenomenon may be related to the significant growth of the economy and theresearch/development compartment in China in the early 21 th century [32]. This positive stimulus, pushedup also the scientific consumption with a large number of paper citing work from other world areas. Indeed,the increase of publications is associated to an increase of the citations unbalance, moving China to the toprank as consumers since the recent influx of its new papers has not yet had the time to accumulate citations.Although the knowledge diffusion proxy provides a measure of knowledge production and consumption,it may be inadequate in providing a rank of the most authoritative cities for Physics research. Indeed, akey issue in appropriately ranking the knowledge production, is that not all citations have the same weight.Citations coming from authoritative nodes are heavier than others coming from less important nodes, thusdefining a recursive diffusion of ranking of nodes in the citation network. In order to include this elementin the ranking of cities we propose the scientific production ranking algorithm. This tool, inspired by thePageRank [33], allows us to define the rank of each node, as function of time, going beyond the knowledgediffusion proxy or simple local measures as citation counts or h-index [14]. In this algorithm the importanceof each node diffuses through the citation links. The rank of a node is determined by the rank of the nodesthat cite it, recursively, thus implicitly weighting differently citations from highly (lowly) ranked nodes.Also in this case we observe noticeable changes in the ranking of cities along the years. For instance thepresence of both European and Asian cities in the top list increases by in the last years. Thisfindings suggest that the Internet, digitalization and accessibility of publications are creating a more levelledplaying field where the dominance of specific area of the world is being progressively eroded to the advan-2age of a more widespread and complex knowledge production and consumption dynamic. Results
We focus our analysis on the APS dataset [31]. It contains all the papers published by the APS from to . We consider only the last years due to the incomplete geolocalization information available for theearly years. During this period, the large majority of indexed papers, . , contain complete informationsuch as authors name, journal of publication, day of publication, list of affiliations and list of citations toother articles published in APS journals. We geolocalized . of papers at urban area level with anaccuracy of . . We refer the reader to the Methods section and to the Supplementary Information (SI)for the detailed description of the dataset and the techniques developed to geolocalize the affiliations.In total, only of papers has been produced inside the USA. Interestingly, over time this fraction has de-creased. For example, in the ’s it was . , while in the last years decreased to just . . Whileone might assume that the APS dataset is biased toward the USA scientific community, the percentage ofpublications contributed by the USA in APS journals after is almost the same as in other publicationdatasets [19, 7]. These alternative datasets contain journals published all over the world and mix differentscientific disciplines. This supports the idea that the APS journals are now attracting the worldwide physicsscientific community independently of nationality, and fairly represent the world production and consump-tion of Physics. It is not possible to provide quantitative analysis of possible nationality bias and disentangleit by an actual change of the dynamic of knowledge production. For this reason, and in order to minimizeany bias in the analysis we focus our analysis in the last years of data.In order to construct the geolocalized citation network we consider nodes (urban areas) and directed linksrepresenting the presence of citations from a paper with affiliation in one urban area to a paper with affil-iation in another urban area. For example, if a paper written in node i cites one paper written in node j there is an link from i to j , i.e., j receives a citation from i and i sends a citation to j . Each paper mayhave multiple affiliations and therefore citations have to be proportionally distributed between all the nodesof the papers. For this reason we weight each link in order to take into account the presence of multipleaffiliations and multiple citations. In a given time window, the total number of citations for papers writtenin j received from papers written in i , is the weight of the link i → j , and the total number of citations forthose paper written in j sent to the papers written in k is the weight of the link j → k . For instance, if in atime window t , there is one paper written in node j , which cite two papers written in node k and was citedby three papers written in node i , then w jk = 2 , w ij = 3 , and we add all such weights for each paper writtenin that node j and obtain the weights for links. For papers written in multiple cities, say j , j , the weightwill be counted equally. The time window we use in this manuscript is one year. We show an example ofthe network construction in Figure (1).In order to define main actors in the production and consumption of Physics, we consider citations as a cur-rency of trade. This analogy allows us to immediately grasp the meaning and distinction between producersand consumers of scientific knowledge. Nodes that receive citations export their knowledge to the citingnodes. Instead, nodes that cite, papers produced from other nodes of the network, import knowledge fromthe cited nodes. Measuring the unbalance trade between citations, we define producers as cities that exportmore than they import, and consumers as cities that import more than they export. More precisely, we can3igure 1: Projecting a paper citation relationship into a city-to-city citation network. (A) Paper A written by authors from Ann Arbor , Los Alamos and New York cites one paper B written by authors fromRome and Madrid and another paper C from Oxford and Princeton. (B) In a city-to-city citation network,directed links from Ann Arbor to Madrid, Rome, Oxford and Princeton are generated, and similarly LosAlamos and New York are connected to the above four cited cities.measure the total knowledge imported by each urban area as (cid:80) j w ij and the total export as (cid:80) j w ji in agiven year. Those measures however acquire specific meaning when considered relatively to the total tradeof physics knowledge worldwide in the same year; i.e. the total number of citations worldwide S = (cid:80) ij w ij .The relative trade unbalance of each urban area i is then: ∆S i = (cid:80) j w ji − (cid:80) j w ij S . (1)A negative or positive value of this quantity indicates if the urban area i is consumer or producer, respec-tively. In Figure (2)-A we show the worldwide geographical distribution of producer (red) and consumer(blue) urban areas for the and . Interestingly, during the s the production of Physics knowledgewas highly localized in a few cities in the eastern and western coasts of the USA and in a few areas of GreatBritain and Northern Europe. In the picture is completely different with many producer cities in centraland southern parts of the USA, Europe and Japan. It is interesting to note that despite the fraction of papersproduced in the USA is generally decreasing or stable, many more cities in the USA acquire the status ofknowledge producers. This implies that the quality of knowledge production from the USA is increasingand thus attracting more citations. This makes it clear that the knowledge produced by an urban area cannot be considered to be measured only by the raw number of papers. Citations are a more appropriate proxythat encodes the value of the products. They serve as an approximation of the actual flow of knowledge.The Figure (2)-A also makes it clear that cities in China are playing the role of major consumers in both and . We also observe that cities in other countries like Russia and India consumed less in than . In other words, in both the production and consumption of knowledge are less concentratedon specific places and generally spread more evenly geographically. In order to provide visual support to4his conclusion we show in Figure (2)-B the geographical distribution of producers and consumers insidethe USA. From the two maps it is evident the drift of knowledge production from the two coastal areas inthe USA to the midwest, central and southern states. Similarly, in Figure (2)-C we plot the same informa-tion for western Europe. In only a few urban areas in Germany and France were clearly producers.By this dominance has been consistently eroded by Italy, Spain and a more widespread geographicaldistribution of producers in France, Germany and UK. Knowledge diffusion proxy.
The definition of producers and consumers is based on a local measure, that does not allow to capture allpossible correlations and bounds between nodes that are not directly connected. This might result in a partialview and description of the system, especially when connectivity patterns are complex [36, 37, 38, 39, 40].Interestingly, a close analysis of each citation network, see Figure (3), clearly shows that citation patternshave indeed all the hallmarks of complex systems [36, 37, 38, 39, 40], especially in the last two decades.The system is self-organized, there is not a central authority that assigns citations and papers to cities, thereis not a blueprint of system’s interactions, and as clearly shown from Figure (3)-C the statistical character-istics of the system are described by heavy-tailed distributions [36, 37, 38, 39, 40]. Not surprisingly, thelevel of complexity of the system has increased with time. In Figure (3)-A we plot the most statisticallysignificant connections of the citation network between cities inside USA in , and . We filterlinks by using the backbone extraction algorithm [41] which preserves the relevant connections of weightednetworks while removing the least statistically significant ones. We visualize each filtered network by usinga bundled representation of links [42]. The direction of each weighted link goes from blue (citing) to red(cited). Similarly, in Figure (3)-B, we visualize the most significant links between cities in Europe (Euro-pean Union’s 27 countries, as well as Switzerland and Norway). It is clear from Figure (3)-A that in the citation patterns inside the USA were limited to a few cities, and in Europe only a few cities were con-nected. Instead, in and we register an increase in the interactions among a larger number of cities.The observed temporal trend is well known and valid not just for Physics [43]. Among many factors thathave been advocated to explain this tendency we find the increase of the research system and the advance intechnology that make collaboration and publishing easier [44, 45, 46, 20].In order to explicitly consider the complex flow of citations between producers and consumers, we proposethe knowledge diffusion proxy algorithm (see Methods section for the formal definition). In this algorithm,producers inject citations in the system that flow along the edges of the network to finally reach consumercities where the injected citations are finally absorbed. The algorithm allows charting the diffusion of knowl-edge, going beyond local measures. The entire topology of the networks is explored uncovering nontrivialcorrelations induced by global citation patterns. For instance, knowledge produced in a city may be con-sumed by another producer that in turn produces knowledge for other cities who are consumers. This pointsout that the actual consumer of knowledge is not just signalled by the unbalance of citations but in the overalltopology of the production and consumption of knowledge in the whole network. Indeed, the final consumerof each injected citation may not be directly connected with the producer. Citations flow along all possiblepaths, sometimes through intermediate cities. In Table (1), and Table (2) we report the rankings of Top final consumers evaluated by the knowledge diffusion proxy for the Top producers in and respectively. We also list the Top neighbours according to the local citation unbalance. From these twotables, it is clear that the final rank of each consumer, obtained by our algorithm, can be extremely differentfrom the ranking obtained by just considering local unbalances. For instance, in Bratislava and Mainz5ank in top consumers absorbing knowledge produced in Boston. However, according to local measureof unbalance, these two cities are ranked out of top (shown in bold in Table (1)). Interestingly, even theTop consumer for New Haven, Berlin, also does not rank among the Top neighbours according to thecitation unbalance. These findings confirm that in order to uncover the complex set of relationships amongcities, it is crucial to consider the entire structure of the network, going beyond simple local measures.Table 1: Rankings from Knowledge diffusion proxy algorithm for top 3 producer cities in 2009. In bold, wehighlight cities that are present in top 10 consumers ranked according to the knowledge diffusion proxy butdo not appear in top 10 cities ranked according to local citation unbalance. Boston Berkeley New HavenDiffusion proxy Citation unbalance Diffusion proxy Citation unbalance Diffusion proxy Citation unbalanceAthens Madrid Athens Athens
Berlin
VancouverMadrid Athens Gwangju Madrid Athens ParisVancouver Vancouver Bratislava Bratislava
Mainz
TriesteGwangju Moscow Madrid Paris Vancouver Athens
Bratislava
Paris Vancouver Vancouver Gwangju GwangjuBerlin Tokyo Trieste Gwangju Trieste BratislavaTrieste Trieste Waco Moscow Bratislava Madrid
Mainz
Beijing Paris Trieste
Coventry
LiverpoolParis Berlin
Berlin
Seoul
Valencia
Oxford
Waco
Gwangju
Mainz
Waco
Madrid
Santa Barbara
Table 2: Rankings from Knowledge diffusion proxy algorithm for top 3 producer cities in 1990. In bold, wehighlight cities that are present in top 10 consumers ranked according to the knowledge diffusion proxy butdo not appear in top 10 cities ranked according to local citation unbalance.
Piscataway Boston Palo AltoDiffusion proxy Citation unbalance Diffusion proxy Citation unbalance Diffusion proxy Citation unbalanceTokyo Stuttgart Tokyo Tokyo Tokyo Tokyo
Beijing
Tokyo Grenoble Grenoble
Beijing
Ann Arbor
Tsukuba
Los Angeles
Beijing
Los Angeles
Tsukuba
BloomingtonGrenoble Urbana
Tsukuba
College Park Seoul Boulder
Tallahassee
College Park
Seoul
Los Alamos
Tallahassee
UrbanaHamilton Grenoble Vancouver Urbana
Charlottesville
Berlin
Buffalo
Rochester
Tallahassee
Boulder
Vancouver
Orsay
Vancouver
Boston
Warsaw
Rochester Berlin Denver
Charlottesville
Los Alamos
Kolkata
Vancouver
Durham
Seoul
Tempe
Hamilton
Charlottesville
Bloomington
Taipei
Los Alamos
In Figure (4)-A and Figure (4)-B we visualize the results considering the Top four producer cities in in the USA and in Europe respectively. We show their Top ten consumers over years as function of time.The size of each circle is proportional to how many times each injected citation is absorbed by that consumer.In the plot, vertical grey strips indicate that the city was not a producer during those years (e.g. Orsay in ). The results show that, on average, Beijing is the top consumer for all of these producers in the past years. Since China registered a big economical growth and increment of research population in the early , it is reasonable to assume that, thanks to this positive stimulus, many more papers were written inits capital, a dominant city for scientific research in China. However, the fast publication growth increasedthe unbalance between sent and received citations. Each paper published in a given city imports knowledgefrom the cited cities. Reaching a balance might require some time. Each city needs to accumulate citations6ack to export its knowledge to others cities. We can speculate that in the near future cities in China mightbe moving among the strongest producers if a fair number of papers start receiving enough citations, whichobviously depends on the quality of the research carried out in the last years. This is the case of cities likeTokyo which has gradually approached the citation balance in recent years. For instance, Table (2) showsthat in Tokyo, was among the top consumers. But by , its contribution to citation consumptionhad become less significant as observed from Figure (4) and Table (1).
Ranking Cities.
Authors, departments, institutions, government and many funding agencies are extremely interested in defin-ing the most important sources of knowledge. The necessity to find objective measures of the importanceof papers, authors, journals, and disciplines leads to the definition of a wide variety of rankings [23, 24].Measures such as impact factor, number of citations and h-index [14] are commonly used to assess the im-portance of scientific production. However, these common indicators might fail to account for the actualimportance and prestige associated to each publication. In order to overcome these limitations, many dif-ferent measures have been proposed [25, 26, 27, 28]. Here we introduce the scientific production rankingalgorithm (SPR), an iterative algorithm based on the notion of diffusing scientific credits. It is analogous toPageRank [33], CiteRank [26], HITS [25], SARA [29], and others ranking metrics. In the algorithm eachnode receives a credit that is redistributed to its neighbours at the next iteration until the process convergesin a stationary distribution of credit to all nodes (see Methods section for the formal definition). The creditsdiffuse following citations links self-consistently, implying that not all links have the same importance. Anycity in the network will be more prominent in rank if it receives citations from high-rank sources. Thisprocess ensures that the rank of each city is self-consistently determined not just by the raw number of cita-tions but also if the citations come from highly ranked cities. In Figure (5) we show the Top cities from to . Interestingly, we clearly see the decline and rise of cities along the years as well as the steadyleadership of Boston and Berkeley. This behaviour is clear in Figure (6)-B where we show the rank for citiesin USA in and . Meanwhile, the ranking of cities in European and Asian countries like France,Italy and Japan has increased significantly, as shown in both Figure (5) and Figure (6)-A. In Figure (6)-C wefocus on the geographical distribution of ranks for a selected set of European countries in and . InTable (3) we provide a quantitative measure of the change in the landscape of the most highly ranked citiesin the world by showing the percentage of cities in the top 100 ranks for different continents. In Figure (7),we compare the ranking obtained by our recursive algorithm with the ranking obtained by considering thetotal volume of publications produced in each city. Since we are considering only journals by the APS, theimpact factor is consistent across all cities and does not include disproportionate effects that often happenwhen mixing disciplines or journal with varied readership. It is then natural to consider a ranking based onthe raw productivity of each place. As we see in the figure though the two rankings, although obviouslycorrelated, provide different results. A number of cities whose ranking, according to productivity, is in theTop 20 cities in the world, are ranked one order of magnitude lower by the SPR algorithm. Valuing thenumber of citations and their origin in the ranking of cities produces results often not consistent with theraw number of papers, signaling that in some places a large fraction of papers are not producing knowledgeas they are not cited. We believe that the present algorithm may be considered as an appropriate way to rankscientific production taking properly into account the impact of papers as measured by citations.7able 3: Percentage of top 100 ranked cities in continents in 1990 and 2009.Continent 1990 2009Asia 4.0% 11.0%Europe 24.0% 33.0%N. America 72.0% 56.0% Discussion
In this paper we study the scientific knowledge flows among cities as measured by papers and citationscontained in APS [31] journals. In order to make clear the meaning and difference between producersand consumers in the context of knowledge, we propose an economical analogy referring to citations asa traded currency between urban areas. We then study the flow of citations from producers to consumerswith the knowledge production proxy algorithm. Finally, we rank the importance of cities as function oftime using the scientific production ranking algorithm. This method, inspired by the PageRank [33], allowsus to evaluate the importance of cities explicitly considering the complex nature of citation patterns. Inour analysis we considered just scientific publications contained in the APS journals [31]. We do not haveinformation on citations received or assigned to papers outside this dataset. These limitations certainly affectthe count of citations of each city, potentially creating biases in our results. However, our findings, whilelimited to a particular dataset, are aligned with different observations reported by other studies focused onother datasets and fields. For example, we identify major US cities (e.g. Boston and San Francisco areas),as the most important sources of Physics. Similar observations have been done by Börner et al. [17] atthe institution level considering papers published in the Proceedings of the National Academy of Sciences,by Mazloumian et al. [8] at country and city level with Web of Science dataset, and by Batty [4] at bothinstitution and country level considering the Institute for Scientific Information (ISI)
HighlyCited database.We also find that some European, Russian and Japanese cities have gradually improved their productivitiesand ranks in recent twenty years. Similarly, such growth in scientific production has been observed byKing [19] in the ISI database. As discussed in detail in the SI, by aggregating citations of cities to theirrespective countries, we find the same correlation between the number of citations, as well as the number ofpapers, and the GDP invested on Research and Development of several countries as reported by Pan et al. [7]based on the ISI database. This analogy between our results, and many others in the literature, suggests thatthe APS dataset, although limited, is representative of the overall scientific production of the largest countriesand cities in the recent 20 years. The methodology proposed in this paper could be readily extended tolarger datasets for which the geolocalization of multiple affiliation is possible. In view of the different rateof publications and citations in different scientific fields we believe however that the analysis of scientificknowledge production should only consider homogeneous datasets. This would help the understanding ofknowledge flows in different areas and identify the hot spot of each discipline worldwide.
Methods
Dataset.
The dataset of the American Physical Society journals, considering papers published between and of which , papers include a list of affiliations [31]. Each of paper may have multiple affilia-tions. In total there are , affiliation strings. 8n order to geolocalize the articles, we parse the city names from the affiliation strings for each article. First,we process each affiliation string and try to match country or US state names from a list of known names andtheir variations in different languages. We crosscheck the results with Google Map API obtaining validatedlocation information for . of affiliation strings, corresponding to , articles. It is worth notic-ing that we do not use Google Map API (or other map APIs like Yahoo! or Bing) directly for geocodingbecause, to our best knowledge, there are no accuracy guarantees to these API results. For each affiliationstring with an extracted country or state name, we also match the city name against GeoName database [47]corresponding to its country or US state. . of affiliation strings with extracted city names are subse-quently verified with Google Map API. Finally, a total of , publication articles successfully pass thefilters we describe here.The dataset also provides , , records of citations between articles published in APS journals. Tobuild citation networks at the city level, we merge the citation links from the same source node to the sametarget node, and put the total citations on this link as the weight. For articles with multiple city names, theweight will be equally distributed to the links of these nodes. There are totally , , links for city-to-city citation networks from to . (For the full details of parsing country and city names, as well asbuilding networks, see Supplementary Information (SI)) Knowledge diffusion proxy algorithm.
This analysis tool is inspired by the dollar experiment , originally developed to characterized the flow ofmoney in economic networks [48]. Formally, it is a biased random walk with sources and sinks wherea citation diffuses in the network. The diffusion takes place on top of the network of net trade flows.Let us define w ij as the number of citation that node i gives to j and w ji as the opposite flow. We candefine the antisymmetric matrix T ij = w ij − w ji . The network of the net trade is defined by the matrix F with F ij = | T ij | = | T ji | for all connected pairs ( i, j ) with T ij < and F ij = 0 for all connected pairs ( i, j ) with T ij ≥ . There are two types of nodes. Producers are nodes with a positive trade unbalance ∆s i = s ini − s outi = (cid:80) j F ji − (cid:80) j F ij . Their strength-in is larger than their strength-out. On the otherhand, consumers are nodes with a negative unbalance ∆s . On top of this network a citation is injected ina producer city. The citation follows the outgoing edges with a probability proportional to their intensities,and the probability that the citation is absorbed in a consumer city j equals to P abs ( j ) = ∆s j /s inj . Byrepeating many times this process from each starting point (producers) we can build a matrix with elements e ij that measure how many times a citation injected in the producer city i is absorbed in a city consumer j . Scientific production ranking algorithm.
The scientific production rank is defined for each node i according to this self-consistent equation: P i = qz i + (1 − q ) (cid:88) j P j s outj w ji + (1 − q ) z i (cid:88) j P j δ (cid:0) s outj (cid:1) . (2) P i is the score of the node i , ≤ q ≤ is the damping factor (defining the probability of random jumpsreaching any other node in the network), w ji is the weight of the directed connection from j to i , s outj is thestrength-out of the node j and finally δ ( x ) , is the Dirac delta function that is for x = 0 and for x = 1 .Here we use the damping factor q = 0 . . The first term on the r.h.s. of Eq. (2) defines the redistribution9f credits to all nodes in the network due to the random jumps in the diffusion. The second term definesthe diffusion of credit through the network. Each node i will get a fraction of credit from each citing node j proportional to the ratio of the weight of link j → i and the strength-out of node j . Finally the last termdefines the redistribution of credits to all the nodes in the networks due to the nodes with zero strength-out. In the original PageRank the vector z has all the components equal to /N (where N is the totalnumber of nodes). Each component has the same value because the jumps are homogeneous. In this caseinstead, the vector z considers the normalized scientific credit given to the node i based on his productivity.Mathematically we have: z i = (cid:80) p δ p,i /n p (cid:80) j (cid:80) p δ p,j /n p , (3)where p defines the generic paper and n p the number of nodes who have written the paper. It is important tonotice that δ p,i = 1 only if the i -th node wrote the paper p , otherwise it equals zero. Acknowledgments
This work has been partially funded by NSF CCF-1101743 and NSF CMMI-1125095 awards. We acknowl-edge the American Physical Society for providing the data about Physical Review’s journals.
Author Contributions
A.V., N.P. & Q.Z. designed research, Q.Z., B.G., & F.C. parsed data, Q.Z., N.P. & A.V. analysed data. Allauthors wrote, reviewed and approved the manuscript.
Competing financial interests
The authors declare no competing financial interests.10igure 2:
Spatial distributions of scientific producers and consumers of Physics.
The geospatial distri-bution of scientific producer and consumer cities. (A) The world map of producers and consumers at thecity level in (top) and (bottom). A producer city, of which the relative unbalance ∆S i > , iscoloured in red scale. A consumer with the relative unbalance ∆S i < is coloured in blue scale. The dark-ness of colour is proportional to the absolute value of unbalance. The larger the absolute value of unbalance,the darker the colour. (B) The map of producer and consumer cities in the continental United States in (left) and (right). (C) The map of producer and consumer cities in selected European countries in (left) and (right). In (B) and (C), a producer city is marked with a red bar, while a consumer city ismarked with a blue bar. The height of each bar is scaled with | ∆S i | . Note that in (C) the height of bars isnot scaled with the height in (B) for visibility. Maps in panel A are created by using ArcGIS R (cid:13) [34], andmaps in panel B and C are created by using R [35]. 11igure 3: Networks structure.
The network structures of city-to-city citation networks. (A) The backbones( α = 0 . ) of the citation networks at the city level within the United States in , , (from theleft to right). (B) The backbones ( α = 1 , . , . from left to right) of the citation networks at the citylevel within the European Union 27 countries as well as Switzerland and Norway in , , (fromthe left to right). In (A) and (B), the color shows the direction of links: if node i cites node j there is alink starting with blue and ending with red. (C) The cumulative distribution function of the link weights F w ( w ij ) = P ( w ≥ w ij ) for the city-to-city citation networks in year , and (from left toright). The maps of networks in (A) and (B) were created using JFlowMap [42].12igure 4: Knowledge diffusion proxy results. (A) The Top producer cities in the USA in and theirTop consumers from knowledge diffusion proxy algorithm in − . (B) The Top producer citiesin the European Union 27 countries as well as Switzerland and Norway in and their Top consumersfrom knowledge diffusion proxy algorithm in − . When a producer city becomes a consumer insome year, a grey strip is marked in that year. For each producer city in (A) and (B), the major consumersof the first producer city m in years are plotted as a function of time from to . The size of thebubble in position ( Y, c ) is also proportional to the counter g m,c ( Y ) in that year. The consumer cities foreach producer are ordered according to the total number of counters in 20 years, i.e., (cid:80) Y max Y min g m,c ( Y ) .13igure 5: Top 20 ranked cities as a function of time.
The plot summarizes Top ranked cities in , , , and (from left to right), and relations between the rankings in different years. Thegrey lines are used when the rank of that city drops out of Top .14igure 6: Geospatial distribution of city ranks. (A) The world map of city ranks in (left) and (right). The ranking of each city is represented by color from blue (high ranks) to white (low ranks). (B) Themap of ranks for cities in the United States in (left) and (right). (C) The map of ranks for citiesin the selected European countries in (left) and (right). In (B) and (C), each city is marked with abar, and the height of each bar is inversely proportional to the ranking position. The Top rank positions ineach region are labelled for reference. Note that in (C) the height of bars is not scaled with the height in (B)for visibility. Maps in panel A are created by using ArcGIS R (cid:13) [34], and maps in panel B and C are createdby using R [35]. 15igure 7: Correlation between scientific production ranking and ranking based on the number ofpublications in . The x-axis represents rankings based on the number of papers each city publishedin , and the y-axis represents the scientific production ranking for each city in . The solid linecorresponds to the power-law fitting of data with slope − . , and separates the space into two regions.In the region below the line (coloured blue), cities gain better rankings from scientific production rankingalgorithm even with relatively less publications, such as Chicago and Piscataway. In the region above(coloured green) cities have lower rankings from the algorithm even they have more papers published, suchas Beijing, Berlin, Wako and Shanghai. 16 upplementary Information The database of Physical Review publications used in this paper consists of , articles, each of whichis identified by a unique Digital Object Identifier (DOI). of these articles ( , ) record the publish-ing year, the author(s) of the article, as well as the corresponding affiliation(s). An article may have morethan one affiliation, and the database provides affiliation strings for each article. In total, we have , affiliation strings, and we aim to extract country and city information from the affiliation strings for eacharticle.We observe that an affiliation string likely stands for a single affiliation, roughly consisting of several commaseparated fields: (SUB-INSTITUTE)*, (INSTITUTE), (OTHER INFORMATION)*, (CITY), (OTHER INFORMATION)*,(COUNTRY/STATE) where ‘ SUB-INSTITUTE ’ means department, college, institute, laboratory within an institute, the aster-isk refers to any repetition of the field (including zero), and ‘
OTHER INFORMATION ’ usually means theprovince (or region) name, postal codes, or P. O. Box. For instance,
PHYSICS DEPARTMENT, THE ROCKEFELLER UNIVERSITY, NEW YORK, NEW YORKTHE INSTITUTE FOR PHYSICAL SCIENCES, THE UNIVERSITY OF TEXAS AT DALLAS,P. O.BOX 688, RICHARDSON, TEXASPHYSICS DEPARTMENT, UNIVERSITY OF GUELPH, GUELPH, ONTARIO N1G 2W1, CANADA
Figure. 8 shows the probability distribution of the number of comma separated fields for all affiliation strings.The mean value of such numbers is . and the standard deviation is . . of all affiliation stringshave between 3 and 5 comma separated fields, while the percentage rises to for those with less than8 such fields (mean ± σ ). Therefore, we first assume that an affiliation string with no more than 7 commaseparated fields represents a single affiliation, and the remaining ones may consist of multiple affiliations. We first extract country and U.S. state names from single affiliation strings. To find country names, wecreate a dataset of country names except U.S. from ISO 3166 country codes [ ? ], and the name of U.S. statesfrom Wikipedia [ ? ]. For some historical country names in the 20th century (e.g., the Soviet Union, Yu-goslavia, East Germany), we manually add them in the dataset. Besides, for some countries, we take intoconsideration the name variations, like full official names and the name in its official language, and possibleabbreviations, e.g., U.S.S.R for the Soviet Union, People’s Republic of China for China, Deutschland forGermany, etc.Based on the above assumptions and observations, for an affiliation string with no more than 7 commaseparated fields, we first search the field representing a country name, the process of which is called ‘ field . and the standard deviation is . . The grey area in the plotrepresents the band with the width of 3 standard deviations, which implies that the most of affiliation stringsconsist of no more than 7 comma separated fields. match ’. For each field in an affiliation string, we eliminate the words with numbers 0-9, which may rep-resent a postal code, and then try to match the field with any of the country name in our country name dataset.If there is no field match for an affiliation string, it is possible that either the author did not write a countryname specifically but some other fields, like the institution name, include a country name (e.g., RANDALMORGAN LABORATORY OF PHYSICS, UNIVERSITY OF PENNSYLVANIA ), or the country nameis mixed with other information in a field, like a city name or a non-numeric postal code (e.g.,
MAX-PLANCK-INSTITUTFÜR MOLEKULARE PHYSIOLOGIE POSTFACH 500247 D-44202 DORTMUND GERMANY ). More-over, for the affiliation strings with ‘ field match ’ results, other fields in that string may also contain countrynames for multiple affiliation cases (e.g.,
ARGONNE NATIONAL LABORATORY, ARGONNE, ILLINOIS60439 AND OHIO STATE UNIVERSITY, COLUMBUS, OHIO ). For the kind of affiliation stringswithout field match results, we try to match the country name word by word in all fields in that affiliationstrings, and for the ones with some field matched, we match the country names word by word in otherfields. We call this process ‘ string match ’. If there is a single match from the above two steps, we assignthe matched country name to this affiliation string, and classify it into affiliation strings with unique countryname. If there are multiple country names matched, we set these affiliation strings aside for later processing.The above two procedures of ‘ field match ’ and ‘ string match ’ give unique country name to . affiliationstrings ( , out of , ), but . ( , out of , ) affiliation strings have no countryname detected. The remaining affiliation strings either contain more than one country name or havemore than 8 fields which may represent multiple affiliations.The next step is to focus on ‘ splitting the multiple affiliations ’ into single records. The case of an af-filiation string with multiple country names varies. For instance, it may represent one affiliation but in-18lude the country names with overlapped words (e.g., Mexico vs. New Mexico for string match pro-cedure, like THE UNIVERSITY OF NEW MEXICO, ALBUQUERQUE NEW MEXICO and Washingtonvs. Washington, D.C. for field match procedure, like
THE GEORGE WASHINGTON UNIVERSITY,WASHINGTON, D.C. ); or some country names may represent a city, a region or a street, (e.g.,
ST.JOHN’S UNIVERSITY, JAMAICA, NEW YORK ); or the union states for some historical countries(e.g.
FACULTY OF CIVIL ENGINEERING, UNIVERSITY OF BELGRADE, BULEVAR REVOLUCIJE73, 11000 BEOGRAD, SRBIJA, YUGOSLAVIA ). We go through this scenario first, and try to filterout affiliation strings of unique affiliation. We assume that two country names cannot appear in the neighborfields or in the neighbor words. Thus, if we found two country names in neighboring fields, we considerthe latter one as the real country name. But if two country names are in the same comma separated field,we determine the country name(s) based on their position. We assign an index to each of the words inthat field according to the order of the words. If the number of words between the first indices of twocountry names is less than the number of the words of the longer country name, the country name withthe larger length is the country name. For instance, in the above example
THE UNIVERSITY OF NEWMEXICO, ALBUQUERQUE NEW MEXICO , we find two country names in the second field:
NEW MEXICO and
MEXICO with the word indices 2 and 3 respectively. The number of words between two indices is 1,which is smaller than the length of
NEW MEXICO , so we determine
NEW MEXICO is the country name forthis affiliation.After performing the multiple name checking described above, we consider the remaining affiliation stringsconsisting of multiple affiliations. We observe that the affiliation strings in this scenario usually containelements implying multiplicity, like
AND and semicolons. For example:
THE RICE INSTITUTE, HOUSTON, TEXAS AND THE COLLEGE OF THE PACIFIC, STOCKTON,CALIFORNIAINSTITUTE FOR ADVANCED STUDY, PRINCETON, NEW JERSEY 08540 AND PHYSICSDEPARTMENT, CALIFORNIA INSTITUTE OF TECHNOLOGY, PASADENA, CALIFORNIAISTITUTO DI FISICA DELL’UNIVERSITA, ROMA, ITALY; AND ISTITUTO NAZIONALEDI FISICA NUCLEARE, SEZIONE DI ROMA, ITALY
If there are semicolons in the affiliation strings, we split the affiliation strings by the position of the semi-colon. However, if there is no semicolon, while there is an
AND , we have to exclude the case like ‘
DEPARTMENTOF PHYSICS AND ASTRONOMY ’. To do so, we observe that if an
AND joins two affiliations, the countryname usually should appear closely before the
AND , so we split the string into two part by an
AND if thelast word position of the country name before
AND is at most one word far from the
AND (We allow oneword between the country name and
AND because of possible non-numeric postal codes.), and the
AND doesnot join any two of the descriptive words of research subjects, which usually appear in the information ofinstitute and sub-institute. We built a list of descriptive words by calculating the frequency of the wordappearance in the first field of all affiliation strings. The top 20 frequently appeared descriptive words arelisted in Table. 4.For the affiliation strings with more than 7 fields, e.g.,
CENTER FOR THEORETICAL PHYSICS, DEPARTMENT OF PHYSICS AND ASTRONOMY,UNIVERSITY OF TEXAS AT AUSTIN, TEXAS 79712; CENTER FOR ADVANCED STUDIES, word frequency word frequencyPHYSICS 314266 RESEARCH 55692SCIENCE 37345 THEORETICAL 32976ASTRONOMY 32247 ENGINEERING 28179MATERIALS 27572 PHYSIK 24083CHEMISTRY 23821 FISICA 23649FÍSICA 22711 PHYSIQUE 21928NUCLEAR 21860 TECHNOLOGY 18769SCIENCES 16999 APPLIED 16184THEORETISCHE 12994 MATHEMATICS 10978SOLID 10351 PHYSICAL 9194
DEPARTMENT OF PHYSICS AND ASTRONOMY, UNIVERSITY OF NEW MEXICO, ALBUQUERQUE,NEW MEXICO 97131; AND MAX-PLANCK-INSTITUT FÜR QUANTENOPTIK, D-8046 GARCHINGBEI MUNCHEN, WEST GERMANY we first split it by semicolons but not by
AND . The split substrings will be processed step by step from fieldmatch to string match and possibly splitting multiple affiliations , in the same way as an affiliation string withno more than 7 fields is processed.It is worth to note that even after splitting process, some of the affiliation strings still contain more than onecountry name, like LOS ALAMOS NATIONAL LABORATORY, UNIVERSITY OF CALIFORNIA, LOS ALAMOS,NEW MEXICO for which the above steps give both California and New Mexico as its country names, or
INSTITUTE FOR QUANTUM COMPUTING, UNIVERSITY OF WATERLOO, N2L 3G1, WATERLOO,ON, CANADA, ST. JEROME’S UNIVERSITY, N2L 3G3, WATERLOO, ON, CANADA, ANDPERIMETER INSTITUTE FOR THEORETICAL PHYSICS, N2L 2Y5, WATERLOO, ON, CANADA of which the first substring after splitting by
AND ( INSTITUTE FOR QUANTUM COMPUTING, UNIVERSITYOF WATERLOO, N2L 3G1, WATERLOO, ON, CANADA, ST. JEROME’S UNIVERSITY, N2L3G3, WATERLOO, ON, CANADA ) still contains another affiliation and there is no more semicolon and
AND to indicate the position to split. Figure. 8 shows that on average affiliation strings representing a singleaffiliation consist of four fields, therefore we split the affiliation (sub)strings of multiple country names butwithout any semicolon and
AND at the position of the country names if the number of fields between twocountry names is not smaller than 4. Thus the final country names for the affiliation strings of the above twoexamples are ‘New Mexico’ and three ‘Canada’s respectively.To double check the results obtained from the above procedures, we use Google geocoders from geopy tool-box [ ? ] to get the country names searched by Google map, and call this step Google geocoders checking .Unfortunately, Google geocoders usually cannot code the affiliation strings with department information oreven institution information. To avoid these exceptions, for the affiliation string with more than three fields,we send the last three fields as an address string to geocoders, and for others we input the whole string togeocoders. Google geocoders return a comma separated address string for each input. If the returned string20s not empty, we match the country names, 2-letter or 3-letter abbreviations in our country name datasetwith the returned result. Once the matched result represent the same country as we extracted, we say thecountry name we parsed for this affiliation string is validated. It should be noted that we do not use Googlegeocoders (or other geocoders like Yahoo! or Bing) directly to search country names because to our bestknowledge there is no evidence to guarantee the accuracy of the results from these APIs.Thus we performthis step of checking to get better accuracy.Figure. 9 summarizes the above steps to extract country names from affiliation strings in a flow chart. Asthe result, the of affiliation strings with multiple country names and more than 7 fields are finally splitinto , new records. In the end, we obtain , records of single affiliation, of which . ( , ) have a country name validated with Google geocoders. Figure. 10 indicates that after 1940, weparsed validated country names for more than of papers in each year. We use these affiliation stringswith validated country names to build citation networks at the country level after 1940, and as the inputs toextract city names.Figure 9: The flow chart of the procedure to extract country name(s) from affiliation strings.21igure 10: The percentage of papers (DOIs) with validated country names per year. The plot shows thatafter 1940 we obtain more than of papers with verified country names (blue bars). We use the database of GeoNames to parse the name of cities in the affiliation strings with identified countrynames. GeoNames database includes geographical data such as names of villages, cities, and other typesof places in various languages, elevation, population and others from various sources [47]. The variationsof languages for geographic names allow us to identify city names written in languages other than English.Each record of places in the database also includes its country name and possibly the first level of admin-istrative division (e.g., the states in the United States). We first filter records that represent cities (by thefeature codes attribute in GeoNames data), and arrange cities by the names of countries and US states. Forcountries like the Soviet Union and Yugoslavia, we combine the cities of their former union countries; andfor East Germany we simply use the cities in Germany.The final results from the above section is a set of affiliation strings, each of which owns a unique countryname, so we argue, that to our best effort, each affiliation string now only represents an institution and hasone city name if any. Since each affiliation string now has a validated country name, we only use the citylist of that country to avoid the same city name in different countries.After cleaning the data, the first step to parse city names is ‘ field match ’, as we performed to find coun-try names. For each field, we delete words with numbers and try to match it with city names in filteredcity dataset for that country. If there are matched city names, we list both the name and coordinates as out-puts, otherwise we perform ‘ string match ’ on the affiliation strings trying to match city names word by word.As we did to validate country names, we use Google geocoders from geopy toolbox to check the correctnessof the city names we extract from affiliation strings. The procedure is similar to that for the country names:the affiliation strings excluding the department level information are given as input to Google geocoders,and the non-empty Google searched results are saved for the next step of validation.The coordinates and22ity names given by Google geocoders for an affiliation string are based on the name of the institutions,and may be different from the name extracted and the coordinates of the city given in GeoName database.To determine if the extracted city name is correct, we simply calculate the geographic distance between thecoordinates given by GeoNames database and the ones given by Google geocoders, and if the distance isless than 50km, we say the extracted result is matched with Google searched result. For the affiliation stringswith multiple city names, we choose the one which has the shortest Vincenty’s distance from the Googlegeocoded result.In total, we have . ( , out of , ) affiliation strings with validated city names. Figure. 11ashows the the percentage of papers (DOIs) with validated city names per year, from which one can observethat we obtain validated city names for more than of papers after 1940, and for this reason we usedata after that year to perform analysis at the city level in this paper. Figure. 11b displays the percentage ofpapers with validated city names to the total number of papers for each country after 1940. The abscissa is60 country names ordered by the total number of papers for each country after 1940. These top 60 countriescontribute of the papers published in Physical Review journals after 1940, as shown by the cumulativedistribution of the total number of papers for all countries (the red dot curve). From Figure. 11b we claim thatfor the most of major countries contributing to publications in Physical Review journals we have unbiasedresults of parsing city names. (a) The percentage of papers with validated city names per year. (b) The percentage of papers with validated city names per coun-try. Figure 11: The percentage of papers (DOIs) with validated city names per year (a), and the percentage ofpapers (DOIs) with validated city names per country (b). (a) clearly shows that after 1940 we obtain morethan of papers with verified city names for each year (blue bars). In (b), the x-axis is top 60 countriesranked by the total number of papers after 1940 in each country. The red dot curve is the cumulativedistribution function of the number of papers over countries after 1940. For the major contributing countriesin terms of paper production, we have obtained more than of papers with validated city names.So far we have obtained geographic coordinates and city names for the affiliation strings from Google23eocoders and GeoName database. However, different city names may represent the same city, geographi-cally close cities or different administrative levels. For instance,
DEPARTMENT OF PHYSICS, BOSTON COLLEGE, BOSTON, MASSACHUSETTS 02467, USADEPARTMENT OF PHYSICS, BOSTON COLLEGE, CHESTNUT HILL, MASSACHUSETTS
Because Chestnut Hill is not a city in Massachusetts in GeoNames database, the city name extracted fromthese two affiliation strings for Boston College is Boston, while Google geocoders gives the city name ofNewton. In this case, one cannot automatically determine which city this affiliation should be in. One pos-sible way to solve such the problem is to project the coordinates into polygons of ‘cities’ in shapefiles forgeographic information systems software. However, the existent shapefiles have different granularities fordifferent countries. It may be unfair to compare the scientific products in different level of administrativeunits over different countries.Therefore, we cluster cities according to their geographic coordinates into ‘urban areas’ or ‘academic cities’in each country. For each country, we perform hierarchical/agglomerative clustering with the geographicdistance matrix, of which the distances are calculated with Vincenty’s formula. With the dendrogram pro-duced from the clustering process, we cut off the branches from the maximum height value to lower onesuntil the distance between any point in a cluster and the centroid of the cluster is less than 25km (the maxi-mum distance within the cluster is 50km) for all clusters. We call such clusters ‘academic cities’. The finalcoordinates of an academic city is the centroid of all coordinates inside that cluster, and the academic cityis named with the city name which has the most papers in that cluster. We notice that due to the differencesbetween geographic areas in different countries, some cities are merged into one academic city and someother cities are split into two. For instance, Boston, Cambridge, Newton in Massachusetts are now clusteredinto one urban area with the name Boston; and Dubna in Moscow Oblast now becomes a separate academiccity. Finally, we have a list of academic cities for each paper (DOI), and all the analysis we made at the citylevel in this paper refer to the unban areas or academic cities.
A citation network consists of a set of nodes (cities) and directed links representing citations that one paperwritten in one city is cited by a paper written in another one according to the references of the latter. Forexample, if a paper is written in node i cites one paper written in node j there is an edge from i to j , i.e., j receives a citation from i and i sends a citation to j . As shown in Figure (1) in the main text, a directedlink from Ann Arbor to Rome and another link to Madrid are built since paper A , which is from Ann Arbor,Michigan, cites the paper B from Rome, Italy and Madrid, Spain. Because the paper A was also contributedby authors from another two cities: Los Alamos in New Mexico and New York City in New York, from eachof these two cities, there is also a link to Rome and another to Madrid.The weight of a link is defined as following. In a given time window, the total number of citations for thepapers written in j received from papers written in a , is the weight of the link ( i → j ) , and the total numberof citations for those paper written in j sent to the papers written in k is the weight of the link ( j → k ) .For instance, in time window t , there is one paper written in node j , which cited two papers written in node k and was cited by three papers written in node i , then there are w i,j = 3 , w j,k = 2 , and we add up suchweight for all papers written in that node j and obtain the weights for links. For the paper written in multiple24ities, say j , j , the weight will be counted equally, i.e., w i,j = w i,j , w j ,k = w j ,k . The time window weuse in this paper is 1 year. We observe a significant growth of the published articles and the citations in recent years, as shown inFigure. 12. Meanwhile, the percentage of papers contributed by authors in the United States has decreasedfrom nearly in early 1960’s to current (Figure. 13). Correspondingly, the number of cities con-tributing to publications in APS journals, as well as their internal interactions, has increased dramatically,as illustrated in Figure. 14 and Figure. 15.In Table. 5 we report basic statistic properties for the city-to-city citation networks in selected years. Fig-ure. 16a reports the cumulative distribution functions for in- and out-degree of the city-to-city citation net-works in different years. The distributions are with behaviors close to power-law with the exponential cutoff.As the year increases, the range of values of k in and k out extends. We define the in/out-strength of node i as the total number of citations it sends/receives at that year. Figure. 16b displays the cumulative distribu-tion function for in- and out-strength of the city-to-city citation networks in different years. The pattern ofstrength distributions is quite similar to the degree distributions.Figure 12: The number of papers (top) and thenumber of citations (bottom) as the function of time(1960-2009). Figure 13: The percentage of papers contributed byauthors from USA as the function of time (1960-2009).Table 5: Summary of basic statistic features for city-to-city citation networks in different years. year V E k in k out S in S out w ij mean std. min max mean std. min max mean std. min max mean std. min max mean std. min max1960 222 2517 11.34 18.13 0 90 11.34 15.20 0 84 41.24 111.16 0 765 41.24 95.99 0 940 3.64 11.57 1 3361970 438 9461 21.60 38.97 0 236 21.60 26.72 0 153 87.53 288.39 0 2893 87.53 198.54 0 1758 4.05 13.98 1 5641980 635 17028 26.82 47.96 0 332 26.82 34.84 0 206 94.08 311.71 0 4182 94.08 213.94 0 2164 3.51 11.02 1 5571990 897 43324 48.30 80.31 0 539 48.30 58.37 0 329 207.59 671.95 0 9125 207.59 459.34 0 4372 4.30 13.00 1 8302000 1327 109438 82.47 126.79 0 754 82.47 102.83 0 556 801.76 2640.94 0 34768 801.76 2167.73 0 20862 9.72 29.71 1 15682009 1704 204747 120.16 178.22 0 968 120.16 151.16 0 822 3033.86 9230.21 0 104149 3033.86 8651.34 0 76044 25.25 75.12 1 3004 (a) The cumulative distribution function of the de-grees for citation networks at the city level. (b) The cumulative distribution function of thestrength for citation networks at the city level. Figure 16: The cumulative distribution function of degree and strength for city-to-city citation networks inyear 1960, 1970, 1980, 1990, 2000 and 2009. 26
Top producers/consumers and results from knowledge diffusion proxy
In Figure. 17 we show the cumulative distribution of the absolute citation unbalance | ∆s | for producers andconsumers at the city level. Similar to the cumulative distributions of strength, the distributions are charac-terized with heavy tails, and the distributions have become broader as the time increases.We list top 20 producers and consumers at the city level from 1985 to 2009 (Table. 6), from 1960 to 1980(Table. 7). It is worth noting that the definition of unbalance ∆s is from the difference between the numberof citations sent and received, which cannot distinguish between cities with a large amount of productionand consumption and those with less production and consumption.Figure 17: The cumulative distribution function of the citation unbalance for producers and consumers atthe city level in year 1960, 1970, 1980, 1990, 2000 and 2009.27able 6: Top 20 producers and consumers at the city level (1985-2009) (a) Top 20 producer cities rank 1985 1990 1995 2000 2005 20091 Piscataway Piscataway Piscataway Boston Boston Boston2 Boston Boston Boston Piscataway New York City Berkeley3 Berkeley Palo Alto Yorktown Heights Los Angeles Los Angeles New Haven4 Princeton Yorktown Heights Berkeley Berkeley Tallahassee Suwon5 Yorktown Heights Berkeley Los Angeles Chicago Palo Alto Princeton6 Ithaca Princeton Urbana New York City Berkeley Piscataway7 New York City Ithaca New York City Lemont Piscataway Higashihiroshima8 DC New York City Chicago Urbana Urbana Prairie View9 Palo Alto San Diego Ithaca Philadelphia Pavia Los Angeles10 Lemont Philadelphia Lemont Princeton West Lafayette Lubbock11 Los Angeles Chicago Princeton West Lafayette Ithaca Palo Alto12 Chicago Santa Barbara Palo Alto Batavia Rochester Batavia13 San Diego Pittsburgh Santa Barbara Rochester Honolulu New York City14 Seattle Lemont Philadelphia Yorktown Heights Batavia Nashville15 Rehovot Los Angeles Minneapolis Palo Alto Yorktown Heights Bristol16 New Haven New Haven San Diego Dallas Irvine Rochester17 Urbana Orsay Batavia Tsukuba Lemont Urbana18 Pittsburgh Holmdel Zurich Waltham Minneapolis Daegu19 Villigen Stony Brook Waltham Madison Philadelphia Tallahassee20 Waltham Batavia Madison East Lansing Boulder Pittsburgh (b) Top 20 consumer cities rank 1985 1990 1995 2000 2005 20091 Stuttgart Tokyo Moscow Beijing Beijing Athens2 Toronto Beijing Beijing Seoul Barcelona Gwangju3 Gaithersburg Tsukuba Seoul Lancaster Coventry Bratislava4 Annandale Tallahassee East Lansing Grenoble Valencia Vancouver5 Bloomington Vancouver Lubbock Dubna Perugia Madrid6 Minneapolis Grenoble Montreal Manhattan Moscow Berlin7 Warsaw Seoul Tallahassee Quito Heidelberg Trieste8 Berlin Kolkata Davis Suwon London Mainz9 Vancouver Charlottesville Dallas Stillwater Dubna Waco10 Ames Durham Taipei Santander Riverside Paris11 West Lafayette Buffalo Berlin Lawrence Amsterdam Valencia12 Charlottesville Warsaw Tokyo Kraków Hefei Coventry13 Seoul Tempe Toyonaka Marseille Dresden Moscow14 Montreal Berlin Delhi Tokyo Bellaterra Bellaterra15 Trieste Madrid Trieste Karlsruhe Shanghai Lanzhou16 Kyoto Sao Paulo St Petersburg Daegu Evanston Shanghai17 Tokyo Taipei Dresden Udine Taipei Sao Paulo18 Varanasi Brussels Bologna Oxford Glasgow Kolkata19 Rio De Janeiro Mainz Munich Moscow Liverpool Clermont20 Ridgefield Davis Cambridge Ruston Bari Hefei (a) Top 20 producer cities rank 1960 1965 1970 1975 19801 Boston Princeton Berkeley Boston Boston2 Princeton Berkeley Boston Berkeley Princeton3 Urbana Boston Princeton Palo Alto Piscataway4 Oak Ridge Piscataway Chicago Princeton Berkeley5 Piscataway New York City Piscataway Piscataway Palo Alto6 New York City Los Angeles Palo Alto Ithaca Ithaca7 Los Angeles Los Alamos Albany Chicago New York City8 Los Alamos Albany San Diego Oak Ridge Chicago9 Chicago Ann Arbor Madison San Diego San Diego10 Ithaca Pittsburgh New York City New Haven Los Angeles11 Rochester Meyrin Pittsburgh Los Angeles Stony Brook12 DC Waltham Waltham Urbana New Haven13 Madison Urbana Meyrin Pittsburgh Philadelphia14 Bloomington Cambridge Ithaca Batavia Albany15 Utrecht Bloomington Cambridge Providence Urbana16 Durham Lemont Los Angeles Albany Albuquerque17 London Ithaca Los Alamos Durham Waltham18 Saskatoon DC New Haven Rochester Batavia19 Sydney Chicago Livermore Livermore College Park20 St Louis Zurich London DC Pittsburgh (b) Top 20 consumer cities rank 1960 1965 1970 1975 19801 Berkeley West Lafayette Evanston Stony Brook Austin2 Palo Alto Palo Alto West Lafayette Grenoble Boulder3 New Haven Orsay Austin Columbus Tokyo4 Pittsburgh College Park Trieste Stuttgart Haifa5 Waltham Albuquerque Columbus Toronto Toronto6 San Diego Livermore Delhi Austin Bhubaneswar7 Lemont Delhi Amherst East Lansing Rehovot8 Livermore Minneapolis Rochester Amherst Ottawa9 West Lafayette Trieste Milwaukee Mumbai Paris10 Poughkeepsie Providence Baton Rouge Denton Santa Barbara11 Evanston Ames Buffalo Mexico City Houston12 Tallahassee Rochester Seattle Munich Golden13 Columbus Evanston Salt Lake City Paris Stuttgart14 Canberra San Diego Haifa Honolulu Kolkata15 Yorktown Heights Syracuse Hoboken Montreal Toyonaka16 Arlington Rehovot Lincoln Orsay Kyoto17 Rome Hoboken Gainesville Roskilde Grenoble18 Meyrin Oxford Tucson Madison Jülich19 Ames El Segundo Bloomington West Lafayette Vancouver20 Irvine Milan East Lansing Rehovot Kingston Top ranked cities from scientific production ranking algorithm
We show the cumulative distribution of scientific production ranking scores for cities in selected years inFigure. 18. We notice that ranking scores are also characterized with heavy tail distributions. In addition,we also observe that both the maximum and minimum ranking scores has decreased with time, and thetail of the distribution becomes steeper in recent decades, which indicates the differences of ranking scoresbetween top ranked cities have gradually shrunk.Figure 18: The cumulative distribution function of scientific production ranking scores for cities in year1960, 1970, 1980, 1990, 2000 and 2009.In Table. 8 and Table. 9, we report top 50 cities ranked from scientific production ranking algorithm from1985 to 2009 and from 1960 to 1980 respectively. 30able 8: Top 50 cities from scientific production ranking algorithm (1985-2009) rank 1985 1990 1995 2000 2005 20091 Piscataway Piscataway Boston Boston Boston Boston2 Boston Boston Piscataway Berkeley Los Angeles Berkeley3 Berkeley Berkeley Berkeley Piscataway Berkeley Los Angeles4 Palo Alto Palo Alto Los Angeles Los Angeles Orsay Tokyo5 New York City Yorktown Heights New York City New York City Tokyo Orsay6 Los Angeles Los Angeles Urbana Chicago Princeton Chicago7 Ithaca New York City Chicago Urbana Piscataway Paris8 Los Alamos Los Alamos Lemont Rochester Palo Alto Princeton9 Princeton Princeton Palo Alto Batavia New York City Rome10 Yorktown Heights Urbana Batavia West Lafayette Philadelphia Piscataway11 Lemont Chicago Philadelphia Lemont Urbana London12 Urbana Philadelphia Madison Orsay Santa Barbara Urbana13 Chicago Ithaca Rochester East Lansing Rome Lemont14 Philadelphia Lemont West Lafayette Ann Arbor Columbus Philadelphia15 Orsay Orsay Orsay Tokyo College Park Oxford16 DC Santa Barbara Princeton College Station New Haven Santa Barbara17 College Park College Park Los Alamos Tsukuba Lemont New Haven18 Oak Ridge Oak Ridge Rome Philadelphia Madison Rochester19 Santa Barbara Livermore Tsukuba Palo Alto Paris Madison20 Rochester Batavia Santa Barbara Madison San Diego Columbus21 Rehovot Tokyo Yorktown Heights College Park Chicago College Park22 San Diego Rochester College Station Pittsburgh Tsukuba Batavia23 Pittsburgh San Diego Pittsburgh Rome Oxford Moscow24 New Haven Columbus Ithaca Princeton Oak Ridge East Lansing25 Stony Brook Madison College Park Los Alamos Tallahassee Palo Alto26 Seattle Pittsburgh New Haven New Haven Rochester Pittsburgh27 Columbus DC Ann Arbor Toyonaka Beijing San Diego28 Boulder Rehovot Pisa Durham Pittsburgh Ann Arbor29 Paris Stuttgart Waltham Columbus Ames Tsukuba30 Livermore Paris East Lansing Stony Brook West Lafayette Seoul31 Madison Minneapolis Oak Ridge Santa Barbara Batavia Pisa32 Austin Boulder Tokyo Albuquerque Pisa West Lafayette33 Tokyo New Haven Stony Brook Baltimore Boulder Padua34 Jülich West Lafayette San Diego Toronto Padua Dubna35 Zurich Stony Brook Minneapolis Pisa London Evanston36 Batavia Bloomington Baltimore Tallahassee Montreal Ames37 Bloomington Seattle Padua Waltham Livermore New York City38 Minneapolis Ann Arbor Toronto Ithaca Los Alamos Toronto39 West Lafayette Austin Boulder Moscow Seoul Oak Ridge40 Ann Arbor Zurich Albuquerque Montreal East Lansing Baltimore41 East Lansing Vancouver Stuttgart Padua Moscow Beijing42 Stuttgart Holmdel Livermore San Diego Nashville Karlsruhe43 Evanston Rome DC Ames Ann Arbor Taipei44 Grenoble Ames Paris Evanston College Station College Station45 Syracuse Waltham Seattle Meyrin Vancouver Meyrin46 Providence Albuquerque Rehovot Gainesville Irvine Los Alamos47 Ames Toyonaka Durham Honolulu Taipei Toyonaka48 Albany Albany Toyonaka Paris Dallas Liverpool49 Waltham Jülich Columbus Oak Ridge Meyrin Davis50 Nashville Grenoble Dallas Bloomington Cincinnati Amsterdam rank 1960 1965 1970 1975 19801 Berkeley Berkeley Boston Boston Boston2 Boston Boston Berkeley Piscataway Piscataway3 New York City Princeton Piscataway Berkeley Berkeley4 Princeton Piscataway Palo Alto Palo Alto Palo Alto5 Chicago New York City Princeton New York City New York City6 Piscataway Chicago New York City Princeton Princeton7 Urbana Los Angeles Chicago Ithaca Los Angeles8 Los Angeles Urbana Los Angeles Los Angeles Chicago9 Ithaca Palo Alto Urbana Chicago Ithaca10 Pittsburgh Pittsburgh Ithaca Lemont Lemont11 Oak Ridge Lemont Pittsburgh Urbana Los Alamos12 Los Alamos DC Lemont Batavia Philadelphia13 DC Ithaca San Diego Philadelphia Urbana14 Rochester Los Alamos Oak Ridge Oak Ridge Oak Ridge15 Philadelphia Albany Philadelphia Pittsburgh College Park16 Albany Oak Ridge DC College Park Batavia17 Palo Alto Philadelphia Albany DC Orsay18 Lemont Waltham New Haven San Diego Stony Brook19 New Haven New Haven Waltham Rochester DC20 Madison Madison College Park Los Alamos Pittsburgh21 College Park San Diego Los Alamos New Haven Rochester22 Bloomington College Park Madison Madison Yorktown Heights23 Waltham Rochester Rochester Waltham New Haven24 Ann Arbor Ann Arbor Ann Arbor Stony Brook San Diego25 Minneapolis Livermore West Lafayette Yorktown Heights Rehovot26 West Lafayette West Lafayette Livermore Albany Madison27 Houston Meyrin Minneapolis Orsay Livermore28 Syracuse Seattle Rehovot Seattle Seattle29 Livermore Minneapolis Oxford Providence Waltham30 Columbus Rehovot London Livermore Albany31 Durham Cleveland Yorktown Heights Rehovot Evanston32 St Louis Yorktown Heights Meyrin Minneapolis West Lafayette33 Oxford Oxford Orsay Evanston Austin34 Cleveland London Ames Durham Providence35 Baltimore Bloomington Evanston West Lafayette Minneapolis36 Seattle Evanston Seattle Ames Ann Arbor37 Providence Cambridge Cleveland London Albuquerque38 Rehovot St Louis Stony Brook Ann Arbor Paris39 Ames Syracuse Cambridge Cleveland East Lansing40 Cambridge Ames Providence East Lansing Bloomington41 London Detroit Durham Albuquerque Cleveland42 Ottawa Columbus Santa Barbara Austin College Station43 Tokyo Durham Boulder Oxford Zurich44 Meyrin Orsay Riverside Santa Barbara Oxford45 Detroit Houston St Louis St Louis Ames46 South Bend Boulder Hamburg Boulder London47 Birmingham Baltimore Detroit Columbus Durham48 Jerusalem Tokyo Columbus Zurich Boulder49 San Diego Paris Syracuse Cambridge St Louis50 Sydney Rome Bloomington Rome Columbus Relation between research outputs and investment
In this section, we report the relation between research outputs (i.e., citations) and investment on scientificresearch. As discussed earlier, we parsed city information based on country information for each affiliation,therefore we can aggregate the number of citations for cities to their countries, and measure the relationbetween research outputs and investment on research in that country. In Figure. 19, we plot the correlationbetween the average number of citations received by each country in 1996-2009 and the average amount ofgross domestic product (GDP) spent on research and development (R& D) (in current US dollars) in thatcountry in that period. We also plot the correlation between the average number of citations received byone country in the same period and the average research population in that country within the same timewindow. The number of citations received approximately linearly scales with both quantities. Such findingsare consistent with the results reported in [7], which studied the database of the Institute for Scientific Infor-mation (ISI). This similarity indicates, although APS dataset is limited, it is representative of the scientificproduction for major countries. The data of GDP, the fraction of GDP spent on R& D, and the researchpopulation are from The World Bank data [32].Figure 19: Relation between research outputs and the investment. (A) The average citations received byeach country as a function of the average GDP on research and development (R& D) in million US dollarsfrom to . (B) The average citations received by each country as a function of the average researchpopulation in that country from to . The solid black line shows the power-law fitting with theexponent . and . respectively. 33 eferences [1] F. Narin and M. P. Carpenter, “National Publication and Citation Comparisons,” Journal of the Ameri-can Society for Information Science , vol. 26, pp. 80–93, 1975.[2] J. D. Frame, F. Narin, and M. P. Carpenter, “The Distribution of World Science,”
Social Studies ofScience , vol. 7, pp. 501–516, 1977.[3] R. M. May, “The Scientific Wealth of Nations,”
Science , vol. 7, pp. 793–796, 1997.[4] M. Batty, “The Geography of Scientific Citation,”
Environ Plan A , vol. 35, pp. 761–765, 2003.[5] L. Leydesdorff and P. Zhou, “Are the contributions of China and Korea upsetting the world system ofscience?,”
Scientometrics , vol. 63, pp. 617–630, 2005.[6] H. Horta and F. Veloso, “Opening the box: comparing EU and US scientific output by scientific field,”
Technological Forecasting & Social Change , vol. 74, pp. 1334–1356, 2007.[7] R. K. Pan, K. Kaski, and S. Fortunato, “World citation and collaboration networks: uncovering the roleof geography in science.,”
Scentific Reports , vol. 2, p. 902, 2012.[8] A. Mazloumian, D. Helbing, S. Lozano, R. P. Light, and K. Börner, “Global multi-level analysis of the’scientific food web’,”
Scentific Reports , vol. 3, p. 1167, 2013.[9] K. Frenken, S. Hardeman, and J. Hoekman, “Spatial scientometrics: Towards a cumulative researchprogram,”
Journal of Informetrics , vol. 3, pp. 222–232, 2009.[10] S. Redner, “How popular is your paper? An empirical study of the citation distribution,”
Eur. Phys. J.B , vol. 4, pp. 131–134, 1998.[11] P. Chen, H. Xie, S. Maslov, and S. Redner, “Finding scientific gems with Google’s PageRank algo-rithm,”
Journal of Informetrics , vol. 1, pp. 8–15, 2007.[12] E. Garfield, “Citation Analysis as a Tool in Journal Evaluation,”
Science , vol. 178, pp. 471–479, 1972.[13] C. Bergstrom, “Eigenfactor: Measuring the value and prestige of scholarly journals,”
College & Re-search Libraries News , vol. 68, pp. 314–316, 2007.[14] J. E. Hirsch, “An index to quantify an individual’s scientific research output,”
Proc. Natl. Acad. Sci. ,vol. 102, pp. 16569–16572, 2005.[15] L. Egghe, “Theory and practise of the g-index,”
Scientometrics , vol. 69, pp. 131–152, 2006.[16] J. E. Hirsch, “Does the h index have predictive power?,”
Proc. Natl. Acad. Sci. , vol. 104, pp. 19193–19198, 2007.[17] K. Börner, S. Penumarthy, M. Meiss, and W. Ke, “Mapping the Diffusion of Information Among MajorU.S. Research Institutions,”
Scientometrics , vol. 68, pp. 415–426, 2006.[18] L. Bornmann, L. Leydesdorff, C. Walch-Solimena, and C. Ettl, “Mapping excellence in the geographyof science: An approach based on Scopus data,”
Journal of Informetrics , vol. 5, no. 4, pp. 537–546,2011. 3419] D. K. King, “The scientific impact of nations,”
Nature , vol. 430, pp. 311–316, 204.[20] J. Adams, “Collaborations: The rise of research networks,”
Nature , vol. 490, pp. 335–336, 2012.[21] G. Laudel, “Studying the brain drain: Can bibliometric methods help?,”
Scientometrics , vol. 57,pp. 215–237, 2003.[22] R. V. Noorden, “Global mobility: Science on the move,”
Nature , vol. 490, pp. 326–329, 2012.[23] E. Garfield,
Citation Indexing. Its Theory and Application in Science, Technology, and Humanities .John Wiley & Sons Inc., 1979.[24] L. Egghe and R. Rousseau,
Introduction to Informetrics : Quantitative Methods in Library, Documen-tation and Information Science . Elsevier Science Publishers, 1990.[25] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,”
Journal of the ACM , vol. 46,no. 5, pp. 604–632, 1999.[26] D. Walker, H. Xie, K.-K. Yan, and S. Maslov, “Ranking scientific publications using a model of net-work traffic,”
Journal of Statistical Mechanics: Theory and Experiment , vol. 2007, p. P06010, 2007.[27] C. Castillo, D. Donato, and A. Gionis, “Estimating Number of Citations Using Author Reputation,” in
String Processing and Information Retrieval , vol. 4726 of
Lecture Notes in Computer Science , SpringerBerlin / Heidelberg, 2007.[28] A. Sidiropoulos and Y. Manolopoulos, “Generalized comparison of graph-based ranking algorithmsfor publications and authors,”
Journal of Systems and Software , vol. 79, pp. 1679–1700, 2007.[29] F. Radicchi, S. Fortunato, B. Markines, and A. Vespignani, “Diffusion of scientific credits and theranking of scientists,”
Phys. Rev. E , vol. 80, p. 056103, 2009.[30] A. Scharnhorst, K. Börner, and P. van den Besselaar, eds.,
Models of Science Dynamics: EncountersBetween Complexity Theory and Information Sciences . Springer-Verlag, 2012.[31] APS, “Data sets for research,” 2010.[32] http://data.worldbank.org/, 2012.[33] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,”
Comp. Net. ISDNSys. , vol. 30, p. 107, 1998.[34] ESRI,
ArcGIS Desktop: Release 9.3 . Environmental Systems Research Institute, Redlands, CA, 2010.[35] R Core Team,
R: A Language and Environment for Statistical Computing . R Foundation for StatisticalComputing, Vienna, Austria, 2012. ISBN 3-900051-07-0.[36] A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,”
Science , vol. 286, p. 509,1999.[37] A. Barrat, M. Barthélemy, and A. Vespignani,
Dynamical Processes on Complex Networks . CambridgeUnivesity Press, 2008. 3538] M. Newman,
Networks. An Introduction . Oxford Univesity Press, 2010.[39] A. Vespignani, “Predicting the behavior of techno-social systems,”
Science , vol. 325, pp. 425–428,2009.[40] A. Vespignani, “Modeling dynamical processes in complex socio-technical systems,”
Nature Physics ,vol. 8, pp. 32–30, 2012.[41] M. Ángeles Serrano, M. Boguñá, and A. Vespignani, “Extracting the multiscale backbone of complexweighted networks,”
Proc. Natl. Acad. Sci. , vol. 106, pp. 6483–6488, April 2009.[42] I. Boyandin, E. Bertini, and D. Lalanne, “Using flow maps to explore migrations over time,” in
Pro-ceedings of Geospatial Visual Analytics Workshop in conjunction with The 13th AGILE InternationalConference on Geographic Information Science (GeoVA) , 2010.[43] J. Adams and Z. Griliches, “Measuring science: An exploration,”
Proc. Natl. Acad. Sci. , vol. 93,pp. 12664–12670, 1996.[44] T. S. Rosenblat and M. M. Mobius, “Getting Closer or Drifting Apart?,”
Quarterly Journal of Eco-nomics , vol. 119, no. 3, pp. 971–1009, 2004.[45] F. Havemann, M. Heinz, and H. Kretschmer, “Collaboration and distances between German immuno-logical institutes Ð a trend analysis,”
Journal of Biomedical Discovery and Collaboration , vol. 1, p. 6,2006.[46] A. Agrawal and A. Goldfar, “Restructuring Research: Communication Costs and the Democratizationof University Innovation,”
American Economic Review