Scaling Scientometrics: Dimensions on Google BigQuery as an infrastructure for large-scale analysis
SScaling Scientometrics: Dimensions on Google BigQuery as an infrastructure forlarge-scale analysis
Daniel W Hook ∗ and Simon J Porter † Digital Science, 6 Briset Street, London, EC1M 5NR
Cloud computing has the capacity to transform many parts of the research ecosystem, fromparticular research areas to overall strategic decision making and policy. Scientometrics sits at theboundary between research and the decision making and evaluation processes of research. One of thebiggest challenges in research policy and strategy is having access to data that allows iterative analysisto inform decisions. Many of these decisions are based on “global” measures such as benchmarkmetrics that are hard to source. In this article, Cloud technologies are explored in this context. Anovel visualisation technique is presented and used as a means to explore the potential for scalingscientometrics by democratising both access to data and compute capacity using the Cloud.
I. INTRODUCTION
In recent years cloud technologies have become usedmore extensively in research. The combination of cost-efficient storage and on-demand compute capability havelowered barriers for many who are either not technicallysavvy or who lack financial resource to create and main-tain large scale real-world computer infrastructure. In theacademic discplines of bibliometrics and scientometrics,and in the related practical fields of research manage-ment, strategy and policy, the use of cloud-based toolsare still naiscent. On one hand data volumes are rela-tively small (at least compared with familiar big datafields such as particle physics) while on the other, thecosts and complexity of arranging access to bibliomet-ric data sources, processing raw data and maintaininganalysis-ready datasets have been prohibitive for all butthe best funded researchers, analysts and policymakers.We argue that Cloud technologies applied in the con-text of scientometrics does not only have the capacity todemocratise access to data but also to democratise accessto analysis. Here we define “analysis” to be the combina-tion of data access together with the capacity to calculate.Data access is often thought to be constrained solely bylicence agreements, but is also characterised by technicallimitations. Recent progress has been made in improvingaccess to research metadata [47]. Yet, data licence agree-ments typically do not make arrangements for the deliveryof an often-updated analysis-ready database, but rathergive access either to a raw flat-file data that needs to beprocessed, structured and mounted into a database formatwith regular updates that must be applied to keep thedata relevant, or access to an API, which must go througha similar process to create an analysis-ready database.Beyond this logical data structuring activity, there hasalso historically been the need for physical hardware, thateffectively defines the computational capacity of the user. ∗ Also at Centre for Complexity Research, Imperial College London,London, SW7 2AZ, UK and Department of Physics, WashingtonUniversity in St Louis, St Louis, Missouri, US. † [email protected] Cloud technologies have the capacity to remove both ofthese constraints by providing an analysis-ready databaseand computational capacity on a per-use basis.Few research areas yet take an approach of providing aCloud-based central store of research data for researchersto query, manipulate and compute with to support theirinvestigations. However, this type of approach can beseen in the conception of “computable data” introducedby [48] as a result of the development of Wolfram Alpha.In this article we seek to highlight the types of analysisthat can be carried out if data is made accessible in theCloud, as described above, as well as the implicationsfor community-ownership of research benchmarks, andthe opportunity to place analytical capabilities with a farbroader range of stakeholders.To begin, we provide a working definition of accessibil-ity and use Dimensions on Google Big Query to explore asimple example related to the field of “knowledge cartog-raphy”, which was introduced and explored extensively by[4, 7, 8, 10–12]. We use this example as it has great nar-rative power and makes global use of a dataset. (Here, byglobal, we mean that to complete an analysis every recordin the dataset must contribute toward the result—a goodexample of a global calculation is a field-weighted citationnormalisation, since this requires the citation counts ofevery publication in a set for a defined time period.)This example brings together the use of a structured,analysis-ready dataset hosted on the Cloud, with uniqueidentifiers to connect from metadata records to spatialinformation with on-demand computation to provide avisualisation that can readily be updated, iterated andprovided regularly to stakeholders in a maintainable man-ner. We believe that the analysis presented here is entirelynovel in a bibliometric or scientometric context. It is re-markable that results of this type have not been presentedby other researchers, but we take this to be a hallmark ofthe limitations of prior computational approaches.
A. Defining data accessibility
The viability and trustworthiness for bibliometricdatasources has been a matter of significant atten- a r X i v : . [ c s . D L ] J a n tion in the bibilometrics community over recent years[5, 18, 23, 28, 32, 33, 36, 44]. The emergence of newdatasources has led to significant analytical efforts tounderstand the strengths and weaknesses of different ap-proaches to collecting and indexing content [34, 39, 42, 46].The primary focuses of these works are in the assessmentof coverage (completeness of journal/subject coverage, andaccuracy and completeness of citation network) togetherwith technical issues around stable construction of fieldnormalisations and other benchmarking details. Both ofthese areas are foundational in whether a database canbe used in bibliometric and scientometric anaylsis, andwhether it is appropriate to use these data in evaluativecontexts. More recently, there has been innovative workwhich extends this standard approach to assess coveragein a different manner to examine suitability of datasetsfor “bibliometric archeology” [6].For the purposes of this paper, we characterise the ma-jority of existing comparative analyses as being focusingon one or more of five key data facets:1. coverage–the extent to which a body of metadatacovers the class of objects that it sets out to cata-logue; explanations of editorial decisions, limitationsbased on geography, subject or nature;2. structure–the format and field structure of the meta-data; the standards which may be relevant;3. nature–the parts of the scholarly record being cap-tured (e.g. articles, journals, datasets, preprints,grants, peer reviews, seminars, conference proceed-ings, conference appearances, altmetric information,and so on); level of granularity;4. context–provenance; details of enhancement tech-niques that may have been applied; use of AI ormachine-learning algorithms used;5. quality–data homogeneity; field completeness; com-pleteness of coverage; quality of sourcing – howeasily can a calculation be performed and how reli-able is the resulting analysis?The first four of these aspects of a dataset define the ex-tent of a “data world” that may be explored and analysedto deliver insight. If we wish to push out the boundariesof this world, then we can do that by improving eachof these facets: Extending the coverage of the database,deepening sophistication of the facetting, expand the dif-ferent types of data that we include for analysis, or bybroadening the links between different parts of the datato improve context. Data quality determines the accuracyof how view of this landscape and the level of trust thatwe can have in analyses.It may be arged that more established data sources havesought to optimise coverage, structure and quality of theirdata. But, newer databases have brought a new focuson nature and context [22, 26]. By expanding the typesof data they that cover, or by creating better linkages between those new data types to improve our ability tocontextualise data, they improve the variety and subtltyof the insights that the scientometrics community maygenerate. We do not suggest that our list of analyticalfacets that drive value is an exhaustive one. There aremany additional features that change the value of anyanalysis such as considerations outside the dataset such asthe affiliation of the analyst, or technical considerationssuch as data homogeneity, or robustness of statisticaltreatment. We argue that data accessibility is a differenttype of feature of a dataset that should be consideredmore actively, especially in the rise of cloud technologies.Data accessibility is a complex and multifaceted topic.The key facets that we believe to be important in thecontext of scientometric data and analysis are:1. Timeliness: the extent to which a user can accesssufficiently up-to-date data;2. Scale: the extent to which it is possible to accessthe whole dataset for calculational purposes;3. Equality: the extent to which the resources to pro-cess the data and perform a calculation are techno-logically practical;4. Licence: the legal terms that define the extent towhich the data may be used and published.The example that we use here does not attempt toillustrate or address all these facets. Recent work by [25]focused on timeliness. In the current article we focuson scale and equality. Specifically, we examine classesof calculation for which data access is required for scaleand look at how Cloud technologies can facilitate bothscale and equality of access to data. Our example will useDigital Science’s Dimensions on BigQuery infrastructure.We note that this paper is specifically designed not to bea comparative study of the accessibility of different datasources, but rather as an opportunity to showcase thetypes of analysis that can be carried out if technologicalchoices are made that democratise data access.This paper is organised as follows: In Sec. II we describethe Dimensions on Google BigQuery technical stack, andthe specific queries used for the analysis presented inthe following section. In Sec. III we show the results ofseveral different calculations of the centre of gravity ofresearch production using the method described in Sec. IIand discuss the context of those results. In Sec. IV, weconsider the potential of Cloud technologies to meet abroad set of use cases. II. METHODA. Technical Infrastructure
Many Cloud technologies are already used across re-search, especially in technical subjects requiring large-scale computation or storage, or those who engage inlarge scale collaborations. Indeed, Cloud technologies arebecoming more widespread in research as they prove tobe highly cost-effective for some types of research activity.Typical use cases involve storage and transfer of data orobtaining computational power on demand.For those with structured data, the Cloud technologiesthat allow users to not only store and distribute accessto a dataset but also to perform complex calculationswith an on-demand infrastructure are now coming of age.Technologies such as Amazon Redshift, Snowflake andGoogle BigQuery all have the potential to meet the usecases mentioned above [50].In addition to their technical capabilities, these tech-nologies are opening up new business models through theability to share secure data in a fine-grained and con-trolled manner. Any of the technologies mentioned allowsa data holder to share data from their Cloud databasewith others on a permissioned basis, opening up accessspecifically or generally based on many different criteria.From a business model perspective, a critical differentia-tor (not used in the current example), is that two partiescan add their data to the cloud completely securely, onecan keep their data private while the other can opentheir data up on some mix of open access and commer-cial basis. The second actors data can then be used bythe first actor, on whatever the appropriate contractualterms are, mixing the data with their private data in acompletely secure manner. The only requirement is thateach dataset should have a sufficient overlap in persistentunique identifiers to allow the datasets to be compatible.Hence, this technology is a strong reason for all stakehold-ers in the community to adopt and ensure that the datathat they expose is well decorated with open identifiers.For large, frequently updated datasets where there is sig-nificant overhead in just storing and updating the data,this new way of working compeletely changes the basis ofengagement.From the perspective of the current article, the avail-abilty of Dimensions data in the Google BigQuery Cloudenvironment allows users to access and compute directlywith the data without having to invest in either buildingor maintaining local infrastructure. It also allows users tomanipulate and calculate with the data across the wholeDimensions dataset. The only technical expertise that isrequired is an ability to program with SQL.It is easy to see how the calculation explained belowcould easily be replaced to calculate other metrics andindicators that require access to a “global” dataset. Suchcalculations include journal metrics such as Journal Im-pact Factor [19], EigenFactor [3], SJR [21] or CiteScore[45], as well as the production of journal citation distri-butions [31], field-based normalisations such as RCR [30],as well as geographical benchmarks, trend analysis orexamples of knowledge cartography, such as the examplethat we have chosen to explore.
B. Calculation for Example
To illustrate how the new technologies described abovemay be used, we perform a simple global calculation. Asnoted above, the word “global” here is not intended torefer to a geographical context, but rather implies thateach record in the database will potentially contribute tothe calculation.We calculate the centre of mass of global research out-put year by year. This calculation has several notewor-thy features that demonstrate the capabilities that we’vediscussed earlier. The calculation: i) involves every pub-lication record in our dataset; ii) makes use of a uniqueidentifier to connect publication outputs to geographicallocations (in our case through GRID); iii) makes use ofthe time-depth of the publications records in the databaseto give a trend analysis.Using non-Cloud infrastructure to perform this calcu-lation such as a standard relational database hosted onphysical infrastructure would make this calculation timeconsuming and resource intensive. By leveraging Cloudinfrastructure we can quickly iterate the detail of this cal-culation to test different hypotheses. For example, we caneasily shift from a centre of mass calculation that focuseson publications to one that focuses on awarded grants,patents or policy documents. We can trivially changethe weighting factor from an unweighted calculation toa citation weight in the case of publications, grant sizein USD for grants, the funded associated with a publi-cation, the altmetric attention associated with a patentand so on. We can also easily restrict our analysis to aspecific research topic, country, institution, a specific classof grants, a particular type of funding or a larger-scalepolicy initiative such as open access. To take this evenfurther, one can imagine even subtler weighting schemesthat take the CRediT taxonomy [1] into account.In the examples contained in this paper we focus on pub-lication output and either unweighted or citation-weightedformaulations. The core of the centre of mass calculationis a simple weighted average of spatial positions that allstudents of classical mechanics meet early in their studies- it is equivalently known as a centre of gravity calculationor centroid.In our example, each “mass” is an affiliated researchinstitution and the location of that mass is the geographi-cal location of the principle campus as recorded in GRID.For each individual paper, there is a centre of mass theposition of which is proportional to the contribution ofthe affiliations of the researchers who have contributedto the paper. For exmaple, if a paper were to be entirelywritten from researchers at a single institution then thecentre of mass for the paper in our calculation wouldbe the location of the principle campus in GRID. If apaper were to be written by two co-authors, one at theUniversity of Cambridge and the other at the Universityof Oxford, then the centre of mass would be computedto be midway between the two Senate House buildings ofthe two institutions. To find the centre of mass of globaloutput in any year, we average the spatial location of allthe papers produced in that year. We can think of thisposition as the “average centre of global research produc-tion” or the “centre of mass/gravity of global researchoutput”.We also introduce a citation-weighted version of thiscalculation which may be interpreted as a measure ofcentrality of global research attention to research output.Formally, we define the centre of mass of a set of re-search objects to be the spatial average (or centroid) ofthe affiliations of the co-creators of the output. On a pa-per with n co-authors, each co-author is associated with1 /n of the paper. If a given co-author is affiliated with m institutions, then each institution will have a weight of1 /m of that co-author’s part of the paper, and 1 /nm ofthe overall paper. Thus, each author-institution pairinghas a weight a nm where (cid:88) n (cid:88) m a nm = 1 . (1)We do not need to explicitly sum over authors to get theoverall contribution of a specific institutions nor do weneed to worry about repetition of institutions since, in ourcalcualtion, we reduce an institution to the longitute andlatitude of its principal campus. Hence, there is a naturalaccumulation of weight to a geographical location.This reduction to longitude and latitude is made possi-ble through the use of GRID. The longitude and latitudeof research institutions is not held natively within theDimensions dataset. However, each institution in Dimen-sions is associated with a persistent unique identifier thatallows us to connect to other resources. In the case ofDimensions the institution identifier is the GRID iden-tifier. GRID not only includes some helpful data aboutinstitutions such as the longitude and latitude that weuse here but also acts as a gateway to resources such asROR (the Research Organisation Registry) that will inturn facilitate access to other pieces of information.This means that we can simply calculate the averagelongitude, long and latitude lat of a single research outputusing: lat = 1 T (cid:88) i (cid:88) j lat ij ; long = 1 T (cid:88) i (cid:88) j long ij , (2)where T is the total number of publications.We can then extend this to a group of outputs byintroducing an index, k , that ranges over each output inthe relevant set to create the average longitude Long andaverage latitude
Lat of the whole set:
Lat = (cid:88) k T k (cid:88) i (cid:88) j lat kij ; Long = (cid:88) k T k (cid:88) i (cid:88) j long kij , (3) where T k is the total number of institutional affiliationson the k th paper in the average.Longitude and latitude are defined as angles on thesurface of a sphere with longitude in the range [ − , − , Lat = 1 C (cid:88) k C k T k (cid:88) i (cid:88) j lat kij ; Long = 1 C (cid:88) k C k T k (cid:88) i (cid:88) j long kij , (4)where C k is the number of citations of k th paper and C is the sum of all citations across papers in the set.Likewise, if we were interested in the level of non-scholarly attention we might replace citations by somerelevant altmetric data.The code snippet below is the implementation of Eqn. 4using Google BigQuery’s implementation of SQL on theDimensions dataset. In addition to the calculation ex-plained above, the code below takes into account caseswhere creators may miss an affiliation by ensuring thatthe normalisation is consistent in the case of null data. WITH pubs_reweighted AS ( SELECT p . id , p . year , a . first_name , a . last_name , a . initials , /* count the distinct numberof organisations per author */ COUNT ( distinct g . id ) num_orgs , /* list of all the GRIDs perauthor */ ARRAY_AGG ( grid_id ) grids , /* count the number of authorson the paper that haveaffiliations in GRID */ COUNT ( p . id ) over ( partition byp . id ) authors FROM ‘ dimensions - ai . data_analytics .publications ‘ p INNER JOIN unnest ( authors ) a INNER JOIN unnest ( a .affiliations_address ) aff INNER JOIN ‘ dimensions - ai .data_analytics . grid ‘ g ON g . id = aff . grid_id GROUP BY p . id , p . year , g . name , a . first_name , a . last_name , a . initials ) , /* get the location for each GRID . Eachrow that is being summed hererepresents a single author . Ifthey have more than one affiliationthen the contribution of theauthor is split equally . */ pub_center_mass AS ( SELECT pr . id , SUM (( g . address . latitude / pr .num_orgs ) / pr . authors ) latitude , SUM (( g . address . longitude / pr .num_orgs ) / pr . authors ) longitude FROM pubs_reweighted pr , UNNEST ( grids ) grid_id INNER JOIN ‘ dimensions - ai .data_analytics . grid ‘ g on g . id = grid_id GROUP BY pr . id ) SELECT p . year , /* sum the centre mass for allpublications / the number ofpublications ; replacing p . metrics .times_cited in the respective to anexplicit value of "1" recovers aweighting - free from the calculation*/ ( sum ( cm1 . latitude * p . metrics .times_cited ) ) / ( sum ( p . metrics .times_cited ) ) latitude , ( sum ( cm1 . longitude * p . metrics .times_cited ) ) / ( sum ( p . metrics .times_cited ) ) longitude FROM pub_center_mass cm1 INNER JOIN ‘ dimensions - ai .data_analytics . publications ‘ p ON p . id = cm1 . id GROUP BY p . year HAVING sum ( p . metrics . times_cited ) > 0 ORDER BY year
Listing 1: Listing to produce a citation-weighted centre ofmass year-by-year using SQL on Google BigQuery withDimensions data.One assumption that may not at first appear obviouswith the weighted approaches used here is that the sumof all citations in time has been used. As a result, papersin 1671 have had 350 years to garner citations whereasmore recent publications have had much less time. Ofcourse, the average in each case is performed on a ho- mogeneous basis (i.e. only publications of the same yearare averaged together), however, this does introduce animplicit bias in the analysis in that a citation bias mayhave a comtemporary skew. A further analysis could beperformed that only considered the citations in an n -yearwindow following the date of publication of the paper. Ofcourse, introducing such a parameter also makes a valuejudgement about the lifetime of a piece of research.In Sec. III we use this method to showcase three analy-ses: i) a standard unweighted calculation of the centre ofmass of research output from 1671 to present day; ii) acalculation of the centre of mass of research weighted bycitation attention over the same time period; iii) a calcu-lation of the citation-weighted centre of mass of researchbased just on data from the freely available COVID-19dataset that is available on the Google BigQuery environ-ment. C. Data specifics
The details of the high-level data schema in Dimensions,including information about coverage and the treatmentof unique identifiers is described in several recent publica-tions, for example [25, 26].Once the data are produced from a script such asthe one above they were downloaded from the interfaceand are initially analysed in Mathematica. The graphicsshown in Sec. III are produced using Ddtawrapper.de.At the Mathematica analysis stage, we plotted everyyear of data from the system. However, this gave anunsatisfactory picture as the data are quite messy. Inthe early years of the dataset (approximately from 1671-1850) the number of publications with a GRID-listedinstitutions number in the single digits. A confluence ofreasons contribute to this picture: i) the low number ofoverall publications; ii) the low level of stated academicaffiliations of authors in early work; iii) affiliations toinstitutions that are not part of GRID. Figure 1 showsthe number of publications with at least one recognisable(GRID-mapped) affiliation in each year in the Dimensionsdataset.From 1900, the data begins to settle as it begins to beappropriate to treat it statistically in the context of astatistical calculation such as the one outlined in Sec. II B.Between 1900 and 1970, the year-on-year variability ofthe data decreases, and from the 1970s the data describesa fairly consistent path with few significant derivations.As such, we have denoted points in the figures in greywhere they contain “less robust” data and in red whenthe data are “more robust”.In the final analysis presented, we focus on the COVID-19 dataset and perform a month-by-month analysis. Inthis situation, we are again in the law of relatively smallnumbers where we have to be careful about statistical ef-fects. However, the COVID-19 dataset has grown quicklyduring 2020 with a few hundred papers in January grow-ing to several thousand papers per month in NovemberFIG. 1: Logarithmic-scaled plot of the number ofGRID-mapped institutions associated with papers in theDimensions database by year from 1671 to 2020. Thetwo notable dips in the data in the first half of the 20thCentury co-incide with the two world wars. The greybackground highlights the region between 1671 and 1990in which the number of contributing records is taken tobe too small to give a stable basis for statistical analysis.
Month Number of publications
January 289February 751March 3,140April 9,999May 15,502June 15,377July 16,706August 15,645September 16,191October 18,304November 15,170December 15,153
TABLE I: Number of COVID-19 research publicationsincluding journal articles, preprints, monographs andbook chapters by month during 2020 in the Dimensionsdatabase.(see Table I).
III. RESULTS
From a historical perspective, the calculation of a va-riety of difference centres of mass can be revealing. Atthe least, they may confirm accepted doctrine, but in thebest situation they can reveal features that allow us to quantify and understand how aspects of our society aredeveloping in a very relatable manner.Bibliometric analyses such as those presented here havepreviously been difficult to undertake due to the chal-lenges of arranging data access, having the capacity toprocess data into an appropriate format, having the com-putation capacity to perform calculations and having agood reason to do put effort into generating this kindof output. With the arrival of cloud-based technologiesthe technical challenges are removed. A mere 40 lines ofcode, with a runtime of significantly less than 1 minute, isrequired to produce the data that underlies the analysispresented here based on the Dimensions dataset.By comparison such plots are relatively more commonin other areas of research, such as economics or geography.The recent work of [17] examined the movement of thecentre of mass economic activity in the world from 1CEto the present day, showing that the economic centreof mass 2 millennia ago lays on a line between Romeand China. During this period, the Silk Roads was thecommercial axis between the two largest empires in theworld: the Roman Empire and the Eastern Han Empire.It is unsurprising that the economic centre of gravity isclosely linked to these ancient centres of commerce. Thecentre of mass was solidly grounded in the same regionuntil at least 1500. However, following the englightenmentin the 18th Century, science and technology began totransform the economies of Europe and for a century from1820 to 1913 the centre of mass of the world’s economymoved rapidly West and North as the Industrial revolutiontransformed first the UK and then the wider Westernworld. Interestingly, in the McKinsey analysis, despiteAmerica’s increasing world status and riches during the20th Century, the centre of economic mass never quiteleft the eurasian continent, reaching its zenith in 1950,just over Iceland, before beginning its journey Eastwardand, again, Southward as first Europe emerged from war,Japan developed economically during the 1980s and finallyChina reached economic preeminence as we entered theAsian Century [40].Most in academia agree that formal research publicationdates from 1665 with the first issue of the PhilosophicalTransaction of the Royal Society [29]. Hence, the datathat we have around research activity only spans a fewhundred years and does not share the time-depth availablein the work of [17]. As a result, from a data perspective,we miss much of the detail around the development ofolder societies such as those in Egypt and China. Anaec-dotally, it is particularly interesting that the Chinese didnot develop a research community with the associatedcommunication structure despite significant technologiesthrough the Ming and Qing periods. Indeed, many ofthe principles that led to the Enlightment in Europe hadparallels in Qing China and there is even evidence in Euro-pean writings that they were aware of enlightnment-styledevelopments in China [49]. Yet, this does not appear tohave resulted in the emergence of formal research publica-tion culture. Miodownik offers a material scientist’s viewin [35] on the relative rate of development of Chinese sci-ence - it may be that the development and wide adoptionof glass in preference to porcelain is the small change thatshaped the development of history for several centuries.The scholarly communications community has asso-ciated today’s digital infrastructure (such as persistentunique identifiers) with pre-digital-era publications andthis gives us an ability to piece together a much fulllerpicture than would otherwise be the case. Nevertheless,Figure 1 makes it clear that data are not sufficient to betreated in a reasonable statistical manner until much morerecently. For the purposes of our example, we have chosento keep the more statistically questionable points on ourplot for aesthetic reasons, but have coloured these pointsin later figures in grey to denote the intrinsic uncertaintyand arbitrariness of the choice of the data point.FIG. 2: Motion of the centre of mass of researchproduction from 1671 to present day. The centre of masscalculation is unweighted by citations or other measuresand is based solely on the outputs of papers byinstitutions that appear in the GRID database.Figure 2 shows the motion of the unweighted centreof mass of global publication output between 1671 andthe present day. The start point of the path is an easyone to calculate since only one publication in that year isassociated with a DOI and a GRID-resolved institution.The paper concerned is a Letter that appeared in thePhilosophical Transactions of the Royal Society of Lon-don. It is written by “Mr. Isaac Newton, Professor of theMathematicks in the University of Cambridge; containinghis new theory about light and colors”. The path is highlyvolatile in the years following 1671 as the number of pa-pers is small (those interested in this detail can review theannual calculation in the supplementary material). How-ever, by 1901, there is are sufficiently many papers withwell-identified institutions that the path settles somewhat.Many of the great academic institutions in the UShad been established in the late 18th Century. Throughthe 19th Century the “Robber Baron” industrialists suchas Mellon, Carnegie and Rochefeller had continued the trend of setting up academic institutions and by the 20thCentury, these institutions were pulling the centre of massof research (eratically at first, but then with increasingspeed) away from Europe. The First and Second WorldWars saw significant disruption in Europe and the wealththat had taken the British Empire a century to accumulatetravelled to the US in just four years as Britain underwrotethe costs of the First World War between 1914-1918.And so, the movement of the centre of mass of researchproduction makes complete sense from 1900-1945.If anything, it is remarkable that 1945, the year thatVannevar Bush wrote his famous Endless Frontier report[9], marks the turning point of the transit of the centre ofmass back toward Europe. While the end of disruptionin Europe meant that academics could return to theirresearch and publication could begin again, Germanywas in ruins and the economy of the UK was in tatters.Despite the success of US-based programs such as theManhattan Project during the war, research focus hadyet to come to the fore in US universities.Following Bush’s report, the National Science Foun-dation was created and the formal basis for a period ofUS-centred scientific pre-eminence was established. InEurope, the reorganisation of research was also underway, the Kaiser Wilhem Institute was renamed to theMax Planck Institute in 1948 and in 1949 the FrauhoferInstitute was established. By the 1960s, the Royal Societyof Great Britain would coin the term “Brain Drain” todescribe the movement of British Scientists from the OldWorld to the New [2, 13]. In the UK, Wilson’s WhiteHeat of Technology of the 1960s [38] served to help tokeep the centre of mass moving torward Europe.Overall the balance of publication volume remained inEurope’s favour from 1945 until 1970, with a slow draft inthe centre of mass of publication toward Europe. Duringthe final decade of this period US spending on researchas a proportion of its discretionary budget reached anall-time high [27] with the that, between 1970 and 1980,the centre of mass looked as thought it might turn aroundand head back toward the US once more. The high level ofinvestments in research had begun to pay off and sciencewas riding high in the public psyche in the US in thisperiod.Yet, despite the payoff from the space race and thebeginning of the computer age, spearheaded by siliconvalley in the US, the path of the centre of mass resumedits trajectory toward Europe in the 1980s. The speedof transit of the centre of mass has remained about thesame since 1990s, but this conceals a complex set of forcesbehind this motion: The rise of Japan as an industrial andresearch power; the emergence of the professionalisationof research in the UK; the creation of a Europe-wide re-search strategy embodied in the creation of the EuropeanResearch Council and centralised strategic funding fromthe framework program grants and the Horizons 2020program; and, since 2000, the rise of China as both amajor economy and research power. Indeed, in decadesto come we are likely to see the centre of mass travelfurther as China establishes further and India scales upits research economy.An unweighted calculation shows the clear average cen-tre of production, but it is interesting also to think aboutdifferent types of weighting. This should be done withcare since the interpretation of such weightings is nottrivial. Figure 3 shows a similar picture to Figure 2, butthis time with each institution’s contribution weighted bythe fraction of the number of citations associated with thepapers written by their affiliated authors. The additionof citation data stablises the path overall, as there is abias toward the most established research economies. Inthis figure, the centre of mass continues to be closest tothe US in 1945, but it returns to Europe initially moreslowly, and actually turns around, heading back towardthe US in the 1980s, before moving once more towardEurope, moving faster than ever, by 2000.FIG. 3: Motion of the centre of mass of researchproduction from 1671 to present day. The centre of masscalculation is weighted by citations to outputs asdescribed by the Code Listing 46 and Eqn. 4.The speed of movement toward the east has increasedsignificantly over the last 20 years, which is indicativenot only of increasing research volumes in China as wellas Japan, India, Australia and New Zealand but also theincreased citation garnered by those publications.Additionally, while the range of movement of the centreof mass from east to west is significant, its movementto the south, while being monotonic and more limitedin range than the longitudinal motion, is notable by itsconsistency in the latter half of the 20th Century. Themajority of the world’s large cities, and hence most abun-dant research economies, are in the northern hemisphere.Yet, the trend is to the South and tracking this motion issure to be interesting in the future.Our third and final narrative is contained in Figure 4,which shows the motion of the citation-weighted centreof mass of COVID-19 research on a monthly basis during2020. The number of publications that contribute to eachpoint on the plot is shown in Table I. As news of COVID-19 emerged from Wuhan in Chinaduring at the beginning of the year, China’s researchersquickly turned their attention to studying the disease.The fact that the centre of mass of COVID-19 research inJanuary 2020 is located on the Tibetan plateau (paradox-ically, quite near to the centre of mass of global economicoutput in 1CE as calculated in the McKinsey report thatoriginally inspired this line work in this paper) rather thancloser to China’s research centres is a clear indication thatresearch was already taking place in the international com-munity. As the year progressed and the virus spread topandemic migrated West, more and more research organ-isations in the West turned their attention to COVID-19research. The shift in the centre of mass of global researchproduction and the speed at which this happened is easyto see from Fig. 4.FIG. 4: Motion of the centre of mass of researchproduction month by month for COVID-19 publicationsfrom January 2020 to December 2020. The centre ofmass calculation is weighted by citations as described byEqn. 4.Maps hold a special place in human storytelling andhence are a powerful means by which we can relate todata. The use of such maps does not come withoutbaggage - such visualisations hide many facets. However,they are impactful and, we believe that the simplicityof the technology that we’ve demonstrated in this shortarticle shows great promise as a tool to illustrate trendsin academic research.
IV. DISCUSSIONA. A new world of analysis
In a recent book [20] produce a set of compelling mapswith associated narratives. We have tried to take thesame approach in our Results section in order to showcasehow these maps may lead to inquiry and contextual inter-pretation beyond the standard work of analysts. We havealso shown how responsive and immediate these analysescan be - not only adding an interesting thread to histori-cal discourse but allowing us to see emergent trends inreal time. We believe that this type of thinking is wellunderstood by many in the scientometric community, asevidenced by the attention received by the work of W.B. Paley (Fig. 5) and others who originally pioneeredresearch cartography. One of the enduring challenges ofautomated data visualisation is the ability to optimise lay-out and preserve information. In general, it is not possibleto reach the level at which this is done in Paley et al’swork. However, in making it easier to create visualisationson the fly, while we give up the data transparency thatPaley aspires to, we are able to add speed of iteration sothat a visualisation can be used in an actionable manner.It is widely recognised that data visualisation is apowerful tool for contextualisation and interpretation[16, 41, 43]. The analysis presented in this paper aimsto make three points: Firstly, that data accessibility isa partner to data quality and an important part of howdata may be deployed to gain insight; Secondly, that datacertain visualisation styles and appraoches have beenpreviously overlooked due to the lack not only of thedata accessibility, but also the need for data connectiv-ity through persistent identifiers; Thirdly, that tools likethese should not be limited only to the most well fundedresearchers and that Cloud infrastructure may be an ef-fective mechanism to democratise access to these typesof data, tools and interpretation, and hence be a route tosuperior strategic decision making across the sector.
B. A new world of data
By introducing the scientific method in his book NovumOrganum in 1620, Bacon codified the deep relationshipbetween science and data. The importance of data isnot solely limited to the scientific disciplines, rather datadefined by a broad definition has always been part ofresearch, regardless of topic. However, until relativelyrecently in human history, data has been rare. In the lasthalf century we have seen an explosion in the amount ofdata made available not only by physical and biologicalexperiments, but also by social experiments and also theemergence of the digital humanities. We have gone froma poverty of data to an amount of data that cannot behandled by any individual human mind.As in the wider world of research, scientometrics hasseen a rise in data availability over the last twenty tothirty years as the research community has grown andprofessionalised. The need for metadata that describesnot only the outputs of research but also the process bywhich they are produced, the broad scholarly record, isnow widely acknowledged.In the next few years, we are likely to see the amountof metadata collected about a research output increasemanyfold, so that the metadata about an object exceedsthe data contained within the object. The ability to scale data systems, share and manipulate data and tosummarise it for human consumption in visualisations isbecoming critical, as is understanding the biases that areinherent to different visualisation styles.In moving forward, we argue that critical considerationneeds to be given to data accessibility. Others such as [37]have argued cogently that investment should be madeinto research data. We believe that investment couldbe helped by introducing a framework such as the oneproposed here to support a working definition of dataaccessibility and good practice. The facets of coverage,structure, nature, context and quality, could form thebasis of a helpful rubric for making research data morevaluable and accessbile to the community. There is alreadya precedent for gaining cross-community collaboration inprojects such as I4OC and I4OA as well as structures foruse of metrics in DORA and the Leiden Manifesto [24] -is data access another similar area where the communityshould seek to build principles to ensure the most evenplaying field?
V. FUTURE EXPLORATIONS
The methods explored in this paper can be extendedand applied in many different scenarios. It is easy tosee how this analysis could be repeated and customisedfor a variety of geographies (e.g. specific countries orregions), subject areas (e.g. COVID as shown here orSustainable Development Goals) and timescales. Weight-ing schemes could include altmetric-based approaches,funding weighting, journal metric-led weighting or anynumber of different approaches to suit specific needs. Inaddition, using Dimensions, parallel analyses could beperformed based on grant data, clinical trials data, patentdata, pollicy documents or data. As noted previously,equivalent problems that could make use of similar ca-pabilities and technologies include global heatmappingof specific research activities, the creation of specific cus-tom benchmarks or other metrics to specification and ondemand.We have discussed context as a critical part of researchanalysis in this paper. Thus, it is important to highlightthe context of the data used in our analyses. Despitethe foundational principals behind Dimensions of noteditorialising its data holdings, it is still not a universaldataset. At the current time, not all funding organisationsmake their data openly available and the publicationsassociated with some geographies and some fields are notheld in the DOI registries that have yet been integratedinto Dimensions. As a result, the analysis presented herehas flaws and will naturally show an english-languagecentred view of the world.In this paper, we have focused on a particular analy-sis and visualisation style that we have not seen in thescientometric literature before. We beleive that the lackof use of this style is due to the constraints that we havehave outlined. However, we believe that our underly-0FIG. 5: One of the first visualisations of research that made use of a full global dataset. While “calculated”, asignificant amount of manual work was needed to make this beautiful visualisation, which ensures that detailed data ismarried with a meaningful visualisation.
Reproduced with kind permission of W B Paley. ing argument around data access can be applied also tothe production of visualisations such as those offered byVOSviewer, CiteSpace and similar technologies [14, 15].We close by commenting that, if adopted broadly, webelieve that the Cloud techniques applied in this articlecan lead to better decision making across academia asanalysis can become more iterative and more availableacross the sector.
CONFLICT OF INTEREST STATEMENT
All authors of this paper are employees of Digital Sci-ence, the creator and owner of Dimensions and GRID.
AUTHOR CONTRIBUTIONS
DWH developed the idea for this paper and draftedthe manuscript and carried out the visualisation. SJPdeveloped the implementation of the code, determined thebusiness rules and methodology for the data extraction.Both co-authors edited and reviewed the manuscript. [1] Allen, L., Scott, J., Brand, A., Hlava, M., and Altman,M. (2014). Publishing: Credit where credit is due.
Nature News [2] Balmer, B., Godwin, M., and Gregory, J. (2009). TheRoyal Society and the ‘brain drain’: natural scientistsmeet social science. Notes and Records: the Royal Soci-ety Journal of the History of Science
63, 339–353. doi:10.1098/rsnr.2008.0053. Publisher: Royal Society[3] Bergstrom, C. (2007). Eigenfactor: Measur-ing the value and prestige of scholarly journals.
College and Research Libraries News
68. doi:https://doi.org/10.5860/crln.68.5.7804[4] Borner, K. (2010).
Atlas of Science: Visualizing WhatWe Know (Cambridge, Mass.: MIT Press), illustratededition edn.[5] Bornmann, L. (2018). Field classification of publicationsin Dimensions: a first case study testing its reliability andvalidity.
Scientometrics arXiv:2012.07675 [physics]
ArXiv: 2012.07675[7] Boyack, K. W., Klavans, R., and B¨orner, K. (2005). Map-ping the backbone of science.
Scientometrics
64, 351–374.doi:10.1007/s11192-005-0255-6[8] Boyack, K. W., Klavans, R., Paley, W. B., and B¨orner,K. (2007). Mapping, illuminating, and interacting withscience. In
ACM SIGGRAPH 2007 sketches (New York,NY, USA: Association for Computing Machinery), SIG-GRAPH ’07, 2–es. doi:10.1145/1278780.1278783[9] Bush, V. (1945).
The Endless Frontier, Report to thePresident on a Program for Postwar Scientific Research .Tech. rep., OFFICE OF SCIENTIFIC RESEARCH ANDDEVELOPMENT WASHINGTON DC[10] B¨orner, K. (2015).
Atlas of Knowledge: Anyone CanMap (Cambridge, Massachusetts: MIT Press), illustratededition edn.[11] B¨orner, K., Chen, C., and Boyack, K. W. (2003).Visualizing knowledge domains.
Annual Review ofInformation Science and Technology
37, 179–255.doi:https://doi.org/10.1002/aris.1440370106. eprint:https://asistdl.onlinelibrary.wiley.com/doi/pdf/10.1002/aris.1440370106[12] B¨orner, K., Klavans, R., Patek, M., Zoss, A. M., Bibers-tine, J. R., Light, R. P., et al. (2012). Design and Updateof a Classification System: The UCSD Map of Science.
PLOS ONE
7, e39464. doi:10.1371/journal.pone.0039464.Publisher: Public Library of Science[13] Cervantes, M. and Guellec, D. (2002). The brain drain:Old myths, new realities.
OECD Observer
Journal of the American Societyfor Information Science and Technology
57, 359–377. doi:https://doi.org/10.1002/asi.20317. eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.20317[15] Colavizza, G., Costas, R., Traag, V. A., Eck, N. J. v.,Leeuwen, T. v., and Waltman, L. (2021). A scientometricoverview of CORD-19.
PLOS ONE
16, e0244839. doi:10.1371/journal.pone.0244839. Publisher: Public Libraryof Science[16] Dick, M. (2020).
The Infographic: A History of DataGraphics in News and Communications [17] Dobbs, R., Remes, J., Manyika, J., Roxburgh, C., Smit,S., and Schaer, F. (2012).
Urban world: Cities and therise of the consuming class . Tech. rep.[18] Garc´ıa-P´erez, M. A. (2010). Accuracy and completenessof publication and citation records in the Web of Science,PsycINFO, and Google Scholar: A case study for the com-putation of h indices in Psychology.
Journal of the Amer-ican Society for Information Science and Technology
American Documentation
14, 195–201.doi:https://doi.org/10.1002/asi.5090140304. eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.5090140304[20] Goldin, I. and Muggah, R. (2020).
Terra Incognita: 100Maps to Survive the Next 100 Years (Place of publicationnot identified: Century), 1st edition edn.[21] Gonz´alez-Pereira, B., Guerrero-Bote, V. P., and Moya-Aneg´on, F. (2010). A new approach to the metric ofjournals’ scientific prestige: The SJR indicator.
Journalof Informetrics
4, 379–391. doi:10.1016/j.joi.2010.03.002[22] Herzog, C., Hook, D., and Konkiel, S. (2020). Dimen-sions: Bringing down barriers between scientometriciansand data.
Quantitative Science Studies
1, 387–395. doi:10.1162/qss˙a˙00020. Publisher: MIT Press[23] Herzog, C. and Lunn, B. K. (2018). Response to the letter‘Field classification of publications in Dimensions: a firstcase study testing its reliability and validity’.
Scientomet-rics
Nature News
Frontiers in ResearchMetrics and Analytics
5. doi:10.3389/frma.2020.595299.Publisher: Frontiers[26] Hook, D. W., Porter, S. J., and Herzog, C. (2018). Di-mensions: Building Context for Search and Evaluation.
Frontiers in Research Metrics and Analytics
3. doi:10.3389/frma.2018.00023. Publisher: Frontiers[27] [Dataset] House, W. (2020). Historical Table, 9.1 - TotalInvestment Outlays for Physical Capital, Research andDevelopment, and Education and Training: 1962-2020[28] Huang, C.-K. K., Neylon, C., Brookes-Kenworthy, C.,Hosking, R., Montgomery, L., Wilson, K., et al. (2020).Comparison of bibliographic data sources: Implicationsfor the robustness of university rankings.
QuantitativeScience Studies
1, 445–478. doi:10.1162/qss˙a˙00031. Pub-lisher: MIT Press[29] Hurst, P. (2010). Trailblazing—350 years of Royal Societypublishing.
Notes and Records of the Royal Society
PLOS Biology
14, e1002541. doi:10.1371/journal.pbio.1002541. Publisher: Public Libraryof Science [31] Larivi`ere, V., Kiermer, V., MacCallum, C. J., McNutt,M., Patterson, M., Pulverer, B., et al. (2016). A simpleproposal for the publication of journal citation distribu-tions. bioRxiv , 062109doi:10.1101/062109. Publisher:Cold Spring Harbor Laboratory Section: New Results[32] L´opez-Illescas, C., de Moya Aneg´on, F., and Moed,H. F. (2009). Comparing bibliometric country-by-country rankings derived from the Web of Science andScopus: the effect of poorly cited journals in oncol-ogy. Journal of Information Science
35, 244–256. doi:10.1177/0165551508098603. Publisher: SAGE Publica-tions Ltd[33] Mart´ın-Mart´ın, A., Orduna-Malea, E., Thelwall, M., andDelgado L´opez-C´ozar, E. (2018). Google Scholar, Web ofScience, and Scopus: A systematic comparison of citationsin 252 subject categories.
Journal of Informetrics
Scientometrics doi:10.1007/s11192-020-03690-4[35] Miodownik, M. (2014).
Stuff Matters: The Strange Storiesof the Marvellous Materials that Shape Our Man-madeWorld (London: Penguin), 1st edition edn.[36] Mongeon, P. and Paul-Hus, A. (2016). The journal cover-age of Web of Science and Scopus: a comparative analysis.
Scientometrics
Nature
Physiology
8, 136–140. doi:10.1152/physiologyonline.1993.8.3.136. Publisher: Ameri- can Physiological Society[39] Powell, K. R. and Peterson, S. R. (2017). Coverageand quality: A comparison of Web of Science andScopus databases for reporting faculty nursing publi-cation metrics.
Nursing Outlook
65, 572–578. doi:10.1016/j.outlook.2017.03.004. Publisher: Elsevier[40] Rachman, G. (2017).
Easternisation: War and Peace inthe Asian Century (Vintage), 1st edition edn.[41] Rendgen, S. (2018).
The Minard System: The CompleteStatistical Graphics of Charles-Joseph Minard (PrincetonArchitectural Press)[42] Thelwall, M. (2018). Dimensions: A competitor to Scopusand the Web of Science?
Journal of Informetrics
12, 430–435. doi:10.1016/j.joi.2018.03.006[43] Tufte, E. R. (2001).
The Visual Display of Quantita-tive Information (Cheshire, Conn: Graphics Press), 2ndedition edn.[44] van Eck, N. J. and Waltman, L. (2019). Accuracy of cita-tion data in Web of Science and Scopus. arXiv:1906.07011[cs]
ArXiv: 1906.07011[45] Van Noorden, R. (2016). Controversial impact factorgets a heavyweight rival.
Nature News arXiv:2005.10732 [cs]
ArXiv: 2005.10732[47] [Dataset] Waltman, L. (2020). Open metadata: An es-sential resource for high-quality research intelligence. doi:10.5281/zenodo.4289982[48] Wolfram, S. (2010). Making the World’s Data Com-putable[49] Wood, M. (2020).
The Story of China: A portrait of acivilisation and its people (Simon & Schuster UK)[50] Zukowski, M. (2018). Cloud-based SQL Solutions forBig Data. In