Taking census of physics
Federico Battiston, Federico Musciotto, Dashun Wang, Albert-Laszlo Barabasi, Michael Szell, Roberta Sinatra
TTaking Census of Physics
Federico Battiston , Federico Musciotto , Dashun Wang , Albert-L ´aszl ´o Barab ´asi ,Michael Szell , and Roberta Sinatra Department of Network and Data Science, Central European University, Budapest, 1051, Hungary Kellogg School of Management, Northwestern University, Evanston, IL 60208, USA Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL 60208, USA Network Science Institute, Northeastern University, Boston, MA 02115, USA Center for Cancer Systems Biology, Dana-Farber Cancer Institute, Boston, MA 02115, USA Complexity Science Hub Vienna, Vienna, 1080, Austria MTA KRTK Agglomeration and Social Networks Lendulet Research Group, Centre for Economic and RegionalStudies, Hungarian Academy of Sciences, Budapest, 1094, Hungary Department of Mathematics, Central European University, Budapest, 1051, Hungary ISI Foundation, Torino, 10126, Italy * [email protected] There was a time when polymaths like Galileo knew all the physics that was there to beknown. Over the centuries, however, the body of knowledge spanned by physics exploded,encompassing topics as diverse as gravitational waves, graphene, or network science. As physicsexpanded in breadth and depth, physicists were forced to specialise, segmenting researchersinto their narrow, specialised communities. How many physicists work in each subfield ofphysics today and how does each subdiscipline evolve? In which subfield are physicists “born”into and where do they migrate, if at all? Here we take an intellectual census of physicists, theiractivities and career trajectories, helping us understand the evolution of the field and gaining a r X i v : . [ phy s i c s . s o c - ph ] J a n uantitative insights about several fundamental scientific processes, from resource allocationto the exchange of knowledge. Advances in this direction were limited by the challenge inanswering two fundamental questions: 1) Who can be counted as a physicist? 2) How dowe survey their activities? The recent availability of large datasets of scientific publicationsfinally offers opportunities to tackle these questions by exploring the production patterns of thescientific population.
2, 3
Indeed, the close to complete publication records of all physicists allowus to reconstruct their subfields of study and career changes, offering quantitative footprints notjust for the field of physics, but its intimate relation with the broader scientific community.
4, 5
Combining large-scale data on physics publications and citations with recent data andnetwork science techniques, here we ask: What are the impact and productivity differencesbetween subfields? As a physics student choosing my future specialty, how do I know whichsubfields are growing? As a funding agency, how do I compare early-career physicists fromdifferent subfields? As a journal editor, how many papers should I expect from each subfieldand how do I compare their impact?
A census of physics subfields
To offer a data-driven answer to these questions,
2, 3 we identify the relevant physics papersand citations within Web of Science (WoS). We start by selecting ∼
5, 6 missing for example those published in interdisciplinary journals like
Nature or Science , or papers published in journals of other disciplines but that are of directrelevance for the physics community. To map out the complete physics literature we then set todetect physics papers by virtue of their patterns of citations among the other ∼
47 million papersin WoS. A paper is a potential physics publication if its references and citations to the core hysics literature are significantly higher than in a null model in which each paper’s citationsare assigned randomly, regardless of a paper’s journal or research area. We identified ∼ ∼ between 1985 and 2015. We use this dataset to reconstruct thepublication profile of 135,877 physicists with a persistent productivity between 1985 and 2015.See Box 1 and SI Section S3 for more details on the dataset curation and validation.The first step in developing a census is to count the number of physicists working in eachsubfield. Such counting is, however, not straightforward, as physicists may contribute to publi-cations in different subfields. We therefore associate each physicist with a primary subfield ifthe number of her publications in the subfield is higher, in a statistically significant manner,than expected for a typical physicist (Box 1 and SI Section S4). The obtained subfield demo-graphics offer us a first summary statistic (Fig. 1a): we find that the largest subfield is CondMat (condensed matter physics) with more than , physicists, capturing 46% of the entirephysicist population. It is followed by General ( , ), HEP (high energy physics, , ),Interdisc (Interdisciplinary physics, , ), Classical ( , ), Nuclear ( , ), AMO (Atomicand molecular physics, , ) and Astro physics ( , ). Plasma is the smallest subfield ofphysics, with less than , researchers.Given the highly specialised nature of the physics subfields, one might suspect that mostphysicists work in a single subfield. Yet, we find that highly specialised physicists are the xception rather than the rule: The majority of physicists (63%) are active in two or moresubfields (Fig. 1b). This prompts us to ask: Which subfields have particularly low or high ratesof specialisation? The differences between subfields are striking, defining two different groups(Fig. 1c): six subfields have less than specialised physicists. Among these subfields, Interdisc has less than 1% of specialised physicists, in line with the expectation that interdisciplinaryphysicists bridge multiple subfields. In contrast, the percentage of specialised physicists in
CondMat , HEP and
Nuclear is 42%, 34% and 25% respectively, at least an order of magnitudelarger than in the other group of subfields. What drives the different levels of specialisationbetween subfields?A physicist working on two or more subfields combines the collective know-how of thesefields, a process deemed essential for novel discoveries in science.
To understand whichof the physics subfields cross-pollinate most significantly, we calculate the co-activities ofindividual physicists between each pair of subfields. Co-activities are defined by weightedlinks between subfields, where the weights measure the observed versus expected co-activitiesbased on a randomised null model (SI Section S6). Starting with the highest weighted links, weplot the minimum number of links needed to have a connected network of subfields (Fig. 1d).The network reveals a non-trivial co-activity structure, clustering all physics subfields intothree broader areas, 1)
Interdisc and
CondMat , 2)
Classical , AMO , and
Plasma , 3)
HEP , As-tro , and
Nuclear , all held together by
General . This research space captures the intellectualaffinities between subfields, facilitating movements between close subfields, while limitingcross-pollination between distant ones like
Interdisc and
Nuclear . For example, the diversityof topics within
CondMat and
Classical and their adaptable approaches, like statistical me-chanics applied to multiple systems composed of large numbers of entities, makes it easier or those working in these subfields to take their tools to different disciplines. In contrast,more specialised subfields like
HEP or Nuclear require their members to acquire familiaritywith large-scale, long-term projects. While scientists working in such fields may have deepknowledge and expertise on the subject they specialise in, they face a greater burden that limitstheir ability to explore other areas. The observed network is similar to the citation network between subfields, showing that the flow of knowledge is captured through multiple metrics,both by paper citations and by the activities of individual physicists. Birth, growth, and migration
Why are there so considerable differences in specialised physicists between similarly sizedsubfields, like
Nuclear and
Interdisc (Fig. 1a,b)? To understand this heterogeneity, we first assessthe relative growth rate of each subfield over time, measuring the fraction of physicists enteringa subfield every year (Fig. 2a). We find that the growth rates of
Interdisc and
Astro increasedfrom a few percent in 1985 to over 20% and 27% respectively after 2010, substantially reshapingthe physics landscape in recent years. An opposite trend characterises
CondMat : while it hadthe largest share of new physicists in 1985, its share dramatically decreased over time, fallingbelow 5% after 2010.
HEP also displayed a receding trend just before 2010, but the spur of newresearch connected to the activity of the Large Hadron Collider in Geneva injected new forcesinto the field. In particular,
HEP ’s sharp peak in 2010 can be attributed to the first ATLAS andCMS publications (SI Section S7).Figure 2a mixes together physicists who start their careers in a particular subfield withthose who make career transitions to other subfields. There are remarkable examples of physi-cists who never changed their subfield, like Klaus von Klitzing, whose first publication wasin CondMat , and contributed over 500 papers to the subfield, earning him the Nobel Prize in
985 for the discovery of the quantised Hall effect. In contrast, Rainer Weiss, best known forinventing the laser interferometric technique at the heart of LIGO, which earned him the NobelPrize in 2017, published his first paper on an unrelated topic in
AMO , “Magnetic Momentsand Hyperfine-Structure Anomalies of Cs , Cs and Cs ”. To distinguish such differentcareers, we next systematically explore career transitions within physics, asking: Where arephysicists “born”, and how do they “migrate” between subfields? When do these transitionstypically occur?Figure 2b shows how many physicists began their careers in each subfield (top rectangles).Remarkably, of the physicists began their careers by publishing in either CondMat , HEP , or
Nuclear ( of all physicists start out in CondMat ). These three subfields capture “curricular”physics topics, the natural ending points of many undergraduate courses, hence the typicalstarting point of research careers.
General , covering topics of interests to a wide set of physicists,accounts for of first publications. In contrast, only of physicists started publishing in Interdisc , and as low as began in Astro . As
Interdisc integrates other disciplines, it might bedifficult to start out as an
Interdisc physicist; the low percentage of
Astro starts may be rootedin the fact that traditionally it has not been a “curricular” subfield. ox 1
Identifying subfields
We classify papers into 9 subfields, based on the 1-digit Physics and Astronomy Classification Scheme (PACS) bythe American Physical Society (APS):•
General : Mathematical Methods, Quantum Mechanics, Relativity, Nonlinear Dynamics, Metrology•
HEP:
The Physics of Elementary Particles and Fields•
Nuclear:
Nuclear Structure and Reactions•
AMO:
Atomic and Molecular Physics•
Classical:
Electromagnetism, Optics, Acoustics, Heat Transfer, Classical Mechanics, and Fluid Dynamics•
Plasma:
Physics of Gases, Plasmas, and Electric Discharges•
CondMat:
Structural, Mechanical, and Thermal Properties; Electronic Structure, Electrical, Magnetic, andOptical Properties•
Interdisc:
Interdisciplinary Physics and Related Areas of Science and Technology•
Astro:
Astrophysics, Astronomy, and GeophysicsPACS were consistently used in papers published in APS journals between 1985 and 2015 (SI Section S2). Usingan algorithm that evaluates the patterns of citations and references between papers, we propagate subfield labelsfrom APS papers to other papers: if the fraction of references and citations between a given paper and papers ina particular subfield is larger than expected by the null model, the paper is assigned to that subfield. A papermay be assigned to multiple subfields, in line with APS papers reporting multiple PACS. In panel a) we show anexample of an unclassified paper which references in
CondMat , Plasma and
Astro , and which is cited by
CondMat , Astro and another publication still lacking a PACS. The publication is first assigned to
CondMat and then to
Astro ,but not to
Plasma , as it lacks statistical significant links to the subfield. The algorithm is run iteratively untilconvergence for each subfield, helping us associate at least one subfield to 1,137,670 papers (SI Section S3).
Assigning physicists to subfields
We analyse all careers with at least labeled papers between 1985 and 2015, capturing the careers of 135,877physicists. We consider a physicist working in a subfield if her share of publications in the subfield is higher thanthat of the average physicist. The statistical criterion we used, guarantees that each scientist is assigned to atleast one subfield, and takes into account the different sizes of subfields. As an example, we show the resultof the criterion applied to the career of Stephen Hawking in panel b). In the physics dataset Hawking has 124papers associated to different subfields. Of these subfields, only General (95 papers) and
Astro (77 papers) areassigned to the physicist through the statistical criterion, whereas
HEP (23 papers) and
Classical (1 paper) are notstatistically significant, which is consistent with Hawking being known as a theoretical physicist and cosmologist.For validation and further methods see SI Sections S3, S4, S5. ) Unclassified paper 2) Propagation of CondMat (significant) 3) Propagation of Plasma (not significant) 4) Propagation of Astro (significant) a b
The links of Fig. 2b capture the significant flows between subfields, linking the subfieldwhere a physicist published her first paper, to the subfields that best characterised her latercareers (SI Section S6). This diagram indicates that
CondMat is the starting point for manyphysicists who later specialised in
Interdisc , Classical , and
General . HEP and
Nuclear tend toswap researchers while feeding talents into
Astro , a pattern that may be rooted in the fact thatall three subfields study radiation or nuclear and subnuclear processes. We find that most
Interdisc physicists did not start their career there, but migrated from
CondMat and
General ,consistent with the hypothesis that one needs to acquire expertise in at least two fields beforebeing able to bring them together. Finally,
Plasma and
Astro welcome physicists with manydifferent backgrounds, but rarely feed into other subfields. The diversity of the incoming flowsto
Plasma and
Astro suggests their accessibility to physicists with many different backgrounds.We also measure the average time it takes to transition to a different subfield, capturedby the vertical axis of Fig. 2b. Once again,
HEP , Nuclear and
CondMat top the list: physicists ho did not start their career in these subfields tend to transition towards them the earliest,typically by the third or fourth year of their research career. The opposite trend was observedfor
Interdisc and
Astro , which not only have the highest transition rates among subfields, but arealso characterised by the longest time to transition. Indeed, on average a physicist publishesher first paper on these two topics to years into her career, roughly double the transitiontime towards HEP , Nuclear and
CondMat . Interdisc displays a late switch, consistent with thehypothesis that it takes time to gather expertise in multiple fields. Similarly, physicists tend toswitch to
Astro typically after a relatively long experience in
HEP .The flow diagram of Fig. 2b helps us better understand the research space captured by Fig. 1d.For instance, in the bottom right triple,
HEP plays the leading role in producing physicistswho transition to its tightly connected subfields,
Nuclear and
Astro . In the top two nodes ofthe network,
CondMat is the main force feeding
Interdisc . The observed widespread careertransitions may reflect potential benefits to the whole field, cross-pollinating one physicscommunity with ideas and methods developed by a different subfield.
8, 9
The role of chaperones
The future prosperity of young scholars has often been linked to access to valuable mentor-ship at the early stages of a scientific career.
For example, a surprising fraction of Nobellaureates had a mentor-mentee or a co-authorship relation with another Nobel laureate,
18, 19 and scientists who co-author early with an established scientist are more likely to have higherimpact and higher chances to publish as lead author than other scientists. Taken together,a senior scientist who acts as “chaperone” during a scientist’s early career might foster theacquisition of skills, passing on experience and knowledge necessary for high achievementslater in a career. o quantify the chaperone effect, we measure how many physicists co-author their firstpaper in a subfield with a physicist who has published in that subfield before. We find thatthe chaperone effect is particularly strong for
HEP , Nuclear and
CondMat , where over ofphysicists wrote their first paper with someone who published before in the same subfield(Fig. 2c and SI Section S8). This large share of chaperoned physicists could have several reasons,like the documented high number of physicists starting their career in these three subfields,or the need to access large facilities, which require early-career physicists to collaborate withestablished scientists. Note that the typical large co-authorships patterns of
HEP can notexplain the magnitude of the chaperone effect characterising this subfield (SI Section S8).Other subfields have a lower fraction of chaperoned physicists, especially
Interdisc and
Astro . These subfields are often explored by more senior physicists who received mentorshipat a previous stage of their careers in a different subfield and often decide to explore the newarea without close supervision (26% of physicists are not chaperoned in
Interdisc and
Astro ,Fig. 2c). On top of this, applications of computational physics, like computational biophysicsor complex systems, classified as
Interdisc , require lower financial resources compared toexperimental research and could also play a significant role in explaining the low chaperoneeffect. Taken together, the chaperone effect is strong in physics, with an average rate of 82%chaperoned physicists across subfields. The effect signals a research culture where physicistsoften get introduced to their future research area by senior colleagues in a collaborative setting,in contrast with disciplines like mathematics, where the majority of scientists start their careerwith publishing solo-author papers. roductivity, impact, and team size across subfields Productivity and impact, capturing the number of papers published and citations receivedby a physicist, are frequently used metrics in the assessment of scientific careers.
22, 23
Thesequantities have implications for decisions and policies involving predicting, nurturing, andfunding early career scientists. Yet, the proper interpretation of these metrics must account forthe highly heterogeneous productivity and citation patterns characterising different subfields and for different team sizes, both of which vary in time.Team size, i.e. the number of coauthors per paper, has been increasing steadily over the pastdecades in all fields, capturing an increasing collaboration in science. Are there particulardifferences in collaborative patterns in the different physics subfields, and what are theirimplications on productivity and impact? To answer this question, we assess the diversity andevolution of collaboration, productivity, and citation standards in the different subfields ofphysics. First, the tendency of scientists to work in increasingly large teams has been particularlypronounced in
HEP (especially after 2005),
Nuclear (especially after 2010) and
Astro (especiallyafter 2000) (Fig. 3a). The observed explosive growth in these three subfield is partly rooted inlarge-scale projects like ATLAS (SI Section S7). They also result in an increased productivity: asphysicists were involved in more and larger teams, the average number of papers they publishedeach year increased by a factor of 10 for
HEP and by a factor of 2 for
Nuclear and
Astro from1985 to 2015 (Fig. 3b). However, for the other six subfields productivity has stayed constantover 30 years, and for all subfields productivity has increased at a slower rate than team sizes.These different rates of increase explain why fractional productivity, i.e. the ratio betweenthe number of papers and the average team size, decreased across all subfields (Fig. 3c). Theeffect is the strongest in
HEP , Nuclear , and
Astro , where team size grew disproportionately. It is orth noting that in these subfields authors are usually ordered alphabetically due to the largeaverage team size, making the assessment of credits for single authors more problematic. Taken together, we find that the amount of knowledge produced per capita decreases in allsubfields despite the increase in the total number of physicists and physics papers.Given the explosive increase in both team size and the number of papers per physicistsin
HEP , do
HEP physicists today have more or less impact than they had decades earlier? Toanswer this question we measured the average impact in number of citations after 5 years(Fig. 3d) and the fractional impact (ratio between number of citations and average team size,Fig. 3e) per physicist per subfield. Interestingly, the average impact of
HEP shows a growthof comparable magnitude as the growth in average productivity, leading to an unchangedfractional impact. In other words, large-scale projects like ATLAS produce papers that generatea large number of citations, compensating for the massive numbers of co-authors (hundredsor more).Given some of the large productivity differences between different subfields, we also expectdifferences in impact, measured in terms of cumulative citations over a career. For instance,how much impact does it take to be a scientific leader in
HEP and how is that different in
CondMat ? In Fig. 3f and Fig. 3g we show the total number of papers and citations acquired overan average career by the top 5% of physicists in each subfield (in terms of productivity). In bothterms,
HEP is by far the most rewarding subfield, whose top scientists coauthor 169 papers andaccumulate over 7,000 citations. In contrast, top
Interdisc physicists coauthor only 18 paperswith less than 1,000 citations. The large discrepancy is not explained by paper citation rates,
32, 33 which are roughly constant across subfields (SI Section S9), but by the high or low numberof papers per author in the respective subfield (Fig. 3b). As a consequence, when physicists ith different specialties compete for positions or grants, caution is needed in comparing theirprofiles using metrics based on citations or productivity, as subfield-dependent differencesappear from the very beginning of a career.What about the rate of top papers in the different subfields? We selected the top 1% ofall physics papers (in terms of citations) and assessed into which subfield they fall (Fig. 3h).The majority falls into
CondMat , General and
HEP , however, this result is trivial as these fieldsproduce the most papers. To unveil the significant effects we measured the surplus betweenthis top 1% distribution and the distribution of subfields of all physics papers. As Fig. 3i shows,
Interdisc papers are 40% more likely to be in the top 1% than expected, while
Nuclear and
Plasma papers are 40% less likely to be found in the top 1%. The high rate of
Interdisc amongthe top cited papers might be partially explained by the finding that papers which are 15%novel and 85% conventional often have high impact. Interdisc is more likely to achieve thisbalance, since interdisciplinary research must be novel and, at the same time, must adhereto established principles. Another explanation is that
Interdisc is more likely to initiate newtopics or emerging subfields. Papers that do open such new avenues are known to acquire ahigh number of citations as they become milestones, cited by subsequent papers once the fieldis established.
34, 35
Recognition of physics subfields
Do impact differences affect the way in which the overall scientific community perceives thedifferent subfields of physics? As a rough proxy of this recognition we take the Nobel Prizesawarded from 1985 to the present, highlighting each awarded subfield (Fig. 3j, SI Section S10).Although the Nobel Prize often recognises research undertaken much before the selectionyear, the timing of Nobel prize selections could affect the way in which the relative importance f different physics communities are perceived by the committee. As a comparison betweenFig. 2a and Fig. 3j shows, Nobel Prizes are not related to the number of physicists flocking intospecific physics communities, nor do they show significant temporal clusters. However, thegeneral distribution of awarded subfields reveals interesting tendencies: a large fraction ofNobel Prizes have been awarded to the “curricular” topics, like
CondMat , the subfield withthe largest number of active researchers, and
HEP . Surprisingly,
Astro , despite the relativelymoderate size of its community, comes in third, with five Nobel Prizes. This success might belinked to the perception of astrophysics as a field that studies the universe on a grand scale, aswell as to its strong ties to HEP, a regular recipient of Nobels. Other well established areas witha long history, such as
AMO and
Classical have also been recognised. In contrast, since 1985
Plasma and
Interdisc have not been awarded a Nobel Prize. The omission of
Interdisc likelycomes from the charter of the Nobel Prize to award clear-cut categories (e.g. physics, chem-istry, medicine/physiology) rooted in 19th century discriminating against interdisciplinarydiscoveries.
36, 37
Conclusions
As one of the oldest scientific disciplines, physics plays a fundamental role in the developmentof science. As the aperture of physics widens, the focus of individual physicists narrows, leadingprogressively to the formation of specialised communities and subfields. Here we offered anintellectual census of these subfields, exploring how physicists migrate between them, howthey specialise and collaborate to create impactful research.We observed that subfields rarely live in isolation but rather tend to overlap, with individualscientists working in multiple subfields and transitioning between fields during their career.Mapping these overlaps reveals a highly non-trivial research space, displaying deep intellectual inks between some subfields and large gaps between others.Physicists who are confronted with heated arguments on the allocation of resources todifferent subfields and departments, often use metrics of productivity or impact to seek pri-ority. However, our research suggests that such arguments should be taken with scepticism.Indeed, there are considerable field-specific differences in the patterns of productivity andimpact. Publication rates have exploded in recent years in
HEP , Nuclear and
Astro , whereasfractional productivity is declining. In some subfields, such as
HEP , researchers co-authors anexceptionally large number of papers, partly rooted in their unique culture of collaboration. Bycontrast, interdisciplinary physicists produce papers at a much lower rate but their papers tendto garner a disproportionally higher impact, once we factor in the relative size of the subfield.Understanding these field differences within physics represents the first step towards a deeperunderstanding of our discipline. As tomorrow’s physicists working on different topics competefor the same position and resources, these insights may prove pertinent for the sustainablevitality of physics as a discipline.Our study is based on Web of Science data, lacking the literature that has been exclusivelypublished in preprint servers like arXiv, leading to unavoidable (but small) differences insubfield representation due to diverse publication cultures in different communities. Forexample, the proportion of HEP and
Astro papers in arXiv is higher compared to our datasetand WoS, reflecting the common practice of these communities to communicate findings inpreprints rather than journal papers. However, there is a high overlap in the coverage of thephysics literature between different databases and a high correlation of the representationof physics subfields (SI Section S3), indicating that our findings should agree if repeated on adifferent database. n this study we focused on careers of physicists within physics. However, these days, manyscientists with a background in the physical sciences contribute to fields outside of physics,from biology to finance, both in academia and the private sectors. For this reason, theinvestigation of the connection between physics and other scientific disciplines, and the careertransitions away from physics, remains as fruitful future work. Indeed, such an investigation,possibly aided with data sources that go beyond scientific publications, could shed light on therole of physics and its subfields in the entire ecosystem of science and beyond.
Acknowledgments
This work was supported by the John Templeton Foundation Grant .5× 3× 2×2.5×observedexpected
CondMatInterdisc Astro NuclearHEPAMO
Plasma
Classical General
Figure 1.
Taking census of physics subfields. a , Number of physicists per subfield. b ,Percentage of physicists working in 1, 2, 3, or 4+ subfields. We call the 37% of physicists whowork in only one subfield specialised . c , Fraction of specialised physicists per subfield. Mostsubfields except for HEP , Nuclear and
CondMat have a negligible fraction of specialisedphysicists. d , The network of co-activity of individual physicists shows the nontrivialconnection between subfields. Node size is proportional to number of physicists in thesubfield, link width is proportional to the overlap between subfields, quantified with the ratiobetween measured number of physicists working on the two subfields and expected numberbased on a randomised null model. cb Figure 2.
Evolution of physics subfields and careers. a , Relative growth rate, defined asyearly fraction of physicists who published their first paper in a new subfield.
Interdisc and
Astro grow,
CondMat shrinks considerably.
HEP displays a spike in 2010 that can be attributedto large-scale collaborations like ATLAS and CMS (SI Section S7). Relative growth rate is lessreliable after 2010 due to early-career physicists accumulating publications at different rates ineach subfield, resulting in reaching the 5 publications threshold at different times anddistorting the proportion of physicists in favor of more productive and non-specialisedsubfields. b , Flow diagram of career transitions. The sizes of rectangles on the top areproportional to the number of career first publications in each given subfield. The rectanglesat the bottom are proportional to the number of physicists in each subfield who did not starttheir career by publishing in the area – for example Astro and
AMO have roughly the samenumber of physicists although
Astro starts with 3%, while AMO with 5%. The distance from thetop reflects the average time at which a career transition towards a subfield occurs. Flows areproportional to the number of physicists who first published in a subfield different from theone in which they worked previously. Only significant flows, i.e. those that are larger thanexpected in the null model, are shown. The percentages on the bottom rectangles report thecontribution of the subfield that is contributing most. c , Fraction of not chaperoned physicistsin each subfield. A large majority of physicists starting in HEP , Nuclear , or
CondMat co-authortheir first paper with physicists who have already published in the subfield. Other subfieldshave a much higher fraction of physicists who are not chaperoned in. b cd efg hij igure 3.
Productivity and impact across physics communities. a , Average team size,defined as average number of authors per paper, over time. Team sizes grow in all fields,especially in
HEP , Nuclear , and
Astro due to large-scale experimental projects. b , Averageproductivity, defined as number of papers per author, over time. Productivity grows for HEP , Nuclear , and
Astro but stays roughly constant for other subfields. c , Fractional productivity,i.e. number of papers divided by team size, over time. For all subfields productivity grows lessthan team size, therefore fractional productivity decreases. d , Average impact, defined asnumber of citations per author within a 5 years window. Impact increases in all fields, but only HEP shows an exceptional growth. e , Fractional impact, i.e. number of paper citations dividedby team size, over time. Most subfields show a roughly constant trend until 2005. f , Number ofpapers of the top physicists for productivity. Due to different collaboration standards, HEP physicists coauthor more papers than other subfields.
Interdisc physicists produce anespecially low number of papers. g , Number of citations of the top physicists forproductivity. HEP physicists receive more citations because of their high productivity. h ,Fraction of top 1% cited papers per subfield and i , subfield surplus with respect to the numberexpected given the subfield size. Interdisc generates the highest number of high impact paperscompared to its size. j , Nobel Prizes in physics per year across subfields. Plasma and
Interdisc have not received an award. eferences Jones, B. F. The burden of knowledge and the “death of the renaissance man”: Is innovationgetting harder?
The Review of Economic Studies , 283–317 (2009). Clauset, A., Larremore, D. B. & Sinatra, R. Data-driven predictions in the science of science.
Science , 477–480 (2017). Fortunato, S. et al.
Science of science.
Science , eaao0185 (2018). Deville, P. et al.
Career on the move: Geography, stratification, and scientific impact.
Scientific Reports (2014). Sinatra, R., Deville, P., Szell, M., Wang, D. & Barabási, A.-L. A century of physics.
NaturePhysics , 791 (2015). Deville, P.
Understanding social dynamics through big data (PhD Thesis) (UniversitéCatholique de Louvain, 2015). PACS 2010 regular edition. https://publishing.aip.org/publishing/pacs/pacs-2010-regular-edition . Dyson, F. Birds and frogs.
Notices of the AMS , 212–223 (2009). Uzzi, B., Mukherjee, S., Stringer, M. & Jones, B. Atypical combinations and scientific impact.
Science , 468–472 (2013).
Foster, J. G., Rzhetsky, A. & Evans, J. A. Tradition and innovation in scientists’ researchstrategies.
American Sociological Review , 875–908 (2015). URL https://doi.org/10.1177/0003122415601618 . https://doi.org/10.1177/0003122415601618 . Guevara, M. R., Hartmann, D., Aristarán, M., Mendoza, M. & Hidalgo, C. A. The researchspace: using career paths to predict the evolution of the research output of individuals,institutions, and nations.
Scientometrics , 1695–1709 (2016).
ATLAS experiment reports. https://atlas.cern/updates/atlas-news/atlas-experiment-reports-its-first-physics-results-lhc . Jia, T., Wang, D. & Szymanski, B. K. Quantifying patterns of research-interest evolution.
Nature Human Behaviour , 0078 (2017). Balassa, B. Trade liberalization and ‘revealed’ comparative advantage.
Manchester School
Crosta, P. M. & Packman, I. G. Faculty productivity in supervising doctoral students?dissertations at cornell university.
Economics of Education Review , 55–65 (2005). Malmgren, R. D., Ottino, J. M. & Amaral, L. A. N. The role of mentorship in protégé perfor-mance.
Nature , 622 (2010).
Chariker, J. H., Zhang, Y., Pani, J. R. & Rouchka, E. C. Identification of successful mentoringcommunities using network-based analysis of mentor–mentee relationships across nobellaureates.
Scientometrics , 1733–1749 (2017).
Zuckerman, H. Nobel laureates in science: Patterns of productivity, collaboration, andauthorship.
American Sociological Review
Ma, Y. & Uzzi, B. The scientific prize network predicts who pushes the boundaries ofscience. https://arxiv.org/abs/1808.09412 (2018).
Sekara, V. et al.
The chaperone effect in science.
PNAS, in print (2018). Szell, M. & Sinatra, R. Research funding goes to rich clubs.
Proceedings of the NationalAcademy of Sciences , 14749–14750 (2015).
Sinatra, R., Wang, D., Deville, P., Song, C. & Barabási, A.-L. Quantifying the evolution ofindividual scientific impact.
Science , aaf5239 (2016).
Liu, L. et al.
Hot streaks in artistic, cultural, and scientific careers.
Nature
Radicchi, F., Fortunato, S. & Castellano, C. Universality of citation distributions: Toward anobjective measure of scientific impact.
Proceedings of the National Academy of Sciences , 17268–17272 (2008).
Pavlidis, I., Petersen, A. M. & Semendeferi, I. Together we stand.
Nature Physics , 700(2014). Wuchty, S., Jones, B. & Uzzi, B. The increasing dominance of teams in production ofknowledge.
Science , 1036–1039 (2007).
Shen, H.-W. & Barabási, A.-L. Collective credit allocation in science.
Proceedings of theNational Academy of Sciences , 12325–12330 (2014).
Lehmann, S., Jackson, A. & Lautrup, B. Measures for measures.
Nature , 1003–1004(2006).
Lehmann, S., Jackson, A. & Lautrup, B. A quantitative analysis of indicators of scientificperformance.
Scientometrics , 369–390 (2008). Hicks, D., Wouters, P., Waltman, L., Rijcke, S. d. & Rafols, I. Bibliometrics: the LeidenManifesto for research metrics.
Nature (2015).
Waltman, L. A review of the literature on citation impact indicators.
Journal of Informetrics , 365–391 (2016). Lillquist, E. & Green, S. The discipline dependence of citation statistics.
Scientometrics ,749–762 (2010). Radicchi, F. & Castellano, C. Rescaling citations of publications in physics.
Physical ReviewE , 046116 (2011). Newman, M. The first-mover advantage in scientific publication.
EPL (Europhysics Letters) , 68001 (2009). Van Noorden, R. Interdisciplinary research by the numbers.
Nature News , 306 (2015).
Szell, M., Ma, Y. & Sinatra, R. Interdisciplinarity: A nobel opportunity. accepted for publica-tion in Nature Physics (2018).
Bromham, L., Dinnage, R. & Hua, X. Interdisciplinary research has consistently lowerfunding success.
Nature , 684 EP – (2016). URL http://dx.doi.org/10.1038/nature18315 . The arXiv repository. https://arxiv.org . Martín-Martín, A., Orduna-Malea, E. & Delgado López-Cózar, E. Coverage of highly-citeddocuments in google scholar, web of science, and scopus: a multidisciplinary comparison.
Scientometrics , 2175–2188 (2018).
Farmer, J. D. Physicists attempt to scale the ivory towers of finance.
Computing in Science& Engineering , 26–39 (1999). upplementary Information Taking Census of Physics
Federico Battiston , Federico Musciotto , Dashun Wang , Albert-L ´aszl ´o Barab ´asi ,Michael Szell , and Roberta Sinatra Department of Network and Data Science, Central European University, Budapest, 1051, Hungary Kellogg School of Management, Northwestern University, Evanston, IL 60208, USA Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL 60208, USA Network Science Institute, Northeastern University, Boston, MA 02115, USA Center for Cancer Systems Biology, Dana-Farber Cancer Institute, Boston, MA 02115, USA Complexity Science Hub Vienna, Vienna, 1080, Austria MTA KRTK Agglomeration and Social Networks Lendulet Research Group, Centre for Economic and RegionalStudies, Hungarian Academy of Sciences, Budapest, 1094, Hungary Department of Mathematics, Central European University, Budapest, 1051, Hungary ISI Foundation, Torino, 10126, Italy * [email protected] S1 Defining physics publications in non-physics journals
We identify physics publications in journals which are not explicitly labelled as physics journalsby means of a method first used in Refs.
1, 2
Such method allows to reconstruct a communityin a network when only a small fraction of nodes are explicitly labelled as belonging to thecommunity. In our case, the hypothesis is that physics papers can be found not only in con-ventional physics journals (core physics papers) but also in other venues (interdisciplinaryphysics papers). It is possible to identify such interdisciplinary papers if they have a significantnumber of references or citations in conventional physics venues. In Ref. the label propa-gation algorithm was first applied to an old version of the Web of Science (WoS), encodinginformation about scientific publications until 2012 and based on an old database structure.Here we reapply the method on an updated version of WoS purchased from Clarivate Analytics,encoding information about publications until 2017, and using a new database structure, witha different identification system for papers among other things. We obtain a new physics a r X i v : . [ phy s i c s . s o c - ph ] J a n ataset of papers, which we want to further characterise by identifying the physics subfieldsthey belong to. For this reason, papers in the dataset except those of the American PhysicalSociety (APS) journals, are then considered to be assigned a given subfield and be part of ourphysics communities analysis. The label propagation method at the subfield level is a modifiedimplementation of the algorithm presented in this Section, and it is illustrated in detail inSection S3.The label propagation method to construct the physics dataset works in the following way.Let us consider a directed network with N nodes, for instance the citation network described bythe WoS dataset, where nodes are scientific publications, and a direct link between publication i and publication j exists if paper i cites paper j . Each node i has an in-degree k IN (numberof citations) and an out-degree k OUT (number of references). Nodes with k IN = and k OUT = are publications without references and citations and are isolated nodes in the network.Additionally, in our case each node i is characterised by a variable t i corresponding to the timeof publication of the article. The method is based on an iterative process where at each step s the N nodes are assigned to three sets: the core set C s , the tangent set T s and the external set E s . The core set C s includes the nodes that are considered to be part of the target communityat a given time step s by the algorithm. In our case, at the step s = , C includes all articlespublished in physics journals. The purpose of this initial core set is to act as a seed to detectother nodes that are part of the community, even if initially they are not classified as such, andthat will be iteratively included in C s at subsequent steps s > 0. The second set is the tangent set T s , and contains all the nodes outside the core set C s that have at least one (ingoing or outgoing)connection to a node within C s . The third set is the external set E s , and corresponds to allnodes outside the core set C s that share no connection with nodes within C s , and thereforehave no chance to be included into the core at the subsequent step s + . By definition we have C s ∪ T s ∪ E s = N and C s ∩ T s ∩ E s = /0 .The basic idea of the method is to iteratively extend the target community C s into C s + byadding candidate nodes from T s that are statistically expected to be part of the community basedon their connections. In our case this corresponds to identifying as physics all scientific paperswhich are not published in physics journals, but whose patterns of references and citations areindistinguishable from those published in the traditional physics venues. The purpose of thetangent set T s is to contain all candidate nodes, i.e. nodes that might subsequently be added to he target community C s at step s after inspection of their incoming and outgoing links. To doso, at each step s and for each node i we compute two variables: r INi , s and r OUTi , s . These variablesquantify the expectation of a particular node to be part of the target community C s based on itsincoming citations and outgoing references.Let us focus first on incoming citations, evaluated through r INi , s , where r INi , s = k IN , J i , s ˆ k IN , J i , s . (1)Here k IN , J i , s corresponds to the number of incoming links (citations) to node i originating fromnodes in the core C s . ˆ k IN , J i , s , instead, accounts for the expected number of incoming linksfrom the core in a null model where the real number of incoming and outgoing links of eachnode (citations and references of each paper) in the network is fixed. This last constraintcorresponds to consider the directed configuration model ensemble of the original citationnetwork, meaning that we can write ˆ k IN , J i , s = k INi ∑ j ∈ C s k OUTj ∑ j ∈ N k OUTj (2)where k INi denotes the total number of incoming links to node i , and the remaining termcorresponds to the probability for a link to originate from C s . As an article i can receive acitation from another paper j only if the latter is more recent, i.e. t j > t i , we eventually set ˆ k IN , J i , s = k INi ∑ j ∈ C s | t j > t i k OUTj ∑ j ∈ N | t j > t i k OUTj . (3)Similarly, the share of outgoing references are evaluated through r OUTi , s , where r OUTi , s = k OUT , J i , s ˆ k OUT , J i , s , (4)and ˆ k OUT , J i , s = k OUTi ∑ j ∈ C s | t j < t i k INj ∑ j ∈ N | t j < t i k INj . (5)A value r INi , s > ( r OUTi , s >1) corresponds to a node that is more likely to reference (be cited from)nodes from the core than what would be expected at random. At each step s of the process, we se the variables r INi , s and r OUTi , s associated to nodes in T s to produce the updated core set C s + .First we add all nodes in C s to C s + . Then, for each node i ∈ T s , we add i to C s + if we have r INi , s > τ IN (6)or r OUTi , s > τ OUT . (7)The thresholds τ IN and τ OUT are fixed based on a parameter p such that the thresholds τ IN and τ OUT correspond respectively to the p − th percentile of the distribution of r INi , and r OUTi , valuesfor nodes within the initial core set C . Once nodes i ∈ T s satisfying the conditions of Eq.6 orEq.7 are added to the core set C s + , both sets T s and E s can be updated to T s + and E s + from C s + . The process stops when C s has converged, i.e. when no nodes from T s can be added tothe core set C s . Note that while the thresholds τ IN and τ OUT remain constant during the wholeprocess, the values r INi , s and r OUTi , s associated to each node i will change at each iteration, giventhe fact that new nodes will incorporate the set C s at each iteration step s. As shown in Ref., in the case of physics publication in the WoS dataset the algorithm was run iteratively for steps, showing fast convergence.The parameter p can be considered as a tolerance parameter in the sense that it definesthe minimal attraction needed for a node to be incorporated in the growing core. As describedin Refs.,
1, 2 in our case it is possible to set the value of p by validating the algorithm on allpublications of two interdisciplinary journals for which a subset is labelled explicitly as physics,namely Science (1995-2013) and
PNAS (1915-2013). The best trade-off between true positive( . ) and true negative rates ( . ) was found for p = . By running the algorithm onthe new version of the WoS dataset comprised of ∼
54 million papers, with an initial core of ∼ ∼ Nature ,and several materials and chemistry journals. ank Journal (number of papers)
Journal (percentage of papers)
Table S1.
Non-physics journals with most physics publications and highest percentage ofphysics publications identified by means of label propagation.
S2 Identifying physics subfields from PACS codes
Despite the WoS dataset provides a thorough classification of core physics publications intodifferent subfields (see Section S3), such classification is not detailed enough to our scopeand, most importantly, it fails to associate a subfield to publications not in physics journals.For such a reason, in our work we associated publications to different subfields according tothe Physics and Astronomy Classification Scheme (PACS) by the American Physical Society, a hierarchical classification used for papers in APS journals between 1977 and 2015. Theclassification uses four digits and an extra identifier. The 1-digit identifies 10 different physicssubfields, namely: General (0), The Physics of Elementary Particles and Fields (shortened as
HEP , 1),
Nuclear
Physics (2), Atomic and Molecular Physics (
AMO , 3), Electromagnetism, Optics,Acoustics, Heat Transfer, Classical Mechanics, and Fluid Dynamics (
Classical , 4), Physics ofGases, Plasmas, and Electric Discharges (
Plasma , 5), Condensed Matter: Structural, Mechanicaland Thermal Properties (6), Condensed Matter: Electronic Structure, Electrical, Magnetic, andOptical Properties (7), Interdisciplinary Physics and Related Areas of Science and Technology(
Interdisc , 8), Geophysics, Astronomy, and Astrophysics (
Astro , 9). We merged PACS 6 and 7 into unique category named
CondMat , in order to match other common physics classifications,such as that found for the arXiv (see Section S3). We stress that the term interdisciplinaryphysics, assigned in Ref. to describe physics publications in non-physics journals, is notlinked to the PACS 8 of the APS scheme. In the following, as well as in the main text, the termInterdisciplinary physics is reserved to identify publications and authors working in this precisesubfield of physics, differently from Ref. PACS can be found in the APS dataset, available fromthe APS upon request, encoding information about all publications appeared in the journalsof the American Physical Society until 2015. Although PACS appeared in 1977, only a smallfraction of the papers were assigned one until they were enforced in 1985. For this reason, wefocused our analysis on the years 1985-2015, for which our dataset has 435,722 papers with atleast one PACS. 5,616 more papers have assigned a PACS but were published before 1985. Morein detail, between 1985-2015 we have 265,549 papers with exactly one 1-digit PACS, 138,176with two PACS, 29,806 with three PACS, 2,160 with PACS and 31 with five PACS.In Fig. S1 we report the distribution of the 9 physics subfields for six well-established journalspublished by the APS, namely the general purpose Physical Review Letters and the specialisedvenues
Physical Review A - E . Physical Review B (covering condensed matter and materialsphysics) and
Physical Review C (covering nuclear physics) indeed predominantly publishpapers belonging to a single subfield, respectively Condensed Matter and Nuclear Physics.Conversely
Physical Review A (covering atomic, molecular, and optical physics and quantuminformation),
Physical Review D (covering particles, fields, gravitation, and cosmology) and
Physical Review E (covering statistical, nonlinear, biological, and soft matter physics) publishacross a greater mixture of subfields. As expected,
Physical Review Letters , the APS flagshipjournal, publishes across all different domains, even though with different frequency.Similarly to the identification of physics papers in non-physics venues, we use the paperspublished in the APS journals as the initial seed to assign subfields to other physics publicationsby means of label propagation (see Section S3 for details.). In such a way, we obtain a data-driven subfield classification of physics papers in the WoS dataset.In Fig.S2a we report the proportions of APS papers belonging to a given subfield, andcompare it to that of our newly created dataset. In Fig.S2b we report the distribution of thenumber of subfield per paper in the APS between 1985 and 2015, as well as the fraction ofnumber of papers per subfield over the years (Fig.S2c). %20%40%60%80% P e r c e n t a g e o f p a p e r s Phys. Rev. A Phys. Rev. B Phys. Rev. C G e n e r a l H E P N u c l e a r A s t r o A M O C l a ss i c a l P l a s m a C o n d M a t I n t e r d i s c P e r c e n t a g e o f p a p e r s Phys. Rev. D G e n e r a l H E P N u c l e a r A s t r o A M O C l a ss i c a l P l a s m a C o n d M a t I n t e r d i s c Phys. Rev. E G e n e r a l H E P N u c l e a r A s t r o A M O C l a ss i c a l P l a s m a C o n d M a t I n t e r d i s c Phys. Rev. Lett.
Figure S1.
Subfield distribution for papers published in APS journals.
Different APSjournals show different publication patterns across subfields.
Physical Review B coverspredominantly
CondMat , and
Physical Review C is similarly focused on
Nuclear . In contrast,
Physical Review A , Physical Review D and
Physical Review E do not cover a single, predominantsubfield.
Physical Review Letters is the most balanced journals of the APS publishing across allsubfields.
S3 Assigning Physics subfields to Web of Science publications
We propagate physics subfields to physics publications in the WoS dataset based on relevantpatterns of references and citations to the specific subfield(s), adapting the method described inthe first section of this SI. For each subfield we have a different initial core set C α , correspondingto all publications in the APS publications between 1985 and 2015 associated to a given subfield α . First, we matched the papers of the APS dataset into the Web of Science dataset, either viaexact doi matching, or, for when the doi is not available, by using the Levenshtein distance tocompute title similarity. In this second case the match was accepted if there was at least 90%string similarity between the titles of two papers in the datasets, and the second best matchhad a string similarity at least 5 times worse. In this way we were able to match 90% of all thepapers manually assigned to a subfield between 1985 and 2015.At difference with the original implementation, where it was possible to set the thresholds τ IN and τ OUT by evaluating the performance of the algorithm on the ’groundtruth’ of physicspapers published in interdisciplinary journals such as
Science and
PNAS , such type of validationis not possible at the subfield level. For such a reason, for label propagation at the subfield evel we slightly modified the original implementation. We observe that the algorithm maypropagate subfields both to papers within and out of the original APS core, which is made ofpapers that already have a PACS code. For such a reason, for each subfield α we selected thethreshold τ α so that after iterations the number of papers of each subfield cannot grow morethan 10% within the original APS dataset. For simplicity, we chose τ IN , α = τ OUT , α . Afterwards,we performed label propagation for each subfield α independently. We obtained a total of1,137,670 papers in WoS published between 1985 and 2015 and classified within one of thesubfields of Physics. We note that also some papers outside the considered time-span wereassigned a subfield, but we focused our analysis on the period − to be consistentwith the years when PACS were systematically used in publications by the APS. As alreadymentioned, PACS corresponding to the two categories associated to Condensed Matter weremerged into the same subfield.It is interesting to compare the classification of papers obtained through label propagationwith that of the original APS dataset. Figure S2a compares the fraction of subfields in theoriginal and the propagated datasets. The two datasets have a similar subfield distributionwith a cosine similarity of 0.99. Differences in the two datasets are likely to indicate an under-or over- representation of some areas of physics in the Physical Review series compared tothe overall physics world. In Fig. S2b we report the distribution of the number of subfieldsper paper in the two datasets. Papers in the reconstructed physics dataset tend to be slightlymore specialised ( of the papers are assigned to a single subfield) than those in the APSdataset ( ). However, overall the two distributions are quite similar. Finally, in Figs. S2c,d weshow the evolution of the fraction of papers of different subfields in the APS dataset and in ourreconstructed dataset from 1985 to 2015. It is evident how the two datasets have very similartemporal patterns during the period under investigation.
Validation:
To test the robustness of our findings, we validated our data-driven classificationof papers across subfields. As already mentioned, PACS codes were systematically introduced inpublications in the APS journals 1985. As our method classifies papers into subfields accordingto patterns of references and citations only, our algorithm naturally assigns subfields also topublications in the APS journals before 1985, provided that they are significantly connectedto the corresponding core papers for the subfield(s). Five of the previously six analysed APSjournals (with the exception of
Physical Review E ) were born before 1985. In Fig. S3 we test the .15 0.30 0.45
APS R e c on s t r u c t ed ph ys i cs da t a s e t F r a c t i on o f APS pape r s Num ber of subfields F r a c t i on o f pape r s APSWoS
Year F r a c t i on o f pape r s i n ou r da t a s e t ab cd Figure S2.
Comparison between the APS dataset and the reconstructed physics dataset. a
Scatterplot of the fraction of subfields appearing in papers of the APS dataset and in thereconstructed physics dataset. b Distribution of number of subfields per paper in the twodatasets. c , d Temporal evolution of the fraction of subfields between 1985 and 2015 for thetwo datasets.robustness of the subfield distributions in the journals as a way to assess the effectiveness of ourdata-driven method to classify physics papers across subfields by comparing the distributionof the subfield manually assigned between 1985 and 2015 in
Physical Review. A, B, C, D , and
Physical Review Letters , with that obtained by means of label propagation for papers publishedbefore 1985 in the same journals. The two distributions are highly correlated for all journals,with cosine similarities ranging from 0.88 to 0.99 .We also tested the robustness of our subfield categorisation by comparing it to additionalsources providing alternative physics classifications, namely the physics classification providedby (i) the WoS dataset (for core physics papers only), (ii) the arXiv repository, that collectselectronic preprints of papers related to physics topics. The cosine similarity between thefraction of papers in our dataset and in the two alternative datasets is quite high, respectively .00.20.40.60.8 A P S r e a l s u b f i e l d s ( - ) Phys. Rev. A Phys. Rev. B Phys. Rev. C
APS propagated subfields (<1985) A P S r e a l s u b f i e l d s ( - ) Phys. Rev. D
APS propagated subfields (<1985)Phys. Rev. Lett.
GeneralHEPNuclearAstroAMOClassicalPlasmaCondMatInterdisc
Figure S3.
Testing propagated subfields in APS journals before 1985.
Scatterplot betweenthe subfield distribution of the papers published in the APS journals after 1985, and thepropagated subfield distribution for papers published before 1985 in the same journals. Thecosine similarities between the distribution of papers before and after 1985 are (i)
Physical Review A , (ii) Physical Review B , (iii) Physical Review C , (iv) Physical Review D and (v)
Physical Review Letters . (i) (ii) nonlin category in the arXiv dataset, that we eventually mapped into the General physicssubfield, actually contains papers of at least an additional subfield, i.e.
Interdisc . For the samereason some of the subfields obtained from the PACS scheme do not have a direct counterpartin the other two datasets. We report the full mappings in Table S2.Another factor that may affect the matching is the presence of specific biases for each ofthese datasets, which are captured by comparing it with our new data-driven reconstructedphysics dataset. For instance, the arXiv, first created as a repository for people working on HighEnergy Physics, shows a disproportionally high number of
HEP and
Astro publications. This .1 0.2 0.3 0.4 0.5
WoS original R e c on s t r u c t ed ph ys i cs da t a s e t arXiv R e c on s t r u c t ed ph ys i cs da t a s e t a b Figure S4.
Comparison between the distribution of subfields in our reconstructed physicsdataset with the WoS and the arXiv physics categories.
Correlation between distributions ishigh, with values of cosine similarity respectively equal to a b HEP ,and the repository has been largely used by such community.In Table S3, we report the five non-APS journals with most papers assigned to each subfieldby means of label propagation (number of papers in brackets).We note that the Astrophysics literature seems to be relatively disconnected to its APS core,compared to results for the other subfields. As an example, we focus on a well establishedspecialised journal in the area, the
Astrophysical Journal , for which WoS indexes 98,482 papers,only 2,330 of which are labeled. This is because, out of the 3,724,542 outgoing references frompapers published in the
Astrophysical Journal , only . are directed towards the Astro core.Similarly, out of the 4,896,146 incoming citations towards papers published in the
AstrophysicalJournal , only . come from the Astro core. As a reference, we compare these numbers withthose of
Solid State Communications , a specialised journal in the area of Condensed Matter,for which our method assign a subfield to 16,274 out of 35,781 papers. In such case, of the489,625 references and 635,466 citations of the journal, . and . link to the CondMat core. These numbers are roughly fives times higher than those for the
Astrophysical Journal .As a consequence of this disconnection, it is possible that our method it is underestimatingthe number of (possibly specialised) scientists working in Astrophysics. For both journals the oS category Subfield arXiv category / General /Fields
HEP hep-ex, hep-lat, hep-ph, hep-th, math-phNuclear Physics
Nuclear nucl-ex, nucl-thAstrophysics
Astro astro-ph, gr-qcAtomic, Molecular & Chemical Physics
AMO quant-ph/
Classical physics, nlinFluids & Plasmas Physics
Plasma /Condensed Matter Physics
CondMat cond-matMultidisciplinary Physics
Interdisc / Table S2.
Mapping of physics categories from arXiv categories and WoS physics categoriesinto physics subfields. fraction of citations (references) coming from (going towards) the cores associated to the othersubfields is negligible.At last, in Fig. S5 we report the publication profile across subfields for three leading interdis-ciplinary journals. Unsurprisingly, most subfields are represented in all three venues. We notethat the proportions of the different subfields is similar to that of the publication of the APSflagship journal,
Physical Review Letters . G e n e r a l H E P N u c l e a r A s t r o A M O C l a ss i c a l P l a s m a C o n d M a t I n t e r d i s c P e r c e n t a g e o f p a p e r s Nature G e n e r a l H E P N u c l e a r A s t r o A M O C l a ss i c a l P l a s m a C o n d M a t I n t e r d i s c Science G e n e r a l H E P N u c l e a r A s t r o A M O C l a ss i c a l P l a s m a C o n d M a t I n t e r d i s c PNAS
Figure S5.
Shares of subfields for publications in
Nature , Science and
PNAS . All threeinterdisciplinary journals publish across all subfields of physics. ank General HEP Nuclear
Rank Astro AMO Classical
Rank Plasma CondMat Interdisc
Table S3.
Non-APS journals with most publications with propagated subfields.
While papers are directly associated to subfields through label propagation, we still need toassign physicists to their correct research area. Some physicists, in particular those extremelyproductive, are likely to appear over a whole career as the authors of papers belonging tomultiple subfields, though some of these might not be significant. As a consequence, whenassigning the authors to the different subfields, we applied a statistical filter in order to assignonly the subfield(s) on which their engagement is significant. In particular, we consider aphysicist as significantly working in a subfield only if her share of publications in it, comparedto her production across all subfields, is greater than that of the average scientist. Let usconsider the bipartite weighted network W = { w i α } , where w i α is an integer corresponding tothe number of publications of author i in subfield α . The previous condition can hence byformalised as RCA = w i α ∑ α w i α ∑ i w i α ∑ i α w i α . > . (8)This filter, known as the Revealed Comparative Advantage (RCA) index, was introduced in 1965in Ref. and has been used previously to filter bipartite networks, as in Ref. Differently fromother alternatives, it guarantees that each author is active on at least one field. We limit ouranalysis to authors with at least N = publications in our reconstructed physics dataset, inorder to drop all the authors whose contribution to physics is marginal. This set covers 135,877authors.The average distribution w i α of subfields per author is shown in Fig.S6a. In Fig.S6b we showthe average fraction of papers in each subfield for authors statistically validated in a given area.This plot is similar to that of Fig.1c of the main text, but reports more fine-grained informationabout the involvement of physicists in the subfields to which they are assigned. As shown,the share of publication in the subfield of belonging is the highest for authors in Cond Mat , HEP and
Nuclear . Last, in Fig.S6c we report the average career length measured in years, ofphysicists starting publishing in a given year. As expected, the earlier the starting year, thelonger the average time span between the first and last publications of a physicist.
Validation:
To test the robustness of our subfield categorisation at the author level, wecompared the numbers of authors working in each subfield with the number of APS members ene r a l H EP N u c l ea r A s t r o A M O C l a ss i c a l P l a s m a C ond M a t I n t e r d i sc F r a c t i on o f pape r s G ene r a l H EP N u c l ea r A s t r o A M O C l a ss i c a l P l a s m a C ond M a t I n t e r d i sc F r a c t i on o f pape r s i n s ub f i e l d St art ing year A v e r age c a r ee r l eng t h ( y ea r s ) ba c Figure S6.
Basic features of authors in our reconstructed physics dataset. a
Averagepublication shares across subfields of a physicist. b For authors validated in a subfield, averagefraction of publications in that subfield. c Average career length measured in years as afunction of the starting year of a career.registered across APS Divisions. In Fig. S7 we report the scatterplot between the two datasets,with a cosine similarity of 0.98. The full mappings between the APS Divisions and our subfieldscheme is reported in Table S4.
S5 Author disambiguation
A common problem in the analysis of scientific careers is that of author disambiguation. Ourcensus of physics is based on merging paper information on subfield and author informationon publications provided by the WoS. Our analysis has been undertaken on the latest availableversion of WoS which, differently from the previous one, has a built-in author disambiguation,where authors are not classified by a name but by a specific author ID. A single author IDis associated to a unique author, and can be associated to several author names when thepublications authored by the same individual report slightly different name formats. Similarly,two homonyms, but distinct individuals with the same author name are associated to differentauthor IDs. Nevertheless, we are aware that a perfect disambiguation is a goal which is impossi-ble to achieve. For such a reason, we decided to test the robustness of our results by replicatingthe analysis reported in the main text after excluding a subset of authors with names which areknown to be particularly hard to disambiguate. In particular, we focused on the most common100 Chinese and 200 Korean names,
9, 10 which correspond to 504,538 distinct author IDs in theWoS dataset, 15,982 of which are present also in our subset of physicists. Overall, results were .0 0.1 0.2 0.3
APS Divisions R e c o n s t r u c t e d p h y s i c s d a t a s e t Figure S7.
Comparison of the fraction of physicists associated to the different subfieldsand the members of the APS Divisions.
Correlation between the two distributions is high,with a cosine similarity of 0.98.shown to be extremely robust to the elimination of such authors. As an example, we reportin Fig.S8 the starting point of our analysis, i.e. the authors distribution across subfields. Thecosine similarity between the distribution across subfields of the full set and the reduced set ofphysicists, without authors difficult to disambiguate, is 0.99.It is worth to mention that highly curated data-repositories with very good author disam-biguation is available for some subfields. For instance, the well-known HEP-INSPIRE datasethas an extremely valid author disambiguation, especially needed for fields where most publica-tions are done by large collaborations. However, it is difficult to map the HEP-INSPIRE authordisambiguation into the built-in WoS author disambiguation. On top of this, we believe thatsuch merge would not add validity to our analysis, as conversely would introduce a bias intothe dataset, where authors publishing in different subfields are classified according to differentdisambiguation procedures. ubfield APS Divisions
General
Computational Physics, Quantum Information, Gravitation
HEP
Particles & Fields
Nuclear
Nuclear Physics, Physics of Beams
Astro
Astrophysics
AMO
Atomic, Molecular & Optical
Classical
Fluid Dynamics
Plasma
Plasma Physics
CondMat
Condensed Matter Physics, Laser Science, Polymer Physics
Interdisc
Biological Physics, Materials Physics, Chemical Physics
Table S4.
Mapping of physics categories from the APS Divisions into the physics subfieldscheme. G e n e r a l H E P N u c l e a r A s t r o A M O C l a ss i c a l P l a s m a C o n d M a t I n t e r d i s c N u m b e r o f p h y s i c i s t s Figure S8.
Testing author disambiguation.
Number of authors working in each subfield:plain color (reduced set of 15,982 authors difficult to disambiguate), faded color (all otherphysicists). The cosine similarity between the distribution across subfields of the full set ofphysicists, and the set without authors hard to disambiguate, is 0.99.
In Fig.1d we map the relation between physics subfields into a network, where nodes representsubfields, and weighted links describe significant co-activity between them. Let us consider aset of N physicists, and two subfields α and β with respectively N α and N β physicists. We definethe co-activity C αβ between the two subfields as the ratio between the number of physicists N αβ working on both subfields α and β , and the expected number ˆ N αβ = ( N α N β ) / N . Startingfrom the link with the highest weight, we plot the minimum number of links needed to havea connected network. All reported links have C > , meaning that only edges with co-activityhigher than what expected at random (given the size of the subfields) are shown.In Fig.2b we show flows of physicists from the subfield(s) of their first publication, to thesubfield(s) where their activity is significant (RCA>1). Let us consider the number of physicists F α | β working in subfield α who started their career by publishing in subfield β , so that ∑ β F α | β = N α . Subfield β is significantly contributing to subfield α only if F α | β / N α is greater than the totalfraction of physicists whose first publication is in subfield β (reported in the rectangles on thetop). Only significant flows are shown. S7 LHC and the HEP 2010 peak
In Fig.2a we show over the years the relative number of new authors entering each subfield. Wenotice that
HEP is characterised by a large peak in 2010. For this reason we looked at all thefirst publications of new
HEP authors in 2010, and searched for the collaborations responsiblefor each paper. We found that of the new
HEP authors in 2010 have a first publicationwhich is connected to the opening of LHC, either directly through the ATLAS, CMS and LHCbcollaborations., or indirectly (Ref. of the ALICE collaboration takes advantage of resultsby LHC). These new authors also amount to the of the total number of new physicistsacross subfields, explaining the observed peak for HEP . In Fig. S9 we show the yearly fractionof physicists who published their first paper in a new subfield, after removing all new 2010
HEP authors connected to the activities of LHC. As displayed, the peak at 2010 for
HEP disappears. R e l a t i v e g r o w t h r a t e Figure S9.
Relative growth rate of subfields after removing new 2010 HEP authorsconnected to the activities of LHC.
No peak is observed for
HEP authors in 2010.
S8 Chaperone effect
In Fig.3c we computed the number of chaperoned authors across subfields. The Chaperoneeffect was originally investigated in Ref. for scientific venues, measured in terms of scientistsmaking the transition from non-last to last (senior / PI) authors in papers published in a journal.Here, as we are interested in the relations, as well as migration between physics subfields, wefocused on a simplified version of such chaperone measure c , computing the fraction ofphysicists first publishing in a subfield who have as co-authors at least one scientist who hasalready published in the area.Despite being intuitive and close to the variable used in Ref., this measure might not proveadequate in the case of subfields characterised by publication through large-scale collabora-tions. For such a reason, we tested our results against ˜ c , a variant of the chaperone index. Giventhe first publication of a scientists in a subfield, ˜ c measures the average fraction of co-authorswho have already published in the area. As shown in Fig.S10, in the case of our data c and ˜ c arevery highly correlated, with a cosine similarity of . . .45 0.60 0.75 ̃c c Figure S10.
Comparison between two measures of Chaperone effect.
Scatterplot betweenthe original measure c to quantify the number of chaperoned authors, and the fractionalmeasure ˜ c . The values of two variables across subfields in our dataset are highly correlated. S9 Authors impact and citation rates across subfields
Top authors across subfields have very different impact, as shown in Fig.3g. This is mainly aconsequence of different productivities, rather than diverse citation patterns across subfields.Indeed, the typical number of papers produced by top authors is very heterogenous acrossphysics communities (Fig.3f ). In contrast, we found that the number of citations per paperis rather constant across subfields: the average is . , with all subfields falling within . standard deviation from this value. For example, papers published in HEP and
Interdisc receiveon average respectively . and . citations, despite the much larger impact of HEP authors.Similar results are obtained for the medians of paper citations across subfields. The averagemedian across physics communities is 9.0, the standard deviation of the median across subfieldsis 1.1, and all subfields are at most 1.7 standard deviation away from the global median. Themedian of paper citations for
HEP and
Interdisc are respectively 9 and 11.
S10 The physics Nobel prizes
In Fig.3j we show the distribution of Nobel prizes awarded in physics across subfields. Dataon Nobel prizes in physics are available on the Nobel prize website. We report all awards ince 1985 in order to be consistent with the rest of our data-driven analysis of careers inphysics. All such awards are accompanied by a motivation which allows to assign the crucialdiscovery or stream of research that led to the Nobel prize to one or more physics subfields. Inthe considered time span (1985-2017), 82 scientists were awarded the Nobel prize in physics. eferences Sinatra, R., Deville, P., Szell, M., Wang, D. & Barabási, A.-L. A century of physics.
NaturePhysics , 791 (2015). Deville, P.
Understanding social dynamics through big data (PhD Thesis) (UniversitéCatholique de Louvain, 2015). PACS 2010 regular edition. https://publishing.aip.org/publishing/pacs/pacs-2010-regular-edition . Aps dataset. https://journals.aps.org/datasets . Balassa, B. Trade liberalization and ‘revealed’ comparative advantage.
Manchester School Hidalgo, C. A. & Hausmann, R. The building blocks of economic complexity.
Proceedings ofthe National Academy of Sciences , 10570–10575 (2009). URL . . Aps divisions. . Smalheiser, N. R. & Torvik, V. I. Author name disambiguation.
Annual Review of InformationScience and Technology , 1–43 (2009). URL https://onlinelibrary.wiley.com/doi/abs/10.1002/aris.2009.1440430113 . https://onlinelibrary.wiley.com/doi/pdf/10.1002/aris.2009.1440430113 . Most common chinese surnames. https://en.wikipedia.org/wiki/List_of_common_Chinese_surnames . Most common korean surnames. https://en.wikipedia.org/wiki/List_of_Korean_surnames . Yetkin, T. New physics at atlas and cms experiments with the first data.
Nuclear Physics B -Proceedings Supplements , 17 – 26 (2010). URL . The International Workshop onBeyond the Standard Model Physics and LHC Signatures (BSM-LHC). Aamodt, K. et al.
Midrapidity antiproton-to-proton ratio in pp collisons at √ s = . and7 tev measured by the alice experiment. Phys. Rev. Lett. , 072002 (2010). URL https://link.aps.org/doi/10.1103/PhysRevLett.105.072002 . Sekara, V. et al.
The chaperone effect in science.
PNAS, in print (2018).
Physics nobel prizes. ..