[PDF] A Model of Densifying Collaboration Networks

Abstract

Research collaborations provide the foundation for scientific advances, but we have only recently begun to understand how they form and grow on a global scale. Here we analyze a model of the growth of research collaboration networks to explain the empirical observations that the number of collaborations scales superlinearly with institution size, though at different rates (heterogeneous densification), the number of institutions grows as a power of the number of researchers (Heaps' law) and institution sizes approximate Zipf's law. This model has three mechanisms: (i) researchers are preferentially hired by large institutions, (ii) new institutions trigger more potential institutions, and (iii) researchers collaborate with friends-of-friends. We show agreement between these assumptions and empirical data, through analysis of co-authorship networks spanning two centuries. We then develop a theoretical understanding of this model, which reveals emergent heterogeneous scaling such that the number of collaborations between institutions scale with an institution's size.

Full PDF

AA Model of Densifying Collaboration Networks

Keith A. Burghardt, ∗ Allon G. Percus, and Kristina Lerman Information Sciences Institute, University of Southern California, Marina del Rey, USA, 90292 Institute of Mathematical Sciences, Claremont Graduate University, Claremont, USA, 91711 (Dated: January 28, 2021)Research collaborations provide the foundation for scientiﬁc advances, but we have only recentlybegun to understand how they form and grow on a global scale. Here we analyze a model of thegrowth of research collaboration networks to explain the empirical observations that the numberof collaborations scales superlinearly with institution size, though at diﬀerent rates (heterogeneousdensiﬁcation), the number of institutions grows as a power of the number of researchers (Heaps’ law)and institution sizes approximate Zipf’s law. This model has three mechanisms: (i) researchers arepreferentially hired by large institutions, (ii) new institutions trigger more potential institutions, and(iii) researchers collaborate with friends-of-friends. We show agreement between these assumptionsand empirical data, through analysis of co-authorship networks spanning two centuries. We thendevelop a theoretical understanding of this model, which reveals emergent heterogeneous scalingsuch that the number of collaborations between institutions scale with an institution’s size.

I. INTRODUCTION

Science is largely a social endeavor. Research collab-orations drive scientiﬁc discovery and produce more im-pactful work: papers with more co-authors garner morecitations and appear in more prestigious venues [1, 2].Collaboration enables researchers to mitigate the delete-rious eﬀects of the increasing complexity of knowledge [3]by leveraging the diversity of expertise [4] and diﬀerentperspectives [5]. Our understanding of the growing col-laboration networks, however, is still in its infancy. Arecent paper explored the role of research institutions inthe growth of scientiﬁc collaborations [6], showing thatcollaborations scale superlinearly with institution size:when an institution doubles in size, this creates roughly30% more collaborations per person. Crucially, the scal-ing laws are diﬀerent for each institution; therefore, largerinstitutions typically receive more advantage from col-laboration than others. Additionally, the paper showedthat institutions vary in size by many orders of magni-tude, with the distribution approximated by Zipf’s law[7], while the number of institutions scales sub-linearlywith the number of researchers, in agreement with Heaps’law [8, 9]. The sublinear scaling implies that, even asmore institutions appear, each institution gets larger onaverage, but this average belies an enormous variance.Burghardt et al. [6] developed a stochastic model to ex-plain these patterns, which we theoretically analyze here.In this model, a researcher appears at each time step andis preferentially hired by larger institutions (e.g., due totheir prestige or funding). With a small probability, how-ever, a researcher joins a newly appearing institution.The arrival of this new institution then triggers yet morenew institutions to form in the future [10]. Finally, oncehired, researchers make connections to other researchersand their collaborators with an independent probability. ∗ [email protected] Despite its simplicity, the model reproduces a range ofempirical observations.The model combines three mechanisms. The ﬁrsttwo mechanisms, known as Polya’s Urn with Trigger-ing, qualitatively reproduces the observed Heaps’ law andZipf’s law [10]. The mechanisms are the following (i) re-searchers are preferentially hired by large institutions,and (ii) new institutions trigger more potential institu-tions. The third mechanism is that researchers collab-orate with friends-of-friends, which reproduces how col-laborations scale with institution size [6, 11, 12]. Thesemodel assumptions are tested here using bibliographicdata from four ﬁelds: computer science, physics, math,and sociology. We show that the data are in broad agree-ment with model’s assumptions about institution growthand links formation. We then explore the theory be-hind the model. We discover that the interaction of thesemechanisms form novel emergent properties such as den-siﬁcation of links between emergent groups. Finally, thismodel shows qualitative agreement with empirical statis-tics, such as signiﬁcant community structure, heteroge-nous scaling laws of collaborations between institutions,and assortativity.The rest of the paper is organized as follows. First,we describe comparisons between bibliographic data pat-terns and assumptions of the model. Next, we show hownetwork statistics qualitatively agree with expectations.Third, we develop a theoretical grounding for the model,and ﬁnally compare these theoretic predictions to simu-lations.

II. A MODEL OF COLLABORATION GROWTH

Burghardt et al. [6] describe a stochastic growth modelof institution formation that captures how institutionsand collaborations jointly grow. They model the forma-tion and growth of institutions by combining a P´olya’surn-like model described by Tria et al., (2014), and amodel of network densiﬁcation [11, 12]. Unlike existing a r X i v : . [ phy s i c s . s o c - ph ] J a n ⌫ + 1 ... ... Hired by new institution ⇢ U (t+1) U (t) ... ... U (t+1) U (t) (a) Reinforcement Triggering1. ⇢ Hired by existing institutionInstitution B Institution A Sequence of new researchers ... ... p B p B p B (b) p A p B FIG. 1. Schematic of our institution growth model. (a) Newresearchers are hired by an institution following a P´olya’s urn-like model [10]. In this model, a new researcher is hired byan institution, denoted by a colored ball, picked uniformly atrandom from an urn. A new institution, where no researcherhas been hired before, triggers ν + 1 new colors to enter theurn, increasing the likelihood of more new institutions to hirea researcher. Both new and old institutions experience re-inforcement, where ρ balls of the same color enter the urn.This creates a rich-get-richer eﬀect where large institutionsare more likely to hire a new researcher. (b) Each institu-tion is composed of both internal collaborators (within eachinstitution, green lines) and external collaborators (betweeninstitutions, purple lines). Once a researcher is hired, theychoose one random internal and one random external collab-orator. New collaborations are formed independently withprobability p A , if hired by institution A, and p B if hired by in-stitution B. These new connections form triangles. Schematictaken from [6]. densiﬁcation models [11–13], Burghardt et al.’s modelreproduces the heterogeneous densiﬁcation of internal(within-institution) and external (between-institution)collaborations, and the non-trivial growth of institutions.In the Appendix, we show that realistic variants of thismodel will also produce qualitatively similar behavior.The model is as follows. Imagine an urn containingballs of diﬀerent colors, with each color representing a dif-ferent institution, as shown in Fig. 1a. Balls are pickedwith replacement, each ball representing a newly hiredresearcher. The color of the picked ball is recorded ina sequence to denote the institution that hires the re-searcher. After the ball is picked, ρ new balls of thesame color are added to the urn. This step, known as“reinforcement” (left panel of Fig. 1a) [10], representsthe additional resources and prestige given to a largerinstitution. If a ball with a new color that was not previ-ously seen is picked, then ν + 1 uniquely-colored balls areplaced into the urn. This step is known as “triggering”(right panel of Fig. 1a) [10]. The new colors represent FIG. 2. Example network from the simulation. Node sizesare proportional to degree and colors correspond to diﬀerentinstitutions. Parameters are ρ = 4, ν = 2, µ p = 0 .

6, and σ p = 0 . N = 1000 nodes. institutions that are now able to form because of the ex-istence of a new institution. This model predicts Heaps’law with scaling relation ∼ N ν/ρ and Zipf’s law withscaling relation ∼ n − (1+ ν/ρ ) [10]. In our simulations, wechose ρ = 4 and ν = 2, which approximates scaling lawsseen in data [6].Next, we model heterogeneous and superlinear scalingof collaborations through a mechanism of network den-siﬁcation. Building on the work of [11, 12], each newresearcher connects to a random researcher within thesame institution, as well as an external researcher pickeduniformly at random (left panel of Fig. 1b). Next, newcollaborators are chosen independently from neighbors ofneighbors with probability p i , where p i is unique to eachresearcher’s institution (right panel of Fig. 1b). We let p i be a Gaussian distributed random variable with mean, µ = 0 .

6, and standard deviation, σ µ = 0 . ρ and ν ,which control Zipf’s law and Heaps’ law, and µ p and σ p ,which controls densiﬁcation. In our analysis of the model,we ﬁx ρ = 4, ν = 2, µ p = 0 .

6, which are in qualitativeagreement with the statistics observed in empirical data[6]. While other plausible mechanisms for Zipf’s law [14–16], Heaps’ law [9], or densiﬁcation [13] exist, the currentmodel describes these patterns in a cohesive framework. Cumulative Institution Size, n Δ n Computer SciencePhysicsMathSociology 𝝙 n ~ n 𝝙 n ~ n . FIG. 3. Rich-get-richer eﬀect in institutions. The mean in-crease in institution size the next year as a function of itssize, n , in the current year for the ﬁelds of computer science,physics, math, and sociology. The model predicts the rate ofinstitution growth is proportional to its size (black line), andthe best ﬁt for data follows a power-law ∼ n α with α ≈ . III. COMPARING MODEL ASSUMPTIONS TOEMPIRICAL DATA

We test the model against bibliographic data collectedby Burghardt et al., (2020), based on data from the Mi-crosoft Academic Graph [17, 18]. Author names andinstitutional aﬃliations have been extracted from wheneach paper was written allowing us to reconstruct the co-authorship network and institution size over time. Thedata Burghardt et al. analyze covers four diﬀerent ﬁelds:computer science, physics, math, and sociology. Theyshow that these results are robust to various assumptionsof the data including whether institution size is deﬁnedas the cumulative number of authors aﬃliated with aninstitution, or in the other extreme, the number of aﬃl-iated authors who have written a paper in a particularyear. For simplicity, we deﬁne institution size as the cu-mulative number of aﬃliated authors. Data parsing isdescribed in greater detail in Burghardt et al.This model predicts the rate of institution growth isproportional to institution size (i.e., follows a preferen-tial attachment mechanism), which we show is approxi-mately correct in Fig. 3. In the growth model, the prob-ability an institution hires a researcher is proportional tothe number of balls associated with that institution inan urn, and the number of balls is proportional to theinstitution’s size. The probability an institution hires aresearcher is therefore proportional to its size, n (blackline). When we compare to data, we see a slight deviationwith growth proportional to n α (dashed line) for n > α is equal to 0 . ± .

02, 0 . ± .

02, and0 . ± .

01 for computer science, physics, math, and so-ciology, respectively, based on linear regression. Alike toprevious ﬁndings for preferential attachment [19, 20], thisﬁgure demonstrates that the mechanism approximately

FIG. 4. New links form between local nodes. For each in-stitution we compare the geometric mean distance betweennodes just before a link forms to random nodes (null model).Shaded ares are 95% conﬁdence intervals. captures the relationship between size and growth.Next, the model assumes new connections are formedlocally in order for networks to densify [6, 11, 12]. We testthis in Fig. 4, in which we compare the geodesic distancebetween researchers before they form new collaborations(solid markers) with the distance between random re-searchers (null model, open markers). The model wouldpredict that collaborations form between researchers whoare two collaborations from each other. For example, ifone researcher was Paul Erd¨os, then the other researcherwould have had an Erd¨os number of two prior to collab-orating [21].In this ﬁgure, new collaborations are deﬁned as thosethat appear the next year and never appeared in anyprevious year, and plots were made for data 10 yearsapart. For example, new links were those that ﬁrst ap-peared in 1951, 1961, 1971, etc. We take the harmonicmean of the geodesic distance to account for uncommoncases in which components are disconnected, and there-fore the geodesic distance is inﬁnity. To determine errorbars (shaded regions) in the null model, we use a form ofbootstrapping. We repeat the following step M times: weﬁnd the mean distance of m random researchers, where m is the number of new links formed the next year. Welet M be 100 for computer science and physics, and 300for math and sociology. Error bars are simply the 95%quantiles of these bootstrapped data. Due to the cost ofﬁnding geodesic distances, computing these null modelerror bars took roughly 50 computer-hours to completeon 3.7 GHz Intel Core i5 processors. Comparing the dis-tances of new research collaborations to this null model,we observe that researchers collaborate locally, often withhigh statistical signiﬁcance. In rough agreement with themodel, we see that researchers connect to one anotherwhen they are two to three collaborations apart, on av-erage.Finally, the ﬁrst step of the model, institution growth,has the same set of mechanisms as Tria et al., (2014). Physics(a) (b)Computer Science Sociology(c) (d)Math

Cumul. Number Researchers, N Cumul. Number Researchers, N C u m u l . I n s t i t u t i o n S i z e , n C u m u l . I n s t i t u t i o n S i z e , n n ~ N n ~ N n ~ N n ~ N FIG. 5. Cumulative institution size, n , versus total numberof researchers hired, N . The model predicts N ∼ n (dashedlines). Five example institutions are shown for each ﬁeld, (a)computer science, (b) physics, (c) math, and (d) sociology. Supplementary material of [10] Eq. 4–5 implies thatinstitutions should grow proportional to “time” in themodel. Time in this case is the cumulative number ofresearchers hired within any university, N , therefore thesize of the institution, n , should be proportional to N .For example, if there are ﬁfty institutions that have hired1,000 researchers in total, then once 2,000 researchershave been hired, the number of researchers within eachinstitution should approximately double (assuming thatthe number of new institutions that appear is small). Wetest this qualitatively in Fig. 5 for each ﬁeld. We ﬁndthat the initial growth is usually much faster than lin-ear (dashed line), and sometimes the asymptotic growthrate is sub-linear (e.g., the largest institution in Fig. 5d).That said, we also often see approximately linear growth.Overall, these results give mixed support for the hypoth-esis on average, but the variations from linear growthsuggest the model, perhaps because of its simplicity, doesnot fully capture the data. IV. QUALITATIVE STATISTICS

Next, we measure network statistics of model simula-tions to check whether these statistics are realistic. Wealso ﬁnd that the community structure, densiﬁcation, as-sortativity, and clustering, shown in Fig. 6, are compara-ble to real networks. First, this model naturally producescommunity structure if we deﬁne “communities” as insti-tutions. The modularity of these communities is nearly - A ss o r t a t i v i t y , r C l u s t e r i ng C oe ff i c i en t, c N 〈 k 〉 M odu l a r i t y , Q - - - - - - k P r ( k ) - - - - - - k P r ( k ) σ =0.0 Max Modularity σ =0.1 Max Modularity σ =0.0 Institution Modularity σ =0.1 Institution Modularity 0.0 0.1 σ σ σ σ (a) (b)(c) σ ( c ) (d) FIG. 6. Network statistics vesus N. (a) Modularity basedon a greedy modularity maximizing method [22], and similarmodularity values with communities deﬁned as institutions,with σ p = 0 . σ p = 0 . σ p = 0 . σ p = 0 . as high as that from a greedy modularity maximizationmethod [22], possibly because the institution-based com-munities are alike to the stochastic block model [24]. Inthe stochastic block model, communities are deﬁned as acollection of nodes that are more likely to connect to eachother than to outside nodes. Similarly, institutions in themodel have diﬀerent probabilities of forming connectionswithin and between other institutions.Second, we ﬁnd that degree increases with N asa power law, known as network densiﬁcation [11, 13](Fig. 6b). While we designed individual institutions todensify, these results still reproduce previous global anal-ysis demonstrating overall densiﬁcation of the network.In addition, we see in the inset that the degree distri-bution is heavy-tailed, much like real networks [25]. Im-portantly, the model has no explicit degree preferentialattachment mechanism; this distribution is an emergentproperty. Dependence of the degree distribution with N can be seen in the Appendix Fig. 11.Next, we ﬁnd that assortativity increases with N and is comparable to real social networks, including co-authorship networks we aim to model [26] (Fig. 6c).Interestingly, however, assortativity begins to decreaseagain if σ = 0 .

0. When σ > .

0, this model reproducesthe heterogenous densiﬁcation, seen in empirical data [6], Cumul. Institution Size C u m u l .I n t e r na l C o ll ab . Cumul. Institution Size C u m u l . E x t e r na l C o ll ab . C o ll abo r a t i on s Institution Size, n InternalExternal

FIG. 7. Simulations of the number of collaborations versus in-stitution size. Internal scaling (within an institution) and ex-ternal scaling (between institutions) is superlinear and variesbetween institutions. as well as consistent positive values of assortativity. Fi-nally, the local clustering coeﬃcient decreases logarithmi-cally, as shown in Fig. 6d. In contrast, a random networkhas a clustering coeﬃcient that decreases as 1 /n [27]. Themodel’s clustering coeﬃcient is comparable to real dataof a variety of sizes [28]. Whether clustering coeﬃcientis stable in real data [29] or decreases with n should beexplored in the future. The variance of the local cluster-ing coeﬃcient within each network (inset of Fig. 6d), isalso wide and should be compared to empirical data inthe future. V. ANALYSIS OF THE MODEL

Next, we develop a theoretical understanding of themodel. We ﬁrst analyze the scaling properties of internalcollaborations. Because the mechanism to form internalcollaborations ignores all nodes and links besides thosewithin the institution itself, we can consider the institu-tion’s internal collaboration network as an isolated net-work. The mechanism to make collaborations within thisnetwork can therefore be reduced to that of a previousmodel [11, 12]. The number of internal collaborations, L int increases with institution size, n via the followingformula L int ( n + 1) = L int ( n ) + 1 + p (cid:104) k int (cid:105) (1)= L int ( n ) + 1 + 2 pL int ( n ) /n (2)where (cid:104) k int (cid:105) is the mean number of internal collabora-tions per researcher, equal to 2 L int ( n ) /n . Intuitively, weadd an edge by default, plus p (cid:104) k int (cid:105) edges through ad-ditional collaborations. Using the results from previouspapers [11, 12], we ﬁnd that L int ( n ) =  n − p p < / n ln( n ) p = 1 / A ( p ) n p p > / - Cumul. Number Researchers 〈 k e x t 〉 Simulation, σ p = 0.1Theory, σ p = 0.0Theory, σ p = 0.1 Simulation, σ = σ = σ Theory, No σ Simulation, σ p = 0.0 FIG. 8. Mean degree of external collaborations, i.e., with re-searchers at diﬀerent institutions, versus the cumulative num-ber of researchers for several simulations. Solid black line aresimulations with σ p = 0 .

1, dashed black line are simulationswith σ p = 0 .

0. Solid red line is ﬁnite σ p = 0 . σ p = 0 . where A ( p ) = [(2 p − p )] − . The scaling constantsand exponents in this theory are taken across all real-izations. In practice, however, the exponent works wellfor large institutes, and underestimates the exponent forsmall institutes, most likely because of ﬁnite size eﬀects.External scaling laws are much more nuanced, and re-quire signiﬁcant amounts of new analysis. We have twogoals in our analysis. First, we want to show that inter-nal and external collaboration exponents are superlinear.Second, we want to understand why internal and exter-nal collaboration exponents are poorly correlated. Tothis end, we start with a similar equation as before, butthis time for external collaborations, L ext : L ext ( n + 1) = L ext ( n ) + 1 + p (cid:104) k ext (cid:105) (4)Our goal is to ﬁrst ﬁnd (cid:104) k ext (cid:105) , mean number of inter-nal collaborations per researcher at external institutions.This value is surprisingly non-trivial compared to internalcollaborations. First, we note that the ﬁrst researcher ischosen at random among all researchers, meaning there isa preference to attach to researchers in larger institutes.While the institution size follows Zipf’s law [10], p ( n ) = νρ n − (1+ ν/ρ ) , (5)where we take the discrete size n to be continuous, whichworks well for large institution sizes. The preference toattach to larger institutes means that we choose an in-stitute of size n ext with probability q ( n ) = np ( n ) (cid:104) n (cid:105) (6)where (cid:104) n (cid:105) = (cid:90) N dn n (cid:18) νρ n − (1+ ν/ρ ) (cid:19) (7) ∼ νρ − ν N − ν/ρ (8)Because ν/ρ <

1, we discover that (cid:104) n (cid:105) diverges. There-fore, we set of cut-oﬀ equal to the total number of re- searchers, N . In full form, q ( n ) is: q ( n ) = ( ρ − ν ) n − ν/ρ ρN − ν/ρ (9)Moreover, by construct, we have the probability of p , f ( p ), be Gaussian distributed, with mean µ p and variance σ p . Finally, k ext for an arbitrary institution is 2 L int ( n ) /n .Putting all this together, we discover that (cid:104) k ext (cid:105) = 2( ρ − ν ) σ √ πρN − ν/ρ (cid:90) N dn (cid:40) n − ν/ρ (cid:90) / dp exp[ − ( p − µ p ) / (2 σ p )]1 − p + (cid:90) / dp n p − − ν/ρ exp[ − ( p − µ p ) / (2 σ p )](2 p − p ) (cid:41) (10)Sadly, this equation is not simple to solve. First, it di-verges near p = 1 /

2. At this special point, the scaling lawapproaches L int ( n ) ∼ n ln( n ), which is why the assump-tions around p (cid:39) / / − (cid:15) and 1 / (cid:15) to 1, then (cid:104) k ext (cid:105) becomes a constant proportional to ln(1 /(cid:15) ). If thisvalue is small compared to N , however, then from Eq. 4, L ext ( n ) ∼ n , which does not agree with our ﬁndings. Onthe other hand, (cid:104) k ext (cid:105) (and therefore ln(1 /(cid:15) )) cannot belarger than N −

1. In other words, we can only connectto as many as nodes as there are in the network. If weassume (cid:104) k ext (cid:105) ∼ N , then from Eq. 4, L ext ( n ) ∼ n . Thisdemonstrates a breakdown in the assumptions of a naiveapproximation of Eq. 10. That being said, we can make perturbative expan-sions around µ p assuming σ p is small. In this limit,exp[ − ( p − µ p ) / (2 σ p )] approaches zero faster than 1 / (2 p −

1) approaches inﬁnity, therefore we can integrate around µ p . If σ p is small, we can focus on p > / µ p > /

2) and note that exp[ − ( p − µ p ) / (2 σ p )] variesmuch more than the denominator, which we can approxi-mate as (2 µ p − µ p ). On the other hand, because n is assumed to be large, a small variation in p couldsigniﬁcantly change the numerator, therefore n p is notapproximately n µ p unless σ →

0, thus the Gaussian dis-tribution becomes a Dirac delta function. In the small σ p limit, (cid:104) k ext (cid:105) = 2( ρ − ν ) σ p √ πρN − ν/ρ (cid:90) N dn n − (1+ ν/ρ ) (cid:90) / dp exp[2 p ln( n ) − ( p − µ p ) / (2 σ p )](2 µ p − µ p ) (11)because the PDF quickly approaches 0 around p = µ p , we can extend the integral of p to ±∞ . Once we integrate,the result becomes (cid:104) k ext (cid:105) = 2( ρ − ν )(2 µ p − µ p ) ρN − ν/ρ (cid:90) N dn n µ p − (1+ ν/ρ ) exp[2 σ p ln( n ) ] (12)after integrating over dn , the result become (cid:104) k ext (cid:105) = 2( ρ − ν ) √ σ (2 µ p − µ p ) ρN − ν/ρ (cid:40) F (cid:34) ν − µ p ρ √ ρσ p (cid:35) + N − νρ +2 µ p +2 σ p log( N ) F (cid:34) ρ log( N ) σ p − ν + 2 µρ √ ρσ p (cid:35)(cid:41) , (13)where F is the Dawson function [30]. If, on the other hand, σ p is zero, then we replace the Gaussian distribu-tion with a Dirac delta and the equation becomes (cid:104) k ext (cid:105) σ p =0 = 2( ρ − ν )(2 µ p − µ p ) ρN − ν/ρ (cid:90) N dn n µ p − (1+ ν/ρ ) (14) (cid:104) k ext (cid:105) σ p =0 = 2( ρ − ν )(2 µ p − µ p ) ρ (2 µ p − ν/ρ ) ( N µ p − −

1) (15)We compare this to simulation data in Fig. 8, and ﬁndsimilar scaling behavior, although the values are oﬀ by afactor of 10, possibly due to the ﬁnite size of most insti-tutions, where the scaling laws assumed above might nothold. To understand the long-term behavior, however,we can take the limit that N → ∞(cid:104) k ext (cid:105) ≈ (cid:40) C N µ p − σ p = 0 ( p = µ p ) C N µp − σ p ln( N ) ln( N ) σ p (cid:28) C = 2( ρ − ν )(2 µ p − ν/ρ )(2 µ p − µ p + 1) (17)and C = ρ − νρσ p (2 − µ p − µ p + 1) (18)We notice that variance increases the mean degree, butalso that that, for ﬁnite σ p , the scaling relation is not apower law. What we are interested in, however, is how (cid:104) k ext (cid:105) depends on n , the institution size. Previous re-search shows, to ﬁrst order, that n = N/N i , where N i isthe number of researchers when the ﬁrst institute formed(c.f., Supplementary materials Eq. 4 of [10]). Substi-tuting this into Eq. 16, we get (cid:104) k ext (cid:105) as a function of n .We can ﬁnally substitute (cid:104) k ext ( n ) (cid:105) into Eq. 4, and no-tice that (cid:104) k ext (cid:105) does not depend on L ext , in contrast tointernal collaborations. Knowing that L ext (1) = 0, thisiterative equation can be solved in the form of a series: L ext ( n ) = n − (cid:88) j =1 p (cid:104) k ext (cid:105) ( j ) (19)= n − p n − (cid:88) j =1 (cid:104) k ext (cid:105) ( j ) (20)sadly, there is in general no simple formula for this series,although if σ p = 0 L ext ( n ) σ p =0 ∼ p n − (cid:88) j =1 j µ p − = H ( n − , − µ p ) (21) - Cumulative Institution Size, n L e x t ( n ) / N i P r ( C u m u l . E x t. C o ll ab . E x ponen t ) - n L e x t ( n ) / N i N i = N i = N i = N i = - n L e x t ( n ) / N i N i = N i = N i = N i = - n L e x t ( n ) / N i N i = N i = N i = N i = N i =10 N i =3x10 N i =3x10 - n L e x t ( n ) / N i × × Theory, σ = Theory, σ =0 (a) (b) FIG. 9. Theoretical scaling laws. (a) External degree, L ext ,normalized by the cumulative number of researchers whenthe institution ﬁrst formed, N i , versus the institution size.Solid lines are theory with σ p > N i .Dashed line is Eq. 21. (b) A theoretical histogram of scalingexponents, ﬁt for n >

10, after N = 10 . where H is the harmonic function. The asymptoticsof the harmonic function tell us that L ext ( n ) ∼ n µ p ,therefore, if σ p = 0, the external collaboration is super-scaling. Sadly, when σ p > sigma p >

0, scaling is approxi-mately a power-law. This theoretical curve is plotted inFig. 9a. We show that institutions that appear earlier(e.g., N i = 10 ) have a smaller scaling law than thosethat appear later (e.g., N i = 3 × ), and ﬁnite vari-ance in p creates larger scaling laws than no variance.Because institutions grow linearly with N i , this impliesthat smaller institutions should have a larger scaling lawthan larger ones [10]. We can also create a histogramof the scaling exponents in Fig. 9b. Because the cumu-lative number of institutes grows as N ν/ρ = N / , thenumber of new institutes scales as N − / , therefore wesample exponents with this frequency. External collabo-rations therefore create heterogeneous scaling exponentsindependent of p . The heterogeneous scaling is insteadan emergent property . VI. COMPARING THEORY TO SIMULATIONS ( ) I n t e r na l C o ll abo r a t i on S c a li ng E x ponen t ( ) E x t e r na l C o ll abo r a t i on S c a li ng E x ponen t (a) (b) n n FIG. 10. Simulated and theoretical collaboration scaling. (a)Internal collaborations are expected to scale as ∼ n p (Eq. 3),which agrees well with simulations, especially for large insti-tution sizes, n . (b) In contrast, the external collaborationsdo not strongly correlate with the internal scaling theory( s = 0 .

22, p-value < − ). Data gathered for 15 simula-tions for institutions with ﬁnal size, n > We ﬁrst compare theory with simulations for inter-nal collaboration scaling exponents. Equation 3 predictsthat, for p > / L int ∼ n p , therefore, we should seea signiﬁcant correlation between the simulation scalinglaws and 2 p , especially for large n . Figure 10 comparesthe simulation and theoretical exponents for 15 simula-tions with ρ = 4, ν = 2, µ p = 0 .

6, and σ p = 0 . n ≥

10 for institu-tions with more than 50 researchers at the ﬁnal time step.In total there were 1582 simulated institutions studied.We ﬁnd agreement with theory for large n in Fig. 10a,and overall a signiﬁcant correlation with theory (Spear-man rank correlation, s = 0 .

85, p-value < − ). Wealso ﬁnd good agreement with simulations in Fig. 12a,which further demonstrates that, as expected, variancein p creates variance in the scaling exponents. Moreover,we can focus on σ p = 0 . n be-comes large, we have better and better agreement withthe theory. The broad distribution of scaling exponentsfor both internal and external collaborations can be seenin the Appendix.We next compare external collaboration with theoryin Fig. 10b. Equations 21 and 16 predict low correla-tion between exponents and the p parameters, which wealso observe in simulations ( s = 0 .

22, p-value < − ),showing support for the theory’s qualitative distinctionbetween internal and external scaling. We also see qual-itative agreement with theory in Fig. 12b. Namely, wesee that σ = 0 . n . With σ p = 0 .

1, weﬁnd that scaling exponents tend to be larger, in agree- ment with Equations 21 and 16, which shows that thescaling exponents increase with σ p .That being said, the theory implies that ﬁnal insti-tution size is proportional to its age, and therefore weshould see a correlation between the ﬁnal institution sizeand the scaling exponent (Fig. 9). We ﬁnd, however, nosigniﬁcant correlation with size (p-value= 0 . VII. CONCLUSION

Burghardt et al. 2020 found surprising statistical regu-larities in the growth of research institutions, and createda model to explain these regularities. We explore thismodel in greater detail and discover empirical agreementwith model assumptions and realistic network propertiessuch as signiﬁcant community structure. Furthermore,we produce a theoretical grounding for this model andshow agreement between theory and simulations. Thistheory demonstrates that while the the internal collabo-ration exponent is proportional to p , the external collab-oration scaling parameter is approximately independentof all other parameters.While these ﬁndings ground the Burghardt et al.’smodel in a stronger empirical and theoretical foundation,there are limitations in what we can explain. First, thegrowth of institutions is sub-linearly related to its size(∆ n ∼ n . ), while the model predicts a linear relation.Second, while collaborations often form between friendsof friends, this is not always the case, as the geodesicmean distance in Fig. 4 is greater than two. Finally, themodel does not fully explain how the institution size cor-relates to the total number of researchers hired (Fig. 5).While it is expected to be linear, we see signiﬁcant de-viations, either due to the simplicity of our model, orpotentially limitations in data collected prior to 1950 [6].While agreement between model and data is still prettyclose, these deviations suggests limitations of the cur-rent model to fully describe data. In addition, the the-ory predicts a much higher value for (cid:104) k ext (cid:105) than we seein simulations, shown in Fig. 8. Similarly, the externalcollaboration distribution for simulations seen in Fig. 9is not in agreement with Fig. 13, and it suggests a de-pendence on time. While we have made great progress,future work is needed to fully understand this model. VIII. APPENDIX

We ﬁrst check the robustness of the heavy-tailed degreedistribution versus network size and σ p , which is shownin Fig. 11. In this data, which is averaged over 10 simu-lations, we ﬁnd the wide degree distribution is robust tovariations in model size and σ p .Next, we compare the collaboration scaling exponentsfor simulations with σ p = 0 . . σ p = 0 . n , while for σ p =0 .

1, the variance is high even for n = 10 .Finally, we explore robustness of results with respectto realistic model variants. The theoretical analysis pre-dicts heterogeneous densiﬁcation of collaborations withininstitutions and heterogeneous densiﬁcation for externalcollborations (between institutions). We check each ofthese qualitative results in Fig. 13, and compare these re-sults for two variants of our model to check sensitivity ofour analysis. In the original model, hired “researchers”initially create one internal and one external link. Welook at a variant of this simulation in which the numberof initial internal and external collaborators was Poissondistributed, with λ = 1 (i.e., on average one internal andone external collaborator). Importantly, Bhat et al. andLambiotte et al. shows that number of links over timeare not self-averaging in their densiﬁcation model [11, 12], therefore initial conditions greatly aﬀect the ﬁnal num-ber of links and may aﬀect the observed scaling behavior.Figure 13 shows our results. We ﬁnd that, while thereare slightly more outliers in the scaling exponent distri-bution, results are quantitatively very similar. Overallthis suggests that our model robustly creates agreementtheir theory. - - - - - - k P r ( k ) - - - - - - k P r ( k ) (a) (b)Degree, k Degree, k FIG. 11. Degree distribution versus network size for σ p = 0 . , 1036 (2007),https://science.sciencemag.org/content/316/5827/1036.full.pdf.[2] Y. Dong, H. Ma, J. Tang, and K. Wang, arXiv preprint:1806.03694 (2018).[3] B. F. Jones, The Review of Economic Studies , 283(2009).[4] S. E. Page, The diversity bonus: How great teams pay oﬀin the knowledge economy , Vol. 5 (Princeton UniversityPress, 2019).[5] A. Yegros-Yegros, I. Rafols, and P. D’Este, PloS one ,e0135095 (2015).[6] K. Burghardt, A. Percus, Z. He, and K. Lerman, arXivpreprint:arXiv:2001.08734 (2020).[7] G. K. Zipf, Human Behavior and the Principle of LeastEﬀort: An Introduction to Human Ecology (Addison-Wesley Press, Inc., Cambridge, MA, 1949) p. 573.[8] L. L¨u, Z.-K. Zhang, and T. Zhou, PLOS ONE , 1(2010).[9] F. Simini and C. James, EPJ Data Science , 24 (2019).[10] F. Tria, V. Loreto, V. D. P. Servedio, and S. H. Strogatz,Scientiﬁc Reports , 5890 EP (2014).[11] R. Lambiotte, P. L. Krapivsky, U. Bhat, and S. Redner,Phys. Rev. Lett. , 218301 (2016).[12] U. Bhat, P. L. Krapivsky, R. Lambiotte, and S. Redner,Phys. Rev. E , 062302 (2016).[13] J. Leskovec, J. Kleinberg, and C. Faloutsos, ACM Trans.Knowl. Discov. Data (2007), 10.1145/1217299.1217301.[14] R. Gibrat, Les inegalites economiques; applications:aux inegalites des richesses, a la concentration des en-treprises, aux populations des villes, aux statistiques desfamilles, etc., da une loi nouvelle, la loi de la eﬀet pro-portionnel (Librairie du Recueil Sirey, Paris, 1931). C u m u l . E x t. C o ll ab . E x ponen t C u m u l .I n t. C o ll ab . E x ponen t C u m u l .I n t. C o ll ab . E x ponen t C u m u l . E x t. C o ll ab . E x ponen t (b) σ p Int. ScalingExt. Scaling

Int. ScalingExt. Scaling σ p (a) FIG. 12. Simulations with diﬀerent scaling laws ( σ p = 0 . σ p = 0 . ρ = 4, ν = 2, and µ p = 0 . n >

50 and scaling lawsare taken for n ≥ , 1429(2004).[16] R. L. Axtell, Science , 1818 (2001),https://science.sciencemag.org/content/293/5536/1818.full.pdf.[17] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-j. P. Hsu,and K. Wang, in Proceedings of the 24th internationalconference on world wide web (ACM, 2015) pp. 243–246.[18] D. Herrmannova and P. Knoth, D-Lib Magazine (2016).[19] H. Jeong, Z. N´eda, and A. L. Barab´asi, Europhysics F r equen cy SimulationExt. F r equen cy SimulationInt. F r equen cy SimulationExt. F r equen cy SimulationInt. (a) (b)

FIG. 13. Internal and external longitudinal collaboration ex-ponents for alternative simulation models. (a) Internal andexternal exponents for simulations with λ = 1 Poisson dis-tributed numbers of initial collaborators (on average one in-ternal collaborator, and one external collaborator). (b) Thesame histograms for the current simulation with exactly oneinternal and one external collaborator.Letters (EPL) , 567 (2003).[20] P. Sheridan and T. Onodera, Scientiﬁc Reports , 2811 (2018).[21] R. De Castro and J. W. Grossman, The MathematicalIntelligencer , 51 (1999).[22] A. Clauset, M. E. J. Newman, and C. Moore, Phys. Rev.E , 066111 (2004).[23] R. Guimer`a, M. Sales-Pardo, and L. A. N. Amaral, Phys.Rev. E , 025101 (2004).[24] E. Abbe, J. Mach. Learn. Res. , 6446–6531 (2017).[25] A.-L. Barab´asi and R. Albert, Science , 509 (1999),https://science.sciencemag.org/content/286/5439/509.full.pdf.[26] M. E. J. Newman, Phys. Rev. E , 026126 (2003).[27] M. E. J. Newman, SIAM Review , 167 (2003).[28] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, andD.-U. Hwang, Physics Reports , 175 (2006).[29] L. Ostroumova, A. Ryabchenko, and E. Samosvat, in Algorithms and Models for the Web Graph , edited byA. Bonato, M. Mitzenmacher, and P. Pra(cid:32)lat (SpringerInternational Publishing, Cham, 2013) pp. 185–202.[30] M. Abramowitz and I. A. Stegun,