[PDF] Linguistic evolution driven by network heterogeneity and the Turing mechanism

Abstract

Given the rapidly evolving landscape of linguistic prevalence, whereby a majority of the world's existing languages are dying out in favor of the adoption of a comparatively fewer set of languages, the factors behind this phenomenon has been the subject of vigorous research. The majority of approaches investigate the temporal evolution of two competing languages in the form of differential equations describing their behavior at large scale. In contrast, relatively few consider the spatial dimension of the problem. Furthermore while much attention has focused on the phenomena of language shift---the adoption of majority languages in lieu of minority ones---relatively less light has been shed on linguistic coexistence, where two or more languages persist in a geographically contiguous region. Here, we study the geographical component of language spread on a discrete medium to monitor the dispersal of language species at a microscopic level. Language dynamics is modeled through a reaction-diffusion system that occurs on a heterogeneous network of contacts based on population flows between urban centers. We show that our framework accurately reproduces empirical linguistic trends driven by a combination of the Turing instability, a mechanism for spontaneous pattern-formation applicable to many natural systems, the heterogeneity of the contact network, and the asymmetries in how people perceive the status of a language. We demonstrate the robustness of our formulation on two datasets corresponding to linguistic coexistence in northern Spain and southern Austria.

Full PDF

LLinguistic evolution driven by network heterogeneity and theTuring mechanism

Sayat Mimar , Mariamo Mussa Juane , Jorge Mira , Juyong Park , Alberto P.Mu˜nuzuri , and Gourab Ghoshal Department of Physics & Astronomy, University of Rochester, Rochester, NY 14607,USA Group of Nonlinear Physics. University of Santiago de Compostela, 15782 Santiago deCompostela, Spain Dept. de F´ısica Aplicada. University of Santiago de Compostela, 15782 Santiago deCompostela, Spain Graduate School of Culture Technology, Korea Advanced Institute of Science andTechnology, Daejon, 305-701, Korea

Manuscript 2

Supporting Information 17

S1 Turing instability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17S2 Data normalization and aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 a r X i v : . [ phy s i c s . s o c - ph ] J un inguistic evolution driven by network heterogeneity and theTuring mechanism Sayat Mimar , Mariamo Mussa Juane , Jorge Mira , Juyong Park , Alberto P.Mu˜nuzuri , and Gourab Ghoshal Department of Physics & Astronomy, University of Rochester, Rochester, NY 14607,USA Group of Nonlinear Physics. University of Santiago de Compostela, 15782 Santiago deCompostela, Spain Dept. de F´ısica Aplicada. University of Santiago de Compostela, 15782 Santiago deCompostela, Spain Graduate School of Culture Technology, Korea Advanced Institute of Science andTechnology, Daejon, 305-701, Korea

Abstract

Given the rapidly evolving landscape of linguistic prevalence, whereby a majority of the world’sexisting languages are dying out in favor of the adoption of a comparatively fewer set of languages, thefactors behind this phenomenon has been the subject of vigorous research. The majority of approachesinvestigate the temporal evolution of two competing languages in the form of diﬀerential equationsdescribing their behavior at large scale. In contrast, relatively few consider the spatial dimension ofthe problem. Furthermore while much attention has focused on the phenomena of language shift—theadoption of majority languages in lieu of minority ones—relatively less light has been shed on linguisticcoexistence, where two or more languages persist in a geographically contiguous region. Here, westudy the geographical component of language spread on a discrete medium to monitor the dispersalof language species at a microscopic level. Language dynamics is modeled through a reaction-diﬀusionsystem that occurs on a heterogeneous network of contacts based on population ﬂows between urbancenters. We show that our framework accurately reproduces empirical linguistic trends driven by acombination of the Turing instability, a mechanism for spontaneous pattern-formation applicable tomany natural systems, the heterogeneity of the contact network, and the asymmetries in how peopleperceive the status of a language. We demonstrate the robustness of our formulation on two datasetscorresponding to linguistic coexistence in northern Spain and southern Austria. Introduction

Language is the center of human activity and has served as the fundamental mode of communication sincethe dawn of human civilization. While there currently exists roughly 6000 languages diﬀering structurallyin terms of grammar and vocabulary [1], they all evolve dynamically through human interactions, thatare shaped by economic, political, geographic and cultural factors [2, 3, 4, 5].A rather unfortunate outcome of such evolution is the replacement of vernacular tongues, spoken by aminority of the population, with that spoken by the dominant majority. Indeed it is estimated that 90%of existing languages will go extinct by the end of the century [1, 6], leading to a huge loss in culturaldiversity, given the inextricable links between speech and customs. One of the ﬁrst mathematical modelsthat accurately reproduced such “language-death” was proposed by Abrams and Strogatz (AS) [7]. Intheir formulation, two languages compete, with the attractiveness of each of the species being determinedby its perceived status amongst the population. As long as the symmetry between the perceived statusis broken, the model necessarily predicts a single hegemonic language adopted by the entire population.Indeed, the model successfully accounts for the decline of 42 real-world minority languages in contactwith hegemonic counterparts. The formulation however fails to account for those cases where languagescoexist in a geographically contiguous region.To account for this limitation, reﬁnements were made to the AS model by Mira and Paredes [8, 9]incorporating bilingualism by introducing an inter-linguistic similarity parameter. An important exampleof this occurs in the northwestern part of Spain in Galicia, where both Galician and Spanish are spoken.Their model analyzes the temporal evolution of these languages demonstrating the existence of a stablecoexistence given enough similarity between the languages.Both formulations and other related ones [10, 11, 12] focus on the temporal aspect of language evolu-tion at a macroscopic scale, while ignoring local dynamics on the space where subpopulations of competingtongues reside. To address this, other approaches incorporate the geographic component into reaction-diﬀusion equations [13, 14, 15] simulating the dispersal of speakers in a continuous domain [16, 17, 18, 19].While the approach is reasonable, it fails when geographic regions representing speakers of a commonlanguage are no longer contiguous, and thus there is no meaningful diﬀusive front. Furthermore, theapproach is unable to provide spatially detailed description of language spread and retreat.An agent-based probabilistic model proposed in [20], supported with detailed empirical data fromsouthern Austria shed light on the constituents of language dynamics at a microscopic level. The region,where speakers of German and Slovenian live, is partitioned into quadratic grids where each cell representsan area of one square kilometer. To determine the probability of speaking one of the languages in a givenyear, the model uses the number of speakers of each language in the preceding year for every cell andtheir interaction with speakers with surrounding cells, hence accounting only for short-range connections.Although this ﬁne-grained model is able to successfully determine the temporal evolution of the twolanguages and generate satisfactory results for geographic distribution of the subpopulations, it haslimitations in terms of generalization. The historic and elaborate dataset for the number of speakerscovering the entire region are from the periods 1880-1910 and 1971-2001, and such detailed records arenot so easily available for other parts of the world.A recent work [21] proposed a district level mean ﬁeld network with uniform weights for Galicia where3wo competing languages are spread across 20 districts of the region with diﬀerent prestige values, tocombine internal complexity of each location with inﬂuence produced by its neighbors. This approachexplains how the interplay between urban and rural dynamics leads to competition in language shift. Theframework however requires ﬁne-grained detail on a plethora of parameters to study the sociolinguisticdynamics across the region.While the models described thus far, reproduce, to varying degrees of accuracy, the evolution of theobserved linguistic trends (to the extent that such data is available) they are formulated in a fashionthat makes it diﬃcult to disentangle the eﬀects of the various mechanisms governing linguistic evolution.Additionally while focusing either on short range interactions or at a macroscopic scale, none of the modelsconsider the eﬀect of human mobility, a rather important ingredient in understanding the dynamics ofsocioeconomic systems [22, 23, 24]. Here we propose a coarse-grained model of language dynamics thatseeks to uncover the minimal mechanisms that reproduce the observed linguistic trends. We set up ourmodel in such a way that we can interrogate the eﬀect of each of the constituent mechanisms. It isimportant to note that our goal is not to reproduce exactly the number of speakers of a given language,but rather, sacriﬁcing speciﬁcity and erring on the side of generalizability, we focus on the qualitativetrends.Our model consists of primarily three ingredients. We ﬁrst discretize the space on which linguisticinteractions occur by representing towns as nodes and connections between them as edges, incorporatingpopulation ﬂows at microscopic level. This geographic network extends the work of [20] by combin-ing both short-range and long-range connections that are not present in the agent-based model andare essential to describe global population interactions. The edges are weighted by a gravity-like rela-tion [24, 25, 26, 27] which is the simplest parameter free model to calculate mobility ﬂows between twocommunities, while accounting for their geographic separation. Second, the evolution of each language ischaracterized by reaction-diﬀusion equations for two competing languages whose dynamics is describedby the Lotka-Volterra model, previously used to model linguistic coexistence [28, 29]. As opposed towavefront propagation in continuous space which requires a contiguous region, reaction diﬀusion on net-works prevents the isolation of language islands. The contact network enables interactions along weightededges such that each node communicates with all of its neighbors. Spatial linguistic patterns emerge onthe geographical network through the Turing Mechanism, an exemplar of pattern formation [30, 31, 32]that relates to many observed natural phenomena in continuous [33] and discrete media [34, 35]. Theﬁnal ingredient in our model is the status perception of each language and its corresponding degree ofspatial correlation.We test our model in two diﬀerent regions of the world with linguistic coexistence, Galicia, in north-western Spain, where Galician and Spanish are spoken, as well as Carinthia in southern Austria, whereone ﬁnds both Slovenian and German speakers. In both cases, we ﬁnd excellent agreement with qual-itative trends, demonstrating that in addition to the speciﬁc model of linguistic competition one mustalso account for the geographic network on which the dynamics take place, as well as asymmetries inlinguistic status perception. 4 a) s P ( s ) (b) (c) s P ( s ) (d) Figure 1: (a)

20 districts of Galicia, each represented by a a diﬀerent color. The points (nodes) correspond tothe 550 cities and towns in the region. (c)

The corresponding map for Southern Carinthia comprising of 9districts and 112 cities. (b) and (d)

Distribution of edge-weights s i = (cid:80) j W ij , for the geographic networks inGalicia and Southern Carinthia with logarithmic binning. The tails of both distributions has the form P ( s ) ∼ s − β . The exponents are extracted using maximum likelihood estimation and are β G ≈ .

72 in (b) and β C ≈ .

35 in (d) . In our analysis we make use of datasets corresponding to two independent parts of Europe where popu-lations speak at least two languages. The ﬁrst comes from the Autonomous Region of Galicia in north-western Spain, where Galician (a Romance language similar to Portuguese) and Castillian (Spanish) areco-oﬃcial. The dataset includes information about the fraction of Galician and Spanish speakers in 20districts of the region consisting of 550 cities [36]. In Fig. 1 a , we illustrate the diﬀerent districts withdistinct colors for each region. The cities are represented as black points. The linguistic distributionsaccording to the census data are shown in Fig. 2 a,b for Galician and Spanish, where the colors repre-sent the fraction of speakers of a particular language in each district. The ﬁgure indicates that the twolanguages are geographically distributed in a complementary fashion; regions with the most abundantGalician speakers correspond to the lowest amount of Spanish speakers and vice versa.5 igure 2: Spatial distribution of linguistic prevalence for Galician (a) and Spanish (b) speakers in the regionof Galicia. Regions are colored according to the fraction of speakers in that particular district. The same mapfor German (c) and Slovenian (d) speakers in Southern Carinthia. In both regions, languages are geographicallydistributed in a complementary fashion; regions with the most abundant speakers of a particular languagecorrespond to the lowest amount of speakers of the other language.

Our second dataset corresponds to Slovenian and German speakers in Southern Carinthia, Austria inthe year 1910 across 9 districts and 112 cities. The data available in digitized form [37] consists of thefraction of the population speaking either language at the level of cities. The region with each districtcorresponding to a diﬀerent color, with the cities represented as points is shown in Fig. 1 c . In Fig. 2 b we plot the spatial distribution of prevalence for the year 1910, ﬁnding a similar trend to that seen forGalicia; regions consisting of large number of Slovenian speakers (primarily near the Austrian-Slovenianborder) consist of few German speakers, with the opposite also being true.6 Model

We model the language competition with the following set of Lotka-Voltera type diﬀerential equations,ﬁrst proposed in [28, 29]: dudt = cuv + α u u (1 − uS u ) ,dvdt = − cuv + α v v (1 − vS v ) . (1)Here u ( (cid:126)x, t ) and v ( (cid:126)x, t ) correspond to the frequency of the population speaking each language, whereasthe cross-term uv represents the competition between them with a strength c , interpreted as the status ofthe language; c > u has a higher status or attractiveness than language v , withthe opposite being true for c <

0. In the absence of competition, the model reduces to logistic growthwhere S u and S v represent the respective carrying capacities of the languages and α u , α v their natalityand mortality rates. For the purposes of our analysis these constants are redundant and without anyloss of generality set to 1 except for c , which is constrained to | c | <

1. In this setting, the ﬁxed points ofEq. (1) are ( u , v ) = ( c c , − c c ).In continuous media, the spatial component representing the diﬀusion of species is introduced via asecond-order diﬀusion coeﬃcient [17, 29] proportional to ∇ u . It’s counterpart in discrete media, suchas networks, is the Laplacian matrix L ij [34, 38], deﬁned as: L ij = W ij − s i δ ij , (2)where W ij is a symmetric weighted adjacency matrix and s i = (cid:80) j W ij is the weighted degree of node i [39]. To include diﬀusion on the underlying network one would then add a term of the form (cid:80) Nj =1 L ij u j ,and a corresponding one for v , in a network of N nodes ( i = 1 . . . N ). If one were to consider ordinarydiﬀusion then one need only include a diﬀusion coeﬃcient d u , d v for each of the populations, whichrepresents the diﬀusing away from populations of higher density to lower density regions. Yet, in thecontext of competition, one must also consider eﬀects where a minority population under threat from amajority population diﬀuses away to a diﬀerent region to avoid extinction [40, 41]. Such cross-diﬀusioncan be introduced by corresponding coeﬃcients a uv , a vu ≥ uv . With thesereﬁnements, Eq. (1) can be recast thus, du i dt = cu i v i + u i (1 − u i ) + N (cid:88) j =1 L ij [( d u + a uv v j ) u j ] ,dv i dt = − cu i v j + v i (1 − v j ) + N (cid:88) j =1 L ij [( d v + a vu u j ) v j ] . (3)Note, that in districts where c >

0, the higher status tongue corresponds to the population u , and v tends to move away at a rate a uv = γ whereas u remains in the district so a vu = 0. Similarly, in districtswhere c < u diﬀuses away and we have a uv = 0, a vu = γ .7 .2 Choice of network and Turing instability Next, we consider the choice of network that best represents the interactions between the populations indiﬀerent centers. Since we are interested in capturing the eﬀects of population movement, in principle,all cities are accessible to each other through a transportation network, such that all nodes are connectedto each other as in a complete graph. Yet the extent of ﬂows between two cities i, j depends on theirrespective populations p i , p j and the distance d ij . The simplest choice for the coupling between cities isthe gravity model [24, 25] with weights W ij = p j p j d ij . (4)Here W ij represents the i th row and j th column of the weight matrix W ; p i and p j are the populations ofthe corresponding cities and d ij is the geographical distance separating them. Thus greater ﬂows occureither between high population centers or those proximate to each other. In principle the gravity modelcan be generalized to more complex dependencies on the population and distance [42] and one can considerrelated models of mobility such as those based on intervening opportunities [26, 43], however the versionconsidered here has been used to accurately described mobility patterns in diﬀerent contexts [44, 45],and our results are not overly sensitive to the precise form of Eq. (4) as long as the networks are heavy-tailed [35]. In Fig. 1 b,d we show the strength distribution P ( s ) of the Galician and Southern Carinthiangeographical networks indicating a right-skewed distribution in both cases. A maximum likelihood ﬁt tothe tails of the distribution yields P ( s ) ∼ s − β with β G ≈ .

72 and β C ≈ .

35. The average weights are (cid:104) s (cid:105) G = 6 .

33 and (cid:104) s (cid:105) C = 1 .

54 whereas the variance σ s = (cid:112) (cid:104) s (cid:105) − (cid:104) s (cid:105) for each network are σ Gs = 29 . σ Cs = 5 . φ ( α ) of the Laplacian matrix (withassociated eigenvalue Λ α ), where α = 1 , . . . N corresponds to the mode [38]. The eigenvalues Λ α aresorted in decreasing order Λ > Λ · · · > Λ N and the ﬁrst eigenvalue is always zero (Λ = 0). Introducingsmall perturbations ( δu i , δv i ), substituting into Eq. (3), and expanding over the set of the Laplacianeigenvectors, the linear growth rate λ α for each node is calculated from a polynomial equation of theform λ α + b (Λ α ) λ α + c (Λ α ) = 0 , (5)where b (Λ α ) , c (Λ α ) are functions of the (cross)-diﬀusion coeﬃcients, the competition terms uv , and thestatus c [34, 35, 40]. The set of solutions to Eq. (5) for all modes α , correspond to a dispersion relation λ (Λ α ). The Turing instability occurs when at least one of the modes become unstable, indicated by Re ( λ α ) > c (Λ α ) <

0, and the corresponding mode is denoted α c [34, 35]. The fulldetails of the calculation are shown in Supplementary Section S1 and Fig. S1.8 igure 3: Testing the eﬀect on linguistic prevalence in Galicia for each ingredient in our model. In (a), (d) wetest the eﬀect of network topology alone, by assigning equal status to every district ( c = 0 . b , with the eﬀect of removing the spatial nature of the network.In (b), (e) we restore the spatial nature of the network, while maintaining a ﬁxed status for every district.Finally, in (c), (f ) , we use the spatial network in combination with the empirical geographic distribution ofstatus for each language. Pearson correlation-coeﬃcients in each panel indicate the comparison between thesimulated and empirical concentration of speakers. After the onset of the Turing instability, the system reaches a steady-state concentration of speakersfor each node (city) which is normalized according to ˜ u = (cid:104) u (cid:105)(cid:104) u (cid:105) + (cid:104) v (cid:105) , and similarly for v . Here the (cid:104) . . . (cid:105) denotes averaging over multiple realizations of the simulation corresponding to diﬀerent initial conditions.The concentration of speakers at the district level is then a population weighted-sum over all constituentnodes. The full details of the normalization and data aggregation procedure is described in SupplementarySection S2 and Fig. S2. In what is to follow, for the sake of simplicity, we assume that both competinglanguages (cross)-diﬀuse at the same rate. We begin our analysis with the case of Galicia. We seek to uncover the role of each underlying mechanismin the observed empirical trends, and therefore systematically probe the eﬀect of the model constituents,starting with the underlying network mediating the population-interactions. Recent results suggest that9uring patterns in networks are inﬂuenced and stabilized primarily by network topology provided thedistribution of links is heavy-tailed [35].While the mobility networks we consider are spatial, we ﬁrst check the extent to which the linguisticpatterns can be explained solely by the heavy-tailed nature of the network. To do so we randomlyassign each node a weight s sampled from the empirical distribution P ( s ) for Galicia (Fig. 1 b ), in eﬀectremoving any information about the spatial location of the nodes, and treating the network as a purelytopological graph. In addition we set the value of the status c = 0 . d u , d v = 0 .

01 and γ = 2 . a,d we show the results ofthe simulation averaged over 100 realizations of the process, compared to the empirical data, as a scatterplot of the districts according to the preponderance of Galician and Spanish, and in Fig. 3 a,d we showthe spatial linguistic distributions.The scatter plots indicate relatively few nodes diﬀerentiate from their initial ﬁxed-points for bothGalician and Spanish. The agreement with the empirical data is rather poor with a Pearson correlation-coeﬃcient of ρ Gp = 0 .

25 for Galician and ρ Sp = 0 .

26 for Spanish. One can also check the relative prevalenceof linguistic speakers in each region by ranking districts by the concentration of speakers for each language,and then compute the rank correlation-coeﬃcient. In Fig. S4 a,d we show the scatter plot of the simulatedand empirical data in terms of the rank of each district. The Spearman correlation coeﬃcient for bothGalican and Spanish is ρ Gs = ρ Ss = 0 .

07. The results indicate that network topology by itself is a poorindicator of the observed linguistic prevalence.Next, we restore the spatial nature of the network maintaining both P ( s ) as well as the geographicposition of the nodes, i.e links between nodes are established according to Eq. (4), and re-run the simu-lation with the same parameters. We show the results in Fig. 3 b,e where we plot the spatial distributionand in Fig.S3 b,e which shows the scatter plot of concentrations. We ﬁnd improved correspondence forboth Galician ( ρ Gp = 0 .

52) and Spanish with ( ρ Sp = 0 . c in every district, which biases the result towardsfavoring Galician speakers. A choice of negative c for each district would reverse the trends. This isalso reﬂected in Fig. S4 b,e for the rank scatter plots, where we ﬁnd ρ Gs = ρ Ss = 0 .

63. Nevertheless, thereasonable agreement between simulation and data for a majority of districts points to an important roleplayed by the geographic networks in linguistic evolution. By itself, however, it is not enough to explainthe full picture.Next, we incorporate the geographical distribution of the status parameter c into our framework.Surveys and polls conducted in Galicia reveal 12 districts where Galician is perceived to have higherstatus ( c > c <

0) [36]. For those districts where residentsreport a higher status for Galician we set c = 0 .

5, and for those that prefer Spanish we set c = − . c,f and Fig. S3 c,f for the spatial distribution and scatter plots respectively. We now ﬁndsigniﬁcantly better agreement with the empirical data for both types of languages, with ρ Gp = ρ Sp = 0 . c,f with ρ Gs = ρ Ss = 0 .

85. Taken together, the results indicate that the heavy-tailed nature of the10 e r m a n Fixed Status withShuffled Weights(a)

Gerp = 0.13

Fixed Status withEmpirical Weights(b)

Gerp = 0.45

Geographic Status withEmpirical network(c)

Gerp = 0.90 S l o ve n i a n (d) Slop = 0.13 (e)

Slop = 0.45 (f)

Slop = 0.90 0.00.51.00.00.51.0

Figure 4:

Testing the eﬀect on linguistic prevalence in Southern Carinthia for each ingredient in our model. In (a), (d) we test the eﬀect of network topology alone, by assigning equal status to every district ( c = 0 . b , with the eﬀect of removing the spatial nature ofthe network. In (b), (e) we restore the spatial nature of the network, while maintaining a ﬁxed status for everydistrict. Finally, in (c), (f ) , we use the spatial network in combination with the empirical geographicdistribution of status for each language. Pearson correlation-coeﬃcients in each panel indicate the comparisonbetween the simulated and empirical concentration of speakers. geographical mobility network coupled with the spatial correlation of the status parameters (along withtheir asymmetry) is a good predictor of the linguistic prevalence in Galicia. To check whether our results are unique to Galica, or generalizable to other regions, we now repeatthe analysis for Southern Carinthia. We adjust the (cross)-diﬀusion constants to generate an instabilityrange coinciding with the eigenvalue distribution of the empirical network Laplacian; now d u = d v = 0 . γ = 21. We once again, start by randomly assigning weights to nodes sampled from the empiricaldistribution P ( s ) seen in Fig. 1 d , assign the same value of the status c = 0 . ρ Gerp = 0 . , ρ Slop = 0 .

13, Fig. 4 a,d and S5 a,d ) as well as theirrelative abundance ( ρ Gers = 0 . , ρ Slos = 0 .

2, Fig. S6 a,d ), indicating that here too, the topological natureof the network by itself is a poor predictor of linguistic prevalence.Next, we consider the geographical network with weights assigned according to the nodes’ positions(Eq. (4)). The results are shown in Fig. 4 b,e and Fig. S5 b,e . Unlike in Galicia, here we ﬁnd poorcorrespondence with the data. The Pearson correlation coeﬃcients are now ρ Gerp = ρ Slop = − .

45 andthe Spearman correlation coeﬃcients are ρ Gers = ρ Slos = − .

35. The negative values for the correlationstem from only a few regions, and it is more accurate to say that there appears to be no correlationbetween the simulated and empirical data. The contrast with Galicia is striking, and is potentially due11o the network in Southern Carinthia being an order of magnitude smaller in size. In this relatively smallsetting, the geographical mobility network has minimal-to-no-role in predicting the linguistic patterns.Additionally, we do not have access to how residents perceive the status of each language given thatthere are no (known) surveys or polls. A reasonable choice in determining the status, however is toinfer it from the proportion of speakers. That is, in those regions where German is the majority tonguewe assume German has the higher status, and similarly regions with majority Slovenian speakers areassigned a higher status for Slovenian. Correspondingly in German majority districts we set c = 0 . c = − . c,f and Fig. S5 c,f for the spatial distribution andscatter plots respectively. In this case, we ﬁnd even better agreement with the data as compared toGalicia, with ρ Gerp = ρ Slop = 0 .

9. A similar trend is seen for the relative abundance (Fig. S6 c,f )with ρ Gers = ρ Slos = 0 .

9. Thus, in this case, while the network plays a limited role, accounting for theasymmetry and spatial correlation of the status, the Turing mechanism produces good agreement withthe empirical linguistic distributions in in Carinthia.

In this manuscript we have presented a minimal formulation to explain the observed linguistic trends intwo regions of Europe where languages co-exist. Our model, based on the Turing mechanism has as itsprimary ingredients, a reaction-diﬀusion model where language species spread and retreat in the samefashion as it occurs in predator-prey dynamics, the mobility network between locations based on thegravity model, coupled with the asymmetries and the geographical distribution in how speakers perceivea given language. Unlike in other descriptions of linguistic evolution, the model constituents are set up ina way, such that we can tease out the eﬀects of each component. Another advantage of our framework ascompared to existing formulations is the need for minimal empirical input, as well as its generalizabilityto multiple settings. Given that the language dynamics occurs on a discrete network we are able tosimultaneously capture microscopic and macroscopic dynamics without the rise of pathologies such as“language islands” due to the lack of diﬀusive fronts in non-contiguous regions.While patterns have been known to be stabilized by heterogenous network topologies in other settings,considering just the network topology by itself without considering its spatial nature, leads to pooragreement with our results and empirical trends. Once one accounts for the spatial location of nodes,the model gets about half the districts right in Galicia. Note that this occurs despite assigning bothlanguages, Galician and Spanish, equal status among residents. In our version, we assigned higher statusto Galican and despite this, the model was able to accurately reproduce some of the districts whereSpanish is the majority language. This points to strong evidence for the spatial mobility network ofcontacts playing an important role in language interaction and diﬀusion. Similar results were seen forthe relative abundance of languages, that is, ranking districts based on the concentration of speakers ofeach language.Interestingly enough, the same was not seen for Southern Carinthia, where the spatial network ap-peared to have little-to-no predictive power in terms of the concentration of German and Slovenianspeakers. Nevertheless, when coupled with a bimodal distribution for the status parameter (reﬂecting12symmetries in how languages are perceived), we got very good agreement in Galicia as well as SouthernCarinthia, both in terms of the concentration of speakers and the relative abundance of the languages.Note that, in the latter case, we were able to produce good agreement with the empirical values, despitenot knowing the actual values of the status parameters for each language.Our results are notable, given our minimal set of assumptions as well as little recourse to empiricalparameters. Of course, to go beyond this one would need tailored models with more granular data and theintroduction of more region-speciﬁc parameters. Additionally, we do not consider more complex facetsof linguistic prevalence such as bilingualism [9, 21], however such features can be in principle introducedthrough an additional term in Eq. (3). Nevertheless, our objective here was to focus on uncovering the(potential) basic mechanisms of linguistic evolution and compare it against empirical trends, and notnecessarily attempt to reproduce exactly the concentration of speakers in a region. We anticipate ourformulation will be quite useful in understanding linguistic prevalence in those regions with scarce dataon the relevant parameters.

Acknowledgments

SM, GG and JP acknowledge support from the Global Research Network program through the Min-istry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2016S1A2A2911945). APM and MMJ are supported by the Spanish Ministerio de Econom´ıa y Com-petitividad and European Regional Development Fund, contract RTI2018-097063-B-I00 AEI/FEDER,UE; by Xunta de Galicia, Research Grant No. 2018-PG082, and the CRETUS Strategic Partnership,AGRUP2015/02, supported by Xunta de Galicia. All these programs are co-funded by FEDER (UE).GG and SM also thank the University of Rochester for ﬁnancial support.13 ibliography [1] Kandler, A. & Steele, J. Modeling language shift.

Proceedings of the National Academy of Sciences , 4851–4853 (2017).[2] Tsunoda, T.

Language Endangerment and Language Revitalization : An Introduction (De GruyterMouton, 2006).[3] Nettle, D. Explaining Global Patterns of Language Diversity.

Journal of Anthropological Archaeology , 354–374 (1998).[4] Castellano, C., Fortunato, S. & Loreto, V. Statistical physics of social dynamics. Reviews of ModernPhysics , 591–646 (2009).[5] Ball, P. The physical modelling of society: A historical perspective. Physica A: Statistical Mechanicsand its Applications , 1–14 (2002).[6] UNESCO Ad Hoc Expert Group on Endangered Languages. Language vitality and endangerment. unesdoc.unesco.org/images/0018/001836/183699E.pdf (2003).[7] Abrams, D. M. & Strogatz, S. H. Modelling the dynamics of language death.

Nature , 900(2003).[8] Mira, J. & Paredes, ´A. Interlinguistic similarity and language death dynamics.

Europhysics Letters , 1031 (2005).[9] Seoane, L. F. & Mira, J. Modeling the life and death of competing languages from a physical andmathematical perspective. Bilingualism and Minority Languages in Europe: Current trends anddevelopment. , 1–22 (2017).[10] Mira, J., Seoane, L. F. & Nieto, J. J. The importance of interlinguistic similarity and stable bilin-gualism when two languages compete. New Journal of Physics , 033007 (2011).[11] Zhang, M. & Gong, T. Principles of parametric estimation in modeling language competition. Proceedings of the National Academy of Sciences , 9698–9703 (2013).[12] Isern, N. & Fort, J. Language extinction and linguistic fronts.

Journal of the Royal Society Interface , 2–10 (2014). 1413] Cooper, F., Ghoshal, G. & P´erez-Mercader, J. Composite bound states and broken u (1) symmetryin the chemical-master-equation derivation of the gray-scott model. Physical Review E , 042926(2013).[14] Cooper, F., Ghoshal, G., Pawling, A. & P´erez-Mercader, J. Internal composite bound states indeterministic reaction diﬀusion models. Physical Review Letters , 044101 (2013).[15] Ghoshal, G., Mu˜nuzuri, A. P. & P´erez-Mercader, J. Emergence of a super-synchronized mobbingstate in a large population of coupled chemical oscillators.

Scientiﬁc Reports , 19186 (2016).[16] Kandler, A. Demography and Language Competition. Human Biology , 181–210 (2009).[17] Yun, J., Shang, S. C., Wei, X. D., Liu, S. & Li, Z. J. The possibility of coexistence and co-developmentin language competition: ecology–society computational model and simulation. SpringerPlus , 855(2016).[18] Kandler, A., Unger, R. & Steele, J. Language shift, bilingualism and the future of Britain’s Celticlanguages. Philosophical Transactions of the Royal Society B: Biological Sciences , 3855–3864(2010).[19] Patriarca, M. & Heinsalu, E. Inﬂuence of geography on language competition.

Physica A: StatisticalMechanics and its Applications , 174–186 (2009).[20] Prochazka, K. & Vogl, G. Quantifying the driving factors for language shift in a bilingual region.

Proceedings of the National Academy of Sciences , 4365–4369 (2017).[21] Mussa Juane, M., Seoane, L. F., Mu˜nuzuri, A. P. & Mira, J. Urbanity and the dynamics of languageshift in Galicia.

Nature Communications , 1–9 (2019).[22] Lee, M., Barbosa, H., Youn, H., Holme, P. & Ghoshal, G. Morphology of travel routes and theorganization of cities. Nature Communications , 2229 (2017).[23] Kirkley, A., Barbosa, H., Barthelemy, M. & Ghoshal, G. From the betweenness centrality in streetnetworks to structural invariants in random planar graphs. Nature Communications , 2501 (2018).[24] Barbosa, H. et al. Human mobility: Models and applications.

Physics Reports , 1–74 (2018).[25] Zipf, G. K. The p1 p2 / d hypothesis : On the intercity movement of persons.

American SociologicalReview , 677–686 (1946).[26] Simini, F., Gonz´alez, M. C., Maritan, A. & Barab´asi, A.-L. A universal model for mobility andmigration patterns. Nature , 96 (2012).[27] Pan, W., Ghoshal, G., Krumme, C., Cebrian, M. & Pentland, A. Urban characteristics attributableto density-driven tie formation.

Nature Communications , 1961 (2013).[28] Pinasco, J. P. & Romanelli, L. Coexistence of Languages is possible. Physica A: Statistical Mechanicsand its Applications , 355–360 (2006). 1529] Kandler, A. & Steele, J. Ecological Models of Language Competition.

Biological Theory , 164–173(2008).[30] Castets, V., Dulos, E., Boissonade, J. & De Kepper, P. Experimental evidence of a sustained standingTuring-type nonequilibrium chemical pattern. Physical Review Letters , 2953–2956 (1990).[31] Ouyang, Q. & Swinney, H. L. Transition from a uniform state to hexagonal and striped Turingpatterns. Nature , 610–612 (1991).[32] Sayama, H., Kaufman, L. & Bar-Yam, Y. Spontaneous pattern formation and genetic diversity inhabitats with irregular geographical features.

Conservation Biology , 893–900 (2003).[33] Turing, A. M. The chemical basis of morphogenesis. Philosophical Transactions of the Royal Societyof London. Series B, Biological Sciences , 37–72 (1952).[34] Nakao, H. & Mikhailov, A. S. Turing patterns in network-organized activator–inhibitor systems.

Nature Physics , 544–550 (2010).[35] Mimar, S., Juane, M. M., Park, J., Mu˜nuzuri, A. P. & Ghoshal, G. Turing patterns mediated bynetwork topology in homogeneous active systems. Physical Review E , 062303 (2019).[36] Instituto Nacional de Estad´ıstica (Spanish Statistical Oﬃce). .[37] Census data on language use for southern Carinthia from 1880-1910. https://figshare.com/articles/Language_use_in_Carinthia/4535399 .[38] Othmer, H. G. & Scriven, L. E. Instability and dynamic pattern in cellular networks. Journal ofTheoretical Biology , 507–537 (1971).[39] Ioannis, A. & Eleni, T. Statistical analysis of weighted networks. Discrete Dynamics in Nature andSociety , 375452 (2008).[40] Vidal-Franco, I., Guiu-Souto, J. & Mu˜nuzuri, A. P. Social media enhances languages diﬀerentiation:a mathematical description.

Royal Society Open Science , 170094 (2017).[41] Vanag, V. K. & Epstein, I. R. Cross-diﬀusion and pattern formation in reaction-diﬀusion systems. Physical Chemistry Chemical Physics , 897–912 (2009).[42] Schneider, M. Gravity models and trip distribution theory. Papers of the regional science association , 51–58 (1959).[43] de Dios Ort´uzar, J. & Willumsen, L. Modeling Transport (John Wiley and Sons Ltd, New York,2011).[44] Jung, W. S., Wang, F. & Stanley, H. E. Gravity model in the Korean highway.

Europhysics Letters , 4 (2008).[45] Noulas, A., Scellato, S., Lambiotte, R., Pontil, M. & Mascolo, C. A tale of many cities: Universalpatterns in human urban mobility. PLoS ONE , 9 (2012).16 upplementary Information S1 Turing instability

Expanding the functions f ( u i , v i ) = cu i v i + u i (1 − u i ) and g ( u i , v i ) = − cu i v i + v i (1 − v i ) to ﬁrst-orderaround the ﬁxed points u , v via perturbations δu i , δv i , Eq. (3) can be written in linearized form as: (cid:32) du i dtdv i dt (cid:33) = (cid:32) f u f v g u g v (cid:33) (cid:32) u i − u v i − v (cid:33) + N (cid:88) j =1 L ij (cid:32) d u + a uv v a uv u a vu v d v + a vu u (cid:33) (cid:32) u j v j (cid:33) , (S1)with the Jacobian and diﬀusion matrix are respectively deﬁned as: J = J | u ,v = (cid:32) f u f v g u g v (cid:33) and D = D | u ,v = (cid:32) d u + a uv v a uv u a vu v d v + a vu u (cid:33) = (cid:32) D uu D uv D vu D vv (cid:33) . (S2)The eigenvalue equation for the Laplacian matrix is: (cid:80) Nj =1 L ij φ ( α ) j = Λ α φ ( α ) i , α = 1 · · · N . In terms ofsmall perturbations, Eq. (S1) becomes: dδu i dt = f u δu i + f v δv i + N (cid:88) j =1 L ij D δu i ,dδv i dt = g u δu i + g v δv i + N (cid:88) j =1 L ij D δv i . (S3)The perturbations can be expanded over the set of Laplacian eigenvectors as δu i ( t ) = (cid:80) Nα =1 B ( α ) u exp [ λ α t ] φ ( α ) i and δv i ( t ) = (cid:80) Nα =1 B ( α ) v exp [ λ α t ] φ ( α ) i . Substituting these into Eq. (S3) we obtain the following eigenvalueequation: λ α (cid:32) B ( α ) u B ( α ) v (cid:33) = (cid:32) f u + D uu Λ α f v + D uv Λ α g u + D vu Λ α g v + D vv Λ α (cid:33) (cid:32) B ( α ) u B ( α ) v (cid:33) (S4)The characteristic equation of this system is given by: λ α + b (Λ α ) λ α + c (Λ α ) = 0 (S5)where: b (Λ α ) = − [ T r ( J ) + T r ( D )Λ α ] ,c (Λ α ) = Det ( D )Λ α + [ D uu g v + f u D vv − f v D vu − D uv g u ]Λ α + Det ( J ) . The solutions to Eq. (S5) are then: λ α = − b (Λ α )+ √ b (Λ α ) − c (Λ α )2 and λ α = − b (Λ α ) − √ b (Λ α ) − c (Λ α )2 .When diﬀusion starts, the only solution with positive real part is λ α . Thus, we deﬁne the dispersionrelation in terms of Laplacian eigenvalues as Re( λ (Λ α )) with roots Λ α and Λ α , deﬁning the instabilityrange.The Turing instability is triggered when the eigenvalues Λ( α ) become unstable, which indicates thatthe corresponding growth factors in the dispersion relation Re( λ (Λ α )) become positive. In Fig. S1 weshow the eigenvalues distribution for the geographic networks of Galicia (a) and Carinthia (b) along thecurve Re( λ (Λ α )). The unstable modes are the eigenvalues that lie in the range [Λ Gα , Λ Gα ] and [Λ Cα , Λ Cα ]respectively. 17 R e () (a) G G R e () (b) C C Figure S1: Eigenvalue distributions of empirical networks of (a)

Galicia and (b)

Carinthia along thedispersion curve Re( λ (Λ α )). The instability ranges are marked by [Λ Gα , Λ Gα ] and [Λ Cα , Λ Cα ] for Galicia andCarinthia respectively. Diﬀerentiation of nodes are triggered by the instable growth factors Re( λ (Λ α )) > S2 Data normalization and aggregation

Fig. S2 shows the results of a typical simulation for the region of Galicia using the set of parameters usedto generate Fig. 3 in the main manuscript. Galician and Spanish speakers start with the initial ﬁxedpoints u = 1 . v = 0 . c = 0 . u = 0 . v = 1 . c = − .

5. The concentration of speakers in the new stationary state is shown in Fig. S2 (a) and (c) . Theresults are averaged over multiple realizations with diﬀerent initial random perturbations to the ﬁxedpoints. The points correspond to the average over multiple realizations and ﬂuctuations are shown aserror bars. In Fig. S2 (b) and (d) we show the normalized concentrations, where nodes are rescaled withthe total average concentration per node ( ˜ u = (cid:104) u i (cid:105)(cid:104) u i (cid:105) + (cid:104) v i (cid:105) and ˜ v = (cid:104) v i (cid:105)(cid:104) u i (cid:105) + (cid:104) v i (cid:105) ). The normalized values ofthe initial ﬁxed points correspond to ˜ u = 0 .

75; ˜ v = 0 . p i of the node that they belong to. In other words, Galician and Spanishspeakers of district j is calculated by (cid:104) u j (cid:105) = (cid:80) i ∈ j ˜ u i p i (cid:80) i ∈ j p i and (cid:104) v j (cid:105) = (cid:80) i ∈ j ˜ v i p i (cid:80) i ∈ j p i . The same procedure is usedin Carinthia. 18

100 200 300 400 500 node index i u i (a) node index i u i (b) node index i v i (c) node index i v i (d) Figure S2: Averaged simulation results after multiple realization with diﬀerent random initial perturba-tions. Panels (a) and (c) are the concentration of speakers of Galician and Spanish respectively for eachnode. Panels (b) and (d) are the normalized fractions of speakers.19 .0 0.2 0.4 0.6 0.8 1.0Simulation Concentration0.00.20.40.60.81.0 R e a l C o n c e n t r a t i o n Gp = 0.25 (a) R e a l C o n c e n t r a t i o n Gp = 0.52 (b) R e a l C o n c e n t r a t i o n Gp = 0.84 (c) R e a l C o n c e n t r a t i o n Sp = 0.26 (d) R e a l C o n c e n t r a t i o n Sp = 0.51 (e) R e a l C o n c e n t r a t i o n Sp = 0.84 (f) Figure S3: Comparison of simulation results with empirical concentrations in Galicia, illustrated in thesame order as in Fig. 3 a, b, c for Galician and d, e, f for Spanish speakers. The Pearson correlation-coeﬃcient ( ρ p ) is reported in each panel. Simulation Rank10 R e a l R a n k Gs = 0.07 (a) Simulation Rank10 R e a l R a n k Gs = 0.63 (b) Simulation Rank10 R e a l R a n k Gs = 0.85 (c) Simulation Rank10 R e a l R a n k Ss = 0.07 (d) Simulation Rank10 R e a l R a n k Ss = 0.63 (e) Simulation Rank10 R e a l R a n k Ss = 0.85 (f) Figure S4: Comparison of simulation results with empirical rank-ordering of districts in terms of con-centrations, in the same order as in Fig. 3 a, b, c for Galician and d, e, f for Spanish speakers. TheSpearman correlation-coeﬃcient ( ρ s ) is reported in each panel.20 .0 0.2 0.4 0.6 0.8 1.0Simulation Concentration0.00.20.40.60.81.0 R e a l C o n c e n t r a t i o n Gerp = 0.13 (a) R e a l C o n c e n t r a t i o n Gerp = 0.45 (b) R e a l C o n c e n t r a t i o n Gerp = 0.90 (c) R e a l C o n c e n t r a t i o n Slop = 0.13 (d) R e a l C o n c e n t r a t i o n Slop = 0.45 (e) R e a l C o n c e n t r a t i o n Slop = 0.90 (f)

Figure S5: Comparison of simulation results with empirical concentrations in Carinthia, in the same orderas in Fig. 4 a, b, c for German and d, e, f for Slovenian speakers. The Pearson correlation-coeﬃcient( ρ p ) is reported in each panel. Simulation Rank10 R e a l r a n k Gers = 0.20 (a) Simulation Rank10 R e a l R a n k Gers = 0.35 (b) Simulation Rank10 R e a l R a n k Gers = 0.90 (c) Simulation Rank10 R e a l R a n k Slos = 0.20 (d) Simulation Rank10 R e a l R a n k Slos = 0.35 (e) Simulation Rank10 R e a l R a n k Slos = 0.90 (f)

Figure S6: Comparison of simulation results with empirical rank-ordering of districts in terms of con-centrations, in the same order as in Fig. 4 a, b, c for German and d, e, f for Slovenian speakers. TheSpearman correlation-coeﬃcient ( ρ ss