[PDF] The agglomeration and dispersion dichotomy of human settlements on Earth

Abstract

Human settlements on Earth are scattered in a multitude of shapes, sizes and spatial arrangements. These patterns are often not random but a result of complex geographical, cultural, economic and historical processes that have profound human and ecological impacts. However, little is known about the global distribution of these patterns and the spatial forces that creates them. This study analyses human settlements from high-resolution satellite imagery and provides a global classification of spatial patterns. We find two emerging classes, namely agglomeration and dispersion. In the former, settlements are fewer than expected based on the predictions of scaling theory, while an unexpectedly high number of settlements characterizes the latter. Our global classification of spatial patterns correlates with some urban outcomes, such as the amount of CO2 emitted for transportation, providing insights into the relationship between land use patterns and socio-economic and environmental indicators. To explain the observed spatial patterns, we also propose a model that combines two agglomeration forces and simulates human settlements' historical growth. Our results show that our model accurately matches the observed global classification (F1: 0.73), helps to understand and estimate the growth of human settlements and, in turn, the distribution and physical dynamics of all human settlements on Earth, from small villages to cities.

Full PDF

PPrecise mapping, spatial structure and classiﬁcation of all the humansettlements on Earth

Emanuele Strano,

1, 2, ∗ Filippo Simini, Marco De Nadai,

4, 5

Thomas Esch, and Mattia Marconcini MindEarth, 2502 Biel/Bienne, CH German Aerospace Center (DLR), 82234 Wessling, Germany Department of Engineering Mathematics, University of Bristol, UK Department of Information Engineering and Computer Science, University of Trento, Trento, Italy Mobs Lab, Fondazione Bruno Kessler, Trento, Italy (Dated: June 12, 2020)

Human settlements (HSs) on Earth have a direct impact on all natural and societal systems butdetailed and quantitative measurements of the locations and spatial structures of all HSs on Earthare still under debate. We provide here the World Settlement Footprint 2015, an unprecedented 10m resolution global spatial inventory of HSs and a precise quantitative analysis and spatial model oftheir coverage, geography and morphology. HSs are estimated to cover 1.47% of the habitable globaldry-land surface and can be classiﬁed, by means of their deviation from scaling structure, into fourmain pattern typologies. A minimal spatial model, based on dynamic interactions between dispersaland centralized urbanization, is able to reproduce all the settlement patterns across regions. Ourdataset and settlement model can be used to improve the modelling of global land use changes andhuman land use cycles and interactions and can ultimately advance our understanding of globalanthropization processes and human-induced environmental changes.

INTRODUCTION

In this study, we provide an unprecedented estimation of the global cover and geography of humansettlements (HSs), and a quantitative analysis of their location, extent and spatial distribution bymeans of scaling analysis. The growth and expansion of cities on Earth are drivers of all global social,economic and environmental systems [1–4]. Considerable evidence exists of the impact of cities onwater and ecological systems, land use competition, food production, biodiversity, climate changeand human health [5–9], while the trade oﬀ between beneﬁts and challenges for global urbanizationhas been extensively debated in the literature [10–13]. While literature have been focused on citiesand urbanization the real dimension, distribution and explanation of HSs at the global scale are not ∗ Correspondence and material requests can be addressed to: [email protected]. a r X i v : . [ phy s i c s . s o c - ph ] J un yet fully understood, particularly with respect to the precise knowledge, global spatial distributionand spatial patterns of all sizes of settlements ranging from very large metropolitan areas to smalland scattered rural settlements.Several factors limit the global description and analysis of HSs: on the one hand, most earlystudies relied on low- or medium-resolution satellite data that range from 0 . urban land cover and thus excluding from the analysis the vast majorityof non-urban settlements. Although higher resolution global HSs inventories have recently beenproposed [18–20], considerable variation exists in their spatial metrics [20], probably due to thediﬃculties in exploiting global high-resolution imagery and the lack of consistent and cross-validatedtests in all the datasets. Consequently, quantitative analyses of global HSs that aim to measure theirextent and model their structure suﬀer from certain biases because of inaccurate data. Quantitativeanalyses of HSs patterns rely on traditional spatial metrics largely used in urban geography [21],such as dispersion/compactness metrics [22–25], and statistical analysis derived from complexsystems, such as scaling analysis and fractals [26]. Scaling analysis, and ﬁnite-size scaling inparticular, is a mathematical framework originally developed to provide a uniﬁed description of theuniversal properties of thermodynamic systems close to a phase transition, which are characterizedby scale-free relationships between the system’s variables and their distributions [27, 28]. In thecontext of urbanization and HSs patterns analyses, some invariant spatial proprieties of HSs [26]and transportation networks [29] have been found to follow scale-free relationships. The strongestempirical evidence of a power-law relationship in urban science is the scale-free distribution ofsettlement sizes: the probability of observing a settlement with an area larger than A followsa power law, P ( A ) ∼ A − α . Empirical values of the power-law exponent have been found in anarrow range around α = 1 for urban areas in regions of various sizes and in diﬀerent continents,suggesting that the distributions of areas of urban settlements universally follow Zipf’s law [30–33].However, the descriptive power of a scaling analysis based on Zipf’s law may have some limitations.because Zipf’s law has usually been observed at large spatial scales, for example, considering thesizes of all cities in a country or continent. Deviations from a pure power-law distribution areexpected in smaller regions, such as provinces or municipalities, due to historical, socio-economicand geographical factors. The presence and consequences of deviations from Zipf’s law at smallerspatial scales have never been investigated at a global scale.We propose a study of HSs on Earth that overcomes these major drawbacks. In particular, thisstudy i) introduces a novel global inventory of HSs, namely the World Settlement Footprint 2015(WSF2015), which has been generated by using both optical and radar satellite imagery and provedsensibly more accurate than any other existing similar layer, ii) provides a comprehensive globalanalysis of the location, extent and density of all size HSs, iii) classiﬁes settlement patterns based onsigniﬁcant deviations from prediction of models based on scaling, and iv) builds a minimal spatiallyexplicit model that is able to reproduce all observed settlement patterns on Earth and to explainhow agglomeration forces aﬀect settlement distributions. RESULTS AND DISCUSSIONPrecise and global estimation of human settlement cover in 2015

In this study, we produced the WSF2015, a novel 10 m resolution binary mask outlining HSsglobally, which has been derived for the ﬁrst time by jointly exploiting multi-temporal radar(Sentinel-1) and optical (Landsat-8) satellite imagery. The WSF2015 has been validated withthe support of Google against 900,000 ground-truth samples labelled by crowd-sourcing photo-interpretation of very high resolution (VHR) Google Earth imagery (see SI Section IV). As a resultof this comprehensive exercise, it emerged that the layer mostly categorizes as HS the combinationof buildings and building lots, where: i) building is considered as any structure having a roofsupported by columns or walls and intended for the shelter, housing, or enclosure of any individual,animal, process, equipment, goods, or materials of any kind; and ii) building lot is considered asthe area contained within an enclosure (e.g., wall, fence, hedge) surrounding a building or a groupof buildings. Furthermore, accuracy ﬁgures assess the high quality and reliability of the WSF2015,which outperforms all other similar existing layers; in particular, it considerably improves thedetection of very small settlements in rural regions and better outlines scattered suburban areas.Details related to the methodology behind the WSF2015 and its validation protocol are given inthe Methods and Supporting Information (SI) sections, respectively, while Figure 1 provides anoverview of the layer at diﬀerent scales for diﬀerent test sites. Such an accurate inventory of humanpresence on Earth allows us to perform an unprecedented analysis on the real magnitude, geographyand spatial structure of HSs at the global level.From the WSF2015, the total number of HSs is approximately 77 million and the correspondingarea amounts to 1,107,682 km (i.e., about 1 .

04% of the global land surface area estimated in131,331,424 km excluding the Arctic and Antarctic regions). However, since not all dry-landsurfaces can be settled, we deﬁned a relief mask excluding areas with complex topography notsuitable for hosting HSs based on the analysis of the local roughness and shaded relief computed FIG. 1. Examples referring to the Igboland (Nigeria), New Delhi (India) and Jinan (China) regions: HSsas outlined from the WSF2015 are shown in black, the relief mask in light brown, the water mask in blue,and remaining open areas (including natural and cropland areas) in light green. In the bottom row, thebackground shows HR optical satellite imagery. from global digital elevation models (DEMs) (see the Methods and SI sections for details). Afterexcluding internal freshwater surfaces as well, the remaining area of habitable land amounts to106,445,525 km ; out of this, we estimate HSs to cover 1 . . ◦ × . ◦ (approximately55 ×

55 km at the equator) and measured the percentage of occupied HS area in each tile δ HS asthe ratio between the tile’s HS area and its total surface area minus the exclusion mask (deﬁned asthe combination of the relief and frashwater masks). The spatial distribution of δ HS is shown inFigure 2, where we also show the cumulative frequency of δ HS for each of the 16 macro areas [34].In the plotted cumulative frequency at the global scale (bottom-left inset of Figure 2) we ﬁxed onthe x -axis seven HS percentage thresholds and we show the long-tail distribution of HS areas pertile, which means that a small number of tiles contains the majority of the settlements.Altogether, Figure 1 and Figure 2 show the high degree of variation of HS patterns both withina tile and between macro areas. − Central America -4 -2 − Southeastern Asia -4 -2 -4 -2 Western Europe C D F − −

10 600.01 0.0001 1.47 30 964 G l o b a l a v e r a g e Global

Southern Asia -4 -2 Eastern Asia -4 -2 South America -4 -2 Middle Africa -4 -2 Southern Africa -4 -2 Eastern Africa -4 -2 Western Asia -4 -2 Australia and N.Zealand -4 -2 Northern Africa -4 -2 Eastern Europe -4 -2 North America -4 -2 Western Africa -4 -2 Human Settlements areas (%)

FIG. 2. A global overview of HSs density in 2015. In both the map and the plot, those areas with a densityless than the global average density (1 . .

47% and 100% arerepresented by a red gradient. The plot on the bottom-left part of the Figure shows the long-tail distributionof the cumulative frequency of the percentage of occupied HS area within each tile, δ HS . A small number oftiles contains the majority of the settlements. Density-independent classes of human settlements’ patterns

The spatial distribution of density alone does not explain the observable complexity of settlementpatterns as observable in Figure 1. Such a variety of patterns may arise from the very well-knownphenomenon of the spatial interpenetration of rural and urban settlements [35], which achieves acomplexity and a size that no longer ﬁt those classes. This phenomenon has been qualitativelyobserved in classical urban geography narratives through the notions of megalopolises [35], urbansprawl [36] and horizontal metropolises [37]. However, this gradual symbiosis of diﬀerent systems ofurbanization forces has never been quantitatively deﬁned and tested.To overcome this limitation, we propose a quantitative classiﬁcation of settlement patternsshowing the existence of large regions where HSs are unexpectedly numerous and highly dispersed,with signiﬁcant deviations from the predictions of scaling theory modelling. For each HS i in a0 . ◦ × . ◦ tile, we measure its area A i and the total settlement area of a tile A totHS = (cid:80) Ni =1 A i ,where N is the number of HSs. The assumption of scaling theory is that the areas of the HSs in thetile and those in its corresponding macro area m are sampled from the same empirical distribution,i.e., P m ( A ), which is well approximated by Zipf’s law (see SI, Figure S1). Based on this assumption,we can estimate the distribution P m ( N | A totHS ) of the number of HSs, N , we expect to ﬁnd in a tilewith a total settlement area A totHS in macro area m according to the theoretical estimates [38] viathe following process. We sample the HS areas from P m ( A ) until the sum of the sampled areas isequal to A totHS ; i.e., the number of HSs N is given by the number of areas sampled. We repeat thisprocess 500 times to numerically estimate the theoretical distribution of the number of HSs in thetile, P ( N | A totHS ) (see Methods). Note that the expected number of HSs increases with the targettotal area A totHS . If the observed values of the number of HSs, N , are distributed according to thetheoretical distribution P m ( N | A totHS ), then the corresponding quantiles Q ( N ) = F ( N | A totHS ) shouldbe distributed uniformly between 0 and 1, where F is the cumulative distribution of N .Thus, based on the theoretical quantiles, we deﬁne two extreme classes of settlement patterns:a dispersion class (0 . ≤ Q ( N ) <

1, 10th decile, blue areas in Figure 3), corresponding to tileswith a large number of HSs according to the theoretical expectations; and an agglomeration class (0 . ≤ Q ( N ) < .

1, 1st decile, orange areas in Figure 3), corresponding to tiles with a small numberof HSs according to the theoretical expectations. In between these two extreme classes, we deﬁnethe balanced class (0 . ≤ Q ( N ) < .

9, 2nd-9th deciles, divided into two sub-groups in green andyellow in Figure 3), corresponding to the tiles with a number of settlements close to the theoreticalaverage number. Figure 3a shows the spatial distribution of the classiﬁed tiles at a global scale. Weobserve that tiles belonging to the various classes are not randomly spatially distributed, but theytend to form clusters that are spatially compact; see, for example, in Figure 3c the blue cluster inthe dispersion class in southern China and the orange cluster in the agglomeration class in northernChina; both areas are of considerable extension. The fact that the classes of settlement patterns arenot randomly distributed in space shows that the proposed classiﬁcation scheme captures diﬀerentpatterns characterizing large geographical regions and possibly large urban corridors.

ClassesPlots legend

Null modelexpected behaviourReal dataemphirical curve abg h ic e fd

Bafoussam

AgglomerationDispersion Balanced Balanced

Human Settlements areas % Q u a n t i l e ( N ) Q u a n t i l e ( N ) Q u a n t i l e ( N ) Hengshui Atlanta Douala Lu’an

FIG. 3. Global classiﬁcation of the HS patterns by their deviation from the scaling null hypothesis. Here, HSsare deﬁned as all areas containing: i) buildings; ii) buildings or building lots; and iii) buildings, building lotsor roads/paved-surfaces and the black areas represent the HS areas. The colour range is consistent across allpanels. The blue plots (b, d, and f) show the dispersion class, which represents locations with more patchesthan expected. b) The metropolitan area of Atlanta (USA), where urban dispersion is due to extensivepatterns of single housing. d) Urban-rural agglomeration in the area around Bafoussam (Cameroon), wheresettlements are mostly composed of informal single housing/agricultural units. f) Area around Lu’an (China),where there is an abundance of low-density sparse and small settlements. The orange plots (c and e) show the agglomeration class, in which we found fewer settlements than expected. c) The city of Douala (Cameroon),where urbanization occurs tightly around the existing urban core. e) The area around the city of Hengshui(China), where the many small agricultural settlements are most likely due to agricultural land preservationstrategies that have been extensively developed with a focus on compactness. g-h-i) These plots show therelation between the percentage of the tile’s HS area (x-axis) and the theoretical quantile of the total numberof settlements observed in the tile (y-axis) for North America, Middle Africa and Eastern Asia, respectively.Each point corresponds to a tile in these macro-areas. The histograms on the right of each scatter plotindicate the number of tiles in each of the classes. The dashed red lines denote the expected distributionaccording to scaling theory.

In North America, more precisely in the Southern United States (US) we notice a large numberof tiles in the dispersion class (blue tiles in Figure 3a, Figure 3b, and Figure 3g), whereas the restof the tiles in the US are mostly within the balanced class (light yellow and green tiles), exceptfor a few large urban agglomerations in the agglomeration class (orange tiles). This picture is inagreement with the measurements of urban sprawl in US metropolitan areas and counties, whichwas evaluated using factors such as development density, land use mix, activity centring, and streetaccessibility [23]. We found, for example, that highly compact cities, such as Douala, Cameroon(Figure 3c), belong to the same class of highly saturated areas as the city of Hengshui, China(Figure 3e). These areas may appear diﬀerent; however, they are intrinsically similar, as in bothcases, the settlements are compact, regardless of their spatial distribution. This classiﬁcation iscorroborated by a qualitative understanding of these two areas: Douala probably attracted allnew settlers around the urban core as it is a port town and the richest and most industrializedtown in Cameroon, whereas near Hengshui, the over-abundance of compactly developed settlementsis due to avoidance of excessive erosion of productive agricultural land. By contrast, the Lu’anregion (Figure 3f), which is also an agricultural area, belongs to the dispersion class probablybecause it has not been regulated by agricultural land erosion protection policies and thus presentsa highly dispersed pattern of settlements. The same highly sprawled pattern appears in severalmega-settlement agglomerations in sub-Saharan Africa, where large sub-urban areas are dominatedby single-plot housing as in the area of Bafoussam, Cameroon (Figure 3d). We can then identifydeviations from the predictions of scaling theory in any of the three classes by comparing thedistribution of the empirical quantiles to the expected uniform distribution. Figure 3g, Figure 3h,and Figure 3i show the relation between the theoretical quantiles Q ( N ) and the percentage of HSareas for the empirical tiles in three example macro areas. In the histograms on the right, we note asigniﬁcant excess of tiles in the dispersion class (10th decile, in blue), corresponding to tiles with alarge number of settlements relative to theoretical expectations. This ﬁnding means that real tilestend to have more settlements than what is predicted by scaling theory. The scatter plots showthat the excess of tiles in the dispersion class is observed across all values of the percentage of HSareas, A totHS , indicating that an over-abundance of HSs is not speciﬁc to lowly or highly urbanizedregions and is thus independent of urban density. Similar results are observed in most macro areas(see SI, Figure S2). Note that the observed excess of real tiles in the dispersion class cannot beexplained by and is not a simple by-product of a diﬀerent distribution of HS sizes for those tiles.The distributions of HS sizes for tiles in the dispersion class are indeed not systematically diﬀerentfrom the distributions of the tiles in the balanced class (see SI, Figure S3). A spatial model for human settlements

The deviation from the prediction by scaling theory provides a variety of HSs patterns but howsuch a variety has been achieved can be only explained by mean of controlled spatial simulations.We start from the hypothesis that HSs evolution cannot be attributed to agglomerating forces alonebut rather to more complicated systems of spatial forces. To est this hypothesis We propose herean extension of random percolation models [33, 39] to simulate and reproduce such a system offorces and explain the macro-dynamics in action during settlement evolution. Our proposed modelworks in a two-dimensional lattice w of size L × L , where L = 1000, whose sites (cells) can beeither occupied ( w i,j = 1, human settlement (HS)) or empty ( w i,j = 0, undeveloped). Without lossof generality, the initial conﬁguration has only the central cell occupied ( w N/ ,N/ = 1), and allother cells are empty ( w i,j = 0 ∀ i, j (cid:54) = N/ q i,j = C ˆ q i,j = C (cid:80) Lk (cid:80) Lz w k,z d − γij,kz (cid:80) Lk (cid:80) Lz d − γij,kz where C = 1 / max i,j (ˆ q i,j ) is a normalization constant and d ij,kz is the Euclidean distance betweensite w i,j and site w k,z .As in its traditional form [33, 39], the parameter γ regulates the strength of attraction of a HScell on a new cell; γ = 0 implies a sparse and randomly located new occupied cell, while a larger γ attracts new cells close to old cells, thus producing mono-centric and compact patterns (see SI,Figure S6). To simulate the diﬀerent forces in action, the exponent γ can take one of two valuesduring the simulation: γ and γ . When the fraction of the occupied cells is less than a given value s (i.e., (cid:80) i,j w i,j /L ≤ s ), the exponent γ characterizes settlements’ expansion during the initialstages of the simulation (see SI, Figure S6). The exponent γ characterizes settlement expansionfor the rest of the simulation when the fraction of HS areas is greater than s : γ =  γ , if (cid:80) i,j w i,j /L ≤ sγ , otherwiseWe grow the simulation to up to 60% of urbanization without loss of generality. The model, whichwe call a multi-parameter model, has three parameters: γ , γ and s . When γ = γ , it becomesequivalent to the single-parameter model [33].We found that the multi-parameter model accurately replicates real settlement patterns andtheir classiﬁcations. Due to the lack of time-varying data, ﬁnding the parameters from historical0 BalancedDispersion ClassClass E n e r g y d i s t a n c e Real patterns Simulated patternsMulti parameter Simulated patternsSingle parameter

Multi parameterSingle parameter

Ghoraghat Upazila (BD) γ= 1.8, γ =3.2, s=0.008 γ= 3.2 γ= 4.0 γ= 1.8, γ =4.0, s=0.006 H a i Dương (VN) F - s c o r e AgglomerationBalanced ad be cf gh

FIG. 4. Qualitative results of two randomly selected tiles in the (a) dispersion and (d) agglomeration classes.The black areas represent the HSs in the tiles. a) Area near Ghoraghat Upazila, Bangladesh, having ansettlement pattern in the dispersion class; b) the most similar simulation obtained with our model; c) themost similar simulation obtained when the single-parameter model falls into the wrong class. d) Area nearHai Duong, Vietnam, having an settlement pattern in the agglomeration class; e) the most similar simulationobtained with our model; f) although the most similar simulation obtained with the single-parameter modellies in the same class as the real tile, it is very diﬀerent. g) This box plot shows the energy distance betweenthe real and simulated tiles computed for each class. It shows that our model always generates settlementpatterns that are consistently better than those generated by the single-parameter model. h) The F1-scoreobtained from the urbanization class of the real tile and the class of its most similar simulation for both ourmodel and the single-parameter model. Our model outperforms the single-parameter model and generatessettlement patterns that are compatible with the classes observed in the real world. empirical data was not possible. Thus, we followed a simulation approach in which we ﬁnd theparameters that best represent the spatial process that might have generated the pattern observedin the real tiles. First, we generate approximately 1,000,000 simulations using a broad range ofparameter values (see SI, Table I); for each real tile i , we ﬁnd the most similar simulation bycomparing the cumulative distributions of HSs and selecting the simulated tile with the smallestenergy distance [40] to the distribution of the real tile (see Methods). We denote this distance by D E ( i ). Finally, for each simulated tile, we also ﬁnd the class of settlement patterns it belongs to1by the above-mentioned procedure based on the quantiles from scaling theory. Figure 4a shows arandomly chosen tile in Ghoraghat Upazila, Bangladesh, while Figure 4b shows its most similarsimulation with parameters γ = 1 . γ = 3 . s = 0 . γ = 1 . γ = 4 . s = 0 . D E across all the tiles from the multi-parameter modeland the single-parameter model (see ﬁg. 4 (g)). The two-sided Kolmogorov-Smirnov (KS) test [41]shows that the multi-parameter model has a signiﬁcantly smaller distance for all urbanizationclasses (see SI, Table SIII), with 47% smaller median distances, on average. This result is robustagainst diﬀerent distance metrics (see SI, Section I). Second, we compare the urbanization class ofeach empirical tile with that of the most similar simulation. We observe that the single-parametermodel overestimates the number of tiles in the agglomeration class and underestimates those in thedispersion class, while the multi-parameter model better captures the distribution of urbanizationclasses. We use the F1-score to quantify the agreement between the urbanization classes of the realand simulated tiles. We ﬁnd that the multi-parameter model achieves a 111% higher performancethan that of the single-parameter model (see SI, Table SII). Figure 4 h shows that this increase inperformance is particularly evident for the balanced and dispersion classes. This ﬁnding indicatesthat the multi-parameter model produces tiles with a distribution of urbanization classes closerto the empirical model and alleviates the over-representation of tiles in the agglomeration classobserved in the single-parameter model (as shown in SI, Figure 7). CONCLUSION

Due to global population growth, HSs are expected to increase accordingly. For this reason, thescientiﬁc understanding the spatial for of HSs is of paramount importance for planning, managingand eventually forecasting HSs and their consequences.In this paper, we provided an unprecedented description of the geography and the spatial2structure of all HSs on Earth. First, we created a new global dataset to reliably measure thelocation and distribution of all the land and clusters occupied by HSs. Then, we measured howmuch land all HSs occupy. As such, a precise estimation of global HS cover can be of relevance fora better estimation of the current state of global land use [42] and of settlement-driven land-usechange as, for example, related to a reexamination of land-use competition dynamics [43]. We alsofound that the density of HSs areas on Earth has a long-tail distribution: very few zones on Earthare occupied by highly dense areas, while the vast majority of Earth is occupied by low-densityscattered settlements composed of less than <

2% of HSs area. These low-density and scatteredpatterns are not only the result of the expansion of metropolitan areas, they also depend on adiﬀerent process that goes beyond the arbitrary rural-urban dichotomy.While cities are undoubtedly crucial for economic growth, other emerging forms of land occupa-tion, that occupy approximately 50% of the global surface, may deserve more profound attentionfrom the scientiﬁc community.We also showed that settlement density alone does not explain the great variability of HSpatterns at the global scale. Thus, we proposed a novel global classiﬁcation of areas based on theobserved deviations from the predictions of the scaling analysis regarding the number of settlementsexpected to be found in a region with a given fraction of HS area. We identiﬁed two main classes ofsettlement patterns: the Dispersion class contains regions having the largest number of settlementswith respect to their HS area, according to the predictions of the scaling analysis; conversely, theAgglomeration class contains regions having the smallest number of settlements with respect to theirHS area, according to the predictions of the scaling analysis. We observed large deviations fromscaling theory that, again, conﬁrm that vast areas are dominated by small and scattered settlementclusters. One can speculate that Zipf’s law is not fully capable of describing urban patterns. Weinstead argue that the observed deviations appear where urban patterns, as directly related tothe presence of a city or a town, are absent. There is indeed a pattern of land occupation that isnot urban nor rural, which however represent a newly observed dominant type of land cover andthat, apparently, does not follow scaling laws. Moreover, it remains an open question whether thispattern should be classiﬁed as city, urban or just very dense rural area and even if such classiﬁcationprovides useful insights. Although for narrative reasons it might be convenient to divide HSspatterns into pre-established classes, we did not empirically observe any sharp classiﬁcation. Thisﬁnding may present a new perspective for the long-standing question on how many people live incities as opposed to rural areas. The proposed spatial explicit model, which accurately describesall observed patterns, explains such variability. In particular, we found that the spatial dynamical3process that regulates attractive and dispersal forces while settlements grow may be subject torandom processes and that their combinations are certainly subject to local and speciﬁc conditions.As such, local and regional conditions must be taken in account when studying and modelling urbanphenomena. In our view, the proposed analysis and model represent a fundamental tool to provideinsights about the structure and the evolution of HSs on Earth and, in turn, of their impact onother human and natural global systems. In the future, the observation of the Earth surface willexperience tremendous improvement, providing more data that are more accurate and denser intime. We hope that our framework will pave the way to new research to understand the extent ofHSs and their impact on the environment and life on Earth.

METHODS

In this section we ﬁrst describe how we delineate the HSs from satellite data, then we explainthe relief mask and the segmentation process. Finally, we describe how we use the data to do thescaling analysis, the simulations and the comparisons with the real data.

Global HSs delineation from satellite imagery

To reliably and accurately outline HSs globally, a novel and robust methodology has beenimplemented to jointly exploit multi-temporal optical and radar satellite data. In particular, therationale is that the temporal dynamics of human settlements over time are considerably diﬀerentthan those of all other information classes. Given all the images acquired over a region of interestin the target period during which sensible changes are not expected to occur (e.g., 1-2 years), keytemporal statistics (i.e., temporal mean, minimum, maximum, variance, etc.) are extracted for i)the original backscattering value in the case of radar data and ii) diﬀerent spectral indices (e.g.,vegetation index, built-up index, etc.) derived after performing cloud/cloud-shadow masking in thecase of optical imagery. Next, candidate training samples for the settlement and non-settlementclass are extracted automatically. In particular, this process is performed on the basis of speciﬁcthresholds for some of the temporal features determined above - based on extensive empiricalanalysis - for each of the 30 diﬀerent climate regions of the K¨oppen Geiger scheme. Classiﬁcation isthen performed for the optical- and radar-based temporal features separately by means of supportvector machines (SVMs); ﬁnally, the two outputs are properly combined. Once assessed and shownto have high robustness, the method has been applied to generate the WSF2015 by means of42014-2015 multi-temporal Sentinel-1 radar and Landsat-8 optical imagery (of which approximately107,000 and 217,000 scenes were processed, respectively). The layer proved extremely accurate andreliable, outclassing all other existing similar layers. This result has been quantitatively assessedthrough an unprecedented validation exercise conducted in close collaboration with Google fora collection of 50 globally distributed test sites (tiles of 1 × × Relief mask

Physical environmental conditions play a major role in HSs development; among these, terrainsteepness is one of the most critical. Accordingly, to exclude from our analysis relief areas that areunfavourable for settlement, we generated – based on extensive empirical analysis - a binary maskusing the Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM) availablebetween -60 ° and +60 ° and the Advanced Spaceborne Thermal Emission and Reﬂection Radiometer(ASTER) DEM elsewhere. Speciﬁcally, the mask is positive where the shaded relief (depicting howthe three-dimensional surface would be illuminated from a point light source) is greater than 212 orthe roughness (deﬁned as the largest inter-cell diﬀerence of a central pixel and its surrounding 8cells) is greater than 15.5 Segmentation process: from raster to vector data

Global urbanization is measured by taking into account HSs, water, and impervious areas. Tofacilitate the analysis at the global scale, the globe has been divided into a grid of 0 . × . × . × . Scaling analysis

To numerically estimate the theoretical conﬁdence intervals for the number of settlements N predicted by scaling theory, we proceed as follows. We evaluate the theoretical conditionaldistribution of the number of settlements in a tile of total HS area A totHS , P m ( N | A totHS ), by samplingwith replacement from the list of settlement areas belonging to the tile’s macro area m until thetotal HS area (i.e., the sum of the sampled areas) is equal to the target value A totHS . The number ofsamples N needed to reach A totHS can be considered to be a number sampled from P m ( N | A totHS ). Byrepeating the sampling process 1000 times, we can evaluate the 1st and 9th deciles, correspondingto the boundaries of the agglomeration and dispersion classes, respectively. Evaluation of the multi-parameter model

Estimating the urbanization process would require temporal data, which are not easy to obtain.Moreover, a model ﬁt on temporal data, where each pixel value is related to all the other pixelsthrough a distance matrix, would be very computationally expensive. Indeed, each tile contains n = 5567 × O ( n ) memory. For this reason, weevaluate our model through simulations.First, we simulate 1000 × γ , γ , s (see SI, Table III). The set of all the simulationtiles is denoted by S . Next, we compare the resulting simulations with the global ( real ) tiles. Foreach tile with an urbanization percentage U r > U s ∈ [ U r − . , U r + 1 . X , Y as: D ( X, Y ) = 2 E | X − Y | − E (cid:12)(cid:12) X − X (cid:48) (cid:12)(cid:12) − E (cid:12)(cid:12) Y − Y (cid:48) (cid:12)(cid:12) (1)where E | X | < ∞ , E | Y | < ∞ , X (cid:48) is an iid copy of X and Y (cid:48) is an iid copy of Y . We denotethe distance of a tile i to its most similar simulated tile by DE ( i ) = min j ∈S D ( X i , X j ). We alsotested other distance measures but did not ﬁnd signiﬁcant diﬀerences (see SI, Section I). As thesize of the simulations is 1000 × r, s ), where r is the real 1000 × s is the simulated 1000 × r and s and frame it as a classiﬁcation problem. Wecompute the F1-score between the ground truth (the classes of the real tiles) and the predictedclasses (the classes of the simulated tiles). The F1-score for all the classes is weighted to accountfor the unbalanced number of tiles in each class. DATA AND SOFTWARE AVAILABILITY

All source codes and data are open and public available. WSF data are available at All sorcecodes are available at https://github.com/denadai2/precise-mapping-human-settlements . [1] UN. Cities in a globalizing world: global report on human settlements 2001 (Earthscan, 2001).[2] UN.

The state of the world cities 2004/5- Globalization and Urban Culture. (Routledge, 2004).[3] UN.

The state of the world cities 2006/7- The Millennium Development Goals and Urban Sustainability. (Routledge, 2006).[4] Birch, E. L. & S.M.Wachter.

Global urbanization (Pennsylvania Press, 2011).[5] Moore, M., Gould, P. & Keary, B. S. Global urbanization and impact on health.

International Journalof Hygiene and Environmental Health , 269 – 278 (2003). [6] Zhou, L. et al. Evidence for a signiﬁcant urbanization eﬀect on climate in china.

Proceedings of theNational Academy of Sciences , 9540–9544 (2004).[7] Kaufmann, R. K. et al.

Climate response to rapid urban growth: Evidence of a human-inducedprecipitation deﬁcit.

Journal of Climate , 2299–2306 (2007).[8] Grimm, N. B. et al. Global change and the ecology of cities.

Science , 756–760 (2008).[9] Tilman, D., Balzer, C., Hill, J. & Befort, B. L. Global food demand and the sustainable intensiﬁcationof agriculture.

Proceedings of the National Academy of Sciences , 20260–20264 (2011).[10] Daily, G. C. & Ehrlich, P. R. Population, sustainability, and earth’s carrying capacity.

BioScience ,761–771 (1992).[11] Johnson, M. P. Environmental impacts of urban sprawl: a survey of the literature and proposed researchagenda. Environment and Planning A , 717–735 (2001).[12] Dye, C. Health and urban living. Science , 766–769 (2008).[13] Seto, K. C., Fragkias, M., G¨uneralp, B. & Reilly, M. K. A meta-analysis of global urban land expansion.

PLOS ONE , 1–9 (2011).[14] Potere, D. & Schneider, A. A critical look at representations of urban areas in global maps. GeoJournal , 55–80 (2007).[15] Gamba, P. & Herold, M. Global mapping of human settlement e Experiences, datasets, and prospects. (CRC Press, 2009).[16] Grekousis, G., Mountrakis, G. & Kavouras, M. An overview of 21 global and 43 regional land-covermapping products.

International Journal of Remote Sensing , 5309–5335 (2015).[17] Angel, S., Parent, J., Civco, D. L., Blei, A. & Potere, D. The dimensions of global urban expansion:Estimates and projections for all countries, 2000–2050. Progress in Planning , 53–107 (2011).[18] Pesaresi, M. et al. Operating procedure for the production of the global human settlement layer fromlandsat data of the epochs 1975, 1990, 2000, and 2014. Tech. Rep., European Join Research Center(2016).[19] Chen, J. et al.

Geomatics World , 1–8 (2017).[20] Esch, T. et al. Breaking new ground in mapping human settlements from space–the global urbanfootprint.

ISPRS Journal of Photogrammetry and Remote Sensing , 30–42 (2017).[21] Herold, M., Scepan, J. & Clarke, K. C. The use of remote sensing and landscape metrics to describestructures and changes in urban land uses.

Environment and Planning A , 1443–1458 (2002).[22] Barrington-Leigh, C. & Millard-Ball, A. A century of sprawl in the united states. Proceedings of theNational Academy of Sciences , 8244–8249 (2015).[23] Hamidi, S. & Ewing, R. A longitudinal study of changes in urban sprawl between 2000 and 2010 in theunited states.

Landscape and Urban Planning , 72–82 (2014).[24] Huang, J., Lu, X. X. & Sellers, J. M. A global comparative analysis of urban form: Applying spatialmetrics and remote sensing.

Landscape and urban planning , 184–197 (2007). [25] Poelmans, L. & Van Rompaey, A. Detecting and modelling spatial patterns of urban sprawl in highlyfragmented areas: A case study in the ﬂanders–brussels region. Landscape and urban planning , 10–19(2009).[26] Batty, M. The size, scale, and shape of cities. Science , 769 (2008).[27] Christensen, K. & Moloney, N. R.

Complexity and criticality , vol. 1 (World Scientiﬁc PublishingCompany, 2005).[28] Stanley, H. E. Scaling, universality, and renormalization: Three pillars of modern critical phenomena.

Reviews of Modern Physics , 358–366 (1999).[29] Strano, E., Nicosia, V., Latora, V., Porta, S. & Barth´elemy, M. Elementary processes governing theevolution of road networks. Scientiﬁc reports , 296 (2012).[30] GK, Z. Human Behavior and the Principle of Least Eﬀort. (Addison-Wesley. (Cambridge, Massachusetts),1949).[31] Rozenfeld, H. D. et al.

Laws of population growth.

Proceedings of the National Academy of Sciences , 18702 (2008).[32] Gabaix, X. & Ioannides, Y. M. The evolution of city size distributions.

Handbook of regional and urbaneconomics , 2341–2378 (2004).[33] Rybski, D., Ros, A. G. C. & Kropp, J. P. Distance-weighted city growth. Physical Review E , 042114(2013).[34] Oﬃce, U. N. S. Standard country or area codes for statistical use , vol. 42 (UN, 1982).[35] Gottmann, J. Megalopolis or the urbanization of the northeastern seaboard.

Economic geography ,189–200 (1957).[36] Indovina, F., Matassoni, F. & Savino, M. La citt`a diﬀusa (Daest Venezia, Italy, 1990).[37] Vigan`o, P., Arnsperger, C., Lanza, E. C., Corte, M. B. & Cavalieri, C. Rethinking urban form:Switzerland as a “horizontal metropolis”.

Urban Planning , 88 (2017).[38] Simini, F. & James, C. Testing heaps’ law for cities using administrative and gridded population datasets. EPJ Data Science , 24 (2019).[39] Makse, H. A., Andrade, J. S., Batty, M., Havlin, S. & Stanley, H. E. Modeling urban growth patternswith correlated percolation. Physical Review E , 7054 (1998).[40] Sz´ekely, G. J. & Rizzo, M. L. Energy statistics: A class of statistics based on distances. Journal ofStatistical Planning and Inference , 1249 – 1272 (2013).[41] Kendall, M. G. A new measure of rank correlation.

Biometrika , 81–93 (1938).[42] Song, X.-P. et al. Global land change from 1982 to 2016.

Nature , 639 (2018).[43] Lambin, E. & Meyfroidt, P. Trends in global land use competition. In C. Seto, K. & Reenberg, A. (eds.)

Rethinking Global Land Use in an Urban Era , chap. 2 (The MIT Press, 2014).[44] Marconcini, M. et al.

Outlining where humans live–the world settlement footprint 2015. arXiv preprintarXiv:1910.12707 (2019). [45] Olofsson, P., Foody, G. M., Stehman, S. V. & Woodcock, C. E. Making better use of accuracy datain land change studies: Estimating accuracy and area and quantifying uncertainty using stratiﬁedestimation. Remote Sensing of Environment , 122–131 (2013).[46] Olofsson, P. et al.

Good practices for estimating area and assessing accuracy of land change.

RemoteSensing of Environment , 42–57 (2014).[47] Cohen, J. A coeﬃcient of agreement for nominal scales.

Educational and psychological measurement ,37–46 (1960).[48] Congalton, R. G., Oderwald, R. G. & Mead, R. A. Assessing landsat classiﬁcation accuracy usingdiscrete multivariate analysis statistical techniques. Photogrammetric engineering and remote sensing , 1671–1678 (1983).[49] Corbane, C. et al. Automated global delineation of human settlements from 40 years of landsat satellitedata archives.

Big Earth Data , 140–169 (2019).[50] Chen, J. et al. Global land cover mapping at 30 m resolution: A pok-based operational approach. { ISPRS } Journal of Photogrammetry and Remote Sensing , 7 – 27 (2015). Global Land CoverMapping and Monitoring.

ACKNOWLEDGEMENTS

This work was partially supported by the Microsoft Azure Research Award. FS is supported bythe EPSRC First Grant EP/P012906/1. Parts of the spatial metrics assessment and applicationsfor this study were funded by the European Space Agency (ESA) under the Urban ThematicExploitation Platform project (TEP Urban,ESRIN/Contract No. 4000113707/15/I-NB). ES thanksEnrico Bertuzzo, Marta Gonzalez, Andrea Rinaldo. MD and ES thanks Nicu Sebe and Bruno Lepri.

AUTHOR CONTRIBUTIONS STATEMENT

All authors conceived the study. M.M. produced the World Settlements Footprint. F.S., M.D.N.and E.S. conducted the analyses and analysed the results. F.S., M.D.N. and E.S. wrote the paper.All authors reviewed the manuscript.

COMPETING INTERESTS

All authors declare no competing interests.0

MATERIALS & CORRESPONDENCE

Correspondence and material requests can be addressed to: [email protected]

SUPPLEMENTARY INFORMATION

FIG. S1. Counter cumulative distribution function (CCDF) of the HS areas for the 16 macro regionsconsidered. FIG. S2. Quantile of the number of HSs in a tile according to the theoretical distribution P ( N | A totHS ). FIG. S3. Counter cumulative distribution function (CCDF) of the HS areas separately for each class ofsettlement patterns: Dispersion (blue), Balanced (green and yellow) and Agglomeration (orange). Notsurprisingly, the tiles in the Agglomeration class contain a higher number of large clusters. The cluster sizedistributions of the tiles in the Dispersion class are not consistently diﬀerent from those in the Balanced class. Appendix A: Robustness tests

We run several tests to verify the sensitivity of the predictive model to: • Alternative metrics for the matches.

The match between real and simulated tiles might besensitive to the choice of the similarity metric. Therefore, in Figure S4, we show the resultswith the Kullback-Leibler (KL) divergence, the Jensen-Shannon (JS) divergence, and theEnergy function. Table I shows the same result broken down per class. The KL divergence isdeﬁned as: D KL ( P (cid:107) Q ) = − (cid:88) x ∈X P ( x ) log (cid:18) Q ( x ) P ( x ) (cid:19) where P and Q are discrete probability distributions. Note that this metric is non-symmetric.The JS divergence is the symmetric version of the KL divergence and is deﬁned as: D JS ( P (cid:107) Q ) = 12 D KL ( P (cid:107) M ) + 12 D KL ( Q (cid:107) M )where M = ( P + Q ). • The multi-prob parameter model.

We also test for a diﬀerent formalization of the multi-parameter model where the exponent γ is not changed only one time but is instead chosenat random with a speciﬁed probability in each stage of the simulation process. We simulatethe growth in urban area through a two-dimensional N × N lattice whose sites w i,j canbe either occupied ( w i,j = 1) or empty ( w i,j = 0). Without loss of generality, we set theinitial conﬁguration with w N/ ,N/ = 1 and all other pixels are zeros. Then, we simulate anevolution process where in each step, the probability that each empty site will be occupied is: q i,j = C (cid:80) Nk (cid:80) Nz w k,z d − Γ k,z (cid:80) Nk (cid:80) Nz d − Γ k,z where C = 1 /max i,j ( q i,j ) is a normalization constant for each step and d k,z is the Euclideandistance between site w i,j and site w k,z . Γ is chosen based on a number p that is randomlychosen in each step: Γ =  γ , if p < sγ , otherwisewhere s is a chosen probability threshold of the simulation and γ , γ are selected growthparameters of the simulation. In each step, w i,j = 1 iﬀ q i,j > .

5. We stop the growth of4urban areas when N (cid:80) Ni,j w i,j ≥ .

5. Since we choose the Γ parameter in each step, we callthis the multi-prob parameter model, whereas the other one is called multi-parameter model.Table III shows the simulated parameters. Table II shows that the alternative formulationhas comparable results of the presented mode, in terms of the F1-score between classes.Together, these results conﬁrm the robustness of our models and methods. D i s t a n c e Energy distance

Model

Multi parameterSingle parameter 1 2 3 4Class0.00.51.01.52.02.53.0 D i s t a n c e Kullback Leibler divergence

Model

Multi parameterSingle parameter 1 2 3 4Class0.000.020.040.060.080.100.12 D i s t a n c e Jensen–Shannon divergence

Model

Multi parameterSingle parameter

FIG. S4. Distance between each real tile and its best simulation for the multi-parameter and single-parametermodels. Our model consistently achieves shorter distances between simulated and real tiles, even when usingalternative metrics such as the Kullback-Leibler divergence and the Jensen–Shannon divergence.

Method Agglomeration Balanced Dispersion All

Multi parameter 0 . ∗∗ . ∗∗ . ∗∗ . ∗∗ . ∗∗ Multi-prob parameter 0 . ∗∗ . ∗∗ . ∗∗ . ∗∗ . ∗∗ TABLE I. 2-way Kolmogorov-Smirnov test of the empirical distribution of the distances between real tilesand simulated tiles for the various models and the single-parameter model. The two models we proposeachieve better performance than the single-parameter model in all classes. (**) indicates a p -value < . Method Agglomeration Balanced Dispersion All

Single parameter 0.98 0.39 0.09 0.24 0.31

Multi parameter

Appendix B: Additional tables

In Table III we show all the tested parameters (and their combinations) for the Multi-parameterand the Multi-prob parameter models.

Parameters ValuesMulti-parameter model γ { , . , . , , . , . , . , . , , . , . , . , . , , , , , , } γ { , . , . , , . , . , . , . , , . , . , . , . , , , , , , } s { . , . , . , . , . , . , . , . , . , . , . , . , . , . ,. , . , . , . , . } Number of tiles 6859

Multi-prob parameter model γ { , . , . , , . , . , . , . , , . , . , . , . , , , , , , } γ { , . , . , , . , . , . , . , , . , . , . , . , , , , , , } s { . , . , . , . , . , . , . , . , . , . , . , . , . , . , . ,. , . , . , . , . , . } Number of tiles 6804TABLE III. Parameters used to perform the simulations. The simulations are created from the Cartesianproduct of these parameters. In the multi-prob parameter model for s = 0 .

5, we computed only thosecombinations where y < y , as the probability to choose one gamma is 0 . Appendix C: HSs Density

In Table IV we show the exact number of the density of HSs in all the macro-areas.6

Method Lower bounds of the bins

Australia N. Zealand 0 .

753 0 .

980 0 .

992 0 .

995 0 .

998 0 . .

401 0 .

965 0 .

991 0 .

998 0 .

999 1 . .

506 0 .

986 0 .

996 0 .

998 0 .

999 0 . .

733 0 .

964 0 .

986 0 .

996 0 .

999 0 . .

489 0 .

926 0 .

969 0 .

993 0 .

998 0 . .

467 0 .

932 0 .

979 0 .

993 0 .

999 0 . .

147 0 .

765 0 .

919 0 .

977 0 .

998 0 . .

491 0 .

871 0 .

944 0 .

978 0 .

995 0 . .

458 0 .

949 0 .

983 0 .

994 0 .

998 0 . .

521 0 .

934 0 .

967 0 .

989 0 .

999 0 . .

412 0 .

680 0 .

753 0 .

853 0 .

975 0 . .

129 0 .

555 0 .

795 0 .

930 0 .

990 0 . .

101 0 .

579 0 .

860 0 .

968 0 .

998 0 . .

415 0 .

822 0 .

936 0 .

979 0 .

997 0 . .

589 0 .

885 0 .

958 0 .

995 0 .

999 0 . .

104 0 .

405 0 .

611 0 .

858 0 .

992 0 . Appendix D: Global classiﬁcation of the HSs patterns

In Figure S5 we show the classiﬁcation of all the tiles with more than 1% urbanization. In thisﬁgure we compare the real tiles with the simulated ones.7

FIG. S5. Global classiﬁcation of all the tiles with more than 1% urbanization for the real tiles (ﬁrst row) andthe simulated ones with the Single-parameter model (second row) and the Multi-parameter model (thirdrow). As it can be seen, the Single-parameter model overestimates the number of tiles in the Dispersion andAgglomeration classes. The multi-parameter model mitigates this problem and it is very similar to the realclassiﬁcation of the tiles (as shown by the F1-score accuracy over the diﬀerent classes). This means that theMulti-parameter model reliably simulates the global pattern of urbanization. Appendix E: Some examples of simulations

FIG. S6. Diﬀerent simulations of the single-parameter model with 30% of total BUC at diﬀerent γ values.Low values of γ generate dispersed settlement patterns, whereas high values of γ generate compact patterns.It can be seen that in A) the model generates a random noise pattern, as the urban areas are created withoutcaring on the existing urban areas ( γ = 0 . FIG. S7. Diﬀerent simulations of the Multi-parameter model with 30% of total BUC for the same γ and γ but diﬀerent s values. It can be seen that in A) the model starts with a sparse pattern ( γ = 0 .

1) andthen switches to the dense one ( γ = 10 .

0) until 30% of urbanization. The resulting pattern is clustered incircles. Contrarily, in F) a random noise pattern is created, as the urban areas are created without caring onthe existing urban areas ( γ = 0 . γ = 0 . FIG. S8. Some examples of the best simulations for four tiles in the Agglomeration class. The left columnshows the real tile, the central column shows the most similar tile generated with the multi-parameter model,and the right column shows the single-parameter model simulation that is most similar to the real tile. FIG. S9. Some more examples of the best simulations for four tiles in the Agglomeration class. The leftcolumn shows the real tile, the central column shows the most similar tile generated with the multi-parametermodel, and the right column shows the single-parameter model simulation that is most similar to the real tile. FIG. S10. Some examples of the best simulations for four tiles in the Balanced class (yellow group). The leftcolumn shows the real tile, the central column shows the most similar tile generated with the multi-parametermodel, and the right column shows the single-parameter model simulation that is most similar to the real tile. FIG. S11. Some more examples of the best simulations for four tiles in the Balanced class (yellow group).The left column shows the real tile, the central column shows the most similar tile generated with themulti-parameter model, and the right column shows the single-parameter model simulation that is mostsimilar to the real tile. FIG. S12. Some examples of the best simulations for four tiles in the Balanced class (green group). The leftcolumn shows the real tile, the central column shows the most similar tile generated with the multi-parametermodel, and the right column shows the single-parameter model simulation that is most similar to the real tile. FIG. S13. Some more examples of the best simulations for four tiles in the Balanced class (green group).The left column shows the real tile, the central column shows the most similar tile generated with themulti-parameter model, and the right column shows the single-parameter model simulation that is mostsimilar to the real tile. FIG. S14. Some examples of the best simulations for four tiles in the Dispersion class. The left column showsthe real tile, the central column shows the most similar tile generated with the multi-parameter model, andthe right column shows the single-parameter model simulation that is most similar to the real tile. FIG. S15. Some more examples of the best simulations for four tiles in the Dispersion class. The left columnshows the real tile, the central column shows the most similar tile generated with the multi-parameter model,and the right column shows the single-parameter model simulation that is most similar to the real tile. Appendix F: Technical validation

As suggested by state-of-the-art good practices for assessing land-cover map accuracy [45, 46],we validated the data through 50 stratiﬁed random sampled 1 × × ,

000 + 1 , × ×

50 = 900 ,

000 cells labelled by the crowd. To our knowledge, thisoutnumbers any other similar exercise presented so far in the literature. Additional details areexplained in [44].To ﬁnally assess the accuracy of the WSF2015, we considered a series of measures commonlyemployed in the remote sensing community39, namely: • Kappa coeﬃcient [47, 48], which jointly takes into account omission (i.e., underestimation)and commission (i.e., overestimation) errors, as well as the possibility of chance agreementbetween classiﬁcation and reference maps. Kappa assumes values between -1 and 1 and acommon rule-of-thumb for its interpretation is the following45: ¡0 no agreement; 0 - 0.20slight; 0.21 - 0.40 fair; 0.41 - 0.60 moderate; 0.61 - 0.80 substantial; 0.81 - 1.0 perfect; • Percent producer’s accuracy PAS% and PANS% of the settlement and non-settlementclass, respectively. Speciﬁcally, they denote the portion of assessment units (i.e., cells or blocks)categorized as settlement/non-settlement according to the collected reference informationwhich are correctly categorized as settlement/non-settlement in the classiﬁcation map. Itscomplementary measure (100 – PA%) corresponds to the percent omission error; • Percent user’s accuracy UAS% and UANS% of the settlement and non-settlementclass, respectively. Speciﬁcally, they denote the proportion of all assessment units (i.e.,cells or blocks) categorized as settlement/non-settlement in the classiﬁcation map which arecategorized as settlement/non-settlement also according to the collected reference information.Its complementary measure (100 – UA%) corresponds to the percent commission error; • Percent average accuracy AA% that is obtained as the mean between PAS% and PANS%and represent a balanced measure of correct settlement and non-settlement detection.9

1. Validation results

Figure S16 and Figure S17 report the accuracy over the 900,000 collected reference samplescomputed for the WSF2015 and, concurrently, the GUF, GHSL and GLC30 layers for comparison.In particular, results are given for all combinations (overall 12) of three considered settlementdeﬁnitions and four assessment criteria. Due to the diﬀerent spatial resolution of the GUF (12m)and both the GHSL and GLC30 (30m), while assessing their quality, each 10x10m cell of theconsidered block spatial assessment unit is tagged as settlement only if the intersection with thespeciﬁc layer is positive. Noticeably, in all experiments the WSF2015 exhibited the best AA%,with a remarkable average of 86.37 and a mean increase with respect to GUF, GHSL and GLC30of +6.24, +15.28 and +18.58, respectively. Alongside, it resulted in an average Kappa of 0.6885with a mean increase of +0.0754 with respect to the GUF and, especially, +0.2338 and +0.2975with respect to GHSL and GLC30, respectively. By analyzing the numbers into detail, one cannotice a noteworthy increase of the WSF2015 Kappa coeﬃcient for assessment criteria 3 and 4(0.7646 on average) with respect to criteria 1 and 2 (0.6123 on average). This is due to the fact that30m resolution Landsat imagery has been employed to generate the product. Hence, even if just aportion of the Landsat pixel on the ground intersects any building, building lot or paved surface,this mostly has a considerable eﬀect in the corresponding spectral signature and the pixel tendsto be ﬁnally categorized as settlement. This is taken into account by assessment criteria 3 and 4,since the entire 30x30m reference block spatial assessment unit is labelled as settlement even if itcontains just one cell marked as settlement. Assessment criteria 1 and 2 should be then consideredmore suitable for a fair comparison against the GUF given its 12m spatial resolution. In this case,one can appreciate how the AA% and Kappa reported for the WSF2015 are in line with thoseexhibited by the GUF, which has been generated from highly expensive 3m resolution commercialTerraSAR-X/TanDEM-X imagery. Instead, assessment criteria 3 and 4 allow a fair comparisonagainst GHSL and GLC30 as they are both derived from Landsat data. Here, the WSF2015 exhibitsnotable AA% and Kappa up to 89.33 and 0.7822, respectively, outperforming both GHSL andGLC30 (with an increase always higher than 17 and 0.32, respectively). From Figure 4, one canalso notice that on average results do not signiﬁcantly vary across the three considered deﬁnitionsof settlement; however, a proper analysis allows to better understand which one ﬁts best with thediﬀerent layers. Concerning the WSF2015, the highest accuracy mostly occur when considering assettlement the combination of buildings and building lots. Only for assessment criteria 1 and 2Kappa is higher when also roads / paved surfaces are included. Indeed, despite generally associated0with very low S1 backscattering values, most of these are not masked out given their ﬁne scalewithin urban areas. As regards the GUF, highest AA% and Kappa occur partly when only buildingsand partly when buildings and building lots are considered as settlement. This is in line withthe theory, since the layer has been generated from radar imagery which is sensitive to verticalstructures (these comprise both buildings, as well as main elements delimiting building lots likewalls, fences, hedges, etc.). In the case of GHSL and GLC30, the two layers show a similar behaviorand provide on average a slightly higher Kappa when settlements are deﬁned as combination ofbuildings and building lots. Giving a closer look to producer’s and user’s accuracies it is possible tobetter understand the nature of the diﬀerent performances. All GUF, GHSL and GLC30 generallyshow very high PANS% (i.e., greater than 85), but mostly exhibit consistently lower PAS%, withvalues never greater than 75.80, 52.39 and 44.26, respectively. On the contrary, the WSF2015scores overall remarkably high PAS% and PANS% (on average 88.71 and 84.04, respectively) and,concurrently, it always shows the best UANS% in front of a UANS% only marginally lower thanthat of the other layers (on average 92.15 and 75.95, respectively). This quantitatively assesses thecapability of the WSF2015 to eﬀectively detect the presence of a considerable number of settlementsactually unseen in the other global products. This occurs at the price of a minor settlement extentoverestimation mostly due to the employment of the COV texture features; speciﬁcally, these allowa more accurate detection in rural and suburban areas, but sometimes result in an overestimationof 1-2 pixels around the actual settlement.The improved detection performances of the WSF2015 can be qualitatively appreciated inFigure S18, where a cross-comparison against GUF, GHSL and GLC30 is reported for threerepresentative regions including the Igboland (i.e., a cultural and common linguistic region locatedin south-eastern Nigeria), Kampala (i.e., the capital and largest city of Uganda) and Bangalore(i.e., the capital of the Indian state of Karnataka). Despite the rather diﬀerent settlement patterns,all three sites are characterized by the presence of medium and large size cities surrounded by anumber of very small settlements. As one can notice, the WSF2015 proves extremely eﬀective in allthree cases, outperforming all other layers; speciﬁcally, it is capable of detecting a higher amount ofsmall villages and better outlining the fringes of major urban areas. The GUF performs equallygood only in the Igboland region, but detects considerably less settlements in the Bangalore and,especially, the Kampala case studies. Both GHSL and GL30 exhibit severe underestimation in allthree test sites.1