Creating a surrogate commuter network from Australian Bureau of Statistics census data
CCreating a surrogate commuter network fromAustralian Bureau of Statistics census data
Kristopher M. Fair* , Cameron Zachreson , Mikhail Prokopenko March 21, 2019
1. Complex Systems Research Group, School of Civil Engineering, Faculty of Engineeringand IT, The University of Sydney, Sydney, NSW 2006, Australia.2. Marie Bashir Institute for Infectious Diseases and Biosecurity, The University ofSydney, Westmead, NSW 2145, Australia.*corresponding author: Kristopher M. Fair ([email protected])
Abstract
Between the 2011 and 2016 national censuses, the Australian Bureau of Statisticschanged its anonymity policy compliance system for the distribution of census data. Thenew method has resulted in dramatic inconsistencies when comparing low-resolutiondata to aggregated high-resolution data. Hence, aggregated totals do not match truetotals, and the mismatch gets worse as the data resolution gets finer. Here, we addressseveral aspects of this inconsistency with respect to the 2016 usual-residence to place-of-work travel data. We introduce a re-sampling system that rectifies many of theartifacts introduced by the new ABS protocol, ensuring a higher level of consistencyacross partition sizes. We offer a surrogate high-resolution 2016 commuter dataset thatreduces the difference between aggregated and true commuter totals from ∼ toonly ∼ , which is on the order of the discrepancy across partition resolutions in datafrom earlier years. Background & Summary
High-resolution commuter network information, as well as general information describingpopulation distributions [1], is a major factor in computational modelling of diffusion phe-nomena in various contexts: demographic [2], epidemiological [3, 4, 5, 7], economic [8], eco-logical [9] and so on. However, privacy constraints on released Census data, in the presence of a r X i v : . [ c s . D B ] M a r ntricate dependencies between population and employment distributions in relatively small,highly urbanized, but spatially spread countries, such as Australia, coupled with changes indata protocols across census years, present specific challenges in reconstructing commuter(travel-to-work) networks with sufficiently high fidelity [10, 11, 12, 13, 14].These challenges manifest in two ways. The first of these pertains to individual micro-data, which is organised by household to capture information about both the individual andhousing unit. While the collective microdata is a powerful resource, variations in questionsasked, possible responses, and record structure often present difficulties in comparing resultsacross years [15]. The second challenge relates to the specific methods used by the agenciesthat gather and report census data, in protecting the anonymity of individuals. While it isnecessary for these methods to introduce perturbations, the details of how such perturbationsare applied can result in unintended consequences when high-resolution data is aggregated.This is because biases introduced by the perturbation protocol are magnified by aggregation.In the recent Australian census datasets [16], these challenges manifest themselves as lossof accuracy in very finely partitioned data, where individual population counts can be on theorder 1 to 10 individuals. An important example of such a data set is the commuter network,describing the normal work travel behaviour of the population. The loss of accuracy in suchdata is primarily due to the specific noise-inducing protocols that the Australian Bureau ofStatistics (ABS) employs to ensure the anonymity of census participants. At the same time,this loss in accuracy severely diminishes the usefulness of the commuter networks in mod-elling contagion phenomena, such as epidemics. In such models, work mobility is a primarydriver of contagious diffusion. As such, the accuracy of the commuter network is crucial forrealistic outputs regarding aggregate demographic and epidemiological characteristics, suchas community and national attack rates. Furthermore, without trustworthy inputs, suchmodels cannot accurately identify salient routes of contagion spread, or analyze mitigationstrategies based on network theory.Similar challenges from noise-inducing protocols, which may also differ across censusears, occur in other scenarios in which there is a need to estimate demographic and phe-nomenological dynamics. This is relevant not only to network-centric studies, but also tomore general agent-based simulations, or any study aimed at fine-grained reconstruction ofspatio-temporal dynamics [17]. Thus, the goal of the present work is not only to recon-struct specific commuter networks of Australia between 2011 and 2016, but also to presenta method of microdata reconstruction. The method aims to correct discrepancies that mayarise due to Census noise protocols, improving consistency across partition scales while pre-serving anonymity. The secondary aim is to increase interoperability of Census datasets, inline with the Integrated Public Use Microdata Series (IPUMS) approach [15].To further these ambitions we first formalize the network structure and identify dis-crepancies between different scales of spatial partitioning. We then describe the technicaldetails for constructing our re-sampled network using additional datasets. Finally, we showseveral comparisons between the ABS provided and re-sampled data that demonstrate thedistinction and validity of the resulting dataset.The ABS provides access to most census data through the on-line system Census Table-Builder, free of charge, for the 2006 census onward. A subset of the available data isthe accumulated microdata of usual-residence (UR) to place-of-work (POW) which con-stitutes the commuter mobility network (we will refer to this as the TTW, or, travel-to-workdataset). Each census has undergone some re-partitioning of residential and work areaswith the latest hierarchical structure divided into four levels of statistical areas for UR(UR = [ SA1 , SA2 , SA3 , SA4 ] ), and POW (POW = [ DZN , SA2 , SA3 , SA4 ] ), respec-tively. This system is defined by the Australian Statistical Geography Standard [18]. Thesmallest of these residential partitions, SA1, is designed to house a population of about 200to 800 people. Maps of SA2, SA1, and DZN partitions for the Greater Sydney region aredisplayed in Figure 1. SA1 and DZN partitions accumulate to exact partitions on the SA2scale, this is displayed for SA1 partitions in Fig. 1a, and for DZN partitions in Fig. 1b.Note that the uneven distribution of employment centres in Australia’s cities produces aorresponding non-uniformity in DZN partition density, as displayed in Fig. 1b. a) b) SA1
DZN
Figure 1: Maps of the Greater Sydney region illustrating the distribution of populationpartitions. (a) A map of the Greater Sydney region showing SA2 (black) and SA1 (red)population partitions. (b) A map of the same area showing SA2 (black) and DZN (red)partitions. The inset in (b) zooms in on the Sydney central business district to illustrate themuch denser packing of DZN partitions in that area.This partitioned commuter data translates to a bipartite network G [ UR → POW ] = ( V G , E G ) where V G is a set of vertices (nodes) of two types V G = X ∪ Y , where X = { x , x , ..., x n } represent the n partitioned UR locations, and Y = { y , y , ..., y k } represent the k partitionedPOW locations. The set of edges E G = { ( x i , y j ) , ( x i , y j ) , ... ( x i | EG | , y j | EG | ) } , (1)defines the unique connections between these vertices. For example, UR x i and POW y j may be connected by an edge e ij = ( x i , y j ) . Each subset of edges has a corresponding set ofweights, defined by the function: w ij ( { e ij } , G ) , (2)which gives a set of commuter numbers indexed to the corresponding location pairs in { e ij } ,over the network G . The use of the argument G is necessary, as the same location pairs mayhave different numbers of commuters in different networks. For brevity, we will omit thesubscripts i and j in cases where they are not required for specificity. We will use similarotation to refer to sets of UR and POW locations associated with edges as x ( { e } ) and y ( { e } ) , as well [Note: the second argument is not necessary here, as the required informationis contained in the set { e } , and does not vary between networks with the same sets of nodes].As mentioned above, these data sets are subject to a perturbation protocol to preventcross referencing different variables that may allow the identification of specific individu-als [19] even with the application of safeguards [20, 21]. Not doing so would violate theAustralian Census and Statistics Act 1905 to preserve the anonymity of individuals. Thisperturbation process is outlined in ABS publications [22, 23, 24].The sizes of UR and POW population partitions affect the magnitudes of the populationsmoving between them. Relative to these magnitudes, different levels of noise are required topreserve the anonymity of individuals. For small commuting populations, the perturbationmagnitudes will be on the order of the unperturbed values. Furthermore, for the 2016 census,the ABS changed their perturbation protocol by removing a step designed to conserve thetotal population across different spatial partitions, a property they refer to as ‘additivity’.Some major practical consequences of removing the additivity-ensuring step are observablediscrepancies in the total number of commuters, N G = (cid:80) w ( E G , G ) , accounted for by thenetwork G on different partition scales.Edge weight distributions, and cumulative population distributions as a function of edgeweight for the SA2 → SA2 and SA1 → DZN commuter networks of 2011 and 2016 are displayedin Figure 2. Lower-resolution TTW networks such as those representing connections on theSA2 scale display relatively consistent weight distributions between censuses. Comparisonacross years shows moderate increases in the numbers of edges across the weight range ascould be expected for an increasing employed population between 2011 and 2016 (Fig. 2a and2b). The corresponding distribution of this increased population across the edge weight rangeis illustrated in Fig. 2c, which does not show any alarming trends or obvious artifacts in thedata. Unfortunately, this consistency does not hold for the fine-grained SA1 → DZN network.The weight distributions for this network shown in Fig. 2d and 2e indicate a counter-intuitiverop in the numbers of small edges between 2011 and 2016, which corresponds to a dramaticdecrease in the total commuting population accounted for by the network. The distributionof the commuting population across the edge weight range (Fig. 2f) confirms that majordiscrepancies exist between partition schemes, likely due to a significant drop in the numberof small edges included in the network. nu m be r o f edge s | E G | edge weight w nu m be r o f edge s | E G | edge weight w c u m u l a t i v e popu l a t i on edge weight w nu m be r o f edge s | E G | edge weight w nu m be r o f edge s | E G | edge weight w c u m u l a t i v epopu l a t i on edge weight w a) b) c)d) e) f) Figure 2: Weight distributions and cumulative population distributions for TTW net-works from different census years and partition schemes. (a) Distributions of edge weights( w < ) for the SA2 → SA2 networks for 2011 and 2016, plotted on a linear scale. (b)Distributions of all edge weights for the SA2 → SA2 network from 2011 and 2016 plotted ona log scale. (c) Cumulative population distributions for the SA2 → SA2 network from 2011and 2016. (d) Distributions of edge weights ( w < ) for the SA1 → DZN networks for 2011and 2016, plotted on a linear scale. (e) Distributions of all edge weights for the SA1 → DZNnetwork from 2011 and 2016 plotted on a log scale. (f) Cumulative population distributionsfor the SA1 → DZN network from 2011 and 2016. The distributions in (a - c) have bin widthof 10, while (c - d) have bin width 1, with a minimum value of 3, artificially introduced bythe ABS protocol. The plots in (a) and (d) show only a subset of the weight range, zoomingin on the low end of the distribution where the largest discrepancies exist between years.As the partitions that comprise the vertices V G are increasingly subdivided, the weightsof the edges connecting them get smaller. The new perturbation protocol appears to dra-atically reduce the number of small edges included in the network, particularly around theminimum value of w = 3 . This adversely effects the network both quantitatively, by lower-ing the commuter populations throughout the network, and structurally, by removing edgesfrom E G , altering the binary structure of the network. In the case of the high-resolutionSA1 → DZN network, small edges are a crucial aspect of the network structure, and carry alarge portion of the total edge strength.The need for a method to ensure consistency in commuter numbers across partitionscales is further exemplified in Figure 3a, which plots the total working population ( N G )in networks built by distributing commuters from SA1 partitions into each of the possiblePOW partition schemes. As the sizes of the POW partitions decrease from the entire nationdown to individual destination zones, the total number of commuters drops by whilethe total number of edges increases by four orders of magnitude. w e i gh t d i ff e r en c e ∆ w ij
10 100 1000edge weight w b) t o t a l c o mm u t e r s N G number of edges |E G | a) [SA1-SA2][SA1-AUS][SA1-State][SA1-SA4] [SA1-SA3][SA1-DZN] no . o f m i ss i ng edge s w c) Figure 3: Discrepancies in total population and commuter distribution related to partitionaggregation behavior. (a) The total number of commuters N G in ABS data for networksof varied size. Each point corresponds to a network between SA1 partitions and a differentscale of POW partition (national, state, SA4, SA3, SA2, DZN). (b) The discrepancy be-tween commuter numbers, ∆ w ij , on each edge w ( E AB , A ) and w ( E AB , B ) plotted against w ( E AB , B ) . (c) The frequency distribution as a function of edge weight for edges present inthe ABS-provided SA2 → SA2 network ( B ) but not the aggregated SA1 → DZN network ( A ).The structural inconsistency across partition scales that this problem introduces can beunderstood by amalgamating the vertices of network G [ SA1 → DZN ] into corresponding SA2partitions. By doing so, we create network A [ SA2 → SA2 ] = ( V A , E A ) , that can be comparedo the network constructed from ABS data on the SA2 scale [which we will label network B [ SA2 → SA2 ] = ( V B , E B ) ]. Network B is missing only of the total commuter populationbecause the edges are composed of more commuters and therefor receive relatively less per-turbation from the ABS protocol. This smaller discrepancy is comparable with that ofprevious years for which the additivity-ensuring step was still included.Figure 3b illustrates the discrepancies between edge weights (commuter numbers betweena given pair of locations) for edges appearing in both networks A and B . To compute thesediscrepancies, we define the set of edges appearing in both E A and E B as the intersection E AB = E B ∩ E A , the weights of these edges for networks B and A , respectively, as w B = w ( E AB , B ) , and w A = w ( E AB , A ) , and the discrepancies ∆ w between the weights of edgesexisting in both sets ∆ w ij = [ w ij ∈ w B ] − [ w ij ∈ w A ] . (3)Using this notation, Fig. 3b plots ∆ w ij as a function of w ∈ w B , and demonstratesthat the perturbations to small edges in the SA1 → DZN network produce large negativediscrepancies in edge weight when the data is aggregated to the SA2 → SA2 scale.To understand this result in more detail, it is helpful to note that the spatial distributionof the working population is very heterogeneous, with an exponentially larger fraction ofthe working population employed within central business districts of major cities. However,only the DZN partitions are designed to accommodate this heterogeneity, as they are delin-eated based on employee population (number of people who work in a region), rather thanresidential population. On the other hand, SA2 partitions are designed based on residentialpopulation which results in a few SA2 business hubs containing many DZN partitions (seeFigure 1b). In some cases, this results in over component SA1 → DZN edges amalgamat-ing to single, larger SA2 → SA2 edges.It is clear that many SA1 → DZN edges are being removed entirely (their weight set tozero) because there are 97,881 edges appearing in the as-provided SA2 → SA2 network B that do not appear after aggregating the SA1 → DZN edges to produce network A . Thisives | E A | ≈ . | E B | for the SA2-level networks. The frequency distribution for the weightsof missing edges, w ( { E B \ E A } , B ) (where the symbol \ denotes the set complement), isshown in Figure 3c which indicates an exponential decrease in removal frequency as a func-tion of edge weight. The data in Figure 2 and Figure 3 indicate conclusively that manysmall perturbations on the SA1 → DZN scale accumulate, producing the large discrepanciesobserved when they are aggregated.In this work, we develop and apply a method to restore lost network structure andimprove quantitative consistency across commuter networks on different partition scales.The result is a surrogate network S [ SA1 → DZN ] = ( V S , E S ) , on the resolution of SA1 to DZN.This reconstructed commuter network will serve as a platform for ongoing research effortsthat utilize Australian travel networks, such as agent-based epidemiological modeling [5, 6]. Methods
Our method is essentially a re-sampling process that we use to introduce new edges into theSA1 → DZN network to improve quantitative consistency upon aggregation to the SA2 scale.The method does not introduce any new edges to the SA2 → SA2 network upon aggregation,and therefore cannot correct for the missing edges distributed as shown in Fig. 3c. However,most of the missing commuters are accounted for by correcting the discrepancies shown inFigure 3b, and our method emphasizes this aspect of the problem.Before commencing our procedure, all data provided by the ABS was pre-processed toremove edges that link to non-geographic regions such as "Migratory/ offshore/ shipping"and "No usual address". For the 2016 SA1 → DZN network this accounts for 53,135 edgesand 469,854 commuters.In addition to the original, perturbed SA1 → DZN network, the method requires thefollowing sets and quantities that we obtained from independent ABS databases: • N X = { N x , N x , ...N x n } and N Y = { N y , N y , ...N y n } , the set of local worker popula-tions for SA1 and DZN partitions, respectively. The SA2 → SA2 commuter numbers from the ABS-provided SA2 → SA2 network ( B ). • The set of (unweighted) SA2 → DZN edges found by creating a mixed-partition network. • P ( w | N x ) , the normalized distribution of edge weights w given residential population N x .The last item refers to the relationship between local distributions of edge weights andpopulation of the associated SA1, as calculated from 2011 census data obtained without theupdated privacy policy compliance protocol.Our method can be summarized as a two-step process:1. Produce a set of q candidate out-edges M = { m , m , ...m q } = { ( x i , w ) , ( x i , w ) ... ( x i q , w q ) } , specifying SA1 ( x ) and num-ber of commuters ( w ). This set accounts for the missing workers from each SA1 whilemaintaining a realistic dependence of weight distribution on UR population P ( w | N x ) .2. Build network S : add the candidate edges in M into the SA1 → DZN network by spec-ifying DZN ( y ) without violating the topology of the SA2 → DZN network, exceedingthe population of the DZN, adding edges that are not present in the SA2 → SA2 net-work, or exceeding known commuter populations between locations in the SA2 → SA2network.In addition to networks A , B , and S defined above, we will refer to several distinctnetwork sets that are important for the explicit description of our process. For clarity, wewill summarize these here and give a brief description of their role in our method:Network R is the ABS-provided SA1 → DZN network (referred to above as G [ SA → DZN ] ),which was released by the ABS subject to the perturbations this work is intended to correct.Network A is the SA2 → SA2 network aggregated from R . Network B is the ABS-providedSA2 → SA2 network that exhibits relatively consistent aggregation behavior (that is, thetotal number of commuters it accounts for is roughly of the known total). We useetwork B as a quantitative ground-truth while generating the surrogate network. Network H is the ABS-provided SA1 → DZN network from the 2011 census, which exhibits acceptableaggregation behavior. We use network H to build up the set of probability distributionsdescribing P ( w | N x ) . A key assumption of our method is that this relationship betweenlocal population and out-edge weight distribution is relatively invariant across census years.Network Γ is the ABS-provided SA2 → DZN network which we use as a topological constraintwhile assigning the candidate edges from each residential zone to appropriate destinationzones. That is, we only incorporate SA1 → DZN edges into S that have a correspondingSA2 → DZN pair existing in Γ . Finally, network S is the surrogate SA1 → DZN network thatis the final output of our method and network C is the SA2 → SA2 network aggregated fromnetwork S . We compare networks B and C when evaluating the aggregation behaviour of S . Some quantitative features of these networks are summarized in Table 1.Table 1: Commuter networks and selected characteristics.Network Partition(UR → POW) | E | (cid:80) w Source R = ( V R , E R ) SA1 → DZN 1,184,946 7,023,571 ABS 2016 A = ( V A , E A ) SA2 → SA2 118,167 7,023,571 Accumulated from RB = ( V B , E B ) SA2 → SA2 212,805 10,073,246 ABS 2016
Γ = ( V Γ , E Γ ) SA2 → DZN 515,250 9,853,543 ABS 2016 H = ( V H , E H ) SA1 → DZN 2,046,094 10,058,331 ABS 2011 S = ( V S , E S ) SA1 → DZN 1,731,938 9,336,333 Constructed C = ( V C , E C ) SA2 → SA2 118,167 9,336,333 Accumulated from S The following two sections describe our method in detail. The first describes the processof generating the list of ( SA1 , w ) pairs which we refer to as “candidate edges”. The seconddescribes the process of assigning these candidates edges to DZN partitions subject to ourselected constraints. A1 candidate edges
We observed the behavior of P ( w ) as a function of N x to be similar across 2006 and 2011censuses. This dependence appears to reflect a consistent feature of the commuter mobilitynetwork. Although the underlying mechanism producing this set of conditional distributionsis not in the scope of this report, it is a subtle aspect of the network structure that should betaken into account. Network H [ SA1 → DZN ] = ( V H , E H ) , derived directly from the 2011 ABScensus, along with the 2011 worker populations, gives the distribution of commuter edgeweights as a function of the local SA1 population P ( w | N x ) (shown in Figure 4). Whilethe method we used to generate these distributions is case-specific, a similar process couldbe applied in any situation where there is some confidence in the separation of time-scalesbetween real network evolution and artifact introduction due to institutional data processingprotocols. Indeed, a more general approach to this aspect of the problem may be needed incases where true network dynamics are more difficult to distinguish from artifacts. This isan ongoing question that we will continue to address in future work. One promising futuredirection is to derive a maximum entropy distribution for the weights of the edges leaving eachlocation, constrained by the known numbers of commuters and the worker populations inthe destination zones allowed by the topology and SA2 → DZN edge weights of network Γ . Ingeneral, the maximum entropy principle determines the least biased probability distributions,consistent with specific constraints on the average values of measurable quantities [25]. Otherapproaches are possible as well, for example, Shannon information could be computed forfragments of the network that exhibit acceptable aggregation behavior, and local weightdistributions defined so that sampling from them explicitly addresses information loss inparts of the network adversely affected by the removal of data from the original travel-to-workmatrix. Techniques for doing so could be adapted from existing methods where networksare iteratively grown from fragments based on node assortativity constraints, leveragingthe relationships between node assortativity and mutual information of the target network[26, 27].igure 4: Edge weight frequency distributions as functions of local population. (a) Colorplot showing P ( w ) ( y axis) as a function of N x ( x axis) for the 2011 SA1 → DZN commuternetwork. (b) The frequency distribution of edges as a function of SA1 → DZN commuternetwork edge weight, where each curve represents the weight frequency distribution for aspecific range of SA1 populations.Once these conditional distributions are established, we sample from them to accountfor the number of missing commuters from each SA1. The number of missing commutersassociated with a given SA1 partition x ∗ is computed as the discrepancy between the knownworking population ( N x i ) and the sum (cid:80) kj =1 w ( { ( x ∗ , y j ) } , R ) , which is the total out-weightassociated with the partition x ∗ . The set of these accumulated populations gives N X R : N X R = (cid:40) k (cid:88) j =1 w ( { ( x , y j ) } , R ) , ... k (cid:88) j =1 w ( { ( x n , y j ) } , R ) (cid:41) = { N Rx , N Rx , ... N Rx n } , (4)which allows us to calculate the discrepancy in local worker population for each SA1: ∆ N X = { [ N x − N Rx ] , [ N x − N Rx ] , ... [ N x n − N Rx n ] } = { ∆ N x , ∆ N x , ... ∆ N x n } . (5)The algorithm then generates M as follows: for each SA1 partition x i , individual weights w (cid:48) are iteratively sampled from P ( w | N x i ) to produce candidate edges m (cid:48) = ( x i , w (cid:48) ) whichre included in M under the condition that ∆ N x i > w (cid:48) + (cid:88) m j ∈ M w j × δ ( x i j , x i ) , (6)where δ ( x i j , x i ) is equal to 1 if x i j = x i and equal to 0 otherwise. If the condition above isnot met the candidate edge m (cid:48) is rejected. The sampling process is repeated until the dis-crepancies ∆ N X n are all less than three, the smallest edge size. That is, candidate edges aregenerated to precisely account for the number of workers missing from each SA1. Quantita-tive features for an instance of the candidate edge set M , and the local populations used toconstrain its construction ( N X ) and assignment ( N Y ) are shown in Table 2. The algorithmicprocess for creating the set of candidate edges is outlined by the pseudocode in Box 1. Thefollowing section describes the process of assigning candidate edges to destination zones.Table 2: independent data sets and selected characteristics.Set Contents Set size Total population Source M = { m , m , ...m q } = { ( x i , w ) , ( x i , w ) , ... ( x i q , w q ) } SA1candidateedges 683,239 2,572,117 Constructed N X = { N x , N x , ...N x n } SA1employedresidents 57,523 10,113,273 ABS 2016 N Y = { N y , N y , ...N y k } DZNemployees 9,151 10,677,111 ABS 2016
Assigning edges
Once the set of candidate edges is generated, each specifying an edge weight and SA1 originvertex, all that remains is to assign them DZN vertices. Then, the new edges can be includedin network R to create the surrogate network S . The procedure we used for these assignmentsis described in this section and outlined in Box 2.We assign candidate edges from M to reasonable DZN partitions by employing Γ [ SA2 → DZN ] , B [ SA2 → SA2 ] , E AB , and N Y to conditionally restrict the connections that can be added in or-er to maintain the lower-resolution topology and worker populations at destination zones.The networks Γ and E AB are used as binary topological constraints, restricting the possibleset of {SA2, DZN} and {SA2, SA2} location pairs that are compatible with the topologyof the new network E S . We use Γ as a topological constraint because it represents a goodcompromise between resolution and quantitative consistency. Because of the larger parti-tioning of the residential zones X Γ , the network loses approximately % of total commutersdue to ABS perturbations, which is much better aggregation behavior than we observe onthe SA1 → DZN scale, but worse than the SA2-level network on these terms. On the otherhand, it explicitly accounts for connectivity between SA2 residential partitions and DZNs,making it a stronger constraint than the SA2 → SA2 network. We use the overlapping edgeset E AB as a topological constraint because it restricts our procedure to those parts of thenetwork in which we have the most confidence. We take this conservative approach in orderto avoid introducing edges to the network that could artificially increase connectivity acrossdisparate regions. The local worker populations at each DZN ( N Y ) are used as quantitativeconstraints, ensuring that local populations are not exceeded due to the addition of newedges. Similarly, w ( E AB , B ) , the number of commuters between SA2(UR) and SA2(POW)in the portions of network B that overlap with A , constrains the number of commuters thatcan be added to particular edges in S .To select SA1 vertices for the candidate edges M , we iterate through the DZN partitionsand perform the following procedure:For each DZN destination vertex y i we use Γ and E AB to determine the set of possibleSA1 origin vertices. These define the subset M (cid:48) ⊆ M compatible with both the SA2 → DZNand SA2 → SA2 topologies. We then sample M (cid:48) uniformly at random, combining the samplewith the current destination zone y i to produce a new edge. The new edge is added to thesurrogate network under the condition that doing so does not exceed the known number ofcommuters between SA2 partitions when the surrogate network is aggregated.o be precise, Γ , E AB , and y i define the subset of SA2 → DZN edges E (cid:48) Γ = { e ∈ E Γ | y ( { e } ) = y i , ( x ( { e } ) , Υ y i ) ∈ E AB } , (7)where Υ y i is the SA2 partition containing the DZN y i . In words, E (cid:48) Γ is the set of SA2 → DZNedges that point to the destination zone y i and are consistent with the SA2 → SA2 topology E AB . These define the SA2 partitions Φ i = x ( E (cid:48) Γ ) and the subset of SA1 partitions con-tained by them which we will call X Φ i . From these, the subset of candidate edges is simplydetermined by selecting only those that contain an element of X Φ i as origin vertex: M (cid:48) = { m j ∈ M | x i j ∈ X Φ i } . (8)Once M (cid:48) is defined, we randomly select a candidate m ∗ ∈ M (cid:48) = ( x ∗ , w ∗ ) with uniform prob-ability, producing a potential new edge e ∗ = ( x ∗ , y i ) with weight w ( e ∗ ) = w ∗ . The newSA1 → DZN edge e ∗ aggregates into the SA2 → SA2 edge e B = { e ∈ E B | X x ⊇ x ( { e ∗ } ) , Y y ⊇ y i } = ( x B , y B ) , (9)where X x and Y y are the sets of SA1 and DZN zones contained (respectively) by the SA2(UR)and SA2(POW) partitions in each element of E B .To check whether or not the new edge e ∗ should be added to the surrogate network, weaggregate E S over the SA1 and DZN vertices contained by the SA2 partitions x B and y B ,and determine whether adding the new edge will exceed the known number of commutersbetween SA2 zones. That is, the edge e ∗ is added to E S under the condition that w ( { e B } ) ≥ w ( { e ∗ } ) + (cid:88) e ij ∈ E S w ( { e ij } , S ) × δ ( e ij , X x B , Y y B ) , (10)where X x B and Y y B are the sets of SA1 and DZN partitions contained by the SA2(UR) andSA2(POW) zones specified by x B and y B , respectively, and δ ( e ij , X x B , Y y B ) = , if x i ∈ X x B AND y j ∈ Y y B , otherwise (11)o summarize, the algorithm allows addition of e ∗ to E S if aggregation of E S to largerpartitions only produces edges that already exist in E Γ and E AB , these topological constraintsare illustrated in figure 5. Aggregated edge weights are constrained as well, so that addition of w ( { e ∗ } ) does exceed the value given by w ( { e B } , B ) upon aggregation of E S to the SA2 → SA2scale. After successful assignment of edge e ∗ into E S , the candidate edge m ∗ is removed from M and the process is repeated until edges meeting this condition cannot be found. SA2(UR)SA1 SA2(POW)DZN
Figure 5: Schematic of topological constraints applied when adding new edges to the surro-gate network. The black lines represent the known SA2 → SA2 and SA2 → DZN connectionsgiven by networks B and Γ . The green lines are allowed surrogate SA1 → DZN edges, asthey are consistent with the known larger-scale topology. The red lines represent edges thatare not allowed, as their inclusion would violate our constraints after aggregation of thesurrogate to larger partition schemes.In principle, the above criterion is sufficient to ensure self-consistency across differently-partitioned data sets, however, the criteria must still account for the effect of the privacypolicy compliance perturbations. To account for possible mismatch between employee num-bers, we added the additional criterion that the number of workers assigned to destination y i must not exceed local worker population N y i ∈ N Y . Therefore, the condition N y i ≥ w ( { e ∗ } ) + (cid:88) e ij ∈ E S w ( { e ij } , S ) × δ ( y ( { e ij } ) , y i ) , (12)must be met, or the edge is not added to E S . Here, δ ( y ( { e ij } ) , y i ) is equal to 1 if y ( { e ij } ) = y i ,and equals 0 otherwise. ox 1 : Candidate edge set algorithm. Pseudocode for the algorithm that produces a list ofcandidate edges from each SA1 that match the local commuter populations and dependenceof edge weight distribution on worker population as determined by the 2011 census. procedure Generate candidate edges input: N X R , the number of SA1 employees aggregated from RN X , the number of SA1 employees reported by ABS P ( w | N x ) , the 2011 edge weight distribution conditional on local population for x i in X R : N Rx i = (cid:80) j =1 m w ( { ( w i , y j ) } , R )∆ N x i = N x i − N Rx i , the number of employees remaining unassigned from x i while ∆ N x i > do: w (cid:48) = sample w with probability P ( w | N x i ) if: ∆ N x i ≥ w (cid:48) m (cid:48) = ( x i , w (cid:48) ) append m (cid:48) to M ∆ N x i − = w (cid:48) , subtract w (cid:48) from ∆ N x i end ifend whileend forox 2 : Destination assignment algorithm. Pseudocode for the algorithm that links thecandidate edges to DZN partitions, producing the surrogate network S . procedure Assigning candidate edges input: E B , the SA2(UR) → SA2(POW) network reported by ABS Γ , the SA2 → DZN network M , the candidate edges produced by Algorithm 1 R , the SA1 → DZN network reported by ABS N Y = { N y , N y , ...N y k } , the DZN employee populationinitialize S = R initialize { ∆ w } , the discrepancies in aggregated commuter numbers (see equation 3) while | M | > for y i in Y R : E (cid:48) Γ = { e ∈ E Γ | y ( { e } ) = y i , ( x ( { e } ) , Υ y i ) ∈ E AB } (equation 7) Φ i = x ( E (cid:48) Γ ) , the SA1 partitions contained by the SA2(UR) partitions of E (cid:48) Γ M (cid:48) = { m j ∈ M | x i j ∈ X Φ i } , subset of M such that Φ i contains x i j sample m ∗ = ( x ∗ , w ∗ ) f rom M (cid:48) unif ormly at randome ∗ = ( x ∗ , y i ) , w ( { e ∗ } ) = w ∗ , the potential new SA1 → DZN edge e B = { e ∈ E B | X x ⊇ x ( { e ∗ } ) , Y y ⊇ y i } = ( x B , y B ) (equation 9) if: w ( { e ∗ } ) > ∆ w ( { e B } ) AND N y i ≥ w ( { e ∗ } ) + (cid:80) np =1 w ( { ( x p , y i ) } , S ) append e ∗ to E S ∆ w ( { e B } ) − = w ( { e ∗ } ) end ifend forend while f the 2,572,117 commuters accounted for by the full set of 683,239 candidate edges M , there were 729,209 commuters comprising 61,855 edges remaining unassigned when ourprocess terminated due to an inability to assign edges under the above criteria. Two factorsare responsible for the inability of the algorithm to assign these edges. The first is thatthe privacy protocol, by design, ensures cross referencing totals do not match in perturbeddata released by the ABS. The second is that our ground-truth topology omits the non-overlapping set w ( { E B \ E A } , B ) , therefore, the 612,215 missing commuters tabulated inFigure 3c cannot be accounted for by our re-sampling procedure.This surrogate network has an additional 546,992 SA1 → DZN edges, a increase ascompared to network R , with a total number of commuters N ( S ) comparable to that of theSA2 → SA2 network, N ( B ) . The total number of commuters in the as-provided SA1 → DZNnetwork N ( G ) is 7,023,571 the total for the surrogate network N ( S ) is 9,336,333 and ourquantitative ground-truth N ( B ) is 10,073,246. Code availability
The custom code used to generate the surrogate network via the method outlined in thistext was run on MATLAB version R2017b. The script and required inputs can be accessedon the online repository [28], along with usage notes and descriptions of relevant parameters.
Data Records
We have made an instance of the reconstructed surrogate commuter network publicly avail-able [28]. All of the data sets used including the original SA1 → DZN commuter mobility net-work, SA2 → DZN network, SA2 → SA2 mobility network, number of employees in each SA1( N X ), number of employees in each DZN ( N Y → DZN network ( H ) is noonger publicly available with the additivity-including privacy policy compliance protocolso we provide the version we used along with our surrogate network. The stability of thefiles available through ABS may vary with time, as evident in the removal of the additivity-ensuring step from the perturbation protocol used for all presently distributed data. Technical Validation
To quantitatively assess the aggregation behavior of the surrogate network S , we first ac-cumulated its component edges into the corresponding SA2 → SA2 topology (which we willrefer to as network C ). This new aggregated surrogate network was then compared to boththe ABS-provided SA2 → SA2 network and the aggregate of the original SA1 → DZN network( A ), by several different metrics. To assess the overall agreement between the three net-works, we first translated their edge lists and weights into adjacency matrices (Figure 6a),and computed the 2D correlation coefficient between each pair: r ( α, β ) = Σ m Σ n ( α mn − ¯ α )( β mn − ¯ β ) (cid:113) m Σ n ( α mn − ¯ α ) Σ m Σ n ( β mn − ¯ β ) , (13)where α and β represent each of the two adjacency matrices being compared. This compar-ison demonstrates a high degree of similarity between all three networks, with a significantimprovement in correlation between the ABS-provided SA2 → SA2 network and the accumu-lated surrogate (Table 3).Table 3: 2D correlation coefficients computed according to Eq. 13, between aggregated andABS-provided SA2 → SA2 networks. Network C is the aggregated SA1 → DZN surrogate,network A is the aggregate of the SA1 → DZN network provided by ABS, and network B isthe SA2 → SA2 network provided by ABS.Network pair
B, A B, C A, C
2D correlation ( r ) 0.9821 0.9996 0.9828Plotting the frequency distribution of edge weights for the ABS-provided SA1 → DZNcommuter networks of 2016 and 2011, along with the corresponding distribution for theurrogate network (Figure 6b) indicates a partial repair of the discrepancy in low-weight( w < ) edge numbers observed between 2011 and 2016 (Fig. 2d).The discrepancies in edge weights between the amalgamated surrogate network ( C ) andthe ABS-provided SA2 → SA2 network ( B ) are plotted in Figure 6c as a function of the edgeweight from network B . Comparison of these discrepancies to those plotted in Figure 3bindicates a dramatic improvement, comparable to the corresponding discrepancies computedfor the 2011 commuter network. To further demonstrate the structural repair imparted to thesurrogate network, we computed the distributions of weighted degree (the sum of all edgeweights incident on each node), for networks A , B , and C (Figure 6d). The distributioncorresponding to the aggregated surrogate network more closely matches that of the rawSA2 → SA2 network.We further quantify the similarity between our amalgamated surrogate ( C ) and theground-truth network (the edges in network B that also exist in network A ), by calcu-lating the mean-squared error (MSE) in the weights over all UR → POW pairs in E AB . Here,we compute the MSE over the edge weight sets α = w ( E AB , B ) , (14)and β = w ( E AB , C ) or, β = w ( E AB , A ) , (15)as MSE ( α, β ) = 1 | E AB | (cid:88) e ij ∈ E AB [ α ij − β ij )] , (16)where subscripts ij indicate specific UR → POW pairs. This quantity provides an estimate ofhow much our algorithm rectified discrepancies between SA2 → SA2 edges, given our conser-vative choice not to add edges to the overlapping set E AB . The results are shown in Table 4below, and indicate a significant quantitative improvement, as expected from comparisonbetween Figure 3b and Figure 6c. etwork A a) nu m be r o f edge s | E G | w b) SA1-DZN network ABS 2011ABS 2016Surrogate 2016 250200150100500 nu m be r o f node s d) NetworkABC150100500-50 w e i gh t d i ff e r en c e ∆ w ij
10 100 1000edge weight w c) Figure 6: Validation of the surrogate network. (a) Color plots of the SA2 → SA2 adjacencymatrix from the aggregate of the original SA1 → DZN network A , aggregated surrogate C ,and ABS-provided SA2 → SA2 network B . The SA2 regions are somewhat spatially orderedsuch that the different states, in particular the larger urban areas, are clustered around thediagonal. (b) Weight distributions for the networks R , H and S . (c) Weight differences, ∆ w ij , as a function of w ( E AB , B ) , demonstrate improved quantitative agreement (compareto Fig. 3b). (d) Distributions of node degree strength (total incident edge weight) fornetworks A , B , and C .To evaluate the improvement in structural properties of the surrogate relative to the as-provided network we analysed two key network measures for the common components of thenetworks A and B . The first is simply the average shortest path between nodes, as computedby applying Dijkstra’s shortest-path algorithm to the weighted networks, interpreting edgeweight as inverse distance. The second is a version of the clustering coefficient adaptedto weighted networks [29] that defines the weighted clustering coefficient for a node i byevaluating the fraction of its neighbors j and k that share connections, weighted based onable 4: MSE between overlapping portions of the aggregated and ABS-provided SA2 → SA2networks computed according to equation 16.Network pair
B, A B, C A, C
MSE 62.51 0.27 60.93the relative weights of the edges connecting the triangle, as C i = 2 k i ( k i − (cid:88) j,k ( ˆ w ij ˆ w jk ˆ w ki ) / , (17)and reports the average of this quantity over all nodes in the network. Here, the weightsof nodes in a triangular cluster are scaled by the largest weight in the network ˆ w ij = w ( { e ij } ) /max ( w ( E )) , and k v is the degree of node v .Table 5: Average weighted network statistics. The networks marked with an asterisk ( ∗ )contain only edges appearing in E AB that is, they represent the overlapping portions of thenetworks. [Note: inclusion of the edges unique to network B quantified in Fig. 3c, producesa dramatic reduction in the network’s clustering coefficient, which is intuitive given therelatively low weights of these edges and our definition of the weighted clustering coefficient(Eq. 17).] Network A ∗ C ∗ B ∗ B Shortest path 0.157 0.118 0.099 0.095Clustering coefficient ( × − ) 1.97 2.95 3.11 1.51These network statistics are shown in Table 5 and indicate improved correspondencebetween the network properties of the overlapping sets w ( E AB , C ) and w ( E AB , B ) , as com-pared to the aggregate of the original network w ( E AB , A ) .The number of commuters in the surrogate network is 9,336,333 constituting a 25% in-crease in the commuter population as compared to the aggregated ABS-provided SA1 → DZNnetwork. Our procedure added nearly half a million new SA1 to DZN edges. The increase incorrelation and closer network statistics at the SA2 scale, as well as the edge-wise decreasein mean-squared error indicates both a quantitative and structural improvement over theoriginal dataset provided by the ABS.he surrogate network proffered here represents a significant improvement over the origi-nal SA1 partitioned commuter mobility network. It reconstructs the population and networkstatistics of the less perturbed SA2 level network by adding additional SA1 → DZN connec-tions that have been lost to the ABS privacy protocol. Access to the surrogate networkand the availability of a method for recovering high fidelity data on such high resolutionnetworks is of broad significance to the computational modeling of diffusion phenomena invarious disciplines. The redistribution of ABS data is protected under Creative Commonslicensing.
Network statistics for different instantiations
The process of generating the surrogate networks is stochastic. However, the constraintsplaced on the new edge generation leads to very consistent surrogate network statisticsacross instantiations. This is evident in comparing the network statistics of the surrogatenetwork analysed here, with several additional instantiations. These are shown in Table 6.Table 6: The weighted network statistics for additional surrogate data sets.Network
C C C C Shortest path 0.119 0.113 0.116 0.119Clustering coefficient 2.70 2.70 2.65 2.70Likewise the MSE and 2D correlation demonstrate an excellent agreement between thespecific surrogate network analysed produced by our study, and additional generated surro-gates. These are shown in Table 7.Table 7: The MSE and 2D correlation between the chosen surrogate, C, and additionalgenerate surrogate networks aggregated to SA2 → SA2.Network C C C MSE 0.142 0.153 0.1482D correlation 1.000 1.000 1.000 onvergence
The process of building the new edges e ∗ from the sample edge distributions is the mosttime consuming part of creating the surrogate networks. Each run generating a surrogatenetwork was given 100 hours to reach the end-point criteria, however a small proportion ofcommuters remain impossible to assign, as larger candidate edges become disallowed by thealgorithm’s constraints. Figure 7 shows the number of unassigned commuters as a functionof time when placing the new edges. As edges are added the constraints of SA1 population,DZN population and SA2-SA2 edge sizes reduce the likelihood of a suitable sample edgefitting. This results in convergence on a non-zero number of unassigned commuters. una ss i gned c o mm u t e r s Figure 7: Algorithm convergence. The number of unassigned commuters as a function oftime while assigning commuter weights to new SA1-DZN edges, running the script ‘cre-ate_surrogate.m’ [28] for 100 hours.
Usage Notes
The MATLAB script ‘creating_surrogate.m’, available in the online repository [28] imple-ments the method outlined in this paper. The input required for this script is located in therepository file ‘inputs.mat’. This workspace includes: • R ), H ), • • N X ), • N Y ), • Γ ), • SA2-SA2 network accumulated from R ( A ), • SA2-SA2 ABS network ( B ).Using this script first produces the commuter residential distribution based on the 2011census data, then a list of possible SA1 edges ( M ) using the residential distribution andfinally assigns them to DZNs, creating e ∗ . This is then combined with the existing edges ofnetwork R to create the surrogate network S . A complete description of each network and thefile header information is located in the corresponding ’README.txt’. The data format issimply a table of edges, the first column corresponding to the SA1 code, the second columncorresponding to the DZN code, and the third column giving the number of commutersassigned to the pair. Acknowledgments
We acknowledge the Australian Bureau of Statistic (ABS) for providing all of the raw data aswell as general advice in regards to the nature of their perturbation procedures. The Authorswere supported through the Australian Research Council Discovery Project DP160102742.
Author Contributions
KF, CZ, and MP designed the research; KF and CZ designed the algorithm; KF implementedthe algorithm code; CZ and KF designed the validation strategy; KF performed data analysisfor validation; CZ, KF, and MP composed the manuscript. ompeting Interests
The authors declare no competing interests.
References [1] Yu, F. & James, W. J. High-resolution reconstruction of the United States humanpopulation distribution, 1790 to 2010.
Sci. Data , 180067 (2018).[2] Eubank, S., et al. Modelling disease outbreaks in realistic urban social networks.
Nature , 180-184 (2004).[3] Longini, I.M., et al.
Containing Pandemic Influenza at the Source.
Science , 1083-1087 (2005).[4] Germann, T. C., Kadau, K., Longini, I. M. & Macken, C. A. Mitigation strategies forpandemic influenza in the United States.
PNAS , 5935-5940 (2006).[5] Cliff, O., et al.
Investigating spatiotemporal dynamics and synchrony of influenza epi-demics in Australia: an agent-based modelling approach.
Simulat. Model. Pract. Theor. , 412-431 (2018).[6] Zachreson, C., et al. Urbanization affects peak timing, prevalence, and bimodality ofinfluenza pandemics in Australia: Results of a census-calibrated model
Science Advances , (2018).[7] Wang, Z., et al. Statistical physics of vaccination.
Phys. Rep. , 1-113 (2016).[8] Farmer, D. J. & Foley, D. The economy needs agent-based modelling.
Nature ,685–686 (2009).[9] D’Alelio, D, Libralato, S., Wyatt, T. & d’Alcalà M. R. Ecological-network models linkdiversity, structure and function in the plankton food-web.
Sci. Rep , 21806 (2016).10] Fang, Y. & Jawitz, J. W. High-resolution reconstruction of the United States humanpopulation distribution, 1790 to 2010. Sci. Data , 180067 (2018).[11] Einav, L. & Levin, J. Economics in the age of big data. Science , 1243089 (2014).[12] Lee, J. Y. L., Brown, J. J. & Ryan, L. M. Sufficiency revisited: rethinking statisticalalgorithms in the big data era.
Am. Stat. , 202-208 (2017).[13] Coull, S.E., Monrose, F. , Reiter, M.K. & Bailey, M. The challenges of effectivelyanonymizing network data. in Australian Bureau of Statistics MethodologyReport
ABS Catalogue No. 1352.0.55.072 (2007).[15] Kugler, T. A. & Fitch, C. A. Interoperable and accessible census and survey data fromIPUMS.
Sci. Data , 180007 (2018).[16] Australian Bureau of Statistics TableBuilder . (2018)[17] Rogers, D. J. & Cegielski, W.H. Opinion: Building a better past with the help ofagent-based modeling. PNAS , 12841-12844 (2017).[18] Australian Bureau of Statistics
Australian Statistical Geography Standard (ASGS):Correspondences, July 2011
ABS Catalogue No. 1270.0.55.006. (2013).[19] Coull, S. E., Narayanan, A. & Shmatikov, V. Robust De-anonymization of Large SparseDatasets. In
Int. J. Uncaertain. Fuzz. , 557-570 (2002).21] Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highlycomplex mixtures using high-density SNP genotyping microarrays.
PLoS Genet. ,1000167 (2008).[22] Fraser, B. & Wooten, J. A proposed method for confidentialising tabular output to pro-tect against differencing. Monographs of Official Statistics: Work Session on StatisticalData Confidentiality
Joint ECE/Eurostat Worksession on Statistical Confidentiality in Bilbao, December2009 (2009).[24] Wooton, J. Measuring and Correcting for Information Loss in Confidentialised CensusCounts.
Australian Bureau of Statistics Research Paper
ABS Catalogue No. 1352.0.55.083(2007).[25] Harding, N., Nigmatullin, R. & Prokopenko, M. Thermodynamic efficiency of conta-gions: a statistical mechanical analysis of the SIS epidemic model.
Interface Focus ,20180036 (2018).[26] Piraveenan, M., Prokopenko, M. & Zomaya, A. Y. Information-Cloning of Scale-FreeNetworks. Advances in Artificial Life
The European Physical Journal B , 291–300 (2009).[28] Fair, K. M., Zachreson, C. & Prokopenko, M. Creating a surrogate commuter networkfrom Australian Bureau of Statistics census data. Zenodo http://doi.org/10.5281/zenodo.2578459 (2018).[29] Onnela, J.P., Saramäki, J., Kertész, J. & Kaski, K. Intensity and coherence of motifsin weighted complex networks.
Phys. Rev. E71