[PDF] Bayesian estimate of position in mobile phone network

Abstract

The traditional approach to mobile phone positioning is based on the assumption that the geographical location of a cell tower recorded in a call details record (CDR) is a proxy for a device's location. A Voronoi tessellation is then constructed based on the entire network of cell towers and this tessellation is considered as a coordinate system, with the device located in a Voronoi polygon of a cell tower that is recorded in the CDR. If Voronoi-based positioning is correct, the uniqueness of the device trajectory is very high, and the device can be identified based on 3-4 of its recorded locations. We propose and investigate a probabilistic approach to device positioning that is based on knowledge of each antennas' parameters and number of connections, as dependent on the distance to the antenna. The critical difference between the Voronoi-based and the real world layout is in the essential overlap of the antennas' service areas: the device that is located in a cell tower's polygon can be served by a more distant antenna that is chosen by the network system to balance the network load. This overlap is too significant to be ignored. Combining data on the distance distribution of the number of connections available for each antenna in the network, we succeed in resolving the overlap problem by applying Bayesian inference and construct a realistic distribution of the device location. Probabilistic device positioning demands a full revision of mobile phone data analysis, which we discuss with a focus on privacy risk estimates.

Full PDF

11 Bayesian estimate of position in mobile phone network

Aleksey Ogulenko , Itzhak Benenson , Itzhak Omer , Barak Alon Department of Geography and Human Environment, Porter School of the Environmental and Earth Science, Tel Aviv University, Israel Partner Communications Company LTD [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

The traditional approach to mobile phone positioning is based on the assumption that the geographical location of a cell tower recorded in a call details record (CDR) is a proxy for a device’s location. A Voronoi tessellation is then constructed based on the entire network of cell towers and this tessellation is considered as a coordinate system, with the device located in a Voronoi polygon of a cell tower that is recorded in the CDR. If Voronoi-based positioning is correct, the uniqueness of the device trajectory is very high, and the device can be identified based on 3-4 of its recorded locations. We propose and investigate a probabilistic approach to device positioning that is based on knowledge of each antennas' parameters and number of connections, as dependent on the distance to the antenna. The critical difference between the Voronoi-based and the real world layout is in the essential overlap of the antennas’ service areas: the device that is located in a cell tower’s polygon can be served by a more distant antenna that is chosen by the network system to balance the network load. This overlap is too significant to be ignored. Combining data on the distance distribution of the number of connections available for each antenna in the network, we succeed in resolving the overlap problem by applying Bayesian inference and construct a realistic distribution of the device location. Probabilistic device positioning demands a full revision of mobile phone data analysis, which we discuss with a focus on privacy risk estimates.

Keywords:

Mobile phone positioning, Bayesian inference, Call Details Record, Location privacy * Corresponding author Individual’s positioning based on the mobile phone data

Mobile phone data, as a source of information on the individual activities in time and space has a great potential for advancing all fields of the Geographic Information science and, specifically, for deeper understanding of population mobility and enhancing transportation and spatial planning (Berlingerio et al. 2013; Pinelli et al. 2016; Markovic et al. 2017). In high-resolution studies, these data are used for estimating and investigating individuals’ daily mobility patterns (Gonzalez et al. 2008; Louail et al. 2014), travel patterns between the city core and periphery (Givoni, 2017), and mode-dependent commuting, such as of bikers and pedestrians (Xu et al. 2016; Kung et al. 2016; Bachir et al. 2019; Huang et al. 2019). When aggregated, mobile phone data assist general studies of urban and metropolitan dynamics (Calabrese et al. 2011; Razin and Charney, 2015), socioeconomic organization of cities (Cottineau and Vanhof, 2019) and urban land use planning (Pei et al. 2014), as well as monitoring of urban activities (Reades et al. 2009; Wu et al. 2020). While different applications have different requirements in regard to the quality, resolution and level of data aggregation, the general tendency is to seek the highest possible resolution of the individual activities, in both space and time. High-resolution data are especially useful for transportation planning and management, like the analysis of public transport effectiveness, or establishing cycling lanes and pedestrian-only streets (Pinelli et al. 2016; Xu et al. 2016; Bachir et al. 2019). Critical in this regard is the ability to estimate the spatial location of mobile phone users. In the vast majority of studies, determining the location of a mobile device is based on the locations of the cell towers (base stations) of the mobile phone network. Namely, the units of the Voronoi coverage is constructed using the towers’ coordinates and then the polygon of this coverage serve as basic units of the coordinate system: If, according the network’s data record, the device d, at time moment t is connected to the antenna of the base station A, then the device is located within the Voronoi polygon V A of A. Typically, the exact position of d within the V A is not specified, while, sometimes the position of a base station itself is considered as a proxy of the position of d. Voronoi-based positioning of the mobile device is exploited in (Williams et al. 2015; Järv et al. 2017; Zufiria and Hernandez-Medina, 2018, 2019; Bonnetain et al. 2019; Cottineau and Vanhoof, 2019; Sotomayor-Gomez and Samaniego, 2020) and very few attempts are made to apply different approaches, like positioning the mobile device based on the frequency of the owners’ visits (Wu et al. 2020). Some studies aim at improving the Voronoi-based positioning by applying a Kalman filter and incorporating the GPS data into the positioning algorithm (Hadachi and Lind, 2019) and, recently, considering the Voronoi polygons of towers’ antennas instead of the polygons built for the towers (Bachir et al. 2019). Namely, a cell tower serves devices by means of three antennas, each covering a 120 o sector of the surrounding space and the Voronoi coverage is constructed based on the barycentres of the service areas of the antennas instead of the towers. The Voronoi-based positioning is based, by definition, on the coverage of non-overlapping polygons. A solid criticism of this assumption was posed in several publications and conference presentations of Ricciato and co-authors (2017, 2020), who suggest that a solution for mobile device positioning must account for the overlap between the antennas’ areas of service (Tennekes, 2018). However, this view still remains on the margins of the research attention. The goal of our study is to propose a framework and the software for mobile device positioning that accounts for the overlap between the service areas of the mobile phone network antennas. The information about the overlap of the antennas’ service areas is hidden in the distribution of the number of devices served by the antenna at different distances during a predefined period of time, which we call below the PRACH curve (see section 2). Based on this information, we propose a Bayesian estimate of the device position that does account for antennas’ overlap. As we demonstrate, the traditional Voronoi-based positioning results in significantly biased and unrealistically precise estimates, and the latter has important consequences for the privacy-related aspects of the mobile phone data. Sections 2 of the paper presents the details of a mobile phone network that are essential for our study. Section 3 of the paper presents Bayesian estimates of the mobile phone position. Section 4 presents a comparison between the Voronoi-based and Bayesian estimates. In Section 5, we consider the consequences of the proposed positioning methodology, which brings significant additional uncertainty to our knowledge about mobile device positioning. The software developed in this project is available for free download at the https://github.com/grauwelf/mob-bayes-clouds. It’s important to note that our approach is based on the standard aggregate data collected by every mobile network operator for the purposes of the network maintenance. Mobile phone network data description

The most abstract representation of a mobile phone network (MPN) serving individual devices of a third and fourth generation is as follows: an MPN consists of two sets of antennas, indoor and outdoor. Outdoor antennas have a constant orientation and sector of service, and are located on cell towers, each bearing several outdoor antennas. The capacity of the outdoor antenna, that is the number of devices that it can simultaneously serve, is constant; while different antennas may have different capacities. Outdoor antennas are capable of serving mobile phones of all generations. Typically, there are 3 antennas on a tower, each covering a 120 o sector. Technical characteristics of outdoor antennas makes it possible to serve mobile devices at a distance of up to ~30 km, and Figure 1 presents examples of antenna service areas. Antennas sectors of service are divided into rings - Trip-Time Bands (TTBs) that are explained later in the paper. Indoor antennas are located inside buildings and aim at serving devices at a distance of hundreds of meters at most. Their capacity is much lower than that of outdoor antennas. The major goal of the MPN is to supply maximum possible quality of service to customers’ devices. The number and location of cell towers is thus an outcome of the compromise between complicated technical and legal limitations. A real-world MPN is never steady and its configuration constantly changes. At the macro-scale, the network is always undergoing maintenance and repair. On the micro-scale, the network is constantly adapting to the state of the propagation medium, fluctuation of the demand and load, interference between antennas of the same or another MPN, and so on. The related algorithms of the telecommunication system are inherently non-deterministic and, for example, it is impossible to predict what frequency band will be chosen for connection between an antenna and mobile phone. Figure 1: Main map shows overlapping sectors of service for two antennas located a large distance apart. Inset: The view of the overlap of sectors of service for three antennas located close by. The maximum possible quality of service to the customers’ devices is achieved by instantaneous balancing between the devices’ requests and antennas’ load. Upon a device request, the MPN software considers several antennas around, including nearby indoor antennas, as candidates for serving the request. The antenna that is chosen by the MPN for service is often not the closest one. Moreover, different antennas can service requests from a stationary device, and the antenna can even be switched during the same phone call. Antennas’ service areas with their TTBs are presented in Figure 1. To balance the antennas’ loads, the MPN is able to estimate the distance between the antennas and a device. The distance is estimated based on the signal round-trip-time (RTT) or signal strength, and may be recorded in the device’s call detail record (CDR). The estimate of the distance is imprecise and is considered by the trip-time bands (TTB) of the antenna’s service sector. As shown in Figure 1, the width of the TTB increases with the growth of the distance from the antenna. The width of TTB rings is not a round number. The width of the ring closest to an antenna is about 200 m, the next closest ones are ~ 400 m, after which the width increases to ~ 700 m, and the most distant are ~1300 m. TTBs for each MPN antenna are known, do not change in time, and the increase in their width with the distance is similar for the majority of them. There are several types of connection sessions (voice calls, SMS, WiFi, sighting). Let us consider the simplest example of a voice call. The CDRs of the voice call contain the time of connection, ID of the antenna, and cell tower ID. Usually, the connection start and end are recorded, and, typically, there are more than two CDRs recorded during the talk. If the call was managed by several antennas, the antenna ID and cell tower IDs of all antennas involved, plus the moments of re-connection are recorded. Ideally, the spatial components of the CDR contain, besides the antenna and cell tower IDs, the sequential number of the distance ring (Table 1). Table 1: Schematic representation of the CDRs of a short voice call

Device ID Start timestamp End timestamp Tower ID Antenna ID … …D86BA7 2020-01-22T17:41:42.000 2020-01-22T17:43:30.000 …7DC5 …E002 …D86BA7 2020-01-22T17:43:30.000 2020-01-22T17:43:52.000 …4EBD …D26B …D86BA7 2020-01-22T17:43:52.000 2020-01-22T17:43:57.000 …7DC5 …E002 …D86BA7 2020-01-22T17:43:57.000 2020-01-22T17:44:54.000 …7DC5 …E002 …D86BA7 2020-01-22T17:44:54.000 2020-01-22T17:44:55.000 …BC1A …640B …D86BA7 2020-01-22T17:44:55.000 2020-01-22T17:49:09.000 …BC1A …8215 …D86BA7 2020-01-22T17:49:09.000 2020-01-22T17:49:46.000 …BC1A …8215

Usually, the CDR data that are available to researchers are somehow aggregated or censored and, typically, the CDR data contain the cell tower IDs, but not the antenna IDs, and time of the connection is somehow rounded. The major disadvantage of the aggregate/censored data is the lack of the information on the distance to the device. The latter seems to be the major reason for the broad view that Voronoi tessellation, based on the locations of the MPN cell towers, may serve as a proxy for the device location. Namely, it is assumed that at the time moment recorded in the CDR, the device is located within the tessellation polygon of the tower that is recorded in this CDR (Candia et al. 2008; Song et al. 2010; Csáji et al. 2012; De Montjoye et al. 2013; Bonnel et al. 2015; Kalatian and Shafahi, 2016). This assumption has far-reaching consequences, especially in regards to user location privacy, and if this is true, then only 3 – 4 locations of a device are sufficient to identify the specific device with close to a 99% probability (De Montjoye et al. 2013). Voronoi-based positioning is extensively used and well-studied. It is unambiguous, computationally effective, and is clearly intuitive. It accepts as a self-evident fact that the tower placed in the polygon’s centroid absolutely dominates over the whole area inside the polygon. However, Voronoi tessellation does not account for the overlap of service areas of multiple antennas caused by the basic physical aspects of the MPN structure and workflow (Zhang, 2017): ● Cross-slot interference : Communication between a user device and base station consist of repeated sequences of periods: downlink period (base station → user device), silent guard period, and uplink period (user device → base station). For the given climate and environment, the downlink signal from a distant base station/antenna may arrive with very low propagation loss yet with a significant delay, hitting into the uplink period of the target base station. Typical scenarios include base stations on top of hills around large city, base stations in different cities, or those separated by a large water body. While normal communications are performed over distances up to 30 km, cross-slot interference can result in communication with the antennas at distances up to 200-300 km. ● Uplink interference : Uplink from a user device at a location that can be served by several towers, such as on the border between cellular cells, will cause interference to adjacent towers. Devices with bad radio-frequency conditions that unsuccessfully try to get access to the chosen station and transmit high power signals can cause high noise conditions. The latter leads to access failures and results in heterogeneity of communication quality inside the coverage area. ● Doppler shifts : A fast‐moving user device causes Doppler shifts (offset of radio-frequency) in the uplink signal received by a base station. The user device then synchronizes to a shifted downlink signal, and its next uplink will be shifted more, and so on. In this way, Doppler shifts cause fundamental performance degradation. Typical scenarios refer to high-speed trains or highways with base stations installed along the road. To manage Doppler shifts, the MPN has to be tuned in order to compensate for the high-frequency offsets, and this also leads to the heterogeneity of communication quality inside the coverage area. Recent doubts about the adequacy of the tower-based Voronoi partition resulted in essential modifications. Two important examples are sectors Voronoi partition (Bachir et al. 2019) that is constructed based on the barycentres of the centroids of cells within the antenna’s sector and, especially, section tessellation technique (Ricciato et al. 2017) that uses service coverage maps to account for the overlap of the cell towers’ coverage areas and shows significant gain of spatial accuracy in simulated scenarios with the synthetic population and MPN coverage. Bayesian inference of the mobile phone position

Bayesian estimate of antenna’s service area

To account for probabilistic nature of the MPN service, we apply the Bayesian approach to the device’s location. To establish the model, we assume that the following components are fully defined: 1) Network layout — location of every cell tower, location of antennas on the towers; azimuth and TTBs of antennas. 2) A posteriori distribution, for each antenna, of the number of connections by antenna’s TTBs – PRACH curves, constructed over a period of time that is long enough to obtain stable estimates. In what follows, we consider PRACH curves constructed for one month. Let us consider a device D located inside the coverage area of a set of antennas { } and estimate the probability that D is located at a given point ̅ , given its current connection is carried by the antenna . Let the TTBs of the , in order of their distance from the antenna, be ( ) { } , where l is a total number of TTBs for the . Each antenna is characterized by the monthly number of connections with the devices located within each TTB. This statistic is called below a

PRACH curve , from the Physical Random Access Channel procedure used by a device to initiate contact with a base station (Korhonen, 2003, p. 340). A PRACH curve is defined by two factors — spatial distribution of the population’s communication activity and the overlap between antennas’ areas of service. In this study we use the PRACH curves supplied to the mobile phone operator by a third-party company. According to the industry standards, the curves are estimated up to the distance of 32 km, and more distant connections are not included. Figure 2 presents PRACH curves for three antennas that represent typical but very different kinds of this curve.

Figure 2: Monthly PRACH curves for three antennas. and denote circumcircle’s radii of the Voronoi polygon (see below) for the towers of these antennas.

As can be seen, the PRACH curves in Figure 2 are very different. Antenna { } and estimate the probability that the device served by the antenna is located in a grid unit applying Bayes theorem: ∑ . (1) Here is the probability that the device is served by antenna given the device is located in the grid element and is an a priori number of devices in . We estimate based on overlap between the grid elements and PRACH curve of all antennas which TTBs overlap . Given a time moment t, a device that is located in can be connected at t to one antenna only. Let us consider the position of in respect to ( ) and let be one of the TTBs for which . In what follows we assume that the probability that the connection was established from to is proportional 1) to the area of intersection and 2) to the fraction of connections carried by among all connections established from to all antennas which TTBs cover . The latter is estimated using a posteriori distribution of network activity. Decomposing ( ) in a sum over all TTBs from we obtain: ( ) ∑ ( ) ( ) ∑ ( ) ( ) (2) An a priori distribution of devices’ locations can be considered, in respect to the prior information, in two ways (Williamson, 2010): From the “objective” point of view, we should avoid any prior assumptions about and thus assume the prior distribution is uniform, That is, we can reduce the terms and in (1). “Subjective” estimation assumes that we have some independent knowledge about prior location distribution. For example, we can assume that priors are proportional to the population of a grid cells and the population information is available from the census. In this case, we would ignore information on the devices that are located in the grid cells temporarily, and exclude unpopulated grid cells from the further calculations. On this basis, we prefer an “objective” view.

Examples of Bayesian estimates of antenna’s service area

Applying Bayesian estimates (1) – (2), we obtain a set of grid cells – a “cloud” of possible location of a device that is registered at a given antenna. This study is based on the information on 22007 antennas of Partner Communications Company LTD (“Partner”) MPN that serves the entire area of Israel. Each antenna of this MPN is technically able to serve devices up to a distance of 32 km, and its PRACH curve is presented by 40 TTBs. For each antenna we possess knowledge on its location (via the cell tower location), azimuth and monthly PRACH curves, and, based on that, have estimated a posteriori distributions of each antenna’s connections, based on an “objective” view of the priors. MPN’s coverage area was discretized into a grid of 250 × 250 m cells, and the total number of these cells for Israel is 360139. To reduce computational cost, we consider only those TTBs whose average monthly density of connections is at least 10 per 250 × 250 m grid cell per month, that is, 160 connections per 1 km per month. This limitation resulted in excluding 0.02% of all connections and the remaining 99.8% of the overall 42×10 connections, estimated as a sum of connections over all PRACH curves, were used for the Bayesian inference. According to this criterion, 335195 (93.1%) of the grid cells are further included into at least one probabilistic cloud. In what follows, we consider the outcomes that are based on these 99.8% of the observations as 100%. All calculations were performed in the PostgreSQL database with the use of the

PostGIS

GIS extension for performing spatial operations.

Figure 3 shows probabilistic “clouds” for eight antennas. To present the cloud for antenna for a given cumulative probability p ( p-cloud of the antenna), we sort all grid cells for which probability , estimated in (1) is positive, and consider a minimal set of cells with the highest that comprise a total share greater than or equal to p. Figure 3 illustrates the important fact that antennas essentially differ in regards to the certainty of device positioning. a b c d Figure 3: 99% Location clouds for eight antennas A – G, that are positioned on different cell towers in the Tel Aviv Metropolitan Area, at a resolution of 250x250 m grid. (a) General view; (b) Very small cloud H, cloud of an average size F, discontinuous cloud G; (c) Three clouds of antennas located at adjacent towers with very small overlap; (d) Cloud of antenna E covers the core part of the discontinuous cloud of antenna A. Let us now extend the Bayesian approach towards the location cloud of the cell tower and compare the probabilistic location to the Voronoi-based one.

Bayesian estimate of the device location based on the cell tower data

We define a probabilistic location cloud for a tower as a union of the probabilistic location clouds of the tower’s antennas. Let us consider a cell tower that is equipped with antennas . Applying Bayes theorem in the same way as in (1) – (2), we obtain ( ) ∑ ∑ ∑ (3) The denotes a probability that the device D, located at the grid element , will be served by some antenna of the cell tower . Since, at a given time moment, the device can be connected to one antenna only, we can consider this probability as a sum of a posteriori probabilities over all the tower’s antennas. In this way, we construct an estimation of the PRACH curve for the tower as combination of the PRACH curves for the tower’s antennas. Figure 4: The p-clouds for cell tower M on a background of the MPN Voronoi coverage We define p-cloud for a tower as a minimal set of grid elements with the highest value of the probabilities that comprise a total share greater than or equal to p. Figure 4 presents the p-clouds for p = 0.5, 0.75, 0.95 and 0.99, for the tower M that serves the densely populated area of the Tel Aviv University campus on a background of the tower-based Voronoi partition. As can be seen, the tower’s 75%- and larger clouds are much larger than tower’s Voronoi polygon. At the same time, the top part of M’s Voronoi polygon is hardly served by the antennas of this tower. Comparison between the Bayesian and Voronoi-based positioning

In what follows we estimate the discrepancy between the Voronoi-based and Bayesian estimation of position for Partner’s MPN. The MPN considered in our study consists of 2851 cell towers equipped with 22007 outdoor 3G antennas in total. Most of the antennas cover a 120 o sector. To compare the deterministic Voronoi and the proposed Bayes locations, let us estimate some properties of the Partner’s MPN Voronoi tessellation. Statistics of the Voronoi tessellation

Figure 5 shows the distribution of the distance between a cell tower and its 4 closest neighboring towers in the MPN. As could be expected, the distribution of the nearest neighbor distance is essentially asymmetric. a b

Figure 5: Distance to the nearest, 2 nd , 3 rd and 4 th neighboring tower. (a) Distribution density for the distances below 3 km; (b) Cumulative distribution for the distances below 9 km; Vertical lines mark median distance The size distribution of the Voronoi polygons has a very long tail (Figure 6). While the average polygon size is 9.89 km , only 20% of polygons are larger than this average, and half of the polygons have area less than 1.2 km . Figure 6: The PDF of the Voronoi polygons’ size. Q , Q and Q denote the 25 th , 50 th , and 75 th percentiles, respectively. The inset – full PDF, the size of the largest polygon is ~ 300 km Despite high variation of the polygons’ size, the MPN Voronoi partition remains close to a honeycomb: As can be seen in Figure 7, the number of neighbors for most of the polygons is between 5 and 7, similar to the Voronoi coverage constructed for the random Poisson point process. We thus conclude that an arbitrary point inside the tessellation is more likely to fall into a large Voronoi polygon than into a small one (Haenggi, 2013).

Figure 7. PDF of the number of neighboring polygons for Partner’s MPN. By design, such a tower pattern enables effective reuse of radio frequency bands, and increases the network’s capacity and coverage area.

Statistics of the probabilistic clouds

Using the Bayesian model (1) – (3) we built the probabilistic cloud for every antenna – each element of a square grid is characterized by the probability to serve a device located in this cell for each antenna that covers it. The question is how large is the full probabilistic cloud and its p-clouds. In what follows, we consider p-clouds for p = 0.95, 0.75 and 0.50. Cloud size and shape are dependent on the population pattern and topography, and we compare clouds’ statistics for two regions of Israel: The densely populated lowland of the Tel Aviv Metropolitan Area and the averagely populated and hilly Haifa and Northern District (Haifa/North). According to the 2016 census, Tel Aviv Metropolitan Area (TAMA) has a population of 3.85 million people and its area is 1198 km , while the combined population of the Haifa/North region is 2.4 million and its area is 5494 km . These two regions also differ in shape. TAMA is located along the coastline and resembles a50x20 km rectangle oriented south-north. The region of Haifa/North is close to square and has complicated landscape varying from the coast to the hilly highlands. As above, we consider antennas’ TTBs with 10 or more connections per 250 × 250 m grid cell per month. The variety of the overlap states between Voronoi and Bayesian coverages is very high. The average number of antennas serving the same grid cell according to the Bayesian clouds (Figure 8) is close to 32 for the 99%cloud, 12 for the 95%-cloud, 5 for the 75%-cloud and 2.5 for the 50%-cloud. a b Figure 8: (a) Distribution of the number of antennas (up to 23) serving the same grid element for the p-clouds, p = 50, 75, 95, 99%. (b) The same distribution for the number of towers (up to 11) The same estimates for the tower are twice lower (Table 2). Table 2. Number of antennas’ and towers’ p-clouds serving the same grid cell Antennas’ p-clouds Towers’ p-clouds p 50% 75% 95% 99% 50% 75% 95% 99% Mean 2.45 4.78 11.99 32.38 1.38 2.37 5.47 13.68 SD 1.47 3.47 9.88 32.14 0.74 1.79 4.64 13.13 Median 2 4 9 22 1 2 4 10 IQR 2 4 14 42 1 2 6 18 Max 18 33 85 343 9 18 38 125 Median size of the 95%-cloud is 2.5 km (Figure 9a), 2 times larger than that of the tower’s Voronoi polygon (1.2 km , Figure 6), while the average size of the 75%-cloud is close to the average Voronoi polygon size. The cumulative probability chart (Figure 9b) shows growth of cloud size from 0.5 km for the 50%-clouds to 4.5 km for 99%-clouds (Figure 9b). a b Figure 9: Distribution of the clouds size (a) and probabilities within the cloud (b). The shaded area in (b) depicts interquartile range with Q Q Q The size of the towers’ Bayesian clouds relative to the size of the Voronoi polygons

Figure 10 presents the ratio between the area of a tower’s p-cloud and the area of the tower’s Voronoi polygon. As can be seen, this ratio is similar in both regions and the size of the Voronoi polygon is, on average similar to the size of the p-cloud where p = 55-60%. a b c Figure 10: The ratio between the area of a tower’s p-cloud and the area of the tower’s Voronoi polygon for Israel (a) and two selected regions (b, c) The overlap between the Voronoi polygons of clouds and towers

We represent the overlap by the probability that a device, registered by the antenna A which belongs to the tower M, is located in M’s Voronoi polygon. To estimate this probability, we cut out from a 100%-cloud of M the part that is covered by the Voronoi polygon, and sum up the probabilities that characterize grid elements inside it. As Figure 11 shows, the overlap covers the entire spectrum of options, from the polygons that do not intersect at all to the polygons that cover the entire cloud and even exceed it. The overlap differs in two regions and, for example, the polygons that contain the entire Bayesian cloud in Haifa/North comprise 3.6%, compared to 2% in TAMA. This can be explained by the larger distance between settlements in the North its less dense transport network: actively served areas may thus be smaller than the corresponding Voronoi polygons. Polygons that do not overlap with probabilistic clouds at all comprise 2.4% for Haifa/North region and 2.5% for TAMA region. Except for the extremes, Voronoi polygon is most likely to cover between 25 and 50% of the Bayesian cloud. a b c d Figure 11: The cumulative probability within the part of the tower’s Bayesian cloud that falls inside the tower’s Voronoi polygon. (a) An example of the Voronoi polygon that accumulates 35% of the overall tower’s cloud probability (b) Entire Israel; (c) TAMA (d) Haifa/North Consequences of probabilistic positioning for the location privacy

A shift from the traditional Voronoi-based technique to the probabilistic Bayesian location inference has significant consequences for the assessment of privacy of mobile device users. Namely, the overlap between the service areas of antennas results in an essentially lower certainty of locating a device than is stated according to the Voronoi-based view of location. Let us consider a typical scenario of the attack on mobility privacy: An adversary intends to reveal the mobility history of target person p who uses the device D, using both surveillance and access to the anonymized CDR dataset A . Anonymization in this particular case means decoding or removing from A any personal information that can be known by the adversary from other sources. We assume that the adversary 1) can identify the location of the target by direct observation, with finite accuracy ε and 2) possesses a full knowledge of the MPN towers and antennas and their mapping into the records of A . The exact format of the dataset A depends on the data processing workflow designed to fulfil the specific business goals of the cellular company. It is usually assumed that A consists of CDRs that contain device ID, time of connection establishment, time of connection termination, cell tower ID, and antenna ID. Typically, more than one CDR is recorded during the voice call. If the call is managed by several antennas, a new CDR is created for every segment of the call (communication session). Traditionally the location of D is based on the Voronoi coverage that is constructed based on the cell towers’ locations. Namely, if D’s CDR is recorded during the communication session, then it contains the ID of the antenna and cell tower C, and D can be located within the Voronoi polygon V C of C (Figure 12, (De Montjoye et al. 2013)). Often, the location of C itself is considered as a proxy for D’s location. Figure 12. (a) Example of the individual trace, the dots represent the locations during communication sessions. (b) Voronoi-based view of the same trace, (De Montjoye et al. 2013). To recognize D’s records in A , the adversary starts with locating the target by direct observation. Knowing target’s position at a moment , the adversary, based on knowledge of the MPN, determines the corresponding Voronoi polygon and then the cell tower C. Further, the adversary queries A for all CDRs of the communications performed through C within the interval of time [ ] , where the uncertainty in the time condition is defined by the known inaccuracy of a time measurement and delay in connections processing. Being sure of Voronoi-based positioning, the adversary believes that D has to be among the query’s result. If we denote this result as and the set of candidate devices’ IDs by , the uniqueness of the target device p can be estimated as: . If , the attack is successful. Otherwise the adversary continues the surveillance and at the moment determines Voronoi cell and queries A again. He believes that target’s phone ID has to be in and target’s uniqueness is . The adversary can proceed, obtaining candidate sets and target’s uniqueness estimates . Since , . Empirical tests of values for real devices demonstrate that the sequence of 3 - 5 records with different towers, even without the time tags, is sufficient to identify almost 99% of the devices (De Montjoye et al. 2013). The above line of argument is based on the assumption that device D, whose communication session was performed via cell tower T, is located within the V T . As we demonstrate, the devices that were served by T are located in an area that is several times larger than V T, and this area is simultaneously served by many other antennas located on many towers. To sum up, Bayesian estimates of a device’s location proposed in this paper result in a significant increase in the possible area of device position, and the probability that the device is located beyond the tower’s Voronoi polygon is high. We thus conclude that the traditional Voronoi-based approach to the location privacy of mobile devices is essentially overcautious, and we will further investigate methods for locating the mobile devices in a consequent paper. References

Bachir, D., Khodabandelou, G., Gauthier, V., El Yacoubi, M., Puchinger, J. (2019). Inferring dynamic origin-destination flows by transport mode using mobile phone data.

Transportation Research Part C: Emerging Technologies, cellphone data. In Blockeel, H., Kersting, K., Nijssen, S., and Železný F. (Eds.): Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, September 23-27, 2013, Proceedings, Part 3, pp. 663–666. Bonnel, P., Hombourger, E., Olteanu-Raimond, A.-M., Smoreda, Z. (2015). Passive mobile phone dataset to construct origin-destination matrix: potentials and limitations. Transportation Research Procedia

11, 381–398. https://doi.org/10.1016/j.trpro.2015.12.032 Bonnetain, L., Furno, A., Krug, J., and Faouzi, N.-E. E. (2019). Can We Map-Match Individual Cellular Network Signaling Trajectories in Urban Environments? Data-Driven Study.

Transportation Research Record , 2674(7), pp. 74-88. Calabrese F., Colonna M., Lovisolo P., Parata D. and Ratti C. (2011). Real-Time Urban Monitoring Using Cell Phones: A Case Study in Rome . IEEE Transactions on Intelligent Transportation Systems , 12 (1), 141-151. Candia, J., Gonzalez, M.C., Wang, P., Schoenharl, T., Madey, and G., Barabási, A.L. (2008). Uncovering individual and collective human dynamics from mobile phone records.

Journal of Physics A: Mathematical and Theoretical,

41, 1–16. https://doi.org/10.1088/1751-8113/41/22/224015 Conttineau, C., and Vanhoof, M. (2019). Mobile Phone Indicators and their Relation to the Socioeconomic Organization of Cities.

ISPRS International of Journal of Geo-Information , 8(1), pp. 1-19. Csáji, B.C., Browet, A., Traag, V.A., Delvenne, J.C., Huens, E., Van Dooren, P., Smoreda, Z. and B., V.D. (2012). Exploring the mobility of mobile phone users.

Physica A: Statistical Mechanics and its Applications

Research in Transportation Economics , 63, 73-85. Gonza´lez M. C., Hidalgo, C.A. and Barabasi A. L. (2008). Understanding individual human mobility patterns,

Nature , 453, 779–782. Hadachi, A., and Lind, A. (2019). Exploring a New Model for Mobile Positioning Based on CDR Data of The Cellular Networks. arXiv perprint arXiv:1902.09399, 1-13. Haenggi, M. (2013). Stochastic geometry for wireless networks. Cambridge University Press, Cambridge. Huang, H., Cheng, Y., and Weibel, R. (2019). Transport mode detection based on mobile phone network data: A systematic review.

Transportation Research Part C: Emerging Technologies , 101, pp. 297-312. Järv, O., Tenkane, H., and Toivonen, T. (2017). Enhancing spatial accuracy of mobile phone data using multi-temporal dasymetric interpolation.

International Journal of Geographical Information Science , 31(8), pp. 1630-1651. Kalatian, A., and Shafahi, Y. (2016). Travel mode detection exploiting cellular network data. MATEC Web Conf 81, 03008. https://doi.org/10.1051/matecconf/20168103008 Korhonen, J. (2003). Introduction to 3G mobile communications, 2nd ed. ed, Artech House mobile communications series. Artech House, Boston, MA. Kung K.S., Greco K., Sobolevsky S. and Ratti C. (2014). Exploring Universal Patterns in Human Home-Work Commuting from Mobile Phone Data.

PLoS ONE , 9(6), e96180. Louail T., Lenormand M., Ros O.G.C., Picornell M., Herranz R., Frias-Martinez E., Ramasco J.J. and Barthelemy M. (2014). From mobile phone data to the spatial structure of cities.

SCIENTIFIC REPORTS , 4, 5276, 1-12. Markovic, N., Sekula, P., Vander Laan, Z., Andrienko, G. and Andrienko, N. (2017).

Applications of Trajectory Data in Transportation: Literature Review and Maryland Case Study . Available at: http://arXiv:1708.07193v1 [stat.ML] [Accessed: 31 May 2018]. Pei, T., Sobolevsky, S., Ratti, C., Shaw, S.-L., Li, T., and Zhou, C. (2014). A new insight into land use classification based on aggregated mobile phone data.

International Journal of Geographical Information Science , 28(9), 1–20. Pinelli, F., Nair, R., Calabrese, F., Berlingerio, M., Di Lorenzo, G. and Sbodio, M.L. (2016). Data-Driven Transit Network Design from Mobile Phone Trajectories.

IEEE Transactions on Intelligent Transportation Systems

Urban Geography , 36(8), 1131–1148. Reades, J., Calabrese, F., and Ratti, C. (2009). Eigenplaces: Analyzing cities using the space– time structure of the mobile phone network.

Environment and Planning B: Planning and Design,

IUSSP Research Workshop on Digital Demography in the Era of Big Data , Seville, pp 1-33. Ricciato, F., Widhalm, P., Pantisano, F., and Craglia, M. (2017). Beyond the “single-operator, CDR-only” paradigm: An interoperable framework for mobile phone network data analyses and population density estimation.

Pervasive Mob Comput

35, 65–82. https://doi.org/10.1016/j.pmcj.2016.04.009 Song, C., Qu, Z., Blumm, N., and Barabasi, A.-L. (2010). Limits of Predictability in Human Mobility.

Science

City limits in the age of smartphones and urban scaling. Computers , Environment and Urban Systems, 79, 101423. Tennekes, M., (2018) Statistical Inference on Mobile Phone Network Data, Presentation at the th European Forum for Geography and Statistics (EFGS 2018), Helsinki, October 16-18, 2018. Williams, N. E., Thomas, T. A., Dunbar, M., Eagle, N., and Dobra, A. (2015). Measures of Human Mobility Using Mobile Phone Records Enhanced with GIS Data . PloS One , 10(7), pp. 1-16. Williamson, J., 2010. In defense of objective Bayesianism. Oxford University Press; Oxford, New York. Wu, Y., Wang, L., Fan, L., Yang, M., Zhang, Y., and Feng, Y. (2020). Comparison of the spatiotemporal mobility patterns among typical subgroups of the actual population with mobile phone data: A case study of Beijing.

Cities , 100, 102670. Xu Y., Shaw S.L., Fang Z. and Yin L. (2016). Estimating Potential Demand of Bicycle Trips from Mobile Phone Data—An Anchor-Point Based Approach.

ISPRS Int. J. Geo-Inf. , 5, 131, 1-23. Zhang, X. (2017). LTE optimization engineering handbook. Wiley, Hoboken, NJ, USA. Zufiria, P. J., and Hernández-Medina, M. A. (2018). Characterizing the Spatial Distribution of Geolocated Categorical Values.

Journal of Applied Physics and Mathematics , 9, pp. 47-53. Zufiria, P. J., and Hernández-Medina, M. A. (2019). A New Technique Based on Voronoi Tessellation to Assess the Space-Dependence of Categorical Variables.

Entropy , 21(8), pp. 1-14.

Acknowledgements