Collisions of uniformly distributed identifiers with an application to MAC address anonymization
Jean-François Determe, Sophia Azzagnuni, Utkarsh Singh, François Horlin, Philippe De Doncker
CCollisions of uniformly distributed identifiers with an applicationto MAC address anonymization
Jean-Fran¸cois Determe ∗ , Sophia Azzagnuni ∗ , Utkarsh Singh ∗ , Fran¸cois Horlin ∗ ,and Philippe De Doncker ∗ ∗ September 22, 2020
Disclaimer — This work has been submitted to the IEEE for possible publication.Copyright may be transferred without notice, after which this version may no longerbe accessible.
Abstract
The main contribution of this paper consists in theoretical approximations of the collisionrate of n random identifiers uniformly distributed in m ( > n ) buckets—along with bounds onthe approximation errors. A secondary contribution is a decentralized anonymization systemof media access control (MAC) addresses with a low collision rate. The main contributionsupports the secondary one in that it quantifies its collision rate, thereby allowing designersto minimize m while attaining specific collision rates. Recent works in crowd monitoringbased on WiFi probe requests, for which collected MAC addresses should be anonymized,have inspired this research. I Introduction
Widely used structures in computer science associate inputs with outputs that are approximatelyuniformly distributed in the set of all possible outputs. Hash tables [1, Chap. 11], cryptographichash functions and token generators (used for anonymization or security purposes) are examplesof such structures [2, Sec. 9.7.1]. An issue is that these structures may generate collisions, thatis, two different inputs being mapped onto the same output [3, Sec. 9].The main contribution of this paper is a set of numerically stable estimates of the collisionrate —the average number of collisions divided by the number of inputs—in a hash-based ortoken-based system; our estimates assume that the hash function or token generator yieldsuniformly distributed outputs. We also derive bounds on the error of these estimates.The main contribution supports our secondary one, a decentralized anonymization procedurefor media access control (MAC) addresses. Specifically, our main contribution quantifies thecollision rate of our secondary one, thereby making it possible to tune the parameters of theanonymization procedure so as to attain arbitrarily small collision rates (at the cost of higherbandwidth and storage requirements).We present our secondary contribution through the lens of a crowd monitoring system basedon WiFi signals. Anonymizing MAC addresses from WiFi probe requests (PRs) [4, Fig. 4-52] isindeed required in networks of sensors used for crowd monitoring [5–7], [8, Sec. 7]. Our schemeprevents user tracking and time synchronization accuracy is no issue on modern networks. ∗ ∗ All authors are with the OPERA Wireless Communications Group, Universit libre de Bruxelles, 1050 Brus-sels, Belgium. Corresponding e-mail: [email protected]. Innoviris funded Jean-Franois Determeand Utkarsh Singh. a r X i v : . [ c s . CR ] S e p he main contribution is general and could be of interest to researchers and engineerspursuing endeavors other than our secondary contribution.The authors of [9, Sec. 5] succinctly mentioned using random binary sequences appended tothe MAC addresses prior to hashing (or to replace MAC addresses with tokens, more specifically,universally unique Identifiers (UUIDs) [10]). Our secondary contribution uses a similar idea,except that we prepend random sequences a central server partially generates and then shareswith time-synchronized sensors. Each sequence is used simultaneously by all sensors during oneminute, a time after which the server and the sensors erase it. Thus, brute force attacks consistin recovering a pepper of high entropy instead of MAC addresses, whose entropy is too low towithstand such attacks [9, 11, 12]. We also split peppers into two parts, with one unknown tothe server.Section II focuses on the main contribution, which lays the foundations for a theoreticalvalidation of our secondary contribution, which Section III presents. II Expected number of collisions with uniformly distributedidentifiers
Variable m denotes a number of possible outputs, such that log ( m ) ∈ N , and { , } γ denotesthe set of all binary sequences of γ bits. We consider a function h : X → { , } log ( m ) (with n := card( X )). Hereafter, h is either a hash function or a token generator, whose output isapproximately uniformly distributed in { , } log ( m ) [2, Sec. 9.7.1].Following standard terminology in the study of hash tables, we refer to m and n as the number of buckets and the number of inserts , respectively. Similarly, α := n/m is called the load factor . Finally, Y ( n,m ) denotes the (random) number of collisions when inserting n valuesinto m buckets (with the uniform distribution assumption). Theorem 1 provides the exact—yetnumerically unstable—formula of E (cid:2) Y ( n,m ) (cid:3) . The numerical instability appears for sufficientlyhigh values of m . Theorem 1.
For n inserts into m buckets, the collision rate, E [ Y ( n,m ) ] /n , is E (cid:2) Y ( n,m ) (cid:3) n = 1 − mn (cid:18) − (cid:18) m − m (cid:19) n (cid:19) , (1) where the uniform distribution assumption has been used.Proof. See the Appendix.Theorem 2 proposes three approximations of E [ Y ( n,m ) ] /n . Theorem 2.
For a degree of approximation K ≥ , a number of inserts n ≥ , and a loadfactor α ≤ , there exist error terms δ ( α, n ) and R K − ( α ) such that E (cid:2) Y ( n,m ) (cid:3) n = 1 − α − (1 − exp( − α )) + δ ( α, n ) (2)= K − (cid:88) k =1 α k ( − k +1 ( k + 1)! + δ ( α, n ) + R K − ( α ) (3)= α δ ( α, n ) + R ( α ) , (4) where − (cid:115) α n − α (cid:18) π − (cid:19) ≤ δ ( α, n ) ≤ , (5)2 R K − ( α ) | ≤ α K ( K + 1)! , (6) and, in particular, | R ( α ) | α/ ≤ α . (7) Proof.
See the Appendix.Equation (2) yields a first approximation that is not numerically stable for low values of α .Equation (3) provides a numerically stable approximation whose precision is controlled through K . The error term δ ( α, n ) quantifies to what extent (1 − α/n ) n accurately approximates exp( − α )(in particular, lim n →∞ (1 − α/n ) n = exp( − α )). The term R K − ( α ) bounds the error tied toapproximating exp( − α ) using its K th-order Taylor polynomial.For low values of α (e.g., α ≤ − ), (4) is an accurate approximation because | R (10 − ) | / (10 − / ≤ − / | R ( α ) | is less than 0.1 % of the approximated value α/
2. For α ≤ n high enough (say, n ≥ α / ( n − α ) (cid:39) α /n = 1 /m . Thus, with m ≥ , | δ ( α, n ) | ≤ m − . ≤ − . III The probe request anonymization procedure
III.A System overview
We now turn to the anonymization procedure of MAC addresses, whose theoretical validationrelies on Theorem 2. As depicted in Figure 1, we designed a system i) comprising several time-synchronized WiFi sensors collecting PRs in their respective locations and ii) a central servercollecting as well as processing PRs. The sensors may have overlapping ranges; thus, in orderto detect identical PRs, sensors must generate source address (SA) identifiers that are identicalfor a given MAC address and time instant. Within the framework of crowd monitoring, thecentral server computes the rate at which PRs are sent (over time frames of one minute) andthen derives an estimate of the number of people in the area covered [6].There are four requirements our system should meet; SA identifiers should i) be identicalacross all sensors at any time instant, ii) not allow anyone to recover the original MAC addressfrom the corresponding identifier alone, iii) not allow tracking for more than one minute, andiv) have a collision rate of less than 10 − for 10 MAC addresses per time frame. (A collisionis defined as two SAs being mapped onto the same SA identifier.) The fourth point means thatthe collision rate remains negligible for up to 10 WiFi devices.Requirements ii) and iii) guarantee privacy. Requirements i) and iv) enable the centralserver to compute accurate attendee counts. Should Requirement i) not be met, sensors wouldreturn different SA identifiers for identical devices simultaneously detected (because of overlap-ping detection ranges), thereby inducing a positive counting bias. Requirement iv) ensures anegligible probability of two devices being identified as a single one (which creates a negativecounting bias).We use the SHA-256 hash function in conjunction with a pepper and truncate its ouput to64 bits. Thus, h : X → { , } is a truncated SHA-256 hash function whose inputs are 48-bitMAC addresses ( X = { , } ). On the server, we could also generate uniformly distributedidentifiers from the hashed identifiers; in this case, the server waits a while until all PRs for agiven time frame have been transmitted.We prepend a time-varying pepper to every MAC address before hashing it. With + denotingthe concatenation operation, and mac address and global pepper representing respectively the3 TP servers Central server H TT P S w i t h T L S Figure 1: Scheme of the sensing procedure. Three WiFi sensors with overlapping ranges detectWiFi probe requests emitted by the smartphones of individuals. The shaded ellipses and theassociated cones depict sensor detection ranges. Each sensor uses HTTPS links to periodi-cally retrieve server peppers from the central server and uses another HTTPS link to uploadanonymized source address identifiers. Time synchronization is achieved by calibration withnetwork time protocol (NTP) servers. Communication links are depicted for only one sensor,to avoid clutter.MAC address to be anonymized and the pepper prepended, h ( global pepper + mac address )generates the SA identifier. As shown in Figure 2, sensors collect a timestamp , a received signalstrength indicator (RSSI), and a source address (the MAC address).The pepper consists in a concatenation of a fixed 128-bit sensor pepper and a time-varying128-bit server pepper . The central server maintains an up-to-date array of 20 server peppers fora duration of 20 minutes that sensors periodically fetch using an HTTPS link with transportsecurity layer (TLS). Sensors use each server pepper for a specific one-minute time frame . Serverpeppers are generated using a pseudo random number generator (PRNG) (e.g., /dev/urandom or /dev/random on Linux). If this PRNG is deemed not secure (see [13]), hardware PRNGgenerators are alternatives too [14, 15]. We can also generate a specific set of peppers for eachcluster of sensors.The server and the sensors delete server peppers once they become outdated—in particular,the sensors erase the volatile memory chunk storing server peppers before updating it with newpeppers retrieved from the server.The fixed sensor pepper forms a last line of defense in case the server peppers get compro-mised. It is written in a file or in the codebase of the sniffer, and it is never stored on the server.We proposed a fixed sensor pepper but storing in advance sensor peppers for time frames of oneminute is possible too; for instance, it represents 42 MB of data for five years.Let us now prove that our four requirements are met.4 erver Pepper
128 bitsupdated every minute
WiFi device
Provides to sensor computer S e n s o r a n o n y m i z e s a n S A Timestamp RSSI
Source Address (MAC)
Copy
Peppered SHA-256
CopyTimestamp RSSI
SA iden �fi er PepperPepperSensor Pepper
128 bitsfixed 256 bits
Figure 2: Scheme of the anonymization procedure executed by sensors.
III.B Requirement 1: peppers are identical across all sensors at a given timeinstant
This requirement depends on the accuracy of time synchronization. We propose to use networktime protocol (NTP), which implies accurate time synchronization on low-latency networks (e.g.,4G networks with timing errors lower than 10 ms [16]). There could be synchronization-relatedmismatches at the frontiers of consecutive one-minute time frames but only for 20 ms/60000ms = 0.033 % of their duration.
III.C Requirement 2: impossibility to recover the original MAC addressfrom anonymous identifiers
Cryptographic hash functions like SHA-256 cannot be directly reversed—in practice, reversingconsists in trying many of the possible inputs until finding one whose hash is the output tobe reversed. Assuming an attacker knows the input MAC address of a particular entry inthe list of PRBs, brute forcing the pepper entails testing many of the 256-bit sequences thatexist (on average, half of them should be tested). For example, 1 million Nvidia RTX 2080SUPER Founders Edition graphics cards can compute roughly 5700 SHA-256 TeraHashes persecond [17]—this implies that testing all 256-bit peppers (approximately 1 .
16 10 TeraHashes)takes 2 .
04 10 seconds, i.e., 6 .
47 10 years. Should one of the two 128-bit peppers be knownto an attacker, testing all 128-bit sequences still takes roughly 1 .
90 10 years. We pointout that relying on a regular SHA-256 hash function without peppers is not safe (see [9, 12]and [11, Sec. VI]) as the entropy of MAC addresses is too low to resist brute force attacks.Moreover, using computationally intensive hashes like bcrypt [18] and Argon2 [19] would implyunreasonable computational requirements for sensors (see also [9, Sec. 5])5
II.D Requirement 3: preventing tracking for more than one minute
This requirement is linked to server peppers being updated between consecutive time frames ofone minute (as mentioned in Section III.A). In particular, the avalanche effect of SHA-256 hashfunctions makes hashing with different peppers return incomparable SA identifiers for a givenMAC address.
III.E Requirement 4: a collision rate of less than − for MAC addresses
We have m = 2 (cid:39) .
84 10 , which means that we truncate SHA-256 hashes to 64 bits. Thiscorresponds to a load factor α = 10 (1 . − − (cid:39) − for n = 10 MAC addresses. Asshown in the paragraph below, it yields a collision rate of about 10 − . , and it makes computerimplementations of the system straightforward (most of the databases on the market support64-bit integer/ BIGINT fields).Figure 3 shows that the collision rate is approximately equal to 10 − . . For α sufficientlylow, (e.g., α ≤ − ), the approximation becomes (4), which explains why the level sets inFigure 3 appear to be linear slopes.Note that approximation errors are negligible. Our load factor α (cid:39) − implies (for K ≥ | R K − ( α ) | ≤ − . Moreover, as already pointed out in our comment of Theorem 2,for m ≥ , | δ ( α, n ) | ≤ − . log10(number of inserts) l og10 ( nu m be r o f bu ck e t s ) log10(Approx. expected collision proportion) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - X Y Level -12.54
Figure 3: Levels sets of the approximation (3) of the collision rate as a function of the numberof inserts n and the number of buckets m . Acknowledgments
We thank Innoviris for funding this research through the MUFINS project. We also thank theIcity.Brussels project and FEDER/EFRO grant for the support they provided.6 ppendix: Proofs .A Proof of Theorem 1
Let p j denote the probability the j th (1 ≤ j ≤ m ) bucket be empty after n inserts. All insertshave equal probabilities to fall within each bucket and whether an insert ends up in one bucket isindependent of which buckets are already occupied. As a result, we have p j = (( m − /m ) n ( n inserts and, for each insert, a probability ( m − /m that it ends up in any bucket except the j thone). The expectation of the number of empty buckets is equal to (cid:80) mj =1 E [ A j ] = m (( m − /m ) n ,where A j = 1 if the j th bucket is empty and equals 0 otherwise. Hence, the expectation of thenumber of occupied buckets is m − m (( m − /m ) n . As the number of collisions is equal to n − “number of occupied bucket”, the proof is complete. .B Lemmas for Theorem 2 To prove Theorem 2, we shall first derive two lemmas. Lemma 1 quantifies to what extent (cid:16) − αn (cid:17) n is a good approximation of exp( − α ). Lemma 1.
For n ≥ and α < n , (cid:16) − αn (cid:17) n = exp( − α ) F ( α, n ) , where exp (cid:32) − α (cid:115) n − α (cid:18) π − (cid:19)(cid:33) ≤ F ( α, n ) ≤ . Proof.
For 0 ≤ α/n < (cid:16) − αn (cid:17) n = exp (cid:16) n log (cid:16) − αn (cid:17)(cid:17) = exp (cid:32) − n ∞ (cid:88) k =1 ( α/n ) k k (cid:33) = exp (cid:32) − α (cid:32) ∞ (cid:88) k =1 ( α/n ) k k + 1 (cid:33)(cid:33) . (8)Defining f ( K ) ( α, n ) := (cid:80) Kk =1 ( α/n ) k / ( k + 1), we have, 0 < f (1) ( α, n ) < f (2) ( α, n ) < · · · sothat if for all K , f ( K ) ( α, n ) ≤ ξ ( α, n ), (cid:80) ∞ k =1 ( α/n ) k / ( k + 1) ≤ ξ ( α, n ). The sum in f ( K ) ( α, n )is the inner product between vectors (( α/n ) k ) ≤ k ≤ K and (1 / ( k + 1)) ≤ k ≤ K . Cauchy-Schwarzinequality yields: f ( K ) ( α, n ) ≤ (cid:118)(cid:117)(cid:117)(cid:116)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) α k n k (cid:19) ≤ k ≤ K (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) k + 1 (cid:19) ≤ k ≤ K (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) .
7e have, using an asymptotic expression for geometric series, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) α k n k (cid:19) ≤ k ≤ K (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = K (cid:88) k =1 (cid:18)(cid:16) αn (cid:17) k (cid:19) = K (cid:88) k =0 (cid:18)(cid:16) αn (cid:17) (cid:19) k − ≤ ∞ (cid:88) k =0 (cid:18)(cid:16) αn (cid:17) (cid:19) k −
1= 11 − α /n − α n − α . Moreover, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) k + 1 (cid:19) ≤ k ≤ K (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = K +1 (cid:88) k =1 k − ≤ ∞ (cid:88) k =1 k − ζ (2) − , where ζ (2) is Riemann zeta function evaluated at 2, which is equal to π /
6. Therefore, theupper bound ξ ( α, n ) may be ξ ( α, n ) := α (cid:114) n − α (cid:114) π − . (9)Injecting these results in (8) as well as noticing that (cid:80) ∞ k =1 ( α/n ) k / ( k + 1) ≥ α − (1 − exp( − α )). Lemma 2.
For < α < , K ≥ and g : [0 , ⊂ R → [0 , ∞ ) : α (cid:55)→ g ( α ) = α − (1 − exp( − α )) , g ( α ) = K − (cid:88) k =0 α k ( k + 1)! ( − k + R K − ( α ) where | R K − ( α ) | ≤ α K ( K + 1)! Proof.
With (cid:96) ( α ) := 1 − exp( − α ), it is easy to compute thatd (cid:96) d α ( x ) = ( − k +1 exp( − x ) . Thus, max x ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12) d (cid:96) d α ( x ) (cid:12)(cid:12)(cid:12)(cid:12) = 1 . K th-order Taylor polynomial of (cid:96) ( α ) aroundzero has a remainder R (cid:48) K ( α ), for which | R (cid:48) K ( α ) | ≤ α K +1 / ( K + 1)!. The desired ( K − α − (1 − exp( − α )) = α − (cid:32) − K (cid:88) k =0 α k k ! ( − k (cid:33) = K − (cid:88) k =0 α k ( k + 1)! ( − k , and the ( K − R K − ( α ) = α − R (cid:48) K ( α ) and satisfies | R K − ( α ) | ≤ α K / ( K +1)!. .C Proof of Theorem 2 With α = n/m , Theorem 1 and Lemma 1, we derive E (cid:2) Y ( n,m ) (cid:3) n = 1 − mn (cid:18) − (cid:18) m − m (cid:19) n (cid:19) = 1 − α − (1 − (1 − α/n ) n )= 1 − α − (1 − exp( − α ) F ( α, n ))For n ≥ α < α (cid:114) n − α π −
16 is monotonically decreasing with n and monotonicallyincreasing with α , and it is approximately equal to 0 . < n = 2 and α = 1. Hence, for n ≥
2, 1 − x ≤ exp( − x ) (for x <
1) and (cid:32) − α (cid:115) n − α (cid:18) π − (cid:19)(cid:33) ≤ F ( α, n ) ≤ . Therefore, θ ( α, n ) ≤ E (cid:2) Y ( n,m ) (cid:3) n − (1 − α − (1 − exp( − α ))) ≤ , where, for α ∈ [0 , θ ( α, n ) := − α − exp( − α ) α (cid:115) n − α (cid:18) π − (cid:19) ≥ − (cid:115) α n − α (cid:18) π − (cid:19) because − exp( − α ) ≥ − exp(0) = − α ∈ [0 , − α − (1 − exp( − α )) = 1 − K − (cid:88) k =0 α k ( k + 1)! ( − k + R K − ( α )= K − (cid:88) k =1 α k ( k + 1)! ( − k +1 + R K − ( α ) , which proves (3). Deriving (4) and (7) is straightforward.9 eferences [1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms .MIT press, 2009.[2] A. J. Menezes, J. Katz, P. C. Van Oorschot, and S. A. Vanstone,
Handbook of appliedcryptography . CRC press, 1996.[3] M. J. Wiener, “The full cost of cryptanalytic attacks,”
Journal of Cryptology , vol. 17, no. 2,pp. 105–124, 2004.[4] M. Gast, . ” O’Reilly Media, Inc.”, 2005.[5] M. Uras, R. Cossu, E. Ferrara, A. Liotta, and L. Atzori, “PmA: A real-world systemfor people mobility monitoring and analysis based on Wi-Fi probes,”
Journal of CleanerProduction , p. 122084, 2020.[6] J.-F. Determe, U. Singh, F. Horlin, and P. De Doncker, “Forecasting Crowd CountsWith Wi-Fi Systems: Univariate, Non-Seasonal Models,”
IEEE Transactions on IntelligentTransportation Systems , 2020.[7] U. Singh, J.-F. Determe, F. Horlin, and P. De Doncker, “Crowd Forecasting based onWiFi Sensors and LSTM Neural Networks,”
IEEE Transactions on Instrumentation andMeasurement , 2020.[8] U. Singh, J.-F. Determe, F. Horlin, and P. D. Doncker, “Crowd monitoring: State-of-the-art and future directions,”
IETE Technical Review , 2020. [Online]. Available:https://doi.org/10.1080/02564602.2020.1803152[9] L. Demir, M. Cunche, and C. Lauradoux, “Analysing the privacy policies of Wi-Fi track-ers,” in
Proceedings of the 2014 workshop on physical analytics , 2014, pp. 39–44.[10] P. Leach, M. Mealling, and R. Salz, “A universally unique identifier (UUID) URN names-pace,” 2005.[11] L. Demir, A. Kumar, M. Cunche, and C. Lauradoux, “The pitfalls of hashing for privacy,”
IEEE Communications Surveys & Tutorials , vol. 20, no. 1, pp. 551–565, 2017.[12] M. Marx, E. Zimmer, T. Mueller, M. Blochberger, and H. Federrath, “Hashing of personallyidentifiable information is not sufficient,”
SICHERHEIT 2018 , 2018.[13] Y. Dodis, D. Pointcheval, S. Ruhault, D. Vergniaud, and D. Wichs, “Security analysis ofpseudo-random number generators with input: /dev/random is not robust,” in
Proceedingsof the 2013 ACM SIGSAC conference on Computer & communications security , 2013, pp.647–658.[14] M. Stipˇcevi´c and B. M. Rogina, “Quantum random number generator based on photonicemission in semiconductors,”
Review of scientific instruments , vol. 78, no. 4, p. 045104,2007.[15] Z. Zheng, Y. Zhang, W. Huang, S. Yu, and H. Guo, “6 gbps real-time optical quantumrandom number generator based on vacuum fluctuation,”
Review of Scientific Instruments ,vol. 90, no. 4, p. 043105, 2019. 1016] R. Miˇskinis, D. Jokubauskis, D. Smirnov, E. Urba, B. Malyˇsko, B. Dzindzel˙eta, andK. Svirskas, “Timing over a 4G (LTE) mobile network,” in . IEEE, 2014, pp. 491–493.[17] “Nvidia RTX 2080 SUPER FE Hashcat Benchmarks,” https://gist.github.com/epixoip/47098d25f171ec1808b519615be1b90d, accessed: 2020-08-13.[18] N. Provos and D. Mazieres, “A future-adaptable password scheme.” in