[PDF] Approach for GDPR Compliant Detection of COVID-19 Infection Chains

Abstract

While prospect of tracking mobile devices' users is widely discussed all over European countries to counteract COVID-19 propagation, we propose a Bloom filter based construction providing users' location privacy and preventing mass surveillance. We apply a solution based on Bloom filters data structure that allows a third party, a government agency, to perform some privacy-preserving set relations on a mobile telco's access logfile. By computing set relations, the government agency, given the knowledge of two identified persons, has an instrument that provides a (possible) infection chain from the initial to the final infected user no matter at which location on a worldwide scale they are. The benefit of our approach is that intermediate possible infected users can be identified and subsequently contacted by the agency. With such approach, we state that solely identities of possible infected users will be revealed and location privacy of others will be preserved. To this extent, it meets General Data Protection Regulation (GDPR)requirements in this area.

Full PDF

AApproach for GDPR Compliant Detection of COVID-19Infection Chains

Louis Tajan Dirk Westhoff

Hochschule Offenburg University of Applied SciencesOffenburg, Germany { louis.tajan,dirk.westhoff } @hs-offenburg.de Abstract —While prospect of tracking mobile devices’ users is widelydiscussed all over European countries to counteract COVID-19 propa-gation, we propose a Bloom ﬁlter based construction providing users’location privacy and preventing mass surveillance. We apply a solutionbased on Bloom ﬁlters data structure that allows a 3 rd party, agovernment agency, to perform some privacy-preserving set relations ona mobile telco’s access logﬁle. By computing set relations, the governmentagency, given the knowledge of two identiﬁed persons, has an instrumentthat provides a (possible) infection chain from the initial to the ﬁnalinfected user no matter at which location on a worldwide scale they are.The beneﬁt of our approach is that intermediate possible infected userscan be identiﬁed and subsequently contacted by the agency. With suchapproach, we state that solely identities of possible infected users will berevealed and location privacy of others will be preserved. To this extent,it meets General Data Protection Regulation (GDPR) requirements inthis area. Index Terms —mobile user tracking, Bloom ﬁlters, set relations, geo-location harvesting, virus propagation

I. I

NTRODUCTION

Cases of COVID-19 disease have been reported in more than 190countries and its spreading has been characterized as pandemic bythe

World Health Organization on 11.03.2020. One of its multipleside effects consists of European democracies being challenged.Indeed, several countries are collecting location-based data from theirown citizens. The state of emergency for health reasons has beenestablished in countries as Spain, Portugal, France or Switzerland.Such a speciﬁc situation empowers a government to perform actionsthat would normally not be allowed to undertake. For instance, inMilano, Italy, mobile network operators are providing information onusers’ trafﬁc to public authorities. In Germany, issues regarding howand for which usage to process the location-based information areones of the most discussed. Indeed, efforts in Germany are twofoldregarding digital support to detect infection chains. First, with an app

Corona-Warn-App deployed and downloaded more than 15 millionstimes in Germany (population of approx. 83 millions). It consistsof using a tracking app with Bluetooth in which a smartphone ofan infected user is subsequently informing all devices which havebeen in proximity (within the beaconing received range at some pointin time in the past). Such an approach is very vulnerable due tothe requirement of continuously activated Bluetooth. The recentlypublished families of BIAS [1] or BlueBorne [2] attacks have shownthat mobile devices with activated Bluetooth can easily be remotelyexecuted, e.g.

CVE-2017-078 CVE-2017-0782 or CVE-2017-14315 and are classiﬁed as a severe risk. Moreover, it has been pointed outthat the harvesting of contacts via Bluetooth with a tracking app isonly properly working in case the app is activated continuously inthe foreground, and, moreover, that at least of the smartphoneusers need to download and continuously use it to indeed have animpact with respect to the identiﬁcation of infection chains. Second, telco operators would provide access logﬁles of mobilenetwork base stations to RKI (Robert-Koch-Institute) to supportinferring infection chains.On the contrary, the Netherlands’ government decided to notapprove a general conﬁnement, for the reason of being incompatiblewith individual freedom.For these reasons, in the work at hand we attempt to proposea construction which combine the efﬁciency to help the publicauthorities to contain the virus spreading with the possibility toprovide privacy with respect to the citizens. Therefore, we concentrateon providing a privacy-preserving solution for the 2nd effort currentlydone within Germany. Our proposed solution makes use of ourprevious works [3], [4] which allows a non-trusted third party toprivately compute operations and relations on sets using Bloom ﬁltersdata structure. Such data structure allows one to represent a largeset of elements in a simple tabular of bits which could providesobfuscation and privacy on the set.We recall that GDPR’s two main objectives are to ﬁrstly enhancethe personal data protection by processing them and to secondlyempower the companies in charge of this processing procedure.Even if this regulation does not apply on ﬁelds as public health ornational security [5], weaving the proposed Bloom ﬁlter based privateprotocols into infection chains investigation would limit governmentagencies to solely identify users with high probability of beinginfected instead of a massive data analysis of all mobile users.

A. Related Work

Several approaches from related work allow one to perform compu-tations on pseudonymized, obfuscated or even encrypted data withoutthe need to discern them. We could list homomorphic encryption[6], [7] or multi-party computation [8], [9] which represent themainly investigated techniques. In [10], we applied our Bloom ﬁlterbased construction to several use cases of post-mortem mobile devicetracking. In our former work [4], we have shown that this alternativeapproach based on Bloom ﬁlter could be used to secure data whilepreserving the ability of performing relevant tests or computationson the private data. Bloom ﬁlters have been used in many differentscenarios as presented in [11]. For instance, Kerschbaum directlyencrypts the Bloom ﬁlter with homomorphic encryption [7]. In [12]authors applied the Bloom ﬁlter to key exchange mechanisms inwireless sensor network (WSN) environment while in [13], authorsoptimize the sensor nodes broadcasting with the use of Bloom ﬁlters.Regarding the investigation of privacy-preserving location tracingsolutions in the environment of COVID-19 spreading, we couldmention the work of PEPP-PT consortium [14]. This European teamprovide standards, technology, and services to countries and devel-opers with the objective to help stopping the COVID-19 spreading. a r X i v : . [ c s . CR ] J u l I. S

CENARIO

A government agency, which role is to reduce the spreading of theSARS-CoV-2 virus in its country, knows different pairs of infectedpersons ( A, B ) . Its objective here, is to identify all the possible pathswhich relies user A to user B and considers the case where infectionof user B is a consequence of user A ’s infection. By retrieving allpossible paths (surely it could also turn out that no path exists andthe infection of users A and B was unrelated), the agency couldidentify all the users within this path that may be also infected bythe virus and try to contact them. Indeed, different mobile device’susers close to the same mobile base station at the same time couldpotentially spread the virus in case of one being infected. To do so, theagency is analyzing connection data provided by a telco company.The connection logs are collected on the base stations which areproviding network access to the users’ mobile devices. A. Parties Involved.

Four parties are involved in the scenario:

Users: could be infected by the SARS-CoV-2 virus. They are con-necting to the base stations to access the mobile network.

Telco company: provides network to the users via several basestations. It also provides log data from the network connectionsto government agencies.

Base stations: are distributed over several countries, provide net-work to the users’ mobile devices and collect connection data.

Government agency: aims to identify ”infection chains” in order tocontact the possible infected users and counteract the COVID-19disease pandemic.

B. Collecting Connection Data

For each base station j , the telco company ﬁrstly generates andinitializes a fresh Bloom ﬁlter BF j represented by a tabular of bits, allset to 0. Any time a user is connecting to the mobile network usingbase station j , the following connection information is aggregatedand added to BF j : ( id i , t i , t i ) with id i the user’s credentials and t i and t i respectively the startingand ending times of its connection to the access point. Such con-nection data should be considered as sensitive regarding the locationprivacy of the users. As it will be presented, we consider a Bloomﬁlter-based approach which brings privacy to the stored data. Indeed,on the one hand the base stations are using usernames to characterizethe users and on the other hand only the telco company could generateand access the connection information from the base stations. C. Proximity Chain - Infection ChainAs notation rule, we use (cid:104)(cid:105) to express proximity chains and [] forinfection chains. A proximity chain consists of a list of users where two successiveones have been at the same location at the same time. To establish aproximity chain, these times of contact should be ordered. In otherwords, in the proximity chain (cid:104)

A, D, F, E, B (cid:105) , the time at whichusers A and D have been at the same location should precede theone for users D and F (i.e. [ t A ; t A ] ∩ [ t D ; t D ] < [ t D ; t D ] ∩ [ t F ; t F ] ).In addition to be deﬁned as a proximity chain, the list could alsorepresent an infection chain. In this case, all the users composing thechain should have a probability of being infected P r ( X i ) greaterthan a certain threshold T r . More concretely, an infection chain [ A, X , . . . , X n , B ] is a proximity chain for which it holds that: ∀ X i : P r ( X i ) > T r , otherwise it is solely a proximity chain. Therefore, an infection chain [ A, X , . . . , X n , B ] represents how theSARS-CoV-2 virus may have spread from an initially infected user A to a consecutive infected user B .It may happen that one or several subsets of a proximity chain (cid:104) A, X , . . . , X n , B (cid:105) are considered as infection chains, e.g. [ A, X , . . . , X i ] and/or [ X j , . . . , B ] . D. Adversary Model:

We consider the government agency as the principal threat for theapplication’s users. As we stated previously, even if GDPR does notapply on public health security matters, we aim to apply limitationson government agencies. In such a way, we would like that theagencies could only identify users with high probability of beinginfected instead of having a massive data analysis of all mobileusers. As we will present in the following sections, having the telcocompany colluding with the government would allow the agency toaccess personal data of all users and therefore we do not considersuch assumption.Even if we do not get any collision, we could also precise thatusers are not trusting the telco company. Indeed, they seek to limitthe mobile devices to collect personal data as much as possible.We also consider that users do not trust any approaches that requireto maintain Bluetooth continuously on since multiple types of attackscould occur as by example remote code executions from

Bleedingbit vulnerabilities [15].III. B

LOOM F ILTERS - BASED A PPROACH

As recently proposed in [4], the Bloom ﬁlter data constructioncould allow to privately represent sets of elements and at the sametime enable performance-saving computation on them. Exactly dueto this performance-saving privacy extension, we argue that ourapproach also suits for such massive data sets like mobile accesslogﬁles. At next, we give a background on Bloom ﬁlters and therelevant set relation and recall the basic protocol’s functions.

A. Bloom Filters

A Bloom ﬁlter is a data structure introduced by Burton HowardBloom in 1970 [16]. It is used to represent a set of elements. Witha Bloom ﬁlter representing a certain set, one can verify whether anelement is a member of this set. Such a data structure consists ofa tabular of m bits which is associated to k public hash functions.At ﬁrst, all the m bits are initialized to 0. To add an element to theBloom ﬁlter, one has to compute the hashes of this element witheach of respective k hash functions. Then, set the bit to 1 for eachposition corresponding to a hash value. To test whether one elementis included in the Bloom ﬁlter, one has, similarly, to compute therespective hash values of this element and verify if the respective bitsare set to 1. If at least one of these bits is set to 0, then we know forsure that the tested element is not a member of the set represented bythe Bloom ﬁlter (i.e. no false negative could append when testing anelement). On the contrary, with some (minor) probability, the testingfunction could retrieve a false positive. Indeed, even if all the bitsthat have been veriﬁed are set to 1, the tested element may not bepart of the set represented by the Bloom ﬁlter. B. Set Relations

Multiple types of operations could be performed on sets. Forprivacy concerns it could be of interest to solely reveal the cardinalityof the resulting set instead of its content. Therefore, we propose asolution on adapted Bloom ﬁlters (see III-C) to use one kind of setrelations namely the inclusiveness deﬁned as follows: eﬁnition 1 (Inclusiveness):

Let A and B be ﬁnite sets. Weconsider A included in B , i.e. A ⊂ B , iff all elements from A areincluded in B : ∀ a ∈ A : a ∈ B . C. Private Protocols

To guarantee full privacy of the sets’ content along with theircardinality, we proposed in [4] to modify the Bloom ﬁlters approachin two aspects. Firstly, instead of using k public hash functions, weare using a unique HMAC function with k secret keys. Secondly, theexact value of k is kept secret and is privately and randomly generatedwithin two publicly known boundaries. We specify the functionsregarding the initialization phase and the inclusiveness protocol.

1) Initialization.: h , k , m , K ← Setup:

The telco company should ﬁrst choose andgenerate the Bloom ﬁlter parameters: the dimension m , theHMAC function h , the amount of keys k and the set of keys K = { κ , . . . , κ k } . BF A ← Create( h, m, K, A ): Generates the Bloom ﬁlter of the dataset A = { a , . . . , a n A } .

2) Inclusiveness Protocol.:

This operator allows to verify if oneset is included in another. It performs directly on the Bloom ﬁltersof the respective sets. This operator is deﬁned as: BF A⊆B ← INC ( BF A , BF B ) : For each index, we set 0 if at thesame index we have 1 for BF A and 0 for BF B and we set 1otherwise.This operator is equivalent to the bitwise binary operator combina-tion: INC ( BF A , BF B ) ≡ ¬ ( BF A ) OR BF B (1)Then we express the number of bits set to 1 in the resulting Bloomﬁlter. If it is equal to m , we can conclude that A ⊆ B if no falsepositive occurred. Otherwise we get A (cid:42) B with certainty.For an evaluation of the correctness and the security of thisprotocol, we refer the readers to [3]. It is shown that a proper selectionof parameters m and k considering the number of elements to beinserted, guarantees the limitation of overlapping bits in the resultingBloom ﬁlter and enables the 3 rd party to correctly conclude on theinclusiveness property of the two sets. Indeed, a too large amount ofoverlapping bits in the resulting Bloom ﬁlter would lead to a case offalse negative. IV. P ROPOSED S OLUTION

From any two given infected users A and B , the govern-ment agency ﬁrst aims to identify all the proximity chains (cid:104) A, id X , . . . , id X n , B (cid:105) . In our protocol, we recall that the telcocompany provides all the relevant Bloom ﬁlters to the governmentagency. We propose to dissociate three cases: • CASE 1 : the smallest possible proximity chain (cid:104)

A, B (cid:105) :there is a base station BS j and a Bloom ﬁlter BF A,B for set { A, B } and INC ( BF j , BF A,B ) = true .Since both users A and B are indeed infected, the proximitychain (cid:104) A, B (cid:105) is also an infection chain [ A, B ] . • CASE 2 : a proximity chain with one intermediate user X (cid:104) A, id X , B (cid:105) :there is a base station BS j and a Bloom ﬁlter BF A,id X for set { A, id X } where INC ( BF j , BF A,id X ) = true andin addition, there is a base station BS j and a Bloom ﬁlter BF id X ,B for set { id X , B } and INC ( BF j , BF id X ,B ) = true .We remark here that we know users A and B but we do notknow user X nor his access credential id X , so the governmentagency has to search in all base stations for all X j for which the above two inclusiveness tests INC hold.If P r ( id X ) > T r we can denote [ A, id X , B ] . • CASE 3 : the general case (cid:104)

A, id X , . . . , id X n , B (cid:105) :we have INC ( BF j , BF A,id X ) = true ∧ . . . ∧ INC ( BF j n , BF id Xn ,B ) = true .Our solution consists of having the government agency building adata tree structure representing all the proximity chains starting fromuser A . From this tree, the agency could easily identify the proximitychains from user A to user B . For the next step of the protocol,the government agency has to evaluate the chain to determine itsplausibility to actually be an infection chain. We give the outlinesof this step but not its evaluation function that we save for theepidemiologists.We emphasize that at this point, the proximity or infection chainswill only reveal usernames of users X , . . . , X n and not their realidentities. At the very end of the protocol, the government agencywill request from the telco company the identities of the intermediateinfected users. A. Generating the Proximity Tree

To obtain a proximity tree, the government agency starts bycreating an empty tree T with user A as root. Then, it processes therecursive algorithm prox tree ( A, A, B, t (cid:48) ) presented in Algorithm1 with t (cid:48) the time from when user A could have started the infectionprocess. The recursive algorithm does as follow: ﬁrst, it generatesthe list BS N of base stations that the current node N has beenconnected to at a time later than t . To test if a user N has beenconnected to a base station j (i.e. test if ( id N , t j , t j ) ∈ BF j ), thegovernment agency receives from the telco company all the Bloomﬁlters composed of each of the 3-tuples ( id N , t j , t j ) . Then, thegovernment agency performs the inclusiveness testing between thereceived Bloom ﬁlters and BF i , the Bloom ﬁlter corresponding tothe connections logﬁle from BS j as: INC ( BF N,j , BF i ) . The nextstep of the algorithm consists of identifying all the users that visitedthe base stations from set BS N at the same moment than user N . Asbefore, the telco company generates Bloom ﬁlters with the 3-tuples ( id l , t l , t l ) for all users l and all time ranges [ t l ; t l ] that overlap theconnection time of user N . To determine which users should be listed,the government agency performs the inclusiveness operator betweenthese Bloom ﬁlters and BF N the one composed by the elementsfrom BS N . Finally, for every identiﬁed users, they are added to theproximity tree T as a leaf of current node N and Algorithm 1 is thenrecursively processed on the leaves.An additional aspect to take into account while recursively process-ing the algorithm is to consider the upper nodes of the current nodein the proximity tree. Indeed, we would like to avoid creating someloops in the tree which are irrelevant when dealing with infectionproblems; if user A infected user C , it makes no sense to consideruser C infecting user A in short period of time. The algorithm shouldthen exclude all the users which are already inserted as upper nodesin the tree. Regarding the tree construction, if we consider that user C has been in proximity of user A and id C is added as a leaf ofroot A , user A should not be considered anymore as potential leafof node id C and so on.In Figure 1 we give a toy example of our recursive algo-rithm with seven users A, B, C, D, E, F, G , three base stations BS j , BS j , BS j and times as integers in [0; 24] . We show thecontent of connection logﬁles from the three base stations and theproximity tree from user A to user B that has been generated bycomputing prox tree ( A, A, B, . We observe in Figure 1 that twousers might be in contact around different base stations. Indeed, the lgorithm 1 prox tree ( N, A, B, t ) Require: a node N from a tree T, users A and B , a time t Ensure: a tree T if N = B then break end if for all BF j do for all t j , t j > t do if ( id N , t j , t j ) ∈ BF j then BS N .add (( BS j , t j , t j )) end if end for end for for all ( BS Nk , t Nk , t Nk ) ∈ BS N do for all id l do for all ( t l , t l ) | ( t l (cid:54) t Nk ∧ t l (cid:62) t Nk ) ∨ ( t l (cid:62) t Nk ∧ t l (cid:54) t Nk ) ∨ ( t l (cid:54) t Nk ∧ t l (cid:62) t Nk ) do if ( id l , t l , t l ) ∈ BF Nk then createLeaf ( id l ) prox tree ( id l , A, B, max ( t Nk , t l )) end if end for end for end for if N.leaf = ∅ then break end if resulting proximity chains are (cid:104) A, C, G, B (cid:105) , (cid:104) A, G, B (cid:105) (with usersA and G in proximity around BS j ), (cid:104) A, G, B (cid:105) (with users A and Gin proximity around BS j ) and (cid:104) A, B (cid:105) . In case there are evaluatedas infection chains, users C and G might also be infected. BF j1 ={ (id A , 0, 6), (id C , 2, 9), (id G , 3, 5), (id D , 7, 10)}BF j2 ={ (id A , 8, 17), (id D , 15, 18)}BF j3 ={ (id F , 2, 11), (id E , 6, 15), (id G , 8, 24), (id A , 18, 24), (id B , 18, 20)} A (id C , j ) (id G , j ) (id D , j ) (id G , j ) (id B , j )(id G , j ) (id D , j ) (id C , j )(id F , j ) (id B , j )(id F , j ) (id E , j )(id B , j ) (id D , j )(id F , j ) (id E , j )(id B , j )(id F , j ) Figure 1. Example of connection logﬁles from three base stations and therespective proximity tree obtained from prox tree ( A, A, B, . It outcomesthree different proximity chains (cid:104) A, C, G, B (cid:105) , (cid:104) A, G, B (cid:105) and (cid:104)

A, B (cid:105) . a) Algorithm optimization: With respect to performance, onecould consider computing the algorithm on the opposite way, namelywith input B as root. To do so, the algorithm should be modiﬁed sothat time is considered backwards. It starts at ending time (24 for ourtoy example) and we build the proximity tree by going back in time.We consider as reverse prox tree this reverse recursive algorithm. In Figure 2 we show the proximity tree obtained after computing reverse prox tree ( B, A, B, from user B considering the timebackwards. As expected, the resulting proximity chains are the samethan in Figure 1 but we remark that the resulting tree is smaller thanthe one obtained in Figure 1. In this speciﬁc toy example we noticethat obtaining the proximity tree was made faster by reversing ouralgorithm. B (id A , j ) (id G , j )(id A , j ) (id E , j ) (id F , j )(id F , j ) (id A , j ) (id C , j )(id A , j ) Figure 2. Example of a proximity tree obtained from reverse prox tree ( B, A, B, with the same toy example thanFigure 1. It generates three different proximity chains (cid:104)

A, C, G, B (cid:105) , (cid:104) A, G, B (cid:105) and (cid:104)

A, B (cid:105) . Another aspect we could consider while comparing the tworesulting trees, is that the order the tree is being build and theproximity chain obtained is also reversed. Indeed, in Figure 1 weobtain ﬁrst (cid:104)

A, C, G, B (cid:105) then (cid:104)

A, G, B (cid:105) (via j ), (cid:104) A, G, B (cid:105) (via j )and ﬁnally (cid:104) A, B (cid:105) . In Figure 2 we see that we obtain the chains inthe exact opposite order with reverse prox tree . Still aiming tooptimize the computation time of our algorithm, in particular whendealing with large numbers of users and base stations, one couldsimultaneously start the tree generation using the algorithm and itsreversed version. For both cases the tree propagates and every timewe ﬁnd a proximity chain in the tree (meaning N = B or N = A for reverse prox tree ) we could store the chain in a set S (or S (cid:48) for reverse prox tree ). Then for each round (i.e for iteration) wetest if the two sets have a common element. If not, we continue.In case they have a common proximity chain, we could stop bothalgorithms and the complete set of proximity chains from users A to B is composed of the addition of sets S and S (cid:48) .To illustrate the approach of computing both versions at the sametime and, as argued, gain on performance, one could explain: • if you throw one stone into the water and you want the resultingwaves to reach a point in r meters distance, then the circle atthe end will encompass many square meters. • if you throw two stones into the water (one at the originalposition, the other one at the position you want to reach), theintersection of the resulting waves propagation will be approx.at a distance r/ meters. • adding the area of these two circles shall be much smaller thanthe circle’s area obtained with one stone.For example, with A = π × r and r = 10 A = 314 . , and with r = 5 the area of the two circles is altogether approximately 160!Another level of optimization could be considered in order toidentify some of the proximity chains faster as for instance to supportthe start of a localized quarantine immediately. Instead of storing thechains into S and S (cid:48) , at each propagation round we look at the chainswhile they are processed so that we stop both algorithms when: • prox tree has built a path (cid:104) A, X , . . . , X i (cid:105) • reverse prox tree has built a path (cid:104) B, X n , . . . , X j (cid:105) • and it holds X i == X j Then the two parts of the proximity chain could be concatenated tocreate the proximity chain (cid:104)

A, X , . . . , X i == X j , . . . , X n , B (cid:105) e could refer to Table I to see that if we perform both algorithmsat the same time in the toy example conﬁguration, we could retrievethe proximity chain (cid:104) A, C, G, B (cid:105) faster with this second level ofoptimization.In Table I we could observe in detail how we retrieve theproximity chains using the two versions of Algorithm 1 and theoptimization with the toy example’s conﬁguration. As stated pre-viously, reverse prox tree ( B, A, B, was executed way fasterthan prox tree ( A, A, B, . Indeed, the original algorithm ended af-ter 18 rounds while the reverse one stopped after the th round. Sinceit is not possible to predict which of the two will ﬁnish processingﬁrst, computing both in parallel will optimize the retrieving. As forthe second level of optimization, concatenating two parts of proximitychains allows to retrieve (cid:104) A, C, G, B (cid:105) at round 2 while discoveredat round 6 with prox tree and round 9 with reverse prox tree .It is of value especially when proximity chains are composed by ahigh number of intermediate users.The performance gain obtained with our two levels of optimizationis downplayed due to the extreme smallness of logﬁles in our toyexample. One could easily imagine that applied to real life scenarioand big data these optimizations are highly performance saving. Forexample, in another scenario dealing with mobile connection logﬁles[10], authors propose to process on these logﬁles and therefore Bloomﬁlters up to elements. b) Algorithm decentralization: The European PEPP-PT consor-tium is advocating a decentralized approach as well as the DP3Tprotocol [17] which relies on Bluetooth, and also as [18] where de-centralization has been investigated. With our presented optimization,we could integrate such construction by introducing two additionalparties besides the ones already presented. We precise that thesetwo additional parties should be extremely powerful in terms ofcomputation and perform parallel computing such as server farmsor clusters: • Computing party 1 which runs prox tree • Computing party 2 which runs reverse prox tree

This way the agency is only receiving per round the values for X i (from computing party 1 ) and X j (from computing party 2 ) andcomparing if X i == X j . Only in the case X i == X j we obtain that computing party 1 is sending (cid:104) A, X , . . . , X i (cid:105) and computing party2 sending (cid:104) X j , . . . , X n , B (cid:105) to the agency. With such a construction,multiple parties are involved in the computation and the whole effortdoes not rely on the government agency. c) Algorithm complexity: One could easily see by analyzing theobtained results in Figures 1 and 2 that the size of the resulting treewill depend on the size of the base stations’ logﬁles. These logﬁleswill naturally depend on the amount of users and thus connectionsduring the particular time. The more base stations and users thereare, the more logﬁles will be numerous and fully ﬁlled. In our toyexample, we have 11 connection entries in all combined base stationsas displayed in Figure 1. They result in a tree with respectively 19and 10 nodes by computing prox tree and reverse prox tree . Wealso recall that in case we ﬁnd the ﬁnal user of the wanted infectionchain (user B in our example) in the tree, the algorithm reaches abreak instruction and therefore the respective sub-tree is no longerexplored. A high activity of this particular user could then reducethe tree’s spreading. As seen previously, one of the two algorithmswill be faster to execute without being able to predict which one andapplying the presented optimization could reduce the complexity tothe faster one. B. Proximity Chain Evaluation

From all the proximity chains (cid:104)

A, id X , . . . , id X n , B (cid:105) obtainedby performing the aforementioned protocol, the government agencyshould determine if users X i might also be infected. To do so, theagency could estimate the users’ probability of being infected andcompare it to a threshold (i.e. P r ( X i ) > T r ). Such a probabilityobviously depends, among others, on the respective neighbors withinthe chain. We consider the probability value computed as a function infection ( previous node, contact time, contact distance,reproduction number, saturation ) where saturation shall denotethe percentage of infected persons within the human population of aregion, which obviously changes over time.More precisely, in Germany the reproduction number R , which isdeﬁned as the mean number of people infected by a case, was 3 atthe beginning of the COVID-19 crisis and by 17.04.2020 could bereduced to . (and meanwhile R = 1 . ). Clearly this number isonly an average but still indicates that inference from a proximitychain to an infection chain very much depends on the concrete timeand location entities met during the pandemic wave. Similar numbersalso exist for other countries as for instance R = 0 . for Belgium at17.04.2020. Another important observation is that since a proximitychain can easily build up over a period of weeks, P r ( X i ) maysigniﬁcantly vary. But only if all probabilities are larger than T r the agency can at least argue having identiﬁed a possible infectionchain.It goes without saying that it is out of scope to determine the infection function. On the one hand, specialists emphasize thehigh contagiousness of the virus but on the other hand, having twousers connecting to the same base station at the same time does notnecessarily imply any physical contact between the two.Without being able to determine the exact probability of a user tobe infected by another one, we could propose a model to evaluatethe probability of a proximity chain becoming an infection chain.First, we know that users A and B are infected and we wouldlike to determine if user B has been infected due to user A orvia another chain and other infection events. Therefore, applyingprobability theory to such a problem is relevant and reﬂects the chain characteristic of it.We deﬁne as P r ( X i ) the following conditional probability P ( X i | X i − ) of the event ” X i − has infected X i knowing that X i − is already infected”. It holds that P r ( X ∩· · ·∩ X n ) = n (cid:89) i =1 P r ( X i ) .Considering a proximity chain (cid:104) A, X , . . . , X n , B (cid:105) , there is a cleartendency that the overall probability to have user B infected dueto user A is inversely proportional to the length of the proximitychain. We propose the following probability model for evaluating aproximity chain:For each (cid:104) A, X , . . . , X n , B (cid:105) , if ∀ i ∈ [1; n ] P r ( X i ) ≥ T r then [ A, X , . . . , X n , B ] (2)The proximity tree obtained at the previous stage of the protocolcontains nodes with users’ credentials and only these usernames arerevealed. It is only in case a proximity chain turns out to be aninfection chain, that the agency will request from the telco companythe real identities of the users composing the chain. Therefore, users’identity are solely revealed in case of infection function outcomesso. Moreover, we recall that during the overall process no additionallocation information of other users listed in the mobile operator’slogﬁle are revealed to the agency. able IC ONSTRUCTION OF THE PROXIMITY TREE ROUND BY ROUND WITH prox tree

AND reverse prox tree

AND HOW THE OPTIMIZATION COULD BEAPPLIED .Round prox tree reverse prox tree

With optimization1

C A, (cid:104) A , B (cid:105) (cid:104) A , B (cid:105) from reverse prox tree .from reverse prox tree .2 G G (cid:104) A , C , G , B (cid:105) from concatenation of (cid:104) A, C, G (cid:105) from prox tree and (cid:104)

G, B (cid:105) from reverse prox tree .3 F A, (cid:104) A , G , B (cid:105) (cid:104) A , G , B (cid:105) from reverse prox tree .4 E E F F B, (cid:104) A , C , G , B (cid:105) F D A, (cid:104) A , G , B (cid:105) (cid:104) A , G , B (cid:105) from reverse prox tree .8 G C C A, (cid:104) A , C , G , B (cid:105) . . . . . . . . . B, (cid:104) A , G , B (cid:105) - . . . . . . . . . B, (cid:104) A , G , B (cid:105) -18 B, (cid:104) A , B (cid:105) - Another way to tune prox tree and make the overall computationmore salable could be, during the computation of prox tree and reverse prox tree , to only consider such paths in the proximitytree as long as they still ﬁt the criterion to also be an infection chain.It could consists of having the testing from equation (2) at line 20from Algorithm 1 and a break instruction in case the test is notfulﬁlled.

C. Recursivity of the Infection Detection

One may notice that a trivial optimization would be to switchusers A and B in the sense that “infection of user A is comingfrom user B ”. In Figure 3 we show the proximity tree obtainedfrom our algorithm by computing prox tree ( B, B, A, with ourtoy example logﬁles. We notice that it results in a very differenttree than in Figure 1 obtained by prox tree ( A, A, B, . In case thegovernment agency holds some information on the infection time ofusers A and B , for example that user A has been infected beforeuser B , only one direction should be considered by the agency. B (id A , j ) (id G , j )(id A , j ) Figure 3. Example of a proximity tree from user B to user A obtained from prox tree ( B, B, A, . It results in two different proximity chains (cid:104) B, A (cid:105) and (cid:104)

B, G, A (cid:105) . To be the most efﬁcient, the government agency should performa ﬁnal step in the protocol. All the users identiﬁed as infected atthe previous stage (i.e. all X i where P r ( X i ) > T r ) should beconsidered as new users A and respectively B in the proposedsolution. Indeed, our protocol is initiated with users tuples ( A, B ) already identiﬁed as infected by the agency. The freshly identiﬁedusers are thus incrementing the list of known infected persons andthe protocol should be applied to them to optimize the search. In sucha way, the most infected users could be identiﬁed and contacted. D. Discussion on Location Privacy

We argue that the proposed solution provides privacy for the usersby three different means. Firstly by using only personal credentialsas usernames and secondly thanks to the Bloom ﬁlter’s constructionand its obfuscation feature. Indeed, as explained previously, the realidentities of users are not provided and stored in the Bloom ﬁltersnor the logﬁles. The telco company uses usernames to distinguishusers and the private mapping will be provided to the governmentagency solely on-demand, when a user is identiﬁed as being part ofan infection chain.The second aspect of location privacy is given by the Bloom ﬁltersbased approach from [4] which allows to compute relations amonglogﬁles while keeping these data sets private. We recall that suchan approach uses an HMAC function instead of a bunch of publichash functions and therefore only the telco company could createthe Bloom ﬁlters and no other party. To this extent, the governmentagency could not try to retrieve locations of a speciﬁc user bygenerating a Bloom ﬁlter with a unique element and performs theinclusiveness relation between this Bloom ﬁlter and the ones frombase stations. For that reason, using secret keys to generate a validBloom ﬁlter enhances the privacy aspect of the protocol. Finally werecall that secret keys are generated and stored only at the telcocompany side and are not required by the government agency toperform our protocol.The third aspect of location privacy consists of having no otherparty than the provider itself (which anyhow has this information)gets the location data of the users. This can be easily done by notrevealing which BF i comes from which BS i . This way, the onlyinformation revealed to the authority is the contact information ofusers having entered the same cell during the same time interval.roviding the concrete location information of this cell is totallyirrelevant for the authority to compute the proximity resp. infectionchain. V. C ONCLUSION

We proposed in this work to use the Bloom ﬁlter approach from[4] for a real life use case, similarly to [10] where we appliedit to a post-mortem mobile device tracking scenario. Our detailedprotocol supports a government agency to track possible COVID-19 infection chains and therefore identify plausible infected mobileusers. Throughout the entire protocol, the agency will only handleusernames which do not allow to retrieve the users’ identities andtherefore their privacy will be preserved. Solely in the case of possibleinfection by the life-threatening SARS-CoV-2 virus, real identitieswill be revealed to the agency, that will be able to contact themand provide medical support. In such way, the telco companies actGDPR compliant and could still guarantee a certain level of locationprivacy to their clients. We could stress that if data stem from the ‘inproximity’ mobile telco’s logﬁle, it means that two devices have beenin the same transmission range of a base station. In the worst casethey can still have a × r distance (easily 500 m or more). However, ifthe same approach can be applied to the RSSI based Swarm-mappingapproach for Android or iOS collected data then ‘in proximity’ has amuch better accuracy [19]. In particular also the WiFiLocationHarvestﬁle of each mobile device contains timestamp, latitude, longitude,trip-id, speed, course at an amazing accuracy which comes close tothe accuracy required to check if two devices got nearer than 2 m(infection distance). And, moreover, compared to the promoted Appbased approach with Bluetooth from Germany Fraunhofer Institutes and others in the RSSI based approach the mobile’s WLAN andBluetooth can be off, and yet, simply due to the measured RSSIfrom the access point the approach provides the location data of thedevices equipped with such modern mobile operating systems.To conclude, our approach may be a good starting point fordebating a reasonable GDPR compliant detection of COVID-19infection chains since we argue it does not provide additional privacy-leakage to other parties than those who already have the knowledgeof our location data. R

EFERENCES [1] Daniele Antonioli, Nils Ole Tippenhauer, and Kasper Rasmussen. BIAS:Bluetooth Impersonation AttackS. In

Proceedings of the IEEE Sympo-sium on Security and Privacy (S&P) , May 2020.[2] Muder Almiani, Abdul Razaque, Liu Yimu, Meer Jaro Khan, TangMinjie, Mohammed Alweshah, and Saleh Atiewi. Bluetooth application-layer packet-ﬁltering for blueborne attack defending. In

Fourth Inter-national Conference on Fog and Mobile Edge Computing, FMEC 2019,Rome, Italy, June 10-13, 2019 , pages 142–148. IEEE, 2019.[3] Louis Tajan, Dirk Westhoff, and Frederik Armknecht. Private setrelations with bloom ﬁlters for outsourced SLA validation.

IACRCryptology ePrint Archive , 2019:993, 2019.[4] Louis Tajan, Dirk Westhoff, and Frederik Armknecht. Solving setrelations with secure bloom ﬁlters keeping cardinality private. In

Pro-ceedings of the 17th International Joint Conference on e-Business andTelecommunications - Volume 3: SECRYPT, , pages 443–450. INSTICC,SciTePress, 2020.[5] Chris Hoofnagle, Bart Sloot, and Frederik Borgesius. The europeanunion general data protection regulation: What it is and what it means.

Information & Communications Technology Law , 28:1–34, 02 2019.[6] Michael J. Freedman, Kobbi Nissim, and Benny Pinkas. Efﬁcient privatematching and set intersection. In Christian Cachin and Jan Camenisch,editors,

Advances in Cryptology - EUROCRYPT 2004, InternationalConference on the Theory and Applications of Cryptographic Tech-niques, Interlaken, Switzerland, May 2-6, 2004, Proceedings , volume3027 of

Lecture Notes in Computer Science , pages 1–19. Springer, 2004. [7] Florian Kerschbaum. Outsourced private set intersection using homo-morphic encryption. In Heung Youl Youm and Yoojae Won, editors, , pages 85–86.ACM, 2012.[8] Lea Kissner and Dawn Xiaodong Song. Privacy-preserving set oper-ations. In Victor Shoup, editor,

Advances in Cryptology - CRYPTO2005: 25th Annual International Cryptology Conference, Santa Barbara,California, USA, August 14-18, 2005, Proceedings , volume 3621 of

Lecture Notes in Computer Science , pages 241–257. Springer, 2005.[9] Pierre K. Y. Lai, Siu-Ming Yiu, K. P. Chow, C. F. Chong, andLucas Chi Kwong Hui. An efﬁcient bloom ﬁlter based solution formultiparty private matching. In Hamid R. Arabnia and Selim Aissi,editors,

Proceedings of the 2006 International Conference on Security& Management, SAM 2006, Las Vegas, Nevada, USA, June 26-29, 2006 ,pages 286–292. CSREA Press, 2006.[10] Louis Tajan and Dirk Westhoff. Retrospective tracking of suspects ingdpr conform mobile access networks datasets. In

Proceedings of theCentral European Cybersecurity Conference 2019, CECC 2019, Munich,Germany, November 14-15, 2019 , pages 5:1–5:6. ACM, 2019.[11] Lailong Luo, Deke Guo, Richard T. B. Ma, Ori Rottenstreich, andXueshan Luo. Optimizing bloom ﬁlter: Challenges, solutions, andcomparisons.

IEEE Commun. Surv. Tutorials , 21(2):1912–1949, 2019.[12] Anup Kumar Maurya and V. N. Sastry. Secure and efﬁcient authenticatedkey exchange mechanism for wireless sensor networks and internet ofthings using bloom ﬁlter. In , pages 173–180. IEEE Computer Society, 2017.[13] Anum Talpur, Thomas Newe, Faisal Karim Shaikh, Adil Amjad Sheikh,Emad A. Felemban, and Abdelmajid Khelil. Bloom ﬁlter based datacollection algorithm for wireless sensor networks. In

Commun. ACM , 13(7):422–426, July 1970.[17] Carmela Troncoso, Mathias Payer, Jean-Pierre Hubaux, Marcel Salath´e,James Larus, Edouard Bugnion, Wouter Lueks, Theresa Stadler, Aposto-los Pyrgelis, Daniele Antonioli, Ludovic Barman, Sylvain Chatel, Ken-neth Paterson, Srdjan ˇCapkun, David Basin, Jan Beutel, Dennis Jackson,Marc Roeschlin, Patrick Leu, Bart Preneel, Nigel Smart, Aysajan Abidin,Seda G¨urses, Michael Veale, Cas Cremers, Michael Backes, Nils OleTippenhauer, Reuben Binns, Ciro Cattuto, Alain Barrat, Dario Fiore,Manuel Barbosa, Rui Oliveira, and Jos´e Pereira. Decentralized privacy-preserving proximity tracing, 2020.[18] Serge Vaudenay. Centralized or decentralized? the contact tracingdilemma. Cryptology ePrint Archive, Report 2020/531, 2020. https://eprint.iacr.org/2020/531.[19] Andreas Dhein and R¨udiger Grimm. Standortlokalisierung in modernensmartphones - grundlagen und aktuelle entwicklungen.