Exposures Exposed: A Measurement and User Study to Assess Mobile Data Privacy in Context
EExposures Exposed: A Measurement and User Study toAssess Mobile Data Privacy in Context
Evita Bakopoulou ebakopou@uci . edu University of California, IrvineUSA
Anastasia Shuba † ashuba@uci . edu Broadcom Inc.USA
Athina Markopoulou athina@uci . edu University of California, IrvineUSA
ABSTRACT
Mobile devices have access to personal, potentially sensitive data,and there is a large number of mobile applications and third-partylibraries that transmit this information over the network to remoteservers (including app developer servers and third party servers).In this paper, we are interested in better understanding of not justthe extent of personally identifiable information (PII) exposure, butalso its context ( i.e., functionality of the app, destination server,encryption used, etc. ) and the risk perceived by mobile users today.To that end we take two steps. First, we perform a measurementstudy : we collect a new dataset via manual and automatic testingand capture the exposure of 16 PII types from 400 most popularAndroid apps. We analyze these exposures and provide insightsinto the extent and patterns of mobile apps sharing PII, which canbe later used for prediction and prevention. Second, we performa user study with 220 participants on Amazon Mechanical Turk:we summarize the results of the measurement study in categories,present them in a realistic context, and assess users’ understanding,concern, and willingness to take action. To the best of our knowledge,our user study is the first to collect and analyze user input in suchfine granularity and on actual (not just potential or permitted) privacyexposures on mobile devices. Although many users did not initiallyunderstand the full implications of their PII being exposed, afterbeing better informed through the study, they became appreciativeand interested in better privacy practices. CCS CONCEPTS • Security and privacy → Privacy protections; • Human-centeredcomputing → Empirical studies in ubiquitous and mobile com-puting; • Networks → Network monitoring;
KEYWORDS
Privacy, Mobile Apps, Personal Identifiable Information (PII), DataAnalysis, User Survey
Mobile devices have access to a wealth of personal, potentiallysensitive information and there is a growing number of applicationsthat access, process and transmit some of this information over thenetwork. Sometimes this is necessary for the intended operationof the applications ( e.g., location is needed by
Google Maps ) andcontrollable ( e.g., by the user through permissions), but for the mostpart, users are not in control of their data today. Applications andthird party libraries routinely transmit user data to remote servers,including application servers but also ad servers and trackers, and users have typically limited visibility and understanding of what partof their personal data is shared, with whom, and for what purpose.With the increased interest in online privacy, there are severalbodies of related work. On one hand, a number of systems havebeen proposed that improve data transparency and protect personallyidentifiable information (PII). In general, these systems fall intothree categories: (i) static analysis and application re-writing [15,18, 26], (ii) dynamic analysis with a modified or rooted OS [7, 16,17, 41, 46], and (iii) VPN-based network monitoring [27, 30, 33, 35–37, 39]. While these tools provide more fine-grained control oversensitive data (as opposed to just permissions), the way users engagewith these privacy-preserving systems is less well studied. On theother hand, in the human-computer interaction (HCI) community,researchers have extensively studied how different designs for apppermissions affect users’ decisions on which apps to install andwhich permission requests are considered legitimate by users [10,19, 20, 22, 44]. A detailed review of related work is provided inSection 2.In this paper, we are interested in understanding privacy expo-sures , which we define as PII transmitted by a mobile app (or thirdparty library used by the app) on the device, over the network inter-face, towards a remote server. Our goal is to understand not onlythe extent and mechanisms of PII exposure, but also its context( i.e., functionality of the app, destination server, encryption used,frequency, etc.) and the risk perceived by mobile users today. Forexample, location needs to be shared for a navigation app to performits intended and legitimate function, and should not be of concernto the user. In contrast, if the same navigation app uses a librarythat shares device ids with a third-party server, this is more likelya privacy leak and should be of concern to the user. We are alsointerested in PII actually exposed in real network traffic, as opposedto potential privacy exposures as captured by permissions. To thatend, we make the following two contributions. Measurement Study.
First, we utilize a state-of-the-art mobilenetwork monitoring tool,
AntMonitor [36], to intercept and inspectoutgoing packets transmitted over the mobile device’s network inter-face. Using
AntMonitor , we conduct two extensive and systematicexperiments (one manual and one automated) on Android phones,where we test 400 most popular Android apps (as of March 2017)and we collect 47,076 outgoing packets. We identify whether these Most prior work [29, 33, 37] refers to PII found in outgoing packets as a “privacyleak,” because PII is by definition private information and an outgoing packets indicatesexfiltration or a “leak.” However, we purposely distinguish between “privacy exposure”(a PII contained in an outgoing packet) and “privacy leak” (which is an exposure thatis not necessary for the intended functionality of the app, and/or goes to a third partyserver, or happens in clear text). This distinction, between a PII exposure and an actualleak, can only be made based on the context , which is one of the main aspects weinvestigate in this paper. a r X i v : . [ c s . CR ] A ug ackets contain any of 16 predefined types of PII (defined in Section3.1), together with related information, which we collectively referto as context including: the destination server/domain ( i.e., whetherit is an App Developer server or a third-party Advertisers & Ana-lytics server), the app category (games, shopping, navigation etc. )which reveals the intended functionality of the app, whether the PIIis exposed in clear text or is encrypted, and whether the app runs inthe background or foreground. Our datasets partly confirm findingsof previous measurement studies of mobile devices but are richer: e.g., they contain previously unseen exposures over plain TCP andUDP, exposures while the app is in the background, and maliciousscanning for rooted devices. We analyze our datasets and provideinsights into the extent and nature of how PII is exposed today. Wealso identify behavioral patterns, such as communities of domainsand mobile apps involved in exposing private information. Thesepatterns can be used in the future to design automated prediction andprevention methods. We plan to make the datasets available to thecommunity. User Study.
Second, we perform a user study on Amazon Me-chanical Turk (MTurk) with 220 users. We summarize the resultsof the measurement study in categories, present participants withreal-world scenarios of private information exposure in context (typeof PII, whether it is shared with the application or a third party, useof encryption, etc. ) and we ask them to assess the legitimacy ( i.e., whether the information is needed for the app’s functionality) andprivacy risk posed. We also educate the participants on how a singlepiece of PII can lead to even more information being discoveredwhen combined with data fetched from a data broker. Finally, weask users before and after the survey what actions they would bewilling to take to protect their privacy, including using free/paidprivacy-enhancing tools and contributing their data to crowdsourc-ing. To the best of our knowledge, our user study is the first to collectand analyze user input in such fine granularity ( i.e., taking contextinto account) and on actual (not just potential or permitted) privacyexposures from mobile devices. We found that (i) many users didnot initially understand the full implications of their PII being ex-posed but (ii) after being better informed through the study, theybecame appreciative and interested in better privacy practices. Theinsights gained by the study can inform the design of fine-graineddata transparency and privacy preserving tools such as AntMonitor [2].The structure of the rest of the paper is as follows. Section 2reviews related work. Section 3 describes the measurement study,including the data collection, summary and analysis. Section 4presents the Amazon Mechanical Turk study, based on our datasets.Section 5 concludes the paper and outlines directions for futuresolutions based on the findings of this study.
Privacy-Preserving Tools.
There are a number of complementaryframeworks, built by different communities, that can enhance datatransparency and privacy protection on mobile devices.
Permissions are the first line of defense against unwanted accessto sensitive resources. However, they are insufficient and too coarse-grained: (i) users typically accept to install apps by default; (ii)permissions do not capture run-time behavior; (iii) they do not protect against inter-app communication and poorly documentedsystem calls; and (iv) permissions signify access to information,which is less of a concern than sharing that information over thenetwork.
Static analysis and application re-writing, such as
PiOS [15],
AndroidLeaks [18], and [26], suffer from the inherent imprecisionof decompilation. Furthermore, static analysis does not capturerepresentative run-time behavior and often fails to deal with nativeor dynamically loaded code.
Dynamic analysis with a modified or rooted OS include
Protect-MyPrivacy [13],
TaintDroid [16], and others [7, 10, 17, 41, 43,46]. Such tools are powerful not suitable for mass adoption sincerooting a phone or installing a custom OS is not only a daunting taskfor the average user, but is also strongly discouraged by wirelessproviders and phone manufacturers.
VPN-based network monitoring tools use the TUN interface tocapture network packets on the device and detect whenever sensitiveinformation is sent over the networ. Over the years, their implementa-tion has evolved from a client-server implementation [27, 33, 37] toa mobile-only implementation [30, 36, 39]. State-of-the-art mobile-only network monitoring tools include
AntMonitor [36],
Recon [33],
Lumen [31], and
Privacy Guard [38]) and are amenableto crowdsourcing thanks to their implementation as a mobile-onlyuser-space app. However, to the best of our knowledge, they havebeen only used so far to collect packet traces and analyze “privacyleaks” found therein, not for user studies.
Privacy “Leak” Datasets.
The aforementioned tools have beenused before to collect datasets containing privacy “leaks” (see Foot-note 1 for terminology) from mobile devices. Most closely relatedto this paper, the VPN-based monitoring tools only used so far tocollect packet traces and analyze “privacy leaks” found therein, notfor user studies.
Recon [33] collected packet traces to train machinelearning classifiers for predicting PII exposures. Razaghpanah etal. [29] collected cases of PII exposures from thousands of usersusing the
Lumen app and used that data to find new advertising andtracking services. Finally, the longitudinal study of PII exposures in[32] demonstrates how privacy evolves across different app versionsin Android.
User Studies.
Several experimental studies have analyzed mobileapp behavior and several users studies have analyzed user interac-tions and mobile usage ( e.g.,
Mehrotra et al. [25], Tian et al. [40],
EarlyBird [45] and Xu et al. [47]). Most closely related to thispaper, are user studies that focus specifically on privacy.Permissions and how users interact with them have been exten-sively studied in [10, 19, 20, 22, 23, 44]. Almuhimedi et al. [10]studied how sending users privacy nudges affected their permissionsettings. Wang et al. [44] studied user decisions when presented withpermission settings that are separated between apps and ad libraries.More recently, Ismail et al. [19] showed that it is possible to maintainapp usability even when disallowing certain permissions. Chitkaraet al. proposed a retrofitted Android system,
ProtectMyPrivacy [13], that allows users to make fewer privacy decisions by settingpermissions based on third-party libraries instead of applications.Taking a different approach,
PrivacyStreams [21] proposes andevaluates (with a user study) a tool for developers to write code ina more privacy-preserving way. In the web ecosystem, the workin [28] studies online privacy in websites to identify mismatched ser expectations and the factors that impact these mismatches. Thework by Kleek et al. [42] is closest to our work in that they useinformation captured from network monitoring to see if it influencesusers’ decisions to install apps. Unlike their work, however, we areinterested in learning what users would do if given more fine-grainedcontrol over their data. Our work in perspective.
In this paper, we use one of theaforementioned state-of-the-art VPN-based monitoring tools,
Ant-Monitor [2, 36], to collect and analyze packet traces and the privacyexposures found therein. This approach has the advantage that itcaptures actual real-world privacy exposures, as opposed to e.g., po-tential exposures described by permissions. In addition, we compilethe large volume of information from the packet traces and presentit to users in a way that they can process and assess. The combina-tion of a measurement study (volume and coverage of packet tracesobtained through extensive and systematic experiments) with a userstudy (summarizing the information into categories, defining context,and obtaining fine-granularity feedback from users) is one of thecontributions of this paper, in addition to the detailed findings ofboth studies as outlined in the introduction.
AntShield
We are interested in collecting real-world cases of PII exposure, ascaptured in packet traces of outgoing packets (as opposed to potentialexposures e.g., indicated by permissions). We focus on PII that havebeen previously defined in related work ( e.g., [30, 33]). Specifically,we are interested in detecting PII that belong to the following list: • Device Identifiers: IMEI, IMSI, Android ID, phone number,serial number, ICCID, MAC Address, available throughAndroid APIs. • User identifiers: usernames and passwords used to login tovarious apps (unavailable through Android APIs); Adver-tiser ID, email (available through Android). • User demographic: first and last name, gender, zipcode,city, etc. - unavailable through Android APIs. • Location: latitude and longitude coordinates, availablethrough Android APIs.To collect real cases of when the PII defined above are exposed,we build on
AntMonitor , an app that intercepts all network trafficfrom the device without requiring rooting. We pick
AntMonitor since it is a representative VPN-based tool for privacy protection (seeSec. 2) and it is easy to extend it for data collection. The traffic in-terception, along with several utility functions of
AntMonitor havebeen made available as a library, and we will refer to it as the
Ant-Monitor Library [2, 36]. We will refer to our data collection appthat extends the
AntMonitor Library , as
AntShield [2, 34, 35]. As shown in Fittg. 1(a),
AntShield receives outgoing packets viathe
PacketFilter interface provided by the
AntMonitor Library .Note that the
AntMonitor Library also implements a TLS proxy,which allows it to decrypt SSL/TLS traffic of applications that do not The design and performance evaluation of the
AntShield system is a contributionon its own. However, we consider it out of the scope of this paper. Here, we onlyuse
AntShield as a tool to collect the packet traces that are the starting point of ourmeasurement and user studies.
PrivacyShield
Android Device acceptIPDatagram () acceptDecryptedSSLPacket() Storage
JSON Files
Target Internet HostAntMonitorLib
Packet-to-App MappingDPI ModulePacket Filter
Online TrafficOffline LogsOther Apps Connector Type
Data Collector
JSON Object
App NamePIIPacket HeadersPacket Data
PIIDPI M a p P acke t() sea r c h () (a) AntShield
Architecture. Data collection consists of the following steps: each packetis intercepted by
AntMonitor Library , searched for any PII, and mapped to an app(b) PII, including those manually entered (name)
Figure 1:
AntShield
System used for Data Collection: Archi-tecture and Screenshot. use certificate pinning (see [36] for details). These decrypted packetsare also passed to
AntShield via the
PacketFilter interface. Eachintercepted packet (unencrypted or decrypted) is searched for PIIusing the
AntMonitor Library ’s Deep Packet Inspection (DPI)module. This module implements the Aho-Corasick search algo-rithm to find multiple strings in one pass of a given packet. Someof the PII defined above are available to all apps via Android APIsand are thus easy to find in packets. To find PII that are unavailablethrough APIs, we add them to the list of strings to search for using
AntShield ’s GUI – see Fig. 1(b). Note that this methodology maymiss PII that are obfuscated by applications prior to transmission,but as shown in [14] such behavior is rare. After DPI, we use the ntMonitor Library ’s mapPacket API call to note which appwas responsible for generating the outgoing packet in question. Fi-nally, we break the packet into any relevant fields (destination IPaddress/port, HTTP method, if applicable, and etc. ) and save it inJSON format. Any PII found is redacted before saving the packetto allow us to share the data with the community. Any contextualinformation (the PII found, along with the application name, andwhether or not this application was in the foreground when it gener-ated the packet) is saved in separate JSON fields. Note that althoughthe entire packet is not necessary for the purposes of this paper, thedata could be useful later, for instance to train classifiers that predictPII exposures as in [33] and [35].In summary, using
AntShield to capture packets on the devicehas several advantages compared to previous datasets collected in themiddle of the network: (1) we are able to accurately map each packetto the app that generated it; (2) we can keep track of foregroundvs. background apps, to see what kind of data apps send while inthe background; (3) we gain insight into TLS, UDP, and regularTCP traffic, in addition to HTTP and HTTPS; (4) scrubbing PII andlabeling packets with the type of PII they contain is fully automated.Due to these advantages, we were able to collect comprehensive andrealistic cases of PII exposures, as described in the next section.
Using
AntShield ’s packet capturing ability, we interacted with andcollected packet traces from 400 most popular free Android apps,based on rankings in
AppAnnie [3]. We used a Nexus 6 device forour data collection, and collected two different datasets, dependingon how we interacted with apps, as described next. Manual Testing.
First, in order to capture PII exposures duringtypical user behavior, we tested the top 100 apps in batches: weinstalled 5 apps on the test device and then used
AntShield tointercept and log packets while interacting with each app for 5min.After all apps in the batch were tested, we switched off the screenand waited 5min to catch any packets sent in the background. Next,we uninstalled each app and finally, turned off
AntShield . Automatic Testing.
We also used the
UI/Application ExerciserMonkey [9] to automatically interact with apps. This does not capturetypical user behavior but enables extensive and stress testing of moreapps. We installed 4 batches of 100 applications each, and had
Monkey perform 1,000 random actions in each tested app while
Ant-Shield logged the generated traffic. At the end of each batch, weswitched off the screen of the test device and waited for 10min tocatch additional exposures sent in the background.
Summary.
Since the two (Automatic and Manual)
AntShield
Datasets capture different behaviors, we describe and analyze themseparately. The
AntShield datasets are summarized in Table 1.Other state-of-the-art datasets include
Recon [33] and [32]. Ourdatasets confirm and extend the findings of previous work, as out-lined in the next section. Thus, in addition to being used to generatesurvey questions in our user study (Sec. 4), our datasets providevaluable insights on their own (Sec. 3.3), and we will make themavailable to the community. These datasets are publicly available at [2].
Auto Manual
867 2264
38 7
496 1174
17 12
Table 1: Summary of Manual and Auto
AntShield datasets col-lected on the device.
Our datasets provide us with insights into the current state of privacyexposures in the Android ecosystem. Some of the captured patternswere previously unknown, and are revealed for the first time here.For example, we were able to detect exposures happening in thebackground, exposures in plain TCP and UDP (not belonging toHTTP(S) flows), and malicious scanning for rooted devices.
Background Exposures . AntShield is in a unique position tocapture exposures that happen in the background vs. foreground,and other contextual information that is only available on the device.Table 1 shows that there is a substantial number of background expo-sures ( e.g., half of all exposures in the automatic dataset) that shouldbe brought to users’ attention. Digging deeper, we found severalinteresting patterns in apps that expose PII both in the backgroundand in the foreground. Fig. 2(a) shows how
Flow Free , a puzzlegame behaves differently in the background vs. the foreground: inthe foreground several device identifiers are sent to ad and analyticsservers, and in the background, one of the ad servers ( mopub.com )also collects the user’s city. Perhaps this information is needed toserve personalized ads based on the user’s location, but it is unclearwhy it is needed when the application is in the background andno ads are being shown. Another example is
MeetMe , an app formeeting people on-line, whose behavior is shown in Fig. 2(b). Inthis case, the app collects less PII in the background, but is contact-ing more ad servers. Such findings are concerning, since apps arecausing users data usage and are posing privacy risks even when theuser is not interacting with the app.
Non-HTTP Exposures . Prior state-of-the-art datasets [32, 33]reported only HTTP(S) exposures. Table 1 reports, for the first time,exposures in non-HTTP(S), including plain TCP or UDP packets.Our dataset contains 29 UDP exposures, all of which were exposingAdvertiser ID and Location. As shown in Table 2, we also foundsome apps (mostly games and photo-editing apps) that exposed theAndroid ID over non-standard (80, 443, 53) TCP ports, such as 8080or 10086 (a port known to be used by trojans, Syphillis and otherthreats [8]). The destination IPs could not be resolved by DNS,indicating that the application may have hard-coded those IPs.
HTTPS Exposures . Since the usage of encryption is increas-ing, we also collected and analyzed PII sent over HTTPS. Table 3summarizes the exposures we discovered in HTTPS traffic. The topapp com.ss.android.article.master is a news app, thus it makessense for it to query the user’s city, perhaps to fetch localized news.However, it is unclear why the app needs the user’s IMEI (when it pp Name Leak Types Port System IMEI, IMSI, AndroidId 8080com.jb.gosms AndroidId 10086com.jiubang.go.music AndroidId 10086air.com.hypah.io.slither Username 10086com.jb.emoji.gokeyboard AndroidId 10086com.gau.go.launcherex AndroidId 10086com.steam.photoeditor AndroidId 10086com.jb.zcamera AndroidId 10086com.flashlight.brightestflashlightpro AndroidId 10086
Domain Name Leak Types Port
Table 2: TCP packets (non HTTP/S) sending PII over ports other than 80, 443, 53
App Name PII Types com.ss.android.article.master City, Adid, Location, AndroidId, IMEI 752com.cleanmaster.security Adid, AndroidId 174com.paypal.android.p2pmobile City, FirstName, LastName, Zipcode,Adid, SerialNumber, AndroidId, Pass-word, Email 131com.offerup Adid, Username, FirstName, Location,Zipcode, AndroidId 114com.cmcm.live Adid, AndroidId, Location, IMEI, Serial-Number, IMSI 114me.lyft.android City, FirstName, LastName, SerialNum-ber, Zipcode, PhoneNumber, Location,AndroidId 112com.pinterest Adid, AndroidId 111com.weather.Weather Adid, Location 110com.qisiemoji.inputmethod Adid, IMEI, AndroidId 83 · · · · · · · · ·
All All 3039
Domain Name Leak Types mopub.com Adid 2380isnssdk AndroidId, IMEI 805roblox.com Location 679applovin.com Adid 566rbxcdn.com Location 561appsflyer.com Adid 549facebook.com Adid 391bitmango.com Adid 371goforandroid.com AndroidId 262ihrhls.com Adid 219pocketgems.com AndroidId 211ksmobile.net SerialNumber, Location, AndroidId 159tapjoy.com Adid, AndroidId 151tapjoyads.com IMEI, AndroidId 147wish.com Adid 139paypal.com AndroidId 131 · · · · · · · · ·
All All 3039
Table 3:
Summary of applications and domain names with HTTPS exposures in our dataset (manual and auto)
App Name Domain PII Types com.bitstrips.imoji 10.2.32,10.3.76 pushwoosh.com AndroidIdcom.nianticlabs.pokemongo0.57.4 upsight-api.com Location, AndroidIdcom.psafe.msuite 3.11.6 ,3.11.8 upsight-api.com AndroidIdcom.yelp.android 9.5.1 bugsnag.com AndroidIdcom.zeptolab.ctr.ads 2.8.0 onesignal.com AndroidIdcom.namcobandai-games.pacmantournaments6.3.0 namcowireless.com AndroidIdcom.huuuge.casino.slots2.3.185 upsight-api.com AndroidIdcom.cmplay.dancingline1.1.1 pushwoosh.com AndroidId
Table 4: Applications with ”jailbroken” field already has the Advertiser ID) and the specific longitude and latitudecoordinates of the user. In this case, the city is needed by the app, butthe IMEI and location coordinates are potentially privacy exposures.Another example is com.cmcm.live - it exposes 5 different deviceidentifiers for no apparent reason. Hence, although well-behavingapps should use HTTPS, they should also be inspected for potentialprivacy exposures as not all information that they gather is neces-sary for their functionality. We also found that the majority of topdomains receiving PII over HTTPS were ad-related. Although itis expected for ad domains to receive the Advertiser ID, other PIIshould not be collected. These findings motivated us to conduct theuser study in Sec. 4 to crowdsource answers to the question of whena privacy exposure becomes a privacy leak.
Checking for Rooted Devices . We noticed a suspicious flagcalled “jailbroken” or “device.jailbroken” exposed by several apps( e.g., com.bitstrips.imoji, com.yelp.android, com.zeptolab.ctr.ads ,etc). This flag was found in the URI content or in the body of a POSTmethod in the packets, and it was set to 1 if the device was rooted,or to 0 otherwise. In Table 4, we show the applications that containthis field in our dataset and the domain to which the “jailbroken”flag is being sent. We also show other types of exposures that theparticular domain collects. From the table, we see that the flag isusually accompanied with a device identifier. Several apps send thisflag to the same domain ( upsight-api.com , an ad network), whichindicates that an ad library is probably exposing this information,rather than the app itself.
Behavioral Analysis of PII Exposures.
An interesting directionfor analyzing the
AntShield dataset is via behavioral analysis. Forinstance, we can ask: (i) what can the communication betweenmobile apps and destination domains reveal about tracking andadvertising? (ii) what type of information exposes to what domainsand how to define similarity of apps or domains with respect toexposures? Fig. 3 showcases one graph that visualizes similardestination domains with respect to exposures they received, ascaptured in the AntShield dataset. We define two domains to besimilar if they are contacted by the same set of applications (seethe box on the right inside Fig. 3). For example, domains A andB are similar because they are contacted by two apps (app1, app2).We depict the similarity of domains A and B as an edge on thegraph of domains, at the bottom of the box. This data can be readily This figure was first presented in Fig. 3 of [35] and is repeated here for completeness. opub . c o m a m a z on - ad sys t e m . c o m N u m be r o f e x po s u r e s Background
AdvertiserIdCity m opub . c o m a m a z on - ad sys t e m . c o m c r a s h l y t i cs . c o m app s f l y e r . c o m N u m be r o f e x po s u r e s Foreground
AndroidIdAdvertiserId com.bigduckgames.flow3.6 (a)
Flow Free , a puzzle game, exposes the user’s city to an advertising serverwhen the application is in the background v r t c a l . c o mm opub . c o m ad sy m p t o t i c . c o m ne x age . c o m quan t c oun t. c o m l k qd . ne tt ube m ogu l . c o m doub l e c li ck . ne t b t r ll . c o m i nne r- a c t i v e . m ob i on li ne - m e t r i x . ne t ad v e r t i s i ng . c o m x ad . c o m c a r n i v a l m ob il e . c o m adn xs . c o m ae r s e r v . c o m s m aa t o . ne t c r a s h l y t i cs . c o m c a r d l y t i cs . c o m y oo m eega m e s . c o m open x . ne tt r e m o r hub . c o m N u m be r o f e x po s u r e s Background
AndroidIdAdvertiserIdCity v r t c a l . c o m doub l e c li ck . ne t ae r s e r v . c o m y oo m eega m e s . c o mm opub . c o m l k qd . ne t i nne r- a c t i v e . m ob i quan t c oun t. c o m s m aa t o . ne t a t d m t. c o m ne x age . c o m t o rr en t i . a l t r e m o r hub . c o m ad v e r t i s i ng . c o m x ad . c o m t ube m ogu l . c o m N u m be r o f e x po s u r e s Foreground
LocationZipcodeAdvertiserIdCity com.myyearbook.m11.8.0.681 (b)
MeetMe , an app for meeting people on-line contacts different domains withdifferent PII types depending on whether it is in the background or the foreground
Figure 2: Application behavior exposing PII, while running inthe background vs. toreground extracted from our trace, together with the type of information thatwas transmitted from apps to domains.The graph depicted on the left side of Fig. 3 shows a projection ofthe underlying bipartite graph (middle step in the box) on domains(last step in the box); the graph is plotted and analyzed using Gephi[12]. Nodes in this graph represent domains; the edges indicate simi-lar nodes as per above definition; the width of the edge indicates thenumber of common applications; and the domain color correspondsto the type of exposed PII. The clusters of domains in the graph are the output of a community detection algorithm, which is a heuristicthat tries to optimize modularity. Interesting patterns are revealed in Fig. 3. First, advertising is theresult of coordinated behavior. For example, it is easy to identifyad exchanges: mopub.com is in the center of all communication;and inner-active.mobi and nexage.com are also clearly shown ashubs. All three large communities on the bottom and left of thegraph correspond to ad networks. Second, on the top left, there is acommunity of domains that belong mostly to Google and Facebook,and two domains ( pof.com and plentyoffish.com ) of a dating ser-vice. The latter could be because the dating app also sends statistics( e.g., for advertising purposes) to Google and Facebook, in additionto its own servers, as suggested by the type of PII being sent (gen-der and device ID, represented by the yellow color). Third, not alldomains belong to a community: some are well-behaved and arecontacted only by their own app. For instance the white-coloreddomain zillow.com towards the bottom center of the graph is anisolate node and only receives information about the user’s location,which makes sense since it provides a real-estate service. Anotherexample is the blue-colored domain hbonow.com : it is only con-tacted by its own app and only receives the Advertiser ID to serveads. Another observation from Figure 3 is that most domains in thesame community receive the same type of PII (as indicated by thedomain color). This can be explained by the common ad librariesshared among different apps that fetch the same PII.In general, similarity of apps and domains based on their PIIexposure found in their network activity can be exploited to detectand prevent abusive behavior ( e.g., advertising, tracking, or malware)in mobile traffic. This is one promising direction for future work.
In this section, we design a user study on Amazon Mechanical Turk(MTurk) in order to assess user’s awareness and understanding ofmobile data exposure, as well as their level of concern and poten-tial for adopting solutions. We use the datasets collected in themeasurement study in the previous section to present participantswith real-world scenarios of private information that was actuallyexposed by mobile apps in our experiments. We present the userwith information about the types of PII exposed, as we as informa-tion about the context this exposure occured, i.e., whether the PII isshared with the application or a third party server; what was the appcategory/intended functionality; whether it is shared in clear or en-crypted text, etc. We then ask the users to assess the legitimacy ( i.e., whether the information is needed for the app’s functionality) andthe risk posed by the particular PII type exposed in that particularcontext. We also educate the participants on how a single piece of PIIcan lead to even more information being discovered when combinedwith data fetched from a data broker. Finally, we ask users beforeand after the survey whether they would use privacy enhancing tools.To the best of our knowledge, this user study is the first one that The main idea is that for specific node i , it tries to assign different communitiesof its neighbors like node j ’s community as i ’s community and compute the gain ofmodularity for whole network. The community which maximize the modularity will bethe proper one. If the gain of modularity be negative or zero, i keeps its community.This process is an iterative process which is done for all nodes. This algorithm isimplemented in Gephi software [12], and works with weighted graphs also. igure 3: Understanding the behavior of app that expose PII through graph analysis of the AntShield dataset. The graph consistsof nodes corresponding to destination domains and edges representing the similarity of two domains. Two domains are similar ifthere are common apps that send packets with PII exposures to both domains; the more common apps expose PII to these domains,the more similar they are, the larger the width of an edge between them. The color of a domain node indicates the types of PIIit receives. One can observe from the graph structure that domains form communities that capture interesting patterns: (1) Thelarge communities on the left and bottom consist mostly of ad networks; ad exchanges are nodes in between ad communities; (2)Facebook/Google domains are a different community on their own, on the top left; (3) small apps contact only their own domain,leading to isolate domain nodes; (4) domains in the same community receive the same type of PII (as indicated by the color of nodes). collects and analyzes user input in such fine granularity (context)and on actual (not just potential or permitted) privacy exposures atlarge scale (as found in the packet traces of the measurement study).Section 4.1 presents the design rationale and questions asked in theMTurk study. Section 4.2 summarizes and analyzes the responsesfrom 220 participants.
MTurk Setup. . We designed a Human Intelligence Task (HIT) onAmazon Mechanical Turk (MTurk) [1] and restricted it to workerswho are based in the U.S., are at least 18 years old, have completedat least High School (in the US), and own a smartphone or a tabletdevice. Our study was approved by the Institutional Review Board(IRB) of our Institution (details are omitted from this double-blindversion). The workers were rewarded at a rate of $0.10 per minuteof their time – a standard followed in other studies [19, 22, 24]. Weallotted 30 minutes for the completion of our HIT, but the majorityof workers completed it within 13 minutes approximately. Theparticipants had to pass at least one of three attention check questionsin order to have their HIT approved and to receive the $3.00 payment. We went through the IRB process in our institution and obtained exempt researchregistration HS
The HIT was open for 9 days in early May 2018. At the end, weanalyzed the responses of 220 workers that passed the attention test.
Demographic questions.
First, we asked a set of demographicquestions, such as educational level, age, and employment sector(tech vs. non-tech). We also asked what kind of mobile OS theyuse and how many different apps they use daily. In addition tothese questions, we added three attention-check questions to preventworkers from gaming the system by providing answers randomly.We discard answers from participants that failed to correctly answerall three attention check questions.
Categorization questions.
The main goal of our study is tolearn how concerned are users about privacy exposures in differentcontexts, defined as: the type of PII exposed, the app category,whether the information was shared with a relevant application server(thus may be useful for the functionality of the app) or third partyadvertiser and analytics servers, and whether it is shared in plaintext or is encrypted. To that end, we first defined these terms andcategories as shown in Fig. 4.First, we asked the participants how comfortable they are withsharing various
PII types with different types of remote servers , asdepicted in Fig. 5. Different PII types include: various device ids(such as phone number, IMEI, IMSI, ICCID, Android Id, etc), userids ( e.g., email, Advertiser Id, username and password), location a) Definitions used to categorize a PII exposure.(b) Definitions used to describe additional context of a PII exposure. Figure 4: Terms defined before being used in the categorization tasks. (GPS coordinates), and demographic information ( e.g., gender, city,zipcode, first and last name). Destination servers are roughly dividedinto two categories: app developers vs. ad & analytics servers. Therationale is that the application servers may need the PII to performtheir functionality ( e.g., Google Maps clearly needs location) whilethird party servers do not (thus causing more of a privacy leak ratherthan an exposure). For each pair of (PII type, destination type)we asked the participants to rate their comfort level of sharing thatPII type with that remote server, on a scale from 0 to 3; where 0represents the least concern and 3 represents maximum concern (andtheir willingness to pay for a privacy-preserving solution). See Fig.5 for details.Second, we asked participants to rate their concern in real-casescenarios of PII exposures from our dataset (Sec. 3). For each packetthat contained a PII in our dataset, we considered a broader context beyond just
PII type and destination server type ( i.e., applicationserver or ad/analytics server). We also considered the category ofthe app ( e.g., game vs navigation), whether the PII was encrypted orsent in plain text, and the frequency of this PII being exposed by thisapp category. The rationale is that the same PII type exposed maybe more or less concerning to users depending on the context. Forexample, location exposed by a navigation app to that app’s serverin an encrypted packet is probably needed for the app to function, while sending a user id to an unrelated third party ( e.g., advertiser)server, frequently and/or in plain text, is indeed a privacy leak.A side benefit of the aforementioned categorization is that ithelped reduce the number of cases to be evaluated by users. Out of8,579 exposures total (Table 1), 1,726 are unique when consideringthe application responsible, the type of PII, the destination host, andlevel of encryption. To further reduce the number of cases to label,we grouped the applications based on their Google Play Store [5]category and destination type (ad & analytics or not). To find ad &analytics domains, we used the hpHost [6] list as it was found to bethe most comprehensive list for the mobile ecosystem to date [29].These grouping reduced the total number of unique combinations to256, which contained 23 unique
Google Play Store categories (outof 36 total). We split these combinations into five batches of HITs,where each HIT contains five (or three for the last batch) categoriesof apps to be labeled. To prepare the participants for the labeling task,we first showed a “warm up” question (Fig. 6(a)) with a hypotheticalscenario of the
Roblox app exposing certain PII and asked them toassess the risk (Fig. 4(a)). Next, we asked the participants to labelthe exposure scenarios for each of the five categories in their HIT –example shown in Fig. 6(b). We also provide an example app outof each category (from our actual dataset) along with a link to theapp’s
Google Play Store page.
0. Now consider different types of information available on your phone and shared by mobile apps with various servers. For each type ofprivate information below, please indicate how much you are concerned about it being shared (in a scale of 0-3), and what measure you would be willingto take in order to protect your data. Enter: •
0: if you are not concerned about sharing that private information with a remote server. •
1: if you are concerned, but you wouldn't take any action to protect it from being shared with a remote server. •
2: if you are concerned, and you would be interested in a free solution that would prevent your private information from being shared without yourconsent. •
3: if you are concerned and you would pay for a solution (if a free solution is not available) that would prevent your private information from beingshared without your consent.
Figure 5: Task to assess how comfortable users are with sharing certain PII type with certain type of remote server.Assessing User Concern and Possible Actions.
In order to as-sess participants’ privacy awareness and understanding, we askedanother set of questions shown in Fig. 7. First, we asked the par-ticipants how much they care about information being shared bytheir mobile device (Fig. 7(a)) and what would they do in order tobetter protect their information. Next, we asked if they would use anapp ( e.g.,
AntMonitor ) that can prevent privacy exposures and howmuch would they pay for such a service (Fig. 7(b)). It was our hopethat the categorization questions described previously would educateusers about mobile privacy and would make them more concernedtowards the end of the survey. To educate them further, we showedthem that a single PII in the hands of a data broker can help createor lookup a user profile and can reveal much more information (Fig.7(d)); the question was based on a real scenario where we fetcheddata from a data broker based on a person’s email address. We thenasked them (if they would use a privacy app (and if they would payfor it) to protect their privacy. Finally, we asked them if they wouldcontribute their mobile data to an app that crowdsources information,train machine learning models and prevents privacy leaks ( e.g., as in
Recon [33] and in [35]). In order to assess our hypothesis that usersbecame more privacy-aware after the categorization questions andthe data broker example, the same questions appeared twice duringthe survey: once in the middle of the survey and once at the end.
In this section, we summarize and analyze the main results of ouruser study. The main observation is that users seem initially confused about the severity of PII shared in different contexts, but they areinterested in and are capable of being trained. Our main findings areas follows: • Users do not seem to understand how severe it is to sharecertain PII types (especially device identifiers) with eitherAdvertisers or Developers. This may be because someof these ids, such as Android ID, IMEI, IMSI, ICCID, aredifficult to understand or relate to, yet they uniquely identifythe device and/or the user and hence should not be sharedwith remote servers. • As expected, users trust Application Developers more thanAdvertisers & Analytics. However, some comments statedthat Developers are “a bunch of hackers behind servers,”indicating possible confusion. • Users also seem confused about sharing PII in plain textvs. using encryption. Sharing PII in plain text is a badpractice, since it exposes private information not only tothe destination servers but also to anyone that is sniffinginto the network. • Towards the end of the study, most users seem to obtain amuch better understanding of mobile privacy. For example,several comments state that users are grateful for our shorttutorial in mobile privacy. They are willing to educate them-selves more in the future and to adopt data transparencyand privacy tools. These are encouraging results for futurework in developing privacy-enhancing technologies. . Consider that you run the ROBLOX app on your smartphone. ROBLOX belongs toapp category GAME in GooglePlay. In our experiments, we noticed that this app sends your phone's GPSlocation, in plain text and with high frequency, to both remote ROBLOX servers ("App Developer") and tothird-partyAdvertisingandAnalyticsservers.Thisissummarizedinthefollowingtable.a) Do you consider the sharing of location withAdvertising & Analytics serversto be... (please pick one option):b) Do you consider the sharing of location with App Developer to be... (please pick one option): o Needed by the App o Not Needed by the App, but not Harmful o Not Needed by the App, and maybe Harmful o I don’t know/care. o Needed by the App o Not Needed by the App, but not Harmful o Not Needed by the App, and maybe Harmful o I don’t know/care. (a) Warm-up question with a hypothetical scenario of a particular example application –the game
Roblox .
14. Consider the Category: Game. Some example apps that belong to this category are: Baseball Boy!, Partymasters -Fun Idle Game, Snake VS Block. Please evaluate each information type that is shared by the Game category in thefollowing table. (b) Entire game category with real cases of PII types exposed and their broader context(remote server type but also app category, encrypted/plain text, and frequency).
Figure 6: Task to assess user concern about privacy exposuresin context.
We collected 223responses, in total, on our MTurk survey. Since we posted multipleHITs, each with different categorization questions (see Sec. 4.1),some users completed more than one HIT, and there were a total of151 unique participants. Two responses were discarded due to failingthe attention-check questions, and one worker was discarded dueto incomplete answers. This leaves a total of 220 valid responses,which we analyze in the rest of the section. The majority of theparticipants were between 25 and 34 years old, held a Bachelor’sdegree and were employed in a non-tech sector. 61.8% and 37.7% ofthe workers were Android users, and iOS users, respectively. 118 ofour workers use six to ten different apps every day, 53 use between11 and 20 apps, 41 use fewer than five apps, and only 8 use morethan 20 apps.
In this subsection, we provide the results we obtained by pro-cessing user answers to the categorization questions (see Figures 5and 6). We asked participants to rank the severity of PII exposures indifferent contexts since whether or not an exposure is considered aprivacy leak depends on the context. There are four main dimensionsin each exposure: (i) its destination (app developers or advertisers), (a) How much do users care about privacy?(b) How much would users pay to protect their privacy?(c) Would users contribute their data to a privacy app? • Yes, I would use it for free. • Yes, I would pay to use it. • No, I would not use it. • I cannot decide/need more information. (d) Do users care more after being educated about data brokers?
Figure 7: Assessing users concern and potential actions. (a) Dousers care about privacy? What they are willing to do about it:(c) share their data with a crowdsourcing system (b) use and/orpay for a privacy app). And (d) do they change attitude afterbeing educated? (ii) category of the app responsible, (iii) level of encryption used, (iv)PII type. These dimensions play an important role in distinguishingprivacy leaks from exposures.Figure 8(a) demonstrates the results when we asked the users howconcerned they are with sharing a certain information type with aparticular destination (Advertisers & Analytics or App Developerservers), as shown previously in Fig. 5. In this task, we observe thatsharing all types of PII is concerning for our participants, regardlessof its destination, and they are willing to use a free app to protectthem. Furthermore, they would also pay for a tool that protects theirphone number and password from leaking to advertisers. As ex-pected, participants are more concerned over their precise longitudeand latitude coordinates being shared as opposed to zipcode and a) Rated by users Concerned and would pay to protect itConcerned but do nothing to protect itConcerned and would use a free app to stop itNot concerned (b) Rated by us
Figure 8: How comfortable are users with sharing PII with advertisers and app developers? (a) Average rating of user responses tocategorization task of Fig. 5 (b) Recommended rating by us (“experts”). city. Overall, our participants seem to trust developers more thanadvertisers, which is not surprising.In Fig. 8(b), we also provide our own “expert” rating for com-parison. In particular, we would like to protect device identifiersregardless of their destination, by using a paid solution if a freeversion is not available. In contrast, our participants chose only toprotect their device identifiers with a free solution and they are notwilling to pay to protect them. Moreover, they give similar ratingto their location data regardless of the destination, although theyshould be more careful with advertisers than developers, as thereare apps that need location in order to function. On the other hand,Advertiser ID should not be protected as well as the other identifiers,since this id is known by advertisers anyway.
Understanding whether PII is needed by the app.
We askusers to assess whether sharing a particular PII is legitimate andneeded for the functionality of the app (or more broadly by apps inthe same category), or if it is unnecessary and potentially harmful.For example, GoogleMaps is an app in the Navigation category andneeds to share location to provide the service.Fig. 9 presents a heatmap of the perceived severity of differentPII types per application category. The x-axis show the different PIItypes and the y-axis contains the categories of apps responsible forsending that PII over the network. The values and colors representthe level of concern the participants have regarding each pair (cate-gory, PII type): “not needed by the app and maybe harmful” (value4 - dark color), “not needed by the app and not harmful” (value 3),“needed by the app” (value 2), “don’t care” (value 1 - light color).White color (or 0 value) represents missing combinations of PIIand category, that were not present in our datasets. In order to pro-duce the heatmap, we consider the mean values of all participants’answers.For comparison, we also show a heatmap labeled by us to expressour “expert” opinion. First, location information (latitude and longi-tude coordinates, zipcode, and city) should be available to system,maps & navigation, weather, and travel & local categories only. Allthe other categories do not need this kind of information for theirfunctionality and hence when apps transmit such data, it may be con-sidered harmful. Second, user identifiers, such as email, password,username, gender, etc. may be needed in some app categories ( e.g., social, games, communication), but not in others ( e.g., photography,personalization, tools). Finally, device identifiers, such as IMEI,IMSI, Serial Number, ICCID, and Android ID should not be used by any app category, except perhaps for the System. These identifiersare unique and can be used to track users across different apps andbuild profiles, such as the one we showed in Fig 7(d). Furthermore,Google explicitly discourages app developers from using these iden-tifiers and asks them to instead create and use their own uniqueidentifiers within their app [4].Comparing the two heatmaps in Fig. 9 reveals that users are notas concerned about device identifiers as they should be: 32.6% of thedevice identifiers per app category were categorized by workers as“Not needed by the app and not harmful,” although they should onlybe accessed by the System. However, other responses make sense:app categories that should not require any information to function aretrusted the least (games, lifestyle, music & audio, personalization,and tools), and categories such as weather and travel & local aretrusted with location information.
Understanding Encryption.
We further split the heatmap inFig. 9 into Fig. 10(a) and Fig. 10(b), depending on whether theunencrytetd packet containing the PII was unencrypted or encrypted,respectively. Our heatmaps not only reveal participants’ opinionsabout exposures, but also showed behavioral patterns of app cate-gories. From Fig. 10(a), we see that the weather category does notuse encryption and sends Advertiser Id, Location and City to remoteservers in plain text. Similarly, maps & navigation category sendsSerial Number, Location, Zipcode and City information in plain text,but encrypts the Android ID and Advertiser ID. Sending PII in plaintext is more harmful since this traffic can be sniffed. Unfortunately,our MTurk participants did not seem to understand the implicationsof transmitting data in plain text.
Understanding Destinations.
We also split our heatmaps basedon whether the packet is going to App Developers (Fig. 11(b)) orAdvertiser & Analytics Servers (Fig. 11(a)). As expected, develop-ers require more types of information in contrast to advertisers. Onthe other hand, advertisers should ideally only require the AdvertiserID, which is indeed fetched by almost every app category. However,most of our participants indicated that none of the PII need to besent to advertisers for the app’s functionality (only 2 out of 31 val-ues in Fig. 11(a) are below 2.5). This indicates that perhaps usersdon’t consider ads being served a part of the app’s functionality. Incontrast, participants indicated more trust towards app developers(Fig. 11(b)), which is expected as certain app categories requirePII to function correctly. For example, the following pairs of PIIand categories are expected: username for applications with logins sers recommended ’ t know/care4: Not needed and maybe harmful3: Not needed and not harmful2: Needed by the App0: Missing values Figure 9: Heatmap severity of (PII type, app category). The darkness of the color indicates the perceived severity of the PII exposure:the darkest corresponds to 4.0 (“Not needed by the app and maybe harmful”), while the lightest indicates 1.0 (“I don’t care”). Zerosrepresent missing values for combinations we did not have in our datasets. We compare the average ratings among users whoanswered that question (on the left column) vs. labeling recommended by us (on the right column).
Users Recommended (a) Exposures in plain text: Avg user rating vs. our recommended rating (b) Exposures in encrypted traffic: average user rating
Figure 10: Heatmaps assessing the severity of exposures in encrypted vs. unencrypted packets, as assessed by users vs. us (“experts”). (communication, games, and social), email for communication, andlocation information for travel & local and maps & navigation. How-ever, device identifies, such as IMEI, IMSI, Serial Number andAndroid Id are not needed by any app category and they should notbe shared with advertisers nor developers. Our participants seem tonot understand this point and they labeled most of these exposuresas “Not needed and not harmful,” while in fact these ids are mostlikely used for tracking. One interesting finding is that participantsshowed more concern over their IMEI vs. other device identifiers.This indicates that they may need more education about the otherdevice identifiers.
Fig. 12(a) shows the distribution of answersto the questions of Fig. 7, which essentially ask how much userscare about privacy exposures and what they are willing to do aboutit. The overwhelming majority of users would like more controlover their information being shared, if a free option is available. Fig.12(b) demonstrates what payment options users would prefer for aprivacy app: the majority would prefer a one-time fee between $1and $5, than a subscription model. These results show that users are not only interested in using privacy-preserving tools but are alsowilling to pay for them. Towards the end of the study, after beingeducated about the extent and severity of sharing PII, more userswere willing to pay for a privacy app, e.g., compare Fig. 12(a) toFig. 12(c). This demonstrates that although users may originally notbe aware or understand the risks of privacy exposures, they becomemore weary after being educated about the occurrence and risks ofsharing PIIs (especially after learning the power of data brokers).Finally, Fig. 12(d) shows the user’s willingness to contribute datato a privacy-preserving tools that crowdsource information (suchas
Lumen , Recon , AntMonitor ) before and after completing oursurvey. Once again, completing the study made users more opentowards using and helping a privacy-preserving app.
At the end of the study, we askedparticipants for their comments and suggestions; 99 out of 220 usersprovided their feedback. In Fig. 13, we summarize the commentsof all workers in the form of a wordcloud. Overall, our
MTurk workers seemed satisfied with the survey and stated that it was aneducational experience. Most of the participants are thankful forour short tutorial on mobile privacy, as they gained more knowledge a) PII sent to Ad Servers. ’ t know/care4: Not needed and maybe harmful3: Not needed and not harmful2: Needed by the App0: Missing values (b) PII sent to Developer Servers. Figure 11: Heatmaps assessing the severity of PII sent to differ-ent destination: Third-party (Ad Servers) vs App Developers.Assessed by users (average user rating shown). The darker thecolor the more concerned the users). about privacy and how different apps share their information withvarious destinations. At the end of the survey, they seem moreconcerned about privacy and interested in a solution, including aprivacy app free or paid; see Fig. 12. Several participants mentionedthe recent Facebook and Cambridge Analytica scandal and wonderedif our user survey was inspired by it (it was not!). Below, we providea sample of representative user comments, out of 99 total comments,grouped by two recurrent topics.
They learned a lot: “I liked this. Was this all due to what is going onwith Facebook and other privacy concerns?”“I enjoyed taking this survey. I would not sharemy information with anyone that would use theinformation in anyway that would be damaging tomy life. Right now I am getting phone calls I don’twant. I would pay to have those stopped.”“I definitely became more concerned about howmuch data is taken. You don’t realize how much (a) How much users care about privacyexposures. (b) Monetization question at the begin-ning of the study.(c) Monetization question at the end ofthe study. (d) Data contribution in the beginning vs.at the end of study
Figure 12: Answers to questions of Fig. 7: how much do userscare about mobile privacy and what they are willing to do toprotect it? Some questions are purposely repeated in the begin-ning and at the end of the study. of your personal data is sent to advertisers and itmakes you more weary of downloading and usingcertain apps.”“I thought it was quite enlightening. I will cer-tainly be paying more attention to what and howan app uses my data in the future.”“I definitely became more concerned about myprivacy. Thank you for the wake up call.”“Good Study. Made me really think about howmuch of everybody’s information is really out there.”“I am surprised to learn how much informationcould be linked to my email address. It is liketelemarketing on steroids and I would be willingto keep that info private I think we should alwayshave the option to keep our private info private.”“I have become a little more concerned about myprivacy as some of this was new information tome. I will definitely be doing more research onthis subject in the near future. Thank you for theinformative study! Everything was very clear andstraight forward, I appreciate the opportunity to igure 13: Overview of main keywords extracted from 99MTurk workers’ comments participate.” They are interested in using a privacy-enhancingsolution: “I did not know all of that and if you developed aproduct I want it.”“Yes - I have been concerned, but it has been diffi-cult to know where to start with securing my data.I live in the US, so there is no GDPR to protectme, however I feel I can review application re-quests for information in a more informed manner.I would also love to find the service you mentionedabout protecting privacy of data.”“Thank you. I learned a lot from this study. I hopeyou are able to develop something that can helpprotect consumers and still allow developers toflurish.”“I would love a reliable app that limited/reduceddata collection and increased privacy. Problem isthat it’s not possible to know which privacy appsare reliable and effective, while still allowing aservice or app to be used. Even a reliable, main-stream email alternative would be good!”
We provided a combination of a measurement and a user study ofactual PII exposed by mobile apps. We also defined and analyzedthe context (PII type, destination domain, app category/functionality,background/foreground, use of encryption vs. plain text) where thesePII exposures occur, and we distinguished between PII exposureand PII leak (which is more likely to be harmful) depending on thecontext. In the measurement study, we collected and analyzed a newricher dataset, which reveals interesting PII exposures and patterns,some of which were previously unknown. Preliminary graph analy-sis revealed interesting patterns of apps and domains colluding toexpose private information. In the user study, we compiled the largeamount of information from the measurement study into a smallernumber of categories (contexts) and we asked users to assess theprivacy exposures in the actual context they appeared, w.r.t. their legitimacy and risk they pose. Most users were initially unaware ofthe severity and potential implications of PII exposures: they couldnot identify the most critical PII or context ( e.g., that it may be ok toshare information with the application server instead of third parties,such as advertises and trackers). However, they seemed to appreciatethe information they got through the study, which made them morewilling to adopt privacy enhancing tools. Our analysis combinesthe scale and coverage of the network-based measurement studywith the fine-granularity user input assessing privacy exposures incontext.There are several directions for future work, building on theobservations of this paper. First, the behavioral analysis of PIIleaks at the end of Section 3 revealed similarities in the way appsand destination domains extract PII from mobile devices. Thoseobservations can be further exploited to design machine learningapproaches that can detect packets with potential privacy exposures[11, 34], further inspect them and eventually prevent an actual leak( e.g., by using real-time tools like
AntMonitor to block a packetor obfuscate a PII). Second, the user study showed that users areinterested in and are capable of being educated about their data, andthey want to adopt better privacy practices and new tools (such as
AntMonitor , Recon or Lumen ) to enhance data transparency andprivacy control.
We would like to thank Milad Asgari Mehrabadi for providing figures2 and 3 for this paper.This work has been supported by NSF Awards 1649372, 1815666,1900654 and by a UCI Proof-of-Concept Award in 2017. E. Bakopoulouhas also been supported by the Broadcom Foundation and HenrySamueli fellowships. A. Shuba has also been supported by an ARCSFellowship.
REFERENCES [1] [n. d.]. Amazon Mechanical Turk. . mturk . com .[2] [n. d.]. Antmonitor Project. https://athinagroup . eng . uci . edu/projects/antmonitor/ .[3] [n. d.]. App Annie. . appannie . com .[4] [n. d.]. Best Practices for Unique Identifiers. https://developer . android . com/training/articles/user-data-ids .[5] [n. d.]. Google Play. https://play . google . com/store?hl = en .[6] [n. d.]. hpHosts - Ad and Tracking servers only. https://hosts-file . net/ad s ervers . txt .[7] [n. d.]. PhoneLab, University at Buffalo. . phone-lab . org/ .[8] [n. d.]. Speed Guide: Ports Database. . speedguide . net/ports . php .[9] [n. d.]. UI/Application Exerciser Monkey. https://developer . android . com/studio/test/monkey . html .[10] Hazim Almuhimedi, Florian Schaub, Norman Sadeh, Idris Adjerid, AlessandroAcquisti, Joshua Gluck, Lorrie Faith Cranor, and Yuvraj Agarwal. 2015. YourLocation has been Shared 5,398 Times!: A Field Study on Mobile App PrivacyNudging. In Proceedings of the 33rd annual ACM conference on human factorsin computing systems . ACM, 787–796.[11] Evita Bakopoulou, Balint Tillman, and Athina Markopoulou. 2019. A FederatedLearning Approach for Mobile Packet Classification. arXiv:cs.LG/1907.13113[12] M. Bastian, S. Heymann, M. Jacomy, et al. 2009. Gephi: an open source softwarefor exploring and manipulating networks.
Icwsm
Proceedings of the ACM on Interactive, Mobile,Wearable and Ubiquitous Technologies
1, 3 (2017), 42.[14] Andrea Continella, Yanick Fratantonio, Martina Lindorfer, Alessandro Puccetti,Ali Zand, Christopher Kruegel, and Giovanni Vigna. 2017. Obfuscation-Resilient rivacy Leak Detection for Mobile Apps Through Differential Analysis. (2017).[15] M. Egele, C. Kruegel, E. Kirda, and G. Vigna. 2011. PiOS: Detecting PrivacyLeaks in iOS Applications.. In NDSS .[16] W. Enck, P. Gilbert, B. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N. Sheth.2014. TaintDroid: an information-flow tracking system for realtime privacymonitoring on smartphones.
ACM TOCS (2014).[17] H. Falaki, D. Lymberopoulos, R. Mahajan, S. Kandula, and D. Estrin. 2010. AFirst Look at Traffic on Smartphones. In
Proc. of the 10th ACM SIGCOMM Conf.on Internet Measurement . Melbourne, Australia.[18] C. Gibler, J. Crussell, J. Erickson, and H. Chen. 2012. AndroidLeaks: Automati-cally detecting potential privacy leaks in Android applications on a large scale. In
TRUST .[19] Qatrunnada Ismail, Tousif Ahmed, Kelly Caine, Apu Kapadia, and Michael Reiter.2017. To Permit or Not to Permit, That is the Usability Question: CrowdsourcingMobile Apps’ Privacy Permission Settings.
Proceedings on Privacy Enhanc-ing Technologies
4, 4 (2017), 118–136. https://doi . org/10 . [20] Zach Jorgensen, Jing Chen, Christopher S. Gates, Ninghui Li, Robert W. Proctor,and Ting Yu. 2015. Dimensions of Risk in Mobile Applications. Proceed-ings of the 5th ACM Conference on Data and Application Security and Pri-vacy - CODASPY ’15 (2015), 49–60. https://doi . org/10 . . [21] Yuanchun Li, Fanglin Chen, Toby Jia-Jun Li, Yao Guo, Gang Huang, MatthewFredrikson, Yuvraj Agarwal, and Jason I. Hong. 2017. PrivacyStreams: EnablingTransparency in Personal Data Processing for Mobile Apps. Proc. ACM Interact.Mob. Wearable Ubiquitous Technol.
1, 3, Article 76 (Sept. 2017), 26 pages. https://doi . org/10 . [22] Jialiu Lin, Norman Sadeh, Shahriyar Amini, Janne Lindqvist, Jason I. Hong,and Joy Zhang. 2012. Expectation and purpose: Undestanding Users’ MentalModels of Mobile App Privacy through Crowdsourcing. Proceedings of the2012 ACM Conference on Ubiquitous Computing - UbiComp ’12 (2012), 501. https://doi . org/10 . . [23] Bin Liu, Mads Schaarup Andersen, Florian Schaub, Hazim Almuhimedi,Shikun (Aerin) Zhang, Norman Sadeh, Yuvraj Agarwal, Alessandro Acquisti,Mads Schaarup Andersen, Florian Schaub, Hazim Almuhimedi, Shikun (Aerin)Zhang, Norman Sadeh, Yuvraj Agarwal, and Alessandro Acquisti. 2016. FollowMy Recommendations: A Personalized Privacy Assistant for Mobile App Permis-sions. Twelfth Symposium on Usable Privacy and Security (SOUPS 2016)
Soups(2016), 27–41.[24] Winter Mason and Siddharth Suri. 2010. Conducting Behavioral Research onAmazon’ s Mechanical Turk.
Behavior Research Methods
5, 5 (2010), 1–23. https://doi . org/10 . . [25] Abhinav Mehrotra, Sandrine R. M¨uller, Gabriella M. Harari, Samuel D. Gosling,Cecilia Mascolo, Mirco Musolesi, and Peter J. Rentfrow. 2017. Understandingthe Role of Places and Activities on Mobile Phone Interaction and Usage Patterns. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Tech-nologies
1, 3 (2017), 1–22. https://doi . org/10 . arXiv:1603.09436[26] Yuhong Nan, Zhemin Yang, Xiaofeng Wang, Yuan Zhang, Donglai Zhu, and MinYang. 2018. Finding Clues for Your Secrets: Semantics-Driven, Learning-BasedPrivacy Discovery in Mobile Apps. (2018).[27] A. Rao, A. Molavi Kakhki, A. Razaghpanah, A. Tang, S. Wang, J. Sherry, P. Gill,A. Krishnamurthy, A. Legout, A. Mislove, and D. Choffnes. 2013. Using theMiddle to Meddle with Mobile . Technical Report. Northeastern University.[28] Ashwini Rao, Florian Schaub, Norman Sadeh, Alessandro Acquisti, and RuoguKang. 2016. Expecting the Unexpected: Understanding Mismatched PrivacyExpectations Online. the Proceedings of the Twelfth Symposium on UsablePrivacy and Security (SOUPS 2016)
Soups (2016), 77–96.[29] Abbas Razaghpanah, Rishab Nithyanand, Narseo Vallina-Rodriguez, SrikanthSundaresan, Mark Allman, Christian Kreibich, and Phillipa Gill. 2018. Apps,Trackers, Privacy, and Regulators: A Global Study of the Mobile Tracking Ecosys-tem. (2018).[30] A. Razaghpanah, N. Vallina-Rodriguez, S. Sundaresan, C. Kreibich, P. Gill, M.Allman, and V. Paxson. 2015. Haystack: In Situ Mobile Traffic Analysis in UserSpace. arXiv:1510.01419 (Oct. 2015).[31] A. Razaghpanah, N. Vallina-Rodriguez, S. Sundaresan, C. Kreibich, P. Gill, M.Allman, and V. Paxson. 2016. Haystack: A Multi-Purpose Mobile Vantage Pointin User Space. arXiv:1510.01419v3 (Oct. 2016).[32] Jingjing Ren, Martina Lindorfer, Daniel J Dubois, Ashwin Rao, David Choffnes,and Narseo Vallina-Rodriguez. 2018. Bug Fixes, Improvements, ... and PrivacyLeaks A Longitudinal Study of PII Leaks Across Android App Versions. February(2018). https://doi . org/10 . . . [33] J. Ren, A. Rao, M. Lindorfer, A. Legout, and D. Choffnes. 2016. ReCon: Reveal-ing and Controlling PII Leaks in Mobile Network Traffic. In In ACM MobiSys) ,Vol. 16.[34] A. Shuba, E. Bakopoulou, and A. Markopoulou. 2018. Privacy Leak Classifica-tion on Mobile Devices. In . 1–5. [35] Anastasia Shuba, Evita Bakopoulou, Milad Asgari Mehrabadi, Hieu Le, DavidChoffnes, and Athina Markopoulou. 2018. Antshield: On-device detection ofpersonal information exposure. arXiv preprint arXiv:1803.01261 (2018).[36] A. Shuba, A. Le, E. Alimpertis, M. Gjoka, and A. Markopoulou. 2016. Ant-Monitor: System and Applications. arXiv:1611.04268 (2016).[37] A. Shuba, A. Le, M. Gjoka, J. Varmarken, S. Langhoff, and A. Markopoulou.2015. AntMonitor: Network Traffic Monitoring and Real-Time Prevention ofPrivacy Leaks in Mobile Devices. Poster Presentation. In
ACM Mobicom Demoand Short Paper (and best demo in S3) .[38] Yihang Song. 2015. PrivacyGuard: A VPN-Based Approach to Detect PrivacyLeakages on Android Devices. (2015).[39] Y. Song and U. Hengartner. 2015. PrivacyGuard: A VPN-based Platform toDetect Information Leakage on Android Devices. In
Proc. of the 5th Annual ACMCCS Workshop on Security and Privacy in Smartphones and Mobile Devices .[40] Lei Tian, Shaosong Li, Junho Ahn, David Chu, Richard Han, Qin Lv, andShivakant Mishra. 2013. Understanding user behavior at scale in a mobilevideo chat application.
Proceedings of the 2013 ACM international joint con-ference on Pervasive and ubiquitous computing - UbiComp ’13 (2013), 647. https://doi . org/10 . . [41] N. Vallina-Rodriguez, A. Auc¸inas, M. Almeida, Y. Grunenberger, K. Papagian-naki, and J. Crowcroft. 2013. RILAnalyzer: A Comprehensive 3G Monitor onYour Phone. In Proc. of IMC . Barcelona, Spain.[42] Max Van Kleek, Ilaria Liccardi, Reuben Binns, Jun Zhao, Daniel J Weitzner, andNigel Shadbolt. 2017. Better the devil you know: Exposing the data sharingpractices of smartphone apps. In
Proceedings of the 2017 CHI Conference onHuman Factors in Computing Systems . ACM, 5208–5220.[43] Haoyu Wang, Yuanchun Li, Yao Guo, and Yuvraj Agarwal. 2017. Understandingthe Purpose of Permission Use in Mobile Apps.
ACM Trans. Inf. Syst. Article https://doi . org/10 . [44] Na Wang, Bo Zhang, Bin Liu, and Hongxia Jin. 2015. Investigating Effects ofControl and Ads Awareness on Android Users’ Privacy Behaviors and Percep-tions. In Proceedings of the 17th International Conference on Human-ComputerInteraction with Mobile Devices and Services . ACM, 373–382.[45] Yichuan Wang and David Chu. 2015. EarlyBird : Mobile Prefetching of So-cial Network Feeds via Content Preference Mining and Usage Pattern Analy-sis Categories and Subject Descriptors.
MobiHoc (2015), 67–76. https://doi . org/10 . . [46] X. Wei, L. Gomez, I. Neamtiu, and M. Faloutsos. 2012. ProfileDroid: Multi-LayerProfiling of Android Applications. In ACM MobiCom .[47] Qiang Xu, Jeffrey Erman, Alexandre Gerber, Zhuoqing Mao, Jeffrey Pang, andShobha Venkataraman. 2011. Identifying Diverse Usage Behaviors of SmartphoneApps. In
Proceedings of the 2011 ACM SIGCOMM Conference on InternetMeasurement Conference (IMC ’11) . ACM, New York, NY, USA, 329–344. https://doi . org/10 . .2068847