Understanding Worldwide Private Information Collection on Android
UUnderstanding Worldwide Private InformationCollection on Android
Yun Shen
NortonLifeLock Research [email protected]
Pierre-Antoine Vervier
NortonLifeLock Research [email protected]
Gianluca Stringhini
Boston [email protected]
Abstract —Mobile phones enable the collection of a wealthof private information, from unique identifiers (e.g., emailaddresses), to a user’s location, to their text messages. Thisinformation can be harvested by apps and sent to third parties,which can use it for a variety of purposes. In this paper weperform the largest study of private information collection (PIC)on Android to date. Leveraging an anonymized dataset collectedfrom the customers of a popular mobile security product, weanalyze the flows of sensitive information generated by 2.1Munique apps installed by 17.3M users over a period of 21 monthsbetween 2018 and 2019. We find that 87.2% of all devicessend private information to at least five different domains, andthat actors active in different regions (e.g., Asia compared toEurope) are interested in collecting different types of information.United States (62% of the total) and China (7% of total flows)are the largest two countries that collect most of the privateinformation. Our findings raise issues regarding data regulation,and would encourage policymakers to further regulate howprivate information is used by and shared among the companiesand how accountability can be truly guaranteed.
I. I
NTRODUCTION
Data has become the commodity that sustains much ofthe Web. In recent years, the research community has raisedawareness on the threats linked to sensitive user data collectionby third parties. For example, specialized companies collectinformation from Web users to uniquely identify them acrosswebsites, potentially to provide them with more tailored ad-vertisements [3], [31], [34], [43], [61]. In some cases, roguebrowser extensions collect information that is supposed toremain private, such as a user’s browsing history [14], [75].As mobile devices become more central in the computingexperience of users, the threats linked to private informationcollection increase. Mobile devices can provide a wealth ofsensitive information [35], [50] that goes beyond identifiersto uniquely fingerprint users [54], including location infor-mation [21], [42], [71], call logs, text messages, and eveninformation on which applications are installed a device [84].This information can be used by third parties to deliver targetedadvertisements [47] as well as for nefarious reasons, fromstalking a victim by monitoring her location [13] to defeatingtwo factor authentication by stealing text messages [33].There exist insightful research efforts [11], [15], [16], [26],[31], [32], [40], [54], [57], [60], [62], [70] to understandthe impact and the threats posed by information collectionon mobile devices. It remains however very challenging to
This paper appeared in the 2021 ISOC Networks and Distributed SystemsSecurity Symposium obtain a comprehensive view of the information collectedby mobile apps, given the wealth of potential informationcollected, the software diversity of mobile platforms, and thegeographic diversity of mobile users and of the actors that theyinteract with. To shed light on the problem, previous researchresorted to running apps in a sandbox environment [15] oranalyzing network traffic [11], [57], [60] to monitor theinformation that they leaked. While this approach can beuseful to identify trackers, it has two limitations: first, byrunning apps from a single vantage point it is challenging toreplicate the geographic diversity of real users; second, appscould detect sandboxed environments and act in a differentway than they would on real devices (for example by notleaking any sensitive information), and this could bias theresults [45], [58], [72]. Alternatively, previous work collectednetwork data from an ISP, looking for information leaks [31],[32], [70]. While this approach solves the sandbox detectionproblem, it still has a geographic bias, since different usersaround the world might be using different apps and mightbe subject to different types of sensitive data collection. Asa third approach, researchers recruited participants to installan app on their mobile phones; the app would then monitorthe device for information leaks [54]. This approach solvesthe issues mentioned above and offers insightful findings, butit remains a challenging task to attract a population of usersthat is large and diverse enough to represent worldwide trendsat scale. Additionally, previous research [54] either mainlylooked at information collections that can be used to identifya device (e.g., IMEI numbers or SIM card information) orconsidered limited types of sensitive information that can becollected by third parties [35], [49], [57], such as birthday,username/passwords, contacts, media files, etc.In this paper, we provide the most comprehensive view ofprivate data collection by Android apps to date. To achievethis, we tap into the analysis infrastructure of a popularmobile security product. The company behind this productruns Android apps in its backend infrastructure and identifiesdangerous information flows by performing static and dynamicanalysis. It then builds signatures of method calls that areindicative of privacy invasive activity and pushes them tothe mobile devices that installed the security product, whichuse them to identify privacy invasive and malicious apps thathave been installed. This infrastructure allows us to monitorthe information collected by apps for a population of 17.3Mdevices daily for 21 months between 2018 and 2019. Thisis three orders of magnitude more devices than what pre-vious work analyzed [54]. Compared to previous work, wego beyond tracking, contact, and credential information, and a r X i v : . [ c s . CR ] F e b ABLE I: Summary of datasets used.
Dataset Data Count
Mobile app activity log Total records 6B(01/2018 - 09/2019) Days 634Countries and regions 201Devices 17.3MDistinct app names 2.13MDistinct app SHA2s 6.5MDistinct PIC FQDNs 76,451Distinct PIC domains 40,851Mobile app reputation log Low reputation SHA2s 3.4MVT Total reports 6.5MPHA SHA2s (detections ≥ ) 3.5MBenign SHA2s (no detection) 2.3MNot found SHA2s 401KDomain to owner org. Domains 10,736(01/2018 - 09/2019) Organizations 9,593Blacklists Domains/IPs 7,670(01/2018 - 09/2019)Geolocation Domains/IPs 40,851(01/2018 - 09/2019) trace 22 categories of private information, 13 of which werenot considered by previous work [11], [49], [54], [57] (seeSection II). This allows us to paint an unprecedented picture ofthe state of sensitive information collection on Android in thewild, identifying the big players in this space (both legitimatecompanies and malicious actors), together with geographictrends.Among others, this paper makes the following findings: • Private information collection is widespread on An-droid, with 87.2% of all devices in our dataset sendinginformation to at least five distinct domains. Whilemost PIC domains collect identifiers to track a user ora device (e.g., device information or email addresses),an alarmingly high number of domains collect othertypes of private information such as a device’s locationor a user’s contacts. • Looking at the destinations where private informationis sent to, we find that most information flows termi-nate in the United States. We do however find thatChina, trailing the US at the second place, collects7% of all data flows. This is three times higher thanwhat was reported in previous work [54]. We also findthat there was no significant difference in the numberof information flows leaving the European Unionafter the implementation of GDPR. These findingshighlight the challenges involved in implementing dataprotection regulations. • We find that potentially harmful applications(PHAs) [27] are more aggressive in collecting privateinformation than benign apps, especially when itcomes to information related to the apps installedand running on a device. We also find that a smallnumber of devices (4k) had apps installed that stealthe user’s text messages, potentially enabling thecircumvention of two factor authentication.Our findings highlight a number of challenges faced by theresearch community when studying private information collec-tion on Android. We show that looking at device penetrationis critical to observe the distribution of information collection scanstatic & dynamicanalysis signaturesDBWhois blackliststelemetrymethodsignatures
IP/domain app reputation VirusTotal app telemetry passiveDNSowner organization geolocation Mobile app activity identification
24 35 6
MLmodelsBackend Deviceengine
Fig. 1: Workflow of our measurement study.actors in the wild, and looking at application penetration onlycan provide a biased view. We also highlight how looking atusers located in different regions is important to get a compre-hensive view, since actors operating in different countries areinterested different types of information.II. D
ATASETS
This section details the approach that we follow for datacollection (summarized in Figure 1) and summarizes thedatasets used in this study (see Table I).
Workflow.
The overall workflow of our measurement study isas follows (see Figure 1). We use mobile app activity data ( (cid:182) ) to identify the private information collection activities from2.13M apps (6.5M SHA2s) installed on 17.3M devices across200 countries and regions. We then augment this data by using Mobile app reputation data ( (cid:183) ) and VirusTotal (VT) reports( (cid:184) ) to identify the potentially harmful apps (PHAs). Finallywe use domain and IP Whois and passive DNS ( (cid:185) ) to extractdomain ownership information (e.g., parent company, businesscategory, etc), IP and domain geolocation ( (cid:186) ) to identify thecountry where apps send data, and IP and domain blacklists( (cid:187) ) to identify domains associated with malicious activity. Weprovide details of each step in the rest of this section. Telemetry data collection.
In this paper, we use mobiletelemetry data collected by the security company’s mobilesecurity product, which has been installed on millions ofmobile devices. This company has a dedicated infrastructureto collect apks (one app may have multiple apks) from popularAndroid markets and various intelligence sources. These apksare then analyzed by a sophisticated infrastructure with bothABLE II: The 22 types of private information monitored by the security app.
Group Category Description Previously studied or novel
Tracking Phone number Phone number [35], [54]Device info IMEI, OS/kernel version, phone producer, phone model [35], [54], [57]SIM card info Information about SIM serial number, IMSI, voicemail number [54]Location info GPS or cell tower coordinates [35], [57]Operator info Information about the network operator (cid:51)
Setting info Information about the device configurations (cid:51)
Activity and social profiling Account info Details about the configured accounts can be exported(including user names of entries under Settings/Accounts) [35], [57] (partially)Email info Details about the email address such as Gmail address can be exported [35]Contact info Contact list can be exported [57]Social network account Details about the social network accounts such as Facebook account can be exported (cid:51)
Voice mail account Details about the voice mail accounts can be exported (cid:51)
Call log Call log can be exported (cid:51)
SMS info App can send the content or sender/recipient details from SMS/MMS messages (cid:51)
Calendar info Calendar can be exported (cid:51)
Usage preference Installed app info Details about apps installed on the phone are/can be exported(full or partial list of installed package names, or app titles) (cid:51)
Running app info Details about apps running at a certain time are/can be exported (cid:51)
Browser history info Browser history can be exported (cid:51)
Browser bookmark info Browser bookmarks can be exported (cid:51)
Audio/Video Audio info Recorded audio clips can be exported(e.g., recorded by the app, or picked from saved) [49]Photo info Photo can be exported (cid:51)
Video info Video can be exported [49]Camera info App can take pictures or picks them from gallery and exports them (cid:51) static and dynamic analysis pipelines. For instance, the staticanalysis pipeline can identify if an apk directly invokes anysuspicious and sensitive API (including reflection [1], dynamiccode loading [52], native code [37], etc.), requests permissionsnot related to its advertised description [53], as well as performfine-grained permission analysis [59], [22], flow and contextsensitive taint analysis [6], etc. Third-party libraries/SDKs usedby the apps are also analyzed using the same procedure statedabove. Following the static analysis, the backend can buildan initial report on control-flow, data-flow, and permissionsrelated to an apk. In addition to static analysis, the securitycompany also performs dynamic analysis by running an apkin a sandbox environment with various Android OS versions.Through network and system instrumentation, the dynamicanalysis pipeline runs an apk in different conditions (e.g.,UI-automation [28], input generation [76], apk fuzzing [79],etc.) with varied execution time to capture its activities un-der different contexts. For example, the dynamic analysispipeline reports if advertisements appear outside of an appin unexpected places (e.g., notification bar, shortcut, etc.)or exhibit unusual behaviors (e.g., change the user’s homepage). State-of-the-art commercial products are also employedby the security company to deal with challenges such asemulator/motion evasion, obfuscated code/libraries, etc. toassist the aforementioned analysis pipelines. At the same time,several machine learning models are built using the featuresgenerated from the pipelines to enable the backend to detectsophisticated PHAs. This way, the mobile security product canfingerprint activities with high accuracy and minimize falsepositives which may lead to an undesirable high customerchurn rate. Note that the infrastructure continuously inspectsapks. An apk that has been analyzed before may also be subjectto regular reinspection. By combining the results of the staticand dynamic analysis, the security company can rigorouslyfingerprint traces of app activity, including the types of privateinformation that the app is collecting. These traces are laterused to develop signatures for the apps collecting privateinformation, in the form of sequences of method signatures. These signatures are then deployed in the security producton the mobile devices to identify installed apps that havebeen linked to private information collection. If the userpermits telemetry data collection, meta-information related tothe app detections are sent to the telemetry data collectioninfrastructure and used to improve the app security featuresand its privacy leakage detection capability. The collected datais safeguarded by the global privacy policy of this securitycompany. Devices are identified by a unique anonymizedidentifier, but it is not possible to link such an identifier backto the device. The mobile security app only collects detectionmetadata, and it cannot inspect network traffic data, hence thecompany does not collect any actual communication/user data,or other types of PII. We provide a detailed discussion aboutethics and data privacy at the end of this section.III. L
ANDSCAPE OF P RIVATE I NFORMATION C OLLECTION (PIC) IN M OBILE E COSYSTEMS
In this section, we study the landscape of private infor-mation collection (PIC) in mobile apps. First, we look at thepervasiveness of PIC apps installed on the user base of the se-curity vendor. We then focus on the app presence rate (i.e., thenumber of PIC domains in apps) to identify the global/regionaltop players. We later focus on the device penetration rate (i.e.,the number of devices that a PIC domain collects informationfrom) to uncover the most pervasive PIC domains globally aswell as important regional players, understand what types ofprivate information are collected by these PIC domains, andif we can observe behavioral differences in different regionsregarding private information collection.
A. Pervasiveness of PIC in Mobile Apps
In this section, we demonstrate the pervasiveness of privateinformation collection in mobile apps. The results are shown inFigure 2. Apps send private information collected to 2 uniquePIC domains on average. As we can observe in Figure 2a, over175K apps (approximately 8.2% of total apps) send collected P e r c e n t a g e Global (a) P e r c e n t a g e Global (b)
Fig. 2: (log-scale)
Complimentary cumulative distribution(CCDF) of mobile apps in terms of unique PIC domains (a)and unique categories of private information collected (b). Fig. 3: Global top 20 PIC domains ranked by app presence.Domain’s primary function- M : Metrics/Analytics, A : Adver-tising, and D : Development.data to at least 5 unique PIC domains. These apps are installedon 15.1M devices in our dataset (87.2% of all devices). At thesame time, as we can observe in Figure 2b, over 156K appscollect at least 5 unique categories of private information (seeTable II). This covers 13M devices (74.9% of all devices).The overlapping 57.6k apps between the aforementioned twocategories of apps cover 12.8M devices (73.8% of all devices).In other words, 73.8% of all devices in our dataset haveat least one app collecting at least 5 unique categories ofprivate information and sending them to at least 5 unique PICdomains. Our findings show that private information collectionin mobile apps is universal and diversified at the same time. B. PIC Domains: App Presence Study
PIC organizations generally benefit from collecting dataabout more users. To reach this goal, one of the strategiesadopted by these organizations is increasing their presence inmobile apps to reach out to more users. For example, an adlibrary will entice developers into including it in their apps.Figure 3 shows the 20 PIC domains with the largest apppresence globally (i.e., the domains that were contacted by thelargest number of apps). Based on the information we collectfrom Crunchbase and the company websites, we attributethese PIC domains to three functions - Metrics/Analytics(M), Advertising (A), and Development (D). As we can see in Figure 3, the majority of these PIC domains (15 out of20) offer advertising services. For example, in addition to thePIC domains owned by Google and Facebook, several knownPIC domains operated by online advertisement companies(e.g., api.airpush.com , android.revmob.com , e.admob.com , ads.mopub.com ) have considerableglobal app presence, being contacted by 10K apps ormore. Additionally, 8 out of the top 20 PIC domains offermetrics/analytics services. One noticeable finding from ourstudy is alog.umeng .com (part of Alibaba Group ).This domain has the largest global app presence and iscontacted by 79,402 apps (3% of total apps). This domainwas not reported by previous measurement studies [54],and might be because our dataset contains three orders ofmagnitude more devices, distributed across the globe (recallthat over 7M users are located in Asia). Note that a high apppenetration rate does not necessarily lead to a high devicepenetration rate, while the latter is directly proportional to thereal amount of data collection. We will discuss this aspect inSection III-C.We then investigate the diversification of private informa-tion collection by these PIC domains from a global perspective.Previous literature focused on unique hardware- and user-identifiers (UIDs) to study Advertising and Tracking Services(ATS) [54]. It remains an open question if these PIC domainsonly collect UIDs given their wide presence in mobile apps. Inthis study, we move beyond UIDs and leverage 22 categories ofprivate information monitored by the security company to showa holistic picture of private information collection in the mobileecosystem. We summarize our findings in Table III. As it canbe seen, the top 20 domains collect a wide spectrum of privateinformation (e.g., 14 out of 20 collect call log information, 13out of 20 collect SMS information, etc). We can also observethat the top 10,000 PIC domains converge to collecting threetypes of private information - device (9,866 PIC domains), simcard (7,448 PIC domains), and location information (5,415 PICdomains) - which enable them to uniquely identify and trackthe end users for potential targeted advertising purpose [67],[36]. In contrast, the top 100 PIC domains focus on collectingmore types of private information (i.e., on average, the top100 PIC domains collect over 8 types of private information)and build a holistic profile of users (e.g., 61 out of top 100PIC domains collect social network account information fromthe end users, in contrast to only 741 out of top 10,000 PICdomains collecting such information).
Geographic differences in PIC domains.
Figure 4a, 4cand 4e show the top 20 PIC domains with the largestregional app presence in North America, Europe and Asiarespectively. In addition to the top global PIC domains, weuncover that certain PIC domains have a high regional apppresence and were not previously reported. For example, poseidon.mobilecore.com (7,046 apps, 91% ofits global presence) and seattleclouds.com (89%of its global presence) have high app presence in NorthAmerica, Russia-based startup.mobile.yandex.net (1,832 apps, 72.5% of its global presence) and mysearch-online.com (2,194 apps, 70% of its globalpresence), respectively, have high app presence in Europe andAsia. Regarding this regional presence phenomenon, we canonly speculate that it is due to the business models adoptedby these companies by focusing on serving regional markets. (a) North America d e v i c e i n f o s i m c a r d i n f o p h o n e nu m b e r l o c a t i o n i n f o s e tt i n g s i n f o a cc o un t i n f o e m a il a d d r e ss c o n t a c t i n f o s n s a cc . i n f o . c a ll l o g s m s i n f o i n s t a ll e d a p p i n f o (b) North America (c) Europe d e v i c e i n f o s i m c a r d i n f o p h o n e nu m b e r l o c a t i o n i n f o s e tt i n g s i n f o a cc o un t i n f o e m a il a dd r e ss c o n t a c t i n f o s n s a cc . i n f o . c a ll l o g s m s i n f o i n s t a ll e d a pp i n f o (d) Europe (e) Asia d e v i c e i n f o s i m c a r d i n f o p h o n e nu m b e r l o c a t i o n i n f o s e tt i n g s i n f o a cc o un t i n f o e m a il a dd r e ss c o n t a c t i n f o s n s a cc . i n f o . c a ll l o g s m s i n f o i n s t a ll e d a pp i n f o (f) Asia Fig. 4: (left column)
Regional top 20 PIC domains ranked by app presence. Domain’s primary function - M : Metrics/Analytics, A : Advertising, and D : Development. (right column) Heatmap illustration of top 12 categories of private information collectedby these PIC domains. Each row is normalized to [0, 1] by a PIC domain’s total app presence. The darker the red implies thatthe more apps that a PIC domain collects information from.
At the regional level, we find that the top 20 PICs con-tacted by apps installed on devices in different geographicalregions collect different categories of private information. Notethat we consider a PIC domain notably collects a certainkind of private information if 20% of apps with its pres-ence collect such information. In North America, we observethat the top 20 PIC domains (Figure 4b) mainly collectdevice information and sim card information, and only 3 PICdomains ( api.airpush.com , data.flurry.com and ads.mopub.com ) collect location information. In contrast, top PIC domains in Europe (Figure 4d) and Asia (Figure 4f)collect more diversified categories of private information. Forexample, 8 out of 20 top PIC domains collect location and set-tings information in both Europe and Asia, 4 out of top 20 PICdomains in Asia prevalently collect installed app information,with mysearch-online.com exclusively gathering suchdata. op 100 Top 1k OverallRank Cat. device info 99 device info 993 device info 189723 location info 95 sim card info 891 sim card info 75135 settings info 92 location info 815 location info 38695 email address 87 phone number 646 phone number 17378 sim card info 85 settings info 593 settings info 15225 phone number 85 email address 474 email address 12722 social network account 68 social network account 280 sms info 3375 account info 60 account info 246 social network account 3091 call log 46 call log 177 account info 2698 contact info 40 installed app info 163 contact info 2588 sms info 38 contact info 138 installed app info 2220 installed app info 32 sms info 128 call log 1442 TABLE III: Top 12 collected categories
C. PIC Domains: Device Penetration Study
In this section, we investigate the top PIC domains fromthe mobile device penetration rate perspective. We show thatlooking at device penetration provides different results thanlooking at app presence only. In fact, some of the actors whomanage to get their libraries installed in many apps do notmanage to have a large number of users running them.
Top PIC domains by device penetration rates.
In reality,a high app presence does not necessarily lead to a highdevice penetration rate (i.e., the number of users sendinginformation to PIC domains), whereas the latter is directlyproportional to the real amount of information collection.In the rest of the section, we focus on the PIC domainsthat have high device penetration rates to uncover theirprivate information collection dynamics in the real world.Figure 5a shows the 20 PIC domains with the largest devicepenetration rate globally. As we can see in Figure 5a, thetop 3 PIC domains ( settings.crashlytics.com , graph.facebook.com , and ssl.google-analytics.com ) cover 8.03M, 7.8M, and4.5M devices respectively, which are proportional to theirapp presence (see Figure 3). alog.umeng.com ’s high apppresence strategy also pays off covering roughly 3M devices.However, quite a few PIC domains with high app presencefailed to gain high device penetration rates. For example, api.airpush.com only covers 68K devices despite of itshigh app presence (21K apps, Figure 3). Besides, PIC domainscontrolled by revmob.com , seattleclouds.com and mobilecore.com also did not manage to have highprevalence in the devices. Geographic differences in PICs.
We also discoverthat different regions present different dominant PICdomains. For example, *.urbanairship.com and mads.amazon-adsystem .com have high devicepenetration rate in North America. config.ioam.de (99.4% of its global presence) solely operates in Europe. cm.ushareit.com (92.4% of its global presence), api.mobula.sdk.duapps.com (88% of its globalpresence), ads.pdbarea.com (825K, 95.4% of its globalpresence) and adbsc.krmobi.com (95.4% of its globalpresence) are almost exclusively contacted by devices locatedin Asia. We further investigate the types of private informationcollected by PIC domains with high device penetration rates,to check if different players active in different regions areinterested in different types of private information. Our findings are summarized in Figure 6. Each row is normalizedby a PIC domain’s total device penetration rate. The heatmapsillustrate the main types of information collected by the PICdomains. First, it is interesting to see in Figure 6 that thetop 20 global and regional PIC domains with high devicepenetration rate focus on collecting four types of privateinformation from the end users - device , sim card , location and settings information. For example, all of the top 20 PICdomains in Figure 6 collect device information. The onlyexception is logger.cloudmobi.net , a prominent PICactive in Asia (see Figure 6d), which predominantly collectsdevice setting information. Approximately 50% of the top PICdomains collect sim card, location, and setting information atboth global and regional levels. Our findings also show thatcertain PIC domains consistently collect multiple types ofprivate information from devices, potentially enabling themto track the end users more systematically. For example, inEurope events.appsflyer.com (1.16M global devicepenetration rate) collects device information from all devicesthat connected to it, and sim card information (and settinginformation) from 95% of them (see Figure 6c). Similarly, ads.mopub.com with 1.8M global device penetration rate(see Figure 6a, 6b and 6c) exhibits a similar behavior, i.e.,collects location, device, and sim card information from over80% of the devices that connected to it. In Section III-B,we show that these two behavior patterns are different fromthe ones observed when looking at the top PIC domainsranked by app presence, where the intention is to collect morediversified private information (see Table III). Our findingscan be treated as the profiles of PIC domains, and helpthe community understand their behavior in fine granularity(e.g., understanding the correlation between domain namingconventions and the types of private information collected). Summary of findings.
We found that looking at app presencecan provide misleading results. In fact, some of the actors whomanaged to get their libraries installed in many apps failedto have many users running them. Further information can befound in Section III-B. We also found that certain PIC domainsconsistently collect multiple types of private information fromthe devices and are capable of tracking the end users moresystematically. We observed different regional players targetingusers in different continents, and collecting different typesof private information. Following these observations, we willfurther discuss the data controllers behind these PIC domainsand the implications of data protection in Section V. (a) Global (b) North America (c) Europe graph.facebook.com(2)settings.crashlytics.com(4)ssl.google-analytics.com(5)alog.umeng.com(1)t.appsflyer.com(13)cm.ushareit.com(3893)e.crashlytics.com(12)api.mobula.sdk.duapps.com(112)rts.mobula.sdk.duapps.com(167)api.cloudmobi.net(417)api.branch.io(141)ads.pdbarea.com(13180)ads.mopub.com(8)cf.ushareit.com(3892)adbsc.krmobi.com(5383)ads.krmobi.com(5763)logger.cloudmobi.net(812)svcs.paypal.com(33)attr.appsflyer.com(303)reports.crashlytics.com(20) (d) Asia Fig. 5: Top 20 PIC domains ranked by device penetration rate. The number next to a PIC domain represents its ranking by apppresence.IV. P
RIVATE I NFORMATION D ESTINATIONS
In the previous section, we focused on end user devices,looking at the top PIC domains that collected private informa-tion from them. In this section, we focus on the destinationof those private information flows, aiming to understand thecountries where these flows terminate.
Geolocation of PIC Domains.
We leverage the techniquedetailed in Section II to uncover the geolocation of the PICdomains and summarize our findings in Figure 7. Our analysisreveals that United State and China are the largest two coun-tries hosting the PIC domains. The United States hosts 44% ofPIC domains, which is in line with the previous literature [54]and China hosts 26.1% of PIC domains. This figure is threetimes higher than previously reported [54]. PIC domains hostedin the US and China collect private information from 14Mdevices (80.9% of global devices) and 4.6M devices (26.5% ofglobal devices) respectively. Other countries host significantlyfewer PIC domains compared to the United States and China(e.g., South Korea, ranked 3rd in the list, hosts merely 2.6%of the PIC domains). Note that the geolocation of 6.4% of PIC domains could not be identified because our approach cannottrace their historical domain records.
Global private information flow . As we saw in Section III, aPIC domain can collect multiple types of private informationfrom the end user. We further investigate the global privateinformation flow from the mobile devices to the PIC domains.The result is shown in Figure 8a. PIC domains hosted inthe United States collect 62% (of which 42.3% coming fromout of the country) of global private information flows. PICdomains hosted in China collect 7% of private informationflows from 4.59M devices globally. This figure is almost fourtimes more than previously reported [54]. At the same time,PIC domains hosted in Singapore collect 6.53% of globalprivate information flows (mainly from India). The rest of thecountries shown in Figure 8a notably collect much less privateinformation comparing to these three countries.
European private information flow and the effect ofGDPR [73].
The European Union’s (EU) General Data Pro-tection Regulation (GDPR) entered into effect on May 25th,2018. It imposes obligations onto organizations in any countryso long as they target or collect data related to people in EU e v i c e i n f o s i m c a r d i n f o p h o n e n u m b e r l o c a t i o n i n f o s e tt i n g s i n f o a cc o u n t i n f o e m a il a d d r e ss c o n t a c t i n f o s n s a cc . i n f o . c a ll l o g s m s i n f o i n s t a ll e d a p p i n f o (a) Global d e v i c e i n f o s i m c a r d i n f o p h o n e n u m b e r l o c a t i o n i n f o s e tt i n g s i n f o a cc o u n t i n f o e m a il a d d r e ss c o n t a c t i n f o s n s a cc . i n f o . c a ll l o g s m s i n f o i n s t a ll e d a p p i n f o (b) North America d e v i c e i n f o s i m c a r d i n f o p h o n e n u m b e r l o c a t i o n i n f o s e tt i n g s i n f o a cc o u n t i n f o e m a il a d d r e ss c o n t a c t i n f o s n s a cc . i n f o . c a ll l o g s m s i n f o i n s t a ll e d a p p i n f o (c) Europe d e v i c e i n f o s i m c a r d i n f o p h o n e n u m b e r l o c a t i o n i n f o s e tt i n g s i n f o a cc o u n t i n f o e m a il a d d r e ss c o n t a c t i n f o s n s a cc . i n f o . c a ll l o g s m s i n f o i n s t a ll e d a p p i n f o graph.facebook.comsettings.crashlytics.comssl.google-analytics.comalog.umeng.comt.appsflyer.comcm.ushareit.come.crashlytics.comapi.mobula.sdk.duapps.comrts.mobula.sdk.duapps.comapi.cloudmobi.netapi.branch.ioads.pdbarea.comads.mopub.comcf.ushareit.comadbsc.krmobi.comads.krmobi.comlogger.cloudmobi.netsvcs.paypal.comattr.appsflyer.comreports.crashlytics.com (d) Asia Fig. 6: Heatmap illustration of top 12 types of private information collected by both global and regional top 20 PIC domains.Each row is normalized to [0, 1] by a PIC domain’s total device penetration rate. The darker the red implies that the moredevices that a PIC domain collects information from.countries (EU28). If data is being transferred to a third-partyand/or outside the EU28, GDPR requires that data subjectsmust be clearly informed about the extent of data collection,the legal basis for the processing of personal data, how longdata is retained. In light of this legislation, we measure theprivate information flows originated from EU countries before(January 5th, 2018 - May 24th, 2018) and after (May 26th,2018 - September 30th, 2019) the GDPR effective date, andcheck if GDPR has a real-world impact to private informationcollection. Our findings are shown in Figure 8b and 8c. Aswe can see, private information confinement within the EU islow. PIC domains hosted in the United States dominate theprivate information collection in the EU, collecting 68% and66% of European private information flows respectively beforeand after the GDPR. This figure is 30% lower than previouslyreported 89.2% [54]. At the same time, Germany and Irelandare the only two European countries that host a reasonableportion of PIC domains and good control of private informationcan be applied, while the other European countries hosting avery small fraction of PIC domains and US remains the largest hosting country. Notably, we uncover that approximately 4.4%and 1.7% of private information flows are collected by PICdomains hosted in Russia and China respectively [80], [81].It is also interesting to see that private information collec-tion in Europe is not affected by GDPR in general. As wecan see in Figure 8b and Figure 8c, the fractions of privateinformation collected by these PIC domains (and consequentlythe countries hosting them) remains stable regardless of theimplementation of GDPR. Our results show that GDPR has notstopped companies from collecting private information fromend users as long as their services are GDPR-compliant, par-tially because that the GDPR treats first-party data uses moreleniently [29]. However, it remains an unanswered question,especially to the consumers, how to trace their private infor-mation after sharing with the GDPR-compliant companies, andhow accountability can be truly guaranteed [74], [44], [46].For instance, which company should be held accountable ifa device identifier was abused (e.g., targeted advertising)while the majority of apps in mobile devices collect device
10 20 30 40
Percentage (%)
USCNUnknownKRSGJPDEHKIERUNLINFRGBIRCAAUTWVNES
Fig. 7: Global top 20 countries ranked by the number of PICdomains hosted.identification information as shown in Section III-C? We aimat studying this question in Section V.V. D
ATA P ROCESSORS AND C ONTROLLERS
In the previous sections, we provided an overview ofthe landscape of private data collection from mobile devices(Section III) and of the countries where private information issent to (Section IV). In this section, we aim at understandingthe characteristics of the data processors and controllers whoultimately obtain and process the private information and theimplications of their privacy policies to the end users.
Overview of top data processors and controllers.
Weselect the top 10k PIC domains covering all the devices inthis study, and use the technique detailed in Section II touncover the ownership of the PIC domains. The top 25 dataprocessors and controllers (ranked by the fraction of devicesthey collect private information from) are shown in Figure 9. Intotal, these 25 data processors and controllers collect privateinformation from 13.9M devices (80.2% of all devices usedin this study). Facebook and Alphabet are the two dominantdata controllers, collecting private information from 9.3M and9.1M devices respectively. AppsFlyer is the third largest dataprocessor/controller collecting information from 3.4M devices.It is worth noting that there are six Chinese companies amongthe global top 25 data processors and controllers: Alibaba(3.1M), Baidu (1.6M), CloudMobi (1.0M), MobVista (880K),Tencent (650K), and Intsig(Shanghai) (646K). In total, thesesix companies are collecting private information from 4.55Mdevices (i.e., 26% of total devices).
Operation of top data processors and controllers.
We ana-lyze the domain distribution of these top 25 data processors andcontrollers to understand more details on their infrastructureand on their operational strategies. Our findings are summa-rized in Figure 10. In total, 16 out of the top 25 data processorsand controllers have no more than 21 PIC domains. Forexample, AppsFlyer, the third largest data processor/controller, has only 11 PIC domains in our dataset. It is evident thatthe majority of the data controllers prefer to control data flowvia several API gateways. At the same time, Baidu (425 PICdomains), Tencent (531 PIC domains), and Adobe (374) preferto use many loosely coupled services to collect data sincetheir operational strategies rely on the Cloud infrastructure.For example, DU Ad Platform ( *.duapp.com , part of Baidu)almost exclusively runs in the AWS infrastructure, QQ plat-form ( *.qq.com , part of Tencent) operates in Tencent ownedCloud infrastructure, and 2o7 ( *.2o7.net ) is part of AdobeMarketing Cloud. Note that previous literature [54] found that“
292 parent organizations that own nearly 2,000 ATS andATS-C domains. ” Our findings, however, indicate that thesedata controllers may own more PIC domains that previouslythought.
Cross-border transfer, non-EU data processor and con-trollers, and implications to data protection.
Based on ourfactual findings, we use Chinese companies as a case studyto quantitatively and objectively understand the implicationsof users’ private information collection when involving cross-border data transfer [74], [44], [46], [68] and how it becomesmore difficult to trace how this data flows. As mentionedbefore, the top six Chinese companies are collecting privateinformation from 4.55M devices. Superficially, such coverageseems in line with our findings in Section IV where we foundthat 7% of private information flows from 4.59M devicesglobally flow to China. We further investigate the geolocationof the PIC domains controlled by these companies and see ifthese domains are hosted in China using the technique detailedin Section II.For example, Baidu has 210 PIC domains hosted out-side China, mainly because the *.duapp.com
PIC domains(owned by its subsidized DU Ad Platform) are hosted in AWS(USA). Besides, *.mobvista.com is hosted in AmazonWeb Services (AWS) and *.cloudmobi.net , has a mixtureof hosting environments in the US and Singapore. We reportmore details about the country distribution of Chinese datacontrollers in Figure 11. The figure confirms that many PICdomains owned by these companies are not hosted in China.Such operational strategy employed by these data controllers,however, leads to undesirable implications for data protection.For example, the DU Ad Platform (partnering with Facebook,Alphabet, appnext, etc.) states in its privacy policy thatpersonal information could be “ shared with any organizationpart of Baidu Group ” and “ may be transferred to countrieswhich provide an adequate level of protection .” In this case,even though private information flows terminate in the AWSCloud, such data could still be transferred to third countries.Moreover, Mobvista (partnering with Baidu, TikTok, etc) ex-plicitly claims in its privacy policy that private informationwould be “ transferred to recipients in countries located outsidethe EEA (including in Singapore where the Site is hosted)which do not provide a similar or adequate level of protectionto that provided by countries in the EEA .”We acknowledge thatwithout knowing more about the actual underlying contractualrelationships it is difficult to draw conclusions on how datais further processed by those entities. Nevertheless, it remainsan open yet important question on how to protect and audit http://ad.duapps.com/gdpr/index.html a) (b) (c) Fig. 8: Sankey diagrams illustrating 1) Global private information flows between the top 15 countries (ranked by the number ofdevices) and top 20 PIC domain locations (a) and 2) Private information flows between EU28 and top 20 PIC domain locationsbefore (b) and after (c) GDPR. Note that the left side of the diagrams represents the origin of information flows and the rightside represents where the information flows terminate. We add a postfix ‘ in’ to the country code at the right hand side in caseof private information flows originating and terminating at the same country. AlphabetFacebookAppsflyerAlibabaPayPalTwitterSHAREitBaiduBranch MetricsAppboyadjustCloudMobiFlurryMobvistapdbareaAmazonNew RelicAirshipSamsungStartAppTencentintsig(ShangHai)LINE CorporationInmobi SolutionsAdobe
Fig. 9: Global top 25 data controllers ranked by the fraction ofdevices they collect private information from. These 25 datacontrollers collect private information from a total of 13.9Mdevices covering 80.2% of all devices used in this study.the usage of such data flows terminated at the PIC domainsowned by these companies with data transfers to third countriesexplicitly stated in the privacy policy. We hope that ourfindings will motivate lawmakers to consider how to addresssuch issues in the future legislations, and more importantly,encourage the commercial partners of these companies todesign rigorous policies to protect user private informationwhen sharing data cross-border.
Summary of findings.
We found that the top 25 data pro-cessors and controllers can collect private information froman overwhelming 80.2% of all devices. 6 top Chinese data
AlphabetFacebookAppsflyerAlibabaPayPalTwitterSHAREitBaiduBranch MetricsAppboyadjustCloudMobiFlurryMobvistapdbareaAmazonNew RelicAirshipSamsungStartAppTencentintsig(ShangHai)LINE CorporationInmobi SolutionsAdobe
Fig. 10: Domain distribution: top 25 data controllers.controllers provided privacy policies and hosted part of theirinfrastructure in countries with rigorous data protection laws.However, they also allow data transfer to third countries andmay incur technical and legal complications on how to furtherprotect private information [80], [81], [74], [44], [46].VI. C
HARACTERIZATION OF
PHA P
RIVATE I NFORMATION C OLLECTION
Potentially harmful applications (PHAs)[27] are apps thatcould put users, user data, or devices at risk (e.g., trojan,spyware, etc.). Some of them aren’t strictly malware butare harmful to the software ecosystem (e.g., impersonatingother apps). These PHAs have been substantially discussedand studied in the previous literature [39], [83], [23], [20],
50 100 150 200 250 300 350
MobvistaCloudMobiBaiduintsig(ShangHai)AlibabaTencent
Inside ChinaOutside China
Fig. 11: Domain distribution: top 6 Chinese data controllers.
Rank Cat.
Operator Info 116K 393K
Running App Info 63K 280K
TABLE IV: Top 10 private information collected by PHAs ona global scale.[13]. In this section, we focus on understanding what privateinformation is collected by PHAs. In particular, we aim tounderstand whether the type of information collected by PHAsis different from the one collected by regular apps.
Private information collection by PHAs.
We consider aSHA2 as potentially harmful if it is flagged by at least 6 AVcompanies in VirusTotal (see Section II). Together with mobileapp reputation data, we identify 3.5M SHA2s associated with1.2M unique PHA app names that were installed on 3.8Mdevices. Following the analytical process used in Section III,we uncover the top 10 types of private information collected byPHAs and summarize our findings in Table IV. We can see thatPHAs mainly collect tracking information, e.g., device info, d e v i c e i n f o s i m c a r d i n f o p h o n e n u m b e r l o c a t i o n i n f oo p e r a t o r i n f o s e tt i n g s i n f o a cc o u n t i n f o e m a il a d d r e ss c o n t a c t i n f o s n s a cc . i n f o . v o i c e i n f o s m s i n f o c a l e n d a r i n f o i n s t a ll e d a p p i n f o r u n n i n g a p p i n f o b r o w s e r h i s t o r y b r o w s e r b o o k m a r k s a u d i o i n f o p h o t o i n f o v i d e o i n f o c a m e r a i n f o North AmericaEuropeAsia 0.00.20.40.60.8
Fig. 12: Heatmap illustration of regional private informationcollection by PHAs. Fig. 13: (log-scale)
Top 20 malicious IPs/Domains ranked bythe number of PHAs.sim card, location, etc. Besides, 116K PHAs (covering 393Kdevices) collect operator information and 63K PHAs (covering280K devices) also collect running app information on aglobal scale. This is more aggressive comparing to the privateinformation collection behavior comparing to 43K/42K benignapps respectively collecting such information. As we can see inFigure 12, the majority of these aggressive PHAs are installedon devices in North America. Note that such aggressive privateinformation collection behavior enables adversaries to betterprofile end users and may lead to some intrusive monetizationactions. For example, we uncover that 590K devices withPHAs presence are affected by notification bar ads (i.e., adsare displayed as app notifications) and 317k devices sufferfrom short-cut ads (i.e., targeted ads are placed on the homescreen). Yet, only 230K devices with PHA installations exhibitin-context ads behavior (i.e., normal behavior as ads aredisplayed inside an app). However, due to the limitation ofour system, we are not able to measure the content correlationbetween private information collected by PHAs and the subjectof advertisement displayed as shortcuts on the devices. Wealso identify 1,549 PHAs (4,930 SHA2s) that read/sent SMSfrom 4,461 devices. Even though such SMS leakage is minorin terms of device prevalence ratio, in light of the recentdiscussion of limitation of SMS-based 2FA authentication ,our findings show that the possibility of such breaches stillexists in the wild. Communications with malicious domains.
We compile ablacklist of IPs/domains that have been involved with ma-licious activities from various sources (see Section II), andaim at understanding if PHAs send private information col-lected from these devices to malicious domains. Figure 13shows the 20 malicious domains with the largest app pres-ence and the fraction of devices connecting to them. As wecan see in Figure 13, has the largestapp presence and was contacted by 550 PHAs collect-ing data from 686 devices. and yy.yamahafree.com have higher device penetration rates,respectively showing communications with 3,789 and 6,455devices respectively. In general, we find that only a smallportion of PHAs communicate with known malicious hosts and https://krebsonsecurity.com/2018/08/reddit-breach-highlights-limits-of-sms-based-authentication/ omains, and such domains have limited device coverage. Thisis different from PC malware while a considerable fractionof malware connect with malicious domains and are part ofbotnets [5], [51], [30]. Summary of findings.
We found that PHAs are more ag-gressive comparing to generic private information collectionbehavior, leading to intrusive monetization actions. However,communications with malicious domains are less pervasivecomparing to desktop applications.VII. D
ISCUSSION AND L IMITATIONS
Implications for the research community.
Our study showsthat looking at device penetration provides different resultsthan looking at apps only. Designing measurement studiesfocused on executing apps could lead to conclusions that arebiased and do not reflect real malicious activity in the wild.In fact, some of the actors who manage to get their librariesinstalled in many apps do not manage to have many usersrunning them. In light of this, we hope that our study caninspire security researchers to design measurement studies thatare representative of the real world as possible.
Implications for policymakers.
We observe that private in-formation confinement within the EU is low. GDPR has notstopped companies from collecting private information fromthe end users as long as their services are GDPR-compliant. Inlight of these findings, we hope that our study would encouragepolicymakers to further regulate how private information isused by and shared among the companies and how accountabil-ity can be truly guaranteed (e.g., , which company should beheld accountable if a device identifier was abused by targetedadvertising while the majority of apps on that mobile devicecollect device identification information).
Study limitations.
Our study relies on the static and dy-namic analysis, and layered security engines at the backend toidentify and fingerprint that certain API calls lead to specificprivate information leakage. This prevents us to capture privateinformation collection activities that happen at runtime butare not captured by the company’s analytical infrastructure,which the on-device security engine relies on. Therefore, ourwork covers the lower bound of global private informationcollection activities. Nevertheless, despite such limitations, ourstudy provides the most comprehensive view of private datacollection by Android apps to date and actionable insights.While this paper is based on measurements collected froma user base that is three orders of magnitude larger thanprevious work, our dataset is biased towards the end users ofa single mobile security product, and therefore still presentssome biases. For example, the distribution of devices usedthis study is not heavily screwed towards any specific region.However, the device distribution in Asia is skewed towardsIndia and Japan and does not have as many devices in Chinawhich is one of the top countries/markets in terms of mobileusers. In terms of the representativeness of the analyzed apps,it is challenging to ascertain the coverage of our study sinceit is infeasible to determine the total number of all Androidapps, given such a fragmented ecosystem and many alternativemarkets. Still, by analyzing 2.1M apps this study is coveringone of the largest sets of apps to date and is in line with thelargest datasets collected by the academic community [4]. Our analysis of data controllers presented in Section Vrely on the identification of the organizations behind PICdomains. As detailed in Section II the mapping of domainsto their owner organization relies on multiple data sourcesproviding connections between the domains, the networks thathost them as well as the organizations supposedly maintainingthese resources. Such connections are cross-checked in thedifferent data sources to compensate for inaccuracies in eachof the sources. We also take a conservative approach andautomatically discard all connections that are not seen in alldata sources. This naturally hurts the number of domainsto which we can map an organization. However, we favorthe accuracy of the domain to owner organization mappingover its coverage. It is also important to note that some appscommunicate with raw IP addresses instead of relying ondomains. Moreover, we have seen that more than 97% of theseIP addresses refer to CDNs or hosting or cloud providers whichhinder the identification of their owner organization.Finally, while in this paper we studied how private infor-mation is collected from devices, and where this informationflows to, our study does not allow us to understand how thisinformation is acted upon by data controllers (i.e., whether andhow it is used to track users). This remains an open questionfor the research community.VIII. R
ELATED W ORK
In this section, we selectively review previous studieson PHA characterization, Android permission system, privatedata leakages and prevention, and third party advertising andtracking services. We refer the readers to [66], [22], [19], [48],[77], [63] for in-depth studies and surveys on securing Androiddevices in general.
PHA characterization.
Previous studies mainly focused onanalyzing PHAs and systematically characterize them fromvarious aspects such as evasion mechanism [20], installationmethods [83], malicious payloads [83], repackaging mecha-nism [82], [64], [39], behaviors [38], [78], monetization [23],etc. These efforts shed light on how Android PHAs operate inthe wild [83], main incentives of mobile malware [23], [39],weaknesses of some of the popular mitigation solutions [20],etc. However, they did not discuss potential threats posed byinformation collection on mobile devices as these efforts centeron app analysis and offer a less comprehensive view of the realdevice prevalence.
Android permission system.
Android permission system hasbeen extensively covered in the previous literature [10], [48],[22], [7], [9]. These studies on Android permissions havemainly leveraged static analysis techniques to understand therole of a given permission [7], [22], [25], potential privacy vi-olation incurred by overprivileged apps [22], [59], permissioncircumvention [55], description-to-permission fidelity [53], andimprove mapping of Android permissions to framework/SDKAPI methods [8], [2]. Some recent research efforts alsoutilize dynamic analysis systems to distinguish and tracethe permissions requested by apps at the runtime and thoserequested by the app’s core functionality [17] and generatea more precise call graph enabling the system to extract thepermission specification and improve the mapping [41]. Ourstudy complements these studies by showing the scales and therevalence of private information collection in the real worlddevices.
Third-party advertising and tracking services (ATSes).
There are two main approaches on studying third-party adver-tising and tracking services (ATSes). One approach leveragesstatic tools to decompile apps and identified the embeddedtrackers from API calls and quantify various aspects of track-ers [60], [11]. These methods offer a view of tracker behaviorand prevalence from app perspective. Another approach isleveraging network traffic either captured on device or by ISPproviders to provide insights into the mobile advertising andtracking ecosystem from an information flow perspective [32],[70], [54], [31].
PII leakage detection and protection.
The root cause ofPII leakage is because the end users are presented with aset of required permissions by the apps but not how theyhandle the data after permissions are granted. Previous studiesshowed that mobile apps leak more privacy information thantheir web counterpart [50], [35]. To this end, research effortsmainly focused on monitoring private information flows [18],[65], detecting potential privacy leaks by apps [26], [56],[55], sensitive data leakage via third party libraries [62], [16],[40], privacy implication caused by targeted advertising inapps [12], private data leakage via network traffic analysis [57],[15], impact of GDPR notices [69], and privacy implicationsincurred by pre-installed apps [24]. Our work complementsthe previous work and shows a holistic picture of the stateof sensitive information collection on Android in the wild,identifying the big players in this space (both legitimatecompanies and malicious actors), together with geographictrends. IX. C
ONCLUSION
In this paper, we presented the most comprehensive mea-surement study on private information collection on Androidto date. We showed that PIC is widespread on Android, andthat various types of information are collected, with actorsoperating in different geographic areas interested in differenttypes of information. While most information flows terminatein the US, 7% of the flows that we observe are directed toChina. We also find that data regulation laws like GDPRhave not been effective in limiting the amount of personalinformation that flows to third countries outside EU .A CKNOWLEDGMENTS
We wish to thank the anonymous reviewers for theirfeedback and our shepherd Adwait Nadkarni for his help inimproving this paper. R
EFERENCES[1] Y. Aafer, W. Du, and H. Yin. Droidapiminer: Mining api-level featuresfor robust malware detection in android. In
SecureComm , 2013.[2] Y. Aafer, G. Tao, J. Huang, X. Zhang, and N. Li. Precise android apiprotection mapping derivation and reasoning. In
ACM CCS , 2018.[3] G. Acar, M. Juarez, N. Nikiforakis, C. Diaz, S. G¨urses, F. Piessens, andB. Preneel. Fpdetective: dusting the web for fingerprinters. In
ACMCCS , 2013.[4] K. Allix, T. F. Bissyand´e, J. Klein, and Y. Le Traon. Androzoo:Collecting millions of android apps for the research community. In
MSR , 2016. [5] M. Antonakakis, R. Perdisci, W. Lee, N. Vasiloglou, and D. Dagon.Detecting malware domains at the upper dns hierarchy. In
USENIXSecurity , 2011.[6] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein,Y. Le Traon, D. Octeau, and P. McDaniel. Flowdroid: Precise context,flow, field, object-sensitive and lifecycle-aware taint analysis for androidapps.
Acm Sigplan Notices , 2014.[7] K. W. Y. Au, Y. F. Zhou, Z. Huang, and D. Lie. Pscout: analyzing theandroid permission specification. In
ACM CCS , 2012.[8] M. Backes, S. Bugiel, E. Derr, P. McDaniel, D. Octeau, and S. Weisger-ber. On demystifying the android application framework: Re-visitingandroid permission specification analysis. In
USENIX Security , 2016.[9] M. Backes, S. Bugiel, O. Schranz, P. von Styp-Rekowsky, and S. Weis-gerber. Artist: The android runtime instrumentation and security toolkit.In
EuroS&P , 2017.[10] D. Barrera, H. G. Kayacik, P. C. Van Oorschot, and A. Somayaji. Amethodology for empirical analysis of permission-based security modelsand its application to android. In
ACM CCS , 2010.[11] R. Binns, U. Lyngs, M. Van Kleek, J. Zhao, T. Libert, and N. Shadbolt.Third party tracking in the mobile ecosystem. In
ACM WebSci , 2018.[12] T. Book and D. S. Wallach. An empirical study of mobile ad targeting. arXiv preprint arXiv:1502.06577 , 2015.[13] R. Chatterjee, P. Doerfler, H. Orgad, S. Havron, J. Palmer, D. Freed,K. Levy, N. Dell, D. McCoy, and T. Ristenpart. The spyware used inintimate partner violence. In
IEEE S&P , 2018.[14] Q. Chen and A. Kapravelos. Mystique: Uncovering information leakagefrom browser extensions. In
ACM CCS , 2018.[15] A. Continella, Y. Fratantonio, M. Lindorfer, A. Puccetti, A. Zand,C. Kruegel, and G. Vigna. Obfuscation-resilient privacy leak detectionfor mobile apps through differential analysis. In
NDSS , 2017.[16] S. Demetriou, W. Merrill, W. Yang, A. Zhang, and C. A. Gunter. Freefor all! assessing user data exposure to advertising libraries on android.In
NDSS , 2016.[17] M. Diamantaris, E. P. Papadopoulos, E. P. Markatos, S. Ioannidis, andJ. Polakis. Reaper: Real-time app analysis for augmenting the androidpermission system. In
CODASPY , 2019.[18] W. Enck, P. Gilbert, S. Han, V. Tendulkar, B.-G. Chun, L. P. Cox,J. Jung, P. McDaniel, and A. N. Sheth. Taintdroid: an information-flowtracking system for realtime privacy monitoring on smartphones.
ACMTransactions on Computer Systems (TOCS) , 2014.[19] Z. Fang, W. Han, and Y. Li. Permission based android security: Issuesand countermeasures. computers & security , 43, 2014.[20] P. Faruki, A. Bharmal, V. Laxmi, V. Ganmoor, M. S. Gaur, M. Conti, andM. Rajarajan. Android security: a survey of issues, malware penetration,and defenses.
IEEE communications surveys & tutorials , 17(2), 2014.[21] K. Fawaz and K. G. Shin. Location privacy protection for smartphoneusers. In
ACM CCS , 2014.[22] A. P. Felt, E. Chin, S. Hanna, D. Song, and D. Wagner. Androidpermissions demystified. In
ACM CCS , 2011.[23] A. P. Felt, M. Finifter, E. Chin, S. Hanna, and D. Wagner. A survey ofmobile malware in the wild. In
SPSM , 2011.[24] J. Gamba, M. Rashed, A. Razaghpanah, J. Tapiador, and N. Vallina-Rodriguez. An analysis of pre-installed android software. In
IEEES&P , 2020.[25] X. Gao, D. Liu, H. Wang, and K. Sun. Pmdroid: Permission supervisionfor android advertising. In
SRDS , 2015.[26] C. Gibler, J. Crussell, J. Erickson, and H. Chen. Androidleaks:automatically detecting potential privacy leaks in android applicationson a large scale. In
TRUST , 2012.[27] Google. Android Security & Privacy 2018 Year In Review. 2019.[28] S. Hao, B. Liu, S. Nath, W. G. Halfond, and R. Govindan. Puma:programmable ui-automation for large-scale dynamic analysis of mobileapps. In
MobiSys , 2014.[29] C. J. Hoofnagle, B. van der Sloot, and F. Z. Borgesius. The europeanunion general data protection regulation: what it is and what it means.
Information & Communications Technology Law , 28(1), 2019.30] C. C. Ife, Y. Shen, S. J. Murdoch, and G. Stringhini. Waves of malice:A longitudinal measurement of the malicious file delivery ecosystemon the web. In
AsiaCCS , 2019.[31] C. Iordanou, G. Smaragdakis, I. Poese, and N. Laoutaris. Tracing crossborder web tracking. In
IMC , 2018.[32] C. Joe-Wong, S. Ha, and M. Chiang. Sponsoring mobile data: Aneconomic analysis of the impact on users and content providers. In
INFOCOM
USENIX Security , 2016.[35] C. Leung, J. Ren, D. Choffnes, and C. Wilson. Should you use theapp for that? comparing the privacy implications of app-and web-basedonline services. In
IMC , 2016.[36] K. Li and T. C. Du. Building a targeted mobile advertising system forlocation-based services.
Decision Support Systems , 54(1), 2012.[37] L. Li, T. F. Bissyand´e, M. Papadakis, S. Rasthofer, A. Bartel, D. Octeau,J. Klein, and L. Traon. Static analysis of android apps: A systematicliterature review.
Information and Software Technology , 2017.[38] M. Lindorfer, M. Neugschwandtner, L. Weichselbaum, Y. Fratantonio,V. Van Der Veen, and C. Platzer. Andrubis–1,000,000 apps later: Aview on current android malware behaviors. In
BADGERS , 2014.[39] M. Lindorfer, S. Volanis, A. Sisto, M. Neugschwandtner, E. Athana-sopoulos, F. Maggi, C. Platzer, S. Zanero, and S. Ioannidis. Andradar:fast discovery of android applications in alternative markets. In
DIMVA ,2014.[40] X. Liu, J. Liu, S. Zhu, W. Wang, and X. Zhang. Privacy risk analysisand mitigation of analytics libraries in the android ecosystem.
IEEETransactions on Mobile Computing , 2019.[41] L. Luo. Heap memory snapshot assisted program analysis for androidpermission specification. In
SANER , 2020.[42] S. Ma, Z. Tang, Q. Xiao, J. Liu, T. T. Duong, X. Lin, and H. Zhu.Detecting gps information leakage in android applications. In
IEEEGLOBECOM , 2013.[43] J. R. Mayer and J. C. Mitchell. Third-party web tracking: Policy andtechnology. In
IEEE S&P , 2012.[44] T. Minssen, C. Seitz, M. Aboy, and M. C. Compagnucci. The eu-usprivacy shield regime for cross-border transfers of personal data underthe gdpr.
European Pharmaceutical Law Review , 4(1), 2020.[45] N. Miramirkhani, M. P. Appini, N. Nikiforakis, and M. Polychronakis.Spotless sandboxes: Evading malware analysis systems using wear-and-tear artifacts. In
IEEE S&P , 2017.[46] T. Mulder and M. Tudorica. Privacy policies, cross-border health dataand the gdpr.
Information & Communications Technology Law , 28(3),2019.[47] S. Nath. Madscope: Characterizing mobile in-app targeted ads. In
MobiSys , 2015.[48] M. Nauman, S. Khan, and X. Zhang. Apex: extending android permis-sion model and enforcement with user-defined runtime constraints. In
ASIACCS , 2010.[49] E. Pan, J. Ren, M. Lindorfer, C. Wilson, and D. Choffnes. Panoptispy:Characterizing audio and video exfiltration from android applications.
PETS , 2018(4), 2018.[50] E. P. Papadopoulos, M. Diamantaris, P. Papadopoulos, T. Petsas,S. Ioannidis, and E. P. Markatos. The long-standing privacy debate:Mobile websites vs mobile apps. In
WWW , 2017.[51] D. Plohmann, K. Yakdan, M. Klatt, J. Bader, and E. Gerhards-Padilla.A comprehensive measurement study of domain generating malware.In
USENIX Security , 2016.[52] S. Poeplau, Y. Fratantonio, A. Bianchi, C. Kruegel, and G. Vigna.Execute this! analyzing unsafe and malicious dynamic code loadingin android applications. In
NDSS , 2014.[53] Z. Qu, V. Rastogi, X. Zhang, Y. Chen, T. Zhu, and Z. Chen. Autocog:Measuring the description-to-permission fidelity in android applications.In
ACM CCS , 2014. [54] A. Razaghpanah, R. Nithyanand, N. Vallina-Rodriguez, S. Sundaresan,M. Allman, C. Kreibich, and P. Gill. Apps, trackers, privacy, andregulators: A global study of the mobile tracking ecosystem. In
NDSS ,2018.[55] J. Reardon, ´A. Feal, P. Wijesekera, A. E. B. On, N. Vallina-Rodriguez,and S. Egelman. 50 ways to leak your data: An exploration of apps’circumvention of the android permissions system. In
USENIX Security ,2019.[56] J. Ren, M. Lindorfer, D. J. Dubois, A. Rao, D. Choffnes, and N. Vallina-Rodriguez. Bug fixes, improvements,... and privacy leaks: A longitudi-nal study of pii leaks across android app versions. In
NDSS , 2018.[57] J. Ren, A. Rao, M. Lindorfer, A. Legout, and D. Choffnes. Recon:Revealing and controlling pii leaks in mobile network traffic. In
MobiSys , 2016.[58] C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich, V. Paxson,N. Pohlmann, H. Bos, and M. Van Steen. Prudent practices for designingmalware experiments: Status quo and outlook. In
IEEE S&P , 2012.[59] B. P. Sarma, N. Li, C. Gates, R. Potharaju, C. Nita-Rotaru, andI. Molloy. Android permissions: a perspective combining risks andbenefits. In
SACMAT , 2012.[60] S. Seneviratne, H. Kolamunna, and A. Seneviratne. A measurementstudy of tracking in paid mobile applications. In
WiSec , 2015.[61] O. Starov and N. Nikiforakis. Extended tracking powers: Measuringthe privacy diffusion enabled by browser extensions. In
WWW , 2017.[62] R. Stevens, C. Gibler, J. Crussell, J. Erickson, and H. Chen. Investi-gating user privacy in android ad libraries. In
MoST , 2012.[63] G. Suarez-Tangil and G. Stringhini. Eight years of rider measurementin the android malware ecosystem.
IEEE Transactions on Dependableand Secure Computing , 2020.[64] G. Suarez-Tangil and G. Stringhini. Eight years of rider measurementin the android malware ecosystem: evolution and lessons learned.
IEEETransactions on Dependable and Secure Computing , 2020.[65] M. Sun, T. Wei, and J. C. Lui. Taintart: A practical multi-levelinformation-flow tracking system for android runtime. In
ACM CCS ,2016.[66] D. J. Tan, T.-W. Chua, V. L. Thing, et al. Securing android: a survey,taxonomy, and challenges.
ACM Computing Surveys (CSUR) , 47(4),2015.[67] R. Unni and R. Harmon. Perceived effectiveness of push vs. pull mobilelocation based advertising.
Journal of Interactive advertising , 7(2),2007.[68] U.S. Department of Commerce.
The Privacy Shield Framework ,Accessed July 23, 2020.[69] C. Utz, M. Degeling, S. Fahl, F. Schaub, and T. Holz. (un) informedconsent: Studying gdpr consent notices in the field. In
ACM CCS , 2019.[70] N. Vallina-Rodriguez, J. Shah, A. Finamore, Y. Grunenberger, K. Pa-pagiannaki, H. Haddadi, and J. Crowcroft. Breaking for commercials:characterizing mobile advertising. In
IMC , 2012.[71] E. Vanrykel, G. Acar, M. Herrmann, and C. Diaz. Leaky birds:Exploiting mobile application traffic for surveillance. In
InternationalConference on Financial Cryptography and Data Security , 2016.[72] T. Vidas and N. Christin. Evading android runtime analysis via sandboxdetection. In
ASIACCS , 2014.[73] P. Voigt and A. Von dem Bussche. The eu general data protection regu-lation (gdpr).
A Practical Guide, 1st Ed., Cham: Springer InternationalPublishing , 2017.[74] W. G. Voss and K. A. Houser. Personal data and the gdpr: Providinga competitive advantage for us companies.
American Business LawJournal , 56(2), 2019.[75] M. Weissbacher, E. Mariconti, G. Suarez-Tangil, G. Stringhini,W. Robertson, and E. Kirda. Ex-ray: Detection of history-leakingbrowser extensions. In
ACSAC , 2017.[76] M. Y. Wong and D. Lie. Intellidroid: A targeted input generator for thedynamic analysis of android malware. In
NDSS , 2016.[77] M. Xu, C. Song, Y. Ji, M.-W. Shih, K. Lu, C. Zheng, R. Duan, Y. Jang,B. Lee, C. Qian, et al. Toward engineering a secure android ecosystem:A survey of existing techniques.
ACM Computing Surveys (CSUR) ,49(2), 2016.78] C. Yang, Z. Xu, G. Gu, V. Yegneswaran, and P. Porras. Droidminer:Automated mining and characterization of fine-grained malicious be-haviors in android applications. In
ESORICS , 2014.[79] H. Ye, S. Cheng, L. Zhang, and F. Jiang. Droidfuzzer: Fuzzing theandroid apps with intent-filter tag. In
MoMM , 2013.[80] B. Zhao and W. Chen. Data protection as a fundamental right:The european general data protection regulation and its exterritorialapplication in china.
US-China Law Review , 16(3), 2019.[81] B. Zhao and G. Mifsud Bonnici. Protecting eu citizens’ personal datain china: a reality or a fantasy?
International Journal of Law and Information Technology , 24(2), 2016.[82] W. Zhou, Y. Zhou, X. Jiang, and P. Ning. Detecting repackaged smart-phone applications in third-party android marketplaces. In
CODASPY ,2012.[83] Y. Zhou and X. Jiang. Dissecting android malware: Characterizationand evolution. In
IEEE S&P , 2012.[84] Y. Zhou, X. Zhang, X. Jiang, and V. W. Freeh. Taming information-stealing smartphone applications (on android). In