[PDF] ATVHunter: Reliable Version Detection of Third-Party Libraries for Vulnerability Identification in Android Applications

Abstract

We propose a system, named ATVHunter, which can pinpoint the precise vulnerable in-app TPL versions and provide detailed information about the vulnerabilities and TPLs. We propose a two-phase detection approach to identify specific TPL versions. Specifically, we extract the Control Flow Graphs as the coarse-grained feature to match potential TPLs in the pre-defined TPL database, and then extract opcode in each basic block of CFG as the fine-grained feature to identify the exact TPL versions. We build a comprehensive TPL database (189,545 unique TPLs with 3,006,676 versions) as the reference database. Meanwhile, to identify the vulnerable in-app TPL versions, we also construct a comprehensive and known vulnerable TPL database containing 1,180 CVEs and 224 security bugs. Experimental results show ATVHunter outperforms state-of-the-art TPL detection tools, achieving 90.55% precision and 88.79% recall with high efficiency, and is also resilient to widely-used obfuscation techniques and scalable for large-scale TPL detection. Furthermore, to investigate the ecosystem of the vulnerable TPLs used by apps, we exploit ATVHunter to conduct a large-scale analysis on 104,446 apps and find that 9,050 apps include vulnerable TPL versions with 53,337 vulnerabilities and 7,480 security bugs, most of which are with high risks and are not recognized by app developers.

Full PDF

AAT V H

U N T E R : Reliable Version Detection ofThird-Party Libraries for Vulnerability Identiﬁcationin Android Applications

Xian Zhan ∗ , Lingling Fan † , Sen Chen ‡ , Feng Wu § , Tianming Liu ¶ , Xiapu Luo ∗ , Yang Liu §∗ Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China † College of Cyber Science, Nankai Univerisity, China ‡ College of Intelligence and Computing, Tianjin University, China § School of Computer Science and Engineering, Nanyang Technological University, Singapore ¶ Faculty of Information Technology, Monash University, Australia

Abstract —Third-party libraries (TPLs) as essential parts inthe mobile ecosystem have become one of the most signiﬁcantcontributors to the huge success of Android, which facilitatethe fast development of Android applications. Detecting TPLsin Android apps is also important for downstream tasks, suchas malware and repackaged apps identiﬁcation. To identify in-app TPLs, we need to solve several challenges, such as TPLdependency, code obfuscation, precise version representation.Unfortunately, existing TPL detection tools have been provedthat they have not solved these challenges very well, let alonespecify the exact TPL versions.To this end, we propose a system, named ATVH

UNTER , whichcan pinpoint the precise vulnerable in-app TPL versions andprovide detailed information about the vulnerabilities and TPLs.We propose a two-phase detection approach to identify speciﬁcTPL versions. Speciﬁcally, we extract the Control Flow Graphs asthe coarse-grained feature to match potential TPLs in the pre-deﬁned TPL database, and then extract opcode in each basicblock of CFG as the ﬁne-grained feature to identify the exactTPL versions. We build a comprehensive TPL database (189,545unique TPLs with 3,006,676 versions) as the reference database.Meanwhile, to identify the vulnerable in-app TPL versions, wealso construct a comprehensive and known vulnerable TPLdatabase containing 1,180 CVEs and 224 security bugs. Exper-imental results show ATVH

UNTER outperforms state-of-the-artTPL detection tools, achieving 90.55% precision and 88.79%recall with high efﬁciency, and is also resilient to widely-usedobfuscation techniques and scalable for large-scale TPL detection.Furthermore, to investigate the ecosystem of the vulnerable TPLsused by apps, we exploit ATVH

UNTER to conduct a large-scale analysis on 104,446 apps and ﬁnd that 9,050 apps includevulnerable TPL versions with 53,337 vulnerabilities and 7,480security bugs, most of which are with high risks and are notrecognized by app developers.

I. I

NTRODUCTION

Nowadays, over 3 million Android applications (apps) areavailable in the ofﬁcial Google Play Store [1]. One reasoncontributing to the huge success of Android could be themassive presence of third-party libraries (TPLs) that providereusable functionalities that can be leveraged by developersto facilitate the development of Android apps (to avoid rein-venting the wheels). However, extensive TPL usage attractsattackers to exploit the vulnerabilities or inject backdoors in the popular TPLs, which poses severe security threats to appusers [2–4]. Previous research [5, 6] pointed out that manyapps contain vulnerable TPLs, and some of them have beenreported with severe vulnerabilities (e.g., Facebook SDK) thatcan be exploited by adversaries [7, 8]. Attackers can exploitthe vulnerabilities in some Ad libraries (e.g., Airpush [9],MoPub [10]) to get privacy-sensitive information from the in-fected devices [11]. Even worse, various TPLs are scattered indifferent apps but the information of TPL components in appsis not transparent. Many developers may not know how manyand which TPLs are used in their apps, due to many directand transitive dependencies. Additionally, about 78% of thevulnerabilities are detected in indirect dependencies, makingthe potential risks hard to spot [12]. Thus, vulnerable TPLidentiﬁcation has become an urgent and high-demand task andTPL version detection has become a standard industry productnamed as Software Composition Analysis (SCA) [12, 13].Existing TPL detection techniques use either clustering-based methods (e.g., LibRadar [14], LibD [15, 16]) or sim-ilarity comparison methods (e.g., LibID [17], LibScout [5]) toidentify TPLs used by the apps. However, according to ouranalysis and previous study [18], we conclude the followingdeﬁciencies in existing approaches: 1)

Low recall.

Clustering-based methods only can identity commonly-used TPLs andmay miss some niche and new TPLs, whose recall depends onthe number of input apps and the reuse rate of TPLs. Besides,the code similarity of different versions and TPL could be vari-ous, which makes it difﬁcult to choose appropriate parametersof the clustering algorithm to perfectly distinguish differentTPLs or even versions. Verifying the clustering results isalso labor-intensive and error-prone. Similarity comparisonmethods construct a predeﬁned TPL database as the referencedatabase. However, current published size of TPL databaseis far smaller than the number of TPLs in the actual marketthus cannot be used to identify a complete set of in-app TPLs.Apart from that, existing techniques more or less depend onthe package structure, especially using package structure toconstruct the in-app library candidates. Whereas, the packagestructure/name of the same TPL in different versions could a r X i v : . [ c s . S E ] F e b e mutant or easily obfuscated. Therefore, using packages asa supplementary feature to generate TPL signatures is alsounreliable [18]. 2) Inability of precise version identiﬁcation.

To ﬁnd the vulnerabilities of the in-app TPLs, we need toprecisely pinpoint the exact TPL versions because not allTPL versions are vulnerable. Even though there are manyTPL detection tools, none of them can meet our requirements.AdDetect [19] just can distinguish the ad and non-ad libraries.ORLIS [20] just provides the matching class. Clustering-based tools (e.g., LibRadar [14], LibD [15, 16]) do not claimthey can pinpoint the exact TPL versions. Besides, currenttools [5, 7, 17, 21] usually reported many false positives atversion-level identiﬁcation [18]. Thus, existing tools are notsuitable for vulnerable TPL detection.Apart from the aforementioned weaknesses of existingtools, we still face some challenges in this research direction:1)

Lack of vulnerable TPL version dataset.

To enablevulnerable TPL version (TPL-V) identiﬁcation, we need acomprehensive set of known vulnerable TPL-Vs. Ideally, foreach vulnerable TPL, it includes TPL names, versions, types,vulnerability severity, etc. However, to the best of our knowl-edge, no such dataset is publicly available. 2)

Precise versionrepresentation.

We need to distinguish TPLs at version level,however, it is challenging to extract appropriate code featuresto represent different versions of the same TPL, especiallywhen the code difference of different versions is tiny. 3)

Interference from code obfuscation.

Many code obfuscationtools (e.g., DashO [22], Proguard [23], and Allatori [24]) canbe used to obfuscate apps and TPLs. For example, dead coderemoval can delete the code without invocation by host apps.These techniques can change the code similarity between in-app TPLs and the original TPLs. Undoubtedly, obfuscationtechniques increase the difﬁculty of TPL identiﬁcation.To ﬁll aforementioned research gap, we propose a sys-tem, named ATVH

UNTER (Android in-app Third-party libraryVulnerability Hunter), which is an obfuscation-resilient TPL-V detection tool and can report detailed information aboutvulnerabilities of in-app TPLs. ATVH

UNTER ﬁrst uses classdependency relations to split the independent candidate TPLmodules from the host app and adopts a two-phase strategy toidentify in-app TPLs. It extracts CFGs as the coarse-grainedfeatures to locate the potential TPLs in the feature database toachieve high efﬁciency. It then extracts the opcode sequence ineach basic block of CFG as the ﬁne-grained feature to identifythe precise version by employing the similarity comparisonmethod. To ensure the recall, we constructed our TPL featuredatabase by collecting comprehensive and large-scale Javalibraries from the maven repository [25]. We use the fuzzyhash method to generate the signature, which can alleviatethe effects from code obfuscation. Compared with previousmethods, ATVH

UNTER does not depend on the packagestructure. The main contributions of this work are as follows: • An effective TPL version detection tool.

We proposeATVH

UNTER , an obfuscation-resilient TPL-V detectiontool with high accuracy that can ﬁnd vulnerable in-appTPL-Vs and provide detailed vulnerabilities and compo- nents reports. With the help of our industry collaborator,ATVH

UNTER was integrated as a branch of an onlineservice to help users identify vulnerable Android TPLs. • Comprehensive datasets.

We have constructed a com-prehensive and large-scale TPL feature database atpresent, which includes 189,545 TPLs with corresponding3,006,676 versions to identify in-app TPLs. We are the ﬁrstto construct a comprehensive vulnerable TPL-V databasefor Android apps, including 1,180 CVEs from 957 TPLswith 38,243 vulnerable versions and 224 security bugsfrom 152 open-source TPLs with 4,533 affected versions. • Thorough comparisons.

We conduct systematic and thor-ough comparisons between ATVH

UNTER and the state-of-the-art tools from different perspectives. The evaluationresult demonstrates ATVH

UNTER is resilient to widely-used obfuscation techniques and outperforms the state-of-the-art TPL-V detection tools, achieving high precision(90.55%) and recall (88.79%) at version-level identiﬁca-tion. We published the related dataset on our website [26]. • Large-scale analysis.

We leverage ATVH

UNTER to con-duct a large-scale study on 73,110 apps using TPLs andﬁnd 9,050 apps contain 10,616 vulnerable TPLs. Thesevulnerable TPLs include 53,337 known vulnerabilities and7,480 security bugs. Most of them use TPLs containingsevere vulnerabilities.II. R

ELATED W ORK

Library Detection.

AdDetect [19] and PEDAL [27] usefeatures such as permissions and APIs to train a classiﬁer todistinguish ad libraries and non-ad libraries. Whereas, thesestudies fail to identify other types of libraries, such as develop-ment aids, UI plugins. Currently, there are three TPL detectiontools based on the clustering algorithms., i.e., LibRadar, LibD,and LibExtractor. LibRadar [14] extracts the Android APIcalls, the total number of API calls and total kinds of APIcalls as the code features and it chooses the multi-levelclustering method to identify potential TPLs. LibD [15, 16]extracts the opcode in each CFG block as the code feature.LibExtractor [28] exploits the clustering-based method toﬁnd potential malicious libraries. In general, clustering-basedapproaches have three common weaknesses: 1) they requirea considerable number of apps as input to generate enoughTPL signatures. It is also difﬁcult to ﬁnd emerging or nicheTPLs. It also can import some impurities. For instance, ifan app is repackaged many times, clustering methods mayconsider the repackaged host app as a TPL. 2) clustering-basedmethods may ﬁnd incomplete TPLs. Some TPLs also dependon other TPLs, but clustering method could separate them intoseveral parts. 3) The above clustering-based approaches moreor less rely on package names and package structures, whichcan be easily obfuscated by existing obfuscators [22–24].LibD claims it is resilient to package name obfuscation andpackage structure mutation, but package ﬂattening techniquecan remove the whole package structure and change the nternal package structure. LibSift [29] constructs the packagedependency graph (PDG) to split independent TPL candidates.LibSift does not identify speciﬁc libraries, only decouplesTPLs into different parts from the host app. Han et al. [30]aim to measure the behavior differences by comparing benignTPLs and malicious TPLs. It extracts the opcode and Androidtype tags as features and hashes all feature in each method, andthen compare it with the ground-truth libraries to identify thelibraries. LibScout [5] is a similarity-based library detectiontool, which uses the Merkle Tree [31] to generate each libraryinstance signature. LibScout chooses the fuzzy method ascode feature which changes the non-system identiﬁers (in themethod signature) by using placeholder “X”. ORLIS [20] usesthe same code feature of LibScout [5] but different featuregeneration approach. LibScout and ORLIS can be resilient toidentiﬁer renaming. Whereas, the code feature of LibScout istoo coarse, which affects the detection performance. Besides,ORLIS can only provide the matched class to users, which isnot user-friendly. Thus, they are not good choices for off-the-shelf TPL detection. LibPecker [7] is also a matching-basedlibrary identiﬁcation tool, which exploits the class dependencyas the code features and hashes it as the ﬁngerprint to ﬁndTPLs. LibPecker then uses the Fuzzy Class matching methodto compare it with the libraries in the database. However, thecomparison process is time-consuming. Moreover, LibPeckeralso assumes the package hierarchy is not change when theTPL is imported into an app, which will affect the recall.LibID [17] is also a TPL version detection tool, but it choosesdex2jar [32] as the decompile tool. The reverse-engineeringcapability of dex2jar directly limits the detection ability ofLibID. More details are clariﬁed in § IV. Vulnerable TPL/App Identiﬁcation.

Yasumatsu et al. [6]attempt to understand how app developers response to theupdate of TPLs. They studied vulnerable versions of sevenTPLs and corresponding apps. By comparing the evolutiontime between different TPL-Vs and apps versions, they mea-sured the reaction of app developers to these vulnerable TPLversions. The number of vulnerable TPL is too small in theirdataset, which cannot show the full picture of the infected appsand vulnerable TPLs. OSSPolice [21] is an automated toolfor identifying free software license violations and vulnerableversions of open-source third-party libraries, including bothnative libraries and Java libraries. It extracts the fuzzy methodsignature as the library feature and function centroid [33] asthe version feature to identify TPL-Vs. However, generatingcentroid is substantial in terms of resource consumption.III. A

RCHITECTURE

We design a system, ATVH

UNTER , which takes an Androidapp as input, and automatically identify the used vulnerableTPL-Vs (if any) according to the constructed database. Fig. 1shows the system design which is divided into two parts: (1)

TPL-V detection , which identiﬁes the speciﬁc versions of TPLsused by apps; and (2) vulnerable TPL-V identiﬁcation , whichcan identify the vulnerable in-app TPL-Vs based on our col-lected known vulnerabilities from NVD [34] and Github [35]. Based on the database, we also conduct a large-scale study toassess the ecosystem of Android apps in terms of the usageof vulnerable TPLs. Details are introduced as follows.

A. TPL Detection

The TPL detection part of ATVH

UNTER includes four keyphases: (1)

Preprocessing , (2)

Module decoupling , (3)

Featuregeneration , and (4)

TPL identiﬁcation .

1) Preprocessing:

ATVH

UNTER primarily conducts twotasks in this phase. The ﬁrst task is to decompile the inputapp and transform the bytecode into appropriate intermediaterepresentations (IRs). The second task is to ﬁnd the primarymodule in the app and delete it to eliminate the interferencefrom the host app. If an app includes TPLs, we call thecode of the host app as the “primary” module and the in-app TPLs constitute the “non-primary” module. ATVH

UNTER ﬁrst parses the AndroidManifest.xml ﬁle and gets the host apppackages. Sometimes, the code of the host app may belongto several different namespace, therefore, we need to extractthe app packages, application namespace and the packagenamespace including the Main Activity (i.e., the launcherActivity) and delete these ﬁles under the host namespace.However, this approach also has following side effects: 1) partof host code suffers from the package ﬂattening or renamingobfuscation and cannot be delete. 2) part of host code cannotbe delete due to special package name. 3) the host app andTPLs have the same package namespace, the method maydelete these TPLs, leading to false negatives. As for the case 1)& 2), if the host code and TPLs have no dependencies, it willnot affect the accuracy of TPL identiﬁcation. If the undeletedhost parts include the TPLs, we can eliminate the interferencein the comparison stage.

2) Module Decoupling:

The purpose of module decouplingis to split up the non-primary module of an app into differentindependent library candidates. Previous research adopts dif-ferent features for module decoupling such as package struc-ture, homogeny graph [15], and package dependency graph(PDG), however, they more or less depend on the packagestructure of apps. Using the package name or the independentpackage structure to split the in-app TPLs is error-prone, whichhas two obvious disadvantages: 1) low resiliency to packageﬂattening [36]; 2) inaccurate TPL instance construction. Thereare many different TPLs sharing the same root package.For instance, “com.android.support.appcompat-v7” [37] and“com.android.support.design” [38] are two different TPLs butthe share the same root package com/android/support . Besides,one TPL may has multiple parallel package structures, as canbe seen an example in Fig. 2, this TPL[39] depends on otherTPLs to build itself and developer deploy the “Fat” jar mode topackage this project. The host TPL with all invoked TPLs con-stitutes a complete TPL. TPL dependencies are very common,about 47.3% of Android TPLs in maven repository dependon others based on our rough statistics. To overcome it, weadopt the Class Dependency Graph (CDG) as the features tosplit up the TPL candidates because CDG does not depend onthe package structure, it is resilient to package ﬂattening. The ecompilingPrimary moduleelimination (cid:1)

Preprocessing (cid:2)

Module Decoupling (cid:3)

FeatureGeneration (cid:4)

Library Identification

Vulnerability CollectionSecurity Bug Collection VulnerabilityDatabase (cid:5)

Vulnerable TPLVersion identification

Offline DB Construction

TPLs withspecific versions

Coarse-grainedfeaturegeneration(CFG)Fine-grainedfeaturegeneration (opcode of CFG) Potential TPLsearchTPL versionidentificationCandidate TPLdecoupling(Class Dependency) (cid:3)

FeatureGeneration

TPL FeatureDatabase VulnerableTPL Database

Mapping Vulnerable TPL versions ü TPL info: name, version, etc. ü Vul info: type, CVSS,etc.

Fig. 1: Workﬂow of ATVH

UNTER

Fig. 2: An example of a TPL’s package structureclass dependency relationship includes: 1) class inheritance,we do not consider the interface relationship because it canbe deleted in obfuscation, 2) method call relationship, and3) ﬁeld reference relationship. We use CDGs to ﬁnd all therelated class ﬁles, and each CDG will be considered as aTPL candidate in general situation. Using CDGs can avoidthe aforementioned situations and package mutation and alsobe resilient to package ﬂattening.In ATVH

UNTER , we use similarity-based method to iden-tify TPL-Vs, we generate the TPL feature database by usingthe complete TPL ﬁles that we downloaded from the mavenrepository. Therefore, we need to pay attention the packagingtechniques of Java projects. To facilitate maintenance, mostdevelopers usually adopt the “skinny” mode to package aTPL, which means the released version only contains thecode by TPL developers without any dependency TPLs. Thedependency TPLs will be loaded during compilation. To solvethis situation, we crawl the meta-data of each TPL and recordtheir dependency TPLs and packaging technique [40] byreading the “pom.xml” ﬁle. If the “pom.xml” claims “jar-with-dependencies”, it means it includes all dependency TPLs,otherwise, it just includes the host TPL code. If we ﬁnd a jarwhich is a skinny one, we also need to split their dependencyTPLs by using their package namespace so that we can matchthe correct version in TPL database.

3) Feature Generation:

After splitting the candidate li-braries, we then aim to extract features and generate theﬁngerprint (a.k.a., signature) to represent each TPL ﬁle. Toensure scalability and accuracy, we choose two granularityfeatures. The coarse-grained feature is used to help us quicklylocate the potential TPLs in the database. The ﬁne-grainedfeature is used to help us identify the TPL-V precisely. (1) Forcoarse-grained features, we choose to extract the Control Flow Graph (CFG) to represent the TPL since CFG is relativelystable [41]. CFG also keeps the semantic information thatensures the accuracy to some extent [42]. (2) For ﬁne-grainedfeatures, we extract the opcode in each basic block of CFG asthe feature for exact version identiﬁcation.

Coarse-grained Feature Extraction.

We ﬁrst extract the CFGfor each method in the candidate TPLs, and traverse theCFG to assign each node a unique serial number (startingfrom 0) according to the execution order. For a branch nodewith sequence number n , its child with more outgoing edgeswill be given sequence number n + 1 and the other childis given n + 2 . If two child nodes have the same outgoingedges, we will give n + 1 to the child node with morestatements in the basic block. We then convert the CFGs intosignatures based on the assigned serial numbers of each nodeto represent each unique TPL, in the form of [node count,edge adjacency list] , where the adjacency list is repre-sented as: [ parent -> ( child , child ,...), parent -> ...] . We then hash the adjacency list of CFG as amethod signature. To improve the search efﬁciency, we sortthese hash values in ascending order and then hash theconcatenate values as one of the coarse-grained TPL features(T1). Meanwhile, we also keep the series of CFG signaturesin our database to represented each TPL in feature database. Fine-grained Feature Extraction.

Based on our analysis, weﬁnd the code similarity of different versions for the same TPLcould be diverse, which can range from about 0% to nearly100%. The coarse-grained features (i.e., CFG) are likely togenerate the same signature of different versions that have mi-nor changes such as insert/delete/modify a statement in a basicblock. Therefore, we propose ﬁner-grained features, i.e., op-code in each basic block of CFG, to represent each version ﬁle.However, extracting more ﬁne-grained features will increasemore computational complexity and cost of the computing re-sources. To ensure the scalability of ATVH

UNTER , a commonway to achieve that is through hashing [43]. However, hash-based method has an obvious drawback to determine whethertwo objects (e.g., TPLs, methods) are similar because a minormodiﬁcation can lead to a dramatic change of the hash value.Thus, we adopt the fuzzy hashing technique [44] instead of thetraditional hash algorithm to generate the code signature forig. 3: Fuzzy hashing for method feature generation as theversion featureeach method. Fig. 3 shows the feature generation process forTPL-Vs. Speciﬁcally, we ﬁrst extract all the opcode sequencesinside each basic block and concatenate them together. Wedo not consider the operands (e.g., identiﬁer names or hard-coded URLs) that are not robust for some simple obfuscationtechniques such as renaming obfuscation and string encryptiontechniques [43, 45]. We then concatenate all opcode sequencesof each basic block according to the adjacency list of CFG.In this step, our method is somewhat similar to LibD [15]with respect to the code feature. We also adopt the opcode ineach basic block of CFG as the code feature. However, wealso have many differences. LibD uses a package-level hashvalue as the ﬁnal signature and uses the clustering algorithm todetect TPLs. While in ATVH

UNTER , to defend against codeobfuscation or TPL customization [7], we use the fuzzy hashon each method-level feature and similarity comparison to ﬁndsimilar methods. We ﬁrst use a slide window (a.k.a., rollinghash [44]) to cut the opcode sequence into small pieces. Eachpiece has an independent contribution to the ﬁnal ﬁngerprint.If one part of the feature changes due to code obfuscation,it would not cause a big difference to the ﬁnal ﬁngerprint.We then hash each piece and combine them as the ﬁnal ﬁne-grained ﬁngerprint of each method. The ﬁngerprints of allmethods in a version to represent a TPL-V.

TPL Database Construction.

We crawled all Java TPLsfrom Maven Repository [25] (189,545 unique TPLs with their3,006,676 versions) to build our TPL database. We use theabove mentioned method to obtain the signature for each TPL.For each version of TPLs, we store both coarse-grained andﬁne-grained features in a MongoDB [46] database. The size ofthe entire database is 300 GB. We spent more than one monthto collect all the TPLs and another two months to generatethe TPL feature database.

4) Library Identiﬁcation:

This step aims to identify theused TPL-Vs in a given app. To achieve it efﬁciently, wepropose a two-stage identiﬁcation method: 1) potential TPLidentiﬁcation; 2) version identiﬁcation.

1) Potential TPL Identiﬁcation.

Since there are over 3million TPL ﬁles to be compared in our database for eachcandidate library, to speed up the entire detection process, we search the database in the following order: a)

Searchby package names.

For each library candidate, we ﬁrst useits package namespace (if not obfuscated) to narrow downthe search space in our database. Note that we cannot di-rectly use the package name to determine a TPL, becausethe same package namespace could include different third-party libraries. For example, the Android support group [47]includes 99 different TPLs. These TPLs have the same groupID “ com.android.support ” and the same package name preﬁx“ android/support/ ”. If the package name has been obfuscatedor a candidate TPL module is without a package name, wemove to the next ﬁltering strategy. Note that, even though it isa non-trivial problem to decide the obfuscated package name,in our work, the package name is only used as supplementaryinformation to speed up the search process. No matter whethera candidate TPL can ﬁnd a match in the TPL database byusing the package names, we still continue to search the TPLdatabase via other features. Thus, we only applied a simplerule to identify the obfuscated apps: if a package name is ahash value or a single letter, we consider it obfuscated. b)

Search by the number of classes.

We assume two TPLs areunlikely to be the same one if the number of classes withintwo TPLs has a big difference [48]. If the number of the classesin a TPL only accounts for less than 40% of that in anotherTPL in the database, we will not further compare them, whichcan help us speed up the identiﬁcation process. c)

Searchby coarse-grained features.

To speed up, we ﬁrst search thecoarse-grained feature T1 in the TPL database; if we ﬁnd thesame one, ATVHunter will report this TPL and stop the searchprocess. Otherwise, ATVH

UNTER will compare the candidateTPL with TPLs in the database, if all the coarse features arethe same, we consider ﬁnd the TPL and the search process willstop. If over 70% of the coarse-grained features are the same(followed by previous research [33, 43, 48, 49]), we considerit as a potential TPL. When we ﬁnd the potential TPL, wewill identify the exact version.

2) Version Identiﬁcation.

To identify the speciﬁc versionsof the used TPLs, we utilize the ﬁne-grained features andcalculate the similarity ratio of two TPLs as the evaluationmetric. To ensure the efﬁciency, we do not compare thesematched methods in previous stage. ATVH

UNTER can recordthe same method pair in the previous stage, therefore, we onlyneed to compare less than 30% of the methods in this phase.Since some code obfuscation techniques (e.g., junk code inser-tion) would change the ﬁngerprints of methods, causing twomethods that were initially the same to be different. Therefore,we need to compare the method similarity and consider twomethods matched only when their method similarity exceeds athreshold. Based on the number of matched methods, we thencompute the TPL similarity. When the number of matchedmethods exceeds the threshold, we consider we ﬁnd the correctTPL with its version. • Method Similarity Comparison.

We employ edit dis-tance [43, 50] to measure the similarity between two methodﬁngerprints. The edit distance of two ﬁngerprints is deﬁnedas the number of minimum edit operations (i.e., insertion,eletion, and substitution) that is required to modify oneﬁngerprint to the other. Based on the edit distance of twosignatures, we compute the Method Similarity Score (MSS)between two methods (i.e., m a and m b ) by using the formula: M SS ( m a , m b ) = 1 − d [ m a , m b ] max ( m, n ) (1)where m and n represent the signature length of two methodsand d [ m a , m b ] is the edit distance of two method signatures.If M SS exceeds a certain threshold θ , we consider the twomethods are matched. Based on our experimental result in§ IV-A, we choose θ = 0 . as the threshold. • TPL Similarity Comparison.

Based on the number ofmatched methods, the similarity of two TPLs ( t and t ) aredeﬁned as follows: T SS ( t , t ) = M | t (cid:84) t | M | t | (2)where t is a TPL candidate from the test app, t is a TPL fromthe database for comparison. M | t | is the number of methodsin t . M | t (cid:84) t | is the number of matched methods of t and t which should meet two conditions: (a) ∀ m i , m j , where m i isa method of t , m j is a method of t , M SS ( m i , m j ) ≥ θ ; (b) ∃ m j , that M SS ( m i , m j ) = 1 , that is, we only compare twoTPLs that have at least one exactly matched method in orderto speed up the identiﬁcation process. For a TPL candidate t ,we consider we ﬁnd a potentially matched TPL-V ( t ) in thedatabase when T SS ( t , t ) ≥ δ , δ is the similarity threshold,and select the TPL-V with the largest similarity score as theﬁnal result of t , providing the identiﬁed TPLs with group id,artifact id and version number. We set the threshold δ = 0 . based on our experimental result in § IV-A. B. Vulnerable TPL-V Identiﬁcation

We ﬁrst build a vulnerable TPL-V database, based on whichwe identify the vulnerable TPL-Vs used by the apps.

1) Database Construction:

The vulnerable TPL-V databaseconstruction process includes collection of know vulnerabil-ities in Android TPLs and security bugs from open-sourcesoftware.

Known TPL Vulnerability Collection.

To collect the vul-nerable TPL versions, we convert the names of all TPL ﬁles(3,006,676 in total) in our feature database into CommonPlatform Enumeration (CPE) format [51] and exploit cve-search [52], a professional CVE search tool, to query thevulnerable TPLs from the public CVE (Common Vulnera-bilities and Exposures) database by mapping the transformedTPL names. In this way, we can get the known vulnerabilitiesof TPL-Vs and their detailed information, including the CVEid, vulnerability type, description, severity score from Com-mon Vulnerability Scoring System (CVSS) [53], vulnerableversions, etc. We use

CVSS v3.0 to indicate the severity of thecollected vulnerabilities in this paper. Finally, we collected1,180 CVEs from 957 unique TPLs with 38,243 affectedversions.

Security Bug Collection.

Since ATVH

UNTER is able toidentify the speciﬁc versions of TPLs used by apps, therefore, besides the known vulnerabilities, we also obtain 224 securitybugs from Github [35] and Bitbucket [54] owing to the col-laboration with our anonymous industrial collaborators.Thesebugs come from 152 open-source TPLs with their correspond-ing 4,533 versions. All of these security bugs have been cross-validated by the security experts in industry.

2) Vulnerable TPL-V Identiﬁcation:

When ATVH

UNTER identiﬁes the used TPL-Vs in the app, it will search the vulner-able TPL database to check whether these identiﬁed TPL-Vsare vulnerable or not. If ATVH

UNTER ﬁnds the vulnerableTPL-Vs, it will generate a detailed vulnerability report tousers. We believe ATVH

UNTER can serve as an extensionof ASI Program [11] for Google. The previous research [6]reported that vulnerabilities listed on ASI program can drawmore attention to developers. However, the vulnerabilitiesare reported by ASI program is limited. Our comprehensivedataset can be a supplement to ASI program.

C. Implementation

ATVH

UNTER is implemented in 2k+ lines of pythoncode. We employ A

PKTOOL [55], a reverse engineering toolcommonly-used by much previous work [56–59] to decompilethe Android apps and exploit Androguard [60] to obtain theclass dependency relations in order to get the independentTPL candidates. We then employ S

OOT [61] to generate CFGand also build on S

OOT to get the opcode sequence in eachbasic block of a CFG. We use the ssdeep [62] to conductfuzzy hash algorithm to generate the code feature and employthe edit distance [50] algorithm to ﬁnd the in-app TPLs. Ourapproach can pinpoint the speciﬁc TPL versions. We maintaina library database containing more than 3 million TPL ﬁlesand construct a vulnerable TPL database that includes 224security bugs from open-source Java software on Github, and1,180 CVEs from 910 Android TPLs in public CVE databases.IV. E

VALUATION

In this section, we ﬁrst construct our ground truth andchoose appropriate thresholds for MSS and TSS in § IV-A.Based on the thresholds, we further evaluate ATVH

UNTER from effectiveness (RQ1) , scalability (RQ2) , and the capa-bility of code obfuscation-resilience (RQ3) . All the experi-ments were conducted on a commercial cloud service runningUbuntu 16.04 LTS with 8-core Intel(R) Xeon(R) Gold 6151processor, CPU @ 3.00GHz and 128G memory. A. Preparation • Ground-truth Dataset Construction.

We build this datasetfor three primary purposes: 1) verify the effectiveness ofATVH

UNTER ; 2) compare the performance with the state-of-the-art tools; 3) release the datasets to the community topromote follow-up research. Since it is difﬁcult to know thespeciﬁc TPL-Vs from commercial apps, we choose the open-source apps to compare ATVH

UNTER with existing tools.We ﬁrst collect the latest versions of 500 open-source appsfrom F-Droid [63] that is the largest repository maintainingopen-source Android apps. We choose open-source apps as a) Method-level (b) TPL-level

Fig. 4: Similarity threshold selectionsubjects since we can get the speciﬁc TPL information (in-cluding the version) in the conﬁguration ﬁles and source codeof apps, such a mapping relation between apps and TPLs isused as the ground-truth for performance evaluation. Theseapps are from 17 different categories with various sizes. Foreach app, we manually analyze it and get the in-app TPLswith their speciﬁc versions. According to our analysis, theseapps contain the number of TPLs ranging from 2 to 37 andthese TPLs also have different functions with diverse sizes.We then download these TPLs with their versions from theMaven repository [25]. To ensure the evaluation results morereliable, we collect the complete versions of each TPL. Weﬁlter 144 apps out due to the incomplete versions of TPLsmaintained in the Maven repository. Note that, based on ouranalysis, we ﬁnd the previous published datasets have somebiases. TPLs from LibScout and LibID are most independentones, thus, we add some TPLs that depend on other TPLs inour dataset, such as “Retroﬁt” depends on “Guava”, to revealthe lib identiﬁcation capability of different tools. Finally, wechoose 356 apps and 189 unique TPLs with the complete 6,819version ﬁles in these apps as the ground truth. • Threshold Selection.

To avoid bias, we randomly selectthree groups ( × ) of apps except the aforementioneddataset to decide appropriate thresholds for method similarityscore θ , and TPL similarity score δ . We use method-level falsepositive rate (FPR) and false negative rate (FNR); and TPL-level FPR and FNR as the metrics to decide the similaritythresholds by varying θ and δ from different thresholds. Weemploy the three groups of apps to implement the sameexperiment three times and then decide the optimal thresholds.Fig. (4a) shows the method-level FPR and FNR at differentsimilarity thresholds. We can ﬁnd when the threshold θ isaround 0.85, both the FPR and FNR are relatively low.Therefore, we choose θ = 0 . as the MSS threshold wherethe FPR is less than 1% and FNR is less than 0.5%, whichcan achieve a good trade-off. Fig. (4b) shows the TPL-levelFPR and FNR at different thresholds. According to the result,we ﬁnd that when the threshold is gradually close to 0.8,many false positives appear due to the same TPL with differentminor-changed versions. When the threshold is close to 1, thenumber of false negatives increases. From Fig. 4, we can ﬁndFPR and FNR achieve a good trade-off when the threshold isaround 0.95, we thus choose 0.95 as the threshold δ of TSS. Insummary, we employ θ = 0 . and δ = 0 . for the followingexperiments. B. RQ1: Effectiveness Evaluation

Experimental Setup.

For the effectiveness evaluation, wecompare ATVH

UNTER with the state-of-the-art publicly-available TPL detection tools (i.e., LibID, LibScout, OSSPo-LICE, and LibPecker) that can specify the used TPL versionsby using our ground truth dataset (§ IV-A). We employ threeevaluation metrics, i.e., precision (

T PT P + F P ), recall (

T PT P + F N )and F1 Score ( ∗ P recision ∗ RecallP recision + Recall ), to evaluate the detectionaccuracy at both TPL-level and version-level. TPL-level iden-tiﬁcation indicates the ability to identify the in-app TPLscorrectly (without specifying the versions), and version-levelidentiﬁcation indicates the ability to ﬁnd both the correct TPLsand the correct versions. For example, if a tool reports that itﬁnds “okio-2.0.0, okio-2.3.0” in an app but the ground truthis “okio-2.4.3”, in this situation, for TPL-level, we considerthe tool ﬁnd the correct TPL; for version-level, we considerthere are two false positives and one false negative.

Result.

Table I shows the comparison results of ATVH

UNTER and other state-of-the-art tools. Considering the overall per-formance, we can see ATVH

UNTER outperforms other toolsregarding all the metrics; the F1 score of ATVH

UNTER atlibrary-level and version-level reached 93.43% and 88.82%,respectively. For library-level identiﬁcation, we can ﬁnd thatall of them can achieve high precision at TPL-level identiﬁ-cation but the performance of recall of current state-of-the-art tools is mediocre. In contrast, the recall of ATVH

UNTER is 88.79%, which is far better than others. For version-levelidentiﬁcation, we can ﬁnd the precision (90.55%) and recall(87.16%) of ATVH

UNTER is much higher than that of othertools. Compared with the library-level precision, we can seethe precision of each tool at version-level decreases a lot,which means most of them can identify the TPL but theycannot pinpoint the exact versions. We elaborate on the reasonsfor false positives and false negatives of ATVH

UNTER andother state-of-the-art tools as follows.

FP Analysis.

The reasons for the false positives ofATVH

UNTER can be concluded in three points: (1) reuse ofopen-source components.

We ﬁnd some TPLs are re-developedbased on other TPLs, with only small code changes, if theirsimilarity is larger than the deﬁned threshold, ATVH

UNTER will report the reused ones at the same time, which are falsepositives. (2)

Artifact id or group id changes.

We identify aTPL by using its group id, artifact id and version number.However, we ﬁnd that some old version TPLs has migrated tothe new ones, with their group id or artifact id changed, buttheir code has little difference. Take the TPL ﬁle “EventBus”as an example, “org.greenrobot:eventbus” [64] is the upgradedversion of “de.greenrobot:eventbus” [65]. The code of thesetwo TPLs have high similarity but with different group ids.ATVH

UNTER matches both of them and considers they aredifferent TPLs. (3)

Different versions with high similarity . Theother reason for the false positives of ATVH

UNTER is thatsome versions of the same TPL have little or no difference intheir code. For example, “ACRA_4.8.3” only modiﬁes a fewstatements in a method of “ACRA_4.8.2”, and ATVH

UNTER

ABLE I: Library and Version Detection Comparison

Tools Library-level Version-levelPrecision Recall F1 Precision Recall F1ATVHunter 98.58% 88.79% 93.43% 90.55% 87.16% 88.82%LibID

LibScout

OSSPoLICE

LibPecker would report the two versions of the TPL at the same time,one of them is regarded as false positives. In our database, weeven ﬁnd some versions of the same TPL have the same Javacode but different resource ﬁles, conﬁguration ﬁles or nativecode (C/C++), but this situation does not affect the vulnerableTPL identiﬁcation process.As for the false positives of other tools, the code featureof LibScout (i.e., fuzzy method signature) is too coarse,which would make it generate the same signature for differentversions if the two versions have minor differences. As theaforementioned example “ACRA”, all existing tools cannotdistinguish the two versions because it generates the samesignature for them. Besides, if the methods are very simple, thesignatures generated by LibScout and OSSPoLICE would alsobe the same, which can also lead to false positives. LibPeckerdepends on the package structure as a supplementary featureto identify different TPLs, they may report a TPL depend onothers TPLs several times. For instance, if an app use theLibrary C that is built on library A and B, if library A andB are also in TPL feature database, LibPecker could reportlibrary C as library A and B, leading to false positives.

FN Analysis.

ATVH

UNTER aims to ﬁnd TPL versions withhigh precision, thus, we sacriﬁced part of the recall when weselect the similarity threshold. The reasons for false negativesof ATVH

UNTER are as follows: (1) When compiling an app,developers may take some optimizations to reduce the sizeof their app. The strategy is that the compiler automaticallyremoves some functions of TPLs that are not called byhost apps, which causes the in-app TPLs to be differentfrom the original TPLs, leading to false negatives. (2) SomeTPLs are integrated into the same package namespace of thehost app, which may be deleted at the pre-processing stage,leading to false negatives. For example, some companies andorganizations develop their own Ad SDK, whose packagename is the same as that of the host app. However, the codeunder the package structure of the host app is deleted atthe pre-processing stage, i.e., the ad library is also deletedwithout further consideration, causing the false negatives.(3) Another reason is that some apps use rarely-used open-source TPLs hosted on open-source platforms (e.g., Githubor Bitbucket) which are not in our TPL database (with over3 million TPLs), leading to false negatives. For example,the TPLs “com.github.DASAR.ShiftColorPicker”, “android-retention-magic-1.2.2”, and “android-json-rpc-0.3.4” are de-veloped and hosted on Github, and not in our dataset, there-fore, ATVH

UNTER cannot ﬁnd this TPL. Since other toolsalso use the similarity comparison method to ﬁnd in-app TPLs,this situation also may affect their recall.As for the false negatives of other TPL detection tools, they more or less use the package structure to generate the TPLfeatures. However, the package structure is not stable, whichcan be easily changed by the package ﬂattening obfuscation.We ﬁnd the packages structures of many real-world in-appTPLs are more or less obfuscated, and some TPLs are evenwithout any package structure; current tools cannot handlesuch cases, leading to false negatives. Besides, it is difﬁcult touse the package structure and package name to ensure the TPLcandidates, lingas demonstrated in §III-A4. Many differentTPLs may have the same package name, and one independentpackage tree could include several TPLs; therefore, existingtools may generate incorrect code features for these TPLs,which also can lead to false negatives. LibID uses Dex2jar [32]to decompile apps, it does not always work in all apps,which discounts the recall of LibID. Besides, LibScout andOSSPoLICE are sensitive to CFG structure modiﬁcation.Compared with them, our CFG adjacency list is less sen-sitive to the CFG structure modiﬁcation. We consider boththe syntax and semantic information, and our method adoptsthe fuzzy hash to generate the TPL ﬁngerprints. Thus, codestatements modiﬁcation can only affect part of the ﬁngerprints,which is more robust to different code obfuscations. Based onthe above analysis, we can ﬁnd that the strategy of featureselection, extraction, and generation are essential, which candirectly affect the performance of the system.

Conclusion:

ATVH

UNTER outperforms state-of-the-artTPL detection tools, achieving 98.58% precision, 88.79%recall at library level, and 90.55% precision, 87.16% recallat version level.

C. RQ2: Efﬁciency Evaluation

In this section, we investigate the detection time ofATVH

UNTER and compare it with state-of-the-art tools toverify its efﬁciency. We compare the detection time ofATVH

UNTER with existing tools by employing the datasetcollected in § IV-A. All tools construct their own TPLdatabases using the same dataset (6,819 TPL versions). Allcompared tools choose similarity comparison method to ﬁndin-app TPLs, thus, the detection time mainly depends on thenumber of in-app TPLs and the number of TPL features in thedatabase. The detection time is the period cost for ﬁnding allTPL-Vs in a test app. Note that the detection time does notinclude the database construction time.

Result:

Table II shows the comparison result of detectiontime. We present four metrics (i.e., Q1, mean, median, Q3)to evaluate the efﬁciency of each tool. We can see that theefﬁciency of ATVH

UNTER also outperforms the state-of-the-art tools (66.24s per app on average). The second one isLibScout, and the average detection time is about 83s. LibIDand LibPecker are relatively time-consuming; the averagedetection time could reach about 16.56h and 4.5h per app.ATVH

UNTER is more efﬁcient than others because ourmethod only needs to directly search to ﬁnd the matchingpairs in most situations, which can dramatically decreasethe detection time. ATVH

UNTER employs a two-stage iden-tiﬁcation method (i.e., ﬁlter the potential TPLs ﬁrst andABLE II: Comparison Results of Detection Time (per app).

Tool ATVHunter LibID LibScout OSSPoLICE LibPeckerQ1

Mean

Median Q3 identify the exact TPL with its speciﬁc version) to ﬁnd thematched libraries from the database, which does not need todirectly compare with the whole database using ﬁne-grainedfeatures and largely reduces the comparison time and thewhole detection time. In contrast, in the similarity featurecomparison stage, LibScout needs to use the class dependencyto ﬁlter some impossible pairs out, and this step is also time-consuming. Besides, LibScout regards the code of the hostapp as one of the candidate TPLs, which also costs extratime. OSSPoLICS exploits the fuzzy method signature (thesame feature of LibScout) [5] as the TPL code feature andfunction centroid [42] as the version code feature. The featuregranularity of OSSPoLICE is much ﬁner than that of LibScout,thus, the computational complexity of OSSPoLICE is alsogreater than that of LibScout. Besides, calculating centroid isheavy in terms of runtime overhead and computing resourcesconsumption, especially for the third element (loop depth) inthe centroid. The time complexity is O (( n + e )( c +1)) and thespace complexity is O ( n + e ) to ﬁnd all the loops, where thereare n nodes, e edges and c elementary circles in the graph.For LibPecker, if it tries to ﬁnd a similar class, it needs tocompare three times while our method only needs to compareonce. Besides, LibPecker also needs to compare the packagehierarchy structure and then calculates the similarity score,which also adds extra time. LibID chooses ﬁner granularityfeatures to identify TPLs, the class dependency analysis, CFGconstruction and class matching are also time-consuming. Conclusion:

Compared with other tools, ATVH

UNTER canidentify exact TPL-Vs with high efﬁciency and it takes lesstime for TPL detection on the ground-truth TPL database.

D. RQ3: Obfuscation-resilient Capability

The obfuscation-resilient capability is an important index tomeasure the performance of a TPL detection tool since obfus-cation techniques can discount the detection performance.

Experimental Setup.

To evaluate the obfuscation-resilientcapability of ATVH

UNTER regarding different obfuscationtechniques, we select 100 apps from the public dataset [66]including multiple categories, and use a popular obfuscationtool, Dasho [22], to obfuscate these apks with four widely-usedobfuscation techniques (i.e., renaming obfuscation, controlﬂow randomization, package ﬂattening and dead code re-moval). Obfuscation is a time-consuming task and requires theobfuscation tool to analyze the code logic in order to conductthe obfuscation. It took us about half a month to obfuscate allof apps. Finally, we get one group (100 apps) of the originalapps and four groups ( × ) of the obfuscated apps. Basedon these groups of apps, we compare ATVH UNTER with othertools in terms of the detection rate ( | T P || GT | ) at version-level. TABLE III: Comparison on Code Obfuscation Techniques Tool NoObfuscation ObfuscationRenaming CFR PKG FLT Code RMVATVHunter 99.26% 99.26% 90.13% 99.26%

LibID

LibScout

OSSPoLICE

LibPecker

Renaming: renaming obfuscation; CFR: Control FlowRandomization; PKG FLT: Package Flattening; Code RMV: DeadCode Removal

Result:

The detection results are presented in Table III,the second column is the detection rate of each tool onapps without obfuscation. We can see ATVHunter achievesthe highest detection rate (99.26%), followed by LibPecker(98.79%). Besides, it can be found that the detection rate ofLibID is only 12.93%, which has a big gap with the resultin RQ1. We found the main cause of this gap is due to theinability of decompilation component dex2jar used by LibID.Many apps in this dataset cannot be decompiled successfullyby dex2jar because of TPL compatibility issues, type errorsand anti-decompilation settings, hence LibID cannot generatethe in-app TPL signature, leading to the low detection rate.As for the capability of tools on obfuscated apps, we cansee that all tools are resilient to renaming obfuscation sincethe detection rate of all tools on renaming apps is the sameas the apps without obfuscation. Our ATVH

UNTER is lessaffected by all of these code obfuscation techniques. Coderemoval has the greatest impact on ATVH

UNTER , detectionrate dropped by about 24%. The detection rate on apps withother obfuscation techniques remains over 90%, demonstratingthe capability of ATVH

UNTER towards commonly-used codeobfuscation techniques. Moreover, we can ﬁnd the recall ofapps are obfuscated by package ﬂattening is the same withthe apps without obfuscation, it shows that our method iscompletely resilient to package ﬂattening. In contrast, apartfrom the renaming obfuscation, the detection rate of othertools has been affected by obfuscations to varying degrees.Especially for LibScout, the performance has dropped bymore than 70%. LibScout can only correctly identify 17.69%of in-app TPLs that are obfuscated by package ﬂatteningor dead code removal, and 18.24% of in-app TPLs withcontrol ﬂow randomization. Except ATVH

UNTER , LibPeckerachieves better performance.As for the control ﬂow randomization (CFR), LibScout andOSSPoLICE use the fuzzy method signature as code featuresthat keep the syntax information but do not remain semanticinformation; thus, it is difﬁcult to defend against CFR. Besides,OSSPoLICE employs CFG centroid [42] as the version-levelcode feature. The CFG centroid is a three-dimensional vector,and each dimension indicates the in-degree, out-degree andloop count, respectively. The CFG centroid is sensitive to CFGstructure modiﬁcation; hence the detection rate of OSSPoLICEhas dropped a lot regarding apps with CFR. LibPecker andLibID show a good resiliency to CFR because both of themselect the class dependencies as the code features that wouldot be changed easily by CFR. ATVH

UNTER extracts CFGas our coarse-grained feature and opcode in the basic blockof CFG as the ﬁne-grained feature. We keep the semanticinformation and remove the operands so our method is resilientto identiﬁer renaming. We split the opcode sequence intosmall pieces and exploit fuzzy hash generate the code feature,although the dead code removal obfuscation and control ﬂowobfuscation techniques can affect a part of code features,our strategy effectively reduces the interference, making thedetection rate decline slightly.Regarding the package ﬂattening technique, existing toolsmore or less depend on package structure to generate TPL sig-natures, without a doubt, which will affect their performance.More speciﬁcally, LibScout depends on package structure/-name to split TPLs. Firstly, many TPLs belong to the samegroup that may have the same package name. It is difﬁcult tosplit these TPLs correctly if they belong to the same group.Secondly, the package ﬂattening technique can easily changethe package hierarchy structure or even remove the wholepackage tree, resulting in that LibScout will generate incorrectTPL signatures or cannot generate signatures for TPLs withoutpackage structures. OSSPoLICE is built on LibScout henceOSSPoLICE inherits the limitations of LibScout. LibPeckerassumes the package structure is preserved during obfuscationbut it does not always hold true for real-world apps. Thisstrong assumption directly restricts it to achieve better perfor-mance. In contrast,ATVH

UNTER uses the class dependencyrelation to split different TPL candidates (on the basis ofhigh cohesion and low coupling among different TPLs), whichcompletely does not depend on the package structure, thus,ATVH

UNTER is resilient to package ﬂattening/renaming.As for dead code removal, this obfuscation technique willdelete some code that is not invoked by host apps, leading thecode features of in-app TPLs are different from the originalTPLs. This obfuscation can affect all TPL detection tools.LibPecker chooses class dependency as the code feature thatkeeps the method call relationship while we adopt CFG ascode feature that do not include the method dependency. Ourmethod may include methods and classes without invocations.The signature of LibPecker stores more semantic informationthan that of us so that LibPecker achieves better performancein dead code removal.

Conclusion:

ATVH

UNTER offers better resiliency to codeobfuscation than existing tools, especially for identiﬁer re-naming, package ﬂattening, and control ﬂow randomization.V. L

ARGE -S CALE A NALYSIS

By leveraging ATVH

UNTER , we further conducted a largescale study on Google Play apps to reveal the threats ofvulnerable TPL-Vs in the real world.

Dataset Collection.

We collected commercial Android appsfrom Google Play based on the number of installations. Foreach installation range, we crawled the latest versions of appsfrom Aug. 2019 to Feb. 2020 for this large-scale experiment.We only consider popular apps whose installation ranges from10,000 to 5 billion, because the vulnerabilities in apps with large installations can affect more devices and users. Note thatthe number of apps in each installation range is unequal; ingeneral, the number of apps with higher installations usually isrelatively smaller. We ﬁnally collected 104,446 apps across 33different categories as the study subjects. From our preliminarystudy on these apps, we found 72% of them (73,110/104,446)use TPLs to facilitate their development. We thus focus on the73,110 apps to conduct the following analysis.

A. Vulnerable TPL Landscape

Before conducting the impact analysis of vulnerable TPLs,we ﬁrst present some essential information about these vulner-able TPL-Vs to let readers have a clear understanding aboutthe threats in TPLs. We use CVSS v3.0 security metrics [53]to indicate the severity (i.e., low, medium, high, and criti-cal) of vulnerabilities. The score greater than 7.0 means thevulnerability with high and critical severity, which accountsfor 21.35% of all the vulnerabilities in our dataset. Thesesevere vulnerabilities usually involve remote code execution,sensitive data leakage, Server-side request forgery (SSRF)attack, etc. Even worse, we ﬁnd of these vulnerableTPLs are widely-used by other TPLs. For example, the li-brary “org.scala-lang:scala-library” with a severe security risk(

CV SS = 9.8) that allows local users to write arbitrary classﬁles, has been used 24,112 times by other TPLs, and mostof vulnerable versions of this TPL have been used morethan 2,000 times. Without a doubt, such cases expand thespread of vulnerabilities and add more security risks to appusers. These severe vulnerabilities usually involve remote codeexecution, sensitive data leakage [67, 68], malicious code orSQL injection, bypass certiﬁcates/authentication, etc. Thesebehaviors deﬁnitely bring unpredictable risks to users’ privacyand property security. We found that most of these vulnerableTPLs belong to utility, accounting for 98.7%.

B. Impact Analysis of Vulnerable TPLs

In our dataset, we ﬁnd that about 12.37% (9,050/73,110)of apps include TPL-Vs, involving 53,337 known vulnera-bilities and 7,480 security bugs from open-source TPLs. Theknown vulnerabilities are from 166 different vulnerable TPLswith corresponding 10,362 versions and the security bugsare from 27 vulnerable TPLs with 284 different versions.These vulnerable apps use a total of 58,330 TPLs and ap-proximately 18.2% of them are vulnerable ones. Among the9,050 vulnerable apps, 329 apps (37.5%) with TPLs containboth vulnerabilities and security bugs. There are 778 appscontaining the TPLs with security bugs and each app containsabout 2.45 security bugs in their TPLs. Furthermore, wealso ﬁnd many education and ﬁnancial apps use the popularUI library “PrimeFaces” [69] that include sever vulnerability(CVE-2017-1000486). Primefaces 5.x is vulnerable to a weakencryption ﬂaw resulting in remote code execution. For moreanalysis result, you can refer to our website [26].

C. Lessons Learned

Based on our analysis, we found many apps include vul-nerable TPLs leading to privacy leakage and ﬁnancial loss.owever, developers seem unaware of the security risks ofTPLs. We explore the reasons from the following points:

For TPL developers, according to our result in § V-A, thereuse rate of vulnerable TPLs is pretty high ( > ). ManyTPL developers also develop their own TPLs based on existingones, especially popular ones, but seem seldom to check theused components for any known vulnerabilities. Even worse,we ﬁnd 210,727 TPLs use vulnerable TPL versions, indicatingmany TPL developers may be unaware of tracking thesevulnerability ﬁx solutions in these open-source products. Al-though some TPL developers have patched the vulnerabilitiesin later versions, many affected apps still use the old versionswith vulnerabilities, which indirectly expands the threats ofthe vulnerabilities in TPLs. The lack of centralized control ofthese open-source TPLs also poses attack surfaces for hackers. For app developers, we reported some TPL versions withsevere vulnerabilities to the corresponding app developersvia emails. We wrote 50 emails to these app developers orcompanies and received 5 replies in 2 months. Based on theirfeedback, we ﬁnd 1) most of the developers only care aboutthe functionalities provided by the TPLs and are unaware ofthe security problems in these TPLs. In fact, it is reasonablesince one is unlikely to analyze all the used libraries beforeusing them, which eliminates the convenience of using thesecomponents or libraries. However, based on our analysis,some commonly-used TPLs contain severe vulnerabilities, wesuggest that app developers should be aware of vulnerabilitiesin TPLs and ATVH

UNTER could be helpful for them todetect vulnerable TPL versions. 2) Some app developers orcompanies do not know how to conduct security detectionof these imported TPLs. They also hope “our team can helpthem conduct the security assessment of the used TPLs or tellthem the speciﬁc analysis processes.” 3) Some app developersdid not know that some vulnerable TPLs have been updatedor patched and they still used these old TPL versions. Even ifthey noticed the upgraded versions, some of them are reluctantto change the old ones due to the extra cost. They said that“If a TPL adds many new functions, they have to spend muchtime understanding these new features and change too muchof their own code. Thus, they prefer to keep old TPL-Vs.”

For app markets, we found that many app markets do nothave such a security assessment mechanism to warn developersabout the potential security risks in their apps. As far as weknow, only Google provides a service named App SecurityImprovement (ASI) program that provides tips to help appdevelopers of Google Play to improve the security of theirapps. Previous research [6] reported that vulnerabilities listedon ASI program could draw more attention from developers.However, the vulnerabilities reported by ASI program are lim-ited due to the lack of a comprehensive vulnerability databaseand such a vulnerable TPL detection tool, like ATVH

UNTER .VI. D

ISCUSSION

Limitations. (1) If the Java code of several versions isthe same, ATVH

UNTER would provide several candidatesinstead of a speciﬁc one, leading to some false positives. (2) ATVH

UNTER may eliminate some TPLs due to mistakenlyregarding them as part of the primary module if such TPLsare imported into the package structure of the host app,thus causing some false negatives. (3) We only focus onthe Java libraries and do not consider the native libraries. Infact, the native library is also an essential part in Androidapps and the vulnerabilities inside would cause more severeconsequences. Detecting vulnerable native libraries is left forour future work. (4)ATVH

UNTER adopts static analysis toﬁnd the TPLs, therefore, we may miss some libraries areloaded in dynamic methods. Besides, some TPLs have somedynamic behaviors, such as refection, dynamic class loading.Our approach may miss some dynamic features and affectour detection performance. (5) We crawled about 3 millionTPLs from maven to build our feature database. Although thisdatabase is large and comprehensive and it can guarantee thedetection rate of ATVH

UNTER , our method still have somelimitations. The third-party libraries are constantly updating,which means ATVH

UNTER cannot ﬁnd these newly emergingTPLs. Thus, how to ﬁnd these newly emerging TPLs anddynamically maintain our database will be our future work.

Threats to Validity. (1) The ﬁrst threat comes from thesimilarity threshold, it is inevitable to induce some falsenegatives and false positives for some apps due to the minordifference between TPLs. To minimize the threat, we selectedthe similarity threshold through a reasonable experimentaldesign. (2) Another threat comes from the analysis on only freeapps. We believe that it is meaningful to study the vulnerableTPLs used by both free and paid apps, which is left for futurework. VII. C

ONCLUSION

In this paper, we proposed ATVH

UNTER , a TPL detectionsystem which can precisely pinpoint the TPL version andﬁnd the vulnerable TPLs used by the apps. Evaluation resultsshow that ATVH

UNTER can effectively and efﬁciently ﬁndin-app TPLs and is resilient to the state-of-the-art obfuscationtechniques. Meanwhile, we construct a comprehensive andlarge vulnerable TPL version database containing 224 securitybugs and 1,180 CVEs. ATVH

UNTER can ﬁnd the vulnerableTPLs in apps and reveals the threat of vulnerable TPLs in apps,which can help improve the quality of apps and has profoundimpact on the Android ecosystem.VIII. A

CKNOWLEDGMENT

We thank the anonymous reviewers for their helpful com-ments. This work is partly supported by the National Re-search Foundation, Prime Ministers Ofﬁce, Singapore un-der its National Cybersecurity R&D Program (Award No.NRF2018NCR-NCR005-0001), the Singapore National Re-search Foundation under NCR Award Number NRF2018NCR-NSOE003-0001, NRF Investigatorship NRFI06-2020-0022,the Singapore National Research Foundation under NCRAward Number NRF2018NCR-NSOE004-0001, the HongKong PhD Fellowship Scheme and Hong Kong RGC Projects(No. 152223/17E,152239/18E, CityU C1008-16G).

EFERENCES

Proceedings of the 41st International Conference on SoftwareEngineering: Software Engineering in Practice . IEEE Press,2019, pp. 183–192.[3] S. Chen, M. Xue, L. Fan, S. Hao, L. Xu, H. Zhu, andB. Li, “Automated poisoning attacks and defenses in malwaredetection systems: An adversarial machine learning approach,” computers & security , vol. 73, pp. 326–344, 2018.[4] S. Chen, L. Fan, C. Chen, M. Xue, Y. Liu, and L. Xu, “Gui-squatting attack: Automated generation of android phishingapps,”

IEEE Transactions on Dependable and Secure Comput-ing , 2019.[5] M. Backes, S. Bugiel, and E. Derr, “Reliable third-party librarydetection in Android and its security applications,” in

CCS ,2016.[6] T. Yasumatsu, T. Watanabe, F. Kanei, E. Shioji, M. Akiyama,and T. Mori, “Understanding the responsiveness of mobile appdevelopers to software library updates,” in

Proc. CODASPY ,2019.[7] Y. Zhang, J. Dai, X. Zhang, S. Huang, Z. Yang, M. Yang, andH. Chen, “Detecting third-party libraries in Android applicationswith high precision and recall,” in

SANER , 2018.[8] L. Li, D. Li, T. F. Bissyandé, J. Klein, H. Cai, D. Lo, andY. Le Traon, “Automatically locating malicious packages inpiggybacked Android apps,” in

The 4th IEEE/ACM Interna-tional Conference on Mobile Software Engineering and Systems(MobileSoft 2017) , 2017.[9] “Airpush,” https://support.google.com/faqs/answer/6376737.[10] “Mopub,” https://support.google.com/faqs/answer/6345928.[11] (2016) App security improvement program. [Online]. Available:https://developer.android.com/google/play/asi.html[12] “Software composition analysis (SCA): what is it and doesyour company need it?” https://snyk.io/blog/what-is-software-composition-analysis-sca-and-does-my-company-need-it/, 2020.[13] (2020) Software Composition Analysis Explained. [On-line]. Available: https://resources.whitesourcesoftware.com/blog-whitesource/sca-software-composition-analysis[14] Z. Ma, H. Wang, Y. Guo, and X. Chen, “Libradar: Fast andaccurate detection of third-party libraries in Android apps,” in

Proc. ICSE-C , 2016.[15] M. Li, W. Wang, P. Wang, S. Wang, D. Wu, J. Liu, R. Xue,and W. Huo, “Libd: Scalable and precise third-party librarydetection in Android markets,” in

Proc. ICSE , 2017.[16] M. Li, P. Wang, W. Wang, S. Wang, D. Wu, J. Liu, R. Xue,W. Huo, and W. Zou, “Large-scale third-party library detectionin Android markets,”

IEEE Transactions on Software Engineer-ing , pp. 1–1, 2018.[17] J. Zhang, A. R. Beresford, and S. A. Kollmann, “Libid: Reliableidentiﬁcation of obfuscated third-party Android libraries,” in

Proc. ISSTA , 2019.[18] X. Zhan, L. Fan, T. Liu, S. Chen, L. Li, H. Wang, Y. Xu, X. Luo,and Y. Liu, “Automated third-party library detection for androidapplications: Are we there yet?” in

ASE , 2020.[19] A. Narayanan, L. Chen, and C. K. Chan, “Addetect: Automateddetection of Android ad libraries using semantic analysis,” in

Proc. ISSNIP , 2014.[20] Y. Wang, H. Wu, H. Zhang, and A. Rountev, “Orlis:Obfuscation-resilient library detection for Android,” in

Proc.MOBILESoft , 2018.[21] R. Duan, A. Bijlani, M. Xu, T. Kim, and W. Lee, “Identifyingopen-source license violation and 1-day security risk at largescale,” in

Proc. CCS

Proceedingsof the 13th annual international conference on mobile systems,applications, and services . ACM, 2015, pp. 89–103.[28] Z. Zhang, W. Diao, C. Hu, S. Guo, C. Zuo, and L. Li, “Anempirical study of potentially malicious third-party libraries inAndroid apps,” in

Proc. WiSec , 2020.[29] C. Soh, H. B. K. Tan, Y. L. Arnatovich, A. Narayanan, andL. Wang, “Libsift: Automated detection of third-party librariesin Android applications,” in

APSEC , 2016.[30] H. Han, R. Li, and J. Tang, “Identify and inspect libraries inAndroid applications,”

Wireless Personal Communications vol103, pp491-503 , 2018.[31] Merkle Tree. [Online]. Available: https://en.wikipedia.org/wiki/Merkle_tree[32] “dex2jar,” https://github.com/pxb1988/dex2jar.[33] C. Kai, W. Peng, L. Yeonjoon, W. XiaoFeng, Z. Nan, H. Heqing,Z. Wei, and L. Peng, “Finding unknown malice in 10 seconds:Mass vetting for new threats at the google-play scale,” in

Proc.USENIX Security

Comput.Secur. , vol. 61, pp. 72–93, Aug. 2016.[42] K. Chen, P. Liu, and Y. Zhang, “Achieving accuracy andscalability simultaneously in detecting application clones onAndroid markets,” in

Proc. ICSE , 2014.[43] W. Zhou, Y. Zhou, X. Jiang, and P. Ning, “Detecting repackagedsmartphone applications in third-party Android marketplaces,”in

Proc. CODASPY

Proc. ACM WiSec , 2014.[49] X. Sun, Y. Zhongyang, Z. Xin, B. Mao, and L. Xie, “Detectingcode reuse in Android applications using component-basedcontrol ﬂow graph,” in

IFIP , 2014.[50] “Edit Distance.” [Online]. Available: https://en.wikipedia.org/wiki/Edit_distance[51] “CPE,” https://nvd.nist.gov/Products/CPE.[52] “cve-search,” https://github.com/cve-search/cve-search.[53] “Common Vulnerability Scoring System (CVSS).” [Online].Available: https://nvd.nist.gov/vuln-metrics/cvss[54] “BitBucket,” https://bitbucket.org/.55] (2019) Apktool. [Online]. Available: https://ibotpeaches.github.io/Apktool/[56] L. Fan, T. Su, S. Chen, G. Meng, Y. Liu, L. Xu, G. Pu, andZ. Su, “Large-scale analysis of framework-speciﬁc exceptionsin android apps,” in . IEEE, 2018, pp.408–419.[57] T. Su, L. Fan, S. Chen, Y. Liu, L. Xu, G. Pu, and Z. Su, “Whymy app crashes? understanding and benchmarking framework-speciﬁc exceptions of android apps,” 2020.[58] S. Chen, L. Fan, C. Chen, T. Su, W. Li, Y. Liu, and L. Xu,“Storydroid: Automated generation of storyboard for Androidapps,” in

Proceedings of the 41st International Conference onSoftware Engineering . IEEE Press, 2019, pp. 596–607.[59] L. Fan, T. Su, S. Chen, G. Meng, Y. Liu, L. Xu, and G. Pu,“Efﬁciently manifesting asynchronous programming errors inandroid apps,” in

Proceedings of the 33rd ACM/IEEE Interna-tional Conference on Automated Software Engineering . ACM,2018, pp. 486–497.[60] “Androguard,” https://github.com/androguard/androguard. [61] “Soot,” https://github.com/Sable/soot, 2019.[62] “ssdeep,” https://ssdeep-project.github.io/ssdeep/index.html.[63] “F-Droid,” https://f-droid.org/en/packages/.[64] (2015) org.greenrobor.eventbus. [Online]. Available: https://mvnrepository.com/artifact/org.greenrobot/eventbus[65] (2020) de.greenrobor.eventbus. [Online]. Available: https://mvnrepository.com/artifact/de.greenrobot/eventbus[66] “Benchmark data,” https://github.com/presto-osu/orlis-orcis/tree/master/orlis/open_source_benchmarks.[67] S. Chen, L. Fan, G. Meng, T. Su, M. Xue, Y. Xue, Y. Liu,and L. Xu, “An empirical assessment of security risks of globalandroid banking apps,” in

Proceedings of the 42nd InternationalConference on Software Engineering . IEEE Press, 2020, pp.596–607.[68] S. Chen, T. Su, L. Fan, G. Meng, M. Xue, Y. Liu, and L. Xu,“Are mobile banking apps secure? what can be improved?” in