ATVHunter: Reliable Version Detection of Third-Party Libraries for Vulnerability Identification in Android Applications
Xian Zhan, Lingling Fan, Sen Chen, Feng Wu, Tianming Liu, Xiapu Luo, Yang Liu
AAT V H
U N T E R : Reliable Version Detection ofThird-Party Libraries for Vulnerability Identificationin Android Applications
Xian Zhan ∗ , Lingling Fan † , Sen Chen ‡ , Feng Wu § , Tianming Liu ¶ , Xiapu Luo ∗ , Yang Liu §∗ Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China † College of Cyber Science, Nankai Univerisity, China ‡ College of Intelligence and Computing, Tianjin University, China § School of Computer Science and Engineering, Nanyang Technological University, Singapore ¶ Faculty of Information Technology, Monash University, Australia
Abstract —Third-party libraries (TPLs) as essential parts inthe mobile ecosystem have become one of the most significantcontributors to the huge success of Android, which facilitatethe fast development of Android applications. Detecting TPLsin Android apps is also important for downstream tasks, suchas malware and repackaged apps identification. To identify in-app TPLs, we need to solve several challenges, such as TPLdependency, code obfuscation, precise version representation.Unfortunately, existing TPL detection tools have been provedthat they have not solved these challenges very well, let alonespecify the exact TPL versions.To this end, we propose a system, named ATVH
UNTER , whichcan pinpoint the precise vulnerable in-app TPL versions andprovide detailed information about the vulnerabilities and TPLs.We propose a two-phase detection approach to identify specificTPL versions. Specifically, we extract the Control Flow Graphs asthe coarse-grained feature to match potential TPLs in the pre-defined TPL database, and then extract opcode in each basicblock of CFG as the fine-grained feature to identify the exactTPL versions. We build a comprehensive TPL database (189,545unique TPLs with 3,006,676 versions) as the reference database.Meanwhile, to identify the vulnerable in-app TPL versions, wealso construct a comprehensive and known vulnerable TPLdatabase containing 1,180 CVEs and 224 security bugs. Exper-imental results show ATVH
UNTER outperforms state-of-the-artTPL detection tools, achieving 90.55% precision and 88.79%recall with high efficiency, and is also resilient to widely-usedobfuscation techniques and scalable for large-scale TPL detection.Furthermore, to investigate the ecosystem of the vulnerable TPLsused by apps, we exploit ATVH
UNTER to conduct a large-scale analysis on 104,446 apps and find that 9,050 apps includevulnerable TPL versions with 53,337 vulnerabilities and 7,480security bugs, most of which are with high risks and are notrecognized by app developers.
I. I
NTRODUCTION
Nowadays, over 3 million Android applications (apps) areavailable in the official Google Play Store [1]. One reasoncontributing to the huge success of Android could be themassive presence of third-party libraries (TPLs) that providereusable functionalities that can be leveraged by developersto facilitate the development of Android apps (to avoid rein-venting the wheels). However, extensive TPL usage attractsattackers to exploit the vulnerabilities or inject backdoors in the popular TPLs, which poses severe security threats to appusers [2–4]. Previous research [5, 6] pointed out that manyapps contain vulnerable TPLs, and some of them have beenreported with severe vulnerabilities (e.g., Facebook SDK) thatcan be exploited by adversaries [7, 8]. Attackers can exploitthe vulnerabilities in some Ad libraries (e.g., Airpush [9],MoPub [10]) to get privacy-sensitive information from the in-fected devices [11]. Even worse, various TPLs are scattered indifferent apps but the information of TPL components in appsis not transparent. Many developers may not know how manyand which TPLs are used in their apps, due to many directand transitive dependencies. Additionally, about 78% of thevulnerabilities are detected in indirect dependencies, makingthe potential risks hard to spot [12]. Thus, vulnerable TPLidentification has become an urgent and high-demand task andTPL version detection has become a standard industry productnamed as Software Composition Analysis (SCA) [12, 13].Existing TPL detection techniques use either clustering-based methods (e.g., LibRadar [14], LibD [15, 16]) or sim-ilarity comparison methods (e.g., LibID [17], LibScout [5]) toidentify TPLs used by the apps. However, according to ouranalysis and previous study [18], we conclude the followingdeficiencies in existing approaches: 1)
Low recall.
Clustering-based methods only can identity commonly-used TPLs andmay miss some niche and new TPLs, whose recall depends onthe number of input apps and the reuse rate of TPLs. Besides,the code similarity of different versions and TPL could be vari-ous, which makes it difficult to choose appropriate parametersof the clustering algorithm to perfectly distinguish differentTPLs or even versions. Verifying the clustering results isalso labor-intensive and error-prone. Similarity comparisonmethods construct a predefined TPL database as the referencedatabase. However, current published size of TPL databaseis far smaller than the number of TPLs in the actual marketthus cannot be used to identify a complete set of in-app TPLs.Apart from that, existing techniques more or less depend onthe package structure, especially using package structure toconstruct the in-app library candidates. Whereas, the packagestructure/name of the same TPL in different versions could a r X i v : . [ c s . S E ] F e b e mutant or easily obfuscated. Therefore, using packages asa supplementary feature to generate TPL signatures is alsounreliable [18]. 2) Inability of precise version identification.
To find the vulnerabilities of the in-app TPLs, we need toprecisely pinpoint the exact TPL versions because not allTPL versions are vulnerable. Even though there are manyTPL detection tools, none of them can meet our requirements.AdDetect [19] just can distinguish the ad and non-ad libraries.ORLIS [20] just provides the matching class. Clustering-based tools (e.g., LibRadar [14], LibD [15, 16]) do not claimthey can pinpoint the exact TPL versions. Besides, currenttools [5, 7, 17, 21] usually reported many false positives atversion-level identification [18]. Thus, existing tools are notsuitable for vulnerable TPL detection.Apart from the aforementioned weaknesses of existingtools, we still face some challenges in this research direction:1)
Lack of vulnerable TPL version dataset.
To enablevulnerable TPL version (TPL-V) identification, we need acomprehensive set of known vulnerable TPL-Vs. Ideally, foreach vulnerable TPL, it includes TPL names, versions, types,vulnerability severity, etc. However, to the best of our knowl-edge, no such dataset is publicly available. 2)
Precise versionrepresentation.
We need to distinguish TPLs at version level,however, it is challenging to extract appropriate code featuresto represent different versions of the same TPL, especiallywhen the code difference of different versions is tiny. 3)
Interference from code obfuscation.
Many code obfuscationtools (e.g., DashO [22], Proguard [23], and Allatori [24]) canbe used to obfuscate apps and TPLs. For example, dead coderemoval can delete the code without invocation by host apps.These techniques can change the code similarity between in-app TPLs and the original TPLs. Undoubtedly, obfuscationtechniques increase the difficulty of TPL identification.To fill aforementioned research gap, we propose a sys-tem, named ATVH
UNTER (Android in-app Third-party libraryVulnerability Hunter), which is an obfuscation-resilient TPL-V detection tool and can report detailed information aboutvulnerabilities of in-app TPLs. ATVH
UNTER first uses classdependency relations to split the independent candidate TPLmodules from the host app and adopts a two-phase strategy toidentify in-app TPLs. It extracts CFGs as the coarse-grainedfeatures to locate the potential TPLs in the feature database toachieve high efficiency. It then extracts the opcode sequence ineach basic block of CFG as the fine-grained feature to identifythe precise version by employing the similarity comparisonmethod. To ensure the recall, we constructed our TPL featuredatabase by collecting comprehensive and large-scale Javalibraries from the maven repository [25]. We use the fuzzyhash method to generate the signature, which can alleviatethe effects from code obfuscation. Compared with previousmethods, ATVH
UNTER does not depend on the packagestructure. The main contributions of this work are as follows: • An effective TPL version detection tool.
We proposeATVH
UNTER , an obfuscation-resilient TPL-V detectiontool with high accuracy that can find vulnerable in-appTPL-Vs and provide detailed vulnerabilities and compo- nents reports. With the help of our industry collaborator,ATVH
UNTER was integrated as a branch of an onlineservice to help users identify vulnerable Android TPLs. • Comprehensive datasets.
We have constructed a com-prehensive and large-scale TPL feature database atpresent, which includes 189,545 TPLs with corresponding3,006,676 versions to identify in-app TPLs. We are the firstto construct a comprehensive vulnerable TPL-V databasefor Android apps, including 1,180 CVEs from 957 TPLswith 38,243 vulnerable versions and 224 security bugsfrom 152 open-source TPLs with 4,533 affected versions. • Thorough comparisons.
We conduct systematic and thor-ough comparisons between ATVH
UNTER and the state-of-the-art tools from different perspectives. The evaluationresult demonstrates ATVH
UNTER is resilient to widely-used obfuscation techniques and outperforms the state-of-the-art TPL-V detection tools, achieving high precision(90.55%) and recall (88.79%) at version-level identifica-tion. We published the related dataset on our website [26]. • Large-scale analysis.
We leverage ATVH
UNTER to con-duct a large-scale study on 73,110 apps using TPLs andfind 9,050 apps contain 10,616 vulnerable TPLs. Thesevulnerable TPLs include 53,337 known vulnerabilities and7,480 security bugs. Most of them use TPLs containingsevere vulnerabilities.II. R
ELATED W ORK
Library Detection.
AdDetect [19] and PEDAL [27] usefeatures such as permissions and APIs to train a classifier todistinguish ad libraries and non-ad libraries. Whereas, thesestudies fail to identify other types of libraries, such as develop-ment aids, UI plugins. Currently, there are three TPL detectiontools based on the clustering algorithms., i.e., LibRadar, LibD,and LibExtractor. LibRadar [14] extracts the Android APIcalls, the total number of API calls and total kinds of APIcalls as the code features and it chooses the multi-levelclustering method to identify potential TPLs. LibD [15, 16]extracts the opcode in each CFG block as the code feature.LibExtractor [28] exploits the clustering-based method tofind potential malicious libraries. In general, clustering-basedapproaches have three common weaknesses: 1) they requirea considerable number of apps as input to generate enoughTPL signatures. It is also difficult to find emerging or nicheTPLs. It also can import some impurities. For instance, ifan app is repackaged many times, clustering methods mayconsider the repackaged host app as a TPL. 2) clustering-basedmethods may find incomplete TPLs. Some TPLs also dependon other TPLs, but clustering method could separate them intoseveral parts. 3) The above clustering-based approaches moreor less rely on package names and package structures, whichcan be easily obfuscated by existing obfuscators [22–24].LibD claims it is resilient to package name obfuscation andpackage structure mutation, but package flattening techniquecan remove the whole package structure and change the nternal package structure. LibSift [29] constructs the packagedependency graph (PDG) to split independent TPL candidates.LibSift does not identify specific libraries, only decouplesTPLs into different parts from the host app. Han et al. [30]aim to measure the behavior differences by comparing benignTPLs and malicious TPLs. It extracts the opcode and Androidtype tags as features and hashes all feature in each method, andthen compare it with the ground-truth libraries to identify thelibraries. LibScout [5] is a similarity-based library detectiontool, which uses the Merkle Tree [31] to generate each libraryinstance signature. LibScout chooses the fuzzy method ascode feature which changes the non-system identifiers (in themethod signature) by using placeholder “X”. ORLIS [20] usesthe same code feature of LibScout [5] but different featuregeneration approach. LibScout and ORLIS can be resilient toidentifier renaming. Whereas, the code feature of LibScout istoo coarse, which affects the detection performance. Besides,ORLIS can only provide the matched class to users, which isnot user-friendly. Thus, they are not good choices for off-the-shelf TPL detection. LibPecker [7] is also a matching-basedlibrary identification tool, which exploits the class dependencyas the code features and hashes it as the fingerprint to findTPLs. LibPecker then uses the Fuzzy Class matching methodto compare it with the libraries in the database. However, thecomparison process is time-consuming. Moreover, LibPeckeralso assumes the package hierarchy is not change when theTPL is imported into an app, which will affect the recall.LibID [17] is also a TPL version detection tool, but it choosesdex2jar [32] as the decompile tool. The reverse-engineeringcapability of dex2jar directly limits the detection ability ofLibID. More details are clarified in § IV. Vulnerable TPL/App Identification.
Yasumatsu et al. [6]attempt to understand how app developers response to theupdate of TPLs. They studied vulnerable versions of sevenTPLs and corresponding apps. By comparing the evolutiontime between different TPL-Vs and apps versions, they mea-sured the reaction of app developers to these vulnerable TPLversions. The number of vulnerable TPL is too small in theirdataset, which cannot show the full picture of the infected appsand vulnerable TPLs. OSSPolice [21] is an automated toolfor identifying free software license violations and vulnerableversions of open-source third-party libraries, including bothnative libraries and Java libraries. It extracts the fuzzy methodsignature as the library feature and function centroid [33] asthe version feature to identify TPL-Vs. However, generatingcentroid is substantial in terms of resource consumption.III. A
RCHITECTURE
We design a system, ATVH
UNTER , which takes an Androidapp as input, and automatically identify the used vulnerableTPL-Vs (if any) according to the constructed database. Fig. 1shows the system design which is divided into two parts: (1)
TPL-V detection , which identifies the specific versions of TPLsused by apps; and (2) vulnerable TPL-V identification , whichcan identify the vulnerable in-app TPL-Vs based on our col-lected known vulnerabilities from NVD [34] and Github [35]. Based on the database, we also conduct a large-scale study toassess the ecosystem of Android apps in terms of the usageof vulnerable TPLs. Details are introduced as follows.
A. TPL Detection
The TPL detection part of ATVH
UNTER includes four keyphases: (1)
Preprocessing , (2)
Module decoupling , (3)
Featuregeneration , and (4)
TPL identification .
1) Preprocessing:
ATVH
UNTER primarily conducts twotasks in this phase. The first task is to decompile the inputapp and transform the bytecode into appropriate intermediaterepresentations (IRs). The second task is to find the primarymodule in the app and delete it to eliminate the interferencefrom the host app. If an app includes TPLs, we call thecode of the host app as the “primary” module and the in-app TPLs constitute the “non-primary” module. ATVH
UNTER first parses the AndroidManifest.xml file and gets the host apppackages. Sometimes, the code of the host app may belongto several different namespace, therefore, we need to extractthe app packages, application namespace and the packagenamespace including the Main Activity (i.e., the launcherActivity) and delete these files under the host namespace.However, this approach also has following side effects: 1) partof host code suffers from the package flattening or renamingobfuscation and cannot be delete. 2) part of host code cannotbe delete due to special package name. 3) the host app andTPLs have the same package namespace, the method maydelete these TPLs, leading to false negatives. As for the case 1)& 2), if the host code and TPLs have no dependencies, it willnot affect the accuracy of TPL identification. If the undeletedhost parts include the TPLs, we can eliminate the interferencein the comparison stage.
2) Module Decoupling:
The purpose of module decouplingis to split up the non-primary module of an app into differentindependent library candidates. Previous research adopts dif-ferent features for module decoupling such as package struc-ture, homogeny graph [15], and package dependency graph(PDG), however, they more or less depend on the packagestructure of apps. Using the package name or the independentpackage structure to split the in-app TPLs is error-prone, whichhas two obvious disadvantages: 1) low resiliency to packageflattening [36]; 2) inaccurate TPL instance construction. Thereare many different TPLs sharing the same root package.For instance, “com.android.support.appcompat-v7” [37] and“com.android.support.design” [38] are two different TPLs butthe share the same root package com/android/support . Besides,one TPL may has multiple parallel package structures, as canbe seen an example in Fig. 2, this TPL[39] depends on otherTPLs to build itself and developer deploy the “Fat” jar mode topackage this project. The host TPL with all invoked TPLs con-stitutes a complete TPL. TPL dependencies are very common,about 47.3% of Android TPLs in maven repository dependon others based on our rough statistics. To overcome it, weadopt the Class Dependency Graph (CDG) as the features tosplit up the TPL candidates because CDG does not depend onthe package structure, it is resilient to package flattening. The ecompilingPrimary moduleelimination (cid:1)
Preprocessing (cid:2)
Module Decoupling (cid:3)
FeatureGeneration (cid:4)
Library Identification
Vulnerability CollectionSecurity Bug Collection VulnerabilityDatabase (cid:5)
Vulnerable TPLVersion identification
Offline DB Construction
TPLs withspecific versions
Coarse-grainedfeaturegeneration(CFG)Fine-grainedfeaturegeneration (opcode of CFG) Potential TPLsearchTPL versionidentificationCandidate TPLdecoupling(Class Dependency) (cid:3)
FeatureGeneration
TPL FeatureDatabase VulnerableTPL Database
Mapping Vulnerable TPL versions ü TPL info: name, version, etc. ü Vul info: type, CVSS,etc.
Fig. 1: Workflow of ATVH
UNTER
Fig. 2: An example of a TPL’s package structureclass dependency relationship includes: 1) class inheritance,we do not consider the interface relationship because it canbe deleted in obfuscation, 2) method call relationship, and3) field reference relationship. We use CDGs to find all therelated class files, and each CDG will be considered as aTPL candidate in general situation. Using CDGs can avoidthe aforementioned situations and package mutation and alsobe resilient to package flattening.In ATVH
UNTER , we use similarity-based method to iden-tify TPL-Vs, we generate the TPL feature database by usingthe complete TPL files that we downloaded from the mavenrepository. Therefore, we need to pay attention the packagingtechniques of Java projects. To facilitate maintenance, mostdevelopers usually adopt the “skinny” mode to package aTPL, which means the released version only contains thecode by TPL developers without any dependency TPLs. Thedependency TPLs will be loaded during compilation. To solvethis situation, we crawl the meta-data of each TPL and recordtheir dependency TPLs and packaging technique [40] byreading the “pom.xml” file. If the “pom.xml” claims “jar-with-dependencies”, it means it includes all dependency TPLs,otherwise, it just includes the host TPL code. If we find a jarwhich is a skinny one, we also need to split their dependencyTPLs by using their package namespace so that we can matchthe correct version in TPL database.
3) Feature Generation:
After splitting the candidate li-braries, we then aim to extract features and generate thefingerprint (a.k.a., signature) to represent each TPL file. Toensure scalability and accuracy, we choose two granularityfeatures. The coarse-grained feature is used to help us quicklylocate the potential TPLs in the database. The fine-grainedfeature is used to help us identify the TPL-V precisely. (1) Forcoarse-grained features, we choose to extract the Control Flow Graph (CFG) to represent the TPL since CFG is relativelystable [41]. CFG also keeps the semantic information thatensures the accuracy to some extent [42]. (2) For fine-grainedfeatures, we extract the opcode in each basic block of CFG asthe feature for exact version identification.
Coarse-grained Feature Extraction.
We first extract the CFGfor each method in the candidate TPLs, and traverse theCFG to assign each node a unique serial number (startingfrom 0) according to the execution order. For a branch nodewith sequence number n , its child with more outgoing edgeswill be given sequence number n + 1 and the other childis given n + 2 . If two child nodes have the same outgoingedges, we will give n + 1 to the child node with morestatements in the basic block. We then convert the CFGs intosignatures based on the assigned serial numbers of each nodeto represent each unique TPL, in the form of [node count,edge adjacency list] , where the adjacency list is repre-sented as: [ parent -> ( child , child ,...), parent -> ...] . We then hash the adjacency list of CFG as amethod signature. To improve the search efficiency, we sortthese hash values in ascending order and then hash theconcatenate values as one of the coarse-grained TPL features(T1). Meanwhile, we also keep the series of CFG signaturesin our database to represented each TPL in feature database. Fine-grained Feature Extraction.
Based on our analysis, wefind the code similarity of different versions for the same TPLcould be diverse, which can range from about 0% to nearly100%. The coarse-grained features (i.e., CFG) are likely togenerate the same signature of different versions that have mi-nor changes such as insert/delete/modify a statement in a basicblock. Therefore, we propose finer-grained features, i.e., op-code in each basic block of CFG, to represent each version file.However, extracting more fine-grained features will increasemore computational complexity and cost of the computing re-sources. To ensure the scalability of ATVH
UNTER , a commonway to achieve that is through hashing [43]. However, hash-based method has an obvious drawback to determine whethertwo objects (e.g., TPLs, methods) are similar because a minormodification can lead to a dramatic change of the hash value.Thus, we adopt the fuzzy hashing technique [44] instead of thetraditional hash algorithm to generate the code signature forig. 3: Fuzzy hashing for method feature generation as theversion featureeach method. Fig. 3 shows the feature generation process forTPL-Vs. Specifically, we first extract all the opcode sequencesinside each basic block and concatenate them together. Wedo not consider the operands (e.g., identifier names or hard-coded URLs) that are not robust for some simple obfuscationtechniques such as renaming obfuscation and string encryptiontechniques [43, 45]. We then concatenate all opcode sequencesof each basic block according to the adjacency list of CFG.In this step, our method is somewhat similar to LibD [15]with respect to the code feature. We also adopt the opcode ineach basic block of CFG as the code feature. However, wealso have many differences. LibD uses a package-level hashvalue as the final signature and uses the clustering algorithm todetect TPLs. While in ATVH
UNTER , to defend against codeobfuscation or TPL customization [7], we use the fuzzy hashon each method-level feature and similarity comparison to findsimilar methods. We first use a slide window (a.k.a., rollinghash [44]) to cut the opcode sequence into small pieces. Eachpiece has an independent contribution to the final fingerprint.If one part of the feature changes due to code obfuscation,it would not cause a big difference to the final fingerprint.We then hash each piece and combine them as the final fine-grained fingerprint of each method. The fingerprints of allmethods in a version to represent a TPL-V.
TPL Database Construction.
We crawled all Java TPLsfrom Maven Repository [25] (189,545 unique TPLs with their3,006,676 versions) to build our TPL database. We use theabove mentioned method to obtain the signature for each TPL.For each version of TPLs, we store both coarse-grained andfine-grained features in a MongoDB [46] database. The size ofthe entire database is 300 GB. We spent more than one monthto collect all the TPLs and another two months to generatethe TPL feature database.
4) Library Identification:
This step aims to identify theused TPL-Vs in a given app. To achieve it efficiently, wepropose a two-stage identification method: 1) potential TPLidentification; 2) version identification.
1) Potential TPL Identification.
Since there are over 3million TPL files to be compared in our database for eachcandidate library, to speed up the entire detection process, we search the database in the following order: a)
Searchby package names.
For each library candidate, we first useits package namespace (if not obfuscated) to narrow downthe search space in our database. Note that we cannot di-rectly use the package name to determine a TPL, becausethe same package namespace could include different third-party libraries. For example, the Android support group [47]includes 99 different TPLs. These TPLs have the same groupID “ com.android.support ” and the same package name prefix“ android/support/ ”. If the package name has been obfuscatedor a candidate TPL module is without a package name, wemove to the next filtering strategy. Note that, even though it isa non-trivial problem to decide the obfuscated package name,in our work, the package name is only used as supplementaryinformation to speed up the search process. No matter whethera candidate TPL can find a match in the TPL database byusing the package names, we still continue to search the TPLdatabase via other features. Thus, we only applied a simplerule to identify the obfuscated apps: if a package name is ahash value or a single letter, we consider it obfuscated. b)
Search by the number of classes.
We assume two TPLs areunlikely to be the same one if the number of classes withintwo TPLs has a big difference [48]. If the number of the classesin a TPL only accounts for less than 40% of that in anotherTPL in the database, we will not further compare them, whichcan help us speed up the identification process. c)
Searchby coarse-grained features.
To speed up, we first search thecoarse-grained feature T1 in the TPL database; if we find thesame one, ATVHunter will report this TPL and stop the searchprocess. Otherwise, ATVH
UNTER will compare the candidateTPL with TPLs in the database, if all the coarse features arethe same, we consider find the TPL and the search process willstop. If over 70% of the coarse-grained features are the same(followed by previous research [33, 43, 48, 49]), we considerit as a potential TPL. When we find the potential TPL, wewill identify the exact version.
2) Version Identification.
To identify the specific versionsof the used TPLs, we utilize the fine-grained features andcalculate the similarity ratio of two TPLs as the evaluationmetric. To ensure the efficiency, we do not compare thesematched methods in previous stage. ATVH
UNTER can recordthe same method pair in the previous stage, therefore, we onlyneed to compare less than 30% of the methods in this phase.Since some code obfuscation techniques (e.g., junk code inser-tion) would change the fingerprints of methods, causing twomethods that were initially the same to be different. Therefore,we need to compare the method similarity and consider twomethods matched only when their method similarity exceeds athreshold. Based on the number of matched methods, we thencompute the TPL similarity. When the number of matchedmethods exceeds the threshold, we consider we find the correctTPL with its version. • Method Similarity Comparison.
We employ edit dis-tance [43, 50] to measure the similarity between two methodfingerprints. The edit distance of two fingerprints is definedas the number of minimum edit operations (i.e., insertion,eletion, and substitution) that is required to modify onefingerprint to the other. Based on the edit distance of twosignatures, we compute the Method Similarity Score (MSS)between two methods (i.e., m a and m b ) by using the formula: M SS ( m a , m b ) = 1 − d [ m a , m b ] max ( m, n ) (1)where m and n represent the signature length of two methodsand d [ m a , m b ] is the edit distance of two method signatures.If M SS exceeds a certain threshold θ , we consider the twomethods are matched. Based on our experimental result in§ IV-A, we choose θ = 0 . as the threshold. • TPL Similarity Comparison.
Based on the number ofmatched methods, the similarity of two TPLs ( t and t ) aredefined as follows: T SS ( t , t ) = M | t (cid:84) t | M | t | (2)where t is a TPL candidate from the test app, t is a TPL fromthe database for comparison. M | t | is the number of methodsin t . M | t (cid:84) t | is the number of matched methods of t and t which should meet two conditions: (a) ∀ m i , m j , where m i isa method of t , m j is a method of t , M SS ( m i , m j ) ≥ θ ; (b) ∃ m j , that M SS ( m i , m j ) = 1 , that is, we only compare twoTPLs that have at least one exactly matched method in orderto speed up the identification process. For a TPL candidate t ,we consider we find a potentially matched TPL-V ( t ) in thedatabase when T SS ( t , t ) ≥ δ , δ is the similarity threshold,and select the TPL-V with the largest similarity score as thefinal result of t , providing the identified TPLs with group id,artifact id and version number. We set the threshold δ = 0 . based on our experimental result in § IV-A. B. Vulnerable TPL-V Identification
We first build a vulnerable TPL-V database, based on whichwe identify the vulnerable TPL-Vs used by the apps.
1) Database Construction:
The vulnerable TPL-V databaseconstruction process includes collection of know vulnerabil-ities in Android TPLs and security bugs from open-sourcesoftware.
Known TPL Vulnerability Collection.
To collect the vul-nerable TPL versions, we convert the names of all TPL files(3,006,676 in total) in our feature database into CommonPlatform Enumeration (CPE) format [51] and exploit cve-search [52], a professional CVE search tool, to query thevulnerable TPLs from the public CVE (Common Vulnera-bilities and Exposures) database by mapping the transformedTPL names. In this way, we can get the known vulnerabilitiesof TPL-Vs and their detailed information, including the CVEid, vulnerability type, description, severity score from Com-mon Vulnerability Scoring System (CVSS) [53], vulnerableversions, etc. We use
CVSS v3.0 to indicate the severity of thecollected vulnerabilities in this paper. Finally, we collected1,180 CVEs from 957 unique TPLs with 38,243 affectedversions.
Security Bug Collection.
Since ATVH
UNTER is able toidentify the specific versions of TPLs used by apps, therefore, besides the known vulnerabilities, we also obtain 224 securitybugs from Github [35] and Bitbucket [54] owing to the col-laboration with our anonymous industrial collaborators.Thesebugs come from 152 open-source TPLs with their correspond-ing 4,533 versions. All of these security bugs have been cross-validated by the security experts in industry.
2) Vulnerable TPL-V Identification:
When ATVH
UNTER identifies the used TPL-Vs in the app, it will search the vulner-able TPL database to check whether these identified TPL-Vsare vulnerable or not. If ATVH
UNTER finds the vulnerableTPL-Vs, it will generate a detailed vulnerability report tousers. We believe ATVH
UNTER can serve as an extensionof ASI Program [11] for Google. The previous research [6]reported that vulnerabilities listed on ASI program can drawmore attention to developers. However, the vulnerabilitiesare reported by ASI program is limited. Our comprehensivedataset can be a supplement to ASI program.
C. Implementation
ATVH
UNTER is implemented in 2k+ lines of pythoncode. We employ A
PKTOOL [55], a reverse engineering toolcommonly-used by much previous work [56–59] to decompilethe Android apps and exploit Androguard [60] to obtain theclass dependency relations in order to get the independentTPL candidates. We then employ S
OOT [61] to generate CFGand also build on S
OOT to get the opcode sequence in eachbasic block of a CFG. We use the ssdeep [62] to conductfuzzy hash algorithm to generate the code feature and employthe edit distance [50] algorithm to find the in-app TPLs. Ourapproach can pinpoint the specific TPL versions. We maintaina library database containing more than 3 million TPL filesand construct a vulnerable TPL database that includes 224security bugs from open-source Java software on Github, and1,180 CVEs from 910 Android TPLs in public CVE databases.IV. E
VALUATION
In this section, we first construct our ground truth andchoose appropriate thresholds for MSS and TSS in § IV-A.Based on the thresholds, we further evaluate ATVH
UNTER from effectiveness (RQ1) , scalability (RQ2) , and the capa-bility of code obfuscation-resilience (RQ3) . All the experi-ments were conducted on a commercial cloud service runningUbuntu 16.04 LTS with 8-core Intel(R) Xeon(R) Gold 6151processor, CPU @ 3.00GHz and 128G memory. A. Preparation • Ground-truth Dataset Construction.
We build this datasetfor three primary purposes: 1) verify the effectiveness ofATVH
UNTER ; 2) compare the performance with the state-of-the-art tools; 3) release the datasets to the community topromote follow-up research. Since it is difficult to know thespecific TPL-Vs from commercial apps, we choose the open-source apps to compare ATVH
UNTER with existing tools.We first collect the latest versions of 500 open-source appsfrom F-Droid [63] that is the largest repository maintainingopen-source Android apps. We choose open-source apps as a) Method-level (b) TPL-level
Fig. 4: Similarity threshold selectionsubjects since we can get the specific TPL information (in-cluding the version) in the configuration files and source codeof apps, such a mapping relation between apps and TPLs isused as the ground-truth for performance evaluation. Theseapps are from 17 different categories with various sizes. Foreach app, we manually analyze it and get the in-app TPLswith their specific versions. According to our analysis, theseapps contain the number of TPLs ranging from 2 to 37 andthese TPLs also have different functions with diverse sizes.We then download these TPLs with their versions from theMaven repository [25]. To ensure the evaluation results morereliable, we collect the complete versions of each TPL. Wefilter 144 apps out due to the incomplete versions of TPLsmaintained in the Maven repository. Note that, based on ouranalysis, we find the previous published datasets have somebiases. TPLs from LibScout and LibID are most independentones, thus, we add some TPLs that depend on other TPLs inour dataset, such as “Retrofit” depends on “Guava”, to revealthe lib identification capability of different tools. Finally, wechoose 356 apps and 189 unique TPLs with the complete 6,819version files in these apps as the ground truth. • Threshold Selection.
To avoid bias, we randomly selectthree groups ( × ) of apps except the aforementioneddataset to decide appropriate thresholds for method similarityscore θ , and TPL similarity score δ . We use method-level falsepositive rate (FPR) and false negative rate (FNR); and TPL-level FPR and FNR as the metrics to decide the similaritythresholds by varying θ and δ from different thresholds. Weemploy the three groups of apps to implement the sameexperiment three times and then decide the optimal thresholds.Fig. (4a) shows the method-level FPR and FNR at differentsimilarity thresholds. We can find when the threshold θ isaround 0.85, both the FPR and FNR are relatively low.Therefore, we choose θ = 0 . as the MSS threshold wherethe FPR is less than 1% and FNR is less than 0.5%, whichcan achieve a good trade-off. Fig. (4b) shows the TPL-levelFPR and FNR at different thresholds. According to the result,we find that when the threshold is gradually close to 0.8,many false positives appear due to the same TPL with differentminor-changed versions. When the threshold is close to 1, thenumber of false negatives increases. From Fig. 4, we can findFPR and FNR achieve a good trade-off when the threshold isaround 0.95, we thus choose 0.95 as the threshold δ of TSS. Insummary, we employ θ = 0 . and δ = 0 . for the followingexperiments. B. RQ1: Effectiveness Evaluation
Experimental Setup.
For the effectiveness evaluation, wecompare ATVH
UNTER with the state-of-the-art publicly-available TPL detection tools (i.e., LibID, LibScout, OSSPo-LICE, and LibPecker) that can specify the used TPL versionsby using our ground truth dataset (§ IV-A). We employ threeevaluation metrics, i.e., precision (
T PT P + F P ), recall (
T PT P + F N )and F1 Score ( ∗ P recision ∗ RecallP recision + Recall ), to evaluate the detectionaccuracy at both TPL-level and version-level. TPL-level iden-tification indicates the ability to identify the in-app TPLscorrectly (without specifying the versions), and version-levelidentification indicates the ability to find both the correct TPLsand the correct versions. For example, if a tool reports that itfinds “okio-2.0.0, okio-2.3.0” in an app but the ground truthis “okio-2.4.3”, in this situation, for TPL-level, we considerthe tool find the correct TPL; for version-level, we considerthere are two false positives and one false negative.
Result.
Table I shows the comparison results of ATVH
UNTER and other state-of-the-art tools. Considering the overall per-formance, we can see ATVH
UNTER outperforms other toolsregarding all the metrics; the F1 score of ATVH
UNTER atlibrary-level and version-level reached 93.43% and 88.82%,respectively. For library-level identification, we can find thatall of them can achieve high precision at TPL-level identifi-cation but the performance of recall of current state-of-the-art tools is mediocre. In contrast, the recall of ATVH
UNTER is 88.79%, which is far better than others. For version-levelidentification, we can find the precision (90.55%) and recall(87.16%) of ATVH
UNTER is much higher than that of othertools. Compared with the library-level precision, we can seethe precision of each tool at version-level decreases a lot,which means most of them can identify the TPL but theycannot pinpoint the exact versions. We elaborate on the reasonsfor false positives and false negatives of ATVH
UNTER andother state-of-the-art tools as follows.
FP Analysis.
The reasons for the false positives ofATVH
UNTER can be concluded in three points: (1) reuse ofopen-source components.
We find some TPLs are re-developedbased on other TPLs, with only small code changes, if theirsimilarity is larger than the defined threshold, ATVH
UNTER will report the reused ones at the same time, which are falsepositives. (2)
Artifact id or group id changes.
We identify aTPL by using its group id, artifact id and version number.However, we find that some old version TPLs has migrated tothe new ones, with their group id or artifact id changed, buttheir code has little difference. Take the TPL file “EventBus”as an example, “org.greenrobot:eventbus” [64] is the upgradedversion of “de.greenrobot:eventbus” [65]. The code of thesetwo TPLs have high similarity but with different group ids.ATVH
UNTER matches both of them and considers they aredifferent TPLs. (3)
Different versions with high similarity . Theother reason for the false positives of ATVH
UNTER is thatsome versions of the same TPL have little or no difference intheir code. For example, “ACRA_4.8.3” only modifies a fewstatements in a method of “ACRA_4.8.2”, and ATVH
UNTER
ABLE I: Library and Version Detection Comparison
Tools Library-level Version-levelPrecision Recall F1 Precision Recall F1ATVHunter 98.58% 88.79% 93.43% 90.55% 87.16% 88.82%LibID
LibScout
OSSPoLICE
LibPecker would report the two versions of the TPL at the same time,one of them is regarded as false positives. In our database, weeven find some versions of the same TPL have the same Javacode but different resource files, configuration files or nativecode (C/C++), but this situation does not affect the vulnerableTPL identification process.As for the false positives of other tools, the code featureof LibScout (i.e., fuzzy method signature) is too coarse,which would make it generate the same signature for differentversions if the two versions have minor differences. As theaforementioned example “ACRA”, all existing tools cannotdistinguish the two versions because it generates the samesignature for them. Besides, if the methods are very simple, thesignatures generated by LibScout and OSSPoLICE would alsobe the same, which can also lead to false positives. LibPeckerdepends on the package structure as a supplementary featureto identify different TPLs, they may report a TPL depend onothers TPLs several times. For instance, if an app use theLibrary C that is built on library A and B, if library A andB are also in TPL feature database, LibPecker could reportlibrary C as library A and B, leading to false positives.
FN Analysis.
ATVH
UNTER aims to find TPL versions withhigh precision, thus, we sacrificed part of the recall when weselect the similarity threshold. The reasons for false negativesof ATVH
UNTER are as follows: (1) When compiling an app,developers may take some optimizations to reduce the sizeof their app. The strategy is that the compiler automaticallyremoves some functions of TPLs that are not called byhost apps, which causes the in-app TPLs to be differentfrom the original TPLs, leading to false negatives. (2) SomeTPLs are integrated into the same package namespace of thehost app, which may be deleted at the pre-processing stage,leading to false negatives. For example, some companies andorganizations develop their own Ad SDK, whose packagename is the same as that of the host app. However, the codeunder the package structure of the host app is deleted atthe pre-processing stage, i.e., the ad library is also deletedwithout further consideration, causing the false negatives.(3) Another reason is that some apps use rarely-used open-source TPLs hosted on open-source platforms (e.g., Githubor Bitbucket) which are not in our TPL database (with over3 million TPLs), leading to false negatives. For example,the TPLs “com.github.DASAR.ShiftColorPicker”, “android-retention-magic-1.2.2”, and “android-json-rpc-0.3.4” are de-veloped and hosted on Github, and not in our dataset, there-fore, ATVH
UNTER cannot find this TPL. Since other toolsalso use the similarity comparison method to find in-app TPLs,this situation also may affect their recall.As for the false negatives of other TPL detection tools, they more or less use the package structure to generate the TPLfeatures. However, the package structure is not stable, whichcan be easily changed by the package flattening obfuscation.We find the packages structures of many real-world in-appTPLs are more or less obfuscated, and some TPLs are evenwithout any package structure; current tools cannot handlesuch cases, leading to false negatives. Besides, it is difficult touse the package structure and package name to ensure the TPLcandidates, lingas demonstrated in §III-A4. Many differentTPLs may have the same package name, and one independentpackage tree could include several TPLs; therefore, existingtools may generate incorrect code features for these TPLs,which also can lead to false negatives. LibID uses Dex2jar [32]to decompile apps, it does not always work in all apps,which discounts the recall of LibID. Besides, LibScout andOSSPoLICE are sensitive to CFG structure modification.Compared with them, our CFG adjacency list is less sen-sitive to the CFG structure modification. We consider boththe syntax and semantic information, and our method adoptsthe fuzzy hash to generate the TPL fingerprints. Thus, codestatements modification can only affect part of the fingerprints,which is more robust to different code obfuscations. Based onthe above analysis, we can find that the strategy of featureselection, extraction, and generation are essential, which candirectly affect the performance of the system.
Conclusion:
ATVH
UNTER outperforms state-of-the-artTPL detection tools, achieving 98.58% precision, 88.79%recall at library level, and 90.55% precision, 87.16% recallat version level.
C. RQ2: Efficiency Evaluation
In this section, we investigate the detection time ofATVH
UNTER and compare it with state-of-the-art tools toverify its efficiency. We compare the detection time ofATVH
UNTER with existing tools by employing the datasetcollected in § IV-A. All tools construct their own TPLdatabases using the same dataset (6,819 TPL versions). Allcompared tools choose similarity comparison method to findin-app TPLs, thus, the detection time mainly depends on thenumber of in-app TPLs and the number of TPL features in thedatabase. The detection time is the period cost for finding allTPL-Vs in a test app. Note that the detection time does notinclude the database construction time.
Result:
Table II shows the comparison result of detectiontime. We present four metrics (i.e., Q1, mean, median, Q3)to evaluate the efficiency of each tool. We can see that theefficiency of ATVH
UNTER also outperforms the state-of-the-art tools (66.24s per app on average). The second one isLibScout, and the average detection time is about 83s. LibIDand LibPecker are relatively time-consuming; the averagedetection time could reach about 16.56h and 4.5h per app.ATVH
UNTER is more efficient than others because ourmethod only needs to directly search to find the matchingpairs in most situations, which can dramatically decreasethe detection time. ATVH
UNTER employs a two-stage iden-tification method (i.e., filter the potential TPLs first andABLE II: Comparison Results of Detection Time (per app).
Tool ATVHunter LibID LibScout OSSPoLICE LibPeckerQ1
Mean
Median Q3 identify the exact TPL with its specific version) to find thematched libraries from the database, which does not need todirectly compare with the whole database using fine-grainedfeatures and largely reduces the comparison time and thewhole detection time. In contrast, in the similarity featurecomparison stage, LibScout needs to use the class dependencyto filter some impossible pairs out, and this step is also time-consuming. Besides, LibScout regards the code of the hostapp as one of the candidate TPLs, which also costs extratime. OSSPoLICS exploits the fuzzy method signature (thesame feature of LibScout) [5] as the TPL code feature andfunction centroid [42] as the version code feature. The featuregranularity of OSSPoLICE is much finer than that of LibScout,thus, the computational complexity of OSSPoLICE is alsogreater than that of LibScout. Besides, calculating centroid isheavy in terms of runtime overhead and computing resourcesconsumption, especially for the third element (loop depth) inthe centroid. The time complexity is O (( n + e )( c +1)) and thespace complexity is O ( n + e ) to find all the loops, where thereare n nodes, e edges and c elementary circles in the graph.For LibPecker, if it tries to find a similar class, it needs tocompare three times while our method only needs to compareonce. Besides, LibPecker also needs to compare the packagehierarchy structure and then calculates the similarity score,which also adds extra time. LibID chooses finer granularityfeatures to identify TPLs, the class dependency analysis, CFGconstruction and class matching are also time-consuming. Conclusion:
Compared with other tools, ATVH
UNTER canidentify exact TPL-Vs with high efficiency and it takes lesstime for TPL detection on the ground-truth TPL database.
D. RQ3: Obfuscation-resilient Capability
The obfuscation-resilient capability is an important index tomeasure the performance of a TPL detection tool since obfus-cation techniques can discount the detection performance.
Experimental Setup.
To evaluate the obfuscation-resilientcapability of ATVH
UNTER regarding different obfuscationtechniques, we select 100 apps from the public dataset [66]including multiple categories, and use a popular obfuscationtool, Dasho [22], to obfuscate these apks with four widely-usedobfuscation techniques (i.e., renaming obfuscation, controlflow randomization, package flattening and dead code re-moval). Obfuscation is a time-consuming task and requires theobfuscation tool to analyze the code logic in order to conductthe obfuscation. It took us about half a month to obfuscate allof apps. Finally, we get one group (100 apps) of the originalapps and four groups ( × ) of the obfuscated apps. Basedon these groups of apps, we compare ATVH UNTER with othertools in terms of the detection rate ( | T P || GT | ) at version-level. TABLE III: Comparison on Code Obfuscation Techniques Tool NoObfuscation ObfuscationRenaming CFR PKG FLT Code RMVATVHunter 99.26% 99.26% 90.13% 99.26%
LibID
LibScout
OSSPoLICE
LibPecker
Renaming: renaming obfuscation; CFR: Control FlowRandomization; PKG FLT: Package Flattening; Code RMV: DeadCode Removal
Result:
The detection results are presented in Table III,the second column is the detection rate of each tool onapps without obfuscation. We can see ATVHunter achievesthe highest detection rate (99.26%), followed by LibPecker(98.79%). Besides, it can be found that the detection rate ofLibID is only 12.93%, which has a big gap with the resultin RQ1. We found the main cause of this gap is due to theinability of decompilation component dex2jar used by LibID.Many apps in this dataset cannot be decompiled successfullyby dex2jar because of TPL compatibility issues, type errorsand anti-decompilation settings, hence LibID cannot generatethe in-app TPL signature, leading to the low detection rate.As for the capability of tools on obfuscated apps, we cansee that all tools are resilient to renaming obfuscation sincethe detection rate of all tools on renaming apps is the sameas the apps without obfuscation. Our ATVH
UNTER is lessaffected by all of these code obfuscation techniques. Coderemoval has the greatest impact on ATVH
UNTER , detectionrate dropped by about 24%. The detection rate on apps withother obfuscation techniques remains over 90%, demonstratingthe capability of ATVH
UNTER towards commonly-used codeobfuscation techniques. Moreover, we can find the recall ofapps are obfuscated by package flattening is the same withthe apps without obfuscation, it shows that our method iscompletely resilient to package flattening. In contrast, apartfrom the renaming obfuscation, the detection rate of othertools has been affected by obfuscations to varying degrees.Especially for LibScout, the performance has dropped bymore than 70%. LibScout can only correctly identify 17.69%of in-app TPLs that are obfuscated by package flatteningor dead code removal, and 18.24% of in-app TPLs withcontrol flow randomization. Except ATVH
UNTER , LibPeckerachieves better performance.As for the control flow randomization (CFR), LibScout andOSSPoLICE use the fuzzy method signature as code featuresthat keep the syntax information but do not remain semanticinformation; thus, it is difficult to defend against CFR. Besides,OSSPoLICE employs CFG centroid [42] as the version-levelcode feature. The CFG centroid is a three-dimensional vector,and each dimension indicates the in-degree, out-degree andloop count, respectively. The CFG centroid is sensitive to CFGstructure modification; hence the detection rate of OSSPoLICEhas dropped a lot regarding apps with CFR. LibPecker andLibID show a good resiliency to CFR because both of themselect the class dependencies as the code features that wouldot be changed easily by CFR. ATVH
UNTER extracts CFGas our coarse-grained feature and opcode in the basic blockof CFG as the fine-grained feature. We keep the semanticinformation and remove the operands so our method is resilientto identifier renaming. We split the opcode sequence intosmall pieces and exploit fuzzy hash generate the code feature,although the dead code removal obfuscation and control flowobfuscation techniques can affect a part of code features,our strategy effectively reduces the interference, making thedetection rate decline slightly.Regarding the package flattening technique, existing toolsmore or less depend on package structure to generate TPL sig-natures, without a doubt, which will affect their performance.More specifically, LibScout depends on package structure/-name to split TPLs. Firstly, many TPLs belong to the samegroup that may have the same package name. It is difficult tosplit these TPLs correctly if they belong to the same group.Secondly, the package flattening technique can easily changethe package hierarchy structure or even remove the wholepackage tree, resulting in that LibScout will generate incorrectTPL signatures or cannot generate signatures for TPLs withoutpackage structures. OSSPoLICE is built on LibScout henceOSSPoLICE inherits the limitations of LibScout. LibPeckerassumes the package structure is preserved during obfuscationbut it does not always hold true for real-world apps. Thisstrong assumption directly restricts it to achieve better perfor-mance. In contrast,ATVH
UNTER uses the class dependencyrelation to split different TPL candidates (on the basis ofhigh cohesion and low coupling among different TPLs), whichcompletely does not depend on the package structure, thus,ATVH
UNTER is resilient to package flattening/renaming.As for dead code removal, this obfuscation technique willdelete some code that is not invoked by host apps, leading thecode features of in-app TPLs are different from the originalTPLs. This obfuscation can affect all TPL detection tools.LibPecker chooses class dependency as the code feature thatkeeps the method call relationship while we adopt CFG ascode feature that do not include the method dependency. Ourmethod may include methods and classes without invocations.The signature of LibPecker stores more semantic informationthan that of us so that LibPecker achieves better performancein dead code removal.
Conclusion:
ATVH
UNTER offers better resiliency to codeobfuscation than existing tools, especially for identifier re-naming, package flattening, and control flow randomization.V. L
ARGE -S CALE A NALYSIS
By leveraging ATVH
UNTER , we further conducted a largescale study on Google Play apps to reveal the threats ofvulnerable TPL-Vs in the real world.
Dataset Collection.
We collected commercial Android appsfrom Google Play based on the number of installations. Foreach installation range, we crawled the latest versions of appsfrom Aug. 2019 to Feb. 2020 for this large-scale experiment.We only consider popular apps whose installation ranges from10,000 to 5 billion, because the vulnerabilities in apps with large installations can affect more devices and users. Note thatthe number of apps in each installation range is unequal; ingeneral, the number of apps with higher installations usually isrelatively smaller. We finally collected 104,446 apps across 33different categories as the study subjects. From our preliminarystudy on these apps, we found 72% of them (73,110/104,446)use TPLs to facilitate their development. We thus focus on the73,110 apps to conduct the following analysis.
A. Vulnerable TPL Landscape
Before conducting the impact analysis of vulnerable TPLs,we first present some essential information about these vulner-able TPL-Vs to let readers have a clear understanding aboutthe threats in TPLs. We use CVSS v3.0 security metrics [53]to indicate the severity (i.e., low, medium, high, and criti-cal) of vulnerabilities. The score greater than 7.0 means thevulnerability with high and critical severity, which accountsfor 21.35% of all the vulnerabilities in our dataset. Thesesevere vulnerabilities usually involve remote code execution,sensitive data leakage, Server-side request forgery (SSRF)attack, etc. Even worse, we find of these vulnerableTPLs are widely-used by other TPLs. For example, the li-brary “org.scala-lang:scala-library” with a severe security risk(
CV SS = 9.8) that allows local users to write arbitrary classfiles, has been used 24,112 times by other TPLs, and mostof vulnerable versions of this TPL have been used morethan 2,000 times. Without a doubt, such cases expand thespread of vulnerabilities and add more security risks to appusers. These severe vulnerabilities usually involve remote codeexecution, sensitive data leakage [67, 68], malicious code orSQL injection, bypass certificates/authentication, etc. Thesebehaviors definitely bring unpredictable risks to users’ privacyand property security. We found that most of these vulnerableTPLs belong to utility, accounting for 98.7%.
B. Impact Analysis of Vulnerable TPLs
In our dataset, we find that about 12.37% (9,050/73,110)of apps include TPL-Vs, involving 53,337 known vulnera-bilities and 7,480 security bugs from open-source TPLs. Theknown vulnerabilities are from 166 different vulnerable TPLswith corresponding 10,362 versions and the security bugsare from 27 vulnerable TPLs with 284 different versions.These vulnerable apps use a total of 58,330 TPLs and ap-proximately 18.2% of them are vulnerable ones. Among the9,050 vulnerable apps, 329 apps (37.5%) with TPLs containboth vulnerabilities and security bugs. There are 778 appscontaining the TPLs with security bugs and each app containsabout 2.45 security bugs in their TPLs. Furthermore, wealso find many education and financial apps use the popularUI library “PrimeFaces” [69] that include sever vulnerability(CVE-2017-1000486). Primefaces 5.x is vulnerable to a weakencryption flaw resulting in remote code execution. For moreanalysis result, you can refer to our website [26].
C. Lessons Learned
Based on our analysis, we found many apps include vul-nerable TPLs leading to privacy leakage and financial loss.owever, developers seem unaware of the security risks ofTPLs. We explore the reasons from the following points:
For TPL developers, according to our result in § V-A, thereuse rate of vulnerable TPLs is pretty high ( > ). ManyTPL developers also develop their own TPLs based on existingones, especially popular ones, but seem seldom to check theused components for any known vulnerabilities. Even worse,we find 210,727 TPLs use vulnerable TPL versions, indicatingmany TPL developers may be unaware of tracking thesevulnerability fix solutions in these open-source products. Al-though some TPL developers have patched the vulnerabilitiesin later versions, many affected apps still use the old versionswith vulnerabilities, which indirectly expands the threats ofthe vulnerabilities in TPLs. The lack of centralized control ofthese open-source TPLs also poses attack surfaces for hackers. For app developers, we reported some TPL versions withsevere vulnerabilities to the corresponding app developersvia emails. We wrote 50 emails to these app developers orcompanies and received 5 replies in 2 months. Based on theirfeedback, we find 1) most of the developers only care aboutthe functionalities provided by the TPLs and are unaware ofthe security problems in these TPLs. In fact, it is reasonablesince one is unlikely to analyze all the used libraries beforeusing them, which eliminates the convenience of using thesecomponents or libraries. However, based on our analysis,some commonly-used TPLs contain severe vulnerabilities, wesuggest that app developers should be aware of vulnerabilitiesin TPLs and ATVH
UNTER could be helpful for them todetect vulnerable TPL versions. 2) Some app developers orcompanies do not know how to conduct security detectionof these imported TPLs. They also hope “our team can helpthem conduct the security assessment of the used TPLs or tellthem the specific analysis processes.” 3) Some app developersdid not know that some vulnerable TPLs have been updatedor patched and they still used these old TPL versions. Even ifthey noticed the upgraded versions, some of them are reluctantto change the old ones due to the extra cost. They said that“If a TPL adds many new functions, they have to spend muchtime understanding these new features and change too muchof their own code. Thus, they prefer to keep old TPL-Vs.”
For app markets, we found that many app markets do nothave such a security assessment mechanism to warn developersabout the potential security risks in their apps. As far as weknow, only Google provides a service named App SecurityImprovement (ASI) program that provides tips to help appdevelopers of Google Play to improve the security of theirapps. Previous research [6] reported that vulnerabilities listedon ASI program could draw more attention from developers.However, the vulnerabilities reported by ASI program are lim-ited due to the lack of a comprehensive vulnerability databaseand such a vulnerable TPL detection tool, like ATVH
UNTER .VI. D
ISCUSSION
Limitations. (1) If the Java code of several versions isthe same, ATVH
UNTER would provide several candidatesinstead of a specific one, leading to some false positives. (2) ATVH
UNTER may eliminate some TPLs due to mistakenlyregarding them as part of the primary module if such TPLsare imported into the package structure of the host app,thus causing some false negatives. (3) We only focus onthe Java libraries and do not consider the native libraries. Infact, the native library is also an essential part in Androidapps and the vulnerabilities inside would cause more severeconsequences. Detecting vulnerable native libraries is left forour future work. (4)ATVH
UNTER adopts static analysis tofind the TPLs, therefore, we may miss some libraries areloaded in dynamic methods. Besides, some TPLs have somedynamic behaviors, such as refection, dynamic class loading.Our approach may miss some dynamic features and affectour detection performance. (5) We crawled about 3 millionTPLs from maven to build our feature database. Although thisdatabase is large and comprehensive and it can guarantee thedetection rate of ATVH
UNTER , our method still have somelimitations. The third-party libraries are constantly updating,which means ATVH
UNTER cannot find these newly emergingTPLs. Thus, how to find these newly emerging TPLs anddynamically maintain our database will be our future work.
Threats to Validity. (1) The first threat comes from thesimilarity threshold, it is inevitable to induce some falsenegatives and false positives for some apps due to the minordifference between TPLs. To minimize the threat, we selectedthe similarity threshold through a reasonable experimentaldesign. (2) Another threat comes from the analysis on only freeapps. We believe that it is meaningful to study the vulnerableTPLs used by both free and paid apps, which is left for futurework. VII. C
ONCLUSION
In this paper, we proposed ATVH
UNTER , a TPL detectionsystem which can precisely pinpoint the TPL version andfind the vulnerable TPLs used by the apps. Evaluation resultsshow that ATVH
UNTER can effectively and efficiently findin-app TPLs and is resilient to the state-of-the-art obfuscationtechniques. Meanwhile, we construct a comprehensive andlarge vulnerable TPL version database containing 224 securitybugs and 1,180 CVEs. ATVH
UNTER can find the vulnerableTPLs in apps and reveals the threat of vulnerable TPLs in apps,which can help improve the quality of apps and has profoundimpact on the Android ecosystem.VIII. A
CKNOWLEDGMENT
We thank the anonymous reviewers for their helpful com-ments. This work is partly supported by the National Re-search Foundation, Prime Ministers Office, Singapore un-der its National Cybersecurity R&D Program (Award No.NRF2018NCR-NCR005-0001), the Singapore National Re-search Foundation under NCR Award Number NRF2018NCR-NSOE003-0001, NRF Investigatorship NRFI06-2020-0022,the Singapore National Research Foundation under NCRAward Number NRF2018NCR-NSOE004-0001, the HongKong PhD Fellowship Scheme and Hong Kong RGC Projects(No. 152223/17E,152239/18E, CityU C1008-16G).
EFERENCES
Proceedings of the 41st International Conference on SoftwareEngineering: Software Engineering in Practice . IEEE Press,2019, pp. 183–192.[3] S. Chen, M. Xue, L. Fan, S. Hao, L. Xu, H. Zhu, andB. Li, “Automated poisoning attacks and defenses in malwaredetection systems: An adversarial machine learning approach,” computers & security , vol. 73, pp. 326–344, 2018.[4] S. Chen, L. Fan, C. Chen, M. Xue, Y. Liu, and L. Xu, “Gui-squatting attack: Automated generation of android phishingapps,”
IEEE Transactions on Dependable and Secure Comput-ing , 2019.[5] M. Backes, S. Bugiel, and E. Derr, “Reliable third-party librarydetection in Android and its security applications,” in
CCS ,2016.[6] T. Yasumatsu, T. Watanabe, F. Kanei, E. Shioji, M. Akiyama,and T. Mori, “Understanding the responsiveness of mobile appdevelopers to software library updates,” in
Proc. CODASPY ,2019.[7] Y. Zhang, J. Dai, X. Zhang, S. Huang, Z. Yang, M. Yang, andH. Chen, “Detecting third-party libraries in Android applicationswith high precision and recall,” in
SANER , 2018.[8] L. Li, D. Li, T. F. Bissyandé, J. Klein, H. Cai, D. Lo, andY. Le Traon, “Automatically locating malicious packages inpiggybacked Android apps,” in
The 4th IEEE/ACM Interna-tional Conference on Mobile Software Engineering and Systems(MobileSoft 2017) , 2017.[9] “Airpush,” https://support.google.com/faqs/answer/6376737.[10] “Mopub,” https://support.google.com/faqs/answer/6345928.[11] (2016) App security improvement program. [Online]. Available:https://developer.android.com/google/play/asi.html[12] “Software composition analysis (SCA): what is it and doesyour company need it?” https://snyk.io/blog/what-is-software-composition-analysis-sca-and-does-my-company-need-it/, 2020.[13] (2020) Software Composition Analysis Explained. [On-line]. Available: https://resources.whitesourcesoftware.com/blog-whitesource/sca-software-composition-analysis[14] Z. Ma, H. Wang, Y. Guo, and X. Chen, “Libradar: Fast andaccurate detection of third-party libraries in Android apps,” in
Proc. ICSE-C , 2016.[15] M. Li, W. Wang, P. Wang, S. Wang, D. Wu, J. Liu, R. Xue,and W. Huo, “Libd: Scalable and precise third-party librarydetection in Android markets,” in
Proc. ICSE , 2017.[16] M. Li, P. Wang, W. Wang, S. Wang, D. Wu, J. Liu, R. Xue,W. Huo, and W. Zou, “Large-scale third-party library detectionin Android markets,”
IEEE Transactions on Software Engineer-ing , pp. 1–1, 2018.[17] J. Zhang, A. R. Beresford, and S. A. Kollmann, “Libid: Reliableidentification of obfuscated third-party Android libraries,” in
Proc. ISSTA , 2019.[18] X. Zhan, L. Fan, T. Liu, S. Chen, L. Li, H. Wang, Y. Xu, X. Luo,and Y. Liu, “Automated third-party library detection for androidapplications: Are we there yet?” in
ASE , 2020.[19] A. Narayanan, L. Chen, and C. K. Chan, “Addetect: Automateddetection of Android ad libraries using semantic analysis,” in
Proc. ISSNIP , 2014.[20] Y. Wang, H. Wu, H. Zhang, and A. Rountev, “Orlis:Obfuscation-resilient library detection for Android,” in
Proc.MOBILESoft , 2018.[21] R. Duan, A. Bijlani, M. Xu, T. Kim, and W. Lee, “Identifyingopen-source license violation and 1-day security risk at largescale,” in
Proc. CCS
Proceedingsof the 13th annual international conference on mobile systems,applications, and services . ACM, 2015, pp. 89–103.[28] Z. Zhang, W. Diao, C. Hu, S. Guo, C. Zuo, and L. Li, “Anempirical study of potentially malicious third-party libraries inAndroid apps,” in
Proc. WiSec , 2020.[29] C. Soh, H. B. K. Tan, Y. L. Arnatovich, A. Narayanan, andL. Wang, “Libsift: Automated detection of third-party librariesin Android applications,” in
APSEC , 2016.[30] H. Han, R. Li, and J. Tang, “Identify and inspect libraries inAndroid applications,”
Wireless Personal Communications vol103, pp491-503 , 2018.[31] Merkle Tree. [Online]. Available: https://en.wikipedia.org/wiki/Merkle_tree[32] “dex2jar,” https://github.com/pxb1988/dex2jar.[33] C. Kai, W. Peng, L. Yeonjoon, W. XiaoFeng, Z. Nan, H. Heqing,Z. Wei, and L. Peng, “Finding unknown malice in 10 seconds:Mass vetting for new threats at the google-play scale,” in
Proc.USENIX Security
Comput.Secur. , vol. 61, pp. 72–93, Aug. 2016.[42] K. Chen, P. Liu, and Y. Zhang, “Achieving accuracy andscalability simultaneously in detecting application clones onAndroid markets,” in
Proc. ICSE , 2014.[43] W. Zhou, Y. Zhou, X. Jiang, and P. Ning, “Detecting repackagedsmartphone applications in third-party Android marketplaces,”in
Proc. CODASPY
Proc. ACM WiSec , 2014.[49] X. Sun, Y. Zhongyang, Z. Xin, B. Mao, and L. Xie, “Detectingcode reuse in Android applications using component-basedcontrol flow graph,” in
IFIP , 2014.[50] “Edit Distance.” [Online]. Available: https://en.wikipedia.org/wiki/Edit_distance[51] “CPE,” https://nvd.nist.gov/Products/CPE.[52] “cve-search,” https://github.com/cve-search/cve-search.[53] “Common Vulnerability Scoring System (CVSS).” [Online].Available: https://nvd.nist.gov/vuln-metrics/cvss[54] “BitBucket,” https://bitbucket.org/.55] (2019) Apktool. [Online]. Available: https://ibotpeaches.github.io/Apktool/[56] L. Fan, T. Su, S. Chen, G. Meng, Y. Liu, L. Xu, G. Pu, andZ. Su, “Large-scale analysis of framework-specific exceptionsin android apps,” in . IEEE, 2018, pp.408–419.[57] T. Su, L. Fan, S. Chen, Y. Liu, L. Xu, G. Pu, and Z. Su, “Whymy app crashes? understanding and benchmarking framework-specific exceptions of android apps,” 2020.[58] S. Chen, L. Fan, C. Chen, T. Su, W. Li, Y. Liu, and L. Xu,“Storydroid: Automated generation of storyboard for Androidapps,” in
Proceedings of the 41st International Conference onSoftware Engineering . IEEE Press, 2019, pp. 596–607.[59] L. Fan, T. Su, S. Chen, G. Meng, Y. Liu, L. Xu, and G. Pu,“Efficiently manifesting asynchronous programming errors inandroid apps,” in
Proceedings of the 33rd ACM/IEEE Interna-tional Conference on Automated Software Engineering . ACM,2018, pp. 486–497.[60] “Androguard,” https://github.com/androguard/androguard. [61] “Soot,” https://github.com/Sable/soot, 2019.[62] “ssdeep,” https://ssdeep-project.github.io/ssdeep/index.html.[63] “F-Droid,” https://f-droid.org/en/packages/.[64] (2015) org.greenrobor.eventbus. [Online]. Available: https://mvnrepository.com/artifact/org.greenrobot/eventbus[65] (2020) de.greenrobor.eventbus. [Online]. Available: https://mvnrepository.com/artifact/de.greenrobot/eventbus[66] “Benchmark data,” https://github.com/presto-osu/orlis-orcis/tree/master/orlis/open_source_benchmarks.[67] S. Chen, L. Fan, G. Meng, T. Su, M. Xue, Y. Xue, Y. Liu,and L. Xu, “An empirical assessment of security risks of globalandroid banking apps,” in
Proceedings of the 42nd InternationalConference on Software Engineering . IEEE Press, 2020, pp.596–607.[68] S. Chen, T. Su, L. Fan, G. Meng, M. Xue, Y. Liu, and L. Xu,“Are mobile banking apps secure? what can be improved?” in