[PDF] FairPlay: Fraud and Malware Detection in Google Play

Abstract

Fraudulent behaviors in Google Android app market fuel search rank abuse and malware proliferation. We present FairPlay, a novel system that uncovers both malware and search rank fraud apps, by picking out trails that fraudsters leave behind. To identify suspicious apps, FairPlay PCF algorithm correlates review activities and uniquely combines detected review relations with linguistic and behavioral signals gleaned from longitudinal Google Play app data. We contribute a new longitudinal app dataset to the community, which consists of over 87K apps, 2.9M reviews, and 2.4M reviewers, collected over half a year. FairPlay achieves over 95% accuracy in classifying gold standard datasets of malware, fraudulent and legitimate apps. We show that 75% of the identified malware apps engage in search rank fraud. FairPlay discovers hundreds of fraudulent apps that currently evade Google Bouncer detection technology, and reveals a new type of attack campaign, where users are harassed into writing positive reviews, and install and review other apps.

Full PDF

aa r X i v : . [ c s . S I] M a r FairPlay: Fraud and Malware Detection in Google Play

Mahmudur RahmanFlorida Int’l Univ.mrahm004@ﬁu.edu Mizanur RahmanFlorida Int’l Univ.mrahm031@ﬁu.edu Bogdan CarbunarFlorida Int’l [email protected] Duen Horng ChauGeorgia [email protected]

Abstract

Fraudulent behaviors in Google’s Android app marketfuel search rank abuse and malware proliferation. Wepresent FairPlay, a novel system that uncovers bothmalware and search rank fraud apps, by picking outtrails that fraudsters leave behind. To identify suspi-cious apps, FairPlay’s PCF algorithm correlates reviewactivities and uniquely combines detected review rela-tions with linguistic and behavioral signals gleaned fromlongitudinal Google Play app data. We contribute anew longitudinal app dataset to the community, whichconsists of over 87K apps, 2.9M reviews, and 2.4M re-viewers, collected over half a year. FairPlay achievesover 95% accuracy in classifying gold standard datasetsof malware, fraudulent and legitimate apps. We showthat 75% of the identiﬁed malware apps engage in searchrank fraud. FairPlay discovers hundreds of fraudulentapps that currently evade Google Bouncer’s detectiontechnology, and reveals a new type of attack campaign,where users are harassed into writing positive reviews,and install and review other apps.

The commercial success of Android app markets such asGoogle Play [1] has made them a lucrative medium forcommitting fraud and malice. Some fraudulent develop-ers deceptively boost the search ranks and popularity oftheir apps (e.g., through fake reviews and bogus instal-lation counts) [2], while malicious developers use appmarkets as a launch pad for their malware [3, 4, 5, 6].Existing mobile malware detection solutions havelimitations. For instance, while Google Play uses theBouncer system [7] to remove malware, out of the 7 , Contributions and Results . We propose FairPlay,a system that leverages the above observations to eﬃ-ciently detect Google Play fraud and malware (see Fig-re 1). Our major contributions are: • A uniﬁed relational, linguistic and behavioralapproach . We formulate the notion of co-reviewgraphs to model reviewing relations between users.We develop PCF, an eﬃcient algorithm to identifytemporally constrained, co-review pseudo cliques —formed by reviewers with substantially overlapping co-reviewing activities across short time windows. Weuse linguistic and behavioral information to (i) detectgenuine reviews from which we then (ii) extract user-identiﬁed fraud and malware indicators. In addition,we detect apps with (i) permission request ramps, (ii)“unbalanced” review, rating and install counts, and (iii)suspicious review spikes. We generate 28 features, anduse them to train supervised learning algorithms [ § • Novel longitudinal and gold standard datasets .We contributed a longitudinal dataset of 87 ,

223 freshlyposted Google Play apps (along with their 2 .

9M reviews,from 2 .

3M reviewers) collected between October 2014and May 2015. We have leveraged search rank fraudexpert contacts in Freelancer [16], anti-virus tools andmanual veriﬁcations to collect gold standard datasets ofhundreds of fraudulent, malware and benign apps. Wewill publish these datasets alongside this work [ § • High Accuracy . FairPlay achieves over 97% ac-curacy in classifying fraudulent and benign apps, andover 95% accuracy in classifying malware and benignapps. FairPlay signiﬁcantly outperforms the malwareindicators of Sarma et al. [12]. Furthermore, we showthat malware often engages in search rank fraud as well:When trained on fraudulent and benign apps, FairPlayﬂagged as fraudulent more than 75% of the gold stan-dard malware apps [ § • Real-world Impact: Uncover Fraud & Attacks .FairPlay discovers hundreds of fraudulent apps that cur-rently evade Google Bouncer’s detection technology. Weshow that these apps are indeed suspicious: the review-ers of 93 .

3% of them form at least 1 pseudo clique and55% of these apps have at least 33% of their reviewersinvolved in a pseudo clique. In addition, FairPlay en-abled us to discover a novel, coercive campaign attacktype, where app users are harassed into writing a pos-itive review for the app, and install and review otherapps [ § § . We focus on the Android app mar-ket ecosystem of Google Play. The participants, con-sisting of users and developers, have Google accounts.Developers create and upload apps, that consist of ex-ecutables (i.e., “apks”), a set of required permissions,and a description. The app market publishes this in- formation, along with the app’s received reviews (1-5stars rating & text), ratings (1-5 stars, no text), ag-gregate rating (over both reviews and ratings), installcount range (predeﬁned buckets, e.g., 50-100, 100-500),size, version number, price, time of last update, and alist of “similar” apps. Adversarial model . We consider not only maliciousdevelopers, who upload malware, but also rationalfraudulent developers. Fraudulent developers attemptto tamper with the search rank of their apps. WhileGoogle keeps secret the criteria used to rank apps, thereviews, ratings and install counts are known to play afundamental part (see e.g., [17]. Fraudulent developersoften rely on crowdsourcing sites [16, 18, 19] to hireteams of workers to commit fraud collectively.To review or rate an app, a user needs to havea Google account, register a mobile device with thataccount, and install the app on the device. This processcomplicates the job of fraudsters, who are thus morelikely to reuse accounts across review writing jobs.

Burguera et al. [9] used crowdsourcing to collect systemcall traces from real users, then used a “partitional”clustering algorithm to classify benign and maliciousapps. Shabtai et al. [10] extracted features from moni-tored apps (e.g., CPU consumption, packets sent, run-ning processes) and used machine learning to identifymalicious apps. Grace et al. [11] used static analysis toeﬃciently identify high and medium risk apps.Previous work has also used app permissions topinpoint malware [12, 13, 14]. Sarma et al. [12] use risksignals extracted from app permissions, e.g., rare criticalpermissions (RCP) and rare pairs of critical permissions(RPCP), to train SVM and inform users of the risks vs.beneﬁts tradeoﬀs of apps. In § .2 Research on Graph Based Opinion SpamDetection. Graph based approaches have been pro-posed to tackle opinion spam [20, 21]. Ye andAkoglu [20] quantify the chance of a product to be aspam campaign target, then cluster spammers on a 2-hop subgraph induced by the products with the highestchance values. Akoglu et al. [21] frame the fraud de-tection as a signed network classiﬁcation problem andclassify users and products, that form a bipartite net-work, using a propagation-based algorithm.FairPlay’s relational approach diﬀers as it identiﬁesapps reviewed in a contiguous time interval, by groupsof users with a history of reviewing apps in common.FairPlay combines the results of this approach with be-havioral and linguistic clues, extracted from longitudi-nal app data, to detect both search rank fraud and mal-ware apps. We emphasize that search rank fraud goesbeyond opinion spam, as it implies fabricating not onlyreviews, but also user app install events and ratings.

We have collected longitudinal data from 87K+ newlyreleased apps over more than 6 months, and identiﬁedgold standard app market behaviors. In the following,we brieﬂy describe the tools we developed, then detailthe data collection eﬀort and the resulting datasets.

Data collection tools . We have developed the

Google Play Crawler (GPCrawler) tool, to automati-cally collect data published by Google Play for apps,users and reviews. Google Play shows only 20 apps ona user page by default. GPCrawler overcomes this lim-itation by using a Firefox add-on and a Python script.The add-on interacts with Google Play to extend theuser page with a “scroll down” button and enable thescript to automatically navigate and collect all the in-formation from the user page.We have also developed the

Google Play App Down-loader (GPad), a Java tool to automatically downloadapks of free apps on a PC, using the open-source

An-droid Market API [22]. GPad scans each app apk usingVirusTotal [8], an online malware detector provider, toﬁnd out the number of anti-malware tools (out of 57:AVG, McAfee, Symantec, Kaspersky, Malwarebytes, F-Secure, etc.) that identify the apk as suspicious. Weused 4 servers (PowerEdge R620, Intel Xeon E-26XX v2CPUs) to collect our datasets, which we describe next.

In order to detectapp attribute changes that occur early in the lifetime ofapps, we used the “New Releases” link to identify appswith a short history on Google Play. We approximatethe ﬁrst upload date of an app using the day of its ﬁrstreview. We have started collecting new releases in July 2014 and by October 2014 we had a set of 87 ,

223 apps,whose ﬁrst upload time was under 40 days prior to ourﬁrst collection time, when they had at most 100 reviews.We have collected longitudinal data from these87 ,

223 apps between October 24, 2014 and May 5, 2015.Speciﬁcally, for each app we captured “snapshots” of itsGoogle Play metadata, twice a week. An app snapshotconsists of values for all its time varying variables, e.g.,the reviews, the rating and install counts, and the setof requested permissions (see § , ,

705 reviews we have collected fromthe 87 ,

223 apps, we recorded the reviewer’s name andid (2 , ,

708 unique ids), date of review, review title,text, and rating. . We used GPad (see §

3) to collect theapks of 7 ,

756 randomly selected apps from the longitu-dinal set (see § ,

756 apks. Noneof these apps had been ﬁltered by Bouncer [7]! Fromthe 523 apps that were ﬂagged by at least 3 tools, weselected those that had at least 10 reviews, to form our“malware app” dataset, for a total of 212 apps.

Fraudulent apps . We used contacts establishedamong Freelancer [16]’s search rank fraud community,to obtain the identities of 15 Google Play accounts thatwere used to write fraudulent reviews. We call these“seed fraud accounts”. These accounts were used to re-view 201 unique apps. We call these, the “seed fraudapps”, and we use them to evaluate FairPlay.

Fraudulent reviews . We have collected all the 53 , ,

969 of thesereviews. We used the 53 ,

625 reviews to identify 188accounts, such that each account was used to review atleast 10 of the 201 seed fraud apps (for a total of 6 , guilt by association (GbA)accounts. To reduce feature duplication, we have usedthe 1 ,

969 fraudulent reviews written by the 15 seedaccounts and the 6 ,

488 fraudulent reviews written bythe 188 GbA accounts for the 201 seed fraud apps, toextract a balanced set of fraudulent reviews. Speciﬁcally,from this set of 8 ,

457 (= 1 , , Benign apps . We have selected 925 candidate appsfrom the longitudinal app set, that have been developedby Google designated “top developers”. We have usedGPad to ﬁlter out those ﬂagged by VirusTotal. Wehave manually investigated 601 of the remaining apps,and selected a set of 200 apps that (i) have more than otation Deﬁnition

CoReG Module nCliques number of pseudo cliques with ρ ≥ θ stats( ρ ) clique density: max, median, SDstats( cliqueSize ) pseudo cliques size: max, median, SD inCliqueSize % of nodes involved in pseudo cliques RF Module malW % of reviews with malware indicators fraudW , goodW % of reviews with fraud/benign words F RI fraud review impact on app rating

IRR Module stats( spikes ) days with spikes & spike amplitude I /Rt , I /Rt install to rating ratios I /Rt , I /Rt install to review ratios JH Module permCt , dangerCt rampCt dangerRamp Table 1: FairPlay’s most important features, organizedby their extracting module.10 reviews and (ii) were developed by reputable mediaoutlets (e.g., NBC, PBS) or have an associated businessmodel (e.g., ﬁtness trackers).

Genuine reviews . We have manually collected a goldstandard set of 315 genuine reviews, as follows. First,we have collected the reviews written for apps installedon the Android smartphones of the authors. We thenused Google’s text and reverse image search tools toidentify and ﬁlter those that plagiarized other reviewsor were written from accounts with generic photos. Wehave then manually selected reviews that mirror theauthors’ experience, have at least 150 characters, andare informative (e.g., provide information about bugs,crash scenario, version update impact, recent changes).

FairPlay organizes theanalysis of longitudinal app data into the following 4modules, illustrated in Figure 1. The Co-Review Graph(CoReG) module identiﬁes apps reviewed in a con-tiguous time window by groups of users with signiﬁ-cantly overlapping review histories. The Review Feed-back (RF) module exploits feedback left by genuine re-viewers, while the Inter Review Relation (IRR) moduleleverages relations between reviews, ratings and installcounts. The Jekyll-Hyde (JH) module monitors apppermissions, with a focus on dangerous ones, to identifyapps that convert from benign to malware. Each mod-ule produces several features that are used to train anapp classiﬁer. FairPlay also uses general features suchas the app’s average rating, total number of reviews, Figure 2: Example pseudo cliques and PCF output.Nodes are users and edge weights denote the numberof apps reviewed in common by the end users. Reviewtimestamps have a 1-day granularity. (a) The entire co-review graph, detected as pseudo clique by PCF when θ is 6. When θ is 7, PCF detects the subgraphs of (b)the ﬁrst two days and (c) the last two days.ratings and installs, for a total of 28 features. Table 1summarizes the most important features. In the follow-ing, we detail each module and the features it extracts. Let the co-review graph of an app, see Figure 2, be agraph where nodes correspond to users who reviewed theapp, and undirected edges have a weight that indicatesthe number of apps reviewed in common by the edge’sendpoint users. We seek to identify cliques in the co-review graph. Figure 5a shows the co-review clique ofone of the seed fraud apps (see § ρ = P e ∈ E w ( e ) ( n ) ,where E denotes the graph’s edges and n its number ofnodes (reviews). We are interested then in subgraphsof the co-review graph whose weighted density exceedsa threshold value θ .We present the Pseudo Clique Finder (PCF) algo-rithm (see Algorithm 1), that takes as input the set ofthe reviews of an app, organized by days, and a thresh-old value θ . PCF outputs a set of identiﬁed pseudocliques with ρ ≥ θ , that were formed during contiguoustime frames. In Section 5.3 we discuss the choice of θ .For each day when the app has received a review(line 1), PCF ﬁnds the day’s most promising pseudoclique (lines 3 and 12 − lgorithm 1 PCF algorithm pseudo-code.

Input: days , an array of daily reviews, and θ , the weighted threshold density Output: allCliques , set of all detected pseudo cliques for d :=0 d < days.size(); d++ Graph PC := new Graph(); bestNearClique(PC, days[d]); c := 1; n := PC.size(); for nd := d+1; d < days.size() & c = 1; d++ bestNearClique(PC, days[nd]); c := (PC.size() > n); endfor if (PC.size() > allCliques := allCliques.add(PC); ﬁ endfor return function bestNearClique(Graph PC, Set revs) if (PC.size() = 0) for root := 0; root < revs.size(); root++ Graph candClique := new Graph (); candClique.addNode (root.getUser()); do candNode := getMaxDensityGain(revs); if (density(candClique ∪ { candNode } ) ≥ θ )) candClique.addNode(candNode); ﬁ while (candNode != null); if (candClique.density() > maxRho) maxRho := candClique.density(); PC := candClique; ﬁ endfor else if (PC.size() > do candNode := getMaxDensityGain(revs); if (density(candClique ∪ candNode) ≥ θ )) PC.addNode(candNode); ﬁ while (candNode != null); return clique equals or exceeds θ (lines 6 and 23 − getM axDensityGain , not depictedin Algorithm 1) picks the review not yet in the work-in-progress pseudo clique, whose writer has written themost apps in common with reviewers already in thepseudo clique. Figure 2 illustrates the output of PCFfor several θ values.If d is the number of days over which A has receivedreviews and r is the maximum number of reviewsreceived in a day, PCF’s complexity is O ( dr ( r + d )). CoReG features . CoReG extracts the followingfeatures from the output of PCF (see Table 1) (i) thenumber of cliques whose density equals or exceeds θ ,(ii) the maximum, median and standard deviation of thedensities of identiﬁed pseudo cliques, (iii) the maximum,median and standard deviation of the node count ofidentiﬁed pseudo cliques, normalized by n (the app’sreview count), and (iv) the total number of nodes ofthe co-review graph that belong to at least one pseudo clique, normalized by n . Reviewswritten by genuine users of malware and fraudulent appsmay describe negative experiences. The RF moduleexploits this observation through a two step approach:(i) detect and ﬁlter out fraudulent reviews, then (ii)identify malware and fraud indicative feedback from theremaining reviews.

Step RF.1: Fraudulent review ﬁlter . We posit thatusers that have higher expertise on apps they review,have written fewer reviews for apps developed by thesame developer, have reviewed more paid apps, are morelikely to be genuine. We exploit this conjecture to usesupervised learning algorithms trained on the followingfeatures, deﬁned for a review R written by user U foran app A : • Reviewer based features . The expertise of U for app A , deﬁned as the number of reviews U wrote for appsthat are “similar” to A , as listed by Google Play (see § bias of U towards A : the number of reviewswritten by U for other apps developed by A ’s developer.In addition, we extract the total money paid by U onapps it has reviewed, the number of apps that U hasliked, and the number of Google+ followers of U . • Text based features . We used the NLTK library [24]and the Naive Bayes classiﬁer, trained on two datasets:(i) 1 ,

041 sentences extracted from randomly selected350 positive and 410 negative Google Play reviews,and (ii) 10 ,

663 sentences extracted from 700 positiveand 700 negative IMDB movie reviews [25]. 10-foldcross validation of the Naive Bayes classiﬁer over thesedatasets reveals a FNR of 16 .

1% and FPR of 19 . R that encode positive and negativesentiments. We then extracted the following features:(i) the percentage of statements in R that encodepositive and negative sentiments respectively, and (ii)the rating of R and its percentile among the reviewswritten by U . Step RF.2: Reviewer feedback extraction . Weconjecture that (i) since no app is perfect, a “balanced”review that contains both app positive and negativesentiments is more likely to be genuine, and (ii) thereshould exist a relation between the review’s dominatingsentiment and its rating. Thus, after ﬁltering out fraud-ulent reviews, we extract feedback from the remain-ing reviews. For this, we have used NLTK to extract5 ,

106 verbs, 7 ,

260 nouns and 13 ,

128 adjectives fromthe 97 ,

071 reviews we collected from the 613 gold stan-dard apps (see § N u m be r o f de t e c t ed app s (a) Dangerous permission count N u m be r o f app s LegitimateMalwareFake (b)

Jan17 , 20151. Findaccountsondevice2. Useaccountsondevice permissions Nov8 , 2014 Nodangerouspermissionsrequested

Nov21 , 2014GooglePlaylicensecheck Dec25 , 20141. Readphonestatus&identity2. Modify&deleteUSBstoragecontents3. Test accesstoprotectedstorage DangerousPermissionRamp (c)

Figure 3: (a) Apks detected as suspicious ( y axis) by multiple anti-virus tools ( x axis), through VirusTotal [8],from a set of 7 ,

756 downloaded apks. (b) Distribution of the number of “dangerous” permissions requested bymalware, fraudulent and benign apps. (c) Dangerous permission ramp during version updates for a sample app“com.battery.plusfree”. Originally the app requested no dangerous permissions.contains 31 words (e.g., risk, hack, corrupt, spam, mal-ware, fake, fraud, blacklist, ads). The fraud indicatorword list contains 112 words (e.g., cheat, hideous, com-plain, wasted, crash) and the benign indicator word listcontains 105 words.

RF features . We extract 3 features (see Table 1),denoting the percentage of genuine reviews that con-tain malware, fraud, and benign indicator words respec-tively. We also extract the impact of detected fraudulentreviews on the overall rating of the app: the absolutediﬀerence between the app’s average rating and its av-erage rating when ignoring all the fraudulent reviews.

Thismodule leverages temporal relations between reviews, aswell as relations between the review, rating and installcounts of apps, to identify suspicious behaviors.

Temporal relations . We detect outliers in the numberof daily reviews received by an app. We identify dayswith spikes of positive reviews as those whose numberof positive reviews exceeds the upper outer fence of thebox-and-whisker plot built over the app’s numbers ofdaily positive reviews.

Reviews, ratings and install counts . We used thePearson’s χ test to investigate relationships betweenthe install and rating counts of the 87K new apps,at the end of the collection interval. We grouped therating count in buckets of the same size as Google Play’sinstall count buckets. Figure 4 shows the mosaic plotof the relationships between rating and install counts. p =0 . IRR features . We extract temporal features (seeTable 1): the number of days with detected spikes and S t anda r d i z ed R e s i dua l s : <− − : − − : : : > − − − K − K K − K M − M K − K M − M − − − K K − M K − K M − M K − K M − M Install count R a t i ng c oun t Figure 4: Mosaic plot of install vs. rating countrelations of the 87K apps. Larger rectangles signify thatmore apps have the corresponding rating and installcount range; dotted lines mean no apps in a certaininstall/rating category. The standardized residualsidentify the cells that contribute the most to the χ test. The most signiﬁcant rating:install ratio is 1:100.the maximum amplitude of a spike. We also extract (i)the ratio of installs to ratings as two features, I /Rt and I /Rt and (ii) the ratio of installs to reviews, as I /Rv and I /Rv . ( I , I ] denotes the install countinterval of an app, ( Rt , Rt ] its rating interval and( Rv , Rv ] its (genuine) review interval. Android’s API level 22 labels 47 permissions as “dan-gerous”. Figure 3b compares the distributions of thenumber of dangerous permissions requested by the goldstandard malware, fraudulent and benign apps. Themost popular dangerous permissions among these appsare “modify or delete the contents of the USB storage”,“read phone status and identity”, “ﬁnd accounts on thedevice”, and “access precise location”. Most benignapps request at most 5 such permissions; some malwareand fraudulent apps request more than 10. trategy FPR % FNR % Accuracy %DT (

Decision Tree)

Multi-layer Perceptron)

Random Forest)

Table 2: Review classiﬁcation results (10-fold cross-validation), of gold standard fraudulent (positive) andgenuine (negative) reviews. MLP achieves the lowestfalse positive rate (FPR) of 1 . Jekyll-Hyde apps . Figure 3c shows the dan-gerous permissions added during diﬀerent version up-dates of one gold standard malware app.

JH features . We extract the following features (seeTable 1), (i) the total number of permissions requestedby the app, (ii) its number of dangerous permissions,(iii) the app’s number of dangerous permission ramps,and (iv) its total number of dangerous permissionsadded over all the ramps.

We have implementedFairPlay using Python to extract data from parsedpages and compute the features, and the R tool to clas-sify reviews and apps. We have set the threshold densityvalue θ to 3, to detect even the smaller pseudo cliques.We have used the Weka data mining suite [26]to perform the experiments, with default settings.We experimented with multiple supervised learning al-gorithms. Due to space constraints, we report re-sults for the best performers: MultiLayer Perceptron(MLP) [27], Decision Trees (DT) (C4.5) and RandomForest (RF) [28], using 10-fold cross-validation [29]. Weuse the term “positive” to denote a fraudulent review,fraudulent or malware app; FPR means false positiverate . Similarly, “negative” denotes a genuine review orbenign app; FNR means false negative rate . To evaluate the accu-racy of FairPlay’s fraudulent review detection compo-nent (RF module), we used the gold standard datasetsof fraudulent and genuine reviews of § ,

972 reviews for 2 ,

284 apps) and the 315 re-viewers of the genuine reviews (9 ,

468 reviews for 7 , Strategy FPR % FNR % Accuracy %FairPlay/DT 3.01 3.01 96.98FairPlay/MLP 1.51 3.01 97.74FairPlay/RF

Table 3: FairPlay classiﬁcation results (10-fold crossvalidation) of gold standard fraudulent (positive) andbenign apps. RF has lowest FPR, thus desirable [30].

Strategy FPR % FNR % Accuracy %FairPlay/DT 4.02 4.25 95.86FairPlay/MLP 4.52 4.72 95.37FairPlay/RF

Sarma et al. [12]/SVM 65.32 24.47 55.23

Table 4: FairPlay classiﬁcation results (10-fold cross val-idation) of gold standard malware (positive) and benignapps, signiﬁcantly outperforming Sarma et al. [12]. Fair-Play’s RF achieves 96.11% accuracy at 1.51% FPR.idation of algorithms classifying reviews as genuine orfraudulent. To minimize wrongful accusations, we seekto minimize the FPR [30]. MLP simultaneously achievesthe highest accuracy of 96 .

26% and the lowest FPR of1 .

47% (at 6 .

67% FNR). Thus, in the following experi-ments, we use MLP to ﬁlter out fraudulent reviews inthe RF.1 step.

To evaluate FairPlay, we have collected all the 97 , ,

949 users, as well asthe 890 ,

139 apps rated or played by these users.

Fraud Detection Accuracy . Table 3 shows 10-foldcross validation results of FairPlay on the gold standardfraudulent and benign apps (see § .

74% andthe lowest FPR of 1 . Malware Detection Accuracy . We have used Sarmaet al. [12]’s solution as a baseline to evaluate theability of FairPlay to accurately detect malware. Wecomputed Sarma et al. [12]’s RCP and RPCP indicators(see § a) No. ofpseudocliqueswith >3 No. of Apps (b)

Belong to largest cliqueBelong to any pseudo clique

No. ofApps % of app reviewers (nodes) (c)

Figure 5: (a) Clique ﬂagged by PCF for “Tiempo - Clima gratis”, one of the 201 seed fraud apps (see § ,

600 investigated:(b) Distribution of per app number of discovered pseudo cliques. 93 .

3% of the 372 apps have at least 1 pseudoclique of θ ≥ . . also accurately identify malware. Is Malware Involved in Fraud?

We conjecturedthat the above result is due in part to malware appsbeing involved in search rank fraud. To verify this,we have trained FairPlay on the gold standard benignand fraudulent app datasets, then we have tested it onthe gold standard malware dataset. MLP is the mostconservative algorithm, discovering 60 .

85% of malwareas fraud participants. Random Forest discovers 72 . .

94% of the malware asfraudulent. This result conﬁrms our conjecture andshows that search rank fraud detection can be animportant addition to mobile malware detection eﬀorts.

We have also evaluatedFairPlay on non “gold standard” apps. For this, we havecollected a set of apps, as follows. First, we selected 8app categories: Arcade, Entertainment, Photography,Simulation, Racing, Sports, Lifestyle, Casual. We haveselected the 6 ,

300 apps from the longitudinal dataset ofthe 87K apps, that belong to one of these 8 categories,and that have more than 10 reviews. From these 6 , ,

600 apps. We have then collected the dataof all their 50 ,

643 reviewers (not unique) including theids of all the 166 ,

407 apps they reviewed.We trained FairPlay with Random Forest (bestperforming on previous experiments) on all the goldstandard benign and fraudulent apps. We have thenrun FairPlay on the 1 ,

600 apps, and identiﬁed 372 apps(23%) as fraudulent. The Racing and Arcade categories have the highest fraud densities: 34% and 36% of theirapps were ﬂagged as fraudulent.

Intuition . During the 10-fold cross validation ofFairPlay for the gold standard fraudulent and benignsets, the top most impactful features for the DecisionTree classiﬁer were (i) the percentage of nodes thatbelong to the largest pseudo clique, (ii) the percentageof nodes that belong to at least one pseudo clique, (iii)the percentage of reviews that contain fraud indicatorwords, and (iv) the number of pseudo clique with θ ≥ ,

600 apps).Figure 5b shows that 93 .

3% of the 372 apps have atleast 1 pseudo clique of θ ≥

3, nearly 71% have atleast 3 pseudo cliques, and a single app can have upto 23 pseudo cliques. Figure 5c shows that the pseudocliques are large and encompass many of the reviewsof the apps: 55% of the 372 apps have at least 33% oftheir reviewers involved in a pseudo clique, while nearly51% of the apps have a single pseudo clique containing33% of their reviewers. While not plotted here due tospace constraints, we note that around 75% of the 372fraudulent apps have at least 20 fraud indicator wordsin their reviews.

Upon close inspec-tion of apps ﬂagged as fraudulent by FairPlay, we iden-tiﬁed apps perpetrating a new attack type. The apps,which we call coercive campaign apps , harass the userto either (i) write a positive review for the app, or (ii)install and write a positive review for other apps (of-ten of the same developer). In return, the app rewardsthe user by, e.g., removing ads, providing more features,unlocking the next game level, boosting the user’s gamelevel or awarding game points.e found evidence of coercive campaign apps fromusers complaining through reviews, e.g., “I only rated itbecause i didn’t want it to pop up while i am playing”,or “Could not even play one level before i had to rate it[...] they actually are telling me to rate the app 5 stars”.We leveraged this evidence to identify more coercivecampaign apps from the longitudinal app set. Speciﬁ-cally, we have ﬁrst manually selected a list of potentialkeywords indicating coercive apps (e.g., “rate”, “down-load”, “ads”). We then searched all the 2 , ,

705 re-views of the 87K apps and found around 82K reviewsthat contain at least one of these keywords. Due totime constraints, we then randomly selected 3 ,

000 re-views from this set, that are not ﬂagged as fraudulentby FairPlay’s RF module. Upon manual inspection, weidentiﬁed 118 reviews that report coercive apps, and48 apps that have received at least 2 such reviews. Weleave a more thorough investigation of this phenomenonfor future work.

We have introduced FairPlay, a system to detect bothfraudulent and malware Google Play apps. Our experi-ments on a newly contributed longitudinal app dataset,have shown that a high percentage of malware is in-volved in search rank fraud; both are accurately iden-tiﬁed by FairPlay. In addition, we showed FairPlay’sability to discover hundreds of apps that evade GooglePlay’s detection technology, including a new type ofcoercive fraud attack.

This research was supported in part by NSF grants1527153 and 1526254, and DoD W911NF-13-1-0142.

References [1] Google Play. https://play.google.com/ .[2] Ezra Siegel. Fake Reviews in Google Play and AppleApp Store. Appentive, 2014.[3] Zach Miners. Report: Malware-infected Android appsspike in the Google Play store. PCWorld, 2014.[4] Stephanie Mlot. Top Android App a Scam, PulledFrom Google Play. PCMag, 2014.[5] Daniel Roberts. How to spot fake apps on the GooglePlay store. Fortune, 2015.[6] Andy Greenberg. Malware Apps Spoof Android Mar-ket To Infect Phones. Forbes Security, 2014.[7] Jon Oberheide and Charlie Miller. Dissecting theAndroid Bouncer.

SummerCon2012, New York , 2012.[8] VirusTotal - Free Online Virus, Malware and URLScanner. , Last ac-cessed on May 2015.[9] Iker Burguera, Urko Zurutuza, and Simin Nadjm-Tehrani. Crowdroid: Behavior-Based Malware De- tection System for Android. In

Proceedings of ACMSPSM , pages 15–26. ACM, 2011.[10] Asaf Shabtai, Uri Kanonov, Yuval Elovici, ChananGlezer, and Yael Weiss. Andromaly: a BehavioralMalware Detection Framework for Android Devices.

Intelligent Information Systems , 38(1):161–190, 2012.[11] Michael Grace, Yajin Zhou, Qiang Zhang, ShihongZou, and Xuxian Jiang. Riskranker: Scalable andAccurate Zero-day Android Malware Detection. In

Proceedings of ACM MobiSys , 2012.[12] Bhaskar Pratim Sarma, Ninghui Li, Chris Gates, RahulPotharaju, Cristina Nita-Rotaru, and Ian Molloy. An-droid Permissions: a Perspective Combining Risks andBeneﬁts. In

Proceedings of ACM SACMAT , 2012.[13] Hao Peng, Chris Gates, Bhaskar Sarma, Ninghui Li,Yuan Qi, Rahul Potharaju, Cristina Nita-Rotaru, andIan Molloy. Using Probabilistic Generative Models forRanking Risks of Android Apps. In

Proceedings ofACM CCS , 2012.[14] S.Y. Yerima, S. Sezer, and I. Muttik. Android MalwareDetection Using Parallel Machine Learning Classiﬁers.In

Proceedings of NGMAST , Sept 2014.[15] Yajin Zhou and Xuxian Jiang. Dissecting AndroidMalware: Characterization and Evolution. In

Proceed-ings of the IEEE S&P , pages 95–109. IEEE, 2012.[16] Freelancer. .[17] Google I/O 2013 - Getting Discovered on Google Play. , 2013.[18] Fiverr. .[19] BestAppPromotion. .[20] Junting Ye and Leman Akoglu. Discovering opinionspammer groups by network footprints. In

MachineLearning and Knowledge Discovery in Databases , pages267–282. Springer, 2015.[21] Leman Akoglu, Rishi Chandy, and Christos Faloutsos.Opinion Fraud Detection in Online Reviews by Net-work Eﬀects. In

Proceedings of ICWSM , 2013.[22] Android Market API. https://code.google.com/p/android-market-api/ ,2011.[23] Takeaki Uno. An eﬃcient algorithm for enumeratingpseudo cliques. In

Proceedings of ISAAC , 2007.[24] Steven Bird, Ewan Klein, and Edward Loper.

NaturalLanguage Processing with Python . O’Reilly, 2009.[25] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.Thumbs Up? Sentiment Classiﬁcation Using MachineLearning Techniques. In

Proceedings of EMNLP , 2002.[26] Weka. .[27] S. I. Gallant. Perceptron-based learning algorithms.

Trans. Neur. Netw. , 1(2):179–191, June 1990.[28] Leo Breiman. Random Forests.

Machine Learning ,45:5–32, 2001.[29] Ron Kohavi. A Study of Cross-Validation and Boot-strap for Accuracy Estimation and Model Selection. In

Proceedings of IJCAI , 1995.[30] D. H. Chau, C. Nachenberg, J. Wilhelm, A. Wright,and C. Faloutsos. Polonium: Tera-scale graph miningand inference for malware detection. In