CARONTE: Crawling Adversarial Resources Over Non-Trusted, High-Profile Environments
CCARONTE : Crawling Adversarial Resources OverNon-Trusted, High-Profile Environments
Michele Campobasso
University of Bologna, IT [email protected]
Pavlo Burda
Eindhoven University of Technology, NL [email protected]
Luca Allodi
Eindhoven University of Technology, NL [email protected]
Abstract —The monitoring of underground criminal activitiesis often automated to maximize the data collection and totrain ML models to automatically adapt data collection toolsto different communities. On the other hand, sophisticatedadversaries may adopt crawling-detection capabilities that maysignificantly jeopardize researchers’ opportunities to perform thedata collection, for example by putting their accounts underthe spotlight and being expelled from the community. This isparticularly undesirable in prominent and high-profile criminalcommunities where entry costs are significant (either monetarilyor for example for background checking or other trust-buildingmechanisms). This paper presents
CARONTE , a tool to semi-automatically learn virtually any forum structure for parsingand data-extraction, while maintaining a low profile for the datacollection and avoiding the requirement of collecting massivedatasets to maintain tool scalability. We showcase the tool againstfour underground forums, and compare the network trafficit generates (as seen from the adversary’s position, i.e. theunderground communitys server) against state-of-the-art toolsfor web-crawling as well as human users.
Index Terms —underground, stealth monitoring, data collec-tion, high-profile communities
I. I
NTRODUCTION
The monitoring of underground activities is a core capabilityenabling law enforcement actions, (academic) research, mal-ware and criminal profiling, among other activities. Currently,monitoring activities focus on the rapid collection of massiveamounts of data [36], that can then be used to train machinelearning (ML) models to, for example, extend available parsingcapabilities to different forums or underground communities.Indeed, the proliferation of underground criminal communitiesmakes the scalability of monitoring capabilities an essentialaspect of an effective, and extensive, data collection, and MLhas been the clear go-to solution to enable this. However, thiscomes at the high price of having to collect large volumesof data for training, raising the visibility of the researcher’sactivity and interest in the criminal community.Indeed, the scientific literature showed that not all commu-nities are born the same [22]; on the contrary, the majority ofunderground communities appear largely uninteresting (evenwhen generating massive amounts of data about alleged arti-facts [38]), both in terms of economics and social aspects [7],[40], as well as in terms of (negative) externalities for societyat large [30], [36]. Whereas there are only a limited number of‘interesting’ communities to monitor, gaining access to thesemay be less than trivial in many cases, particularly for forum-
Fig. 1. Example of inbound traffic monitoring from criminal communities based communities and markets [6], [40]: high entry costs interms of entry fees, background checks, interviews, or pull-in mechanisms are becoming more and more adopted in theunderground as a means to control or limit the influence of‘untrusted’ players in the community [6], [40]. Under thesecircumstances, researchers and LE infiltrating undergroundcommunities may face significant opportunity costs wherebyincreasing monitoring activity may also jeopardize their abilityto monitor the very community(-ies) in which they wish toremain undercover: network logs and navigation patterns ofcrawling tools (authenticated in the communities using theresearcher’s credentials) can put the real nature of that user’svisits under the spotlight, and lead to blacklisting or banningfrom the community. This is particularly undesirable in high-profile communities where the cost of re-entry can be high.Anecdotal evidence shows that monitoring incoming traffic,for example for robot detection or source-IP checking, is acountermeasure that underground communities may employto limit undesired behaviour. Some communities explicitlyacknowledge the adopted countermeasures (see for exampleFigure I), others explicitly state that they are aware of themonitoring operations of LE and other ‘undesirable’ users;for example, the administrator of one prominent undergroundforum for malware and cyber-attacks that the authors aremonitoring, states explicitly: “Forums like this are beingparsed by special services and automatically transfer requeststo social network accounts and e-mails.”
This significantlyinhibits researchers’ ability to build scalable, reusable parsingmodules, as the collection of large amounts of data to trainthe associated ML algorithms may be slow or carry significantrisks of exclusions from the monitored communities. Pastranaet al. [32] lead the way in identifying stealthiness as a require- a r X i v : . [ c s . CR ] S e p ent for systematic underground resource crawlers, with manyrecent works not explicitly mentioning these aspects [27], [33].In this work we present CARONTE , a tool to monitorunderground forums that: (1) can be configured to semi-automatically learn virtually any forum structure, without theneed of writing ad-hoc parsers or collecting and manually clas-sify large volumes of data; (2) implements a simple user modelto mimic human behaviour on a webpage, to maintain a lowprofile while performing the data-collection. We showcase thetool against 4 underground forums, and compare the networktraffic it generates (as seen from the adversary’s position, i.e.the underground community’s server) against state-of-the-arttools for web-crawling. Our results clearly show that both
CARONTE ’s request patterns as well as the completeness of thedownloaded resources page are significantly closer to humanswhen compared to other SoA crawling tools.This paper proceeds as follows. In Section II we discussrelevant background and related work; Section III presentsthe tool, whereas Section IV presents the experimental val-idation and results. V deepens the impact and limitations of
CARONTE , and Section VI concludes the paper.II. B
ACKGROUND
Cybercrime monitoring has mainly been implementedthrough ad-hoc tools to scrape adversarial platforms that don’tscale up with the number of sources and the variety of thecontent; nonetheless, most of them were more concerned ondeveloping techniques that enable underground economy dis-covery, key hacker identification [5] and threat detection [11],disregard stealthiness in favour of parsing volumes [27], [33],with few notable exceptions [32]. In this section we discussthe changing threat model against which these solutions areused, and the technical means by which their employment canbe detected by adversaries.
A. A changing threat model for cybercrime monitoring
Cybercriminals have demonstrated to be increasingly awareof the mounting interest from scientific and nation statesponsored investigations, pushing them to start developingtechniques to avoid unwanted actors and data gathering in theircommunities [31]. Retaliation activities on the side of the cy-bercrooks have made the news in the past [6], [7], [40], includ-ing threatening journalists and academic researchers [23].The attention posed by cybercriminals to the public sphere isalso reflected on the technology and administrative procedurethey employ to detect or stop possible ‘intrusions’ in theircommunities. Part of these techniques are centered on theevaluation of a perspective member at the act of registrationon these platforms [7], [40], or on the continuous monitoringof community participation by each member [6]. Similarly,recent evidence suggests cybercriminals may be monitoringand auditing the traffic on the web servers, both to prevent access from undesired IP ranges (e.g. see Figure I), and tomitigate external threats such as denial of service attacks.These technologies identify patterns and anomalies in networktraffic to detect undesired activities, requests generated byrobots and crawlers and, if necessary, take action to limitthose [18], [34]. Whereas researchers can build profiles togo undercover in certain communities, thereby passing thefirst access filter enforced by cybercrime community entryregulations [6], network monitoring operations remain anunmitigated threat to automation of data collection once insidethe forum, particularly at the face of an evolving adversary(i.e. the cybercrooks behind the platform). B. Robot and crawler detection
With the birth of the big data society, crawling has become aconspicuous portion of the Internet traffic [9] and an unwantedpractice from website owners, due both to network resourceconsumption and to the lack of an explicit permission to athird-party to massively download all the website content forunknown goals, often resulting in a privacy violation [41] [19].Several anti-crawling techniques have been developed; agreat number of these are based on minimal traffic patternsanalysis, often focused on monitoring characteristics of HTTPtraffic observable from logs. The monitored characteristicsinclude the rate of requests, the length of browsing sessions,lack of cookie management, presence of bogus user agent,JavaScript execution, access to robots.txt file, usage of HEADHTTP requests, cherry-picking of requested resources and thelack of a referrer in HTTP requests [41] [35] [15]. Thesestrategies are quite simplistic and don’t provide consistentand reliable crawler detection, ignoring the chance that afocused and stealthier crawler can act in disguise, tamperingwith information in the requests. Since our adversary couldlegitimately have advanced skills in computer matters, theserequisites remain relevant, but aren’t sufficient to keep a robotundercover. Additional efforts have been made for creatingmore reliable methods; the state-of-art techniques for detectingautomated activity on a website include pattern recognition,like loopholes detection and breadth first or depth first strate-gies, JavaScript fingerprinting and tracking, and Turing tests,on top of other strategies [26] [14] [39]. Turing tests asCAPTCHAs can be outsourced at extremely low prices [2]or solved via OCR [25] and the production of not suspicioustraffic can be obtained with a focused crawler that acts withsome precautions. Moreover, the context of our studies bringsus to platforms in the dark-web where Turing tests that requireJavaScript enabled are almost non existent, due to linkedrisks (e.g. allowing in the past to bypass completely theanonymization of Tor [1] [3]).Zhang et al. [41] propose a dynamic blocker for crawlersanalyzing some traffic patterns, such as the complete explo-ration of the resources linked to a page (attachments, links,...), the nonacceptance of cookies, bogus user agents in HTTPrequests and high fetch rates. Stevanovic et al. introducessome additional checks compared to the previous analyzedorks, such as the
HTML/image ratio , which tends to bevery high for crawlers, the number of PDF/PS file requests , thepercentage on total requests of answers with 4xx error codesand unassigned referrers, which show high scores for robots[37]. Doran et al. propose to recognize crawlers in real-time[14]. In particular, their work they provide a 4-tier analysisbased on Syntactical log analysis , Traffic pattern analysis , Analytical learning techniques , and
Turing test systems . Forwhat concerns more behavioral patterns, Kwon et al. havestudied how crawlers generally have a monotonous behavior in the type of requested resources. Their attempt therefore isto classify crawlers based on the ”switching factor” betweentext and multimedia contents [26]. Other studies regardingbehavioral patterns are by Balla et al., who analyze the timebetween one request and another and when these are issued(like during night time, making it more suspicious) [10].Crawler detection patterns aside, data gathering, ready touse and well structured, is not an easy task to accomplish.Forum structures may vary a lot, depending on the forumplatform adopted and their configuration. In particular, thegoal is not to scrape the entire pages to dump them on disk,but to extract and structure data for further analysis; for thisreason, the crawler need to be instructed on what resourcesare required to be collected, how to reach them and what dothey mean. Therefore, a knowledge base should be created forthe crawler in a reliable way, enabling the forum traversal inthe required areas through the identification of the existingresources of each page of interest.
C. Modeling ‘regular’ user behaviour
Studies on user browsing behaviour broadly distinguishbetween click patterns and time patterns . Click patterns . Click models are used to evaluate userdecisions in considering a topic or hyperlink relevant to thespecific purpose of their navigation or query [17]. Derived ap-proaches consider single-browsing and multi-browsing modelsto infer user behaviour as a function of the purpose of thenavigation, in particular distinguishing between navigational and informational queries, whereby the user wants to reacha specific resource (likely producing one click at a time), oris interested in exploring new information (likely producingmultiple clicks at a time) [17], [20]. These models showthat past behaviour or user interest are useful predictors ofwhich clicks will happen in the future [17]. In our context,forum-browsing clearly covers both dimensions, depending onwhether the user aims at retrieving specific information (e.g.updates in a thread of previous interest to the user), or toexplore the content of a forum section.
Time patterns . More broadly, these dynamics are explainedin the information retrieval literature as dependent on theuser’s task [8]. The decision of a user to click on a specificresource depends on its perceived and intrinsic relevance w.r.t.the user’s goal, and is bounded by how many topics need be HTML/image ratio is a numerical attribute calculated as the number ofHTML page requests over the number of image file (JPEG and PNG) requestssent in a single session. opened to find the answer the information the user is interestedin [16]. Post-click user behaviour (i.e. what the user doesone he or she reaches the clicked resource) has been shownto be directly related to the relevance of the document [21].Post-click behaviour includes variables associated with mousemovements, scrolling, and eye-tracking [17], [21], clearlyshowing that what the user does, and how much time the userspends on a webpage, varies as a function of the relevance ofthe webpage. Indeed, a user’s inaction on a webpage has beenshown to be relevant to model the quality of dynamic systemssuch as recommendation systems [43]. Part of that behaviourcan be quantified by considering how quickly users can beexpected to process the relevant information [29]. Data aroundthis subject is scarce and quite diverse; some sources refer theaverage reading speed to be around 200-250WPM (Words PerMinute) with a comprehension rate of 50/60% [29], othersreport that for reading some technical content with a goodproficiency, the speed can be around 50-60WPM.III.
CARONTE
A. Design
From the literature analysis in the previous section, wederive a set of desiderata for
CARONTE .
1) Functional and behavioural requirements:
First andforemost,
CARONTE must be able to semi-automatically learnforum structures without the need for extensive pre-collecteddatasets on which to train automated models [33]. This shouldbe a one-time only process, employed for each new forumstructure that has not already been learned. Further,
CARONTE must have the ability to diverge from crawler behavior and,where possible, to mimic human behavior. In this regard,as emerged from the time patterns paragraph, keeping inmind that one significant aspect of crawlers is their greedin resources,
CARONTE shouldn’t exhibit high fetch ratesand mimic as much as possible human’s time to browse andread resources, whether the content is appealing for it ornot. Therefore, we model
CARONTE to mimic interest to aspecified subset of the forum, exploring only certain sectionsof it, accordingly to the hypothetical goals of our modelledactor. The interested topics and sections of the forum willbe taught during the learning phase. Then,
CARONTE will beable to receive instructions about which areas are valuableto crawl and which to skip. The forum contents will beexplored both through navigational and informational queries;in particular,
CARONTE will access quickly resources, likeposts in threads already read and the resources related to pathtraversing that occur from the landing page to the section ofinterest, while it will take more time and produce less frequentclicks while staying on pages with new content from thesection of interest. To improve its stealthiness on this aspect,we design a navigation schedule on a forum like an actualhuman being having in mind variables such as time of dayand stochastic interruptions.
2) Technical requirements:
To avoid detection at the net-work level,
CARONTE will have to act indistinguishably toa regular browser in terms of generated traffic and differing ig. 2. CARONTE trainer module structure. to regular crawlers. The primary aspect is to produce notsuspicious HTTP requests against the webserver; crawlers’traffic is characterized by the adoption of HEAD HTTPrequests to determine whether the resource to download is oftheir interest or not, non-filling of referral link field in requestsand by the usage of a bogus user agent [10], [14], [24],[35], [37]. Further, crawlers might be interested on fetchingonly text content, refusing to download styles, images [37]and JavaScript, (e.g. to minimize network footprint) or won’tactively execute client-side code such as JavaScript, handlesessions and cache as a ’regular’ web browser would do.In our study case, we assume legitimate to have JavaScriptcompletely disabled for increasing the anonymity of our tool;this countermeasure inside the Dark Web is quite common andshouldn’t raise any suspect.With this in mind, we identify the need of a fully functionalbrowser that by design covers all these aspects coherentlywith a legitimate one, but that offers the possibility to bemaneuvered programmatically. Table I provides an overviewof the identified requirements for
CARONTE . B. Architecture and implementation
CARONTE adopts a two-tier architecture, separating the training from the crawling operations.
1) Trainer module:a) Base mechanism:
The trainer module has the task tobuild a knowledge base for traversing the forum structure 2.For each page where relevant content or fields are present,the trainer will load, save and render a modified copy of itto the user. For each of them, the operator will be asked toclick on the desired resources inside of the rendered page.Before being rendered, pages are preprocessed; in particular,we inject JavaScript scripts to allow
CARONTE to gatherthe events triggered by the human operator. With differentcombinations of onclick() and addEventListener() ,we control these interactions and generate AJAX requestsagainst
CARONTE ’s backend. The payload of these requestsis a resource identifier (see ”Resource identifiers” paragraphin this section) that will allow the crawler module to access tothe required information or interact with it, where necessary.Subsequently, it then proceeds to render again the saved page,but highlighting the previously identified content, allowingthe user to confirm if the identifiers for the resources havebeen inferred correctly by the tool or not (Figure 4). In somecases user-generated clicks are not possible or we aim toidentify a group of resources. For example, this is the case
Fig. 3. CARONTE crawler module structure.Fig. 4. Validation of identifiers inferred. for identifying multiple posts inside of a thread; for this kindof resources, our goal is to infer a resource identifier thatcan operate like a regular expression, enabling the tool toresolve all the required elements on the page. Our strategyhere is based on the collection of multiple snippets of textcontained in each of these resources (Figure 5). For each ofthe received fragments,
CARONTE will query the JavaScriptengine embedded in the browser handled by Selenium in orderto resolve their identifiers and, through syntactical similarity,generate a matching one. Text content will be gathered with thehelp of the human operator in a special page (here referred toas content collector page ) that is presented to the user togetherwith the original page. b) Resource identifiers:
The desired resources can beidentified through two different approaches:
XPath or HTMLcommon classes . XPath is a standardized query languagethat identifies elements inside of a XML-like document; itsupports regular expressions for matching several elements.HTML classes instead are attributes assigned to nodes ofan HTML file for which different styles are assigned. Eventhough XPath is an ad-hoc technique for identifying elementsin a HTML page, sometimes inferring HTML classes is easier
ABLE IS
UMMARY OF IDENTIFIED REQUIREMENTS FOR
CARONTE
Requirement Description ImplementationLearning forumstructures Understanding forum structure,how to browse it and wherevaluable information is Creation of a human supervised learning module that identifies neededresourcesRegular browser be-haviour Realistic user agents, caching be-haviour, referral handling Exploration of required sections only, throttling requests accordingly totext volume of the page, mimicking reading time. Confining crawlingactivity in semi-random time slots during the day and suspending itfor random amounts of time during the dayRealistic browserconfiguration Addons install and pages downloadfeature Install NoScript and Page Save WE, preparation of the browser tosupport shortcuts for downloading a pageAnonymity Browsing session needs to beanonymous TOR Browser adoption, JavaScript disabled at browser level andchanging default to refuse JavaScript and active content
Fig. 5. Gathering of text snippets from the saved page (in next tab). than XPaths. During the training phase, when a resource isclicked, the loaded page will identify the associated identifierthrough a series of heuristics, and send it to the backend. Ifthe resources are multiple, the content-collector page will berendered with the downloaded page and the user will fill thefields with the required data. The process to identify the mostlikely resource identifiers depend on the data structure and thenumber of classes associated with that resource.
CARONTE supports the following four cases: • Technique 1 . Extract the XPath of the exact resource. Ifthe resources are multiple, the most frequent XPath willbe the candidate; • Technique 2 . Extract XPath of the exact resource, butthe last node is truncated. The XPath approach mayfail due to the presence of extra HTML tags (e.g. dueto text formatting), that can then be disregarded. If theresources are multiple, the most frequent XPath will bethe candidate after removing the last node; • Technique 3 . The class of the exact resource. If theresources are multiple, the most frequent class will bethe candidate. This approach solves the problem of cal-culating an XPath in a page where the content is dynamic,resulting in a non predictable XPath for a certain resource,depending on the loaded content in the page. If theresources are not assigned to a class, the element willbe replaced with its parent, which will act as a wrapper; • Technique 4 . Two classes of the exact resource. If theresources are multiple, the two most frequent classeswill be the candidates. This approach is adopted tohandle elements in a page that exhibit the same classof the desired content, resulting in a misclassification.Therefore, this strategy allows to have a stricter conditionon the searching criteria for the required resource. If theresource(s) has no class, the element will be replaced withits parent, which will act as a wrapper.
2) Crawler module:
Based on the structural details col-lected with the trainer module, the crawler module willtraverse the forum to reach the required resources, explorethreads and collect all the required data. The crawler willalso embody the requirements of being compliant with thetraffic generated from a regular browser while camouflagingits nature adopting low fetch rates for pages. How time iscalculated before accessing to the next resource is deepenedin the Reading Time paragraph under Implementation section.
CARONTE further keeps track of updated threads and selectsthose opportunistically for visiting. Threads that have not beenupdated are not traversed a second time.
C. Behavioral aspects1) Legitimate browser traffic - Browser:
To implement
CARONTE ’s browser functionalities we adopt
Tor BrowserSelenium , or tbselenium for short. tbselenium accesses geck-odriver , the browser engine branded Mozilla that allows tomaneuverer the browser’s behavior and UI. Moreover, tbse-lenium exposes an interface for customizing the environmentand, finally, produces traffic identical to Tor Browser.
2) Mimicking legitimate human traffic:a) Work schedule:
CARONTE can be configured to workwithin pre-defined timeslots during the week or in the week-ends, late afternoons and evenings during the week and allthree sessions on weekend. Between each session, a random-ized time of inactivity simulates short pauses (between 5minutes and half an hour) and longer ones at pre-defined times(e.g. 2 hours around dinner time). These can be configured.Each session has a start time and an end time; each of them canvary of up to 25% of the total duration of the crawling sessionrandomly. Each session has the 20% of chance to be skipped.onetheless, we would avoid to have 24 hours of inactivity,so if there’s no sessions scheduled in the next 24 hours, acompatible one with the default schedule will be executed.Start and end times are shifted accordingly to the timezone ofthe geographical location of our forum user profile. b) Reading time:
The time spent between two requestsis calculated according to two main criteria: • If the current page doesn’t show significant content to beread (e.g. pressing login button, reaching the section ofinterest of a forum, moving to page 2 of a forum section,...) or the content has been already read (a thread maycontain new messages, therefore old will be skipped), thetime spent before going to the next page is a randomnumber of seconds between 3 and 7. This decision isbased on the fact that the information on the page ismore essential and visual. This enables our fake actor toskim rapidly and choose what to read, resulting to fulfillthe expectation of having a navigational queries pattern; • If the current page is the body of a thread, the tool willwait, for each unread post, an arbitrary amount of secondscalculated as the time to read the post at a speed inthe range of 120-180 WPM. This behavior validates theexpectation of producing informational queries. c) User event generation:
CARONTE ’s modeled usergoal is to reach the threads of interest and iterate them toextract their content. When starting the crawling process,
CARONTE loads the forum homepage, as it was typed on theaddress bar, then reaches the login page. Once logged in, itreaches one of the sections of interest expressed during thetraining and opens a thread per time (if it has been never reador has new replies). For each thread, it browses each pageuntil the thread has been read in the whole. The click patternsgenerated match the purpose of our fake user, which considersrelevant the content of pages with a significant quantity of textlike a thread instead of a login page.IV. E
XPERIMENTAL VALIDATION
A. Forum selection
In order to proof
CARONTE ’s capabilities against differentforums, we selected four real-world criminal forums built ontop of different platforms. The candidates (Table II) correspondto a consistent representation of the most common forumplatforms wildly adopted on the Web [5] [11] [33] [36] [4].We first reproduced four live hacker forums by scrap-ing them and hosting their content on a local server atour Institution. Before reproducing the content on our sys-tems we inspected the source code and scanned it with
VirusTotal.com to assure malicious links or code was notpresent. Forum mirrors include multimedia content, styles andJavaScript. To avoid provoking misservice on the server sidewhile scraping the forums, we avoided aggressive scraping.As our interest is to have an appropriate test-bed to evaluate
CARONTE ’s overall performance, the nature (or quality) of thecontent of the forums is irrelevant for our purposes.
B. State-of-art tools selection
To provide a comparison of
CARONTE ’s capabilities againstother tools, we select three among the available ones: • A1 Website Download: shareware crawler specializedin downloading forum content. Through a fine-grainedcustomization wizard, it is possible to use configurationpresets that fit better the crawling process against a certainforum software, optimizing its performances; • HTTrack: probably the most famous tool for downloadingwebsites, HTTrack provides several tweaking featuresthrough regular expressions for downloading a website; • grab-site : fully open-source, grab-site is a regular crawlerfor downloading large portions of the web, powered bythe Archive Team. C. Training phase
The approach adopted by
CARONTE to discover the struc-ture of a forum has proven effectiveness over our tests. In orderto get the structure of a forum, we rely on the predictabilityof the structure of a forum in the future in terms of XPathand HTML classes. This is true in the majority of the cases;from the literature analysis and empirical evaluations of themost common forum structures [13] [28] [42], we found noevidence of dynamically-loaded forum structures that wouldalter the DOM structure at each visit or while being on apage. This seems well in line with environments like the DarkWeb, where platform simplicity and functionality, as well aspredictability, are desirable [40]. During this phase, all thecountermeasures that disable the download and execution ofactive content and JavaScript are in use as well.
1) Problems and solutions: Post details mismatch avoid-ance.
We’ve found out that seldom post details like authorsand date have different structural identifiers or are displaceddifferently, causing the crawler module to miss them andassociate a post’s details to another, due to the differentcardinality of the identified ones (from the crawled posts,in our database we notice that only 8 author names havebeen missed in 12854 posts for IP Board 3.4.4). Even though
CARONTE is not able to recover them, we model the postas an unique container (which we call post wrapper ) wheredetails are anchored to it. By doing so, the identifiers arecalculated relatively to the post and we thus avoid accidentalpost assignment to wrong ids.
Inconsistent reference to navigation button in forums.
De-pending on the forum platform and on the adopted config-uration (e.g. forum skins or themes), HTML tags may havedifferent names and usages. For example, vBullettin 4.2.1 andXenForo 1.5 adopt the same HTML tag id or class for boththe forward navigation button and back inside of a thread orsection of a forum. During the training stage of
CARONTE inferring the class of this element leads to the unwanted resultof identifying both buttons with the same rule (Figure 6). Thiswould result in moving back and forth between the first andthe second page.
CARONTE manages this issue in the trainingphase by loading the next page and asking the user if thehighlighted part of the DOM is one element only, or more. In
ABLE IIS
CRAPED FORUMS FOR OUR TESTBED . Forum Time span Forum software Obtained withhttps://nulled.io 14 Jan 2015 - 06 May 2016 IP Board 3.4.4 Online dumphttp://offensivecommunity.net Jun 2012 - 6 Feb 2019 MyBB (unknown version) HTTrack 3.49.2http://darkwebmafias.net Jun 2017 - 7 Feb 2019 XenForo 1.5 A1 Website Downloader 9http://garage4hackers.com Jul 2010 - 4 Feb 2019 vBullettin 4.2.1 A1 Website Downloader 9
Fig. 6. Resource name collision.TABLE IIIT
REATMENT COMBINATION AND EXPERIMENTS . Forum ExactXPath ParentXPath SingleClass DoubleClassnulled.io 10 1 2 0offensivecommunity.net 9 2 2 0darkwebmafias.net 9 2 2 0garage4hackers.com 8 1 3 0 the second case,
CARONTE keeps record of the conflict andaccesses the second retrieved element when this case occurs.
2) Training evaluation:
Depending on the peculiarities ofthe forum against which
CARONTE has been trained, differentstrategies have been adopted to determine the resource iden-tifiers (”Resource identifiers” paragraph). A summary aboutthe identification strategies used per forum is available inTable III. In greater detail, for
OffensiveCommunity.net and
DarkWebMafias.net , the first post of a thread has some dif-ferences in the HTML structure compared to other posts. Inparticular, the field of the post author is wrapped aroundsome extra nodes that provide a special style to it. Withthe adoption of the parent XPath ( technique ), it hasbeen possible to infer a rule that works for every post’sauthor. For
Garage4Hackers.com the XPath regex was not asufficient approach to find all the post wrappers in a page.This is due to the fact that, when multiple resources aremeant to be identified through an XPath regex, the XPath //*[starts-with(@id, ’something’)] selector isused, which returns an array of nodes. Specifically, we were in-terested in nodes with id post_XXXXX , but on the same pagewere also present nodes with id post_message_XXXXX ,which caused the resolution of both content types. To over-come to that, identifying the required resources through a classwas sufficient to solve the problem ( technique ). For allthe forums, identifying the regex for the post wrapper is aspecial process that uses a variant of technique , wherethe container is identified by incremental steps. Starting fromsnippets of text from different posts of the page, a first XPathis calculated. Subsequently, with the user interaction, it isrefined removing the unnecessary child until the whole postis correctly classified with the XPath calculated. Moreover, for all the forums, inferring a stable XPath identifier for thenext page buttons is not possible. This depends from thenumber of buttons inside of the navigation wrapper, whichchanges depending on the number of available pages or evenwhen moving to the second page. To circumvent this problem,again a class comes in help (technique . The doubleclass (technique
D. Network patterns and behaviour
In order to evaluate how network traffic generated by
CARONTE compares w.r.t. network traffic generated by hu-mans (i.e. legitimate users) and state-of-the-art crawlers, weperformed an experiment employing the Amazon MechanicalTurk platform. This enables us to compare
CARONTE againstboth ‘undesirable’ and ‘desirable’ traffic from the perspectiveof the criminal forum administrator.
1) Experiment methodology:a) Human navigation experiment:
To generate humantraffic towards our forums, we rely on Amazon MechanicalTurk (MTurk). From the literature review, we identify threemain experimental variables characterizing the habits of aregular user on the Internet: • Var1 : The interest raised in the reader by the contentmay lead him to read carefully all the content of a certainthread or not, resulting in skimming and moving quicklyto a next resource [16], [21]; • Var2 : The desire of privacy of the user, which maybe high or low, resulting in the adoption of solutionsthat prevent JavaScript to be executed or not to avoidfingerprinting techniques [1], [6], [32]; • Var3 : The propensity of an user to open several resourcesin parallel before actually browsing them or insteadopening them one per time, reading their content firstbefore moving to the next resource [17].To control for possible interdependencies between thesedimensions, we create a − fractional factorial experimentaldesign , that allows us to reduce the number of experimentalconditions from eight to four [12]. The experimental treat-ments and design are reported in Table IV, and V respectively.
2) Experimental design and setup:
An overview of theexperimental setup is shown in Figure 7. The setup imple-mentation has been carried out in three stages: the selectedweb forums (ref. Table II) are hosted on an IIS web server
ABLE IVE
XPERIMENTAL FEATURES AND TREATMENTS
Var1
The reader is interestedin the content or skimsa few posts Read all thecontent insideof the thread Skim threador read firstpost
Var2
The user enables or dis-ables JavaScript on TorBrowser Enabled Disabled
Var3
Opening resources inparallel or sequentially Sequential Parallel
TABLE VT
REATMENT COMBINATION AND EXPERIMENTS . Exp1 Exp2 Exp3 Exp4A B A B A B A B
Var1 - + - + + - + -
Var2 + - - + + - - +
Var3 + - - + - + + - (vers. 10) where access logging is enabled. We prepare anAmazon Mechanical Turk task reflecting the experimentaldesign (ref. Table V). The task includes eight questions basedon the content of the forum webpages (two multiple-choicequestions per forum). The task included detailed step-by-stepinstructions that respondents had to follow. Such instructionsserve the purpose of enforcing the treatment in the experiment;for example,
Exp3 requires users to read all content of a thread(
Var1, A ), have JavaScript enabled (
Var2, A ), and open forumtabs in parallel (
Var3, B ): [...] open in separate tabs all threads you think arerelevant to those two topics ( Var3, B ).While reading the forum threads, please also skimthrough to at least the second thread page (
Var1, A ),if present, and even if you already found the answer.
Notice that, as JavaScript is enabled by default in TORbrowser, there is no instruction for
Var2, A . When a respon-dent accesses the task on AMT, he or she is assigned randomlyto an experimental condition. Further, each instance of theexperiment randomizes the forum order to minimize cross-over effects. The last step consists of enabling us to collectthe generated network requests. To avoid limitations imposedby the TOR circuit refresh mechanism that may change theIP address of users every 10 minutes, we set a cookie on theuser’s browser with a unique session ID. We use the samestrategy to track the experimental condition to which the userhas been randomly assigned to at access time.
3) Results:
Figure 8 reports the network analysis for
CARONTE compared to the state of the art tools and theMTurks. As emerged from the requirements,
CARONTE shouldnot exhibit greed in resource fetching, but should access them MaxCircuitDirtiness
MechanicalTurk worker
IIS_webserver:8080/forums g4hocnlData analysis
Survey page dwm exp=rnd()forum=rnd() exp3forum3forum4forum1forum2 exp3 + sessID access logs
Tor network
The forums are deployed on an internal system at the University. Re-sources are accessed by industry standard automated tools (scrapers),
CARONTE , and MTurks. All tools access the local resources throughthe TOR network. Each MTurk is randomly assigned to an experimentsetup with different conditions (see Table V). Internal network logsallow us to backtrack user requests to specific experimental setups.Fig. 7. Experimental setup with lower frequency, accordingly to the resource content. Onone hand, if is true that avoiding request bursts is probablymore than enough against an automated monitoring tool, onthe other hand it guarantees an extra layer of stealthiness incase of human verification and could fool ML-based robot-detection systems. Therefore, we compare this traffic throttlingmodel with humans and other tools by monitoring the amountof requests per thread and the time between them. Fromthe comparisons, emerges that the time elapsed between twodifferent requests produced by humans is comparable to CARONTE ’s and HTTrack, while the others perform moreaggressively. For what concerns the media/text ratio of the ses-sions,
CARONTE together with grab-site , perform quiteclose to humans. Finally, we’ve compared the number ofrequests issued per thread : CARONTE and grab-site per-form again better when compared to humans than the other twotools, but their behavior slightly differs from MTurks. Overall,we observe that
CARONTE network trace is consistently verysimilar to human-generated network traffic, whereas othertools are clearly different over one or more dimensions.V. D
ISCUSSION
CARONTE ’s training module proved effective in flexiblylearning diverse forum structures. Differently from ML-basedsystems, the adopted semi-automated procedure allows the toolto reliably identify relevant structures in the DOM of a page,while avoiding entirely the need to collect massive amountsof pre-existent data (for the training and validation) that mightjeopardize the researcher activity. Whereas this does comeat the price of additional human-sourced work w.r.t. fully-automated procedures,
CARONTE is meant to be employedover (the few) highly-prominent underground communities A request refers to all the calls to a page of a thread, without consideringall the linked content downloaded. Requests per thread refer to the set of all the requests that have beenfired by an actor inside of each and every thread.
46 CARONTE GRABSITE HTTrack Mturk a1website ( l og ) T i m e be t w een r eq . Auto exp1 exp2 exp3 exp4 −202 CARONTEGRABSITE HTTrack Mturk a1website ( l og ) M ed i a /t e x t Auto exp1 exp2 exp3 exp4 012 CARONTE GRABSITE HTTrack Mturk a1website ( l og ) R eq . pe r t h r ead Auto exp1 exp2 exp3 exp4
Fig. 8. Evaluation of
CARONTE against state-of-art tools and MTurks where the threat model
CARONTE addresses is realistic. Thepresented proof-of-concept has been tested over four diverseforum structures, and can be expanded in future work beyondthe ‘forum’ domain (e.g. e-commerce criminal websites).From the network analysis it emerges that
CARONTE reproduces coherently the three investigated features whencompared to humans and performs better, on average, thanthe other tools. Our tool produces the multimedia traffic ofa regular human actor, together with grab-site , while theother two tools diverge from this behavior; we suspect that thisis to be traced back to some optimization mechanisms whichavoid to reissue requests for the same resource, without evenissuing an HEAD HTTP request. A regular browser insteadwill always reissue the request while loading another page,if not explicitly instructed by a server-side caching policy.Nonetheless, we have found no confirmation in the documen-tation of these tools. With regards to the number of requestsgenerated per thread, there’s a noticeable difference whencompared to humans. This is probably caused by MTurksskipping some pages in the threads. In fact, in multiple cases,the downloaded forums have plenty of ‘useless’ replies tothreads, which may result in a decreased interest from thereader, possibly leading to skipping the following pages. Theobserved difference in generated requests per thread between
CARONTE and grab-site and the other two tools is causedby the fact that they follow also non relevant links, such ascontent re-displacement in the page. In particular, this lastbehavior represent a well-known traffic feature of a crawler. Toimprove this, it could be possible to instruct our tool to ignorethreads where content is redundant and extremely short.As mentioned in the network patterns and behavior subsec-tion, we have monitored some extra features that may representa red-flag in crawler detection. Nonetheless, they’re not part ofthe experiment since are enforced conditions (like the fillingof the referral field) by design of our tool. These are shown,for reference, in Table VI:
CARONTE explores threads one at atime, sequentially, while other crawlers tend to open multipleresources in parallel.
A1 Website Download has neverfilled the referrer URL in the HTTP requests, highlightingthe fact that this request has not been sent from a legitimatebrowser. In the last analysis, for these monitored aspects, wecan say that
HTTrack performs better than the others in termsof browser features exhibited.Currently,
CARONTE is limited to the circumvention ofpassive traffic monitoring and intrusion detection tools; amore sophisticated or active adversary could introduce new
TABLE VIE
XTRA FEATURES MONITORED . Tool JS Styles Cache Seq./Par. Referrals
CARONTE (cid:55) (cid:51) (cid:51)
Seq. (cid:51) grab-site (cid:51) (cid:51) (cid:55)
Par. (cid:51)
HTTrack (cid:51) (cid:51) (cid:51)
Par. (cid:51)
A1Website (cid:51) (cid:51) (cid:55)
Par. (cid:55) techniques to recognize if the user connected is a human (livechat, custom made CAPTCHAs, ...) or could adopt refinedanti-crawling mechanisms that leverage on ML.VI. C
ONCLUSIONS , ETHICAL ASPECTS AND FUTURE WORK
Automated tools that gather data in a stealthy way fromhigh-profile forums are a growing need in our society, due tothe increase of cyberattacks and the amount of data in theseplatforms.
CARONTE has proven to be a flexible and suitablesolution for stealthy information gathering from cybercriminalforums, but it can be adopted disregard the content of the plat-form explored. This could allow collecting data from virtuallyany forum for several purposes, such as bullying monitoring,suicidal prevention and terrorism monitoring, with the aid ofNLP modules. Nonetheless, it may become an instrument forinvestigations driven by nations that apply strong censorshipand violent oppression, jeopardizing the freedom of whistle-blower reporters.As future work, we plan to introduce support for solvingCAPTCHAs, which represent a barrier for
CARONTE as wellas considering website fingerprinting to infer browser-basedmonitoring capabilities.
CARONTE will be made available to interested researchers.R
EFERENCES[1] Firefox zero-day exploit used by fbi to shutdown child porn on tornetwork hosting; tor mail compromised, Aug 2013.[2] Top 10 captcha solving services compared, 2018.[3] Zerodium discloses flaw that allows code execution in tor browser, Sep2018.[4] Blacktds: an infrastructure for massive scale malware and phishingdistribution on demand. forums linked in the page: https://blacktds.com/.[5] A
BBASI , A., L I , W., B ENJAMIN , V., H U , S., AND C HEN , H. Descrip-tive analytics: Examining expert hackers in web forums. In
Intelligenceand Security Informatics Conference (JISIC), 2014 IEEE Joint (2014),IEEE, pp. 56–63.[6] A
LLODI , L. Economic factors of vulnerability trade and exploitation.In
Proceedings of the 2017 ACM SIGSAC Conference on Computer andCommunications Security (2017), ACM, pp. 1483–1499.7] A
LLODI , L., C
ORRADIN , M.,
AND M ASSACCI , F. Then and now: Onthe maturity of the cybercrime markets.
IEEE Transactions on EmergingTopics in Computing (2015).[8] B
AEZA -Y ATES , R., R
IBEIRO , B. D . A. N., ET AL . Modern informationretrieval . New York: ACM Press; Harlow, England: Addison-Wesley,,2011.[9] B AI , Q., X IONG , G., Z
HAO , Y.,
AND H E , L. Analysis and detectionof bogus behavior in web crawler measurement. Procedia ComputerScience 31 (2014), 1084–1091.[10] B
ALLA , A., S
TASSOPOULOU , A.,
AND D IKAIAKOS , M. D. Real-time web crawler detection. In
Telecommunications (ICT), 2011 18thInternational Conference on (2011), IEEE, pp. 428–432.[11] B
ENJAMIN , V., L I , W., H OLT , T.,
AND C HEN , H. Exploring threatsand vulnerabilities in hacker web: Forums, irc and carding shops. In
Intelligence and Security Informatics (ISI), 2015 IEEE InternationalConference on (2015), IEEE, pp. 85–90.[12] B OX , G. E., AND H UNTER , J. S. The 2 kp fractional factorial designs.
Technometrics 3 , 3 (1961), 311–351.[13] C AI , R., Y ANG , J.-M., L AI , W., W ANG , Y.,
AND Z HANG , L. irobot:An intelligent crawler for web forums. In
Proceedings of the 17thinternational conference on World Wide Web (2008), ACM, pp. 447–456.[14] D
ORAN , D.,
AND G OKHALE , S. S. Web robot detection techniques:overview and limitations.
Data Mining and Knowledge Discovery 22 ,1-2 (2011), 183–210.[15] D
ORAN , D., M
ORILLO , K.,
AND G OKHALE , S. S. A comparison ofweb robot and human requests. In
Proceedings of the 2013 IEEE/ACMinternational conference on advances in social networks analysis andmining (2013), ACM, pp. 1374–1380.[16] D
UPRET , G.,
AND L IAO , C. A model to estimate intrinsic documentrelevance from the clickthrough logs of a web search engine. In
Proceedings of the third ACM international conference on Web searchand data mining (2010), ACM, pp. 181–190.[17] D
UPRET , G. E.,
AND P IWOWARSKI , B. A user browsing model topredict search engine click data from past observations. In
Proceedingsof the 31st annual international ACM SIGIR conference on Researchand development in information retrieval (2008), ACM, pp. 331–338.[18] F
ALLMANN , H., W
ONDRACEK , G.,
AND P LATZER , C. Covertly prob-ing underground economy marketplaces. In
International Conferenceon Detection of Intrusions and Malware, and Vulnerability Assessment (2010), Springer, pp. 101–110.[19] F
ORD , R.,
AND R AY , H. Googling for gold: Web crawlers, hacking anddefense explained. Network Security 2004 , 1 (2004), 10–13.[20] G UO , F., L IU , C., AND W ANG , Y. M. Efficient multiple-click models inweb search. In
Proceedings of the second acm international conferenceon web search and data mining (2009), ACM, pp. 124–131.[21] G UO , Q., AND A GICHTEIN , E. Beyond dwell time: estimating documentrelevance from cursor movements and other post-click searcher behavior.In
Proceedings of the 21st international conference on World Wide Web (2012), ACM, pp. 569–578.[22] H
ERLEY , C.,
AND F LORENCIO , D. Nobody sells gold for the price ofsilver: Dishonesty, uncertainty and the underground economy.
SpringerEcon. of Inf. Sec. and Priv. (2010).[23] H
INE , G. E., O
NAOLAPO , J., D E C RISTOFARO , E., K
OURTELLIS , N.,L
EONTIADIS , I., S
AMARAS , R., S
TRINGHINI , G.,
AND B LACKBURN ,J. Kek, cucks, and god emperor trump: A measurement study of 4chanspolitically incorrect forum and its effects on the web. In
EleventhInternational AAAI Conference on Web and Social Media (2017).[24] J
ACOB , G., K
IRDA , E., K
RUEGEL , C.,
AND V IGNA , G. Pubcrawl:Protecting users and businesses from crawlers. In
USENIX SecuritySymposium (2012), pp. 507–522.[25] K
ORAKAKIS , M., M
AGKOS , E.,
AND M YLONAS , P. Automated captchasolving: An empirical comparison of selected techniques. In
SMAP (2014), pp. 44–47.[26] K
WON , S., K IM , Y.-G., AND C HA , S. Web robot detection basedon pattern-matching technique. Journal of Information Science 38 , 2(2012), 118–126.[27] L AI , Y.-M., Z HENG , X., C
HOW , K., H UI , L. C., AND Y IU , S.-M.Automatic online monitoring and data-mining internet forums. In Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), 2011 Seventh International Conference on (2011), IEEE, pp. 384–387.[28] L IM , W.-Y., R AJA , V.,
AND T HING , V. L. Generalized and lightweightalgorithms for automated web forum content extraction. In (2013), IEEE, pp. 1–8.[29] M C N AIR , J. What is the average reading speed and the best rate ofreading? http://ezinearticles.com/?What-is-the-Average-Reading-Speed-and-the-Best-Rate-of-Reading?&id=2298503 (2009).[30] N
AYAK , K., M
ARINO , D., E
FSTATHOPOULOS , P.,
AND D UMITRAS ¸, T.Some vulnerabilities are different than others. In
Proc. of RAID’14 .Springer, 2014, pp. 426–446.[31] O
ERLEMANS , J.-J.,
ET AL . Investigating cybercrime . PhD thesis, 2017.[32] P
ASTRANA , S., T
HOMAS , D. R., H
UTCHINGS , A.,
AND C LAYTON ,R. Crimebb: Enabling cybercrime research on underground forums atscale. In
Proceedings of the 2018 World Wide Web Conference on WorldWide Web (2018), International World Wide Web Conferences SteeringCommittee, pp. 1845–1854.[33] P
ORTNOFF , R. S., A
FROZ , S., D
URRETT , G., K
UMMERFELD , J. K.,B
ERG -K IRKPATRICK , T., M C C OY , D., L EVCHENKO , K.,
AND P AX - SON , V. Tools for automated analysis of cybercriminal markets. In
Proceedings of the 26th International Conference on World Wide Web (2017), International World Wide Web Conferences Steering Committee,pp. 657–666.[34] Q
ASSRAWI , M. T.,
AND Z HANG , H. Client honeypots: Approaches andchallenges. In
New Trends in Information Science and Service Science(NISS), 2010 4th International Conference on (2010), IEEE, pp. 19–25.[35] S
ARDAR , T. H.,
AND A NSARI , Z. Detection and confirmation of webrobot requests for cleaning the voluminous web log data. In
IMpact ofE-Technology on US (IMPETUS), 2014 International Conference on the (2014), IEEE, pp. 13–19.[36] S
OSKA , K.,
AND C HRISTIN , N. Measuring the longitudinal evolutionof the online anonymous marketplace ecosystem. In
USENIX SecuritySymposium (2015), pp. 33–48.[37] S
TEVANOVIC , D., A N , A., AND V LAJIC , N. Feature evaluation forweb crawler detection with data mining techniques.
Expert Systemswith Applications 39 , 10 (2012), 8707–8717.[38] V AN W EGBERG , R., T
AJALIZADEHKHOOB , S., S
OSKA , K., A
KYAZI ,U., G
ANAN , C. H., K
LIEVINK , B., C
HRISTIN , N.,
AND V AN E ETEN ,M. Plug and prey? measuring the commoditization of cybercrime viaonline anonymous markets. In { USENIX } Security Symposium( { USENIX } Security 18) (2018), pp. 1009–1026.[39] V ON A HN , L., B LUM , M., H
OPPER , N. J.,
AND L ANGFORD , J.Captcha: Using hard ai problems for security. In
International Con-ference on the Theory and Applications of Cryptographic Techniques (2003), Springer, pp. 294–311.[40] Y IP , M., S HADBOLT , N.,
AND W EBBER , C. Why forums? an empiricalanalysis into the facilitating factors of carding forums.[41] Z
HANG , D., Z
HANG , D.,
AND L IU , X. A novel malicious webcrawler detector: Performance and evaluation. International Journal ofComputer Science Issues (IJCSI) 10 , 1 (2013), 121.[42] Z
HANG , Y., F AN , Y., H OU , S., L IU , J., Y E , Y., AND B OURLAI , T.idetector: Automate underground forum analysis based on heterogeneousinformation network. In (2018),IEEE, pp. 1071–1078.[43] Z
HAO , Q., W
ILLEMSEN , M. C., A
DOMAVICIUS , G., H
ARPER , F. M.,
AND K ONSTAN , J. A. Interpreting user inaction in recommendersystems. In