[PDF] Essential Characteristics of Approximate matching algorithms: A Survey of Practitioners Opinions and requirement regarding Approximate Matching

Abstract

Digital forensic investigation has become more challenging due to the rapid growth in the volume of encountered data. It is difficult for an investigator to examine the entire volume of encountered data manually. Approximate Matching algorithms are being used to serve the purpose by automatically filtering correlated and relevant data that an investigator needs to examine manually. Presently there are several prominent approximate matching tools and technique those are being used to assist critical investigation process. However, to measure the guarantees of a tool, it is important to understand the exact requirement of an investigator regarding these algorithms. This paper presents the findings of a closed survey conducted among a highly experienced group of federal state and local law enforcement practitioners and researchers, aimed to understand the practitioner and researcher's opinion regarding approximate matching algorithms. The study provides the baseline attributes of approximate matching tools that a scheme should possess to meet the real requirement of an investigator.

Full PDF

EESSENTIAL CHARACTERISTICS OFAPPROXIMATE MATCHING ALGORITHMS

A SURVEY OF PRACTITIONERS OPINIONS AND REQUIREMENTREGARDING APPROXIMATE MATCHING

Monika Singh

Indraprastha Institute of Information Technology, Delhi (IIIT-D), Delhi, India

ABSTRACT

Digital forensic investigation has become more challenging due to the rapid growth in the volume ofencountered data. It is diﬃcult for an investigator to examine the entire volume of encountered datamanually. Approximate Matching algorithms are being used to serve the purpose by automaticallyﬁltering correlated and relevant data that an investigator needs to examine manually. Presently thereare several prominent approximate matching tools and technique those are being used to assist criticalinvestigation process. However, to measure the guarantees of a tool, it is important to understandthe exact requirement of an investigator regarding these algorithms.This paper presents the ﬁndings of a closed survey conducted among a highly experienced groupof federal state and local law enforcement practitioners and researchers, aimed to understand thepractitioner and researcher’s opinion regarding approximate matching algorithms. The study providesthe baseline attributes of approximate matching tools that a scheme should possess to meet the realrequirement of an investigator.

Keywords : Digital Forensics, Approximate Matching Algorithm, Survey, Tools, Similarity Hashing

1. INTRODUCTION

In today’s digital era, we are surrounded by enor-mous amount of digital data around us, whichcould be computer hard-disc, external hard-drive, pen-drives, mobile storage, ﬂash drives,tablets, etc. Hence on a crime scene, an investi-gator is confronted with several terabytes of dig-ital data, which is an enormous volume of datafor an investigator to examine manually. Themajor requirement of today’s forensic investiga-tion process is to have the capability to ﬁlter outthe relevant data from the total volume obtaineddata on a crime scene that can be examined man-ually in a reasonable period of time.Filtering is typically done by matching thecase ﬁles with NIST reference data set. The Fil-tering process can be performed in following twoways 1) Blacklisting 2) Whitelisting.

Blacklist- ing is the process of ﬁltering data by matchingthem with the set of Known-to-be-bad ﬁles (asdetermined by the investigator). The resultantﬁles after this process are the ones which an in-vestigator needs to examine closely.

Whitelist-ing is the process of ﬁltering by matching theﬁles with a set of already Known-to-be-good ﬁles.The ﬁles passing this process need not be exam-ined by the investigator.Nowadays, Approximate Matching algorithmsare being used to perform the ﬁltering. Tradi-tional cryptographic hash functions cannot beused to serve the purpose as even a single bitchange in the ﬁle content is expected to modifythe entire digest randomly, which is useful to ﬁndexact duplicates. However, here the requirementis to ﬁnd similarity.‘Approximate Matching’ is a generic technique1 a r X i v : . [ c s . C Y ] F e b or ﬁnding similarity among given ﬁles, typicallyby assigning a ‘similarity score.’ An approximatematching technique can be characterized intoone of the following categories: Bytewise Match-ing, Syntactic Matching, and Semantic Match-ing Bytewise Matching measures the similarityat the byte level of the digital object without con-sidering the internal structure of the data object.These techniques are known as fuzzy hashing orsimilarity hashing. Syntactic Matching deﬁnessimilarity based on the internal structure of thedata object. Semantic Matching measures simi-larity based on the contextual attributes of thedigital objects. It is also known as PerceptualHashing or Robust Hashing.Most prominent and commonly used approxi-mate matching schemes includes ssdeep, sdhash,mrsh, etc. However, there are several cases wheremost of the existing scheme will fail to ﬁnd sim-ilarity. For example similarity between a textdocument and PDF document with same con-tent, colored and the grey scale version of thesame image, diﬀerent format (jpg, bmp, etc.)representation of the same image. The purposeof these algorithms is to assist the critical foren-sic investigation process by ﬁltering out the rel-evant documents. Hence it is important to un-derstand the real required properties that an al-gorithm should possess to ﬁlter out the actualrelevant documents.Hence we conducted a survey in a closed groupof highly experienced and trained digital inves-tigators, practitioners, and researchers to under-stand their requirements/needs regarding thesealgorithms.

2. CURRENT TRENDS OFAPPROXIMATE MATCHINGSCHEME

Approximate matching algorithms have beenused since the year 2002. The ﬁrst schemewas proposed by Nicolas Harbour (Harbour,2002) called dcﬂdd, which is a block-based hash-ing scheme. Later an improvement of dcﬂddwas proposed by Kornblum (Kornblum, 2006)known as Context Triggered Piecewise Hashing(CTPH). The CTPH scheme is based on an email detection algorithm called spamsum, proposedby Andrew et al. (Tridgell, 2002). Instead ofhashing ﬁxed size blocks, CTPH dived the datain variable size blocks and hashes each blockusing a non-cryptographic hash function calledFNV hash. The CTPH tool is known as ssdeep .Breitinger et al. (Breitinger, Baier, & Becking-ham, 2012) presented the thorough analysis ofssdeep and showed that ssdeep does not with-stand an active adversary for blacklisting andwhitelisting.Roussev et al.(Roussev, 2010) proposed a newscheme called sdhash in the year 2010. The basicidea of the sdhash scheme is to generate the ﬁ-nal hash using only statistically improbable fea-tures of the document. Detailed security andimplementation analysis of sdhash is presentedin (Breitinger et al., 2012) by Breitinger et al.This work uncovered several implementation andsecurity issues and showed that it is possible tobeat the similarity score by tampering a given ﬁlewithout changing the perceptual behavior of thisﬁle (e.g., image ﬁles look almost same despitethe tampering). The claims of (Breitinger et al.,2012) is again veriﬁed by chang et al. in (Chang,Sanadhya, Singh, & Verma, 2015). The paperalso shows an attack method which can misleadthe investigator with many forged similar ﬁles.Furthermore, Roussev(Roussev, 2011) has shownthat sdhash outperforms ssdeep in terms of bothaccuracy and scalability.Another scheme known as bbHash (Breitinger& Baier, 2012) was proposed by Breitinger et al.in the year 2012. However because of the highruntime bbHash is not practically usable. Bre-itinger et al. proposed another scheme mvHash-B similarity preserving hashing (Breitinger,Astebol, Baier, & Busch, 2013) in year 2013(Breitinger et al., 2013). The scheme works inthree phases; ﬁrst, compresses the input datausing majority voting, performs run-length en-coding and then ﬁnally stores the ﬁngerprintinto Bloom ﬁlters. ‘B’ in mvHash-B denotesthe bloom ﬁlter representation of the similaritydigest. In terms of performance, mvHash-B isone of the most eﬃcient schemes among all ex-isting schemes with lowest run-time complexityand small digest size. A thorough analysis of2vHash-B is presented by Chang et al. (Chang,Sanadhya, & Singh, 2016). The paper uncov-ers the weakness of mvHash-B scheme and showsthat mvHash-B does not withstand an active ad-versary against the blacklist and also proposes animprovement to mvHash-B design to conquer theweakness.Apart from the above-mentioned schemes, sev-eral forensic tools are using diﬀerent ﬁlteringtechniques. FTK performs Cluster Analysis toﬁnd related documents, near duplicates. Encaseuses Entropy Near Match Analyser to discoversimilar ﬁles. X-ways implemented a new tech-nology called FuzZyDoc to identify known doc-uments.

3. SURVEYMETHODOLOGY

The baseline deﬁnition and terminology of theapproximate matching algorithm is already de-ﬁned by (Breitinger, Guttman, McCarrin, &Roussev, 2014). The NIST Special Publication800-168 (Breitinger et al., 2014) deﬁnes the prop-erties at more general and broader level. How-ever, similarity deﬁnition, as well as the require-ments, vary for diﬀerent data object type. Forexample, ﬁles with similar perceived text contentmay have entirely diﬀerent structure and wouldbe entirely diﬀerent if an inappropriate algorithmis applied; the actual similarities would remainunnoticed. A color and the grayscale version ofthe same image would be completely diﬀerentwhen using most of the existing schemes. Hencethere is a strong need to deﬁne the propertiesbased on the practical requirements.The aim of the survey was to establish thekey characteristics of the approximate matchingalgorithm based on the requirement of digitalforensics practitioners and researchers with theunderstanding of diﬀerent perspectives towardsthese algorithms. Using the results of the survey,in future, our aim is to build the real data-setwith known similarity and evaluate the existingschemes based on the key characteristics usingthe real data-set.

The survey contains a brief introduction to thetopic of the study and the purpose of the survey,followed by 10 questions consisting : • • • • The results of survey is presented in this sectionbased on the all the received responses. However,where appropriate, further analysis based uponthe perspectives of researchers and practitionersis presented.

Demographic Information of the survey is repre-sented in ﬁgure 1. More than 75% of the par-ticipants had more than 10 years experience as3hown in ﬁgure 1. All of the participants werefrom the United States of America and workingas a federal, state or local law enforcement prac-titioner or researchers.Figure 1: Demographic Data

Results shown in around 65% of the total par-ticipants were aware and are using approxi-mate matching algorithms during the investiga-tion process. Among them, 45% were awarethat their tools perform approximate ﬁlteringbut were unaware of the technique/scheme usedby the tool. Whereas remaining were well awareof the techniques used by the tools or the spe-ciﬁc technique used by them. Following is thelist of schemes mentioned by the participantsthose are being used by the practitioners thesedays: Cluster analysis of FTK, FuzzyDocs of X-ways, md5Deep, CodeSuite, PhotoDNA, sdhash,ssdeep and some of the practitioners are usingtheir self-constructed tf-idf based schemes. Thisshows that most of the practitioners ﬁnd approx-imate matching algorithms useful and are will-ing to use these algorithms to ﬁlter out relevantdata. However, since there is no consensus forone standard scheme hence all of them are usingtechniques that are either by default being usedby the tools or they ﬁnd it more useful for a par-ticular case based on their personal experience.

Participants were asked to scale the uses of theapproximate matching algorithm based on theirprofessional experiences. List of applications wasderived by the examining the existing literatureshown in ﬁgure 2. Responders were also allowedto add more uses to the list.Figure 2: Applications of Approximate MatchingAlgorithms • Based on all of the received responses ap-proximate matching algorithms are beingfrequently used to identify the Related Doc-uments. • Fragment Identiﬁcation was scaled secondin the list. Fragment Identiﬁcation is theidentiﬁcation of a document based on a pieceof data. • Following are the other important uses inthe respective order of their raking – Correlation of network (Data packetreconstruction from the fragmentedﬁles over the network) – Embedded Object Identiﬁcation ( e.g.,a jpeg within a word document) – Identiﬁcation of the code version (Iden-tiﬁcation of patched or upgraded ver-sion of software).However approximate matching algorithmsdon’t work well for network correlation if thenetwork is encrypted.4n further examination about the type of thedata object that they need to ﬁlter out in mostfrequent cases, the participants were asked torate among text, image, executable ﬁle, and mul-timedia. All of the received responses showsthat text and images are the most commonly ap-peared data types that need to be ﬁltered. Exe-cutable and multimedia ﬁles also required to beﬁltered in some of the cases.Hence we can state that one of the importantcharacteristics of a scheme is its ability of re-lated document correlation and fragment identi-ﬁcation for text and image data type.

Participants were asked choose among the exist-ing measure that represents the similarity of twotextual documents most closely in their opinion.Proposed measures were ‘Edit distance,’ ‘Lengthof longest common sub-string’ and ‘Length oflongest common subsequence.’ Where Edit Dis-tance is a way of calculating dissimilarity be-tween two text sequence by counting the mini-mum number operations required to transformone sequence into the other. Length of longestcommon substring denotes the length of the com-mon contiguous substring of maximal length.Length of the longest common subsequence oftwo text sequence signiﬁes the length of the com-mon substring of maximal length where the sub-string might not appear in contiguous fashionbut preserves the ordering of characters. All ofthese measures were taken by examining the cur-rent literature. Participants were also allowed toadd new measures. The purpose of this questionis to ﬁnd out the key measure to ﬁnd real similar-ity in two text documents. Figure 3 shows thatbased on the experience of the responder ‘Lengthof longest Common Substring’ deﬁnes similaritybetween two documents the most. Hence thismeasure should be used to build ground truthfor Text data set with known similarity.However by further analysis of the responses,we realized this should be analyzed further tomake a persuasive conclusion. Figure 3: Key Measure to Identify the GroundTruth

Participants were asked to rank the similar im-ages that they need to ﬁlter out during the realcase investigations. Figure 4 shows that all ofthe listed image similarities are almost equallyimportant from the practitioner’s point of view.However, the ability of a scheme to ﬁnd out thesimilarity between diﬀerent formate is rankedhighest. Hence it would be useful if a techniquecan ﬁlter out all of the listed image similarities.Figure 4: Image Similarity

The purpose of this question to understand sim-ilarity deﬁnition for a program ﬁle based on therequirement of an investigator. Figure 5 showsthe results. All of the listed similarity character-istics are chosen by examining the current liter-ature and participants were allowed to add miss-ing properties. From the received responses wecan say one of the essential abilities of a tool forprogram ﬁle similarity is to ﬁnd the same pro-5ram with diﬀerent variable name and loopingconstructs. According to practitioner most of theexisting tools produce high false positive resultsbecause of the ﬂagging reusable components suchas dlls, icons and other resources obtained fromsoftware libraries.Figure 5: Executable Program File Similarity

Participants were asked that if the volume datavolume obtained from the crime scene containingﬁle structure information such as MFT(MasterFile Table) for windows ﬁles, inode, etc. has in-creased. From the result, we can say more than70% of the cases the ﬁle structure information isavailable. Hence this information can be used toimprove the ﬁltering abilities of the tools. Wealso found that amount of application speciﬁcdata has also increased such as SQLite data onmobile devices. Hence major challenge for ap-proximate matching algorithms is to have theability to ﬁnd similarity across the device spe-ciﬁc data.

4. CONCLUSIONS ANDFUTURE WORK

Automated ﬁltering has become one of the es-sential requirements of today’s digital investiga-tion process. Approximate matching algorithmsare being used to perform the ﬁltering. Approx-imate matching is a relatively new area of dig-ital forensics, which is still evolving and needsto be deﬁned more formally. The aim of thesurvey was to gain insights of the practitionersand researchers opinion regarding this new class Figure 6: File System Informationof algorithms. From the results of the study,we found that practitioners and researchers ac-knowledge the utility of approximate matchingalgorithms and are willing to use it to speed upthe investigation process. Since there is no stan-dard way to measure the guarantees of the al-gorithms. It is diﬃcult to ﬁnd the appropriatetools.Hence this paper presents the practitionersand researcher’s requirement and perspective to-wards approximate matching algorithms. Futurework must focus on formalizing the propertiesof approximate matching algorithms and provid-ing a well-deﬁned evaluation framework to eval-uate existing and future approximate matchingschemes.

REFERENCES

Breitinger, F., Astebol, K. P., Baier, H., &Busch, C. (2013). mvhash-b - A newapproach for similarity preservinghashing. In

Seventh internationalconference on IT security incidentmanagement and IT forensics, IMF 2013,nuremberg, germany, march 12-14, 2013 (pp. 33–44).Breitinger, F., & Baier, H. (2012). A fuzzyhashing approach based on randomsequences and hamming distance. In

Proceedings of the conference on digitalforensics, security and law (pp. 89–100).Breitinger, F., Baier, H., & Beckingham, J.(2012). Security and implementationanalysis of the similarity digest sdhash. In6 irst international baltic conference onnetwork security & forensics (nesefo).

Breitinger, F., Guttman, B., McCarrin, M., &Roussev, V. (2014). Approximatematching: deﬁnition and terminology.

URL http://csrc. nist.gov/publications/drafts/800-168/sp800 168 draft. pdf .Chang, D., Sanadhya, S. K., & Singh, M.(2016). Security analysis of mvhash-bsimilarity hashing.

The Journal of DigitalForensics, Security and Law: JDFSL , (2), 21.Chang, D., Sanadhya, S. K., Singh, M., &Verma, R. (2015). A collision attack onsdhash similarity hashing. In Proceedingsof 10th intl. conference on systematicapproaches to digital forensic engineering (pp. 36–46).Harbour, N. (2002).

Dcﬂdd. defense computerforensics lab.

Kornblum, J. D. (2006). Identifying almostidentical ﬁles using context triggeredpiecewise hashing.

Digital Investigation , (Supplement-1), 91–97. Retrieved from http://dx.doi.org/10.1016/j.diin.2006.06.015 doi: 10.1016/j.diin.2006.06.015Roussev, V. (2010). Data ﬁngerprinting withsimilarity digests. In K. Chow &S. Shenoi (Eds.), Advances in digitalforensics VI - sixth IFIP WG 11.9international conference on digitalforensics, hong kong, china, january 4-6,2010, revised selected papers (Vol. 337,pp. 207–226). Springer. Retrieved from http://dx.doi.org/10.1007/978-3-642-15506-2 15 doi: 10.1007/978-3-642-15506-2 15Roussev, V. (2011, August). An evaluation offorensic similarity hashes.

Digit. Investig. , , S34–S41. Retrieved from http://dx.doi.org/10.1016/j.diin.2011.05.005 doi: 10.1016/j.diin.2011.05.005Tridgell, A. (2002). Spamsum readme.

Retrieved from