Essential Characteristics of Approximate matching algorithms: A Survey of Practitioners Opinions and requirement regarding Approximate Matching
EESSENTIAL CHARACTERISTICS OFAPPROXIMATE MATCHING ALGORITHMS
A SURVEY OF PRACTITIONERS OPINIONS AND REQUIREMENTREGARDING APPROXIMATE MATCHING
Monika Singh
Indraprastha Institute of Information Technology, Delhi (IIIT-D), Delhi, India
ABSTRACT
Digital forensic investigation has become more challenging due to the rapid growth in the volume ofencountered data. It is difficult for an investigator to examine the entire volume of encountered datamanually. Approximate Matching algorithms are being used to serve the purpose by automaticallyfiltering correlated and relevant data that an investigator needs to examine manually. Presently thereare several prominent approximate matching tools and technique those are being used to assist criticalinvestigation process. However, to measure the guarantees of a tool, it is important to understandthe exact requirement of an investigator regarding these algorithms.This paper presents the findings of a closed survey conducted among a highly experienced groupof federal state and local law enforcement practitioners and researchers, aimed to understand thepractitioner and researcher’s opinion regarding approximate matching algorithms. The study providesthe baseline attributes of approximate matching tools that a scheme should possess to meet the realrequirement of an investigator.
Keywords : Digital Forensics, Approximate Matching Algorithm, Survey, Tools, Similarity Hashing
1. INTRODUCTION
In today’s digital era, we are surrounded by enor-mous amount of digital data around us, whichcould be computer hard-disc, external hard-drive, pen-drives, mobile storage, flash drives,tablets, etc. Hence on a crime scene, an investi-gator is confronted with several terabytes of dig-ital data, which is an enormous volume of datafor an investigator to examine manually. Themajor requirement of today’s forensic investiga-tion process is to have the capability to filter outthe relevant data from the total volume obtaineddata on a crime scene that can be examined man-ually in a reasonable period of time.Filtering is typically done by matching thecase files with NIST reference data set. The Fil-tering process can be performed in following twoways 1) Blacklisting 2) Whitelisting.
Blacklist- ing is the process of filtering data by matchingthem with the set of Known-to-be-bad files (asdetermined by the investigator). The resultantfiles after this process are the ones which an in-vestigator needs to examine closely.
Whitelist-ing is the process of filtering by matching thefiles with a set of already Known-to-be-good files.The files passing this process need not be exam-ined by the investigator.Nowadays, Approximate Matching algorithmsare being used to perform the filtering. Tradi-tional cryptographic hash functions cannot beused to serve the purpose as even a single bitchange in the file content is expected to modifythe entire digest randomly, which is useful to findexact duplicates. However, here the requirementis to find similarity.‘Approximate Matching’ is a generic technique1 a r X i v : . [ c s . C Y ] F e b or finding similarity among given files, typicallyby assigning a ‘similarity score.’ An approximatematching technique can be characterized intoone of the following categories: Bytewise Match-ing, Syntactic Matching, and Semantic Match-ing Bytewise Matching measures the similarityat the byte level of the digital object without con-sidering the internal structure of the data object.These techniques are known as fuzzy hashing orsimilarity hashing. Syntactic Matching definessimilarity based on the internal structure of thedata object. Semantic Matching measures simi-larity based on the contextual attributes of thedigital objects. It is also known as PerceptualHashing or Robust Hashing.Most prominent and commonly used approxi-mate matching schemes includes ssdeep, sdhash,mrsh, etc. However, there are several cases wheremost of the existing scheme will fail to find sim-ilarity. For example similarity between a textdocument and PDF document with same con-tent, colored and the grey scale version of thesame image, different format (jpg, bmp, etc.)representation of the same image. The purposeof these algorithms is to assist the critical foren-sic investigation process by filtering out the rel-evant documents. Hence it is important to un-derstand the real required properties that an al-gorithm should possess to filter out the actualrelevant documents.Hence we conducted a survey in a closed groupof highly experienced and trained digital inves-tigators, practitioners, and researchers to under-stand their requirements/needs regarding thesealgorithms.
2. CURRENT TRENDS OFAPPROXIMATE MATCHINGSCHEME
Approximate matching algorithms have beenused since the year 2002. The first schemewas proposed by Nicolas Harbour (Harbour,2002) called dcfldd, which is a block-based hash-ing scheme. Later an improvement of dcflddwas proposed by Kornblum (Kornblum, 2006)known as Context Triggered Piecewise Hashing(CTPH). The CTPH scheme is based on an email detection algorithm called spamsum, proposedby Andrew et al. (Tridgell, 2002). Instead ofhashing fixed size blocks, CTPH dived the datain variable size blocks and hashes each blockusing a non-cryptographic hash function calledFNV hash. The CTPH tool is known as ssdeep .Breitinger et al. (Breitinger, Baier, & Becking-ham, 2012) presented the thorough analysis ofssdeep and showed that ssdeep does not with-stand an active adversary for blacklisting andwhitelisting.Roussev et al.(Roussev, 2010) proposed a newscheme called sdhash in the year 2010. The basicidea of the sdhash scheme is to generate the fi-nal hash using only statistically improbable fea-tures of the document. Detailed security andimplementation analysis of sdhash is presentedin (Breitinger et al., 2012) by Breitinger et al.This work uncovered several implementation andsecurity issues and showed that it is possible tobeat the similarity score by tampering a given filewithout changing the perceptual behavior of thisfile (e.g., image files look almost same despitethe tampering). The claims of (Breitinger et al.,2012) is again verified by chang et al. in (Chang,Sanadhya, Singh, & Verma, 2015). The paperalso shows an attack method which can misleadthe investigator with many forged similar files.Furthermore, Roussev(Roussev, 2011) has shownthat sdhash outperforms ssdeep in terms of bothaccuracy and scalability.Another scheme known as bbHash (Breitinger& Baier, 2012) was proposed by Breitinger et al.in the year 2012. However because of the highruntime bbHash is not practically usable. Bre-itinger et al. proposed another scheme mvHash-B similarity preserving hashing (Breitinger,Astebol, Baier, & Busch, 2013) in year 2013(Breitinger et al., 2013). The scheme works inthree phases; first, compresses the input datausing majority voting, performs run-length en-coding and then finally stores the fingerprintinto Bloom filters. ‘B’ in mvHash-B denotesthe bloom filter representation of the similaritydigest. In terms of performance, mvHash-B isone of the most efficient schemes among all ex-isting schemes with lowest run-time complexityand small digest size. A thorough analysis of2vHash-B is presented by Chang et al. (Chang,Sanadhya, & Singh, 2016). The paper uncov-ers the weakness of mvHash-B scheme and showsthat mvHash-B does not withstand an active ad-versary against the blacklist and also proposes animprovement to mvHash-B design to conquer theweakness.Apart from the above-mentioned schemes, sev-eral forensic tools are using different filteringtechniques. FTK performs Cluster Analysis tofind related documents, near duplicates. Encaseuses Entropy Near Match Analyser to discoversimilar files. X-ways implemented a new tech-nology called FuzZyDoc to identify known doc-uments.
3. SURVEYMETHODOLOGY
The baseline definition and terminology of theapproximate matching algorithm is already de-fined by (Breitinger, Guttman, McCarrin, &Roussev, 2014). The NIST Special Publication800-168 (Breitinger et al., 2014) defines the prop-erties at more general and broader level. How-ever, similarity definition, as well as the require-ments, vary for different data object type. Forexample, files with similar perceived text contentmay have entirely different structure and wouldbe entirely different if an inappropriate algorithmis applied; the actual similarities would remainunnoticed. A color and the grayscale version ofthe same image would be completely differentwhen using most of the existing schemes. Hencethere is a strong need to define the propertiesbased on the practical requirements.The aim of the survey was to establish thekey characteristics of the approximate matchingalgorithm based on the requirement of digitalforensics practitioners and researchers with theunderstanding of different perspectives towardsthese algorithms. Using the results of the survey,in future, our aim is to build the real data-setwith known similarity and evaluate the existingschemes based on the key characteristics usingthe real data-set.
The survey contains a brief introduction to thetopic of the study and the purpose of the survey,followed by 10 questions consisting : • • • • The results of survey is presented in this sectionbased on the all the received responses. However,where appropriate, further analysis based uponthe perspectives of researchers and practitionersis presented.
Demographic Information of the survey is repre-sented in figure 1. More than 75% of the par-ticipants had more than 10 years experience as3hown in figure 1. All of the participants werefrom the United States of America and workingas a federal, state or local law enforcement prac-titioner or researchers.Figure 1: Demographic Data
Results shown in around 65% of the total par-ticipants were aware and are using approxi-mate matching algorithms during the investiga-tion process. Among them, 45% were awarethat their tools perform approximate filteringbut were unaware of the technique/scheme usedby the tool. Whereas remaining were well awareof the techniques used by the tools or the spe-cific technique used by them. Following is thelist of schemes mentioned by the participantsthose are being used by the practitioners thesedays: Cluster analysis of FTK, FuzzyDocs of X-ways, md5Deep, CodeSuite, PhotoDNA, sdhash,ssdeep and some of the practitioners are usingtheir self-constructed tf-idf based schemes. Thisshows that most of the practitioners find approx-imate matching algorithms useful and are will-ing to use these algorithms to filter out relevantdata. However, since there is no consensus forone standard scheme hence all of them are usingtechniques that are either by default being usedby the tools or they find it more useful for a par-ticular case based on their personal experience.
Participants were asked to scale the uses of theapproximate matching algorithm based on theirprofessional experiences. List of applications wasderived by the examining the existing literatureshown in figure 2. Responders were also allowedto add more uses to the list.Figure 2: Applications of Approximate MatchingAlgorithms • Based on all of the received responses ap-proximate matching algorithms are beingfrequently used to identify the Related Doc-uments. • Fragment Identification was scaled secondin the list. Fragment Identification is theidentification of a document based on a pieceof data. • Following are the other important uses inthe respective order of their raking – Correlation of network (Data packetreconstruction from the fragmentedfiles over the network) – Embedded Object Identification ( e.g.,a jpeg within a word document) – Identification of the code version (Iden-tification of patched or upgraded ver-sion of software).However approximate matching algorithmsdon’t work well for network correlation if thenetwork is encrypted.4n further examination about the type of thedata object that they need to filter out in mostfrequent cases, the participants were asked torate among text, image, executable file, and mul-timedia. All of the received responses showsthat text and images are the most commonly ap-peared data types that need to be filtered. Exe-cutable and multimedia files also required to befiltered in some of the cases.Hence we can state that one of the importantcharacteristics of a scheme is its ability of re-lated document correlation and fragment identi-fication for text and image data type.
Participants were asked choose among the exist-ing measure that represents the similarity of twotextual documents most closely in their opinion.Proposed measures were ‘Edit distance,’ ‘Lengthof longest common sub-string’ and ‘Length oflongest common subsequence.’ Where Edit Dis-tance is a way of calculating dissimilarity be-tween two text sequence by counting the mini-mum number operations required to transformone sequence into the other. Length of longestcommon substring denotes the length of the com-mon contiguous substring of maximal length.Length of the longest common subsequence oftwo text sequence signifies the length of the com-mon substring of maximal length where the sub-string might not appear in contiguous fashionbut preserves the ordering of characters. All ofthese measures were taken by examining the cur-rent literature. Participants were also allowed toadd new measures. The purpose of this questionis to find out the key measure to find real similar-ity in two text documents. Figure 3 shows thatbased on the experience of the responder ‘Lengthof longest Common Substring’ defines similaritybetween two documents the most. Hence thismeasure should be used to build ground truthfor Text data set with known similarity.However by further analysis of the responses,we realized this should be analyzed further tomake a persuasive conclusion. Figure 3: Key Measure to Identify the GroundTruth
Participants were asked to rank the similar im-ages that they need to filter out during the realcase investigations. Figure 4 shows that all ofthe listed image similarities are almost equallyimportant from the practitioner’s point of view.However, the ability of a scheme to find out thesimilarity between different formate is rankedhighest. Hence it would be useful if a techniquecan filter out all of the listed image similarities.Figure 4: Image Similarity
The purpose of this question to understand sim-ilarity definition for a program file based on therequirement of an investigator. Figure 5 showsthe results. All of the listed similarity character-istics are chosen by examining the current liter-ature and participants were allowed to add miss-ing properties. From the received responses wecan say one of the essential abilities of a tool forprogram file similarity is to find the same pro-5ram with different variable name and loopingconstructs. According to practitioner most of theexisting tools produce high false positive resultsbecause of the flagging reusable components suchas dlls, icons and other resources obtained fromsoftware libraries.Figure 5: Executable Program File Similarity
Participants were asked that if the volume datavolume obtained from the crime scene containingfile structure information such as MFT(MasterFile Table) for windows files, inode, etc. has in-creased. From the result, we can say more than70% of the cases the file structure information isavailable. Hence this information can be used toimprove the filtering abilities of the tools. Wealso found that amount of application specificdata has also increased such as SQLite data onmobile devices. Hence major challenge for ap-proximate matching algorithms is to have theability to find similarity across the device spe-cific data.
4. CONCLUSIONS ANDFUTURE WORK
Automated filtering has become one of the es-sential requirements of today’s digital investiga-tion process. Approximate matching algorithmsare being used to perform the filtering. Approx-imate matching is a relatively new area of dig-ital forensics, which is still evolving and needsto be defined more formally. The aim of thesurvey was to gain insights of the practitionersand researchers opinion regarding this new class Figure 6: File System Informationof algorithms. From the results of the study,we found that practitioners and researchers ac-knowledge the utility of approximate matchingalgorithms and are willing to use it to speed upthe investigation process. Since there is no stan-dard way to measure the guarantees of the al-gorithms. It is difficult to find the appropriatetools.Hence this paper presents the practitionersand researcher’s requirement and perspective to-wards approximate matching algorithms. Futurework must focus on formalizing the propertiesof approximate matching algorithms and provid-ing a well-defined evaluation framework to eval-uate existing and future approximate matchingschemes.
REFERENCES
Breitinger, F., Astebol, K. P., Baier, H., &Busch, C. (2013). mvhash-b - A newapproach for similarity preservinghashing. In
Seventh internationalconference on IT security incidentmanagement and IT forensics, IMF 2013,nuremberg, germany, march 12-14, 2013 (pp. 33–44).Breitinger, F., & Baier, H. (2012). A fuzzyhashing approach based on randomsequences and hamming distance. In
Proceedings of the conference on digitalforensics, security and law (pp. 89–100).Breitinger, F., Baier, H., & Beckingham, J.(2012). Security and implementationanalysis of the similarity digest sdhash. In6 irst international baltic conference onnetwork security & forensics (nesefo).
Breitinger, F., Guttman, B., McCarrin, M., &Roussev, V. (2014). Approximatematching: definition and terminology.
URL http://csrc. nist.gov/publications/drafts/800-168/sp800 168 draft. pdf .Chang, D., Sanadhya, S. K., & Singh, M.(2016). Security analysis of mvhash-bsimilarity hashing.
The Journal of DigitalForensics, Security and Law: JDFSL , (2), 21.Chang, D., Sanadhya, S. K., Singh, M., &Verma, R. (2015). A collision attack onsdhash similarity hashing. In Proceedingsof 10th intl. conference on systematicapproaches to digital forensic engineering (pp. 36–46).Harbour, N. (2002).
Dcfldd. defense computerforensics lab.
Kornblum, J. D. (2006). Identifying almostidentical files using context triggeredpiecewise hashing.
Digital Investigation , (Supplement-1), 91–97. Retrieved from http://dx.doi.org/10.1016/j.diin.2006.06.015 doi: 10.1016/j.diin.2006.06.015Roussev, V. (2010). Data fingerprinting withsimilarity digests. In K. Chow &S. Shenoi (Eds.), Advances in digitalforensics VI - sixth IFIP WG 11.9international conference on digitalforensics, hong kong, china, january 4-6,2010, revised selected papers (Vol. 337,pp. 207–226). Springer. Retrieved from http://dx.doi.org/10.1007/978-3-642-15506-2 15 doi: 10.1007/978-3-642-15506-2 15Roussev, V. (2011, August). An evaluation offorensic similarity hashes.
Digit. Investig. , , S34–S41. Retrieved from http://dx.doi.org/10.1016/j.diin.2011.05.005 doi: 10.1016/j.diin.2011.05.005Tridgell, A. (2002). Spamsum readme.
Retrieved from