Looking for non-compliant documents using error messages from multiple parsers
LLOOKING FOR NON-COMPLIANT DOCUMENTS USINGERROR MESSAGES FROM MULTIPLE PARSERS
MICHAEL ROBINSON
Abstract.
Whether a file is accepted by a single parser is not a reliableindication of whether a file complies with its stated format. Bugs within boththe parser and the format specification mean that a compliant file may failto parse, or that a non-compliant file might be read without any apparenttrouble. The latter situation presents a significant security risk, and should beavoided. This article suggests that a better way to assess format specificationcompliance is to examine the set of error messages produced by a set of parsersrather than a single parser. If both a sample of compliant files and a sample ofnon-compliant files are available, then we show how a statistical test based on apseudo-likelihood ratio can be very effective at determining a file’s compliance.Our method is format agnostic, and does not directly rely upon a formalspecification of the format. Although this article focuses upon the case of thePDF format (ISO 32000-2), we make no attempt to use any specific details ofthe format. Furthermore, we show how principal components analysis can beuseful for a format specification designer to assess the quality and structure ofthese samples of files and parsers. While these tests are absolutely rudimentary,it appears that their use to measure file format variability and to identify non-compliant files is both novel and surprisingly effective. Introduction
Modern file formats are often quite complex, yet the formal specifications forsome common formats can be ambiguous or confusing. A single parser is thereforenot a reliable arbiter of file format compliance: it may incorrectly deem a compliantfile as non-compliant, or conversely it may parse a non-compliant file (perhaps withdisastrous consequences). For widely used file formats, there are usually severalreadily available parsers. It is natural to ask if a statistical approach that lever-ages multiple existing parsers – but is otherwise format agnostic – might suffice todiscriminate between compliant and non-compliant files.This article describes an exploratory technique and a statistical test for iden-tifying files whose parser behavior is unusual. The techniques presented performno direct inspection of the contents of any file. Certainly the content of a givenfile plays an important role in its usage, but the techniques of this article only“see the content” through the lens of an ensemble of parsers. Our techniques aretherefore also well-suited to assessing the background variability of parser behavioron different classes of files. Since our approach aims to leverage existing parsers intheir unmodified, uninstrumented state, the statistical techniques could be used onmany different file formats without much alteration.For the purpose of this article, a file format consists of a set of compliant files anda set of non-compliant files. Formal specifications specify properties that compliantfiles must have, but formal specifications need not be present for there to be an a r X i v : . [ c s . OH ] D ec MICHAEL ROBINSON agreed-upon file format. This article presents a new statistical test, which we call the Bernoulli misclassification test , that determines whether a given file is morerepresentative of the compliant files or of the non-compliant files. In order toperform such a test, we require samples of both sets: namely a sample of compliantand a sample of non-compliant files. Because realistic samples of files are large anddifficult to curate, the sample of compliant files may be contaminated with filesthat should not be considered as compliant. Conversely, there may be some files inour sample that are erroneously marked as non-compliant. Our statistical test isdesigned to identify these misclassified files.The foundation of any statistical approach necessarily relies on both data cov-erage and sufficient sampling to ensure good estimation of the relevant governingparameters. Our approach here is no different, as the basis of the statistical testrelies upon parameters estimated from the data in order to be effective. Given thatour approach is format agnostic, it is reliant upon not only a good sample of filesbut also a good sample of parsers.To test our approach, this article presents a case study using the PDF speci-fication (ISO 32000-2), because there are many extant open source parsers withdistinct underlying codebases. The test data presented in this article were de-veloped by an independent test and evaluation team in support of the “DARPASafeDocs Evaluation 2” exercise. The data consist of two datasets correspondingto the samples explained above: a sample of largely compliant files and a sample oflargely non-compliant files. Each of the files (in both samples) was manually testedfor compliance, so that the performance of our statistical test could be determined.While the fraction of misclassified files in the two datasets differ, the two datasetswere sufficiently clean so as to allow reliable enough parameter estimation for ourstatistical test.While the statistical methods discussed in this article are absolutely rudimen-tary, they did locate files that were truly misclassified with surprising effectiveness.Although these statistical methods surely do not suffice on their own for all pur-poses, they are easy to deploy and understand. We suggest that they ought to bepart of the format hacker’s toolbox.2.
Historical context
There appears to be very little work in analyzing file format compliance usingstatistical tools. In contrast to what we present here, most format complianceassessment that the author is aware of is performed using techniques common incompiler theory. For instance, [1; 11] explain the typical approaches.The closest connections to this article appear to be various techniques for iden-tifying malware using the structure of file contents rather than responses to thosecontents. For instance [6] looked for statistical features characteristic of malwarepresent in headers of executable files. Using the structure of file contents, ran-somware can be classified statistically [2; 12]. Similar to our use of error messageson files, the distribution of API calls can be useful in identifying malware as itexecutes [3]. Other behavioral indicators, such as performance counters [7] can beuseful as well. However, it appears that the use of statistical tools to determine fileformat compliance is completely unanticipated and novel.In a few cases, statistical methods are useful in identifying file formats that mightbe difficult to archive or curate [8; 10; 9].
INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 3
Table 1.
Counts of files in the internet sourced and dangerous datasetsDataset Valid Rejected Total internet sourced dangerous
488 516 1000Totals 7694 2306 100003.
Data description
This article focuses on the analysis of PDF files, whose format is determinedby the PDF specification (ISO 32000-2). It is recognized that the specificationis ambiguous in places, and that there are many proprietary extensions to thespecification. Because of this, many “PDF files” may not be completely compliantor may be close enough to compliance to parse correctly. On the other hand, bugswithin the parsers may cause them to accept non-compliant files. To explore theseissues, the test and evaluation team collected a corpus of “PDF files” into twodatasets: internet sourced and dangerous comprising a total of 10000 files.Within each dataset, the files were manually determined to be either “valid”,that is that they are compliant with the PDF specification, or “rejected”, whichmeans that they fail to comply with the specification. In this article, we treatthe “valid” or “rejected” determinations as experimental ground truth for the files.Since our Bernoulli misclassification test did not use these determinations, we wereable to use them to estimate the test’s accuracy (discussed in Section 5). The overallstatistics of both datasets are shown in Table 1. A standard χ test reveals thatthe differences in compliance between the two datasets is statistically significant( p < . internet sourced set tobe our sample of predominantly compliant files, and we took the dangerous setto be our sample of predominantly non-compliant files. The significant differencebetween the two samples is precisely what our misclassification test leverages inorder to find misclassified files.Rather than looking at the contents of each file, we reasoned that there arealready many extant parsers that do just that. Therefore, we ran each file througheach of a large collection of parsers shown in Table 2. We selected these parsersbased on their easy availability and with the understanding that many of them donot share code. This latter fact ensures that places within the PDF specificationthat are ambiguous may receive several interpretations by different parsers. Theoutput to stderr from each parser was collected for each file, and a set of 955regular expressions were used to identify which error messages had occurred foreach parser-file pair (see Table 2).As an example, the 1000-th file in the internet sourced dataset was considered“valid,” yet produced 7 distinct messages: caradoc extract : Type error : Invalid variant type , caradoc stats : Type error : Invalid variant type , caradoc stats strict : PDF error : Syntax error , hammer : uncategorized error, pdfium : uncategorized error, xpdf pdfinfo : uncategorized error, and xpdf pdftops : uncategorized error. MICHAEL ROBINSON M e ss a g e M e ss a g e
0 250 500 750 1000
File internet_sourced (b) dangerous
0 3000 6000 9000
Figure 1.
The relation matrix for (a) internet sourced , (b) dangerous . Rows correspond to the error messages listed in Ta-ble 2. Columns correspond to files. In each matrix entry, whiteindicates no error, black indicates error.It is important to recognize that our method makes no attempt to interpret thesemantic meanings of these error messages. Instead, we are merely interested inthe statistics of the co-occurrence of these messages.The data can be tabulated as an integer relation matrix , in which each rowcorresponds to a particular regular expression (an error message in what follows)and each column corresponds to a particular file. Each entry records the number oftimes a given error message occurred for each file. Return values to the operatingsystem were not collected, though if desired these could have simply been added asadditional “messages” as rows in the relation matrix.We constructed two relation matrices, one for internet sourced (Figure 1(a))and one for dangerous (Figure 1(b)). Each relation matrix has the same rows (955distinct error messages, as shown in Table 2) but different numbers of columns(9000 for internet sourced and 1000 for dangerous ).Continuing our example of the 1000-th file in internet sourced , this file corre-sponds to the 1000-th column of the relation matrix in Figure 1(a), and has 1s inrows 58, 254, 393, 589, 683, 910 and 911, because each of these messages occurredexactly once. It has 0s in all other entries in that column.The horizontal dark bands present in both relation matrices shown in Figure 1correspond to error messages that could not be categorized easily: not syntax errorsor other specific malformations. Some parsers produce these kind of messages morefrequently than others, which explains why some portions of the matrices show agreater prevalence of dark horizontal bands than others.Although there is some apparent structure visible in Figure 1, namely the darkhorizontal bands, it is difficult to discriminate any column-by-column differences.
INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 5
Table 2.
Error message counts and rows per parserParser First row Last row Total message regexes caradoc extract caradoc stats
197 392 196 caradoc stats strict
393 588 196 hammer
589 589 1 mutool show
590 635 46 mutool clean
636 681 46 origami pdfcop
682 682 1 pdfium
683 683 1 pdfminer dumppdf
684 703 20 pdfminer pdf2txt
704 723 20 pdftk server
724 724 1 pdftools pdfid
725 729 5 pdftools pdfparser
730 734 5 peepdf
735 735 1 poppler pdfinfo
736 792 57 poppler pdftocairo
793 849 57 poppler pdftops
850 906 57 qpdf
907 907 1 verapdf greenfield
908 908 1 verapdf pdfbox
909 909 1 xpdf pdfinfo
910 910 1 xpdf pdftops
911 911 1 caradoc extract
912 913 2 caradoc stats
914 915 2 caradoc stats strict
916 917 2 hammer
918 919 2 mutool clean
920 921 2 mutool show
922 923 2 origami pdfcop
924 925 2 pdfium
926 927 2 pdfminer dumppdf
928 929 2 pdfminer pdf2txt
930 931 2 pdftk server
932 933 2 pdftools pdfid
934 935 2 pdftools pdfparser
936 937 2 peepdf
938 939 2 poppler pdfinfo
940 941 2 poppler pdftocairo
942 943 2 poppler pdftops
944 945 2 qpdf
946 947 2 verapdf greenfield
948 949 2 verapdf pdfbox
950 951 2 xpdf pdfinfo
952 953 2Total 955
MICHAEL ROBINSON (a) internet_sourced (b) dangerous
Figure 2.
Principal components plots for (a) internet sourced and (b) dangerous . Black indicate “valid”, and gray indicates“reject”. The axes correspond to the three principal vectors, andso are not plotted on the same scale.These differences are indeed present, but require more sophistication to extract.That statistical analysis forms the basis of most of this article.4.
Principal components analysis
To build some intuition about the structure of the relation matrices, let us de-velop a dimension-reduced visualization of the columns (files) of both relation ma-trices shown in Figure 1. While there are many possible techniques for dimensionreduction, principal components analysis is generally the easiest to construct and tointerpret. To better understand the structure of the data, we will incorporate theground truth as part of the visualization. This will help explain the performanceof the Bernoulli misclassification test statistic in the next section.Principal components analysis is a way to render a high dimensional data setthat shows the “most important” dimensions and suppresses the rest. It is thereforea convenient way to visualize data that are formatted as a set of points in R n where n is large. The output of principal components analysis is a scatter plot in whichthe axes are chosen as the linear combinations of rows yielding the largest variance.These axes are called the principal vectors, and are sorted from largest variance toleast variance. In our analysis, the largest three principal vectors were used becausethey represented the majority of the variance in the data.To apply principal components analysis, we reinterpret our tabular data as adiscrete subset of R n (a point cloud) in which columns (files) are points, rows(messages) are components of the coordinates for each point. In both datasets,there are n = 955 messages. Because a file exhibits an error or not, the componentsare all either 0 or 1. Although one might argue that this could result in quantizationerror, many interesting features are nevertheless visible in the two datasets.Figure 2 shows the principal components analysis plots for both datasets. Pointsin both plots are colored according to the ground truth so that a point correspondingto a “valid” file is black, and a point corresponding to a “rejected” file is gray.The most striking difference in the principal components analysis plots is that the internet sourced dataset is apparently much more “clumpy” than the dangerous dataset. The three dense clusters in Figure 2(a) consist entirely of “valid” files, with INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 7 (a) internet_sourced (b) dangerous
Figure 3.
Scree plots for (a) internet sourced and (b) dangerous .most of the “rejected” files in the internet sourced data appearing as a “haze”of files outside of those clusters. Although the cause of these three dense clustersof “valid” files cannot be determined solely from the relation matrix – which filesare accepted by which parsers – we hypothesize that these clusters correspond topopular tool chains for creating PDF files.In stark contrast, the dangerous set shown in Figure 2(b) contains two looseclusters that are mixed “valid” and “rejected” files. Intuitively, if one is lookingto identify “valid” files, one would have a much harder time doing this with the dangerous set, so we might argue that the apparent signal-to-noise ratio is muchlower in the dangerous set.Principal components analysis can be misleading if only a small fraction of theoverall variance in the data is explained by the first few principal vectors. It iseasy to determine if this problem is occurring – simply plot the variance in thedata explained by each principal vector. This is called a Scree plot [13], and isshown in Figure 3. In both datasets, the Scree plots decrease quite rapidly over thefirst few principal vectors. This shows that principal components analysis reliablyrepresents the data.We can conclude that if one is generally working with files that are naturallyoccurring (like the internet sourced set), one probably does not need to dig toodeeply to determine whether a given file is valid or not. The rest of this articlebuttresses this claim by providing a statistical test that does just that. On theother hand, if one is routinely handling files that test the limits of their format (likethe dangerous set), statistical analysis alone will likely be insufficient to determinewhich files are valid. A deeper format-aware analysis would be necessary in thatcase. 5.
Bernoulli likelihood ratio misclassification test statistic
In order to determine file validity implicitly – by consulting parser behaviorrather than the specification itself – we need exemplars of parser behavior ontypical compliant and non-compliant files. We propose to use the fact that thetwo datasets have rather different statistics (Table 1) with the majority of files in internet sourced being “valid” while a (slim) majority of files in dangerous are“rejected”. We can attempt to identify files in internet sourced that exemplify
MICHAEL ROBINSON the parser behavior present in the dangerous dataset – these files probably shouldbe treated with suspicion, and are perhaps not “valid.” Conversely, files withinthe dangerous set that behave more like those in internet sourced may well be“valid.”A file is misclassified if the set of messages it produces is more characteristic ofthe other dataset but not the one in which it is presently found. We suspect thatsuch a misclassified file will have a collection of error messages that is unusual whencompared to the others in that dataset. We estimate the probability that each errormessage occurs in each dataset, and estimate a likelihood for each file assuming theerror messages are independent. Since both datasets have the same error messages,we can also estimate the likelihood that the file came from the other dataset. Alikelihood ratio statistic can thereby detect when a file is misclassified, because itis more likely to belong to the other dataset.Since we cannot assume that the occurrence of any given set of error messages isstatistically independent, it is difficult to write a proper likelihood function. To thatend, we use a pseudo-likelihood , which makes the incorrect assumption that errormessages are statistically independent [5]. On the other hand, this assumptionmerely reduces the sensitivity of the test without producing extra outliers. Thepseudo-likelihood assumption trades recall to get better precision . As such, pseudo-likelihoods are useful for classification but not for uncertainty quantification.For the Bernoulli misclassification test, we assume each error message is gov-erned by a Bernoulli distribution, which means that it either occurs or does notoccur. Multiple instances of the same error are ignored. The test assumes thateach error message occurs with a probability that depends on the dataset (either internet sourced or dangerous ). When a parser processes a given file, sometimesit produces multiple copies of the same error message. This can happen if the parserattempts to repair a slightly non-compliant file as it proceeds, for instance. If thishappens, then the Bernoulli distribution is no longer valid because it assumes atmost one instance of a given error message. Ultimately, the performance of theBernoulli misclassification test was good even though we ignored duplicate errormessages.Let us consider the internet sourced dataset first. It is straightforward tocompute the probability p k of error message k occurring from the relation matrix:simply take the average value of row k in the relation matrix shown in Figure 1(a).The resulting probabilities for both datasets are shown in Figure 4. Said anotherway, if file f is drawn from the internet sourced dataset, then the probabilitythat f produces error message k is p k . Conversely, the probability that f does not produce this error message is (1 − p k ). If we let f k = 0 if the file f did not produceerror k and f k = 1 if the file did produce error k , then the probability that f isfrom the internet sourced dataset is p k f k + (1 − p k )(1 − f k ) , if we only consider error message k .Since we have many error messages available for analysis, the pseudo-likelihoodthat file f (column in the relation matrix) is correctly classified is simply the productof each of these individual probabilities, namely L internet sourced ( f ) = (cid:89) k =1 ( p k f k + (1 − p k )(1 − f k )) . INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 9 (a) internet_sourced (b) dangerous
Figure 4.
Error probability for (a) internet source and (b) dangerous .We define L dangerous ( f ) similarly using the error message probabilities p (cid:48) k from the dangerous set instead, L dangerous ( f ) = (cid:89) k =1 ( p (cid:48) k f k + (1 − p (cid:48) k )(1 − f k )) . We define the
Bernoulli misclassification test statistic to be the ratio of these twopseudo-likelihoods, λ internet sourced ( f ) = L dangerous ( f ) L internet sourced ( f ) . If f is a file drawn from the internet sourced dataset, then we generally ex-pect that L internet sourced ( f ) will be larger than L dangerous ( f ), which implies that λ internet sourced ( f ) <
1. Conversely, if a file is drawn from the dangerous dataset,which means it is a misclassification if it actually is present in internet sourced ,we would expect that λ internet sourced ( f ) >
1. The intuition is that since files inthe dangerous set are likely to be invalid, a high value of λ internet sourced ( f ) is anindication that the file f is not compliant. A histogram of λ internet sourced is shownin Figure 5(a).Conversely, we can define λ dangerous ( g ) = L internet sourced ( g ) L dangerous ( g )for each file g in the dangerous set. The intuition in this case is that a highvalue of λ dangerous ( g ) is an indication that g is compliant, since it is likely to be amisclassification. The histogram of values of λ dangerous ( g ) is shown in Figure 5(b).Since there is some variability (or noise) present within the data, we should notuse the value of λ = 1 as the cutoff for detecting misclassifications. Although theintersections between the histogram curves and the red lines λ = 1 in both framesof Figure 5 are close to the true fraction of misclassifications in both datasets (80%for internet sourced and 48% for dangerous ), they are not exactly correct. Thissuggests using a different threshold T instead, so that all files whose statistic λ isgreater than T will be deemed to be misclassified.Let us now use the ground truth, which specifies whether a given file is “valid”or “rejected”, to determine the performance of the Bernoulli misclassification test (a) internet_sourced (b) dangerous λ = 1 Figure 5.
Histogram of Bernoulli misclassification test statisticfor (a) internet sourced and (b) dangerous . The horizontal linemarks a value of λ = 1. (a) internet_sourced (b) dangerous area under curve = 0.80 λ = 1 Figure 6.
Receiver operating curves for Bernoulli likelihood ratiomisclassification statistic for (a) internet sourced and (b) dangerous .statistic. A misclassified file in internet sourced is one that is “rejected”, whilea misclassified file in dangerous is one that is “valid”.In the case of either dataset, for a given threshold T , the probability of detection is the probability that a truly misclassified file f will have a statistic λ ( f ) > T .On the other hand, the probability of false alarm is the probability that a correctlyclassified file f will have a statistic λ ( f ) > T . Ideally, the probability of detectionwill be close to 1, while simultaneously the probability of false alarm will be closeto 0.The plot of probability of detection versus false alarm for all thresholds is shownin Figure 6. Better misclassification detectors have plots further to the upperleft, away from the diagonal. Since the curve plotted is far above the diagonalfor internet sourced in Figure 6(a), we conclude that the Bernoulli likelihoodratio misclassification statistic is a very accurate detector of misclassified files in internet sourced . Additionally, since the plot in Figure 6(b) is above the diago-nal, this indicates that the Bernoulli likelihood ratio misclassification statistic also INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 11 performs well on the dangerous dataset set. The sharp plateau in Figure 6(b) isdue to a number of instances in which L internet sourced took the value 0 on somefiles in dangerous . These happen to be truly non-compliant files! These qualita-tive assessments are confirmed by the areas under the curves in Figure 6. A perfectmisclassification detector will have an area of 1 under the curve, while a detectorthat randomly decides misclassifications would have area 0 . .
95 for λ internet sourced and 0 .
80 for λ dangerous for areas under the curve,which we deem to be surprisingly good given the fact that the file contents werenot directly inspected by our method.6. Parser redundancy analysis
The Bernoulli misclassification statistic relies upon the diversity of responses toeach file in order to make reliable decisions. It is impractical to use quite so manyparsers as we used, so it is useful to assess whether we could get good performancewith fewer parsers. The easiest way to do this is to measure the correlation betweenthe typical error message distributions produced by different parsers across all files.In the following analysis, we combined both internet sourced and dangerous intoa single dataset, whose relation matrix consists of the horizontal concatenation ofboth matrices shown in Figure 1. That is, the rows are the same as before, but thecolumns consist of the union of the columns from both matrices. Error messages(rows) that never occur were removed from the analysis, since they contribute noinformation.The data can also be visualized using principal components analysis, much aswas done in Section 4, but instead we use the error counts across files as coordi-nates for each parser. Since specific error messages cannot be enabled or disabledindividually, it makes sense to aggregate the error messages into parsers by takingthe total number of error messages for a each parser on a given file. Principalcomponents analysis places the parsers according to the plot shown in Figure 7.There is one large cluster (in the lower left of Figure 7) of parsers which have sim-ilar behaviors. There are quite a few outliers, most notably caradoc extract and poppler pdfinfo . The reader is cautioned that the apparent density of the largecluster is a bit misleading, because the ranges of the axes are quite different. Inany event, the wide spread of parsers across the plot indicates that having a largenumber of parsers is quite valuable.Although principal components analysis is useful, we can obtain a ranking ofparsers by their redundancy, so that we can prioritize more informative parsers.We computed Pearson’s correlation coefficient between all pairs of error messages(rows), to form a fairly large correlation matrix (not shown). Different error mes-sages that occur on exactly the same set of files have a correlation coefficient of 1.In that case, one of those error messages is redundant. Errors were then groupedby parser, and for each pair of parsers we stored the median correlation from eachof their pairwise message correlations. The result of this aggregation is the matrixshown in Figure 8. Whiter colors indicate higher correlation – more redundancy– while darker colors indicate lower correlation. Most of the matrix is fairly dark,which indicates low redundancy overall. The bright off-diagonal entries indicatetrade-offs: one needs only run one of the two parsers indicated. The bright bandsfor pdftools pdfid occurred because that parser did not correlate with any of theothers at all.
Figure 7.
Parsers placed in three dimensions according to prin-cipal components analysis of the union of internet sourced and dangerous .The clusters visible in the Figure 7 can be confirmed by sorting the parsers basedon their median correlation. The rows and columns shown in Figure 8 are sorted inthis way. This indicates which parsers are individually most informative, becausethey form an approximate spanning set for the data. The obvious block in thelower right suggests that parsers should be grouped into two categories based ontheir informativeness: high and low. The rows and columns of the block in Figure 8suggest that the boundary between the high and low categories appears to be fairlysharp, with all parsers to the left of origami pdfcop being highly informative.Based on the approximate spanning property of the highly informative category,which covers most of the variability visible in Figure 7, one should always run theparsers in the high category, with the other parsers in the low informative categoryparser treated as increasingly optional as one moves to the lower right of Figure 8.Notice that the parsers near the lower right of Figure 8 all come from the cluster
INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 13
Figure 8.
Correlation matrix between parser behaviors. Rowsand columns are sorted according to median message correlation.in the lower left of Figure 7. Although it is difficult to see from the projectionshown in Figure 7, least informative parsers are on the inside of the cluster, whilethe parsers in the same cluster that are on the outer edges of the cluster are moreinformative. 7.
Conclusion and recommendations
This article has demonstrated that a given file’s compliance with a format speci-fication can be determined from the following three samples: (1) a diverse collectionof parsers for the file format, (2) a sample of compliant files, and (3) a sample ofnon-compliant files. Our methodology is format agnostic, and so could work if afile format is not formally specified. Furthermore, our Bernoulli test for compli-ance is statistical, so it is both robust to errors in the three samples, yet benefitsfrom richer samples should they be available. We note that the use of the threesamples means that this approach is a supervised approach. Under appropriate circumstances, it might be possible to use an unsupervised bootstrapping approachto extract the two samples of files from a single aggregated sample. This requiresfurther investigation.Principal components analysis is helpful in understanding the variability withina sample of files or parsers, but it holds somewhat less value as an automatedanalytic tool. Clusters visible within the principal components plots appear toreflect specific tool chains used in the creation of files, with valid files formingseveral dense clusters.The Bernoulli likelihood ratio statistic detects non-compliant files by comparingerror message prevalence aggregated across all parsers and both samples of files. Itrelies inherently upon both the coverage and depth of these samples, but we haveshown that it can be very effective at its job when these samples are adequate.Overall, we found that there is not much redundancy present in the behaviors ofparsers for the PDF specification. That is to say that the relation matrix containsby far the most information if all parsers are considered, so one ought to use allparsers if possible. Our analysis determined which parsers are individually the mostinformative, based on pairwise comparisons. While it is entirely reasonable to studydifferent subsets of parsers, we did not perform that analysis here. We refer theinterested reader to [4] where such an analysis was performed.If resources are tight, we found that it is probably not necessary to rerun a givenparser with different options. For instance, running only one of xpdf tops and xpdf pdfinfo probably will not change the results too much. Roughly speaking,it appears that the best strategy is “more programmers contributing different codeinstead of one programmer’s code run in many different ways.”Our two statistical techniques for analyzing file format compliance are admit-tedly simple but standard statistical tools, though are apparently not in wide use.They are easy to deploy, easy to interpret, and require little maintenance otherthan the selection of a single detection threshold. We therefore encourage deeperexploration into and experimentation with statistical methods in the study of fileformat compliance.
Acknowledgments
The author would like to thank the SafeDocs test and evaluation team, includingNASA (National Aeronautics and Space Administration) Jet Propulsion Labora-tory, California Institute of Technology and the PDF Association, Inc., for pro-viding the test data. The author would like to thank Kris Ambrose for his heroicprocessing of all of the relevant files through all of the parsers, and for subsequentlypackaging the results in a very convenient format. The author would also like tothank Peter Wyatt for his detailed and insightful investigation into the files thatwere deemed outliers and for closely reading an earlier draft of this article.This material is based upon work supported by the Defense Advanced ResearchProjects Agency (DARPA) SafeDocs program under contract HR001119C0072.Any opinions, findings and conclusions or recommendations expressed in this ma-terial are those of the author and do not necessarily reflect the views of DARPA.
References [1] Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. Compilers, principles, tech-niques.
Addison wesley , 7(8):9, 1986.
INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 15 [2] Bander Ali Saleh Al-rimy, Mohd Aizaini Maarof, and Syed Zainudeen MohdShaid. Ransomware threat success factors, taxonomy, and countermeasures:A survey and research directions.
Computers & Security , 74:144–166, 2018.[3] Mamoun Alazab. Profiling and classifying the behavior of malicious codes.
Journal of Systems and Software , 100:91 – 102, 2015. ISSN 0164-1212. doi:https://doi.org/10.1016/j.jss.2014.10.031. URL .[4] Kristopher Ambrose, Steve Huntsman, Michael Robinson, and Matvey Yutin.Topological differential testing. arXiv preprint arXiv:2003.00976 , 2020.[5] Barry C Arnold and David Strauss. Pseudolikelihood estimation: some ex-amples.
Sankhy¯a: The Indian Journal of Statistics, Series B , pages 233–243,1991.[6] Mohamed Belaoued and Smaine Mazouzi. A real-time PE-malware detec-tion system based on chi-square test and PE-file features. In
IFIP Interna-tional Conference on Computer Science and its Applications , pages 416–425.Springer, 2015.[7] John Demme, Matthew Maycock, Jared Schmitz, Adrian Tang, Adam Waks-man, Simha Sethumadhavan, and Salvatore Stolfo. On the feasibility of onlinemalware detection with performance counters.
ACM SIGARCH ComputerArchitecture News , 41(3):559–570, 2013.[8] Roman Graf and Sergiu Gordea. A risk analysis of file formats for preservationplanning. In
Proceedings of the 10th International Conference on Preservationof Digital Objects (iPres2013) , pages 177–186, 2013.[9] Gregory W Lawrence, William R Kehoe, Oya Y Rieger, William H Walters,and Anne R Kenney.
Risk Management of Digital Information: A File FormatInvestigation.
ERIC, 2000.[10] D. Pearson and C.Webb. Defining file format obsolescence: A risky journey.
The International Journal of Digital Curation , 3(1):89–106, July 2008.[11] Maksym Schipka. Detection of exploits in files, January 8 2009. US PatentApp. 11/822,533.[12] S. D. S.L and J. CD. Windows malware detector using convolutional neuralnetwork based on visualization images.
IEEE Transactions on Emerging Topicsin Computing , pages 1–1, 2019. doi: 10.1109/TETC.2019.2910086.[13] An Gie Yong, Sean Pearce, et al. A beginner’s guide to factor analysis: Fo-cusing on exploratory factor analysis.
Tutorials in quantitative methods forpsychology , 9(2):79–94, 2013.
Mathematics and Statistics, American University, Washington, DC, USA
Email address ::