[PDF] Looking for non-compliant documents using error messages from multiple parsers

Abstract

Whether a file is accepted by a single parser is not a reliable indication of whether a file complies with its stated format. Bugs within both the parser and the format specification mean that a compliant file may fail to parse, or that a non-compliant file might be read without any apparent trouble. The latter situation presents a significant security risk, and should be avoided. This article suggests that a better way to assess format specification compliance is to examine the set of error messages produced by a set of parsers rather than a single parser. If both a sample of compliant files and a sample of non-compliant files are available, then we show how a statistical test based on a pseudo-likelihood ratio can be very effective at determining a file's compliance. Our method is format agnostic, and does not directly rely upon a formal specification of the format. Although this article focuses upon the case of the PDF format (ISO 32000-2), we make no attempt to use any specific details of the format. Furthermore, we show how principal components analysis can be useful for a format specification designer to assess the quality and structure of these samples of files and parsers. While these tests are absolutely rudimentary, it appears that their use to measure file format variability and to identify non-compliant files is both novel and surprisingly effective.

Full PDF

LLOOKING FOR NON-COMPLIANT DOCUMENTS USINGERROR MESSAGES FROM MULTIPLE PARSERS

MICHAEL ROBINSON

Abstract.

Whether a ﬁle is accepted by a single parser is not a reliableindication of whether a ﬁle complies with its stated format. Bugs within boththe parser and the format speciﬁcation mean that a compliant ﬁle may failto parse, or that a non-compliant ﬁle might be read without any apparenttrouble. The latter situation presents a signiﬁcant security risk, and should beavoided. This article suggests that a better way to assess format speciﬁcationcompliance is to examine the set of error messages produced by a set of parsersrather than a single parser. If both a sample of compliant ﬁles and a sample ofnon-compliant ﬁles are available, then we show how a statistical test based on apseudo-likelihood ratio can be very eﬀective at determining a ﬁle’s compliance.Our method is format agnostic, and does not directly rely upon a formalspeciﬁcation of the format. Although this article focuses upon the case of thePDF format (ISO 32000-2), we make no attempt to use any speciﬁc details ofthe format. Furthermore, we show how principal components analysis can beuseful for a format speciﬁcation designer to assess the quality and structure ofthese samples of ﬁles and parsers. While these tests are absolutely rudimentary,it appears that their use to measure ﬁle format variability and to identify non-compliant ﬁles is both novel and surprisingly eﬀective. Introduction

Modern ﬁle formats are often quite complex, yet the formal speciﬁcations forsome common formats can be ambiguous or confusing. A single parser is thereforenot a reliable arbiter of ﬁle format compliance: it may incorrectly deem a compliantﬁle as non-compliant, or conversely it may parse a non-compliant ﬁle (perhaps withdisastrous consequences). For widely used ﬁle formats, there are usually severalreadily available parsers. It is natural to ask if a statistical approach that lever-ages multiple existing parsers – but is otherwise format agnostic – might suﬃce todiscriminate between compliant and non-compliant ﬁles.This article describes an exploratory technique and a statistical test for iden-tifying ﬁles whose parser behavior is unusual. The techniques presented performno direct inspection of the contents of any ﬁle. Certainly the content of a givenﬁle plays an important role in its usage, but the techniques of this article only“see the content” through the lens of an ensemble of parsers. Our techniques aretherefore also well-suited to assessing the background variability of parser behavioron diﬀerent classes of ﬁles. Since our approach aims to leverage existing parsers intheir unmodiﬁed, uninstrumented state, the statistical techniques could be used onmany diﬀerent ﬁle formats without much alteration.For the purpose of this article, a ﬁle format consists of a set of compliant ﬁles anda set of non-compliant ﬁles. Formal speciﬁcations specify properties that compliantﬁles must have, but formal speciﬁcations need not be present for there to be an a r X i v : . [ c s . OH ] D ec MICHAEL ROBINSON agreed-upon ﬁle format. This article presents a new statistical test, which we call the Bernoulli misclassiﬁcation test , that determines whether a given ﬁle is morerepresentative of the compliant ﬁles or of the non-compliant ﬁles. In order toperform such a test, we require samples of both sets: namely a sample of compliantand a sample of non-compliant ﬁles. Because realistic samples of ﬁles are large anddiﬃcult to curate, the sample of compliant ﬁles may be contaminated with ﬁlesthat should not be considered as compliant. Conversely, there may be some ﬁles inour sample that are erroneously marked as non-compliant. Our statistical test isdesigned to identify these misclassiﬁed ﬁles.The foundation of any statistical approach necessarily relies on both data cov-erage and suﬃcient sampling to ensure good estimation of the relevant governingparameters. Our approach here is no diﬀerent, as the basis of the statistical testrelies upon parameters estimated from the data in order to be eﬀective. Given thatour approach is format agnostic, it is reliant upon not only a good sample of ﬁlesbut also a good sample of parsers.To test our approach, this article presents a case study using the PDF speci-ﬁcation (ISO 32000-2), because there are many extant open source parsers withdistinct underlying codebases. The test data presented in this article were de-veloped by an independent test and evaluation team in support of the “DARPASafeDocs Evaluation 2” exercise. The data consist of two datasets correspondingto the samples explained above: a sample of largely compliant ﬁles and a sample oflargely non-compliant ﬁles. Each of the ﬁles (in both samples) was manually testedfor compliance, so that the performance of our statistical test could be determined.While the fraction of misclassiﬁed ﬁles in the two datasets diﬀer, the two datasetswere suﬃciently clean so as to allow reliable enough parameter estimation for ourstatistical test.While the statistical methods discussed in this article are absolutely rudimen-tary, they did locate ﬁles that were truly misclassiﬁed with surprising eﬀectiveness.Although these statistical methods surely do not suﬃce on their own for all pur-poses, they are easy to deploy and understand. We suggest that they ought to bepart of the format hacker’s toolbox.2.

Historical context

There appears to be very little work in analyzing ﬁle format compliance usingstatistical tools. In contrast to what we present here, most format complianceassessment that the author is aware of is performed using techniques common incompiler theory. For instance, [1; 11] explain the typical approaches.The closest connections to this article appear to be various techniques for iden-tifying malware using the structure of ﬁle contents rather than responses to thosecontents. For instance [6] looked for statistical features characteristic of malwarepresent in headers of executable ﬁles. Using the structure of ﬁle contents, ran-somware can be classiﬁed statistically [2; 12]. Similar to our use of error messageson ﬁles, the distribution of API calls can be useful in identifying malware as itexecutes [3]. Other behavioral indicators, such as performance counters [7] can beuseful as well. However, it appears that the use of statistical tools to determine ﬁleformat compliance is completely unanticipated and novel.In a few cases, statistical methods are useful in identifying ﬁle formats that mightbe diﬃcult to archive or curate [8; 10; 9].

INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 3

Table 1.

Counts of ﬁles in the internet sourced and dangerous datasetsDataset Valid Rejected Total internet sourced dangerous

488 516 1000Totals 7694 2306 100003.

Data description

This article focuses on the analysis of PDF ﬁles, whose format is determinedby the PDF speciﬁcation (ISO 32000-2). It is recognized that the speciﬁcationis ambiguous in places, and that there are many proprietary extensions to thespeciﬁcation. Because of this, many “PDF ﬁles” may not be completely compliantor may be close enough to compliance to parse correctly. On the other hand, bugswithin the parsers may cause them to accept non-compliant ﬁles. To explore theseissues, the test and evaluation team collected a corpus of “PDF ﬁles” into twodatasets: internet sourced and dangerous comprising a total of 10000 ﬁles.Within each dataset, the ﬁles were manually determined to be either “valid”,that is that they are compliant with the PDF speciﬁcation, or “rejected”, whichmeans that they fail to comply with the speciﬁcation. In this article, we treatthe “valid” or “rejected” determinations as experimental ground truth for the ﬁles.Since our Bernoulli misclassiﬁcation test did not use these determinations, we wereable to use them to estimate the test’s accuracy (discussed in Section 5). The overallstatistics of both datasets are shown in Table 1. A standard χ test reveals thatthe diﬀerences in compliance between the two datasets is statistically signiﬁcant( p < . internet sourced set tobe our sample of predominantly compliant ﬁles, and we took the dangerous setto be our sample of predominantly non-compliant ﬁles. The signiﬁcant diﬀerencebetween the two samples is precisely what our misclassiﬁcation test leverages inorder to ﬁnd misclassiﬁed ﬁles.Rather than looking at the contents of each ﬁle, we reasoned that there arealready many extant parsers that do just that. Therefore, we ran each ﬁle througheach of a large collection of parsers shown in Table 2. We selected these parsersbased on their easy availability and with the understanding that many of them donot share code. This latter fact ensures that places within the PDF speciﬁcationthat are ambiguous may receive several interpretations by diﬀerent parsers. Theoutput to stderr from each parser was collected for each ﬁle, and a set of 955regular expressions were used to identify which error messages had occurred foreach parser-ﬁle pair (see Table 2).As an example, the 1000-th ﬁle in the internet sourced dataset was considered“valid,” yet produced 7 distinct messages: caradoc extract : Type error : Invalid variant type , caradoc stats : Type error : Invalid variant type , caradoc stats strict : PDF error : Syntax error , hammer : uncategorized error, pdfium : uncategorized error, xpdf pdfinfo : uncategorized error, and xpdf pdftops : uncategorized error. MICHAEL ROBINSON M e ss a g e M e ss a g e

0 250 500 750 1000

File internet_sourced (b) dangerous

0 3000 6000 9000

Figure 1.

The relation matrix for (a) internet sourced , (b) dangerous . Rows correspond to the error messages listed in Ta-ble 2. Columns correspond to ﬁles. In each matrix entry, whiteindicates no error, black indicates error.It is important to recognize that our method makes no attempt to interpret thesemantic meanings of these error messages. Instead, we are merely interested inthe statistics of the co-occurrence of these messages.The data can be tabulated as an integer relation matrix , in which each rowcorresponds to a particular regular expression (an error message in what follows)and each column corresponds to a particular ﬁle. Each entry records the number oftimes a given error message occurred for each ﬁle. Return values to the operatingsystem were not collected, though if desired these could have simply been added asadditional “messages” as rows in the relation matrix.We constructed two relation matrices, one for internet sourced (Figure 1(a))and one for dangerous (Figure 1(b)). Each relation matrix has the same rows (955distinct error messages, as shown in Table 2) but diﬀerent numbers of columns(9000 for internet sourced and 1000 for dangerous ).Continuing our example of the 1000-th ﬁle in internet sourced , this ﬁle corre-sponds to the 1000-th column of the relation matrix in Figure 1(a), and has 1s inrows 58, 254, 393, 589, 683, 910 and 911, because each of these messages occurredexactly once. It has 0s in all other entries in that column.The horizontal dark bands present in both relation matrices shown in Figure 1correspond to error messages that could not be categorized easily: not syntax errorsor other speciﬁc malformations. Some parsers produce these kind of messages morefrequently than others, which explains why some portions of the matrices show agreater prevalence of dark horizontal bands than others.Although there is some apparent structure visible in Figure 1, namely the darkhorizontal bands, it is diﬃcult to discriminate any column-by-column diﬀerences.

INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 5

Table 2.

Error message counts and rows per parserParser First row Last row Total message regexes caradoc extract caradoc stats

197 392 196 caradoc stats strict

393 588 196 hammer

589 589 1 mutool show

590 635 46 mutool clean

636 681 46 origami pdfcop

682 682 1 pdfium

683 683 1 pdfminer dumppdf

684 703 20 pdfminer pdf2txt

704 723 20 pdftk server

724 724 1 pdftools pdfid

725 729 5 pdftools pdfparser

730 734 5 peepdf

735 735 1 poppler pdfinfo

736 792 57 poppler pdftocairo

793 849 57 poppler pdftops

850 906 57 qpdf

907 907 1 verapdf greenfield

908 908 1 verapdf pdfbox

909 909 1 xpdf pdfinfo

910 910 1 xpdf pdftops

911 911 1 caradoc extract

912 913 2 caradoc stats

914 915 2 caradoc stats strict

916 917 2 hammer

918 919 2 mutool clean

920 921 2 mutool show

922 923 2 origami pdfcop

924 925 2 pdfium

926 927 2 pdfminer dumppdf

928 929 2 pdfminer pdf2txt

930 931 2 pdftk server

932 933 2 pdftools pdfid

934 935 2 pdftools pdfparser

936 937 2 peepdf

938 939 2 poppler pdfinfo

940 941 2 poppler pdftocairo

942 943 2 poppler pdftops

944 945 2 qpdf

946 947 2 verapdf greenfield

948 949 2 verapdf pdfbox

950 951 2 xpdf pdfinfo

952 953 2Total 955

MICHAEL ROBINSON (a) internet_sourced (b) dangerous

Figure 2.

Principal components plots for (a) internet sourced and (b) dangerous . Black indicate “valid”, and gray indicates“reject”. The axes correspond to the three principal vectors, andso are not plotted on the same scale.These diﬀerences are indeed present, but require more sophistication to extract.That statistical analysis forms the basis of most of this article.4.

Principal components analysis

To build some intuition about the structure of the relation matrices, let us de-velop a dimension-reduced visualization of the columns (ﬁles) of both relation ma-trices shown in Figure 1. While there are many possible techniques for dimensionreduction, principal components analysis is generally the easiest to construct and tointerpret. To better understand the structure of the data, we will incorporate theground truth as part of the visualization. This will help explain the performanceof the Bernoulli misclassiﬁcation test statistic in the next section.Principal components analysis is a way to render a high dimensional data setthat shows the “most important” dimensions and suppresses the rest. It is thereforea convenient way to visualize data that are formatted as a set of points in R n where n is large. The output of principal components analysis is a scatter plot in whichthe axes are chosen as the linear combinations of rows yielding the largest variance.These axes are called the principal vectors, and are sorted from largest variance toleast variance. In our analysis, the largest three principal vectors were used becausethey represented the majority of the variance in the data.To apply principal components analysis, we reinterpret our tabular data as adiscrete subset of R n (a point cloud) in which columns (ﬁles) are points, rows(messages) are components of the coordinates for each point. In both datasets,there are n = 955 messages. Because a ﬁle exhibits an error or not, the componentsare all either 0 or 1. Although one might argue that this could result in quantizationerror, many interesting features are nevertheless visible in the two datasets.Figure 2 shows the principal components analysis plots for both datasets. Pointsin both plots are colored according to the ground truth so that a point correspondingto a “valid” ﬁle is black, and a point corresponding to a “rejected” ﬁle is gray.The most striking diﬀerence in the principal components analysis plots is that the internet sourced dataset is apparently much more “clumpy” than the dangerous dataset. The three dense clusters in Figure 2(a) consist entirely of “valid” ﬁles, with INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 7 (a) internet_sourced (b) dangerous

Figure 3.

Scree plots for (a) internet sourced and (b) dangerous .most of the “rejected” ﬁles in the internet sourced data appearing as a “haze”of ﬁles outside of those clusters. Although the cause of these three dense clustersof “valid” ﬁles cannot be determined solely from the relation matrix – which ﬁlesare accepted by which parsers – we hypothesize that these clusters correspond topopular tool chains for creating PDF ﬁles.In stark contrast, the dangerous set shown in Figure 2(b) contains two looseclusters that are mixed “valid” and “rejected” ﬁles. Intuitively, if one is lookingto identify “valid” ﬁles, one would have a much harder time doing this with the dangerous set, so we might argue that the apparent signal-to-noise ratio is muchlower in the dangerous set.Principal components analysis can be misleading if only a small fraction of theoverall variance in the data is explained by the ﬁrst few principal vectors. It iseasy to determine if this problem is occurring – simply plot the variance in thedata explained by each principal vector. This is called a Scree plot [13], and isshown in Figure 3. In both datasets, the Scree plots decrease quite rapidly over theﬁrst few principal vectors. This shows that principal components analysis reliablyrepresents the data.We can conclude that if one is generally working with ﬁles that are naturallyoccurring (like the internet sourced set), one probably does not need to dig toodeeply to determine whether a given ﬁle is valid or not. The rest of this articlebuttresses this claim by providing a statistical test that does just that. On theother hand, if one is routinely handling ﬁles that test the limits of their format (likethe dangerous set), statistical analysis alone will likely be insuﬃcient to determinewhich ﬁles are valid. A deeper format-aware analysis would be necessary in thatcase. 5.

Bernoulli likelihood ratio misclassification test statistic

In order to determine ﬁle validity implicitly – by consulting parser behaviorrather than the speciﬁcation itself – we need exemplars of parser behavior ontypical compliant and non-compliant ﬁles. We propose to use the fact that thetwo datasets have rather diﬀerent statistics (Table 1) with the majority of ﬁles in internet sourced being “valid” while a (slim) majority of ﬁles in dangerous are“rejected”. We can attempt to identify ﬁles in internet sourced that exemplify

MICHAEL ROBINSON the parser behavior present in the dangerous dataset – these ﬁles probably shouldbe treated with suspicion, and are perhaps not “valid.” Conversely, ﬁles withinthe dangerous set that behave more like those in internet sourced may well be“valid.”A ﬁle is misclassiﬁed if the set of messages it produces is more characteristic ofthe other dataset but not the one in which it is presently found. We suspect thatsuch a misclassiﬁed ﬁle will have a collection of error messages that is unusual whencompared to the others in that dataset. We estimate the probability that each errormessage occurs in each dataset, and estimate a likelihood for each ﬁle assuming theerror messages are independent. Since both datasets have the same error messages,we can also estimate the likelihood that the ﬁle came from the other dataset. Alikelihood ratio statistic can thereby detect when a ﬁle is misclassiﬁed, because itis more likely to belong to the other dataset.Since we cannot assume that the occurrence of any given set of error messages isstatistically independent, it is diﬃcult to write a proper likelihood function. To thatend, we use a pseudo-likelihood , which makes the incorrect assumption that errormessages are statistically independent [5]. On the other hand, this assumptionmerely reduces the sensitivity of the test without producing extra outliers. Thepseudo-likelihood assumption trades recall to get better precision . As such, pseudo-likelihoods are useful for classiﬁcation but not for uncertainty quantiﬁcation.For the Bernoulli misclassiﬁcation test, we assume each error message is gov-erned by a Bernoulli distribution, which means that it either occurs or does notoccur. Multiple instances of the same error are ignored. The test assumes thateach error message occurs with a probability that depends on the dataset (either internet sourced or dangerous ). When a parser processes a given ﬁle, sometimesit produces multiple copies of the same error message. This can happen if the parserattempts to repair a slightly non-compliant ﬁle as it proceeds, for instance. If thishappens, then the Bernoulli distribution is no longer valid because it assumes atmost one instance of a given error message. Ultimately, the performance of theBernoulli misclassiﬁcation test was good even though we ignored duplicate errormessages.Let us consider the internet sourced dataset ﬁrst. It is straightforward tocompute the probability p k of error message k occurring from the relation matrix:simply take the average value of row k in the relation matrix shown in Figure 1(a).The resulting probabilities for both datasets are shown in Figure 4. Said anotherway, if ﬁle f is drawn from the internet sourced dataset, then the probabilitythat f produces error message k is p k . Conversely, the probability that f does not produce this error message is (1 − p k ). If we let f k = 0 if the ﬁle f did not produceerror k and f k = 1 if the ﬁle did produce error k , then the probability that f isfrom the internet sourced dataset is p k f k + (1 − p k )(1 − f k ) , if we only consider error message k .Since we have many error messages available for analysis, the pseudo-likelihoodthat ﬁle f (column in the relation matrix) is correctly classiﬁed is simply the productof each of these individual probabilities, namely L internet sourced ( f ) = (cid:89) k =1 ( p k f k + (1 − p k )(1 − f k )) . INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 9 (a) internet_sourced (b) dangerous

Figure 4.

Error probability for (a) internet source and (b) dangerous .We deﬁne L dangerous ( f ) similarly using the error message probabilities p (cid:48) k from the dangerous set instead, L dangerous ( f ) = (cid:89) k =1 ( p (cid:48) k f k + (1 − p (cid:48) k )(1 − f k )) . We deﬁne the

Bernoulli misclassiﬁcation test statistic to be the ratio of these twopseudo-likelihoods, λ internet sourced ( f ) = L dangerous ( f ) L internet sourced ( f ) . If f is a ﬁle drawn from the internet sourced dataset, then we generally ex-pect that L internet sourced ( f ) will be larger than L dangerous ( f ), which implies that λ internet sourced ( f ) <

1. Conversely, if a ﬁle is drawn from the dangerous dataset,which means it is a misclassiﬁcation if it actually is present in internet sourced ,we would expect that λ internet sourced ( f ) >

1. The intuition is that since ﬁles inthe dangerous set are likely to be invalid, a high value of λ internet sourced ( f ) is anindication that the ﬁle f is not compliant. A histogram of λ internet sourced is shownin Figure 5(a).Conversely, we can deﬁne λ dangerous ( g ) = L internet sourced ( g ) L dangerous ( g )for each ﬁle g in the dangerous set. The intuition in this case is that a highvalue of λ dangerous ( g ) is an indication that g is compliant, since it is likely to be amisclassiﬁcation. The histogram of values of λ dangerous ( g ) is shown in Figure 5(b).Since there is some variability (or noise) present within the data, we should notuse the value of λ = 1 as the cutoﬀ for detecting misclassiﬁcations. Although theintersections between the histogram curves and the red lines λ = 1 in both framesof Figure 5 are close to the true fraction of misclassiﬁcations in both datasets (80%for internet sourced and 48% for dangerous ), they are not exactly correct. Thissuggests using a diﬀerent threshold T instead, so that all ﬁles whose statistic λ isgreater than T will be deemed to be misclassiﬁed.Let us now use the ground truth, which speciﬁes whether a given ﬁle is “valid”or “rejected”, to determine the performance of the Bernoulli misclassiﬁcation test (a) internet_sourced (b) dangerous λ = 1 Figure 5.

Histogram of Bernoulli misclassiﬁcation test statisticfor (a) internet sourced and (b) dangerous . The horizontal linemarks a value of λ = 1. (a) internet_sourced (b) dangerous area under curve = 0.80 λ = 1 Figure 6.

Receiver operating curves for Bernoulli likelihood ratiomisclassiﬁcation statistic for (a) internet sourced and (b) dangerous .statistic. A misclassiﬁed ﬁle in internet sourced is one that is “rejected”, whilea misclassiﬁed ﬁle in dangerous is one that is “valid”.In the case of either dataset, for a given threshold T , the probability of detection is the probability that a truly misclassiﬁed ﬁle f will have a statistic λ ( f ) > T .On the other hand, the probability of false alarm is the probability that a correctlyclassiﬁed ﬁle f will have a statistic λ ( f ) > T . Ideally, the probability of detectionwill be close to 1, while simultaneously the probability of false alarm will be closeto 0.The plot of probability of detection versus false alarm for all thresholds is shownin Figure 6. Better misclassiﬁcation detectors have plots further to the upperleft, away from the diagonal. Since the curve plotted is far above the diagonalfor internet sourced in Figure 6(a), we conclude that the Bernoulli likelihoodratio misclassiﬁcation statistic is a very accurate detector of misclassiﬁed ﬁles in internet sourced . Additionally, since the plot in Figure 6(b) is above the diago-nal, this indicates that the Bernoulli likelihood ratio misclassiﬁcation statistic also INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 11 performs well on the dangerous dataset set. The sharp plateau in Figure 6(b) isdue to a number of instances in which L internet sourced took the value 0 on someﬁles in dangerous . These happen to be truly non-compliant ﬁles! These qualita-tive assessments are conﬁrmed by the areas under the curves in Figure 6. A perfectmisclassiﬁcation detector will have an area of 1 under the curve, while a detectorthat randomly decides misclassiﬁcations would have area 0 . .

95 for λ internet sourced and 0 .

80 for λ dangerous for areas under the curve,which we deem to be surprisingly good given the fact that the ﬁle contents werenot directly inspected by our method.6. Parser redundancy analysis

The Bernoulli misclassiﬁcation statistic relies upon the diversity of responses toeach ﬁle in order to make reliable decisions. It is impractical to use quite so manyparsers as we used, so it is useful to assess whether we could get good performancewith fewer parsers. The easiest way to do this is to measure the correlation betweenthe typical error message distributions produced by diﬀerent parsers across all ﬁles.In the following analysis, we combined both internet sourced and dangerous intoa single dataset, whose relation matrix consists of the horizontal concatenation ofboth matrices shown in Figure 1. That is, the rows are the same as before, but thecolumns consist of the union of the columns from both matrices. Error messages(rows) that never occur were removed from the analysis, since they contribute noinformation.The data can also be visualized using principal components analysis, much aswas done in Section 4, but instead we use the error counts across ﬁles as coordi-nates for each parser. Since speciﬁc error messages cannot be enabled or disabledindividually, it makes sense to aggregate the error messages into parsers by takingthe total number of error messages for a each parser on a given ﬁle. Principalcomponents analysis places the parsers according to the plot shown in Figure 7.There is one large cluster (in the lower left of Figure 7) of parsers which have sim-ilar behaviors. There are quite a few outliers, most notably caradoc extract and poppler pdfinfo . The reader is cautioned that the apparent density of the largecluster is a bit misleading, because the ranges of the axes are quite diﬀerent. Inany event, the wide spread of parsers across the plot indicates that having a largenumber of parsers is quite valuable.Although principal components analysis is useful, we can obtain a ranking ofparsers by their redundancy, so that we can prioritize more informative parsers.We computed Pearson’s correlation coeﬃcient between all pairs of error messages(rows), to form a fairly large correlation matrix (not shown). Diﬀerent error mes-sages that occur on exactly the same set of ﬁles have a correlation coeﬃcient of 1.In that case, one of those error messages is redundant. Errors were then groupedby parser, and for each pair of parsers we stored the median correlation from eachof their pairwise message correlations. The result of this aggregation is the matrixshown in Figure 8. Whiter colors indicate higher correlation – more redundancy– while darker colors indicate lower correlation. Most of the matrix is fairly dark,which indicates low redundancy overall. The bright oﬀ-diagonal entries indicatetrade-oﬀs: one needs only run one of the two parsers indicated. The bright bandsfor pdftools pdfid occurred because that parser did not correlate with any of theothers at all.

Figure 7.

Parsers placed in three dimensions according to prin-cipal components analysis of the union of internet sourced and dangerous .The clusters visible in the Figure 7 can be conﬁrmed by sorting the parsers basedon their median correlation. The rows and columns shown in Figure 8 are sorted inthis way. This indicates which parsers are individually most informative, becausethey form an approximate spanning set for the data. The obvious block in thelower right suggests that parsers should be grouped into two categories based ontheir informativeness: high and low. The rows and columns of the block in Figure 8suggest that the boundary between the high and low categories appears to be fairlysharp, with all parsers to the left of origami pdfcop being highly informative.Based on the approximate spanning property of the highly informative category,which covers most of the variability visible in Figure 7, one should always run theparsers in the high category, with the other parsers in the low informative categoryparser treated as increasingly optional as one moves to the lower right of Figure 8.Notice that the parsers near the lower right of Figure 8 all come from the cluster

INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 13

Figure 8.

Correlation matrix between parser behaviors. Rowsand columns are sorted according to median message correlation.in the lower left of Figure 7. Although it is diﬃcult to see from the projectionshown in Figure 7, least informative parsers are on the inside of the cluster, whilethe parsers in the same cluster that are on the outer edges of the cluster are moreinformative. 7.

Conclusion and recommendations

This article has demonstrated that a given ﬁle’s compliance with a format speci-ﬁcation can be determined from the following three samples: (1) a diverse collectionof parsers for the ﬁle format, (2) a sample of compliant ﬁles, and (3) a sample ofnon-compliant ﬁles. Our methodology is format agnostic, and so could work if aﬁle format is not formally speciﬁed. Furthermore, our Bernoulli test for compli-ance is statistical, so it is both robust to errors in the three samples, yet beneﬁtsfrom richer samples should they be available. We note that the use of the threesamples means that this approach is a supervised approach. Under appropriate circumstances, it might be possible to use an unsupervised bootstrapping approachto extract the two samples of ﬁles from a single aggregated sample. This requiresfurther investigation.Principal components analysis is helpful in understanding the variability withina sample of ﬁles or parsers, but it holds somewhat less value as an automatedanalytic tool. Clusters visible within the principal components plots appear toreﬂect speciﬁc tool chains used in the creation of ﬁles, with valid ﬁles formingseveral dense clusters.The Bernoulli likelihood ratio statistic detects non-compliant ﬁles by comparingerror message prevalence aggregated across all parsers and both samples of ﬁles. Itrelies inherently upon both the coverage and depth of these samples, but we haveshown that it can be very eﬀective at its job when these samples are adequate.Overall, we found that there is not much redundancy present in the behaviors ofparsers for the PDF speciﬁcation. That is to say that the relation matrix containsby far the most information if all parsers are considered, so one ought to use allparsers if possible. Our analysis determined which parsers are individually the mostinformative, based on pairwise comparisons. While it is entirely reasonable to studydiﬀerent subsets of parsers, we did not perform that analysis here. We refer theinterested reader to [4] where such an analysis was performed.If resources are tight, we found that it is probably not necessary to rerun a givenparser with diﬀerent options. For instance, running only one of xpdf tops and xpdf pdfinfo probably will not change the results too much. Roughly speaking,it appears that the best strategy is “more programmers contributing diﬀerent codeinstead of one programmer’s code run in many diﬀerent ways.”Our two statistical techniques for analyzing ﬁle format compliance are admit-tedly simple but standard statistical tools, though are apparently not in wide use.They are easy to deploy, easy to interpret, and require little maintenance otherthan the selection of a single detection threshold. We therefore encourage deeperexploration into and experimentation with statistical methods in the study of ﬁleformat compliance.

Acknowledgments

The author would like to thank the SafeDocs test and evaluation team, includingNASA (National Aeronautics and Space Administration) Jet Propulsion Labora-tory, California Institute of Technology and the PDF Association, Inc., for pro-viding the test data. The author would like to thank Kris Ambrose for his heroicprocessing of all of the relevant ﬁles through all of the parsers, and for subsequentlypackaging the results in a very convenient format. The author would also like tothank Peter Wyatt for his detailed and insightful investigation into the ﬁles thatwere deemed outliers and for closely reading an earlier draft of this article.This material is based upon work supported by the Defense Advanced ResearchProjects Agency (DARPA) SafeDocs program under contract HR001119C0072.Any opinions, ﬁndings and conclusions or recommendations expressed in this ma-terial are those of the author and do not necessarily reﬂect the views of DARPA.

References [1] Alfred V Aho, Ravi Sethi, and Jeﬀrey D Ullman. Compilers, principles, tech-niques.

Addison wesley , 7(8):9, 1986.

INDING NON-COMPLIANT DOCUMENTS USING ERROR MESSAGES 15 [2] Bander Ali Saleh Al-rimy, Mohd Aizaini Maarof, and Syed Zainudeen MohdShaid. Ransomware threat success factors, taxonomy, and countermeasures:A survey and research directions.

Computers & Security , 74:144–166, 2018.[3] Mamoun Alazab. Proﬁling and classifying the behavior of malicious codes.

Journal of Systems and Software , 100:91 – 102, 2015. ISSN 0164-1212. doi:https://doi.org/10.1016/j.jss.2014.10.031. URL .[4] Kristopher Ambrose, Steve Huntsman, Michael Robinson, and Matvey Yutin.Topological diﬀerential testing. arXiv preprint arXiv:2003.00976 , 2020.[5] Barry C Arnold and David Strauss. Pseudolikelihood estimation: some ex-amples.

Sankhy¯a: The Indian Journal of Statistics, Series B , pages 233–243,1991.[6] Mohamed Belaoued and Smaine Mazouzi. A real-time PE-malware detec-tion system based on chi-square test and PE-ﬁle features. In

IFIP Interna-tional Conference on Computer Science and its Applications , pages 416–425.Springer, 2015.[7] John Demme, Matthew Maycock, Jared Schmitz, Adrian Tang, Adam Waks-man, Simha Sethumadhavan, and Salvatore Stolfo. On the feasibility of onlinemalware detection with performance counters.

ACM SIGARCH ComputerArchitecture News , 41(3):559–570, 2013.[8] Roman Graf and Sergiu Gordea. A risk analysis of ﬁle formats for preservationplanning. In

Proceedings of the 10th International Conference on Preservationof Digital Objects (iPres2013) , pages 177–186, 2013.[9] Gregory W Lawrence, William R Kehoe, Oya Y Rieger, William H Walters,and Anne R Kenney.

Risk Management of Digital Information: A File FormatInvestigation.

ERIC, 2000.[10] D. Pearson and C.Webb. Deﬁning ﬁle format obsolescence: A risky journey.

The International Journal of Digital Curation , 3(1):89–106, July 2008.[11] Maksym Schipka. Detection of exploits in ﬁles, January 8 2009. US PatentApp. 11/822,533.[12] S. D. S.L and J. CD. Windows malware detector using convolutional neuralnetwork based on visualization images.

IEEE Transactions on Emerging Topicsin Computing , pages 1–1, 2019. doi: 10.1109/TETC.2019.2910086.[13] An Gie Yong, Sean Pearce, et al. A beginner’s guide to factor analysis: Fo-cusing on exploratory factor analysis.

Tutorials in quantitative methods forpsychology , 9(2):79–94, 2013.

Mathematics and Statistics, American University, Washington, DC, USA

Email address ::