Operation is the hardest teacher: estimating DNN accuracy looking for mispredictions
OOperation is the hardest teacher: estimatingDNN accuracy looking for mispredictions
Antonio Guerriero, Roberto Pietrantuono, Stefano Russo
DIETI, Universit`a degli Studi di Napoli Federico II
Via Claudio 21, 80125 - Napoli, Italy { antonio.guerriero, roberto.pietrantuono, stefano.russo } @unina.it Abstract —Deep Neural Networks (DNN) are typically testedfor accuracy relying on a set of unlabelled real world data(operational dataset), from which a subset is selected, manuallylabelled and used as test suite. This subset is required to besmall (due to manual labelling cost) yet to faithfully representthe operational context, with the resulting test suite containingroughly the same proportion of examples causing misprediction(i.e., failing test cases) as the operational dataset.However, while testing to estimate accuracy, it is desirableto also learn as much as possible from the failing tests in theoperational dataset, since they inform about possible bugs of theDNN. A smart sampling strategy may allow to intentionally in-clude in the test suite many examples causing misprediction, thusproviding this way more valuable inputs for DNN improvementwhile preserving the ability to get trustworthy unbiased estimates.This paper presents a test selection technique (DeepEST) thatactively looks for failing test cases in the operational dataset of aDNN, with the goal of assessing the DNN expected accuracy by asmall and “informative” test suite (namely with a high number ofmispredictions) for subsequent DNN improvement. Experimentswith five subjects, combining four DNN models and threedatasets, are described. The results show that DeepEST providesDNN accuracy estimates with precision close to (and often betterthan) those of existing sampling-based DNN testing techniques,while detecting from 5 to 30 times more mispredictions, with thesame test suite size.
Index Terms —Software testing, Artificial neural networks
I. I
NTRODUCTION
Deep Neural Networks (DNN) are today integral part ofmany applications, including safety-critical software systemssuch as in the medical [1] and autonomous driving domains[2]. Testing is a crucial activity in the development of suchsystems, for both quality/safety reasons to avoid DNN-causedcatastrophic failures [3], [4], and for cost as well – very largesamples may be needed to reliably test a DNN [5], [6].A significant research effort is currently being put on DNNtesting. A primary goal is to find adversarial examples causing mispredictions , namely to expose as many failing behavioursas possible [7]–[11]. Several structural coverage criteria havebeen proposed to drive the automated generation of testinputs and assess the test adequacy – neuron coverage [7], k-multisection neuron coverage, neuron boundary coverage [8],combinatorial coverage [10]. It has been argued that thesecriteria may be misleading, because of the low correlationbetween the number of misclassified inputs in a test set and its
This work has been supported by the project COSMIC of UNINA DIETI. coverage [12]. Wu et al. [13] and Kim et al. [14] recently con-sidered discrepancy measures between the training/validationdata and test data, to improve fault detection and to havecoverage criteria better correlated to failure-inducing inputs.The output of this type of failure-finding testing (andthen debugging) process is an improved DNN, with higheraccuracy. This resembles what is called debug testing inthe traditional testing literature [15]. Beside differences intesting DNN-based and conventional software (e.g., the oracledefinition), which make DNN testing problematic, a furtherissue is that the so-obtained testing results are not necessarilyrelated to the accuracy actually experienced in operation. Infact, testing data may be not representative of the actualoperational context . This may happen when test data aregenerated artificially (like in adversarial examples generation)or they differ significantly from input observed in the field. Thenumber of mispredictions and/or the coverage achieved giveonly an “indirect” (and, for what said, inaccurate) measureof the expected accuracy in operation, and ultimately of theconfidence that can be placed in the DNN.A less investigated research path is testing a DNN with theexplicit goal of providing a statistical estimate of its expectedaccuracy in the operational context. In software reliabilityengineering, this is a well established practice known as operational testing [16]. The objective is to assess how wella DNN will perform in the intended context using a small yeteffective amount of test data. With a cost-effective accuracyestimate, testers can establish a threshold-based release crite-rion and correct or tune their artificial networks (e.g., adjustthe DNN structure or hyper-parameters) until the criterion ismet. The reference scenario is the following: a DNN modelis trained with a set of data ( training dataset ) and is meantto operate in a given context ( operational context ). To test themodel, an arbitrarily large set of operational data is available( operational dataset ) containing examples whose correct labelis unknown: the goal of the tester is to select a small subsetof the operational data, to be manually labelled and used astest cases to estimate the accuracy of the model in operation.Due to manual labelling, testers typically look for a trade-offbetween a good estimate and the labelling cost. This problemhas been recently faced by Li et al. [17]. Their selectioncriterion is to minimise cross-entropy between selected testdata and the whole set of unlabelled operational data, so as tohave a sample of tests representative of the operational context. a r X i v : . [ c s . S E ] F e b s this approach does not explicitly look for failing examples,it does not give much room for improving the DNN accuracy.From this viewpoint, many tests selected are useless, as theyare non-failing examples that serve only the purpose of havinghighly-confident estimates.Given the above, we either have a test suite exposing failuresbut not representative (adversarial-like testing), or representa-tive tests but few failures exposed (operational testing). Thispaper targets the drawbacks of both strategies together. Wepresent DeepEST (Deep neural networks Enhanced Samplerfor operational Testing), an operational testing method forDNN that builds test suites from an operational dataset soas to provide a close and efficient estimate of the expectedaccuracy and to find, at the same time, a high number offailing examples. Paraphrasing a famous aphorism by O. Wilde(“Experience is the hardest kind of teacher. It gives you the test first and the lesson afterward.”), DeepEST aims to sample,from operational data, many failing tests to learn from whilebuilding a set of tests suited for the DNN accuracy estimate.With respect to pure operational testing, DeepEST is ex-pected to provide DNN accuracy estimates that are moreprecise and more efficient (for number of tests required),plus the practical advantage of enabling debug-testing-likescenarios (improving accuracy through debugging/re-training).With respect to adversarial-like testing, DeepEST is able toprovide accuracy estimates, enabling evaluation of alternativedesigns or DNN fine tuning, and establishing a release crite-rion directly related to what will be observed in operation.Experiments with various DNN and datasets are presented,which assess DeepEST effectiveness, sensitivity to the numberof tests to select from the operational dataset, and dependencyon the dataset. As DeepEST can exploit various types ofauxiliary information to select tests, the experimental studyconsiders four variants. The results show that all the variantsproduce accuracy estimates similar to those of the state-of-the-art technique, while detecting from 5 to 30 times moremispredictions, with the same sample size.The paper is structured as follows. Section II introdcuessampling-based operational testing concepts used by Deep-EST. Section III describes the DeepEST algorithm. SectionsIV and V report the experimentation. Section VI discussesrelated work. Section VII concludes the paper.II. S AMPLING - BASED OPERATIONAL TESTING OF
DNN S A. Operational testing of DNNs
In the traditional testing literature, operational testing refersto the family of techniques that use an operational profileto test a system to estimate its expected reliability (i.e.,probability of not failing) in operation. Likewise, the primarygoal of
DNN operational testing is to estimate the expected accuracy (i.e., probability of not having mispredictions) in agiven operational context [17]. Two main challenges arise.
Data skew . The idea of operational testing is that testingshould not just care about exposing possible failures, but We use the terms misprediction and failure interchangeably for DNN. should be able to spot those failure-causing inputs that aremore likely to occur in operation. In case of a relevantmismatch between the pre-release test data and the post-releasecontext, the system could be stimulated in operation withinputs never seen during testing, with unexpected failures.Data skew is a concern for DNN, more than for traditionalsoftware systems. These are expected to work on a range ofinput data given to functionalities, and can be tested on asmall carefully-selected sample of input data (e.g., obtainedby input space partitioning). DNN are data-driven by nature;they are constructed around a training dataset, and generalisingbeyond what observed during training is hard. This data-drivenapproach is governed by a statistical process and, due also ofthe black-box nature of DNN, it is tricky to identify classes ofinputs that homogeneously represent the expected behaviourin operation. For instance, white-box partitioning has beenshown to be not clearly correlated to the failing behaviour [12].Thus, a drift of the post-release operational context from thepre-release testing context is more likely to cause unexpectedfailures compared to traditional software.
The imitation bias of operational testing.
The operationalcontext drift is just a triggering condition for unexpectedfailures at runtime. The problem occurs because of the wayin which operational testing is conducted. Operational testingselects a small data sample that can accurately represent thepopulation; however, the mere imitation of the expected inputcan be inefficient, especially in highly reliable systems, be-cause many failure-free tests are executed to get an acceptableestimate. A representative sample would roughly contain thesame proportion of examples causing misprediction as theoperational dataset. There are two problems with this: first,in highly accurate DNNs, the number of examples causingmispredictions is low, thus requiring other types of testingactivities dedicated to detect mispredictions (e.g., throughadversarial-like techniques) to possibly improve the accuracy.Second, just mimicking the expected usage is fine from theestimation point of view as long as the imitation is faithful.If this is not the case, the risk of overestimation increases: ifwe only aim at having a representative sample of tests, theactual experienced accuracy may be significantly smaller thanthe estimated one if the operational context drifts, because ofthe occurrence of unexpected mispredictions.Thus, a conventional approach for DNN operational testingallows to obtain the desired accuracy estimate but, since itreveals few mispredictions, can be ineffective (leading towrong estimates) and inefficient (requiring further separatetesting activities to detect mispredictions). Recent results inoperational testing for traditional software show that exposingfailures and estimating reliability are not contrasting objectives[18]. A strategy that actively looks for failures (rather than justmimicking the expected usage) can lead to accurate and stableestimates and, at the same time, expose many failures. Ouraim is exactly to spot failing examples in a DNN operationaldataset, while preserving the ability to yield effective (smallerror) and efficient (low variance) estimates. . Sampling-based testing
DeepEST uses statistical sampling, a natural way to copewith estimation problems: it serves to design sampling planstailored for a population under study, providing effective andefficient estimators. In sampling-based testing [19], the sampleis the set of n test cases T = { t , . . . , t n } , having a binaryoutcome (pass/fail). Test outcomes are a series of independentBernoulli random variables z t i such that z t i = if the DNNpredicts the correct label for t i , z t i = otherwise. The parameterof our interest is the DNN accuracy θ ; we aim to an estimate ˆ θ with two desirable properties: unbiasedness – i.e., the expec-tation of the estimate E [ˆ θ ] should be equal to the true value θ - and efficiency – for the given the sample size, the varianceof ˆ θ should be as low as possible (for a highly confident,stable estimate). The probability that z t i = 1 corresponds tothe true (unknown) proportion: θ = (cid:80) Nt =1 z ti N , with N beingthe population size (i.e., the size of the operational dataset).Simple random sampling with replacement (SRSWR) is thebaseline approach: an unbiased estimator of θ is the observedproportion of correct predictions over the number of trials n : ˆ θ SRSW R = (cid:80) nt =1 z t i n . (1)Having assumed independent variables, the variance of ˆ θ is: V (ˆ θ SRSW R ) = θ (1 − θ ) n . (2)An improvement is represented by simple random sampling without replacement (SRSWOR), namely, the same test caseis not selected twice: this reduces the variance to: V (ˆ θ SRSW OR ) = N − nN − θ (1 − θ ) n . (3)While SRS keeps the mathematical treatment simple, it isunable to exploit additional information a tester might have.Exploiting auxiliary information to modify the samplingscheme is what is done in sampling theory to get more efficientestimators [20]. The sampling is made proportionally to anauxiliary observable variable assumed to be related to the(unknown) quantity to estimate; the estimator is then adjustedto account for the non-uniform sampling, so as to preserveunbiasedness. For instance, stratified sampling is a strategywhich uses knowledge about which sample units are expectedto have homogeneous values, and selects units contributingmore to lower the estimate’s variance.Li et al. [17] present two sampling strategies for DNN,which are, to our knowledge, the only attempt to DNN op-erational testing: Confidence-based Stratified Sampling (CSS)and Cross Entropy-based Sampling (CES) – the latter being theauthors’ proposal. Both strategies exploit auxiliary informationto drive the sampling task.In CSS, sampling is proportional to the confidence valueprovided by classifiers when predicting a label: examples withhigher confidence are more likely to be selected as part ofthe test suite. This works well when the classifier is reliable,namely if examples with higher confidence are actually those for which the prediction is more likely to be correct (in otherwords, the model is perfectly trained for the operation context).Whenever the operation context drifts from the training one,CSS exhibits poor performance.CES attempts to overcome this limitation by using theoutput of the m neurons in the last hidden layer in the lasthidden layer, assumed to be more robust to the operationcontext drift. It builds the test suite trying to minimize theaverage cross-entropy between the probability distribution ofthe m -dimensional representation of the output of neuronscomputed on the operational dataset and on the selected tests.To pursue the double objective of sampling cases causingmispredictions and estimating accuracy efficiently, DeepESTadopts an adaptive and without-replacement sampling algo-rithm, described in Section III. C. Auxiliary information
To look for mispredictions, the auxiliary information lever-aged by DeepEST adaptive sampling should represent thebelief about some factor(s) related to the model’s (in)accuracy.We consider two auxiliary variables, related to somehowopposite sources of information: the confidence value andthe distance between the operational dataset example and thetraining dataset. The latter is based on the result by Kim et al. [14] that inputs more “distant” from training data aremore likely to cause misprediction – they show that distanceis correlated to mispredictions, which is what we look for.We borrow their distance metrics
Distance-based SurpriseAdequacy (DSA) and
Likelihood-based Surprise Adequacy (LSA). These are computed by the
Activation Trace (AT),namely a vector of activation values of each neuron of a certainlayer corresponding to a certain input; we compute AT withreference to the last DNN activation layer. LSA is definedas the negative log of density (computed via Kernel DensityEstimation). DSA is defined as the Euclidean distance betweenthe AT of a new input and ATs observed during training. Thevariant using confidence is DeepEST CS ; the variants using thedistance metrics are DeepEST DSA and DeepEST
LSA .Performance of an auxiliary variable can strongly depend onthe DNN and on the training/operation dataset: for instance,for a distance metric the belief that examples far from the onein the training set are more likely to cause a mispredictionis not an absolute truth. In particular, if an example is verysimilar to many others in the training set according to thedistance metric, but has a different label, distance will be nota good metric to select it. Similarly, the confidence is a goodproxy if the DNN is well trained for the operational context. Ingeneral, relying on a single auxiliary variable may work wellin some settings and bad in others. Combining multiple beliefsis a choice that is expected to improve the stability of resultsacross multiple settings. Based on this, we define a furthervariant of our algorithm, named DeepEST C , considering asauxiliary variable the combination of confidence and distance: P = P c × (1 − P d ) (4)here P c is the confidence value, P d is the DSA value nor-malized [0, 1] and P is probability of correct prediction. Theintuition behind is that the probability of a correct prediction isrelated to both the confidence of the DNN and the drift of theexample from what seen during training. In fact, P c is relatedto the probability that the DNN does a correct prediction according to what seen during training (in other words: itis the probability of correct prediction with perfect training); P d is a proxy for the probability of wrong prediction relatedto how far the example is from what seen during training –hence due to the imperfection of training. If confidence P c ishigh and the example is close to the training dataset (i.e., P d is small), there is a high chance of correct prediction.III. T HE D EEP
EST
ALGORITHM
Both CES and CSS select a sample of tests representative ofthe operational context. DeepEST takes a different approach,favouring the sampling of failing (namely, misprediction-causing ) tests, then used to estimate the DNN accuracy.While the estimation ability demands for probabilistic (hence“sampling-based”) selection of examples, the very idea ofDeepEST is that, since mispredictions are usually rare com-pared to correct examples, looking for failing tests is wellhandled by adaptive sampling [21]: the examples progressivelyselected may give hints about the probability of finding otherfailing examples, so as to adapt the sampling process ac-cordingly. Given a samples size, adaptive sampling implicitlyassumes that inputs of interest are not uniformly spread acrossthe input space, and adopts a disproportional selection to spotthem, counterbalanced by the estimator to preserve unbiased-ness. In software testing, this is expected to give smallervariance compared to conventional sampling (e.g., SRS), sincethe failing inputs are usually not uniformly distributed. TheDeepEST sampling algorithm is inspired to the Adaptive WebSampling [22], a flexible design for sampling rare populations.The above auxiliary variables are used to define a weight w i,j between any pair of examples i and j of the operationaldataset, used to explore the example space adaptively. Forinstance, let us assume to use the normalised DSA distance :if the example i has distance P d i (representing the belief that i causes a misprediction), and a pre-defined threshold τ isexceeded (i.e., i has a sufficiently high distance compared tothe others), then all the w i,j values ( ∀ j of the operationaldataset) are set to their distance P d j ; otherwise w i,j = 0 .This way, a strong-enough belief about example i causingmisprediction entails the activation of all the weights toward i . The latter are used, as explained hereafter, for sampling,and makes the algorithm follow the distance criterion to spotpotential clusters of failing examples.The DeepEST sampling strategy is sketched in Algorithm1. Assuming n examples to be selected from the operationaldataset as test cases, the algorithm selects and executes onetest case per step. The first input is selected by simple random In DeepEST C , DSA is preferred to LSA since it has been shown tohave better performance for the deeper layer [14]. Algorithm 1: DeepEST adaptive samplingData: OD : operational dataset; i : current sample; T :test suite; n : number of tests; r : probability ofWBS T = ∅ ; i = SRS sampling(OD) ; // first sample bySRS OD = OD \ i ; // remove sample fromdataset T = T ∪ i ; // add to suite y = labelling and test execution(i) ; for k = 2; k < = n ; k++ do rs = random (0 , ; if rs < r then i = WBS sampling(OD); OD = OD \ i ; else i = SRS sampling(OD); OD = OD \ i ; T = T ∪ i ; y i = labelling and test execution(i) ; z k = Equation ˆ θ = Equation // compute final estimate sampling, namely, initially all examples have equal probabilityof being selected. Then, one of two sampling schemes isused to select next test: weight-based sampling (WBS), or simple random sampling (SRS). Example i is selected fromthe operational dataset at step k with probability q k,i given by: q k,i = r · (cid:80) j ∈ s k w i,j (cid:80) h/ ∈ s k ,j ∈ s k w h,j + (1 − r ) · N − n s k (5)where: • r : probability of using WBS (hence, probability of usingSRS = - r ; r is set to 0.8 in our implementation); • s k : current sample (all examples selected up to step k ); • w i,j : weight relating example j in s k to example i ; • n s k : size of the current sample s k ; • N : size of the operational dataset.For both WBS and SRS, the selection is without replacement . WBS selects an example i proportionally to the sum of weights w i,j of already selected examples toward i – the chanceof taking i depends on the current sample, favouring theidentification of clusters of failing examples if the auxiliaryvariable (hence, w i,j ) is well-correlated with mispredictions.As this is not always the case, the WBS “depth” exploration isbalanced by SRS, chosen with probability ( - r ), for a breadthexploration of the example space. This diversification in thesearch is useful to escape from unproductive cluster searches.The steps are repeated until the testing budget n ≤ N is over.At step , the probability that a randomly selected examplewill cause a misprediction is estimated as the outcome y ofthe first test (1 in case of misprediction, 0 otherwise). Without replacement sampling schemes generally give smaller variancethan their with replacement counterpart on the same sample size [20].
ABLE I: Experimental DNN and datasets
Subject Training Test TrueDNN Dataset Classes set size set size accuracy
S1 CN5 MNIST 10 60,000 10,000 0.9905S2 LeNet5 0.9868S3 CN12 CIFAR10 10 50,000 10,000 0.8066S4 VGG16 0.9359S5 VGG16 CIFAR100 100 50,000 10,000 0.7048
At step k> , example i , whose outcome is y i , is selectedwith probability q k,i according to Eq. 5, and the estimator ofthe misprediction probability is that by Hansen-Hurwitz [23]: z k = 1 N ( (cid:88) j ∈ s k y j + y i q k,i ) (6)where the y j values are the outcome of the tests alreadyselected. z k is an unbiased estimator of the expected mispre-diction probability at step k ; the final estimator of the expected accuracy of the DNN is 1 minus the average of the z k values: ˆ θ = 1 − n ( y + n (cid:88) k =2 z k ) (7)where n is the number of tests run.IV. E VALUATION
A. Experimental subjects
The four variants of DeepEST are evaluated against theSRSWR scheme as baseline and the mentioned state-of-the-art technique CES. Five experiments are conducted with fourDNN models and three popular datasets.The datasets are MNIST, a dataset of handwritten digits[24]; CIFAR10, for image processing systems; and CIFAR100,similar to the previous one but with 100 classes [25]. The cho-sen DNN models are ConvNet5 (here simply CN5) and LeNet5for MNIST classification; ConvNet12 (simply, CN12) andVGG16 for CIFAR10 classification; VGG16 for CIFAR100. Table I lists the five subjects (DNN-dataset pairs); the trueaccuracy is in the last column.
B. Research questions and experiment design
The evaluation answers the following research questions.
RQ1: Effectiveness . How does DeepEST perform in findinginputs causing misprediction (i.e., failing examples) and si-multaneously estimating a DNN operational accuracy?
To gauge DeepEST ability to provide effective DNN ac-curacy estimates with few examples, while spotting a highnumber of failing inputs, we set to 200 the number of teststo select (then varied to answer RQ2) and repeat 30 times theexecution of the 6 compared techniques on the 5 subjects.As for evaluation metrics, we compute: CN5 and CN12 are calibrated in the same way as Li et al. [17]; LeNet5is calibrated as Kim et al. [14]; for the VGG16 network we considered theweights at https://github.com/geifmany/cifar-vgg. • The accuracy ˆ θ i at the i -th repetition, and then com-pute the Mean Squared Error (MSE) as M SE (ˆ θ ) = (cid:80) i =1 (ˆ θ i − θ ) , where θ is the true operational accu-racy. Note that for unbiased estimators, MSE and variancecan be considered indistinguishable. In fact, M SE = V ariance + Bias and Bias (ˆ θ ) = E [ˆ θ ] − θ = 0 .The precision of the estimator is: π (ˆ θ ) = MSE (ˆ θ ) , and therelative precision (or relative efficiency) of estimator A with respect to B is: π A,B = MSE (ˆ θ B ) MSE (ˆ θ A ) ( π A,B > meansthat A is better than B ). • The average number of failures ( ϕ = M ean ( ϕ i ) ) with ϕ i being the number of failures in repetition i .For comparison purpose, we consider the relative numberof failures of technique A with respect to B : ρ A,B = ϕ A ϕ B ( ρ A,B > means that A is better than B ). RQ2: Sensitivity to sample size . How does the performanceof DeepEST vary with the sample size?
It is important to figure out how performance varies withthe number of test cases to select from the operational dataset,namely with sample size. Indeed, DeepEST aims to performwell especially with a small sample size, so as to yield preciseestimates with relatively few examples to be manually labelled.
RQ3: Dataset influence . How is DeepEST performance af-fected by the datasets?
In DNN testing, results are often heavily dependent on the(training and operational) datasets. This RQ aims to figure outhow these may influence the ability of the auxiliary variables(confidence, DSA, LSA) to discriminate failing examples,affecting the performance of DeepEST. To answer RQ3, thetest set is completely labeled, so as to identify all failures.
C. Implementation
DeepEST is implemented mostly in Java. The implementa-tion of the distance metric in DeepEST
DSA and DeepEST
LSA is the same used by Kim et al. [14]; we used their Pythonscripts to compute DSA and LSA values. These are computedconsidering the last activation layer of each DNN. The thresh-old τ needed for the weights definition is set as follows: • DeepEST
DSA : τ = mean ( DSA ) + 2 × Std ( DSA ) ; • DeepEST
LSA : τ = mean ( LSA ) +
V ar ( LSA ) .The threshold for confidence , used by DeepEST CS andDeepEST C , is set to . , assuming that lower confidencevalues are more related to misprediction (i.e., the weights areactivated when the confidence is less than τ = 0 . ).In CES, the selection exploits the output of the last hiddenlayer. We use the same configuration as the original article[17]. The size of the initial sample is p = , enlarged by agroup Q ∗ of q = examples at each step. The number of randomgroups from which Q ∗ is selected is L = 300 . For CES andSRS, we used the Python scripts provided by Li et al. [17]. a) Subject S1 (CN5, MNIST) (b) Subject S2 (LeNet5, MNIST)(c) Subject S3 (CN12, CIFAR10) (d) Subject S4 (VGG16, CIFAR10) (e) Subject S5 (VGG16, CIFAR100) Fig. 1: RQ1 (effectiveness): Mean Squared Error of estimatesTABLE II: RQ1 (effectiveness): Mean and standard deviation ( σ ) of the number of failing examples detected Subject SRS CES DeepEST CS DeepEST
DSA
DeepEST
LSA
DeepEST C TotalfailingexamplesMean % σ Mean % σ Mean % σ Mean % σ Mean % σ Mean % σ S1 (CN5, MNIST) 1.90 1.52 2.23 1.57 44.57 0.68 30.17 4.03 22.43 3.29 11.93 3.06 95 S2 (LeNet5, MNIST) 2.77 2.06 2.03 1.16 65.13 0.86 37.57 5.12 24.10 3.74 24.07 4.15 132 S3 (CN12, CIFAR10) 37.93 6.47 33.77 3.92 106.10 7.26 98.33 4.62 38.07 6.61 70.77 7.09 1,934 S4 (VGG16, CIFAR10) 12.60 3.33 13.03 3.49 85.00 4.86 16.80 3.02 14.77 3.14 24.90 4.97 641 S5 (VGG16, CIFAR100) 57.33 6.53 55.77 6.62 131.17 7.35 78.20 6.62 67.47 3.68 108.70 4.45 2,952 V. R
ESULTS
A. RQ1: effectiveness
Figure 1 plots the MSE of the estimated accuracy. Thetechniques exhibit comparable performances, with DeepEST C being the best one for 3 of the 5 subjects. CES has goodperformance in terms of MSE, it is the second technique in3 cases. Considering the single variables: confidence, DSA orLSA lead to results slightly more variable over the subjects –an aspect explored in RQ3. The SRS case is interesting, too:it is never the worst approach and is the best in one case.Table II reports the average number and the standard devia-tion of the failing examples detected. All variants of DeepESTidentify many more failing examples than SRS and CES,even up to a factor of 30x (DeepEST CS vs CES for subjectS2) and reaching in some cases (S1 and S2, DeepEST CS ) A replication package is at: https://github.com/dessertlab/DeepEST almost 50% of the total number of failing examples in thedatasets (last column). The DeepEST algorithm leverages theadaptive sampling to spot clusters of failing examples withrelatively few tests (set to 200 for RQ1). Its performance variesdepending on the auxiliary information used, but it is alwaysremarkably better than SRS and CES.Among the DeepEST variants, confidence (DeepEST CS )turns out to be the most effective auxiliary variable in detectingfailures, showing the best performance for all datasets andmodels, followed by DSA. CES and SRS select the lowestnumber of mispredicted examples, and are close to each other.Considering both the failure detection ability and the esti-mate accuracy, DeepEST C – that combines confidence andDSA, the two best auxiliary variables for failing examplesdetection - gives a good trade-off, since it provides stable(across subjects) and close-to-true estimates of the accuracy,with many more detected failing examples than CES and SRS.ABLE III: Pairwise comparison of techniques. A value of ρ R,C > means the technique on the row has a greater precisionthan that on the column. Similarly for the relative number of detected failures π R,C (a) Subject S1 (CN5, MNIST)DeepEST row vs col
CS DSA LSA C
SRSCES ρ R,C π R,C D ee p E S T CS ρ
R,C - 1.4773 1.9866 3.7346 23.4561 π R,C
DSA ρ
R,C - - 1.3447 2.5279 15.8772 π R,C
LSA ρ
R,C - - - 1.8799 11.8070 π R,C
C ρ
R,C - - - - 6.2807 π R,C row vs col
CS DSA LSA C
SRSCES ρ R,C π R,C D ee p E S T CS ρ
R,C - 1.7338 2.7026 2.7064 23.5422 π R,C
DSA ρ
R,C - - 1.5588 1.5609 13.5783 π R,C
LSA ρ
R,C - - - 1.0014 8.7108 π R,C
C ρ
R,C - - - - 8.6988 π R,C row vs col
CS DSA LSA C
SRSCES ρ R,C π R,C D ee p E S T CS ρ
R,C - 1.0790 2.7872 1.4993 2.7970 π R,C
DSA ρ
R,C - - 2.5832 1.3895 2.5923 π R,C
LSA ρ
R,C - - - 0.5379 1.0035 π R,C
C ρ
R,C - - - - 1.8656 π R,C row vs col
CS DSA LSA C
SRSCES ρ R,C π R,C D ee p E S T CS ρ
R,C - 5.0595 5.7562 3.4137 6.7460 π R,C
DSA ρ
R,C - - 1.1377 0.6747 1.3333 π R,C
LSA ρ
R,C - - - 0.5930 1.1720 π R,C
C ρ
R,C - - - - 1.9762 π R,C row vs col
CS DSA LSA C
SRSCES ρ R,C π R,C D ee p E S T CS ρ
R,C - 1.6773 1.9442 1.2067 2.2878 π R,C
DSA ρ
R,C - - 1.1591 0.7194 1.3640 π R,C
LSA ρ
R,C - - - 0.6207 1.1767 π R,C
C ρ
R,C - - - - 1.8959 π R,C wins
DeepEST
Total (r vs c)
CES
CS DSA LSA C SRS wins
CES - 0 0 0 0 0 0/25 D ee p E S T CS DSA
LSA C SRS 1 0 0 0 0 - 1/25
Total
CES
CS DSA
LSA C SRS losses
Tables III(a)–(e) show the results of the pairwise comparisonof the techniques. For the four DeepEST variants, rows andcolumns headings list (in blue) the name of the auxiliaryvariables. The evaluation metrics are the ratio ρ of the failingexamples and the relative precision π of the estimators. Valuesof ρ or π greater (lower) than 1 mean the technique on therow (column) has better performance. If a technique is betterthan the other in a pair for both metrics (values coloured inthe table), we say that it wins .Table III(f) summarizes the number of wins (and losses).CES never wins over other approaches, while DeepEST C winsagainst CES 3 out of 5 times. SRS never wins over DeepEST,while it wins vs CES considering VGG16 on CIFAR100.DeepEST C never looses and collects the highest numberof wins (10), showing the best trade-off among accuracyof the estimation and number of detected failing examples. The choice of the DeepEST variant may be determined bywhich auxiliary information can be collected. We see thatDeepEST C , exploiting a combination of two variables, hasgood and more stable results in terms of MSE than the othervariants, at the expense of a slight decrease of detected failures.Single auxiliary variables are more sensitive to the specificdataset/model pair (e.g., confidence works well if the DNNis reliable), but expose more mispredictions. Confidence hasthe advantage of not requiring knowledge of the hidden layersand is easier to compute. When no information is available oreasily computable, SRS could be a good low cost solution. B. RQ2: sensitivity to sample size
To answer this RQ, experiments are run with the samplesizes 50, 100, 200, 400, 800, considering the subject with thehighest accuracy (S1), and the one with the lowest accuracy a) Subject S1 (CN5, MNIST) (b) Subject S5 (VGG16, CIFAR100)
Fig. 2: RQ2 (sensitivity to sample size): Accuracy for the most (a) and the least (b) accurate subjects (a) Subject S1 (CN5, MNIST) (b) Subject S5 (VGG16, CIFAR100)
Fig. 3: RQ2 (sensitivity to sample size): MSE for the most (a) and the least (b) accurate subjects (a) Subject S1 (CN5, MNIST) (b) Subject S5 (VGG16, CIFAR100)
Fig. 4: RQ2 (sensitivity to sample size): Number of failing examples for the most (a) and the least (b) accurate subjects(S5), so as to analyze how DeepEST performs when thereare very few and many failing examples in the dataset,respectively. Figure 2 shows the mean values of the estimates’accuracy over repetitions. Figures 3 and 4 plot the MSE andthe mean number of detected failures, respectively. Expectedly,increasing the sample size all techniques exhibit a decreasingtrend in MSE and an increasing trend in failing examples.For the subject with highest accuracy (S1), we observe thefollowing. DeepEST C shows very good performance for MSE(Fig. 3(a), Fig. 1), and it detects on average about six times thefailures of SRS and CES (Table II). The advantage for MSE ismore pronounced with the smallest sample sizes, which makeDeepEST C particularly suited when the number of examples to select and label is very small. For larger sample sizes, theMSE is similar but the advantage of DeepEST C over SRSand CES is very pronounced for detected failures (Fig. 4(a)).DeepEST CS is the best among all techniques to detect failuresfor small sample sizes; for sizes and , the best isDeepEST DSA (Fig. 4(a)). Although the estimates are unbiased(hence, they tend to the true value), if we look at the meanestimates over the repetitions (Figure 2(a)), CES shows badperformance with up to test cases. SRS works well withlow budget; its good results may be influenced by the verylow number of failures: 18/30 repetitions show failures and accuracy. It is interesting to note that in most casesCES and SRS overestimate the true accuracy – an undesiredroperty, especially for critical systems. This is related to thelow number of failures detected, as discussed in Section II.For the subject with lowest accuracy (S5), CES and SRSoutperform DeepEST C for small sample sizes ( and );from to , the estimation by CES starts diverging, whileSRS and DeepEST keep good performance. The tendency tooverestimate the accuracy by CES and SRS is confirmed.Performance in failing examples detection is always clearlyin favour of all DeepEST variants. Performance in estimationaccuracy is almost specular to what observed with the mostaccurate model. A reason is that DeepEST is a samplingtechniques particularly suited for rare populations, which isnot the case of S5. As for the ability to detect failing examples, confidence is the best auxiliary variable for DeepEST forsubject S5: it presents the best values in all configurations. C. RQ3: dataset influence
We have seen in the experiments for RQ1 and RQ2 thatno single auxiliary variable performs best in all situations.For instance, we can consider the confidence a good auxiliaryinformation for subject S1, and a bad choice for S5. Thismay depend on several factors: assuming a perfect training,the confidence could be affected by a bias in the trainingset; or, with a perfect training set, a wrong training phase(e.g., due to overfitting) could generate mispredictions withhigh confidence. In some cases the operational dataset couldcontain examples very similar (i.e., small distance) to those inthe training set but with a different label, affecting the discrim-inative power of the DSA and LSA metrics. To analyze howthe three datasets influence the ability of auxiliary variablesto discriminate failing examples (impacting the performanceobserved in RQ1-RQ2), we consider the subjects S1 and S5,as for RQ2, plus the VGG16 DNN for CIFAR10.Figures 5, 6, and 7 show the logistic regression for thethree datasets. The curves fit the probabilities for the outcomefail/pass to the three predictors: confidence, DSA, and LSA.Consider MNIST, for which CN5 reaches the highest accuracyamong all subjects; the probability for a test to fail is very lowfor values of confidence between 0.7 and 1 (Figure 5). This isclearly not the case for CIFAR10 (Figure 6) and, especially,CIFAR 100 (Figure 7). In the latter, there is a high chance ofmisprediction even with high values of confidence; this couldbe due to a high skew between training and test data.The discriminating power of DSA and LSA is clearlygreater for MNIST, as the high slope of the S-shaped curvesin Figures 5(b)-(c) suggests, compared to corresponding onesin Figures 6 and 7. In this case, it actually happens that thefarthest examples have highest probability to be related to afailure, with a sharp increase after DSA ≈ . and LSA ≈ .This means that if in operation there are (a lot of) examplesfar from what observed in training, re-training with a morerepresentative dataset can be useful to improve the accuracy.This behaviour is not observed for CIFAR10 and CIFAR100,and DSA and LSA do not seem to be effective in dividingthe two sets of examples. In CIFAR10, the DSA line is morehorizontal, meaning that the DSA value does not reflect well the failure probability. The scarcely discriminative power ofthe auxiliary variables in CIFAR10 and CIFAR100 partiallyexplains the smaller gain of DeepEST over CES and SRS (es-pecially in terms of MSE); nevertheless, its adaptivity allowsspotting many failing examples even in these conditions.Finally, it is interesting to highlight the performance ofDeepEST CS (based on confidence) on MNIST in RQ2. Thesaturation in its failure detection ability (Figure 4(a)) canbe explained observing that, after a number of tests able tospot failures looking at low-confidence examples, the fewremaining ones with high confidence are selected with lowprobability; the sharper discrimination made by DSA and LSAdetermines a high detection ability even when few failingexamples remain. In summary, whenever a tester has goodbelief/evidence about the appropriateness of one of the aboveauxiliary variables, it is a good choice to select the specificDeepEST variant; if not, the combined variant DeepEST C hasshown to give the best trade-offs in all five experimented cases. D. Threats to validity
A threat to the internal validity comes from the selectionof the experimental subjects. To favor the repeatability ofthe experiments under different possibly influencing factors,we have used publicly available networks and pre-trainedmodels, to avoid incorrect implementation. The configurationof parameter r in DeepEST and a different setting of thresholdsmay also affect the results (in terms of efficiency of theestimator), hence a fine-tuning is suggested before applying themethod to other dataset-DNNs. Although the code developedwas carefully inspected, a common threat is the correctness ofthe scripts to collect data and compute the results.The choice of the sample size influences the effectivenesstoo. We ran a sensitivity analysis to show that DeepEST isstill effective (compared to both CES and SRS) under five(from 50 to 800) values of the sample size, but differentvalues could yield different results. Threats to external validitydepend on both the number of models and datasets consideredfor experimentation. We strived to control this threat con-sidering different widely used DNN and datasets. Althoughthe results may change with different subjects, the diversityand significance of the chosen subjects give confidence tothe general considerations. Replicability of the experimentson other subjects is to further mitigate this threat.VI. R ELATED WORK
Testing of DNN is a hot research topic. Zhang et al. [26]present a survey on Machine Learning testing, identifyingthree main families of techniques: mutation testing [27],metamorphic testing [9], [28], and cross referencing [7], [29].The former two are for test generation: they generate adver-sarial examples causing mispredictions. Cross-referencing canhighlight the most interesting test cases when different imple-mentations of a system disagree. Most of these techniques aremeant for fault detection, rather than for accuracy estimate.Testing DNN for accuracy is the focus of Li et al. [17], whopresented CES – which DeepEST has been compared against a) Confidence (b) DSA (c) LSA
Fig. 5: RQ3 (dataset influence): MNIST samples distribution (a) Confidence (b) DSA (c) LSA
Fig. 6: RQ3 (dataset influence): CIFAR10 samples distribution (a) Confidence (b) DSA (c) LSA
Fig. 7: RQ3 (dataset influence): CIFAR100 samples distribution- as the first approach using operational testing to this aim.DeepEST differs from CES in several aspects, the key onesbeing the sampling algorithm and the used auxiliary variables,which are conceived to improve failing examples detectionwhile preserving the accuracy estimation power.Operational testing, where tests are derived according to theoperational profile, is an established practice to estimate soft-ware reliability [30]. It was the core technique of Cleanroomsoftware engineering [31]–[34] and of the Software ReliabilityEngineering Test process [30]. Cai et al. developed
AdaptiveTesting , still based on the operational profile, but foreseeingadaptation in assigning tests to partitions [35]–[38]. Recently,Pietrantuno et al. stressed the use of unequal probabilitysampling to improve the estimation efficiency [39], to thisaim formalizing several sampling schemes [19]. Our proposalgoes along this direction: we do not sample tests representativeof the operational dataset, but we “alter” the selection andcounterbalance the uneven selection in the estimator. Unequalprobability, without-replacement and adaptive sampling are thekey concepts we borrowed for operational testing of DNNs. VII. C
ONCLUSION
Testing the accuracy of a DNN with operational data aimsat precise estimates with small test suites, due to the cost formanually labelling the selected test cases. This effort motivatesto pursue also the goal of exposing many mispredictions withthe same test suite, so as to improve the DNN after testing.With these two concurrent goals, we presented DeepEST, atechnique to select failing tests from an operational dataset,while ultimately yielding faithful estimates of the accuracy ofa DNN under test.We evaluated experimentally four variants of DeepEST,based on various types of auxiliary information that itsadaptive sampling strategy can leverage. The results withfour DNN models and three popular datasets show that allDeepEST variants provide accurate estimates, compared toexisting sampling-based DNN testing techniques, while gen-erally much outperforming them in exposing mispredictions.Practitioners may choose the appropriate variant, dependingon the characteristics of their operational dataset and on whichauxiliary information is available or they can collect.
EFERENCES[1] Ziad Obermeyer and Ezekiel J. Emanuel. Predicting the future — bigdata, machine learning, and clinical medicine.
New England Journal ofMedicine , 375(13):1216–1219, 2016. PMID: 27682033.[2] Mariusz Bojarski et al.
End to End Learning for Self-Driving Cars.arXiv:1604.07316, 2016.[3] Amir Efrati. Uber finds deadly accident likely caused by software setto ignore objects on road.
The Information , ICSE, pages303–314. ACM, 2018.[6] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf.DeepFace: Closing the gap to human-level performance in face verifica-tion. In
IEEE Conference on Computer Vision and Pattern Recognition ,CVPR, pages 1701–1708. IEEE, 2014.[7] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore:Automated whitebox testing of deep learning systems.
Communicationsof the ACM , 62(11):137–145, 2019.[8] Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue,Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, andYadong Wang. DeepGauge: Multi-Granularity Testing Criteria for DeepLearning Systems. In , ASE, pages 120–131. ACM, 2018.[9] Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and SarfrazKhurshid. DeepRoad: GAN-Based Metamorphic Testing and Input Vali-dation Framework for Autonomous Driving Systems. In , ASE,pages 132–142. ACM, 2018.[10] Lei Ma, Fuyuan Zhang, Minhui Xue, Bo Li, Yang Liu, Jianjun Zhao,and Yadong Wang. Combinatorial testing for deep learning systems.arxiv.org/abs/1806.07723, 2018.[11] Augustus Odena and Ian Goodfellow. TensorFuzz: Debugging neuralnetworks with coverage-guided fuzzing. In , volume 97 of
Proc. of Machine LearningResearch , 2019.[12] Zenan Li, Xiaoxing Ma, Chang Xu, and Chun Cao. Structural coveragecriteria for neural networks could be misleading. In ,ICSE-NIER, pages 89–92. IEEE, 2019.[13] Weibin Wu, Hui Xu, Sanqiang Zhong, Michael R. Lyu, and Irwin King.Deep Validation: Toward Detecting Real-World Corner Cases for DeepNeural Networks. In , DSN, pages 125–137. IEEE,2019.[14] Jinhan Kim, Robert Feldt, and Shin Yoo. Guiding Deep Learning SystemTesting Using Surprise Adequacy. In , ICSE, pages 1039–1049. IEEE, 2019.[15] Phyllis G. Frankl, Richard G. Hamlet, Bev Littlewood, and LorenzoStrigini. Evaluating testing methods by delivered reliability.
IEEETransactions on Software Engineering , 24(8):586–601, 1998.[16] Michael R. Lyu, editor.
Handbook of Software Reliability Engineering .McGraw-Hill, Inc., Hightstown, NJ, USA, 1996.[17] Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and JianL¨u. Boosting Operational DNN Testing Efficiency through Conditioning.In
Proc. of the 2019 27th ACM Joint Meeting on European SoftwareEngineering Conference and Symposium on the Foundations of SoftwareEngineering , ESEC/FSE, pages 499–509. ACM, 2019.[18] Domenico Cotroneo, Roberto Pietrantuono, and Stefano Russo. RELAItesting: a technique to assess and improve software reliability.
IEEETransactions on Software Engineering , 42(5):452–475, 2016.[19] Roberto Pietrantuono and Stefano Russo. On adaptive sampling-based testing for software reliability assessment. In , ISSRE, pages 1–11.IEEE, 2016.[20] Sharon L. Lohr.
Sampling: Design and Analysis . Duxbury Press, 2009.[21] Steven K. Thompson.
Sampling, Third Edition . John Wiley & Sons,Inc., 2012. [22] Daniel G. Horvitz and Donovan J. Thompson. A generalization ofsampling without replacement from a finite universe.
Journal of theAmerican Statistical Association , 47(260):pp. 663–685, 1952.[23] Morris H. Hansen and William N. Hurwitz. On the Theory of Sam-pling from Finite Populations.
The Annals of Mathematical Statistics ,14(4):333–362, 1943.[24] Yann Lecun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition.
Proceedingsof the IEEE , 86(11):2278–2324, 1998.[25] Alex Krizhevsky. Learning multiple layers of features from tiny images.Technical Report TR-2009, University of Toronto, 2009.[26] Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. Machine learningtesting: Survey, landscapes and horizons.
IEEE Transactions on SoftwareEngineering , pages 1–37, 2020.[27] Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, and et al. DeepMutation:Mutation Testing of Deep Learning Systems. In , ISSRE, pages 100–111.IEEE, 2018.[28] Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail Kaiser, BaowenXu, and Tsong Yueh Chen. Testing and validating machine learningclassifiers by metamorphic testing.
Journal of Systems and Software ,84(4):544–558, 2011.[29] Siwakorn Srisakaokul, Zhengkai Wu, Angello Astorga, OreoluwaAlebiosu, and Tao Xie. Multiple-implementation testing of supervisedlearning software. In
AAAI Workshops . Association for the Advancementof Artificial Intelligence, 2018.[30] John D. Musa. Software reliability-engineered testing.
Computer ,29(11):61–68, 1996.[31] Harlan D. Mills, Michael Dyer, and Richard C. Linger. Cleanroomsoftware engineering.
IEEE Software , 4(55):19–24, 1987.[32] P. Allen Currit, Michael Dyer, and Harlan D. Mills. Certifying thereliability of software.
IEEE Transactions on Software Engineering ,SE-12(1):3–11, 1986.[33] Richard H. Cobb and Harlan D. Mills. Engineering software understatistical quality control.
IEEE Software , 7(6):45–54, 1990.[34] Richard C. Linger and Harlan D. Mills. A case study in cleanroomsoftware engineering: the IBM COBOL Structuring Facility. In , COMP-SAC, pages 10–17. IEEE, 1988.[35] Junpeng Lv, Bei-Bei Yin, and Kai-Yuan Cai. On the asymptotic behaviorof adaptive testing strategy for software reliability assessment.
IEEETransactions on Software Engineering , 40(4):396–412, 2014.[36] Junpeng Lv, Bei-Bei Yin, and Kai-Yuan Cai. Estimating confidenceinterval of software reliability with adaptive testing strategy.
Journal ofSystems and Software , 97:192–206, 2014.[37] Kai-Yuan Cai, Chang-Hai. Jiang, Hai Hu, and Cheng-Gang Bai. An ex-perimental study of adaptive testing for software reliability assessment.
Journal of Systems and Software , 81(8):1406–1429, 2008.[38] Kai-Yuan Cai, Yong-Chao Li, and Ke Liu. Optimal and adaptivetesting for software reliability assessment.
Information and SoftwareTechnology , 46(15):989–1000, 2004.[39] Roberto Pietrantuono and Stefano Russo. Probabilistic sampling-basedtesting for accelerated reliability assessment. In