[PDF] Distribution-Aware Testing of Neural Networks Using Generative Models

Abstract

The reliability of software that has a Deep Neural Network (DNN) as a component is urgently important today given the increasing number of critical applications being deployed with DNNs. The need for reliability raises a need for rigorous testing of the safety and trustworthiness of these systems. In the last few years, there have been a number of research efforts focused on testing DNNs. However the test generation techniques proposed so far lack a check to determine whether the test inputs they are generating are valid, and thus invalid inputs are produced. To illustrate this situation, we explored three recent DNN testing techniques. Using deep generative model based input validation, we show that all the three techniques generate significant number of invalid test inputs. We further analyzed the test coverage achieved by the test inputs generated by the DNN testing techniques and showed how invalid test inputs can falsely inflate test coverage metrics. To overcome the inclusion of invalid inputs in testing, we propose a technique to incorporate the valid input space of the DNN model under test in the test generation process. Our technique uses a deep generative model-based algorithm to generate only valid inputs. Results of our empirical studies show that our technique is effective in eliminating invalid tests and boosting the number of valid test inputs generated.

Full PDF

DDistribution-Aware Testing of Neural NetworksUsing Generative Models

Swaroopa Dola

Department of Computer EngineeringUniversity of Virginia

Charlottesville, [email protected]

Matthew B. Dwyer

Department of Computer ScienceUniversity of Virginia

Charlottesville, [email protected]

Mary Lou Soffa

Department of Computer ScienceUniversity of Virginia

Charlottesville, [email protected]

Abstract —The reliability of software that has a Deep NeuralNetwork (DNN) as a component is urgently important today giventhe increasing number of critical applications being deployedwith DNNs. The need for reliability raises a need for rigoroustesting of the safety and trustworthiness of these systems. Inthe last few years, there have been a number of research effortsfocused on testing DNNs. However the test generation techniquesproposed so far lack a check to determine whether the testinputs they are generating are valid, and thus invalid inputs areproduced. To illustrate this situation, we explored three recentDNN testing techniques. Using deep generative model basedinput validation, we show that all the three techniques generatesigniﬁcant number of invalid test inputs. We further analyzedthe test coverage achieved by the test inputs generated by theDNN testing techniques and showed how invalid test inputs canfalsely inﬂate test coverage metrics.To overcome the inclusion of invalid inputs in testing, wepropose a technique to incorporate the valid input space ofthe DNN model under test in the test generation process. Ourtechnique uses a deep generative model-based algorithm togenerate only valid inputs. Results of our empirical studies showthat our technique is effective in eliminating invalid tests andboosting the number of valid test inputs generated.

Index Terms —deep neural networks, deep learning, inputvalidation, test generation, test coverage

I. I

NTRODUCTION

Deep Neural Networks (DNN) components are increasinglybeing deployed in mission and safety critical systems, e.g.,[1], [2], [3], [4]. Similar to traditional programmed softwarecomponents, these learned

DNN components require signif-icant testing to ensure that they are reliable and thus ﬁt fordeployment.Yet DNNs differ from programmed software componentsin a variety of ways. (1) They generally do not have well-deﬁned speciﬁcations and instead rely on a set of examples thatrepresent intended component behavior. (2) These examplesare used to train the parameters of a ﬁxed implementationarchitecture resulting in implementation behavior encoded asvalues of the learned parameters. (3) The training processcontinues until the learned function is an accurate approxi-mation of the intended behavior. Finally, (4) the accuracy ofthe learned function is intended to generalize to the set of validinputs comprised of the data distribution of which the trainingexamples are representative. The above characteristics of DNNs present challenges forapplying existing software testing methods to DNNs. Forexample, the lack of speciﬁcations makes it most challengingto develop a rich test oracle, as well as the fact that parametervalues encode behavior which renders traditional structuralcode coverage ineffective. The growing body of research onDNN testing has begun to address some of these characteris-tics. While structural code coverage metrics are ineffective forDNNs, methods that cover combinations of computed DNNneuron values have been developed to assess and drive DNNtesting [5], [6], [7]. Also, variations of metamorphic testinghave been developed to check critical continuity propertiesacross the learned function approximations helping to ﬁll theoracle gap [8], [9], [10]. In this paper, we focus on thechallenges that DNN generalization presents to testing, and inparticular how current DNN testing techniques treat valid andinvalid inputs. To understand these challenges, consider theimplementation of a traditional software component C , whichis developed to meet a speciﬁcation S : R n → R m ∪ e , where e denotes the error behavior intended for invalid inputs . In thissetting, the input domain R n is partitioned into valid inputs, V , and invalid inputs, V = R n − V , which should yield e .The testing of C selects a test set T ⊂ R n and assesseswhether ∀ t ∈ T : C ( t ) = S ( t ) . As sketched in Fig. 1a,typically C is comprised of input validation , which deter-mines if an input value lies in V and then executes either functional logic which realizes the behavior of S on V , or error processing for invalid data. Developers have come torely on the several intuitions about such software. First, input C ( (cid:126)i ) { if (valid( (cid:126)i )) return logic( (cid:126)i ) elsereturn error( (cid:126)i )} (a) Code N ( (cid:126)i ) { . . .. . .. . . return } (b) DNN Fig. 1: Structure of code and DNN components C and N . a r X i v : . [ c s . S E ] F e b V V V Fig. 2: Cumulative neuron coverage of LeNet-1 on the ﬁrst 100 valid and invalid inputs generated by DLFuzz (top) andDeepXplore (bottom); coverage vectors (left) and ratios (right) for each set are shown along with the cumulative ratio (inparentheses)validation logic is distinct from functional logic, demandingtesting approaches that exploit its properties [11], [12], [13],[14] to effectively support it [15], [16], [17]. Second, test suitesthat achieve higher coverage are better in that they exercisemore of the validation, functional, and error logic.Now, consider a DNN, N : R n → R m , which is trained toaccurately approximate the, possibly unavailable, speciﬁcation S . As sketched in Fig. 1b, N is comprised of layers ofneurons that are cross-coupled by connections labeled withlearned parameters. When the learned parameters for N aresuch that P r ( N ( i ) = S ( i ) | i ∈ V ) ≥ − (cid:15) , for a desirederror (cid:15) , the network is expected to generalize to the validinput distribution, V . Even if N were trained to detect invaliddata and respond appropriately, its structure does not force adistinction between input validation, functional logic, or errorprocessing. In practice, this distinction is uncommon and inthis case N does not even have an analog for e in its outputdomain. Because of the lack of this distinction, whether aninput lies in V or V , the computation performed by N overlapsto a large degree, e.g., common sets of neurons are activated.Not distinguishing between valid and invalid input can beproblematic for DNN testing in at least three ways. (1) Testingtechniques that generate invalid inputs increase cost with littlevalue added for testing the functional logic of N . Fig. 3 depictsvalid test inputs and selected invalid test inputs from tworecently proposed DNN test generation techniques [5], [18].As we show in §IV, across a range of testing approaches forDNNs [19], [5], [18], on average 42% of the generated testsare invalid and in the worst-case all generated tests by a giventechnique are invalid. (2) When a test case fails developer timeis required to triage the failure. With high numbers of invalidtest inputs, developers may be forced to look through largenumbers of test inputs, similar to those depicted in Fig. 3, tomake judgements about test validity. The high-rate of invalidinputs runs the risk that developers will avoid the use ofthese techniques, thereby negating their purported value. (3)Whereas for traditional software the coverage produced byinvalid inputs is conﬁned to the validation and error logic, forDNNs an analogous separation of coverage is not guaranteed.As depicted at the top of Fig. 2, the cumulative coverage fromvalid and invalid test sets can be almost identical – differingby as few as 1 of 52 neurons. Worse yet, as depicted in thebottom of Fig. 2, invalid tests can artiﬁcially boost coveragesigniﬁcantly beyond what is achieved by valid tests - from0.692 to 0.808. This increase in coverage suggests that, unlikefor traditional software, DNN test suites that achieve highercoverage are not necessarily better ! Fig. 3: Valid tests vs Invalid tests. Top Row: Valid testsfrom MNIST training dataset. Middle Row: Invalid tests fromDeepConcolic. Bottom Row: Invalid tests from DeepXploreIn this paper, we study the effects of DNN test generationtechniques not distinguishing between valid and invalid dataand characterize the potential impact of the issues identiﬁedabove. Our approach is to leverage a growing body of researchfrom the Machine Learning (ML) community that learnsmodels of the training distribution, V , from which the trainingdata is drawn [20], [21], [22], [23]. While there are many suchmodels, in this paper we employ the variational auto-encoder (VAE) – leaving the study of alternative models to future work.Leveraging VAE models allows us to study techniquesrepresentative of the current state of DNN testing researchand to make two important observations. First, we demonstratethat existing DNN testing techniques, such as DeepXplore [5],DLFuzz [19], and DeepConcolic [18], produce large numbersof test cases with invalid inputs, which increases test costwithout a clear beneﬁt. Second, we demonstrate that existingDNN test coverage metrics, e.g., [5], [6], are unable todistinguish valid and invalid test cases, which risks biasingtest suites toward including more invalid inputs in pursuit ofhigher coverage.Building on these observations, we present a novel approachthat combines a VAE model with existing test generationtechniques to produce test cases with only valid inputs. Morespeciﬁcally, we formulate the joint optimization of probabilitydensity of valid inputs and the objective of existing DNNtest generation techniques, and use gradient ascent to generatevalid tests. An experimental analysis on datasets used in theDNN testing literature [24], [25] shows the cost-effectiveness2f the proposed approach.The primary contributions of this work lie in: (a) the iden-tiﬁcation of limitations in existing DNN test generation andcoverage criteria in their treatment of invalid input data; (b)the development of a technique for incorporating an explicitmodel of the valid input space of a DNN into test generation toaddress those limitations; and (c) experimental evaluation thatdemonstrates the extent of the limitations and the effectivenessof our technique in mitigating them.The remainder of this article is organized as follows. Thefollowing section, §II, describes the concepts that are usedin this paper and related work. Our approach is detailed in§III. Experimental strategy and results are described in §IV.§V discusses the threats to validity of our study and §VIconcludes.II. B ACKGROUND AND R ELATED R ESEARCH

A. Deep Neural Networks

Deep Neural Networks (DNNs) are a class of MachineLearning models that can extract high level features fromraw input. Similar to the human brain, DNNs contain a largenumber of inter-connected elements called neurons. DNNshave multiple layers, and each layer contains a number ofneurons. A typical DNN consists of an input layer, one ormore hidden layers followed by an output layer. Connectionsbetween neurons are called edges and their associated weightsare referred to as the model parameters. A neuron receivesits input as a weighted sum over outputs of neurons fromthe previous layer. The neuron then applies a non-linearactivation function on this input to generate its output. Overall,a DNN is a mathematical function over the model parametersfor transforming inputs into outputs. The model learns itsparameters by training on known input data called the trainingdata. The objective of DNN training is to learn the modelparameters in order to make accurate predictions on unseendata during deployment.

B. DNN testing techniques

DNN testing is an active research area with a numberof testing techniques developed to address the challenges oftesting these systems [26], [10] in terms of test coveragecriteria, test generation and test oracles.After training, DNN testing techniques use either naturalinputs or adversarial inputs for testing. Adversarial inputs aretest inputs that are generated by applying tiny perturbationson the original inputs, which cause the model to make falsepredictions [27]. There is another line of research that focuseson generating adversarial examples for exposing vulnerabili-ties of DNN models [27], [28], [29] without addressing testadequacy. However our work differs by focusing on coverageguided DNN testing techniques from the software engineeringliterature.

1) Coverage Criteria:

In traditional software testing, cov-erage criteria are used to measure how thoroughly softwareis tested. Most practical coverage criteria e.g., [30], use thestructure of the software system to make this assessment, e.g., the percentage of statements or branch outcomes covered bya test suite. Similar to structural software coverage criteria,coverage criteria for DNNs have been proposed by variousresearch efforts, as follows.Pei et al. [5] proposed neuron coverage (NC) as a testcoverage criteria. For a given test suite, neuron coverage ismeasured as the ratio of the number of unique neurons whoseoutput exceeds a speciﬁed threshold value to the total numberof neurons present in the DNN.Ma et al. [6] proposed a range of coverage criteria includ-ing: k-multisection neuron coverage (KMNC), neuron bound-ary coverage (NBC), and strong neuron activation coverage(SNAC). These coverage criteria can be used to determinewhether a test case falls in the major functional region orcorner case region of a DNN. Activation traces of all neuronsare captured for the training data and lower and upper boundsof activations are measured for each of the neurons.K-multisection coverage is calculated by dividing the inter-val between lower and upper bounds into k-bins and measuringthe number of bins activated by the test inputs. For a test suite,k-multisection coverage is the ratio of the uniquely coveredbins to the total number of bins in the model.Neuron activations above the upper bound or below thelower bound are considered to be in corner case regions.Neuron boundary coverage is measured as a ratio of thenumber of covered upper and lower corner case regions tothe total number of corner case regions of the model. Strongneuron activation coverage is the ratio of the number ofcovered upper corner case regions to the total number of uppercorner case regions in the DNN. Top-k neuron coverage andtop-k neuron patterns are based on top hyper-activate neuronsand their combinations.Modiﬁed Condition/Decision Coverage variants for DNNs[7] are proposed by Sun et al [7]. These metrics are basedon sign and value change of a neuron’s activation to capturethe causal changes in the test inputs. Ma et al. [31] proposedcombinatorial test coverage to measure the combinations ofneuron activations and deactivations covered by a test suite.In our work, we focus on the NC, KMNC, NBC, and SNACcriteria and we show that these metrics cannot differentiatebetween valid and invalid test inputs generated by existingDNN test generation techniques. We leave the analysis forother coverage metrics for future work.

2) DNN test generation:

Research on DNN test generationis largely inspired by traditional software testing techniquessuch as metamorphic testing, fuzz based testing and symbolicexecution. Below, we discuss the state of DNN test generationresearch.DeepXplore [5] is a white-box differential test generationtechnique that uses domain speciﬁc constraints on inputs.This technique requires multiple DNN models trained on thesame dataset as cross referencing oracles. The objective ofDeepXplore is a joint optimization of neuron coverage anddifferences in the predictions of DNN models. Maximizingthe objective generates tests that achieve high neuron coveragewhile simultaneously achieving erroneous predictions by the3NN model. DeepXplore uses gradient ascent to solve thejoint optimization. DeepTest [9] is another testing techniquethat generates test inputs by applying domain speciﬁc con-straints on seed inputs. The major focus of DeepTest is togenerate test inputs for testing autonomous vehicles. It usesgreedy search driven by neuron coverage criteria.Fuzzing is another traditional software testing techniquethat has been adapted for DNN test generation includingDLFuzz [19], and TensorFuzz [32]. DLFuzz is an adversar-ial input test generation technique. It uses neuron coveragedriven test generation similar to DeepXplore. However unlikeDeepXplore, it does not require multiple DNN models. Italso uses a constraint to keep the newly generated test inputsclose to the original inputs. TensorFuzz is a coverage guidedtesting method for ﬁnding numerical issues in trained neuralnetworks and disagreements between neural networks and theirquantized versions.DeepConcolic [18] uses the concolic testing approach forgenerating adversarial test inputs for DNN testing. Concolicexecution is a coverage-guided testing technique that com-bines symbolic execution and path information from concreteexecution for generating tests satisfying a coverage criteria.DeepConcolic supports neuron coverage and MC/DC variantsfor DNNs.None of these DNN testing techniques check whether thetest inputs they are generating follow the training distribution.They generate a signiﬁcant number of invalid inputs thatare outside the model’s training distribution as shown in ourevaluation section IV.

C. Out-of-Distribution Input Detection

Out-of-distribution input detection (OOD), also referred toas outlier or anomaly detection, is a well-studied problem inML ﬁeld[20], [21], [22], [33]. A recent survey [23] describesthe state of deep learning based outlier detection research andclassiﬁes deep learning based outlier detection techniques intosupervised, semi-supervised, unsupervised categories. Unsu-pervised models are preferred as labeling is expensive. Weuse an unsupervised generative model based approach for ourwork.A generative model learns the distribution of the data andcan predict how likely a test input is with respect to trainingdistribution. This prediction can be used to identify invalid testinputs. A DNN classiﬁer learns the conditional distributionof target variables with respect to observable variables. Eventhough such a classiﬁer has high accuracy on data sampledfrom the training distribution, its accuracy on samples outsidethe training distribution cannot be guaranteed [34]. By traininga generative model with the same data, its density predictionscan be used to reject inputs with low densities. When a testinput has low density it implies that the DNN classiﬁer did nothave enough samples around test input region in the trainingdataset.Examples of generative models are autoencoders, vari-ational autoencoders [35], generative adversarial networks(GAN) [36], and autoregressive models such as PixelCNN [37] Fig. 4: Technique for identifying invalid test inputsand PixelCNN++ [38]. We primarily use the variational au-toencoder based out-of-distribution detection technique in ourwork. Also, we repeat our experiments to identify invalidinputs generated by test generation techniques using a Pixel-CNN++ based validation approach. The study is described insection IV-B to show how sensitive invalid input identiﬁcationis with respect to the out-of-distribution detection mechanismused.

D. Variational Autoencoder

A variational autoencoder is a generative model that rep-resents latent space as a probability distribution. It has anencoder, code layer and a decoder [35]. The encoder isresponsible for mapping inputs to a lower dimensional latentspace, and the decoder generates new inputs by sampling fromthe latent space. Latent space is modeled by a code layer,and it is generated from a prior distribution, e.g., a NormalGaussian distribution. The encoder’s objective is to learn theposterior distribution and decoder’s objective is to learn thelikelihood of the original input reconstructed by the decoder.A VAE model is trained by minimizing the difference betweenposterior and latent prior distributions and maximizing thelikelihood estimation of the input. A trained VAE model willgenerate high probability density estimates for data belongingto the training data distribution when compared to out-of-distribution inputs. This key insight is used for validating testinputs generated by DNN test generation techniques in ourresearch. III. A

PPROACH

In this section, we describe our approach to (1) identifyinglimitations of existing DNN test generation techniques, and(2) generating valid test inputs for testing DNNs.

A. Analysis of Existing DNN Test Generation Techniques

The methodology for analysing test inputs generated byexisting test generation techniques is depicted in Fig. 4.DNN(s) under test and the deep generative model are trainedon the same dataset. Test inputs generated by existing DNNtest generation techniques for the DNN(s) under test are passedas inputs to the deep generative model which estimates theirdensities. These densities are used by the decision logic toclassify inputs as valid or invalid.For our experiments, we use a VAE for expressing the deepgenerative model logic, and in particular, the model proposedby An and Cho [20] where the decoder of a VAE outputsdistribution parameters for the samples generated by theencoder. The probability of generating the original test inputfrom a latent variable is calculated using these distributionparameters. This probability is referred to as reconstruction4robability. Valid inputs have higher reconstruction probabilitywhen compared to invalid inputs.For a dataset under test, which we call the valid dataset,we identify another dataset which has a different distribution.The inputs from this dataset are considered as invalid inputs.Invalid dataset selection is guided by two factors: (1) thedataset should have same input dimensions as the valid dataset,and (2) invalid and valid datasets should model disjoint datacategories.After identifying an invalid dataset, we compute the re-construction probability threshold for identifying invalid in-puts. Reconstruction probabilities are calculated for inputsfrom both valid and invalid datasets. We generate a rangeof thresholds from the combined reconstruction probabilityvalues of valid and invalid inputs. We compute the F-measure,which is a measure of a test’s accuracy, for these thresholdvalues. The F-measure is the harmonic mean of precisionand recall. A good F-measure balances precision and recalland results in a fewer number of both false positives andfalse negatives. In our case, this means fewer valid inputsare falsely classiﬁed as invalid and fewer invalid inputs arefalsely classiﬁed as valid. The threshold value with the highestF-measure is selected for our experiments. When classifyingtest inputs generated by DNN test generation techniques, testinputs with reconstruction probability less than the selectedthreshold are classiﬁed as invalid by the VAE classiﬁer.We measure the percentage of invalid inputs generated bymultiple test generation techniques and the coverage of bothvalid and invalid tests. The results of the experiments are usedto answer the research questions related to the limitations ofexisting techniques presented in Section IV.

B. Our Test Generation Technique

We present a technique to generate valid test inputs in thissection. Our workﬂow is described in Fig. 5. Our approachleverages existing gradient ascent based test generation tech-nique’s objective formulation. The objective of existing testgeneration techniques is modeled to increase test coverageand produce inputs that cause the model to make incorrectpredictions. We augment this objective with probability densityestimated by a generative model. Gradient ascent is used tosolve the joint optimization. Maximizing the joint optimizationwill result in inputs that follow the distribution of the trainingdata of the DNN under test along with satisfying objective ofthe baseline testing technique.We provide a detailed description of our test generationalgorithm using a VAE as the generative model in Algorithm 1.The decoder of the VAE outputs the distribution parameters( µ ˆx , σ ˆx ) for the samples generated by the encoder as perthe OOD detection algorithm proposed in [20]. The algorithmrequires a DNN under test, an objective function of a baselinegradient ascent based test generation technique obj , a prob-abilistic encoder and decoder as inputs and produces both atest suite of valid inputs and their test coverage as output.For every input of the seed set, the probabilistic encodergenerates parameters in latent space as shown in line 4 of the Fig. 5: Technique for generating valid test inputsAlgorithm 1. In lines 5-7, a sample from the latent space isused by the decoder to calculate the reconstruction probabilityof the input. The objective is modeled as a weighted sum of obj and reconstruction probability in line 8. Lines 9-11 showthe gradient ascent. The gradient is calculated for the objectiveand at this stage, domain constraints, if any, are applied to thegradient and a new test input is generated. In lines 12-13, thegenerated test is tested for validity. If this test input causesthe model to mispredict and has a reconstruction probabilityhigher than the threshold, then on lines 14-15 the coverageis updated and input is added to the generated test suite. Theprocedure continues until all seeds are processed. We evaluatethis technique using DeepXplore as a baseline test generationtechnique in Section IV. Algorithm 1

Valid test input generation using VAE

Input: X ← Seed inputsDNN ← DNN under testobj1 ← Objective function of test generation techniques ← Step size for gradient ascentmax iterations ← maximum iterations for gradient ascent f θ , g φ ← Trained probabilistic encoder and decoder λ ← hyperparameter for balancing two goals α ← Reconstruction probability threshold

Output:

Set of test inputs, coverage gen test = {} for x in X do for i=1 to max iterations do µ z , σ z = f θ ( z | x ) draw sample from z ∼ N ( µ z , σ z ) µ ˆx , σ ˆx = g φ ( x | z ) obj2 = p θ ( x | µ ˆx , σ ˆx ) obj = obj1 + λ × obj2 gradient = ∂obj/∂x gradient = Constraints(gradient) x = x + s × gradient p = Reconstruction Probability(x, f θ , g φ ) if Counter Example(DNN, x) and p ≥ α then gen test.add(x) update coverage break end if end for end for ataset Name Architecture AccuracySource MNIST MNI-1MNI-2MNI-3MNI-4 LeNet-1 [39]LeNet-4 [39]LeNet-5 [39]Custom [18] 3:52:72064:148:693625:268:1077867:1300:312202 98.66%99.03%99.08%99.03%SVHN SVH-1SVH-2SVH-3SVH-4 ALL-CNN-A [40]ALL-CNN-B [40]ALL-CNN-C [40]VGG19 [41] 7:2248:1.2M9:2824:1.3M9:2824:1.3M19:28884:38M 96%95.67%95.98%94.69%

TABLE I: Models used in our studies with number of layers(

VALUATION

The design and evaluation of experiments for studyingexisting techniques and demonstrating effectiveness of our ap-proach are described in this section. We answer the followingresearch questions:

RQ1:

Do existing test generation techniques produce invalidinputs?

RQ2:

Existing test generation techniques are guided by testcoverage criteria. How do invalid inputs affect test cov-erage metrics?

RQ3:

VAE based input validation can be incorporated intotest generation techniques. How effective is this techniquein generating valid inputs and what is the overhead?

RQ4:

Is the determination of invalid inputs sensitive to thegenerative model used?

A. Evaluation Setup

All experiments are conducted on servers with one Intel(R)Xeon(R) CPU E5-2620 v4 2.10GHz processor with 32 cores,62GB of memory, and 4 NVIDIA TITAN Xp GPUs. Thesoftware that supports our evaluation as well as all of the datadescribed below is available at https://github.com/swa112003/DistributionAwareDNNTesting.

1) Test Generation Frameworks:

We study three state of theart test generation techniques: DeepXplore [5], DLFuzz [19],and DeepConcolic [18] to demonstrate the limitations ofexisting techniques in terms of generating valid test inputsand satisfying test coverage criteria. The choice of theseframeworks is guided by the categorization of test inputgeneration techniques presented in a recent survey [26] andthe availability of open source code. The survey categorizestest generation frameworks into three algorithmic families;we choose one technique from each family. DeepXplore isselected from domain-speciﬁc test input synthesis, DLFuzzfrom fuzz and search based test input generation and Deep-Concolic from symbolic execution based test input generationcategories.

2) Test Coverage Criteria:

DeepXplore and DLFuzz useneuron coverage [5] as the test adequacy criteria whereasDeepConcolic can be used with neuron coverage [5], neuronboundary coverage [6] and MC/DC coverage criteria forDNNs [7]. We use neuron coverage as the test adequacycriteria for generating tests using all three frameworks. Re-sulting test inputs from test generation are analyzed using neuron coverage and extended neuron coverage metrics, i.e, k-multisection neuron coverage, neuron boundary coverage andstrong neuron activation coverage. We leave the remainingcoverage criteria discussed in these works [6], [7] for futurestudy.

3) Datasets and DNN Models:

We use two popular datasetsMNIST [24] and SVHN [25] for the experiments. Gener-ative models can assign higher densities to datasets whosedistributions are different from their training datasets in somecases[42]. For example, a VAE trained on CIFAR10 [43] canassign higher densities to inputs from SVHN dataset. Whensuch a model is used for invalid input identiﬁcation, it mightresult in high densities being assigned to invalid inputs whichwill result in false negatives. Also selecting the thresholddensity for deciding invalid inputs becomes challenging insuch scenarios. This problem is actively being addressed byML research community[44]. Generative models trained onMNIST and SVHN do not have this issue [42], so we selectedthese two datasets for our research.

MNIST is a collection of grayscale images of handwrittendigits with 60000 training images and 10000 test images. Allthree frameworks that we are studying support test generationfor MNIST dataset. Similar to DeepXplore, we use LeNet-1,LeNet-4 and LeNet-5 networks from LeNet family [39] and acustom architecture used in the DeepConcolic work [18] forMNIST classiﬁcation. All the four models are convolutionalnetworks with max-pooling layers and the number of layersranging from 3 to 7.

SVHN contains color images of digits in natural scenesand the dataset has 73257 training images and 26032 testimages. We implemented SVHN support for all three frame-works. We trained SVHN classiﬁcation models with the ALL-CNN-A, ALL-CNN-B and ALL-CNN-C network architecturesproposed in [40] and VGG19 [41] for our experiments. Thesemodels are convolutional networks with dropout and eitherglobal average pooling or max-pooling layers and the numberof layers range from 7 to 19. The models are summarized inTable I where we report measures of their architecture and testaccuracy.

4) VAE Models:

For MNIST, we trained the VAE thatoutputs distribution parameters using the model architecturedescribed in [20]. The FashionMNIST dataset [45], is similarto MNIST and contains 28x28 grey scale images. Howeverthe distribution is different from that of MNIST as Fashion-MNIST contains clothing images. We use the FashionMNISTas the invalid input space for calculating the reconstructionprobability threshold. Since the VAE is not trained on Fash-ionMNIST distribution and FashionMNIST clothing inputs aresemantically unrelated to MNIST digit inputs, the VAE shouldoutput lower reconstruction probabilities for test inputs fromthe FashionMNIST dataset.We experimented with different variations of the generatorarchitecture used in [46] for selecting a VAE network for theSVHN dataset. For each of the variants, the encoder is createdby transposing the generator network as suggested in [46]. Thenetwork that achieved the highest F-measure for identifying6 ataset MNIST SVHNValid

MNIST Test SVHN Test

Invalid

FashionMNIST Test CIFAR10 Test

F-measure

False Positives

False Negatives

TABLE II: F-measure and percentage of false positives andfalse negatives for VAE based input validation model

DNN Testing Technique Valid (%) Invalid (%) Total (%)MNI-1 DeepXplore 38.5 TABLE III: Neuron Coverage of test inputs generated byDeepXplore, DLFuzz and DeepConcolic for MNIST classiﬁersinvalid inputs is selected for our experiments. CIFAR10 [43] isused as the invalid input dataset for calculating reconstructionprobability threshold of VAE trained on SVHN. F-measurevalues and percentage of false positives for MNIST and SVHNtest datasets are given in Table II.

B. Results and Research Questions

In this section, we present results of our experiments weused to answer the research questions.

RQ1. Do existing test generation techniques produceinvalid inputs?

We generated test inputs for MNIST and SVHN classiﬁersusing the DeepXplore, DLFuzz and DeepConcolic techniques.The DeepXplore framework supports three types of inputtransformations: lightening, occlusion and blackout. We gen-erated tests for all three transformations to answer RQ1.We randomly sampled 500 seed inputs from each MNISTand SVHN test dataset for DeepXplore and DLFuzz. Deep-Xplore and DLFuzz use gradient ascent for test generation,and we used the hyperparameters reported in their respectiveworks [5], [19] for our study. Similarly, we selected theneuron coverage threshold of 0.25 as it is commonly usedin DeepXplore and DLFuzz experiments in their originalwork. The DeepConcolic tool uses a single seed input for testgeneration for neuron coverage, and a timeout of 12 hours isused for test generation in the primary work [18]. We usedthe same strategy, and the framework is run with the globaloptimisation approach. Generated tests are classiﬁed as validor invalid by using the reconstruction probability metric ofVAE. The top row of Fig. 6 shows the percentage of invalidtest inputs generated by these frameworks for MNIST andSVHN DNN models.The percentage of tests generated by DeepXplore variesdepending on the constraint used. For all the four MNISTclassiﬁers, occlusion constraint produced a high percentage of

DNN Testing Technique Valid (%) Invalid (%) Total (%)SVH-1 DeepXplore 44.4

TABLE IV: Neuron Coverage of test inputs generated byDeepXplore, DLFuzz and DeepConcolic for SVHN classiﬁersinvalid test inputs i.e., greater than 90% while blackout con-straint generated less than 1% invalid inputs. The lighteningconstraint generated 94% and 63% invalid inputs for modelsMNI-1 and MNI-3 and less than 1% for other two. DLFuzzgenerated invalid inputs in the range 36% to 46% for MNI-1,MNI-2 and MNI-3 classiﬁers while less than 1% for MNI-4.For SVHN classiﬁers, the occlusion and blackout constraintsgenerated a higher number of invalid tests when compared tolightening constraints on an average. DLFuzz generated invalidinputs are in the range 9% to 20% for SVHN classiﬁers. Allthe test inputs generated by the DeepConcolic framework forboth MNIST and SVHN classiﬁers are classiﬁed as invalid bythe VAE model.

Result for RQ1: All three testing techniques studiedproduced signiﬁcant numbers of invalid tests; 42% onaverage and ranging from 73-100% in the worst-case.RQ2. Existing test generation techniques are guided bytest coverage criteria. How do invalid inputs effect testcoverage metrics?

We measured neuron coverage(NC), multi-granularity cov-erage criteria i.e., k-multisection neuron coverage (KMNC),neuron boundary coverage (NBC) and strong neuron activationcoverage (SNAC) of both valid and invalid tests generatedby the three frameworks. The k-value of 100 is used formeasuring KMNC coverage. We also measured the cumulativeneuron coverage of valid and invalid test inputs. Results arepresented in Tables III and IV for neuron coverage metricand Tables V and VI have data for multi-granularity coveragecriteria.Across 8 DNNs, 3 test generation techniques, and 4 cover-age criteria, 72% of the time invalid tests achieved coveragegreater than or equal to that achieved by valid tests. The entriesin Tables III, IV, V and VI corresponding to this insight arehighlighted in bold. 25% of the time invalid tests outperformvalid for coverage, and 25% of the time invalid coverageboosts overall coverage by more then 10%.

Result for RQ2: Invalid inputs yield high coverage fora variety of coverage criterion when compared to validinputs and they frequently increase coverage beyond thatwhich would be achieved with valid inputs alone.

RQ3. VAE based input validation can be incorporatedinto the test generation techniques. How effective is thistechnique in generating valid inputs and what is theoverhead?

To answer this question, we generated test inputs by usingVAE based input validation along with a gradient ascent basedtest generation technique as described in Algorithm 1. Weselected DeepXplore as the baseline test generation techniqueand density estimated by VAE is incorporated as a goal intoits objective to formulate a joint optimization. Result of a jointoptimization is sensitive to the weights of different goals usedin the objective function. To address this, we ﬁxed the weightsof the goals of the baseline’s objective and performed a sweepover a range of density weights to ﬁnd the best conﬁguration.We used gradient ascent to generate test inputs for MNISTand SVHN models. We randomly identiﬁed 200 seed inputsfrom each of the two datasets and used the same seed setand gradient ascent parameters, i.e., step size and maximumiterations for baseline and our technique. The experiments arerepeated three times and average results are presented in thissection.We measured the number of valid tests generated along withtheir neuron coverage for our technique and the baseline todemonstrate the effectiveness of our technique. The validityof the inputs is measured with respect to the OOD detection algorithm used, i.e., the VAE in this case. Our techniquegenerates only valid test inputs. Since baseline generates bothvalid and invalid test inputs, we added the input validationmodule to the baseline to capture only the valid test inputs.Neuron coverage achieved by the baseline technique and ourtechnique are presented in Figures 7 and 8 for MNIST andSVHN classiﬁers respectively. The plots show the coverageover a range of 200 seed inputs. Our technique achievedneuron coverage greater than or equivalent to that of Deep-Xplore baseline for all the 8 DNN models. For the scenarioswhere baseline is able achieve neuron coverage comparable toours, our technique outperformed the baseline in terms of thenumber of valid inputs generated. Fig. 9 contains a comparisonof the number of valid inputs generated by the baseline and ourtechnique for MNIST and SVHN classiﬁers. The total validinputs generated by our technique for the MNIST models are5.6 times the valid inputs generated by the baseline. For SVHNdataset, our technique generated 1.6 times more valid inputswhen compared to the baseline. Hence, having VAE in thetest objective guides gradient ascent effectively in searchingfor valid inputs.Table VII shows the performance data of DeepXplore+VAEand DeepXplore algorithms for 200 seed inputs. Every iter-ation of these algorithms has two components, 1) gradientascent, and 2) input validation. For each seed input, gradient8

NN TestingTechnique Coverage Valid(%) Invalid(%) Total(%)MNI-1 DeepXplore KMNC NBC - SNAC - DLFuzz KMNC

NBC - - -

SNAC - - -

DeepConcolic KMNC - NBC - - -

SNAC - - -

MNI-2 DeepXplore KMNC NBC - SNAC - DLFuzz KMNC

NBC - - -

SNAC - - -

DeepConcolic KMNC - NBC - SNAC - MNI-3 DeepXplore KMNC

NBC

SNAC

DLFuzz KMNC

NBC - 0.2 0.2

SNAC - - -

DeepConcolic KMNC - NBC - SNAC - MNI-4 DeepXplore KMNC

NBC - SNAC - DLFuzz KMNC

NBC

SNAC

DeepConcolic KMNC - NBC - SNAC - TABLE V: Multi-granularity neuron coverage of test inputsgenerated by DeepXplore, DLFuzz and DeepConcolic forMNIST classiﬁersascent is performed until it ﬁnds a valid test input or fora maximum of 30 iterations whichever happens ﬁrst. Inputvalidation is performed only when the differential oracle failsthe generated test input in that iteration. In all the cases, Deep-Xplore+VAE ran for fewer iterations and input validationswhen compared to the baseline. For the scenarios where thedifference between DeepXplore+VAE and baseline’s numberof iterations and input validations is high, DeepXplore+VAEis faster because the baseline is spending more time ongenerating invalid inputs which are then rejected by the inputvalidation module. When this difference is small, baseline hasbetter overall run-time, but DeepXplore+VAE generates morevalid inputs and has lower cost per valid input when comparedto the baseline. We note that due to DeepXplore+VAE’simproved effectiveness in generating valid tests it improveson the baseline’s ”time to produce a valid test” reducing itfrom 4.7 to 1.7 minutes, on average measured across threeruns.

Result for RQ3: Incorporating a VAE into test gener-ation eliminates the generation of invalid test inputs,signiﬁcantly increases the generation of valid inputs,reduces the time to generate valid tests, and increasescoverage achieved on generated valid tests.

DNN TestingTechnique Coverage Valid(%) Invalid(%) Total(%)SVH-1 DeepXplore KMNC

NBC

SNAC

DLFuzz KMNC

55 38.8 57.9

NBC

SNAC

DeepConcolic KMNC - NBC - SNAC - SVH-2 DeepXplore KMNC

NBC

SNAC

DLFuzz KMNC

58 43.5 61.3

NBC

SNAC DeepConcolic KMNC - NBC - SNAC - SVH-3 DeepXplore KMNC

NBC

SNAC

DLFuzz KMNC

NBC

SNAC

DeepConcolic KMNC - NBC - SNAC - SVH-4 DeepXplore KMNC

NBC SNAC

DLFuzz KMNC

NBC

SNAC

DeepConcolic KMNC - NBC - SNAC - TABLE VI: Multi-granularity neuron coverage of test inputsgenerated by DeepXplore, DLFuzz and DeepConcolic forSVHN classiﬁers

RQ4. Is the determination of invalid inputs sensitive tothe generative model used?

To answer RQ4, we use a PixelCNN++ based input val-idation technique. PixelCNN++ is an autoregressive deepgenerative model [38]. The advantage of using this modelfor out-of-distribution detection is that the model outputs theprobability density explicitly. We trained PixelCNN++ modelsfor MNIST and SVHN datasets. For each dataset, we ﬁndthe threshold for identifying invalid inputs by using an invaliddataset and F-measure analysis similar to VAE based detectiontechnique described in Section III-A. The F-measure, precisionand recall of the selected thresholds for both the datasets arepresented in Table VIII.The percentage of test inputs generated by DeepXplore,DLFuzz and DeepConcolic for the MNIST and SVHN classi-ﬁcation models that are classiﬁed as invalid by PixelCNN++based input classiﬁer are presented on the bottom row ofFig. 6. PixelCNN++ for the MNIST models, classiﬁed a highpercentage of test inputs generated by DeepXplore’s light andocclusion constraints as invalid and classiﬁed all test inputsas valid for blackout constraint. For the SVHN classiﬁers,occlusion and blackout constraints result in higher number of9ig. 7: Neuron Coverage of valid inputs generated by DeepXplore and DeepXplore extended with VAE for MNIST modelsFig. 8: Neuron Coverage of valid inputs generated by DeepXplore and DeepXplore extended with VAE for SVHN models

DNN DeepXplore+VAE DeepXplore Iterations(DeepXplore+VAE- DeepXplore) Input validations(DeepXplore+VAE- DeepXplore)Run-timein mins ValidInputs Iterations Inputvalidations Run-timein mins Validinputs Iterations InputvalidationsMNI-1 96.74 29 5413 882 103.82 1 5972 1832 -559 -950MNI-2 73.5 54 4910 413 103 3 5913 1812 -1003 -1399MNI-3 60.39 56 4863 200 96.66 3 5917 1587 -1054 -1387MNI-4 54.97 52 4736 52 46.57 29 5199 375 -463 -323SVH-1 97.12 17 5637 27 64.7 12 5737 47 -100 -20SVH-2 97.96 20 5578 28 66.83 9 5798 60 -220 -32SVH-3 90.34 21 5539 29 69.83 11 5703 80 -164 -51SVH-4 143.81 83 4126 219 130.57 53 4547 446 -421 -227

TABLE VII: Run-time analysis of test generation algorithms of DeepXplore+VAE and DeepXplore for MNIST and SVHNclassiﬁersFig. 9: Number of valid inputs generated by DeepXploreand DeepXplore extended with VAE for MNIST and SVHNmodelsinvalid inputs when compared to the light constraint.The PixelCNN++ classiﬁed all test inputs generated byDLFuzz as invalid for MNIST models and more than 60%test inputs as invalid for SVHN models. All inputs generatedby DeepConcolic are identiﬁed as invalid for both the models.The results follow the same trend as observed by VAE basedclassiﬁer. However the percentage of test inputs classiﬁed asinvalid by PixelCNN++ is less when compared to that of VAE

Dataset MNIST SVHNValid

MNIST Test SVHN Test

Invalid

FashionMNIST Test CIFAR10 Test

F-measure

False Positives

False Negatives

TABLE VIII: F-measure and percentage of false positives andfalse negatives for PixelCNN++ based input validation modelfor DeepXplore generated tests. For DLFuzz, the PixelCNN++approach resulted in more invalid tests when compared to theVAE based classiﬁer. Both the VAE and PixelCNN++ basedtechniques classiﬁed all test inputs generated by DeepConcolicas invalid.

Result for RQ4: Test generators are judged to produceinvalid tests with different OOD techniques, but thenumber of invalid tests is sensitive to the deep generativemodel architecture used.

V. T

HREATS TO V ALIDITY

We designed our study to provide a degree of generaliz-ability by spanning all of the algorithmic families of DNN10est generation approaches that have been developed to date,as well as 2 datasets, 8 models, 4 coverage criteria, and2 approaches to out-of-distribution detection. Moreover, thedatasets and models that we have chosen are those that havebeen used in prior research – which was both a conveniencechoice and a means of promoting comparison among methods,e.g., against baselines. Despite these measures, our ﬁndingsmay be dependent on these choices.Further study, especially with additional OOD techniques,beyond VAE and PixelCNN++, is warranted to understandthe generalizability of our ﬁndings as relates to the rate atwhich invalid inputs are generated and the degree of coverageachieved by those inputs. Our study on adapting test generationwith OOD is more limited using a single model, a VAE,and a single test generation approach, DeepXplore whichis a representative of the class of optimization-based testgeneration approaches. It is not a simple matter to extendthis study to other families of test generation methods, butthat will be necessary to understand the extent to which thebeneﬁt of integrating OOD methods with DNN test generationtechniques broadly generalizes.We ran all of our experiments multiple times and cross-checked them with prior work, e.g., that we achieved the samelevel of coverage for baseline techniques as was reported inprior work. We took these measures to assure the quality of thedata reported here and we made the code available in githubfor transparency and replicability.VI. C

ONCLUSIONS

This paper demonstrates that existing DNN test generationand test coverage techniques do not consider the valid inputspace, which can have several deleterious effects. It can leadDNN test methods to generate large numbers of invalid inputs– those that lie off the training distribution as judged bystate-of-the-art techniques – thereby reducing the efﬁciency ofthe test generation process and, even worse, producing largenumbers of tests that might be rejected as invalid during fault triage processes. It can lead test coverage techniques to valueinvalid tests inappropriately by achieving or improving oncoverage from valid tests – this has the potential to bias testgeneration results. N defensive ( (cid:126)i ) { if (!OOD( (cid:126)i )) return N ( (cid:126)i ) elsereturn error( (cid:126)i )} Fig. 10: Defensive DNNWe demonstrate that existingout of distribution detection tech-niques can be coupled with testgeneration algorithms to addressthis problem. In this work, we fo-cused on VAE-based OOD detec-tion and incorporating such mod-els into optimization-based testgeneration. Our study shows thisto be effective in signiﬁcantlyboosting the number of valid testinputs generated and in eliminat-ing invalid tests. While promising, more work is needed toexplore the potential for other OOD models to inform testgeneration and to incorporate such models into constraint-based and fuzzing test generators.Finally, we plan to explore how the well-understood con-cept of defensive programming for traditional programs, assketched in Fig. 1a, can be adapted to DNNs. Fig. 10 sketchesa possibility suggested by the ﬁndings in this paper, wherethe role of input validation is played by an OOD detector.In such an architecture, testing of N should be restrictedto inputs that are not out of distribution, but testing of theOOD itself must be conducted over a broader input space asis the case with prior work on input validation testing [15],[16], [17]. With such an architecture, DNN test suites thatachieve higher coverage of OOD and N are better , therebyreestablishing the long held intuitions about test coverage fortraditional software.A CKNOWLEDGEMENTS

This material is based in part upon work supported byNational Science Foundation awards 1900676 and 2019239.11

EFERENCES[1] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang,X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-driving cars,”

CoRR , vol. abs/1604.07316, 2016. [Online]. Available:http://arxiv.org/abs/1604.07316[2] S. Pendleton, H. Andersen, X. Du, X. Shen, M. Meghjani, Y. Eng,D. Rus, and M. Ang, “Perception, planning, control, and coordinationfor autonomous vehicles,”

Machines , vol. 5, no. 1, p. 6, 2017.[3] N. Smolyanskiy, A. Kamenev, J. Smith, and S. Birchﬁeld, “Toward low-ﬂying autonomous mav trail navigation using deep neural networks forenvironmental awareness,” in , Sep. 2017, pp. 4241–4247.[4] A. Loquercio, A. I. Maqueda, C. R. D. Blanco, and D. Scaramuzza,“Dronet: Learning to ﬂy by driving,”

IEEE Robotics and AutomationLetters , 2018.[5] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whiteboxtesting of deep learning systems,” in proceedings of the 26th Symposiumon Operating Systems Principles , 2017, pp. 1–18.[6] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen,T. Su, L. Li, Y. Liu et al. , “Deepgauge: Multi-granularity testing criteriafor deep learning systems,” in

Proceedings of the 33rd ACM/IEEEInternational Conference on Automated Software Engineering , 2018, pp.120–131.[7] Y. Sun, X. Huang, D. Kroening, J. Sharp, M. Hill, and R. Ashmore,“Testing deep neural networks,” arXiv preprint arXiv:1803.04792 , 2018.[8] X. Xie, J. W. K. Ho, C. Murphy, G. E. Kaiser, B. Xu, and T. Y. Chen,“Testing and validating machine learning classiﬁers by metamorphictesting,”

J. Syst. Softw. , vol. 84, no. 4, pp. 544–558, 2011. [Online].Available: https://doi.org/10.1016/j.jss.2010.11.920[9] Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testingof deep-neural-network-driven autonomous cars,” in

Proceedings of the40th international conference on software engineering , 2018, pp. 303–314.[10] X. Huang, D. Kroening, W. Ruan, J. Sharp, Y. Sun, E. Thamo, M. Wu,and X. Yi, “A survey of safety and trustworthiness of deep neuralnetworks: Veriﬁcation, testing, adversarial attack and defence, andinterpretability,”

Computer Science Review , vol. 37, p. 100270, 2020.[11] J. H. Hayes and J. Offutt, “Input validation analysis and testing,”

Empirical Software Engineering , vol. 11, no. 4, pp. 493–522, 2006.[12] N. Li, T. Xie, M. Jin, and C. Liu, “Perturbation-based user-input-validation testing of web applications,”

Journal of Systems and Software ,vol. 83, no. 11, pp. 2263–2274, 2010.[13] H. Liu and H. B. K. Tan, “Covering code behavior on input validation infunctional testing,”

Information and Software Technology , vol. 51, no. 2,pp. 546–553, 2009.[14] K. Taneja, N. Li, M. R. Marri, T. Xie, and N. Tillmann, “Mitv: multiple-implementation testing of user-input validators for web applications,” in

Proceedings of the IEEE/ACM international conference on Automatedsoftware engineering , 2010, pp. 131–134.[15] S. Sinha and M. J. Harrold, “Analysis and testing of programs withexception handling constructs,”

IEEE Transactions on Software Engi-neering , vol. 26, no. 9, pp. 849–871, 2000.[16] P. Zhang and S. Elbaum, “Amplifying tests to validate exceptionhandling code: An extended study in the mobile application domain,”

ACM Transactions on Software Engineering and Methodology (TOSEM) ,vol. 23, no. 4, pp. 1–28, 2014.[17] A. Gofﬁ, A. Gorla, M. D. Ernst, and M. Pezz`e, “Automatic genera-tion of oracles for exceptional behaviors,” in

Proceedings of the 25thInternational Symposium on Software Testing and Analysis , 2016, pp.213–224.[18] Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening,“Concolic testing for deep neural networks,” in

Proceedings of the 33rdACM/IEEE International Conference on Automated Software Engineer-ing , 2018, pp. 109–119.[19] J. Guo, Y. Jiang, Y. Zhao, Q. Chen, and J. Sun, “Dlfuzz: Differentialfuzzing testing of deep learning systems,” in

Proceedings of the 201826th ACM Joint Meeting on European Software Engineering Conferenceand Symposium on the Foundations of Software Engineering , 2018, pp.739–743.[20] J. An and S. Cho, “Variational autoencoder based anomaly detectionusing reconstruction probability,”

Special Lecture on IE , vol. 2, no. 1,2015. [21] H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei,Y. Feng et al. , “Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications,” in

Proceedings of the2018 World Wide Web Conference , 2018, pp. 187–196.[22] H. Zenati, C. S. Foo, B. Lecouat, G. Manek, and V. R. Chan-drasekhar, “Efﬁcient gan-based anomaly detection,” arXiv preprintarXiv:1802.06222 , 2018.[23] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: Asurvey,” arXiv preprint arXiv:1901.03407 , 2019.[24] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun.com/exdb/mnist/ , 1998.[25] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,“Reading digits in natural images with unsupervised feature learning,”2011.[26] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine learning test-ing: Survey, landscapes and horizons,”

IEEE Transactions on SoftwareEngineering , 2020.[27] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572 , 2014.[28] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in thephysical world,” arXiv preprint arXiv:1607.02533 , 2016.[29] N. Carlini and D. Wagner, “Towards evaluating the robustness of neuralnetworks,” in . IEEE,2017, pp. 39–57.[30] E. J. Weyuker, “The evaluation of program-based software test dataadequacy criteria,”

Communications of the ACM , vol. 31, no. 6, pp.668–675, 1988.[31] L. Ma, F. Zhang, M. Xue, B. Li, Y. Liu, J. Zhao, and Y. Wang,“Combinatorial testing for deep learning systems,” arXiv preprintarXiv:1806.07723 , 2018.[32] A. Odena, C. Olsson, D. Andersen, and I. Goodfellow, “Tensorfuzz: De-bugging neural networks with coverage-guided fuzzing,” in

InternationalConference on Machine Learning , 2019, pp. 4901–4911.[33] D. Hendrycks, M. Mazeika, and T. Dietterich, “Deep anomaly detectionwith outlier exposure,” arXiv preprint arXiv:1812.04606 , 2018.[34] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easilyfooled: High conﬁdence predictions for unrecognizable images,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2015, pp. 427–436.[35] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXivpreprint arXiv:1312.6114 , 2013.[36] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Advances in neural information processing systems , 2014, pp. 2672–2680.[37] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al. , “Conditional image generation with pixelcnn decoders,” in

Ad-vances in neural information processing systems , 2016, pp. 4790–4798.[38] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pixelcnn++:Improving the pixelcnn with discretized logistic mixture likelihood andother modiﬁcations,” arXiv preprint arXiv:1701.05517 , 2017.[39] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[40] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,“Striving for simplicity: The all convolutional net,” arXiv preprintarXiv:1412.6806 , 2014.[41] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[42] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshmi-narayanan, “Do deep generative models know what they don’t know?” arXiv preprint arXiv:1810.09136 , 2018.[43] A. Krizhevsky, G. Hinton et al. , “Learning multiple layers of featuresfrom tiny images,” 2009.[44] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon,and B. Lakshminarayanan, “Likelihood ratios for out-of-distributiondetection,” in

Advances in Neural Information Processing Systems , 2019,pp. 14 707–14 718.[45] H. Xiao, K. Rasul, and R. Vollgraf. (2017) Fashion-mnist: a novel imagedataset for benchmarking machine learning algorithms.[46] M. Rosca, B. Lakshminarayanan, and S. Mohamed, “Distribution match-ing in variational inference,” arXiv preprint arXiv:1802.06847 , 2018., 2018.