[PDF] Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Objective Search (Experience Paper)

Abstract

Automatically detecting the positions of key-points (e.g., facial key-points or finger key-points) in an image is an essential problem in many applications, such as driver's gaze detection and drowsiness detection in automated driving systems. With the recent advances of Deep Neural Networks (DNNs), Key-Points detection DNNs (KP-DNNs) have been increasingly employed for that purpose. Nevertheless, KP-DNN testing and validation have remained a challenging problem because KP-DNNs predict many independent key-points at the same time -- where each individual key-point may be critical in the targeted application -- and images can vary a great deal according to many factors. In this paper, we present an approach to automatically generate test data for KP-DNNs using many-objective search. In our experiments, focused on facial key-points detection DNNs developed for an industrial automotive application, we show that our approach can generate test suites to severely mispredict, on average, more than 93% of all key-points. In comparison, random search-based test data generation can only severely mispredict 41% of them. Many of these mispredictions, however, are not avoidable and should not therefore be considered failures. We also empirically compare state-of-the-art, many-objective search algorithms and their variants, tailored for test suite generation. Furthermore, we investigate and demonstrate how to learn specific conditions, based on image characteristics (e.g., head posture and skin color), that lead to severe mispredictions. Such conditions serve as a basis for risk analysis or DNN retraining.

Full PDF

AAutomatic Test Suite Generation for Key-points Detection DNNsUsing Many-Objective Search

Fitash Ul Haq

University of [email protected]

Donghwan Shin

University of [email protected]

Lionel C. Briand

University of LuxembourgLuxembourgUniversity of OttawaOttawa, [email protected]

Thomas Stifter

IEE [email protected]

Jun Wang

Post [email protected]

ABSTRACT

Automatically detecting the positions of key-points (e.g., facial key-points or finger key-points) in an image is an essential problemin many applications, such as driver’s gaze detection and drowsi-ness detection in automated driving systems. With the recent ad-vances of Deep Neural Networks (DNNs), Key-Points detectionDNNs (KP-DNNs) have been increasingly employed for that pur-pose. Nevertheless, KP-DNN testing and validation have remained achallenging problem because KP-DNNs predict many independentkey-points at the same time—where each individual key-point maybe critical in the targeted application—and images can vary a greatdeal according to many factors.In this paper, we present an approach to automatically generatetest data for KP-DNNs using many-objective search. In our experi-ments, focused on facial key-points detection DNNs developed foran industrial automotive application, we show that our approachcan generate test suites to severely mispredict, on average, morethan 93% of all key-points. In comparison, random search-based testdata generation can only severely mispredict 41% of them. Manyof these mispredictions, however, are not avoidable and shouldnot therefore be considered failures. We also empirically comparestate-of-the-art, many-objective search algorithms and their vari-ants, tailored for test suite generation. Furthermore, we investigateand demonstrate how to learn specific conditions, based on imagecharacteristics (e.g., head posture and skin color), that lead to severemispredictions. Such conditions serve as a basis for risk analysis orDNN retraining.

CCS CONCEPTS • Software and its engineering → Empirical software valida-tion . KEYWORDS

Key-point detection, deep neural network, software testing, many-objective search algorithm

Automatically detecting key-points (e.g., facial key-points or fin-ger key-points) in an image or a video is a fundamental step for many applications, such as face recognition [53], facial expressionrecognition [26], and drowsiness detection [23]. With the recentadvances in Deep Neural Networks (DNNs), Key-Points detectionDNNs (KP-DNNs) have been widely studied [21–23, 42].To ensure the reliability of KP-DNNs, it is essential to check howaccurate the DNNs are when applied to various test data. Never-theless, testing KP-DNNs is still often focused on pre-recorded testdata, such as publicly available datasets [4, 40, 49] that are typicallynot collected or generated according to any systematic strategy, andthat therefore provide limited test results and confidence. Further,since third parties with ML expertise often provide these KP-DNNs,their internal information is usually not available to the engineerswho integrate them into their application systems [28, 36]. As a re-sult, black-box testing approaches should be favored, preferably onethat is dedicated and adapted to KP-DNNs to maximize test effec-tiveness within acceptable time constraints, which is the objectiveof this paper.To automatically generate new and diverse test data, one mayopt to use advanced DNN testing approaches that have been re-cently proposed by several researchers. For example, Tian et al. [43]presented DeepTest, an approach that generates new test imagesfrom training images by applying simple image transformations(e.g., rotate, scale, and for and rain effects). Wicker et al. [47] gener-ated adversarial examples, i.e., small perturbations that are almostimperceptible by humans but causing DNN misclassifications, usingfeature extraction from images. Gambi et al. [17] presented AsFault,an approach that generates virtual road networks in computersimulations using search-based testing for testing a DNN-basedautomated driving system. However, none of the existing DNNtesting approaches take into account the following testing chal-lenges that are specific to KP-DNNs predicting multiple key-pointsat the same time. First, accuracy for individual key-points is ofgreat importance, and therefore one should not simply consideraverage prediction errors across all key-points. Second, the numberof key-points is typically large, e.g., our evaluation uses a facialKP-DNN that detects 27 key-points in an image. Third, dependingon the performance of the KP-DNN under test, it may be infea-sible to generate a test image causing a significant mispredictionfor some particular key-points. Therefore, if such infeasibility isobserved at run time, it is essential to dynamically and efficiently a r X i v : . [ c s . C V ] D ec itash Ul Haq, Donghwan Shin, Lionel C. Briand, Thomas Stifter, and Jun Wang distribute the computational resources dedicated to testing to theother key-points. This implies that using one of the existing DNNtesting approaches is not an option.To address the above challenges, we recast the problem of KP-DNN testing as a many-objective optimization problem. Since thesevere misprediction of each individual key-point becomes oneobjective, many-objective optimization algorithms can potentiallybe effective and efficient at generating test data that causes theKP-DNN to mispredict individual key-points. Though our approachis applicable to any KP-DNNs, we focus our experiments on an in-dustrial Facial KP-DNN (FKP-DNN), a typical example of KP-DNNs.Our empirical investigation shows that our approach is effectiveat generating test data that cause most of the key-points (93% onaverage) to be severely mispredicted. Though many of these mis-predictions are not avoidable because a key-point can be invisibledue to head posture or a shadow, characterizing them is importantto analyze risks and a large number of them can still be addressedby retraining or other means. We further investigate (1) the de-gree of mispredictions caused by test data generated by alternativesearch algorithms and (2) how to learn specific conditions (e.g., headposture) leading to severe mispredictions of individual key-points.Our ultimate goal is to provide practical recommendations on howto test KP-DNNs and how to analyze test results to support riskanalysis and retraining.The contributions of this paper are summarized as follows:1) We formalize the problem definition of KP-DNN testing. Ourformalism specifies KP-DNN input and output variables, as wellas the notion of severe misprediction for a key-point, and char-acterises the specific testing challenges in this context.2) We propose a way to automatically generate test data for KP-DNNs using many-objective search algorithms and simulation.3) We investigate how automatically generated test data and resultscan be effectively used to reveal and possibly address the rootcauses of inaccuracy in FKP-DNNs under test.4) We present empirical results and lessons learned drawn fromour experience in applying the approach in an industrial context.The rest of the paper is organized as follows. Section 2 providesbackground information on KP-DNNs and search-based testingwith simulation. Section 3 formalises the problem of KP-DNN test-ing. Section 4 positions our work with respect to related work.Section 5 describes our approach. Section 6 evaluates our approachwith an industrial case study and Section 7 concludes the paper. Search-based software testing (SBST) [30] uses meta-heuristic algo-rithms to automate software testing tasks, such as test case gener-ation [16] and prioritization [27], for a specific system under test.The key idea is to formulate a software testing problem as an opti-mization problem by defining proper fitness functions. For example,EvoSuite [16] uses a search-based approach to automatically gener-ate unit test cases for a Java program to satisfy a coverage criterion,such as branch coverage. In this case, the fitness function is definedbased on the coverage achieved by unit test cases. Recently, SBST has also been used for DNN testing. For example,AsFault [17] automatically generates virtual roads to make vision-based DNNs go out of lane, and DeepJanus [38] uses multi-objectivesearch to generate a pair of similar test inputs that cause the DNNunder test to misbehave for one test input but not for the other.However, when the number of objectives is above three, multi-objective search algorithms, such as NSGA-II [13], do not scalewell [9, 24]. This is where many-objective search algorithms comeinto play. For example, NSGA-III [12] is a generic many-objectivesearch algorithm that extends NSGA-II with the idea of virtual ref-erence points to increase the diversity of optimal solutions evenwhen there are more than three objectives. Panichella et al. [35]proposed a Many-Objective Sorting Algorithm (MOSA), anotherextension of NSGA-II, that is tailored for test suite generation. Incontrast to NSGA-III, MOSA aims to efficiently achieve each ob-jective individually. To do this, MOSA has three main features: (1)it focuses search towards uncovered objectives, (2) it uses a novelpreference criterion to rank solutions rather than diversifying them,and (3) it saves the best test case for each objective in an archive.MOSA has shown to outperform alternative search algorithms inthe context of coverage-based unit testing for traditional softwareprograms [34, 35]. Recently, Abdessalem et al. [2] proposed FITEST,an extension of MOSA, to further improve the efficiency of many-objective search. The idea is to dynamically reduce the numberof candidates (i.e., the population size) to be considered by focus-ing on the uncovered objectives only. Hence, FITEST’s populationsize decreases as more objectives are achieved, whereas MOSA’spopulation size is fixed throughout the search.Notice that MOSA and FITEST are carefully designed for search-based test suite generation when the number of objectives is abovethree. Therefore, they are a priori suitable to address the problemof test suite generation for KPF-DNNs, as detailed in Section 3.

Testing cyber-physical, safety-critical systems can be done in twoways: (1) in a real-world environment and (2) in a virtual envi-ronment, relying on simulators. In the former case, software isdeployed in the real-world environment, a form of testing whichis often expensive and dangerous. The latter case, referred to assimulation-based testing, can raise concerns about simulation’sfidelity but has two major benefits. First, simulation-based testingoffers controllability, meaning that test drivers can control staticand dynamic features of the simulation (e.g., light, face features),and can thus automatically cover a wide range of scenarios. Second,because we can control simulation parameters and can get imageinformation from the simulator, we know the ground truth (actualFKPs) and can thus apply search-based solutions to perform safeand fully automated testing of safety-critical software [1, 3, 5].In many cyber-physical fields, simulation models are developedbefore implementing actual products. These simulation models helpengineers in various activities, such as early verification and valida-tion of the system under test [19, 31]. In such contexts, simulated-based testing is highly recommended as it is economical, faster,safer and flexible [19]. utomatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search

In this section, we provide a general but precise problem descrip-tion regarding test data generation for FKP-DNNs. Though we usea specific application domain for the purpose of exemplification,this description can easily be generalized to all situations wherekey-points must be detected in an image, such as hand key-pointdetection [41] and human pose key-points detection [33], regardlessof the content of that image.A FKP-DNN takes as input a facial image (real or simulated) andreturns the predicted positions of its key-points. An image can bedefined by various factors, such as face size, skin color, and headposture. Using a labeled image that contains the actual positionsof key-points, one can measure the degree of prediction errors foran FKP-DNN under test. The goal of test suite generation for FKP-DNNs is to generate labeled images that cause an FKP-DNN undertest to inaccurately predict the positions of key-points, to an extentwhich affects the safety of the system relying on such predictions.More specifically, let 𝑡 be a labeled test image composed of atuple 𝑡 = ( ic 𝑡 , p 𝑡 ) where ic 𝑡 represents image characteristics, suchas face size, skin color, and head posture, and p 𝑡 represents theactual positions of key-points in 𝑡 . p 𝑡 can be further decomposed as p 𝑡 = ⟨ 𝑝 𝑡, , . . . , 𝑝 𝑡,𝑘 ⟩ where 𝑝 𝑡,𝑖 for 𝑖 = , . . . , 𝑘 is the actual positionof the 𝑖 -th key-point in 𝑡 and 𝑘 is the total number of key-points.Depending on ic 𝑡 , some key-points may not actually be visiblein 𝑡 . For example, if the head is turned 90° to the right, the righteye key-point will be invisible in the image taken from the frontcamera. In this case, the actual positions of invisible key-points are null . 𝑉 ( 𝑡 ) is a set of indices for visible key-points in 𝑡 , and | 𝑉 ( 𝑡 )| is the total number of visible key-points in 𝑡 . An FKP-DNN 𝑑 canbe considered as a function 𝑑 ( 𝑡 ) = ⟨ ˆ 𝑝 𝑡, , . . . , ˆ 𝑝 𝑡,𝑘 ⟩ where ˆ 𝑝 𝑡,𝑖 is thepredicted position of the 𝑖 -th key-point in 𝑡 . The Normalized MeanError (NME) of 𝑑 for all key-points in 𝑡 can be defined as follows: NME ( 𝑑, 𝑡 ) = Σ 𝑖 ∈ 𝑉 ( 𝑡 ) NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 )| 𝑉 ( 𝑡 )| where NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) is the Normalized Error of the predicted posi-tion ˆ 𝑝 𝑡,𝑖 with respect to the actual position 𝑝 𝑡,𝑖 , ranging between0 and 1. For example, in our case study, if one measures the errorusing the Euclidean distance between the actual and predicted po-sitions in an image, then the normalization can be done by dividingthe distance by the height or width of the face , whichever is larger.Such a normalization allows error values to be compared for facesof different sizes.Generating a critical test image 𝑡 for 𝑑 such that it maximizes NME ( 𝑑, 𝑡 ) could be the goal of test case generation. However, itdoes not properly account for the fact that accuracy for individualkey-points is of great importance, and therefore that averagingprediction errors across key-points may be misleading. This is,indeed, a major concern for IEE, who is working on gaze detectionsystems that either warn the driver or take the control of vehiclewhen the driver is apparently not paying attention while driving. Byconsidering individual key-points, we may observe that certain key-points’ predictions are particularly inaccurate. Such informationcan be used by IEE to (1) focus the retraining of 𝑑 on these key-points for improving its accuracy or (2) select accurate key-points This can be easily calculated with the actual positions of key-points. to be primarily relied upon—when there are alternative choices—bythe applications using them to make decisions.To verify if 𝑑 correctly predicts individual key-points for 𝑡 , weneed to check if NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) is less than a small threshold 𝜖 for allkey-points. Therefore, though that may not be possible, test suitegeneration attempts to generate a minimal set of test cases TS thatsatisfies NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) ≥ 𝜖 for all 𝑖 = , . . . , 𝑘 for some 𝑡 ∈ TS . Notethat the value of 𝜖 is usually application specific; for example, IEEuses 𝜖 = .

05 because

NME > .

05 is considered critical in theirapplications.Test suite generation for FKP-DNNs entails several challenges.First, the test input space is too large to be exhaustively explored.For IEE, this is clear when considering the features that characterizea facial image, including head posture and facial characteristics,such as different shapes of eyes, noses, and mouths. Second, thenumber of key-points is typically large (e.g., 27 key-points for IEE’sDNN) and, increasing the complexity of the problem, individualkey-points are independent as people may exhibit different relativepositions for the same key-points. Third, depending on the accuracyof the DNN under test, it may be infeasible to find an image causingthe normalized error to exceed the threshold for some key-points.Considering a limited time test budget, if finding a critical imagefor a key-point seems infeasible, it is essential to dynamically andefficiently distribute computation time at run time to find criticalimages for the other key-points. Fourth, it is not easy to manuallylabel the actual positions of key-points for a large number of testimages. While there are publicly available real-world datasets forvarious key-point detection problems [4, 40, 49], such datasets limitthe exploration of the input space to what data is available. Last butnot least, FKP-DNNs, like other DNNs in practice, are often providedby third parties with expertise in ML, who typically do not provideaccess to the trained DNN’s internal information. Therefore, testsuite generation for FKP-DNNs should be black-box, in the sensethat it should not rely on DNN internal information.To address the above challenges, as described in detail in Sec-tion 5, we suggest to apply many-objective search algorithms, whichcan be effective and efficient for achieving many independent ob-jectives within a limited time budget. These algorithms are a prioria good match to our problem since requirements for individualkey-points (i.e., NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) ≥ 𝜖 for 𝑡 ∈ TS ) can be translated intoobjective functions. Furthermore, this approach is DNN-agnosticas it considers the DNN under test as a black-box. Many automated techniques are available in the literature for test-ing DNNs. One type of automatic test data generation techniquesis to generate adversarial examples [51] that are imperceptible forhumans but cause the DNN under test to misbehave. Guo et al.[18] proposed DLFuzz, an approach that iteratively applies smallperturbations to original images to maximize neuron coverage (i.e.,the rate of activated neurons) and prediction differences betweenthe original and synthesized images. Zhou et al. [54] and Kong et al.[25] proposed DeepBillboard and PhysGAN, respectively, two simi-lar approaches that generate adversarial images that can be placedon drive-by billboards, both digitally or physically, to induce fail-ures in DNN-based automated driving systems. Wicker et al. [47] itash Ul Haq, Donghwan Shin, Lionel C. Briand, Thomas Stifter, and Jun Wang presented an approach to generate adversarial images using featureextraction techniques, such as Scale Invariant Feature Transform(SIFT), to extract features from original images. Rozsa et al. [39]introduced a Fast Flipping Attribute (FFA) technique to effectivelygenerate adversarial images for DNNs detecting facial attributes(e.g., male or female). However, in contrast to our objective of verify-ing the safety of DNN predictions for a large variety of plausible testinputs, adversarial examples mainly target security attacks wherean attacker intentionally introduces human-imperceptible changes(i.e., attacks) to cause the DNN to generate incorrect predictions.Another important line of work is to generate new test datafrom already labeled data (e.g., training data) by applying label-preserving changes to avoid labeling problems for newly generatedtest inputs. Pei et al. [37] proposed DeepXplore, an approach thatgenerates label-preserving test images to maximize both neuroncoverage and differential behaviors of multiple DNNs for the gener-ated images. Tian et al. [43] introduced DeepTest, an approach thatsynthesizes label-preserving test images by applying affine transfor-mations and effect filters, such as rain and snow, to original imagesin order to maximize neuron coverage. Zhang et al. [52] presentedDeepRoad, an approach that uses Generative Adversarial Networks(GANs), instead of simple transformations and effect filters, to gen-erate more realistic images. Du et al. [15] proposed DeepCruiser, anapproach that generates label-preserving sequences of test data totest stateful deep learning systems, based on Recurrent Neural Net-works (RNNs) using special coverage criteria for RNNs converted toMarkov Decision Process (MDP). Recently, Xie et al. [50] presentedDeepHunter, an extensible coverage-guided testing framework ex-tending DeepTest to further utilize multiple coverage criteria, withmore label-preserving transformation strategies. While these worksenable engineers to generate more realistic test data from existingdata, generating label-preserving test images inherently limits thesearch space when the objective is to identify critical situations inthe most comprehensive way possible. Furthermore, when relyingon a simulator, labeling is not a critical issue.To overcome the limitation of label-preserving test data genera-tion, simulation-based testing is increasingly used for testing DNNs,especially in the context of DNN-based automated driving systems.Gambi et al. [17] presented AsFault, a search-based approach togenerates different types of road topologies in a simulated environ-ment to test the lane keeping functionality of self-driving DNNs.Tuncali et al. [44] introduced Sim-ATAV, a framework for testingclosed-loop behaviors of DNN-based systems in simulated environ-ments using requirement falsification methods. Riccio and Tonella[38] presented DeepJanus, an approach to generate pairs of similartest inputs, using search-based testing with simulations, that causethe DNN under test to misbehave for one test input but not for theother. Although the solutions in this category successfully usedsimulation to generate critical test data, they are inadequate for thetest suite generation of FKP-DNNs, because they do not considermany independent outputs individually.In summary, existing work does not address the fundamentaland specific challenges of test suite generation for KP-DNNs, asdescribed in Section 3. To this end, we propose a solution, basedon many-objective optimization, that automatically generates test suites whose objective is to cause KP-DNNs to severely mispre-dict the positions of individual key-points and then use machinelearning to explain such mispredictions.

This section provides a solution to the problem of test suite gen-eration for KP-DNNs, described in Section 3, by applying many-objective search algorithms. Similar to Section 3, although ourapproach is not specific to the domain of facial key-points, to rootthe presentation into concrete examples, we describe the approachin the context of test suite generation for FKP-DNNs.Recall that MOSA [35] and FITEST [2] take as input (1) a setof objectives, (2) a set of fitness functions indicating the degreeto which individual objectives have been achieved, and (3) a timebudget; it then returns a set of solutions that maximally achieveeach objective individually within the given time budget. Therefore,we can apply the algorithms to our problem by carefully defining aset of corresponding objectives and fitness functions.Based on the problem definition in Section 3, we can definethat an objective for each key-point is to find a test image thatmakes a FKP-DNN under test severely mispredict the key-pointposition. Specifically, for a FKP-DNN 𝑑 , the set of objectives to beachieved by a many-objective search algorithm is NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) ≥ 𝜖 for all 𝑖 = , . . . , 𝑘 where NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) is the normalized error of thepredicted position ˆ 𝑝 𝑡,𝑖 with respect to the actual position 𝑝 𝑡,𝑖 forthe 𝑖 -th key-point and 𝜖 is a small threshold pre-defined by domainexperts. Naturally, the set of fitness functions corresponding to theobjectives is NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) for all 𝑖 = , . . . , 𝑘 .While it seems intuitive to define the sets of objectives andfitness functions for many-objective search algorithms, the issue inpractice is how to determine the actual positions of key-points inan automated manner; a simulator plays a key role here.Figure 1 shows the overview of the search-based test suite gen-eration process for a FKP-DNN using a simulator. It is composed offour main components: a search engine, a simulator, the DNN undertest, and a fitness calculator. The process begins with the search en-gine generating a set of new input values for image characteristics ic (e.g., head posture and skin color). The input values are used bythe simulator to generate a new test image 𝑡 as well as the actualpositions of key-points p 𝑡 = ⟨ 𝑝 𝑡, , . . . , 𝑝 𝑡,𝑘 ⟩ for all 𝑖 = , . . . , 𝑘 . TheFKP-DNN under test 𝑑 takes 𝑡 as input and returns the predictedpositions of the key-points 𝑑 ( 𝑡 ) = ⟨ ˆ 𝑝 𝑡, , . . . , ˆ 𝑝 𝑡,𝑘 ⟩ . Then, the fitnesscalculator computes the fitness score of 𝑡 using NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) foreach 𝑖 = , . . . , 𝑘 . The fitness score is fed back to the search engineto generate new and better input values over the next iterations.This process continues until either a given time budget runs outor all the objectives are achieved. The process ends with returninga test suite TS such that each 𝑡 ∈ TS satisfies NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) ≥ 𝜖 forsome 𝑖 = , . . . , 𝑘 . In the following subsections, we will explain eachof the approach components in more detail. The search engine drives the whole process based on a meta-heuristicsearch algorithm. For a given set of objectives with fitness functionsand a time budget, it iteratively generates new test inputs for ic to ultimately provide engineers with the most effective test suite. utomatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search Search Engine

NewValues

Simulator

DNN

Error calculatorFitness Score Predicted Keypoints

Test Suite

TestImage

Engineer

Figure 1: Overview of Automatic Test Suite Generation forFKP-DNNs

Search algorithms, such as Random Search (RS), MOSA [35], andFITEST [2], vary in the way they generate new test inputs basedon the fitness results from previous iterations.For example, MOSA starts with an initial set of randomly gener-ated test cases that forms an initial population; then, it creates newtest cases using crossover and mutation operators that are typicallyused in Genetic Algorithms (GA). Unlike GA, the selection in MOSAis performed by considering both the non-dominance relation andthe uncovered objectives. Specifically, for each uncovered objective 𝑢 , a test case 𝑡 will have a higher chance of remaining in the nextgeneration if 𝑡 is non-dominated by the others and is the closestto cover 𝑢 . To form the final test suite, MOSA uses an archive thatkeeps track of the best test cases that cover individual objectives.FITEST is basically the same as MOSA, except that it dynamicallyreduces the population size with each iteration, according to thenumber of uncovered objectives.More specifically, Algorithm 1 presents the pseudo-code of MOSAand FITEST. It takes as input a set of objectives 𝑂 and returns anarchive 𝐴 (i.e., a test suite) that aims to maximally achieve individualobjectives in 𝑂 . To do that, the algorithm begins with initializing 𝐴 ,the set of uncovered objectives 𝑈 , and the population 𝑃 (lines 1–3).The algorithm then computes the set of covered objectives 𝐶 ⊆ 𝑂 by calculating the fitness scores of 𝑃 for 𝑂 (line 4), updates 𝐴 toinclude test cases from 𝑃 , which are best at achieving 𝐶 (line 5),and updates 𝑈 to remove the covered objectives in 𝐶 (line 6). Untilthe stopping criterion is met, the algorithm repeats the followings:(1) generating a set of new test cases 𝑄 from 𝑃 using crossover and mutation (line 8), (2) updating 𝐶 , 𝐴 , and 𝑈 for 𝑄 , as done for 𝑃 (lines 9–11), and (3) generating the next generation 𝑃 from thecurrent 𝑃 and 𝑄 considering 𝑈 using selection (line 12). The algo-rithm ends by returning 𝐴 . The main difference between MOSAand FITEST is in the getNextGeneration function (line 12): MOSAkeeps | 𝑃 | = | 𝑂 | , whereas FITEST reduces | 𝑃 | as | 𝑈 | decreases.In practice, using an effective and efficient search algorithm,for a given problem, is critical for testing at scale and thereforebe applicable in industrial contexts. This is also the case here forgenerating better test suites for FKP-DNNs. The simulator is one of the key enablers of our approach becauseit takes as input ic 𝑡 and generates 𝑡 labeled with p 𝑡 . It should beable to generate diverse test images by manipulating various ic s,such that there is variation in head posture, gender, skin color,and mouth size. At the same time, it should be able to generateindividual test images within reasonable time since this stronglyaffects the efficiency of the entire search-based process. Algorithm 1:

Pseudo-code of MOSA and FITEST

Input :

Set of Objectives 𝑂 Output :

Archive (Test Cases) 𝐴 Archive 𝐴 ← ∅ Set of Uncovered Objectives 𝑈 ← 𝑂 Set of Test Cases 𝑃 ← initialPopulation ( | 𝑂 |) Set of Covered Objectives 𝐶 ← calculateObjs ( 𝑂, 𝑃 ) 𝐴 ← updateArchive ( 𝐴, 𝑃,𝐶 ) 𝑈 ← updateUncoveredObjs ( 𝑈 ,𝐶 ) while not ( stoping_condition ) do Set of Test Cases 𝑄 ← generateOffspring ( 𝑃 ) 𝐶 ← calculateObjs ( 𝑂, 𝑄 ) 𝐴 ← updateArchive ( 𝐴, 𝑄,𝐶 ) 𝑈 ← updateUncoveredObjs ( U , C ) 𝑃 ← getNextGeneration ( 𝑃 ∪ 𝑄,𝑈 ) end return 𝐴 To achieve this, one can use 3D modeling tools, such as Make-Human [11] and Blender [10]. MakeHuman can be used to createparameterized 3D face models with detailed morphological charac-teristics, such as skin color and hair style. Blender, on the other hand,can be used to manipulate the 3D models according to selected ic values [7]. Since a 3D model is basically a textured polygonal mesh(i.e., a collection of nodes and edges in 3D) that defines the shapeof a polyhedral object, an engineer can mark the actual positionsof key-points in the model by selecting the corresponding nodes inthe textured mesh. Then, using the simulator, it is easy to generatea 2D image capturing the face labeled with the actual key-pointpositions. The fitness calculator takes 𝑝 𝑡,𝑖 and ˆ 𝑝 𝑡,𝑖 and returns NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) foreach 𝑖 = , . . . , 𝑘 . As mentioned in Section 3, one way to calculate NE ( 𝑝 𝑡,𝑖 , ˆ 𝑝 𝑡,𝑖 ) is to compute the Euclidean distance between 𝑝 𝑡,𝑖 andˆ 𝑝 𝑡,𝑖 and then normalize it using the height or width of the actualface, whichever is larger. Normalization enables the comparison ofthe error values of different 3D models having different face sizes. In this section, we report on the empirical evaluation of applyingour many-objective, search-based test suite generation approachto a FKP-DNN developed by IEE. We aim to answer the followingresearch questions.

RQ1:

How do alternative many-objective search algorithms fare interms of test effectiveness?

RQ1 aims to check whether using many-objective search algorithms, such as MOSA and FITEST, is indeeda suitable solution for the problem of test suite generation for FKP-DNNs. To answer this, we investigate how many key-points areseverely mispredicted (according to our earlier definition) by testsuites that are automatically generated by the search algorithms,and how alternative algorithms compare to each other and to asimple random search.

RQ2 : Can we further distinguish search algorithms using the de-gree of mispredictions caused by the test suites they generate?

Since itash Ul Haq, Donghwan Shin, Lionel C. Briand, Thomas Stifter, and Jun Wang

Figure 2: The actual positions of 27 facial key-points

RQ1 only considers the number of severely mispredicted key-points,differences in effectiveness across search algorithms may not ap-pear clearly and completely. For example, two test suites generatedby different algorithms may cause the same number of severelymispredicted key-points. However, since the level of risk entailedby mispredictions may be proportional to the distance between theactual and predicted key-points, a test suite with a higher degree ofmispredictions may be more amenable to risk analysis, dependingon the application context. Based on this, RQ2 compares how se-verely key-points are mispredicted by test suites generated acrossdifferent search algorithms.

RQ3 : Can we explain individual key-point mispredictions in termsof image characteristics?

In addition to the cost-effective, automatedtesting of FKP-DNNs, engineers need to understand why severemispredictions occur for individual key-points. This is critical forengineers to analyze and address the root causes of mispredictions,or, when the latter is not possible, at the very least assess the risks.For example, if engineers know that a certain key-point is signif-icantly mispredicted for a specific head posture range, they cangenerate more test images in that range and retrain the DNN toimprove its prediction accuracy. To this end, RQ3 aims to investi-gate whether it is possible to provide accurate and interpretableexplanations of mispredictions based on image characteristics usedby the simulator to generate test images.

We use a proprietary FKP-DNN, denoted IEE-DNN (described indetail in § 6.1.1), developed by IEE to build a driver’s gaze detectioncomponent for autonomous vehicles. To enable simulation-basedtesting, we also use an in-house simulator, namely IEE-SIM (de-scribed in detail in § 6.1.2), developed by IEE to generate largenumbers of labeled facial images to train the IEE-DNN.

IEE-DNN takes as input a 256x256 pixel image; itreturns the predicted positions of 27 facial key-points in the inputimage. Figure 2 depicts the actual positions of the 27 key-points fora sample image.The IEE-DNN is based on the stacked hourglass architecture [33]with an adaptive wing loss function [46]. Specifically, it consistsof two hourglass modules, each of which contains four residual modules for downsampling and another four for upsampling. Itis trained on 18,120 syntactic images generated by the IEE-SIM.Against an independent set of 2,738 syntactic test images, the IEE-DNN achieved a low NME value of 0 . .

05, such NME value implies that, if we just considerthe average prediction error for all key-points, the IEE-DNN is suf-ficiently accurate according to the test set. Note that our approachis independent from any specific DNN architecture as it does notrequire any DNN internal information.

IEE-SIM takes as input various image character-istics, such as head posture, light intensity, and the position of a(virtual) camera, and returns a corresponding facial image with theactual positions of the 27 key-points.The IEE-SIM is based on MakeHuman [11] and Blender [10]. Todecrease its execution time, IEE engineers carefully designed many3D face models in advance, by considering the diversity of skincolors, face sizes, and so on. By doing this, we can quickly generatean image by specifying a pre-defined 3D model ID to use insteadof dynamically generating the 3D models from scratch at run time.As a result, the average execution time for generating one labeledimage is around 6 seconds on an iMac (3GHz 6-Core Intel i5 CPU,40GB memory, and Radeon Pro 570X 4GB graphic card).The main concern of IEE is to verify the accuracy of the IEE-DNNfor diverse head postures and drivers. Therefore, we manipulatefour feature values when varying image characteristics: roll, pitch,and yaw values for controlling head posture and the 3D model IDfor indirectly the face features to be used. The ranges of the roll,pitch, and yaw values are limited between -30° and +30° to mimicrealistic ways in which drivers position their head. For 3D models,we use the subset of 10 different models that IEE considered to beof main interest.

To answer RQ1, we generate a test suite using ourapproach for a fixed time budget and measure the

Effectiveness Score ( ES ) of the test suite, defined as the number of key-points that areseverely mispredicted by the IEE-DNN according to the test suiteover the total number of key-points. ES ranges between 0 and 1,where higher values, when possible, are desirable.To better understand how ES varies across different search algo-rithms, we compare MOSA [35], FITEST [2], as well as their variants,namely MOSA+ and FITEST+. The variants are identical to theiroriginals, but the crossover strategy is changed to consider fitnessscores for uncovered (i.e., not-yet-achieved) objectives to betterguide candidates towards uncovered objectives. In other words, thehigher the fitness score for uncovered objectives, the more similarchildren are to parents. This is done by dynamically updating thedistribution index of the Simulated Binary Crossover (SBX), basedon the fitness score of parents, for uncovered objectives.Furthermore, we use Random Search (RS) as a baseline. RS ran-domly generates a test suite for each iteration and keeps the bestuntil the search process ends; RS provides insights into how easythe search problem is and helps us assess if other, more complexsearch algorithms are indeed necessary. utomatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search RS MOSA MOSA+ FITEST FITEST+Search Algorithm0.30.40.50.60.70.80.91.0 E ff ec ti v e n e ss S c o r e ( E S ) Figure 3: Test effectiveness ( ES ) for different search algo-rithms One might consider using NSGA-III as another baseline that triesto achieve many objectives collectively. However, it is not readilytailored to test suite generation as explained in Section 2.1, andtherefore requires addressing the problem of how to encode testsuites into chromosomes, which is a research subject that is beyondthe scope of our paper.For MOSA, FITEST, and their variants, the initial population sizeis the number of objectives. To be consistent, we set the numberof randomly generated test cases at each iteration of RS to be thenumber of objectives. For the other parameters, such as mutationand crossover rates in MOSA, FITEST, and their variants, we usethe default values recommended in the original studies.To account for randomness in search algorithms, we repeat theexperiment 20 times. For each run, we use the same time budgetof two hours for all search algorithms, based on our preliminaryevaluation showing that two hours are enough to converge. Weapply the non-parametric Mann–Whitney U test [29] to assess thestatistical significance of the difference in ES between algorithms.We also measure Vargha and Delaney’s ˆ 𝐴 𝐴𝐵 [45] to capture theeffect size of the difference. Figure 3 depicts the differences in ES across searchalgorithms. RS only achieves ES = .

41 on average across 20 runs.Though such ES value may seem a priori high, one must recall thatin many situations, some key-points cannot be correctly predicted,for example when they are invisible or hidden by a shadow. Incontrast to RS, MOSA, MOSA+, and FITEST+ reach ES = ES > .

93 on average, implying that our approach seems indeed effectiveat generating test suites that cause severe mispredictions for manykey-points.Table 1 shows the results of statistical comparisons betweendifferent search algorithms. Under the

Pair column, sub-columns A and B indicate the two search algorithms being compared. Underthe ES column, sub-columns 𝑝 -value and ˆ 𝐴 𝐴𝐵 indicate the statisticalsignificance and the effect size, respectively, when comparing ES distributions between A and B .At a level of significance 𝛼 = 0.01, we can see that MOSA is signif-icantly better than FITEST, with large effect size (0.84). ComparingMOSA+ and FITEST+ shows the same result, with an even larger Table 1: Statistical Analysis Results

Pair

ES MS

A B 𝑝 -value ˆ 𝐴 𝐴𝐵 𝑝 -value ˆ 𝐴 𝐴𝐵 MOSA RS 0 .

000 1 .

00 0 .

000 0 . .

000 1 .

00 0 .

000 0 . .

000 1 .

00 0 .

000 0 . .

000 1 .

00 0 .

000 0 . .

000 0 .

84 0 .

570 0 . .

000 0 .

95 0 .

594 0 . .

750 0 .

53 0 .

052 0 . .

005 0 .

76 0 .

177 0 . .

000 0 .

86 0 .

009 0 . .

109 0 .

64 0 .

780 0 . ES are statistically insignificant, implying that dynamically controllingthe similarity between parents and children in crossover does notsignificantly improve effectiveness. In RQ2, we will further comparethe many-objective search algorithms by considering the degree ofmisprediction severity across the test suites they generate.To summarize, the answer to RQ1 is that our approach is effectivein generating test suites that cause IEE-DNN—which is alreadyrather accurate (NME = 0 .

018 for the test dataset) as one mightexpect from an industrial model—to severely mispredict more than93% of all key-points on average. This number is also much higherthan that obtained with random search (41%). While MOSA andMOSA+ are significantly better than FITEST and FITEST+ in termsof ES , there is no significant difference between MOSA and MOSA+,which will be investigated further in RQ2. To answer RQ2, following the same procedure as forRQ1, we measure the

Misprediction Severity ( MS ) of a test suite foreach key-point, defined as the maximum NE value observed whenrunning the test suite. As for NE , MS ranges between 0 and 1, where1 implies the maximum prediction error.Since RQ2 aims to further distinguish the search algorithms byconsidering the MS values of test suites they generated in RQ1, weuse the test suites from RQ1 and report the average MS of individ-ual key-points for each algorithm. We apply the non-parametricWilcoxon signed-rank test to statistically compare MS distributionsfor each key-point between search algorithms, and measure Varghaand Delaney’s ˆ 𝐴 𝐴𝐵 to capture the effect size of the difference. Figure 4 shows average MS values of the test suitesgenerated by different search algorithms, for individual key-points.For example, we can see that the average MS value with MOSA+for the 26th key-point (i.e., KP26) is around 0.8. itash Ul Haq, Donghwan Shin, Lionel C. Briand, Thomas Stifter, and Jun Wang

Figure 4: Misprediction Severity ( MS ) for individual key-points for different search algorithms Comparing MOSA, FITEST, and their variants, we can see thatthe overall patterns shown in the radar chart are similar. Thisimplies that the many-objective test suite generation algorithmsyield similar patterns in terms of which individual key-points tendto get higher MS values. To further assess the difference betweenthem, we need to check the results of the statistical tests.Table 1 shows the results of statistical comparisons betweendifferent search algorithms. Under the MS column, sub-columns 𝑝 -value and ˆ 𝐴 𝐴𝐵 indicate the statistical significance and the effectsize, respectively, when comparing the MS distributions of twoalgorithms under columns A and B .In Table 1, with 𝛼 = .

01, there is no statistically significantdifference in MS between MOSA and MOSA+, and between FITESTand FITEST+. This implies that, consistent with RQ1, dynamicallyadjusting the distribution index in crossover does not increasemisprediction severity for individual key-points.As for MOSA and FITEST, which were significantly differentwith respect to ES in RQ1, they are no longer significantly differentregarding MS . This inconsistency between RQ1 and RQ2 happensbecause, even though MOSA (and MOSA+) is better than FITEST(and FITEST+) in making the NE values of more key-points ex-ceed 𝜖 = .

05, FITEST (and FITEST+) yields higher NE values thanMOSA (and MOSA+) for some key-points. When it comes to MOSA+and FITEST+, MOSA+ is significantly better than FITEST+ with asmall effect size (0.56). The results between MOSA and FITEST, andMOSA+ and FITEST+, imply that dynamically reducing the popula-tion size during the search increases the degree of mispredictionsfor some key-points, but the overall impact is limited.Interestingly, Figure 4 shows that some key-points (i.e., KP7,KP24, KP25, KP26, and KP27) are more severely mispredicted thanthe others. A detailed analysis found two distinct reasons thattogether affect the accuracy of the DNN: the under-representationof some key-points in the training data and a large variation in theshape and size of the mouth across different 3D models. For KP7,we found that it is only included in 79% of the training data. Onepossible explanation is that, as shown in Figure 2, KP7 can easily become invisible depending on the head posture. Interestingly, KP9,which has a symmetrical position to KP7 in the face, is included ina larger proportion of the training data (84%). This implies that key-points that are under-represented in training data, because theyhappen not to be visible on numerous occasions, are more likely tobe severely mispredicted. On the other hand, KP24, KP25, KP26, andKP27 are more severely mispredicted than the others, even thoughthey are included in most of the training images (more than 92%).These four key-points are severely mispredicted because they arerelated to the mouth, which shows the largest variation among facefeatures; the mouth can be opened and closed, and is larger than theeyes. Therefore, the positions of the mouth key-points vary morethan other key-points in the face, even when the head posture isfixed. As a result, learning the position of the mouth key-pointsis more difficult than that for other key-points. This implies that,during training, we need to focus more on certain key-points whoseactual positions can vary more than the others across images.To summarize, the answer to RQ2 is that, by additionally consid-ering MS, we cannot further distinguish MOSA and FITEST, andtheir variants. Instead, through a detailed analysis of the most se-verely mispredicted key-points, we identified important insights toprevent severe mispredictions: (1) since some key-points may notactually be visible in an image, depending on image characteristics(e.g., head posture), for each key-point, it is important to ensurethat the training data contains enough images where the key-pointis visible, (2) more training is required for certain key-points, whosepositions tend to vary significantly more across the face and whichare harder to predict than the others. Such observations can obvi-ously generalize to other types of images and key-points. To answer RQ3, decision trees [6] are learnt to in-fer how the NE (normalized prediction error) of the FKP-DNN forindividual key-points relate to image characteristics used by thesimulator to generate test images. We use decision trees becausethey are easy to interpret due to their hierarchical decision-makingprocess [32], though alternative forms of rule-based learning couldbe considered as well. Interpretability is essential for engineers toassess the risks associated with a DNN in the context of a givenapplication and to devise ways to improve the DNN, for example,through additional training data. Specifically, we build a decisiontree for each key-point to identify conditions describing how NE varies according to input variables, i.e., roll, pitch, yaw, and 3Dmodel ID. Since our test suite generation approach generates manytest images specified by roll, pitch, yaw, and 3D model, and giventhat NE values are calculated for individual key-points during thesearch process, we can use this information as a set of observa-tions for building decision trees and explain mispredictions. Tomimic a practical scenario in which engineers use our approachovernight (e.g., ten hours), we use the information collected fromfive random runs (i.e., equivalent to ten hours) of MOSA+ (i.e., themost effective search algorithm according to RQ1 and RQ2). Sincethe target variable (i.e., NE ) is continuous, we use regression treesto explain and predict the degree of mispredictions. Specifically,we use REPTree (i.e., fast tree learner using information-gain andreduced-error pruning) implemented in Weka [48]. utomatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search A cc u r ac y (a) Accuracy S i ze (b) Size Figure 5: Regression tree accuracy and size for all key-points

To evaluate the (predictive) accuracy of generated regressiontrees, we measure the Mean Absolute Error (MAE) using 10-foldcross validation. Note that, to provide interpretable explanations,the simplicity of the resulting regression trees is just as importantas its accuracy. In particular, as the number of nodes in a regressiontree increases, it becomes increasingly difficult for engineers tointerpret the results. As a trade-off between accuracy and simplicity,we set the minimum number of observations per leaf node to 40,based on preliminary experiments. For the other parameters forREPTree, we use default values provided by Weka.

Figure 5 shows the MAE and size (i.e., number ofnodes) distributions for all 27 trees built based on 3854 test imagesobtained from five runs of MOSA+. The average MAE is 0.01 and theaverage size is 25.7. We can see that the trees are accurate as thereis an average difference of 0.01 between the actual and predicted NE values, which is far below the threshold that IEE finds acceptable(0.05). The trees are also of reasonable size and therefore easy tointerpret as they lead to simple rules, as further shown below.Table 2 shows some representative rules derived from the treegenerated for KP26, the most severely mispredicted key-point inRQ2. All the remaining rules and the trees for the other key-pointsare available in the supporting material . In Table 2, column ic -condition shows conditions on values of the image characteristics,i.e., Roll (R), Pitch (P), Yaw (Y), and 3D Model ID (M); column NE shows the average NE value for all test images satisfying thecondition. For example, the first row from the table means that, fortest images such that 3D model ID is 9 and pitch is less than 18.41,the average NE is 0.04. Recall that each 3D model ID correspondsto implicit facial characteristics. For example, for model ID 9, skincolor is brown, face structure is broad, nose type is aquiline, andmouth structure is uneven.Using such conditions, engineers can easily identify when theFKP-DNN leads to severe mispredictions, for individual key-points.For example, by comparing the first and third conditions in Table 2,we can see that, for the same 3D model, changing the head posturetoward specific ranges leads to a significant increase of the predic-tion error of the IEE-DNN for KP26. To better visualize the exampleabove, we present two test images, in Figures 6a and 6b, satisfyingthe first and third conditions, respectively. In each image, the green http://tiny.cc/trees-weka ic -condition NE 𝑀 = ∧ 𝑃 < .

41 0 . 𝑀 = ∧ 𝑃 ≥ . ∧ 𝑅 < − . ∧ 𝑌 < .

06 0 . 𝑀 = ∧ 𝑃 ≥ . ∧ 𝑅 < − . ∧ . ≤ 𝑌 <

19 0 . 𝑀 = ∧ 𝑃 ≥ . ∧ 𝑅 < − . ∧ 𝑌 ≥

19 0 . Table 2: Representative rules derived from the decision treefor KP26 (M: Model-ID, P: Pitch, R: Roll, Y: Yaw) (a) A test image satisfying thefirst condition in Table 2 (b) A test image satisfying thethird condition in Table 2

Figure 6: Test images satisfying two different conditionsin the regression tree for KP26. Actual and predicted posi-tions are shown by green-circle and red-triangle dots, respec-tively. dots and red triangle indicate the actual and predicted positionsof KP26. While there is a small prediction error ( NE = . NE = .

89) when turning the headfurther down and to the right, as shown in Figure 6b. Engineerscan further investigate the root causes of severe mispredictionsusing targeted additional images. For example, based on Figure 6b,we generated a targeted image that differs only in the directionof the shadow (by changing the position of the light source in the3D model) and could confirm that the shadow was indeed the rootcause of the large NE value.Knowing under what conditions severe mispredictions are oc-curring can help engineers in two ways. First, it helps to assessthe risks associated with individual key-points for specific condi-tions, in the context of a specific application. For example, if thehead posture can be constrained to 𝑃 < .

41 in the context ofa certain application of the FKP-DNN, the risk of severe mispre-diction for KP26 is reduced according to the conditions in Table 2.Second, it enables the generation of specific test images, using thesimulator, that are expected to cause particularly severe mispredic-tions and can be used for retraining the DNN. For example, fromthe conditions in Table 2, we can generate new images satisfying 𝑀 = ∧ 𝑃 ≥ . ∧ 𝑅 < − .

31, which will likely lead to high NE values, and can be used to help retrain and improve the DNN byaugmenting its training data set.To summarize, the answer to RQ3 is that, by building regressiontrees using the information that is produced by executing our testgeneration approach, we can largely explain the variance in indi-vidual key-point mispredictions in terms of image characteristics itash Ul Haq, Donghwan Shin, Lionel C. Briand, Thomas Stifter, and Jun Wang that are controllable in the simulator. These conditions can help en-gineers assess risks for individual key-points and generate specificimages for retraining of the DNN in a way that is more likely tohave an impact on its accuracy. In RQ1, we have used a threshold ( 𝜖 = .

05) for deciding if a key-point is severely mispredicted or not. This may affect the results.However, this threshold value was set by IEE domain experts, ac-counting for the specific application of their DNN. Furthermore, wechecked that using other small threshold values does not severelyaffect overall trends but only slightly affect ES values.We have used 3D face models with various facial characteristicsshared by IEE. IEE engineers have prepared the 3D models by mark-ing key-points’ actual positions manually. The manual markingcan be erroneous and such error can affect the quality of data gen-erated for training and testing the DNN. To mitigate such threats,IEE engineers have done extensive testing for each 3D model andcarefully validated the generated data.Though we focused, in our case study, on one FKP-DNN devel-oped by IEE, it is representative of what one can find in indus-try [8, 22, 23], in terms of prediction accuracy, inputs and outputs,and training procedure. Additional industrial case studies are how-ever required to improve the generalizability of our results.

Lesson 1: Our approach is indeed useful in practice.

Althoughwe have presented here the evaluation results using the latest IEE-DNN version, our approach has been continuously applied to pre-vious versions during its development process. Indeed, test suitesautomatically generated by our approach and the analysis of testingresults for mispredicted key-points have helped assess and improveIEE-DNNs. In particular, our approach led IEE to augment theirtraining dataset (e.g., diversity and sample size) and thus improvethe generalization of the IEE-DNN against various inputs. In sum-mary, based on the results, IEE has been (1) continuously enrichingtheir dataset by adding more training images from diverse 3D facemodels and (2) improving the IEE-DNN’s architecture by doublingthe number of hidden layers to drastically increase its accuracy.Our approach has been also used for improving the simulator.For example, during a detailed analysis of the testing results, werevealed a critical issue: the labeled key-point positions on testimages, generated by the IEE-SIM, were not accurate and, afteranalysis, engineers realized there was a misalignment between thetexture and nodes of the 3D mesh that are used to define the actualposition of key-points.IEE also plans to use our approach in various ways. Since thetest suites generated by our approach cause severe mispredictionsfor many key-points, retraining the DNN using these test suites—given that the DNN architecture is now adequate—is expected tohelp improve the accuracy of the DNN. Our approach can also helpdemonstrate the robustness of the DNN under certain conditionsby, for example, showing the ranges of image characteristic values(i.e., roll, pitch, yaw, and 3D model ID) which, for generated testimages, do not lead to severe mispredictions. If those conditions match the conditions of the intended application, then engineerscan conclude that the DNN is likely to be safe for use.

Lesson 2: Simulation-based testing brings key benefits.

Asdiscussed in Section 5.2, a simulator is one of the key enablers of ourapproach as it generates labeled test images in a controlled manner.However, the simulator can also be a limiter as test suites generatedby our approach depend on what the simulator can generate. Ideally,one would like access to a high-fidelity and configurable simulatorthat is able to generate diverse images, accounting for all majorfactors that affect the behavior of the DNN under test. For example,the IEE-SIM has been developed to support the configuration ofvarious image characteristics, such as head posture, face size, andskin color, that are essential because of their effect on facial key-point detection. Thanks to the simulator, we can generate as manydifferent test images as we need, with known key-points, and driveeffectively automated test generation. Given the usefulness of ourapproach, the cost of having a simulator, with sufficient fidelity andconfigurability, benefits the development and testing of KP-DNNsby enabling efficient automation.

In this paper, we formalize the problem definition of KP-DNN test-ing and present an approach to automatically generate test data forKP-DNNs with many independent outputs, a common situation inmany applications. We empirically compare state-of-the-art, many-objective search algorithms and their variants tailored for test suitegeneration. We find MOSA+ to be significantly more effective thanrandom search (baseline) and other many-objective search algo-rithms, e.g., FITEST, with large effect sizes. We also observe thatour approach can generate test suites to severely mispredict morethan 93% of all key-points on average, while random search, asa comparison, can do so for 41% of them. We further investigateand demonstrate a way, based on regression trees, to learn theconditions, in terms of image characteristics, that cause severe mis-predictions for individual key-points. These conditions are essentialto engineers to assess the risks associated with using a DNN andto generate new images for DNN retraining, when possible.As a part of future work, we plan to devise effective retrainingstrategies and assess to what extent the accuracy of FKP-DNNs canbe improved through retraining. We also plan to tailor NSGA-III, ageneric many-objective search algorithm, for test suite generationand empirically evaluate its performance.

ACKNOWLEDGMENTS

This work has received funding from the European Research Coun-cil under the European Union’s Horizon 2020 research and inno-vation programme (grant agreement No 694277), IEE S.A. Lux-embourg, and the Canada Research Chair and Discovery Grantprogrammes. Donghwan Shin was partially supported by the Ba-sic Science Research Programme through the National ResearchFoundation of Korea (NRF) funded by the Ministry of Education(2019R1A6A3A03033444). We would like to thank Raja Ben Ab-dessalem and Shiva Nejati for early discussions on this paper.

REFERENCES [1] Raja Ben Abdessalem, Shiva Nejati, Lionel C Briand, and Thomas Stifter. 2018.Testing vision-based control systems using learnable evolutionary algorithms. utomatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search In .IEEE, 1016–1026.[2] Raja Ben Abdessalem, Annibale Panichella, Shiva Nejati, Lionel C Briand, andThomas Stifter. 2018. Testing autonomous cars for feature interaction failuresusing many-objective search. In . IEEE, 143–154.[3] Raja Ben Abdessalem, Annibale Panichella, Shiva Nejati, Lionel C Briand, andThomas Stifter. 2018. Testing autonomous cars for feature interaction failuresusing many-objective search. In . IEEE, 143–154.[4] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014.2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition .3686–3693.[5] Raja Ben Abdessalem, Shiva Nejati, Lionel C Briand, and Thomas Stifter. 2016.Testing advanced driver assistance systems using multi-objective search andneural networks. In

Proceedings of the 31st IEEE/ACM International Conference onAutomated Software Engineering . 63–74.[6] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984.

Classification and regression trees . CRC press.[7] Leyde Briceno and Gunther Paul. 2019. MakeHuman: A Review of the ModellingFramework. In

Proceedings of the 20th Congress of the International ErgonomicsAssociation (IEA 2018) , Sebastiano Bagnara, Riccardo Tartaglia, Sara Albolino,Thomas Alexander, and Yushi Fujita (Eds.). Springer International Publishing,Cham, 224–232.[8] Yu Chen, Jian Yang, and Jianjun Qian. 2017. Recurrent neural network for faciallandmark detection.

Neurocomputing

219 (2017), 26–38.[9] Guillermo Campos Ciro, Frédéric Dugardin, Farouk Yalaoui, and Russell Kelly.2016. A NSGA-II and NSGA-III comparison for solving an open shop schedulingproblem with resource constraints.

IFAC-PapersOnLine

49, 12 (2016), 1272–1277.[10] Blender Online Community. 2018.

Blender - a 3D modelling and rendering package

Makehumancommunity.org

IEEE transactions on evolutionarycomputation

18, 4 (2013), 577–601.[13] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. Afast and elitist multiobjective genetic algorithm: NSGA-II.

IEEE transactions onevolutionary computation

6, 2 (2002), 182–197.[14] Changxing Ding and Dacheng Tao. 2015. Robust face recognition via multimodaldeep face representation.

IEEE Transactions on Multimedia

17, 11 (2015), 2049–2058.[15] Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma, Jianjun Zhao, and Yang Liu. 2018. Deep-cruiser: Automated guided testing for stateful deep learning systems. arXivpreprint arXiv:1812.05339 (2018).[16] Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generationfor object-oriented software. In

Proceedings of the 19th ACM SIGSOFT symposiumand the 13th European conference on Foundations of software engineering . 416–419.[17] Alessio Gambi, Marc Mueller, and Gordon Fraser. 2019. Automatically testing self-driving cars with search-based procedural content generation. In

Proceedings ofthe 28th ACM SIGSOFT International Symposium on Software Testing and Analysis .318–328.[18] Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. Dlfuzz: Dif-ferential fuzzing testing of deep learning systems. In

Proceedings of the 2018 26thACM Joint Meeting on European Software Engineering Conference and Symposiumon the Foundations of Software Engineering . 739–743.[19] Yunfei Hou, Yunjie Zhao, Aditya Wagh, Longfei Zhang, Chunming Qiao, Kevin FHulme, Changxu Wu, Adel W Sadek, and Xuejie Liu. 2015. Simulation-basedtesting and evaluation tools for transportation cyber–physical systems.

IEEETransactions on Vehicular Technology

65, 3 (2015), 1098–1108.[20] Xiao Hu, Shaohu Peng, Li Wang, Zhao Yang, and Zhaowen Li. 2017. Surveillancevideo face recognition with single sample per person based on 3D modeling andblurring.

Neurocomputing

235 (2017), 46–58.[21] Di Huang, Renke Zhang, Yuan Yin, Yiding Wang, and Yunhong Wang. 2017.Local feature approach to dorsal hand vein recognition by centroid-based circularkey-point grid and fine-grained matching.

Image and Vision Computing

58 (2017),266–277.[22] Yichao Huang, Xiaorui Liu, Lianwen Jin, and Xin Zhang. 2015. Deepfinger: Acascade convolutional neuron network approach to finger key point detection inegocentric vision with mobile camera. In . IEEE, 2944–2949.[23] Rateb Jabbar, Khalifa Al-Khalifa, Mohamed Kharbeche, Wael Alhajyaseen,Mohsen Jafari, and Shan Jiang. 2018. Real-time driver drowsiness detectionfor android application using deep neural networks techniques.

Procedia com-puter science

130 (2018), 400–407. [24] Joshua Knowles and David Corne. 2007. Quantifying the effects of objectivespace dimension in evolutionary multiobjective optimization. In

InternationalConference on Evolutionary Multi-Criterion Optimization . Springer, 757–771.[25] Zelun Kong, Junfeng Guo, Ang Li, and Cong Liu. 2020. PhysGAN: GeneratingPhysical-World-Resilient Adversarial Examples for Autonomous Driving. In

Pro-ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition .14254–14263.[26] Junnan Li and Edmund Y Lam. 2015. Facial expression recognition using deepneural networks. In . IEEE, 1–6.[27] Zheng Li, Mark Harman, and Robert M Hierons. 2007. Search algorithms forregression test case prioritization.

IEEE Transactions on software engineering

Journal of Systems and Software

170 (2020), 110767.[29] Henry B Mann and Donald R Whitney. 1947. On a test of whether one oftwo random variables is stochastically larger than the other.

The annals ofmathematical statistics (1947), 50–60.[30] Phil McMinn. 2011. Search-based software testing: Past, present and future. In . IEEE, 153–163.[31] Claudio Menghi, Shiva Nejati, Lionel Briand, and Yago Isasi Parache. 2020.Approximation-refinement testing of compute-intensive cyber-physical models:An approach based on system identification. In

Proceedings of the ACM/IEEE 42ndInternational Conference on Software Engineering . 372–384.[32] W James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu.2019. Definitions, methods, and applications in interpretable machine learning.

Proceedings of the National Academy of Sciences

European conference on computer vision . Springer,483–499.[34] A. Panichella, F. M. Kifetew, and P. Tonella. 2015. Reformulating Branch Cov-erage as a Many-Objective Optimization Problem. In . 1–10.https://doi.org/10.1109/ICST.2015.7102604[35] Annibale Panichella, Fitsum Meshesha Kifetew, and Paolo Tonella. 2017. Au-tomated test case generation as a many-objective optimisation problem withdynamic selection of the targets.

IEEE Transactions on Software Engineering

44, 2(2017), 122–158.[36] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z BerkayCelik, and Ananthram Swami. 2017. Practical black-box attacks against machinelearning. In

Proceedings of the 2017 ACM on Asia conference on computer andcommunications security . 506–519.[37] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Au-tomated whitebox testing of deep learning systems. In proceedings of the 26thSymposium on Operating Systems Principles . 1–18.[38] Vincenzo Riccio and Paolo Tonella. 2020. Model-based exploration of the frontierof behaviours for deep learning system testing. arXiv preprint arXiv:2007.02787 (2020).[39] Andras Rozsa, Manuel Günther, Ethan M Rudd, and Terrance E Boult. 2019. Facialattributes: Accuracy and adversarial robustness.

Pattern Recognition Letters

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition . 3674–3681.[41] Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand key-point detection in single images using multiview bootstrapping. In

Proceedingsof the IEEE conference on Computer Vision and Pattern Recognition . 1145–1153.[42] Guanglu Song, Yu Liu, Yuhang Zang, Xiaogang Wang, Biao Leng, and QingshengYuan. 2020. KPNet: Towards Minimal Face Detector. arXiv:2003.07543 [cs.CV][43] Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automatedtesting of deep-neural-network-driven autonomous cars. In

Proceedings of the40th international conference on software engineering . 303–314.[44] Cumhur Erkan Tuncali, Georgios Fainekos, Hisahiro Ito, and James Kapinski.2018. Simulation-based adversarial test generation for autonomous vehicles withmachine learning components. In .IEEE, 1555–1562.[45] András Vargha and Harold D. Delaney. 2000. A Critique and Improvement ofthe CL Common Language Effect Size Statistics of McGraw and Wong.

Journalof Educational and Behavioral Statistics

25, 2 (2000), 101–132. https://doi.org/10.3102/10769986025002101 arXiv:https://doi.org/10.3102/10769986025002101[46] Xinyao Wang, Liefeng Bo, and Li Fuxin. 2019. Adaptive wing loss for robustface alignment via heatmap regression. In

Proceedings of the IEEE InternationalConference on Computer Vision . 6971–6981.[47] Matthew Wicker, Xiaowei Huang, and Marta Kwiatkowska. 2018. Feature-guidedblack-box safety testing of deep neural networks. In

International Conferenceon Tools and Algorithms for the Construction and Analysis of Systems . Springer, itash Ul Haq, Donghwan Shin, Lionel C. Briand, Thomas Stifter, and Jun Wang

Acm Sigmod Record

31, 1 (2002),76–77.[49] Lior Wolf, Tal Hassner, and Itay Maoz. 2011. Face recognition in unconstrainedvideos with matched background similarity. In

CVPR 2011 . IEEE, 529–534.[50] Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, JianjunZhao, Bo Li, Jianxiong Yin, and Simon See. 2019. Deephunter: A coverage-guidedfuzz testing framework for deep neural networks. In

Proceedings of the 28th ACMSIGSOFT International Symposium on Software Testing and Analysis . 146–157.[51] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. 2019. Adversarial examples:Attacks and defenses for deep learning.

IEEE transactions on neural networks and learning systems

30, 9 (2019), 2805–2824.[52] Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khur-shid. 2018. DeepRoad: GAN-based metamorphic testing and input validationframework for autonomous driving systems. In . IEEE, 132–142.[53] Shutong Zhang and Chenyue Meng. 2016. Facial keypoints detection using neuralnetwork.

Stanford Report (2016), 1.[54] Husheng Zhou, Wei Li, Zelun Kong, Junfeng Guo, Yuqun Zhang, Bei Yu, LingmingZhang, and Cong Liu. 2020. Deepbillboard: Systematic physical-world testing ofautonomous driving systems. In