MFPP: Morphological Fragmental Perturbation Pyramid for Black-Box Model Explanations
Qing Yang, Xia Zhu, Jong-Kae Fwu, Yun Ye, Ganmei You, Yuan Zhu
MMFPP: Morphological Fragmental PerturbationPyramid for Black-Box Model Explanations
Qing Yang ∗ , Xia Zhu † , Jong-Kae Fwu † ,Yun Ye ∗ , Ganmei You ∗ and Yuan Zhu ∗∗ Intel Corporation, ChinaEmail: { qing.y.yang, yun.ye, ganmei.you, yuan.y.zhu } @intel.com † Intel Corporation, USAEmail: { xia.zhu,jong-kae.fwu } @intel.com Abstract —Deep neural networks (DNNs) have recently beenapplied and used in many advanced and diverse tasks, suchas medical diagnosis, automatic driving, etc. Due to the lackof transparency of the deep models, DNNs are often criticizedfor their prediction that cannot be explainable by human.In this paper, we propose a novel Morphological FragmentalPerturbation Pyramid (MFPP) method to solve the ExplainableAI problem. In particular, we focus on the black-box scheme,which can identify the input area responsible for the output ofthe DNN without having to understand the internal architectureof the DNN. In the MFPP method, we divide the input imageinto multi-scale fragments and randomly mask out fragmentsas perturbation to generate a saliency map, which indicatesthe significance of each pixel for the prediction result of theblack box model. Compared with the existing input samplingperturbation method, the pyramid structure fragment has provedto be more effective. It can better explore the morphologicalinformation of the input image to match its semantic information,and does not need any value inside the DNN. We qualitativelyand quantitatively prove that MFPP meets and exceeds theperformance of state-of-the-art (SOTA) black-box interpretationmethod on multiple DNN models and datasets.
I. I
NTRODUCTION
In the past decade, deep neural networks (DNN) havemade breakthroughs in various AI tasks and greatly changedmany fields, such as computer vision and natural languageprocessing. However, the lack of transparency of the DNNmodel has led to serious concerns about the widespreaddeployment of machine learning technologies, especially whenthese DNN models are given decision-making power in criticalapplications such as medical diagnosis, autonomous driv-ing, intelligent surveillance and financial authentication [Yanget al., 2020] et al. .Many methods for model interpretation have been proposed,but often produce unsatisfactory results. Some existing meth-ods [Zhou et al., 2016], [Selvaraju et al., 2017], [Chattopadhayet al., 2018] generate saliency maps based on intermediateinformation. For example, CAM [Zhou et al., 2016](ClassActivation Mapping) requires to add or use existing globalaverage pooling layer before the last fully-connected layerto generate saliency map; Grad-CAM [Selvaraju et al., 2017]needs weights and feature map values insider a DNN to getweighted summation. In addition to requiring DNN internalstructure information, the results are with low resolution andvisually coarse due to the up-scaling operations in the saliency map generating process. As shown in Figure 1, Grad-CAM canlocate objects such as bananas and skiers, but the generatedmap is larger than the object itself, which reduces its reliabilityand effectiveness. The black-box model methods [Ribeiroet al., 2016], [Fong and Vedaldi, 2017], [Petsiuk et al., 2018],[Wagner et al., 2019] do not require the internal structure andvalues of DNNs, but there are still problems with accuracyand granularity. LIME [Ribeiro et al., 2016] proposes a modelagnostic method with traditional machine learning ideas, andinterprets the prediction of more complex models by fitting asmall local linear models. However, the linear model cannotfully handle the weight of millions of data, which may leadto underfitting. Therefore, they simplify the input samplingmethod (from pixel mode to superpixel mode), resulting ina coarse-grained and less accurate saliency map. For objectssuch as banana and wall-clock in Figure 1, the saliencymap from LIME covers part of the target object, but alsocovers other regions. Its non-adaptive probability thresholdvalue needs user’s manual input. BBMP [Fong and Vedaldi,2017] uses the Adam optimizer to iteratively optimize thesaliency map. However, when the background is complex anddeceptive, it is difficult to converge and correctly locate thetarget. As a result, BBMP makes a poor explanation in thefirst three samples of Figure 1. RISE [Petsiuk et al., 2018]can generate a saliency map for each pixel, and it workswell on rectangular objects (such as the goldfish sample inthe fifth row). When non-rectangular targets and multipletiny objects belong to the same category, grid-based samplingmethods (such as RISE) will cause performance degradation.Recent work Extremal Perturbation (EP) [Fong et al., 2019]is the best prior method, which can identify saliency mapsthrough extreme perturbation and SGD. Our evaluation resultsin Sec.IV-D show that EP is time-consuming and providesunbalanced accuracy.In summary, the main contributions of this paper are sum-marized as follows:(1) A novel Morphological Fragmentation PerturbationPyramid (MFPP) design is proposed for black box model inter-pretation, which perturbs morphological fragments of differentscales and makes full use of input semantic information. (2)The proposed MFPP is applied to the random input samplingmethod and significantly improve its performance in inter-preting accuracy scores. (3) The qualitative and quantitative a r X i v : . [ c s . C V ] J u l rig MFPP Fast-MFPP RISE Grad-CAM BBMP EP LIME C l o c k B a n a n a B a s e b a ll G o l d f i s h S k i Fig. 1: The visualization comparison results of the proposed MFPP, the proposed Fast-MFPP, and four other methods from left to right:(EP) [Fong et al., 2019], RISE [Petsiuk et al., 2018], BBMP [Fong and Vedaldi, 2017], LIME [Ribeiro et al., 2016] and Grad-CAM [Selvarajuet al., 2017] on VGG16. Samples − are from MS COCO2014 [Lin et al., 2014]. Sample is public image [Gan, 2019]. evaluations on multiple data sets and models are performed. Itis proven that MFPP has better interpretability for black boxDNN prediction, higher accuracy, and an order of magnitudefaster than the latest methods.The rest of the paper is organized as follows: Section IIbriefly reviews related works in model explanation area. Sec-tion III proposes the black-box model explanation methodMFPP. Section IV shows the intuitive and quantitative experi-ments and corresponding results analysis. Finally, a summaryof the paper is provided in Section V.II. R ELATED WORK
In the past few years, many methods [Zhou et al., 2016],[Selvaraju et al., 2017], [Chattopadhay et al., 2018], [Ribeiroet al., 2016], [Petsiuk et al., 2018], [Wagner et al., 2019]have been proposed to explain and visualize the predictionsof deep CNN classifiers, which greatly promotes the researchof model interpretability and design optimization. Surveypapers [Zhang and Zhu, 2018], [Guidotti et al., 2019] and[Du et al., 2019] have fully summarized these methods. In thissection, we provide some criteria to classify the previous visualinterpretation methods in multiple dimensions. This can allowresearchers to change their views to thoroughly understand thedifferences between each category.
A. Model-dependent vs. Model-agnostic
Since the ultimate goal of model interpretation is to diag-nose and analyze how the model works, the model itself is theprotagonist. Whether the model under the interpretation is ablack-box or a white-box divides different methods into twomain camps: model-dependent methods (MDM) and modelagnostic methods (MAM). MDM leverages internal modelinformation to generate an explanation. For example, CAMgenerates importance map of the input image by using class-specific gradient information that flows into the final GAP(Global Average Pooling [Lin et al., 2013]) layer of a CNNmodel. Grad-CAM requires of activation values and specifiedfeature map. Grad-CAM++ [Chattopadhay et al., 2018] evenneeds a smooth prediction function because it utilizes thethird derivative. In short, model-dependent methods usuallyhave strict restrictions on their use and often suffer from low-resolution results.On the contrary, model-agnostic methods (such asLIME [Ribeiro et al., 2016], BBMP [Fong and Vedaldi, 2017],RISE [Petsiuk et al., 2018], FGVis [Wagner et al., 2019])have no such restrictions, as they do not need to use modelinternal information. MAM only leverage input samples andcorresponding output results to visually explain the model’spredictions. MAM can explain the predictions of almost all eighted summation
Φ(I)
Black Box
Classifier 𝐼 𝑀 𝑖 𝐼 ⋅ 𝑀 𝑖 𝑆𝐼 + 𝑆
Fig. 2: Overview of MFPP method work flow: The input image I is sent to a segmentation algorithm Ψ to generate multiple-scale segmentationresults. For each segmentation scale, masks Mi are randomly generated and element-wise multiplied with input I to obtain masked images.These masked images are fed to the black-box DNN model to obtain the saliency scores of the target classes. The scores are weightedsummed across all masks. The final output is the saliency map. classifiers, including CNN models and classical machine learn-ing models ( e.g. support vector machine, decision tree, andrandom forests et al. ). Model-agnostic methods have broadpracticality. B. Patch-wise vs. Pixel-wise perturbation
Input perturbation is a general method, and we measurethe output change after changing the input by removing orinserting information (such as masking, blurring, and replac-ing) from the image. It can also be used to generate volumeprediction data for random sampling or local model fitting.BBMP evaluates and selects the most meaningful perturbationmethod by comparing area blur, fixed value replacementand noise added perturbation. The patch-wise and pixel-wiseperturbation will lead to different visual results.Pixel-wise perturbation is adopted by FGVis and real-time saliency [Dabkowski and Gal, 2017], while LIME, An-chors [Ribeiro et al., 2018] and Regional [Seo et al., 2019] arebased on patch-wise perturbation. The perturbation methodwill affect the granularity of the output saliency map. Ingeneral, salience maps from pixel-wise methods are moreaccurate in terms of location, at the expense of lower semanticsdetails because their results are spatially discrete. Results frompatch-wise methods are more visually pleasing and close tomodel-dependent results where the boundaries can better fitthe object boundaries.
C. External model fitting vs. Statistical method
Depending on how to handle the model output from theperturbation input, the different methods are further dividedinto two camps and lead to different results and processingtime. LIME needs to locally fit another linear model with ridgeregression to facilitate interpretation. Anchors [Ribeiro et al.,2018] method further expands this idea to locally fit decisiontrees in order to provide better interpretability. Both methodsrequire manually selecting the number of iterations for localmodel fitting. Manual configuration does not guarantee localmodel fits well. Both BBMP and EP use gradient-basedoptimization methods to iteratively optimize the saliency map(Adam [Kingma and Ba, 2014]). Our experiment results showthat the number of iterations will significantly affect theaccuracy and appearance of the final saliency map, however,the author sets the iteration number to a constant value basedon experience. In contrast, RISE uses statistical methods,which perform weighted summation of all sampled outputswithout the need of additional model fitting or optimizeriterations. As shown in Table II, the statistical method requiresless processing time.III. P
ROPOSED M ETHOD
Input perturbation is commonly used in existing methods[LIME, Anchors, BBMP, RISE, FGVis, EP]. In RISE [Petsiuket al., 2018], Petsiuk proposed a random input samplingmethod. They randomly mask half of 7x7 grids and addition-ally apply a random transportation shift to generate masks.This method ignores the morphological characteristics of theobject, making the visual interpretation results unsatisfactory,as shown in the Figure 1. In LIME [Ribeiro et al., 2016],coarse-grained superpixel perturbation makes their interpreta-tion coarse-grained and low-precision, as shown in Figure 1.In Sec.III-A, We show the importance of morphology to visualinterpretation results and analyze the relationship betweenthe two. In Sec.III-B, we introduce a new input perturbationethod, bridging multi-scale morphological information withthe input perturbation method. The granularity of the inter-pretation results has been greatly improved with proposedmethod. Sec.III-C shows the overall flow chart of our proposedmethod.
Fig. 3: Top-left: the superpixel result for ’bird-and-cat’ picture. Top-right: disturb superpixels and statistically measure the ”bird” score ofeach segment; Bottom: visualize the probability distribution of birdfragments from high to low.
A. Morphological analysis on visual explanation
Objects and their surrounding areas are the main basis forthe classification model to make decisions. For each predic-tion, the content composed of objects is very important for thevisual interpretation of the results. According to morphologicaltheory, objects consist of shapes, textures, and colors. Zeiler etal. [Zeiler and Fergus, 2014] designed DeconvNet to visualizeand understand convolutional networks. Their results showthat the lower layer can understand edges, corners and colors.The middle layer focuses on the texture and part of theshape. Finally, the high layer will recognize the entire objectand tend to extract the semantic information of the completeshape. Inspired by this work, we proposed the design to makefull use of morphological information when performing inputinterference.We started with an experiment. As shown in Figure 3,first, the image is divided into different segments, also knownas super-pixels. Then, we perturb the super-pixels and sendperturbed image to the model f for prediction. Third, westatistically measure the significance score of the ”bird” cate-gory in each prediction. Finally, we lower the threshold of theinterpretation score to visualize the probability distribution ofthe segments that make up the target object.Based on this method, we can evaluate the contributionof each patch of a specific category. For the ”bird” typeprediction, as shown in the upper right of Figure 3, thefragments of different score are shown with different colors.The brighter the color, the higher the score. It can be seenthat the high-scoring patches clustered where the bird is. Asshown in the Figure 3, the score gradually decreases from thecenter of the bird. Fig. 4: Different boundary areas generated using edge-based andgrid-based boundaries. Top: Edge-based boundary. From left to right,sigma is 1, 3, 7; Bottom: Grid-based boundary. From left to right,sigma is 10 and 20. The bottom right corner is the 7x7 grid boundaryused by RISE.
This observation inspired us to utilize morphological-basedmasks to better perturb the model input with its semanticdistribution. Since segmentation is the basis for generatingmasks in our method, a fast and effective super-pixel algorithmSLIC [Achanta et al., 2012] with O ( N ) time complexityis adopted. In Figure 4, different sigma values are used inSLIC to generate different segmentation results. These sigmavalues control the smoothness of the segment boundaries, i.e. morphological degree, it influences the explanation perfor-mance as shown in the sub-figure (a) of Figure 6. B. Fragments Perturbation Pyramid
Fig. 5: Different segmentation densities/scales are applied to the inputimage.
Lin et al. proposed Feature Pyramid Network (FPN) [Linet al., 2017] to take advantage of the inherent multi-scalepyramid hierarchy of deep convolutional networks, whichshows significant improvements in certain applications. In thefield of object detection, from sliding windows to multiplefeature maps as input to the classification module, people arealways looking for a way to enable the predictor to detect andrecognize objects of varying sizes. In Faster-RCNN [Ren et al.,2017], anchors are designed to expand the receiving field todifferent scales, thereby significantly improving the accuracyof objects of different scales. In addition, YOLO [Redmonet al., 2015], [Redmon and Farhadi, 2017], [Redmon and Ali,2018] series and other one-shot object detection networks con-tinue to use different scales of feature maps in the evaluationphase. Based on this observation, we transfer this classic ideafrom object detection to model explanation.As shown in Figure 5, by changing the segment numbervalue in the segmentation algorithm, the input image I isfed into the segmentation algorithm and is segmented intoegments with different granularities. These fragments areused for randomized mask generation.This method allows those model interpreters to view multi-scale objects in the input image, as shown in Figure 5. Thisis helpful for enriching explanation sources. C. MFPP
Figure 2 shows the overall structure of MFPP and all dataflows used to explain the prediction of a black-box model.The input image I is sent to the segmentation algorithm Ψ to generate multiple segmented segments. Masks M i aregenerated by converting randomly selected fragments into zerograyscale. We multiply these masks with input I element-wisely to get the masked images. Then the masked imagesare fed to the black-box model f to get prediction score oftarget class, which is weighted summed to generate the finalsaliency map.Let I : Ω → R be the space of a color image with size H × W where Ω = { , ..., H }×{ , ..., W } is a discrete lattice.Let Φ : Γ → R be a black machine learning model, whichmaps the image to a scalar output value Φ( I ) in R . The outputcould be activation, or class prediction score of Φ : Γ → R .Next, we investigate what part of the input I stronglyactivates the category, resulting in a large response Φ( I ) . Inparticular, we would like to find a mask M and assign a value M ( µ ) ∈ { , } to each pixel µ ∈ Ω , where M ( µ ) = 1 meansthat the pixel contributes a lot to the output, and M ( µ ) = 0 does not. Based on Monte Carlo method, we can get S I, Φ ( µ ) ≈ E [ M ] · N N (cid:88) i =1 Φ( I (cid:12) M i ) · M i ( µ ) (1)where Ψ is image segmentation operation, and F l is the outputof image I from this operation: F l = Ψ( I ) (2)In this case, N is the total number of masks with differentsegmentation scales, g ( l ) is the number of fragments in group l and L is the total number of groups. N = L (cid:88) l =1 g ( F l ) (3)Substituting N from (3) in (1) S I, Φ ( µ ) = 1 E [ M ] · (cid:80) Ll =1 g ( F l ) L (cid:88) l =1 g ( F l ) (cid:88) i =1 Φ( I (cid:12) M l,i ) · M l,i ( µ ) (4)Please note that our method does not use any internalinformation in the model, so it is suitable for interpreting anyblack box model. IV. E XPERIMENTS AND R ESULTS
A. Datasets and based models
Our algorithm is implemented and evaluated with PyTorch1.2.0. The experiments are run with a single Nvidia P6000GPU. The the size of input image I to black box model is224x224.In the experiment, the four SOTA methods and our proposedMFPP method were visually compared on the typical samplesof MS-COCO 2014 dataset [Lin et al., 2014]. Quantitativeexperiments were conducted on the entire test subset of thePASCAL VOC [Everingham et al., 2010] and MS-COCO2014 minival set. The models used in the experiment areVGG16 [Simonyan and Zisserman, 2014] and ResNet50 [Heet al., 2016]. B. Intuitive results
In Figure 1, we compare the model explanation result ofthe total methods for same input images (the first column),including the proposed MFPP and its fast version. For intuitivevisual evaluation, MFPP provides a more accurate and fine-grained saliency map than other competitive methods.Grad-CAM’s outputs look coarse with low resolution dueto the upscaling operation in the saliency map generation.As shown in Figure 1, Grad-CAM can locate objects suchas bananas in the first row and skiers in the fourth row,however the area it covers is too large and far beyond theobject area borders, which reduces reliability. The weaknessof Grad-CAM is clearly shown in the goldfish example (thefifth row), where there are multiple tiny objects of the samecategory.Some existing model-agnostic methods (such as LIME,BBMP, RISE and FGVis) have accuracy and/or granularityissues. For example, the BBMP column shows that it has apoor explanation of the first samples, because when thebackground is complex and deceptive, BBMP cannot correctlylocate the target.For first samples of EP’s output (7th column), its saliencymap can only locate several points of the target, and it totallyfails in the third example due to complex background ofbaseball player. It has similar weakness at handling multipletiny objects of the same category as shown in goldfish sample.It has similar weakness at handling multiple tiny objects ofthe same category since it easily loses focus on non-extremepoints as shown in goldfish sample.RISE can generate a pixel-wise heat-map, but its grid-basedsampling method leads to two problems. First, the performanceof non-rectangular shaped targets is poor. Second, it losesgranularity as shown in the baseball player sample (the thirdrow).The second and third columns show the performance of theproposed Fast-MFPP ( masks) and MFPP ( masks).In the first two and last rows, the proposed methods outperformall other methods. In the goldfish case, it finds most instancescorrectly without error. In the baseball case, the proposeddesigns are robust even with complex background, and areable to identify the exact body shape area of the target object.or overall comparison, MFPP ( masks) generates themost accurate and cleanest importance map. The importancemap is clear and pleasing, with minimal errors and noise. C. Ablation Studies
In this section, we perform ablation studies to study theeffects of various parameters used in our model. Pointing gameaccuracy conducted on ResNet50 and VOC07 test data set areused as performance metrics.Figure 6 sub-figure (a) depicts the sensitivity of MFPPto fragmentation morphological degree (the sigma value ofSLIC algorithm). The higher the sigma value, the smootherthe boundary. When sigma is less than , MFPP is not verysensitive to it. When sigma is high enough, MFPP will besimplified to multi-layer RISE. Same effect is also shown inthe lower right sub-picture of Figure 4. Figure 6 sub-figure(b) shows the effect of the pyramid layer ( i.e. the numberof fragmentation scales). In the beginning, more layers canimprove performance. When the number of layers becomes toohigh, performance degrades. During mask generation, masksare firstly generated in upscaled size, and then random croppedto the target size. Figure 6 sub-figure (c) shows the sensitivityto mask upscaling offset. Figure 6 sub-figure (d) shows theimpact of the number of fragments, and very dense or sparsefragments will reduce the final performance.For experiments in Figure 1 and the rest of experiments,the configurations of MFPP and Fast-MFPP are as follows:SLIC sigma is 1 and compactness is 10, upscale offsetis 2.2, the number of fragments for 5 pyramid layers are [50 , , , , , respectively. D. Main results
In this section, we quantitatively evaluate the performanceof model explanation and processing time on the pointinggame experiments [Zhang et al., 2018]. Pointing game ex-periment extracts maximum points from saliency map andmeasuring whether it falls into ground truth bounding boxes.The accuracy score is defined as
Acc = HitsHits + Misses . Theexperiment was conducted on the PASCAL VOC07 test dataset (with 4,952 images and 20 categories of ground truthlabels) and the COCO 2014 minival data set (excluded imageswith the attribute ”iscrowd = 1”). We repeated the experiment3 times and took the average. The experiment results arecompared to some reference method results extracted from[Fong et al., 2019].The evaluation includes two versions of MFPP, which havedifferent numbers of masks; Fast-MFPP contains masks,and MFPP contains . Both of them use the perturbationpyramid of layer. EP is existing SOTA with multiple hyper-parameters, and we configure it with default values as follows:area is . , the maximum number of iterations is , step is , sigma is , smooth is . The configurations of RISE are asfollows: masks for VGG16 and masks for ResNet50to match RISE paper, the number of grid cells is 7x7, and theprobability of closing is . . TABLE I: The Result of Pointing Game [Zhang et al., 2018] onVOC2007 test and COCO2014 minival
Dataset. Methods in GreyColor are for Black-box Model. EP’s result on VOC07 is takenfrom [Fong et al., 2019].
VOC07 Test COCO14 MiniValMethod VGG16 ResNet50 VGG16 ResNet50Cntr. . . . . . . . . . . . . . . . ± . . ± . . ± . . ± . EP . . ± . . ± . Fast-MFPP . ± . . ± . . ± . . ± . MFPP . ± . ± ± ± TABLE II: The Benchmark for Average Processing Time for SingleSample Explanation on VOC07 Test Dataset.
VOC07 Test COCO14 MiniValMethod VGG16 ResNet50 VGG16 ResNet50LIME . ± . . ± . . ± . . ± . RISE . ± . . ± . ± . ± . EP . ± . . ± . . ± . . ± . Fast-MFPP ± ± ± ± MFPP . ± . . ± . . ± . . ± . As shown in Table I, in terms of localization accu-racy, for VOC07 test set, MFPP has the highest score onResNet50( ), which is higher than current SOTA methodEP [Fong et al., 2019] (88.9%). EP has the best accuracyon VGG16, which is 88.0%. For COCO14 minival set,MFPP has the highest scores on both VGG16( ) andResNet50( ). They are also higher than EP(51.5% and56.1%) Methods in grey rows applies to black-box models,while methods in white rows applies to white-box models.Overall, the two versions of MFPP are more accurate thanmost of existing methods.For the processing time benchmark, we measure the totalexecution time from input image preprocessing to saliencymap generation. We report both average and standard deviationof the processing time. The hardware tested is a single NvidiaP6000 GPU. As shown in Table II , for VOC07 test set, MFPPtakes 32.63 seconds per explanation on ResNet50, which is2.2 times faster than EP’s 72.09 seconds. It’s also 1.47 timesfaster on VGG16. In particular, the fast version of Fast-MFPPis 10.7 times faster than EP on ResNet50 and 7.3 timesfaster on VGG16. The experiments on COCO14 minival setshow similar results. The Fast-MFPP is the fastest methodscompared to other other explaining black box models in bothVOC07 test and COCO14 minival set.In Figure 7, (a) and (b) show the impact of mask numberon the pointing game accuracy on VGG16 and ResNet50. Theaccuracy increases slowly with the number of masks increases.When MFPP uses mask for ResNet50, it exceeds SOTAand achieves the highest score. A cc u r a c y ( % ) Number of Scale Layers
VOC diff VOC all70.00% A cc u r a c y ( % ) Morphological Degree (SLIC Sigma)
VOC all VOC diff A cc u r a c y ( % ) Upscale offset
VOC all VOC diff 70.00%
50 100 200 400 600 800 1000 1200 1400 1600 1800 2000 A cc u r a c y ( % ) Number of fragments
VOC all VOC diff (b) (c) (d)(a)
Fig. 6: Pointing game accuracy varies with different MFPP configurations: (a) shows the effect of morphological degree( i.e. sigma value inSLIC, the smaller, the more morphological); (b) shows the effect of pyramid layers (number of fragmentation scales); (c) shows the effectof mask upscaling offset (masks are firstly generated in upscaled size and then random crop to target size); (d) shows the effect of fragmentnumbers in the case of one pyramid layer. All experiments are conducted on ResNet50 and VOC07 test set. A cc u r a c y ( % ) Number of masks
MFPP RISE EP50.00%60.00%70.00%80.00%90.00% 4000 8000 12000 16000 20000 A cc u r a c y ( % ) Number of masks MFPP RISE EP (a)(b)
ResNet-50VGG-16
MFPPFast-MFPP
Fig. 7: The accuracy of the pointing game experiment with differentmodel explanation methods: (a) accuracy on VGG16 (b) accuracyon ResNet50. Fixed value for EP, since number of masks is not itsparameter.
In Figure 8, (a) and (b) show the corresponding processingtimes when using different mask numbers. The processing timeof both MFPP and RISE increase linearly as the number ofmasks increases. They both are faster than EP.In summary, pointing game experiments conducted on thePASCAL VOC07 test set and COCO minival set show thatour proposed MFPP method benefits from morphologicalfragmentation and multiple perturbation layers. In terms ofaccuracy, it meets or exceeds the performance of existing T i m e ( s ) Number of masks MFPP RISE EP01020304050607080 4000 8000 12000 16000 20000 T i m e ( s ) Number of masks MFPP RISE EP (b) (a)
ResNet-50VGG-16
Fig. 8: The processing time of the pointing game experiment withdifferent model explanation methods: (a) processing time on VGG16(b) processing time on ResNet50. Fixed value for EP, since numberof masks is not its parameter.
SOTA black box interpretation methods. At the same time, forthe average processing time of each sample, MFPP is at leasttwice faster than the existing SOTA method. In particular, Fast-MFPP is more than 10 times faster than the existing SOTAmethod on ResNet50, and has the same accuracy level (within0.5%).. C
ONCLUSION
This paper proposes a novel MFPP method to explain theprediction of black box models with multi-scale morphologicalfragment perturbation modules. First, we prove that morpho-logical fragmentation is a more effective perturbation methodthan random input sampling in model interpretation tasks.Secondly, MFPP can generate a finer-grained interpretationof critically shaped objects on an intuitive basis. Third, thequantitative experiments show that the performance of MFPPachieves or exceeds the SOTA black box interpretation methodEP [Fong et al., 2019] of the classic pointing game accuracyscore. MFPP is at least twice as fast as EP. In particular,the fast version of MFPP is an order of magnitude fasterthan EP on ResNet50, while achieving the same level ofaccuracy. Since MFPP outperforms the latest methods in termsof accuracy and speed, we believe that MFPP can be apromising interpretation method in the field of deep neuralnetwork diagnosis. R
EFERENCES[Achanta et al., 2012] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P.,and S¨usstrunk, S. (2012). Slic superpixels compared to state-of-the-artsuperpixel methods.
IEEE transactions on pattern analysis and machineintelligence , 34(11):2274–2282.[Chattopadhay et al., 2018] Chattopadhay, A., Sarkar, A., Howlader, P., andBalasubramanian, V. N. (2018). Grad-cam++: Generalized gradient-basedvisual explanations for deep convolutional networks. In , pages 839–847.IEEE.[Dabkowski and Gal, 2017] Dabkowski, P. and Gal, Y. (2017). Real timeimage saliency for black box classifiers. In
Advances in NIPS , pages6967–6976.[Du et al., 2019] Du, M., Liu, N., and Hu, X. (2019). Techniques forinterpretable machine learning.
Communications of the ACM , 63(1):68–77.[Everingham et al., 2010] Everingham, M., Van Gool, L., Williams, C. K.,Winn, J., and Zisserman, A. (2010). The pascal visual object classes (voc)challenge.
International journal of computer vision , 88(2):303–338.[Fong et al., 2019] Fong, R., Patrick, M., and Vedaldi, A. (2019). Under-standing deep networks via extremal perturbations and smooth masks. In
Proceedings of the IEEE International Conference on Computer Vision ,pages 2950–2958.[Fong and Vedaldi, 2017] Fong, R. C. and Vedaldi, A. (2017). Interpretableexplanations of black boxes by meaningful perturbation. In
Proceedings ofthe IEEE International Conference on Computer Vision , pages 3429–3437.[Gan, 2019] Gan, A. (2019). Goldfish sample. https://github.com/WindQingYang/Sample Pictures/blob/master/goldfish.jpg.[Guidotti et al., 2019] Guidotti, R., Monreale, A., Ruggieri, S., Turini, F.,Giannotti, F., and Pedreschi, D. (2019). A survey of methods for explainingblack box models.
ACM computing surveys (CSUR) , 51(5):93.[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deepresidual learning for image recognition. In
Proceedings of the IEEEconference on computer vision and pattern recognition , pages 770–778.[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A methodfor stochastic optimization. arXiv preprint arXiv:1412.6980 .[Lin et al., 2013] Lin, M., Chen, Q., and Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400 .[Lin et al., 2017] Lin, T.-Y., Doll´ar, P., Girshick, R., He, K., Hariharan, B.,and Belongie, S. (2017). Feature pyramid networks for object detection.In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 2117–2125.[Lin et al., 2014] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,Ramanan, D., Doll´ar, P., and Zitnick, C. L. (2014). Microsoft coco:Common objects in context.
Lecture Notes in Computer Science , page740–755.[Petsiuk et al., 2018] Petsiuk, V., Das, A., and Saenko, K. (2018). Rise:Randomized input sampling for explanation of black-box models. arXivpreprint arXiv:1806.07421 .[Redmon and Ali, 2018] Redmon, J. and Ali, F. (2018). Yolov3: An incre-mental improvement. arXiv preprint arXiv:1804.02767 .[Redmon et al., 2015] Redmon, J., Divvala, S. K., Girshick, R. B., andFarhadi, A. (2015). You only look once: Unified, real-time object detection.
CoRR , abs/1506.02640.[Redmon and Farhadi, 2017] Redmon, J. and Farhadi, A. (2017). Yolo9000:Better, faster, stronger. .[Ren et al., 2017] Ren, S., He, K., Girshick, R., and Sun, J. (2017).Faster r-cnn: Towards real-time object detection with region proposalnetworks.
IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 39(6):1137–1149.[Ribeiro et al., 2016] Ribeiro, M. T., Singh, S., and Guestrin, C. (2016).” why should i trust you?” explaining the predictions of any classifier.In
Proceedings of the 22nd ACM SIGKDD international conference onknowledge discovery and data mining , pages 1135–1144.[Ribeiro et al., 2018] Ribeiro, M. T., Singh, S., and Guestrin, C. (2018).Anchors: High-precision model-agnostic explanations. In
AAAI .[Russakovsky et al., 2015] Russakovsky, O., Deng, J., Su, H., Krause, J.,Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., and et al. (2015). Imagenet large scale visual recognition challenge.
International Journal of Computer Vision , 115(3):211–252.[Selvaraju et al., 2017] Selvaraju, R. R., Cogswell, M., Das, A., Vedantam,R., Parikh, D., and Batra, D. (2017). Grad-cam: Visual explanations fromeep networks via gradient-based localization. In
Proceedings of the IEEEinternational conference on computer vision , pages 618–626.[Seo et al., 2019] Seo, D., Oh, K., and Oh, I.-S. (2019). Regional multi-scaleapproach for visually pleasing explanations of deep neural networks.
IEEEAccess , 8:8572–8582.[Simonyan et al., 2013] Simonyan, K., Vedaldi, A., and Zisserman, A.(2013). Deep inside convolutional networks: Visualising image classifi-cation models and saliency maps.[Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014).Very deep convolutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 .[Wagner et al., 2019] Wagner, J., Kohler, J. M., Gindele, T., Hetzel, L.,Wiedemer, J. T., and Behnke, S. (2019). Interpretable and fine-grainedvisual explanations for convolutional neural networks. In
Proceedings ofthe IEEE Conference on CVPR , pages 9097–9107.[Yang et al., 2020] Yang, Q., Zhu, X., Fwu, J.-K., Ye, Y., You, G., and Zhu,Y. (2020). Pipenet: Selective modal pipeline of fusion network for multi-modal face anti-spoofing. In
Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition Workshops , pages 644–645.[Zeiler et al., 2011] Zeiler, M., Taylor, G., and Fergus, R. (2011). Adap-tive deconvolutional networks for mid and high level feature learning.
Proceedings of the IEEE International Conference on Computer Vision ,2011:2018–2025.[Zeiler and Fergus, 2014] Zeiler, M. D. and Fergus, R. (2014). Visualizingand understanding convolutional networks. In
European conference oncomputer vision , pages 818–833. Springer.[Zhang et al., 2018] Zhang, J., Bargal, S. A., Lin, Z., Brandt, J., Shen, X.,and Sclaroff, S. (2018). Top-down neural attention by excitation backprop.
International Journal of Computer Vision , 126(10):1084–1102.[Zhang and Zhu, 2018] Zhang, Q.-s. and Zhu, S.-C. (2018). Visual inter-pretability for deep learning: a survey.
Frontiers of Information Technology& Electronic Engineering , 19(1):27–39.[Zhou et al., 2014] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., andTorralba, A. (2014). Object detectors emerge in deep scene cnns. arXivpreprint arXiv:1412.6856 .[Zhou et al., 2016] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., andTorralba, A. (2016). Learning deep features for discriminative localization.In