Visualizing the decision-making process in deep neural decision forest
VVisualizing the Decision-making Process in Deep Neural Decision Forest
Shichao Li and Kwang-Ting ChengHong Kong University of Science and TechnologyKowloon, Hong Kong [email protected], [email protected]
Abstract
Deep neural decision forest (NDF) achieved remarkableperformance on various vision tasks via combining deci-sion tree and deep representation learning. In this work,we first trace the decision-making process of this modeland visualize saliency maps to understand which portionof the input influence it more for both classification andregression problems. We then apply NDF on a multi-taskcoordinate regression problem and demonstrate the dis-tribution of routing probabilities, which is vital for inter-preting NDF yet not shown for regression problems. Thepre-trained model and code for visualization will be avail-able at https://github.com/Nicholasli1995/VisualizingNDF
1. Introduction
Traditional decision trees [5, 3] are interpretable sincethey conduct inference by making decisions. An input isrouted by a series of splitting nodes and the conclusion isdrawn at one leaf node. Training these models follow a lo-cal greedy heuristic [5, 3], where a purity metric such asentropy is adopted to select the best splitting function froma candidate set at each splitting node. Hand-crafted featureswere usually used and the model’s representation learningability is limited.Deep neural decision forest (NDF) [4] and its laterregression version [6] formulated a probabilistic routingframework for decision trees. As a result the loss functionis differentiable with respect to the parameters used in thesplitting functions, enabling gradient-based optimization ina global way. Despite the success of NDF, there is few ef-fort devoted to visualize the decision making process of it.In addition, the deep representation learning ability broughtby the soft-routing framework comes with the price of vis-iting every leaf node in the tree. The model will be moresimilar to traditional decision tree and more interpretableif few leaf nodes contribute to the final prediction. Fortu-nately, the desired property was demonstrated by the distri-
InputFeature extractorSplitting nodeLeaf nodeRouting/RecommendationFeature extraction OrFeature feeding Local patches Root node
Figure 1: Illustration of the decision-making process indeep neural decision forest. Input images are routed (redarrows) by splitting nodes and arrive at the prediction givenat leaf nodes. The feature extractor computes deep repre-sentation from the input and send it (blue arrows) to eachsplitting node for decision making. Best viewed in color.bution of routing probabilities in [4] for a image classifica-tion problem. To our best knowledge, this property has notyet been validated for any regression problem.In this paper, we trace the routing of input images andapply gradient-based technique to visualize the importantportions of the input that affect NDF’s decision-making pro-cess. We also apply NDF to a new multi-task regressionproblem and visualize the distribution of routing probabili-ties to fill the knowledge blank. In summary, our contribu-tions are:1. We trace the decision-making process of NDF andcompute saliency maps to visualize which portion ofthe input influences it more.2. We utilize NDF on a new regression problem and visu-alize the distribution of routing probabilities to validateits interpretability.1 a r X i v : . [ c s . C V ] A p r . Related works Traditional classification and regression trees make pre-dictions by decision making, where hand-crafted features[5, 3] were computed to split the feature space and route theinput. Deep neural decision forest (NDF) and its regressionvariant [6] were proposed to equip traditional decision treeswith deep feature learning ability. Gradient-based method[7] was adopted to understand the prediction made by tra-ditional deep convolutional neural network (CNN). How-ever, this visualization technique has not yet been applied toNDF. Another orthogonal line of research attempts to learnmore interpretable representation [9] and organize the infer-ence process into a decision tree [10]. Our work is differentfrom them since it is more of a visualization-based modeldiagnosis and no other loss function is used in the trainingphase to drive semantically meaningful feature learning asin [9].
3. Methodology
A deep neural decision forest (NDF) is an ensemble ofdeep neural decision trees. Each tree consists of splittingnodes and leaf nodes. In general each tree can have un-constrained topology but here we specify every tree as fullbinary tree for simplicity. We index the nodes sequentiallywith integer i as shown in Figure 1.A splitting node S i is associated with a recommendation(splitting) function R i that extracts deep features from theinput x and gives the recommendation score (routing prob-ability) s i = R i ( x ) that the input is recommended (routed)to its left sub-tree.We denote the unique path from the root node to a leafnode L i a computation path P i . Each leaf node stores onefunction M i that maps the input into a prediction vector p i = M i ( x ) . To get the final prediction P , each leaf nodecontributes its prediction vector weighted by the probabilityof taking its computation path as P = (cid:88) i ∈N l w i p i (1)and N l is the set of all leaf nodes. The weight can be ob-tained by multiplying all the recommendation scores givenby the splitting nodes along the path. Assume the path P i consists of a sequence of q splitting nodes and one leaf nodeas {S j i , S j i , . . . , S j q i q , L i } , where the superscript for a split-ting node denotes to which child node to route the input.Here j m = 0 means the input is routed to the left child and j m = 1 otherwise. Then the weight can be expressed as w i = q (cid:89) m =1 ( s i m ) ( j m =0) (1 − s i m ) ( j m =1) (2)Note that the weights of all leaf nodes sum to 1 and thefinal prediction is hence a convex combination of all the pre- diction vectors of the leaf nodes. In addition, we assume therecommendation and mapping functions mentioned aboveare differentiable and parametrized by θ i at node i . Thenthe final prediction is a differentiable function with respectto all the parameters which we omit above to ensure clarity.A loss function defined upon the final prediction can hencebe minimized with back-propagation algorithm.Note here all computation paths will contribute to the fi-nal prediction of this model, unlike traditional decision treewhere only one path is taken for each input. We believe themodel is more interpretable and similar to tradition deci-sion trees when only a few computation paths contribute tothe final prediction. This has been shown to be the case forclassification problem in [4]. Here we also demonstrate thedistribution of routing probabilities for a regression prob-lem.To understand how the input can influence the decision-making of this model, we take the gradient of the routingprobability with respect to the input and name it decisionsaliency map (DSM) , DSM = ∂s i ∂ x (3)For classification problem, the prediction vector p i foreach leaf node L i is a discrete probability distribution vec-tor whose length equals the number of classes. The y thentry p i ( y ) gives the probability P ( y | x ) that the input x belongs to class y . For regression problems, p i is also areal-valued vector but the entries do not necessarily sumto 1. The optimization target for classification problemsis to minimize the negative log-likelihood loss over thewhole training set containing N instances D = { x i , y i } Ni =1 , L ( D ) = − (cid:80) Ni =1 log( P ( y i | x i )) . For a multi-task regressionproblem with N instances D = { x i , y i } Ni =1 , we directly usethe squared loss function, L ( D ) = (cid:80) Ni =1 || P i − y i || .In the experiment, we use deep CNN to extract featuresfrom the input and use sigmoid function to compute the rec-ommendation scores from the features. The network pa-rameters and leaf node prediction vectors are optimized al-ternately by back propagation and update rule, respectively.Details about the network architectures, training algorithmand hyper-parameter settings can be found in our supple-mentary materials (included in the GitHub repository).
4. Experiments
Standard datasets provided by PyTorch are used. Weuse one full binary tree of depth 9 for both datasets, butthe complexity of the feature extractor for CIFAR-10 ishigher. Adam optimizer is used with learning rate specified https://pytorch.org/docs/0.4.0/_modules/torchvision/datasets red:2 (N1, P1.00) (N2, P1.00) (N4, P1.00) (N8, P1.00) (N16, P1.00) (N32, P1.00) (N65, P1.00) (N130, P1.00) (N260, P1.00)Pred:0 (N1, P1.00) (N2, P1.00) (N4, P1.00) (N8, P1.00) (N16, P1.00) (N33, P1.00) (N66, P1.00) (N132, P1.00) (N264, P1.00)Pred:9 (N1, P1.00) (N2, P1.00) (N5, P1.00) (N10, P1.00) (N21, P1.00) (N43, P1.00) (N87, P1.00) (N175, P1.00) (N351, P1.00)Pred:5 (N1, P1.00) (N2, P1.00) (N4, P0.72) (N8, P0.72) (N16, P0.72) (N32, P0.72) (N64, P0.71) (N129, P0.71) (N259, P0.71)Pred:6 (N1, P1.00) (N2, P1.00) (N5, P0.99) (N11, P0.99) (N22, P0.99) (N44, P0.99) (N89, P0.99) (N179, P0.99) (N359, P0.99) Figure 2: Decision saliency maps for MNIST test set. Each row gives the decision-making process of one image, where theleft-most image is the input and the others are DSMs along the computation path of the input. Each DSM is computed bytaking derivative of the routing probability with respect to the input image. Model prediction is given above the input imageand (Na, Pb) means the input arrives at splitting node a with probability b during the decision-making process.
Pred:airplane (N1, P1.00) (N2, P1.00) (N5, P1.00) (N11, P1.00) (N23, P1.00) (N46, P1.00) (N93, P1.00) (N186, P1.00) (N372, P1.00)Pred:bird (N1, P1.00) (N2, P0.68) (N5, P0.57) (N11, P0.57) (N23, P0.57) (N47, P0.57) (N95, P0.57) (N190, P0.57) (N381, P0.57)Pred:bird (N1, P1.00) (N2, P1.00) (N5, P1.00) (N11, P1.00) (N23, P1.00) (N47, P1.00) (N95, P1.00) (N190, P1.00) (N381, P1.00)Pred:bird (N1, P1.00) (N2, P1.00) (N5, P1.00) (N11, P1.00) (N23, P1.00) (N47, P1.00) (N95, P1.00) (N190, P1.00) (N381, P1.00)Pred:horse (N1, P1.00) (N3, P1.00) (N7, P1.00) (N15, P1.00) (N31, P1.00) (N62, P1.00) (N125, P1.00) (N250, P1.00) (N501, P1.00)
Figure 3: Decision saliency maps for CIFAR-10 test set using the same annotation style as Figure 2.as 0.001. Test accuracies for different datasets and featureextractors are shown in Table 1. We record the computa-tion path for each test image that has the largest probabilitybeen taken, and compute DSMs for some random samplesas shown in Fig. 2 and Fig. 3. The tree is very decisiveas indicated by the probability of arriving at each splittingnode. In addition, the foreground usually affect the decisionmore as expected and also similar to [7]. Interestingly, thehighlight (yellow dots) for different DSMs along the com-putation path vary a lot for some examples. This means thenetwork is trying to look at different regions of the inputwhile deciding how to route the input. Another interestingobservation is that the model mis-classify dog as bird whenit is not certain about its decision.
Here we study the decision-making process for a morecomplex multi-coordinate regression problem on 3DFAWdataset [2]. To our best knowledge, this is the first time NDFis boosted and applied on a multi-task regression problem.For an input image x i , the goal is to predict the position of66 facial landmarks as a vector y i . We start with an ini-tialized shape ˆ y and use a cascade of NDF to update the After Initialization
PredictionGround truth
After stage 10
PredictionGround truth
Figure 4: Face alignment using a cascade of NDFs. Acoarse shape initialization can be updated to well fit theground truth after 10 stages. Best viewed in color.estimated facial shape stage by stage. The final prediction ˆ y = ˆ y + (cid:80) Kt =1 ∆ y t where K is the total stage numberand ∆ y t is the shape update (model prediction) at stage t .We concatenate 66 local patches cropped around current es-timated facial landmarks as input and every leaf node storesa vector as the shape update. We use a cascade length of 10,and in each stage an ensemble of 3 trees is used where each P o r t i o n Stage 1
Stage 5 Stage 10
Figure 5: Distribution of recommendation scores forboosted regression with NDF. Three stages are visualizedand the model is very decisive as the distribution is peakedaround 0 and 1.Dataset Feature extractor AccuracyMNIST Shallow CNN 99.3%CIFAR-10 VGG16 [8] 92.4%CIFAR-10 ResNet50 [1] 93.4%Table 1: Accuracies for the classification experiments withdifferent feature extractors.has a depth of 5. The model prediction is shown in Fig. 4.The distribution of recommendation scores for this re-gression problem is shown in Fig. 5, which is consistentwith the results for classification in [4]. This means NDFis also decisive for a regression problem and the model canapproximate the decision-making process of traditional re-gression trees. The input patches to the model and theircorresponding DSMs for a randomly chosen splitting nodeare shown in Fig. 6. From these maps we can tell which partof the face influence the decision more during the routing ofthe input.
5. Conclusion
We visualize saliency maps during the decision-makingprocess of NDF for both classification and regression prob-lems to understand with part of the input has larger impacton the model decision. We also apply NDF on a facial land-mark regression problem and obtain the distribution of rout-ing probabilities for the first time. The distribution is con-sistent with the previous classification work and indicates adecisive behavior.
Acknowledgement . We gratefully acknowledge the sup-port of NVIDIA Corporation with the donation of the TitanXp GPU used for this research.
References [1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In
Proceedings of the IEEE con-
Figure 6: Input patches to NDF for regression and their cor-responding DSMs. ference on computer vision and pattern recognition , pages770–778, 2016. 4[2] L. A. Jeni, S. Tulyakov, L. Yin, N. Sebe, and J. F. Cohn.The first 3d face alignment in the wild (3dfaw) challenge. In
European Conference on Computer Vision , pages 511–520.Springer, 2016. 3[3] V. Kazemi and J. Sullivan. One millisecond face alignmentwith an ensemble of regression trees. In , pages1867–1874, June 2014. 1, 2[4] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bul.Deep neural decision forests. In , pages 1467–1475,Dec 2015. 1, 2, 4[5] S. Liao, A. K. Jain, and S. Z. Li. A fast and accurate uncon-strained face detector.
IEEE Transactions on Pattern Analy-sis and Machine Intelligence , 38(2):211–223, Feb 2016. 1,2[6] W. Shen, Y. Guo, Y. Wang, K. Zhao, B. Wang, and A. L.Yuille. Deep regression forests for age estimation. In
TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , June 2018. 1, 2[7] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep in-side convolutional networks: Visualising image classifica-tion models and saliency maps.
CoRR , abs/1312.6034, 2013.2, 3[8] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition.
CoRR ,abs/1409.1556, 2014. 4[9] Q. Zhang, Y. N. Wu, and S. Zhu. Interpretable convo-lutional neural networks. In , pages 8827–8836, June 2018. 2[10] Q. Zhang, Y. Yang, Y. N. Wu, and S. Zhu. Interpreting cnnsvia decision trees.