HHow you see me
Rohit Gandikota and Deepak Mishra Abstract — Convolution Neural networks(CNN) are one of themost powerful tools in the present era of science. There hasbeen a lot of research done to improve their performance androbustness while their internal working was left unexplored tomuch extent. They are often defined as black boxes that canmap non-linear data effectively. This paper tries to show howwe have taught the CNN’s to look at an image. Visual resultsare shown to explain what a CNN is looking at in an image.The proposed algorithm exploits the basic math behind aCNN to backtrack the important pixels. This is a genericapproach which can be applied to any network till VGG.This doesn’t require any additional training or architecturalchanges. In literature, few attempts have been made to explainhow learning happens in CNN internally, by exploiting theconvolution filter maps. This is a simple algorithm as it doesnot involve any cost functions, filter exploitation, gradientcalculations or probability scores. Further we demonstrate thatthe proposed scheme can be used in some important ComputerVision tasks.
I. INTRODUCTIONConvolution Neural networks have been revolutionizingthe area of Computer vision with their outstanding per-formance in vision tasks. They are non linear functionswhich are designed to model a human eye. Acknowledg-ing their importance, research has been done to improvetheir performance. Over time, the accuracy of the systemincreased and so did the complexity behind it’s working.Much complex architectures are being introduced to improvetheir performance. For example, starting from AlexNet[1]with 8 layers, then came ZFNet[2] and VGGNet[3]. PresentResnet[4] has hundreds of layers with 50 times lesser numberof parameters compared to AlexNet. These models offerless evidence on how they work internally and achieve suchextraordinary results.One approach for understanding CNN is by exploiting thefeature activation maps of the filters in the network. Anotherway is by proposing the regions that the CNN is looking at,in the image. The common motivation behind these methodsare to propose the regions in the image that correspondsto the CNN’s output of recognizing objects in the image.Understanding the working of CNN’s are important because • We can guide the training the network more accuratelywhen we can visualize how it is learning from eachepoch of training. • A visual evidence can be given to explain any alter-nate identification(i.e A different classification from the Rohit Gandikota is a graduate student from Avionics depart-ment, Indian Institute of Space science and Technology, Kerala, India gandikota.SC15B088 at ug.iist.ac.in Deepak Mishra is with Faculty of Avionics, Indian Institute ofSpace science and Technology, Kerala, India deepak.mishra atiist.ac.in ground truth). Some times a network can identify ormove it’s attention more towards another less importantobject in an image. This can be visualized also. • Development in the area of neurology can happenby understanding from the CNN’s working which isnothing but a human eye model. • Many other applications can be improved and simplifiedlike computing image saliency and improvising detec-tion methods. Self driving cars can use this as they canunderstand what human sees on the road before he takescertain decisions.
Fig. 1. Results from MSRA-B dataset. First column shows the importantpixels, second column shows the attention regions and the final columndepicts the saliency region in the images.
The proposed algorithm basically exploits the basic workingof the CNN to backtrack and find the important pixels inthe image, the CNN is looking at to make a decision. Wetry to unroll the forward pass in the CNN after passing animage, given a node we try to find the nodes that activatethe given node the most. This has been done from the finaloutput layer till the input layer to get the information in pixellevel. So given an image we pass it through a CNN to get therecognition output and then we simply backtrack the nodesin a tree fashion from the output till the input layer.Rest of the paper is organized as follows: section IIdiscusses the existing works that are relevant for the work,section III presents the proposed method in detail, sectionIV explains the potency of our method on saliency detectionempirically, section V talks about the future work andimprovements we are planning on and finally section VIconcludes the work presented. a r X i v : . [ c s . C V ] N ov I. PREVIOUS WORKIn this sections previous works in this area of research arediscussed. This work is mainly motivated by a very dynamicalgorithm called Viterbi Algorithm[11]. Viterbi algorithmis for finding optimal sequence of hidden states. Given anobservation sequence and a Hidden Markov Model(HMM),the algorithm returns the state path through the HMM thathas a maximum likelihood for the observations. Viterbialgorithm is similar to a forward pass, except has one com-ponent: back-pointers. The reason is that while the forwardalgorithm needs to produce an observation likelihood, theViterbi algorithm must produce a probability and the mostlikely state sequence. We compute this sequence by keepinga check on the path of hidden states that led to each state, andthen at the end backtracking the best path to the beginning(the Viterbi backtrack).Few attempts were made to understand CNN previously,most of them are gradient based. They find out the imageregions which can improve the probability score for a outputclass. The work presented in [5] measures the sensitivityof the classification score for a class by tweaking thepixel values. They compute partial derivative of the scorein the pixel space and visualize them. Another approachwas visualizing using deconvolution, as shown in [6]. Thedeconvolution approach visualizes the activation maps(filtermaps) learned by the network.Some other works like [7],[8],[10] were done by takingrelevance score for each feature with respect to a class. Theidea was to see how the output was effected if a feature isdropped. The importance of the feature was based on thechange in the score.III. PROPOSED METHODThis sections discusses the proposed method and it’salgorithm in detail. Our work is based on Convolutionalneural network, and hence understanding it is vital. EveryCNN is made of common blocks in the form of convolution,pooling and fully connected. In this section we explainhow we back tracked through these layers to determinediscriminative image regions in the form of important pixels.There is no need for any extra training for this, weused a pre-trained VGG19 network on Imagenet dataset.Ouralgorithm exploits the CNN working during testing time.Backtracking fully connected, convolution and pooling layersare described in detail below
A. Backtracking Fully Connected Layers
Typically for recognition the final layer will be a fullyconnected with a soft-max activation. We maintained a twomemory vectors, M for storing the important node locationsin the succeeding layer and M to start storing the locationsof important nodes in the present layer. For every node N in the memory vector M , we track the nodes in the presentlayer that are most responsible to activate it. So for examplein the last layer since we are interested in the predicted label,our memory vector M has only one node corresponding tothe label . We start storing all the nodes in the fc2 layer Fig. 2. After backtracking through all the fully connected layers, we endup with nodes in 7x7x512 dimensional space. Projecting the top responsiblenode in that dimension onto the input image of size 224x224 results in theabove visualization. The image has been split into equal 7x7 boxes and boxwith same co-ordinates as the node is selected to visualize the importantregion. of VGG19 that are most responsible(i.e. top (cid:48) n (cid:48) numberof nodes that contribute) for the activation of the node inmemory vector M . Fig. 3 shows the process of backtrackingthrough fc layers.For the of understanding Algorithm 1 , let us assume thatthe memory vector of succeeding layer has m number ofnodes. And since the fc layers are dimensional, we storedthe locations by just a single number. But for the first fclayer whose previous layer would be spatial layer, we needto store the location in a tuple of 3 dimensions [ f ilter, x, y ] . As shown in Fig.3. This tuple can be visualizedover an image by plotting the top responsible node on theimage as a box. Further we backtrack through convolutionand max pooling layers till the input image to plot importantpixels on the image. Algorithm 1:
Backtracking through fully connected lay-ers M1 : Memory vector that has node locations fromhigher layer M2 : Empty memory vector to store locations ofpresent layer m: Number of nodes in M1 vector for i=1:m do W,b=weights and bias from this layer to node M1[i] A = Activations in the present layer array = W*A + b M2.append(arg(array > end M1 < - M2 B. Backtracking through Convolution Layers
As discussed above upon reaching the first fc layer whilebacktracking, the next layer would be convolution or poolinglayers. This subsection would talk about how we backracked through these layers. Note that from here, all thelayers have a 2D(in case of pooling) or 3D(in case ofconvolution) receptive fields. Also we have stored our nodesin the first fc layer in 3d tuple.As shown in Fig. 4, for each important node in the presentlayer l , we extract the receptive field from the previous layer l − . Now we calculate the dot product between weightsand check for the feature in l − layer that has maximumactivation. Later to get the x and y co-ordinates, we tookthe node with maximum activation in the feature map thatwe have extracted earlier. Note that once the dot product iscalculated, the result would be the same shape as receptivefield in the previous layer. We calculated the sum over x andy axes to find out the maximum activating feature. Similarlyafter we extracted the feature, we just took the maximumactivating node in that feature. This location is stored in amemory vector that is sent back to above layers. Algorithm2 explains the above mentioned process Algorithm 2:
Backtracking Convolution layers M1 : Memory vector that has node locations fromhigher layer M2 : Empty memory vector to store locations ofpresent layer for i=1:m do W,b=weights and bias from this layer to node M1[i] A = Activations in the present layer array = W*A + b C = sum(array,axes=x,y) channel = arg(C > x,y = unravel index(argmax(array[channel])) M2.append([channel,x,y]) end C. Backtracking through Max pooling Layers
As discussed earlier, Max pooling has a 2D receptive field.As explained in
Algorithm 3 and visualized in Fig.5, forevery important node in the present layer, we extract thereceptive field in the previous layer and find the node withmaximum activation.Note that we have basically unwrapped the working of allthe layers to backtrack the important pixels in the image.So after passing the image through the CNN, we extractthe activation maps and go backwards unwrapping all thelayer functions and finally reach the input layers. That aftervisualizing gives the important pixels in the image, as shownin Fig.9. IV. APPLICATIONSThere are many applications associated with this method.For example, it can be used as a visual tool for attentionregions of CNN in the image. This method can be useda object detection by drawing bounding boxes around theextrema important pixels. Also to detect saliency regionsin the image. This can also be used for better training
Algorithm 3:
Backtracking Maxpooling layers rf ( n ) = function that extracts receptive field of thenode n M1 : Memory vector that has node locationsfrom higher layer M2 : Empty memory vector to store locations ofpresent layer for i=1:m do A = Activations in the present layer array = rf ( M i ]) C = argmax(array) channel = M1[i][0] x,y = unravel index(C) M2.append([channel,x,y]) end Fig. 3. Backtracking through fully connected layers. The turquoise colorednodes in layer L − positively activate the node in layer L . These are thenodes that are stored in the memory M Fig. 4. Backtracking through Convolution layers. The turquoise coloredchannel in layer L − is the maximum contributing channel for the activationof turquoise colored node in layer L . Further the red colored node is selectedas it is the maximum contributing node from the selected channel. Thischannel and the co-ordinates of the red colored node are stored in thememory M ig. 5. Backtracking through Max pooling layers. The turquoise colored2x2 nodal region in layer L − is the receptive field for max poolingoperation of the node highlighted in layer L . Further the red colored nodeis the maximum contributing node from the field. This channel and theco-ordinates of the red colored node are stored in the memory M and understanding of CNN. Note that the simplicity in themethod makes the computational complexity very less forall the above mentioned applications. We have done somedetailed analysis on saliency detection. A. Saliency detection
After obtaining the important pixels in the image as shownin Fig.6, we have drawn Gaussian around each pixel andthresholded to obtain the saliency maps as shown in Fig.7.We have ran our algorithm on MSRA-B dataset and theresults are shown in Table 1. Attention region can also bevisualized without thresholding the Gaussians. Best valueswere chosen for standard deviation of Gaussian and thethreshold value after considering several random values.Some interesting observations were seen, like • Even though the class of the object present in theimage is not known to CNN, it looks at the object atmeaningful regions. • This proposed method works good with blurred imagesalso as shown in Fig.10(a). • The attention region is not just the recognized object,but also some background as shown in Fig.10(b). Thisexplains the precision values to be lower as comparedto recall values.
TABLE IR
ESULTS ON
MSRA-B
DATASET
Accuracy Precision Recall F-Score IOU0.98 0.5 0.8 0.7 0.6
V. FUTURE WORKIn future we would like to extend this algorithm further onto densenet[12] and resnet[4]. We would also like to build anextension to architectures which stores the best activations
Fig. 6. The important pixels tracked back after a forward pass through aVGG19 network.Fig. 7. Saliency map derived for the picture shown above.Fig. 8. Attention region of the image based on the important pixel density.Red being the region with the most attention.ig. 9. Results for the Dragonfly. Here there is a pattern that the CNN isdirecting most of it’s attention at, the head and the tail of the dragonfly.Fig. 10. (a) Results on a blurred image of car. (b) Background is alsoshown some attention other than the class object car. directly and give a quick response. Also we would like toexplore different applications as this can revolutionize thecomplexity. VI. CONCLUSIONSWe present a simple method to visualize and understandhow a CNN looks at an image, by back tracking all theoperations of CNN on an image. We have also shownthat saliency parts in the image can be identified usingthis method. As shown in Fig.1. we have also visualizedthe attention region in the image. We would explore otherapplications in the future and also would like to improve thisalgorithm to make it faster and more generic. We succeededin understanding a magnificent tool’s working in a moresimpler way and from a different point of view.ACKNOWLEDGMENTI would like to express my special thanks of gratitude tomy guide Dr.Deepak Mishra as well as Dr. Vineeth.B.S, whogave me the technical and moral support to do this wonderfulproject on the topic of Computer Vision, which also helpedme in gaining a lot of knowledge and I came to know aboutso many new things I am really thankful to them.Secondly I would also like to thank my parents and friendswho helped me a lot in finalizing this project within thelimited time frame. Rdirectly and give a quick response. Also we would like toexplore different applications as this can revolutionize thecomplexity. VI. CONCLUSIONSWe present a simple method to visualize and understandhow a CNN looks at an image, by back tracking all theoperations of CNN on an image. We have also shownthat saliency parts in the image can be identified usingthis method. As shown in Fig.1. we have also visualizedthe attention region in the image. We would explore otherapplications in the future and also would like to improve thisalgorithm to make it faster and more generic. We succeededin understanding a magnificent tool’s working in a moresimpler way and from a different point of view.ACKNOWLEDGMENTI would like to express my special thanks of gratitude tomy guide Dr.Deepak Mishra as well as Dr. Vineeth.B.S, whogave me the technical and moral support to do this wonderfulproject on the topic of Computer Vision, which also helpedme in gaining a lot of knowledge and I came to know aboutso many new things I am really thankful to them.Secondly I would also like to thank my parents and friendswho helped me a lot in finalizing this project within thelimited time frame. R