[PDF] Behavior and performance of the deep belief networks on image classification

Abstract

We apply deep belief networks of restricted Boltzmann machines to bags of words of sift features obtained from databases of 13 Scenes, 15 Scenes and Caltech 256 and study experimentally their behavior and performance. We find that the final performance in the supervised phase is reached much faster if the system is pre-trained. Pre-training the system on a larger dataset keeping the supervised dataset fixed improves the performance (for the 13 Scenes case). After the unsupervised pre-training, neurons arise that form approximate explicit representations for several categories (meaning they are mostly active for this category). The last three facts suggest that unsupervised training really discovers structure in these data. Pre-training can be done on a completely different dataset (we use Corel dataset) and we find that the supervised phase performs just as good (on the 15 Scenes dataset). This leads us to conjecture that one can pre-train the system once (e.g. in a factory) and subsequently apply it to many supervised problems which then learn much faster. The best performance is obtained with single hidden layer system suggesting that the histogram of sift features doesn't have much high level structure. The overall performance is almost equal, but slightly worse then that of the support vector machine and the spatial pyramidal matching.

Full PDF

aa r X i v : . [ c s . N E ] D ec Behavior and performance of the deep beliefnetworks on image classiﬁcation

Karol Gregor,

New York University, New York, NY, 10003 &California Institute of Technology, Pasadena, CA, 91125 [email protected]

Gregory Grifﬁn,

California Institute of TechnologyPasadena, CA, 91125 [email protected]

Abstract

We apply deep belief networks of restricted Boltzmann machines to bags of wordsof sift features obtained from databases of 13 Scenes, 15 Scenes and Caltech 256and study experimentally their behavior and performance. We ﬁnd that the ﬁnalperformance in the supervised phase is reached much faster if the system is pre-trained. Pre-training the system on a larger dataset keeping the supervised datasetﬁxed improves the performance (for the 13 Scenes case). After the unsupervisedpre-training, neurons arise that form approximate explicit representations for sev-eral categories (meaning they are mostly active for this category). The last threefacts suggest that unsupervised training really discovers structure in these data.Pre-training can be done on a completely different dataset (we use Corel dataset)and we ﬁnd that the supervised phase performs just as good (on the 15 Scenesdataset). This leads us to conjecture that one can pre-train the system once (e.g.in a factory) and subsequently apply it to many supervised problems which thenlearn much faster. The best performance is obtained with single hidden layer sys-tem suggesting that the histogram of sift features doesn’t have much high levelstructure. The overall performance is almost equal, but slightly worse then that ofthe support vector machine and the spatial pyramidal matching.

Deep belief networks (DBN) of restricted Boltzmann machines (RBM) [1], along with the back-propagation algorithm and recently discovered layer by layer unsupervised learning algorithm ofcontrastive divergence [2] have recently become a popular topic of research in supervised and un-supervised classiﬁcation because they promise several advantages. They have fast inference, fastunsupervised learning, are rather ﬂexible, and are deep. One last advantage is the ability to encodericher, higher order structures [3] as in the hierarchical structure of the mammalian cerebral cortex.The goal of this paper is to study the applicability of the deep belief networks to visual image classi-ﬁcation. To this effect we use popular image datasets: 13 Scenes dataset, a newer 15 Scenes dataset,Caltech 256 and Corel dataset. The DBN’s are not applied directly to images. Instead, they areapplied to histograms of frequencies of visual words, which are obtained by vector quantization ofSIFT features [4, 5]. More generally the image is broken up into 2x2 or 4x4 regions, treating thewords in each regions as different (with some smoothing function) to capture some spatial infor-mation and making histogram of these. The code used for implementing and testing hierarchicalrestricted Boltzmann machines is that on the Geoffrey Hinton’s website with appropriate modiﬁca-tions.The architecture and the training are those described in [2]. The values in the input layer are real val-ued normalized histograms of the visual words [4]. These are connected to 0,1,2 or 3 hidden layers.Finally the top layer containing the labels is added. The network is ﬁrst trained in an unsupervised1ay (except for the 0 layer system), which is termed pre-training, using contrastive divergence.Subsequently the layer of labels is added and the network is trained using back-propagation.We look at the 13 Scenes, 15 Scenes and Caltech 256 databases in turn, analyzing most extensivelythe 13 Scenes database. Apart from dependencies on few parameters and overall performance weare especially interested in understanding the effect of the unsupervised phase. We discuss how fastthe network learns during the supervised phase compared to case without pre-training, whether itcan take advantage of a large amount of unsupervised data, whether these data can be of differentnature then the labeled data and what do some of the actual neurons represent.

In this section we use the database of 3859 images of 13 Scenes [5]. Every image is ﬁrst processedin the following way: One extracts patches of some size N using a regular grid of spacing l. Onethen sub-samples the image by a factor of two and repeats. This is repeated until the image issmall enough that only one patch can be extracted. The resulting patches are transformed into siftdescriptors and are clustered using K-means algorithm into 1000 visual words. One more word isadded that speciﬁes how many of the patches have overall variation below certain threshold andwhich were excluded from the above clustering. Thus we start we a bag of 1001 words. Thearchitecture and training is as described in the introduction, with the layer of labels containing 13neurons. t e s t e rr o r

13 Scenes, Ntrain=100; 1001−131001−500−131001−500−500−131001−500−500−500−131000−1000−13, rem1001−500−13, rand1001−500−500−13, rand1001−500−500−500−13, rand

Figure 1: Performance on the 13 Scenes database. The notation e.g. 1001-500-13 means: inputof 1001 units - hidden layer 1 of 500 units - output of 13 units. The networks containing hiddenlayers were trained layer by layer using contrastive divergence for 800 iterations (sweeps throughthe training set). Then the networks were trained using back-propagation. The ﬁnal performancesbetween networks are similar, but pre-trained networks have much shorter supervised phase. Wordrem means that the special visual word was removed. Also shown are the cases of hidden layersystems with no pre-training but starting with a random small initial weights (rand).The loglog plot of the dependence of performance on the number of iterations (number of sweepsthrough training set) for various architectures is in the Fig. 1. For the network with zero hiddenlayers (1001-13) there is no unsupervised phase and for the networks with positive number of hiddenlayers, each unsupervised pre-training consisted of 800 sweeps through the entire sample. Thedashed lines show the performance of networks with hidden units without pre-training and startingwith small random initial conditions.First, we observe that the ﬁnal performance is better for the deep network then for the zero hiddenlayer one, but only slightly. Next, we observe that once the system with hidden layers is pre-trainedin the unsupervised way, it is much faster to obtain a good performance. We see that after a single2teration through the entire sample one gets around 25% error, something which took more then 500sweeps to achieve with the zero layer network.Let us look at the actual run times. The zero layer saturates after about 1000-2000 iterations. 1000iterations took about 43 minutes to run. In the 1001-500-13 network, to get as good performancewe need about 200 pre-training periods (see below). This took about 28 minutes. For the back-propwe need about 10 iterations which took about 6 minutes (full back-prop for single hidden layer). Sosingle hidden layer system took about 34 minutes to train.Let us note there there are two basic ways of running the back-propagation. Either on the entirenetwork or only on the top layer. The later is much faster (as fast per iteration as the zero layersystem if the number of units in the ﬁnal layer is the same as in the input). We ﬁnd that for thesystem with one hidden layer, the performance of both methods is roughly the same as the functionof iteration but for more hidden layer, the performance of the later gets worse.The network has a large number of parameters and overtrains very quickly. It reaches 100% perfor-mance on the train set after 25, 10, 8 iterations for the network with 1,2,3 hidden layers respectively.Despite this, it performs well on the test set, and the test error does not start to increase in time(probably due to a weight decay in the algorithm).Next it is interesting to look at the performance as a function of the number of pre-training iterations.This is shown in the Fig. 1. We see that increasing this number helps, up to a certain point. e rr o r

13 Scenes, Ntrain=100; 1001−500−131001−500−500−131001−500−500−500−131001−500−500−2000−13

Figure 2: Performance as a function of the number of the pre-training epochs. Increasing the laterﬁrst improves the performance and then saturates.We want a good learning system to learn about the world in an unsupervised way by discoveringits structure and we should provide supervision only part of the time. To test this principle, wepre-train the network on 100 examples from each category for 800 iterations. Then we run thesupervised phase with a subset of this training set containing 1,2,4,8,16,32,64 examples and test onthe testing set of the images not contained in the 100 examples per category used for pre-training.The results are shown in the Fig. 3 (along other Scenes results). We see that indeed we get the betterperformance in this case, though the improvement is not impressive.In the brain the implicit information in the image is transformed into and explicit representation thein the higher level brain, meaning that, speaking in idealized terms, there are neurons which ﬁre ifand only if a given concept was present. In our example, this would correspond to ﬁnding neuronsthat would be active if and only a picture from a given category is present. We now look into whetheror how much is this true.We train the network in an unsupervised way as before. We consider a given layer, namely layerthree. We choose a category. We present the images one by one to the network and look at theactivities of each neuron. For every neuron we look at how the activity of that neuron is able to3 E rr o r Scene Database Performance Comparison 13S: SVM, 1000 words13S: DBN1000−1000−1313S: DBN1000−1000−13, pretr 10015S: SPM, 200 words15S: 2x2 DBN4000−4000−1515S: 1x1 DBN1000−1000−15

Figure 3: Results for 13 and 15 Scenes databases. Notation: 13S (13 Scenes database), 15S (15Scenes database), DBN (Deep belief network), SVM (Support vector machine), SPM (Spatial pyra-midal matching), pretr 100 - network was pre-trained on 100 examples per category. The overallperformance of DBN was close to other methods, just a but little lower. Notice that the pre-trainingon 100 examples improved the performance.distinguish a chosen category from the other categories. We look at the following quantity. Givena threshold - a real number between zero and one. For all the images from the chosen category wecalculate the percentage of cases it classiﬁes as being from that category. For all the images notfrom the chosen category we calculate the percentage it classiﬁes as not being from that category.We average the two. Then we maximize over the thresholds. We call this the performance parameter.Then we chose the neuron with the best performance parameter. We do this for each category. Theresults are in the Fig 4, with the description in the caption. Actually for few categories we pickedthe neurons by hand, if the activities looked seemingly sharper, but the results are similar. A c t i v a t i on ( F i r i ng R a t e )

13 Scenes, Selected Neurons of the third layer A c t i v a t i on ( F i r i ng R a t e ) Figure 4: Neurons closest to explicitly representing a given category. Each pair of axes showsactivity of a particular neuron. They were chosen by hand (but close to the prescription in the text)to best represent each category (in the order from left to right, from top to bottom). The blue curve isthe average activity over all images in a given category (of x-axis). The number is the performanceparameter as deﬁned in the text. The red line is the best threshold, as deﬁned in the text. Each greendot is the activity for a particular image (spread randomly in the interval (-0.1,0.1) around each pointto give the idea of a density). The neuron for category 13 is not shown.4e see that neurons , , , , , are relatively nice explicit representations. Neuron is notable to distinguish categories but distinguish three categories from the rest. Similarly the neurons , , , are not able to distinguish one another but distinguish themselves from the rest. This isreasonable since they are all indoor rooms and are more similar to each other then to various outdoorscenes. The best activities at other layers were somewhat similar, but the third layer happens to givethe best results.A priori the network is symmetric under ﬂipping the activity of a neuron when we ﬂip appropriateconnections. For this reason we ﬂipped neurons with average activity more then . . However theproblem is not fully symmetric due to weight decay and initial conditions and the actual percentageof such neurons as a function of layer number is . , . , . , and .The explicit representations emerging in the network are nice only if the same cannot be said aboutlooking at the data itself. Now the data are small ( ∼ . ) real numbers so we cannot be looking ataverage activities because they can be swamped by few images having large values. However we canlook at the threshold deﬁned above which lets us distinguish categories. The comparison is shownin the Figure 5. We see that the performance as measured by this parameter is somewhat better. Ineither case, it is nice that there are neurons from which we can read off if the picture belongs to agiven category or not by just looking at the overall activity. One also naturally asks, since we arepicking the best neurons, whether random connections wouldn’t also produce something that lookslike the above graph. We checked that this is not so and the graphs are much more uniform. P e r f o r m an c e pa r a m e t e r layer 0layer 3 Figure 5: The performance parameter as deﬁned in the text. The third layer neurons distinguishcategories better then the input units do.Figure 3 shows comparison between the performance obtained here and support vector machine(SVM). We see that DBN performance is close but not quite equal to this. The 15 Scenes results areexplained in the next section.

The data set in this section is the same as above with two scenes added. There are only historicalreasons for having these two. There is also some difference in extracting the visual words. Thepatches were chosen using a different grid and they were clustered using mixture of Gaussiansrather then K-means. We study the system for 200 and 1000 visual words (clusters). We also breakup the image into 4 regions on 2x2 grid and treating word in each region as different (we also usesome smoothing function). The results are shown in the Figure 6.We see that the 1000 words cases do better then the 200 ones and the 2x2 grid case does betterthen the rest. For simple bag of words, the one hidden layer, 1000-1000-15 system, does the bestand adding more layers worsens the performance. Again we see that after pre-training, as in the 135 e rr o r

15 Scenes, Ntrain=100; 1000−15200−151000−500−15200−500−151000−500−500−15200−500−500−151000−500−500−500−15200−500−500−500−151000−1000−15200−1000−152x2: 4000−4000−15 (top)200−1000−15 (rand init)

Figure 6: The performance on the 15 scenes database. For description see the caption of Fig 1Scenes case, a good performance is reached after much fewer iterations than without pre-training,though longer then in that case.Next we compare the performance of the system for various methods and various sizes of the trainingset. We calculate the performance for the cases when the same training set is used for both pre-training and back-prop. Then for the case where pre-training is done on 100 images per categoryand back-prop for smaller number of them. Then when we use back-prop only, starting with smallinitial conditions. Finally we test whether network can learn in an unsupervised way from one data-set and then train in a supervised way using another data-set. For this purpose, Corel dataset and 15Scenes dataset were used respectively.The results for these four ways are shown in the Fig. 7. We see that in this case, no matter whichmethod is used, the ﬁnal result is the same. For the performance as a function of the number ofiterations all the methods with pre-training behaved roughly the same way. The case without pre-training was much slower. To ﬁnd a case in between, we also used of Corel images and inthis case the performance increased a bit more slowly then in the previous pre-trained cases. Theﬁnal results for second and third layer are similar, but slightly worse. The Corel pre-training isparticularly interesting because it suggests that we can pre-train the system once on one data-set,e.g. one can buy a factory pre-trained product, and then one can use fast supervised training forvarious different problems. However, more tests are needed to establish this conjecture.It is not quite clear why for the 15 Scenes, as opposed to the 13 Scenes, all these ways of runningthe problem give the same ﬁnal performance. One reason might be that these experiments are doneon 200 words instead of the 1000. Other might be that somehow there is not as much structure inthese data for unsupervised training to discover and one really needs the labels to separate the data.Finally we compare the performance to the spatial pyramidal matching (SPM) [6], Fig 3. TheSPM takes advantage of a spatial information. We see it performs better then simple bag of words,but roughly the same as 2x2 system. On the other hand the 2x2 system used 1000 visual words(quantization points) and SPM only 200. It would be interesting to see whether 1000 words wouldpush the performance higher, considering the fact that increasing the number of words for DBNhelps. We performed the basic analysis also on Caltech 256. The features were extracted the same wayas for the 15 Scene database. We applied applied the RBM’s to bag of 200 words and also, bag of3200 words obtained by breaking the region up into 4x4 grid. To save space, we only show the later,6 e rr o r

15 Scenes 1−layer1−layer, pretrained on 1001−layer, pretrained on Corel1−layer, small initial rand weights

Figure 7: Comparison of different methods for one hidden layer system on 200 words. In the ﬁrst setthe system is pre-trained on the same set as used for back-prop, in the second it was pre-trained on100 images per category, in the third on Corel database and in the fourth there was no pre-training(note: The last case for sizes 64 and 100 was obtained after 500 and 5000 iterations resp.)Fig. 8. The one hidden layer system does the best, but that adding more hidden layers worsens theperformance. e rr o r Caltech 256, 4x4 200words, Ntrain=64; 3200−2563200−500−2563200−1000−2563200−1000−256, top layer only3200−3200−2563200−3200−256, top layer only3200−3200−3200−256, top layer only

Figure 8: The performance on Caltech 256 bag of words obtained by breaking up every image into4x4 pieces and treating same words in different regions as different.Let us discuss the running times, starting with 1x1 case, selecting the appropriate numbers. The3000 iterations of zero layer system took about 32 hours. 400 pre-training iterations for one layersystem about 2 hours and 300 back-propagation periods about 15 hours. For the 4x4 case, the 8000iterations of the zero layer system took 24 days. 200 pre-training iterations for one layer systemtook 21 hours and 300 back propagation periods of the top layer took about 29 hours.Finally we compare the performance to the other methods. This is shown in the Fig. 9. We see thatfor bag of words the svm gives about the same results, may be slightly better then ours. For the 4x4case, our result is not very good. However we didn’t spend much time improving or analyzing theperformance in this case. The graph also shows the much more impressive result by Varma (this isnot applied to just the simple bag of these words that we use).7

10 15 20 30 40 50 640.30.40.50.60.70.80.9 Ntrain E rr o r Caltech 256 Performance Comparison SVM 1x1SVM 4x4SPM (2006)M. Varma (2007)DBN 1x1DBN 4x4

Figure 9: Comparison of performance for different methods. For notation see caption of Fig. 3.

Unsupervised learning discovers structure in images. This is strongly suggested by our experimentalobservations of a short supervised training phase, advantage of having a lot of unsupervised data,ability to learn from different dataset then the one with labels and in approximate emergence ofneurons explicitly representing certain categories. However for these data, the effects are ratherweak. Adding hidden layers doesn’t improve the performance. This is probably due to the factthat histograms of quantized sift features either don’t have very high level structure (which doesn’tsound surprising) or not the kind of structure these networks can discover. This suggests that furtherresearch should be devoted to developing richer representations of image content, of a form thatmay be processed by an RBM-like network. Finally, as a classiﬁer, the DBN’s do quite well, butin this case slightly worse then the standard methods of SVM or SPM. Since SVM is pushed intoperfection among shallow architectures, this should not be discouraging and one should focus on theproblems where high level structure is needed.We thank Pietro Perona and Yoshua Bengio for useful discussions, Yevgeni Bart for the 13 Scenesdata and Caltech computer vision cluster for computing resources.

References [1] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. InRumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: Volume 1: Foundations,pages 194-281. MIT Press, Cambridge, MA.[2] Geoffrey E. Hinton, Simon Osindero and Yee-Whye Teh, A Fast Learning Algorithm for Deep BeliefNets, Neural Computation

18, 1527-1554 (2006)[3] Yoshua Bengio, Learning deep architectures for AI, technical report number 1312[4] Gyuri Dorko, Cordelia Schmid: Selection of Scale-Invariant Parts for Object Class Recognition. ICCV2003:634-640[5] Fei-Fei Li, Pietro Perona, California Institute of Technology: A Bayesian Hierarchical Model for Learn-ing Natural Scene Categories. CVPR 2005:524-531[6] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. S.Lazebnik, C. Schmid, and J. Ponce. Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, New York, June 2006, vol. II, pp. 2169-2178.(2006)[3] Yoshua Bengio, Learning deep architectures for AI, technical report number 1312[4] Gyuri Dorko, Cordelia Schmid: Selection of Scale-Invariant Parts for Object Class Recognition. ICCV2003:634-640[5] Fei-Fei Li, Pietro Perona, California Institute of Technology: A Bayesian Hierarchical Model for Learn-ing Natural Scene Categories. CVPR 2005:524-531[6] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. S.Lazebnik, C. Schmid, and J. Ponce. Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, New York, June 2006, vol. II, pp. 2169-2178.