Object-Scene Convolutional Neural Networks for Event Recognition in Images
aa r X i v : . [ c s . C V ] M a y Object-Scene Convolutional Neural Networks for Event Recognition in Images
Limin Wang , Zhe Wang Wenbin Du Yu Qiao Department of Information Engineering, The Chinese University of Hong Kong Shenzhen key lab of Comp. Vis. & Pat. Rec., Shenzhen Institutes of Advanced Technology, CAS, China
Abstract
Event recognition from still images is of great impor-tance for image understanding. However, compared withevent recognition in videos, there are much fewer researchworks on event recognition in images. This paper addressesthe issue of event recognition from images and proposes aneffective method with deep neural networks. Specifically,we design a new architecture, called
Object-Scene Con-volutional Neural Network (OS-CNN). This architecture isdecomposed into object net and scene net, which extractuseful information for event understanding from the per-spective of objects and scene context, respectively. Mean-while, we investigate different network architectures for OS-CNN design, and adapt the deep (AlexNet) and very-deep(GoogLeNet) networks to the task of event recognition. Fur-thermore, we find that the deep and very-deep networksare complementary to each other. Finally, based on theproposed OS-CNN and comparative study of different net-work architectures, we come up with a solution of five-stream CNN for the track of cultural event recognition at theChaLearn Looking at People (LAP) challenge 2015. Ourmethod obtains the performance of . and ranks the st place in this challenge.
1. Introduction
Event recognition from still images is one of the chal-lenging problems in computer vision research. While manyefforts have been devoted to the problem of video-basedevent and action recognition [8, 10, 12, 13, 14, 15], there aremuch fewer research works on image-based event recogni-tion [6, 17]. Compared with images, videos are able to pro-vide more useful information for event undertanding such asmotion, in addition to static appearance. Therefore, eventrecognition from still images poses more challenges thanvideos. Meanwhile, the concept of event itself is extremelycomplex, as the characterization of an event is related tomany factors, including objects, human poses, human gar-ments, scene categories, and other context. Therefore, the event recognition is highly related with other high-levelcomputer vision problems, such as object recognition [5]and scene recognition [19]. In this paper, we propose an ef-fective method for the track of cultural event recognition atthe ChaLearn Looking at People (LAP) challenge 2015 [1],which obtains the performance of . and ranks the st place in this challenge.Specifically, we propose a new architecture for eventrecognition, called Object-Scene Convolutional Neural Net-work (OS-CNN), which extracts the important visual cuesof both object and scene for event understanding. The OS-CNN is decomposed into two separate nets, namely objectnet and scene net. The object net aggregates important in-formation for recognizing event from the perspective of ob-ject, while the scene net performs event recognition withthe help of scene context. The cues of containing objectand scene context provide complementary information forevent understanding from still images. The recognition re-sults from object net and scene net are combined by late fu-sion. Decoupling the object and scene nets also allows us toexploit the availability of large amounts of annotated imagedata by pre-training object net on the ImageNet challengedataset [3] and scene net on the Places dataset [19].Meanwhile, there are many famous and successful net-work architectures for CNNs, such as AlexNet [5], Clari-faiNet [18], GoogLeNet [11], and VGGNet [9]. These ar-chitectures have proved to be effective for object and scenerecognition, and obtained the state-of-the-art performanceon the datasets of ImageNet and Places [7, 19]. However,their performance on event recognition and the complemen-tarity among them has not been fully explored before. In ourproposed OS-CNN, we exploit these successful deep archi-tectures for event recognition, and further boost the recog-nition performance by using ensemble of them. Finally,based on our OS-CNN and comparative study of differentnetwork architectures, we come up with a solution of five-stream CNN for the ChaLearn LAP challenge 2015.The rest of this paper is organized as follows. In Section2, we describe the technical details about our OS-CNN. Wethen provide the implementation details and experimental1 igure 1. The architecture of Object-Scene Convolutional Neural Network (OS-CNN) for event recognition. We pre-trained the objectCNN on the ImageNet dataset and the scene CNN on the Places dataset. It should be noted that we choose the ClarifaiNet architecturefor CNN in this illustration. But, in practice, we may choose other architectures for both object and scene CNN and even fuse multipledifferent architectures. results in Section 3. Finally, we conclude our method andpresent the future works in Section 4.
2. Object-Scene CNNs
Event understanding is highly related with other twohigh-level computer vision problems: object and scenerecognition. Therefore, we utilize two separate componentsfor event recognition. The object stream, pre-trained inlarge object dataset (ImageNet), carries information aboutobject depicted in the image. The scene stream, pre-trainedin large scene dataset (Places), captures the pattern aboutscene context of this image. We design our event recogni-tion architecture accordingly and propose a new network ar-chitecture, called Object-Scene CNN (OS-CNN) as shownin Figure 1. Each CNN is pre-trained on its own datasetand fine tuned for event recognition on the target dataset.We use late fusion to combine the scores of two separateCNNs.
We hope that the object net is able to capture useful in-formation of object to help event recognition. As the objectnet is essentially dealing with object cues, we build it withthe help of the recent advances on large-scale image recog-nition methods [5], and pre-train the network on a large im-age classification dataset, such as the ImageNet dataset [3].Specifically, we first choose the ClarifaiNet network archi-tecture [18] and use the pre-trained model in [2] . Then, wefine tune the model parameters for the task of event recog-nition on the training dataset provided by the challenge or-ganizers. The details about the network architecture can bereferred to its original paper [18] and the details about thefine tuning of network parameters can be found in Section
3. Next, we describe the scene net, which exploits sceneinformation for event recognition.
The scene net is expected to extract the scene informa-tion of image to help conduct event recognition. Hence,the scene net is designed for handling scene context, andwe may resort to recent advances on the problem of scenerecognition. Places dataset [19] is a recent large dataset andit contains scene categories with . millions of im-ages. Specifically, we first use the pre-trained model in [19] , which choose the famous AlexNet architecture [5]. Sim-ilar to object net, we then fine tune the model parameterson the training dataset from the cultural event recognitionchallenge. The details about the network architecture canbe found in its original paper [5] and the details about thefine tuning of network parameters can be found in Section3. Based on the above analysis, the recognition of event ishighly related to the concepts of object and scene. There-fore, we expect that the prediction results of both object andscene nets are complementary to each other, and combinethem using late fusion as follows: s ( I ) = α o s o ( I ) + α s s s ( I ) , (1)where I is the input image, s o ( I ) and s s ( I ) are the predic-tion scores of object and scene net, α o and α s are the fusionweights of object and scene net. In current implementation,these two fusion weights are equal to each other. In the past several years, several successful deep CNNarchitectures have been designed for the task of object http://places.csail.mit.edu/ ecognition at the ImageNet Large Scale Visual Recogni-tion Challenge [7]. These architectures can be roughlyclassified into two categories: (i) deep CNN includingAlexNet [5] and ClarifaiNet [18], (ii) very-deep CNN in-cluding GoogLeNet [11] and VGGNet [9]. The deep CNNarchitectures usually contain convolutional layers and fully-connected layers as shown in Figure 1. The very-deepCNN architectures resort to extremely deep structures withsmaller initial filter size or designing a new inception mod-ule in a network-in-network manner. The previous studiesshow that deeper networks will obtain better performanceon the task of object recognition. However, their perfor-mance on event recognition still remains unknown.In this subsection, we exploit these very-deep networksin our proposed Object-Scene CNN architecture and aim toverify the superior performance of deeper structure. Specif-ically, we choose the GoogLeNet architecture for both ob-ject and scene nets. GoogLeNet is a 22-layer very-deep net-work and is based on a newly-designed module, codenamed Inception . To optimize performance, the architectural deci-sions are based on the Hebbian principle and the intuition ofmulti-scale processing. The details about GoogLeNet archi-tecture can be found in [11]. We use the GoogLeNet modelreleased on the Caffe webpage to initialize the object net.For scene CNN, we utilize the pre-trained model releasedin the technical report [16] .We also study the complementarity of convolutional neu-ral networks with different architectures. We combine theprediction results of deep OS-CNNs with the ones of very-deep OS-CNNs as follows: s x ( I ) = β d s dx ( I ) + β v − d s v − dx ( I ) , (2)where x ∈ { o, s } denotes object net or scene net, s dx ( I ) and s v − dx ( I ) are the scores of deep and very-deep CNNs, β d and β v − d are their fusion weights ( β d = 0 . and β v − d = 0 . ).Although the very-deep OS-CNN outperforms the deep OS-CNN, the combination of them is still able to further boostthe recognition performance.
3. Experiments
In this section, we first describe the dataset of culturalevent recognition at the ChaLearn LAP challenge 2015.Then we give a detailed description of the implementationdetails about training OS-CNNs on the event recognitiondataset provided by the challenge organizers. Finally, wepresent and analyze the experimental results of proposedOS-CNNs on the dataset of ChaLearn LAP challenge 2015.
Cultural event recognition is a new task at the ChaLearnLAP challenge 2015. This task provides an event recogni- https://github.com/BVLC/caffe/wiki/Model-Zoo http://vision.princeton.edu/pvt/GoogLeNet/ tion dataset composed of images collected from two imagesearch engines (Google Images and Bing Images). Thereare important cultural events from the world in thisdataset, and some sample images are shown in Figure 2.From these images, we see that garments, human poses, ob-jects, and scene context constitute the possible cues to beexploited for recognizing the events. The dataset is dividedinto three parts: development data ( , images), valida-tion data ( , images), and evaluation data ( , im-ages). During develop phase, we train our model on the de-velopment data and verify its performance on the validationdata. For final evaluation, we merge the development andvalidation data into a single training data ( , images),and re-train our model. Our final submission results to thechallenge are obtained by the re-trained model. The prin-cipal quantitative measure used is based on precision/recallcurve. They use the area under this curve as the computa-tion of the average precision (AP), which is calculated bynumerical integration. The training procedure of OS-CNNs is implemented us-ing the famous Caffe toolbox [4]. Although there are , training images in the cultural event recognition dataset, itssize is relatively small compared with the ImageNet dataset[3]. Therefore, we choose to pre-train our model on twolarge datasets: ImageNet dataset for object net and Placesdataset for scene net, as described in Section 2. In orderto make the deep-learned features more discriminative forthe task of event recognition, we then fine tune the networkparameters on the cultural event recognition dataset.The network weights are learnt using the mini-batchstochastic gradient descent with momentum (set to 0.9). Ateach iteration, a mini-batch of samples is constructedby randomly sampling. During training phase, all the im-ages are resized to × , and a × or × sub-image is randomly cropped from the image. They arethen manipulated with a random horizontal flipping. Thedropout ratio for fully-connected layer is set as . . To over-come the issues of over-fitting, we set the learning rate ofhidden layers as − times of final layer. The learning rateis initially set to − , and decreased according to a fixedschedule: decreasing to − after 1.4K iterations, to − after 2.8K iterations, and training stopped at 4.2K iterations.During testing phase, we resort to a multi-view votingmethod [5] to classify each image. Like training procedure,we resize each testing image into × . For each CNN,we obtain inputs by cropping and flipping four cornersand the center of the image. The score of this CNN for thisimage can be obtained by averaging the scores across thesecrops. The scores from multiple object and scene nets arecombined using late fusion. igure 2. Samples of cultural event recognition dataset at the ChaLearn LAP challenge 2015. The cultural event recognition dataset has important cultural events in the world. It includes: Annual Buffalo Roundup (USA), Battle of the Oranges (Italy), Chinese New Year(China), Notting Hill Carnival (UK), Obon (Japan) and so on. All the images are collected from the Internet by using Google and Bingsearch engines. These images exhibit large intra-class variations and are very challenging for event recognition. Effectiveness of OS-CNN.
First, we measure the per-formance of separate object and scene nets. Three scenar-ios are considered: (i) only using object net, (ii) only usingscene net, (ii) using OS-CNN. For each setting, we use thedeep network architecture: ClarifaiNet for object net andAlexNet for scene net. The results are shown in Figure 4.From these results, object net outperforms scene net for thetask of event recognition (mAP . vs. . ). It is alsoclear that fusion of object and scene nets helps to improvethe performance to . . This result indicates there existscomplementary property between object and scene nets forevent recognition.In order to further investigate this complementarity, wevisualize the filters of first convolutional layer of object andscene nets in Figure 3. There are 96 filters in the first convo-lutional layers. We observe that both nets may learn somecommon filters indicates by the blue box. Meanwhile, somefilters indicated by red boxes are only learned by a singlenet. Therefore, object net and scene net may capture com-mon patterns such edges, but also extract complementaryinformation with different filters. Evaluation of different architectures.
Second, we in-vestigate the performance of CNNs with different architec-tures and design three settings: (i) CNN with deep archi-tecture (AlexNet or ClarifaiNet), (ii) CNN with very-deeparchitecture (GoogLeNet), (iii) combination of both deeparchitecture and very-deep architecture. We conduct thiscomparative study for both object and scene nets.The results of object net and scene net are shown in Fig-ure 5 and Figure 6 respectively. From these results, it isclear that deeper architecture obtain better performance forevent recognition, no matter object net or scene net, whichagrees with the findings in object recognition. The very- Rank Team Score1
MMLAB (Ours)
Table 1. Comparison the performance of our five-stream CNN withthat of other team. Our result is significantly better than others. deep architecture outperforms deep architecture by about . At the same time, we observe that the fusion of differ-ent architectures can help to further boost the recognitionperformance (about improvement). Challenge approach and results.
Based on the nu-merical evaluation and analysis above, we conclude that (i)object net is better than scene net, (ii) very-deep architec-ture outperforms deep architecture, (iii) fusion of multipleCNNs from different visual cues (object and scene) withdifferent architectures (deep and very-deep) contributes toperformance improvement. Hence, we introduce anotherobject net with very-deep architecture into our OS-CNNframework. We pre-train a 19-layer VGGNet on the Ima-geNet dataset and fine tune network weights on the train-ing dataset of cultural event recognition. Totally, our chal-lenge solution is composed of five-stream CNNs pre-trainedwith different datasets (ImageNet or Places) equipped withdifferent network architectures. The challenge results areshown in Table 1. We see that our method obtains the bestperformance and significantly outperforms the second placeby nearly .
4. Conclusions
This paper has presented an effective method for cul-tural event recognition from still images. We utilize theeep CNNs for this task and propose a new architecture,called
Object-Scene Convolutional Neural Network (OS-CNN). This architecture is decomposed into object net andscene net, which extract useful information for event under-standing from the perspective of objects and scene content,respectively. Meanwhile, we consider different networkstructures for OS-CNN and conduct a comparative study ofdeep CNN and very-deep CNN for event recognition. Weshow that deeper architecture is also helpful in the task ofevent recognition from still images, and the combinationof different architectures is able to boost performance. Inpractice, based on our proposed OS-CNN and comparativestudy, we design a five-stream CNN for the track of culturalevent recognition at the ChaLearn LAP challenge 2015. Inthe future, we may consider jointly optimizing the objectand scene nets and incorporating more visual cues for eventunderstanding.
Acknowledgement
This work is supported by a donation of Tesla K40GPU from NVIDIA corporation. Limin Wang is supportedby Hong Kong PhD Fellowship. Yu Qiao is supportedby National Natural Science Foundation of China(91320101, 61472410), Shenzhen Basic Research Program(JCYJ20120903092050890, JCYJ20120617114614438,JCYJ20130402113127496), 100 Talents Program of CAS,and Guangdong Innovative Research Team Program(No.201001D0104648280).
References [1] X. Baro, J. Gonzlez, J. Fabian, M. A. Bautista,M. Oliu, I. Guyon, H. J. Escalante, and S. Escalers.Chalearn looking at people 2015 cvpr challenges andresults: action spotting and cultural event recogni-tion. In
CVPR, ChaLearn Looking at People work-shop , 2015. 1[2] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisser-man. Return of the devil in the details: Delving deepinto convolutional nets. In
BMVC , pages 1–12, 2014.2[3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li.ImageNet: A large-scale hierarchical image database.In
CVPR , pages 248–255, 2009. 1, 2, 3[4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe:Convolutional architecture for fast feature embedding.
CoRR , abs/1408.5093, 2014. 3[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-ageNet classification with deep convolutional neuralnetworks. In
NIPS , pages 1106–1114, 2012. 1, 2, 3 [6] L. Li and F. Li. What, where and who? classify-ing events by scene and object recognition. In
ICCV ,pages 1–8, 2007. 1[7] O. Russakovsky, J. Deng, H. Su, J. Krause,S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNetlarge scale visual recognition challenge, 2014. 1, 3[8] K. Simonyan and A. Zisserman. Two-stream convo-lutional networks for action recognition in videos. In
NIPS , pages 568–576, 2014. 1[9] K. Simonyan and A. Zisserman. Very deep convo-lutional networks for large-scale image recognition.
CoRR , abs/1409.1556, 2014. 1, 3[10] C. Sun and R. Nevatia. ACTIVE: activity concepttransitions in video event classification. In
ICCV ,pages 913–920, 2013. 1[11] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-binovich. Going deeper with convolutions.
CoRR ,abs/1409.4842, 2014. 1, 3[12] H. Wang and C. Schmid. Action recognition with im-proved trajectories. In
ICCV , pages 3551–3558, 2013.1[13] L. Wang, Y. Qiao, and X. Tang. Mining motion atomsand phrases for complex action recognition. In
ICCV ,pages 2680–2687, 2013. 1[14] L. Wang, Y. Qiao, and X. Tang. Motionlets: Mid-level3D parts for human motion recognition. In
CVPR ,pages 2674–2681, 2013. 1[15] L. Wang, Y. Qiao, and X. Tang. Action recognitionwith trajectory-pooled deep-convolutional descriptors.In
CVPR , pages 1–10, 2015. 1[16] Z. Wu, Y. Zhang, F. Yu, and J. Xiao. A GPU implen-tation of googlenet. Technical report, 2014. 3[17] Y. Xiong, K. Zhu, D. Lin, and X. Tang. Recog-nize complex events from static images by fusing deepchannels. In
CVPR , pages 1–10, 2015. 1[18] M. D. Zeiler and R. Fergus. Visualizing and under-standing convolutional networks. In
ECCV , 2014. 1,2, 3[19] B. Zhou, `A. Lapedriza, J. Xiao, A. Torralba, andA. Oliva. Learning deep features for scene recogni-tion using places database. In
NIPS , pages 487–495,2014. 1, 2 igure 3. The filters learned in first layer of object net and scene net. Blue box indicates similar filters shared by two CNNs, and red boxesdenote the filters only learned by a single CNN. o−cnn s−cnn os−cnn
Figure 4. Results of object net (o-cnn), scene net (s-cnn), and OS-CNN. We plot the average precision (AP) values for the classes andthe last column indicates the mean AP (mAP) over these classes. deep cnn very deep cnn combine Figure 5. Results of object net using different architectures. We plot the average precision (AP) values for the classes and the lastcolumn indicates the mean AP (mAP) over these classes. deep cnn very deep cnn combine Figure 6. Results of scene net using different architectures. We plot the average precision (AP) values for the50