[PDF] Kernelized Deep Convolutional Neural Network for Describing Complex Images

Abstract

With the impressive capability to capture visual content, deep convolutional neural networks (CNN) have demon- strated promising performance in various vision-based ap- plications, such as classification, recognition, and objec- t detection. However, due to the intrinsic structure design of CNN, for images with complex content, it achieves lim- ited capability on invariance to translation, rotation, and re-sizing changes, which is strongly emphasized in the s- cenario of content-based image retrieval. In this paper, to address this problem, we proposed a new kernelized deep convolutional neural network. We first discuss our motiva- tion by an experimental study to demonstrate the sensitivi- ty of the global CNN feature to the basic geometric trans- formations. Then, we propose to represent visual content with approximate invariance to the above geometric trans- formations from a kernelized perspective. We extract CNN features on the detected object-like patches and aggregate these patch-level CNN features to form a vectorial repre- sentation with the Fisher vector model. The effectiveness of our proposed algorithm is demonstrated on image search application with three benchmark datasets.

Full PDF

KKernelized Deep Convolutional Neural Network for Describing Complex Images

Zhen LiuUniversity of Science and Technology of China [email protected]

Abstract

With the impressive capability to capture visual content,deep convolutional neural networks (CNN) have demon-strated promising performance in various vision-based ap-plications, such as classiﬁcation, recognition, and objectdetection. However, due to the intrinsic structure designof CNN, for images with complex content, it achieves lim-ited capability on invariance to translation, rotation, andre-sizing changes, which is strongly emphasized in the sce-nario of content-based image retrieval. In this paper, toaddress this problem, we proposed a new kernelized deepconvolutional neural network. We ﬁrst discuss our motiva-tion by an experimental study to demonstrate the sensitiv-ity of the global CNN feature to the basic geometric trans-formations. Then, we propose to represent visual contentwith approximate invariance to the above geometric trans-formations from a kernelized perspective. We extract CNNfeatures on the detected object-like patches and aggregatethese patch-level CNN features to form a vectorial repre-sentation with the Fisher vector model. The effectivenessof our proposed algorithm is demonstrated on image searchapplication with three benchmark datasets.

1. Introduction

Vectorial image representation is a fundamental prob-lem in computer vision ﬁeld. In many visual analysis sys-tems, the visual content in an image is usually representedinto a ﬁx-sized vector for convenience of the followed pro-cessing. In recent years a lot of effort has been made onﬁrst designing the handcraft visual features [27, 10, 5] andthen aggregating the visual features into a single vector[38, 39, 32, 20, 22].The bag-of-visual-words (BoVW) model is one of thefamous methods to construct image representation. In theBoVW model, ﬁrstly, a set of local invariant visual featuresare extracted on the detected image patches or the denselysampled grids. Then an image is represented into a visualword histogram based on the quantization results of localfeatures with an off-line trained visual vocabulary. The visual vocabulary is usually trained with the unsupervisedclustering algorithm, such as the standard k -means, hierar-chical k -means [29], approximate k -means [34]. Usuallythe quantization is performed by the nearest neighbor or theapproximate nearest neighbor method. Namely each localinvariant visual feature is quantized to its nearest or approx-imate nearest visual word in the vocabulary, which is a kindof hard vector quantization. Instead of the hard vector quan-tization, in [39], Wang et al. proposed a locality linear cod-ing approach to quantize each local visual feature.Kernel method is another alternative to transform a setof features into a vectorial representation, such as Fisherkernel [32], and democratic kernel [22]. Fisher kernel mod-els the joint probability distribution of the visual featuresdetected in an image. The vectorial representation is con-structed based on the derivatives in the parameter space.Besides the quantization results in the BoVW model, Fisherkernel also includes the residual information between thelocal visual features and their visual words [21]. Fisherkernel is demonstrated to be more efﬁcient than the BoVWmodel in image classiﬁcation and image search applications[32, 22, 20, 18, 33]. One non probabilistic version of Fisherkernel is carefully investigated in [20, 21], which is namedas vector of locally aggregated descriptors (VLAD).Instead of designing the handcraft visual features, suchas SIFT [27], SURF [5], and HOG [10], deep convolu-tional neural network (CNN) [24] learns a non-linear trans-formation model from large-scale well organized semanticdataset, namely ImageNet [12]. With the learned non-lineartransformation model, each image can be transformed to afeature vector [23]. With deep nets to learn from large-scaledataset, the CNN model can well discriminate diverse vi-sual content, which is desired in many visual informationprocessing systems. With breakthrough in many computervision tasks, the CNN model has made a milestone in visualrepresentation and become a new benchmark baseline [36].A lot of efforts have been made to understand the rep-resentation ability of convolutional neural network [15, 40,25, 9, 26, 35]. In [15], Goodfellow et al. test the invari-ance of deep networks with a natural video dataset and ﬁndthat the “deep” structure can obtain more invariance than1 a r X i v : . [ c s . C V ] S e p a) (b)(c) Figure 1. The illustration of our motivation to propose the kernelized convolutional neural network. (We refer the CNN details to the Caffeimplementation. There should not be substantial differences from the original CNN model in [24].) (a) A simple image with a singleobject localized at the center (roughly aligned) (b) A complex image with several objects (c) The proposed kernelized convolutional neuralnetwork algorithm the “shallow” ones. In [40], Zeiler and Fergus try to under-stand why deep convolutional neural network works verywell. They propose to visualize the patterns activated by theintermediate layers with a deconvolutional network. It is re-vealed that some complex patterns can be captured by toplayers, which is very amazing. In [25], Lenc et al. study themathematical properties of equivariance, invariance, equiv-alence of image representations such as SIFT or CNN fromthe theoretical perspective. In [9], Cimpoi et al. conducta range of experiments on material and texture attributerecognition and ﬁnd that CNN can also obtain excellent re-sult on this topic. In [26], Long et al. study the learnedcorrespondence at a ﬁne level of CNN and reveal that goodkeypoint prediction can be obtained with the learned inter-mediate CNN features. More speciﬁcally, in [35], Razavianet al. demonstrate that local spatial information of imageis also conveyed by CNN and this local information can beused to perform facial landmark prediction, semantic seg-mentation, and object keypoints detection.However, CNN is suitable to describe these images witha single object localized at the center, namely those roughlyaligned images as shown in Fig. 1(a). For a complex im-age with multiple objects, it is unsuitable to extract a singleglobal CNN feature as shown in Fig. 1(b) because theremay exits geometric transformations on these objects. As amore reasonable alternative, we can ﬁrstly align the contentof the image and then construct the global vectorial rep-resentation. Hence, inspired by the invariant representationvia pooling local features, in this paper we propose to repre-sent image with local CNN to address the translation and re-sizing invariance issue and pool the transformed CNN fea-ture to achieve a ﬁx-sized rotation-invariant representation, which we call the kernelized convolutional neural network(KCNN) in the following. Speciﬁcally, we ﬁrst detect someobject-like patches from the given image. Then for each de-tected object-like patch, we extract CNN feature to describethe object in it. Finally to form a vectorial representationof the whole image, we aggregate these object-level CNNfeatures with kernel function as shown in Fig. 1(c).We organize the rest of the paper as follows. In Section2, we present some studies on the sensitivity of global CNNfeature to three speciﬁc transformations. In Section 3, weintroduce our algorithm in detail. The experimental resultsare presented in Section 4. Finally we make conclusions inSection 5.

2. Sensitivity of Global CNN Feature

In this section, we study the sensitivity of global CNNfeature to geometric transformations, i.e. , translation,scaling, and rotation in detail. The study is made on theHolidays [20] dataset which is a benchmark dataset forimage search with 1491 high resolution images. We usethe Caffe-based CNN implementation [23] to extract ourCNN feature. In the following, given an image I , we use f ( · ) to denote its extracted CNN feature in ”fc7” layer anduse m ( · ) to denote the cosine similarity between two CNNfeatures. All CNN feature are L2-normalized in default.To reveal the impact of geometric transformations to theglobal CNN feature independently, we design the followingexperiments to make sure that each image undergoes onlyone kind of geometric transformations. Translation.

Generally, a translation can be made in ver-tical and horizontal directions. To simplify the study, we2 a)(b)(c)

Figure 2. The experiment to study the translation property ofglobal CNN feature. (a) The illustration of image translation (b)Two examples of the similarities of the global CNN feature beforeand after the translation transformation (c) The mean and standarddeviation of similarities of the global CNN features with respectto the translation transformation consider only the translation in the horizontal direction asshown in Fig. 2. The extension to the general translationis straightforward. Given an image I with size M by N ,we generate a larger image with size M × N , as shownin Fig. 2(a) and pad the left half part with image I by theborder extrapolation method. Then we circularly translate I by t pixels to the left and construct its transformed version I ( t ) and extract the global CNN feature f ( I ( t )) . We mea-sure the consistency score between global CNN features of I ( t = 0) and I ( t ) with their cosine similarity, as shown bythe following equation. m ( I ( t )) = < f ( I ( t = 0)) , f ( I ( t )) > (1)in which < · , · > means the inner product operation.In Fig. 2(b), we illustrate two examples of the similaritybetween the global CNN features before and after thetranslation transformation. It can be seen that with the in-crease of horizontal translation, the similarity ﬁrst declinesand then grows after it reaches a valley. The decrease insimilarity reﬂects the fact that the global CNN feature issensitive to the translation transformation. On the otherhand, the increase of the similarity after the valley pointdemonstrates the effect of the ﬂipping operation which ismake during the training stage of the CNN model. Similarphenomenon is also demonstrated by the statistical resultsshown in Fig. 2(c). The difference in the trends of thesimilarity curves reﬂects the tolerance capability of globalCNN feature to the translation transformation is also related (a)(b)(c) Figure 3. The experiment to study the scaling property of globalCNN feature. (a) The illustration of image scaling (b) Two exam-ples of the similarities of the global CNN feature before and afterthe scaling transformation (c) The mean and standard deviation ofsimilarities of the global CNN features with respect to the scalingtransformation to the content of image.

Scaling.

In Fig. 3, we show our experiment to study thescaling property of the global CNN feature. The similarityto measure the image scaling transformation is deﬁned as m ( I ( s )) = < f ( I ( s = 1)) , f ( I ( s )) >, (2)where I ( s ) denotes the new image re-sized from the orig-inal image I with the width and height being s times of I . To keep the image I ( s ) in the same size, we pad theregion beyond the image boundary by the border extrapo-lation method. Another choice is to crop the sub-imagesdifferent size at the same location. However, there shouldnot be substantial difference between these two methods toconstruct I ( s ) . Then we extract the global CNN feature f ( I ( s )) .In Fig. 3(b), we illustrate two examples of the similarityof the global CNN features before and after the scalingtransformation. It can be seen that the similarity scoredecreases as the image is scaled with different ratios,which means the global CNN feature is not invariance tothe scaling transformation. Similar phenomenon is alsodemonstrated by the statistical results shown in Fig. 3(c). Rotation.

In Fig. 4, we show our experiment to study therotation property of the global CNN feature. We measurethe consistency score of CNN feature to rotation transfor-3 a)(b)(c)

Figure 4. The experiment to study the rotation property of globalCNN feature. (a) The illustration of image rotation (b) Two exam-ples of the similarities of the global CNN feature before and afterthe rotation transformation (c) The mean and standard deviation ofsimilarities of the global CNN features with respect to the rotationtransformation mation as m ( I ( θ )) = < f ( I ( θ = 0 ◦ ) ) , f ( I ( θ )) >, (3)where I ( θ ) denotes the new image after the image I is ro-tated by θ degree, as shown in Fig. 4(a). Please note thatthe image size will change after rotation as shown by com-paring the ﬁgure of θ = 0 ◦ and the ﬁgure of θ = 45 ◦ in Fig.4(a). To study the property of global CNN feature whenonly rotation transformation exists, we extract the CNN fea-ture on the sub-image located at the center of I ( s ) as illus-trated in the blue square of the red inscribed circle in Fig.4(a).In Fig. 4(b), we illustrate two examples of the similarityof the global CNN features before and after the rotationtransformation. It can be seen that the similarity varies asthe image is rotated with different degrees and the similaritycurves of these two examples have different trends. Similarphenomenon is also demonstrated by the statistical resultsshown in Fig. 4(c), which demonstrated that the globalCNN feature is sensitive to the rotation transformation.That the similarity curves have different trends means thetolerance ability to the rotation transformation of globalCNN feature is also related to the content of image. Discussion.

From the experiments above, it can be ob-served that the similarity m of the global CNN featuresbefore and after transformation is sensitive to translation, (a) Figure 5. Two example images with detected object-like patches.Only several top ranked patches are shown. rotation, and scaling. This comes from the architecture ofthe CNN model in which the neurons are highly related tothe spatial positions of the image pixels in local perceptionﬁeld. When the image is transformed, the spatial positionsof those pixels are changed, which results in the inconsis-tent CNN feature and limits the robustness of CNN featureto these geometric transformations such as translation, scal-ing, and rotation. To address this problem we propose toﬁrstly align the image content in the patch level before ex-tracting the CNN feature. Such a strategy makes the featurerobust to translation and scaling change. Moreover, to en-hance the robustness to rotation changes, each image patchis rotated circularly by 8 times. Then to build a vectorial im-age level representation, we aggregate the extracted patch-level CNN features with kernel functions.

3. Kernelized Convolutional Neural Network

In this section, we introduce our algorithm to constructthe vectorial representations on the roughly content-alignedimages with the kernel method and the deep convolutionalneural network in detail.Given two sets of image patches X and Y withcard( X ) = n and card( Y ) = m . Let’s consider using matchkernel K ( · , · ) [17, 7, 22] to measure the similarity between X and Y , hence we have K ( X , Y ) = (cid:88) x ∈X (cid:88) y ∈Y k ( x, y ) , (4)where k ( · , · ) measures the similarity between two featuredescriptors and x stands for an image patch and y has thesimilar meaning.To construct a vectorial image representation for eachimage, we consider these separable kernel functions.Namely the similarity between two feature descriptors k ( x, y ) can be computed by the inner product operation, as4hown by the following equation K ( X , Y ) = (cid:88) x ∈X (cid:88) y ∈Y k ( x, y )= (cid:88) x ∈X (cid:88) y ∈Y < φ ( x ) , φ ( y ) > = < ( (cid:88) x ∈X φ ( x )) , ( (cid:88) y ∈Y φ ( y )) > = < Ψ( X ) , Ψ( Y ) >, (5)where φ ( · ) means a kind of linear or nonlinear transforma-tion and Ψ( X ) is the ﬁnal image-level vectorial representa-tion we need.In Eq. 5, the key issue is how to deﬁne the function φ ( · ) .Firstly, as the size of x is not ﬁxed, we need a function totransform x into a ﬁxed dimensional vectorial representa-tion, which can be denoted by γ ( · ) . Secondly, to aggregatethese patch-level vectorial representations γ ( x ) into the ﬁ-nal image-level vectorial representation Ψ( X ) , we need afunction to map γ ( x ) into another space. This step can bedenoted by β ( · ) . Such that we have the form φ ( x ) = β ( γ ( x )) . (6)In the following, we will discuss how to design the function γ ( · ) and β ( · ) . γ ( · ) : In computer vision, it is a fundamental problemto describe an image patch of various sizes into a ﬁxed-length feature vector. There are many classic works onit [27, 10, 5]. For example, in SIFT [27] algorithm, thespatially constrained gradient histogram is used to representthe image patch. With the development of the technology,some researchers turn to the large-scale machine learningtechniques. The recently research works revealed that thedeep convolutional neural network (CNN) is very powerfulfor many computer vision tasks [36]. The CNN model islearned from a million-scale database, ImageNet. Withthe advantage of the non-linearity and large number ofparameters, CNN can easily handle the immense variants ofvision tasks. In this paper, we adopt the CNN model [23] totransform the image patch into its vectorial representation.In [23], a pre-trained CNN model and well organized codeare provided to be publicly available for academic uses. Weadopt the CNN model to obtain the vectorial representationof each image patch. β ( · ) : After the image patches are transformed intovectorial representations, we adopt the separable kernelmethods to aggregate them together to represent theimage. There are also a lot of works devoted to kernelmethods [17, 7, 22, 33, 6, 32]. One classic separable kernelis the Fisher kernel which models the joint probability distribution of a set of features [33, 32, 31, 20]. Perronnin et.al. [31, 33] applied Fisher kernel to image classiﬁcationand image retrieval applications. They model the features’joint probability distribution with a Gaussian mixture(GMM) model. In Fisher kernel, the mapping function β ( · ) corresponds to the gradient function of the features’joint probability distribution with respect to the parametersof this distribution, scaled by the inverse square root ofthe Fisher information matrix. It gives the direction inparameter space into which the learned distribution shouldbe modiﬁed to better ﬁt the observed data. In comparisonwith the BoVW model, the Fisher kernel model can obtainhigher accuracy. Hence given a set of features, we adoptthe Fisher kernel to construct their vectorial representation. x . To analyze the visual content in a given image, re-searchers usually extract some interesting patches from it.The word “interesting” means some clearly deﬁned rules,which can make the detected patches have the desiredproperties. For example, in SIFT algorithm [27], the im-age patches are detected with different of Gaussian (DoG)method to obtain the scale invariant property. Then to ob-tain the rotation invariant property, the detected patches arealigned with the dominant orientation of its gradients. Inthis paper, we use the object detector [8, 2, 1, 13] to ex-tract some object-like patches from the image. After someobject-like patches are detected, these patches are spatiallyaligned, which can provide the property of invariance totranslation and scaling transformations. In a most recentlypublished work named BING object detector [8], Chenget al. proposed a very efﬁcient algorithm to detect object-like image patches with a quite higher detection rate, whichcan process 300 frames per second on a single CPU. BINGobject detection algorithm [8] output a real value for eachpatch to indicate how the detected image patch is like to bean object. With this real value, we can control the numberof image patches we want. Considering the excellent speedof BING algorithm, we adopt it [8] to extract our imagepatches. To achieve the rotation invariant property of the ex-tracted image patches, we rotate each image patch, x , by 8discrete degrees which consist of 0 ◦ , 45 ◦ , 90 ◦ , 135 ◦ , 180 ◦ ,225 ◦ , 270 ◦ , 315 ◦ . Intuitively, a dominant angle for eachobject patch can be estimated in the similar way as SIFT.However, our study reveals that such a strategy yields lowperformance, due to the unreliability of the dominant angleestimation in object-patch level. Some examples of detectedobject-like patches with BING algorithm are shown in Fig.5. Time cost . Besides the time cost to extract object-likepatches with BING detector and the aggregation cost withFisher kernel, the time to extract KCNN will be × N timesof regular CNN, where N means the number of detectedobjects. But, this can be accelerated with GPU clusters.5ince our paper focus on addressing the sensitivity of reg-ular CNN, in our implementation we use the CPU mode ofCaffe library. To fairly show the effectiveness of KCNN,we use the linear search method to search the database withthe inner product operation to compute the similarity of twoimages. Therefore the complexity will depend on the di-mension of image vectorial representation.

4. Experimental Results

In this section, we evaluate our algorithm on the im-age retrieval application. We adopt three public availablebenchmark datasets, i.e , Holidays [20] and UKBench [29]and Oxford Building [3], to demonstrate the impact of theparameters in our algorithm. We also compare our algo-rithm with some other methods for image retrieval applica-tion.Holidays dataset [20] contains 1491 high-resolution im-ages of different scenes and objects with 500 queries. Toevaluate the performance we use the average precision mea-sure computed as the area under the precision-recall curvefor a query. We compute the mean of the average precisionfor all queries to obtain a mean Average Precision (mAP)score, which is used to evaluate the overall performance[34].UKBench dataset [29] contains 2550 objects or scenes,each with four images taken under different views or imag-ing conditions, resulting in 10200 images in total. In termsof accuracy measurement, the top-4 accuracy [29] is used asevaluation metric. For top-4 accuracy, for each query, theretrieval accuracy is measured by counting the number ofcorrect images in top-4 returned results. Then the retrievalperformance is averaged over all test queries.Oxford Building dataset [3, 34] consists of 5062 imagesof buildings and 55 query images corresponding to 11 dis-tinct buildings in Oxford. Images are annotated as eitherrelevant, not relevant, or junk indicating that it is unclearwhether a user would consider the image as relevant ornot. Following the recommended protocol, the junk imagesare removed from the ranking results. The retrieval per-formance is also measured by the mean Average Precision(mAP) computed over the 55 queries.Our experiments are implemented on a server with 32GBmemory and 2.4GHz CPU of Intel Xeon.

In this section, we study the impact of parameters. Thereare three parameters in our algorithm. The ﬁrst one is thenumber of image patches x detected by BING detector [8],which can be denoted by N . The second one is the dimen-sion of vectorial representation of image patch, γ ( x ) . Weadopt the CNN model to construct the vectorial representa-tion of x resulting in a 4096-D γ ( x ) [23]. For convenience (a) D =32 (b) D =64 (c) D =128(d) D =32 (e) D =64 (f) D =128 Figure 6. The illustration of the impact of the parameters in theproposed kernelized convolutional neural network (KCNN) algo-rithm on Holidays dataset. (a), (b), (c) are the results when imagepatch x is not rotated. (d), (e), (f) are the similar meanings butwith x rotated. N is the number of object detected with BINGdetector [8]. V is the number of Gaussian functions used in Fishervector model [32]. D is the dimension of the CNN features [23]after performing the PCA dimension reduction.Table 1. The performance of the proposed kernelized convolu-tional neural network (KCNN) algorithm on three benchmarkdatasets, namely Holidays [19], Oxford Building [34], and UK-Bench [29]. D = 128 , N = 127 are used here. Dataset CNN KCNNNon Rotated x Rotated xV =64 V =128 V =64 V =128Holidays 0.68 0.793 0.801 0.823 (mAP) (+17.7%) (+17.8%) (+21%) (+ )UKBench 3.41 3.46 3.51 3.72 (top-4) (+1.5%) (+2.9%) (+9.1%) (+ )Oxford 0.38 0.48 ) (+10.5%) (+18.4%)without loss of generality, we perform the principle com-ponents analysis (PCA) to reduce the 4096-D γ ( x ) to D dimension. The last parameter is the visual vocabulary sizeused in Fisher vector [32] corresponding to the β ( · ) in Eq.6, which can be denoted by V .The results are demonstrated in Fig. 6. It can be seen thatbetter accuracy can be obtained when more patches (larger N ) are used. Similarly with larger D and V , we can obtainhigher accuracy. However the impacts of D and V are mi-nor than N . In Table 1, we demonstrate the performance ofthe proposed KCNN algorithm on Holidays, UKBench, andOxford Building datasets. We can see that it is beneﬁcal toperform rotation operation to image patch x for Holidaysand UKBench dataset. Especially for the UKBench dataset,the accuracy for the CNN feature is 3.41 and is improved to3.51 (+2.9%) with our KCNN algorithm without rotating x .After performing the rotation to x , the accuracy is improved6 a) (b) Figure 7. The examples of the transformations between images.(a) UKBench dataset (b) Oxford Building dataset from 3.51 to 3.74 (+6.6%). This is because there are manyrotation transformations in UKBench dataset, as shown inFig. 7(a). However, the rotation operation to image patch x is harmful on Oxford Building dataset for our KCNN algo-rithm. Similar result has also been observed when SIFT fea-tures are used to perform retrieval on this dataset [30] [21].That is, in the construction of SIFT descriptor, better re-trieval performance is obtained with the orientation selectedas the gravity orientation instead of the traditional domi-nant gradient orientation [27] [21], since there is very fewrotation transformations for the building images, as demon-strated in Fig. 7(b).To further demonstrate the performance of the pro-posed kernelized convolutional neural network (KCNN) al-gorithm, we show the Average Precision (AP) of each queryof Oxford Building dataset in Table 2. It can be seen that theproposed KCNN algorithm can get better retrieval applica-tion than the original convolutional neural network (CNN)algorithm for most queries. There are 38 queries out of total55 queries (69.1%) whose retrieval performance have beenimproved. The highest improvement comes from the query“ashmolean 2” whose retrieval performance is improved by355.3% from 0.0987 to 0.4495. Some examples on Holi-days dataset are shown in Fig. 8, in which we give theirrank number with CNN representation and our KCNN rep-resentation. From Fig. 8(a) and Fig. 8(b), it can be seen thatour KCNN well addresses the sensitivity of rotation trans-formation which may fail the global CNN feature. From theﬁrst result of Fig. 8(a) and the second result of Fig. 8(c),it can be seen that global CNN can also tolerate the slightlyscaling transformation while our KCNN can do much betteras shown in the ﬁrst result of Fig. 8(c). In this section, we give some comparisons with the re-sults reported in other research works. As shown in Table3, it can be seen that the proposed KCNN method obtainsbest result on both Holidays and UKBench datasets. How-ever on Oxford Building dataset SIFT [27] based methodscan get better result namely [32] [3] [22]. The reason isthat the Oxford Building dataset consists of building imagesand the retrieval on this dataset is more like a ﬁne-grainedproblem [28]. On the other hand deep convolutional neu- (a)(b)(c)

Figure 8. Some examples of search results on Holidays dataset.Their rank numbers are also given in the blue boxes with CNNfeature and KCNN feature. ral network is designed to tackle the generic classiﬁcationproblem [11, 24] and ﬁne-tune is usually required for theﬁne-grained vision tasks.There also exists some works on performing imagesearch with CNN. However our work has substantial dif-ference with them. Comparing with [36], our goal is totallydifferent. Our goal is to construct a vectorial representationfor an image while [36] use the Spatial Search that is nota vectorial image representation. The spatial search meansextensively search all the sub-patches extracted on the gridsat several levels. The search complexity will be O ( N ) where N means the number of extracted sub-patches. Com-paring with [14], on Holidays, we get 0.829 mAP while [14]gets 0.802 mAP. Besides the higher accuracy, we also ad-dress the rotation transformation while [14] not. Comparingwith [4], they focus on construct compressed codes of im-age representation with the retrained regular CNN while wefocus on addressing the object transformations in the vecto-rial representation of complex images without retraining.

5. Conclusion

In this paper, we have analyzed the sensitivity of theglobal CNN feature to the geometric transformations of im-age such as translation, scaling, and rotation. Based on ouranalysis, inspired by the well-studied local feature based7 able 2. The detailed results of each query on Oxford Building dataset.

Average all souls ashmolean balliolPrecision 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5CNN 0.113 0.21 0.274 0.546 0.156 0.366 0.099 0.068 0.229 0.208 0.134 0.121 0.206 0.322 0.302KCNN 0.32 0.416 0.447 0.748 0.491 0.733 0.449 0.293 0.717 0.272 0.581 0.414 0.388 0.603 0.447Improved +184% +98% +63% +37% +215% +100% +355% +334% +213% +31% +333% +242% +89% +87% +48%Average bodleian christ church cornmarketPrecision 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5CNN 0.207 0.262 0.421 0.463 0.431 0.442 0.487 0.363 0.213 0.144 0.594 0.223 0.133 0.139 0.559KCNN 0.502 0.592 0.535 0.598 0.63 0.472 0.56 0.526 0.399 0.134 0.426 0.4 0.307 0.515 0.578Improved +142% +126% +27% +29% +46% +6.8% +15% +45% +87% -6.8% -28% +80% +130% +271% +3.5%Average hertford keble magdalenPrecision 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5CNN 0.6 0.617 0.588 0.636 0.64 0.383 0.482 0.551 0.209 0.336 0.104 0.114 0.087 0.143 0.081KCNN 0.605 0.604 0.586 0.609 0.471 0.696 0.747 0.638 0.547 0.226 0.09 0.08 0.078 0.1 0.08Improved +0.8% -2.2% -0.4% -4.3% -27% +82% +55% +16% +162% -33% -13% -30% -11% -30% -2%Average pitt rivers radcliffe cameraPrecision 1 2 3 4 5 1 2 3 4 5CNN 0.563 0.678 0.45 0.472 0.519 0.876 0.788 0.89 0.889 0.889KCNN 0.651 0.846 0.662 0.381 0.621 0.743 0.842 0.842 0.876 0.709Improved +15.6% +24.8% +47% -19.5% +19.7% -15.2% +6.9% -5.3% -1.5% -20.2%

Table 3. The performance comparisons with other reported re-search works based on global image representation.

Dataset [32] [3] [22] CNN KCNNHolidays 0.735 0.646 0.771 0.68 0.829(mAP)UKBench 3.50 N/A 3.53 3.41 3.74(top-4)Oxford Building N/A 0.555 0.676 0.38 0.51(mAP)image representation methods, we proposed our kernelizedconvolutional network (KCNN) algorithm to describe thecontent of complex images. With our KCNN method, wecan obtain a more robust vectorial representation. Besidesthe CNN structure implemented in Caffe library, there arealso some other emerging CNN structures [37, 16]. In thefuture, we would like to investigate the potential of thesedifferent CNN models on image retrieval and investigatethe performance of our KCNN model integrated with theseCNN models.

References [1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In

Proceedings of IEEE International Conference on ComputerVision and Patteren Recognation , pages 73–80, 2010.[2] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the ob-jectness of image windows. In

Proceedings of IEEE Trans- actions on Pattern Analysis and Machine Intelligence , vol-ume 34, pages 2189–2202, 2012.[3] R. Arandjelovic and A. Zisserman. All about vlad.

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , 2013.[4] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky.Neural codes for image retrieval. In

Proceedings of Euro-pean Conference on Computer Vision , pages 584–599. 2014.[5] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded uprobust features.

Proceedings of the European Conference onComputer Vision , pages 404–417, 2006.[6] L. Bo, K. Lai, X. Ren, and D. Fox. Object recognitionwith hierarchical kernel descriptors. In

Proceedings of IEEEInternational Conference on Computer Vision and PatternRecognition , June 2011.[7] L. Bo and C. Sminchisescu. Efﬁcient Match Kernel betweenSets of Features for Visual Recognition. In

Proceedings ofAdvances in Neural Information Processing Systems , De-cember 2009.[8] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr.BING: Binarized normed gradients for objectness estimationat 300fps. In

Proceedings of IEEE International Conferenceon Computer Vision and Pattern Recognition , 2014.[9] M. Cimpoi, S. Maji, and A. Vedaldi. Deep ﬁlter banks fortexture recognition and segmentation. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 3828–3836, 2015.[10] N. Dalal and B. Triggs. Histograms of oriented gradientsfor human detection. In

Proceedings of IEEE Conference onComputer Vision and Pattern Recognition , volume 1, pages886–893, 2005.

11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei.Imagenet: A large-scale hierarchical image database.

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 248–255, 2009.[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In

Proceedings of IEEE Conference on Computer Vision andPattern Recognition , pages 248–255, 2009.[13] I. Endres and D. Hoiem. Category independent object pro-posals. In

Proceedings of European Conference on Com-puter Vision , pages 575–588, 2010.[14] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scaleorderless pooling of deep convolutional activation features.In

Proceedings of European Conference on Computer Vision ,pages 392–407. 2014.[15] I. Goodfellow, H. Lee, Q. V. Le, A. Saxe, and A. Y. Ng.Measuring invariances in deep networks. In

Proceedings ofAdvances in neural information processing systems , pages646–654, 2009.[16] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Dar-rell, and K. Keutzer. Densenet: Implementing efﬁcient con-vnet descriptor pyramids. arXiv preprint arXiv:1404.1869 ,2014.[17] T. Jaakkola and D. Haussler. Exploiting generative modelsin discriminative classiﬁers. In

Proceedings of Advances inNeural Information Processing Systems , 1998.[18] H. J´egou and O. Chum. Negative evidences and co-occurences in image retrieval: The beneﬁt of pca and whiten-ing. In

Proceedings of the European Conference on Com-puter Vision , pages 774–787. 2012.[19] H. Jegou, M. Douze, and C. Schmid. Hamming embeddingand weak geometric consistency for large scale image search.

Proceedings of the European Conference on Computer Vi-sion , pages 304–317, 2008.[20] H. J´egou, M. Douze, C. Schmid, and P. P´erez. Aggregat-ing local descriptors into a compact image representation.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 3304–3311, 2010.[21] H. J´egou, F. Perronnin, M. Douze, C. Schmid, et al. Ag-gregating local image descriptors into compact codes.

IEEETransactions on Pattern Analysis and Machine Intelligence ,34(9):1704–1716, 2012.[22] H. J´egou and A. Zisserman. Triangulation embedding anddemocratic aggregation for image search. In

CVPR - Inter-national Conference on Computer Vision and Pattern Recog-nition , Columbus, United States, June 2014.[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093 , 2014.[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , pages1097–1105, 2012.[25] K. Lenc and A. Vedaldi. Understanding image representa-tions by measuring their equivariance and equivalence. arXivpreprint arXiv:1411.5908 , 2014. [26] J. L. Long, N. Zhang, and T. Darrell. Do convnets learn cor-respondence? In

Advances in Neural Information ProcessingSystems , pages 1601–1609, 2014.[27] D. G. Lowe. Distinctive image features from scale-invariantkeypoints.

International Journal of Computer Vision ,60(2):91–110, 2004.[28] M.-E. Nilsback and A. Zisserman. Automated ﬂower classi-ﬁcation over a large number of classes. In

Computer Vision,Graphics & Image Processing, 2008. ICVGIP’08. Sixth In-dian Conference on , pages 722–729, 2008.[29] D. Nister and H. Stewenius. Scalable recognition with a vo-cabulary tree.

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 2161–2168,2006.[30] M. Perd’och, O. Chum, and J. Matas. Efﬁcient represen-tation of local geometry for large scale object retrieval. In

Proceedings of IEEE Conference on Computer Vision andPattern Recognition , pages 9–16, 2009.[31] F. Perronnin and C. Dance. Fisher kernels on visual vocab-ularies for image categorization. In

Proceedings of IEEEInternational Conference on Computer Vision and PatternRecognition , 2007.[32] F. Perronnin, Y. Liu, J. S´anchez, and H. Poirier. Large-scaleimage retrieval with compressed ﬁsher vectors. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 3384–3391, 2010.[33] F. Perronnin, J. S´anchez, and T. Mensink. Improving theﬁsher kernel for large-scale image classiﬁcation. In

Pro-ceedings of European Conference on Computer Vision , pages143–156. 2010.[34] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisser-man. Object retrieval with large vocabularies and fast spatialmatching.

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 1–8, 2007.[35] A. S. Razavian, H. Azizpour, A. Maki, J. Sullivan, C. H. Ek,and S. Carlsson. Persistent evidence of local image proper-ties in generic convnets. In

Image Analysis , pages 249–262.Springer, 2015.[36] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.Cnn features off-the-shelf: an astounding baseline for recog-nition. In

Proceedings of IEEE International Conferenceon Computer Vision and Pattern Recognition Workshops(CVPRW) , pages 512–519, 2014.[37] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[38] J. Sivic and A. Zisserman. Video google: A text retrievalapproach to object matching in videos.

Proceedings of theIEEE International Conference on Computer Vision , pages1470–1477, 2003.[39] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong.Locality-constrained linear coding for image classiﬁcation.In

Proceedings of IEEE Conference on Computer Vision andPattern Recognition , pages 3360–3367, 2010.[40] M. D. Zeiler and R. Fergus. Visualizing and understandingconvolutional networks. In

Proceedings of European Con-ference on Computer Vision , pages 818–833. 2014., pages 818–833. 2014.