Second-order Convolutional Neural Networks
aa r X i v : . [ c s . C V ] M a r Second-order Convolutional Neural Networks ∗ Kaicheng Yu, Mathieu SalzmannCVLab, EPFL1015 Lausanne, Switzerland { kaicheng.yu, mathieu.salzmann } @epfl.ch Abstract
Convolutional Neural Networks (CNNs) have been suc-cessfully applied to many computer vision tasks, such asimage classification. By performing linear combinationsand element-wise nonlinear operations, these networks canbe thought of as extracting solely first-order informationfrom an input image. In the past, however, second-orderstatistics computed from handcrafted features, e.g., covari-ances, have proven highly effective in diverse recognitiontasks. In this paper, we introduce a novel class of CNNsthat exploit second-order statistics. To this end, we de-sign a series of new layers that (i) extract a covariancematrix from convolutional activations, (ii) compute a para-metric, second-order transformation of a matrix, and (iii)perform a parametric vectorization of a matrix. These op-erations can be assembled to form a Covariance Descrip-tor Unit (CDU), which replaces the fully-connected layersof standard CNNs. Our experiments demonstrate the bene-fits of our new architecture, which outperform the first-orderCNNs, while relying on up to 90% fewer parameters.
1. Introduction
Image classification, e.g., recognizing objects and peoplein images, has been one of the fundamental goals of com-puter vision since its inception. In the past few years, Con-volutional Neural Networks (CNNs), which jointly learnthe features and the classifier, have proven highly effectiveat tackling such classification tasks [2, 16, 34], and havethus dramatically accelerated the advances in recognition.In essence, CNNs stack multiple layers, convolutional andfully-connected ones, with the parameters of each layer act-ing as filters on the output of the preceding one. By com-puting such linear combinations, even when followed byelement-wise nonlinearities and pooling, traditional CNNscan be thought of as extracting only first-order statisticsfrom the input images. In other words, such networks can-not extract second-order statistics, such as covariances.Psychophysics research, however, has shown that ∗ This research is funded by the Swiss National Science Foundation. …… …
FC activationsConvolutional activations
Input
Second-order statistics
CDU S O - CNN F O - CNN
Convolutional activations
Figure 1.
Comparison of traditional first-order (FO-)CNNs (top) with our second-order (SO-) CNNs (bottom).
While, by performing linear combinations, traditional CNNsextract first-order information, our new architectures computesecond-order statistics. second-order statistics play an important role in the humanvisual recognition process [22]. This has been exploited inthe past in computer vision via the development of RegionCovariance Descriptors (RCDs) [38], which encode covari-ance matrices computed from local image features. In fact,these descriptors have been shown to typically outperformfirst-order features for visual recognition tasks such as ma-terial recognition and people re-identification [13, 14, 21].However, to this date, RCDs have been mostly confined toexploiting handcrafted features, and have thus been unableto match the performance of deep networks.In this paper, we introduce a new class of CNN architec-tures that exploit second-order statistics for visual recogni-tion. To this end, we develop three new types of layers. Thefirst one extracts a covariance matrix from convolutional ac-tivations. The second one computes a parametric second-order transformation of an input matrix, such as a covari-ance matrix. Finally, the last one performs a parametric vec-torization of an input matrix. These different types of layerscan be stacked into a Covariance Descriptor Unit (CDU),which, as shown in Fig. 1, replaces the fully-connected lay-ers of a traditional CNN. Altogether, this provides us withsecond-order CNNs (SO-CNNs) that can be trained in an1nd-to-end manner.To the best of our knowledge, only very few works haveconsidered the use of RCDs in conjunction with CNNs. Inparticular, [40] extracted RCDs from features pre-computedusing a CNN, but without proposing an end-to-end learn-ing framework. By contrast, [20] briefly studied the use ofmatrix outer product, which corresponds to a second-orderoperation, within a deep network as an application of theirmatrix backpropagation algorithm. While interesting, thiswork did not focus on extracting second-order statistics andthus remains preliminary in that respect. Here, we study thisproblem more thoroughly and introduce new layer typesthat were not considered in [20], and, as evidenced by ourexperiments, are key to the success of second-order CNNs.We demonstrate the benefits of our second-order CNNson the tasks of object recognition, using the CIFAR10dataset [24], and material recognition, using the challeng-ing Materials in Context Database (MINC) [2]. Our exper-iments demonstrate the generality of our approach by im-plementing it within different basic network architectures,such as FitNet [30], VGG16 [34] and ResNet [16]. In allcases, we show that our second-order CNNs outperform thecorresponding first-order ones, while relying on up to 90%fewer parameters for networks having large fully-connectedlayers. Furthermore, our method also outperforms the co-variance learning framework of [17], using pre-computeddeep features, and the single covariance network of [20].We believe that this clearly evidences the potential of oursecond-order CNNs and, by making our code publicly avail-able, that it will motivate other researchers to explore goingbeyond first-order statistics within deep learning.
2. Related Work
Visual recognition is one of the core problems of com-puter vision, and has thus received a huge amount of atten-tion. Below, we briefly review the recent advances that aremost closely related to this work, which brings together thenotions of deep learning and second-order statistics, such ascovariance matrices.
CNNs for Visual Recognition.
While, in the past,the problems of feature extraction and classifier trainingwere typically decoupled [3, 27, 32], the impressive re-sults achieved 5 years ago by AlexNet [25] on the Im-ageNet recognition challenge have put deep learning atthe center of visual recognition. Recent years have seengreat progress in this context, with increasingly deeper net-works [16, 25, 34], novel normalization [19, 31] and op-timization [7, 23, 37, 42] strategies. All these networks,however, follow the same general strategy of stacking mul-tiple layers, convolutional and fully-connected ones, eachof which computes linear combinations of the output of theprevious one. Despite the use of nonlinearities and poolingstrategies, the resulting operations therefore still essentially extract first-order information, in the sense that they cannotcompute higher-order statistics, such as covariances.
Covariance Descriptors for Visual Recognition.
Inthe era of handcrafted features, however, second-orderstatistics, and particularly Region Covariance Descriptors(RCDs) [39], have proven effective to address visual recog-nition tasks. Several metrics have been proposed to com-pare RCDs [1, 28, 29, 35], and they have been used in vari-ous classification frameworks, such as boosting [39], kernelSupport Vector Machines [21], sparse coding [5, 9] and dic-tionary learning [12, 15, 26, 36]. In all these works, how-ever, while the classifier was trained, no learning compo-nent was involved in the computation of the RCDs.
Covariance Descriptors and Learning.
To the best ofour knowledge, [11], and its log-Euclidean metric learningextension [18], can be thought of as the first attempts tolearn RCDs. This, however, was achieved by reducing thedimensionality of input RCDs, and thus has limited learn-ing power. In a work concurrent to ours [17], the frameworkof [11] was extended to learning multiple transformations ofinput RCDs. This approach, however, still relied on RCDsas input. By contrast, here, we introduce an end-to-endlearning strategy. As discussed later, this requires specialcare to transition from the convolutional activations to thecovariance matrix, and, as evidenced by our experiments,significantly outperforms the approach of [17].Only very few works have considered using RCDs inconjunction with deep learning. In particular, [41] designeda CNN taking RCDs as input for the task of saliency compu-tation. The focus of this work, however, differs fundamen-tally from ours, as it rather aims to process pre-computedRCDs, whereas we seek to learn second-order statisticsfrom images. More closely related to our work, [40] com-puted RCDs from features extracted using a pre-trainedCNN. Nevertheless, this work is limited to computing astandard covariance, and did not propose any end-to-endlearning strategy. By contrast, [20] briefly discussed theidea of computing a covariance matrix within a CNN, whichwas then flattened after a logarithmic map. Second-orderstatistics, however, were not the focus of this work, whichrather aimed to develop a general matrix backpropagationalgorithm. As a consequence, it did not consider practi-cal problems such as the parameter explosion arising fromappending a fully-connected layer to a large, flattened co-variance matrix, and the resulting method would thereforenot be applicable to networks with high-dimensional fea-ture maps, such as the VGG or ResNet. Here, we notonly take this into account, but also introduce new typesof layers, thus truly developing a new class of deep ar-chitectures that exploit second-order statistics. Our exper-iments demonstrate that our second-order CNNs not onlyoutperform the first order ones, but also the state-of-the-artcovariance-based approaches of [20] and [17].
Deep convolution feature maps Cov-layer W W Σ O2T-layer
Y Y WW !" jiiParametric Vectorization (PV)(W x H x D) (D x D) (D’ x D’) (D’’ x 1)…Output… Figure 2. Our Covariance Descriptor Unit (CDU).
3. Our Approach
In this section, we first introduce the basic architectureof our second-order CNNs (SO-CNNs), including our newlayer types. We then address practical issues arising whenstarting from pre-trained convolutional layers and whendealing with high-dimensional convolutional feature maps.
As illustrated by Fig. 1, an SO-CNN consists of a se-ries of convolutions, followed by new second-order layersof different types, ending in a mapping to vector space,which then lets us predict a class label probability via afully-connected layer and a softmax. The convolutional lay-ers in our new SO-CNN architecture are standard ones, andwe therefore focus the discussion on the new layer typesthat model second-order statistics. In particular, as illus-trated by Fig. 2, we introduce three such new layer types:Cov layers, which compute a covariance matrix from con-volutional activations; O2T layers, which compute a para-metric second-order transformation of an input matrix; andPV layers, which perform a parametric mapping to vectorspace of an input matrix. Below, we discuss these differentlayer types in more detail.
Cov Layer.
As suggested by the name, a Cov layer com-putes a covariance matrix. In particular, this type of lay-ers typically follows a convolutional layer, and thus acts onconvolutional activations.Specifically, let X be the ( W × H × D ) tensor corre-sponding to a convolutional activation map. This tensor canbe reshaped into an ( N × D ) matrix X = [ x , x , . . . , x N ] ,with x k ∈ R D and N = W · H . The ( D × D ) covariancematrix of such features can then be expressed as Σ = 1 N N X k =1 ( x k − µ )( x k − µ ) T , (1)where µ = N P Nk =1 x k is the mean of the feature vectors.While Σ encodes second-order statistics, it completelydiscards the first-order ones, which may nonetheless bring valuable information. To keep the first order information,we propose to define the output of our Cov layer as C = (cid:20) Σ + β µµ T β µ β µ T (cid:21) , (2)which incorporates the mean of the features, via a parameter β . This parameter was set to β = 0 . in our experiments.A key ingredient for end-to-end learning is that the oper-ation performed by each layer is differentiable. Being con-tinuous algebraic operations, the covariance matrix in Eq. 1and the mean vector µ clearly are differentiable with re-spect to their input X . This therefore makes our Cov layerdifferentiable, and enables its use in an end-to-end learningframework. O2T Layer.
The Cov layer described above is non-parametric. As a consequence, it may decrease the net-work capacity compared to the traditional way of exploit-ing the convolutional activations by passing them through aparametric fully-connected layer, and thus yield a less ex-pressive model despite its use of second-order information.To overcome this, we introduce a parametric second-ordertransformation layer, which not only increases the modelcapacity via additional parameters, but also allows us tohandle large convolutional feature maps.More specifically, given a ( D × D ) matrix M as input,our O2T layer performs a second-order transformation ofthe form Y = WMW T , (3)whose parameters W ∈ R D × D ′ are trainable. Note thatthe value D ′ controls the size of the output matrix, and thusgives more flexibility to the network than the previous Covlayer. Clearly, this second-order operation is differentiable,and can therefore be integrated in an end-to-end learningframework.The O2T layer can be applied either to a covariance ma-trix computed by a Cov layer, or recursively to the outputof another O2T layer. Note that, since covariance matricesare symmetric positive (semi)definite (SPD) matrices, ourormulation guarantees that the output obtained by apply-ing one or multiple recursive O2T layers also is. To pre-vent degeneracies and guarantee that the rank of the origi-nal covariance matrix is preserved, additional orthonormal-ity constraints can be enforced on the parameters W . Tothis end, we make use of the optimization method on theStiefel manifold employed in [10]. Empirically, we foundthese constraints to have varying but in general limited in-fluence on the results. Altogether, our parametric O2T lay-ers increase the capacity of the network while still modelingsecond-order information. PV Layer.
Since our ultimate goal is classification, weeventually need to map our second-order, matrix-based rep-resentation to a vector form, which can in turn be mapped toa class probability estimate via a fully-connected layer witha softmax activation. In [17, 20], such a vectorization wasachieved by simply flattening the matrix, after applying alogarithmic map. When working with large matrices (large D ), this however may lead to an intractable number of pa-rameters to map the resulting O ( D ) -dimensional vector tothe vector of class probability estimates. Here, instead ofdirect flattening, we introduce a parametric vectorization ofthe second-order representation.Specifically, given an input matrix Y ∈ R D ′ × D ′ , wecompute a vector v ∈ R D ′′ , whose j -th element is definedas [ v ] j = ([ W ] : ,j ) T Y [ W ] : ,j = D ′ X i =1 [ W ⊙ YW ] i,j , (4)where W ∈ R D ′ × D ′′ are trainable parameters, and [ A ] i,j denotes the entry in the i -th row and j -th column of matrix A , with [ A ] : ,j the complete j -th column. Note that, whileboth formulations in Eq. 4 are equivalent, the first one iseasier to interpret, but the second one is better suited forefficient implementation with matrix operations.Due to its formulation, this vectorization can, in essence,still be thought of as a second-order transformation. Moreimportantly, being parametric, it increases the flexibility ofthe model, while preventing the number of parameters inthe following fully-connected layer to become intractable.As for our other layers, this operation is differentiable, andcan thus be integrated to an end-to-end learning formalism. General SO-CNN Architecture.
We dub Covariance Descriptor Unit (CDU) a sub-network obtained by stacking our new layer types. In short,and as illustrated in Fig. 2, a CDU takes as input the ac-tivations of a convolutional layer and first computes a co-variance matrix according to Eq. 2. The resulting matrixis passed through a number of O2T layers (Eq. 3), includ-ing none, whose output is then mapped back to a vectorvia a PV layer. Each of these layers can be followed byan element-wise nonlinearity. In particular, we make use of ReLUs, which have the property of maintaining the posi-tive definiteness of SPD matrices. Importantly, the resultingCDUs are generic and can be integrated in any state-of-the-art CNN architecture.As such, our framework makes it possible to transformany traditional first-order CNN architecture into a second-order one for image classification. To this end, one cansimply remove the fully-connected layers of the first-orderCNN and connect the resulting output to a CDU. The outputof the CDU being a vector, one can then simply pass it toa fully-connected layer, which, after a softmax activation,produces class probabilities. Since, as discussed above, allour new layers are differentiable, the resulting network canbe trained in an end-to-end manner.
The basic SO-CNN architecture described above can betrained from scratch, which we will show in our experi-ments. To speed up training, however, one might want toleverage the availability of pre-trained first-order CNNs. Todo so, we propose to first freeze the pre-trained convolu-tional layers to train the second part of the SO-CNN, andthen fine-tune the entire network. We observed empiricallythat, while we could train the second part of the network,fine-tuning did not converge. This, we believe, is due to thefact that there is no connection between first- and second-order features in the first stage, and thus the gradient of thesecond part is too different from that of the first one at thebeginning of the fine-tuning process.To address this, we therefore propose to introduce an ad-ditional transition layer, which will facilitate training andgive more flexibility to the model by allowing it to modifythe pre-trained convolutional feature maps.To this end, we apply a linear mapping to each featurevector independently. Specifically, let x k be an originalconvolutional feature vector. We then learn a mapping ofthe form h ( x k ) = Wx k + b , (5)where W ∈ R ˜ D × D is a trainable weight matrix, and b ∈ R ˜ D a trainable bias. By constraining the weight ma-trix and bias to be the same for all the feature vectors, thisis equivalent to a × convolutional layer with linear ac-tivation function. The parameter ˜ D gives rise to a range ofdifferent models, with adapted features ranging from lowerto higher dimensionalities than the original ones. As shownin our experiments, this strategy allows us to effectively ex-ploit pre-trained convolutions in our SO-CNNs, while stilllearning the entire model in an end-to-end manner by un-freezing the convolutions in a second learning phase. In our basic SO-CNNs, a CDU directly follows a con-volutional layer. While this transition can, in principle, C D U C D U + VGG16
512 14141024 1414512 512 14141414 1n ∑∑ Y ∑∑ Y + ∑∑ Y ∑∑ Y +YY Y’ Fuse after PVFuse before PV C D U C D U C D U C D U PVPV PV
SO-CNN with multiple CDUs(Example with VGG16)
FC-layer1x1 Conv-layer
Fusion illustration
Figure 3.
Using multiple CDUs. (Left)
Example of an SO-CNNwith multiple CDUs. (Right)
Two methods to fuse informationbetween multiple CDUs. Fusion occurs after the PV-layers in thetop figure, and before in the bottom one. Fusion strategies includeconcatenation, summation and averaging. Note that black arrowsindicate mathematical operations, whereas white ones correspondto an identity mapping. be achieved seamlessly, the rapid growth in the dimen-sionality of the convolutional feature maps computed bymodern architectures makes this problem more challeng-ing. Indeed, with a basic architecture derived from, e.g.,the ResNet [16], whose last convolutional activation maphas size (7 × × for a (224 × input, the re-sulting covariance matrix would be very high-dimensional( × ), but have a low rank (at most 48). In practice,this would translate into instabilities in the learning processdue to many 0 eigenvalues. While, in principle, this couldbe handled by using the strategy of Section 3.2 with a small ˜ D , this would incur a loss of information that reduces thenetwork capacity too severely. Below, we study two strate-gies to overcome this problem, which define our completeSO-CNN architecture. Robust Covariance Estimation.
As a first solution toovercome the low-rank problem, we make use of the robustcovariance approximation introduced in [40] in the contextof RCDs. Specifically, let Σ = USU T be the eigenvaluedecomposition of the covariance matrix. A robust estimateof Σ can be written as ˆ Σ = U f ( S ) U T , (6)where f ( · ) is applied element-wise to the values of the di- agonal matrix S , and is defined as f ( x ) = s(cid:18) − α α (cid:19) + xα − − α α , (7)with parameter α set to . in practice. The resulting esti-mate ˆ Σ can then replace Σ in Eq. 2.Thanks to the matrix backpropagation frameworkof [20], which handles eigenvalue decomposition, this ro-bust estimate can also be differentiated, and thus incorpo-rated in an end-to-end learning framework. Multiple CDUs.
Our second strategy to handling high-dimensional feature maps, illustrated by Fig. 3(left), con-sists of splitting the feature maps into n separate groups ofequal sizes. Each group will then act as input to a differentCDU, whose covariance matrix will have fewer 0 eigenval-ues than a covariance obtained from all the features. Forexample, with a ResNet, instead of computing a covari-ance descriptor of size × , we create 4 groupsof 512 features, and use them to compute 4 different covari-ance descriptors, followed by separate O2T and PV layers.In essence, this strategy still makes use of all the features,but does not consider all the possible pairwise covariances.However, since the features are learned, the network can au-tomatically determine which pairwise covariances are im-portant. Note that the robust covariance estimate discussedabove can be applied to the covariance matrix of each group.Ultimately, the information contained in the multipleCDUs needs to be fused into a single image representa-tion. We propose two strategies to do so, illustrated inFig. 3(right). The first one consists of combining the CDUsoutput vectors by an operation such as summing, averag-ing or concatenation. The second one aims at fusing themultiple branches before vectorization, which can be againachieved by summing or averaging the respective matrices,or concatenating them into a larger block-diagonal matrix.This is then followed by a PV layer.
4. Experiments
In this section, we first present results obtained with ourbasic SO-CNN introduced in Section 3.1 on CIFAR-10.We then turn to evaluating our complete SO-CNN architec-ture, with the different strategies introduced in Sections 3.2and 3.3, on the larger, more challenging MINC dataset.
CIFAR-10 [24] is an object recognition dataset contain-ing 50000 training and 10000 testing (32 × RGB im-ages depicting 10 classes of objects. In the following ex-periments, we augmented the data by flipping the trainingimages for all models and baselines. Because of the rela-tively small scale of this dataset, we can directly apply ourbasic SO-CNN to it. We therefore make use of this dataset
V Parameter V a li d a t i o n A cc u r ac y Covariance dimension
Figure 4.
Joint influence of the PV output dimension and thesecond-order dimension.
With no O2T layers, learning is un-stable once the PV dimension becomes significantly larger thanthe covariance dimension (64). With one O2T layer, learning ismore stable, particularly when the PV dimension is not signifi-cantly larger than the O2T dimension. This suggests one shoulduse a PV dimension similar to that of the last O2T layer. to evaluate different architecture designs within our basicSO-CNN framework. Furthermore, we compare our ba-sic SO-CNN to the corresponding first-order CNN, to thematrix backpropagation model of [20] (
MatBP ) and to the
SPD-net of [17].
Model Setup.
We use the FitNet-v1 model of [30] asour base first-order architecture. FitNet has 3 convolutionalblocks, each of which contains 3 convolutional layers, withno dropout. The filters are of size (3 , for all layers, andone max-pooling layer is attached after each block. In thefirst-order model, the last convolutions are followed by onefully-connected (FC) layer of size 500. In our basic SO-CNNs, we replace this layer with a CDU. Since the last con-volutional feature map is of dimension 64, the resulting co-variance matrix is sufficiently small not to require a robustestimate or multiple CDUs. Both FitNet and our SO-CNNhave then a final FC layer to produce a 10-dimensional vec-tor of class probabilities via a softmax activation. Below, weevaluate different architectures of our SO-CNN model, cor-responding to varying the output dimensionality of the PVlayer, and the number and dimensionalities of the O2T lay-ers. For all models (first- and second-order), all the weightswere initialized using the method of [8]. We used stochas-tic gradient descent with an initial learning rate of 0.01 andreduced by a factor 10 when the validation loss does notdecrease for 8 epochs. PV Output Dimension vs Second-order Dimension.
In-tuitively, the output dimensionality of the PV layer shouldbe similar to that of the second-order descriptor, whether thelast O2T layer or directly the covariance when no O2T layeris used (e.g., a much smaller dimension would result in in-formation loss). In a first experiment, we therefore evaluate S
ETTING SO-CNN-2 SO-CNN-3 SO-CNN-4 SO-CNN-5 S AME .
90% 83 .
68% 83 .
18% 84 . ÷ .
86% 84 .
45% 83 .
69% 83 . × .
35% 84 . . . Table 1.
Influence of O2T layer number and dimension. S AME indicates that the dimension is the same (64) in all layers, and ÷ or × that the dimension is divided or multiplied by 2 fromone layer to the next. The PV-layer has the same dimension asthe last O2T-layer. For example, CDU-3 with ÷ corresponds toO2T(200) - O2T(100) - O2T(50) - PV(50). the joint influence of these two dimensionalities. To thisend, we make use of either no O2T layer, or one such layerWe vary the PV output dimensionality from 10 to 200 witha step size of 10, and the dimensionality of the O2T layer,denoted by O2T( m ) for dimension m ∈ { , , } . InFig. 4, we plot the accuracy of the resulting models as afunction of the PV dimensionality. We can observe that asmall m should be used in conjunction with a small PV di-mension, whereas a large m yields slightly higher accuracywith a high PV dimension. Furthermore, training seems tobe less stable if the PV dimension is significantly larger thanthe second-order one. We can also see that, as expected,adding one O2T layer brings more flexibility to the model,and thus yields higher accuracy. Number and Dimensions of O2T Layers.
As a secondexperiment, we evaluate the influence of the number and di-mensions of O2T layers in our SO-CNN framework. To thisend, we vary the number of O2T layers from 2 to 5 (we alsotested with 1 but omit it here due to a consistently slightlylower accuracy), denoted by SO-CNN- { } , and fol-low three strategies regarding their dimensionalities: (i) Wekeep the dimension constant across the different O2T lay-ers; (ii) We increase the dimensionality from 50 by a factor2 in successive O2T layers; (iii) We decrease the dimen-sionality by half to reach a final dimension of 50. In allthese settings, following the results of the previous experi-ment, we set the PV output dimensionality to that of the lastO2T layer. The results of this experiment are provided inTable 1. They show that (i) adding more O2T layers indeedincreases the capacity of the network, but may lead to over-fitting if too many such layers are employed; (ii) the mosteffective strategy to set the dimensionalities of the O2T lay-ers consists of increasing them in successive layers. Comparison to the Baselines.
Following the previousanalysis, in Table 2, we compare the results of our SO-CNN-4 model with increasing O2T layer dimensions andPV output dimension matching that of the last O2T layerwith the first-order FitNet CNN, and the MatBP [20] andSDP-Net [17] baselines. For the comparison to be fair, forMatBP, we made use of the same FitNet-based architectureas us. For SDP-net, which relies on a covariance matrixas input, we exploited RCDs obtained from the last convo-lutional layer of the first-order FitNet. Note that we wereunable to train these two baselines from scratch, as opposed
LASSIFIER S ETTINGS P ARAMS A CC FitNet [30] 500 620K . MatBP [20] - 131K . SPD-net [17] 70,50,30 55K . SO-CNN-4 × . Table 2.
Baseline comparison on CIFAR10 with FitNet archi-tectures.
Note that we outperform all baselines, while relying onroughly 40% fewer parameters than the first-order CNN, which isclosest to us in accuracy. to our SO-CNNs, and therefore fine-tuned them from thepre-trained FitNet. The hyper-parameters of SDP-net wereset according to the recommendations in [17].As can be seen from the table, our model outperformsMatBP and SPD-net by a significant margin, thus showingthe benefits of our end-to-end learning strategy over usinga single covariance flattened after a log-map (MatBP) andover a two-stage strategy consisting of using a pre-definedcovariance matrix as input (SDP-net). Note also that ourmodel outperforms the first-order one, thus showing the im-portance of leveraging second-order information. As canbe verified from the results of our previous experiments,other versions of our SO-CNN also outperform the first-order one, confirming the benefits of our approach. Alto-gether, we believe that these results clearly demonstrate thepotential of our basic SO-CNN architecture.
We now evaluate our compete SO-CNNs, including thestrategies introduced in Sections 3.2 and 3.3, on the large-scale MINC material recognition dataset. This choice wasmotivated by the fact that traditional second-order descrip-tors have proven particularly effective for tasks such as ma-terial or texture recognition [13, 14, 21]. Below, we brieflydescribe this dataset and the architectures we used, as wellas evaluate different versions of our approach and compareit to the state-of-the-art.
MINC-2500 is a large-scale material recognition datasetcontaining 23 classes of different materials, some of whichare shown in Fig. 5. For each class, there are 2500 (362 × RGB images. We split the dataset into training, valida-tion and test samples with proportions . , . , . , re-spectively. Unlike other small-scale material databases [4,6, 33], the images contain not only the material but also itssurrounding environment, thus making it more challenging.To augment the data, we used horizontal flip, and randomcropping to × patches, thus matching the standardinput size of our base CNN architectures described below. CNN Architectures.
The size of this dataset makes itwell-suited to use recent deeper architectures, such as theVGG [34] and the ResNet [16]. In particular, we use theVGG16 model, which has 16 convolutional layers (config-uration C in [34]). For ResNet, we employ the ResNet50
Figure 5. Samples from the MINC-2500 dataset with 50 convolutional layers. To compose our second-ordernetworks, we replace the fully-connected layers and the lastaverage pooling layer with our CDUs. We then attach onefully-connected layer of dimension 23 with softmax acti-vation to obtain the final class probabilities. For both VGGand ResNet, to reduce over-fitting, we constrain the weightsof the O2T layers to be orthonormal. For the comparison tobe fair, and following [2], the weights of the common con-volutional layers of both first- and second-order models areinitialized with weights pre-trained from ImageNet. In thefollowing experiments, the CDUs all have 3 O2T layers,with dimensions set to D , D and D , where D is the dimen-sion of the covariance matrix. Note that this does not matchthe best strategy of Section 4.1, which consisted of doublingthe dimension. However, applying this strategy here wouldresult in a final dimension of ,which would significantly increase computational cost.The PV output dimension is the same as that of the last O2Tlayer. Learning Strategy.
To train our SO-CNNs (SO-VGG16 and SO-ResNet50),we first freeze the convolutional layers and train the sec-ond part of the networks for a few (2-4) epochs, and thenfine-tune the whole network in an end-to-end manner. ForSO-VGG16, the initial learning rates for second-order train-ing and fine-tuning are set to − and − , respectively,and reduced by a factor 4 when learning plateaued. ForSO-ResNet50, the starting rates are set to − and − ,respectively. As mentioned in Section 3.2, we observed em-pirically that, during our two-stage learning strategy, wecould successfully train the second-order part of the net-work, but fine-tuning the entire network failed. This, webelieve, is due to the fact that, in the first phase, no gradi-ent is backpropagated between the first- and second-orderparts of the network. To overcome this, we therefore intro-duce an additional 1 × ODELS F USION A CCURACY × CDU V-sum . × CDU V-avg . × CDU V-concat . × CDU V-concat . × CDU V-concat . × CDU D-sum . × CDU D-avg . × CDU D-concat . × CDU + Robust - . × CDU + Robust V-concat . × CDU + Robust D-concat . Table 3.
Comparison of different SO-VGG16 designs.
Robust indicates the use of a robust covariance estimate, n × CDU indi-cates that the convolutional feature maps are split into n groupswith one CDU each. V- method indicates that fusion occurs invector space, while D- stands for descriptor space. R ESULTS ON
VGG16-
BASED MODEL S ETTINGS P ARAMS A CCURACY
VGG16 [34] 237M . × - FCs 237.64M . MatBP [20] 20.77M . SPD-net [17] 0.253M . Our best 15.21M . Table 4.
Baseline comparison on MINC2500 for the VGG16-based models.
We outperform all the baselines significantly, andrely on roughly 90% fewer parameters than the first-order CNN. section.
Robust Estimation & Fusion of CDUs.
In Section 3.3,we introduced two strategies to handle high-dimensionalfeature maps within our SO-CNNs: Making use of robustcovariance estimates and exploiting multiple CDUs. In thelatter case, we also proposed several ways to fuse the mul-tiple CDUs into a single representation, consisting of sum-ming, averaging or concatenating the vectors output by theCDUs, or performing similar operations on the final second-order descriptors. We will denote these different fusionstrategies by { V,D } -sum, { V,D } -avg and { V,D } -concat, re-spectively, for the vector (V) or descriptor (D) cases. Wereport the results of these different strategies in Table 3.These results show that (i) making use of multiple CDUsis typically more effective than relying on a robust covari-ance estimate; (ii) using more than 2 CDUs has little im-pact; (iii) fusing at the level of second-order descriptors (D)is more effective than at the level of vectors, particularly viaconcatenation. Comparison to the Baselines.
In Tables 4 and 5, wecompare the results of our best SO-VGG16 and the corre-sponding SO-ResNet50 to the first-order CNNs and to theMatBP [20] and SPD-net [17] baselines. Since the SPD- R
ESULTS ON R ES N ET BASED MODEL S ETTINGS P ARAMS A CCURACY
ResNet50 [16] 23.63M . × - FCs 26.17M . MatBP [20] 32.26M . SPD-net [17] 2.97M . Our best 26.00M . Table 5.
Baseline comparison on MINC2500 for the ResNet50-based models.
Note that our SO-ResNet50 again outperforms thesecond-order-based baselines and the first-order one, although bya smaller margin. We believe that investigating residual second-order strategy could be interesting to further improve our results. net and MatBP models do not implement any robust co-variance estimation, we reduced the dimensionality of thefeature maps to 512 using two 1 × × × .As in the CIFAR-10 case, our end-to-end approach signif-icantly outperforms MatBP and SPD-net, thus showing thebenefits of our framework over simpler second-order-basedapproaches. Our best SO-VGG16 model also outperformsthe first-order VGG16 by a significant margin, while rely-ing on much fewer parameters. Note that this is also true formost of the architectures that we have tested in the previousexperiment. The fact that the additional 1 ×
5. Conclusion
In this paper, we have introduced an end-to-end learn-ing framework that integrates second-order information forimage recognition. To this end, we have developed newlayer types and addressed the practical difficulties arisingwhen dealing with covariance matrices. Our experimentshave demonstrated that our framework can outperform first-order networks and other second-order-based baselines. Inthe future, we will explore alternative learning strategies forthis type of architectures. We hope that our research will in-spire others to investigate architectures that go beyond thestandard first-order ones. eferences [1] Vincent Arsigny, Pierre Fillard, Xavier Pennec, and NicholasAyache. Log-Euclidean metrics for fast and simple calcu-lus on diffusion tensors.
Magnetic Resonance in Medicine ,2006. 2[2] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala.Material recognition in the wild with the Materials in Con-text Database. In
CVPR , 2015. 1, 2, 7[3] Michael Calonder, Vincent Lepetit, Mustafa ¨Ozuysal,Tomasz Trzcinski, Christoph Strecha, and Pascal Fua.BRIEF - Computing a Local Binary Descriptor Very Fast.
IEEE TPAMI , 2012. 2[4] B Caputo, E Hayman, and P Mallikarjuna. Class-specificmaterial categorisation. In
ICCV , 2005. 7[5] Anoop Cherian and Suvrit Sra. Riemannian Sparse Codingfor Positive Definite Matrices. In
ECCV , 2014. 2[6] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, SammyMohamed, and Andrea Vedaldi. Describing Textures in theWild. In
CVPR , 2014. 7[7] John C Duchi, Elad Hazan, and Yoram Singer. Adaptive Sub-gradient Methods for Online Learning and Stochastic Opti-mization.
Journal of Machine Learning Research , 2011. 2[8] Xavier Glorot and Yoshua Bengio. Understanding the dif-ficulty of training deep feedforward neural networks.
AIS-TATS , 2010. 6[9] Kai Guo, Prakash Ishwar, and Janusz Konrad. Action Recog-nition Using Sparse Representation on Covariance Mani-folds of Optical Flow. In
IEEE International Conferenceon Advanced Video and Signal Based Surveillance (AVSS) ,2010. 2[10] M Harandi and B Fernando. Generalized BackPropagation, ´Etude De Cas: Orthogonality. arXiv.org , 2016. 4[11] M T Harandi, M Salzmann, and R Hartley. From manifoldto manifold: Geometry-aware dimensionality reduction forSPD matrices. In
ECCV , 2014. 2[12] Mehrtash Harandi and Mathieu Salzmann. Riemannian cod-ing and dictionary learning: Kernels to the rescue. In
CVPR ,2015. 2[13] Mehrtash Harandi, Mathieu Salzmann, and Fatih Porikli.Bregman Divergences for Infinite Dimensional CovarianceMatrices. In
CVPR , 2014. 1, 7[14] Mehrtash Harandi, Mathieu Salzmann, and Fatih Porikli.Bregman Divergences for Infinite Dimensional CovarianceMatrices. In
CVPR , 2014. 1, 7[15] Mehrtash Tafazzoli Harandi, Conrad Sanderson, Richard IHartley, and Brian C Lovell. Sparse Coding and DictionaryLearning for Symmetric Positive Definite Matrices - A Ker-nel Approach. In
ECCV , 2012. 2[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep Residual Learning for Image Recognition. In
CVPR ,2016. 1, 2, 5, 7, 8[17] Zhiwu Huang and Luc J Van Gool. A Riemannian Networkfor SPD Matrix Learning. In
AAAI , 2017. 2, 4, 6, 7, 8 [18] Zhiwu Huang, Ruiping Wang, Shiguang Shan, Xianqiu Li,and Xilin Chen. Log-Euclidean Metric Learning on Sym-metric Positive Definite Manifold with Application to ImageSet Classification. In
ICML , 2015. 2[19] Sergey Ioffe and Christian Szegedy. Batch Normalization:Accelerating Deep Network Training by Reducing InternalCovariate Shift. In
ICML , 2015. 2[20] Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu.Matrix Backpropagation for Deep Networks with StructuredLayers. In
ICCV , 2015. 2, 4, 5, 6, 7, 8[21] Sadeep Jayasumana, Richard I Hartley, Mathieu Salzmann,Hongdong Li, and Mehrtash Tafazzoli Harandi. KernelMethods on the Riemannian Manifold of Symmetric Posi-tive Definite Matrices. In
CVPR , 2013. 1, 2, 7[22] B Julesz, E N Gilbert, L A Shepp, and H L Frisch. Inabil-ity of Humans to Discriminate between Visual Textures ThatAgree in Second-Order Statistics—Revisited.
Perception ,2(4):391–405, December 1973. 1[23] Diederik P Kingma and Jimmy Ba. Adam: A Method forStochastic Optimization. In
ICLR , 2015. 2[24] Alex Krizhevsky and Geoffrey Hinton. Learning multiplelayers of features from tiny images. Technical report, 2009.2, 5[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im-ageNet Classification with Deep Convolutional Neural Net-works.
NIPS , 2012. 2[26] Peihua Li, Qilong Wang, Wangmeng Zuo, and Lei Zhang0006. Log-Euclidean Kernels for Sparse Representation andDictionary Learning. In
ICCV , 2013. 2[27] David G Lowe. Distinctive Image Features from Scale-Invariant Keypoints.
IJCV , 2004. 2[28] Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A Rie-mannian Framework for Tensor Computing.
IJCV , 2006. 2[29] Minh Ha Quang, Marco San-Biagio, and Vittorio Murino.Log-Hilbert-Schmidt metric between positive definite oper-ators on Hilbert spaces.
NIPS , 2014. 2[30] Adriana Romero, Nicolas Ballas, Ebrahimi Kahou, AntoineChassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hintsfor thin deep nets. In
ICLR , 2015. 2, 6, 7[31] Tim Salimans and Diederik P Kingma. Weight Normaliza-tion - A Simple Reparameterization to Accelerate Trainingof Deep Neural Networks. In
NIPS , 2016. 2[32] Bernt Schiele and James L Crowley. Recognition with-out Correspondence using Multidimensional Receptive FieldHistograms.
IJCV , 2000. 2[33] L Sharan, R Rosenholtz, and E Adelson. Material percep-tion: What can you see in a brief glance?
Journal of Vision ,2009. 7[34] Karen Simonyan and Andrew Zisserman. Very Deep Con-volutional Networks for Large-Scale Image Recognition. In
ICLR , 2015. 1, 2, 7, 8[35] Suvrit Sra. A new metric on the manifold of kernel matriceswith application to matrix geometric means.
NIPS , 2012. 236] Suvrit Sra and Anoop Cherian. Generalized DictionaryLearning for Symmetric Positive Definite Matrices withApplication to Nearest Neighbor Retrieval. In
MachineLearning and Knowledge Discovery in Databases . Springer,Berlin, Heidelberg, 2011. 2[37] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. Dropout - a simpleway to prevent neural networks from overfitting.
Journal ofMachine Learning Research , 2014. 2[38] Oncel Tuzel, Fatih Porikli, and Peter Meer. Region Covari-ance - A Fast Descriptor for Detection and Classification. In
ECCV , 2006. 1[39] Oncel Tuzel, Fatih Porikli, and Peter Meer. Pedestrian De-tection via Classification on Riemannian Manifolds.
IEEETPAMI , 2008. 2[40] Qilong Wang, Peihua Li, Wangmeng Zuo, and Lei Zhang.RAID-G - Robust Estimation of Approximate Infinite Di-mensional Gaussian with Application to Material Recogni-tion. In
CVPR , 2016. 2, 5[41] Xin Xu, Nan Mu, Xiaolong Zhang, and Bo Li. Covariancedescriptor based convolution neural network for saliencycomputation in low contrast images.
IJCNN , 2016. 2[42] Matthew D Zeiler. ADADELTA: An Adaptive Learning RateMethod. arXiv.orgarXiv.org