Learning Semantically Enhanced Feature for Fine-Grained Image Classification
IIEEE SIGNAL PROCESSING LETTERS. VOL. 27, 2020 1
Learning Semantically Enhanced Feature forFine-Grained Image Classification
Wei Luo*,
Member, IEEE , Hengmin Zhang*, Jun Li, and Xiu-Shen Wei
Abstract —We aim to provide a computationally cheap yeteffective approach for fine-grained image classification (FGIC)in this letter. Unlike previous methods that rely on complex partlocalization modules, our approach learns fine-grained featuresby enhancing the semantics of sub-features of a global feature.Specifically, we first achieve the sub-feature semantic by arrang-ing feature channels of a CNN into different groups throughchannel permutation. Meanwhile, to enhance the discriminabilityof sub-features, the groups are guided to be activated on objectparts with strong discriminability by a weighted combinationregularization. Our approach is parameter parsimonious and canbe easily integrated into the backbone model as a plug-and-playmodule for end-to-end training with only image-level supervision.Experiments verified the effectiveness of our approach andvalidated its comparable performance to the state-of-the-artmethods. Code is available at https://github.com/cswluo/SEF
Index Terms —Image classification, visual categorization, fea-ture learning
I. I
NTRODUCTION F INE-grained image classification (FGIC) concerns thetask of distinguishing subordinate categories of some baseclasses such as dogs [1], birds [2], cars [3], aircraft [4]. Dueto the large intra-class pose variation and high inter-classappearance similarity, as well as the scarcity of annotated data,it is challenging to efficiently solve the FGIC problem.Recent studies have shown great interests in tackling theFGIC problem by unifying part localization and feature learn-ing in an end-to-end CNN [5], [6], [7], [8], [9], [10], [11].For example, [12], [13], and [5] first crop object parts on inputimages and feed them into models for feature extraction, wherethe part locations are obtained by taking as input the image-level features extracted by a convolutional network. To makethe model optimization more easier, [8] and [14] produce partsfeatures by weighting feature channels using soft attentions.Another line of research divides feature channels into severalgroups with each corresponding to a semantic part of theinput image [6], [9]. However, these methods usually makethe model optimization more difficult, since they either relyon complex part localization modules, or introduce a largenumber of parameters into their backbone models, or require
Submitted date: 07/05/2020. This work was supported in part by NSFCunder grant 61702197, in part by NSFGD under grant 2020A151501813 and2017A030310261.Wei Luo is with the South China Agricultural University, Guangzhou,510000, China (email:[email protected]).Hengmin Zhang is with the East China University of Science and Technol-ogy, Shanghai, 200237, China (email:[email protected]).Jun Li and Xiu-Shen Wei are with the Nanjing University of Science andTechnology, Nanjing, 210094, China (email: { junli,weixs } @njust.edu.cn).* indicates equal contribution. Wei Luo is the corresponding author. a separate module to guide the learning of the feature channelgrouping.In this letter, we propose a computationally cheap yet effec-tive approach that learns fine-grained features by improvingthe sub-features’ semantics and discriminability. It includestwo core components: a semantic grouping module thatarranges feature channels with similar properties into the samegroup to represent a semantic part of the input image; afeature enhancement module that improves the discriminabilityof the grouped features by guiding them to be activated onobject parts with strong discriminability. The two componentscan be easily implemented as a plug-and-play module inmodern CNNs without the need of guided initialization ormodifying the backbone network structure. By coupling thetwo components together, our approach ensures the generationof powerful and distinguishable fine-grained features withoutrequiring complex part localization modules.Concretely, we construct semantic groups in the last convo-lutional layer of a CNN by arranging feature channels througha permutation matrix, which is learned implicitly throughregularizing the relationship between feature channels, i.e.,maximizing the correlations between feature channels in thesame predefined group and decorrelating those in different pre-defined groups. Thus, our feature channel grouping does notintroduce any additional parameters into the backbone model.Compared to [9] and [6], our strategy groups feature channelsmore consistently and requires no guided initialization. Toguide the semantic groups to be activated on object parts withstrong discriminability, we introduce a regularization methodthat employs a weighted combination of maximum entropylearning and knowledge distillation. The strategy of weightedcombination is derived from matching prediction distributionsbetween the outputs of the global feature and its group-wisesub-features. This regularization method introduces only asmall number of parameters into the backbone models due tothe output of prediction distributions of sub-features. Overall,by coupling the two components together, our approach canefficiently obtain fine-grained features with strong discrim-inability. Besides, our approach can be easily integrated intothe backbone model as a plug-and-play module for end-to-endtraining with only image-level supervision. Our contributionsare summarized as follows: • We propose a computationally cheap FGIC approachthat achieves comparable performance to the state-of-the-art methods with only . % parameters more than itsResNet-50 backbone on the Birds dataset. • We propose to achieve part localization by learning se-mantic groups of feature channels, which does not require a r X i v : . [ c s . C V ] A ug EEE SIGNAL PROCESSING LETTERS. VOL. 27, 2020 2
CNN
Semantic grouping … . … . Tropical KingbirdLeast FlycatcherHerring GullGadwallHorned PuffinMallard
EntropyKnowledge distillation S ub -f ea t u r e s g l ob a l f ea t u r e Fig. 1. Overview of our approach. The last-layer convolutional feature channels (depicted by the mixed color block) of the CNN are arranged into differentgroups (represented by different colors) by our semantic grouping module. The global and its sub-features (group-wise features) are obtained from the arrangedfeature channels by average pooling. The light yellow block in the gray block denotes the predicted class distributions from corresponding sub-features, whichare regularized by the output of the global feature through knowledge distillation. All gray blocks are only effective in the training stage while removed inthe testing stage. The details of the CNN are omitted for clarity. Best viewed in color. guided initialization and extra parameters. • We propose to enhance feature discriminability by guid-ing its sub-features to be extracted from object parts withstrong discriminability.II. P
ROPOSED A PPROACH
Our approach involves two main components: 1) A semanticgrouping module that arranges feature channels with differentproperties into different groups. 2) A feature enhancementmodule that elevates feature performance by improving its sub-features discriminability. Fig. 1 is an overview of our approach.
A. Semantic Grouping
Previous work [15] has verified that bunches of filters in thehigh-level layers of CNNs are required to represent a semanticconcept. Therefore, we develop a regularization method thatarranges filters with different properties into different groups tocapture semantic concepts. Specifically, given a convolutionalfeature X L ∈ R C × W H of layer L , where a single featurechannel is represented by X Li ∈ R W H , i ∈ [1 , · · · , C ] . We firstarrange its feature channels through a permutation operation X L (cid:48) = AX L , where A ∈ R C × C is a permutation matrix, andthen divide the channels into G groups. Since X L is obtainedby convolving the filters of layer L with the features of layer L − . The convolution operation can be formulated as X L = BX L − , (1)where B ∈ R C × Ω and X L − ∈ R Ω × Ψ are respectively thereshaped filters of layer L and the reshaped feature of layer L − . Thus X L (cid:48) can be rewritten as X L (cid:48) = AX L = ABX L − = WX L − , (2)where W is a permutation of B .To achieve the groups with semantic meaning, A should belearned to discover the similarities between the filters (rows)of B . It is, however, nontrivial to learn the permutation matrixstraightforwardly. Therefore, we instead learn W directly byconstraining the relationships between feature channels of X L (cid:48) , thus circumventing the difficulty of learning A . To effect, we maximize the correlation between feature channels in thesame group while decorrelating those in different groups . Concretely, let ˜ X L (cid:48) i ← X L (cid:48) i / || X L (cid:48) i || be a normalized channel.The correlation between channels is then defined as d ij = ˜ X L (cid:48) T i ˜ X L (cid:48) j , (3)where T is the transpose operation. Let D ∈ R G × G be thecorrelation matrix with element D mn = C m Cn (cid:80) i ∈ m,j ∈ n d ij corresponding to the average correlation of feature channelsfrom groups m and n , where m, n ∈ , · · · , G and C m is thenumber of channels in group m . Then the semantic groupscan be achieved by minimizing L group = 12 ( (cid:107) D (cid:107) F − (cid:107) diag ( D ) (cid:107) ) , (4)where diag ( · ) extracts the main diagonal of a matrix.Generally, Eq. 2 can be implemented as a convolutionaloperation in CNNs. Eq. 4 can be regularized on feature chan-nels without introducing any additional parameters. Differentfrom [11] in which a semantic mapping matrix is explicitlylearned to reduce the redundancy of bilinear features, ourstrategy focuses on improving the semantics of sub-featuresand does not modify the backbone network structure. B. Feature Enhancement
Semantic grouping can drive features of different groupsto be activated on different semantic (object) parts. However,the discriminability of those parts may not be guaranteed. We,therefore, need to guide these semantic groups to be activatedon object parts with strong discriminability. A simple wayto achieve this effect is to match the prediction distributionsbetween the object and its parts as implemented in [14]. How-ever, it is unclear about the reasons why matching distributionscan improve performance. Here, we provide an analysis tounderstand its principles and make improvements to achievebetter performance.Let P w and P a be the prediction distributions of an objectand its part, respectively. Then, matching their distributionscan be achieved by minimizing the KL divergence betweenthem [14], L KL ( P w || P a ) = − H ( P w ) + H ( P w , P a ) , (5)where H ( P w ) = − (cid:80) P w log P w and H ( P w , P a ) = − (cid:80) P w log P a . Thus, the optimization objective of a clas-sification task can be generally written as L = L cr + λ L KL , (6) EEE SIGNAL PROCESSING LETTERS. VOL. 27, 2020 3 where L cr is the cross entropy loss and λ is the balance weight.Substituting Eq. 5 into Eq. 6 we have L = L cr − λ H ( P w ) + λ H ( P w , P a ) . (7)Eq. 7 implies that minimizing matching prediction distribu-tions can be decomposed into a maximum entropy term anda knowledge distillation term. Both of them are powerfulregularization methods [16], [17]. In FGIC, maximum entropylearning can effectively reduce the confidence of classifiers,thus leading to better generalization in low data-diversityscenarios [18]. The last term in Eq. 7 distills knowledgefrom the global feature to the local feature, thus enhancingthe discriminability of local features. To this end, we canregulate the importance of the two terms separately for betterperformance. Put all together, our optimization objective canbe formulated as L = E x (cid:0) L cr − λ H ( P w )+ γG G (cid:88) g =1 H ( P w , P ga )+ φ L group (cid:1) . (8)Here, λ, γ, φ and G are hyper-parameters and we omit thedependence on x for clarity (see Table III for evaluation).In our implementation, the last-layer feature channels arefirst pooled averagely and then simultaneously fed into G + 1 neural networks (NNs) to output class distributions, in whichone takes as input the global feature and the others take asinput only features in corresponding groups. Only the NNtaking the global feature is used for prediction (see Fig. 1).III. E XPERIMENTS
A. Experimental Setup
We employ ResNet-50 [19] as the backbone of our ap-proach in PyTroch and experiment on CUB-Birds [2], StanfordCars [3], Stanford Dogs [1], and FGVC-Aircraft [4] datasets(see supplementary materials (SMs) for data statistics). Weinitialize our model using the pretrained weights on Ima-geNet [20] and fine-tune all layers on batches of 32 images ofsize × by SGD [21] with momentum of 0.9. Randomflip is employed only in training. The G +1 NNs are all single-layer networks. The initial learning rate, lr , is . except onDogs where . is used and decays by . every epochswith a total of training epochs. λ, γ and φ are validated onthe Birds validation set, which contains of the trainingsamples, and are correspondingly set to , . and acrossall datasets. G is determined by the performance of modelswith different values and respectively set to , , and onAircraft, Birds, and the other two datasets in this letter. Metrics . Except for classification accuracy, we also employscores to rank the overall performance of a method acrossdatasets. Given the performance of method m on L datasets,the score of method m is S m = L (cid:80) Ll =1 R ml , where R ml is therank of method m on the l -th dataset based on its classificationaccuracy. The closer the score is to 1, the better. B. Comparison with the State-of-the-Art
To be fair, we only compare to weakly-supervised methodsemploying the ResNet-50 backbone, due to its popularity and state-of-the-art performance in recent FGIC work.
Notice thatwe are not attending to achieve the best performance but toemphasize the advantage brought by our simple construction . Complexity analysis.
Our approach introduces additionalparameters into the backbone network only in knowledgedistillation, where the total number of additional parameters isconstrained by the multiplication of feature dimensionality andthe number of classes. For example, , new parametersare introduced on the Birds dataset, which accounts for only . % of the parameters of ResNet-50. Since there are onlynegligible additional parameters in our approach, the networkis efficient to train. Compared with computation-intensivemethods such as S3N [10], TASN [9], API-Net [22], andDCL [23] (requires 60, 90, 100, and 180 epochs for training,respectively), our approach can be optimized in 50 epochs.During testing, only the backbone network is activated withall additional modules removed. Compared with its ResNet-50 backbone, our approach boosts performance + . % onaverage with the same time cost at inference, which indicatesthe practical value of our approach. TABLE IC
OMPARISON WITH STATE - OF - THE - ART METHODS ( % ).Birds Cars Dogs Aircraft ScoresKernel-Pooling (cid:63) [24] . . − . . MAMC-CNN [8] . . . − . DFB-CNN (cid:63) [7] . . − . . NTS-Net † [13] . . − . . S3N † [10] . . − . . API-Net [22] . . . . . DCL [23] . . − . . TASN † [9] . . − − . Cross-X [14] . . . . . ResNet-50 ‡ [19] . . . . . MaxEnt-CNN ‡ [18] . . . . . DBT-Net [11] . . − . . SEF (ours) . . . . . (cid:63) , † and ‡ represent methods with separated initialization, multi-cropping operations, and results from our re-implementation,respectively. The closer the score is to 1, the better. Performance comparison.
Table I shows that there is nosingle method that can achieve the best performance on alldatasets. Our approach (SEF) attains a score of . ranking thin overall performance, which is comparable to the state-of-the-art methods, especially considering its simple construction.The methods in the second group are closely related toours. Compared to ResNet-50 and MaxEnt-CNN, SEF boostsperformance on all datasets. Besides, SEF achieves a morerobust performance than DBT-Net. Among other methods,S3N, API-Net, DCL, and Cross-X rank before ours. However,they are more expensive than ours such as Cross-X using twiceas many parameters as ResNet-50.Table II shows the performance of our approach on differentbackbones. It is worth noting that we did not cross-validatethe hyper-parameters for these new backbones, but to usethose validated for ResNet-50, and the hyper-parametersare determined through cross-validation on Birds and appliedto all datasets without modification. Thus better results can be EEE SIGNAL PROCESSING LETTERS. VOL. 27, 2020 4
TABLE IIP
ERFORMANCE ON DIFFERENT BACKBONES ( % ).VGG16 (+SEF) ResNet-50 (+SEF) ResNeXt-50 (+SEF)Birds 77.7 (81.1) 84.5 (87.3) 86.8 (87.8)Cars 87.3 (88.3) 92.9 (94.0) 93.7 (94.2)Dogs 71.1 (75.4) 88.1 (88.8) 89.8 (90.8)Aircraft 87.0 (88.5) 90.3 (92.1) 92.0 (92.6) Fig. 2. From left to right are correlation matrices of feature channels ofmodels with 3, 5, and 7 groups, averaged on 64 images randomly selectedfrom the Birds testing dataset, respectively. expected than here. However, the performance demonstratesthat our approach can be robustly generalized to differentbackbones and datasets (Please refer to SMs for more details).
C. Ablation Studies
For simplicity, the rest experiments and analyses are per-formed on ResNet-18 unless otherwise clarified.
Effectiveness of individual module.
Table III shows theresults of our approach with different configurations. It revealsthat learning semantic groups (row 2) or matching distributions(row 3) independently can slightly improve the performance.Combining both directly (row 4) brings some advantages, butthe improvement is not systematically consistent. The perfor-mance, however, can be significantly improved by decompos-ing matching distributions into separate regularizers (row 5),which indicates the effectiveness of feature enhancement thatguiding semantic groups to be activated on object parts withstrong discriminability.
TABLE IIIP
ERFORMANCE WITH DIFFERENT OPTIONS ( % ).Birds Cars Dogs AircraftResNet-18 [19] . . . . λ = 0 , γ = 0 , φ = 1 82 . . . . λ = 0 . , γ = 0 . , φ = 0 82 . . . . λ = 0 . , γ = 0 . , φ = 1 83 . . . . λ = 1 , γ = 0 . , φ = 1 . . . . Number of semantic groups.
The group semantic isstrongly correlated with its discriminability. However, toomany groups may break the correlation, resulting in weakgroup features. Fig. 2 shows the correlations between featurechannels of the last convolutional layer. Fig. 3 depicts thediscriminability of group features of models with varyinggroups. They illustrate that our semantic grouping modulecan effectively achieve the function of grouping correlatedfeature channels and separating uncorrelated feature channels(Fig. 2); feature channels should not be divided into too Fig. 3. Discriminability of the st group features of models at differentlearning epochs tested on the Birds validation set. ng means the number ofgroups used in the model.Fig. 4. Group-wise activation maps of models with 2 semantic groupssuperimposed on original images. The 1st, 2nd, and 3rd rows are respectivelythe original images, activation maps of the 1st and 2nd semantic groups. many groups, which on the one hand causes the difficultyin optimization (Fig. 2) and, on the other hand, reduces thesemantic as well as the discriminability of each group (Fig. 3). D. Visualization
The discriminability of semantic groups (sub-features) canbe visualized by highlighting corresponding areas on inputimages. Ideally, every group should be activated by a setof proximity pixels due to the similar property of neuronsin the group. Fig. 4 shows that different groups can beactivated by different semantic parts of images, which arehighly identifiable such as the wing and engine nacelles ofthe aircraft. It signifies the success of our approach to enhancefeature semantically. IV. C
ONCLUSION
In this letter, we proposed a computationally cheap yeteffective approach that involves a semantic grouping and afeature enhancement module for FGIC. We empirically studiedthe effectiveness of each individual module and their combin-ing effects through ablation studies, as well as the relationshipbetween the number of groups and the semantic integrity ineach group. Comparable performance to the state-of-the-artmethods and low computational cost make it possible to bewidely employed in FGIC applications.
EEE SIGNAL PROCESSING LETTERS. VOL. 27, 2020 5 R EFERENCES[1] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset forfine-grained image categorization,” in
First Workshop on Fine-GrainedVisual Categorization (FGVC) at CVPR , 2011.[2] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “Thecaltech-ucsd birds-200-2011 dataset,” California Institute of Technology,Tech. Rep., 2011.[3] J. Krause, M. Stark, J. Deng, and F.-F. Li, “3d object representations forfine-grained categorization,” in , 2013.[4] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi,“Fine-grained visual classification of aircraft,” in arXiv preprintarXiv:1306.5151 , 2013.[5] J. Fu, H. Zheng, and T. Mei, “Recurrent attention convolutional neuralnetwork for fine-grained image recognition,” in
CVPR , 2017.[6] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convo-lutional neural network for fine-grained image recognition,” in
ICCV ,2017.[7] Y. Wang, V. I. Morariu, and L. S. Davis, “Learning a discriminativefilter bank within a cnn for fine-grained recognition,” in
CVPR , 2018.[8] M. Sun, Y. Yuan, F. Zhou, and E. Ding, “Multi-attention multi-classconstraint for fine-grained image recognition,” in
ECCV , 2018.[9] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo, “Looking for the devil in thedetails: Learning trilinear attention sampling network for fine-grainedimage recognition,” in
CVPR , 2019.[10] Y. Ding, Y. Zhou, Y. Zhu, Q. Ye, and J. Jiao, “Selective sparse samplingfor fine-grained image recognition,” in
ICCV , 2019.[11] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo, “Learning deep bilineartransformation for fine-grained image representation,” in
NIPS , 2019.[12] X. Liu, T. Xia, J. Wang, and Y. Lin, “Fully convolutional attentionlocalization networks,” in arXiv preprint arXiv:1603.06765 , 2016.[13] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang, “Learning tonavigate for fine-grained classification,” in
ECCV , 2018.[14] W. Luo, X. Yang, X. Mo, Y. Lu, L. S. Davis, J. Li, J. Yang, and S.-N.Lim, “Cross-x learning for fine-grained visual categorization,” in
ICCV ,2019.[15] R. Fong and A. Vedaldi, “Net2vec: Quantifying and explaining howconcepts are encoded by filters in deep neural networks,” in
CVPR ,2018.[16] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra, “A maximum entropyapproach to natural language processing,”
Computational linguistics ,vol. 22, no. 1, pp. 39–71, 1996.[17] H. Geoffrey, O. Vinyals, and J. Dean, “Distilling the knowledge in aneural network,” in arXiv preprint arXiv:1503.02531 , 2015.[18] A. Dubey, O. Gupta, R. Raskar, and N. Naik, “Maximum entropy fine-grained classification,” in
NIPS , 2018.[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
CVPR , 2016.[20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet:A large-scale hierarchical image database,” in
CVPR , 2009.[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient based learningapplied to document recognition,”
Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[22] P. Zhuang, Y. Wang, and Y. Qiao, “Learning attentive pairwise interac-tion for fine-grained classification,” in
AAAI , 2020.[23] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and constructionlearning for fine-grained image recognition,” in
CVPR , 2019.[24] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie, “Kernelpooling for convolutional neural networks,” in
CVPR , 2017.[25] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in
ICLR , 2015.[26] S. Xie, R. Girshick, P. Dollr, Z. Tu, and K. He, “Aggregated residualtransformations for deep neural networks,” in
CVPR , 2017. A PPENDIX
A: D
ATA STATISTICS
We validate the effectiveness of our approach on 4 fine-grained benchmark datasets. Detailed instructions of thesedatasets can be found on their homepages, respectively. Thedata statistics are presented in the following Table.
TABLE IVT
HE STATISTICS OF FINE - GRAINED DATASETS IN THIS LETTER
Datasets category training testingCUB-Birds [2]
200 5 ,
994 5 , Stanford Cars [3] 196 8,144 8,041Stanford Dogs [1] 120 12,000 8,580FGVC-Aircraft [4] 100 6,667 3,333 A PPENDIX
B: S
TRUCTURES OF BACKBONE NETWORKSAND T RAINING D ETAILS
VGG16[25].
Due the the huge parameters of VGG16, wecorrespondingly modified the structure of VGG16 to makeit suitable for our validation. To this end, we remove all thefully connected layers and employ the global average poolingon the last convolutional feature maps, then directly feedthe pooled features into the output layer. This modificationreduces about % parameters of the vanilla VGG16 model.Our approach introduces about . % additional parametersto this modified VGG16 on the Birds dataset. ResNeXt-50[26].