Sill-Net: Feature Augmentation with Separated Illumination Representation
SSill-Net: Feature Augmentation with Separated Illumination Representation
Haipeng Zhang Zhong Cao Ziang Yan Changshui Zhang Abstract
For visual object recognition tasks, the illumina-tion variations can cause distinct changes in objectappearance and thus confuse the deep neural net-work based recognition models. Especially forsome rare illumination conditions, collecting suf-ficient training samples could be time-consumingand expensive. To solve this problem, in thispaper we propose a novel neural network archi-tecture called S eparating- Ill umination
Net work(Sill-Net). Sill-Net learns to separate illuminationfeatures from images, and then during trainingwe augment training samples with these separatedillumination features in the feature space. Ex-perimental results demonstrate that our approachoutperforms current state-of-the-art methods inseveral object classification benchmarks.
1. Introduction
Although deep neural network based models have achievedremarkable successes in various computer vision tasks(Krizhevsky et al., 2017; Simonyan & Zisserman, 2014; Rus-sakovsky et al., 2015; He et al., 2016), vast amounts of an-notated training data are usually required for a superior per-formance in many visual tasks. For the object classificationtask, the requirement for a large training set could be par-tially explained by the fact that many latent variables (e.g.,positions/postures of the objects, the brightness/contrast ofthe image, and the illumination conditions) can cause sig-nificant changes in the appearance of objects. Althoughcollecting a large training set to cover all possible values ofthese latent variables could improve the recognition perfor-mance, for rare latent values such as extreme illuminationconditions it could be prohibitively time-consuming andexpensive to collect enough training images.In this paper we restrict our attention to illumination con-ditions. For many real-world computer vision applications(e.g., autonomous driving and video surveillance) it is es- Department of Automation, Tsinghua University,Beijing, China. Correspondence to: Changshui Zhang
Figure 1.
Illustration of the key idea of our approach. The semanticand illumination representation are separated from the trainingimage (mandatory straight). The illumination representation isused to augment the support sample (deer crossing). sential to recognize the objects under extreme illuminationconditions such as backlighting, overexposure and othercomplicated cast shadows. Thus, we reckon it is desirableto improve recognition models’ generalization ability underdifferent illumination conditions in order to deploy robustmodels in real-world applications.We propose a novel neural network architecture named S eparating- Ill umination
Net work (Sill-Net) to deal withsuch problems. The key idea of our approach is to sepa-rate the illumination features from the semantic featuresin images, and then augment the separated illuminationfeatures onto other training samples (hereinafter we namethese samples as “support samples”) to construct a moreextensive feature set for subsequent training (see Figure1). Specifically, our approach consists of three steps. Inthe first step, we separate the illumination and semanticfeatures for all images in the existing dataset via a disentan-glement method, and use the separated illumination featuresto build an illumination repository. Then, we transplant theillumination repository to the support samples to constructan augmented training set and use it to train a recognitionmodel. Finally, test images are fed into the trained modelfor classification. Our proposed approach could improvethe robustness to illumination conditions since the supportsamples used for training are blended with many different il-lumination features. Thus after training, the obtained modelwould naturally generalize better under various illuminationconditions.Our contributions are summarized as follows: a r X i v : . [ c s . C V ] F e b ill-Net: Feature Augmentation with Separated Illumination Representation Exchange Mechanism
Separation Module
Matching and Reconstruction ModuleIllumination Repository
Augmentation Module
ClassifierReconstructorMatchingTemplates
Support samples
Extractor Features … Augmented features
Spatial Transformer
Separator
Templates
Training images
Figure 2.
Illustration of the architecture of Sill-Net. Sill-Net consists of three main modules: the separation module, the matching andreconstruction module, and the augmentation module. The semantic and illumination features are separated by the exchange mechanismin the first module. The semantic features are constraint to be informative by the matching and reconstruction module. The illuminationfeatures are stored into a repository. In the augmentation module, we use the illumination features in the repository to augment the supportsamples (e.g., template images) for training a generalizable model.
1) We develop an algorithm to separate the illuminationfeatures from semantic features for natural images. Theseparated illumination features can be used to construct anillumination feature repository.2) We propose an augmentation method to blend supportsamples with the illumination feature repository, whichcould effortlessly enhance the illumination variety of thetraining set and thus improve the robustness to illuminationconditions of the trained deep model.3) We evaluate Sill-Net on several object classificationbenchmarks, i.e., two traffic datasets (GTSRB and TT100K)and three logo datasets (Belgalogos, FlickrLogos32, andTopLogo-10). Sill-Net outperforms the state-of-the-art(SOTA) methods by a large margin.
2. Proposed Method
In this section, we introduce our S eparating- Ill umination
Net work (Sill-Net) . Sill-Net first learns to separate the se-mantic and illumination features of training images. Thenthe illumination features are blended with the semantic fea-ture of each support sample to construct an augmentedfeature set. Finally, we train again on the illumination-augmented feature set for classification. The architecture ofour method is illustrated in Figure 2.Sill-Net mainly consists of the following modules: the sep-aration module, the matching and reconstruction module,and the augmentation module. In detail, we implement themethod in three steps:1) The separation module is trained to separate the features into semantic parts and illumination parts for all trainingimages. The matching and reconstruction module promotesthe learning of better semantic feature representation. Thelearned illumination features are stored into an illuminationrepository. The details are illustrated in Section 2.1.2) The semantic feature of each support image is combinedwith all illumination features in the repository to build anaugmented feature set to train the classifier. The augmenta-tion module is illustrated in Section 2.2.3) Test images are input into the well-trained model to bepredicted in an end-to-end manner, referring to Section 2.3.This approach assumes that the illumination distributionlearned from training data is similar to that of test data.Thus the illumination features can be used as feature aug-mentation for sufficient training.We can choose different support samples in different visualtasks. For instance, in conventional classification tasks, weuse the real training images as support samples; in one-shotclassification tasks, we construct the support set with tem-plate images (i.e., graphic symbols visually and abstractlyrepresenting semantic information).
Let X = { ( x i , y i , t i ) } Ni =1 represent the labeled dataset oftraining classes with N images, where x i denotes the i -thtraining image, y i is the one-hot label, and t i denotes thecorresponding template image (or any image of the objectwithout much deformation).A feature extractor denoted by E ( z | x ) learns the separated ill-Net: Feature Augmentation with Separated Illumination Representation features z from images x , where z can be split along chan-nels: z = [ z sem , z illu ] . Here, z sem is called the semanticfeature, which represents the consistent information of thesame category, while z illu is called the illumination feature.Here we specify what illumination represents in our paper.The narrow meaning of illumination is one of the environ-mental impacts which cause the appearance changes butno label changes. We call the features related to all theenvironmental impacts but are not category-specific as illu-mination features. Technically, we divide the object featureinto different channels, one half determining the categorylabel defined as the semantic feature, and the other half unre-lated to the category label defined as the illumination feature.Thus, the following three conditions should be satisfied:1) The semantic feature is informative to reconstruct thecorresponding template image.2) The semantic feature can predict the label while theillumination feature can not.3) The illumination feature should not contain the semanticinformation.To satisfy the above conditions, we build the followingmodules. Matching and reconstruction module.
We first constructa matching module (as shown in Figure 2) to make the se-mantic feature informative as required by the first condition.Since we design the extractor without downsampling op-erations to maintain the spatial information of the object,the semantic feature of one real image should be similar tothat of its corresponding template image. However, the realimage is usually deformed compared to the regular templateimage. Therefore, we use a spatial transformer (Jaderberget al., 2015) T to correct the deformation. We constrain thetransformed semantic feature T (cid:0) z sem ( i ) | x i (cid:1) to be consis-tent with the template feature z sem ( i ) | t i by the mean squareerror (MSE), that is: L match = 1 N N (cid:88) i =1 (cid:0) T (cid:0) z sem ( i ) | x i (cid:1) − z sem ( i ) | t i (cid:1) . (1)Besides, we design a reconstructor R ( t |T ( z sem )) (asshown in Figure 2) to retrieve the template image t fromthe semantic feature z sem to ensure that it is informativeenough. We constrain the reconstructed template image ˆ t i by binary cross-entropy (BCE) loss: L recon = 1 N N (cid:88) i =1 | t i | (cid:88) j =1 − t ij log ˆ t ij − (1 − t ij ) log (cid:0) − ˆ t ij (cid:1) , (2)where t ij represents the j -th pixel of the i -th template image t i . Since the template images are composed of primarycolors within the range of [0 , , binary cross-entropy (BCE)loss is sufficiently efficient for the retrieval (Kim et al.,2019). So far, the semantic feature is constrained to beconsistent with the template feature and informative enoughto be reconstructed to its template image. Exchange mechanism.
To ensure that the semantic featurecan predict the label while the illumination feature can not,we utilize a feature exchange mechanism enlightened by(Xiao et al., 2018) to separate the feature. As shown inFigure 3, the semantic feature z sem ( i ) of one image x i isblended with the illumination feature z illu ( j ) of anotherimage x j to form a new one through feature mixup (Zhanget al., 2017): z = r z sem ( i ) + (1 − r ) z illu ( j ) , (3)where the proportion r ∈ [0 , . As required by the con-dition, the blended feature z retains the same label y i asthe semantic feature. Hence through training, the semanticfeature would learn information to predict the label whilethe illumination feature would not. Cross combination Classifier … ExtractorLabelImage Feature
Figure 3.
Illustration of the exchange mechanism. The semanticand illumination features are exchanged between random pairedimages with labels y i and y j . Then we obtain cross combined fea-tures labeled the same as the images corresponding to the semanticfeatures. These features are then classified as the specified labels. We implement the exchange process for random pairs ofimages, building a new exchanged feature set: Z ex = (cid:8) r z sem ( i ) + (1 − r ) z illu ( j ) , y i (cid:12)(cid:12) i, j = 1 , · · · , N (cid:9) . (4)The mixed features are then input into a classifier P for labelprediction. We denote the distribution of the predicted label y given the mixed feature z by P ( y | z ) . Then we minimizethe cross-entropy loss: L class = − N ex N ex (cid:88) i =1 M (cid:88) c =1 y ic log P ( y ic | z i ∈ Z ex ) , (5)where N ex = |Z ex | denotes the number of recombinedfeatures in the augmented feature set, M represents theclass number of all images for training and test, and y ic isthe c -th element of the one-hot label y i . ill-Net: Feature Augmentation with Separated Illumination Representation The semantic feature retains the information to predict thelabel after training on the exchanged feature set. Besides,the semantic information in the illumination feature wouldbe reduced, because otherwise, it would impair the predicionwhen it is blended with the other semantic features.
Constraints on illumination features.
As required by thethird condition, it is essential to impose additional con-straints on illumination features to reduce the semantic in-formation. However, it is difficult to find suitable restrictionssince the generally used datasets have no illumination labels.Enlightened by the disentanglement metric proposed in(Suter et al., 2019), we design a constraint on illumina-tion features by negative Post Interventional Disagreement(PIDA). Given a subset X c = { ( x ci , y c ) } N c i =1 including N c images of the same label y c , we write the loss as follows: L illu = − P IDA = − M (cid:88) c =1 N c (cid:88) i =1 D ( E ( z illu | x c ,y c ) , z illu | x ci ,y c ) , (6)here, D is a proper distance function (e.g., (cid:96) -norm), z illu | x ci ,y c is the illumination feature of image x ci with thesame label y c , E is the expectation, and N c is the numberof images in class c .According to Eq. (6), PIDA quantifies the distances be-tween the illumination feature of each same-labeled image z illu | x ci ,y c and their expectation E ( z illu | x c ,y c ) when theillumination conditions are changed. In the subset X c , thesemantic information of each image is similar while the il-lumination information is different. Suppose an undesirablesituation that the illumination features capture much seman-tic information rather than illumination information. Theexpectation would strengthen the common semantic compo-nent and weaken the distinct illumination components, andthus PIDA would be small. It means that the smaller PIDAis, the more semantic information the illumination featurecaptures compared to illumination information. By max-imizing PIDA (i.e., minimizing L illu ), we can effectivelyreduce the common semantic information remaining in theillumination features.In summary, the overall loss function in the training phasecan be written as: L = L match + L recon + L class + L illu . (7)Through the above training, the model learns to split thefeatures into semantic and illumination features. After the first training step, the illumination feature of eachimage can be separated. These features are collected to construct an illumination repository, expressed as follows: Z illu = (cid:8) z illu ( i ) (cid:9) Ni =1 . (8)We then use the illumination features to augment the supportsamples by a multiple of the repository size N . Consider X t = { ( x ti , y ti , t ti ) } N t i =1 with N t images of label y t , here weassume that the template images t ti constitute the support set.We combine all illumination features in the repository withthe semantic feature of each template z sem ( i ) | t ti by featuremixup, building an augmented feature set as follows: Z aug = (cid:8) r z semt ( i ) + (1 − r ) z illu ( j ) , y ti (cid:12)(cid:12) i = 1 , · · · , N t (cid:9) , (9) where z illu ( j ) ∈ Z illu .We train the model again on the feature set Z aug . So, if afew support samples are provided, the model can be trainedon the augmented feature set blended with real illuminationfeatures, making it generalizable to test data. The classifica-tion loss of augmented training is expressed as follows: L aug = − N N (cid:88) i =1 M (cid:88) c =1 y ic log P ( y ic | z i ∈ Z aug ) , (10)where N = |Z aug | denotes the number of all recombinedfeatures in the augmented feature set.Now, the model has been trained to be generalizable for test. The feature extractor and classifier have been fully trainedafter the first two phases. Given the i -th test image, thefeature extractor firstly splits the semantic and illuminationfeature. Subsequently, the features are blended, and then theclassifier outputs the category label ˆ c , formulated as: ˆ c i = arg max c P ( y ic | z i ) . (11)The inference is achieved in an end-to-end manner.
3. Experiments
We validate the effectiveness of our method ontwo traffic sign datasets, GTSRB (Stallkamp et al., 2012)and Tsinghua-Tencent 100K (TT100K) (Zhu et al., 2016),and three logo datasets, BelgaLogos (Joly & Buisson, 2009;Letessier et al., 2012), FlirckrLogos-32 (Romberg et al.,2011) and TopLogo-10 (Su et al., 2017), because thesedatasets contain different illumination. Table 1 shows thesize and number of classes of each dataset (we use thedataset provided in (Kim et al., 2019)). More details aboutthe datasets are described in the supplementary material. ill-Net: Feature Augmentation with Separated Illumination Representation
Table 1.
Dataset specifications.
Dataset GTSRB TT100K BelgaLogos FlickrLogos-32 TopLogo-10Size 51839 11988 9585 3404 848Classes 43 36 37 32 11
Evaluation tasks.
Generally, we evaluate our model by thefollowing steps. 1) Utilize the training dataset (or subset)to separate out the illumination features. 2) The supportsamples are augmented with the illumination features toform an augmented feature set. 3) Train a classifier on theaugmented feature set. 4) Prediction on the test dataset.We validate our model on the following classification tasks.1) One-shot classificationIn this type of task, the training phase requires no real im-ages of the test classes but one template image for eachcategory. This task is similar to the one-shot classification.We set up two scenarios for traffic sign classification. In thefirst scenario, we split GTSRB into a training subset with22 classes and a test subset with the other 21 classes, wherethe template images constitute the support set. In the secondscenario, we train on GTSRB and test on TT100K for cross-dataset evaluation. We exclude four common classes sharedby GTSRB and TT100K in the test set. For convenience,we denote the first scenario by GTSRB → GTSRB and thesecond by GTSRB → TT100K, where the training set is onthe left side of the arrow while the test set is on the right.For logo classification, we use the largest BelgaLogos as atraining set and the remaining two as test sets respectively,denoted by Belga → Flickr32 and Belga → Toplogos. Sameas above, we remove four common classes in FlickrLogos-32 and five in Toplogo-10 shared by BelgaLogos.2) Cross-domain one-shot classificationTo further validate the generalization of our method, we per-form a cross-domain one-shot evaluation by another two ex-periments, where the model is trained on traffic sign datasetsand tested on logo datasets. Specifically, we train the modelon GTSRB and test on FlickrLogos-32 and Toplogo-10.We denote these two scenarios as GTSRB → Flickr32 andGTSRB → Toplogos. The setup is more challenging com-pared to the previous scenarios, since we train the model inthe domain of traffic sign datasets while we test the modelin an entirely different domain of logo datasets.
Architecture and parameter settings.
We construct theextractor by six convolution layers to separate the seman-tic and illumination features (see the separation module inFigure 2). The reconstructor is built with layers of the ex-tractor in an inversed order. The classifiers (see Figure 2and Figure 3) are built with six convolution layers and threepooling layers. Due to space limitations, more details of the architecture are described in the supplementary material.The networks are trained using the ADAM optimizer withlearning rate − , β = (0 . , . and (cid:15) = 10 − . Themixup proportion r is set to . throughout the experiments.Limited by graphics card memory, we choose the mini-batchsize of 16, which can be larger if conditions permit.The matching and reconstruction loss functions are weightedby proportionality coefficients for optimal results. Theweighted overall loss function is expressed as follows: L = α L match + γ L recon + L class + L illu . (12)We can choose α and γ in the range of [10 − , − ] . Whenthey are too large, the model tends to learn false featureswith values close to zero. While when they are too small,the model is not able to learn informative semantic features.In our method, α is set as − and γ is set as − . Template image processing.
Previous studies (Tabeliniet al., 2020) have shown that basic image processing ontemplate images (as support samples) helps the network’sgeneralization. In our experiment, we diversify the templateimages themselves using the following methods: geometrictransformations, image enhancement (including brightness,color, contrast and sharpness adjustment), and blur. Thetemplate images are diversified and thus allow the model tolearn more generalizable features. We observe that basic pro-cessing on template images improves model performance.
We compare our method with Quadruplet networks (Kimet al., 2017) (QuadNet) and Variational Prototyping-Encoder(Kim et al., 2019) (VPE) for one-shot classification, reportedin Table 2 and 3. We quote accuracies of the compared meth-ods under their optimal settings, that is, VPE is implementedwith augmentation and spatial transformer (VPE+aug+stnversion) and QuadNet is implemented without augmentation.As shown in the tables, our method outperforms compara-tive methods in all scenarios.
Table 2.
One-shot classification accuracy ( % ) on traffic signdatasets. The results of other methods are cited from (Kim et al.,2017; 2019). The best results are marked in blue. GTSRB → GTSRB GTSRB → TT100KNo. support set (22+21)-way 36-wayQuadNet 45.2 N/AVPE 83.79 71.80Sill-Net
Sill-Net w/o aug 46.25 45.94
In traffic sign classification, Sill-Net outperforms the secondbest method VPE by a large margin of 13.81 % (accuracy ill-Net: Feature Augmentation with Separated Illumination Representation improved from 83.79 % to 97.60 % ) and 23.79 % (accuracyimproved from 71.80 % to 95.59 % ) respectively in two sce-narios (see Table 2). It indicates that training on the fea-tures augmented by illumination information does help thereal-world classification, even though only one templateimage is provided. It is notable that in the cross-dataset sce-nario GTSRB → TT100K, Sill-Net achieves a comparableperformance to the intra-dataset scenario GTSRB → GTSRB,while VPE performs much worse in the cross-dataset sce-nario. We surmise it is because VPE learns latent embed-dings generalizable to test classes in the same domain (GT-SRB), but the generalization might be discounted when thetarget domain is slightly shifted (from GTSRB to TT100K).It is observed that the illumination conditions in GTSRB arequite similar to that in TT100K, therefore Sill-Net showsbetter generalization performance by making full use of theillumination information in GTSRB.
Table 3.
One-shot classification accuracy ( % ) on logo datasets.The results of other methods are cited from (Kim et al., 2017;2019). The best results are marked in blue. Belga → Flickr32 Belga → ToplogosNo. support set 32-way 11-wayQuadNet 37.72 36.62VPE 53.53 57.75Sill-Net
Sill-Net w/o aug 52.38 47.95
In logo classification, Sill-Net improves the performanceby 11.68 % (from 53.53 % to 65.21 % ) and 26.68 % (from57.75 % to 84.43 % ) compared to VPE, respectively in twoscenarios (see Table 3). The improvement of accuracies inlogo classification is not comparable to that in traffic classi-fication, which might be due to the undesirable quality ofthe training logo dataset. The GTSRB is the largest datasetwith various illumination conditions. And the traffic signsare always complete and well localized in the images sothat illumination features can be separated more easily. Incontrast, the separation is harder for logo dataset due to in-complete logos, color variations, and non-rigid deformation(e.g., logos on the bottles).We compare our method to the ordinary convolutional modelconsisting of a feature extractor and a classifier without fea-ture augmentation, denoted as Sill-Net w/o aug (see the lastrow of Tables 2 and 3). For a fair comparison, the featureextractor and classifier share the same number of convolu-tional layers with Sill-Net. We train it on a synthetic datasetcomposed of the template images after basic processing (i.e.,geometric transformations, image enhancement, and blur).The number of training samples is set to be the same asother methods. The results are reported as a reference toshow how the Sill-Net performs without illumination featureaugmentation. The unsatisfactory results show that the illu- mination feature augmentation does enhance the recognitionability of the model in one-shot classification. Sill-Net achieves the best results among all methods in cross-domain one-shot classification tasks, as shown in Table 4.It outperforms VPE by a large margin of 23.63 % (69.75 % compared to 46.12 % ) in GTSRB → Flickr32 and 39.86 % (69.46 % compared to 29.60 % ) in GTSRB → Toplogos.
Table 4.
Cross-domain one-shot classification accuracy ( % ). Themodels are trained on the traffic sign dataset (GTSRB) and testedon the logo datasets. The best results are marked in blue. GTSRB → Flickr32 GTSRB → ToplogosNo. support set 32-way 11-wayQuadNet 28.41 25.38VPE 46.12 29.60Sill-Net
Sill-Net w/o aug 53.94 47.54
The results illustrate that our method is still generalizablewhen the domain is transferred from traffic signs to logos.The unsatisfactory results of VPE are predictable. VPElearns a generalizable similarity embedding space of thesemantic information among the same or similar domain(i.e., from traffic signs to traffic signs or from logos to lo-gos). However, the embeddings learned from traffic signsare difficult to generalize to logos. In contrast, our methodlearns well-separated semantic and illumination representa-tions and augments the illumination features to the templateimages from novel domains to generalize the model.
In this section, we delve into the contribution of each com-ponent of our method. The components under evaluationinclude the exchange mechanism, the matching and recon-struction module, the illumination constraint, and templateimage processing, as shown in Table 5. We disable onecomponent at a time and then record the performance toassess its importance. The experiments are implemented inthe one-shot classification scenario GTSRB → GTSRB.The results demonstrate that the exchange mechanism andmatching module are the core components of our method.The accuracy of the model drops to 48.10 % without ex-change mechanism. It is because that the semantic andillumination features cannot be well separated without theexchange mechanism. The remaining semantic informationin the illumination features is useless, or even would inter-fere with the recognition when they are combined with thesemantic features of other objects during feature augmenta-tion, hurting the performance of the model. ill-Net: Feature Augmentation with Separated Illumination Representation Meanwhile, the matching module cooperating with the sep-aration module can further separate the semantic and il-lumination features. The matching module corrects thedeformation of the object features. It retains the concretesemantic information (e.g., the outline of the object andsemantic details of the object contents) with the supervisionof template images. Without the matching module, the se-mantic features would not be informative enough, so that theseparation module would have difficulty to separate the illu-mination features from the semantic features. Therefore, theaccuracy of the model drops to 54.27 % when the matchingmodule is removed. Table 5.
Ablation study results ( % ) in the one-shot classificationscenario GTSRB → GTSRB. We disable one component at a timeand record the performance of Sill-Net.
Factor Accuracy (decrement) w/o exchange mechanism 48.10 (-49.50)w/o matching module 54.27 (-43.33)w/o reconstruction module 80.74 (-16.86)w/o illumination constraint 90.73 (-6.87)w/o template processing 80.19 (-17.41)full method 97.60
The accuracy of the model decreases by 16.86 % withoutthe reconstruction module. The reconstruction module alsostrives to make semantic features more informative. Thematching module helps the model capture some level ofthe concrete semantic information, while the reconstructionmodule prompts to retain more delicate details of the object.The illumination constraint increases the model performanceby 6.87 % . Intuitively, the constraint reduces the seman-tic information in illumination features and thus enhancestheir quality. Higher-quality illumination representation canimprove our feature augmentation method’s effectiveness,which is consistent with the results.Furthermore, template image processing improves modelperformance as expected. The processing methods (i.e.,geometric transformations, image enhancement, and blur asintroduced before) diversify the template images so that thetrained model is more generalizable. Under the combinedeffect of the proposed illumination augmentation in thefeature space and the variation of template images, the fullmodel achieves the best results among the existing methods. Figure 4 shows the separated semantic and illuminationfeatures of the images from training and test classes inGTSRB, visualized in the third and fourth lines. Note thatthe training and test datasets share no common classes. Asshown in the figure, the semantic features delicately retaininformation consistent with the template images for bothtraining and test classes. It is due to three aspects. First, T e m p l a t e I npu t S e m a n ti c I ll u m i n a ti on R ec on s t r u c t e d Training classes Test classes
Figure 4.
Visualization of the separated features and the recon-structed template images from training and test classes. The firsttwo rows show the input images and their corresponding templateimages. The third and fourth rows show the semantic and illu-mination features of the input images separated by our model.The last row shows the template images reconstructed from thesemantic features. More visualization results are shown in thesupplementary material. the extractor maintains the size and spatial information ofthe features. Second, although objects in the input imagesvary in size and position, the features are corrected to thenormal situation corresponding to the template images viathe spatial transformer in the matching module. Third, thereconstruction module promotes the semantic feature toretain the details of the objects.In contrast, the semantic information is effectively reducedin illumination features. These features reflect the illumi-nation conditions in the original images to a certain ex-tent. Intuitively, the pink parts in the features represent thebright illumination while the green parts represent the darkillumination. Such well-separated representations lay thefoundation for the good performance of our model.
While the reconstructor serves to obtain informative seman-tic features during training, it can also retrieve the templateimages in the inference phase. As shown in the last row ofFigure 4, the reconstructor robustly generates the templateimages of both the training and test samples, regardless ofillumination variance, object deformation, blur, and lowresolution of the images. Not only outlines of the symbolcontents but also fine-details are well restored in the gener-ated template images, which improves the reconstructionresults by VPE. Our results further demonstrate that theproposed model have learned good represents of semanticinformation for both classification and reconstruction. ill-Net: Feature Augmentation with Separated Illumination Representation
4. Discussions
So far, our studies have validated the feasibility and effec-tiveness of illumination-based feature augmentation. Theidea of learning good semantic and illumination featuresbefore training a classifier is consistent with the thinkingof decoupling representation and classifier (Zhang & Yao,2020). Compared to the existing approaches (Kim et al.,2017; 2019), our method not only achieves the best resultson the traffic sign and logo classifications, but also learnsintuitively interpretable semantic and illumination represen-tation and performs better reconstructions.Our method can be widely applied to a series of trainingscenarios. In the case that the training samples with cer-tain illumination conditions are limited in the dataset, wecan augment these samples with that type of illuminationfeatures separated from other images (or simply use the illu-mination features in our repository). Besides, we can utilizethe method to expand a few support samples or even onlyone (e.g., template images) to form a large training dataset,solving the problem of lacking annotated real data. Over-all, the imbalance both in size and illumination conditionsof the dataset could be alleviated since we can transplantillumination information to specific training samples with alimited number and illumination diversity.Here is one question that why we do not classify a test sam-ple by its semantic feature after the separation. Actually,we have to train the classifier with the augmented samplesbecause generally there are not many support samples infew-shot or one-shot scenarios. If we trained the classifierwith a few support samples, it would be poor in generaliza-tion due to the memorization of deep networks (Arpit et al.,2017). While when we extend the volume and diversity ofthe feature set by illumination augmentation, the trainedmodel can be more generalizable.Our work can be improved from the following aspects.First, it should be noted that the illumination features learnedby our model seem to reflect relative illumination intensityrather than fine details, limited by the lack of illuminationsupervision. The constraint used in our method improvesthe quality of illumination features to some extent and thusenhances the model performance. However, alternativedisentanglement methods with more stringent constraints orpretraining on illumination supervised data, can be appliedto obtain refined illumination representation.Second, the spatial transformation network (STN) can besubstituted by other networks in the matching module. Intraffic sign classifications, STN can well correct the se-mantic features to be consistent with that of the templates.However, it is sometimes difficult to deal with the non-rigiddeformation in logo datasets. Furthermore, general objectsmight be distinct with the templates in many aspects, such as color variation and changes in visual angles. Two wayscan be considered. First, we can choose several differenttemplates for different types of variation. Second, we candevelop general networks to deal with such transformations.For instance, we can translate the objects along directions(e.g., color and visual angles) in the feature space to thetemplates via semantic transformations (Wang et al., 2020).
5. Realated works
Data augmentation is an effective data-space solution tothe problem of limited data (Shorten & Khoshgoftaar, 2019).Augmentations based on data warping transform existingimages by some methods while preserving the original la-bels (LeCun et al., 1998; Zheng et al., 2019). Oversamplingaugmentations enhance the datasets by generating synthetictraining samples (Inoue, 2018; Bowles et al., 2018).In this work, we propose a method of feature space augmen-tation. This kind of augmentations implement the transfor-mation in a learned feature space rather than the input space(DeVries & Taylor, 2017). Recently, augmentation meth-ods on semantic feature space are proposed to regularizedeep networks (Wang et al., 2020; Bai et al., 2020). Unlikethese methods, we augment the samples with interpretableillumination representation in an easier way.
Few-shot learning.
Early efforts for few-shot learning werebased on generative models that sought to build the Bayesianprobabilistic framework (Fei-Fei et al., 2006). Recently,more attention was paid on meta-learning, which can begenerally summarized into five sub-categories: learn-to-measure ( e.g. , MatchNets (Vinyals et al., 2016), ProtoNets(Snell et al., 2017)), learn-to-finetune ( e.g. , MAML (Finnet al., 2017)), learn-to-remember ( e.g. , SNAIL (Mishra et al.,2018)), learn-to-adjust ( e.g. , MetaNets (Munkhdalai & Yu,2017)) and learn-to-parameterize ( e.g. , DynamicNets (Gi-daris & Komodakis, 2018)). In this work, we used taskssimilar to one-shot learning to evaluate our method.
6. Conclusion
In this paper, we develop a novel neural network architecturenamed S eparating- Ill umination
Net work (Sill-Net). The il-lumination features can be well separated from trainingimages by Sill-Net. These features can be used to aug-ment the support samples. Our method outperforms thestate-of-the-art (SOTA) methods by a large margin in sev-eral benchmarks. In addition to these improvements invisual applications, the results demonstrate the feasibility ofthe illumination-based augmentation method in the featurespace in object recognition, which is a potential researchdirection about data augmentation. ill-Net: Feature Augmentation with Separated Illumination Representation
References
Arpit, D., Jastrz˛ebski, S., Ballas, N., Krueger, D., Bengio,E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville,A., Bengio, Y., et al. A closer look at memorization indeep networks. In
International Conference on MachineLearning , pp. 233–242. PMLR, 2017. 8Bai, H., Sun, R., Hong, L., Zhou, F., Ye, N., Ye, H.-J., Chan,S.-H. G., and Li, Z. Decaug: Out-of-distribution gener-alization via decomposed feature representation and se-mantic augmentation. arXiv preprint arXiv:2012.09382 ,2020. 8Bowles, C., Chen, L., Guerrero, R., Bentley, P., Gunn, R.,Hammers, A., Dickie, D. A., Hernández, M. V., Wardlaw,J., and Rueckert, D. Gan augmentation: Augmentingtraining data using generative adversarial networks. arXivpreprint arXiv:1810.10863 , 2018. 8DeVries, T. and Taylor, G. W. Dataset augmentation infeature space. arXiv preprint arXiv:1702.05538 , 2017. 8Fei-Fei, L., Fergus, R., and Perona, P. One-shot learning ofobject categories.
TPAMI , 28(4):594–611, 2006. 8Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In
ICML ,2017. 8Gidaris, S. and Komodakis, N. Dynamic few-shot visuallearning without forgetting. In
CVPR , 2018. 8He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In
Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016. 1Inoue, H. Data augmentation by pairing samples for imagesclassification. arXiv preprint arXiv:1801.02929 , 2018. 8Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatialtransformer networks. In
Advances in neural informationprocessing systems , pp. 2017–2025, 2015. 3Joly, A. and Buisson, O. Logo retrieval with a contrariovisual query expansion. In
Proceedings of the 17th ACMinternational conference on Multimedia , pp. 581–584,2009. 4Kim, J., Lee, S., Oh, T.-H., and Kweon, I. S. Co-domainembedding using deep quadruplet networks for unseentraffic sign recognition. arXiv preprint arXiv:1712.01907 ,2017. 5, 6, 8Kim, J., Oh, T.-H., Lee, S., Pan, F., and Kweon, I. S. Varia-tional prototyping-encoder: One-shot learning with pro-totypical images. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 9462–9470, 2019. 3, 4, 5, 6, 8 Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassification with deep convolutional neural networks.
Communications of the ACM , 60(6):84–90, 2017. 1LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.
Proceed-ings of the IEEE , 86(11):2278–2324, 1998. 8Letessier, P., Buisson, O., and Joly, A. Scalable mining ofsmall visual objects. In
Proceedings of the 20th ACMinternational conference on Multimedia , pp. 599–608,2012. 4Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. Asimple neural attentive meta-learner. In
ICLR , 2018. 8Munkhdalai, T. and Yu, H. Meta networks. In
ICML , 2017.8Romberg, S., Pueyo, L. G., Lienhart, R., and Van Zwol,R. Scalable logo recognition in real-world images. In
Proceedings of the 1st ACM International Conference onMultimedia Retrieval , pp. 1–8, 2011. 4Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., et al. Imagenet large scale visual recognition chal-lenge.
International journal of computer vision , 115(3):211–252, 2015. 1Shorten, C. and Khoshgoftaar, T. M. A survey on imagedata augmentation for deep learning.
Journal of Big Data ,6(1):60, 2019. 8Simonyan, K. and Zisserman, A. Very deep convolu-tional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 , 2014. 1Snell, J., Swersky, K., and Zemel, R. Prototypical networksfor few-shot learning. In
NIPS , 2017. 8Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. Manvs. computer: Benchmarking machine learning algo-rithms for traffic sign recognition.
Neural networks , 32:323–332, 2012. 4Su, H., Zhu, X., and Gong, S. Deep learning logo detectionwith data expansion by synthesising context. In , pp. 530–539. IEEE, 2017. 4Suter, R., Miladinovic, D., Schölkopf, B., and Bauer, S.Robustly disentangled causal mechanisms: Validatingdeep representations for interventional robustness. In
International Conference on Machine Learning , pp. 6056–6065. PMLR, 2019. 4 ill-Net: Feature Augmentation with Separated Illumination Representation
Tabelini, L., Berriel, R., Paixão, T. M., De Souza, A. F.,Badue, C., Sebe, N., and Oliveira-Santos, T. Deep trafficsign detection and recognition without target domain realimages. arXiv preprint arXiv:2008.00962 , 2020. 5Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.Matching networks for one shot learning. In
NIPS , 2016.8Wang, Y., Huang, G., Song, S., Pan, X., Xia, Y., and Wu, C.Regularizing deep networks with semantic data augmen-tation. arXiv preprint arXiv:2007.10538 , 2020. 8Xiao, T., Hong, J., and Ma, J. Elegant: Exchanging latent en-codings with gan for transferring multiple face attributes.In
Proceedings of the European conference on computervision (ECCV) , pp. 168–184, 2018. 3Zhang, H. and Yao, Q. Decoupling representation andclassifier for noisy label learning. arXiv preprintarXiv:2011.08145 , 2020. 8Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz,D. mixup: Beyond empirical risk minimization. arXivpreprint arXiv:1710.09412 , 2017. 3Zheng, X., Chalasani, T., Ghosal, K., Lutz, S., and Smolic,A. Stada: Style transfer as data augmentation. arXivpreprint arXiv:1909.01056 , 2019. 8Zhu, Z., Liang, D., Zhang, S., Huang, X., Li, B., and Hu,S. Traffic-sign detection and classification in the wild. In