Feature Space Transfer for Data Augmentation
Bo Liu, Xudong Wang, Mandar Dixit, Roland Kwitt, Nuno Vasconcelos
FFeature Space Transfer for Data Augmentation
Bo LiuUniversity of California, San Diego [email protected]
Xudong WamgUniversity of California, San Diego [email protected]
Mandar DixitMicrosoft [email protected]
Roland KwittUniversity of Salzburg, Austria [email protected]
Nuno VasconcelosUniversity of California, San Diego [email protected]
Abstract
The problem of data augmentation in feature space isconsidered. A new architecture, denoted the FeATure Trans-fEr Network (FATTEN), is proposed for the modeling of fea-ture trajectories induced by variations of object pose. Thisarchitecture exploits a parametrization of the pose mani-fold in terms of pose and appearance. This leads to a deepencoder/decoder network architecture, where the encoderfactors into an appearance and a pose predictor. Unlikeprevious attempts at trajectory transfer, FATTEN can beefficiently trained end-to-end, with no need to train sepa-rate feature transfer functions. This is realized by supplyingthe decoder with information about a target pose and theuse of a multi-task loss that penalizes category- and pose-mismatches. In result, FATTEN discourages discontinuousor non-smooth trajectories that fail to capture the structureof the pose manifold, and generalizes well on object recog-nition tasks involving large pose variation. Experimentalresults on the artificial ModelNet database show that it cansuccessfully learn to map source features to target featuresof a desired pose, while preserving class identity. Most no-tably, by using feature space transfer for data augmentation(w.r.t. pose and depth) on SUN-RGBD objects, we demon-strate considerable performance improvements on one/few-shot object recognition in a transfer learning setup, com-pared to current state-of-the-art methods.
1. Introduction
Convolutional neural networks (CNNs) trained on largedatasets, such as ImageNet [2], have enabled tremendousgains in problems like object recognition over the lastfew years. These models not only achieve human levelperformance in recognition challenges, but are also eas-ily transferable to other tasks, by fine tuning. Many re-cent works have shown that ImageNet trained CNNs, like
Appearance space P o s e s p a ce P ( x ) P (ˆ x ) A ( x ) = A (ˆ x ) F ( x ) x ˆ x Feature spacemanifold M Feature spacetrajectory
Figure 1. Schematic illustration of feature space transfer for vari-ations in pose . The input feature x and transferred feature ˆx areprojected to the same point in appearance space , but have differentmapping points in pose space . AlexNet [14], VGG [27], GoogLeNet [32], or ResNet [9]can be used as feature extractors for the solution of prob-lems as diverse as object detection [6, 23] or generatingimage captions [12, 35]. Nevertheless, there are still chal-lenges to CNN-based recognition. One limitation is thatexisting CNNs still have limited ability to handle posevariability. This is, in part, due to limitations of existingdatasets, which are usually collected on the web and are bi-ased towards a certain type of images. For example, objectsthat have a well defined “frontal view,” such as “couch” or“clock,” are rarely available from viewing angles that differsignificantly from frontal.This is problematic for applications like robotics, wherea robot might have to navigate around or manipulate suchobjects. When implemented in real time, current CNNstend to produce object labels that are unstable with respectto viewing angle. The resulting object recognition can vary1 a r X i v : . [ c s . C V ] A p r rom nearly perfect under some views to much weaker forneighboring, and very similar, views. One potential so-lution to the problem is to rely on larger datasets, with amuch more dense sampling of the viewing sphere. This,however, is not trivial to accomplish for a number of rea-sons. First , for many classes, such images are not easy tofind on the web in large enough quantities.
Second , be-cause existing recognition methods are weakest at recog-nizing “off-view” images, the process cannot be easily au-tomated.
Third , the alternative of collecting these images inthe lab is quite daunting. While this has been done in thepast, e.g. the COIL [17], NORB [16], or Yale face dataset,these datasets are too small by modern standards. The set-ups used to collect them, by either using a robotic table andseveral cameras, or building a camera dome, can also not beeasily replicated and do not lend themselves to distributeddataset creation efforts, such as crowd sourcing. Finally,even if feasible to assemble, such datasets would be massiveand thus difficult to process. For example, the NORB rec-ommendation of collecting elevations, azimuths, and lighting conditions per object, results in , images perobject. Applying this standard to ImageNet would result ina dataset of close to billion images!Some of these problems can be addressed by resortingto computer generated images. This has indeed become anestablished practice to address problems that require mul-tiple object views, such as shape recognition, where syn-thetic image datasets [19, 31] are routinely used. However,the application of networks trained on synthetic data to realimages raises a problem of transfer learning . While thereis a vast literature literature on this topic [28, 15, 24, 33,36, 26, 22], these methods are usually not tailored for thetransfer of object poses. In particular, they do not explic-itly account for the fact that, as illustrated in Fig. 1, objectssubject to pose variation span low-dimensional manifoldsof image space, or corresponding spaces of CNN features.This has recently been addressed by [3], who have proposedan attribute guided augmentation (AGA) method to transferobject trajectories along the pose manifold.Besides learning a classifier that generalizes on targetdata, the AGA transfer learning system also includes a mod-ule that predicts the responses of the model across views.More precisely, given a view of an unseen object, it pre-dicts the model responses to a set of other views of this ob-ject. These can then be used to augment the training setof a one-shot classifier, i.e ., a classifier that requires a sin-gle image per object for training. While this was shownto improve on generic transfer learning methods, AGA hassome limitations. For example, it discretizes the pose angleinto several bins and learns an independent trajectory trans-fer function between each possible pair of them. While thissimplifies learning, the trajectories are not guaranteed to becontinuous. Hence, the modeling fails to capture some of the core properties of the pose manifold, such as continuityand smoothness. In fact, a ◦ walk around the viewingsphere is not guaranteed to have identical start and finishingfeature responses. In our experience, these choices compro-mise the effectiveness of the transfer. Contribution.
In this work, we propose an alternative,
FeATure TransfEr Network (FATTEN) , that addresses theseproblems. Essentially, this is an encoder-decoder architec-ture, inspired by Fig. 1. We exploit a parametrization ofpose trajectories in terms of an appearance map, which cap-tures properties such as object color and texture and is con-stant for each object, and a pose map, which is pose depen-dent. The encoder maps the feature responses x of a CNNfor an object image into a pair of appearance A ( x ) and pose P ( x ) parameters. The decoder then takes these parametersplus a target pose t = P (ˆ x ) and produces the correspond-ing feature vector ˆ x . The network is trained end-to-end,using a multi-task loss that accounts for both classificationerrors and the accuracy of feature transfer across views.The performance of FATTEN is investigated on twotasks. The first is a multi-view retrieval task, where syn-thesized feature vectors are used to retrieve images by ob-ject class and pose. These experiments are conducted onthe popular ModelNet [37] shape dataset and show thatFATTEN generates features of good quality for applica-tions involving computer graphics imagery. This couldbe of use for a now large 3D shape classification litera-ture [37, 20, 30, 21], where such datasets are predominant.The second task is transfer learning. We compare the per-formance of the proposed architecture against both generalpurpose transfer learning algorithms and the AGA proce-dure. Our results show that there are significant benefitsin developing methods explicitly for trajectory transfer, andin forcing these methods to learn continuous trajectories inthe pose manifold. The FATTEN architecture is shown toachieve state-of-the-art performance for pose transfer. Organization.
In Sect. 2, we review related work; Sect. 3introduces the proposed FATTEN architecture. Sect. 4presents experimental results on ModelNet and SUN-RGBD and Sect. 5 concludes the paper with a discussionof the main points and an outlook on open issues.
2. Related Work
Since objects describe smooth trajectories in imagespace, as a function of viewing angle, it has long beenknown that such trajectories span a 3D manifold in imagespace, parameterized by the viewing angle. Hence, manyof the manifold modeling methods proposed in the litera-ture [25, 1, 34] could, in principle, be used to develop tra-jectory transfer algorithms. However, many of these meth-ods are transductive, i.e ., they do not produce a function thatcan make predictions for images outside of the training set,2nd do not leverage recent advances in deep learning. Whiledeep learning could be used to explicitly model pose mani-folds, it is difficult to rely on CNNs pre-trained on ImageNetfor this purpose. This is because these networks attempt tocollapse the manifold into a space where class discrimina-tion is linear. On the other hand, the feature trajectories inresponse to pose variability are readily available. These tra-jectories are also much easier to model. For example, if theCNN is successful in mapping the pose manifold of a givenobject into a single point, i.e ., exhibits total pose invariancefor that object, the problem is already solved and trajectoryleaning is trivial for that object.One of the main goals of trajectory transfer is to “fat-ten” a feature space, by augmenting a dataset with featureresponses of unseen object poses. In this sense, the prob-lem is related to extensive recent literature on GANs [7],which have been successfully used to generate images,image-to-image translations [10], inpainting [18] or style-transfer [5]. While our work uses an encoder-decoder ar-chitecture, which is fairly common in the GAN-based im-age generation literature, we aim for a different goal of gen-erating CNN feature responses. This prevents access to adataset of “real” feature responses across the pose manifold,since these are generally unknown. While an ImageNetCNN could be used to produce some features, the problemthat we are trying to solve is exactly the fact that ImageNetCNNs do not effectively model the pose manifold. Hence,the GAN formalism of learning to match a “real” distribu-tion is not easily applicable to trajectory transfer.Instead, trajectory transfer is more closely related to thetopic of transfer learning, where, now, there is extensivework on problems such as zero-shot [28, 15, 24] or n -shot [33, 36, 26, 22] learning. However, these methodstend to be of general purpose. In some cases, they ex-ploit generic semantic properties, such as attributes or af-fordances [15, 24], in others they simply rely on genericmachine learning for domain adaptation [28], transfer learn-ing [36] or, more recently, meta-learning [26, 4, 22]. Noneof these methods exploits specific properties of the posemanifold, such as the parametrizations of Figure 1. The in-troduction of networks that enforce such parameterizationsis a form of regularization that improves on the transfer per-formance of generic procedures. This was shown on theAGA work [3] and is confirmed by our results, which showeven larger gains over very recent generic methods, such asfeature hallucination proposed in [8].Finally, trajectory transfer is of interest for problemsinvolving multi-view recognition. Due to the increasedcost of multi-view imaging, these problems frequently in-clude some degree of learning from computer generatedimages. This is, for example, an established practice inthe shape recognition literature, where synthetic imagedatasets [19, 31] are routinely used. The emergence of these artificial datasets has enabled a rich literature in shaperecognition methods [13, 37, 30, 20, 21, 11] and alreadyproduced some interesting conclusions. For example, whilemany representations have been proposed, there is some ev-idence that the problem could be solved as one of multi-view recognition, using simple multi-view extensions ofcurrent CNNs [30]. It is not clear, however, how these meth-ods or conclusions generalize to real world images. Ourresults show that feature trajectory transfer models, suchas FATTEN, learned on synthetic datasets, such as Mod-elNet [37], can be successfully transferred to real imagedatasets, such as SUN-RGBD [29].
3. The FATTEN architecture
In this section, we describe the proposed architecture for feature space transfer . In this work, we assume the availability of a train-ing set with pose annotations, i.e ., { ( x n , p n , y n ) } n , where x n ∈ R D is the feature vector ( e.g ., a CNN activation atsome layer) extracted from an image, p n is the correspond-ing pose value and y n a category label. The pose valuecould be a scalar p n , e.g ., the azimuth angle on the viewingsphere, but is more generally a vector, e.g ., also encodingan elevation angle or even the distance to the object (objectdepth). The problem is to learn the feature transfer function F ( x n , p ) that maps the source feature vector x n to a targetfeature vector ˆ x n corresponding to a new pose p . The FATTEN architecture is inspired by Fig. 1, whichdepicts the manifold spanned by an object under pose varia-tion. The manifold M is embedded in R D and is parameter-ized by two variables. The first, is an appearance descriptor a ∈ R A that captures object properties such as color or tex-ture. This parameter is pose invariant, i.e ., it has the samevalue for all points on the manifold. It can be thought of asan object identifier that distinguishes the manifold spannedby one object from those spanned by others. The secondis a pose descriptor p ∈ R N that characterizes the point x on the manifold that corresponds to a particular pose p .Conceptually, feature points x could be thought of as therealization of a mapping φ ( a , p ) (cid:55)→ x ∈ M . (1)The FATTEN architecture models the relationship be-tween the feature vectors extracted from object images andthe associated appearance and pose parameters. As shownin Fig. 2, it is an encoder/decoder architecture. The encoderessentially aims to invert the mapping of (1). Given a fea-ture vector x , it produces an estimate of the appearance a NN x ∈ R D ( e.g. , FC7 ) Decoder a ∈ R A Posepredictor p ∈ P N − t = e j ˆ x = F ( x ) + x i d ( x ) ( s h o r t c u t ) + concatenate L p (ˆ x , t ) L c (ˆ x , y ) Pose loss
Category loss D e s i r e d p o s e v a l u e “ c l a ss “ Input : Image F ( x ) Posepredictor CategorypredictorAppearancepredictor E n c o d e r [ a ⊕ p ⊕ t ] Figure 2. The FATTEN architecture. Here, id denotes the iden-tity shortcut connection, D the dimensionality of the input featurespace, C the dimensionality of the appearance space and P N − the N − probability simplex. and pose p parameters. This is complemented with a target pose parameter t , which specifies the pose associated witha desired feature vector ˆ x . This feature is then generatedby a decoder that that operates on the concatenation of a , p and t , i.e ., [ a , p , t ] . While, in principle, it would suffice torely on ˆ x = φ ( a , t ) , i.e ., to use the inverse of the encoder asa decoder, we have obtained best results with the followingmodifications. First , to discourage the encoder/decoder pair from learn-ing a mapping that simply “matches” feature pairs, FAT-TEN implements the residual learning paradigm of [9]. Inparticular, the encoder-decoder is only used to learn theresidual F ( x ) = ˆ x − x (2)between the target and source feature vectors. Second , twomappings that explicitly recover the appearance a and pose p are used instead of a single monolithic encoder. This fa-cilitates learning, since the pose predictor can be learnedwith full supervision. Third, a vector encoding is used forthe source p and target t parameters, instead of continuousvalues. This makes the dimensionality of the pose param-eters closer to that of the appearance parameter, enabling amore balanced learning problem. We have found that, oth-erwise, the learning algorithm can have a tendency to ignorethe pose parameters and produce a smaller diversity of tar- get feature vectors. Finally, rather than a function of a and t alone, the decoder is a function of a , p , and t . This againguarantees that the intermediate representation is higher di-mensional and facilitates the learning of the decoder. Wenext discuss the details of the various network modules. Encoder.
The encoder consists of a pose and an appear-ance predictor. The pose predictor implements the mapping p = P ( x ) from feature vectors x to pose parameters. Theposes are first internally mapped into a code vector c ∈ R N of dimensionality comparable to that of the appearance vec-tor a . In the current implementation of FATTEN this isachieved in three steps. First, the pose space is quantizedinto N cells of centroids m i . Each pose is then assigned tothe cell of the nearest representative m ∗ and represented bya N -dimensional one-hot encoding that identifies m ∗ . Thepose mapping P is finally implemented with a classifier thatmaps x into a vector of posterior probabilities p = [ p ( m | x ) , . . . , p ( m N | x )] (3)on the N − probability simplex P N − . This is imple-mented with a two-layer neural network, composed of afully-connected layer, batch normalization, and a ReLU,followed by a softmax layer.The appearance predictor implements the mapping a = A ( x ) from feature vectors x to appearance descriptors a .This is realized with a two-layer network, where each layerconsists of a fully-connected layer, batch normalization, anda ELU layer. The outputs of the pose and appearance pre-dictors are concatenated with a one-hot encoding of the tar-get pose. Assuming that this pose belongs to the cell ofcentroid m j , this is t = e j , where e j is a vector of all zeroswith a at position j . Decoder.
The decoder maps the vector of concatenated ap-pearance and pose parameters [ a ⊕ p ⊕ t ] (4)into the residual ˆ x − x . It is implemented with a two layernetwork, where the first layer contains a sequence of fully-connected layer, batch normalization, and ELU, and the sec-ond is a fully connected layer. The decoder output is finallysummed to the input feature vector x to produce the targetfeature vector ˆ x . The network is trained end-to-end, so as to optimize amulti-task loss that accounts for two goals. The first goalis that the generated feature vector ˆ x indeed corresponds tothe desired pose t . This is measured by the pose loss, which ⊕ denotes vector concatenation. a) Airplane (b) Bowl (c) Plant (d) Bookshelf ( e ) D e s k Figure 3. Exemplary ModelNet [37] views: (a) Different views from one object (airplane); (b)-(c) Symmetric object (bowl, plant) indifferent views; (d)-(e) Four views (bookshelf, desk) with 90 degrees difference. is the cross-entropy loss commonly used for classification, i.e ., L p ( ˆx , t ) = − log ρ j ( P ( ˆx )) , (5)where ρ j ( v ) = e vj (cid:80) k e vk is the softmax function and j is thenon-zero element of the one-hot vector t = e j . Note that,as shown in Fig. 2, this requires passing the target featurevector ˆ x through the pose predictor P . It should be empha-sized that this is only needed during training, albeit the lossin Eq. (6) can also be measured during inference, since thetarget pose t is known. This can serve as a diagnostic of theperformance of FATTEN.The second goal is that the generated feature vector ˆ x is assigned the same class label y as the source vector x .This encourages the generation of features with high recog-nition accuracy on the original object recognition problem.Recognition accuracy depends on the network used to ex-tract the feature vectors, denoted as CNN in Fig. 2. Notethat this network can be fine-tuned for operation with theFATTEN module in an end-to-end manner. While FATTENcan, in principle, be applied to any such network, our im-plementation is based on the VGG16 model of [27]. Morespecifically, we rely on the fc7 activations of a fine-tunedVGG16 network as source and target features. The categorypredictor of Fig. 2 is then the fc8 layer of this network. Theaccuracy of this predictor is measured with a cross-entropyloss L c ( ˆx , y ) = − log ρ y ( ˆx ) , (6)where ρ ( v ) is the softmax output of this network. The multi-task loss is then defined as L ( ˆx , t , y ) = L a ( ˆx , t ) + L c ( ˆx , y ) . (7)In general, it is beneficial to pre-train the pose predictor P ( x ) and embed it into the encoder-decoder structure. Thisreduces the number of degrees of freedom the network, andminimizes the ambiguity inherent to the fact that a givenfeature vector could be consistent with multiple pairs ofpose and appearance parameters. For example, while allfeature vectors x extracted from views of the same object should be constrained to map into the same appearance pa-rameter value p , we have so far felt no need to enforcesuch constraint. This endows the network with robustnessto small variations of the appearance descriptor, due to oc-clusions, etc. Furthermore, when a pre-trained pose predic-tor is used, only the weights of the encoder/decoder need tobe learned. The weights of the sub-networks used by theloss function(s) are fixed. This minimizes the chance thatthe FATTEN structure will over-fit to specific pose valuesor object categories.
4. Experiments
We first train and evaluate the FATTEN model on theartificial ModelNet [37] dataset (Sec. 4.1), and then assessits feature augmentation performance on the one-shot objectrecognition task introduced in [3] (Sec. 4.2).
Dataset.
ModelNet [37] is a 3D artificial data set with 3Dvoxel grids. It contains shapes from object cate-gories. Given a 3D shape, it is possible to render 2D im-ages from any pose. In our experiments, we follow therendering strategy of [30]. virtual cameras are placedaround the object, in increments of degrees along the z -axis, and degrees above the ground. Several renderedviews are shown in Fig. 3. The training and testing divi-sion is the same as in the ModelNet benchmark, using objects per category for training and for testing. How-ever, the dataset contains some categories of symmetric ob-jects, such as ‘bowl’, which produce identical images fromall views (see Fig. 3(b)) and some that lack any distinctiveinformation across views, such as ‘plant’ (see Fig. 3(c)). Fortraining, these objects are eliminated and the remaining object categories are used. Implementation.
All feature vectors x are collected fromthe fc7 activations of a fine-tuned VGG16 network. Thepose predictor is trained with a learning rate . for epochs, and evaluated on the testing corpus. The complete5 rr. [deg] Perc. . . . . . . . Table 1. Pose prediction error on ModelNet.
Perc. denotes thepercentage of error cases.
Pose Object category
Accuracy [%] 96.20 83.65
Table 2. Pose and category accuracy (in %) of generated features,on ModelNet.
FATTEN model is then trained for , epochs with alearning rate of . . The angle range of ◦ - ◦ are di-vided into non-overlapping intervals of size ◦ each,which are labeled as - . Any given angle value is thenconverted to a classification label based on the interval itbelongs to. The feature transfer performance of FATTEN is assessed intwo steps. The accuracy of the pose predictor is evaluatedfirst, with the results listed in Table. 1. The large majorityof the errors have magnitude of ◦ . This is not surpris-ing, since ModelNet images have no texture. As as shownin Fig. 3(d)-(e), object views that differ by ◦ can be sim-ilar or even identical for some objects. However, this is nota substantial problem for transfer. Since two feature vec-tors corresponding to the ◦ difference are close to eachother in feature space, to the point where the loss cannot dis-tinguish them clearly, FATTEN will generate target featuresclose to the source, which is the goal anyway. If these errorsare disregarded, the pose prediction has accuracy . .The second evaluation step measures the feature trans-fer performance of the whole network, given the pre-trainedpose predictor. During training, each feature in the trainingset is transferred to all 12 views (including the identity map-ping). During testing, this is repeated for each test feature.The accuracy of the pose and category prediction of the fea-tures, generated on the test corpus, is listed in Table 2. Notethat, here, category refers to object category or class. It isclear that on a large synthetic dataset, such as ModelNet,FATTEN can generate features of good quality, as indicatedby the pose prediction accuracy of . and the categoryprediction accuracy of . . A set of retrieval experiments is performed on ModelNet tofurther assess the effectiveness of FATTEN generated fea-tures. These experiments address the question of whetherthe latter can be used to retrieve instances of (1) the sameclass or (2) the same pose. Since all features are extracted
Feature type (P)ose (C)ategory P + C Real .
58 32 .
71 23 . Generated .
62 28 .
89 11 . Table 3. Retrieval performance in mAP [%] of real and generatedfeatures, on the testing portion ModelNet, for distance functions d , d and d c , see Sec. 4.1.2. from the VGG16 fc7 layer, the Euclidean distance d ( x , y ) = || x − y || (8)is a sensible measure of similarity between x , and y forthe purpose of retrieving images of the same object cate-gory . This is because the model is trained to map featureswith equal category labels to the same partitions of the fea-ture space (enforced by the category loss L c ). However, d is inadequate for pose retrieval. Instead, retrieval is basedon the activation of the second fully-connected layer of thepose predictor P , which is denoted γ ( x ) . The pose distancefunction is the defined as d ( x , y ) = || γ ( x ) − γ ( y ) || . (9)Finally, the performance of joint category & pose re-trieval is measured with a combined distance d c ( x , y ) = d ( x , y ) + λd ( x , y ) . (10)All queries and instances to be retrieved are based on gener-ated features from the testing corpus of ModelNet. For eachgenerated feature, three queries are performed: (1) Cate-gory , (2)
Pose , and (3)
Category & Pose . This is comparedto the performance, on the same experiment, of the real fea-tures extracted from the testing corpus.Retrieval results are listed in Table 3 and some retrievalexamples are shown in Fig. 4. The generated features en-able a very high mAP for pose retrieval , even higher thanthe mAP of real features. This is strong evidence that FAT-TEN successfully encodes pose information in the trans-ferred features. The mAP of the generated features on cate-gory retrieval and the combination of both is comparativelylow. However, the performance of real features is also weakon these tasks. This could be due to a failure of mappingfeatures from the same category into well defined neighbor-hoods, or to the distance metric used for retrieval. While re-trieval performs a nearest neighbor search under these met-rics, the network optimizes the cross-entropy loss on thesoftmax output(s) of both output branches of Fig. 2. Thedistance of Eq. (10) may be a particularly poor way to assessjoint category and pose distances. In the following section,we will see that using a strong classifier ( e.g ., a SVM) onthe generated features produces significantly better results.6 +PCategoryPose Query Top 10 retrieved images
Figure 4. Some retrieval results for the experiments of Sec. 4.1.2. The first two lines refer to category and pose retrieval, lines 3-4 tocategory retrieval and lines 5-6 to pose retrieval. Errors are highlighted in red. For each pair of figures in query part, the left one is originalfigure, while the right one is the real image corresponding to generated feature.
The experiments above provide no insight on whetherFATTEN generates meaningful features for tasks involv-ing real world datasets. In this section, we assess featuretransfer performance on a one-shot object recognition prob-lem. On this task, feature transfer is used for for featurespace “fattening” or data augmentation . The dataset andbenchmark is collected from SUN-RGBD [29], followingthe setup of [3].
Dataset.
The whole SUN-RGBD dataset contains images and their corresponding depth maps. Additionally,2D and 3D bounding boxes are available as ground truth forobject detection.
Depth (distance from the camera plane)and
Pose (rotation around the vertical axis of the 3D co-ordinate system) are used as pose parameters in this task.The depth range of [0 , m is broken into non-overlappingintervals of size 0.5m. An additional interval [5 , + ∞ ) is in-cluded for larger depth values. For pose, the angular rangeof ◦ - ◦ is divided into non-overlapping intervals ofsize ◦ each. These intervals are used for one-hot encod-ing and system training. To allow a fair comparison withAGA, however, during testing, we restrict the desired pose t to take the values ◦ , ◦ , ..., ◦ prescribed in [3]. Thisis mainly to ensure that our system generates syntheticpoints along the Depth trajectory and along the Pose tra-jectory similar to theirs.The first images of SUN-RGBD are used for train-ing and the remaining images for testing. However,if only ground truth bounding boxes are used for object ex-traction, the instances are neither balanced w.r.t. categories,nor w.r.t. pose/depth values. To remedy this issue, a fast R-CNN [6] object detector is fine-tuned on the dataset and theselective search proposals with IoU > . (to ground truthboxes) and detection scores > . are used to extract objectimages for training. As this strategy produces a sufficient amount of data, the training set can be easily balanced percategory, as well as pose and depth. In the testing set, onlyground truth bounding boxes are used to exact objects. Allsource features are exacted from the penultimate ( i.e ., fc7 )layer of the fine-tuned fast R-CNN detector for all instancesfrom both training and testing sets.Evaluation is based on the source and target objectclasses defined in [3]. We denote S as the source dataset,and let T and T denote two different (disjoint) targetdatasets; further, T = T ∪ T denotes a third dataset thatis a union of the first two. Table 4 lists all the object cat-egories in each set. The instances in S are collected fromthe training portion of SUN-RGBD only, while those in T and T are collected from the testing set. Further, S doesnot overlap with any T i which ensurers that FATTEN hasno access to shared knowledge between training/testing im-ages or classes. Implementation.
The attribute predictors for pose and depth are trained with a learning rate of . for epochs. The feature transfer network is fine-tuned, startingfrom the weights obtained from the ModelNet experimentof Sec. 4.1, with a learning rate of . for epochs.The classification problems on T and T are 10-class prob-lems, whereas T is a 20-class problem, respectively. As a baseline for one-shot learning, we train a linear SVM using only a single instance per class. We then feed those sameinstances into the feature transfer network to generate artifi-cial features for different values of depth and pose. Specif-ically, we use different values for depth and for pose.After feature synthesis, a linear SVM is trained with thesame parameters on the now augmented (“fattened”) fea-ture set (source and target features).7 (19, Source) T (10) T (10)bathtub lamp picture mugbed monitor whiteboard telephonebookshelf night stand fridge bowlbox pillow counter bottlechair sink books scannercounter sofa stove microwavedesk table cabinet coffee tabledoor tv printer recycle bindresser toilet computer cartgarbage bin ottoman benchTable 4. List of object categories in the source S training set andthe two target/evaluation sets T and T . Table 5 lists the averaged one-shot recognition accuracies(over random repetitions) for all three evaluation sets T i For comparison, five-shot results in the same augmentationsetup are also reported. Table 5 additionally lists the recog-nition accuracies of two recently proposed strategies to dataaugmentation, i.e ., feature hallucination as introduced in [8]as well as attribute-guided augmentation (AGA) of [3].Table 5 supports some conclusions.
First , when com-pared to the SVM baseline, FATTEN achieves a remark-able and consistent improvement of around percentagepoints on all evaluation sets. This indicates that FATTENcan actually embed the pose information into features andeffectively “fatten” the data used to train the linear SVMclassifier. Second , and most notably, FATTEN achieves asignificant improvement (about percentage points) overAGA, and an even larger improvement over the feature hal-lucination approach of [8]. The improved performances ofFATTEN over AGA and AGA over hallucination show thatit is important 1) to exploit the structure of the pose man-ifold (which only FATTEN and AGA do), and 2) to relyon models that can capture defining properties of this mani-fold, such as continuity and smoothness of feature trajecto-ries (which AGA does not).While the feature hallucination strategy works remark-ably well in the ImageNet1k low-shot setup used in [8],Table 5 only shows marginal gains over the baseline (es-pecially in the one-shot case). There may be several rea-sons as to why it fails in this setup. First, the number ofexamples per category ( k in the notation of [8]) is a hyper-parameter set through cross-validation. To make the com-parison fair, we chose to use the same value in all methods,which is k = 19 . This may not be the optimal setting for[8]. Second, we adopt the same number of clusters as usedby the authors when training the generator. However, thebest value may depend on the dataset (ImageNet1k in [8] vs . SUN-RGBD here). Without clear guidelines of how toset this parameter, it seems challenging to adjust it appropri- Baseline Hal. [8]
AGA [3]
FATTEN
One-shot T (10) 33.74 35.43 39.10 44.99 T (10) 23.76 21.12 30.12 34.70 T (20) 22.84 21.67 26.67 32.20 Five-shot T (10) 50.03 50.31 56.92 58.82 T (10) 36.76 38.07 47.04 50.69 T (20) 37.37 38.24 42.87 47.07 Table 5. One-shot and five-shot recognition accuracy for three dif-ferent few-shot recognition problems, constructed from the SUN-RGBD dataset. The recognition accuracies (in %) are averagedover 500 random repetitions of the experiment. The
Baseline de-notes the recognition accuracy achieved by a linear SVM, trainedon single instances of each class only. ately. Third, all results of [8] list the top-5 accuracy, whilewe use top-1 accuracy. Finally, FATTEN takes advantageof pose and depth to generate more features, while the hal-lucination feature generator is non-parametric and does notexplicitly use this information for synthesis.The improvement of FATTEN over AGA can most likelybe attributed to 1) the fact that AGA uses separate synthesisfunctions (trained independently) and 2) failure cases of thepose/depth predictor that determines which particular syn-thesis function is used. In case of the latter, generated fea-tures are likely to be less informative, or might even con-found any subsequent classifier.
5. Discussion
The proposed architecture to data augmentation in fea-ture space, FATTEN, aims to learn trajectories of featureresponses, induced by variations in image properties (suchas pose). These trajectories can then be easily traversedvia one learned mapping function which, when appliedto instances of novel classes, effectively enriches the fea-ture space by additional samples corresponding to a desiredchange, e.g ., in pose. This “fattening” of the feature space ishighly beneficial in situations where the collection of largeamounts of adequate training data to cover these variationswould be time-consuming, if not impossible. In principle,FATTEN can be used for any kind of desired (continuous)variation, so long as the trajectories can be learned froman external dataset. By discretizing the space of variations, e.g ., the rotation angle in case of pose, we also effectivelyreduce the dimensionality of the learning problem and en-sure that the approach scales favorably w.r.t. different res-olutions of desired changes. Finally, it is worth pointingout that feature space transfer via FATTEN is not limitedto object images; rather, it is a generic architecture in the8ense that any variation could, in principle, be learned andtransferred.
References [1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimen-sionality reduction and data representation.
Neural Compu-tation , 15(6):1373–1396, 2003. 2[2] J. Deng, W. Dong, R. S. L.-J. Li, K. Li, and L. Fei-Fei. Ima-genet: A large-scale hierarchical image database. In
CVPR ,2009. 1[3] M. Dixit, R. Kwitt, M. .Niethammer, and N. Vasconcelos.Aga: Attribute-guided augmentation. In
CVPR , 2017. 2, 3,5, 7, 8[4] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks.
CoRR ,abs/1703.03400, 2017. 3[5] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transferusing convolutional neural networks. In
CVPR , 2016. 3[6] R. Girshick. Fast R-CNN. In
ICCV , 2015. 1, 7[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In
NIPS , 2014. 3[8] B. Hariharan and R. Girshick. Low-shot visual recog-nition by shrinking and hallucinating features.
CoRR ,abs/1606.02819, 2016. 3, 7, 8[9] K. He, X. Zhang, S.Ren, and J. Sun. Deep residual learningfor image recognition. In
CVPR , 2016. 1, 4[10] P. Isola, J. Zhu, T. Zhou, and A. Efros. Image-to-image trans-lation with conditional adversarial networks. In
CVPR , 2017.3[11] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri. 3dshape segmentation with projective convolutional networks.
CoRR , abs/1612.02808, 2016. 3[12] A. Karpathy and L. Fei-Fei. Deep visual-semantic align-ments for generating image descriptions. In
CVPR , 2015.1[13] J. Knopp, M. Prasad, G. Willems, R. Timofte, andL. Van Gool. Hough transform and 3d surf for robust threedimensional classification.
Computer vision–ECCV 2010 ,pages 589–602, 2010. 3[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. In
NIPS , 2012. 1[15] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categoriza-tion.
TPAMI , 36(3):453–465, 2014. 2, 3[16] Y. LeCun, F. Huang, and L. Bottou. Learning methods forgeneric object recognition with invariance to pose and light-ing. In
CVPR , 2004. 2[17] S. Nene, S. Nayar, and H. Murase. Columbia object imagelibrary. Technical Report CUCS-006-96, Columbia Univer-sity, 1996. 2[18] D. Pathak, P. Kr¨ahenb¨uhl, J. Donahue, T. Darrell, andA. Efros. Context encoders: Feature learning by inpainting.In
CVPR , 2016. 3 [19] X. Peng, B. Sun, K. Ali, and K. Saenko. Learning deep ob-ject detectors from 3d models. In
ICCV , 2015. 2, 3[20] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deeplearning on point sets for 3d classification and segmentation.
CoRR , abs/1612.00593, 2016. 2, 3[21] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J.Guibas. Volumetric and multi-view cnns for object classi-fication on 3d data.
CoRR , abs/1604.03265, 2016. 2, 3[22] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In
ICLR , 2017. 2, 3[23] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection. In
NIPS , 2015. 1[24] B. Romera-Paredes and P. Torr. An embarrassingly simpleapproach to zero-shot learning. In
ICML , 2015. 2, 3[25] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduc-tion by locally linear embedding.
Science , 290(5500):2323–2326, 2000. 2[26] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, andT. Lillicrap. Meta-learning with memory-augmented neuralnetworks. In
ICML , 2016. 2, 3[27] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition.
CoRR ,abs/1409.1556, 2014. 1, 5[28] R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng. Zero-shot learning through cross-modal transfer. In
NIPS , 2013.2, 3[29] S. Song, S. Lichtenberg, and J. Xiao. SUN RGB-D: A RGB-D scene understanding benchmark suite. In
CVPR , 2015. 3,7[30] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition.In
ICCV , 2015. 2, 3, 5[31] H. Su, C. Qi, Y. Li, and L. Guibas. Render for CNN: View-point estimation in images using cnns trained with rendered3d model views. In
ICCV , 2015. 2, 3[32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In
CVPR , 2015. 1[33] K. Tang, M. Tappen, R. Sukthankar, and C. Lampert. Op-timizing one-shot recognition with micro-set learning. In
CVPR , 2010. 2, 3[34] L. van der Maaten and G. Hinton. Visualizing high-dimensional data using t-sne.
Journal of Machine LearningResearch , 9:2579–2605, 2008. 2[35] O. Vinjays, A. Toshev, S. Bengio, and D. Erhan. Show andtell: A neural image caption generator. In
CVPR , 2015. 1[36] O. Vinyals, C. Blundell, T. Lillicrap, k. kavukcuoglu, andD. Wierstra. Matching networks for one shot learning. In
NIPS , 2016. 2, 3[37] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, andJ. Xiao. 3D ShapeNets: A deep representation for volumetricshape modeling. In
CVPR , 2015. 2, 3, 5, 2015. 2, 3, 5