[PDF] Low-shot Visual Recognition by Shrinking and Hallucinating Features

Abstract

Low-shot visual learning---the ability to recognize novel object categories from very few examples---is a hallmark of human visual intelligence. Existing machine learning approaches fail to generalize in the same way. To make progress on this foundational problem, we present a low-shot learning benchmark on complex images that mimics challenges faced by recognition systems in the wild. We then propose a) representation regularization techniques, and b) techniques to hallucinate additional training examples for data-starved classes. Together, our methods improve the effectiveness of convolutional networks in low-shot learning, improving the one-shot accuracy on novel classes by 2.3x on the challenging ImageNet dataset.

Full PDF

LLow-shot Visual Recognition by Shrinking and Hallucinating Features

Bharath Hariharan and Ross GirshickFacebook AI Research (FAIR)

Representation learning Low-shot learning

Featureextractor

Base classes (many training examples)

Classifier (base and novel categories)

Novel classes (few training examples)

Figure 1:

Our low-shot learning benchmark in two phases: representation learning and low-shot learning.

Modern recognitionmodels use large labeled datasets like ImageNet to build good visual representations and train strong classiﬁers ( representation learning ).However, these datasets only contain a ﬁxed set of classes. In many realistic scenarios, once deployed, the model might encounter novelclasses that it also needs to recognize, but with very few training examples available ( low-shot learning ). We present two ways of signiﬁcantlyimproving performance in this scenario: (1) a novel loss function for representation learning that leads to better visual representations thatgeneralize well, and (2) a method for hallucinating additional examples for the data-starved novel classes.

Abstract

Low-shot visual learning—the ability to recognize novelobject categories from very few examples—is a hallmarkof human visual intelligence. Existing machine learningapproaches fail to generalize in the same way. To makeprogress on this foundational problem, we present a low-shot learning benchmark on complex images that mimicschallenges faced by recognition systems in the wild. We thenpropose (1) representation regularization techniques, and(2) techniques to hallucinate additional training examplesfor data-starved classes. Together, our methods improve theeffectiveness of convolutional networks in low-shot learning,improving the one-shot accuracy on novel classes by 2.3 × on the challenging ImageNet dataset.

1. Introduction

Recently, error rates on benchmarks like ImageNet [8]have been halved, and then halved again. These gains comefrom deep convolutional networks (ConvNets) that learnrich feature representations [24]. It is now clear that if anapplication has an a priori ﬁxed set of visual concepts andthousands of examples per concept, an effective way to buildan object recognition system is to train a deep ConvNet. Butwhat if these assumptions are not satisﬁed and the network must learn novel categories from very few examples?The ability to perform low-shot learning —learning novelconcepts from very few examples—is a hallmark of the hu-man visual system. We are able to do this not only for naturalobject categories such as different kinds of animals, but alsofor synthetic objects that are unlike anything we’ve seenbefore [40]. In contrast, in spite of signiﬁcant improvementsin recognition performance, computational recognition ap-proaches fail to generalize well from few examples [25]. Ourgoal in this paper is to make progress towards imparting thishuman ability to modern recognition systems.Our ﬁrst contribution is a low-shot learning benchmarkbased on the challenging ImageNet1k dataset. As shownin Figure 1, our benchmark is implemented in two phases.In the representation learning phase , the learner tunes itsfeature representation on a set of base classes that have manytraining instances. In the low-shot learning phase , the learneris exposed to a set of novel classes with only a few exam-ples per class and must learn a classiﬁer over the joint labelspace of base and novel classes. This benchmark simulates ascenario in which the learner is deployed in the wild and hasto quickly learn novel concepts it encounters from very littletraining data. Unlike previous low-shot learning tests ( e.g .,[13, 25]) we measure the learner’s accuracy on both the baseand novel classes. This provides a sanity check that accuracygains on novel classes do not come at the expense of a large1 a r X i v : . [ c s . C V ] N ov oss in performance on the base classes. This evaluationprotocol follows the standard way that image classiﬁers areevaluated on popular benchmarks like ImageNet, thus eas-ing the comparison of progress on low-shot learning to thetypical data-rich scenario.Next, we investigate how to improve the learner’s per-formance on the benchmark. We build on the intuitionthat certain modes of intra-class variation generalize acrosscategories ( e.g ., pose transformations). We present a wayof “hallucinating” additional examples for novel classes bytransferring modes of variation from the base classes. Theseadditional examples improve the one-shot top-5 accuracy onnovel classes by

15 points (absolute) while also maintainingaccuracy on the base classes.Finally, we show that the feature representation learntin the ﬁrst phase has a large impact on low-shot general-ization ability. Speciﬁcally, we formulate a loss functionthat penalizes the difference between classiﬁers learnt onlarge and small datasets, and then draw connections betweenthis loss and regularization of feature activations . We showthat simply regularizing feature activations can increase one-shot, top-5 accuracy on novel classes by (absolute)without harming base class performance. Combining thisbetter representation with the hallucination strategy pushesour improvement up to

18 points above the baseline.

2. Related work

One-shot and low-shot learning.

One class of ap-proaches to one-shot learning uses generative models ofappearance that tap into a global [13] or a supercategory-level [39] prior. Generative models based on strokes [26] orparts [48] have shown promise in restricted domains suchas hand-written characters [28, 25]. They also work well indatasets without much intra-class variation or clutter, suchas Caltech 101 [13]. Dixit et al . [9] leverage a corpus withattribute annotations to generate additional examples by vary-ing attributes. We also propose a way to generate additionalexamples, but our model does not use supervision. A similarapproach to synthesizing additional examples by transform-ing existing ones is presented in early work by Miller etal . [31]. Our approach generalizes this to realistic, genericimage categories and is non-parametric.Jia et al . [22] present a promising alternative to generationusing Bayesian reasoning to infer an object category from afew examples; however, in [22] the full, large-scale trainingset is available during training.Among discriminative approaches, early work attemptedto use a single image of the novel class to adapt classiﬁersfrom similar base classes [3, 33] using simple hand-craftedfeatures. Bertinetto et al . [4] regress from single examplesto a classiﬁers, while Wang and Hebert [47] regress fromclassiﬁers trained on small datasets to classiﬁers trained onlarge datasets. Recent “meta-learning” techniques learn to directly map training sets and test examples to classiﬁca-tion outputs [46, 15, 36]. We compare favorably with theseapproaches in our experiments.Amongst representation learning approaches, metriclearning, such as the triplet loss [44, 41, 14] or siamesenetworks [23, 18], has been used to automatically learn fea-ture representations where objects of the same class arecloser together. Such approaches have shown beneﬁts inface identiﬁcation [44]. On benchmarks involving more gen-eral Internet imagery, such as ImageNet [8], these methodsperform worse than simple classiﬁcation baselines [37], andit is unclear if they can beneﬁt low-shot learning.

Zero-shot learning.

Zero-shot recognition uses textualor attribute-level descriptions of object classes to train clas-siﬁers. While this problem is different than ours, the moti-vation is the same: to reduce the amount of data requiredto learn classiﬁers. One line of work uses hand-designedattribute descriptions that are provided to the system for thenovel categories [38, 27, 12]. Another class of approachesembeds images into word embedding spaces learnt usinglarge text corpora, so that classiﬁers for novel concepts canbe obtained simply from the word embedding of the con-cept [16, 43, 32, 49]. A ﬁnal class of approaches attempts todirectly regress to image classiﬁers from textual descriptions[11, 29] or from prototypical images of the category [21].Similar to our benchmark, Chao et al . [5] propose that zero-shot learning evaluation should also include the trainingcategories that do have examples. We believe this evaluationstyle is good for both zero and low-shot learning.

Transfer learning.

The ability to learn novel classesquickly is one of the main motivations for multitask andtransfer learning. Thrun’s classic paper convincingly arguesthat “learning the n -th task should be easier than learningthe ﬁrst,” with ease referring to sample complexity [45].However, recent transfer learning research has mostly fo-cussed on the scenario where large amounts of training dataare available for novel classes. For that situation, the efﬁ-cacy of pre-trained ConvNets for extracting features is wellknown [10, 34, 42]. There is also some analysis on whataspects of ImageNet training aid this transfer [1, 2]. Forfaces, Taigman et al . [44] ﬁnd that low-dimensional featurerepresentations transfer better on faces and Galanti et al . [17]provide some theoretical justiﬁcation for this ﬁnding. Thiswork hints at a link between the complexity of the featurerepresentation and its generalizability, a link which we alsoobserve in this paper. We ﬁnd that stronger base classiﬁersgeneralize better than weaker classiﬁers ( e.g . comparingResNet-10 to ResNet-50 [19]). There have also been novellosses proposed explicitly to aid transfer, such as the multi-verse loss of Littwin and Wolf [30]. Our paper also proposesnovel losses designed speciﬁcally for low-shot learning. . A low-shot learning benchmark Our goal is to build a benchmark for low-shot learningthat mimics situations that arise in practice. Current recog-nition systems require days or even weeks of training onexpensive hardware to develop good feature representations.The trained recognition systems may then be deployed asa service to be used by downstream applications. Thesedownstream applications may need the ability to recognizenovel categories, but they may have neither the training datarequired, nor the infrastructure needed to retrain the models.Thus, there are two natural phases: in the ﬁrst phase, wehave the data and resources to train sophisticated featureextractors on large labelled datasets, and in the second phase,we want to add additional categories to our repertoire atminimal computational and data cost.Our low-shot learning benchmark implements a similarsetup. It employs a learner, two training phases, and onetesting phase. The learner is assumed to be composed of afeature extractor and a multi-class classiﬁer. The benchmarkis agnostic to the speciﬁc form of each component.During representation learning (training phase one), thelearner receives a ﬁxed set of base categories C base , and adataset D containing a large number of examples for eachcategory in C base . The learner uses D to set the parametersof its feature extractor.In the second phase, which we call low-shot learning , thelearner is given a set of categories C l that it must learn todistinguish. C l = C base ∪ C novel is a mix of base categories C base , and unseen novel categories C novel . For each novelcategory, the learner has access to only n positive examples,where n ∈ { , , , , } . For the base categories, thelearner still has access to D . The learner may then use theseexamples and its feature extractor to set the parameters ofits multi-class classiﬁer while also optionally modifying thefeature extractor.In the testing phase, the learnt model predicts labels fromthe combined label space C base ∪ C novel on a set of pre-viously unseen test images. To measure the variability inlow-shot learning accuracy, we repeat the low-shot learn-ing and testing phases for 5 trials, each time with a randomdraw of examples for the novel classes. We report the meanaccuracy and the standard deviation over these trials.The simplest, and commonly used, baseline approach isto train a ConvNet with label cross-entropy loss in the repre-sentation learning phase and then train a new linear classiﬁerhead in the low-shot learning phase. We now show signiﬁ-cant improvements on this baseline, ﬁrst by a novel strategyof hallucinating additional training examples (Section 4) andthen by improving the representation itself (Section 5).

4. Better low-shot learning through generation

In the low-shot learning phase, our goal is to train goodclassiﬁers for novel categories from only a few examples.Intuitively, the challenge is that these examples capture verylittle of the category’s intra-class variation. For instance, ifthe category is a particular bird species, then we may onlyhave examples of the bird perched on a branch, and none ofit in ﬂight. The classiﬁer might then erroneously concludethat this novel category only consists of perched birds.However, this mode of variation is common to many birdspecies, including those we have encountered in the baseclasses. From the many base class examples we have seen,we can understand the transformation that relates perchedbird images to the image of the corresponding bird in ﬂight,and then use this transformation to “hallucinate” additionalexamples for our novel bird category. If we were given theset of all such category-independent transformations, thenwe can hallucinate as many new examples for each novelcategory example as there are transformations.However, we do not have a pre-deﬁned set of transforma-tions that we can apply. But we can take a non-parametricapproach. Any two examples z and z belonging to thesame category represent a plausible transformation. Then,given a novel category example x , we want to apply to x the transformation that sent z to z . That is, we want tocomplete the transformation “analogy” z : z :: x : ? .We do this by training a function G that takes as in-put the concatenated feature vectors of the three examples [ φ ( x ) , φ ( z ) , φ ( z )] . It produces as output a “hallucinated”feature vector (of the same dimensionality as φ ), which cor-responds to applying the z → z transformation to x . Weuse an MLP with three fully connected layers for G .We ﬁrst describe how we train G , and then show how weuse the generated examples in the low-shot learning phase. To train G , we ﬁrst collect a dataset of completed analo-gies from our base classes. To do this we ﬁrst cluster thefeature vectors of the examples in each base category intoa ﬁxed number of clusters (100). This is to keep computa-tional complexity manageable. Next, for each pair of cen-troids c a , c a in one category a , we search for another pairof centroids c b , c b from another category b , such that thecosine distance between c a − c a and c b − c b is minimized.We collect all such quadruplets ( c a , c a , c b , c b ) with cosinesimilarity greater than zero into a dataset D G . See Figure 2for example transformation analogies.We now use the dataset D G to train G . For each quadru-plet ( c a , c a , c b , c b ) , we feed ( c a , c b , c b ) to the generator. Let ˆ c a = G ([ c a , c b , c b ]) be the output of the generator. We thenminimize λL mse ( ˆ c a , c a ) + L cls ( W, ˆ c a , a ) , where:1. L mse ( ˆ c a , c a ) is the mean squared error between the :: :: :: :: :: : Figure 2: Example mined analogies. Each row shows the fourimage clusters that form the four elements in the analogy.

Row 1: birds with a sky backdrop vs birds with greenery in the background.

Row 2: whole fruits vs cut fruit.

Row 3: machines (printer, coffeemaking) in isolation vs the same machine operated by a human. generator’s output and the true target of the analogy c a .2. L cls ( W, ˆ c a , a ) is the classiﬁcation loss, where W is theﬁxed linear classiﬁer on the base classes learnt duringrepresentation learning, and L cls ( W, x, y ) is the logloss of the classiﬁer W on the example ( x, y ) . Our generated examples are unlikely to be as good as realexamples, but should provide a useful bias to the classiﬁerwhen only a few real examples are present. Therefore wewant to rely on generated examples only when the numberof real examples is low.Concretely, we have a hyperparameter k (set throughcross-validation), which is the minimum number of examplesper novel category that we want to have. If the actual numberof real examples for a novel category, n , is less than k , thenwe additionally generate k − n hallucinated examples. Togenerate a synthetic example for a novel category l , wesample the feature vector of a “seed” example φ ( x ) from oneof the n real examples for this category, and a pair of clustercentroids c a , c a from a base category a chosen uniformly atrandom. We then pass this triplet through G , and add thehallucinated feature vector G ([ φ ( x ) , c a , c a ]) to our trainingset with label l . We then train the logistic regression classiﬁeron this mix of real and generated data in the usual manner.

5. Better representations for low-shot learning

We now turn to the question of improving representationlearning so as to enable better low-shot learning. As de-scribed above, the learner consists of a feature extractor φ and a classiﬁer W . The goal of representation learning is a good feature extractor : one that enables learning of effectiveclassiﬁers from few examples. Intuitively, our goal is toreduce the difference between classiﬁers trained on large datasets and classiﬁers trained on small datasets so that thosetrained on small datasets generalize better.We ﬁrst describe a proposal that encodes this goal in aloss that can be minimized during representation learning.Then, we draw connections to several alternatives. We assume that the classiﬁer W is linear, e.g., the lastlayer of a ConvNet. Let D denote a large labeled dataset ofbase class images. Typically, training the feature extractor φ and the classiﬁer W on D involves minimizing a classiﬁca-tion objective with respect to φ and W : min W, φ L D ( φ , W ) = min W, φ | D | (cid:88) ( x,y ) ∈ D L cls ( W, φ ( x ) , y ) (1)where L cls ( W, x, y ) is the multiclass logistic loss on an ex-ample x with label y for a linear classiﬁer W : L cls ( W, x, y ) = − log p y ( W, x ) (2) p k ( W, x ) = exp( w Tk x ) (cid:80) j exp( w Tj x ) . (3)We modify this training procedure as follows. We sim-ulate low-shot learning experiments on the base classes byconsidering several tiny training sets S ⊂ D , | S | (cid:28) | D | .We then want to reduce the difference between classiﬁerstrained on the large dataset D (using the feature extractor φ )and classiﬁers trained on these small datasets S .The classiﬁer trained on D is just W . Training a classiﬁeron S involves solving a minimization problem: min V L S ( φ , V ) = min V | S | (cid:88) ( x,y ) ∈ S L cls ( V, φ ( x ) , y ) (4)We want the minimizer of this objective to match W . Inother words, we want W to minimize L S ( φ , V ) . L S ( φ , V ) is convex in V (Fig. 3), so a necessary and sufﬁcient con-dition for this is that the gradient of L S ( φ , V ) at V = W ,denoted by ∇ V L S ( φ , V ) | V = W , is 0. More generally, thecloser W is to the global minimum of L S ( φ , V ) , the lowerthe magnitude of this gradient. Thus, we want to minimize: ˜ L S ( φ , W ) = (cid:107)∇ V L S ( φ , V ) | V = W (cid:107) (5)The gradient ∇ V L S ( φ , V ) has a simple analytical form(see supplemental material for details ): ∇ V L S ( φ , V ) = [ g ( S, V ) , . . . g K ( S, V )] (6) g k ( S, V ) = 1 | S | (cid:88) ( x,y ) ∈ S ( p k ( V, φ ( x )) − δ yk ) φ ( x ) (7) Supplemental material is available at http://home.bharathh.info/lowshotsupp.pdf ad WGood WL S ( 𝜙 ,V) Figure 3: Motivation for the SGM loss. We want to learn a repre-sentation φ such that the arg min of the small set training objective L S ( φ , V ) matches W , the classiﬁer trained on a large dataset D . where K is the number of classes, δ yk is 1 when y = k and0 otherwise, and p k is as deﬁned in equation (3).This leads to an analytical form for the func-tion ˜ L S ( φ , W ) : | S | (cid:80) Kk =1 (cid:107) (cid:80) ( x,y ) ∈ S ( p k ( W, φ ( x )) − δ yk ) φ ( x ) (cid:107) . We use this analytical function of W and φ asa loss.We consider an extreme version of this loss where S is a single example ( x, y ) . In this case, ˜ L S ( φ , W ) = K (cid:88) k =1 ( p k ( W, φ ( x )) − δ yk ) (cid:107) φ ( x ) (cid:107) (8) = α ( W, φ ( x ) , y ) (cid:107) φ ( x ) (cid:107) . (9)where α ( W, φ ( x ) , y ) = (cid:80) k ( p k ( W, φ ( x )) − δ yk ) is a per-example weight that is higher for data points that are misclas-siﬁed. Thus the loss becomes a weighted L regularizationon the feature activations.Our ﬁnal loss, which we call SGM for Squared GradientMagnitude, averages this over all examples in D . L SGMD ( φ , W ) = 1 | D | (cid:88) ( x,y ) ∈ D α ( W, φ ( x ) , y ) (cid:107) φ ( x ) (cid:107) (10)We train our feature representation by minimizing astraightforward linear combination of the SGM loss andthe original classiﬁcation objective. min W, φ L D ( φ , W ) + λL SGMD ( φ , W ) (11) λ is obtained through cross-validation. Batch SGM.

Above, we used singleton sets as our tinytraining sets S . An alternative is to consider every mini-batch of examples B that we see during SGD as S . Hence,we penalize the squared gradient magnitude of the averageloss over B , yielding the loss term: λ ˜ L B ( φ , W ) . In eachSGD iteration, our total loss is thus the sum of this loss termand the standard classiﬁcation loss. Note that because thisloss is deﬁned on mini-batches the number of examples perclass in each mini-batch is a random variable. Thus this loss, which we call “batch SGM”, optimizes for an expected lossover a distribution of possible low-shot values n . In Eq. (9), it can be shown that α ( W, φ ( x ) , y ) ∈ [0 , (see supplementary). Thus, in practice, the SGM loss isdominated by (cid:107) φ ( x ) (cid:107) , which is much larger. This suggestsa simple squared L norm as a loss: min W, φ L D ( φ , W ) + λ | D | (cid:88) ( x,y ) ∈ D (cid:107) φ ( x ) (cid:107) . (12)While L regularization is a common technique, note thathere we are regularizing the feature representation , as op-posed to regularizing the weight vector . Regularizing thefeature vector norm has been a staple of unsupervised learn-ing approaches to prevent degenerate solutions [35], butto the best of our knowledge it hasn’t been considered insupervised classiﬁcation.We can also consider other ways of regularizing the rep-resentation, such as an L regularization: min W, φ L D ( φ , W ) + λ | D | (cid:88) ( x,y ) ∈ D (cid:107) φ ( x ) (cid:107) . (13)We also evaluate other forms of feature regularizationthat have been proposed in the literature. The ﬁrst of theseis dropout [20], which was used in earlier ConvNet archi-tectures [24], but has been eschewed by recent architecturessuch as ResNets [19]. Another form of feature regulariza-tion involves minimizing the correlation between the fea-tures [6, 7]. We also compare to the multiverse loss [30]which was shown to improve transfer learning performance. Why should feature regularization help?

When learn-ing the classiﬁer and feature extractor jointly, the featureextractor can choose to encode less discriminative informa-tion in the feature vector because the classiﬁer can learn toignore this information. However, when learning new classi-ﬁers in the low-shot phase, the learner will not have enoughdata to identify discriminative features for the unseen classesfrom its representation. Minimizing the norm of the featureactivations might limit what the learner can encode into thefeatures, and thus force it to only encode useful information.

A common approach to one-shot learning is to learn agood distance metric that generalizes to unseen classes. Wetrain a ConvNet with the triplet loss as a representative base-line method. The triplet loss takes as input a triplet of ex-amples ( x, x + , x − ) , where x and x + belong to the samecategory while x − does not: L triplet ( φ ( x ) , φ ( x + ) , φ ( x − )) = (14) max( (cid:107) φ ( x + ) − φ ( x ) (cid:107) − (cid:107) φ ( x − ) − φ ( x ) (cid:107) + γ, . he loss encourages x − to be at least γ farther away from x than x + is.

6. Experiments and discussion

We use the ImageNet1k challenge dataset for experimentsbecause it has a wide array of classes with signiﬁcant intra-class variation. We divided the 1000 ImageNet categoriesrandomly into 389 base categories and 611 novel categories(listed in the supplementary material).Many of the methods we evaluate have hyperparametersthat need to be cross-validated. Since we are interested ingeneralization to novel classes, we did not want to cross-validate on the same set of classes that we test on. Wetherefore constructed two disjoint sets of classes by dividingthe base categories into two subsets C base (193 classes) and C base (196 classes) and the novel categories into C novel (300 classes) and C novel (311 classes). Then, for cross-validating hyperparameters, we provided the learner with C cv = C base ∪ C novel in the low-shot learning and testingphase, and evaluated its top-5 accuracy on the combinedlabel set C cv . The hyperparameter setting that gave thehighest top-5 accuracy was then frozen. We then conductedour ﬁnal experiments using these hyperparameter settingsby providing the learner with C ﬁn = C base ∪ C novel . Allreported numbers in this paper are on C ﬁn .Our test images are a subset of the ImageNet1k validationset: we simply restricted it to only include examples from theclasses of interest ( C cv or C ﬁn ). Performance is measuredby top-1 and top-5 accuracy on the test images for eachvalue of n (number of novel examples per category). Wereport the mean and standard deviation from 5 runs eachusing a different random sample of novel examples duringthe low-shot training phase.To break down the ﬁnal performance metrics, we reportseparately the average accuracy on the test samples from thenovel classes and on all test samples. While our focus is onthe novel classes, we nevertheless need to ensure that goodperformance on novel classes doesn’t come at the cost oflower accuracy on the base classes. For most of our experiments, we use a small ten-layerResNet architecture [19] as our feature extractor φ (detailsin supplementary material). When trained on all 1000 cate-gories of ImageNet, it gives a validation top-5 error rate of16.7% (center crop), making it similar to AlexNet [24]. Weuse this architecture because it’s relatively fast to train (2days on 4 GPUs) and resembles state-of-the-art architectures.Note that ResNet architectures, as described in [19], do notuse dropout. Later, we show some experiments using thelarger and deeper ResNet-50 architecture. For all experiments on representation learning, exceptthe triplet embedding, the networks are trained from scratchfor 90 epochs on the base classes. The learning rate startsat 0.1 and is divided by 10 every 30 epochs. The weightdecay is ﬁxed at 0.0001. For the triplet embedding, we ﬁrstpretrain the network using a softmax classiﬁer and log lossfor 90 epochs, and then train the network further using thetriplet loss and starting with a learning rate of 0.001. Westop training when the loss stops decreasing (55 epochs).This schedule is used because, as described in [37], tripletnetworks train slowly from scratch.For methods that introduce a new loss, there is a hyper-parameter that controls how much we weigh the new loss.Dropout also has a similar hyperparameter that governs whatfraction of activations are dropped. We set these hyperpa-rameters by cross-validation.For our generator G , we use a three layer MLP with ReLUas the activation function. We also add a ReLU at the end,since φ is known to be non-negative. All hidden layers havea dimensionality of 512.In the low-shot learning phase, we train the linear classi-ﬁer using SGD for 10000 iterations with a mini-batch sizeof 1000. We cross-validate for the learning rate. The low-shot benchmark creates a heavily imbalancedclassiﬁcation problem. During low-shot learning the baseclasses may have thousands of examples, while each novelclass has only a few examples. We use two simple strate-gies to mitigate this issue. One, we oversample the novelclasses when training the classiﬁer by sampling uniformlyover classes and then uniformly within each chosen class.Two, we L regularize the multi-class logistic classiﬁer’sweights by adding weight decay during low-shot learning.We ﬁnd that the weight of the classiﬁer’s L regularizationterm has a large impact and needs to be cross-validated. Impact of representation learning.

We plot a subset ofthe methods in Figure 4, and show the full set of numbersin Tables 1 and 2. The plots show the mean top-5 accuracy,averaged over 5 low-shot learning trials, for the novel classes,and over the combined set of novel and base classes. Thestandard deviations are low (generally less than 0.5%, seesupplementary material) and are too small to display clearlyas error bars. Top-1 accuracy and numerical values are inthe supplementary material. We observe that: • When tested just on base classes, many methods per-form similarly (not shown), but their performance dif-fers drastically in the low-shot scenarios, especially for The subset reduces clutter, making the plots more readable. We omitresults for Batch SGM, Dropout and L because Batch SGM performssimilarly to SGM and L , while L and Dropout perform worse. epresentation Lowshot phase n=1 2 5 10 20 ResNet-10

Baseline Classiﬁer 14.1 33.3 56.2 66.2 71.5Baseline Generation ∗ + Classiﬁer 29.7 42.2 56.1 64.5 70.0SGM ∗ Classiﬁer 23.1 42.4

SGM ∗ Generation ∗ + Classiﬁer Batch SGM ∗ Classiﬁer 23.0 42.4 61.9 L1 ∗ Classiﬁer 20.8 40.8 59.8 67.5 71.6L2 ∗ Classiﬁer 29.1

ResNet-50

Baseline Classiﬁer 28.2 51.0 71.0

Baseline Generation ∗ + Classiﬁer SGM ∗ Classiﬁer 37.8 57.1

SGM ∗ Generation ∗ + Classiﬁer Table 1: Top-5 accuracy on only novel classes. Best are bolded andblue; the second best are italicized and red. ∗ Our methods. small n . Thus, accuracy on base classes does not gen-eralize to novel classes, especially when novel classeshave very few training examples. • Batch SGM, SGM, and L are top performers overallwith L being better for small n . They improve novelclass accuracy by more than 10 points for small n (1 or2) and more than 3 points for n > . L also improveslow-shot performance, but the gains are much smaller. • Dropout is on par with SGM for small n , but ends upbeing similar or worse than the baseline for n ≥ interms of all class accuracy. Empirically, dropout also re-duces feature norm, suggesting that implicit L featureregularization might explain some of these gains. • Triplet loss improves accuracy for small n but is 5points worse than the baseline for n = 20 in terms ofall class accuracy. While more sophisticated variants ofthe triplet loss may improve performance [37], featureregularization is both effective and much simpler. • The decov loss [7] provides marginal gains for highervalues of n but is outperformed by the feature regular-ization alternatives.As an additional experiment, we also attempted to ﬁne-tune the baseline representation on all the base class exam-ples and the small set of novel class examples. We foundthat this did not improve performance over the frozen repre-sentation (see Baseline-ft in Tables 1 and 2). This indicatesthat ﬁnetuning the representation is not only expensive, butalso does not help in the low-shot learning scenario. Impact of generation.

Figure 5 shows the top-5 accu-

Representation Lowshot phase n=1 2 5 10 20

ResNet-10

Baseline Classiﬁer 43.0 54.3 67.2 72.8 75.9Baseline Generation ∗ + Classiﬁer 52.4 59.4 67.5 72.6 76.9SGM ∗ Classiﬁer 49.4 60.5

SGM ∗ Generation ∗ + Classiﬁer Batch SGM ∗ Classiﬁer 49.3 60.5 71.4 L1 ∗ Classiﬁer 47.1 58.5 69.2 73.7 76.1L2 ∗ Classiﬁer 52.7

ResNet-50

Baseline Classiﬁer 54.1 67.7

Baseline Generation ∗ + Classiﬁer SGM ∗ Classiﬁer 60.0

SGM ∗ Generation ∗ + Classiﬁer Table 2: Top-5 accuracy on base and novel classes. Best are boldedand blue; the second best are italicized and red. ∗ Our methods. racies on novel classes and on base+novel classes for ourgeneration method applied on top of the baseline represen-tation and the SGM feature representation. The numbersare in Tables 1 and 2. Note that we only generate exampleswhen n < k , with k = 20 for baseline representations and 5for SGM (see Section 4.2). We observe that the generatedexamples provide a large gain of over 9 points for n = 1 , on the novel classes for the baseline representation. Whenusing the SGM representation, the gains are smaller, butsigniﬁcant.We also compared our generation strategy to commonforms of data augmentation (aspect ratio and scale jitter, hor-izontal ﬂips, and brightness, contrast and saturation changes).Data augmentation only provides small improvements (about1 percentage point). This conﬁrms that our generation strat-egy produces more diverse and useful training examples thansimple data augmentation. Comparison to other low-shot methods.

We also com-pared to two recently proposed low-shot learning methods:matching networks [46] and model regression [47].

Modelregression trains a small MLP to regress from the classiﬁertrained on a small dataset to the classiﬁer trained on thefull dataset. It then uses the output from this regressor toregularize the classiﬁer learnt in the low-shot learning phase.

Matching networks proposes a nearest-neighbor approachthat trains embeddings end-to-end for the task of low-shotlearning. We apply both these techniques on our baselinerepresentation.For both these methods, the respective papers evaluatedon the novel classes only. In contrast, real-world recognition

Examples per novel class (n) T o p - a cc u r a c y ( % ) Novel classes baselinetripletsdecov l2sgm

Examples per novel class (n) T o p - a cc u r a c y ( % ) All classes

Examples per novel class (n) T o p - a cc u r a c y ( % ) All classes (zoom n = 1,2)

Figure 4:

Representation learning comparison.

Top-5 accuracy on ImageNet1k val. Top-performing feature regularization methodsreduce the training samples needed to match the baseline accuracy by 2x. Note the different Y-axis scales.

Examples per novel class (n) T o p - a cc u r a c y ( % ) Novel classes baselinebaseline+gensgmsgm+gen

Examples per novel class (n) T o p - a cc u r a c y ( % ) All classes

Examples per novel class (n) T o p - a cc u r a c y ( % ) All classes (zoom n = 1,2)

Figure 5:

Comparisons with and without example generation.

Top-5 accuracy on ImageNet1k val. Note the different Y-axis scales.

Examples per novel class (n) T o p - a cc u r a c y ( % ) Novel classes baselinebaseline+gensgm+gen match. netmodel reg

Examples per novel class (n) T o p - a cc u r a c y ( % ) All classes

Examples per novel class (n) T o p - a cc u r a c y ( % ) All classes (zoom n = 1,2)

Figure 6:

Comparison to recently proposed methods.

Top-5 accuracy on ImageNet1k val. Note the different Y-axis scales. systems will need to discriminate between data-starved novelconcepts, and base classes with lots of data. We adapt thesemethods to work with both base and novel classes as follows.For model regression, we only use the model regressor-basedregularization on the novel classes, with the other classiﬁersregularized using standard weight decay. We use one-vs-allclassiﬁers to match the original work.Matching networks require the training dataset to be keptin memory during test time. To make this tractable, we use100 examples per class, with the novel classes correspond-ingly oversampled.Comparisons between these methods and our approachare shown in Figure 6. We ﬁnd that model regression im-proves signiﬁcantly over the baseline, but our generationstrategy works better for low n . Model regression also hurtsoverall accuracy for high n .Matching networks work very well on novel classes. Interms of overall performance, they perform better than ourgeneration approach on top of the baseline representation,but worse than generation combined with the SGM represen-tation, especially for n > . Further, matching networks are based on nearest neighbors and keep the entire training setin memory, making them much more expensive at test timethan our simple linear classiﬁers. Deeper networks.

We also evaluated our approach onthe ResNet-50 network architecture to test if our conclusionsextend to deeper convnets that are now in use (Tables 1and 2). First, even with the baseline representation andwithout any generation we ﬁnd that the deeper architecturealso leads to improved performance in all low-shot scenarios.However, our SGM loss and our generation strategy furtherimprove this performance. Our ﬁnal top-5 accuracy on novelclasses is still more than 8 points higher for n = 1 , , andour overall accuracy is about 3 points higher, indicating thatour contributions generalize to deeper and better models.

7. Conclusion

This paper proposes (1) a low-shot recognitionbenchmark of realistic complexity, (2) the squaredgradient magnitude (SGM) loss that encodes the end-goal of low-shot learning, and (3) a novel way oftransferring modes of variation from base classes toata-starved ones. Source code and models are available at: https://github.com/facebookresearch/low-shot-shrink-hallucinate . References [1] P. Agrawal, R. Girshick, and J. Malik. Analyzing the perfor-mance of multilayer neural networks for object recognition.In

ECCV . 2014. 2[2] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, andS. Carlsson. From generic to speciﬁc deep representations forvisual recognition. In

CVPR , 2015. 2[3] E. Bart and S. Ullman. Cross-generalization: Learning novelclasses from a single example by feature replacement. In

CVPR , 2005. 2[4] L. Bertinetto, J. Henriques, J. Valmadre, P. Torr, andA. Vedaldi. Learning feed-forward one-shot learners. In

NIPS , 2016. 2[5] W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha. An empiri-cal study and analysis of generalized zero-shot learning forobject recognition in the wild. In

ECCV , 2016. 2[6] B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen.Discovering hidden factors of variation in deep networks. In

ICLR , 2015. 5[7] M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Ba-tra. Reducing overﬁtting in deep networks by decorrelatingrepresentations. In

ICLR , 2016. 5, 7[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A large-scale hierarchical image database. In

CVPR , 2009. 1, 2[9] M. Dixit, R. Kwitt, M. Niethammer, and N. Vasconcelos.AGA: Attribute-Guided Augmentation. In

CVPR , 2017. 2[10] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. DeCAF: A deep convolutionalactivation feature for generic visual recognition. In

ICML ,2014. 2[11] M. Elhoseiny, B. Saleh, and A. Elgammal. Write a classiﬁer:Zero-shot learning using purely textual descriptions. In

ICCV ,2013. 2[12] A. Farhadi, I. Endres, and D. Hoiem. Attribute-centric recog-nition for cross-category generalization. In

CVPR , 2010. 2[13] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning ofobject categories.

TPAMI , 2006. 1, 2[14] M. Fink. Object classiﬁcation from a single example utilizingclass relevance metrics.

NIPS , 2005. 2[15] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In

ICML , 2017.2[16] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean,M. Ranzato, and T. Mikolov. DeViSE: A deep visual-semanticembedding model. In

NIPS . 2013. 2[17] T. Galanti, L. Wolf, and T. Hazan. A theoretical frameworkfor deep transfer learning.

Information and Inference , 2016.2[18] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality re-duction by learning an invariant mapping. In

CVPR , 2006.2 [19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In

CVPR , 2016. 2, 5, 6[20] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever,and R. R. Salakhutdinov. Improving neural networks bypreventing co-adaptation of feature detectors. arXiv preprintarXiv:1207.0580 , 2012. 5, 7[21] S. Jetley, B. Romera-Paredes, S. Jayasumana, and P. Torr.Prototypical priors: From improving classiﬁcation to zero-shot learning. In

BMVC , 2015. 2[22] Y. Jia and T. Darrell. Latent task adaptation with large-scalehierarchies. In

ICCV , 2013. 2[23] G. Koch, R. Zemel, and R. Salakhudtinov. Siamese neuralnetworks for one-shot image recognition. In

ICML DeepLearning Workshop , 2015. 2[24] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet clas-siﬁcation with deep convolutional neural networks. In

NIPS ,2012. 1, 5, 6[25] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induc-tion.

Science , 2015. 1, 2[26] B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum. One-shot learning by inverting a compositional causal process. In

NIPS . 2013. 2[27] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classiﬁcation for zero-shot visual object categorization.

TPAMI , 2014. 2[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.

Proc. of theIEEE , 1998. 2[29] J. Lei Ba, K. Swersky, S. Fidler, and R. salakhutdinov. Pre-dicting deep zero-shot convolutional neural networks usingtextual descriptions. In

ICCV , 2015. 2[30] E. Littwin and L. Wolf. The multiverse loss for robust transferlearning. arXiv preprint arXiv:1511.09033 , 2015. 2, 5, 7[31] E. G. Miller, N. E. Matsakis, and P. A. Viola. Learningfrom one example through shared densities on transforms. In

CVPR , 2000. 2[32] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens,A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning byconvex combination of semantic embeddings. In

ICLR , 2014.2[33] A. Opelt, A. Pinz, and A. Zisserman. Incremental learningof object detectors using a visual shape alphabet. In

CVPR ,2006. 2[34] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning andtransferring mid-level image representations using convolu-tional neural networks. In

CVPR , 2014. 2[35] M. Ranzato.

Unsupervised learning of feature hierarchies .PhD thesis, New York University, 2009. 5[36] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In

ICLR , 2017. 2[37] O. Rippel, M. Paluri, P. Dollar, and L. Bourdev. Metriclearning with adaptive density discrimination. In

ICLR , 2016.2, 6, 7[38] B. Romera-Paredes and P. Torr. An embarrassingly simpleapproach to zero-shot learning. In

ICML , 2015. 239] R. Salakhutdinov, J. Tenenbaum, and A. Torralba. One-shotlearning with a hierarchical nonparametric bayesian model.

Unsupervised and Transfer Learning Challenges in MachineLearning , 2012. 2[40] L. A. Schmidt.

Meaning and compositionality as statisticalinduction of categories and constraints . PhD thesis, Mas-sachusetts Institute of Technology, 2009. 1[41] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: Auniﬁed embedding for face recognition and clustering. In

CVPR , 2015. 2[42] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carls-son. CNN features off-the-shelf: An astounding baseline forrecognition. In

CVPR Workshops , 2014. 2[43] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shotlearning through cross-modal transfer. In

NIPS , 2013. 2[44] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-scaletraining for face identiﬁcation. In

CVPR , 2015. 2[45] S. Thrun. Is learning the n-th thing any easier than learningthe ﬁrst?

NIPS , 1996. 2[46] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, andD. Wierstra. Matching networks for one shot learning.

CoRR ,abs/1606.04080, 2016. 2, 7[47] Y.-X. Wang and M. Hebert. Learning to learn: Model regres-sion networks for easy small sample learning. In

ECCV , 2016.2, 7[48] A. Wong and A. L. Yuille. One shot learning via compositionsof meaningful patches. In

ICCV , 2015. 2[49] Z. Zhang and V. Saligrama. Zero-shot learning via semanticsimilarity embedding. In