[PDF] Generating Visual Representations for Zero-Shot Classification

Abstract

This paper addresses the task of learning an image clas-sifier when some categories are defined by semantic descriptions only (e.g. visual attributes) while the others are defined by exemplar images as well. This task is often referred to as the Zero-Shot classification task (ZSC). Most of the previous methods rely on learning a common embedding space allowing to compare visual features of unknown categories with semantic descriptions. This paper argues that these approaches are limited as i) efficient discrimi-native classifiers can't be used ii) classification tasks with seen and unseen categories (Generalized Zero-Shot Classification or GZSC) can't be addressed efficiently. In contrast , this paper suggests to address ZSC and GZSC by i) learning a conditional generator using seen classes ii) generate artificial training examples for the categories without exemplars. ZSC is then turned into a standard supervised learning problem. Experiments with 4 generative models and 5 datasets experimentally validate the approach, giving state-of-the-art results on both ZSC and GZSC.

Full PDF

GGenerating Visual Representations for Zero-Shot Classiﬁcation

Maxime Bucher, St´ephane HerbinONERA - The French Aerospace LabPalaiseau, France [email protected],[email protected]

Fr´ed´eric JurieNormandie Univ, UNICAEN, ENSICAEN, CNRSCaen, France [email protected]

Abstract

This paper addresses the task of learning an image clas-siﬁer when some categories are deﬁned by semantic de-scriptions only ( e.g . visual attributes) while the others aredeﬁned by exemplar images as well. This task is often re-ferred to as the Zero-Shot classiﬁcation task (ZSC). Mostof the previous methods rely on learning a common embed-ding space allowing to compare visual features of unknowncategories with semantic descriptions. This paper arguesthat these approaches are limited as i) efﬁcient discrimi-native classiﬁers can’t be used ii) classiﬁcation tasks withseen and unseen categories (Generalized Zero-Shot Clas-siﬁcation or GZSC) can’t be addressed efﬁciently. In con-trast, this paper suggests to address ZSC and GZSC by i)learning a conditional generator using seen classes ii) gen-erate artiﬁcial training examples for the categories withoutexemplars. ZSC is then turned into a standard supervisedlearning problem. Experiments with 4 generative modelsand 5 datasets experimentally validate the approach, givingstate-of-the-art results on both ZSC and GZSC.

1. Introduction and related works

Zero-Shot Classiﬁcation (ZSC) [22] addresses classiﬁ-cation problems where not all the classes are representedin the training examples. ZSC can be made possible bydeﬁning a high-level description of the categories, relatingthe new classes ( the unseen classes ) to classes for whichtraining examples are available ( seen classes ). Learning isusually done by leveraging an intermediate level of repre-sentation, the attributes, that provide semantic informationabout the categories to classify. As pointed out by [32] thisparadigm can be compared to how human can identify anew object from a description of it, leveraging similaritiesbetween its description and previously learned concepts.Recent ZSC algorithms ( e.g . [1, 5]) do the classiﬁca-tion by deﬁning a zero-shot prediction function that outputsthe class y having the maximum compatibility score with seen and unseen classes feature spaceimage generator classi ﬁ erdiscriminative Our generative approachTraditional Zero-Shot Learning images of seen classes images of unseen classescommon embeddingdescriptions of seen classescar, bike horse,pedestriandesriptions of unseen classes decoderclass attributesnoise - car - horselearned using images and attributes from seen classes Figure 1: Our method consists in i) learning an image fea-ture generator capable of generating artiﬁcial image repre-sentations from given attributes ii) learning a discriminativeclassiﬁer from the artiﬁcially generated training data.the image x : f ( x ) = arg max y S ( x, y ) . The compatibil-ity function, for its part, is often deﬁned as S ( x, y ; W ) = θ ( x ) t W φ ( y ) where θ and φ are two projections and W isa bilinear function relating the two in a common embed-ding. There are different variants in the recent literature onhow the projections or the similarity measure are computed[11, 8, 15, 29, 32, 40, 41, 43], but in all cases the class ischosen as the one maximizing the compatibility score. Thisembedding and maximal compatibility approach, however,does not exploit, in the learning phase, the information po-tentially contained in the semantic representation of the un-seen categories. The only step where a discriminating capa-bility is exploited is in the ﬁnal label selection which usesan arg max y decision scheme, but not in the setting of thecompatibility score itself.A parallel can be easily done between the aforemen-tioned approaches and generative models such as deﬁned1 a r X i v : . [ c s . C V ] D ec n the machine learning community. Generative models es-timate the joint distribution p ( y, x ) of images and classes,often by learning the class prior probability p ( y ) and theclass-conditional density p ( x | y ) separately. However, as ithas been observed for a long time [37], discriminative ap-proaches trained for predicting directly the class label havebetter performance than model-based approaches as long asthe learning database reliably samples the target distribu-tion.Despite one can expect discriminative methods to givebetter performance [37], they can’t be used directly in thecase of ZSC for obvious reasons: as no images are avail-able for some categories, discriminative classiﬁers cannotbe learned out-of-the-box.This paper proposes to overcome this difﬁculty by gen-erating training features for the unseen classes, in such away that standard discriminative classiﬁers can be learned(Fig. 1). Generating data for machine learning tasks hasbeen studied in the literature e.g ., [18] or [3] to compensatefor imbalanced training sets. Generating novel training ex-amples from the existing ones is also at the heart of the tech-nique called Data Augmentation , frequently used for train-ing deep neural networks [23]. When there is no trainingdata at all for some categories, some underlying paramet-ric representation can be used to generate missing trainingdata, assuming a mapping from the underlying representa-tion to the image space. [12] generated images by applyingwarping and other geometric / photometric transformationsto prototypical logo exemplars. A similar idea was also pre-sented in [19] for text spotting in images. [7] capture whatthey call

The Gist of a Gesture by recording human ges-tures, representing them by a model and use this model togenerate a large set of realistic gestures.We build in this direction, in the context of ZSC, the un-derlying representation being some attribute or text baseddescription of the unseen categories, and the transforma-tion from attributes to image features being learned fromthe examples of the seen classes. A relevant way to learnthis transformation is to use generative models such as de-noising auto encoders [4] and generative adversarial nets (GAN) [16] or their variants [10, 26]. GANs consist in esti-mating generative models via an adversarial process simul-taneously learning two models, a generative model that cap-tures the data distribution, and a discriminative model thatestimates the probability that a sample came from the train-ing data rather than the generator. The

Conditional Gen-erative Adversarial Nets of [28] is a very relevant variantadapted to our problem.In addition to the advantage of using discriminative clas-siﬁers – which is expected to give better performance – ourapproach, by nature, can address the more realistic task ofGeneralized Zero-Shot Classiﬁcation (GZSC). This prob-lem, introduced in [9], assumes that both seen and unseen categories are present at test time, making the traditionalapproaches suffering from bias decision issues. In contrast,the proposed approach uses (artiﬁcial) training examples ofboth seen and unseen classes during training, avoiding theaforementioned issues.Another reason to perform classiﬁcation inference di-rectly in the visual feature space rather than in an abstractattribute or embedding space is that data are usually moreeasily separated in the former, especially when using dis-criminant deep features that are now commonly available.This paper experimentally validates the proposed strat-egy on 4 standard Zero-Shot classiﬁcation datasets (Ani-mals with Attributes (AWA) [22], SUN attributes (SUN)[31], Apascal&Ayahoo (aP&Y) [14] and Caltech-UCSDBirds-200-2011 (CUB) [38]), and gives insight on how theapproach scales on large datasets such as ImageNet [11]. Itshows state-of-the-art performance on all datasets for bothZSC and GZSC.

2. Approach

As motivated in the introduction, we address in this pa-per the problem of learning a classiﬁer capable of discrim-inating between a given set of classes where empirical datais only available for a subset of it, the so-called seen classes.In the vocabulary of zero-shot classiﬁcation, the problem isusually qualiﬁed as inductive — we do not have access toany data from the unseen classes — as opposed to transduc-tive where the unseen data is available but not the associatedlabels. We do not address in this paper the transductive set-ting, considering that the availability of target data is a bigconstraint in practice.The learning dataset D s is deﬁned by a series of triplets { x si , a si , y si } N s i =1 where x si ∈ X is the raw data (image or fea-tures), y si ∈ Y s is the associated class label and a si is a richsemantic representation of the class (attributes, word vectoror text) belonging to A s . This semantic representation isexpected to i) contain enough information to discriminatebetween classes by itself, ii) be predictable from raw dataand iii) infer unambiguously the class label y = l ( a ) .In an inductive ZSC problem, all that is known regardingthe new target domain is the set of semantic class represen-tations A u of the unseen classes. The goal is to use thisinformation and the structure of the semantic representationspace to design a classiﬁcation function f able to predict theclass label ˆ y = f ( x ; A u , D s ) . The classiﬁcation function f is usually parametric and settled by the optimization of anempirical learning criterion. In ZSC, the main problem is precisely the fact that nodata is available for the unseen classes. The approach taken2n this paper is to artiﬁcially generate data for the unseenclasses given that seen classes and their semantic represen-tations provide enough information to do so, and then applya discriminative approach to learn the class predictor.The availability of data for the unseen classes has twomain advantages: it can make the classiﬁcation of seen and unseen classes as a single homogeneous process, allowingto address Generalized Zero Shot Classiﬁcation as a sin-gle supervised classiﬁcation problem; it potentially allowsa larger number of unseen classes, which is for instance re-quired for datasets such ImageNet [11].Let (cid:98) D u = { ˆ x ui , a ui , y ui } N u i =1 be a database generatedto account for the unseen semantic class representation a u ∈ A u . The ZSC classiﬁcation function becomes: ˆ y = f D ( x ; (cid:98) D u , D s ) and can be used in association with the seendata D s , to learn a homogeneous supervised problem. Our generators of unseen data build on the recently pro-posed approaches for conditional data generation as pre-sented in section 1. The idea is to learn globally a para-metric random generative process G using a differentiablecriterion able to compare, as a whole, a target data distribu-tion and a generated one.Given z a random sample from a ﬁxed multivariate priordistribution, typically uniform or Gaussian, and w the setof parameters, new sample data consistent with the seman-tic description a are generated by applying the function: (cid:98) x = G ( a , z ; w ) . A simple way to generate conditional (cid:98) x data is to concatenate the semantic representation a and therandom prior z as the input of a multi-layer network, asshown in Fig. 2.We now present 4 different strategies to design such aconditional data generator, the functional structure of thegenerator being common to all the described approaches. Generative Moment Matching Network

A ﬁrst ap-proach is to adapt the Generative Moment Matching Net-work (GMMN) proposed in [24] to conditioning. The gen-erative process will be considered as good if for each se-mantic description a two random populations X ( a ) from D s and (cid:98) X ( a ; w ) sampled from the generator have low max-imum mean discrepancy which is a probability divergencemeasure between two distributions. This divergence can beapproximated using a Hilbert kernel based statistics [17] –typically a linear combination of Gaussian functions withvarious widths — which has the big advantage of being dif-ferentiable and may be thus exploited as a machine learn-ing cost. Network parameters w are then obtained by opti-mizing the differentiable statistics by stochastic gradient de-scent, using batches of generated and real data conditionedby the semantic description a . (2) AC-GAN(4) Adversarial Auto-Encoder(1) GMMN(3) Denoising Auto-Encoder FC + lreluFC + lrelu FC + lreluFC + lreluFC + lrelu FC + lrelu FC + lreluFC + lreluFC + lrelu

Figure 2: Architecture of the different generative modelsstudied.

Conditional Generative adversarial models

Our secondmodel builds on the principles of the generative adversarialnetworks (GAN), which is to learn a discrepancy measurebetween a true and a generated distributions — the

Discrim-inator — simultaneously with the data generator. One ex-tension allowing to produce conditional distributions is theAC-GAN [30] (Fig. 2) where the generated and the true dis-tributions are compared using a binary classiﬁer, and thequality of the conditional generation is controlled by theperformance of this auxiliary task.This model bears similarities with the GMMN model,the key difference being that in the GMMN distributionsof true and generated data are compared using the kernelbased empirical statistics while in the AC-GAN case it ismeasured by a learned discriminative parametric model.

Denoising Auto-Encoder

Our third generator relies onthe work presented in [4], where an encoder/decoder struc-ture is proposed to design a data generator, the latent codeplaying the role of the random prior z used to generate thedata. A simple extension able to introduce a conditionaldata generation control has been developed by concatenat-ing the semantic representation a to the code that is fed tothe decoder (Fig. 2).In practice, this model is learned as a standard auto-encoder, except that i) some noise is added to the input andii) the semantic representation a is concatenated to the codein the hidden layer. For generating novel examples, only thedecoder part, i.e . the head of the network using z and a asinput to produce (cid:98) x is used. Adversarial Auto-Encoder

Our fourth generator is in-spired by [26], which is an extension of the denoising auto-3ncoder. It introduces an adversarial criterion to control thelatent code produced by the encoder part, so that the codedistribution matches a ﬁxed prior distribution. This extraconstraint is expected to ensure that all parts of the sam-pling prior space will produce meaningful data.During training, both the auto-encoder and the discrimi-nator are learned simultaneously. For generating novel ex-amples, as for the denoising auto-encoder, only the decoderpart is used.

We implemented our 4 generative models with neuralnetworks, whose architectures are illustrated Fig. 2. Hid-den layers are fully connected (FC) with leaky-relu non-linearity [25] (leakage coefﬁcient of 0.2). For the modelsusing a classiﬁer (the AC-GAN and the Adversarial auto-encoder), the classiﬁer is a linear classiﬁer (fully connectedlayer + Softmax activation function). The loss used tomeasure the quality of the reconstruction in the two auto-encoders is the L2 norm.Regarding how to sample the noise z , we did not observeany difference between sampling it from a Gaussian distri-bution or from a uniform distribution.

3. Experiments

In this section, after presenting the datasets and the ex-perimental settings, we start by comparing the different gen-erative models described in the previous section. We thenshow how our approach can be used for the GeneralizedZero-shot Classiﬁcation Task, which is one of the key con-tributions of the paper, provide some experiments on a largescale zero shot classiﬁcation task, and ﬁnally compare ourapproach with state-of-the art Zero-Shot approaches on theregular Zero-shot Classiﬁcation Task.

A ﬁrst experimental evaluation is done on 4 standardZSC datasets: Animals with Attributes (AWA) [22], SUNattributes (SUN) [31], Apascal&Ayahoo (aP&Y) [14] andCaltech-UCSD Birds-200-2011 (CUB) [38] . These bench-marks exhibit a great diversity of concepts; SUN and CUBare for ﬁne-Grained categorization, and include respectivelybirds and scenes images; AwA contains images of animalsfrom 50 different categories; ﬁnally, aP&Y has broader con-cepts, from cars to animals. For each dataset, attributesdescriptions are given, either at the class level or at im-age level. aP&Y, CUB and SUN have per image binaryattributes that we average to produce per class real valuedrepresentations. In order to make comparisons with otherworks, we follow the same training/testing splits for AwA[22], CUB [2] and aP&Y [14]. For SUN we experiment twodifferent settings: one with 10 unseen classes as in [20], a second, more competitive, with ten different folds randomlychosen and averaged, as proposed by [8] (72/71 splits).Image features are computed using two deep networks,the VGG-VeryDeep-19 [34] and the GoogLeNet [36] net-works. For the VGG-19 we use the 4,096-dim top-layerhidden unit activations (fc7) while for the GoogLeNet weuse the 1,024-dim top-layer pooling units. We keep theweights learned on ImageNet ﬁxed i.e ., we don’t apply anyﬁne-tuning.The classiﬁers are obtained by adding a standard FullyConnected with Softmax layer to the pre-trained networks.We purposively chose a simple classiﬁer to better observethe behavior of the generators. In all our experiments wegenerated 500 artiﬁcial image features by class, which weconsider to be a reasonable trade-off between accuracy andtraining time; we have not observed any signiﬁcant im-provement when adding more images.Each architecture has its own set of hyper-parameters(typically the number of units per layer, the number ofhidden layers, the learning rate, etc .). They are obtainedtrough a ’Zero-shot’ cross-validation procedure. In this pro-cedure, 20% of the seen classes are considered as unseen(hence used as validation set), allowing to choose the hyper-parameters maximizing the accuracy on this so-obtainedvalidation set. In practice, typical values for the numberof neurons (resp. the number of hidden layers) are in therange of [500-2000] (resp. 1 or 2).Model parameters are initialized according to a centeredGaussian distribution ( σ = 0 . ). They are optimized withthe Adam solver [21] with a cross-validated learning rate(typically of − ), using mini-batches of size 128 exceptfor the GMMN where each batch contains all the trainingimages of one class, to make the estimation of the statis-tics more reliable. In order to avoid over-ﬁtting, we useddropout [35] at every layer (probability of drop of 0.2 forthe inputs layers and of 0.5 for the hidden layers). Inputdata (both image features and w2c vectors) are scaled to[0,1] by applying an afﬁne transformation. With the Ten-sorFlow framework [13] running on a Nvidia Titan X pas-cal GPU, the learning stage takes around 10 minutes for agiven set of hyper-parameters. Our code will be made pub-licly available. Our code will be made publicly available. Our ﬁrst round of experiments consists in comparing theperformance of the 4 generative models described in Sec-tion 2.3, on the regular Zero-shot classiﬁcation task. Ourintention is to select the best one for further experiments.Performance on the validation set is reported Table 1. Wecan see that the GMMN model outperforms the 3 others onaverage, with a noticeable 5% improvement on aP&Y. Itsoptimization is also computationally more stable than the https://github.com/maximebucher Model aP&Y AwA CUB SUN Avg

Denois. Auto-encod. [4] 62.0 66.4 42.8 82.5 63.4AC-GAN [30] 55.2 66.0 44.6 83.5 62.3Adv. Auto-encod. [26] 59.5

Table 2: Generalized Zero-Shot classiﬁcation accuracy onAWA. Image features are obtained with the GoogLeNet.[36] CNN.

AwAMethod u → u s → s u → a s → a Lampert et al . [22] dap et al . [22] iap et al . [29] 63.7 76.9 9.5 75.9Changpinyo et al . [8] o − vs − o et al . [8] struct Ours. (VGG-19) 87.78 85.61 38.21 83.14

Table 3: Generalized Zero-Shot classiﬁcation accuracy onCUB. Image features are obtained with the GoogLeNet [36]CNN.

CUBMethod u → u s → s u → a s → a Lampert et al . [22] dap et al . [22] iap et al . [29] 35.8 70.5 1.8 69.9Changpinyo et al . [8] o − vs − o et al . [8] struct Ours. (VGG-19) 59.70 71.21 20.12 69.45 adversarial versions. We consequently chose this generatorfor the following.We explain the superiority of the GMMN model by thefact it aligns the distributions by using an explicit model ofthe divergence of the distributions while the adversarial au-toencoder and the AC-GAN have to learn it. For its part, thedenoising autoencoder doesn’t have any guaranty that thedistributions are aligned, explaining its weak performancecompared to the 3 other generators.

In this section, we follow the Generalized Zero-ShotLearning (GZSC) protocol introduced by Chao et al . [9].In this protocol, test data are from any classes, seen or un-seen. This task is more realistic and harder, as the numberof class candidates is larger.We follow the notations of [9], i.e . u → u : test images from unseen classes, labels of unseenclasses (conventional ZSC) s → s : test images from seen classes, labels of seen classes (multi-class classication for seen classes) u → a : test images from unseen classes, labels of seen andunseen classes (GZSC) s → a : test images from seen classes, labels of seen andunseen classes (GZSC)In the ﬁrst two cases, only the seen/unseen classes areused in the training phase. In the last two cases, the classi-ﬁer is learned with training data combining images gener-ated for all classes (seen and not seen).Most of the recent ZSC works e.g ., [2, 6, 5, 32] are fo-cused on improving the embedding or the scoring function.However, [9] has shown that this type of approach is un-practical with GZSC. Indeed the scoring function is in thiscase biased toward seen classes, leading to very low accu-racy on the unseen classes. This can be seen on Table 2 and3 ( u → a column), where the accuracy drops signiﬁcantlycompared to regular ZSC performance. The data distribu-tion of the ZSC datasets are strongly subject to this bias, asunseen classes are very similar to seen classes both in termsof visual appearance and attribute description. When seenand unseen classes are candidates, it becomes much harderto distinguish between them. For example, the horse (seen)and the zebra classes (unseen) of the AwA dataset cannot bedistinguished by standard ZSC methods.As we can see on Table 2 and 3, our generative approachoutperforms any other previous approach. In the hardestcase, u → a , it gives the accuracy of 30% (resp. 10%)higher than state-of-the-art approaches on the AwA (resp.CUB) dataset. It can be easily explained by the fact that itdoesn’t suffer from the scoring function problem we men-tioned, as the Softmax classiﬁer is learned to discriminateboth seen and unseen classes, offering a decisive solution tothe bias problem. We compared our approach with state-of-the-art methodson a large-scale Zero-Shot classiﬁcation task. These expe-riences mirror those presented in [15]: 1000 classes fromthose of the ImageNet 2012 1K set [33] are chosen for train-ing (seen classes) while 20.345 others are considered to beunseen classes with no image available. Image features arecomputed with the GoogLeNet network [36].In contrast with ZSC datasets, no attributes are providedfor deﬁning unseen classes. We represent those categoriesusing a skip-gram language model [27]. This model islearned on a dump of the Wikipedia corpus ( ≈ Flat Hit @KScenario Method 1 2 5 10 202-hop

Frome [15] 6.0 10.0 18.1 26.4 36.4Norouzi [29] 9.4 15.1 24.7 32.7 41.8Changpinyo [8] 10.5 16.7 28.6 40.1 52.0Ours.

Frome [15] 0.8 2.7 7.9 14.2 22.7 (+1K)

Norouzi [29] 0.3 7.1 17.2 24.9 33.5Ours.

Frome [15] 1.7 2.9 5.3 8.2 12.5Norouzi [29] 2.7 4.4 7.8 11.5 16.1Changpinyo [8] 2.9 4.9 9.2 14.2 20.9Ours.

Frome [15] 0.5 1.4 3.4 5.9 9.7 (+1K)

Norouzi [29] 0.2 2.4 5.9 9.7 14.3Ours.

Frome [15] 0.8 1.4 2.5 3.9 6.0Norouzi [29] 1.4 2.2 3.9 5.8 8.3Changpinyo [8] 1.5 2.4 4.5 7.1 10.9Ours.

Frome [15] 0.3 0.8 1.9 3.2 5.3 (+1K)

Norouzi [29] 0.2 1.2 3.0 5.0 7.5Ours. quently, We use this hidden layer to describe each class la-bel by embedding the class name into this 500-dimensionalspace. Some classes cannot be represented as their name isnot contained in the vocabulary established by parsing theWikipedia corpus. Such classes are ignored, bringing thenumber of classes from 20,842 to 20,345 classes. For faircomparison, we take the same language model as [8] withthe same classes excluded.As in [8, 15] our model is evaluated on three differentscenarios, with an increasing number of unseen classes: i)2-hop: 1,509 classes ii) 3-hop: 7,678 classes, iii) All: allunseen categories.For this task we use the Flat-Hit@K metric, the percent-age of test images for which the model returns the true la-bels in the top K prediction scores.Table 4 summarizes the performance on the 3 hops. Asone can see, our model gets state-of the art performancefor each conﬁguration. As it can be observed from theseexperiments, our generative model is very suitable for thislarge scale GZSC problem e.g ., our approach improves by5% best competitors for the Flat-Hit 1 metric on the 2-hopscenario.

In this last section, we follow the protocol of the standardZSC task: during training, only data from seen classes areavailable while at test time new images (from unseen classesonly) have to be assigned to one of the unseen classes.As explained in the introduction, the recent ZSC litera-ture [2, 6, 5, 32] mostly focuses on developing a good em-bedding for comparing attributes and images. One of our Table 5: Zero-shot classiﬁcation accuracy (mean ± std) on5 runs. We report results with VGG-19 and GoogLeNetfeatures. SUN dataset is evaluated on 2 different splits (see3.1). * [8] features extracted from an MIT Places[45] pre-trained model. Feat. Method aP&Y AwA CUB SUN

Lampert et al . [22] - 60.5 39.1 -/44.5Akata et al . [2] - 66.7 50.1 -/-Changpinyo et al . [8] - 72.9 54.7 / * G oog L e N e t [ ] Xian et al . [41] - 71.9 45.5 -Ours. / Lampert et al . [22] 38.16 57.23 - 72.00/-Romera-Paredes [32] 24.22 75.32 - 82.10/-Zhang et al . [43] 46.23 76.33 30.41 82.50/-Zhang et al . [44] 50.35 80.46 42.11 83.83/-Wang et al . [39] - 78.3 48.6 -/-Bucher et al . [5] 53.15 77.32 43.29 84.41/- V GG - V er y D ee p [ ] Bucher et al . [6] 56.77 86.55 45.87 86.21/-Ours. motivations for generating training images was to make thetraining of discriminative classiﬁers possible, assuming itwould result in better performance. This section aims atvalidating this hypothesis on the regular ZSC task.Table 5 summarizes our experiments, reporting the ac-curacy obtained by state of the art methods on the 4 ZSCdatasets, with 2 different deep image features. Each entry isthe mean/standard deviation computed on 5 different runs.With the VGG network, our method give above state-of-the-art performance on each dataset, with a noticeable im-provement of more than 15% on CUB. On the SUN dataset,Changpinyo et al . [8]’s seems to give better performancebut used the MIT Places dataset to learn the features. It hasbeen recently pointed out in sec. 5.1 of Xiang et al. [42] thatthis database ”intersects with both training and test classesof SUN, which could explain their better results comparedto ours.

4. Conclusions

This paper introduces a novel way to address Zero-Shot Classiﬁcation and Generalized Zero-Shot Classiﬁca-tion tasks by learning a conditional generator from seen dataand generating artiﬁcial training examples for the categorieswithout exemplars, turning ZSC into a standard supervisedlearning problem. This novel formulation addresses the twomain limitation of previous ZSC method i.e ., their intrin-sic bias for Generalized Zero-Shot Classiﬁcation tasks andtheir limitations in using discriminative classiﬁers in thedeep image feature space. Our experiments with 4 gener-ative models and 5 datasets experimentally validate the ap-proach and give state-of-the-art performance.6 cknowledgement

M.Bucher was in part supported by R´egion Normandie.

References [1] Zeynep Akata, Mateusz Malinowski, Mario Fritz, and BerntSchiele. Multi-cue Zero-Shot Learning with Strong Super-vision. In , pages 59–68. IEEE, June 2016.[2] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, andBernt Schiele. Evaluation of Output Embeddings for Fine-Grained Image Classiﬁcation. In

IEEE International Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2015.[3] Mohamed Bahy Bader-El-Den, Eleman Teitei, andMo Adda. Hierarchical classiﬁcation for dealing withthe Class imbalance problem.

IJCNN , 2016.[4] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vin-cent. Generalized denoising auto-encoders as generativemodels. In

Advances in Neural Information Processing Sys-tems , pages 899–907, 2013.[5] M Bucher, S Herbin, and F Jurie. Improving semantic em-bedding consistency by metric learning for zero-shot classif-ﬁcation. In

European Conference on Computer Vision , 2016.[6] Maxime Bucher, St´ephane Herbin, and Fr´ed´eric Jurie. Hardnegative mining for metric learning based zero-shot classiﬁ-cation. In

Computer Vision–ECCV 2016 Workshops , pages524–531. Springer, 2016.[7] Maria E Cabrera and Juan P Wachs. Embodied gesture learn-ing from one-shot. In , pages 1092–1097. IEEE, 2016.[8] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and FeiSha. Synthesized classiﬁers for zero-shot learning. In

Com-puter Vision and Pattern Recognition (CVPR), 2016 IEEEConference on , pages 5327–5336. IEEE, 2016.[9] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and FeiSha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In

EuropeanConference on Computer Vision , pages 52–68. Springer,2016.[10] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, andWenjie Li. Mode Regularized Generative Adversarial Net-works. arXiv , December 2016.[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical im-age database. In

Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on , pages 248–255.IEEE, 2009.[12] Christian Eggert, Anton Winschel, and Rainer Lienhart. Onthe Beneﬁt of Synthetic Data for Company Logo Detection.In

ACM Multimedia , 2015.[13] Mart´ın Abadi et al. TensorFlow: Large-scale machine learn-ing on heterogeneous systems, 2015. Software availablefrom tensorﬂow.org. [14] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth.Describing objects by their attributes. In

IEEE Interna-tional Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2009.[15] Andrea Frome, Gregory S Corrado, Jonathon Shlens, SamyBengio, Jeffrey Dean, Marc’Aurelio Ranzato, and TomasMikolov. DeViSE: A Deep Visual-Semantic EmbeddingModel. In

Conference on Neural Information ProcessingSystems (NIPS) , 2013.[16] I Goodfellow, J Pouget-Abadie, and M Mirza. Generativeadversarial nets. In

NIPS , 2014.[17] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bern-hard Sch¨olkopf, and Alexander Smola. A kernel two-sampletest.

Journal of Machine Learning Research , 13(Mar):723–773, 2012.[18] Hongyu Guo and Herna L Viktor. Learning from imbalanceddata sets with boosting and data generation - the DataBoost-IM approach.

SIGKDD Explorations , 2004.[19] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An-drew Zisserman. Reading Text in the Wild with Convolu-tional Neural Networks.

International Journal of ComputerVision , 116(1):1–20, 2016.[20] Dinesh Jayaraman and Kristen Grauman. Zero-shot recog-nition with unreliable attributes. In

Conference on NeuralInformation Processing Systems (NIPS) , 2014.[21] Diederik Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[22] Christoph H Lampert, Hannes Nickisch, and Stefan Harmel-ing. Attribute-Based Classiﬁcation for Zero-Shot Visual Ob-ject Categorization.

IEEE Trans Pattern Anal Mach Intell ,36(3):453–465, 2014.[23] Y LeCun, L Bottou, Y Bengio, and P Haffner. Gradient-based learning applied to document recognition.

Proceed-ings of the IEEE , 86(11):2278–2324, November 1998.[24] Yujia Li, Kevin Swersky, and Richard S Zemel. Generativemoment matching networks. In

ICML , pages 1718–1727,2015.[25] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Recti-ﬁer nonlinearities improve neural network acoustic models.In in ICML Workshop on Deep Learning for Audio, Speechand Language Processing . Citeseer, 2013.[26] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, IanGoodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 , 2015.[27] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,and Jeff Dean. Distributed representations of words andphrases and their compositionality. In

Advances in neuralinformation processing systems , pages 3111–3119, 2013.[28] Mehdi Mirza and Simon Osindero. Conditional GenerativeAdversarial Nets. arXiv , November 2014.[29] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, YoramSinger, Jonathon Shlens, Andrea Frome, Greg S Corrado, nd Jeffrey Dean. Zero-shot learning by convex combinationof semantic embeddings. arXiv preprint arXiv:1312.5650 ,2013.[30] Augustus Odena, Christopher Olah, and Jonathon Shlens.Conditional image synthesis with auxiliary classiﬁer gans. arXiv preprint arXiv:1610.09585 , 2016.[31] Genevieve Patterson and James Hays. Sun attribute database:Discovering, annotating, and recognizing scene attributes.In Computer Vision and Pattern Recognition (CVPR), 2012IEEE Conference on , pages 2751–2758. IEEE, 2012.[32] Bernardino Romera-Paredes and Philip HS Torr. An embar-rassingly simple approach to zero-shot learning. In

ICML ,pages 2152–2161, 2015.[33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, et al. Imagenet largescale visual recognition challenge.

International Journal ofComputer Vision , 115(3):211–252, 2015.[34] Karen Simonyan and Andrew Zisserman. Very Deep Con-volutional Networks for Large-Scale Image Recognition. In

ICLR , 2014.[35] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. Dropout: a simple wayto prevent neural networks from overﬁtting.

Journal of Ma-chine Learning Research , 15(1):1929–1958, 2014.[36] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going deeper withconvolutions. In

IEEE International Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2015.[37] Ilkay Ulusoy and Christopher M Bishop. Generative versusDiscriminative Methods for Object Recognition. In

CVPR ,2005.[38] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-ona, and Serge Belongie. The caltech-ucsd birds-200-2011dataset. 2011.[39] Qian Wang and Ke Chen. Zero-shot visual recogni-tion via bidirectional latent embedding. arXiv preprintarXiv:1607.02104 , 2016.[40] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie:scaling up to large vocabulary image annotation. In

Pro-ceedings of the Twenty-Second international joint conferenceon Artiﬁcial Intelligence-Volume Volume Three , pages 2764–2770. AAAI Press, 2011.[41] Yongqin Xian, Zeynep Akata, Gaurav Sharma, QuynhNguyen, Matthias Hein, and Bernt Schiele. Latent embed-dings for zero-shot classiﬁcation. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 69–77, 2016.[42] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shotlearning-the good, the bad and the ugly. In

IEEE Conferenceon Computer Vision and Pattern Recognition , 2017.[43] Ziming Zhang and Venkatesh Saligrama. Zero-Shot Learn-ing via Semantic Similarity Embedding. In

IEEE Interna-tional Conference on Computer Vision (ICCV) , 2015. [44] Ziming Zhang and Venkatesh Saligrama. Zero-shot learningvia joint latent similarity embedding. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 6034–6042, 2016.[45] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor-ralba, and Aude Oliva. Learning deep features for scenerecognition using places database. In

Advances in neuralinformation processing systems , pages 487–495, 2014., pages 487–495, 2014.