[PDF] Transductive Zero-Shot Learning by Decoupled Feature Generation

Abstract

In this paper, we address zero-shot learning (ZSL), the problem of recognizing categories for which no labeled visual data are available during training. We focus on the transductive setting, in which unlabelled visual data from unseen classes is available. State-of-the-art paradigms in ZSL typically exploit generative adversarial networks to synthesize visual features from semantic attributes. We posit that the main limitation of these approaches is to adopt a single model to face two problems: 1) generating realistic visual features, and 2) translating semantic attributes into visual cues. Differently, we propose to decouple such tasks, solving them separately. In particular, we train an unconditional generator to solely capture the complexity of the distribution of visual data and we subsequently pair it with a conditional generator devoted to enrich the prior knowledge of the data distribution with the semantic content of the class embeddings. We present a detailed ablation study to dissect the effect of our proposed decoupling approach, while demonstrating its superiority over the related state-of-the-art.

Full PDF

TTransductive Zero-Shot Learning by Decoupled Feature Generation

Federico Marmoreo , , Jacopo Cavazza and Vittorio Murino , , Pattern Analysis and Computer Vision, Istituto Italiano di Tecnologia, Italy University of Genova, Italy Huawei Technologies Ltd., Ireland Research Center, Ireland Department of Computer Science, University of Verona, Italy { federico.marmoreo,jacopo.cavazza,vittorio.murino } @iit.it Abstract

In this paper, we address zero-shot learning (ZSL), theproblem of recognizing categories for which no labeled vi-sual data are available during training. We focus on thetransductive setting, in which unlabelled visual data fromunseen classes is available. State-of-the-art paradigms inZSL typically exploit generative adversarial networks tosynthesize visual features from semantic attributes. We positthat the main limitation of these approaches is to adopt asingle model to face two problems: 1) generating realisticvisual features, and 2) translating semantic attributes intovisual cues. Differently, we propose to decouple such tasks,solving them separately. In particular, we train an uncon-ditional generator to solely capture the complexity of thedistribution of visual data and we subsequently pair it witha conditional generator devoted to enrich the prior knowl-edge of the data distribution with the semantic content ofthe class embeddings. We present a detailed ablation studyto dissect the effect of our proposed decoupling approach,while demonstrating its superiority over the related state-of-the-art.

1. Introduction

The ﬁeld of visual object recognition has seen a signif-icant progress in recent years, mainly because of the avail-ability of large-scale annotated datasets. However, labellingdata is not only difﬁcult and costly but is also prone to er-rors since requiring human intervention. Furthermore, an-notating a big corpus of data for each of the categories to berecognized in a balanced way is simply unfeasible due thewell known long-tail distribution problem [22].As a promising solution to the aforementioned issues, zero-shot learning (ZSL) algorithms tackle the problem ofrecognizing novel categories, even if a classiﬁer has notbeen directly trained on them [16, 15]. In a fully supervised paradigm, each single category is assumed to be (evenly)represented by a set of annotated visual data (images). ZSLmethods instead, allow this assumption to hold only for arestricted set of seen categories. The goal is then to recog-nize a disjoint set of target unseen categories, when, in thetransductive setting, only unlabelled visual data is available(differently from the inductive setting, in which no visualdata from unseen classes are available). To transfer fromseen to unseen categories, auxiliary information is typicallyadopted in the form of either manually-deﬁned attributes[15] or distributed word embeddings [29]. When transfer-ring from the seen to the unseen classes, the main challengeis handling such category shift: in this paper, we evalu-ate this in the generalized

ZSL (GZSL) setup in which amethod is evaluated on both seen and unseen classes, re-quiring to learn the latter without forgetting the former ones.In fact, given the separation of the data into labelled seenand unlabelled unseen instances, supervised training canbe done for seen classes only, resulting in an unbalancedperformance in GZSL. To address this issue, various recentworks proposed to augment the unseen-class data with syn-thetic labeled data [18, 3, 6, 13, 17, 31, 24, 37, 30, 23, 8].In this paper, we address the transductive GZSL prob-lem, by introducing a novel, more effective, feature syn-thesis method able to balance the training process. In de-tails, our approach builds upon the possibility of mimick-ing the human brain in hallucinating a mental imagery ofa certain unknown category while reading a textual de-scription of it. As pursued by a number of recent works[18, 3, 6, 13, 17, 31, 24, 37, 30, 23, 8], conditional featuregeneration is adopted for this purpose. Speciﬁcally, imagesfrom the seen classes are fed into a backbone ResNet-101network which pre-computes a set of real visual features[29]. Subsequently, through a Generative Adversarial Net-work (GAN) [10, 2, 11], the following min-max optimiza-tion game is solved: a generator network is asked to synthe-size visual embeddings which should look real to a discrim-inator module. Since the generator network is conditionally a r X i v : . [ c s . C V ] F e b ependent upon semantic embeddings, the trained modelcan be exploited to create synthetic features of the classesfor which we lack visual labeled data. Afterwards, GZSLcan be solved as simple classiﬁcation problems through asoftmax classiﬁer trained on top of real features (from seenclasses) and generated features (from the unseen ones).In the literature, several variants have been attempted toimprove the pipeline under the architectural point of viewwith the aim to solving efﬁciently GZSL: using an attributeregression module [13], easing the generator with a vari-ational auto-encoder [18, 3, 31, 24, 8], adopting cycle-consistency [6], designing intermediate latent embeddings[17] or employing features-to-feature translation methods[18, 3, 6, 13, 17, 31, 24, 37, 30, 23, 8]. On the one hand, ascommonly done in adversarial training, we need to generatesynthetic descriptors indistinguishable from a pool of pre-trained real features used as reference. On the other hand,synthetic features are required to translate the semantic in-formation into visual patterns, which are discriminative forthe seen and unseen classes to be recognized.We posit that resolving these two tasks within a singlearchitecture is arguably difﬁcult and we claim that this isthe major limitation affecting the performance of the cur-rently available feature generating schemes for GZSL. Infact, since adopting a single architecture for two tasks, oneof them may be suboptimally solved with respect to theother, resulting in a poor modelling of either the visual orthe semantic space.Hence, differently to prior work, in this paper we pro-pose to separately solve the two tasks, decoupling the fea-ture generation stage to better tackle transductive GZSL.First, we train an unconditional generative adversarial net-work with the purpose of synthesizing features visually sim-ilar to the real ones in order to properly model their distribu-tion in an unsupervised manner. Since our generation is notconditioned on semantic embeddings, we are sure to speciﬁ-cally model the visual appearance of our feature representa-tions. Second, we encapsulate such visual information intoa structured prior , which is used in tandem with a condi-tioning factor (here, the semantic embedding) to (condition-ally) generate synthetic feature vectors. Because of the im-proved source of noise, we expect to enhance the semantic-to-visual translation as well, yielding visual descriptors withricher semantic content. The resulting architecture for de-coupled feature generation is named DecGAN.Since our DecGAN is decoupled into one unconditionaland one conditional GAN-like branches, it is capable ofexploiting the unlabeled visual data which are availablein transductive GZSL. In fact, while we can compare thegenerated seen features with the real ones, both condition-ally and unconditionally (since we have access to labels),we cannot do it for the unseen ones. Unseen classes are,in fact, not supported by annotated visual data, hence the conditional discriminator cannot “verify” them. We deemthat our proposed architecture contributes in addressing thisproblem by cross-connecting the conditional branch withthe unconditional one. In other words, we use the uncon-ditional discriminator to evaluate the “quality” of the con-ditionally generated features. In this way, we decouple thefeature generation, not only for the seen categories, but alsofor the unseen ones, resulting in a better modelling for bothand improving the GZSL performance.In summary, this work provides the following originalcontributions.• We introduce the idea of decoupling feature generationfor transductive zero-shot learning by encapsulatingvisual patterns into a structured prior, which is subse-quently adopted to boost the semantically conditionedsynthesis of visual features.• We implement our idea through a novel architecture,termed DecGAN, which combines an unconditionaland a conditional feature generation module, introduc-ing a novel cross-connected branch mechanism able todecoupling feature generation for both seen and unseencategories.• Through an extensive ablation study, we analyze eachsingle component of our architecture. As compared tothe transductive GZSL state-of-the-art, DecGAN out-performs it on CUB [27] and SUN [32] datasets (seeTable 4.3, Section 4).The rest of the paper is organised as follows. In Section2, we outline the most relevant related work. In Section3, we present the proposed approach, which is then exper-imentally validated in Section 4. The ﬁnal conclusions aredrawn in Section 5.

2. Related Work

In this Section, we will cover the most relevant re-lated work in the ﬁeld of transductive zero-shot learning.For more general details on other zero-shot paradigms, thereader can refer to [29].Classical approaches in GZSL aim at learning a compat-ibility function between visual features and class embed-dings, projecting them in a common space [29].In order to exploit unlabeled data for the unseen classesin the transductive GZSL, [33] proposes a procedure to per-form label propagation [7] as to simultaneously learn a rep-resentation for both seen and unseen classes. As shown in[29], this label propagation procedure can be extended toall the methods based on compatibility functions that mapthe visual features into the class embedding space [1]. Dueto the shift between seen and unseen classes, the projectionmay struggle when switching from the seen to the unseenomain. To this end, in [26, 35, 36], the projection functionis improved, while [34] tries to alleviate the issue using anensemble of classiﬁers. Differently, we perform syntheticfeature generation to produce labeled visual features for theunseen classes.Generative approaches for transductive GZSL have beenrecently proposed [31, 23, 8, 25]. Using the taxonomy ofgenerative models [9], the method proposed in [25] canbe categorized as an explicit density model with tractabledensity. In fact, here it is proposed to model the densityfunction of the conditional probability of the visual featuresgiven the class embedding through an exponential familyof distributions. Among the proﬁtable beneﬁts of tractabledensity function, the computational pipeline becomes sim-pler and more efﬁcient. However, constraining the densityfunction limits the possibility to capture all the complex-ity of the data. Instead, our framework is based on GANs[10, 2, 11], a direct implicit density model [9], and there-fore, we do not impose any density function for the distri-bution from which we want to generate the visual features,but we let the model to directly learn it from the data.In [23], a constraint is introduced in GAN training toimprove the discriminative properties of the generated fea-tures. Speciﬁcally, a compatibility function f between vi-sual features and class embeddings is learned, then the cor-relation between gradients of real and generated features inrespect to f is maximized during GAN training.In [31] and [8], a mixture of explicit and implicit mod-els, a Variational Autoencoder (VAE) [14] and a GAN, isproposed. Speciﬁcally, a single generator/decoder is condi-tioned on the attribute embeddings and used to approximatethe numerically intractable distribution of the visual fea-tures. By directly minimizing the divergence between thereal visual features and the generated ones, the model learns1) how to extract visual features for those classes which arenot seen during training but only described through theirattributes, and 2) how to mimic the distribution of visualfeatures (with the addition of one adversarial categorizationnetwork in [8]).Differently to [31, 23, 8], in our work, we propose a de-coupled feature generation framework. Hence, instead oftraining one single conditional generator, we train an un-conditional generator to solely capture the complexity of thevisual data distribution, and we subsequently pair it with aconditional generator devoted to enrich the prior knowledgeof the data distribution with the semantic content of the classembeddings.To generalize the generative deep networks on the (un-labeled) unseen domain: [23] use an unconditional dis-criminator for both, seen and unseen data, and implicitlylearns the class label information through the compatibilityfunction; [8] apply a pseudo-labeling strategy; [31] use anadditional unconditional discriminator for the unseen data. Differently, we cross-connect our conditional and uncondi-tional branches.

3. Decoupled Feature Generation we consider two disjoint sets of classes: the seen classes Y s and the unseen ones Y u , such that Y s ∩ Y u = ∅ . For theseen classes, a dataset of triplets ( x s , y s , c ( y s )) is available: x s ∈ X is the visual feature vector, y s ∈ Y s is its classlabel and c ( y s ) is the corresponding class embedding. Dif-ferently, for the unseen classes, in transductive GZSL weonly have unlabeled visual features x u . The sets of the la-bels y u of the unseen classes are described in terms of theirsemantic embeddings c ( y u ) , as for the seen ones.Given a test visual feature x , the goal is to predict thecorresponding class label y which can either belong to theseen or to the unseen classes.For feature generation approaches, a conditional gener-ator G is fed with random noise z , and a class embedding c ( y ) and it synthesizes a feature vector which will be de-noted by (cid:101) x . Once G is trained, synthetic features (cid:101) x u aregenerated for the unseen classes and are used, together with x s , to train a softmax classiﬁer which is responsible for theﬁnal recognition task. Looking at Figure 1, our proposed DecGAN architec-ture is composed of two cross-connected branches, whichconsist of two GANs - one unconditional (in yellow in theﬁgure) and one conditional (in light blue), which are cross-connected forming a third cross-branch (in violet).It can be noticed that there are two main ingredients: a structured prior and a cross-connection between the con-ditioned and unconditioned branches. Since the uncondi-tional branch learns how to mimic the feature representa-tion, regardless of the semantic class embeddings, this al-lows to generate a structured prior which can be sharedacross classes and adopted by the conditional branch tobetter perform the semantic-to-visual mapping. The cross-connection is fundamental as well: once synthetic featuresare conditionally generated, they can be checked to be re-alistic from the conditional generator only if they belongto the seen classes - for which we have labels. But, withthe additional usage of the unconditional generator, we canalso verify if the synthetic features from unseen classes aresimilar to real ones in distribution. This framework fullyexploits the possibility of a transductive zero-shot learning.The reader can refer to Figure 1 for a visualization andto the next paragraphs for the details on the design of ourarchitecture. igure 1. Our proposed DecGAN architecture is composed of twocross-connected branches consisting of two GANs: unconditionaland conditional. The unconditional branch (yellow) is composedof generator G and discriminator D , and the conditional branch(light blue) is composed of generator G c and discriminator D c .An additional cross-branch (violet) is composed of G c and D . Generator G is decomposed into G and G , such thatgiven some random noise z , G ( G ( z )) = G ( z ) = ˜ x .The structured prior s = G ( z ) is fed as input into G c together with the class embeddings c ( y ) , for the sake ofconditionally-generating visual descriptors ˜ x c . Best viewedin colors. The unconditional branch is composed of the generator G and the discriminator D . The generator G is decom-posed into G and G , such that, given some random noise z , G ( G ( z )) = G ( z ) . We refer to the output of G asthe structured prior s , that is s = G ( z ) . The concatena-tion of s and the class embeddings c ( y ) is passed as inputto the conditional branch (see next paragraph). The un-conditional branch is dedicated to learning the data distri-bution and we model it as a Wasserstein GAN (WGAN)[2, 11]. Hence, optimization is performed by minimizingthe Wasserstein distance between the real data distributionand the synthetic’s one by playing the following two-playergame between G and D [2]: min G max D E x [ D ( x )] − E ˜ x [ D (˜ x )] , (1)where ˜ x = G ( z ) denotes the features generated from theunconditional generator G . To regularize the min-max op-timization, we use the following penalty term [11]: R = E ˆ x [( (cid:13)(cid:13) ∇ D (ˆ x ) (cid:13)(cid:13) − ] , (2)where ˆ x = α x + (1 − α )˜ x with α ∼ U (0 , . To learn how to translate the semantic content of the classembeddings c ( y ) we model the conditional branch with the extension of the WGAN to a conditional model [30]. Theconditional branch is composed of the generator G c and thediscriminator D c . In this architecture, both the generatorand the discriminator are conditioned on the class embed-dings. The generator G c takes as input the structured prior s and it is conditioned on the class embeddings, learning howto enrich the information about the data distribution con-tained in s with the semantic content of c ( y ) . The generatedfeatures ˜ x c = G c ( s , c ( y )) are then evaluated by the discrim-inator D c together with the class embedding that generatedthem, and compared to real data pairs ( x , c ( y )) .With this architecture, G c learns how to enrich s with thecontent of the class embeddings. The quality of the relationbetween the generated visual features and the semantic con-tent is then evaluated by D c . The optimization is carried outthrough min G c max D c E x [ D c ( x , c ( y ))] − E ˜ x c [ D c (˜ x c , c ( y ))] , (3)with the regularization term [30]: R = E ˆ x [( (cid:107)∇ D c (ˆ x , c ( y )) (cid:107) − ] , (4)where ˆ x = α x + (1 − α )˜ x c with α ∼ U (0 , .We also add the regularization term introduced by [6].That is, given a pre-trained linear module A and ˜ a = A (ˆ x ) the reconstruction of c ( y ) given ˆ x , we add the reconstruc-tion loss: R rec = (cid:107) c ( y ) − ˜ a (cid:107) . (5) Because labeled data are not available for the unseenclasses, we cannot feed the conditional discriminator D c with them. To exploit the unlabeled data we propose to con-ditionally generate the visual features ˜ x c and evaluate themonly by their distribution using D . Thus in this setting wedo not condition both the generator and the discriminator,as is commonly done in GAN based conditional generation,but we only condition the generator. Hence, optimization isobtained by min G c max D E x [ D ( x )] − E ˜ x c [ D (˜ x c )] , (6)adapting as consequence the regularization term on the gra-dients as R = E ˆ x [( (cid:13)(cid:13) ∇ D (ˆ x ) (cid:13)(cid:13) − ] , (7)where ˆ x = α x + (1 − α )˜ x c with α ∼ U (0 , and addingthe reconstruction loss deﬁned in equation (5). he origin of the proposed architecture. Our work isinspired by FusedGAN [4], which combines two GANs toimprove image generation in a semi-supervised setup. Dif-ferently, in our case, we handle feature generation in thezero-shot case, so we have no annotated data at all for someof the classes and we need to generate them. To solve thisproblem, differently from FusedGAN, we cross-connect thetwo branches to transfer the knowledge of the seen domainto the unseen one.

To train the proposed DecGAN, we propose a three-staged training strategy, which is explained beneath andsketched in Figure 2.1. In the ﬁrst stage, we optimize both the conditional andthe unconditional branch using only data from the seenclasses. We seek to achieve decoupled feature gener-ation for the seen categories in a way that, while G learns how to model the data distribution, G c learnshow to enrich the structured prior with the content ofthe class embeddings. We perform an alternate train-ing strategy in which, ﬁrst, we update D and G usingequations (1) and (2). Then, we update D c and G c us-ing equations (3) and (4). A full update step consistsof k updates of D , 1 update of G , k updates of D c and 1 update of G c in sequence with k > [11]. Here,we chose k = 5 as done in [30].2. In the second training stage, we want to take advantageof the unseen unlabeled data to add into the structuredprior the information of the unseen data distribution.To reach our goal, we use the unseen data to ﬁne-tune D an G using equations (1) and (2).3. The third stage consists in the ﬁne-tuning of the con-ditional generator G c on the unseen data. Using struc-tured prior, generalized over the unseen classes in theprevious stage, we condition G c with the embeddingsof the unseen classes to reinforce its ability to translatesemantic content into visual features in the unseen do-main. That is, we use equations (6) and (7) to update D an G c using unseen data. We implement G and G c as single hidden layer neuralnetworks with hidden layer of size 4096 and leaky ReLUactivation and output layer that has the size of the visualfeature vectors, 2048, and ReLU activation. In G , a leakyReLU is used as the activation function (without hiddenlayer). Speciﬁcally, G is a 2-hidden layer neural networkand we use its ﬁrst layer as structured prior. The size of thestructured prior (the output of G ) is ﬁxed to . The sizeof noise z is ﬁxed to 512 and is sampled from a multivariate Dataset att stc Y s Y u FLO [19] - 1024 62+20 20SUN [32] 102 - 580+65 72CUB [27] 312 - 100+50 50

Table 1. Statistics of considered datasets: number of seen classes Y s (training + validation), unseen classes Y u , dimension ofattribute (att) annotation and sentences (stc) extracted features. normal distribution N ( , I ) , where is the 512-dimensionalvector of zeros and I is the 512-dimensional identity matrix. D and D c are neural networks composed of a single hid-den layer of size 4096 (with leaky ReLU activation) and byan unconstrained real number as output.

4. Experiments

We evaluate proposed DecGAN on standard benchmarkdatasets for GZSL. We considered Oxford Flowers (FLO)[19], SUN Attribute (SUN) [32] and Caltech-UCSD-Birds200-2011 (CUB) [27]. FLO consists of 8,189 images of 102different types of ﬂowers, SUN 14,340 images of scenesfrom 717 classes and CUB of 11,788 images of 200 differ-ent types of birds.For SUN and CUB, we use manually annotated attributes[29]. Because for FLO the attributes are not available,we follow [31] in using 1024-dim sentence embedding ex-tracted by the character-based CNN-RNN [20] from ﬁne-grained visual descriptions of the images. Statistics of thedatasets are available in Table 1. For a fair comparison, wesplit the classes of SUN and CUB between seen and unseenusing the splits proposed by [29]. For FLO, we use the splitsas in [20]. For all datasets, the visual features are chosenas the 2048-dim top-layer pooling units of the ResNet-101[12], provided by [29].As evaluation metrics, the GZSL setup, we measure theperformance as harmonic mean between seen and unseenaccuracy, each one computed as top-1 classiﬁcation accu-racy on seen and unseen classes [29].

In this Section, we will perform an accurate ablationanalysis on the different components of our proposed Dec-GAN architecture. Speciﬁcally, we will separately evaluatethe impact on performance of each of the three brancheswhich endows DecGAN: namely, the unconditioned, con-ditioned and cross-branch as presented in Section 3.2. Wewill also pay attention to the effects of the different stagesof our training pipeline: the ﬁrst stage is evaluated in stand-alone fashion, while we also provide experimental evidencefor the effect of removing the second stage. igure 2. DecGAN training is performed in 3 stages. Stage 1 performs an alternate training on the conditional and the unconditional branchusing seen data. Stage 2 uses the unconditional branch to ﬁne-tune G using unseen data to improve the structured prior s . Finally, Stage3 carries out the ﬁne-tuning of G c , feeding the cross-branch with unseen data. To better analyse our composite training pipeline, in Table2, we present an ablation study to assess the impact on per-formance of each of the three stages of our training pipeline.Precisely, we evaluate the drop in performance resultingfrom removing any of the aforementioned stages from thefull training pipeline of DecGAN: when either removingthe ﬁrst, second or third stage, we obtain DecGAN ( − Stg1) ,DecGAN ( − Stg2) and DecGAN ( − Stg3) , respectively. Wealso assess the performance of the ﬁrst and third stage sep-arately (DecGAN (Stg1) and DecGAN (Stg3) ) Note that, wecannot evaluate the stage 2 in a standalone fashion sincesuch stage lacks of a conditional feature generation pipelinefrom which we can sample features for the seen/unseenclasses (see Figure 2).

Discussion.

With respect to the performance of the fullDecGAN model, the ﬁrst stage always achieve a subop-timal performance, and this can be clearly explained bythe fact that, DecGAN (Stg1) exploits data from the seenclasses only (inductive setup). The poor performance ofDecGAN (Stg3) reveals that without the conditional discrim-inator, which takes as input the visual features and the re-lated class embeddings (consequently learning the relationbetween visual features and class embeddings), the gener-ator does not learn to generate class dependent visual fea-tures. When removing the ﬁrst stage, the performance iscomparable with DecGAN (Stg3) : this shows how pivotalthis ﬁrst step is for the whole pipeline. During the sec-ond stage, we ﬁne-tune the structured prior adding infor-mation about the data distribution of the unseen visual fea-tures, but we update the conditional generator on the newstructured prior only in Stage 3. The misalignment betweenthe structured prior that the conditional generator expects asinput and the updated one it receives leads in a performancedrop when comparing DecGAN (Stg1) and DecGAN ( − Stg3) .However, the importance of the transitory Stage 2, is high- lighted by the difference in performance between DecGANand DecGAN ( − Stg2) . The third stage has a clear effect onperformance: ﬁne-tuning the conditional generator over theunseen data always boosts the performance, no matter if thesecond stage was performed or not.

We posit that the main advantage of DecGAN is the pos-sibility of decoupling the feature generation stage, as totackle two problems separately: 1) the generation of fea-tures which are visually similar to the real ones, and 2) thetranslation of semantic patterns from attributes to features.We want now to prove that the aforementioned separationof the tasks leads to a superior performance if comparedto a model which tries to perform both tasks jointly. Todo so, we compare the proposed decoupled feature gener-ation (achieved through DecGAN) with a baseline modelin which we perform a similar staged training, without per-forming decoupling.Speciﬁcally, we consider the architecture represented inFigure 3, which is described beneath. It is composed ofa single conditional generator G c and two discriminators,one conditional D c and one unconditional D , implementedthrough Wasserstein GANs subjects to gradient penalty lossand reconstruction loss in the same way of our DecGAN.Moreover, similarly to our DecGAN, we train this architec-ture in two stages (see Figure 3). In the ﬁrst stage, we train G c together with D c using only seen data, and in the secondstage, we train G c together with D using unseen data. Dif-ferently to DecGAN, generation is performed with a singlegenerator that has to learn both, the data distribution of thereal visual features and how to translate the semantic con-tent of the class embeddings into them, without taking ad-vantage of the structured priors and the decoupled featuresgeneration. As for DecGAN, G c and D and D c for thisbaseline are one-hidden layer neural networks with hiddenlayer of size 4096 with leaky ReLU activation. The size ofthe noise is ﬁxed to 1024, the same of the structured prior s . LO SUN CUB a u a s H a u a s H a u a s H DecGAN (Stg1) (Stg3) ( − Stg1) ( − Stg2) . . . DecGAN ( − Stg3) . . . . . . . . Table 2. We assess the impact on performance of the presence/absence of each stage of the training pipeline of our DecGAN model. Wereport top-1 accuracy on seen classes a s and unseen classes a u and their harmonic mean H . Best H values are highlighted in bold. Allresults are reported by averaging accuracies over 5 different runs. FLO SUN CUB a u a s H a u a s H a u a s H Not decoupled 69.5 91.4 79.0 52.7 44.3 48.1 54.3 66.7 59.9Decoupled . . . . . . . . Table 3. The effect of decoupling the feature generation stage. We compare the decoupled approach of DecGAN to the not decoupledbaseline. We report top-1 accuracy on seen classes a s and unseen classes a u and their harmonic mean H . Best H values are highlightedin bold. All results are reported by averaging accuracies over 5 different runs. The results of this analysis are presented in Table 3 anddiscussed in the following.

Discussion.

The effect of decoupling is clearly visible onall the 3 benchmarks showing that it is always advantageoussince leading to a superior performance, when consideringall error metrics adopted in this paper: accuracy over un-seen and seen classes a u , a s and harmonic mean H . Whenwe do not perform decoupling, the resulting performance iscomparable to the one achieved by DecGAN ( − Stg2) (checkTables 2 and 3). We think that this is an effect of what hap-pens to the structure prior, which is ﬁrst trained on the seendata and the conditional generator, and then ﬁne-tuned onthe unseen ones. Such discontinuous usage of seen and un-seen data leads the generator to use the structured prior as arandom input, since it is not able to read the visual informa-tion which is encapsulated inside.

In this Section, we report the key benchmark againstthe state-of-the-art transductive GZSL. The methods are theprojection with visual structure constrain (CDVSc) [26], theeffective deep embedding (EDE ex) [35], the progressiveensamble network (PREN) [34], the domain-invariant pro-jection (Full DIPL) [36], the attribute-based latent embed-ding (ALE) [1], the generative framework based on a fam-ily of Gaussian distributions (GFZSL) [25], the discrimi-native semantic representation learning (DSRL) based on anon-negative matrix factorization approach [33], the featuregeneration approach based on a paired network and varia-tional auto-encoder GAN and VAE [31, 8] (f-VAEGAN-D2and Z-VAE-GAN) and the addition of the gradient matching loss during the gan training [23] (GMN). We additionallyreport some classical inductive methods as [21, 1, 5, 28],as well some generative inductive methods based on GAN[30, 6], VAE [3, 24] or combination of them [31]. Weselected the three publicly available benchmark datasets(FLO, SUN and CUB) presented in Section 4.1. We presentthe results through mean over ﬁve different runs at a ﬁxednumber of DecGAN training epochs. We report results ob-tained in Table 4.3.

In Table 4.3, we show how our proposed decoupled featuregeneration, implemented through our DecGAN model, iscapable of improving in performance prior methods.Among the inductive zero-shot learning methods, Dec-GAN sets up sharp improvements in performance: meth-ods such as CADA-VAE [24] are improved in the scoredharmonic mean H by +9.3% on SUN and by +11.0%on CUB. Similarly, DecGAN is capable of ouperformingcycle-WGAN [6] on FLO (+16.3%), SUN (+10.5%) andCUB (+10.4%). Actually, even the performance of Dec-GAN in the ﬁrst training stage is superior to cycle consis-tency: DecGAN (Stg1) improves cycle-WGAN by +2.0% ofFLO.A solid performance is shown also when benchmarkingthe state-of-the-art in the transductive generalized zero-shotlearning setup. On FLO, DecGAN improves several priormethods by margin, in terms of H : +59.3% with respect toALE [29], +47.7% with respect to GFZSL [25] and +43.6%with respect to DSRL [33]. The only method reported inTable 4.3 which is slightly superior to us is f-VAEGAN-D2 [31] and the reason for that is the usage of two fea- igure 3. Baseline architecture. Left : a visualization of the architectural design of the baseline.

Right : similarly to our proposed DecGAN,we adopt a staged training to optimize the baseline.

FLO SUN CUB a u a s H a u a s H a u a s H ESZSL [21] I - - - 11.0 27.9 15.8 12.6 63.8 21.0

ALE [1] - - - 21.8 33.1 26.3 23.7 62.8 34.4

SynC [5] - - - 7.9 43.3 13.4 11.5 70.9 19.8

LATEM [28] - - - 14.7 28.8 19.5 15.2 57.3 24.0 f-CLSWGAN [30] 59.0 73.8 65.6 42.6 36.6 39.4 43.7 57.7 49.7 f-VAEGAN [31] 56.8 74.9 64.6 38.0 45.1 41.3 48.4 60.1 53.6 cycle-WGAN [6] 61.6 69.2 65.2 33.8 47.2 39.4 47.9 59.3 53.0

SyntE [3] - - - 40.9 30.5 34.9 41.5 53.3 46.7

CADA-VAE [24] - - - 47.2 35.7 40.6 51.6 53.5 52.4

DSRL [33] T 26.9 64.3 37.9 17.7 25.0 20.7 17.3 39.0 24.0

GMN [23] - - - 57.1 40.7 47.5 † † †

CDVSc [26] - - - 27.8 63.2 38.6 37.0 84.6 51.4

PREN [34] - - - 35.4 27.2 30.8 35.2 55.8 43.1

Z-VAE-GAN [8] - - - 53.1 35.8 42.8 64.1 57.9 60.8

ALE trans [29] 13.6 61.4 22.2 19.9 22.6 21.2 23.5 45.1 30.9

Full DIPL [36] - - - - - - 41.7 44.8 43.2

EDE ex [35] - - - 47.2 38.5 42.4 54.0 62.9 58.1

GFZSL [25] 21.8 75.8 33.8 0.0 41.6 0.0 24.9 45.8 32.2 f-VAEGAN-D2 [31] 78.7 87.2

DecGAN (ours) 73.0 92.2

Table 4. Results in GZSL. We report top-1 accuracy on seen classes a s and unseen classes a u and their harmonic mean H . First andsecond best values are highlighted in bold and italic, respectively, for H . Inductive (I) and Transductive (T) methods are reported. † :results not reported because of the usage of different class embeddings, which are not comparable. Our results are presented by averagingperformance scores over 5 different runs. ture generation schemes: a variational autoencoder and aGAN. We therefore deem that our idea is still competitive:by means of our structured prior, we can almost match theperformance of a method which uses twice the number offeature generators. This evidently shows DecGAN as a bal-ance between model light-weighting and performance. Onthe other two benchmark datasets, DecGAN improves f-VAEGAN-D2 [31] by +0.3% on SUN and +0.2% on CUBand, similarly, surpasses in performance recent prior artsuch as PREN [34] (+19.1% on SUN and +20.3% on CUB)or Z-VAE-GAN [8] (+7.1% on SUN and +2.6% on CUB).

5. Conclusions

In this paper, we address a major limitation of the main-stream approach in (generalized) zero-shot learning, con- sisting in the necessity of solving two problems with asingle computational pipeline: 1) capturing the distribu-tion of visual features in order to generate realistic descrip-tors, and 2) translating semantic attributes into visual pat-terns. Therefore we proposed DecGAN, which decouplesthe aforementioned problems, by means of an unconditionalGAN generating a structured prior. The latter can be used toimprove the conditional generation of visual features. Theoverall architecture has a staged training, whose steps havebeen validated in a broad experimental comparison, assess-ing that this computational setup is particularly favorablefor the transductive GZSL setup. In fact, DecGAN is im-proving in performance previous state-of-the-art on chal-lenging public benchmark datasets. eferences [1] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, andCordelia Schmid. Label-embedding for image classiﬁcation.

IEEE transactions on pattern analysis and machine intelli-gence , 38(7):1425–1438, 2015.[2] Mart´ın Arjovsky, Soumith Chintala, and L´eon Bottou.Wasserstein gan.

ArXiv , abs/1701.07875, 2017.[3] Gundeep Arora, Vinay Kumar Verma, Ashish Mishra, andPiyush Rai. Generalized zero-shot learning via synthesizedexamples.

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2018.[4] Navaneeth Bodla, Gang Hua, and Rama Chellappa. Semi-supervised fusedgan for conditional image generation. In

Proceedings of the European Conference on Computer Vi-sion (ECCV) , pages 669–683, 2018.[5] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and F.Sha. Synthesized classiﬁers for zero-shot learning. , pages 5327–5336, 2016.[6] Rafael Felix, Vijay BG Kumar, Ian Reid, and GustavoCarneiro. Multi-modal cycle-consistent generalized zero-shot learning. In

The European Conference on ComputerVision (ECCV) , 2018.[7] Yasuhiro Fujiwara and Go Irie. Efﬁcient label propagation.In

ICML , 2014.[8] Rui Gao, Xingsong Hou, Jie Qin, Jiaxin Chen, Li Liu, FanZhu, Zhao Zhang, and Ling Shao. Zero-vae-gan: Generatingunseen features for generalized and transductive zero-shotlearning.

IEEE Transactions on Image Processing , 29:3665–3680, 2020.[9] Ian J. Goodfellow. Nips 2016 tutorial: Generative adversarialnetworks.

ArXiv , abs/1701.00160, 2017.[10] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,and Yoshua Bengio. Generative adversarial nets. In

NIPS ,2014.[11] Ishaan Gulrajani, Faruk Ahmed, Mart´ın Arjovsky, VincentDumoulin, and Aaron C. Courville. Improved training ofwasserstein gans. In

NIPS , 2017.[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[13] He Huang, Changhu Wang, Philip S Yu, and Chang-DongWang. Generative dual adversarial network for generalizedzero-shot learning. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2019.[14] Diederik P. Kingma and Max Welling. Auto-encoding vari-ational bayes.

CoRR , abs/1312.6114, 2013.[15] Christoph H Lampert, Hannes Nickisch, and Stefan Harmel-ing. Learning to detect unseen object classes by between-class attribute transfer. In

Computer Vision and PatternRecognition (CVPR) . IEEE, 2009.[16] Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. Zero-data learning of new tasks. In

Conference on Artiﬁcial Intel-ligence (AAAI) . AAAI, 2008. [17] Jingling Li, Mengmeng Jing, Ke Lu, Zhengming Ding, LeiZhu, and Zi Huang. Leveraging the invariant side of gener-ative zero-shot learning. In

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2019.[18] Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal, andHema A Murthy. A generative model for zero shot learn-ing using conditional variational autoencoders. In

The IEEEConference on Computer Vision and Pattern Recognition(CVPR) Workshops , pages 2188–2196, 2018.[19] M-E. Nilsback and A. Zisserman. A visual vocabulary forﬂower classiﬁcation. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2006.[20] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele.Learning deep representations of ﬁne-grained visual descrip-tions. pages 49–58, 06 2016.[21] Bernardino Romera-Paredes and Philip Torr. An embarrass-ingly simple approach to zero-shot learning. In

The Interna-tional Conference on Machine Learning (ICML) , 2015.[22] Ruslan Salakhutdinov, Antonio Torralba, and Josh Tenen-baum. Learning to share visual appearance for multiclassobject detection. In

CVPR 2011 , pages 1481–1488. IEEE,2011.[23] Mert B¨ulent Sariyildiz and Ramazan Gokberk Cinbis. Gra-dient matching generative networks for zero-shot learning. , pages 2163–2173, 2019.[24] Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, TrevorDarrell, and Zeynep Akata. Generalized zero- and few-shot learning via aligned variational autoencoders. In

TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , June 2019.[25] Vinay Kumar Verma and Piyush Rai. A simple exponentialfamily framework for zero-shot learning. In

ECML/PKDD ,2017.[26] Ziyu Wan, Dongdong Chen, Yan Li, Xingguang Yan, JungeZhang, Yizhou Yu, and Jing Liao. Transductive zero-shotlearning with visual structure constraint. In

Advances in Neu-ral Information Processing Systems 32 , pages 9972–9982.2019.[27] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Be-longie, and P. Perona. Caltech-UCSD Birds 200. TechnicalReport CNS-TR-2010-001, California Institute of Technol-ogy, 2010.[28] Yongqin Xian, Zeynep Akata, Gaurav Sharma, QuynhNguyen, Matthias Hein, and Bernt Schiele. Latent embed-dings for zero-shot classiﬁcation. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 69–77, 2016.[29] Yongqin Xian, Christoph H Lampert, Bernt Schiele, andZeynep Akata. Zero-shot learning: a comprehensive eval-uation of the good, the bad and the ugly.

The IEEE Transac-tions on Pattern Analysis and Machine Intelligence , 2018.[30] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and ZeynepAkata. Feature generating networks for zero-shot learning.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2018.[31] Yongqin Xian, Saurabh Sharma, Bernt Schiele, and ZeynepAkata. F-VAEGAN-D2: A Feature Generating Frameworkor Any-Shot Learning. In

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , June 2019.[32] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.Sun database: Large-scale scene recognition from abbey tozoo. In

The IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) , 2010.[33] Meng Ye and Yuhong Guo. Zero-shot classiﬁcation withdiscriminative semantic representation learning. , pages 5103–5111, 2017.[34] Meng Ye and Yuhong Guo. Progressive ensemble networksfor zero-shot recognition. , pages 11720–11728, 2019.[35] Lei Zhang, Peng Wang, Lingqiao Liu, Chunhua Shen, WeiWei, Yanning Zhang, and Anton van den Hengel. Towardseffective deep embedding for zero-shot learning.

ArXiv ,abs/1808.10075, 2018.[36] An Zhao, Mingyu Ding, Jiechao Guan, Zhiwu Lu, Tao Xi-ang, and Ji-Rong Wen. Domain-invariant projection learningfor zero-shot recognition. In

NeurIPS , 2018.[37] Yizhe Zhu, Jianwen Xie, Bingchen Liu, and Ahmed Elgam-mal. Learning feature-to-feature translator by alternatingback-propagation for generative zero-shot learning. In