End-to-end Generative Zero-shot Learning via Few-shot Learning
Georgios Chochlakis, Efthymios Georgiou, Alexandros Potamianos
EEnd-to-end Generative Zero-shot Learning via Few-shot Learning
Georgios Chochlakis Efthymios Georgiou Alexandros Potamianos
Abstract
Contemporary state-of-the-art approaches to Zero-Shot Learning (ZSL) train generative nets to syn-thesize examples conditioned on the providedmetadata. Thereafter, classifiers are trained onthese synthetic data in a supervised manner. Inthis work, we introduce
Z2FSL , an end-to-endgenerative ZSL framework that uses such an ap-proach as a backbone and feeds its synthesizedoutput to a Few-Shot Learning (FSL) algorithm.The two modules are trained jointly. Z2FSLsolves the ZSL problem with a FSL algorithm,reducing, in effect, ZSL to FSL. A wide class ofalgorithms can be integrated within our frame-work. Our experimental results show consistentimprovement over several baselines. The pro-posed method, evaluated across standard bench-marks, shows state-of-the-art or competitive per-formance in ZSL and Generalized ZSL tasks.
1. Introduction
Deep Learning has seen great success in various settings anddisciplines, like Computer Vision (Krizhevsky et al., 2012;Zhou et al., 2017; Chen et al., 2019), Speech and LanguageProcessing (Devlin et al., 2019; Brown et al., 2020), Com-puter Graphics (Starke et al., 2019; Park et al., 2019) andMedical Science (Ronneberger et al., 2015; Rajpurkar et al.,2017). Despite the variety of applications, there is a com-mon denominator: resources. The overwhelming majorityof applications not only benefit from but necessitate volu-minous data along with the associated hardware resourcesto achieve their reported results. This is evident both fromtheir unprecedented, “superhuman” performance in manynarrow tasks compared to traditional algorithms (He et al.,2015; Silver et al., 2017) and their surprising deficienciesin low-data regimes. Consequently, resource requirementshave skyrocketed. For instance, BERT’s (Devlin et al., 2019) School of Electrical and Computer Engineering, National Tech-nical University of Athens, Athens, Attica, Greece BehavioralSignal Technologies, Los Angeles, CA, USA. Correspondence to:Georgios Chochlakis < [email protected] > . training requirements reach 256 TPU days.A class of problems that deal with small and medium-sizeddata sets and distributions shifts are Zero-Shot Learning ( ZSL ) and
Few-Shot Learning ( FSL ). These can proveinteresting testing grounds for efficient learning. In theseproblems, and particularly in ZSL for Computer Visiontasks, Deep Learning has failed to have an immediate im-pact. It has only been applied in indirect ways, most notablyby replacing image features extracted with traditional Com-puter Vision algorithms, like Bag of Visual Words (Westonet al., 2010), with features extracted by deep nets such asResNet (He et al., 2016). Further integration of Deep Learn-ing techniques is important to advance these settings.A step towards that direction has been made with the in-clusion of generative models. For FSL, various forms ofautoencoders have been leveraged to provide additional,synthetic examples given the actual support set (Antoniouet al., 2017; Wang et al., 2018; Xian et al., 2019). In ZSL,generative networks conditioned on some form of class de-scriptions are trained so as to generate synthetic samplesof the test classes. In this way, the classification task istransformed into a standard supervised classification task.As a result, classifiers can be trained in a supervised manner(Xian et al., 2018b; Zhu et al., 2018). More elaborate train-ing techniques have been proposed (Li et al., 2019b; Xianet al., 2019; Li et al., 2019a; Keshari et al., 2020; Narayanet al., 2020), yet without much progress on extending thebasic approach.In this work, we combine these two low-data regimes,namely ZSL and FSL. Specifically, we use the generativeZSL pipeline as a backbone and feed its synthesized out-put to a FSL classifier. The two modules can be trainedjointly, rendering the overall process end-to-end. Formally,the FSL classifier’s loss is combined with the prior loss ofthe generative ZSL framework to form our proposed frame-work, Z2FSL. Z2FSL conceptually reduces ZSL to FSL bystructuring the generator’s output as a support set for theFSL algorithm. Using the same FSL classifier during bothtraining and testing is possible because of the flexibility ofits output label space. This property holds because the FSLclassifier can classify input patterns based on the examplesand the classes present in its support set.Our motivation and rationale for making the process end-to- a r X i v : . [ c s . C V ] F e b nd-to-end Generative ZSL via FSL end is threefold. First, the generative net gains access to theclassification loss of the final classifier. This is beneficialbecause sample generation becomes more discriminative ina manner that explicitly helps the FSL classifier, since thelatter’s loss drives the generation. Secondly, thanks to theaforementioned FSL property, the FSL classifier’s trainingis not reliant on the generated samples of the generator, sothe former can be pre-trained on real examples. Addition-ally, this pre-training is not restricted to the correspondingtraining set of each task. For example, we can do so onImageNet (Deng et al., 2009). Lastly, Few-shot learners per-form favorably compared to other alternatives in low-shotclassification tasks (Vinyals et al., 2016; Snell et al., 2017;Wang et al., 2018).Our contributions to the study of ZSL are:1. The coupling of two standard research benchmarks,ZSL and FSL, by our novel framework, Z2FSL, whichmakes generative ZSL approaches end-to-end by usinga FSL classifier.2. Formulating our framework in a manner that allowsfor a wide class of ZSL and FSL algorithms to beseamlessly integrated.3. Achieving state-of-the-art or competitive performanceon ZSL and Generalized ZSL benchmarks and analyz-ing the contributions of each component of our frame-work.We have open-sourced our code .
2. Related Work
Earlier works address ZSL by splitting inference into twostages, inferring the attributes – the auxiliary description– of an image and then assigning the image to the closestgiven attribute vector. Examples are DAP (Lampert et al.,2013) and the technique presented by Al-Halah et al. (2016).Alternatively, IAP (Lampert et al., 2013) predicts the classposteriors and these are used to calculate the attribute pos-teriors of any image. Word2Vec (Mikolov et al., 2013)descriptions have also been used instead of attributes, anexample being CONSE (Norouzi et al., 2013).More recent research concentrates on learning a linear map-ping from the image-feature space to a semantic space.ALE (Akata et al., 2015a) learns a compatibility functionbetween attributes and image features that is a bilinear form.SJE (Akata et al., 2015b), DEVISE (Frome et al., 2013),ESZSL (Romera-Paredes & Torr, 2015) and Qiao et al.(2016) learn a bilinear form as a compatibility function aswell. SAE (Kodirov et al., 2017) tackles ZSL with a linear https://github.com/gchochla/z2fsl autoencoder. Extending linear mappings, Xian et al. (2016)introduced LATEM, which is a piecewise-linear compatibil-ity function.Other recent approaches can be categorized as prototypi-cal, because, at least conceptually, a prototype per class iscomputed. SYNC (Changpinyo et al., 2016) align graphs insemantic and image-feature space, calculating prototypes inthe process. CVCZSL (Li et al., 2019b) learns a neural netthat maps directly from attributes to image features, to classprototypes that is.Currently, generative ZSL stands as the state of the art. Theinaugural works of Xian et al. (2018b); Zhu et al. (2018) laidout the foundation, the basic generative approach, whichcan be broken down into three stages: first, a generativenetwork is trained to generate instances of seen classes con-ditioned on the provided class descriptions. A differentiableclassifier ( e.g . pre-trained linear classifier, AC-GAN (Odenaet al., 2017)) can also be used to drive discriminative gen-eration. In the second stage, given the description of everytest class, the generative net is used to create a syntheticdata set, transforming the problem into a supervised one.Then, a supervised classifier (SVM, linear, etc .) is trainedon this data set. In the last stage, the classifier is tasked withclassifying the actual test samples.This basic approach has been somewhat enriched to improveperformance. CIZSL (Elhoseiny & Elfeki, 2019) uses cre-ative generation during training. LisGAN (Li et al., 2019a)borrows from prototypical approaches and utilizes class rep-resentatives to anchor generation. f-VAEGAN (Xian et al.,2019) shares weights between the decoder of a VAE andthe generator of a WGAN to leverage the better aspects ofboth. GDAN (Huang et al., 2019) uses cycle consistency.ZSML (Verma et al., 2020) introduces meta-learning tech-niques. OCD (Keshari et al., 2020) uses an over-completedistribution to generate hard examples and render the syn-thetic data set more informative. TF-VAEGAN (Narayanet al., 2020) uses feedback to augment the f-VAEGAN.In all cases, the proposed algorithms deviate from the basicapproach mainly in the training regime of the generativenet. In this paper, we describe Z2FSL, a generative ZSLframework which, via FSL, allows for improvements to thewhole pipeline. In the next section, we describe in detailhow this is achieved.
3. Preliminaries
In this section, we give all the necessary definitions, intro-duce notation and briefly discuss the necessary background. nd-to-end Generative ZSL via FSL
Let X tr be the training samples and X ts be the test samples.Let Y tr be the corresponding set of training labels and Y ts be the corresponding set of test labels.In the ZSL setting, we have Y tr ∩ Y ts = ∅ , i.e . there are nocommon classes between training and testing, leading to theterms seen and unseen to describe the classes of the train andtest setting respectively. In the Generalized ZSL (GZSL) setting, that restriction becomes Y tr ⊂ Y ts , meaning thatthere are samples from both unseen classes and every seenclass during testing. Both in ZSL and GZSL, an auxiliarydescription of each class is provided to counterbalance theabsence of training examples for unseen test classes. Such adescription can be the Word2Vec representation of the class,an attribute vector or the Wikipedia article for the class.In a FSL setting, it is sufficient to impose the restriction Y tr ⊂ Y ts , same as GZSL. However, no descriptions areprovided. Rather, during testing, we are provided with aset of examples per test class, named support set . A sup-port set essentially consists of labeled examples from all n W test classes, with n S examples each, where n W and n S are arbitrary natural numbers referred to as way and shot respectively. Also, let S k denote the set of examples ofclass k in a support set. Given a support set, we are taskedwith classifying a set of unlabeled samples of the same n W classes called query set . For convenience, we consider querysets to contain n Q samples per class, which is also an ar-bitrary natural number. Randomly sampling a support setand a corresponding query set from a (usually significantly)larger test data set constitutes an episode . The purpose of anepisode is to test the ability of a model to generalize to pos-sibly unseen classes given only a small number of examplesof each class, seen or unseen, and its classification accuracyon the episode is naturally used as the metric of success.It is customary to report the average accuracy over manyepisodes to capture a more robust metric for a particulardata set, where the way and the shot of the episodes remainconstant. That is to say, a specific FSL setting is character-ized by its way and shot and described as n W - way n S - shot , e.g . 25-way 4-shot refers to the regime where the supportsets contain n W = 25 classes, with n S = 4 samples each. : WassersteinGenerative Neural Nets (WGAN) (Arjovsky et al., 2017), anevolution of Generative Adversarial Nets (Goodfellow et al.,2014), are a framework to train a generator G , formulating p ( x | z ) of real samples x with noise inputs z , in a minimaxfashion with another network, a discriminator D . Gulrajaniet al. (2017) added a regularization term to improve per-formance. We present the formulation of p ( x | a, z ) , which includes a conditioning variable a , namely L W GAN ( G, D ; p R , p Z ) = E ( x,a ) ∼ p R ,z ∼ p Z [ D ( x, a ) − D ( G ( a, z ) , a )] − λ E (ˆ x,a ) ∼ p ˆ X (cid:2) ( (cid:107)∇ ˆ x D (ˆ x, a ) (cid:107) − (cid:3) , (1)where p R is the distribution of the real data, p Z a “noise”distribution, p ˆ X the joint distribution of the conditioningvariable and the uniform distribution on the line between x and G ( a, z ) , ( x, a ) ∼ p R , z ∼ p Z (intuitively ˆ x = ux +(1 − u ) G ( a, z ) , u ∼ U (0 , ) and λ is a hyperparameter.The minimax game is defined as: min G max D L W GAN ( G, D ; p R , p Z ) . (2)We use generator and generative net interchangeably. Variational Autoencoder : The Variational Autoencoder(VAE), introduced by Kingma & Welling (2014), is an au-toencoder that maximizes the variational lower bound of themarginal likelihood. An autoencoder consists of an encoder E , which compresses its input to a “latent” variable z , anda decoder D , which reconstructs the original input of E based on z . We present the formulation of the VAE with aconditioning variable a , L V AE ( E, D ; p R , p θ ) = − E ( x,a ) ∼ p R [ x · log D ( E ( x, a ) , a )+ (1 − x ) log (1 − D ( E ( x, a ) , a ))]+ E ( x,a ) ∼ p R [ D KL ( p E ( z | x, a ) (cid:107) p θ ( z ))] , (3)where D KL is the Kullback-Leibler (KL) divergence, p R the distribution of the real data, p E the output distributionof E , which is a Gaussian Feedforward Neural Network(FFNN), i.e . it outputs the mean and the diagonal elementsof the covariance of a Gaussian distribution, which makesthe KL divergence analytical, and p θ is the prior distribu-tion of the “latent” variable. For practical purposes, theprior is set to N ( . ; 0 , I ) , and the reparameterization trick(Bengio et al., 2013) is used to sample from the encoder as z = µ ( x, a ) + (cid:15) (cid:12) σ ( x, a ) , (cid:15) ∼ N ( (cid:15) ; 0 , I ) , where (cid:12) is theHadamard product and µ, σ the encoder’s outputs. f-VAEGAN : f-VAEGAN (Xian et al., 2019) is a genera-tive ZSL approach that deploys both a WGAN and a VAEto train the generator, by sharing its weights between theVAE’s decoder and the WGAN’s generator. The overall lossfunction of the approach is L V AEGAN ( G, E, D ; p R , p Z , β ) = L V AE ( E, G ; p R , p Z )+ β · L W GAN ( G, D ; p R , p Z ) , (4)where E is the encoder of the VAE, D the discriminator ofthe WGAN, G the generator of the WGAN and the decoderof the VAE and β a hyperparameter. nd-to-end Generative ZSL via FSL Prototypical Network : Prototypical Networks (PN) (Snellet al., 2017) present a simple, differentiable framework forFSL. A neural network, f φ , is used to map the input sam-ples to a metric space. The support set is mapped and theembeddings are averaged per class so as to get a prototype c k for each. Then, each sample in the query set is mappedto the metric space and classified to the nearest prototypebased on the Euclidean distance d ( · , · ) . The formulation is c k = 1 | S k | (cid:88) x i ∈ S k f φ ( x i ) ,p φ ( y = k | x ) = exp ( − d ( f φ ( x ) , c k )) (cid:80) k (cid:48) exp ( − d ( f φ ( x ) , c k (cid:48) )) , L P N ( f φ ; S, Q ) = 1 | Q | (cid:88) ( x i ,y i ) ∈ Q log p φ ( y = y i | x ) , (5)where p φ ( y | x ) is the softmax output distribution of x be-longing to some class. To compute L P N , we have to sampleepisodes instead of batches. In this manner, training resem-bles testing. We refer to that manner of training as episodic .
4. Method
We now describe the proposed Z2FSL framework in detail.In Z2FSL, a generative ZSL pipeline, used as a backbone, iscoupled with a FSL classifier. We can train the two modulesjointly by conceptually reducing ZSL to FSL, which simplymeans that the backbone generates the classifier’s supportset. To get an episode, real examples of the classes thatare present in this support set can be used as the query set.During testing, the test samples become the query and thesupport is again provided by the backbone. The frameworkis also presented in Figure 1, where a graphical represen-tation of how the novel component of our framework, theFSL classifier, affects the pipeline can be seen. The back-bone provides the support set of the FSL classifier in allsettings. During training, this allows the back-propagationof the FSL loss to the generator. During testing, synthesizedexamples of the test classes are provided in order to enablethe classification of unseen classes.This is possible because the FSL algorithm can classify itsinput dynamically, in the sense that its output distributionis based on the classes present in the support set. Thisallows us to train the classifier on classes other than theunseen classes and simply provide the necessary syntheticsupport set during testing. Another advantage is that theFSL algorithm can actually be pre-trained before the jointtraining with the backbone and/or fine-tuned afterwards, oreven trained completely separately of the backbone.
We train all the components for multiple iterations. Asan example, the components in this work, other than the generator and the FSL classifier, which are also visible inFigure 1, include the discriminator of a WGAN and theencoder of a VAE. In each iteration, we take steps trainingthe components: training the FSL classifier, training thegenerator and training all the other components as necessary.First, when training the FSL algorithm, we randomly sam-ple query sets from the training set and support sets aregenerated by the backbone, conditioned on the metadata ofthe classes that appear in the query set. The FSL loss, de-noted L F SL , remains intact ( e.g . Equation 5 for PNs). Thisis a slightly modified version of the episodic training wedefined in Section 3.1, since the support set is now synthetic.Episodes during training, n W , n S and n Q in particular, canbe set arbitrarily, similar to batches. The process can beseen indirectly in Figure 1. For this step, we would have tosimply back-propagate L F SL only to the FSL classifier.Second, when training the generator of the ZSL backbone,we use the loss within the generative framework of thebackbone, which we denote as L ZSL ( e.g . Equation 4 ifthe backbone is the f-VAEGAN). By feeding the generatedsamples to the FSL algorithm, we can back-propagate L F SL to the generator. This also requires sampling a query setbased on the classes the generator provides. The overallobjective of the ZSL generator is then described as: L Z F SL = L ZSL + γ L F SL , (6)where γ is a hyperparameter. This is the step depicted inFigure 1 if we take into account all arrows.The rest of the components can be trained before or after thetwo aforementioned steps. In this work, for example, we up-date the discriminator multiple times before each generatorupdate, as suggested by Goodfellow et al. (2014); Arjovskyet al. (2017).However, components of the backbone can even be trainedalong with the generator. A component that does not affectthe generation of the support set in the forward pass can betrained along with the generator, but it remains unaffectedby L F SL . Such a component is the encoder of a VAE, whichwe also use in our work, or a regressor for cycle consistency(Huang et al., 2019). A component that affects generationin the forward pass, such as a feedback module (Narayanet al., 2020), can be trained along with the generator andupdated based on the objective in Equation 6 rather than L ZSL alone.In Figure 1, we can more generally see that both real ex-amples and corresponding descriptions are provided to thebackbone, while only real and generated examples to theclassifier. Real samples are useful to the backbone onlyduring training, e.g . to compute the VAE reconstructionloss. nd-to-end Generative ZSL via FSL
Figure 1.
Graphical representation of the Z2FSL framework and pipeline. We use a generative Zero-shot Learning backbone with aFew-shot Learning classifier. During training, we train the generator G with a combination of its training within the backbone and theFew-shot Learning classification task. During testing, the forward pass is only altered in that real examples are not provided to thebackbone and no backward pass is performed. For the evaluation, the pipeline is modified in two ways.First, as shown in Figure 1, we stop providing real samplesto the backbone. Only the descriptions are necessary so as togenerate the test support set. Second, episodes are restrictedby the test setting. The number of classes in each set, n W , isfixed and equal to the number of test classes. n Q is equal, foreach test class, to the number of test samples available. n S remains a hyperparameter, as we can choose the number ofsamples to generate per class. In this manner, the query setcontains all test samples and the FSL classification accuracyis exactly the final ZSL accuracy. We make three assumptions about the backbone and theclassifier altogether. First, the backbone is a generativeZSL approach. Second, the classifier is trained through adifferentiable process. Third, the classifier can classify itsinput dynamically, by matching support and query classes(Vinyals et al., 2016; Snell et al., 2017; Wang et al., 2018).This fact allows the usage of the classifier during testingwithout any training on unseen classes as long as corre-sponding supporting examples are provided at that time.This means that the classifier is not reliant on the generatorand, consequently, the modules can be trained separately orjointly.As a result, our framework is suitable for a wide class ofZSL and FSL algorithms. We formulate our frameworkas agnostic to the generative ZSL and the FSL algorithm,and refer to a specific implementation of it by the follow-ing macro: Z2FSL( z , f ), where z is the generative ZSLbackbone and f the FSL classifier.
5. Experiments
We first present the datasets, followed by implementationdetails and finally present our experimental results alongwith comparison to state ot the art.
We use the Caltech UCSD Bird 200 (CUB, Wah et al. 2011),which consists of 11788 images of birds belonging to 200species. One 312-dimensional attribute vector per class isprovided as well. We also use Animals with Attributes 2(AwA2, Xian et al. 2018a). It contains 37322 images from50 categories and 85-dimensional attributes. Our last dataset is the SUN Scene Classification (SUN, Patterson & Hays2012) data set, with 14340 images of 717 categories and102-dimensional attributes.We use the provided attributes as our auxiliary descriptions,and in particular the continuous attributes after normalizingthem w.r.t. their L2 norm. We use 10 crops per image(original image, top-right, top-left, bottom-right, bottom-leftand their horizontally flipped counterparts) as augmentation.We use the original image for testing. Instead of images, weuse features extracted by the ResNet-101 (He et al., 2016)trained on ImageNet (Deng et al., 2009). We choose the2048-dimensional output of its adaptive average poolinglayer. Additionally, we perform min-max normalization ofthe features to [0 , and use the train-test and seen-unseensplits proposed by Xian et al. (2018a). Since we present the results of Z2FSL(f-VAEGAN, PN) incomparison to the state of the art, we present details for that nd-to-end Generative ZSL via FSL
Figure 2.
Comparison of Z2FSL to baseline generative ZSLbackbones . Linear( z ) describes the setting where the genera-tive ZSL approach z is used with a linear classifier trained on thegenerated data. We skip the second argument of the Z2FSL macrofor convenience, since it is the PN in all cases. Linear(VAEGAN)results for CUB and SUN reported from Xian et al. (2019). Themetric presented is the average per class top-1 accuracy. Per-formance gains denote absolute improvement compared to thecorresponding baseline. specific architecture in this section.We pre-train the PN in an episodic manner as suggested bySnell et al. (2017), where support and query sets are sampledfrom the real training data of the corresponding data set. ThePN is implemented as a FFNN with n h hidden layers withsquare weight matrices and ReLU activations. The FSLclassifier’s learning rate is kept the same in pre-training andthe joint training with the generator.The generator and the encoder are FFNNs with 2 hidden lay-ers each, 4096 followed by 8192 units for the generator andthe reverse for the encoder, Leaky ReLU hidden activations(0.2 slope), linear output for the encoder and sigmoid for thegenerator. The noise dimension is chosen to be equal to thedimension of the attributes. The discriminator is a FFNNwith one hidden layer of 4096 neurons and Leaky ReLUhidden activations (0.2 slope). We set the coefficient ofthe regularization term of WGAN λ = 10 and the trainingupdates of the discriminator per generator update equal to 5.When we sample support sets from the generative net, weset n S = 5 during training. During testing, we set it to n S = 1800 for unseen classes. For seen classes in theGZSL test setting, we experiment with both seen and unseensupport and select the better alternative for each benchmark.We also consider the shot of the test support set for seenclasses a different hyperparameter m S . Table 1.
Comparison of our approach, Z2FSL(f-VAEGAN, PN),to previous work. The metric presented is the average per classtop-1 accuracy. Z
ERO - SHOT L EARNING A PPROACH
CUB A W A2 SUN
CVCZSL (L
I ET AL ., 2019 B ) 54.4 71.1 62.6 F -CLSWGAN (X IAN ET AL ., 2018 B ) IS GAN (L
I ET AL ., 2019 A ) 58.8 - 61.7 F -VAEGAN (X IAN ET AL ., 2019) 61.0 - 64.7OCD (K
ESHARI ET AL ., 2020) 60.3 71.3 63.5TF-VAEGAN (N ARAYAN ET AL ., 2020) F -VAEGAN, PN) 62.5 68.0 The optimizer in all cases is Adam (Kingma & Ba, 2015)with β = 0 . and β = 0 . . We apply gradient clipping,restricting the gradient for each parameter within [ − , .Our implementation is in PyTorch (Paszke et al., 2019).The tunable hyperparameters that we either search for orvary by data set are presented in the Supplementary Mate-rial. In this section, we present the performance of our frameworkusing various generative ZSL approaches as backbones andcompare that with the plain generative ZSL approaches, i.e .without the FSL algorithm, as baselines. The comparisoncan be seen in Figure 2. The baselines we have chosen area VAE, a WGAN and a f-VAEGAN. Z2FSL improves theperformance of all baselines consistently across all bench-marks. It is interesting to see that, in most cases, the sim-ple backbones, the VAE and the WGAN, enhanced by ourframework, exceed the performance of the more elaborateand superior – on its own – plain f-VAEGAN.
In this section, we present our results in comparison withthe current state-of-the-art approaches that use the same testsetting as our approach (feature extractor, dimensionality offeatures, splits, etc . described in Sections 5.1 and 5.2).5.4.1. Z
ERO - SHOT L EARNING
For ZSL, our experimental evaluation in Section 5.3 showsthat Z2FSL(f-VAEGAN, PN) outperforms the f-VAEGAN.This performance compares favorably to the rest of the state-of-the-art approaches as well, as can be seen in Table 1. Inparticular, even though the TF-VAEGAN itself builds on topof the f-VAEGAN and improves performance, our approachoutperforms that on SUN, where it achieves state-of-the-artperformance, improving the previous one by an absolute nd-to-end Generative ZSL via FSL
Table 2.
Comparison of our approach, Z2FSL(f-VAEGAN, PN), to previous work. The metrics presented are: u is the average per classtop-1 accuracy of unseen classes, s is the average per class top-1 accuracy of seen class and H their harmonic mean. H is considered themain metric of this setting. G ENERALIZED Z ERO - SHOT L EARNING
CUB A W A2 SUN A PPROACH
U S H U S H U S H CVCZSL (L
I ET AL ., 2019 B ) 47.4 47.6 47.5 56.4 F -CLSWGAN (X IAN ET AL ., 2018 B ) 43.7 57.7 49.7 - - - 42.6 36.6 39.4L IS GAN (L
I ET AL ., 2019 A ) 46.5 57.9 51.6 - - - 42.9 37.8 40.2 F -VAEGAN (X IAN ET AL ., 2019) 48.4 60.1 53.6 - - - 45.1 38.0 41.3OCD (K
ESHARI ET AL ., 2020) 44.8 59.9 51.3 59.5 73.4 65.7 44.8
TF-VAEGAN (N
ARAYAN ET AL ., 2020) F -VAEGAN, PN) 47.2 61.2 53.3 57.4 80.0 margin of . .5.4.2. G ENERALIZED Z ERO - SHOT L EARNING
For GZSL, results can be seen in Table 2. Performance inAwA2 is marginally better than the previous state of theart, CVCZSL and TF-VAEGAN. In contrast, in CUB andSUN it is harder to balance seen and unseen accuracies, asdecreasing the s metric, which is partly controlled by m S ,is required to achieve the best H possible. This leads toinferior performance compared to our specific backbone inSUN and marginally worse accuracy in CUB. The state-of-the-art results in AwA2 can be partly explained by the smallnumber of classes, which are 50 in total, only 10 of whichare unseen. OMPONENT A NALYSIS
We perform a more detailed analysis of how the usage of thenovel component of our approach, the FSL classifier, affectsperformance. We experiment with using the FSL classifiersolely during the training of the backbone. During testing,we follow the standard practise of training a linear classifieron the synthetic data. We also experiment with using theFSL classifier only during testing, i.e . simply setting γ = 0 in Equation 6. Results are presented in Figure 3. We canobserve that both regimes yield improvement compared tothe plain backbone. Additionally, regarding the gains inperformance compared to the backbone, the gain of Z2FSLis greater than the sum of the gains of the two ablation stud-ies. This shows that end-to-end training yields a significantimprovement and validates that this process renders the gen-eration discriminative in a manner that explicitly helps theclassifier. Figure 3.
FSL classifier’s effects on performance . We comparethe performance of the f-VAEGAN, the backbone generative ZSLapproach, to the performance of Z2FSL(f-VAEGAN, PN) whenwe discard the FSL classifier during the evaluation and use a lin-ear classifier instead, the performance of Z2FSL(f-VAEGAN, PN)when γ = 0 (Equation 6), i.e . discarding the FSL classifier duringtraining only, and the complete Z2FSL(f-VAEGAN, PN) (with theFSL classifier in both settings). The metric presented is the aver-age per class top-1 accuracy. Performance gains denote absoluteimprovement compared to the backbone. YNTHETIC VS . R
EAL S UPPORT S ET We also examine the performance of Z2FSL when usingreal supporting examples of seen classes compared to perfor-mance with synthetic ones. Results are presented in Table 3.In AwA2 and SUN, the synthetic support results in a betterharmonic mean. On the other hand, in CUB, real supportleads to an increase in performance. Moreover, as we notedin Section 5.2, the shot for seen classes in the test support setis controlled by a different hyperparameter than that of un- nd-to-end Generative ZSL via FSL
Table 3.
Comparison of our approach, Z2FSL(f-VAEGAN, PN) with real and synthetic support for seen classes. The metrics presentedare: u is the average per class top-1 accuracy of unseen classes, s is the average per class top-1 accuracy of seen class and H theirharmonic mean. G ENERALIZED Z ERO - SHOT L EARNING
CUB A W A2 SUN A PPROACH
U S H U S H U S H Z2FSL( F -VAEGAN, PN) ( WITH REAL SUPPORT ) 47.2 61.2 F -VAEGAN, PN) 44.4 58.0 50.3 57.4 80.0 Table 4.
Comparison of our approach, Z2FSL(f-VAEGAN, PN),with and without pre-training the FSL classifier. The metric pre-sented is the average per class top-1 accuracy. ZSLA
PPROACH
CUB SUN
Z2FSL (
NO PRE - TRAINING ) 58.0 61.3Z2FSL seen classes, m S and n S (for testing) respectively . Thesefacts clearly demonstrate the bias towards seen classes thenaive approaches of using all the available training data inthe support set or too many synthetic ones could lead to.5.5.3. P RE - TRAINING
We also test the effects of pre-training the FSL classifier ofour framework. It makes sense, intuitively, to pre-train itin an actual FSL setting before the joint training with thegenerator, where we choose to train it with a combinationof real and synthetic data. Table 4 shows a decrease inperformance without pre-training, 5.2% absolute decreasein SUN and 4.5% in CUB to be exact. This illustrates thattraining in the FSL setting is essential to avoid overfitting.
6. Conclusions
In this paper, we introduce a novel, end-to-end generativeZSL framework, Z2FSL. Z2FSL uses the same supervisedclassifier during both training and testing. We choose a FSLalgorithm to fill that role, since the choice of classifier isrestricted by the ZSL setting. In this manner, we also couplethe two low-data regimes. We formulate our frameworkso as to allow a broad class of generative ZSL approachesand FSL classifiers to be integrated. Empirically, we showthat our framework improves upon the results of the plaingenerative approach. Extensive ablation studies reveal thatthe improvement originates from the fact that the generatedsamples of the generative net are rendered more discrimina- The results in Table 3 are achieved with a n S that is roughlytwo orders of magnitude greater than m S tive in a way that explicitly helps the classifier. In addition,these studies demonstrate the advantages of being able touse a pre-trained classifier. Our results are state of the art orcompetitive across all benchmarks. We also show that usingsynthetic samples for seen classes as well as decreasing thenumber of samples of these classes compared to unseenones can improve performance in GZSL.Our future plans include further investigating and mitigat-ing the bias towards seen classes in the GZSL. Anotherresearch direction we plan to investigate are techniques tobetter train the FSL classifier separately of the generator.Initial thoughts include consistent fine-tuning on generatedsamples of unseen classes after the joint training and exten-sive pre-training on other data sets. Since we now have anend-to-end process to train the generative net, it is possi-ble to completely dispose of the generative framework and,therefore, we also plan to investigate this training regime. References
Akata, Z., Perronnin, F., Harchaoui, Z., and Schmid, C.Label-embedding for image classification.
IEEE Trans.Pattern Anal. Mach. Intell. , 38(7):1425–1438, 2015a.Akata, Z., Reed, S., Walter, D., Lee, H., and Schiele, B.Evaluation of output embeddings for fine-grained imageclassification. In
IEEE Conf. Comput. Vis. Pattern Recog. ,pp. 2927–2936, 2015b.Al-Halah, Z., Tapaswi, M., and Stiefelhagen, R. Recoveringthe missing link: Predicting class-attribute associationsfor unsupervised zero-shot learning. In
IEEE Conf. Com-put. Vis. Pattern Recog. , pp. 5975–5984, 2016.Antoniou, A., Storkey, A., and Edwards, H. Data Augmen-tation Generative Adversarial Networks. arXiv preprintarXiv:1711.04340 , 2017.Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan.In
Int. Conf. Mach. Learn. , pp. 214–223, 2017.Bengio, Y., L´eonard, N., and Courville, A. Estimating orpropagating gradients through stochastic neurons for con- nd-to-end Generative ZSL via FSL ditional computation. arXiv preprint arXiv:1308.3432 ,2013.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., et al. Language models are Few-shot Learners. arXiv preprint arXiv:2005.14165 , 2020.Changpinyo, S., Chao, W.-L., Gong, B., and Sha, F. Synthe-sized Classifiers for Zero-shot Learning. In
IEEE Conf.Comput. Vis. Pattern Recog. , pp. 5327–5336, 2016.Chen, W., Ling, H., Gao, J., Smith, E., Lehtinen, J., Jacob-son, A., and Fidler, S. Learning to predict 3d objectswith an interpolation-based differentiable renderer. In
Adv. Neural Inform. Process. Syst. , pp. 9609–9619, 2019.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,L. ImageNet: A large-scale hierarchical image database.In
IEEE Conf. Comput. Vis. Pattern Recog. , pp. 248–255,2009.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:Pre-training of deep bidirectional transformers for lan-guage understanding. In
Proceedings of the Conference ofthe North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies ,volume 1, pp. 4171–4186. Association for ComputationalLinguistics, 2019.Elhoseiny, M. and Elfeki, M. Creativity inspired Zero-shotLearning. In
Int. Conf. Comput. Vis. , pp. 5784–5793,2019.Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J.,Ranzato, M., and Mikolov, T. Devise: A deep visual-semantic embedding model. In
Adv. Neural Inform. Pro-cess. Syst. , pp. 2121–2129, 2013.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In
Adv. Neural Inform.Process. Syst. , pp. 2672–2680, 2014.Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. C. Improved training of wasserstein gans.In
Adv. Neural Inform. Process. Syst. , pp. 5767–5777,2017.He, K., Zhang, X., Ren, S., and Sun, J. Delving deepinto rectifiers: Surpassing human-level performance onImageNet classification. In
Int. Conf. Comput. Vis. , pp.1026–1034, 2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep ResidualLearning for image recognition. In
IEEE Conf. Comput.Vis. Pattern Recog. , pp. 770–778, 2016. Huang, H., Wang, C., Yu, P. S., and Wang, C.-D. Genera-tive Dual Adversarial network for Generalized Zero-shotLearning. In
IEEE Conf. Comput. Vis. Pattern Recog. , pp.801–810, 2019.Keshari, R., Singh, R., and Vatsa, M. Generalized Zero-ShotLearning Via Over-Complete Distribution. In
IEEE Conf.Comput. Vis. Pattern Recog. , pp. 13300–13308, 2020.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In
Int. Conf. Learn. Represent. , 2015.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. In
Int. Conf. Learn. Represent. , 2014.Kodirov, E., Xiang, T., and Gong, S. Semantic Autoen-coder for Zero-shot Learning. In
IEEE Conf. Comput. Vis.Pattern Recog. , pp. 3174–3183, 2017.Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassification with deep convolutional neural networks. In
Adv. Neural Inform. Process. Syst. , pp. 1097–1105, 2012.Lampert, C. H., Nickisch, H., and Harmeling, S. Attribute-based classification for Zero-shot visual object catego-rization.
IEEE Trans. Pattern Anal. Mach. Intell. , 36(3):453–465, 2013.Li, J., Jing, M., Lu, K., Ding, Z., Zhu, L., and Huang, Z.Leveraging the invariant side of Generative Zero-shotLearning. In
IEEE Conf. Comput. Vis. Pattern Recog. , pp.7402–7411, 2019a.Li, K., Min, M. R., and Fu, Y. Rethinking Zero-shot Learn-ing: A conditional visual classification perspective. In
Int. Conf. Comput. Vis. , pp. 3583–3592, 2019b.Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficientestimation of word representations in vector space. In
Int.Conf. Learn. Represent. , 2013.Narayan, S., Gupta, A., Khan, F. S., Snoek, C. G., andShao, L. Latent embedding feedback and discrimina-tive features for zero-shot classification. arXiv preprintarXiv:2003.07833 , 2020.Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens,J., Frome, A., Corrado, G. S., and Dean, J. Zero-shotlearning by convex combination of semantic embeddings.In
Int. Conf. Learn. Represent. , 2013.Odena, A., Olah, C., and Shlens, J. Conditional imagesynthesis with Auxiliary Classifier GANs. In
Int. Conf.Mach. Learn. , pp. 2642–2651, 2017.Park, S., Ryu, H., Lee, S., Lee, S., and Lee, J. Learningpredict-and-simulate policies from unorganized humanmotion data.
ACM Trans. Graph. , 38(6):1–11, 2019. nd-to-end Generative ZSL via FSL
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,L., et al. Pytorch: An imperative style, high-performancedeep learning library. In
Adv. Neural Inform. Process.Syst. , pp. 8026–8037, 2019.Patterson, G. and Hays, J. Sun attribute database: Discov-ering, annotating, and recognizing scene attributes. In
IEEE Conf. Comput. Vis. Pattern Recog. , pp. 2751–2758.IEEE, 2012.Qiao, R., Liu, L., Shen, C., and Van Den Hengel, A. Lessis more: Zero-shot Learning from online textual docu-ments with noise suppression. In
IEEE Conf. Comput.Vis. Pattern Recog. , pp. 2249–2257, 2016.Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan,T., Ding, D., Bagul, A., Langlotz, C., Shpanskaya, K.,et al. Chexnet: Radiologist-level pneumonia detectionon chest X-rays with Deep Learning. arXiv preprintarXiv:1711.05225 , 2017.Romera-Paredes, B. and Torr, P. An embarrassingly simpleapproach to Zero-shot Learning. In
Int. Conf. Mach.Learn. , pp. 2152–2161, 2015.Ronneberger, O., Fischer, P., and Brox, T. U-net: Con-volutional networks for biomedical image segmentation.In
International Conference on Medical Image Comput-ing and Computer-Assisted Intervention , pp. 234–241.Springer, 2015.Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,Bolton, A., et al. Mastering the game of Go withouthuman knowledge.
Nature , 550(7676):354–359, 2017.Snell, J., Swersky, K., and Zemel, R. Prototypical networksfor few-shot learning. In
Adv. Neural Inform. Process.Syst. , pp. 4077–4087, 2017.Starke, S., Zhang, H., Komura, T., and Saito, J. Neural statemachine for character-scene interactions. In
ACM Trans.Graph. , volume 38, pp. 209–1, 2019.Verma, V. K., Brahma, D., and Rai, P. Meta-Learning forGeneralized Zero-Shot Learning. In
AAAI , pp. 6062–6069, 2020.Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.Matching networks for One shot learning. In
Adv. NeuralInform. Process. Syst. , pp. 3630–3638, 2016.Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie,S. The Caltech-UCSD Birds-200-2011 dataset. 2011.Wang, Y.-X., Girshick, R., Hebert, M., and Hariharan, B.Low-shot learning from imaginary data. In
IEEE Conf.Comput. Vis. Pattern Recog. , pp. 7278–7286, 2018. Weston, J., Bengio, S., and Usunier, N. Large scale im-age annotation: Learning to rank with joint word-imageembeddings.
Machine learning , 81(1):21–35, 2010.Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., andSchiele, B. Latent embeddings for zero-shot classification.In
IEEE Conf. Comput. Vis. Pattern Recog. , pp. 69–77,2016.Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. Zero-shot Learning—A comprehensive evaluation of the good,the bad and the ugly.
IEEE Trans. Pattern Anal. Mach.Intell. , 41(9):2251–2265, 2018a.Xian, Y., Lorenz, T., Schiele, B., and Akata, Z. Featuregenerating networks for zero-shot learning. In
IEEE Conf.Comput. Vis. Pattern Recog. , pp. 5542–5551, 2018b.Xian, Y., Sharma, S., Schiele, B., and Akata, Z. f-VAEGAN-D2: A feature generating framework for any-shot learn-ing. In
IEEE Conf. Comput. Vis. Pattern Recog. , pp.10275–10284, 2019.Zhou, T., Brown, M., Snavely, N., and Lowe, D. G. Unsu-pervised learning of depth and ego-motion from video. In
IEEE Conf. Comput. Vis. Pattern Recog. , pp. 1851–1858,2017.Zhu, Y., Elhoseiny, M., Liu, B., Peng, X., and Elgammal,A. A Generative Adversarial approach for Zero-shotLearning from noisy texts. In
IEEE Conf. Comput. Vis.Pattern Recog. , pp. 1004–1013, 2018. nd-to-end Generative ZSL via FSL
A. Evaluation Metrics
For the sake of completeness, we formally define thethe Zero-Shot Learning (ZSL) and Generalized Zero-ShotLearning (GZSL) metrics we use. We evaluate our frame-work with the average per-class [top-1] accuracy for ZSL,defined as acc Y = 1 (cid:107)Y(cid:107) (cid:88) y ∈Y , (7)where in the case of ZSL Y are the unseen classes. ForGZSL, we use the harmonic mean of the average per-classaccuracy of seen classes and that of unseen classes, definedas H = 2 u · su + s , (8)where we define u = acc Y for unseen classes and s = acc Y for seen classes for convenience and adherence toestablished notation. B. Hyperparameters
We present the rest of the hyperparameters for the pre-training of the
Few-Shot Learning (FSL) algorithm, thePrototypical Network (PN), in Table 5 and for the train-ing within our framework, Z2FSL(f-VAEGAN, PN), in Ta-ble 6, both for
Zero-Shot Learning (ZSL) and
General-ized Zero-shot Learning (GZSL) . C. Classifier Fine-tuning
After the joint training of the FSL classifier and the generatorof the backbone, we fine-tune the FSL classifier on samplesgenerated by the generator conditioned on attributes of theunseen classes. This basically means that the generatorprovides both the support and the query set. We train for25 episodes, using the same learning rate as in the othertwo settings the classifier is trained. We also retain thehyperparameters of the episode in this training regime thesame as in the joint training of the FSL classifier and thegenerator. This process provides marginal improvement, ifany, and is generally inconsistent. Further work is requiredto stabilize it.
D. Prototypical Network initialization
We initialize the weight matrices of the PN by setting all thediagonal elements equal to 1, while the rest are randomlysampled i.i.d. from N ( . ; 0 , . . We do so to bias thePN to preserve its input space structure as much as possi-ble, which we expect to be somewhat discriminative due toResNet-101. Notice that we can do so because ResNet-101yields non-negative values and we use ReLU activations. This structure can be thought of as similar to a residual layer(He et al., 2016) routinely used in Convolutional Neuralnets. It is for this reason that all weight matrices in the PNare square. We observed an increase in validation accuracyin all settings using this clever initialization trick. We areunaware of another similar approach in the literature, sofurther investigation may be warranted. E. Data Augmentation
For the extra crops besides the original image, we crop theoriginal image starting from the desired corner and extend-ing up to 80% of each dimension and finally resize the cropto match the original image’s dimensions. nd-to-end Generative ZSL via FSL
Table 5.
Hyperparameter configuration per setting and data set for the Prototypical Network’s pre-training. From top to bottom, thehyperparameters presented are the learning rate of the FSL algorithm α h , the number of episodes N h , the number of hidden layers n h , thenumber of classes in an episode n W , the number of support examples per class n S , and the number of queries per class n Q . ZSL GZSL H YPERPARAMETER
CUB A W A2 SUN CUB A W A2 SUN α h · − − − − − · − N h n h n W
25 10 40 25 10 50 n S n Q
10 15 5 10 15 2
Table 6.
Hyperparameter configuration per setting and data set for our main experiments. We have the ZSL learning rate α f , the coefficientof the WGAN loss β , the coefficient of the FSL loss γ , the number of classes in a training episode n W , the number of generations perclass in during training n S , the number of generations per seen class during testing m S , the number of “queries” per class in a trainingepisode n Q , and the number of episodes N . ZSL GZSL H YPERPARAMETER
CUB A W A2 SUN CUB A W A2 SUN α f − − − − − − β
100 100 100 100 100 100 γ
100 100 100 10 10 10 n W
25 10 80 25 10 80 n S m S - - - 5 2 5 n Q