Generalised Zero-Shot Learning with Domain Classification in a Joint Semantic and Visual Space
Rafael Felix, Ben Harwood, Michele Sasdelli, Gustavo Carneiro
GGeneralised Zero-Shot Learning with Domain Classification in a JointSemantic and Visual Space
Rafael Felix , , Ben Harwood , , Michele Sasdelli , and Gustavo Carneiro , Australian Centre for Robotic Vision The University of Adelaide Monash University { rafael.felixalves,michele.sasdelli,gustavo.carneiro } @adelaide.edu.au, [email protected] Abstract
Generalised zero-shot learning (GZSL) is a clas-sification problem where the learning stage relies ona set of seen visual classes and the inference stageaims to identify both the seen visual classes and anew set of unseen visual classes. Critically, both thelearning and inference stages can leverage a semanticrepresentation that is available for the seen and unseenclasses. Most state-of-the-art GZSL approaches relyon a mapping between latent visual and semanticspaces without considering if a particular samplebelongs to the set of seen or unseen classes. In thispaper, we propose a novel GZSL method that learns ajoint latent representation that combines both visualand semantic information. This mitigates the needfor learning a mapping between the two spaces. Ourmethod also introduces a domain classification thatestimates whether a sample belongs to a seen or anunseen class. Our classifier then combines a classdiscriminator with this domain classifier with the goalof reducing the natural bias that GZSL approacheshave toward the seen classes. Experiments show thatour method achieves state-of-the-art results in terms ofharmonic mean, the area under the seen and unseencurve and unseen classification accuracy on publicGZSL benchmark data sets. Our code will be availableupon acceptance of this paper.
1. Introduction
Humans have a powerful ability to learn aboutnew visual objects without actually seeing them. Thisprocess generally involves the use of language todescribe how a new visual object would look like. The textual description then allows for a new class of objectto be formed in a person’s mind. Our understanding ofexactly how the human brain functions for this task islimited, but it is clear that humans make some sortof association between visual objects and semantictextual descriptions. Conceptually, objects with similardescriptions can naturally be viewed as being near toeach other in some latent space, representing visual andsemantic information. The research topic is known asgeneralised zero-shot learning (GZSL) aims to mimicthis recognition ability of humans. In general, GZSLapproaches employ an auxiliary set of semantic in-formation that describes a set of visual classes. Thisadditional information, such as tags or descriptions,can be utilised to overcome missing visual informationin some of the classes [1].Traditional GZSL approaches aim to recognise thevisual classes available during the training process(i.e. the seen , source or known classes), and alsoclasses that are not available during training (i.e. un-seen , target or novel classes). Due to this constraint,GZSL approaches are intrinsically divided into twomain tasks: (1) the training of a model that learns atransformation from the visual to the semantic space,using the visual samples and semantic informationfrom seen classes; and (2) the transformation of anew test image by the model above into the semanticspace, followed by a search of the closest semanticsample representing a seen or unseen class. In recentyears, GZSL researchers have become increasinglyinterested in pairwise functions for disentangling thesedomains [2], and deep generative models [3], [4] forlearning to transform between the visual and semanticrepresentations. In general, GZSL methods do not tryto estimate if a test sample belongs to the set ofseen or unseen classes – this issue inevitably biases a r X i v : . [ c s . C V ] A ug ZSL approaches toward seen classes. Only recentlythis issue has been acknowledged with a method thatautomatically combines the classification of Zero-ShotLearning (ZSL) for unseen classes with the classifica-tion of seen classes, by automatically weighting (usingthe test sample) the contribution of each classifier [5].Although that approach is in the right direction, it hasthe issue of relying on the training of multiple classi-fiers. Another issue with the methods above is that theydo not consider a latent space jointly optimised for thevisual and semantic representation, which we believeis a crucial part of the inference process performed byhumans that should be imitated by GZSL methods.In Fig. 1, we illustrate the idea explored in thispaper for GZSL. The visual and semantic samples arerepresented in a joint latent space. This space is usedto learn a classifier of visual classes and a domainclassifier for seen and unseen domains.In this paper, we aim to explore two observationsabout the latent space for the domain classification.The first observation is that samples from unseenclasses that are visually similar to one of the seenclasses tend to be projected relatively close to otherseen classes distributions, instead of outside of thedistribution of seen classes, as proposed by Socher et.al [6]. Our second observation is that samples fromunseen classes that are visually different from any ofthe seen classes, tend to be projected outside the dis-tribution of seen classes [6]. Atzmon and Chechik [5]propose a general framework that combines domainexpert classifiers, such as DAP [7] for unseen classes,and LAGO for the seen classes [5]. However, thismethod relies on the disjoint training of both expertsmodels, and the assumption that unseen samples areprojected outside the distribution of seen classes [6].Hence, this method can be considered to be in gen-eral sub-optimal. We propose a general frameworkfor learning and combining the visual and domainclassifiers using the latent space. More specifically, wefirst introduce a general framework for latent spacelearning from cycle-WGAN [3] and CADA-VAE [4].Then, we propose a novel method for the seen andunseen domain classification from this latent space.Finally, we introduce a way to combine the visual anddomain classifiers. The empirical results show that ourproposed framework outperforms previous approachesin terms of unseen accuracy and harmonic mean (
H-mean ) on several GZSL benchmark data sets, such asCUB [8], SUN [1], AWA1 [7], [1] and AWA2 [7],[1]. In terms of unseen accuracy, our method showsimprovements of . %, . %, . %, . % for CUB,SUN, AWA1, and AWA2, respectively. Moreover, our method shows substantial improvements in terms ofarea under the curve of seen and unseen accuracy( AUSUC ) [9]. For AUSUC we improved from . , . , . to . , . , . , on CUB,AWA1 and AWA2, respectively.
2. Related Work
In this section, we discuss relevant literature thatmotivates and contextualises our work.
Zero-shot learning (ZSL) is similar to GZSL, witha crucial difference: during inference, only the visualsamples from the unseen classes are considered [10],[1]. This difference makes ZSL a special case ofGZSL. Therefore, critical problems present in GZSLare not considered in this approach, such as the naturalbias of the visual classifier toward the seen classes.Unfortunately, this setup not only reduces the appli-cability of ZSL methods but also makes it unrealisticfor real-world applications [11], [3]. Also, ZSL failsto handle jointly the seen and unseen data [9], [12].Due to the simplicity and unrealistic assumptions ofZSL, the whole field moved toward the GZSL problem,which is introduced in the next section.
In GZSL, the algorithm is trained using visual sam-ples from the seen classes, but the inference involvesthe analysis of samples from the seen and unseenclasses. The main issue faced by GZSL methods is thebias toward the seen classes naturally present duringinference, so a great deal of research has focused onmitigating this problem [3], [11]. Particularly impor-tant examples of this type of research are anomalydetection [6], domain balancing [9] and generativedata augmentation for GZSL [3], [4], [11]. Despitethe advances in GZSL with the approaches mentionedabove, we note that little attention has been devotedto addressing the seen/unseen domain classification inGZSL based on a latent space that is jointly learnedto represent the visual and semantic representations.Moreover, we argue that the multi-modal nature ofthis joint latent space carries interesting properties toperform domain classification. In this paper, we showthat classifying the seen and unseen domains playsan important role in improving domain balancing inGZSL.igure 1: Depiction of the method proposed in this paper – our approach learns the latent space for the visualand semantic modalities. We train two classifiers using samples from this latent space: one to classify all the seenand unseen visual classes, and another to classify between the seen and unseen domains. The final classificationcombines the results of these two classifiers.
A particularly successful GZSL method is basedon data augmentation, where artificial visual samplesof the unseen classes are generated from the semanticrepresentation to train the visual classifier [3], [4],[11], [13]. This approach has produced the currentstate-of-the-art results in GZSL benchmark data sets.Overall, these studies focus on how to learn genera-tive models conditioned on the semantic informationthat is used to augment the data set for the unseenclasses. Among the main approaches, we observe theuse of Generative Adversarial Networks (GAN) [3],[11] and Variational Autoencoders (VAE) [4], [13].In this paper, we formalise these approaches as aframework for generative probabilistic latent spacelearning. Additionally, we show that these latent spaceshave interesting properties that allow our approach toclassifying samples into the seen or unseen domainsfor GZSL.
Recent research has tackled the problem of GZSLas a novelty detection problem [6]. This approachassumes that unseen classes are projected out of thedistribution of seen classes. Therefore, these unseenclasses samples can be handled as an outlier of the seen classes distribution [6]. However, this approachfails to notice that samples from unseen classes can beprojected relatively close to one of the seen classes.Atzmon and Chechik [5] aims to tackle this noveltydetection issue by providing a framework that han-dles domain classification for GZSL. The gist of thatapproach consists of a gating method that performsdomain adaptation to combine an unseen class classi-fier (e.g., DAP [10], DeVISE [2]), and CMT [6]), anda seen class classifier [5]. Even though this methodachieves remarkable performance in GZSL, it stillrelies on a sub-optimal disjoint training of multipleclassifiers. In this paper, we mitigate these two issuesby combining a seen/unseen class discriminator with adomain classifier that uses samples from a latent spacethat is trained to represent both the visual and semanticspaces.
3. Method
In this section, we introduce the problem formula-tion and our proposed approach.
In order to formulate the method of learninga classifier that can recognise visual samples fromnseen visual classes, we define a visual data set D = { ( x , y ) i } Ni =1 , where x ∈ X ⊆ R K denotesthe visual representation, and y ∈ Y = { , ..., C } denotes the visual class. Recent research shows thatsuch visual representation, x , can be acquired fromnetworks specialised in feature extraction. These arewidely available in the literature, such as pre-traineddeep residual nets [14].In GZSL, the set of classes Y is split into twodomains: seen domain Y S = { , ..., | S |} , and theunseen domain Y U = { ( | S | + 1) , ..., ( | S | + | U | ) } .Hence, the total number of classes is C = | S | + | U | ,with Y = Y S ∪ Y U , Y S ∩ Y U = ∅ . During training,we can only access visual samples from Y S , but duringtesting, samples can come from any class in Y . Thislack of visual samples from unseen classes duringtraining is compensated with a semantic data set thatincludes semantic information for the seen and unseenclasses. Therefore, we introduce the semantic data set R = { ( a , y ) j } j ∈Y , which associates visual classeswith semantic samples, where a ∈ A ⊆ R L representsa semantic feature (e.g., set of continuous features suchas word2vec [1], or BoW ). Note that the semantic dataset only has a single element per class.In comparison with the supervised learningparadigm, the problem of GZSL has a distinct setup.The data set D is divided into mutually exclusivetraining and testing visual subsets D T r and D T e ,respectively. The D T r contains a subset of the visualsamples belonging to the seen classes, and D T e con-tains the visual samples from the seen classes thatare held out from training and all samples from theunseen classes. The training data set is composed ofthe semantic data set R and the training visual subset D T r , while the testing data set relies only on the testingvisual subset D T e . In this section, we first introduce the componentsfor the latent space learning applied to GZSL models,then we describe CADA-VAE and cycle-WGAN. Fi-nally, we introduce the domain classification for theselatent space.In recent years, we note an increasing number ofmodels that use data augmentation for GZSL models[3], [4], [11], [15], [16], [17]. Overall, these methodsaim to learn a generative model that produces artificialsamples from unseen visual classes conditioned ontheir semantic representation. These artificial samples lie in a latent space. In this paper, we aim to demon-strate that our proposed domain classification can beadapted to GZSL models that rely on data augmen-tation, such as CADA-VAE [4] and cycle-WGAN [3].Although these two models consist of different trainingapproaches, we observe that their components can begenerally described as a framework for latent spacelearning. Below, we introduce three components ofsuch models: the encoder (or generator), the decoder(or regressor), and the discriminator.The encoder transforms samples from an inputspace (i.e., visual or semantic) into a latent space. Werepresent the encoder with z x = Encoder x ( x ) (1)for the visual space and similarly for the semanticspace with z a = Encoder a ( a ) , where the vector z { x,z } ∈ R Z lies in the latent space. The decodertransforms from the latent space into one of the inputmodalities. We represent the decoder with ˜ x = Decoder x ( z ) , (2)and similarly for the semantic space with ˜ a = Decoder a ( z ) . The latent space discriminator, used todetermine whether a sample z belongs to the latentspace given the input x , is represented by p ( z | x ) = Discriminator ( z ; x ) . (3)We consider the simplified models above to describeCADA-VAE [4] and cycle-WGAN [3] as the latentspace learning models. CADA-VAE:
This model is a special type ofvariational autoencoder (VAE) for GZSL [4]. In thisapproach, the VAE aims to learn the latent space withcross alignment and distribution alignment losses, asdepicted in Fig. 2. The overall loss by Schonfeld etal. [4] can be described with L = L V AE + γ (cid:0) L (cid:88) i L (cid:88) j (cid:54) = i || x ( j ) − ˜ x ( i ) || (cid:1) + δ (cid:0) || µ ( j ) − µ ( i ) || + || Σ ( j ) − Σ ( i ) || F robenius (cid:1) , (4)where the first term represents the VAE loss [4],the second term denotes the reconstruction error be-tween L modalities – that is, during training, theencoder projects input samples in the latent space(e.g. Encoder x for x ), then the decoder of a differentmodality is used (e.g. Decoder a from z x – see Fig. 2),which constraints the visual and semantic projectionso be in the same region of the latent space representedby the mean µ and variance Σ of the samples producedby the encoder [4].Figure 2: Depiction of the method CADA-VAE [4].In this method encoders for the visual and semanticrepresentation project samples into a shared latentspace. cycle-WGAN : Fig. 3 depicts the model cycle-WGAN [3]. This model is optimised as a GenerativeAdversarial Network (GAN), regularised by a cycleconsistent term, described with L = L W GAN + γ (cid:0) || a − ˜ a || (cid:1) , (5)where the first term, L W GAN , represents a
Wasserstein
Generative Adversarial Loss (WGAN [3]), and thesecond term denotes the reconstruction loss (cycle)for the semantic representation. Thus, the generativeprojection of a given semantic representation into thelatent space is encouraged to be back projected nearthe original semantic representation.
From the previous section, we note that the latentspace is an embedding space for visual and semanticsamples. Therefore, we can use this latent space tolearn a discriminative model given by f ( y | x ) = (cid:90) v (cid:90) z p ( z | x ) f ( y, v | z ) dvd z , (6)where the function f ( . ) represents the GZSL classifierand can be described in terms of domains, v ∈ { s, u } ( s = seen and u = unseen ), with f ( y | x ) = (cid:88) v ∈{ s,u } p ( y | z x ) f ( v | z x , y ) , (7) Figure 3: Depiction of the cycle-WGAN method [3].This method encodes the semantic space into a latentvisual space. The decoder produces semantic vectorsthat are used to regularise the learning process.where we assume from (6) that p ( z | x ) is a deltafunction at z x = Encoder x ( x ) . The term p ( y | z x ) in(7) is represented by a simple deep learning classifierwith softmax activation. We define the function f ( . ) in (7) by f ( v | z x , y ) = (cid:26) p ( v | z x , y ) , if v, y are in same domain , otherwise, (8)where ”same domain” means the domain of seen orthe unseen classes, and p ( v | z x , y ) is denoted by adeep learning classifier with softmax activation. Thefunction in (8) represents our proposed domain classi-fier (DC). During the DC training, for training samplesof the seen domain, we optimise p ( v = s | z , y ) withsamples drawn from the latent space. These samplesare acquired from visual and semantic representationsprojected in the latent space. For the unseen domain, p ( v = u | z , y ) , we use the semantic projections in thelatent space.
4. Experiments
In this section, we present the benchmark datasets,as well as the evaluation criteria for our experimentalsetup. We then show the results of our method andcompare them with the current state-of-the-art. Finally,we provide ablation studies to explore our method.able 1: The benchmarks for GZSL: CUB[8], SUN[18], AWA1[1], and AWA2 [1]. Column (1) shows thenumber of seen classes, denoted by |Y S | , split into thenumber of training and validation classes (train+val),(2) presents the number of unseen classes |Y U | , (3)displays the number of samples available for training |D T r | and (4) shows number of testing samples thatbelong to the unseen classes |D T eU | and number oftesting samples that belong to the seen classes |D T eS | from [3], [11] Name |Y S | (train+val) |Y U | |D Tr | |D TeU | + |D TeS | CUB 150 (100+50) 50 7057 1764+2967SUN 745 (580+65) 72 14340 2580+1440AWA1
40 (27+13) 10 19832 4958+5685AWA2 40 (27+13) 10 23527 5882+7913
We assess our method on four publicly avail-able benchmark GZSL data sets: CUB-200-2011 [8];SUN [1]; AWA1 [7], [1], and AWA2 [7], [1]. Toguarantee that our experiments are reproducible, weuse the GZSL experimental setup described by Xianet al. [1]. As the CUB data set is generally regardedas fine-grained, there is an intrinsic expectation thatthe novel unseen classes tend to have their classmodes close to the seen classes. Thus, such densevisual representation space is a challenging problemfor GZSL approaches. We also explore the use ofcoarse data sets, such as AWA1, AWA2, and SUN.Given the diversity of classes for such coarse data sets,there is an intrinsic expectation that novel classes willbe projected far away from the samples of seen classesin the latent space, making the domain classificationa trivial task. However, we argue that this statementdoes not always hold, particularly for classes thatare visually similar (e.g. zebra/horse, whale/dolphin,leopard/bobcat), as depicted in Fig. 4. Table 1 containssome basic information about the data sets in terms ofthe number of seen and unseen classes and the numberof training and testing images.We represent the visual space by extracting imagefeatures from the activation of the 2048-dimensionaltop pooling layer of ResNet-101 [14]. For the semanticrepresentation of the data set CUB-200-2011 [1], weuse the 1024-dimensional vector produced by CNN-RNN [19]. These semantic samples represent a writtendescription of each image using 10 sentences perimage. To define a unique semantic sample per-class,we average the semantic samples of all images be-longing to each class [1]. We use manually annotated (A)(B)Figure 4: Example of two classes that are visuallysimilar from the benchmark dataset AWA1 [1]. (A) thesample leopard belongs to the seen classes, and (B)the sample bobcat belongs to the unseen classes. Wespeculate that samples from these two classes will lieclose to each other in the latent space even though theycome from different domains, challenging the viewthat samples from new unseen classes will lie far fromsamples of the seen classes in the latent space.semantic samples containing 102 and 85 dimensionsrespectively, for the data sets SUN [1], AWA1 [1],and AWA2 [1]. To prevent a violation of the ZSLconstraints, where the test classes should not be ac-cessed during training, all the features were extractedaccording to training splits proposed in [1].
Xian et. al [1] formalised the current evaluationprotocol for GZSL. We first compute the average per-class top-1 accuracy measured independently for eachclass, then we calculate the overall mean. We calculatethe mean-class accuracy for each domain separately,i.e., the seen ( Y S ) and the unseen ( Y U ) classes. Then,we also compute the harmonic mean (H-mean) of theseen and unseen domains accuracy [1]. Furthermore,we show results by measuring the area under theseen and unseen curve (AUSUC) [9] by varying thedomain expertise [9]. This domain expertise consistsf a hyper-parameter to perform the trade-off betweenthe performance in the seen and unseen classes [9]. In this section, we describe the architecture andtraining procedures for learning the proposed latentspace. As described in Sec. 3, we extend the follow-ing two models for our experimental setup: CADA-VAE [4] and cycle-WGAN [3]. The model CADA-VAE contains the following models that are parame-terised as neural networks:
Encoder x ( . ) , Encoder a ( . ) in (1), Decoder x ( . ) , and Decoder a ( . ) in (2). Thetraining of CADA-VAE aims to produce a latent spacethat satisfies (4). In terms of the model architectureand hyper-parameters ( e.g. the number of epochs, batchsize, the number of layers, learning rate, and, weightdecay ), we followed the specifications provided by [4].The encoder for visual representation is parameterisedwith hidden neurons, and the encoder for thesemantic representation is parameterised with 1450hidden neurons. The decoders for the visual and se-mantic representation are parameterised with , hidden neurons, respectively. For both modalities, theencoders project samples into the latent space, whichis represented with 64-dimension vectors in the latentspace. The model is optimised with Adam for 100epochs [20]. We use an adaptive scheduling rate forthe hyper-parameters γ, δ , by (0 . , . , with re-spective epochs (21 − , − [4]. We also extendedcycle-WGAN [3], as explained in Sec. 3. The modelcycle-WGAN contains the following functions that areparameterised as neural networks: Encoder a ( . ) in (1), Decoder a ( . ) in (2), and Discriminator ( . ) in (3). Wefollowed the hyper-parameters choice (e.g. number ofepochs, batch size, number of layers, learning rate,and weight decay, learning rate decay ) defined in [3].The encoder is parameterised with a single hiddenlayer containing 4096 nodes with LeakyReLU activa-tion [21], and the output layer, with 2048 nodes, hasa ReLU activation [22]. The decoder is parameterisedwith a linear layer, and the discriminator is a networkwith a single hidden layer with 4096 nodes. Thenetwork has a LeakyReLU activation, and the outputlayer has no activation.The domain classifier (DC) is implemented as aneural network with binary output, representing theseen and unseen domains. The model is trained with
2. The code will be available upon acceptance, and we intend toadd the link to the Github repository here.
Adam optimiser [20] to recognise the domains. Theoutput probability of the domain classifier tends not tobe well calibrated [23], [5]. Therefore, we calibrate themodel output using the validation set [23], [1]. Then,the domain classification is performed as describedin (7) [9].
In this section, we present the results for ourproposed approach. The first question aimed to beanswered in this paper consists of whether the pro-posed latent space contains relevant information thatenables our approach to learn the domain classifierfor GZSL. Thus, we provide numerical evidence thatour method outperforms both baselines (i.e., CADA-VAE and cycle-WGAN) and previous GZSL. In Ta-ble 2, we show the results in terms of unseen classaccuracy Y U , seen class accuracy Y S and harmonicmean H , as described in Sec. 4.2. These resultsare given for the data sets CUB, SUN, AWA1 andAWA2. We compare our approach with 12 leadingGZSL methods, which are divided into three groups:semantic (SJE [24], ALE [25], LATEM [26], ES-ZSL [27], SYNC [12], DEVISE [2]), latent spacelearning (SAE [15], f-CLSWGAN [11], cycle-WGAN[3] and CADA-VAE [4]) and domain classification(CMT[6] and DAZSL [5]). The semantic group con-tains methods that only use the seen class visual andsemantic samples to learn a transformation functionfrom the visual to the semantic space, and classificationis based on nearest neighbour classification in thatsemantic space. The latent space learning group relieson visual samples from seen classes and semanticsamples from seen and unseen classes during training,and are detailed in Sec. 3. The domain classificationgroup relies on methods that weight the classificationof seen and unseen classes. We discuss the numeralresults in Table 2 in Section 5. In Table 3 we report the area under the curveof seen and unseen accuracy (AUSUC) [9] for thebenchmark data sets CUB, SUN, AWA1, and AWA2.We compare the results of the original CADA-VAE [4]and cycle-WGAN [3] with and without the DC. Similarto harmonic mean, the AUSUC is an evaluation metricthat measures the trade-off between the seen andunseen domains.able 2: GZSL results using per-class average top-1 accuracy on the test sets of unseen classes Y U , seen classes Y S , and H-mean result H ; and ZSL results on the unseen classes exclusively – all results shown in percentage. Theresults from previously proposed methods in the field were extracted from [1]. The highlighted values represent thebest ones in each column. The methods below the double horizontal line represent the ones that use the semanticvectors from unseen classes during training. The notation * represents the results that we reproduced, and resultsrepresented with − were not available in the literature, or hyper-parameters were not given. CUB SUN AWA1 AWA2Classifier Y S Y U H Y S Y U H Y S Y U H Y S Y U H Semantic approach
SJE [24] . . . . . . . . . . . . ALE [25] . . . . . . . . . . . . LATEM [26] . . . . . . . . . . . . ESZSL [27] . . . . . . . . . . . . SYNC [12] . . . . . . . . . . . . DEVISE [2] . . . . . . . . . . . . Generative approach
SAE [15] . . . . . . . . . . . . f-CLSWGAN [11] . . . . . . . . . . . . cycle-WGAN [3] . . . . . . . . . − − − CADA-VAE [4] . . . . . . . . . . . . CADA-VAE [4] * . . . . . . . . . . . . Domain Classification
CMT [6] . . . . . . . . . . . . DAZSL [5] . . . . . . . . . − − − cycle-WGAN + DC (ours) . . . . . . . . − − − CADA-VAE + DC (ours) . . . . . . Table 3: Area under the curve of seen and unseenaccuracy (AUSUC). The highlighted values per columnrepresent the best results in each data set. The notation * represents the results that we reproduced. Classifier CUB SUN AWA1 AWA2
EZSL . . . − DAZSL [5] . . − f-CLSWGAN [11] . . . − cycle-WGAN [3] * . . . − CADA-VAE [4] * . . . . cycle-WGAN + DC . . − CADA + DC . .
5. Discussions
In this section, we discuss the main contributionspresented by our approach. We performed our ex-periments by combining previous GZSL approaches(such as CADA-VAE [4]) and cycle-WGAN [3]) withour Domain Classification in order to enhance thebalancing of the seen and unseen domains for GZSL.Firstly, in Table 2 we provide quantitative informa-tion that shows that our method outperforms existingmethods in terms of unseen accuracy, Y U . This demon-strates that by learning to classify the domain for each sample, our method improves the classification of theunseen classes.Specifically, for CUB, SUN, AWA1 and AWA2data sets, the baseline unseen classification results of . %, . %, . %, and . % have become . %, . %, . %, and . %. This improvement wasachieved given a minor trade-off with the seen classes.Secondly, despite the trade-off mentioned above,our approach is still able to achieve minor improve-ments in terms of H-mean . Table 2 shows an improve-ment of . %, . %, . % and . %, when comparedto the baseline CADA-VAE. Although these resultscan be considered minor, we argue that our modeldoes not directly optimise the H-mean. Thus, thisimprovement indicates that our approach has a morebalanced performance than previous models.We note similar behaviour for the cycle-WGANmodel [3], where the proposed method achieves im-provement for H-mean from . % to . % forCUB, from . % to . % for SUN, and from . %to . % for AWA1. However, such improvement isachieved due to the positive trade-off towards the seendomain. We argue that this difference, when comparedto CADA-VAE, is due to the inherent differences in thelatent space learning of each of the approaches. In fact,he approach CADA-VAE is directly optimised by avariational autoencoder, where the control on the latentspace is guided by a divergence measure for the visualand semantic representation jointly. On the other hand,the cycle-WGAN model is directly optimised by anadversarial loss from a generative adversarial networkconditioned mainly on the semantic representation.In terms of AUSUC, the proposed approachachieves improvements for both cycle-WGAN [3] andCADA-VAE [4]. For CADA-VAE, the domain clas-sification yielded improvements from . , . , . , . to . , . , . , . , forCUB, SUN, AWA1 and AWA2, respectively. Likewise,for cycle-WGAN [3], the DC provided improvementsfrom . , . to . , . for CUB andAWA1, respectively.
6. Conclusion and Future Work
In this paper, we introduce a principled methodto classify the seen and unseen domains in GZSL.In particular, we presented our domain classifier thatlearns directly from the latent space of visual andsemantic information. We have demonstrated that ourproposed approach can be combined with previouslatent space learning models, such as CADA-VAE andcycle-WGAN. Our approach yielded improvements foreach one of those models by automatically balancingthe seen and unseen domains in benchmark experi-ments on four available data sets: CUB, SUN, AWA1,and AWA2.Our experimental results show that our proposedapproach has achieved state-of-the-art H-mean resultsfor CUB, AWA1 and AWA2, and unseen accuracyfor CUB, SUN, AWA1, and AWA2. In particular, ourresults are substantially better than the state of theart on CUB and SUN, which contain a large numberof classes. On AWA1, AWA2, which are smaller datasets, our results are marginally better. Furthermore, ourmodel produces substantial improvements in terms ofAUSUC results for CUB, AWA1 and marginally betteron AWA2.As stated previously, our domain classificationlearns to discriminate between samples from the seenand unseen domains. We observe that the improvementof CADA-VAE and cycle-WGAN are different. TheCADA-VAE model tends to improve in terms of theunseen domain when the DC is applied. Whereascycle-WGAN tends to improve in terms of the seendomain. On one hand, we note that the training strategy for both models follows different guidelines, VAEand GAN. On the other hand, our model does notimpose direct constraints in order to optimise GZSLmetrics, such as accuracy or H-mean. In fact, webelieve that these aspects are the main factors forthe contrasting outcomes for CADA-VAE and cycle-WGAN models. With that in mind, we believe thatthe differences between these two data augmentationapproaches should be studied in future generalisedzero-shot learning research.In the future, we intend to further study the reasonsbehind the performance difference observed betweenthe data sets. Moreover, we also plan to develop amore extensive framework that can incorporate domainclassification for approaches that do not rely on latentspace learning.
Acknowledgments
Supported by Australian Research Councilthrough grants
DP180103232 , CE140100016 and
FL130100102 . We would like to acknowledge thedonation of a TitanXp by Nvidia.
References [1] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata,“Zero-shot learning - A comprehensive evaluationof the good, the bad and the ugly,”
CoRR , vol.abs/1707.00600, 2017.[2] A. Frome, G. S. Corrado, J. Shlens, S. Bengio,J. Dean, T. Mikolov et al. , “Devise: A deep visual-semantic embedding model,” in
Advances in neuralinformation processing systems , 2013, pp. 2121–2129.[3] R. Felix, B. V. Kumar, I. Reid, and G. Carneiro,“Multi-modal cycle-consistent generalized zero-shotlearning,” in
European Conference on Computer Vi-sion . Springer, 2018, pp. 21–37.[4] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, andZ. Akata, “Generalized zero-and few-shot learning viaaligned variational autoencoders,” in
Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 8247–8255.[5] Y. Atzmon and G. Chechik, “Adaptive confidencesmoothing for generalized zero-shot learning,” in
Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2019, pp. 11 671–11 680.6] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng,“Zero-shot learning through cross-modal transfer,” in
Advances in Neural Information Processing Systems ,2013, pp. 935–943.[7] C. H. Lampert, H. Nickisch, and S. Harmeling,“Learning to detect unseen object classes by between-class attribute transfer,” in
Computer Vision and Pat-tern Recognition, 2009. CVPR 2009. IEEE Conferenceon , June 2009, pp. 951–958.[8] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff,S. Belongie, and P. Perona, “Caltech-ucsd birds 200,”2010.[9] W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha,“An empirical study and analysis of generalized zero-shot learning for object recognition in the wild,” in
European Conference on Computer Vision . Springer,2016, pp. 52–68.[10] C. H. Lampert, H. Nickisch, and S. Harmeling,“Attribute-based classification for zero-shot visual ob-ject categorization,”
IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 36, no. 3, pp.453–465, 2014.[11] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata, “Featuregenerating networks for zero-shot learning,” arXiv ,2017.[12] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha,“Synthesized classifiers for zero-shot learning,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2016, pp. 5327–5336.[13] V. K. Verma, G. Arora, A. Mishra, and P. Rai, “Gener-alized zero-shot learning via synthesized examples,” in
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2018.[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in
Proceedings ofthe IEEE conference on computer vision and patternrecognition , 2016, pp. 770–778.[15] T. X. Elyor Kodirov and S. Gong, “Semantic autoen-coder for zero-shot learning,”
IEEE CVPR 2017 , July2017.[16] A. Mishra, S. Krishna Reddy, A. Mittal, and H. A.Murthy, “A generative model for zero shot learningusing conditional variational autoencoders,” in
Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition Workshops , 2018, pp. 2188–2196. [17] Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. El-gammal, “A generative adversarial approach for zero-shot learning from noisy texts,” in
Proceedings ofthe IEEE conference on computer vision and patternrecognition , 2018, pp. 1004–1013.[18] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Tor-ralba, “Sun database: Large-scale scene recognitionfrom abbey to zoo,” in
Computer vision and patternrecognition (CVPR), 2010 IEEE conference on . IEEE,2010, pp. 3485–3492.[19] S. Reed, Z. Akata, H. Lee, and B. Schiele, “Learningdeep representations of fine-grained visual descrip-tions,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2016, pp.49–58.[20] D. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” arXiv , 2014.[21] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifiernonlinearities improve neural network acoustic mod-els,” in
Proc. icml , vol. 30, no. 1, 2013, p. 3.[22] V. Nair and G. E. Hinton, “Rectified linear unitsimprove restricted boltzmann machines,” in
Proceed-ings of the 27th international conference on machinelearning (ICML-10) , 2010, pp. 807–814.[23] Y. Gal, J. Hron, and A. Kendall, “Concrete dropout,” in
Advances in Neural Information Processing Systems ,2017, pp. 3581–3590.[24] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele,“Evaluation of output embeddings for fine-grainedimage classification,” in
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition ,2015, pp. 2927–2936.[25] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid,“Label-embedding for image classification,”
IEEEtransactions on pattern analysis and machine intel-ligence , vol. 38, no. 7, pp. 1425–1438, 2016.[26] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein,and B. Schiele, “Latent embeddings for zero-shotclassification,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2016,pp. 69–77.[27] B. Romera-Paredes and P. Torr, “An embarrassinglysimple approach to zero-shot learning,” in