Zero-shot Learning with Deep Neural Networks for Object Recognition
ZZero-shot Learning with Deep Neural Networks for ObjectRecognition ∗ Le Cacheux, Yannick , Le Borgne, Herv´e , and Crucianu, Michel Universit´e Paris-Saclay, CEA, List, F-91120, Palaiseau, France CEDRIC – CNAM, Paris, France
Abstract
Zero-shot learning deals with the ability to recognize objects without any visual trainingsample. To counterbalance this lack of visual data, each class to recognize is associated with asemantic prototype that reflects the essential features of the object. The general approach isto learn a mapping from visual data to semantic prototypes, then use it at inference to classifyvisual samples from the class prototypes only. Different settings of this general configurationcan be considered depending on the use case of interest, in particular whether one only wantsto classify objects that have not been employed to learn the mapping or whether one canuse unlabelled visual examples to learn the mapping. This chapter presents a review of theapproaches based on deep neural networks to tackle the ZSL problem. We highlight findingsthat had a large impact on the evolution of this domain and list its current challenges.
The core problem of supervised learning lies in the ability to generalize the prediction of a modellearned on some samples seen in the training set to other unseen samples in the test set. A keyhypothesis is that the samples of the training set allow a fair estimation of the distribution ofthe test set, since both result from the same independent and identically distributed randomvariables. Beyond the practical issues linked to the exhaustiveness of the training samples,such a paradigm is not adequate for all needs, nor reflects the way humans seem to learn andgeneralize. Despite the fact that, to our knowledge, nobody has seen a real dragon, unicornor any beast of the classical fantasy, one could easily recognize some of them if met. Actually,from the single textual description of these creatures, and inferring from the knowledge ofthe real wildlife, there exist many drawings and other visual representations of them in theentertainment industry.Zero-shot learning (ZSL) addresses the problem of recognizing categories of the test setthat are not present in the training set [LEB08, LNH09, PPHM09, FEHF09]. The categoriesused at training time are called seen and those at testing time are unseen , and contrary toclassical supervised learning, not any sample of unseen categories is available during training.To compensate this lack of information, each category is nevertheless described semanticallyeither with a list of attributes, a set of words or sentences in natural language. The generalidea of ZSL is thus to learn some intermediate features from training data, that can be usedduring the test to map the sample to the unseen classes. These intermediate features can reflectthe colors or textures ( fur , feathers , snow , sand ...) or even some part of objects ( paws, claws,eyes, ears, trunk, leaf ...). Since such features are likely to be present in both seen and unseencategories, and one can expect to infer a discriminative description of more complex conceptsfrom them (e.g. some types of animals, trees, flowers...), the problem becomes tractable. ∗ This is a preprint of the following chapter: Yannick Le Cacheux, Herv´e Le Borgne, Michel Crucianu, Zero-shotLearning with Deep Neural Networks for Object Recognition, published in Multi-faceted Deep Learning: Models andData, edited by Jenny Benois-Pineau, Akka Zemmari, 2021, Springer reproduced with permission of Springer. Thefinal authenticated version is available online at: http://dx.doi.org/ a r X i v : . [ c s . C V ] F e b igure 1: Illustration of a basic ZSL model, with two seen classes fox and zebra , and two unseenclasses horse and tiger . Each class is represented by a 3-dimensional semantic prototype corre-sponding to attributes “has stripes” , “is orange” and “has hooves” . During the training phase, themodel learns the relations between the visual features and the attributes using the seen classes.During the evaluation phase, the model estimates the attributes for each test image and predictsthe unseen class having the closest prototype. Formally, let us note the set of seen classes C S and that of unseen classes C U . The set of allclasses is C = C S ∪ C U , with C S ∩ C U = ∅ .For each class c ∈ C , semantic information is provided. It can consist of binary attributes,such as “has stripes” , “is orange” and “has hooves” . With this example, the semantic repre-sentation of the class tiger would be (1 1 0) (cid:62) , while the representation of class zebra would be(1 0 1) (cid:62) . For a given class c , we write its corresponding semantic representation vector s c ; sucha vector is also called the class prototype . The prototypes of all classes have the same dimension K , and represent the same attributes. More generally, the semantic information does not haveto consist of binary attributes, and may not correspond to attributes at all. More details onthe most common types of prototypes are provided in Section 4.For each class c , a set of images is available. One can extract a feature vector x i ∈ R D from an image, usually using a pre-trained deep neural networksuch as VGG [SZ14] orResNet [HZRS16]. In the latter case, the feature vector of an image corresponds to the internalrepresentation in the network after the last max-pooling layer, before the last fully-connectedlayer. It is of course also possible to train a deep network from scratch on the available trainingimages. In the following, we will refer to an image, a sample of a class or its feature vectorwith this unique notation x .During the training phase, the model has only access to the semantic representations ofseen classes { s c } c ∈C S and to N images belonging to these classes. Hence, the training datasetis D tr = ( { ( x n , y n ) } n ∈ (cid:74) ,N (cid:75) , { s c } c ∈C S ), where y n ∈ C S is the label of the n th training sample.During the testing phase, the model has access to the semantic representations of unseen classes { s c (cid:48) } c (cid:48) ∈C U , and to the N (cid:48) unlabeled images belonging to unseen classes { x n (cid:48) } n (cid:48) ∈ (cid:74) ,N (cid:48) (cid:75) . Theobjective for the model is to make a prediction ˆ y n (cid:48) ∈ C U for each test image x te n (cid:48) , assigning itto the most likely unseen class.As a first basic example, a simple ZSL model may consist in simply predicting attributesˆ s corresponding to an image x such that ˆ s = w (cid:62) x ; the parameters w can be estimated onthe training set D tr using a least square loss. To make predictions, we can simply predict theunseen class c whose attributes s c are closest to the estimated attributes ˆ s as measured by aeuclidean distance. Such a model is illustrated in Fig. 1, and will be presented in more detailsin Section 3.1. ore generally, most ZSL methods in the literature are based on a compatibility function f : R D × R K → R assigning a “compatibility” score f ( x , s ) to a pair composed of a visualsample x ∈ R D and a semantic prototype s ∈ R K , that reflects the likelihood that x belongsto class c (if s is s c ). This function may be parameterized by a vector w or a matrix W , or bya set of parameters { w i } i , leading to the notation f w ( x , s ) or f ( x , s ; { w i } i ) in the following.These parameters are generally learned by selecting a suitable loss function L and minimizingthe total training loss L tr over the training dataset D tr with respect to the parameters w : L tr ( D tr ) = 1 N N (cid:88) n =1 (cid:88) c ∈C S L [( x n , y n , s c ) , f w ] + λ Ω[ f w ] (1)where Ω[ f ] is a regularization penalty based on f and weighted by λ . Once the model learned,the predicted label ˆ y of a test image x can be selected among candidate testing classes basedon their semantic representations { s c } c ∈C U :ˆ y = argmax c ∈C U f w ( x , s c ) (2)In the standard setting of ZSL, the only data available during the training phase consistsof the class prototypes of the seen classes and the corresponding labeled visual samples. Theclass prototypes of unseen classes, as well as the unlabeled instances from these unseen classes,are only provided during the testing phase, after the model was trained. Moreover, the testsamples for which we make predictions only belong to these unseen classes. When class prototypes of both seen and unseen classes are available during the training phase,[WZYM19] considers it as a class-transductive setting, as opposed to the standard setting thatis class-inductive , when unseen class prototypes are only made available after the training ofthe model is completed. In a class-transductive setting, the prototypes of unseen classes can forexample be leveraged by a generative model, which attempts to synthesize images of objectsfrom unseen classes based on their semantic description (Section 3.3). They can also simplybe used during training to ensure that the model does not misclassify a sample from a seenclass as a sample from an unseen class. An access to this information as early as the trainingphase may be legitimate for some use-cases but new classes cannot be added as seamlessly asin a class-inductive setting, in which a new class can be introduced by simply providing itssemantic representation (without any retraining).A more permissive setting allows to consider that unlabeled instances of unseen classesare available during training. Such a setting is called instance-transductive in [WZYM19], asopposed to the instance-inductive setting. These two settings are often simply referred to asrespectively transductive and inductive , even though there is some ambiguity on whether the(instance-)inductive setting designates a class-inductive or a class-transductive setting. Somemethods use approaches which specifically take advantage of the availability of these unla-beled images, for example by extracting additional information on the geometry of the visualmanifold [FYH + +
19] or a hierarchical structure [RSS11].Others make use of information regarding the environment of the object, for example by detect-ing surrounding objects [ZBS +
19] or by computing co-occurence statistics using an additionalmultilabel dataset [MGS14]. Other methods consider that instead of a semantic representationper class, a semantic representation per image is available, for example in the form of textdescriptions [RALS16] or human gaze information [KASB17]. nother classification of ZSL settings is concerned with which classes have to be recognizedduring the testing phase. Indeed, one may legitimately want to recognize both seen and unseenclasses. The setting in which testing instances may belong to both seen and unseen classes isusually called generalized zero-shot learning (GZSL) and has been introduced by [CCGS16b].Approaches to extend ZSL to GZSL can be divided into roughly two categories: (1) approacheswhich explicitly try to identify when a sample does not belong to a seen class, and use either astandard classifier or a ZSL method depending on the result, and (2) approaches that employa unified framework for both seen and unseen classes.In [SGMN13], the authors explicitly estimate the probability g u ( x ) = P ( y ∈ C U | x ) that atest instance x belongs to an unseen class c ∈ C U . They first estimate the class-conditionalprobability density p ( x | c ) for all seen classes c ∈ C S , by assuming the projections ˆ s ( x ) of visualfeatures in the semantic space are normally distributed around the semantic prototype s c . Wecan then consider that an instance x does not belong to a seen class if its class-conditionalprobability is below a threshold γ for all seen classes: g u ( x ) = [ ∀ c ∈ C S , p ( x | c ) < γ ] (3)If one sees the compatibility f ( x , s c ) as the probability that the label of visual instance x is c , i.e. P ( y = c | x ) ∝ f ( x , s c ), the compatibilities of seen and unseen classes can be weightedby the estimated probabilities that x belongs to a seen or unseen class.Most recent GZSL methods [VAMR18, XLSA18b, CCGS20] adopt a more direct approach:the unweighted compatibility function f is used to directly estimate compatibilities of seen andunseen classes, so that we simply haveˆ y = argmax c ∈C S ∪C U f ( x , s c ) (4)This approach has the advantage that using a trained ZSL model in a GZSL setting is straight-forward, as all there is to do is adding the seen class prototypes to the list of prototypeswhose compatibility with x needs to be evaluated. However, it has been empirically demon-strated [CCGS16b, XSA17] that many ZSL models suffer from a bias towards seen classes.With the example of Fig. 1, many models would thus tend to consider zebras as “weird” horsesrather than members of a new, unseen class. To address this problem, a straightforward so-lution consists in penalizing seen classes to the benefit of unseen classes by decreasing thecompatibility of the former by a constant value γ , similarly to Equation 5. In [LCLBC19a]was put forward a simple method to select a suitable value of γ based on a training-validation-testing split specific to GZSL, which enabled a slight reduction in the accuracy on seen classesto result in a large improvement of the accuracy on unseen classes, thus significantly improvingthe GZSL score of any model.Other even less restrictive tasks may be considered during the testing or application phase.For instance, one may want a model able to answer that a visual instance matches neither a seennor an unseen class. Or one may aim to recognize entities that belong to several non-exclusivecategories, a setting known as multilabel ZSL [MGS14, FYH +
15, LFYFW18]. Other worksare interested in the ZSL setting applied to other tasks such as object detection [BSS +
18] orsemantic segmentation [XCH + Most of the (G)ZSL works to date address a classification task on mutually exclusive classes,thus the performance is evaluated with a classification rate. The standard accuracy neverthelesscomputes the score per sample (micro-average accuracy). Although many publicly availableZSL datasets [WBW +
11, LNH14] have well-balanced classes, other datasets or use cases do notnecessarily exhibit this property. [XSA17] therefore proposed to compute the score per class (macro-average accuracy) and most recent works adopted this metric.For GZSL the performance measure is a more subtle issue. Of course, using y n ∈ C S ∪C U foreach of the N testing instances, the micro and macro average accuracy can still be employed.However, this does not always provide the full picture regarding the performance of a (G)ZSLmodel: assuming per class accuracy is used and 80% of classes are seen classes, a perfectsupervised model could achieve 80% accuracy with absolutely no ZSL abilities. This is all themore important as many GZSL models suffer from a bias towards seen classes, as mentionedpreviously.To take the trade-off between seen and unseen classes into account, performance is oftenmeasured separately on each type of classes. Chao et al. [CCGS16b] defined A U→U as the (top) , CUB (middle) and ImageNet (bottom) . (per class) accuracy evaluated only on test instances of unseen classes when candidate classesare the unseen classes C U . Also, A U→C is the accuracy evaluated on test instances of unseenclasses when candidate classes are all classes C , seen and unseen. Then A S→S and A S→C are defined correspondingly. Before the GZSL setting, test classes were all unseen classes sothe (per-class) accuracy was A U→U . A S→S corresponds to what is measured in a standardsupervised learning setting. A C→C would correspond to the standard per class accuracy in aGZSL setting. A U→C and A S→C respectively measure how well a GZSL model is performingon respectively seen and unseen classes. [XSA17] proposes to use the harmonic mean as atrade-off between the two, to penalize models with a high score in one of these two sub-tasks but low performance in the other. This measure is the most commonly employed in therecent GZSL literature [CZX +
18, VAMR18, LCLBC19b, MYX + A U→C isevaluated [HAT19] in order to still provide some measure of GZSL performance. Alternatively,[CCGS16b] introduced calibrated stacking , where a weight γ is used as a trade-off betweenfavoring A U→C (when γ >
0) and A S→C (when γ < y = argmax c ∈C f ( x , s c ) − γ [ c ∈ C S ] (5)[CCGS16b] defined the Area Under Seen-Unseen accuracy Curve (AUSUC) as the area underthe curve of the plot with A U→C on the x -axis and A S→C on the y -axis, when γ goes from −∞ to + ∞ . Similarly to the area under a receiver operating characteristic curve, the AUSUC canbe used as a metric to evaluate the performance of a GZSL model. We briefly describe a few datasets commonly used to benchmark ZSL models, provide therough accuracy obtained on these datasets by mid-2020 and mention a few common biases toavoid when measuring ZSL accuracy on such benchmarks. Some examples of typical imagesfrom these datasets are shown in Fig. 2. The dataset list is by no means exhaustive, as manyother ZSL evaluation datasets can be found in the literature.
Animals with Attributes or AwA [LNH14] is one of the first proposed benchmarksfor ZSL [LNH09]; it has recently been replaced by the very similar
AwA2 [XLSA18a] dueto copyright issues on some images. It consists of 37322 images of 50 animal species such as antelope , grizzly bear or dolphin , 10 of which are being used as unseen test classes, the rest beingseen training classes. Class prototypes have 85 binary attributes such as brown , stripes , hairless or claws . As mentioned in Sec. 2.1, visual features are typically extracted from images usinga deep network such as ResNet pre-trained on a generic dataset like ImageNet. As evidencedin [XSA17], this can induce an important bias on the AwA2 dataset. Indeed, 6 of the 10 unseen est classes are among the 1000 classes of ImageNet used to train the ResNet model; thus, suchclasses cannot be considered as truly “unseen”. In [XSA17] it is therefore proposed to employa different train / test split, called the proposed split , such that no unseen (test) class is presentamong the 1000 ResNet training classes. This setting has been widely adopted by the ZSLcommunity. Recent ZSL models in a standard ZSL setting can reach an accuracy of around71% [XSSA19] on the 10 test classes of this proposed split. Caltech UCSD Birds 200-2011 or CUB [WBW +
11] is referred to as a “fine-grained”dataset, as its 200 classes all correspond to bird species ( black footed albatross , rusty blackbird , eastern towhee ...) and are considered to be fairly similar (Fig. 2). Fifty classes are used asunseen testing classes; similarly to AwA2, the standard train / test split has been proposed in[XSA17]. The class prototypes consist of 312 usually continuous attributes with values between0 and 1. Examples of attributes include “has crown color blue” , “has nape color white” or “hasbill shape cone” . Recent models can reach a ZSL accuracy of around 64% [LCLBC19b] on the50 test classes.The ImageNet [DDS +
09] dataset has also been used as a large-scale ZSL benchmark [RSS11].Contrary to AwA or CUB, the usual semantic prototypes do not consist of attributes but ratherof word embeddings of the class names – more details are provided in Sec. 4. This dataset con-tains classes as diverse as coyote , goldfish , lipstick or speedboat . The training classes usuallyconsist of the 1000 classes of the ILSVRC challenge [RDS + There exist several surveys of the ZSL literature, each with its own classification of existingapproaches [XLSA18a, FXJ +
18, WZYM19]. Here we separate the state of the art into threemain categories: regression methods (Section 3.1), ranking methods (Section 3.2) and generativemethods (Section 3.3). We start by presenting the most simple methods which can be consideredas baselines. Sometimes, the methods described below are slightly different from their initialformulation in the original articles, for the sake of brevity and simplicity. We aim at giving ageneral overview with the strengths, weaknesses and underlying hypotheses of these types ofmethods, not to dive deep into specific implementation details. As well, we mainly address theGZSL and standard ZSL settings, since they are the most easily applicable to real use-cases.The
Direct Attribute Prediction or DAP [LNH09] approach consists in training K standard classifiers which provide the probability P ( a k | x ) that attribute a k is present in visualinput x . At test time, we predict the class c which maximizes the probability to have attributescorresponding to its class prototype s c . Assuming deterministic binary attributes, identicalclass priors P ( c ), uniform attribute priors and independence of attributes, we have:argmax c ∈C U P ( c | x ) = argmax c ∈C U K (cid:89) k =1 P ( a k = ( s c ) k | x ) (6)Similar results may be obtained with continuous attributes by using probability density func-tions and regressors instead of classifiers. The Indirect Attribute Prediction or IAP wasalso proposed in [LNH09] and is very close to DAP. A notable difference is that it does notrequire any model training beyond a standard multi-class classifier on seen classes, and in par-ticular does not require any training related to the attributes. As such, it enables to seamlesslyconvert any pre-trained standard supervised classification model to a ZSL setting provided asemantic representation is available for each seen and unseen class. In Equation (6), writing P ( a k | x ) as P ( a k | c ) P ( c | x ) and considering that P ( c | x ) for seen classes can be obtained usingany supervised classifier trained on the training dataset on the one hand, and P ( a k | c ) is 1 if a k = ( s c ) k and 0 otherwise on the other hand, we finally have:ˆ y = argmax c ∈C U K (cid:89) k =1 (cid:88) c (cid:48) ∈C S ( s c ) k =( s c (cid:48) ) k P ( c (cid:48) | x ) (7) imilarly to IAP, a method based on Convex Semantic Embeddings, or ConSE [NMB + x , we estimate its semanticrepresentation ˆ s ( x ) ∈ R K as a convex combination of the semantic prototypes s ˆ c of the bestpredictions ˆ c t ( x ) for x , each prototype being weighted by its classification score. For a testinstance x , we can then simply predict ˆ y as the class whose class prototype is the closestto the estimated semantic representation as measured with cosine similarity. We can noticethat contrary to DAP and IAP, ConSE does not make any implicit assumption regarding thenature of the class prototypes, and can be used with semantic representations having binaryor continuous components. It is also interesting to note that if the convex combination isrestricted to one prototype, this method is equivalent to simply finding the best matching seenclass to the (unseen) test instance x and predicting the unseen class whose prototype is closestto the prototype of the best matching seen class. One simple approach to ZSL is to view this task as a regression problem, where we aim topredict continuous attributes from a visual instance. Linear regression is a straightforwardbaseline. Given a visual sample x and the corresponding semantic representation s , we aim topredict each semantic component s k of s as ˆ s k = w (cid:62) k x , so as to minimize the squared differencebetween the prediction and the true value (ˆ s k − s k ) , w k ∈ R D being the parameters of themodel. If we write W = ( w , . . . , w K ) (cid:62) ∈ R K × D , we can directly estimate the entire prototypewith ˆ s = Wx . We can also directly compare how close ˆ s is to s with (cid:107) ˆ s − s (cid:107) = (cid:80) k (ˆ s k − s k ) . Aswith a standard linear regression, we determine the optimal parameters W by minimizing theerror over the training dataset D tr . Let us note X = ( x , . . . , x N ) (cid:62) ∈ R N × D the matrix whose N lines correspond to the visual features of training samples, and T = ( t , . . . , t N ) (cid:62) ∈ R N × K that containing the class prototypes associated to each image so that t n = s y n . To simplifynotations, we denote (cid:107)·(cid:107) both the (cid:96) (cid:96) λ :1 N (cid:107) XW (cid:62) − T (cid:107) + λ (cid:107) W (cid:107) (8)Such a loss has a closed-form solution, which directly gives the value of the optimal parameters: W = T (cid:62) X ( X (cid:62) X + λN I D ) − (9)At test time, given an image x belonging to an unseen class, we estimate its correspondingsemantic representation ˆ s = Wx and predict the class with the closest semantic prototype.Note that it’s also possible to use other distances or similarity measures such as a cosinesimilarity during the prediction phase.The Embarrassingly Simple approach to Zero-Shot Learning [RPT15], often ab-breviated
ESZSL , makes use of a similar idea with a few additional steps. Similarly, theprojection ˆ t n = Wx n of an image x n should be close to the expected semantic representations.This last similarity is nevertheless estimated by a dot product ˆ t (cid:62) n t n that should be close to1 for the ground truth t n = s y n and to − Y ∈ {− , } N × C that is 1 on line n and column y n and − S = ( s , . . . , s C ) (cid:62) ∈ R |C S |× K the matrix that contains the prototypes of seen classes, the lossto minimize is N (cid:107) XW (cid:62) S (cid:62) − Y (cid:107) . In [RPT15], it is further regularized such that visual fea-tures projected on the semantic space, XW (cid:62) , have similar norms to allow for fair comparison,and similarly for the semantic prototypes projected on the visual space W (cid:62) S (cid:62) . Adding an (cid:96) N (cid:107) XW (cid:62) S (cid:62) − Y (cid:107) + γ (cid:107) W (cid:62) S (cid:62) (cid:107) + λN (cid:107) XW (cid:62) (cid:107) + γλ (cid:107) W (cid:107) (10) λ and γ being hyperparameters controlling the weights of the different regularization terms.The minimization of this expression also leads to a closed-form solution: W = ( S (cid:62) S + λN I K ) − S (cid:62) Y (cid:62) X ( X (cid:62) X + γN I D ) − (11)Instead of aiming to predict the class prototypes s from the visual features x , we can consider predicting the visual features from the class prototypes’ features. Each visual dimension s expressed as a linear combination of prototypes such that we can estimate the “average”visual representation associated with prototype s with ˆ x = W (cid:62) s . The distances between theobservations and our predictions are then computed in the visual space by minimizing thedistance (cid:107) x n − ˆ x n (cid:107) between the sample x n and the predicted visual features ˆ x n = W (cid:62) s y n of the corresponding semantic prototype s y n . The resulting regularized loss N (cid:107) X − TW (cid:107) + λ (cid:107) W (cid:107) also has a closed-form solution: W = ( T (cid:62) T + λN I K ) − T (cid:62) X (12)The label of a test image x is then predicted through a nearest-neighbor search in the visualspace. Although this approach is very similar to previous ones, it turns out that projectingsemantic objects to the visual space has an advantage. Like other machine learning methods,ZSL methods can be subject to the hubness problem [RNI10], which describes a situation wherecertain objects, referred to as hubs , are the nearest neighbors of many other objects. In thecase of ZSL, if a semantic prototype is too centrally located, if may be the nearest neighborof many projections of visual samples into the semantic space even if these samples belong toother classes, thus leading to incorrect predictions and decreasing the performance of the model.When using ridge regression for ZSL, it has been verified experimentally [LDB15, SSH +
15] thatthis situation tends to happen. However, [SSH +
15] shows that this effect is mitigated whenprojecting from the semantic to the visual space, compared to the opposite situation. It shouldbe noted that the hubness problem does not occur exclusively when using ridge regression, andmore complex ZSL methods such as [ZXG17] make use of the findings of [SSH + SAE ) [KXG17] approach can be seen as a combination of thetwo ridge regression projections, from the semantic space to the visual one and the opposite.The idea consists in first encoding a visual sample by linearly projecting it onto the semanticspace and then decoding it by projecting the result into the visual space again. Contrary to theprevious proposals, there is no immediate closed-form solution to this problem. However, itcan be expressed as a Sylvester equation and a numerical solution can be computed efficientlyusing the Bartels-Stewart algorithm [BS72]. During the testing phase, predictions can be madeeither in the semantic space or in the visual space.All previous methods project linearly from one modality (visual or semantic) to the other,but they can be adapted to non-linear regression methods, as proposed by
Cross-ModalTransfer or CMT [SGMN13]. It consists in a simple fully-connected, 1-hidden layer neuralnetwork with hyperbolic tangent non-linearity, which is used to predict semantic prototypesfrom visual features. Equation (8) can therefore be re-written as1 N (cid:88) n (cid:107) t n − W tanh( W x n ) (cid:107) (13) W ∈ R H × D and W ∈ R K × H being the parameters of the model, and H the dimension of thehidden layer which is a hyperparameter. Similar or more complex adaptations can easily bemade for other methods. The main drawback of such non linear projections compared to thelinear methods presented earlier is that there is no general closed-form solution, and iterativenumerical algorithms must be used to determine suitable values for the parameters. Triplet loss methods make a more direct use of the compatibility function f . The main ideabehind these methods is that the compatibility of matching pairs should be much higher thanthe compatibility of non-matching pairs. More specifically, given a visual sample x with label y , we expect that its compatibility with the corresponding prototype s y should be much higherthan its compatibility with s c , the prototype of a different class c (cid:54) = y . How “much higher”can be more precisely defined through the introduction of a margin m , such that f ( x , s y ) ≥ m + f ( x , s c ). To enforce this constraint, we can penalize triplets ( x , s y , s c ) , c (cid:54) = y , for whichthis inequality is not true, using the triplet loss L triplet ( x , s c , s y ; f ) = [ m + f ( x , s c ) − f ( x , s y )] + (14)where [ · ] + = max(0 , · ). This way, for a given triplet ( x , s y , s c ) , c (cid:54) = y , the loss is 0 if f ( x , s c ) ismuch smaller than f ( x , s y ), and is all the higher as f ( x , s c ) gets close to, or surpasses f ( x , s y ).In general, it is not possible to derive a solution analytically for methods based on a tripletloss, so we must resort to the use of numerical optimization. n many triplet loss approaches to ZSL, the compatibility function f is simply definedas a bilinear mapping between the visual and semantic spaces parameterized by a matrix W ∈ R D × K , so that f ( x , s ) = x (cid:62) Ws . This compatibility function is actually the same aswith ESZSL, even though the loss function used to learn its parameters W is different. The Deep Visual-Semantic Embedding model or
DeViSE [FCS +
13] is one of the most directapplications of a triplet loss with a linear compatibility function to ZSL: the total loss is simplythe sum of the triplet loss over all training triplets ( x n , s y n , s c ) , c (cid:54) = y : L tr ( D tr ) = 1 N N (cid:88) n =1 (cid:88) c ∈C S c (cid:54) = yn [ m + f ( x n , s c ) − f ( x n , s y n )] + (15)DeViSE can also be viewed as a direct application of the Weston-Watkins loss [WW +
99] toZSL. It can be noted that the link with the generic loss framework in Equation (1) is this timequite straightforward, as with many triplet loss methods. Although no explicit regularizationΩ on f is mentioned in the original publication – even though the authors make use of earlystopping in the gradient descent – it is again straightforward to add an (cid:96) Structured Joint Embedding approach, or
SJE [ARW + L tr ( D tr ) = 1 N N (cid:88) n =1 max c ∈C S c (cid:54) = yn ([ m + f ( x n , s c ) − f ( x n , s y n )] + ) (16)The Attribute Label Embedding approach or
ALE [APHS13, APHS15] considers the ZSLtask as a ranking problem, where the objective is to rank the correct class c as high as possibleon the list of candidate unseen classes. From this perspective, we can consider that SJE onlytakes into account the top element of the ranking list provided the margin m is close to 0. Bycontrast, DeViSE penalizes all ranking mistakes: given labeled sample ( x , y ), for all classes c mistakenly ranked higher than y , we have f ( x , s c ) > f ( x , s y ) which contributes to the loss.The ALE approach aims to be somewhere in between these two proposals, so that a mistake onthe rank when the true class is close to the top of the ranking list weighs more than a mistakewhen the true class is lower on the list.Similarly to CMT in the previous section, all triplet-loss models can be extended to thenonlinear case. Such an extension is even more straightforward as this time, having no closed-form solution, all models require the use of numerical optimization. One such example of anonlinear model worth describing due to its historical significance and still fair performance isthe Synthesized Classifiers approach, or
SynC [CCGS16a, CCGS20]. Based on a manifoldlearning framework, it aims to learn phantom classes in both the semantic and visual spaces,so that linear classifiers for seen and unseen classes can be synthesized as a combination of suchphantom classes. More precisely, the goal is to synthesize linear classifiers w c in the visual spacesuch that the compatibility between image x and class c can be computed with f ( x , s c ) = w (cid:62) c x .The prediction is then ˆ y = argmax c w (cid:62) c x . Let us note respectively { ∗ x p } p ∈ (cid:74) ,P (cid:75) and { ∗ s p } p ∈ (cid:74) ,P (cid:75) the P phantom classes in the respective visual and semantic spaces. These phantom classesare learned and constitute the parameters of the model . Each visual classifier is synthesizedas a linear combination of visual phantom classes w c = (cid:80) Pp =1 v c,p ∗ x p . The value of eachcoefficient v c,p is set so as to correspond to the conditional probability of observing phantomclass ∗ s p in the neighborhood of real class s c in the semantic space. Following works on manifoldlearning [HR03, MH08], this can be expressed according to s c and ∗ s p . The parameters of themodel, i.e. the phantom classes { ( ∗ x p , ∗ s p ) } p , can be estimated by making use of the Crammer- A number of simplifications were made for the sake of clarity and brevity: in the original article [CCGS16a],phantom classes are actually sparse linear combinations of semantic prototypes, v c,p can further use Mahalanobisdistance, other losses such as squared hinge loss can be employed instead of the Crammer-Singer loss, Euclidean dis-tances between semantic prototypes can be used instead of a fixed margin in the triplet loss, additional regularizationterms and hyperparameters are introduced, and optimization between { ∗ x p } p and { ∗ s p } p is performed alternatingly. inger loss, with adequate regularization to obtain the following objective:minimize { ( ∗ x p , ∗ s p ) } p N N (cid:88) n =1 max c ∈C S c (cid:54) = yn (cid:16) [ m + w (cid:62) c x n − w (cid:62) y n x n ] + (cid:17) + λ (cid:88) c ∈C S (cid:107) w c (cid:107) + γ P (cid:88) p =1 (cid:107) ∗ s p (cid:107) (17)where λ and γ are hyperparameters. It is interesting to note that ALE can actually be consid-ered as a special case of SynC, where the classifiers are simply a linear combination of semanticprototypes.Recently, [LCLBC19b] showed that modifications to the triplet loss could enable modelsobtained with this loss to reach (G)ZSL accuracy competitive with generative models (Sec. 3.3).Such modifications include a margin that depends on the similarity between s y and s c inEquation (14) so that confusions between very similar classes are not penalized as much asconfusions between dissimilar classes during training, as well as a weighting scheme that makes“representative” training samples have more impact than outliers. Generative methods applied to ZSL aim to produce visual samples belonging to unseen classesbased on their semantic description; these samples can then be used to train standard classifiers.Partly for this reason, most generative methods directly produce high-level visual features, asopposed to raw pixels – another reason being that generating raw images is usually not aseffective [XLSA18b]. Generative methods have gained a lot of attention in the last few years:many if not most recent high-visibility ZSL approaches [VAMR18, XLSA18b, XSSA19] rely ongenerative models. This is partly because such approaches have interesting properties, whichmake them particularly suitable to certain settings such as GZSL. However, a disadvantage isthat they can only operate in a class-transductive setting, since the class prototypes of unseenclasses are needed to generate samples belonging to these classes; contrary to methods based onregression or explicit compatibility functions, at least some additional training is necessary tointegrate novel classes to the model. We divide generative approaches into two main categories:methods generating a parametric distribution, which consider visual samples follow a standardprobability distribution such as a multivariate Gaussian and attempt to estimate its parametersso that visual features can be sampled from this distribution, and non-parametric methods,where visual samples are directly generated by the model.Methods based on parametric distributions assume that visual features for each class followa standard parametric distribution. For example, one may consider that for each class c ,visual features are samples from a multivariate Gaussian with mean µ c ∈ R D and covariance Σ c ∈ R D × D , such that for samples x from class c we have p ( x ; µ c , Σ c ) = N ( x | µ c , Σ c ). Ifone can estimate µ and Σ for unseen classes, it is possible to generate samples belonging tothese classes. Zero-shot recognition can then be performed by training a standard multi-classclassifier on the labeled generated samples.Alternatively, knowing the (estimated) distribution of samples from unseen classes, onemay determine the class of a test visual sample x using maximum likelihood or similar meth-ods [VR17]: ˆ y = argmax c ∈C U p ( x ; µ c , Σ c ) (18)Other approaches [XLSA18b] also propose to further train a ZSL model based on an explicitcompatibility function using the generated samples and the corresponding class prototypes, andthen perform zero-shot recognition as usually with Equation (2).The Generative Framework for Zero-Shot Learning [VR17] or
GFZSL assumes thatvisual features are normally distributed given their class. The parameters of the distribution( µ c , σ c ) (to simplify, we assume that Σ c = diag( σ c ), with σ c ∈ R + D ) are easy to obtain for seenclasses c ∈ C S using e.g. maximum likelihood estimators, but are unknown for unseen classes.Since the only information available about unseen classes consists of class prototypes, one canassume that the parameters µ c and σ c of class c depend on class prototype s c . [VR17] furtherassumes a linear dependency, such that µ c = W (cid:62) µ s c and ρ c = log( σ c ) = W (cid:62) σ s c . The models’parameters W µ ∈ R K × D and W σ ∈ R K × D can then be obtained with ridge regression, usingthe class distribution parameters { (ˆ µ c , ˆ ρ c ) } c ∈C S estimated on seen classes as training samples. imilarly to previous approaches, this consists in minimizing (cid:96) M = (ˆ µ , . . . , ˆ µ C ) (cid:62) ∈ R C × D and R = (ˆ ρ , . . . , ˆ ρ C ) (cid:62) ∈ R C × D , wehave: W µ = ( S (cid:62) S + λ µ I K ) − S (cid:62) M (19) W σ = ( S (cid:62) S + λ σ I K ) − S (cid:62) R (20)We can thus predict parameters (ˆ µ c , ˆ ρ c ) for all unseen classes c ∈ C U , and sample visual featuresof unseen classes accordingly. Predictions can then be made using either a standard classifieror the estimated distributions themselves. [VR17] also extends this approach to include moregeneric distributions belonging to the exponential family and non-linear regressors.The Synthesized Samples for Zero-Shot Learning [GDHG17] or
SSZSL methodsimilarly assumes that p ( x | c ) is Gaussian, estimates parameters ( µ , Σ ) for seen classes withtechniques similar to GFZSL and aims to predict parameters (ˆ µ , ˆ Σ ) for unseen classes. Ina way that reminds the ConSE method, the distributions parameters are estimated using aconvex combination of parameters from seen classes d , such that ˆ µ = Z (cid:80) d ∈C S w d µ d andˆ σ = Z (cid:80) d ∈C S w d σ d , with Z = (cid:62) w = (cid:80) d w d . The model therefore has one vector param-eter w c ∈ R |C S | to determine per unseen class c . These last are set such that the semanticprototype s te c from unseen class c is approximately a convex combination of prototypes fromseen classes, i.e. s te c (cid:39) S (cid:62) w c /Z c , while preventing classes dissimilar to s te c from being assigneda large weight. This results in the following loss for unseen class c : (cid:107) s te c − S (cid:62) w c (cid:107) + λ w (cid:62) c d c (21)where each element ( d c ) i of d c is a measure of how dissimilar unseen class c is to seen class i , and λ is a hyperparameter. Minimizing the second term in Equation (21) naturally leads toassigning smaller weights to classes dissimilar to c . A closed-form solution can then be obtainedas: w c = ( SS (cid:62) ) − ( λ d c − Ss te c ) (22)The non parametric approaches do not explicitly make simplifying assumptions aboutthe shape of the distribution of visual features, and use powerful generative methods such asvariational auto-encoders (VAEs) [KW14] or generative adversarial networks (GANs) [GPAM + Synthesized Examples for GZSL method [VAMR18], or
SE-GZSL , is based on aconditional VAE [SLY15] architecture. It consists of two main parts: an encoder E ( · ) whichmaps an input x to an R -dimensional internal representation or latent code z ∈ R R , anda decoder D ( · ) which tries to reconstruct the input x from the internal representation. Anoptional third part can be added to the model: a regressor R ( · ) which estimates the semanticrepresentation t of the visual input x . See Chapter 2 for more details on the VAE architecture.To help the decoder to produce class-dependant reconstructed outputs, the corresponding classprototype t n = s y n is concatenated to the representation z n for input x n .Other approaches such as [MKRMM18] consider that the encoder outputs a probabilitydistribution, assuming that the true distribution of visual samples is an isotropic Gaussiangiven the latent representation, i.e. p ( x | z , t ) = N ( x | µ ( z , t ) , σ I ). In this case, the outputof the decoder should be ˆ x = µ ( z , t ), and it can be shown that minimizing − log( p ( x | z , t ))is equivalent to minimizing (cid:107) x − ˆ x (cid:107) . Furthermore, in [MKRMM18], the class prototype isappended to the visual sample as opposed to the latent code.The authors of [VAMR18] further propose to use the regressor R to encourage the de-coder to generate discriminative visual samples. An example of such components consists inevaluating the quality of predicted attributes from synthesized samples, and takes the form L = − E p (ˆ x | z , t ) p ( z ) p ( t ) [log( p ( a | ˆ x ))]. The regressor itself is trained on both labeled training sam-ples and generated samples, and the parameters of the encoder / decoder and the regressorare optimized alternatingly. f-GAN [XLSA18b] is based on a similar approach, but makes useof conditional GANs [MO14] to generate visual features. It consists of two parts: a discrim-inator D which tries to distinguish real images from synthesized images, and a generator G which tries to generate images that D cannot distinguish from real images. Both encoder and In [GDHG17], the authors use ( d c ) i = (cid:18) exp (cid:18) − (cid:107) s te c − s tr i (cid:107) ¯ α (cid:19)(cid:19) − to measure how dissimilar unseen class c is toseen class i , where ¯ α is the mean value of the distances between any two prototypes from seen classes. ecoder are multilayer perceptrons. The generator is similar to the decoder from the previousapproach in that it takes as input a latent code z ∈ R R and the semantic representation s c ofa class c , and attempts to generate a visual sample ˆ x of class c : G : R R × R K → R D . Thekey difference is that the latent code is not the output of an encoder but consists of randomGaussian noise. The discriminator takes as input a visual sample, either real or generated, ofa class c as well as the prototype s c , and predicts the probability that the visual sample wasgenerated: D : R D × R K → [0 , G and D compete in a two-player minimax game, such thatthe optimization objective is:min G max D E p ( x ,y ) ,p ( z ) [log( D ( x , s y )) + log(1 − D ( G ( z , s y ) , s y ))] (23)The authors of [XLSA18b] further propose to train an improved Wasserstein GAN [ACB17,GAA + In most of the work on ZSL, the semantic features s c were usually assumed to be vectors ofattributes such as “is red” , “has stripes” or “has wings” . Such attributes can either be binaryor continuous with values in [0 , . “is red” could mean that the animal or object is mostly red. But for a large-scale dataset withhundreds or even thousands of classes, or an open dataset where novel classes are expected toappear over time, it is impractical or even impossible to define a priori all the useful attributesand manually provide semantic prototypes based on these attributes for all the classes. It is allthe more time-consuming as fine-grained datasets may require hundreds of different attributesto reach a satisfactory accuracy [LCPLB20].For large-scale or open datasets it is therefore necessary to identify appropriate sources ofsemantic information and means to extract this information in order to obtain relevant semanticprototypes. In the case of ImageNet, readily available sources are the word embeddings of theclass names and the relations between them according to WordNet [Mil95], a large lexicaldatabase of English that has been developed for many years by human efforts, but is nowopenly available. Word embeddings are obtained in an unsupervised way and such embeddingsof the class names have been employed in ZSL as semantic class representations since [RSS11].The word embedding model is typically trained on a large text corpus, such as Wikipedia, wherea neural architecture is assigned the task to predict the context of a given word. For instance,the skip-gram objective [MSC +
13] aims to find word representations containing predictiveinformation about the words occurring in the neighborhood of a given word. Given a sequence { w , . . . , w T } of T training words and a context window of size S , the goal is to maximize T (cid:88) t =1 (cid:88) − S ≤ i ≤ Si (cid:54) =0 log p ( w t + i | w t ) (24)Although deep neural architectures could be used for this task, it is much more common to usea shallow network with a single hidden layer. In this case, each unique word w is associatedwith an “input” vector v w and an “output” vector v (cid:48) w , and p ( w i | w t ) is computed such that p ( w i | w t ) = exp( v (cid:48)(cid:62) w i v w t ) (cid:80) w exp( v (cid:48)(cid:62) w v w t ) (25)The internal representation corresponding to the hidden layer, i.e. the input vector represen-tation v w , can then be used as the word embedding. Other approaches such as [BGJM17] or[PSM14] have also been proposed.Semantic information regarding ImageNet classes can also be provided by WordNet sub-sumption relationships between the classes (or IS-A relations). They were obtained by severalmethods, e.g. with graph convolutional neural networks, and employed for ZSL on ImageNetwith state-of-the-art results [WYG18, KCL + ccuracy of about 14% according to [HAT19]. However, although using word or graph embed-dings can reduce the additional human effort required to obtain class prototypes to virtuallyzero – pre-trained word embeddings can easily be found online – there still exists a large per-formance gap between the use of such embeddings and manually crafted attributes [LCPLB20].Such a difference may be due in part to the text corpora used to train the word embeddings.The use of complementary sources to produce semantic class representations for ZSL relieson the assumption that the information these sources provide reflects visual similarity rela-tions between classes. However, the word embeddings are typically developed from generictext corpora, like Wikipedia, that do not focus on the visual aspect. Also, the subsumptionrelationships issued from WordNet and supporting the graph-based class embeddings representhierarchical conceptual relations. In both cases, the sought-after visual relations are at bestrepresented in a very indirect and incomplete way.To address this limitation and include more visual information to build the semantic proto-types, it was recently proposed to employ text corpora with a more visual nature, by construct-ing such a corpus from Flickr tags [LCPLB20]. Following [PG11], the authors of [LCPLB20]further suggested to address the problem of bulk tagging [OM13] – users attributing the exactsame tags to numerous photos – by ensuring that a tuple of words ( w i , w j ) can only appearonce for each user during training, thus preventing a single user from having a disproportionateweight on the final embedding. Also, [LCLBC20] suggested to exploit the sentence descriptionsof WordNet concepts, in addition to the class name embedding, to produce semantic represen-tations better reflecting visual relations. Any of these two proposals allow to reach an accuracybetween 17.2 and 17.8 on the 500 test classes of the ImageNet ZSL benchmark with the linearmodel from the semantic to the visual space (Section 3.1), compared to 14.4 with semanticprototypes based on standard embeddings. Zero-shot learning addresses the problem of recognizing categories that are missing from thetraining set. ZSL has grown from an endeavor of some machine learning and computer visionresearchers to find approaches that come closer to how humans learn to identify object classes.It now aims to become a radical answer to the concern that the amount of labeled data growsmuch slower than the volume of data in general, so supervised learning alone cannot producesatisfactory solutions for many real-world applications.Key to the possibility of recognizing previously unseen categories is the availability for allcategories, both seen and unseen, of more than just conventional labels. For each categorywe should have complementary information (or features) reflecting the characteristics of themodality used for recognition (visual if recognition is directed to images). The relation betweenthese features and the target modality can thus be learned from the seen categories and thenemployed for recognizing unseen categories.Most of the work on ZSL took advantage of the existence of some small or medium-sizedatasets for which the complementary information, under the form of attributes, was devisedand manually provided to support the development of ZSL methods. However, in generalapplications one has to deal with large and even open sets of categories, so other approachesshould be found for identifying associated sources of complementary information and exploitingthem in ZSL.For the large ImageNet dataset several readily available complementary sources were found,including word embeddings of class names, WordNet-based concept hierarchies including theclasses as nodes, and short textual definitions from WordNet. While this allowed to extendZSL methods to such large-scale datasets, the state-of-the-art accuracy obtained on the unseencategories of ImageNet is yet disappointing. This is because the information provided by thesesources reflects mostly the conceptual relations and not so much the visual characteristics ofthe categories. To go beyond this level of performance we consider that two important stepsshould be taken. First, it is necessary to assemble large corpora including rather detailedtextual descriptions of the visual characteristics of a large number of object categories. Partialcorpora do exist in various domains (e.g. flora descriptions) and different languages. Second,zero-shot recognition should rely on a deeper, compositional analysis (e.g. [PNGR19]) of animage as well as on visual reasoning. eferences [AB17] Martin Arjovsky and L´eon Bottou. Towards principled methods for traininggenerative adversarial networks. arXiv preprint arXiv:1701.04862 , 2017.[ACB17] Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein GAN. arXivpreprint arXiv:1701.07875 , 2017.[APHS13] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for attribute-based classification. In Computer Vision and PatternRecognition , pages 819–826, 2013.[APHS15] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification.
IEEE T. Pattern Analysis and MachineIntelligence , 38(7):1425–1438, 2015.[ARW +
15] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Eval-uation of output embeddings for fine-grained image classification. In
ComputerVision and Pattern Recognition , 2015.[BGJM17] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enrich-ing word vectors with subword information.
Transactions of the Association forComputational Linguistics , 5:135–146, 2017.[BS72] Richard H. Bartels and George W Stewart. Solution of the matrix equationAX+ XB= C [F4].
Communications of the ACM , 15(9):820–826, 1972.[BSS +
18] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Di-vakaran. Zero-shot object detection. In
European Conference on ComputerVision , 2018.[CCGS16a] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesizedclassifiers for zero-shot learning. In
Computer Vision and Pattern Recognition ,pages 5327–5336, 2016.[CCGS16b] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empiricalstudy and analysis of generalized zero-shot learning for object recognition in thewild. In
European Conference on Computer Vision , pages 52–68. Springer, 2016.[CCGS20] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Classifier andexemplar synthesis for zero-shot learning.
International Journal of ComputerVision , 128(1):166–201, 2020.[CS01] Koby Crammer and Yoram Singer. On the algorithmic implementation of mul-ticlass kernel-based vector machines.
Journal of Machine Learning Research ,2(Dec):265–292, 2001.[CZX +
18] Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. Zero-shotvisual recognition using semantics-preserving adversarial embedding networks.In
Computer Vision and Pattern Recognition , pages 1043–1052, 2018.[DDS +
09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:A large-scale hierarchical image database. In , pages 248–255. Ieee, 2009.[FCS +
13] Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, TomasMikolov, et al. Devise: A deep visual-semantic embedding model. In
Advancesin Neural Information Processing Systems , 2013.[FEHF09] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects bytheir attributes. In
Computer Vision and Pattern Recognition , pages 1778–1785.IEEE, 2009.[FXJ +
18] Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xiangyang Xue, Leonid Sigal, and Shao-gang Gong. Recent advances in zero-shot recognition: Toward data-efficient un-derstanding of visual content.
IEEE Signal Processing Magazine , 35(1):112–125,2018.[FYH +
15] Yanwei Fu, Yongxin Yang, Tim Hospedales, Tao Xiang, and Shaogang Gong.Transductive multi-label zero-shot learning. arXiv preprint arXiv:1503.07790 ,2015. GAA +
17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, andAaron C Courville. Improved training of Wasserstein GANs. In
Advances inneural information processing systems , pages 5767–5777, 2017.[GDHG17] Yuchen Guo, Guiguang Ding, Jungong Han, and Yue Gao. Synthesizing samplesfro zero-shot learning. In
IJCAI . IJCAI, 2017.[GPAM +
14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversar-ial nets. In
Advances in neural information processing systems , pages 2672–2680,2014.[HAT19] Tristan Hascoet, Yasuo Ariki, and Tetsuya Takiguchi. On zero-shot recognitionof generic objects. In
Computer Vision and Pattern Recognition , pages 9553–9561, 2019.[HR03] Geoffrey E Hinton and Sam T Roweis. Stochastic neighbor embedding. In
Advances in Neural Information Processing Systems , pages 857–864, 2003.[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-ing for image recognition. In
Computer Vision and Pattern Recognition , pages770–778, 2016.[KASB17] Nour Karessli, Zeynep Akata, Bernt Schiele, and Andreas Bulling. Gaze em-beddings for zero-shot image classification. In
Computer Vision and PatternRecognition , pages 4525–4534, 2017.[KCL +
19] Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang,and Eric P. Xing. Rethinking knowledge graph propagation for zero-shot learn-ing. In
Computer Vision and Pattern Recognition , pages 11487–11496, 2019.[KW14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In
International Conference on Learning Representations , 2014.[KXG17] Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. In
Computer Vision and Pattern Recognition , pages 4447–4456.IEEE, 2017.[LCLBC19a] Yannick Le Cacheux, Herv´e Le Borgne, and Michel Crucianu. From classical togeneralized zero-shot learning: A simple adaptation process. In
InternationalConference on Multimedia Modeling , pages 465–477. Springer, 2019.[LCLBC19b] Yannick Le Cacheux, Herv´e Le Borgne, and Michel Crucianu. Modeling interand intra-class relations in the triplet loss for zero-shot learning. In
ComputerVision and Pattern Recognition , pages 10333–10342, 2019.[LCLBC20] Yannick Le Cacheux, Herv´e Le Borgne, and Michel Crucianu. Using sentencesas semantic embeddings for large scale zero-shot learning. In
ECCV 2020Workshop: Transferring and Adapting Source Knowledge in Computer Vision .Springer, 2020.[LCPLB20] Yannick Le Cacheux, Adrian Popescu, and Herv´e Le Borgne. Webly super-vised semantic embeddings for large scale zero-shot learning. arXiv preprintarXiv:2008.02880 , 2020.[LDB15] Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. Hubness and pollution:Delving into cross-space mapping for zero-shot learning. In
Proceedings of the53rd Annual Meeting of the Association for Computational Linguistics and the7th International Joint Conference on Natural Language Processing (Volume 1:Long Papers) , pages 270–280, 2015.[LEB08] Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. Zero-data learning of newtasks. In
AAAI Conference on Artificial Intelligence , volume 2, pages 646–651,2008.[LFYFW18] Chung-Wei Lee, Wei Fang, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. Multi-label zero-shot learning with structured knowledge graphs. In
Computer Visionand Pattern Recognition , pages 1576–1585, 2018.[LG15] Xin Li and Yuhong Guo. Max-margin zero-shot learning for multi-class classifi-cation. In
Artificial Intelligence and Statistics , pages 626–634, 2015. LNH09] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning todetect unseen object classes by between-class attribute transfer. In
ComputerVision and Pattern Recognition , pages 951–958. IEEE, 2009.[LNH14] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-basedclassification for zero-shot visual object categorization.
IEEE T. Pattern Anal-ysis and Machine Intelligence , 36(3):453–465, 2014.[MGS14] Thomas Mensink, Efstratios Gavves, and Cees GM Snoek. Costa: Co-occurrencestatistics for zero-shot classification. In
Computer Vision and Pattern Recogni-tion , pages 2441–2448, 2014.[MH08] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.
Journal of machine learning research , 9(Nov):2579–2605, 2008.[Mil95] George A. Miller. Wordnet: A lexical database for english.
Commun. ACM ,38(11):39–41, November 1995.[MKRMM18] Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal, and Hema A Murthy.A generative model for zero shot learning using conditional variational autoen-coders. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , pages 2188–2196, 2018.[MO14] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXivpreprint arXiv:1411.1784 , 2014.[MSC +
13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean.Distributed representations of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems , pages 3111–3119, 2013.[MYX +
20] Shaobo Min, Hantao Yao, Hongtao Xie, Chaoqun Wang, Zheng-Jun Zha, andYongdong Zhang. Domain-aware visual bias eliminating for generalized zero-shot learning. In
Computer Vision and Pattern Recognition , pages 12664–12673,2020.[NMB +
14] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, JonathonShlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learningby convex combination of semantic embeddings. In
International Conference onLearning Representations , pages 488–501, 2014.[OM13] Neil O’Hare and Vanessa Murdock. Modeling locations with social media.
In-formation retrieval , 16(1):30–62, 2013.[PG11] Adrian Popescu and Gregory Grefenstette. Social media driven image retrieval.In
Proceedings of the 1st ACM International Conference on Multimedia Re-trieval , pages 1–8, 2011.[PLH20a] Antoine Plumerault, Herv´e Le Borgne, and C´eline Hudelot. Avae: Adversarialvariational auto encoder. In
International Conference on Pattern Recognition(ICPR) , 2020.[PLH20b] Antoine Plumerault, Herv´e Le Borgne, and C´eline Hudelot. Controlling gener-ative models with continuous factors of variations. In
International Conferenceon Learning Representations , 2020.[PNGR19] Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’AurelioRanzato. Task-driven modular networks for zero-shot compositional learning.In
International Conference on Computer Vision , pages 3593–3602, 2019.[PPHM09] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. Zero-shot learning with semantic output codes. In
Advances in Neural InformationProcessing Systems , pages 1410–1418, 2009.[PSM14] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Globalvectors for word representation. In
Proceedings of the conference on empiricalmethods in natural language processing , pages 1532–1543, 2014.[RALS16] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. Learning deep rep-resentations of fine-grained visual descriptions. In
Computer Vision and PatternRecognition , pages 49–58, 2016.[RDS +
15] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, lexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual RecognitionChallenge. International Journal of Computer Vision (IJCV) , 115(3):211–252,2015.[RNI10] Miloˇs Radovanovi´c, Alexandros Nanopoulos, and Mirjana Ivanovi´c. Hubs inspace: Popular nearest neighbors in high-dimensional data.
Journal of MachineLearning Research , 11(Sep):2487–2531, 2010.[RPT15] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple ap-proach to zero-shot learning. In
International Conference on Machine Learning ,pages 2152–2161, 2015.[RSS11] Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledgetransfer and zero-shot learning in a large-scale setting. In
Computer Vision andPattern Recognition , pages 1641–1648, 2011.[SGMN13] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In
Advances in Neural InformationProcessing Systems , pages 935–943, 2013.[SLY15] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output rep-resentation using deep conditional generative models. In
Advances in neuralinformation processing systems , pages 3483–3491, 2015.[SSH +
15] Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Mat-sumoto. Ridge regression, hubness, and zero-shot learning. In
Joint EuropeanConference on Machine Learning and Knowledge Discovery in Databases , pages135–151. Springer, 2015.[SZ14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.[THJA04] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Al-tun. Support vector machine learning for interdependent and structured outputspaces. In
Proceedings of the twenty-first international conference on Machinelearning , page 104, 2004.[TJHA05] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Al-tun. Large margin methods for structured and interdependent output variables.
Journal of machine learning research , 6(Sep):1453–1484, 2005.[VAMR18] Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. Gen-eralized zero-shot learning via synthesized examples. In
Computer Vision andPattern Recognition , pages 4281–4289, 2018.[VR17] Vinay Kumar Verma and Piyush Rai. A simple exponential family frameworkfor zero-shot learning. In
Joint European Conference on Machine Learning andKnowledge Discovery in Databases , pages 792–808. Springer, 2017.[WBW +
11] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Be-longie. The Caltech-UCSD Birds-200-2011 dataset, 2011.[WW +
99] Jason Weston, Chris Watkins, et al. Support vector machines for multi-classpattern recognition. In
European Symposium on Artificial Neural Networks ,volume 99, pages 219–224, 1999.[WYG18] Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot recognition via se-mantic embeddings and knowledge graphs. In
Computer Vision and PatternRecognition , pages 6857–6866, 2018.[WZYM19] Wei Wang, Vincent W Zheng, Han Yu, and Chunyan Miao. A survey of zero-shotlearning: Settings, methods, and applications.
ACM Transactions on IntelligentSystems and Technology (TIST) , 10(2):1–37, 2019.[XCH +
19] Y. Xian, S. Choudhury, Y. He, B. Schiele, and Z. Akata. Semantic projectionnetwork for zero- and few-label semantic segmentation. In
Computer Vision andPattern Recognition , pages 8248–8257, 2019.[XLSA18a] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly.
IEEE T. Pattern Analysis and Machine Intelligence , 41(9):2251–2265, 2018. XLSA18b] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature gener-ating networks for zero-shot learning. In
Computer Vision and Pattern Recog-nition , pages 5542–5551, 2018.[XSA17] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good,the bad and the ugly. In
Computer Vision and Pattern Recognition , pages 4582–4591, 2017.[XSSA19] Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. f-VAEGAN-D2: A feature generating framework for any-shot learning. In
Computer Visionand Pattern Recognition , pages 10275–10284, 2019.[ZBS +
19] Eloi Zablocki, Patrick Bordes, Laure Soulier, Benjamin Piwowarski, and PatrickGallinari. Context-aware zero-shot learning for object recognition. In
Interna-tional Conference on Machine Learning , pages 7292–7303, 2019.[ZXG17] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding modelfor zero-shot learning. In
Computer Vision and Pattern Recognition , pages 2021–2030, 2017., pages 2021–2030, 2017.