[PDF] An Efficient Framework for Zero-Shot Sketch-Based Image Retrieval

Abstract

Recently, Zero-shot Sketch-based Image Retrieval (ZS-SBIR) has attracted the attention of the computer vision community due to it's real-world applications, and the more realistic and challenging setting than found in SBIR. ZS-SBIR inherits the main challenges of multiple computer vision problems including content-based Image Retrieval (CBIR), zero-shot learning and domain adaptation. The majority of previous studies using deep neural networks have achieved improved results through either projecting sketch and images into a common low-dimensional space or transferring knowledge from seen to unseen classes. However, those approaches are trained with complex frameworks composed of multiple deep convolutional neural networks (CNNs) and are dependent on category-level word labels. This increases the requirements on training resources and datasets. In comparison, we propose a simple and efficient framework that does not require high computational training resources, and can be trained on datasets without semantic categorical labels. Furthermore, at training and inference stages our method only uses a single CNN. In this work, a pre-trained ImageNet CNN (e.g., ResNet50) is fine-tuned with three proposed learning objects: domain-aware quadruplet loss, semantic classification loss, and semantic knowledge preservation loss. The domain-aware quadruplet and semantic classification losses are introduced to learn discriminative, semantic and domain invariant features through considering ZS-SBIR as object detection and verification problem. ...

Full PDF

AAn Eﬃcient Framework for Zero-Shot Sketch-Based Image Retrieval

Osman Tursun ∗ , Simon Denman, Sridha Sridharan, Ethan Goan, Clinton Fookes Signal Processing, Artiﬁcial Intelligence and Vision Technologies (SAIVT)Queensland University of Technology, Australia

Abstract

Recently, Zero-shot Sketch-based Image Retrieval (ZS-SBIR) has attracted the attention of the computervision community due to it’s real-world applications, and the more realistic and challenging setting thanfound in SBIR. ZS-SBIR inherits the main challenges of multiple computer vision problems including content-based Image Retrieval (CBIR), zero-shot learning and domain adaptation. The majority of previous studiesusing deep neural networks have achieved improved results through either projecting sketch and imagesinto a common low-dimensional space or transferring knowledge from seen to unseen classes. However,those approaches are trained with complex frameworks composed of multiple deep convolutional neuralnetworks (CNNs) and are dependent on category-level word labels. This increases the requirements ontraining resources and datasets. In comparison, we propose a simple and eﬃcient framework that does notrequire high computational training resources, and can be trained on datasets without semantic categoricallabels. Furthermore, at training and inference stages our method only uses a single CNN. In this work, apre-trained ImageNet CNN ( i.e.

ResNet50) is ﬁne-tuned with three proposed learning objects: domain-awarequadruplet loss , semantic classiﬁcation loss , and semantic knowledge preservation loss . The domain-awarequadruplet and semantic classiﬁcation losses are introduced to learn discriminative, semantic and domaininvariant features through considering ZS-SBIR as a object detection and veriﬁcation problem. To preservesemantic knowledge learned with ImageNet and utilise it on unseen categories, the semantic knowledgepreservation loss is proposed. To reduce computational cost and increase the accuracy of the semanticknowledge distillation process, ground-truth semantic knowledge is prepared in a class-oriented fashion priorto training. Extensive experiments are conducted on three challenging ZS-SBIR datasets, Sketchy Extended,TU-Berlin Extended and QuickDraw Extended. The proposed method achieves state-of-the-art results, andoutperforms the majority of related works by a large margin. Keywords:

Sketch-based Image Retrieval, Zero-shot Learning, Knowledge Distillation, Similarity Learning ∗ Corresponding author

Email addresses: [email protected] (OsmanTursun), [email protected] (Simon Denman), [email protected] (Sridha Sridharan), [email protected] (Ethan Goan), [email protected] (Clinton Fookes)

1. Introduction

Searching for images using an image query has in-creased in popularity as content-based image retrieval(CBIR) techniques have improved in recent years.However, the thrust of CBIR research has consideredthe scenario where both query and gallery images arereal photos ( i.e. scenes [1], faces [2]) or digital im-ages ( i.e. logo [3]). With the widespread popularity

Preprint submitted to Journal of Pattern Recognition February 9, 2021 a r X i v : . [ c s . C V ] F e b f touch-screen devices, free-hand sketch-based im-age retrieval (SBIR) tasks have drawn the attentionof the computer vision (CV) community as sketchesare a convenient, universal, easy and fast method forimage description [4, 5, 6, 7, 8, 9]. See Figure 1 forexamples of SBIR results. Figure 1: Examples of SBIR results. In each row, the ﬁgure inthe left is query while others are retrieved results.

The domain-gap and information-gap betweensketch and photo domains presents a challenge toexisting CBIR approaches. Sketches contain sparseand abstract information, while photos carry denseand precise information. Deep learning models havebeen applied to reduce these gaps, either in the la-tent space or the pixel space. A variety of complexarchitectures such as multiple independent networks[10, 11, 4, 12], semi-hetergenous networks [13, 14],generative adversarial networks (GAN) [15, 16, 17,18] and networks with domain-invariant layers [5, 12]have been proposed to address the domain gap. Al-though these approaches have shown substantial im-provements over hand-crafted features, the increasedmodel complexity requires extra resources duringtraining and inference processes.The model complexity has increased as SBIR meth-ods are often evaluated under zero-shot settings,where testing queries are unseen during training. Re-lated studies found existing SBIR models tend to failunder zero-shot settings [12]. To tackle this problem,commonly, the mapping or joint embedding spacebetween the visual representation and class seman-tic representation is modelled. To achieve this, lan-guage models [4, 19, 20, 21, 20] are often used to extract semantic class embeddings, while auxiliarynetworks such as auto-encoders [19, 20, 4], GANs[20, 20, 22, 23] or graph convolutional neural net-works (GNN) [21, 12] are trained to learn the joinedrepresentation or mapping.Moreover, the high training cost is not the onlydrawback of the aforementioned methods. They re-quire all classes of the training set have descriptivetext labels that can be modelled by a language model.However, in some practical applications, classes maybe only labelled with numerical values, or uncommon( i.e. unknown) word labels which cannot be modelledby a language model.This paper aims to tackle ZS-SBIR with a simple,eﬃcient, and language model-free framework. Therecent state-of-the-art (SOTA) work, SAKE [5], is aconcise and simple framework. It’s feature extrac-tion encoder is a single-stream Convolutional Neu-ral network (CNN) that ensures an eﬃcient and sim-ple inference process. However, during training italso requires a language model and another ImageNetpre-trained CNN to generate valid teacher signals forknowledge distillation. They argue preserving knowl-edge learned from Image-Net via knowledge distilla-tion is beneﬁcial for ZS-SBIR. Although we also ﬁndthat rich features learned from Image-Net are essen-tial for ZS-SBIR, we ﬁnd a language model and on-line teacher network are not necessary. SAKE gen-erates a teacher signal for each input item, eitherfrom the sketch or photo domain. It therefore re-quires a language model to align teacher signals thatare otherwise invalid due to domain-shift. The align-ment is based on the semantic similarity matrix ofImageNet labels and target dataset labels, which isconstructed using WordNet [24]. In comparison, wegenerate teacher signals for each class by averagingthe activations of pretrained ImageNet with imagesfrom the photo domain, whose distribution is close toImageNet. We therefore do not require any seman-tic labels or language models, and the teacher signalgeneration is a one-time oﬄine process.We also ﬁnd SAKE treats the ZS-SBIR problem asan object identiﬁcation task, where the learning ob-jective is a categorical classiﬁcation loss, while otherrelated works [4, 23] consider the problem a veriﬁ-cation task where metric learning i.e. triplet loss, is2pplied. In this work, we not only unify these twoobjectives in a single framework, but also propose adomain-aware quadruplet loss for metric learning.We have tested the proposed method on twopopular SBIR datasets (Sketchy Extended [25] andTU-Berlin Extended [26]) and a newly proposedchallenging SBIR dataset, QuickDraw Extended [4].In all benchmarks, we have achieved state-of-the-art (SOTA) performance only through ﬁne-tuning aResNet50 [27] model with our three proposed learn-ing objectives:

Domain-aware Quadruplet , SemanticClassiﬁcation and

Semantic Knowledge Preservation losses.The remainder of the paper is organized as follows.Section 2 presents a literature review where we dis-cuss related ZS-SBIR studies. In Section 3, the pro-posed model and learning objectives are introduced.Section 4 outlines experiment setups, results and re-lated discussions; and ﬁnally Section 5 concludes thepaper.

2. Related Work

Early SBIR studies mainly focus on the challengesraised by the large domain gap between the sketchand photo domain. Both hand-crafted features anddeep features have been explored. Hand-crafted fea-tures include edge/shape-based features [28, 29, 30]with a bag-of-words representation, as in some as-pects, strong edges in a photo correspond to thecontours of sketches. On the other hand, deep fea-tures seek to learn a joint representation the ofsketch and photo domain through metric learning[31, 32, 33, 14], style-content disentangle representa-tion, [6, 22] and style-transfer [18, 15, 10, 34]. How-ever, related studies discover that the accuracy ofthese models decreases in a real-life and challengingscenarios where either the queries or gallery imagesare unseen. To tackle this problem, zero-shot SBIRapproaches [12, 5, 4, 19, 20, 21, 20] have been pro-posed.The majority of ZS-SBIR approaches leverage se-mantic information embedded in seen data ( i.e. wordlabels) to learn a generalised representation for bothseen and unseen categories. The main diﬀerence be-tween these methods lies in the architecture of the mapping network and the embedding method in thesemantic space. GNNs [12, 21], Multi Layer Percep-trons (MLP) [4, 35] and GANs [36, 20, 37, 19] haveall be used for the mapping network. On other side,word2vec [4, 19, 21, 37, 20, 35, 36] and hierarchicalmodels [19, 20, 36] are common embedding methodsused to construct the semantic space.In comparison, the recent work SAKE [5] uses a vi-sual semantic representation learned from ImageNet.However, to avoid an incorrect representation causedby domain-shift, SAKE aligns the visual semanticrepresentation with a semantic similarity matrix con-structed with wordNet. We proposed an alternativemethod for extracting a visual semantic representa-tion that is free-from alignment. As such, our methoddoes not require a language model. To the best of ourknowledge, making our proposed approach one of avery small number of ZS-SBIR studies that do notrequire a language model. Other such methods aretypically generative approaches based-on GANs [6]and variational auto-encoder encoders (VAE) [38].

3. The Proposed Method

In this section, we present our proposed method forzero-shot sketch-based image retrieval (ZS-SBIR). Inthe following sub-sections, we ﬁrst outline the over-all structure of the proposed method, then explainnetwork structures, learning objectives and discussimplementation details.

The objective of the proposed method is to learndiscriminative and domain-invariant CNN encodersthat map semantically similar images from the sketchand photo domains into the same region of a com-mon embedding space. An overall diagram of theproposed approach is shown in Figure 2. The dia-gram is composed of two parts: the

Online TrainingStudent Network and the

Oﬄine Soft Label Extrac-tion with Teacher Network . The student network, E st , is trained with quadruplets and three proposedlearning objectives. The teacher network, E te , gener-ates ground-truth for the knowledge distillation thatprevents the E st from forgetting semantic knowledge3 lass1 class2classN ... Ofﬂine Soft Label ExtractionDomain-based Quadruplet Ground-Truth Labels (i.e., duck, turtle, rabbit) ... class1class2classNSoft Labels

Activation

KnowledgeDistillation ............ G AP G AP G AP G AP Shared weights...Shared weightsShared weights

Training Student Network ... G AP Class-based Average............

Figure 2: A diagram of the proposed method. The approach includes an

Online Training Student Network and an

OﬄineSoft Label Extraction with Teacher Network . The student, E st , and teacher, E te , networks use Resnet50 as the backbone. Aquadruplet composed of two images from the sketch domain and two images from the photo domain is the input of the studentnetwork, while the inputs to the teacher network are only from the photo domain. The teacher network generates soft labelsto prevent the student network from forgetting previously learned knowledge from ImageNet, and the soft label extraction is aone-time oﬄine process. learned from pre-training on ImageNet. Unlike [5],our approach does not require a teacher network dur-ing training and a language model for an alignment.Our approach, therefore, has a simple and eﬃcienttraining process. For simplicity and eﬃciency, twoencoders with ResNet50 [27] backbones are used as E st and E te . However, for E st , we replace the fully-connected layer of ResNet50 with three new fully con-nected layers, F C id , F C sim and

F C soft , that corre-spond to three proposed learning objectives. The size of

F C soft is 1 , F C id and F C sim are equal to the number of classes ( i.e.

80, 100, 104 or220) and the size of embedded feature ( i.e.

64, 512 or1024). In this work, a four stream encoder where allstreams share weights is used as E st . However, semi-heterogeneous networks, or special domain-invariantlayers widely used by previous works [4, 12, 5] to pro-cess the photos and sketches separately are easily in-tegrated into E st . Global average pooling (GAP) isapplied to extract latent features from the last con-4olutional layer of the backbone network. Similar to[4], we have tried adding an attention [39] mechanismto E st , although we observed that it didn’t yield anyimprovements during ablation studies. To learn a discriminative and domain-invariant en-coder with general semantic knowledge, we introducethe following learning objectives:

Domain-awareQuadruplet Loss , Classiﬁcation Loss and

Knowledge

Preservation loss . Domain-aware Quadruplet Loss is a modiﬁedversion of the triplet loss, which has been widelyused to maximise the inter-class distance and min-imise the intra-class distance in embedding space forvarious image retrieval tasks [1, 40]. Here, our ob-jective is also minimizing distance between sketchesand photos from the same semantic category, whilemaximizing the distance between sketches and pho-tos from diﬀerent categories in the target embeddingspace.With the triplet loss, this inter-class and intra-class distance relationship is formulated with tripletswhere a sketch and a photo are selected from the samecategory, while another photo is from a diﬀerent cat-egory. For example, T ( i ) = (cid:8) I as ( i ) , I + p ( i ) , I − p ( i ) (cid:9) is i th triplet where l ( I as ( i )) = l ( I + p ( i )) and l ( I as ( i )) (cid:54) = l ( I − p ( i )) (notation l represents label). The Euclideandistance between the anchor sketch and the positive(same class) photo image is δ + ( i ) = || E st ( I as ( i )) − E st ( I + p ( i )) || , while the Euclidean distance betweenthe anchor sketch and it’s negative (diﬀerent class)photo image is δ − p ( i ) = || E st ( I as ( i )) − E st ( I − p ( i )) || . δ − p ( i ) should be larger than δ + ( i ) by a threshold α ,which is set to 0.2. The triplet loss for a batch of N triplets is deﬁned as L sim = 1 N N (cid:88) i =1 max ( δ + ( i ) − δ − p ( i ) + α, . (1)The proposed domain-aware quadruplet loss de-ploys an extra negative sketch image, I − s , such thatthe quadruplet is deﬁned as Q = (cid:8) I as , I + p , I − p , I − s (cid:9) .The additional image is used to calculate the Eu-clidean distance between the anchor sketch and the additional negative (diﬀerent class) sketch image, δ − s ( i ) = || E ( I as ( i )) − E ( I − p ( i )) || . Therefore, the pro-posed loss is L sim = 12 N N (cid:88) i =1 ( max ( δ + ( i ) − δ − p ( i ) + α, max ( δ + ( i ) − δ − s ( i ) + α, . (3)We proposed the quadruplet for the following rea-sons:1. To overcome domain imbalance which can ap-pear in triplet loss and classiﬁcation losses (dis-cussed later), as the total number of sampledphotos are two times the number of sketches.2. Related studies [40, 41] demonstrate that an ex-tra negative image is beneﬁcial for learning dis-criminative features. However, these studies donot take consider domain diﬀerences and imbal-ance in their formulations. Semantic Classiﬁcation Loss is introduced to en-sure hidden features extracted with E st are composedof signals that are suﬃcient to identifying the seman-tic classes of inputs from both the sketch and photodomains. Additionally, with this semantic loss, E st implicitly learns to minimise the intra-class distance.Speciﬁcally, a soft-max cross-entropy loss is utilised.As Equation 4 shows, every input to E st is a quadru-plet, Q , that includes two images from the sketch do-main and two images from the photo domain. Thisequal domain sampling ensures domain balance. Theoutput from E st will be sent to the F C cls for soft-max calculation. Here, we simply use notation φ torepresent this whole process, L cls = − N N (cid:88) i =1 (cid:88) I ∈ Q − logp ( l ( I ( i )) | φ ( I ( i ))) , (4)where p represents the probability. Semantic Knowledge Preservation Loss

Trans-fer learning plays a key role in SBIR tasks. Net-works pre-trained on ImageNet have been ﬁne-tunedfor ZS-SBIR problems in previous works [12, 6, 4, 5].However, Liu et al. [5] claim ﬁne-tuning will cause5atastrophic forgetting that decreases the ability ofthe ﬁne-tuned network to adapt back to the origi-nal domain. To prevent a network from forgettingpreviously learned knowledge, Liu et al. generatesa teacher signal to each of the training inputs forknowledge distillation. However, this requires extratraining resources as inputs are also sent to a teachernetwork to generate the teacher signals. Moreover,their method also requires a language model for er-ror alignments. Here, we implement a similar knowl-edge distillation approach, which is eﬃcient and doesnot require a language model. As shown in Figure 2,we only use a class-based teacher signal rather thanitem-based teacher signals. The teacher signals arethe softmax of the average activation of the teachernetwork E teacher for each semantic class. The teachersignals can be considered as soft signals. The nota-tion q ( l ( I )) represents the soft label of image I . Toreduce the errors caused by domain shifts, we calcu-lated q with the softmax of the average activation ofeach class that exists in the photo domain as shownin Figure 2. Those soft labels are only calculatedonce, so it is eﬃcient. We use the cross-entropy losswith soft labels for calculation of the Knowledge loss L knowledge L knowledge = − N N (cid:88) i =1 (cid:88) I ∈ Q − q ( l ( I ( i )) logσ ( E ( I ( i ))(5)In summary, E st is trained using the L in Equation6, which is a combination of the three proposed ob-jectives. For simplicity, the weights of each objectiveare set to 1. L = L knowledge + L cls + L sim (6) PyTorch [42] is used as our implementation frame-work, and all models are trained with single GTX1080Ti GPU. We select an ImageNet pretrainedResNet50 as the backbone for both teacher and stu-dent networks. We applied the SGD optimiser withmomentum=0.9 and decay=5 × − . The batchsizeis 16, but it includes 64 images as each input is a quadruplet. The initial learning rate λ = 1 × − ,and it is decayed by a factor of 10 times after ev-ery ten epochs. We trained all models for up to 25epochs, which is smaller than what previous works[4, 22, 5] require, as our model starts to convergesafter only a few training epochs. We also usedearly-stop based on the validation accuracy. If themodel’s validation accuracy has not shown improve-ments within 5 epochs, the model will stop training.

4. Experiments

We evaluated our method on well-known large-scale SBIR datasets:

Sketchy Extended , TU-BerlinExtended and

QuickDraw Extended . An overall com-parison of these datasets is described in Table 1.

Sketchy Extended is an extended version of theSketchy dataset [25] by Liu et al. [14]. The Sketchydataset has 125 categories. Each category is com-posed of 100 natural images and at least 600 sketches.It’s photo domain is extended by adding an extra60 ,

502 natural images collected from ImageNet. Theextended version has an average of 604 sketches and584 images in each class, and it is a balanced datasetas the variance between the number of items in eachclass is relatively small. To adapt this dataset forzero-shot studies, the dataset is partitioned into seenand unseen sets. There exists two partition proto-cols in the literature. For clarify, we refer to them as

SK-SH and

SK-YE . SK-SH is proposed by Shen etal. [12], who creates an unseen set by randomly select-ing 25 classes, and the remaining 100 classes are usedas training classes. However, some of those randomlyselected classes might have already been seen by net-works initialised with ImageNet pretrained weights,and thus this violates the zero-shot setting. SK-YEintroduced by Yelamarthi et al. [38]. They carefullyselects 21 classes that are not present in ImagenNet.

TU-Berlin Extended includes 20 ,

000 sketchesfrom the TU-Berlin dataset [26] and an extra 204 , et al. [14]. Its sketch-domain has a uniform class distribution but withonly 80 items, while the photo-domian has around787 items, but is highly imbalanced. It, therefore, is6 challenging dataset. The partition protocol intro-duce by Shen et al. [12] is used for creating zero-shottraining and testing sets. We refer to this protocol as TUB-SH where 30 randomly picked classes that in-clude at least 400 photo images are used for testing,and other classes are used for training.

QuickDraw Extended is a challenging dataset cre-ated by Dey et al. [4]. Compared to Sketchy Ex-tended and TU-Berlin Extended datasets, it includesmore sketches (average of 3022/class) and photos (av-erage of 1853/class). All sketches are drawn by ama-teurs, so they are very abstract and highly variable.Moreover, all classes are carefully selected to avoidambiguity and overlap. A partition following a simi-lar protocol to that proposed by Yelamarthi et al. [38]is provided. We named this partition

QD-DE . Withthis partition, the dataset is split into 80 training and30 testing classes.

Precision (P) and mean average precision (mAP)are two main metrics for evaluating the ranked re-trieval results for testing queries in related SBIRstudies. Precision is calculated for the top k ( i.e. ,100, 200) ranked results, and mAP values are calcu-lated for the top K or all ranked results. The P@K isequal to the ratio between the number of total doc-uments and relevant documents in the K retrievedresults. P@K is also used for calculating AP valuesof each query as follows: AP @ K = K (cid:88) i =1 P @ i × γ ( i ) N , (7)where N is total number of relevant documents and γ ( i ) is 1 if the i th ranked result is relevant, otherwise0. mAP@k is mean AP@k of all queries. We have compared the proposed methods withSOTA methods on ZS-SBIR and its generalised ver-sion (search space includes seen and unseen cate-gories [20]) . The results of the ZS-SBIR task on theSketchy Extended and TU-berlin Extended datasetsare shown in Table 2, and the ZS-SBIR results on the QuickDraw Extended datasets are listed in Table 4.The results of the generalised ZS-SBIR task on theSketchy Extended and TU-Berlin Extended datasetsare shown in Table 3.In all these experiments, we have shown improve-ments compared to SOTA methods. We have sur-passed methods that have not utilized a languagemodel by a large margin. We also compare the em-bedding feature sizes used by the methods. Our fea-ture size is 512, which we note is relatively small com-pared to many other methods, and equal to featuresize of the previous best SOTA method, SAKE [5].

We also visualized top10 and top5 results of somesuccess and failure cases. All those results are dis-played in Figure 3 and Figure 4. Figure 3 displaysboth ZS-SBIR and GZS-SBIR results of Sketchy Ext.and TU-Berlin Ext. datasets. While Figure 4 showstop10 results of QuickDraw Ext. dataset. The pro-posed method returns perfect results when the givenquery is unambiguous, whereas it returns acceptablefalse-positive results when the query is unclear. Thisis more like to happen when the searching space is aslarge as in generalised ZS-SBIR cases.

Here, we investigate the impact of each proposedlearning objective and attention (att.) [39] on theproposed approach using the TU-Berlin Extendedand Sketchy Extended Datasets. As shown in Table5, we provided results of models trained with several combinations of these losses. The model trained withthe triplet loss is the baseline, and it’s results showthat it is a challenging baseline which outperformsmost of the state-of-the-art method listed in Table 2.Each learning objective improves the results of thebaseline. However, the attention mechanism has notbrought any improvements to the ﬁnal results. Wesupposed that, with the proposed learning objectives,the network is trained to pay attention to importantinformation without an attention module.7 a) Zeroshot, Sketchy (Split:SK-YE) (b) Generalized zeroshot, Sketchy (Split:SK-YE)(c) Zeroshot, TU-Berlin (Split:TUB-SH) (d) Generalized zeroshot, TU-Berlin (Split:TUB-SH)

Figure 3: Top-5 ZS-SBIR (a,c) and generalised ZS-SBIR (b,d) results retrieved by our model on Sketchy Ext. (a,b) and TU-Berlin Ext. (c,d) datasets. Correct results are shown with a green border, while false results are shown with a red border. Thetop two rows are all correct, the third row is partially correct, while the bottom row is all incorrect. able 1: Comparison of public SBIR datasets. These datasets include images from the sketch and photo domains. For zero-shotstudies, they are split to train (seen) and test (unseen) classes. Sketchy Ext. [14] TU-BerlinExt. [10] QuickDrawExt. [4] ±

61 80 ± ± ±

76 787 ±

489 1853 ± Name

SK-SH [12] SK-YE [38] TUB-SH [12] QD-DE [4]

Type Random ImageNet Random ImageNetorthogonal orthogonal

100 104 220 80

25 21 30 30

Table 2: A performance comparison of recent state-of-the-art ZS-SBIR methods.

Method DIM. SketchyExt.(Split:SK-SH) SketchyExt.(Split:SK-YE) TUBerlinExt.(Split:TU-SH) mAP P mAP P mAP P@all @100 @200 @200 @all @100ZSIH [12] 64 ∗ ∗ - - - - 16.5 25.2512 - - - - 25.9 36.9CVAE [38] 4,096 19.6 28.4 22.5 33.3 - -GZS-SBIR [16] 2,048 28.9 35.8 - - 23.8 33.4SEM-PCYC [20] 64 34.9 46.3 - - 29.7 42.664 ∗ . a . a ∗ a These mAP@200 evaluations use a diﬀerent formulation to ours. If we follow the same mAP@200 evaluation protocol, ourmAP@200 values for Sketchy-Ext. (Split: SK-YE) is 72.24.

5. Conclusion

In this work, we propose a simple and eﬃcientframework for zero-shot sketch-based image retrieval(ZS-SBIR). The model is trained in an end-to-end fashion with three introduced losses: domain-awarequadruplet loss , semantic classiﬁcation loss and se-mantic knowledge preservation loss . The domain-aware quadruplet loss addresses the issue of domain-imbalance that occurrs using the vanilla triplet loss9 able 3: Comparison results of generalised ZS-SBIR on Sketchy Extended and TU-Berlin Extended datasets. Method DIM. SketchyExt.(Split:SK-SH) TU-BerlinExt.(Split:TU-SH) mAP@all P@100 mAP@all P@100ZSIH [12] 64 21.9 29.6 14.2 21.8SEM-PCYC [20] 64 30.7 36.4 19.2 29.8Style-guide [6] 4,096 33.1 38.1 14.9 22.6Ours 512

Figure 4: Top-10 ZS-SBIR results retrieved by the proposed model on the QuickDraw Ext. dataset. Correct results are shownwith a green border, while incorrect results are shown with a red border. The top two rows are all correct, the third row ispartially correct, while the bottom row is all incorrect.Table 4: Comparison results of QuickDraw-Extended Dataset.

Method DIM. mAP@all P@200

CVAE [38] 4,096 0.30 0.30Doodle2Search [4] 4,096 7.52 6.75Ours

512 11.88 16.65 that is frequently used to reduce the domain gap andlearn a shared low-dimension feature space. In ad-dition, categorical semantic classiﬁcation is also usedto learn semantic features. To enhance the zero-shotability of the learned model, the semantic knowledgepreservation loss is introduced. This loss is formu-lated to prevent the rich knowledge learned from the ImageNet dataset from being forgotten during ﬁne-tuning of the pre-trained ImageNet model that isused by the network. Experiments on three chal-lenging ZS-SBIR datasets show that the proposedframework is more eﬃcient and eﬀective than relatedworks. Moreover, extensive ablation studies showeach introduced loss brings non-trivial improvementsand contributes to the state-of-the-art performance.

References [1] A. Gordo, J. Almaz´an, J. Revaud, D. Larlus,Deep image retrieval: Learning global represen-tations for image search, in: European confer-10 able 5: Ablation study for the proposed approach. The same backbone network trained with the triplet loss is used as baseline.

Quad. ID. Know. Att. Sketchy Ext.(Split: SK-YE) TUBerlin Ext.(Split: TU-SH) mAP mAP P P mAP mAP P P@all @200 @100 @200 @all @200 @100 @200- - - - 47.7 43.6 55.1 52.0 44.8 46.2 56.8 55.0 (cid:88) - - - 48.5 44.4 56.0 52.9 46.3 47.5 58.0 56.2 (cid:88) (cid:88) - - 51.1 48.9 61.7 57.5 46.8 48.1 58.1 56.0 (cid:88) - (cid:88) - 51.8 49.5 62.7 58.3 47.2 48.7 58.7 56.7 (cid:88) (cid:88) (cid:88) - 52.7 50.2 64.2 59.6 48.0 50.5 60.8 58.6 (cid:88) (cid:88) (cid:88) (cid:88) ence on computer vision, Springer, 2016, pp.241–257.[2] F. Schroﬀ, D. Kalenichenko, J. Philbin, Facenet:A uniﬁed embedding for face recognition andclustering, in: Proceedings of the IEEE confer-ence on computer vision and pattern recognition,2015, pp. 815–823.[3] O. Tursun, S. Denman, S. Sivapalan, S. Srid-haran, C. Fookes, S. Mau, Component-based at-tention for large-scale trademark retrieval, IEEETransactions on Information Forensics and Secu-rity.[4] S. Dey, P. Riba, A. Dutta, J. Llados, Y.-Z. Song,Doodle to search: Practical zero-shot sketch-based image retrieval, in: Proceedings of theIEEE Conference on Computer Vision and Pat-tern Recognition, 2019, pp. 2179–2188.[5] Q. Liu, L. Xie, H. Wang, A. L. Yuille,Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval, in: Proceed-ings of the IEEE International Conference onComputer Vision, 2019, pp. 3662–3671.[6] T. Dutta, S. Biswas, Style-guided zero-shotsketch-based image retrieval., in: BMVC, 2019,p. 209.[7] T. Dutta, A. Singh, S. Biswas, Adaptive margindiversity regularizer for handling data imbalancein zero-shot sbir, in: European Conference onComputer Vision, Springer, 2020, pp. 349–364. [8] Y. Wang, F. Huang, Y. Zhang, R. Feng,T. Zhang, W. Fan, Deep cascaded cross-modalcorrelation learning for ﬁne-grained sketch-basedimage retrieval, Pattern Recognition 100 (2020)107148.[9] F. Huang, C. Jin, Y. Zhang, K. Weng, T. Zhang,W. Fan, Sketch-based image retrieval with deepvisual semantic descriptor, Pattern Recognition76 (2018) 537–548.[10] H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang,X. Cao, Sketchnet: Sketch classiﬁcation withweb images, in: Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recog-nition, 2016, pp. 1105–1113.[11] P. Sangkloy, N. Burnell, C. Ham, J. Hays, Thesketchy database: learning to retrieve badlydrawn bunnies, ACM Transactions on Graphics(TOG) 35 (4) (2016) 1–12.[12] Y. Shen, L. Liu, F. Shen, L. Shao, Zero-shotsketch-image hashing, in: Proceedings of theIEEE Conference on Computer Vision and Pat-tern Recognition, 2018, pp. 3598–3607.[13] J. Lei, Y. Song, B. Peng, Z. Ma, L. Shao,Y.-Z. Song, Semi-heterogeneous three-way jointembedding network for sketch-based image re-trieval, IEEE Transactions on Circuits and Sys-tems for Video Technology.[14] L. Liu, F. Shen, Y. Shen, X. Liu, L. Shao, Deepsketch hashing: Fast free-hand sketch-based im-11ge retrieval, in: Proceedings of the IEEE con-ference on computer vision and pattern recogni-tion, 2017, pp. 2862–2871.[15] J. Zhang, F. Shen, L. Liu, F. Zhu, M. Yu,L. Shao, H. Tao Shen, L. Van Gool, Gener-ative domain-migration hashing for sketch-to-image retrieval, in: Proceedings of the EuropeanConference on Computer Vision (ECCV), 2018,pp. 297–314.[16] V. Kumar Verma, A. Mishra, A. Mishra, P. Rai,Generative model for zero-shot sketch-based im-age retrieval, in: Proceedings of the IEEE/CVFConference on Computer Vision and PatternRecognition (CVPR) Workshops, 2019.[17] K. Pang, Y.-Z. Song, T. Xiang, T. M.Hospedales, Cross-domain generative learningfor ﬁne-grained sketch-based image retrieval., in:BMVC, 2017, pp. 1–12.[18] L. Guo, J. Liu, Y. Wang, Z. Luo, W. Wen,H. Lu, Sketch-based image retrieval using gen-erative adversarial networks, in: Proceedings ofthe 25th ACM international conference on Mul-timedia, 2017, pp. 1267–1268.[19] J. Zhu, X. Xu, F. Shen, R. K.-W. Lee, Z. Wang,H. T. Shen, Ocean: A dual learning approachfor generalized zero-shot sketch-based image re-trieval, in: 2020 IEEE International Conferenceon Multimedia and Expo (ICME), IEEE, 2020,pp. 1–6.[20] A. Dutta, Z. Akata, Semantically tied paired cy-cle consistency for zero-shot sketch-based imageretrieval, in: CVPR, 2019.[21] Z. Zhang, Y. Zhang, R. Feng, T. Zhang, W. Fan,Zero-shot sketch-based image retrieval via graphconvolution network., in: AAAI, 2020, pp.12943–12950.[22] J. Li, Z. Ling, L. Niu, L. Zhang, Bi-directional domain translation for zero-shotsketch-based image retrieval, arXiv preprintarXiv:1911.13251. [23] A. Pandey, A. Mishra, V. K. Verma, A. Mittal,H. Murthy, Stacked adversarial network for zero-shot sketch based image retrieval, in: The IEEEWinter Conference on Applications of ComputerVision, 2020, pp. 2540–2549.[24] G. A. Miller, WordNet: An electronic lexicaldatabase, MIT press, 1998.[25] P. Sangkloy, N. Burnell, C. Ham, J. Hays, Thesketchy database: Learning to retrieve badlydrawn bunnies, ACM Transactions on Graphics(proceedings of SIGGRAPH).[26] M. Eitz, J. Hays, M. Alexa, How do humanssketch objects?, ACM Transactions on graphics(TOG) 31 (4) (2012) 1–10.[27] K. He, X. Zhang, S. Ren, J. Sun, Deep residuallearning for image recognition, in: Proceedingsof the IEEE conference on computer vision andpattern recognition, 2016, pp. 770–778.[28] R. Hu, J. Collomosse, A performance evaluationof gradient ﬁeld hog descriptor for sketch basedimage retrieval, Computer Vision and Image Un-derstanding 117 (7) (2013) 790–806.[29] J. M. Saavedra, Sketch based image retrievalusing a soft computation of the histogram ofedge local orientations (s-helo), in: 2014 IEEEInternational Conference on Image Processing(ICIP), IEEE, 2014, pp. 2998–3002.[30] J. M. Saavedra, J. M. Barrios, S. Orand, Sketchbased image retrieval using learned keyshapes(lks)., in: BMVC, Vol. 1, 2015, p. 7.[31] Q. Yu, F. Liu, Y.-Z. Song, T. Xiang, T. M.Hospedales, C.-C. Loy, Sketch me that shoe, in:Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, 2016, pp.799–807.[32] Y. Qi, Y.-Z. Song, H. Zhang, J. Liu, Sketch-based image retrieval via siamese convolutionalneural network, in: 2016 IEEE InternationalConference on Image Processing (ICIP), IEEE,2016, pp. 2460–2464.1233] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, T. M.Hospedales, Deep spatial-semantic attention forﬁne-grained sketch-based image retrieval, in:Proceedings of the IEEE International Confer-ence on Computer Vision, 2017, pp. 5551–5560.[34] C. Bai, J. Chen, Q. Ma, P. Hao, S. Chen, Cross-domain representation learning by domain-migration generative adversarial network forsketch based image retrieval, Journal of VisualCommunication and Image Representation 71(2020) 102835.[35] U. Chaudhuri, B. Banerjee, A. Bhattacharya,M. Datcu, A simpliﬁed framework for zero-shotcross-modal sketch data retrieval, in: Proceed-ings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition Work-shops, 2020, pp. 182–183.[36] C. Deng, X. Xu, H. Wang, M. Yang, D. Tao, Pro-gressive cross-modal semantic network for zero-shot sketch-based image retrieval, IEEE Trans-actions on Image Processing 29 (2020) 8892–8902.[37] X. Xu, K. Lin, H. Lu, L. Gao, H. T. Shen,Correlated features synthesis and alignment for zero-shot cross-modal retrieval, in: Proceedingsof the 43rd International ACM SIGIR Confer-ence on Research and Development in Informa-tion Retrieval, 2020, pp. 1419–1428.[38] S. K. Yelamarthi, S. K. Reddy, A. Mishra,A. Mittal, A zero-shot framework for sketchbased image retrieval, in: European conferenceon computer vision, Springer, 2018, pp. 316–333.[39] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville,R. Salakhudinov, R. Zemel, Y. Bengio, Show, at-tend and tell: Neural image caption generationwith visual attention, in: International confer-ence on machine learning, 2015, pp. 2048–2057.[40] W. Chen, X. Chen, J. Zhang, K. Huang, Be-yond triplet loss: a deep quadruplet network forperson re-identiﬁcation, in: Proceedings of theIEEE Conference on Computer Vision and Pat-tern Recognition, 2017, pp. 403–412. [41] A. Khatun, S. Denman, S. Sridharan, C. Fookes,Joint identiﬁcation-veriﬁcation for person re-identiﬁcation: A four stream deep learning ap-proach with improved quartet loss function,Computer Vision and Image Understanding(2020) 102989.[42] A. Paszke, S. Gross, F. Massa, A. Lerer,J. Bradbury, G. Chanan, T. Killeen, Z. Lin,N. Gimelshein, L. Antiga, A. Desmaison,A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te-jani, S. Chilamkurthy, B. Steiner, L. Fang,J. Bai, S. Chintala, Pytorch: An imperativestyle, high-performance deep learning library,in: H. Wallach, H. Larochelle, A. Beygelzimer,F. d ''