SHREWD: Semantic Hierarchy-based Relational Embeddings for Weakly-supervised Deep Hashing
PPublished as a conference paper at ICLR 2019
SHREWD:S
EMANTIC H IERARCHY - BASED R ELATIONAL E MBED - DINGS FOR W EAKLY - SUPERVISED D EEP H ASHING
Heikki Arponen and Tom E Bishop
Intuition Machines Inc. { heikki,tom } @intuitionmachines.com A BSTRACT
Using class labels to represent class similarity is a typical approach to trainingdeep hashing systems for retrieval; samples from the same or different classestake binary 1 or 0 similarity values. This similarity does not model the full richknowledge of semantic relations that may be present between data points. In thiswork we build upon the idea of using semantic hierarchies to form distance metricsbetween all available sample labels; for example cat to dog has a smaller distancethan cat to guitar. We combine this type of semantic distance into a loss func-tion to promote similar distances between the deep neural network embeddings.We also introduce an empirical Kullback-Leibler divergence loss term to promotebinarization and uniformity of the embeddings. We test the resulting SHREWDmethod and demonstrate improvements in hierarchical retrieval scores using com-pact, binary hash codes instead of real valued ones, and show that in a weaklysupervised hashing setting we are able to learn competitively without explicitlyrelying on class labels, but instead on similarities between labels.
NTRODUCTION
Content-Based Image Retrieval (CBIR) on very large datasets typically relies on hashing for efficientapproximate nearest neighbor search; see e.g. Wang et al. (2016) for a review. Early methods such as(LSH) Gionis et al. (1999) were data-independent, but Data-dependent methods (either supervised orunsupervised) have shown better performance. Recently, Deep hashing methods using CNNs havehad much success over traditional methods, see e.g. Hashnet (Cao et al., 2017), DADH (Li et al.,2018). Most supervised hashing techniques rely on a pairwise binary similarity matrix S = { s ij } ,whereby s ij = 1 for images i and j taken from the same class, and otherwise.A richer set of affinity is possible using semantic relations, for example in the form of class hier-archies. Yan et al. (2017) consider the semantic hierarchy for non-deep hashing, minimizing innerproduct distance of hash codes from the distance in the semantic hierarchy. In the SHDH method(Wang et al., 2017), the pairwise similarity matrix is defined from such a hierarchy according to aweighted sum of weighted Hamming distances.In Unsupervised Semantic Deep Hashing (USDH, Jin (2018)), semantic relations are obtained bylooking at embeddings on a pre-trained VGG model on Imagenet. The goal of the semantic loss hereis simply to minimize the distance between binarized hash codes and their pre-trained embeddings,i.e. neighbors in hashing space are neighbors in pre-trained feature space. This is somewhat similarto our notion of semantic similarity except for using a pre-trained embedding instead of a pre-labeledsemantic hierarchy of relations.Zhe et al. (2019) consider class-wise Deep hashing, in which a clustering-like operation is used toform a loss between samples both from the same class and different levels from the hierarchy.Recently Barz & Denzler (2018) explored image retrieval using semantic hierarchies to design anembedding space, in a two step process. Firstly they directly find embedding vectors of the classlabels on a unit hypersphere, using a linear algebra based approach, such that the distances of theseembeddings are similar to the supplied hierarchical similarity. In the second stage, they train astandard CNN encoder model to regress images towards these embedding vectors. They do notconsider hashing in their work. 1 a r X i v : . [ c s . I R ] A ug ublished as a conference paper at ICLR 2019 ORMULATION
We also make use of hierarchical relational distances in a similar way to constrain our embeddings.However compared to our work, Barz & Denzler (2018) consider continuous representations andrequire the embedding dimension to equal the number of classes, whereas we learn compact quan-tized hash codes of arbitrary length, which are more practical for real world retrieval performance.Moreover, we do not directly find fixed target embeddings for the classes, but instead require that theneural network embeddings will be learned in conjunction with the network weights, to best matchthe similarities derived from the labels. And unlike Zhe et al. (2019), in our weakly supervisedSHREWD method, we do not require explicit class membership, only relative semantic distances tobe supplied.Let ( x, y ) denote a training example pair consisting of an image and some (possibly weakly) super-vised target y, which can be a label, tags, captions etc. The embeddings are defined as ˆ z = f θ ( x ) fora deep neural network f parameterized by weights θ . Instead of learning to predict the target y , weassume that there exists an estimate of similarity between targets, d ( y, y (cid:48) ) . The task of the networkis then to learn this similarity by attempting to match (cid:107) ˆ z − ˆ z (cid:48) (cid:107) with d ( y, y (cid:48) ) under some predefinednorm in the embedding space.While in this work we use class hierarchies to implicitly inform our loss function via the similaritymetric d , in general our formulation is weakly supervised in the sense that these labels themselvesare not directly required as targets. We could equally well replace this target metric space with anyother metric based on for instance web-mined noisy tag distances in a word embedding space suchas GloVe or word2vec, as in Frome et al. (2013), or ranked image similarities according to recordeduser preferences.In addition to learning similarities between images, it is important to try to fully utilize the availablehashing space in order to facilitate efficient retrieval by using the Hamming distance to rank mostsimilar images to a given query image. Consider for example a perfect ImageNet classifier. Wecould trivially map all 1000 class predictions to a 10-bit hash code, which would yield a perfect mAPscore. The retrieval performance of such a “mAP-miner” model would however be poor, becausethe model is unable to rank examples both within a given class and between different classes (Dinget al., 2018). We therefore introduce an empirical Kullback-Leibler (KL) divergence term betweenthe embedding distribution and a (near-)binary target distribution, which we add as an additional lossterm. The KL loss serves an additional purpose in driving the embeddings close to binary values inorder to reduce the information loss due to binarizing the embeddings.We next describe the loss function, L ( θ ) , that we minimize in order to train our CNN model. Webreak down our approach into the following 3 parts: L ( θ ) = L sim + λ L KL + λ L cls (1) L cls represents a traditional categorical cross-entropy loss on top of a linear layer with softmaxplaced on the non-binarized latent codes. The meaning and use of each of the other two termsare described in more detail below. Similar to Barz & Denzler (2018) we consider variants withand without the L cls , giving variants of the algorithm we term SHREWD (weakly supervised, noexplicit class labels needed) and SHRED (fully supervised).2.1 S EMANTIC S IMILARITY LOSS
In order to weakly supervise using a semantic similarity metric, we seek to find affinity betweenthe normalized distances in the learned embedding space and normalized distances in the semanticspace. Therefore we define L sim = 1 B B (cid:88) b,b (cid:48) =1 (cid:12)(cid:12)(cid:12)(cid:12) τ z (cid:107) ˆ z b − ˆ z b (cid:48) (cid:107) M − τ y d ( y b , y b (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) w bb (cid:48) , (2)where B is a minibatch size, (cid:107) . . . (cid:107) M denotes Manhattan distance (because in the end we will mea-sure similarity in the binary space by Hamming distance), d ( y b , y b (cid:48) ) is the given ground truth sim-ilarity and w bb (cid:48) is an additional weight, which is used to give more weight to similar example pairs(e.g. cat-dog) than distant ones (e.g. cat-moon). τ z and τ y are normalizing scale factors estimated2ublished as a conference paper at ICLR 2019 Method mAP mAHP@250 Classification accuracy
DeViSE (Frome et al., 2013) † † Barz & Denzler (2018) † , L CORR † , L CORR + CLS ‡ ‡ ‡ n/a L sim only 0.2204 0.7007 10.01% L cls only 0.5647 0.8124 73.00% L sim + L cls only 0.5292 0.8188 69.68% L KL + L cls only 0.3010 0.6215 69.25%SHREWD [Ours] L KL + L sim L KL + L sim + L cls † indicatesnon-quantized embedding codes. ‡ mAHP@2500 measured with this method, so not equivalent.Note that while L cls performs best on supervised classification, L sim allows for better retrievalperformance, however this is degraded unless L KL is also included to regularize towards binaryembeddings. For measuring classification accuracy on methods that don’t include L cls , we measureusing a linear classifier with the same structure as in L cls trained on the output of the first network.from the current batch. We use a slowly decaying form for the weight, w bb (cid:48) = γ ρ / ( γ + d ( y b , y b (cid:48) )) ρ ,with parameter values γ = 0 . and ρ = 2 .2.2 KL- DIVERGENCE BASED DISTRIBUTION MATCHING LOSS
Our empirical loss for minimizing the KL divergence KL ( p || q ) . = (cid:82) dzp ( z ) log( p ( z ) /q ( z )) be-tween the sample embedding distribution p ( z ) and a target distribution q ( z ) is based on theKozachenko-Leonenko estimator of entropy Kozachenko & Leonenko (1987), and can be defined as L KL = 1 B B (cid:88) b =1 [log ( ν (ˆ z b ; z )) − log ( ν (ˆ z b ; ˆ z ))] , (3)where ν (ˆ z b ; z ) denotes the distance of ˆ z b to a nearest vector z b (cid:48) , where z is a sample (of e.g. size B )of vectors from a target distribution. We employ the beta distribution with parameters α = β = 0 . as this target distribution, which is thus moderately concentrated to binary values in the embeddingspace. The result is that our embedding vectors will be regularized towards uniform binary values,whilst still enabling continuous backpropagation though the network and giving some flexibility inallowing the distance matching loss to perform its job. When quantized, the resulting embeddingsare likely to be similar to their continuous values, meaning that the binary codes will have distancesmore similar to their corresponding semantic distances. XPERIMENTAL RESULTS
Metrics
As discussed in section 2, the mAP score can be a misleading metric for retrieval perfor-mance when using class information only. Similarly to other works such as Deng et al. (2011); Barz& Denzler (2018), we focus on measuring the retrieval performance taking semantic hierarchicalrelations into account by the mean Average Hierarchical Precision (mAHP). However more in linewith other hashing works, we use the hamming distance of the binary codes for ranking the retrievedresults.
CIFAR-100
We first test on CIFAR-100 Krizhevsky & Hinton (2009) using the same semantichierarchy and Resnet-110w architecture as in Barz & Denzler (2018), where only the top fullyconnected layer is replaced to return embeddings at the size of the desired hash length. See Tables 1,2for comparisons with previous methods, an ablation study, and effects of hash code length.3ublished as a conference paper at ICLR 2019
Code length mAP result mAHP result Classification accuracy
16 bits 0.3577 0.7478 65.65%32 bits 0.5114 0.8202 65.00%64 bits 0.6514 0.8690 70.79%128 bits
Method mAP mAHP@250 Classification accuracy
Barz & Denzler (2018) (floating point embeddings) 0.3037 0.7902 48.97%SHREWD [Ours] (64 bit binary) 0.4604 0.8676 —SHREWD [Ours] (128 bit binary hash codes) 0.5067 0.8674 —SHREWD [Ours] (floating point embeddings) — 0.8733 63.28%SHRED [Ours] (64 bit binary) —Table 3: Performance on ILSVRC 2012, floating point vs quantized hash codes (NB classifier isonly trained by using floating point embeddings as features)
ILSVRC 2012
We also evaluate on the ImageNet Large Scale Visual Recognition Challenge(ILSVRC) 2012 dataset. For similarity labels, we use the same tree-structured WordNet hierar-chy as in Barz & Denzler (2018). We use a standard Resnet-50 architecture with a fully connectedhashing layer as before. Retrieval results are summarized in Table 3. We compare the resulting Hi-erarchical Precision scores with and without L KL , for binarized and continuous values in Figure 1.We see that our results improve on the previously reported hierarchical retrieval results whilst usingquantized embeddings, enabling efficient retrieval. H P @ k KL loss ablation
No KL, floatNo KL, binaryWith KL, floatWith KL, binary 0 50 100 150 200 250k0.820.830.840.850.860.870.88 H P @ k KL loss ablation
No KL, floatNo KL, binaryWith KL, floatWith KL, binary
Figure 1: Hierarchical precision @k for CIFAR-100 (left) and ILSVRC-2012 (right) for 64-bitSHREWD. We see a substantial drop in the precision after binarization when not using the KLloss. Also binarization does not cause as severe a drop in precision when using the KL loss.
ONCLUSIONS
We approached Deep Hashing for retrieval, introducing novel combined loss functions that bal-ance code binarization with equivalent distance matching from hierarchical semantic relations. Wehave demonstrated new state of the art results for semantic hierarchy based image retrieval (mAHPscores) on CIFAR and ImageNet with both our fully supervised (SHRED) and weakly-supervised(SHREWD) methods. 4ublished as a conference paper at ICLR 2019 R EFERENCES
Bj¨orn Barz and Joachim Denzler. Hierarchy-based Image Embeddings for Semantic Image Re-trieval. September 2018.Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S Yu. HashNet: Deep Learning to Hashby Continuation. February 2017.Jia Deng, Alexander C Berg, and Fei-Fei Li. Hierarchical semantic indexing for large scale imageretrieval.
CVPR , pp. 785–792, 2011.Pak Lun Kevin Ding, Yikang Li, and Baoxin Li. Mean Local Group Average Precision (mLGAP):A New Performance Metric for Hashing-based Retrieval. November 2018.Andrea Frome, Gregory S Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’AurelioRanzato, and Tomas Mikolov. DeViSE - A Deep Visual-Semantic Embedding Model.
NIPS ,2013.Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity Search in High Dimensions viaHashing.
VLDB , 1999.Sheng Jin. Unsupervised Semantic Deep Hashing. March 2018.L. F. Kozachenko and N. N. Leonenko. Sample Estimate of the Entropy of a Random Vector.
Probl.Peredachi Inf., 23:2 (1987), 916 , 1987.A Krizhevsky and G Hinton. Learning multiple layers of features from tiny images. 2009.Jinxing Li, Bob Zhang, Guangming Lu, and David Zhang. Dual Asymmetric Deep Hashing Learn-ing. January 2018.Xu Sun, Bingzhen Wei, Xuancheng Ren, and Shuming Ma. Label Embedding Network: LearningLabel Representation for Soft Training of Deep Networks. October 2017.Dan Wang, Heyan Huang, Chi Lu, Bo-Si Feng, Liqiang Nie, Guihua Wen, and Xian-Ling Mao.Supervised Deep Hashing for Hierarchical Labeled Data. April 2017.Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. A Survey on Learningto Hash. June 2016.Cheng Yan, Xiao Bai, Jun Zhou, and Yun Liu. Hierarchical Hashing for Image Retrieval. In