[PDF] Embarrassingly Simple Binary Representation Learning

Abstract

Recent binary representation learning models usually require sophisticated binary optimization, similarity measure or even generative models as auxiliaries. However, one may wonder whether these non-trivial components are needed to formulate practical and effective hashing models. In this paper, we answer the above question by proposing an embarrassingly simple approach to binary representation learning. With a simple classification objective, our model only incorporates two additional fully-connected layers onto the top of an arbitrary backbone network, whilst complying with the binary constraints during training. The proposed model lower-bounds the Information Bottleneck (IB) between data samples and their semantics, and can be related to many recent `learning to hash' paradigms. We show that, when properly designed, even such a simple network can generate effective binary codes, by fully exploring data semantics without any held-out alternating updating steps or auxiliary models. Experiments are conducted on conventional large-scale benchmarks, i.e., CIFAR-10, NUS-WIDE, and ImageNet, where the proposed simple model outperforms the state-of-the-art methods.

Full PDF

EEmbarrassingly Simple Binary Representation Learning

Yuming Shen , Jie Qin , Jiaxin Chen , Li Liu , and Fan Zhu Inception Institute of Artiﬁcial Intelligence (IIAI), Abu Dhabi, UAE { ymcidence, qinjiebuaa, chenjiaxinx, liuli1213, fanzhu1987 } @gmail.com Abstract

Recent binary representation learning models usually re-quire sophisticated binary optimization, similarity measureor even generative models as auxiliaries. However, one maywonder whether these non-trivial components are needed toformulate practical and effective hashing models.In this paper, we answer the above question by propos-ing an embarrassingly simple approach to binary repre-sentation learning. With a simple classiﬁcation objec-tive, our model only incorporates two additional fully-connected layers onto the top of an arbitrary backbone net-work,whilst complying with the binary constraints duringtraining. The proposed model lower-bounds the Informa-tion Bottleneck (IB) between data samples and their se-mantics, and can be related to many recent ‘learning tohash’ paradigms. We show that, when properly designed,even such a simple network can generate effective binarycodes, by fully exploring data semantics without any held-out alternating updating steps or auxiliary models. Exper-iments are conducted on conventional large-scale bench-marks, i.e., CIFAR-10, NUS-WIDE, and ImageNet, wherethe proposed simple model outperforms the state-of-the-artmethods. Our codes are available at https://github.com/ymcidence/JMLH .

1. Introduction

Approximate nearest neighbour search with binary rep-resentations has been regarded as an effective and efﬁcientsolution to large-scale multimedia data retrieval. Conven-tionally termed as learning to hash , this family of tech-niques aims at (a) shrinking the embedding size of data and (b) producing binary features to speedup the computationof distance-based pair-wise data relevance. Similar to manyother machine learning tasks, learning to hash can be ei-ther unsupervised or supervised. The former requires lesslabeling efforts for training, while the later obtains betterperformance in retrieval. We focus on supervised hashing to fully leverage the semantic information of data.Recent research in this ﬁeld largely boosts the perfor-mance of the produced hash codes by introducing deeplearning techniques. Deep hashing models typically em-ploy an indifferentiable sign activation to the top of theencoding network. Various methods have been proposed toempower the encoder with the ability to properly locate datain the Hamming space.A typical approach is to employ a held-out code learneras the network training complementary [11, 29, 40]. Thecode learner performs discrete optimization and alternatelyupdates the semantic-based target codes to govern the be-havior of the encoding network. This approach generallyrequires longer training time since the held-out discrete op-timization step cannot be effectively paralleled, and con-sumes additional memory to cache the target codes dur-ing each round of update. Alternatively, some proposeto decouple unrelated data representations by introducingsimilarity-based penalties to the encoders [7, 42, 43, 44].To train an encoder with these regularizers, one may resortto continuous relaxation on the codes, which arguably de-grades the training quality. One recent fashion in deep hash-ing is to employ generative adversarial models [5, 13, 34,45]. By distinguishing synthesized data from real ones, theencoder implicitly acknowledges the respective data distri-bution.However, the above precisely-proposed approaches raiseanother question:

How to build an effective supervisedhashing model with minimum auxiliary components?

We attempt to ﬁnd the answer by carefully consideringthe following main challenges of learning to hash: • Keeping the discrete nature of binary codes.

The bi-nary constraints usually lead to an NP-hard optimiza-tion problem in parameterized models, and cannot bedirectly solved by gradient-based methods. This isusually addressed by conventional methods using held-out discrete optimization or relaxation techniques. • Enriching the information carried by the codes. It a r X i v : . [ c s . C V ] A ug s always essential to make the encoder aware of thesemantic information ( e.g ., lables or tags) of data.As a result, in this paper, we propose a simple but pow-erful deep hashing network. In our model, the above prob-lems are tackled by relating data and their semantics with abinary representation bottleneck, which is thereafter used asthe ﬁnal hash codes. A single recognition penalty is appliedfor training. With a reasonable regularization term, the ﬁ-nal learning objective forms a variational lower bound ofthe Information Bottleneck (IB) [2, 36] between observeddata and their semantics. Importantly, one can imposestochasticity on the binary bottleneck to keep the binaryconstraints and apply gradient estimation methods duringtraining. Therefore, the whole framework can be optimizedend-to-end with Stochastic Gradient Descent (SGD). To thisend, we ﬁnd our design leads to an embarrassingly simplesolution,which basically shapes a single classiﬁcation neu-ral network .Regardless of the regularization, the proposed model justmaximizes the label likelihood of data. Thus, we name ourmodel Just-Maximizing-Likelihood Hashing (JMLH). Thecontributions of this paper are summarized as follows: • We propose a simple and novel deep hashing model, i.e ., JMLH, and theoretically base it on the VariationalInformation Bottleneck (VIB) [2] method. To the bestof our knowledge, JMLH is the ﬁrst attempt in deephashing to employ the IB methods. • We show that, when properly designed and trained,a classiﬁcation neural network with a discrete bottle-neck already produces effective binary representations.Therefore, the proposed model requires no auxiliarycomponents and can be optimized directly. • Relations between JMLH and many existing hashingmodels are discussed in detail. • JMLH successfully outperforms state-of-the-art hash-ing techniques on several benchmark datasets, i.e .,CIFAR-10 [20], NUS-WIDE [9] and ImageNet [28].In the rest of this paper, we ﬁrst describe our model indetail in Section 2. Subsequently, the relationships betweenJMLH and existing works are elaborated in Section 3. Sec-tion 4 presents the implementation details and experimentalﬁndings, with a brief conclusion given in Section 5.

2. Model

The goal of learning to hash is to ﬁnd an optimal en-coding function f : X → { , } m to represent data. Here X is the variable space of data observation and m refers tothe length of the hash code space B . In the context of su-pervised hashing, training is usually supported by the data bx yθ φ n Figure 1. The directed graphical model of JMLH. We treat the hashcode b as the latent bottleneck between data x and their labels y . The dotted lines deﬁne the stochastic encoding procedure of q ( B | X ) , and the solid lines denote the approximated likelihood q ( Y | B ) . n is the total number of observed data points. Note thatthe respective parameters θ and φ are jointly learned, forming anextremely simple training model. labels Y . We intendedly use capitalized notations, i.e ., X , Y and B , for the (random) variable spaces, and denote eachrespective variable instances with lower-cased letters, i.e ., x , y and b . JMLH involves a stochastic encoder q ( B | X ) and a clas-siﬁer q ( Y | B ) . An additional deterministic distribution p ( B ) is used as the prior of B . This model is illustratedin Figure 1 as a directed graphical model. Particularly, eachdatum x ∈ X is ﬁrstly associated with a latent binary code b ∈ B according to q ( B | X ) , and then the respective label y ∈ Y can be predicted by feeding q ( Y | B ) with b . There-fore, B can be regarded as the bottleneck between X and Y . Successively applying q ( B | X ) and q ( Y | B ) accordingto the above procedure speciﬁes a single-task neural net-work with a binary layer in between, which makes JMLHextremely simple.We ﬁrstly describe the above-mentioned probabilisticmodels and then discuss how they are combined as a wholefor efﬁcient end-to-end training. Given a training pair of ( x , y ) , the corresponding probabil-ities models of q ( b | x ) and q ( y | b ) in JMLH are deﬁned as q ( b | x ) = P ( b | κ ( x ; θ )) ,q ( y | b ) = Cat ( y | π ( b ; φ )) or P ( y | π ( b ; φ )) ,p ( b ) = B ( b | m, . . (1)Here P ( b | κ ( x ; θ )) indicates the Poisson binomial distribu-tion, parameterized by a neural network κ ( x ; θ ) as follows: P ( b | κ ( x ; θ )) = m (cid:89) i =1 κ b i i (1 − κ i ) − b i . (2) Here we use q ( · ) to denote an approximated posterior when one can-not directly model the corresponding true distribution, e.g ., q ( B | X ) . Onthe other hand, p ( · ) is used when the distribution can be deterministicallydeﬁned or computed, e.g ., the pre-deﬁned prior p ( B ) . able 1. Network settings of JMLH. All layers are sequentiallyapplied. Notation Speciﬁcation Variable

Input Arbitrary data, X × images in our experiments κ ( x ; θ ) Arbitrary network backbone,Alexnet [21] before fc 7 in our experimentsFully-connected, size of m B

Binary stochastic activation π ( b ; φ ) Fully-connected, size of label length Y softmax (single-label datasets) sigmoid (multi-label datasets) On the other hand, p ( y | b ) can be either categorical forsingle-label classiﬁcation, i.e ., Cat ( y | π ( b ; φ )) , or Poissonbinomial for multi-label classiﬁcation, i.e ., P ( y | π ( b ; φ )) ,implemented by another network π ( b ; φ ) . We additionallyintroduce p ( b ) of a binomial distribution B ( b | m, . as thecode prior for regularization purpose.Note that we choose discrete probability models for B to avoid the use of continuous relaxation. That is to say,the input to the classiﬁer π ( · ) is already binarized. Continu-ous relaxation, e.g ., activating the neurons with a sigmoid non-linearity, is not considered here as it skews the obser-vation of the classiﬁer, propagating biased semantic infor-mation measurement back to the encoder. Sequentially stacking κ ( x ; θ ) and π ( b ; φ ) empiricallyforms a classiﬁcation neural network with a binary bottle-neck B , of which the briefed structure is illustrated in Ta-ble 1. It can be seen that JMLH only introduces two addi-tional layers on the top of an arbitrary network backbone,which makes it easy to be adopted to different pre-trainedmodels and is convenient for implementation.Then we deﬁne the learning objective with n given train-ing pairs { ( x , y ) } n of this single network as L = 1 n (cid:88) ( x ,y ) E q ( b | x ) [ − log q ( y | b )] (cid:124) (cid:123)(cid:122) (cid:125) classiﬁcation objective + λ KL ( q ( b | x ) || p ( b )) (cid:124) (cid:123)(cid:122) (cid:125) regularization , (3)where λ is a hyper-parameter. All the probability modelsare deﬁned in Eq. (1). We ﬁrst elaborate each component ofit in this subsection and later show that this learning objec-tive is supported by VIB [2] in Section 2.2.1.The ﬁrst Right-Hand-Side (RHS) term of Eq. (3), i.e . − log q ( y | b ) , is actually a negative log-likelihood classiﬁ-cation penalty since q ( y | b ) is categorical. This loss con-veys semantic label information of data to their codes dur-ing training. Algorithm 1:

The Training Procedure of JMLH

Input:

Data observations X , the corresponding labels Y andthe maxinum number of iterations T . Output:

Network parameters θ . repeat Randomly pick a batch of { ( x , y ) } from training dataSample (cid:15) ∼ U (0 , m for each datum L ←

Eq. (3) ( θ, φ ) ← (cid:16) θ − Γ ( ∇ θ L ) , φ − Γ ( ∇ φ L ) (cid:17) according toEq.. (6) until convergence or reaching the maximum iteration T ; The second RHS term of Eq. (3) acts as a regularizer. Byminimizing the Kullback-Leibler (KL) divergence betweenthe posterior q ( b | x ) and the prior p ( b ) , the entropy carriedby B is reserved. As the prior and the posterior are basi-cally binomial,the KL divergence can be deterministicallycomputed by two entropy terms H ( · ) : KL ( q ( b | x ) || p ( b )) = H (cid:0) q ( b | x ) , p ( b ) (cid:1) − H (cid:0) p ( b ) , p ( b ) (cid:1) . (4)The whole network of JMLH is trained only usingEq. (3). This makes the optimization extremely simple,requiring no auxiliary module or additional complex lossfunction. The only problem comes from the gradient com-putation of the intractable expected negative log-likelihoodw.r.t. θ , which is discussed in Section 2.1.3. Computing the gradients of the negative log-likelihood ex-pectation term ∇ θ E q ( b | x ) [ − log q ( y | b )] of Eq. (3) is in-tractable. One needs to traverse the latent space of B for each sample x to accurately obtain the loss and corre-sponding gradients. Inspired by [10], we use the followingreparametrization of B : (cid:101) b i = (cid:40) κ i ( x ; θ ) (cid:62) (cid:15) i , κ i ( x ; θ ) < (cid:15) i , for i = 1 ... m, (5)where each (cid:15) i ∼ U (0 , is a small random signal. Eq. (5)is conventionally termed as the stochastic binary neural ac-tivation. With this reparametrization, the gradient of L w.r.t.the encoder parameters θ can be estimated by the distribu-tional derivative estimator [10]: ∇ θ L = 1 n (cid:88) ( x ,y ) (cid:16) E (cid:15) [ −∇ θ log q ( y | (cid:101) b )]+ λ ∇ θ KL ( q ( b | x ) || p ( b )) (cid:17) (6) Although the reparametrization trick [19] is initially designed for con-tinuous variables, we keep using this terminology here, because the trickproposed in [10] leads to a similar gradient estimator to the one of [19]. 𝒙𝝐 𝑦ℒ𝒃 𝒃 𝑥 𝒙 Stochastic NodeDeterministic Node (a) (b)

Figure 2. An analogy of the JMLH computation graphs for (a) training and (b) test.

With this estimator, the network of JMLH can be trainedwith SGD end-to-end. Note that ∇ φ L can be determinis-tically obtained and does not require approximation since π ( b ; φ ) does not involve stochasticity.The whole training process is illustrated in Algorithm 1,and the respective variable feed path is illustrated in Fig-ure 2 (a). Here we use Γ( · ) to denote the gradient scaler,which is the Adam optimizer [18] in this work. It can beseen that, during training, JMLH performs identically to anormal neural classiﬁer. The only additional step is just tosample the random signals (cid:15) . Given a query datum x ( q ) , the corresponding hash code isproduced by the encoder, i.e ., b ( q ) = (cid:0) sign( κ ( x ( q ) ; θ ) − .

5) + 1 (cid:1) / , (7)which is shown in Figure 2 (b). In this subsection, we show that JMLH deﬁnes a special dis-crete extension of VIB [2] to learn information-rich codes.By empirically assigning the joint probability of X and Y with the Dirac delta function p ( x , y ) = n (cid:80) i δ ( x − x i ) δ ( y − y i ) = p ( y | x ) p ( x ) , i.e ., data samples are inde-pendent, the negative learning objective of JMLH can berewritten as −L = 1 n (cid:88) ( x ,y ) (cid:88) b (cid:16) p ( x ) p ( y | x ) q ( b | x ) log q ( y | b ) − λp ( x ) q ( b | x ) log q ( b | x ) p ( b ) (cid:17) , (8)where the ﬁrst RHS term is the variational lower boundof the mutual information I ( B, Y ) with the second RHS term the lower bound of the negative mutual information − λI ( B, X ) according to [2]. Consequently, −L literallylower-bounds the IB [36] objective R IB ( X, Y, B ) : R IB ( X, Y, B ) = I ( B, Y ) − λI ( B, X ) ≥ −L . (9)We refer to the related articles [2, 36] for more detailed def-initions.Intuitively, our learning objective allows B to maximallyrepresent the semantic meaning of the label space Y by as-cending I ( B, Y ) . Note that, though − λI ( B, X ) acts as apenalty in Eq. (9), we are not expecting zero mutual infor-mation between X and B , otherwise the produced codeswould be data-independent. The purpose of introducing − λI ( B, X ) is to ﬁlter redundant information not related tothe semantic meanings of data during encoding, and simul-taneously preserve the essential part to support I ( B, Y ) . Inthis way, the learned codes can be compressed and discrim-inative. In the context of large-scale data retrieval, relevant datapairs are usually and conveniently deﬁned by sharing thelabels/tags, which is generally reasonable. It is trivial andinefﬁcient to traverse all data points in a dataset and explic-itly assign pair-wise similarity marks to each of them, whilethe labels/tags can be regarded as the similarity ‘anchors’ toease this process.JMLH favors this setting as it is literally a special clas-siﬁer during training. The bottleneck latents B are directlylinked to the data labels. When the model is well-trained,the codes of relevant data are naturally located with shortHamming distances. This idea has also been proved inmany label-based hashing approaches [17, 29].

3. Related Work

Our work is related to various hashing techniques, ofwhich the most popular and related ones are selectively dis-cussed according to our motivation and design.

Traditional solutions.

We ﬁrstly look at the problem ofdiscrete optimization. A typical example is SDH [29],which also sequentially behaves encoding and classiﬁca-tion. However, as SDH [29] resorts to Discrete Cyclic Coor-dinate descent (DCC) for alternating code updating, a held-out optimization step is involved. Practically, this is hardfor parallelization and batch-wise optimization. Addition-ally, training errors of the classiﬁcation step cannot be efﬁ-ciently propagated back to the encoder. A similar paradigmcan be found in [39], while its objective is based on pair-wise data similarity. In both single-modal hashing [40, 11]nd cross-modal hashing [23, 31], alternating code updat-ing is widely adopted. For those methods that have held-outcode-learners, the network is regularized by the producedtarget code. The disadvantage of this disarticulated pro-cess is the low training quality. On the other hand, reg-ularizing the network by quantization is also widely con-sidered [6, 12, 17, 30]. However, these approaches ignorea severe problem of the different presence of codes. Thenetwork observes continuous codes during training, whichmay represent different meanings from their discrete coun-terparts for test. This problem is explicitly solved in JMLHas our code bottleneck is exactly binary.

Gradient estimation solutions.

Some existing hashingmodels solve the discrete constraints for SGD by gradi-ent estimation techniques so that the hashing model can beconveniently trained. In SGH [10], a distributional deriva-tive estimator is proposed based on the Taylor expansion ofthe gradient, and the discreteness is kept by the stochas-tic neuron. This approach has a similar presence to thereparametrization trick [19], and is unbiased and stable dur-ing training. This is also adopted in [32], and JMLH fol-lows the same idea. An alternative simple choice here isthe Straight-Through (ST) estimator [3], which is used inGreedyHash [35]. The REINFORCE algorithm [38] is alsoemployed for the same purpose in [41], while it undergoeshigh variance during training.

JMLH is not the ﬁrst model that trains the hash-ing network with classiﬁcation objectives. For instance,SUBIC [17] also employs a classiﬁcation loss as its learn-ing objective. Speciﬁcally, SUBIC [17] separates the hashcode into l blocks and ground each code block on a ∆ ml − simplex in order to favor the discreteness. This approachconsiderably limits the maximal information carried by thecodes. Besides, the supervised version of GreedyHash [35]is similar to JMLH both in terms of classiﬁcation objec-tive and keeping the discrete constraints. However, Greedy-Hash [35] only uses the quantization loss on the code bottle-neck, ignoring the entropy of the codes, while we considerminimizing KL ( q ( b | x ) ||B ( b | m, . to preserve the en-tropy. Moreover, GreedyHash [35] provides no theoreticalclue of how the trained codes are related to data semantics.MIHash [4] borrows the concept of mutual informationas with JMLH, ending up with different designs. Our modelreﬂects the mutual information between codes and data se-mantics as a part of VIB [2], while MIHash [4] consid-ers relevant-irrelevant code distribution discrepancy and re-quires complex histogram binning [37] during training.Recently, a popular idea in deep representation learningis to employ Generative Adversarial Networks (GANs) [16]during training, which has been attempted in [5, 13, 34, 45].The discriminators or the encoders in GANs are aware of the data distribution p ( X ) without explicitly parameteriz-ing p ( X ) . The problem is that the auxiliary generator sig-niﬁcantly increases the training complexity as more param-eters are introduced.We experimentally show that the above sophisticated de-signs are not always necessarily needed as the simple net-work of JMLH can already achieve the state-of-the-art re-trieval performance.

4. Experiments

Extensive image retrieval experiments are conducted inthis section, mainly according to the following themes: • Comparison with existing methods.

We show that,simple as JMLH is, it still outperforms state-of-the-arthashing models. • Ablation study.

The importance of each part of JMLHis evaluated and discussed. • Intuitive results.

Some illustrative results are pro-vided to implicitly justify the effectiveness of JMLH.

JMLH is implemented with the popular deep learning tool-box Tensorﬂow [1]. The network speciﬁcs are provided inTable 1. For our image retrieval task, AlexNet [21] be-fore the fc 7 layer is adopted as the network backbone,where parameters are initialized with the ImageNet [28]pre-trained results and is jointly updated during training.For multi-labeled datasets, i.e ., NUS-WIDE [9], we ac-tivates the last layer of π ( y | b ) with the sigmoid non-linearity, while the softmax activation is used when train-ing JMLH on CIFAR-10 [20] and ImageNet [28]. JMLHinvolves one hyper-parameter, i.e ., the regularization fac-tor λ . We empirically set λ = 0 . . The learning rate ofthe Adam optimizer Γ ( · ) [18] is set to × − . We ﬁxthe training batch size to 256. The codes can be found at https://github.com/ymcidence/JMLH . consists of 60,000 images from 10 classes.We follow the common setting [13, 22, 35] and select 1,000images (100 per class) as the query set. The remaining59,000 images are regarded as the database. The train-ing set contains 5000 images, uniformly selected from thedatabase. NUS-WIDE [9] is a collection of nearly 270,000 Web im-ages of 81 categories downloaded from Flickr. Followingthe settings in [26, 39, 22], we adopt the subset of images able 2. Performance comparison (w.r.t. mAP@ k ) of JMLH and the state-of-the-art hashing methods. The respective retrieval sequencelength k is adopted according to the most popular settings [13, 35, 41]. All baselines are reported according to the identical setting. Method Super- CIFAR-10 (mAP@all)

NUS-WIDE (mAP@5000)

ImageNet (mAP@1000) vision

16 bits 32 bits 64 bits 16 bits 32 bits 64 bits 16 bits 32 bits 64 bitsITQ [14] (cid:55) (cid:55) (cid:55) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

JMLH (Ours) (cid:88) from the 21 most frequent categories. 100 images of eachclass are utilized as a query set and the remaining imagesform the database. For training, we employ 10,500 imagesuniformly selected from the 21 classes.

ImageNet [28] is originally released for large-scale imageclassiﬁcation purpose, and is recently used in deep hashingevaluation. Following [8, 41], we randomly select 100 cate-gories to perform our retrieval task. All the original trainingimages are used as the database, and all the validation im-ages form the query set. For each category, 130 images areused for training.

We compare JMLH with existing methods using conven-tional evaluation metrics, including top- k mean-AveragePrecision (mAP@ k ), Precision of top- k retrieved sam-ples (Precision@ k ), Precision within Hamming radius of 2(P@H ≤

2) and Precision-Recall (P-R) curves.Note that, for mAP@ k , we adopt the most popular set-tings of k = all, , for CIFAR-10 , NUS-WIDE ,and

ImageNet respectively according to [13, 35, 41].

JMLH is compared with various widely recognized hash-ing baselines, including ITQ [14], AGH [26], DGH [24],KSH [25], ITQ-CCA [15], SDH [29], CNNH [39],DNNH [22], DHN [43], HashNet [8], HashGAN [5]PGDH [41] and the supervised version of GreedyHash [35].Note that the term of

HashGAN is used both in [13] and [5].Here we refer to the later one as it is a supervised approachand thus is more related to our work. For feature-based models, e.g ., shallow hashing mod-els, we use the AlexNet [21] fc 7 pre-trained features torepresent data for training and test. As for the end-to-endbaseline frameworks, we directly adopt the original trainingsettings described in their original papers and pre-trainedweights are also applied for ﬁne-tuning when possible.

The retrieval mAP@ k results are reported in Table 2. Therespective P-R curves, Precision@ k and P@H ≤ e.g ., HashGAN [5]. This resultaligns with our motivation, and shows the clue that, withthe current evaluation metrics, one may not require an ex-tremely complex model to obtain the best-performing deephashing function.The performance margin between JMLH and Greedy-Hash [35] is not signiﬁcant on CIFAR-10 [20], but this gapgets larger when it comes to a relatively more difﬁcult situa-tion, i.e ., ImageNet [28]. This raises the concern of a properregularization term for training. Both GreedyHash [35] andJMLH are trained with classiﬁcation-oriented objectives.The former literally involves a quantization penalty whileJMLH considers equally distributed { , } bits to maximizethe expected code entropy. This factor becomes essentialwhen the data label space is large and the training samplesare limited as the codes need to be expressive enough to besuccessfully classiﬁed. We ﬁnd our design has better gener-alization ability in this case. . . . . . . . . . Recall P r e c i s i o n

100 500 1 , . . . . . . Top k Returned Samples P r e c i s i o n k Scores on CIFAR-10

16 32 640 . . . . . . Bits P @ H ≤ P@H ≤ JMLHITQAGHDGHKSHITQ-CCASDHCNNHDNNHDHNHashNetHashGAN

Figure 3.

Left:

Middle: k returned samples on CIFAR-10 [20]. Right:

Precision within Hamming radius of 2 scores on CIFAR-10 [20].

In this subsection, we evaluate different components interms of formulating a simple deep hashing model, and em-pirically show which one is of importance for good perfor-mance.

We ﬁrstly look at the inﬂuence of quanti-zation. By dropping the binary stochastic neuron and em-ploying the sigmoid activation on the code bottleneck B ,a regular deep neural classiﬁer is built. The regularizationterm is kept, and is subsequently analyzed by other base-lines. JMLH-QR.

The KL term of Eq. (3) is replaced bythe quantization regularizer between the activated bi-nary codes B and their real-valued counterparts before thestochastic neurons. JMLH-NR.

The regularizer is deprecated in this baseline,and the whole learning objective is formulated by the clas-siﬁcation cross-entropy.

JMLH-VAE.

We replace the classiﬁer π ( · ) with a decoder,and use the L reconstruction error instead of classiﬁcationloss during training. Therefore, the model collapses to anunsupervised Variational Auto-Encoder (VAE) [19], with anegative Evidence Lower-BOund (ELBO) of n (cid:88) x E q ( b | x ) [ − log q ( x | b )] + KL ( q ( b | x ) || p ( b )) . (10)For the simplicity of training, the encoder and decoder forthis baseline are both implemented with a two-layer neuralnetworks and are fed by AlexNet [21] fc 7 features. The mAP results of the above-mentioned baselines areshown in Table 3. Since JMLH-VAE is an unsupervisedmodel, its performance is relatively lower than the others.

Table 3. mAP@all results by using different variants of the pro-posed JMLH on CIFAR-10.

Baseline 16 bits 32 bits 64 bits

JMLH (full model) 0.805 0.841 0.837

We experience a 20% performance drop when using thecontinuous relaxation during training, i.e ., JMLH-Cont. Asdiscussed in Section 3, the binary constraints are essentialfor models like JMLH as it directly inﬂuences the classi-ﬁer’s observation. Without regularization, JMLH-NR strug-gles in the training-test generalization. Though not com-peting our full model, JMLH-QR still performs closely toGreedyHash [35], as the learning objectives are similar. Thedifference between JMLH-QR and GreedyHash [35] lies inthe stochasticity of gradient estimation. Both ST [3] anddistributional derivative [10] work for this case as long asthe binary constraints are not violated. Hence, a properlearning objective becomes more important.

The regularization penalty of JMLH is scaled by a hyper-parameter λ . By default, it is set to λ = 0 . for the overallbest performance. The impact of λ is illustrated in Figure 4(a). The performance drops quickly when λ goes larger,which actually reﬂects the penalty of the mutual informa-tion between data X and codes B , i.e ., I ( X, B ) . A largevalue of λ over-regularizes the model by decorrelating X with B , making the produced codes less-informative. One key claim of this paper is to build a simple deephashing model. Training JMLH is non-trivial and efﬁ- . . . . . . . . λ m A P @ a ll Performance w.r.t. diﬀerent values of λ . . . . . . Epoch m A P @ a ll Training Eﬃciency (a) (b) . . . . code length m A P @ a ll Performance with extremely short codes

JMLHGreedyHashDHN (c)

Figure 4. (a) mAP@all results of 32-bit JMLH on CIFAR-10 [20]with different values of λ . (b) Training efﬁciency of JMLH andMIHash [4] on CIFAR-10 [20]. (c)

Encoding performance com-parison with extremely short code length on CIFAR-10 [20]. cient. Our classiﬁcation likelihood learning objective pro-vides a straightforward way to convey data semantics tothe encoder. We show training efﬁciency comparison be-tween JMLH and MIHash [4] in Figure 4 (b). It canbe observed that JMLH converges more quickly to thebest performance than MIHash [4] with a margin of ∼ Following [35], we also explore the minimal size of codesto represent data semantics. The experiments are conductedby setting the code length to m = 4 , , ..., , , and thecorresponding results are shown in Figure 4 (c). We cansee that, compared with GreedyHash [35] and DHN [43], −

20 0 20 − −

20 0 20 − AirplaneAutomobileBirdCatDeerDogFrogHorseShipTruck (a)

Top-10 Retrieved ImagesQuery (b)

Figure 5. (a) (b)

Examples of top-10 retrieved candidates of 32-bitJMLH on CIFAR-10 [20].

JMLH obtains better performance even when the encodinglength is very short. The entropy-preserving regularizationterm plays the key role here since the maximum number ofconcepts that the code space can cover is limited.

The t-SNE [27] visualization of 32-bit JMLH on CIFAR-10 [20] is shown in Figure 5 (a). Even though the pro-posed model is simple both in terms of network structureand learning objective, the resulting binary codes are stillclearly scattered in the feature space according to their se-mantic meanings. We further provide several image re-trieval examples where the top-10 retrieved candidates areshown together with the query image in Figure 5 (b). Ob-viously, JMLH successfully ﬁnds related images in the topof the retrieval list. Here we only show the 32-bit results tokeep the content concise.

5. Conclusion

In this paper, we proposed a simple but effective deephashing model called JMLH. Our model shaped a conven-tional deep neural network with a single likelihood max-imization learning objective. A differentiable binary bot-tleneck was plugged in, making the whole network end-to-end trainable using SGD. JMLH was linked to the infor-mation bottleneck methods, which aimed at learning max-imally representative features for a given task. We showedthat, when applying proper binary-preserving gradient es-timators and suitable regularization terms, a single classi-ﬁcation model could generate high-quality hash codes forsimilarity search, outperforming state-of-the-art models. eferences [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.Tensorﬂow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467 , 2016.5[2] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deepvariational information bottleneck. In

International Confer-ence on Learning Representations (ICLR) , 2016. 2, 3, 4, 5[3] Y. Bengio, N. L´eonard, and A. Courville. Estimating or prop-agating gradients through stochastic neurons for conditionalcomputation. arXiv preprint arXiv:1308.3432 , 2013. 5, 7[4] F. Cakir, K. He, S. Adel Bargal, and S. Sclaroff. Mihash:Online hashing with mutual information. In

IEEE Interna-tional Conference on Computer Vision (ICCV) , 2017. 5, 6,8[5] Y. Cao, B. Liu, M. Long, and J. Wang. Hashgan: Deeplearning to hash with pair conditional wasserstein gan. In

IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2018. 1, 5, 6, 8[6] Y. Cao, M. Long, B. Liu, and J. Wang. Deep cauchy hashingfor hamming space retrieval. In

IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2018. 5[7] Z. Cao, M. Long, J. Wang, and P. S. Yu. Hashnet: Deeplearning to hash by continuation. In

The IEEE InternationalConference on Computer Vision (ICCV) , Oct 2017. 1[8] Z. Cao, M. Long, J. Wang, and P. S. Yu. Hashnet: Deeplearning to hash by continuation. In

IEEE International Con-ference on Computer Vision (ICCV) , 2017. 6[9] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng.Nus-wide: a real-world web image database from nationaluniversity of singapore. In

ACM International Conferenceon Image and Video Retrieval (CIVR) , 2009. 2, 5[10] B. Dai, R. Guo, S. Kumar, N. He, and L. Song. Stochasticgenerative hashing. In

International Conference on MachineLearning (ICML) , 2017. 3, 5, 7[11] T.-T. Do, A.-D. Doan, and N.-M. Cheung. Learning to hashwith binary deep neural network. In

European Conferenceon Computer Vision (ECCV) , 2016. 1, 4[12] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deephashing for compact binary codes learning. In

IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2015. 5[13] K. Ghasedi Dizaji, F. Zheng, N. Sadoughi, Y. Yang, C. Deng,and H. Huang. Unsupervised deep generative adversarialhashing network. In

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2018. 1, 5, 6, 8[14] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Itera-tive quantization: A procrustean approach to learning binarycodes for large-scale image retrieval.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 35(12):2916–2929, 2013. 6[15] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Itera-tive quantization: A procrustean approach to learning binarycodes for large-scale image retrieval.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 35(12):2916–2929, 2013. 6 [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In

Advances in neural informationprocessing systems (NIPS) , 2014. 5[17] H. Jain, J. Zepeda, P. Perez, and R. Gribonval. Subic: Asupervised, structured binary code for image search. In

IEEEInternational Conference on Computer Vision (ICCV) , 2017.4, 5[18] D. Kingma and J. Ba. Adam: A method for acm symposiumon theory of computing (stoc)hastic optimization. In

Inter-national Conference on Learning Representations (ICLR) ,2015. 4, 5[19] D. Kingma and M. Welling. Auto-encoding variationalbayes. In

International Conference on Learning Represen-tations (ICLR) , 2014. 3, 5, 7[20] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. 2009. 2, 5, 6, 7, 8[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , pages1097–1105, 2012. 3, 5, 6, 7, 8[22] H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous featurelearning and hash coding with deep neural networks. In

IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2015. 5, 6[23] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao. Deep sketchhashing: Fast free-hand sketch-based image retrieval. In

IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2017. 5[24] W. Liu, C. Mu, S. Kumar, and S.-F. Chang. Discrete graphhashing. In

Advances in Neural Information Processing Sys-tems (NIPS) , 2014. 6[25] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Su-pervised hashing with kernels. In

IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2012. 6[26] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing withgraphs. In

International Conference on Machine Learning(ICML) , 2011. 5, 6[27] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.

Journal of Machine Learning Research , 9(Nov):2579–2605,2008. 8[28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.

International Journal of Computer Vision , 115(3):211–252,2015. 2, 5, 6[29] F. Shen, C. Shen, W. Liu, and H. Tao Shen. Supervised dis-crete hashing. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2015. 1, 4, 6[30] Y. Shen, L. Liu, and L. Shao. Unsupervised binary repre-sentation learning with deep variational networks.

Interna-tional Journal of Computer Vision, DOI: 10.1007/s11263-019-01166-4 . 5[31] Y. Shen, l. Liu, L. Shao, and J. Song. Deep binaries: en-coding semantic-rich cues for efﬁcient textual-visual crossretrieval. In

IEEE International Conference on ComputerVision (ICCV) , 2017. 532] Y. Shen, L. Liu, F. Shen, and L. Shao. Zero-shot sketch-image hashing. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2018. 5[33] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In

InternationalConference in Learning Representations (ICLR) , 2015. 8[34] J. Song, T. He, L. Gao, X. Xu, A. Hanjalic, and H. T. Shen.Binary generative adversarial networks for image retrieval.In

AAAI Conference on Artiﬁcial Intelligence (AAAI) , 2018.1, 5[35] S. Su, C. Zhang, K. Han, and Y. Tian. Greedy hash: To-wards fast optimization for accurate hash coding in cnn. In

Advances in Neural Information Processing Systems , 2018.5, 6, 7, 8[36] N. Tishby, F. C. Pereira, and W. Bialek. The informationbottleneck method. In

Annual Allerton Conference on Com-munication, Control, and Computing , 1999. 2, 4[37] E. Ustinova and V. Lempitsky. Learning deep embeddingswith histogram loss. In

Advances in Neural Information Pro-cessing Systems (NIPS) . 2016. 5[38] R. J. Williams. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning.

Machinelearning , 8(3-4):229–256, 1992. 5[39] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hash-ing for image retrieval via image representation learning. In

AAAI Conference on Artiﬁcial Intelligence (AAAI , 2014. 4,5, 6[40] Y. Yang, Y. Luo, W. Chen, F. Shen, J. Shao, and H. T. Shen.Zero-shot hashing via transferring supervised knowledge. In

ACM international conference on Multimedia (MM) , 2016.1, 4[41] X. Yuan, L. Ren, J. Lu, and J. Zhou. Relaxation-free deephashing via policy gradient. In

The European Conference onComputer Vision (ECCV) , September 2018. 5, 6[42] X. Zhou, F. Shen, L. Liu, W. Liu, L. Nie, Y. Yang, and H. T.Shen. Graph convolutional network hashing.

IEEE Transac-tions on Cybernetics , pages 1–13, 2018. 1[43] H. Zhu, M. Long, J. Wang, and Y. Cao. Deep hashing net-work for efﬁcient similarity retrieval. In

AAAI Conferenceon Artiﬁcial Intelligence (AAAI) , 2016. 1, 6, 8[44] B. Zhuang, G. Lin, C. Shen, and I. Reid. Fast training oftriplet-based deep binary embedding networks. In

IEEEConference on Computer Vision and Pattern Recognition(CVPR) , 2016. 1[45] M. Zieba, P. Semberecki, T. El-Gaaly, and T. Trzcinski. Bin-gan: Learning compact binary descriptors with a regularizedgan. In