[PDF] Adaptive Multi-modal Fusion Hashing via Hadamard Matrix

Abstract

Hashing plays an important role in information retrieval, due to its low storage and high speed of processing. Among the techniques available in the literature, multi-modal hashing, which can encode heterogeneous multi-modal features into compact hash codes, has received particular attention. Most of the existing multi-modal hashing methods adopt the fixed weighting factors to fuse multiple modalities for any query data, which cannot capture the variation of different queries. Besides, many methods introduce hyper-parameters to balance many regularization terms that make the optimization harder. Meanwhile, it is time-consuming and labor-intensive to set proper parameter values. The limitations may significantly hinder their promotion in real applications. In this paper, we propose a simple, yet effective method that is inspired by the Hadamard matrix. The proposed method captures the multi-modal feature information in an adaptive manner and preserves the discriminative semantic information in the hash codes. Our framework is flexible and involves a very few hyper-parameters. Extensive experimental results show the method is effective and achieves superior performance compared to state-of-the-art algorithms.

Full PDF

AAdaptive Online Multi-modal Hashing via Hadamard Matrix

Jun Yu , Xiao-Jun Wu , Donglin Zhang , Josef Kittler The School of Artiﬁcial Intelligence and Computer Science, Jiangnan University, Wuxi, China. The Jiangsu Provincial Engineering Laboratory of Pattern Recognition and ComputationalIntelligence. Center for Vision, Speech and Signal Processing(CVSSP), University of Surry, GU2 7XHGuildford, [email protected]; wu [email protected];[email protected]; [email protected]

Abstract

Hashing plays an important role in information retrieval, dueto its low storage and high speed of processing. Amongthe techniques available in the literature, multi-modal hash-ing, which can encode heterogeneous multi-modal featuresinto compact hash codes, has received particular atten-tion. Existing multi-modal hashing methods introduce hyper-parameters to balance many regularization terms designed tomake the models more robust in the hash learning process.However, it is time-consuming and labor-intensive to set themproper values. In this paper, we propose a simple, yet effec-tive method that is inspired by the Hadamard matrix, whichcaptures the multi-modal feature information in an adaptivemanner and preserves the discriminative semantic informa-tion in the hash codes. Our framework is ﬂexible and involvesa very few hyper-parameters. Extensive experimental resultsshow the method is effective and achieves superior perfor-mance compared to state-of-the-art algorithms.

Introduction

As an effective technique to deal with the challenges posedby the explosive growth of mutimedia data, hashing has at-tracted increasing attention in information retrieval and re-lated areas (Wang et al. 2014). Existing hashing methodsmainly focus on uni-modal hashing (Gong et al. 2013; Dataret al. 2004; Shen et al. 2015; Ji et al. 2017; Lin et al. 2018)and cross-modal hashing (Liu et al. 2016; Wang, Wang, andGao 2018; Li, Tang, and Mei 2019; Hu et al. 2019; Yu, Wu,and Kittler 2019; Zhang, Wu, and Yu 2020).Different from uni-modal hashing and cross-modal hash-ing where only one of the multiple modalities is given at thequery stage, multi-modal hashing combines multiple modal-ities to comprehensively represent query data for multime-dia retrieval. A simple way to extend uni-modal hashing tothe multi-modal situation is to use a representation concate-nating multiple uni-modal features forming the input of uni-modal hashing methods. However, such extension may failto exploit the complementarity of the modalities. To handlethis problem, various learning methods (Song et al. 2013;Liu et al. 2012; Xiaobo et al. 2018; Shen et al. 2015) have

Copyright c (cid:13) been developed. Multiple Feature Hashing (MFH) (Songet al. 2013) explores the local structure of the individual fea-tures and fuses them in a joint framework. Multiple KernelLearning (MKH) (Liu et al. 2012) fuses multiple features byan optimised linear-combination. Multi-view Latent Hash-ing (MVLH) (Shen et al. 2015) aims to ﬁnd a uniﬁed ker-nel feature space where the weights of different modalityare adaptively learned. Multiview discrete hashing (MVDH)(Xiaobo et al. 2018) jointly performs a matrix factoriza-tion and spectral clustering to learn compact hash codes.In MVDH, the weight imposed on each modality is adap-tively learned to reﬂect the importance of the modality to thelearning process. Although these approaches have achievedpromising performance in many applications, they have thefollowing two shortcomings: (1) most of above multi-modalhashing methods model each modal information by con-structing a similarity graph, which will cost O ( n ) . Thishigh computational complexity is not scalable to large-scalemultimedia retrieval problems. (2) The weighting of dif-ferent modalities learned in an ofﬂine learning stage cannot effectively support dynamic data. Some online adap-tive hashing methods (Zhu et al. 2020; Lu et al. 2019) at-tempt to tackle these problems. An example is Online Multi-modal Hashing with Dynamic Query-adaption (OMH-DQ)(Lu et al. 2019) where a parameter-free online mode canadaptively learn the hash codes for the dynamic queries.Additionally, the use of the semantic labels avoids the highcomputational complexity.However, these approaches usually incorporate additionalregularization terms to enhance the discriminative capabil-ity. The hyper-parameters introduced to balance the terms toobtain optimal performance will require disordinate amountof time to adjust, which makes the methods inapplicable inpractice. Based on this observation, in this paper, we pro-pose a simple yet very effective multi-modal hashing toovercome the challenging problem. Inspired by some recentworks (Koutaki, Shirai, and Ambai 2018; Yuan et al. 2020;Lin et al. 2020) where Hadamard matrix has been proven tobe effective in the hash learning, we introduce a Hadamardmatrix to generate discriminative target codes for the data,which induces the samples with the same label informationto approach their common target codes in the ofﬂine hash a r X i v : . [ c s . MM ] S e p unction learning stage. In the online search stage, we adoptthe adpatively self-weighting scheme to capture the dynamicinformation. The advantages of our method are summarizedas follows • We introduce a Hadamard matrix into the multi-modal re-trieval process to guide the hash learning. We show thatthis enables discriminative semantic information to bepreserved in the hash codes, although our model is rel-atively simple. • The method is easy to implement and requires low com-putational time. It does not involve the setting of hyper-parameters. • A comparative evaluation of the proposed methodwith state-of-the-art hashing methods on three availabledatasets shows that our method boosts the retrieval per-formance.

Offline Training

Online Search

Image modalityText modality

Nonlinear Mapping

Hamming Space -11 -1 -1-1 -1 -111-1 -1 -1-1 -1 -1 1 -1 -1 -1 -1 -1 -1-1 -1 -1 𝑐 𝑐 𝑐 𝑐 𝑐 𝑐 𝑐 𝑐 Hadamard Matrix

VisualTextual

Projection Hash Centers

Category information

Guidance

Output Codes

1 -1 -1 1 -1 1 1 -1

1 -1 -1 1 -1 1 1 -11 -1 1 -1 -1 1 -1 1 …

1 -1 -1 1 -1 1 1 -1

1 -1 1 -1 -1 1 -1 1

Multi-modal projection matrix

Self-weighting schemeDynamic Weights

Database

Data stream

Hash Codes

෩𝑊 (𝑟 ≠ 𝑟 ∗ ) ... VisualTextual

Newly arriving data

Hash Codes for Query

Nonlinear MappingNonlinear Mapping

Retrieval

Lingshan Giant Buddha is a world famous scenic spot, located between the mountains and waters of Mashan national scenic spot in Binhu District, Wuxi City, Jiangsu ProvinceShu Moyu was born in Jiangsu, Hunan Province in 2019. He is a very handsome and intelligent boy. Her mother is very kindly and beautiful, and her father is also handsome.

Query data

Adaptive Online Hashing

Figure 1: The overview of the proposed method. The pro-posed framework is divided into two parts: Ofﬂine trainingand Online search. The multi-modal data is collaborativelyprojected into the common Hamming space where samplesof the same category converge to their common hash centerpoint generated by a Hadamard matrix in the ofﬂine trainingstage. Based on the learned projection matrices in the ofﬂinetraining stage, we adopt an adaptive weighing scheme to ob-tain hash codes for new data in order to reﬂect the variationsin their dynamics in the online search process.

THE PROPOSED METHOD

Model Formulation

Assume that the training dataset is comprised of n multime-dia instances represented with M different modalities, andaccompanied by a set of class labels L = { l i } ni =1 . The m -th modality is denoted as X ( m ) = [ x ( m )1 , ..., x ( m ) n ] ∈ R d m × n , where d m is the dimensionality of the m -th modal-ity. Our method aims to learn the discriminative hash code B ∈ {− , } r × n to represent multimedia instances, where r is the length of the output codes in Hamming space. Wepre-deﬁne a set of points C = { c , c , ..., c k } ∈ R r ∗ × k as the hash centers of speciﬁc-class respectively, where k is the number of categories and r ∗ indicates the dimension of hash centers. We encourage data points with the sameclass information to be close to a common hash center andthose with different semantic information to be associatedwith different hash centers respectively. Intuitively, the pre-deﬁned hash centers in the Hamming space should conformto the following requirement: A sufﬁcient mutual distancebetween the centers should ensure that samples from differ-ent classes are separated well in the Hamming space. Theconcept of valid hash centers is speciﬁed in Deﬁnition 1 Deﬁnition 1 . Hash centers C = { c i } si =1 ⊂ { , } r ∗ inthe r ∗ -dimensional Hamming space satisfy that the averagepairwise Hamming distance is greater than or equal to r ∗ / ,i.e. V s (cid:88) i (cid:54) = j D H ( c i , c j ) ≥ r ∗ (1)where V is the number of combinatins of different c i and c j , s is the size of set C , and D H denotes the Hammingdistance.Inspired by recent works (Lin et al. 2020; Koutaki, Shirai,and Ambai 2018) which have been shown to be very promis-ing in the ﬁeld of hash learning, we introduce the HadamardMaxtrix to guide the hash learning. A Hadamard matrix con-structed via Sylvester method (Sylvester 1867) has the fol-lowing properties • It is an r ∗ -order ( r ∗ = 2 n , n = 1 , ... ) squared matrixwhose elements are either +1 or -1. The coding length r ∗ of the generated Hadamard matrix is r ∗ = min { l | l = 2 n , l ≥ r, l ≥ k, n = 1 , , ... } (2) • Both its rows and columns are pair-wise orthogonal,which ensures the Hamming distance between any twocolumn vectors is r ∗ / . Thus, each column of Hadamardmatrix can serve as a hash center satisfying Deﬁnition 1.Referring to Eq. (2), there may be cases when the outputcode length r does not satisfy: r = r ∗ . To mitigate this prob-lem, Local Sensitive Hashing(LSH) is adopted to transformthe hash centers generated by Hadamard matrix to ensurethe dimension of centers is consistent with output codes. ˜ c i = sign ( ˜ W T c i ) (3)where ˜ W = { ˜ w i } ri =1 ∈ R r ∗ × r is sampled from the stan-dard Gaussian distribution. The transformed Hadamard ma-trix preserves the main properties of the original Hadamardmatrix and complies with the requirement of minimal Ham-ming distance between columns. The detailed theory is de-veloped in (Lin et al. 2020) Semantic hash centers for multi-label data

Hash centers { c , c , ..., c k } corresponding to k categories respectivelyhave been obtained as described above. For data classifedinto two or more categories, the corresponding hash centeris the centroid of the multiple centers, each of which relatesto a single category. Ofﬂine Training Stage

Through the above process, we obtain the hash cen-ter representation H ∗ ∈ R r × n for all training datan the Hamming space. H ∗ is also termed as the tar-get codes for the training data. For the m -th modal-ity X ( m ) = [ x ( m )1 , ..., x ( m ) n ] ∈ R d m × n of the trainingset, we calculate a nonlinearly transformed representation φ ( x ( m ) i ) = [ exp ( (cid:107) x ( m ) i − a ( m )1 (cid:107) F σ m ) , ..., exp ( (cid:107) x ( m ) i − a ( m ) p (cid:107) F σ m )] ,where { a ( m ) j } pj =1 are p anchors that are randomly selectedfrom the m -th modality of the training data and σ m de-notes the Gaussian kernel parameter. The φ ( X ( m ) ) =[ φ ( x ( m )1 ) , ...φ ( x ( m ) n )] ∈ R p × n preserve the intra-modalitycorrelation among data within single modality. The timecomplexity of this preprocessing phase is O ( M np ) .The heterogenous modalities are projected into a commonHamming space. In this space, data points of the same cat-egory are encouraged to migrate towards a common hashcenter and those of different categories converge to distincthash centers. Thus, we have the following min W ( m ) M (cid:88) m =1 (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F (4)where W ( m ) ∈ R r × p is the projection matrix of the m -thmodality, each column of H ∗ is the target code of the corre-sponding training sample, and (cid:107) · (cid:107) F denotes the Frobeniusnorm of a matrix.In multimedia retrieval, there may potentially be discrep-ancy between the heterogeneous modalities. Accordingly, itis necessary to gauge the importance of different modalitiesso as to learn an effective and discriminative hash function.To handle the problem, we transform Eq. (4) to its equiva-lent form (see Proof 1): min µ ( m ) ,W ( m ) M (cid:88) m =1 µ ( m ) (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F s.t. M (cid:88) m =1 µ ( m ) = 1 (5)As formulated in Eq.(5), µ ( m ) can be considered as a func-tion of weight µ ( m ) . The more discriminative the m -thmodality, the smaller value of (cid:107) H ∗ − W ( m ) φ ( X ( m ) ) (cid:107) F , andthe larger the corresponding µ ( m ) , and vice versa. Proof 1:

Eq.(4) is equivalent to Eq.(5).According to the Cauchy-Schwarz inequality, the folllowing(6) holds. M (cid:88) m =1 µ ( m ) (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F ⇔ ( M (cid:88) m =1 µ ( m ) (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F )( M (cid:88) m =1 µ ( m ) ) ≥ ( M (cid:88) m =1 (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F ) (6) Thus, we can obtain ( M (cid:88) m =1 (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F ) = min µ ( m ) M (cid:88) m =1 µ ( m ) (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F then min W ( m ) M (cid:88) m =1 (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F ⇔ min W ( m ) ( M (cid:88) m =1 (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F ) ⇔ min µ ( m ) ,W ( m ) M (cid:88) m =1 µ ( m ) (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F (7)To avoid over-ﬁtting, a regularization term is added to (5).The overall learning framework the becomes min µ ( m ) ,W ( m ) M (cid:88) m =1 µ ( m ) (cid:107) H ∗ − sign ( W ( m ) φ ( X ( m ) )) (cid:107) F + δ M (cid:88) m =1 (cid:107) W ( m ) (cid:107) F s.t. M (cid:88) m =1 µ ( m ) = 1 (8)where δ is a penalty parameter. We relax the objective func-tion to make it tractable computationally, since the sign func-tion makes it difﬁcult to optimize (8) directly. The relaxedobjective function can be written as: min µ ( m ) ,W ( m ) M (cid:88) m =1 µ ( m ) (cid:107) H ∗ − W ( m ) φ ( X ( m ) ) (cid:107) F + δ M (cid:88) m =1 (cid:107) W ( m ) (cid:107) F s.t. M (cid:88) m =1 µ ( m ) = 1 (9)We adopt the alternating optimization method to solve therelaxed problem in (9). • Step 1: Update W ( m ) with other variables ﬁxed. We setthe derivative of the objective function with respect to W ( m ) ( m = 1 , .... ) to zero and obtain W ( m ) = 1 µ ( m ) H ∗ φ T ( X ( m ) )( 1 µ ( m ) φ ( X ( m ) ) φ T ( X ( m ) )+ δI ) − (10) • Step 2: Update µ ( m ) with other variables ﬁxed. Gatheringhe terms relating to µ ( m ) , we get the subproblem min µ ( m ) ≥ M (cid:88) m =1 ( G ( m ) ) µ ( m ) s.t. M (cid:88) m =1 µ ( m ) = 1 (11)where G ( m ) = (cid:107) H ∗ − W ( m ) φ ( X ( m ) ) (cid:107) F . According tothe Cauchy-Schwarz inequality, the optimal µ ( m ) can beobtained as µ ( m ) = G ( m ) (cid:80) Mm =1 G ( m ) (12) Online Search Process with Dynamic Weights

In the online stage, we assume that data appear in the man-ner of the data stream. Newly arriving data will be hashcoded and archived in the database. Unfortunately, the ﬁxedweights learned from Eq. (12) can not capture the varia-tions of dynamic data in the process of hash coding. Thus,the weights should be adjusted dynamically for each spe-ciﬁc instance content. Motivated by this intuition, we adopta selft-weighting scheme based on the projection matrices W ( m ) ( m = 1 , , ... ) learned in the ofﬂine training to ob-tain more accurate hash codes for newly arriving multimediadata. The adaptive online hashing learning process is givenas min B q ,µ ( m ) q M (cid:88) m =1 µ ( m ) q (cid:107) B q − W ( m ) φ ( X ( m ) q ) (cid:107) F s.t. M (cid:88) m =1 µ ( m ) = 1 , B q ∈ {− , } r × n q (13)where B q and n q are the hash codes and the number ofthe new instances respectively. φ ( X ( m ) q ) is the nonlinearlytransformed representation of the m -th modality of thenewly coming instances.We solve the problem in Eq. (13) by alternative updating thefollowing variables iteratively.Update µ ( m ) q with ﬁxed B q . The optimal solution of µ ( m ) q isobtained as µ ( m ) q = G ( m ) q (cid:80) Mm =1 G ( m ) q (14)where G ( m ) q = (cid:107) B q − W ( m ) φ ( X ( m ) q ) (cid:107) F .Update B q with ﬁxed µ ( m ) q . We can get a closed solution as B q = sgn ( M (cid:88) m =1 µ ( m ) q W ( m ) φ ( X ( m ) q )) (15) Experiment

In this section, we conduct retrieval experiments on threewidely-used multi-modal datasets to verify the performanceof the proposed method. Our experiments are executed ona Windows 10 platform based desktop machine with 12GBmemory and 4-core 3.6GHz CPU.

Algorithm 1

Adaptive Online Multi-modal Hashing

Input:

Training set X ( m ) = [ x ( m )1 , ..., x ( m ) n ] ∈ R d m × n , ( m = 1 , ..., M ) , label matrix L .Generating output target codes H ∗ according to Section.Calculate the transformed representation φ ( X ( m ) ) =[ φ ( x ( m )1 ) , ...φ ( x ( m ) n )] ∈ R p × n Initialize W ( m ) ( m = 1 , ..., M ) and µ ( m ) ( m =1 , ..., M ) repeat Update W ( m ) ( m = 1 , ..., M ) according to (10); Update µ ( m ) ( m = 1 , ..., M ) according to (11); until convergence Output: W ( m ) ( m = 1 , ..., M ) for q =1,...T do Receive newly arriving X q repeat Update µ ( m ) ( m = 1 , ..., M ) according to (14);Update B q according to (14); until convergence Output B q end for Datasets

WiKi (Rasiwasia et al. 2010) is a multi-modal single-labeldataset which consists of 2866 multimedia documents of 10categories. We directly generate one hash center for eachcategory. Each image is represented by 128-dimensionalSIFT histogram vector, while text is represented as a 10-dimensional feature vector generated by latent Dirichlet al-location. A random subset of 2173 multimedia samples isused as the ofﬂine training set and the retrieval set and theremaining 963 samples as the query set.

Pascal VOC 2007 (Wei et al. 2017) contains 9963 imagesof 20 categories. Each image and associated 399 tags withthe image compose a multimedia sample. In this dataset, Weemploy the 4096-dimensional CNN feature to represent thevisual object and the 798 dimensional tag ranking feature isemployed as the text feature. A random subset of 2000 sam-ples is provided as the ofﬂine training set, and the remainingsamples are divided into query set and retrieval set, contain-ing 963 and 7000 samples respectively.

NUS-WIDE (Chua et al. 2009) is comprised of 269648multi-modal samples of 81 concepts. In our experiments, weonly keep 186577 samples of the top ten most frequent con-cepts. The image modality is represented by a 500 dimen-sional bag- of-visual words and the 1000 dimensional tag oc-currence vector is employed as text modality feature. A ran-dom subset of 1866 samples for query set and 184711 sam-ples for retrieval set. 5000 samples are randomly selectedfrom the retrieval set for the ofﬂine training stage.Pascal VOC 2007 and NUS-WIDE are two multi-labledatasets. For those multimedia samples with multiple labels,we ﬁrst generate hash centers for single category, then cal-culate the centroid of the multi-centers as the semantic targetcodes of this sample.able 1: mAP Comparison of Different Methods for different bitsMethods WiKi Pascal VOC 2007 NUS-WIDE16 32 64 128 16 32 64 128 16 32 64 128ITQ 0.5122 0.5359 0.5490 0.5532 0.7586 0.7975 0.8053 0.8061 0.3724 0.3751 0.3776 0.3789LSH 0.4306 0.4712 0.5085 0.5276 0.4402 0.5591 0.6628 0.7262 0.3421 0.3554 0.3544 0.3672DLLE 0.5234 0.5330 0.5466 0.5506 0.7629 0.8068 0.8131 0.8193 0.3738 0.3782 0.3794 0.3823HCOH 0.5450 0.5474 0.5494 0.5490 0.2436 0.6050 0.6070 0.6072 0.3232 0.3451 0.3434 0.3645MFH 0.4630 0.5040 0.5455 0.5569 0.5364 0.6376 0.6941 0.7216 0.3673 0.3752 0.3803 0.3815MVLH 0.3027 0.3166 0.3000 0.3045 0.5469 0.6324 0.6995 0.7203 0.3363 0.3339 0.3324 0.3284OMH-DQ 0.4117 0.4393 0.4556 0.4319 0.5673 0.7040 0.8096 0.8542 0.5223 0.5381 0.5823 0.5957Ours m AP WiKiPascal VOC 2007NUS-WIDE

Figure 2: Performance variation with respect to δ . Baselines and Evaluation Scheme

We compare our method with several state-of-the-art hash-ing methods. These baselines can be divided into two cat-egories. (1) Multi-modal hashing methods including MFH(Song et al. 2013), MVLH (Shen et al. 2015), and OMH-DQ (Lu et al. 2019); (2) Single-modal hashing methods in-cluding ITQ (Gong et al. 2013), LSH (Datar et al. 2004),DLLE (Ji et al. 2017), HCOH (Lin et al. 2018). Since thesingle-modal methods can not deal with multiple modalitiessimultaneously, we concatenate multiple modalities as theinput feature for a fair comparison. We adjust the parame-ters of each method to take values from the candidate rangegiven in the original papers and report the best results. Theperformance is evaluated by Mean Average Precision (mAP)(Yi and Yeung 2012; Zhang and Li 2014). For a query q , theAverage Precision (AP) is deﬁned as follows AP ( q ) = 1 l q R (cid:88) m =1 P q ( m ) δ q ( m ) (16)where P q ( m ) denotes the accuracy of the top m retrievalresults; δ q ( m ) = 1 if the m -th position is the true neighbourof the query q , and otherwise δ q ( m ) = 0 ; l q is the correctstatistics of top R retrieval results. The mAP is deﬁned asthe mean of the average precisions of all the queries. WiKi Pascal VOC 2007 NUS-WIDE00.10.20.30.40.50.60.70.80.91 m AP FixedOurs

Figure 3: The results of the ablation experiments on theadaptive online stage

Accuracy Comparison

The experimental results on WiKi, Pascal VOC 2007 andNUS-WIDE are presented in Table 1. We can clearly ob-serve that our method consistently outperforms all the base-lines used in the comparison, when the code length variesfrom 16 bits to 128 bits. As the code length increases,the performance of our method improves slightly. This be-haviour demonstrates that our method is not sensitive to thecode length and can achieve satisfactory performance evenwith short codes. Compared with OMH-DQ, our methodachieves an average improvement of 23% , 13% and 7% onWiKi, Pascal VOC 2007 and NUS-WIDE respectively. Thisindicates that our method can generate effective hash codesin large-scale applications.

Ablation study

In our method, the projection matrices arelearned during the ofﬂine stage and the hash codes of newlycoming multimedia data are generated in an online mode inorder to capture the variations inherent in the data. In orderto validate the effectiveness of the proposed adaptive onlinestrategy, we conduct ablation experiments. Let ’ﬁxed’ indi-cate the weights learned in the ofﬂine stage. They are appliedto generate hash codes for newly coming data. As shown inFig. 3, our method exhibits a considerable improvement over

20 40 60 80 100 120 14000.10.20.30.40.50.60.70.80.91

Image ModalityText Modality (a) WiKi

Image ModalityText Modality (b) Pascal VOC 2007

Image ModalityText Modality (c) NUS-WIDE

Figure 4: Visualization of modality weights adapted to dynamic data iteration number S t ep1 : ob j e c t i v e f un c t i on v a l ue (a) WiKi iteration number S t ep1 : ob j e c t i v e f un c t i on v a l ue (b) Pascal VOC 2007 iteration number S t ep1 : ob j e c t i v e f un c t i on v a l ue (c) NUS-WIDE Figure 5: Convergence curves on Wiki (a), Pascal VOC 2007 (b) and NUS-WIDE (c).’ﬁxed’. Fig. 4 shows the dynamic variation of the weightof each modality for a new data batch. Note that the varia-tions for Pascal VOC 2007 and NUS-WIDE are larger thanfor WiKi. From Fig. 3, it is apparent that the improvementgained in Precision is correlated with the extent of varia-tions of the dynamic weight. This implies that the weightadaptation is very important. It impacts on the performancebeneﬁcially, especially for data with large diversity.

Run Time Comparison

In this subsection, we investigate the training time of the pro-posed method and compare it with baselines by conductingexperiments on WiKi, Pascal VOC 2007 and NUS-WIDEdatasets respectively. The statistics of the results are reportedin Table 2. We can see that our method requires less train-ing time than OMH-DQ. LSH is a popular data-independentmethod. It is obvious that the computation cost of LSH isrelatively low. HCOH is also a supervised method based onHadamard matrix, but its optimization does not involve thematrix inverse operation. Except for LSH and HCOH, ourmethod is faster than the other methods compared. Althoughthe training time of our method is slightly slower than thatof LSH and HCOH, its performance is much better.

Parameter sensitivity and Convergence analysis

There is only one penalty parameter δ to avoid overﬁttingin our model. In order to explore its effect on the perfor- Table 2: Comparison of training time (seconds)Methods Training time (s)WiKi Pascal VOC 2007 NUS-WIDEITQ 0.5033 87.7270 2.5625LSH 0.0129 2.9982 0.1014DLLE 139.1312 147.1619 1461.2236HCOH 0.2302 9.9715 3.1992MFH 2.7651 25.9216 19.0934MVLH 184.7249 452.2433 913.1369OMH-DQ 8.3657 192.0446 70.1140Ours 0.3724 23.8603 1.6507mance of our model, we vary its values in the range of { e − , e − , e − , e − , e − } . The performance curvesare plotted in Fig.3. We can see that a degradation com-mences on WiKi from 1e-3. In contrast, the performance isrelatively stable for a large range of values on Pascal VOC2007 and NUS-WIDE, which may be because the overﬁt-ting is less likely to happen on larger datasets. In conclusion,our model is insensitive to the parameter and can ﬂexibly beapplied, especially to larger-scale multimedia retrieval prob-lems.The optimisation process based on the updating rule (seeEq.(10 and Eq. (12))) is decreasing the objective functiononotonically, and rapidly converges to the minimum. Thisis shown by the results of the experiments on WiKi, PascalVOC 2007 and NUS-WIDE using our model with the codeof 128-bit length. The convergence curves obtained on thethree datasets are plotted in Fig. 5. (The convergence trendfor the hash codes of other length is similar.) As shown inFig. 5, our model converges within 5 iterations on WiKi,Pascal VOC 2007 and NUS-WIDE respectively. Conclusion

In this paper, we proposed a novel multi-modal hashingmethod where Hadamard matrix is introduced to generatea discriminative hash center for each content category. Ourmodel exhibits strong discriminative capability and is com-putationally light. As it is not highly sensitive to hyper-parameters, it can be applied very ﬂexibly. The results ofthe experiments conducted on several public multi-modaldatasets demonstrate the superior accuracy and efﬁciency ofthe proposed method, as compared to the state-of-the-art al-gorithms.

References

Chua, T.-S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; and Zheng,Y. 2009. NUS-WIDE: A Real-World Web Image Databasefrom National University of Singapore. In

Proceedings ofthe ACM International Conference on Image and Video Re-trieval .Datar, M.; Immorlica, N.; Indyk, P.; and Mirrokni, V. S.2004. Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In

Proceedings of the Twentieth AnnualSymposium on Computational Geometry , 253262.Gong, Y.; Lazebnik, S.; Gordo, A.; and Perronnin, F. 2013.Iterative Quantization: A Procrustean Approach to Learn-ing Binary Codes for Large-Scale Image Retrieval.

IEEETransactions on Pattern Analysis and Machine Intelligence

IEEE Transactions on Image Pro-cessing

IEEE Transac-tions on Image Processing

IEEE Transac-tions on Pattern Analysis and Machine Intelligence

Inter-national Journal of Computer Vision

Proceedings of the 26th ACM International Conference on Multimedia ,16351643.Liu, H.; Ji, R.; Wu, Y.; and Hua, G. 2016. Supervised MatrixFactorization for Cross-Modality Hashing. In

InternationalJoint Conference on Artiﬁcial Intelligence , 1767–1773.Liu, X.; He, J.; Liu, D.; and Lang, B. 2012. Compact kernelhashing with multiple features. In

Proceedings of the 20thACM international conference on Multimedia , 881–884.Lu, X.; Zhu, L.; Cheng, Z.; Nie, L.; and Zhang, H. 2019. On-line Multi-Modal Hashing with Dynamic Query-Adaption.In

Proceedings of the 42nd International ACM SIGIR Con-ference on Research and Development in Information Re-trieval , 715724.Rasiwasia, N.; Costa Pereira, J.; Coviello, E.; Doyle, G.;Lanckriet, G. R.; Levy, R.; and Vasconcelos, N. 2010. ANew Approach to Cross-Modal Multimedia Retrieval. In

Proceedings of the 18th ACM International Conference onMultimedia , 251260.Shen, F.; Shen, C.; Liu, W.; and Shen, H. T. 2015. Super-vised Discrete Hashing. In , 37–45.Shen, X.; Shen, F.; Sun, Q.-S.; and Yuan, Y.-H. 2015. Multi-view Latent Hashing for Efﬁcient Multimedia Search. In

Proceedings of the 23rd ACM International Conference onMultimedia , 831–834.Song, J.; Yang, Y.; Huang, Z.; and Shen, H. 2013. EffectiveMultiple Feature Hashing for Large-Scale Near-DuplicateVideo Retrieval.

IEEE Transactions on Multimedia

The London, Edinburgh, and Dublin PhilosophicalMagazine and Journal of Science

IEEETransactions on Circuits and Systems for Video Technology arXiv: Data Structures andAlgorithms .Wei, Y.; Zhao, Y.; Lu, C.; Wei, S.; Liu, L.; Zhu, Z.; and Yan,S. 2017. Cross-Modal Retrieval With CNN Visual Features:A New Baseline.

IEEE Transactions on Systems, Man, andCybernetics

ACM Transactions on IntelligentSystems and Technology (TIST)

International Conference on NeuralInformation Processing Systems .u, J.; Wu, X.; and Kittler, J. 2019. Discriminative Super-vised Hashing for Cross-Modal Similarity Search.

Imageand Vision Computing

89: 50–56.Yuan, L.; Wang, T.; Zhang, X.; Tay, F. E.; Jie, Z.; Liu, W.;and Feng, J. 2020. Central Similarity Quantization for Efﬁ-cient Image and Video Retrieval. In .Zhang, D.; and Li, W.-J. 2014. Large-Scale Supervised Mul-timodal Hashing with Semantic Correlation Maximization.In

Proceedings of the Twenty-Eighth AAAI Conference onArtiﬁcial Intelligence , 21772183. AAAI Press.Zhang, D.; Wu, X.-J.; and Yu, J. 2020. Learning latent hashcodes with discriminative structure preserving for cross-modal retrieval.

Pattern Analysis and Applications (4).Zhu, L.; Lu, X.; Cheng, Z.; Li, J.; and Zhang, H. 2020. Flexi-ble Multi-modal Hashing for Scalable Multimedia Retrieval.