[PDF] Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

Abstract

Deep learning has successfully shown excellent performance in learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and video, should be taken into account. Music video retrieval by given musical audio is a natural way to search and interact with music contents. In this work, we study cross-modal music video retrieval in terms of emotion similarity. Particularly, audio of an arbitrary length is used to retrieve a longer or full-length music video. To this end, we propose a novel audio-visual embedding algorithm by Supervised Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into a shared space to bridge the semantic gap between audio and video. This also preserves the similarity between audio and visual contents from different videos with the same class label and the temporal structure. The contribution of our approach is mainly manifested in the two aspects: i) We propose to select top k audio chunks by attention-based Long Short-Term Memory (LSTM)model, which can represent good audio summarization with local properties. ii) We propose an end-to-end deep model for cross-modal audio-visual learning where S-DCCA is trained to learn the semantic correlation between audio and visual modalities. Due to the lack of music video dataset, we construct 10K music video dataset from YouTube 8M dataset. Some promising results such as MAP and precision-recall show that our proposed model can be applied to music video retrieval.

Full PDF

AAudio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

Donghuo Zeng

National Institute of Informatics

Tokyo, Japan, [email protected]

Yi Yu

National Institute of Informatics

Tokyo, Japan, [email protected]

Keizo Oyama

National Institute of Informatics

Tokyo, Japan, [email protected]

Abstract —Deep learning has successfully shown excellent per-formance in learning joint representations between different datamodalities. Unfortunately, little research focuses on cross-modalcorrelation learning where temporal structures of different datamodalities, such as audio and video, should be taken into account.Music video retrieval by a given musical audio is a natural way tosearch and interact with music contents. In this work, we studycross-modal music video retrieval in terms of emotion similarity.Particularly, an audio of an arbitrary length is used to retrieve alonger or full-length music video. To this end, we propose a novelaudio-visual embedding algorithm by Supervised Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video intoa shared space to bridge the semantic gap between audio andvideo. This also preserves the similarity among audio and visualcontents from different videos with the same class label and thetemporal structure. The contribution of our approach is mainlymanifested in the two aspects: i) We propose to select top k audiochunks by attention-based Long Short-Term Memory (LSTM)model, which can represent good audio summarization with localproperties. ii) We propose an end-to-end deep model for cross-modal audio-visual learning where S-DCCA is trained to learnthe semantic correlation between audio and visual modalities.Due to the lack of music video dataset, we construct 10K musicvideo dataset from YouTube 8M dataset. Some promising resultssuch as MAP and precision-recall show that our proposed modelcan be applied to music video retrieval.

Index Terms —Deep learning, Cross-modal retrieval, Deep CCA

I. INTRODUCTIONDeep cross-modal learning is a very important researchtopic in the area of multimedia and computer vision, with thegoal of learning joint representation between different datamodalities such as image-text [21], [24] and audio-lyrics [25].In the cross-modal music video retrieval, taking a piece ofmusic audio segment to retrieve visual contents is a naturalway to ﬁnd an interesting music video that facilitates andimproves people’s music experiences. Let us imagine thescenario: when a user sits in the bar, a song attracts hisattention. He instantly records the song by his cellphone andwith this as query ﬁnds semantically similar music videos, asshown in Fig. 1. Correlation learning between visual and audiosequences is non-trivial. However, little work has contributedto this task where temporal structures of different modalitiesshould be considered.The large volumes of music videos emerged in the Internetprovide a nice opportunity for us to learn the correlation

Fig. 1. Overview of music video retrieval: Select one or more representativeaudio chunks as query to ﬁnd similar music video, based on content similarity. between visual and audio temporal sequences. A music videocontains visual and audio modalities, which are embedded inmusical temporal sequences to express music theme and story.Moreover, as a special form of expression, a music video alsoconveys strong feelings and emotions, which are semanticallycontained in audio and visual modalities. That is to say, musicemotion is delivered by both audio and visual modalities inmusic video. This motivates us to learn a joint embeddingspace where music audio and visual contents are assumed withsame semantically meaning.In this work, we study how to use audio to retrieve musicvideo under a realistic situation: with a segment of music audiothat has a variable length as a query, the system automaticallyﬁnds the music video that is similar to this audio with regardto emotions. In other words, an audio with an arbitrarylength can retrieve a longer or full-length music video. It isnatural for users to search music video in this way. However,this is a challenging research issue because audio and videoare different modalities that have different low-level featureswith different properties of temporal structures. To this end,we propose a novel audio-visual embedding algorithm bySupervised Deep Canonical Correlation Analysis (S-DCCA)that projects audio and video into a joint feature space tobridge the gap across different modalities. This also preservesthe similarity among audio and visual contents from differentvideos with the same class label and the temporal structure. Inaddition to selecting 10K music video data from the YouTube-8M dataset, most importantly, several contributions are made a r X i v : . [ c s . MM ] A ug n this paper as follows:i) To the best of our knowledge, this is the ﬁrst work thatstudies how to retrieve a full video by an audio having avariable length.ii) We propose to select k representative audio chunks basedon emotion features extracted by a Long Short-Term Memory(LSTM)-based attention model, which serve as audio summarymeanwhile conserving the temporal structure.iii) We propose an end-to-end deep architecture for cross-modal audio-visual embedding where S-DCCA is trainedto learn the semantic correlation between audio and visualmodalities.iv) Evaluation demonstrates that our deep model hascompetitive performance compared with state-of-the-art ap-proaches.The rest of this paper is structured as follows. Section IIintroduces related work on deep cross-modal embedding andmultimedia retrieval. Section III presents the architecture ofour model and Section IV reports the experimental results.Finally, Section VI draws the conclusion and points out futurework. II. RELATED WORKCross-modal music retrieval intensively focuses on studyingmusic and visual modalities [2], [4], [7], [14], [18], [22].Similarity between audio features extracted from songs andimage features extracted from the album covers are trainedby a Java SOMToolbox framework in [14]. Based on thissimilarity, people can easily manage a music collection andutilize album cover as visual content to search a music songfrom music dataset. Based on multimodal mixture models, astatistical method is applied to jointly model music, images,and text [4] for facilitating music multimodal retrieval. Thesensor data streams are mapped to a geo-feature. A visualfeature is calculated from video content. With the trained SV M hmm model, mood tags associated with visual-awarelikelihood are generated. Then, the likelihoods of the moodtags associated with location information and video contentare combined by late fusion. Mood tags with large likelihoodsare regarded as scene moods of this video. Finally, the songsmatching user’s listening history are extracted as personalizedrecommendations [18]. To learn the semantic correlation be-tween music and video, a method to choosing features andstatistical novelty based on kernel methods [7] is suggestedto segment music song. Co-occurring changes from audio andvideo in music videos can be found, where the correlationscan be applied to cross-modal audio-visual music retrieval.The key idea of cross-modal correlation learning is to learna joint space where different modalities can be correlatedsemantically. In particular, recent progresses mainly focus oncross-modal learning between text and image such as [11],[26]. Most existing deep architectures with two sub-networksexploit pre-trained convolutional neural network (CNN) [19]as image branch [23] and utilize pre-trained text-level em-bedding model [12] or hand-crafted feature extraction such asbag of words [11] as text branch. Then image and text features are projected to the shared space to compute a ranking lossfunction by a feed-forward way. Image-text benchmarks suchas [13], [16] are used to evaluate the performances of cross-modal matching and retrieval.Existing deep cross-modal retrieval methods have two prop-erties: i) little work related to cross-modal correlation learningtakes into account temporal structure of different modal data.ii) Pre-trained models are directly used to extract image ortext features. Distinguished from existing deep cross-modalretrieval architectures, this work takes into account temporalstructures to learn the correlation between audio and videofor enabling cross-modal music video retrieval, where se-quential audio and visual contents are projected to the samecanonical space. An end-to-end neural network architecturewith two-branch sequential structures for audio and video isinvestigated. Most importantly, we propose a novel methodthat extracts representative chunks from audio, which is ableto summarize audios with different lengths. In addition, wepropose a supervised deep CCA method to learn their semanticcorrelation. III. ARCHITECTUREIdeally, continuous audio segments (called chunks in thispaper) , which are short enough, have the same music property,such as emotion attribute. This motivates us to equally dividea long audio sequence into chunks with the same length.Then, the emotion information of each chunk is computed,and the best chunks with the most attention intensity are usedto represent the whole audio sequence. By the cross-modalcorrelation between the best audio chunks and visual features,the most similar videos can be found.

A. Neural Attention Modeling

The main part of attention computation is realized by theLong Short Term Memory (LSTM) networks model [9] witha bi-directional extension. A LSTM model contains self-loopswhich can keep the gradient ﬂow for long periods. The weightsin the self-loops are updated based on the context and can bechanged dynamically according to the input sequence by fourcomponents of LSTM structures:1) Input gate decides which values will be updated, whichdepends on the current input x i and the previous hidden state h t − , and is calculated as follows: s t = σ ( b i + W xi x t + W hi h t − + W ci c t − ) . (1)2) Forget gate decides what kind of information should beabandoned from the cell state, and is computed as follows: f t = σ ( b f + W xf x t + W hf h t − + W cf c t − ) . (2)3) Units c t will be updated from the old state c t − as follows: c t = f t c t − + i t tanh ( W xc x t + W hc h t − + b c ) . (3)4) Output gate is achieved as follows: o t = σ ( W xo x t + W ho h t − + W co c t + b o ) . (4) ig. 2. (a) Main structure of neural attention model, which takes a sequenceof audio chunks as input, processes it by the forward and backward LSTMmodels (achieved by the blue circles), and ﬁnally uses the output of bi-directional LSTM models to calculate the attention score and attentiondistribution as a 72-dimensional vector. (b) A LSTM memory block, includingthree gates.Fig. 3. Emotion learning model for evaluating the contribution of each chunkto emotions. When an original 216 seconds audio is divided into 3 chunks,the model calculates the contribution score of each chunk, which helps toobtain the top k-th chunk. h t = o t tanh ( c t ) (5)where x t represents the current input, h t − denotes the previ-ous hidden state, W and b are the weight and bias matrices,respectively.LSTM is a one way computation method. In order toconsider both past and future information, the extensionof LSTM networks adds one more layer with the oppositetemporal sequence and is named bi-directional LSTM, asshown in Fig. 2. In our works, each audio is divided into72 chunks, each with 3 seconds. Then, the bi-directionalLSTM model is applied on each chunk. In the attentionmodel, the input of bi-directional LSTMs is the output ofglobal max-pooling layer, which is the ﬁrst attention layer tocompute the contribution scores of different audio chunks.The attention score u t of the t-th chunk can be computed asfollows. u t = W T tanh ( W f h tf + W b h tb + β ) , (6)where h tf , h tb are the outputs of forward and backward LSTMfor the t-th chunk, the W T , W f , W b and β are the weightparameters of attention score function. When the attention score is obtained, the attention distribution θ is calculated bya softmax function: θ = sof tmax ( u t ) . (7)We regard this architecture as an emotion learningmodel [10], which is trained over the MER31K dataset, usingemotion tags from AllMusic . The detail of selecting audiosegments achieved by emotion learning model is shown inFig.3. Firstly, the emotion learning model is used to evaluatethe contributions of each chunk to emotions. The contributionscore allows us to rank the chunks. Secondly, in the rankedchunks, the best top k are selected. For instance, the ﬁrst audioin the Fig. 3 is divided into 3 chunks, and depending on thecontribution scores, the third chunk is selected as the best one,because it has the highest score within the audio. B. Supervised Deep Canonical Correlation Analysis and Dis-tance Similarity

CCA [3] is a classical approach for correlation analysisamong two or more modalities. Its core idea is to learnprojection matrices that map features of different modalitiesinto the same space, where the correlation between similaritems of different modalities are maximized.Denote X ∈ R k as an audio feature, Y ∈ R l as a visualfeature, and denote W x , and W y as matrices that linearly map X and Y to the same space, then W x and W y are found bymaximizing the correlation between W Tx X and W Ty Y , asfollows: ( W x , W y ) = arg max ( W x ,W y ) W Tx Σ xy W y (cid:113) W Tx Σ xx W x · W Ty Σ yy W y (8)where Σ xx and Σ yy represent the covariance matrices of Xand Y, respectively and Σ xy is their cross covariance matrix.DCCA extends CCA, realizing non-linear projections bydeep neural networks (DNN). Assume the output of ( i − th layer is X i − and Y i − ( X = X and Y = Y ), and W xi , W yi , b xi , b yi are the weights and biases of the i th lay-ers. Then, the i th layer outputs X i = s ( W Txi X i − + b xi ) , Y i = s ( W Tyi Y i − + b yi ) at two branches, where s : R → R is a nonlinear function. The output of the ﬁnal ( d th ) layerare f x = s ( W xd X d − + b xd ) , f y = s ( W yd Y d − + b yd ) . Let θ x represent the parameters W xi , b xi , i = 1, ..., d, and θ y representthe parameters W yi , b yi , i = 1, ..., d. They are optimized by ( θ ∗ x , θ ∗ y ) = arg max ( θ x ,θ y ) corr ( f x ( X, θ x ) , f y ( Y, θ y )) . (9)Supervised deep CCA does not merely consider one-to-onematch between all pairs of audio-visual data and apply deepCCA to learn the correlation. In order to preserve the similarityamong items with the same class label, audio and visualcontents from different videos with the same class label are formed as new relevant pairs to increase the number of trainingsamples.In the training process, maximizing the CCA objectivefunction G ( W Tx Σ xy W y ) to obtained the linear projectionsweight W x , W y and non-linear function f x , f y as follow. ( W x , W y , f x , f y ) = arg max ( W x ,W y ,f x ,f y ) G ( W Tx Σ xy W y ) ,s.t.W Tx Σ xx W x = I, W Ty Σ yy W y = I. (10)where the covariance matrices Σ xx , Σ xy and Σ yy are com-puted as.where N is the number of all pairs. The σ value decide twofactor of the number of training dataset, different from DCCA,S-DCCA considers pairs between audio and visual contentsfrom videos with the same class label, including those pairsformed from different videos, as shown in ( ?? ). similar toDCCA, all parameters are optimized by formulation ( ?? ). Theleft side of Fig. 4 shows the whole process. C. K-means clustering k-means clustering is a very popular unsupervised learningmethod for cluster analysis in data mining. k-means clusteringenables n variables to be separated into k clusters based onthe nearest mean, where k is usually pre-deﬁned by users.Given a set of variables X=( x , x , . . . , x n ), where eachvariable x i ∈ X is a d-dimensional vector. In order to clusterthem into k groups G = g , g , ..., g k ( k < n ) , ﬁrstly, acommon method is to randomly choose k values from X asinitial cluster centers, then iteratively update the cluster centerafter assigning each variable x i to its closest cluster till thecluster center never changes. The objective function is deﬁnedas follows: arg max G k (cid:88) i =1 (cid:88) x ∈ g i || x − u i || (11) where u i is the mean of points or cluster center of G i . In ourexperiments, we allocate 3 annotated audios for each 10 pre-deﬁned categories (angry, tender, bitter, cheerful, fun, bright,happy, anxious, calm and warm) to compute the initiated mean u . We use the k-means method to cluster all audios into 10semantic classes based on the emotion features. D. Matching and Ranking

It is not easy to recognize emotion inside the visual modal-ity, because the visual feature of the dataset is high-levelsemantic features without clear emotion expression like facialexpression changes or body movement. However, the high-level semantic information extracted or trained from compli-cated deep network is able to represent emotion attributescontained in music. Based on this background, we design aS-DCCA model to learn the correlation between audio andvideo, which enables us to use audio to retrieve video clip.The audio-visual embedding is to map audio chunks andvisual features to a common space. This space links audiochunks and visual feature in terms of emotion, and enablesus to implement cross-modal music video retrieval based onemotion similarity. In the cross-modal retrieval, given an audiochunk or multiple chunks as query, we calculate the similaritybetween the query audio chunks and each of the visual featuresfrom the database in the emotion-based embedding space. Weuse the cosine similarity between f x ( X, θ x ) and f y ( Y, θ y ) asthe similarity metric, which is deﬁned as follows. Cos ( f x , f y ) = f x f y || f x || . || f y || (12)The detail of our architecture is shown in Fig. 4. whichconsists of 2 branches: audio branch and visual branch. Firstly,the pre-trained VGG16 model is used to extract frame-levelaudio feature and the pre-trained Inception model is used toextract frame-level visual feature, for all data in the dataset.econdly, the frame-level visual feature is represented asvideo-level feature by the max pooling method. As for audiobranch, we load frame-level audio feature into the pre-trainedemotion learning model [10] to extract emotion features ,based on which the best top k chunks are selected to do musicvideo retrieval, then feed them into Sub-Net1 and Sub-Net2respectively. Thirdly, based on the extracted emotion features,we apply k-means to cluster the audio into 10 groups. Fourthly,the visual video-level feature and emotion of top k audiochunks are fed into 4 fully connected layers, which generatescompact features. Finally, CCA components of these compactfeatures are used to compute the similarity between video andaudio chunks. IV. E XPERIMENTS

The performances of the proposed S-DCCA for cross-modalmusic video retrieval are evaluated in this section, with thestudies on the inﬂuence of the number of chunks and cross-modal music video retrieval by audio.

A. Dataset and Evaluation Metric1)

Dataset : The second version of YouTube-8M dataset [1]is a large scale video dataset, which includes more than 7million videos with 4716 classes labeled by the annotationsystem. The dataset consists of three parts: training set, vali-date set, and test set. In the training set, each class containsat least 100 training videos. Features of these videos areextracted by the state-of-the-art popular pre-trained models andreleased for public use. Each video contains audio and visualmodality. Based on the visual information, videos are dividedinto 24 topics, such as sports, game, arts&entertainment, etc.Specially, the arts&entertainment topic contains the “musicvideo” label which allows us to construct a music dataset. Avideo that is included in our music video dataset (MV-10K)should satisfy two conditions:1) Each video should include the [music video] label,without other labels.2) The length of each video ranges from 213 to 219seconds.In order to keep enough information in each chunk, thenumber of chunks for each audio is set as 3, 6, 9. We selectvideos whose length is around 216 second, because 216 is thecommon multiple of 3, 6, 9. In our experiment, we separatelyget 4 subsets of videos based on different video lengths, andthe details are shown in Table I.YouTube-8M has already released the frame-level featureand video-level feature for both audio and visual information.Frame-level visual feature is extracted by public Inceptionmodel which is trained on the ImageNet. Each frame of thevisual content is computed per second in the ﬁrst 6 minutes.After transfer learning and feature dimension reductions withPCA, the dimension of frame-level visual feature is len ×1024,where len is the video lengths in seconds. The video-levelvisual feature is obtained by the DBoF approach [1]. Theframe-level audio feature is extracted by a VGG-like model,

TABLE IT HE I NFORMATION OF M USIC D ATASET S ELECTED

Length Span Selected size216 ±

3: [213, 219] 10,000216 ±

6: [210, 222] 20,000216 ±

9: [207, 225] 30,000216 ±

12: [204, 228] 40,000 as described in [8], and their average is computed as thevideo-level audio feature. Evaluation Metrics : In this paper, we choose recall,precision, and MAP as the main metrics for the quantitativeevaluation of our method.

Precision and Recall [15] are a pair of metrics, which arerelated to the numbers of relevant documents and retrieveddocuments. In our experiments, precision is the fraction ofretrieved music videos that are relevant to the audio queryand recall is the fraction of the relevant music videos that arecorrectly retrieved.

Mean Average Precision (MAP) [6] for all audio queriesis the mean of the average precision for each audio query.When using a music audio as query, in its N ranked retrievedmusic videos, the average precision (AP) is deﬁned as AP = 1 R N (cid:88) i =1 p ( i ) · rel ( i ) (13)where R is the number of relevant music videos that belongto the same cluster as the query, p ( i ) is the precision of top i music videos, rel ( i ) is a binary value which is 1, if the i th music video belongs to the same cluster as the query, and 0otherwise. The cluster for each audio-visual pair only is usedin the process of training. During testing, we assume all themusic videos that have the same cluster label as the queryaudio are relevant. B. Experiment Setting

The frame-level video feature in YouTube-8M is computedone frame per second, according to the pre-trained emotionlearning model. We divide the 216 second frame-level audiofeature into 72 chunks.The attention model is applied to eachchunk to calculate the contribution score of emotion, and each3 second share the same score. Finally, the result of maxpooling is regarded as the score of emotion for each chunk.The following parameters are used in our experiments: • Network parameter. Both the audio and the branch have4 hidden layers. The number of units per layer is 512,512, 256, 256 in the visual branch, and 128, 128, 64, 64in the audio branch. The number of CCA component is30. We set the probability of dropout to 0.2 and apply tanh as the activation function in each hidden layer anduse sigmoid function in the ﬁnal layer. • Experiment parameter. Train batch size is 512 and testbatch size is 64. The number of training epochs is 50. ig. 5. Precision-recall curve with the number of chunks set to 3, where“mean” denotes using the average of frame level audio feature as query, k(=1, 2) is the number of audio chunks selected as query.TABLE IIT HE MAP

RESULTS OF DIFFERENT METHODS UNDER DIFFERENTCONFIGURATIONS . k/chunks 1/3 2/6 3/9 mean Multi-views 14.02 14.36 14.25 14.58CCA 18.34 18.39 18.32 18.35KCCA 17.54 17.04 17.49 17.80DCCA 18.35 18.39 18.22 18.40C-CCA 18.51 19.60 19.73 19.72

S-DCCA 21.38 21.43 21.24 21.76 • We run the experiments with 5 fold cross-validation andget the average performance. • The

RM SP rop optimizer is used and the learning rateis set to 0.001.

C. Baseline

Multi-view [27] learning is a technology in machine learn-ing that learn one function per view to model multiple viewsand optimizes all functions to remove the cross-view gap.

CCA [20] algorithm is to ﬁnd the correlations betweentwo multivariate sets of vectors by linear projections, whichdepends on singular value decomposition.

KCCA [5] is also a method to extract common featuresfrom two data sets Instead of the linear correlation KCCA triesto obtain non-linear correlation through the kernel method,which uses Gaussian kernel and set parameter β =0.4. DCCA [3] is to learn the nonlinear transformations of twodata sets such that outputs are highly correlated.

C-CCA [17] (Cluster-CCA) is a CCA variant. Differentfrom standard CCA. C-CCA algorithm clusters each data setinto several groups or classes and tries to enhance the intra-cluster correlation.

Fig. 6. Precision-recall curve with the chunks=6, where “mean” denotes usingthe average of frame level audio feature, k(=1, 2, 3) is the number of audiochunks selected as query.Fig. 7. Precision-recall curve with the chunks=9, where “mean” denotes usingthe average of frame level audio feature, k (=1, 2, 3) is the number of audiochunks selected as query.

D. Experiment Result and Analysis

Our experiments of S-DCCA use three different trainingdata sets to obtain three different models. The basic C-CCAand S-DCCA model are trained by the 8000 one-to-one pairs.To enhance to intra-cluster correlation, we further consider thecorrelation between audios and visual contents from differentvideos of the same cluster, to learn the relationship betweenthe two modalities. We also try to construct more audio-visualpairs during the training. The C-CCA-extend1 and S-DCCA-extend1 are trained by around 0.8 million pairs, C-CCA-extend2 and S-DCCA-extend2 models by around 1.5 million ig. 8. Precision-recall curve, achieved by changing the number of output,where k (=1, 2, 3) is the number of chunks selected from all chunks (c) ofan audio as query; for example, k/c=1/3 denotes selecting 1 chunk from anaudio that is divided into 3 chunks. ”mean” denotes using the average of thewhole audio as query.Fig. 9. Mean average precision when using different numbers of audio chunksselected as query for video retrieval, k denotes the number of chunks selectedas query, c denotes the number of overall chunks that the audio is dividedinto. pairs. where the former -extend1 model uses 50% of all musicvideos of a cluster to form training pairs with each audio inthe cluster, and the latter -extend2 model applies 100% of allmusic videos in the same cluster to form training pairs.We use the precision-recall curve to draw the tendency ofresults as the number of outputs increases so as to compareour S-DCCA model with DCCA model and S-DCCA-extend2model. Our model tries to leverage the temporal structureinside the query audio, and each query audio is divided into3, 6, or 9 chunks, from which k chunks are selected as the actual query. In order to investigate the overall performanceof our S-DCCA, we use MAP as the metric and compare S-DCCA with others CCA variants (DCCA, C-CCA, KCCA),we set the same dimension of embedding for all methods, andset the same hidden layers structure for DCCA, S-DCCA, S-DCCA-extend1, and S-DCCA-extend2. The correct retrievedvideo in the rank list which has the same category as query,otherwise it is incorrect video.Figs. 5, 6, 7 demonstrates the precision-recall curve, com-paring DCCA and S-DCCA-extend2 model. The pair of pre-cision and recall value is achieved by changing the numberof music videos output. Generally, with the increase of thenumber of music videos output, the recall increases and theprecision decreases. In the S-DCCA-extend2 model, thesethree ﬁgures show that precision starts with the highest valueand then sharply decreases before recall arrives at 0.2, thenprecision almost remains stable as recall increases to 1.0. Asis known, the query and the model as two main factors controlthe curve trend. As for the query factor, when each audio isdivided into 3 or 6 chunks, the precision and recall curvesof the selected chunks and full-length audio are very close.But when each audio is divided into 9 chunks, and 3 chunksare selected as query, the performance is better than otherconﬁgurations when the number of output is small. This infersthat the 3 chunks have most contribution of emotion and thiskind of information is helpful for cross-modal retrieval. Asfor the model factor, S-DCCA-extend2 is better than DCCA,which indicates that more videos in the output belong to thesame cluster as the query in S-DCCA-extend2, than in DCCA.We also investigate the inﬂuence of the number of overallchunks and the number of chunks selected. Fig. 8, shows thatwith the same volume of audio information as query, whenthe audio is divided into 9 chunks and 3 chunks are selectedas the query the S-DCCA-extend2 model achieves the bestperformance (precision ranges from 26.6% to 23.8%; recallranges from 0.20 to 0.41).In order to further study the inﬂuence of the number of over-all chunks and the number of chunks selected as query , theMAP results of different models are compared in Table II andFig. 9. As for the number of chunks selected, generally thereis no big difference in MAP when the same model is used.When the same audio information is used as query, comparingthe MAP results among different models, it shows that thetraining process explicitly exploiting the cluster informationgenerally outperforms the one without cluster information. Asa result, S-DCCA (and S-DCCA-extend1, S-DCCA-extend2)and C-CCA (and C-CCA-extend1, C-CCA-extend2) can gethigher MAP than Multi-views, CCA, KCCA, and DCCA. Itindicates that the correlation learning based on both clusterinformation and instance features is better than those usinginstance features only. With the increases in the volume ofthe training data, from two groups, group 1: C-CCA, C-CCA-extend1, C-CCA-extend2, and group 2: S-DCCA, S-DCCA-extend1, S-DCCA-extend2, the MAP gets higher and higher. Itproves that considering all possible pairs within two data setsfor each label cluster can get better performance than one-o-one pairs, and it also illustrates the limited training datacannot well learn the correlation between audio and visualfeature in this case. Generally, using parts of audio as queriesto do retrieval can get close performance as in this case wherefull-length audio is used as queries.V. C ONCLUSIONS

We proposed a supervised deep CCA model to learn asemantic space where audio and visual data from musicvideo, which are in different modalities, are linked to learnthe cross-modal correlation. Besides the pairwise similarity,the semantic similarity between audio and visual contentsfrom different videos in the same cluster is also explicitlyconsidered. An end-to-end deep architecture that representsan audio sequence as representative chunks is studied. Theexperimental evaluation run over MV-10K data selected fromYoutube-8M proves the effectiveness of the proposed deepaudio-visual embedding algorithm in cross-modal music videoretrieval. We will try to integrate more users’ preferenceinformation to our Deep architecture for personalized musiccross-modal video recommendation. We will investigate thetask of taking a short video as query to retrieve a longer orfull audio in the future work.ACKNOWLEDGEMENTThis work was partially supported by JSPS KAKENHIGrant Number 16K16058. The ﬁrst Author would like to thankFrancisco Raposo for discussing how to implement CCA.R

EFERENCES[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, GeorgeToderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan.Youtube-8m: A large-scale video classiﬁcation benchmark. arXivpreprint arXiv:1609.08675 , 2016.[2] Esra Acar, Frank Hopfgartner, and Sahin Albayrak. Understandingaffective content of music videos through learned representations. In

International Conference on Multimedia Modeling , pages 303–314.Springer, 2014.[3] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deepcanonical correlation analysis. In

International Conference on MachineLearning , pages 1247–1255, 2013.[4] Eric Brochu, Nando De Freitas, and Kejie Bao. The sound of an albumcover: Probabilistic multimedia and information retrieval. In

ArtiﬁcialIntelligence and Statistics (AISTATS) , 2003.[5] Nello Cristianini, John Shawe-Taylor, et al.

An introduction to supportvector machines and other kernel-based learning methods . Cambridgeuniversity press, 2000.[6] Fangxiang Feng, Xiaojie Wang, and Ruifan Li. Cross-modal retrievalwith correspondence autoencoder. In

Proceedings of the 22nd ACMinternational conference on Multimedia , pages 7–16. ACM, 2014.[7] Olivier Gillet, Slim Essid, and Gal Richard. On the correlation ofautomatic audio and visual segmentations of music videos.

IEEETransactions on Circuits and Systems for Video Technology , 17(3):347–355, 2007.[8] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke,Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif ASaurous, Bryan Seybold, et al. Cnn architectures for large-scale audioclassiﬁcation. In

Acoustics, Speech and Signal Processing (ICASSP),2017 IEEE International Conference on , pages 131–135. IEEE, 2017.[9] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.[10] Yu-Siang Huang, Szu-Yu Chou, and Yi-Hsuan Yang. Music thumbnail-ing via neural attention modeling of music emotion. In

Asia-PaciﬁcSignal and Information Processing Association Annual Summit andConference (APSIPA ASC), 2017 , pages 347–350. IEEE, 2017. [11] Qing-Yuan Jiang and Wu-Jun Li. Deep cross-modal hashing.

CoRR ,2016.[12] Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vecwith practical insights into document embedding generation. arXivpreprint arXiv:1607.05368 , 2016.[13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, PietroPerona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoftcoco: Common objects in context. In

European conference on computervision , pages 740–755. Springer, 2014.[14] Rudolf Mayer. Analysing the similarity of album art with self-organisingmaps. In

International Workshop on Self-Organizing Maps , pages 357–366. Springer, 2011.[15] David Martin Powers. Evaluation: from precision, recall and f-measureto roc, informedness, markedness and correlation. 2011.[16] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle,Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. A new approachto cross-modal multimedia retrieval. In

Proceedings of the 18th ACMinternational conference on Multimedia , pages 251–260. ACM, 2010.[17] Nikhil Rasiwasia, Dhruv Mahajan, Vijay Mahadevan, and Gaurav Ag-garwal. Cluster canonical correlation analysis. In

Artiﬁcial Intelligenceand Statistics , pages 823–831, 2014.[18] Rajiv Ratn Shah, Yi Yu, and Roger Zimmermann. Advisor: Person-alized video soundtrack recommendation by late fusion with heuristicrankings. In

Proceedings of the 22nd ACM international conference onMultimedia , pages 607–616. ACM, 2014.[19] Karen Simonyan and Andrew Zisserman. Very deep convolu-tional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[20] Bruce Thompson. Canonical correlation analysis.

Encyclopedia ofstatistics in behavioral science , 2005.[21] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages 5005–5013,2016.[22] Yi Yu, Zhijie Shen, and Roger Zimmermann. Automatic music sound-track generation for outdoor videos from contextual sensor information.In

Proceedings of the 20th ACM international conference on Multimedia ,pages 1377–1378. ACM, 2012.[23] Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. Venuenet:Fine-grained venue discovery by deep correlation learning. In

Multi-media (ISM), 2017 IEEE International Symposium on , pages 288–291.IEEE, 2017.[24] Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. Category-based deep cca for ﬁne-grained venue discovery from multimodal data.

IEEE Transactions on Neural Networks and Learning Systems , pages1–9, 2018.[25] Yi Yu, Suhua Tang, Francisco Raposo, and Lei Chen. Deep cross-modal correlation learning for audio and lyrics in music retrieval.

ACMTransaction on Multimedia Computing Communication and Applica-tions , 2017.[26] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. Videocaptioning and retrieval models with semantic attention. arxiv preprint. arXiv preprint arXiv:1610.02947 , 2, 2016.[27] Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-view learningoverview: Recent progress and new challenges.