Automatic Music Highlight Extraction using Convolutional Recurrent Attention Networks
Jung-Woo Ha, Adrian Kim, Chanju Kim, Jangyeon Park, Sunghun Kim
AAUTOMATIC MUSIC HIGHLIGHT EXTRACTION USING CONVOLUTIONAL RECURRENTATTENTION NETWORKS
Jung-Woo Ha , Adrian Kim , Chanju Kim , Jangyeon Park , and Sung Kim Clova AI Research and Clova Music, NAVER Corp., Korea Hong Kong University of Science and Technology, China
ABSTRACT
Music highlights are valuable contents for music ser-vices. Most methods focused on low-level signal features.We propose a method for extracting highlights using high-level features from convolutional recurrent attention networks(CRAN). CRAN utilizes convolution and recurrent layersfor sequential learning with an attention mechanism. Theattention allows CRAN to capture significant snippets for dis-tinguishing between genres, thus being used as a high-levelfeature. CRAN was evaluated on over 32,000 popular tracksin Korea for two months. Experimental results show ourmethod outperforms three baseline methods through quanti-tative and qualitative evaluations. Also, we analyze the effectsof attention and sequence information on performance.
1. INTRODUCTION
Identifying music highlights is an important task for onlinemusic services. For example, most music online services pro-vide first 1 minute free previews. However, if we can iden-tify the highlights of each music, it is much better to play thehighlights as a preview for users. Users can quickly browsemusics by listening highlights and select their favorites. High-lights can contribute to music recommendation [1, 2, 3]. Us-ing highlights, users efficiently confirm the discovery-basedplaylists containing unknown or new released tracks.Most of existing methods have focused on using low-levelsignal features including the pitch and loudness by MFCCand FFT [4, 5]. Therefore, these approaches are limited toextract snippets reflecting high-level properties of a track suchas genres and themes. Although extraction by human expertsguarantees the high quality results, basically it does not scale.In this paper, we assume that high-level information suchas genre contribute to extract highlights, and thus propose anew deep learning-based technique for extracting music high-lights. Our approach, convolutional recurrent attention-basedhighlight extraction (CRAN) uses both mel-spectrogam fea-tures and high-level acoustic features generated by the atten-tion model [6]. First, CRAN finds the highlight candidates byfocusing on core regions for different genres. This is achievedby setting track genres to the output of CRAN and learning to attend the parts significant for characterizing the genres.Then, th highlights are determined by summing the energy ofmel-spectrogram and the attention scores. The loss of genreclassification are back propagated, and weights including theattention layer are updated in the end-to-end manner. In ad-dition, CRAN is trained in an unsupervised way with respectto finding highlights because it does not use ground truth dataof highlight regions for training.We evaluate CRAN on 32,000 popular tracks from De-cember 2016 to January 2017, which are served through a Ko-rean famous online music service, NAVER Music. The evalu-ation dataset consists of various genre songs including K-popand world music. For experiments, we extract the highlighted30 second clip per track using CRAN, and conduct qualita-tive evaluation with the likert scale (1 to 5) and quantitativeverification using ground truth data generated by human ex-perts. The results show that CRAN’s highlights outperformthree baselines including the first 1 minute, an energy-basedmethod, and a the attention model with no recurrent layer(CAN). CRAN also outperforms CAN and models with noattention with respect to genre classification. Furthermore,we analyze the relationships between the attention and tradi-tional low-level signals of tracks to show the attention’s rolein identifying highlights.
2. MUSIC DATA DESCRIPTION
We select 32,083 popular songs with 10 genres played fromDecember 2016 to January 2017 for two months in NAVERMusic. The detailed data are summarized in Table 1. Notethat some tracks belong to more than one genre, so the sum-mation of tracks per genre is larger than the number of thedata. The data are separated into training, validation, and testsets. Considering a real-world service scenario, we separatethe data based on the ranking of each track as shown in Table2. We use two ranking criteria such as the popularity and thereleased date. We extracted a ground-truth dataset with high-lights of 300 tracks by eight human experts for quantitativeevaluation, explained in Section 4.1. The experts marked thetimes when they believe that highlight parts start and stop byhearing tracks. a r X i v : . [ c s . L G ] D ec able 1 . Constitution of tracks per genre Genre
Table 2 . Data separation for experiments
Data Ratio(%) Rank range(%)
We convert mp3 files to mel-spectrograms, which are two-dimensional matrices of which row and column are the num-ber of mel-frequencies and time slots. Each mel-spectrogramis generated from a time sequence sampled from an mp3 filewith a sample rate of 8372Hz using librosa [7]. The samplerate was set as two times of the frequency of C8 on equaltemperament(12-TET). The number of mel-bins is 128 andthe fft window size is 1024, which makes a single time slotof a mel-spectrogram to be about 61 milliseconds. The inputrepresentation x is generated as follows:i. P T ( x ) ≥ : use the first 240 seconds of x ii. P T ( x ) < : fill in the missing part with the last240- P T ( x ) seconds of a trackwhere P T ( x ) denotes the playing time of x . Therefore, wecan obtain a 128 ×
3. ATTENTION-BASED HIGHLIGHT EXTRACTION3.1. Convolutional Recurrent Attention Networks
CNN has been applied to many music pattern recognitionapproaches [8, 9, 10]. A low-level feature such as mel-spectrogram can be abstracted into a high-level feature byrepeating convolution operations, thus being used for calcu-lating the genre probabilities in the output layer. The attentionlayer aims to find which regions of features learned play asignificant role for distinguishing between genres. The at-tention results later can be used to identify track highlights.We use 1D-convolution by defining each mel as a channel toreduce training time without losing accuracy, instead of the2D-convolution [10]. Given a mel-spectrogram for a track x ,in specific, an intermediate feature u is generated through theconvolution and pooling operations: u = Concatenate ( M axpooling n ( Conv k ( x ))) (1)where n and k denote the numbers of pooling and convolutionlayers. We use the exponential linear unit (elu) as a non-linearfunction [11]. After that, u is separated into a sequence of T Fig. 1 . Structure of CRANtime slot vectors, U = { u ( t ) } Tt =1 , which are fed into bidirec-tional LSTM [12, 13]. Then, we obtain a set of T similarityvectors, V , from the tanh values of u ( t ) and of a vector trans-formed from the output of LSTM, u (cid:48) : u (cid:48) = BiLST M ( U ) (2) V = { v ( t ) } Tt =1 = g ( U ) ⊗ f ( u (cid:48) ) (3) g ( U ) = tanh( T S W U ) (4) f ( u (cid:48) ) = Re (tanh( F C W u (cid:48) ) , T ) (5)where ⊗ denotes element-wise multiplication. Re ( x , T ) is afunction which makes T duplicates of x . T S W and F C W are the weight matrices of the time separated connection forattention and the fully connected layer (FC1) to the output ofLSTM, as shown in Fig. 1. CRAN uses the soft attentionapproach [14]. The attention score of { u ( t ) } is the softmaxvalue of { v ( t ) } using a two layer-networks: α i = Sof tmax { tanh( A W v ( i ) ) } (6)where A W is the weight matrix of the connection betweensimilarity vectors and each node of the attention score layers.Then, z is calculated by the attention score-weighted summa-tion of the similarity vectors of all time slots: z = P (cid:88) Tt =1 α t v ( t ) (7)where P is a matrix for dimensionality compatibility. Then,the context vector m is obtained by element-wise multiplica-tion between the tanh values of z and of the FC vector: m = tanh( z ) ⊗ tanh( F C W u (cid:48) ) (8)Finally, the probability of a genre y is defined as the soft-max function of m . The loss function is the categorical cross-entropy [15]. We use the mel-spectrogram and the attention scores of atrack together for highlight extraction. The highlight score ig. 2 . Model performance for various model parameters
Time (frames)Dance
Mean
Time (frames)
Standard deviation
Time (frames)Ballad Time (frames)Time (frames)Teuroteu Time (frames)Time (frames)Hiphop Time (frames)Time (frames)Rock Time (frames)Time (frames)Jazz Time (frames)Time (frames)R&B Time (frames)Time (frames)Indie Time (frames)Time (frames)Classic Time (frames)
Time (frames)Electronic
Time (frames)
Fig. 3 . Distribution of attention scores according to genres. Values decrease in the order of white, yellow, red, and black.of each frame is computed by summing the attention scoresand the mean energies: ˜ e ( n ) = γα n + (1 − γ )128 (cid:88) i =1 e ( n ) i (9) H n = β (cid:88) S − s =0 ˜ e ( n + s ) + (1 − β ) (cid:16) ∆ e ( n ) + ∆ e ( n ) (cid:17) (10)where e ( n ) i and S denote the energy of the i -th mel channel inthe n -th time frame and the duration of a highlight. β and γ are arbitrary constants in (0, 1). ∆ e ( n ) and ∆ e ( n ) denote thedifferences of e ( n ) and ∆ e ( n ) , and they enable the model toprefer the rapid energy increment.
4. EXPERIMENTAL RESULTS4.1. Parameter Setup and Evaluation Methodology
Hyperparameters of CRAN are summarized in Table 3. Wecompare the highlights extracted by CRAN to those generatedby the method summing the energy of mel-spectrogram, thefirst one minute snippet (F1M), and convolutional attentionmodel without a recurrent layer (CAN). In addition, CRANand CAN are compared to models without attention for genreclassification, called CRN and CNN, respectively.We compare the highlights extracted by CRANs to thoseby the three baselines. We define two metrics for the evalu-ation. One is the time overlapped between the ground-truthand extracted highlights. The other is the recall of extractedhighlights. Given a track x and an extracted highlight H , two Table 3 . Parameter setup of CRAN - Convolution & pooling layers: 2 & 1, 4 pairs- β , γ : 0.5, 0.1 Table 4 . Comparison of quantitative performance
Models Overlap (s) Recall QualFirst 1 minute (F1M) 6.96 ± ± ± ± metrics are defined as follows: O ( x , H ) = P T ( GT ( x ) ∩ H ) (11) Recall ( x , H ) = (cid:26) , if O ( x , H ) > . × P T ( H )0 , otherwise (12)where P T ( x ) and GT ( x ) denotes the playing time and theground truth highlight of x . In addition, five human expertsrated the highlights extracted by each model in range of [1, 5]as the qualitative evaluation. Table 4 presents the results. CRAN yields the best accuracywith respect to both qualitative and quantitative evaluations.This indicates that the high-level features improves the qual-ity of the extracted music highlights. Interestingly, we can ime (frames)Dance Time (frames)Ballad Time (frames)Teuroteu Time (frames)Hiphop Time (frames)Rock Time (frames)Jazz Time (frames)R&B Time (frames)Indie Time (frames)Classic
Time (frames)Electronic
Fig. 4 . Correlation coefficient between attention scores and energy of mel-spectrograms for time frames per genre
Table 5 . Comparison of mean overlapped time per genre
Genre Size F1M Mel CAN CRANDance 57 6.56 20.72 22.37
Ballad 113 3.41 22.14 23.37
Teuroteu 5 18.8 20.0 21.20
Hiphop 42 13.33 14.52 15.79
Rock 27 7.15 19.78 22.11
Jazz 6 10.0 20.33
R&B 53 7.56 18.89
Electronic 9 7.22 17.33 17.11
Table 6 . Comparison of quantitative performance
Recall@3 CNN CAN CRN CRANPopularity 0.804 0.898 0.858
NewRelease 0.802 0.831 0.791 find that using F1M leads to very poor performance even ifits playing time is twice. It indicates that the conventionalpreview is needed to be improved using the automatic high-light extraction-based method for enhancing user experience.Table 5 presents the results with respect to overlap andrecall according to genres. In Table 5, values denote the over-lapped time. Overall, CRAN yields a little better performancecompared to CAN and outperforms the mel energy-basedmethod and F1M. It indicate that the attention scores arehelpful for improving the highlight quality in most genres.Interestingly, all models provide relatively low performanceon hiphop and indie genres, resulting from their rap-orientedor informal composition.
We investigate the effects of the attention mechanism ongenre learning and classification performance. Recall@3was used as a evaluation metric, considering the ambiguityand similarity between genres [17, 18]. Table 6 depicts theclassification performance of each model. As shown in Ta-ble 6, the attention mechanism considerably improves theperformances as at least 0.05 on two test datasets. In addi-tion, CRAN provides better accuracy compared to CAN, andit indicates that sequential learning is useful for classifyinggenres. Fig. 2 shows the classification performance for modeltypes and parameters with respect to time and accuracy. FromFigs. 2(a) and (b), the usage of both sequential modeling andthe attention mechanism prevents overfitting comparing theloss of CRAN to other models. It is interesting that the num-ber and the hidden node size of the recurrent layers rarelycontributes to improve the loss of the model from Fig. 2 (c)and (d). The use of attention does not require much trainingtime while the use of recurrent layers slightly increases themodel size, as shown in Fig. 2(d).
Fig. 3 presents the distribution of the mean and the varianceof the attention scores derived from CRAN per genre. Asshown in Fig. 3, time slots with a large attention score varyby genre. In particular, ballad, rock, and R&B tracks showsimilar attention patterns. Hiphop and classical genres showrelatively low standard deviation of attention scores due totheir characteristics [19]. This result indicates the attentionby CRAN learns the properties of a genre.Fig. 4 presents the correlation coefficient between the at-tention scores and the energy of mel-spectrogram for eachgenre. we find that the regions with higher energy in the latterof a track are likely to be a highlight. In addition, high en-ergy regions are obtained larger attention scores through en-tire time frames in classical music, compared to other genres.We infer that high-level features can play a complement rolefor extracting information from tracks, considering the differ-ent patterns between attention scores and low-level signals.
5. CONCLUDING REMARKS
We demonstrate a new music highlight extraction method us-ing high-level acoustic information as well as low-level sig-nal features, using convolutional recurrent attention networks(CRAN) in an unsupervised manner. We evaluated CRANon 32,083 tracks with 10 genres. Quantitative and qualitativeevaluations show that CRAN outperforms baselines. Also,the results indicate that the attention scores generated byCRAN pay an important role in extracting highlights. Asfuture work, CRAN-based highlights will be applied to ClovaMusic service, the AI platform of NAVER and LINE. . REFERENCES [1] Rui Cai, Chao Zhang, Lei Zhang, and Wei-Ying Ma,“Scalable music recommendation by search,” in
Pro-ceedings of the 15th ACM international conference onMultimedia . ACM, 2007, pp. 1065–1074.[2] Ja-Hwung Su, Hsin-Ho Yeh, S Yu Philip, and Vincent STseng, “Music recommendation using content and con-text information mining,”
IEEE Intelligent Systems , vol.25, no. 1, 2010.[3] Oscar Celma, “Music recommendation,” in
MusicRecommendation and Discovery , pp. 43–85. Springer,2010.[4] Lie Lu and Hong-Jiang Zhang, “Automated extractionof music snippets,” in
Proceedings of the eleventh ACMinternational conference on Multimedia . ACM, 2003,pp. 140–147.[5] JiePing Xu, Yang Zhao, Zhe Chen, and ZiLi Liu, “Mu-sic snippet extraction via melody-based repeated patterndiscovery,”
Science in China Series F: Information Sci-ences , vol. 52, no. 5, pp. 804–812, 2009.[6] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron C Courville, Ruslan Salakhutdinov, Richard SZemel, and Yoshua Bengio, “Show, attend and tell:Neural image caption generation with visual attention.,”in
ICML , 2015, vol. 14, pp. 77–81.[7] Brian McFee, Colin Raffel, Dawen Liang, Daniel PWEllis, Matt McVicar, Eric Battenberg, and Oriol Nieto,“librosa: Audio and music signal analysis in python,” in
Proceedings of the 14th python in science conference ,2015.[8] Jan Schluter and Sebastian Bock, “Improved musicalonset detection with convolutional neural networks,” in
Acoustics, speech and signal processing (icassp), 2014ieee international conference on . IEEE, 2014, pp. 6979–6983.[9] Karen Ullrich, Jan Schl¨uter, and Thomas Grill, “Bound-ary detection in music structure analysis using convolu-tional neural networks.,” in
ISMIR , 2014, pp. 417–422.[10] Keunwoo Choi, George Fazekas, Mark Sandler, andKyunghyun Cho, “Convolutional recurrent neuralnetworks for music classification,” arXiv preprintarXiv:1609.04243 , 2016.[11] Djork-Arn´e Clevert, Thomas Unterthiner, and SeppHochreiter, “Fast and accurate deep network learn-ing by exponential linear units (elus),” arXiv preprintarXiv:1511.07289 , 2015. [12] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,”
Neural computation , vol. 9, no. 8, pp.1735–1780, 1997.[13] Alex Graves, “Long short-term memory,” in
Super-vised Sequence Labelling with Recurrent Neural Net-works , pp. 37–45. Springer, 2012.[14] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim,“Dual attention networks for multimodal reasoning andmatching,” arXiv preprint arXiv:1611.00471 , 2016.[15] Lih-Yuan Deng, “The cross-entropy method: a unifiedapproach to combinatorial optimization, monte-carlosimulation, and machine learning,” 2006.[16] Diederik Kingma and Jimmy Ba, “Adam: Amethod for stochastic optimization,” arXiv preprintarXiv:1412.6980 , 2014.[17] Ioannis Panagakis, Emmanouil Benetos, and Constan-tine Kotropoulos, “Music genre classification: A multi-linear approach,” in
ISMIR , 2008, pp. 583–588.[18] Carlos N Silla Jr, Celso AA Kaestner, and Alessandro LKoerich, “Automatic music genre classification usingensemble of classifiers,” in
Systems, Man and Cyber-netics, 2007. ISIC. IEEE International Conference on .IEEE, 2007, pp. 1687–1692.[19] Marina Gall and Nick Breeze, “Music compositionlessons: the multimodal affordances of technology,”