Singer Identification Using Deep Timbre Feature Learning with KNN-Net
SS INGER I DENTIFICATION U SING D EEP T IMBRE F EATURE L EARNING WITH
KNN-N ET A P
REPRINT
Xulong Zhang
School of Computer Science and TechnologyFudan UniversityShanghai, 200438 China [email protected]
Jiale Qian
School of Computer Science and TechnologyFudan UniversityShanghai, 200438 China [email protected]
Yi Yu
Digital Content and Media Sciences Research DivisionNational Institute of InformaticsTokyo, 101-8430 Japan [email protected]
Yifu Sun
School of Computer Science and TechnologyFudan UniversityShanghai, 200438 China [email protected]
Wei Li ∗ School of Computer Science and Technology Shanghai Key Laboratory of Intelligent Information ProcessingFudan UniversityShanghai, 200438 China [email protected]
February 23, 2021 A BSTRACT
In this paper, we study the issue of automatic singer identification (SID) in popular music recordings,which aims to recognize who sang a given piece of song. The main challenge for this investigationlies in the fact that a singer’s singing voice changes and intertwines with the signal of backgroundaccompaniment in time domain. To handle this challenge, we propose the KNN-Net for SID, whichis a deep neural network model with the goal of learning local timbre feature representation from themixture of singer voice and background music. Unlike other deep neural networks using the softmaxlayer as the output layer, we instead utilize the KNN as a more interpretable layer to output targetsinger labels. Moreover, attention mechanism is first introduced to highlight crucial timbre featuresfor SID. Experiments on the existing artist20 dataset show that the proposed approach outperforms thestate-of-the-art method by 4%. We also create singer32 and singer60 datasets consisting of Chinesepop music to evaluate the reliability of the proposed method. The more extensive experimentsadditionally indicate that our proposed model achieves a significant performance improvementcompared to the state-of-the-art methods. K eywords KNN · Singer identification · Attention · Deep learning · Music information retrieval ∗ This work was supported by National Key R&D Program of China(2019YFC1711800), NSFC(61671156). Correspondingauthor: Wei Li, [email protected] a r X i v : . [ c s . S D ] F e b PREPRINT - F
EBRUARY
23, 2021
Automatic singer identification (SID) is considered as an increasingly important research topic in the field of audiosignal processing, with the aim of recognizing who sang a given piece of song. It has many potential applications inAI music. For example, an optimal SID system can be used to recommend similar singers to users based on vocalcharacteristics of each singer. It is also very helpful to reduce human efforts to manage unlabeled or insufficient songsand check huge amount of suspect songs rapidly. Generally, a singer’s voice is based on the artistry of the work tends toarbitrarily change in time domain, which is significantly different from the normal audio signal. What makes the taskmore challenging is that the singing voice is inevitably intertwined with the background accompaniment [1, 2].Research focuses of existing SID approaches can be grouped into three categories: i) simply considering as an issueof the speaker identification (SPID) [3], ii) directly identifying singers ignoring the influence of background musicfor the singer voice [4, 5], and iii) distinguishing singers by voice characteristics after removing the interference ofbackground music [6]. Hamid et al. [3] used i-vector that is first introduced in SPID task, which aims to extractsong-level descriptors built from frame-level timbre features. Their experimental results achieved a F1-score of 0.846on artist20 [7], a public dataset for the SID task. Loni et al., in [4] used the combination of timbre and vibrato featureswith different attributes to describe the vocal characteristics of the singer, which achieved an accuracy of 0.805 on acappella database of 23 singers. Nasrullah et al. [5] proposed another supervised method using the CRNN architecturethat achieved an accuracy of 0.935 on the artist20. Some new methods are proposed using singing voice separation[8, 9] as pre-processing. Sharma et al. [6] extract the singing vocals from polyphonic songs using Wave-U-Net basedapproach to overcome the interference of background accompaniment, which outperforms the baseline without audiosource separation by a large margin. Our method belongs to the second category.Figure 1: The overview of the proposed deep network architecture for singer identification.Generally, softmax layer as the output layer of DNN is the most common choice for classification tasks such as SID[5, 10, 11], but in this manner it is quite improper that all the singers are taken into account as candidates in eachdecision. Inspired by KNN [12], a traditional machine learning method, we confine each decision to a relativelyappropriate range, i.e, a fixed number of target singers or song clips. The KNN in our proposed model functions as thefinal layer where neighbor distances can be computed in parallel by GPU [13]. As the distance matrix [14] of KNNalgorithm, timbre space is introduced to represent vocal features of each singer. Additionally, the current SID dataset islimited, which has hindered the development of large-scale SID study [1, 5, 15]. We create our datasets named singer32and singer60 that contain more singer labels than artist20 and evaluate our model on these datasets.
The KNN-Net for SID is introduced in this section, which is a deep neural network architecture with the aim of timbrefeature learning and interpretably classifying. This deep model can be extended to other audio classification tasks.2
PREPRINT - F
EBRUARY
23, 2021
As the red and blue arrows presented in Fig. 1, two stages are required in the proposed method. The model takesthe time-frequency representation corresponding to each clip as the input. This is fed into four CNN layers to extractlocal timbre features, which is followed by the GRU module to extract time-domain features. After CRNN layers, anattention layer is employed to strengthen important feature representations [16].In the first stage, the softmax layer is used to predict the probability of each singer on training dataset, which aims totrain the feature extraction block. While in the next stage, softmax layer is replaced by a linear dense layer which isalso defined as the KNN layer, and the weights of the attention and CRNN layers are fixed. The KNN layer followingattention layer then returns the high-dimensional cosine distance vectors as the output. Subsequently, top K song clipsare chosen according to these cosine distance vectors, where k represents the threshold value of the KNN algorithm.Target singer labels are finally estimated by voting the most frequent singer among the top K song clips.The workflow of the proposed deep network architecture for SID is summarized as follows: i) Learning timbre spacebased on attention-CRNN model. ii) Using the model learned in i) to output the reference matrix of training samples,which represents timbre features of each singer. iii) Using the reference matrix as the input weight to construct a denselayer without bias which is also called a KNN layer. iv) Constructing KNN-Net by integrating learned feature extractioncomponent and the KNN layers without retraining. v) Predicting target singer labels with KNN-Net, which aims tosearch the most similar singers in the timbre space.
The differences between singers lie in their vocal characteristics and frequency bands of spectral features [17]. Typically,voice information of male singers mainly is in the low and medium frequency bands while that of female singers tendsto be in higher frequency bands [18], which indicates the importance of the strategy to give different frequency bandswith different attention weights. For example, the network should learn to focus on low and medium frequency bandswhen handling the songs performed by male singers and the higher frequency bands for female singers. Consequently,we introduce the attention layer to explore the contribution of each frequency band to the whole task and highlight keyparts [19].The structure of attention layer is presented in Fig. 1. Assuming that a sequence of N units is returned from the GRUmodule, the output of the attention layer could be represented as c i = N (cid:88) j =1 α ij h j (1)where c i represents the hidden vector obtained from weighted summation of S i − = { h , ..., h N } , which representsthe hidden vector of the output from the last GRU layer. The attention weight of each feature vector can be obtained byequation (2) and equation (3): α ij = exp ( h ij ) (cid:80) Nk =1 exp ( h ik ) (2) h ij = f c ( S i − , h j ) (3)where h ij represents the score of dependence relationship between S i − and h j , which is obtained by the learnedmapping function f c . After normalization using the softmax function shown in equation (2), the final output of attentionlayer is obtained as the weighted sum.The attention block is independent from the mixed model of CNN and GRU, which captures the important featuresextracted by CRNN and strengthens their relationship. Traditional KNN method is integrated into the deep learning model, and the key idea is to design a network structurerelated to distance measurement. The following is the detailed derivation of the distance computation in the KNN layer.Let q be the feature vector of test samples extracted by the attention CRNN network, W be the reference matrixobtained from training samples and w represents one specific column of W . According to the cosine distance measure3 PREPRINT - F
EBRUARY
23, 2021Figure 2: Visualization of the timbre features of each singer using t-SNE. Singers with similar timbre features andmusic genre are tend to be close, e.g., the lead singers of Led Zepplin and Aerosmith.Table 1: Evaluation on artist20, singer32 and singer60: The results of the attention-CRNN-KNN (KNN-Net) achieve thehighest accuracy compared to the baseline, and the accuracy on artist20 is significantly higher than other two datasets.Measurement Accuracy Precision Recall F1Dataset A20 S32 S60 A20 S32 S60 A20 S32 S60 A20 S32 S60i-vector [3] 0.85 - - 0.86 - - 0.86 - - 0.85 - -SVS-SID [6] 0.90 - - - - - - - - - - -CRNN [5] 0.94 0.69 0.57 0.93 0.72 0.57 0.93 0.69 0.56 0.93 0.68 0.51Attention-CRNN 0.95 0.72 0.63 0.96 0.75 0.62 0.95 0.69 0.62 0.95 0.69 0.58Attention-CRNN-KNN 0.99 0.74 - 0.99 0.76 - 0.99 0.70 - 0.99 0.71 -[20], the distance between q and w can be easily computed as cos ( q, w ) = q · w | q | · | w | (4)For each given test sample, | q | is regarded as a constant value. After normalization operation ( | w | = 1 ), the cosinedistance in equation (4) can be converted as cos ( q, W ) = q · W (5)In the neural network, the reference matrix W is passed into the dense layer as A = g ( q · W + b ) (6)where A represents the result matrix and g represents the activation function, respectively. When g is a linear functionand b is 0, A can be computed as A = q · W (7)Equation (7) referring to the dense layer is therefore equal to equation (5), which indicates that the dense layer can beinterpreted as the KNN method. Our experiments are designed to investigate whether our proposed KNN-Net model achieves a performance comparablewith previous methods, and to measure the effect of the introduced attention mechanism.
We use the artist20 (A20) dataset that contains 20 singer labels and 1413 songs. Each singer has 6 albums in which 4albums are used in the training set and the remaining 2 albums are used in the validation and test set. We extend artist204
PREPRINT - F
EBRUARY
23, 2021into singer32 (S32) and singer60 (S60). Singer32 contains 32 Chinese pop singers, and each singer has 70 musicrecordings which are extracted from their music videos. Similarly, singer60 consists of 60 singers, including 30 malesingers and 30 female singers, and each singer has 70 songs. The collections of singer32 and singer60 are divided intotraining, validation and test sets in the ratio of 8:1:1. In order to quantitatively evaluate the classification performance of the proposed model, the accuracy, precision, recalland F1 score are taken as the performance metrics.
As convolutional recurrent neural network (CRNN) proposed in [5] performed best results in the existing state-of-artworks of SID, we took it as our comparison method. In addition, we also add attention mechanism to CRNN and treat itas another important comparison method. Some results obtained by i-vector [3] and SVS-SID [6] on artist20 dataset aretaken as comparisons.In our method, target singer labels are first estimated based on the frame-block size (takes up 1 second that consists of32 frames). The recognition result of an entire song can be voted from all blocks. The detailed settings of networklayers and parameters of CRNN and attention-CRNN are currently available at gitlab . And the value of K in theproposed KNN-Net method is set as 11 based on some initial experiments. According to the experimental results on the public dataset artist20, we present the confusion matrix of the recognitionresults of the proposed method based on frame-block level as shown in Fig. 3, wherein the recognition accuracy has anaverage of 85 % . a e r o s m i t h b e a t l e s c r ee d e n c e _ c l e a r w a t e r _ r e v i v a l c u r e d a v e _ m a tt h e w s _ b a n d d e p e c h e _ m o d e f l ee t w oo d _ m a c g a r t h _ b r oo k s g r ee n _ d a y l e d _ z e pp e li n m a d o nn a m e t a lli c a p r i n c e q u ee n r a d i o h e a d r o x e tt e s t ee l y _ d a n s u z a nn e _ v e g a t o r i _ a m o s u Predicted labelaerosmithbeatlescreedence_clearwater_revivalcuredave_matthews_banddepeche_modefleetwood_macgarth_brooksgreen_dayled_zeppelinmadonnametallicaprincequeenradioheadroxettesteely_dansuzanne_vegatori_amosu2 T r u e l a b e l Figure 3: The confusion matrix of the recognition results based on frame-block level evaluated on artist20.Table 1 shows the singer identification results for our proposed model and comparison methods. It is observed thatour KNN-based deep architecture achieves the best performance, improving the performance significantly comparedwith the baseline model. Specifically, for artist20, a little improvement in the recognition accuracy is achieved when https://zenodo.org/record/3822288 https://zenodo.org/record/3823866 https://gitlab.com/exp_codes/tut-attention PREPRINT - F
EBRUARY
23, 2021introducing the attention mechanism to CRNN, while a 5% increase is achieved when using our attention-CRNN-KNNmodel. Obviously, the experimental results have higher accuracy on the dataset artist20 than on the other two datasets.For artist20, the relatively small capacity of the dataset is one of the reasons. In addition, the dataset also contains musicpieces of various genres that provide additional information to the neural network, which may improve the recognitionaccuracy to a certain extent. Comparatively, singer32 and singer60 include only Chinese pop music and do not containany genre-related information. Unfortunately, due to the problem of large dimension of the reference matrix, we can notgive the experimental results on singer60.
We have presented a novel KNN-based deep architecture that learns the timbre feature extraction network to recognizesingers by KNN approach, which can be regarded as a query searching task based on timbre feature space. Moreover, alarge-scale reference matrix is computed in parallel efficiently by the cosine distance computation with the dense layer.We evaluated our method on three datasets for singer identification and confirmed that it outperforms the state-of-the-artmethods.Considering the problem of large dimensional reference matrix that cannot be stored in the memory when enlargingthe dataset, the future work is to implement a centralization method on the reference matrix. We would like to utilizea clustering method to reduce the dimension of the reference matrix so that the requirements of large-scale singeridentification could be satisfied.
References [1] Seyed Kooshan, Hashemi Fard, and Rahil Mahdian Toroghi. Singer identification by vocal parts detection andsinger classification using lstm neural networks. In
Proceedings of the 4th International Conference on PatternRecognition and Image Analysis , pages 246–250. IEEE, 2019.[2] Zebang Shen, Binbin Yong, Gaofeng Zhang, Rui Zhou, and Qingguo Zhou. A deep learning method for chinesesinger identification.
Tsinghua Science and Technology , 24(4):371–378, 2019.[3] Hamid Eghbal-Zadeh, Bernhard Lehner, Markus Schedl, and Gerhard Widmer. I-vectors for timbre-based musicsimilarity and music artist classification. In
Proceedings of the 16th International Society for Music InformationRetrieval Conference , pages 554–560, 2015.[4] Deepali Y Loni and Shaila Subbaraman. Timbre-vibrato model for singer identification.
Information andCommunication Technology for Intelligent Systems , pages 279–292, 2019.[5] Zain Nasrullah and Yue Zhao. Music artist classification with convolutional recurrent neural networks. In
Proceedings of the 2019 International Joint Conference on Neural Networks , pages 1–8. IEEE, 2019.[6] Bidisha Sharma, Rohan Kumar Das, and Haizhou Li. On the importance of audio-source separation for singeridentification in polyphonic music. In
Proceedings of the 20st Annual Conference of the International SpeechCommunication Association , pages 2020–2024, 2019.[7] Daniel PW Ellis. Classifying music audio with timbral and chroma features. In
Proceedings of the 8th InternationalConference on Music Information Retrieval , 2007.[8] Pritish Chandna, Marius Miron, Jordi Janer, and Emilia Gómez. Monoaural audio source separation using deepconvolutional neural networks. In
Proceedings of the International conference on latent variable analysis andsignal separation , pages 258–266. Springer, 2017.[9] Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. Singingvoice separation with deep u-net convolutional networks. In
Proceedings of the 18th International Society forMusic Information Retrieval Conference , 2017.[10] Aitor Arronte Alvarez and Francisco Gomez-Martin. Singer identification using convolutional acoustic motifembeddings. arXiv preprint arXiv:2008.00198 , 2020.[11] Tsung-Han Hsieh, Kai-Hsiang Cheng, Zhe-Cheng Fan, Yu-Ching Yang, and Yi-Hsuan Yang. Addressing theconfounds of accompaniments in singer identification. In
Proceedings of the 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing , pages 1–5, 2020.[12] Zhongheng Zhang. Introduction to machine learning: k-nearest neighbors.
Annals of translational medicine ,4(11), 2016. 6
PREPRINT - F
EBRUARY
23, 2021[13] Ricardo J Barrientos, Fabricio Millaguir, José L Sánchez, and Enrique Arias. Gpu-based exhaustive algorithmsprocessing knn queries.
The Journal of Supercomputing , 73(10):4611–4634, 2017.[14] Hao Zhang, Alexander C Berg, Michael Maire, and Jitendra Malik. Svm-knn: Discriminative nearest neighborclassification for visual category recognition. In
Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition , volume 2, pages 2126–2136. IEEE, 2006.[15] Rajitha Amarasinghe and Lakshman Jayaratne. Supervised learning approach for singer identification in sri lankanmusic.
European Journal of Computer Science and Information Technology , 4(6):1–14, 2016.[16] Minz Won, Sanghyuk Chun, and Xavier Serra. Toward interpretable music tagging with self-attention. arXivpreprint arXiv:1906.04972 , 2019.[17] Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, and Yifan Gong. End-to-end attention based text-dependentspeaker verification. In , pages 171–178. IEEE, 2016.[18] Özgür Devrim Orman and Levent M Arslan. Frequency analysis of speaker identification. In
A Speaker Odyssey -The Speaker Recognition Workshop , pages 219–222, 2001.[19] Ye Yuan and Kebin Jia. Fusionatt: deep fusional attention networks for multi-channel biomedical signals.
Sensors ,19(11):2429, 2019.[20] Kittipong Chomboon, Pasapitch Chujai, Pongsakorn Teerarassamee, Kittisak Kerdprasop, and Nittaya Kerdprasop.An empirical study of distance metrics for k-nearest neighbor algorithm. In