[PDF] Deep Learning Approach for Singer Voice Classification of Vietnamese Popular Music

Abstract

Singer voice classification is a meaningful task in the digital era. With a huge number of songs today, identifying a singer is very helpful for music information retrieval, music properties indexing, and so on. In this paper, we propose a new method to identify the singer's name based on analysis of Vietnamese popular music. We employ the use of vocal segment detection and singing voice separation as the pre-processing steps. The purpose of these steps is to extract the singer's voice from the mixture sound. In order to build a singer classifier, we propose a neural network architecture working with Mel Frequency Cepstral Coefficient as extracted input features from said vocal. To verify the accuracy of our methods, we evaluate on a dataset of 300 Vietnamese songs from 18 famous singers. We achieve an accuracy of 92.84% with 5-fold stratified cross-validation, the best result compared to other methods on the same data set.

Full PDF

DDeep Learning Approach for Singer Voice Classification ofVietnamese Popular Music

Toan Pham Van

R&D Lab, Sun* IncHanoi, [email protected]

Ngoc Tran Ngo Quang

R&D Lab, Sun* IncHanoi, [email protected]

Ta Minh Thanh

Le Quy Don Technical UniversityHanoi, [email protected]

ABSTRACT

Singer voice classification is a meaningful task in the digital era.With a huge number of songs today, identifying a singer is veryhelpful for music information retrieval, music properties indexing,and so on. In this paper, we propose a new method to identify thesinger’s name based on analysis of Vietnamese popular music. Weemploy the use of vocal segment detection and singing voice sepa-ration as the pre-processing steps. The purpose of these steps is toextract the singer’s voice from the mixture sound. In order to build asinger classifier, we propose a neural network architecture workingwith Mel Frequency Cepstral Coefficient (MFCC) as extracted inputfeatures from said vocal. To verify the accuracy of our methods,we evaluate on a dataset of 300 Vietnamese songs from 18 famoussingers. We achieve an accuracy of 92.84% with 5-fold stratifiedcross-validation, the best result compared to other methods on thesame data set.

CCS CONCEPTS • Applied computing → Sound and music computing ; •

Com-puting methodologies → Classification and regression trees . KEYWORDS vocal extraction, singer classification, deep learning, music infor-mation retrieval

ACM Reference Format:

Toan Pham Van, Ngoc Tran Ngo Quang, and Ta Minh Thanh. 2019. DeepLearning Approach for Singer Voice Classification of Vietnamese PopularMusic. In

The Tenth International Symposium on Information and Communi-cation Technology (SoICT 2019), December 4–6, 2019, Hanoi - Ha Long Bay,Vietnam, Viet Nam.

ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3368926.3369700

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

With the growing collections of digital music, the number of songsare published increases day by day. As a result, we need an auto-matic system that can classify each song with useful informationsuch as singer or categories. This task would be useful for musicretrieval problems, automatic database indexing, or content-basedmusic recommendation systems [3]. It also is important to con-tribute to future research on music index retrieval (MIR) for Viet-namese songs. Each singer’s vocal has different acoustic features astimbre, pitch, frequency range; and all of them are useful for singerclassification. From these features, we can use some machine learn-ing method to classify the singer’s gender or age [25], or timbre [23].Similarly, our problem could also be solved with the approach of ma-chine learning. Specifically, from the audio characteristics of vocals,we may use traditional algorithms such as Support Vector Machine(SVM), Naive Bayes, or k-Nearest Neighbor (KNN), to classify thesingers. On another note, the application of deep learning methodshas been very popular lately: It is used in most fields of artificialintelligence such as computer vision, natural language processing,audio processing, etc [9]. The application of this technology to thesinger classification problem is a new approach. In light of this,we propose a new scheme for singer classification in this paperusing cutting-edge deep neural network models to achieve state-of-the-art result, exceeding which of classical methods. We dividethis process into 3 phases: vocal segmentation, vocal extraction,and vocal classification. The overall architecture of our system isdemonstrated in

Fig 1 . Each of these steps is solved by a differentneural network architecture trained on different datasets. We haveimproved on each of these components with our customized neuralnetwork architectures to help the overall system achieving betteraccuracy.

Our contributions are as follows: • We propose a new method to identify the singers based ontheir song. This is a new method in music index retrieval(MIR) research field. • We design a new neural network architecture (NNA) toachieve the better accuracy in each sub-problem. Accordingto the proposed NNA, our paper achieves an overall resultsuperior to which from using traditional methods such asSVM (support vector machine), KNN (k-nearest neighboralgorithm), Naive Bayes, and so on. a r X i v : . [ c s . S D ] F e b oICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Vietnam, Viet Nam Toan et al. • We build a dataset which can be used publicly available forresearch purposes based on Vietnamese popular songs. Also,our dataset is published for non-commercial applications.

The rest of this paper is organized as follows. Section 2 presents abrief review of related works. All the deep learning methods usedto identify singers are mentioned in Section 3. Section 4 describesthe system setup for the experiments and our dataset. Experimen-tal results and evaluations are presented in Section 5. Finally, weconclude the paper in Section 6.

Nowadays, in order to meet the demands of human entertainment,music works are produced with increasing frequency. The demandfor searching and storing information related to songs has alsoincreased accordingly. Music databases are rapidly expanding as aresult, storing in millions of tracks in the digital cloud. Meanwhile,there is a lack of meta-information for a majority of them. Theproblem of automatically classifying music information ( 𝑒.𝑔. singer,music genres, etc) becomes very meaningful. This is also a researchtopic that deserves attention in the field of computer science. Thepurpose of our research is to predict the singer with the highestaccuracy from any short song snippet. Our classification methodcan be applied on the music index retrieval, music searching system,and music categorization. The mentioned problem of automaticallyclassifying music can be separated into the following steps: • Step 1 - Vocal Segmentation:

Main purpose of this task issplitting the input audio into vocal and non-vocal (instru-mental) regions. This gives prior information that the audiopassing though the next step which is vocal separation, hasa vocal component. • Step 2 - Vocal Separation:

After distinguishing betweenvocal and non-vocal segments from the whole song, we pro-ceed to vocal separation. This step extracts the vocal (singers’voice) from the mixture sound mixing together drums, guitar,background music, and so on. This is an important phaseto determine the accuracy of the singer classification later.The clearly vocal and non-vocal separation can be make theclean features for preparing the input data of last step. • Step 3 - Vocal Classification:

After obtaining a clean vocal( 𝑖.𝑒. removed most of the noise), we continue on to use aneural network to classify singers. This problem is quitesimilar to the speaker identification but for singing vocalinstead. In this step, the feature of singers’ voice can beclearly classified using our proposed method.

For any machine learning model to work properly and effectively,good feature extraction is required. We employ two standard featuregenerating operations in audio processing such as the Short-TimeFourier Transform (STFT) [22] and the Mel-Frequency Cepstrum(MFC) [7]. STFT provides us the frequencies’ properties in short timeframesand describes how it changes over time. On the other hand, theMFC feature makes use of the magnitude of those information,maps it into a more natural-sounding frequency domain (the Melscale), and keeps only informative parts of those by discarding somecoefficients.Existing literature provides a plethora of fingerprinting meth-ods from spectrograms [2]. In this paper, we opt to extract deeperfeatures from such existing schemes through the use of deep neu-ral networks (DNN). STFT and MFCC features are employed forefficiently audio acoustic features extraction.

In any song, some segments of a song are purely instrumental( 𝑒.𝑔. , intro, bridge, and outro) and these are not needed for thevocal classification task. Instead of processing with the whole song,it makes sense to ignore these non-vocal segments via a vocalsegmentation algorithm. There have been numerous studies andmethods in this field, such as formant-based [16], frequency-based[12], Hidden Markov model [17], etc. In this work, we use deepconvolutional neural network with input audio features.

Before singers classification, we need to extract the vocal-only partsfrom a master mix. There has been a proliferation of literature in thesubject from the music information retrieval and the singing voiceseparation communities; since beyond our use, this topic is worthbillions in the entertainment industry. Classical methods would usetechniques like variants of non-negative matrix factorization (NMF)[4], where one can think of each component in the decompositionbeing an instrumental track. More modern researches would useideal/soft binary mask on the spectrogram of the master mix, usingeither image segmentation methods like U-Net [10], or convolu-tional neural networks followed by a simple flatten-dense layer.

And last but not least is our final step to output the singer fromthe extracted vocal track. Previous works in speaker detectionused classical methods like Gaussian Mixture Models [28], whilenewer ones employ modern convolutional neural networks [26].For singer detection, the paper [12] has used acoustic features likefrequency responses to achieve some decent result. Like every otherclassification problems, after extracting deep features, such featuresare fed into some differentiator such as multiclass SVM or a densely-connected network with softmax activation.During the process of finalizing this paper, we came across somenew works on this topic of singer classification [24]. The maincontribution of [24] is to propose a new deep learning approachbased on LSTM and MFCC features to identify the singer of a songin large datasets. In another work related with singer classifica-tion problem, the method in paper [19] employed MFCC and LPC(linear predictive coding) coefficients from Indian video songs asthe singers’ feature, and then the singer models are trained usingNaive bayes classifier and back propagation algorithm using neu-ral network. Both of related works seemed to try separating the eep Learning Approach for Singer Voice Classification of Vietnamese Popular Music SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Vietnam, Viet Nam

Figure 1: The overview of our singer classification system.Figure 2: The architecture of our vocal/non-vocal segmentation model using Convolutional Neural Network. vocal and non-vocal from background soundtrack. However, theirpre-process is not considered carefully.While the actual configurations of the network contain notabledifferences, the general idea behind the architecture design is rela-tively similar. We have plan on comparing the models and analyzingthe effects of those differences, but that will not be in the scope ofthis paper.

In this section, we present the deep learning algorithms used tosolve our defined problem above. With each task in workflow shownin

Fig 1 , we have a difference neural network architecture for it. Thevocal segmentation model is trained with a convolutional neuralnetwork [13]. The vocal separation task is solved with our custom network based on U-Net architecture [21]. And the last one, wepropose an Bidirectional LSTM network architecture designed toperform well at classification.

The goal is to automatically detect such boundaries in audio signalsso that the results are close to human annotation. In our problem,we applied Convolutional Neural Network (CNN) to predict thetype of the music segments. We propose to train a neural networkon human annotations to predict likely musical boundary locationsin audio data. Our classifier is trained with an in-house collectionof 2034 Karaoke songs. For every song in this dataset, the startpoint and endpoint of each vocal segment are annotated. We usethis data set because it is easy to differentiate the vocal segments oICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Vietnam, Viet Nam Toan et al. from the non-vocal ones from Karaoke songs; and these segmentswill be the training data for the vocal/non-vocal classifier. In ourresearch, we used a CNN architecture combined with a few fullyconnected layers at the end. The overview of this network shown in

Fig 2 . The input of model is 50 MFCC frames, each correspondingto a 500 milliseconds audio segment. The input frames are passthrough two convolutional layers. First convolution layer have 128filters with size 10 ×

10 and the second layer have 32 filters withsize 5 ×

5. After that, the output reduced by max-pooling layer 5 × ×

2. These layers are followed by two densely connectedlayers of 128 neurons, which are associated to dropout rates of 0.75and 0.5. ReLU activation function [6] are used between each layers.The output layer is composed of two neurons, normalized using asoftmax function to classify vocal and non-vocal class.

We first process the obtained vocal-present tracks before feedingit into our vocal separation network. The audio files are choppedinto 6-second long snippets, then passed through the Short-timeFourier Transform (STFT) [22]. The short-time Fourier Transformfeatures temporal frequency properties in every short timeframe.After obtaining the magnitude and phase matrices from the STFT,the magnitude matrix is then normalized to the logarithmic scalewith preserving nonnegativity property using log1p :log1p ( 𝑥 ) = log ( + 𝑥 ) , (1)intuitively, we are making a spectrogram similar to one in decibelscale - the normal one that is in everyday usage. The phase matrix,if needed, will be used to reconstruct the vocal track with thespectrogram output from our vocal separation model.Most current researches on vocal extraction uses pixel-wise seg-mentation on the spectrogram of the master track, then combinesthe result with the original phase matrix, to achieve the final result[20]. Notably, Jansson et. al used a deep U-Net architecture for thetask [10], yielding decent result without blowing up the number ofparameters. In light of such papers, we opted to use a similar butcustomized deep neural network described as follows. From theobtained spectrogram, we start with three 1-D convolution layersalong the time domain for feature extractions. After that, there arethree 1-D transposed convolution layers to expand and convertthe features back into the same dimension as the output of its cor-responding convolutional output. And last but not least, for theskip-connection layers, instead of just purely adding/concatenatingthe two encoded outputs from the convolutional layers into theinput of the transposed-convolutional layers, we pass each saidoutput through a Gated Recurrent Unit [5] layer. These layers willlearn to map these matrices from the convolutional output spacesto the transposed convolutional input spaces, effectively bringingthrough information that have been lost in the downsizing process.The skip-connection is a design choice borrowed from the famousU-Net architecture; however we decided on having our little GRUtwist since it makes little sense combining features of differentlatent spaces. Albeit working in practice, in our opinion, it wouldhave made more sense if we reused the convolutional matrix in ourtransposed convolutional layer, which would decrease the numberof tunable parameters, and the model’s capacity as a result. Thediagram Fig 3 shows a visualization of the model.

Figure 3: The vocal separation model architecture based onU-net with GRU skip-connection.

The spectrogram from the last transposed convolutional layer,after undoing the log1p operation, will be the spectrogram of theextracted vocal track, and we can recover the actual track by passingthat spectrogram and the phase matrix of the original track intothe inverse STFT. However we will not be needing that, since onlythe spectrogram will be passed on to the next step.

With the vocal-only spectrogram obtained from the last step, weextract the Mel-Frequency Cepstral Coefficients (MFCCs) to be thefeatures passed onto our model. The MFCCs, in contrast to a nor-mal magnitude spectrogram, captures more detailed low-frequencyfeatures that correspond to human voice, while discarding the lessinformative part (the amount of information kept is a hyperparam-eter). Each frame, 13 MFC coefficients were extracted using 26 filterbands. To better model the behavior of the signal, the differentialsand accelerations of the MFC coefficients were calculated. All thesefeatures were combined into a feature vector of size 39. The featurevectors served as input to the LSTM model. A Bidirectional-LSTMmodel with 3 hidden layers and 25 hidden units each layer is usedfor vocal classification. The result is then finally passed througha dense layer activated by the softmax function to get a predictedprobability distribution of whether the track belongs to an artist.Training was done using a backpropagation length of 20 time steps,with batch size of 64. The Adam optimizer with an initial learningrate of 0.001 is used to train this model.

In any machine learning and deep learning tasks, data plays animportant role to the accuracy of the whole system. Thus, the datapreparation phase must be carried out carefully. Unlike other prob-lems, the singer classification problem is divided into 3 subproblemsand each of them needs a different data set. Specifically, with thevocal segmentation problem, we use the 2034 Karaoke songs datasetto train the model. This dataset includes Vietnamese karaoke songswith annotations of vocal segments’ starting and ending points. Thevocal separation are trained on MUSDB18 [18] and DSD100 [14]dataset. Two datasets are contain the full lengths music tracks of https://sigsep.github.io/datasets/musdb.html https://sigsep.github.io/datasets/dsd100.html eep Learning Approach for Singer Voice Classification of Vietnamese Popular Music SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Vietnam, Viet Nam Table 1: Vocal and non-vocal segmentation result

Song genre CNN Precision CNN + Viterbi PrecisionVocal Nonvocal Mean Vocal Nonvocal MeanCountry 91.30 97.20 94.25 97.82 99.64 98.73Balad 92.85 94.24 93.55 98.65 99.86 99.26Bolero 94.32 90.24 92.28 96.30 98.12 97.21Rock 88.23 97.15 90.69 90.64 90.67 97.10different styles along with their isolated drums, bass, vocals andother stems. To train model for the singer classification task, wecollected 300 Vietnamese songs of 18 singers. Details of this dataset are described in

Fig 4 . Our experiment is conducted on a computer with Intel Core i5-7500CPU @3.4GHz, 32GB of RAM, GPU GeForce GTX 1080 Ti, and 1TBSSD Harddisk. All three subnetworks are implemented with thewell-known PyTorch framework [11].

The raw frame-level vocal/non-vocal probabilities are obtainedusing a step size of 10 milliseconds. The Viterbi [8] algorithm isused to infer the most likely voice segment (vocal or non-vocal)from these raw data, allowing an increase in the accuracy of ourmodel. With the dataset of 2034 Karaoke songs, we use 1500 ofthose for training, and 534 for testing. We also divide this datasetinto 4 music genres to analyze the output of our model.

Table 1 shows the detail of our model. According to the result, we foundthat genres like country, ballad and bolero gave better classificationresults (approximately 98%). This gives us a suggestion to improveour method in the next steps.

Figure 4: The number songs of each singer in our collecteddataset. Table 2: The result of vocal separation

DSD100 MUSDB18GRU Skip connection

Table 3: The result of vocal classification with two audio sig-nal

Mean precision Mean recall Mean F1 scoreRaw signal 85.4 82.6 83.96Separated signal

As we can see, the precision for the first three genres was decent;however for the Rock genre it fell short. For future work, we wouldwant to improve performance for bad-performance genres like it, aswell as genres that are not in the current dataset (for e.g., electronicmusic, hip-hop, etc.)

The evaluation is conducted by using the MUSDB18 dataset (100songs for training and 50 songs for testing) and the official packagesfrom SiSEC2018 [29]. We used the signal-to-distortion ratio (SDR)[27] [15], as it is the most widely used metric in this researchproblem. The model’s evaluation with two datasets, DSD100 andMUSDB18, is introduced in Section 4. The detail of the result areshown in

Table 2 .Audibly, the separated result is only decent enough for our task,but not on the production-level quality we hoped for. Currently weare experimenting with both improvements to the current model(adding attention, changing the skip connection layer), and otherpromising architectures. These considerations are however toospecific for this paper, and will be further analyzed on some futurepaper on this sole subtask.

After vocal segmentation step, we do vocal classification experi-ments with MFCC from two audio signals. In the first experiment,we tried with raw features after concatenating the vocal segments.This audio signal includes vocals, instruments and other sounds.The second experiment, we pass the raw signal audio through avocal separation model. The output of this step is the input of our vo-cal classification model. Both experiments were performed with thesame network architecture using Bidirectional LSTM with MFCCfeature described in Section 2. The comparison of these experimentsare shown in

Table 3 . We use F1-score for this evaluation.Our goal leaves more to be desired. For starter, we can experimentwith other features such as linear predictive coding, which is widelyused in speaker recognition [1]. Also, the current model can onlyhandle songs with only one singer – with little change, we can adaptthis code to songs with multiple singers, given that each sectionof the song only has one singer. Another improvement we canadd is adding multiple-singer detection in vocal mixes, say, whenthe voices are harmonizing, similar to speakers’ sources separation.Further, we can experiment with singer embedding, given a singer’s oICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Vietnam, Viet Nam Toan et al. extracted features, we may be able to generalize about the vocalproperties, the song style of that artist, etc, which is a hot topic inthe music information retrieval community.

In this paper we have employed deep learning techniques to buildneural networks to solve the singer vocal classification problem.We have proposed a method to solve this problem including thefollowing steps: vocal segmentation, vocal extraction and vocalclassification. Each of the steps above is addressed with the appro-priate neural network architecture. This makes it easy for us toindividually optimize each subproblem. The overall accuracy of thisproblem is approximately 93% with the data set of 300 songs fromVietnamese singers. This dataset was also collected manually andpublicly for the scientific community to conduct similar studies.

ACKNOWLEDGMENTS

This work is partially supported by

Sun-Asterisk Inc . We wouldlike to thank our colleagues at

Sun-Asterisk Inc for their adviceand expertise. Without their support, this experiment would nothave been accomplished.

REFERENCES [1] Homayoon Beigi. 2011.

Fundamentals of Speaker Recognition . https://doi.org/10.1007/978-0-387-77592-0[2] Pedro Cano, Eloi Batlle, Ton Kalker, and Jaap Haitsma. 2003. A Review of Algo-rithms for Audio Fingerprinting. (03 2003).[3] Pedro Cano, Markus Koppenberger, and Nicolas Wack. 2005. Content-based MusicAudio Recommendation. In

Proceedings of the 13th Annual ACM InternationalConference on Multimedia (MULTIMEDIA ’05) . ACM, New York, NY, USA, 211–212.https://doi.org/10.1145/1101149.1101181[4] Angkana Chanrungutai and Chotirat Ratanamahatana. 2008. Singing VoiceSeparation for Mono-Channel Music Using Non-negative Matrix Factorization.243 – 246. https://doi.org/10.1109/ATC.2008.4760565[5] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learning Phrase Representations usingRNN Encoder-Decoder for Statistical Machine Translation.

ArXiv abs/1406.1078(2014).[6] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. 2013. Improving deepneural networks for LVCSR using rectified linear units and dropout. In . IEEE,8609–8613.[7] Jonathan Foote. 1997. Content-based retrieval of music and audio. In

OtherConferences .[8] G David Forney. 1973. The viterbi algorithm.

Proc. IEEE

61, 3 (1973), 268–278.[9] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016.

Deep Learning . TheMIT Press.[10] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, AparnaKumar, and Tillman Weyde. 2017. Singing Voice Separation with Deep U-NetConvolutional Networks. In

ISMIR .[11] Nikhil Ketkar. 2017.

Introduction to PyTorch . Apress, Berkeley, CA, 195–208.https://doi.org/10.1007/978-1-4842-2766-4_12[12] Youngmoo E. Kim and Brian Whitman. 2002. Singer identification in popularmusic recordings using voice coding features. In in Proc. International Symposiumon Music Information Retrieval .[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classifi-cation with Deep Convolutional Neural Networks.

Neural Information ProcessingSystems

25 (01 2012). https://doi.org/10.1145/3065386[14] Antoine Liutkus, Fabian-Robert Stöter, Zafar Rafii, Daichi Kitamura, BertrandRivet, Nobutaka Ito, Nobutaka Ono, and Julie Fontecave. 2017. The 2016 SignalSeparation Evaluation Campaign. In

Latent Variable Analysis and Signal Sep-aration - 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic,August 25-28, 2015, Proceedings , Petr Tichavský, Massoud Babaie-Zadeh, Olivier J.J.Michel, and Nadège Thirion-Moreau (Eds.). Springer International Publishing,Cham, 323–332.[15] Antoine Liutkus, Fabian-Robert Stöter, Zafar Rafii, Daichi Kitamura, BertrandRivet, Nobutaka Ito, Nobutaka Ono, and Julie Fontecave. 2017. The 2016 signalseparation evaluation campaign. In

International conference on latent variableanalysis and signal separation . Springer, 323–332. [16] Y. V. Srinivasa Murthy, Shashidhar G. Koolagudi, and Vishnu G. Swaroop. 2017.Vocal and Non-vocal Segmentation based on the Analysis of Formant Structure. (2017), 1–6.[17] Tin Lay Nwe and Ye Wang. 2004. Automatic Detection Of Vocal Segments InPopular Songs. In

ISMIR .[18] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis,and Rachel Bittner. 2017. The MUSDB18 corpus for music separation. https://doi.org/10.5281/zenodo.1117372[19] Tushar Ratanpara and Narendra Patel. 2015. Singer Identification Using MFCCand LPC Coefficients from Indian Video Songs. In

Emerging ICT for Bridging theFuture - Proceedings of the 49th Annual Convention of the Computer Society ofIndia (CSI) Volume 1 , Suresh Chandra Satapathy, A. Govardhan, K. Srujan Raju,and J. K. Mandal (Eds.). Springer International Publishing, Cham, 275–282.[20] Gerard Roma, Emad M. Grais, Andrew J. R. Simpson, and Mark D. Plumbley. 2016.Singing voice separation using deep neural networks and f0 estimation.[21] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: ConvolutionalNetworks for Biomedical Image Segmentation. In

Medical Image Computing andComputer-Assisted Intervention – MICCAI 2015 , Nassir Navab, Joachim Horneg-ger, William M. Wells, and Alejandro F. Frangi (Eds.). Springer InternationalPublishing, Cham, 234–241.[22] Ervin Sejdić, Igor Djurović, and Jin Jiang. 2009. Time–frequency feature repre-sentation using energy concentration: An overview of recent advances.

DigitalSignal Processing

19, 1 (2009), 153 – 183. https://doi.org/10.1016/j.dsp.2007.12.004[23] Cheng-Ya Sha, Yi-Hsuan Yang, Yu-Ching Lin, and Homer H. Chen. 2013. Singingvoice timbre classification of Chinese popular music. (2013), 734–738.[24] Zebang Shen, Binbin Yong, Gaofeng Zhang, Rui Zhou, and Qingguo Zhou. 2019.A deep learning method for Chinese singer identification.

Tsinghua Science andTechnology

24 (08 2019), 371–378. https://doi.org/10.26599/TST.2018.9010121[25] Zhengshan Shi. 2015. Singer Traits Identification using Deep Neural Network.[26] Amirsina Torfi, N.M. Nasrabadi, and J. Dawson. 2017. Text-Independent SpeakerVerification Using 3D Convolutional Neural Networks. arXiv:1705.09422 [cs.CV] (07 2017).[27] Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp,Naoya Takahashi, and Yuki Mitsufuji. 2017. Improving music source separationbased on deep neural networks through data augmentation and network blending.In . IEEE, 261–265.[28] Félicien Vallet, Jim Uro, Jérémy Andriamakaoly, Hakim Nabi, Mathieu Derval,and Jean Carrive. 2016. Speech Trax: A Bottom to the Top Approach for SpeakerTracking and Indexing in an Archiving Context. In