Boosting the Predictive Accurary of Singer Identification Using Discrete Wavelet Transform For Feature Extraction
Victoire Djimna Noyum, Younous Perieukeu Mofenjou, Cyrille Feudjio, Alkan Göktug, Ernest Fokoué
BBoosting the Predictive Accurary of Singer Identification UsingDiscrete Wavelet Transform For Feature Extraction
Victoire Djimna Noyum a , ∗ , Younous Perieukeu Mofenjou a , Cyrille Feudjio a , Alkan Göktug b andErnest Fokoué c a School of Mathematical Sciences, African Institute for Mathematical Sciences, Crystal Garden, Limbe b School of Mathematical Sciences, ETH Zurich, Rämistrasse 101, 8092 Zurich, Switzerland c School of Mathematical Sciences, Rochester Institute of Technology, Rochester, NY 14623
A R T I C L E I N F O
Keywords :DWTSinger IdentificationRPCASVMGMM
Abstract
Facing the diversity and growth of the musical field nowadays, the search for precise songsbecomes more and more complex. The identity of the singer facilitates this search. In thisproject, we focus on the problem of identifying the singer by using different methods for featureextraction. Particularly, we introduce the Discrete Wavelet Transform (DWT) for this purpose.To the best of our knowledge, DWT has never been used this way before in the context of singeridentification. This process consists of three crucial parts. First, the vocal signal is separatedfrom the background music by using the Robust Principal Component Analysis (RPCA). Second,features from the obtained vocal signal are extracted. Here, the goal is to study the performanceof the Discrete Wavelet Transform (DWT) in comparison to the Mel Frequency Cepstral Coeffi-cient (MFCC) which is the most used technique in audio signals. Finally, we proceed with theidentification of the singer where two methods have experimented: the Support Vector Machine(SVM), and the Gaussian Mixture Model (GMM). We conclude that, for a dataset of 4 singers and200 songs, the best identification system consists of the DWT (db4) feature extraction introducedin this work combined with a linear support vector machine for identification resulting in a meanaccuracy of 83.96%.
1. Introduction
Music is a universal art form and cultural activity which can have several effects on the listener depending on theintention of the artist as well on the state of mind of the listener. Hence, with music, it is possible to express criticsagainst politics or society, mobilize people for a course or to point out feelings arising from love, happiness, sadness, orloneliness. With the increasing possibilities to access and to share art, the world of music is becoming more and morevast and diverse. Query fast through this world and collecting precise information is a challenge data scientists facetoday. In this sense, by listening to a song, one could develop interest in the biography of the artist and may want toaccess other songs from this artist. This issue on which this project is based on is known as the identification of thesinger. The identification of the singer is done in three phases: the separation of the singer’s voice from the backgroundmusic, the feature extraction and the identification process using the features extracted from the vocal signal obtainedfrom the separation procedure.
2. Background
A great deal of research has been done in the field of singer identification. In 2002, (Liu and Huang, 2002) proposeda singer identification technique for the classification of MP3 musical objects according to their content. They usedphoneme segmentation for signal separation. Unfortunately, the signal of the singer’s voice at the output of this methodstill contains a lot of background music (noise) which make the singer identification difficult. ⋆ This document is the results of the research project funded by AIMS CAMEROON with the help of MASTERCARD Foundation.In this work, we show that Discrete Wavelet Transform is the best method for feature extraction in vocal signals. ∗ Corresponding author [email protected] (V.D. Noyum); [email protected] (Y.P. Mofenjou); [email protected] (C. Feudjio); [email protected] (A. Göktug); [email protected] (.E.F. )
ORCID (s): (V.D. Noyum)
Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 1 of 17 a r X i v : . [ c s . S D ] J a n n 2004, a spectrum-based method of identifying the singer, proposed by (Bartsch and Wakefield, 2004), worked wellonly for ideal cases that contained audio samples with the singer’s voice only. The test set accuracy was 70-80%. Inthe ”Identification of the singer based on vocal and instrumental models” proposed by (Maddage et al., 2004). in thesame year, the singer was identified using both low-level characteristics and knowledge of musical structure. Using thedataset with 100 popular songs of solo singers, they obtained an accuracy of over 87%. However, this method was notsuitable for music that was more instrumental than singing.A systematic approach to identify and separate the unvoiced singing voice from the musical accompaniment is proposedby (Hsu and Jang, 2009). For the separation of the singer’s voice, they used the spectral subtraction method. Thismethod follows the framework of Computer Auditory Scene Analysis (CASA) which includes the segmentation andclustering steps. This method considerably improved the clarity of the singing voice signal but was not always sufficientbecause, during clustering, a lot of information is lost.To solve the problem of identifying the singer based on the acoustic variables of the singer’s voice, (Yang, 2016) usedthe Gaussian Mixture Method (GMM) and Support Vector Machine (SVM) in 2016. He obtained accuracies of 96.42%and 81.23% with a dataset of hundred (100) songs of ten (10) singers. For signal separation, he used Robust PrincipalComponent Analysis (RPCA) which is an improved version of Principal Component Analysis (PCA) and gives a betterresult than NMF. For feature extraction, he used Mel-Frequency Cepstral Coefficient (MFCC).In 2017, (Xing, 2017) proposed an effective system of singer identification with human voice separated from originalmusic. He used first, Robust Principal Component Analysis (RPCA) to music separation with its high performance.After the clear enough human voices are extracted, the Linear Predictive Coding (LPC) method was chosen as theexperimental method for feature extraction. Finally, the singer would be identified by Gaussian Mixture Model (GMM)with 63.6% of accuracy in a dataset of 100 singers.In 2019, The work of (NAMEIRAKPAM et al., 2019) had implemented discret wavelet transform (DWT) as a pre-processing step (denoising) prior to feature extraction to investigate the performance of singer’s identification with andwithout DWT. It is found that after applying wavelet transform the accuracy result decreases. However, the decreasein percentage accuracy is minimal (5.79%, 0.72% and 0.72% for 8, 16 and 32 Gaussians respectively). While thecomputational time is drastically reduced.
3. Problem Statement and Contribution of this Study
The recent and improved research presented above shows that singer’s identification using DWT for pre-processingand MFCC for feature extraction is done in a much reduced time, but the accuracy decreases compared to the resultsobtained without DWT. Unfortunately, MFCC uses the Fast Fourier Transform (FFT) for the change from the timedomain to the frequency domain. FFT does not retain the time domain information and results in loss of data during thechange. In this study, we will use DWT for all the feature extraction process to see if it improves feature extractionmore than MFCC because this method retains time-domain information by its ability to operate in both in the timeand frequency domain simultaneously. The Robust Principal Component Analysis (RPCA) will be used to separatethe singer’s voice from the background music. To identify the singer, we will apply both the Support Vector Machine(SVM) and the Gaussian Mixture Model (GMM).
4. Study Organization
The objective of this study is to build a model allowing the identification of the speaker using DWT for featureextraction. To achieve this goal, we present Robust Principal Component Analysis (RPCA) as the best technique forseparating the singer’s voice and its methodology, followed by the description and process of feature extraction usingDiscrete Wavelet Transform (DWT). Then, we explain the learning techniques such as the Support Vector Machine(SVM) and the Gaussian Mixture Model (GMM) for singer identification. Finally, we present the experiments andresults. We conclude our research and propose recommendations for future work.
5. Singing Voice Separation Technique: RPCA
Propsed by (Candès et al., 2011), Robust Principal Component Analysis (RPCA) is a modification of the PrincipalComponent Analysis (PCA) method. RPCA has been proven to perform well for noise-corrupted data compared to
Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 2 of 17
CA. The idea of RPCA is to decompose a 𝑉 ( 𝑉 ∈ ℝ 𝑛 × 𝑝 ) data matrix into two other matrices, 𝐿 and 𝑆 , as follows: 𝑉 = 𝐿 + 𝑆 (1)In the case of sound, 𝐿 ( 𝐿 ∈ ℝ 𝑛 × 𝑝 ) is the low-rank matrix corresponding to the background music and 𝑆 ( 𝑆 ∈ ℝ 𝑛 × 𝑝 ) isthe sparse matrix characterizing the singing voice. Indeed, in music, the part of noise incorporated (background music)often varies more slowly with time compared to the singer’s voice. In other words, the singer’s own voice is likely to bemore non-stationary than the noise. This phenomenon can be easily observed by analyzing the spectrographs in Figure1. The spectral structure of pure noise is usually fixed or slowly varying, while the vocal part changes rapidly overtime. This hypothesis implies that the noise part appears to be of low-rank, while the pure voice part is sparse (Hunget al., 2018). Therefore, extracting the sparse component from the music signal matrix tends to separate the backgroundmusic from the voice of the speaker. This separation is made by convex optimization which can be written as (Candèset al., 2011): minimize ‖ 𝐿 ‖ ∗ + 𝜆 ‖ 𝑆 ‖ (2)subject to 𝐿 + 𝑆 = 𝑉 , where 𝜆 > is a trade-off parameter between the rank of 𝐿 and the sparsity of 𝑆 , ‖ . ‖ ∗ is the nuclear norm representingthe sum of singular values of matrix entries and ‖ . ‖ is the 𝐿 -norm representing the sum of absolute values of matrixentries (Candès et al., 2011).To solve the problem given to the equation 2, we use the Augmented Lagrange Multiplier (ALM) method. Thecorresponding formula is given by (Candès et al., 2011): 𝐿 𝑎 ( 𝐿, 𝑆, 𝑌 , 𝜇 ) = ‖ 𝐿 ‖ ∗ + 𝜆 ‖ 𝑆 ‖ + ⟨ 𝑌 , 𝑉 − 𝐿 − 𝑆 ⟩ + 𝜇 ‖ 𝑉 − 𝐿 − 𝑆 ‖ 𝐹 (3)In equation 3, 𝜇 is a penalty parameter (always positive), 𝑌 is slack variable matrix, ‖ . ‖ 𝐹 is the Frobenius norm. ⟨ 𝑌 , 𝑉 − 𝐿 − 𝑆 ⟩ implies the standard trace inproduct. At the end, we obtain two matrices: the low-rank matrix 𝐿 andthe sparse matrix 𝑆 respectively (Candès et al., 2011). (a) Original matrix V (b) Sparse matrix S (c) Low-rank L Figure 1:
Example RPCA results for Garou2.mp3: (a) the original matrix, (b) the low-rank matrix, and (c) the sparsematrix.
To obtain the signals of the background music and the singing voice represented in Figure 2, the Inverse Short-TimeFourier Transform (ISTFT) is performed to return to the temporal domain. The signal separation process is summarizedin Figure 3.
6. Discret Wavelet Transform (DWT) as a feature extraction technique
Extraction of acoustic characteristics plays an essential role in the construction of a singer identification system.The objective is to select variables that have a high inter-label range and low discrimination power within the label. Thediscriminating power of characteristics or sets of characteristics indicates the extent to which they can discriminatebetween labels. The selection of characteristics is usually done by examining the discriminative power of the variables.The performance of a set of features depends on demand. Thus, designing them for a specific application is the mainchallenge in building singer identification systems. In this section, we will present the theoretical background ofthe Discrete Wavelet Transform (DWT) method. DWT is based on dividing the signal into several sub-bands beforeperforming feature extraction.
Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 3 of 17 a) Original matrix V (b) Sparse matrix S (c) Low-rank L
Figure 2:
Example of signal after ISTFT for Garou2.mp3: (a) the original matrix, (b) the low-rank matrix, and (c) thesparse matrix.
Figure 3:
Signal separation process (Li and Akagi, 2019)
Wavelet Transform (WT) is a very powerful tool for the analysis and classification of time series signals. It isunfortunately not known or popular in the field of data science. This is partly because you need to have some priorknowledge about signal processing, Fourier Transform and a solid mathematics background before you can understandthe theory underlying the wavelet transform. However, we believe that it is also due to the fact that most books, articlesand papers are far too theoretical and do not provide enough practical information illustrating how they could be used.WT has many applications in the analysis of stationary and non-stationary signals. These applications includeremoving noise from signals, detecting abrupt discontinuities, and compressing large amounts of data (Wang et al.,2013).
WT decomposes a signal into a group of constituent signals, called wavelets, each having a well-defined dominantfrequency, similar to the Fourier Transform (FT) in which the representation of a signal is made by sine and cosinefunctions of unlimited duration. In WT, wavelets are transient functions of short duration, i.e. of limited durationcentered around a specific time. The drawback of FT is that, as the time domain transitions to the frequency domain,information about what is happening in the time domain is lost. From the observation of the frequency spectrumobtained using FT, it is easy to distinguish the frequency content of the analyzed signal, but it is not possible to deducein what time the signal components of the frequency spectrum will appear or disappear. Unlike FT, WT allows bothtime-domain and frequency-domain analysis, providing information on the evolution of the frequency content of asignal over time (Montejo and Suárez, 2007). There are many families of WT but the two principal are:•
Continuous Wavelet Transform (CWT) : The values of the scaling and translation factors are continuous, which
Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 4 of 17 eans that there can be an infinite amount of wavelets. It performs a multi-resolution analysis by contraction anddilatation of the wavelet functions (Aggarwal et al., 2011). Its different sub-families are: – Mexican hat wavelet – Morlet wavelet – Complex Gaussian wavelets – Gaussian•
Discrete Wavelet Transformations (DWT) : It uses filter banks for the construction of the multi-resolution time-frequency plane and special wavelet filters for the analysis and reconstruction of signals (Merry and Steinbuch,2005). Its different sub-families are: – Daubechies – Symlets – Coiflets – Biorthogonal
DWT is defined by the following equation 4: 𝑊 ( 𝑗, 𝑘 ) = ∑ 𝑛 𝑥 ( 𝑛 )2 − 𝑗 𝜓 (2 − 𝑗 𝑛 − 𝑘 ) , (4)where 𝜓 ( 𝑡 ) is a time function with finite energy and fast decay called the mother wavelet. 𝑊 ( 𝑗, 𝑘 ) represents thewavelet coefficients, where 𝑘 denotes location, and 𝑗 denotes level.DWT has four families: (1) Daubechies; (2) Symlets; (3) Coiflets; (4) Biorthogonal. Each type has a differentshape, smoothness, and compactness and is useful for a different application. Since a wavelet has to satisfy only twomathematical conditions which are the so-called normalization and orthogonalization constraints, it is easy to generatea new type of wavelet.DWT contains three major steps:1. Wavelet Threshold De-Noising
In general, after separation of the voice signal, this signal still contains some small noises. The elimination ofthis noise is very important for the accuracy of the characteristics that will be extracted from the signal. Indeed, asinger has many different sounds and therefore, if the vocal signal extracted from these sounds contains noise,the identification will not be optimal. Donoho has introduced the use of wavelets to denoise the signals. Hedeveloped linear denoising for noises composed of high-frequency components and non-linear denoising (waveletshrinkage) for noises also existing in the low frequencies (Donoho, 1995). Schremmer et al. have developedsoftware for real-time wavelet noise canceling of audio signals. Noise suppression is achieved by using softor hard thresholding of the DWT of the coefficients (Schremmer et al., 2001). The success criterion for noisesuppression is the difference between the original signal and the denoised signal. A new speech enhancementsystem based on a wavelet denoising framework has been introduced by Fu Qiang and Wan Eric. In this system,noisy speech is first pre-processed using a generalized spectral subtraction method to initially reduce the noiselevel with negligible speech distortion. Then, the decomposition of the resulting speech signal into critical bandsis done using the perceptual wavelet (Fu and Wan, 2003). Denoising using DWT is developed in (Saric et al.,2005) where the threshold is given by the equation: 𝜆 = 𝜎 𝑛 √ (2 log 𝑁 ) , (5)where 𝜆 is the wavelet threshold, 𝜎 𝑛 is the standard deviation of the noise, and 𝑁 is the length of the samplesignal. Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 5 of 17 . Wavelet Decomposition
As illustrated in Figure 4, DWT breaks down a signal into several scales representing different frequency bands.Short-duration wavelets are used to extract information from the high-frequency components. Long-durationwavelets can be used to extract information from low frequencies (Chang et al., 2000). The process goes onunder multiple levels as a subsequent coefficient from the first level within the approximation. At each process,the frequency resolution is doubled using the filters while decomposing and reducing the time complexity to half.In the end, we consider all the high-frequency bands ( 𝐻 , 𝐻 , 𝐻 , 𝐻 ) and the last low-frequency band ( 𝐿 ). Figure 4:
DWT process level four (dwt, (accessed April 22, 2020)) Feature Extraction
Multi-resolution analysis (MRA) is used to extract feature vectors from the signal data. Very common in vocalsignal, time-frequency domain DWT based statistical features for classification include mean average value,standard deviation, and spectral entropy.•
Mean average value: it defines the mean of each vector of the sub-bands obtained in the previous step. Itis given as, 𝜇 = ∑ 𝑁𝑖 =1 𝑥 𝑖 𝑁 (6)• Standard deviation: it defines the variance of the signal. It is given by 𝜎 = ∑ 𝑁𝑖 =1 ( 𝑥 𝑖 − 𝜇 ) 𝑁 (7)• Power spectral density: it is calculated in two steps: first, by finding the Fast Fourier Transform (FFT) 𝐹 ( 𝜉 𝑖 ) of the time series and then, taking the squared modulus of the FFT coefficients. 𝑃 ( 𝜉 𝑖 ) = | 𝐹 ( 𝜉 𝑖 ) | 𝑁 (8) Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 6 of 17
Spectral entropy: it is the measure of randomness and the information content of a signal. To calculatethe entropy of a given vocal signal, we use the Shannon entropy formula 𝐸 = − 𝑁 ∑ 𝑖 =1 𝑥 𝑖 log( 𝑥 𝑖 ) (9)
7. Learning Techniques: Classification Models
In the world of machine learning, two main areas can be distinguished: supervised learning and unsupervisedlearning. The main difference between the two lies in the nature of the data and the approaches used to process them.In this section, we present two learning techniques that are widely used for audio: Support Vector Machine (SVM) andGaussian Mixture Model (GMM).
Support Vector Machine (SVM) was developed by Cortes and Vapnik in 1995 and improved by Boser, Guyon, andVapnik in 1998 (Boser et al., 1992); (Vapnik, 1998) which is useful for solving problems of monitoring classificationin high dimensions. The SVM approach searches directly for a plane or surface of separation by an optimizationprocedure that finds the points that form the boundaries of the classes. These points are called support vectors. Besides,the SVM approach uses the kernel method to map the data with a nonlinear transformation to a high-dimensional spaceand tries to find a separation surface between the two classes in this new space. When we have two labels (classes), weuse the binary SVM and in cases with more than two labels, we apply the multi-SVM.
Binary-SVM is used when the data has exactly two classes. For classification, SVM finds the best hyperplane thatseparates all data points of one class from those of the other class as illustrated in Figure 5 by the red line. The besthyperplane is the one with the largest margin between the classes.
Figure 5:
SVM graph (svm, (accessed April 19, 2020) )
The hyperplane equation is given by 𝑤 𝑇 𝑥 − 𝑏 = 0 , (10)where 𝑤 is the weight and 𝑏 the bias. 𝑥 is an input variable.There are three cases of Binary-SVM: Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 7 of 17 . Hard margin
Here, the two classes can be separated linearly (Figure 5). The goal is to maximize ‖ 𝑤 ‖ which is equivalent tominimizing ‖ 𝑤 ‖ .Hence, the problem can be reformulated to 𝑤 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 ‖ 𝑤 ‖ . We have two constraints:(a) 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 ≥ , 𝑖𝑓 𝑦 ( 𝑖 ) = 1 (b) 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 ≤ −1 , 𝑖𝑓 𝑦 ( 𝑖 ) = −1 We combine these two constraints and get 𝑦 ( 𝑖 ) ( 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 ) ≥ , 𝑓 𝑜𝑟 𝑖 = 1 , ..., 𝑚, where 𝑚 is the number of samples and 𝑦 is the label.So, the optimization problem becomes: { 𝑤 ∗ = argmin ‖ 𝑤 ‖ subject to 𝑦 ( 𝑖 ) ( 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 ) ≥ The SVM classifier is given by: 𝑓 ( ⃗𝑥 ) = sign ( ⃗𝑤.⃗𝑥 − 𝑏 ) (11)2. Soft margin
This case occurs when the data are not separable in a linear way because there are dots within the margin.Consequently, the loss of function becomes the hinge loss: 𝓁 ( 𝑤 ; 𝑥, 𝑦 ) = max [0 , 𝑦 ( 𝑖 ) ( 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 )] , 𝑖 = 1 , ..., 𝑚. (12)Having the hinge loss by equation 12, the expected loss is given by 𝐿 ( 𝑤 ) = 1 𝑚 [ 𝑚 ∑ 𝑖 =1 𝓁 ( 𝑤 ; 𝑥, 𝑦 ) ] + 𝜆 ‖ 𝑤 ‖ , where 𝜆 is the trade-off increasing the size of the margin and ensuring that the data point is on the correct side ofthe margin.Hence, the optimization problem becomes: { 𝑤 ∗ = argmin 𝐿 ( 𝑤 ) subject to 𝑦 ( 𝑖 ) ( 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 ) ≥ 𝓁 Nonlinear classification (kernel SVM)
In Figure 6 a case where the data cannot be separated by a hyperplane is depicted. We find a map 𝜙 ∶ 𝑥 ( 𝑖 ) ⟶ 𝜙 ( 𝑥 ) ( 𝑖 ) from the data space to the feature space such that the data are linearly separable in the feature space byapplying the so-called “kernel trick": 𝑘 ( 𝑥, 𝑥 𝑖 ) = 𝜙 ( 𝑥 ) 𝜙 ( 𝑥 𝑖 ) (13)Kernel function may be any of the symmetric functions that satisfy the Mercer’s conditions (Brunner et al., 2012).In the feature space, one can write: 𝑤 𝑇 = 𝑛 ∑ 𝑖 =1 𝛼 𝑖 𝑦 ( 𝑖 ) 𝜙 ( 𝑥 ( 𝑖 ) ) Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 8 of 17 𝑤 𝑇 𝜙 ( 𝑥 ( 𝑖 ) ) = 𝑛 ∑ 𝑖 =1 𝛼 𝑖 𝑦 ( 𝑖 ) 𝜙 ( 𝑥 ( 𝑖 ) ) 𝜙 ( 𝑥 ( 𝑖 ) ) (14)Using 13 in 14, we obtain: 𝑤 𝑇 𝜙 ( 𝑥 ( 𝑖 ) ) = 𝑛 ∑ 𝑖 =1 𝛼 𝑖 𝑦 ( 𝑖 ) 𝑘 ( 𝑥 ( 𝑖 ) , 𝑥 ( 𝑖 ) ) So, 𝑦 ( 𝑖 ) ( 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 ) ⟹ 𝑦 ( 𝑖 ) [ 𝑛 ∑ 𝑖 =1 𝛼 𝑖 𝑦 ( 𝑖 ) 𝑘 ( 𝑥 ( 𝑖 ) , 𝑥 ( 𝑖 ) ) − 𝑏 ] There are several functions of the SVM kernel.(a)
Polynomial kernel: it is a non-stationary kernel. The polynomial kernel is well suited for problems whereall training data are normalized. It is given by equation 15 𝐾 ( 𝑥, 𝑥 𝑖 ) = ( 𝛼𝑥 𝑇 𝑥 𝑖 + 𝑐 ) 𝑑 , (15)where the slope 𝛼 is the adjustable parameter, 𝑑 is the polynomial degree and 𝑐 is the constant.The dimension of the feature space vector 𝜙 ( 𝑥 ) for the polynomial kernel of degree 𝑝 and for the inputpattern of dimension 𝑑 is: ( 𝑝 + 𝑑 )! 𝑝 ! 𝑑 ! (b) Gaussian kernel: it is an example of radial basis function kernel (RBF). It is characterized by the equation: 𝑘 ( 𝑥, 𝑥 ′ ) = 𝑒𝑥𝑝 [− 𝛾 ‖ 𝑥 − 𝑥 ′ ‖ ] (16)Usually, 𝛾 = 𝜎 . So, the equation 16 becomes: 𝑘 ( 𝑥, 𝑥 ′ ) = 𝑒𝑥𝑝 [ − ‖ 𝑥 − 𝑥 ′ ‖ 𝜎 ] (17)The adjustable parameter 𝜎 plays a major role in the performance of the kernel and should be carefullytuned to the problem at hand. If overestimated, the exponential will behave almost linearly and the high-dimensional projection will begin to lose its non-linear power. On the contrary, if underestimated, thefunction will lack regularization and the decision boundary will be highly sensitive to noise in training data(Ramalingam and Dhanalakshmi, 2014).SVM classifier is given by: 𝑓 ( 𝑥 ) = sign ( 𝑤 𝑇 𝜙 ( 𝑥 ) + 𝑏 ) (18) SVM was made for binary classification. But in the real world, we deal with classification problems with more thantwo classes. Multi-category classification problems are usually divided into a series of binary problems so that binarySVM can be directly applied (Mathur and Foody, 2008).One representative method is the “One-Against-All" approach. Consider an 𝑀 -class problem, where we have 𝑁 training samples: { 𝑥 (1) , 𝑦 (1) } , ..., { 𝑥 ( 𝑁 ) , 𝑦 ( 𝑁 ) } . Here, 𝑥 ( 𝑖 ) ∈ ℝ 𝑚 is a 𝑚 -dimensional feature vector and 𝑦 ( 𝑖 ) ∈{1 , , ..., 𝑀 } is the corresponding class label. Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 9 of 17 igure 6:
Kernel SVM graph (svm, (accessed April 19, 2020)
The One-Against-All approach constructs 𝑀 binary SVM classifiers, each of which separates one class from allthe rest. The 𝑖 𝑡ℎ SVM is trained with all the training examples of the 𝑖 𝑡ℎ class with positive labels and all the otherswith negative labels.Mathematically, the 𝑖 𝑡ℎ SVM solves the following problem that yields the 𝑖 𝑡ℎ decision function 𝑓 𝑖 ( 𝑥 ) = 𝑠𝑖𝑔𝑛 ( 𝑤 𝑇𝑖 𝜙 ( 𝑥 )+ 𝑏 𝑖 ) minimize ∶ 𝐿 ( 𝑤, 𝜉 𝑖𝑗 ) = 12 ‖ 𝑤 𝑖 ‖ + 𝐶 𝑁 ∑ 𝑙 =1 𝜉 𝑖𝑗 subject to ∶ ̃𝑦 𝑗 ( 𝑤 𝑇𝑖 𝜙 ( 𝑥 𝑗 ) + 𝑏 𝑖 ) ≥ 𝜉 𝑖𝑗 , 𝜉 𝑖𝑗 ≥ , (19)where ̃𝑦 𝑗 = 1 if 𝑦 𝑗 = 𝑖 and ̃𝑦 𝑗 = −1 otherwise.At the classification phase, a sample 𝑥 is predicted to be in class 𝑖 ∗ whose 𝑓 𝑖 ∗ produces the largest value 𝑖 ∗ = argmax 𝑓 𝑖 ( 𝑥 ) = argmax ( 𝑤 𝑇𝑖 𝜙 ( 𝑥 ) + 𝑏 𝑖 ) , 𝑖 = 1 , ..., 𝑀 (20) Gaussian Mixture Model (GMM) is a parametric probability density function expressed as a weighted sum ofGaussian component densities. In a biometric system, GMMs are widely used as a parametric model of the probabilitydistribution of continuous measurements or features, such as spectral features related to the vocal tract in a speakerrecognition system. GMM parameters are estimated from training data using the iterative expectation maximization(EM) algorithm (Reynolds, 2009).
Here, the initialization of the GMM parameters is carried out using the number of clusters and allowing to form thedifferent centers. In fact, GMM is a function that is comprised of several Gaussians, each identified by 𝑘 ∈ {1 , ..., 𝐾 } ( 𝐾 : number of clusters of the dataset).Each 𝑘 has the following parameters (Figure 7):1. A mean vector ⃗𝜇 that defines its center.2. A covariance matrix Σ that defines its width. This would be equivalent to the dimensions of an ellipsoid in amultivariate scenario.3. A mixture of weights 𝑤 that defines how big or small the Gaussian function will be.A mixture of weights must satisfy the constraint that: 𝐾 ∑ 𝑘 =1 𝑤 𝑘 = 1 (21) Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 10 of 17 igure 7:
GMM parameters (Reynolds et al., (Reynolds, 2009)).
GMM is giving by 𝑝 ( 𝑥 | 𝜆 ) = 𝐾 ∑ 𝑘 =1 𝑤 𝑘 𝑔 ( 𝑥 | 𝜇 𝑘 , Σ 𝑘 ) , 𝑘 = 1 , ..., 𝐾, (22)where 𝑥 is the D-dimensional continuous-valued data vector (i.e. measurement or features), 𝑤 𝑘 are the mixture weights,and 𝑔 ( 𝑥 | 𝜇 𝑘 , Σ 𝑘 ) are the Gaussian densities. Each component density is a D-variate Gaussian function of the form: 𝑔 ( 𝑥 | 𝜇 𝑘 , Σ 𝑘 ) = 1(2 𝜋 ) 𝐷 ∕2 | Σ 𝑖 | 𝑒𝑥𝑝 [ − 12 ( 𝑥 − 𝜇 𝑘 ) 𝑇 Σ −1 ( 𝑥 − 𝜇 𝑘 ) ] (23)A collective representation of the parameters is defined as: 𝜆 = { 𝑤 𝑘 , 𝜇 𝑘 , Σ 𝑘 } , 𝑘 = 1 , ..., 𝐾 (24)There are several variants of the GMM presented in equation 24. The Σ 𝑘 can be full rank or forced to be diagonal. Also,parameters can be shared or linked between Gaussian components. As an example, by having a common covariancematrix for all components, the choice of model configuration (the number of components, full or diagonal covarianceand parameter coupling) is usually determined by the volume of data available for the estimation of GMM parameters,and by the way in which the GMM is used in a particular biometric application. It is important to note this becauseeven if the characteristics are not statistically independent, the Gaussian components act together to model the overalldensity of the characteristics. The modeling of correlations between the vector components of the features can beperformed by the linear combination of the Gaussian diagonal covariance basis. The effect of employing a set of 𝐾 fullcovariance matrices Gaussian can also be obtained by employing a larger set of Gaussian diagonal covariance matrices.The use of a GMM to represent feature distributions can also be driven by the intuitive idea that the densities ofindividual components can model an underlying set of hidden classes. For example, in the case of the speaker, itis reasonable to assume that the acoustic space of the spectral features corresponding to a speaker’s major phoneticevents such as vowels or fricatives. These acoustic classes reflect certain general configurations of speaker-dependentvocal pathways that are useful in characterizing speaker identity. The spectral shape of the 𝑘 𝑡ℎ acoustic class can,in turn, be represented by the mean 𝜇 𝑘 of the 𝑘 𝑡ℎ component density, and variations in the mean spectral shape canbe represented by the covariance matrix Σ 𝑘 . Since not all the characteristics used to form the GMM are labeled, theacoustic classes are hidden, in the sense that the class of observation is unknown. The observation density of the featurevectors derived from these hidden acoustic classes form a Gaussian mixture (assuming that the feature vectors areindependent) (Reynolds, 2009). Taking into account the training vectors and a configuration of the GMM, we wish to estimate the parameters of theGMM, 𝜆 , which in some sense corresponds best to the distribution of the training vectors. There are several techniquesfor estimating the parameters of a GMM (McLachlan and Basford, 1988). By far, the most popular and best-establishedmethod is the expectation-maximization (EM) algorithm. The objective of expectation maximization (EM) is to find Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 11 of 17 he model parameters that maximize the probability of the GMM given the training data. For a sequence of 𝑇 trainingvectors 𝑋 = 𝑥 , ....𝑥 𝑇 , the likelihood of the GMM, assuming independence between the vectors, can be written as 𝑝 ( ⃗𝑥 | Θ) = 𝐾 ∑ 𝑘 =1 𝑤 𝑘 𝑝 𝑘 ( ⃗𝑥 | 𝑧 𝑘 , 𝜃 𝑘 ) , (25)where 𝑝 ( ⃗𝑥 ) is the finite mixture model, and 𝑝 𝑘 ( ⃗𝑥 | 𝑧 𝑘 , 𝜃 𝑘 ) is the gaussian density for the 𝑘 𝑡ℎ mixture component. 𝑧 = ( 𝑧 , ..., 𝑧 𝐾 ) is the vector of 𝐾 binary indicator variables which are mutually exclusive and exhaustive. Θ is thecomplete set of parameters ( Θ = { 𝑤 , ..., 𝑤 𝐾 , 𝜃 , ..., 𝜃 𝐾 } ) Step 1: E-step
The goal here is to compute the membership weights 𝑤 𝑖𝑘 which are the probabilities that reflect the uncertainty,given ⃗𝑥 𝑖 and Θ . The membership weight of a data point ⃗𝑥 𝑖 in cluster 𝑘 can be written as: 𝑤 𝑖𝑘 = 𝑝 ( 𝑧 𝑖𝑘 = 1 | ⃗𝑥 𝑖 , Θ) = 𝑤 𝑘 𝑝 𝑘 ( ⃗𝑥 | 𝑧 𝑘 , 𝜃 𝑘 ) ∑ 𝐾𝑚 =1 𝑤 𝑚 𝑝 𝑚 ( ⃗𝑥 𝑖 | 𝑧 𝑚 , 𝜃 𝑚 ) (26) Step 2: M-step
This step aims to use the membership weights obtained in equation 26 in E-step, to calculate new parameter valueswhich are given by equation 27, 28, 29. 𝑤 𝑘 = 𝑁 𝑘 𝑁 , ≤ 𝑘 ≤ 𝐾, (27) ⃗𝜇 𝑘 = 1 𝑁 𝑘 𝑁 ∑ 𝑖 =1 𝑤 𝑖𝑘 ⃗𝑥 𝑖 , ≤ 𝑘 ≤ 𝐾 (28) Σ 𝑘 = 1 𝑁 𝑘 𝑁 ∑ 𝑖 =1 𝑤 𝑖𝑘 ( ⃗𝑥 𝑖 − ⃗𝜇 𝑘 )( ⃗𝑥 𝑖 − ⃗𝜇 𝑘 ) 𝑇 , ≤ 𝑘 ≤ 𝐾. (29)where 𝑁 𝑘 = ∑ 𝑁𝑖 =1 𝑤 𝑖𝑘 is the column sum of the membership weight matrix.
8. Implementation and Analysis
This section presents the implementation of the techniques studied in the previous sections for the singer’sidentification. We will first give an overview of the procedure from separation to identification. Then, we will presentall the experimental data used in this study. Finally, we will show the experiments and results obtained at each step ofthe singer identification process with and without feature extraction.
As shown in the Figure 8, the inputs are audio files with the .mp3 extension. After importing these files, we get themusical signal. We apply the STFT to each signal to obtain its matrix in the frequency domain. Afterward, the RPCAtechnique is applied to this matrix separating it into two matrices: a low rank and a sparse matrix. After performing theISTFT on the sparse matrix, the vocal signal is obtained. This vocal signal obtained from each sound of the datasetallows building a data-frame. The purpose of this study is to show, first, the importance of feature extraction and then,to compare the two techniques DWT and MFCC. Hence, we perform three experiments: (1) Training the data withoutfeature extraction; (2) Using MFCC for feature extraction; (3) Using DWT for feature extraction before training withSVM and GMM techniques.
We created a database of test recordings by selecting four popular singers: two men and two women, each with50 excerpts. These songs go through the pre-processing phase where missing values are removed and twelve-secondsinging voice segments are obtained from each musical recording. As a result, after separation of signal and after usingRPCA, each singer has a total of 263232 singing voice segments which are then introduced into the feature extractionphase.
Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 12 of 17 igure 8:
Block diagram of the overall process of singer’s identification
For de-noising, we investigate the maximum gain for input signals with different levels of degradation. The amountof white noise added to the original signal is controlled with the standard deviation of the noise 𝜎 𝑛 . The maximum gainis obtained by replacing the threshold 𝜆 of equation 5 with 𝜆 = 𝑘.𝜎 𝑛 √ (2 log 𝑁 ) , (30)where < 𝑘 < . In fact, the universal threshold t given by equation 5 is too high for audio signals and it cuts a big partof the original signal. So it is modified with factor 𝑘 to obtain a higher quality output signal. The value of 𝑘 is changedgradually with steps of 0.1. Finally, we find 𝑘 that gives the best result depicted in Figure 9. Figure 9:
De-noising of the signal of Celine’s songVictoire Djimna et al.:
Preprint submitted to Elsevier
Page 13 of 17 .3.2. Decomposition of signal
According to literature, we use, for this implementation, the DWT Daubechies four (db4) of level 4. Each signalhas the frequency 𝐻𝑧 after de-noising. In Figure 10, we can easily see the five (05) sub-bands of the previouslyde-noised song: 𝐿 (0 − 260 𝐻𝑧 ) , 𝐻 (260 − 520 𝐻𝑧 ) , 𝐻 (520 − 1040 𝐻𝑧 ) , 𝐻 (1040 − 2080 𝐻𝑧 ) , 𝐻 (2080 − 4160 𝐻𝑧 ) Figure 10:
Decomposition of the signal of Celine’s song previously de-noised
After the decomposition of the signal in sub-bands, we extract the following features to build our final data-frame:•
Time-Frequency domain:
Mean and Spectral entropy.•
Time domain:
Mean, Median, and Standard deviation.
Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 14 of 17
Frequency domain:
Power spectral density.
In this step, we first do a feature engineering to see which of the 18 features are correlated and use PCA to take thefeatures which represent 99.99% of variation: 15 features. Then, we separate our data-frame: features for input andnames of singers for output labels. We then check the parameter of models which can influence the result and take theone which gives us the best result. Finally, knowing the audio signal, we train the machine learning models using thefold cross-validation value 10. We also shuffle the data fifteen (15) times to have realistic results. Then, the generalaccuracy will be the mean of the fifteen (15) accuracies.First, we do the training without feature extraction. We remark that the best model is delivered by the SVM-RBFmodel with 36.78% of mean accuracy. It should be noted that the 263232 columns in the data-frame, without featureextraction, are considered as input for model training. Then, it becomes impossible to build the covariance matrix forGMM because the number of inputs is greater than the number of observations. However, a reduction in size is notpossible because these columns represent the vector of the signal, and reducing it is equivalent to muting the signal,therefore, the implementation of this method was not possible.Second, we do the training using MFCC for feature extraction. We found that, in general, the SVM model performsbetter than the GMM. The best model is SVM-Linear with an mean accuracy of 61.49%.Finally, we train the models using the final data-frame obtained at the previous step with DWT for feature extraction.We have determined that the best model is SVM-Linear with 83.96%.The results have been summary in Figure 11
Figure 11:
Boxplot of performance (accuracies) with DWT
9. Conclusion and Recommendation
The objective of this work was to apply the DWT for feature extraction and compare the results with the MFCCto see which of the two improves the identification of the singer in term of accuracy. We first gave the physical andmathematical description of the different techniques ranging from separation of the vocal signal from the backgroundsignal to the singer identification. Then, we implemented these techniques according to a dataset of 200 songs (50songs per singer). RPCA was used for the separation of signals; DWT and MFCC were used to feature extraction; andSVM and GMM were used for singer’s identification.For a set of 200 observations of audio signals, this study shows that:
Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 15 of 17
The vocal signals are very unstable because the same artist sings in different styles and uses different nuancesthat influence the results of the study using a small number of recordings. As a consequence, low accuracies areobtained.• The feature extraction is essential for the singer identification process.• DWT performs better than MFCC for feature extraction in term of accuracy and training time.• SVM performs better than GMM for singer identification.• The best configuration of techniques for singer identification is DWT + SVM-Linear with a mean accuracy of83.96% with a training time of 1 068 s.However, to generalize these results, it is essential to perform the same study on a much larger set of recordings.Since DWT is better at extracting characteristics from audio signals, future work could be the in-depth study ofthe different families of DWT to investigate the effects of their individual properties and to use appropriate DWTs fordifferent cases of datasets.Besides, DWT can be used for extended periods to study other non-stationary signals such as those in the human body(electroencephalogram (EEG), electrocardiogram (ECG), electro-oculography (EOG)). This will allow abnormalities tobe detected fairly quickly and diseases predicted and treated before complications arise.
CRediT authorship contribution statement
Victoire Djimna Noyum:
Conceptualization of this study, Methodology, Software, Data curation, Writing -original draft.
Younous Perieukeu Mofenjou:
Software, Result compilation, Writing -review.
Cyrille Feudjio:
Software, Result compilation, Writing -review.
Alkan Göktug:
Supervision, Validation, Writing - review & editing.
Ernest Fokoué :
Supervision, Software, Validation, Writing - review & editing.
References
Victoire Djimna et al.:
Preprint submitted to Elsevier
Page 16 of 17 erry, R., Steinbuch, M., 2005. Wavelet theory and applications. literature study, Eindhoven university of technology, Department of mechanicalengineering, Control systems technology group .Montejo, L.A., Suárez, L.E., 2007. Aplicaciones de la transformada ondícula ("wavelet") en ingeniería estructural. Mecánica Computacional 26,2742–2753.NAMEIRAKPAM, J., BISWAS, S., BONJYOSTNA, A., 2019. Singer identification using wavelet transform, in: 2019 2nd International Conferenceon Innovations in Electronics, Signal Processing and Communication (IESC), IEEE. pp. 238–242.Ramalingam, T., Dhanalakshmi, P., 2014. Speech/music classification using wavelet based feature extraction techniques. Journal of ComputerScience 10, 34.Reynolds, D.A., 2009. Gaussian mixture models. Encyclopedia of biometrics 741.Saric, M., Bilicic, L., Dujmic, H., 2005. White noise reduction of audio signal using wavelets transform with modified universal threshold. Universityof Split, R. Boskovica b. b HR 21000.Schremmer, C., Haenselmann, T., Bomers, F., 2001. A wavelet based audio denoiser, in: Proc. IEEE International Conference on Multimedia andExpo, Citeseer. pp. 145–148.Vapnik, V., 1998. Statistical learning theory, new york, 1998.Wang, X., Wang, J., Fu, C., Gao, Y., 2013. Determination of corrosion type by wavelet-based fractal dimension from electrochemical noise. Int. J.Electrochem. Sci 8, 7211–7222.Xing, L., 2017. Singer identification of pop music with singing-voice separation by rpca .Yang, S., 2016. Statistical approaches for signal processing with application to automatic singer identification .
Victoire Djimna et al.: