DD ATA - DRIVEN AUDIO RECOGNITION : A SUPERVISEDDICTIONARY APPROACH
A P
REPRINT
Imad Rida
Laboratoire BMBI CompiègneUniversité de Technologie de CompiègneCompiègne, France2021-01-01 A BSTRACT
Machine hearing or listening represents an emerging area. Conventional approaches rely on thedesign of handcrafted features specialized to a specific audio task and that can hardly generalizedto other audio fields. Unfortunately, these predefined features may be of variable discriminationpower while extended to other tasks or even within the same task due to different nature of clips.Motivated by this need of a principled framework across domain applications for machine listening,we propose a generic and data-driven representation learning approach. For this sake, a novel andefficient supervised dictionary learning method is presented. Experiments are performed on bothcomputational auditory scene (East Anglia and Rouen) and synthetic music chord recognition datasets.Obtained results show that our method is capable to reach state-of-the-art hand-crafted features forboth applications K eywords Audio · Dictionary learning · Music · SceneHumans have a very high perception capability through physical sensation, which can include sensory input from theeyes, ears, nose, tongue, or skin. A lot of efforts have been devoted to develop intelligent computer systems capable tointerpret data in a similar manner to the way humans use their senses to relate to the world around them. While mostefforts have focused on vision perception which represents the dominant sense in humans, machine hearing also knownas machine listening or computer audition represents an emerging area [1].Machine hearing represents the ability of a computer or machine to process audio data. There is a wide range variety ofaudio application domains including music, speech and environmental sounds. Depending on the application domain,several tasks can be performed such as, speech/speaker recognition, music transcription, computational scene auditoryrecognition, etc (see Table 1).Table 1: Machine hearing tasks based on different application domains
Tasks Domains Environnemental Speech MusicDescription
Environment Emotion Music Recommendation
Classification
Computational Auditory SceneRecognition Speech or Speaker Recognition Music Transcription
Detection
Event Detection Voice Activity Detection Music Detection a r X i v : . [ c s . S D ] D ec PREPRINT - 2021-01-01In this chapter, we are interested in the classification of audio signals in both environmental and music domains andmore particularly, Computational Auditory Scene Recognition (CASR) and music chord recognition. The formerrefers to the task of associating a semantic label to an audio stream that identifies the environment in which it has beenproduced while the second task aims to recognize music chords that represent the most fundamental structure and theback-bone of occidental music.In the following we briefly review different approaches for audio signal classification. A novel method able to learn thediscriminative feature representations will be introduced. Extensive experiments will be performed on CASR and musicchord benchmark databases and results will be compared to conventional state-of-the-art hand-crafted features.
The problem of audio signal classification is now becoming more and more frequent, ranging from speech to non-speechsignal classification. The usual trend to classify signals is first to extract discriminative feature representations fromthe signals, and then feed a classifier with them. Features are chosen so as to enforce similarities within a class anddisparities between classes. The more discriminative the features are, the better the classifier performs.For each audio signal classification problem, specific hand-crafted features have been proposed. For instance, chromavectors represent the dominant representation which has been developed in order to extract the harmonic content frommusic signals for different applications [2, 3, 4, 5, 6, 7, 8, 9].In audio scene recognition, recorded signals can be potentially composed of a very large amount of sound events whileonly few of these events are informative. Furthermore, the sound events can be from different nature depending on thelocation (street, office, restaurant, train station, etc). To tackle this problem, features such as Mel-Frequency CepstralCoefficients (MFCCs) [10, 11, 12, 13, 14, 15] have been successfully applied and combined with different classificationtechniques [16, 17].These predefined features may be of variable discrimination power according to the signal nature and learning task if theyare extended to other application domains. For this reason machine hearing systems should be able to learn automaticallythe suited feature representations. Time-frequency features have shown good ability to represent real-world signals [18]and methods have been designed to learn them. They can be broadly divided into four main approaches [19]: wavelets,Cohen distribution design, dictionary and filter banks learning summarized in Table 2.Table 2: Non exhaustive time-frequency representation learning for classification [19].Approach Methods• [20]• Wavelets • [21]• [22]• Cohen Distribution • [18]• [23]• Dictionary • [24]• [25]• Filter Bank • [26]• [19]Wavelets showed very good performance in the context of compression [27, 28] where one minimizes the error betweenthe original and approximate signal representation. While the latter may be a salutary goal, it does not well address theclassification problems. [20] suggested a classification-based cost function maximizing the minimum probability ofcorrect classification along the confusion-matrix diagonal. This cost function is optimized using a genetic algorithm(GA) [29]. [21] tried to tune their introduced wavelet by maximizing the distance in the wavelet feature space of themeans of the classes to be classified. This is done by constructing a shape-adapted Local Discriminant Bases (LDBs)called also morphological LDBs (MLDBs) as an extension of LDBs [30]. In other words they aim to select bases froma dictionary that maximize the dissimilarities among classes. [22] tried to learn the shape of the mother wavelet, sinceclassical wavelet such as Haar, or Daubechies ones may not be optimal for a given discrimination problem. Then, thebest wavelet coefficients that are useful for the discrimination problem are selected. Features obtained from differentwavelet shapes and coefficient selections were combined to learn a large-margin classifier.2
PREPRINT - 2021-01-01In the Cohen distribution design, [18] proposed to use a Support Vector Machine (SVM) of the Cohen’s groupTime-Frequency Representations (TFRs). The main problem is that the classification performance is dependingon the choice of TFR and SVM kernel respectively. To tackle this problem, they presented a simple optimizationprocedure to determine the optimal SVM and TFR kernel parameters. [23] proposed a method for selecting Cohen classtime-frequency distribution appropriate for classification tasks based on the kernel-target alignment [31].Motivated by their success in image denoising [32, 33] and inpainting [34], dictionary learning was further extendedto classification tasks. It consists in finding a linear decomposition of a raw signal or potentially its time-frequencyrepresentation using a few atoms of a learned dictionary. While conventional dictionary learning techniques triedto minimize the signal reconstruction error, [24, 35] introduced supervised dictionary by embedding a logistic lossfunction to simultaneously learn a classifier, the dictionary D and the decomposition coefficients of the signals over D .[25] introduced a dictionary learning method by adding a structured incoherence penalty term to learn C dictionariesfor C classes while enforcing incoherence in order to make these dictionaries as different as possible.In the filter bank approach, [26] designed a method named Discriminative Feature Extraction (DFE) where both thefeature extractor and classifier are learned with the objective to minimize the recognition error. The designed featureextractor is a filter bank where each filter’s frequency response has a Gaussian form determined by three kinds ofparameters (center frequency, bandwidth, and gain factor). The classifier was defined as a prototype-based distance [36].[19] proposed to build features by designing a data-driven filter bank and by pooling the time-frequency representationsto provide time-invariant features. For this purpose, they tackled the problem by jointly learning the filters of the filterbank with a support vector machine. The resulting optimization problem boils down to a generalized version of aMultiple Kernel Learning (MKL) problem [37].It can be seen that methods among, wavelets, Cohen distribution and filter bank approaches, solely seek to find asuitable time-frequency feature representation for signal classification. Although time-frequency representations showedefficiency to classify temporal signal (audio, electroencephalography, etc), there is no effectiveness guarantee for alltype of signals. On the other side, dictionary learning can be combined with any initial feature representation and hencemay have the ability and flexibility to deal with signals from different nature. Indeed, supervised dictionary learningcan be seen as task-driven approach for learning discriminative representations.In this chapter, based on an initial time-frequency representation, the problem of signal audio recognition is formulatedas a supervised dictionary learning problem. The resulting optimization problem is non-convex and solved usinga proximal gradient descent method. In the following we introduce our representation learning method based ondictionary learning as well as the performed experiments on both music chord recognition and computation auditoryscene recognition databases. Sparse representation of signals and images has known a big interest from researchers in order to analyze, extract orselect features. A "sparse representation" means that a signal or image can be represented as a linear combination offew representative elements, called dictionary atoms. The main challenge of the sparse representation is the choiceof the dictionary on which the signal will be represented and the sparsity type. The simplest approach to tackle thisproblem is to take predefined dictionary such as wavelet analysis, Gabor atoms or Discrete Cosine Basis, but this willgive us no guarantee that these predefined dictionaries will be able to represent and extract useful information for theproblem in question.Alternative approach is to learn the suited set of atoms from the data. From the view of compression sensing, dictionarylearning is originally designed to learn an adaptive codebook to faithfully represent the signals with sparsity constraint.Dictionary learning has been applied for different applications such as image denoising [32, 38], inpainting [34, 38],clustering [39, 40] and classification [41, 24, 35, 42, 43, 44, 45, 46, 47, 48].In the following we review the conventional dictionary learning based on a single dictionary and the different approachesto build supervised dictionary for classification. We also introduce our class based dictionary learning method.3
PREPRINT - 2021-01-01
Let suppose a dictionary D ∈ R M × K composed of K atoms { d k ∈ R M } Kk =1 . We seek a sparse representation a n ∈ R K of a signal x n ∈ R M over D such as: x n ≈ K (cid:88) k =1 a nk d k = Da n (1)Given a set of N signals { x n } Nn =1 , the coefficients of a n as well as the dictionary D are obtained by solving thefollowing optimization problem: min D , { a n } Nn =1 N (cid:88) n =1 (cid:107) x n − Da n (cid:107) + λ (cid:107) a n (cid:107) s.t (cid:107) d k (cid:107) ≤ ∀ k = 1 , · · · , K (2)It can be seen that the original formulation for dictionary learning is based on the minimization of the reconstructionerror between a signal and its sparse representation over the learned dictionary. Although this formulation is optimal forsolving problems such as denoising and inpainting, it may not lead to optimal solution in classification tasks, where theultimate goal is to make the learned dictionary and corresponding sparse representation as discriminative as possiblesince it does not take the label information in consideration. This motivated the emergence of supervised dictionarylearning techniques [49, 50, 51]. Supervised dictionary learning can be organized in six main groups [52]: learning one dictionary per class, unsuperviseddictionary learning followed by supervised pruning, joint dictionary and classifier learning, embedding class labels intothe learning of dictionary, embedding class labels into the learning of sparse coefficients and learning a histogram ofdictionary elements over signal constituents. In the following we briefly introduce these approaches as well the mainworks belonging to them. Note that the advantages and drawbacks of each approach are summarized in Table 3.
The first and simplest approach is to compute one dictionary per class, i.e., using the training samples of each class, adictionary is constructed. The overall dictionary is obtained by the concatenation of individual class dictionaries. In thisframework, [53] proposed the so-called Sparse Representation-based Classification (SRC), where training samples ofeach class serve as dictionary. The sparse representation of a testing sample over each dictionary is calculated based onLasso. The test sample is then assigned to class label which dictionary provides the minimal residual reconstructionerror. [54], instead to use dictionaries based on training samples proposed to learn a dictionary per class based on theconventional approach (2). Although this approach can be potentially performing, learned dictionaries can capturesimilar properties for different classes leading to poor classification performance. To tackle this problem, [25] suggestedto make the learned dictionaries as different as possible to capture distinct information by minimizing the pairwisesimilarity between dictionaries. [55] proposed to learn a dictionary per class to capture the particularity information anda shared dictionary to capture the commonality. After finding the overall dictionary, the classification of test samples isperformed the same way as with the SRC.
In this approach, a very large dictionary in learned following the conventional approach (2), then the dictionary atomsare merged based on a predefined criterion so as to obtain a reduced discriminative dictionary. For instance, [56] usedAgglomerative Information Bottleneck (AIB) which iteratively merges two atoms that cause the smallest decrease inthe mutual information between the dictionary atoms and the class labels. In the same context, [57] proposed anothermethod based on merging two dictionary atoms so as to minimize the loss of mutual information between the histogramof dictionary atoms and class labels. 4
PREPRINT - 2021-01-01
This approach showed very good performances and represented a big advance in the field. It seeks to jointly learndictionary and classifier. In [24] a linear classifier and logistic loss function was applied. [58] suggested a techniquecalled discriminative K-SVD (DK-SVD) which also jointly learns the classifier parameters and dictionary. However,instead to solve the optimization problem iteratively and alternately between classifier parameters and dictionary, asub-optimal learning process is built upon two main steps. The first one aims to learn a conventional dictionary andsparse representation coefficients of the signals over it. The second step uses the resulting sparse coefficients to learn alinear classifier.
In this framework we can cite the approach of [59]. They propose to first project the data into an orthogonal spacewhere the intra and inter-class reconstruction errors are minimized and maximized respectively, and subsequently learnthe dictionary and the sparse representation of the data in this new space. [60] seek to minimize the information lossdue to class labels prediction from a supervised learned dictionary instead of the original training data samples.
This approach seeks to include class labels in the learning of coefficients. Supervised coefficient is based on minimizingthe within-class covariance of coefficients and at the same time maximizing their between-class covariance. [61] triedto learn simultaneously a dictionary per class by decomposing every signal x n with label y n over the C dictionariesand enforcing the sparsity of the coefficients related to the dictionaries D j such that y n (cid:54) = j . Classification of a newsample is done in the same way as SRC [53]. There are situations where a signal is made of some local constituents, e.g., an image is made up of patches or a speechmade of phonemes. In this case histogram of dictionary atoms learned on local constituents is computed. The resultinghistograms are used to train a classifier and predict the class label of unknown5
PREPRINT - 2021-01-01 T a b l e : S u mm a r yo f s up e r v i s e dd i c ti on a r y l ea r n i ng t ec hn i qu e s f o r d a t ac l a ss i fi ca ti on [ ] . M e t hod s A pp r o ac h A dv a n t a g e s & D r a w b ac k s [ ][ ] A . D i c ti on a r yp e r c l a ss ( + ) ea s e d i c ti on a r y c o m pu t a ti on [ ] ( − ) v e r y l a r g e d i c ti on a r y [ ] [ ] B . P r un e l a r g e d i c ti on a r i e s ( + ) ea s e d i c ti on a r y c o m pu t a ti on [ ] ( − ) l o w p e rf o r m a n ce s [ ] C . J o i n t d i c ti on a r y & c l a ss i fi e r l ea r n i ng ( + ) goodp e rf o r m a n ce s [ ] ( − ) t oo m a nyp a r a m e t e r s [ ] D . L a b e l s i nd i c ti on a r y ( + ) goodp e rf o r m a n ce s [ ] ( − ) c o m p l e xop ti m i za ti on [ ] E . L a b e l s i n c o e f fi c i e n t s ( + ) goodp e rf o r m a n ce s ( − ) c o m p l e x [ ] F . H i s t og r a m s o f d i c ti on a r y e l e m e n t s ( + ) goodp e rf o r m a n ce s [ ] ( − ) on l yb a s e d l o ca l c on s tit u e n t s PREPRINT - 2021-01-01signals. [62] aggregated small patches over all images in a class, and clustered them using k-means algorithm. Obtainedcluster centers form a dictionary. Although the latter method gives good results, it does not really include the labelinformation in the learning process. This motivated to exploit the class information to learn dictionaries in supervisedway [63].Based on the brief study of supervised dictionary approaches, we introduce in the following a novel supervised dictionarymethod. Our proposed method tries to exploit the strong points of the previous methods that is: i) learning one dictionaryper class, and ii) embedding class labels to force sparse coefficients. To this end, we encourage the dissimilarity betweenthe dictionaries by penalizing the pairwise similarity between them. To reach superior discrimination power, we pushtowards zero the coefficients of a signal representation over other dictionaries than the one corresponding to its classlabel.
Let consider { ( x n , y n ) } Nn =1 where x n ∈ R M is a signal and y n ∈ { , · · · , C } its label. We consider a dictionary D c ∈ R M × K (cid:48) associated to each class c . The global dictionary D = [ D · · · D C ] ∈ R M × K represents the concatenation ofthe class based dictionaries { D c } Cc =1 . Each dictionary D c is composed of K (cid:48) atoms { d k ∈ R M } K (cid:48) k =1 . For simplicitysake we consider K (cid:48) is the same for all { D c } Cc =1 . The sparse representation of x n over the global dictionary D is a Tn = [ a Tn · · · a Tnc · · · a TnC ] where a nc represents the sparse representation over the class specific dictionary D c . Hencethe sparse representation of the overall training data { x n } Nn =1 is gathered in A = [ a · · · a n ] . The dictionary learningproblem we intend to address is formulated as follows: min { D c } Cc =1 , { a n } Nn =1 J = J + J + λJ + γ J + γ J s.t (cid:107) d ck (cid:107) ≤ ∀ c = 1 , · · · , C and ∀ k = 1 , · · · , K (3)where in the problem (3) J = N (cid:88) n =1 (cid:107) x n − Da n (cid:107) represents the global reconstruction error over the global dictionary D . J = C (cid:88) c =1 N (cid:88) n =1 y n = c (cid:107) x n − D c a nc (cid:107) stands for the class specific reconstruction error over the dictionary D c . In other words J measures the quality ofreconstructing a sample ( x n , y n = c ) over the sole dictionary D c . J = N (cid:88) n =1 (cid:107) a n (cid:107) is the classical sparsity penalization. J = N (cid:88) n =1 C (cid:88) c =1 y n (cid:54) = c (cid:107) a nc (cid:107) aims to push toward zero the coefficients a nc of the signal x n representation over non-class specific dictionary D j , j (cid:54) = y n . J = C (cid:88) c =1 C (cid:88) c (cid:48) =1 c (cid:48) (cid:54) = c (cid:13)(cid:13) D Tc D c (cid:48) (cid:13)(cid:13) F PREPRINT - 2021-01-01with (cid:107) . (cid:107) F is the Frobenius norm, encourages the pairwise orthogonality between different dictionaries.To sum up, our dictionary learning problem (3) seek to:• Capture as much as possible information in the signal by minimizing the global reconstruction error.• Specialize the extracted information per class by minimizing the class specific reconstruction error similar tointra-class variations minimization.• Render dissimilar the extracted class specific information by promoting orthogonality of dictionaries and"zeroing" coefficients not specific to the sample label. In other words, we attempt to maximize inter-classvariations.• Promote coefficients sparsity to maintain generalization ability. λ , γ and γ are regularization parameters controlling the sparsity, the structure of sparse coefficients and pairwiseorthogonality of learned dictionaries respectively. We could have associated a regularization parameter to the term J , however to avoid multiplying the number of hyper-parameters we choose to fix it to . Furthermore, conductedexperiments show that it does not have significant impact on the performances.Compared to [55] where they propose to learn a shared dictionary combined with class specific, we only rely on thelatter one. Furthermore their optimization scheme is based on a simplifying assumption that y n (cid:54) = c (cid:107) a nc (cid:107) = 0 whicheases the optimization but harms the convergence. In our formulation we do not rely on those assumptions and weprovide a more general optimization algorithm described in the next section. At the first sight, the objective function in (3) seems to be complex but it can be solved based on an alternatingoptimization scheme which involves a sparse coding step and dictionary optimization step. Indeed, problem (3) isconvex in D c for the coefficients a nc fixed and is so the inverse way when the D c are fixed. In this step, we fix { D c } Cc =1 and we estimate the coefficients { a n } Nn =1 . For each signal x n of class y n , the relatedvector a n is decoupled in the optimization problem. Let y n = c (cid:48) , this conducts us to solve the following problem: min a n (cid:107) x n − Da n (cid:107) + (cid:107) x n − D c (cid:48) a nc (cid:48) (cid:107) + γ ( (cid:107) a n (cid:107) − (cid:107) a nc (cid:48) (cid:107) ) + λ (cid:107) a n (cid:107) (4)where (cid:107) a n (cid:107) = C (cid:88) c =1 (cid:107) a nc (cid:107) and C (cid:88) c =1 c (cid:54) = c (cid:48) (cid:107) a nc (cid:107) = (cid:107) a n (cid:107) − (cid:107) a nc (cid:48) (cid:107) It can be seen that (4) consists of quadratic error terms and elastic-net type penalization. Thus this problem is amenableto a Lasso problem which can be solved by a classical Lasso solver [64].
Here we illustrate the estimation of { D p } Cp =1 while fixing { a n } Nn =1 . It can be seen that (3) involves quadratic termswith respect to the dictionaries. The derivative of the objective function with respect to D p is: ∇ D p J = ∇ D p J + ∇ D p J + γ ∇ D p J (5)with the involved terms defined below using the matrix derivation formula [65]. J = N (cid:88) n =1 (cid:107) x n − Da n (cid:107) = N (cid:88) n =1 (cid:107) ˜ x n − D p a np (cid:107) ∇ D p J = N (cid:88) n =1 − ˜x n a Tnp + 2 D p a np a Tnp (6)8
PREPRINT - 2021-01-01where ˜ x n = x n − C (cid:88) c =1 c (cid:54) = p D c a nc For the second term of the derivative ∇ D p J we can write J = N (cid:88) n =1 y n = p (cid:107) x n − D p a np (cid:107) + N (cid:88) n =1 (cid:88) c (cid:54) = p y n = c (cid:107) x n − D c a nc (cid:107) ∇ D p J = N (cid:88) n =1 y n = p − x n a Tnp + 2 D p a np a Tnp (7)Finally the expression of the last term is given by J = (cid:88) c (cid:54) = p (cid:13)(cid:13) D Tp D c (cid:13)(cid:13) F + (cid:88) c (cid:54) = p (cid:88) c (cid:48) (cid:54) = cc (cid:48) (cid:54) = p (cid:13)(cid:13) D Tc D c (cid:48) (cid:13)(cid:13) F ∇ D p J = (cid:88) c (cid:54) = p D c D Tc ) D p (8)Algorithm 1 summarizes the different steps of our optimization approach which is based on an alternating scheme: thefirst step consists of a signal sparse coding based on the Lasso algorithm. The second step is dictionary optimizationbased on proximal gradient descent approach. The proximal procedure is useful in order to handle the atom normalizationconstraint (cid:107) d ck (cid:107) ≤ in the problem (3). Algorithm 1
The optimization algorithm Initialization: D , t ← , initialize η and α while t ≤ T do Solve for A t ← argmin A J ( D t − , A ) using Lasso algorithm Compute the gradient G D t − ← ∇ D J ( D t − , A t ) based on equations (5) to (8) η ← η repeat D t ← D t − − η G D t − D t ← Prox (cid:0) D t (cid:1) with Prox (cid:0) D t (cid:1) : { d k } Kk =1 = d k if (cid:107) d k (cid:107) ≤ d k (cid:107) d k (cid:107) otherwise η ← η × α until J ( D t , A t ) < J ( D t − , A t − ) t ← t + 1 end while2.5 Classification Once the dictionaries are learned, they are used to encode both training and testing samples based on Lasso. Theresulting coefficients are used to feed an SVM classifier. Figures 1 to 3 show the processing flow of dictionary learningbased on the training data, coding both training and testing data over the learned dictionary respectively.9
PREPRINT - 2021-01-01 { x n ∈ R M } Nn =1 DictionaryLearning { D c ∈ R M × K (cid:48) } Cc =1 { y n } Nn =1 Figure 1: Processing flow of dictionary learning on the training set. { x n ∈ R M } Nn =1 Sparse Representation { D c ∈ R M × K (cid:48) } Cc =1 SVMLearning { y n } Nn =1 h { a n } Nn =1 Figure 2: Processing flow of SVM training over the learned dictionary and training set. { x n (cid:48) ∈ R M } N test n (cid:48) =1 Sparse Representation { D c ∈ R M × K (cid:48) } Cc =1 SVMClassification h { ˜ y n (cid:48) } N test n (cid:48) =1 { a n (cid:48) } N test n (cid:48) =1 Figure 3: Processing flow of classification over testing set.Let define H a Hilbert space induced by kernel k ( ., . ) . The decision function of a binary classification problem is givenby h ( a ) = h ( a ) + b with h ∈ H , b ∈ R and (cid:107) h (cid:107) H = (cid:107) h (cid:107) H and is obtained as the solution of [66]: min h ,b (cid:107) h (cid:107) H + C svm N (cid:88) n =1 ξ n s.t y n h ( a n ) ≥ − ξ n , ξ n ≥ ∀ n = 1 , · · · , N (9)where (cid:110) ( a n , y n ) ∈ A × {− , +1 } (cid:111) Nn =1 are the labelled training samples. ξ n and C svm represent slack variables andtuning parameter used to balance margin and training error. The solution is given by h ( a ) = N (cid:88) n =1 α n y n k ( a n , a ) whereparameters α n are solution of the dual quadratic problem: max ααα N (cid:88) n =1 α n − N (cid:88) n =1 N (cid:88) n (cid:48) =1 α n α n (cid:48) y n y n (cid:48) k ( a n , a n (cid:48) ) s.t ∀ n ≤ α n ≤ C svm , N (cid:88) n =1 α n y n = 0 (10)To solve our C -class audio classification problem we employ one-against-all strategy. It consists in constructing C binarySVM, each one separates a class from all the rest. The c th SVM solves the decision problem h ( c ) ( a ) = h ( c )0 ( a ) + b ( c ) with data from class c taken as positive samples and the remaining training samples as negatives. Note that in our casewe have used a simple linear kernel as the non-linear aspect of the classification problem is taken into account in thedictionary learning. This is customary in supervised dictionary classification [24, 35].10 PREPRINT - 2021-01-01
We conduct our experiments on two different audio signal classification problems, Computational Auditory SceneRecognition (CASR) and music chord recognition. For each problem, dictionary learning based on a initial time-frequency representation is compared to conventional hand-crafted features.
In this section we briefly review different approaches to tackle CASR problem as well as the evaluation of our proposeddictionary learning technique compared with conventional hand-crafted features on East Anglia (EA) and LITIS Rouendatasets.Several categories of audio features have been employed in CASR systems. [67] divided the features into 12 categoriessummarized in Table 4. From the features organization in Table 4, we can distinguish four main categories: low-leveltime/frequency, frequency band energy, learned features based on an time-frequency representation and speech-based.Among low-level features, we find easy and simple features to compute such as zero crossing [68]. Frequency bandenergy feature are based on the computation of the energy at different frequency bands using Fourier transform [68] orfilter banks such as Gammatone [69] and Mel-scale filter banks [70] which seek to mimic the response of the humanauditory system. The goal of learning methods is to describe an acoustic signal as a linear combination of elementaryfunctions that capture salient spectral components [71]. Beside the first three introduced feature categories, speech-basedfeatures and more particularly Mel-Frequency Cepstral Coefficients (MFCCs) represent the most prominent featuresthat have been considered in the problem of audio scene recognition.A considerable amount of works have applied MFCCs for CASR, [72] used Gaussian Mixture Model (GMM) toestimate the distribution of MFCC coefficients. [73] combined MFCCs with Hidden Markov11
PREPRINT - 2021-01-01 T a b l e : M a i n a ud i o f ea t u r eca t e go r i e s f o r a ud i o s ce n e r ec ogn iti on [ ] . M e t hod s A pp r o ac h F ea t u r e s [ ] L o w - l e v e lti m e - b a s e d & fr e qu e n c y - b a s e d Z e r o c r o ss i ng r a t e [ ] S p ec t r a l ce n t r o i d [ ] F r e qu e n c y - b a nd e n e r gy M a gn it ud e o r po w e r s p ec t r u m [ ] A ud it o r y fi lt e r b a nk s G a mm a t on e fi lt e r s [ ] M e l - s ca l e fi lt e r b a nk [ ] C e p s t r a l M e l -fr e qu e n c y ce p s t r a l c o e f fi c i e n t s [ ] S p a ti a l I n t e r a u r a lti m e /l e v e l d i ff e r e n ce [ ] V o i c i ng T on e - fi t f ea t u r e s [ ] L i n ea r p r e d i c ti v e m od e l L i n ea r p r e d i c ti v ec o e f fi c i e n t s [ ] P a r a m e t r i ca pp r ox i m a ti on C onvo l u ti on s p ec t r og r a m a nd [ ] G a bo r fi lt e r s [ ] F ea t u r e l ea r n i ng L ea r n e d f ea t u r e s fr o m M F CC s [ ] M a t r i x f ac t o r i za ti on N on - n e g a ti v e m a t r i x f ac t o r i za ti on [ ] P r ob a b ili s ti c l a t e n t c o m pon e n t [ ]I m a g e p r o ce ss i ng HOG ti m e -fr e qu e n ce r e p r e s e n t a ti on [ ] E v e n t d e t ec ti on A n a l y s i s o f e v e n t s o cc u rr e n ce PREPRINT - 2021-01-01Models (HMM). [79] used Non-Negative Matrix Factorization (NMF) with MFCC features. [83] employed MFCCfeatures in a two-stage framework based on GMM and SVM. [71] used sparse restricted Boltzmann machine tocapture relevant MFCC coefficients. [84] extracted a large set of features including MFCCs using a short slidingwindow approach. SVM is used to classify these short segments, and a majority voting scheme is employed for thewhole sequence decision. [85] applied Recurrence Quantification Analysis (RQA) on the MFCCs for supplying someadditional information on temporal dynamics of the signal.Another trend is to extract discriminative features from time-frequency representations. [86] applied NMF to extracttime-frequency patches. [80] instead of the NMF used temporally-constrained Shift-Invariant Probabilistic LatentComponent Analysis (SIPLCA) to extract time-frequency patches from spectrogram. [87] proposed a method basedon treating time-frequency representations of audio signals as image texture. In the same context, [88] introducednovel sound event image representation called Subband Power Distribution (SPD). The SPD captures the distributionof the sound’s log-spectral power over time in each subband, such that it can be visualized as a two-dimensionalimage representation. Recently [81] proposed to use Histogram of Oriented Gradient to extract information fromtime-frequency representations.
We rely our experiments on two representative datasets which are described below.• East Anglia (EA): this dataset provides environmental sounds [89] coming from 10 different locations: bar,beach, bus, car, football match, launderette, lecture, office, rail station, street . In each location a recordingof 4-minutes at a frequency of 22.1 kHz has been collected. The 4-minutes recordings are splitted into 8recordings of 30-seconds so that in total we have 10 locations (classes) and each class has 8 examples of30-seconds.• Litis Rouen: this dataset provides environmental sounds [81] recorded in 19 locations. Each location hasdifferent number of 30-seconds examples downsampled at 22.5 kHz. Table 5 summarizes the content of thedataset. Table 5: Summary of Litis Rouen audio scene dataset. Classes plane 23busy street 143bus 192cafe 120car 243train station hall 269kid game hall 145market 276metro-paris 139metro-rouen 249billiard pool hall 155quite-street 90student hall 88restaurant 133pedestrian street 122shop 203train 164high-speed train 147tube station 125
In the following we introduce the different features used in our experiments as well as the data partition and protocols. http://lemur.cmp.uea.ac.uk/Research/noise_db/ https://sites.google.com/site/alainrakotomamonjy/home/audio-scene PREPRINT - 2021-01-01
Features
Based on an initial time-frequency representation (spectrogram) computed on sliding windows of size samplesand hops of samples, we apply our class based dictionary learning method introduced in 2.3. In order to evaluate theefficiency of our proposed method, we compare its performance to the following conventional features:• Spectrogram pooling: represents the temporal pooling of the spectrogram computed on sliding windows ofsize samples and hops of samples.• Bag of MFCC: consists in calculating the MFCC features on windows of size ms with hops of ms. Foreach window, cepstra over bands are computed (lower and upper band are set to and kHz). Thefinal feature vector is obtained by concatenating the average and standard deviation of the batch of windowswith overlap of windows.• Bag of MFCC-D-DD: in addition of the average and standard deviation, the first-order and second-orderdifferences of the MFCC over the windows are concatenated to the feature vector.• Texture-based time-frequency representation: it consists on extracting features from time-frequency texture[87].• Recurrent Quantification Analysis (RQA): aims to extract from MFCCs some additional information ontemporal dynamics. For all MFCCs obtained over windows with overlap of , RQA features have beencomputed [85]. Afterwards, MFCC features and RQA features are all averaged over time and MFCC averages,standard deviations as well as the RQA averages are concatenated to form the final feature vector.• HOG of time-frequency representation: applies HOG to time-frequency representations transformed to images.The time-frequency representations are calculated based on Constant-Q Transform (CQT). HOG is able toprovide information about the occurrence of gradient orientations in the resulting images [81].More details to extract these features can be found in [81]. Note that for classification, Support Vector Machine (SVM)with linear kernel is applied.
Protocols and parameters tuning
For sake of comparison we have performed the same experiments using the same repartitions and protocols in [81].We have averaged the performances from different splits of the initial data into training and test. The training setrepresents % of data while the rest represents the test set. Our proposed dictionary learning technique requires thefollowing parameters:• λ , γ , γ controlling respectively, the sparsity, the structure of sparse coefficients and pairwise orthogonalityof learned dictionaries. The parameters are selected among { . , . , . } .• K (cid:48) the size of each dictionary D c . Its value is explored among { , , } .Beyond that we use a linear SVM classifier which its regularization parameter C svm is selected among valueslogarithmically scaled between . and . All these parameters are tuned according to a validation scheme. Modelselection is performed by resampling times the training set into learning and validation sets of equal size. The bestparameters are considered as those maximizing the averaged performances on the validation sets. Note that K-SVD[90] has been used to initialize the class based dictionaries and the parameters T = 200 , α = 0 . and η = 10 − wasapplied for the optimization scheme (see Section 2.4). Table 6 represents the performance (classification accuracy) comparison between different conventional features asreported in [81] and our class based dictionary method on Rouen and EA datasets. Texture denotes the work of [87] whileMFCC-D-DD denotes the MFCC with derivatives features. MFCC, MFCC-RQA, MFCC-900 and MFCC-RQA-900denote, MFCC features, the MFCC with RQA with cut-off frequency of 10 kHz, the MFCC and the MFCC combinedRQA with upper frequency set at 900 Hz respectively. Spectrogram pooling stands for the temporal pooling of thetime-frequency spectrogram. HOG-full and HOG-marginalized represent the concatenation of histogram obtained fromdifferent cells resulting to very-high dimensionality feature vector and the concatenation of the averaged histogramsover time and frequency respectively.It can be seen in Table 6 that HOG-marginalized outperforms all competing features in Rouen dataset, it can be alsoseen that the temporal pooling of spectrogram is also giving good results and almost reach the ones obtained by14
PREPRINT - 2021-01-01Table 6: Comparison of performances related to different feature representations on Rouen, EA audio scene classificationdatasets. Bold values stand for best values on each dataset.
Features Rouen EA
Texture - 0.57 ± . ± . ± ± ± MFCC-900 0.60 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Plane Busy street Bus CafeCar Train station hall Kid game hall MarketMetro − paris Metro − rouen Biliard pool Quiet streetStudent hall Restaurant Pedestrian street ShopTrain High − speed train Tube station Figure 4: Example of learned dictionaries per class on Rouen dataset. Rows correspond to learned dictionary atoms.HOG-marginalized. Surprisingly the temporal pooling of the spectrogram on all analysis windows helps to estimatethe energy variation over time for a raw signal assumed to represent a single scene. Indeed it has been found that theuse of the analysis windows improves the recognition performance. Moreover the small size of the windows helps tocapture the stable characteristics of the signal [91]. Note also that MFCC+RQA features are performing better thanother MFCC based features, however the cut-off-frequency of
Hz leads to a large loss in performance.We can also notice that our proposed dictionary learning is giving very promising results and is outperforming textureand conventional speech recognition feature, MFCC and MFCC-D-DD features which have been widely used in theliterature and have showed their ability to tackle the problems of audio scene recognition.Figure 4 and Figure 5 show the learned dictionaries per class on Rouen dataset and the pairwise similarity betweenthem. The idea behind estimating the similarity between different learned dictionaries is to verify the initial goal tolearn dissimilar dictionaries able to extract diverse information from classes for discrimination purpose. It can be seenthat there is some similarity between some learned dictionaries which could influence the classification accuracy sincethese dictionaries tend to provide similar information for different classes. This may be related to the increasing numberof classes that makes enforcing the pairwise dictionaries dissimilarity hardly feasible.15
PREPRINT - 2021-01-01
Figure 5: Similarity between different learned dictionaries on Rouen dataset. X-axis and Y-axis stand for the classnumbers organized in the same order in Table 5.In the East Anglia dataset, all features including our proposed dictionary learning perform well except texture, howeverwe should note a slight advantage of MFCC.
The simplest definition of a chord is few musical notes played at the same time. In western music, each chord can becharacterized by the:• root or fundamental: the fundamental note on which the chord is built,• number of notes • type : gives the interval scheme between notes.A music signal can be deemed composed of sequences of these different chords. Commonly, the duration of the chordsin the sequence varies over time rendering their recognition difficult. Given a raw audio signal, chord recognitionsystem attempts to automatically determine the sequence of chords describing the harmonic information.To recognize chords most approaches rely on features crafted based on time-frequency representation of the rawsignals, the most common and dominant features being chroma [2]. Pitch Class Profiles (PCP) or chroma vectorswas introduced by [4]. It is a 12-dimensional vectors representing the energy within an equal-tempered chromaticscale { C, C , D, · · · , B } . The chroma has several variations, among them we can cite Harmonic Pitch Class Profiles(HPCPs) which is an extension of the Pitch Class Profiles (PCPs) by estimating the harmonics [92, 93] and EnhancedPitch Class Profile (EPCP) which is calculated using the harmonic product spectrum [94]. Chroma vectors werecombined with different machine learning such as Hidden Markov and Support Vector Machine [5, 95]. We will focus on third, triad and seventh chords which are respectively composed of 2, 3 and 4 notes. When a note Bhas twice the frequency of a note A, the interval [ A B ] forms an octave. In tempered occidental music, the smallestsubdivision of an octave is a semitone which corresponds to one twelfth of an octave, that is a multiplication by √ interm of frequency. To be tertian, i.e a standard harmony, each interval between notes in a chord must be composed of 3or 4 semitones. These intervals are respectively called minor and Major . Thus, for a given root, there is 2 possiblethirds, 4 possible triads, and 8 possible sevenths. Table 7 sum-up all the possible tertian third, triad and seventh chords.The pursued goal in this work is to guess the type and not the fundamental of a chord leading to 14 possible labels( = 2 + 4 + 8 ). For this purpose, we have created a dataset which contains 2156 music chord samples of duration -seconds at frequency Hz with the 14 different classes. Each class contains 154 samples from differentinstruments at different fundamentals. 16
PREPRINT - 2021-01-01Table 7: Different kind of tertian chords, intervals are in semitones
In the following we introduce the different features used in our experiments as well as the data partition and protocols.
Features
Similar to the previous application we compute an initial time-frequency representation (spectrogram) on slidingwindows of size 4096 samples and hops of 32 samples. Then we apply our dictionary learning method. The resultingsparse representations are used as inputs of an SVM. The following conventional features serve as competitors to ourapproach.• Spectrogram pooling: represents the temporal pooling of the spectrogram as previously.• Interpolated power spectral density: music notes follow an exponential scale, however Power Spectral Density(PSD) is based on Fourier transform which follows a linear scale. To address this problem PSD (which lieson a linear scale) is sampled at specific frequencies corresponding to 96 notes leading to an exponentialrepresentation more suitable for chord recognition [96].• Chroma: it represents a -dimensional vector, every component represents the spectral energy of a semi-tonewithin the chromatic scale. Chroma vector entries are calculated by summing the spectral density correspondingto frequencies belonging to the same chroma [2]. Protocols and parameters tuning
We have averaged the performances from different 10 splits of the initial data into training and test. The training setrepresents 2/3 of data. Model selection is performed by resampling times the training set into learning and validationset of equal size. The best parameters are considered as those maximizing the averaged performances on the validationsets. Note that the parameters are chosen from the same intervals used above in the computational auditory scenerecognition problem. Table 8 represents the performance (classification accuracy) comparison of evaluated features on music chord dataset. Itcan be seen that our dictionary learning method outperforms all other features.Table 9 represents the performance (classification accuracy) comparison of evaluated features on music chord datasetbased on the polynomial kernel. It can be seen the interpolated PSD outperforms chroma and spectrogram. It can bealso noticed that the polynomial kernel overcome the linear one in this particular task of chord recognition based on theconventional hand-crafted features.Figure 6 and Figure 7 show the learned dictionaries and the pairwise similarity between them. Contrary to CASRRouen dataset, it can be seen that the highest similarity between learned dictionaries is on the diagonal. This means thatthe resulting dictionaries are different between them leading to extract diverse information per class. While chroma,17
PREPRINT - 2021-01-01Table 8: Comparison of performances related to different feature representations on music chord dataset based on linearSVM. Bold value stands for best performance.
Features Music chord
Chroma 0.19 ± ± ± ± Table 9: Comparison of performances related to different feature representations on music chord dataset based onpolynomial kernel. Bold value stands for best performance.
Features Music chord
Chroma 0.70 ± ± Spectrogram pooling 0.72 ± Figure 6: Example of learned dictionaries per each class on music chord dataset.interpolated PSD and spectrogram failed totally to reach good performances based on a linear SVM, our dictionarylearning method could achieve very promising results.Linear classification is a computationally efficient way to categorize test samples. It consists in finding a linear separatorbetween two classes. Linear classification has been the focus of much research in machine learning for decades and the18
PREPRINT - 2021-01-01
Figure 7: Similarity between different learned dictionaries on music chord dataset. X-axis and Y-axis stand for the classnumbers.resulting algorithms are well understood. However, many datasets cannot be separated linearly and require complexnonlinear classifiers which is the case of our music chord dataset.A popular solution to enjoy the benefits of linear classifiers is to embed the data into a high dimensional feature space,where a linear classifier eventually exists. The feature space mapping is chosen to be nonlinear in order to convertnonlinear relations to linear relations. This nonlinear classification framework is at the heart of the popular kernel-basedmethods [97]. Despite the popularity of kernel-based classification, its computational complexity at test time stronglydepends on the number of training samples [98], which limits its applicability in large scale datasets.An eventual alternative to kernel methods, is sparse coding which consists in finding a compact representation of thedata in an overcomplete learned dictionary which can be seen as a nonlinear feature representation mapping. Thisis confirmed by our experiments which clearly shows that our proposed dictionary learning method outperforms theother hand-crafted features. A success story of automatically learning useful features is represented by deep learningtechniques [99, 100] which aim to learn several hierarchical layers, each layer can be seen as a kind of mappingoperation to the one from dictionary learning.
We have proposed a novel supervised dictionary learning method for audio signal recognition. The proposed methodseek to minimize the intra-class variations, maximize the inter-class variations and promote the sparsity to controlthe complexity of the signal decomposition over the dictionary. This is done by learning a dictionary per class,minimizing the class based reconstruction error and promoting the pairwise orthogonality of the dictionaries. Thelearned dictionaries are supposed to provide diverse information per class. The resulting problem is non-convex andsolved using a proximal gradient descent method.Our proposed method was extensively tested on two different audio recognition applications: computational auditoryscene recognition and music chord recognition. The obtained results were compared to different conventional hand-crafted features. While there is no universal hand-crafted feature representation able to successfully tackle differentaudio recognition problems, our proposed dictionary learning method combined with a simple linear classifier showedvery promising results while dealing with the two diverse recognition problems.Despite the simplicity and good performances of our approach, we could notice that the task to make the learneddictionaries as different as possible is hardly feasible when dealing with large number of classes. An example is humanidentity recognition based on gait where each individual is seen as a class.A possible alternative is to jointly learn the dictionary and classifier by incorporating a classification cost term. However,this will be leading to many parameters to tune, which makes the approach computationally expensive.19
PREPRINT - 2021-01-01
References [1] Richard F Lyon. Machine hearing: An emerging field [exploratory dsp].
Ieee signal processing magazine ,27(5):131–139, 2010.[2] Laurent Oudre, Yves Grenier, and Cédric Févotte. Template-based chord recognition: Influence of the chordtypes. In
ISMIR , pages 153–158, 2009.[3] Laurent Oudre, Yves Grenier, and Cédric Févotte. Chord recognition by fitting rescaled chroma vectors to chordtemplates.
IEEE Transactions on Audio, Speech, and Language Processing , 19(7):2222–2233, 2011.[4] Takuya Fujishima. Realtime chord recognition of musical sound: A system using common lisp music. In
Proc.ICMC , volume 1999, pages 464–467, 1999.[5] Alexander Sheh and Daniel PW Ellis. Chord segmentation and recognition using em-trained hidden markovmodels. In
ISMIR , volume 3, pages 183–189, 2003.[6] Matthias Mauch and Simon Dixon. Simultaneous estimation of chords and musical context from audio.
IEEETransactions on Audio, Speech, and Language Processing , 18(6):1280–1289, 2010.[7] Daniel PW Ellis. Classifying music audio with timbral and chroma features. In
ISMIR , volume 7, pages 339–340,2007.[8] Riccardo Miotto and Nicola Orio. A music identification system based on chroma indexing and statisticalmodeling. In
ISMIR , pages 301–306, 2008.[9] Mark A Bartsch and Gregory H Wakefield. Audio thumbnailing of popular music using chroma-based represen-tations.
IEEE Transactions on multimedia , 7(1):96–104, 2005.[10] Steven Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognitionin continuously spoken sentences.
IEEE transactions on acoustics, speech, and signal processing , 28(4):357–366,1980.[11] Tomi Kinnunen, Rahim Saeidi, Filip Sedlák, Kong Aik Lee, Johan Sandberg, Maria Hansson-Sandsten, andHaizhou Li. Low-variance multitaper mfcc features: a case study in robust speaker verification.
IEEE Transactionson Audio, Speech, and Language Processing , 20(7):1990–2001, 2012.[12] Fang Zheng, Guoliang Zhang, and Zhanjiang Song. Comparison of different implementations of mfcc.
Journalof Computer Science and Technology , 16(6):582–589, 2001.[13] Mohamed Benzeghiba, Renato De Mori, Olivier Deroo, Stephane Dupont, Teodora Erbes, Denis Jouvet, LucianoFissore, Pietro Laface, Alfred Mertins, Christophe Ris, et al. Automatic speech recognition and speech variability:A review.
Speech Communication , 49(10):763–786, 2007.[14] Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. An overview of noise-robust automatic speechrecognition.
IEEE/ACM Transactions on Audio, Speech, and Language Processing , 22(4):745–777, 2014.[15] Imad Rida. Feature extraction for temporal signal recognition: An overview. arXiv preprint arXiv:1812.01780 ,2018.[16] Dan Ellis. Computational auditory scene analysis exploiting speech-recognition knowledge. In
Applications ofSignal Processing to Audio and Acoustics, 1997. 1997 IEEE ASSP Workshop on , pages 4–pp. IEEE, 2004.[17] Vesa Peltonen, Juha Tuomi, Anssi Klapuri, Jyri Huopaniemi, and Timo Sorsa. Computational auditory scenerecognition. In
Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on ,volume 2, pages II–1941. IEEE, 2002.[18] Manuel Davy, Arthur Gretton, Arnaud Doucet, Peter JW Rayner, et al. Optimized support vector machines fornonstationary signal classification.
IEEE Signal Processing Letters , 9(12):442–445, 2002.[19] Maxime Sangnier, Jérôme Gauthier, and Alain Rakotomamonjy. Filter bank learning for signal classification.
Signal Processing , 113:124–137, 2015.[20] Eric Jones, Paul Runkle, Nilanjan Dasgupta, Luise Couchman, and Lawrence Carin. Genetic algorithm waveletdesign for signal classification.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 23(8):890–895,2001. 20
PREPRINT - 2021-01-01[21] Daniel J Strauss, Gabriele Steidl, and Wolfgang Delb. Feature extraction by shape-adapted local discriminantbases.
Signal Processing , 83(2):359–376, 2003.[22] Florian Yger and Alain Rakotomamonjy. Wavelet kernel learning.
Pattern Recognition , 44(10):2614–2629,2011.[23] Paul Honeiné, Cédric Richard, Patrick Flandrin, and J-B Pothin. Optimal selection of time-frequency representa-tions for signal classification: A kernel-target alignment approach. In , volume 3, pages III–III. IEEE, 2006.[24] Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R Bach. Supervised dictionarylearning. In
Advances in neural information processing systems , pages 1033–1040, 2009.[25] Ignacio Ramirez, Pablo Sprechmann, and Guillermo Sapiro. Classification and clustering via dictionary learningwith structured incoherence and shared features. In
Computer Vision and Pattern Recognition (CVPR), 2010IEEE Conference on , pages 3501–3508. IEEE, 2010.[26] Alain Biem, Shigeru Katagiri, Erik McDermott, and Biing-Hwang Juang. An application of discriminativefeature extraction to filter-bank-based speech recognition.
IEEE Transactions on Speech and Audio Processing ,9(2):96–110, 2001.[27] Ahmed H Tewfik, Deepen Sinha, and Paul Jorgensen. On the optimal choice of a wavelet for signal representation.
IEEE Transactions on information theory , 38(2):747–765, 1992.[28] Roger L Claypoole, Richard G Baraniuk, and Robert D Nowak. Adaptive wavelet transforms via lifting. In
Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on ,volume 3, pages 1513–1516. IEEE, 1998.[29] David E Goldberg and John H Holland. Genetic algorithms and machine learning.
Machine learning , 3(2):95–99,1988.[30] Naoki Saito, Ronald R Coifman, Frank B Geshwind, and Fred Warner. Discriminant feature extraction usingempirical probability density estimation and a local basis library.
Pattern Recognition , 35(12):2841–2852, 2002.[31] N Cristitiaini, A Elissee, J Shawe-Taylor, and J Kandola. On kernel-target alignment. NIPS, 2002.[32] Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learneddictionaries.
Image Processing, IEEE Transactions on , 15(12):3736–3745, 2006.[33] Imad Rida, Somaya Almaadeed, and Ahmed Bouridane. Gait recognition based on modified phase-onlycorrelation.
Signal, Image and Video Processing , 10(3):463–470, 2016.[34] Michael Elad, Mario AT Figueiredo, and Yi Ma. On the role of sparse and redundant representations in imageprocessing.
Proceedings of the IEEE , 98(6):972–982, 2010.[35] Julien Mairal, Francis Bach, and Jean Ponce. Task-driven dictionary learning.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 34(4):791–804, 2012.[36] Erik McDermott and Shigeru Katagiri. Prototype-based minimum classification error/generalized probabilisticdescent training for various speech units.
Computer Speech & Language , 8(4):351–368, 1994.[37] Alain Rakotomamonjy, Francis R Bach, Stéphane Canu, and Yves Grandvalet. Simplemkl.
Journal of MachineLearning Research , 9(Nov):2491–2521, 2008.[38] Julien Mairal, Michael Elad, and Guillermo Sapiro. Sparse representation for color image restoration.
ImageProcessing, IEEE Transactions on , 17(1):53–69, 2008.[39] Bin Cheng, Jianchao Yang, Shuicheng Yan, Yun Fu, and Thomas S Huang. Learning with-graph for imageanalysis.
Image Processing, IEEE Transactions on , 19(4):858–866, 2010.[40] John Wright, Yi Ma, Julien Mairal, Guillermo Sapiro, Thomas S Huang, and Shuicheng Yan. Sparse representa-tion for computer vision and pattern recognition.
Proceedings of the IEEE , 98(6):1031–1044, 2010.[41] David M Bradley and J Andrew Bagnell. Differential sparse coding. 2008.21
PREPRINT - 2021-01-01[42] Imad Rida, Somaya Al-Maadeed, Arif Mahmood, Ahmed Bouridane, and Sambit Bakshi. Palmprint identificationusing an ensemble of sparse representations.
IEEE Access , 6:3241–3248, 2018.[43] Imad Rida, Xudong Jiang, and Gian Luca Marcialis. Human body part selection by group lasso of motion formodel-free gait recognition.
IEEE Signal Processing Letters , 23(1):154–158, 2015.[44] Somaya Al Maadeed, Xudong Jiang, Imad Rida, and Ahmed Bouridane. Palmprint identification using sparseand dense hybrid representation.
Multimedia Tools and Applications , 78(5):5665–5679, 2019.[45] Imad Rida, Somaya Al Maadeed, Xudong Jiang, Fei Lunke, and Abdelaziz Bensrhair. An ensemble learningmethod based on random subspace sampling for palmprint identification. In , pages 2047–2051. IEEE, 2018.[46] Imad Rida, Romain Herault, Gian Luca Marcialis, and Gilles Gasso. Palmprint recognition with an efficient datadriven ensemble classifier.
Pattern Recognition Letters , 126:21–30, 2019.[47] Imad Rida, Larbi Boubchir, Noor Al-Maadeed, Somaya Al-Maadeed, and Ahmed Bouridane. Robust model-freegait recognition by statistical dependency feature selection and globality-locality preserving projections. In , pages 652–655. IEEE,2016.[48] Imad Rida, Somaya Al Maadeed, and Ahmed Bouridane. Unsupervised feature selection method for improvedhuman gait recognition. In , pages 1128–1132.IEEE, 2015.[49] Imad Rida, Noor Al-Maadeed, Somaya Al-Maadeed, and Sambit Bakshi. A comprehensive overview of featurerepresentation for biometric recognition.
Multimedia Tools and Applications , pages 1–24, 2018.[50] Imad Rida, Noor Al Maadeed, and Somaya Al Maadeed. A novel efficient classwise sparse and collaborativerepresentation for holistic palmprint recognition. In , pages 156–161. IEEE, 2018.[51] Frédéric Dehais, Imad Rida, Raphaëlle N Roy, John Iversen, Tim Mullen, and Daniel Callan. A pbci to predictattentional error before it happens in real flight conditions. In , pages 4155–4160. IEEE, 2019.[52] Mehrdad J Gangeh, Ahmed K Farahat, Ali Ghodsi, and Mohamed S Kamel. Supervised dictionary learning andsparse representation-a review. arXiv preprint arXiv:1502.05928 , 2015.[53] John Wright, Allen Y Yang, Arvind Ganesh, S Shankar Sastry, and Yi Ma. Robust face recognition via sparserepresentation.
IEEE transactions on pattern analysis and machine intelligence , 31(2):210–227, 2009.[54] Meng Yang, Lei Zhang, Jian Yang, and Dejing Zhang. Metaface learning for sparse representation based facerecognition. In
Image Processing (ICIP), 2010 17th IEEE International Conference on , pages 1601–1604. IEEE,2010.[55] Shu Kong and Donghui Wang. A dictionary learning approach for classification: separating the particularity andthe commonality. In
European Conference on Computer Vision , pages 186–199. Springer, 2012.[56] Brian Fulkerson, Andrea Vedaldi, and Stefano Soatto. Localizing objects with smart dictionaries. In
EuropeanConference on Computer Vision , pages 179–192. Springer, 2008.[57] John Winn, Antonio Criminisi, and Thomas Minka. Object categorization by learned universal visual dictionary.In
Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1 , volume 2, pages 1800–1807.IEEE, 2005.[58] Qiang Zhang and Baoxin Li. Discriminative k-svd for dictionary learning in face recognition. In
ComputerVision and Pattern Recognition (CVPR), 2010 IEEE Conference on , pages 2691–2698. IEEE, 2010.[59] Haichao Zhang, Yanning Zhang, and Thomas S Huang. Simultaneous discriminative projection and dictionarylearning for sparse representation based classification.
Pattern Recognition , 46(1):346–354, 2013.[60] Svetlana Lazebnik and Maxim Raginsky. Supervised learning of quantizer codebooks by information lossminimization.
IEEE transactions on pattern analysis and machine intelligence , 31(7):1294–1309, 2009.22
PREPRINT - 2021-01-01[61] Meng Yang, Lei Zhang, Xiangchu Feng, and David Zhang. Fisher discrimination dictionary learning for sparserepresentation. In
Computer Vision (ICCV), 2011 IEEE International Conference on , pages 543–550. IEEE,2011.[62] Manik Varma and Andrew Zisserman. A statistical approach to material classification using image patchexemplars.
IEEE transactions on pattern analysis and machine intelligence , 31(11):2032–2047, 2009.[63] Xiao-Chen Lian, Zhiwei Li, Changhu Wang, Bao-Liang Lu, and Lei Zhang. Probabilistic models for superviseddictionary learning. In
CVPR , pages 2305–2312, 2010.[64] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Efficient sparse coding algorithms. In
Advances inneural information processing systems , pages 801–808, 2006.[65] Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook.
Technical University of Denmark ,7:15, 2008.[66] Bernhard Schölkopf and Alexander J Smola.
Learning with kernels: support vector machines, regularization,optimization, and beyond . MIT press, 2002.[67] Daniele Barchiesi, Dimitrios Giannoulis, Dan Stowell, and Mark D Plumbley. Acoustic scene classification:Classifying environments from the sounds they produce.
IEEE Signal Processing Magazine , 32(3):16–34, 2015.[68] Antti J Eronen, Vesa T Peltonen, Juha T Tuomi, Anssi P Klapuri, Seppo Fagerlund, Timo Sorsa, Gaëtan Lorho,and Jyri Huopaniemi. Audio-based context recognition.
IEEE Transactions on Audio, Speech, and LanguageProcessing , 14(1):321–329, 2006.[69] Nitin Sawhney and Pattie Maes. Situational awareness from environmental sounds.
Project Rep. for Pattie Maes ,1997.[70] Brian Clarkson, Nitin Sawhney, and Alex Pentland. Auditory context awareness via wearable computing.
Energy ,400(600):20, 1998.[71] Kyogu Lee, Ziwon Hyung, and Juhan Nam. Acoustic scene classification using sparse feature learning andevent-based pooling. In ,pages 1–4. IEEE, 2013.[72] Jean-Julien Aucouturier, Boris Defreville, and François Pachet. The bag-of-frames approach to audio patternrecognition: A sufficient model for urban soundscapes but not for polyphonic music.
The Journal of theAcoustical Society of America , 122(2):881–891, 2007.[73] Ling Ma, Ben Milner, and Dan Smith. Acoustic environment classification.
ACM Transactions on Speech andLanguage Processing (TSLP) , 3(2):1–22, 2006.[74] Robert G Malkin and Alex Waibel. Classifying user environment for mobile applications using linear autoencod-ing of ambient audio. In
Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, andSignal Processing, 2005. , volume 5, pages v–509. IEEE, 2005.[75] Waldo Nogueira, Gerard Roma, and Perfecto Herrera. Sound scene identification based on mfcc, binauralfeatures and a support vector machine classifier.
IEEE AASP Challenge on Detection and Classification ofAcoustic Scenes and Events , 2013.[76] Johannes D Krijnders and GA ten Holt. A tone-fit feature representation for scene classification.
Energy [dB] ,400(450):500, 2013.[77] Selina Chu, Shrikanth Narayanan, and C-C Jay Kuo. Environmental sound recognition with time–frequencyaudio features.
IEEE Transactions on Audio, Speech, and Language Processing , 17(6):1142–1158, 2009.[78] Kailash Patil and Mounya Elhilali. Multiresolution auditory representations for scene classification. cortex ,87(1):516–527, 2002.[79] Benjamin Cauchi. Non-negative matrix factorisation applied to auditory scenes classification.
Master’s thesis,Master ATIAM, Université Pierre et Marie Curie , 2011.[80] Emmanouil Benetos, Mathieu Lagrange, and Simon Dixon. Characterisation of acoustic scenes using a temporallyconstrained shit-invariant model. In
DAFx , 2012. 23
PREPRINT - 2021-01-01[81] Alain Rakotomamonjy and Gilles Gasso. Histogram of gradients of time-frequency representations for audioscene classification.
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , 23(1):142–153, 2015.[82] Toni Heittola, Annamaria Mesaros, Antti Eronen, and Tuomas Virtanen. Audio context recognition using audioevent histograms. In
Signal Processing Conference, 2010 18th European , pages 1272–1276. IEEE, 2010.[83] Pengfei Hu, Wenju Liu, Wei Jiang, et al. Combining frame and segment based models for environmental soundclassification. In
INTERSPEECH , pages 2502–2505, 2012.[84] Jürgen T Geiger, Björn Schuller, and Gerhard Rigoll. Large-scale audio feature extraction and svm for acousticscene classification. In , pages1–4. IEEE, 2013.[85] Gerard Roma, Waldo Nogueira, and Perfecto Herrera. Recurrence quantification analysis features for environ-mental sound recognition. In ,pages 1–4. IEEE, 2013.[86] Courtenay V Cotton and Daniel PW Ellis. Spectral vs. spectro-temporal features for acoustic event detection. In , pages 69–72.IEEE, 2011.[87] Guoshen Yu and Jean-Jacques Slotine. Audio classification from time-frequency texture. arXiv preprintarXiv:0809.4501 , 2008.[88] Jonathan Dennis, Huy Dat Tran, and Eng Siong Chng. Image feature representation of the subband powerdistribution for robust sound event classification.
IEEE Transactions on Audio, Speech, and Language Processing ,21(2):367–377, 2013.[89] Ling Ma, DJ Smith, and Ben P Milner. Context awareness using environmental noise classification. In
INTERSPEECH , 2003.[90] Michal Aharon, Michael Elad, and Alfred Bruckstein. K-svd: An algorithm for designing overcompletedictionaries for sparse representation.
IEEE Transactions on signal processing , 54(11):4311–4322, 2006.[91] George Tzanetakis and Perry Cook. Musical genre classification of audio signals.
Speech and Audio Processing,IEEE transactions on , 10(5):293–302, 2002.[92] Hélene Papadopoulos and Geoffroy Peeters. Simultaneous estimation of chord progression and downbeatsfrom an audio file. In , pages121–124. IEEE, 2008.[93] Hélene Papadopoulos and Geoffroy Peeters. Large-scale study of chord estimation algorithms based on chromarepresentation and hmm. In , pages 53–60.IEEE, 2007.[94] Kyogu Lee. Automatic chord recognition from audio using enhanced pitch class profile. In
Proc. of theInternational Computer Music Conference , page 26, 2006.[95] Adrian Weller, Daniel Ellis, and Tony Jebara. Structured prediction models for chord transcription of musicaudio. In
Machine Learning and Applications, 2009. ICMLA’09. International Conference on , pages 590–595.IEEE, 2009.[96] Imad Rida, Romain Herault, and Gilles Gasso. Supervised music chord recognition. In
Machine Learning andApplications (ICMLA), 2014 13th International Conference on , pages 336–341. IEEE, 2014.[97] John Shawe-Taylor and Nello Cristianini.
Kernel methods for pattern analysis . Cambridge university press,2004.[98] Christopher JC Burges. A tutorial on support vector machines for pattern recognition.
Data mining andknowledge discovery , 2(2):121–167, 1998.[99] Yoshua Bengio, Aaron Courville, and Pierre Vincent. Representation learning: A review and new perspectives.
Pattern Analysis and Machine Intelligence, IEEE Transactions on , 35(8):1798–1828, 2013.[100] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.