A Comparison of deep learning methods for environmental sound
Juncheng Li, Wei Dai, Florian Metze, Shuhui Qu, Samarjit Das
AA COMPARISON OF DEEP LEARNING METHODS FOR ENVIRONMENTAL SOUNDDETECTION
Juncheng Li*, Wei Dai*, Florian Metze*, Shuhui Qu, and Samarjit Das { junchenl,wdai,fmetze } @cs.cmu.edu, [email protected], [email protected] ABSTRACT
Environmental sound detection is a challenging applicationof machine learning because of the noisy nature of the sig-nal, and the small amount of (labeled) data that is typicallyavailable. This work thus presents a comparison of severalstate-of-the-art Deep Learning models on the IEEE chal-lenge on Detection and Classification of Acoustic Scenes andEvents (DCASE) 2016 challenge task and data, classifyingsounds into one of fifteen common indoor and outdoor acous-tic scenes, such as bus, cafe, car, city center, forest path,library, train, etc. In total, 13 hours of stereo audio recordingsare available, making this one of the largest datasets available.We perform experiments on six sets of features, includ-ing standard Mel-frequency cepstral coefficients (MFCC),Binaural MFCC, log Mel-spectrum and two different large-scale temporal pooling features extracted using OpenSMILE.On these features, we apply five models: Gaussian MixtureModel (GMM), Deep Neural Network (DNN), RecurrentNeural Network (RNN), Convolutional Deep Neural Net-work (CNN) and i-vector. Using the late-fusion approach, weimprove the performance of the baseline 72.5% by 15.6% in4-fold Cross Validation (CV) avg. accuracy and 11% in testaccuracy, which matches the best result of the DCASE 2016challenge.With large feature sets, deep neural network models out-perform traditional methods and achieve the best performanceamong all the studied methods. Consistent with other work,the best performing single model is the non-temporal DNNmodel, which we take as evidence that sounds in the DCASEchallenge do not exhibit strong temporal dynamics.
Index Terms — audio scene classification, DNN, RNN,CNN, i-vectors, late fusion
1. INTRODUCTION
Increasingly, machines in various environments can hear,such as smartphones, security systems, and autonomousrobots. The prospect of human-like sound understandingcould open up a wide range of applications, including intel-ligent machine state monitoring using acoustic information,acoustic surveillance, cataloging and information retrievalapplications such as search in audio archives [1] as well asaudio-assisted multimedia content search. Compared withspeech, environmental sounds are more diverse and span a wide range of frequencies. Moreover, they are often lesswell defined. Existing works for this task largely use conven-tional classifiers such as GMM and SVM, which do not havethe feature abstraction capability found in deeper models.Furthermore, conventional models do not model temporaldynamics. For example, the winning solutions by [2][3]for DCASE challenge 2013 and 2016, extracts MFCC andi-vectors, and they both used other deeper models for tem-poral relation analysis. In this work, we focus on the taskof acoustic scene identification, which aims to characterizethe acoustic environment of an audio stream by selecting asemantic label for it. We apply state-of-the-art deep learning(DL) architectures to various feature representations gener-ated from signal processing methods. Specifically, we use thefollowing architectures: (1) Deep Neural Network (DNN) (2)Recurrent Neural Network (RNN); (3) Convolutional DeepNeural Network (CNN). Additionally, we explore the com-bination of these models: (DNN, RNN and CNN). We alsocompare DL models with Gaussian mixture model (GMM),and i-vectors.We also use several feature representations based on sig-nal processing methods: Mel-frequency cepstral coefficients(MFCC), log Mel-Spectrum, spectrogram, other conventionalfeatures such as pitch, energy, zero-crossing rate, mean-crossing rate etc. There are several studies using DL in soundevent detection [4][5]. However, to the best of our knowl-edge, this is the first comprehensive study of a diverse setof deep architectures on the acoustic scene recognition task,borrowing ideas from signal processing as well as recentadvancements in automatic speech recognition. We use thedataset from the DCASE challenge. The dataset contains15 diverse indoor and outdoor locations (classes), such asbus, cafe, car, city center, forest path, library, train, totaling13 hours of audio recording (see Section 3.1 for detail). Inthis paper, we present a comparison of the most successfuland complementary approaches to sound event detection onDCASE, which we implemented on top of our evaluationsystem [6] in a systematic and consistent way.
2. EXPERIMENTS2.1. Dataset
We use the dataset from the IEEE challenge on Detection andClassification of Acoustic Scenes and Events [7], and we also a r X i v : . [ c s . S D ] M a r se the evaluation setup from the contest. The training datasetcontains 15 diverse indoor and outdoor locations (labels), to-taling 9.75 hours of recording (1170 files) and 8.7GB in WAVformat (Dual Channel, Sample Rate: 44100 Hz, Precision: 24bit, Duration: 30 sec each). We do 4-fold CV for model selec-tion and parameter tuning. The evaluation dataset (390 files)contains same classes of audio as training set, with totaling3.25 hours of recording and 2.5GB in the same WAV format. We create six sets of features using audio signal processingmethods:1.
Monaural and Binaural MFCC : Same as the winning so-lution in the DCASE challenge 2016 [3]. We take 23 Mel-frequency (excluding the 0th) cepstral coefficients over win-dow length 20 ms. We augment the feature with first andsecond order differences using 60 ms window, resulting in a61-dimension vector. We also computed the MFCC on right,left and the channel difference (BiMFCC).2.
Smile983 & Smile6k : We use OpenSmile [8] to generateMFCC, Fourier transforms, zero crossing rate, energy, andpitch, among others. We also compute first and second or-der features resulting in 6573 features. We select 983 featuresrecommended by domain experts to create the 983-dim fea-ture. Note that this is a much larger feature set than the MFCCfeatures and each feature represents longer time window of100 ms.3.
LogMel : We use LibROSA [9] to compute the log Mel-Spectrum, and we use the same parameters as the MFCCsetup. This is the mel log powers before the discrete cosinetransform step during the MFCC computation. We take 60mel frequencies and 200 mel frequencies resulting in 60-dimand 200-dim LogMel features.All features are standardized to have zero mean and unit vari-ance on the training set. The same standardization is appliedat the validation and test time.
We use the GMMs provided by the DCASE challenge com-mittee [7] as the baseline system for acoustic scene recogni-tion. Each audio clip is represented as a bag of acoustic fea-tures extracted from audio segments, and for each class label,a GMM is trained on this bag of acoustic features using onlyaudio clips from that class.
We replicate the i-vector [10] pipeline from [3]. The universalbackground model (UBM) is a GMM with 256 componentstrained on the development dataset using BiMFCC feature.The mean supervector M of the GMM can be decomposedas: M = m + T · y , where m is an audio scene independentvector and T · y is an offset. The low-dimensional (400-dim)subspace vector y is an audio scene dependent vector, and itis a latent variable with the normal prior. The i-vector w is amaximum a posteriori probability (MAP) estimate of y . We use the Kaldi Toolkit [11] to compute T matrix and performLinear Discrimant Analysis (LDA). Multi-layer perception has recently been successfully appliedto speech recognition and audio analysis and shows superiorperformance compared to GMMs [12]. Here we tried varioussets of hyperparameters including depth (2-10 layers), num-ber of hidden units (256-1024), dropout rate (0-0.4), regular-izer (L1, L2), and various optimization algorithms(stochasticgradient descent, Adam [13], RMSProp [14], Adagrad [15]),batch normalization [16], etc. All the deep models we triedin the next two sections are tuned via cross validation (CV) toachieve their best performance.
Bidirectional architectures generally perform better than uni-directional counterparts. We tried both LSTM [17] andGRU [18] bidirectional layers. Our network only has 2 layers(one direction a layer) due to convergence time and limitedimprovement from deeper RNN models [19].
Lately, CNNs has been applied to speech recognition usingspectrogram features [20] and achieve state-of-the-art speechrecognition performance. We employ architectures similar tothe VGG net [21] to keep the number of model parameterssmall. The input we use is the popular rectified linear units(relu) to model non-linearity. We also found that dense layersin the bottom do not help but only slow down computation, sowe do not include them in most experiments. Dropout layerssignificantly improve performance, which is consistent withthe CNN behaviors on natural images. Overall CNNs takesignificantly longer to train than RNNs, and DNNs due to theconvolutional layers.Table 1 shows an example of the architectures of all the DLmodels described as above.
DNN RNN CNN
Input depending on featureDense 256 GRU 256 32 × × × × × × × × × × × × × × × Table 1: Model Specifications. BN:Batch NormalizationReLu: Rectified Linear Activation Function
For each audio clip (train and test), our processing pipelineconsists of the following: 1) Apply the various transformsSection 2.2) to each audio clip to extract the feature repre-sentations; 2) For non-temporal models such as GMMs, wetreat each feature as a training example. For temporal mod-els such as RNNs, we consider a sequence of features as onetraining example; 3) At test time, we apply the same pipelineas training and break the audio clip as multiple instances, andthe likelihood of a class label for a test audio clip is the sumof predicted class likelihood for each segment. The class withthe highest predicted likelihoods is the predicted label for thetest audio clip.We train our deep learning models with the Keras library [4]built on Theano [22] and TensorFlow, using 4 Titan X GPUson a 128GB memory, Intel Core i7 node.
In the end, we ensemble all the models mentioned above. Intotal, we have thirty models for the problem and five differentarchitectures. We rank the models by performance, only bestperforming models which pass a predefined accuracy thresh-old are included in fusion. To further stabilize the model,we construct ensembles of the ensembles. For example, thebaseline GMM is excluded due to its poor performance. Wetest with random forest, extremely randomized trees, Ada-boost, gradient tree boosting, weighted average probabilitiesand other model selection methods in the late fusion [23].
3. RESULTS
Figure 1 shows the cross validation (CV) accuracy for 5 clas-sifiers over 6 features. 60-dim and 200-dim LogMel are listedin a single column. GMM with MFCC feature is the officialbaseline provided in the DCASE challenge, which achievesa mean CV accuracy of 72.5%, while our best performingmodel (DNN with the Smile6k features) achieves a mean CVaccuracy of 84.2% and test accuracy of 84.1%. The best latefusion model has an 88.1% mean CV accuracy and 88.2%test accuracy, which is competitve with the winning solutionin the DCASE challenge [3].MFCC BiMFCC Smile983 Smile6k LogMel . . . . GMM IVector DNN RNN CNN LateFusion
Figure 1: 4-fold CV avg. accuracy
GMM I-Vector DNN RNN CNN FusionBeach
Bus
Cafe/Rest.
Car
City
Forest
Grocery
Home
Library
Metro
Office
Park
Resident
Train
Tram
Average
Table 2: Class-wise accuracy (%) of the best CV average models.Colored rows correspond to the most challenging classes in theconfusion matrix from [6]
4. DISCUSSION
Figure 1 shows that feature representation is critical for clas-sifier performance. For neural network models (RNNs,DNNs), a larger set of features extracted from signal pro-cessing pipeline improves performance. Among the neuralnetwork models, it is interesting to note that RNNs and CNNsoutperform DNNs using MFCC, BiMFCC and Smile983 fea-tures, but DNNs outperform RNNs and CNNs on Smile6kfeature. It is possible that with limited feature representation(e.g., MFCC and BiMFCC), modeling temporally adjacentpieces enhances the local feature representation and thusimproves performance of temporal models like RNNs. How-ever, with a sufficiently expressive feature (e.g., Smile6k), thetemporal modeling becomes less important, and it becomesmore effective to model local dynamics rather than long-range dependencies. Unlike speech, which has long rangedependency (a sentence utterance could span 6-20 seconds),environment sounds generally lack a coherent context, asevents in the environment occur more or less randomly fromthe listener’s perspective. A human listener of environmentalnoise is unlikely able to predict what sound will occur next inan environment, in contrast to speech.Table 2 shows that most locations are relatively easy to iden-tify except for a few difficult classes to distinguish, such aspark and residential area, or train and tram. We can also seethat various models have varying performance in differentclasses, and thus performing late fusion tends to compen-sate individual model’s error, leading to improved overallperformance.
The performance of non-neural network models, particularly,the GMMs suffer from the curse of dimensionality . That is,in the high dimensional space, the volume grows exponen-tially while the number of available data stays constant, lead- a) Weight after FFT (b) Weight after SmoothingFigure 2: DNN’s 1st layer after input ing to highly sparse sample. In spite of being GMM-based,this issue is less prominent for the i-vector approach sinceits Factorial Analysis procedure always keep the dimensionlow. The performance of i-vector pipeline is the best amongall the models using BiMFCC feature, we observe i-vectorpipeline outperform DL models with low dimension features.We also observe i-vector pipeline tends to do better in morenoisy classes such as train and tram, while suffer in relativelyquiet classes such as home and residential area.
Deep classifiers are able to learn more abstract representationof the feature. Figure 2 shows the 1st layer of a fully con-nected DNN model. Here, our feature is BiMFCC (61-dim).The DNN model has 5 dense layers, each with 256 hiddenunits. Figure2(a) shows the FFT of the weight of the firstlayer, and indicates the responsiveness of the 256 correspond-ing hidden units. We note that DNN’s neurons are more activein the MFCC range (0-23) and are less active in the delta ofMFCC (24-41) and double delta dimension (42-61). If we ap-ply a Savitzky-Golay smoothing function [24] which acts likea low-pass filter on each neuron’s vector (61-dim). We getFigure2(b) which is the de-noised weight of layer (each col-ored line corresponds with one neuron vector), which lookslike a filter bank. The chaotic responses of DNN neurons alsodemonstrate that DNN is not capable of capturing temporalinformation in the feature.
Figure 3: RNN Neuron (512-dim) ActivationFigure 4: BiMFCC 61 over 100 frames
Our RNN model consists of 2 bidirectional GRU layers, andthey both have 512 hidden units. Figure 3 shows the neu- (a)Log Mel-spectrum (b)Weight after FFTFigure 5: CNN 1st Convolutional2D layer ron activation of the forward layer of the bidirectional GRUnetwork over 100 frames. Figure 4 shows the correspond-ing input feature (MFCC). With a train audio, it shows thatRNN neurons are stable across the time domain as long asthere is no variation of feature over time. This shed light onwhy our RNN performs better on relatively more monotonousaudio scenes such as train and tram rather than event-rich au-dio scenes like park and residential areas. Meanwhile, therecould be a potential gain from incorporating attention-basedRNN [25] here to tackle those event-rich audio scenes basedon audio events.
Figure 5(a) shows the input to the CNN which is a log Melenergy spectrum (60-dim). Figure 5(b) is the weight of the1st convolutional layer (32 convolutional filters) after FFT.This highly resembles a filter bank of bandpass filters. Wenotice there is a sharp transition in filters at around the 40thMel band. This is due to the weak energy beyond the 40thMel band shown in Figure 5(a). Our finding is consistent withprior work on speech data [26]. The filter bank we learned arerelatively wider compared with that is learned in speech.
5. CONCLUSION
We find that deep learning models compare favorably withtraditional pipelines (GMM and i-vector). Specifically,GMMs with the MFCC feature, the baseline model providedby DCASE contest, achieves 77.2% test accuracy, while thebest performing model (hierarchical DNN with Smile6k fea-ture) reaches 88.2% test accuracy. RNN and CNN generallyhave performance in the range of 73-82%.Fusing the temporal specialized models (e.g. CNNs, RNNs)with resolution specialized models (DNNs, i-vector) improvethe overall performance significantly. We train the classi-fiers independently first to maximize model diversity, andfuse these models for the best performance. We find that nosingle model outperforms all other models across all featuresets, showing that model performance can vary significantlywith feature representation. The fact that the best perform-ing model is the non-temporal DNN model is evidence thatenvironmental (or “scene”) sounds do not necessarily exhibitstrong temporal dynamics. This is consistent with our day-to-day experience that environmental sounds tend to be randomand unpredictable. . REFERENCES [1] R. Ranft, “Natural sound archives: past, present andfuture,”
Anais da Academia Brasileira de Ciłncias , vol.76, pp. 2, 2004.[2] G. Roma, W. Nogueira, P. Herrera, and R. de Boronat,“Recurrence quantification analysis features for audi-tory scene classification,”
DCASE Challenge, Tech. Rep ,2013.[3] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer,“CP-JKU submissions for DCASE-2016: a hybrid ap-proach using binaural i-vectors and deep convolutionalneural networks,” Tech. Rep., DCASE2016 Challenge,September 2016.[4] E. Cakir, et al. ”Polyphonic sound event detection usingmulti label deep neural networks.” 2015 internationaljoint conference on neural networks (IJCNN) , 2015.[5] A. Mesaros, et al. ”Sound event detection in reallife recordings using coupled matrix factorization ofspectral representations and class activity annotations.”ICASSP , IEEE, 2015.[6] W. Dai, J. Li, P. Pham, S. Das, and S. Qu, “Acousticscene recognition with deep neural networks (DCASEchallenge 2016),” Tech. Rep., DCASE2016 Challenge,September 2016.[7] T. Heittola, A. Mesaros, and T. Virtanen, “DCASE2016baseline system,” Tech. Rep., DCASE2016 Challenge,September 2016.[8] F. Eyben, M. W¨ollmer, and B. Schuller, “Opensmile:The munich versatile and fast open-source audio featureextractor,” in
Proceedings of the 18th ACM Interna-tional Conference on Multimedia , New York, NY, USA,2010, MM ’10, pp. 1459–1462, ACM.[9] B. McFee, M. McVicar, C. Raffel, D. Liang, O. Nieto, E.Battenberg, J. Moore, D. Ellis, R. YAMAMOTO, R. Bit-tner, D. Repetto, P. Viktorin, J. F. Santos, and A. Holo-vaty, “librosa: 0.4.1,” Oct. 2015.[10] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoicemodeling with sparse training data,”
IEEE transactionson speech and audio processing , vol. 13, no. 3, pp. 345–354, 2005.[11] D. Povey, A. Ghoshal, G. Boulianne, N. Goel, M. Han-nemann, Y. Qian, P. Schwarz, and G. Stemmer, “Thekaldi speech recognition toolkit,” in
IEEE 2011 work-shop , 2011.[12] A. Graves, A. Mohamed, and G. E. Hinton, “Speechrecognition with deep recurrent neural networks,”
CoRR , vol. abs/1303.5778, 2013. [13] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,”
CoRR , vol. abs/1412.6980, 2014.[14] Y. N. Dauphin, H. d. Vries, and Y. , “Equilibrated adap-tive learning rates for non-convex optimization,” Cam-bridge, MA, USA, 2015, NIPS’15, pp. 1504–1512, MITPress.[15] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradi-ent methods for online learning and stochastic optimiza-tion,”
Journal of Machine Learning Research , vol. 12,no. 2011, pp. 2121–2159, 2011.[16] S. Ioffe and C. Szegedy, “Batch normalization: Acceler-ating deep network training by reducing internal covari-ate shift,”
CoRR , vol. abs/1502.03167, 2015.[17] S. Hochreiter and J. Schmidhuber, “Long short-termmemory,”
Neural Comput. , vol. 9, no. 8, pp. 1735–1780,Nov. 1997.[18] J. Chung, C¸ . G¨ulc¸ehre, K. Cho, and Y. Bengio, “Em-pirical evaluation of gated recurrent neural networks onsequence modeling,”
CoRR , vol. abs/1412.3555, 2014.[19] O. Irsoy and C. Cardie, “Opinion mining with deep re-current neural networks,” 2014, EMNLP, Citeseer.[20] A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Di-amos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A.Coates, and A. Y. Ng, “Deep speech: Scaling up end-to-end speech recognition,”
CoRR , vol. abs/1412.5567,2014.[21] K. Simonyan and A. Zisserman, “Very deep convo-lutional networks for large-scale image recognition,”
CoRR , vol. abs/1409.1556, 2014.[22] Theano Development Team, “Theano: A Python frame-work for fast computation of mathematical expressions,” arXiv e-prints , vol. abs/1605.02688, May 2016.[23] R. Caruana, et al. ”Ensemble selection from librariesof models.”Proceedings of the twenty-first internationalcon-ference on Machine learning , ACM, 2004.[24] A. Savitzky and M. J. E. Golay, “Smoothing and Dif-ferentiation of Data by Simplified Least Squares Proce-dures.,”
Anal. Chem. , vol. 36, no. 8, pp. 1627–1639,July 1964.[25] J. Chorowski, D. Bahdanau, K. Cho, and Y. Ben-gio, “End-to-end continuous speech recognition usingattention-based recurrent NN: first results,”
CoRR , vol.abs/1412.1602, 2014.[26] P. Golik, Z. T¨uske, R. Schl¨uter, and H. Ney, “Convo-lutional neural networks for acoustic modeling of rawtime signal in lvcsr,” in