Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification
Zhichao Zhang, Shugong Xu, Tianhao Qiao, Shunqing Zhang, Shan Cao
AAttention based Convolutional Recurrent NeuralNetwork for Environmental Sound Classification
Zhichao Zhang, Shugong Xu (cid:63) , Tianhao Qiao, Shunqing Zhang, and Shan Cao
Shanghai Institute for Advanced Communication and Data Science,Shanghai University, Shanghai, 200444, China { zhichaozhang, shugong, qiaotianhao, shunqing, cshan } @shu.edu.cn Abstract.
Environmental sound classification (ESC) is a challengingproblem due to the complexity of sounds. The ESC performance is heav-ily dependent on the effectiveness of representative features extractedfrom the environmental sounds. However, ESC often suffers from the se-mantically irrelevant frames and silent frames. In order to deal with this,we employ a frame-level attention model to focus on the semantically rel-evant frames and salient frames. Specifically, we first propose an convo-lutional recurrent neural network to learn spectro-temporal features andtemporal correlations. Then, we extend our convolutional RNN modelwith a frame-level attention mechanism to learn discriminative featurerepresentations for ESC. Experiments were conducted on ESC-50 andESC-10 datasets. Experimental results demonstrated the effectiveness ofthe proposed method and achieved the state-of-the-art performance interms of classification accuracy.
Keywords:
Environmental Sound Classification · Convolutional Recur-rent Neural Network · Attention Mechanism
Environmental sound classification (ESC) is an important branch of sound recog-nition and is widely applied in surveillance [17], home automation [22], sceneanalysis [4] and machine hearing [13].Thus far, a variety of signal processing and machine learning techniques havebeen applied for ESC, including dictionary learning [7], matrix factorization [5],gaussian mixture model (GMM) [8] and recently, deep neural networks [19, 27].For traditional machine learning classifiers, selecting proper features is key toeffective performance. For instance, audio signals have been traditionally charac-terized by Mel-frequency cepstral coefficients (MFCCs) as features and classifiedusing a GMM classifier.In recent years, deep neural networks (DNNs) have shown outstanding perfor-mance in feature extraction for ESC. Compared to hand-crafted feature, DNNs (cid:63)
Corresponding author. Shanghai Institute for Advanced Communication and DataScience, Shanghai University, Shanghai, China(email: [email protected] ). a r X i v : . [ c s . S D ] J u l ave the ability to extract discriminative feature representations from large quan-tities of training data and generalize well on unseen data. McLoughlin et al. [14]proposed a deep belief network to extract high-level feature representations frommagnitude spectrum which yielded better results than the traditional methods.Piczak [15] first evaluated the potential of convolutional neural network (CNN)in classifying short audio clips of environmental sounds and showed excellentperformance on several public datasets. Takahashi et al. [20] created a three-channel feature as the input to a CNN by combining log mel spectrogram andits delta and delta-delta information in a manner similar to the RGB input of im-age. In order to model the sequential dynamics of environmental sound signals,Vu et al. [24] applied a recurrent neural network (RNN) to learn temporal rela-tionships. Moreover, there is a growing trend to combine CNN and RNN modelsinto a single architecture. Bae et al. [2] proposed to train the RNN and CNNin parallel in order to learn sequential correlation and local spectro-temporalinformation.In addition, attention mechanism-based models have shown outstanding per-formance in learning relevant feature representations for sequence data [6]. Re-cently, attention mechanism-based RNNs have been successfully applied to awide variety of tasks, including speech recognition [6], machine translation [3]and document classification [25]. In principle, attention mechanism-based RNNsare well suited to ESC tasks. First, environmental sound is essentially the se-quence data which contains correlation information between adjacent frames.Second, not all frame-level features contribute equally to the representations ofenvironmental sounds. Usually, in public ESC datasets, signals contains manyperiods of silence, with only a few intermittent frames associated with the soundclass. Thus, it is important to select semantically relevant frames for specificclass. Similar to attention mechanism-based RNN, we can also compute theframe-level attention map from CNN features, focusing on the semantically rel-evant frames. In the field of ESC, several works [9, 11, 12, 18] have studied theeffectiveness of attention mechanisms and have obtained promising results in sev-eral datasets. Different from previous works, we explored both the performanceof frame-level attention mechanism for CNN layers and RNN layers.In this paper, we propose an attention mechanism-based convolutional RNNarchitecture (ACRNN) in order to focus on semantically relevant frames andproduce discriminative features for ESC. The main contributions of this paperare summarized as follows. – To deal with silent frames and semantically irrelevant frames, We employ anattention model to automatically focus on the semantically relevant framesand produce discriminative features for ESC. We explore both the perfor-mance of frame-level attention mechanism for CNN layers and RNN layers. – To analyze temporal relations, We propose a novel convolutional RNN modelwhich first uses CNN to extract high level feature representations and theninputs the features to bidirectional GRUs. We combine the convolutionalRNN and attention model in a unified architecture.
To indicate the effectiveness of the proposed method and achieve currentstate-of-the-art performance, we conduct experiments on ESC-10 and ESC-50 datasets.
Conv1Conv2Max Pooling …… Conv7Conv8Max Pooling G R U , B i - D i r e c t i o n a l G R U , B i - D i r e c t i o n a l F u ll y C o nn e c t e d L a y e r frequency t i m e Log Gammatone Spectrogram Convolutional Recurrent Network Prediction Probability Distributions staticdelta dogsirenrain 0.70.20.1
Fig. 1: Architecture of convolutional recurrent neural network for environmentalsound classification
In this section, we introduce the proposed method for ESC. First, we generateLog Gammatone spetrogram (Log-GTs) features from environmental sounds asthe input of ACRNN, as shown in Fig. 1. Then, we introduce the architecture ofACRNN, which combines convolutional RNN and a frame-level attention mech-anism. For the architecture of convolutional RNN and attention mechanism,we will give a detailed description, respectively. Finally, the data augmentationmethods we used are introduced.
Given a signal, We first use short-time Fourier Transform (STFT) with hammingwindow size of 23 ms (1024 samples at 44.1kHz) and 50% overlap to extract theenergy spectrogram. Then, we apply a 128-band Gammatone filter bank [23] tothe energy spectrogram and the resulting spectrogram is converted into loga-rithmic scale. In order to make efficient use of limited data, the spectrogram issplit into 128 frames (approximately 1.5s in length) with 50% overlap. The deltainformation of the original spectrogram is calculated, which is the first temporalderivative of the static spectrogram. Afterwards, we concatenate the log gam-matone spectrogram and its delta information to a 3-D feature representation X ∈ R × × (Log-GTs) as the input of the network. .2 Architecture of Convolutional RNN In this section, we propose an convolutional RNN to analyze Log-GTs for ESC.We first use CNN to learn high level feature representations on the Log-GTs.Then, the CNN-learned features are fed into bidirectional gated recurrent unit(GRU) layers which are used to learn the temporal correlation information. Fi-nally, these features are fed into a fully connected layer with a softmax activationfunction to output the probability distribution of different classes. In this paper,the convolutional RNN is comprised of eight convolutional layers ( l - l ) and twobidirectional GRU layers ( l - l ). The architecture and parameters of networkare as follows: – l - l : The first two stacked convolutional layers use 32 filters with a receptivefield of (3,5) and stride of (1,1). This is followed by a max-pooling with a (4,3)stride to reduce the dimensions of feature maps. ReLU activation functionis used. – l - l : The next two convolutional layers use 64 filters with a receptive fieldof (3,1) and stride of (1,1), and is used to learn local patterns along thefrequency dimension. This is followed by a max-pooling with a (4,1) stride.ReLU activation function is used. – l - l : The following pair of convolutional layers uses 128 filters with a recep-tive field of (1,5) and stride of (1,1), and is used to learn local patterns alongthe time dimension. This is followed by a max-pooling with a (1,3) stride.ReLU activation function is used. – l - l : The subsequent two convolutional layers use 256 filters with a receptivefield of (3,3) and stride of (1,1) to learn joint time-frequency characteristics.This is followed by a max-pooling of a (2,2) stride. ReLU activation functionis used. – l - l : Two bidirectional GRU layers with 256 cells are used for temporalsummarization, and tanh activation function is used. Dropout with proba-bility of 0 . . Not all frame-level features contribute equally to representations of environmen-tal sounds. As shown in Fig. 2, except for the semantically relevant frames( f f f
1) and silent or noisy frame( f dog bark , baby cry and clock tick . Attention for CNN layers:
As shown in Fig.3(a), given CNN features M ∈ R F × T × C , we first use a 3x3 convolution filter to learn a hidden representation.This is followed by a average-pool with ( F,
1) size in order to reduce the fre-quency dimension to one. Then, we use softmax function to form a normalizedattention map A ∈ R × T × , which holds the frame-level attention weights forCNN features. With attention map A , the attention weighted CNN features areobtained as M (cid:48) = M · A (1)The attention is applied by multiplying the attention vector A to each featurevector of M along frequency dimension and channel dimension. Attention for RNN layers:
As shown in Fig.3(b), we first feed the GRUoutput h t = [ −→ h t , ←− h t ] through a one-layer MLP to obtain a hidden representationof h t , then we calculate the normalized importance weight β t by a softmaxfunction (2). After that, we compute the feature vector v through a weightedsum of the frame-level convolutional RNN feautues based on the weights (3). Thefeature vector v is forwarded into the fully connected layer for final classification. β t = exp ( W ∗ h t ) (cid:80) Tt =1 exp ( W ∗ h t ) (2) v = T (cid:88) t =1 β t h t (3) Limited data easily leads model towards overfitting. In this paper, we use timestretch with a factor randomly selected from [0.8, 1.3] and pitch shift with aig. 3: Frame-level attention for (a) CNN layers and (b) RNN layers. For CNNlayers, we use frame-level attention to obtain attention map, which is multipliedin frame-wise of CNN features, resulting the attention weighted features. ForRNN layers, we utilize frame-level attention to obtain attention weights, whichis multiplied in frame-wise of input features. Then, we aggregate these attentionweighted representations to form a feature vector, which can be seen as a high-level representation of a sound like ”dog bark”.factor randomly selected from [-3.5, 3.5] to increase raw training data size. Inaddition, an efficient mixup [26] augmentation method is used to construct vir-tual training data and extend the training distribution. In mixup, a feature anda target (ˆx, ˆy) are generated by mixing two feature-target examples, which aredetermined by (cid:40) ˆ x = λx i + (1 − λ ) x j ˆ y = λy i + (1 − λ ) y j (4)where x i and x j are two features randomly selected from the training Log-GTs, and y i and y j are their one-hot labels. The mix factor λ is decided by ahyper-parameter α and λ ∼ Beta( α , α ). To evaluate the performance of our proposed methods, we carry out experi-ments on two publicly available datasets: ESC-50 and ESC-10 [16]. ESC-50 is aollection of 2000 environmental recordings containing 50 classes in 5 major cat-egories, including animals, natural soundscapes and water sounds, human non-speech sounds, interior/domestic sounds , and exterior/urban noises . All audiosamples are 5 seconds in duration with a 44.1 kHz sampling frequency. ESC-10is a subset of 10 classes (400 samples) selected from the ESC-50 dataset ( dogbark, rain, sea waves, baby cry, clock tick, person sneeze, helicopter, chainsaw,rooster, fire crackling ).In this paper, we use a sampling rate of 44.1 kHz for all samples in order to userich high-frequency information. For training, all models optimize cross-entropyloss using mini-batch stochastic gradient descent with Nesterov momentum of0.9. Each batch consists of 64 segments randomly selected from the trainingset without repetition. All models are trained for 300 epochs by beginning withan initial learning rate of 0.01, and then divided the learning rate by 10 every100 epochs. We initialize the network weights to zero mean Gaussian noise witha standard deviation of 0.05. In the test phase, we evaluate the whole sampleprediction with the highest average prediction probability of each segment. Boththe training and testing features are normalized by the global mean and stardarddeviation of the training set. All models are trained using Keras library withTensorFlow backend on a Nvidia P100 GPU with 12GB memory.
Table 1: Comparison of ACRNN and existing methods. We perform 5-fold crossvalidation (CV) by using the official fold settings. The average results of CV arerecorded.
Model ESC-10 ESC-50
PiczakCNN [15] 80.5% 64.9%SoundNet [1] 92.1% 74.2%WaveMsNet [28] 93.7% 79.1%EnvNet-v2 [21] 91.4% 84.9%Multi-Stream CNN [12] 93.7% 83.5%ACRNN
We compare our model with existing networks reported as PiczakCNN [15],SoundNet [1], WaveMsNet [28], EnvNet-v2 [21] and Multi-Stream CNN [12].According to [15], PiczakCNN consists of two convolutional layers and threefully connected layers. The input features of CNN are generated by combininglog mel spectrogram and its delta information. We refer PiczakCNN as a baselinemethod.The results are summarized in Table 1. We see that ACRNN outperformsPiczakCNN and obtains an absolute improvement of 13.2% and 21.2% on ESC-10 and ESC-50 datasets, respectively. Then, we compare our model with severaltate-of-the-art methods: SoundNet8 [1], WaveMsNet [28], EnvNet-v2 [21] andMulti-Stream CNN [12]. We observe that on both ESC-10 and ESC-50 datasets,ACRNN obtains the highest classification accuracy. Note that WaveMsNet [28]and Multi-Stream CNN [12] achieve same classification accuracy as ACRNN onESC-10 but using feature fusion (raw data and spectrogram features), whereasACRNN only utilizes spectrogram features.In Fig.4, we provide the confusion matrix generated by ACRNN for ESC-50 dataset. We see that most classes achieve higher accuracy than 80%(32/40).Particularly,
Church bells obtains a 100% recognition rate. However, we ob-serve that only 52.5%(21/40)
Helicopter samples are correctly recognized, with17.5%(7/40) samples misclassified as
Airplane . We attribute this mistakes to thesimilar characteristics between the two environmental sounds.Fig. 4: Confusion matrix of ACRNN with an average classification accuracy86.1% on ESC-50 dataset. .3 Effects of attention mechanism
Table 2: Classification accuracy of proposed convolutional RNN with and with-out the attention mechanism. ’augment’ denotes a combination of time stretch,pitch shift and mixup.
Model Settings ESC-10 ESC-50 convolutional RNN 89.2% 79.9%convolutional RNN-attention 91.7% 81.3%convolutional RNN-augment 93.0% 84.6%convolutional RNN-attention-augment
To investigate the effects of the attention mechanism, we compare the resultsof proposed convolutional RNN with and without the attention mechanism. InTable 2, the results show that the attention mechanism delivers a significantlyimproved accuracy even when we use a data augmentation scheme. In addition,data augmentation boasts an improvement of 2.0% and 4.8% on ESC-10 andESC-50 datasets, respectively.
Table 3: Classification accuracy of applying the attention mechanism to theoutput of different layers of the proposed convolutional RNN.
Model Settings ESC-10 ESC-50 no attention 93.0% 84.6%attention at l l l l l In this section, we investigate the classification performance when applyingframe-level attention mechanism to the different layers of CNN and RNN. Asshown in Table 3, we obtained the highest classification accuracy and boostedan absolutely improvement of 0.7% and 1.5% when applying the attention mech-anism at l on both ESC-10 and ESC-50 datasets, respectively. On the ESC-50dataset, the classification accuracy obtained a slight improvement when the at-tention mechanism was applied at l and l , while for other CNN layers, thelassification accuracy decreased. On the ESC-10 dataset, we obtained an im-provement of 0.5% when only applying attention at l for CNN layers. Further-more, we found that on both ESC-10 and ESC-50 datasets, the classificationaccuracy is improved than standard convolutional RNN when applying atten-tion at l for CNN layers. In this paper, we proposed an attention mechanism-based convolutional recur-rent neural network (ACRNN) for ESC. We explored the frame-level attentionmechanism and gave a detailed description for CNN layers and RNN layers, re-spectively. Experimental results on ESC-10 and ESC-50 datasets demonstratedthe effectiveness of the proposed method and achieved state-of-the-art perfor-mance in terms of classification accuracy. In addition, we compared the classifi-cation accuracy when applying different layers, including CNN layers and RNNlayers. The experimental results showed that applying attention for RNN layersobtained highest accuracy. However, we found when applying attention for CNNlayers, the performance is not always improved. We plan to explore this in ourfuture work.
References
1. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representationsfrom unlabeled video. In: Proc. Int. Conf. Neural Inf. Process. Syst. pp. 892–900(2016)2. Bae, S.H., Choi, I., Kim, N.S.: Acoustic scene classification using parallel combi-nation of lstm and cnn. DCASE2016 Challenge, Tech. Rep. (2016)3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. arXiv preprint arXiv:1409.0473 (2014)4. Barchiesi, D., Giannoulis, D., Stowell, D., Plumbley, M.D.: Acoustic scene clas-sification: Classifying environments from the sounds they produce. IEEE SignalProcess. Magazine (3), 16–34 (2015)5. Bisot, V., Serizel, R., Essid, S., Richard, G.: Feature learning with matrix factor-ization applied to acoustic scene classification. IEEE/ACM Trans. Audio, Speech,Language Process. (6), 1216–1229 (2017)6. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-basedmodels for speech recognition. In: Proc. Int. Conf. Neural Inf. Process. Syst. pp.577–585 (2015)7. Chu, S., Narayanan, S., Kuo, C.C.J.: Environmental sound recognition with time–frequency audio features. IEEE Trans. Audio, Speech, Language Process. (6),1142–1158 (2009)8. Dhanalakshmi, P., Palanivel, S., Ramalingam, V.: Classification of audio signalsusing aann and gmm. Applied Soft Computing (1), 716–723 (2011)9. Guo, J., Xu, N., Li, L.J., Alwan, A.: Attention based cldnns for short-durationacoustic scene classification. In: Proc. Interspeech. pp. 469–473 (2017)10. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)1. Jun, W., Shengchen, L.: Self-attention mechanism based system for dcase2018 chal-lenge task1 and task4. DCASE2018 Challenge, Tech. Rep. (2018)12. Li, X., Chebiyyam, V., Kirchhoff, K.: Multi-stream network with temporal atten-tion for environmental sound classification. arXiv preprint arXiv:1901.08608 (2019)13. Lyon, R.F.: Machine hearing: An emerging field [exploratory dsp]. IEEE SignalProcess. Magazine (5), 131–139 (2010)14. McLoughlin, I., Zhang, H., Xie, Z., Song, Y., Xiao, W.: Robust sound event classi-fication using deep neural networks. IEEE/ACM Trans. Audio, Speech, LanguageProcess. (3), 540–552 (2015)15. Piczak, K.J.: Environmental sound classification with convolutional neural net-works. In: Proc. 25th Int. Workshop Mach. Learning Signal Process. pp. 1–6 (2015)16. Piczak, K.J.: Esc: Dataset for environmental sound classification. In: Proc. 23rdACM Int. Conf. Multimedia. pp. 1015–1018 (2015)17. Radhakrishnan, R., Divakaran, A., Smaragdis, A.: Audio analysis for surveillanceapplications. In: Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. pp.158–161 (2005)18. Ren, Z., et. al.: Attention-based convolutional neural networks for acoustic sceneclassification. DCASE2018 Challenge, Tech. Rep. (2018)19. Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmenta-tion for environmental sound classification. IEEE Signal Process. Letters (3),279–283 (2017)20. Takahashi, N., Gygli, M., Pfister, B., Van Gool, L.: Deep convolutional neu-ral networks and data augmentation for acoustic event detection. arXiv preprintarXiv:1604.07160 (2016)21. Tokozume, Y., Ushiku, Y., Harada, T.: Learning from between-class examples fordeep sound recognition. arXiv preprint arXiv:1711.10282 (2017)22. Vacher, M., Serignat, J.F., Chaillol, S.: Sound classification in a smart room en-vironment: an approach using gmm and hmm methods. In: Proc. 4th IEEE Conf.Speech Technique, Human-Computer Dialogue. vol. 1, pp. 135–146 (2007)23. Valero, X., Alias, F.: Gammatone cepstral coefficients: Biologically inspired fea-tures for non-speech audio classification. IEEE Trans. Multimedia14