[PDF] Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification

Abstract

Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The ESC performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus on the semantically relevant frames and salient frames. Specifically, we first propose an convolutional recurrent neural network to learn spectro-temporal features and temporal correlations. Then, we extend our convolutional RNN model with a frame-level attention mechanism to learn discriminative feature representations for ESC. Experiments were conducted on ESC-50 and ESC-10 datasets. Experimental results demonstrated the effectiveness of the proposed method and achieved the state-of-the-art performance in terms of classification accuracy.

Full PDF

AAttention based Convolutional Recurrent NeuralNetwork for Environmental Sound Classiﬁcation

Zhichao Zhang, Shugong Xu (cid:63) , Tianhao Qiao, Shunqing Zhang, and Shan Cao

Shanghai Institute for Advanced Communication and Data Science,Shanghai University, Shanghai, 200444, China { zhichaozhang, shugong, qiaotianhao, shunqing, cshan } @shu.edu.cn Abstract.

Environmental sound classiﬁcation (ESC) is a challengingproblem due to the complexity of sounds. The ESC performance is heav-ily dependent on the eﬀectiveness of representative features extractedfrom the environmental sounds. However, ESC often suﬀers from the se-mantically irrelevant frames and silent frames. In order to deal with this,we employ a frame-level attention model to focus on the semantically rel-evant frames and salient frames. Speciﬁcally, we ﬁrst propose an convo-lutional recurrent neural network to learn spectro-temporal features andtemporal correlations. Then, we extend our convolutional RNN modelwith a frame-level attention mechanism to learn discriminative featurerepresentations for ESC. Experiments were conducted on ESC-50 andESC-10 datasets. Experimental results demonstrated the eﬀectiveness ofthe proposed method and achieved the state-of-the-art performance interms of classiﬁcation accuracy.

Keywords:

Environmental Sound Classiﬁcation · Convolutional Recur-rent Neural Network · Attention Mechanism

Environmental sound classiﬁcation (ESC) is an important branch of sound recog-nition and is widely applied in surveillance [17], home automation [22], sceneanalysis [4] and machine hearing [13].Thus far, a variety of signal processing and machine learning techniques havebeen applied for ESC, including dictionary learning [7], matrix factorization [5],gaussian mixture model (GMM) [8] and recently, deep neural networks [19, 27].For traditional machine learning classiﬁers, selecting proper features is key toeﬀective performance. For instance, audio signals have been traditionally charac-terized by Mel-frequency cepstral coeﬃcients (MFCCs) as features and classiﬁedusing a GMM classiﬁer.In recent years, deep neural networks (DNNs) have shown outstanding perfor-mance in feature extraction for ESC. Compared to hand-crafted feature, DNNs (cid:63)

Corresponding author. Shanghai Institute for Advanced Communication and DataScience, Shanghai University, Shanghai, China(email: [email protected] ). a r X i v : . [ c s . S D ] J u l ave the ability to extract discriminative feature representations from large quan-tities of training data and generalize well on unseen data. McLoughlin et al. [14]proposed a deep belief network to extract high-level feature representations frommagnitude spectrum which yielded better results than the traditional methods.Piczak [15] ﬁrst evaluated the potential of convolutional neural network (CNN)in classifying short audio clips of environmental sounds and showed excellentperformance on several public datasets. Takahashi et al. [20] created a three-channel feature as the input to a CNN by combining log mel spectrogram andits delta and delta-delta information in a manner similar to the RGB input of im-age. In order to model the sequential dynamics of environmental sound signals,Vu et al. [24] applied a recurrent neural network (RNN) to learn temporal rela-tionships. Moreover, there is a growing trend to combine CNN and RNN modelsinto a single architecture. Bae et al. [2] proposed to train the RNN and CNNin parallel in order to learn sequential correlation and local spectro-temporalinformation.In addition, attention mechanism-based models have shown outstanding per-formance in learning relevant feature representations for sequence data [6]. Re-cently, attention mechanism-based RNNs have been successfully applied to awide variety of tasks, including speech recognition [6], machine translation [3]and document classiﬁcation [25]. In principle, attention mechanism-based RNNsare well suited to ESC tasks. First, environmental sound is essentially the se-quence data which contains correlation information between adjacent frames.Second, not all frame-level features contribute equally to the representations ofenvironmental sounds. Usually, in public ESC datasets, signals contains manyperiods of silence, with only a few intermittent frames associated with the soundclass. Thus, it is important to select semantically relevant frames for speciﬁcclass. Similar to attention mechanism-based RNN, we can also compute theframe-level attention map from CNN features, focusing on the semantically rel-evant frames. In the ﬁeld of ESC, several works [9, 11, 12, 18] have studied theeﬀectiveness of attention mechanisms and have obtained promising results in sev-eral datasets. Diﬀerent from previous works, we explored both the performanceof frame-level attention mechanism for CNN layers and RNN layers.In this paper, we propose an attention mechanism-based convolutional RNNarchitecture (ACRNN) in order to focus on semantically relevant frames andproduce discriminative features for ESC. The main contributions of this paperare summarized as follows. – To deal with silent frames and semantically irrelevant frames, We employ anattention model to automatically focus on the semantically relevant framesand produce discriminative features for ESC. We explore both the perfor-mance of frame-level attention mechanism for CNN layers and RNN layers. – To analyze temporal relations, We propose a novel convolutional RNN modelwhich ﬁrst uses CNN to extract high level feature representations and theninputs the features to bidirectional GRUs. We combine the convolutionalRNN and attention model in a uniﬁed architecture.

To indicate the eﬀectiveness of the proposed method and achieve currentstate-of-the-art performance, we conduct experiments on ESC-10 and ESC-50 datasets.

Conv1Conv2Max Pooling …… Conv7Conv8Max Pooling G R U , B i - D i r e c t i o n a l G R U , B i - D i r e c t i o n a l F u ll y C o nn e c t e d L a y e r frequency t i m e Log Gammatone Spectrogram Convolutional Recurrent Network Prediction Probability Distributions staticdelta dogsirenrain 0.70.20.1

Fig. 1: Architecture of convolutional recurrent neural network for environmentalsound classiﬁcation

In this section, we introduce the proposed method for ESC. First, we generateLog Gammatone spetrogram (Log-GTs) features from environmental sounds asthe input of ACRNN, as shown in Fig. 1. Then, we introduce the architecture ofACRNN, which combines convolutional RNN and a frame-level attention mech-anism. For the architecture of convolutional RNN and attention mechanism,we will give a detailed description, respectively. Finally, the data augmentationmethods we used are introduced.

Given a signal, We ﬁrst use short-time Fourier Transform (STFT) with hammingwindow size of 23 ms (1024 samples at 44.1kHz) and 50% overlap to extract theenergy spectrogram. Then, we apply a 128-band Gammatone ﬁlter bank [23] tothe energy spectrogram and the resulting spectrogram is converted into loga-rithmic scale. In order to make eﬃcient use of limited data, the spectrogram issplit into 128 frames (approximately 1.5s in length) with 50% overlap. The deltainformation of the original spectrogram is calculated, which is the ﬁrst temporalderivative of the static spectrogram. Afterwards, we concatenate the log gam-matone spectrogram and its delta information to a 3-D feature representation X ∈ R × × (Log-GTs) as the input of the network. .2 Architecture of Convolutional RNN In this section, we propose an convolutional RNN to analyze Log-GTs for ESC.We ﬁrst use CNN to learn high level feature representations on the Log-GTs.Then, the CNN-learned features are fed into bidirectional gated recurrent unit(GRU) layers which are used to learn the temporal correlation information. Fi-nally, these features are fed into a fully connected layer with a softmax activationfunction to output the probability distribution of diﬀerent classes. In this paper,the convolutional RNN is comprised of eight convolutional layers ( l - l ) and twobidirectional GRU layers ( l - l ). The architecture and parameters of networkare as follows: – l - l : The ﬁrst two stacked convolutional layers use 32 ﬁlters with a receptiveﬁeld of (3,5) and stride of (1,1). This is followed by a max-pooling with a (4,3)stride to reduce the dimensions of feature maps. ReLU activation functionis used. – l - l : The next two convolutional layers use 64 ﬁlters with a receptive ﬁeldof (3,1) and stride of (1,1), and is used to learn local patterns along thefrequency dimension. This is followed by a max-pooling with a (4,1) stride.ReLU activation function is used. – l - l : The following pair of convolutional layers uses 128 ﬁlters with a recep-tive ﬁeld of (1,5) and stride of (1,1), and is used to learn local patterns alongthe time dimension. This is followed by a max-pooling with a (1,3) stride.ReLU activation function is used. – l - l : The subsequent two convolutional layers use 256 ﬁlters with a receptiveﬁeld of (3,3) and stride of (1,1) to learn joint time-frequency characteristics.This is followed by a max-pooling of a (2,2) stride. ReLU activation functionis used. – l - l : Two bidirectional GRU layers with 256 cells are used for temporalsummarization, and tanh activation function is used. Dropout with proba-bility of 0 . . Not all frame-level features contribute equally to representations of environmen-tal sounds. As shown in Fig. 2, except for the semantically relevant frames( f f f

1) and silent or noisy frame( f dog bark , baby cry and clock tick . Attention for CNN layers:

As shown in Fig.3(a), given CNN features M ∈ R F × T × C , we ﬁrst use a 3x3 convolution ﬁlter to learn a hidden representation.This is followed by a average-pool with ( F,

1) size in order to reduce the fre-quency dimension to one. Then, we use softmax function to form a normalizedattention map A ∈ R × T × , which holds the frame-level attention weights forCNN features. With attention map A , the attention weighted CNN features areobtained as M (cid:48) = M · A (1)The attention is applied by multiplying the attention vector A to each featurevector of M along frequency dimension and channel dimension. Attention for RNN layers:

As shown in Fig.3(b), we ﬁrst feed the GRUoutput h t = [ −→ h t , ←− h t ] through a one-layer MLP to obtain a hidden representationof h t , then we calculate the normalized importance weight β t by a softmaxfunction (2). After that, we compute the feature vector v through a weightedsum of the frame-level convolutional RNN feautues based on the weights (3). Thefeature vector v is forwarded into the fully connected layer for ﬁnal classiﬁcation. β t = exp ( W ∗ h t ) (cid:80) Tt =1 exp ( W ∗ h t ) (2) v = T (cid:88) t =1 β t h t (3) Limited data easily leads model towards overﬁtting. In this paper, we use timestretch with a factor randomly selected from [0.8, 1.3] and pitch shift with aig. 3: Frame-level attention for (a) CNN layers and (b) RNN layers. For CNNlayers, we use frame-level attention to obtain attention map, which is multipliedin frame-wise of CNN features, resulting the attention weighted features. ForRNN layers, we utilize frame-level attention to obtain attention weights, whichis multiplied in frame-wise of input features. Then, we aggregate these attentionweighted representations to form a feature vector, which can be seen as a high-level representation of a sound like ”dog bark”.factor randomly selected from [-3.5, 3.5] to increase raw training data size. Inaddition, an eﬃcient mixup [26] augmentation method is used to construct vir-tual training data and extend the training distribution. In mixup, a feature anda target (ˆx, ˆy) are generated by mixing two feature-target examples, which aredetermined by (cid:40) ˆ x = λx i + (1 − λ ) x j ˆ y = λy i + (1 − λ ) y j (4)where x i and x j are two features randomly selected from the training Log-GTs, and y i and y j are their one-hot labels. The mix factor λ is decided by ahyper-parameter α and λ ∼ Beta( α , α ). To evaluate the performance of our proposed methods, we carry out experi-ments on two publicly available datasets: ESC-50 and ESC-10 [16]. ESC-50 is aollection of 2000 environmental recordings containing 50 classes in 5 major cat-egories, including animals, natural soundscapes and water sounds, human non-speech sounds, interior/domestic sounds , and exterior/urban noises . All audiosamples are 5 seconds in duration with a 44.1 kHz sampling frequency. ESC-10is a subset of 10 classes (400 samples) selected from the ESC-50 dataset ( dogbark, rain, sea waves, baby cry, clock tick, person sneeze, helicopter, chainsaw,rooster, ﬁre crackling ).In this paper, we use a sampling rate of 44.1 kHz for all samples in order to userich high-frequency information. For training, all models optimize cross-entropyloss using mini-batch stochastic gradient descent with Nesterov momentum of0.9. Each batch consists of 64 segments randomly selected from the trainingset without repetition. All models are trained for 300 epochs by beginning withan initial learning rate of 0.01, and then divided the learning rate by 10 every100 epochs. We initialize the network weights to zero mean Gaussian noise witha standard deviation of 0.05. In the test phase, we evaluate the whole sampleprediction with the highest average prediction probability of each segment. Boththe training and testing features are normalized by the global mean and stardarddeviation of the training set. All models are trained using Keras library withTensorFlow backend on a Nvidia P100 GPU with 12GB memory.

Table 1: Comparison of ACRNN and existing methods. We perform 5-fold crossvalidation (CV) by using the oﬃcial fold settings. The average results of CV arerecorded.

Model ESC-10 ESC-50

PiczakCNN [15] 80.5% 64.9%SoundNet [1] 92.1% 74.2%WaveMsNet [28] 93.7% 79.1%EnvNet-v2 [21] 91.4% 84.9%Multi-Stream CNN [12] 93.7% 83.5%ACRNN

We compare our model with existing networks reported as PiczakCNN [15],SoundNet [1], WaveMsNet [28], EnvNet-v2 [21] and Multi-Stream CNN [12].According to [15], PiczakCNN consists of two convolutional layers and threefully connected layers. The input features of CNN are generated by combininglog mel spectrogram and its delta information. We refer PiczakCNN as a baselinemethod.The results are summarized in Table 1. We see that ACRNN outperformsPiczakCNN and obtains an absolute improvement of 13.2% and 21.2% on ESC-10 and ESC-50 datasets, respectively. Then, we compare our model with severaltate-of-the-art methods: SoundNet8 [1], WaveMsNet [28], EnvNet-v2 [21] andMulti-Stream CNN [12]. We observe that on both ESC-10 and ESC-50 datasets,ACRNN obtains the highest classiﬁcation accuracy. Note that WaveMsNet [28]and Multi-Stream CNN [12] achieve same classiﬁcation accuracy as ACRNN onESC-10 but using feature fusion (raw data and spectrogram features), whereasACRNN only utilizes spectrogram features.In Fig.4, we provide the confusion matrix generated by ACRNN for ESC-50 dataset. We see that most classes achieve higher accuracy than 80%(32/40).Particularly,

Church bells obtains a 100% recognition rate. However, we ob-serve that only 52.5%(21/40)

Helicopter samples are correctly recognized, with17.5%(7/40) samples misclassiﬁed as

Airplane . We attribute this mistakes to thesimilar characteristics between the two environmental sounds.Fig. 4: Confusion matrix of ACRNN with an average classiﬁcation accuracy86.1% on ESC-50 dataset. .3 Eﬀects of attention mechanism

Table 2: Classiﬁcation accuracy of proposed convolutional RNN with and with-out the attention mechanism. ’augment’ denotes a combination of time stretch,pitch shift and mixup.

Model Settings ESC-10 ESC-50 convolutional RNN 89.2% 79.9%convolutional RNN-attention 91.7% 81.3%convolutional RNN-augment 93.0% 84.6%convolutional RNN-attention-augment

To investigate the eﬀects of the attention mechanism, we compare the resultsof proposed convolutional RNN with and without the attention mechanism. InTable 2, the results show that the attention mechanism delivers a signiﬁcantlyimproved accuracy even when we use a data augmentation scheme. In addition,data augmentation boasts an improvement of 2.0% and 4.8% on ESC-10 andESC-50 datasets, respectively.

Table 3: Classiﬁcation accuracy of applying the attention mechanism to theoutput of diﬀerent layers of the proposed convolutional RNN.

Model Settings ESC-10 ESC-50 no attention 93.0% 84.6%attention at l l l l l In this section, we investigate the classiﬁcation performance when applyingframe-level attention mechanism to the diﬀerent layers of CNN and RNN. Asshown in Table 3, we obtained the highest classiﬁcation accuracy and boostedan absolutely improvement of 0.7% and 1.5% when applying the attention mech-anism at l on both ESC-10 and ESC-50 datasets, respectively. On the ESC-50dataset, the classiﬁcation accuracy obtained a slight improvement when the at-tention mechanism was applied at l and l , while for other CNN layers, thelassiﬁcation accuracy decreased. On the ESC-10 dataset, we obtained an im-provement of 0.5% when only applying attention at l for CNN layers. Further-more, we found that on both ESC-10 and ESC-50 datasets, the classiﬁcationaccuracy is improved than standard convolutional RNN when applying atten-tion at l for CNN layers. In this paper, we proposed an attention mechanism-based convolutional recur-rent neural network (ACRNN) for ESC. We explored the frame-level attentionmechanism and gave a detailed description for CNN layers and RNN layers, re-spectively. Experimental results on ESC-10 and ESC-50 datasets demonstratedthe eﬀectiveness of the proposed method and achieved state-of-the-art perfor-mance in terms of classiﬁcation accuracy. In addition, we compared the classiﬁ-cation accuracy when applying diﬀerent layers, including CNN layers and RNNlayers. The experimental results showed that applying attention for RNN layersobtained highest accuracy. However, we found when applying attention for CNNlayers, the performance is not always improved. We plan to explore this in ourfuture work.

References

1. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representationsfrom unlabeled video. In: Proc. Int. Conf. Neural Inf. Process. Syst. pp. 892–900(2016)2. Bae, S.H., Choi, I., Kim, N.S.: Acoustic scene classiﬁcation using parallel combi-nation of lstm and cnn. DCASE2016 Challenge, Tech. Rep. (2016)3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. arXiv preprint arXiv:1409.0473 (2014)4. Barchiesi, D., Giannoulis, D., Stowell, D., Plumbley, M.D.: Acoustic scene clas-siﬁcation: Classifying environments from the sounds they produce. IEEE SignalProcess. Magazine (3), 16–34 (2015)5. Bisot, V., Serizel, R., Essid, S., Richard, G.: Feature learning with matrix factor-ization applied to acoustic scene classiﬁcation. IEEE/ACM Trans. Audio, Speech,Language Process. (6), 1216–1229 (2017)6. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-basedmodels for speech recognition. In: Proc. Int. Conf. Neural Inf. Process. Syst. pp.577–585 (2015)7. Chu, S., Narayanan, S., Kuo, C.C.J.: Environmental sound recognition with time–frequency audio features. IEEE Trans. Audio, Speech, Language Process. (6),1142–1158 (2009)8. Dhanalakshmi, P., Palanivel, S., Ramalingam, V.: Classiﬁcation of audio signalsusing aann and gmm. Applied Soft Computing (1), 716–723 (2011)9. Guo, J., Xu, N., Li, L.J., Alwan, A.: Attention based cldnns for short-durationacoustic scene classiﬁcation. In: Proc. Interspeech. pp. 469–473 (2017)10. Ioﬀe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)1. Jun, W., Shengchen, L.: Self-attention mechanism based system for dcase2018 chal-lenge task1 and task4. DCASE2018 Challenge, Tech. Rep. (2018)12. Li, X., Chebiyyam, V., Kirchhoﬀ, K.: Multi-stream network with temporal atten-tion for environmental sound classiﬁcation. arXiv preprint arXiv:1901.08608 (2019)13. Lyon, R.F.: Machine hearing: An emerging ﬁeld [exploratory dsp]. IEEE SignalProcess. Magazine (5), 131–139 (2010)14. McLoughlin, I., Zhang, H., Xie, Z., Song, Y., Xiao, W.: Robust sound event classi-ﬁcation using deep neural networks. IEEE/ACM Trans. Audio, Speech, LanguageProcess. (3), 540–552 (2015)15. Piczak, K.J.: Environmental sound classiﬁcation with convolutional neural net-works. In: Proc. 25th Int. Workshop Mach. Learning Signal Process. pp. 1–6 (2015)16. Piczak, K.J.: Esc: Dataset for environmental sound classiﬁcation. In: Proc. 23rdACM Int. Conf. Multimedia. pp. 1015–1018 (2015)17. Radhakrishnan, R., Divakaran, A., Smaragdis, A.: Audio analysis for surveillanceapplications. In: Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. pp.158–161 (2005)18. Ren, Z., et. al.: Attention-based convolutional neural networks for acoustic sceneclassiﬁcation. DCASE2018 Challenge, Tech. Rep. (2018)19. Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmenta-tion for environmental sound classiﬁcation. IEEE Signal Process. Letters (3),279–283 (2017)20. Takahashi, N., Gygli, M., Pﬁster, B., Van Gool, L.: Deep convolutional neu-ral networks and data augmentation for acoustic event detection. arXiv preprintarXiv:1604.07160 (2016)21. Tokozume, Y., Ushiku, Y., Harada, T.: Learning from between-class examples fordeep sound recognition. arXiv preprint arXiv:1711.10282 (2017)22. Vacher, M., Serignat, J.F., Chaillol, S.: Sound classiﬁcation in a smart room en-vironment: an approach using gmm and hmm methods. In: Proc. 4th IEEE Conf.Speech Technique, Human-Computer Dialogue. vol. 1, pp. 135–146 (2007)23. Valero, X., Alias, F.: Gammatone cepstral coeﬃcients: Biologically inspired fea-tures for non-speech audio classiﬁcation. IEEE Trans. Multimedia14