Learning How to Listen: A Temporal-Frequential Attention Model for Sound Event Detection
LLEARNING HOW TO LISTEN: A TEMPORAL-FREQUENTIAL ATTENTION MODEL FORSOUND EVENT DETECTION
Yu-Han Shen, Ke-Xin He, Wei-Qiang Zhang
Department of Electronic Engineering, Tsinghua University, Beijing, China [email protected], [email protected], [email protected]
ABSTRACT
In this paper, we propose a temporal-frequential attention model forsound event detection (SED). Our network learns how to listen withtwo attention models: a temporal attention model and a frequentialattention model. Proposed system learns when to listen using thetemporal attention model while it learns where to listen on thefrequency axis using the frequential attention model. With thesetwo models, we attempt to make our system pay more attention toimportant frames or segments and important frequency componentsfor sound event detection. Our proposed method is demonstratedon the task 2 of Detection and Classification of Acoustic Scenesand Events (DCASE) 2017 Challenge and achieves competitiveperformance.
Index Terms — sound event detection, convolutional neuralnetwork, recurrent neural network, attention model, temporal-frequential attention
1. INTRODUCTION
Nowadays, sound event detection (SED), also named as acousticevent detection(AED), is considered as a popular topic in the field ofacoustic signal processing. The aim of SED is to temporally locatethe onset and offset times of target sound events present in an audiorecording.The Detection and Classification of Acoustic Scenes and Events(DCASE) Challenge is an international challenge concerning SED,and has been held for several years. In DCASE 2017 Challenge, thetheme of task 2 is “detection of rare sound events” [1]. It providesdataset [2] and baseline for rare sound event detection in synthesizedrecordings. Here, “rare” means that target sound events (babycry,glassbreak, gunshot) would occur at most once within a 30-secondrecording. And the mean duration of target sound event is veryshort: 2.25 s for babycry, 1.16 s for glassbreak, 1.32 s for gunshot,leading to a serious problem of data imbalance. All audio recordingsare notated with ground-truth labels of event class, onset and offsettime. According to the task description, a separate system shouldbe developed for each of the three target event classes to detect thetemporal occurrences of these events [1].Among the submissions in DCASE 2017, most models arebased on deep neural networks. Both of the top 2 teams [3, 4]utilized Convolutional Recurrent Neural Networks (CRNN) as theirmain architecture. They combined Convolutional Neural Networks(CNN) with Recurrent Neural Networks (RNN) to make frame-levelpredictions for target events and then adopted post-processing to getthe onset and offset time of sound events. Kao et al. [5] proposed aRegion-based Convolutional Recurrent Neural Network (R-CRNN)
The corresponding author is Wei-Qiang Zhang. to improve previous work in 2018. In our work, we followed themain architecture of those three models and used CRNN as mainclassifier.Inspired by the excellent performance of attention model inmachine translation [6], image caption [7], speaker verification [8],audio tagging [9], we proposed an attention model for SED. Cur-rently, most attention models in speech and audio processing onlyconcentrate on time domain. We proposed a temporal-frequentialattention model to focus on important frequency components as wellas important frames or segments. Our attention model can learn howto listen by extracting not only temporal information but also spectralinformation. Besides, we visualized the weights of attention modelsto show what our models have actually learnt.The rest of this paper is organized as follows: in Section 2, weintroduce our methods in detail, mainly including feature extraction,baseline and temporal-frequential attention model. The dataset,experiment setup and evaluation metric are illustrated in Section3. The results and analysis are presented in Section 4. Finally, weconclude our work in Section 5.
2. METHODS2.1. System overview
As shown in Figure 1, our proposed system is a CRNN architecturewith temporal-frequential attention model. The input of our systemis a 2-dim acoustic feature. It is fed into a frequential attention modelto produce frequential attention weights. Our system learns to focuson specific frequency components of audios using those attentionweights. The input acoustic feature will multiply with those attentionweights and then pass through CRNN architecture. Comparedwith traditional CRNN [3, 4], we add a temporal attention modelto let our system pay different attention to different frames. Thetemporal attention weights will multiply with the outputs of CRNNby element-wise. A sigmoid activation is used to get normalizedprobabilities. Then we utilize post processing to get final detectionoutputs.
The acoustic feature used in our work is log filter bank energy(Fbank). The sampling rate of input audios is 44.1kHz. To extractFbank feature, each audio is divided into frames of 40 ms durationwith shifts of 20 ms. Then we apply 128 mel-scale filters coveringthe frequency range 300 to 22050 Hz on the magnitude spectrumof each frame. Finally, we take logarithm on the amplitude andget Fbank feature. The extracted Fbank feature is normalized tozero mean and unit standard deviation before being fed into neuralnetworks. a r X i v : . [ c s . S D ] O c t ig. 1 . Illustration of overall system. We adopt state-of-the-art CRNN as baseline. The input is Fbankfeature of 30-second audios. And the output of our system givesbinary predictions for each segment with time resolution of 80 ms (4times of the input frame shift 20 ms).The CRNN architecture consists of three parts: convolutionalneural network (CNN), recurrent neural network (RNN) and fully-connected layer. The architecture of our CRNN is similar to that in[5], and it is shown in Figure 2.The CNN part contains four convolutional layers, and each layeris followed by batch normalization [10], ReLU activation unit anddropout layer [11]. We add two residual connections [12] to improvethe performance of CNN. Max-pooling layers (on both time axis andfrequency axis) are used to maintain the most important informationon each feature map. At the end of CNN, the extracted features overdifferent convolutional channels are stacked along the frequencyaxis.The RNN part is a bi-directional gated recurrent unit (bi-GRU)layer. Compared with uni-directional GRU, bi-GRU can extracttemporal structures of sound events better. We add the outputs offorward GRU and backward GRU to get final outputs of bi-GRU.The size of the output of bi-GRU is (375, U ), where U is the numberof GRU units.After the bi-GRU, a single fully-connected layer with sigmoidactivation is used to give classification result for each segment (4frames). The output denotes the presence probabilities of the targetevent in each segment.In order to determine the presence of an event, a binary predic-tion is given for each segment with a constant threshold of 0.5. Thesepredictions are post-processed with a median filter of length 240 ms.Since at most one event would occur in a 30-s audio, we select thelongest continuous sequence of positive predictions to get the onsetand offset of target events. As shown in Figure 1, we add a temporal attention model at theend of CNN to enable our system to learn when to listen. Thisattention model was proposed to ignore irrelevant sounds and focus
Fig. 2 . The architecture of CRNN. The first and second dimensionsof convolutional kernels and strides represent the time axis andfrequency axis respectively.more on important segments. Unlike the attention model in audioclassification [9] that only focuses on positive segments (includingevents), our temporal attention pays more attention to both positivesegments and hard negative segments (only backgrounds, but easilymisclassified as events) because they should be further differentiated.The output of CNN will pass through a fully-connected layerwith N t hidden units, followed by an activation unit (sigmoid,ReLU, or softmax). Then a global max-pooling on the frequencyaxis is used to get one weight for each segment. Those attentionweights will be normalized along time axis. In our experiments, thisoperation of normalization has shown great effectiveness because ittakes into account the variation of weight factors along time axisinstead of considering only current segment. Then we multiply thetemporal attention weights with the output of the fully-connectedlayer after bi-GRU. A sigmoid function is used to normalize theprobabilities to [0 , . The final output can be computed as follows: ˆ a t = max n ∈{ , , ,...,N t } { σ ( W n C t + b n ) } , (1) a t = T ˆ a t (cid:80) t ˆ a t , (2) y t = 11 + exp( − a t h t ) , (3)where σ ( · ) is an activation function, C t denotes the output of CNN, W n and b n represent the weights and bias for the n -th hidden unitrespectively, n ∈ { , , , . . . , N t } and N t is the number of hiddenunits in time attention model. ˆ a t is the candidate temporal attentionweight, T is the total number of segments in an audio, a t is thenormalized temporal attention weight, and y t is the final outputprobabilities. Apart from temporal attention model, we proposed a frequentialattention model. As we all know, various sound events may havedifferent spectral characteristics. So we assume that we should treat able 1 . Performance of proposed models and other methods, in terms of ER and F-score (%). *** indicates that class-wise results arenot given in related paper. We compare the following models: (1) Baseline: our bi-GRU-based CRNN; (2) CRNN+TA: our bi-GRU-basedCRNN with temporal attention model; (3) Proposed: our bi-GRU-based CRNN with temporal-frequential attention model; (4) R-CRNN:Region-based CRNN; (5) 1d-CRNN: DCASE 1st place model; (6) CRNN: DCASE 2nd place model.Model Development Dataset Evaluation Datasetbabycry glassbreak gunshot average babycry glassbreak gunshot averageBaseline 0.14 | | | | | | | | | | | | | | | | Proposed | | | | | | | | R-CRNN [5] 0.09 | *** 0.04 | *** | *** | | | | | | | | | | | | | | | N f hidden units, followed by an activationfunction (sigmoid, ReLU, or softmax). Here, N f is set to 128 tocorrespond with the number of mel-filters. Then it is normalizedalong the frequency axis to get frequential attention weights. Finally,an element-wise multiplication is adopted between the frequentialattention weights and input Fbank feature before the feature is fedinto CRNN architecture. The weighted feature is computed asfollows: ˆ M n,t = σ ( V n F t + c n ) , (4) M n,t = N f ˆ M n,t (cid:80) n ˆ M n,t , (5) ˜ F t = M t ⊗ F t , (6)where σ ( · ) is an activation function, F t is the input acoustic feature, V n and c n represent the weights and bias for the n -th hidden unitrespectively. ˆ M n,t is the candidate frequential attention weight, M n,t is the normalized frequential attention weight, ⊗ representselement-wise multiplication and ˜ F t is the weighted feature.
3. EXPERIMENTS3.1. Dataset
We demonstrate proposed model on DCASE 2017 Challenge task2 [1]. The task dataset consists of isolated sound events for eachtarget class and recordings of everyday acoustic scenes to serveas background [2]. There are three target event classes: babycry,glassbreak and gunshot. A synthesizer for creating mixtures atdifferent event-to-background ratios is also provided. The datasetis comprised of development dataset and evaluation dataset. Thedevelopment dataset also consists of two parts: train subset andtest subset. Participants are allowed to use any combination ofthe provided data for training, and evaluate their models on thetest subset of development dataset. Ranking of submitted systemsis based on their performance on evaluation dataset. Detailedinformation about this task and dataset is available in [1][2].We use the synthesizer to generate 3000 mixtures for eachclass. The event-to-background ratios are -6, 0, 6dB, and the eventpresence probability is set to 0.9 (default value: 0.5) in order to gainmore positive samples and mitigate the problem of data imbalance. We use the development test subset to optimize our model and finallyevaluate it on the evaluation dataset.
Our model is trained using Adam [13] with learning rate 0.001. Dueto data imbalance, we use weighted cross-entropy loss function toreduce deletion error. The loss function is computed as follows:
Loss = − (cid:80) wˆ y t log( y t ) + (1 − ˆ y t ) log(1 − y t ) N (7)where y t is the output score of each segment, ˆ y t is ground-truthlabel, and w is the loss weight for positive samples. In ourexperiments, the value of w equals to 10.In order to accelerate training, we adopt pre-training strategy.We firstly train the baseline CRNN for 10 epoches and then use thepre-trained CRNN to initialize the weights during the training ofproposed model. The training is stopped after 200 epoches. Thebatch size is 64. The number of hidden layer unit in temporalattention model N t is 32. The number of GRU units U is 32.Because our work is a 0/1 classification system, we use sigmoidand ReLU activation in attention models. According to experimentalresults, our system can achieve the best performance with ReLUactivation in temporal attention model and sigmoid activation infrequential attention model. We evaluate our method based on two kinds of event-based metrics:event-based error rate (ER) and event-based F-score. Both metricsare computed as defined in [14], using a collar of 500 ms andconsidering only the event onset. If the output accurately predictsthe presence of target event and its onset, we denote it as correctdetection. The onset detection is considered accurate only when itis predicted within the range of 500 ms of the actual onset time. ERis the sum of deletion error and insertion error, and F-score is theharmonic average of precision and recall. We compute these metricsusing sed eval toolbox [14] provided by DCASE organizer.
4. RESULTS4.1. Experimental results
The performances of proposed models and other methods, in termsof ER and F-score, are shown in Table 1. Results show thattemporal attention model can improve the performance of bi-GRUbased CRNN baseline, and frequential attention model can make a) Visualization of temporal attention weights (b) Visualization of frequential attention weights
Fig. 3 . Visualization of attention models.further improvement. Compared with baseline, proposed methodcan improve the performance of all classes on both developmentdataset and evaluation dataset.Compared with other state-of-the-art methods, the performanceof our model is also competitive. Note that both of the top 2teams adopt ensemble method. Lim et al. [3] combined theoutput probabilities of more than four models with different timesteps and different data mixtures to make final decision. Cakiret al. [4] utilized the ensemble of seven architectures. We canachieve comparable results on development dataset without anymodel ensemble. Moreover, the average ER only increases slightlyfrom 0.09 to 0.13 on evaluation dataset. We believe that ourproposed model has a better capability of generalization. Proposedmodel achieves the lowest average ER (0.13) and the highest averageF-score (93.4%) on evaluation dataset, outperforming all othermethods.
In order to know more about our attention models, we visualize theweights of both temporal attention model and frequential attentionmodel. Presented in Figure 3 is a good example of what our proposedtemporal-frequential attention model has actually learnt. Figure3 (a) and (b) are visualization of temporal attention weights andfrequential attention weights respectively.In Figure 3, (i) is the mel-spectrogram of an audio in theevaluation dataset. In this audio, babycry occurs from 23.13s to26.16s with “bus” background. There is a “beep” sound at around9-th second. In (ii), the blue line denotes the output probabilityand the orange line denotes the temporal attention weights. We cannotice that the weight value is bigger when “beep” and “babycry”occur, which conforms with our previous assumption that temporalattention model gives more attention to positive segments and hardnegative segments. (iii) is the visualization of frequential attentionweights and (iv) is the spectrogram of weighted feature. We canfind that the value of frequential attention weight is bigger in low-frequency area, which means that our frequential attention pays lessattention to high frequency components. This can be considered as a low-band filter and frequential attention model can ignore somehigh-frequency noise.
5. CONCLUSION
In this paper, we proposed a temporal-frequential attention model forsound event detection. Proposed model is tested on DCASE 2017task 2. Our system can achieve the best performance on DCASEevaluation dataset even without model ensemble. In addition tosound event detection, our temporal-frequential attention model canbe applied in speaker verification, speech recognition, audio taggingin the future for further research.
6. REFERENCES [1] Annamaria Mesaros, Toni Heittola, Aleksandr Diment,Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, BhikshaRaj, and Tuomas Virtanen, “DCASE 2017 challenge setup:tasks, datasets and baseline system,” in
Proceedings of theDetection and Classification of Acoustic Scenes and Events2017 Workshop , 2017, pp. 85–92.[2] A. Mesaros, T. Heittola, and T. Virtanen, “TUT databasefor acoustic scene classification and sound event detection,”in , Aug 2016, pp. 1128–1132.[3] H. Lim, J. Park, K. Lee, and Y. Han, “Rare sound eventdetection using 1d convolutional recurrent neural networks,”in
Proceedings of the Detection and Classification of AcousticScenes and Events 2017 Workshop , 2017, pp. 80–84.[4] E. Cakir and T. Virtanen, “Convolutional recurrent neuralnetworks for rare sound event detection,” in
Proceedings ofthe Detection and Classification of Acoustic Scenes and Events2017 Workshop , 2017, pp. 803–806.[5] C.-C. Kao, W. Wang, M. Sun, and C. Wang, “R-CRNN:Region-based Convolutional Recurrent Neural Network forAudio Event Detection,” arXiv preprint arXiv:1808.06627 ,Aug. 2018.6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is AllYou Need,” arXiv preprint arXiv:1706.03762 , June 2017.[7] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher,“Knowing when to look: Adaptive attention via a visualsentinel for image captioning,” in
IEEE Conference onComputer Vision and Pattern Recognition , 2017, pp. 3242–3250.[8] F. A. Rezaur rahman Chowdhury, Q. Wang, I. L. Moreno,and L. Wan, “Attention-based models for text-dependentspeaker verification,” in , April2018, pp. 5359–5363.[9] Qiuqiang Kong, Yong Xu, Wenwu Wang, and Mark D.Plumbley, “Audio set classification with attention model: Aprobabilistic perspective,”
CoRR , vol. abs/1711.00927, 2017.[10] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” in
Proceedings of The 32nd International Conference on MachineLearning , 2015, pp. 448–456.[11] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neuralnetworks from overfitting,”
Journal of Machine LearningResearch , vol. 15, no. 1, pp. 1929–1958, 2014.[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in
Proceedings of IEEE Conference onComputer Vision and Pattern Recognition , 2016, pp. 770–778.[13] D. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” arXiv preprint arXiv:1412.6980 , 2014.[14] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen,“Metrics for polyphonic sound event detection,”