A Global-local Attention Framework for Weakly Labelled Audio Tagging
AA GLOBAL-LOCAL ATTENTION FRAMEWORK FOR WEAKLY LABELLED AUDIOTAGGING
Helin Wang , Yuexian Zou , , ∗ , Wenwu Wang ADSPLAB, School of ECE, Peking University, Shenzhen, China Peng Cheng Laboratory, Shenzhen, China Center for Vision, Speech and Signal Processing, University of Surrey, UK
ABSTRACT
Weakly labelled audio tagging aims to predict the classes ofsound events within an audio clip, where the onset and offsettimes of the sound events are not provided. Previous workshave used the multiple instance learning (MIL) framework,and exploited the information of the whole audio clip byMIL pooling functions. However, the detailed information ofsound events such as their durations may not be consideredunder this framework. To address this issue, we propose anovel two-stream framework for audio tagging by exploitingthe global and local information of sound events. The globalstream aims to analyze the whole audio clip in order to cap-ture the local clips that need to be attended using a class-wiseselection module. These clips are then fed to the local streamto exploit the detailed information for a better decision. Ex-perimental results on the AudioSet show that our proposedmethod can significantly improve the performance of audiotagging under different baseline network architectures.
Index Terms — Audio tagging, weak labels, two-streamframework, class-wise attentional clips
1. INTRODUCTION
Audio Tagging is a technique for predicting the presenceor absence of sound events within an audio clip [1]. TheDetection and Classification of Acoustic Scenes and Events(DCASE) challenges [2, 3] provide strongly labelled datasetsfor audio tagging, where the onset and offset time of soundevents are annotated. However, such annotation is time-consuming and hard to obtain, and these audio taggingdatasets are relatively small. Recently, weakly labelled audiotagging has attracted increasing interest in the audio signalprocessing community [4, 5], where the datasets ( e.g.
Au-dioSet [6]) only annotate the types of sound events present ineach audio clip but do not provide any timestamp informationof their onset and offset.
This paper was partially supported by Shenzhen Science & Tech-nology Fundamental Research Programs (No: JCYJ20170817160058246,JCYJ20180507182908274 & JSGG20191129105421211)*Corresponding author: [email protected]
As the duration of sound events can be very different andthe overlaps of sound events often occur in an audio clip, au-dio tagging with weakly labelled data is a challenging prob-lem. A popular approach for this problem is based on multipleinstance learning (MIL) [7, 8]. In MIL, the input sequence istreated as a bag and split into the set of instances, where mul-tiple instances in the same bag share the same labels. Thereare two main MIL strategies, i.e. instance-level approach [9]and embedding-level approach [10,11]. The embedding-levelapproach integrates the instance-level feature representationsinto a bag-level contextual representation and then directlycarries out bag-level classification, which shows better per-formance than the instance-level approach [12]. The methodsfor aggregating the information from the instances play animportant part in the MIL frameworks. The default choicesare global max pooling (GMP) [13] and global average pool-ing (GAP) [14], but they are often less flexible for adapting topractical applications and may lose detailed information rele-vant to acoustic events. For example, GMP cannot capture theinformation of the long-duration event keyboard typing well,while the short-duration event mouse click may be ignoredby GAP. More recently, attention mechanisms have been em-ployed to detect the occurence of sound events [15–18] andhave achieved promising results. However, these methods at-tempt to make a decision on the whole audio clips and arelimited in capturing the detailed information of the acousticevents.To address this issue, in this paper, we propose a novelglobal-local attention (GL-AT) framework for weakly la-belled audio tagging where the global and local informationare successively modelled in the audio clip. Our method isinspired by the behaviors of human annotators for an audiodataset [6]. At first, the annotators may glimpse over an au-dio clip roughly, and determine some possible categories andtheir temporal regions. Then, these possible regions guidethem to make refined decisions on specific categories follow-ing a region-by-region inspection. Thus, in a similar fashion,we can solve the audio tagging with a two-stream frame-work, including a global stream and a local stream. Morespecifically, the global stream takes an audio clip as the input a r X i v : . [ ee ss . A S ] F e b ig. 1 . Overall architecture of our two-stream framework for weakly labelled audio tagging (GL-AT).to a deep neural network and learns global representationssupervised by the weak labels. Several class-wise attentionalsub-clips are then selected according to the global represen-tations, and fed to another neural network to learn the localrepresentations. The final class distributions are obtained byaggregating the predicted global class distributions and localclass distributions. The optimization of the local stream isinfluenced by the global stream because the sub-clips are se-lected according to the prediction results of the global stream.In addition, the local stream improves the optimization of theglobal stream where sub-clip selection and classification areboth performed.The contributions of this paper can be summarized intothree aspects. Firstly, we present a two-stream global-local at-tention framework that can efficiently recognize sound eventswithin an audio clip with weakly labelled data. Secondly,we propose an effective class-wise sub-clip selection modulewhich can dynamically generate several attentional sub-clipswith low complexity and high diversity. Thirdly, experimen-tal results show that our proposed framework can significantlyimprove the performance of AudioSet tagging, and can beused in different baselines.
2. PROPOSED METHOD
In this section, we firstly present the two-stream frameworkfor weakly labelled audio tagging ( i.e.
GL-AT). Then, a class-wise clips selection module is presented, which bridges thegap between the global and local streams.
The overall framework of our proposedframework is shown in Fig. 1. Given an input audio clip A ∈ R T × S , where T and S are the duration ( e.g. s) andthe sampling rate, respectively. Let’s denote its corresponding label as y ∈ R L , where y i = { , } denotes whether label i appears or not and L denotes the number of labels.The feature extractor F ( · ) is firstly applied, which can beeither convolutional neural networks (CNNs) [19–23] or con-volutional recurrent neural networks (CRNNs) [9]. We as-sume that M = F ( A ; θ F ) is the global frame-wise featureafter the feature extractor, where θ F denotes the parametersof the feature extractor and M ∈ R T × C . Here, T and C de-note the number of output frames and the dimension of thefeatures for each frame, respectively. Then a global poolingfunction P ( · ) is applied to obtain the global clip-wise fea-ture M (cid:48) ∈ R × C . Following [22], both maximum and av-erage operations are used for global pooling. In order to getthe prediction score ˆ y ∈ R L , a classifier C ( · ) containing twofully-connected layers is applied [22]. ˆ y = C ( M (cid:48) ; θ C ) (1)where θ C denotes the parameters of the classifier. We thenuse a sigmoid function σ ( · ) to turn ˆ y into a range [0 , , andobtain the global clip-wise prediction score ˆ y g ∈ R L . ˆ y g = 11 + exp ( − ˆ y ) (2) Local Stream.
Let { A , A , · · · , A N } be a set of N lo-cal clips selected from the input audio clip A . These localclips have the same duration (but are shorter than A ), andare fed to another feature extractor which has the same struc-ture as the global stream. Then the local prediction scores { ˆ y , ˆ y , · · · , ˆ y N } are obtained using (1) and (2). Finally,these local predicted scores are aggregated by the global pool-ing function: ˆ y l = P (ˆ y , ˆ y , · · · , ˆ y N ) (3)where ˆ y l ∈ R L is the local clip-wise prediction score. Notethat this two-stream framework can be trained end-to-end andransferred easily to different feature extractor networks. Dur-ing the training stage, these two streams are jointly trained. Atthe inference stage, we fuse the predictions from the globalstream ( ˆ y g ) and the local stream ( ˆ y l ) with the global poolingfunction to generate the final prediction score of the audio. Two-stream Learning.
Given a training dataset (cid:8) A i , y i (cid:9) Di =1 ,where D denotes the number of training examples. A i is the i -th audio clip and y i represents its corresponding labels. Theoverall loss function of our two-stream learning is formulatedas the sum of two streams, L = L g + L l (4)where L g and L l represent the global and the local loss, re-spectively. Specifically, the binary cross entropy loss is ap-plied for both streams, L g = D (cid:88) i =1 L (cid:88) j =1 y ji log (cid:16) ˆ y jgi (cid:17) + (cid:16) − y ji (cid:17) log (cid:16) − ˆ y jgi (cid:17) (5) L l = D (cid:88) i =1 L (cid:88) j =1 y ji log (cid:16) ˆ y jli (cid:17) + (cid:16) − y ji (cid:17) log (cid:16) − ˆ y jli (cid:17) (6)where ˆ y jgi and ˆ y jli are the prediction scores of the j -th categoryof the i -th audio clip from the global stream and local stream,respectively. The Adam [24] is employed as the optimizer. Potential audio clips are not available in weak labels, and inthis paper, we propose a simple but efficient method to dy-namically generate candidate audio clips following two ba-sic principles. On the one hand, the diversity of candidateclips should be as high as possible to cover all possible soundevents within an audio clip. On the other hand, the numberof candidate clips should be as small as possible to reducecomputational complexity and storage space.In order to generate the candidate clips, we firstly calcu-late the class-wise activation. A classifier is directly applied tothe global frame-wise feature M , and the global frame-wisepredicted score S g ∈ R T × L can be obtained using (1) and(2), where T denotes the number of frames within an audioclip. The class-wise activation of the i -th category is denotedas S ig ∈ R T , which indicates the importance of the framesleading to the classification of an audio clip to class i .Such class-wise activation S g is discriminative amongdifferent categories, and we can employ the activation to lo-calize the potential candidate clips. However, there are often alarge number of categories ( e.g. categories in AudioSet),and only a small number of categories (less than ) appear inan audio clip. If the activation of all the categories is used, thegenerated clips are too many to be computationally efficient.To address this issue, we sort the predicted score S g in adescending order and select the top N class-wise attentional activation (denoted as (cid:8) S ig (cid:9) Ni =1 ). N is a hyperparameter andthe choice of N is discussed in our experiments.The value of S ig ( t ) represents the probability that the sub-clip belongs to the i -th category at timestamp t . In order tolocalize the clips of interest with low computational complex-ity, we employ a temporal window of size τ to select the can-didate clips. For each S ig ( t ) , the frame with the maximumactivation (denoted as m ) is set as the medium frame, and therange of the candidate clip is set as [ m − τ / , m + τ / . Ifthe maximum boundary of the candidate clip is over the dura-tion of the audio clip, the range is re-weighted to [ T − τ, T ] ,where T denotes the duration of the audio clip. In addition, ifthe minimum boundary is less than , we re-weight the rangeto [0 , τ ] , where τ is a hyperparameter and discussed in ourexperiments.
3. EXPERIMENTS
In this section, we report the experimental results and com-parisons that demonstrate the effectiveness of the proposedmethod. In addition, ablation studies are carried out to showthe contribution of the crucial components.
Audioset [6] is used in our experiments, which is alarge-scale dataset with over million -second audio clipsfrom YouTube videos, with a total of categories. Thesame dataset divisions and pre-processing approaches ( e.g. re-sampling and data-balancing) are applied as [22]. Metrics.
Mean average precision (mAP), mean area underthe curve (mAUC) and d-prime are used as our evaluationmetrics, which are the most commonly used metrics for audiotagging. These metrics are calculated on individual categoriesand then averaged across all categories.
Implementation Details.
We compare the proposed methodwith the recent state-of-the-arts including TALNet [9], CNN10[22], ResNet38 [22] and AT-SCA [25]. Specifically, all thesemodels are applied as the global stream of our framework, andthe local stream uses the same feature extractor and classifieras in the global stream. Thus, we obtain the results of themodels with our two-stream learning framework, which areTALNet + (cid:13)
GL-AT, CNN10 + (cid:13)
GL-AT, ResNet38 + (cid:13)
GL-AT andAT-SCA + (cid:13)
GL-AT. Batch size is set as and all networksare trained with k iterations in total. Unless otherwisestated, we set the hyperparameters N as and τ as s in ourexperiments. Table 1 demonstrates the performance of our proposedmethod (GT-AL ) and other state-of-the-art methods on https://github.com/WangHelin1997/GL-AT able 1 . Accuracy comparisons of our method and state-of-the-arts on the AudioSet. Method mAP mAUC d-prime
TAL Net (2019) [9] 0.362 0.965 2.554TAL Net ∗ TALNet + (cid:13) GL-AT (ours) 0.401 0.970 2.659
CNN10 (2019) [22] 0.380 0.971 2.678CNN10 ∗ CNN10 + (cid:13) GL-AT (ours) 0.408 0.974 2.742
ResNet38 (2019) [22] 0.434 0.974 2.737ResNet38 ∗ ResNet38 + (cid:13) GL-AT (ours) 0.438 0.975 2.774
AT-SCA (2020) [25] 0.390 0.970 2.652AT-SCA ∗ AT-SCA + (cid:13) GL-AT (ours) 0.413 0.971 2.677 ∗ The listed results of TAL Net, CNN10, ResNet38 andAT-SCA are reproduced, and all the experimental setupsare the same as the original papers [9, 22, 25].
Table 2 . Ablative study of two streams in CNN10 + (cid:13)
GL-AT
Method Global Local mAP mAUC
CNN10 √ (cid:13) GL-AT √ √ √ √ In order to explore the effectiveness of the two streams, wejointly train the global and local streams in GL-AT, and dur-ing the inference stage, the influence of each stream is demon-strated in Table 2. Thanks to the joint training strategy, GL-AT outperforms the baseline method (CNN10) with only the (a) (b)
Fig. 2 . Accuracy comparisons of CNN10 + (cid:13)
GL-AT with dif-ferent values of N and τ (metric: mAP)global stream. This is because the class-wise attentional clipsselection module connects the two streams, and the optimiza-tion of the global stream is influenced by the local stream,which leads to better robustness. In addition, we can see thatusing the local stream alone performs better than only usingthe global stream, for the reason that the local stream is ableto focus on the detailed information of the audio. Nonethe-less, the global stream plays an important role in guiding thelearning of the local stream, and employing both global andlocal streams achieves the best results in our experiments.Furthermore, we exploit the influence of the number ofthe local clips ( N ) and the duration of the local clips ( τ ). Asshown is Fig. 2(a), the mAP performance shows an upwardtrend with the gradual increase in N . This means that it is use-ful to improve the audio tagging performance with more localclips. However, more clips will increase the computationalcost because all the selected clips are fed to the networks. Wecan see that the performance tends to be stable when N is setto { , , } , and we set N to for a balance of accuracy andcomplexity. In addition, we test different τ values, and reportthe results in Fig. 2(b). As the values of τ increase, the accu-racy is boosted and then drops, which achieves high accuracywhen τ is [3 , (similar to [26]). We argue that the length ( i.e. about s ) is the duration of most sound events. If the durationof the clips is too short ( e.g. s ), the complete information ofa sound event cannot be captured. While on the other hand,if the duration is too long ( e.g. over s ), the detailed infor-mation will be compromised. The extreme is that when theduration is s (the whole length of the audio clip), the localstream works as the global stream, which does not capture thedetailed information intended.
4. CONCLUSIONS
We have presented a two-stream framework to take advantageof the global and local information of audio, which resemblesthe multi-task learning principle. Experimental results on theAudioset show that our method can boost the performance ofdifferent state-of-the-art methods in audio tagging. . REFERENCES [1] T. Virtanen, M. D. Plumbley, and D. Ellis,
ComputationalAnalysis of Sound Scenes and Events . Springer, 2018.[2] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. La-grange, and M. D. Plumbley, “Detection and classification ofacoustic scenes and events: An IEEE AASP Challenge,” in
IEEE Workshop on Applications of Signal Processing to Audioand Acoustics . IEEE, 2013, pp. 1–4.[3] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D.Plumbley, “Detection and classification of acoustic scenes andevents,”
IEEE Transactions on Multimedia , vol. 17, no. 10, pp.1733–1746, 2015.[4] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P. Shah,“Large-scale weakly labeled semi-supervised sound event de-tection in domestic environments,” in
Workshop on Detectionand Classification of Acoustic Scenes and Events , 2018.[5] N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, “Soundevent detection in domestic environments with weakly labeleddata and soundscape synthesis,” in
Workshop on Detection andClassification of Acoustic Scenes and Events , 2019.[6] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen,W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audioset: An ontology and human-labeled dataset for audio events,”in
IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2017, pp. 776–780.[7] J. Amores, “Multiple instance classification: Review, taxon-omy and comparative study,”
Artificial Intelligence , vol. 201,pp. 81–105, 2013.[8] S.-Y. Tseng, J. Li, Y. Wang, F. Metze, J. Szurley, and S. Das,“Multiple instance deep learning for weakly supervised small-footprint audio event detection,” in
Proc. Interspeech , 2018,pp. 3279–3283.[9] Y. Wang, J. Li, and F. Metze, “A comparison of five multi-ple instance learning pooling functions for sound event detec-tion with weak labeling,” in
IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 31–35.[10] C.-C. Kao, M. Sun, W. Wang, and C. Wang, “A comparisonof pooling methods on LSTM models for rare acoustic eventclassification,” in
IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.316–320.[11] L. Lin, X. Wang, H. Liu, and Y. Qian, “Specialized decisionsurface and disentangled feature for weakly-supervised poly-phonic sound event detection,”
IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 28, pp. 1466–1478, 2020.[12] X. Wang, Y. Yan, P. Tang, X. Bai, and W. Liu, “Revisiting mul-tiple instance neural networks,”
Pattern Recognition , vol. 74,pp. 15–24, 2018.[13] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localiza-tion for free? - weakly-supervised learning with convolutionalneural networks,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2015, pp. 685–694. [14] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-ralba, “Learning deep features for discriminative localization,”in
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 2921–2929.[15] Q. Kong, C. Yu, Y. Xu, T. Iqbal, W. Wang, and M. D. Plumbley,“Weakly labelled audioset tagging with attention neural net-works,”
IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 27, no. 11, pp. 1791–1802, 2019.[16] C. Yu, K. S. Barsim, Q. Kong, and B. Yang, “Multi-level at-tention model for weakly supervised audio classification,” in
Workshop on Detection and Classification of Acoustic Scenesand Events , 2018.[17] M. Ilse, J. M. Tomczak, and M. Welling, “Attention-based deepmultiple instance learning,” in . International Machine Learn-ing Society (IMLS), 2018, pp. 3376–3391.[18] H. Wang, Y. Zou, D. Chong, and W. Wang, “Environmentalsound classification with parallel temporal-spectral attention,”in
Proc. Interspeech , 2020, pp. 821–825.[19] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke,A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous,B. Seybold et al. , “Cnn architectures for large-scale audio clas-sification,” in
IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2017, pp.131–135.[20] K. Choi, G. Fazekas, and M. Sandler, “Automatic tagging us-ing deep convolutional neural networks,” in
Proceedings ofthe 17th International Society for Music Information RetrievalConference (ISMIR) , 2016, pp. 805–811.[21] Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley, “Large-scaleweakly supervised audio classification using gated convolu-tional neural network,” in
IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE,2018, pp. 121–125.[22] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D.Plumbley, “Panns: Large-scale pretrained audio neural net-works for audio pattern recognition,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 28, pp.2880–2894, 2020.[23] L. Ford, H. Tang, F. Grondin, and J. Glass, “A deep residualnetwork for large-scale acoustic scene analysis,” in
Proc. In-terspeech , 2019, pp. 2568–2572.[24] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-timization,” in , 2015.[25] S. Hong, Y. Zou, W. Wang, and M. Cao, “Weakly labelledaudio tagging via convolutional networks with spatial andchannel-wise attention,” in
IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE,2020, pp. 296–300.[26] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomyfor urban sound research,” in