[PDF] Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection

Abstract

Weakly Labelled learning has garnered lot of attention in recent years due to its potential to scale Sound Event Detection (SED) and is formulated as Multiple Instance Learning (MIL) problem. This paper proposes a Multi-Task Learning (MTL) framework for learning from Weakly Labelled Audio data which encompasses the traditional MIL setup. To show the utility of proposed framework, we use the input TimeFrequency representation (T-F) reconstruction as the auxiliary task. We show that the chosen auxiliary task de-noises internal T-F representation and improves SED performance under noisy recordings. Our second contribution is introducing two step Attention Pooling mechanism. By having 2-steps in attention mechanism, the network retains better T-F level information without compromising SED performance. The visualisation of first step and second step attention weights helps in localising the audio-event in T-F domain. For evaluating the proposed framework, we remix the DCASE 2019 task 1 acoustic scene data with DCASE 2018 Task 2 sounds event data under 0, 10 and 20 db SNR resulting in a multi-class Weakly labelled SED problem. The proposed total framework outperforms existing benchmark models over all SNRs, specifically 22.3 %, 12.8 %, 5.9 % improvement over benchmark model on 0, 10 and 20 dB SNR respectively. We carry out ablation study to determine the contribution of each auxiliary task and 2-step Attention Pooling to the SED performance improvement. The code is publicly released

Full PDF

11 Multi-Task Learning for Interpretable WeaklyLabelled Sound Event Detection

Soham Deshmukh, Bhiksha Raj, Rita Singh

Abstract —Weakly Labelled learning has garnered lot of at-tention in recent years due to its potential to scale Sound EventDetection (SED) and is formulated as Multiple Instance Learning(MIL) problem. This paper proposes a Multi-Task Learning(MTL) framework for learning from Weakly Labelled Audiodata which encompasses the traditional MIL setup. To showthe utility of proposed framework, we use the input Time-Frequency representation (T-F) reconstruction as the auxiliarytask. We show that the chosen auxiliary task de-noises internalT-F representation and improves SED performance under noisyrecordings. Our second contribution is introducing two stepAttention Pooling mechanism. By having 2-steps in attentionmechanism, the network retains better T-F level informationwithout compromising SED performance. The visualisation ofﬁrst step and second step attention weights helps in localisingthe audio-event in T-F domain. For evaluating the proposedframework, we remix the DCASE 2019 task 1 acoustic scenedata with DCASE 2018 Task 2 sounds event data under 0, 10and 20 db SNR resulting in a multi-class Weakly labelled SEDproblem. The proposed total framework outperforms existingbenchmark models over all SNRs, speciﬁcally 22.3 %, 12.8 %,5.9 % improvement over benchmark model on 0, 10 and 20 dBSNR respectively. We carry out ablation study to determine thecontribution of each auxiliary task and 2-step Attention Poolingto the SED performance improvement. The code is publiclyreleased.

Index Terms —weakly labelled sound event detection, multi tasklearning, deep neural networks, attention, auto encoder

I. I

NTRODUCTION T HE goal of Sound Event Detection (SED) is to determinethe presence, nature and temporal location of soundevents in audio signals. This is usually accomplished usingmachine learning and signal processing algorithms, and is ofgreat added beneﬁt to applications like wearable devices [1],mobile robots [2] and public security [3].Many SED algorithms rely on strongly labelled data [4][5] [6] for training in order to perform accurate audio eventdetection. Here the term strongly labelled refers to audio inwhich events are annotated with their corresponding onsetand offset time. Usually, only the audio segments withinthe onset-offset time boundaries are used as training data,while segments outside the annotated onset and offset timeboundaries are considered to be non-target events. However,producing strongly labelled data for SED is quite expensivein terms of the expertise, time and human resources required

S. Deshmukh is with Department of Electrical and Computer Engineering,Carnegie Mellon UniversityB. Raj is Professor at Language Technologies Institute, School of ComputerScience, Carnegie Mellon UniversityR. Singh is Associate Research Professor at Language Technologies Insti-tute, School of Computer Science, Carnegie Mellon University for the purpose. For instance, the annotations produced areoften restricted to minutes for every few hours [6] of actualdata and time spent listening to it. To address the scalabilityproblem associated with generating the strongly labelled data,researchers have developed methods to train SED models fromunstructured publicly available multimedia data. In this setting,the annotation generally takes the form of weak labels. Theterm ’weak’ information or label refers to information whichsimply indicates presence or absence of particular events inthe video or audio and does not provide any informationwith respect to number of audio events, duration of eventsand time localisation of events in the audio or video. Anuraget al. [7] proposed a way of training SED models basedof weakly labelled data, where event detection is framedas Multiple-Instance Learning (MIL) problem [8] and labelsare available for a ’bag’ of instances and not for individualinstances. Speciﬁcally for SED, this results in every audio clipbeing considered as a ’bag’ of instances, where the individualinstances are the segments of the audio clip and clip levelevent information (label) is available.A speciﬁc type of MIL formulation which provides bench-mark performance is neural MIL formulation. In neural MILformulation, the model consists of two sequential components:a temporal information aggregator followed by a poolingmechanism. In short, the ﬁrst component of the model (neuralnetworks) produces temporal predictions which are then ag-gregated by the second half of the model, usually a poolingoperator to produce audio clip level predictions. The temporalpredictions are generally generated by encoding audio using aConvolution Neural Networks (CNN) [9] based architecture.The following pooling mechanism generally takes form ofGlobal Max Pooling (GMP), Global Average Pooling (GAP),[10], Global Weighted Rank Pooling (GWRP) [11], Attentionpooling [12] to get audio level predictions. Depending onhow the pooling is performed, has a signiﬁcant performanceeffect on both the SED and each class’s intermediate Time-Frequency (T-F) representation obtained. GMP and GAP con-sistently underestimate and overestimate the sound sourcetime predictions respectively, and to overcome this problemsmart and adaptive pooling mechanisms like Adaptive pooling[10], Global Weighted Rank Pooling (GWRP) [11], Attentionpooling [12] were developed. However, the developed poolingmechanisms still lacks the granularity of prediction requiredfor inference of each audio event in T-F domain and issometimes known to be unstable with Binary Cross Entropyloss usually used for multi-class WLSED [12].To make the MIL setup more ﬂexible and combat the abovementioned challenges, we propose a Multi-Task Learning a r X i v : . [ ee ss . A S ] O c t Fig. 1. Multiple-Instance Learning problem setup for Weakly Labelled SED. The setup is divided into three parts: input T-F representation, SegmentationNetwork and Classiﬁcation Network, which outputs audio clip level sound event presence probabilities. (MTL) framework which uses a two-step attention poolingmechanism and signal reconstruction based auxiliary task.The chosen auxiliary task, results in providing a cleaned T-Frepresentation of audio events, which can be further used forsource separation. The two central contributions of the paperare: • Mutli-Task Learning formulation for WLSED: The paperintroduces Multi-Task Learning setup for Weakly labelledSound Event Detection (WLSED) where MIL based SEDis the primary task and is assisted by an auxiliary task.The auxiliary task can be any secondary task whoselabels are implicitly available or requires external meta-information. We choose input T-F signal reconstruction asauxiliary task as it does not require external labels or data.The input T-F reconstruction auxiliary task forces CNNencoded representation to retain and enhance each soundsource’s temporal presence in the audio clip. Ablationstudies are performed to quantify the effects of chosenauxiliary task to increased performance SED performanceobtained. To the best of our knowledge, this is theﬁrst work formulating Multi-Task learning for WeaklyLabelled SED data. • Two-step Attention Pooling mechanism: The paper in-troduces a stable and interpretable two-step AttentionPooling mechanism along with MTL formulation. Thishelps in bringing intermediate T-F representation obtainedfrom the initial CNN network to localised predictions ofaudio events in both temporal and frequency domain. Theﬁrst step of attention operates over Mel bins and learnsa contribution of each Mel bin to each time step foreach audio event. The second step of attention operatesover time and learns the contribution of audio eventsat each time step. The audio clip level predictions areobtained by summing up temporal contributions made byeach audio event. This particular way of breaking theattention into two steps helps the network to easily focuson particular parts of inputs and localise in both time-frequency domain without making the training unstableand leads to benchmark performance.The code is publicly realised and open-sourced. The paper isstructured as follows: Section II introduces Multiple-InstanceLearning setup and previous work in Weakly labelled SED.Section III introduces Multi-Task Learning formulation withSED as the primary task. Section III-C goes into the details of proposed 2-step Attention Mechanism. Section IV containsthe experiment details about dataset, feature extraction andnetwork details. Section V contains the Weakly labelled SEDresults under different SNR, Ablation test results and Endto End interpretable visualisation. Section VI contains futurework and directions to build on this work, followed byConclusion in Section VIIII. P

ROBLEM SETUP AND R ELATED WORK

Weakly labelled learning in context of SED is generallyformulated as Multiple Instance Learning (MIL) [8] problemwhere a single binary class label is assigned to a collection(bag) of similar training examples (instances). The overalllearning setup used for WLSED is show in Fig. 1

A. MIL formulation for SED

Let the raw audio be represented as { x i } Ti =1 , where x i is an individual frame out of the T frames making up theaudio. Let the features extracted from raw audio { x i } Ti =1 berepresented by { ˆ x i } Ti =1 . The extracted features constitute abag B = { ˆ x i } Ti =1 . MIL assumption states that the weak labelsof bag B are y = max i { y i } Ti =1 , where y i is the strong labelcorresponding to feature ˆ x i . Thus training pairs of Weaklylabelled data consists of { B, y } . The mapping of frame levelfeatures to audio level event prediction is learned by neuralnetworks in Neural MIL. The neural network can be dividedinto two smaller sequential neural networks. The ﬁrst networklearns a function g which generates a T-F segmentation mask.Ideally, the output generated by this network is supposed tobe T-F segmentation masks for each audio event. The paperwill refer to the ﬁrst network as ‘Segmentation network’. Thisis followed by a second network which learns mapping g between T-F segmentation masks to probabilities of corre-sponding sound events. The output of the second network isaudio clip level class probabilities. As the second network’srole is to perform classiﬁcation of T-F representation intoappropriate audio events, the paper will refer to this secondnetwork as ‘Classiﬁcation Network’. B. Segmentation Network

The initial work [7], proposes SVM max margin formulationand CNN based architecture as two alternatives for learning thesegmentation mapping. To improve the quality of segmentation maps learned, [13] [14] used a Time-Distributed CNN (TD-CNN) followed by a Global Max Pooling (GMP) to pick outlocation of relevant temporal events. In most of the relevantwork, log Mel spectrograms are used as features to repre-sent raw audios. [15]–[17] shows using log mel spectrogramrepresentation of input followed by CNN based architectureleads to improved performance compared to machine learningmethods like SVM. To better capture temporal event infor-mation present in audio, [18] used convolutional RecurrentNeural Network (CRNN). CRNN consists of CNN layerswhich operate on log Mel spectrogram input to extract relevantsource level features followed by Bidirectional RecurrentNeural Network (Bi-RNN) to capture the temporal contextinformation between frames in T-F representation. Recently[12] proved that size of receptive ﬁeld is more importantthan performing intermediate local pooling, and used atrousCNNs [19] with global attention pooling layer to achievebenchmark performance. In this paper, we intend to achievebetter explainability in the T-F domain without compromisingSED performance, by reconstruction based auxiliary task and2-Step Attention Pooling. The proposed methodology helps indenoising the internal T-F representations and provides audioevent localisation in temporal and frequency domain, alongwith achieving benchmark performance in WLSED.

C. Classiﬁcation Network

Classiﬁcation mapping takes intermediate T-F representa-tion as input and employes pooling operators such as GlobalMax Pooling (GMP), Global Average Pooling (GAP) [20],Global Weighted Rank Pooling (GWRP) [11], global attentionpooling [21] [12] or other poolings [22] [23] [10] and evenfully connected layers to predict the presence speciﬁc ofsound events in the audio. The GMP operator is known tounder predict the sound events as the operator only takes intoaccount the most prominent feature while ignoring others. Onthe other hand, GAP is consistently known to over predictsthe sounds events [11]. To address this challenge, [11] usesGWRP operator which can be seen as an extension of bothGAP and GMP. The GWRP operator converts to GMP whenr = 0 and GAP when r = 1, where r is a hyperparameterthat varies according to frequency of occurrence of soundevents. However, any type of global pooling can only predictthe time domain segmentation, but not the T-F segmentation.To make the pooling operator more ﬂexible and aware ofeach contributing location in the T-F domain, an attentionmechanism was proposed [21] [12]. Attention mechanism ismore ﬂexible and weighs each contributing location in T-Foutput to form the ﬁnal predictions. However, training usingattention proposed in [21] [12] over the entire K × T × F (where K is sound events, T is time frames and F isfrequency bins) is unstable with commonly used Binary CrossEntropy loss. Both the challenges of ﬁne-grained sound eventcontribution in the T-F domain and instability are addressedby two-step Attention Pooling (2AP) mechanism proposed inthis paper.

D. Multi-Task Learning

Multi-Task Learning has been associated with many othernames, some of the common ones are joint learning, learningto learn, and learning with auxiliary tasks. Multi-Task Learn-ing (MTL) is a type of inductive transfer, where this inductivebias generally takes the form of an auxiliary task, which forcesthe network to learn representation to jointly solve more thanone task. This generally leads to solutions that generalise better[24] and is shown to work across many application of machinelearning natural language processing, speech recognition andcomputer vision [25] [26] [27].In audio domain, MLT has been recently applied to jointlylearn features for multiple speech classiﬁcation tasks: speakeridentiﬁcation, emotion classiﬁcation, and automatic speechrecognition [28] [29] in which the shared network learnsrepresentation to solve all the downstream tasks equally well.Our MTL setup employs a different formulation, where ratherthan multiple tasks, we employ an auxiliary task which doesn’trequire additional data or labels and the auxiliary task’sperformance is not of interest. The key is selecting appropriateauxiliary tasks for Multi-Task Learning, if incorrectly selectedthe auxiliary task can hurt the performance of the primarytask. To the best of our knowledge, this is the ﬁrst work whichformulates Multi-Task Learning for Weakly Labelled SED dataand determines appropriate auxiliary tasks for the same.III. P

ROPOSED M ETHODOLOGY

This section contains the details of Multi-Task Learningformulation for SED, its corresponding segmentation mappingnetwork g , classiﬁcation mapping network g , and the auxil-iary T-F reconstruction auto-encoder task. This paper’s Multi-Task Learning formulation is depicted in Fig 2 A. Multi-Task Learning SED formulation

In this section we formalise the Multi-Task Learning setupfor Weakly Labelled SED. Let the raw audio be represented as X = { x i } Ti =1 where x i ∈ R and i indicates the speciﬁc framein of total frames T. Features are extracted from raw audio andbrought to a T-F form: ˆ X = { ˆ x i } Ti =1 where ˆ x i ∈ R F whereF is the number of frequency bins used to represent a frameof audio. Every bag of frames B = { ˆ x i } Ti =1 has a weak label y ∈ R K where K is the number of audio events. Then theweakly labelled training data in terms of bags is representedas: B j = ( { ˆ x i } Ti =1 , y ) | Nj =0 (1)where N are the number of examples or data points.As the primary task in our setup of Multi-task learningframework is Sound Event Detection, the ﬁrst part of networkfor SED task focuses on learning a segmentation mapping g ( . ) where: g : ˆ X (cid:55)→ Z (2)The segmentation network maps the feature set { ˆ x i } Ti =1 to Z = { z i } Ti =1 where z i ∈ R K × F and K is the number ofaudio events. The second part of SED task is network which Fig. 2. Multi-Task Learning with T-F reconstruction as auxiliary task. The upper part of network is the primary WLSED task, with main block as: input T-Frepresentation, Segmentation Network and Classiﬁcation Network. The lower part of network is of the auxiliary task which is reconstruction of internal T-Frepresentation. classiﬁes { z i } Ti =1 to P = { p i } Ki =1 where P ∈ R K . Thenetwork learns a mapping: g : Z (cid:55)→ P (3)where g maps each classes T-F segmentation to their presenceprobabilities of k th event known as p k The auxiliary task in MTL setup is reconstruction of in-put T-F representation. For reconstruction, an autoencoderstructure will be used. The ﬁrst part of autoencoder is anencoder network which compacts data into an intermediateT-F representation. The aim of compacted features is to allowreconstruction of data with minimal error. The encoder learnsa function g ( . ) where g : ˆ X (cid:55)→ Z .We make an assumption that audio’s T-F representation { ˆ x i } Ti =1 can be completely explained as a linear or non-linearcombination of each audio event’s independent T-F represen-tation. This assumption allows the network to have a sharedencoder for both SED and auxiliary reconstruction tasks. Fromnow on we will represent g ( . ) = g ( . ) = g ( . ) as the sharedsegmentation mapping function, such that g : ˆ X (cid:55)→ Z . Theshared encoder performs compaction in the number of ﬁltersand keeps the T-F dimensions untouched.The second part of the auxiliary task is a decoder networkwhich learns a mapping g such that g : Z (cid:55)→ ¯ X where ¯ X isthe reconstructed T-F representation. Speciﬁcally: { ¯ x i } Ti =1 = g ( { z i } Ti =1 ) (4)Here the mapping function g ( . ) is ideally such that: g ( g ( . )) = g ( g ( . )) = I (5)To force the above relation in auxiliary task, we introduceloss function L to minimise the difference between true T-F representation { ˆ x i } Ti =1 and predicted T-F representation { ¯ x i } Ti =1 of audio.To solve for the SED task, the network should learn g ( . ) shared mapping such that the mask { z i } Ti =1 should accuratelysegment each audio event and classiﬁcation mapping g ( . ) should map it to correct audio event. Let the loss to enforcethese conditions for the primary task be L . The functions g ( . ) , g ( . ) , g ( . ) will be expressed using Neural Networks.In terms of Neural Network terminology, this is equivalentto learning a weights W = [ w, w , w ] where w, w , w areweights corresponding to each function g ( . ) , g ( . ) , g ( . ) re-spectively. The optimisation problem can be framed in termsof these weights W over all data points as:min W L ( P, y | w, w ) + α L ( { ¯ x i } Ti =1 , { ˆ x i } Ti =1 | w, w ) (6)The two-component loss function will be to referred as L ( W ) and parameter α accounts for scale difference between losses L and L . It also helps in adjusting weightage given toauxiliary task relative to primary task to guide learning ofweights. B. Shared Segmentation Network

The Segmentation mapping here processes the audio clipand converts it into a processed T-F (Time Frequency) repre-sentation for each class. The audio clip X i is ﬁrst convertedinto a T-F representation ˆ X = { ˆ x i } Ti =1 . Here ˆ X = { ˆ x i } Ti =1 is log mel spectrogram. CNN based network is used herefor modelling mapping g ( . ) . The CNN based network hasmultiple CNN blocks. Each CNN block consists of threesubparts. In ﬁrst part, features ˆ X = { ˆ x i } Ti =1 is processedusing 2 repeating units of 2-dimensional linear convolution, Batch Normalisation [30] and ReLU [31] nonlinearity. Thisis followed by a 2-dimensional Average pooling operatorwith appropriate stride and padding to main original T-Fdimensions of audio. Batch Normalisation helps to stabilisetraining by reducing the layer’s internal covariate shift andrecent work suggests it makes the optimisation landscapesigniﬁcantly smoother which results in the stable behaviourof gradients [32]. This 2 unit CNN block is repeated to formthe segmentation mapping section of the network.The segmentation network also acts as the encoder of theauto-encoder framework and has to jointly encode featuresrelevant for audio event detection and audio T-F representationreconstruction. Having a common encoder forces the networkto learn a shared representation, helps the network to exploitcommonalities and differences across SED and T-F recon-struction and enables the network to generalise better on ouroriginal task. The Multi-Task Learning (MLT) setup using anauxiliary task to introduce an inductive bias, which will causethe network to prefer solutions that generalise better [33]. Thiswill result in improved learning and predictive power for SED.The MLT setup used here is a hard parameter sharing insteadof soft parameter sharing which greatly reduces the risk ofoverﬁtting [34].

C. Classiﬁcation Network

The modelling choice for classiﬁcation mapping results indifferent intermediate T-F representation of audio. The tradi-tional choices for modelling classiﬁcation mapping are GlobalMax Pooling [20], Global Average Pooling [35] and GlobalWeighted Rank Pooling [36] which results in compromisebetween time level precision and classiﬁcation performance.We instead propose a 2-step Attention Pooling mechanismwhich retains interpretability in the T-F domain and providesbetter SED classiﬁcation performance. The two-step AttentionPooling mechanism converts segmentation mapping { z i } Ti =1 into audio level predictions P .The 2-step Attention Pooling takes input { z i } Ti =1 of size K × F × T where K, F, T denotes the number of classes,frequency bins and time frames respectively, where the ﬁrststep operates on Z = { z i } Ti =1 and produces Z p as theintermediate output, followed by second step which operateson Z p and produces Z p as the ﬁnal output indicatingpresence probability of different sound events in the audioclip. In the ﬁrst step, two independent linear neural networksoperate on Z to produce an attention weight Z a and anintermediate classiﬁcation output Z p respectively. Softmaxis used to ensure attention weights sum up to 1 along thefrequency dimension, to produce normalised attention weights (cid:98) Z a . The normalised attention weights are used to weigh theintermediate classiﬁcation outputs to produce Z p of size K × T with squashed frequency dimension. Z a = σ ( ZW Ta + b a ) (7) Z c = ZW Tc + b c (8) (cid:98) Z a = e Z a (cid:80) Fi =1 e Z a (9) ˆ Z a are the probabilities to weigh the classiﬁcation branchresults: Z p = F (cid:88) i =0 Z c · ˆ Z a (10)Subsequently, Z p is passed as input to the second stepattention pooling. The second step attention pooling performsthe same general process as the ﬁrst step attention with thedifference of operating Softmax and squashing on the timedimension. Z a = σ ( Z p W Ta + b a ) (11) Z c = Z p W Tc + b c (12) (cid:98) Z a = e Z a (cid:80) Tt =1 e Z a (13) Z p = T (cid:88) t =0 Z c · ˆ Z a (14) Z p ∈ (0 , is of size K and denotes the probability ofeach sound event k ∈ K in the audio clip. By breaking theattention into two steps, it makes the pooling more inter-pretable by answering the questions of what frequency binsand what time steps contributes to which audio events byvisualising normalised attention weights (cid:98) Z a , (cid:98) Z a and eachstep’s attention output Z p , Z p . By adding σ in both step’sintermediate classiﬁcation output, it ensures the output isbounded between 0 and 1, leading to stable training withBinary Cross Entropy loss used for computing error betweenthe predicted probability of audio events P and true weaklabels y . The quantitative performance comparison betweentraditional attention pooling and 2-step attention is done inthe results section. D. Decoder Network

The decoder of the auxiliary-task takes Z = { z i } Ti =1 asinput and tries to reconstructs { ˆ x i } Ti =1 . The decoder usesCNN based network for down-sampling number of ﬁltersfrom K (Number of classes). The CNN based network hasmultiple CNN blocks, where each subblock consists of threesubparts. In ﬁrst part, features ˆ X = { ˆ x i } Ti =1 is processedusing 2 repeating units of 2-dimensional linear convolution,Batch Normalisation [30] and ReLU [31] nonlinearity. Thisis followed by a 2-dimensional Average pooling layer withappropriate stride and padding to main original T-F dimensionsof audio. The general architecture of decoder closely followsthe shared encoder architecture, with exception that the num-ber of ﬁlters are decreased from one block to another in thedecoder, while they are increased in the encoder architecture.For training the weights, the loss used for the auxiliary task isMean Squared Error (MSE) which computes error between T-F representation { ˆ x i } Ti =1 and reconstructed T-F representation { ¯ x i } Ti =1 TABLE IS

HARED S EGMENTATION N ETWORK

Layers Output sizelog mel spectrogram 1 × × × × × × × × × × × × × × × × × × × × × × × × LASSIFICATION N ETWORK

Layers Output sizeclass T-F representation 41 × × st step Attention Pooling 41 × nd step Attention Pooling 41TABLE IIID ECODER N ETWORK

Layers Output sizeInternal T-F representation 256 × × × × × × × × × × × × × × × × × × × IV. E

XPERIMENTS

A. Dataset

The dataset is made by mixing DCASE 2019 task 1 ofAcoustic scene classiﬁcation [37] and DCASE 2018 task 2of General purpose Audio tagging [38]. The dataset creationmethodology used is the same as [11]. However, we usenew DCASE 2019 Task 1 instead of DCASE 2018 usedin [11]. We use the DCASE 2019 variant as it extends theTUT Urban Acoustic Scenes 2018 with other 6 cities toa total of 12 large European cities, making a total datasetmore diverse and challenging. Speciﬁcally, TUT 2018 UrbanAcoustic Scenes dataset contains recordings from Barcelona,Helsinki, London, Paris, Stockholm and Vienna, to which TAU2019 Urban Acoustic Scenes dataset adds Lisbon, Amsterdam,Lyon, Madrid, Milan, and Prague. This provides for the back-ground or environmental noise needed to simulate real worldaudio event background. The DCASE 2018 task 2 providesannotated audio clips associated with one of the 41 events like‘Flute’, ‘Gunshot’and ‘Bus’from Google Audioset Ontology[39]. On the other hand, DCASE 2019 task 1 contains 10 secclips from 10 different scenes like “airport”, “metro station”and “urban park” to name a few. From DCASE 2018 task 2, two second clips are extracted and mixed with 10 secondbackground noise from DCASE 2019 task 1 for SNR 0 dB, 10dB and 20 dB. Each audio clip contains three non-overlappingaudio events. For each SNR, the 8000 mixed audio clips aredivided into 4 cross-validation folds. The mixed audio clipsare single channel with a sampling rate of 32 kHz and themixing procedure can be found in [11].

B. Evaluation Metric

The metric used to evaluate the proposed framework andbenchmark models is Area Under the Receiver OperatingCharacteristic Curve (ROC AUC) [40], micro Precision andmacro Precision. The metrics are chosen such that they arethreshold independent and characterise the performance of thenetwork in both balanced and imbalanced class settings. Themetrics are deﬁned as follows:1) ROC AUC: A receiver operating characteristic curveplots true positive rate (TPR) vs false positive rate(FPR). The area under the ROC curve is computed whichsummaries the ROC AUC curve. Using the AUC doesnot require manual selection of a threshold. Bigger AUCindicates better performance. A random guess has anAUC of 0.52) Precision: The precision is intuitively the ability ofthe classiﬁer not to label as positive a sample that isnegative. P = T PT P + F P (15)There are different ways of adapting this metric to multi-label/multi-class classiﬁcation. We use both ‘Micro’and‘Macro’precision for evaluation [41]. Micro Precisioncalculates the metrics globally by counting the totaltrue positives, false negatives and false positives. On theother hand Macro Precision calculates metrics for eachlabel, and ﬁnds their unweighted mean. Macro Precisiondoes not take label imbalance into account and hencewe consider Micro Precision to be a metric of primaryimportance here.

C. Feature Extraction

The raw data is converted to T-F representation by applyingFFT with a window size of 2048 and overlap of 1024 betweenconsecutive windows. This is followed by applying mel ﬁlterbanks with 64 bands and converting it to log scale to obtainlog mel spectrogram. This is used as input to the sharedsegmentation mapping network and is found to work well withneural networks [11], [42].

D. Network

This subsection provides shared segmentation, classiﬁcationand decoder network’s detailed description along the trainingdetails. The segmentation network details are shown in TableI. It takes log mel spectrogram as input and employs a CNNblock structure similar to VGG [43]. The smallest unit ofthe network consists of a CNN layer of kernel size 3, Batch

TABLE IVW

EAKLY L ABELLED

SED R

ESULTS

Networks SNR 20 dB SNR 10 dB SNR 0 dBmicro-P macro-P AUC micro-P macro-P AUC micro-P macro-P AUCGAP 0.5067 0.6127 0.9338 0.4291 0.5390 0.5390 0.3295 0.4093 0.8694GMP 0.5390 0.5186 0.8497 0.5263 0.5023 0.8422 0.4640 0.4441 0.8189GWRP [11] 0.7018 0.7522 0.9362 0.6538 0.7129 0.9265 0.5285 0.6084 0.8985Atrous AP [12] 0.7391 0.7586 0.9279 0.6740 0.7404 0.9211 0.5714 0.6341 0.90142APAE

TABLE VA

BLATION S TUDY α = 0.0 0.7772 α = 0.001 α = 0.01 0.7637 0.7464 0.9333 0.7428 0.7278 0.9277 0.6792 0.6689 0.9114 Normalisation and ReLU sequentially called CNN sub-block.This sub-block is repeated twice and is followed by 2DAverage Pooling of stride 1 and padding 1 to form a CNNblock. Empirically, 2D Max pooling and 2D Average Poolingshowed similar performance and therefore 2D Average Poolingwas arbitrarily chosen. The CNN block is repeated 4 timeswith increasing channel size (32, 64, 128, 256). This input ispassed to the class convolution block as shown in Fig. 2 toreduce it down to number of classes. Note that the Decodernetwork takes the input of class convolution as it’s inputand not the output of class convolution. This is determinedempircally, as the performance of both is comparable andconsidering the input of class convolution saves an additionalconvolution block which would be necessary to upscale theclass T-F representation in the decoder network.The classiﬁcation network (Table II), consists of 2 stepattention pooling to summarize and reduce the class T-Frepresentation to get audio clip level predictions. The 2-step attention pooling is described in Section III-C. TheDecoder network (Table III) emulates the encoder networkwith decrease in the number of channels to reduce it to log melspectrogram of input. Using the terminology of sub-block andblock, the decoder consists of 3 CNN-blocks with decreasingchannel size (256, 128, 64, 32). This is followed by a reversedclass convolution block consisting of one CNN block withonly one channel to reduce the T-F representation to log melspectrogram. The entire network is trained end-to-end with abatch size of 24 and learning of 1e-3 using Adam optimiser[44] on 4 GPU cards with 12 GB RAM each.V. R

ESULTS

A. Sound Event Detection Results

The proposed architecture is compared with traditionalmethods and current benchmark architecture for weakly la-belled sound event detection speciﬁcally GMP, GAP, GWRP[11], Attention pooling with Atrous convolution [12]. Theperformance of different architectures are compared in TableIV where the each metric is an average cross-validation score obtained across 4 folds. The important metric here isMicro Precision (micro-P), as it calculates metrics globallyby counting the total true positives, false negatives and falsepositives. This is a better indicator of network performanceas it takes into account class imbalance rather than simpleunweighted averaging that macro Precision does. 2APAE (2step Attention Pooling with Auto Encoder auxiliary task) hasmicro-Precision score of 0.7829, 0.7603 and 0.6986 on SNR20, 10 and 0 dB respectively. The results show that the 2APAEnetwork achieves the best score across all SNR’s across allmetrics. In terms of micro-Precision, 2APAE outperformsexisting benchmark of Atrous AP (Atrous Attention Pooling[12]) on SNR 20, 10 and 0 dB by 5.9%, 12.8% and 22.3%respectively.The second beneﬁt of breaking the attention into twosteps apart from improving performance, is to provide stabletraining. The attention pooling used with atrous convolutionresulted in overﬂow issues during training, where the ﬁnalpredictions probabilities when over 1 for some audio eventsor classes. This doesn’t ﬁt well with Binary Cross entropy asloss function which expects the input probabilities between 0to 1. Empirically, adding a squashing function after attentionhampers learning. By breaking the attention into two steps,it allows for the intermediate use of sigmoid which helps inensuring the outputs don’t explode above 1.

B. Ablation Study

To determine the contribution of 2-step Attention Poolingand reconstruction based auxiliary task on the SED perfor-mance, ablation study is performed. For ablation study, wechange the value of α in total loss, described in Section III-A: L = L ( P, y | w, w ) + α L ( { ¯ x i } Ti =1 , { ˆ x i } Ti =1 | w, w ) (16)The value of α determines the contribution of auxiliary taskto the weakly labelled SED. By varying the value of alpha,we can determine the performance improvement by 2-stepattention ( α = 0.0) and contribution of auxiliary task ( α > L loss term (Mean TABLE VIW

EAKLY L ABELLED

SED

AUDIO EVENT SPECIFIC RESULTS FOR SNR = 0

Model Guitar Applause Bark Bassdrum Burping Bus Cello Chime Clarinet Comp.keyb. Cough Cowbell Doublebass Drawer Elec.piano Fart Fingersnapp. Firework Flute Glockensp.GAP 0.549 0.848 0.477 0.161 0.508 0.168 0.361 0.626 0.289 0.502 0.384 0.447 0.199 0.212 0.251 0.386 0.409 0.36 0.286 0.539GMP 0.517 0.539 0.53 0.535 0.426 0.145 0.378 0.406 0.466 0.356 0.208 0.872 0.275 0.077 0.31 0.393 0.623 0.322 0.384 0.889GWRP 0.728 0.933 0.742 0.242 0.741 0.254 0.511 0.766 0.449 0.587 0.629 0.768 0.262 0.296 0.349 0.652 0.514 0.517 0.418 0.893AtrousAP 0.72 0.956 0.782 0.169 0.804 0.2 0.562 0.767 0.502 0.685 0.756 0.781 0.17 0.214 0.187 0.691 0.734 0.566 0.318 0.9022APAE 0.869 0.942 0.865 0.82 0.849 0.572 0.71 0.633 0.542 0.59 0.628 0.921 0.579 0.386 0.552 0.569 0.907 0.579 0.473 0.9072APAE e-3 0.792 0.951 0.839 0.812 0.874 0.627 0.669 0.606 0.503 0.699 0.631 0.94 0.59 0.403 0.453 0.562 0.941 0.565 0.535 0.8072APAE e-2 0.759 0.943 0.787 0.789 0.81 0.605 0.677 0.637 0.485 0.68 0.632 0.916 0.563 0.377 0.522 0.589 0.867 0.61 0.522 0.853Gong Gunshot Harmonica Hi-hat Keys Knock Laughter Meow Micro.oven Oboe Saxophone Scissors Shatter Snaredrum Squeak Tambourine Tearing Telephone Trumpet Violinﬁddle Writing0.34 0.473 0.698 0.717 0.384 0.42 0.396 0.3 0.193 0.288 0.477 0.456 0.527 0.344 0.174 0.512 0.357 0.272 0.514 0.474 0.3770.416 0.43 0.375 0.887 0.493 0.52 0.406 0.314 0.215 0.485 0.566 0.344 0.416 0.462 0.077 0.911 0.39 0.345 0.569 0.674 0.1920.576 0.645 0.851 0.847 0.624 0.543 0.585 0.548 0.367 0.495 0.654 0.545 0.684 0.513 0.207 0.866 0.552 0.49 0.664 0.594 0.520.692 0.684 0.861 0.919 0.78 0.694 0.628 0.583 0.157 0.565 0.684 0.713 0.777 0.446 0.223 0.952 0.573 0.52 0.793 0.743 0.5380.643 0.651 0.848 0.97 0.744 0.845 0.581 0.483 0.499 0.609 0.729 0.627 0.702 0.791 0.146 0.964 0.612 0.425 0.727 0.748 0.5650.663 0.679 0.81 0.973 0.742 0.789 0.651 0.532 0.447 0.671 0.716 0.597 0.731 0.829 0.138 0.947 0.528 0.424 0.682 0.747 0.5770.69 0.696 0.877 0.981 0.7 0.787 0.583 0.437 0.42 0.598 0.698 0.633 0.69 0.791 0.159 0.948 0.522 0.441 0.736 0.72 0.503

TABLE VIIW

EAKLY L ABELLED

SED

AUDIO EVENT SPECIFIC RESULTS FOR SNR = 10

Model Guitar Applause Bark Bassdrum Burping Bus Cello Chime Clarinet Comp.keyb. Cough Cowbell Doublebass Drawer Elec.piano Fart Fingersnapp. Firework Flute Glockensp.GAP 0.69 0.974 0.691 0.238 0.642 0.373 0.57 0.763 0.372 0.648 0.529 0.507 0.394 0.438 0.447 0.573 0.461 0.481 0.391 0.644GMP 0.604 0.691 0.626 0.732 0.63 0.163 0.494 0.508 0.581 0.399 0.284 0.862 0.421 0.083 0.414 0.267 0.667 0.386 0.528 0.881GWRP 0.777 0.969 0.868 0.454 0.873 0.49 0.685 0.809 0.597 0.668 0.766 0.842 0.512 0.553 0.527 0.665 0.567 0.643 0.552 0.921AtrousAP 0.816 0.982 0.893 0.38 0.908 0.459 0.72 0.81 0.66 0.713 0.815 0.349 0.431 0.587 0.539 0.768 0.738 0.656 0.622 0.9312APAE 0.921 0.953 0.861 0.904 0.942 0.672 0.775 0.674 0.583 0.728 0.747 0.894 0.78 0.614 0.652 0.652 0.954 0.73 0.69 0.852APAE e-3 0.891 0.963 0.874 0.87 0.906 0.815 0.792 0.725 0.67 0.726 0.741 0.922 0.766 0.565 0.689 0.571 0.896 0.618 0.672 0.9322APAE e-2 0.856 0.937 0.816 0.905 0.884 0.715 0.762 0.565 0.659 0.684 0.627 0.918 0.805 0.516 0.673 0.647 0.911 0.716 0.696 0.929Gong Gunshot Harmonica Hi-hat Keys Knock Laughter Meow Micro.oven Oboe Saxophone Scissors Shatter Snaredrum Squeak Tambourine Tearing Telephone Trumpet Violinﬁddle Writing0.513 0.537 0.854 0.802 0.486 0.579 0.569 0.437 0.261 0.403 0.554 0.538 0.645 0.617 0.219 0.598 0.422 0.396 0.728 0.553 0.4990.494 0.539 0.487 0.915 0.193 0.563 0.507 0.444 0.266 0.58 0.614 0.362 0.237 0.645 0.097 0.854 0.353 0.408 0.654 0.725 0.3120.684 0.705 0.884 0.913 0.679 0.714 0.71 0.684 0.514 0.541 0.751 0.597 0.77 0.763 0.246 0.873 0.506 0.569 0.819 0.712 0.5340.805 0.732 0.918 0.929 0.776 0.793 0.783 0.734 0.242 0.716 0.795 0.756 0.826 0.746 0.289 0.918 0.594 0.644 0.862 0.834 0.6620.656 0.749 0.869 0.975 0.773 0.849 0.675 0.481 0.534 0.659 0.765 0.604 0.723 0.873 0.175 0.943 0.593 0.506 0.803 0.727 0.5960.778 0.733 0.863 0.99 0.809 0.845 0.646 0.477 0.475 0.792 0.777 0.597 0.828 0.903 0.127 0.968 0.586 0.562 0.847 0.789 0.6130.659 0.72 0.851 0.983 0.774 0.83 0.661 0.445 0.527 0.698 0.747 0.552 0.796 0.947 0.127 0.941 0.628 0.522 0.813 0.775 0.571

TABLE VIIIW

EAKLY L ABELLED

SED

AUDIO EVENT SPECIFIC RESULTS FOR SNR = 20

Model Guitar Applause Bark Bassdrum Burping Bus Cello Chime Clarinet Comp.keyb. Cough Cowbell Doublebass Drawer Elec.piano Fart Fingersnapp. Firework Flute Glockensp.GAP 0.72 0.986 0.747 0.399 0.699 0.56 0.64 0.803 0.485 0.707 0.571 0.554 0.501 0.532 0.597 0.652 0.481 0.593 0.498 0.766GMP 0.507 0.843 0.654 0.838 0.631 0.336 0.565 0.489 0.657 0.344 0.44 0.889 0.42 0.137 0.579 0.328 0.653 0.226 0.54 0.931GWRP 0.83 0.986 0.922 0.529 0.869 0.649 0.727 0.813 0.657 0.728 0.742 0.875 0.696 0.626 0.627 0.7 0.636 0.722 0.697 0.934AtrousAP 0.877 0.991 0.922 0.562 0.924 0.622 0.773 0.819 0.746 0.77 0.89 0.716 0.573 0.708 0.703 0.806 0.746 0.755 0.745 0.9572APAE 0.903 0.969 0.911 0.936 0.959 0.761 0.787 0.642 0.666 0.736 0.605 0.936 0.825 0.592 0.665 0.589 0.956 0.681 0.834 0.9132APAE e-3 0.908 0.955 0.867 0.9 0.946 0.755 0.845 0.648 0.738 0.716 0.707 0.936 0.81 0.611 0.708 0.682 0.909 0.712 0.849 0.9612APAE e-2 0.89 0.97 0.908 0.922 0.93 0.737 0.874 0.598 0.647 0.643 0.668 0.946 0.826 0.656 0.667 0.583 0.931 0.707 0.74 0.84Gong Gunshot Harmonica Hi-hat Keys Knock Laughter Meow Micro.oven Oboe Saxophone Scissors Shatter Snaredrum Squeak Tambourine Tearing Telephone Trumpet Violinﬁddle Writing0.664 0.552 0.895 0.829 0.535 0.688 0.642 0.501 0.308 0.548 0.622 0.554 0.717 0.696 0.236 0.67 0.432 0.445 0.825 0.626 0.5310.529 0.523 0.517 0.804 0.335 0.599 0.441 0.302 0.121 0.668 0.653 0.371 0.289 0.509 0.076 0.873 0.4 0.404 0.72 0.726 0.2770.811 0.713 0.917 0.92 0.733 0.827 0.791 0.674 0.498 0.728 0.801 0.624 0.812 0.79 0.277 0.915 0.571 0.647 0.869 0.839 0.650.86 0.743 0.945 0.954 0.782 0.832 0.776 0.724 0.123 0.84 0.862 0.74 0.859 0.762 0.239 0.955 0.613 0.646 0.909 0.888 0.6860.751 0.676 0.891 0.966 0.79 0.838 0.637 0.498 0.552 0.79 0.833 0.633 0.846 0.932 0.323 0.965 0.609 0.503 0.903 0.855 0.5660.725 0.666 0.87 0.989 0.76 0.808 0.705 0.455 0.616 0.83 0.833 0.613 0.809 0.943 0.177 0.966 0.56 0.495 0.885 0.819 0.5780.719 0.707 0.849 0.971 0.82 0.869 0.674 0.436 0.532 0.735 0.817 0.546 0.781 0.904 0.182 0.967 0.566 0.465 0.83 0.834 0.6

Squared Error) is magnitude greater than L loss. The lower α value helps in adjusting for this large scale difference betweenthe two losses.When α = 0 . , the network has no contribution from theauxiliary task and can be used to evaluate the performanceof 2-step Attention Pooling. In terms of micro-Precision,the 2-step Attention Pooling outperforms existing benchmarkof Atrous AP (Atrous Attention Pooling [12]) from TableV on SNR 20, 10 and 0 dB by 5.2%, 10.2% and 21.4%respectively. By adding the auxiliary task contribution witha relative weightage of α = 0 . , compared to proposed2-step attention, an improvement of 0.7%, 2.3% and 0.7%is observed. The numbers indicate that 2-step attention hasa heavy contribution on the improvement of performance,with extra performance gains from auxiliary task. When α isincreased to 0.01, the performance compared to α = 0 . is decreased. This indicates that the auxiliary task’s losscontribution starts to overpower the primary SED task’s losscontribution rather than improving generalisation. The sharedencoder then learns features more relevant to auxiliary taskrather than primary SED task. C. Audio-event speciﬁc SED results

Table VI, VII, VIII shows precision values for audio eventspeciﬁc SED for each of the 41 audio events for SNR = 0,10and 20 dB respectively. The name of the audio events areabbreviated to ﬁt the table. In the table, 2APAE refers to thebase model with no contribution from auxiliary task, while2APAE 1e-3 and 2APAE 1e-3 refers to 2APAE model withauxiliary task contribution of α = 0 . and α = 0 . respec-tively. For almost all audio events, variants of 2APAE have thebest precision scores against GMP, GAP, GWRP, Atrous for Fig. 3. Visualisation of 2 step Attention Pooling and Auxiliary task’s decoder outputs. The ﬁgure has total of 8 subplots each visualising a step of the 2APAEprocess. Subplot 1 depicts the scaled log mel spectrogram input to the network. Subplot 2 is the reconstructed output by the auxiliary task decoder. Subplots3,4 and 5 are attention weights for three most probable classes in the prediction. Subplot 6 is the output of st step Attention pooling. Subplot 7 and 8 is theattention weight and output of nd step Attention Pooling respectively. The y-axis in subplot 1 - 4 is Mel-bins and sound event in subplot 5-6. The x-axis istime in subplot 1-6 and Sound Event in subplot 7 all snr = 0, 10, 20. Particularly, for audio events like ’Bassdrum’, ’bus’, ’double bass’, ’cowbell’ 2APAE outperformsother models by a large margin. However, 2APAE strugglesachieving best scores for audio events like ’gong’,’chime’and ’meow’ where the Atrous model performs better. Thisindicates using atrous or dilated convolutions can be helpfulin particular situations like detecting sound events whoseenergy is spread wide in the temporal domain. This can beincorporated in 2APAE by replacing the linear convolutionsin shared encoder with atrous or dilated convolutions. Thisdirection is not explored in this paper and is left for futurework. D. Interpretable visualisation

Apart from improved performance, using 2-step AttentionPooling, provides a way to visualise the contribution of eachMel bin and each time frame in the T-F representation to eachaudio event and ﬁnal predictions. Before visualising attentionweights, generally a class-dependent threshold is applied [11][45] to remove False positive and fake activations. But thispost-processed varies from network to network and doesn’t paint an accurate picture of the contribution of each componentin attention as it becomes highly dependent on thresholdchosen and adds author’s bias for choosing threshold. Hence inthis paper we plot the raw attention weights without applyingany thresholds.We pick a random example with SNR 20 db and showthe end to end visualisation of the 2 Step Attention Poolingmechanism. The ﬁrst subplot shows the log mel spectrogramthat is fed to the segmentation network. The ﬁrst subplot hasthree distinct audio events happening in it, namely: telephoneringing, cello playing and cat meowing. The second subplotshows the output of reconstruction based auxiliary tasks. Wewill get back to the description of the second subplot later.The third, fourth, ﬁfth subplot shows the st Step AttentionPooling weights for top 3 predicted events by the network:telephone ringing, cello playing and cat meowing respectivelywith time on x-axis and mel bins on y-axis. The lighter colorindicates higher weights in time and frequency domain. Thesesubplots tell us what frequency bands are getting activated atwhich instance of time for the particular class. For example,in subplot 3, the lower frequency bands are getting activated for initial time steps where the telephone ringing is occurringin the audio clip. The ﬁfth subplot shows the output of st step Attention Pooling where y-axis is audio events and x-axisis time with squashed frequency axis. This shows that aftertaking into account the contribution from every frequency bin,there are three main candidate time chunks where the audioevents are located. Comparing this to subplot one of input, theyalign with the three audio events happening in the audio clip,with different audio events getting activated at different timechunks. The sixth subplot visualises the nd Step Attentionweights where y-axis is audio events and x-axis is time. Theygive high weight to audio events telephone ringing, celloplaying and cat meowing at different time steps. However,the audio events are not perfectly time-aligned. This is dueto the use of 2D-Average pooling per two convolution layersin the shared encoder. The addition of 2D average poolingimproves weakly labelled SED performance, but results inlosing time-level precision. The seventh subplot, shows theoutput of nd step Attention Pooling where y-axis is the audioevent prediction probability and the x-axis is the audio event.This is the ﬁnal output of the classiﬁcation network and entirenetwork for SED.Coming back to subplot two, it is the output of the decoderused for reconstruction auxiliary task. In our case, the goalof the auxiliary task and the decoder is to reconstruct thelog mel spectrogram of input from the shared encoder’s T-Frepresentation. From the subplot, we can see that the decoderis not only able to reconstruct the audio events clearly butit is also denoising the log mel spectrogram in 20 dB SNR.For the decoder to perform well and denoise the output, theinternal T-F representation has to be denoised as well. Thishelps us in inferring two things: ﬁrst, In Multi-Task Learningframework, adding the reconstruction auxiliary task to primaryweakly labelled SED leads to denoising of the internal T-F and the encoder representation of log mel spectrogram.This is backed by ablation study’s result in Section V-Bwhich shows adding reconstruction auxiliary task improvesperformance. That improved performance can be attributedto this denoising of internal T-F representations. Second, thedenoised log mel representation, will help in improving theperformance of audio source separation systems. This setupeither pre-trained or jointly trained with source separation, canbe used to produce cleaned audio log mel representation forbetter source separation.VI. F UTURE WORK AND DIRECTIONS

The paper shows, Weakly labelled SED can be formulatedas Multi-Task Learning where Multiple Instance Learning SEDis the primary task coupled with an constructive auxiliarytask. This results in interesting directions for extending andbuilding upon this work. To enumerate a few: ﬁrst, exploringdifferent auxiliary task to help Weakly labelled SED detection.The paper shows one such task and there can be singleor multiple auxiliary task which might help creating bettershared segmentation mapping for audio events. Second, jointlytraining sound source separation along with weakly LabelledSED in MTL framework. This would provide an end-to-end weakly labelled sound event detection and source separationsystem. Third, developing better loss function and dynamic α computation to optimally use multiple audio task in MTLframework for improved performance.VII. C ONCLUSION

The paper proposes a Multi-Task learning formulation forlearning from Weakly Labelled Audio data which incorporatesthe MIL formulation of SED as primary task. In the MTLframework, we use input T-F reconstruction as auxiliary taskwhich helps in denoising the intermediate T-F representationand encourages the shared segmentation network to retainsource speciﬁc information. To make the pooling mechanismmore interpretable in T-F domains, we introduce a 2-stepAttention Pooling mechanism. To show the utility of the pro-posed framework, we remix the DCASE 2019 task 1 acousticscene data with DCASE 2018 Task 2 sounds event data under0, 10 and 20 db SNR. The proposed network outperformsexisting benchmark model over all SNRs, speciﬁcally 22.3%, 12.8 %, 5.9 % improvement over benchmark model on 0,10 and 20 dB SNR respectively. The result from the ablationstudy indicates: the 2-step Attention Pooling is the majorfactor improving performance of SED followed by about 1%improvement by addition of reconstruction based auxiliarytask. The work lays foundation for future work of end-to-endsound source separation and sound event detection for WeaklyLabelled data. R

EFERENCES[1] Y. Xu, W. J. Li, and K. K. Lee,

Intelligent Wearable Interfaces . USA:Wiley-Interscience, 2008.[2] S. Chu, M. Mataric, C. Kuo, and S. Narayanan, “Where ami? scene recognition for mobile robots using audio features,” in . LosAlamitos, CA, USA: IEEE Computer Society, jul 2006, pp. 885–888.[Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICME.2006.262661[3] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti,“Scream and gunshot detection and localization for audio-surveillancesystems,” in , 2007, pp. 21–26.[4] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D.Plumbley, “Detection and classiﬁcation of acoustic scenes and events,”

IEEE Transactions on Multimedia , vol. 17, no. 10, pp. 1733–1746, 2015.[5] E. Çakır, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen,“Convolutional recurrent neural networks for polyphonic sound eventdetection,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 25, no. 6, pp. 1291–1303, 2017.[6] A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acousticscene classiﬁcation and sound event detection,” in , 2016, pp. 1128–1132.[7] A. Kumar and B. Raj, “Audio event detection using weakly labeleddata,” in

Proceedings of the 24th ACM International Conferenceon Multimedia , ser. MM ’16. New York, NY, USA: Associationfor Computing Machinery, 2016, p. 1038–1047. [Online]. Available:https://doi.org/10.1145/2964284.2964310[8] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solvingthe multiple instance problem with axis-parallel rectangles,”

Artif.Intell. , vol. 89, no. 1–2, p. 31–71, Jan. 1997. [Online]. Available:https://doi.org/10.1016/S0004-3702(96)00034-3[9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[10] B. McFee, J. Salamon, and J. P. Bello, “Adaptive pooling operatorsfor weakly labeled sound event detection,”

IEEE/ACM Trans. Audio,Speech and Lang. Proc. , vol. 26, no. 11, p. 2180–2193, Nov. 2018.[Online]. Available: https://doi.org/10.1109/TASLP.2018.2858559 [11] Q. Kong, Y. Xu, I. Sobieraj, W. Wang, and M. D. Plumbley,“Sound event detection and time–frequency segmentation fromweakly labelled data,” IEEE/ACM Trans. Audio, Speech and Lang.Proc. , vol. 27, no. 4, p. 777–787, Apr. 2019. [Online]. Available:https://doi.org/10.1109/TASLP.2019.2895254[12] Z. Ren, Q. Kong, J. Han, M. D. Plumbley, and B. W. Schuller,“Attention-based atrous convolutional neural networks: Visualisationand understanding perspectives of acoustic scenes,” in

ICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2019, pp. 56–60.[13] S.-Y. Tseng, J. Li, Y. Wang, J. Szurley, F. Metze, and S. Das, “Multipleinstance deep learning for weakly supervised audio event detection,” 122017.[14] A. Kumar and B. Raj, “Deep cnn framework for audio event recognitionusing weakly labeled web data,” 07 2017.[15] K. J. Piczak, “Environmental sound classiﬁcation with convolutionalneural networks,” in , 2015, pp. 1–6.[16] J. Salamon and J. P. Bello, “Deep convolutional neural networks anddata augmentation for environmental sound classiﬁcation,”

IEEE SignalProcessing Letters , vol. 24, no. 3, pp. 279–283, 2017.[17] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen,C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold,M. Slaney, R. Weiss, and K. Wilson, “Cnn architectures for large-scale audio classiﬁcation,” in

International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2017. [Online]. Available:https://arxiv.org/abs/1609.09430[18] Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley, “Large-scale weakly su-pervised audio classiﬁcation using gated convolutional neural network,”in , 2018, pp. 121–125.[19] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethink-ing atrous convolution for semantic image segmentation,”

ArXiv , vol.abs/1706.05587, 2017.[20] K. Choi, G. Fazekas, and M. Sandler, “Automatic tagging using deepconvolutional neural networks,” 06 2016.[21] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Audio set classiﬁca-tion with attention model: A probabilistic perspective,” in , 2018, pp. 316–320.[22] S.-Y. Chou, J.-S. R. Jang, and Y.-H. Yang, “Framecnn : A weakly-supervised learning framework for frame-wise acoustic event detectionand classiﬁcation,” 2017.[23] T. Su, J. Liu, and Y. Yang, “Weakly-supervised audio event detectionusing event-speciﬁc gaussian ﬁlters and fully convolutional networks,”in , 2017, pp. 791–795.[24] R. Caruana, “Multitask learning,”

Machine Learning , vol. 28, pp. 41–75,1997.[25] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neuralnetwork learning for speech recognition and related applications: anoverview,” in , 2013, pp. 8599–8603.[26] R. Girshick, “Fast r-cnn,” in , 2015, pp. 1440–1448.[27] R. Collobert and J. Weston, “A uniﬁed architecture for naturallanguage processing: Deep neural networks with multitask learning,”in

Proceedings of the 25th International Conference on MachineLearning , ser. ICML ’08. New York, NY, USA: Associationfor Computing Machinery, 2008, p. 160–167. [Online]. Available:https://doi.org/10.1145/1390156.1390177[28] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio,“Learning problem-agnostic speech representations from multiple self-supervised tasks,”

ArXiv , vol. abs/1904.03416, 2019.[29] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro,J. Trmal, and Y. Bengio, “Multi-task self-supervised learning for robustspeech recognition,” 01 2020.[30] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in

Proceedingsof the 32nd International Conference on International Conference onMachine Learning - Volume 37 , ser. ICML’15. JMLR.org, 2015, p.448–456.[31] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restrictedboltzmann machines,” in

Proceedings of the 27th International Confer-ence on International Conference on Machine Learning , ser. ICML’10.Madison, WI, USA: Omnipress, 2010, p. 807–814. [32] S. Santurkar, D. Tsipras, A. Ilyas, and A. M ˛adry, “How does batch nor-malization help optimization?” in

Proceedings of the 32nd InternationalConference on Neural Information Processing Systems , ser. NIPS’18.Red Hook, NY, USA: Curran Associates Inc., 2018, p. 2488–2498.[33] S. Ruder, “An overview of multi-task learning in deep neural networks,”

ArXiv , vol. abs/1706.05098, 2017.[34] J. Baxter, “A bayesian/information theoretic model of learning to learnvia multiple task sampling,” in

Machine Learning , 1997, pp. 7–39.[35] M. Lin, Q. Chen, and S. Yan, “Network in network,” in ,Y. Bengio and Y. LeCun, Eds., 2014. [Online]. Available: http://arxiv.org/abs/1312.4400[36] A. Kolesnikov and C. H. Lampert, “Seed, expand and constrain: Threeprinciples for weakly-supervised image segmentation,” in

ComputerVision – ECCV 2016 , B. Leibe, J. Matas, N. Sebe, and M. Welling,Eds. Cham: Springer International Publishing, 2016, pp. 695–711.[37] A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset forurban acoustic scene classiﬁcation,” in

Proceedings of the Detectionand Classiﬁcation of Acoustic Scenes and Events 2018 Workshop(DCASE2018) , November 2018, pp. 9–13. [Online]. Available:https://arxiv.org/abs/1807.09840[38] E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis, X. Favory, J. Pons,and X. Serra, “General-purpose tagging of freesound audio withaudioset labels: Task description, dataset, and baseline,” in

Proceedingsof the Detection and Classiﬁcation of Acoustic Scenes and Events2018 Workshop (DCASE2018) , November 2018, pp. 69–73. [Online].Available: https://arxiv.org/abs/1807.09902[39] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology andhuman-labeled dataset for audio events,” in , 2017,pp. 776–780.[40] A. P. Bradley, “The use of the area under the roc curve in the evaluationof machine learning algorithms,”

Pattern Recognit. , vol. 30, pp. 1145–1159, 1997.[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay, “Scikit-learn: Machine learning in Python,”

Journal of MachineLearning Research , vol. 12, pp. 2825–2830, 2011.[42] K. Choi, G. Fazekas, and M. B. Sandler, “Automatic tagging using deepconvolutional neural networks,” in

ISMIR , 2016.[43] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,”

CoRR , vol. abs/1409.1556, 2015.[44] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

CoRR , vol. abs/1412.6980, 2015.[45] L. Ford, H. Tang, F. Grondin, and J. Glass, “A deep residual networkfor large-scale acoustic scene analysis,” in