[PDF] A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection

Abstract

In this paper, we propose a novel four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection (SELD). First, we explore two spatial augmentation techniques, namely audio channel swapping (ACS) and multi-channel simulation (MCS), to deal with data sparsity in SELD. ACS and MDS focus on augmenting the limited training data with expanding direction of arrival (DOA) representations such that the acoustic models trained with the augmented data are robust to localization variations of acoustic sources. Next, time-domain mixing (TDM) and time-frequency masking (TFM) are also investigated to deal with overlapping sound events and data diversity. Finally, ACS, MCS, TDM and TFM are combined in a step-by-step manner to form an effective four-stage data augmentation scheme. Tested on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 data sets, our proposed augmentation approach greatly improves the system performance, ranking our submitted system in the first place in the SELD task of DCASE 2020 Challenge. Furthermore, we employ a ResNet-Conformer architecture to model both global and local context dependencies of an audio sequence to yield further gains over those architectures used in the DCASE 2020 SELD evaluations.

Full PDF

11 A Four-Stage Data Augmentation Approach toResNet-Conformer Based Acoustic Modeling forSound Event Localization and Detection

Qing Wang, Jun Du, Hua-Xin Wu, Jia Pan, Feng Ma, and Chin-Hui Lee,

Fellow, IEEE

Abstract —In this paper, we propose a novel four-stage dataaugmentation approach to ResNet-Conformer based acousticmodeling for sound event localization and detection (SELD).First, we explore two spatial augmentation techniques, namelyaudio channel swapping (ACS) and multi-channel simulation(MCS), to deal with data sparsity in SELD. ACS and MDSfocus on augmenting the limited training data with expandingdirection of arrival (DOA) representations such that the acousticmodels trained with the augmented data are robust to localizationvariations of acoustic sources. Next, time-domain mixing (TDM)and time-frequency masking (TFM) are also investigated to dealwith overlapping sound events and data diversity. Finally, ACS,MCS, TDM and TFM are combined in a step-by-step manner toform an effective four-stage data augmentation scheme. Testedon the Detection and Classiﬁcation of Acoustic Scenes andEvents (DCASE) 2020 data sets, our proposed augmentationapproach greatly improves the system performance, rankingour submitted system in the ﬁrst place in the SELD task ofDCASE 2020 Challenge. Furthermore, we employ a ResNet-Conformer architecture to model both global and local contextdependencies of an audio sequence to yield further gains overthose architectures used in the DCASE 2020 SELD evaluations.

Index Terms —Spatial data augmentation, sound event detec-tion, sound source localization, direction of arrival, Conformer.

I. I

NTRODUCTION S OUND event localization and detection (SELD) is atask to detect the presence of individual sound eventsand localize their arriving directions. Humans can correctlyidentify and localize multiple sound events overlapping bothtemporally and spatially in an audio signal, but it is verychallenging for machines. However, effective SELD is of greatimportance in many applications. For instance, SELD-enabledrobots are able to perform search and rescue missions whendetecting the presence of a ﬁre, an alarm, or a scream, andlocalizing them. In teleconferences, an active speaker can berecognized and tracked, making it possible to use beamformingtechniques for enhancing speech and for improving automaticspeech recognition (ASR) [1, 2]. Intelligent homes and smart

Q. Wang and J. Du are with the National Engineering Laboratory forSpeech and Language Information Processing, University of Science andTechnology of China, Hefei 230027, China (e-mail: [email protected],[email protected]).H.-X. Wu, J. Pan, and F. Ma are with iFlytek, Hefei 230088, China (email:hxwu2@iﬂytek.com, [email protected], fengma@iﬂytek.com).C.-H. Lee is with the School of Electrical and Computer Engineering,Georgia Institute of Technology, Atlanta, GA 30332-0250 USA (e-mail:[email protected]). cities can also employ SELD for acoustic scene analysis andaudio surveillance [3, 4].To solve the SELD problem, two key issues denoted assound event detection (SED) and sound source localization(SSL) have to be addressed. SED aims to recognize individualsound events in an audio sequence together with their onsetand offset times. Early SED methods [5–7] were developedfrom the ASR ﬁeld, with Gaussian mixture model (GMM)and hidden Markov model (HMM) used for acoustic mod-eling. However, when overlapping events occur, the detec-tion results were often unsatisfactory. Non-negative matrixfactorization (NMF) based algorithms were used to learn adictionary of basis vectors and then separate sound sources[8–10]. Nevertheless, this method is not robust in noisyenvironments. Recently, deep neural network (DNN) archi-tectures in various forms have been successfully employedfor SED. Feed-forward neural network was used for soundevent classiﬁcation, which greatly outperformed support vectormachines (SVM) in low signal-to-noise ratio (SNR) levels[11]. Convolutional neural network (CNN) [12–14] and re-current neural network (RNN) [15–17] were also adoptedfor SED task. Capsule neural network (CapsNet) [18] whichwas proposed for image classiﬁcation, was used to separateindividual sound events from overlapped mixture by selectingthe most representative spectral features of each sound event[19, 20]. State-of-the-art results for SED task were achievedby convolutional recurrent neural network (CRNN) [21–23], arecently published architecture which combined CNN, RNN,and DNN together.SSL aims to estimate the direction-of-arrival (DOA) foreach sound source. Various algorithms have been proposedfor DOA estimation. These approaches can be categorized intotwo kinds: parametric-based and DNN-based. Parametric DOAestimation approaches include multiple signal classiﬁcation(MUSIC) [24], estimation of signal parameters via rotationalinvariance technique (ESPRIT) [25, 26], and steered responsepower phase transform (SRP-PHAT) [27–29], which rely ona sound ﬁeld model. DNN-based approaches, however, donot rely on preassumptions about array geometries and havesuperior generalization ability to unseen scenarios because oftheir high regression capability [30–32]. The authors proposeda DOA estimation method for overlapping sources by combin-ing sound intensity vectors methods [33, 34] and DNN-basedseparation [35]. A DNN-based phase difference enhancementfor DOA estimation was proposed in [36], showing betterresults than direct regression to DOA representation. a r X i v : . [ c s . S D ] J a n Recent challenges on Detection and Classiﬁcation of Acous-tic Scenes and Events (DCASE) have attracted researchattentions, supporting the development of computational sceneand event analysis methods by comparing different approachesusing a common publicly available data set. The DCASEChallenge consists of several audio related tasks, one of themis SELD. For supervised methods, one key factor that affectsthe system performance is the size of the training data. Thedevelopment data set for the SELD task consists of only 60060-second audio sequences recorded in noisy environments,making it very challenging to apply DNN-based techniques. Inmachine learning, data augmentation is an effective strategy toovercome the lack of training data and to alleviate overﬁtting.These augmentation approaches have been widely used inmany areas, such as ASR, sound classiﬁcation, image classi-ﬁcation, and computer vision [37–40]. For the SED task, timestretching, pitch shifting, equalized mixture data augmentation(EMDA) [41], and mixup [42] are effective to improve systemperformances [16, 43–45]. However, it is not appropriate toadopt these techniques for DOA estimation. The main reasonis that when the abovementioned augmentation approaches areused to modify the audio signal, the spatial information maybe affected in a hardly predictable way and the new DOAlabels must be updated correctly. For example, SpecAugment[46], a simple yet effective method to improve SED, broughtonly a minor gain for DOA estimation [47]. To the best of theauthors’ knowledge, there exist only a few data augmentationstudies for DOA estimation. Mazzon’s team ﬁrst proposed aspatial augmentation method based on the property of ﬁrst-order Ambisonics (FOA) sound encoding [48]. It focusedon expanding the representation of the DOA subspace forthe FOA data set and was effective to reduce DOA errors.However, the authors did not investigate its effectiveness todeal with overlapping sound events, which is necessary forfuture SELD applications in real-life acoustic scenes withpotentially overlapping events.In this study, we investigate a few novel approaches tospatial data augmentation for acoustic modeling in SELD.We ﬁrst propose two techniques, namely audio channel swap-ping (ACS) and multi-channel simulation (MCS), to increaseDOA representations of the limited training data. ACS isbased on the physical and rotational properties of two dataformats, tetrahedral microphone array (MIC) and FOA. Spa-tial augmentation using ACS for both MIC and FOA datasets is discussed in our previous study [49, 50] whereasthe method proposed in [51] merely focused on the FOAdata set. The MCS approach aims to simulate new multi-channel data by estimating spatial information carried by staticnon-overlapping sound events. A complex Gaussian mixturemodel (CGMM) [52] is used to estimate time-frequency (T-F)masks and a generalized eigenvalue (GEV) beamformer [53] isemployed to obtain enhanced spectra which is combined withspatial information to simulate multi-channel data. In additionto ACS and MCS, we also adopt two other augmentationtechniques for SELD, namely time-domain mixing (TDM)which randomly mixes two individual sound events in the http://dcase.community/challenge2020 time domain and is similar to EMDA [41] and time-frequencymasking (TFM) which randomly drops several consecutiveframes or frequency bins of spectra features [46, 47]. Bycombining these four complementary techniques in a stage-by-stage manner without the Conformer, our submitted ResNet-GRU based system [49] achieved the best performance forthe SELD task of DCASE 2020 Challenge [54]. To furtherimprove acoustic modeling in this study, we also adopt aConformer which combines convolution and transformer andachieves state-of-the-art results in ASR [55]. The Conformermodule is a novel combination of self-attention and convo-lution, with self-attention capturing global dependencies andconvolution learning local features in an audio sequence. Weincorporate the Conformer framework into the ResNet networkused in our DCASE 2020 system to train acoustic modelswith the proposed four-stage data augmentation scheme. Theresulting ResNet-Conformer architecture yields further gainsover those systems used in the DCASE 2020 SELD evaluation.Our major contributions can be summarized as follows:1) presenting a novel ACS spatial augmentation methodto expand both MIC and FOA data sets of DCASE 2020Challenge based on symmetrical distribution characteristics oftetrahedral microphone array in the MIC data set;2) proposing a novel MCS spatial augmentation techniqueto increase DOA representations for static sound events byestimating spatial information carried by audio signals;3) incorporating a Conformer module into a ResNet systemto form a ResNet-Conformer architecture that captures bothglobal and local context dependencies in an audio sequence;4) designing a set of comprehensive experiments for theDACSE 2020 SELD task to show the effectiveness of theproposed four-stage data augmentation approach to acousticmodeling for the proposed ResNet-Conformer architecture.The remainder of the paper is organized as follows. Sec-tion II describes the spatial data augmentation approaches,especially for ACS and MCS. Section III details the Conformerarchitecture. Experimental results and analysis are presentedin Section IV. Finally we conclude the paper in Section V.II. F OUR -S TAGE D ATA A UGMENTATION

A. Audio Channel Swapping (ACS)

TAU-NIGENS, the development data set for the SELD taskof DCASE 2020 Challenge [56], contains multiple spatialsound-scene recordings, generated by convolving randomlychosen isolated sound event examples obtained from theNIGENS General Sound Events Database [57] with real-life room impulse responses (RIRs) collected using an em32Eigenmike composed of 32 professional quality microphonespositioned on the surface of a grid sphere. The correspond-ing reference DOAs are estimated acoustically from the ex-tracted RIRs using a subspace MUSIC algorithm. Furthermore,each scene recording is delivered in two 4-channel spatialrecording formats, MIC and FOA. The MIC set is extracteddirectly by selecting channels 6, 10, 26, and 22 of Eigen-mike, corresponding to a tetrahedral capsule arrangement. https://mhacoustics.com/products M (45 o ,35 o ) M (−45 o ,−35 o )M (135 o , −35 o ) M (−135 o ,35 o ) 𝑥𝑦 Fig. 1. Top view of an arrangement of the four microphones in sphericalcoordinates from the z-axis for the MIC format data set. Note that the azimuthangle is increasing counter-clockwise.

Deﬁning φ and θ as the azimuth and elevation angles ofthe sound source, R = 4 . cm is the spherical microphonearray radius, the MIC format has microphones arrangedin the right-handed spherical coordinates of ( φ , θ , R ) =( o , o , . cm), ( − o , − o , . cm), ( o , − o , . cm),and ( − o , o , . cm) as shown in Fig. 1, encoding aDOA with both time and level differences. For these fourmicrophones mounted on a spherical bafﬂe, an analyticalexpression for the directional array response is given by theexpansion for the MIC format [56]: H MIC m ( φ m , θ m , φ, θ, ω ) =1( ωR/c ) (cid:88) n =0 i n − h (cid:48) (2) n ( ωR/c ) (2 n + 1) P n (cos( γ m )) (1)where m is the channel number, ( φ m , θ m ) is a pair the speciﬁcmicrophone’s azimuth and elevation angles as shown in Fig. 1, ω = 2 πf is the angular frequency, i is the imaginary unit, c =343 m/s is the speed of sound, γ m is the angle between the m -th microphone position and the DOA, P n is the unnormalizedLegendre polynomial of degree n , and h (cid:48) (2) n is the derivativewith respect to the argument of a spherical Hankel functionof the second kind. From Eq. (1), we can see that the spatialresponse of the m th channel is a function of the cosine anglebetween the microphone position and the DOA: cos( γ m ) = sin( θ )sin( θ m )cos( φ − φ m ) + cos( θ )cos( θ m ) . (2)Ambisonics is another data format which decomposes asound ﬁeld on the orthogonal basis of spherical harmonicfunctions. In this study, ﬁrst-order decomposition is used togenerate the FOA data set. It is obtained by converting the32-channel microphone array signals by means of encodingﬁlters based on anechoic measurements of the Eigenmikearray response as detailed in [58]. The FOA signal consistsof four channels ( W, Y, Z, X ) with W corresponding to anomnidirectional microphone and ( Y, Z, X ) corresponding to three bidirectional microphones aligned on the Cartesian axes.All four channels in the FOA format is space-coincident,offering only level differences and no time differences for asingle DOA. With t and f as the T-F bin indexes, consideringa point p ( t, f ) from DOA in the short-time Fourier transform(STFT) domain given by azimuth angle φ and elevationangle θ , the sound ﬁeld on the four FOA channels can bedecomposed as  W ( t, f ) Y ( t, f ) Z ( t, f ) X ( t, f )  =  φ )cos( θ )sin( θ )cos( φ )cos( θ )  p ( t, f ) . (3)Using the SN3D normalization scheme of Ambisonices [59],the frequency-independent spatial response (steering vector)of the m th channel H FOA m ( φ, θ, f ) for FOA is given by H FOA1 ( φ, θ, f ) = 1 , (4) H FOA2 ( φ, θ, f ) = sin( φ )cos( θ ) , (5) H FOA3 ( φ, θ, f ) = sin( θ ) , (6) H FOA4 ( φ, θ, f ) = cos( φ )cos( θ ) . (7)New DOA representations can be generated based on thespatial responses of the MIC and FOA data sets by applyingtransformations to audio channels. For data with the MIC for-mat, not only level but also time differences are encoded, thusthe spatial responses of the augmented data must be exactlythe same as that of the original data. Only level differences areencoded for the FOA format data, which means that there mayexist sign inversion for the spatial responses of the augmenteddata. There are only a limited set of transformations that canbe applied to the audio channels in order to keep the spatialresponses of the MIC data unchanged. Speciﬁcally, channelswapping is used for the MIC data and there are only eightallowable transformations to obtain effective audio data andthe corresponding DOA representations. To obtain the sameDOA labels for the FOA data, channel transformations can beapplied to the FOA channels according to the spatial responses.Table I lists all eight DOA transformations (including theoriginal one) for ACS spatial augmentation. Take one trans-formation, φ = φ + π, θ = θ , as shown in Fig. 2 for example. M , M , M , and M are four microphones arranged on aspherical bafﬂe to extract the MIC data. The azimuth andelevation angles of the four microphones are shown in Fig. 1.Considering an original sound source S from DOA given byazimuth angle φ and elevation angle θ , then the MIC formatdata can be denoted as ( C , C , C , C ), which means thatthe m th channel data C m is extracted by the m th microphone M m . By applying a DOA transformation, the newly generatedsound source S new has an azimuth angle φ + π and an elevationangle θ . It can be seen in Fig. 2 that the relative locationrelationship, between the newly generated sound source S new and spherical microphone array, stays unchanged. Due to thesymmetry of the four-microphone arrangement, it is equivalentto obtain multi-channel data ( C , C , C , C ) for sound source S new , corresponding to swapping the 1st and 4th channelsplus the 2nd and 3rd channels. From a theoretical perspective,after applying the DOA transformation ( φ = φ + π, θ = θ ), TABLE IT HE ACS

AUGMENTATION APPROACH FOR BOTH

MIC

AND

FOA

DATA SETS . C m AND C new m DENOTE THE m th CHANNEL DATA OF THE ORIGINAL ANDAUGMENTED DATA SETS , RESPECTIVELY .DOA Transformation MIC Dataset FOA Dataset φ = φ − π/ , θ = − θ C new1 = C , C new2 = C , C new3 = C , C new4 = C C new1 = C , C new2 = − C , C new3 = − C , C new4 = C φ = − φ − π/ , θ = θ C new1 = C , C new2 = C , C new3 = C , C new4 = C C new1 = C , C new2 = − C , C new3 = C , C new4 = − C φ = φ, θ = θ C new1 = C , C new2 = C , C new3 = C , C new4 = C C new1 = C , C new2 = C , C new3 = C , C new4 = C φ = − φ, θ = − θ C new1 = C , C new2 = C , C new3 = C , C new4 = C C new1 = C , C new2 = − C , C new3 = − C , C new4 = C φ = φ + π/ , θ = − θ C new1 = C , C new2 = C , C new3 = C , C new4 = C C new1 = C , C new2 = C , C new3 = − C , C new4 = − C φ = − φ + π/ , θ = θ C new1 = C , C new2 = C , C new3 = C , C new4 = C C new1 = C , C new2 = C , C new3 = C , C new4 = C φ = φ + π, θ = θ C new1 = C , C new2 = C , C new3 = C , C new4 = C C new1 = C , C new2 = − C , C new3 = C , C new4 = − C φ = − φ + π, θ = − θ C new1 = C , C new2 = C , C new3 = C , C new4 = C C new1 = C , C new2 = C , C new3 = − C , C new4 = − C 𝜃 M M M M 𝑥 𝑦𝑧 𝜙 S S new 𝜃𝜙 + 𝜋𝑅 Fig. 2. One DOA transformation example: ( φ = φ + π, θ = θ ). M , M , M ,and M are four microphones in MIC dataset. R is the spherical microphonearray radius. S denotes the original DOA with azimuth angle φ and elevationangle θ , and S new denotes the new source with angles φ + π and θ . the spatial response of each channel for both the MIC andFOA data can be calculated according to Eqs. (2) and (4-7) togenerate the augmented data accordingly. Note that ACS forthe FOA data is also discussed in [51].The ACS augmentation approach is simple to implement.It can be applied to any sound event sample, whether non-overlapping or overlapping, whether static or moving, bydirectly performing transformations on audio channels. Theoriginal DOA labels are limited in the domain that azimuth φ ∈ [ − o , o ] and elevation θ ∈ [ − o , o ] . After apply-ing DOA transformations, it is easy to control the augmentedDOA labels in the same domain. B. Multi-channel Simulation (MCS)

Sound event samples are delivered in multi-channel datacontaining both spectral and spatial information. We propose anovel MCS augmentation technique to increase the diversity ofDOA labels for non-overlapping and non-moving sound eventsegments. MCS consists of two steps as shown in Fig. 3. Inthe ﬁrst step, a CGMM is used to estimate T-F masks that

CGMM

Mask Estimator

GEV

Beamformer

Covariance

Matrix

Original Multichannel Dataset

Spectral

Information

Spatial

Information

Random Phase

Perturbation

Eigenvalue

Decomposition

Multichannel

Data SimulatorSimulated

Multichannel Dataset

Spectral

Information

Spatial

Information

Random Segment Selection (a) Step 1

CGMM

Mask Estimator

GEV

Beamformer

Covariance

Matrix

Original Multichannel Dataset

Spectral

Information

Spatial

Information

Random Phase

Perturbation

Eigenvalue

Decomposition

Multichannel

Data SimulatorSimulated

Multichannel Dataset

Spectral

Information

Spatial

Information

Random Segment Selection (b) Step 2Fig. 3. Proposed multi-channel simulation (MCS) work ﬂow. represent the probabilities of the T-F units being sound sourceor only noise [52]. Then we adopt a GEV beamformer [53]to extract the desired spectral vector. Meanwhile, the spatialvector is estimated by calculating the covariance matrix ofthe source signal. In the second step, we randomly selectthe spectral and spatial information and perform a randomphase perturbation on the spectral part to guarantee a full-rank covariance matrix. Eigenvalue decomposition is usedto calculate the eigenvalue and eigenvector. Finally a multi-channel simulator is adopted to generate simulated multi-channel data by combining spectral and spatial information.In order to exploit the spatial features of both MIC andFOA data, we concatenate the two 4-channel spatial formats,resulting in an 8-channel signal. First, we collect all non-overlapping and non-moving sources in the ofﬁcial set. Con-sidering a signal s ( t ) in the time domain, we have an array of M =8 microphones in total to record sound samples, thus theobserved signal at the m th microphone can be written as x m ( t ) = (cid:88) τ h m ( τ ) s ( t − τ ) + n m ( t ) (8) where s ( t ) and n m ( t ) denote the source and noise signalrecorded at m th microphone, respectively, and h m ( τ ) denotesan impulse response between the source and the m th micro-phone. Via STFT, the microphone array observation vectortransformed into T-F domain is given by x ( f, t ) = h ( f ) S ( f, t ) + n ( f, t ) (9)with x ( f, t ) = [ X ( f, t ) , X ( f, t ) , ..., X M ( f, t )] T (10) h ( f ) = [ H ( f ) , H ( f ) , ..., H M ( f )] T (11) n ( f, t ) = [ N ( f, t ) , N ( f, t ) , ..., N M ( f, t )] T (12)where x ( f, t ) , h ( f ) , and n ( f, t ) are mixture vector, steeringvector, and noise vector, respectively. S ( f, t ) is the targetsource signal and [ · ] T denotes non-conjugate transposition.For simplicity, we use subscripts to denote f and t in thefollowing formulations. Then the observed multi-channel datacan be written as x f,t = [ X f,t, , X f,t, , ..., X f,t,M ] T (13)We use a CGMM-based method proposed in [52] to estimateT-F masks representing the probabilities of the T-F units beinga sound or only noise. Using CGMM, the observed signals canbe clustered into either one containing sounds or the othercontaining only noise, and expressed as x f,t = h ( v ) f S ( v ) f,t (where d f,t = v ) (14)where d f,t denotes the category index at the time frame t and frequency bin f . When v takes s , the category representssound source. When v takes n , the category represents noise. S ( v ) f,t ∼ N c (0 , φ ( v ) f,t ) is assumed to follow a complex Gaussiandistribution, then the observed multi-channel vector is assumedto follow a multivariate complex Gaussian distribution x f,t | d f,t = v ∼ N c ( , φ ( v ) f,t H ( v ) f ) (15)where H ( v ) f = h ( v ) f ( h ( v ) f ) H (16) N c ( x | µ , Σ ) = 1 | π Σ | exp (cid:0) − ( x − µ ) H Σ − ( x − µ ) (cid:1) (17)with ( · ) H denoting conjugate transposition. φ ( v ) f,t and H ( v ) f are two CGMM parameters that can be estimated using amaximum likelihood (ML) criterion. Through expectation-maximization (EM), the T-F masks can be updated as follows: λ ( v ) f,t ← p ( x f,t | d f,t = v ) (cid:80) v p ( x f,t | d f,t = v ) (18)where p ( x f,t | d f,t = v ) = N c ( x f,t | , φ ( v ) f,t H ( v ) f ) . The proba-bility of T-F unit ( f, t ) being sound source or only noise canbe measured by λ ( v ) f,t after convergence.The spatial information of the sound source is containedin the multi-channel data. We estimate it by calculating thecovariance matrix of the enhanced source, written as: R ( s ) f = 1 (cid:80) t λ ( s ) f,t (cid:88) t λ ( s ) f,t x f,t x H f,t (19) where λ ( s ) f,t denotes the probability of the T-F unit ( t, f ) beinga sound source. Finally we perform an energy normalizationon the covariance matrix R ( s ) f to extract spatial S f as follows S f = M R ( s ) f tr( R ( s ) f ) . (20)To estimate the spectral vector, we adopt a GEV beam-former and a single-channel post-ﬁlter as done in [53]. Ourgoal is to ﬁnd a vector of optimal ﬁlter coefﬁcients w f =[ W f, , W f, , ..., W f,M ] T with which the beamformer outputachieves the maximum signal-to-noise (SNR) ratio and isdistortionless at the same time. The output can be written as ˆ S f,t = w H f x f,t . (21)According to [53], the ﬁlter coefﬁcients of the GEV beam-former w SNR ,f is the eigenvector corresponding to the largesteigenvalue of (Φ n f ) − Φ x f , where Φ x f and Φ n f denote the crosspower spectral density (PSD) matrices of the observed signaland the noise, respectively.The optimal coefﬁcients vector of the GEV beamformeris computed by maximizing the output SNR, which mayintroduce speech distortion. To obtain a distortionless sourcesignal, a single-channel post-ﬁlter ω f is added as follows w f = ω f w SNR ,f . (22)According to the blind analytical normalization method [53],the post-ﬁlter ω f is obtained as ω f = (cid:113) w HSNR ,f Φ n f Φ n f w SNR ,f /M w HSNR ,f Φ n f w SNR ,f . (23)So far we only estimate the spectral and spatial informationfor all non-overlapping and non-moving sound event segments.To simulate multi-channel data, two sound event segmentscontaining such information are chosen. For the spatial case,it is obvious that S f = S H f . The eigenvalue decomposition ofsuch a conjugate matrix S f can be written as follows S f = UΛU H (24)where Λ = diag( λ , λ , ..., λ M ) denote the eigenvalues, and U = [ u , u , ..., u M ] T are the corresponding eigenvectors.Eq. (24) can also be expressed as a sum of M components: S f = M (cid:88) m =1 λ m u m u H m . (25)We have spectral information ˆ S f,t extracted from one soundevent segment and spatial information S f extracted fromanother. We want to simulate a new sound event segmentwhose SED label corresponds to that of what ˆ S f,t belongsto and a DOA label corresponds to that of what S f belongsto. The simulated multi-channel data can be written as ˆ x f,t = M (cid:88) m =1 (cid:112) λ m ˆ S f,t exp( − T m πj ) u m (26)where T = 0 , T m ∈ (0 , T ) , m = 2 , , ..., M . The term exp( − T m πj ) is used as a random phase perturbation to makethe covariance matrix of the simulated signal full rank. C. Time-Domain Mixing (TDM)

When two sound events occur close to each other in time,it is more difﬁcult to perform SED and SSL. To improve thegeneralization of our model to handle overlapping sources, weperform TDM for two non-overlapping sources. The SELDlabels for the augmented data are the union of the originallabels for the two sound events. Although no new DOA isgenerated, TDM increases the number of overlapping trainingsamples, which proves to be effective.

D. Time-Frequency Masking (TFM)

SpecAugment is a simple yet helpful augmentation methodin ASR [46]. We ﬁnd it useful for SED but it may sometimescause performance degradation for SSL. Nevertheless wefound SpecAugment to be effective for both SED and SSLwith a large-sized training set. In this study, masks are appliedto the time and frequency dimensions randomly for each inputlog Mel-spectrogram feature in each batch during training.III. R ES N ET -C ONFORMER B ASED

SELD S

YSTEM

In our submitted system for DCASE 2020 Challenge[49, 50], we investigated several deep learning based acous-tic models for the SELD task, which consist of high-levelfeature representation, temporal context representation andfull connection. The high-level feature representation moduleusually contains a series of CNN blocks, each having a 2Dconvolution layer followed by a batch normalization process,a rectiﬁed linear unit (Relu), and a max-pooling operation.The temporal context representation module is adopted tomodel the temporal structures within sound events. We usetwo parallel branches in the fully-connected (FC) module toperform SED and SSL simultaneously, similar to the ofﬁcialbaseline SELD system [56]. Moreover, we used modiﬁedversions of ResNet [60] and Xception [61] to learn local shift-invariant features. Besides the bidirectional gated recurrentunit (GRU) used in the baseline system, we also adoptedfactorized time delay neural network (TDNN-F) [62] to exploitlonger temporal context dependency in the audio signal.The Conformer architecture, combining convolution andtransformer [63], was proposed in [55] and achieved state-of-the-art results for ASR. Soon afterward it was applied tocontinuous speech separation [64], sound event detection andseparation in domestic environments [65]. The convolutionlayers are effective to extract local ﬁne-grained features whilethe transformer models are good at capturing long-range globalcontext. Thus Conformer is supposed to be able to model bothlocal and global context dependencies in an audio sequence.In this paper, we examine the use of Conformer for theSELD task. We use ResNet to extract local shift-invariantfeatures. Then Conformer is adopted to learn both local andglobal context representations. We call our acoustic modelResNet-Conformer. Fig. 4 shows an overview of the proposedarchitecture for the SELD task and a detailed Conformerimplementation. As shown in the left panel in Fig. 4, twoparallel branches contain two FC layers, each performingindividual SED and SSL subtasks. Note that N is equal tothe number of sound event classes. Input (FOA+MIC)ResNetConformer × B Fully-connected Fully-connected

SED SSL ⋮ 𝑁×1 ⋮ ⋮ ⋮

𝑁×3

Feed Forward Block

Multi - Head Self - AttentionBlock + × Convolution BlockFeed Forward BlockLayernorm ++++ × zොzz ′ z ′′ o Fig. 4. A ﬂow chart of the proposed ResNet-Conformer architecture for theSELD task and a detailed implementation of the Conformer module.

Shown in the right dashed box in Fig. 4, Conformer iscomposed of two feed forward blocks that sandwich a multi-head self-attention (MHSA) block and a convolution block.The second feed forward block is followed by a layer normal-ization process. A residual connection is added behind eachblock. Assume z is the input to Conformer, the output o canbe calculated through intermediate ˆ z , z (cid:48) and z (cid:48)(cid:48) as ˆ z = z + 12 FFN( z ) (27) z (cid:48) = ˆ z + MHSA(ˆ z ) (28) z (cid:48)(cid:48) = z (cid:48) + Conv( z (cid:48) ) (29) o = Layernorm( z (cid:48)(cid:48) + 12 FNN( z (cid:48)(cid:48) )) (30)where FFN( · ) , MHSA( · ) , Conv( · ) and Layernorm( · ) denotea feed forward network block, a multi-head self-attentionblock, a convolution block, and a layer normalization process,respectively. The input ˆ z to MHSA is ﬁrst processed by layernormalization and then converted to the query Q , key K , andvalue V by performing linear projection as follows: Multihead( Q , K , V ) = [ H , H , ..., H h ] W O (31) H i = Attention( QW Qi , KW Ki , VW Vi ) (32) Attention( q , k , v ) = softmax (cid:18) qk T √ d k (cid:19) v (33)where Q , K , V = Layernorm(ˆ z ) , h denotes the number of theattention heads. W Qi , W Ki ∈ R d × d k , W Vi ∈ R d × d v are learn-able parameter matrices for the i th head. W O ∈ R ( h × d v ) × d isthe ﬁnal linear parameter matrix applied on the concatenatedfeature vector. d , d k and d v denote the dimension of input, keyand value, respectively. [ · ] denotes the concatenation operation. Conv( · ) is illustrated in Fig. 5 which contains two pointwiseconvolution layers sandwiching a depthwise convolution layer.The ﬁrst pointwise convolution layer is followed by Reluactivation and the second convolution layer is followed bydropout operation. Following the depthwise convolution layeris a batch normalization process and Swish activation. Thefeed forward network consists of two linear layers and a Pointwise

Conv

Depthwise

Conv

Batchnorm

Pointwise

Conv DropoutLayernorm

SwishActivation

Relu

Activation

Fig. 5. A detailed implementation of the convolution block.

Layernorm Linear Layer

Relu

Activation Dropout

Linear

Layer Dropout

Fig. 6. A detailed implementation of the feed forward network block. nonlinear activation in between as illustrated in Fig. 6. Reluactivation and dropout are used to help regularize the network.Multitask learning is used to train the ResNet-Conformermodel as shown in Fig. 4. The output layers in the twobranches consist of multiple targets to be predicted, includingactive sound classes and the corresponding DOAs. Joint lossfunction is adopted to solve the SED and SSL subtaskssimultaneously. The SED subtask is performed as a multi-label classiﬁcation with a binary cross-entropy (BCE) loss.The SSL subtask is performed as a multi-output regressionwith a masked mean squared error (MSE) loss [23]. The multi-objective loss function to be minimized can be expressed as L = − α T (cid:88) t (cid:88) n y SED t,n logˆ y SED t,n + α T (cid:88) t (cid:88) n || (ˆ y SSL t,n − y SSL t,n ) y SED t,n || (34)where ˆ y SED t,n and ˆ y SSL t,n are the active probability estimationand DOA estimation for the n th sound event at the t thframe, respectively. Correspondingly, y SED t,n and y SSL t,n are thereference versions. Both ˆ y SSL t,n and y SSL t,n are 3-dimensionalCartesian representation of the DOA. T denotes the total framenumber in a minibatch. The SED classiﬁcation loss and SSLregression loss are combined for joint optimization duringtraining with loss weights α equal to 1 and α equal to 10.IV. E XPERIMENTAL R ESULTS AND D ISCUSSIONS

We evaluate SELD on the ofﬁcial development set of Task3 in DCASE 2020 Challenge, called TAU-NIGENS SpatialSound Events 2020 [56]. It contains 600 60-second audiorecordings with a 24 kHz sampling rate. They are dividedinto six splits with four for training, one for validation andthe last split for testing. Totally, there are 14 sound classesof spatial events as listed in Table II. The four proposed dataaugmentation approaches as described in Section II are usedto expand the development data set.We extract two types of features for each of the two datasets,FOA and MIC. Using STFT with a hamming window oflength 1024 samples and a 50% overlap, linear spectrogramfor each channel is extracted. Then 64-dimensional log Mel-spectrogram feature vector is extracted for both datasets. Thesecond type of features is format-speciﬁc. For FOA datasetacoustic intensity vector (IV) computed at each of the 64 Mel-bands is extracted while for MIC dataset generalized cross-correlation phase transform (GCC-PHAT) computed in each of the 64 Mel-bands is extracted similar to [23]. Finally, thereare 4 channels of log Mel-spectrogram features and 3 channelsof IV features, hence up to 7 feature maps for FOA signals.For MIC signals, there are 4 channels of log Mel-spectrogramfeatures and 6 channels of GCC-PHAT features, hence up to10 feature maps. We use both FOA and MIC datasets, so 17input feature maps are used to train the models.The TFM augmentation approach is applied to each acousticfeature in each batch. For every acoustic feature, we multiplymasks on time and frequency dimensions for the ﬁrst 11feature maps. The last 6 feature maps containing DOA infor-mation are not applied with the masks. The time mask lengthis randomly selected from zero to 35 frames, and maskingis applied every 100 frames. The frequency mask length israndomly selected from zero to 30 bins.A joint measurement on performances of localization anddetection of sound events is performed as suggested in [66].Location-dependent detection metrics that count correct anderroneous detections within certain spatial error allowances,and classiﬁcation-dependent localization metrics that measurethe spatial error between sound events with the same label areused to evaluate the SED and SSL performances, respectively.To compute SED metrics, some intermediate statistics, suchas true positive ( TP ), false positive ( FP or insertion error I ),false negative ( FN or deletion error D ), and substitution error S , need to be counted ﬁrst. Considering that a TP is predictedonly when the spatial error for the detected event is withinthe given threshold of o from the reference, two location-dependent detection metrics, error rate ( ER o ) and F-score( F o ), are then calculated as follows: P = T PT P + F P , R = T PT P + F N (35) F o = 2 P RP + R , ER o = D + I + SN (36)where N is the total number of reference sound events. P and R denote the precision and recall metrics, respectively.Classiﬁcation-dependent localization metrics are computedonly across each class, instead of across all outputs. The ﬁrstis the localization error LE CD which expresses the averageangular distance between predictions and references of thesame class and can be calculated as LE CD = arccos( u ref · u pre ) (37)where u ref and u pre denote the unit Cartesian position vectorsof reference sound event and predicted sound event, respec-tively. The subscript refers to classiﬁcation-dependent. Thesecond is a simple localization recall metric LR CD which ex-presses the true positive rate of how many of these localizationestimates are detected in a class out of the total class instances.All metrics are computed in one-second non-overlappingsegments to alleviate the effect of onset/offset subjectivityin reference annotations. With these four metrics, an earlystopping SELD score SELD score can be computed as follows

SELD score = ER o + (1 − F o ) + LE (cid:48) CD + (1 − LR CD )4 (38) TABLE IIT HE SOUND CLASSES OF THE SPATIAL EVENTS IN

DCASE 2020 C

HALLENGE . Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13SoundClass Alarm CryingBaby Crash BarkingDog RunningEngine FemaleScream FemaleSpeech BurningFire Footsteps KnockingDoor MaleScream MaleSpeech RingingPhone Piano

TABLE IIIA

PERFORMANCE COMPARISON FOR DIFFERENT MODELS ON THEDEVELOPMENT SET WITHOUT DATA AUGMENTATION . System ER o F o LE CD LR CD SELD score

Baseline-MIC 0.78 31.4% . o . o . o . o . o . o where LE (cid:48) CD = LE CD /π . The SELD score is an overall perfor-mance metric for the SELD task. The model with the smallest

SELD score on the validation split is chosen as the best model.Audio clips with a length of 60 seconds are used for trainingall model architectures with an Adam optimizer [67]. Thelearning rate is set to 0.001 and is decreased by 50% ifthe SELD score of the validation split does not improve in80 consecutive epochs. A threshold of 0.5 is used to assessthe predicted results of the SELD model. For the Conformermodule, the number of attention heads h is set to 8, and thedimension of attention vector d is set to 512. For simplicity, d k and d v are both equal to 64. We use a kernel size of 51 forthe depthwise convolution. The module number B shown inFig. 4 is set to 8. All experiments in this study were performedusing the PyTorch toolkit [68]. A. Results Based on Different Acoustic Models

First in Table III, we compare different acoustic modelson SELD without using any data augmentation. The ofﬁcialbaseline SELDnet system [56] was compared with the bestResNet-GRU model proposed in our submitted system forDCASE 2020 Challenge [49, 50]. In addition, the performanceof ResNet-Conformer is also compared.The ﬁrst two rows represent the ofﬁcial baseline and ourResNet-GRU systems trained with only the MIC data set.“Baseline-FOA” and “ResNet-GRU-FOA” are compared in thethird and fourth rows of Table III using the FOA data. Theproposed ResNet-GRU architecture outperforms the baselineSELDnet for both MIC and FOA formats on all four eval-uation metrics. The main reason may be that the use ofresidual connection helps the models to capture more usefulshift-invariant local features from the input acoustic features.“ResNet-GRU-Both” shown in the bottom row is trainedwith the concatenated features extracted from both MIC andFOA data sets, yielding high scores than the models trainedwith separated features. By replacing the GRU module withthe Conformer module, “ResNet-Conformer-Both” achievesconsistent improvements for both SED and SSL metrics over

0 10 20 30 40 50 60Time/s F r e q u e n c y / k H z

0 10 20 30 40 50 60Time/s C l a ss I n d e x C l a ss I n d e x

15 0 10 20 30 40 50 60Time/s C l a ss I n d e x (a) Spectrogram

0 10 20 30 40 50 60Time/s F r e q u e n c y / k H z

0 10 20 30 40 50 60Time/s C l a ss I n d e x C l a ss I n d e x

15 0 10 20 30 40 50 60Time/s C l a ss I n d e x (b) SED prediction of ResNet-GRU

0 10 20 30 40 50 60Time/s F r e q u e n c y / k H z

0 10 20 30 40 50 60

Time/s C l a ss I n d e x C l a ss I n d e x

15 0 10 20 30 40 50 60Time/s C l a ss I n d e x (c) SED Reference

0 10 20 30 40 50 60Time/s F r e q u e n c y / k H z

0 10 20 30 40 50 60Time/s C l a ss I n d e x C l a ss I n d e x

15 0 10 20 30 40 50 60Time/s C l a ss I n d e x (d) SED prediction of ResNet-Conformer Fig. 7. An example comparison of ResNet-GRU and ResNet-Conformer. “ResNet-GRU-Both”, which demonstrates that the Conformeris more effective in modeling context dependencies thanGRU. Compared to the two ofﬁcial baseline systems, the bestResNet-Conformer model achieves 37.2% and 31.9% relativeimprovements on the SELD scores, respectively.Fig. 7 illustrates a visualization of the SED prediction ofthe class index of sound events as listed in Table II usingResNet-GRU and ResNet-Conformer without data augmen-tation. The SED-predicted indices of ResNet-Conformer aremore accurate than those of ResNet-GRU. In the ﬁrst 20seconds, two sound events, i.e., “Crying Baby” (with index1) and “Running Engine” (with index 4), occur at the sametime. ResNet-GRU could not well detect the “Crying Baby”event segments, but for ResNet-Conformer these segments arecorrectly predicted as shown in the green dashed rectangularboxes, which proves the effectiveness of the Conformer net-works. As shown in Fig. 7(a) and Fig. 7(c), the “Barking Dog”(with index 3) event happens twice and there is a period of timebetween them. However, ResNet-GRU only recognizes thesecond instance and wrongly predicts the ﬁrst occurrence of“Barking Dog” as “Crying Baby”. ResNet-Conformer detectsthe two occurrences as shown in the black dashed rectangularbox of Fig. 7(d), indicating its superiority in modeling shortsegments. From about 35 to 50 seconds, there exist two soundevents, “Alarm” (with index 0) and “Footsteps” (with index8). ResNet-Conformer wrongly recognizes it as “Piano” at thebeginning, but when a longer sequence could be observed itcorrects the error as shown in the blue dashed rectangularbox of Fig. 7(d) and correctly predicts the two sound events.ResNet-GRU, however, wrongly recognizes it as “Piano” evenafter a long duration and predicts it as three separate soundevents. This example shows Conformer’s superiority overGRU. By using Conformer, it is more likely than GRU tocapture both local and global context dependencies.

TABLE IVA

PERFORMANCE COMPARISON OF DIFFERENT

ACS

APPROACHES USINGTHE R ES N ET -GRU MODEL .System ER o F o LE CD LR CD SELD score [51] 0.44 63.9% . o . o . o PERFORMANCE COMPARISON WHEN APPLYING FOUR AUGMENTATIONAPPROACHES INDIVIDUALLY USING THE R ES N ET -GRU MODEL .System ER o F o LE CD LR CD SELD score

ResNet-GRU 0.63 47.6% . o . o . o . o . o B. Results Based on ACS Spatial Augmentation

In [51], the authors proposed an augmentation methodusing the FOA data set, containing sixteen patterns of spatialaugmentation whereas our proposed ACS approach can beapplied to not only the FOA but also MIC data sets. Aperformance comparison between is shown in Table IV. Tomake a fair comparison, we apply the method in [51] toaugment the same amount of training data as our proposedACS approach. “ACS-FOA” denotes the system trained onlywith the FOA set while “ACS” denotes the system trained withboth FOA and MIC sets. The difference of “ACS-FOA” from[51] is that only eight patterns/transformations in Table I wereadopted in our approach. It is noted that similar results areobtained by these two systems. This indicates that the eightpatterns adopted by ACS already contain enough useful DOAinformation, and adding the other eight patterns may lead toinformation redundancy since they just apply reﬂections withrespect to the xy : z = 0 plane when compared with theeight patterns adopted by ACS. By comparing the bottom tworows in Table IV, we can see that ACS outperforms ACS-FOAwhen applying spatial augmentation to the MIC set. This resultveriﬁes the usefulness of feature fusion of both FOA and MICdata sets, which is also shown in subsection IV-A. C. Results Based on Individual Data Augmentation

We next adopt ResNet-GRU as the acoustic model tocompare the system performances, as listed in Table V, whenapplying four data augmentation approaches, namely ACS,MCS, TDM, and TFM, individually. All four augmentationapproaches generate a similar size of training data.We can make the following observations: (i) all four met-rics yields gains except for the LE CD metric of the TFMapproach. Since only several hours of audio data is available,applying masks on log mel-spectrogram features may notbring performance gain to the SSL task, which has also beenobserved in [47]; (ii) the ACS and MCS approaches achieve consistent improvements for both SED and SSL metrics. Thisdemonstrates that increasing DOA representations is veryeffective for the SELD task. The ACS approach can be appliedto all sound event segments in the development data set,while the MCS approach can only be applied to the non-overlapping and non-moving sound event segments. Thus theoverall performance SELD score of the ACS approach is slightlybetter than that of the MCS approach; and (iii) for the TDMapproach, both SED and SSL metrics improve even thoughno new DOA presentation is generated, indicating mixing twonon-overlapping sound signals in the time domain helps modelrobustness to unseen samples. In summary for the ResNet-GRU system with no data augmentation, ACS, MCS, TDM,and TFM individually yield 32.5%, 30.0%, 22.5%, and 10.0%relative SELD score reductions, respectively.

D. Results Based on Four-Stage Data Augmentation

We next evaluate the system performances when using thefour augmentation techniques. The four-stage data augmenta-tion scheme was used to exploit the complementarity amongthe four approaches and our submitted ResNet-GRU ensemblesystem ranks the ﬁrst place for the SELD task of DCASE2020 Challenge [54]. Since ACS can be applied to the wholedevelopment data set, we perform ACS on the original data inthe ﬁrst stage. MCS aims to simulate new DOA presentationsfor static non-overlapping sound events, on which the TDMapproach can be applied. So MCS is performed in the secondstage and TDM is performed in the following third stage. Witha larger data set now, we apply TFM in the ﬁnal stage.Tables VI and VII list performance comparisons when ap-plying the four-stage data augmentation scheme using ResNet-GRU and ResNet-Conformer, respectively. The ﬁrst and sec-ond columns denote the systems and the corresponding train-ing data size, respectively. ACS is performed on the originaldata in the ﬁrst stage, generating a 55-hour training set. Thenwe apply MCS to the 55-hour set, generating a larger 155-hour set. TDM and TFM are conducted in a similar way, andﬁnally a 255-hour training set is obtained. For ResNet-GRU,each augmentation approach achieves performance gains forthe SED and SSL metrics. When applying ACS on the originaldata set, “S2” achieves a SELD score of 0.27 lower from0.40 for S1 without any augmentation. When applying MCS,TDM, and TFM separately on the original data set used inS1, the SELD scores are worse than the ACS approach asshown in Table V. However, by using the proposed four-stagedata augmentation scheme, consistent performance gains areyielded in “S3”, “S4”, and “S5”. Compared to the modelwithout using data augmentation in S1, these four systemsachieve 32.5%, 40.0%, 45.0%, and 55.0% relative SELD scorereductions, respectively.Next the four-stage framework is also evaluated using theproposed ResNet-Conformer model. As shown in Table VII,the SELD score without using data augmentation for S6 is0.32 in the top row, yielding a 20% relative reduction from0.40 for ResNet-GRU in S1. When comparing performancesbetween ResNet-Conformer and ResNet-GRU, the gains fromTable VI to Table VII are gradually reduced when ACS, MCS, TABLE VIA

PERFORMANCE COMPARISON BY COMBINING FOUR AUGMENTATIONAPPROACHES . (S1:R ES N ET -GRU, S2:S1+ACS, S3:S2+MCS,S4:S3+TDM, S5:S4+TFM)System Size (h) ER o F o LE CD LR CD SELD score

S1 8 0.63 47.6% . o . o . o . o % o % TABLE VIIA

PERFORMANCE COMPARISON BY COMBINING FOUR AUGMENTATIONAPPROACHES . (S6:R ES N ET -C ONFORMER , S7:S6+ACS, S8:S7+MCS,S9:S8+TDM, S10:S9+TFM)System Size (h) ER o F o LE CD LR CD SELD score

S6 8 0.51 58.6% . o . o . o . o % o % TDM and TFM are brought in step-by-step, demonstrating theeffectiveness of the proposed data augmentation approaches.The results in the bottom rows highlighted in bold fonts inTables VI and VII after applying all four techniques show onlyslight performance differences. Clearly for deep learning, thefour-stage scheme can largely increase the data diversity, thusimprove the generalization ability of acoustic models.Fig. 8 shows an example of the SED prediction usingResNet-Conformer model with and without four-stage dataaugmentation. For the two segments from beginning to 15seconds and from 25 to 40 seconds, ResNet-Conformer pre-dicts correct results both with and without data augmentation.As shown in the blue dashed rectangular boxes, when dataaugmentation is not used, the model tends to wrongly predictthe shorter sound events. But with data augmentation, themodel is able to output correct predictions. In the last 20seconds, the model trained without data augmentation cannotrecognize overlapping sound events, and instead predicts twowrong events as shown in the black dashed rectangular boxof Fig. 8(b). However, the model correctly predicts the over-lapping sound events when adopting the proposed four-stagedata augmentation approach.V. C

ONCLUSION

This study focuses on data augmentation and acousticmodeling for the SELD task. Two novel spatial augmentationapproaches, namely ACS and MCS, are proposed to deal withdata sparsity for deep learning based acoustic modeling. TheACS approach can be applied to all sound event segmentswhile the MCS approach is suitable for static non-overlappingaudio segments, both of which aim at increasing DOA repre-sentations. We adopt a four-stage data augmentation scheme

0 10 20 30 40 50 60 Time/s F r e q u e n c y / k H z

0 10 20 30 40 50 60 Time/s C l a ss I n d e x C l a ss I n d e x C l a ss I n d e x (a) Spectrogram

0 10 20 30 40 50 60 Time/s F r e q u e n c y / k H z

0 10 20 30 40 50 60 Time/s C l a ss I n d e x C l a ss I n d e x C l a ss I n d e x (b) SED prediction without data augmentation

0 10 20 30 40 50 60 Time/s F r e q u e n c y / k H z

0 10 20 30 40 50 60 Time/s C l a ss I n d e x C l a ss I n d e x C l a ss I n d e x (c) SED Reference

0 10 20 30 40 50 60 Time/s F r e q u e n c y / k H z

0 10 20 30 40 50 60 Time/s C l a ss I n d e x C l a ss I n d e x C l a ss I n d e x (d) SED prediction with data augmentation Fig. 8. An example comparison of ResNet-Conformer with or without four-stage data augmentation. to improve the performance step-by-step. We also employa Conformer architecture which combines convolution andTransformer together to model both global and local contextdependencies in an audio sequence and propose a ResNet-Conformer architecture. Experiments carried out on the de-velopment data set of DCASE 2020 Challenge have shownthe effectiveness of the data augmentation approaches. Furtherimprovement is achieved by the proposed ResNet-Conformer,yielding signiﬁcant gains over our best deep architectures.A

CKNOWLEDGMENT

The authors would like to thank Yuxuan Wang, Tairan Chen,Zijun Jing and Yi Fang for their help on some experiments.R

EFERENCES[1] H. Wang and P. Chu, “Voice source localization for automatic camerapointing system in videoconferencing,” in

Proc. IEEE Int. Conf. Acoust.,Speech Signal Process. , vol. 1, 1997, pp. 187–190.[2] P. Swietojanski, A. Ghoshal, and S. Renals, “Convolutional neuralnetworks for distant speech recognition,”

IEEE Signal Process. Lett. ,vol. 21, no. 9, pp. 1120–1124, 2014.[3] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti,“Scream and gunshot detection and localization for audio-surveillancesystems,” in

Proc. IEEE Conf. Adv. Video Signal Based Surveillance ,2007, pp. 21–26.[4] P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and M. Vento, “Audiosurveillance of roads: A system for detecting anomalous sounds,”

IEEETrans. Intell. Transp. Syst. , vol. 17, no. 1, pp. 279–288, 2015.[5] T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, “Audio contextrecognition using audio event histograms,” in

Proc. Eur. Signal Process.Conf. , 2010, pp. 1272–1276.[6] A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, “Acoustic eventdetection in real life recordings,” in

Proc. Eur. Signal Process. Conf. ,2010, pp. 1267–1271.[7] T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, “Context-dependentsound event detection,”

EURASIP J. Audio, Speech, Music Process. , vol.2013, no. 1, p. 1, 2013.[8] J. F. Gemmeke, L. Vuegen, P. Karsmakers, B. Vanrumste et al. , “Anexemplar-based NMF approach to audio event detection,” in

Proc. IEEEWorkshop Appl. Signal Process. Audio Acoust. , 2013, pp. 1–4.[9] A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, “Sound eventdetection in real life recordings using coupled matrix factorization ofspectral representations and class activity annotations,” in

Proc. IEEEInt. Conf. Acoust., Speech Sigal Process. , 2015, pp. 151–155.[10] T. Komatsu, T. Toizumi, R. Kondo, and Y. Senda, “Acoustic event de-tection method using semi-supervised non-negative matrix factorizationwith a mixture of local dictionaries,” in

Proc. Detection ClassiﬁcationAcoust. Scenes Events Workshop , 2016, pp. 45–49.[11] I. McLoughlin, H. Zhang, Z. Xie, Y. Song, and W. Xiao, “Robust soundevent classiﬁcation using deep neural networks,”

IEEE/ACM Trans.Audio, Speech, Lang. Process. , vol. 23, no. 3, pp. 540–552, 2015. [12] K. J. Piczak, “Environmental sound classiﬁcation with convolutionalneural networks,” in Proc. IEEE Int. Workshop Mach. Learning SignalProcess. , 2015, pp. 1–6.[13] H. Zhang, I. McLoughlin, and Y. Song, “Robust sound event recognitionusing convolutional neural networks,” in

Proc. IEEE Int. Conf. Acoust.,Speech Signal Process. , 2015, pp. 559–563.[14] H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio eventrecognition with 1-max pooling convolutional neural networks,” in

Proc.Interspeech , 2016, pp. 3653–3657.[15] Y. Wang, L. Neves, and F. Metze, “Audio-based multimedia eventdetection using deep recurrent neural networks,” in

Proc. IEEE Int. Conf.Acoust., Speech Signal Process. , 2016, pp. 2742–2746.[16] G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent neuralnetworks for polyphonic sound event detection in real life recordings,”in

Proc. IEEE Int. Conf. Acoust., Speech Signal Process. , 2016, pp.6440–6444.[17] T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. Le Roux, and K. Takeda,“Duration-controlled LSTM for polyphonic sound event detection,”

IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 25, no. 11, pp.2059–2070, 2017.[18] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing betweencapsules,” in

Proc. Adv. Neural Inf. Process. Syst. , 2017, pp. 3856–3866.[19] F. Vesperini, L. Gabrielli, E. Principi, and S. Squartini, “Polyphonicsound event detection by using capsule neural networks,”

IEEE J. Sel.Topics Signal Process. , vol. 13, no. 2, pp. 310–322, 2019.[20] Y. Liu, J. Tang, Y. Song, and L. Dai, “A capsule based approach forpolyphonic sound event detection,” in

Proc. Asia-Paciﬁc Signal Inf.Process. Assoc. , 2018, pp. 1853–1857.[21] E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen,“Convolutional recurrent neural networks for polyphonic sound eventdetection,”

IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 25,no. 6, pp. 1291–1303, 2017.[22] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound eventlocalization and detection of overlapping sources using convolutionalrecurrent neural networks,”

IEEE J. Sel. Topics Signal Process. , vol. 13,no. 1, pp. 34–48, 2018.[23] Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley,“Polyphonic sound event detection and localization using a two-stagestrategy,” in

Proc. Detection Classiﬁcation Acoust. Scenes Events Work-shop , 2019, pp. 30–34.[24] R. Schmidt, “Multiple emitter location and signal parameter estimation,”

IEEE Trans. Antennas Propag. , vol. 34, no. 3, pp. 276–280, 1986.[25] R. Roy and T. Kailath, “ESPRIT-estimation of signal parameters viarotational invariance techniques,”

IEEE Trans. Acoust., Speech, SignalProcess. , vol. 37, no. 7, pp. 984–995, 1989.[26] H. Teutsch and W. Kellermann, “Acoustic source detection and local-ization based on waveﬁeld decomposition using circular microphonearrays,”

J. Acoust. Soc. Amer. , vol. 120, no. 5, pp. 2724–2736, 2006.[27] C. Knapp and G. Carter, “The generalized correlation method forestimation of time delay,”

IEEE Trans. Acoust., Speech, Signal Process. ,vol. 24, no. 4, pp. 320–327, 1976.[28] M. S. Brandstein and H. F. Silverman, “A robust method for speechsignal time-delay estimation in reverberant rooms,” in

Proc. IEEE Int.Conf. Acoust., Speech Signal Process. , vol. 1, 1997, pp. 375–378.[29] H. Do, H. F. Silverman, and Y. Yu, “A real-time SRP-PHAT sourcelocation implementation using stochastic region contraction (SRC) ona large-aperture microphone array,” in

Proc. IEEE Int. Conf. Acoust.,Speech Signal Process. , vol. 1, 2007, pp. 121–124.[30] E. L. Ferguson, S. B. Williams, and C. T. Jin, “Sound source localizationin a multipath environment using convolutional neural networks,” in

Proc. IEEE Int. Conf. Acoust., Speech Signal Process. , 2018, pp. 2386–2390.[31] Z.-M. Liu, C. Zhang, and S. Y. Philip, “Direction-of-arrival estimationbased on deep neural networks with robustness to array imperfections,”

IEEE Trans. Antennas Propag. , vol. 66, no. 12, pp. 7315–7327, 2018.[32] S. Adavanne, A. Politis, and T. Virtanen, “Direction of arrival estima-tion for multiple sound sources using convolutional recurrent neuralnetwork,” in

Proc. Eur. Signal Process. Conf. , 2018, pp. 1462–1466.[33] D. Pavlidi, S. Delikaris-Manias, V. Pulkki, and A. Mouchtaris, “3Dlocalization of multiple sound sources with intensity vector estimatesin single source zones,” in

Proc. Eur. Signal Process. Conf. , 2015, pp.1556–1560.[34] S. Hafezi, A. H. Moore, and P. A. Naylor, “Augmented intensity vectorsfor direction of arrival estimation in the spherical harmonic domain,”

IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 25, no. 10, pp.1956–1968, 2017.[35] M. Yasuda, Y. Koizumi, S. Saito, H. Uematsu, and K. Imoto, “Sound event localization based on sound intensity vector reﬁned by DNN-based denoising and source separation,” in

Proc. IEEE Int. Conf. Acoust.,Speech Signal Process. , 2020, pp. 651–655.[36] J. Pak and J. W. Shin, “Sound localization based on phase differenceenhancement using deep neural networks,”

IEEE/ACM Trans. Audio,Speech, Lang. Process. , vol. 27, no. 8, pp. 1335–1345, 2019.[37] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep neuralnetwork acoustic modeling,”

IEEE/ACM Trans. Audio, Speech, Lang.Process. , vol. 23, no. 9, pp. 1469–1477, 2015.[38] J. Salamon and J. P. Bello, “Deep convolutional neural networks anddata augmentation for environmental sound classiﬁcation,”

IEEE SignalProcess. Lett. , vol. 24, no. 3, pp. 279–283, 2017.[39] L. Perez and J. Wang, “The effectiveness of data augmentation in imageclassiﬁcation using deep learning,” arXiv preprint:1712.04621 , 2017.[40] P. Y. Simard, D. Steinkraus, J. C. Platt et al. , “Best practices forconvolutional neural networks applied to visual document analysis,” in

Proc. Int. Conf. Docum. Anal. Recognit. , 2003, pp. 958–963.[41] N. Takahashi, M. Gygli, and L. Van Gool, “Aenet: Learning deep audiofeatures for video analysis,”

IEEE Trans. Multimedia , vol. 20, no. 3, pp.513–524, 2017.[42] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyondempirical risk minimization,” arXiv preprint arXiv:1710.09412 , 2017.[43] R. Lu and Z. Duan, “Bidirectional GRU for sound event detection,”DCASE2017 Challenge, Tech. Rep., September 2017.[44] N. Takahashi, M. Gygli, B. Pﬁster, and L. Van Gool, “Deep con-volutional neural networks and data augmentation for acoustic eventdetection,” arXiv preprint arXiv:1604.07160 , 2016.[45] K. Shimada, N. Takahashi, S. Takahashi, and Y. Mitsufuji, “Sound eventlocalization and detection using activity-coupled cartesian DOA vectorand RD3Net,” arXiv preprint arXiv:2006.12014 , 2020.[46] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,and Q. V. Le, “SpecAugment: A simple data augmentation method forautomatic speech recognition,” arXiv preprint arXiv:1904.08779 , 2019.[47] J. Zhang, W. Ding, and L. He, “Data augmentation and prior knowledge-based regularization for sound event localization and detection,”DCASE2019 Challenge, Tech. Rep., June 2019.[48] L. Mazzon, Y. Koizumi, M. Yasuda, and N. Harada, “First orderambisonics domain spatial augmentation for DNN-based direction ofarrival estimation,” in

Proc. Detection Classiﬁcation Acoust. ScenesEvents Workshop , 2019, pp. 154–158.[49] Q. Wang, H. Wu, Z. Jing, F. Ma, Y. Fang, Y. Wang,T. Chen, J. Pan, J. Du, and C.-H. Lee, “The USTC-iFlytek system for sound event localization and detection ofDCASE2020 challenge,” DCASE2020 Challenge, Tech. Rep., July2020. [Online]. Available: http://dcase.community/challenge2020/task-sound-event-localization-and-detection-results

Accepted by 12th Int. Symp. Chinese SpokenLang. Process. , 2021.[51] L. Mazzon, M. Yasuda, Y. Koizumi, and N. Harada, “Sound eventlocalization and detection using foa domain spatial augmentation,”DCASE2019 Challenge, Tech. Rep., June 2019.[52] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDRbeamforming using time-frequency masks for online/ofﬂine ASR innoise,” in

Proc. IEEE Int. Conf. Acoust., Speech Signal Process. , 2016,pp. 5210–5214.[53] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming basedon generalized eigenvalue decomposition,”

IEEE/ACM Trans. Audio,Speech, Lang. Process. , vol. 15, no. 5, pp. 1529–1539, 2007.[54] DCASE2020, “Sound event localization and detection chal-lenge results,” DCASE2020 Challenge, Tech. Rep., July2020. [Online]. Available: http://dcase.community/challenge2020/task-sound-event-localization-and-detection-results[55] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han,S. Wang, Z. Zhang, Y. Wu et al. , “Conformer: Convolution-augmentedtransformer for speech recognition,” arXiv preprint:2005.08100 , 2020.[56] A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant spatialsound scenes with moving sources for sound event localization anddetection,” arXiv preprint:2006.01919 , 2020.[57] I. Trowitzsch, J. Taghia, Y. Kashef, and K. Obermayer, “The NIGENSgeneral sound events database,” arXiv preprint arXiv:1902.08314 , 2019.[58] A. Politis and H. Gamper, “Comparing modeled and measurement-basedspherical harmonic encoding ﬁlters for spherical microphone arrays,” in

Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. , 2017, pp.224–228. [59] J. Daniel, “Repr´esentation de champs acoustiques, application `a latransmission et `a la reproduction de sc`enes sonores complexes dansun contexte multim´edia,” Ph.D. dissertation, Univ. of Paris VI, France,2000. [Online]. Available: http://gyronymo.free.fr[60] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recogit. , 2016,pp. 770–778.[61] F. Chollet, “Xception: Deep learning with depthwise separable convolu-tions,” in

Proc. IEEE Conf. Comput. Vision Pattern Recogit. , 2017, pp.1251–1258.[62] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, andS. Khudanpur, “Semi-orthogonal low-rank matrix factorization for deepneural networks.” in

Proc. Interspeech , 2018, pp. 3743–3747.[63] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Adv. NeuralInf. Process. Syst. , 2017, pp. 5998–6008.[64] S. Chen, Y. Wu, Z. Chen, J. Li, C. Wang, S. Liu, and M. Zhou,“Continuous speech separation with conformer,” arXiv preprintarXiv:2008.05773 , 2020.[65] K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, andK. Takeda, “Convolution-augmented transformer for semi-supervisedsound event detection,” DCASE2020 Challenge, Tech. Rep., June 2020.[66] A. Mesaros, S. Adavanne, A. Politis, T. Heittola, and T. Virtanen, “Jointmeasurement of localization and detection of sound events,” in

Proc.IEEE Workshop Appl. Signal Process. Audio Acoust. , 2019, pp. 333–337.[67] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[68] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inPyTorch,” 2017.

Qing Wang received the B.S. and Ph.D. degreesfrom the Department of Electronic Engineeringand Information Science, University of Science andTechnology of China (USTC), Hefei, China, in2012 and 2018, respectively. From July 2018 toFebruary 2020, she worked at Tencent companyon single-channel speech enhancement. She is cur-rently a Postdoctor at USTC. Her research interestsinclude speech enhancement, robust speech recog-nition, acoustic scene classiﬁcation, sound eventlocalization and detection.

Jun Du received the B.Eng. and Ph.D. degreesfrom the Department of Electronic Engineeringand Information Science, University of Science andTechnology of China (USTC), in 2004 and 2009,respectively. From 2004 to 2009, he was with iFlytekSpeech Lab of USTC. During the above period, heworked as an Intern twice for nine months withMicrosoft Research Asia (MSRA), Beijing. In 2007,he was also a Research Assistant for six monthswith the Department of Computer Science, TheUniversity of Hong Kong. From July 2009 to June2010, he was with iFlytek Research on speech recognition. From July 2010to January 2013, he was with MSRA as an Associate Researcher, workingon handwriting recognition, OCR, and speech recognition. Since February2013, he has been with the National Engineering Laboratory for Speech andLanguage Information Processing (NEL-SLIP) of USTC.

Hua-Xin Wu received the B.E. degrees in 2016 fromthe Southeast University. Since 2016, he has beenwith iFlytek Research on multimodal speech recog-nition and keyword spotting. His current researchinterests include keyword spotting and sound eventdetection.

Jia Pan received the B.S. and M.S. degrees in2006 and 2009, respectively, from the Departmentof Electronic Engineering and Information Science,University of Science and Technology of China,Hefei, China, where he is currently working towardthe Ph.D. degree. Since 2009, he has been withiFlytek Research on speech recognition and spokendialogue systems. His current research interests in-clude speech recognition and machine learning. Tian Gao (USTC)

Lei Sun (USTC)

Feng Ma (iFlytek) Yi Fang (iFlytek)

Di-Yuan Liu (iFlytek)

Qiang Zhang (iFlytek) Xiang Zhang (iFlytek)Hai-Kun Wang (iFlytek) Jia Pan(iFlytek) Jian-Qing Gao(iFlytek)Jun Du (USTC)

Team

Chin-Hui Lee (GIT)Jing-Dong Chen (NWPU)

Feng Ma received the B.Eng. and M.S. degrees fromthe Department of Electronic Engineering and In-formation Science, University of Science and Tech-nology of China, Hefei, China, in 2009 and 2012,respectively. He is currently with iFlytek Research,Hefei, China. His current research interests includeacoustic echo cancellation, microphone arrays, androbust speech recognition.