Capturing scattered discriminative information using a deep architecture in acoustic scene classification
DDetection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
CAPTURING SCATTERED DISCRIMINATIVE INFORMATION USINGA DEEP ARCHITECTURE IN ACOUSTIC SCENE CLASSIFICATION
Hye-jin Shim ∗ , Jee-weon Jung ∗ Ju-ho Kim, Ha-jin Yu † , University of Seoul, School of Computer Science, Seoul, South Korea
ABSTRACT
Frequently misclassified pairs of classes that share many commonacoustic properties exist in acoustic scene classification (ASC). Todistinguish such pairs of classes, trivial details scattered throughoutthe data could be vital clues. However, these details are less no-ticeable and are easily removed using conventional non-linear ac-tivations (e.g. ReLU). Furthermore, making design choices to em-phasize trivial details can easily lead to overfitting if the system isnot sufficiently generalized. In this study, based on the analysis ofthe ASC task’s characteristics, we investigate various methods tocapture discriminative information and simultaneously mitigate theoverfitting problem. We adopt a max feature map method to replaceconventional non-linear activations in a deep neural network, andtherefore, we apply an element-wise comparison between differentfilters of a convolution layer’s output. Two data augment methodsand two deep architecture modules are further explored to reduceoverfitting and sustain the system’s discriminative power. Variousexperiments are conducted using the detection and classification ofacoustic scenes and events 2020 task1-a dataset to validate the pro-posed methods. Our results show that the proposed system consis-tently outperforms the baseline, where the single best performingsystem has an accuracy of 70.4% compared to 65.1% of the base-line.
Index Terms — LCNN, CBAM, ASC, deep neural network
1. INTRODUCTION
The detection and classification of acoustic scene and events(DCASE) community has been hosting multiple challenges to uti-lize sound event information generated in everyday environmentand physical events [1–3]. DCASE challenges provide not only thedataset for various audio-related tasks, but also a platform to com-pare and analyze the proposed systems. Among many kinds of taskscovered in DCASE challenges, acoustic scene classification (ASC)is a multi-class classification task that classifies an input recordinginto one of the predefined scenes.In the process of developing an ASC system, two major issueshave been widely explored in recent research literature. One is thegeneralization of the system in domain mismatch conditions thatcould arise from different recording devices [4–6]. More specif-ically, if an ASC system is not generalized towards unknown de-vices, performance on different devices degrades in the test phase.Another critical issue is the occurrence of frequently misclassifiedclasses (e.g. shopping mall - airport, tram - metro) [7, 8]. Many ∗ Equal contribution † Corresponding authorThis research was supported by Basic Science Research Programthrough the National Research Foundation of Korea(NRF) funded by theMinistry of Science, ICT & Future Planning(2020R1A2C1007081). (a) Devices (b) ScenesFigure 1: t-SNE visualization results of embeddings.acoustic characteristics coincide with these pairs of classes. Trivialdetails can be decisive clues for accurate classification; however, fo-cusing on such details easily leads to a trade-off, thereby degradinggeneralization. In particular, due to the characteristics of the ASCtask (see Section 2), discriminative information is scattered ratherthroughout the recording. However, widely used convolutional neu-ral network (CNN)-based models that exploit the ReLU activationfunction make feature representations sparse as it may discard neg-ative values [9].To investigate the aforementioned problems, we present a vi-sualization of the representation vectors (i.e. embeddings, codes)of the baseline using a t-SNE algorithm [10], depicted in Figure 1.Here, (a) and (b) refer to the result of plotted embeddings wheredifferent colors denote different device and scene labels, respec-tively. Figure 1-(a) shows that the devices do not form noticeableclusters, indicating good generalization. However, it can be seen inFigure 1-(b) that each scene does not have a clear decision bound-ary. Therefore, on leveraging this analysis, we focus on mitigatingthe misclassified classes.In this study, we explore several methods to reduce the removalof information and overfitting based on the characteristics of theASC task, and this analysis is presented in Section 2. Firstly, in-stead of common CNNs, we utilize a light CNN (LCNN) archi-tecture [11]. LCNN is an architecture that adopts a max featuremap (MFM) operation instead of non-linear activation functionssuch as ReLU or tanh. LCNN demonstrates the state-of-the-art per-formance in spoofing detection for automatic speaker verification(i.e. audio spoofing detection) [12, 13]. Second, to mitigate over-fitting, data augmentation and attention-based deep architecturalmodules are explored. Two data augmentation techniques, mix-upand SpecAugment are also investigated [14, 15]. The convolutionalblock attention module (CBAM) and squeeze and excitation (SE)networks are studied for enhancing the discriminative power usingminimum additional parameters [16, 17].
2. CHARACTERISTICS OF ASC
In this section, we present an analysis of the characteristics of theASC task. We assume that the discriminative information for theASC task included in an audio recording is scattered. Sound cues a r X i v : . [ ee ss . A S ] J u l etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan could occur either consistently or occasionally. For example, con-sistently occurring sound cues, such as a low degree of reverberationand the sound of the wind imply outdoor location. Various soundevents such as chirping of birds and barking of dogs are also impor-tant cues, but they are impactive and short, and they may only occurin some recordings that are labeled as “parks. Therefore, importantcues can have multiple characteristics; they are not focused on spe-cific parts of the data, and they occur irregularly. In our analysis,gathering scattered information that resides in an input recording isof interest.In tasks such as speaker and image classification, the target in-formation in data is relatively clear. As speaker classification uti-lizes human voice to identify speaker identity, the discriminative in-formation is concentrated in human voice rather than in non-speechsegments. Therefore, many studies on speaker classification attemptto remove non-speech segments using techniques such as voice ac-tivity detection (VAD). Similarly, many tasks in the image domainsadopt various methods to focus only on the target object. Becauseof the differences in these tasks such as speaker and image clas-sification versus the ASC task, we argue that different modelingapproaches should be considered.Audio spoofing detection is a task that shares similar character-istics with the ASC task considered in our analysis. Audio spoofingdetection also makes a binary decision whether an input utteranceis spoofed. In the case of audio spoofing detection, discriminativeinformation is more scattered because distortions occur in the en-tire audio file during the spoofing process. Therefore, non-speechsegments are also important because the distortion is not limited tothe speech segments. Previous studies also show that VAD couldeliminate useful information [18, 19]. Considering these charac-teristics, and in order to not miss much information, it has beendemonstrated that LCNN is particularly effective in audio spoofingdetection [12, 13]. This is because relatively less informative parts(i.e. negative values) could be removed using ReLU activation witha common CNN, making a sparse representation (as illustrated inFigure 2-(a)). This phenomenon has been reported in [11] to occurespecially for the first few convolution layers.We hypothesize that this phenomenon would apply to an ASCsystem too because the ASC task has commonalities in that impor-tant information is scattered across the data, similar to audio spoof-ing detection. To mitigate the problem of sparse representation inan ASC task, we propose to utilize MFM operation included in theLCNN architecture. As MFM operation selects feature maps withan element-wise competitive relationship, trivial information can beretained if the value is relatively high. Furthermore, focusing ontrivial details could also lead to overfitting. Hence, in this study, weaim to adopt regularization methods, while introducing a minimumnumber of additional parameters and retaining the discriminativepower of the system by applying state-of-the-art deep architecturemodules.
3. PROPOSED FRAMEWORK3.1. LCNN
LCNN is a deep learning architecture, initially designed for facerecognition with noisy labels [11]. Its main feature is a novel oper-ation referred to as max feature map (MFM) that replaces the non-linear activation function of a deep neural network (DNN). MFMoperation extends the concept of maxout activation [20] and adoptsa competitive scheme between filters of a given feature map. In (a) ReLU (b) MFMFigure 2: Comparison of ReLU activation function (left) and MFM(right) . Orange, green, and white indicate negative, positive, andzero values, respectively. ReLU removes all negative values, whileMFM considers the element-wise maximum one based on a com-parative relationshipthis study, we introduce the MFM operation to the ASC task, basedon two assumptions. To the best of our knowledge, this is the firstreport on such an implementation. Firstly, we hypothesize that scat-tered discriminative information can relatively reside throughout aninput feature map, compared to widely used ReLU non-linearitythat discards negative values. Secondly, we note that MFM oper-ations demonstrate state-of-the-art performance in audio spoofingdetection in which two tasks share common properties.The implementation of an MFM operation can be denoted asfollows. Let a be a given feature map derived through a convolu-tion layer, a ∈ R K × T × F , where K , T , and F refer to the num-ber of output channels, time domain frames, and frequency bins,respectively. We split a into two feature maps, a and a , a , a ∈ R K × T × F . The MFM applied feature map is obtained by Max ( a , a ) , element-wise. Figure 2-(b) illustrates the MFM op-eration.Specifically, our design of LCNN is similar to that of [12], withsome modifications. The architecture of [12] is also a modified ver-sion of the original LCNN [11], applying additional batch normal-ization used after a max pooling layer. Table 1 provides detailsof the architecture of the proposed system that adopts an LCNN.Conv a, MFM a, BatchNorm, Conv, MFM, CBAM can be seen asa block, and 4 blocks are implemented to contain an adequate num-ber of parameters. The number of blocks is determined based oncomparative experiments. With limited labelled data and recent DNNs with many parameters,overfitting easily occurs in DNN-based ASC systems [3, 8, 14, 15,21]. To account for overfitting, our design choices include data aug-mentation methods and deep architecture modules for generaliza-tion purposes with enhanced model capacity. For the regularizationpurpose, we adopt two data augmentation methods: mix-up [14]and specAugment [15]. Let x i and x j be two audio recordings thatbelong to class y i and y j , respectively, where y is a one-hot vector.A mix-up operation creates an augmented audio recording with acorresponding soft-label using two different recordings. Formally,an augmented audio recording can be denoted as the following: x (cid:48) = λx i + (1 − λ ) x j ,y (cid:48) = λy i + (1 − λ ) y j , (1)where λ , is a random variable drawn from Beta ( α, α ) , and α ∈ (0 , inf) , is a real value between 0 and 1. With a rather simple im- etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan Table 1: The LCNN architecture. The numbers in the output shapecolumn refer to the frame (time), frequency, and the number of ker-nels. MFM, MaxPool and FC indicate max feature map, max pool-ing layer and fully-connected layer, respectively.Type Kernel/Stride OutputConv 1 7 × × l × × l × × × × l / 2) × × × × l / 2) × × l / 2) × × l / 2) × × × × l / 2) × × l / 2) × × l / 2) × × × × l / 4) × × l / 4) × × × × l / 4) × × l / 4) × × l / 4) × × × × l / 4) × × l / 4) × × l / 4) × × × × l / 8) × × × × l / 8) × × l / 8) × × l / 8) × × × × l / 8) × × l / 8) × × l / 8) × × l / 8) × × × × l / 8) × × l / 8) × × l / 8) × × × × l / 8) × × l / 8) × × l / 8) × × × × l / 16) × × x , x ∈ R T × F be aMel-filterbank energy feature extracted from an input audio record-ing, where T and F are the number of frames and Mel-frequencybins, respectively, and t and f are indices for T and F respec-tively. To apply time masking, we randomly select t stt and t end , t stt ≤ t end ≤ t T , where stt and end are indices for start and end,and then, mask with 0. To apply frequency masking, we randomlyselect f stt and f end , f stt ≤ f end ≤ f F , and then, mask with 0.In this study, we sequentially apply specAugment and mix-up forbetter generalization.To increase model capacity while introducing minimum num-ber of additional parameters to the model, we investigate two recentdeep architecture modules: SE [16] and CBAM [17]. SE focuses on the relationship between different channels of a given feature map.SE first squeezes the input feature map via a global average poolinglayer to derive a channel descriptor that includes the global spatial(time and frequency in ASC) context. Then, using minimal numberof additional parameters, SE re-calibrates channel-wise dependen-cies via an excitation step. Specifically, the excitation step adoptstwo fully-connected layers that input a derived channel descriptorand output a re-calibrated channel descriptor. SE transforms thegiven feature map by multiplying the re-calibrated channel descrip-tor, where each value in the channel descriptor is broadcasted toconduct element-wise multiplication with each filter of a featuremap. In our experiments that incorporate the SE module, we ap-ply SE to the output of each residual block following the methodsreported in the literature. Further details regarding the SE modulecan be found in [16].CBAM is a deep architecture module that sequentially applieschannel attention and spatial attention. To derive a channel attentionmap, CBAM applies global max and average pooling operations tothe spatial domain. It then uses two fully-connected layers. Chan-nel attention is applied by element-wise multiplication of the inputfeature map with the channel attention map, where each value ofthe channel attention map is broadcasted to fit the spatial domain.To derive a spatial attention map, CBAM applies two global pool-ing operations to the channel domain and then adopts a convolutionlayer. Spatial attention is also applied by element-wise multiplica-tion of the feature map after channel attention with a derived spatialattention map. In our experiments using the CBAM module, we ap-ply it to the output of each residual block following the literature.Further details regarding the CBAM module can be found in [17].
4. EXPERIMENTS4.1. Dataset
We use the DCASE2020 task1-a dataset for all our experiments. Itincludes 23,040 audio recordings 44.1 kHz with a 24-bit resolution,where each recording has a duration of 10 s. The dataset containsaudio recordings from three real devices (A, B, and C) and six aug-mented devices (S1-S6). Unless explicitly mentioned, all perfor-mances in this paper are reported using the official DCASE2020fold 1 configuration, which assigns 13,965 recordings as the train-ing set and 2,970 recordings as the test set.
Mel-spectrograms with 128 Mel-filterbanks are used for all experi-ments where the number of FFT bins, window length, and shift sizeare set to 2,048, 40 ms, and 20 ms, respectively. During the trainingphase, we randomly select 250 consecutive frames (5 s) instead ofusing the whole recording. In the test phase, an audio recording issplit into three overlapping sub-recordings (i.e. 0 5 s, 2.5 7.5 s, and5 10 s), and the mean of the output layer is used to perform classi-fication. This technique has been reported to mitigate overfitting inprevious works [7, 22].We use an SGD optimizer with a batch size of 24. The initiallearning rate is set to 0.001 and scheduled with a warm restart ofstochastic gradient descent [23]. For a single system, we train theDNN in an end-to-end fashion. For the ensemble system, supportvector machine (SVM) classifiers are employed. Further technicaldetails are provided in our technical report for facilitating the repro-duction of conducted experiments [24]. etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
Table 2: Baseline comparison with other systems. Classificationaccuracies reported using DCASE2020 fold1 configuration.System Acc (%)DCASE2019 baseline [2] 46.5DCASE2020 baseline [25] 54.1
Ours-baseline
Table 2 compares the baseline of this study with the two of-ficial baselines of the DCASE community. The DCASE2019baseline inputs log Mel-spectrograms and uses convolution andfully-connected layers. Further, the DCASE2020 baseline in-puts L3 embeddings [26] extracted from another DNN and usesfully-connected layers for classification. Our baseline uses Mel-spectrograms as inputs, and it uses convolution, batch normalization[27], and Leaky ReLU [28] layers with residual connection [29],where a SE module exists after each residual block . The resultsshow that our baseline outperforms the DCASE2020 baseline byover 10% in classification accuracy.Table 3 describes the effectiveness of the proposed approachesusing LCNN, SE, and CBAM. It also provides a comparison ofthe effects of using mix-up or/and specAugment data augmenta-tion methods. Firstly, for comparing the system architecture with-out any data augmentation and deep architecture modules, ResNetand LCNN achieve accuracies of 65.1% and 67.1%, respectively.To optimize the LCNN system, we also adjust the number of blocksand find that the original LCNN with 4 blocks achieves the bestperformance. Secondly, we validate the effectiveness of data aug-mentation. The results show that mix-up and specAugment are botheffective and using a combination of these two methods is the bestchoice. Thirdly, we apply the deep architecture modules of SE andCBAM. From the results of the experiment, we observe that theCBAM is slightly better than SE.Table 4 represents the results of comparing the frequently mis- Model architecture and accuracies per each device and scene is pre-sented in our technical report for the DCASE2020 challenge.
Table 4: Comparison results of the number of frequently misclassi-fied pairs of acoustic scenes between baseline and proposed system.Reduction refers to the number of confusion pairs between the twoclasses.Class Baseline Proposed ReductionMetro - Tram 114 81 Shopping - Airport 107 101 Shopping - Metro st 84 56 Shopping - Street ped 83 88 -5Public square - Street ped 74 70 Total 462 396 66Table 5: Results of our submitted systems for the DCASE2020 chal-lenge task1-a.System , classification accuracy increased to 71.7%.
6. CONCLUSION
In this paper, we assumed that the information that enables clas-sification between different scenes with similar characteristics isscattered throughout the recording of the ASC task. In the caseof a shopping mall and an airport, there was a common charac-teristic that they were reverberant and there was a babel of voicesas they are indoors. Therefore, trivial details could be importantcues to distinguish the two classes. Based on this hypothesis, weproposed a method that is expected to capture this discriminativeinformation better. We applied two deep architecture modules ofLCNN and CBAM and two data augmentation methods of mix-upand specAugment. The proposed method helped to improve the sys-tem performance with less computation and overhead parameters.We achieved an accuracy of 70.4% using the single best performingsystem, compared to 65.1% of the baseline. Also submitted to DCASE2020 workshop; authors will add a citation ifaccepted.etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
7. REFERENCES [1] M. D. Plumbley, C. Kroos, J. P. Bello, G. Richard, D. P.Ellis, and A. Mesaros,
Proceedings of the Detection andClassification of Acoustic Scenes and Events 2018 Workshop(DCASE2018) . Tampere University of Technology. Labora-tory of Signal Processing, 2018.[2] M. Mandel, J. Salamon, and D. P. W. Ellis,
Proceedings of theDetection and Classification of Acoustic Scenes and Events2019 Workshop (DCASE2019) . NY, USA: New York Uni-versity, October 2019.[3] M. D. McDonnell and W. Gao, “Acoustic scene classifica-tion using deep residual networks with late fusion of sepa-rated high and low frequency paths,” in
ICASSP 2020-2020IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) . IEEE, 2020, pp. 141–145.[4] S. Gharib, K. Drossos, E. Cakir, D. Serdyuk, and T. Virta-nen, “Unsupervised adversarial domain adaptation for acous-tic scene classification,” arXiv preprint arXiv:1808.05777 ,2018.[5] P. Primus and D. Eitelsebner, “Acoustic scene classificationwith mismatched recording devices,”
Tech. Rep., DCASE2019Challenge , 2019.[6] M. Kosmider, “Calibrating neural networks for secondaryrecording devices,” in
Proceedings of the Detection and Clas-sification of Acoustic Scenes and Events Workshop (DCASE),New York, NY, USA , 2019, pp. 25–26.[7] H.-S. Heo, J.-w. Jung, H.-j. Shim, and H.-J. Yu, “Acousticscene classification using teacher-student learning with soft-labels,”
Proc. Interspeech 2019 , pp. 614–618, 2019.[8] J.-w. Jung, H. Heo, H.-j. Shim, and H.-J. Yu, “Distillingthe knowledge of specialist deep neural networks in acous-tic scene classification,” in
Proceedings of the Detection andClassification of Acoustic Scenes and Events 2019 Work-shop (DCASE2019) , New York University, NY, USA, October2019, pp. 114–118.[9] X. Wu, R. He, and Z. Sun, “A lightened cnn for deep facerepresentation,”
ArXiv , vol. abs/1511.02683, 2015.[10] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”
Journal of machine learning research , vol. 9, no. Nov, pp.2579–2605, 2008.[11] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep facerepresentation with noisy labels,”
IEEE Transactions on Infor-mation Forensics and Security , vol. 13, no. 11, pp. 2884–2896,2018.[12] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova, A. Gor-lanov, and A. Kozlov, “Stc antispoofing systems for theasvspoof2019 challenge,” arXiv preprint arXiv:1904.05576 ,2019.[13] C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “Assert:Anti-spoofing with squeeze-excitation and residual networks,” arXiv preprint arXiv:1904.01120 , 2019.[14] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz,“mixup: Beyond empirical risk minimization,” arXiv preprintarXiv:1710.09412 , 2017. [15] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “Specaugment: A simple data augmen-tation method for automatic speech recognition,”
Proc. Inter-speech 2019 , pp. 2613–2617, 2019.[16] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net-works,” in
Proceedings of the IEEE conference on computervision and pattern recognition , 2018, pp. 7132–7141.[17] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: Convo-lutional block attention module,” in
Proceedings of the Euro-pean conference on computer vision (ECCV) , 2018, pp. 3–19.[18] I. Lapidot and J.-F. Bonastre, “Effects of waveform pmf onanti-spoofing detection,”
Proc. Interspeech 2019 , pp. 2853–2857, 2019.[19] H. Dinkel, Y. Qian, and K. Yu, “Small-footprint convolutionalneural network for spoofing detection,” in . IEEE, 2017,pp. 3086–3091.[20] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville,and Y. Bengio, “Maxout networks,” arXiv preprintarXiv:1302.4389 , 2013.[21] S. Mun, S. Park, D. K. Han, and H. Ko, “Generative adver-sarial network based acoustic scene training set augmentationand selection using svm hyper-plane,”
Proc. DCASE , pp. 93–97, 2017.[22] J.-w. Jung, H.-s. Heo, H.-j. Shim, and H.-j. Yu, “DNN basedmulti-level feature ensemble for acoustic scene classification,”in
Proceedings of the Detection and Classification of AcousticScenes and Events 2018 Workshop (DCASE2018) , November2018, pp. 113–117.[23] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descentwith warm restarts,” arXiv preprint arXiv:1608.03983 , 2016.[24] H.-j. Shim, J.-h. Kim, J.-w. Jung, and H.-j. Yu, “Audio taggingand deep architectures for acoustic scene classification: Uossubmission for the DCASE 2020 challenge,” DCASE2020Challenge, Tech. Rep., 2020.[25] T. Heittola, A. Mesaros, and T. Virtanen, “Acoustic sceneclassification in dcase 2020 challenge: generalization acrossdevices and low complexity solutions,” in
Proceedings of theDetection and Classification of Acoustic Scenes and Events2020 Workshop (DCASE2020) , 2020, submitted. [Online].Available: https://arxiv.org/abs/2005.14623[26] J. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello, “Look,listen and learn more: Design choices for deep audioembeddings,” in
IEEE Int.˜Conf.˜on Acoustics, Speech andSignal Processing (ICASSP) , Brighton, UK, May 2019, pp.3852–3856. [Online]. Available: https://ieeexplore.ieee.org/document/8682475[27] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,”in
International Conference on Machine Learning , 2015, pp.448–456.[28] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinear-ities improve neural network acoustic models,” in
Proc. icml ,vol. 30, no. 1, 2013, p. 3.[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in