[PDF] An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances

Abstract

In this paper, we propose a sub-utterance unit selection framework to remove acoustic segments in audio recordings that carry little information for acoustic scene classification (ASC). Our approach is built upon a universal set of acoustic segment units covering the overall acoustic scene space. First, those units are modeled with acoustic segment models (ASMs) used to tokenize acoustic scene utterances into sequences of acoustic segment units. Next, paralleling the idea of stop words in information retrieval, stop ASMs are automatically detected. Finally, acoustic segments associated with the stop ASMs are blocked, because of their low indexing power in retrieval of most acoustic scenes. In contrast to building scene models with whole utterances, the ASM-removed sub-utterances, i.e., acoustic utterances without stop acoustic segments, are then used as inputs to the AlexNet-L back-end for final classification. On the DCASE 2018 dataset, scene classification accuracy increases from 68%, with whole utterances, to 72.1%, with segment selection. This represents a competitive accuracy without any data augmentation, and/or ensemble strategy. Moreover, our approach compares favourably to AlexNet-L with attention.

Full PDF

AAn Acoustic Segment Model Based Segment Unit Selection Approach toAcoustic Scene Classiﬁcation with Partial Utterances

Hu Hu , Sabato Marco Siniscalchi , , Yannan Wang , Xue Bai , Jun Du , Chin-Hui Lee School of Electrical and Computer Engineering, Georgia Institute of Technology, USA Computer Engineering School, University of Enna, Italy Tencent Media Lab, Tencent Corporation, Shenzhen, Guangdong, China University of Science and Technology of China, HeFei, China [email protected], [email protected], [email protected],[email protected], [email protected], [email protected]

Abstract

In this paper, we propose a sub-utterance unit selectionframework to remove acoustic segments in audio recordingsthat carry little information for acoustic scene classiﬁcation(ASC). Our approach is built upon a universal set of acousticsegment units covering the overall acoustic scene space. First,those units are modeled with acoustic segment models (ASMs)used to tokenize acoustic scene utterances into sequences ofacoustic segment units. Next, paralleling the idea of stop wordsin information retrieval, stop ASMs are automatically detected.Finally, acoustic segments associated with the stop ASMs areblocked, because of their low indexing power in retrieval ofmost acoustic scenes. In contrast to building scene models withwhole utterances, the ASM-removed sub-utterances, i.e., acous-tic utterances without stop acoustic segments, are then used asinputs to the AlexNet-L back-end for ﬁnal classiﬁcation. On theDCASE 2018 dataset, scene classiﬁcation accuracy increasesfrom 68%, with whole utterances, to 72.1%, with segment se-lection. This represents a competitive accuracy without anydata augmentation, and/or ensemble strategy. Moreover, ourapproach compares favourably to AlexNet-L with attention.

Index Terms : acoustic scene classiﬁcation, acoustic segmentmodels, stop words detection, convolutional neural network

1. Introduction

The aim of the acoustic scene classiﬁcation (ASC) refers to thetask of identifying real-life sounds into environment classes,such as metro station, street trafﬁc, or public square. An acous-tic scene sound contains much information and rich content,which makes accurate scene prediction difﬁcult. ASC has beenan attracting research ﬁeld for decades, and the IEEE Detec-tion and Classiﬁcation of Acoustic Scenes and Events (DCASE)challenge [1, 2, 3] provides the benchmark data and a competi-tive platform to promote sound scene research and analyses. Inrecent years, we have witnessed that the deep neural networks(DNNs) have gradually dominated the design of top ASC sys-tems, and the main ingredient of their success is the applicationof deep convolutional neural networks (CNNs) [4, 5, 6, 7]. Fur-thermore, with the use of advanced deep learning techniques,such as attention mechanism [8, 9, 10] and deep network baseddata augmentation [11, 12, 13], a further boost in ASC systemperformances can be obtained.In this study, we leverage upon acoustic segment models(ASMs) as an indicator of the indexing power of the input audiosegment units with respect to the acoustic scenes being classi-ﬁed. A set of ASM models is employed to carry out acoustic segment selection in the front-end. An initial ASM sequencefor each given audio recording is obtained by unsupervised seg-mentation and clustering. Next, we use Gaussian mixture model(GMM)- or deep neural network (DNN)- hidden Markov model(GMM/DNN-HMM) [14, 15] to model the ASM sequences ina semi-supervised manner. Thus, each audio recording is seg-mented (tokenized) into a sequence of acoustic segment units,each having its relative ASM unit index. In this work, the termsacoustic segment units, acoustic segments, sub-utterance units,and tokens, are used interchangeably. A similar strategy to de-tect the stop words in information retrieval [16] is adapted, anda set of stop ASMs is identiﬁed using the training data. StopASMs represent meaningless ASM units, which carrying verylow indexing power in retrieving most acoustic scenes. Just likeword ’the’, ’an’ and ’or’ in document retrieval problems, stopASMs are therefore not useful to identify the target scene class.Those stop ASMs are used at a front-end level to block all ofaudio segments consisting of sequences of acoustic frames, be-longing to those stop ASM models in both the training and eval-uation stages. In doing so, noisy acoustic segments are elimi-nated in building models at the training stage, and not sent tothe back-end acoustic scene classiﬁer during the classiﬁcationstages. The proposed approach is evaluated on DCASE 2018Task1a data set. Our experiments demonstrate that our solutionimproves an AlexNet like system, dubbed AlexNet-L, boostingthe classiﬁcation accuracy from 68.0% to 72.1%. The latter isa competitive result since neither data augmentation nor sys-tem combination are used. Furthermore, our segment-selectionscheme with ASMs compares favourably with a recently pro-posed CNN classiﬁers using an ASM-based attention mecha-nism [17].

2. Related Work and Our Contributions

Several approaches based on feature learning have been pro-posed for the ASC task. For example, low-level features [18,19, 20], which are directly extracted from the input signal at thefront-end level, are thoroughly investigated to boost ASC per-formance. With deep models, mid-level features [21, 22, 23, 24]are instead induced from a DNN hidden layer, which takes intoaccount the overall information embedded in training set. Inaddition to these low-level or mid-level features, the use of rawwaveform to feed end-to-end systems has also been investigatedin [25, 26]. However, to the best of the authors’ knowledge, allof those feature learning works on ASC employ the whole inputaudio recording at the input layer to obtain high-dimensionalfeature vectors, and there aren’t investigations concerned with a r X i v : . [ ee ss . A S ] J u l egment selection at a front-end level.From a human listening perspective, sound recognition isoften guided by detecting prominent acoustic events and/or au-dio cues useful to identify particular acoustic scenes [27]. Forexample, human listeners may leverage upon a car horn soundto determine that it is from a street trafﬁc scene, or a loud planeengine sound to determine it is from an airport. Those soundsgenerated from car horns and plane engines, have stronger in-dexing power than other sounds for classifying these two acous-tic scenes. Hence we argue that we are bound to get better ASCaccuracy if we can block acoustic segments with little index-ing power. Our idea could be related to an attention mechanism[8, 9, 10, 17], which uses an ad-hoc internal connectionist blockand a huge amount of data to weight hidden internal representa-tions accordingly to its salience to the target outputs. However,an attention mechanism requires extra amount of parameters,and the performance highly depends on the model tuning. Ourapproach introduces a well-known approach from the informa-tion retrieval ﬁeld [16] to detect meaningless sound events, andthe experimental evidence conﬁrms our claim.In order to ﬁnd the semantic salience of sound events, weuse ASMs. ASMs are a set of self-organized sound units thatare intended to cover the overall acoustic characteristics usingavailable training data [28]. The ASM framework has recentlybeen adopted in many audio sequence classiﬁcation tasks, suchas language identiﬁcation [29], speaker identiﬁcation [30], emo-tion recognition [31], music genre classiﬁcation [32] and theacoustic scene classiﬁcation [33]. As for ASC, it makes the as-sumption that the acoustic characteristics of all scenes can becovered by a universal set of acoustic units. Thus, input au-dio recordings can be transformed into ASM sequences, whichare in turn processed by latent semantic analysis (LSA) [34]to obtain feature vectors with semantic information. Finally,a CNN based ASC system with an attention mechanism usingASM units is proposed in [17]. Different from the conventionalASM framework, in this work, ASM sequences are not used fora follow-up feature extraction process. In the experimental sec-tion, we demonstrated that our front-end solution outperformsthat with the attention mechanism in [17].

3. Acoustic Segment Modeling

Like the phoneme representation for the speech utterance, weassume that the sound characteristics of acoustic scenes canalso be covered by a universal set of acoustic units. The ASMapproach aims to build a tokenizer to transfer the scene audiointo a sequence of ASMs, i.e., the acoustic units speciﬁed inan acoustic inventory. The ASM sequence is generated in twomain steps: (i) an unsupervised approach is used to seed the ini-tial ASMs, with each acoustic unit having a ﬁxed length (acous-tic segment), (ii) either a GMM-HMM or DNN-HMM systemis built on top of the initial ASMs and then used to generate theASM sequence for a given audio recording.

ASM initialization is a critical factor for the success of the ASMframework. The initial ASM sequence generation is performedat a feature level, i.e. log-mel ﬁlter bank (LMFB) energies, ormel frequency cepstral coefﬁcients (MFCCs). First, a given in-put audio recording is divided into a sequence of ﬁxed-lengthsegments. In our experiments, an audio recording is split into acoustic segments, each having LMFB or MFCC frames.The arithmetic mean of these frames is used to generate a InitialSegmentation K-meansClustering GMM/DNN-HMMTrainingStop ASMs DetectionRemove if its ASM is in ‘Stop ASMs’ Re-segment and Padding AlexNet-L(vote) Scene ClassLMFBFeatures ASMSequences Initial ASMSequencesStop ASMs

Figure 1:

The framework of proposed ASM-guided segment se-lection approach for ASC. single feature vector representing the whole segment. Next,all generated feature vectors for the training material are usedwith the K-means clustering algorithm to ﬁnd a set of centroids.Audio segments are grouped into a small number of acousticclasses (each class is an ASM), to represent the whole acousticspace scattered by the training data. We do not leverage anyprior knowledge when building our acoustic inventory; there-fore, the set of ASMs and its corresponding model arises in anunsupervised manner. Finally, the centroids can be used to mapa given audio recording into an initial ASM sequence. Each ini-tial ASM sequence has a ﬁxed number of ASM units, sincewe have split the input recording into ﬁxed-length segments. The initial ASMs can provide a rough segmentation of the au-dio scene, where segments have ﬁxed-length in terms of thenumber of audio frames, which does not adhere to real scenar-ios. A ﬁner segmentation result can be obtained leveraging theHMM framework, in which the state probability density func-tions (pdfs) can be obtained with a GMM or DNN. We thereforedeployed both GMM-HMM and DNN-HMM tokenizers as fol-lows:1. Seed a GMM-HMM for each ASM unit based on theinitial ASM segmentation.2. Use the model obtained in Step 1 and perform Viterbidecoding on all the training utterances.3. Train the model with the new transcriptions generated inStep 2.4. Repeat Step 2 and Step 3 until convergence.5. Build a DNN-HMM using the GMM-HMM and newASM segmentation.

4. Front-end Segment Selection via ASM

For acoustic scenes, differently from spoken utterances, onlyfew real meaningful segments characterize the whole scene. Forexample, a car horn sound is an import sign to determine a streettrafﬁc scene, but many other segments in that acoustic scene donot carry any key information to make the correct classiﬁcation.However, it’s not easy to detect meaningful segments for mostof the acoustic scenes. Therefore, more and more ASC sys-tems simply use a CNN end-to-end approach to learn the map-ping from the input audio recordings to the output scene class .onetheless, CNNs are good at extracting local features but notfor overall segment selection. ASMs are used in our work toﬁnd useful acoustic segments, which have high indexing powerwith respect to the target acoustic scene. If useless segments canbe removed in the front-end processing stage, that will be ben-eﬁcial to the back-end classiﬁer, as proven in the experimentalsection.The proposed ASM-based front-end segment selectionframework for sub-utterance acoustic scene classiﬁcation isshown in Figure 1. The dashed lines indicate where ASM se-quences and stop ASMs are used. Stop ASMs, and ASM se-quences are generated in the segment selection block to removeconsecutive feature frames not useful for ﬁnal scene classiﬁca-tion. The stop rules and segment selection steps are describedin detail in the following section.

Stop ASMs, as the name reveals, takes inspiration from stopwords in information retrieval [16]. Given the inventory ofASMs, stop ASMs are a subset of the original ASMs that doesnot carry much information for retrieving the target acousticscenes. Compared with other ASMs in the inventory, stopASMs have either a lower indexing power, or no indexing powerat all when it comes to make the ﬁnal classiﬁcation decision. Weuse D to denote the total number of ASMs in the inventory, and N to denote the total number of utterances in the training set.In this study, we consider four different methods to detect thestop ASMs [35, 36, 37]:• Mean Probability (MP): it is the average probability ofeach unit M j in ASM sequences of the training set. MPconsiders the frequency of each ASM and is calculatedby MP ( M j ) = (cid:80) Ni =1 P i,j N , (1)in which P i,j is the probability of the ASM unit M j inutterance U i , and calculated by dividing its frequency bythe total number of the ASMs in U i .• Inverse Document Frequency (IDF): it measures howmuch information the ASM provides to reﬂect the im-portance of each ASM. IDF is calculated by IDF ( M j ) = log N + 1 N j + 1 , (2)in which N j is total number of times the ASM unit M j appears in the training utterances.• Variance of Probability (VP): it considers the variance ofeach ASM unit. VP is calculated by V P ( M j ) = (cid:80) Ni =1 ( P i,j − MP ( M j )) N . (3)• Statistical Values (SATs): this metric considers both themean and the variance. If an ASM unit has high SAT val-ues, it implies that M j occurs frequently and uniformlyin all the training utterances, and M j is very likely to bea stop ASM. SAT is calculated by SAT ( M j ) = MP ( M j ) V P ( M j ) / . (4)In our experiments, we select the top P ASMs for func-tioning as stop ASMs. The segments whose ASMs are in stopASMs dominate in most utterances.

The front-end processing is performed with stop ASMs andASM sequences. As shown in Figure 1, for a given input withan ASM sequence, the corresponding LMFB feature frames willbe removed if their ASMs are stop ASMs. The remaining fea-ture fragments will be re-segmented and padded into a group ofﬁxed-length acoustic segments. In our experiments, if a frag-ment is more than frames, we will divide it into segmentswith the length of frames, otherwise, we will perform zero-padding to make it having frames. Hence, each acousticscene would eventually have a different number of segments,which depends on the number of frames that have been blockedat the front-end level. After re-segmenting and padding pro-cess, the new generated ﬁxed-length segments are fed into theback-end classiﬁer. Each segment is assigned to a scene classby AlexNet-L back-end classiﬁer, and the ﬁnal scene class isobtained via majority voting among all segments.

5. Experiments and Result Analysis

The proposed approach is evaluated on the DCASE 2018Task1a development data set [3]. It contains hours of acous-tic scene audio recorded with the same device at a kHz sam-pling rate in different acoustic scenes. Following the ofﬁ-cial recommendation, the development data set is divided intotraining and test sets containing 6122 and 2518 utterances, re-spectively. For each 10-second binaural signal, STFT with points is applied separately on the left and right channels, witha window length of ms and an overlap length of ms. Melﬁlter-banks with bins are applied to obtain the log-mel ﬁlterbank (LMFB) features. Our ASC baseline system is based onthe AlexNet [38] model. Nonetheless, different from the orig-inal AlexNet, we reduce the parameter size due to internal re-source constrains. The baseline is denoted as AlexNet-L in therest of this work. It has ﬁve convolutional layers with a kernelsize of × , and two fully connected layers with hidden dimen-sion of . Each layer consists of convolution, batch normal-ization, ReLU activation function and max pooling. AlexNet-Lis trained with a stochastic gradient descent (SGD) algorithmwith a cosine based learning rate scheduler. Each input utter-ance is segmented to frames ( . seconds per segment). Af-ter AlexNet-L classiﬁcation, the ﬁnal scene class of the inputwaveform is voted by a majority using the classiﬁcation resulton each segments.In the acoustic segment modeling stage, the initial segmentlength is set to 20 frames, which is the same with the segmentlength in the baseline AlexNet-L. Hence each utterance is di-vided into segments. The size D of the ASM inventory isset to 64. According to our experiments, D is robust to theparameter setting. GMM-HMMs are powerful for modelingsequential data, and we here to reﬁne the initial tokenizationphase. A left-to-right HMM topology is used in each 6-stateGMM-HMM. In DNN-HMM, the DNN has six hidden layers,each having 2048 neurons. The output layer estimates the stateprobability density function (pdf) of the 64 HMMs. During thestop ASMs detection, top-3 ASMs function as stop ASMs foreach metric discussed in Section 4.1. After removing segmentswhose ASMs are in the stop ASMs set, the remaining acousticfragments are re-segmented and padded to obtain acousticframes per segment. As done with AlexNet-L, majority votingis used to decide the ﬁnal scene class. .2. Stop ASMs Detection Results The stop ASMs detection result is shown in Table 1. Differ-ent detection criteria lead to a different set of stop ASMs. Weselected the top three ASMs according to highest MP, lowestIDF, lowest VP and highest SAT, as our stop ASMs. From Ta-ble 1, we can notice that some ASMs, such as M , M , ap-pear independently of the metrics. The latter implies that someacoustic segments do satisfy our assumption, that is, there areacoustic segments having low or no indexing power for sceneclassiﬁcation. Stop ASMs found by GMM-HMM and DNN-HMM are similar. When using the SAT criterion, the same setof stop ASMs is found independently of the tokenizer. How-ever, although different metrics can select the same stop ASMs,the technique to obtain ASMs would lead to different segmentboundaries, which eventually affects ﬁnal classiﬁcation results.Table 1: Stop ASMs detection results with different metrics. M i indicates i -th ASM in the ASM inventory. Initial ASMs implythat initial ASM sequences are used in stop ASMs detection.GMM-HMM and DNN-HMM refer to sequences obtained withthe those models. Metric Initial ASMs GMM-HMM DNN-HMMMP M , M , M M , M , M M , M , M IDF M , M , M M , M , M M , M , M VP M , M , M M , M , M M , M , M SAT M , M , M M , M , M M , M , M The proposed ASC system is shown in Figure 1, and related ex-perimental results are given in Table 2. For a comprehensiveevaluation, the high-resolution attention network with ASM(HRAN-ASM) system proposed in [17] is also implemented,and its classiﬁcation results are reported. In particular, we haveadopted the two attention modules with ASM embedding intoour AlexNet-L baseline model. The ﬁrst two rows in Table 2 areofﬁcial baseline [3] and our AlexNet-L baseline. The AlexNet-L attains a classiﬁcation accuracy of . , which is improvedto . by with the HRAN-ASM in the third row. Althoughour experiments are conducted with the same data sets adoptedin [17], experimental results are slightly different because ofdifferent speech features and models used in our work.Table 2 lists experimental results obtained with the pro-posed approach in the last three rows when SAT is used asthe metric to extract the stop ASMs. We can use our acousticsegment blocking approach with three different ASM tokeniz-ers, namely (i) the initial unsupervised ASM (initial ASM), (ii)GMM-HMM, and (iii) DNN-HMM. In case (i), stop ASMs areﬁrst detected using the tokenization obtained with initial ASM.Since the segment length of initial ASM sequences and the in-put sequences to AlexNet-L are the same, we can directly blockthe whole segment of the input sequence if its correspondingASM token is in the set of stop ASMs. Thus, the blocking oper-ation is based on the segment level, in which the re-segmentingand padding is not needed on processed segments. Althoughinitial ASMs are simple models, AlexNet-L accuracy can beboosted from . to . (compare ﬁrst and third rows).In cases (ii) and (iii) with either GMM-HMM or DNN-HMMtokenizers, we can obtain more precise ASM sequences, whichin turn improve the ﬁnal ASC accuracy as shown in the lasttwo rows of Table 2. In details, GMM-HMM boosts AlexNet-Lclassiﬁcation up to . . DNN-HMM can deliver more accu-rate alignments and boundaries for each ASM segment, which Table 2: Evaluation results on DCASE 2018 Task1a data set.

Model Accuracy %Ofﬁcial baseline [3] 59.7AlexNet-L baseline 68.0+ HRAN-ASM [17] 69.5+ initial ASM (SAT) 70.1+ ASM-GMM-HMM (SAT) 71.6+ ASM-DNN-HMM (SAT) 72.1leads to a ﬁnal scene classiﬁcation accuracy of . , whichrepresenting a . absolute improvement when compared toAlexNet-L.The above results allow us to conclude that: (a) the ASM-based segment selection approach can signiﬁcantly improveASC system, attaining a ﬁnal classiﬁcation accuracy of 72.1%,which is a competitive performance given that data augmenta-tion and ensemble methods have not been used in our work,and (b) the proposed solution outperforms AlexNet-L with theattention mechanism initialised with ASM, which the later isreported to compare favourably against self-attention in [17].The latter demonstrates that our front-end segment selection ap-proach outperforms a more standard attention scheme. The effect of different stop ASMs metrics is shown in Table 3.The initial ASM sequences are used to evaluate different met-rics. From Table 3, we can see that different stop ASMs detec-tion metrics result in different classiﬁcation outcomes. UsingMP with initial ASM sequences does not lead to any improve-ment over AlexNet-L. SAT shows the best performance amongall metrics. Those results makes sense since SAT considers boththe mean and the variance of the distribution of each ASM unit.Table 3:

Results with different stop ASMs detection metrics us-ing initial ASMs.

Model Accuracy %AlexNet-L baseline 68.0+ initial ASM (MP) 68.0+ initial ASM (IDF) 68.7+ initial ASM (VP) 69.3+ initial ASM (SAT) 70.1

6. Summary

In this paper, instead of using whole utterances for scene model-ing, we propose an ASM based front-end segment selection ap-proach to acoustic scene classiﬁcation. The overall frameworkis based on two modules: (i) acoustic segment modeling and se-lection, and (ii) CNN based classiﬁcation. ASMs are ﬁrst gener-ated in an unsupervised manner and reﬁned with GMM/DNN-HMM models. Then stop ASMs detection is performed usingASM sequences for training data. ASM sequences and stopASMs are used for segment selection before CNN classiﬁer. Itimplies that segments with low or no indexing power are re-moved. The proposed approach is evaluated on DCASE 2018Task1a, and experimental evidences demonstrate the viabilityof sub-utterance ASC. A classiﬁcation accuracy of . isobtained, which is highly competitive for single system and nodata expansion. . References [1] A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange,T. Virtanen, and M. D. Plumbley, “Detection and classiﬁcation ofacoustic scenes and events: Outcome of the DCASE 2016 chal-lenge,” IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 26, no. 2, pp. 379–393, 2018.[2] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vin-cent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup:Tasks, datasets and baseline system,” in

Proceedings of the Detec-tion and Classiﬁcation of Acoustic Scenes and Events 2017 Work-shop (DCASE2017) , November 2017, pp. 85–92.[3] A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device datasetfor urban acoustic scene classiﬁcation,” in

Proceedings of the De-tection and Classiﬁcation of Acoustic Scenes and Events 2018Workshop (DCASE2018) , November 2018, pp. 9–13.[4] Y. Han and K. Lee, “Acoustic scene classiﬁcation using convo-lutional neural network and multiple-width frequency-delta dataaugmentation,” arXiv preprint arXiv:1607.02383 , 2016.[5] D. Battaglino, L. Lepauloux, and N. Evans, “Acoustic sceneclassiﬁcation using convolutional neural networks,” DCASE2016Challenge, Tech. Rep., September 2016.[6] K. Koutini, H. Eghbal-zadeh, M. Dorfer, and G. Widmer,“The receptive ﬁeld as a regularizer in deep convolutionalneural networks for acoustic scene classiﬁcation,”

CoRR , vol.abs/1907.01803, 2019.[7] H. Hu, C.-H. H. Yang, X. Xia, X. Bai, X. Tang, Y. Wang, S. Niu,L. Chai, J. Li, H. Zhu, F. Bao, Y. Zhao, S. M. Siniscalchi, Y. Wang,J. Du, and C.-H. Lee, “Device-robust acoustic scene classiﬁca-tion based on two-stage categorization and data augmentation,”DCASE2020 Challenge, Tech. Rep., June 2020.[8] J. Wang and S. Li, “Self-attention mechanism based system fordcase2018 challenge task1 and task4,” in

Proc. DCASE Chal-lenge , 2018, pp. 1–5.[9] Z. Ren, Q. Kong, K. Qian, M. D. Plumbley, B. Schuller et al. ,“Attention-based convolutional neural networks for acoustic sceneclassiﬁcation,” in

DCASE 2018 Workshop Proceedings , 2018.[10] J. Guo, N. Xu, L.-J. Li, and A. Alwan, “Attention based cldnns forshort-duration acoustic scene classiﬁcation.” in

INTERSPEECH ,2017, pp. 469–473.[11] T. Nguyen and F. Pernkopf, “Acoustic scene classiﬁcation withmismatched devices using cliquenets and mixup data augmenta-tion,” in

Interspeech , 2019.[12] H. Chen, Z. Liu, Z. Liu, P. Zhang, and Y. Yan, “Integrating thedata augmentation scheme with various classiﬁers for acousticscene modeling,” DCASE2019 Challenge, Tech. Rep., June 2019.[13] T. Zhang, K. Zhang, and J. Wu, “Data independent sequenceaugmentation method for acoustic scene classiﬁcation.” in

Inter-speech , 2018, pp. 3289–3293.[14] L. R. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,”

Proceedings of the IEEE ,vol. 77, no. 2, pp. 257–286, 1989.[15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deepneural networks for acoustic modeling in speech recognition: Theshared views of four research groups,”

IEEE Signal processingmagazine , vol. 29, no. 6, pp. 82–97, 2012.[16] W. J. Wilbur and K. Sirotkin, “The automatic identiﬁcation of stopwords,”

Journal of information science , vol. 18, pp. 45–55, 1992.[17] X. Bai, J. Du, J. Pan, Z. Heng-shun, Y.-H. Tu, and C.-H. Lee,“High-resolution attention network with acoustic segment modelfor acoustic scene classiﬁcation,” in

ICASSP , 2020.[18] J.-J. Aucouturier, B. Defreville, and F. Pachet, “The bag-of-framesapproach to audio pattern recognition: A sufﬁcient model for ur-ban soundscapes but not for polyphonic music,”

The Journal ofthe Acoustical Society of America , vol. 122, no. 2, pp. 881–891,2007. [19] A. Rakotomamonjy and G. Gasso, “Histogram of gradients oftime–frequency representations for audio scene classiﬁcation,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 23, no. 1, pp. 142–153, 2014.[20] V. Bisot, R. Serizel, S. Essid, and G. Richard, “Feature learningwith matrix factorization applied to acoustic scene classiﬁcation,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 25, no. 6, pp. 1216–1229, 2017.[21] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer, “Cp-jkusubmissions for dcase-2016: A hybrid approach using binaurali-vectors and deep convolutional neural networks,”

IEEE AASPChallenge on Detection and Classiﬁcation of Acoustic Scenes andEvents (DCASE) , 2016.[22] H. Zeinali, L. Burget, and J. Cernocky, “Convolutional neural net-works and x-vector embedding for dcase2018 acoustic scene clas-siﬁcation challenge,” arXiv preprint arXiv:1810.04273 , 2018.[23] P. Sharma, V. Abrol, and A. Thakur, “Ase: Acoustic scene em-bedding using deep archetypal analysis and gmm.” in

Interspeech ,2018, pp. 3299–3303.[24] J. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello, “Look, listen,and learn more: Design choices for deep audio embeddings,” in

ICASSP) . IEEE, 2019, pp. 3852–3856.[25] L. Pham, I. McLoughlin, H. Phan, and R. Palaniappan, “A robustframework for acoustic scene classiﬁcation,”

Proc. Interspeech2019 , pp. 3634–3638, 2019.[26] J. Chen, J. Hao, K. Chen, D. Xie, S. Yang, and S. Pu, “An end-to-end audio classiﬁcation system based on raw waveforms andmix-training strategy,” arXiv preprint arXiv:1911.09349 , 2019.[27] V. T. Peltonen, A. J. Eronen, M. P. Parviainen, and A. P. Kla-puri, “Recognition of everyday auditory scenes: potentials, laten-cies and cues,”

PREPRINTS-AUDIO ENGINEERING SOCIETY ,2001.[28] C.-H. Lee, F. K. Soong, and B.-H. Juang, “A segment model basedapproach to speech recognition,” in

ICASSP , 1988, pp. 501–541.[29] H. Li, B. Ma, and C.-H. Lee, “A vector space modeling approachto spoken language identiﬁcation,”

IEEE Transactions on Audio,Speech, and Language Processing , vol. 15, pp. 271–284, 2006.[30] Y. Tsao, H. Sun, H. Li, and C.-H. Lee, “An acoustic seg-ment model approach to incorporating temporal information intospeaker modeling for text-independent speaker recognition,” in

ICASSP . IEEE, 2010, pp. 4422–4425.[31] H.-y. Lee, T.-y. Hu, H. Jing, Y.-F. Chang, Y. Tsao, Y.-C. Kao, andT.-L. Pao, “Ensemble of machine learning and acoustic segmentmodel techniques for speech emotion and autism spectrum disor-ders recognition.” in

INTERSPEECH , 2013, pp. 215–219.[32] M. Riley, E. Heinen, and J. Ghosh, “A text retrieval approach tocontent-based audio retrieval,” in

Int. Symp. on Music InformationRetrieval (ISMIR) , 2008, pp. 295–300.[33] X. Bai, J. Du, Z.-R. Wang, and C.-H. Lee, “A hybrid approach toacoustic scene classiﬁcation based on universal acoustic models,”

Proc. Interspeech 2019 , pp. 3619–3623, 2019.[34] T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction tolatent semantic analysis,”

Discourse processes , vol. 25, no. 2-3,pp. 259–284, 1998.[35] W. J. Wilbur and K. Sirotkin, “The automatic identiﬁcation ofstop words,”

Journal of Information Science , vol. 18, no. 1,pp. 45–55, 1992. [Online]. Available: https://doi.org/10.1177/016555159201800106[36] D. Na and C. Xu, “Automatically generation and evaluation ofstop words list for chinese patents,”

Telkomnika , vol. 13, no. 4, p.1414, 2015.[37] R. T.-W. Lo, B. He, and I. Ounis, “Automatically building a stop-word list for an information retrieval system,” in

Journal on Dig-ital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR) , vol. 5, 2005, pp.17–24.[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi-ﬁcation with deep convolutional neural networks,” in