[PDF] A punishment voting algorithm based on super categories construction for acoustic scene classification

Abstract

In acoustic scene classification researches, audio segment is usually split into multiple samples. Majority voting is then utilized to ensemble the results of the samples. In this paper, we propose a punishment voting algorithm based on the super categories construction method for acoustic scene classification. Specifically, we propose a DenseNet-like model as the base classifier. The base classifier is trained by the CQT spectrograms generated from the raw audio segments. Taking advantage of the results of the base classifier, we propose a super categories construction method using the spectral clustering. Super classifiers corresponding to the constructed super categories are further trained. Finally, the super classifiers are utilized to enhance the majority voting of the base classifier by punishment voting. Experiments show that the punishment voting obviously improves the performances on both the DCASE2017 Development dataset and the LITIS Rouen dataset.

Full PDF

AA punishment voting algorithm based on super categoriesconstruction for acoustic scene classiﬁcation

Weiping Zheng, Zhenyao Mo, Jiantao Yi

School of Computer,South China Normal University,Guangzhou 510631, China [email protected],[email protected],[email protected]

Abstract

In acoustic scene classiﬁcation researches, audio segment is usually split into multiplesamples. Majority voting is then utilized to ensemble the results of the samples. In this pa-per, we propose a punishment voting algorithm based on the super categories constructionmethod for acoustic scene classiﬁcation. Speciﬁcally, we propose a DenseNet-like model asthe base classiﬁer. The base classiﬁer is trained by the CQT spectrograms generated fromthe raw audio segments. Taking advantage of the results of the base classiﬁer, we pro-pose a super categories construction method using the spectral clustering. Super classiﬁerscorresponding to the constructed super categories are further trained. Finally, the superclassiﬁers are utilized to enhance the majority voting of the base classiﬁer by punishmentvoting. Experiments show that the punishment voting obviously improves the performanceson both the DCASE2017 Development dataset and the LITIS Rouen dataset.

Keywords: acoustic scene classiﬁcation, DenseNet, punishment voting, CQT, spectral clus-tering

Acoustic scene classiﬁcation (ASC) is attracting more and more attentions in the research com-munity of artiﬁcial intelligence. Taking advantage of the audio signals, ASC is able to inferthe information of the environment from which the audios are produced. This ability is veryhelpful for many applications, such as surveillance [1], robotic navigation [2] and recognition ofcyclist’s route [3], etc. In recent years, with the upsurge of deep learning research, deep learningbased solutions have become more and more popular in the acoustic scene classiﬁcation. Manypopular deep learning architectures have been applied to solve this problem, e.g., CNN [4, 5],RNN [6, 7], LSTM [4], DNN [8] and their combinations [4, 7]. In addition, these solutions haveachieved promising results, which surpass the performances of most traditional machine learningmethods.As we all know, the identiﬁcation process of humans is from coarseness to ﬁneness. Forexample, when we have to classify a large amount of objects (such as car, plane, train, dog,cat, bird, etc.), we ﬁrst divide them into several coarse categories intuitively. Speciﬁcally, theobjects of car, plane and train are classiﬁed into the transport category, while dog, cat and1 a r X i v : . [ c s . S D ] J u l ird are regarded as the animal category. Naturally, we could perform ﬁner classiﬁcation uponthese coarse categories. Moreover, as we can observe from the machine learning works, theaccuracies of coarse categories classiﬁcation tend to be higher compared to the correspondingﬁne classiﬁcation tasks. Another motivation is the inter-class similarities among acoustic sceneclasses. It is easy to ﬁnd that some acoustic scene classes are more similar in the acoustic prop-erties. It inspires us to cluster these acoustic scene classes into a coarse category. Eventually,we can construct several coarse categories based on the original acoustic scene categories. Asthese constructed coarse categories are made up by the original scene categories (or classes),they are called as the super categories. We can train super classiﬁers for these super categoriesrespectively. These super classiﬁers are responsible for discriminating the coarse categories. Asa result, the amounts of their outputs are fewer than the ones of the base classiﬁer. In thispaper, the base classiﬁer refers to the classiﬁcation model which can discriminate all the originalcategories in a ﬁne-grained manner. As mentioned above, the accuracies of super classiﬁers arehigher. This fact gives us more conﬁdence in these super classiﬁers. To this end, we proposea punishment voting algorithm to take advantage of these super classiﬁers to enhance the baseclassiﬁer for improving the ASC performance.The framework of our method is as follows. Firstly, we generate the CQT spectrograms [9]from the audio segments. Then a DenseNet-like [10] architecture is proposed as the base classi-ﬁer, trained by the CQT spectrograms. On the basis of the base classiﬁer, we construct severalsuper categories through the spectral clustering method [11]. For each super category, we fur-ther train a corresponding super classiﬁer. Finally, a special voting algorithm which we call thepunishment voting is proposed to combine the base classiﬁer and the super classiﬁers to form astrong classiﬁer. The ﬂowchart of our method is shown in Figure 1.In this paper, we have proposed a super categories construction method and a punishmentFigure 1: Flowchart of our proposed methodvoting algorithm. To verify the eﬀectiveness of our proposed method, we use the DCASE20172evelopment dataset as well as the LITIS Rouen dataset to evaluate our results. The ex-periments show that the punishment voting method can obviously improve the performanceson both datasets. The remainder of this paper is organized as follows. The transformationof the raw audio is introduced in Section 2. Section 3 describes the details of our proposedmethods, including the design of the base classifer, the construction of super categories, thetraining of super classiﬁers and the punishment voting algorithm. In Section 4, we demonstratethe experiment results. Finally, we conclude in Section 5. Recently, CNN is very popular in ASC [4, 5]. In these solutions, the audio signals are trans-formed to special Time-Frequency representations, namely the spectrograms. Inspired by [12],we select the CQT spectrogram [9] as the input of the proposed CNN network. To generateCQT spectrograms from the raw audio segments, we utilize the cqt function in the python li-brary Librosa 0.5.0. For each audio segment, we generate N CQT spectrograms, where N isthe number of auditory channels. For the DCASE2017 Development dataset, N equals to 2.However, N is 1 for the LITIS Rouen dataset. For the convenience of calculation, we resizeevery spectrogram into the size of 832*143. Moreover, we split the spectrograms into patcheswith the width of 143 and a shift step of 80. As a result, we can get 10 patches with the sizeof 143*143 for each CQT spectrogram. In other words, each audio segment can generate N *10patches (samples). For both datasets, we use the same way to generate the CQT spectrogrampatches. DenseNet is an excellent deep Learning model proposed by [10] in 2017. It has many advantages,such as encouraging feature reuse, alleviating the vanishing-gradient problem, and substantiallyreducing the number of parameters, etc. DenseNet is very successful in the image recognitiontasks. It even has outperformed ResNet [13] for some datasets. In this paper, we try to buildup a DenseNet-like network for acoustic scene classiﬁcation. We ﬁrst use this proposed networkto build up a base classiﬁer. Besides, we train the super classiﬁers using the same deep networkexcept some minor modiﬁcation. As shown in Table 1, the proposed network has three denseblocks, two transition layers and a growth rate of 12. These three dense blocks have 4, 4 and32 bottleneck layers respectively. Note that compared to the standard DenseNet [10], we haveremoved the 7 × https://sites.google.com/site/alainrakotomamonjy/home/audio-scene × × × × × × × × × × × × × × × × For some acoustic scenes, they are very similar in acoustic properties and frequently misclassi-ﬁed. It is natural to cluster these similar scenes into a coarse category, which we call the supercategory. Actually, for some dataset, the super categories information is explicitly provided.For example, these super categories are provided in the DCASE2017 Development dataset.They are the ’indoor’, ’outdoor’ and ’vehicle’. Speciﬁcally, the ’indoor’ category includes theCafe/Restaurant, Grocery store, Home, Library, Metro station and Oﬃce scenes; the ’vehicle’category contains the Bus, Car, Train and Tram scenes. The rest of the scenes are regardedas the ’outdoor’ category. However, not all the datasets provide super category labels. Forexample, the LITIS Rouen dataset does not oﬀer any information about these. Obviously, it isnecessary to construct super categories from the original acoustic scenes.For constructing the super categories, it is challenging to ﬁnd out the acoustic scenes withsimilar acoustics properties. However, it is easy to ﬁgure out the misclassiﬁed cases for eachacoustic scene, according to a certain classiﬁcation model. In our proposed method, we usethe misclassiﬁcation information to approximate the similarities among acoustic scenes, as sim-ilar scenes are prone to be misclassiﬁed as their counterparts. Speciﬁcally, using the trainedDenseNet-like deep model, we can ﬁgure out a confuse matrix M for the testing samples. Let M ij be the element of the i th row and the j th column of M . The M ij ( i (cid:54) = j ) represents theamount of the patches whose ground-truth label is i while wrongly predicted as j . On thebasis of the confuse matrix, we apply the spectral clustering method [11] to construct the super4ategories.Note that the values in the confuse matrix are averaged over multiple testing sets. For ex-ample, in the DCASE2017 Development dataset, the confuse matrix is averaged across the testresults of the 4 training/testing splits provided by the challenge organizer. The test results areobtained by the base classiﬁer describe in Section 3.1 with majority voting [14]. It is worthyto mention that the super category outputs of the DCASE2017 Development dataset in ourexperiments are exactly the same to the oﬃcial ’indoor’, ’outdoor’ and ’vehicle’ divisions, wherethe cluster number is set to 3 and the K parameter in KNN is 2. This proves the eﬀectiveness ofour proposed super categories construction method. We also cluster the LITIS Rouen datasetinto 3 super categories. The clustering results are as follows: { Metro-rouen, High-speed train } , { Restaurant, Shop, Market, Caf´ e , billiard pool hall } and { Quiet street, Plane, Bus, Train,Car, Tubestaion, Kid game hall, Metro-paris, Student hall, Pedestrian street, Busy street, Trainstation hall } . For each super category, we further train a corresponding super classiﬁer. The super classiﬁeruses the same DenseNet-like architecture as the base classiﬁer. The resultant weights in thebase classiﬁer are transferred to initialize the super classiﬁer [15]. The only diﬀerence lies inthe output layer. Taking the ’vehicle’ super category in DCASE2017 as an example, ﬁve outputnodes are set in the output layer of the super classiﬁer, including four nodes serving as theindicators of Car, Bus, Train and Tram respectively, along with the ﬁfth node known as thenegative ﬂag. The negative ﬂag is ﬁred when the testing sample is considered as not belongingto the ’vehicle’ super category. The patches used to train the base classiﬁer are again utilized totrain the newly super classiﬁer. However, labels of some patches should be modiﬁed. Speciﬁcally,the labels of the patches of Car, Bus, Train and Tram acoustic scenes are left unchanged whilethe ones of the other patches are all modiﬁed into the ’NON-VEHICLE’ labels. In this way,each super classiﬁer is responsible for the discrimination of a small range of acoustic scenes. Asa result, the accuracies of these super classiﬁers are improved. Likewise, SoftMax is applied inthe output layer of the super classiﬁers.

For the convenience of description, we introduce two data structures before giving the algorithm,the voting vector and the negative ﬂag vector. Assume that we have AN audio segments inthe testing set; P N patches are produced for each segment; SN super categories have beenconstructed from the CN original acoustic scene classes; One base classiﬁer and SN superclassiﬁers have been built up as mention above. For the i th audio segment ( i ∈ [1 , AN ]), wegenerate a voting vector V V i = ( vv i , vv i , . . . , vv iCN ), which is initialized by all zero elements.For the p th patch produced by the i th audio segment, we feed it into the base CNN model,supposed that the k th node ( k ∈ [1 , CN ]) in the output layer of the base CNN model acquiresthe maximum value, the value of vv ik is increased by 1. After all the P N patches are processed,the voting vector

V V i is completely prepared, where vv i + vv i + · · · + vv iCN = P N . For the i th audio segment ( i ∈ [1 , AN ]), we calculate SN negative ﬂag vectors, denoted as N F ij =( nf ij, , nf ij, , . . . , nf ij,P N ) , j ∈ [1 , SN ], each for a corresponding super classiﬁer. For the p th patch produced by the i th audio segment, we feed it to the j th super classiﬁer, if the negative5ag node (see Section 3.3) acquires the maximum value, the nf ij,p is set as 1; otherwise it isset as 0. After the P N patches are all processed by the SN super classiﬁers and the relatedelements in N F ij ( j ∈ [1 , SN ]) are set up, the preparation of negative ﬂag vectors for the i th audio segment is done. Algorithm 1

Punishment Voting

Input: CN ; SN ; P N ; V V i = ( vv i , vv i , . . . , vv iCN ); N F ij = ( nf ij, , nf ij, , . . . , nf ij,P N ), j ∈ [1 , SN ]; AS j , j ∈ [1 , SN ] : the original acoustic scene set corresponding to the j − th supercategory. Output: R i : the resultant acoustic scene of the i − th audio segment; for j = 1 : SN do count=0 for k = 1 : P N do if nf ij,k == 1 then count++ end if end for if count > P N ∗ / then for p = 1 : CN do if p ∈ AS j then vv ip = vv ip ∗ . end if end for end if end for R i = argmax t ( { vv it | t ∈ [1 , CN ] } ) return R i ;These SN negative ﬂag vectors are applied to punishment vote towards the voting vector.Speciﬁcally, for the n th super classiﬁer, we observe the negative ﬂag vector N F in . If a majorityof elements in N F in is equal to 1 (the threshold is set to P N ∗ / i th audio segment does not belong to the n th super category according to the n th super classiﬁer. As we are more conﬁdent in the super classiﬁers, the judgement is used torectify the voting vector. All elements corresponding to the n th super category in the votingvector are multiplied with a punishment factor γ ( γ is set to 0.25 in the experiment). Once allthe super classiﬁers have ﬁnished punishment vote, we can ﬁgure out the ﬁnal result from thevoting vector. The pseudocode is shown in Algorithm 1. All the experiments are executed on an Intel(R) Core(TM) i7 system with 64GB RAM andNVIDIA GTX 1080 Ti GPU. We implement the DenseNet-like models using TensorFlow witha learning rate of 0.0001, a batch size of 32 and a dropout rate of 0.2. Adam is applied as6he optimizer with a maximum epochs of 1000. We evaluate the proposed punishment votingalgorithm on the DCASE2017 Development dataset and the LITIS Rouen dataset. There are15 acoustic scenes in the DCASE2017 Development dataset. For each scene, there are 312 10-second audio segments. We follow the 4-fold cross validation on this dataset. The LITIS Rouendataset contains 3026 30-second audio segments which can be categorized into 19 acoustic sceneclasses. We randomly select three splits out of the 20 standard splits on this dataset and evaluatethe average accuracy over these three splits.

We have trained two DenseNet-like base classiﬁers for the two datasets respectively. The onefor the DCASE2017 Development dataset has 15 output nodes each of which is correspondingto an acoustic scene class, while the other for the LITIS Rouen dataset has 19 output nodes.The average accuracy achieved by the DCASE2017 base classiﬁer with majority voting is 79.4%(the baseline accuracy is 74.8% according to the DCASE2017 challenge). The average accuracyby the LITIS Rouen base classiﬁer with majority voting is 92.13%.

As mentioned above, we construct three super categories for both datasets. The constructedsuper categories in DCASE2017 Development dataset are exactly the same as the ’indoor’,’outdoor’ and ’vehicle’ separations originally provided by oﬃcial organizer. We have trainedsuper classiﬁers for the super categories. By performing majority voting, as shown in Figure 2,the average accuracies for the three super classiﬁers are 88.05%, 88.51% and 93.34% respectivelyfor the DCASE2017 Development dataset. Compared to the accuracy of the base classiﬁer, theaccuracies of super classiﬁers are signiﬁcantly improved as we expect. As shown in Figure 3,the average accuracies of the three super classiﬁer with majority voting are 96.19%, 93.36% and93.57% respectively for the LITIS Rouen dataset.7igure 2: Accuracies of the three super classiﬁers on DCASE2017 Development datasetFigure 3: Accuracies of the three super classiﬁers on LITIS Rouen dataset

Figure 4 makes a comparison between the results of punishment voting and those of the baseclassiﬁer with majority voting on the DCASE2017 Development dataset. An accuracy of 81.92%is obtained by our proposed method, which outperforms the base classiﬁer by 2.52%. Similarly,8igure 4: Comparison of majority voting and punishment voting on DCASE2017 DevelopmentdatasetFigure 5: Comparison of majority voting and punishment voting on LITIS Rouen datasetFigure 5 shows the superiority of our proposed punishment voting on the LITIS Rouen dataset.Through our method, the accuracy is raised to 94.83% which has an increase of 2.7% over the9ase classiﬁer with majority voting.

In the acoustic scene classiﬁcation research domain, majority voting is frequently used. In thispaper, we proposed a punishment voting algorithm for acoustic scene classiﬁcations. Thereare two main contributions in this work. They are the punishment voting algorithm and thesuper categories construction method. Firstly, we transform the audio segments into CQTspectrograms [9]. Using these CQT spectrograms, we train a DenseNet-like model as the baseclassiﬁer. Based on the results of the base classiﬁer, we construct the super categories by thespectral clustering. For each super category, we train a corresponding super classiﬁer. Finally,we develop a punishment voting algorithm to combine the results of the base classiﬁer andthose of the super classiﬁers to acquire the ﬁnal results. Compared to the base classiﬁer, thepunishment voting method has a 2.52% improvement on the DCASE2017 Development datasetand a 2.7% boost on the LITIS Rouen dataset. We believe that the punishment voting methodis useful for other recognition tasks as well, such as image recognition, behavior recognition, etc.We will extend our method to these tasks in our future studies.

References [1] Stavros Ntalampiras, Ilyas Potamitis, and Nikos Fakotakis. Probabilistic novelty detectionfor acoustic surveillance under real-world conditions.

IEEE Transactions on Multimedia ,13(4):713–719, 2011.[2] Wei He, Zhijun Li, and CL Philip Chen. A survey of human-centered intelligent robots:issues and challenges.

IEEE/CAA Journal of Automatica Sinica , 4(4):602–609, 2017.[3] Bj¨orn Schuller, Florian Pokorny, Stefan Ladst¨atter, Maria Fellner, Franz Graf, and Lu-cas Paletta. Acoustic geo-sensing: Recognising cyclists’ route, route direction, and routeprogress from cell-phone audio. In

Acoustics, Speech and Signal Processing (ICASSP), 2013IEEE International Conference on , pages 453–457. IEEE, 2013.[4] Soo Hyun Bae, Inkyu Choi, and Nam Soo Kim. Acoustic scene classiﬁcation using parallelcombination of lstm and cnn. In

Proceedings of the Detection and Classiﬁcation of AcousticScenes and Events 2016 Workshop (DCASE2016) , pages 11–15, 2016.[5] Daniele Battaglino, Ludovick Lepauloux, Nicholas Evans, France Mougins, and France Biot.Acoustic scene classiﬁcation using convolutional neural networks.

IEEE AASP Challengeon Detec , 2016.[6] Seongkyu Mun, Suwon Shon, Wooil Kim, David K Han, and Hanseok Ko. A novel discrimi-native feature extraction for acoustic scene classiﬁcation using rnn based source separation.

IEICE TRANSACTIONS on Information and Systems , 100(12):3041–3044, 2017.[7] Juncheng Li Dai Wei, Phuong Pham, Samarjit Das, and Shuhui Qu. Acoustic scene recog-nition with deep neural networks (dcase challenge 2016).

Robert Bosch Research and Tech-nology Center , 3, 2016. 108] Rohit Patiyal and Padmanabhan Rajan. Acoustic scene classiﬁcation using deep learn-ing.

IEEE AASP Challenge on Detection and Classiﬁcation of Acoustic Scenes and Events(DCASE) , 2016.[9] Christian Sch¨orkhuber and Anssi Klapuri. Constant-q transform toolbox for music pro-cessing. In , pages 3–64,2010.[10] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Denselyconnected convolutional networks. In

Proceedings of the IEEE conference on computervision and pattern recognition , volume 1, page 3, 2017.[11] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and analgorithm. In

Advances in neural information processing systems , pages 849–856, 2002.[12] Thomas Lidy and Alexander Schindler. Cqt-based convolutional neural networks for audioscene classiﬁcation. In

Proceedings of the Detection and Classiﬁcation of Acoustic Scenesand Events 2016 Workshop (DCASE2016) , volume 90, pages 1032–1048. DCASE2016 Chal-lenge, 2016.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[14] Lionel S Penrose. The elementary statistics of majority voting.

Journal of the RoyalStatistical Society , 109(1):53–57, 1946.[15] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning.