A punishment voting algorithm based on super categories construction for acoustic scene classification
AA punishment voting algorithm based on super categoriesconstruction for acoustic scene classification
Weiping Zheng, Zhenyao Mo, Jiantao Yi
School of Computer,South China Normal University,Guangzhou 510631, China [email protected],[email protected],[email protected]
Abstract
In acoustic scene classification researches, audio segment is usually split into multiplesamples. Majority voting is then utilized to ensemble the results of the samples. In this pa-per, we propose a punishment voting algorithm based on the super categories constructionmethod for acoustic scene classification. Specifically, we propose a DenseNet-like model asthe base classifier. The base classifier is trained by the CQT spectrograms generated fromthe raw audio segments. Taking advantage of the results of the base classifier, we pro-pose a super categories construction method using the spectral clustering. Super classifierscorresponding to the constructed super categories are further trained. Finally, the superclassifiers are utilized to enhance the majority voting of the base classifier by punishmentvoting. Experiments show that the punishment voting obviously improves the performanceson both the DCASE2017 Development dataset and the LITIS Rouen dataset.
Keywords: acoustic scene classification, DenseNet, punishment voting, CQT, spectral clus-tering
Acoustic scene classification (ASC) is attracting more and more attentions in the research com-munity of artificial intelligence. Taking advantage of the audio signals, ASC is able to inferthe information of the environment from which the audios are produced. This ability is veryhelpful for many applications, such as surveillance [1], robotic navigation [2] and recognition ofcyclist’s route [3], etc. In recent years, with the upsurge of deep learning research, deep learningbased solutions have become more and more popular in the acoustic scene classification. Manypopular deep learning architectures have been applied to solve this problem, e.g., CNN [4, 5],RNN [6, 7], LSTM [4], DNN [8] and their combinations [4, 7]. In addition, these solutions haveachieved promising results, which surpass the performances of most traditional machine learningmethods.As we all know, the identification process of humans is from coarseness to fineness. Forexample, when we have to classify a large amount of objects (such as car, plane, train, dog,cat, bird, etc.), we first divide them into several coarse categories intuitively. Specifically, theobjects of car, plane and train are classified into the transport category, while dog, cat and1 a r X i v : . [ c s . S D ] J u l ird are regarded as the animal category. Naturally, we could perform finer classification uponthese coarse categories. Moreover, as we can observe from the machine learning works, theaccuracies of coarse categories classification tend to be higher compared to the correspondingfine classification tasks. Another motivation is the inter-class similarities among acoustic sceneclasses. It is easy to find that some acoustic scene classes are more similar in the acoustic prop-erties. It inspires us to cluster these acoustic scene classes into a coarse category. Eventually,we can construct several coarse categories based on the original acoustic scene categories. Asthese constructed coarse categories are made up by the original scene categories (or classes),they are called as the super categories. We can train super classifiers for these super categoriesrespectively. These super classifiers are responsible for discriminating the coarse categories. Asa result, the amounts of their outputs are fewer than the ones of the base classifier. In thispaper, the base classifier refers to the classification model which can discriminate all the originalcategories in a fine-grained manner. As mentioned above, the accuracies of super classifiers arehigher. This fact gives us more confidence in these super classifiers. To this end, we proposea punishment voting algorithm to take advantage of these super classifiers to enhance the baseclassifier for improving the ASC performance.The framework of our method is as follows. Firstly, we generate the CQT spectrograms [9]from the audio segments. Then a DenseNet-like [10] architecture is proposed as the base classi-fier, trained by the CQT spectrograms. On the basis of the base classifier, we construct severalsuper categories through the spectral clustering method [11]. For each super category, we fur-ther train a corresponding super classifier. Finally, a special voting algorithm which we call thepunishment voting is proposed to combine the base classifier and the super classifiers to form astrong classifier. The flowchart of our method is shown in Figure 1.In this paper, we have proposed a super categories construction method and a punishmentFigure 1: Flowchart of our proposed methodvoting algorithm. To verify the effectiveness of our proposed method, we use the DCASE20172evelopment dataset as well as the LITIS Rouen dataset to evaluate our results. The ex-periments show that the punishment voting method can obviously improve the performanceson both datasets. The remainder of this paper is organized as follows. The transformationof the raw audio is introduced in Section 2. Section 3 describes the details of our proposedmethods, including the design of the base classifer, the construction of super categories, thetraining of super classifiers and the punishment voting algorithm. In Section 4, we demonstratethe experiment results. Finally, we conclude in Section 5. Recently, CNN is very popular in ASC [4, 5]. In these solutions, the audio signals are trans-formed to special Time-Frequency representations, namely the spectrograms. Inspired by [12],we select the CQT spectrogram [9] as the input of the proposed CNN network. To generateCQT spectrograms from the raw audio segments, we utilize the cqt function in the python li-brary Librosa 0.5.0. For each audio segment, we generate N CQT spectrograms, where N isthe number of auditory channels. For the DCASE2017 Development dataset, N equals to 2.However, N is 1 for the LITIS Rouen dataset. For the convenience of calculation, we resizeevery spectrogram into the size of 832*143. Moreover, we split the spectrograms into patcheswith the width of 143 and a shift step of 80. As a result, we can get 10 patches with the sizeof 143*143 for each CQT spectrogram. In other words, each audio segment can generate N *10patches (samples). For both datasets, we use the same way to generate the CQT spectrogrampatches. DenseNet is an excellent deep Learning model proposed by [10] in 2017. It has many advantages,such as encouraging feature reuse, alleviating the vanishing-gradient problem, and substantiallyreducing the number of parameters, etc. DenseNet is very successful in the image recognitiontasks. It even has outperformed ResNet [13] for some datasets. In this paper, we try to buildup a DenseNet-like network for acoustic scene classification. We first use this proposed networkto build up a base classifier. Besides, we train the super classifiers using the same deep networkexcept some minor modification. As shown in Table 1, the proposed network has three denseblocks, two transition layers and a growth rate of 12. These three dense blocks have 4, 4 and32 bottleneck layers respectively. Note that compared to the standard DenseNet [10], we haveremoved the 7 × https://sites.google.com/site/alainrakotomamonjy/home/audio-scene × × × × × × × × × × × × × × × × For some acoustic scenes, they are very similar in acoustic properties and frequently misclassi-fied. It is natural to cluster these similar scenes into a coarse category, which we call the supercategory. Actually, for some dataset, the super categories information is explicitly provided.For example, these super categories are provided in the DCASE2017 Development dataset.They are the ’indoor’, ’outdoor’ and ’vehicle’. Specifically, the ’indoor’ category includes theCafe/Restaurant, Grocery store, Home, Library, Metro station and Office scenes; the ’vehicle’category contains the Bus, Car, Train and Tram scenes. The rest of the scenes are regardedas the ’outdoor’ category. However, not all the datasets provide super category labels. Forexample, the LITIS Rouen dataset does not offer any information about these. Obviously, it isnecessary to construct super categories from the original acoustic scenes.For constructing the super categories, it is challenging to find out the acoustic scenes withsimilar acoustics properties. However, it is easy to figure out the misclassified cases for eachacoustic scene, according to a certain classification model. In our proposed method, we usethe misclassification information to approximate the similarities among acoustic scenes, as sim-ilar scenes are prone to be misclassified as their counterparts. Specifically, using the trainedDenseNet-like deep model, we can figure out a confuse matrix M for the testing samples. Let M ij be the element of the i th row and the j th column of M . The M ij ( i (cid:54) = j ) represents theamount of the patches whose ground-truth label is i while wrongly predicted as j . On thebasis of the confuse matrix, we apply the spectral clustering method [11] to construct the super4ategories.Note that the values in the confuse matrix are averaged over multiple testing sets. For ex-ample, in the DCASE2017 Development dataset, the confuse matrix is averaged across the testresults of the 4 training/testing splits provided by the challenge organizer. The test results areobtained by the base classifier describe in Section 3.1 with majority voting [14]. It is worthyto mention that the super category outputs of the DCASE2017 Development dataset in ourexperiments are exactly the same to the official ’indoor’, ’outdoor’ and ’vehicle’ divisions, wherethe cluster number is set to 3 and the K parameter in KNN is 2. This proves the effectiveness ofour proposed super categories construction method. We also cluster the LITIS Rouen datasetinto 3 super categories. The clustering results are as follows: { Metro-rouen, High-speed train } , { Restaurant, Shop, Market, Caf´ e , billiard pool hall } and { Quiet street, Plane, Bus, Train,Car, Tubestaion, Kid game hall, Metro-paris, Student hall, Pedestrian street, Busy street, Trainstation hall } . For each super category, we further train a corresponding super classifier. The super classifieruses the same DenseNet-like architecture as the base classifier. The resultant weights in thebase classifier are transferred to initialize the super classifier [15]. The only difference lies inthe output layer. Taking the ’vehicle’ super category in DCASE2017 as an example, five outputnodes are set in the output layer of the super classifier, including four nodes serving as theindicators of Car, Bus, Train and Tram respectively, along with the fifth node known as thenegative flag. The negative flag is fired when the testing sample is considered as not belongingto the ’vehicle’ super category. The patches used to train the base classifier are again utilized totrain the newly super classifier. However, labels of some patches should be modified. Specifically,the labels of the patches of Car, Bus, Train and Tram acoustic scenes are left unchanged whilethe ones of the other patches are all modified into the ’NON-VEHICLE’ labels. In this way,each super classifier is responsible for the discrimination of a small range of acoustic scenes. Asa result, the accuracies of these super classifiers are improved. Likewise, SoftMax is applied inthe output layer of the super classifiers.
For the convenience of description, we introduce two data structures before giving the algorithm,the voting vector and the negative flag vector. Assume that we have AN audio segments inthe testing set; P N patches are produced for each segment; SN super categories have beenconstructed from the CN original acoustic scene classes; One base classifier and SN superclassifiers have been built up as mention above. For the i th audio segment ( i ∈ [1 , AN ]), wegenerate a voting vector V V i = ( vv i , vv i , . . . , vv iCN ), which is initialized by all zero elements.For the p th patch produced by the i th audio segment, we feed it into the base CNN model,supposed that the k th node ( k ∈ [1 , CN ]) in the output layer of the base CNN model acquiresthe maximum value, the value of vv ik is increased by 1. After all the P N patches are processed,the voting vector
V V i is completely prepared, where vv i + vv i + · · · + vv iCN = P N . For the i th audio segment ( i ∈ [1 , AN ]), we calculate SN negative flag vectors, denoted as N F ij =( nf ij, , nf ij, , . . . , nf ij,P N ) , j ∈ [1 , SN ], each for a corresponding super classifier. For the p th patch produced by the i th audio segment, we feed it to the j th super classifier, if the negative5ag node (see Section 3.3) acquires the maximum value, the nf ij,p is set as 1; otherwise it isset as 0. After the P N patches are all processed by the SN super classifiers and the relatedelements in N F ij ( j ∈ [1 , SN ]) are set up, the preparation of negative flag vectors for the i th audio segment is done. Algorithm 1
Punishment Voting
Input: CN ; SN ; P N ; V V i = ( vv i , vv i , . . . , vv iCN ); N F ij = ( nf ij, , nf ij, , . . . , nf ij,P N ), j ∈ [1 , SN ]; AS j , j ∈ [1 , SN ] : the original acoustic scene set corresponding to the j − th supercategory. Output: R i : the resultant acoustic scene of the i − th audio segment; for j = 1 : SN do count=0 for k = 1 : P N do if nf ij,k == 1 then count++ end if end for if count > P N ∗ / then for p = 1 : CN do if p ∈ AS j then vv ip = vv ip ∗ . end if end for end if end for R i = argmax t ( { vv it | t ∈ [1 , CN ] } ) return R i ;These SN negative flag vectors are applied to punishment vote towards the voting vector.Specifically, for the n th super classifier, we observe the negative flag vector N F in . If a majorityof elements in N F in is equal to 1 (the threshold is set to P N ∗ / i th audio segment does not belong to the n th super category according to the n th super classifier. As we are more confident in the super classifiers, the judgement is used torectify the voting vector. All elements corresponding to the n th super category in the votingvector are multiplied with a punishment factor γ ( γ is set to 0.25 in the experiment). Once allthe super classifiers have finished punishment vote, we can figure out the final result from thevoting vector. The pseudocode is shown in Algorithm 1. All the experiments are executed on an Intel(R) Core(TM) i7 system with 64GB RAM andNVIDIA GTX 1080 Ti GPU. We implement the DenseNet-like models using TensorFlow witha learning rate of 0.0001, a batch size of 32 and a dropout rate of 0.2. Adam is applied as6he optimizer with a maximum epochs of 1000. We evaluate the proposed punishment votingalgorithm on the DCASE2017 Development dataset and the LITIS Rouen dataset. There are15 acoustic scenes in the DCASE2017 Development dataset. For each scene, there are 312 10-second audio segments. We follow the 4-fold cross validation on this dataset. The LITIS Rouendataset contains 3026 30-second audio segments which can be categorized into 19 acoustic sceneclasses. We randomly select three splits out of the 20 standard splits on this dataset and evaluatethe average accuracy over these three splits.
We have trained two DenseNet-like base classifiers for the two datasets respectively. The onefor the DCASE2017 Development dataset has 15 output nodes each of which is correspondingto an acoustic scene class, while the other for the LITIS Rouen dataset has 19 output nodes.The average accuracy achieved by the DCASE2017 base classifier with majority voting is 79.4%(the baseline accuracy is 74.8% according to the DCASE2017 challenge). The average accuracyby the LITIS Rouen base classifier with majority voting is 92.13%.
As mentioned above, we construct three super categories for both datasets. The constructedsuper categories in DCASE2017 Development dataset are exactly the same as the ’indoor’,’outdoor’ and ’vehicle’ separations originally provided by official organizer. We have trainedsuper classifiers for the super categories. By performing majority voting, as shown in Figure 2,the average accuracies for the three super classifiers are 88.05%, 88.51% and 93.34% respectivelyfor the DCASE2017 Development dataset. Compared to the accuracy of the base classifier, theaccuracies of super classifiers are significantly improved as we expect. As shown in Figure 3,the average accuracies of the three super classifier with majority voting are 96.19%, 93.36% and93.57% respectively for the LITIS Rouen dataset.7igure 2: Accuracies of the three super classifiers on DCASE2017 Development datasetFigure 3: Accuracies of the three super classifiers on LITIS Rouen dataset
Figure 4 makes a comparison between the results of punishment voting and those of the baseclassifier with majority voting on the DCASE2017 Development dataset. An accuracy of 81.92%is obtained by our proposed method, which outperforms the base classifier by 2.52%. Similarly,8igure 4: Comparison of majority voting and punishment voting on DCASE2017 DevelopmentdatasetFigure 5: Comparison of majority voting and punishment voting on LITIS Rouen datasetFigure 5 shows the superiority of our proposed punishment voting on the LITIS Rouen dataset.Through our method, the accuracy is raised to 94.83% which has an increase of 2.7% over the9ase classifier with majority voting.
In the acoustic scene classification research domain, majority voting is frequently used. In thispaper, we proposed a punishment voting algorithm for acoustic scene classifications. Thereare two main contributions in this work. They are the punishment voting algorithm and thesuper categories construction method. Firstly, we transform the audio segments into CQTspectrograms [9]. Using these CQT spectrograms, we train a DenseNet-like model as the baseclassifier. Based on the results of the base classifier, we construct the super categories by thespectral clustering. For each super category, we train a corresponding super classifier. Finally,we develop a punishment voting algorithm to combine the results of the base classifier andthose of the super classifiers to acquire the final results. Compared to the base classifier, thepunishment voting method has a 2.52% improvement on the DCASE2017 Development datasetand a 2.7% boost on the LITIS Rouen dataset. We believe that the punishment voting methodis useful for other recognition tasks as well, such as image recognition, behavior recognition, etc.We will extend our method to these tasks in our future studies.
References [1] Stavros Ntalampiras, Ilyas Potamitis, and Nikos Fakotakis. Probabilistic novelty detectionfor acoustic surveillance under real-world conditions.
IEEE Transactions on Multimedia ,13(4):713–719, 2011.[2] Wei He, Zhijun Li, and CL Philip Chen. A survey of human-centered intelligent robots:issues and challenges.
IEEE/CAA Journal of Automatica Sinica , 4(4):602–609, 2017.[3] Bj¨orn Schuller, Florian Pokorny, Stefan Ladst¨atter, Maria Fellner, Franz Graf, and Lu-cas Paletta. Acoustic geo-sensing: Recognising cyclists’ route, route direction, and routeprogress from cell-phone audio. In
Acoustics, Speech and Signal Processing (ICASSP), 2013IEEE International Conference on , pages 453–457. IEEE, 2013.[4] Soo Hyun Bae, Inkyu Choi, and Nam Soo Kim. Acoustic scene classification using parallelcombination of lstm and cnn. In
Proceedings of the Detection and Classification of AcousticScenes and Events 2016 Workshop (DCASE2016) , pages 11–15, 2016.[5] Daniele Battaglino, Ludovick Lepauloux, Nicholas Evans, France Mougins, and France Biot.Acoustic scene classification using convolutional neural networks.
IEEE AASP Challengeon Detec , 2016.[6] Seongkyu Mun, Suwon Shon, Wooil Kim, David K Han, and Hanseok Ko. A novel discrimi-native feature extraction for acoustic scene classification using rnn based source separation.
IEICE TRANSACTIONS on Information and Systems , 100(12):3041–3044, 2017.[7] Juncheng Li Dai Wei, Phuong Pham, Samarjit Das, and Shuhui Qu. Acoustic scene recog-nition with deep neural networks (dcase challenge 2016).
Robert Bosch Research and Tech-nology Center , 3, 2016. 108] Rohit Patiyal and Padmanabhan Rajan. Acoustic scene classification using deep learn-ing.
IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events(DCASE) , 2016.[9] Christian Sch¨orkhuber and Anssi Klapuri. Constant-q transform toolbox for music pro-cessing. In , pages 3–64,2010.[10] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Denselyconnected convolutional networks. In
Proceedings of the IEEE conference on computervision and pattern recognition , volume 1, page 3, 2017.[11] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and analgorithm. In
Advances in neural information processing systems , pages 849–856, 2002.[12] Thomas Lidy and Alexander Schindler. Cqt-based convolutional neural networks for audioscene classification. In
Proceedings of the Detection and Classification of Acoustic Scenesand Events 2016 Workshop (DCASE2016) , volume 90, pages 1032–1048. DCASE2016 Chal-lenge, 2016.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[14] Lionel S Penrose. The elementary statistics of majority voting.
Journal of the RoyalStatistical Society , 109(1):53–57, 1946.[15] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning.