Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network
Kele Xu, Dawei Feng, Haibo Mi, Boqing Zhu, Dezhi Wang, Lilun Zhang, Hengxing Cai, Shuwen Liu
MMixup-Based Acoustic Scene Classification UsingMulti-Channel Convolutional Neural Network
Kele Xu
School of Information CommunicationNational University of Defense Technology
Wuhan, [email protected]
Dawei Feng, Haibo Mi, Boqing Zhu
School of ComputerNational University of Defense Technology
Changsha, [email protected], haibo [email protected]
Dezhi Wang, Lilun Zhang
College of Meteorology and OceanographyNational University of Defense Technology
Changsha, Chinawang [email protected], [email protected]
Hengxing Cai
School of EngineeringSun Yat-Sen University
Guangzhou, [email protected]
Shuwen Liu
School of Computer ScienceNanjing University of Technology
Nanjing, [email protected]
Abstract —Audio scene classification, the problem of predictingclass labels of audio scenes, has drawn lots of attention duringthe last several years. However, it remains challenging and fallsshort of accuracy and efficiency. Recently, Convolutional NeuralNetwork (CNN)-based methods have achieved better perfor-mance with comparison to the traditional methods. Nevertheless,conventional single channel CNN may fail to consider the factthat additional cues may be embedded in the multi-channelrecordings. In this paper, we explore the use of Multi-channelCNN for the classification task, which aims to extract featuresfrom different channels in an end-to-end manner. We conduct theevaluation compared with the conventional CNN and traditionalGaussian Mixture Model-based methods. Moreover, to improvethe classification accuracy further, this paper explores the usingof mixup method. In brief, mixup trains the neural network onlinear combinations of pairs of the representation of audio sceneexamples and their labels. By employing the mixup approach fordata augmentation, the novel model can provide higher predictionaccuracy and robustness in contrast with previous models, whilethe generalization error can also be reduced on the evaluationdata.
Index Terms —Multi-channel, convolutional neural network,acoustic scene classification, mixup
I. I
NTRODUCTION
Acoustic scene classification (ASC) refers to the identi-fication of the environment in which the audios have beenacquired, which associates a semantic label to each audio.In 1997, Sawhney proposed the first method to address theASC problem in an MIT technical report [1]. A set of classes,including “people”, “voices”, “subway”, “traffic” is recorded.An overall classification accuracy of 68% was obtained basedon the recurrent neural networks and the K-nearest neighborcriterion. Indeed, the recognition of environments has becomean important application in the field of machine listening, andASC enables devices to make sense of their environments. Thepotential applications of ASC seem evident in several fields,such as security surveillance and context-aware services. In order to solve the problem of lacking common bench-marking datasets, the first Detection and Classification ofAcoustic Scenes and Events (DCASE) 2013 challenge [2] wasorganized by the IEEE Audio and Acoustic Signal Processing(AASP) Technical Committee. Many audio processing tech-niques have been proposed during the past years. The applica-tions of deep learning in the ASC have witnessed a dramaticincrease during last five years, especially the convolutionalneural network (CNN). Compared to the traditional method,which commonly involves training a Gaussian Mixture Model(GMM) on the frame-level features such as Mel-FrequencyCepstral Coefficients (MFCCs) [3], CNN-based methods canachieve better performance. However, most of the previousattempts aimed to apply the deep learning method by usingone single channel (or just the average between the leftand right channels) [4]. A robust Audio Scene classificationmodel should be able to capture temporal patterns at differentchannels as additional cues may be embedded in the multi-channel recordings [5]. In this paper, we explore the use ofmulti-channel CNN for the ASC task, which achieves betteraccuracy with comparison to the standard CNN.On the other hand, the deep neural network architectureshave a large number of parameters, and they are prone to over-fitting. The easiest and most widely used approach to reduceoverfitting is to employ larger datasets. As an alternative, dataaugmentation method can be used to improve the performanceof neural network by artificially enlarging the dataset usinglabel-preserving transformation. However, only a few attemptshave been made for the data augmentation for audio sceneclassification.In this paper, we explore the use of mixup-based methodfor data augmentation [6], with the goal to obtain superioraccuracy and robustness. In brief, mixup constructs virtualtraining examples, and the neural network can be trained byusing the linear combinations of pairs of the representation ofexamples and their labels. a r X i v : . [ c s . C V ] M a y heoretically, mixup extends the training distribution byincorporating the prior knowledge that linear interpolationsof audio feature vectors should lead to linear interpolationsof the associated targets [6]. Mixup can be implemented ina few lines of code, and induces the minimal computationoverhead. Despite its simplicity, mixup allows a performanceimprovement using the DCASE 2017 audio scene classifica-tion dataset.The paper is organized as follows. Section 2 discussesthe relationship between our method and prior work, while,the multi-channel CNN classification method is presented inSection 3. Section 4 describes the mixup method, and theexperimental results are given in Section 5. Section 6 givesthe conclusion of this paper.II. R ELATED TO P RIOR W ORK
Scene classification (detection) has been explored by com-puter vision using different techniques, and dramatic progresshas been made during last two decades. However, comparedto the progress of scene classification using image (or video),audio-based approaches have been under-explored, and thestate-of-the-art audio-based techniques are not able to achievethe comparable performance to its image/video counterpart.In fact, audios can sometimes be more descriptive thanvideos/images, especially when it comes to the description ofan event.Recently, due to the release of the relatively larger labeleddata, there has been a plethora of efforts have been made forthe audio scene classification task [7], [8]. In brief, the maincontributions can be divided into three parts: the representationof the audio signal (or handcrafted feature design) [9], [10],[11]; more sophisticated shallow-architecture classifiers [12],[13], [14] and the applications of deep learning in ASC task[15], [16].Indeed, deep learning has witnessed dramatic progress dur-ing the last decade and achieved success in several differentfields, such as, image classification [16], speech recognition[17], natural language processing [18] and so on. Although,there are some attempts, which employ CNN as the tool tosolve the ASC task, most of them tried to solve the problemwithin the context of using the monaural signals. In [11], theauthor proposed to concatenate different channels, resulting ina one-channel file with longer duration. This kind of methodemployed the one-channel CNN architecture. In [20], theauthor proposed to use all-convolutional neural network andmasked global pooling for the ASC task. However, only left-hand channel was employed for the classification task. Here,we argue that additional cues may be embedded in the binauralrecordings [11]. The combination of information in multi-channels may lead to advanced feature representations for theclassification.On the other hand, the trend of deep neural network’architecture is to become deeper and wider, and millions ofparameters need to be trained. To improve the generalizationability of neural networks, plenty of regularization approacheshave been used, which include: batch normalization, dropout, etc. When there is only limited training data available, dataaugmentation using preserving transformation is a widely-used technique for the neural network training to improve therobustness. Although following the same concept of improvingthe prediction invariance of deep neural network, the dataaugmentation in audio scene classification is different from theimage classification tasks, and the traditional augmentation,such as rotation, flipping, distorting and deformation cannotbe applied directly. The procedure is dataset-dependent andrequires the use of expert knowledge [6].In this paper, we explore the use of mixup data augmen-tation approach, which was proposed in [6]. In brief, thenew samples are created by mixing two inputs of the neuralnetwork with a ratio, and the labels of the samples are similarto the between-class label. Normally, the ratio ranges from 0 to1. Using the DCASE 2017 audio scene classification dataset,improved performance has been observed after employingmixup approach.III. M
ULTI C HANNEL C ONVOLUTIONAL N EURAL N ETWORK
Due to its ability of automatic learning complex featurerepresentations, CNNs have achieved great success. CNN hasthe potential to identify the various salient patterns of the audiosignals. In more detail, the processing units in the lower layerscan obtain the local feature of the signals, while the higherlayers can extract the features of high-level representations.The input for a CNN architecture can be the raw audiosignal or the spatial frequency representation of the rawsignal (for example: MFCCs, Short time Fourier transform,spectrograms). In our experiments, we employ the widely usedfeature representation: Mel-filter bank features of the audiosignal segments as the input for the CNN. However, it is notcomplicated to extend our framework for other kinds of input.Unlike the attempts which aim to maintain the one-channelCNN architecture [11], we extract features in terms of threedifferent channels. The three different channels are: leftchannel, right channel, the mean between the left and rightchannels. The Mel-filter bank features of different channelswill be concatenated as a multi-channel image, which resultsin training a system in an end-to-end manner. Note that, theMel-filter bank features configuration was kept the same foreach single channel during our experiments. In our experiment,Mel-filter bank features is calculated for each channel. Weemploy the first half of the symmetric Hann window as thewindow function with a window size of 25ms and a hop sizeof 25ms.The input of the network is three-channels Mel-filter bankfeatures with size 3 × × ×
128 denotes the size of Mel-filter bankfeatures for single channel. The input sizes are kept the sameduring the experiments. The flowchart of Multi-channel CNN-based audio scene classification is given in Fig 1.There are numerous variants of CNN architectures in theliterature. However, their basic components are very similar. ig. 1. Multi-channel CNN-based audio scene classification
Since the starting with LeNet-5 [23], convolutional neural net-works have typically standard structure-stacked convolutionallayers (optionally followed by batch normalization and max-pooling) are followed by fully-connected layers.In this paper, we followed the VGG-style [24] networks andXception [25] networks due to its relatively high accuracy andsimplicity. The main contribution of VGG net is to increase thedepth using an architecture with very small (3 ×
3) convolutionfilters.While VGG achieves an impressive accuracy on the imageclassification task, its deployment on even the most modest-sized GPUs is a problem because of huge computationalrequirements, both in terms of memory and time. It becomesinefficient due to large width of convolutional layers.As the-state-of-the-art model in Inception model group,Xception architecture employs the depthwise separable con-volution operation to replace the regular Inception modules,which has an excellent performance on a larger image clas-sification dataset like ImageNet, and becomes a cornerstoneof convolutional neural network architecture design. Anotherchange that Xception model made, was to replace the fully-connected layers at the end with a simple global averagepooling which averages out the channel values across the 2Dfeature map, after the last convolutional layer. This drasticallyreduces the total number of parameters. This can be under-stood from VGGNet, where fully connected layers containabout 90% of parameters.The only changes we made to VGG were to the finallayer (using the global average layer) as well as the use ofbatch normalization instead of Local Response Normalization(LRN). The parameters of the CNN model are optimized withstochastic gradient descent. The cross-entropy was selected asthe objective function. Moreover, an L2 weight decay penaltyof 0.002 was employed in our model. To train the CNN,we used Keras with tensorflow backend, which can fullyutilize GPU resource. CUDA and cuDNN were also used toaccelerate the system.It is worthwhile to note that each layer consists of many con- volutions or pooling operators. The convolutional filters can beinterpreted as the filter-banks learning. For the activation layer,the rectified linear unit is used to introduce the non-linearityinto a neural network. The last layer is the probability outputlayer, that converts the output vector of the fully connectedlayer to a vector of probabilities, which sum up to 1, eachprobability corresponding to one class. The probabilities canbe used to predict the scene label of the audio segment.For the final prediction of an input instance, there are manywidely used approaches to perform the final prediction, forexample, maximum probability, median probability, averageprobability and majority votes. In this paper, for the evaluationof the CNN-based method, we use the maximum probabilityto obtain the label.IV. M
IXUP F OR D ATA A UGMENTATION
We evaluate the multi-channel CNN on the TUT soundevents detection 2017 database [7]. The database consistsof stereo recordings which were collected using 44.1 kHzsampling rate and 24-bit resolution. The recordings camefrom 15 various acoustic scenes, which have distinct recordinglocations, for example: office, train, forest path. For differentlocations, 3-5 minutes long audio was recorded. And the audiofiles were split into 30-second segments. The acoustic sceneclasses considered in this task were: bus, cafe/restaurant, car,city center, forest path, grocery store, home, lakeside beach,library, metro station, office, residential area, train, tram, andurban park.Currently, most publicly available ASC datasets have lim-ited sizes [3], [7]. The disadvantage of small dataset is thatthe model is prone to overfitting. In the DCASE 2017 audioscene classification task, it is found that the generalizationgap is big, and the accuracy difference between developmentdataset and evaluation dataset ranges from 4% to 30% byusing different approaches. The ability to generalization isa research topic for the deep neural network. To improvethe generalization ability of deep neural network, especiallythe CNN, a plethora approaches have been proposed, such asdropout [26], batch normalization [27]. Data augmentation isanother explicit form of regularization, which is also widelyused in the deep neural network. In more detail, for thedeep CNN, random cropping and random flipping are twomost popular data augmentation approaches. However, thesemethods cannot be applied to ASC directly. Recently, it isfound that Generative adversarial network can be used forASC data augmentation, and impressive performance havebeen obtained on the task [28]. Indeed, the data augmentationis under-explored in previous ASC study.In this paper, we explore the use of mixup data aug-mentation. In more detail, virtual training examples can beconstructed by using the following formula: x = α × x i + (1 − α ) × x j (1) y = α × y i + (1 − α ) × y j (2)Where ( x i , y i ) and ( x j , y j ) are two examples randomselected from the training data of the DCASE 2017 ASCask. α is the mixed ratio. In our experiments, α ∈ [0 , .A mixup example is given in Fig 2, and the α is set as 0.2for the example. In the figure, two labeled audio scenes areselected randomly, while a new training sample is constructedby weighted average between two given samples. Fig. 2. An example for mixup data augmentation for audio scene classifica-tion.
Despite its simplicity, the mixup data augmentation methodshave provided state-of-the-art performance in many datasets,which include the CIFAR-10, CIFAR-100, and ImageNet-2012 image classification datasets. Similar to create inter-class, mixup increases the robustness of deep CNN whenthe samples contains corrupted labeled ones. In the followingsection, we will demonstrate that mixup data augmentationcan also improve the ASC performance.V. E
XPERIMENTAL R ESULTS
As the DCASE 2017 scene classification dataset providescross-validation splits, we follow the 4 fold cross-validationsplits. We used the same experiment settings from develop-ment set for the evaluation set. In the development stage,the results are evaluated in terms of average accuracy for 4folds. The performance of evaluation data is also given in thissection.In our experiments, we made two sets of comparison: theperformance comparison between the single channel CNNand multi-channel CNN; the performance comparison betweenMulti-channel CNN with mixup data augmentation and Multi-channel CNN without mixup data augmentation.
A. Single/Multi-channel CNN
The first set of experience aims to evaluate single-channeland multi-channel based audio scene classification using con-volutonal neural network. As aforementioned, the architecturesused for the comparison include: VGGNet and Xception, and all of the CNNs are trained from scratch without any pre-trained initialization. Table 1 presents the validation resultsfor the 4 fold cross-validation as well as the performanceon the evaluation data. The performance of baseline is alsogiven in Table 1. In more detail, the baseline system usedhere consists of 60 MFCC features and a Gaussian mixturemodel (GMM) based classifier. As can be seen from the table,
TABLE IA
UDIO SCENE CLASSIFICATION ACCURACY USING SINGLE / MULTICHANNEL C ONVOLUTIONAL N EURAL N ETWORK
Method Cross-validation Evaluation
Baseline (GMM) 74.8% 61.0%Single-channel using VGGNet 82.4% 67.8%Multi-channel using VGGNet 84.7% 71.5%Single-channel using Xception 83.3% 72.1%Multi-channel using Xception 85.4% 74.5% both single-channel CNNs and multi-channel CNN showedbetter performance against the baseline. It can also be observedthat multi-channel CNN performance is better than the single-channel CNN using different architectures, and the accuracyis increased about 2% to 3%. It may imply that additional fea-tures can be extracted from multi-channels, which can improvethe accuracy for the ASC task. Similar to the results obtainedon the ImageNet classification, Xception architecture providesbetter performance than VGGNet. The reason is that Xceptionarchitecture can take multi-scale information into account asdifferent kernels size are learned in the model. On the otherhand, the accuracy difference between the development datasetand evaluation dataset) is significant, which ranges from 10.9%to 14.2% using different approaches. This may indicate thatthe trained CNN models are prone to overfitting during thetraining procedure.
B. Multi-channel CNN with/without mixup
The second set of experiments aimed to show that mixup-base data augmentation can improve the performance of ASC.Moreover, it is also shown that mixup can reduce the gener-alization gap.
TABLE IIA
UDIO SCENE CLASSIFICATION ACCURACY USING MIXUP DATAAUGMENTATION
Method α Cross-validation Evaluation
Multi-channel VGGNet 0 84.7% 71.5%Multi-channel VGGNet 0.2 85.2% 73.4%Multi-channel VGGNet 0.5 86.9% 73.2%Multi-channel VGGNet 0.8 85.8% 72.1%Multi-channel Xception 0 85.4% 74.5%Multi-channel Xception 0.2 86.7% 75.6%Multi-channel Xception 0.5 87.2% 76.7%Multi-channel Xception 0.8 86.9% 74.8%
The performance of different approaches, which employingthe mixup data augmentation, is given in Table 2, and differentratios are used for the mixup. Due to the computation resourceconstraint, only three ratios are used in our experiments (when α = 0 , the mixup approach is not employed). As can be seenrom the table: without mixup data augmentation, the cross-validation accuracy of multi-channel VGGNet is 84.7% andthe accuracy of evaluation data is 71.5%, while the accuracyof evaluation data ranges from 72.1% to 73.2% if mixupdata augmentation is employed. For multi-channel CNN usingXception architecture, the accuracy was also improved usingthe mixup data augmentation, which demonstrates that mixupapproach is effective despite its simplicity. In our experimentalresults, mixup with ratio 0.5 provides superior performance.VI. C ONCLUSION
In this paper, we have presented the multi-channel con-volutional neural network-based method for the multi-classacoustic scene classification. To summarize, the contributionsof this paper are twofold: firstly, we present a multi-channelCNN architecture for the classification task. Secondly, weexplore the mixup data augmentation method, and experimentsdemonstrated that by employing the mixup dataset augmen-tation, the classification can be improved, and the general-ization error can also be reduced. To the best knowledgeof the authors, this is the first attempt of employing mixupfor the audio scene classification task. For future work, wewill investigate the CNN architecture to utilize multi-scaleinformation embedded in the audio signal, thus improving theclassification accuracy. The mixup approach is also needed tobe fully explored. Presently, the mixup processing is relied onthe log mel spectrum of the audio signal. We did not observesignificant improvement by mixup of the raw audio signaldirectly. Moreover, mixup may also be useful for audio eventtagging and detection.A
CKNOWLEDGMENT