[PDF] Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network

Abstract

Audio scene classification, the problem of predicting class labels of audio scenes, has drawn lots of attention during the last several years. However, it remains challenging and falls short of accuracy and efficiency. Recently, Convolutional Neural Network (CNN)-based methods have achieved better performance with comparison to the traditional methods. Nevertheless, conventional single channel CNN may fail to consider the fact that additional cues may be embedded in the multi-channel recordings. In this paper, we explore the use of Multi-channel CNN for the classification task, which aims to extract features from different channels in an end-to-end manner. We conduct the evaluation compared with the conventional CNN and traditional Gaussian Mixture Model-based methods. Moreover, to improve the classification accuracy further, this paper explores the using of mixup method. In brief, mixup trains the neural network on linear combinations of pairs of the representation of audio scene examples and their labels. By employing the mixup approach for data argumentation, the novel model can provide higher prediction accuracy and robustness in contrast with previous models, while the generalization error can also be reduced on the evaluation data.

Full PDF

MMixup-Based Acoustic Scene Classiﬁcation UsingMulti-Channel Convolutional Neural Network

Kele Xu

School of Information CommunicationNational University of Defense Technology

Wuhan, [email protected]

Dawei Feng, Haibo Mi, Boqing Zhu

School of ComputerNational University of Defense Technology

Changsha, [email protected], haibo [email protected]

Dezhi Wang, Lilun Zhang

College of Meteorology and OceanographyNational University of Defense Technology

Changsha, Chinawang [email protected], [email protected]

Hengxing Cai

School of EngineeringSun Yat-Sen University

Guangzhou, [email protected]

Shuwen Liu

School of Computer ScienceNanjing University of Technology

Nanjing, [email protected]

Abstract —Audio scene classiﬁcation, the problem of predictingclass labels of audio scenes, has drawn lots of attention duringthe last several years. However, it remains challenging and fallsshort of accuracy and efﬁciency. Recently, Convolutional NeuralNetwork (CNN)-based methods have achieved better perfor-mance with comparison to the traditional methods. Nevertheless,conventional single channel CNN may fail to consider the factthat additional cues may be embedded in the multi-channelrecordings. In this paper, we explore the use of Multi-channelCNN for the classiﬁcation task, which aims to extract featuresfrom different channels in an end-to-end manner. We conduct theevaluation compared with the conventional CNN and traditionalGaussian Mixture Model-based methods. Moreover, to improvethe classiﬁcation accuracy further, this paper explores the usingof mixup method. In brief, mixup trains the neural network onlinear combinations of pairs of the representation of audio sceneexamples and their labels. By employing the mixup approach fordata augmentation, the novel model can provide higher predictionaccuracy and robustness in contrast with previous models, whilethe generalization error can also be reduced on the evaluationdata.

Index Terms —Multi-channel, convolutional neural network,acoustic scene classiﬁcation, mixup

I. I

NTRODUCTION

Acoustic scene classiﬁcation (ASC) refers to the identi-ﬁcation of the environment in which the audios have beenacquired, which associates a semantic label to each audio.In 1997, Sawhney proposed the ﬁrst method to address theASC problem in an MIT technical report [1]. A set of classes,including “people”, “voices”, “subway”, “trafﬁc” is recorded.An overall classiﬁcation accuracy of 68% was obtained basedon the recurrent neural networks and the K-nearest neighborcriterion. Indeed, the recognition of environments has becomean important application in the ﬁeld of machine listening, andASC enables devices to make sense of their environments. Thepotential applications of ASC seem evident in several ﬁelds,such as security surveillance and context-aware services. In order to solve the problem of lacking common bench-marking datasets, the ﬁrst Detection and Classiﬁcation ofAcoustic Scenes and Events (DCASE) 2013 challenge [2] wasorganized by the IEEE Audio and Acoustic Signal Processing(AASP) Technical Committee. Many audio processing tech-niques have been proposed during the past years. The applica-tions of deep learning in the ASC have witnessed a dramaticincrease during last ﬁve years, especially the convolutionalneural network (CNN). Compared to the traditional method,which commonly involves training a Gaussian Mixture Model(GMM) on the frame-level features such as Mel-FrequencyCepstral Coefﬁcients (MFCCs) [3], CNN-based methods canachieve better performance. However, most of the previousattempts aimed to apply the deep learning method by usingone single channel (or just the average between the leftand right channels) [4]. A robust Audio Scene classiﬁcationmodel should be able to capture temporal patterns at differentchannels as additional cues may be embedded in the multi-channel recordings [5]. In this paper, we explore the use ofmulti-channel CNN for the ASC task, which achieves betteraccuracy with comparison to the standard CNN.On the other hand, the deep neural network architectureshave a large number of parameters, and they are prone to over-ﬁtting. The easiest and most widely used approach to reduceoverﬁtting is to employ larger datasets. As an alternative, dataaugmentation method can be used to improve the performanceof neural network by artiﬁcially enlarging the dataset usinglabel-preserving transformation. However, only a few attemptshave been made for the data augmentation for audio sceneclassiﬁcation.In this paper, we explore the use of mixup-based methodfor data augmentation [6], with the goal to obtain superioraccuracy and robustness. In brief, mixup constructs virtualtraining examples, and the neural network can be trained byusing the linear combinations of pairs of the representation ofexamples and their labels. a r X i v : . [ c s . C V ] M a y heoretically, mixup extends the training distribution byincorporating the prior knowledge that linear interpolationsof audio feature vectors should lead to linear interpolationsof the associated targets [6]. Mixup can be implemented ina few lines of code, and induces the minimal computationoverhead. Despite its simplicity, mixup allows a performanceimprovement using the DCASE 2017 audio scene classiﬁca-tion dataset.The paper is organized as follows. Section 2 discussesthe relationship between our method and prior work, while,the multi-channel CNN classiﬁcation method is presented inSection 3. Section 4 describes the mixup method, and theexperimental results are given in Section 5. Section 6 givesthe conclusion of this paper.II. R ELATED TO P RIOR W ORK

Scene classiﬁcation (detection) has been explored by com-puter vision using different techniques, and dramatic progresshas been made during last two decades. However, comparedto the progress of scene classiﬁcation using image (or video),audio-based approaches have been under-explored, and thestate-of-the-art audio-based techniques are not able to achievethe comparable performance to its image/video counterpart.In fact, audios can sometimes be more descriptive thanvideos/images, especially when it comes to the description ofan event.Recently, due to the release of the relatively larger labeleddata, there has been a plethora of efforts have been made forthe audio scene classiﬁcation task [7], [8]. In brief, the maincontributions can be divided into three parts: the representationof the audio signal (or handcrafted feature design) [9], [10],[11]; more sophisticated shallow-architecture classiﬁers [12],[13], [14] and the applications of deep learning in ASC task[15], [16].Indeed, deep learning has witnessed dramatic progress dur-ing the last decade and achieved success in several differentﬁelds, such as, image classiﬁcation [16], speech recognition[17], natural language processing [18] and so on. Although,there are some attempts, which employ CNN as the tool tosolve the ASC task, most of them tried to solve the problemwithin the context of using the monaural signals. In [11], theauthor proposed to concatenate different channels, resulting ina one-channel ﬁle with longer duration. This kind of methodemployed the one-channel CNN architecture. In [20], theauthor proposed to use all-convolutional neural network andmasked global pooling for the ASC task. However, only left-hand channel was employed for the classiﬁcation task. Here,we argue that additional cues may be embedded in the binauralrecordings [11]. The combination of information in multi-channels may lead to advanced feature representations for theclassiﬁcation.On the other hand, the trend of deep neural network’architecture is to become deeper and wider, and millions ofparameters need to be trained. To improve the generalizationability of neural networks, plenty of regularization approacheshave been used, which include: batch normalization, dropout, etc. When there is only limited training data available, dataaugmentation using preserving transformation is a widely-used technique for the neural network training to improve therobustness. Although following the same concept of improvingthe prediction invariance of deep neural network, the dataaugmentation in audio scene classiﬁcation is different from theimage classiﬁcation tasks, and the traditional augmentation,such as rotation, ﬂipping, distorting and deformation cannotbe applied directly. The procedure is dataset-dependent andrequires the use of expert knowledge [6].In this paper, we explore the use of mixup data augmen-tation approach, which was proposed in [6]. In brief, thenew samples are created by mixing two inputs of the neuralnetwork with a ratio, and the labels of the samples are similarto the between-class label. Normally, the ratio ranges from 0 to1. Using the DCASE 2017 audio scene classiﬁcation dataset,improved performance has been observed after employingmixup approach.III. M

ULTI C HANNEL C ONVOLUTIONAL N EURAL N ETWORK

Due to its ability of automatic learning complex featurerepresentations, CNNs have achieved great success. CNN hasthe potential to identify the various salient patterns of the audiosignals. In more detail, the processing units in the lower layerscan obtain the local feature of the signals, while the higherlayers can extract the features of high-level representations.The input for a CNN architecture can be the raw audiosignal or the spatial frequency representation of the rawsignal (for example: MFCCs, Short time Fourier transform,spectrograms). In our experiments, we employ the widely usedfeature representation: Mel-ﬁlter bank features of the audiosignal segments as the input for the CNN. However, it is notcomplicated to extend our framework for other kinds of input.Unlike the attempts which aim to maintain the one-channelCNN architecture [11], we extract features in terms of threedifferent channels. The three different channels are: leftchannel, right channel, the mean between the left and rightchannels. The Mel-ﬁlter bank features of different channelswill be concatenated as a multi-channel image, which resultsin training a system in an end-to-end manner. Note that, theMel-ﬁlter bank features conﬁguration was kept the same foreach single channel during our experiments. In our experiment,Mel-ﬁlter bank features is calculated for each channel. Weemploy the ﬁrst half of the symmetric Hann window as thewindow function with a window size of 25ms and a hop sizeof 25ms.The input of the network is three-channels Mel-ﬁlter bankfeatures with size 3 × × ×

128 denotes the size of Mel-ﬁlter bankfeatures for single channel. The input sizes are kept the sameduring the experiments. The ﬂowchart of Multi-channel CNN-based audio scene classiﬁcation is given in Fig 1.There are numerous variants of CNN architectures in theliterature. However, their basic components are very similar. ig. 1. Multi-channel CNN-based audio scene classiﬁcation

Since the starting with LeNet-5 [23], convolutional neural net-works have typically standard structure-stacked convolutionallayers (optionally followed by batch normalization and max-pooling) are followed by fully-connected layers.In this paper, we followed the VGG-style [24] networks andXception [25] networks due to its relatively high accuracy andsimplicity. The main contribution of VGG net is to increase thedepth using an architecture with very small (3 ×

3) convolutionﬁlters.While VGG achieves an impressive accuracy on the imageclassiﬁcation task, its deployment on even the most modest-sized GPUs is a problem because of huge computationalrequirements, both in terms of memory and time. It becomesinefﬁcient due to large width of convolutional layers.As the-state-of-the-art model in Inception model group,Xception architecture employs the depthwise separable con-volution operation to replace the regular Inception modules,which has an excellent performance on a larger image clas-siﬁcation dataset like ImageNet, and becomes a cornerstoneof convolutional neural network architecture design. Anotherchange that Xception model made, was to replace the fully-connected layers at the end with a simple global averagepooling which averages out the channel values across the 2Dfeature map, after the last convolutional layer. This drasticallyreduces the total number of parameters. This can be under-stood from VGGNet, where fully connected layers containabout 90% of parameters.The only changes we made to VGG were to the ﬁnallayer (using the global average layer) as well as the use ofbatch normalization instead of Local Response Normalization(LRN). The parameters of the CNN model are optimized withstochastic gradient descent. The cross-entropy was selected asthe objective function. Moreover, an L2 weight decay penaltyof 0.002 was employed in our model. To train the CNN,we used Keras with tensorﬂow backend, which can fullyutilize GPU resource. CUDA and cuDNN were also used toaccelerate the system.It is worthwhile to note that each layer consists of many con- volutions or pooling operators. The convolutional ﬁlters can beinterpreted as the ﬁlter-banks learning. For the activation layer,the rectiﬁed linear unit is used to introduce the non-linearityinto a neural network. The last layer is the probability outputlayer, that converts the output vector of the fully connectedlayer to a vector of probabilities, which sum up to 1, eachprobability corresponding to one class. The probabilities canbe used to predict the scene label of the audio segment.For the ﬁnal prediction of an input instance, there are manywidely used approaches to perform the ﬁnal prediction, forexample, maximum probability, median probability, averageprobability and majority votes. In this paper, for the evaluationof the CNN-based method, we use the maximum probabilityto obtain the label.IV. M

IXUP F OR D ATA A UGMENTATION

We evaluate the multi-channel CNN on the TUT soundevents detection 2017 database [7]. The database consistsof stereo recordings which were collected using 44.1 kHzsampling rate and 24-bit resolution. The recordings camefrom 15 various acoustic scenes, which have distinct recordinglocations, for example: ofﬁce, train, forest path. For differentlocations, 3-5 minutes long audio was recorded. And the audioﬁles were split into 30-second segments. The acoustic sceneclasses considered in this task were: bus, cafe/restaurant, car,city center, forest path, grocery store, home, lakeside beach,library, metro station, ofﬁce, residential area, train, tram, andurban park.Currently, most publicly available ASC datasets have lim-ited sizes [3], [7]. The disadvantage of small dataset is thatthe model is prone to overﬁtting. In the DCASE 2017 audioscene classiﬁcation task, it is found that the generalizationgap is big, and the accuracy difference between developmentdataset and evaluation dataset ranges from 4% to 30% byusing different approaches. The ability to generalization isa research topic for the deep neural network. To improvethe generalization ability of deep neural network, especiallythe CNN, a plethora approaches have been proposed, such asdropout [26], batch normalization [27]. Data augmentation isanother explicit form of regularization, which is also widelyused in the deep neural network. In more detail, for thedeep CNN, random cropping and random ﬂipping are twomost popular data augmentation approaches. However, thesemethods cannot be applied to ASC directly. Recently, it isfound that Generative adversarial network can be used forASC data augmentation, and impressive performance havebeen obtained on the task [28]. Indeed, the data augmentationis under-explored in previous ASC study.In this paper, we explore the use of mixup data aug-mentation. In more detail, virtual training examples can beconstructed by using the following formula: x = α × x i + (1 − α ) × x j (1) y = α × y i + (1 − α ) × y j (2)Where ( x i , y i ) and ( x j , y j ) are two examples randomselected from the training data of the DCASE 2017 ASCask. α is the mixed ratio. In our experiments, α ∈ [0 , .A mixup example is given in Fig 2, and the α is set as 0.2for the example. In the ﬁgure, two labeled audio scenes areselected randomly, while a new training sample is constructedby weighted average between two given samples. Fig. 2. An example for mixup data augmentation for audio scene classiﬁca-tion.

Despite its simplicity, the mixup data augmentation methodshave provided state-of-the-art performance in many datasets,which include the CIFAR-10, CIFAR-100, and ImageNet-2012 image classiﬁcation datasets. Similar to create inter-class, mixup increases the robustness of deep CNN whenthe samples contains corrupted labeled ones. In the followingsection, we will demonstrate that mixup data augmentationcan also improve the ASC performance.V. E

XPERIMENTAL R ESULTS

As the DCASE 2017 scene classiﬁcation dataset providescross-validation splits, we follow the 4 fold cross-validationsplits. We used the same experiment settings from develop-ment set for the evaluation set. In the development stage,the results are evaluated in terms of average accuracy for 4folds. The performance of evaluation data is also given in thissection.In our experiments, we made two sets of comparison: theperformance comparison between the single channel CNNand multi-channel CNN; the performance comparison betweenMulti-channel CNN with mixup data augmentation and Multi-channel CNN without mixup data augmentation.

A. Single/Multi-channel CNN

The ﬁrst set of experience aims to evaluate single-channeland multi-channel based audio scene classiﬁcation using con-volutonal neural network. As aforementioned, the architecturesused for the comparison include: VGGNet and Xception, and all of the CNNs are trained from scratch without any pre-trained initialization. Table 1 presents the validation resultsfor the 4 fold cross-validation as well as the performanceon the evaluation data. The performance of baseline is alsogiven in Table 1. In more detail, the baseline system usedhere consists of 60 MFCC features and a Gaussian mixturemodel (GMM) based classiﬁer. As can be seen from the table,

TABLE IA

UDIO SCENE CLASSIFICATION ACCURACY USING SINGLE / MULTICHANNEL C ONVOLUTIONAL N EURAL N ETWORK

Method Cross-validation Evaluation

Baseline (GMM) 74.8% 61.0%Single-channel using VGGNet 82.4% 67.8%Multi-channel using VGGNet 84.7% 71.5%Single-channel using Xception 83.3% 72.1%Multi-channel using Xception 85.4% 74.5% both single-channel CNNs and multi-channel CNN showedbetter performance against the baseline. It can also be observedthat multi-channel CNN performance is better than the single-channel CNN using different architectures, and the accuracyis increased about 2% to 3%. It may imply that additional fea-tures can be extracted from multi-channels, which can improvethe accuracy for the ASC task. Similar to the results obtainedon the ImageNet classiﬁcation, Xception architecture providesbetter performance than VGGNet. The reason is that Xceptionarchitecture can take multi-scale information into account asdifferent kernels size are learned in the model. On the otherhand, the accuracy difference between the development datasetand evaluation dataset) is signiﬁcant, which ranges from 10.9%to 14.2% using different approaches. This may indicate thatthe trained CNN models are prone to overﬁtting during thetraining procedure.

B. Multi-channel CNN with/without mixup

The second set of experiments aimed to show that mixup-base data augmentation can improve the performance of ASC.Moreover, it is also shown that mixup can reduce the gener-alization gap.

TABLE IIA

UDIO SCENE CLASSIFICATION ACCURACY USING MIXUP DATAAUGMENTATION

Method α Cross-validation Evaluation

Multi-channel VGGNet 0 84.7% 71.5%Multi-channel VGGNet 0.2 85.2% 73.4%Multi-channel VGGNet 0.5 86.9% 73.2%Multi-channel VGGNet 0.8 85.8% 72.1%Multi-channel Xception 0 85.4% 74.5%Multi-channel Xception 0.2 86.7% 75.6%Multi-channel Xception 0.5 87.2% 76.7%Multi-channel Xception 0.8 86.9% 74.8%

The performance of different approaches, which employingthe mixup data augmentation, is given in Table 2, and differentratios are used for the mixup. Due to the computation resourceconstraint, only three ratios are used in our experiments (when α = 0 , the mixup approach is not employed). As can be seenrom the table: without mixup data augmentation, the cross-validation accuracy of multi-channel VGGNet is 84.7% andthe accuracy of evaluation data is 71.5%, while the accuracyof evaluation data ranges from 72.1% to 73.2% if mixupdata augmentation is employed. For multi-channel CNN usingXception architecture, the accuracy was also improved usingthe mixup data augmentation, which demonstrates that mixupapproach is effective despite its simplicity. In our experimentalresults, mixup with ratio 0.5 provides superior performance.VI. C ONCLUSION

In this paper, we have presented the multi-channel con-volutional neural network-based method for the multi-classacoustic scene classiﬁcation. To summarize, the contributionsof this paper are twofold: ﬁrstly, we present a multi-channelCNN architecture for the classiﬁcation task. Secondly, weexplore the mixup data augmentation method, and experimentsdemonstrated that by employing the mixup dataset augmen-tation, the classiﬁcation can be improved, and the general-ization error can also be reduced. To the best knowledgeof the authors, this is the ﬁrst attempt of employing mixupfor the audio scene classiﬁcation task. For future work, wewill investigate the CNN architecture to utilize multi-scaleinformation embedded in the audio signal, thus improving theclassiﬁcation accuracy. The mixup approach is also needed tobe fully explored. Presently, the mixup processing is relied onthe log mel spectrum of the audio signal. We did not observesigniﬁcant improvement by mixup of the raw audio signaldirectly. Moreover, mixup may also be useful for audio eventtagging and detection.A

CKNOWLEDGMENT