[PDF] Deep Learning of Human Perception in Audio Event Classification

Abstract

In this paper, we introduce our recent studies on human perception in audio event classification by different deep learning models. In particular, the pre-trained model VGGish is used as feature extractor to process audio data, and DenseNet is trained by and used as feature extractor for our electroencephalography (EEG) data. The correlation between audio stimuli and EEG is learned in a shared space. In the experiments, we record brain activities (EEG signals) of several subjects while they are listening to music events of 8 audio categories selected from Google AudioSet, using a 16-channel EEG headset with active electrodes. Our experimental results demonstrate that i) audio event classification can be improved by exploiting the power of human perception, and ii) the correlation between audio stimuli and EEG can be learned to complement audio event understanding.

Full PDF

DDeep Learning of Human Perception in AudioEvent Classiﬁcation

Yi Yu , , Samuel Beuret , Donghuo Zeng , , Keizo Oyama , National Institute of Informatics, Tokyo, SOKENDAI ´Ecole Polytechnique F´ed´erale de Lausanne Abstract —In this paper, we introduce our recent studies onhuman perception in audio event classiﬁcation by differentdeep learning models. In particular, the pre-trained modelVGGish is used as feature extractor to process audio data, andDenseNet is trained by and used as feature extractor for ourelectroencephalography (EEG) data. The correlation betweenaudio stimuli and EEG is learned in a shared space. In theexperiments, we record brain activities (EEG signals) of severalsubjects while they are listening to music events of 8 audiocategories selected from Google AudioSet, using a 16-channelEEG headset with active electrodes. Our experimental resultsdemonstrate that i) audio event classiﬁcation can be improved byexploiting the power of human perception, and ii) the correlationbetween audio stimuli and EEG can be learned to complementaudio event understanding.

Index Terms —EEG, Deep learning of human perception, Audioevent classiﬁcation, Canonical Correlation analysis

I. B

ACKGROUND AND M OTIVATION

Audio event classiﬁcation is an interesting problem inmachine perception, which mainly targets for recognizing andrelating sounds from audio. Various Convolutional NeuralNetworks (CNNs) have demonstrated promising results inaudio classiﬁcation [1]. In contrast, human perception andresponses to audio event, e.g., how to understand and interpretaudio events encountered in real-world environments by spe-ciﬁc semantic categorization or more detailed description, iscorrelated to human cognitive processes. Recent researches oncognitive neuroscience [1][2] have been carried out to learndiscriminative features from EEG recordings for distinguishingmusic audio stimuli by CNN techniques, which have shownthe possibility of using brain signals and deep learning inclassifying music audios. However, little research studies thefollowing problems: i) how to measure the differences betweenaudio events by exploiting audio and/or EEG, and ii) howto measure the correlation between audio stimuli and thecorresponding EEG data.Motivated by Google research which recently released asound vocabulary and dataset aiming to provide a common,realistic-scale evaluation platform for audio event classiﬁcationsuch as human sounds, music genres, environmental sounds(see https://research.google.com/audioset/), in this work, webuild a new EEG dataset to annotate Google audio sub-dataset with 160 segments for singing entity. EEG data are Samuel was involved in this work during their internship in National Instituteof Informatics (NII), Tokyo. recorded while subjects are listening to the selected musicaudios, for the purpose of annotating audio events. Particularly,alignments between audio and EEG are obtained during EEGdata collection. We study not only the capabilities of deeplearning in classifying audio stimuli by using the evoked EEGdata, but also the correlation between audio feature and EEGfeature to understand audio events. This paper has two majorcontributions: i) several models are compared to evaluate theperformances of audio event classiﬁcation. ii) the correlationbetween audio feature and EEG feature is learned to helpaudio event understanding and classiﬁcation, which generatescompetitive results. II. M

ETHODOLOGY

This paper aims to demonstrate how audio stimuli, EEG,and their combination distinguish different audio events. Tothis end, we train different deep models to learn the correla-tion between audio stimuli and EEG and investigate severalscenarios of audio event classiﬁcation.

A. Audio event classiﬁcation using EEG data alone

We ﬁrst train a convolutional neural network DenseNet[3], which takes EEG data as input, has a dense layer asoutput layer, and uses a softmax activation function for theclassiﬁcation of audio events. We try to minimize the cross-entropy between the predicted probabilities and the referenceprobabilities (the class labels). The network trained here(without the last dense layer) is reused in all the other learningmodels as an EEG feature extractor. It allows us to reduce eachEEG recording to a feature vector of 512 dimensions.For the comparison purpose we perform event classiﬁcationfollowing an equivalent pipeline. The ﬁrst step is to reduce512-dimension EEG feature to a dimension of 20 using PCA[4]. In the second step, we train a SVM classiﬁer [5] basedon the 20-demension compact feature and the class label.

B. Audio event classiﬁcation using audio data alone

The class labels are predicted by solely using the audio data.The ﬁrst step is to extract the necessary features from the rawaudio. This is done by using the pre-trained VGGish model[6] which extends the well-established VGG16 architecture [7]trained on the large-scale AudioSet [8]. In this way, each songis reduced to a 1152-dimension vector. This vector is furtherreduced to a 20-dimension vector using PCA. On this basis,we train a SVM classiﬁer for classifying audio events. a r X i v : . [ c s . S D ] S e p a) EEG only (b) Audio only (c) Audio and EEG Fig. 1: Confusion matrix for audio event classiﬁcation.

C. Audio event classiﬁcation using both audio and EEG

A pair of EEG data and audio signal are reduced to 512and 1152 dimensions using the previously deﬁned featureextractors. Then, they are concatenated together as a 1664-dimension vector, which is further reduced to 20 dimensionsby using PCA. On this basis, a SVM classiﬁer is trained.

D. Correlation learning between audio and EEG

We use canonical-correlation analysis (CCA) [9], DeepCCA (DCCA) [10], and Category-based Deep CCA(C-DCCA)[11] to project audio and EEG features into a shared space.We hope that the information contained in the EEG datacan help extract meaningful features from the audio datathrough canonical-correlation analysis. To learn the correlationbetween audio and EEG, two tasks are investigated: using EEGas query to retrieve audio and vice versa.III. E

VALUATION

Our audio events, selected from Google large-scale Au-dioSet [8], contain 8 audio categories (Chant, Child singing,Choir, Female singing, Male singing, Rapping, Syntheticsinging, and Yodeling) with 160 10s-long audio segments.We conduct the EEG data collection on 9 male subjects, us-ing EEG devices produced by OpenBCI (http://openbci.com/)where 16 channels are used to sample EEG data at thefrequency of 125 Hz. Each subject listens to a category-based session with 20 audios (pause time between audios is 2seconds) for 5 times while his EEG signal is recorded. A totalnumber of 7200 EEG signals are acquired. We randomly splitour dataset into ten folds, and make sure that each category isequally represented in each fold. Then, all our test results areaveraged over 10 cross validations.

A. Comparisons among proposed learning models

Using DenseNet with a fully-connected layer as a classiﬁer,we obtain a training accuracy of 94% and a testing accuracyof 61%. When we use the DenseNet as a feature extractor anda combination of PCA and SVM to perform the classiﬁcation,we obtain a training accuracy of 98% and a testing accuracy of 59%. We can see that the difference between two testingaccuracies is small, showing that the classiﬁcation using PCAand SVM is only slightly lower than that of the dense layerused to train the network. It shows that we can consider theresults obtained by this classiﬁer with dense layer on the nextexperiment as reasonable, compared to an optimal classiﬁer.The confusion matrix associated with this experiment is shownin Fig. 1(a). It shows that the error is more or less equallydistributed among all the classes, without any obvious errorpattern.For the classiﬁcation considering only audio data, we ob-tained a training accuracy of 100% and a testing accuracyof 67%. The confusion matrix associated with the resultsis shown in Figure 1(b). We can notice some interestingerror patterns, such as the frequent confusions between femalesinging and child singing or female singing and male singing.The observations are consistent with the reality, where theaverage frequency range of women is situated between those ofchildren and men. For the classiﬁcation considering both audioand EEG data, we obtain a training accuracy of 99%, and atesting accuracy of 81%. This testing accuracy is much higherthan that achieved by EEG-only and audio-only methods.It shows that a part of the information contained in twomediums is mutually complementary. The confusion matrixcorresponding to this experiment is shown in Figure 1(c).It seems to combine the characteristics of the two previousexperiments, although recurrent errors have been attenuated.All results are summarized in Table I.TABLE I: Accuracies of audio event classiﬁcation with differ-ent learning methods and data modalities

Data modality Audio only EEG only Audio & EEGModel PCA-SVM DenseNet PCA-SVM PCA-SVMAccuracy 67% 61% 59% 81%

B. Results on correlation learning between audio and EEG

We use the cross-modal retrieval tasks to evaluate corre-lation between audio and EEG. We try to ﬁnd the relevantudio for a given EEG signal and vice versa. In the formertask, there is only one relevant audio, and the performance ismeasured by the MRR1 metric (the mean of the inverse of therank of the relevant item), while the performance of the latteris measured by the MAP metric (there are 45 relevant EEGdata for each audio, based on which mean average precisionis computed). In the correlation analysis, we compare CCA[9], DCCA [10] and C-DCCA [11], and their results aresummarized in Tables II and III. According to these results,the retrieval performance is almost irrelevant of the numberof CCA components, indicating that 10 CCA components aresufﬁcient to achieve a good performance. CCA and DCCAproduce similar results. In the task of retrieving audio fromEEG, 720 EEGs (queries) correspond to 16 audios (in thedatabase), and many EEG signals (45) share the same audio.This is equivalent to the classiﬁcation of EEG to one of the16 audios, and MRR1 is relatively high. In contrast, givenan audio as query, there are 45 relevant EEG data in thedatabase. But not all of them are quite similar even in theshared canonical space. Therefore, the MAP performance isrelatively low in all methods. In both tasks, C-DCCA achievesa much better performance than the other two methods bystressing the intra-class similarity.TABLE II: MRR1 of audio retrieval with EEG data as query

Number of components CCA DCCA C-DCCA10 0.287 0.284 0.36915 0.289 0.248 0.37220 0.283 0.267 0.37325 0.283 0.251 0.37030 0.283 0.288 0.36835 0.282 0.283 0.36840 0.283 0.273 0.370

TABLE III: MAP of EEG retrieval with audio as query

Number of components CCA DCCA C-DCCA10 0.112 0.115 0.18215 0.112 0.092 0.18220 0.109 0.105 0.18825 0.108 0.099 0.18630 0.111 0.129 0.18435 0.109 0.109 0.18340 0.109 0.114 0.183

IV. C

ONCLUSION AND F UTURE WORK

Experimental results conﬁrm that using EEG helps to in-crease the precision of audio event classiﬁcation. Meanwhile,the great gap between testing and training in the accuracyshows that we could increase the performances of all theclassiﬁers by adding some regularization to avoid overﬁtting.We also notice that the detected correlation remains weak andfurther experiments or data collection are necessary to showmore meaningful results. In the future, we will also investigatehow to leverage the power of human perception to reﬁne audioevent recommendation. V. A

CKNOWLEDGEMENT

The authors would like to appreciate Francisco Raposo’shelp in EEG data collection and processing during his intern-ship in NII. Many thanks go to volunteers who helped us torecord EEG signals where they were listening to audio events.R

EFERENCES[1] S. Stober, D. J. Cameron, and J. A. Grahn, “Using convolutional neuralnetworks to recognize rhythm stimuli from electroencephalographyrecordings,” in

Advances in Neural Information Processing Systems 27:Annual Conference on Neural Information Processing Systems 2014,December 8-13 2014, Montreal, Quebec, Canada , 2014, pp. 1449–1457.[2] F. Raposo, D. M. de Matos, R. Ribeiro, S. Tang, and Y. Yu, “Towardsdeep modeling of music semantics using EEG regularizers,”

CoRR , vol.abs/1712.05197, 2017. [Online]. Available: http://arxiv.org/abs/1712.05197[3] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, andK. Keutzer, “Densenet: Implementing efﬁcient convnet descriptor pyra-mids,” arXiv preprint arXiv:1404.1869 , 2014.[4] I. Jolliffe, “Principal component analysis,” in

International encyclopediaof statistical science . Springer, 2011, pp. 1094–1096.[5] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Sup-port vector machines,”

IEEE Intelligent Systems and their applications ,vol. 13, no. 4, pp. 18–28, 1998.[6] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen,C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney,R. Weiss, and K. Wilson, “Cnn architectures for large-scale audioclassiﬁcation,” in

International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2017.[7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks forsemantic segmentation,” in

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2015.[8] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology andhuman-labeled dataset for audio events,” in

Proc. IEEE ICASSP 2017 ,New Orleans, LA, 2017.[9] H. Hotelling, “Relations between two sets of variates,”

Biometrika ,vol. 28, no. 3/4, pp. 321–377, 1936.[10] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonicalcorrelation analysis,” in

International Conference on Machine Learning ,2013, pp. 1247–1255.[11] Y. Yu, S. Tang, K. Aizawa, and A. Aizawa, “Category-based deepCCA for ﬁne-grained venue discovery from multimodal data,”