Deep Learning of Human Perception in Audio Event Classification
DDeep Learning of Human Perception in AudioEvent Classification
Yi Yu , , Samuel Beuret , Donghuo Zeng , , Keizo Oyama , National Institute of Informatics, Tokyo, SOKENDAI ´Ecole Polytechnique F´ed´erale de Lausanne Abstract —In this paper, we introduce our recent studies onhuman perception in audio event classification by differentdeep learning models. In particular, the pre-trained modelVGGish is used as feature extractor to process audio data, andDenseNet is trained by and used as feature extractor for ourelectroencephalography (EEG) data. The correlation betweenaudio stimuli and EEG is learned in a shared space. In theexperiments, we record brain activities (EEG signals) of severalsubjects while they are listening to music events of 8 audiocategories selected from Google AudioSet, using a 16-channelEEG headset with active electrodes. Our experimental resultsdemonstrate that i) audio event classification can be improved byexploiting the power of human perception, and ii) the correlationbetween audio stimuli and EEG can be learned to complementaudio event understanding.
Index Terms —EEG, Deep learning of human perception, Audioevent classification, Canonical Correlation analysis
I. B
ACKGROUND AND M OTIVATION
Audio event classification is an interesting problem inmachine perception, which mainly targets for recognizing andrelating sounds from audio. Various Convolutional NeuralNetworks (CNNs) have demonstrated promising results inaudio classification [1]. In contrast, human perception andresponses to audio event, e.g., how to understand and interpretaudio events encountered in real-world environments by spe-cific semantic categorization or more detailed description, iscorrelated to human cognitive processes. Recent researches oncognitive neuroscience [1][2] have been carried out to learndiscriminative features from EEG recordings for distinguishingmusic audio stimuli by CNN techniques, which have shownthe possibility of using brain signals and deep learning inclassifying music audios. However, little research studies thefollowing problems: i) how to measure the differences betweenaudio events by exploiting audio and/or EEG, and ii) howto measure the correlation between audio stimuli and thecorresponding EEG data.Motivated by Google research which recently released asound vocabulary and dataset aiming to provide a common,realistic-scale evaluation platform for audio event classificationsuch as human sounds, music genres, environmental sounds(see https://research.google.com/audioset/), in this work, webuild a new EEG dataset to annotate Google audio sub-dataset with 160 segments for singing entity. EEG data are Samuel was involved in this work during their internship in National Instituteof Informatics (NII), Tokyo. recorded while subjects are listening to the selected musicaudios, for the purpose of annotating audio events. Particularly,alignments between audio and EEG are obtained during EEGdata collection. We study not only the capabilities of deeplearning in classifying audio stimuli by using the evoked EEGdata, but also the correlation between audio feature and EEGfeature to understand audio events. This paper has two majorcontributions: i) several models are compared to evaluate theperformances of audio event classification. ii) the correlationbetween audio feature and EEG feature is learned to helpaudio event understanding and classification, which generatescompetitive results. II. M
ETHODOLOGY
This paper aims to demonstrate how audio stimuli, EEG,and their combination distinguish different audio events. Tothis end, we train different deep models to learn the correla-tion between audio stimuli and EEG and investigate severalscenarios of audio event classification.
A. Audio event classification using EEG data alone
We first train a convolutional neural network DenseNet[3], which takes EEG data as input, has a dense layer asoutput layer, and uses a softmax activation function for theclassification of audio events. We try to minimize the cross-entropy between the predicted probabilities and the referenceprobabilities (the class labels). The network trained here(without the last dense layer) is reused in all the other learningmodels as an EEG feature extractor. It allows us to reduce eachEEG recording to a feature vector of 512 dimensions.For the comparison purpose we perform event classificationfollowing an equivalent pipeline. The first step is to reduce512-dimension EEG feature to a dimension of 20 using PCA[4]. In the second step, we train a SVM classifier [5] basedon the 20-demension compact feature and the class label.
B. Audio event classification using audio data alone
The class labels are predicted by solely using the audio data.The first step is to extract the necessary features from the rawaudio. This is done by using the pre-trained VGGish model[6] which extends the well-established VGG16 architecture [7]trained on the large-scale AudioSet [8]. In this way, each songis reduced to a 1152-dimension vector. This vector is furtherreduced to a 20-dimension vector using PCA. On this basis,we train a SVM classifier for classifying audio events. a r X i v : . [ c s . S D ] S e p a) EEG only (b) Audio only (c) Audio and EEG Fig. 1: Confusion matrix for audio event classification.
C. Audio event classification using both audio and EEG
A pair of EEG data and audio signal are reduced to 512and 1152 dimensions using the previously defined featureextractors. Then, they are concatenated together as a 1664-dimension vector, which is further reduced to 20 dimensionsby using PCA. On this basis, a SVM classifier is trained.
D. Correlation learning between audio and EEG
We use canonical-correlation analysis (CCA) [9], DeepCCA (DCCA) [10], and Category-based Deep CCA(C-DCCA)[11] to project audio and EEG features into a shared space.We hope that the information contained in the EEG datacan help extract meaningful features from the audio datathrough canonical-correlation analysis. To learn the correlationbetween audio and EEG, two tasks are investigated: using EEGas query to retrieve audio and vice versa.III. E
VALUATION
Our audio events, selected from Google large-scale Au-dioSet [8], contain 8 audio categories (Chant, Child singing,Choir, Female singing, Male singing, Rapping, Syntheticsinging, and Yodeling) with 160 10s-long audio segments.We conduct the EEG data collection on 9 male subjects, us-ing EEG devices produced by OpenBCI (http://openbci.com/)where 16 channels are used to sample EEG data at thefrequency of 125 Hz. Each subject listens to a category-based session with 20 audios (pause time between audios is 2seconds) for 5 times while his EEG signal is recorded. A totalnumber of 7200 EEG signals are acquired. We randomly splitour dataset into ten folds, and make sure that each category isequally represented in each fold. Then, all our test results areaveraged over 10 cross validations.
A. Comparisons among proposed learning models
Using DenseNet with a fully-connected layer as a classifier,we obtain a training accuracy of 94% and a testing accuracyof 61%. When we use the DenseNet as a feature extractor anda combination of PCA and SVM to perform the classification,we obtain a training accuracy of 98% and a testing accuracy of 59%. We can see that the difference between two testingaccuracies is small, showing that the classification using PCAand SVM is only slightly lower than that of the dense layerused to train the network. It shows that we can consider theresults obtained by this classifier with dense layer on the nextexperiment as reasonable, compared to an optimal classifier.The confusion matrix associated with this experiment is shownin Fig. 1(a). It shows that the error is more or less equallydistributed among all the classes, without any obvious errorpattern.For the classification considering only audio data, we ob-tained a training accuracy of 100% and a testing accuracyof 67%. The confusion matrix associated with the resultsis shown in Figure 1(b). We can notice some interestingerror patterns, such as the frequent confusions between femalesinging and child singing or female singing and male singing.The observations are consistent with the reality, where theaverage frequency range of women is situated between those ofchildren and men. For the classification considering both audioand EEG data, we obtain a training accuracy of 99%, and atesting accuracy of 81%. This testing accuracy is much higherthan that achieved by EEG-only and audio-only methods.It shows that a part of the information contained in twomediums is mutually complementary. The confusion matrixcorresponding to this experiment is shown in Figure 1(c).It seems to combine the characteristics of the two previousexperiments, although recurrent errors have been attenuated.All results are summarized in Table I.TABLE I: Accuracies of audio event classification with differ-ent learning methods and data modalities
Data modality Audio only EEG only Audio & EEGModel PCA-SVM DenseNet PCA-SVM PCA-SVMAccuracy 67% 61% 59% 81%
B. Results on correlation learning between audio and EEG
We use the cross-modal retrieval tasks to evaluate corre-lation between audio and EEG. We try to find the relevantudio for a given EEG signal and vice versa. In the formertask, there is only one relevant audio, and the performance ismeasured by the MRR1 metric (the mean of the inverse of therank of the relevant item), while the performance of the latteris measured by the MAP metric (there are 45 relevant EEGdata for each audio, based on which mean average precisionis computed). In the correlation analysis, we compare CCA[9], DCCA [10] and C-DCCA [11], and their results aresummarized in Tables II and III. According to these results,the retrieval performance is almost irrelevant of the numberof CCA components, indicating that 10 CCA components aresufficient to achieve a good performance. CCA and DCCAproduce similar results. In the task of retrieving audio fromEEG, 720 EEGs (queries) correspond to 16 audios (in thedatabase), and many EEG signals (45) share the same audio.This is equivalent to the classification of EEG to one of the16 audios, and MRR1 is relatively high. In contrast, givenan audio as query, there are 45 relevant EEG data in thedatabase. But not all of them are quite similar even in theshared canonical space. Therefore, the MAP performance isrelatively low in all methods. In both tasks, C-DCCA achievesa much better performance than the other two methods bystressing the intra-class similarity.TABLE II: MRR1 of audio retrieval with EEG data as query
Number of components CCA DCCA C-DCCA10 0.287 0.284 0.36915 0.289 0.248 0.37220 0.283 0.267 0.37325 0.283 0.251 0.37030 0.283 0.288 0.36835 0.282 0.283 0.36840 0.283 0.273 0.370
TABLE III: MAP of EEG retrieval with audio as query
Number of components CCA DCCA C-DCCA10 0.112 0.115 0.18215 0.112 0.092 0.18220 0.109 0.105 0.18825 0.108 0.099 0.18630 0.111 0.129 0.18435 0.109 0.109 0.18340 0.109 0.114 0.183
IV. C
ONCLUSION AND F UTURE WORK
Experimental results confirm that using EEG helps to in-crease the precision of audio event classification. Meanwhile,the great gap between testing and training in the accuracyshows that we could increase the performances of all theclassifiers by adding some regularization to avoid overfitting.We also notice that the detected correlation remains weak andfurther experiments or data collection are necessary to showmore meaningful results. In the future, we will also investigatehow to leverage the power of human perception to refine audioevent recommendation. V. A
CKNOWLEDGEMENT
The authors would like to appreciate Francisco Raposo’shelp in EEG data collection and processing during his intern-ship in NII. Many thanks go to volunteers who helped us torecord EEG signals where they were listening to audio events.R
EFERENCES[1] S. Stober, D. J. Cameron, and J. A. Grahn, “Using convolutional neuralnetworks to recognize rhythm stimuli from electroencephalographyrecordings,” in
Advances in Neural Information Processing Systems 27:Annual Conference on Neural Information Processing Systems 2014,December 8-13 2014, Montreal, Quebec, Canada , 2014, pp. 1449–1457.[2] F. Raposo, D. M. de Matos, R. Ribeiro, S. Tang, and Y. Yu, “Towardsdeep modeling of music semantics using EEG regularizers,”
CoRR , vol.abs/1712.05197, 2017. [Online]. Available: http://arxiv.org/abs/1712.05197[3] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, andK. Keutzer, “Densenet: Implementing efficient convnet descriptor pyra-mids,” arXiv preprint arXiv:1404.1869 , 2014.[4] I. Jolliffe, “Principal component analysis,” in
International encyclopediaof statistical science . Springer, 2011, pp. 1094–1096.[5] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Sup-port vector machines,”
IEEE Intelligent Systems and their applications ,vol. 13, no. 4, pp. 18–28, 1998.[6] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen,C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney,R. Weiss, and K. Wilson, “Cnn architectures for large-scale audioclassification,” in
International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2017.[7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks forsemantic segmentation,” in
The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2015.[8] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology andhuman-labeled dataset for audio events,” in
Proc. IEEE ICASSP 2017 ,New Orleans, LA, 2017.[9] H. Hotelling, “Relations between two sets of variates,”
Biometrika ,vol. 28, no. 3/4, pp. 321–377, 1936.[10] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonicalcorrelation analysis,” in
International Conference on Machine Learning ,2013, pp. 1247–1255.[11] Y. Yu, S. Tang, K. Aizawa, and A. Aizawa, “Category-based deepCCA for fine-grained venue discovery from multimodal data,”