[PDF] GuessTheMusic: Song Identification from Electroencephalography response

Abstract

The music signal comprises of different features like rhythm, timbre, melody, harmony. Its impact on the human brain has been an exciting research topic for the past several decades. Electroencephalography (EEG) signal enables non-invasive measurement of brain activity. Leveraging the recent advancements in deep learning, we proposed a novel approach for song identification using a Convolution Neural network given the electroencephalography (EEG) responses. We recorded the EEG signals from a group of 20 participants while listening to a set of 12 song clips, each of approximately 2 minutes, that were presented in random order. The repeating nature of Music is captured by a data slicing approach considering brain signals of 1 second duration as representative of each song clip. More specifically, we predict the song corresponding to one second of EEG data when given as input rather than a complete two-minute response. We have also discussed pre-processing steps to handle large dimensions of a dataset and various CNN architectures. For all the experiments, we have considered each participant's EEG response for each song in both train and test data. We have obtained 84.96\% accuracy for the same. The performance observed gives appropriate implication towards the notion that listening to a song creates specific patterns in the brain, and these patterns vary from person to person.

Full PDF

GGuessTheMusic: Song Identiﬁcation fromElectroencephalography response (cid:63)

Dhananjay Sonawane , Krishna Prasad Miyapuram , Bharatesh RS , andDerek J. Lomas Computer Science and Engineering, Indian Institute of Technology Gandhinagar,Gujarat - 382355, India dhananjay.sonawane@[email protected] Centre for Cognitive and Brain Sciences, Indian Institute of TechnologyGandhinagar, Gujarat - 382355, India [email protected] Industrial Design Engineering, Delft University of Technology, Netherlands

[email protected]

Abstract.

The music signal comprises of diﬀerent features like rhythm,timbre, melody, harmony. Its impact on the human brain has been anexciting research topic for the past several decades. Electroencephalog-raphy (EEG) signal enables non-invasive measurement of brain activity.Leveraging the recent advancements in deep learning, we proposed anovel approach for song identiﬁcation using a Convolution Neural net-work given the electroencephalography (EEG) responses. We recordedthe EEG signals from a group of 20 participants while listening to a setof 12 song clips, each of approximately 2 minutes, that were presentedin random order. The repeating nature of Music is captured by a dataslicing approach considering brain signals of 1 second duration as repre-sentative of each song clip. More speciﬁcally, we predict the song corre-sponding to one second of EEG data when given as input rather thana complete two-minute response. We have also discussed pre-processingsteps to handle large dimensions of a dataset and various CNN archi-tectures. For all the experiments, we have considered each participant’sEEG response for each song in both train and test data. We have ob-tained 84.96% accuracy for the same. The performance observed givesappropriate implication towards the notion that listening to a song cre-ates speciﬁc patterns in the brain, and these patterns vary from personto person.

Keywords:

EEG · CNN · neural entrainmment · music · frequency fol-lowing response · brain signals · classiﬁcation Audio is a type of time-series signal characterized by frequency and amplitude.Music signals are a particular type of audio signals that posses a speciﬁc typeof acoustic and structural features. Accordingly, one would expect that music (cid:63)

Supported by organization x. a r X i v : . [ q - b i o . N C ] S e p D. Sonawane et al. aﬀects diﬀerent parts of the brain as compared to other audio signals. Neverthe-less, how closely the brain activity pattern is related to the perception of periodicsignal like music? Electroencephalography (EEG) is a method to measure electri-cal activity generated by the synchronized activity of neurons. There is plethoraof evidence published by researchers to bolster linkage between EEG responseand music. Braticco et al.[1] showed that the brain anticipates the melodic in-formation before the onset of the stimulus and that they are processed in thesecondary auditory cortex. The study on the processing of the rhythms by Sny-der et al. found out that the gamma activity in the EEG response correspondsto the beats in the simple rhythms[2]. A recent study showed that it is possibleto extract tempo - a critical music stimuli feature, from the EEG signal[3]. Theauthors concluded that the quality of tempo estimation was highly dependenton the music stimulus used. The frequency of neural response generated afterentrainment of music is highly related to its beat frequency[4]. Further, few re-searchers carried out Canonical Correlation Analysis to estimate the correlationcoeﬃcient of music stimuli with EEG data[5,6].However, work done on pattern of brain activity reﬂecting neural entrainmentto music listening and its recognition is still at an early stage. These patterns arevery much intricated and thus it is hard to interpret what is happening in thehuman brain when a person is listening to a song. Moreover, aesthetic experienceassociated music listening is highly subjective - i.e. it varies from person to personand also from time to time depending on various contextual factors such as moodof the individual who is listening to music. That is why the song identiﬁcationtask is challenging. Previous research has focused on the relationship betweenthe song and its brain (EEG) responses. They have used engineered features forprocessing the EEG data which are dependent on the domain knowledge. Therehave been few attempts on automatic feature extraction from EEG data usingneural networks for song classiﬁcation task[7].Taking the notion of the resonance between EEG signal and music stimuli,in this paper, we hypothesize the following - 1) music stimuli create identiﬁablepatterns in EEG response 2) for a given song; these patterns vary from personto person. We pose these hypothesis as a song identiﬁcation task using deeplearning architecture. To study the ﬁrst hypothesis, we split each participant’sEEG response for each song in training and test dataset. We explored how largethe train data should be and the eﬀect of train data size on the performanceof the model. For a given participant, the model learns song pattern present inEEG response from training data and try to predict song ID for test data. Forthe second hypothesis, we exclude some participants entirely from the trainingdataset. During data preprocessing, raw EEG response is divided into segments,and each such chunk corresponds to 1 second long EEG response. Our modelpredicts song ID for each such chunk present in test data. It is represented as2D matrices, and we call them ”song image”. More about data preprocessing isdiscussed in section 3.B. This technique allows us to use 2D and 3D convolutionneural networks, which are usually used in computer vision and image processingﬁelds. The features extracted from the song image by CNN are fed to a multilayer uessTheMusic: Song Identiﬁcation from Electroencephalography response 3 perceptron network for the classiﬁcation task. Our results outperform state ofthe art accuracy.The remaining part of the paper is organized as follows: section 2 describesprior work on song classiﬁcation problem using EEG data and their results.Section 3 reports our methods, including data collection, pre-processing steps,and CNN architectures used. In section 4, we discuss the performance of ourmodel, and in section 5, we draw a conclusion on the cognitive process behindmusic perception and suggest possible future work.

It has been shown that human mental states can be unraveled from non-invasivemeasurements of brain activity such as functional MRI, EEG etc [8]. Severalresearchers have documented the frequency following response (FFR), which isthe potential induced in the brain while listening to periodic or nearly periodicaudio signal[9]. A successful attempt has been made to reconstruct perceptualinput from EEG response. In [10], the objective of the study was: a person islooking at an image, and brain activity is captured by the EEG in real-time. TheEEG signals are then analyzed and transformed into a semantically similar imageto the one that the person is looking at. They modeled the Brain2Image sys-tem using variational autoencoders (VAE) and generative adversarial networks(GAN). The research work[11] relates music and its activity using a statisticalframework. Authors study the classiﬁcation of musical content via the individ-ual EEG responses by targeting three tasks: stimuli-speciﬁc classiﬁcation, groupclassiﬁcation, i.e., songs recorded with lyrics, songs recorded without lyrics, andinstrumental pieces and meter classiﬁcation, i.e., 3/4 vs. 4/4 meter. They haveused OpenMIIR dataset[12]. It includes response data of 10 subjects who lis-tened to 12 music fragments with duration ranging from 7 s to 16 s coming frompopular musical pieces. They proposed Hidden Markov Model and probabilisticcomputation method to the developed model, which was trained on 9 subjectsand tested on the 10 th subject. They achieved 42.7%, 49.6%, 68.7% classiﬁcationrate for task1, task2, task3, respectively. Foster et al. investigated the correla-tion between EEG response and music features extracted by the librosa libraryin Python[13]. Using representational similarity analysis, they report the cor-relation coeﬃcient of EEG data with normalized tempogram features as 0.63and with MFCCs as 0.62. They also deal with song identiﬁcation from the EEGdata problem and obtained 28.7% accuracy using the logistic regression model.Our study stands out in terms of methodology as we exploit the power of deeplearning architecture for automatic feature extraction from EEG response.Yi Yu et al. used a convolution neural network called DenseNet[16] for au-dio event classiﬁcation[7]. The EEG responses were collected on 9 male partic-ipants. The audio stimuli were 10 seconds long, spanned over 8 diﬀerent cat-egories (Chant, Child singing, Choir, Female singing, Male singing, Rapping,Synthetic singing, and Yodeling). They achieved 69% accuracy using EEG dataonly. However, the optimal result was 81%, where they used audio features ex- D. Sonawane et al. tracted from another convolution network - VGG16[15] along with EEG re-sponse. Sebastian Stober et al., aimed to classify 24 music stimuli[14]. Eachmusic segment comprised of 2 unique rhythms played at a diﬀerent pitch. Dueto small data, they process and classify each EEG channel individually. CNN wastrained on 84% of complete response (approximately 27 second response chunkout of 32 seconds), validated on 8% of complete response(approximately 2.5 sec-ond response chunk), and tested on 8% of complete response(approximately 2.5second response chunk). They report 24.4% accuracy. We, in this study, dealwith more complex data. Our music segment comprises diverse tone, rhythm,pitch, and some of them also include vocals. This makes the song identiﬁcationtask more challenging. The proposed architecture is much simpler compared toDenseNet[16] and VGG16[15].

The aim of this work is to create an approach to classify EEG response corre-sponding to respective song event. The complete experiment can be describedin 4 phases : Data collection, Preprocessing, CNN architecture and model devel-opment, Testing models.

Participants were made to sit in a dimly lit room. Then we collected demographicinformation like age, gender, and handedness. Brief information regarding theEEG collection setup, time that the experiment will take, the responses thatthey have to make were discussed with all participants. Then we measure thecircumference of the participant’s head to select a suitable EEG cap. The 128Fig. 1: Illustration of the 128-channel-system and electrode position uessTheMusic: Song Identiﬁcation from Electroencephalography response 5 channel high density Geodesic electrode net cap (Hydrocel Geodesic SensorNetplatform, Electrical Geodesics Inc., USA, Now Philips) is chosen according tothe headsize measurement. The cap is immersed in the KCl electrolyte solutionprepared in 1 litre pure distilled water. The reference electrode position is mea-sured as the intersecting point on the lines between nasion (point in betweeneyebrows) and inion (middle point of skull ending at the backside) with preau-ricular points on both sides, and then it is marked. Other electrodes are placedaccording to the International 10-20 system, as depicted in Fig.1.After this setup, participants were asked to close the eyes on a single beeptone. This is followed by 10 seconds of silence, after which the song stimulusis presented. At the end of each stimulus, a double beep tone was sounded atwhich the participants were instructed to open eyes and make a response. Theywere asked two questions : – How much familiar are you with the song? The participants were asked torate in the range of 1 to 5(where one indicated strongly familiar, and ﬁveindicated strongly unfamiliar) – How much did you enjoy the song? The participants were asked to rate inthe 33 range of 1 to 5(where one indicated extremely enjoyable, and ﬁveindicated incredibly dull)These responses were collected within 10 second silence window before presentingthe next song. Since the maximum length of the song is 132 seconds, all otherresponses were zero padded accordingly. Therefore, all song responses are of142 seconds after considering the above window. Total of 20 participants datacollected on 12 music stimuli. Songs used in the experiment are listed in Table 1.Table 1: Songs used in EEG data collection

Song ID Song Name Artist Song Length(in seconds)

The songs contain some tonal and vocal excerpts. The sampling rate for 11

D. Sonawane et al. participants was 1000Hz, and for the remaining 9 participants, it was 250Hz.16 participants were male, while 4 were female. All of them right-handed withan average age of 25.3 years and a standard deviation of 3.38. All music stimuliwere presented to the subject in random order.

EEG is highly sensitive to noise. It also captures eye blinking and high-frequencymuscle movements. Therefore, it is necessary to clean EEG data before usingit for any application. EEGLAB toolbox was used to implement the majorityof preprocessing steps. Once the channel locations are provided, we performedaverage re-referencing with respect to channel number 129. Raw EEG signalwas then loaded as epochs for each presented song. 12 epochs were created foreach participant. By this, we eliminated the signals which do not lie in our areaof interest. We then used independent component analysis to remove artifactdata. This has been achieved using ’runica’ algorithm present in MATLAB. Wehave used ’adjust’ toolbox for artifact removal. For simplicity, positive inﬁnities,negative inﬁnities, and NaN (not a number) values are replaced by zero. However,taking the average value of surrounding electrodes for these outliers would be abetter approach and may improve the performance. The above steps create ﬁnalready-to-use data.For deep learning models, we would need extensive data with reasonablefeature dimensions to detect a pattern. In our task, we had the opposite situation.Our data contains 240 EEG responses corresponding to 20 participants and 12song stimuli. However, a number of samples collected in one electrode for onesong of one participant is greater than 27,000 at the 250Hz sampling rate. Thenumber of samples goes beyond 100,000 at 1000Hz sampling rate. Thus, our dataconsists of a few examples with very high dimensionality. (a) Song ID - 6 (b) Song ID - 7

Fig. 2: Song images of 26 th second for participant ID: 1902 in time domainThe data augmentation technique was used to increase data. We split theEEG response of each participant for each song in a chunk of 1-second long uessTheMusic: Song Identiﬁcation from Electroencephalography response 7 window. Thus, giving us 128(electrodes) * 250 or 1000 (samples per second) di-mension 2D matrices. We call them ”song image” and label them the same songID of the corresponding song. Such a formulation not only increases numberof examples in original dataset but also allows us to use 2D, and 3D convolu-tion networks. Fig. 2 shows the song images of the 26 th second of two diﬀerentsongs for one of the participants in time domain. However, window size is designparameter and it is diﬃcult to decide optimal window size for obtaining song im-ages. Larger window size increases dimensionality of a song image and decreasesthe total number of song examples. Thereby killing the purpose of the data aug-mentation. High dimensional input to Convolution Neural Network (CNN) alsodrastically increases number of trainable parameters provided that the rest of thearchitecture remains the same. Smaller window size will carry less informationabout EEG signal and CNN may perform poorly. As the sampling rate changes,the 1 second time window will carry diﬀerent number of samples in song image.This causes inconsistency in input data to the CNN. We needed more concretepreprocessing steps so that smaller window size would increase number of exam-ples in dataset but not at the cost of the performance. Fig. 3 illustrates the FastFourier Transform (FFT) of the all 12 song response for one participant. It isworth noting that, in all the FFT’s, the maximum frequency component is lessthan 100Hz. This is expected as EEG is characterized by frequency bands - 0Hz- 4Hz (Delta band), 4Hz - 8Hz (Theta band), 8Hz - 15Hz (Alpha band), 15Hz- 32Hz (Beta band), and frequencies higher than 32Hz (Gamma band), exhibithigh power in low frequency ranges.Fig. 3: FFT of EEG response of the participant ID 1902 for all 12 songs D. Sonawane et al.

Using spectopo function in EEGLAB, we converted time-domain EEG datainto the frequency domain. Spectopo calculates the amplitude of a frequencycomponent present in each 1 second window of the EEG response. The maximumfrequency component in the frequency domain representation of data is chosenas per Nyquist’s criteria. It is 125Hz and 500Hz when the sampling rate is 250Hzand 1000Hz, respectively. Regardless of what is sampling frequency of EEG, wecan safely choose 125Hz as maximum frequency component. Fig. 4 explains theFig. 4: Time domain to frequency domain conversion for one participant datausing spectopodimensionality and conversion of time-domain data to frequency domain data.Frequency domain data helps in dimensionality reduction as well as makes inputdataset consistent and compact. But, the time window for which we calculateFFT is again a design parameter. The eﬀect of the time window on a performanceof the CNN in both time and frequency domain has been further addressed innext section. (a) Song ID - 6 (b) Song ID - 7

Fig. 5: Song images of 26 th second for participant ID: 1902 in frequency domain uessTheMusic: Song Identiﬁcation from Electroencephalography response 9 The row of song image in frequency domain denotes electrode while columndenotes the frequency. Fig. 5 shows the song images of the 26 th second of thediﬀerent song for participant 1902 in frequency domain.We follow standard practice in machine learning to develop the model. Test-Train split is 0.3 with a random selection of song images in training (70%) andtesting (30%) data. Validation split parameter is set to 0.2. Convolutional Neural Network is the core part of our model because it learns theunderlying pattern in the song image. It includes many hyperparameters thatneed to be set carefully. We apply CNN to the both time domain and frequencydomain song image dataset. The CNN architecture remains same except for theinput layer where shape of the input song image changes as per domain and timewindow.Fig. 6: 2D CNN architecture followed by Dense network for song classiﬁcationTo study our ﬁrst hypothesis that music creates an identiﬁable pattern inthe brain EEG signals, we consider all participants data in training data. Animmediate problem is how much of 1 participant’s response should be consideredin training data to predict song ID? To answer this question, we vary the train-test split from 20% to 95%. The x % train-test split means x % of data will bechosen as test data while (100 − x )% will be treated as train data. For eachsplit, we randomly select test samples. We have created a 3-layer CNN networkfor feature extraction and a 2-layer dense neural network for song classiﬁcation.Except for the last output layer, each convolution layer as well as the dense layerhas the ReLu[17] activation function. It brings non-linearity in architecture andhelps to detect complex patterns. Since we are doing multiclass classiﬁcation,the output layer has a softmax activation function. The loss function used is categorical cross-entropy. It is given by, Crossentropy = − C (cid:88) i =1 y o,c ∗ log ( p o,c ) , (1)where, C is total number of classes - 12 in our case, y o,c is a binary number ifclass label c is correct classiﬁcation for observation o , p o,c predicted probabilityobservation o is of class c .We used Adam optimizer to minimize categorical cross-entropy loss. Thekernel size is 3*3, and we have used 16 such ﬁlters at each convolution layer.Two max-pooling layers have been added after convolution layers 2 and 3. Theirexclusion almost doubles total trainable parameter, thereby increasing networkcomplexity, training time. Fig.6 shows the 2D CNN architecture. Removing oneor more layers from the architecture mentioned above resulted in an underde-termined system and failed to learn all the patterns. Adding extra layers led tooverﬁtting and thus reducing the performance.We have analyzed the eﬀect of the time window on the performance of theCNN in time domain. We created a 3 separate datasets with time window set to 1second, 2 seconds, 3 seconds and having the song image shape 128*250, 128*500,128*750 respectively. We did not increase time window beyond 3 seconds as num-ber of examples in dataset were reduced signiﬁcantly. The 9 participants withEEG sampling frequency of 250Hz were chosen in above dataset. Similar stepswere applied to the participant where EEG sampling rate was 1000Hz. To main-tain the symmetry, we randomly chose 9 participants out of 11. In frequencydomain, we did not increase the time window for which FFT is calculated. Be-cause, 1 sec time window gave reasonable results. We have also analyzed theeﬀect of higher sampling frequency. For this, we choose those 11 participantsfor which sampling rate was 1000Hz. This resulted in the maximum frequencycomponent being 500Hz, thus song image shape changes to 128*501. The previ-ous model was trained on new data. We could have compared this result withan earlier model where the song image was 128*126, but since earlier data had20 participants. For a fair comparison, we choose the same 11 participants anddiscard all the frequency from 127Hz to 501Hz. We trained the model on thisdata as well.For the investigation of our second hypothesis - music creates a diﬀerentpattern on a diﬀerent person; we used 5 of the randomly chosen participant’sresponses as a test dataset. Training data included the remaining 15 participant’sresponses. To improve the result for this task, we developed a 3D CNN model.All the parameters remain same as the previous model except for the kernel,which changed to 3*3*3, and the max-pool layer modiﬁed as 2*2*2. We stacked10 consecutive song images and fed it to the ﬁrst layer of 3D CNN as input. Thechoice 10 is made by considering the trade-oﬀ between the number of samplesin data and dimensionality of each 3D input sample.We used 30 epochs for training the CNN. The network architecture which wehave proposed is designed in the Keras framework. The initialization of weightsis assigned with some random numbers by using the Keras framework[18] itself. uessTheMusic: Song Identiﬁcation from Electroencephalography response 11 The GPU NVIDIA GTX 1050 that we have used for this experiment has 4GBof RAM. The batch size was kept 16 for all the experiments because of memoryconstraints.

When each participant’s response is accounted for in both train and test data,our model reports outstanding training accuracy as 90.90% and test accuracy as84.96% in frequency domain (model 5). Confusion matrix is shown in Fig. 8a. Itshows that test data is well spread across all 12 classes, and almost all of them arecorrectly classiﬁed. Fig. 7b shows accuracy vs. epoch curve for the above model.Table 2 summarizes the accuracy of the all CNN models. The model trainedTable 2: Performance of the CNN models

CNN Domain Song image Total number CNN Train TestModel shape of song trainable accuracy accuracyimages parameters (%) (%) on time domain dataset hardly learnt anything for song identiﬁcation task. Bychanging the time window and sampling frequency did not change the perfor-mance of the CNN in time domain. But the same CNN architecture obtainedhigh accuracy when trained on frequency domain dataset. The performance ofCNN in frequency domain could be either due to learning the temporal patternin EEG or learning from other participant’s responses. To examine the lattercause, we retrained the same CNN model, this time excluding 5 participantsentirely from the train data. We got 86.95% training accuracy, but the modelreported 7.73% test accuracy (model 10). We extend this experiment by trainingthe 3D CNN model, and observed 9.44% test accuracy for cross-participant data(model 11). This shows that CNN depends upon the temporal features in each participant’s response for the song prediction. It also gives insights regarding theEEG pattern generated due to music entrainment diﬀer from person to personfor the same song. (a) Change in the test accuracy fordiﬀerent train-test split values (b) Training and valida-tion curve

Fig. 7: Accuracy plotsWe also studied the high-frequency signals generated in the brain due tomusic entrainment. For this, we choose those participants whose data is collectedat 1000Hz. It will ensure the high value of the maximum frequency component(upto 500Hz) in the song image. All participant responses were considered in trainand test data. Two models of the same architecture were developed; one trainedon data having all 500Hz frequency element while the other trained on data bychoosing only the ﬁrst 126Hz frequencies out of 500Hz. Both performed almostequally well, giving 80.99% and 76.19% accuracy, respectively (model 8,9). Itexplains that higher EEG frequencies do not contribute much to the patterngenerated while listening to music. To investigate how much each participant’sdata should be included in train data to predict song ID on test data, we varythe train-test split from 20% to 95%. Fig. 7a shows accuracy plot for diﬀerenttrain-test split value. We got remarkable test accuracy 78.12% by training theCNN model on 20% of total data(Train-test split = 0.8). In other words, bylearning approximately 17 seconds long EEG response of 120 seconds prolongedmusic stimuli, we were able to predict song ID for the rest of the 103 secondswith 78% correct prediction probability. More commendable accuracy is 22.12%at 0.95 train-test split. This performance is much better than a random guess,which is 8.33% for this 12 class classiﬁcation problem. Fig.8a, 8b, 8c shows theconfusion matrices for 0.3, 0.5, 0.95 train-test split ratio, respectively.We have visualized the intermediate CNN outputs. Fig. 9 shows the outputof the 3 rd convolution layer of model 7. For the same song ID - 9, the participant1901 and 1905 have learnt diﬀerent features. Filter1, Filter7, Filter14, Filter15shows diﬀerent patterns in Fig. 9a, 9b. This supports our 2 nd hypothesis that uessTheMusic: Song Identiﬁcation from Electroencephalography response 13(a) Confusion matrix for0.3 train-test split (b) Confusion matrix for0.5 train-test split (c) Confusion matrix for0.95 train-test split Fig. 8: Confusion matrices (a) Participant ID: 1901, Song ID:9 (b) Participant ID: 1905, Song ID:9(c) Participant ID: 1901, Song ID:3 (d) Participant ID: 1905, Song ID:3

Fig. 9: Convolution layer 3 output the EEG patterns vary from person to person for a given song. Similar analogyapplies to Fig. 9c, 9d.

In this paper, we proposed an approach to identify the song from brain activ-ity, which is recorded when a person is listening to it. We worked on our datacollected for 20 participants and 12 two minutes of songs having diverse tone,pitch, rhythm, and vocals. In particular, we were successfully able to classifysongs from only 1 second long EEG response in frequency domain. But, CNNmodel failed in time domain. We developed a simple but yet eﬃcient 3 layerdeep learning model in the Keras framework. The results show that identiﬁablepatterns are generated in the brain during music entrainment. We were able todetect them when each participant’s EEG response considered in both train andtest data. Our model performed poorly when some of the participants were com-pletely excluded from the train data. This gives us insights about the diﬀerentpatterns created when diﬀerent persons were listening to the same song. Thepossible reason could be people focus on a diﬀerent tone, vocals during musicentrainment, thereby reducing performance for cross-participant song identiﬁ-cation task. Thus, as future work, we aim at acquiring more data and look forother preprocessing methods, CNN architectures to improve accuracy for acrossparticipant data. However, results achieved in this paper are highly appreciableand provides an essential step towards an ambitious mind-reading goal.

Author contributions

DS developed the CNN model presented in this paper. KPM and DL were in-volved in experimental design, and discussion of results. BRS was involved indata collection and preprocessing EEG data. We would like to thank Ms. EshaSharma for her contribution towards data collection. Data can be available onreasonable request to KPM.

References

1. Brattico, E., Tervaniemi, M., Ntnen, R., Peretz, I.: Musical scale properties areautomatically processed in the human auditory cortex. Brain research. 1117, 16274(2006).https://doi.org/10.1016/j.brainres.2006.08.0232. Snyder, J., Large, E.: Gamma-Band Activity Reﬂects the Metric Structure ofRhythmic Tone Sequences. Brain research. Cognitive brain research. 24, 11726(2005).https://doi.org/10.1016/j.cogbrainres.2004.12.0143. Stober, S., Prtzlich, T., Mller, M.: Brain Beats: Tempo Extraction from EEG Data.In: ISMIR. pp. 276282 (2016)4. Nozaradan, S.: Exploring how musical rhythm entrains brain activity with electroen-cephalogram frequency-tagging. Philosophical Transactions of the Royal Society B:Biological Sciences. 369, 20130393 (2014). https://doi.org/10.1098/rstb.2013.0393uessTheMusic: Song Identiﬁcation from Electroencephalography response 155. Gang, N., Kaneshiro, B., Berger, J., Dmochowski, J.P.: Decoding Neurally Rele-vant Musical Features Using Canonical Correlation Analysis. In: ISMIR. pp. 131138(2017)6. Sanyal, S., Nag, S., Banerjee, A., Sengupta, R., Ghosh, D.: Music of brain and musicon brain: a novel EEG soniﬁcation approach. Cogn Neurodyn. 13, 1331 (2019).https://doi.org/10.1007/s11571-018-9502-47. Yu, Y., Beuret, S., Zeng, D., Oyama, K.: Deep Learning of Human Perception inAudio Event Classiﬁcation. 2018 IEEE International Symposium on Multimedia(ISM). (2018). https://doi.org/10.1109/ISM.2018.00-118. Haynes, J.-D., Rees, G.: Decoding mental states from brain activity in humans.Nature Reviews Neuroscience. 7, 523534 (2006)9. Bidelman, G., Powers, L.: Response properties of the human frequency-followingresponse (FFR) to speech and non-speech sounds: level dependence, adaptationand phase-locking limits. International Journal of Audiology. 57, 665672 (2018).https://doi.org/10.1080/14992027.2018.147033810. Kavasidis, I., Palazzo, S., Spampinato, C., Giordano, D., Shah, M.: Brain2image:Converting brain signals into images. In: Proceedings of the 25th ACM internationalconference on Multimedia. pp. 18091817 (2017)11. Ntalampiras, S., Potamitis, I.: A Statistical Inference Framework for Understand-ing Music-Related Brain Activity. IEEE Journal of Selected Topics in Signal Pro-cessing. 13, 275284 (2019). https://doi.org/10.1109/JSTSP.2019.290543112. Stober, S., Sternin, A., Owen, A.M., Grahn, J.A.: Towards Music Imagery Informa-tion Retrieval: Introducing the OpenMIIR Dataset of EEG Recordings from MusicPerception and Imagination. In: ISMIR. pp. 763769 (2015)13. Foster, C., Dharmaretnam, D., Xu, H., Fyshe, A., Tzanetakis, G.: Decod-ing Music in the Human Brain Using EEG Data. In: 2018 IEEE 20th Inter-national Workshop on Multimedia Signal Processing (MMSP). pp. 16 (2018).https://doi.org/10.1109/MMSP.2018.854705114. Stober, S., Cameron, D.J., Grahn, J.A.: Using Convolutional Neural Networks toRecognize Rhythm Stimuli from Electroencephalography Recordings. In: Ghahra-mani, Z., Welling, M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. (eds.)Advances in Neural Information Processing Systems 27. pp. 14491457. Curran As-sociates, Inc. (2014)15. Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for SemanticSegmentation. Presented at the Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (2015)16. Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., Keutzer,K.: Densenet: Implementing eﬃcient convnet descriptor pyramids. arXiv preprintarXiv:1404.1869. (2014)17. Nair, V., Hinton, G.E.: Rectiﬁed linear units improve restricted boltzmann ma-chines. In: Proceedings of the 27th international conference on machine learning(ICML-10). pp. 807814 (2010)18. F. Chollet et al., Keras, https://github.com/fchollet/kerashttps://github.com/fchollet/keras