Decoding Imagined Speech using Wavelet Features and Deep Neural Networks
Jerrin Thomas Panachakel, A.G. Ramakrishnan, A.G. Ramakrishnan
aa r X i v : . [ q - b i o . N C ] M a r Decoding Imagined Speech using Wavelet Featuresand Deep Neural Networks
Jerrin Thomas Panachakel
Indian Institute of Science
Bangalore, [email protected]
A.G. Ramakrishnan
Indian Institute of Science
Bangalore, [email protected]
T.V. Ananthapadmanabha
Voice and Speech Systems
Bangalore, [email protected]
Abstract —This paper proposes a novel approach that uses deepneural networks for classifying imagined speech, significantlyincreasing the classification accuracy. The proposed approachemploys only the EEG channels over specific areas of thebrain for classification, and derives distinct feature vectors fromeach of those channels. This gives us more data to train aclassifier, enabling us to use deep learning approaches. Waveletand temporal domain features are extracted from each channel.The final class label of each test trial is obtained by applyinga majority voting on the classification results of the individualchannels considered in the trial. This approach is used forclassifying all the 11 prompts in the KaraOne dataset of imaginedspeech. The proposed architecture and the approach of treatingthe data have resulted in an average classification accuracy of57.15%, which is an improvement of around 35% over the state-of-the-art results.
Index Terms —imagined speech, brain-computer interaction,deep neural network, commone spatial pattern, EEG
I. I
NTRODUCTION
Speech is one of the most basic and natural form of hu-man communication. However, nearly 70 million people havespeech disabilities around the world. Speech disability due tocomplete paralysis prevents people from communicating withother through any modality. It will greatly help such people,if by some means we are able to decode his/her thoughts,commonly referred to as “imagined speech” [1].The interest in imagined speech dates back to the daysof Hans Berger, who invented electroencephalogram (EEG)as a tool for synthetic telepathy [2]. Although it is almost acentury since the first EEG recording, the success in decodingimagined speech from EEG signals is rather limited. One ofthe major reasons for the same is the very low signal-to-noiseratio (SNR) of EEG signals.The potential of the recent developments in the field ofmachine learning, such as deep neural networks (DNN) has notbeen exploited fully in the field of decoding imagined speech,since such techniques require a huge amount of training data.In this paper, we selected 11 EEG channels that cover thecortical areas involved in speech imagery. For each imaginedword, each of the EEG channels so selected is considered asan independent input signal, thus providing more training data.This is in contrast to the earlier approaches concatenating thefeatures to form a single feature vector.Our new approach has been validated using the KaraOnedataset [3] and we have obtained accuracy values better than the state-of-the-art results reported in the literature for thesame dataset.The rest of the paper is organized as follows: Section IIdescribes prior work in the literature in the field of decod-ing imagined speech. Section III describes the dataset andthe procedure for generating the feature vectors. Section IVdescribes the proposed DNN classifier in some detail. Theresults obtained are given in section V.II. R
ELATED W ORK IN THE L ITERATURE
This section briefly describes the work in the field ofdecoding imagined speech, reported over the last decade.C.S. DaSalla et al. developed a brain-computer interface(BCI) system based on vowel imagery [4] in the year 2009.The objective was to discriminate between the imaginedvowels, /a/ and /u/ . The experimental paradigm consisted ofthree parts:1) Imagined mouth opening and imagined vocalisation ofvowel /a/ .2) Imagined lip rounding and imagined vocalisation ofvowel /u/ .3) Control state with no action.Using common spatial pattern (CSP) generated spatial filtervectors as features and nonlinear support vector machine(SVM) as the classifier, they achieved accuracies in the rangeof 56% to 72% for different subjects. As noted by Brigham et.al [1], the relatively higher accuracy obtained might havearisen because of the additional involvement of motor imagery.Following a similar approach, Wang Li et al. in 2013developed a system to distinguish between two monosyllabicChinese characters meaning “left” and “one” [5]. Visual cuewas provided to the subject to instruct him/her on the characterto be imagined. When the cue disappeared, the subject had torepeatedly imagine the character in his/her mind as many timesas possible for a duration of 4 sec. They obtained an accuracyof around 67%.In 2010, Brigham et al. came up with an algorithm basedon autoregressive (AR) coefficients and k-nearest neighbor (k-NN) algorithm for classifying two imagined syllables /ba/ and /ku/ [1]. In this experiment, the subjects were given an auditorycue on the syllable to be imagined, followed by a series ofclick sounds. After the last click, the subjects were instructedo imagine the syllable once every 1.5 sec for a period of 6sec. They reported an accuracy of around 61%.In 2016, Min et.al used statistical features such as mean,variance, standard deviation, and skewness for pairwise clas-sification of vowels ( /a/ , /e/ , /i/ , /o/ , and /u/ ) using extremelearning machine (ELM) with radial basis function. In theirexperimental paradigm, auditory cue was provided at thebeginning of the trial to inform the subject as to which vowelwas to be imagined. After the auditory cue, two beeps wereplayed, after which the subject had to imagine the vowel heardduring the beginning of the trial. An average accuracy of about72% was reported.In 2017, Nguyen, Karavas and Artemiadis [6] came upwith an approach based on Riemannian manifold features forclassifying four different sets of prompts:1) Vowels ( /a/ , /i/ and /u/ ).2) Short words (“ in ” and “ out ”).3) Long words (“ cooperate ” and “ independent ”).4) Short-long words (“ in ” and “ cooperate ”).The accuracy reported for the four sets of prompts are 49.2%,50.1%, 66.2% and 80.1%, respectively. This dataset is oneamongst the few imagined speech datasets that are availablein the public domain and is referred to as the “ASU dataset” .More information about this dataset is given in Section III-A.Balaji et al. in 2017 investigated the use of bilingual imagi-nary speech, namely English “ Yes” & “ No ” and Hindi “ Haan ”(meaning “ yes ”) & “ Na ” (meaning “ no ”) for an imaginedspeech based BCI system [7]. Principal component analysis(PCA) was used for data reduction and an artificial neuralnetwork (ANN) was used as the classifier. Two specific setsof EEG channels corresponding to language comprehensionand decision making were utilized. An interesting part of theexperimental protocol was that there was no auditory or visualcue and the subjects were instructed to imagine the answerto a binary question posed either in English or Hindi. Thestudy reported an accuracy of 75.4% for the combined English-Hindi task and quite a surprisingly high accuracy of 85.2% forclassifying the decision.In 2017, Sereshkeh et al. came up with an algorithm basedon features extracted using discrete wavelet transform (DWT)and regularized neural networks for classifying the imagineddecisions of “ yes ” and “ no ” [8], similar to the work by Balaji et al. They reported an accuracy of about 67%.In 2018, Cooney et al. [9] used Mel frequency cepstralcoefficients (MFCC) as features and SVM as classifier toclassify all the 11 prompts in the KARAONE dataset [3].The prompts consisted of seven phonemic/syllabic prompts( /iy/ , /uw/ , /piy/ , /tiy/ , /diy/ , /m/ , /n/ ) and four words (“pat”,“pot”, “knew” and “gnaw”). A maximum accuracy of only33.33% was achieved. The lower accuracy might have arisenbecause of a larger number of choices, unlike the binary choiceemployed in the previous works.In a recent work [10], Jerrin et al. used deep neuralnetworks (DNN) for the first time to classify imagined speech.The specific task was to classify imagined words “in” and“cooperate”. The features used were based on discrete wavelet transform and a DNN with three hidden layers was employed.The highest accuracy reported was around 86%.III. D ATASET U SED FOR THE S TUDY AND M ETHODS
A. The KaraOne Dataset
The KaraOne dataset [3] has been used for our study. TheKaraOne dataset consists of EEG data captured during theimagination and articulation of 11 prompts, which included7 phonemic/syllabic prompts (iy, uw, piy, tiy, diy, m, n)and 4 words derived from Kent’s list of phonetically-similarpairs (i.e., pat, pot, knew, and gnaw). The data was capturedat 1 KHz sampling rate using SynAmps RT amplifier. Theelectrode placement was based on the 10/10 system [11].Each data recording trial had four stages:1) A 5-second rest state.2) A stimulus state in which an auditory and a visual cuewere provided to the participant.3) A 5 seconds imagined speech state.4) An articulation state.We followed the same preprocessing steps as in [3], whichincluded ocular artifact removal using blind source separation[12], band-pass filtering from 1 to 50 Hz and a Laplacianspatial filtering.
B. Wavelet Feature Extraction
In our work, instead of concatenating the features obtainedfrom several channels, each channel is treated as a distinctinput. This is possible because of the high correlation presentbetween the signals of various channels [13]. The following11 channels only have been chosen to be used in our work,based on the involvement of the underlying brain regions inthe production of speech [14], [15]:1) ‘C4’: postcentral gyrus2) ‘FC3’: premotor cortex3) ‘FC1’: premotor cortex4) ‘F5’: inferior frontal gyrus, Broca’s area5) ‘C3’: postcentral gyrus6) ‘F7’: Broca’s area7) ‘FT7’: inferior temporal gyrus8) ‘CZ’: postcentral gyrus9) ‘P3’: superior parietal lobule10) ‘T7’: middle temporal gyrus, secondary auditory cortex11) ‘C5’: Wernicke’s area, primary auditory cortexThis choice of channels is also backed by the commonspatial patterns (CSP) analysis on imagined speech v/s reststate EEG data [6].Since each EEG channel is considered as an independentdata vector, algorithms that extract a single feature vector fromthe entire set of EEG channels (such as Reimannian manifoldfeatures used by Nguyen et.al [6] and fuzzy entropy features[16]) cannot be used with the proposed architecture.For each trial, only the first 3000 samples (3 seconds)of collected data have been used for feature extraction. Forextracting the temporal features, we divided the first 3000
ABLE IC
OMPARISON OF THE MEAN CROSS - VALIDATION ACCURACIES IN PERCENTAGE OBTAINED USING DIFFERENT METHODS ( GIVEN IN EACH COLUMN ) INCLASSIFYING IMAGINED PROMPTS IN THE K ARA O NE DATASET . “ S TO “ S ARE THE PARTICIPANT ID S . MethodSubject SVM+MFCC [9] DT+MFCC [9] ProposedMethod (DNN) s01 22.27 24.52 43.02s02 33.33 31.06 60.91s03 23.62 15.12 84.23s04 15.31 21.14 45.78s05 14.84 11.41 37.43s06 20.86 21.17 60.81s07 26.08 26.84 75.07s08 23.15 18.37 49.98Average: 22.43 21.20 57.15 samples into 4 equal blocks and extracted the followingstatistical features for each block:1) Root-mean-square2) Variance3) Kurtosis4) Skewness5) 3rd order momentDaubechies-4 (db4) wavelet is extensively used to extractfeatures from EEG signals [17]. The 3 second-EEG signalsare decomposed into 7 levels using db4 wavelet, for extractingthe wavelet domain features. The above mentioned statisticalfeatures are extracted from the last approximation coefficientsand for each of the last three detailed coefficients. This isperformed to capture specific frequency bands that possessinformation on the cortical activity corresponding to the speechimagery. Hence, there are 20 temporal domain features andanother 20 wavelet domain features adding up to featurevectors of dimension 40.IV. D
ETAILS OF THE
DNN C
LASSIFIER
A DNN with two hidden layers is used as the primaryclassifier. Since the dimension of the feature vector is 40, thenumber of neurons in the input layer is also 40. Each densehidden layer consists of 40 neurons. Also, dropout and batchnormalization layers are added after each dense layer. Thedropout ratio is 10% for the two hidden layers. The activationfunction of all the layers except the first hidden later is therectified linear unit. The activation function of the first hiddenlayer is hyperbolic tangent. The activation function of theoutput layer is softmax . Loss function is categorical cross-entropy . This DNN architecture is adopted based on cross-validation performance of several DNN architectures. Because of the availability of very limited data, only cross-validation is performed. Since we have derived 11 featurevectors (one per each chosen channel) per trial, 11 outputsare obtained for each trial, one each for each channel. Thefinal decision for each trial is then based on majority or hardvoting of the 11 outputs.V. R
ESULTS AND C OMPARISON WITH THE L ITERATURE
Five-fold cross-validation is performed on the pre-processeddata of each participant. During cross-validation, it is ensuredthat all the channels corresponding to a trial are either inthe training set or in the test set. This is important, sincethe presence of a couple of channels from the test trials inthe training set can lead to high spurious accuracy due todata leakage. The cross-validation results obtained are listedin Table I, along with other results reported in the literature.VI. C
ONCLUSION
The present work shows that it is possible to treat each EEGchannel as an independent data vector in order to increase thesize of the training set for the purpose of decoding imaginedspeech using deep learning techniques. The proposed methodgives around 35% improvement in accuracy on an averageover the state-of-the-art results.VII. A
CKNOWLEDGEMENT
The authors thank Dr. Frank Rudzicz, University of Toronto,Mr. Ciaran Cooney, Ulster University, Dr. Kanishka Sharma,Mr. Pradeep Kumar G. and Ms. Ritika Jain, Indian Instituteof Science for the support extended to this work.
EFERENCES[1] K. Brigham and B. V. Kumar, “Imagined speech classification withEEG signals for silent communication: a preliminary investigation intosynthetic telepathy,” in
Bioinformatics and Biomedical Engineering(iCBBE), 2010 4th International Conference on . IEEE, 2010, pp. 1–4.[2] T. La Vaque, “The history of EEG Hans Berger: psychophysiologist.A historical vignette,”
Journal of Neurotherapy , vol. 3, no. 2, pp. 1–9,1999.[3] S. Zhao and F. Rudzicz, “Classifying phonological categories in imag-ined and articulated speech,” in
Acoustics, Speech and Signal Processing(ICASSP), 2015 IEEE International Conference on . IEEE, 2015, pp.992–996.[4] C. S. DaSalla, H. Kambara, M. Sato, and Y. Koike, “Single-trialclassification of vowel speech imagery using common spatial patterns,”
Neural networks , vol. 22, no. 9, pp. 1334–1339, 2009.[5] L. Wang, X. Zhang, X. Zhong, and Y. Zhang, “Analysis and classificationof speech imagery EEG for BCI,”
Biomedical signal processing andcontrol , vol. 8, no. 6, pp. 901–908, 2013.[6] C. H. Nguyen, G. K. Karavas, and P. Artemiadis, “Inferring imaginedspeech using EEG signals: a new approach using riemannian manifoldfeatures,”
Journal of neural engineering , vol. 15, no. 1, p. 016002, 2017.[7] A. Balaji, A. Haldar, K. Patil, T. S. Ruthvik, C. Valliappan, M. Jartarkar,and V. Baths, “EEG-based classification of bilingual unspoken speechusing ANN,” in
Engineering in Medicine and Biology Society (EMBC),2017 39th Annual International Conference of the IEEE . IEEE, 2017,pp. 1022–1025.[8] A. R. Sereshkeh, R. Trott, A. Bricout, and T. Chau, “EEG classificationof covert speech using regularized neural networks,”
IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 25, no. 12,pp. 2292–2300, 2017.[9] C. Cooney, R. Folli, and D. Coyle, “Mel frequency cepstral coefficientsenhance imagined speech decoding accuracy from EEG,” in . IEEE, 2018, pp. 1–7.[10] J. T. Panachakel, A. G. Ramakrishnan, and T. V. Ananthapadmanabha,“A novel deep learning architecture for decoding imagined speech fromeeg,” in
Proceedings of the IEEE Austria International BiomedicalEngineering Conference (AIBEC2019) . IEEE, 2019.[11] G. Chatrian, E. Lettich, and P. Nelson, “Ten percent electrode system fortopographic studies of spontaneous and evoked eeg activities,”
AmericanJournal of EEG technology , vol. 25, no. 2, pp. 83–92, 1985.[12] G. G´omez-Herrero, W. De Clercq, H. Anwar, O. Kara, K. Egiazarian,S. Van Huffel, and W. Van Paesschen, “Automatic removal of ocularartifacts in the EEG without an EOG reference channel,” in
Proceedingsof the 7th Nordic Signal Processing Symposium-NORSIG 2006 . IEEE,2006, pp. 130–133.[13] A. G. Ramakrishnan and J. V. Satyanarayana, “Reconstruction of EEGfrom limited channel acquisition using estimated signal correlation,”
Biomedical Signal Processing and Control , vol. 27, pp. 164–173, 2016.[14] W. D. Marslen-Wilson and L. K. Tyler, “Morphology, language andthe brain: the decompositional substrate for language comprehension,”
Philosophical Transactions of the Royal Society of London B: BiologicalSciences , vol. 362, no. 1481, pp. 823–836, 2007.[15] B. Alderson-Day, S. Weis, S. McCarthy-Jones, P. Moseley, D. Smailes,and C. Fernyhough, “The brain’s conversation with itself: neural sub-strates of dialogic inner speech,”
Social cognitive and affective neuro-science , vol. 11, no. 1, pp. 110–120, 2015.[16] S. Raghu, N. Sriraam, G. P. Kumar, and A. S. Hegde, “A novel approachfor real-time recognition of epileptic seizures using minimum variancemodified fuzzy entropy,”
IEEE Transactions on Biomedical Engineering ,vol. 65, no. 11, pp. 2612–2621, 2018.[17] L. F. Nicolas-Alonso and J. Gomez-Gil, “Brain computer interfaces, areview,” sensors , vol. 12, no. 2, pp. 1211–1279, 2012., vol. 12, no. 2, pp. 1211–1279, 2012.