[PDF] Deep Feature Mining via Attention-based BiLSTM-GCN for Human Motor Imagery Recognition

Abstract

Recognition accuracy and response time are both critically essential ahead of building practical electroencephalography (EEG) based brain-computer interface (BCI). Recent approaches, however, have either compromised in the classification accuracy or responding time. This paper presents a novel deep learning approach designed towards remarkably accurate and responsive motor imagery (MI) recognition based on scalp EEG. Bidirectional Long Short-term Memory (BiLSTM) with the Attention mechanism manages to derive relevant features from raw EEG signals. The connected graph convolutional neural network (GCN) promotes the decoding performance by cooperating with the topological structure of features, which are estimated from the overall data. The 0.4-second detection framework has shown effective and efficient prediction based on individual and group-wise training, with 98.81% and 94.64% accuracy, respectively, which outperformed all the state-of-the-art studies. The introduced deep feature mining approach can precisely recognize human motion intents from raw EEG signals, which paves the road to translate the EEG based MI recognition to practical BCI systems.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 1

Deep Feature Mining via Attention-basedBiLSTM-GCN for Human Motor ImageryRecognition

Yimin Hou, Shuyue Jia,

Student Member, IEEE , Shu Zhang, Xiangmin Lun, Yan Shi, Yang Li,

SeniorMember, IEEE , Hanrui Yang, Rui Zeng, and Jinglei Lv

Abstract —Recognition accuracy and response time are bothcritically essential ahead of building practical electroencephalog-raphy (EEG) based brain-computer interface (BCI). Recentapproaches, however, have either compromised in the classiﬁ-cation accuracy or responding time. This paper presents a noveldeep learning approach designed towards remarkably accurateand responsive motor imagery (MI) recognition based on scalpEEG. Bidirectional Long Short-term Memory (BiLSTM) withthe Attention mechanism manages to derive relevant featuresfrom raw EEG signals. The connected graph convolutionalneural network (GCN) promotes the decoding performance bycooperating with the topological structure of features, whichare estimated from the overall data. The 0.4-second detectionframework has shown effective and efﬁcient prediction based onindividual and group-wise training, with 98.81% and 94.64%accuracy, respectively, which outperformed all the state-of-the-art studies. The introduced deep feature mining approach canprecisely recognize human motion intents from raw EEG signals,which paves the road to translate the EEG based MI recognitionto practical BCI systems.

Index Terms —Brain-computer Interface (BCI), Electroen-cephalography (EEG), Motor Imagery (MI), Bidirectional LongShort-term Memory (BiLSTM), Graph Convolutional NeuralNetwork (GCN), Attention Mechanism

I. I

NTRODUCTION R Ecently, brain-computer interface (BCI) plays a promis-ing role in assisting and rehabilitating patients fromparalysis, epilepsy, and brain injuries via interpreting neuralactivities to control the peripherals [1, 2]. Among the non-invasive brain activity acquisition systems, electroencephalog-raphy (EEG)-based BCI gains extensive attention recentlygiven its higher temporal resolution and portability. Hence,

This work was supported by the National Natural Science Foundation ofChina under Grant 31772059.Yimin Hou, Yan Shi, and Hanrui Yang are with the School of AutomationEngineering, Northeast Electric Power University, Jilin, 132012, PR China.Shuyue Jia, the corresponding author, is with the School of ComputerScience, Northeast Electric Power University, Jilin, 132012, PR China, andthe School of Computer Science, The University of Sydney, Sydney, 2006,Australia. (E-mail: [email protected]).Shu Zhang is with the School of Computer Science, Northwestern Poly-technical University, Xi’an, 710129, PR China.Xiangmin Lun is with the School of Automation Engineering, NortheastElectric Power University, Jilin, 132012, PR China, and the College ofMechanical and Electric Engineering, Changchun University of Science andTechnology, Changchun, 130022, PR China.Yang Li is with the School of Electrical Engineering, Northeast ElectricPower University, Jilin, 132012, PR China.Jinglei Lv and Rui Zeng are with the Sydney Imaging & School ofBiomedical Engineering, The University of Sydney, Sydney, 2006, Australia. it has been popularly employed to assist the recovery ofpatients from motor impairments, e.g., amyotrophic lateralsclerosis (ALS), spinal cord injury (SCI), or stroke survivors[3, 4]. Speciﬁcally, researchers have focused on the recognitionof motor imagery (MI) based on EEG and translating thebrain activities into speciﬁc motor intentions. In such a way,users can further manipulate external devices or exchangeinformation with the surroundings [4]. Although researchershave developed several MI-based prototype applications, thereis still space of improvement before the practical clinicaltranslation could be promoted [2, 5]. De facto, to achieveeffective and efﬁcient control via only MI, both precise EEGdecoding and quick response are eagerly expected. However,few existing works of literature are competent in both per-spectives. In this study, we explore the possibility of a deeplearning framework to tackle the challenge.

A. Related Work

Lately, Deep Learning (DL) attracted increasing attentionin many disciplines because of its promising performance inclassiﬁcation tasks [6]. A growing number of works haveshown that DL will play a pivotal role in the precise decodingof brain activities [2]. Especially, recent works have beencarried out on EEG motion intention detection. A primarycurrent focus is to implement the DL-based approach todecode EEG MI tasks, which have attained promising results[7]. Due to the high temporal resolution of EEG signals,methods related to the recurrent neural network (RNN) [8],which can analyze time-series data, were extensively ap-plied to ﬁlter and classify EEG sequences, i.e., time points[9, 10, 11, 12, 13]. In reference [9], a novel RNN frameworkwith spatial and temporal ﬁltering was put forward to classifyEEG signals for emotion recognition, and achieved 95.4%accuracy for three classes with a 9 seconds’ segment as asample. Wang et al. and Luo et al. performed Long Short-term Memory (LSTM) [14] to handle time slices’ signals, andachieved 77.30% and 82.75% accuracy, respectively [10, 11].Reference [13] presented Attention-based RNN for EEG-basedperson identiﬁcation, which attained 99.89% accuracy for 8participants at the subject level with 4 seconds’ signals as asample. However, it can be found that in these studies, signalsover experimental duration were recognized as samples, whichresulted in a slow responsive prediction.Apart from RNN, Convolutional Neural Network (CNN)[15, 16] has been performed to decode EEG signals as a r X i v : . [ ee ss . SP ] M a y OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 2 (i) 64-channel Raw EEG Signals Acquisition C h a nn e l s Slice (ii) BiLSTM with Attention for Feature Extraction

LSTM CellLSTM Cell Attention L R B F LRBF

Labels4 TasksBackpropInput (iii) Graph Convolutional Neural Network

GCNMaxPoolingFlattenSoftmax

LRBF

Features Pearson MatrixLaplacian Matrix Adjacency Matrix N × N Graph Present

LRBF N × Labels

Fig. 1: The schematical overview consisted of 64-channel raw EEG signals acquisition, the BiLSTM with Attention model for feature extraction, and the GCN model forclassiﬁcation. well [17, 18]. Hou et al. proposed ESI and CNN, andachieved competitive results, i.e., 94.50% and 96.00% accu-racy at the group and subject levels, respectively, for four-class classiﬁcation. What is more, by combining CNN withgraph theory, the Graph Convolutional Neural Network (GCN)[19, 20, 21, 22, 23] approach was presented lately, takingconsideration of the functional topological relationship ofEEG electrodes [24, 25, 26, 27]. In references [24] and [25],a GCN with broad learning approach was proposed, andattained 93.66% and 94.24% accuracy, separately, for EEGemotion recognition. Song et al. and Wang et al. introduceddynamical GCN (90.40% accuracy) and phase-locking value-based GCN (84.35% accuracy) to recognize different emotions[26, 27]. Highly accurate prediction has been accomplishedvia the GCN model. But few researchers have investigated theapproach in the area of EEG MI decoding.

B. Contribution of This Paper

Towards accurate and fast MI recognition, an Attention-based BiLSTM-GCN was introduced to mine the effectivefeatures from raw EEG signals. The main contributions weresummarized as follows. (i)

To our best knowledge, this work was the ﬁrst thatcombined BiLSTM with the GCN to decode EEG tasks. (ii)

The Attention-based BiLSTM successfully derived relevantfeatures from raw EEG signals. Followed by the GCN model,it enhanced the decoding performance by considering theinternal topological structure of features. (iii)

The proposed feature mining approach managed to decodeEEG MI signals with stably reproducible results yielding remarkable robustness and adaptability, even countering con-siderable inter-trial and inter-subject variability.

C. Organization of This Paper

The rest of this paper was organized in the following. Thepreliminary knowledge of the BiLSTM, Attention mechanismand GCN was introduced in Section II, which was the foun-dation of the presented approach. In Section III, experimentaldetails and numerical results were presented, followed by theconclusion in Section IV.II. M

ETHODOLOGY

A. Pipeline Overview

The framework of the proposed method was demonstratedin Figure 1. (i) ×

64 timesteps. (ii)

The Attention-based BiLSTM was put forward to ﬁlter64-channel (spatial information) and 0.4-second (temporalinformation) raw EEG data and derived features from the fully-connected neurons. (iii)

The Pearson, Adjacency, and Laplacian Matrices ofoverall features were introduced sequentially to represent thetopological structure of features, i.e., as a graph. Followed bythe features and its corresponding graph representation as theinput, the GCN model was performed to classify four-classMI tasks.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 3

Element-wiseMultiplication

L R B F

InputSoftmax LayerAttention LayerBiLSTM ModelFeature Extraction

Fig. 2: Presented BiLSTM with the Attention mechanism for feature extraction.

B. BiLSTM with Attention1) BiLSTM Model:

RNN-based approaches have been ex-tensively applied to analyze EEG time-series signals. AnRNN cell, though alike a feedforward neural network, hasconnections pointing backward, which sends its output backto itself. The learned features of an RNN cell at time step t are inﬂuenced by not only input signals x ( t ) , but also theoutput (state) at time step t -1. This design mechanism dictatesthat RNN-based methods can handle sequential data, e.g., timepoints signals, by unrolling the network through time. TheLSTM and Gated Recurrent Unit (GRU) [28] are the most pop-ular variants of the RNN-based approaches. In Section II-D,the paper compared the performance of the welcomed modelsexperimentally, and the BiLSTM with Attention displayed inFigure 2 outperformed others due to better detection of thelong-term dependencies of raw EEG signals. i ( t ) = σ (cid:0) W Txi · x ( t ) + W Thi · h ( t − + b i (cid:1) (1) f ( t ) = σ (cid:0) W Txf · x ( t ) + W Thf · h ( t − + b f (cid:1) (2) o ( t ) = σ (cid:0) W Txo · x ( t ) + W Tho · h ( t − + b o (cid:1) (3) g ( t ) = tanh (cid:0) W Txg · x ( t ) + W Thg · h ( t − + b g (cid:1) (4) c ( t ) = f ( t ) ⊗ c ( t − + i ( t ) ⊗ g ( t ) (5) y ( t ) = h ( t ) = o ( t ) ⊗ tanh (cid:0) c ( t ) (cid:1) (6)As illustrated by Figure 2, three kinds of gates manipulateand control the memories of EEG signals, namely, the inputgate, forget gate, and output gate. Demonstrated by the i ( t ) , theinput gate partially stores the information of x ( t ) , and controlswhich part of it should be added to the long-term state c ( t ) .The forget gate that is controlled by the f ( t ) decides whichpiece of the c ( t ) should be overlooked. And the output gate,controlled by o ( t ) , allows which part of the info from c ( t ) should output, denoted as y ( t ) , as known as the short-term state h ( t ) . Manipulated by the above gates, two kinds of states arestored. The long-term state c ( t ) travels through the cell fromleft to right, dropping some memories at the forget gate andadding something new from the input gate. After that, the infopasses through a non-linear activation function, tanh activationfunction usually, and then it is ﬁltered by the output gate. Insuch a way, the short-term state h ( t ) is produced.Equation 1 − Equation 6 describe the procedure of an LSTMcell, where W and b are the weights and biases for differentlayers to store the memory and learn a generalized model, and σ is a non-linear activation function, i.e., sigmoid functionused in the experiments. For bidirectional LSTM, BiLSTMfor short, the signals x ( t ) are inputted from left to right forthe forward LSTM cell. What is more, they are reversedand inputted into another LSTM cell, the backward LSTM.Thus it leaves us two output vectors, which store much morecomprehensive information than a single LSTM cell. Then,they are concatenated as the ﬁnal output of the cell.

2) Attention Mechanism:

The Attention mechanism, imi-tated from the human visual, has a vital part to play in theﬁeld of Computer Vision (CV), Natural Language Processing(NLP), and Automatic Speech Recognition (ASR) [29, 30,31, 32]. Not all the signals contribute equally towards theclassiﬁcation. Hence, an Attention mechanism s ( t ) is jointlytrained as a weighted sum of the output of the BiLSTM withAttention based on the weights. u ( t ) = tanh (cid:0) W w y ( t ) + b w (cid:1) (7) α ( t ) = exp (cid:16) u (cid:62) ( t ) u w (cid:17)(cid:80) t exp (cid:16) u (cid:62) ( t ) u w (cid:17) (8) s ( t ) = (cid:88) t α ( t ) y ( t ) (9) u ( t ) is a Fully-connected (FC) layer for learning features of y ( t ) , followed by a Softmax layer which outputs a probability OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 4 distribution α ( t ) . W w , u w , and b w denote trainable weightsand bias, respectively. It selects and extracts the most signiﬁ-cant temporal and spatial information from y ( t ) by multiplying α ( t ) with regard to the contribution to the decoding tasks. C. Graph Convolutional Neural Network1) Graph Convolution:

In the graph theory, a graph is pre-sented by the graph Laplacian L . It is computed by the DegreeMatrix D minus the Adjacency Matrix A , i.e., L = D − A .In this work, Pearson’s matrix P was utilized to measure theinner correlations among features. P X,Y = E (( X − µ X ) ( Y − µ Y )) σ X σ Y (10)where X and Y are two variables regarding different features, ρ X,Y is their correlation, σ X and σ Y are the standard de-viations and µ X and µ Y are the expectations. Besides, theAdjacency Matrix A is recognised as: A = | P | − I (11)Where | P | is the absolute of the Pearson’s matrix P , and I ∈ R N × N is an identity matrix. In addition, the Degree Matrix D of the graph is computed as follows. D ii = N (cid:88) j =1 A ij (12)Then, the normalized graph Laplacian is computed as: L = D − A = I N − D − / AD − / (13)It is then decomposed by the Fourier basis U =[ u , . . . , u N − ] ∈ R N × N . The graph Laplacian is describedas L = U Λ U T , where Λ = diag ([ λ , . . . , λ N − ]) ∈ R N × N are the eigenvalues of L . The convolutional operation for agraph is deﬁned as: y = g θ ( L ) x = g θ (cid:0) U Λ U T (cid:1) x = U g θ (Λ) U T x (14)In which, g θ is a non-parametric ﬁlter. Speciﬁcally, the oper-ation is as follows. y k +1: ,j = σ  f k − (cid:88) i =1 U g θ (Λ) U T x k : ,i  (15)In which y k ∈ R N × f k − denotes the signals, N is thenumber of vertices of the graph, f k − and f k are the num-bers of input and output channels respectively, and the σ denotes a non-linearity activation function. What is more, g θ is approximated by the Chebyshev polynomials because it isnot localized in space and very time-consuming [33]. TheChebyshev recurrent polynomial approximation is describedas T k ( x ) = 2 xT k − ( x ) − T k − ( x ) , T = 1 , T = x . And theﬁlter can be presented as g θ (Λ) = (cid:80) K − k =0 θ k T k (˜Λ) , in which θ ∈ R K is a set of coefﬁcients, and T k (˜Λ) ∈ R K is the k th order polynomial at ˜Λ = 2Λ /λ max − I n , and I n ∈ [ − , isa diagonal matrix of the scaled eigenvalues. The Convolutioncan be rewritten as: y = K − (cid:88) k =0 θ k T k ( ˜ L ) x (16)

2) Graph Pooling:

The graph pooling operation can beachieved via the Graclus multilevel clustering algorithm,which consists of nodes clustering and one-dimensional pool-ing [34]. A greedy algorithm is implemented to compute thesuccessive coarser of a graph and minimize the clusteringobjective, from which the normalized cut is chosen [35].Through such a way, meaningful neighborhoods on graphsare acquired. Reference [23] proposed to carry out a balancedbinary tree to store the neighborhoods, and a one-dimensionalpooling is then applied for precise dimensionality reduction.

D. Proposed Approach

The presented approach was a combination of the Attention-based BiLSTM and the GCN as illustrated in Figure 1. TheBiLSTM with the Attention mechanism was presented toderive relevant features from raw EEG signals. During theprocedure, features were obtained from neurons at the FClayer. The GCN was then applied to classify the extractedfeatures. It was the combination of two models that promotedand enhanced the decoding performance by a signiﬁcantmargin compared with existing studies. Details were providedin the following.First of all, an optimal RNN-based model was explored toobtain relevant features from raw EEG signals. As detailedin Figure 3, in this work, the BiLSTM with Attention modelwas best performed, which achieved 77.86% Global AverageAccuracy (GAA). The input size x ( t ) of the model was 64denoting 64 channels (electrodes) of raw EEG signals. Themax time t was chosen as 64, which was 0.4 seconds’ segment.According to Figure 3b, a higher accuracy has obtained whileincreasing the number of cells of BiLSTM model. It should,however, be noted in Figure 3f that when there were more than256 cells, the loss showed an upward trend, which indicatedthe concern of overﬁtting due to the increment of the modelcomplexity. As a result, 256 LSTM cells (76.67% GAA) werechosen to generalize the model. Meanwhile, it was apparentthat, in Figure 3c, as for the linear size of the Attentionweights, the majority of the choices did not make a difference.Thus, 8 neurons, with 79.40% GAA, were applied during theexperiments empirically. Comparing Figure 3d and Figure 3h,it showed that a compromise solution should be applied, whichtook consideration of both performance and input size of theGCN. As a result, the linear size of 64 (76.73% GAA) wasutilized at the FC layer.Besides, to prevent overﬁtting, a 25% dropout [36] for theBiLSTM and FC layer was implemented. The model carriedout batch normalization (BN) [37] for the FC layer, whichwas activated by the softplus function [38]. And L2 norm with × − coefﬁcient was applied to the Euclidean distance asthe loss function. 1,024 batch sizes were used to maximize theusage of GPU resources. × − learning rate was appliedto Adam Optimizer [39].Furthermore, 2 nd order Chebyshev polynomial was appliedto approximate convolutional ﬁlters in the experiments. TheGCN consisted of six graph convolutional layers with 16, 32,64, 128, 256, and 512 ﬁlters, respectively, each followed bya graph max-pooling layer, and a softmax layer derived theﬁnal prediction. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 5 (a) GAA w.r.t. RNN-basedModels (b) GAA w.r.t. the BiLSTM CellSize (c) GAA w.r.t. the Attention Sizeof the BiLSTM (d) GAA w.r.t. Num. of ExtractedFeatures(e) Loss w.r.t. RNN-based Models (f) Loss w.r.t. the BiLSTM CellSize (g) Loss w.r.t. the Attention Sizeof the BiLSTM (h) Loss w.r.t. Num. of ExtractedFeatures

Fig. 3: Models and Hyperparameters Comparison w.r.t. the RNN-based Methods for Feature Extraction (a) Pearson Matrix for SubjectNine (b) Absolute Pearson Matrix forSubject Nine(c) Adjacency Matrix forSubject Nine (d) Laplacian Matrix for SubjectNine

Fig. 4: The Pearson, Absolute Pearson, Adjacency, and Laplacian Matrices for SubjectNine.

In addition, for the GCN model, before the non-linearsoftplus activation function, BN was utilized at all of the layersexcept the ﬁnal softmax. × − L2 norm was added tothe loss function, which was a cross-entropy loss. StochasticGradient Descent [40] with 16 batch sizes was optimized bythe Adam ( × − learning rate).All the experiments above were performed and imple-mented by the Google TensorFlow [41] 1.14.0 under NVIDIARTX2080ti and CUDA10.0. III. R ESULTS AND D ISCUSSION

A. Description of Dataset

The data collected from the EEG Motor Movement/ImageryDataset [42] was employed in this study. Numerous EEG trialswere acquired from 109 participants performing four MI tasks,i.e., imagining left ﬁst (L), right ﬁst (R), both ﬁsts (B), andboth feet (F) (21 trials per task). Each trial is a 4-secondexperiment’s duration (160 Hz sample rate) with one singletask [17]. In this work, a 0.4 seconds’ temporal segment of64 channels’ signals, i.e., 64 channels ×

64 time points, wasregarded as a sample. In Section III-B, we used a group of20 subjects’ data ( S − S ) to train and validate our method.10-fold cross validation was carried out. Further, 50 subjects( S − S ) were selected to verify the repeatability and stabilityof our approach. In Section III-C, the data set of individualsubjects ( S − S ), was utilized to perform subject-leveladaptation. For all the experiments, the dataset was randomlydivided into 10 parts, where 90% was the training set, and theleft 10% was regarded as the test set. In Section III-B, theabove procedure has carried out for 10 times. Thus, it left us10 results of 10-fold cross validation. B. Group-wise Prediction

It was suggested that the inter-subject variability remainsone of the concerns for interpreting EEG signals [43]. Firstly,a small group size (20 subjects) was adopted for group-wise prediction. In Figure 3a, 63.57% GAA was achieved bythe BiLSTM model. After applying the attention mechanism,it enhanced the decoding performance, which accomplished77.86% GAA (14.29% improvement). Further, we employedAttention-based BiLSTM-GCN model in this work. It attained94.64% maximum GAA [17] (31.07% improvement comparedwith the BiLSTM model), and 93.04% median accuracy from

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 6 (a) Box Plot for 10-fold crossvalidation (b) Confusion Matrix of TestOne

Fig. 5: Box plot and confusion matrix for 10-fold cross validation.

By grouping signals from additional 30 subjects (in total 50subjects), the robustness of the method has been validated inFigure 6. (a) GAA w.r.t. group-wiseprediction (b) ROC Curve w.r.t. group-wiseprediction

Fig. 6: GAA and ROC Curve for 20 and 50 subjects, separately.

Towards practical EEG-based BCI applications, it is essen-tial to develop a robust model to counter serious individualvariability [43]. Figure 6a illustrated the GAA of our methodthrough iterations. As listed in Figure 6b, we can see that94.64% and 91.40% GAA were obtained with regard to thegroup of 20 and 50 subjects, respectively. And the Area underthe Curves (AUC) were 0.964 and 0.943. Indicated by theabove results, the presented approach can successfully ﬁlter thedistinctions of signals, even though the data set was extended.In other words, by increasing the inter-subject variability, therobustness and effectiveness of the method were evaluated.The comparison of group-wise evaluation was demonstratedmeasured by the maximum of GAA [17] during experiments [44, 17]. Here, we compared the performance of several state-of-the-art methods in Table I.

TABLE I: Comparison on group-wise evaluation

Related Work Max. GAA Approach Num ofSubjects Database Ma et al. (2018) 68.20% RNNs 12 Physionet DatabaseHou et al. (2019) 94.50% ESI + CNNs 1092.50% 14 This work 94.64% Attention-based BiLSTM-GCN 20

Table I listed the performance of related methods. Hou et al. achieved competitive results. However, our methodobtained higher performance (0.14% accuracy improvement)even with doubling the number of subjects. It can be foundthat our approach has outperformed those by giving the highestaccuracy of decoding EEG MI signals.

C. Subject-speciﬁc Adaptation

The performance of individual adaptation has witnessed aﬂourishing increment [45, 46, 47, 48, 49, 50, 18, 17]. Theresults of our method on subject-level adaptation have beenreviewed in Table II, and we compared the results in TableIII.

TABLE II: Subject-level Evaluation

No. of Subject GAA Kappa Precision Recall F1 Score

Average 95.48% 93.94% 95.50% 95.61% 95.35%

Results were given in Table II, from which the highest GAAwas 98.81% achieved by the Subject S and S , and the lowestwas 90.48% by S . On average, the presented approach canhandle the challenge of subject-speciﬁc adaptation. It achievedcompetitive results, with an average accuracy of 95.48%.Moreover, Cohen’s Kappa coefﬁcient (Kappa), precision, re-call, and f1-score were 93.94%, 95.50%, 95.61%, and 95.35%,respectively. The promising results above indicated that theintroduced method ﬁltered raw EEG signals, and succeed inclassifying MI tasks. (a) Loss w.r.t. Subject-levelValidation (b) ROC Curve w.r.t.Subject-level Validation Fig. 7: Loss and ROC Curve for subject-level evaluation.

As can be seen from Figure 7a, the model has been shown toconverge for the subject-speciﬁc adaptation. Receiver Operat-

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 7 ing Characteristic Curve (ROC Curve) with its correspondingAUC was visible in Figure 7b.The comparison of subject-level prediction was put forwardbetween the presented approach and competitive models [45,46, 47, 48, 49, 50, 18, 17]. The Attention-based BiLSTM-GCN approach has achieved highly accurate results, whichsuggested robustness and effectiveness towards EEG signalprocessing as shown in Table III.The presented approach has improved classiﬁcation ac-curacy, and obtained state-of-the-art results. The reason forthe outstanding performance was that the Attention-basedBiLSTM model managed to extract relevant features from rawEEG signals. The followed GCN model successfully classiﬁedfeatures by cooperating with the topological relationship ofoverall features. IV. C

ONCLUSION

To address the challenge of inter-trial and inter-subject vari-ability in EEG signals, an innovative approach of Attention-based BiLSTM-GCN was proposed to accurately classify four-class EEG MI tasks, i.e., imagining left ﬁst, right ﬁst, bothﬁsts, and both feet. First of all, the BiLSTM with Attentionmodel succeeded in extracting relevant features from raw EEGsignals. The followed GCN model intensiﬁed the decodingperformance by cooperating with the internal topological rela-tionship of relevant features, which were estimated from Pear-sons matrix of the overall features. Besides, results providedcompelling evidence that the method has converged to bothsubject-level and group-wise predictions, and achieved the beststate-of-the-art performance, i.e., 98.81% and 94.64% accu-racy, respectively, for handling individual variability, whichwere far ahead of existing studies. The 0.4-second samplesize was proven effective and efﬁcient in prediction comparedwith traditional 4s trial length, which means that our proposedframework can provide time-resolved solution towards fastresponse. Results on a group of 20 subjects were derived by10-fold cross validation indicating repeatability and stability.The proposed method is predicted to advance the clinicaltranslation of the EEG MI-based BCI technology to meet thediverse demands, such as of paralyzed patients. In summary,the unprecedented performance with the highest accuracyand time-resolved prediction were fulﬁlled via the introducedfeature mining approach.A

CKNOWLEDGMENTS

The authors would like to thank the Brain Team at Googlefor developing TensorFlow. We further acknowledge Phys-ioNet for open source the EEG Motor Movement/ImageryDataset to promote the research.R

EFERENCES[1] C. E. Bouton, A. Shaikhouni, N. V. Annetta, M. A. Bockbrader, D. A.Friedenberg, D. M. Nielson, G. Sharma, P. B. Sederberg, B. C. Glenn,W. J. Mysiw, et al. , “Restoring cortical control of functional movementin a human with quadriplegia,”

Nature , vol. 533, no. 7602, p. 247, 2016.[2] M. A. Schwemmer, N. D. Skomrock, P. B. Sederberg, J. E. Ting,G. Sharma, M. A. Bockbrader, and D. A. Friedenberg, “Meeting brain–computer interface user performance expectations using a deep neuralnetwork decoding framework,”

Nature medicine , vol. 24, no. 11, p. 1669,2018. [3] J. J. Daly and J. R. Wolpaw, “Brain–computer interfaces in neurologicalrehabilitation,”

The Lancet Neurology , vol. 7, no. 11, pp. 1032–1043,2008.[4] J. Pereira, A. I. Sburlea, and G. R. M¨uller-Putz, “Eeg patterns of self-paced movement imaginations towards externally-cued and internally-selected targets,”

Scientiﬁc reports , vol. 8, no. 1, p. 13394, 2018.[5] M. Mahmood, D. Mzurikwao, Y.-S. Kim, Y. Lee, S. Mishra, R. Herbert,A. Duarte, C. S. Ang, and W.-H. Yeo, “Fully portable and wirelessuniversal brain–machine interfaces enabled by ﬂexible scalp electronicsand deep learning algorithm,”

Nature Machine Intelligence , vol. 1, no. 9,pp. 412–422, 2019.[6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, pp. 436–444, 2015.[7] F. Lotte, L. Bougrain, A. Cichocki, M. Clerc, M. Congedo, A. Rako-tomamonjy, and F. Yger, “A review of classiﬁcation algorithms for eeg-based brain–computer interfaces: a 10 year update,”

Journal of neuralengineering , vol. 15, no. 3, p. 031005, 2018.[8] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-tations by back-propagating errors,” nature , vol. 323, no. 6088, pp. 533–536, 1986.[9] T. Zhang, W. Zheng, Z. Cui, Y. Zong, and Y. Li, “Spatial–temporalrecurrent neural network for emotion recognition,”

IEEE transactionson cybernetics , vol. 49, no. 3, pp. 839–847, 2018.[10] P. Wang, A. Jiang, X. Liu, J. Shang, and L. Zhang, “Lstm-based eegclassiﬁcation in motor imagery tasks,”

IEEE Transactions on NeuralSystems and Rehabilitation Engineering , vol. 26, no. 11, pp. 2086–2095,2018.[11] T.-j. Luo, F. Chao, et al. , “Exploring spatial-frequency-sequential re-lationships for motor imagery classiﬁcation with recurrent neural net-work,”

BMC bioinformatics , vol. 19, no. 1, p. 344, 2018.[12] N. F. G¨uler, E. D. ¨Ubeyli, and I. G¨uler, “Recurrent neural networksemploying lyapunov exponents for eeg signals classiﬁcation,”

Expertsystems with applications , vol. 29, no. 3, pp. 506–514, 2005.[13] X. Zhang, L. Yao, S. S. Kanhere, Y. Liu, T. Gu, and K. Chen,“Mindid: Person identiﬁcation from brain waves through attention-basedrecurrent neural network,”

Proceedings of the ACM on Interactive,Mobile, Wearable and Ubiquitous Technologies , vol. 2, no. 3, p. 149,2018.[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[15] K. Fukushima, “Neocognitron: A self-organizing neural network modelfor a mechanism of pattern recognition unaffected by shift in position,”

Biological cybernetics , vol. 36, no. 4, pp. 193–202, 1980.[16] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. , “Gradient-basedlearning applied to document recognition,”

Proceedings of the IEEE ,vol. 86, no. 11, pp. 2278–2324, 1998.[17] Y. Hou, L. Zhou, S. Jia, and X. Lun, “A novel approach of decodingEEG four-class motor imagery tasks via scout ESI and CNN,”

Journalof Neural Engineering , vol. 17, p. 016048, feb 2020.[18] H. Dose, J. S. Møller, H. K. Iversen, and S. Puthusserypady, “An end-to-end deep learning approach to mi-eeg signal classiﬁcation for bcis,”

Expert Systems with Applications , vol. 114, pp. 532–542, 2018.[19] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks ongraph-structured data,” arXiv preprint arXiv:1506.05163 , 2015.[20] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networksand locally connected networks on graphs,” in

International Conferenceon Learning Representations (ICLR2014), CBLS, April 2014 , pp. http–openreview, 2014.[21] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel,A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphsfor learning molecular ﬁngerprints,” in

Advances in neural informationprocessing systems , pp. 2224–2232, 2015.[22] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neuralnetworks for graphs,” in

International conference on machine learning ,pp. 2014–2023, 2016.[23] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neuralnetworks on graphs with fast localized spectral ﬁltering,” in

Advancesin neural information processing systems , pp. 3844–3852, 2016.[24] X.-h. Wang, T. Zhang, X.-m. Xu, L. Chen, X.-f. Xing, and C. P.Chen, “Eeg emotion recognition using dynamical graph convolutionalneural networks and broad learning system,” in , pp. 1240–1244,IEEE, 2018.[25] T. Zhang, X. Wang, X. Xu, and C. P. Chen, “Gcb-net: Graph convolu-tional broad network and its application in emotion recognition,”

IEEETransactions on Affective Computing , 2019.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 8TABLE III: Current studies comparison on subject-level prediction

Related Work Max. GAA Approach Database

Ortiz-Echeverri et al. (2019) 94.66% Sorted-fast ICA-CWT + CNNs BCI Competition IV-a DatasetSadiq et al. (2019) 95.20% EWT + LS-SVMTaran et al. (2018) 96.89% TQWT + LS-SVMZhang et al. (2019) 83.00% CNNs-LSTM BCI Competition IV-2a DatasetJi et al. (2019) 95.10% SVMAmin et al. (2019) 95.40% MCNNsDose et al. (2018) 68.51% CNNs Physionet DatabaseHou et al. (2019) 96.00% ESI + CNNs

This work 98.81% Attention-based BiLSTM-GCN [26] T. Song, W. Zheng, P. Song, and Z. Cui, “Eeg emotion recognition usingdynamical graph convolutional neural networks,”

IEEE Transactions onAffective Computing , 2018.[27] Z. Wang, Y. Tong, and X. Heng, “Phase-locking value based graphconvolutional neural networks for emotion recognition,”

IEEE Access ,vol. 7, pp. 93711–93722, 2019.[28] K. Cho, B. van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingRNN encoder–decoder for statistical machine translation,” in

Proceed-ings of the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , (Doha, Qatar), pp. 1724–1734, Association forComputational Linguistics, Oct. 2014.[29] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” arXiv preprint arXiv:1409.0473 ,2014.[30] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,and Y. Bengio, “Show, attend and tell: Neural image caption generationwith visual attention,” in

International conference on machine learning ,pp. 2048–2057, 2015.[31] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchicalattention networks for document classiﬁcation,” in

Proceedings of the2016 conference of the North American chapter of the associationfor computational linguistics: human language technologies , pp. 1480–1489, 2016.[32] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,“Attention-based models for speech recognition,” in

Advances in neuralinformation processing systems , pp. 577–585, 2015.[33] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets ongraphs via spectral graph theory,”

Applied and Computational HarmonicAnalysis , vol. 30, no. 2, pp. 129–150, 2011.[34] I. S. Dhillon, Y. Guan, and B. Kulis, “Weighted graph cuts withouteigenvectors a multilevel approach,”

IEEE transactions on patternanalysis and machine intelligence , vol. 29, no. 11, pp. 1944–1957, 2007.[35] J. Shi and J. Malik, “Normalized cuts and image segmentation,”

De-partmental Papers (CIS) , vol. 22, no. 8, pp. 888–905, 2000.[36] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a simple way to prevent neural networks from overﬁt-ting,”

The journal of machine learning research , vol. 15, no. 1, pp. 1929–1958, 2014.[37] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[38] R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, andH. S. Seung, “Digital selection and analogue ampliﬁcation coexist in acortex-inspired silicon circuit,”

Nature , vol. 405, no. 6789, p. 947, 2000.[39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[40] T. Zhang, “Solving large scale linear prediction problems using stochas-tic gradient descent algorithms,” in

Proceedings of the twenty-ﬁrstinternational conference on Machine learning , p. 116, ACM, 2004.[41] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, et al. , “Tensorﬂow: A systemfor large-scale machine learning,” in { USENIX } Symposium onOperating Systems Design and Implementation ( { OSDI } , pp. 265–283, 2016.[42] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C.Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E.Stanley, “Physiobank, physiotoolkit, and physionet: components of anew research resource for complex physiologic signals,” Circulation ,vol. 101, no. 23, pp. e215–e220, 2000. [43] H. Tanaka, “Group task-related component analysis (gtrca): a multi-variate method for inter-trial reproducibility and inter-subject similaritymaximization for eeg data analysis,”

Scientiﬁc Reports , vol. 10, no. 1,pp. 1–17, 2020.[44] X. Ma, S. Qiu, C. Du, J. Xing, and H. He, “Improving eeg-basedmotor imagery classiﬁcation via spatial and temporal recurrent neuralnetworks,” in , pp. 1903–1906,IEEE, 2018.[45] C. J. Ortiz-Echeverri, S. Salazar-Colores, J. Rodr´ıguez-Res´endiz, andR. A. G´omez-Loenzo, “A new approach for motor imagery classiﬁcationbased on sorted blind source separation, continuous wavelet transform,and convolutional neural network,”

Sensors , vol. 19, no. 20, p. 4541,2019.[46] M. T. Sadiq, X. Yu, Z. Yuan, Z. Fan, A. U. Rehman, G. Li, and G. Xiao,“Motor imagery eeg signals classiﬁcation based on mode amplitude andfrequency components using empirical wavelet transform,”

IEEE Access ,vol. 7, pp. 127678–127692, 2019.[47] S. Taran and V. Bajaj, “Motor imagery tasks-based eeg signals clas-siﬁcation using tunable-q wavelet transform,”

Neural Computing andApplications , vol. 31, no. 11, pp. 6925–6932, 2019.[48] R. Zhang, Q. Zong, L. Dou, and X. Zhao, “A novel hybrid deep learningscheme for four-class motor imagery classiﬁcation,”

Journal of neuralengineering , vol. 16, no. 6, p. 066004, 2019.[49] N. Ji, L. Ma, H. Dong, and X. Zhang, “Eeg signals feature extractionbased on dwt and emd combined with approximate entropy,”

Brainsciences , vol. 9, no. 8, p. 201, 2019.[50] S. U. Amin, M. Alsulaiman, G. Muhammad, M. A. Bencherif, and M. S.Hossain, “Multilevel weighted feature fusion using convolutional neuralnetworks for eeg motor imagery classiﬁcation,”