[PDF] A Robust Deep Learning Approach for Automatic Classification of Seizures Against Non-seizures

Abstract

Identifying epileptic seizures through analysis of the electroencephalography (EEG) signal becomes a standard method for the diagnosis of epilepsy. Manual seizure identification on EEG by trained neurologists is time-consuming, labor-intensive and error-prone, and a reliable automatic seizure/non-seizure classification method is needed. One of the challenges in automatic seizure/non-seizure classification is that seizure morphologies exhibit considerable variabilities. In order to capture essential seizure patterns, this paper leverages an attention mechanism and a bidirectional long short-term memory (BiLSTM) to exploit both spatial and temporal discriminating features and overcome seizure variabilities. The attention mechanism is to capture spatial features according to the contributions of different brain regions to seizures. The BiLSTM is to extract discriminating temporal features in the forward and the backward directions. Cross-validation experiments and cross-patient experiments over the noisy data of CHB-MIT are performed to evaluate our proposed approach. The obtained average sensitivity of 87.00%, specificity of 88.60% and precision of 88.63% in cross-validation experiments are higher than using the current state-of-the-art methods, and the standard deviations of our approach are lower. The evaluation results of cross-patient experiments indicate that, our approach has better performance compared with the current state-of-the-art methods and is more robust across patients.

Full PDF

AA Robust Deep Learning Approach for Automatic Classiﬁcation of Seizures AgainstNon-seizures

Xinghua Yao a , Xiaojin Li a , Qiang Ye b , Yan Huang a , Qiang Cheng a, ∗ , Guo-Qiang Zhang c, ∗ a Institute of Biomedical Informatics, University of Kentucky, Lexington, Kentucky, USA b Department of Mathematics, University of Kentucky, Lexington, Kentucky, USA c The University of Texas Health Science Center at Houston, Houston, Texas, USA

Abstract

Identifying epileptic seizures through analysis of the electroencephalography (EEG) signal becomes a standard method for the diag-nosis of epilepsy. Manual seizure identiﬁcation on EEG by trained neurologists is time-consuming, labor-intensive and error-prone,and a reliable automatic seizure/non-seizure classiﬁcation method is needed. One of the challenges in automatic seizure/non-seizureclassiﬁcation is that seizure morphologies exhibit considerable variabilities. In order to capture essential seizure patterns, this pa-per leverages an attention mechanism and a bidirectional long short-term memory (BiLSTM) to exploit both spatial and temporaldiscriminating features and overcome seizure variabilities. The attention mechanism is to capture spatial features according tothe contributions of different brain regions to seizures. The BiLSTM is to extract discriminating temporal features in the forwardand the backward directions. Cross-validation experiments and cross-patient experiments over the noisy data of CHB-MIT areperformed to evaluate our proposed approach. The obtained average sensitivity of 87.00%, speciﬁcity of 88.60% and precisionof 88.63% in cross-validation experiments are higher than using the current state-of-the-art methods, and the standard deviationsof our approach are lower. The evaluation results of cross-patient experiments indicate that, our approach has better performancecompared with the current state-of-the-art methods and is more robust across patients.

Keywords: attention mechanism, bidirectional LSTM, seizure/non-seizure classiﬁcation, deep learning

1. Introduction

More than 50 million people in the world suffer fromepilepsy [1]. Epilepsy is a central nervous system disorder,in which brain activity becomes abnormal, causing seizures orperiods of unusual behaviors, sensations, and sometimes lossof awareness. An important technique to diagnose epilepsyis electroencephalography (EEG). An EEG signal records theelectrical activities of the brain, and may reveal patterns of nor-mal or abnormal brain electrical activities. In current clinicalpractices, EEG signals are collected from the brains by makinguse of either non-intrusive or implanted devices. The collectedoff-line EEG signals are then reviewed and analyzed by trainedneurologists to identify characteristic patterns of the disease,such as pre-ictal spikes and seizures (A seizure is a sudden,uncontrolled electrical disturbance in the brain, which signiﬁesepilepsy.), and to capture disease information, like seizure fre-quency, seizure type, etc. The obtained disease information isto provide supports for therapeutic decisions. This manual wayof reviewing and analyzing is labor-intensive and error-prone,for it usually takes several hours for a well-trained expert toanalyze one-day of recordings from one patient [2, 3, 4, 5, 6]. ∗ Corresponding authors

Email addresses:

[email protected] (Qiang Cheng), [email protected] (Guo-Qiang Zhang)

These limitations have motivated researchers to develop auto-mated techniques to recognize seizure. In this paper, we fo-cus on developing an automatic approach to classifying seizuresignal segments and non-seizure segments from off-line EEGsignals for assisting neurologists to make diagnosis.One of critical challenges in the seizure/non-seizure classiﬁ-cation is that seizure morphologies exhibit considerable inter-patient and intra-patient variabilities. Different machine learn-ing methods and computational technologies have been appliedto address this challenge. Seizure detection is often convertedinto a problem of seizure/non-seizure classiﬁcation but moreof a real-time ﬂavor. Extensive studies have been coductedfor constructing patient-speciﬁc detectors capable of detectingseizures [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. In early stud-ies, hand-crafted features are usually used as characteristics ofseizure manifestations in EEG. More recent studies focus on ap-plying deep learning models to seizure detection [4, 13, 16, 17].Most of these studies adopt interesting technologies to help ex-tracting seizure features. For example, signal processing tech-niques are used to ﬁler the data; certain modules need to be pre-trained; multiple channels are utilized to extract spatial features,and temporal features are extracted by the sliding windows.However, to the best of our knowledge, the data over channelsare processed in the same way; i.e., the channels are not differ-entiated. About extracting temporal features, most studies onlywork in the forward direction. In fact, for seizure/non-seizure

Preprint submitted to Biomedical Signal Processing and Control June 7, 2019 a r X i v : . [ c s . L G ] J un lassiﬁcation, the EEG signals can potentially provide some ad-ditional information in the backward direction [13].Different brain regions are likely to have different contri-butions to the seizure. The characteristics of EEG data forepilepsy at different brain regions are different. The featuresof EEG signals at a time point are correlated with the past dataand the future data. Besides, though EEG signals are in generaldynamic and non-linear, during a sufﬁciently small time period,the signal may be considered to be stationary. Based on theabove three observations and inspired by an architecture in [18],we design a new approach by using bidirectional long short-term memory (BiLSTM) integrated with an attention mecha-nism. Firstly, we introduce an attention mechanism over EEGchannels. Different weights are automatically assigned to sig-nal channels at different brain regions according to how muchthey would affect the seizures. Secondly, the bidirectional longshort-term memory technique is adopted to extract temporalfeatures of EEG signals in both the forward and the backwarddirections. Thirdly, output sequences of the BiLSTM mod-ule are split into patches according to time steps. Each patchonly contains data at one time step. All the patches are sepa-rately processed to extract features. With these three new ideas,we develop a novel approach for seizure/non-seizure classiﬁca-tion in EEG signals. Cross-validation and cross-patient experi-ments are performed using the proposed approach. In the cross-validation experiments, we obtain the average sensitivity, speci-ﬁcity and precision of 87.0%, 88.6% and 88.63%, respectively,and the corresponding standard deviations of 0.0363, 0.0463and 0.0388, respectively. For the cross-patient experiments, theaverage sensitivity, speciﬁcity and precision of 83.72%, 84.06%and 85.36% are respectively achieved, and the standard devi-ations being 0.1349, 0.1379 and 0.1020, respectively. Theseresults exceed the current state-of-the-art performances on thenoisy data of CHB-MIT in [17], [18] and [4]. The extensiveexperimental results show that the performance of the proposednew approach is promising and has high stability, with smallervariations compared to existing methods.In brief, the main novelties of our paper include the follow-ing:(1) An attention mechanism is utilized to capture spatial fea-tures of seizure for the ﬁrst time. It distinguishes EEGsignals from different brain regions and generates differ-ent attention weights for EEG data over different chan-nels. The attention weights are explained by using EEGdata segment examples.(2) Bidirectional long short-term memory is combined withattention mechanism to extract temporal features. At eachtime step, the past spatially-weighted data and the futurespatially-weighted data are analyzed.(3) Experimental results on the noisy EEG data of CHB-MIT demonstrate that, the new approach can capturemore robust seizure patterns than current state-of-the-artdeep learning approaches, and overcome the inter-patientseizure variations better. The rest of this paper is organized as follows. Section 2 de-scribes related research work on automatic seizure/non-seizureclassiﬁcation. Section 3 presents our designed approach of BiL-STM with attention. In Section 4, evaluation of the proposedapproach is performed in cross-validation and cross-patient ex-periments. Section 5 explains the attention mechanism and val-idates main modules in the proposed approach. Section 6 dis-cusses the approach of BiLSTM with attention. Conclusionsand future work are described in Section 7.

2. Related work

There is extensive research for seizure/non-seizure classiﬁ-cation, which distinguishes seizure segments from non-seizuresegments. Seizure detection, which is often of a real-time ﬂa-vor, is often viewed as the seizure/non-seizure classiﬁcationproblem. The study of seizure detection can be divided intothree categories. One category is using traditional machinelearning methods [7, 8, 10, 11, 12, 19, 20, 21, 22, 23]. Thesecond category is about signal processing methods and net-work techniques [6, 9, 15, 24]. And the third category is usingdeep learning methods [4, 13, 16, 17, 18, 25].

With traditional machine learning methods, many previousworks focus on developing patient-speciﬁc seizure detectionmethods.Shoeb and Guttag proposed a patient-speciﬁc seizure detec-tion method by using the support vector machine (SVM) [7].The method leverages ﬁlters to extract spectral features overeach channel, and then concatenate the feature vectors accord-ing to a ﬁxed time length. Then, train the SVM model with theobtained feature vectors as the input. The method achieved asensitivity of 96%, a median detection delay of 3 seconds and amedian false detection rate of 2 per 24 hour. The sensitivity re-sult is often used as a benchmark for patient-speciﬁc seizure de-tection on the data set CHB-MIT. The authors observed that theidentity of channels could help differentiate between the seizureand the non-seizure activity.Amin and Kamboh [8] designed an algorithm RUSBoost toprocess imbalanced seizure/non-seizure data, and used RUS-Boost and the decision tree classiﬁer to conduct patient-speciﬁcexperiments with the CHB-MIT data set. The method was fastin training and achieved good performance with seizure detec-tion accuracy of 97% and false detection rate of 0.08 per hour.Hunyadi et al. [10] presented seizure detection algorithm,which uses a nuclear norm regularization to convey spatial dis-tribution information of ictal patterns. The algorithm extractedfeatures from each channel, and then stacked them to analyzeas one entity.Truong et al. [12] proposed a automatic seizure detectionmethod over intracranial electroencephalography (iEEG) data.First, supervised classiﬁers were used to select those channelsthat contribute the most to seizures. Features in the frequencyand time domains were extracted, including spectral power and2orrelations between channel pairs. Then, Random Forest clas-siﬁer was utilized for classiﬁcation. This method has the state-of-the-art computational efﬁciency while maintaining the ac-curacy. In this method, selecting channels that contribute themost to seizures is to reduce the number of channels, therebyimproving the computational efﬁciency.The work in [7, 8, 10, 12] used data over multiple channels toextract spatial features. However, they did not apply differentprocessing ways to the data with different channels.Esbroeck et al. [11] proposed a multi-task learning frame-work to detect patient-speciﬁc seizure onset in the presenceof intra-patient variability in seizure morphology. They con-sidered distinguishing the windows of each seizure from non-seizure data as a separate task and treating the individual-seizure discrimination as another task. Compared to the stan-dard SVM, testing results of the CHB-MIT data set indicatedthat their approach performed better in most cases.Kiranyaz et al. [26] presented a systematic approach forpatient-speciﬁc classiﬁcation of long-term EEG. In the ap-proach, EEG data were processed through band-pass ﬁltering,feature extraction, epileptic seizures aggregation and morpho-logic ﬁltering. Results of the data processing were input intocollective network of binary classiﬁers to classify signal fromeach channel. Then, initial classiﬁcation results over each chan-nel were further learned and weighted by a dedicated classiﬁerwhich makes ﬁnal classiﬁcation decision of each EEG frame.Over the CHB-MIT data set, [26] achieved an average sensi-tivity of 89.01% and an average speciﬁcity of 94.71%. Highnumber of classiﬁers increased computational complexity of theapproach.In the patient-speciﬁc case, the data have no variationscaused by different subjects. The performances of the patient-speciﬁc seizure/non-seizure classiﬁers are better than 90%.However, the patient-speciﬁc classiﬁers have a limitation ofpoor generalizability.In [19], Fergus et al. presented a method for seizure detec-tion across subjects based on traditional machine learning tech-niques, and obtained 88% in Sensitivity and 88% in Speciﬁcityover the CHB-MIT data set by selecting features in multiplebrain regions. The method mainly consists of four steps, whichare data ﬁltering, feature extraction, feature selection and train-ing classiﬁers. In cross-validation experiments, EEG signals inCHB-MIT were segmented according to a segment length 60seconds, one seizure segment was truncated for each seizure,non-seizure segments were extracted from non-seizure EEGrecords as many as seizure segments. The produced experi-ment data consist of 171 seizure segments and 171 non-seizuresegments. On the average, each seizure segment contains 40sseizure data. Additionally, after segmenting EEG signals [19]used a bandpass ﬁlter and second order butterworth ﬁlters toextract the EEG data in the bandwidth 0.5Hz-30Hz.

Based on signal processing techniques, Zandi et al. proposeda wavelet-based algorithm for real-time detection of epilepticseizures using scalp EEG [6]. In this algorithm, the EEG from each channel was decomposed by wavelet packet transform,and a patient-speciﬁc measure was developed by using waveletcoefﬁcients to separate the seizure and non-seizure states. Uti-lizing the measure, a combined seizure index was derived foreach epoch of every EEG channel. Appropriate channel alarmswere generated by inspecting the combined seizure index.Acharya et al. [24] presented a method for the automaticdetection of normal, pre-ictal, and ictal conditions from EEGsignals. Four entropy features, including approximate entropy,sample entropy, and two phase entropies, were extracted. Theextracted features were input into the classiﬁer to do classiﬁca-tion. Over the EEG data set provided by University of Bonn,seven classiﬁers were fed with extracted entropies to show theeffectiveness of the features.Zhou et al. [15] proposed a seizure detection algorithm usinglacunarity and Bayesian linear discriminant analysis (BLDA).In the algorithm, wavelet decomposition on EEGs was con-ducted with ﬁve scales, and the wavelet coefﬁcients at scales3, 4, and 5 were selected. Features including lacunarity andﬂuctuation index were extracted from the selected scales, andthen they were fed to the BLDA for training and classiﬁca-tion. Patient-speciﬁc experiments were performed on intracra-nial EEG data from the Epilepsy Center of the University Hos-pital of Freiburg. The obtained average sensitivity was 96.25%,with an average false detection rate of 0.13 per hour and a meandelay time of 13.8s. The obtained precision results for elevenpatients were less than 50%.By leveraging network technologies, Fan and Chou [9] uti-lized a complex network model to represent EEG signals,and integrated it with spectral graph theory to extract spatial-temporal synchronization patterns for detecting seizure onsetsin real-time. The method was tested on 23 patients from theCHB-MIT data set. The resulting patient-speciﬁc sensitivitysurpassed the benchmark methods.

Recently, deep learning techniques have been developedrapidly and applied to solve the seizure/non-seizure classiﬁca-tion problem.Vidyaratne et al. [13] proposed a deep recurrent architec-ture by combining Cellular Neural Network and BidirectionalRecurrent Neural Network. The bidirectional recurrent neu-ral network was deployed into each cell in the cellular neu-ral network, and it was utilized to extract temporal features inthe forward and the backward directions. Each cell interactswith its neighbor cells to extract local spatial-temporal features.The computed results in the cellular neural network were out-put into a multi-layered perceptron. In the perceptron, sampleswere classiﬁed based on a trained threshold. In order to satisfythe input requirements of cellular neural network, the authorsproposed a mapping which organizes EEG signals into a 2Dgrid arrangement. Patient-speciﬁc experiments were conductedover the EEG data of ﬁve patients from the CHB-MIT data set.The obtained sensitivities were all 100% for the ﬁve patients.In their experiments, the raw EEG data were preprocessed us-ing a bandpass ﬁlter between 3Hz and 30Hz in order to extractseizure activity data.3olmohammadi et al. [16] explored seizure-detection per-formances of two neural networks over the data source of TUHEEG Corpus. Their experiment results showed that the con-volutional long short-term memory (LSTM) network is betterthan the convolutional GRU network. And also the impacts ofinitialization methods and regularization methods over the per-formance were experimented. The two models in [16] did notutilize attention mechanism.Hussein et al. [18] designed a deep neural network forseizure/non-seizure classiﬁcation by using LSTM as a mainmodule. The approach extracts temporal features by usingLSTM. Evaluation was performed on the EEG data set pro-vided by University of Bonn. Testing results mostly reached100%. In [17], Acharya et al. presented a 13-layers deep neuralnetwork for seizure/non-seizure classiﬁcation by using convo-lutional neural network (CNN). Over the Bonn EEG data set,the obtained average sensitivity and speciﬁcity were 95% and90%, respectively. For the experiments in [18] and [17], thetwo approaches extracted seizure features from the data on onechannel to conduct classiﬁcation. Each record in the Bonn EEGdata set is the data from only one channel.In [4], Thodoroff et al. designed a recurrent convolutionalneural network to capture spectral, spatial and temporal pat-terns of seizures. The EEG signals were ﬁrstly transformed intoimages by using Polar Projection, cubic interpolation, and FastFourier transform. The image-based representation of EEG sig-nals was to exploit the spatial locality in seizures. Created im-ages were fed to the convolution neural network. The outputvectors of the convolution neural network were organized to besequences in chronological order. The sequences were then in-put into the bidirectional recurrent neural network to produceclassiﬁed seizure/non-seizure results. Both patient-speciﬁc ex-periments and cross-patient experiments were performed. Thepatient-speciﬁc experiment results were similar to the results in[7]. And the cross-patient testing sensitivity was 85% on av-erage. In the two kinds of experiments, the convolution neuralnetwork was pre-trained alone. And the transfer learning tech-nology was utilized to overcome the problem of small amountof data in the patient-speciﬁc experiments. The proposed recur-rent convolutional neural network in [4] is complicated.Ansari et al. [25] aimed to automatically optimize feature se-lection for seizure detection. They utilized deep CNN to extractoptimal features, and then fed the features to random forest todo classiﬁcation. In evaluation experiments, EEG recordings of26 and 22 neonates were taken as training data and testing data,respectively. A false alarm rate of 0.9 per hour and a sensi-tivity of 77% were achieved. The proposed method needed nopredeﬁned features, and surpassed three classic feature-basedapproaches.

3. Methods

EEG signal data is an important modality for the diagnosisof epilepsy. It is generally collected through placing electrodeson the scalp. Each electrode records brain activities in its lo-cated brain region. As different brain regions play different roles in the seizure procedure, the data collected at differentbrain regions record different characteristics of seizures. Withthe observations in [7], differences between seizure data andnon-seizure data are related to channels. To exploit the differ-ences of signals from different brain regions, we will use anattention mechanism to assign different weights to data fromdifferent channels.Brain activities are continuous, and EEG signals could be re-garded as continuous records of brain activities when ignoringthe sampling effects. The brain activity at a time point is corre-lated with past signal data, and could also be analyzed from fu-ture signal data. This two-directions analysis helps extract moreseizure features. To leverage correlations from both directions,we perform BiLSTM for analyzing EEG sequence data.EEG signal is dynamic and non-linear. Due to the dynamicnature, certain statistical characteristics of EEG signals changeover time. However, the EEG signal segments have similar sta-tistical temporal and spectral features for a sufﬁciently smalltime duration [18, 27]. After bidirectionally processing, the se-quence is split into time-step patches. Each patch only containsdata in a time step. The patches are further extracted featuresthrough full connection operations separately and concurrently.Based on the above three ideas and inspired by [18], we de-velop a new approach of BiLSTM with attention (shortly, at-tention BiLSTM) in order to classify seizure segments and non-seizure segments. Raw EEG signals are split into data segmentsaccording to a ﬁxed time span. The split data segments are au-tomatically weighted through an attention mechanism, i.e., foreach segment, signal data from different channels are multi-plied with different weights. The weights are achieved througha fully connected module and a non-linear function in train-ing procedure. After adding weights, the data segments are fedto bidirectional LSTM module. The BiLSTM module extractsfeatures in both forward and backward directions. For outputsequences of BiLSTM, data at each time step are separately in-put into a full connection module. Then, the extracted featuresare averaged over all the time steps in order to achieve globalfeatures of a segment. Finally, the labels of data segments arecalculated by a fully connected module with the Softmax func-tion.

Our model architecture consists of ﬁve modules, includ-ing attention layer, BiLSTM module, time-distributed fully-connected layer, pooling layer and fully-connected layer withSoftmax. The designed architecture is presented in Fig. 1.

The attention layer, shown in Fig. 2, is to generate attentionweights for each channel and then executes an element-wisemultiplication. The original data are input into a fully con-nected module with a nonlinear activation function. The out-puts of the fully connected module are averaged over all thetime steps. Then, the obtained average values are copied to beshared at all time steps. In this way, an attention weight matrix4 ig. 1.

Architecture of the proposed approach. T7-P7, F3-C3, P4-O2and F8-T8 represent channels. W , W , W and W are weights on thefour channels, respectively. is achieved. Finally, the attention matrix is element-wisely mul-tiplied with the original inputs. The attention layer is computedusing the following equations: Y = f re ( X ) (1) Y = σ ( Y ∗ W al + B al ) (2) Y = f re ( Y ) (3) Y = f av ( Y ) (4) Y = f cy ( Y ) (5) Y al = X (cid:12) Y (6)Here, X denotes an input tensor of size ( n sm , n sp , n ch ) . Sym-bols n sm , n sp , n ch represent the number of samples, the numberof time steps, and the number of signal channels, respectively. Y is a matrix of size ( n ss , n ch ) , n ss = n sm ∗ n sp , W al a weightmatrix of size ( n ch , n ch ) , a bias matrix B al of size ( n ss , n ch ) , and Y with size ( n ss , n ch ) . A symbol σ ( · ) represents a non-linearfunction, like so f tmax ( · ) and sigmoid ( · ) . Y is a matrix of size ( n sm , n sp , n ch ) , Y of size ( n sm , n ch ) , Y of size ( n sm , n sp , n ch ) , and Y al an output matrix of attention layer with shape ( n sm , n sp , n ch ) .Functions f re ( · ) and f re ( · ) are to reshape a matrix, f av ( · ) is afunction of computing averages along with the second axis ofmatrix, and f cy ( · ) is an copying operation to share the averagesover all the time steps. The symbol (cid:12) means an element-wise multiplication between matrices. Fig. 2.

Work ﬂow of attention layer.

The BiLSTM module processes the input sequence sepa-rately according to the forward order and the backward order,and synthesize the forward outputs and the backward outputs[28, 29]. Its main procedure is presented in Fig. 3. In eitherforward order or backward order, the sequence is computed inthe same way as LSTM, in which the computation can be de-scribed by using Eqs. (7) − (12) according to [30] and [31]. Thesynthesizing operations can be concatenation or summation. Fig. 3.

Work ﬂow of BiLSTM module.

Block input (cid:101) C t = ϕ ( X int ∗ W ce + Y bot − ∗ R ce + B ce ) (7)Input gate G igt = σ ( X int ∗ W ig + Y bot − ∗ R ig + B ig ) (8)Forget gate G f gt = σ ( X int ∗ W f g + Y bot − ∗ R f g + B f g ) (9)Output gate G ogt = σ ( X int ∗ W og + Y bot − ∗ R og + B og ) (10)Cell C t = C t − (cid:12) G f gt + (cid:101) C t (cid:12) G igt (11)Block output Y bot = ψ ( C t ) (cid:12) G ogt (12)5ere, X int is an input matrix of size ( n sm , n ch ) at the time step t ,and Y bot an output matrix of size ( n sm , n f e ) at the time step t ,where n f e is a dimensionality of extracted feature space. Ma-trices G igt , G f gt , G ogt , (cid:101) C t , and C t represent input gate state, forgetgate state, output gate state, a block input, and cell state at thetime step t , respectively. Input weights matrices W ce , W ig , W f g and W og are with shape ( n ch , n f e ) . Recurrent weights matrices R ce , R ig , R f g , and R og are of size ( n f e , n f e ) . Bias matrices B ce , B ig , B f g , and B og are of size ( n sm , n f e ) . ϕ ( · ) , σ ( · ) , and ψ ( · ) are non-linear activation functions. The symbol (cid:12) meanselement-wise multiplication.For the output Y al in the attention layer, it is split into n sp components according to time steps, i.e., X , X , · · · , X n sp , witheach one being a matrix of size ( n sm , n ch ) . These componentsform a sequence of X X · · · X n sp in a chronological orders. Forthe sequence X X · · · X n sp , the variable X in t in Eq. (7) has dif-ferent values in the forward and the backward order. Its valueat the time step t in the forward order is X t , and the value inthe backward order is X n sp − t + . Based on Eqs. (7) − (12), a for-ward output sequence Y f d is obtained in the forward order, anda backward output sequence Y bd for the backward order. Weuse Y f d ( t ) to denote the t -th item in the sequence Y f d , i.e., theforward output at the time step t , and Y bd ( t ) for the backwardoutput at the time step t . The two output sequences Y f d and Y bd are then synthesized as follows: Y blm ( t ) = Φ ( Y f d ( t ) , Y bd ( n sp − t + )) (13)Here, t = , · · · , n sp . Φ ( · ) means an operation, which has twooptions, i.e., concatenation and summation. Y blm representsthe synthesized sequence of the forward output sequence andthe backward output sequence, and Y blm ( t ) of size ( n sm , n f e ) means the t -th item in the sequence Y blm , i.e., the output ofBiLSTM module at the time step t . n f e is a dimensionality ofoutput space of BiLSTM module. The time-distributed fully-connected layer is to further ex-tract features at each time step. It executes fully-connected op-erations separately and simultaneously for inputs at each timestep. And the fully-connected operations use linear functionsas activation functions. Time-distributed layer could help im-prove executing efﬁciency when processing signal data withhigh sampling frequency. At each time step, the computationprocedure is described as follows: Y dl ( t ) = Y blm ( t ) ∗ W dl + B dl . (14)Here, t = , , · · · , n sp . Matrix Y dl ( t ) of size ( n sm , n f e ) , isthe output at the time step t in time-distributed fully-connectedlayer, where n f e is a dimensionality of extracted feature spacein the time-distributed layer. W dl denotes a weight matrix ofsize ( n f e , n f e ) , B dl a bias matrix of size ( n sm , n f e ) . Allthe time-step components { Y dl ( t ) , t = , · · · , n sp } composea matrix Y dl of size ( n sm , n sp , n f e ) as the output of the time-distributed fully-connected layer. The pooling layer in our architecture executes the averagepooling operation in order to extract global features of eachsample. The operation computes a mean value of the time-step data for each sample in the output matrix Y dl of time-distributed fully-connected layer, and outputs a matrix Y ap ofsize ( n sm , n f e ) . Fully connected layer executes a fully connected operationto extract further features and to reduce the last dimension ofinput matrix into number of classes. It uses a linear function asits activation function. Based on outputs of the fully-connectedlayer, Softmax layer computes probabilities that each samplebelongs to a classiﬁcation. In the following, we will use Eq.(15) and Eq. (16) to present the computations in the fully-connected layer and in the Softmax Layer. Y f cl = Y ap ∗ W f cl + B f cl (15) Y sl = so f tmax ( Y f cl ) (16)Here, W f cl and B f cl denotes weights matrix of size ( n f e , n c ) and bias matrix of size ( n sm , n c ) , respectively. n c is the numberof classes. Y f cl is an output matrix of size ( n sm , n c ) in the fully-connected layer. Function so f tmax ( · ) calculates probabilitiesabout each sample belonging to each class. Y sl is an output ofthe Softmax layer.The pseudo-codes of the proposed seizure/non-seizure clas-siﬁcation approach of BiLSTM with attention are shown in Al-gorithm 1. Algorithm 1.

Seizure/Non-seizure Classiﬁcation over EEGData using the Attention BiLSTM Approach

Input: X , the matrix of EEG data segments Output: Y pred , the matrix of classiﬁcation results Initialize matrices W al , B al , W ce , W ig , W f g , W og , R ce , R ig , R f g , R og , B ce , B ig , B f g , B og , W dl , B dl , W f cl , B f cl Compute the output matrix Y al using the input X and Eqs. (1) − (6) Split Y al into n sp components { X , X , ··· , X n sp } according to timesteps, and compose a sequence X X ··· X n sp in chronological order Compute a forward output sequence Y f d for the sequence X X ··· X n sp based on Eqs. (7) − (12) Compute a backward output sequence Y bd for the inverse se-quence X n sp ··· X X based on Eqs. (7) − (12) Synthesize sequences Y f d and Y bd by using Eq. (13), and achievea sequence Y blm Compute a sequence Y dl by using Eq. (14), and then compose amatrix Y dl according to time steps Compute matrix Y ap by averaging values over time steps for eachsample in Y dl Compute matrix Y sl according to Eqs. (15) and (16) Compute the column position of the maximal element in each rowof Y sl , and achieve classiﬁcation results Y pred Return Y pred . Evaluation In this section, we evaluate the approach of BiLSTM with at-tention by performing cross-validation experiments and cross-patient experiments over the noisy scalp EEG data set of CHB-MIT. Our evaluation mainly adopts three standard metrics, in-cluding the sensitivity, the speciﬁcity and the precision. Thecross-validation experiment is that, data from all the patientsare randomly split into three mutually disjoint sets, i.e., trainingset, validation set and testing set. The training set and validationset are used to train a model, and the testing set is to assess theability of the trained model. To reduce variability, ten roundsof cross-validation are performed for each seizure/non-seizureclassiﬁcation approach in our experiments. Then, average val-ues and standard deviations over results in the ten rounds arecalculated. The cross-patient experiment means that, one pa-tient is selected as testing subject, and all the other patients astraining and validation subjects. Data from the training and val-idation subjects are to train a model, and data from the testingsubject are to test the trained model. In our cross-patient exper-iments, 23 patients in CHB-MIT are separately selected as testsubject to assess the performance of our proposed approach,and then the overall performance over the 23 patients is ana-lyzed.

The data set CHB-MIT contains 686 EEG recordings from23 patients of different ages ranging from 1.5 years to 22 years.The recordings include 198 seizures. The used sampling fre-quency is 256 Hz. Each recording contains a set of EEG signalswith different channels. Most recordings are one hour long,and some are for two or four hours. The EEG recordings aregrouped into 24 cases and stored in EDF data ﬁles. Each EDFﬁle corresponds to an EEG recording. In each case, the signaldata were recorded from a single patient. Case Chb21 was ob-tained 1.5 years after Case Chb01 from the same patient. Eachdata ﬁle contains data over 23 or more channels. There existdata ﬁles in which the data over some channels were missing.And some data ﬁles, for example, Chb12 27.edf, Chb12 28.edfand Chb12 29.edf, have different channel montages from otherseizure ﬁles. In our experiments, we did not use the data in theabove three EDF ﬁles.

In order to extract effective seizure features, 17 commonchannels were selected, i.e., for each patient, the data of 17common channels were used for seizure/non-seizure featuresextraction. The 17 common channels were P4-O2, FP2-F4,P7-O1, C4-P4, F7-T7, C3-P3, FP1-F7, F8-T8, FZ-CZ, CZ-PZ,F3-C3, T7-P7, P8-O2, FP1-F3, F4-C4, FP2-F8, and P3-O1, re-spectively. Each data record was split into data segments withthe length of 23 seconds from the beginning to the end with-out overlapping. According to annotation ﬁles which mark thestarting time and the ending time of each seizure, it could be de-termined whether a data segment contains a seizure or not. In our experiments, if a segment contained a seizure, it was con-sidered as a seizure segment; otherwise, it was a non-seizuresegment. In the seizure segments, the lengths of seizure datavaried from 1s to 23s, with the average length being 16.9s.Among all seizure segments, the portion of the seizure signalless than 7s was 14.7%, the part containing more than 10s ac-counted for 76.1%, and the part containing more than 17s ac-counted for 59.8%.As a result of the splitting, 665 seizure segments were ob-tained. The 665 seizure data segments were taken as a partof our experiment data. For evaluation over a balanced data,665 non-seizure segments in each experiment were randomlyselected from all the non-seizure segments.

The deep learning approach in [18] uses LSTM as a mainmodule (shortly, LSTM approach) to detect seizures. TheLSTM approach is evaluated through cross-validation exper-iments over the EEG data set from University of Bonn [32],showing the state-of-the-art performance. We will compare ourapproach with the LSTM approach. And also our approach willbe compared with a convolutional neural network approach (forshort, CNN approach) in [17]. Since the data in Bonn EEG dataset is strictly processed, and does not contain any artifacts, andis small in size, we choose to use the noisy CHB-MIT data setfor the cross-validation experiments.The LSTM approach [18] and the CNN approach [17] do notprovide all the source codes. Thus, we implemented the twoapproaches according to their descriptions. The implementedLSTM approach and CNN approach were tested, and the test-ing results achieved the reported performances in [18] and [17].Then based on the two implementations, we experimented withthe CHB-MIT data set to compare them with our approach ofattention BiLSTM.In each cross-validation experiment, all the seizure segmentswere utilized as a part of experiment data, and non-seizure seg-ments with the same quantity were randomly selected. Thetraining set, validation set and testing set were obtained by ran-domly splitting the experiment data set according to the ratio70:15:15. We tuned and determined parameters to achieve thebest performance for the three approaches, including the LSTMapproach, the CNN approach, and our attention BiLSTM ap-proach. And for each approach, ten cross-validation experi-ments were carried out based on the correspondingly well-tunedparameters.For cross-validation experiments using the LSTM approach,our parameters were set as follows: The number of hiddenstates was 120 in the LSTM layer, that in the time-distributedcomputing layer was 60, the optimizer was RMSprop, the learn-ing rate was 0.0007, the batch size was 30, and the numberof epochs was 30. For the CNN approach in [17], it containsﬁve convolutional layers, ﬁve max pooling layers, and threefully connected layers, and its parameters setting in our cross-validation experiments was as follows: The number of hiddenstates in the ﬁrst two convolutional layers was 100, that in eachof the second two convolutional layers was 200, that in the ﬁfthconvolutional layer was 260, that in the ﬁrst fully connected7ayer was 100, that in the second fully connected layer was50, the parameter alpha was 0.01 in the LeakyReLU activationfunction, the optimizer was Adam, the learning rate was 0.001,the batch size was 30, and the number of epochs was 50. For theproposed approach of BiLSTM with attention, our well-tunedparameters in the cross-validation experiments were as follows:The number of hidden states in the bidirectional LSTM layerwas 140, that in the time-distributed layer was 70, the mergingmode in the bidirectional LSTM was concatenation, the opti-mizer was RMSprop, the learning rate was 0.0013, the batchsize was 30, and the number of epochs was 35. And the totalnumber of trainable parameters is 197,078.The cross-validation results using the LSTM approach, in-cluding Sensitivity, Speciﬁcity, F1 score, Precision, Accuracy,the average and the standard deviation, are shown in Table 1.And the results by using the CNN approach and our approachof attention BiLSTM are presented in Tables 2 and 3, respec-tively.

Table 1

Cross-validation results using the LSTM approach.

Item Sens. Spec. F1 Sco. Prec. Accu.1 0.8500 0.8800 0.8629 0.8763 0.86502 0.7700 0.8500 0.8021 0.8370 0.81003 0.7900 0.8700 0.8229 0.8587 0.83004 0.7100 0.9300 0.7978 0.9103 0.82005 0.8200 0.8900 0.8497 0.8817 0.85506 0.9100 0.7900 0.8585 0.8125 0.85007 0.8600 0.8300 0.8473 0.8350 0.84508 0.8600 0.8400 0.8515 0.8431 0.85009 0.9400 0.7200 0.8468 0.7705 0.830010 0.9300 0.8300 0.8857 0.8455 0.8800Ave. 0.8440 0.8430 0.8425 0.8470 0.8435Std. 0.0696 0.0550 0.0259 0.0368 0.0201

Sens. is an abbreviation for Sensitivity, Spec. for Speciﬁcity, F1Sco. for F1 Score, Prec. for Precision, Accu. for Accuracy, Ave. forAverage, and Std. for Standard Deviation. These abbreviations arealso used in Tables 2, 3, and 4.

For the LSTM approach, the achieved average sensitivity, av-erage speciﬁcity and average precision are respectively 84.4%,84.3% and 84.7%. By using the approach of attention BiLSTM,the obtained average sensitivity of 87%, speciﬁcity of 88.6%and precision of 88.63% are better than the LSTM approach.For the F1 score and accuracy, the approach of attention BiL-STM also exceeds the LSTM approach. And the standard de-viations of by the attention BiLSTM approach are mostly lessthan the LSTM approach. It can be seen that the proposed ap-proach of attention BiLSTM not only classiﬁes seizures moreaccurately than the LSTM approach, but is also more stable.For the CNN approach, the obtained average sensitivity, av-erage speciﬁcity and average precision are 84.8%, 81.0% and82.56%, respectively. Our model outperforms the CNN ap-proach in sensitivity, speciﬁcity and precision. For the aver-age accuracy and the average F1 score, our approach also hashigher values than the CNN approach. And the standard devia-

Table 2

Cross-validation results using the CNN approach.

Item Sens. Spec. F1 Sco. Prec. Accu.1 0.8400 0.8500 0.8442 0.8485 0.84502 0.9200 0.7700 0.8558 0.8000 0.84503 0.8000 0.8400 0.8163 0.8333 0.82004 0.9000 0.6900 0.8145 0.7438 0.79505 0.9200 0.8000 0.8679 0.8214 0.86006 0.7900 0.8500 0.8144 0.8404 0.82007 0.6300 0.9700 0.7590 0.9545 0.80008 0.8500 0.8700 0.8586 0.8673 0.86009 0.8700 0.7700 0.8286 0.7909 0.820010 0.9600 0.6900 0.8458 0.7559 0.8250Ave. 0.8480 0.8100 0.8305 0.8256 0.8290Std. 0.0891 0.0809 0.0301 0.0571 0.0217

Table 3

Cross-validation results using the attention BiLSTM approach.

Item Sens. Spec. F1 Sco. Prec. Accu.1 0.8800 0.9000 0.8889 0.8980 0.89002 0.8400 0.9200 0.8750 0.9130 0.88003 0.8600 0.8400 0.8515 0.8431 0.85004 0.9400 0.7900 0.8744 0.8174 0.86505 0.9100 0.8600 0.8878 0.8667 0.88506 0.8800 0.9000 0.8889 0.8980 0.89007 0.8200 0.8600 0.8367 0.8542 0.84008 0.8900 0.9500 0.9175 0.9468 0.92009 0.8200 0.9000 0.8542 0.8913 0.860010 0.8600 0.9400 0.8958 0.9348 0.9000Ave. 0.8700 0.8860 0.8771 0.8863 0.8780Std. 0.0363 0.0463 0.0228 0.0388 0.0230 tions in our method are smaller than the CNN approach. Theseexperimental results show that, the proposed approach of atten-tion BiLSTM has better performance in the seizure/non-seizureclassiﬁcation than the CNN approach.

For cross-patient seizure/non-seizure classiﬁcation, each ex-periment takes data of one patient as testing data, and otherpatients data as training data and validation data according tothe ratio 85:15. Because the two cases Chb01 and Chb21 arerecords from the same patient. The two cases were utilizedtogether either as testing data or training-validation data. Ineach experiment, all the seizure data segments from each pa-tient were utilized, and non-seizure data segments were ran-domly selected with the same number of seizure segments. So,the data was balanced in each experiment.For each patient, we used her/his EEG data as testing dataand data of other patients as training-validation data, and ob-tained the sensitivity, speciﬁcity, F1 score, precision, and accu-racy, which are listed in Table 4. Fig. 4 shows the sensitivi-ties, the speciﬁcities and the precisions in the form of bar chart.8or the 23 patients in CHB-MIT, the average sensitivity, speci-ﬁcity, precision, and accuracy are 83.72%, 84.06%, 85.36%,and 83.89%, respectively. And the standard deviations of sensi-tivity, speciﬁcity and precision are 0.1349, 0.1379, and 0.1020,respectively.

Table 4

Cross-patient experiment results using the attention BiLSTM.

Case Sens. Spec. F1 Sco. Prec. Accu.Chb01,21 0.8974 0.7179 0.8235 0.7609 0.8077Chb02 0.8000 1.0000 0.8889 1.0000 0.9000Chb03 0.8846 0.9615 0.9200 0.9583 0.9231Chb04 0.9524 0.8095 0.8889 0.8333 0.8810Chb05 1.0000 0.4286 0.7778 0.6364 0.7143Chb06 0.8125 0.7500 0.7879 0.7647 0.7813Chb07 0.9412 0.8824 0.9143 0.8889 0.9118Chb08 0.9556 0.7333 0.8600 0.7818 0.8444Chb09 0.9375 0.6250 0.8108 0.7143 0.7813Chb10 0.9600 0.8800 0.9231 0.8889 0.9200Chb11 0.9730 0.8649 0.9231 0.8780 0.9189Chb12 0.5211 0.8451 0.6218 0.7708 0.6831Chb13 0.6000 0.8571 0.6885 0.8077 0.7286Chb14 0.6429 0.9286 0.7500 0.9000 0.7857Chb15 0.7379 0.9223 0.8128 0.9048 0.8301Chb16 0.6875 0.6250 0.6667 0.6471 0.6563Chb17 1.0000 0.8125 0.9143 0.8421 0.9063Chb18 0.9000 0.9000 0.9000 0.9000 0.9000Chb19 0.7857 1.0000 0.8800 1.0000 0.8929Chb20 0.7273 0.9545 0.8205 0.9412 0.8409Chb22 0.9167 0.9167 0.9167 0.9167 0.9167Chb23 0.9200 1.0000 0.9583 1.0000 0.9600Chb24 0.7027 0.9189 0.7879 0.8966 0.8108Ave. 0.8372 0.8406 0.8363 0.8536 0.8389Std. 0.1349 0.1379 0.0888 0.1020 0.0833

Fig. 4. (Color online) Bar chart illustations of cross-patient sensitivity,speciﬁcity and precision over 24 cases for the attention BiLSTM.

In [4], Thodoroff et al. utilize a recurrent convolutional neu-ral network (recurrent CNN) and obtain an average sensitiv-ity 85% in cross-patient experiments over the CHB-MIT dataset. According to Fig. 7(a) and Fig. 7(c) in [4], for six casesChb06, Chb12, Chb13, Chb14, Chb15 and Chb16, the obtainedsensitivity results are not good, only around 20% for Chb06and Chb14. For other seventeen cases the sensitivity results are mostly 100%. The two cases Chb01 and Chb21 are testedseparately for recurrent CNN. Our method achieved better sen-sitivities in the above cases, all exceeding 50%, although thesensitivity of the remaining cases were less than 100%. Fig.5 presents the sensitivity comparisons between the method ofrecurrent CNN and our approach of BiLSTM with attentionfor the above six cases. And Fig. 6 shows sensitivities of 21common-tested cases. The 21 cases do not contain Chb01,Chb21 and Chb24. Over the common-tested cases, our stan-dard deviations for sensitivity and speciﬁcity are 0.1374 and0.1407, respectively. The results indicated that our sensitivityresults are more concentrative, and in this sense, the proposedapproach of attention BiLSTM is more stable.

Fig. 5. (Color online) Comparison of cross-patient sensitivity over 6cases between attention BiLSTM and recurrent CNN.

Fig. 6. (Color online) Comparison of cross-patient sensitivity over 21common cases between attention BiLSTM and recurrent CNN.

5. Model analysis

Our attention mechanism is designed for distinguishing sig-nals from different brain regions and produces different weightsfor the signals. In the attention layer, a kernel matrix and a biasmatrix are needed, and they are trained together with other mod-ules in our model. Based on the two matrices, the weights ofchannels, which correspond to different brain regions, are cal-culated according to the input data. In fact, different epilepsy9atients have different seizure patterns and EEG signal is dy-namic. For one patient, experienced seizures may have differ-ent types and may come from different brain regions. There-fore, it is reasonable to calculate adaptively channel weights inour attention mechanism. Fig. 7 and Fig. 8 show attentionweight distributions on 17 channels in two data segments fromtwo patients (i.e., Chb11 and Chb03), which are computed bythe attention mechanism in the same trained model. These twoﬁgures show that our attention mechanism can adaptively cal-culate the channel weights of signal data from different patients.

Fig. 7.

Attention weights on channels for a seizure segment in Chb11.

Fig. 8.

Attention weights on channels for a seizure segment in Chb03.

In some areas of the brain, EEG signals during seizures showmany differences with signals at non-seizures. The differences,such as frequency and magnitude, could be used to indentifyseizure and non-seizure. The attention mechanism captures sig-nal characteristics and assigns large weight values to the chan-nels, which could distinguish seizure and non-seizure segments.Generally, the greater the differences between the seizure sig-nal and the non-seizure signal through the channel, the greaterthe weight assigned to the corresponding channel. An exam-ple of attention weights of 17 channels for a seizure segmentis shown in Fig. 7; the channels of F8-T8, P3-O1 and FP2-F8 have the large weights compared to other channels. In Fig. 9(a)and Fig. 9(b), the actual signals over the above three channelschange (i.e., six purple panels) much either in the frequency orin the magnitude. For the actual signals over channels P4-O2and P8-O2 (i.e., four green panels), the changes in Fig. 9(a) andFig. 9(b) are relatively small. As shown in Fig. 7, the assignedweights over the two channels are small. (a) Signals in a non-seizure segment from Chb11.(b) Signals in a seizure segment from Chb11.

Fig. 9. (Color online) Visualization of signals on channels in anon-seizure segment and a seizure segment from Chb11. Purplepanels represent channels with large signal changes, and green panelsfor channels with small signal changes.

The actual signals of the channel P4-O2 (i.e., two green pan-els) in Fig. 10(c) and Fig. 10(d) manifest small differences inmagnitudes. The attention mechanism produces small weightfor the channel P4-O2 so that the corresponding signal data isnot treated as critical evidences to classify seizure/non-seizure.The signals over channels T7-P7, FP2-F8 and P3-O1 (i.e., sixpurple panels) change a lot from the non-seizure Fig. 10(c)to the seizure Fig. 10(d). Such changes could differentiateseizure/non-seizure segments. So, the three channels are as-signed large attention weights, as shown in Fig. 8.10 c) Signals in a non-seizure segment from Chb03.(d) Signals in a seizure segment from Chb03.

Fig. 10. (Color online) Visualization of signals on channels in anon-seizure segment and a seizure segment from Chb03. The purplepanels and green panels have the same meanings as in Fig. 9.

The approach of attention BiLSTM is developed in the in-spiration of the LSTM approach in [18]. In the development,the performances of bidirectional LSTM and attention mech-anism are separately explored. By using parameters with thebest performances in the tuning procedures, ten rounds of cross-validation experiments are performed separately for testing thetwo modules. When testing the module of bidirectional LSTM,the parameters are set as follows: The learning rate is 0.001, thenumber of hidden states in the bidirectional LSTM is 100, thatin time-distributed layer is 50, the optimizer is RMSprop, batchsize is 30, and the number of epochs is 30. For the testing ofattention mechanism, the parameters are: The learning rate is0.001, the number of hidden states in the module LSTM is 100,that in time-distributed layer is 50, the optimizer is RMSprop,batch size is 30, and the number of epochs is 25. The obtainedcross-validation results are shown in Table 5. The results in-dicate that, the bidirectional LSTM only improves the sensitiv-ity and the attention mechanism only enhances the speciﬁcitycompared to the LSTM approach results in Table 1. After com- bining the two modules in the approach of attention BiLSTM,both the sensitivity and the speciﬁcity are improved with 2.6%and 4.3%, respectively. Thus, both bidirectional LSTM and at-tention mechanism play important roles in the approach of at-tention BiLSTM for seizure/non-seizure classiﬁcation.

6. Discussion

In this paper, we design a novel approach of BiLSTM withattention for seizure/non-seizure classiﬁcation in off-line EEGdata. Cross-patient and cross-validation experiments across pa-tients are separately applied to evaluations on the pediatric dataset of CHB-MIT. When doing segmentation, a time length of23 seconds is selected by referring to the segment length inBonn EEG data set [32], and each data record in each case issplit from the beginning to the end without overlapping. Asa result, 665 seizure segments are obtained, and the lengthsof seizure data vary from 1s to 23s in seizure segments. Thelength diversity of seizure data is aligned with a real-worldsituation. In each experiment, the 665 seizure segments weretaken as a part of experimental data, and 665 non-seizure seg-ments were randomly selected from the extracted non-seizuresegments. Its randomness and sparsity reduce temporal corre-lations among non-seizure data segments, and avoid resulting inoverly optimistic speciﬁcity results [7]. The above segmenting-data-record way and the selecting strategy of non-seizure seg-ments make the evaluation of our approach be more reliable.In the cross-validation experiments, the sensitivity, speci-ﬁcity and precision of our approach were better than the LSTMapproach in [18] and the CNN approach in [17]. The improve-ments in the sensitivity, speciﬁcity, and precision over thosetwo state-of-the-art approaches were 2.6%, 4.3%, 3.93% and2.2%, 7.6%, 6.07%, respectively, and the standard deviationswere less than the two approaches in comparison. As Table 5shows, the better performances of our approach are attributedto the attention mechanism and the feature extraction in bothforward and backward directions.Among cross-patient experiment results in Table 4, there ex-ist gaps. Over the six patients, including Chb05, Chb09, Chb12,Chb13, Chb14, and Chb16, either sensitivity or speciﬁcity wereless than 70%. For the seven patients, i.e., Chb03, Chb07,Chb10, Chb11, Chb18, Chb22 and Chb23, all testing resultswere over 85%. The possible reason is that, for a child, thebrain, meninges, skull, and head size change overtime [33].Compared to the method of recurrent CNN proposed in [4], theperformances of our method BiLSTM with attention were morestable. In [4], the convolution neural network module in recur-rent CNN is pre-trained before training the whole model. Ourattention BiLSTM approach does not need pre-training, and itdirectly processes raw data and extracts features. The REVEALalgorithm proposed in [34] achieved an average sensitivity of61%. [5] used the automatic seizure detection system EpiScanon the CHB-MIT data set and obtained an average sensitivityof 67%. The average sensitivity of our approach is much betterthan REVEAL and EpiScan.The application scenario of our approach is to automati-cally selecting all the seizure segments from the off-line EEG11 able 5

Cross-validation results for modules in the attention BiLSTM approach.

Module Sensitivity Speciﬁcity F1 Score Precision AccuracyBidirectional LSTM 0.8630 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± data records for neurologists’ reviewing and analysis. Becauseof the off-line EEG data segments, extracting features in theforward direction and the backward direction and performinganalyses are feasible in practices. In the application, select-ing as many seizure segments as possible and as accuratelyas possible is our target. For this target, the performance ofa seizure-segments-selection method can be measured by sen-sitivity, speciﬁcity and precision, not by the number of falsealarms per hour. So, the metric of false alarm rate is not calcu-lated and not compared for the proposed approach.Instead of directly training weights on channels, we utilizean attention mechanism to generate weights. In the directlytraining way, the obtained weights on channels are the samefor all the patients. In fact, the seizure patterns of different pa-tients are different, and different types of seizures have differentpatterns, and it is possible that one patient may have differenttypes of seizures. Therefore, for data segments from differentpatients, the weights on channels, which describe the strengththat signals signify seizures, need be different. In our atten-tion mechanism, a kernel matrix and a bias matrix are obtainedby training, and then the two trained matrices are performedtransformations by combining with data segments. The out-puts of transformations are attention weights for the data seg-ments. The attention mechanism produce different weights fordata segments from different patients, and further efﬁcientlyhelp extract seizure features.When designing attention mechanism, we tried differentways: one way is adding different attention weights over timesteps, and another way is adding different attention weightsover time steps and over channels. Our experimental resultsusing the two ways were not good. One possible reason is thatthe role of each brain region in the whole brain state is gener-ally stable in a short duration such as 23s. Finally, we choose toapply attention mechanism to channels and share the attentionweights among time steps. Actually, different channels havedifferent contributions to a seizure, and the contributions turnout to be correlated to the locations of brain regions, rather thanthe time. In addition, we applied our method to single channeldata. The results with single channel data were not good. Theywere in agreement with the observation in [7]; that is, for somechannels, the data morphology in seizure state is similar to thatin non-seizure state.

7. Conclusions

This paper focuses on the problem of automatic seizure/non-seizure classiﬁcation. Inspired by the architecture in [18], we analyze both spacial and temporal characteristics of seizures,and propose a novel deep learning-based approach by using themodel of BiLSTM integrated with attention. The integrationof an attention mechanism is to capture spatial features bet-ter, and the employment of the BiLSTM model is to extractmore temporal features. The proposed approach is evaluated onthe noisy EEG data set of CHB-MIT. The evaluation is acrossmultiple patients and uses data from multiple brain regions. Inthe cross-validation experiments, we obtain sensitivity of 87%,speciﬁcity of 88.6% and precision of 88.63%, which are betterthan the LSTM approach in [18] and the CNN approach in [17].In the cross-patient experiments, the testing results are 83.72%-sensitivity, 84.06%-speciﬁcity and 85.35%-precision on aver-age. Comparing to the model reccurrent CNN in [4], our modelBiLSTM with attention is more stable.In the approach of BiLSTM with attention, the pooling layeradopts a globally-averaging way to extract holistic features ofdata segments. The problem whether such a way is the best ornot for the seizure/non-seizure classiﬁcation will be explored inthe future. And also we want to investigate whether the lengthof data segments has effects on the sensitivity, the speciﬁcityand the precision.

References [1] I. Megiddo, A. Colson, D. Chisholm, T. Dua, A. Nandi, R. Laxminarayan.Health and economic beneﬁts of public ﬁnancing of epilepsy treatment inIndia: An agent-based simulation model. Epilepsia, vol. 57, no. 3, pp. 464-474, 2016. https://doi.org/10.1111/epi.13294.[2] J. Gotman, J. R. Ives, P. Gloor. Automatic recognition of inter-ictalepileptic activity in prolonged EEG recordings. Electroencephalogra-phy and Clinical Neurophysiology, vol. 46, no. 5, pp. 510-520, 1979.https://doi.org/10.1016/0013-4694(79)90004-X.[3] J. Gotman. Automatic recognition of epileptic seizures in the EEG. Elec-troencephalography and Clinical Neurophysiology, vol. 54, no. 5, pp. 530-540, 1982. https://doi.org/10.1016/0013-4694(82)90038-4.[4] P. Thodoroff, J. Pineau, A. Lim. Learning robust features using deep learn-ing for automatic seizure detection. Proceedings of the 1st Machine Learn-ing for Healthcare Conference; Los Angeles, CA, USA; 2016. Journal ofMachine Learning Research, vol. 56, pp. 178-190, 2016.[5] F. F¨urbass, P. Ossenblok, M. Hartmann, H. Perko, A. M. Skupch, G.Lindinger, L. Elezi, E. Pataraia, A. J. Colon, C. Baumgartner, T. Kluge.Prospective multi-center study of an automatic online seizure detectionsystem for epilepsy monitoring units. Clinical Neurophysiology, vol. 126,no. 6, pp. 1124-1131, 2015. https://doi.org/10.1016/j.clinph.2014.09.023.[6] A. S. Zandi, M. Javidan, G. A. Dumont, R. Tafreshi. Automatedreal-time epileptic seizure detection in scalp EEG recordings usingan algorithm based on wavelet packet transform. IEEE Transactionson Biomedical Engineering, vol. 57, no. 7, pp.1639-1651, 2010.https://doi.org/10.1109/TBME.2010.2046417.[7] A. Shoeb, J. Guttag. Application of machine learning to epileptic seizuredetection. Proceedings of the 27th International Conference on MachineLearning; pp. 975-982; Haifa, Israel; 2010.