[PDF] A Novel Use of Discrete Wavelet Transform Features in the Prediction of Epileptic Seizures from EEG Data

Abstract

This paper demonstrates the predictive superiority of discrete wavelet transform (DWT) over previously used methods of feature extraction in the diagnosis of epileptic seizures from EEG data. Classification accuracy, specificity, and sensitivity are used as evaluation metrics. We specifically show the immense potential of 2 combinations (DWT-db4 combined with SVM and DWT-db2 combined with RF) as compared to others when it comes to diagnosing epileptic seizures either in the balanced or the imbalanced dataset. The results also highlight that MFCC performs less than all the DWT used in this study and that, The mean-differences are statistically significant respectively in the imbalanced and balanced dataset. Finally, either in the balanced or the imbalanced dataset, the feature extraction techniques, the models, and the interaction between them have a statistically significant effect on the classification accuracy.

Full PDF

AA Novel Use of Discrete Wavelet Transform Features in thePrediction of Epileptic Seizures from EEG Data ⋆ Cyrille Feudjio a , ∗ , Victoire Djimna Noyum a , Younous Perieukeu Mofendjou a , Rockefeller b andErnest Fokoué c , ∗ a School of Mathematical Sciences, African Institute for Mathematical Sciences, Crystal Gardens, Limbe Cameroon b School of Mathematical Sciences, Stellenbosch University, South Africa c School of Mathematical Sciences, Rochester Institute of Technology, Rochester, NY 14623

A R T I C L E I N F O

Keywords :Feature ExtractionDWTMFCCEEG signalsEpileptic seizures.

Abstract

This paper demonstrates the predictive superiority of discrete wavelet transform (DWT) overpreviously used methods of feature extraction in the diagnosis of epileptic seizures from EEGdata. Classiﬁcation accuracy, speciﬁcity, and sensitivity are used as evaluation metrics. Wespeciﬁcally show the immense potential of 2 combinations (DWT-db4 combined with SVMand DWT-db2 combined with RF) as compared to others when it comes to diagnosing epilepticseizures either in the balanced or the imbalanced dataset. The results also highlight that MFCCperforms less than all the DWT used in this study and that, The mean-differences are statisticallysigniﬁcant respectively in the imbalanced and balanced dataset. Finally, either in the balanced orthe imbalanced dataset, the feature extraction techniques, the models, and the interaction betweenthem have a statistically signiﬁcant effect on the classiﬁcation accuracy.

1. Introduction

Nowadays, people face various kinds of stress in their daily lives and most of the people around the world sufferfrom a range of neurological disorders. Epilepsy affects up to 1% of the population, making it, alarmingly, the thirdcommonest encephalopathy (Usman et al., 2017). It can affect both males and females of all races, ethnic, backgrounds,and ages.Approximately 50 million people worldwide suffer from epilepsy and 90% of them are from developing coun-tries(Kandar et al., 2012). It is not one disorder but rather a syndrome with widely divergent symptoms involvingepisodic abnormal electrical activity within the brain. Patients with epilepsy could be treated with medication orsurgical procedures (Guenot, 2004). However, these methods are not fully effective. Unfortunately, seizures that can’tbe fully treated medically limit the patient’s active life, and in these cases, patients cannot work independently andperform certain activities. This ends up in the social isolation of people and economic difﬁculties. However, earlyprediction of epileptic seizures can ensure sufﬁcient time before they occur.Tons of effort has been put in situ by researchers and institutions to modify that.But interestingly, the main causeof it remains a mystery. Only an early diagnosis pauses as a secure and plausible way to treat it. Therefore, severalmethods are developed to detect an epileptic seizure before it starts. Machine learning models are used for this taskwhich incorporates Electroencephalography (EEG) signal acquisition and preprocessing, features extraction from thesignals, and ﬁnally, classiﬁcation between different seizure states. Electroencephalography (EEG) may be a usefulmethod to watch the nonlinear electrical function of the brain’s nerve cells; hence, it is a valuable tool for the epilepsyevaluation and treatment(Wang et al., 2013).Feature extraction which involves tidying the data is usually said to represent where 80% of the time is spentworking on a data science. In this case, for instance, Preprocessing and feature extraction from EEG signals have an ⋆ This document is the results of the research project funded by AIMS CAMEROON with the help of Mastercard Foundation.In this work, we demonstrate the predictive superiority of discrete wavelet transform over previously used methods of feature extraction in thediagnosis epileptic seizures from EEG data. ∗ Corresponding author [email protected] (C. Feudjio); [email protected] (V.D. Noyum); [email protected] (Y.P. Mofendjou); [email protected] ( Rockefeller); [email protected] (E. Fokoué)

ORCID (s): (C. Feudjio)

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 1 of 28 a r X i v : . [ c s . C E ] J a n xcellent effect on maximizing prediction time. Literature (Rasekhi et al., 2013),(Teixeira et al., 2014),(Bandarabadiet al., 2015),(Zandi et al., 2013) states that no machine learning model provides a reliable method for pre-processing andfeature extraction. However, each of these processes is tailored to speciﬁc problems, which makes them indispensablebefore building the model. Therefore, this project aims to look at the predictive effect of feature extraction methodswithin the EEG dataset especially in the case of an epileptic seizure. Some researchers conducted studies to detect the phase of seizures through EEG signal processing as reported withinthe study of (Ullah et al., 2018). During this study, the classiﬁcation of the seizure phase was called pre-ictal, ictal,inter-ictal, and postictal to analyze differences in characteristics in each phase. The tactic used for signal processingwas done in the time, frequency or time-frequency domains. Furthermore, another important study was early detectionof the phase before seizures to supply an alarm to epilepsy patients as reported within the study of (Saputro et al.,2019). During this research, they detected the kind of seizures as opportunities and challenges to help the neurologist inclassifying the seizure from EEG recording.(Golmohammadi et al., 2017) conducted an epileptic EEG signal processing simulation to differentiate the types ofseizures. The seizure types studied during this research were a generalized non-speciﬁc seizure, non-speciﬁc seizure,and tonic-clonic seizure. The methods utilized during this research were Mel Frequency Cepstral Coefﬁcients (MFCC),Hjorth Descriptor, and Independent Component Analysis (ICA) as features extractions while Support Vector Machine(SVM) was used as a classiﬁer.(Mursalin et al., 2017) presented a hybrid approach where features from time and frequency domains were analyzedto detect epileptic seizures from EEG signals. They started by applying an Improved Correlation-based FeatureSelection (ICFS) method to capture relevant features from the time domain, frequency domain, and entropy-basedfeatures. Then, the classiﬁcation of the selected feature was done by an ensemble of Random Forest (RF) classiﬁers.Results revealed that the proposed method was better in performance as compared to the conventional correlation-basedand some other state-of-the-art methods of epileptic seizure detection.An automatic epilepsy diagnosis framework based on the combination of multi-domain feature extraction andnonlinear analysis of EEG signals was proposed by (Wang et al., 2017). EEG signals were pre-processed by using thewavelet threshold method to remove the artifacts, and representative features within the time domain, frequency domain,time-frequency domain, and nonlinear analysis features was extracted. The optimal combination of the extractedfeatures was identiﬁed and evaluated via different classiﬁers. Experimental results demonstrated that, the proposedepileptic seizure detection method can achieve a high average accuracy of 99.25%.keeping in mind the fact that preprocessing of the EEG signals can improve prediction sensitivity and averageanticipation time, (Saputro et al., 2019) proposed an efﬁcient machine learning method for epilepsy prediction. In theirresearch, they classiﬁed three sorts of seizure; Generalized Non-Speciﬁc Seizure (GNSZ), Focal Non-Speciﬁc Seizure(FNSZ), and Tonic-Clonic Seizure (TCSZ). They used the mixture of three feature extraction methods, Mel FrequencyCepstral Coefﬁcients (MFCC), Hjorth Descriptor and Independent Component Analysis (ICA). The most effectiveresult was obtained by combining MFCC and Hjorth descriptors which detected seizure type with 91.4% as averageaccuracy.(Paul, 2018) proposed a method for automatic seizure detection based on the mean and minimum value of energy.The algorithm was tested on the CHB-MIT database on three subjects with 60 and 40% of data used as training and testdata, respectively. They obtained an average detection accuracy of 99.81%(Bandarabadi et al., 2015) proposed an algorithm to predict epilepsy seizures which can extend the lifetime ofepilepsy-affected patients. they extracted spectral power features, and after an appropriate selection, these featureswere passed into Support Vector Machines for classiﬁcation. They observed sensitivity of 75.8% and concluded that,reducing the proposed features subset can improve seizure prediction performance.(Teixeira et al., 2014) proposed a model for the prediction of epileptic seizures by choosing six channels of EEGsignals and extracted 22 linear univariate features for each channel. They tested their model for prediction by varyingmultiple combinations of electrodes and also with four different pre-ictal state durations. They used three classiﬁersand approximately predicted every seizure. After selecting suitable features, training data was fed into Support VectorMachine for training, then test data was passed for determining classiﬁcation accuracy and sensitivity. They observed75.8% as sensitivity of detecting the seizure.Many researchers used EEG signals to detect the beginning of the pre-ictal state of epilepsy. However, only a few

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 2 of 28 ave reliably detected it .(Rasekhi et al., 2013) proposed an algorithm for seizure prediction with the help of univariatelinear features. They used six EEG channels in their proposed model and extracted 22 univariate linear properties.Support Vector Machine was used as a classiﬁer to classify the preictal and ictal states of EEG signals. On the average ,the prediction sensitivity after applying this algorithm was 73.90%.(Gadhoumi et al., 2012) used a wavelet method for the prediction of seizures. They extracted features includingwavelet energy and wavelet entropy. Two or three channels were selected for testing purposes on a dataset of sixpatients. Sensitivity was reported as 88% with a mean anticipation time of twenty-two minutes.(Bhople and Tijare, 2012) proposed an epileptic seizure detection method by using a Fast Fourier Transform(FFT). The FFT-based features were extracted and were fed to the neural networks. Multilayer perceptron (MLP) andGeneralized Feed-Forward Neural Network (GFFNN) were used as a classiﬁer. The algorithm was tested on the Bonndatabase, and results show they were able to achieve 100% accuracy.(Acharya et al., 2012) designed a method for the detection of three states of EEG signal, (normal, pre-ictal, and ictalconditions) from recorded EEG signals. They combined the features from two domains; time domain, and frequencydomain, and found that this combined features method is performing well in situations where the signal has a nonlinearand non-stationary nature.

Early detection is an important step in assisting people with epilepsy to take preventive measures against theupcoming manifestation of the disease/disorder, such as ﬁnding a secure place before the seizures occur. Classiﬁcationof seizure could be a signiﬁcant milestone in the journey through a potential or proper treatment, and if possible,prognosis prediction. In this regard, several automatic methods for detecting epileptic activity have been proposedrecently (Wang et al., 2017). Most of them use Fourier Spectral Analysis for the extraction of EEG signals underthe assumption that EEG signals are stationary (Polat and Güne¸s, 2007), allowing the transformation of signals fromthe time domain to the frequency domain. Also, wavelet transformation approaches for time-frequency estimationare generally interesting. For instance, the Discrete Wavelet Transform (DWT) method which is a classical methodof time-frequency analysis similar to the Short-Term Fourier Transform has been used to extract features from EEGsignals (Acharya et al., 2012). In addition to the extraction of time-frequency characteristics, non-linear analysis ofEEG signals have also received considerable attention for detecting seizures which can be considered as a transitionof the human brain (Gajic et al., 2015). We also have several discrete wavelet transformations using multi-domaincharacteristics and non-linear analysis to improve the performance of EEG seizure detection. Besides that, otherfeature extraction methods such as Mel frequency cepstral coefﬁcients (MFCC), Hjorth descriptor, and IndependentComponent Analysis (ICA) could also be genuinely used. All of these methods are designed to remove redundant andirrelevant features so that the classiﬁcation of new instances becomes more accurate. Researchers continue to explorethese methods because the accuracy or sensitivity of classiﬁcation models is highly dependent on the features used forprediction. Therefore, our contribution is to establish or at least intelligently speculate on the predictive powers of thefeature extraction methods used. Tackling this will help us to:• Learn about different Machine Learning methods that can be combined with the feature extraction process andinterpret the outcomes to build a kind of hybrid model that could hopefully generalize well.• Potentially build a whole method that works well on the EEG dataset and could be extended to other domainswhere time series or wave signals are used.• In the long run, build a package to make the whole process (feature extractions, ﬁtting, and evaluating models)easy to use through dialogue boxes; both for medical purposes and social good.Contextually, the focus throughout the study will be on MFCC and the 3 best Wavelets as feature extraction methods.

2. Feature Extraction Techniques

Feature extraction techniques are methods that select and/or combine variables into relevant features, effectivelyreducing the quantity of information that has got to be processed, while still accurately and completely describing theoriginal dataset. In this chapter, we present two features extraction techniques used on the EEG dataset namely WaveletTransform and Mel Frequency Cepstral Coefﬁcient(MFCC).

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 3 of 28 .1. Wavelet Transforms

A wavelet 𝜓 ( 𝑡 ) is a small wave, which must be oscillatory in some way to discriminate between different frequen-cies (Merry, 2005). It allows complex information content to be decomposed into elementary form at different positionsand scales, and subsequently reconstructed back again with high accuracy. Figure 1 shows some examples of a possiblewavelet. Figure 1:

Kind of wavelet functions

The use of Wavelet Transform can work continuously (CWT) or discrete (DWT).Given a time-domain signal function 𝑓 ( 𝑡 ) , and a wavelet function 𝜓 ( 𝑡 ) , the Continuous Wavelet Transform isdeﬁned by equation (1). Ψ 𝜓𝑓 ( 𝜏, 𝑠 ) = 1 √| 𝑠 | ∫ +∞−∞ 𝑓 ( 𝑡 ) 𝜓 ∗ ( 𝑡 − 𝜏𝑠 ) 𝑑𝑡 , (1)Where 𝜏 and 𝑠 represent the translation and scale parameters respectively while 𝜓 ( 𝑡 ) is called the mother wavelet.The symbol * indicates that, in case of a complex wavelet, the complex conjugate is used. By discretizing theseparameters, the DWT is obtained (al Qerem et al., 2020).Several Wavelets Transform namely Discrete Wavelet Transforms (DWT), Discrete Fourier Transform (DFT),Single Valued Decomposition (SVD), Empirical Mode Decomposition (EMD), and their variants are widely used forseizure detection and prediction applications. Although many other time-frequency feature engineering approachesare prevailing for signal processing such as EMD, SVD, ICA, and PCA (al Qerem et al., 2020), DWT based waveletfeature analysis is identiﬁed as effective for time-frequency domain analysis because of its multiscale approximationfeature. Another highlight of DWT based feature engineering is that, it is employed for both signal noise reductionas pre-processing, and feature extraction. The main characteristic of DWT which makes it the most effective methodfor analysis of EEG signals is its resolution of frequency and time. This property leads to optimality status forfrequency-time resolution (Chen et al., 2017). However, there exist many families of DWT transform described below. There are many types of DWT, which are considered as mathematical and statistical functions. These types aredivided into families according to frequency components. Seven different types of common wavelets appearing in theliterature (John Martin et al., 2018). These are Discrete Meyer (dmey), Reverse biorthogonal (rbio), Biorthogonal(bior), Daubechies (db), Symlets (sym), Coiﬂets (coif), and Haar (Haar) (al Qerem et al., 2020).Four main factors have a direct impact on discrete wavelet transform (DWT) performance, summarized by: DWTcoefﬁcient feature, mother wavelet, frequency band and, decomposition level. As mentioned in (Zhang et al., 2017),based on the classiﬁcation accuracy and computational time, it was found that Coiﬂet of order 1(Coif1) is the bestwavelet family for analysis of EEG signal. According to (John Martin et al., 2018), this argument is being challengedby many researchers. Therefore they recommend Haar and, second and fourth-order Daubechies (db2, db4) waveletsfor signal preprocessing and feature extraction since these methods were provided better accuracy in the recentclassiﬁcations.The Figure 2 presents the different wavelets families and their associated mother wavelet.

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 4 of 28 igure 2:

Wavelets Families (al Qerem et al., 2020)

The methodology contains three steps which are described below. The coming subsection describes the threecritical steps in DWT.1.

Wavelet Threshold De-Noising :

Generally, Physiological signals are nonlinear and contaminated (Wang et al., 2017): it could be due to back-ground noise around the facilities, disposition of the electrodes, mobility of the patient during the recording, etc.Removing noise is, therefore, an important step. The wavelet threshold method can perform well in denoisingnon-stationary EEG signals (John Martin et al., 2018). The word “noise" is mentioned as a standard term used insignal processing, but in EEG signal processing noise is in the form of sharp waves that are not signiﬁcant toidentify (John Martin et al., 2018). Thus, getting rid of some frequency bands appearing within the decomposedbands by the use of the wavelet threshold becomes a critical step to achieve to unveil the relevant features fromthe raw signal. The threshold is expressed as (Gilda and Slepian, 2019): 𝜆 = 𝜎 √ (2 log 𝑁 ) , (2)where 𝜆 is the wavelet threshold, 𝜎 is the standard deviation of the noise and 𝑁 is the length of the samplesignals, respectively.The denoised EEG signal will facilitate extraction of distinguishable features than original signal, especially forepileptic event detection (John Martin et al., 2018).2. Wavelet Decomposition

The process of Wavelet decomposition is described below:• The ﬁrst EEG signal goes into a band-pass ﬁlter. Band-pass ﬁlter is a combination of both High-band-PassFilter (HPF) and Low-band-Pass Filter (LPF) to have the requested result. This process is categorized asthe ﬁrst level, which incorporates two corresponding coefﬁcients; one among them is Approximation (A)and another is Detailed (D).• The output of the Low-band-Pass Filter (LPF) goes into another band-pass ﬁlter and so on up to level 4 asshown in Figure 3.Note that, the process goes on under multiple levels as a subsequence of the coefﬁcient from the ﬁrst level withinthe approximation. At each process, the frequency resolution is doubled using the ﬁlters while decomposing andreducing the time complexity to half.

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 5 of 28 igure 3:

Four level EEG signal decomposition(Gilda and Slepian, 2019) Features Extraction :

Multi-Resolution Analysis (MRA) is used to extract feature vectors from signal data. Commonly, when DWT isused as a feature extraction method, the extracted features for classiﬁcation include mean average value, standarddeviation, energy, and spectral entropy. Below are the formula to compute all these quantities (Ahammad et al.,2014).• The variance can be deﬁned as a deviation of the signal from its mean. It is given as, 𝜎 = ∑ 𝑁𝑖 =1 ( 𝑥 𝑖 − 𝜇 ) 𝑁 (3)• The mean signal energy of a seizure data generally tends to be higher than that of normal data due to higheramplitudes. It is given as, 𝐸 = ∑ 𝑁𝑖 =1 𝑥 𝑖 𝑁 (4)• Power spectral density is calculated in two steps. First, by ﬁnding the fast Fourier transform 𝑋 ( 𝑤 𝑖 ) of thetime series and then taking the squared modulus of the FFT coefﬁcients. 𝑃 ( 𝑤 𝑖 ) = | 𝑋 ( 𝑤 𝑖 ) | 𝑁 (5)From this, the maximum and minimum values are used.• Entropy is the measure of randomness and the information content of a signal. To calculate the entropy of agiven EEG signal, The Shannon entropy formula is used: 𝐸𝑁𝑇 = − 𝑁 ∑ 𝑖 =1 𝑥 𝑖 log( 𝑥 𝑖 ) (6)• The interquartile range of the EEG signal gives the statistical dispersion of a signal, which is a measureof how squeezed or stretched a distribution is. The signal is divided into two parts, one containing valueslower than the median and another containing values higher than the median to calculate the inter quartilerange. Therefore, it is the difference between the median of the left half and that of the right half.• Kurtosis is used to measure how much weight the tail of the probability distribution has. Hence, thepresence of spikes is indicated by increased kurtosis. Kurtosis can be calculated by: 𝐾 = ∑ 𝑁𝑖 =1 ( 𝑥 𝑖 − 𝜇 ) 𝑁𝜎 (7)Where 𝑥 𝑖 denotes the time series of an EEG data set and N the number of samples in the signal.The second feature extraction method that we are going to use in our implementation is Mel Frequency CepstralCoefﬁcient(MFCC) and it is described below. Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 6 of 28 .2. Mel Frequency Cepstral Coefﬁcient

MFCC was originally designed to study real speech registered by human ears but it can process quasi- stationarysignals in the forms of sound signals and EEG signals (Nguyen et al., 2012). This method is widely used in EEG signalprocessing with high accuracy (Othman et al., 2009). The methodology for its implementation is described below(Figure 4).

Figure 4:

Extraction ﬂowchart of MFCC coefﬁcients (Gong et al., 2015)

According to (Ren et al., 2018), the steps of the MFCC are described as follows:1.

Pre-emphasis and Framing :

Passing of signal through a ﬁlter that emphasizes higher frequencies, andthereafter, divide it into 𝑁 frames.2. Hamming window :

The hamming window is applied to each frame to obtain windowed frames, and can becalculated as follows: 𝑦 ( 𝑘 ) = 𝑥 ( 𝑘 ) × 𝐻 ( 𝑘 ) , (8)Where 𝐻 ( 𝑘 ) is given by: 𝐻 ( 𝑘 ) = 𝑎 − 𝑏 cos ( 𝜋𝑘𝑁 − 1 , 𝑘 = 0 , , , ..., 𝑁 − 1 ) (9)and 𝑁 is the number of points in a frame, and a, b denotes the parameters of the hamming window (a = 0.54, b =0.46). 𝑥 ( 𝑘 ) and 𝑦 ( 𝑘 ) are respectively the input and output signal while 𝐻 ( 𝑘 ) is the Hamming window.3. Changing from the time domain to the frequency domain :

The output of this step is called a spectrum.The frequency spectrum 𝐹 ( 𝑤 ) of the 𝑖 𝑡ℎ frame 𝑥 𝑖 ( 𝑛 ) can be calculated using the Fast Fourier Transform. Theshort-time power spectrum | 𝐹 ( 𝑤 ) | can be calculated and ﬁltered by a Mel-ﬁlter bank 𝐵 𝑀𝑒𝑙 . The mappingrelation from a linear frequency f to Mel-frequency 𝑓 𝑀𝑒𝑙 can be calculated as follow: 𝑓 𝑀𝑒𝑙 = 𝛿 ln ( 𝑓𝜈 ) , (10)where f and 𝑓 𝑀𝑒𝑙 denote the linear frequency and the Mel-frequency respectively, and 𝛿 , 𝜈 are the parameters( 𝛿 = 2595 , 𝜈 = 700 ). Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 7 of 28 . Mel-frequency wrapping :

Calculating the log amplitude at the spectrum into the Mel scale by using a triangleﬁlter bank. The output of the short-time power spectrum | 𝐹 ( 𝑤 ) | can be calculated via this Mel-ﬁlter bank: 𝜃 ( 𝑀 𝑘 ) = ln [ 𝑁 ∑ 𝑘 =1 | 𝐹 ( 𝑤 ) | 𝐻 𝑚 ( 𝑘 ) ] , 𝑚 = 1 , , , ..., 𝑀 (11)Where 𝐻 𝑚 ( 𝑘 ) is the ﬁlter bank.5. Cepstrum :

Transforming mel-spectrum coefﬁcient by using Discrete Cosine Transform (DCT) producingfourteen cepstral coefﬁcients for every frame. The Mel-Frequency Cepstrum 𝑐 ( 𝑘 ) can be calculated by applyingthe updated Inverse Discrete Cosine Transform (IDCT) in the Mel-frequency coordinate spectrum, which can bedescribed by the following formula: 𝑐 𝑘 = 𝑀 ∑ 𝑘 =1 𝑀 𝑘 cos ( 𝑛 ( 𝑙 − 0 . 𝜋𝑀 ) 𝑛 = 1 , , ..., 𝑝 (12)where 𝑝 is the dimension of MFCC, 𝑐 ( 𝑘 ) denotes the 𝑘 𝑡ℎ MFCC, and 𝑝 is less than the number 𝑀 of Mel-ﬁlters.

3. Classiﬁcation Techniques and Performance Evaluation

Classiﬁcation is a supervised learning technique of categorizing a given set of data (structured or unstructured)into classes. The main goal of a classiﬁcation problem is to identify the category/class to which a new data will fallunder. In this chapter, seven classiﬁers will be used, among which Linear and Quadratic Discriminant Analysis, Naivesbayes, Support vector Machine, K-Nearest Neighbor, Random Fores, Gradient Boosting. The coming section willpresent how they work as well as how their performances are assessed.

Let us assume that, the data set is deﬁned by: {( 𝑥 𝑖 , 𝑦 𝑖 )} 𝑛𝑖 =1 , (13)where n is the sample size, 𝑥 𝑖 and 𝑦 𝑖 represent respectively the deferent feature vectors and the class labels. Forsimplicity, we note that 𝑥 𝑖 ∈ ℝ 𝑑 and 𝑦 𝑖 ∈ ℝ . The objectives of the following method is to classify the data into kclasses using Linear Discriminant Analysis (LDA), and Quadratic discriminant Analysis (QDA).To better understand how these two methods work, let us start our study by considering the one-dimensional datawhich means 𝑥 ∈ ℝ and just two classes.Let us denote by 𝐺 ( 𝑥 ) and 𝐺 ( 𝑥 ) the Cumulative Distribution Functions of these two classes. we can derive thecorresponding probability density functions by: 𝑔 ( 𝑥 ) = 𝜕𝐺 ( 𝑥 ) 𝜕𝑥 (14) 𝑔 ( 𝑥 ) = 𝜕𝐺 ( 𝑥 ) 𝜕𝑥 (15)Let us assume that the two classes have normal (Gaussian) distribution (Ghojogh and Crowley, 2019) and that the meanof one of the two classes is greater than the other one ( 𝜇 < 𝜇 ) . In the following,  and  represent respectively theﬁrst and second class.Let us denote by 𝑥 ∗ , the point where the probability of the two classes are equal. We can write that 𝜇 < 𝑥 ∗ < 𝜇 since we know that 𝜇 < 𝜇 . This means that: { 𝑖𝑓 𝑥 < 𝑥 ∗ then 𝑥 belong to  𝑖𝑓 𝑥 > 𝑥 ∗ then 𝑥 belong to  The probability 𝑃 of the error in the estimation of the class where x belongs to can be written as: 𝑃 = 𝑃 ( 𝑥 > 𝑥 ∗ , 𝑥 ∈  ) + 𝑃 ( 𝑥 < 𝑥 ∗ , 𝑥 ∈  ) (16) Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 8 of 28 sing the conditional probability given by 17: 𝑃 ( 𝐴, 𝐵 ) = 𝑃 ( 𝐴 | 𝐵 ) 𝑃 ( 𝐵 ) (17)we can rewrite (16) as : 𝑃 = 𝑃 ( 𝑥 > 𝑥 ∗ | 𝑥 ∈  ) 𝑃 ( 𝑥 ∈  ) + 𝑃 ( 𝑥 < 𝑥 ∗ | 𝑥 ∈  ) 𝑃 ( 𝑥 ∈  ) (18)The Aim of these methods is to minimize 𝑃 by ﬁnding 𝑥 ∗ .Using the deﬁnition of the Cumulative Distribution Function, we can write: { 𝑃 ( 𝑥 > 𝑥 ∗ | 𝑥 ∈  ) = 1 − 𝐺 ( 𝑥 ∗ ) 𝑃 ( 𝑥 < 𝑥 ∗ | 𝑥 ∈  ) = 𝐺 ( 𝑥 ∗ ) (19)Using the deﬁnition of probability density function, we can write: { 𝑃 ( 𝑥 ∈  ) = 𝑔 ( 𝑥 ) = 𝜎 𝑃 ( 𝑥 ∈  ) = 𝑔 ( 𝑥 ) = 𝜎 (20)Here we denote the piors 𝑔 ( 𝑥 ) and 𝑔 ( 𝑥 ) by 𝜎 and 𝜎 . By replacing (19) and (20) in (18), we obtain: 𝑃 = [ 𝐺 ( 𝑥 ∗ ) ] 𝜎 + 𝐺 ( 𝑥 ∗ ) 𝜎 (21)Let us now take the derivative of 𝑃 for the sake of minimization 𝜕𝑃 𝜕𝑥 ∗ = − 𝑔 ( 𝑥 ∗ ) 𝜎 + 𝑔 ( 𝑥 ∗ ) 𝜎 If we set this derivative equal to zero, we will obtain the relation: 𝑔 ( 𝑥 ∗ ) 𝜎 = 𝑔 ( 𝑥 ∗ ) 𝜎 (22) 𝑔 𝑖 ( 𝑥 ∗ ) and 𝜎 are the likelihood (class conditional) and prior probabilities respectively. Now let us suppose that the datais a multivariate data with dimensionality d. The probability density function for multivariate Gaussian distribution isgiven by: 𝑔 ( 𝑥 ) = 1 √ (2 𝜋 ) 𝑑 | Σ | exp ( − ( 𝑥 − 𝜇 ) 𝑇 Σ −1 ( 𝑥 − 𝜇 )2 ) , (23)where 𝑥 ∈ ℝ 𝑑 , 𝜇 ∈ ℝ 𝑑 is the mean, Σ ∈ ℝ 𝑑 × 𝑑 is the covariance matrix, and | . | is the determinant of the matrix.Replacing (23) in (22), we obtain: √ (2 𝜋 ) 𝑑 | Σ | exp ( − ( 𝑥 − 𝜇 ) 𝑇 Σ −11 ( 𝑥 − 𝜇 )2 ) 𝜎 = 1 √ (2 𝜋 ) 𝑑 | Σ | × exp ( − ( 𝑥 − 𝜇 ) 𝑇 Σ −12 ( 𝑥 − 𝜇 )2 ) 𝜎 (24)This last equation is used for LDA and QDA methods In this case, we assume that the two classes have equal covariance matrices (Σ = Σ = Σ) Ghojogh and Crowley(2019). Therefore, the equation (24) becomes exp ( − ( 𝑥 − 𝜇 ) 𝑇 Σ −1 ( 𝑥 − 𝜇 )2 ) 𝜎 = exp ( − ( 𝑥 − 𝜇 ) 𝑇 Σ −1 ( 𝑥 − 𝜇 )2 ) 𝜎 Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 9 of 28 y taking the logarithm of both sizes we obtain: − ( 𝑥 − 𝜇 ) 𝑇 Σ −1 ( 𝑥 − 𝜇 )2 + ln( 𝜎 ) = − ( 𝑥 − 𝜇 ) 𝑇 Σ −1 ( 𝑥 − 𝜇 )2 + ln( 𝜎 ) (25)The numerator of the ﬁrst term of the left-hand side can be developed and rewritten as: ( 𝑥 − 𝜇 ) 𝑇 Σ −1 ( 𝑥 − 𝜇 ) = 𝑥 𝑇 Σ −1 𝑥 + 𝜇 𝑇 Σ −1 𝜇 − 2 𝜇 𝑇 Σ −1 𝑥 (26)because 𝜇 𝑇 Σ −1 𝑥 = 𝑥 𝑇 Σ − 𝑇 𝜇 and Σ − 𝑇 = Σ −1 since Σ −1 is symmetric. We can do the same with the numerator ofright hand side and obtain: ( 𝑥 − 𝜇 ) 𝑇 Σ −1 ( 𝑥 − 𝜇 ) = 𝑥 𝑇 Σ −1 𝑥 + 𝜇 𝑇 Σ −1 𝜇 − 2 𝜇 𝑇 Σ −1 𝑥 (27)By replacing (26) and (27) in (25) and arranging we obtain: ( Σ −1 ( 𝜇 − 𝜇 ) ) 𝑇 𝑥 + ( 𝜇 − 𝜇 ) 𝑇 Σ −1 ( 𝜇 − 𝜇 ) + 2 ln( 𝜎 𝜎 ) (28)This last equation is the equation of a line. Thus, if we consider Gaussian distributions for the two classes where thecovariance matrices are assumed to be equal, the boundary of classiﬁcation is a line. Because of this linearity whichdiscriminates the two classes, the method is called Linear Discriminant Analysis (LDA).If we deﬁne Γ( 𝑥 ) : ℝ 𝑑 ⟼ ℝ such that: Γ( 𝑥 ) = 2 ( Σ −1 ( 𝜇 − 𝜇 ) ) 𝑇 𝑥 + ( 𝜇 − 𝜇 ) 𝑇 Σ −1 ( 𝜇 − 𝜇 ) + 2 ln( 𝜎 𝜎 ) (29)then the class of an instance x is estimated as:  ( 𝑥 ) = { 𝑖𝑓 Γ( 𝑥 ) < 𝑖𝑓 Γ( 𝑥 ) > (30) In this case, we do not assume that the two classes have equal covariance matrices, so we have (Σ ≠ Σ ) (Ghojoghand Crowley, 2019). Therefore, by taking the natural logarithm of both sides of equation (24) we obtain: − 12 ln( | Σ | ) − ( 𝑥 − 𝜇 ) 𝑇 Σ −11 ( 𝑥 − 𝜇 )2 + ln( 𝜎 ) = − 12 ln( | Σ | )− ( 𝑥 − 𝜇 ) 𝑇 Σ −12 ( 𝑥 − 𝜇 )2 + ln( 𝜎 ) (31)According to (26), we can rewrite (31) as: − 12 ln( | Σ | ) − 12 𝑥 𝑇 Σ −11 𝑥 − 12 𝜇 𝑇 Σ −11 𝜇 + 𝜇 𝑇 Σ −11 𝑥 + ln( 𝜎 )= − 12 ln( | Σ | ) − 12 𝑥 𝑇 Σ −12 𝑥 − 12 𝜇 𝑇 Σ −12 𝜇 + 𝜇 𝑇 Σ −12 𝑥 + ln( 𝜎 ) (32)Let us multiply (32) by 2. After some rearrangement, we obtain: 𝑥 𝑇 ( Σ − Σ ) −1 𝑥 + 2 ( Σ −12 𝜇 − Σ −11 𝜇 ) 𝑇 𝑥 + ( 𝜇 𝑇 Σ −11 𝜇 − 𝜇 𝑇 Σ −12 𝜇 ) + ln ( | Σ || Σ | ) + 2 ln ( 𝜎 𝜎 ) = 0 which is in the quadratic form. Hence, if we consider Gaussian distributions for the two classes, the decision boundaryof classiﬁcation is quadratic, that is why this method is called Quadratic Discriminant Analysis(QDA).Let us deﬁne Γ( 𝑥 ) : ℝ 𝑑 ⟼ ℝ such that: Γ( 𝑥 ) = 𝑥 𝑇 ( Σ − Σ ) −1 𝑥 + 2 ( Σ −12 𝜇 − Σ −11 𝜇 ) 𝑇 𝑥 + ( 𝜇 𝑇 Σ −11 𝜇 − 𝜇 𝑇 Σ −12 𝜇 ) + ln ( | Σ || Σ | ) + 2 ln ( 𝜎 𝜎 ) (33)For estimation of the class of an instance x, we use equation (30). Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 10 of 28 .1.3. LDA and QDA for Multi-Class Classiﬁcation

In this general case, we consider multiple classes (they can be more than two) indexed by k ∈ {1 , ..., | 𝐶 | } . Accordingto Equations (22) and (23) we have: 𝑔 𝑘 ( 𝑥 ) 𝜎 𝑘 = 1 √ (2 𝜋 ) 𝑑 | Σ 𝑘 | exp ( − ( 𝑥 − 𝜇 𝑘 ) 𝑇 Σ −1 𝑘 ( 𝑥 − 𝜇 𝑘 )2 ) 𝜎 𝑘 (34)If we take the logarithm of (34), we have: ln( 𝑔 𝑘 ( 𝑥 ) 𝜎 𝑘 ) = − 𝑑 𝜋 ) − 12 ln( | Σ 𝑘 | ) − 12 ( 𝑥 − 𝜇 𝑘 ) 𝑇 Σ −1 𝑘 ( 𝑥 − 𝜇 𝑘 ) + ln(Σ 𝑘 | ) Let us drop the ﬁrst term because it is the same for all the classes. So we have: Γ 𝑘 ( 𝑥 ) = − 12 ln( | Σ 𝑘 | ) − 12 ( 𝑥 − 𝜇 𝑘 ) 𝑇 Σ −1 𝑘 ( 𝑥 − 𝜇 𝑘 ) + ln( | Σ 𝑘 | )Γ 𝑘 ( 𝑥 ) is the scaled posterior of the k-th class.In QDA, the class of instance x is estimated as: ̂𝐶 ( 𝑥 ) = arg max 𝑘 Γ 𝑘 ( 𝑥 ) (35)In LDA we assume that Σ = Σ = ... = Σ | 𝐶 | = Σ . Therefore Γ 𝑘 ( 𝑥 ) becomes: Γ 𝑘 ( 𝑥 ) = − 12 ln( | Σ | ) − 12 𝑥 𝑇 Σ −1 𝑥 − 12 𝜇 𝑇𝑘 Σ −1 𝜇 𝑘 + 𝜇 𝑇𝑘 Σ −1 𝑥 + ln( 𝜎 𝑘 ) We drop the ﬁrst and second of the right-hand side term because it is the same for all the classes. Hence, we have: Γ 𝑘 ( 𝑥 ) = 𝜇 𝑇𝑘 Σ −1 𝑥 − 12 𝜇 𝑇𝑘 Σ −1 𝜇 𝑘 + ln( 𝜎 𝑘 ) So the class of the instance x is determined by (35)

Naive Bayes classiﬁer is a classiﬁer based on the Bayes Theorem with the naive assumption that, given theirbelonging to a speciﬁc class, features are independent of each other. Let’s see how we can derive the model.

Deﬁnition 3.3.

Bayes Theorem (Stuart et al., 1994)Given a feature vector 𝑋 = ( 𝑥 , 𝑥 , ..., 𝑥 𝑛 ) and a class variable 𝐶 𝑘 , the Bayes Theorem states that: 𝑃 ( 𝐶 𝑘 | 𝑋 ) = 𝑃 ( 𝑋 | 𝐶 𝑘 ) 𝑃 ( 𝐶 𝑘 ) 𝑃 ( 𝑋 ) , for 𝑛 𝑘 = 1 , , ..., 𝐾 , (36) 𝑃 ( 𝐶 𝑘 | 𝑋 ) is called the posterior probability, 𝑃 ( 𝑋 | 𝐶 𝑘 ) the likelihood, 𝑃 ( 𝐶 𝑘 ) the prior probability of class and 𝑃 ( 𝑋 ) theprior probability of predictor.We are interested in calculating the posterior probability from the likelihood and prior probabilities. Using thechain rule, the likelihood 𝑃 ( 𝑋 | 𝐶𝑘 ) can be decomposed as: 𝑃 ( 𝑋 | 𝐶 𝑘 ) = 𝑃 ( 𝑥 | 𝑥 , ...𝑥 𝑛 , 𝐶 𝑘 ) 𝑃 ( 𝑥 | 𝑥 , ...𝑥 𝑛 , 𝐶 𝑘 ) ...𝑃 ( 𝑥 𝑛 −1 | 𝑥 𝑛 , 𝐶 𝑘 ) 𝑃 ( 𝑥 𝑛 | 𝐶 𝑘 ) (37)The above sets of probabilities can be hard and expensive to compute. However we can use the Naive independenceassumption which is given by: 𝑃 ( 𝑥 𝑖 ∣ 𝑥 𝑖 +1 , ..., 𝑥 𝑛 ∣ 𝐶 𝑘 ) = 𝑃 ( 𝑥 𝑖 ∣ 𝐶 𝑘 ) (38) Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 11 of 28 sing ( 38) in ( 37), we have: 𝑃 ( 𝑋 ∣ 𝐶 𝑘 ) = 𝑃 ( 𝑥 , ..., 𝑥 𝑛 ∣ 𝐶 𝑘 ) = 𝑛 ∏ 𝑖 =1 𝑃 ( 𝑥 𝑖 ∣ 𝐶 𝑘 ) (39)Therefore, the posterior probability ( 36) can then be written as: 𝑃 ( 𝐶 𝑘 | 𝑋 ) = 𝑃 ( 𝐶 𝑘 ) ∏ 𝑛𝑖 =1 𝑃 ( 𝑥 𝑖 ∣ 𝐶 𝑘 ) 𝑃 ( 𝑋 ) (40)Knowing that the prior probability of predictor 𝑃 ( 𝑋 ) is constant given the input, we can write (40) as: 𝑃 ( 𝐶 𝑘 | 𝑋 ) ∝ 𝑃 ( 𝐶 𝑘 ) 𝑛 ∏ 𝑖 =1 𝑃 ( 𝑥 𝑖 ∣ 𝐶 𝑘 ) (41)where ∝ means positively proportional to.The Naive Bayes classiﬁcation problem is : for different class values of Ck, ﬁnd the maximum of 𝑃 ( 𝐶 𝑘 ) 𝑛 ∏ 𝑖 =1 𝑃 ( 𝑥 𝑖 ∣ 𝐶 𝑘 ) . Mathematically, we can formulate this problem as: ̂𝐶 = arg max 𝐶 𝑘 𝑃 ( 𝐶 𝑘 ) 𝑛 ∏ 𝑖 =1 𝑃 ( 𝑥 𝑖 ∣ 𝐶 𝑘 ) (42)The prior probability of class 𝑃 ( 𝐶 𝑘 ) could be estimated as the relative frequency of class 𝐶 𝑘 in the training data. • When the assumption of independent predictors holds true, a Naive Bayes classiﬁer performs better as comparedto other models.• Naive Bayes requires a small amount of training data to estimate the test data. So the training period takes lesstime.• It can be used for both binary and multi-class classiﬁcation problems. • The main limitation of Naive Bayes is the assumption of independent predictor features. Naive Bayes implicitlyassumes that all the attributes are mutually independent. In real life, it is almost impossible that we get a set ofpredictors that are completely independent or one another.• If a categorical variable has a category in the test dataset, which was not observed in the training dataset, then themodel will assign a 0 (zero) probability and will be unable to make a prediction. Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classiﬁcationor regression problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where nis number of features ), with the value of each feature being the value of a particular coordinate. Then, we performclassiﬁcation by ﬁnding the hyperplane that differentiates the classes very well. It can be use for binary classiﬁcationand multi-class classiﬁcation.Below we present the mathematics behind binary classiﬁcation.Let us assume that, the dataset is deﬁne by: {( 𝑥 𝑖 , 𝑦 𝑖 )} 𝑛𝑖 =1 , (43) Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 12 of 28 here n is the sample size, 𝑥 𝑖 and 𝑦 𝑖 represent respectively the different feature vectors and the class labels. We notethat 𝑥 𝑖 ∈ ℝ 𝑑 and 𝑦 𝑖 ∈ {−1 , .According to (Nguyen et al., 2012), Support Vector Machine (SVM) algorithm will ﬁnd the optimal hyperplanegiven by (44): 𝑓 ( 𝑥 ) = 𝑤 𝑇 Φ( 𝑥 ) + 𝑏 (44)to separate the training data by solving the optimization problem (45) 𝑚𝑖𝑛

12 ∥ 𝑤 ∥ +  𝑛 ∑ 𝑖 =1 𝜓 𝑖 (45)subject to the constraint (46). 𝑦 𝑖 ( 𝑤 𝑇 Φ( 𝑥 𝑖 ) + 𝑏 ) ≥ 𝜓 𝑖 and 𝜓 𝑖 ≥ , 𝑖 = 1 , ..., 𝑛 (46)The optimization problem (45) will guarantee to maximize the hyperplane margin while minimizing the cost of error,where 𝜓 𝑖 , 𝑖 = 1 , ..., 𝑛 are non-negative slack variables introduced to relax the constraints of separable data problems tothe constraint (46) of non-separable data problems. For an error to occur, the corresponding must exceed unity, so ∑ 𝑖 𝜓 𝑖 is an upper bound on the number of training errors. Hence an extra cost  𝑖 𝜓 𝑖 for errors is added to the objectivefunction where  is a parameter chosen by the user.For Nonlinear classiﬁcation (case where, the data cannot be separated by an hyperplane), the kernel function 𝐾 ( 𝑥 𝑖 , 𝑥 𝑗 ) = Φ( 𝑥 𝑖 ) 𝑇 Φ( 𝑥 𝑗 ) is introduced and the optimal hyperplane becomes: 𝑓 ( 𝑥 ) = 𝑛 ∑ 𝑖 =1 𝛼 𝑖 𝑦 𝑖 𝐾 ( 𝑠 𝑖 , 𝑥 ) + 𝑏 , (47)where 𝑠 𝑖 is the 𝑖 𝑡ℎ support vector. The function Φ ∶ 𝑥 ( 𝑖 ) ⟶ Φ( 𝑥 ) ( 𝑖 ) is a map from the data space to the feature spacesuch that the data are linearly separable in the feature space.Note that there are several kernel functions such as Polynomial kernel, Gaussian kernel and Sigmoidal Kernel. K-Nearest Neighbor (KNN) is the simplest classiﬁcation algorithm. The approach is to plot all data points on space,and with any new sample, observe its k nearest points on space and make a decision based on majority voting. Thus,KNN algorithm involves no training and it takes the least calculation time when implemented with an optimal value ofk. The steps of KNN algorithm are as follows Sutton (2012):• For a given instance, ﬁnd its distance from all other data points. Use an appropriate distance metric based on theproblem instance.• Sort the computed distances in increasing order. Depending on the value of k, observe the nearest k points.• Identify the majority class amongst the k points, and declare it as the predicted class.Choosing an optimal value of k is a challenge in this approach. Most often, the process is repeated for several differenttrials of k. The evaluation scores are then observed using a graph to ﬁnd the optimal value of k.

Random forests can be deﬁne as a combination of tree predictors such that each tree depends on the values of arandom vector sampled independently and with the same distribution for all trees in the forest. The steps of building arandom forest classiﬁer are as follows (Cutler et al., 2012):• Select a subset of features from the dataset.• From the selected subset of features, using the best split method, pick a node.• Continue the best split method to form child nodes from the subset of features.• Repeat the steps until all nodes are used as split.• Iteratively create n numbers of trees using steps 1-4 to form a forest.

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 13 of 28 igure 5:

Steps of Random Forest (Cutler et al., 2012)

Gradient Boosting is one of the technique for performing supervised machine learning tasks, like classiﬁcation andregression. Like Random Forests, it is an ensemble learner. This means that, it will create a ﬁnal model based on acollection of individual models. Let us see how it works. The magic of this model is described in the name: “Gradient"plus “Boosting".Boosting built models from individual in an iterative way. In boosting, the individual models are not built oncompletely random subsets of data and features but sequentially by putting more weight on instances with wrongpredictions and high errors. The general idea behind this is that instances, which are hard to predict correctly will befocused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset ofthe training set, it is called Stochastic Gradient Boosting, which can help improve generalizability of the model (Natekinand Knoll, 2013).Similar to how Neural Networks utilize gradient descent to optimize ("learn") weights, the gradient is used tominimize a loss function. The weak learner is built and its predictions are compared to the correct outcome that weexpect in each round of training. The error rate of the model is estimated by the distance between prediction, and truthwhich can be used to calculate the gradient. The gradient is basically the partial derivative of the loss function, thus itdescribes the steepness of the error function (Natekin and Knoll, 2013).

To train and evaluate the performance of different models, we will use K-fold cross-validation with 5 replicationsfollowed by the confusion matrix which is describe below.

Cross-validation, also know as rotation estimation, is the statistical practice of partitioning a sample of data intosubsets such that the analysis is initially performed on a single subset, while the other subset(s) are retained forsubsequent use in conﬁrming and validating the initial analysis. The initial subset of data is called the training set; theother subset(s) are called validation or testing sets. In K-fold cross-validation, the original sample is partitioned into Ksub-samples. Between K sub-samples, a single subsample is retained as the validation data for testing the model, andthe remaining 𝐾 − 1 sub samples are used as training data. The cross-validation process is then repeated K times (thefolds), with each of the K sub-samples used exactly once as the validation data. The K results from the folds then canbe averaged to produce a single estimation.• The advantage of this method over repeated random sub-sampling is that all observations are used for bothtraining and validation, and each observation is used for validation exactly once. The variance of the resultingestimate is reduced as k is increased. Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 14 of 28

The disadvantage of this method is that the training algorithm has to be run again from scratch k times, whichmeans it takes k times as much computation to make an evaluation.

For performance recognition evaluation, confusion matrix metrics are often used. The criteria for performanceevaluation usually employed include three parts: sensitivity (the proportion of the total number of positive cases thatare correctly classiﬁed), speciﬁcity (the proportion of the total number of negative cases that are correctly classiﬁed),and classiﬁcation accuracy (the proportion of the total number of EEG signals that are correctly classiﬁed).

4. Experimental Results and Analysis

In this section, we implement the different methods presented above on the EEG dataset. As features extractionmethods, we implement Discrete Wavelet Transform (DWT) and Mel Frequency Ceptral Coefﬁcient (MFCC) and asa classiﬁcation method, we implement QDA, LDA, RF, NB, GB, KNN, SVM. We also make a comparison not onlybetween the different feature extraction methods but also the interaction between feature extraction and the classiﬁerbased on the ﬁnal scores of the classiﬁers. The overal process of classiﬁcation is describeb by Figure 6.

Figure 6:

Process of EEG signal classiﬁcation (Wen and Zhang, 2017)

5. Methodology

The ﬂowchart of the proposed classiﬁcation framework is shown by Figure 7

The database used in this work is recorded at the University Hospital Bonn, Germany. It is composed of ﬁvedifferent sections of EEG signals, and these sections are represented by symbols S, F, N, O, and Z as shown in Table 1.Each of these sections consist of 100 signals, where recording time was about 23.6 s. In order to record the data in themost accurate way, they used an ampliﬁed system with 128 signal channels, in which the output resulted in 173.61Hz of the sampling rate. The EEG samples in the O and Z datasets are derived from healthy volunteers with externalsurface electrodes for open and closed eye conditions. The F and N datasets are acquired during seizure-free intervalsand the dataset S contains only the seizure activity. The ﬁve data sets S, F, N, O, and Z are classiﬁed into two distinctgroups in our study. The epileptic seizure class (S) is composed of the subset S, and the non-seizure class (FNOZ) iscomposed of the subsets F, N, O, and Z, respectively.

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 15 of 28 igure 7:

The ﬂowchart of the proposed classiﬁcation framework

Table 1

The deﬁnitions and descriptions for the electroencephalographic (EEG) signals from the University of Bonn, Germany

Information Dataset O Dataset Z Dataset F Dataset N Dataset S

State Awake and eyesopen(Healthy) Awake and eyesclosed(Healthy) Seizure-free Seizure-free Seizure activityElectrode type Surface Surface Intercranial Intercranial Intercranial 𝑁 𝑜 of channels 100 100 100 100 100Recording time 23.6 23.6 23.6 23.6 23.6 In this study, we built and used two datasets from the raw database. The ﬁrst dataset is an imbalanced dataset with0.2 as the prevalence of the positive class (Figure 8) and the second is a balanced dataset with 0.5 as the prevalence ofthe positive class (Figure 9).

Figure 8:

Flowchart for the building of the imbalanced dataset

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 16 of 28 igure 9:

Flowchart for the building of the balanced dataset

Feature engineering is the process of converting raws data into features that better represent the raw observationsfor predictive models. Thus, it can improve the accuracy of the model on unseen data. The ﬂowchart of our featureengineering is described in Figure 10.

Figure 10:

The ﬂowchart of Feature Engineering

For features extraction, we have implemented 4 feature extraction methods namely :• Discrete wavelet transform DWT-db4,• Discrete wavelet transform DWT-db2,• Discrete wavelet transform DWT-coif1,• Mel Frequency Ceptral Coeﬁcient MFCC.We follow the step of wavelet transform for the ﬁrst three and the step of MFCC describeb in chapter three for the lastone.•

Wavelet Threshold De-Noising :

We can see in Figure 11 one signal after and before de-noising•

Wavelet Decomposition : The DWT is used to split a signal into different frequency sub-bands, as many asneeded or as many as possible. We can see in ﬁgures (12, 13, 14, 15, 16) the ﬁve decomposed bands from onesignal that we are going to use.

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 17 of 28 igure 11:

Signal before and after de-noising

Figure 12:

Sub-band signals using four-level wavelet decomposition (db4) from an original EEG signal: the approximationin 0-4 Hz.

Figure 13:

Sub-band signals using four-level wavelet decomposition (db4) from an original EEG signal: the detail in 4-8Hz. • Feature Extraction : After decomposition, we extract some features in the time-frequency Domain (meanaverage value,standard deviation, Relative band power, spectral entropy), in the frequency Domain (relativepower spectral density estimated by the coefﬁcients of the FFT), and in time domain ( mean, median, standarddeviation and the total variation). Additionally, the maximum, minimum and the total variation measures of theDWT transform coefﬁcients are also estimated in order to describe the non-stationary signals.

Some of the original features extracted are correlated and redundant. To select an optimal feature subset fromthe original feature set, we use one feature dimensionality reduction method named (PCA). The PCA algorithm is

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 18 of 28 igure 14:

Sub-band signals using four-level wavelet decomposition (db4) from an original EEG signal: the detail in8-16 Hz.

Figure 15:

Sub-band signals using four-level wavelet decomposition (db4) from an original EEG signal: the detail in16-32 Hz.

Figure 16:

Sub-band signals using four-level wavelet decomposition (db4) from an original EEG signal. the detail in32-64 Hz. implemented to obtain a relatively low dimensional, but signiﬁcantly discriminative feature set which will improve theclassiﬁcation performance

Seven classiﬁers are used after 4 feature extraction methods to detect the epileptic seizure from the non-seizure.The differents classiﬁer are:• Linear discriminant analysis (LDA),• Quadratic discriminant analysis (QDA),

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 19 of 28 able 2

The classiﬁcation performance for 10-fold CV without features extraction

LDA 69.60 94.50 40.00 86.80 94.00 44.00QDA 92.90 99.75 86.00

KNN 75.50 100 42.00 88.48 100 52.00NB 94.7 96.50 88.00 95.16 99.00 90.00RF 94.30 99.00 81.00 95.44 95.00 96.00GB

Imbalanced data Balanced dataTable 3

The classiﬁcation performance of 10-fold CV with wavelet the “db4"method

LDA 98.50 99.25 83.00 95.24 99.00 99.00QDA 96.40 99.25 87.00 97.04 100.00 94.00KNN 97.90 99.50 86.00 97.00 100.00 94.00NB 91.80 93.75 87.00 92.24 99.00 95.00RF • K-nearest neigbor (KNN),• Naive Bayes (NB),• Random forest (RF),• Gradient boosting (GB),• Suport vector machine (SVM).The results of the classiﬁcation performance measured by the Sensitivity (SEN), the Speciﬁcity (SPE), and Accuracy(ACC) using 10-fold cross-validation are shown in the different tables. Besides, boxplot are used to compare thedifferent models and features extraction methods.

According to Figure 17, we can draw the conclusion below for the balanced dataset:• Without features extraction : – The best model is Quadratic Discriminant Analysis (QDA) which has 96.92% of accuracy, 100% ofspeciﬁcity, and 85% of sensitivity . – LDA and KNN have the smallest accuracy and can be avoided when we performed a classiﬁcation withoutfeatures extraction.

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 20 of 28 able 4

The classiﬁcation performance of 10-fold CV with wavelet the “db2"method

LDA 96.10 99.25 89.00 95.60 100.00 98.00QDA 94.9 97.25 91.00 97.84 99.00 91.00KNN 94.80 97.50 84.00 96.92 100.00 89.00NB 92.70 94.00 84.00 91.88 93.00 92.00RF 96.20 99.50 85.00

GB 96.10 98.75 93.00 98.48 98.00 97.00SVM

Imbalanced data Balanced dataTable 5

The classiﬁcation performance of 10-fold CV with wavelet the “coif1"method

LDA 94.20 99.50 88.00 95.28 100.00 96.00QDA 93.30 98.50 90.00 97.52 99.00 91.00KNN 95.30 99.75 82.00 96.04 100.00 92.00NB 92.80 95.00 84.0 92.40 91.00 95.00RF 95.90 98.50 92.00 97.04 95.00 98.00GB 96.20 98.25 94.00 97.20 95.00 98.25SVM

The classiﬁcation performance of 10-fold CV with the MFCC method

LDA 93.70 99.00 82.00 95.24 97.00 90.00QDA 94.00 99.05 83.00 94.00 95.00 93.00KNN 90.70 98.25 78.00 94.20 95.50 93.00NB 92.30 96.75 85.00 94.72 95.00 94.00RF 94.40 98.50 77.00 95.80 96.50 94.00GB • Among the feature extraction methods : – Wavelet db4 is the best method when it is used with SVM which has 98.99% of Accuracy 99% of sensitivityand 99% of speciﬁcity. – Wavelet db2 associated with RF challenge db4 associated with SVM with 98.56% of Accuracy, 97% ofsensitivity, and 98.10% of speciﬁcity. – Wavelet coif1 brings its best results when it is Associated with SVM and the results are 97.92% of Accuracy,98% of sensitivity, and 98% of speciﬁcity. – MFCC performs less than all the DWT used here but can bring its best results when associated to SVM.

According to Figure 18, we can draw the conclusion below for the imbalanced dataset:

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 21 of 28 igure 17:

Box-plot for model comparison in terms of accuracy in the balanced data

Figure 18:

Box-plot for model comparison in term of accuracy in the imbalanced dataset • Without features extraction : – The best model is GB which has 95.1% of accuracy, 82.00% of sensitivity and 95.36% of speciﬁcity. – LDA and KNN have the smallest accuracy and can be avoided when we performed a classiﬁcation withoutfeatures extraction.• Among the feature extraction methods : – Wavelet db2 is the best method when it is associated with RF which brings 98.99% of accuracy, 99.25% ofspeciﬁcity, and 95% of sensitivity. – Wavelet db4 can challenge wavelet db2 when it is used with SVM or RF which here bring 98.90% ofaccuracy, 99.10% of speciﬁcity, and 98% of sensitivity.

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 22 of 28 able 7

ANOVA analysis for the imbalanced dataset Df Sum Sq Mean Sq F value Pr( > F)Models 6 8947.34 1491.22 60.10 0.0000feat_extr 4 18879.49 4719.87 190.23 0.0000Models:feat_extr 24 28916.51 1204.85 48.56 0.0000Residuals 1715 42552.50 24.81

Table 8 omega-squared names Ω – Wavelet coif1 brings its best results when it is associated with SVM and the results are 97.80% of accuracy,99.25% of speciﬁcity, and 93% of sensitivity. – MFCC performs less than all the DWT used here but can bring its best results when it is associated withSVM.Looking at the results presented above, we observe many differences in the predictive performances. In the next part,we check the statistical signiﬁcance of the difference in predictive performances and the effect Size.

There are two factors to evaluate; feature extraction method and model . They have ﬁve and seven levelsrespectively. Therefore, the two-way Analysis Of Variance (ANOVA) is suitable for our analysis. Using the two-way ANOVA, we can simultaneously evaluate how the type of feature extraction and model affect the accuracy ofclassiﬁcation. Hence, we can test three effects below on classiﬁcation accuracy:• Effect of features extraction.• Effect of models.• Effect of features extraction and models interactions.

From Table 7, the P-value obtained from ANOVA analysis for feature extraction, models, and interaction arestatistically signiﬁcant ( 𝑃 ≤ . . We conclude that, the type of feature extraction, the type of model, and interactionof both feature extraction and model signiﬁcantly affects the accuracy of classiﬁcation.Each factor has an independent signiﬁcant effect on classiﬁcation accuracy. While it is good to know if there is astatistically signiﬁcant effect of some models or feature extraction techniques on the accuracy, it is as important toknow the size of the effect they have on the outcome. To check this, we can calculate the effect size which is estimatedby the measures of omega-squared presented in Table 8.They are an estimate of how much variance in the response variables is accounted for by the explanatory variables.The following interpretation of omega-squared are suggested by (Field, 2013).• Omega-squared = 0 - 0.01: very small• Omega-squared = 0.01 - 0.06: small• Omega-squared = 0.06 - 0.14: medium• Omega-squared > Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 23 of 28 able 9

ANOVA analysis for the balanced dataset Df Sum Sq Mean Sq F value Pr( > F)Models 6 4607.28 767.88 111.95 0.0000feat_extr 4 2564.63 641.16 93.48 0.0000Models:feat_extr 24 4953.55 206.40 30.09 0.0000Residuals 1715 11762.96 6.86

Table 10 omega-squared names Ω Table 11

HSD-test for pairwise comparison of features extraction (Imbalanced dataset)term comparison estimate conf.low conf.high adj.p.value1 feat_extr mfcc-wfe 5.31 4.29 6.34 0.002 feat_extr coif1-wfe 7.33 6.30 8.36 0.003 feat_extr db2-wfe 7.89 6.86 8.91 0.004 feat_extr db4-wfe 9.49 8.46 10.51 0.005 feat_extr coif1-mfcc 2.01 0.99 3.04 0.006 feat_extr db2-mfcc 2.57 1.54 3.60 0.007 feat_extr db4-mfcc 4.17 3.14 5.20 0.008 feat_extr db2-coif1 0.56 -0.47 1.59 0.589 feat_extr db4-coif1 2.16 1.13 3.19 0.0010 feat_extr db4-db2 1.60 0.57 2.63 0.00

According to this, Model has a medium effect on the mean accuracy while feature extraction and the interaction betweenmodel and feature extraction have a large effect.

From Table 9, the P-value obtained from ANOVA analysis for feature extraction, models, and interaction arestatistically signiﬁcant ( 𝑃 ≤ . . We conclude that, the type of feature extraction, the type of model and, interactionof both feature extraction and model signiﬁcantly affect the accuracy of classiﬁcation.As previously seen, each factor has an independent signiﬁcant effect on the classiﬁcation accuracy. For the effectsize, model, and the interaction between model and feature extraction have a large effect on the mean accuracy whilefeature extraction has a medium effect (Table 10).Now, given that the feature extraction methods alongside the suitable models are statistically signiﬁcant upon acertain effect size, it is important to mention that, ANOVA does not tell much about which of the two models outperformthe other. To know the pairs of signiﬁcant different feature extraction methods, and type of model, we can perform amultiple pairwise comparison analysis. Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 24 of 28 able 12

HSD-test for pairwise comparison of features extraction (Balanced dataset)term comparison estimate conf.low conf.high adj.p.value1 feat_extr Mfcc-wfe 1.45 0.91 1.99 0.002 feat_extr coif1-wfe 2.66 2.12 3.20 0.003 feat_extr db4-wfe 3.06 2.52 3.60 0.004 feat_extr db2-wfe 3.22 2.68 3.76 0.005 feat_extr coif1-Mfcc 1.22 0.68 1.76 0.006 feat_extr db4-Mfcc 1.62 1.08 2.16 0.007 feat_extr db2-Mfcc 1.77 1.23 2.31 0.008 feat_extr db4-coif1 0.40 -0.14 0.94 0.269 feat_extr db2-coif1 0.55 0.01 1.09 0.0410 feat_extr db2-db4 0.15 -0.39 0.69 0.94

Table 13

HSD-test for pairwise comparison of models (Imbalanced dataset)term comparison estimate conf.low conf.high adj.p.value1 Models KNN-LDA 0.42 -0.90 1.74 0.972 Models NB-LDA 2.44 1.12 3.76 0.003 Models QDA-LDA 3.96 2.64 5.28 0.004 Models RF-LDA 5.28 3.96 6.60 0.005 Models GB-LDA 5.46 4.14 6.78 0.006 Models SVM-LDA 5.92 4.60 7.24 0.007 Models NB-KNN 2.02 0.70 3.34 0.008 Models QDA-KNN 3.54 2.22 4.86 0.009 Models RF-KNN 4.86 3.54 6.18 0.0010 Models GB-KNN 5.04 3.72 6.36 0.0011 Models SVM-KNN 5.50 4.18 6.82 0.0012 Models QDA-NB 1.52 0.20 2.84 0.0113 Models RF-NB 2.84 1.52 4.16 0.0014 Models GB-NB 3.02 1.70 4.34 0.0015 Models SVM-NB 3.48 2.16 4.80 0.0016 Models RF-QDA 1.32 0.00 2.64 0.0517 Models GB-QDA 1.50 0.18 2.82 0.0118 Models SVM-QDA 1.96 0.64 3.28 0.0019 Models GB-RF 0.18 -1.14 1.50 1.0020 Models SVM-RF 0.64 -0.68 1.96 0.7821 Models SVM-GB 0.46 -0.86 1.78 0.95Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 25 of 28 able 14

HSD-test for pairwise comparison of models (Balanced dataset)term comparison estimate conf.low conf.high adj.p.value1 Models LDA-NB 0.35 -0.34 1.04 0.742 Models KNN-NB 1.25 0.56 1.94 0.003 Models QDA-NB 3.39 2.70 4.08 0.004 Models GB-NB 3.40 2.71 4.09 0.005 Models RF-NB 3.64 2.95 4.33 0.006 Models SVM-NB 4.31 3.62 5.00 0.007 Models KNN-LDA 0.90 0.20 1.59 0.008 Models QDA-LDA 3.04 2.35 3.73 0.009 Models GB-LDA 3.05 2.36 3.74 0.0010 Models RF-LDA 3.29 2.60 3.98 0.0011 Models SVM-LDA 3.96 3.27 4.65 0.0012 Models QDA-KNN 2.14 1.45 2.84 0.0013 Models GB-KNN 2.15 1.46 2.84 0.0014 Models RF-KNN 2.39 1.70 3.08 0.0015 Models SVM-KNN 3.06 2.37 3.76 0.0016 Models GB-QDA 0.01 -0.68 0.70 1.0017 Models RF-QDA 0.25 -0.44 0.94 0.9418 Models SVM-QDA 0.92 0.23 1.61 0.0019 Models RF-GB 0.24 -0.45 0.93 0.9520 Models SVM-GB 0.91 0.22 1.60 0.0021 Models SVM-RF 0.67 -0.02 1.36 0.06

6. Discussion and Conclusion

In this paper four feature extraction techniques namely DWT-db4, DWT-db2, DWT-coif1, and MFCC wereinvestigated and combined with seven machine learning classiﬁers for classifying epilepsy seizure in the case of abalanced and an imbalanced datasets. Stochastic Hold Out with 50 replications was used in the creation of the predictiveperformances. The two-way and one-way ANOVA test were used for statistical signiﬁcance analysis of the differencein predictive performances and effect size. Tukey HSD test was used for pairwise comparison analysis of models andfeature extraction methods. The results indicate that, in the imbalanced dataset, without features extraction, GradientBoosting (GB) performed best with a classiﬁcation accuracy of 95.1% but, this accuracy is only signiﬁcantly high incomparison with LDA and KNN.Among the feature extraction methods, DWT-db2 associated with Random Forest(RF) is the best combination with 98.99% as a classiﬁcation accuracy. However, this combination is challenged byDWT-db4 associated with Support Vector Machine (SVM) or Random Forest (RF) with a classiﬁcation accuracy of98.90%.In the balanced dataset, without features extraction, Quadratic Discriminant Analysis (QDA) performed best with aclassiﬁcation accuracy of 96.92% which is only signiﬁcantly high in comparison with LDA and KNN. Among thefeature extraction methods, DWT-db4 associated with Support Vector Machine (SVM) is the best combination with aclassiﬁcation accuracy of 98.99%. Nevertheless, this combination is challenged by DWT-db2 associated with RandomForest (RF) with a classiﬁcation accuracy of 98.56%.The results also highlight that, MFCC performs less than all the DWT used here in the balanced or imbalanceddataset. The mean-difference are statistically signiﬁcant with the minimum mean-difference of 2.01 and 1.2 when it iscompared to coif1 respectively in the imbalanced and balanced dataset (Table 11 and 12). Whether in the balancedor the imbalanced dataset, the feature extraction methods, model, and the interaction between them have statisticallysigniﬁcant effect on the classiﬁcation accuracy. In the imbalanced dataset, the model has a medium effect while featureextraction and interaction between model and feature extraction have a large effect. In the balanced dataset, model andinteraction between model and feature extraction have a large effect while feature extraction has a medium effect.For the improvement of this work, we are planning to analyze the mean-difference of the different interactionsbetween feature extraction methods and classiﬁers. Also, we are planning to study larger databases to evaluate thetruthfulness of the present results in any EEG dataset of epilepsy seizure, and further, to establish the magnitude of thedifference in the predictive performances between the balanced and imbalanced datasets. Another direction should be

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 26 of 28 o extend the number of feature extraction methods and classiﬁers model, and also test our study in ECG dataset.

CRediT authorship contribution statement

Cyrille Feudjio:

Investigation, Conceptualization, Methodology, Software, Data curation, Experimentation,Result compilation, Writing - original draft.

Victoire Djimna Noyum:

Software, Result compilation, Writing -review.

Younous Perieukeu Mofendjou:

Software, Result compilation, Writing - review .

Rockefeller:

Supervision,Validation, Writing - review & editing.

Ernest Fokoué:

Conceptualization of this study, Supervision, Validation,Writing - review & editing.

References

Acharya, U.R., Sree, S.V., Ang, P.C.A., Yanti, R., Suri, J.S., 2012. Application of non-linear and wavelet based features for the automatedidentiﬁcation of epileptic eeg signals. International journal of neural systems 22, 1250002.Ahammad, N., Fathima, T., Joseph, P., 2014. Detection of epileptic seizure event and onset using eeg. BioMed research international 2014.Bandarabadi, M., Teixeira, C.A., Rasekhi, J., Dourado, A., 2015. Epileptic seizure prediction using relative spectral power features. ClinicalNeurophysiology 126, 237–248.Bhople, A.D., Tijare, P., 2012. Fast fourier transform based classiﬁcation of epileptic seizure using artiﬁcial neural network. Int J Adv Res ComputSci Softw Eng 2.Chen, D., Wan, S., Xiang, J., Bao, F.S., 2017. A high-performance seizure detection algorithm based on discrete wavelet transform (dwt) and eeg.PloS one 12.Cutler, A., Cutler, D.R., Stevens, J.R., 2012. Random forests, in: Ensemble machine learning. Springer, pp. 157–175.Field, A., 2013. Discovering statistics using IBM SPSS statistics. sage.Gadhoumi, K., Lina, J.M., Gotman, J., 2012. Discriminating preictal and interictal states in patients with temporal lobe epilepsy using waveletanalysis of intracerebral eeg. Clinical neurophysiology 123, 1906–1916.Gajic, D., Djurovic, Z., Gligorijevic, J., Di Gennaro, S., Savic-Gajic, I., 2015. Detection of epileptiform activity in eeg signals based on time-frequencyand non-linear analysis. Frontiers in computational neuroscience 9, 38.Ghojogh, B., Crowley, M., 2019. Linear and quadratic discriminant analysis: Tutorial. arXiv preprint arXiv:1906.02590 .Gilda, S., Slepian, Z., 2019. Automatic kalman-ﬁlter-based wavelet shrinkage denoising of 1d stellar spectra. Monthly Notices of the RoyalAstronomical Society 490, 5249–5269.Golmohammadi, M., Shah, V., Lopez, S., Ziyabari, S., Yang, S., Camaratta, J., Obeid, I., Picone, J., 2017. The tuh eeg seizure corpus, in: Proceedingsof the American Clinical Neurophysiology Society Annual Meeting, p. 1.Gong, S., Dai, Y., Ji, J., Wang, J., Sun, H., 2015. Emotion analysis of telephone complaints from customer based on affective computing.Computational intelligence and neuroscience 2015.Guenot, M., 2004. Surgical treatment of epilepsy: outcome of various surgical procedures in adults and children. Revue neurologique 160, 5S241–50.John Martin, R., Sujatha, S., Swapna, S., 2018. Multiresolution analysis in eeg signal feature engineering for epileptic seizure detection. InternationalJournal of Computer Applications 975, 8887.Kandar, H., Das, S.K., Ghosh, L., Gupta, B.K., 2012. Epilepsy and its management: A review. Journal of PharmaSciTech 1, 20–26.Merry, R., 2005. Wavelet theory and applications: a literature study. DCT rapporten 2005.Mursalin, M., Zhang, Y., Chen, Y., Chawla, N.V., 2017. Automated epileptic seizure detection using improved correlation-based feature selectionwith random forest classiﬁer. Neurocomputing 241, 204–214.Natekin, A., Knoll, A., 2013. Gradient boosting machines, a tutorial. Frontiers in neurorobotics 7, 21.Nguyen, P., Tran, D., Huang, X., Sharma, D., 2012. A proposed feature extraction method for eeg-based person identiﬁcation, in: Proceedings on theInternational Conference on Artiﬁcial Intelligence (ICAI), The Steering Committee of The World Congress in Computer Science, Computer . p. 1.Othman, M., Wahab, A., Khosrowabadi, R., 2009. Mfcc for robust emotion detection using eeg, in: 2009 IEEE 9th Malaysia International Conferenceon Communications (MICC), IEEE. pp. 98–101.Paul, Y., 2018. Various epileptic seizure detection techniques using biomedical signals: a review. Brain informatics 5, 6.Polat, K., Güne¸s, S., 2007. Classiﬁcation of epileptiform eeg using a hybrid system based on decision tree classiﬁer and fast fourier transform.Applied Mathematics and Computation 187, 1017–1026.al Qerem, A., Kharbat, F., Nashwan, S., Ashraf, S., blaou, k., 2020. General model for best feature extraction of eeg using discrete wavelet transformwavelet family and differential evolution. International Journal of Distributed Sensor Networks 16, 1550147720911009.Rasekhi, J., Mollaei, M.R.K., Bandarabadi, M., Teixeira, C.A., Dourado, A., 2013. Preprocessing effects of 22 linear univariate features on theperformance of seizure prediction methods. Journal of neuroscience methods 217, 9–16.Ren, H., Qu, J., Chai, Y., Huang, L., Tang, Q., 2018. Cepstrum coefﬁcient analysis from low-frequency to high-frequency applied to automaticepileptic seizure detection with bio-electrical signals. Applied Sciences 8, 1528.Saputro, I.R.D., Maryati, N.D., Solihati, S.R., Wijayanto, I., Hadiyoso, S., Patmasari, R., 2019. Seizure type classiﬁcation on eeg signal usingsupport vector machine, in: Journal of Physics: Conference Series, IOP Publishing. p. 012065.Stuart, A., Arnold, S., Ord, J.K., O’Hagan, A., Forster, J., 1994. Kendall’s advanced theory of statistics. Wiley.Sutton, O., 2012. Introduction to k nearest neighbour classiﬁcation and condensed nearest neighbour data reduction. University lectures, Universityof Leicester , 1–10.Teixeira, C.A., Direito, B., Bandarabadi, M., Le Van Quyen, M., Valderrama, M., Schelter, B., Schulze-Bonhage, A., Navarro, V., Sales, F., Dourado,

Cyrille Feudjio et al:

Preprint submitted to Elsevier

Page 27 of 28 ., 2014. Epileptic seizure predictors based on computational intelligence techniques: A comparative study with 278 patients. Computer methodsand programs in biomedicine 114, 324–336.Ullah, I., Hussain, M., Aboalsamh, H., et al., 2018. An automated system for epilepsy detection using eeg brain signals based on deep learningapproach. Expert Systems with Applications 107, 61–71.Usman, S.M., Usman, M., Fong, S., 2017. Epileptic seizures prediction using machine learning methods. Computational and mathematical methodsin medicine 2017.Wang, L., Xue, W., Li, Y., Luo, M., Huang, J., Cui, W., Huang, C., 2017. Automatic epileptic seizure detection in eeg signals using multi-domainfeature extraction and nonlinear analysis. Entropy 19, 222.Wang, Y., Zhou, W., Yuan, Q., Li, X., Meng, Q., Zhao, X., Wang, J., 2013. Comparison of ictal and interictal eeg signals using fractal features.International journal of neural systems 23, 1350028.Wen, T., Zhang, Z., 2017. Effective and extensible feature extraction method using genetic algorithm-based frequency-domain feature search forepileptic eeg multiclassiﬁcation. Medicine 96.Zandi, A.S., Tafreshi, R., Javidan, M., Dumont, G.A., 2013. Predicting epileptic seizures in scalp eeg based on a variational bayesian gaussianmixture model of zero-crossing intervals. IEEE Transactions on Biomedical Engineering 60, 1401–1413.Zhang, Y., Liu, B., Ji, X., Huang, D., 2017. Classiﬁcation of eeg signals based on autoregressive model and wavelet packet decomposition. NeuralProcessing Letters 45, 365–378.

Cyrille Feudjio et al: